Dear all, In the latest version of spark there's a feature called : automatic partition discovery and Schema migration for parquet. As far as I know, this gives the ability to split the DataFrame into several parquet files, and by just loading the parent directory one can get the global schema of the parent DataFrame.
I'm trying to use this feature in the following problem but I get some troubles. I want to perfom a serie of feature of extraction for a set of images. At a first step, my DataFrame has just two columns : imageId, imageRawData. Then I transform the imageRowData column with different image feature extractors. The result can be of different types. For example on feature could be a mllib.Vector, and another one could be an Array[Byte]. Each feature extractor store its output as a parquet file with two columns : imageId, featureType. Then, at the end, I get the following files : - features/rawData.parquet - features/feature1.parquet - features/feature2.parquet When I load all the features with : sqlContext.load("features") It seems to works and I get with this example a DataFrame with 4 columns : imageId, imageRawData, feature1, feature2. But, when I try to read the values, for example with show, some columns have null fields and I just can't figure out what's going wrong. Any ideas ? Best, Jao