Dear all,

In the latest version of spark there's a feature called : automatic
partition discovery and Schema migration for parquet. As far as I know,
this gives the ability to split the DataFrame into several parquet files,
and by just loading the parent directory one can get the global schema of
the parent DataFrame.

I'm trying to use this feature in the following problem but I get some
troubles. I want to perfom a serie of feature of extraction for a set of
images. At a first step, my DataFrame has just two columns : imageId,
imageRawData. Then I transform the imageRowData column with different image
feature extractors. The result can be of different types. For example on
feature could be a mllib.Vector, and another one could be an Array[Byte].
Each feature extractor store its output as a parquet file with two columns
: imageId, featureType. Then, at the end, I get the following files :

- features/rawData.parquet
- features/feature1.parquet
- features/feature2.parquet

When I load all the features with :

sqlContext.load("features")

It seems to works and I get with this example a DataFrame with 4 columns :
imageId, imageRawData, feature1, feature2.
But, when I try to read the values, for example with show, some columns
have null fields and I just can't figure out what's going wrong.

Any ideas ?


Best,


Jao

Reply via email to