I am reading JSON data that has different schemas for every record. That
is, for a given field that would have a null value, it's simply absent from
that record (and therefore, its schema).

I would like to use the DataFrame API to select specific fields from this
data, and for fields that are missing from a record, to default to null or
an empty string.

Is this possible or can DataFrames only handle a single consistent schema
throughout the data?

One thing I noticed is that the schema of the DataFrame is the superset of
all the records in it, so if record A has field X, but record B does not,
it will show up in B as null because it's part of the DataFrame's schema
(because A has it). But if none of the records have field X, then
referencing that field will result in an error about not being able to
resolve that column.

If I know the schema of all possible fields and the order in which they
occur, it may be possible to get the RDD from the DataFrame and build my
own DataFrame with createDataFrame and passing it my fabricated
super-schema. However, this is brittle, as the super-schema is not in my
control and may change in the future.

Thanks for any suggestions,
Alex.

Reply via email to