I am reading JSON data that has different schemas for every record. That is, for a given field that would have a null value, it's simply absent from that record (and therefore, its schema).
I would like to use the DataFrame API to select specific fields from this data, and for fields that are missing from a record, to default to null or an empty string. Is this possible or can DataFrames only handle a single consistent schema throughout the data? One thing I noticed is that the schema of the DataFrame is the superset of all the records in it, so if record A has field X, but record B does not, it will show up in B as null because it's part of the DataFrame's schema (because A has it). But if none of the records have field X, then referencing that field will result in an error about not being able to resolve that column. If I know the schema of all possible fields and the order in which they occur, it may be possible to get the RDD from the DataFrame and build my own DataFrame with createDataFrame and passing it my fabricated super-schema. However, this is brittle, as the super-schema is not in my control and may change in the future. Thanks for any suggestions, Alex.