Hi, there Spark has provided json document processing feature for a long time. In most examples I see, each line is a json object in the sample file. That is the easiest case. But how can we process a json document, which does not conform to this standard format (one line per json object)? Here is the document I am working on.
First of all, it is multiple lines for one single big json object. The real file can be as long as 20+ G. Within that one single json object, it contains many name/value pairs. The name is some kind of id values. The value is the actual json object that I would like to be part of dataframe. Is there any way to do that? Appreciate any input. { "id1": { "Title":"title1", "Author":"Tom", "Source":{ "Date":"20160506", "Type":"URL" }, "Data":" blah blah"}, "id2": { "Title":"title2", "Author":"John", "Source":{ "Date":"20150923", "Type":"URL" }, "Data":" blah blah "}, "id3: { "Title":"title3", "Author":"John", "Source":{ "Date":"20150902", "Type":"URL" }, "Data":" blah blah "} }