There is a good link for this here, http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files
If there are a lot of small files, then it would work pretty okay in a distributed manner, but I am worried if it is single large file. In this case, this would only work in single executor which I think will end up with OutOfMemoryException. Spark JSON data source does not support multi-line JSON as input due to the limitation of TextInputFormat and LineRecordReader. You may have to just extract the values after reading it by textFile.. 2016-07-07 14:48 GMT+09:00 Lan Jiang <ljia...@gmail.com>: > Hi, there > > Spark has provided json document processing feature for a long time. In > most examples I see, each line is a json object in the sample file. That is > the easiest case. But how can we process a json document, which does not > conform to this standard format (one line per json object)? Here is the > document I am working on. > > First of all, it is multiple lines for one single big json object. The > real file can be as long as 20+ G. Within that one single json object, it > contains many name/value pairs. The name is some kind of id values. The > value is the actual json object that I would like to be part of dataframe. > Is there any way to do that? Appreciate any input. > > > { > "id1": { > "Title":"title1", > "Author":"Tom", > "Source":{ > "Date":"20160506", > "Type":"URL" > }, > "Data":" blah blah"}, > > "id2": { > "Title":"title2", > "Author":"John", > "Source":{ > "Date":"20150923", > "Type":"URL" > }, > "Data":" blah blah "}, > > "id3: { > "Title":"title3", > "Author":"John", > "Source":{ > "Date":"20150902", > "Type":"URL" > }, > "Data":" blah blah "} > } > >