There is a good link for this here,
http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files

If there are a lot of small files, then it would work pretty okay in a
distributed manner, but I am worried if it is single large file.

In this case, this would only work in single executor which I think will
end up with OutOfMemoryException.

Spark JSON data source does not support multi-line JSON as input due to the
limitation of TextInputFormat and LineRecordReader.

You may have to just extract the values after reading it by textFile..
​


2016-07-07 14:48 GMT+09:00 Lan Jiang <ljia...@gmail.com>:

> Hi, there
>
> Spark has provided json document processing feature for a long time. In
> most examples I see, each line is a json object in the sample file. That is
> the easiest case. But how can we process a json document, which does not
> conform to this standard format (one line per json object)? Here is the
> document I am working on.
>
> First of all, it is multiple lines for one single big json object. The
> real file can be as long as 20+ G. Within that one single json object, it
> contains many name/value pairs. The name is some kind of id values. The
> value is the actual json object that I would like to be part of dataframe.
> Is there any way to do that? Appreciate any input.
>
>
> {
> "id1": {
> "Title":"title1",
> "Author":"Tom",
> "Source":{
> "Date":"20160506",
> "Type":"URL"
> },
> "Data":" blah blah"},
>
> "id2": {
> "Title":"title2",
> "Author":"John",
> "Source":{
> "Date":"20150923",
> "Type":"URL"
> },
> "Data":" blah blah "},
>
> "id3: {
> "Title":"title3",
> "Author":"John",
> "Source":{
> "Date":"20150902",
> "Type":"URL"
> },
> "Data":" blah blah "}
> }
>
>

Reply via email to