Hi, there,

Thank you all for your input. @Hyukjin, as a matter of fact, I have read
the blog link you posted before asking the question on the forum. As you
pointed out, the link uses wholeTextFiles(0, which is bad in my case,
because my json file can be as large as 20G+ and OOM might occur. I am not
sure how to extract the value by using textFile call as it will create an
RDD of string and treat each line without ordering. It destroys the json
context.

Large multiline json file with parent node are very common in the real
world. Take the common employees json example below, assuming we have
millions of employee and it is super large json document, how can spark
handle this? This should be a common pattern, shouldn't it? In real world,
json document does not always come as cleanly formatted as the spark
example requires.

{
"employees":[
    {
      "firstName":"John",
      "lastName":"Doe"
    },
    {
      "firstName":"Anna",
       "lastName":"Smith"
    },
    {
       "firstName":"Peter",
        "lastName":"Jones"}
]
}



On Thu, Jul 7, 2016 at 1:47 AM, Hyukjin Kwon <gurwls...@gmail.com> wrote:

> The link uses wholeTextFiles() API which treats each file as each record.
>
>
> 2016-07-07 15:42 GMT+09:00 Jörn Franke <jornfra...@gmail.com>:
>
>> This does not need necessarily the case if you look at the Hadoop
>> FileInputFormat architecture then you can even split large multi line Jsons
>> without issues. I would need to have a look at it, but one large file does
>> not mean one Executor independent of the underlying format.
>>
>> On 07 Jul 2016, at 08:12, Hyukjin Kwon <gurwls...@gmail.com> wrote:
>>
>> There is a good link for this here,
>> http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files
>>
>> If there are a lot of small files, then it would work pretty okay in a
>> distributed manner, but I am worried if it is single large file.
>>
>> In this case, this would only work in single executor which I think will
>> end up with OutOfMemoryException.
>>
>> Spark JSON data source does not support multi-line JSON as input due to
>> the limitation of TextInputFormat and LineRecordReader.
>>
>> You may have to just extract the values after reading it by textFile..
>> ​
>>
>>
>> 2016-07-07 14:48 GMT+09:00 Lan Jiang <ljia...@gmail.com>:
>>
>>> Hi, there
>>>
>>> Spark has provided json document processing feature for a long time. In
>>> most examples I see, each line is a json object in the sample file. That is
>>> the easiest case. But how can we process a json document, which does not
>>> conform to this standard format (one line per json object)? Here is the
>>> document I am working on.
>>>
>>> First of all, it is multiple lines for one single big json object. The
>>> real file can be as long as 20+ G. Within that one single json object, it
>>> contains many name/value pairs. The name is some kind of id values. The
>>> value is the actual json object that I would like to be part of dataframe.
>>> Is there any way to do that? Appreciate any input.
>>>
>>>
>>> {
>>> "id1": {
>>> "Title":"title1",
>>> "Author":"Tom",
>>> "Source":{
>>> "Date":"20160506",
>>> "Type":"URL"
>>> },
>>> "Data":" blah blah"},
>>>
>>> "id2": {
>>> "Title":"title2",
>>> "Author":"John",
>>> "Source":{
>>> "Date":"20150923",
>>> "Type":"URL"
>>> },
>>> "Data":" blah blah "},
>>>
>>> "id3: {
>>> "Title":"title3",
>>> "Author":"John",
>>> "Source":{
>>> "Date":"20150902",
>>> "Type":"URL"
>>> },
>>> "Data":" blah blah "}
>>> }
>>>
>>>
>>
>

Reply via email to