The problem is for Hadoop Input format to identify the record delimiter. If the 
whole json record is in one line, then the nature record delimiter will be the 
new line character. 
Keep in mind in distribute file system, the file split position most likely IS 
not on the record delimiter. The input format implementation has to go back or 
forward in the bytes array looking for the next record delimiter on another 
node. 
Without a perfect record delimiter, then you just has to parse the whole file, 
as you know the file boundary is a reliable record delimiter.
JSON is Never a good format to be stored in BigData platform. If your source 
json is liking this, then you have to preprocess it. Or write your own 
implementation to handle the record delimiter, for your json data case. But 
good luck with that. There is no perfect generic solution for any kind of JSON 
data you want to handle.
Yong

From: ljia...@gmail.com
Date: Thu, 7 Jul 2016 11:57:26 -0500
Subject: Re: Processing json document
To: gurwls...@gmail.com
CC: jornfra...@gmail.com; user@spark.apache.org

Hi, there,
Thank you all for your input. @Hyukjin, as a matter of fact, I have read the 
blog link you posted before asking the question on the forum. As you pointed 
out, the link uses wholeTextFiles(0, which is bad in my case, because my json 
file can be as large as 20G+ and OOM might occur. I am not sure how to extract 
the value by using textFile call as it will create an RDD of string and treat 
each line without ordering. It destroys the json context. 
Large multiline json file with parent node are very common in the real world. 
Take the common employees json example below, assuming we have millions of 
employee and it is super large json document, how can spark handle this? This 
should be a common pattern, shouldn't it? In real world, json document does not 
always come as cleanly formatted as the spark example requires. 
{"employees":[    {      "firstName":"John",       "lastName":"Doe"    },    {  
    "firstName":"Anna",        "lastName":"Smith"    },    {       
"firstName":"Peter",         "lastName":"Jones"}]}


On Thu, Jul 7, 2016 at 1:47 AM, Hyukjin Kwon <gurwls...@gmail.com> wrote:
The link uses wholeTextFiles() API which treats each file as each record.

2016-07-07 15:42 GMT+09:00 Jörn Franke <jornfra...@gmail.com>:
This does not need necessarily the case if you look at the Hadoop 
FileInputFormat architecture then you can even split large multi line Jsons 
without issues. I would need to have a look at it, but one large file does not 
mean one Executor independent of the underlying format.
On 07 Jul 2016, at 08:12, Hyukjin Kwon <gurwls...@gmail.com> wrote:

There is a good link for this here, 
http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files
If there are a lot of small files, then it would work pretty okay in a 
distributed manner, but I am worried if it is single large file. 
In this case, this would only work in single executor which I think will end up 
with OutOfMemoryException.
Spark JSON data source does not support multi-line JSON as input due to the 
limitation of TextInputFormat and LineRecordReader.You may have to just extract 
the values after reading it by textFile..
​

2016-07-07 14:48 GMT+09:00 Lan Jiang <ljia...@gmail.com>:
Hi, there
Spark has provided json document processing feature for a long time. In most 
examples I see, each line is a json object in the sample file. That is the 
easiest case. But how can we process a json document, which does not conform to 
this standard format (one line per json object)? Here is the document I am 
working on. 
First of all, it is multiple lines for one single big json object. The real 
file can be as long as 20+ G. Within that one single json object, it contains 
many name/value pairs. The name is some kind of id values. The value is the 
actual json object that I would like to be part of dataframe. Is there any way 
to do that? Appreciate any input. 

{    "id1": {    "Title":"title1",    "Author":"Tom",    "Source":{        
"Date":"20160506",        "Type":"URL"    },    "Data":" blah blah"},
    "id2": {    "Title":"title2",    "Author":"John",    "Source":{        
"Date":"20150923",        "Type":"URL"    },    "Data":"  blah blah "},
    "id3: {    "Title":"title3",    "Author":"John",    "Source":{        
"Date":"20150902",        "Type":"URL"    },    "Data":" blah blah "}}






                                          

Reply via email to