RE: Processing json document

Hyukjin Kwon Thu, 07 Jul 2016 21:10:02 -0700

Yea, I totally agree with Yong.

Anyway, this might not be a great idea but you might want to take a look
this,
http://pivotal-field-engineering.github.io/pmr-common/pmr/apidocs/com/gopivotal/mapreduce/lib/input/JsonInputFormat.html


This does not recognise nested structure but I assume you might be able to
do this by, for example, removing the first "{" and last "}" in your large
file and then loading it so that id object in your data can be recognised
as a row, or modifying the JsonInputFormat.

After that, you might be able to load this by SparkContext.hadoopFile or
SparkContext.newHadoopFile API as a RDD which consist of each row having
each json doc. And then, there is SQLContext.json API which takes RDD
consist of each row having each json document.

I know this is a rough and not the best idea but this is only way I
currently think of..
The problem is for Hadoop Input format to identify the record delimiter. If
the whole json record is in one line, then the nature record delimiter will
be the new line character.

Keep in mind in distribute file system, the file split position most likely
IS not on the record delimiter. The input format implementation has to go
back or forward in the bytes array looking for the next record delimiter on
another node.

Without a perfect record delimiter, then you just has to parse the whole
file, as you know the file boundary is a reliable record delimiter.

JSON is Never a good format to be stored in BigData platform. If your
source json is liking this, then you have to preprocess it. Or write your
own implementation to handle the record delimiter, for your json data case.
But good luck with that. There is no perfect generic solution for any kind
of JSON data you want to handle.

Yong

------------------------------
From: ljia...@gmail.com
Date: Thu, 7 Jul 2016 11:57:26 -0500
Subject: Re: Processing json document
To: gurwls...@gmail.com
CC: jornfra...@gmail.com; user@spark.apache.org

Hi, there,

Thank you all for your input. @Hyukjin, as a matter of fact, I have read
the blog link you posted before asking the question on the forum. As you
pointed out, the link uses wholeTextFiles(0, which is bad in my case,
because my json file can be as large as 20G+ and OOM might occur. I am not
sure how to extract the value by using textFile call as it will create an
RDD of string and treat each line without ordering. It destroys the json
context.

Large multiline json file with parent node are very common in the real
world. Take the common employees json example below, assuming we have
millions of employee and it is super large json document, how can spark
handle this? This should be a common pattern, shouldn't it? In real world,
json document does not always come as cleanly formatted as the spark
example requires.

{
"employees":[
    {
      "firstName":"John",
      "lastName":"Doe"
    },
    {
      "firstName":"Anna",
       "lastName":"Smith"
    },
    {
       "firstName":"Peter",
        "lastName":"Jones"}
]
}



On Thu, Jul 7, 2016 at 1:47 AM, Hyukjin Kwon <gurwls...@gmail.com> wrote:

The link uses wholeTextFiles() API which treats each file as each record.


2016-07-07 15:42 GMT+09:00 Jörn Franke <jornfra...@gmail.com>:

This does not need necessarily the case if you look at the Hadoop
FileInputFormat architecture then you can even split large multi line Jsons
without issues. I would need to have a look at it, but one large file does
not mean one Executor independent of the underlying format.

On 07 Jul 2016, at 08:12, Hyukjin Kwon <gurwls...@gmail.com> wrote:

There is a good link for this here,
http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files

If there are a lot of small files, then it would work pretty okay in a
distributed manner, but I am worried if it is single large file.

In this case, this would only work in single executor which I think will
end up with OutOfMemoryException.

Spark JSON data source does not support multi-line JSON as input due to the
limitation of TextInputFormat and LineRecordReader.

You may have to just extract the values after reading it by textFile..



2016-07-07 14:48 GMT+09:00 Lan Jiang <ljia...@gmail.com>:

Hi, there

Spark has provided json document processing feature for a long time. In
most examples I see, each line is a json object in the sample file. That is
the easiest case. But how can we process a json document, which does not
conform to this standard format (one line per json object)? Here is the
document I am working on.

First of all, it is multiple lines for one single big json object. The real
file can be as long as 20+ G. Within that one single json object, it
contains many name/value pairs. The name is some kind of id values. The
value is the actual json object that I would like to be part of dataframe.
Is there any way to do that? Appreciate any input.


{
"id1": {
"Title":"title1",
"Author":"Tom",
"Source":{
"Date":"20160506",
"Type":"URL"
},
"Data":" blah blah"},

"id2": {
"Title":"title2",
"Author":"John",
"Source":{
"Date":"20150923",
"Type":"URL"
},
"Data":" blah blah "},

"id3: {
"Title":"title3",
"Author":"John",
"Source":{
"Date":"20150902",
"Type":"URL"
},
"Data":" blah blah "}
}

RE: Processing json document

Reply via email to