Yea, I totally agree with Yong. Anyway, this might not be a great idea but you might want to take a look this, http://pivotal-field-engineering.github.io/pmr-common/pmr/apidocs/com/gopivotal/mapreduce/lib/input/JsonInputFormat.html
This does not recognise nested structure but I assume you might be able to do this by, for example, removing the first "{" and last "}" in your large file and then loading it so that id object in your data can be recognised as a row, or modifying the JsonInputFormat. After that, you might be able to load this by SparkContext.hadoopFile or SparkContext.newHadoopFile API as a RDD which consist of each row having each json doc. And then, there is SQLContext.json API which takes RDD consist of each row having each json document. I know this is a rough and not the best idea but this is only way I currently think of.. The problem is for Hadoop Input format to identify the record delimiter. If the whole json record is in one line, then the nature record delimiter will be the new line character. Keep in mind in distribute file system, the file split position most likely IS not on the record delimiter. The input format implementation has to go back or forward in the bytes array looking for the next record delimiter on another node. Without a perfect record delimiter, then you just has to parse the whole file, as you know the file boundary is a reliable record delimiter. JSON is Never a good format to be stored in BigData platform. If your source json is liking this, then you have to preprocess it. Or write your own implementation to handle the record delimiter, for your json data case. But good luck with that. There is no perfect generic solution for any kind of JSON data you want to handle. Yong ------------------------------ From: ljia...@gmail.com Date: Thu, 7 Jul 2016 11:57:26 -0500 Subject: Re: Processing json document To: gurwls...@gmail.com CC: jornfra...@gmail.com; user@spark.apache.org Hi, there, Thank you all for your input. @Hyukjin, as a matter of fact, I have read the blog link you posted before asking the question on the forum. As you pointed out, the link uses wholeTextFiles(0, which is bad in my case, because my json file can be as large as 20G+ and OOM might occur. I am not sure how to extract the value by using textFile call as it will create an RDD of string and treat each line without ordering. It destroys the json context. Large multiline json file with parent node are very common in the real world. Take the common employees json example below, assuming we have millions of employee and it is super large json document, how can spark handle this? This should be a common pattern, shouldn't it? In real world, json document does not always come as cleanly formatted as the spark example requires. { "employees":[ { "firstName":"John", "lastName":"Doe" }, { "firstName":"Anna", "lastName":"Smith" }, { "firstName":"Peter", "lastName":"Jones"} ] } On Thu, Jul 7, 2016 at 1:47 AM, Hyukjin Kwon <gurwls...@gmail.com> wrote: The link uses wholeTextFiles() API which treats each file as each record. 2016-07-07 15:42 GMT+09:00 Jörn Franke <jornfra...@gmail.com>: This does not need necessarily the case if you look at the Hadoop FileInputFormat architecture then you can even split large multi line Jsons without issues. I would need to have a look at it, but one large file does not mean one Executor independent of the underlying format. On 07 Jul 2016, at 08:12, Hyukjin Kwon <gurwls...@gmail.com> wrote: There is a good link for this here, http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files If there are a lot of small files, then it would work pretty okay in a distributed manner, but I am worried if it is single large file. In this case, this would only work in single executor which I think will end up with OutOfMemoryException. Spark JSON data source does not support multi-line JSON as input due to the limitation of TextInputFormat and LineRecordReader. You may have to just extract the values after reading it by textFile.. 2016-07-07 14:48 GMT+09:00 Lan Jiang <ljia...@gmail.com>: Hi, there Spark has provided json document processing feature for a long time. In most examples I see, each line is a json object in the sample file. That is the easiest case. But how can we process a json document, which does not conform to this standard format (one line per json object)? Here is the document I am working on. First of all, it is multiple lines for one single big json object. The real file can be as long as 20+ G. Within that one single json object, it contains many name/value pairs. The name is some kind of id values. The value is the actual json object that I would like to be part of dataframe. Is there any way to do that? Appreciate any input. { "id1": { "Title":"title1", "Author":"Tom", "Source":{ "Date":"20160506", "Type":"URL" }, "Data":" blah blah"}, "id2": { "Title":"title2", "Author":"John", "Source":{ "Date":"20150923", "Type":"URL" }, "Data":" blah blah "}, "id3: { "Title":"title3", "Author":"John", "Source":{ "Date":"20150902", "Type":"URL" }, "Data":" blah blah "} }