?????? Why the json file used by sparkSession.read.json must be a validjson object per line

Wangjianfei Wed, 19 Oct 2016 17:45:16 -0700

yeah, the design mainly because hdfs.
 
------------------




        ????????????????????2015????????????    ??????        ?????? 15101549787






 




------------------ ???????? ------------------
??????: "Jakob Odersky"<ja...@odersky.com>; 
????????: 2016??10??20??(??????) ????4:46
??????: "Hyukjin Kwon"<gurwls...@gmail.com>; 
????: "Daniel Barclay"<danielbarclay....@gmail.com>; "Koert 
Kuipers"<ko...@tresata.com>; "user @spark"<user@spark.apache.org>; 
"Wangjianfei"<1004910...@qq.com>; 
????: Re: Why the json file used by sparkSession.read.json must be a validjson 
object per line



Another reason I could imagine is that files are often read from HDFS,
which by default uses line terminators to separate records.

It is possible to implement your own hdfs delimiter finder, however
for arbitrary json data, finding that delimiter would require stateful
parsing of the file and would be difficult to parallelize across a
cluster.

On Tue, Oct 18, 2016 at 4:40 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote:
> Regarding his recent PR[1], I guess he meant multiple line json.
>
> As far as I know, single line json also conplies the standard. I left a
> comment with RFC in the PR but please let me know if I am wrong at any
> point.
>
> Thanks!
>
> [1]https://github.com/apache/spark/pull/15511
>
>
> On 19 Oct 2016 7:00 a.m., "Daniel Barclay" <danielbarclay....@gmail.com>
> wrote:
>>
>> Koert,
>>
>> Koert Kuipers wrote:
>>
>> A single json object would mean for most parsers it needs to fit in memory
>> when reading or writing
>>
>> Note that codlife didn't seem to being asking about single-object JSON
>> files, but about standard-format JSON files.
>>
>>
>> On Oct 15, 2016 11:09, "codlife" <1004910...@qq.com> wrote:
>>>
>>> Hi:
>>>    I'm doubt about the design of spark.read.json,  why the json file is
>>> not
>>> a standard json file, who can tell me the internal reason. Any advice is
>>> appreciated.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Why-the-json-file-used-by-sparkSession-read-json-must-be-a-valid-json-object-per-line-tp27907.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>
>

?????? Why the json file used by sparkSession.read.json must be a validjson object per line

Reply via email to