Re: Why the json file used by sparkSession.read.json must be a valid json object per line

Steve Loughran Thu, 20 Oct 2016 03:34:31 -0700

> On 19 Oct 2016, at 21:46, Jakob Odersky <ja...@odersky.com> wrote:
> 
> Another reason I could imagine is that files are often read from HDFS,
> which by default uses line terminators to separate records.
> 
> It is possible to implement your own hdfs delimiter finder, however
> for arbitrary json data, finding that delimiter would require stateful
> parsing of the file and would be difficult to parallelize across a
> cluster.
>



good point. 

If you are creating your own files of a list of JSON files, then you could do 
your own encoding, one with say a header for each record (say 'J'+'S'+'O'+'N' + 
int64 length, and split on that: you don't need to scan a record to know its 
length, and you can scan a large document counting its records simply though a 
sequence of skip + read(byte[8]) operations.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Why the json file used by sparkSession.read.json must be a valid json object per line

Reply via email to