Re: Quirk in how Spark DF handles JSON input records?

Michael Segel Thu, 03 Nov 2016 08:30:40 -0700

Hi,

I understand.


With XML, if you know the tag you want to group by, you can use a multi-line 
input format and just advance in to the split until you find that tag.
Much more difficult in JSON.




On Nov 3, 2016, at 2:41 AM, Mendelson, Assaf 
<assaf.mendel...@rsa.com<mailto:assaf.mendel...@rsa.com>> wrote:

I agree this can be a little annoying. The reason this is done this way is to 
enable cases where the json file is huge. To allow splitting it, a separator is 
needed and newline is the separator used (as is done in all text files in 
Hadoop and spark).
I always wondered why support has not been implemented for cases where each 
file is small (e.g. has one object) but the implementation now assume each line 
has a legal json object.

Why I do to overcome this is use RDDs (using pyspark):

// get an RDD of the text context. The map is used because wholeTextFiles 
returns a tuple of filename, file content
jsonRDD = sc.wholeTextFiles(filename).map(lambda x: x[1])

// remove whitespaces. This can actually be too much as it would also work 
inside string info so you can maybe remove just the end line characters (e.g. 
\r, \n)
import re
js = jsonRDD.map(lambda x: re.sub(r"\s+", "", x, flags=re.UNICODE))

// convert the rdd to dataframe. If you have your own schema, this is where you 
should add it.
df = spark.read.json(js)

Assaf.

From: Michael Segel [mailto:msegel_had...@hotmail.com]
Sent: Wednesday, November 02, 2016 9:39 PM
To: Daniel Siegmann
Cc: user @spark
Subject: Re: Quirk in how Spark DF handles JSON input records?


On Nov 2, 2016, at 2:22 PM, Daniel Siegmann 
<dsiegm...@securityscorecard.io<mailto:dsiegm...@securityscorecard.io>> wrote:

Yes, it needs to be on a single line. Spark (or Hadoop really) treats newlines 
as a record separator by default. While it is possible to use a different 
string as a record separator, what would you use in the case of JSON?
If you do some Googling I suspect you'll find some possible solutions. 
Personally, I would just use a separate JSON library (e.g. json4s) to parse 
this metadata into an object, rather than trying to read it in through Spark.


Yeah, that’s the basic idea.

This JSON is metadata to help drive the process not row records… although the 
column descriptors are row records so in the short term I could cheat and just 
store those in a file.

:-(


--
Daniel Siegmann
Senior Software Engineer
SecurityScorecard Inc.
214 W 29th Street, 5th Floor
New York, NY 10001

Re: Quirk in how Spark DF handles JSON input records?

Reply via email to