Re: Processing json document

2016-07-08 Thread Jörn Franke
t; From: ljia...@gmail.com > Date: Thu, 7 Jul 2016 11:57:26 -0500 > Subject: Re: Processing json document > To: gurwls...@gmail.com > CC: jornfra...@gmail.com; user@spark.apache.org > > Hi, there, > > Thank you all for your input. @Hyukjin, as a matter of fact, I have read t

RE: Processing json document

2016-07-07 Thread Hyukjin Kwon
- From: ljia...@gmail.com Date: Thu, 7 Jul 2016 11:57:26 -0500 Subject: Re: Processing json document To: gurwls...@gmail.com CC: jornfra...@gmail.com; user@spark.apache.org Hi, there, Thank you all for your input. @Hyukjin, as a matter of fact, I have read the blog link you posted before

RE: Processing json document

2016-07-07 Thread Yong Zhang
. Yong From: ljia...@gmail.com Date: Thu, 7 Jul 2016 11:57:26 -0500 Subject: Re: Processing json document To: gurwls...@gmail.com CC: jornfra...@gmail.com; user@spark.apache.org Hi, there, Thank you all for your input. @Hyukjin, as a matter of fact, I have read the blog link you posted before

Re: Processing json document

2016-07-07 Thread Lan Jiang
Hi, there, Thank you all for your input. @Hyukjin, as a matter of fact, I have read the blog link you posted before asking the question on the forum. As you pointed out, the link uses wholeTextFiles(0, which is bad in my case, because my json file can be as large as 20G+ and OOM might occur. I am

Re: Processing json document

2016-07-07 Thread Hyukjin Kwon
The link uses wholeTextFiles() API which treats each file as each record. 2016-07-07 15:42 GMT+09:00 Jörn Franke : > This does not need necessarily the case if you look at the Hadoop > FileInputFormat architecture then you can even split large multi line Jsons > without

Re: Processing json document

2016-07-07 Thread Jörn Franke
This does not need necessarily the case if you look at the Hadoop FileInputFormat architecture then you can even split large multi line Jsons without issues. I would need to have a look at it, but one large file does not mean one Executor independent of the underlying format. > On 07 Jul 2016,

Re: Processing json document

2016-07-07 Thread Hyukjin Kwon
There is a good link for this here, http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files If there are a lot of small files, then it would work pretty okay in a distributed manner, but I am worried if it is single large file. In this case, this would only work in

Re: Processing json document

2016-07-07 Thread Jean Georges Perrin
do you want id1, id2, id3 to be processed similarly? The Java code I use is: df = df.withColumn(K.NAME, df.col("fields.premise_name")); the original structure is something like {"fields":{"premise_name":"ccc"}} hope it helps > On Jul 7, 2016, at 1:48 AM, Lan Jiang