Re: jsonFile function in SQLContext does not work

2014-06-26 Thread Yin Huai
Yes. It will be added in later versions. Thanks, Yin On Wed, Jun 25, 2014 at 3:39 PM, durin wrote: > Hi Yin an Aaron, > > thanks for your help, this was indeed the problem. I've counted 1233 blank > lines using grep, and the code snippet below works with those. > > From what you said, I guess

Re: jsonFile function in SQLContext does not work

2014-06-25 Thread durin
Hi Yin an Aaron, thanks for your help, this was indeed the problem. I've counted 1233 blank lines using grep, and the code snippet below works with those. >From what you said, I guess that skipping faulty lines will be possible in later versions? Kind regards, Simon -- View this message in c

Re: jsonFile function in SQLContext does not work

2014-06-25 Thread Yin Huai
Hi Durin, I guess that blank lines caused the problem (like Aaron said). Right now, jsonFile does not skip faulty lines. Can you first use sc.textfile to load the file as RDD[String] and then use filter to filter out those blank lines (code snippet can be found below)? val sqlContext = new org.ap

Re: jsonFile function in SQLContext does not work

2014-06-25 Thread Aaron Davidson
Is it possible you have blank lines in your input? Not that this should be an error condition, but it may be what's causing it. On Wed, Jun 25, 2014 at 11:57 AM, durin wrote: > Hi Zongheng Yang, > > thanks for your response. Reading your answer, I did some more tests and > realized that analyzi

Re: jsonFile function in SQLContext does not work

2014-06-25 Thread durin
Hi Zongheng Yang, thanks for your response. Reading your answer, I did some more tests and realized that analyzing very small parts of the dataset (which is ~130GB in ~4.3M lines) works fine. The error occurs when I analyze larger parts. Using 5% of the whole data, the error is the same as posted

Re: jsonFile function in SQLContext does not work

2014-06-25 Thread Zongheng Yang
Hi durin, I just tried this example (nice data, by the way!), *with each JSON object on one line*, and it worked fine: scala> rdd.printSchema() root |-- entities: org.apache.spark.sql.catalyst.types.StructType$@13b6cdef ||-- friends: ArrayType[org.apache.spark.sql.catalyst.types.StructType$