Hi Eran, Can you try 1.6? With the change in https://github.com/apache/spark/pull/10288, JSON data source will not throw a runtime exception if there is any record that it cannot parse. Instead, it will put the entire record to the column of "_corrupt_record".
Thanks, Yin On Sun, Dec 20, 2015 at 9:37 AM, Eran Witkon <eranwit...@gmail.com> wrote: > Thanks for this! > This was the problem... > > On Sun, 20 Dec 2015 at 18:49 Chris Fregly <ch...@fregly.com> wrote: > >> hey Eran, I run into this all the time with Json. >> >> the problem is likely that your Json is "too pretty" and extending beyond >> a single line which trips up the Json reader. >> >> my solution is usually to de-pretty the Json - either manually or through >> an ETL step - by stripping all white space before pointing my >> DataFrame/JSON reader at the file. >> >> this tool is handy for one-off scenerios: http://jsonviewer.stack.hu >> >> for streaming use cases, you'll want to have a light de-pretty ETL step >> either within the Spark Streaming job after ingestion - or upstream using >> something like a Flume interceptor, NiFi Processor (I love NiFi), or Kafka >> transformation assuming those exist by now. >> >> a similar problem exists for XML, btw. there's lots of wonky workarounds >> for this that use MapPartitions and all kinds of craziness. the best >> option, in my opinion, is to just ETL/flatten the data to make the >> DataFrame reader happy. >> >> On Dec 19, 2015, at 4:55 PM, Eran Witkon <eranwit...@gmail.com> wrote: >> >> Hi, >> I tried the following code in spark-shell on spark1.5.2: >> >> *val df = >> sqlContext.read.json("/home/eranw/Workspace/JSON/sample/sample2.json")* >> *df.count()* >> >> 15/12/19 23:49:40 ERROR Executor: Managed memory leak detected; size = >> 67108864 bytes, TID = 3 >> 15/12/19 23:49:40 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID >> 3) >> java.lang.RuntimeException: Failed to parse a value for data type >> StructType() (current token: VALUE_STRING). >> at scala.sys.package$.error(package.scala:27) >> at >> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:172) >> at >> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:251) >> at >> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:246) >> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) >> at >> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:365) >> at >> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622) >> at >> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$ >> 1.org >> $apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110) >> >> Am I am doing something wrong? >> Eran >> >>