Hi Zongheng Yang,

thanks for your response. Reading your answer, I did some more tests and
realized that analyzing very small parts of the dataset (which is ~130GB in
~4.3M lines) works fine. 
The error occurs when I analyze larger parts. Using 5% of the whole data,
the error is the same as posted before for certain TIDs. However, I get the
structure determined so far as a result when using 5%.

The Spark WebUI shows the following:

Job aborted due to stage failure: Task 6.0:11 failed 4 times, most recent
failure: Exception failure in TID 108 on host foo.bar.com:
com.fasterxml.jackson.databind.JsonMappingException: No content to map due
to end-of-input at [Source: java.io.StringReader@3697781f; line: 1, column:
1]
com.fasterxml.jackson.databind.JsonMappingException.from(JsonMappingException.java:164)
com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:3029)
com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:2971)
com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2091)
org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply(JsonRDD.scala:261)
org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply(JsonRDD.scala:261)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:172)
scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1157)
org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:823)
org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:821)
org.apache.spark.SparkContext$$anonfun$24.apply(SparkContext.scala:1132)
org.apache.spark.SparkContext$$anonfun$24.apply(SparkContext.scala:1132)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:112)
org.apache.spark.scheduler.Task.run(Task.scala:51)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
java.lang.Thread.run(Thread.java:662) Driver stacktrace:



Is the only possible reason that some of these 4.3 Million JSON-Objects are
not valid JSON, or could there be another explanation?
And if it is the reason, is there some way to tell the function to just skip
faulty lines?


Thanks,
Durin



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/jsonFile-function-in-SQLContext-does-not-work-tp8273p8278.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to