jsonFile function in SQLContext does not work

2014-06-25 Thread durin
I'm using Spark 1.0.0-SNAPSHOT (downloaded and compiled on 2014/06/23).
I'm trying to execute the following code:

import org.apache.spark.SparkContext._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val table =
sqlContext.jsonFile(hdfs://host:9100/user/myuser/data.json)
table.printSchema()

data.json looks like this (3 shortened lines shown here):

{field1:content,id:12312213,read:false,user:{id:121212,name:E.
Stark,num_heads:0},place:Winterfell,entities:{weapons:[],friends:[{name:R.
Baratheon,id:23234,indices:[0,16]}]},lang:en}
{field1:content,id:56756765,read:false,user:{id:121212,name:E.
Stark,num_heads:0},place:Winterfell,entities:{weapons:[],friends:[{name:R.
Baratheon,id:23234,indices:[0,16]}]},lang:en}
{field1:content,id:56765765,read:false,user:{id:121212,name:E.
Stark,num_heads:0},place:Winterfell,entities:{weapons:[],friends:[{name:R.
Baratheon,id:23234,indices:[0,16]}]},lang:en}

The JSON-Object in each line is valid according to the JSON-Validator I use,
and as jsonFile is defined as

def jsonFile(path: String): SchemaRDD
Loads a JSON file (one object per line), returning the result as a
SchemaRDD.

I would assume this should work. However, executing this code return this
error:

14/06/25 10:05:09 WARN scheduler.TaskSetManager: Lost TID 11 (task 0.0:11)
14/06/25 10:05:09 WARN scheduler.TaskSetManager: Loss was due to
com.fasterxml.jackson.databind.JsonMappingException
com.fasterxml.jackson.databind.JsonMappingException: No content to map due
to end-of-input
 at [Source: java.io.StringReader@238df2e4; line: 1, column: 1]
at
com.fasterxml.jackson.databind.JsonMappingException.from(JsonMappingException.java:164)
...


Does anyone know where the problem lies?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/jsonFile-function-in-SQLContext-does-not-work-tp8273.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: jsonFile function in SQLContext does not work

2014-06-25 Thread Zongheng Yang
Hi durin,

I just tried this example (nice data, by the way!), *with each JSON
object on one line*, and it worked fine:

scala rdd.printSchema()
root
 |-- entities: org.apache.spark.sql.catalyst.types.StructType$@13b6cdef
 ||-- friends:
ArrayType[org.apache.spark.sql.catalyst.types.StructType$@13b6cdef]
 |||-- id: IntegerType
 |||-- indices: ArrayType[IntegerType]
 |||-- name: StringType
 ||-- weapons: ArrayType[StringType]
 |-- field1: StringType
 |-- id: IntegerType
 |-- lang: StringType
 |-- place: StringType
 |-- read: BooleanType
 |-- user: org.apache.spark.sql.catalyst.types.StructType$@13b6cdef
 ||-- id: IntegerType
 ||-- name: StringType
 ||-- num_heads: IntegerType

On Wed, Jun 25, 2014 at 10:57 AM, durin m...@simon-schaefer.net wrote:
 I'm using Spark 1.0.0-SNAPSHOT (downloaded and compiled on 2014/06/23).
 I'm trying to execute the following code:

 import org.apache.spark.SparkContext._
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 val table =
 sqlContext.jsonFile(hdfs://host:9100/user/myuser/data.json)
 table.printSchema()

 data.json looks like this (3 shortened lines shown here):

 {field1:content,id:12312213,read:false,user:{id:121212,name:E.
 Stark,num_heads:0},place:Winterfell,entities:{weapons:[],friends:[{name:R.
 Baratheon,id:23234,indices:[0,16]}]},lang:en}
 {field1:content,id:56756765,read:false,user:{id:121212,name:E.
 Stark,num_heads:0},place:Winterfell,entities:{weapons:[],friends:[{name:R.
 Baratheon,id:23234,indices:[0,16]}]},lang:en}
 {field1:content,id:56765765,read:false,user:{id:121212,name:E.
 Stark,num_heads:0},place:Winterfell,entities:{weapons:[],friends:[{name:R.
 Baratheon,id:23234,indices:[0,16]}]},lang:en}

 The JSON-Object in each line is valid according to the JSON-Validator I use,
 and as jsonFile is defined as

 def jsonFile(path: String): SchemaRDD
 Loads a JSON file (one object per line), returning the result as a
 SchemaRDD.

 I would assume this should work. However, executing this code return this
 error:

 14/06/25 10:05:09 WARN scheduler.TaskSetManager: Lost TID 11 (task 0.0:11)
 14/06/25 10:05:09 WARN scheduler.TaskSetManager: Loss was due to
 com.fasterxml.jackson.databind.JsonMappingException
 com.fasterxml.jackson.databind.JsonMappingException: No content to map due
 to end-of-input
  at [Source: java.io.StringReader@238df2e4; line: 1, column: 1]
 at
 com.fasterxml.jackson.databind.JsonMappingException.from(JsonMappingException.java:164)
 ...


 Does anyone know where the problem lies?



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/jsonFile-function-in-SQLContext-does-not-work-tp8273.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: jsonFile function in SQLContext does not work

2014-06-25 Thread durin
Hi Zongheng Yang,

thanks for your response. Reading your answer, I did some more tests and
realized that analyzing very small parts of the dataset (which is ~130GB in
~4.3M lines) works fine. 
The error occurs when I analyze larger parts. Using 5% of the whole data,
the error is the same as posted before for certain TIDs. However, I get the
structure determined so far as a result when using 5%.

The Spark WebUI shows the following:

Job aborted due to stage failure: Task 6.0:11 failed 4 times, most recent
failure: Exception failure in TID 108 on host foo.bar.com:
com.fasterxml.jackson.databind.JsonMappingException: No content to map due
to end-of-input at [Source: java.io.StringReader@3697781f; line: 1, column:
1]
com.fasterxml.jackson.databind.JsonMappingException.from(JsonMappingException.java:164)
com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:3029)
com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:2971)
com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2091)
org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply(JsonRDD.scala:261)
org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply(JsonRDD.scala:261)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:172)
scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1157)
org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:823)
org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:821)
org.apache.spark.SparkContext$$anonfun$24.apply(SparkContext.scala:1132)
org.apache.spark.SparkContext$$anonfun$24.apply(SparkContext.scala:1132)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:112)
org.apache.spark.scheduler.Task.run(Task.scala:51)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
java.lang.Thread.run(Thread.java:662) Driver stacktrace:



Is the only possible reason that some of these 4.3 Million JSON-Objects are
not valid JSON, or could there be another explanation?
And if it is the reason, is there some way to tell the function to just skip
faulty lines?


Thanks,
Durin



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/jsonFile-function-in-SQLContext-does-not-work-tp8273p8278.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: jsonFile function in SQLContext does not work

2014-06-25 Thread Aaron Davidson
Is it possible you have blank lines in your input? Not that this should be
an error condition, but it may be what's causing it.


On Wed, Jun 25, 2014 at 11:57 AM, durin m...@simon-schaefer.net wrote:

 Hi Zongheng Yang,

 thanks for your response. Reading your answer, I did some more tests and
 realized that analyzing very small parts of the dataset (which is ~130GB in
 ~4.3M lines) works fine.
 The error occurs when I analyze larger parts. Using 5% of the whole data,
 the error is the same as posted before for certain TIDs. However, I get the
 structure determined so far as a result when using 5%.

 The Spark WebUI shows the following:

 Job aborted due to stage failure: Task 6.0:11 failed 4 times, most recent
 failure: Exception failure in TID 108 on host foo.bar.com:
 com.fasterxml.jackson.databind.JsonMappingException: No content to map due
 to end-of-input at [Source: java.io.StringReader@3697781f; line: 1,
 column:
 1]

 com.fasterxml.jackson.databind.JsonMappingException.from(JsonMappingException.java:164)

 com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:3029)

 com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:2971)

 com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2091)

 org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply(JsonRDD.scala:261)

 org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply(JsonRDD.scala:261)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 scala.collection.Iterator$class.foreach(Iterator.scala:727)
 scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

 scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:172)
 scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1157)
 org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:823)
 org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:821)
 org.apache.spark.SparkContext$$anonfun$24.apply(SparkContext.scala:1132)
 org.apache.spark.SparkContext$$anonfun$24.apply(SparkContext.scala:1132)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:112)
 org.apache.spark.scheduler.Task.run(Task.scala:51)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)

 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 java.lang.Thread.run(Thread.java:662) Driver stacktrace:



 Is the only possible reason that some of these 4.3 Million JSON-Objects are
 not valid JSON, or could there be another explanation?
 And if it is the reason, is there some way to tell the function to just
 skip
 faulty lines?


 Thanks,
 Durin



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/jsonFile-function-in-SQLContext-does-not-work-tp8273p8278.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: jsonFile function in SQLContext does not work

2014-06-25 Thread Yin Huai
Hi Durin,

I guess that blank lines caused the problem (like Aaron said). Right now,
jsonFile does not skip faulty lines. Can you first use sc.textfile to load
the file as RDD[String] and then use filter to filter out those blank lines
(code snippet can be found below)?

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val rdd = sc.textFile(hdfs://host:9100/user/myuser/data.json).filter(r =
r.trim != )
val table = sqlContext.jsonRDD(rdd)
table.printSchema()

Thanks,

Yin



On Wed, Jun 25, 2014 at 1:08 PM, Aaron Davidson ilike...@gmail.com wrote:

 Is it possible you have blank lines in your input? Not that this should be
 an error condition, but it may be what's causing it.


 On Wed, Jun 25, 2014 at 11:57 AM, durin m...@simon-schaefer.net wrote:

 Hi Zongheng Yang,

 thanks for your response. Reading your answer, I did some more tests and
 realized that analyzing very small parts of the dataset (which is ~130GB
 in
 ~4.3M lines) works fine.
 The error occurs when I analyze larger parts. Using 5% of the whole data,
 the error is the same as posted before for certain TIDs. However, I get
 the
 structure determined so far as a result when using 5%.

 The Spark WebUI shows the following:

 Job aborted due to stage failure: Task 6.0:11 failed 4 times, most recent
 failure: Exception failure in TID 108 on host foo.bar.com:
 com.fasterxml.jackson.databind.JsonMappingException: No content to map due
 to end-of-input at [Source: java.io.StringReader@3697781f; line: 1,
 column:
 1]

 com.fasterxml.jackson.databind.JsonMappingException.from(JsonMappingException.java:164)

 com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:3029)

 com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:2971)

 com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2091)

 org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply(JsonRDD.scala:261)

 org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply(JsonRDD.scala:261)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 scala.collection.Iterator$class.foreach(Iterator.scala:727)
 scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

 scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:172)
 scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1157)
 org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:823)
 org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:821)
 org.apache.spark.SparkContext$$anonfun$24.apply(SparkContext.scala:1132)
 org.apache.spark.SparkContext$$anonfun$24.apply(SparkContext.scala:1132)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:112)
 org.apache.spark.scheduler.Task.run(Task.scala:51)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)

 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 java.lang.Thread.run(Thread.java:662) Driver stacktrace:



 Is the only possible reason that some of these 4.3 Million JSON-Objects
 are
 not valid JSON, or could there be another explanation?
 And if it is the reason, is there some way to tell the function to just
 skip
 faulty lines?


 Thanks,
 Durin



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/jsonFile-function-in-SQLContext-does-not-work-tp8273p8278.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.





Re: jsonFile function in SQLContext does not work

2014-06-25 Thread durin
Hi Yin an Aaron,

thanks for your help, this was indeed the problem. I've counted 1233 blank
lines using grep, and the code snippet below works with those.

From what you said, I guess that skipping faulty lines will be possible in
later versions?


Kind regards,
Simon



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/jsonFile-function-in-SQLContext-does-not-work-tp8273p8293.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.