Re: debug jsonRDD problem?
On Wed, May 27, 2015 at 02:06:16PM -0700, Ted Yu wrote: Looks like the exception was caused by resolved.get(prefix ++ a) returning None : a => StructField(a.head, resolved.get(prefix ++ a).get, nullable = true) There are three occurrences of resolved.get() in createSchema() - None should be better handled in these places. My two cents. Here's the simplest test case I've come up with: sqlContext.jsonRDD(sc.parallelize(Array("{\"'```'\":\"\"}"))).count() Mike Stone - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: debug jsonRDD problem?
Looks like the exception was caused by resolved.get(prefix ++ a) returning None : a => StructField(a.head, resolved.get(prefix ++ a).get, nullable = true) There are three occurrences of resolved.get() in createSchema() - None should be better handled in these places. My two cents. On Wed, May 27, 2015 at 1:46 PM, Michael Stone wrote: > On Wed, May 27, 2015 at 01:13:43PM -0700, Ted Yu wrote: > >> Can you tell us a bit more about (schema of) your JSON ? >> > > It's fairly simple, consisting of 22 fields with values that are mostly > strings or integers, except that some of the fields are objects > with http header/value pairs. I'd guess it's something in those latter > fields that is causing the problems. The data is 800M rows that I didn't > create in the first place and I'm in the process of making a simpler test > case. What I was mostly wondering is if there were an obvious mechanism > that I'm just missing to get jsonRDD to spit out more information about > which specific rows it's having problems with. > > You can find sample JSON >> in sql/core/src/test//scala/org/apache/spark/sql/json/ >> TestJsonData.scala >> > > I know the jsonRDD works in general, I've used it before without problems. > It even works on subsets of this data. > Mike Stone >
Re: debug jsonRDD problem?
On Wed, May 27, 2015 at 01:13:43PM -0700, Ted Yu wrote: Can you tell us a bit more about (schema of) your JSON ? It's fairly simple, consisting of 22 fields with values that are mostly strings or integers, except that some of the fields are objects with http header/value pairs. I'd guess it's something in those latter fields that is causing the problems. The data is 800M rows that I didn't create in the first place and I'm in the process of making a simpler test case. What I was mostly wondering is if there were an obvious mechanism that I'm just missing to get jsonRDD to spit out more information about which specific rows it's having problems with. You can find sample JSON in sql/core/src/test//scala/org/apache/spark/sql/json/ TestJsonData.scala I know the jsonRDD works in general, I've used it before without problems. It even works on subsets of this data. Mike Stone - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: debug jsonRDD problem?
Can you tell us a bit more about (schema of) your JSON ? You can find sample JSON in sql/core/src/test//scala/org/apache/spark/sql/json/TestJsonData.scala Cheers On Wed, May 27, 2015 at 12:33 PM, Michael Stone wrote: > Can anyone provide some suggestions on how to debug this? Using spark > 1.3.1. The json itself seems to be valid (other programs can parse it) and > the problem seems to lie in jsonRDD trying to describe & use a schema. > > scala> sqlContext.jsonRDD(rdd).count() > java.util.NoSuchElementException: None.get >at scala.None$.get(Option.scala:313) >at scala.None$.get(Option.scala:311) >at > org.apache.spark.sql.json.JsonRDD$$anonfun$14.apply(JsonRDD.scala:105) >at > org.apache.spark.sql.json.JsonRDD$$anonfun$14.apply(JsonRDD.scala:101) >at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) >at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) >at scala.collection.immutable.Map$Map1.foreach(Map.scala:109) >at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) >at scala.collection.AbstractTraversable.map(Traversable.scala:105) >at > org.apache.spark.sql.json.JsonRDD$.org$apache$spark$sql$json$JsonRDD$$makeStruct$1(JsonRDD.scala:101) >at > org.apache.spark.sql.json.JsonRDD$$anonfun$14.apply(JsonRDD.scala:104) >at > org.apache.spark.sql.json.JsonRDD$$anonfun$14.apply(JsonRDD.scala:101) >at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) >at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) >at scala.collection.immutable.Map$Map2.foreach(Map.scala:130) >at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) >at scala.collection.AbstractTraversable.map(Traversable.scala:105) >at > org.apache.spark.sql.json.JsonRDD$.org$apache$spark$sql$json$JsonRDD$$makeStruct$1(JsonRDD.scala:101) >at > org.apache.spark.sql.json.JsonRDD$.createSchema(JsonRDD.scala:132) >at org.apache.spark.sql.json.JsonRDD$.inferSchema(JsonRDD.scala:56) >at org.apache.spark.sql.SQLContext.jsonRDD(SQLContext.scala:635) >at org.apache.spark.sql.SQLContext.jsonRDD(SQLContext.scala:581) >[...] > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >