RE: MatchError in JsonRDD.toLong
Yes, actually that is what I mean exactly. And maybe you missed my last response, you can use the API: jsonRDD(json:RDD[String], schema:StructType) to clearly clarify your schema. For numbers bigger than Long, we can use DecimalType. Thanks, Daoyuan From: Tobias Pfeiffer [mailto:t...@preferred.jp] Sent: Tuesday, January 20, 2015 9:26 AM To: Wang, Daoyuan Cc: user Subject: Re: MatchError in JsonRDD.toLong Hi, On Fri, Jan 16, 2015 at 6:14 PM, Wang, Daoyuan daoyuan.w...@intel.commailto:daoyuan.w...@intel.com wrote: The second parameter of jsonRDD is the sampling ratio when we infer schema. OK, I was aware of this, but I guess I understand the problem now. My sampling ratio is so low that I only see the Long values of data items and infer it's a Long. When I meet the data that's actually longer than Long, I get the error I posted; basically it's the same situation as when specifying a wrong schema manually. So is there any way around this other than increasing the sample ratio to discover also the very BigDecimal-sized numbers? Thanks Tobias
Re: MatchError in JsonRDD.toLong
Hi, On Fri, Jan 16, 2015 at 6:14 PM, Wang, Daoyuan daoyuan.w...@intel.com wrote: The second parameter of jsonRDD is the sampling ratio when we infer schema. OK, I was aware of this, but I guess I understand the problem now. My sampling ratio is so low that I only see the Long values of data items and infer it's a Long. When I meet the data that's actually longer than Long, I get the error I posted; basically it's the same situation as when specifying a wrong schema manually. So is there any way around this other than increasing the sample ratio to discover also the very BigDecimal-sized numbers? Thanks Tobias
Re: MatchError in JsonRDD.toLong
Hi again, On Fri, Jan 16, 2015 at 4:25 PM, Tobias Pfeiffer t...@preferred.jp wrote: Now I'm wondering where this comes from (I haven't touched this component in a while, nor upgraded Spark etc.) [...] So the reason that the error is showing up now is that suddenly data from a different dataset is showing up in my test dataset... don't ask me... anyway, this different dataset contains data like {Click:nonclicked, Impression:1, DisplayURL:4401798909506983219, AdId:21215341, ...} {Click:nonclicked, Impression:1, DisplayURL:14452800566866169008, AdId:10587781, ...} and the DisplayURL seems to be too long for Long, while it is still inferred as a Long column. So, what to do about this? Is jsonRDD inherently incapable of handling those long numbers or is it just an issue in the schema inference and I should file a JIRA issue? Thanks Tobias
RE: MatchError in JsonRDD.toLong
Hi Tobias, Can you provide how you create the JsonRDD? Thanks, Daoyuan From: Tobias Pfeiffer [mailto:t...@preferred.jp] Sent: Friday, January 16, 2015 4:01 PM To: user Subject: Re: MatchError in JsonRDD.toLong Hi again, On Fri, Jan 16, 2015 at 4:25 PM, Tobias Pfeiffer t...@preferred.jpmailto:t...@preferred.jp wrote: Now I'm wondering where this comes from (I haven't touched this component in a while, nor upgraded Spark etc.) [...] So the reason that the error is showing up now is that suddenly data from a different dataset is showing up in my test dataset... don't ask me... anyway, this different dataset contains data like {Click:nonclicked, Impression:1, DisplayURL:4401798909506983219, AdId:21215341, ...} {Click:nonclicked, Impression:1, DisplayURL:14452800566866169008, AdId:10587781, ...} and the DisplayURL seems to be too long for Long, while it is still inferred as a Long column. So, what to do about this? Is jsonRDD inherently incapable of handling those long numbers or is it just an issue in the schema inference and I should file a JIRA issue? Thanks Tobias
Re: MatchError in JsonRDD.toLong
Hi, On Fri, Jan 16, 2015 at 5:55 PM, Wang, Daoyuan daoyuan.w...@intel.com wrote: Can you provide how you create the JsonRDD? This should be reproducible in the Spark shell: - import org.apache.spark.sql._ val sqlc = new SparkContext(sc) val rdd = sc.parallelize({Click:nonclicked, Impression:1, DisplayURL:4401798909506983219, AdId:21215341} :: {Click:nonclicked, Impression:1, DisplayURL:14452800566866169008, AdId:10587781} :: Nil) // works fine val json = sqlc.jsonRDD(rdd) json.registerTempTable(test) sqlc.sql(SELECT * FROM test).collect // - MatchError val json2 = sqlc.jsonRDD(rdd, 0.1) json2.registerTempTable(test2) sqlc.sql(SELECT * FROM test2).collect - I guess the issue in the latter case is that the column is inferred as Long when some rows actually are too big for Long... Thanks Tobias
RE: MatchError in JsonRDD.toLong
The second parameter of jsonRDD is the sampling ratio when we infer schema. Thanks, Daoyuan From: Tobias Pfeiffer [mailto:t...@preferred.jp] Sent: Friday, January 16, 2015 5:11 PM To: Wang, Daoyuan Cc: user Subject: Re: MatchError in JsonRDD.toLong Hi, On Fri, Jan 16, 2015 at 5:55 PM, Wang, Daoyuan daoyuan.w...@intel.commailto:daoyuan.w...@intel.com wrote: Can you provide how you create the JsonRDD? This should be reproducible in the Spark shell: - import org.apache.spark.sql._ val sqlc = new SparkContext(sc) val rdd = sc.parallelize({Click:nonclicked, Impression:1, DisplayURL:4401798909506983219, AdId:21215341} :: {Click:nonclicked, Impression:1, DisplayURL:14452800566866169008, AdId:10587781} :: Nil) // works fine val json = sqlc.jsonRDD(rdd) json.registerTempTable(test) sqlc.sql(SELECT * FROM test).collect // - MatchError val json2 = sqlc.jsonRDD(rdd, 0.1) json2.registerTempTable(test2) sqlc.sql(SELECT * FROM test2).collect - I guess the issue in the latter case is that the column is inferred as Long when some rows actually are too big for Long... Thanks Tobias
RE: MatchError in JsonRDD.toLong
And you can use jsonRDD(json:RDD[String], schema:StructType) to clearly clarify your schema. For numbers later than Long, we can use DecimalType. Thanks, Daoyuan From: Wang, Daoyuan [mailto:daoyuan.w...@intel.com] Sent: Friday, January 16, 2015 5:14 PM To: Tobias Pfeiffer Cc: user Subject: RE: MatchError in JsonRDD.toLong The second parameter of jsonRDD is the sampling ratio when we infer schema. Thanks, Daoyuan From: Tobias Pfeiffer [mailto:t...@preferred.jp] Sent: Friday, January 16, 2015 5:11 PM To: Wang, Daoyuan Cc: user Subject: Re: MatchError in JsonRDD.toLong Hi, On Fri, Jan 16, 2015 at 5:55 PM, Wang, Daoyuan daoyuan.w...@intel.commailto:daoyuan.w...@intel.com wrote: Can you provide how you create the JsonRDD? This should be reproducible in the Spark shell: - import org.apache.spark.sql._ val sqlc = new SparkContext(sc) val rdd = sc.parallelize({Click:nonclicked, Impression:1, DisplayURL:4401798909506983219, AdId:21215341} :: {Click:nonclicked, Impression:1, DisplayURL:14452800566866169008, AdId:10587781} :: Nil) // works fine val json = sqlc.jsonRDD(rdd) json.registerTempTable(test) sqlc.sql(SELECT * FROM test).collect // - MatchError val json2 = sqlc.jsonRDD(rdd, 0.1) json2.registerTempTable(test2) sqlc.sql(SELECT * FROM test2).collect - I guess the issue in the latter case is that the column is inferred as Long when some rows actually are too big for Long... Thanks Tobias
MatchError in JsonRDD.toLong
Hi, I am experiencing a weird error that suddenly popped up in my unit tests. I have a couple of HDFS files in JSON format and my test is basically creating a JsonRDD and then issuing a very simple SQL query over it. This used to work fine, but now suddenly I get: 15:58:49.039 [Executor task launch worker-1] ERROR executor.Executor - Exception in task 1.0 in stage 29.0 (TID 117) scala.MatchError: 14452800566866169008 (of class java.math.BigInteger) at org.apache.spark.sql.json.JsonRDD$.toLong(JsonRDD.scala:282) at org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:353) at org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1$$anonfun$apply$12.apply(JsonRDD.scala:381) at scala.Option.map(Option.scala:145) at org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:380) at org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:365) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.sql.json.JsonRDD$.org$apache$spark$sql$json$JsonRDD$$asRow(JsonRDD.scala:365) at org.apache.spark.sql.json.JsonRDD$$anonfun$jsonStringToRow$1.apply(JsonRDD.scala:38) at org.apache.spark.sql.json.JsonRDD$$anonfun$jsonStringToRow$1.apply(JsonRDD.scala:38) ... java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) The stack trace contains none of my classes, so it's a bit hard to track down where this starts. The code of JsonRDD.toLong is in fact private def toLong(value: Any): Long = { value match { case value: java.lang.Integer = value.asInstanceOf[Int].toLong case value: java.lang.Long = value.asInstanceOf[Long] } } so if value is a BigInteger, toLong doesn't work. Now I'm wondering where this comes from (I haven't touched this component in a while, nor upgraded Spark etc.), but in particular I would like to know how to work around this. Thanks Tobias