RE: MatchError in JsonRDD.toLong

2015-01-19 Thread Wang, Daoyuan
Yes, actually that is what I mean exactly. And maybe you missed my last 
response, you can use the API:
jsonRDD(json:RDD[String], schema:StructType)
to clearly clarify your schema. For numbers bigger than Long, we can use 
DecimalType.

Thanks,
Daoyuan


From: Tobias Pfeiffer [mailto:t...@preferred.jp]
Sent: Tuesday, January 20, 2015 9:26 AM
To: Wang, Daoyuan
Cc: user
Subject: Re: MatchError in JsonRDD.toLong

Hi,

On Fri, Jan 16, 2015 at 6:14 PM, Wang, Daoyuan 
daoyuan.w...@intel.commailto:daoyuan.w...@intel.com wrote:
The second parameter of jsonRDD is the sampling ratio when we infer schema.

OK, I was aware of this, but I guess I understand the problem now. My sampling 
ratio is so low that I only see the Long values of data items and infer it's a 
Long. When I meet the data that's actually longer than Long, I get the error I 
posted; basically it's the same situation as when specifying a wrong schema 
manually.

So is there any way around this other than increasing the sample ratio to 
discover also the very BigDecimal-sized numbers?

Thanks
Tobias



Re: MatchError in JsonRDD.toLong

2015-01-19 Thread Tobias Pfeiffer
Hi,

On Fri, Jan 16, 2015 at 6:14 PM, Wang, Daoyuan daoyuan.w...@intel.com
wrote:

 The second parameter of jsonRDD is the sampling ratio when we infer schema.


OK, I was aware of this, but I guess I understand the problem now. My
sampling ratio is so low that I only see the Long values of data items and
infer it's a Long. When I meet the data that's actually longer than Long, I
get the error I posted; basically it's the same situation as when
specifying a wrong schema manually.

So is there any way around this other than increasing the sample ratio to
discover also the very BigDecimal-sized numbers?

Thanks
Tobias


Re: MatchError in JsonRDD.toLong

2015-01-16 Thread Tobias Pfeiffer
Hi again,

On Fri, Jan 16, 2015 at 4:25 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Now I'm wondering where this comes from (I haven't touched this component
 in a while, nor upgraded Spark etc.) [...]


So the reason that the error is showing up now is that suddenly data from a
different dataset is showing up in my test dataset... don't ask me...
anyway, this different dataset contains data like

  {Click:nonclicked, Impression:1,
   DisplayURL:4401798909506983219, AdId:21215341, ...}
  {Click:nonclicked, Impression:1,
   DisplayURL:14452800566866169008, AdId:10587781, ...}

and the DisplayURL seems to be too long for Long, while it is still
inferred as a Long column.

So, what to do about this? Is jsonRDD inherently incapable of handling
those long numbers or is it just an issue in the schema inference and I
should file a JIRA issue?

Thanks
Tobias


RE: MatchError in JsonRDD.toLong

2015-01-16 Thread Wang, Daoyuan
Hi Tobias,

Can you provide how you create the JsonRDD?

Thanks,
Daoyuan


From: Tobias Pfeiffer [mailto:t...@preferred.jp]
Sent: Friday, January 16, 2015 4:01 PM
To: user
Subject: Re: MatchError in JsonRDD.toLong

Hi again,

On Fri, Jan 16, 2015 at 4:25 PM, Tobias Pfeiffer 
t...@preferred.jpmailto:t...@preferred.jp wrote:
Now I'm wondering where this comes from (I haven't touched this component in a 
while, nor upgraded Spark etc.) [...]

So the reason that the error is showing up now is that suddenly data from a 
different dataset is showing up in my test dataset... don't ask me... anyway, 
this different dataset contains data like

  {Click:nonclicked, Impression:1,
   DisplayURL:4401798909506983219, AdId:21215341, ...}
  {Click:nonclicked, Impression:1,
   DisplayURL:14452800566866169008, AdId:10587781, ...}

and the DisplayURL seems to be too long for Long, while it is still inferred as 
a Long column.

So, what to do about this? Is jsonRDD inherently incapable of handling those 
long numbers or is it just an issue in the schema inference and I should file a 
JIRA issue?

Thanks
Tobias


Re: MatchError in JsonRDD.toLong

2015-01-16 Thread Tobias Pfeiffer
Hi,

On Fri, Jan 16, 2015 at 5:55 PM, Wang, Daoyuan daoyuan.w...@intel.com
wrote:

 Can you provide how you create the JsonRDD?


This should be reproducible in the Spark shell:

-
import org.apache.spark.sql._
val sqlc = new SparkContext(sc)
val rdd = sc.parallelize({Click:nonclicked, Impression:1,
DisplayURL:4401798909506983219, AdId:21215341} ::
 {Click:nonclicked, Impression:1,
DisplayURL:14452800566866169008, AdId:10587781} :: Nil)

// works fine
val json = sqlc.jsonRDD(rdd)
json.registerTempTable(test)
sqlc.sql(SELECT * FROM test).collect

// - MatchError
val json2 = sqlc.jsonRDD(rdd, 0.1)
json2.registerTempTable(test2)
sqlc.sql(SELECT * FROM test2).collect
-

I guess the issue in the latter case is that the column is inferred as Long
when some rows actually are too big for Long...

Thanks
Tobias


RE: MatchError in JsonRDD.toLong

2015-01-16 Thread Wang, Daoyuan
The second parameter of jsonRDD is the sampling ratio when we infer schema.

Thanks,
Daoyuan

From: Tobias Pfeiffer [mailto:t...@preferred.jp]
Sent: Friday, January 16, 2015 5:11 PM
To: Wang, Daoyuan
Cc: user
Subject: Re: MatchError in JsonRDD.toLong

Hi,

On Fri, Jan 16, 2015 at 5:55 PM, Wang, Daoyuan 
daoyuan.w...@intel.commailto:daoyuan.w...@intel.com wrote:
Can you provide how you create the JsonRDD?

This should be reproducible in the Spark shell:

-
import org.apache.spark.sql._
val sqlc = new SparkContext(sc)
val rdd = sc.parallelize({Click:nonclicked, Impression:1, 
DisplayURL:4401798909506983219, AdId:21215341} ::
 {Click:nonclicked, Impression:1, 
DisplayURL:14452800566866169008, AdId:10587781} :: Nil)

// works fine
val json = sqlc.jsonRDD(rdd)
json.registerTempTable(test)
sqlc.sql(SELECT * FROM test).collect

// - MatchError
val json2 = sqlc.jsonRDD(rdd, 0.1)
json2.registerTempTable(test2)
sqlc.sql(SELECT * FROM test2).collect
-

I guess the issue in the latter case is that the column is inferred as Long 
when some rows actually are too big for Long...

Thanks
Tobias



RE: MatchError in JsonRDD.toLong

2015-01-16 Thread Wang, Daoyuan
And you can use jsonRDD(json:RDD[String], schema:StructType) to clearly clarify 
your schema. For numbers later than Long, we can use DecimalType.

Thanks,
Daoyuan

From: Wang, Daoyuan [mailto:daoyuan.w...@intel.com]
Sent: Friday, January 16, 2015 5:14 PM
To: Tobias Pfeiffer
Cc: user
Subject: RE: MatchError in JsonRDD.toLong

The second parameter of jsonRDD is the sampling ratio when we infer schema.

Thanks,
Daoyuan

From: Tobias Pfeiffer [mailto:t...@preferred.jp]
Sent: Friday, January 16, 2015 5:11 PM
To: Wang, Daoyuan
Cc: user
Subject: Re: MatchError in JsonRDD.toLong

Hi,

On Fri, Jan 16, 2015 at 5:55 PM, Wang, Daoyuan 
daoyuan.w...@intel.commailto:daoyuan.w...@intel.com wrote:
Can you provide how you create the JsonRDD?

This should be reproducible in the Spark shell:

-
import org.apache.spark.sql._
val sqlc = new SparkContext(sc)
val rdd = sc.parallelize({Click:nonclicked, Impression:1, 
DisplayURL:4401798909506983219, AdId:21215341} ::
 {Click:nonclicked, Impression:1, 
DisplayURL:14452800566866169008, AdId:10587781} :: Nil)

// works fine
val json = sqlc.jsonRDD(rdd)
json.registerTempTable(test)
sqlc.sql(SELECT * FROM test).collect

// - MatchError
val json2 = sqlc.jsonRDD(rdd, 0.1)
json2.registerTempTable(test2)
sqlc.sql(SELECT * FROM test2).collect
-

I guess the issue in the latter case is that the column is inferred as Long 
when some rows actually are too big for Long...

Thanks
Tobias



MatchError in JsonRDD.toLong

2015-01-15 Thread Tobias Pfeiffer
Hi,

I am experiencing a weird error that suddenly popped up in my unit tests. I
have a couple of HDFS files in JSON format and my test is basically
creating a JsonRDD and then issuing a very simple SQL query over it. This
used to work fine, but now suddenly I get:

15:58:49.039 [Executor task launch worker-1] ERROR executor.Executor -
Exception in task 1.0 in stage 29.0 (TID 117)
scala.MatchError: 14452800566866169008 (of class java.math.BigInteger)
at org.apache.spark.sql.json.JsonRDD$.toLong(JsonRDD.scala:282)
at org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:353)
at
org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1$$anonfun$apply$12.apply(JsonRDD.scala:381)
at scala.Option.map(Option.scala:145)
at
org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:380)
at
org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:365)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.sql.json.JsonRDD$.org$apache$spark$sql$json$JsonRDD$$asRow(JsonRDD.scala:365)
at
org.apache.spark.sql.json.JsonRDD$$anonfun$jsonStringToRow$1.apply(JsonRDD.scala:38)
at
org.apache.spark.sql.json.JsonRDD$$anonfun$jsonStringToRow$1.apply(JsonRDD.scala:38)
...

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

  
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)

The stack trace contains none of my classes, so it's a bit hard to track
down where this starts.

The code of JsonRDD.toLong is in fact

  private def toLong(value: Any): Long = {
value match {
  case value: java.lang.Integer = value.asInstanceOf[Int].toLong
  case value: java.lang.Long = value.asInstanceOf[Long]
}
  }

so if value is a BigInteger, toLong doesn't work. Now I'm wondering where
this comes from (I haven't touched this component in a while, nor upgraded
Spark etc.), but in particular I would like to know how to work around this.

Thanks
Tobias