Re: Spark SQL - Exception only when using cacheTable

2014-10-13 Thread poiuytrez
This is how the table was created:

transactions = parts.map(lambda p: Row(customer_id=long(p[0]),
chain=int(p[1]), dept=int(p[2]), category=int(p[3]), company=int(p[4]),
brand=int(p[5]), date=str(p[6]), productsize=float(p[7]),
productmeasure=str(p[8]), purchasequantity=int(p[9]),
purchaseamount=float(p[10])))

# Infer the schema, and register the Schema RDD as a table
schemaTransactions = sqlContext.inferSchema(transactions)
schemaTransactions.registerTempTable(transactions)
sqlContext.cacheTable(transactions)

t = sqlContext.sql(SELECT * FROM transactions WHERE purchaseamount = 50)
t.count()


Thank you,
poiuytrez



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Exception-only-when-using-cacheTable-tp16031p16262.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL - Exception only when using cacheTable

2014-10-11 Thread Cheng Lian
How was the table created? Would you mind to share related code? It 
seems that the underlying type of the |customer_id| field is actually 
long, but the schema says it’s integer, basically it’s a type mismatch 
error.


The first query succeeds because |SchemaRDD.count()| is translated to 
something equivalent to |SELECT COUNT(1) FROM ...| and doesn’t actually 
touch the field. While in the second case, Spark tries to materialize 
the whole underlying table into in-memory columnar format because you 
asked to cache the table, thus the type mismatch is detected.


On 10/10/14 8:28 PM, poiuytrez wrote:


Hi Cheng,

I am using Spark 1.1.0.
This is the stack trace:
14/10/10 12:17:40 WARN TaskSetManager: Lost task 120.0 in stage 7.0 (TID
2235, spark-w-0.c.db.internal): java.lang.ClassCastException: java.lang.Long
cannot be cast to java.lang.Integer
 scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)

org.apache.spark.sql.catalyst.expressions.GenericRow.getInt(Row.scala:146)

 org.apache.spark.sql.columnar.INT$.getField(ColumnType.scala:105)
 org.apache.spark.sql.columnar.INT$.getField(ColumnType.scala:92)

org.apache.spark.sql.columnar.BasicColumnBuilder.appendFrom(ColumnBuilder.scala:72)

org.apache.spark.sql.columnar.NativeColumnBuilder.org$apache$spark$sql$columnar$NullableColumnBuilder$super$appendFrom(ColumnBuilder.scala:88)

org.apache.spark.sql.columnar.NullableColumnBuilder$class.appendFrom(NullableColumnBuilder.scala:57)

org.apache.spark.sql.columnar.NativeColumnBuilder.org$apache$spark$sql$columnar$compression$CompressibleColumnBuilder$super$appendFrom(ColumnBuilder.scala:88)

org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.appendFrom(CompressibleColumnBuilder.scala:76)

org.apache.spark.sql.columnar.NativeColumnBuilder.appendFrom(ColumnBuilder.scala:88)

org.apache.spark.sql.columnar.InMemoryRelation$anonfun$1$anon$1.next(InMemoryColumnarTableScan.scala:65)

org.apache.spark.sql.columnar.InMemoryRelation$anonfun$1$anon$1.next(InMemoryColumnarTableScan.scala:50)

org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:236)

org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163)

 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:227)

org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)

 org.apache.spark.scheduler.Task.run(Task.scala:54)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

 java.lang.Thread.run(Thread.java:745)


This was also printed on the driver:

14/10/10 12:17:43 ERROR TaskSetManager: Task 120 in stage 7.0 failed 4
times; aborting job
14/10/10 12:17:43 INFO TaskSchedulerImpl: Cancelling stage 7
14/10/10 12:17:43 INFO TaskSchedulerImpl: Stage 7 was cancelled
14/10/10 12:17:43 INFO DAGScheduler: Failed to run collect at
SparkPlan.scala:85
Traceback (most recent call last):
   File stdin, line 1, in module
   File /home/hadoop/spark-install/python/pyspark/sql.py, line 1606, in
count
 return self._jschema_rdd.count()
   File
/home/hadoop/spark-install/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
line 538, in __call__
   File
/home/hadoop/spark-install/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o100.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task
120 in stage 7.0 failed 4 times, most recent failure: Lost task 

Re: Spark SQL - Exception only when using cacheTable

2014-10-10 Thread visakh
Can you try checking whether the table is being cached? You can use isCached
method. More details are here -
http://spark.apache.org/docs/1.0.2/api/java/org/apache/spark/sql/SQLContext.html



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Exception-only-when-using-cacheTable-tp16031p16123.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL - Exception only when using cacheTable

2014-10-10 Thread Cheng Lian
Hi Poiuytrez, what version of Spark are you using? Exception details 
like stacktrace are really needed to investigate this issue. You can 
find them in the executor logs, or just browse the application 
stderr/stdout link from Spark Web UI.


On 10/9/14 9:37 PM, poiuytrez wrote:

Hello,

I have a weird issue, this request works fine:
sqlContext.sql(SELECT customer_id FROM transactions WHERE purchaseamount =
200).count()

However, when I cache the table before making the request:
sqlContext.cacheTable(transactions)
sqlContext.sql(SELECT customer_id FROM transactions WHERE purchaseamount =
200).count()

I am getting an exception on of the task:
: org.apache.spark.SparkException: Job aborted due to stage failure: Task
120 in stage 104.0 failed 4 times, most recent failure: Lost task 120.3 in
stage 104.0 (TID 20537, spark-w-0.c.internal): java.lang.ClassCastException:

(I have no details after the ':')

Any ideas of what could be wrong?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Exception-only-when-using-cacheTable-tp16031.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL - Exception only when using cacheTable

2014-10-10 Thread poiuytrez
I am using the python api. Unfortunately, I cannot find the isCached method
equivalent in the documentation:
https://spark.apache.org/docs/1.1.0/api/python/index.html in the SQLContext
section.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Exception-only-when-using-cacheTable-tp16031p16137.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL - Exception only when using cacheTable

2014-10-10 Thread poiuytrez
Hi Cheng, 

I am using Spark 1.1.0. 
This is the stack trace:
14/10/10 12:17:40 WARN TaskSetManager: Lost task 120.0 in stage 7.0 (TID
2235, spark-w-0.c.db.internal): java.lang.ClassCastException: java.lang.Long
cannot be cast to java.lang.Integer
scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
   
org.apache.spark.sql.catalyst.expressions.GenericRow.getInt(Row.scala:146)
org.apache.spark.sql.columnar.INT$.getField(ColumnType.scala:105)
org.apache.spark.sql.columnar.INT$.getField(ColumnType.scala:92)
   
org.apache.spark.sql.columnar.BasicColumnBuilder.appendFrom(ColumnBuilder.scala:72)
   
org.apache.spark.sql.columnar.NativeColumnBuilder.org$apache$spark$sql$columnar$NullableColumnBuilder$$super$appendFrom(ColumnBuilder.scala:88)
   
org.apache.spark.sql.columnar.NullableColumnBuilder$class.appendFrom(NullableColumnBuilder.scala:57)
   
org.apache.spark.sql.columnar.NativeColumnBuilder.org$apache$spark$sql$columnar$compression$CompressibleColumnBuilder$$super$appendFrom(ColumnBuilder.scala:88)
   
org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.appendFrom(CompressibleColumnBuilder.scala:76)
   
org.apache.spark.sql.columnar.NativeColumnBuilder.appendFrom(ColumnBuilder.scala:88)
   
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryColumnarTableScan.scala:65)
   
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryColumnarTableScan.scala:50)
   
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:236)
   
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163)
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
   
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
   
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
   
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
   
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)


This was also printed on the driver:

14/10/10 12:17:43 ERROR TaskSetManager: Task 120 in stage 7.0 failed 4
times; aborting job
14/10/10 12:17:43 INFO TaskSchedulerImpl: Cancelling stage 7
14/10/10 12:17:43 INFO TaskSchedulerImpl: Stage 7 was cancelled
14/10/10 12:17:43 INFO DAGScheduler: Failed to run collect at
SparkPlan.scala:85
Traceback (most recent call last):
  File stdin, line 1, in module
  File /home/hadoop/spark-install/python/pyspark/sql.py, line 1606, in
count
return self._jschema_rdd.count()
  File
/home/hadoop/spark-install/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
line 538, in __call__
  File
/home/hadoop/spark-install/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o100.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task
120 in stage 7.0 failed 4 times, most recent failure: Lost task 120.3 in
stage 7.0 (TID 2248, spark-w-0.c.db.internal): java.lang.ClassCastException: 

Driver stacktrace:
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at