Re: Spark SQL - Exception only when using cacheTable
This is how the table was created: transactions = parts.map(lambda p: Row(customer_id=long(p[0]), chain=int(p[1]), dept=int(p[2]), category=int(p[3]), company=int(p[4]), brand=int(p[5]), date=str(p[6]), productsize=float(p[7]), productmeasure=str(p[8]), purchasequantity=int(p[9]), purchaseamount=float(p[10]))) # Infer the schema, and register the Schema RDD as a table schemaTransactions = sqlContext.inferSchema(transactions) schemaTransactions.registerTempTable(transactions) sqlContext.cacheTable(transactions) t = sqlContext.sql(SELECT * FROM transactions WHERE purchaseamount = 50) t.count() Thank you, poiuytrez -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Exception-only-when-using-cacheTable-tp16031p16262.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark SQL - Exception only when using cacheTable
How was the table created? Would you mind to share related code? It seems that the underlying type of the |customer_id| field is actually long, but the schema says it’s integer, basically it’s a type mismatch error. The first query succeeds because |SchemaRDD.count()| is translated to something equivalent to |SELECT COUNT(1) FROM ...| and doesn’t actually touch the field. While in the second case, Spark tries to materialize the whole underlying table into in-memory columnar format because you asked to cache the table, thus the type mismatch is detected. On 10/10/14 8:28 PM, poiuytrez wrote: Hi Cheng, I am using Spark 1.1.0. This is the stack trace: 14/10/10 12:17:40 WARN TaskSetManager: Lost task 120.0 in stage 7.0 (TID 2235, spark-w-0.c.db.internal): java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106) org.apache.spark.sql.catalyst.expressions.GenericRow.getInt(Row.scala:146) org.apache.spark.sql.columnar.INT$.getField(ColumnType.scala:105) org.apache.spark.sql.columnar.INT$.getField(ColumnType.scala:92) org.apache.spark.sql.columnar.BasicColumnBuilder.appendFrom(ColumnBuilder.scala:72) org.apache.spark.sql.columnar.NativeColumnBuilder.org$apache$spark$sql$columnar$NullableColumnBuilder$super$appendFrom(ColumnBuilder.scala:88) org.apache.spark.sql.columnar.NullableColumnBuilder$class.appendFrom(NullableColumnBuilder.scala:57) org.apache.spark.sql.columnar.NativeColumnBuilder.org$apache$spark$sql$columnar$compression$CompressibleColumnBuilder$super$appendFrom(ColumnBuilder.scala:88) org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.appendFrom(CompressibleColumnBuilder.scala:76) org.apache.spark.sql.columnar.NativeColumnBuilder.appendFrom(ColumnBuilder.scala:88) org.apache.spark.sql.columnar.InMemoryRelation$anonfun$1$anon$1.next(InMemoryColumnarTableScan.scala:65) org.apache.spark.sql.columnar.InMemoryRelation$anonfun$1$anon$1.next(InMemoryColumnarTableScan.scala:50) org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:236) org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163) org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70) org.apache.spark.rdd.RDD.iterator(RDD.scala:227) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) This was also printed on the driver: 14/10/10 12:17:43 ERROR TaskSetManager: Task 120 in stage 7.0 failed 4 times; aborting job 14/10/10 12:17:43 INFO TaskSchedulerImpl: Cancelling stage 7 14/10/10 12:17:43 INFO TaskSchedulerImpl: Stage 7 was cancelled 14/10/10 12:17:43 INFO DAGScheduler: Failed to run collect at SparkPlan.scala:85 Traceback (most recent call last): File stdin, line 1, in module File /home/hadoop/spark-install/python/pyspark/sql.py, line 1606, in count return self._jschema_rdd.count() File /home/hadoop/spark-install/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /home/hadoop/spark-install/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o100.count. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 120 in stage 7.0 failed 4 times, most recent failure: Lost task
Re: Spark SQL - Exception only when using cacheTable
Can you try checking whether the table is being cached? You can use isCached method. More details are here - http://spark.apache.org/docs/1.0.2/api/java/org/apache/spark/sql/SQLContext.html -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Exception-only-when-using-cacheTable-tp16031p16123.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark SQL - Exception only when using cacheTable
Hi Poiuytrez, what version of Spark are you using? Exception details like stacktrace are really needed to investigate this issue. You can find them in the executor logs, or just browse the application stderr/stdout link from Spark Web UI. On 10/9/14 9:37 PM, poiuytrez wrote: Hello, I have a weird issue, this request works fine: sqlContext.sql(SELECT customer_id FROM transactions WHERE purchaseamount = 200).count() However, when I cache the table before making the request: sqlContext.cacheTable(transactions) sqlContext.sql(SELECT customer_id FROM transactions WHERE purchaseamount = 200).count() I am getting an exception on of the task: : org.apache.spark.SparkException: Job aborted due to stage failure: Task 120 in stage 104.0 failed 4 times, most recent failure: Lost task 120.3 in stage 104.0 (TID 20537, spark-w-0.c.internal): java.lang.ClassCastException: (I have no details after the ':') Any ideas of what could be wrong? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Exception-only-when-using-cacheTable-tp16031.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark SQL - Exception only when using cacheTable
I am using the python api. Unfortunately, I cannot find the isCached method equivalent in the documentation: https://spark.apache.org/docs/1.1.0/api/python/index.html in the SQLContext section. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Exception-only-when-using-cacheTable-tp16031p16137.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark SQL - Exception only when using cacheTable
Hi Cheng, I am using Spark 1.1.0. This is the stack trace: 14/10/10 12:17:40 WARN TaskSetManager: Lost task 120.0 in stage 7.0 (TID 2235, spark-w-0.c.db.internal): java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106) org.apache.spark.sql.catalyst.expressions.GenericRow.getInt(Row.scala:146) org.apache.spark.sql.columnar.INT$.getField(ColumnType.scala:105) org.apache.spark.sql.columnar.INT$.getField(ColumnType.scala:92) org.apache.spark.sql.columnar.BasicColumnBuilder.appendFrom(ColumnBuilder.scala:72) org.apache.spark.sql.columnar.NativeColumnBuilder.org$apache$spark$sql$columnar$NullableColumnBuilder$$super$appendFrom(ColumnBuilder.scala:88) org.apache.spark.sql.columnar.NullableColumnBuilder$class.appendFrom(NullableColumnBuilder.scala:57) org.apache.spark.sql.columnar.NativeColumnBuilder.org$apache$spark$sql$columnar$compression$CompressibleColumnBuilder$$super$appendFrom(ColumnBuilder.scala:88) org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.appendFrom(CompressibleColumnBuilder.scala:76) org.apache.spark.sql.columnar.NativeColumnBuilder.appendFrom(ColumnBuilder.scala:88) org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryColumnarTableScan.scala:65) org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryColumnarTableScan.scala:50) org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:236) org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163) org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70) org.apache.spark.rdd.RDD.iterator(RDD.scala:227) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) This was also printed on the driver: 14/10/10 12:17:43 ERROR TaskSetManager: Task 120 in stage 7.0 failed 4 times; aborting job 14/10/10 12:17:43 INFO TaskSchedulerImpl: Cancelling stage 7 14/10/10 12:17:43 INFO TaskSchedulerImpl: Stage 7 was cancelled 14/10/10 12:17:43 INFO DAGScheduler: Failed to run collect at SparkPlan.scala:85 Traceback (most recent call last): File stdin, line 1, in module File /home/hadoop/spark-install/python/pyspark/sql.py, line 1606, in count return self._jschema_rdd.count() File /home/hadoop/spark-install/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /home/hadoop/spark-install/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o100.count. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 120 in stage 7.0 failed 4 times, most recent failure: Lost task 120.3 in stage 7.0 (TID 2248, spark-w-0.c.db.internal): java.lang.ClassCastException: Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at