[ https://issues.apache.org/jira/browse/SPARK-6082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Armbrust resolved SPARK-6082. ------------------------------------- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4842 [https://github.com/apache/spark/pull/4842] > SparkSQL should fail gracefully when input data format doesn't match > expectations > --------------------------------------------------------------------------------- > > Key: SPARK-6082 > URL: https://issues.apache.org/jira/browse/SPARK-6082 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.2.1 > Reporter: Kay Ousterhout > Fix For: 1.3.0 > > > I have a udf that creates a tab-delimited table. If any of the column values > contain a tab, SQL fails with an ArrayIndexOutOfBounds exception (pasted > below). It would be great if SQL failed gracefully here, with a helpful > exception (something like "One row contained too many values"). > It looks like this can be done quite easily, by checking here if i > > columnBuilders.size and if so, throwing a nicer exception: > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala#L124. > One thing that makes this problem especially annoying to debug is because if > you do "CREATE table foo as select transform(..." and then "CACHE table foo", > it works fine. It only fails if you do "CACHE table foo as select > transform(...". Because of this, it would be great if the problem were more > transparent to users. > Stack trace: > java.lang.ArrayIndexOutOfBoundsException: 3 > at > org.apache.spark.sql.columnar.InMemoryRelation$anonfun$3$anon$1.next(InMemoryColumnarTableScan.scala:125) > at > org.apache.spark.sql.columnar.InMemoryRelation$anonfun$3$anon$1.next(InMemoryColumnarTableScan.scala:112) > at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:249) > at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163) > at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:245) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:220) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org