Reminder that Snappy is not a splittable format. I've had success with Hive + LZF (splittable) and bzip2 (also splittable).
Gzip is also not splittable, so you won't be utilizing your cluster to process this data in parallel as only 1 task can read and process unsplittable data - versus many tasks spread across the cluster. On Wed, Dec 30, 2015 at 6:45 AM, Dawid Wysakowicz < wysakowicz.da...@gmail.com> wrote: > Didn't anyone used spark with orc and snappy compression? > > 2015-12-29 18:25 GMT+01:00 Dawid Wysakowicz <wysakowicz.da...@gmail.com>: > >> Hi, >> >> I have a table in hive stored as orc with compression = snappy. I try to >> execute a query on that table that fails (previously I run it on table in >> orc-zlib format and parquet so it is not the matter of query). >> >> I managed to execute this query with hive on tez on that tables. >> >> The exception i get is as follows: >> >> 15/12/29 17:16:46 WARN scheduler.DAGScheduler: Creating new stage failed >> due to exception - job: 3 >> java.lang.RuntimeException: serious problem >> at >> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021) >> at >> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) >> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) >> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) >> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) >> at scala.Option.getOrElse(Option.scala:120) >> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) >> at >> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) >> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) >> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) >> at scala.Option.getOrElse(Option.scala:120) >> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) >> at >> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) >> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) >> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) >> at scala.Option.getOrElse(Option.scala:120) >> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) >> at >> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.getPartitions(MapPartitionsWithPreparationRDD.scala:40) >> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) >> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) >> at scala.Option.getOrElse(Option.scala:120) >> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) >> at >> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) >> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) >> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) >> at scala.Option.getOrElse(Option.scala:120) >> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) >> at org.apache.spark.ShuffleDependency.<init>(Dependency.scala:82) >> at >> org.apache.spark.sql.execution.ShuffledRowRDD.getDependencies(ShuffledRowRDD.scala:59) >> at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:226) >> at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:224) >> at scala.Option.getOrElse(Option.scala:120) >> at org.apache.spark.rdd.RDD.dependencies(RDD.scala:224) >> at org.apache.spark.scheduler.DAGScheduler.visit$2(DAGScheduler.scala:388) >> at >> org.apache.spark.scheduler.DAGScheduler.getAncestorShuffleDependencies(DAGScheduler.scala:405) >> at >> org.apache.spark.scheduler.DAGScheduler.registerShuffleDependencies(DAGScheduler.scala:370) >> at org.apache.spark.scheduler.DAGScheduler.org >> $apache$spark$scheduler$DAGScheduler$$getShuffleMapStage(DAGScheduler.scala:253) >> at >> org.apache.spark.scheduler.DAGScheduler$$anonfun$visit$1$1.apply(DAGScheduler.scala:354) >> at >> org.apache.spark.scheduler.DAGScheduler$$anonfun$visit$1$1.apply(DAGScheduler.scala:351) >> at scala.collection.immutable.List.foreach(List.scala:318) >> at org.apache.spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:351) >> at >> org.apache.spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:363) >> at >> org.apache.spark.scheduler.DAGScheduler.getParentStagesAndId(DAGScheduler.scala:266) >> at >> org.apache.spark.scheduler.DAGScheduler.newResultStage(DAGScheduler.scala:300) >> at >> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:734) >> at >> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1466) >> at >> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458) >> at >> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447) >> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) >> Caused by: java.util.concurrent.ExecutionException: >> java.lang.IndexOutOfBoundsException: Index: 0 >> at java.util.concurrent.FutureTask.report(FutureTask.java:122) >> at java.util.concurrent.FutureTask.get(FutureTask.java:192) >> at >> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1016) >> ... 48 more >> Caused by: java.lang.IndexOutOfBoundsException: Index: 0 >> at java.util.Collections$EmptyList.get(Collections.java:4454) >> at >> org.apache.hadoop.hive.ql.io.orc.OrcProto$Type.getSubtypes(OrcProto.java:12240) >> at >> org.apache.hadoop.hive.ql.io.orc.ReaderImpl.getColumnIndicesFromNames(ReaderImpl.java:651) >> at >> org.apache.hadoop.hive.ql.io.orc.ReaderImpl.getRawDataSizeOfColumns(ReaderImpl.java:634) >> at >> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.populateAndCacheStripeDetails(OrcInputFormat.java:927) >> at >> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:836) >> at >> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:702) >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> at java.lang.Thread.run(Thread.java:745) >> >> I will be glad for any help on that matter. >> >> Regards >> Dawid Wysakowicz >> > > -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com