Reminder that Snappy is not a splittable format.

I've had success with Hive + LZF (splittable) and bzip2 (also splittable).

Gzip is also not splittable, so you won't be utilizing your cluster to
process this data in parallel as only 1 task can read and process
unsplittable data - versus many tasks spread across the cluster.

On Wed, Dec 30, 2015 at 6:45 AM, Dawid Wysakowicz <
wysakowicz.da...@gmail.com> wrote:

> Didn't anyone used spark with orc and snappy compression?
>
> 2015-12-29 18:25 GMT+01:00 Dawid Wysakowicz <wysakowicz.da...@gmail.com>:
>
>> Hi,
>>
>> I have a table in hive stored as orc with compression = snappy. I try to
>> execute a query on that table that fails (previously I run it on table in
>> orc-zlib format and parquet so it is not the matter of query).
>>
>> I managed to execute this query with hive on tez on that tables.
>>
>> The exception i get is as follows:
>>
>> 15/12/29 17:16:46 WARN scheduler.DAGScheduler: Creating new stage failed
>> due to exception - job: 3
>> java.lang.RuntimeException: serious problem
>> at
>> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021)
>> at
>> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
>> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>> at scala.Option.getOrElse(Option.scala:120)
>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>> at scala.Option.getOrElse(Option.scala:120)
>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>> at scala.Option.getOrElse(Option.scala:120)
>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>> at
>> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.getPartitions(MapPartitionsWithPreparationRDD.scala:40)
>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>> at scala.Option.getOrElse(Option.scala:120)
>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>> at scala.Option.getOrElse(Option.scala:120)
>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>> at org.apache.spark.ShuffleDependency.<init>(Dependency.scala:82)
>> at
>> org.apache.spark.sql.execution.ShuffledRowRDD.getDependencies(ShuffledRowRDD.scala:59)
>> at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:226)
>> at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:224)
>> at scala.Option.getOrElse(Option.scala:120)
>> at org.apache.spark.rdd.RDD.dependencies(RDD.scala:224)
>> at org.apache.spark.scheduler.DAGScheduler.visit$2(DAGScheduler.scala:388)
>> at
>> org.apache.spark.scheduler.DAGScheduler.getAncestorShuffleDependencies(DAGScheduler.scala:405)
>> at
>> org.apache.spark.scheduler.DAGScheduler.registerShuffleDependencies(DAGScheduler.scala:370)
>> at org.apache.spark.scheduler.DAGScheduler.org
>> $apache$spark$scheduler$DAGScheduler$$getShuffleMapStage(DAGScheduler.scala:253)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$visit$1$1.apply(DAGScheduler.scala:354)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$visit$1$1.apply(DAGScheduler.scala:351)
>> at scala.collection.immutable.List.foreach(List.scala:318)
>> at org.apache.spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:351)
>> at
>> org.apache.spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:363)
>> at
>> org.apache.spark.scheduler.DAGScheduler.getParentStagesAndId(DAGScheduler.scala:266)
>> at
>> org.apache.spark.scheduler.DAGScheduler.newResultStage(DAGScheduler.scala:300)
>> at
>> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:734)
>> at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1466)
>> at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
>> at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
>> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>> Caused by: java.util.concurrent.ExecutionException:
>> java.lang.IndexOutOfBoundsException: Index: 0
>> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>> at
>> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1016)
>> ... 48 more
>> Caused by: java.lang.IndexOutOfBoundsException: Index: 0
>> at java.util.Collections$EmptyList.get(Collections.java:4454)
>> at
>> org.apache.hadoop.hive.ql.io.orc.OrcProto$Type.getSubtypes(OrcProto.java:12240)
>> at
>> org.apache.hadoop.hive.ql.io.orc.ReaderImpl.getColumnIndicesFromNames(ReaderImpl.java:651)
>> at
>> org.apache.hadoop.hive.ql.io.orc.ReaderImpl.getRawDataSizeOfColumns(ReaderImpl.java:634)
>> at
>> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.populateAndCacheStripeDetails(OrcInputFormat.java:927)
>> at
>> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:836)
>> at
>> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:702)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> I will be glad for any help on that matter.
>>
>> Regards
>> Dawid Wysakowicz
>>
>
>


-- 

*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com

Reply via email to