(Forgot to cc user mail list)
On 11/16/14 4:59 PM, Cheng Lian wrote:
Hey Sadhan,
Thanks for the additional information, this is helpful. Seems that
some Parquet internal contract was broken, but I'm not sure whether
it's caused by Spark SQL or Parquet, or even maybe the Parquet file
itself was damaged somehow. I'm investigating this. In the meanwhile,
would you mind to help to narrow down the problem by trying to scan
exactly the same Parquet file with some other systems (e.g. Hive or
Impala)? If other systems work, then there must be something wrong
with Spark SQL.
Cheng
On Sun, Nov 16, 2014 at 1:19 PM, Sadhan Sood <sadhan.s...@gmail.com
<mailto:sadhan.s...@gmail.com>> wrote:
Hi Cheng,
Thanks for your response. Here is the stack trace from yarn logs:
Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
at java.util.ArrayList.elementData(ArrayList.java:418)
at java.util.ArrayList.get(ArrayList.java:431)
at parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:95)
at parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:95)
at parquet.io.PrimitiveColumnIO.getLast(PrimitiveColumnIO.java:80)
at parquet.io.PrimitiveColumnIO.isLast(PrimitiveColumnIO.java:74)
at
parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:282)
at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:131)
at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:96)
at
parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:136)
at
parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:96)
at
parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:126)
at
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)
... 26 more
On Sat, Nov 15, 2014 at 9:28 AM, Cheng Lian <lian.cs....@gmail.com
<mailto:lian.cs....@gmail.com>> wrote:
Hi Sadhan,
Could you please provide the stack trace of the
|ArrayIndexOutOfBoundsException| (if any)? The reason why the
first query succeeds is that Spark SQL doesn’t bother reading
all data from the table to give |COUNT(*)|. In the second
case, however, the whole table is asked to be cached lazily
via the |cacheTable| call, thus it’s scanned to build the
in-memory columnar cache. Then thing went wrong while scanning
this LZO compressed Parquet file. But unfortunately the stack
trace at hand doesn’t indicate the root cause.
Cheng
On 11/15/14 5:28 AM, Sadhan Sood wrote:
While testing SparkSQL on a bunch of parquet files (basically
used to be a partition for one of our hive tables), I
encountered this error:
import org.apache.spark.sql.SchemaRDD
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val parquetFileRDD = sqlContext.parquetFile(parquetFile)
parquetFileRDD.registerTempTable("xyz_20141109")
sqlContext.sql("SELECT count(*) FROM
xyz_20141109").collect() <-- works fine
sqlContext.cacheTable("xyz_20141109")
sqlContext.sql("SELECT count(*) FROM
xyz_20141109").collect() <-- fails with an exception
parquet.io.ParquetDecodingException: Can not read value at 0
in block -1 in file
hdfs://xxxxxxxx::9000/event_logs/xyz/20141109/part-00009359b87ae-a949-3ded-ac3e-3a6bda3a4f3a-r-00009.lzo.parquet
at
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
at
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
at
org.apache.spark.rdd.NewHadoopRDD$anon$1.hasNext(NewHadoopRDD.scala:145)
at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at
scala.collection.Iterator$anon$11.hasNext(Iterator.scala:327)
at
scala.collection.Iterator$anon$14.hasNext(Iterator.scala:388)
at
org.apache.spark.sql.columnar.InMemoryRelation$anonfun$3$anon$1.hasNext(InMemoryColumnarTableScan.scala:136)
at
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:248)
at
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163)
at
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArrayIndexOutOfBoundsException