Hi,
When I tried reading parquet data that was generated by spark in cascading it throws following error Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file "" at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228) at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201) at org.apache.parquet.hadoop.mapred.DeprecatedParquetInputFormat$RecordReaderWrapper.<init>(DeprecatedParquetInputFormat.java:103) at org.apache.parquet.hadoop.mapred.DeprecatedParquetInputFormat.getRecordReader(DeprecatedParquetInputFormat.java:47) at cascading.tap.hadoop.io .MultiInputFormat$1.operate(MultiInputFormat.java:253) at cascading.tap.hadoop.io .MultiInputFormat$1.operate(MultiInputFormat.java:248) at cascading.util.Util.retry(Util.java:1044) at cascading.tap.hadoop.io .MultiInputFormat.getRecordReader(MultiInputFormat.java:247) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332) at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:268) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ArrayIndexOutOfBoundsException: -1 at java.util.ArrayList.elementData(ArrayList.java:418) at java.util.ArrayList.get(ArrayList.java:431) at org.apache.parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:98) at org.apache.parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:98) at org.apache.parquet.io .PrimitiveColumnIO.getLast(PrimitiveColumnIO.java:83) at org.apache.parquet.io.PrimitiveColumnIO.isLast(PrimitiveColumnIO.java:77) at org.apache.parquet.io .RecordReaderImplementation.<init>(RecordReaderImplementation.java:293) at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:134) at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:99) at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154) at org.apache.parquet.io .MessageColumnIO.getRecordReader(MessageColumnIO.java:99) at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137) at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208) This is mostly seen when parquet has nested structures. I didnt find any solution to this. I see some JIRA issues like this https://issues.apache.org/jira/browse/SPARK-10434 (parquet compatability /interoperabilityissues) where reading parquet files in Spark 1.4 where the files were generated by Spark 1.5 .This was fixed in later versions but was it fixed in Cascading? Not sure if this is something to do with Parquet version or Cascading has a bug or Spark is doing something with Parquet files which cascading is not accepting Note : I am trying to read Parquet with avro schema in Cascading I have posted in Cascading mailing list too. -- Thanks Vikas Gandham