I tried spark.sql.parquet.writeLegacyFormat to true but still issue persists.
Thanks Vikas Gandham On Thu, Nov 16, 2017 at 10:25 AM, Yong Zhang <java8...@hotmail.com> wrote: > I don't have experience with Cascading, but we saw similar issue for > importing the data generated in Spark into Hive. > > > Did you try this setting "spark.sql.parquet.writeLegacyFormat" to true? > > > https://stackoverflow.com/questions/44279870/why-cant- > impala-read-parquet-files-after-spark-sqls-write > > <https://stackoverflow.com/questions/44279870/why-cant-impala-read-parquet-files-after-spark-sqls-write> > java - Why can't Impala read parquet files after Spark SQL ... > <https://stackoverflow.com/questions/44279870/why-cant-impala-read-parquet-files-after-spark-sqls-write> > stackoverflow.com > Having some issues with the way that Spark is interpreting columns for > parquet. I have an Oracle source with confirmed schema (df.schema() > method): root |-- LM_PERSON ... > > > > > ------------------------------ > *From:* Vikas Gandham <g.73vi...@gmail.com> > *Sent:* Wednesday, November 15, 2017 2:30 PM > *To:* user@spark.apache.org > *Subject:* Parquet files from spark not readable in Cascading > > > Hi, > > > > When I tried reading parquet data that was generated by spark in > cascading it throws following error > > > > > > > > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read > value at 0 in block -1 in file "" > > at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue( > InternalParquetRecordReader.java:228) > > at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue( > ParquetRecordReader.java:201) > > at org.apache.parquet.hadoop.mapred.DeprecatedParquetInputFormat$ > RecordReaderWrapper.<init>(DeprecatedParquetInputFormat.java:103) > > at org.apache.parquet.hadoop.mapred.DeprecatedParquetInputFormat. > getRecordReader(DeprecatedParquetInputFormat.java:47) > > at cascading.tap.hadoop.io.MultiInputFormat$1.operate( > MultiInputFormat.java:253) > > at cascading.tap.hadoop.io.MultiInputFormat$1.operate( > MultiInputFormat.java:248) > > at cascading.util.Util.retry(Util.java:1044) > > at cascading.tap.hadoop.io.MultiInputFormat.getRecordReader( > MultiInputFormat.java:247) > > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394) > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332) > > at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run( > LocalJobRunner.java:268) > > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > > at java.util.concurrent.ThreadPoolExecutor.runWorker( > ThreadPoolExecutor.java:1142) > > at java.util.concurrent.ThreadPoolExecutor$Worker.run( > ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > > Caused by: java.lang.ArrayIndexOutOfBoundsException: -1 > > at java.util.ArrayList.elementData(ArrayList.java:418) > > at java.util.ArrayList.get(ArrayList.java:431) > > at org.apache.parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:98) > > at org.apache.parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:98) > > at org.apache.parquet.io.PrimitiveColumnIO.getLast( > PrimitiveColumnIO.java:83) > > at org.apache.parquet.io.PrimitiveColumnIO.isLast( > PrimitiveColumnIO.java:77) > > at org.apache.parquet.io.RecordReaderImplementation.<init>( > RecordReaderImplementation.java:293) > > at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:134) > > at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:99) > > at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept( > FilterCompat.java:154) > > at org.apache.parquet.io.MessageColumnIO.getRecordReader( > MessageColumnIO.java:99) > > at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead( > InternalParquetRecordReader.java:137) > > at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue( > InternalParquetRecordReader.java:208) > > > > This is mostly seen when parquet has nested structures. > > > > I didnt find any solution to this. > > > > I see some JIRA issues like this https://issues.apache. > org/jira/browse/SPARK-10434 (parquet compatability > /interoperabilityissues) where reading parquet files in Spark 1.4 where the > files > [SPARK-10434] Parquet compatibility with 1.4 is broken ... > <https://issues.apache.org/jira/browse/SPARK-10434> > issues.apache.org > This behavior is a hybrid of parquet-avro and parquet-hive: the 3-level > structure and repeated group name "bag" are borrowed from parquet-hive, > while the innermost ... > > were generated by Spark 1.5 .This was fixed in later versions but was it > fixed in Cascading? > > > > Not sure if this is something to do with Parquet version or Cascading has > a bug or Spark is doing something with Parquet files > > which cascading is not accepting > > > > Note : I am trying to read Parquet with avro schema in Cascading > > > > I have posted in Cascading mailing list too. > > > > > > -- > Thanks > Vikas Gandham > -- Thanks Vikas Gandham