Is there a tool that I can read specific rowgroup/column/page ? Thanks,
Tongjie On Sat, Nov 22, 2014 at 5:56 PM, Tongjie Chen <[email protected]> wrote: > Actually stack trace looks different. > > In my case, there seems to be a bad entry in the parquet file (although I > can successfully write it ), at some row group , some page, 19072 out > of 36318 in that currentPage, that entry cannot be read. > > On Sat, Nov 22, 2014 at 5:48 PM, Cheng Lian <[email protected]> wrote: > >> The problem mentioned in [this thread] [1] looks similar to yours. >> >> [1]: http://apache-spark-user-list.1001560.n3.nabble.com/ >> SparkSQL-exception-on-cached-parquet-table-tt18978.html#a19020 >> >> >> On 11/23/14 4:22 AM, Tongjie Chen wrote: >> >>> Hi, >>> >>> >>> Does anyone find the following message familiar? It seems like a data >>> corruption issue but when we wrote that parquet file, it did not have >>> any error. We are using Parquet version 1.6.0rc3. >>> >>> >>> Thanks, >>> >>> >>> Tongjie >>> >>> >>> >>> >>> >>> 2014-11-22 18:55:28,970 WARN [main] >>> org.apache.hadoop.mapred.YarnChild: Exception running child : >>> java.io.IOException: java.io.IOException: >>> parquet.io.ParquetDecodingException: Can not read value at 511538 in >>> block 0 in file >>> s3n://..../dateint=20141122/hour=16/batchid=merged_ >>> 20141122T171928_1/542f393b-57f8-441b-8591-2c0169f15d14_000072 >>> at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain. >>> handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121) >>> at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil. >>> handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77) >>> at org.apache.hadoop.hive.shims.HadoopShimsSecure$ >>> CombineFileRecordReader.doNextWithExceptionHandler( >>> HadoopShimsSecure.java:302) >>> at org.apache.hadoop.hive.shims.HadoopShimsSecure$ >>> CombineFileRecordReader.next(HadoopShimsSecure.java:218) >>> at org.apache.hadoop.mapred.MapTask$TrackedRecordReader. >>> moveToNext(MapTask.java:199) >>> at org.apache.hadoop.mapred.MapTask$TrackedRecordReader. >>> next(MapTask.java:185) >>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52) >>> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask. >>> java:432) >>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) >>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) >>> at java.security.AccessController.doPrivileged(Native Method) >>> at javax.security.auth.Subject.doAs(Subject.java:415) >>> at org.apache.hadoop.security.UserGroupInformation.doAs( >>> UserGroupInformation.java:1548) >>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) >>> Caused by: java.io.IOException: parquet.io.ParquetDecodingException: >>> Can not read value at 511538 in block 0 in file >>> s3n://..../dateint=20141122/hour=16/batchid=merged_ >>> 20141122T171928_1/542f393b-57f8-441b-8591-2c0169f15d14_000072 >>> at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain. >>> handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121) >>> at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil. >>> handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77) >>> at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader. >>> doNext(HiveContextAwareRecordReader.java:276) >>> at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext( >>> CombineHiveRecordReader.java:101) >>> at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext( >>> CombineHiveRecordReader.java:41) >>> at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader. >>> next(HiveContextAwareRecordReader.java:108) >>> at org.apache.hadoop.hive.shims.HadoopShimsSecure$ >>> CombineFileRecordReader.doNextWithExceptionHandler( >>> HadoopShimsSecure.java:300) >>> ... 11 more >>> Caused by: parquet.io.ParquetDecodingException: Can not read value at >>> 511538 in block 0 in file >>> s3n://..../dateint=20141122/hour=16/batchid=merged_ >>> 20141122T171928_1/542f393b-57f8-441b-8591-2c0169f15d14_000072 >>> at parquet.hadoop.InternalParquetRecordReader.nextKeyValue( >>> InternalParquetRecordReader.java:213) >>> at parquet.hadoop.ParquetRecordReader.nextKeyValue( >>> ParquetRecordReader.java:204) >>> at org.apache.hadoop.hive.ql.io.parquet.read. >>> ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:157) >>> at org.apache.hadoop.hive.ql.io.parquet.read. >>> ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:45) >>> at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader. >>> doNext(HiveContextAwareRecordReader.java:274) >>> ... 15 more >>> Caused by: parquet.io.ParquetDecodingException: Can't read value in >>> column [other_properties, map, value] BINARY at value 20433392 out of >>> 27896945, 19072 out of 36318 in currentPage. repetition level: 1, >>> definition level: 3 >>> at parquet.column.impl.ColumnReaderImpl.readValue( >>> ColumnReaderImpl.java:450) >>> at parquet.column.impl.ColumnReaderImpl. >>> writeCurrentValueToConverter(ColumnReaderImpl.java:352) >>> at parquet.io.RecordReaderImplementation.read( >>> RecordReaderImplementation.java:402) >>> at parquet.hadoop.InternalParquetRecordReader.nextKeyValue( >>> InternalParquetRecordReader.java:194) >>> ... 19 more >>> Caused by: parquet.io.ParquetDecodingException: could not read bytes >>> at offset 1599090621 >>> at parquet.column.values.plain.BinaryPlainValuesReader. >>> readBytes(BinaryPlainValuesReader.java:43) >>> at parquet.column.impl.ColumnReaderImpl$2$6.read( >>> ColumnReaderImpl.java:295) >>> at parquet.column.impl.ColumnReaderImpl.readValue( >>> ColumnReaderImpl.java:446) >>> ... 22 more >>> Caused by: *java.lang.ArrayIndexOutOfBoundsException*: 1599090621 >>> at parquet.bytes.BytesUtils.readIntLittleEndian( >>> BytesUtils.java:54) >>> at parquet.column.values.plain.BinaryPlainValuesReader. >>> readBytes(BinaryPlainValuesReader.java:36) >>> ... 24 more >>> >>> >> >
