Re: parquet data corruption
Hi Cheng Please find answers below. 1.Tool used to write Parquet files : Its custom code in spark which uses ParquetOutPutFormat from hive-exec0.13 code snippet is defined below. 2.The tool used to read those Parquet files - hive 0.13 3.What is the "complex" query? select count(1) from table group by colname having count(1)>1; 4. If possible, code snippet you used to write the files ParquetOutputFormat output = new ParquetOutputFormat(new DataWritableWriteSupport()); Configuration conf = new Configuration(); MessageType messagetype = MessageTypeParser.parseMessageType(writeSchema); DataWritableWriteSupport.setSchema(messagetype,conf); Map> recordWriterMap = new HashMap >(); RecordWriter writer ; HDFSDelegate hdfsDelegate = new HDFSDelegate(hdfsUser); while(t.hasNext()){ Tuple2 record = t.next(); writer = recordWriterMap.get(record._1.getPartitionPart()); if(writer == null){ Path filePath = new Path(hdfsDir+record._1.getPartitionPart()+"/"+filePrefix+".parquet"); hdfsDelegate.delete(filePath); writer = output.getRecordWriter(conf, filePath, CompressionCodecName.SNAPPY); recordWriterMap.put(record._1.getPartitionPart(), writer); } writer.write(null, record._2); } for(Entry > recWriter:recordWriterMap.entrySet()){ recWriter.getValue().close(null); } 5.Did you move files written somewhere else to the target directory - Yes, firstly files are written in some other directory then moved to hive table. Thanks On Fri, Apr 22, 2016 at 10:04 AM, Cheng Lian wrote: > (cc dev@parquet.apache.org) > > Hey Shushant, > > This kind of error can be tricky to debug. Could you please provide the > following information: > > - The tool used to write those Parquet files (possibly Hive 0.13 since you > mentioned hive-exec 0.13?) > - The tool used to read those Parquet files (should be Hive according to > the stack trace, but what version?) > - What is the "complex" query? > - Schema of those Parquet files (can be checked using parquet-tools), as > well as corresponding schema of the user application (table schema for Hive) > - If possible, code snippet you used to write the files > - Are there files of different schemata mixed up? Some tools, like Hive, > don't handle schema evolution well. > > I saw the file name in the stack trace consists of a timestamp. This isn't > the naming convention used by Hive. Did you move files written somewhere > else to the target directory? > > Cheng > > > On 4/22/16 10:56 AM, Shushant Arora wrote: > > Hi > > I am writing to a parquet table > using parquet.hadoop.ParquetOutputFormat(from hive-exec 0.13). > Data is being written correctly and when I do count(1) or select * with > limit I get proper result. > > But when I do some complex query on table it throws below excpetion : > > Diagnostic Messages for this Task: > Error: java.io.IOException: java.io.IOException: > parquet.io.ParquetDecodingException: Can not read value at 18 in block 0 in > file > hdfs://nameservice1/user/hive/warehouse/dbname.db/tablename/partitionname/20160421032223.parquet > at > org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121) > at > org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77) > at > org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:255) > at > org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:170) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:199) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:185) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) > Caused by: java.io.IOException: parquet.io.ParquetDecodingException: Can > not read value at 18 in block 0 in file > > hdfs://nameservice1/user/hive/warehouse/dbname.db/tablename/partitionname/20160421032223.parquet > at > org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121) > at >
parquet data corruption
Hi I am writing to a parquet table using parquet.hadoop.ParquetOutputFormat(from hive-exec 0.13). Data is being written correctly and when I do count(1) or select * with limit I get proper result. But when I do some complex query on table it throws below excpetion : Diagnostic Messages for this Task: Error: java.io.IOException: java.io.IOException: parquet.io.ParquetDecodingException: Can not read value at 18 in block 0 in file hdfs://nameservice1/user/hive/warehouse/dbname.db/tablename/partitionname/20160421032223.parquet at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121) at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77) at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:255) at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:170) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:199) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:185) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) Caused by: java.io.IOException: parquet.io.ParquetDecodingException: Can not read value at 18 in block 0 in file hdfs://nameservice1/user/hive/warehouse/dbname.db/tablename/partitionname/20160421032223.parquet at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121) at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77) at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:344) at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:101) at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:41) at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:122) at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:253) ... 11 more Caused by: parquet.io.ParquetDecodingException: Can not read value at 18 in block 0 in file hdfs://nameservice1/user/hive/warehouse/dbname.db/tablename/partitionname/20160421032223.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:216) at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:144) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:159) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:48) at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:339) ... 15 more Caused by: parquet.io.ParquetDecodingException: Can't read value in column [sessionid] BINARY at value 18 out of 18, 18 out of 18 in currentPage. repetition level: 0, definition level: 1 at parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:450) at parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:352) at parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:402) at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:197) ... 19 more Caused by: parquet.io.ParquetDecodingException: could not read bytes at offset 726 at parquet.column.values.plain.BinaryPlainValuesReader.readBytes(BinaryPlainValuesReader.java:43) at parquet.column.impl.ColumnReaderImpl$2$6.read(ColumnReaderImpl.java:295) at parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:446) ... 22 more Caused by: java.lang.ArrayIndexOutOfBoundsException: 726 at parquet.bytes.BytesUtils.readIntLittleEndian(BytesUtils.java:54) at parquet.column.values.plain.BinaryPlainValuesReader.readBytes(BinaryPlainValuesReader.java:36) ... 24 more FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask Whats the reason of this error ? Why
Re: parquet data corruption
(cc dev@parquet.apache.org) Hey Shushant, This kind of error can be tricky to debug. Could you please provide the following information: - The tool used to write those Parquet files (possibly Hive 0.13 since you mentioned hive-exec 0.13?) - The tool used to read those Parquet files (should be Hive according to the stack trace, but what version?) - What is the "complex" query? - Schema of those Parquet files (can be checked using parquet-tools), as well as corresponding schema of the user application (table schema for Hive) - If possible, code snippet you used to write the files - Are there files of different schemata mixed up? Some tools, like Hive, don't handle schema evolution well. I saw the file name in the stack trace consists of a timestamp. This isn't the naming convention used by Hive. Did you move files written somewhere else to the target directory? Cheng On 4/22/16 10:56 AM, Shushant Arora wrote: Hi I am writing to a parquet table using parquet.hadoop.ParquetOutputFormat(from hive-exec 0.13). Data is being written correctly and when I do count(1) or select * with limit I get proper result. But when I do some complex query on table it throws below excpetion : Diagnostic Messages for this Task: Error: java.io.IOException: java.io.IOException: parquet.io.ParquetDecodingException: Can not read value at 18 in block 0 in file hdfs://nameservice1/user/hive/warehouse/dbname.db/tablename/partitionname/20160421032223.parquet at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121) at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77) at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:255) at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:170) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:199) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:185) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) Caused by: java.io.IOException: parquet.io.ParquetDecodingException: Can not read value at 18 in block 0 in file hdfs://nameservice1/user/hive/warehouse/dbname.db/tablename/partitionname/20160421032223.parquet at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121) at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77) at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:344) at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:101) at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:41) at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:122) at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:253) ... 11 more Caused by: parquet.io.ParquetDecodingException: Can not read value at 18 in block 0 in file hdfs://nameservice1/user/hive/warehouse/dbname.db/tablename/partitionname/20160421032223.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:216) at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:144) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:159) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:48) at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:339) ... 15 more Caused by: parquet.io.ParquetDecodingException: Can't read value in column [sessionid] BINARY at value 18 out of 18, 18 out of 18 in currentPage. repetition level: 0, definition level: 1 at parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:450) at parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:352)