Re: parquet data corruption

2016-04-22 Thread Shushant Arora
Hi Cheng

Please find answers below.

1.Tool used to write Parquet files : Its custom code in spark which uses
ParquetOutPutFormat from hive-exec0.13 code snippet  is defined below.

2.The tool used to read those Parquet files - hive 0.13

3.What is the "complex" query? select count(1) from table group by colname
having count(1)>1;

4. If possible, code snippet you used to write the files

ParquetOutputFormat output = new
ParquetOutputFormat(new DataWritableWriteSupport());
Configuration conf = new Configuration();
MessageType messagetype = MessageTypeParser.parseMessageType(writeSchema);
DataWritableWriteSupport.setSchema(messagetype,conf);
Map> recordWriterMap = new
HashMap>();
RecordWriter writer ;
HDFSDelegate hdfsDelegate = new HDFSDelegate(hdfsUser);
while(t.hasNext()){
Tuple2 record = t.next();
writer = recordWriterMap.get(record._1.getPartitionPart());
if(writer == null){
Path filePath = new
Path(hdfsDir+record._1.getPartitionPart()+"/"+filePrefix+".parquet");
hdfsDelegate.delete(filePath);
writer = output.getRecordWriter(conf, filePath,
CompressionCodecName.SNAPPY);
recordWriterMap.put(record._1.getPartitionPart(), writer);
}
writer.write(null, record._2);
}
for(Entry>
recWriter:recordWriterMap.entrySet()){
recWriter.getValue().close(null);
}



5.Did you move files written somewhere else to the target directory - Yes,
firstly files are written in some other directory then moved to hive table.


Thanks



On Fri, Apr 22, 2016 at 10:04 AM, Cheng Lian  wrote:

> (cc dev@parquet.apache.org)
>
> Hey Shushant,
>
> This kind of error can be tricky to debug. Could you please provide the
> following information:
>
> - The tool used to write those Parquet files (possibly Hive 0.13 since you
> mentioned hive-exec 0.13?)
> - The tool used to read those Parquet files (should be Hive according to
> the stack trace, but what version?)
> - What is the "complex" query?
> - Schema of those Parquet files (can be checked using parquet-tools), as
> well as corresponding schema of the user application (table schema for Hive)
> - If possible, code snippet you used to write the files
> - Are there files of different schemata mixed up? Some tools, like Hive,
> don't handle schema evolution well.
>
> I saw the file name in the stack trace consists of a timestamp. This isn't
> the naming convention used by Hive. Did you move files written somewhere
> else to the target directory?
>
> Cheng
>
>
> On 4/22/16 10:56 AM, Shushant Arora wrote:
>
> Hi
>
> I am writing to a parquet table
> using parquet.hadoop.ParquetOutputFormat(from hive-exec 0.13).
> Data is being written correctly and when I do count(1) or select * with
> limit I get proper result.
>
> But when I do some complex query on table it throws below excpetion :
>
> Diagnostic Messages for this Task:
> Error: java.io.IOException: java.io.IOException:
> parquet.io.ParquetDecodingException: Can not read value at 18 in block 0 in
> file
> hdfs://nameservice1/user/hive/warehouse/dbname.db/tablename/partitionname/20160421032223.parquet
> at
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
> at
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
> at
> org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:255)
> at
> org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:170)
> at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:199)
> at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:185)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
>  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
> Caused by: java.io.IOException: parquet.io.ParquetDecodingException: Can
> not read value at 18 in block 0 in file
>
> hdfs://nameservice1/user/hive/warehouse/dbname.db/tablename/partitionname/20160421032223.parquet
> at
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
> at
> 

parquet data corruption

2016-04-21 Thread Shushant Arora
Hi

I am writing to a parquet table
using parquet.hadoop.ParquetOutputFormat(from hive-exec 0.13).
Data is being written correctly and when I do count(1) or select * with
limit I get proper result.

But when I do some complex query on table it throws below excpetion :

Diagnostic Messages for this Task:
Error: java.io.IOException: java.io.IOException:
parquet.io.ParquetDecodingException: Can not read value at 18 in block 0 in
file
hdfs://nameservice1/user/hive/warehouse/dbname.db/tablename/partitionname/20160421032223.parquet
at
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
at
org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
at
org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:255)
at
org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:170)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:199)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:185)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.io.IOException: parquet.io.ParquetDecodingException: Can
not read value at 18 in block 0 in file
hdfs://nameservice1/user/hive/warehouse/dbname.db/tablename/partitionname/20160421032223.parquet
at
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
at
org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
at
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:344)
at
org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:101)
at
org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:41)
at
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:122)
at
org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:253)
... 11 more
Caused by: parquet.io.ParquetDecodingException: Can not read value at 18 in
block 0 in file
hdfs://nameservice1/user/hive/warehouse/dbname.db/tablename/partitionname/20160421032223.parquet
at
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:216)
at
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:144)
at
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:159)
at
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:48)
at
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:339)
... 15 more
Caused by: parquet.io.ParquetDecodingException: Can't read value in column
[sessionid] BINARY at value 18 out of 18, 18 out of 18 in currentPage.
repetition level: 0, definition level: 1
at
parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:450)
at
parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:352)
at
parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:402)
at
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:197)
... 19 more
Caused by: parquet.io.ParquetDecodingException: could not read bytes at
offset 726
at
parquet.column.values.plain.BinaryPlainValuesReader.readBytes(BinaryPlainValuesReader.java:43)
at
parquet.column.impl.ColumnReaderImpl$2$6.read(ColumnReaderImpl.java:295)
at
parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:446)
... 22 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: 726
at parquet.bytes.BytesUtils.readIntLittleEndian(BytesUtils.java:54)
at
parquet.column.values.plain.BinaryPlainValuesReader.readBytes(BinaryPlainValuesReader.java:36)
... 24 more


FAILED: Execution Error, return code 2 from
org.apache.hadoop.hive.ql.exec.mr.MapRedTask





Whats the reason of this error ? Why 

Re: parquet data corruption

2016-04-21 Thread Cheng Lian

(cc dev@parquet.apache.org)

Hey Shushant,

This kind of error can be tricky to debug. Could you please provide the 
following information:


- The tool used to write those Parquet files (possibly Hive 0.13 since 
you mentioned hive-exec 0.13?)
- The tool used to read those Parquet files (should be Hive according to 
the stack trace, but what version?)

- What is the "complex" query?
- Schema of those Parquet files (can be checked using parquet-tools), as 
well as corresponding schema of the user application (table schema for Hive)

- If possible, code snippet you used to write the files
- Are there files of different schemata mixed up? Some tools, like Hive, 
don't handle schema evolution well.


I saw the file name in the stack trace consists of a timestamp. This 
isn't the naming convention used by Hive. Did you move files written 
somewhere else to the target directory?


Cheng

On 4/22/16 10:56 AM, Shushant Arora wrote:

Hi

I am writing to a parquet table 
using parquet.hadoop.ParquetOutputFormat(from hive-exec 0.13).
Data is being written correctly and when I do count(1) or select * 
with limit I get proper result.


But when I do some complex query on table it throws below excpetion :

Diagnostic Messages for this Task:
Error: java.io.IOException: java.io.IOException: 
parquet.io.ParquetDecodingException: Can not read value at 18 in block 
0 in file 
hdfs://nameservice1/user/hive/warehouse/dbname.db/tablename/partitionname/20160421032223.parquet
at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
at 
org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:255)
at 
org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:170)
at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:199)
at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:185)

at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)

 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.io.IOException: parquet.io.ParquetDecodingException: 
Can not read value at 18 in block 0 in file

hdfs://nameservice1/user/hive/warehouse/dbname.db/tablename/partitionname/20160421032223.parquet
at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
at 
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:344)
at 
org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:101)
at 
org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:41)
at 
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:122)
at 
org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:253)

... 11 more
Caused by: parquet.io.ParquetDecodingException: Can not read value at 
18 in block 0 in file

hdfs://nameservice1/user/hive/warehouse/dbname.db/tablename/partitionname/20160421032223.parquet
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:216)
at 
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:144)
at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:159)
at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:48)
at 
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:339)

... 15 more
Caused by: parquet.io.ParquetDecodingException: Can't read value in 
column [sessionid] BINARY at value 18 out of 18, 18 out of 18 in 
currentPage. repetition level: 0, definition level: 1
at 
parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:450)
at 
parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:352)