[jira] [Commented] (HIVE-5994) ORC RLEv2 encodes wrongly for large negative BIGINTs (64 bits )

2014-02-20 Thread Puneet Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13906887#comment-13906887
 ] 

Puneet Gupta commented on HIVE-5994:


Hi Prasanth 
I also tested with the path mentioned in 
https://reviews.apache.org/r/16148/diff/ by merging the code in 0.12 .0 . It 
solves the issue :-).

Thanks for the help .



> ORC RLEv2 encodes wrongly for large negative BIGINTs  (64 bits )
> 
>
> Key: HIVE-5994
> URL: https://issues.apache.org/jira/browse/HIVE-5994
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.13.0
>Reporter: Prasanth J
>Assignee: Prasanth J
>  Labels: orcfile
> Fix For: 0.13.0
>
> Attachments: HIVE-5994.1.patch
>
>
> For large negative BIGINTs, zigzag encoding will yield large value (64bit 
> value) with MSB set to 1. This value is interpreted as negative value in 
> SerializationUtils.findClosestNumBits(long value) function. This resulted in 
> wrong computation of total number of bits required which results in wrong 
> encoding/decoding of values.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HIVE-5994) ORC RLEv2 encodes wrongly for large negative BIGINTs (64 bits )

2014-02-18 Thread Puneet Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13905058#comment-13905058
 ] 

Puneet Gupta commented on HIVE-5994:


Hi Prasanth

This is the code I Used to reproduce the issue . 
1. I am using Hive binary from "hive-0.12.0.tar.gz" 
2. I am using a old hadoop version "hadoop-core-1.0.0.jar"   --- 
http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-core
3. In the below code if  ROWS_TO_TEST is set to 1 or >10 , the problem does not 
occur.

---
package hive;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hive.ql.io.orc.CompressionKind;
import org.apache.hadoop.hive.ql.io.orc.OrcFile;
import org.apache.hadoop.hive.ql.io.orc.Reader;
import org.apache.hadoop.hive.ql.io.orc.RecordReader;
import org.apache.hadoop.hive.ql.io.orc.Writer;
import org.apache.hadoop.hive.ql.io.orc.OrcFile.WriterOptions;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;

public class TestLong {

/**
 * @param args
 * @throws IOException 
 */
public static void main(String[] args) throws IOException
{
int ROWS_TO_TEST =10;
Path path = new Path("E:/Test/file.orc");
Configuration conf = new Configuration();
FileSystem fs = FileSystem.getLocal(conf);
if(fs.exists(path))
fs.delete(path,true);

ObjectInspector inspector = ObjectInspectorFactory
.getReflectionObjectInspector(MyData.class,

ObjectInspectorFactory.ObjectInspectorOptions.JAVA);

WriterOptions options = OrcFile.writerOptions(conf)

.inspector(inspector).compress(CompressionKind.SNAPPY);

Writer writer = OrcFile.createWriter(path, options);

for (int i = 0; i < ROWS_TO_TEST; i++) {
writer.addRow(new MyData());
}
writer.close();

Reader reader = OrcFile.createReader(fs, path);
RecordReader rows = reader.rows(null);
Object row = null;
while (rows.hasNext()) {
row = rows.next(row);
System.out.println(row);
}
}


private static class MyData
{
long data = 470327563395383L ;
}
}
---
OUTPUT
{112}
{112}
{112}
{112}
{112}
{112}
{112}
{112}
{112}
{112}


> ORC RLEv2 encodes wrongly for large negative BIGINTs  (64 bits )
> 
>
> Key: HIVE-5994
> URL: https://issues.apache.org/jira/browse/HIVE-5994
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.13.0
>Reporter: Prasanth J
>Assignee: Prasanth J
>  Labels: orcfile
> Fix For: 0.13.0
>
> Attachments: HIVE-5994.1.patch
>
>
> For large negative BIGINTs, zigzag encoding will yield large value (64bit 
> value) with MSB set to 1. This value is interpreted as negative value in 
> SerializationUtils.findClosestNumBits(long value) function. This resulted in 
> wrong computation of total number of bits required which results in wrong 
> encoding/decoding of values.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HIVE-5994) ORC RLEv2 encodes wrongly for large negative BIGINTs (64 bits )

2014-02-18 Thread Puneet Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13904141#comment-13904141
 ] 

Puneet Gupta commented on HIVE-5994:


Will it affect positive values also?  I am trying to write long  
470327563395383L and I see some issues while reading back.
When I write 10 rows of same long value and read back , I get value as 112 
instead.
When I write 1 or 100 rows of same long value and read back , I get correct 
value back !  Not sure why ? 



> ORC RLEv2 encodes wrongly for large negative BIGINTs  (64 bits )
> 
>
> Key: HIVE-5994
> URL: https://issues.apache.org/jira/browse/HIVE-5994
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.13.0
>Reporter: Prasanth J
>Assignee: Prasanth J
>  Labels: orcfile
> Fix For: 0.13.0
>
> Attachments: HIVE-5994.1.patch
>
>
> For large negative BIGINTs, zigzag encoding will yield large value (64bit 
> value) with MSB set to 1. This value is interpreted as negative value in 
> SerializationUtils.findClosestNumBits(long value) function. This resulted in 
> wrong computation of total number of bits required which results in wrong 
> encoding/decoding of values.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HIVE-5922) In orc.InStream.CompressedStream, the desired position passed to seek can equal offsets[i] + bytes[i].remaining() when ORC predicate pushdown is enabled

2014-02-16 Thread Puneet Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13902956#comment-13902956
 ] 

Puneet Gupta commented on HIVE-5922:


>From what is know 0.12.0 does not have vectorization support .So that can not 
>be the issue.  Also this happens only on seeking while predicate push-down is 
>enabled . Normal iteration is fine . 

> In orc.InStream.CompressedStream, the desired position passed to seek can 
> equal offsets[i] + bytes[i].remaining() when ORC predicate pushdown is enabled
> 
>
> Key: HIVE-5922
> URL: https://issues.apache.org/jira/browse/HIVE-5922
> Project: Hive
>  Issue Type: Bug
>  Components: File Formats
>Reporter: Yin Huai
>
> Two stack traces ...
> {code}
> java.io.IOException: IO error in map input file 
> hdfs://10.38.55.204:8020/user/hive/warehouse/ssdb_bin_compress_orc_large_0_13.db/cycle/04_0
>   at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:236)
>   at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:210)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
>   at org.apache.hadoop.mapred.Child.main(Child.java:249)
> Caused by: java.io.IOException: java.io.IOException: Seek outside of data in 
> compressed stream Stream for column 9 kind DATA position: 21496054 length: 
> 33790900 range: 2 offset: 1048588 limit: 1048588 range 0 = 13893791 to 
> 1048588;  range 1 = 17039555 to 1310735;  range 2 = 20447466 to 1048588;  
> range 3 = 23855377 to 1048588;  range 4 = 27263288 to 1048588;  range 5 = 
> 30409052 to 1310735 uncompressed: 262144 to 262144 to 21496054
>   at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
>   at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
>   at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:276)
>   at 
> org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
>   at 
> org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
>   at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:108)
>   at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:230)
>   ... 9 more
> Caused by: java.io.IOException: Seek outside of data in compressed stream 
> Stream for column 9 kind DATA position: 21496054 length: 33790900 range: 2 
> offset: 1048588 limit: 1048588 range 0 = 13893791 to 1048588;  range 1 = 
> 17039555 to 1310735;  range 2 = 20447466 to 1048588;  range 3 = 23855377 to 
> 1048588;  range 4 = 27263288 to 1048588;  range 5 = 30409052 to 1310735 
> uncompressed: 262144 to 262144 to 21496054
>   at 
> org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.seek(InStream.java:328)
>   at 
> org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.readHeader(InStream.java:161)
>   at 
> org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.read(InStream.java:205)
>   at 
> org.apache.hadoop.hive.ql.io.orc.SerializationUtils.readInts(SerializationUtils.java:450)
>   at 
> org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.readDirectValues(RunLengthIntegerReaderV2.java:240)
>   at 
> org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:53)
>   at 
> org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:288)
>   at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$IntTreeReader.next(RecordReaderImpl.java:510)
>   at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StructTreeReader.next(RecordReaderImpl.java:1581)
>   at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.next(RecordReaderImpl.java:2707)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:110)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:86)
>   at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAw

[jira] [Commented] (HIVE-5922) In orc.InStream.CompressedStream, the desired position passed to seek can equal offsets[i] + bytes[i].remaining() when ORC predicate pushdown is enabled

2014-02-15 Thread Puneet Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13902358#comment-13902358
 ] 

Puneet Gupta commented on HIVE-5922:


Hi Prasanth

I am using Hive Binary from "hive-0.12.0-bin.tar.gz"
http://apache.claz.org/hive/hive-0.12.0/

I am using only the ORC file format part to store my data . Its not used along 
with Hive . 

> In orc.InStream.CompressedStream, the desired position passed to seek can 
> equal offsets[i] + bytes[i].remaining() when ORC predicate pushdown is enabled
> 
>
> Key: HIVE-5922
> URL: https://issues.apache.org/jira/browse/HIVE-5922
> Project: Hive
>  Issue Type: Bug
>  Components: File Formats
>Reporter: Yin Huai
>
> Two stack traces ...
> {code}
> java.io.IOException: IO error in map input file 
> hdfs://10.38.55.204:8020/user/hive/warehouse/ssdb_bin_compress_orc_large_0_13.db/cycle/04_0
>   at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:236)
>   at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:210)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
>   at org.apache.hadoop.mapred.Child.main(Child.java:249)
> Caused by: java.io.IOException: java.io.IOException: Seek outside of data in 
> compressed stream Stream for column 9 kind DATA position: 21496054 length: 
> 33790900 range: 2 offset: 1048588 limit: 1048588 range 0 = 13893791 to 
> 1048588;  range 1 = 17039555 to 1310735;  range 2 = 20447466 to 1048588;  
> range 3 = 23855377 to 1048588;  range 4 = 27263288 to 1048588;  range 5 = 
> 30409052 to 1310735 uncompressed: 262144 to 262144 to 21496054
>   at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
>   at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
>   at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:276)
>   at 
> org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
>   at 
> org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
>   at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:108)
>   at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:230)
>   ... 9 more
> Caused by: java.io.IOException: Seek outside of data in compressed stream 
> Stream for column 9 kind DATA position: 21496054 length: 33790900 range: 2 
> offset: 1048588 limit: 1048588 range 0 = 13893791 to 1048588;  range 1 = 
> 17039555 to 1310735;  range 2 = 20447466 to 1048588;  range 3 = 23855377 to 
> 1048588;  range 4 = 27263288 to 1048588;  range 5 = 30409052 to 1310735 
> uncompressed: 262144 to 262144 to 21496054
>   at 
> org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.seek(InStream.java:328)
>   at 
> org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.readHeader(InStream.java:161)
>   at 
> org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.read(InStream.java:205)
>   at 
> org.apache.hadoop.hive.ql.io.orc.SerializationUtils.readInts(SerializationUtils.java:450)
>   at 
> org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.readDirectValues(RunLengthIntegerReaderV2.java:240)
>   at 
> org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:53)
>   at 
> org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:288)
>   at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$IntTreeReader.next(RecordReaderImpl.java:510)
>   at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StructTreeReader.next(RecordReaderImpl.java:1581)
>   at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.next(RecordReaderImpl.java:2707)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:110)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:86)
>   at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContext

[jira] [Commented] (HIVE-5922) In orc.InStream.CompressedStream, the desired position passed to seek can equal offsets[i] + bytes[i].remaining() when ORC predicate pushdown is enabled

2014-02-15 Thread Puneet Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13902353#comment-13902353
 ] 

Puneet Gupta commented on HIVE-5922:


I got a similar Exception  ( on seeking to row 9,103,258 )

java.io.IOException: Seek outside of data in compressed stream Stream for 
column 65 kind DATA position: 1572882 length: 2116178 range: 1 offset: 1048588 
limit: 1048588 range 0 = 0 to 0;  range 1 = 524294 to 1048588;  range 2 = 
1835029 to 262147 uncompressed: 1048588 to 1048588 to 1572882
at 
org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.seek(InStream.java:277)
at 
org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.readHeader(InStream.java:153)
at 
org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.read(InStream.java:197)
at 
org.apache.hadoop.hive.ql.io.orc.SerializationUtils.readInts(SerializationUtils.java:450)
at 
org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.readPatchedBaseValues(RunLengthIntegerReaderV2.java:161)
at 
org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:54)
at 
org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.skip(RunLengthIntegerReaderV2.java:318)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$IntTreeReader.skipRows(RecordReaderImpl.java:427)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StructTreeReader.skipRows(RecordReaderImpl.java:1181)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:2183)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.seekToRow(RecordReaderImpl.java:2284)

Some observations

1. I have used Snappy for compression 

2. there are 75 columns in the file (mostly numbers - int,long,byte,short  and 
a few strings). Exception always happens for column 65 which is an int. If I 
remove this column from include column list, seek works fine . 

2. This issue happens only when I am seeking to row using 
RecordReader.seekToRow(long). In this flow the RecordReader is created using  
Reader.rows(long, long, boolean[], SearchArgument, String[]). The 
SearchArgument is using "IN" construct with 200 long values which are actually 
the row numbers I want to retrieve   
(SearchArgument.FACTORY.newBuilder().startOr().in(colName, 200 Long 
Values).end().build()). Exception happens for seek to row 9103258 (file has 
about 13 million rows). I tried SearchArgument with just one IN value of 
9103258 BINGO .. got the same Exception. This problem can be reproduced for 
any rowSeek between 9103258 and 9103279. Rows after this seem to work fine .

3. I face no Exceptions if the RecordReader is created using Reader.rows(null), 
and the entire file is iterated using RecordReader.hasNext() and 
RecordReader.next()

4. I face no Exceptions if the RecordReader is created using  Reader.rows(long, 
long, boolean[], SearchArgument, String[]) and SearchArgument is passed null. 
Then the required data (about 200 rows) is retrieved using 
RecordReader.seekToRow(long) and RecordReader.next()

5.Obvious WorkAround is not to use predicate push down . In may case since I 
know the row numbers to be seeked to, the performance let down is not very 
drastic. 
Read/SeekTO 167 rows in (ms)3609   : Existing usage with predicate push 
down in ORC  
Read/SeekTo 167 rows in (ms)4626   : WorkAround without 
predicate/Search-Argument pushdown
    Difference of  1017 ms  = approx 7 ms per row let down in 
performance (arounf 80% values are fetched from different strides)


> In orc.InStream.CompressedStream, the desired position passed to seek can 
> equal offsets[i] + bytes[i].remaining() when ORC predicate pushdown is enabled
> 
>
> Key: HIVE-5922
> URL: https://issues.apache.org/jira/browse/HIVE-5922
> Project: Hive
>  Issue Type: Bug
>  Components: File Formats
>Reporter: Yin Huai
>
> Two stack traces ...
> {code}
> java.io.IOException: IO error in map input file 
> hdfs://10.38.55.204:8020/user/hive/warehouse/ssdb_bin_compress_orc_large_0_13.db/cycle/04_0
>   at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:236)
>   at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:210)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.aut