Re: Review Request 45238: HIVE-9660 store end offset of compressed data for RG in RowIndex in ORC

Sergey Shelukhin Wed, 06 Apr 2016 21:21:07 -0700


> On April 6, 2016, 11:27 p.m., Prasanth_J wrote:
> > common/src/java/org/apache/hadoop/hive/conf/HiveConf.java, lines 1205-1206
> > <https://reviews.apache.org/r/45238/diff/1/?file=1312350#file1312350line1205>
> >
> >     IMO, this should not be made an option for the user. Should be enabled 
> > as long as index is enabled. If for some reason, this is wrong, then disk 
> > range computation should fallback to old code path using WORST_CASE_SLOP (2 
> > x buffer size). Will be better to have config for fallback path as opposed 
> > to tracking lengths.


this is needed for the write path - what if the write path is problematic 
because of the length tracking?


> On April 6, 2016, 11:27 p.m., Prasanth_J wrote:
> > orc/src/java/org/apache/orc/TypeDescription.java, line 384
> > <https://reviews.apache.org/r/45238/diff/1/?file=1312356#file1312356line384>
> >
> >     Is this related?

yes


> On April 6, 2016, 11:27 p.m., Prasanth_J wrote:
> > orc/src/java/org/apache/orc/impl/RunLengthIntegerWriterV2.java, line 192
> > <https://reviews.apache.org/r/45238/diff/1/?file=1312362#file1312362line192>
> >
> >     By flush does this mean writing to OutStream? What if numLiterals 
> > straddles compression buffers (first CB is flushed and closed and second CB 
> > contains some literals)? Is that still considered to be flushed?

yes, just writing to outstream. It doesn't matter for this class what happens 
there for the most part (see big comment in WriterImpl); the flush count is 
always updated after writing out to OutStream, so in any callback from CBs 
created during the write, we'll still see the old flush count. Only when that 
2nd CB is written, we will get a callback and find the new flush count.


> On April 6, 2016, 11:27 p.m., Prasanth_J wrote:
> > orc/src/java/org/apache/orc/impl/WriterImpl.java, line 595
> > <https://reviews.apache.org/r/45238/diff/1/?file=1312364#file1312364line595>
> >
> >     I think it will be easier if we maintain a Map<String, LengthsBuilder> 
> > that maps stream name and proto builder of lengths. The callback interface 
> > can be reportCompressionBuffer(String name, int cbSize). When this callback 
> > is called, length information can be appended to the lengths builder for 
> > the corresponding stream. When flushing the stripe, using the same stream 
> > name we can check if the stream is suppressed or not. With that we can also 
> > avoid the index adjustment. 
> >     
> >     I am just seeing if there are ways to simplify the state maintenance 
> > here. Thoughts?

the map by enum might make sense... do you think this is acceptable in terms of 
extra overhead?


> On April 6, 2016, 11:27 p.m., Prasanth_J wrote:
> > orc/src/protobuf/orc_proto.proto, line 85
> > <https://reviews.apache.org/r/45238/diff/1/?file=1312365#file1312365line85>
> >
> >     If this is a required field, it will break backward compatibility. 
> > Isn't it?

it's repeated, not required :)


> On April 6, 2016, 11:27 p.m., Prasanth_J wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderUtils.java, line 
> > 280
> > <https://reviews.apache.org/r/45238/diff/1/?file=1312378#file1312378line280>
> >
> >     I don't see any changes that bypasses this when lengths are available. 
> > Is that going into separate jira?

see estimateRgEndOffset calls. It calls the fn right now to log the message, 
but uses lengths if available. I can remove it when the patch is done... or 
only do it if info is enabled


- Sergey


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/45238/#review126974
-----------------------------------------------------------


On March 23, 2016, 7:08 p.m., Sergey Shelukhin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/45238/
> -----------------------------------------------------------
> 
> (Updated March 23, 2016, 7:08 p.m.)
> 
> 
> Review request for hive and Prasanth_J.
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> see jira
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java c14df20 
>   
> llap-server/src/java/org/apache/hadoop/hive/llap/cli/LlapOptionsProcessor.java
>  c292b37 
>   
> llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/OrcEncodedDataReader.java
>  eb251a8 
>   
> llap-server/src/java/org/apache/hadoop/hive/llap/io/metadata/OrcStripeMetadata.java
>  82187bd 
>   orc/src/java/org/apache/orc/OrcConf.java 6fcbb72 
>   orc/src/java/org/apache/orc/OrcFile.java 3945a5d 
>   orc/src/java/org/apache/orc/TypeDescription.java bd900ac 
>   orc/src/java/org/apache/orc/impl/BitFieldWriter.java aa5f886 
>   orc/src/java/org/apache/orc/impl/IntegerWriter.java 419054f 
>   orc/src/java/org/apache/orc/impl/OutStream.java 81662cc 
>   orc/src/java/org/apache/orc/impl/RunLengthByteWriter.java 09108b2 
>   orc/src/java/org/apache/orc/impl/RunLengthIntegerWriter.java 3e5f2e2 
>   orc/src/java/org/apache/orc/impl/RunLengthIntegerWriterV2.java fab2801 
>   orc/src/java/org/apache/orc/impl/SerializationUtils.java c1162e4 
>   orc/src/java/org/apache/orc/impl/WriterImpl.java 6497ecf 
>   orc/src/protobuf/orc_proto.proto f4935b4 
>   orc/src/test/org/apache/orc/impl/TestBitFieldReader.java e4c6f6b 
>   orc/src/test/org/apache/orc/impl/TestBitPack.java f2d3d64 
>   orc/src/test/org/apache/orc/impl/TestInStream.java 9e65345 
>   orc/src/test/org/apache/orc/impl/TestIntegerCompressionReader.java 399f35e 
>   orc/src/test/org/apache/orc/impl/TestOutStream.java e9614d5 
>   orc/src/test/org/apache/orc/impl/TestRunLengthByteReader.java a14bef1 
>   orc/src/test/org/apache/orc/impl/TestRunLengthIntegerReader.java 28239ba 
>   ql/src/java/org/apache/hadoop/hive/llap/DebugUtils.java ea626d7 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezJobMonitor.java 67f9da8 
>   ql/src/java/org/apache/hadoop/hive/ql/hooks/PostExecOrcFileDump.java 
> d5d1370 
>   ql/src/java/org/apache/hadoop/hive/ql/io/orc/FileDump.java 9c2f88f 
>   ql/src/java/org/apache/hadoop/hive/ql/io/orc/JsonFileDump.java 00de545 
>   ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderUtils.java 8a73948 
>   ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/EncodedReader.java 
> 96af96a 
>   ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/EncodedReaderImpl.java 
> 29b51ec 
>   ql/src/test/queries/clientpositive/orc_lengths.q PRE-CREATION 
>   ql/src/test/results/clientpositive/orc_lengths.q.out PRE-CREATION 
>   
> storage-api/src/java/org/apache/hadoop/hive/common/io/encoded/EncodedColumnBatch.java
>  ddba889 
> 
> Diff: https://reviews.apache.org/r/45238/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Sergey Shelukhin
> 
>

Re: Review Request 45238: HIVE-9660 store end offset of compressed data for RG in RowIndex in ORC

Reply via email to