[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7

ASF GitHub Bot (Jira) Mon, 25 Jan 2021 12:21:09 -0800


     [ 
https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=541286&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-541286
 ]


ASF GitHub Bot logged work on HIVE-23553:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 25/Jan/21 20:20
            Start Date: 25/Jan/21 20:20
    Worklog Time Spent: 10m 
      Work Description: mustafaiman commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r563956041



##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -4509,7 +4509,7 @@ private static void populateLlapDaemonVarsSet(Set<String> 
llapDaemonVarsSetLocal
         "Minimum allocation possible from LLAP buddy allocator. Allocations 
below that are\n" +
         "padded to minimum allocation. For ORC, should generally be the same 
as the expected\n" +
         "compression buffer size, or next lowest power of 2. Must be a power 
of 2."),
-    LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "16Mb", new 
SizeValidator(),
+    LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "4Mb", new 
SizeValidator(),

Review comment:
       You say this is changed to be compatible with ORC setting. I do not 
understand why this is necessary and what its impact is. This looks like a 
change that is not to be taken lightly

##########
File path: 
llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/LlapRecordReaderUtils.java
##########
@@ -0,0 +1,438 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.llap.io.encoded;
+
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hive.common.io.DiskRangeList;
+import org.apache.hadoop.hive.ql.io.orc.encoded.LlapDataReader;
+import org.apache.orc.CompressionCodec;
+import org.apache.orc.CompressionKind;
+import org.apache.orc.OrcFile;
+import org.apache.orc.OrcProto;
+import org.apache.orc.StripeInformation;
+import org.apache.orc.TypeDescription;
+import org.apache.orc.impl.BufferChunk;
+import org.apache.orc.impl.DataReaderProperties;
+import org.apache.orc.impl.DirectDecompressionCodec;
+import org.apache.orc.impl.HadoopShims;
+import org.apache.orc.impl.HadoopShimsFactory;
+import org.apache.orc.impl.InStream;
+import org.apache.orc.impl.OrcCodecPool;
+import org.apache.orc.impl.OrcIndex;
+import org.apache.orc.impl.RecordReaderUtils;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.List;
+import java.util.function.Supplier;
+
+public class LlapRecordReaderUtils {
+
+  private static final HadoopShims SHIMS = HadoopShimsFactory.get();
+  private static final Logger LOG = 
LoggerFactory.getLogger(LlapRecordReaderUtils.class);
+
+  static HadoopShims.ZeroCopyReaderShim createZeroCopyShim(FSDataInputStream 
file, CompressionCodec codec, RecordReaderUtils.ByteBufferAllocatorPool pool) 
throws IOException {
+    return codec != null && (!(codec instanceof DirectDecompressionCodec) || 
!((DirectDecompressionCodec)codec).isAvailable()) ? null : 
SHIMS.getZeroCopyReader(file, pool);

Review comment:
       Can you invert this condition. There are a lot of negatives making this 
hard to understand
   `codec == null || (codec instanceof DirectDecompressionCodec && 
((DirectDecompressionCodec) codec).isAvailable()) ? 
SHIMS.getZeroCopyReader(file, pool) : null` is much more understandable

##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##########
@@ -443,7 +444,8 @@ public void setBaseAndInnerReader(
       return new OrcRawRecordMerger.KeyInterval(null, null);
     }
 
-    OrcTail orcTail = getOrcTail(orcSplit.getPath(), conf, cacheTag, 
orcSplit.getFileKey()).orcTail;
+    VectorizedOrcAcidRowBatchReader.ReaderData orcReaderData =

Review comment:
       I am not sure about this. Previously we did not create the full reader. 
Why do we need to create the reader now? All calls from here use orcTail anyway 
except `List<StripeStatistics> stats = 
orcReaderData.reader.getVariantStripeStatistics(null);`. However, we can create 
this also from the info in tail, too.

##########
File path: 
llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/OrcEncodedDataReader.java
##########
@@ -631,10 +630,19 @@ private OrcFileMetadata getFileFooterFromCacheOrDisk() 
throws IOException {
           OrcTail orcTail = getOrcTailFromLlapBuffers(tailBuffers);
           counters.incrCounter(LlapIOCounters.METADATA_CACHE_HIT);
           FileTail tail = orcTail.getFileTail();
-          stats = orcTail.getStripeStatisticsProto();
+          CompressionKind compressionKind = orcTail.getCompressionKind();
+          InStream.StreamOptions options = null;
+          if (compressionKind != CompressionKind.NONE) {
+            options = InStream.options()
+                
.withCodec(OrcCodecPool.getCodec(compressionKind)).withBufferSize(orcTail.getCompressionBufferSize());
+          }
+          InStream stream = InStream.create("stripe stats", 
orcTail.getTailBuffer(),

Review comment:
       Please extract "stripe stats" to a constant.
   I understand the correct way to get the stripe stats is to get it from 
OrcReader.getStripeStats. However, we want to avoid creating a reader when we 
have the metadata in cache. That is why we do this bit here as far as I 
understand. It would be better to extract this part to a helper method with 
some javadoc explaining why we do this.

##########
File path: 
llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/OrcEncodedDataReader.java
##########
@@ -631,10 +630,19 @@ private OrcFileMetadata getFileFooterFromCacheOrDisk() 
throws IOException {
           OrcTail orcTail = getOrcTailFromLlapBuffers(tailBuffers);
           counters.incrCounter(LlapIOCounters.METADATA_CACHE_HIT);
           FileTail tail = orcTail.getFileTail();
-          stats = orcTail.getStripeStatisticsProto();
+          CompressionKind compressionKind = orcTail.getCompressionKind();
+          InStream.StreamOptions options = null;
+          if (compressionKind != CompressionKind.NONE) {
+            options = InStream.options()
+                
.withCodec(OrcCodecPool.getCodec(compressionKind)).withBufferSize(orcTail.getCompressionBufferSize());
+          }
+          InStream stream = InStream.create("stripe stats", 
orcTail.getTailBuffer(),
+              orcTail.getMetadataOffset(), orcTail.getMetadataSize(), options);
+          stats = 
OrcProto.Metadata.parseFrom(InStream.createCodedInputStream(stream)).getStripeStatsList();
           stripes = new ArrayList<>(tail.getFooter().getStripesCount());
+          int stripeIdx = 0;
           for (OrcProto.StripeInformation stripeProto : 
tail.getFooter().getStripesList()) {
-            stripes.add(new ReaderImpl.StripeInformationImpl(stripeProto));
+            stripes.add(new ReaderImpl.StripeInformationImpl(stripeProto, 
stripeIdx++, -1, null));

Review comment:
       I am confused with this part. previouse stripe id and encryption keys 
are never relevant?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/LocalCache.java
##########
@@ -82,8 +82,7 @@ public void put(Path path, OrcTail tail) {
     if (bb.capacity() != bb.remaining()) {
       throw new RuntimeException("Bytebuffer allocated for path: " + path + " 
has remaining: " + bb.remaining() + " != capacity: " + bb.capacity());
     }
-    cache.put(path, new TailAndFileData(tail.getFileTail().getFileLength(),
-        tail.getFileModificationTime(), bb.duplicate()));
+    cache.put(path, new TailAndFileData(bb.limit(), 
tail.getFileModificationTime(), bb.duplicate()));

Review comment:
       getFileLength() is still available. Why this change?

##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##########
@@ -2003,6 +2005,22 @@ private static IntegerColumnStatistics 
deserializeIntColumnStatistics(List<OrcPr
    * @param colStats The statistics array
    * @return The min record key
    */
+  private static OrcRawRecordMerger.KeyInterval 
getKeyInterval(ColumnStatistics[] colStats) {

Review comment:
       This and `getKeyInterval(List<OrcProto.ColumnStatistics> colStats)` are 
almost the same. Can you get rid of this duplication?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 541286)
    Time Spent: 3.5h  (was: 3h 20m)

> Upgrade ORC version to 1.6.7
> ----------------------------
>
>                 Key: HIVE-23553
>                 URL: https://issues.apache.org/jira/browse/HIVE-23553
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Panagiotis Garefalakis
>            Assignee: Panagiotis Garefalakis
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 3.5h
>  Remaining Estimate: 0h
>
>  Apache Hive is currently on 1.5.X version and in order to take advantage of 
> the latest ORC improvements such as column encryption we have to bump to 
> 1.6.X.
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin
> Even though ORC reader could work out of the box, HIVE LLAP is heavily 
> depending on internal ORC APIs e.g., to retrieve and store File Footers, 
> Tails, streams – un/compress RG data etc. As there ware many internal changes 
> from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the 
> upgrade is not straightforward.
> This Umbrella Jira tracks this upgrade effort.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7

Reply via email to