[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=548277&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-548277 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 05/Feb/21 10:05 Start Date: 05/Feb/21 10:05 Worklog Time Spent: 10m Work Description: dongjoon-hyun commented on pull request #1823: URL: https://github.com/apache/hive/pull/1823#issuecomment-773584083 Thank you so much all! It's great! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 548277) Time Spent: 10h (was: 9h 50m) > Upgrade ORC version to 1.6.7 > > > Key: HIVE-23553 > URL: https://issues.apache.org/jira/browse/HIVE-23553 > Project: Hive > Issue Type: Improvement >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 10h > Remaining Estimate: 0h > > Apache Hive is currently on 1.5.X version and in order to take advantage of > the latest ORC improvements such as column encryption we have to bump to > 1.6.X. > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin > Even though ORC reader could work out of the box, HIVE LLAP is heavily > depending on internal ORC APIs e.g., to retrieve and store File Footers, > Tails, streams – un/compress RG data etc. As there ware many internal changes > from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the > upgrade is not straightforward. > This Umbrella Jira tracks this upgrade effort. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=548129&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-548129 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 05/Feb/21 09:48 Start Date: 05/Feb/21 09:48 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1823: URL: https://github.com/apache/hive/pull/1823#discussion_r570105800 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/EncodedTreeReaderFactory.java ## @@ -2585,6 +2590,7 @@ private static TreeReader getPrimitiveTreeReader(final int columnIndex, .setColumnEncoding(columnEncoding) .setVectors(vectors) .setContext(context) +.setIsInstant(columnType.getCategory() == TypeDescription.Category.TIMESTAMP_INSTANT) Review comment: Sure thing! This is now tracked as HIVE-24735 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 548129) Time Spent: 9h 50m (was: 9h 40m) > Upgrade ORC version to 1.6.7 > > > Key: HIVE-23553 > URL: https://issues.apache.org/jira/browse/HIVE-23553 > Project: Hive > Issue Type: Improvement >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 9h 50m > Remaining Estimate: 0h > > Apache Hive is currently on 1.5.X version and in order to take advantage of > the latest ORC improvements such as column encryption we have to bump to > 1.6.X. > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin > Even though ORC reader could work out of the box, HIVE LLAP is heavily > depending on internal ORC APIs e.g., to retrieve and store File Footers, > Tails, streams – un/compress RG data etc. As there ware many internal changes > from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the > upgrade is not straightforward. > This Umbrella Jira tracks this upgrade effort. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=547850&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547850 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 04/Feb/21 20:28 Start Date: 04/Feb/21 20:28 Worklog Time Spent: 10m Work Description: dongjoon-hyun commented on pull request #1823: URL: https://github.com/apache/hive/pull/1823#issuecomment-773584083 Thank you so much all! It's great! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 547850) Time Spent: 9h 40m (was: 9.5h) > Upgrade ORC version to 1.6.7 > > > Key: HIVE-23553 > URL: https://issues.apache.org/jira/browse/HIVE-23553 > Project: Hive > Issue Type: Improvement >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 9h 40m > Remaining Estimate: 0h > > Apache Hive is currently on 1.5.X version and in order to take advantage of > the latest ORC improvements such as column encryption we have to bump to > 1.6.X. > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin > Even though ORC reader could work out of the box, HIVE LLAP is heavily > depending on internal ORC APIs e.g., to retrieve and store File Footers, > Tails, streams – un/compress RG data etc. As there ware many internal changes > from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the > upgrade is not straightforward. > This Umbrella Jira tracks this upgrade effort. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=547552&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547552 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 04/Feb/21 10:18 Start Date: 04/Feb/21 10:18 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1823: URL: https://github.com/apache/hive/pull/1823#discussion_r570105800 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/EncodedTreeReaderFactory.java ## @@ -2585,6 +2590,7 @@ private static TreeReader getPrimitiveTreeReader(final int columnIndex, .setColumnEncoding(columnEncoding) .setVectors(vectors) .setContext(context) +.setIsInstant(columnType.getCategory() == TypeDescription.Category.TIMESTAMP_INSTANT) Review comment: Sure thing! This is now tracked as HIVE-24735 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 547552) Time Spent: 9.5h (was: 9h 20m) > Upgrade ORC version to 1.6.7 > > > Key: HIVE-23553 > URL: https://issues.apache.org/jira/browse/HIVE-23553 > Project: Hive > Issue Type: Improvement >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 9.5h > Remaining Estimate: 0h > > Apache Hive is currently on 1.5.X version and in order to take advantage of > the latest ORC improvements such as column encryption we have to bump to > 1.6.X. > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin > Even though ORC reader could work out of the box, HIVE LLAP is heavily > depending on internal ORC APIs e.g., to retrieve and store File Footers, > Tails, streams – un/compress RG data etc. As there ware many internal changes > from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the > upgrade is not straightforward. > This Umbrella Jira tracks this upgrade effort. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=547351&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547351 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 04/Feb/21 00:02 Start Date: 04/Feb/21 00:02 Worklog Time Spent: 10m Work Description: jcamachor merged pull request #1823: URL: https://github.com/apache/hive/pull/1823 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 547351) Time Spent: 9h 20m (was: 9h 10m) > Upgrade ORC version to 1.6.7 > > > Key: HIVE-23553 > URL: https://issues.apache.org/jira/browse/HIVE-23553 > Project: Hive > Issue Type: Improvement >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Time Spent: 9h 20m > Remaining Estimate: 0h > > Apache Hive is currently on 1.5.X version and in order to take advantage of > the latest ORC improvements such as column encryption we have to bump to > 1.6.X. > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin > Even though ORC reader could work out of the box, HIVE LLAP is heavily > depending on internal ORC APIs e.g., to retrieve and store File Footers, > Tails, streams – un/compress RG data etc. As there ware many internal changes > from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the > upgrade is not straightforward. > This Umbrella Jira tracks this upgrade effort. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=547313&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547313 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 03/Feb/21 22:17 Start Date: 03/Feb/21 22:17 Worklog Time Spent: 10m Work Description: mustafaiman commented on pull request #1823: URL: https://github.com/apache/hive/pull/1823#issuecomment-772864136 Looks good to me. Thanks for the effort @pgaref This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 547313) Time Spent: 9h 10m (was: 9h) > Upgrade ORC version to 1.6.7 > > > Key: HIVE-23553 > URL: https://issues.apache.org/jira/browse/HIVE-23553 > Project: Hive > Issue Type: Improvement >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Time Spent: 9h 10m > Remaining Estimate: 0h > > Apache Hive is currently on 1.5.X version and in order to take advantage of > the latest ORC improvements such as column encryption we have to bump to > 1.6.X. > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin > Even though ORC reader could work out of the box, HIVE LLAP is heavily > depending on internal ORC APIs e.g., to retrieve and store File Footers, > Tails, streams – un/compress RG data etc. As there ware many internal changes > from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the > upgrade is not straightforward. > This Umbrella Jira tracks this upgrade effort. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=547303&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547303 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 03/Feb/21 21:55 Start Date: 03/Feb/21 21:55 Worklog Time Spent: 10m Work Description: jcamachor commented on pull request #1823: URL: https://github.com/apache/hive/pull/1823#issuecomment-772850624 Thanks for addressing the comments @pgaref . I am fine from my side, +1. I'd like to hear from @mustafaiman , if it's fine from his side, we can merge it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 547303) Time Spent: 9h (was: 8h 50m) > Upgrade ORC version to 1.6.7 > > > Key: HIVE-23553 > URL: https://issues.apache.org/jira/browse/HIVE-23553 > Project: Hive > Issue Type: Improvement >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Time Spent: 9h > Remaining Estimate: 0h > > Apache Hive is currently on 1.5.X version and in order to take advantage of > the latest ORC improvements such as column encryption we have to bump to > 1.6.X. > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin > Even though ORC reader could work out of the box, HIVE LLAP is heavily > depending on internal ORC APIs e.g., to retrieve and store File Footers, > Tails, streams – un/compress RG data etc. As there ware many internal changes > from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the > upgrade is not straightforward. > This Umbrella Jira tracks this upgrade effort. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=547300&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547300 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 03/Feb/21 21:52 Start Date: 03/Feb/21 21:52 Worklog Time Spent: 10m Work Description: jcamachor commented on a change in pull request #1823: URL: https://github.com/apache/hive/pull/1823#discussion_r569775411 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/EncodedTreeReaderFactory.java ## @@ -2585,6 +2590,7 @@ private static TreeReader getPrimitiveTreeReader(final int columnIndex, .setColumnEncoding(columnEncoding) .setVectors(vectors) .setContext(context) +.setIsInstant(columnType.getCategory() == TypeDescription.Category.TIMESTAMP_INSTANT) Review comment: @pgaref , can we create a follow-up JIRA to implement TIMESTAMP WITH LOCAL TIME ZONE integration with ORC so we do not forget about it? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 547300) Time Spent: 8h 50m (was: 8h 40m) > Upgrade ORC version to 1.6.7 > > > Key: HIVE-23553 > URL: https://issues.apache.org/jira/browse/HIVE-23553 > Project: Hive > Issue Type: Improvement >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Time Spent: 8h 50m > Remaining Estimate: 0h > > Apache Hive is currently on 1.5.X version and in order to take advantage of > the latest ORC improvements such as column encryption we have to bump to > 1.6.X. > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin > Even though ORC reader could work out of the box, HIVE LLAP is heavily > depending on internal ORC APIs e.g., to retrieve and store File Footers, > Tails, streams – un/compress RG data etc. As there ware many internal changes > from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the > upgrade is not straightforward. > This Umbrella Jira tracks this upgrade effort. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=547292&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-547292 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 03/Feb/21 21:40 Start Date: 03/Feb/21 21:40 Worklog Time Spent: 10m Work Description: pgaref commented on pull request #1823: URL: https://github.com/apache/hive/pull/1823#issuecomment-772842805 Gentle ping @mustafaiman @jcamachor -- any further comments here? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 547292) Time Spent: 8h 40m (was: 8.5h) > Upgrade ORC version to 1.6.7 > > > Key: HIVE-23553 > URL: https://issues.apache.org/jira/browse/HIVE-23553 > Project: Hive > Issue Type: Improvement >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Time Spent: 8h 40m > Remaining Estimate: 0h > > Apache Hive is currently on 1.5.X version and in order to take advantage of > the latest ORC improvements such as column encryption we have to bump to > 1.6.X. > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin > Even though ORC reader could work out of the box, HIVE LLAP is heavily > depending on internal ORC APIs e.g., to retrieve and store File Footers, > Tails, streams – un/compress RG data etc. As there ware many internal changes > from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the > upgrade is not straightforward. > This Umbrella Jira tracks this upgrade effort. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=546411&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-546411 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 03/Feb/21 01:01 Start Date: 03/Feb/21 01:01 Worklog Time Spent: 10m Work Description: pgaref commented on pull request #1823: URL: https://github.com/apache/hive/pull/1823#issuecomment-771227835 > All q.out files show data size increase for tables. Since most of them are consistently additional 4 bytes per row, that seems like not a bug. However, I found some irregular increases too like 16 bytes per row. Can you explain why data size increased so we can check the irregularities and make sure they are expected? Hey @mustafaiman -- the main size differences are on Timestamp columns where we now support nanosecond precision (using 2 extra variables for the lower and the upper precision as part of the stats -- see [ORC-611](https://issues.apache.org/jira/browse/ORC-611)). Other than that there are other changes that can also affect size, such as: Trimming StringStatistics minimum and maximum values as part of ORC-203 or List and Map column statistics that was recently added as part of ORC-398. Happy to check further if you have doubts about a particular query. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 546411) Time Spent: 8.5h (was: 8h 20m) > Upgrade ORC version to 1.6.7 > > > Key: HIVE-23553 > URL: https://issues.apache.org/jira/browse/HIVE-23553 > Project: Hive > Issue Type: Improvement >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Time Spent: 8.5h > Remaining Estimate: 0h > > Apache Hive is currently on 1.5.X version and in order to take advantage of > the latest ORC improvements such as column encryption we have to bump to > 1.6.X. > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin > Even though ORC reader could work out of the box, HIVE LLAP is heavily > depending on internal ORC APIs e.g., to retrieve and store File Footers, > Tails, streams – un/compress RG data etc. As there ware many internal changes > from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the > upgrade is not straightforward. > This Umbrella Jira tracks this upgrade effort. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=546366&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-546366 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 03/Feb/21 00:57 Start Date: 03/Feb/21 00:57 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1823: URL: https://github.com/apache/hive/pull/1823#discussion_r568204044 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/EncodedTreeReaderFactory.java ## @@ -2585,6 +2590,7 @@ private static TreeReader getPrimitiveTreeReader(final int columnIndex, .setColumnEncoding(columnEncoding) .setVectors(vectors) .setContext(context) +.setIsInstant(columnType.getCategory() == TypeDescription.Category.TIMESTAMP_INSTANT) Review comment: Even though TimeStamp with local timezone was added as part of [ORC-189](https://issues.apache.org/jira/browse/ORC-189) ## File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java ## @@ -4509,7 +4509,7 @@ private static void populateLlapDaemonVarsSet(Set llapDaemonVarsSetLocal "Minimum allocation possible from LLAP buddy allocator. Allocations below that are\n" + "padded to minimum allocation. For ORC, should generally be the same as the expected\n" + "compression buffer size, or next lowest power of 2. Must be a power of 2."), -LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "16Mb", new SizeValidator(), +LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "4Mb", new SizeValidator(), Review comment: LLAP_ALLOCATOR_MAX_ALLOC is used both for the LowLevelCacheImpl (buddyAllocator) and bufferSize on [WriterOptions](https://github.com/apache/hive/blob/da1aa077716a65c2a02d850828b16cdeece1f574/llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/SerDeEncodedDataReader.java#L1553) Please check how this propagated from [SerDeEncodedDataReader](https://github.com/apache/hive/blob/da1aa077716a65c2a02d850828b16cdeece1f574/llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/SerDeEncodedDataReader.java#L248) Llap is tightly coupled to ORC, thus it could make sense to use the same buffer size for serialized Buffers, and the ORC writer as we would not need to split/merge them -- however I have nothing against splitting the conf or checking is the 8Mb limit is a hard one. All I am trying to say here is that this is orthogonal to the ORC version bump. ## File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java ## @@ -4509,7 +4509,7 @@ private static void populateLlapDaemonVarsSet(Set llapDaemonVarsSetLocal "Minimum allocation possible from LLAP buddy allocator. Allocations below that are\n" + "padded to minimum allocation. For ORC, should generally be the same as the expected\n" + "compression buffer size, or next lowest power of 2. Must be a power of 2."), -LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "16Mb", new SizeValidator(), +LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "4Mb", new SizeValidator(), Review comment: Sure, addressed as part of 2eca3b9de1f2332beabc3bde9ac0f89d62ec1527 -- also opened HIVE-24721 to investigate this further ## File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java ## @@ -4509,7 +4509,7 @@ private static void populateLlapDaemonVarsSet(Set llapDaemonVarsSetLocal "Minimum allocation possible from LLAP buddy allocator. Allocations below that are\n" + "padded to minimum allocation. For ORC, should generally be the same as the expected\n" + "compression buffer size, or next lowest power of 2. Must be a power of 2."), -LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "16Mb", new SizeValidator(), +LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "4Mb", new SizeValidator(), Review comment: Sure, addressed this as part of 2eca3b9de1f2332beabc3bde9ac0f89d62ec1527 -- also opened HIVE-24721 to investigate this further ## File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java ## @@ -4509,7 +4509,7 @@ private static void populateLlapDaemonVarsSet(Set llapDaemonVarsSetLocal "Minimum allocation possible from LLAP buddy allocator. Allocations below that are\n" + "padded to minimum allocation. For ORC, should generally be the same as the expected\n" + "compression buffer size, or next lowest power of 2. Must be a power of 2."), -LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "16Mb", new SizeValidator(), +LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "4Mb", new SizeValidator()
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=546005&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-546005 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 02/Feb/21 12:39 Start Date: 02/Feb/21 12:39 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1823: URL: https://github.com/apache/hive/pull/1823#discussion_r568567405 ## File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java ## @@ -4509,7 +4509,7 @@ private static void populateLlapDaemonVarsSet(Set llapDaemonVarsSetLocal "Minimum allocation possible from LLAP buddy allocator. Allocations below that are\n" + "padded to minimum allocation. For ORC, should generally be the same as the expected\n" + "compression buffer size, or next lowest power of 2. Must be a power of 2."), -LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "16Mb", new SizeValidator(), +LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "4Mb", new SizeValidator(), Review comment: Sure, addressed this as part of 2eca3b9de1f2332beabc3bde9ac0f89d62ec1527 -- also opened HIVE-24721 to investigate this further ## File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java ## @@ -4509,7 +4509,7 @@ private static void populateLlapDaemonVarsSet(Set llapDaemonVarsSetLocal "Minimum allocation possible from LLAP buddy allocator. Allocations below that are\n" + "padded to minimum allocation. For ORC, should generally be the same as the expected\n" + "compression buffer size, or next lowest power of 2. Must be a power of 2."), -LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "16Mb", new SizeValidator(), +LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "4Mb", new SizeValidator(), Review comment: Sure, addressed this as part of 2eca3b9de1f2332beabc3bde9ac0f89d62ec1527 -- also opened HIVE-24721 to investigate further This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 546005) Time Spent: 8h 10m (was: 8h) > Upgrade ORC version to 1.6.7 > > > Key: HIVE-23553 > URL: https://issues.apache.org/jira/browse/HIVE-23553 > Project: Hive > Issue Type: Improvement >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Time Spent: 8h 10m > Remaining Estimate: 0h > > Apache Hive is currently on 1.5.X version and in order to take advantage of > the latest ORC improvements such as column encryption we have to bump to > 1.6.X. > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin > Even though ORC reader could work out of the box, HIVE LLAP is heavily > depending on internal ORC APIs e.g., to retrieve and store File Footers, > Tails, streams – un/compress RG data etc. As there ware many internal changes > from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the > upgrade is not straightforward. > This Umbrella Jira tracks this upgrade effort. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=546003&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-546003 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 02/Feb/21 12:36 Start Date: 02/Feb/21 12:36 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1823: URL: https://github.com/apache/hive/pull/1823#discussion_r568567405 ## File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java ## @@ -4509,7 +4509,7 @@ private static void populateLlapDaemonVarsSet(Set llapDaemonVarsSetLocal "Minimum allocation possible from LLAP buddy allocator. Allocations below that are\n" + "padded to minimum allocation. For ORC, should generally be the same as the expected\n" + "compression buffer size, or next lowest power of 2. Must be a power of 2."), -LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "16Mb", new SizeValidator(), +LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "4Mb", new SizeValidator(), Review comment: Sure, addressed as part of 2eca3b9de1f2332beabc3bde9ac0f89d62ec1527 -- also opened HIVE-24721 to investigate this further This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 546003) Time Spent: 8h (was: 7h 50m) > Upgrade ORC version to 1.6.7 > > > Key: HIVE-23553 > URL: https://issues.apache.org/jira/browse/HIVE-23553 > Project: Hive > Issue Type: Improvement >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Time Spent: 8h > Remaining Estimate: 0h > > Apache Hive is currently on 1.5.X version and in order to take advantage of > the latest ORC improvements such as column encryption we have to bump to > 1.6.X. > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin > Even though ORC reader could work out of the box, HIVE LLAP is heavily > depending on internal ORC APIs e.g., to retrieve and store File Footers, > Tails, streams – un/compress RG data etc. As there ware many internal changes > from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the > upgrade is not straightforward. > This Umbrella Jira tracks this upgrade effort. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=545970&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-545970 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 02/Feb/21 11:11 Start Date: 02/Feb/21 11:11 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1823: URL: https://github.com/apache/hive/pull/1823#discussion_r568174337 ## File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java ## @@ -4509,7 +4509,7 @@ private static void populateLlapDaemonVarsSet(Set llapDaemonVarsSetLocal "Minimum allocation possible from LLAP buddy allocator. Allocations below that are\n" + "padded to minimum allocation. For ORC, should generally be the same as the expected\n" + "compression buffer size, or next lowest power of 2. Must be a power of 2."), -LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "16Mb", new SizeValidator(), +LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "4Mb", new SizeValidator(), Review comment: LLAP_ALLOCATOR_MAX_ALLOC is used both for the LowLevelCacheImpl (buddyAllocator) and bufferSize on [WriterOptions](https://github.com/apache/hive/blob/da1aa077716a65c2a02d850828b16cdeece1f574/llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/SerDeEncodedDataReader.java#L1553) Please check how this propagated from [SerDeEncodedDataReader](https://github.com/apache/hive/blob/da1aa077716a65c2a02d850828b16cdeece1f574/llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/SerDeEncodedDataReader.java#L248) Llap is tightly coupled to ORC, thus it could make sense to use the same buffer size for serialized Buffers, and the ORC writer as we would not need to split/merge them -- however I have nothing against splitting the conf or checking is the 8Mb limit is a hard one. All I am trying to say here is that this is orthogonal to the ORC version bump. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 545970) Time Spent: 7h 50m (was: 7h 40m) > Upgrade ORC version to 1.6.7 > > > Key: HIVE-23553 > URL: https://issues.apache.org/jira/browse/HIVE-23553 > Project: Hive > Issue Type: Improvement >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Time Spent: 7h 50m > Remaining Estimate: 0h > > Apache Hive is currently on 1.5.X version and in order to take advantage of > the latest ORC improvements such as column encryption we have to bump to > 1.6.X. > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin > Even though ORC reader could work out of the box, HIVE LLAP is heavily > depending on internal ORC APIs e.g., to retrieve and store File Footers, > Tails, streams – un/compress RG data etc. As there ware many internal changes > from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the > upgrade is not straightforward. > This Umbrella Jira tracks this upgrade effort. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=545689&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-545689 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 01/Feb/21 23:16 Start Date: 01/Feb/21 23:16 Worklog Time Spent: 10m Work Description: pgaref commented on pull request #1823: URL: https://github.com/apache/hive/pull/1823#issuecomment-771227835 > All q.out files show data size increase for tables. Since most of them are consistently additional 4 bytes per row, that seems like not a bug. However, I found some irregular increases too like 16 bytes per row. Can you explain why data size increased so we can check the irregularities and make sure they are expected? Hey @mustafaiman -- the main size differences are on Timestamp columns where we now support nanosecond precision (using 2 extra variables for the lower and the upper precision as part of the stats -- see [ORC-611](https://issues.apache.org/jira/browse/ORC-611)). Other than that there are other changes that can also affect size, such as: Trimming StringStatistics minimum and maximum values as part of ORC-203 or List and Map column statistics that was recently added as part of ORC-398. Happy to check further if you have doubts about a particular query. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 545689) Time Spent: 7h 40m (was: 7.5h) > Upgrade ORC version to 1.6.7 > > > Key: HIVE-23553 > URL: https://issues.apache.org/jira/browse/HIVE-23553 > Project: Hive > Issue Type: Improvement >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Time Spent: 7h 40m > Remaining Estimate: 0h > > Apache Hive is currently on 1.5.X version and in order to take advantage of > the latest ORC improvements such as column encryption we have to bump to > 1.6.X. > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin > Even though ORC reader could work out of the box, HIVE LLAP is heavily > depending on internal ORC APIs e.g., to retrieve and store File Footers, > Tails, streams – un/compress RG data etc. As there ware many internal changes > from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the > upgrade is not straightforward. > This Umbrella Jira tracks this upgrade effort. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=545684&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-545684 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 01/Feb/21 23:05 Start Date: 01/Feb/21 23:05 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1823: URL: https://github.com/apache/hive/pull/1823#discussion_r568204044 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/EncodedTreeReaderFactory.java ## @@ -2585,6 +2590,7 @@ private static TreeReader getPrimitiveTreeReader(final int columnIndex, .setColumnEncoding(columnEncoding) .setVectors(vectors) .setContext(context) +.setIsInstant(columnType.getCategory() == TypeDescription.Category.TIMESTAMP_INSTANT) Review comment: Even though TimeStamp with local timezone was added as part of [ORC-189](https://issues.apache.org/jira/browse/ORC-189) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 545684) Time Spent: 7.5h (was: 7h 20m) > Upgrade ORC version to 1.6.7 > > > Key: HIVE-23553 > URL: https://issues.apache.org/jira/browse/HIVE-23553 > Project: Hive > Issue Type: Improvement >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Time Spent: 7.5h > Remaining Estimate: 0h > > Apache Hive is currently on 1.5.X version and in order to take advantage of > the latest ORC improvements such as column encryption we have to bump to > 1.6.X. > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin > Even though ORC reader could work out of the box, HIVE LLAP is heavily > depending on internal ORC APIs e.g., to retrieve and store File Footers, > Tails, streams – un/compress RG data etc. As there ware many internal changes > from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the > upgrade is not straightforward. > This Umbrella Jira tracks this upgrade effort. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=545680&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-545680 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 01/Feb/21 22:58 Start Date: 01/Feb/21 22:58 Worklog Time Spent: 10m Work Description: mustafaiman commented on a change in pull request #1823: URL: https://github.com/apache/hive/pull/1823#discussion_r568200811 ## File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java ## @@ -4509,7 +4509,7 @@ private static void populateLlapDaemonVarsSet(Set llapDaemonVarsSetLocal "Minimum allocation possible from LLAP buddy allocator. Allocations below that are\n" + "padded to minimum allocation. For ORC, should generally be the same as the expected\n" + "compression buffer size, or next lowest power of 2. Must be a power of 2."), -LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "16Mb", new SizeValidator(), +LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "4Mb", new SizeValidator(), Review comment: I see ORC strictly enforces this now. I would set the appropriate setting at Hive-ORC boundary and leave the LLAP_ALLOCATOR_MAX_ALLOC as it is (Math.min(llap.allocator.max, what ORC enforces). If you think we should set LLAP_ALLOCATOR_MAX_ALLOC to be the same as what ORC enforces, that can be done in a seperate ticket. Like you said this is orthogonal to ORC version bump, therefore should be discussed in its own ticket. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 545680) Time Spent: 7h 20m (was: 7h 10m) > Upgrade ORC version to 1.6.7 > > > Key: HIVE-23553 > URL: https://issues.apache.org/jira/browse/HIVE-23553 > Project: Hive > Issue Type: Improvement >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Time Spent: 7h 20m > Remaining Estimate: 0h > > Apache Hive is currently on 1.5.X version and in order to take advantage of > the latest ORC improvements such as column encryption we have to bump to > 1.6.X. > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin > Even though ORC reader could work out of the box, HIVE LLAP is heavily > depending on internal ORC APIs e.g., to retrieve and store File Footers, > Tails, streams – un/compress RG data etc. As there ware many internal changes > from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the > upgrade is not straightforward. > This Umbrella Jira tracks this upgrade effort. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=545643&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-545643 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 01/Feb/21 22:09 Start Date: 01/Feb/21 22:09 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1823: URL: https://github.com/apache/hive/pull/1823#discussion_r568174337 ## File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java ## @@ -4509,7 +4509,7 @@ private static void populateLlapDaemonVarsSet(Set llapDaemonVarsSetLocal "Minimum allocation possible from LLAP buddy allocator. Allocations below that are\n" + "padded to minimum allocation. For ORC, should generally be the same as the expected\n" + "compression buffer size, or next lowest power of 2. Must be a power of 2."), -LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "16Mb", new SizeValidator(), +LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "4Mb", new SizeValidator(), Review comment: LLAP_ALLOCATOR_MAX_ALLOC is used both for the LowLevelCacheImpl (buddyAllocator) and bufferSize on [WriterOptions](https://github.com/apache/hive/blob/da1aa077716a65c2a02d850828b16cdeece1f574/llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/SerDeEncodedDataReader.java#L1553) Please check how this propagated from [SerDeEncodedDataReader](https://github.com/apache/hive/blob/da1aa077716a65c2a02d850828b16cdeece1f574/llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/SerDeEncodedDataReader.java#L248) Llap is tightly coupled to ORC, thus it could make sense to use the same buffer size for serialized Buffers, and the ORC writer as we would not need to split/merge them -- however I have nothing against splitting the conf or checking is the 8Mb limit is a hard one. All I am trying to say here is that this is orthogonal the ORC version bump. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 545643) Time Spent: 7h (was: 6h 50m) > Upgrade ORC version to 1.6.7 > > > Key: HIVE-23553 > URL: https://issues.apache.org/jira/browse/HIVE-23553 > Project: Hive > Issue Type: Improvement >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Time Spent: 7h > Remaining Estimate: 0h > > Apache Hive is currently on 1.5.X version and in order to take advantage of > the latest ORC improvements such as column encryption we have to bump to > 1.6.X. > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin > Even though ORC reader could work out of the box, HIVE LLAP is heavily > depending on internal ORC APIs e.g., to retrieve and store File Footers, > Tails, streams – un/compress RG data etc. As there ware many internal changes > from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the > upgrade is not straightforward. > This Umbrella Jira tracks this upgrade effort. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=545644&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-545644 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 01/Feb/21 22:09 Start Date: 01/Feb/21 22:09 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1823: URL: https://github.com/apache/hive/pull/1823#discussion_r568174337 ## File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java ## @@ -4509,7 +4509,7 @@ private static void populateLlapDaemonVarsSet(Set llapDaemonVarsSetLocal "Minimum allocation possible from LLAP buddy allocator. Allocations below that are\n" + "padded to minimum allocation. For ORC, should generally be the same as the expected\n" + "compression buffer size, or next lowest power of 2. Must be a power of 2."), -LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "16Mb", new SizeValidator(), +LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "4Mb", new SizeValidator(), Review comment: LLAP_ALLOCATOR_MAX_ALLOC is used both for the LowLevelCacheImpl (buddyAllocator) and bufferSize on [WriterOptions](https://github.com/apache/hive/blob/da1aa077716a65c2a02d850828b16cdeece1f574/llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/SerDeEncodedDataReader.java#L1553) Please check how this propagated from [SerDeEncodedDataReader](https://github.com/apache/hive/blob/da1aa077716a65c2a02d850828b16cdeece1f574/llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/SerDeEncodedDataReader.java#L248) Llap is tightly coupled to ORC, thus it could make sense to use the same buffer size for serialized Buffers, and the ORC writer as we would not need to split/merge them -- however I have nothing against splitting the conf or checking is the 8Mb limit is a hard one. All I am trying to say here is that this is orthogonal the ORC version bump. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 545644) Time Spent: 7h 10m (was: 7h) > Upgrade ORC version to 1.6.7 > > > Key: HIVE-23553 > URL: https://issues.apache.org/jira/browse/HIVE-23553 > Project: Hive > Issue Type: Improvement >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Time Spent: 7h 10m > Remaining Estimate: 0h > > Apache Hive is currently on 1.5.X version and in order to take advantage of > the latest ORC improvements such as column encryption we have to bump to > 1.6.X. > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin > Even though ORC reader could work out of the box, HIVE LLAP is heavily > depending on internal ORC APIs e.g., to retrieve and store File Footers, > Tails, streams – un/compress RG data etc. As there ware many internal changes > from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the > upgrade is not straightforward. > This Umbrella Jira tracks this upgrade effort. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=545623&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-545623 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 01/Feb/21 21:54 Start Date: 01/Feb/21 21:54 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1823: URL: https://github.com/apache/hive/pull/1823#discussion_r568166455 ## File path: llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/LlapRecordReaderUtils.java ## @@ -0,0 +1,440 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.llap.io.encoded; + +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.hive.common.io.DiskRangeList; +import org.apache.hadoop.hive.ql.io.orc.encoded.LlapDataReader; +import org.apache.orc.CompressionCodec; +import org.apache.orc.CompressionKind; +import org.apache.orc.OrcFile; +import org.apache.orc.OrcProto; +import org.apache.orc.StripeInformation; +import org.apache.orc.TypeDescription; +import org.apache.orc.impl.BufferChunk; +import org.apache.orc.impl.DataReaderProperties; +import org.apache.orc.impl.DirectDecompressionCodec; +import org.apache.orc.impl.HadoopShims; +import org.apache.orc.impl.HadoopShimsFactory; +import org.apache.orc.impl.InStream; +import org.apache.orc.impl.OrcCodecPool; +import org.apache.orc.impl.OrcIndex; +import org.apache.orc.impl.RecordReaderUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.nio.ByteBuffer; +import java.util.List; +import java.util.function.Supplier; + +public class LlapRecordReaderUtils { + + private static final HadoopShims SHIMS = HadoopShimsFactory.get(); + private static final Logger LOG = LoggerFactory.getLogger(LlapRecordReaderUtils.class); + + static HadoopShims.ZeroCopyReaderShim createZeroCopyShim(FSDataInputStream file, CompressionCodec codec, + RecordReaderUtils.ByteBufferAllocatorPool pool) throws IOException { +return codec == null || (codec instanceof DirectDecompressionCodec && ((DirectDecompressionCodec) codec) +.isAvailable()) ? null : SHIMS.getZeroCopyReader(file, pool); Review comment: Good catch, FIXed thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 545623) Time Spent: 6h 50m (was: 6h 40m) > Upgrade ORC version to 1.6.7 > > > Key: HIVE-23553 > URL: https://issues.apache.org/jira/browse/HIVE-23553 > Project: Hive > Issue Type: Improvement >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Time Spent: 6h 50m > Remaining Estimate: 0h > > Apache Hive is currently on 1.5.X version and in order to take advantage of > the latest ORC improvements such as column encryption we have to bump to > 1.6.X. > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin > Even though ORC reader could work out of the box, HIVE LLAP is heavily > depending on internal ORC APIs e.g., to retrieve and store File Footers, > Tails, streams – un/compress RG data etc. As there ware many internal changes > from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the > upgrade is not straightforward. > This Umbrella Jira tracks this upgrade effort. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=545617&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-545617 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 01/Feb/21 21:45 Start Date: 01/Feb/21 21:45 Worklog Time Spent: 10m Work Description: mustafaiman commented on a change in pull request #1823: URL: https://github.com/apache/hive/pull/1823#discussion_r568161096 ## File path: llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/LlapRecordReaderUtils.java ## @@ -0,0 +1,440 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.llap.io.encoded; + +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.hive.common.io.DiskRangeList; +import org.apache.hadoop.hive.ql.io.orc.encoded.LlapDataReader; +import org.apache.orc.CompressionCodec; +import org.apache.orc.CompressionKind; +import org.apache.orc.OrcFile; +import org.apache.orc.OrcProto; +import org.apache.orc.StripeInformation; +import org.apache.orc.TypeDescription; +import org.apache.orc.impl.BufferChunk; +import org.apache.orc.impl.DataReaderProperties; +import org.apache.orc.impl.DirectDecompressionCodec; +import org.apache.orc.impl.HadoopShims; +import org.apache.orc.impl.HadoopShimsFactory; +import org.apache.orc.impl.InStream; +import org.apache.orc.impl.OrcCodecPool; +import org.apache.orc.impl.OrcIndex; +import org.apache.orc.impl.RecordReaderUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.nio.ByteBuffer; +import java.util.List; +import java.util.function.Supplier; + +public class LlapRecordReaderUtils { + + private static final HadoopShims SHIMS = HadoopShimsFactory.get(); + private static final Logger LOG = LoggerFactory.getLogger(LlapRecordReaderUtils.class); + + static HadoopShims.ZeroCopyReaderShim createZeroCopyShim(FSDataInputStream file, CompressionCodec codec, + RecordReaderUtils.ByteBufferAllocatorPool pool) throws IOException { +return codec == null || (codec instanceof DirectDecompressionCodec && ((DirectDecompressionCodec) codec) +.isAvailable()) ? null : SHIMS.getZeroCopyReader(file, pool); Review comment: I think this was equivalent to `codec == null || (codec instanceof DirectDecompressionCodec && ((DirectDecompressionCodec) codec).isAvailable()) ? SHIMS.getZeroCopyReader(file, pool) : null` before. Looks like `null: SHIMS.getZeroCopyReader` thing got inverted. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 545617) Time Spent: 6h 40m (was: 6.5h) > Upgrade ORC version to 1.6.7 > > > Key: HIVE-23553 > URL: https://issues.apache.org/jira/browse/HIVE-23553 > Project: Hive > Issue Type: Improvement >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Time Spent: 6h 40m > Remaining Estimate: 0h > > Apache Hive is currently on 1.5.X version and in order to take advantage of > the latest ORC improvements such as column encryption we have to bump to > 1.6.X. > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin > Even though ORC reader could work out of the box, HIVE LLAP is heavily > depending on internal ORC APIs e.g., to retrieve and store File Footers, > Tails, streams – un/compress RG data etc. As there ware many internal changes > from 1.5 to 1.6 (Input st
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=545610&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-545610 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 01/Feb/21 21:38 Start Date: 01/Feb/21 21:38 Worklog Time Spent: 10m Work Description: mustafaiman commented on a change in pull request #1823: URL: https://github.com/apache/hive/pull/1823#discussion_r568157322 ## File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java ## @@ -4509,7 +4509,7 @@ private static void populateLlapDaemonVarsSet(Set llapDaemonVarsSetLocal "Minimum allocation possible from LLAP buddy allocator. Allocations below that are\n" + "padded to minimum allocation. For ORC, should generally be the same as the expected\n" + "compression buffer size, or next lowest power of 2. Must be a power of 2."), -LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "16Mb", new SizeValidator(), +LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "4Mb", new SizeValidator(), Review comment: I still do not understand why we need to change LLAP Allocator's maximum allocation size. Does LLAP allocator serve only ORC writers? I think it is used for other buffer needs too. Hive depends on ORC. So I dont understand how ORC uses LLAP_ALLOCATOR_MAX_ALLOC for anything. We pass orc writers the appropriate configs. If ORC writers need smaller buffer, we can configure that for those writers via WriterOptions. There is no need to change llap allocator's settings for that. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 545610) Time Spent: 6.5h (was: 6h 20m) > Upgrade ORC version to 1.6.7 > > > Key: HIVE-23553 > URL: https://issues.apache.org/jira/browse/HIVE-23553 > Project: Hive > Issue Type: Improvement >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Time Spent: 6.5h > Remaining Estimate: 0h > > Apache Hive is currently on 1.5.X version and in order to take advantage of > the latest ORC improvements such as column encryption we have to bump to > 1.6.X. > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin > Even though ORC reader could work out of the box, HIVE LLAP is heavily > depending on internal ORC APIs e.g., to retrieve and store File Footers, > Tails, streams – un/compress RG data etc. As there ware many internal changes > from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the > upgrade is not straightforward. > This Umbrella Jira tracks this upgrade effort. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=545493&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-545493 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 01/Feb/21 18:19 Start Date: 01/Feb/21 18:19 Worklog Time Spent: 10m Work Description: pgaref commented on pull request #1823: URL: https://github.com/apache/hive/pull/1823#issuecomment-771057135 Tests just passed and comments are addressed above. @mustafaiman @jcamachor please take another look and let me know what you think :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 545493) Time Spent: 6h 20m (was: 6h 10m) > Upgrade ORC version to 1.6.7 > > > Key: HIVE-23553 > URL: https://issues.apache.org/jira/browse/HIVE-23553 > Project: Hive > Issue Type: Improvement >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Time Spent: 6h 20m > Remaining Estimate: 0h > > Apache Hive is currently on 1.5.X version and in order to take advantage of > the latest ORC improvements such as column encryption we have to bump to > 1.6.X. > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin > Even though ORC reader could work out of the box, HIVE LLAP is heavily > depending on internal ORC APIs e.g., to retrieve and store File Footers, > Tails, streams – un/compress RG data etc. As there ware many internal changes > from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the > upgrade is not straightforward. > This Umbrella Jira tracks this upgrade effort. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=545316&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-545316 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 01/Feb/21 13:07 Start Date: 01/Feb/21 13:07 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1823: URL: https://github.com/apache/hive/pull/1823#discussion_r566908774 ## File path: ql/src/test/results/clientpositive/tez/orc_merge12.q.out ## @@ -162,8 +162,8 @@ Stripe Statistics: Column 6: count: 9174 hasNull: true min: -16379.0 max: 9763215.5639 sum: 5.62236530305E7 Column 7: count: 12288 hasNull: false min: 00020767-dd8f-4f4d-bd68-4b7be64b8e44 max: fffa3516-e219-4027-b0d3-72bb2e676c52 sum: 442368 Column 8: count: 12288 hasNull: false min: 000976f7-7075-4f3f-a564-5a375fafcc101416a2b7-7f64-41b7-851f-97d15405037e max: fffd0642-5f01-48cd-8d97-3428faee49e9b39f2b4c-efdc-4e5f-9ab5-4aa5394cb156 sum: 884736 -Column 9: count: 9173 hasNull: true min: 1969-12-31 15:59:30.929 max: 1969-12-31 16:00:30.808 -Column 10: count: 9174 hasNull: true min: 1969-12-31 15:59:30.929 max: 1969-12-31 16:00:30.808 +Column 9: count: 9173 hasNull: true min: 1969-12-31 15:59:30.929 max: 1969-12-31 16:00:30.80899 Review comment: Yes, this is expected as we are now supporting Nanosecond precision for Timestamps: https://issues.apache.org/jira/browse/ORC-663 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 545316) Time Spent: 6h 10m (was: 6h) > Upgrade ORC version to 1.6.7 > > > Key: HIVE-23553 > URL: https://issues.apache.org/jira/browse/HIVE-23553 > Project: Hive > Issue Type: Improvement >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Time Spent: 6h 10m > Remaining Estimate: 0h > > Apache Hive is currently on 1.5.X version and in order to take advantage of > the latest ORC improvements such as column encryption we have to bump to > 1.6.X. > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin > Even though ORC reader could work out of the box, HIVE LLAP is heavily > depending on internal ORC APIs e.g., to retrieve and store File Footers, > Tails, streams – un/compress RG data etc. As there ware many internal changes > from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the > upgrade is not straightforward. > This Umbrella Jira tracks this upgrade effort. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=545313&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-545313 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 01/Feb/21 13:05 Start Date: 01/Feb/21 13:05 Worklog Time Spent: 10m Work Description: pgaref edited a comment on pull request #1823: URL: https://github.com/apache/hive/pull/1823#issuecomment-769755146 > I only partially reviewed this. Will continue reviewing. > One question: I see we do not care about column encryption related arguments in multiple places. Is it because column encryption is not supported? Hey @mustafaiman good question with a complicated answer -- while creating this I also did some digging to find out whats supported and what not. To sum up my findings: - It looks like we are currently able to encrypt entire tables and/or data on hdfs using kms: HIVE-8065 - Support for column level encryption/decryption (passing some encryption setting to the Table props and let Hive take care of the rest) started more than a while ago as part of HIVE-6329 - There was a community discussion as part of HIVE-21848 to unify encryption table properties (at least for ORC and Parquet) that concluded in the accepted options - However, these properties are still not propagated to the tables: HIVE-21849 I believe part of the reason is that Hive already integrates with Apache Ranger that can restrict user access to particular columns and also adds data-masking on top. However, I am more than happy discussing the revival of column encryption at some point. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 545313) Time Spent: 6h (was: 5h 50m) > Upgrade ORC version to 1.6.7 > > > Key: HIVE-23553 > URL: https://issues.apache.org/jira/browse/HIVE-23553 > Project: Hive > Issue Type: Improvement >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Time Spent: 6h > Remaining Estimate: 0h > > Apache Hive is currently on 1.5.X version and in order to take advantage of > the latest ORC improvements such as column encryption we have to bump to > 1.6.X. > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin > Even though ORC reader could work out of the box, HIVE LLAP is heavily > depending on internal ORC APIs e.g., to retrieve and store File Footers, > Tails, streams – un/compress RG data etc. As there ware many internal changes > from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the > upgrade is not straightforward. > This Umbrella Jira tracks this upgrade effort. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=545312&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-545312 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 01/Feb/21 13:03 Start Date: 01/Feb/21 13:03 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1823: URL: https://github.com/apache/hive/pull/1823#discussion_r567807669 ## File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java ## @@ -4509,7 +4509,7 @@ private static void populateLlapDaemonVarsSet(Set llapDaemonVarsSetLocal "Minimum allocation possible from LLAP buddy allocator. Allocations below that are\n" + "padded to minimum allocation. For ORC, should generally be the same as the expected\n" + "compression buffer size, or next lowest power of 2. Must be a power of 2."), -LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "16Mb", new SizeValidator(), +LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "4Mb", new SizeValidator(), Review comment: The issue here is that LLAP_ALLOCATOR_MAX_ALLOC is also used as the ORC Writer buffer size (thus the change). Initial buffer size check was introduced in [ORC-238](https://github.com/apache/orc/pull/171/files) even though it was only applied when buffer size was enforced from table properties. Later, on ORC-1.6 this was enforced for the [Writer buffer size in general](https://github.com/apache/orc/blob/0128f817b0ab28fa2d0660737234ac966f0f5c50/java/core/src/java/org/apache/orc/impl/WriterImpl.java#L171). The max bufferSize argument can be up to 2^(3*8 - 1) -- meaning less than 8Mb and since we enforce the size to be power of 2 the next available is 4Mb. The main question here is if there is a reason for the upper limit to be < 8 Mb (cc @prasanthj that might know more here) -- or if we should decouple the two configuration (LLAP alloc and ORC Writer buffer size). I believe the best thing to do for now is open a new Ticket to track this (as this will either require more work on LLAP, or a new release on ORC) -- and I do not expect this to cause any major issues until then. @mustafaiman what do you think? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 545312) Time Spent: 5h 50m (was: 5h 40m) > Upgrade ORC version to 1.6.7 > > > Key: HIVE-23553 > URL: https://issues.apache.org/jira/browse/HIVE-23553 > Project: Hive > Issue Type: Improvement >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Time Spent: 5h 50m > Remaining Estimate: 0h > > Apache Hive is currently on 1.5.X version and in order to take advantage of > the latest ORC improvements such as column encryption we have to bump to > 1.6.X. > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin > Even though ORC reader could work out of the box, HIVE LLAP is heavily > depending on internal ORC APIs e.g., to retrieve and store File Footers, > Tails, streams – un/compress RG data etc. As there ware many internal changes > from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the > upgrade is not straightforward. > This Umbrella Jira tracks this upgrade effort. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=545304&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-545304 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 01/Feb/21 12:51 Start Date: 01/Feb/21 12:51 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1823: URL: https://github.com/apache/hive/pull/1823#discussion_r567800512 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/LocalCache.java ## @@ -82,8 +82,7 @@ public void put(Path path, OrcTail tail) { if (bb.capacity() != bb.remaining()) { throw new RuntimeException("Bytebuffer allocated for path: " + path + " has remaining: " + bb.remaining() + " != capacity: " + bb.capacity()); } -cache.put(path, new TailAndFileData(tail.getFileTail().getFileLength(), -tail.getFileModificationTime(), bb.duplicate())); +cache.put(path, new TailAndFileData(bb.limit(), tail.getFileModificationTime(), bb.duplicate())); Review comment: But I agree, cache should be populated with the original **getFileTail().getFileLength()** as it is afterward used for comparison (thus reverted this change) -- however, where ReaderImpl.extractFileTail is now called uses the buffer size instead. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 545304) Time Spent: 5h 40m (was: 5.5h) > Upgrade ORC version to 1.6.7 > > > Key: HIVE-23553 > URL: https://issues.apache.org/jira/browse/HIVE-23553 > Project: Hive > Issue Type: Improvement >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Time Spent: 5h 40m > Remaining Estimate: 0h > > Apache Hive is currently on 1.5.X version and in order to take advantage of > the latest ORC improvements such as column encryption we have to bump to > 1.6.X. > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin > Even though ORC reader could work out of the box, HIVE LLAP is heavily > depending on internal ORC APIs e.g., to retrieve and store File Footers, > Tails, streams – un/compress RG data etc. As there ware many internal changes > from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the > upgrade is not straightforward. > This Umbrella Jira tracks this upgrade effort. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=545302&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-545302 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 01/Feb/21 12:48 Start Date: 01/Feb/21 12:48 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1823: URL: https://github.com/apache/hive/pull/1823#discussion_r567799059 ## File path: ql/src/test/results/clientpositive/llap/dynamic_semijoin_reduction_multicol.q.out ## @@ -355,7 +355,7 @@ Stage-1 HIVE COUNTERS: RECORDS_OUT_OPERATOR_TS_3: 800 TOTAL_TABLE_ROWS_WRITTEN: 0 Stage-1 LLAP IO COUNTERS: - CACHE_HIT_BYTES: 138344 + CACHE_MISS_BYTES: 138342 Review comment: This was a bit more complex, CacheWriter.getSparseOrcIndexFromDenseDest was called with colId = 0 from SerDeEncodedDataReader -- causing IndexOutOfBounds and Cache not being populated. This is now addressed by https://github.com/apache/hive/pull/1823/commits/da1aa077716a65c2a02d850828b16cdeece1f574 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 545302) Time Spent: 5.5h (was: 5h 20m) > Upgrade ORC version to 1.6.7 > > > Key: HIVE-23553 > URL: https://issues.apache.org/jira/browse/HIVE-23553 > Project: Hive > Issue Type: Improvement >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Time Spent: 5.5h > Remaining Estimate: 0h > > Apache Hive is currently on 1.5.X version and in order to take advantage of > the latest ORC improvements such as column encryption we have to bump to > 1.6.X. > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin > Even though ORC reader could work out of the box, HIVE LLAP is heavily > depending on internal ORC APIs e.g., to retrieve and store File Footers, > Tails, streams – un/compress RG data etc. As there ware many internal changes > from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the > upgrade is not straightforward. > This Umbrella Jira tracks this upgrade effort. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=544478&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-544478 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 29/Jan/21 20:40 Start Date: 29/Jan/21 20:40 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1823: URL: https://github.com/apache/hive/pull/1823#discussion_r566867490 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ## @@ -443,7 +444,8 @@ public void setBaseAndInnerReader( return new OrcRawRecordMerger.KeyInterval(null, null); } -OrcTail orcTail = getOrcTail(orcSplit.getPath(), conf, cacheTag, orcSplit.getFileKey()).orcTail; +VectorizedOrcAcidRowBatchReader.ReaderData orcReaderData = Review comment: This is one of the breaking ORC changes introduced by encryption support. As Tail and thus StripeStatistics may be encrypted, we always need a reader instance to retrieve them. OrcTail maintained the API call for backwards compatibility but it still expects a reader to actually retrieve the stats: https://github.com/apache/orc/blob/d78cc39a9299b62bc8a5d8f5c3fac9201e03cb8b/java/core/src/java/org/apache/orc/impl/OrcTail.java#L210 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 544478) Time Spent: 5h 20m (was: 5h 10m) > Upgrade ORC version to 1.6.7 > > > Key: HIVE-23553 > URL: https://issues.apache.org/jira/browse/HIVE-23553 > Project: Hive > Issue Type: Improvement >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Time Spent: 5h 20m > Remaining Estimate: 0h > > Apache Hive is currently on 1.5.X version and in order to take advantage of > the latest ORC improvements such as column encryption we have to bump to > 1.6.X. > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin > Even though ORC reader could work out of the box, HIVE LLAP is heavily > depending on internal ORC APIs e.g., to retrieve and store File Footers, > Tails, streams – un/compress RG data etc. As there ware many internal changes > from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the > upgrade is not straightforward. > This Umbrella Jira tracks this upgrade effort. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=544339&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-544339 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 29/Jan/21 16:29 Start Date: 29/Jan/21 16:29 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1823: URL: https://github.com/apache/hive/pull/1823#discussion_r566941929 ## File path: ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestOrcFile.java ## @@ -325,7 +326,7 @@ public void testReadFormat_0_11() throws Exception { + "binary,string1:string,middle:struct>>,list:array>," + "map:map>,ts:timestamp," -+ "decimal1:decimal(38,18)>", readerInspector.getTypeName()); ++ "decimal1:decimal(38,10)>", readerInspector.getTypeName()); Review comment: Seems that that type scale was not properly propaged before and was using HiveDecimal.SYSTEM_DEFAULT_SCALE which is 18: https://github.com/apache/hive/blob/ff6f3565e50148b7bcfbcf19b970379f2bd59290/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcStruct.java#L607 This is actually in-line with ORC tests for the same file: https://github.com/apache/orc/blob/b54d10cedf5ec1529cf06d77268510c216402cba/java/core/src/test/org/apache/orc/impl/TestRecordReaderImpl.java#L141 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 544339) Time Spent: 5h (was: 4h 50m) > Upgrade ORC version to 1.6.7 > > > Key: HIVE-23553 > URL: https://issues.apache.org/jira/browse/HIVE-23553 > Project: Hive > Issue Type: Improvement >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Time Spent: 5h > Remaining Estimate: 0h > > Apache Hive is currently on 1.5.X version and in order to take advantage of > the latest ORC improvements such as column encryption we have to bump to > 1.6.X. > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin > Even though ORC reader could work out of the box, HIVE LLAP is heavily > depending on internal ORC APIs e.g., to retrieve and store File Footers, > Tails, streams – un/compress RG data etc. As there ware many internal changes > from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the > upgrade is not straightforward. > This Umbrella Jira tracks this upgrade effort. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=544340&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-544340 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 29/Jan/21 16:29 Start Date: 29/Jan/21 16:29 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1823: URL: https://github.com/apache/hive/pull/1823#discussion_r566941929 ## File path: ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestOrcFile.java ## @@ -325,7 +326,7 @@ public void testReadFormat_0_11() throws Exception { + "binary,string1:string,middle:struct>>,list:array>," + "map:map>,ts:timestamp," -+ "decimal1:decimal(38,18)>", readerInspector.getTypeName()); ++ "decimal1:decimal(38,10)>", readerInspector.getTypeName()); Review comment: Seems that that type scale was not properly propaged before and was using HiveDecimal.SYSTEM_DEFAULT_SCALE which is 18: https://github.com/apache/hive/blob/ff6f3565e50148b7bcfbcf19b970379f2bd59290/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcStruct.java#L607 Current schema (decimal with scale 10) is actually in-line with ORC tests for the same file: https://github.com/apache/orc/blob/b54d10cedf5ec1529cf06d77268510c216402cba/java/core/src/test/org/apache/orc/impl/TestRecordReaderImpl.java#L141 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 544340) Time Spent: 5h 10m (was: 5h) > Upgrade ORC version to 1.6.7 > > > Key: HIVE-23553 > URL: https://issues.apache.org/jira/browse/HIVE-23553 > Project: Hive > Issue Type: Improvement >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Time Spent: 5h 10m > Remaining Estimate: 0h > > Apache Hive is currently on 1.5.X version and in order to take advantage of > the latest ORC improvements such as column encryption we have to bump to > 1.6.X. > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin > Even though ORC reader could work out of the box, HIVE LLAP is heavily > depending on internal ORC APIs e.g., to retrieve and store File Footers, > Tails, streams – un/compress RG data etc. As there ware many internal changes > from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the > upgrade is not straightforward. > This Umbrella Jira tracks this upgrade effort. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=544327&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-544327 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 29/Jan/21 15:56 Start Date: 29/Jan/21 15:56 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1823: URL: https://github.com/apache/hive/pull/1823#discussion_r566919513 ## File path: ql/src/test/results/clientpositive/llap/orc_file_dump.q.out ## @@ -249,15 +249,15 @@ Stripes: Entry 1: numHashFunctions: 4 bitCount: 6272 popCount: 182 loadFactor: 0.029 expectedFpp: 7.090246E-7 Stripe level merge: numHashFunctions: 4 bitCount: 6272 popCount: 1772 loadFactor: 0.2825 expectedFpp: 0.0063713384 Row group indices for column 9: - Entry 0: count: 1000 hasNull: false min: 2013-03-01 09:11:58.703 max: 2013-03-01 09:11:58.703 positions: 0,0,0,0,0,0 - Entry 1: count: 49 hasNull: false min: 2013-03-01 09:11:58.703 max: 2013-03-01 09:11:58.703 positions: 0,7,488,0,1538,488 + Entry 0: count: 1000 hasNull: false min: 2013-03-01 09:11:58.70307 max: 2013-03-01 09:11:58.703325 positions: 0,0,0,0,0,0 + Entry 1: count: 49 hasNull: false min: 2013-03-01 09:11:58.703076 max: 2013-03-01 09:11:58.703325 positions: 0,7,488,0,1538,488 Bloom filters for column 9: Entry 0: numHashFunctions: 4 bitCount: 6272 popCount: 4 loadFactor: 0.0006 expectedFpp: 1.6543056E-13 Entry 1: numHashFunctions: 4 bitCount: 6272 popCount: 4 loadFactor: 0.0006 expectedFpp: 1.6543056E-13 Stripe level merge: numHashFunctions: 4 bitCount: 6272 popCount: 4 loadFactor: 0.0006 expectedFpp: 1.6543056E-13 Row group indices for column 10: - Entry 0: count: 1000 hasNull: false min: 8 max: 9994 sum: 5118211 positions: 0,0,0,0,0 - Entry 1: count: 49 hasNull: false min: 248 max: 9490 sum: 246405 positions: 0,2194,0,4,488 + Entry 0: count: 1000 hasNull: false min: 0.08 max: 99.94 sum: 51182.11 positions: 0,0,0,0,0 Review comment: yeah, this is related to the TS conversion issue described above This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 544327) Time Spent: 4h 50m (was: 4h 40m) > Upgrade ORC version to 1.6.7 > > > Key: HIVE-23553 > URL: https://issues.apache.org/jira/browse/HIVE-23553 > Project: Hive > Issue Type: Improvement >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Time Spent: 4h 50m > Remaining Estimate: 0h > > Apache Hive is currently on 1.5.X version and in order to take advantage of > the latest ORC improvements such as column encryption we have to bump to > 1.6.X. > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin > Even though ORC reader could work out of the box, HIVE LLAP is heavily > depending on internal ORC APIs e.g., to retrieve and store File Footers, > Tails, streams – un/compress RG data etc. As there ware many internal changes > from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the > upgrade is not straightforward. > This Umbrella Jira tracks this upgrade effort. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=544318&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-544318 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 29/Jan/21 15:41 Start Date: 29/Jan/21 15:41 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1823: URL: https://github.com/apache/hive/pull/1823#discussion_r566909094 ## File path: ql/src/test/results/clientpositive/llap/orc_file_dump.q.out ## @@ -111,7 +111,7 @@ Stripe Statistics: Column 6: count: 1049 hasNull: false bytesOnDisk: 3323 min: 0.02 max: 49.85 sum: 26286.3477 Column 7: count: 1049 hasNull: false bytesOnDisk: 137 true: 526 Column 8: count: 1049 hasNull: false bytesOnDisk: 3430 min: max: zach zipper sum: 13443 -Column 9: count: 1049 hasNull: false bytesOnDisk: 1802 min: 2013-03-01 09:11:58.703 max: 2013-03-01 09:11:58.703 +Column 9: count: 1049 hasNull: false bytesOnDisk: 1802 min: 2013-03-01 09:11:58.70307 max: 2013-03-01 09:11:58.703325 Review comment: Yes, this is expected as are now supporting Nanosecond precision for Timestamps: https://issues.apache.org/jira/browse/ORC-663 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 544318) Time Spent: 4h 40m (was: 4.5h) > Upgrade ORC version to 1.6.7 > > > Key: HIVE-23553 > URL: https://issues.apache.org/jira/browse/HIVE-23553 > Project: Hive > Issue Type: Improvement >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Time Spent: 4h 40m > Remaining Estimate: 0h > > Apache Hive is currently on 1.5.X version and in order to take advantage of > the latest ORC improvements such as column encryption we have to bump to > 1.6.X. > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin > Even though ORC reader could work out of the box, HIVE LLAP is heavily > depending on internal ORC APIs e.g., to retrieve and store File Footers, > Tails, streams – un/compress RG data etc. As there ware many internal changes > from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the > upgrade is not straightforward. > This Umbrella Jira tracks this upgrade effort. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=544317&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-544317 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 29/Jan/21 15:40 Start Date: 29/Jan/21 15:40 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1823: URL: https://github.com/apache/hive/pull/1823#discussion_r566908774 ## File path: ql/src/test/results/clientpositive/tez/orc_merge12.q.out ## @@ -162,8 +162,8 @@ Stripe Statistics: Column 6: count: 9174 hasNull: true min: -16379.0 max: 9763215.5639 sum: 5.62236530305E7 Column 7: count: 12288 hasNull: false min: 00020767-dd8f-4f4d-bd68-4b7be64b8e44 max: fffa3516-e219-4027-b0d3-72bb2e676c52 sum: 442368 Column 8: count: 12288 hasNull: false min: 000976f7-7075-4f3f-a564-5a375fafcc101416a2b7-7f64-41b7-851f-97d15405037e max: fffd0642-5f01-48cd-8d97-3428faee49e9b39f2b4c-efdc-4e5f-9ab5-4aa5394cb156 sum: 884736 -Column 9: count: 9173 hasNull: true min: 1969-12-31 15:59:30.929 max: 1969-12-31 16:00:30.808 -Column 10: count: 9174 hasNull: true min: 1969-12-31 15:59:30.929 max: 1969-12-31 16:00:30.808 +Column 9: count: 9173 hasNull: true min: 1969-12-31 15:59:30.929 max: 1969-12-31 16:00:30.80899 Review comment: Yes, this is expected as are now supporting Nanosecond precision for Timestamps: https://issues.apache.org/jira/browse/ORC-663 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 544317) Time Spent: 4.5h (was: 4h 20m) > Upgrade ORC version to 1.6.7 > > > Key: HIVE-23553 > URL: https://issues.apache.org/jira/browse/HIVE-23553 > Project: Hive > Issue Type: Improvement >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Time Spent: 4.5h > Remaining Estimate: 0h > > Apache Hive is currently on 1.5.X version and in order to take advantage of > the latest ORC improvements such as column encryption we have to bump to > 1.6.X. > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin > Even though ORC reader could work out of the box, HIVE LLAP is heavily > depending on internal ORC APIs e.g., to retrieve and store File Footers, > Tails, streams – un/compress RG data etc. As there ware many internal changes > from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the > upgrade is not straightforward. > This Umbrella Jira tracks this upgrade effort. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=544315&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-544315 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 29/Jan/21 15:39 Start Date: 29/Jan/21 15:39 Worklog Time Spent: 10m Work Description: pgaref commented on a change in pull request #1823: URL: https://github.com/apache/hive/pull/1823#discussion_r566784743 ## File path: llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/OrcEncodedDataReader.java ## @@ -631,10 +630,19 @@ private OrcFileMetadata getFileFooterFromCacheOrDisk() throws IOException { OrcTail orcTail = getOrcTailFromLlapBuffers(tailBuffers); counters.incrCounter(LlapIOCounters.METADATA_CACHE_HIT); FileTail tail = orcTail.getFileTail(); - stats = orcTail.getStripeStatisticsProto(); + CompressionKind compressionKind = orcTail.getCompressionKind(); + InStream.StreamOptions options = null; + if (compressionKind != CompressionKind.NONE) { +options = InStream.options() + .withCodec(OrcCodecPool.getCodec(compressionKind)).withBufferSize(orcTail.getCompressionBufferSize()); + } + InStream stream = InStream.create("stripe stats", orcTail.getTailBuffer(), Review comment: Sure, makes sense -- extracted method above. ## File path: llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/OrcEncodedDataReader.java ## @@ -631,10 +630,19 @@ private OrcFileMetadata getFileFooterFromCacheOrDisk() throws IOException { OrcTail orcTail = getOrcTailFromLlapBuffers(tailBuffers); counters.incrCounter(LlapIOCounters.METADATA_CACHE_HIT); FileTail tail = orcTail.getFileTail(); - stats = orcTail.getStripeStatisticsProto(); + CompressionKind compressionKind = orcTail.getCompressionKind(); + InStream.StreamOptions options = null; + if (compressionKind != CompressionKind.NONE) { +options = InStream.options() + .withCodec(OrcCodecPool.getCodec(compressionKind)).withBufferSize(orcTail.getCompressionBufferSize()); + } + InStream stream = InStream.create("stripe stats", orcTail.getTailBuffer(), + orcTail.getMetadataOffset(), orcTail.getMetadataSize(), options); + stats = OrcProto.Metadata.parseFrom(InStream.createCodedInputStream(stream)).getStripeStatsList(); stripes = new ArrayList<>(tail.getFooter().getStripesCount()); + int stripeIdx = 0; for (OrcProto.StripeInformation stripeProto : tail.getFooter().getStripesList()) { -stripes.add(new ReaderImpl.StripeInformationImpl(stripeProto)); +stripes.add(new ReaderImpl.StripeInformationImpl(stripeProto, stripeIdx++, -1, null)); Review comment: In ORC-1.5 encryption is not supported -- Stripe info can also be encrypted since ORC-523 and thus the extra arguments here. Since we are not yet using encryption on LLAP the last 2 params can be null we I am keeping an incremental StripeId as it is used in a couple of places like the StripePlanner. ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/LlapDataReader.java ## @@ -0,0 +1,93 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.io.orc.encoded; + +import org.apache.hadoop.hive.common.io.DiskRangeList; +import org.apache.orc.CompressionCodec; +import org.apache.orc.OrcFile; +import org.apache.orc.OrcProto; +import org.apache.orc.StripeInformation; +import org.apache.orc.TypeDescription; +import org.apache.orc.impl.OrcIndex; + +import java.io.IOException; +import java.nio.ByteBuffer; + +/** An abstract data reader that IO formats can use to read bytes from underlying storage. */ +public interface LlapDataReader extends AutoCloseable, Cloneable { + + /** Opens the DataReader, making it ready to use. */ + void open() throws IOException; + + OrcIndex readRowIndex(StripeInformation stripe, + T
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=544234&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-544234 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 29/Jan/21 11:41 Start Date: 29/Jan/21 11:41 Worklog Time Spent: 10m Work Description: pgaref commented on pull request #1823: URL: https://github.com/apache/hive/pull/1823#issuecomment-769755146 > I only partially reviewed this. Will continue reviewing. > One question: I see we do not care about column encryption related arguments in multiple places. Is it because column encryption is not supported? Hey @mustage good question with a complicated answer -- while creating this I also did some digging to find out whats supported and what not. To sum up my findings: - It looks like we are currently able to encrypt entire tables and/or data on hdfs using kms: HIVE-8065 - Support for column level encryption/decryption (passing some encryption setting to the Table props and let Hive take care of the rest) started more than a while ago as part of HIVE-6329 - There was a community discussion as part of HIVE-21848 to unify encryption table properties (at least for ORC and Parquet) that concluded in the accepted options - However, these properties are still not propagated to the tables: HIVE-21849 I believe part of the reason is that Hive already integrates with Apache Ranger that can restrict user access to particular columns and also adds data-masking on top. However, I am more than happy discussing the revival of column encryption at some point. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 544234) Time Spent: 4h 10m (was: 4h) > Upgrade ORC version to 1.6.7 > > > Key: HIVE-23553 > URL: https://issues.apache.org/jira/browse/HIVE-23553 > Project: Hive > Issue Type: Improvement >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Time Spent: 4h 10m > Remaining Estimate: 0h > > Apache Hive is currently on 1.5.X version and in order to take advantage of > the latest ORC improvements such as column encryption we have to bump to > 1.6.X. > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin > Even though ORC reader could work out of the box, HIVE LLAP is heavily > depending on internal ORC APIs e.g., to retrieve and store File Footers, > Tails, streams – un/compress RG data etc. As there ware many internal changes > from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the > upgrade is not straightforward. > This Umbrella Jira tracks this upgrade effort. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=542604&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-542604 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 27/Jan/21 03:28 Start Date: 27/Jan/21 03:28 Worklog Time Spent: 10m Work Description: jcamachor commented on a change in pull request #1823: URL: https://github.com/apache/hive/pull/1823#discussion_r564992785 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcFileFormatProxy.java ## @@ -47,12 +47,14 @@ public SplitInfos applySargToMetadata( OrcTail orcTail = ReaderImpl.extractFileTail(fileMetadata); OrcProto.Footer footer = orcTail.getFooter(); int stripeCount = footer.getStripesCount(); -boolean writerUsedProlepticGregorian = footer.hasCalendar() -? footer.getCalendar() == OrcProto.CalendarKind.PROLEPTIC_GREGORIAN -: OrcConf.PROLEPTIC_GREGORIAN_DEFAULT.getBoolean(conf); +// Always convert To PROLEPTIC_GREGORIAN Review comment: Why is it OK to use proleptic calendar always here? Could we leave short explanation in the comment for when we need to revisit this code? ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/EncodedReaderImpl.java ## @@ -282,6 +280,56 @@ public String toString() { } } + public static boolean[] findPresentStreamsByColumn( Review comment: Can we add javadoc for these public static utility methods? If they are used only in this class, should we change their visibility? ## File path: ql/src/test/results/clientpositive/llap/schema_evol_orc_nonvec_part_all_primitive.q.out ## @@ -687,11 +687,11 @@ POSTHOOK: Input: default@part_change_various_various_timestamp_n6 POSTHOOK: Input: default@part_change_various_various_timestamp_n6@part=1 A masked pattern was here insert_num partc1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 b -1011 1970-01-01 00:00:00.001 1969-12-31 23:59:59.872 NULL 1969-12-07 03:28:36.352 NULLNULLNULLNULL6229-06-28 02:54:28.970117179 6229-06-28 02:54:28.97011 6229-06-28 02:54:28.97011 1950-12-18 00:00:00 original -1021 1970-01-01 00:00:00 1970-01-01 00:00:00.127 1970-01-01 00:00:32.767 1970-01-25 20:31:23.647 NULLNULLNULLNULL5966-07-09 03:30:50.597 5966-07-09 03:30:50.597 5966-07-09 03:30:50.597 2049-12-18 00:00:00 original +1011 1970-01-01 00:00:01 1969-12-31 23:57:52 NULL 1901-12-13 20:45:52 NULLNULLNULLNULL6229-06-28 02:54:28.970117179 6229-06-28 02:54:28.97011 6229-06-28 02:54:28.97011 1950-12-18 00:00:00 original Review comment: This shifting for timestamp values does not seem right (or at least I cannot make sense of it). Could you explain what is going on here? Some of the shifting is significant: For those, I remember there were some backwards incompatible changes in schema evolution in 1.6.x, it may be related to that? However, other shifting seems a bit more suspicious, e.g., 1 second, ~2 minutes? ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/EncodedTreeReaderFactory.java ## @@ -2585,6 +2590,7 @@ private static TreeReader getPrimitiveTreeReader(final int columnIndex, .setColumnEncoding(columnEncoding) .setVectors(vectors) .setContext(context) +.setIsInstant(columnType.getCategory() == TypeDescription.Category.TIMESTAMP_INSTANT) Review comment: As @mustafaiman mentioned, I think this should be always false indeed: TIMESTAMP_INSTANT is equivalent to TIMESTAMP_WITH_LOCAL_TIME_ZONE type in Hive. AFAIK support to read/write timestamp with local time zone in ORC is not implemented yet. ## File path: ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestOrcFile.java ## @@ -325,7 +326,7 @@ public void testReadFormat_0_11() throws Exception { + "binary,string1:string,middle:struct>>,list:array>," + "map:map>,ts:timestamp," -+ "decimal1:decimal(38,18)>", readerInspector.getTypeName()); ++ "decimal1:decimal(38,10)>", readerInspector.getTypeName()); Review comment: Change in decimal scale. Expected? ## File path: ql/src/test/results/clientpositive/llap/orc_file_dump.q.out ## @@ -249,15 +249,15 @@ Stripes: Entry 1: numHashFunctions: 4 bitCount: 6272 popCount: 182 loadFactor: 0.029 expectedFpp: 7.090246E-7 Stripe level merge: numHashFunctions: 4 bitCount: 6272 popCount: 1772 loadFactor: 0.2825 expectedFpp: 0.0063713384 Row group indices for column 9: - Entry 0: count: 1000 hasNull: false min: 2013-03-01 09:11:58.703 max: 2013-03-01 09:11:58.703 positions: 0,0,0,0,0,0 -
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=541832&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-541832 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 26/Jan/21 04:24 Start Date: 26/Jan/21 04:24 Worklog Time Spent: 10m Work Description: mustafaiman commented on a change in pull request #1823: URL: https://github.com/apache/hive/pull/1823#discussion_r563956041 ## File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java ## @@ -4509,7 +4509,7 @@ private static void populateLlapDaemonVarsSet(Set llapDaemonVarsSetLocal "Minimum allocation possible from LLAP buddy allocator. Allocations below that are\n" + "padded to minimum allocation. For ORC, should generally be the same as the expected\n" + "compression buffer size, or next lowest power of 2. Must be a power of 2."), -LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "16Mb", new SizeValidator(), +LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "4Mb", new SizeValidator(), Review comment: You say this is changed to be compatible with ORC setting. I do not understand why this is necessary and what its impact is. This looks like a change that is not to be taken lightly ## File path: llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/LlapRecordReaderUtils.java ## @@ -0,0 +1,438 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.llap.io.encoded; + +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.hive.common.io.DiskRangeList; +import org.apache.hadoop.hive.ql.io.orc.encoded.LlapDataReader; +import org.apache.orc.CompressionCodec; +import org.apache.orc.CompressionKind; +import org.apache.orc.OrcFile; +import org.apache.orc.OrcProto; +import org.apache.orc.StripeInformation; +import org.apache.orc.TypeDescription; +import org.apache.orc.impl.BufferChunk; +import org.apache.orc.impl.DataReaderProperties; +import org.apache.orc.impl.DirectDecompressionCodec; +import org.apache.orc.impl.HadoopShims; +import org.apache.orc.impl.HadoopShimsFactory; +import org.apache.orc.impl.InStream; +import org.apache.orc.impl.OrcCodecPool; +import org.apache.orc.impl.OrcIndex; +import org.apache.orc.impl.RecordReaderUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.nio.ByteBuffer; +import java.util.List; +import java.util.function.Supplier; + +public class LlapRecordReaderUtils { + + private static final HadoopShims SHIMS = HadoopShimsFactory.get(); + private static final Logger LOG = LoggerFactory.getLogger(LlapRecordReaderUtils.class); + + static HadoopShims.ZeroCopyReaderShim createZeroCopyShim(FSDataInputStream file, CompressionCodec codec, RecordReaderUtils.ByteBufferAllocatorPool pool) throws IOException { +return codec != null && (!(codec instanceof DirectDecompressionCodec) || !((DirectDecompressionCodec)codec).isAvailable()) ? null : SHIMS.getZeroCopyReader(file, pool); Review comment: Can you invert this condition. There are a lot of negatives making this hard to understand `codec == null || (codec instanceof DirectDecompressionCodec && ((DirectDecompressionCodec) codec).isAvailable()) ? SHIMS.getZeroCopyReader(file, pool) : null` is much more understandable ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ## @@ -443,7 +444,8 @@ public void setBaseAndInnerReader( return new OrcRawRecordMerger.KeyInterval(null, null); } -OrcTail orcTail = getOrcTail(orcSplit.getPath(), conf, cacheTag, orcSplit.getFileKey()).orcTail; +VectorizedOrcAcidRowBatchReader.ReaderData orcReaderData = Review comment: I am not sure about this. Previously we did not create the full reader. Why do we need to create the reader now? All calls from here use orcTail anyway except `List stats = orcReaderData.reader.
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=541348&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-541348 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 25/Jan/21 22:50 Start Date: 25/Jan/21 22:50 Worklog Time Spent: 10m Work Description: mustafaiman commented on a change in pull request #1823: URL: https://github.com/apache/hive/pull/1823#discussion_r564077084 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/EncodedTreeReaderFactory.java ## @@ -2585,6 +2590,7 @@ private static TreeReader getPrimitiveTreeReader(final int columnIndex, .setColumnEncoding(columnEncoding) .setVectors(vectors) .setContext(context) +.setIsInstant(columnType.getCategory() == TypeDescription.Category.TIMESTAMP_INSTANT) Review comment: Isn't this always `false`? Don't we need another case for TIMESTAMP_INSTANT? ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/LlapDataReader.java ## @@ -0,0 +1,93 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.io.orc.encoded; + +import org.apache.hadoop.hive.common.io.DiskRangeList; +import org.apache.orc.CompressionCodec; +import org.apache.orc.OrcFile; +import org.apache.orc.OrcProto; +import org.apache.orc.StripeInformation; +import org.apache.orc.TypeDescription; +import org.apache.orc.impl.OrcIndex; + +import java.io.IOException; +import java.nio.ByteBuffer; + +/** An abstract data reader that IO formats can use to read bytes from underlying storage. */ +public interface LlapDataReader extends AutoCloseable, Cloneable { + + /** Opens the DataReader, making it ready to use. */ + void open() throws IOException; + + OrcIndex readRowIndex(StripeInformation stripe, + TypeDescription fileSchema, + OrcProto.StripeFooter footer, + boolean ignoreNonUtf8BloomFilter, + boolean[] included, + OrcProto.RowIndex[] indexes, + boolean[] sargColumns, + OrcFile.WriterVersion version, + OrcProto.Stream.Kind[] bloomFilterKinds, + OrcProto.BloomFilterIndex[] bloomFilterIndices + ) throws IOException; + + OrcProto.StripeFooter readStripeFooter(StripeInformation stripe) throws IOException; + + /** Reads the data. + * + * Note that for the cases such as zero-copy read, caller must release the disk ranges + * produced after being done with them. Call isTrackingDiskRanges to find out if this is needed. + * @param range List if disk ranges to read. Ranges with data will be ignored. + * @param baseOffset Base offset from the start of the file of the ranges in disk range list. + * @param doForceDirect Whether the data should be read into direct buffers. + * @return New or modified list of DiskRange-s, where all the ranges are filled with data. + */ + DiskRangeList readFileData( + DiskRangeList range, long baseOffset, boolean doForceDirect) throws IOException; + + + /** + * Whether the user should release buffers created by readFileData. See readFileData javadoc. + */ + boolean isTrackingDiskRanges(); + + /** + * Releases buffers created by readFileData. See readFileData javadoc. + * @param toRelease The buffer to release. + */ + void releaseBuffer(ByteBuffer toRelease); + + /** + * Clone the entire state of the DataReader with the assumption that the + * clone will be closed at a different time. Thus, any file handles in the + * implementation need to be cloned. + * @return a new instance + */ + LlapDataReader clone(); + + @Override + void close() throws IOException; + + /** + * Returns the compression codec used by this datareader. + * We should consider removing this from the interface. + * @return the compression codec + */ + CompressionCodec getCompressionCodec(); Review comment: This interface looks like a copy of ORC's DataReader except this method. ORC's DataReader returns as StreamOptions instead of CompressionCodec. As far as i understand, StreamOptions includes the
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=541286&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-541286 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 25/Jan/21 20:20 Start Date: 25/Jan/21 20:20 Worklog Time Spent: 10m Work Description: mustafaiman commented on a change in pull request #1823: URL: https://github.com/apache/hive/pull/1823#discussion_r563956041 ## File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java ## @@ -4509,7 +4509,7 @@ private static void populateLlapDaemonVarsSet(Set llapDaemonVarsSetLocal "Minimum allocation possible from LLAP buddy allocator. Allocations below that are\n" + "padded to minimum allocation. For ORC, should generally be the same as the expected\n" + "compression buffer size, or next lowest power of 2. Must be a power of 2."), -LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "16Mb", new SizeValidator(), +LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "4Mb", new SizeValidator(), Review comment: You say this is changed to be compatible with ORC setting. I do not understand why this is necessary and what its impact is. This looks like a change that is not to be taken lightly ## File path: llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/LlapRecordReaderUtils.java ## @@ -0,0 +1,438 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.llap.io.encoded; + +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.hive.common.io.DiskRangeList; +import org.apache.hadoop.hive.ql.io.orc.encoded.LlapDataReader; +import org.apache.orc.CompressionCodec; +import org.apache.orc.CompressionKind; +import org.apache.orc.OrcFile; +import org.apache.orc.OrcProto; +import org.apache.orc.StripeInformation; +import org.apache.orc.TypeDescription; +import org.apache.orc.impl.BufferChunk; +import org.apache.orc.impl.DataReaderProperties; +import org.apache.orc.impl.DirectDecompressionCodec; +import org.apache.orc.impl.HadoopShims; +import org.apache.orc.impl.HadoopShimsFactory; +import org.apache.orc.impl.InStream; +import org.apache.orc.impl.OrcCodecPool; +import org.apache.orc.impl.OrcIndex; +import org.apache.orc.impl.RecordReaderUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.nio.ByteBuffer; +import java.util.List; +import java.util.function.Supplier; + +public class LlapRecordReaderUtils { + + private static final HadoopShims SHIMS = HadoopShimsFactory.get(); + private static final Logger LOG = LoggerFactory.getLogger(LlapRecordReaderUtils.class); + + static HadoopShims.ZeroCopyReaderShim createZeroCopyShim(FSDataInputStream file, CompressionCodec codec, RecordReaderUtils.ByteBufferAllocatorPool pool) throws IOException { +return codec != null && (!(codec instanceof DirectDecompressionCodec) || !((DirectDecompressionCodec)codec).isAvailable()) ? null : SHIMS.getZeroCopyReader(file, pool); Review comment: Can you invert this condition. There are a lot of negatives making this hard to understand `codec == null || (codec instanceof DirectDecompressionCodec && ((DirectDecompressionCodec) codec).isAvailable()) ? SHIMS.getZeroCopyReader(file, pool) : null` is much more understandable ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ## @@ -443,7 +444,8 @@ public void setBaseAndInnerReader( return new OrcRawRecordMerger.KeyInterval(null, null); } -OrcTail orcTail = getOrcTail(orcSplit.getPath(), conf, cacheTag, orcSplit.getFileKey()).orcTail; +VectorizedOrcAcidRowBatchReader.ReaderData orcReaderData = Review comment: I am not sure about this. Previously we did not create the full reader. Why do we need to create the reader now? All calls from here use orcTail anyway except `List stats = orcReaderData.reader.
[jira] [Work logged] (HIVE-23553) Upgrade ORC version to 1.6.7
[ https://issues.apache.org/jira/browse/HIVE-23553?focusedWorklogId=540403&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-540403 ] ASF GitHub Bot logged work on HIVE-23553: - Author: ASF GitHub Bot Created on: 22/Jan/21 23:47 Start Date: 22/Jan/21 23:47 Worklog Time Spent: 10m Work Description: dongjoon-hyun commented on pull request #1823: URL: https://github.com/apache/hive/pull/1823#issuecomment-765763006 Apache ORC 1.6.7 is officially release. Could you update the PR to use the official one, @pgaref ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 540403) Time Spent: 3h 20m (was: 3h 10m) > Upgrade ORC version to 1.6.7 > > > Key: HIVE-23553 > URL: https://issues.apache.org/jira/browse/HIVE-23553 > Project: Hive > Issue Type: Improvement >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Time Spent: 3h 20m > Remaining Estimate: 0h > > Apache Hive is currently on 1.5.X version and in order to take advantage of > the latest ORC improvements such as column encryption we have to bump to > 1.6.X. > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343288&styleName=&projectId=12318320&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_4ae78f19321c7fb1e7f337fba1dd90af751d8810_lin > Even though ORC reader could work out of the box, HIVE LLAP is heavily > depending on internal ORC APIs e.g., to retrieve and store File Footers, > Tails, streams – un/compress RG data etc. As there ware many internal changes > from 1.5 to 1.6 (Input stream offsets, relative BufferChunks etc.) the > upgrade is not straightforward. > This Umbrella Jira tracks this upgrade effort. -- This message was sent by Atlassian Jira (v8.3.4#803005)