[jira] [Work logged] (HIVE-26692) Check for the expected thrift version before compiling
[ https://issues.apache.org/jira/browse/HIVE-26692?focusedWorklogId=831308=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831308 ] ASF GitHub Bot logged work on HIVE-26692: - Author: ASF GitHub Bot Created on: 06/Dec/22 07:55 Start Date: 06/Dec/22 07:55 Worklog Time Spent: 10m Work Description: sonarcloud[bot] commented on PR #3820: URL: https://github.com/apache/hive/pull/3820#issuecomment-1338922096 Kudos, SonarCloud Quality Gate passed! [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache_hive=3820) [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_hive=3820=false=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3820=false=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache_hive=3820=false=BUG) [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_hive=3820=false=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3820=false=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_hive=3820=false=VULNERABILITY) [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3820=false=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3820=false=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3820=false=SECURITY_HOTSPOT) [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_hive=3820=false=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3820=false=CODE_SMELL) [0 Code Smells](https://sonarcloud.io/project/issues?id=apache_hive=3820=false=CODE_SMELL) [![No Coverage information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/NoCoverageInfo-16px.png 'No Coverage information')](https://sonarcloud.io/component_measures?id=apache_hive=3820=coverage=list) No Coverage information [![No Duplication information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/NoDuplicationInfo-16px.png 'No Duplication information')](https://sonarcloud.io/component_measures?id=apache_hive=3820=duplicated_lines_density=list) No Duplication information Issue Time Tracking --- Worklog Id: (was: 831308) Time Spent: 2h 40m (was: 2.5h) > Check for the expected thrift version before compiling > -- > > Key: HIVE-26692 > URL: https://issues.apache.org/jira/browse/HIVE-26692 > Project: Hive > Issue Type: Task > Components: Thrift API >Affects Versions: 4.0.0-alpha-2 >Reporter: Alessandro Solimando >Assignee: Alessandro Solimando >Priority: Major > Labels: pull-request-available > Time Spent: 2h 40m > Remaining Estimate: 0h > > At the moment we don't check for the thrift version before launching thrift, > the error messages are often cryptic upon mismatches. > An explicit check with a clear error message would be nice, like what parquet > does: > [https://github.com/apache/parquet-mr/blob/master/parquet-thrift/pom.xml#L247-L268] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26770) Make "end of loop" compaction logs appear more selectively
[ https://issues.apache.org/jira/browse/HIVE-26770?focusedWorklogId=831304=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831304 ] ASF GitHub Bot logged work on HIVE-26770: - Author: ASF GitHub Bot Created on: 06/Dec/22 07:50 Start Date: 06/Dec/22 07:50 Worklog Time Spent: 10m Work Description: deniskuzZ commented on code in PR #3832: URL: https://github.com/apache/hive/pull/3832#discussion_r1040607921 ## ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/CompactorThread.java: ## @@ -61,6 +61,13 @@ public abstract class CompactorThread extends Thread implements Configurable { protected String hostName; protected String runtimeVersion; + //Time threshold for compactor thread log + //In milliseconds: + protected Integer MAX_WARN_LOG_TIME = 120; //20 min + + protected long checkInterval; + + public enum CompactorThreadType {INITIATOR, WORKER, CLEANER} @Override Review Comment: new line + should it be public or package-private is enough? Issue Time Tracking --- Worklog Id: (was: 831304) Time Spent: 5h 40m (was: 5.5h) > Make "end of loop" compaction logs appear more selectively > -- > > Key: HIVE-26770 > URL: https://issues.apache.org/jira/browse/HIVE-26770 > Project: Hive > Issue Type: Improvement >Affects Versions: 4.0.0-alpha-1 >Reporter: Akshat Mathur >Assignee: Akshat Mathur >Priority: Major > Labels: pull-request-available > Time Spent: 5h 40m > Remaining Estimate: 0h > > Currently Initiator, Worker, and Cleaner threads log something like "finished > one loop" on INFO level. > This is useful to figure out if one of these threads is taking too long to > finish a loop, but expensive in general. > > Suggested Time: 20mins > Logging this should be changed in the following way > # If loop finished within a predefined amount of time, level should be DEBUG > and message should look like: *Initiator loop took \{ellapsedTime} seconds to > finish.* > # If loop ran longer than this predefined amount, level should be WARN and > message should look like: *Possible Initiator slowdown, loop took > \{ellapsedTime} seconds to finish.* -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26770) Make "end of loop" compaction logs appear more selectively
[ https://issues.apache.org/jira/browse/HIVE-26770?focusedWorklogId=831300=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831300 ] ASF GitHub Bot logged work on HIVE-26770: - Author: ASF GitHub Bot Created on: 06/Dec/22 07:47 Start Date: 06/Dec/22 07:47 Worklog Time Spent: 10m Work Description: deniskuzZ commented on code in PR #3832: URL: https://github.com/apache/hive/pull/3832#discussion_r1040605656 ## ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Worker.java: ## @@ -138,6 +142,7 @@ public void run() { @Override public void init(AtomicBoolean stop) throws Exception { super.init(stop); +checkInterval = 0; Review Comment: set it to 0 in the declaration Issue Time Tracking --- Worklog Id: (was: 831300) Time Spent: 5.5h (was: 5h 20m) > Make "end of loop" compaction logs appear more selectively > -- > > Key: HIVE-26770 > URL: https://issues.apache.org/jira/browse/HIVE-26770 > Project: Hive > Issue Type: Improvement >Affects Versions: 4.0.0-alpha-1 >Reporter: Akshat Mathur >Assignee: Akshat Mathur >Priority: Major > Labels: pull-request-available > Time Spent: 5.5h > Remaining Estimate: 0h > > Currently Initiator, Worker, and Cleaner threads log something like "finished > one loop" on INFO level. > This is useful to figure out if one of these threads is taking too long to > finish a loop, but expensive in general. > > Suggested Time: 20mins > Logging this should be changed in the following way > # If loop finished within a predefined amount of time, level should be DEBUG > and message should look like: *Initiator loop took \{ellapsedTime} seconds to > finish.* > # If loop ran longer than this predefined amount, level should be WARN and > message should look like: *Possible Initiator slowdown, loop took > \{ellapsedTime} seconds to finish.* -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-23891) Using UNION sql clause and speculative execution can cause file duplication in Tez
[ https://issues.apache.org/jira/browse/HIVE-23891?focusedWorklogId=831287=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831287 ] ASF GitHub Bot logged work on HIVE-23891: - Author: ASF GitHub Bot Created on: 06/Dec/22 07:31 Start Date: 06/Dec/22 07:31 Worklog Time Spent: 10m Work Description: sonarcloud[bot] commented on PR #3836: URL: https://github.com/apache/hive/pull/3836#issuecomment-1338903701 Kudos, SonarCloud Quality Gate passed! [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache_hive=3836) [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_hive=3836=false=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3836=false=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache_hive=3836=false=BUG) [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_hive=3836=false=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3836=false=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_hive=3836=false=VULNERABILITY) [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3836=false=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3836=false=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3836=false=SECURITY_HOTSPOT) [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_hive=3836=false=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3836=false=CODE_SMELL) [8 Code Smells](https://sonarcloud.io/project/issues?id=apache_hive=3836=false=CODE_SMELL) [![No Coverage information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/NoCoverageInfo-16px.png 'No Coverage information')](https://sonarcloud.io/component_measures?id=apache_hive=3836=coverage=list) No Coverage information [![No Duplication information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/NoDuplicationInfo-16px.png 'No Duplication information')](https://sonarcloud.io/component_measures?id=apache_hive=3836=duplicated_lines_density=list) No Duplication information Issue Time Tracking --- Worklog Id: (was: 831287) Time Spent: 2h 50m (was: 2h 40m) > Using UNION sql clause and speculative execution can cause file duplication > in Tez > -- > > Key: HIVE-23891 > URL: https://issues.apache.org/jira/browse/HIVE-23891 > Project: Hive > Issue Type: Bug >Reporter: George Pachitariu >Assignee: George Pachitariu >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23891.1.patch > > Time Spent: 2h 50m > Remaining Estimate: 0h > > Hello, > the specific scenario when this can happen: > - the execution engine is Tez; > - speculative execution is on; > - the query inserts into a table and the last step is a UNION sql clause; > The problem is that Tez creates an extra layer of subdirectories when there > is a UNION. Later, when deduplicating, Hive doesn't take that into account > and only deduplicates folders but not the files inside. > So for a query like this: > {code:sql} > insert overwrite table union_all > select * from union_first_part > union all > select * from union_second_part; > {code} > The folder structure afterwards will be like this (a possible example): > {code:java} > .../union_all/HIVE_UNION_SUBDIR_1/00_0 > .../union_all/HIVE_UNION_SUBDIR_1/00_1 > .../union_all/HIVE_UNION_SUBDIR_2/00_1 > {code} > The attached patch increases the number of folder levels that Hive
[jira] [Assigned] (HIVE-26810) Replace HiveFilterSetOpTransposeRule onMatch method with Calcite's built-in implementation
[ https://issues.apache.org/jira/browse/HIVE-26810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alessandro Solimando reassigned HIVE-26810: --- > Replace HiveFilterSetOpTransposeRule onMatch method with Calcite's built-in > implementation > -- > > Key: HIVE-26810 > URL: https://issues.apache.org/jira/browse/HIVE-26810 > Project: Hive > Issue Type: Task > Components: CBO >Affects Versions: 4.0.0-alpha-2 >Reporter: Alessandro Solimando >Assignee: Alessandro Solimando >Priority: Major > > After HIVE-26762, the _onMatch_ method is now the same as in the Calcite > implementation, we can drop the Hive's override in order to avoid the risk of > them drifting away again. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26221) Add histogram-based column statistics
[ https://issues.apache.org/jira/browse/HIVE-26221?focusedWorklogId=831286=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831286 ] ASF GitHub Bot logged work on HIVE-26221: - Author: ASF GitHub Bot Created on: 06/Dec/22 07:28 Start Date: 06/Dec/22 07:28 Worklog Time Spent: 10m Work Description: dengzhhu653 commented on code in PR #3137: URL: https://github.com/apache/hive/pull/3137#discussion_r1040592119 ## standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/StatisticsTestUtils.java: ## @@ -109,4 +135,116 @@ public static HyperLogLog createHll(String... values) { } return hll; } + + /** + * Creates an HLL object initialized with the given values. + * @param values the values to be added + * @return an HLL object initialized with the given values. + */ + public static HyperLogLog createHll(double... values) { +HyperLogLog hll = HyperLogLog.builder().build(); +Arrays.stream(values).forEach(hll::addDouble); +return hll; + } + + /** + * Creates a KLL object initialized with the given values. + * @param values the values to be added + * @return a KLL object initialized with the given values. + */ + public static KllFloatsSketch createKll(float... values) { +KllFloatsSketch kll = new KllFloatsSketch(); +for (float value : values) { + kll.update(value); +} +return kll; + } + + /** + * Creates a KLL object initialized with the given values. + * @param values the values to be added + * @return a KLL object initialized with the given values. + */ + public static KllFloatsSketch createKll(double... values) { +KllFloatsSketch kll = new KllFloatsSketch(); +for (double value : values) { + kll.update(Double.valueOf(value).floatValue()); +} +return kll; + } + + /** + * Creates a KLL object initialized with the given values. + * @param values the values to be added + * @return a KLL object initialized with the given values. + */ + public static KllFloatsSketch createKll(long... values) { +KllFloatsSketch kll = new KllFloatsSketch(); +for (long value : values) { + kll.update(value); +} +return kll; + } + + /** + * Checks if expected and computed statistics data are equal. + * @param expected expected statistics data + * @param computed computed statistics data + */ + public static void assertEqualStatistics(ColumnStatisticsData expected, ColumnStatisticsData computed) { +if (expected.getSetField() != computed.getSetField()) { + throw new IllegalArgumentException("Expected data is of type " + expected.getSetField() + + " while computed data is of type " + computed.getSetField()); +} + +Class dataClass = null; +switch (expected.getSetField()) { +case DATE_STATS: + dataClass = DateColumnStatsData.class; + break; +case LONG_STATS: + dataClass = LongColumnStatsData.class; + break; +case DOUBLE_STATS: + dataClass = DoubleColumnStatsData.class; + break; +case DECIMAL_STATS: + dataClass = DecimalColumnStatsData.class; + break; +case TIMESTAMP_STATS: + dataClass = TimestampColumnStatsData.class; + break; +default: + // it's an unsupported class for KLL, no special treatment needed + Assert.assertEquals(expected, computed); + return; +} +assertEqualStatistics(expected, computed, dataClass); + } + + private static void assertEqualStatistics( Review Comment: This function only compares the `histogram`, and does not tell much truth when either `computedHasHistograms` or `expectedHasHistograms` is false. Cloud we compare the `ColumnStatisticsData` by `Assert.assertEquals(expected, computed);` as we did in Line 219? Issue Time Tracking --- Worklog Id: (was: 831286) Time Spent: 4.5h (was: 4h 20m) > Add histogram-based column statistics > - > > Key: HIVE-26221 > URL: https://issues.apache.org/jira/browse/HIVE-26221 > Project: Hive > Issue Type: Improvement > Components: CBO, Metastore, Statistics >Affects Versions: 4.0.0-alpha-2 >Reporter: Alessandro Solimando >Assignee: Alessandro Solimando >Priority: Major > Labels: pull-request-available > Time Spent: 4.5h > Remaining Estimate: 0h > > Hive does not support histogram statistics, which are particularly useful for > skewed data (which is very common in practice) and range predicates. > Hive's current selectivity estimation for range predicates is based on a > hard-coded value of 1/3 (see >
[jira] [Work logged] (HIVE-26221) Add histogram-based column statistics
[ https://issues.apache.org/jira/browse/HIVE-26221?focusedWorklogId=831280=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831280 ] ASF GitHub Bot logged work on HIVE-26221: - Author: ASF GitHub Bot Created on: 06/Dec/22 07:10 Start Date: 06/Dec/22 07:10 Worklog Time Spent: 10m Work Description: asolimando commented on code in PR #3137: URL: https://github.com/apache/hive/pull/3137#discussion_r1040578295 ## standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/columnstats/ColStatsBuilder.java: ## @@ -103,6 +105,12 @@ public ColStatsBuilder hll(String... values) { return this; } + public ColStatsBuilder hll(double... values) { +HyperLogLog hll = StatisticsTestUtils.createHll(values); +this.bitVector = hll.serialize(); Review Comment: No, HLL is different from KLL, it's used for counting distinct values, the method naming is different because HLL has a in-house implementation in Hive while KLL comes from Apache datasketches library Issue Time Tracking --- Worklog Id: (was: 831280) Time Spent: 4h 20m (was: 4h 10m) > Add histogram-based column statistics > - > > Key: HIVE-26221 > URL: https://issues.apache.org/jira/browse/HIVE-26221 > Project: Hive > Issue Type: Improvement > Components: CBO, Metastore, Statistics >Affects Versions: 4.0.0-alpha-2 >Reporter: Alessandro Solimando >Assignee: Alessandro Solimando >Priority: Major > Labels: pull-request-available > Time Spent: 4h 20m > Remaining Estimate: 0h > > Hive does not support histogram statistics, which are particularly useful for > skewed data (which is very common in practice) and range predicates. > Hive's current selectivity estimation for range predicates is based on a > hard-coded value of 1/3 (see > [FilterSelectivityEstimator.java#L138-L144|https://github.com/apache/hive/blob/56c336268ea8c281d23c22d89271af37cb7e2572/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L138-L144]).]) > The current proposal aims at integrating histogram as an additional column > statistics, stored into the Hive metastore at the table (or partition) level. > The main requirements for histogram integration are the following: > * efficiency: the approach must scale and support billions of rows > * merge-ability: partition-level histograms have to be merged to form > table-level histograms > * explicit and configurable trade-off between memory footprint and accuracy > Hive already integrates [KLL data > sketches|https://datasketches.apache.org/docs/KLL/KLLSketch.html] UDAF. > Datasketches are small, stateful programs that process massive data-streams > and can provide approximate answers, with mathematical guarantees, to > computationally difficult queries orders-of-magnitude faster than > traditional, exact methods. > We propose to use KLL, and more specifically the cumulative distribution > function (CDF), as the underlying data structure for our histogram statistics. > The current proposal targets numeric data types (float, integer and numeric > families) and temporal data types (date and timestamp). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26221) Add histogram-based column statistics
[ https://issues.apache.org/jira/browse/HIVE-26221?focusedWorklogId=831279=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831279 ] ASF GitHub Bot logged work on HIVE-26221: - Author: ASF GitHub Bot Created on: 06/Dec/22 07:07 Start Date: 06/Dec/22 07:07 Worklog Time Spent: 10m Work Description: dengzhhu653 commented on code in PR #3137: URL: https://github.com/apache/hive/pull/3137#discussion_r1040575953 ## standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/columnstats/ColStatsBuilder.java: ## @@ -103,6 +105,12 @@ public ColStatsBuilder hll(String... values) { return this; } + public ColStatsBuilder hll(double... values) { +HyperLogLog hll = StatisticsTestUtils.createHll(values); +this.bitVector = hll.serialize(); Review Comment: This is meant to `this.kll = kll.toByteArray();`? Issue Time Tracking --- Worklog Id: (was: 831279) Time Spent: 4h 10m (was: 4h) > Add histogram-based column statistics > - > > Key: HIVE-26221 > URL: https://issues.apache.org/jira/browse/HIVE-26221 > Project: Hive > Issue Type: Improvement > Components: CBO, Metastore, Statistics >Affects Versions: 4.0.0-alpha-2 >Reporter: Alessandro Solimando >Assignee: Alessandro Solimando >Priority: Major > Labels: pull-request-available > Time Spent: 4h 10m > Remaining Estimate: 0h > > Hive does not support histogram statistics, which are particularly useful for > skewed data (which is very common in practice) and range predicates. > Hive's current selectivity estimation for range predicates is based on a > hard-coded value of 1/3 (see > [FilterSelectivityEstimator.java#L138-L144|https://github.com/apache/hive/blob/56c336268ea8c281d23c22d89271af37cb7e2572/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L138-L144]).]) > The current proposal aims at integrating histogram as an additional column > statistics, stored into the Hive metastore at the table (or partition) level. > The main requirements for histogram integration are the following: > * efficiency: the approach must scale and support billions of rows > * merge-ability: partition-level histograms have to be merged to form > table-level histograms > * explicit and configurable trade-off between memory footprint and accuracy > Hive already integrates [KLL data > sketches|https://datasketches.apache.org/docs/KLL/KLLSketch.html] UDAF. > Datasketches are small, stateful programs that process massive data-streams > and can provide approximate answers, with mathematical guarantees, to > computationally difficult queries orders-of-magnitude faster than > traditional, exact methods. > We propose to use KLL, and more specifically the cumulative distribution > function (CDF), as the underlying data structure for our histogram statistics. > The current proposal targets numeric data types (float, integer and numeric > families) and temporal data types (date and timestamp). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26799) Make authorizations on custom UDFs involved in tables/view configurable.
[ https://issues.apache.org/jira/browse/HIVE-26799?focusedWorklogId=831272=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831272 ] ASF GitHub Bot logged work on HIVE-26799: - Author: ASF GitHub Bot Created on: 06/Dec/22 06:49 Start Date: 06/Dec/22 06:49 Worklog Time Spent: 10m Work Description: dengzhhu653 commented on code in PR #3821: URL: https://github.com/apache/hive/pull/3821#discussion_r1040564092 ## ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java: ## @@ -12550,6 +12550,20 @@ private ParseResult rewriteASTWithMaskAndFilter(TableMask tableMask, ASTNode ast } } + void gatherUserSuppliedFunctions(ASTNode ast) { +int tokenType = ast.getToken().getType(); +if (tokenType == HiveParser.TOK_FUNCTION || +tokenType == HiveParser.TOK_FUNCTIONDI || +tokenType == HiveParser.TOK_FUNCTIONSTAR) { + if (ast.getChild(0).getType() == HiveParser.Identifier) { + this.userSuppliedFunctions.add(unescapeIdentifier(ast.getChild(0).getText())); Review Comment: Could we add the lower-cased function names into `userSuppliedFunctions`? I wonder there are some queries like: `select MIN(a) from table_example`. Does it handle cast properly? for example: `select cast(a as int) from `table_example`. Issue Time Tracking --- Worklog Id: (was: 831272) Time Spent: 1h 40m (was: 1.5h) > Make authorizations on custom UDFs involved in tables/view configurable. > > > Key: HIVE-26799 > URL: https://issues.apache.org/jira/browse/HIVE-26799 > Project: Hive > Issue Type: New Feature > Components: HiveServer2, Security >Affects Versions: 4.0.0-alpha-2 >Reporter: Sai Hemanth Gantasala >Assignee: Sai Hemanth Gantasala >Priority: Major > Labels: pull-request-available > Time Spent: 1h 40m > Remaining Estimate: 0h > > When Hive is using Ranger/Sentry as an authorization service, consider the > following scenario. > {code:java} > > create table test_udf(st string); // privileged user operation > > create function Udf_UPPER as 'openkb.hive.udf.MyUpper' using jar > > 'hdfs:///tmp/MyUpperUDF-1.0.0.jar'; // privileged user operation > > create view v1_udf as select udf_upper(st) from test_udf; // privileged > > user operation > //unprivileged user test_user is given select permissions on view v1_udf > > select * from v1_udf; {code} > It is expected that test_user needs to have select privilege on v1_udf and > select permissions on udf_upper custom UDF in order to do a select query on > view. > This patch introduces a configuration > "hive.security.authorization.functions.in.view"=false which disables > authorization on views associated with views/tables during the select query. > In this mode, only UDFs explicitly stated in the query would still be > authorized as it is currently. > The reason for making these custom UDFs associated with view/tables > authorizable is that currently, test_user will need to be granted select > permissions on the custom udf. and the test_user can use this UDF and query > against any other table, which is a security concern. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-23891) Using UNION sql clause and speculative execution can cause file duplication in Tez
[ https://issues.apache.org/jira/browse/HIVE-23891?focusedWorklogId=831269=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831269 ] ASF GitHub Bot logged work on HIVE-23891: - Author: ASF GitHub Bot Created on: 06/Dec/22 06:39 Start Date: 06/Dec/22 06:39 Worklog Time Spent: 10m Work Description: dengzhhu653 opened a new pull request, #3836: URL: https://github.com/apache/hive/pull/3836 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Issue Time Tracking --- Worklog Id: (was: 831269) Time Spent: 2h 40m (was: 2.5h) > Using UNION sql clause and speculative execution can cause file duplication > in Tez > -- > > Key: HIVE-23891 > URL: https://issues.apache.org/jira/browse/HIVE-23891 > Project: Hive > Issue Type: Bug >Reporter: George Pachitariu >Assignee: George Pachitariu >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23891.1.patch > > Time Spent: 2h 40m > Remaining Estimate: 0h > > Hello, > the specific scenario when this can happen: > - the execution engine is Tez; > - speculative execution is on; > - the query inserts into a table and the last step is a UNION sql clause; > The problem is that Tez creates an extra layer of subdirectories when there > is a UNION. Later, when deduplicating, Hive doesn't take that into account > and only deduplicates folders but not the files inside. > So for a query like this: > {code:sql} > insert overwrite table union_all > select * from union_first_part > union all > select * from union_second_part; > {code} > The folder structure afterwards will be like this (a possible example): > {code:java} > .../union_all/HIVE_UNION_SUBDIR_1/00_0 > .../union_all/HIVE_UNION_SUBDIR_1/00_1 > .../union_all/HIVE_UNION_SUBDIR_2/00_1 > {code} > The attached patch increases the number of folder levels that Hive will check > recursively for duplicates when we have a UNION in Tez. > Feel free to reach out if you have any questions :). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-26569) Support renewal and recreation of LLAP_TOKENs
[ https://issues.apache.org/jira/browse/HIVE-26569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] László Bodor updated HIVE-26569: Summary: Support renewal and recreation of LLAP_TOKENs (was: LlapTokenRenewer: TezAM (LlapTaskCommunicator) to renew LLAP_TOKENs) > Support renewal and recreation of LLAP_TOKENs > - > > Key: HIVE-26569 > URL: https://issues.apache.org/jira/browse/HIVE-26569 > Project: Hive > Issue Type: Improvement >Reporter: László Bodor >Assignee: László Bodor >Priority: Major > Labels: pull-request-available > Time Spent: 1h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26788) Update stats of table/partition after minor compaction using noscan operation
[ https://issues.apache.org/jira/browse/HIVE-26788?focusedWorklogId=831260=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831260 ] ASF GitHub Bot logged work on HIVE-26788: - Author: ASF GitHub Bot Created on: 06/Dec/22 04:56 Start Date: 06/Dec/22 04:56 Worklog Time Spent: 10m Work Description: SourabhBadhya commented on code in PR #3812: URL: https://github.com/apache/hive/pull/3812#discussion_r1040462553 ## ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/StatsUpdater.java: ## @@ -52,10 +52,6 @@ public final class StatsUpdater { */ public void gatherStats(CompactionInfo ci, HiveConf conf, String userName, String compactionQueueName) { try { -if (!ci.isMajorCompaction()) { Review Comment: I thought this was a problem but I did some investigation. There is a if-else statement which decides whether a MR or Tez task needs to be created. For the `NOSCAN` operation, it does not generate a MR or a Tez task. https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java#L88-L108 (If basic stats is ok to be used, then MR or Tez task is not created) https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java#L128-L134 (If no scan is used then the MapRedTask is removed from the plan). AFAIK Tez sessions are created only when a Tez task is executed. https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java#L207-L208 Issue Time Tracking --- Worklog Id: (was: 831260) Time Spent: 1h 40m (was: 1.5h) > Update stats of table/partition after minor compaction using noscan operation > - > > Key: HIVE-26788 > URL: https://issues.apache.org/jira/browse/HIVE-26788 > Project: Hive > Issue Type: Improvement >Reporter: Sourabh Badhya >Assignee: Sourabh Badhya >Priority: Major > Labels: pull-request-available > Time Spent: 1h 40m > Remaining Estimate: 0h > > Currently, statistics are not updated for minor compaction since minor > compaction performs little updates on the statistics (such as number of files > in table/partition & total size of the table/partition). It is better to > utilize NOSCAN operation for minor compaction since NOSCAN operations > performs faster update of statistics and updates the relevant fields such as > number of files & total sizes of the table/partitions. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26788) Update stats of table/partition after minor compaction using noscan operation
[ https://issues.apache.org/jira/browse/HIVE-26788?focusedWorklogId=831259=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831259 ] ASF GitHub Bot logged work on HIVE-26788: - Author: ASF GitHub Bot Created on: 06/Dec/22 04:53 Start Date: 06/Dec/22 04:53 Worklog Time Spent: 10m Work Description: SourabhBadhya commented on code in PR #3812: URL: https://github.com/apache/hive/pull/3812#discussion_r1040462553 ## ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/StatsUpdater.java: ## @@ -52,10 +52,6 @@ public final class StatsUpdater { */ public void gatherStats(CompactionInfo ci, HiveConf conf, String userName, String compactionQueueName) { try { -if (!ci.isMajorCompaction()) { Review Comment: I thought this was a problem but I did some investigation. There is a if-else statement which decides whether a MR or Tez task needs to be created. For the `NOSCAN` operation, it does not generate a MR or a Tez task. https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java#L88-L108 (If basic stats is ok to be used, then MR or Tez task is not created) https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java#L128-L134 (If no scan is used then the MapRedTask is removed from the plan). AFAIK Tez sessions are created only when a Tez task is executed. Issue Time Tracking --- Worklog Id: (was: 831259) Time Spent: 1.5h (was: 1h 20m) > Update stats of table/partition after minor compaction using noscan operation > - > > Key: HIVE-26788 > URL: https://issues.apache.org/jira/browse/HIVE-26788 > Project: Hive > Issue Type: Improvement >Reporter: Sourabh Badhya >Assignee: Sourabh Badhya >Priority: Major > Labels: pull-request-available > Time Spent: 1.5h > Remaining Estimate: 0h > > Currently, statistics are not updated for minor compaction since minor > compaction performs little updates on the statistics (such as number of files > in table/partition & total size of the table/partition). It is better to > utilize NOSCAN operation for minor compaction since NOSCAN operations > performs faster update of statistics and updates the relevant fields such as > number of files & total sizes of the table/partitions. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26788) Update stats of table/partition after minor compaction using noscan operation
[ https://issues.apache.org/jira/browse/HIVE-26788?focusedWorklogId=831258=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831258 ] ASF GitHub Bot logged work on HIVE-26788: - Author: ASF GitHub Bot Created on: 06/Dec/22 04:51 Start Date: 06/Dec/22 04:51 Worklog Time Spent: 10m Work Description: SourabhBadhya commented on code in PR #3812: URL: https://github.com/apache/hive/pull/3812#discussion_r1040462553 ## ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/StatsUpdater.java: ## @@ -52,10 +52,6 @@ public final class StatsUpdater { */ public void gatherStats(CompactionInfo ci, HiveConf conf, String userName, String compactionQueueName) { try { -if (!ci.isMajorCompaction()) { Review Comment: I thought this is a problem but I did some investigation. There is a if-else statement which decides whether a MR or Tez task needs to be created. For the `NOSCAN` operation, it does not generate a MR or a Tez task. https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java#L88-L108 (If basic stats is ok to be used, then MR or Tez task is not created) https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java#L128-L134 (If no scan is used then the MapRedTask is removed from the plan). AFAIK Tez sessions are created only when a Tez task is executed. Issue Time Tracking --- Worklog Id: (was: 831258) Time Spent: 1h 20m (was: 1h 10m) > Update stats of table/partition after minor compaction using noscan operation > - > > Key: HIVE-26788 > URL: https://issues.apache.org/jira/browse/HIVE-26788 > Project: Hive > Issue Type: Improvement >Reporter: Sourabh Badhya >Assignee: Sourabh Badhya >Priority: Major > Labels: pull-request-available > Time Spent: 1h 20m > Remaining Estimate: 0h > > Currently, statistics are not updated for minor compaction since minor > compaction performs little updates on the statistics (such as number of files > in table/partition & total size of the table/partition). It is better to > utilize NOSCAN operation for minor compaction since NOSCAN operations > performs faster update of statistics and updates the relevant fields such as > number of files & total sizes of the table/partitions. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26788) Update stats of table/partition after minor compaction using noscan operation
[ https://issues.apache.org/jira/browse/HIVE-26788?focusedWorklogId=831257=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831257 ] ASF GitHub Bot logged work on HIVE-26788: - Author: ASF GitHub Bot Created on: 06/Dec/22 04:51 Start Date: 06/Dec/22 04:51 Worklog Time Spent: 10m Work Description: SourabhBadhya commented on code in PR #3812: URL: https://github.com/apache/hive/pull/3812#discussion_r1040462553 ## ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/StatsUpdater.java: ## @@ -52,10 +52,6 @@ public final class StatsUpdater { */ public void gatherStats(CompactionInfo ci, HiveConf conf, String userName, String compactionQueueName) { try { -if (!ci.isMajorCompaction()) { Review Comment: I believe this is a problem but I did some investigation. There is a if-else statement which decides whether a MR or Tez task needs to be created. For the `NOSCAN` operation, it does not generate a MR or a Tez task. https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java#L88-L108 (If basic stats is ok to be used, then MR or Tez task is not created) https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java#L128-L134 (If no scan is used then the MapRedTask is removed from the plan). AFAIK Tez sessions are created only when a Tez task is executed. Issue Time Tracking --- Worklog Id: (was: 831257) Time Spent: 1h 10m (was: 1h) > Update stats of table/partition after minor compaction using noscan operation > - > > Key: HIVE-26788 > URL: https://issues.apache.org/jira/browse/HIVE-26788 > Project: Hive > Issue Type: Improvement >Reporter: Sourabh Badhya >Assignee: Sourabh Badhya >Priority: Major > Labels: pull-request-available > Time Spent: 1h 10m > Remaining Estimate: 0h > > Currently, statistics are not updated for minor compaction since minor > compaction performs little updates on the statistics (such as number of files > in table/partition & total size of the table/partition). It is better to > utilize NOSCAN operation for minor compaction since NOSCAN operations > performs faster update of statistics and updates the relevant fields such as > number of files & total sizes of the table/partitions. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HIVE-26770) Make "end of loop" compaction logs appear more selectively
[ https://issues.apache.org/jira/browse/HIVE-26770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17643658#comment-17643658 ] Akshat Mathur commented on HIVE-26770: -- Test passed. Had to close the old PR and create a new one > Make "end of loop" compaction logs appear more selectively > -- > > Key: HIVE-26770 > URL: https://issues.apache.org/jira/browse/HIVE-26770 > Project: Hive > Issue Type: Improvement >Affects Versions: 4.0.0-alpha-1 >Reporter: Akshat Mathur >Assignee: Akshat Mathur >Priority: Major > Labels: pull-request-available > Time Spent: 5h 20m > Remaining Estimate: 0h > > Currently Initiator, Worker, and Cleaner threads log something like "finished > one loop" on INFO level. > This is useful to figure out if one of these threads is taking too long to > finish a loop, but expensive in general. > > Suggested Time: 20mins > Logging this should be changed in the following way > # If loop finished within a predefined amount of time, level should be DEBUG > and message should look like: *Initiator loop took \{ellapsedTime} seconds to > finish.* > # If loop ran longer than this predefined amount, level should be WARN and > message should look like: *Possible Initiator slowdown, loop took > \{ellapsedTime} seconds to finish.* -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (HIVE-26770) Make "end of loop" compaction logs appear more selectively
[ https://issues.apache.org/jira/browse/HIVE-26770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17643124#comment-17643124 ] Akshat Mathur edited comment on HIVE-26770 at 12/6/22 4:42 AM: --- Due to timeout the test are failing, blocking merge was (Author: JIRAUSER298271): Due to timeout the test are failing, blocking merge > Make "end of loop" compaction logs appear more selectively > -- > > Key: HIVE-26770 > URL: https://issues.apache.org/jira/browse/HIVE-26770 > Project: Hive > Issue Type: Improvement >Affects Versions: 4.0.0-alpha-1 >Reporter: Akshat Mathur >Assignee: Akshat Mathur >Priority: Major > Labels: pull-request-available > Time Spent: 5h 20m > Remaining Estimate: 0h > > Currently Initiator, Worker, and Cleaner threads log something like "finished > one loop" on INFO level. > This is useful to figure out if one of these threads is taking too long to > finish a loop, but expensive in general. > > Suggested Time: 20mins > Logging this should be changed in the following way > # If loop finished within a predefined amount of time, level should be DEBUG > and message should look like: *Initiator loop took \{ellapsedTime} seconds to > finish.* > # If loop ran longer than this predefined amount, level should be WARN and > message should look like: *Possible Initiator slowdown, loop took > \{ellapsedTime} seconds to finish.* -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HIVE-26806) Precommit tests in CI are timing out after HIVE-26796
[ https://issues.apache.org/jira/browse/HIVE-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17643656#comment-17643656 ] Akshat Mathur commented on HIVE-26806: -- [~zabetak] Closing PR-3803 and opening a new one worked thanks. Run for new PR: http://ci.hive.apache.org/blue/organizations/jenkins/hive-precommit/detail/PR-3832/1/pipeline/ > Precommit tests in CI are timing out after HIVE-26796 > - > > Key: HIVE-26806 > URL: https://issues.apache.org/jira/browse/HIVE-26806 > Project: Hive > Issue Type: Bug > Components: Testing Infrastructure >Reporter: Stamatis Zampetakis >Assignee: Stamatis Zampetakis >Priority: Major > > http://ci.hive.apache.org/job/hive-precommit/job/master/1506/ > {noformat} > ancelling nested steps due to timeout > 15:22:08 Sending interrupt signal to process > 15:22:08 Killing processes > 15:22:09 kill finished with exit code 0 > 15:22:19 Terminated > 15:22:19 script returned exit code 143 > [Pipeline] } > [Pipeline] // withEnv > [Pipeline] } > 15:22:19 Deleting 1 temporary files > [Pipeline] // configFileProvider > [Pipeline] } > [Pipeline] // stage > [Pipeline] stage > [Pipeline] { (PostProcess) > [Pipeline] sh > [Pipeline] sh > [Pipeline] sh > [Pipeline] junit > 15:22:25 Recording test results > 15:22:32 [Checks API] No suitable checks publisher found. > [Pipeline] } > [Pipeline] // stage > [Pipeline] } > [Pipeline] // container > [Pipeline] } > [Pipeline] // node > [Pipeline] } > [Pipeline] // timeout > [Pipeline] } > [Pipeline] // podTemplate > [Pipeline] } > 15:22:32 Failed in branch split-01 > [Pipeline] // parallel > [Pipeline] } > [Pipeline] // stage > [Pipeline] stage > [Pipeline] { (Archive) > [Pipeline] podTemplate > [Pipeline] { > [Pipeline] timeout > 15:22:33 Timeout set to expire in 6 hr 0 min > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26799) Make authorizations on custom UDFs involved in tables/view configurable.
[ https://issues.apache.org/jira/browse/HIVE-26799?focusedWorklogId=831255=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831255 ] ASF GitHub Bot logged work on HIVE-26799: - Author: ASF GitHub Bot Created on: 06/Dec/22 03:57 Start Date: 06/Dec/22 03:57 Worklog Time Spent: 10m Work Description: sonarcloud[bot] commented on PR #3821: URL: https://github.com/apache/hive/pull/3821#issuecomment-1338706082 Kudos, SonarCloud Quality Gate passed! [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache_hive=3821) [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_hive=3821=false=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3821=false=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache_hive=3821=false=BUG) [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_hive=3821=false=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3821=false=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_hive=3821=false=VULNERABILITY) [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3821=false=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3821=false=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3821=false=SECURITY_HOTSPOT) [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_hive=3821=false=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3821=false=CODE_SMELL) [1 Code Smell](https://sonarcloud.io/project/issues?id=apache_hive=3821=false=CODE_SMELL) [![No Coverage information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/NoCoverageInfo-16px.png 'No Coverage information')](https://sonarcloud.io/component_measures?id=apache_hive=3821=coverage=list) No Coverage information [![No Duplication information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/NoDuplicationInfo-16px.png 'No Duplication information')](https://sonarcloud.io/component_measures?id=apache_hive=3821=duplicated_lines_density=list) No Duplication information Issue Time Tracking --- Worklog Id: (was: 831255) Time Spent: 1.5h (was: 1h 20m) > Make authorizations on custom UDFs involved in tables/view configurable. > > > Key: HIVE-26799 > URL: https://issues.apache.org/jira/browse/HIVE-26799 > Project: Hive > Issue Type: New Feature > Components: HiveServer2, Security >Affects Versions: 4.0.0-alpha-2 >Reporter: Sai Hemanth Gantasala >Assignee: Sai Hemanth Gantasala >Priority: Major > Labels: pull-request-available > Time Spent: 1.5h > Remaining Estimate: 0h > > When Hive is using Ranger/Sentry as an authorization service, consider the > following scenario. > {code:java} > > create table test_udf(st string); // privileged user operation > > create function Udf_UPPER as 'openkb.hive.udf.MyUpper' using jar > > 'hdfs:///tmp/MyUpperUDF-1.0.0.jar'; // privileged user operation > > create view v1_udf as select udf_upper(st) from test_udf; // privileged > > user operation > //unprivileged user test_user is given select permissions on view v1_udf > > select * from v1_udf; {code} > It is expected that test_user needs to have select privilege on v1_udf and > select permissions on udf_upper custom UDF in order to do a select query on > view. > This patch introduces a configuration > "hive.security.authorization.functions.in.view"=false which disables > authorization on views associated with views/tables
[jira] [Work logged] (HIVE-23559) Optimise Hive::moveAcidFiles for cloud storage
[ https://issues.apache.org/jira/browse/HIVE-23559?focusedWorklogId=831250=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831250 ] ASF GitHub Bot logged work on HIVE-23559: - Author: ASF GitHub Bot Created on: 06/Dec/22 02:35 Start Date: 06/Dec/22 02:35 Worklog Time Spent: 10m Work Description: sonarcloud[bot] commented on PR #3795: URL: https://github.com/apache/hive/pull/3795#issuecomment-1338645658 Kudos, SonarCloud Quality Gate passed! [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache_hive=3795) [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_hive=3795=false=BUG) [![C](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/C-16px.png 'C')](https://sonarcloud.io/project/issues?id=apache_hive=3795=false=BUG) [1 Bug](https://sonarcloud.io/project/issues?id=apache_hive=3795=false=BUG) [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_hive=3795=false=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3795=false=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_hive=3795=false=VULNERABILITY) [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3795=false=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3795=false=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3795=false=SECURITY_HOTSPOT) [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_hive=3795=false=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3795=false=CODE_SMELL) [7 Code Smells](https://sonarcloud.io/project/issues?id=apache_hive=3795=false=CODE_SMELL) [![No Coverage information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/NoCoverageInfo-16px.png 'No Coverage information')](https://sonarcloud.io/component_measures?id=apache_hive=3795=coverage=list) No Coverage information [![No Duplication information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/NoDuplicationInfo-16px.png 'No Duplication information')](https://sonarcloud.io/component_measures?id=apache_hive=3795=duplicated_lines_density=list) No Duplication information Issue Time Tracking --- Worklog Id: (was: 831250) Time Spent: 50m (was: 40m) > Optimise Hive::moveAcidFiles for cloud storage > -- > > Key: HIVE-23559 > URL: https://issues.apache.org/jira/browse/HIVE-23559 > Project: Hive > Issue Type: Improvement >Reporter: Rajesh Balamohan >Assignee: Dmitriy Fingerman >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L4752] > It ends up transferring DELTA, DELETE_DELTA, BASE prefixes sequentially from > staging to final location. > This causes delays even with simple updates statements, which updates smaller > number of records in cloud storage. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HIVE-25327) Mapjoins in HiveServer2 fail when jmxremote is used
[ https://issues.apache.org/jira/browse/HIVE-25327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17643627#comment-17643627 ] hansonhe commented on HIVE-25327: - The same problem happened to me. My Environment: hive-3.1.2 hadoop-3.1.4 > Mapjoins in HiveServer2 fail when jmxremote is used > --- > > Key: HIVE-25327 > URL: https://issues.apache.org/jira/browse/HIVE-25327 > Project: Hive > Issue Type: Bug > Components: HiveServer2 >Affects Versions: 3.1.2 > Environment: apache hadoop 3.1.3 > apache hive 3.1.2 > java version 1.8.0_282 > OS:RedHat 8.2 > >Reporter: louiechen >Assignee: loushang >Priority: Major > > I also encountered the same problem, although this problem has been closed in > the previous version, tracking the source code of the current version, there > are also corrections, but the problem remains > The following is the main content of the previous issue[HIVE-11369]: > having hive.auto.convert.join set to true works in the CLI with no issue, but > fails in HiveServer2 when jmx options are passed to the service on startup. > This (in hive-env.sh) is enough to make it fail: > {noformat} > -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.port=8009 > {noformat} > As soon as I remove the line, it works properly. I have *no*idea... > Here's the log from the service: > {noformat} > 2015-07-24 17:19:27,457 INFO [HiveServer2-Handler-Pool: Thread-22]: > ql.Driver (SessionState.java:printInfo(912)) - Query ID = > hive_20150724171919_aaa88a89-dc6d-490b-821c-4eec6d4c0421 > 2015-07-24 17:19:27,457 INFO [HiveServer2-Handler-Pool: Thread-22]: > ql.Driver (SessionState.java:printInfo(912)) - Total jobs = 1 > 2015-07-24 17:19:27,465 INFO [HiveServer2-Handler-Pool: Thread-22]: > ql.Driver (Driver.java:launchTask(1638)) - Starting task > [Stage-4:MAPREDLOCAL] in serial mode > 2015-07-24 17:19:27,467 INFO [HiveServer2-Handler-Pool: Thread-22]: > mr.MapredLocalTask (MapredLocalTask.java:executeInChildVM(159)) - Generating > plan file > file:/tmp/hive/8932c206-5420-4b6f-9f1f-5f1706f30df8/hive_2015-07-24_17-19-26_552_5082133674120283907-1/-local-10005/plan.xml > 2015-07-24 17:19:27,625 WARN [HiveServer2-Handler-Pool: Thread-22]: > conf.HiveConf (HiveConf.java:initialize(2620)) - HiveConf of name > hive.files.umask.value does not exist > 2015-07-24 17:19:27,708 INFO [HiveServer2-Handler-Pool: Thread-22]: > mr.MapredLocalTask (MapredLocalTask.java:executeInChildVM(288)) - Executing: > /usr/lib/hadoop/bin/hadoop jar > /usr/lib/hive/lib/hive-common-1.1.0-cdh5.4.3.jar > org.apache.hadoop.hive.ql.exec.mr.ExecDriver -localtask -plan > file:/tmp/hive/8932c206-5420-4b6f-9f1f-5f1706f30df8/hive_2015-07-24_17-19-26_552_5082133674120283907-1/-local-10005/plan.xml >-jobconffile > file:/tmp/hive/8932c206-5420-4b6f-9f1f-5f1706f30df8/hive_2015-07-24_17-19-26_552_5082133674120283907-1/-local-10006/jobconf.xml > 2015-07-24 17:19:28,499 ERROR [HiveServer2-Handler-Pool: Thread-22]: > exec.Task (SessionState.java:printError(921)) - Execution failed with exit > status: 1 > 2015-07-24 17:19:28,500 ERROR [HiveServer2-Handler-Pool: Thread-22]: > exec.Task (SessionState.java:printError(921)) - Obtaining error information > 2015-07-24 17:19:28,500 ERROR [HiveServer2-Handler-Pool: Thread-22]: > exec.Task (SessionState.java:printError(921)) - > Task failed! > Task ID: > Stage-4 > Logs: > 2015-07-24 17:19:28,501 ERROR [HiveServer2-Handler-Pool: Thread-22]: > exec.Task (SessionState.java:printError(921)) - > /tmp/hiveserver2_manual/hive-server2.log > 2015-07-24 17:19:28,501 ERROR [HiveServer2-Handler-Pool: Thread-22]: > mr.MapredLocalTask (MapredLocalTask.java:executeInChildVM(308)) - Execution > failed with exit status: 1 > 2015-07-24 17:19:28,518 ERROR [HiveServer2-Handler-Pool: Thread-22]: > ql.Driver (SessionState.java:printError(921)) - FAILED: Execution Error, > return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask > 2015-07-24 17:19:28,599 WARN [HiveServer2-Handler-Pool: Thread-22]: > security.UserGroupInformation (UserGroupInformation.java:doAs(1674)) - > PriviledgedActionException as:hive (auth:SIMPLE) > cause:org.apache.hive.service.cli.HiveSQLException: Error while processing > statement: FAILED: Execution Error, return code 1 from > org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask > 2015-07-24 17:19:28,600 WARN [HiveServer2-Handler-Pool: Thread-22]: > thrift.ThriftCLIService (ThriftCLIService.java:ExecuteStatement(496)) - Error > executing statement: > org.apache.hive.service.cli.HiveSQLException: Error while processing > statement: FAILED: Execution Error, return code 1 from > org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask > at
[jira] [Work logged] (HIVE-26799) Make authorizations on custom UDFs involved in tables/view configurable.
[ https://issues.apache.org/jira/browse/HIVE-26799?focusedWorklogId=831246=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831246 ] ASF GitHub Bot logged work on HIVE-26799: - Author: ASF GitHub Bot Created on: 06/Dec/22 02:20 Start Date: 06/Dec/22 02:20 Worklog Time Spent: 10m Work Description: saihemanth-cloudera commented on code in PR #3821: URL: https://github.com/apache/hive/pull/3821#discussion_r1040352141 ## ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java: ## @@ -12550,6 +12550,21 @@ private ParseResult rewriteASTWithMaskAndFilter(TableMask tableMask, ASTNode ast } } + void gatherUserSuppliedFunctions(ASTNode ast) { +int tokenType = ast.getToken().getType(); +if (tokenType == HiveParser.TOK_FUNCTION || +tokenType == HiveParser.TOK_FUNCTIONDI || +tokenType == HiveParser.TOK_FUNCTIONSTAR) { + if (ast.getChild(0).getType() == HiveParser.Identifier) { +// maybe user supplied +this.userSuppliedFunctions.add(ast.getChild(0).getText()); Review Comment: Ack Issue Time Tracking --- Worklog Id: (was: 831246) Time Spent: 1h 20m (was: 1h 10m) > Make authorizations on custom UDFs involved in tables/view configurable. > > > Key: HIVE-26799 > URL: https://issues.apache.org/jira/browse/HIVE-26799 > Project: Hive > Issue Type: New Feature > Components: HiveServer2, Security >Affects Versions: 4.0.0-alpha-2 >Reporter: Sai Hemanth Gantasala >Assignee: Sai Hemanth Gantasala >Priority: Major > Labels: pull-request-available > Time Spent: 1h 20m > Remaining Estimate: 0h > > When Hive is using Ranger/Sentry as an authorization service, consider the > following scenario. > {code:java} > > create table test_udf(st string); // privileged user operation > > create function Udf_UPPER as 'openkb.hive.udf.MyUpper' using jar > > 'hdfs:///tmp/MyUpperUDF-1.0.0.jar'; // privileged user operation > > create view v1_udf as select udf_upper(st) from test_udf; // privileged > > user operation > //unprivileged user test_user is given select permissions on view v1_udf > > select * from v1_udf; {code} > It is expected that test_user needs to have select privilege on v1_udf and > select permissions on udf_upper custom UDF in order to do a select query on > view. > This patch introduces a configuration > "hive.security.authorization.functions.in.view"=false which disables > authorization on views associated with views/tables during the select query. > In this mode, only UDFs explicitly stated in the query would still be > authorized as it is currently. > The reason for making these custom UDFs associated with view/tables > authorizable is that currently, test_user will need to be granted select > permissions on the custom udf. and the test_user can use this UDF and query > against any other table, which is a security concern. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26758) Allow use scratchdir for staging final job
[ https://issues.apache.org/jira/browse/HIVE-26758?focusedWorklogId=831231=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831231 ] ASF GitHub Bot logged work on HIVE-26758: - Author: ASF GitHub Bot Created on: 06/Dec/22 00:33 Start Date: 06/Dec/22 00:33 Worklog Time Spent: 10m Work Description: sonarcloud[bot] commented on PR #3831: URL: https://github.com/apache/hive/pull/3831#issuecomment-1338488053 Kudos, SonarCloud Quality Gate passed! [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache_hive=3831) [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_hive=3831=false=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3831=false=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache_hive=3831=false=BUG) [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_hive=3831=false=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3831=false=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_hive=3831=false=VULNERABILITY) [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3831=false=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3831=false=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3831=false=SECURITY_HOTSPOT) [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_hive=3831=false=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3831=false=CODE_SMELL) [2 Code Smells](https://sonarcloud.io/project/issues?id=apache_hive=3831=false=CODE_SMELL) [![No Coverage information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/NoCoverageInfo-16px.png 'No Coverage information')](https://sonarcloud.io/component_measures?id=apache_hive=3831=coverage=list) No Coverage information [![No Duplication information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/NoDuplicationInfo-16px.png 'No Duplication information')](https://sonarcloud.io/component_measures?id=apache_hive=3831=duplicated_lines_density=list) No Duplication information Issue Time Tracking --- Worklog Id: (was: 831231) Time Spent: 3h 50m (was: 3h 40m) > Allow use scratchdir for staging final job > -- > > Key: HIVE-26758 > URL: https://issues.apache.org/jira/browse/HIVE-26758 > Project: Hive > Issue Type: New Feature > Components: Query Planning >Affects Versions: 4.0.0-alpha-2 >Reporter: Yi Zhang >Assignee: Yi Zhang >Priority: Minor > Labels: pull-request-available > Time Spent: 3h 50m > Remaining Estimate: 0h > > The query results are staged in stagingdir that is relative to the > destination path // > during blobstorage optimzation HIVE-17620 final job is set to use stagingdir. > HIVE-15215 mentioned the possibility of using scratch for staging when write > to S3 but it was long time ago and no activity. > > This is to allow final job to use hive.exec.scratchdir as the interim jobs, > with a configuration > hive.use.scratchdir.for.staging > This is useful for cross Filesystem, user can use local source filesystem > instead of remote filesystem for the staging. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26809) Upgrade ORC to 1.8.0
[ https://issues.apache.org/jira/browse/HIVE-26809?focusedWorklogId=831229=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831229 ] ASF GitHub Bot logged work on HIVE-26809: - Author: ASF GitHub Bot Created on: 06/Dec/22 00:21 Start Date: 06/Dec/22 00:21 Worklog Time Spent: 10m Work Description: TuroczyX commented on PR #3833: URL: https://github.com/apache/hive/pull/3833#issuecomment-1338477492 like it :) Issue Time Tracking --- Worklog Id: (was: 831229) Time Spent: 0.5h (was: 20m) > Upgrade ORC to 1.8.0 > > > Key: HIVE-26809 > URL: https://issues.apache.org/jira/browse/HIVE-26809 > Project: Hive > Issue Type: Improvement >Affects Versions: 4.0.0 >Reporter: Dmitriy Fingerman >Assignee: Dmitriy Fingerman >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26758) Allow use scratchdir for staging final job
[ https://issues.apache.org/jira/browse/HIVE-26758?focusedWorklogId=831226=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831226 ] ASF GitHub Bot logged work on HIVE-26758: - Author: ASF GitHub Bot Created on: 05/Dec/22 23:39 Start Date: 05/Dec/22 23:39 Worklog Time Spent: 10m Work Description: yigress commented on PR #3831: URL: https://github.com/apache/hive/pull/3831#issuecomment-1338364244 thanks @sunchao for the review! addressed comments Issue Time Tracking --- Worklog Id: (was: 831226) Time Spent: 3h 40m (was: 3.5h) > Allow use scratchdir for staging final job > -- > > Key: HIVE-26758 > URL: https://issues.apache.org/jira/browse/HIVE-26758 > Project: Hive > Issue Type: New Feature > Components: Query Planning >Affects Versions: 4.0.0-alpha-2 >Reporter: Yi Zhang >Assignee: Yi Zhang >Priority: Minor > Labels: pull-request-available > Time Spent: 3h 40m > Remaining Estimate: 0h > > The query results are staged in stagingdir that is relative to the > destination path // > during blobstorage optimzation HIVE-17620 final job is set to use stagingdir. > HIVE-15215 mentioned the possibility of using scratch for staging when write > to S3 but it was long time ago and no activity. > > This is to allow final job to use hive.exec.scratchdir as the interim jobs, > with a configuration > hive.use.scratchdir.for.staging > This is useful for cross Filesystem, user can use local source filesystem > instead of remote filesystem for the staging. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26758) Allow use scratchdir for staging final job
[ https://issues.apache.org/jira/browse/HIVE-26758?focusedWorklogId=831223=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831223 ] ASF GitHub Bot logged work on HIVE-26758: - Author: ASF GitHub Bot Created on: 05/Dec/22 23:23 Start Date: 05/Dec/22 23:23 Worklog Time Spent: 10m Work Description: sunchao commented on code in PR #3831: URL: https://github.com/apache/hive/pull/3831#discussion_r1040208683 ## ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java: ## @@ -1971,13 +1971,19 @@ public static Path createMoveTask(Task currTask, boolean chDir, * 2. INSERT operation on full ACID table */ if (!isMmTable && !isDirectInsert) { -// generate the temporary file -// it must be on the same file system as the current destination Context baseCtx = parseCtx.getContext(); -// Create the required temporary file in the HDFS location if the destination -// path of the FileSinkOperator table is a blobstore path. -Path tmpDir = baseCtx.getTempDirForFinalJobPath(fileSinkDesc.getDestPath()); +// Choose location of required temporary file +Path tmpDir = null; +if (hconf.getBoolVar(ConfVars.HIVE_USE_SCRATCHDIR_FOR_STAGING)) { + tmpDir = baseCtx.getTempDirForInterimJobPath(fileSinkDesc.getDestPath()); +} else { + tmpDir = baseCtx.getTempDirForFinalJobPath(fileSinkDesc.getDestPath()); +} +DynamicPartitionCtx dpCtx = fileSinkDesc.getDynPartCtx(); +if (dpCtx != null && dpCtx.getSPPath() != null) { +tmpDir = new Path(tmpDir, dpCtx.getSPPath()); Review Comment: nit: 2 space indentation ## common/src/java/org/apache/hadoop/hive/conf/HiveConf.java: ## @@ -5629,6 +5629,10 @@ public static enum ConfVars { "This is a performance optimization that forces the final FileSinkOperator to write to the blobstore.\n" + "See HIVE-15121 for details."), +HIVE_USE_SCRATCHDIR_FOR_STAGING("hive.use.scratchdir.for.staging", false, +"Use ${hive.exec.scratchdir} for query results instead of ${hive.exec.stagingdir}.\n" + +"This stages query results in ${hive.exec.scratchdir} before move to final destination."), Review Comment: nit: move -> moving ## ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java: ## @@ -2608,8 +2608,8 @@ private Partition loadPartitionInternal(Path loadPath, Table tbl, Map Allow use scratchdir for staging final job > -- > > Key: HIVE-26758 > URL: https://issues.apache.org/jira/browse/HIVE-26758 > Project: Hive > Issue Type: New Feature > Components: Query Planning >Affects Versions: 4.0.0-alpha-2 >Reporter: Yi Zhang >Assignee: Yi Zhang >Priority: Minor > Labels: pull-request-available > Time Spent: 3.5h > Remaining Estimate: 0h > > The query results are staged in stagingdir that is relative to the > destination path // > during blobstorage optimzation HIVE-17620 final job is set to use stagingdir. > HIVE-15215 mentioned the possibility of using scratch for staging when write > to S3 but it was long time ago and no activity. > > This is to allow final job to use hive.exec.scratchdir as the interim jobs, > with a configuration > hive.use.scratchdir.for.staging > This is useful for cross Filesystem, user can use local source filesystem > instead of remote filesystem for the staging. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26809) Upgrade ORC to 1.8.0
[ https://issues.apache.org/jira/browse/HIVE-26809?focusedWorklogId=831219=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831219 ] ASF GitHub Bot logged work on HIVE-26809: - Author: ASF GitHub Bot Created on: 05/Dec/22 23:06 Start Date: 05/Dec/22 23:06 Worklog Time Spent: 10m Work Description: sonarcloud[bot] commented on PR #3833: URL: https://github.com/apache/hive/pull/3833#issuecomment-1338304065 Kudos, SonarCloud Quality Gate passed! [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache_hive=3833) [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_hive=3833=false=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3833=false=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache_hive=3833=false=BUG) [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_hive=3833=false=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3833=false=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_hive=3833=false=VULNERABILITY) [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3833=false=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3833=false=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3833=false=SECURITY_HOTSPOT) [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_hive=3833=false=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3833=false=CODE_SMELL) [1 Code Smell](https://sonarcloud.io/project/issues?id=apache_hive=3833=false=CODE_SMELL) [![No Coverage information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/NoCoverageInfo-16px.png 'No Coverage information')](https://sonarcloud.io/component_measures?id=apache_hive=3833=coverage=list) No Coverage information [![No Duplication information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/NoDuplicationInfo-16px.png 'No Duplication information')](https://sonarcloud.io/component_measures?id=apache_hive=3833=duplicated_lines_density=list) No Duplication information Issue Time Tracking --- Worklog Id: (was: 831219) Time Spent: 20m (was: 10m) > Upgrade ORC to 1.8.0 > > > Key: HIVE-26809 > URL: https://issues.apache.org/jira/browse/HIVE-26809 > Project: Hive > Issue Type: Improvement >Affects Versions: 4.0.0 >Reporter: Dmitriy Fingerman >Assignee: Dmitriy Fingerman >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-26809) Upgrade ORC to 1.8.0
[ https://issues.apache.org/jira/browse/HIVE-26809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HIVE-26809: -- Labels: pull-request-available (was: ) > Upgrade ORC to 1.8.0 > > > Key: HIVE-26809 > URL: https://issues.apache.org/jira/browse/HIVE-26809 > Project: Hive > Issue Type: Improvement >Affects Versions: 4.0.0 >Reporter: Dmitriy Fingerman >Assignee: Dmitriy Fingerman >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26809) Upgrade ORC to 1.8.0
[ https://issues.apache.org/jira/browse/HIVE-26809?focusedWorklogId=831207=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831207 ] ASF GitHub Bot logged work on HIVE-26809: - Author: ASF GitHub Bot Created on: 05/Dec/22 22:11 Start Date: 05/Dec/22 22:11 Worklog Time Spent: 10m Work Description: difin opened a new pull request, #3833: URL: https://github.com/apache/hive/pull/3833 ### What changes were proposed in this pull request? Upgrading ORC version to currently latest version 1.8.0. This PR is based on the changes proposed in unfinished PR https://github.com/apache/hive/pull/2853 (ticket https://issues.apache.org/jira/browse/HIVE-25497 - Bump ORC to 1.7.2) with changes on top of it which enabled CI to pass. Changes done in HIVE-25497: "LLAP EncodedTreeReaderFactory is implementing its own TreeReaderFactory Issue Time Tracking --- Worklog Id: (was: 831207) Remaining Estimate: 0h Time Spent: 10m > Upgrade ORC to 1.8.0 > > > Key: HIVE-26809 > URL: https://issues.apache.org/jira/browse/HIVE-26809 > Project: Hive > Issue Type: Improvement >Affects Versions: 4.0.0 >Reporter: Dmitriy Fingerman >Assignee: Dmitriy Fingerman >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-26809) Upgrade ORC to 1.8.0
[ https://issues.apache.org/jira/browse/HIVE-26809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Fingerman updated HIVE-26809: - Affects Version/s: 4.0.0 > Upgrade ORC to 1.8.0 > > > Key: HIVE-26809 > URL: https://issues.apache.org/jira/browse/HIVE-26809 > Project: Hive > Issue Type: Improvement >Affects Versions: 4.0.0 >Reporter: Dmitriy Fingerman >Assignee: Dmitriy Fingerman >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HIVE-26809) Upgrade ORC to 1.8.0
[ https://issues.apache.org/jira/browse/HIVE-26809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Fingerman reassigned HIVE-26809: > Upgrade ORC to 1.8.0 > > > Key: HIVE-26809 > URL: https://issues.apache.org/jira/browse/HIVE-26809 > Project: Hive > Issue Type: Improvement >Reporter: Dmitriy Fingerman >Assignee: Dmitriy Fingerman >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-23559) Optimise Hive::moveAcidFiles for cloud storage
[ https://issues.apache.org/jira/browse/HIVE-23559?focusedWorklogId=831201=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831201 ] ASF GitHub Bot logged work on HIVE-23559: - Author: ASF GitHub Bot Created on: 05/Dec/22 21:51 Start Date: 05/Dec/22 21:51 Worklog Time Spent: 10m Work Description: ramesh0201 commented on code in PR #3795: URL: https://github.com/apache/hive/pull/3795#discussion_r1040134949 ## ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java: ## @@ -5208,55 +5208,94 @@ private static void moveAcidFiles(String deltaFileType, PathFilter pathFilter, F } LOG.debug("Acid move found " + deltaStats.length + " " + deltaFileType + " files"); +List> futures = new LinkedList<>(); +final ExecutorService pool = conf.getInt(ConfVars.HIVE_MOVE_FILES_THREAD_COUNT.varname, 25) > 0 ? + Executors.newFixedThreadPool(conf.getInt(ConfVars.HIVE_MOVE_FILES_THREAD_COUNT.varname, 25), +new ThreadFactoryBuilder().setDaemon(true).setNameFormat("Move-Acid-Files-Thread-%d").build()) : null; + for (FileStatus deltaStat : deltaStats) { - Path deltaPath = deltaStat.getPath(); - // Create the delta directory. Don't worry if it already exists, - // as that likely means another task got to it first. Then move each of the buckets. - // it would be more efficient to try to move the delta with it's buckets but that is - // harder to make race condition proof. - Path deltaDest = new Path(dst, deltaPath.getName()); - try { -if (!createdDeltaDirs.contains(deltaDest)) { - try { -if(fs.mkdirs(deltaDest)) { - try { - fs.rename(AcidUtils.OrcAcidVersion.getVersionFilePath(deltaStat.getPath()), -AcidUtils.OrcAcidVersion.getVersionFilePath(deltaDest)); - } catch (FileNotFoundException fnf) { -// There might be no side file. Skip in this case. - } + + if (null == pool) { +moveAcidFilesForDelta(deltaFileType, fs, dst, createdDeltaDirs, newFiles, deltaStat); + } else { +futures.add(pool.submit(new Callable() { + @Override + public Void call() throws HiveException { +try { + moveAcidFilesForDelta(deltaFileType, fs, dst, createdDeltaDirs, newFiles, deltaStat); +} catch (Exception e) { + final String poolMsg = + "Unable to move source " + deltaStat.getPath().getName() + " to destination " + dst.getName(); + throw getHiveException(e, poolMsg); } -createdDeltaDirs.add(deltaDest); - } catch (IOException swallowIt) { -// Don't worry about this, as it likely just means it's already been created. -LOG.info("Unable to create " + deltaFileType + " directory " + deltaDest + -", assuming it already exists: " + swallowIt.getMessage()); +return null; } +})); + } +} + +if (null != pool) { + pool.shutdown(); + for (Future future : futures) { +try { + future.get(); +} catch (Exception e) { + throw handlePoolException(pool, e); } -FileStatus[] bucketStats = fs.listStatus(deltaPath, AcidUtils.bucketFileFilter); -LOG.debug("Acid move found " + bucketStats.length + " bucket files"); -for (FileStatus bucketStat : bucketStats) { - Path bucketSrc = bucketStat.getPath(); - Path bucketDest = new Path(deltaDest, bucketSrc.getName()); - final String msg = "Unable to move source " + bucketSrc + " to destination " + - bucketDest; - LOG.info("Moving bucket " + bucketSrc.toUri().toString() + " to " + - bucketDest.toUri().toString()); - try { -fs.rename(bucketSrc, bucketDest); -if (newFiles != null) { - newFiles.add(bucketDest); + } +} + } + + private static void moveAcidFilesForDelta(String deltaFileType, FileSystem fs, +Path dst, Set createdDeltaDirs, +List newFiles, FileStatus deltaStat) throws HiveException { + +Path deltaPath = deltaStat.getPath(); +// Create the delta directory. Don't worry if it already exists, +// as that likely means another task got to it first. Then move each of the buckets. +// it would be more efficient to try to move the delta with it's buckets but that is +// harder to make race condition proof. +Path deltaDest = new Path(dst, deltaPath.getName()); +try { + if (!createdDeltaDirs.contains(deltaDest)) { +try { + if(fs.mkdirs(deltaDest)) { +try { +
[jira] [Work logged] (HIVE-23559) Optimise Hive::moveAcidFiles for cloud storage
[ https://issues.apache.org/jira/browse/HIVE-23559?focusedWorklogId=831200=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831200 ] ASF GitHub Bot logged work on HIVE-23559: - Author: ASF GitHub Bot Created on: 05/Dec/22 21:50 Start Date: 05/Dec/22 21:50 Worklog Time Spent: 10m Work Description: ramesh0201 commented on code in PR #3795: URL: https://github.com/apache/hive/pull/3795#discussion_r1040133418 ## ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java: ## @@ -5208,55 +5208,94 @@ private static void moveAcidFiles(String deltaFileType, PathFilter pathFilter, F } LOG.debug("Acid move found " + deltaStats.length + " " + deltaFileType + " files"); +List> futures = new LinkedList<>(); +final ExecutorService pool = conf.getInt(ConfVars.HIVE_MOVE_FILES_THREAD_COUNT.varname, 25) > 0 ? + Executors.newFixedThreadPool(conf.getInt(ConfVars.HIVE_MOVE_FILES_THREAD_COUNT.varname, 25), +new ThreadFactoryBuilder().setDaemon(true).setNameFormat("Move-Acid-Files-Thread-%d").build()) : null; + for (FileStatus deltaStat : deltaStats) { - Path deltaPath = deltaStat.getPath(); - // Create the delta directory. Don't worry if it already exists, - // as that likely means another task got to it first. Then move each of the buckets. - // it would be more efficient to try to move the delta with it's buckets but that is - // harder to make race condition proof. - Path deltaDest = new Path(dst, deltaPath.getName()); - try { -if (!createdDeltaDirs.contains(deltaDest)) { - try { -if(fs.mkdirs(deltaDest)) { - try { - fs.rename(AcidUtils.OrcAcidVersion.getVersionFilePath(deltaStat.getPath()), -AcidUtils.OrcAcidVersion.getVersionFilePath(deltaDest)); - } catch (FileNotFoundException fnf) { -// There might be no side file. Skip in this case. - } + + if (null == pool) { +moveAcidFilesForDelta(deltaFileType, fs, dst, createdDeltaDirs, newFiles, deltaStat); + } else { +futures.add(pool.submit(new Callable() { + @Override + public Void call() throws HiveException { +try { + moveAcidFilesForDelta(deltaFileType, fs, dst, createdDeltaDirs, newFiles, deltaStat); +} catch (Exception e) { + final String poolMsg = + "Unable to move source " + deltaStat.getPath().getName() + " to destination " + dst.getName(); + throw getHiveException(e, poolMsg); } -createdDeltaDirs.add(deltaDest); - } catch (IOException swallowIt) { -// Don't worry about this, as it likely just means it's already been created. -LOG.info("Unable to create " + deltaFileType + " directory " + deltaDest + -", assuming it already exists: " + swallowIt.getMessage()); +return null; } +})); + } +} Review Comment: I think we need to handle the thread interruption. We might need to cancel the running futures and and interrupt the current thread. Issue Time Tracking --- Worklog Id: (was: 831200) Time Spent: 0.5h (was: 20m) > Optimise Hive::moveAcidFiles for cloud storage > -- > > Key: HIVE-23559 > URL: https://issues.apache.org/jira/browse/HIVE-23559 > Project: Hive > Issue Type: Improvement >Reporter: Rajesh Balamohan >Assignee: Dmitriy Fingerman >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L4752] > It ends up transferring DELTA, DELETE_DELTA, BASE prefixes sequentially from > staging to final location. > This causes delays even with simple updates statements, which updates smaller > number of records in cloud storage. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HIVE-26806) Precommit tests in CI are timing out after HIVE-26796
[ https://issues.apache.org/jira/browse/HIVE-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17643560#comment-17643560 ] Stamatis Zampetakis commented on HIVE-26806: [~asolimando] For documentation purposes can you elaborate what happened after deleting all successful builds? What was the problem that you observed? > Precommit tests in CI are timing out after HIVE-26796 > - > > Key: HIVE-26806 > URL: https://issues.apache.org/jira/browse/HIVE-26806 > Project: Hive > Issue Type: Bug > Components: Testing Infrastructure >Reporter: Stamatis Zampetakis >Assignee: Stamatis Zampetakis >Priority: Major > > http://ci.hive.apache.org/job/hive-precommit/job/master/1506/ > {noformat} > ancelling nested steps due to timeout > 15:22:08 Sending interrupt signal to process > 15:22:08 Killing processes > 15:22:09 kill finished with exit code 0 > 15:22:19 Terminated > 15:22:19 script returned exit code 143 > [Pipeline] } > [Pipeline] // withEnv > [Pipeline] } > 15:22:19 Deleting 1 temporary files > [Pipeline] // configFileProvider > [Pipeline] } > [Pipeline] // stage > [Pipeline] stage > [Pipeline] { (PostProcess) > [Pipeline] sh > [Pipeline] sh > [Pipeline] sh > [Pipeline] junit > 15:22:25 Recording test results > 15:22:32 [Checks API] No suitable checks publisher found. > [Pipeline] } > [Pipeline] // stage > [Pipeline] } > [Pipeline] // container > [Pipeline] } > [Pipeline] // node > [Pipeline] } > [Pipeline] // timeout > [Pipeline] } > [Pipeline] // podTemplate > [Pipeline] } > 15:22:32 Failed in branch split-01 > [Pipeline] // parallel > [Pipeline] } > [Pipeline] // stage > [Pipeline] stage > [Pipeline] { (Archive) > [Pipeline] podTemplate > [Pipeline] { > [Pipeline] timeout > 15:22:33 Timeout set to expire in 6 hr 0 min > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26770) Make "end of loop" compaction logs appear more selectively
[ https://issues.apache.org/jira/browse/HIVE-26770?focusedWorklogId=831177=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831177 ] ASF GitHub Bot logged work on HIVE-26770: - Author: ASF GitHub Bot Created on: 05/Dec/22 20:49 Start Date: 05/Dec/22 20:49 Worklog Time Spent: 10m Work Description: sonarcloud[bot] commented on PR #3832: URL: https://github.com/apache/hive/pull/3832#issuecomment-1338144472 Kudos, SonarCloud Quality Gate passed! [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache_hive=3832) [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_hive=3832=false=BUG) [![C](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/C-16px.png 'C')](https://sonarcloud.io/project/issues?id=apache_hive=3832=false=BUG) [1 Bug](https://sonarcloud.io/project/issues?id=apache_hive=3832=false=BUG) [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_hive=3832=false=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3832=false=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_hive=3832=false=VULNERABILITY) [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3832=false=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3832=false=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3832=false=SECURITY_HOTSPOT) [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_hive=3832=false=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3832=false=CODE_SMELL) [10 Code Smells](https://sonarcloud.io/project/issues?id=apache_hive=3832=false=CODE_SMELL) [![No Coverage information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/NoCoverageInfo-16px.png 'No Coverage information')](https://sonarcloud.io/component_measures?id=apache_hive=3832=coverage=list) No Coverage information [![No Duplication information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/NoDuplicationInfo-16px.png 'No Duplication information')](https://sonarcloud.io/component_measures?id=apache_hive=3832=duplicated_lines_density=list) No Duplication information Issue Time Tracking --- Worklog Id: (was: 831177) Time Spent: 5h 20m (was: 5h 10m) > Make "end of loop" compaction logs appear more selectively > -- > > Key: HIVE-26770 > URL: https://issues.apache.org/jira/browse/HIVE-26770 > Project: Hive > Issue Type: Improvement >Affects Versions: 4.0.0-alpha-1 >Reporter: Akshat Mathur >Assignee: Akshat Mathur >Priority: Major > Labels: pull-request-available > Time Spent: 5h 20m > Remaining Estimate: 0h > > Currently Initiator, Worker, and Cleaner threads log something like "finished > one loop" on INFO level. > This is useful to figure out if one of these threads is taking too long to > finish a loop, but expensive in general. > > Suggested Time: 20mins > Logging this should be changed in the following way > # If loop finished within a predefined amount of time, level should be DEBUG > and message should look like: *Initiator loop took \{ellapsedTime} seconds to > finish.* > # If loop ran longer than this predefined amount, level should be WARN and > message should look like: *Possible Initiator slowdown, loop took > \{ellapsedTime} seconds to finish.* -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26758) Allow use scratchdir for staging final job
[ https://issues.apache.org/jira/browse/HIVE-26758?focusedWorklogId=831165=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831165 ] ASF GitHub Bot logged work on HIVE-26758: - Author: ASF GitHub Bot Created on: 05/Dec/22 19:28 Start Date: 05/Dec/22 19:28 Worklog Time Spent: 10m Work Description: sonarcloud[bot] commented on PR #3831: URL: https://github.com/apache/hive/pull/3831#issuecomment-1338025065 Kudos, SonarCloud Quality Gate passed! [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache_hive=3831) [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_hive=3831=false=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3831=false=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache_hive=3831=false=BUG) [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_hive=3831=false=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3831=false=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_hive=3831=false=VULNERABILITY) [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3831=false=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3831=false=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3831=false=SECURITY_HOTSPOT) [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_hive=3831=false=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3831=false=CODE_SMELL) [2 Code Smells](https://sonarcloud.io/project/issues?id=apache_hive=3831=false=CODE_SMELL) [![No Coverage information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/NoCoverageInfo-16px.png 'No Coverage information')](https://sonarcloud.io/component_measures?id=apache_hive=3831=coverage=list) No Coverage information [![No Duplication information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/NoDuplicationInfo-16px.png 'No Duplication information')](https://sonarcloud.io/component_measures?id=apache_hive=3831=duplicated_lines_density=list) No Duplication information Issue Time Tracking --- Worklog Id: (was: 831165) Time Spent: 3h 20m (was: 3h 10m) > Allow use scratchdir for staging final job > -- > > Key: HIVE-26758 > URL: https://issues.apache.org/jira/browse/HIVE-26758 > Project: Hive > Issue Type: New Feature > Components: Query Planning >Affects Versions: 4.0.0-alpha-2 >Reporter: Yi Zhang >Assignee: Yi Zhang >Priority: Minor > Labels: pull-request-available > Time Spent: 3h 20m > Remaining Estimate: 0h > > The query results are staged in stagingdir that is relative to the > destination path // > during blobstorage optimzation HIVE-17620 final job is set to use stagingdir. > HIVE-15215 mentioned the possibility of using scratch for staging when write > to S3 but it was long time ago and no activity. > > This is to allow final job to use hive.exec.scratchdir as the interim jobs, > with a configuration > hive.use.scratchdir.for.staging > This is useful for cross Filesystem, user can use local source filesystem > instead of remote filesystem for the staging. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HIVE-26762) Remove operand pruning in HiveFilterSetOpTransposeRule
[ https://issues.apache.org/jira/browse/HIVE-26762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Kasa resolved HIVE-26762. --- Resolution: Fixed Merged to master. Thanks [~asolimando] for the patch. > Remove operand pruning in HiveFilterSetOpTransposeRule > -- > > Key: HIVE-26762 > URL: https://issues.apache.org/jira/browse/HIVE-26762 > Project: Hive > Issue Type: Task > Components: CBO, Query Planning >Affects Versions: 4.0.0-alpha-2 >Reporter: Alessandro Solimando >Assignee: Alessandro Solimando >Priority: Major > Labels: pull-request-available > Time Spent: 1h 10m > Remaining Estimate: 0h > > HiveFilterSetOpTransposeRule, when applied to UNION ALL operands, checks if > the newly pushed filter simplifies to FALSE (due to the predicates holding on > the input). > If this is true and there is more than one UNION ALL operand, it gets pruned. > After HIVE-26524 ("Use Calcite to remove sections of a query plan known never > produces rows"), this is possibly redundant and we could drop this feature > and let the other rules take care of the pruning. > In such a case, it might be even possible to drop the Hive specific rule and > relies on the Calcite one (the difference is just the operand pruning at the > moment of writing), similarly to what HIVE-26642 did for > HiveReduceExpressionRule. Writing it here as a reminder, but it's recommended > to tackle this in a separate ticket after verifying that is feasible. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26762) Remove operand pruning in HiveFilterSetOpTransposeRule
[ https://issues.apache.org/jira/browse/HIVE-26762?focusedWorklogId=831161=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831161 ] ASF GitHub Bot logged work on HIVE-26762: - Author: ASF GitHub Bot Created on: 05/Dec/22 19:17 Start Date: 05/Dec/22 19:17 Worklog Time Spent: 10m Work Description: kasakrisz merged PR #3825: URL: https://github.com/apache/hive/pull/3825 Issue Time Tracking --- Worklog Id: (was: 831161) Time Spent: 1h 10m (was: 1h) > Remove operand pruning in HiveFilterSetOpTransposeRule > -- > > Key: HIVE-26762 > URL: https://issues.apache.org/jira/browse/HIVE-26762 > Project: Hive > Issue Type: Task > Components: CBO, Query Planning >Affects Versions: 4.0.0-alpha-2 >Reporter: Alessandro Solimando >Assignee: Alessandro Solimando >Priority: Major > Labels: pull-request-available > Time Spent: 1h 10m > Remaining Estimate: 0h > > HiveFilterSetOpTransposeRule, when applied to UNION ALL operands, checks if > the newly pushed filter simplifies to FALSE (due to the predicates holding on > the input). > If this is true and there is more than one UNION ALL operand, it gets pruned. > After HIVE-26524 ("Use Calcite to remove sections of a query plan known never > produces rows"), this is possibly redundant and we could drop this feature > and let the other rules take care of the pruning. > In such a case, it might be even possible to drop the Hive specific rule and > relies on the Calcite one (the difference is just the operand pruning at the > moment of writing), similarly to what HIVE-26642 did for > HiveReduceExpressionRule. Writing it here as a reminder, but it's recommended > to tackle this in a separate ticket after verifying that is feasible. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26221) Add histogram-based column statistics
[ https://issues.apache.org/jira/browse/HIVE-26221?focusedWorklogId=831158=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831158 ] ASF GitHub Bot logged work on HIVE-26221: - Author: ASF GitHub Bot Created on: 05/Dec/22 19:08 Start Date: 05/Dec/22 19:08 Worklog Time Spent: 10m Work Description: sonarcloud[bot] commented on PR #3137: URL: https://github.com/apache/hive/pull/3137#issuecomment-1337993922 Kudos, SonarCloud Quality Gate passed! [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache_hive=3137) [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_hive=3137=false=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3137=false=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache_hive=3137=false=BUG) [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_hive=3137=false=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3137=false=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_hive=3137=false=VULNERABILITY) [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3137=false=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3137=false=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3137=false=SECURITY_HOTSPOT) [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_hive=3137=false=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3137=false=CODE_SMELL) [38 Code Smells](https://sonarcloud.io/project/issues?id=apache_hive=3137=false=CODE_SMELL) [![No Coverage information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/NoCoverageInfo-16px.png 'No Coverage information')](https://sonarcloud.io/component_measures?id=apache_hive=3137=coverage=list) No Coverage information [![No Duplication information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/NoDuplicationInfo-16px.png 'No Duplication information')](https://sonarcloud.io/component_measures?id=apache_hive=3137=duplicated_lines_density=list) No Duplication information Issue Time Tracking --- Worklog Id: (was: 831158) Time Spent: 4h (was: 3h 50m) > Add histogram-based column statistics > - > > Key: HIVE-26221 > URL: https://issues.apache.org/jira/browse/HIVE-26221 > Project: Hive > Issue Type: Improvement > Components: CBO, Metastore, Statistics >Affects Versions: 4.0.0-alpha-2 >Reporter: Alessandro Solimando >Assignee: Alessandro Solimando >Priority: Major > Labels: pull-request-available > Time Spent: 4h > Remaining Estimate: 0h > > Hive does not support histogram statistics, which are particularly useful for > skewed data (which is very common in practice) and range predicates. > Hive's current selectivity estimation for range predicates is based on a > hard-coded value of 1/3 (see > [FilterSelectivityEstimator.java#L138-L144|https://github.com/apache/hive/blob/56c336268ea8c281d23c22d89271af37cb7e2572/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L138-L144]).]) > The current proposal aims at integrating histogram as an additional column > statistics, stored into the Hive metastore at the table (or partition) level. > The main requirements for histogram integration are the following: > * efficiency: the approach must scale and support billions of rows > * merge-ability: partition-level histograms have to be merged to form > table-level histograms > * explicit and configurable trade-off
[jira] [Work logged] (HIVE-26770) Make "end of loop" compaction logs appear more selectively
[ https://issues.apache.org/jira/browse/HIVE-26770?focusedWorklogId=831154=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831154 ] ASF GitHub Bot logged work on HIVE-26770: - Author: ASF GitHub Bot Created on: 05/Dec/22 18:59 Start Date: 05/Dec/22 18:59 Worklog Time Spent: 10m Work Description: akshat0395 closed pull request #3803: HIVE-26770: Make end of loop compaction logs appear more selectively and reduce code duplication URL: https://github.com/apache/hive/pull/3803 Issue Time Tracking --- Worklog Id: (was: 831154) Time Spent: 5h (was: 4h 50m) > Make "end of loop" compaction logs appear more selectively > -- > > Key: HIVE-26770 > URL: https://issues.apache.org/jira/browse/HIVE-26770 > Project: Hive > Issue Type: Improvement >Affects Versions: 4.0.0-alpha-1 >Reporter: Akshat Mathur >Assignee: Akshat Mathur >Priority: Major > Labels: pull-request-available > Time Spent: 5h > Remaining Estimate: 0h > > Currently Initiator, Worker, and Cleaner threads log something like "finished > one loop" on INFO level. > This is useful to figure out if one of these threads is taking too long to > finish a loop, but expensive in general. > > Suggested Time: 20mins > Logging this should be changed in the following way > # If loop finished within a predefined amount of time, level should be DEBUG > and message should look like: *Initiator loop took \{ellapsedTime} seconds to > finish.* > # If loop ran longer than this predefined amount, level should be WARN and > message should look like: *Possible Initiator slowdown, loop took > \{ellapsedTime} seconds to finish.* -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26770) Make "end of loop" compaction logs appear more selectively
[ https://issues.apache.org/jira/browse/HIVE-26770?focusedWorklogId=831155=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831155 ] ASF GitHub Bot logged work on HIVE-26770: - Author: ASF GitHub Bot Created on: 05/Dec/22 18:59 Start Date: 05/Dec/22 18:59 Worklog Time Spent: 10m Work Description: akshat0395 opened a new pull request, #3832: URL: https://github.com/apache/hive/pull/3832 ### What changes were proposed in this pull request? Make "end of loop" compaction logs appear more selectively and move duplicate code from Compactor threads to base class, more details can be found in the following ticket [HIVE-26770](https://issues.apache.org/jira/browse/HIVE-26770) ### Why are the changes needed? Improved logging for Compactor threads to reduce noise and share time based stats ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Issue Time Tracking --- Worklog Id: (was: 831155) Time Spent: 5h 10m (was: 5h) > Make "end of loop" compaction logs appear more selectively > -- > > Key: HIVE-26770 > URL: https://issues.apache.org/jira/browse/HIVE-26770 > Project: Hive > Issue Type: Improvement >Affects Versions: 4.0.0-alpha-1 >Reporter: Akshat Mathur >Assignee: Akshat Mathur >Priority: Major > Labels: pull-request-available > Time Spent: 5h 10m > Remaining Estimate: 0h > > Currently Initiator, Worker, and Cleaner threads log something like "finished > one loop" on INFO level. > This is useful to figure out if one of these threads is taking too long to > finish a loop, but expensive in general. > > Suggested Time: 20mins > Logging this should be changed in the following way > # If loop finished within a predefined amount of time, level should be DEBUG > and message should look like: *Initiator loop took \{ellapsedTime} seconds to > finish.* > # If loop ran longer than this predefined amount, level should be WARN and > message should look like: *Possible Initiator slowdown, loop took > \{ellapsedTime} seconds to finish.* -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26758) Allow use scratchdir for staging final job
[ https://issues.apache.org/jira/browse/HIVE-26758?focusedWorklogId=831150=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831150 ] ASF GitHub Bot logged work on HIVE-26758: - Author: ASF GitHub Bot Created on: 05/Dec/22 18:37 Start Date: 05/Dec/22 18:37 Worklog Time Spent: 10m Work Description: yigress opened a new pull request, #3831: URL: https://github.com/apache/hive/pull/3831 ### What changes were proposed in this pull request? 1. add a hive configuration hive.use.scratchdir.for.staging 2. for native table, no-mm, no-direct-insert, no-acid, change dynamic partition staging directory layout from /// to /// 3. when hive.use.scratchdir.for.staging=true, FileSinkOperator's dirName, DynamicContext's sourcePath change from /{hive.exec.stagingdir} to for example for query insert into/overwrite table partition(year=2001, season) select... before the change, the FileSinkOperator conf has /year=2001/.staging_dir/season=xxx after the change, it has /.staging_dir/year=2001/season=xxx This change allow to swap with another path such as , and the moveTask will move into ### Why are the changes needed? In the S3 blobstorage optimization, HIVE-15121 / HIVE-17620 changed interim job path to use hive.exec.scracthdir, final job to use hive.exec.stagingdir. https://issues.apache.org/jira/browse/HIVE-15215 is open whether to use scratch for staging dir for S3. However for blobstorage where 'rename' is slow and no encryption, it can help performance to use scratchdir to staging query results and use the MoveTask to copy to blobstorage. This is especially true when there is FileMerge task. This may also help cross-filesystem when user wants to use local cluster filesystem to staging query results and move the results to destination filesystem. ### Does this PR introduce _any_ user-facing change? This adds a new hive configuration. ### How was this patch tested? Issue Time Tracking --- Worklog Id: (was: 831150) Time Spent: 3h 10m (was: 3h) > Allow use scratchdir for staging final job > -- > > Key: HIVE-26758 > URL: https://issues.apache.org/jira/browse/HIVE-26758 > Project: Hive > Issue Type: New Feature > Components: Query Planning >Affects Versions: 4.0.0-alpha-2 >Reporter: Yi Zhang >Assignee: Yi Zhang >Priority: Minor > Labels: pull-request-available > Time Spent: 3h 10m > Remaining Estimate: 0h > > The query results are staged in stagingdir that is relative to the > destination path // > during blobstorage optimzation HIVE-17620 final job is set to use stagingdir. > HIVE-15215 mentioned the possibility of using scratch for staging when write > to S3 but it was long time ago and no activity. > > This is to allow final job to use hive.exec.scratchdir as the interim jobs, > with a configuration > hive.use.scratchdir.for.staging > This is useful for cross Filesystem, user can use local source filesystem > instead of remote filesystem for the staging. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26758) Allow use scratchdir for staging final job
[ https://issues.apache.org/jira/browse/HIVE-26758?focusedWorklogId=831142=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831142 ] ASF GitHub Bot logged work on HIVE-26758: - Author: ASF GitHub Bot Created on: 05/Dec/22 18:21 Start Date: 05/Dec/22 18:21 Worklog Time Spent: 10m Work Description: yigress closed pull request #3781: HIVE-26758: Allow use scratchdir for staging final job URL: https://github.com/apache/hive/pull/3781 Issue Time Tracking --- Worklog Id: (was: 831142) Time Spent: 3h (was: 2h 50m) > Allow use scratchdir for staging final job > -- > > Key: HIVE-26758 > URL: https://issues.apache.org/jira/browse/HIVE-26758 > Project: Hive > Issue Type: New Feature > Components: Query Planning >Affects Versions: 4.0.0-alpha-2 >Reporter: Yi Zhang >Assignee: Yi Zhang >Priority: Minor > Labels: pull-request-available > Time Spent: 3h > Remaining Estimate: 0h > > The query results are staged in stagingdir that is relative to the > destination path // > during blobstorage optimzation HIVE-17620 final job is set to use stagingdir. > HIVE-15215 mentioned the possibility of using scratch for staging when write > to S3 but it was long time ago and no activity. > > This is to allow final job to use hive.exec.scratchdir as the interim jobs, > with a configuration > hive.use.scratchdir.for.staging > This is useful for cross Filesystem, user can use local source filesystem > instead of remote filesystem for the staging. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26758) Allow use scratchdir for staging final job
[ https://issues.apache.org/jira/browse/HIVE-26758?focusedWorklogId=831141=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831141 ] ASF GitHub Bot logged work on HIVE-26758: - Author: ASF GitHub Bot Created on: 05/Dec/22 18:21 Start Date: 05/Dec/22 18:21 Worklog Time Spent: 10m Work Description: yigress commented on PR #3781: URL: https://github.com/apache/hive/pull/3781#issuecomment-1337898840 close this one and to create a new PR due to testing issue https://issues.apache.org/jira/browse/HIVE-26806 Issue Time Tracking --- Worklog Id: (was: 831141) Time Spent: 2h 50m (was: 2h 40m) > Allow use scratchdir for staging final job > -- > > Key: HIVE-26758 > URL: https://issues.apache.org/jira/browse/HIVE-26758 > Project: Hive > Issue Type: New Feature > Components: Query Planning >Affects Versions: 4.0.0-alpha-2 >Reporter: Yi Zhang >Assignee: Yi Zhang >Priority: Minor > Labels: pull-request-available > Time Spent: 2h 50m > Remaining Estimate: 0h > > The query results are staged in stagingdir that is relative to the > destination path // > during blobstorage optimzation HIVE-17620 final job is set to use stagingdir. > HIVE-15215 mentioned the possibility of using scratch for staging when write > to S3 but it was long time ago and no activity. > > This is to allow final job to use hive.exec.scratchdir as the interim jobs, > with a configuration > hive.use.scratchdir.for.staging > This is useful for cross Filesystem, user can use local source filesystem > instead of remote filesystem for the staging. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-26685) Improve Path name escaping / unescaping performance
[ https://issues.apache.org/jira/browse/HIVE-26685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Zheng updated HIVE-26685: - Assignee: James Petty Resolution: Fixed Status: Resolved (was: Patch Available) > Improve Path name escaping / unescaping performance > --- > > Key: HIVE-26685 > URL: https://issues.apache.org/jira/browse/HIVE-26685 > Project: Hive > Issue Type: Improvement > Components: Hive >Affects Versions: All Versions >Reporter: James Petty >Assignee: James Petty >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-26685.1.patch > > Time Spent: 1h 10m > Remaining Estimate: 0h > > When escaping / unescaping partition path part names, the existing logic > incurs significant avoidable overhead by copying each character sequentially > into a new StringBuilder even when no escaping/unescaping is necessary as > well as using String.format to escape characters inside of the inner loop. > > The included patch to improve the performance of these operations refactors > two static method implementations, but requires no external API surface or > user-visible behavior changes. This change is applicable and portable to a > wide range of Hive versions from branch-0.6 onward when the initial method > implementations were added. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-26685) Improve Path name escaping / unescaping performance
[ https://issues.apache.org/jira/browse/HIVE-26685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Zheng updated HIVE-26685: - Fix Version/s: 4.0.0 > Improve Path name escaping / unescaping performance > --- > > Key: HIVE-26685 > URL: https://issues.apache.org/jira/browse/HIVE-26685 > Project: Hive > Issue Type: Improvement > Components: Hive >Affects Versions: All Versions >Reporter: James Petty >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-26685.1.patch > > Time Spent: 1h 10m > Remaining Estimate: 0h > > When escaping / unescaping partition path part names, the existing logic > incurs significant avoidable overhead by copying each character sequentially > into a new StringBuilder even when no escaping/unescaping is necessary as > well as using String.format to escape characters inside of the inner loop. > > The included patch to improve the performance of these operations refactors > two static method implementations, but requires no external API surface or > user-visible behavior changes. This change is applicable and portable to a > wide range of Hive versions from branch-0.6 onward when the initial method > implementations were added. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26685) Improve Path name escaping / unescaping performance
[ https://issues.apache.org/jira/browse/HIVE-26685?focusedWorklogId=831136=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831136 ] ASF GitHub Bot logged work on HIVE-26685: - Author: ASF GitHub Bot Created on: 05/Dec/22 18:04 Start Date: 05/Dec/22 18:04 Worklog Time Spent: 10m Work Description: weiatwork commented on PR #3721: URL: https://github.com/apache/hive/pull/3721#issuecomment-1337872037 Thanks Zoltan! That helped a lot. Going to merge this PR. Issue Time Tracking --- Worklog Id: (was: 831136) Time Spent: 1h (was: 50m) > Improve Path name escaping / unescaping performance > --- > > Key: HIVE-26685 > URL: https://issues.apache.org/jira/browse/HIVE-26685 > Project: Hive > Issue Type: Improvement > Components: Hive >Affects Versions: All Versions >Reporter: James Petty >Priority: Minor > Labels: pull-request-available > Attachments: HIVE-26685.1.patch > > Time Spent: 1h > Remaining Estimate: 0h > > When escaping / unescaping partition path part names, the existing logic > incurs significant avoidable overhead by copying each character sequentially > into a new StringBuilder even when no escaping/unescaping is necessary as > well as using String.format to escape characters inside of the inner loop. > > The included patch to improve the performance of these operations refactors > two static method implementations, but requires no external API surface or > user-visible behavior changes. This change is applicable and portable to a > wide range of Hive versions from branch-0.6 onward when the initial method > implementations were added. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26685) Improve Path name escaping / unescaping performance
[ https://issues.apache.org/jira/browse/HIVE-26685?focusedWorklogId=831137=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831137 ] ASF GitHub Bot logged work on HIVE-26685: - Author: ASF GitHub Bot Created on: 05/Dec/22 18:04 Start Date: 05/Dec/22 18:04 Worklog Time Spent: 10m Work Description: weiatwork merged PR #3721: URL: https://github.com/apache/hive/pull/3721 Issue Time Tracking --- Worklog Id: (was: 831137) Time Spent: 1h 10m (was: 1h) > Improve Path name escaping / unescaping performance > --- > > Key: HIVE-26685 > URL: https://issues.apache.org/jira/browse/HIVE-26685 > Project: Hive > Issue Type: Improvement > Components: Hive >Affects Versions: All Versions >Reporter: James Petty >Priority: Minor > Labels: pull-request-available > Attachments: HIVE-26685.1.patch > > Time Spent: 1h 10m > Remaining Estimate: 0h > > When escaping / unescaping partition path part names, the existing logic > incurs significant avoidable overhead by copying each character sequentially > into a new StringBuilder even when no escaping/unescaping is necessary as > well as using String.format to escape characters inside of the inner loop. > > The included patch to improve the performance of these operations refactors > two static method implementations, but requires no external API surface or > user-visible behavior changes. This change is applicable and portable to a > wide range of Hive versions from branch-0.6 onward when the initial method > implementations were added. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26685) Improve Path name escaping / unescaping performance
[ https://issues.apache.org/jira/browse/HIVE-26685?focusedWorklogId=831129=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831129 ] ASF GitHub Bot logged work on HIVE-26685: - Author: ASF GitHub Bot Created on: 05/Dec/22 17:34 Start Date: 05/Dec/22 17:34 Worklog Time Spent: 10m Work Description: kgyrtkirk commented on PR #3721: URL: https://github.com/apache/hive/pull/3721#issuecomment-1337801725 @weiatwork you should follow something like [this](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=142642065) to link your asf/github accounts together; after you have access you should see this group: https://github.com/orgs/apache/teams/hive-committers Issue Time Tracking --- Worklog Id: (was: 831129) Time Spent: 50m (was: 40m) > Improve Path name escaping / unescaping performance > --- > > Key: HIVE-26685 > URL: https://issues.apache.org/jira/browse/HIVE-26685 > Project: Hive > Issue Type: Improvement > Components: Hive >Affects Versions: All Versions >Reporter: James Petty >Priority: Minor > Labels: pull-request-available > Attachments: HIVE-26685.1.patch > > Time Spent: 50m > Remaining Estimate: 0h > > When escaping / unescaping partition path part names, the existing logic > incurs significant avoidable overhead by copying each character sequentially > into a new StringBuilder even when no escaping/unescaping is necessary as > well as using String.format to escape characters inside of the inner loop. > > The included patch to improve the performance of these operations refactors > two static method implementations, but requires no external API surface or > user-visible behavior changes. This change is applicable and portable to a > wide range of Hive versions from branch-0.6 onward when the initial method > implementations were added. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26685) Improve Path name escaping / unescaping performance
[ https://issues.apache.org/jira/browse/HIVE-26685?focusedWorklogId=831125=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831125 ] ASF GitHub Bot logged work on HIVE-26685: - Author: ASF GitHub Bot Created on: 05/Dec/22 17:22 Start Date: 05/Dec/22 17:22 Worklog Time Spent: 10m Work Description: weiatwork commented on PR #3721: URL: https://github.com/apache/hive/pull/3721#issuecomment-1337775392 @kgyrtkirk I don't seem to have write access on Github, although I can still push to the ASF Git repo directly I believe. Anyway for me to get write access here (as a committer, so that I can merge people's PRs)? Issue Time Tracking --- Worklog Id: (was: 831125) Time Spent: 40m (was: 0.5h) > Improve Path name escaping / unescaping performance > --- > > Key: HIVE-26685 > URL: https://issues.apache.org/jira/browse/HIVE-26685 > Project: Hive > Issue Type: Improvement > Components: Hive >Affects Versions: All Versions >Reporter: James Petty >Priority: Minor > Labels: pull-request-available > Attachments: HIVE-26685.1.patch > > Time Spent: 40m > Remaining Estimate: 0h > > When escaping / unescaping partition path part names, the existing logic > incurs significant avoidable overhead by copying each character sequentially > into a new StringBuilder even when no escaping/unescaping is necessary as > well as using String.format to escape characters inside of the inner loop. > > The included patch to improve the performance of these operations refactors > two static method implementations, but requires no external API surface or > user-visible behavior changes. This change is applicable and portable to a > wide range of Hive versions from branch-0.6 onward when the initial method > implementations were added. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (HIVE-26806) Precommit tests in CI are timing out after HIVE-26796
[ https://issues.apache.org/jira/browse/HIVE-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17643470#comment-17643470 ] Alessandro Solimando edited comment on HIVE-26806 at 12/5/22 5:19 PM: -- It looks that deleting all green past runs did not fix for [https://github.com/apache/hive/pull/3137]. That's a big deal since the PR is huge and review is in progress, I don't think I can close and re-open it. Is there a way to tweak timeout for that PR alone [~zabetak]? EDIT: there is, I am using "Replay" in Jenkins so I can change the JenkinsFile for the given run without any change in Git, hopefully that will do the trick. was (Author: asolimando): It looks that deleting all green past runs did not fix for [https://github.com/apache/hive/pull/3137]. That's a big deal since the PR is huge and review is in progress, I don't think I can close and re-open it. Is there a way to tweak timeout for that PR alone [~zabetak]? > Precommit tests in CI are timing out after HIVE-26796 > - > > Key: HIVE-26806 > URL: https://issues.apache.org/jira/browse/HIVE-26806 > Project: Hive > Issue Type: Bug > Components: Testing Infrastructure >Reporter: Stamatis Zampetakis >Assignee: Stamatis Zampetakis >Priority: Major > > http://ci.hive.apache.org/job/hive-precommit/job/master/1506/ > {noformat} > ancelling nested steps due to timeout > 15:22:08 Sending interrupt signal to process > 15:22:08 Killing processes > 15:22:09 kill finished with exit code 0 > 15:22:19 Terminated > 15:22:19 script returned exit code 143 > [Pipeline] } > [Pipeline] // withEnv > [Pipeline] } > 15:22:19 Deleting 1 temporary files > [Pipeline] // configFileProvider > [Pipeline] } > [Pipeline] // stage > [Pipeline] stage > [Pipeline] { (PostProcess) > [Pipeline] sh > [Pipeline] sh > [Pipeline] sh > [Pipeline] junit > 15:22:25 Recording test results > 15:22:32 [Checks API] No suitable checks publisher found. > [Pipeline] } > [Pipeline] // stage > [Pipeline] } > [Pipeline] // container > [Pipeline] } > [Pipeline] // node > [Pipeline] } > [Pipeline] // timeout > [Pipeline] } > [Pipeline] // podTemplate > [Pipeline] } > 15:22:32 Failed in branch split-01 > [Pipeline] // parallel > [Pipeline] } > [Pipeline] // stage > [Pipeline] stage > [Pipeline] { (Archive) > [Pipeline] podTemplate > [Pipeline] { > [Pipeline] timeout > 15:22:33 Timeout set to expire in 6 hr 0 min > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HIVE-26806) Precommit tests in CI are timing out after HIVE-26796
[ https://issues.apache.org/jira/browse/HIVE-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17643470#comment-17643470 ] Alessandro Solimando commented on HIVE-26806: - It looks that deleting all green past runs did not fix for [https://github.com/apache/hive/pull/3137]. That's a big deal since the PR is huge and review is in progress, I don't think I can close and re-open it. Is there a way to tweak timeout for that PR alone [~zabetak]? > Precommit tests in CI are timing out after HIVE-26796 > - > > Key: HIVE-26806 > URL: https://issues.apache.org/jira/browse/HIVE-26806 > Project: Hive > Issue Type: Bug > Components: Testing Infrastructure >Reporter: Stamatis Zampetakis >Assignee: Stamatis Zampetakis >Priority: Major > > http://ci.hive.apache.org/job/hive-precommit/job/master/1506/ > {noformat} > ancelling nested steps due to timeout > 15:22:08 Sending interrupt signal to process > 15:22:08 Killing processes > 15:22:09 kill finished with exit code 0 > 15:22:19 Terminated > 15:22:19 script returned exit code 143 > [Pipeline] } > [Pipeline] // withEnv > [Pipeline] } > 15:22:19 Deleting 1 temporary files > [Pipeline] // configFileProvider > [Pipeline] } > [Pipeline] // stage > [Pipeline] stage > [Pipeline] { (PostProcess) > [Pipeline] sh > [Pipeline] sh > [Pipeline] sh > [Pipeline] junit > 15:22:25 Recording test results > 15:22:32 [Checks API] No suitable checks publisher found. > [Pipeline] } > [Pipeline] // stage > [Pipeline] } > [Pipeline] // container > [Pipeline] } > [Pipeline] // node > [Pipeline] } > [Pipeline] // timeout > [Pipeline] } > [Pipeline] // podTemplate > [Pipeline] } > 15:22:32 Failed in branch split-01 > [Pipeline] // parallel > [Pipeline] } > [Pipeline] // stage > [Pipeline] stage > [Pipeline] { (Archive) > [Pipeline] podTemplate > [Pipeline] { > [Pipeline] timeout > 15:22:33 Timeout set to expire in 6 hr 0 min > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26683) Sum over window produces 0 when row contains null
[ https://issues.apache.org/jira/browse/HIVE-26683?focusedWorklogId=831115=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831115 ] ASF GitHub Bot logged work on HIVE-26683: - Author: ASF GitHub Bot Created on: 05/Dec/22 16:58 Start Date: 05/Dec/22 16:58 Worklog Time Spent: 10m Work Description: ramesh0201 merged PR #3800: URL: https://github.com/apache/hive/pull/3800 Issue Time Tracking --- Worklog Id: (was: 831115) Time Spent: 2h 10m (was: 2h) > Sum over window produces 0 when row contains null > - > > Key: HIVE-26683 > URL: https://issues.apache.org/jira/browse/HIVE-26683 > Project: Hive > Issue Type: Bug > Components: HiveServer2 >Reporter: Steve Carlin >Assignee: Steve Carlin >Priority: Major > Labels: pull-request-available > Time Spent: 2h 10m > Remaining Estimate: 0h > > Ran the following sql: > > {code:java} > create table sum_window_test_small (id int, tinyint_col tinyint); > insert into sum_window_test_small values (5,5), (10, NULL), (11,1); > select id, > tinyint_col, > sum(tinyint_col) over (order by id nulls last rows between 1 following and 1 > following) > from sum_window_test_small order by id; > select id, > tinyint_col, > sum(tinyint_col) over (order by id nulls last rows between current row and 1 > following) > from sum_window_test_small order by id; > {code} > The result is > {code:java} > +-+--+---+ > | id | tinyint_col | sum_window_0 | > +-+--+---+ > | 5 | 5 | 0 | > | 10 | NULL | 1 | > | 11 | 1 | NULL | > +-+--+---+{code} > The first row should have the sum as NULL > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26762) Remove operand pruning in HiveFilterSetOpTransposeRule
[ https://issues.apache.org/jira/browse/HIVE-26762?focusedWorklogId=831109=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831109 ] ASF GitHub Bot logged work on HIVE-26762: - Author: ASF GitHub Bot Created on: 05/Dec/22 16:46 Start Date: 05/Dec/22 16:46 Worklog Time Spent: 10m Work Description: asolimando commented on PR #3825: URL: https://github.com/apache/hive/pull/3825#issuecomment-1337709244 @kasakrisz, tests are green, can we merge this? Issue Time Tracking --- Worklog Id: (was: 831109) Time Spent: 1h (was: 50m) > Remove operand pruning in HiveFilterSetOpTransposeRule > -- > > Key: HIVE-26762 > URL: https://issues.apache.org/jira/browse/HIVE-26762 > Project: Hive > Issue Type: Task > Components: CBO, Query Planning >Affects Versions: 4.0.0-alpha-2 >Reporter: Alessandro Solimando >Assignee: Alessandro Solimando >Priority: Major > Labels: pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > > HiveFilterSetOpTransposeRule, when applied to UNION ALL operands, checks if > the newly pushed filter simplifies to FALSE (due to the predicates holding on > the input). > If this is true and there is more than one UNION ALL operand, it gets pruned. > After HIVE-26524 ("Use Calcite to remove sections of a query plan known never > produces rows"), this is possibly redundant and we could drop this feature > and let the other rules take care of the pruning. > In such a case, it might be even possible to drop the Hive specific rule and > relies on the Calcite one (the difference is just the operand pruning at the > moment of writing), similarly to what HIVE-26642 did for > HiveReduceExpressionRule. Writing it here as a reminder, but it's recommended > to tackle this in a separate ticket after verifying that is feasible. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26221) Add histogram-based column statistics
[ https://issues.apache.org/jira/browse/HIVE-26221?focusedWorklogId=831063=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831063 ] ASF GitHub Bot logged work on HIVE-26221: - Author: ASF GitHub Bot Created on: 05/Dec/22 15:10 Start Date: 05/Dec/22 15:10 Worklog Time Spent: 10m Work Description: sonarcloud[bot] commented on PR #3137: URL: https://github.com/apache/hive/pull/3137#issuecomment-1337544903 Kudos, SonarCloud Quality Gate passed! [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache_hive=3137) [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_hive=3137=false=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3137=false=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache_hive=3137=false=BUG) [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_hive=3137=false=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3137=false=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_hive=3137=false=VULNERABILITY) [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3137=false=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3137=false=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3137=false=SECURITY_HOTSPOT) [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_hive=3137=false=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3137=false=CODE_SMELL) [38 Code Smells](https://sonarcloud.io/project/issues?id=apache_hive=3137=false=CODE_SMELL) [![No Coverage information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/NoCoverageInfo-16px.png 'No Coverage information')](https://sonarcloud.io/component_measures?id=apache_hive=3137=coverage=list) No Coverage information [![No Duplication information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/NoDuplicationInfo-16px.png 'No Duplication information')](https://sonarcloud.io/component_measures?id=apache_hive=3137=duplicated_lines_density=list) No Duplication information Issue Time Tracking --- Worklog Id: (was: 831063) Time Spent: 3h 50m (was: 3h 40m) > Add histogram-based column statistics > - > > Key: HIVE-26221 > URL: https://issues.apache.org/jira/browse/HIVE-26221 > Project: Hive > Issue Type: Improvement > Components: CBO, Metastore, Statistics >Affects Versions: 4.0.0-alpha-2 >Reporter: Alessandro Solimando >Assignee: Alessandro Solimando >Priority: Major > Labels: pull-request-available > Time Spent: 3h 50m > Remaining Estimate: 0h > > Hive does not support histogram statistics, which are particularly useful for > skewed data (which is very common in practice) and range predicates. > Hive's current selectivity estimation for range predicates is based on a > hard-coded value of 1/3 (see > [FilterSelectivityEstimator.java#L138-L144|https://github.com/apache/hive/blob/56c336268ea8c281d23c22d89271af37cb7e2572/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L138-L144]).]) > The current proposal aims at integrating histogram as an additional column > statistics, stored into the Hive metastore at the table (or partition) level. > The main requirements for histogram integration are the following: > * efficiency: the approach must scale and support billions of rows > * merge-ability: partition-level histograms have to be merged to form > table-level histograms > * explicit and configurable
[jira] [Work started] (HIVE-26808) Port Iceberg catalog changes
[ https://issues.apache.org/jira/browse/HIVE-26808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HIVE-26808 started by Zsolt Miskolczi. -- > Port Iceberg catalog changes > > > Key: HIVE-26808 > URL: https://issues.apache.org/jira/browse/HIVE-26808 > Project: Hive > Issue Type: Improvement > Components: Iceberg integration >Reporter: Zsolt Miskolczi >Assignee: Zsolt Miskolczi >Priority: Major > > The last round of porting happened in 2022 april, there were a couple of > changes especially in HiveTableOperations worth porting into iceberg-catalog. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HIVE-26808) Port Iceberg catalog changes
[ https://issues.apache.org/jira/browse/HIVE-26808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zsolt Miskolczi reassigned HIVE-26808: -- Assignee: Zsolt Miskolczi > Port Iceberg catalog changes > > > Key: HIVE-26808 > URL: https://issues.apache.org/jira/browse/HIVE-26808 > Project: Hive > Issue Type: Improvement > Components: Iceberg integration >Reporter: Zsolt Miskolczi >Assignee: Zsolt Miskolczi >Priority: Major > > The last round of porting happened in 2022 april, there were a couple of > changes especially in HiveTableOperations worth porting into iceberg-catalog. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26762) Remove operand pruning in HiveFilterSetOpTransposeRule
[ https://issues.apache.org/jira/browse/HIVE-26762?focusedWorklogId=831021=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831021 ] ASF GitHub Bot logged work on HIVE-26762: - Author: ASF GitHub Bot Created on: 05/Dec/22 13:50 Start Date: 05/Dec/22 13:50 Worklog Time Spent: 10m Work Description: sonarcloud[bot] commented on PR #3825: URL: https://github.com/apache/hive/pull/3825#issuecomment-1337399540 Kudos, SonarCloud Quality Gate passed! [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache_hive=3825) [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_hive=3825=false=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3825=false=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache_hive=3825=false=BUG) [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_hive=3825=false=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3825=false=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_hive=3825=false=VULNERABILITY) [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3825=false=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3825=false=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3825=false=SECURITY_HOTSPOT) [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_hive=3825=false=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3825=false=CODE_SMELL) [0 Code Smells](https://sonarcloud.io/project/issues?id=apache_hive=3825=false=CODE_SMELL) [![No Coverage information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/NoCoverageInfo-16px.png 'No Coverage information')](https://sonarcloud.io/component_measures?id=apache_hive=3825=coverage=list) No Coverage information [![No Duplication information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/NoDuplicationInfo-16px.png 'No Duplication information')](https://sonarcloud.io/component_measures?id=apache_hive=3825=duplicated_lines_density=list) No Duplication information Issue Time Tracking --- Worklog Id: (was: 831021) Time Spent: 50m (was: 40m) > Remove operand pruning in HiveFilterSetOpTransposeRule > -- > > Key: HIVE-26762 > URL: https://issues.apache.org/jira/browse/HIVE-26762 > Project: Hive > Issue Type: Task > Components: CBO, Query Planning >Affects Versions: 4.0.0-alpha-2 >Reporter: Alessandro Solimando >Assignee: Alessandro Solimando >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > HiveFilterSetOpTransposeRule, when applied to UNION ALL operands, checks if > the newly pushed filter simplifies to FALSE (due to the predicates holding on > the input). > If this is true and there is more than one UNION ALL operand, it gets pruned. > After HIVE-26524 ("Use Calcite to remove sections of a query plan known never > produces rows"), this is possibly redundant and we could drop this feature > and let the other rules take care of the pruning. > In such a case, it might be even possible to drop the Hive specific rule and > relies on the Calcite one (the difference is just the operand pruning at the > moment of writing), similarly to what HIVE-26642 did for > HiveReduceExpressionRule. Writing it here as a reminder, but it's recommended > to tackle this in a separate ticket after verifying that is feasible. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-22589) Add storage support for ProlepticCalendar in ORC, Parquet, and Avro
[ https://issues.apache.org/jira/browse/HIVE-22589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mengkai Liu updated HIVE-22589: --- Description: Hive recently moved its processing to the proleptic calendar, which has created some issues for users who have dates before 1580 AD. Hive最近将其处理移到了proleptic日历上,这给日期在公元1580年之前的用户带来了一些问题。 HIVE-22405 extended the column vectors for times & dates to encode which calendar they are using. HIVE-22405扩展了时间和日期的列向量&,以编码它们所使用的日历。 This issue is to support proleptic calendar in ORC, Parquet, and Avro, when files are written/read by Hive. To preserve compatibility with other engines until they upgrade their readers, files will be written using hybrid calendar by default. Default behavior when files do not contain calendar information in their metadata is configurable. 当文件由Hive写入/读取时,此问题用于支持ORC、Parquet和Avro中的proleptic日历。为了在升级阅读器之前保持与其他引擎的兼容性,默认情况下将使用混合日历写入文件。当文件的元数据中不包含日历信息时,默认行为是可配置的。 was: Hive recently moved its processing to the proleptic calendar, which has created some issues for users who have dates before 1580 AD. HIVE-22405 extended the column vectors for times & dates to encode which calendar they are using. This issue is to support proleptic calendar in ORC, Parquet, and Avro, when files are written/read by Hive. To preserve compatibility with other engines until they upgrade their readers, files will be written using hybrid calendar by default. Default behavior when files do not contain calendar information in their metadata is configurable. > Add storage support for ProlepticCalendar in ORC, Parquet, and Avro > --- > > Key: HIVE-22589 > URL: https://issues.apache.org/jira/browse/HIVE-22589 > Project: Hive > Issue Type: Bug > Components: Avro, ORC, Parquet >Reporter: Jesus Camacho Rodriguez >Assignee: Jesus Camacho Rodriguez >Priority: Major > Labels: compatibility, datetime > Fix For: 4.0.0-alpha-1 > > Attachments: HIVE-22589.01.patch, HIVE-22589.02.patch, > HIVE-22589.03.patch, HIVE-22589.04.patch, HIVE-22589.05.patch, > HIVE-22589.06.patch, HIVE-22589.07.patch, HIVE-22589.07.patch, > HIVE-22589.07.patch, HIVE-22589.07.patch, HIVE-22589.08.patch, > HIVE-22589.08.patch, HIVE-22589.patch, HIVE-22589.patch > > > Hive recently moved its processing to the proleptic calendar, which has > created some issues for users who have dates before 1580 AD. > Hive最近将其处理移到了proleptic日历上,这给日期在公元1580年之前的用户带来了一些问题。 > HIVE-22405 extended the column vectors for times & dates to encode which > calendar they are using. > HIVE-22405扩展了时间和日期的列向量&,以编码它们所使用的日历。 > This issue is to support proleptic calendar in ORC, Parquet, and Avro, when > files are written/read by Hive. To preserve compatibility with other engines > until they upgrade their readers, files will be written using hybrid calendar > by default. Default behavior when files do not contain calendar information > in their metadata is configurable. > 当文件由Hive写入/读取时,此问题用于支持ORC、Parquet和Avro中的proleptic日历。为了在升级阅读器之前保持与其他引擎的兼容性,默认情况下将使用混合日历写入文件。当文件的元数据中不包含日历信息时,默认行为是可配置的。 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26794) Explore changing TxnHandler#connPoolMutex to NoPoolConnectionPool
[ https://issues.apache.org/jira/browse/HIVE-26794?focusedWorklogId=831019=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831019 ] ASF GitHub Bot logged work on HIVE-26794: - Author: ASF GitHub Bot Created on: 05/Dec/22 13:44 Start Date: 05/Dec/22 13:44 Worklog Time Spent: 10m Work Description: sonarcloud[bot] commented on PR #3817: URL: https://github.com/apache/hive/pull/3817#issuecomment-1337381330 Kudos, SonarCloud Quality Gate passed! [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache_hive=3817) [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_hive=3817=false=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3817=false=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache_hive=3817=false=BUG) [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_hive=3817=false=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3817=false=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_hive=3817=false=VULNERABILITY) [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3817=false=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3817=false=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3817=false=SECURITY_HOTSPOT) [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_hive=3817=false=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3817=false=CODE_SMELL) [2 Code Smells](https://sonarcloud.io/project/issues?id=apache_hive=3817=false=CODE_SMELL) [![No Coverage information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/NoCoverageInfo-16px.png 'No Coverage information')](https://sonarcloud.io/component_measures?id=apache_hive=3817=coverage=list) No Coverage information [![No Duplication information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/NoDuplicationInfo-16px.png 'No Duplication information')](https://sonarcloud.io/component_measures?id=apache_hive=3817=duplicated_lines_density=list) No Duplication information Issue Time Tracking --- Worklog Id: (was: 831019) Time Spent: 1h 50m (was: 1h 40m) > Explore changing TxnHandler#connPoolMutex to NoPoolConnectionPool > - > > Key: HIVE-26794 > URL: https://issues.apache.org/jira/browse/HIVE-26794 > Project: Hive > Issue Type: Improvement > Components: Standalone Metastore >Reporter: Zhihua Deng >Priority: Major > Labels: pull-request-available > Time Spent: 1h 50m > Remaining Estimate: 0h > > Instead of creating a fixed size connection pool for TxnHandler#MutexAPI, the > pool can be assigned to NoPoolConnectionPool due to: > * TxnHandler#MutexAPI is primarily designed to provide coarse-grained mutex > support to maintenance tasks running inside the Metastore, these tasks are > not user faced; > * A fixed size connection pool as same as the pool used in ObjectStore is a > waste for other non leaders in the warehouse; > The NoPoolConnectionPool provides connection on demand, and > TxnHandler#MutexAPI only uses getConnection method to fetch a connection from > the pool, so it's doable to change the pool to NoPoolConnectionPool, this > would make the HMS more scaleable. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26221) Add histogram-based column statistics
[ https://issues.apache.org/jira/browse/HIVE-26221?focusedWorklogId=831012=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831012 ] ASF GitHub Bot logged work on HIVE-26221: - Author: ASF GitHub Bot Created on: 05/Dec/22 13:05 Start Date: 05/Dec/22 13:05 Worklog Time Spent: 10m Work Description: dengzhhu653 commented on code in PR #3137: URL: https://github.com/apache/hive/pull/3137#discussion_r1039572781 ## standalone-metastore/metastore-server/src/main/sql/mysql/hive-schema-4.0.0.mysql.sql: ## @@ -768,6 +769,7 @@ CREATE TABLE IF NOT EXISTS `PART_COL_STATS` ( `NUM_NULLS` bigint(20) NOT NULL, `NUM_DISTINCTS` bigint(20), `BIT_VECTOR` blob, + `HISTOGRAM` blob, Review Comment: Should this column `HISTOGRAM` be also placed at the end? Issue Time Tracking --- Worklog Id: (was: 831012) Time Spent: 3h 40m (was: 3.5h) > Add histogram-based column statistics > - > > Key: HIVE-26221 > URL: https://issues.apache.org/jira/browse/HIVE-26221 > Project: Hive > Issue Type: Improvement > Components: CBO, Metastore, Statistics >Affects Versions: 4.0.0-alpha-2 >Reporter: Alessandro Solimando >Assignee: Alessandro Solimando >Priority: Major > Labels: pull-request-available > Time Spent: 3h 40m > Remaining Estimate: 0h > > Hive does not support histogram statistics, which are particularly useful for > skewed data (which is very common in practice) and range predicates. > Hive's current selectivity estimation for range predicates is based on a > hard-coded value of 1/3 (see > [FilterSelectivityEstimator.java#L138-L144|https://github.com/apache/hive/blob/56c336268ea8c281d23c22d89271af37cb7e2572/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L138-L144]).]) > The current proposal aims at integrating histogram as an additional column > statistics, stored into the Hive metastore at the table (or partition) level. > The main requirements for histogram integration are the following: > * efficiency: the approach must scale and support billions of rows > * merge-ability: partition-level histograms have to be merged to form > table-level histograms > * explicit and configurable trade-off between memory footprint and accuracy > Hive already integrates [KLL data > sketches|https://datasketches.apache.org/docs/KLL/KLLSketch.html] UDAF. > Datasketches are small, stateful programs that process massive data-streams > and can provide approximate answers, with mathematical guarantees, to > computationally difficult queries orders-of-magnitude faster than > traditional, exact methods. > We propose to use KLL, and more specifically the cumulative distribution > function (CDF), as the underlying data structure for our histogram statistics. > The current proposal targets numeric data types (float, integer and numeric > families) and temporal data types (date and timestamp). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26692) Check for the expected thrift version before compiling
[ https://issues.apache.org/jira/browse/HIVE-26692?focusedWorklogId=831007=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831007 ] ASF GitHub Bot logged work on HIVE-26692: - Author: ASF GitHub Bot Created on: 05/Dec/22 12:39 Start Date: 05/Dec/22 12:39 Worklog Time Spent: 10m Work Description: sonarcloud[bot] commented on PR #3820: URL: https://github.com/apache/hive/pull/3820#issuecomment-1337272737 Kudos, SonarCloud Quality Gate passed! [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache_hive=3820) [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_hive=3820=false=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3820=false=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache_hive=3820=false=BUG) [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_hive=3820=false=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3820=false=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_hive=3820=false=VULNERABILITY) [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3820=false=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3820=false=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3820=false=SECURITY_HOTSPOT) [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_hive=3820=false=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3820=false=CODE_SMELL) [0 Code Smells](https://sonarcloud.io/project/issues?id=apache_hive=3820=false=CODE_SMELL) [![No Coverage information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/NoCoverageInfo-16px.png 'No Coverage information')](https://sonarcloud.io/component_measures?id=apache_hive=3820=coverage=list) No Coverage information [![No Duplication information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/NoDuplicationInfo-16px.png 'No Duplication information')](https://sonarcloud.io/component_measures?id=apache_hive=3820=duplicated_lines_density=list) No Duplication information Issue Time Tracking --- Worklog Id: (was: 831007) Time Spent: 2.5h (was: 2h 20m) > Check for the expected thrift version before compiling > -- > > Key: HIVE-26692 > URL: https://issues.apache.org/jira/browse/HIVE-26692 > Project: Hive > Issue Type: Task > Components: Thrift API >Affects Versions: 4.0.0-alpha-2 >Reporter: Alessandro Solimando >Assignee: Alessandro Solimando >Priority: Major > Labels: pull-request-available > Time Spent: 2.5h > Remaining Estimate: 0h > > At the moment we don't check for the thrift version before launching thrift, > the error messages are often cryptic upon mismatches. > An explicit check with a clear error message would be nice, like what parquet > does: > [https://github.com/apache/parquet-mr/blob/master/parquet-thrift/pom.xml#L247-L268] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HIVE-26807) Investigate test running times before/after Zookeeper upgrade to 3.6.3
[ https://issues.apache.org/jira/browse/HIVE-26807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17643313#comment-17643313 ] Stamatis Zampetakis commented on HIVE-26807: First of all, I extracted the test results in CSV files with the following structure (testname@classname@time). {noformat} zgrep -a " /tmp/master-1514.csv zgrep -a " /tmp/master-1495.csv {noformat} To facilitate the analysis, I imported the CSV files into Postgres tables. {code:sql} CREATE TABLE master_1514 (testname VARCHAR, classname VARCHAR, time DECIMAL); CREATE TABLE master_1495 (testname VARCHAR, classname VARCHAR, time DECIMAL); COPY master_1514 FROM '/tmp/master-1514.csv' WITH DELIMITER '@'; COPY master_1495 FROM '/tmp/master-1495.csv' WITH DELIMITER '@'; {code} The combination of testname, classname is not unique due to parameterized tests so we need an way to distinguish duplicate tests if we want to perform joins. The trick is to use the ROW_NUMBER window function and assign incrementing integers to seemingly duplicate tests; it is not 100% precise but satisfactory for our needs. {code:sql} SELECT testname, classname, time, ROW_NUMBER() OVER (PARTITION BY testname, classname ORDER BY time) as rnum FROM master_1514 {code} I used the following query to get an overview of the situation before and after upgrade. {code:sql} SELECT COUNT(*), MAX(diff), MIN(diff), AVG(diff), sum(ntime)/60/60 as total_hours_1514 ,sum(otime)/60/60 as total_hours_1495 FROM (SELECT n.testname, n.classname,n.time as ntime,o.time as otime, n.time-o.time as diff FROM (SELECT testname, classname, time, ROW_NUMBER() OVER (PARTITION BY testname, classname ORDER BY time) as rnum FROM master_1514) n INNER JOIN (SELECT testname, classname, time, ROW_NUMBER() OVER (PARTITION BY testname, classname ORDER BY time) as rnum FROM master_1495) o ON n.testname=o.testname AND n.classname = o.classname AND n.rnum = o.rnum) compare {code} {noformat} count | max | min | avg | total_hours_1514 | total_hours_1495 ---+-+-++-+- 47530 | 130.627 | -58.070 | 0.14675221965074689670 | 25.43901639 | 23.50147944 {noformat} Observe that the total duration of the tests has increased by 8% (cumulative is ~2h) which is noticeable but maybe not problematic at this stage. The tests are running in parallel splits so the general slowdown per split is in the order of a few minutes. Moreover, there are tests that are much slower (see max) but also tests that are much faster (see min) so there is nothing justifying a revert of the Zookeeper upgrade. Nevertheless, it may be interesting to investigate further the tests who became much slower to see if there is anything that could be done to save some CI resources. I used the following query to find the 1000 tests that were seemingly affected the most after the upgrade. {code:sql} COPY ( SELECT n.testname, n.classname,n.time as B_1514,o.time as B_1495, n.time-o.time as diff FROM (SELECT testname, classname, time, ROW_NUMBER() OVER (PARTITION BY testname, classname ORDER BY time) as rnum FROM master_1514) n INNER JOIN (SELECT testname, classname, time, ROW_NUMBER() OVER (PARTITION BY testname, classname ORDER BY time) as rnum FROM master_1495) o ON n.testname=o.testname AND n.classname = o.classname AND n.rnum = o.rnum ORDER BY diff DESC LIMIT 1000) TO '/tmp/testtimes-diff-1514-1495.csv' WITH DELIMITER '@'; {code} The results are attached in [^diff-1514-1495.csv]. > Investigate test running times before/after Zookeeper upgrade to 3.6.3 > -- > > Key: HIVE-26807 > URL: https://issues.apache.org/jira/browse/HIVE-26807 > Project: Hive > Issue Type: Task > Components: Testing Infrastructure >Reporter: Stamatis Zampetakis >Assignee: Stamatis Zampetakis >Priority: Major > Attachments: diff-1514-1495.csv, test-results-1495.tgz, > test-results-1514.tgz > > > During the investigation of the CI timing out (HIVE-2686) there were some > concerns that the Zookeeper (HIVE-26763) upgrade caused some significant > slowdown. > The goal of this issue is to analyse the test results from the following > builds: > * [Build-1495|http://ci.hive.apache.org/job/hive-precommit/job/master/1495/], > commit just before Zookeeper upgrade; > * > [Builld-1514|http://ci.hive.apache.org/job/hive-precommit/job/master/1514/], > commit after Zookeeper upgrade with skipped tests (HIVE-26796) and CI > timeouts (HIVE-26806) fixed; > and reason about the impact of the Zookeeper upgrade in test execution. -- This message was sent by Atlassian Jira
[jira] [Updated] (HIVE-26807) Investigate test running times before/after Zookeeper upgrade to 3.6.3
[ https://issues.apache.org/jira/browse/HIVE-26807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stamatis Zampetakis updated HIVE-26807: --- Attachment: diff-1514-1495.csv > Investigate test running times before/after Zookeeper upgrade to 3.6.3 > -- > > Key: HIVE-26807 > URL: https://issues.apache.org/jira/browse/HIVE-26807 > Project: Hive > Issue Type: Task > Components: Testing Infrastructure >Reporter: Stamatis Zampetakis >Assignee: Stamatis Zampetakis >Priority: Major > Attachments: diff-1514-1495.csv, test-results-1495.tgz, > test-results-1514.tgz > > > During the investigation of the CI timing out (HIVE-2686) there were some > concerns that the Zookeeper (HIVE-26763) upgrade caused some significant > slowdown. > The goal of this issue is to analyse the test results from the following > builds: > * [Build-1495|http://ci.hive.apache.org/job/hive-precommit/job/master/1495/], > commit just before Zookeeper upgrade; > * > [Builld-1514|http://ci.hive.apache.org/job/hive-precommit/job/master/1514/], > commit after Zookeeper upgrade with skipped tests (HIVE-26796) and CI > timeouts (HIVE-26806) fixed; > and reason about the impact of the Zookeeper upgrade in test execution. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HIVE-26807) Investigate test running times before/after Zookeeper upgrade to 3.6.3
[ https://issues.apache.org/jira/browse/HIVE-26807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17643300#comment-17643300 ] Stamatis Zampetakis commented on HIVE-26807: I uploaded the test-results from the 1495 and 1514 build to the JIRA in case the results of builds are not available in the future. > Investigate test running times before/after Zookeeper upgrade to 3.6.3 > -- > > Key: HIVE-26807 > URL: https://issues.apache.org/jira/browse/HIVE-26807 > Project: Hive > Issue Type: Task > Components: Testing Infrastructure >Reporter: Stamatis Zampetakis >Assignee: Stamatis Zampetakis >Priority: Major > Attachments: test-results-1495.tgz, test-results-1514.tgz > > > During the investigation of the CI timing out (HIVE-2686) there were some > concerns that the Zookeeper (HIVE-26763) upgrade caused some significant > slowdown. > The goal of this issue is to analyse the test results from the following > builds: > * [Build-1495|http://ci.hive.apache.org/job/hive-precommit/job/master/1495/], > commit just before Zookeeper upgrade; > * > [Builld-1514|http://ci.hive.apache.org/job/hive-precommit/job/master/1514/], > commit after Zookeeper upgrade with skipped tests (HIVE-26796) and CI > timeouts (HIVE-26806) fixed; > and reason about the impact of the Zookeeper upgrade in test execution. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-26807) Investigate test running times before/after Zookeeper upgrade to 3.6.3
[ https://issues.apache.org/jira/browse/HIVE-26807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stamatis Zampetakis updated HIVE-26807: --- Attachment: test-results-1495.tgz > Investigate test running times before/after Zookeeper upgrade to 3.6.3 > -- > > Key: HIVE-26807 > URL: https://issues.apache.org/jira/browse/HIVE-26807 > Project: Hive > Issue Type: Task > Components: Testing Infrastructure >Reporter: Stamatis Zampetakis >Assignee: Stamatis Zampetakis >Priority: Major > Attachments: test-results-1495.tgz, test-results-1514.tgz > > > During the investigation of the CI timing out (HIVE-2686) there were some > concerns that the Zookeeper (HIVE-26763) upgrade caused some significant > slowdown. > The goal of this issue is to analyse the test results from the following > builds: > * [Build-1495|http://ci.hive.apache.org/job/hive-precommit/job/master/1495/], > commit just before Zookeeper upgrade; > * > [Builld-1514|http://ci.hive.apache.org/job/hive-precommit/job/master/1514/], > commit after Zookeeper upgrade with skipped tests (HIVE-26796) and CI > timeouts (HIVE-26806) fixed; > and reason about the impact of the Zookeeper upgrade in test execution. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-26807) Investigate test running times before/after Zookeeper upgrade to 3.6.3
[ https://issues.apache.org/jira/browse/HIVE-26807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stamatis Zampetakis updated HIVE-26807: --- Attachment: test-results-1514.tgz > Investigate test running times before/after Zookeeper upgrade to 3.6.3 > -- > > Key: HIVE-26807 > URL: https://issues.apache.org/jira/browse/HIVE-26807 > Project: Hive > Issue Type: Task > Components: Testing Infrastructure >Reporter: Stamatis Zampetakis >Assignee: Stamatis Zampetakis >Priority: Major > Attachments: test-results-1495.tgz, test-results-1514.tgz > > > During the investigation of the CI timing out (HIVE-2686) there were some > concerns that the Zookeeper (HIVE-26763) upgrade caused some significant > slowdown. > The goal of this issue is to analyse the test results from the following > builds: > * [Build-1495|http://ci.hive.apache.org/job/hive-precommit/job/master/1495/], > commit just before Zookeeper upgrade; > * > [Builld-1514|http://ci.hive.apache.org/job/hive-precommit/job/master/1514/], > commit after Zookeeper upgrade with skipped tests (HIVE-26796) and CI > timeouts (HIVE-26806) fixed; > and reason about the impact of the Zookeeper upgrade in test execution. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HIVE-26807) Investigate test running times before/after Zookeeper upgrade to 3.6.3
[ https://issues.apache.org/jira/browse/HIVE-26807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stamatis Zampetakis reassigned HIVE-26807: -- > Investigate test running times before/after Zookeeper upgrade to 3.6.3 > -- > > Key: HIVE-26807 > URL: https://issues.apache.org/jira/browse/HIVE-26807 > Project: Hive > Issue Type: Task > Components: Testing Infrastructure >Reporter: Stamatis Zampetakis >Assignee: Stamatis Zampetakis >Priority: Major > > During the investigation of the CI timing out (HIVE-2686) there were some > concerns that the Zookeeper (HIVE-26763) upgrade caused some significant > slowdown. > The goal of this issue is to analyse the test results from the following > builds: > * [Build-1495|http://ci.hive.apache.org/job/hive-precommit/job/master/1495/], > commit just before Zookeeper upgrade; > * > [Builld-1514|http://ci.hive.apache.org/job/hive-precommit/job/master/1514/], > commit after Zookeeper upgrade with skipped tests (HIVE-26796) and CI > timeouts (HIVE-26806) fixed; > and reason about the impact of the Zookeeper upgrade in test execution. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26578) Enable Iceberg storage format for materialized views
[ https://issues.apache.org/jira/browse/HIVE-26578?focusedWorklogId=830992=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-830992 ] ASF GitHub Bot logged work on HIVE-26578: - Author: ASF GitHub Bot Created on: 05/Dec/22 10:58 Start Date: 05/Dec/22 10:58 Worklog Time Spent: 10m Work Description: sonarcloud[bot] commented on PR #3823: URL: https://github.com/apache/hive/pull/3823#issuecomment-1337134994 Kudos, SonarCloud Quality Gate passed! [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache_hive=3823) [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_hive=3823=false=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3823=false=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache_hive=3823=false=BUG) [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_hive=3823=false=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3823=false=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_hive=3823=false=VULNERABILITY) [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3823=false=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3823=false=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3823=false=SECURITY_HOTSPOT) [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_hive=3823=false=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3823=false=CODE_SMELL) [0 Code Smells](https://sonarcloud.io/project/issues?id=apache_hive=3823=false=CODE_SMELL) [![No Coverage information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/NoCoverageInfo-16px.png 'No Coverage information')](https://sonarcloud.io/component_measures?id=apache_hive=3823=coverage=list) No Coverage information [![No Duplication information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/NoDuplicationInfo-16px.png 'No Duplication information')](https://sonarcloud.io/component_measures?id=apache_hive=3823=duplicated_lines_density=list) No Duplication information Issue Time Tracking --- Worklog Id: (was: 830992) Time Spent: 40m (was: 0.5h) > Enable Iceberg storage format for materialized views > > > Key: HIVE-26578 > URL: https://issues.apache.org/jira/browse/HIVE-26578 > Project: Hive > Issue Type: Improvement > Components: Materialized views >Reporter: Krisztian Kasa >Assignee: Krisztian Kasa >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > {code} > create materialized view mat1 stored by iceberg stored as orc tblproperties > ('format-version'='1') as > select tbl_ice.b, tbl_ice.c from tbl_ice where tbl_ice.c > 52; > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HIVE-26806) Precommit tests in CI are timing out after HIVE-26796
[ https://issues.apache.org/jira/browse/HIVE-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17643247#comment-17643247 ] Stamatis Zampetakis commented on HIVE-26806: *Important note:* The test timings from master *are not used* to split tests in PRs. The master branch and PR branches have separate Jenkins jobs so one does not use the other as a reference. The splitting of tests on the first run of a PR (or a PR without a previous successful build) is more or less random. > Precommit tests in CI are timing out after HIVE-26796 > - > > Key: HIVE-26806 > URL: https://issues.apache.org/jira/browse/HIVE-26806 > Project: Hive > Issue Type: Bug > Components: Testing Infrastructure >Reporter: Stamatis Zampetakis >Assignee: Stamatis Zampetakis >Priority: Major > > http://ci.hive.apache.org/job/hive-precommit/job/master/1506/ > {noformat} > ancelling nested steps due to timeout > 15:22:08 Sending interrupt signal to process > 15:22:08 Killing processes > 15:22:09 kill finished with exit code 0 > 15:22:19 Terminated > 15:22:19 script returned exit code 143 > [Pipeline] } > [Pipeline] // withEnv > [Pipeline] } > 15:22:19 Deleting 1 temporary files > [Pipeline] // configFileProvider > [Pipeline] } > [Pipeline] // stage > [Pipeline] stage > [Pipeline] { (PostProcess) > [Pipeline] sh > [Pipeline] sh > [Pipeline] sh > [Pipeline] junit > 15:22:25 Recording test results > 15:22:32 [Checks API] No suitable checks publisher found. > [Pipeline] } > [Pipeline] // stage > [Pipeline] } > [Pipeline] // container > [Pipeline] } > [Pipeline] // node > [Pipeline] } > [Pipeline] // timeout > [Pipeline] } > [Pipeline] // podTemplate > [Pipeline] } > 15:22:32 Failed in branch split-01 > [Pipeline] // parallel > [Pipeline] } > [Pipeline] // stage > [Pipeline] stage > [Pipeline] { (Archive) > [Pipeline] podTemplate > [Pipeline] { > [Pipeline] timeout > 15:22:33 Timeout set to expire in 6 hr 0 min > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HIVE-26806) Precommit tests in CI are timing out after HIVE-26796
[ https://issues.apache.org/jira/browse/HIVE-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17643222#comment-17643222 ] Alessandro Solimando commented on HIVE-26806: - Thanks [~zabetak], as you say the issue now affects only existing PRs, I am trying 2. to see if it works, otherwise I will go for 1., I will keep you guys posted here. Forgetting the old affected PRs, I am OK with reducing the timeout to the previous value, since it now works. > Precommit tests in CI are timing out after HIVE-26796 > - > > Key: HIVE-26806 > URL: https://issues.apache.org/jira/browse/HIVE-26806 > Project: Hive > Issue Type: Bug > Components: Testing Infrastructure >Reporter: Stamatis Zampetakis >Assignee: Stamatis Zampetakis >Priority: Major > > http://ci.hive.apache.org/job/hive-precommit/job/master/1506/ > {noformat} > ancelling nested steps due to timeout > 15:22:08 Sending interrupt signal to process > 15:22:08 Killing processes > 15:22:09 kill finished with exit code 0 > 15:22:19 Terminated > 15:22:19 script returned exit code 143 > [Pipeline] } > [Pipeline] // withEnv > [Pipeline] } > 15:22:19 Deleting 1 temporary files > [Pipeline] // configFileProvider > [Pipeline] } > [Pipeline] // stage > [Pipeline] stage > [Pipeline] { (PostProcess) > [Pipeline] sh > [Pipeline] sh > [Pipeline] sh > [Pipeline] junit > 15:22:25 Recording test results > 15:22:32 [Checks API] No suitable checks publisher found. > [Pipeline] } > [Pipeline] // stage > [Pipeline] } > [Pipeline] // container > [Pipeline] } > [Pipeline] // node > [Pipeline] } > [Pipeline] // timeout > [Pipeline] } > [Pipeline] // podTemplate > [Pipeline] } > 15:22:32 Failed in branch split-01 > [Pipeline] // parallel > [Pipeline] } > [Pipeline] // stage > [Pipeline] stage > [Pipeline] { (Archive) > [Pipeline] podTemplate > [Pipeline] { > [Pipeline] timeout > 15:22:33 Timeout set to expire in 6 hr 0 min > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26221) Add histogram-based column statistics
[ https://issues.apache.org/jira/browse/HIVE-26221?focusedWorklogId=830981=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-830981 ] ASF GitHub Bot logged work on HIVE-26221: - Author: ASF GitHub Bot Created on: 05/Dec/22 10:14 Start Date: 05/Dec/22 10:14 Worklog Time Spent: 10m Work Description: asolimando commented on code in PR #3137: URL: https://github.com/apache/hive/pull/3137#discussion_r1039397581 ## standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/columnstats/aggr/LongColumnStatsAggregator.java: ## @@ -51,73 +52,94 @@ public ColumnStatisticsObj aggregate(List colStatsWit checkStatisticsList(colStatsWithSourceInfo); ColumnStatisticsObj statsObj = null; -String colType = null; +String colType; String colName = null; // check if all the ColumnStatisticsObjs contain stats and all the ndv are // bitvectors boolean doAllPartitionContainStats = partNames.size() == colStatsWithSourceInfo.size(); NumDistinctValueEstimator ndvEstimator = null; +KllHistogramEstimator histogramEstimator = null; +boolean areAllNDVEstimatorsMergeable = true; +boolean areAllHistogramEstimatorsMergeable = true; for (ColStatsObjWithSourceInfo csp : colStatsWithSourceInfo) { ColumnStatisticsObj cso = csp.getColStatsObj(); if (statsObj == null) { colName = cso.getColName(); colType = cso.getColType(); statsObj = ColumnStatsAggregatorFactory.newColumnStaticsObj(colName, colType, cso.getStatsData().getSetField()); -LOG.trace("doAllPartitionContainStats for column: {} is: {}", colName, -doAllPartitionContainStats); +LOG.trace("doAllPartitionContainStats for column: {} is: {}", colName, doAllPartitionContainStats); } - LongColumnStatsDataInspector longColumnStatsData = longInspectorFromStats(cso); - if (longColumnStatsData.getNdvEstimator() == null) { -ndvEstimator = null; -break; - } else { -// check if all of the bit vectors can merge -NumDistinctValueEstimator estimator = longColumnStatsData.getNdvEstimator(); + LongColumnStatsDataInspector columnStatsData = longInspectorFromStats(cso); + + // check if we can merge NDV estimators + if (columnStatsData.getNdvEstimator() == null) { +areAllNDVEstimatorsMergeable = false; + } else if (areAllNDVEstimatorsMergeable) { +NumDistinctValueEstimator estimator = columnStatsData.getNdvEstimator(); if (ndvEstimator == null) { ndvEstimator = estimator; } else { - if (ndvEstimator.canMerge(estimator)) { -continue; - } else { -ndvEstimator = null; -break; + if (!ndvEstimator.canMerge(estimator)) { +areAllNDVEstimatorsMergeable = false; + } +} + } + // check if we can merge histogram estimators + if (columnStatsData.getHistogramEstimator() == null) { Review Comment: You are right, I have double checked and indeed it can be simplified as you suggest, I have added those fours lines you have cited right before the `return` statement of `aggregate` method and it's enough. Issue Time Tracking --- Worklog Id: (was: 830981) Time Spent: 3.5h (was: 3h 20m) > Add histogram-based column statistics > - > > Key: HIVE-26221 > URL: https://issues.apache.org/jira/browse/HIVE-26221 > Project: Hive > Issue Type: Improvement > Components: CBO, Metastore, Statistics >Affects Versions: 4.0.0-alpha-2 >Reporter: Alessandro Solimando >Assignee: Alessandro Solimando >Priority: Major > Labels: pull-request-available > Time Spent: 3.5h > Remaining Estimate: 0h > > Hive does not support histogram statistics, which are particularly useful for > skewed data (which is very common in practice) and range predicates. > Hive's current selectivity estimation for range predicates is based on a > hard-coded value of 1/3 (see > [FilterSelectivityEstimator.java#L138-L144|https://github.com/apache/hive/blob/56c336268ea8c281d23c22d89271af37cb7e2572/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L138-L144]).]) > The current proposal aims at integrating histogram as an additional column > statistics, stored into the Hive metastore at the table (or partition) level. > The main requirements for histogram integration are the following: > * efficiency: the approach must scale and support billions of rows > * merge-ability: partition-level histograms have to be merged to form > table-level histograms > * explicit and configurable trade-off between
[jira] [Comment Edited] (HIVE-26806) Precommit tests in CI are timing out after HIVE-26796
[ https://issues.apache.org/jira/browse/HIVE-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17643213#comment-17643213 ] Stamatis Zampetakis edited comment on HIVE-26806 at 12/5/22 10:13 AM: -- The recent builds on master (1513, 1514) are now back to normal and each split takes at most ~2h. [~asolimando] [~ayushtkn] I am planning to revert the timeout back to 6h by committing directly to master in a few hours. Please speak up if there is any reason not do to this. [~akshatm] The Jenkins plugin that is used to split the test into buckets uses the last successful build of the job as a guide. Each PR corresponds to a separate Jenkins Job (http://ci.hive.apache.org/job/hive-precommit/view/change-requests/). The last successful build for your PR is http://ci.hive.apache.org/job/hive-precommit/job/PR-3803/8/ so this is what will be used to split the tests. This is not good cause the successful run has 3K less tests than what exists in master so the splitting will be pretty bad. I see three ways to unblock the current situation and overcome the problem: # Close PR-3803 and open a new one. # Manually delete every successful build for JOB PR-3803 and start a new one. # Increase the timeout on the JenkinsFile and try again. None of these is perfect but I have higher hopes for 1 and 2. was (Author: zabetak): The recent builds on master (1513, 1514) are now back to normal and each split takes at most ~2h. [~asolimando] [~ayushtkn] I am planning to revert the timeout back to 6h by committing directly to master in a few hours. Please speak up if there is any reason not do to this. [~akshatm] The Jenkins plugin that is used to split the test into buckets uses the last successful build of the job as a guide. Each PR corresponds to a separate Jenkins Job (http://ci.hive.apache.org/job/hive-precommit/view/change-requests/). The last successful build for your PR is http://ci.hive.apache.org/job/hive-precommit/job/PR-3803/8/ so this is what will be used to split the tests. This is not good cause the successful run has 3K less tests than what exists in master so the splitting will be pretty bad. I see three ways to unblock the current situation and overcome the problem: # Close PR-3803 and open a new one. # Manually delete every successful build for JOB PR-3803 and start a new one. # Increase the timeout on the JenkinsFile and try again. None of these is perfect but I have higher hopes for 1 and 2. > Precommit tests in CI are timing out after HIVE-26796 > - > > Key: HIVE-26806 > URL: https://issues.apache.org/jira/browse/HIVE-26806 > Project: Hive > Issue Type: Bug > Components: Testing Infrastructure >Reporter: Stamatis Zampetakis >Assignee: Stamatis Zampetakis >Priority: Major > > http://ci.hive.apache.org/job/hive-precommit/job/master/1506/ > {noformat} > ancelling nested steps due to timeout > 15:22:08 Sending interrupt signal to process > 15:22:08 Killing processes > 15:22:09 kill finished with exit code 0 > 15:22:19 Terminated > 15:22:19 script returned exit code 143 > [Pipeline] } > [Pipeline] // withEnv > [Pipeline] } > 15:22:19 Deleting 1 temporary files > [Pipeline] // configFileProvider > [Pipeline] } > [Pipeline] // stage > [Pipeline] stage > [Pipeline] { (PostProcess) > [Pipeline] sh > [Pipeline] sh > [Pipeline] sh > [Pipeline] junit > 15:22:25 Recording test results > 15:22:32 [Checks API] No suitable checks publisher found. > [Pipeline] } > [Pipeline] // stage > [Pipeline] } > [Pipeline] // container > [Pipeline] } > [Pipeline] // node > [Pipeline] } > [Pipeline] // timeout > [Pipeline] } > [Pipeline] // podTemplate > [Pipeline] } > 15:22:32 Failed in branch split-01 > [Pipeline] // parallel > [Pipeline] } > [Pipeline] // stage > [Pipeline] stage > [Pipeline] { (Archive) > [Pipeline] podTemplate > [Pipeline] { > [Pipeline] timeout > 15:22:33 Timeout set to expire in 6 hr 0 min > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HIVE-26806) Precommit tests in CI are timing out after HIVE-26796
[ https://issues.apache.org/jira/browse/HIVE-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17643213#comment-17643213 ] Stamatis Zampetakis commented on HIVE-26806: The recent builds on master (1513, 1514) are now back to normal and each split takes at most ~2h. [~asolimando] [~ayushtkn] I am planning to revert the timeout back to 6h by committing directly to master in a few hours. Please speak up if there is any reason not do to this. [~akshatm] The Jenkins plugin that is used to split the test into buckets uses the last successful build of the job as a guide. Each PR corresponds to a separate Jenkins Job (http://ci.hive.apache.org/job/hive-precommit/view/change-requests/). The last successful build for your PR is http://ci.hive.apache.org/job/hive-precommit/job/PR-3803/8/ so this is what will be used to split the tests. This is not good cause the successful run has 3K less tests than what exists in master so the splitting will be pretty bad. I see three ways to unblock the current situation and overcome the problem: # Close PR-3803 and open a new one. # Manually delete every successful build for JOB PR-3803 and start a new one. # Increase the timeout on the JenkinsFile and try again. None of these is perfect but I have higher hopes for 1 and 2. > Precommit tests in CI are timing out after HIVE-26796 > - > > Key: HIVE-26806 > URL: https://issues.apache.org/jira/browse/HIVE-26806 > Project: Hive > Issue Type: Bug > Components: Testing Infrastructure >Reporter: Stamatis Zampetakis >Assignee: Stamatis Zampetakis >Priority: Major > > http://ci.hive.apache.org/job/hive-precommit/job/master/1506/ > {noformat} > ancelling nested steps due to timeout > 15:22:08 Sending interrupt signal to process > 15:22:08 Killing processes > 15:22:09 kill finished with exit code 0 > 15:22:19 Terminated > 15:22:19 script returned exit code 143 > [Pipeline] } > [Pipeline] // withEnv > [Pipeline] } > 15:22:19 Deleting 1 temporary files > [Pipeline] // configFileProvider > [Pipeline] } > [Pipeline] // stage > [Pipeline] stage > [Pipeline] { (PostProcess) > [Pipeline] sh > [Pipeline] sh > [Pipeline] sh > [Pipeline] junit > 15:22:25 Recording test results > 15:22:32 [Checks API] No suitable checks publisher found. > [Pipeline] } > [Pipeline] // stage > [Pipeline] } > [Pipeline] // container > [Pipeline] } > [Pipeline] // node > [Pipeline] } > [Pipeline] // timeout > [Pipeline] } > [Pipeline] // podTemplate > [Pipeline] } > 15:22:32 Failed in branch split-01 > [Pipeline] // parallel > [Pipeline] } > [Pipeline] // stage > [Pipeline] stage > [Pipeline] { (Archive) > [Pipeline] podTemplate > [Pipeline] { > [Pipeline] timeout > 15:22:33 Timeout set to expire in 6 hr 0 min > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26569) LlapTokenRenewer: TezAM (LlapTaskCommunicator) to renew LLAP_TOKENs
[ https://issues.apache.org/jira/browse/HIVE-26569?focusedWorklogId=830980=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-830980 ] ASF GitHub Bot logged work on HIVE-26569: - Author: ASF GitHub Bot Created on: 05/Dec/22 10:12 Start Date: 05/Dec/22 10:12 Worklog Time Spent: 10m Work Description: sonarcloud[bot] commented on PR #3626: URL: https://github.com/apache/hive/pull/3626#issuecomment-1337077515 Kudos, SonarCloud Quality Gate passed! [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache_hive=3626) [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_hive=3626=false=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3626=false=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache_hive=3626=false=BUG) [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_hive=3626=false=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3626=false=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_hive=3626=false=VULNERABILITY) [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3626=false=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3626=false=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3626=false=SECURITY_HOTSPOT) [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_hive=3626=false=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3626=false=CODE_SMELL) [7 Code Smells](https://sonarcloud.io/project/issues?id=apache_hive=3626=false=CODE_SMELL) [![No Coverage information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/NoCoverageInfo-16px.png 'No Coverage information')](https://sonarcloud.io/component_measures?id=apache_hive=3626=coverage=list) No Coverage information [![No Duplication information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/NoDuplicationInfo-16px.png 'No Duplication information')](https://sonarcloud.io/component_measures?id=apache_hive=3626=duplicated_lines_density=list) No Duplication information Issue Time Tracking --- Worklog Id: (was: 830980) Time Spent: 1h 20m (was: 1h 10m) > LlapTokenRenewer: TezAM (LlapTaskCommunicator) to renew LLAP_TOKENs > --- > > Key: HIVE-26569 > URL: https://issues.apache.org/jira/browse/HIVE-26569 > Project: Hive > Issue Type: Improvement >Reporter: László Bodor >Assignee: László Bodor >Priority: Major > Labels: pull-request-available > Time Spent: 1h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26788) Update stats of table/partition after minor compaction using noscan operation
[ https://issues.apache.org/jira/browse/HIVE-26788?focusedWorklogId=830976=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-830976 ] ASF GitHub Bot logged work on HIVE-26788: - Author: ASF GitHub Bot Created on: 05/Dec/22 09:58 Start Date: 05/Dec/22 09:58 Worklog Time Spent: 10m Work Description: SourabhBadhya commented on code in PR #3812: URL: https://github.com/apache/hive/pull/3812#discussion_r1039378601 ## ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/StatsUpdater.java: ## @@ -73,6 +69,9 @@ public void gatherStats(CompactionInfo ci, HiveConf conf, String userName, Strin sb.append(")"); } sb.append(" compute statistics"); +if (ci.isMinorCompaction()) { +sb.append(" noscan"); Review Comment: Minor compaction is expected to not compact too many files and hence in most scenarios only the number of files gets changed after minor compaction. Whereas large updates like major compaction needs to update all statistics (since it happens once in a while) to keep the metadata updated. Therefore the idea was to do a fast update of statistics on a minor compaction & do complete update in case of major compaction. Issue Time Tracking --- Worklog Id: (was: 830976) Time Spent: 1h (was: 50m) > Update stats of table/partition after minor compaction using noscan operation > - > > Key: HIVE-26788 > URL: https://issues.apache.org/jira/browse/HIVE-26788 > Project: Hive > Issue Type: Improvement >Reporter: Sourabh Badhya >Assignee: Sourabh Badhya >Priority: Major > Labels: pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > > Currently, statistics are not updated for minor compaction since minor > compaction performs little updates on the statistics (such as number of files > in table/partition & total size of the table/partition). It is better to > utilize NOSCAN operation for minor compaction since NOSCAN operations > performs faster update of statistics and updates the relevant fields such as > number of files & total sizes of the table/partitions. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HIVE-14305) To/From UTC timestamp may return incorrect result because of DST
[ https://issues.apache.org/jira/browse/HIVE-14305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17643196#comment-17643196 ] David Scarlatti commented on HIVE-14305: it seems solved in Hive3. > To/From UTC timestamp may return incorrect result because of DST > > > Key: HIVE-14305 > URL: https://issues.apache.org/jira/browse/HIVE-14305 > Project: Hive > Issue Type: Sub-task >Reporter: Rui Li >Assignee: Rui Li >Priority: Major > Labels: timestamp > > If the machine's local timezone involves DST, the UDFs return incorrect > results. > For example: > {code} > select to_utc_timestamp('2005-04-03 02:01:00','UTC'); > {code} > returns {{2005-04-03 03:01:00}}. Correct result should be {{2005-04-03 > 02:01:00}}. > {code} > select to_utc_timestamp('2005-04-03 10:01:00','Asia/Shanghai'); > {code} > returns {{2005-04-03 03:01:00}}. Correct result should be {{2005-04-03 > 02:01:00}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26221) Add histogram-based column statistics
[ https://issues.apache.org/jira/browse/HIVE-26221?focusedWorklogId=830970=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-830970 ] ASF GitHub Bot logged work on HIVE-26221: - Author: ASF GitHub Bot Created on: 05/Dec/22 09:20 Start Date: 05/Dec/22 09:20 Worklog Time Spent: 10m Work Description: asolimando commented on code in PR #3137: URL: https://github.com/apache/hive/pull/3137#discussion_r1039339301 ## standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/StatObjectConverter.java: ## @@ -1064,6 +1118,9 @@ public static void setFieldsIntoOldStats(ColumnStatisticsObj oldStatObj, if (newDecimalStatsData.isSetBitVectors()) { oldDecimalStatsData.setBitVectors(newDecimalStatsData.getBitVectors()); } + if (newDecimalStatsData.isSetHistogram()) { +oldDecimalStatsData.setHistogram(newDecimalStatsData.getHistogram()); + } Review Comment: Yes, absolutely, thanks for catching that, added. Issue Time Tracking --- Worklog Id: (was: 830970) Time Spent: 3h 20m (was: 3h 10m) > Add histogram-based column statistics > - > > Key: HIVE-26221 > URL: https://issues.apache.org/jira/browse/HIVE-26221 > Project: Hive > Issue Type: Improvement > Components: CBO, Metastore, Statistics >Affects Versions: 4.0.0-alpha-2 >Reporter: Alessandro Solimando >Assignee: Alessandro Solimando >Priority: Major > Labels: pull-request-available > Time Spent: 3h 20m > Remaining Estimate: 0h > > Hive does not support histogram statistics, which are particularly useful for > skewed data (which is very common in practice) and range predicates. > Hive's current selectivity estimation for range predicates is based on a > hard-coded value of 1/3 (see > [FilterSelectivityEstimator.java#L138-L144|https://github.com/apache/hive/blob/56c336268ea8c281d23c22d89271af37cb7e2572/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L138-L144]).]) > The current proposal aims at integrating histogram as an additional column > statistics, stored into the Hive metastore at the table (or partition) level. > The main requirements for histogram integration are the following: > * efficiency: the approach must scale and support billions of rows > * merge-ability: partition-level histograms have to be merged to form > table-level histograms > * explicit and configurable trade-off between memory footprint and accuracy > Hive already integrates [KLL data > sketches|https://datasketches.apache.org/docs/KLL/KLLSketch.html] UDAF. > Datasketches are small, stateful programs that process massive data-streams > and can provide approximate answers, with mathematical guarantees, to > computationally difficult queries orders-of-magnitude faster than > traditional, exact methods. > We propose to use KLL, and more specifically the cumulative distribution > function (CDF), as the underlying data structure for our histogram statistics. > The current proposal targets numeric data types (float, integer and numeric > families) and temporal data types (date and timestamp). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26221) Add histogram-based column statistics
[ https://issues.apache.org/jira/browse/HIVE-26221?focusedWorklogId=830969=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-830969 ] ASF GitHub Bot logged work on HIVE-26221: - Author: ASF GitHub Bot Created on: 05/Dec/22 09:16 Start Date: 05/Dec/22 09:16 Worklog Time Spent: 10m Work Description: asolimando commented on code in PR #3137: URL: https://github.com/apache/hive/pull/3137#discussion_r1039335554 ## ql/src/java/org/apache/hadoop/hive/ql/exec/DDLPlanUtils.java: ## @@ -395,29 +404,46 @@ public void addDoubleStats(ColumnStatisticsData cd, List ls) { ls.add(lowValue + dc.getLowValue() + "'"); } + public String checkHistogram(ColumnStatisticsData cd) { +byte[] buffer = null; + +if (cd.isSetDoubleStats() && cd.getDoubleStats().isSetHistogram()) { Review Comment: Good catch, we need to handle all the other supported data types here, I have added that Issue Time Tracking --- Worklog Id: (was: 830969) Time Spent: 3h 10m (was: 3h) > Add histogram-based column statistics > - > > Key: HIVE-26221 > URL: https://issues.apache.org/jira/browse/HIVE-26221 > Project: Hive > Issue Type: Improvement > Components: CBO, Metastore, Statistics >Affects Versions: 4.0.0-alpha-2 >Reporter: Alessandro Solimando >Assignee: Alessandro Solimando >Priority: Major > Labels: pull-request-available > Time Spent: 3h 10m > Remaining Estimate: 0h > > Hive does not support histogram statistics, which are particularly useful for > skewed data (which is very common in practice) and range predicates. > Hive's current selectivity estimation for range predicates is based on a > hard-coded value of 1/3 (see > [FilterSelectivityEstimator.java#L138-L144|https://github.com/apache/hive/blob/56c336268ea8c281d23c22d89271af37cb7e2572/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L138-L144]).]) > The current proposal aims at integrating histogram as an additional column > statistics, stored into the Hive metastore at the table (or partition) level. > The main requirements for histogram integration are the following: > * efficiency: the approach must scale and support billions of rows > * merge-ability: partition-level histograms have to be merged to form > table-level histograms > * explicit and configurable trade-off between memory footprint and accuracy > Hive already integrates [KLL data > sketches|https://datasketches.apache.org/docs/KLL/KLLSketch.html] UDAF. > Datasketches are small, stateful programs that process massive data-streams > and can provide approximate answers, with mathematical guarantees, to > computationally difficult queries orders-of-magnitude faster than > traditional, exact methods. > We propose to use KLL, and more specifically the cumulative distribution > function (CDF), as the underlying data structure for our histogram statistics. > The current proposal targets numeric data types (float, integer and numeric > families) and temporal data types (date and timestamp). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26221) Add histogram-based column statistics
[ https://issues.apache.org/jira/browse/HIVE-26221?focusedWorklogId=830968=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-830968 ] ASF GitHub Bot logged work on HIVE-26221: - Author: ASF GitHub Bot Created on: 05/Dec/22 09:14 Start Date: 05/Dec/22 09:14 Worklog Time Spent: 10m Work Description: asolimando commented on code in PR #3137: URL: https://github.com/apache/hive/pull/3137#discussion_r1039333194 ## standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/columnstats/aggr/BinaryColumnStatsAggregator.java: ## @@ -60,4 +60,8 @@ public ColumnStatisticsObj aggregate(List colStatsWit statsObj.setStatsData(columnStatisticsData); return statsObj; } + + @Override protected ColumnStatisticsData initColumnStatisticsData() { +throw new UnsupportedOperationException("initColumnStatisticsData not supported for binary statistics"); Review Comment: You are right, the method does not do much for `binary` and `boolean`, but it still make sense, so I have: - removed the exception, replaced with `return new ColumnStatisticsData();` - used the method to actually initialize the emtpy `ColumnStatisticsData` for those two data types Issue Time Tracking --- Worklog Id: (was: 830968) Time Spent: 3h (was: 2h 50m) > Add histogram-based column statistics > - > > Key: HIVE-26221 > URL: https://issues.apache.org/jira/browse/HIVE-26221 > Project: Hive > Issue Type: Improvement > Components: CBO, Metastore, Statistics >Affects Versions: 4.0.0-alpha-2 >Reporter: Alessandro Solimando >Assignee: Alessandro Solimando >Priority: Major > Labels: pull-request-available > Time Spent: 3h > Remaining Estimate: 0h > > Hive does not support histogram statistics, which are particularly useful for > skewed data (which is very common in practice) and range predicates. > Hive's current selectivity estimation for range predicates is based on a > hard-coded value of 1/3 (see > [FilterSelectivityEstimator.java#L138-L144|https://github.com/apache/hive/blob/56c336268ea8c281d23c22d89271af37cb7e2572/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L138-L144]).]) > The current proposal aims at integrating histogram as an additional column > statistics, stored into the Hive metastore at the table (or partition) level. > The main requirements for histogram integration are the following: > * efficiency: the approach must scale and support billions of rows > * merge-ability: partition-level histograms have to be merged to form > table-level histograms > * explicit and configurable trade-off between memory footprint and accuracy > Hive already integrates [KLL data > sketches|https://datasketches.apache.org/docs/KLL/KLLSketch.html] UDAF. > Datasketches are small, stateful programs that process massive data-streams > and can provide approximate answers, with mathematical guarantees, to > computationally difficult queries orders-of-magnitude faster than > traditional, exact methods. > We propose to use KLL, and more specifically the cumulative distribution > function (CDF), as the underlying data structure for our histogram statistics. > The current proposal targets numeric data types (float, integer and numeric > families) and temporal data types (date and timestamp). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26799) Make authorizations on custom UDFs involved in tables/view configurable.
[ https://issues.apache.org/jira/browse/HIVE-26799?focusedWorklogId=830967=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-830967 ] ASF GitHub Bot logged work on HIVE-26799: - Author: ASF GitHub Bot Created on: 05/Dec/22 09:11 Start Date: 05/Dec/22 09:11 Worklog Time Spent: 10m Work Description: dengzhhu653 commented on code in PR #3821: URL: https://github.com/apache/hive/pull/3821#discussion_r1039330370 ## ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java: ## @@ -12550,6 +12550,21 @@ private ParseResult rewriteASTWithMaskAndFilter(TableMask tableMask, ASTNode ast } } + void gatherUserSuppliedFunctions(ASTNode ast) { +int tokenType = ast.getToken().getType(); +if (tokenType == HiveParser.TOK_FUNCTION || +tokenType == HiveParser.TOK_FUNCTIONDI || +tokenType == HiveParser.TOK_FUNCTIONSTAR) { + if (ast.getChild(0).getType() == HiveParser.Identifier) { +// maybe user supplied +this.userSuppliedFunctions.add(ast.getChild(0).getText()); Review Comment: The `ast.getChild(0).getText()` should be trimmed by `unescapeIdentifier(expressionTree.getChild(0).getText())`. Issue Time Tracking --- Worklog Id: (was: 830967) Time Spent: 1h 10m (was: 1h) > Make authorizations on custom UDFs involved in tables/view configurable. > > > Key: HIVE-26799 > URL: https://issues.apache.org/jira/browse/HIVE-26799 > Project: Hive > Issue Type: New Feature > Components: HiveServer2, Security >Affects Versions: 4.0.0-alpha-2 >Reporter: Sai Hemanth Gantasala >Assignee: Sai Hemanth Gantasala >Priority: Major > Labels: pull-request-available > Time Spent: 1h 10m > Remaining Estimate: 0h > > When Hive is using Ranger/Sentry as an authorization service, consider the > following scenario. > {code:java} > > create table test_udf(st string); // privileged user operation > > create function Udf_UPPER as 'openkb.hive.udf.MyUpper' using jar > > 'hdfs:///tmp/MyUpperUDF-1.0.0.jar'; // privileged user operation > > create view v1_udf as select udf_upper(st) from test_udf; // privileged > > user operation > //unprivileged user test_user is given select permissions on view v1_udf > > select * from v1_udf; {code} > It is expected that test_user needs to have select privilege on v1_udf and > select permissions on udf_upper custom UDF in order to do a select query on > view. > This patch introduces a configuration > "hive.security.authorization.functions.in.view"=false which disables > authorization on views associated with views/tables during the select query. > In this mode, only UDFs explicitly stated in the query would still be > authorized as it is currently. > The reason for making these custom UDFs associated with view/tables > authorizable is that currently, test_user will need to be granted select > permissions on the custom udf. and the test_user can use this UDF and query > against any other table, which is a security concern. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26762) Remove operand pruning in HiveFilterSetOpTransposeRule
[ https://issues.apache.org/jira/browse/HIVE-26762?focusedWorklogId=830966=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-830966 ] ASF GitHub Bot logged work on HIVE-26762: - Author: ASF GitHub Bot Created on: 05/Dec/22 09:10 Start Date: 05/Dec/22 09:10 Worklog Time Spent: 10m Work Description: kasakrisz commented on code in PR #3825: URL: https://github.com/apache/hive/pull/3825#discussion_r1039321050 ## ql/src/test/results/clientpositive/llap/union_all_filter_transpose_pruned_operands.q.out: ## @@ -0,0 +1,140 @@ +PREHOOK: query: CREATE EXTERNAL TABLE t (a string, b string) +PREHOOK: type: CREATETABLE +PREHOOK: Output: database:default +PREHOOK: Output: default@t +POSTHOOK: query: CREATE EXTERNAL TABLE t (a string, b string) +POSTHOOK: type: CREATETABLE +POSTHOOK: Output: database:default +POSTHOOK: Output: default@t +PREHOOK: query: INSERT INTO t VALUES ('1000', 'b1') +PREHOOK: type: QUERY +PREHOOK: Input: _dummy_database@_dummy_table +PREHOOK: Output: default@t +POSTHOOK: query: INSERT INTO t VALUES ('1000', 'b1') +POSTHOOK: type: QUERY +POSTHOOK: Input: _dummy_database@_dummy_table +POSTHOOK: Output: default@t +POSTHOOK: Lineage: t.a SCRIPT [] +POSTHOOK: Lineage: t.b SCRIPT [] +PREHOOK: query: INSERT INTO t VALUES ('1001', 'b1') +PREHOOK: type: QUERY +PREHOOK: Input: _dummy_database@_dummy_table +PREHOOK: Output: default@t +POSTHOOK: query: INSERT INTO t VALUES ('1001', 'b1') +POSTHOOK: type: QUERY +POSTHOOK: Input: _dummy_database@_dummy_table +POSTHOOK: Output: default@t +POSTHOOK: Lineage: t.a SCRIPT [] +POSTHOOK: Lineage: t.b SCRIPT [] +PREHOOK: query: INSERT INTO t VALUES ('1002', 'b1') +PREHOOK: type: QUERY +PREHOOK: Input: _dummy_database@_dummy_table +PREHOOK: Output: default@t +POSTHOOK: query: INSERT INTO t VALUES ('1002', 'b1') +POSTHOOK: type: QUERY +POSTHOOK: Input: _dummy_database@_dummy_table +POSTHOOK: Output: default@t +POSTHOOK: Lineage: t.a SCRIPT [] +POSTHOOK: Lineage: t.b SCRIPT [] +PREHOOK: query: INSERT INTO t VALUES ('2000', 'b2') +PREHOOK: type: QUERY +PREHOOK: Input: _dummy_database@_dummy_table +PREHOOK: Output: default@t +POSTHOOK: query: INSERT INTO t VALUES ('2000', 'b2') +POSTHOOK: type: QUERY +POSTHOOK: Input: _dummy_database@_dummy_table +POSTHOOK: Output: default@t +POSTHOOK: Lineage: t.a SCRIPT [] +POSTHOOK: Lineage: t.b SCRIPT [] +PREHOOK: query: SELECT * FROM ( + SELECT + a, + b + FROM t + UNION ALL + SELECT + a, + b + FROM t + WHERE a = 1001 +UNION ALL + SELECT + a, + b + FROM t + WHERE a = 1002) AS t2 +WHERE a = 1000 +PREHOOK: type: QUERY +PREHOOK: Input: default@t + A masked pattern was here +POSTHOOK: query: SELECT * FROM ( + SELECT + a, + b + FROM t + UNION ALL + SELECT + a, + b + FROM t + WHERE a = 1001 +UNION ALL + SELECT + a, + b + FROM t + WHERE a = 1002) AS t2 +WHERE a = 1000 +POSTHOOK: type: QUERY +POSTHOOK: Input: default@t + A masked pattern was here +1000 b1 +PREHOOK: query: EXPLAIN CBO +SELECT * FROM ( + SELECT + a, + b + FROM t + UNION ALL + SELECT + a, + b + FROM t + WHERE a = 1001 +UNION ALL + SELECT + a, + b + FROM t + WHERE a = 1002) AS t2 +WHERE a = 1000 +PREHOOK: type: QUERY +PREHOOK: Input: default@t + A masked pattern was here +POSTHOOK: query: EXPLAIN CBO +SELECT * FROM ( + SELECT + a, + b + FROM t + UNION ALL + SELECT + a, + b + FROM t + WHERE a = 1001 +UNION ALL + SELECT + a, + b + FROM t + WHERE a = 1002) AS t2 +WHERE a = 1000 +POSTHOOK: type: QUERY +POSTHOOK: Input: default@t + A masked pattern was here +CBO PLAN: +HiveProject(a=[$0], b=[$1]) + HiveFilter(condition=[=(CAST($0):DOUBLE, 1000)]) Review Comment: This test does not intend testing the automatic casting for comparison but pruning empty result union branches. Could you please change the literals to string in the predicates. Issue Time Tracking --- Worklog Id: (was: 830966) Time Spent: 40m (was: 0.5h) > Remove operand pruning in HiveFilterSetOpTransposeRule > -- > > Key: HIVE-26762 > URL: https://issues.apache.org/jira/browse/HIVE-26762 > Project: Hive > Issue Type: Task > Components: CBO, Query Planning >Affects Versions: 4.0.0-alpha-2 >Reporter: Alessandro Solimando >Assignee: Alessandro Solimando >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > HiveFilterSetOpTransposeRule, when applied to UNION ALL operands, checks if > the newly pushed filter simplifies to FALSE (due to the predicates holding on > the input). > If this is true and there is more than one UNION ALL operand, it gets pruned. > After
[jira] [Work logged] (HIVE-26221) Add histogram-based column statistics
[ https://issues.apache.org/jira/browse/HIVE-26221?focusedWorklogId=830964=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-830964 ] ASF GitHub Bot logged work on HIVE-26221: - Author: ASF GitHub Bot Created on: 05/Dec/22 09:09 Start Date: 05/Dec/22 09:09 Worklog Time Spent: 10m Work Description: asolimando commented on code in PR #3137: URL: https://github.com/apache/hive/pull/3137#discussion_r1039328901 ## standalone-metastore/metastore-server/src/main/sql/mysql/upgrade-4.0.0-alpha-2-to-4.0.0.mysql.sql: ## @@ -1,5 +1,9 @@ SELECT 'Upgrading MetaStore schema from 4.0.0-alpha-2 to 4.0.0' AS MESSAGE; + Issue Time Tracking --- Worklog Id: (was: 830964) Time Spent: 2h 50m (was: 2h 40m) > Add histogram-based column statistics > - > > Key: HIVE-26221 > URL: https://issues.apache.org/jira/browse/HIVE-26221 > Project: Hive > Issue Type: Improvement > Components: CBO, Metastore, Statistics >Affects Versions: 4.0.0-alpha-2 >Reporter: Alessandro Solimando >Assignee: Alessandro Solimando >Priority: Major > Labels: pull-request-available > Time Spent: 2h 50m > Remaining Estimate: 0h > > Hive does not support histogram statistics, which are particularly useful for > skewed data (which is very common in practice) and range predicates. > Hive's current selectivity estimation for range predicates is based on a > hard-coded value of 1/3 (see > [FilterSelectivityEstimator.java#L138-L144|https://github.com/apache/hive/blob/56c336268ea8c281d23c22d89271af37cb7e2572/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L138-L144]).]) > The current proposal aims at integrating histogram as an additional column > statistics, stored into the Hive metastore at the table (or partition) level. > The main requirements for histogram integration are the following: > * efficiency: the approach must scale and support billions of rows > * merge-ability: partition-level histograms have to be merged to form > table-level histograms > * explicit and configurable trade-off between memory footprint and accuracy > Hive already integrates [KLL data > sketches|https://datasketches.apache.org/docs/KLL/KLLSketch.html] UDAF. > Datasketches are small, stateful programs that process massive data-streams > and can provide approximate answers, with mathematical guarantees, to > computationally difficult queries orders-of-magnitude faster than > traditional, exact methods. > We propose to use KLL, and more specifically the cumulative distribution > function (CDF), as the underlying data structure for our histogram statistics. > The current proposal targets numeric data types (float, integer and numeric > families) and temporal data types (date and timestamp). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26754) Implement array_distinct UDF to return an array after removing duplicates in it
[ https://issues.apache.org/jira/browse/HIVE-26754?focusedWorklogId=830962=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-830962 ] ASF GitHub Bot logged work on HIVE-26754: - Author: ASF GitHub Bot Created on: 05/Dec/22 09:00 Start Date: 05/Dec/22 09:00 Worklog Time Spent: 10m Work Description: tarak271 commented on PR #3806: URL: https://github.com/apache/hive/pull/3806#issuecomment-1336982493 Test failure seems unrelated to this changes. That test 'orc_ppd_basic.q' is even failing without my changes Issue Time Tracking --- Worklog Id: (was: 830962) Time Spent: 3h 50m (was: 3h 40m) > Implement array_distinct UDF to return an array after removing duplicates in > it > --- > > Key: HIVE-26754 > URL: https://issues.apache.org/jira/browse/HIVE-26754 > Project: Hive > Issue Type: Sub-task > Components: Hive >Reporter: Taraka Rama Rao Lethavadla >Assignee: Taraka Rama Rao Lethavadla >Priority: Major > Labels: pull-request-available > Time Spent: 3h 50m > Remaining Estimate: 0h > > *array_distinct(array(obj1, obj2,...))* - The function returns an array of > the same type as the input argument where all duplicate values have been > removed. > Example: > > SELECT array_distinct(array('b', 'd', 'd', 'a')) FROM src LIMIT 1; > ['a', 'b', 'c'] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26221) Add histogram-based column statistics
[ https://issues.apache.org/jira/browse/HIVE-26221?focusedWorklogId=830961=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-830961 ] ASF GitHub Bot logged work on HIVE-26221: - Author: ASF GitHub Bot Created on: 05/Dec/22 08:52 Start Date: 05/Dec/22 08:52 Worklog Time Spent: 10m Work Description: dengzhhu653 commented on code in PR #3137: URL: https://github.com/apache/hive/pull/3137#discussion_r1039312469 ## standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/columnstats/aggr/LongColumnStatsAggregator.java: ## @@ -51,73 +52,94 @@ public ColumnStatisticsObj aggregate(List colStatsWit checkStatisticsList(colStatsWithSourceInfo); ColumnStatisticsObj statsObj = null; -String colType = null; +String colType; String colName = null; // check if all the ColumnStatisticsObjs contain stats and all the ndv are // bitvectors boolean doAllPartitionContainStats = partNames.size() == colStatsWithSourceInfo.size(); NumDistinctValueEstimator ndvEstimator = null; +KllHistogramEstimator histogramEstimator = null; +boolean areAllNDVEstimatorsMergeable = true; +boolean areAllHistogramEstimatorsMergeable = true; for (ColStatsObjWithSourceInfo csp : colStatsWithSourceInfo) { ColumnStatisticsObj cso = csp.getColStatsObj(); if (statsObj == null) { colName = cso.getColName(); colType = cso.getColType(); statsObj = ColumnStatsAggregatorFactory.newColumnStaticsObj(colName, colType, cso.getStatsData().getSetField()); -LOG.trace("doAllPartitionContainStats for column: {} is: {}", colName, -doAllPartitionContainStats); +LOG.trace("doAllPartitionContainStats for column: {} is: {}", colName, doAllPartitionContainStats); } - LongColumnStatsDataInspector longColumnStatsData = longInspectorFromStats(cso); - if (longColumnStatsData.getNdvEstimator() == null) { -ndvEstimator = null; -break; - } else { -// check if all of the bit vectors can merge -NumDistinctValueEstimator estimator = longColumnStatsData.getNdvEstimator(); + LongColumnStatsDataInspector columnStatsData = longInspectorFromStats(cso); + + // check if we can merge NDV estimators + if (columnStatsData.getNdvEstimator() == null) { +areAllNDVEstimatorsMergeable = false; + } else if (areAllNDVEstimatorsMergeable) { +NumDistinctValueEstimator estimator = columnStatsData.getNdvEstimator(); if (ndvEstimator == null) { ndvEstimator = estimator; } else { - if (ndvEstimator.canMerge(estimator)) { -continue; - } else { -ndvEstimator = null; -break; + if (!ndvEstimator.canMerge(estimator)) { +areAllNDVEstimatorsMergeable = false; + } +} + } + // check if we can merge histogram estimators + if (columnStatsData.getHistogramEstimator() == null) { Review Comment: To keep things simple, can we call ```java // merge what can be merged and keep the one with the biggest cardinality KllHistogramEstimator mergedKllHistogramEstimator = mergeHistograms(colStatsWithSourceInfo); if (mergedKllHistogramEstimator != null) { columnStatisticsData.getLongStats().setHistogram(mergedKllHistogramEstimator.serialize()); } ``` directly to aggregate the histogram statistics instead of introducing `areAllHistogramEstimatorsMergeable` and `histogramEstimator` via iterating over the `colStatsWithSourceInfo`? Issue Time Tracking --- Worklog Id: (was: 830961) Time Spent: 2h 40m (was: 2.5h) > Add histogram-based column statistics > - > > Key: HIVE-26221 > URL: https://issues.apache.org/jira/browse/HIVE-26221 > Project: Hive > Issue Type: Improvement > Components: CBO, Metastore, Statistics >Affects Versions: 4.0.0-alpha-2 >Reporter: Alessandro Solimando >Assignee: Alessandro Solimando >Priority: Major > Labels: pull-request-available > Time Spent: 2h 40m > Remaining Estimate: 0h > > Hive does not support histogram statistics, which are particularly useful for > skewed data (which is very common in practice) and range predicates. > Hive's current selectivity estimation for range predicates is based on a > hard-coded value of 1/3 (see > [FilterSelectivityEstimator.java#L138-L144|https://github.com/apache/hive/blob/56c336268ea8c281d23c22d89271af37cb7e2572/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L138-L144]).]) > The current proposal aims at integrating histogram as an
[jira] [Comment Edited] (HIVE-26737) Subquery returning wrong results when database has materialized views
[ https://issues.apache.org/jira/browse/HIVE-26737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17643166#comment-17643166 ] Krisztian Kasa edited comment on HIVE-26737 at 12/5/22 8:51 AM: [#3761|https://github.com/apache/hive/pull/3761] merged to master. Thanks [~scarlin] for the patch. was (Author: kkasa): Merged to master. Thanks [~scarlin] for the patch. > Subquery returning wrong results when database has materialized views > - > > Key: HIVE-26737 > URL: https://issues.apache.org/jira/browse/HIVE-26737 > Project: Hive > Issue Type: Bug > Components: HiveServer2 >Reporter: Steve Carlin >Assignee: Steve Carlin >Priority: Major > Labels: pull-request-available > Time Spent: 2h 50m > Remaining Estimate: 0h > > When HS2 has materialized views in its registry, subqueries with correlated > variables may return wrong results. > An example of this: > > {code:java} > CREATE TABLE t_test1( > id int, > int_col int, > year int, > month int > ); > CREATE TABLE t_test2( > id int, > int_col int, > year int, > month int > ); > CREATE TABLE dummy ( > id int > ) stored as orc TBLPROPERTIES ('transactional'='true'); > CREATE MATERIALIZED VIEW need_a_mat_view_in_registry AS > SELECT * FROM dummy where id > 5; > INSERT INTO t_test1 VALUES (1, 1, 2009, 1), (10,0, 2009, 1); > INSERT INTO t_test2 VALUES (1, 1, 2009, 1); > select id, int_col, year, month from t_test1 s where s.int_col = (select > count(*) from t_test2 t where s.id = t.id) order by id; > {code} > The select statement should produce 2 rows, but it is only producing one. > The CBO plan produced has an inner join instead of a left join. > {code:java} > HiveSortLimit(sort0=[$0], dir0=[ASC]) > HiveProject(id=[$0], int_col=[$1], year=[$2], month=[$3]) > HiveJoin(condition=[AND(=($0, $5), =($4, $6))], joinType=[inner], > algorithm=[none], cost=[not available]) > HiveProject(id=[$0], int_col=[$1], year=[$2], month=[$3], > CAST=[CAST($1):BIGINT]) > HiveFilter(condition=[AND(IS NOT NULL($0), IS NOT > NULL(CAST($1):BIGINT))]) > HiveTableScan(table=[[default, t_test1]], table:alias=[s]) > HiveProject(id=[$0], $f1=[$1]) > HiveFilter(condition=[IS NOT NULL($1)]) > HiveAggregate(group=[{0}], agg#0=[count()]) > HiveFilter(condition=[IS NOT NULL($0)]) > HiveTableScan(table=[[default, t_test2]], table:alias=[t]){code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HIVE-26737) Subquery returning wrong results when database has materialized views
[ https://issues.apache.org/jira/browse/HIVE-26737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Kasa resolved HIVE-26737. --- Resolution: Fixed Merged to master. Thanks [~scarlin] for the patch. > Subquery returning wrong results when database has materialized views > - > > Key: HIVE-26737 > URL: https://issues.apache.org/jira/browse/HIVE-26737 > Project: Hive > Issue Type: Bug > Components: HiveServer2 >Reporter: Steve Carlin >Assignee: Steve Carlin >Priority: Major > Labels: pull-request-available > Time Spent: 2h 50m > Remaining Estimate: 0h > > When HS2 has materialized views in its registry, subqueries with correlated > variables may return wrong results. > An example of this: > > {code:java} > CREATE TABLE t_test1( > id int, > int_col int, > year int, > month int > ); > CREATE TABLE t_test2( > id int, > int_col int, > year int, > month int > ); > CREATE TABLE dummy ( > id int > ) stored as orc TBLPROPERTIES ('transactional'='true'); > CREATE MATERIALIZED VIEW need_a_mat_view_in_registry AS > SELECT * FROM dummy where id > 5; > INSERT INTO t_test1 VALUES (1, 1, 2009, 1), (10,0, 2009, 1); > INSERT INTO t_test2 VALUES (1, 1, 2009, 1); > select id, int_col, year, month from t_test1 s where s.int_col = (select > count(*) from t_test2 t where s.id = t.id) order by id; > {code} > The select statement should produce 2 rows, but it is only producing one. > The CBO plan produced has an inner join instead of a left join. > {code:java} > HiveSortLimit(sort0=[$0], dir0=[ASC]) > HiveProject(id=[$0], int_col=[$1], year=[$2], month=[$3]) > HiveJoin(condition=[AND(=($0, $5), =($4, $6))], joinType=[inner], > algorithm=[none], cost=[not available]) > HiveProject(id=[$0], int_col=[$1], year=[$2], month=[$3], > CAST=[CAST($1):BIGINT]) > HiveFilter(condition=[AND(IS NOT NULL($0), IS NOT > NULL(CAST($1):BIGINT))]) > HiveTableScan(table=[[default, t_test1]], table:alias=[s]) > HiveProject(id=[$0], $f1=[$1]) > HiveFilter(condition=[IS NOT NULL($1)]) > HiveAggregate(group=[{0}], agg#0=[count()]) > HiveFilter(condition=[IS NOT NULL($0)]) > HiveTableScan(table=[[default, t_test2]], table:alias=[t]){code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26737) Subquery returning wrong results when database has materialized views
[ https://issues.apache.org/jira/browse/HIVE-26737?focusedWorklogId=830960=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-830960 ] ASF GitHub Bot logged work on HIVE-26737: - Author: ASF GitHub Bot Created on: 05/Dec/22 08:49 Start Date: 05/Dec/22 08:49 Worklog Time Spent: 10m Work Description: kasakrisz merged PR #3761: URL: https://github.com/apache/hive/pull/3761 Issue Time Tracking --- Worklog Id: (was: 830960) Time Spent: 2h 50m (was: 2h 40m) > Subquery returning wrong results when database has materialized views > - > > Key: HIVE-26737 > URL: https://issues.apache.org/jira/browse/HIVE-26737 > Project: Hive > Issue Type: Bug > Components: HiveServer2 >Reporter: Steve Carlin >Assignee: Steve Carlin >Priority: Major > Labels: pull-request-available > Time Spent: 2h 50m > Remaining Estimate: 0h > > When HS2 has materialized views in its registry, subqueries with correlated > variables may return wrong results. > An example of this: > > {code:java} > CREATE TABLE t_test1( > id int, > int_col int, > year int, > month int > ); > CREATE TABLE t_test2( > id int, > int_col int, > year int, > month int > ); > CREATE TABLE dummy ( > id int > ) stored as orc TBLPROPERTIES ('transactional'='true'); > CREATE MATERIALIZED VIEW need_a_mat_view_in_registry AS > SELECT * FROM dummy where id > 5; > INSERT INTO t_test1 VALUES (1, 1, 2009, 1), (10,0, 2009, 1); > INSERT INTO t_test2 VALUES (1, 1, 2009, 1); > select id, int_col, year, month from t_test1 s where s.int_col = (select > count(*) from t_test2 t where s.id = t.id) order by id; > {code} > The select statement should produce 2 rows, but it is only producing one. > The CBO plan produced has an inner join instead of a left join. > {code:java} > HiveSortLimit(sort0=[$0], dir0=[ASC]) > HiveProject(id=[$0], int_col=[$1], year=[$2], month=[$3]) > HiveJoin(condition=[AND(=($0, $5), =($4, $6))], joinType=[inner], > algorithm=[none], cost=[not available]) > HiveProject(id=[$0], int_col=[$1], year=[$2], month=[$3], > CAST=[CAST($1):BIGINT]) > HiveFilter(condition=[AND(IS NOT NULL($0), IS NOT > NULL(CAST($1):BIGINT))]) > HiveTableScan(table=[[default, t_test1]], table:alias=[s]) > HiveProject(id=[$0], $f1=[$1]) > HiveFilter(condition=[IS NOT NULL($1)]) > HiveAggregate(group=[{0}], agg#0=[count()]) > HiveFilter(condition=[IS NOT NULL($0)]) > HiveTableScan(table=[[default, t_test2]], table:alias=[t]){code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26770) Make "end of loop" compaction logs appear more selectively
[ https://issues.apache.org/jira/browse/HIVE-26770?focusedWorklogId=830959=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-830959 ] ASF GitHub Bot logged work on HIVE-26770: - Author: ASF GitHub Bot Created on: 05/Dec/22 08:36 Start Date: 05/Dec/22 08:36 Worklog Time Spent: 10m Work Description: sonarcloud[bot] commented on PR #3803: URL: https://github.com/apache/hive/pull/3803#issuecomment-1336957318 Kudos, SonarCloud Quality Gate passed! [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache_hive=3803) [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_hive=3803=false=BUG) [![C](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/C-16px.png 'C')](https://sonarcloud.io/project/issues?id=apache_hive=3803=false=BUG) [1 Bug](https://sonarcloud.io/project/issues?id=apache_hive=3803=false=BUG) [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_hive=3803=false=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3803=false=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_hive=3803=false=VULNERABILITY) [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3803=false=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3803=false=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive=3803=false=SECURITY_HOTSPOT) [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_hive=3803=false=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive=3803=false=CODE_SMELL) [10 Code Smells](https://sonarcloud.io/project/issues?id=apache_hive=3803=false=CODE_SMELL) [![No Coverage information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/NoCoverageInfo-16px.png 'No Coverage information')](https://sonarcloud.io/component_measures?id=apache_hive=3803=coverage=list) No Coverage information [![No Duplication information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/NoDuplicationInfo-16px.png 'No Duplication information')](https://sonarcloud.io/component_measures?id=apache_hive=3803=duplicated_lines_density=list) No Duplication information Issue Time Tracking --- Worklog Id: (was: 830959) Time Spent: 4h 50m (was: 4h 40m) > Make "end of loop" compaction logs appear more selectively > -- > > Key: HIVE-26770 > URL: https://issues.apache.org/jira/browse/HIVE-26770 > Project: Hive > Issue Type: Improvement >Affects Versions: 4.0.0-alpha-1 >Reporter: Akshat Mathur >Assignee: Akshat Mathur >Priority: Major > Labels: pull-request-available > Time Spent: 4h 50m > Remaining Estimate: 0h > > Currently Initiator, Worker, and Cleaner threads log something like "finished > one loop" on INFO level. > This is useful to figure out if one of these threads is taking too long to > finish a loop, but expensive in general. > > Suggested Time: 20mins > Logging this should be changed in the following way > # If loop finished within a predefined amount of time, level should be DEBUG > and message should look like: *Initiator loop took \{ellapsedTime} seconds to > finish.* > # If loop ran longer than this predefined amount, level should be WARN and > message should look like: *Possible Initiator slowdown, loop took > \{ellapsedTime} seconds to finish.* -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26788) Update stats of table/partition after minor compaction using noscan operation
[ https://issues.apache.org/jira/browse/HIVE-26788?focusedWorklogId=830957=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-830957 ] ASF GitHub Bot logged work on HIVE-26788: - Author: ASF GitHub Bot Created on: 05/Dec/22 08:21 Start Date: 05/Dec/22 08:21 Worklog Time Spent: 10m Work Description: deniskuzZ commented on code in PR #3812: URL: https://github.com/apache/hive/pull/3812#discussion_r1039282031 ## ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/StatsUpdater.java: ## @@ -52,10 +52,6 @@ public final class StatsUpdater { */ public void gatherStats(CompactionInfo ci, HiveConf conf, String userName, String compactionQueueName) { try { -if (!ci.isMajorCompaction()) { Review Comment: how much overhead we could get on a production cluster? AFAIK when multiple workers are used those would try to initiate a new Tez session. Issue Time Tracking --- Worklog Id: (was: 830957) Time Spent: 50m (was: 40m) > Update stats of table/partition after minor compaction using noscan operation > - > > Key: HIVE-26788 > URL: https://issues.apache.org/jira/browse/HIVE-26788 > Project: Hive > Issue Type: Improvement >Reporter: Sourabh Badhya >Assignee: Sourabh Badhya >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > Currently, statistics are not updated for minor compaction since minor > compaction performs little updates on the statistics (such as number of files > in table/partition & total size of the table/partition). It is better to > utilize NOSCAN operation for minor compaction since NOSCAN operations > performs faster update of statistics and updates the relevant fields such as > number of files & total sizes of the table/partitions. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work logged] (HIVE-26788) Update stats of table/partition after minor compaction using noscan operation
[ https://issues.apache.org/jira/browse/HIVE-26788?focusedWorklogId=830956=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-830956 ] ASF GitHub Bot logged work on HIVE-26788: - Author: ASF GitHub Bot Created on: 05/Dec/22 08:17 Start Date: 05/Dec/22 08:17 Worklog Time Spent: 10m Work Description: deniskuzZ commented on code in PR #3812: URL: https://github.com/apache/hive/pull/3812#discussion_r1039279154 ## ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/StatsUpdater.java: ## @@ -73,6 +69,9 @@ public void gatherStats(CompactionInfo ci, HiveConf conf, String userName, Strin sb.append(")"); } sb.append(" compute statistics"); +if (ci.isMinorCompaction()) { +sb.append(" noscan"); Review Comment: why `noscan` is used only in case of minor compaction? Issue Time Tracking --- Worklog Id: (was: 830956) Time Spent: 40m (was: 0.5h) > Update stats of table/partition after minor compaction using noscan operation > - > > Key: HIVE-26788 > URL: https://issues.apache.org/jira/browse/HIVE-26788 > Project: Hive > Issue Type: Improvement >Reporter: Sourabh Badhya >Assignee: Sourabh Badhya >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > Currently, statistics are not updated for minor compaction since minor > compaction performs little updates on the statistics (such as number of files > in table/partition & total size of the table/partition). It is better to > utilize NOSCAN operation for minor compaction since NOSCAN operations > performs faster update of statistics and updates the relevant fields such as > number of files & total sizes of the table/partitions. -- This message was sent by Atlassian Jira (v8.20.10#820010)