[jira] [Work logged] (HIVE-23880) Bloom filters can be merged in a parallel way in VectorUDAFBloomFilterMerge
[ https://issues.apache.org/jira/browse/HIVE-23880?focusedWorklogId=470580&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470580 ] ASF GitHub Bot logged work on HIVE-23880: - Author: ASF GitHub Bot Created on: 14/Aug/20 05:37 Start Date: 14/Aug/20 05:37 Worklog Time Spent: 10m Work Description: abstractdog commented on a change in pull request #1280: URL: https://github.com/apache/hive/pull/1280#discussion_r470419947 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorGroupByOperator.java ## @@ -1126,6 +1137,7 @@ protected void initializeOp(Configuration hconf) throws HiveException { VectorAggregateExpression vecAggrExpr = null; try { vecAggrExpr = ctor.newInstance(vecAggrDesc); + vecAggrExpr.withConf(hconf); Review comment: Sadly, I need to agree with conf abusing in (hive) codebase :) somehow I don't really like instanceof stuff here, only for a single expression, moreover, I wanted to find a general way to provide some configuration to expressions, as this patch showed that they might need that (in the future). On the other hand, explicitly calling a specific constructor for different types could be a kind of documentation in one place about "how to instantiate" these expressions. I'm about to refactor this logic to a separate method in VectorGroupByOperator and let this patch go! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 470580) Time Spent: 7h 50m (was: 7h 40m) > Bloom filters can be merged in a parallel way in VectorUDAFBloomFilterMerge > --- > > Key: HIVE-23880 > URL: https://issues.apache.org/jira/browse/HIVE-23880 > Project: Hive > Issue Type: Improvement >Reporter: László Bodor >Assignee: László Bodor >Priority: Major > Labels: pull-request-available > Attachments: lipwig-output3605036885489193068.svg > > Time Spent: 7h 50m > Remaining Estimate: 0h > > Merging bloom filters in semijoin reduction can become the main bottleneck in > case of large number of source mapper tasks (~1000, Map 1 in below example) > and a large amount of expected entries (50M) in bloom filters. > For example in TPCDS Q93: > {code} > select /*+ semi(store_returns, sr_item_sk, store_sales, 7000)*/ > ss_customer_sk > ,sum(act_sales) sumsales > from (select ss_item_sk > ,ss_ticket_number > ,ss_customer_sk > ,case when sr_return_quantity is not null then > (ss_quantity-sr_return_quantity)*ss_sales_price > else > (ss_quantity*ss_sales_price) end act_sales > from store_sales left outer join store_returns on (sr_item_sk = > ss_item_sk >and > sr_ticket_number = ss_ticket_number) > ,reason > where sr_reason_sk = r_reason_sk > and r_reason_desc = 'reason 66') t > group by ss_customer_sk > order by sumsales, ss_customer_sk > limit 100; > {code} > On 10TB-30TB scale there is a chance that from 3-4 mins of query runtime 1-2 > mins are spent with merging bloom filters (Reducer 2), as in: > [^lipwig-output3605036885489193068.svg] > {code} > -- > VERTICES MODESTATUS TOTAL COMPLETED RUNNING PENDING > FAILED KILLED > -- > Map 3 .. llap SUCCEEDED 1 100 > 0 0 > Map 1 .. llap SUCCEEDED 1263 126300 > 0 0 > Reducer 2 llap RUNNING 1 010 > 0 0 > Map 4 llap RUNNING 6154 0 207 5947 > 0 0 > Reducer 5 llapINITED 43 00 43 > 0 0 > Reducer 6 llapINITED 1 001 > 0 0 > -- > VERTICES: 02/06 [>>--] 16% ELAPSED TIME: 149.98 s > ---
[jira] [Work logged] (HIVE-23880) Bloom filters can be merged in a parallel way in VectorUDAFBloomFilterMerge
[ https://issues.apache.org/jira/browse/HIVE-23880?focusedWorklogId=470581&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470581 ] ASF GitHub Bot logged work on HIVE-23880: - Author: ASF GitHub Bot Created on: 14/Aug/20 05:37 Start Date: 14/Aug/20 05:37 Worklog Time Spent: 10m Work Description: mustafaiman commented on pull request #1280: URL: https://github.com/apache/hive/pull/1280#issuecomment-673894751 @abstractdog I missed that call. I think that covers it. Good work. +1 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 470581) Time Spent: 8h (was: 7h 50m) > Bloom filters can be merged in a parallel way in VectorUDAFBloomFilterMerge > --- > > Key: HIVE-23880 > URL: https://issues.apache.org/jira/browse/HIVE-23880 > Project: Hive > Issue Type: Improvement >Reporter: László Bodor >Assignee: László Bodor >Priority: Major > Labels: pull-request-available > Attachments: lipwig-output3605036885489193068.svg > > Time Spent: 8h > Remaining Estimate: 0h > > Merging bloom filters in semijoin reduction can become the main bottleneck in > case of large number of source mapper tasks (~1000, Map 1 in below example) > and a large amount of expected entries (50M) in bloom filters. > For example in TPCDS Q93: > {code} > select /*+ semi(store_returns, sr_item_sk, store_sales, 7000)*/ > ss_customer_sk > ,sum(act_sales) sumsales > from (select ss_item_sk > ,ss_ticket_number > ,ss_customer_sk > ,case when sr_return_quantity is not null then > (ss_quantity-sr_return_quantity)*ss_sales_price > else > (ss_quantity*ss_sales_price) end act_sales > from store_sales left outer join store_returns on (sr_item_sk = > ss_item_sk >and > sr_ticket_number = ss_ticket_number) > ,reason > where sr_reason_sk = r_reason_sk > and r_reason_desc = 'reason 66') t > group by ss_customer_sk > order by sumsales, ss_customer_sk > limit 100; > {code} > On 10TB-30TB scale there is a chance that from 3-4 mins of query runtime 1-2 > mins are spent with merging bloom filters (Reducer 2), as in: > [^lipwig-output3605036885489193068.svg] > {code} > -- > VERTICES MODESTATUS TOTAL COMPLETED RUNNING PENDING > FAILED KILLED > -- > Map 3 .. llap SUCCEEDED 1 100 > 0 0 > Map 1 .. llap SUCCEEDED 1263 126300 > 0 0 > Reducer 2 llap RUNNING 1 010 > 0 0 > Map 4 llap RUNNING 6154 0 207 5947 > 0 0 > Reducer 5 llapINITED 43 00 43 > 0 0 > Reducer 6 llapINITED 1 001 > 0 0 > -- > VERTICES: 02/06 [>>--] 16% ELAPSED TIME: 149.98 s > -- > {code} > For example, 70M entries in bloom filter leads to a 436 465 696 bits, so > merging 1263 bloom filters means running ~ 1263 * 436 465 696 bitwise OR > operation, which is very hot codepath, but can be parallelized. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23880) Bloom filters can be merged in a parallel way in VectorUDAFBloomFilterMerge
[ https://issues.apache.org/jira/browse/HIVE-23880?focusedWorklogId=470577&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470577 ] ASF GitHub Bot logged work on HIVE-23880: - Author: ASF GitHub Bot Created on: 14/Aug/20 05:29 Start Date: 14/Aug/20 05:29 Worklog Time Spent: 10m Work Description: abstractdog commented on pull request #1280: URL: https://github.com/apache/hive/pull/1280#issuecomment-673892379 > @abstractdog > I am almost ok with this patch. However I still dont understand how this integrates with `ProcessingModeHashAggregate`. Since there are multiple VectorAggregationBufferRows in hash mode, I think we should `finish` each of them as we process them. Otherwise, we pass to the next operator in the pipeline without completing the bloom filter. Also, since hash mode dynamically allocates and frees VectorAggregationBufferRows these `finish`es should happen as we deallocate each of them, rather than only at the end of the operator. Good point. I was creating this patch by focusing on finishing buffers correctly, I think I've already taken care of by this, please take a look: https://github.com/apache/hive/pull/1280/commits/0ada66534a937b8f4492d14f508903fa98402aed#diff-07c28d3f5c72db581b9cd4fa424a0ecbR675 As you can see, I'm calling finish before every instance of writeSingleRow. I'm assuming that writeSingleRow is a point where a buffer should be finished for writing. In ProcessingModeHashAggregate, the above part is enclosed in an iteration on buffers in flush method. Are you aware of any other places where I should finish a buffer? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 470577) Time Spent: 7h 40m (was: 7.5h) > Bloom filters can be merged in a parallel way in VectorUDAFBloomFilterMerge > --- > > Key: HIVE-23880 > URL: https://issues.apache.org/jira/browse/HIVE-23880 > Project: Hive > Issue Type: Improvement >Reporter: László Bodor >Assignee: László Bodor >Priority: Major > Labels: pull-request-available > Attachments: lipwig-output3605036885489193068.svg > > Time Spent: 7h 40m > Remaining Estimate: 0h > > Merging bloom filters in semijoin reduction can become the main bottleneck in > case of large number of source mapper tasks (~1000, Map 1 in below example) > and a large amount of expected entries (50M) in bloom filters. > For example in TPCDS Q93: > {code} > select /*+ semi(store_returns, sr_item_sk, store_sales, 7000)*/ > ss_customer_sk > ,sum(act_sales) sumsales > from (select ss_item_sk > ,ss_ticket_number > ,ss_customer_sk > ,case when sr_return_quantity is not null then > (ss_quantity-sr_return_quantity)*ss_sales_price > else > (ss_quantity*ss_sales_price) end act_sales > from store_sales left outer join store_returns on (sr_item_sk = > ss_item_sk >and > sr_ticket_number = ss_ticket_number) > ,reason > where sr_reason_sk = r_reason_sk > and r_reason_desc = 'reason 66') t > group by ss_customer_sk > order by sumsales, ss_customer_sk > limit 100; > {code} > On 10TB-30TB scale there is a chance that from 3-4 mins of query runtime 1-2 > mins are spent with merging bloom filters (Reducer 2), as in: > [^lipwig-output3605036885489193068.svg] > {code} > -- > VERTICES MODESTATUS TOTAL COMPLETED RUNNING PENDING > FAILED KILLED > -- > Map 3 .. llap SUCCEEDED 1 100 > 0 0 > Map 1 .. llap SUCCEEDED 1263 126300 > 0 0 > Reducer 2 llap RUNNING 1 010 > 0 0 > Map 4 llap RUNNING 6154 0 207 5947 > 0 0 > Reducer 5 llapINITED 43 00 43 > 0 0 > Reducer 6 llapINITED 1 001 > 0 0 > -
[jira] [Work logged] (HIVE-24032) Remove hadoop shims dependency and use FileSystem Api directly from standalone metastore
[ https://issues.apache.org/jira/browse/HIVE-24032?focusedWorklogId=470547&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470547 ] ASF GitHub Bot logged work on HIVE-24032: - Author: ASF GitHub Bot Created on: 14/Aug/20 03:10 Start Date: 14/Aug/20 03:10 Worklog Time Spent: 10m Work Description: aasha commented on a change in pull request #1396: URL: https://github.com/apache/hive/pull/1396#discussion_r470385260 ## File path: standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/EncryptionFileUtils.java ## @@ -0,0 +1,65 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.metastore.utils; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.hdfs.client.HdfsAdmin; +import org.apache.hadoop.hdfs.protocol.EncryptionZone; + +import java.io.IOException; +import java.net.URI; + +public class EncryptionFileUtils { + + public static boolean isPathEncrypted(Path path, Configuration conf) throws IOException { +Path fullPath; +if (path.isAbsolute()) { + fullPath = path; +} else { + fullPath = path.getFileSystem(conf).makeQualified(path); +} +if(!"hdfs".equalsIgnoreCase(path.toUri().getScheme())) { + return false; +} +return (EncryptionFileUtils.getEncryptionZoneForPath(fullPath, conf) != null); + } + + public static EncryptionZone getEncryptionZoneForPath(Path path, Configuration conf) throws IOException { +URI uri = path.getFileSystem(conf).getUri(); +if ("hdfs".equals(uri.getScheme())) { + HdfsAdmin hdfsAdmin = new HdfsAdmin(uri, conf); + if (path.getFileSystem(conf).exists(path)) { +return hdfsAdmin.getEncryptionZoneForPath(path); + } else if (!path.getParent().equals(path)) { Review comment: This is an exit condition. When path is the root, then path.getParent should be equal to path and exit the recursion. These utils are picked from hadoop code Hadoop23shims.java. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 470547) Time Spent: 1h 10m (was: 1h) > Remove hadoop shims dependency and use FileSystem Api directly from > standalone metastore > > > Key: HIVE-24032 > URL: https://issues.apache.org/jira/browse/HIVE-24032 > Project: Hive > Issue Type: Task >Reporter: Aasha Medhi >Assignee: Aasha Medhi >Priority: Major > Labels: pull-request-available > Attachments: HIVE-24032.01.patch > > Time Spent: 1h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24032) Remove hadoop shims dependency and use FileSystem Api directly from standalone metastore
[ https://issues.apache.org/jira/browse/HIVE-24032?focusedWorklogId=470536&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470536 ] ASF GitHub Bot logged work on HIVE-24032: - Author: ASF GitHub Bot Created on: 14/Aug/20 02:26 Start Date: 14/Aug/20 02:26 Worklog Time Spent: 10m Work Description: aasha commented on a change in pull request #1396: URL: https://github.com/apache/hive/pull/1396#discussion_r470374769 ## File path: standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/EncryptionFileUtils.java ## @@ -0,0 +1,65 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.metastore.utils; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.hdfs.client.HdfsAdmin; +import org.apache.hadoop.hdfs.protocol.EncryptionZone; + +import java.io.IOException; +import java.net.URI; + +public class EncryptionFileUtils { + + public static boolean isPathEncrypted(Path path, Configuration conf) throws IOException { +Path fullPath; +if (path.isAbsolute()) { + fullPath = path; +} else { + fullPath = path.getFileSystem(conf).makeQualified(path); +} +if(!"hdfs".equalsIgnoreCase(path.toUri().getScheme())) { Review comment: If the scheme itself is not hdfs, we needn't make a call to getEncryptionZoneForPath. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 470536) Time Spent: 1h (was: 50m) > Remove hadoop shims dependency and use FileSystem Api directly from > standalone metastore > > > Key: HIVE-24032 > URL: https://issues.apache.org/jira/browse/HIVE-24032 > Project: Hive > Issue Type: Task >Reporter: Aasha Medhi >Assignee: Aasha Medhi >Priority: Major > Labels: pull-request-available > Attachments: HIVE-24032.01.patch > > Time Spent: 1h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24032) Remove hadoop shims dependency and use FileSystem Api directly from standalone metastore
[ https://issues.apache.org/jira/browse/HIVE-24032?focusedWorklogId=470535&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470535 ] ASF GitHub Bot logged work on HIVE-24032: - Author: ASF GitHub Bot Created on: 14/Aug/20 02:24 Start Date: 14/Aug/20 02:24 Worklog Time Spent: 10m Work Description: aasha commented on a change in pull request #1396: URL: https://github.com/apache/hive/pull/1396#discussion_r470374769 ## File path: standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/EncryptionFileUtils.java ## @@ -0,0 +1,65 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.metastore.utils; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.hdfs.client.HdfsAdmin; +import org.apache.hadoop.hdfs.protocol.EncryptionZone; + +import java.io.IOException; +import java.net.URI; + +public class EncryptionFileUtils { + + public static boolean isPathEncrypted(Path path, Configuration conf) throws IOException { +Path fullPath; +if (path.isAbsolute()) { + fullPath = path; +} else { + fullPath = path.getFileSystem(conf).makeQualified(path); +} +if(!"hdfs".equalsIgnoreCase(path.toUri().getScheme())) { Review comment: If the scheme itself is not hdfs, we needn't make a call to getEncryptionZoneForPath. This will save a file system call and will be faster. ## File path: standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/EncryptionFileUtils.java ## @@ -0,0 +1,65 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.metastore.utils; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.hdfs.client.HdfsAdmin; +import org.apache.hadoop.hdfs.protocol.EncryptionZone; + +import java.io.IOException; +import java.net.URI; + +public class EncryptionFileUtils { + + public static boolean isPathEncrypted(Path path, Configuration conf) throws IOException { +Path fullPath; +if (path.isAbsolute()) { + fullPath = path; +} else { + fullPath = path.getFileSystem(conf).makeQualified(path); +} +if(!"hdfs".equalsIgnoreCase(path.toUri().getScheme())) { + return false; +} +return (EncryptionFileUtils.getEncryptionZoneForPath(fullPath, conf) != null); Review comment: Its a static method and hence used with the class name This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 470535) Time Spent: 50m (was: 40m) > Remove hadoop shims dependency and use FileSystem Api directly from > standalone metastore > > > Key: HIVE-24032 > URL: https://issues.apache.org/jira/browse/HIVE-24032 > Project: Hive > Issue Type: Task >Reporter: Aasha Medhi >
[jira] [Work logged] (HIVE-24032) Remove hadoop shims dependency and use FileSystem Api directly from standalone metastore
[ https://issues.apache.org/jira/browse/HIVE-24032?focusedWorklogId=470534&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470534 ] ASF GitHub Bot logged work on HIVE-24032: - Author: ASF GitHub Bot Created on: 14/Aug/20 02:20 Start Date: 14/Aug/20 02:20 Worklog Time Spent: 10m Work Description: aasha commented on a change in pull request #1396: URL: https://github.com/apache/hive/pull/1396#discussion_r470373805 ## File path: standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/EncryptionFileUtils.java ## @@ -0,0 +1,65 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.metastore.utils; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.hdfs.client.HdfsAdmin; +import org.apache.hadoop.hdfs.protocol.EncryptionZone; + +import java.io.IOException; +import java.net.URI; + +public class EncryptionFileUtils { + + public static boolean isPathEncrypted(Path path, Configuration conf) throws IOException { +Path fullPath; +if (path.isAbsolute()) { + fullPath = path; +} else { + fullPath = path.getFileSystem(conf).makeQualified(path); +} +if(!"hdfs".equalsIgnoreCase(path.toUri().getScheme())) { + return false; +} +return (EncryptionFileUtils.getEncryptionZoneForPath(fullPath, conf) != null); + } + + public static EncryptionZone getEncryptionZoneForPath(Path path, Configuration conf) throws IOException { +URI uri = path.getFileSystem(conf).getUri(); +if ("hdfs".equals(uri.getScheme())) { + HdfsAdmin hdfsAdmin = new HdfsAdmin(uri, conf); + if (path.getFileSystem(conf).exists(path)) { +return hdfsAdmin.getEncryptionZoneForPath(path); + } else if (!path.getParent().equals(path)) { +return getEncryptionZoneForPath(path.getParent(), conf); + } else { +return null; + } +} +return null; + } + + public static void createEncryptionZone(Path path, String keyName, Configuration conf) throws IOException { Review comment: Better to keep it as part of same utility class. Tests can also use this from utility. Its not a test code, its a utility method This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 470534) Time Spent: 40m (was: 0.5h) > Remove hadoop shims dependency and use FileSystem Api directly from > standalone metastore > > > Key: HIVE-24032 > URL: https://issues.apache.org/jira/browse/HIVE-24032 > Project: Hive > Issue Type: Task >Reporter: Aasha Medhi >Assignee: Aasha Medhi >Priority: Major > Labels: pull-request-available > Attachments: HIVE-24032.01.patch > > Time Spent: 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24032) Remove hadoop shims dependency and use FileSystem Api directly from standalone metastore
[ https://issues.apache.org/jira/browse/HIVE-24032?focusedWorklogId=470533&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470533 ] ASF GitHub Bot logged work on HIVE-24032: - Author: ASF GitHub Bot Created on: 14/Aug/20 02:19 Start Date: 14/Aug/20 02:19 Worklog Time Spent: 10m Work Description: aasha commented on a change in pull request #1396: URL: https://github.com/apache/hive/pull/1396#discussion_r470373619 ## File path: standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/EncryptionFileUtils.java ## @@ -0,0 +1,65 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.metastore.utils; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.hdfs.client.HdfsAdmin; +import org.apache.hadoop.hdfs.protocol.EncryptionZone; + +import java.io.IOException; +import java.net.URI; + +public class EncryptionFileUtils { + + public static boolean isPathEncrypted(Path path, Configuration conf) throws IOException { +Path fullPath; +if (path.isAbsolute()) { + fullPath = path; +} else { + fullPath = path.getFileSystem(conf).makeQualified(path); +} +if(!"hdfs".equalsIgnoreCase(path.toUri().getScheme())) { + return false; +} +return (EncryptionFileUtils.getEncryptionZoneForPath(fullPath, conf) != null); Review comment: Its a static method This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 470533) Time Spent: 0.5h (was: 20m) > Remove hadoop shims dependency and use FileSystem Api directly from > standalone metastore > > > Key: HIVE-24032 > URL: https://issues.apache.org/jira/browse/HIVE-24032 > Project: Hive > Issue Type: Task >Reporter: Aasha Medhi >Assignee: Aasha Medhi >Priority: Major > Labels: pull-request-available > Attachments: HIVE-24032.01.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23880) Bloom filters can be merged in a parallel way in VectorUDAFBloomFilterMerge
[ https://issues.apache.org/jira/browse/HIVE-23880?focusedWorklogId=470493&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470493 ] ASF GitHub Bot logged work on HIVE-23880: - Author: ASF GitHub Bot Created on: 14/Aug/20 00:02 Start Date: 14/Aug/20 00:02 Worklog Time Spent: 10m Work Description: mustafaiman commented on pull request #1280: URL: https://github.com/apache/hive/pull/1280#issuecomment-673767435 @abstractdog I am almost ok with this patch. However I still dont understand how this integrates with `ProcessingModeHashAggregate`. Since there are multiple VectorAggregationBufferRows in hash mode, I think we should `finish` each of them as we process them. Otherwise, we pass to the next operator in the pipeline without completing the bloom filter. Also, since hash mode dynamically allocates and frees VectorAggregationBufferRows these `finish`es should happen as we deallocate each of them, rather than only at the end of the operator. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 470493) Time Spent: 7.5h (was: 7h 20m) > Bloom filters can be merged in a parallel way in VectorUDAFBloomFilterMerge > --- > > Key: HIVE-23880 > URL: https://issues.apache.org/jira/browse/HIVE-23880 > Project: Hive > Issue Type: Improvement >Reporter: László Bodor >Assignee: László Bodor >Priority: Major > Labels: pull-request-available > Attachments: lipwig-output3605036885489193068.svg > > Time Spent: 7.5h > Remaining Estimate: 0h > > Merging bloom filters in semijoin reduction can become the main bottleneck in > case of large number of source mapper tasks (~1000, Map 1 in below example) > and a large amount of expected entries (50M) in bloom filters. > For example in TPCDS Q93: > {code} > select /*+ semi(store_returns, sr_item_sk, store_sales, 7000)*/ > ss_customer_sk > ,sum(act_sales) sumsales > from (select ss_item_sk > ,ss_ticket_number > ,ss_customer_sk > ,case when sr_return_quantity is not null then > (ss_quantity-sr_return_quantity)*ss_sales_price > else > (ss_quantity*ss_sales_price) end act_sales > from store_sales left outer join store_returns on (sr_item_sk = > ss_item_sk >and > sr_ticket_number = ss_ticket_number) > ,reason > where sr_reason_sk = r_reason_sk > and r_reason_desc = 'reason 66') t > group by ss_customer_sk > order by sumsales, ss_customer_sk > limit 100; > {code} > On 10TB-30TB scale there is a chance that from 3-4 mins of query runtime 1-2 > mins are spent with merging bloom filters (Reducer 2), as in: > [^lipwig-output3605036885489193068.svg] > {code} > -- > VERTICES MODESTATUS TOTAL COMPLETED RUNNING PENDING > FAILED KILLED > -- > Map 3 .. llap SUCCEEDED 1 100 > 0 0 > Map 1 .. llap SUCCEEDED 1263 126300 > 0 0 > Reducer 2 llap RUNNING 1 010 > 0 0 > Map 4 llap RUNNING 6154 0 207 5947 > 0 0 > Reducer 5 llapINITED 43 00 43 > 0 0 > Reducer 6 llapINITED 1 001 > 0 0 > -- > VERTICES: 02/06 [>>--] 16% ELAPSED TIME: 149.98 s > -- > {code} > For example, 70M entries in bloom filter leads to a 436 465 696 bits, so > merging 1263 bloom filters means running ~ 1263 * 436 465 696 bitwise OR > operation, which is very hot codepath, but can be parallelized. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23880) Bloom filters can be merged in a parallel way in VectorUDAFBloomFilterMerge
[ https://issues.apache.org/jira/browse/HIVE-23880?focusedWorklogId=470489&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470489 ] ASF GitHub Bot logged work on HIVE-23880: - Author: ASF GitHub Bot Created on: 13/Aug/20 23:56 Start Date: 13/Aug/20 23:56 Worklog Time Spent: 10m Work Description: mustafaiman commented on a change in pull request #1280: URL: https://github.com/apache/hive/pull/1280#discussion_r470310851 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorGroupByOperator.java ## @@ -1126,6 +1137,7 @@ protected void initializeOp(Configuration hconf) throws HiveException { VectorAggregateExpression vecAggrExpr = null; try { vecAggrExpr = ctor.newInstance(vecAggrDesc); + vecAggrExpr.withConf(hconf); Review comment: I think making `VectorUDAFBloomFilterMerge` construction a special case and supplying the single int to that constructor is much cleaner. While trying to avoid that specialization, you are injecting the conf object to all the other classes. I specifically despise passing conf object around in Hive as it is abused so much in every part of the codebase. I'd prefer the other way but I won't insist on it. It is not a big deal for this patch. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 470489) Time Spent: 7h 20m (was: 7h 10m) > Bloom filters can be merged in a parallel way in VectorUDAFBloomFilterMerge > --- > > Key: HIVE-23880 > URL: https://issues.apache.org/jira/browse/HIVE-23880 > Project: Hive > Issue Type: Improvement >Reporter: László Bodor >Assignee: László Bodor >Priority: Major > Labels: pull-request-available > Attachments: lipwig-output3605036885489193068.svg > > Time Spent: 7h 20m > Remaining Estimate: 0h > > Merging bloom filters in semijoin reduction can become the main bottleneck in > case of large number of source mapper tasks (~1000, Map 1 in below example) > and a large amount of expected entries (50M) in bloom filters. > For example in TPCDS Q93: > {code} > select /*+ semi(store_returns, sr_item_sk, store_sales, 7000)*/ > ss_customer_sk > ,sum(act_sales) sumsales > from (select ss_item_sk > ,ss_ticket_number > ,ss_customer_sk > ,case when sr_return_quantity is not null then > (ss_quantity-sr_return_quantity)*ss_sales_price > else > (ss_quantity*ss_sales_price) end act_sales > from store_sales left outer join store_returns on (sr_item_sk = > ss_item_sk >and > sr_ticket_number = ss_ticket_number) > ,reason > where sr_reason_sk = r_reason_sk > and r_reason_desc = 'reason 66') t > group by ss_customer_sk > order by sumsales, ss_customer_sk > limit 100; > {code} > On 10TB-30TB scale there is a chance that from 3-4 mins of query runtime 1-2 > mins are spent with merging bloom filters (Reducer 2), as in: > [^lipwig-output3605036885489193068.svg] > {code} > -- > VERTICES MODESTATUS TOTAL COMPLETED RUNNING PENDING > FAILED KILLED > -- > Map 3 .. llap SUCCEEDED 1 100 > 0 0 > Map 1 .. llap SUCCEEDED 1263 126300 > 0 0 > Reducer 2 llap RUNNING 1 010 > 0 0 > Map 4 llap RUNNING 6154 0 207 5947 > 0 0 > Reducer 5 llapINITED 43 00 43 > 0 0 > Reducer 6 llapINITED 1 001 > 0 0 > -- > VERTICES: 02/06 [>>--] 16% ELAPSED TIME: 149.98 s > -- > {code} > For example, 70M entries in bloom filter leads t
[jira] [Updated] (HIVE-24039) Update jquery version to mitigate CVE-2020-11023
[ https://issues.apache.org/jira/browse/HIVE-24039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajkumar Singh updated HIVE-24039: -- Summary: Update jquery version to mitigate CVE-2020-11023 (was: update jquery version to mitigate CVE-2020-11023) > Update jquery version to mitigate CVE-2020-11023 > > > Key: HIVE-24039 > URL: https://issues.apache.org/jira/browse/HIVE-24039 > Project: Hive > Issue Type: Bug > Components: HiveServer2 >Reporter: Rajkumar Singh >Assignee: Rajkumar Singh >Priority: Major > > there is known vulnerability in jquery version used by hive, with this jira > plan is to upgrade the jquery version 3.5.0 where it's been fixed. more > details about the vulnerability can be found here. > https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-11023 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24032) Remove hadoop shims dependency and use FileSystem Api directly from standalone metastore
[ https://issues.apache.org/jira/browse/HIVE-24032?focusedWorklogId=470429&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470429 ] ASF GitHub Bot logged work on HIVE-24032: - Author: ASF GitHub Bot Created on: 13/Aug/20 21:00 Start Date: 13/Aug/20 21:00 Worklog Time Spent: 10m Work Description: pkumarsinha commented on a change in pull request #1396: URL: https://github.com/apache/hive/pull/1396#discussion_r470230243 ## File path: standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/EncryptionFileUtils.java ## @@ -0,0 +1,65 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.metastore.utils; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.hdfs.client.HdfsAdmin; +import org.apache.hadoop.hdfs.protocol.EncryptionZone; + +import java.io.IOException; +import java.net.URI; + +public class EncryptionFileUtils { + Review comment: Add a private constructor to avoid any accidental object creation ## File path: standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/EncryptionFileUtils.java ## @@ -0,0 +1,65 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.metastore.utils; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.hdfs.client.HdfsAdmin; +import org.apache.hadoop.hdfs.protocol.EncryptionZone; + +import java.io.IOException; +import java.net.URI; + +public class EncryptionFileUtils { Review comment: nit: Can we rename it to EncryptionZoneUtils ## File path: standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/EncryptionFileUtils.java ## @@ -0,0 +1,65 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.metastore.utils; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.hdfs.client.HdfsAdmin; +import org.apache.hadoop.hdfs.protocol.EncryptionZone; + +import java.io.IOException; +import java.net.URI; + +public class EncryptionFileUtils { + + public static boolean isPathEncrypted(Path path, Configuration conf) throws IOException { +Path fullPath; +if (path.isAbsolute()) { + fullPath = path; +} else { + fullPath = path.getFileSystem(conf).makeQualified(path); +} +if(!"hdfs".equalsIgnoreCase(path.toUri().getScheme())) { + return
[jira] [Assigned] (HIVE-24039) update jquery version to mitigate CVE-2020-11023
[ https://issues.apache.org/jira/browse/HIVE-24039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajkumar Singh reassigned HIVE-24039: - > update jquery version to mitigate CVE-2020-11023 > > > Key: HIVE-24039 > URL: https://issues.apache.org/jira/browse/HIVE-24039 > Project: Hive > Issue Type: Bug > Components: HiveServer2 >Reporter: Rajkumar Singh >Assignee: Rajkumar Singh >Priority: Major > > there is known vulnerability in jquery version used by hive, with this jira > plan is to upgrade the jquery version 3.5.0 where it's been fixed. more > details about the vulnerability can be found here. > https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-11023 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HIVE-23972) Add external client ID to LLAP external client
[ https://issues.apache.org/jira/browse/HIVE-23972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Dere updated HIVE-23972: -- Fix Version/s: 4.0.0 Resolution: Fixed Status: Resolved (was: Patch Available) Fix merged by [~prasanthj] > Add external client ID to LLAP external client > -- > > Key: HIVE-23972 > URL: https://issues.apache.org/jira/browse/HIVE-23972 > Project: Hive > Issue Type: Bug > Components: llap >Reporter: Jason Dere >Assignee: Jason Dere >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > There currently is not a good way to tell which currently running LLAP tasks > are from external LLAP clients, and also no good way to know which > application is submitting these external LLAP requests. > One possible solution for this is to add an option for the external LLAP > client to pass in an external client ID, which can get logged by HiveServer2 > during the getSplits request, as well as displayed from the LLAP > executorsStatus. > cc [~ShubhamChaurasia] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HIVE-23972) Add external client ID to LLAP external client
[ https://issues.apache.org/jira/browse/HIVE-23972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Dere updated HIVE-23972: -- Status: Patch Available (was: Open) > Add external client ID to LLAP external client > -- > > Key: HIVE-23972 > URL: https://issues.apache.org/jira/browse/HIVE-23972 > Project: Hive > Issue Type: Bug > Components: llap >Reporter: Jason Dere >Assignee: Jason Dere >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > There currently is not a good way to tell which currently running LLAP tasks > are from external LLAP clients, and also no good way to know which > application is submitting these external LLAP requests. > One possible solution for this is to add an option for the external LLAP > client to pass in an external client ID, which can get logged by HiveServer2 > during the getSplits request, as well as displayed from the LLAP > executorsStatus. > cc [~ShubhamChaurasia] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23972) Add external client ID to LLAP external client
[ https://issues.apache.org/jira/browse/HIVE-23972?focusedWorklogId=470407&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470407 ] ASF GitHub Bot logged work on HIVE-23972: - Author: ASF GitHub Bot Created on: 13/Aug/20 20:31 Start Date: 13/Aug/20 20:31 Worklog Time Spent: 10m Work Description: prasanthj merged pull request #1350: URL: https://github.com/apache/hive/pull/1350 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 470407) Time Spent: 40m (was: 0.5h) > Add external client ID to LLAP external client > -- > > Key: HIVE-23972 > URL: https://issues.apache.org/jira/browse/HIVE-23972 > Project: Hive > Issue Type: Bug > Components: llap >Reporter: Jason Dere >Assignee: Jason Dere >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > There currently is not a good way to tell which currently running LLAP tasks > are from external LLAP clients, and also no good way to know which > application is submitting these external LLAP requests. > One possible solution for this is to add an option for the external LLAP > client to pass in an external client ID, which can get logged by HiveServer2 > during the getSplits request, as well as displayed from the LLAP > executorsStatus. > cc [~ShubhamChaurasia] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HIVE-23965) Improve plan regression tests using TPCDS30TB metastore dump and custom configs
[ https://issues.apache.org/jira/browse/HIVE-23965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176406#comment-17176406 ] Jesus Camacho Rodriguez edited comment on HIVE-23965 at 8/13/20, 3:57 PM: -- +1 on removing old driver, since the new one fixes issues with the existing one. I do not think having the old one around adds much value and updating all those q files will be a pain. [~zabetak], [~kgyrtkirk], if this PR is ready to be merged, I think the removal can be done in a follow-up. was (Author: jcamachorodriguez): +1 on removing old driver, since it fixes issues with the existing one. I do not think having the old one around adds much value and updating all those q files will be a pain. [~zabetak], [~kgyrtkirk], if this PR is ready to be merged, I think the removal can be done in a follow-up. > Improve plan regression tests using TPCDS30TB metastore dump and custom > configs > --- > > Key: HIVE-23965 > URL: https://issues.apache.org/jira/browse/HIVE-23965 > Project: Hive > Issue Type: Improvement >Reporter: Stamatis Zampetakis >Assignee: Stamatis Zampetakis >Priority: Major > Labels: pull-request-available > Time Spent: 1h 10m > Remaining Estimate: 0h > > The existing regression tests (HIVE-12586) based on TPC-DS have certain > shortcomings: > The table statistics do not reflect cardinalities from a specific TPC-DS > scale factor (SF). Some tables are from a 30TB dataset, others from 200GB > dataset, and others from a 3GB dataset. This mix leads to plans that may > never appear when using an actual TPC-DS dataset. > The existing statistics do not contain information about partitions something > that can have a big impact on the resulting plans. > The existing regression tests rely on more or less on the default > configuration (hive-site.xml). In real-life scenarios though some of the > configurations differ and may impact the choices of the optimizer. > This issue aims to address the above shortcomings by using a curated > TPCDS30TB metastore dump along with some custom hive configurations. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24015) Disable query-based compaction on MR execution engine
[ https://issues.apache.org/jira/browse/HIVE-24015?focusedWorklogId=470303&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470303 ] ASF GitHub Bot logged work on HIVE-24015: - Author: ASF GitHub Bot Created on: 13/Aug/20 15:57 Start Date: 13/Aug/20 15:57 Worklog Time Spent: 10m Work Description: klcopp merged pull request #1375: URL: https://github.com/apache/hive/pull/1375 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 470303) Time Spent: 20m (was: 10m) > Disable query-based compaction on MR execution engine > - > > Key: HIVE-24015 > URL: https://issues.apache.org/jira/browse/HIVE-24015 > Project: Hive > Issue Type: Task >Reporter: Karen Coppage >Assignee: Karen Coppage >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Major compaction can be run when the execution engine is MR. This can cause > data loss a la HIVE-23703 (the fix for data loss when the execution engine is > MR was reverted by HIVE-23763). > Currently minor compaction can only be run when the execution engine is Tez, > otherwise it falls back to MR (non-query-based) compaction. We should extend > this functionality to major compaction as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24020) Automatic Compaction not working in existing partitions for Streaming Ingest with Dynamic Partition
[ https://issues.apache.org/jira/browse/HIVE-24020?focusedWorklogId=470224&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470224 ] ASF GitHub Bot logged work on HIVE-24020: - Author: ASF GitHub Bot Created on: 13/Aug/20 13:18 Start Date: 13/Aug/20 13:18 Worklog Time Spent: 10m Work Description: vpnvishv commented on pull request #1382: URL: https://github.com/apache/hive/pull/1382#issuecomment-673473084 @pvary @laszlopinter86 @klcopp Can you please review. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 470224) Time Spent: 20m (was: 10m) > Automatic Compaction not working in existing partitions for Streaming Ingest > with Dynamic Partition > --- > > Key: HIVE-24020 > URL: https://issues.apache.org/jira/browse/HIVE-24020 > Project: Hive > Issue Type: Bug > Components: Streaming, Transactions >Affects Versions: 4.0.0, 3.1.2 >Reporter: Vipin Vishvkarma >Assignee: Vipin Vishvkarma >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > This issue happens when we try to do streaming ingest with dynamic partition > on already existing partitions. I checked in the code, we have following > check in the AbstractRecordWriter. > > {code:java} > PartitionInfo partitionInfo = > conn.createPartitionIfNotExists(partitionValues); > // collect the newly added partitions. connection.commitTransaction() will > report the dynamically added > // partitions to TxnHandler > if (!partitionInfo.isExists()) { > addedPartitions.add(partitionInfo.getName()); > } else { > if (LOG.isDebugEnabled()) { > LOG.debug("Partition {} already exists for table {}", > partitionInfo.getName(), fullyQualifiedTableName); > } > } > {code} > Above *addedPartitions* is passed to *addDynamicPartitions* during > TransactionBatch commit. So in case of already existing partitions, > *addedPartitions* will be empty and *addDynamicPartitions* **will not move > entries from TXN_COMPONENTS to COMPLETED_TXN_COMPONENTS. This results in > Initiator not able to trigger auto compaction. > Another issue which has been observed is, we are not clearing > *addedPartitions* on writer close, which results in information flowing > across transactions. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HIVE-12679) Allow users to be able to specify an implementation of IMetaStoreClient via HiveConf
[ https://issues.apache.org/jira/browse/HIVE-12679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177001#comment-17177001 ] Noritaka Sekiyama edited comment on HIVE-12679 at 8/13/20, 1:14 PM: I talked with Austin, and I submitted a new pull-request based on the patch which has been already uploaded to this issue. https://github.com/apache/hive/pull/1402 Hive committers - can you review the patch again? was (Author: moomindani): I talked with Austin, and I submitted a new pull-request based on the patch which has been already uploaded to this issue. Hive committers - can you review the patch again? > Allow users to be able to specify an implementation of IMetaStoreClient via > HiveConf > > > Key: HIVE-12679 > URL: https://issues.apache.org/jira/browse/HIVE-12679 > Project: Hive > Issue Type: Improvement > Components: Configuration, Metastore, Query Planning >Reporter: Austin Lee >Priority: Minor > Labels: metastore, pull-request-available > Attachments: HIVE-12679.1.patch, HIVE-12679.2.patch, > HIVE-12679.branch-1.2.patch, HIVE-12679.branch-2.3.patch, HIVE-12679.patch > > Time Spent: 10m > Remaining Estimate: 0h > > Hi, > I would like to propose a change that would make it possible for users to > choose an implementation of IMetaStoreClient via HiveConf, i.e. > hive-site.xml. Currently, in Hive the choice is hard coded to be > SessionHiveMetaStoreClient in org.apache.hadoop.hive.ql.metadata.Hive. There > is no other direct reference to SessionHiveMetaStoreClient other than the > hard coded class name in Hive.java and the QL component operates only on the > IMetaStoreClient interface so the change would be minimal and it would be > quite similar to how an implementation of RawStore is specified and loaded in > hive-metastore. One use case this change would serve would be one where a > user wishes to use an implementation of this interface without the dependency > on the Thrift server. > > Thank you, > Austin -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HIVE-12679) Allow users to be able to specify an implementation of IMetaStoreClient via HiveConf
[ https://issues.apache.org/jira/browse/HIVE-12679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177001#comment-17177001 ] Noritaka Sekiyama commented on HIVE-12679: -- I talked with Austin, and I submitted a new pull-request based on the patch which has been already uploaded to this issue. Hive committers - can you review the patch again? > Allow users to be able to specify an implementation of IMetaStoreClient via > HiveConf > > > Key: HIVE-12679 > URL: https://issues.apache.org/jira/browse/HIVE-12679 > Project: Hive > Issue Type: Improvement > Components: Configuration, Metastore, Query Planning >Reporter: Austin Lee >Priority: Minor > Labels: metastore, pull-request-available > Attachments: HIVE-12679.1.patch, HIVE-12679.2.patch, > HIVE-12679.branch-1.2.patch, HIVE-12679.branch-2.3.patch, HIVE-12679.patch > > Time Spent: 10m > Remaining Estimate: 0h > > Hi, > I would like to propose a change that would make it possible for users to > choose an implementation of IMetaStoreClient via HiveConf, i.e. > hive-site.xml. Currently, in Hive the choice is hard coded to be > SessionHiveMetaStoreClient in org.apache.hadoop.hive.ql.metadata.Hive. There > is no other direct reference to SessionHiveMetaStoreClient other than the > hard coded class name in Hive.java and the QL component operates only on the > IMetaStoreClient interface so the change would be minimal and it would be > quite similar to how an implementation of RawStore is specified and loaded in > hive-metastore. One use case this change would serve would be one where a > user wishes to use an implementation of this interface without the dependency > on the Thrift server. > > Thank you, > Austin -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HIVE-12679) Allow users to be able to specify an implementation of IMetaStoreClient via HiveConf
[ https://issues.apache.org/jira/browse/HIVE-12679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HIVE-12679: -- Labels: metastore pull-request-available (was: metastore) > Allow users to be able to specify an implementation of IMetaStoreClient via > HiveConf > > > Key: HIVE-12679 > URL: https://issues.apache.org/jira/browse/HIVE-12679 > Project: Hive > Issue Type: Improvement > Components: Configuration, Metastore, Query Planning >Reporter: Austin Lee >Priority: Minor > Labels: metastore, pull-request-available > Attachments: HIVE-12679.1.patch, HIVE-12679.2.patch, > HIVE-12679.branch-1.2.patch, HIVE-12679.branch-2.3.patch, HIVE-12679.patch > > Time Spent: 10m > Remaining Estimate: 0h > > Hi, > I would like to propose a change that would make it possible for users to > choose an implementation of IMetaStoreClient via HiveConf, i.e. > hive-site.xml. Currently, in Hive the choice is hard coded to be > SessionHiveMetaStoreClient in org.apache.hadoop.hive.ql.metadata.Hive. There > is no other direct reference to SessionHiveMetaStoreClient other than the > hard coded class name in Hive.java and the QL component operates only on the > IMetaStoreClient interface so the change would be minimal and it would be > quite similar to how an implementation of RawStore is specified and loaded in > hive-metastore. One use case this change would serve would be one where a > user wishes to use an implementation of this interface without the dependency > on the Thrift server. > > Thank you, > Austin -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-12679) Allow users to be able to specify an implementation of IMetaStoreClient via HiveConf
[ https://issues.apache.org/jira/browse/HIVE-12679?focusedWorklogId=470221&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470221 ] ASF GitHub Bot logged work on HIVE-12679: - Author: ASF GitHub Bot Created on: 13/Aug/20 13:11 Start Date: 13/Aug/20 13:11 Worklog Time Spent: 10m Work Description: moomindani opened a new pull request #1402: URL: https://github.com/apache/hive/pull/1402 ### What changes were proposed in this pull request? This change makes it possible for users to choose an implementation of IMetaStoreClient via HiveConf, i.e. hive-site.xml. This patch is to retry merging the patch of HIVE-12679. I talked with Austin, the original contributor for this issue, and he said it is okay to submit this patch again. https://issues.apache.org/jira/browse/HIVE-12679 ### Why are the changes needed? Currently, in Hive the choice is hard coded to be SessionHiveMetaStoreClient in org.apache.hadoop.hive.ql.metadata.Hive. There is no other direct reference to SessionHiveMetaStoreClient other than the hard coded class name in Hive.java and the QL component operates only on the IMetaStoreClient interface so the change would be minimal and it would be quite similar to how an implementation of RawStore is specified and loaded in hive-metastore. One use case this change would serve would be one where a user wishes to use an implementation of this interface without the dependency on the Thrift server. ### Does this PR introduce _any_ user-facing change? Yes. User will be able to specify own implementation of IMetaStoreClient. (e.g. IMetaStoreClient for AWS Glue Data Catalog). ### How was this patch tested? Unit test and integration test with Glue Data Catalog client. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 470221) Remaining Estimate: 0h Time Spent: 10m > Allow users to be able to specify an implementation of IMetaStoreClient via > HiveConf > > > Key: HIVE-12679 > URL: https://issues.apache.org/jira/browse/HIVE-12679 > Project: Hive > Issue Type: Improvement > Components: Configuration, Metastore, Query Planning >Reporter: Austin Lee >Priority: Minor > Labels: metastore > Attachments: HIVE-12679.1.patch, HIVE-12679.2.patch, > HIVE-12679.branch-1.2.patch, HIVE-12679.branch-2.3.patch, HIVE-12679.patch > > Time Spent: 10m > Remaining Estimate: 0h > > Hi, > I would like to propose a change that would make it possible for users to > choose an implementation of IMetaStoreClient via HiveConf, i.e. > hive-site.xml. Currently, in Hive the choice is hard coded to be > SessionHiveMetaStoreClient in org.apache.hadoop.hive.ql.metadata.Hive. There > is no other direct reference to SessionHiveMetaStoreClient other than the > hard coded class name in Hive.java and the QL component operates only on the > IMetaStoreClient interface so the change would be minimal and it would be > quite similar to how an implementation of RawStore is specified and loaded in > hive-metastore. One use case this change would serve would be one where a > user wishes to use an implementation of this interface without the dependency > on the Thrift server. > > Thank you, > Austin -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HIVE-24038) Hive更改表中字段数据类型导致的异常
[ https://issues.apache.org/jira/browse/HIVE-24038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zuo Junhao updated HIVE-24038: -- Description: 一张表有两个字段,在建表时(内部表)定义为int类型(user_role,label_type),在向该表中insert数据时,这两个字段是string类型的,由于Hive的读时校验模式,不会报错,但是在再次select的时候,对应的字段会是null;此时我 alter table xxx chanage column xxx xxx string后,再次向表中insert了一遍数据,再次select的时候,会报以下异常:Failed with exception java.io.IOException:java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable > Hive更改表中字段数据类型导致的异常 > --- > > Key: HIVE-24038 > URL: https://issues.apache.org/jira/browse/HIVE-24038 > Project: Hive > Issue Type: Bug >Reporter: Zuo Junhao >Priority: Major > > 一张表有两个字段,在建表时(内部表)定义为int类型(user_role,label_type),在向该表中insert数据时,这两个字段是string类型的,由于Hive的读时校验模式,不会报错,但是在再次select的时候,对应的字段会是null;此时我 > alter table xxx chanage column xxx xxx > string后,再次向表中insert了一遍数据,再次select的时候,会报以下异常:Failed with exception > java.io.IOException:java.lang.ClassCastException: org.apache.hadoop.io.Text > cannot be cast to org.apache.hadoop.io.IntWritable -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HIVE-23966) Minor query-based compaction always results in delta dirs with minWriteId=1
[ https://issues.apache.org/jira/browse/HIVE-23966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karen Coppage updated HIVE-23966: - Fix Version/s: 4.0.0 > Minor query-based compaction always results in delta dirs with minWriteId=1 > --- > > Key: HIVE-23966 > URL: https://issues.apache.org/jira/browse/HIVE-23966 > Project: Hive > Issue Type: Bug >Reporter: Karen Coppage >Assignee: Karen Coppage >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > Minor compaction after major/IOW will result in directories that look like: > * base_z_v > * delta_1_y_v > * delete_delta_1_y_v > Should be: > * base_z_v > * delta_(z+1)_y_v > * delete_delta_(z+1)_y_v > Issues this causes: > For example, after running insert overwrite, then minor compaction, major > compaction will fail with the following error: > {noformat} > Found 2 equal splits: OrcSplit > [hdfs://.../warehouse/tablespace/managed/hive/bucketed/delta_001_006_v0001058/bucket_4, > start=0, length=722, isOriginal=false, fileLength=722, hasFooter=false, > hasBase=true, deltas=1] and OrcSplit > [hdfs://.../warehouse/tablespace/managed/hive/bucketed/base_001/bucket_4_0, > start=0, length=811, isOriginal=false, fileLength=811, hasFooter=false, > hasBase=true, deltas=1] > {noformat} > or it can fail with: > {noformat} > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Wrong sort order > of Acid rows detected for the rows: > org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder$WriteIdRowId@201be62b > an > d > org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder$WriteIdRowId@5f97bd3f > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HIVE-24001) Don't cache MapWork in tez/ObjectCache during query-based compaction
[ https://issues.apache.org/jira/browse/HIVE-24001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karen Coppage updated HIVE-24001: - Fix Version/s: 4.0.0 > Don't cache MapWork in tez/ObjectCache during query-based compaction > > > Key: HIVE-24001 > URL: https://issues.apache.org/jira/browse/HIVE-24001 > Project: Hive > Issue Type: Bug >Reporter: Karen Coppage >Assignee: Karen Coppage >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Query-based major compaction can fail intermittently with the following issue: > {code:java} > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: One writer is > supposed to handle only one bucket. We saw these 2 different buckets: 1 and 6 > at > org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder.evaluate(GenericUDFValidateAcidSortOrder.java:77) > {code} > This is consistently preceded in the application log with: > {code:java} > [INFO] [TezChild] |tez.ObjectCache|: Found > hive_20200804185133_f04cca69-fa30-4f1b-a5fe-80fc2d749f48_Map 1__MAP_PLAN__ in > cache with value: org.apache.hadoop.hive.ql.plan.MapWork@74652101 > {code} > Alternatively, when MapRecordProcessor doesn't find mapWork in > tez/ObjectCache (but instead caches mapWork), major compaction succeeds. > The failure happens because, if MapWork is reused, > GenericUDFValidateAcidSortOrder (which is called during compaction) is also > reused on splits belonging to two different buckets, which produces an error. > Solution is to avoid storing MapWork in the ObjectCache during query-based > compaction. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HIVE-24024) Improve logging around CompactionTxnHandler
[ https://issues.apache.org/jira/browse/HIVE-24024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karen Coppage resolved HIVE-24024. -- Fix Version/s: 4.0.0 Resolution: Fixed Submitted to master. Thanks [~lpinter] for the review! > Improve logging around CompactionTxnHandler > --- > > Key: HIVE-24024 > URL: https://issues.apache.org/jira/browse/HIVE-24024 > Project: Hive > Issue Type: Improvement >Reporter: Karen Coppage >Assignee: Karen Coppage >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > CompactionTxnHandler often doesn't log the preparedStatement parameters, > which is really painful when compaction isn't working the way it should. Also > expand logging around compaction Cleaner, Initiator, Worker. And some > formatting cleanup. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24024) Improve logging around CompactionTxnHandler
[ https://issues.apache.org/jira/browse/HIVE-24024?focusedWorklogId=470194&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470194 ] ASF GitHub Bot logged work on HIVE-24024: - Author: ASF GitHub Bot Created on: 13/Aug/20 11:59 Start Date: 13/Aug/20 11:59 Worklog Time Spent: 10m Work Description: klcopp merged pull request #1389: URL: https://github.com/apache/hive/pull/1389 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 470194) Time Spent: 20m (was: 10m) > Improve logging around CompactionTxnHandler > --- > > Key: HIVE-24024 > URL: https://issues.apache.org/jira/browse/HIVE-24024 > Project: Hive > Issue Type: Improvement >Reporter: Karen Coppage >Assignee: Karen Coppage >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > CompactionTxnHandler often doesn't log the preparedStatement parameters, > which is really painful when compaction isn't working the way it should. Also > expand logging around compaction Cleaner, Initiator, Worker. And some > formatting cleanup. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HIVE-23993) Handle irrecoverable errors
[ https://issues.apache.org/jira/browse/HIVE-23993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anishek Agarwal updated HIVE-23993: --- Resolution: Fixed Status: Resolved (was: Patch Available) Committed to master, Thanks for the patch [~aasha] and review [~pkumarsinha] > Handle irrecoverable errors > --- > > Key: HIVE-23993 > URL: https://issues.apache.org/jira/browse/HIVE-23993 > Project: Hive > Issue Type: Task >Reporter: Aasha Medhi >Assignee: Aasha Medhi >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23993.01.patch, HIVE-23993.02.patch, > HIVE-23993.03.patch, HIVE-23993.04.patch, HIVE-23993.05.patch, > HIVE-23993.06.patch, HIVE-23993.07.patch, HIVE-23993.08.patch, Retry Logic > for Replication.pdf > > Time Spent: 2h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HIVE-24037) Parallelize hash table constructions in map joins
[ https://issues.apache.org/jira/browse/HIVE-24037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HIVE-24037: -- Labels: pull-request-available (was: ) > Parallelize hash table constructions in map joins > - > > Key: HIVE-24037 > URL: https://issues.apache.org/jira/browse/HIVE-24037 > Project: Hive > Issue Type: Improvement >Reporter: Ramesh Kumar Thangarajan >Assignee: Ramesh Kumar Thangarajan >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Parallelize hash table constructions in map joins -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24037) Parallelize hash table constructions in map joins
[ https://issues.apache.org/jira/browse/HIVE-24037?focusedWorklogId=470163&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470163 ] ASF GitHub Bot logged work on HIVE-24037: - Author: ASF GitHub Bot Created on: 13/Aug/20 09:46 Start Date: 13/Aug/20 09:46 Worklog Time Spent: 10m Work Description: ramesh0201 opened a new pull request #1401: URL: https://github.com/apache/hive/pull/1401 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 470163) Remaining Estimate: 0h Time Spent: 10m > Parallelize hash table constructions in map joins > - > > Key: HIVE-24037 > URL: https://issues.apache.org/jira/browse/HIVE-24037 > Project: Hive > Issue Type: Improvement >Reporter: Ramesh Kumar Thangarajan >Assignee: Ramesh Kumar Thangarajan >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Parallelize hash table constructions in map joins -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HIVE-23981) Use task counter enum to get the approximate counter value
[ https://issues.apache.org/jira/browse/HIVE-23981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mahesh kumar behera resolved HIVE-23981. Resolution: Fixed > Use task counter enum to get the approximate counter value > -- > > Key: HIVE-23981 > URL: https://issues.apache.org/jira/browse/HIVE-23981 > Project: Hive > Issue Type: Bug >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > Labels: pull-request-available > > The value for APPROXIMATE_INPUT_RECORDS should be obtained using the enum > name instead of static string. Once Tez release is done with the specific > information we should change it to > org.apache.tez.common.counters.TaskCounter.APPROXIMATE_INPUT_RECORDS. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HIVE-24037) Parallelize hash table constructions in map joins
[ https://issues.apache.org/jira/browse/HIVE-24037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramesh Kumar Thangarajan reassigned HIVE-24037: --- > Parallelize hash table constructions in map joins > - > > Key: HIVE-24037 > URL: https://issues.apache.org/jira/browse/HIVE-24037 > Project: Hive > Issue Type: Improvement >Reporter: Ramesh Kumar Thangarajan >Assignee: Ramesh Kumar Thangarajan >Priority: Major > > Parallelize hash table constructions in map joins -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HIVE-23938) LLAP: JDK11 - some GC log file rotation related jvm arguments cannot be used anymore
[ https://issues.apache.org/jira/browse/HIVE-23938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] László Bodor updated HIVE-23938: Attachment: gc_2020-07-29-12.jdk8.log > LLAP: JDK11 - some GC log file rotation related jvm arguments cannot be used > anymore > > > Key: HIVE-23938 > URL: https://issues.apache.org/jira/browse/HIVE-23938 > Project: Hive > Issue Type: Improvement >Reporter: László Bodor >Assignee: László Bodor >Priority: Major > Attachments: gc_2020-07-27-13.log, gc_2020-07-29-12.jdk8.log > > > https://github.com/apache/hive/blob/master/llap-server/bin/runLlapDaemon.sh#L55 > {code} > JAVA_OPTS_BASE="-server -Djava.net.preferIPv4Stack=true -XX:+UseNUMA > -XX:+PrintGCDetails -verbose:gc -XX:+UseGCLogFileRotation > -XX:NumberOfGCLogFiles=4 -XX:GCLogFileSize=100M -XX:+PrintGCDateStamps" > {code} > on JDK11 I got something like: > {code} > + exec /usr/lib/jvm/jre-11-openjdk/bin/java -Dproc_llapdaemon -Xms32000m > -Xmx64000m -Dhttp.maxConnections=17 -XX:+UseG1GC -XX:+ResizeTLAB -XX:+UseNUMA > -XX:+AggressiveOpts -XX:MetaspaceSize=1024m > -XX:InitiatingHeapOccupancyPercent=80 -XX:MaxGCPauseMillis=200 > -XX:+PreserveFramePointer -XX:AllocatePrefetchStyle=2 > -Dhttp.maxConnections=10 -Dasync.profiler.home=/grid/0/async-profiler -server > -Djava.net.preferIPv4Stack=true -XX:+UseNUMA -XX:+PrintGCDetails -verbose:gc > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=4 -XX:GCLogFileSize=100M > -XX:+PrintGCDateStamps > -Xloggc:/grid/2/yarn/container-logs/application_1595375468459_0113/container_e26_1595375468459_0113_01_09/gc_2020-07-27-12.log > > ... > org.apache.hadoop.hive.llap.daemon.impl.LlapDaemon > OpenJDK 64-Bit Server VM warning: Option AggressiveOpts was deprecated in > version 11.0 and will likely be removed in a future release. > Unrecognized VM option 'UseGCLogFileRotation' > Error: Could not create the Java Virtual Machine. > Error: A fatal exception has occurred. Program will exit. > {code} > These are not valid in JDK11: > {code} > -XX:+UseGCLogFileRotation > -XX:NumberOfGCLogFiles > -XX:GCLogFileSize > -XX:+PrintGCTimeStamps > -XX:+PrintGCDateStamps > {code} > Instead something like: > {code} > -Xlog:gc*,safepoint:gc.log:time,uptime:filecount=4,filesize=100M > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HIVE-20593) Load Data for partitioned ACID tables fails with bucketId out of range: -1
[ https://issues.apache.org/jira/browse/HIVE-20593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176849#comment-17176849 ] Bernard commented on HIVE-20593: Hi, Is the a workaround for this one without updating Hive? We've tried recreating the table but we're still getting this error. Thanks, Bernard > Load Data for partitioned ACID tables fails with bucketId out of range: -1 > -- > > Key: HIVE-20593 > URL: https://issues.apache.org/jira/browse/HIVE-20593 > Project: Hive > Issue Type: Bug > Components: Transactions >Affects Versions: 3.1.0 >Reporter: Deepak Jaiswal >Assignee: Deepak Jaiswal >Priority: Major > Fix For: 4.0.0, 3.2.0, 3.1.2 > > Attachments: HIVE-20593.1.patch, HIVE-20593.2.patch, > HIVE-20593.3.patch > > > Load data for ACID tables is failing to load ORC files when it is converted > to IAS job. > > The tempTblObj is inherited from target table. However, the only table > property which needs to be inherited is bucketing version. Properties like > transactional etc should be ignored. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HIVE-18284) NPE when inserting data with 'distribute by' clause with dynpart sort optimization
[ https://issues.apache.org/jira/browse/HIVE-18284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Syed Shameerur Rahman updated HIVE-18284: - Environment: (was: EMR) > NPE when inserting data with 'distribute by' clause with dynpart sort > optimization > -- > > Key: HIVE-18284 > URL: https://issues.apache.org/jira/browse/HIVE-18284 > Project: Hive > Issue Type: Bug > Components: Query Processor >Affects Versions: 2.3.1, 2.3.2 >Reporter: Aki Tanaka >Assignee: Syed Shameerur Rahman >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > A Null Pointer Exception occurs when inserting data with 'distribute by' > clause. The following snippet query reproduces this issue: > *(non-vectorized , non-llap mode)* > {code:java} > create table table1 (col1 string, datekey int); > insert into table1 values ('ROW1', 1), ('ROW2', 2), ('ROW3', 1); > create table table2 (col1 string) partitioned by (datekey int); > set hive.vectorized.execution.enabled=false; > set hive.optimize.sort.dynamic.partition=true; > set hive.exec.dynamic.partition.mode=nonstrict; > insert into table table2 > PARTITION(datekey) > select col1, > datekey > from table1 > distribute by datekey ; > {code} > I could run the insert query without the error if I remove Distribute By or > use Cluster By clause. > It seems that the issue happens because Distribute By does not guarantee > clustering or sorting properties on the distributed keys. > FileSinkOperator removes the previous fsp. FileSinkOperator will remove the > previous fsp which might be re-used when we use Distribute By. > https://github.com/apache/hive/blob/branch-2.3/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L972 > The following stack trace is logged. > {code:java} > Vertex failed, vertexName=Reducer 2, vertexId=vertex_1513111717879_0056_1_01, > diagnostics=[Task failed, taskId=task_1513111717879_0056_1_01_00, > diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( > failure ) : > attempt_1513111717879_0056_1_01_00_0:java.lang.RuntimeException: > org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while > processing row (tag=0) {"key":{},"value":{"_col0":"ROW3","_col1":1}} > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime > Error while processing row (tag=0) > {"key":{},"value":{"_col0":"ROW3","_col1":1}} > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:365) > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:250) > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:317) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185) > ... 14 more > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:762) > at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897) > at > org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95) > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:356) > ... 17 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005
[jira] [Updated] (HIVE-18284) NPE when inserting data with 'distribute by' clause with dynpart sort optimization
[ https://issues.apache.org/jira/browse/HIVE-18284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Syed Shameerur Rahman updated HIVE-18284: - Description: A Null Pointer Exception occurs when inserting data with 'distribute by' clause. The following snippet query reproduces this issue: *(non-vectorized , non-llap mode)* {code:java} create table table1 (col1 string, datekey int); insert into table1 values ('ROW1', 1), ('ROW2', 2), ('ROW3', 1); create table table2 (col1 string) partitioned by (datekey int); set hive.vectorized.execution.enabled=false; set hive.optimize.sort.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; insert into table table2 PARTITION(datekey) select col1, datekey from table1 distribute by datekey ; {code} I could run the insert query without the error if I remove Distribute By or use Cluster By clause. It seems that the issue happens because Distribute By does not guarantee clustering or sorting properties on the distributed keys. FileSinkOperator removes the previous fsp. FileSinkOperator will remove the previous fsp which might be re-used when we use Distribute By. https://github.com/apache/hive/blob/branch-2.3/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L972 The following stack trace is logged. {code:java} Vertex failed, vertexName=Reducer 2, vertexId=vertex_1513111717879_0056_1_01, diagnostics=[Task failed, taskId=task_1513111717879_0056_1_01_00, diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( failure ) : attempt_1513111717879_0056_1_01_00_0:java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{},"value":{"_col0":"ROW3","_col1":1}} at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{},"value":{"_col0":"ROW3","_col1":1}} at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:365) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:250) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:317) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185) ... 14 more Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:762) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897) at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:356) ... 17 more {code} was: A Null Pointer Exception occurs when inserting data with 'distribute by' clause. The following snippet query reproduces this issue: {code:java} create table table1 (col1 string, datekey int); insert into table1 values ('ROW1', 1), ('ROW2', 2), ('ROW3', 1); create table table2 (col1 string) partitioned by (datekey int); set hive.vectorized.execution.enabled=false; set hive.optimize.sort.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; insert into table table2 PARTITION(datekey) select col1, datekey from table1 distribute by datekey ; {code} I could run the insert query without the error if I remove Distribute By or use Cluster By clause. It seems that the issue happens because Distribute By does not guarantee clustering or sorting properties on the d
[jira] [Commented] (HIVE-18284) NPE when inserting data with 'distribute by' clause with dynpart sort optimization
[ https://issues.apache.org/jira/browse/HIVE-18284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176832#comment-17176832 ] Syed Shameerur Rahman commented on HIVE-18284: -- *PR:* https://github.com/apache/hive/pull/1400 [~jcamachorodriguez] can you please review? Thanks! > NPE when inserting data with 'distribute by' clause with dynpart sort > optimization > -- > > Key: HIVE-18284 > URL: https://issues.apache.org/jira/browse/HIVE-18284 > Project: Hive > Issue Type: Bug > Components: Query Processor >Affects Versions: 2.3.1, 2.3.2 > Environment: EMR >Reporter: Aki Tanaka >Assignee: Syed Shameerur Rahman >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > A Null Pointer Exception occurs when inserting data with 'distribute by' > clause. The following snippet query reproduces this issue: > {code:java} > create table table1 (col1 string, datekey int); > insert into table1 values ('ROW1', 1), ('ROW2', 2), ('ROW3', 1); > create table table2 (col1 string) partitioned by (datekey int); > set hive.exec.dynamic.partition.mode=nonstrict; > insert into table table2 > PARTITION(datekey) > select col1, > datekey > from table1 > distribute by datekey ; > {code} > I could run the insert query without the error if I remove Distribute By or > use Cluster By clause. > It seems that the issue happens because Distribute By does not guarantee > clustering or sorting properties on the distributed keys. > FileSinkOperator removes the previous fsp. FileSinkOperator will remove the > previous fsp which might be re-used when we use Distribute By. > https://github.com/apache/hive/blob/branch-2.3/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L972 > The following stack trace is logged. > {code:java} > Vertex failed, vertexName=Reducer 2, vertexId=vertex_1513111717879_0056_1_01, > diagnostics=[Task failed, taskId=task_1513111717879_0056_1_01_00, > diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( > failure ) : > attempt_1513111717879_0056_1_01_00_0:java.lang.RuntimeException: > org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while > processing row (tag=0) {"key":{},"value":{"_col0":"ROW3","_col1":1}} > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime > Error while processing row (tag=0) > {"key":{},"value":{"_col0":"ROW3","_col1":1}} > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:365) > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:250) > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:317) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185) > ... 14 more > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:762) > at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897) > at > org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95) > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:356) > ... 17 more > {code} -- This message was sent by Atlassian
[jira] [Updated] (HIVE-18284) NPE when inserting data with 'distribute by' clause with dynpart sort optimization
[ https://issues.apache.org/jira/browse/HIVE-18284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Syed Shameerur Rahman updated HIVE-18284: - Description: A Null Pointer Exception occurs when inserting data with 'distribute by' clause. The following snippet query reproduces this issue: {code:java} create table table1 (col1 string, datekey int); insert into table1 values ('ROW1', 1), ('ROW2', 2), ('ROW3', 1); create table table2 (col1 string) partitioned by (datekey int); set hive.vectorized.execution.enabled=false; set hive.optimize.sort.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; insert into table table2 PARTITION(datekey) select col1, datekey from table1 distribute by datekey ; {code} I could run the insert query without the error if I remove Distribute By or use Cluster By clause. It seems that the issue happens because Distribute By does not guarantee clustering or sorting properties on the distributed keys. FileSinkOperator removes the previous fsp. FileSinkOperator will remove the previous fsp which might be re-used when we use Distribute By. https://github.com/apache/hive/blob/branch-2.3/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L972 The following stack trace is logged. {code:java} Vertex failed, vertexName=Reducer 2, vertexId=vertex_1513111717879_0056_1_01, diagnostics=[Task failed, taskId=task_1513111717879_0056_1_01_00, diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( failure ) : attempt_1513111717879_0056_1_01_00_0:java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{},"value":{"_col0":"ROW3","_col1":1}} at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{},"value":{"_col0":"ROW3","_col1":1}} at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:365) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:250) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:317) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185) ... 14 more Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:762) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897) at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:356) ... 17 more {code} was: A Null Pointer Exception occurs when inserting data with 'distribute by' clause. The following snippet query reproduces this issue: {code:java} create table table1 (col1 string, datekey int); insert into table1 values ('ROW1', 1), ('ROW2', 2), ('ROW3', 1); create table table2 (col1 string) partitioned by (datekey int); set hive.exec.dynamic.partition.mode=nonstrict; insert into table table2 PARTITION(datekey) select col1, datekey from table1 distribute by datekey ; {code} I could run the insert query without the error if I remove Distribute By or use Cluster By clause. It seems that the issue happens because Distribute By does not guarantee clustering or sorting properties on the distributed keys. FileSinkOperator removes the previous fsp. FileSinkOperator will remove the previous fsp which might be re-
[jira] [Updated] (HIVE-18284) NPE when inserting data with 'distribute by' clause with dynpart sort optimization
[ https://issues.apache.org/jira/browse/HIVE-18284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Syed Shameerur Rahman updated HIVE-18284: - Summary: NPE when inserting data with 'distribute by' clause with dynpart sort optimization (was: NPE when inserting data with 'distribute by' clause) > NPE when inserting data with 'distribute by' clause with dynpart sort > optimization > -- > > Key: HIVE-18284 > URL: https://issues.apache.org/jira/browse/HIVE-18284 > Project: Hive > Issue Type: Bug > Components: Query Processor >Affects Versions: 2.3.1, 2.3.2 > Environment: EMR >Reporter: Aki Tanaka >Assignee: Syed Shameerur Rahman >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > A Null Pointer Exception occurs when inserting data with 'distribute by' > clause. The following snippet query reproduces this issue: > {code:java} > create table table1 (col1 string, datekey int); > insert into table1 values ('ROW1', 1), ('ROW2', 2), ('ROW3', 1); > create table table2 (col1 string) partitioned by (datekey int); > set hive.exec.dynamic.partition.mode=nonstrict; > insert into table table2 > PARTITION(datekey) > select col1, > datekey > from table1 > distribute by datekey ; > {code} > I could run the insert query without the error if I remove Distribute By or > use Cluster By clause. > It seems that the issue happens because Distribute By does not guarantee > clustering or sorting properties on the distributed keys. > FileSinkOperator removes the previous fsp. FileSinkOperator will remove the > previous fsp which might be re-used when we use Distribute By. > https://github.com/apache/hive/blob/branch-2.3/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L972 > The following stack trace is logged. > {code:java} > Vertex failed, vertexName=Reducer 2, vertexId=vertex_1513111717879_0056_1_01, > diagnostics=[Task failed, taskId=task_1513111717879_0056_1_01_00, > diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( > failure ) : > attempt_1513111717879_0056_1_01_00_0:java.lang.RuntimeException: > org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while > processing row (tag=0) {"key":{},"value":{"_col0":"ROW3","_col1":1}} > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime > Error while processing row (tag=0) > {"key":{},"value":{"_col0":"ROW3","_col1":1}} > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:365) > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:250) > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:317) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185) > ... 14 more > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:762) > at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897) > at > org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95) > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:356) > ... 17 more > {code} -- This message was sent by Atlassian
[jira] [Updated] (HIVE-18284) NPE when inserting data with 'distribute by' clause
[ https://issues.apache.org/jira/browse/HIVE-18284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HIVE-18284: -- Labels: pull-request-available (was: ) > NPE when inserting data with 'distribute by' clause > --- > > Key: HIVE-18284 > URL: https://issues.apache.org/jira/browse/HIVE-18284 > Project: Hive > Issue Type: Bug > Components: Query Processor >Affects Versions: 2.3.1, 2.3.2 > Environment: EMR >Reporter: Aki Tanaka >Assignee: Lynch Lee >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > A Null Pointer Exception occurs when inserting data with 'distribute by' > clause. The following snippet query reproduces this issue: > {code:java} > create table table1 (col1 string, datekey int); > insert into table1 values ('ROW1', 1), ('ROW2', 2), ('ROW3', 1); > create table table2 (col1 string) partitioned by (datekey int); > set hive.exec.dynamic.partition.mode=nonstrict; > insert into table table2 > PARTITION(datekey) > select col1, > datekey > from table1 > distribute by datekey ; > {code} > I could run the insert query without the error if I remove Distribute By or > use Cluster By clause. > It seems that the issue happens because Distribute By does not guarantee > clustering or sorting properties on the distributed keys. > FileSinkOperator removes the previous fsp. FileSinkOperator will remove the > previous fsp which might be re-used when we use Distribute By. > https://github.com/apache/hive/blob/branch-2.3/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L972 > The following stack trace is logged. > {code:java} > Vertex failed, vertexName=Reducer 2, vertexId=vertex_1513111717879_0056_1_01, > diagnostics=[Task failed, taskId=task_1513111717879_0056_1_01_00, > diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( > failure ) : > attempt_1513111717879_0056_1_01_00_0:java.lang.RuntimeException: > org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while > processing row (tag=0) {"key":{},"value":{"_col0":"ROW3","_col1":1}} > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime > Error while processing row (tag=0) > {"key":{},"value":{"_col0":"ROW3","_col1":1}} > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:365) > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:250) > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:317) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185) > ... 14 more > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:762) > at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897) > at > org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95) > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:356) > ... 17 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HIVE-18284) NPE when inserting data with 'distribute by' clause
[ https://issues.apache.org/jira/browse/HIVE-18284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Syed Shameerur Rahman reassigned HIVE-18284: Assignee: Syed Shameerur Rahman (was: Lynch Lee) > NPE when inserting data with 'distribute by' clause > --- > > Key: HIVE-18284 > URL: https://issues.apache.org/jira/browse/HIVE-18284 > Project: Hive > Issue Type: Bug > Components: Query Processor >Affects Versions: 2.3.1, 2.3.2 > Environment: EMR >Reporter: Aki Tanaka >Assignee: Syed Shameerur Rahman >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > A Null Pointer Exception occurs when inserting data with 'distribute by' > clause. The following snippet query reproduces this issue: > {code:java} > create table table1 (col1 string, datekey int); > insert into table1 values ('ROW1', 1), ('ROW2', 2), ('ROW3', 1); > create table table2 (col1 string) partitioned by (datekey int); > set hive.exec.dynamic.partition.mode=nonstrict; > insert into table table2 > PARTITION(datekey) > select col1, > datekey > from table1 > distribute by datekey ; > {code} > I could run the insert query without the error if I remove Distribute By or > use Cluster By clause. > It seems that the issue happens because Distribute By does not guarantee > clustering or sorting properties on the distributed keys. > FileSinkOperator removes the previous fsp. FileSinkOperator will remove the > previous fsp which might be re-used when we use Distribute By. > https://github.com/apache/hive/blob/branch-2.3/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L972 > The following stack trace is logged. > {code:java} > Vertex failed, vertexName=Reducer 2, vertexId=vertex_1513111717879_0056_1_01, > diagnostics=[Task failed, taskId=task_1513111717879_0056_1_01_00, > diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( > failure ) : > attempt_1513111717879_0056_1_01_00_0:java.lang.RuntimeException: > org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while > processing row (tag=0) {"key":{},"value":{"_col0":"ROW3","_col1":1}} > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime > Error while processing row (tag=0) > {"key":{},"value":{"_col0":"ROW3","_col1":1}} > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:365) > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:250) > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:317) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185) > ... 14 more > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:762) > at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897) > at > org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95) > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:356) > ... 17 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-18284) NPE when inserting data with 'distribute by' clause
[ https://issues.apache.org/jira/browse/HIVE-18284?focusedWorklogId=470133&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470133 ] ASF GitHub Bot logged work on HIVE-18284: - Author: ASF GitHub Bot Created on: 13/Aug/20 08:18 Start Date: 13/Aug/20 08:18 Worklog Time Spent: 10m Work Description: shameersss1 opened a new pull request #1400: URL: https://github.com/apache/hive/pull/1400 … ### What changes were proposed in this pull request? when hive.optimize.sort.dynamic.partition=true we expect the keys to be sorted in the reducer side so that reducers can keep only one record writer open at any time thereby reducing the memory pressure on the reducers (HIVE-6455) , But in case of non-vectorizied , non-llap execution the keys are not sorted and fails with NPE. Refer: https://issues.apache.org/jira/browse/HIVE-18284?focusedCommentId=17173124&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17173124 Caused due to: https://issues.apache.org/jira/browse/HIVE-13260 ### Why are the changes needed? Changes are required in ReduceSinkDeduplication to merge properly the Child reduce sink operator and parent reduce sink operator ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a qtest This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 470133) Remaining Estimate: 0h Time Spent: 10m > NPE when inserting data with 'distribute by' clause > --- > > Key: HIVE-18284 > URL: https://issues.apache.org/jira/browse/HIVE-18284 > Project: Hive > Issue Type: Bug > Components: Query Processor >Affects Versions: 2.3.1, 2.3.2 > Environment: EMR >Reporter: Aki Tanaka >Assignee: Lynch Lee >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > A Null Pointer Exception occurs when inserting data with 'distribute by' > clause. The following snippet query reproduces this issue: > {code:java} > create table table1 (col1 string, datekey int); > insert into table1 values ('ROW1', 1), ('ROW2', 2), ('ROW3', 1); > create table table2 (col1 string) partitioned by (datekey int); > set hive.exec.dynamic.partition.mode=nonstrict; > insert into table table2 > PARTITION(datekey) > select col1, > datekey > from table1 > distribute by datekey ; > {code} > I could run the insert query without the error if I remove Distribute By or > use Cluster By clause. > It seems that the issue happens because Distribute By does not guarantee > clustering or sorting properties on the distributed keys. > FileSinkOperator removes the previous fsp. FileSinkOperator will remove the > previous fsp which might be re-used when we use Distribute By. > https://github.com/apache/hive/blob/branch-2.3/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L972 > The following stack trace is logged. > {code:java} > Vertex failed, vertexName=Reducer 2, vertexId=vertex_1513111717879_0056_1_01, > diagnostics=[Task failed, taskId=task_1513111717879_0056_1_01_00, > diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( > failure ) : > attempt_1513111717879_0056_1_01_00_0:java.lang.RuntimeException: > org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while > processing row (tag=0) {"key":{},"value":{"_col0":"ROW3","_col1":1}} > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) > at org.apache.tez.common.CallableWithNdc.ca
[jira] [Work logged] (HIVE-23880) Bloom filters can be merged in a parallel way in VectorUDAFBloomFilterMerge
[ https://issues.apache.org/jira/browse/HIVE-23880?focusedWorklogId=470126&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470126 ] ASF GitHub Bot logged work on HIVE-23880: - Author: ASF GitHub Bot Created on: 13/Aug/20 07:55 Start Date: 13/Aug/20 07:55 Worklog Time Spent: 10m Work Description: abstractdog commented on a change in pull request #1280: URL: https://github.com/apache/hive/pull/1280#discussion_r469765404 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/aggregates/VectorUDAFBloomFilterMerge.java ## @@ -77,6 +75,211 @@ public void reset() { // Do not change the initial bytes which contain NumHashFunctions/NumBits! Arrays.fill(bfBytes, BloomKFilter.START_OF_SERIALIZED_LONGS, bfBytes.length, (byte) 0); } + +public boolean mergeBloomFilterBytesFromInputColumn(BytesColumnVector inputColumn, +int batchSize, boolean selectedInUse, int[] selected, Configuration conf) { + // already set in previous iterations, no need to call initExecutor again + if (numThreads == 0) { +return false; + } + if (executor == null) { +initExecutor(conf, batchSize); +if (!isParallel) { + return false; +} + } + + // split every bloom filter (represented by a part of a byte[]) across workers + for (int j = 0; j < batchSize; j++) { +if (!selectedInUse && inputColumn.noNulls) { + splitVectorAcrossWorkers(workers, inputColumn.vector[j], inputColumn.start[j], + inputColumn.length[j]); +} else if (!selectedInUse) { + if (!inputColumn.isNull[j]) { +splitVectorAcrossWorkers(workers, inputColumn.vector[j], inputColumn.start[j], +inputColumn.length[j]); + } +} else if (inputColumn.noNulls) { + int i = selected[j]; + splitVectorAcrossWorkers(workers, inputColumn.vector[i], inputColumn.start[i], + inputColumn.length[i]); +} else { + int i = selected[j]; + if (!inputColumn.isNull[i]) { +splitVectorAcrossWorkers(workers, inputColumn.vector[i], inputColumn.start[i], +inputColumn.length[i]); + } +} + } + + return true; +} + +private void initExecutor(Configuration conf, int batchSize) { + numThreads = conf.getInt(HiveConf.ConfVars.TEZ_BLOOM_FILTER_MERGE_THREADS.varname, + HiveConf.ConfVars.TEZ_BLOOM_FILTER_MERGE_THREADS.defaultIntVal); + LOG.info("Number of threads used for bloom filter merge: {}", numThreads); + + if (numThreads < 0) { +throw new RuntimeException( +"invalid number of threads for bloom filter merge: " + numThreads); + } + if (numThreads == 0) { // disable parallel feature +return; // this will leave isParallel=false + } + isParallel = true; + executor = Executors.newFixedThreadPool(numThreads); + + workers = new BloomFilterMergeWorker[numThreads]; + for (int f = 0; f < numThreads; f++) { +workers[f] = new BloomFilterMergeWorker(bfBytes, 0, bfBytes.length); + } + + for (int f = 0; f < numThreads; f++) { +executor.submit(workers[f]); + } +} + +public int getNumberOfWaitingMergeTasks(){ + int size = 0; + for (BloomFilterMergeWorker w : workers){ +size += w.queue.size(); + } + return size; +} + +public int getNumberOfMergingWorkers() { + int working = 0; + for (BloomFilterMergeWorker w : workers) { +if (w.isMerging.get()) { + working += 1; +} + } + return working; +} + +private static void splitVectorAcrossWorkers(BloomFilterMergeWorker[] workers, byte[] bytes, +int start, int length) { + if (bytes == null || length == 0) { +return; + } + /* + * This will split a byte[] across workers as below: + * let's say there are 10 workers for 7813 bytes, in this case + * length: 7813, elementPerBatch: 781 + * bytes assigned to workers: inclusive lower bound, exclusive upper bound + * 1. worker: 5 -> 786 + * 2. worker: 786 -> 1567 + * 3. worker: 1567 -> 2348 + * 4. worker: 2348 -> 3129 + * 5. worker: 3129 -> 3910 + * 6. worker: 3910 -> 4691 + * 7. worker: 4691 -> 5472 + * 8. worker: 5472 -> 6253 + * 9. worker: 6253 -> 7034 + * 10. worker: 7034 -> 7813 (last element per batch is: 779) + * + * This way, a particular worker will be given with the same part + * of all bloom filters along with the shared base bloom filter, + * so the bitwise OR function will not be a subject of threading/sync issues. + */ + int elementPerBatch = + (int) Math.ceil((double) (length - START_OF_
[jira] [Work logged] (HIVE-23880) Bloom filters can be merged in a parallel way in VectorUDAFBloomFilterMerge
[ https://issues.apache.org/jira/browse/HIVE-23880?focusedWorklogId=470123&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470123 ] ASF GitHub Bot logged work on HIVE-23880: - Author: ASF GitHub Bot Created on: 13/Aug/20 07:50 Start Date: 13/Aug/20 07:50 Worklog Time Spent: 10m Work Description: abstractdog commented on a change in pull request #1280: URL: https://github.com/apache/hive/pull/1280#discussion_r469762209 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/aggregates/VectorUDAFBloomFilterMerge.java ## @@ -77,6 +75,211 @@ public void reset() { // Do not change the initial bytes which contain NumHashFunctions/NumBits! Arrays.fill(bfBytes, BloomKFilter.START_OF_SERIALIZED_LONGS, bfBytes.length, (byte) 0); } + +public boolean mergeBloomFilterBytesFromInputColumn(BytesColumnVector inputColumn, +int batchSize, boolean selectedInUse, int[] selected, Configuration conf) { + // already set in previous iterations, no need to call initExecutor again + if (numThreads == 0) { +return false; + } + if (executor == null) { +initExecutor(conf, batchSize); +if (!isParallel) { + return false; +} + } + + // split every bloom filter (represented by a part of a byte[]) across workers + for (int j = 0; j < batchSize; j++) { +if (!selectedInUse && inputColumn.noNulls) { + splitVectorAcrossWorkers(workers, inputColumn.vector[j], inputColumn.start[j], + inputColumn.length[j]); +} else if (!selectedInUse) { + if (!inputColumn.isNull[j]) { +splitVectorAcrossWorkers(workers, inputColumn.vector[j], inputColumn.start[j], +inputColumn.length[j]); + } +} else if (inputColumn.noNulls) { + int i = selected[j]; + splitVectorAcrossWorkers(workers, inputColumn.vector[i], inputColumn.start[i], + inputColumn.length[i]); +} else { + int i = selected[j]; + if (!inputColumn.isNull[i]) { +splitVectorAcrossWorkers(workers, inputColumn.vector[i], inputColumn.start[i], +inputColumn.length[i]); + } +} + } + + return true; +} + +private void initExecutor(Configuration conf, int batchSize) { + numThreads = conf.getInt(HiveConf.ConfVars.TEZ_BLOOM_FILTER_MERGE_THREADS.varname, + HiveConf.ConfVars.TEZ_BLOOM_FILTER_MERGE_THREADS.defaultIntVal); + LOG.info("Number of threads used for bloom filter merge: {}", numThreads); + + if (numThreads < 0) { +throw new RuntimeException( +"invalid number of threads for bloom filter merge: " + numThreads); + } + if (numThreads == 0) { // disable parallel feature +return; // this will leave isParallel=false + } + isParallel = true; + executor = Executors.newFixedThreadPool(numThreads); + + workers = new BloomFilterMergeWorker[numThreads]; + for (int f = 0; f < numThreads; f++) { +workers[f] = new BloomFilterMergeWorker(bfBytes, 0, bfBytes.length); + } + + for (int f = 0; f < numThreads; f++) { +executor.submit(workers[f]); + } +} + +public int getNumberOfWaitingMergeTasks(){ + int size = 0; + for (BloomFilterMergeWorker w : workers){ +size += w.queue.size(); + } + return size; +} + +public int getNumberOfMergingWorkers() { + int working = 0; + for (BloomFilterMergeWorker w : workers) { +if (w.isMerging.get()) { + working += 1; +} + } + return working; +} + +private static void splitVectorAcrossWorkers(BloomFilterMergeWorker[] workers, byte[] bytes, +int start, int length) { + if (bytes == null || length == 0) { +return; + } + /* + * This will split a byte[] across workers as below: + * let's say there are 10 workers for 7813 bytes, in this case + * length: 7813, elementPerBatch: 781 + * bytes assigned to workers: inclusive lower bound, exclusive upper bound + * 1. worker: 5 -> 786 + * 2. worker: 786 -> 1567 + * 3. worker: 1567 -> 2348 + * 4. worker: 2348 -> 3129 + * 5. worker: 3129 -> 3910 + * 6. worker: 3910 -> 4691 + * 7. worker: 4691 -> 5472 + * 8. worker: 5472 -> 6253 + * 9. worker: 6253 -> 7034 + * 10. worker: 7034 -> 7813 (last element per batch is: 779) + * + * This way, a particular worker will be given with the same part + * of all bloom filters along with the shared base bloom filter, + * so the bitwise OR function will not be a subject of threading/sync issues. + */ + int elementPerBatch = + (int) Math.ceil((double) (length - START_OF_
[jira] [Work logged] (HIVE-23880) Bloom filters can be merged in a parallel way in VectorUDAFBloomFilterMerge
[ https://issues.apache.org/jira/browse/HIVE-23880?focusedWorklogId=470122&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-470122 ] ASF GitHub Bot logged work on HIVE-23880: - Author: ASF GitHub Bot Created on: 13/Aug/20 07:49 Start Date: 13/Aug/20 07:49 Worklog Time Spent: 10m Work Description: abstractdog commented on a change in pull request #1280: URL: https://github.com/apache/hive/pull/1280#discussion_r469761824 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorGroupByOperator.java ## @@ -252,6 +258,13 @@ protected VectorAggregationBufferRow allocateAggregationBuffer() throws HiveExce return bufferSet; } +protected void finishAggregators(boolean aborted) { Review comment: I'll take care of this in next patch This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 470122) Time Spent: 6h 50m (was: 6h 40m) > Bloom filters can be merged in a parallel way in VectorUDAFBloomFilterMerge > --- > > Key: HIVE-23880 > URL: https://issues.apache.org/jira/browse/HIVE-23880 > Project: Hive > Issue Type: Improvement >Reporter: László Bodor >Assignee: László Bodor >Priority: Major > Labels: pull-request-available > Attachments: lipwig-output3605036885489193068.svg > > Time Spent: 6h 50m > Remaining Estimate: 0h > > Merging bloom filters in semijoin reduction can become the main bottleneck in > case of large number of source mapper tasks (~1000, Map 1 in below example) > and a large amount of expected entries (50M) in bloom filters. > For example in TPCDS Q93: > {code} > select /*+ semi(store_returns, sr_item_sk, store_sales, 7000)*/ > ss_customer_sk > ,sum(act_sales) sumsales > from (select ss_item_sk > ,ss_ticket_number > ,ss_customer_sk > ,case when sr_return_quantity is not null then > (ss_quantity-sr_return_quantity)*ss_sales_price > else > (ss_quantity*ss_sales_price) end act_sales > from store_sales left outer join store_returns on (sr_item_sk = > ss_item_sk >and > sr_ticket_number = ss_ticket_number) > ,reason > where sr_reason_sk = r_reason_sk > and r_reason_desc = 'reason 66') t > group by ss_customer_sk > order by sumsales, ss_customer_sk > limit 100; > {code} > On 10TB-30TB scale there is a chance that from 3-4 mins of query runtime 1-2 > mins are spent with merging bloom filters (Reducer 2), as in: > [^lipwig-output3605036885489193068.svg] > {code} > -- > VERTICES MODESTATUS TOTAL COMPLETED RUNNING PENDING > FAILED KILLED > -- > Map 3 .. llap SUCCEEDED 1 100 > 0 0 > Map 1 .. llap SUCCEEDED 1263 126300 > 0 0 > Reducer 2 llap RUNNING 1 010 > 0 0 > Map 4 llap RUNNING 6154 0 207 5947 > 0 0 > Reducer 5 llapINITED 43 00 43 > 0 0 > Reducer 6 llapINITED 1 001 > 0 0 > -- > VERTICES: 02/06 [>>--] 16% ELAPSED TIME: 149.98 s > -- > {code} > For example, 70M entries in bloom filter leads to a 436 465 696 bits, so > merging 1263 bloom filters means running ~ 1263 * 436 465 696 bitwise OR > operation, which is very hot codepath, but can be parallelized. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HIVE-23927) Cast to Timestamp generates different output for Integer & Float values
[ https://issues.apache.org/jira/browse/HIVE-23927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176800#comment-17176800 ] Renukaprasad C commented on HIVE-23927: --- Thanks [~jcamachorodriguez] & [~pgaref]. We will do the similar implementation as other integer datatype conversion (As suggested by [~pgaref] -Maybe we should make this configurable as well – as we do in longToTimestamp method) in *PrimitiveObjectInspectorUtils.getTimestamp(Object, PrimitiveObjectInspector, boolean).* > Cast to Timestamp generates different output for Integer & Float values > > > Key: HIVE-23927 > URL: https://issues.apache.org/jira/browse/HIVE-23927 > Project: Hive > Issue Type: Bug >Reporter: Renukaprasad C >Priority: Major > > Double consider the input value as SECOND and converts into Millis internally. > Whereas, Integer value will be considered as Millis and produce different > output. > org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils.getTimestamp(Object, > PrimitiveObjectInspector, boolean) - Handles Integral & Decimal values > differently. This cause the issue. > 0: jdbc:hive2://localhost:1> select cast(1.204135216E9 as timestamp) > Double2TimeStamp, cast(1204135216 as timestamp) Int2TimeStamp from abc > tablesample(1 rows); > OK > INFO : Compiling > command(queryId=renu_20200724140642_70132390-ee12-4214-a2ca-a7e10556fc14): > select cast(1.204135216E9 as timestamp) Double2TimeStamp, cast(1204135216 as > timestamp) Int2TimeStamp from abc tablesample(1 rows) > INFO : Concurrency mode is disabled, not creating a lock manager > INFO : Semantic Analysis Completed (retrial = false) > INFO : Returning Hive schema: > Schema(fieldSchemas:[FieldSchema(name:double2timestamp, type:timestamp, > comment:null), FieldSchema(name:int2timestamp, type:timestamp, > comment:null)], properties:null) > INFO : Completed compiling > command(queryId=renu_20200724140642_70132390-ee12-4214-a2ca-a7e10556fc14); > Time taken: 0.175 seconds > INFO : Concurrency mode is disabled, not creating a lock manager > INFO : Executing > command(queryId=renu_20200724140642_70132390-ee12-4214-a2ca-a7e10556fc14): > select cast(1.204135216E9 as timestamp) Double2TimeStamp, cast(1204135216 as > timestamp) Int2TimeStamp from abc tablesample(1 rows) > INFO : Completed executing > command(queryId=renu_20200724140642_70132390-ee12-4214-a2ca-a7e10556fc14); > Time taken: 0.001 seconds > INFO : OK > INFO : Concurrency mode is disabled, not creating a lock manager > ++--+ > |double2timestamp| int2timestamp | > ++--+ > | 2008-02-27 18:00:16.0 | 1970-01-14 22:28:55.216 | > ++--+ -- This message was sent by Atlassian Jira (v8.3.4#803005)