[jira] [Created] (HIVE-27142) Map Join not working as expected when joining non-native tables with native tables
Syed Shameerur Rahman created HIVE-27142: Summary: Map Join not working as expected when joining non-native tables with native tables Key: HIVE-27142 URL: https://issues.apache.org/jira/browse/HIVE-27142 Project: Hive Issue Type: Bug Affects Versions: All Versions Reporter: Syed Shameerur Rahman Assignee: Syed Shameerur Rahman Fix For: 4.0.0 *1. Issue :* When *_hive.auto.convert.join=true_* and if the underlying query is trying to join a large non-native hive table with a small native hive table, The map join is happening in the wrong side i.e on the map task which process the small native hive table and it can lead to OOM when the non-native table is really large and only few map tasks are spawned to scan the small native hive tables. *2. Why is this happening ?* This happens due to improper stats collection/computation of non native hive tables. Since the non-native hive tables are actually stored in a different location which Hive does not know of and only a temporary path which is visible to Hive while creating a non native table does not store the actual data, The stats collection logic tend to under estimate the data/rows and hence causes the map join to happen in the wrong side. *3. Potential Solutions* 3.1 Turn off *_hive.auto.convert.join=false._* This can have a negative impact of the query if the same query is trying to do multiple joins i.e one join with non-native tables and other join where both the tables are native. 3.2 Compute stats for non-native table by firing the ANALYZE TABLE <> command before joining native and non-native commands. The user may or may not choose to do it. 3.3 Don't collect/estimate stats for non-native hive tables by default (Preferred solution) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26787) Pushdown Timestamp data type to metastore via direct sql / JDO
Syed Shameerur Rahman created HIVE-26787: Summary: Pushdown Timestamp data type to metastore via direct sql / JDO Key: HIVE-26787 URL: https://issues.apache.org/jira/browse/HIVE-26787 Project: Hive Issue Type: Improvement Reporter: Syed Shameerur Rahman Assignee: Syed Shameerur Rahman make timestamp data type push down to hive metastore during partition pruning. This is in similar lines with the jira: https://issues.apache.org/jira/browse/HIVE-26778 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26778) Pushdown Date data type to metastore via direct sql / JDO
Syed Shameerur Rahman created HIVE-26778: Summary: Pushdown Date data type to metastore via direct sql / JDO Key: HIVE-26778 URL: https://issues.apache.org/jira/browse/HIVE-26778 Project: Hive Issue Type: Bug Reporter: Syed Shameerur Rahman Assignee: Syed Shameerur Rahman The original feature to push down date data type while doing partition pruning via direct sql / JDO was added as part of the jira : https://issues.apache.org/jira/browse/HIVE-5679 Since the behavior of Hive has changed with CBO, Now when CBO is turned on, The date data types are not pushed down to metastore due to CBO adding extra keyword 'DATE' with the original filter since the filter parser is not handled to parse this extra keyword it fails and hence the date data type is not pushed down to the metastore. {code:java} select * from test_table where date_col = '2022-01-01'; {code} When CBO is turned on, The filter predicate generated is date_col=DATE'2022-01-01' -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-26467) SessionState should be accessible inside ThreadPool
Syed Shameerur Rahman created HIVE-26467: Summary: SessionState should be accessible inside ThreadPool Key: HIVE-26467 URL: https://issues.apache.org/jira/browse/HIVE-26467 Project: Hive Issue Type: Improvement Reporter: Syed Shameerur Rahman Assignee: Syed Shameerur Rahman Fix For: 4.0.0 Currently SessionState.get() returns null if it is called inside a ThreadPool. If there is any custom third party component leverages SessionState.get() for some operations like getting the session state or session config it will result in null since session state is thread local (https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/session/SessionState.java#L622) and ThreadLocal variable are not inheritable to child threads / thread pools. So one solution is to make the thread local variable inheritable so the SessionState gets propagated to child threads. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-25942) Upgrade commons-io to 2.8.0 due to CVE-2021-29425
Syed Shameerur Rahman created HIVE-25942: Summary: Upgrade commons-io to 2.8.0 due to CVE-2021-29425 Key: HIVE-25942 URL: https://issues.apache.org/jira/browse/HIVE-25942 Project: Hive Issue Type: Bug Reporter: Syed Shameerur Rahman Assignee: Syed Shameerur Rahman Fix For: 4.0.0 Due to [CVE-2021-29425|https://nvd.nist.gov/vuln/detail/CVE-2021-29425] all the commons-io versions below 2.7 are affected. Tez and Hadoop have upgraded commons-io to 2.8.0 in [TEZ-4353|https://issues.apache.org/jira/browse/TEZ-4353] and [HADOOP-17683|https://issues.apache.org/jira/browse/HADOOP-17683] respectively and it will be good if Hive also follows the same. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HIVE-25907) IOW Directory queries fails to write data to final path when query result cache is enabled
Syed Shameerur Rahman created HIVE-25907: Summary: IOW Directory queries fails to write data to final path when query result cache is enabled Key: HIVE-25907 URL: https://issues.apache.org/jira/browse/HIVE-25907 Project: Hive Issue Type: Bug Components: Hive Reporter: Syed Shameerur Rahman Assignee: Syed Shameerur Rahman Fix For: 4.0.0 INSERT OVERWRITE DIRECTORY queries fails to write the data to the specified directory location when query result cache is enabled. *Steps to reproduce* {code:java} 1. create a data file with the following data 1 abc 10.5 2 def 11.5 2. create table pointing to that data create external table iowd(strct struct) row format delimited fields terminated by '\t' collection items terminated by ' ' location ''; 3. run the following query set hive.query.results.cache.enabled=true; INSERT OVERWRITE DIRECTORY "" SELECT * FROM iowd; {code} After execution of the above query, It is expected that the destination directory contains data from the table iowd, But due to HIVE-21386 it is not happening anymore. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HIVE-25680) Authorize #get_table_meta HiveMetastore Server API to use any of the HiveMetastore Authorization model
Syed Shameerur Rahman created HIVE-25680: Summary: Authorize #get_table_meta HiveMetastore Server API to use any of the HiveMetastore Authorization model Key: HIVE-25680 URL: https://issues.apache.org/jira/browse/HIVE-25680 Project: Hive Issue Type: Bug Affects Versions: All Versions Reporter: Syed Shameerur Rahman Assignee: Syed Shameerur Rahman Fix For: 4.0.0 Attachments: Screenshot 2021-11-08 at 2.39.30 PM.png When Apache Hue or any other application which uses #get_table_meta API is not gated to use any of the authorization model which HiveMetastore provides. For more information on Storage based Authorization Model : https://cwiki.apache.org/confluence/display/Hive/HCatalog+Authorization You can easily reproduce this with Apache Hive + Apache Hue {code:java} hive.security.metastore.authorization.manager org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider hive.security.metastore.authenticator.manager org.apache.hadoop.hive.ql.security.HadoopDefaultMetastoreAuthenticator hive.metastore.pre.event.listeners org.apache.hadoop.hive.ql.security.authorization.AuthorizationPreEventListener {code} {code:java} #!/bin/bash set -x hdfs dfs -mkdir /datasets hdfs dfs -mkdir /datasets/database1 hdfs dfs -mkdir /datasets/database1/table1 echo "stefano,1992" | hdfs dfs -put - /datasets/database1/table1/file1.csv hdfs dfs -chmod -R 700 /datasets/database1 sudo tee -a setup.hql > /dev/null < create the first user called "admin" and provide a password Access the Hive Editor 2. On the SQL section on the left under Databases you should see default and database1 listed. Click on database1 3. As you can see a table called table1 is listed => this should not be possible as our admin user has no HDFS grants on /datasets/database1 4. run from the Hive editor the following query SHOW TABLES; The output shows a Permission denied error => this is the expected behavior -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HIVE-25443) Arrow SerDe Cannot serialize/deserialize complex data types When there are more than 1024 values
Syed Shameerur Rahman created HIVE-25443: Summary: Arrow SerDe Cannot serialize/deserialize complex data types When there are more than 1024 values Key: HIVE-25443 URL: https://issues.apache.org/jira/browse/HIVE-25443 Project: Hive Issue Type: Bug Components: Serializers/Deserializers Affects Versions: 3.1.2, 3.1.1, 3.0.0, 3.1.0 Reporter: Syed Shameerur Rahman Assignee: Syed Shameerur Rahman Fix For: 4.0.0 Complex data types like MAP, STRUCT cannot be serialized/deserialzed using Arrow SerDe when there are more than 1024 values. This happens due to ColumnVector always being initialized with a size of 1024. Issue #1 : https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/arrow/ArrowColumnarBatchSerDe.java#L213 Issue #2 : https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/arrow/ArrowColumnarBatchSerDe.java#L215 Sample unit test to reproduce the case in TestArrowColumnarBatchSerDe : {code:java} @Test public void testListBooleanWithMoreThan1024Values() throws SerDeException { String[][] schema = { {"boolean_list", "array"}, }; Object[][] rows = new Object[1025][1]; for (int i = 0; i < 1025; i++) { rows[i][0] = new BooleanWritable(true); } initAndSerializeAndDeserialize(schema, toList(rows)); } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-24690) GlobalLimitOptimizer Fails To Identify Some Queries With LIMIT Operator
Syed Shameerur Rahman created HIVE-24690: Summary: GlobalLimitOptimizer Fails To Identify Some Queries With LIMIT Operator Key: HIVE-24690 URL: https://issues.apache.org/jira/browse/HIVE-24690 Project: Hive Issue Type: Bug Components: Query Planning Affects Versions: 3.1.0, 2.1.0, 1.1.0 Reporter: Syed Shameerur Rahman Assignee: Syed Shameerur Rahman As per [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GlobalLimitOptimizer.java#L88] queries like {code:java} CREATE TABLE ... AS SELECT col1, col2 FROM tbl LIMIT .. INSERT OVERWRITE TABLE ... SELECT col1, hash(col2), split(col1) FROM ... LIMIT... {code} falls under the category of qualified list, But after HIVE-9444 it is not. On investigating this issue, It is found that for {code:java} CREATE TABLE ... AS SELECT col1, col2 FROM tbl LIMIT {code} query the operator tree looks like *TS -> SEL -> LIM -> RS -> SEL -> LIM -> FS* Since only only LIMIT operator is allowed as per https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GlobalLimitOptimizer.java#L196 , The *GlobalLimitOptimizer* fails to identify such queries. *Steps To Reproduce* {code:java} set hive.limit.optimize.enable=true; create table t1 (a int); create table t2 select * from t1 LIMIT 10; {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-23851) MSCK REPAIR Command With Partition Filtering Fails While Dropping Partitions
Syed Shameerur Rahman created HIVE-23851: Summary: MSCK REPAIR Command With Partition Filtering Fails While Dropping Partitions Key: HIVE-23851 URL: https://issues.apache.org/jira/browse/HIVE-23851 Project: Hive Issue Type: Bug Affects Versions: 4.0.0 Reporter: Syed Shameerur Rahman Assignee: Syed Shameerur Rahman *Steps to reproduce:* # Create external table # Run msck command to sync all the partitions with metastore # Remove one of the partition path # Run msck repair with partition filtering *Stack Trace:* {code:java} 2020-07-15T02:10:29,045 ERROR [4dad298b-28b1-4e6b-94b6-aa785b60c576 main] ppr.PartitionExpressionForMetastore: Failed to deserialize the expression java.lang.IndexOutOfBoundsException: Index: 110, Size: 0 at java.util.ArrayList.rangeCheck(ArrayList.java:657) ~[?:1.8.0_192] at java.util.ArrayList.get(ArrayList.java:433) ~[?:1.8.0_192] at org.apache.hive.com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] at org.apache.hive.com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:857) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:707) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] at org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readObject(SerializationUtilities.java:211) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] at org.apache.hadoop.hive.ql.exec.SerializationUtilities.deserializeObjectFromKryo(SerializationUtilities.java:806) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] at org.apache.hadoop.hive.ql.exec.SerializationUtilities.deserializeExpressionFromKryo(SerializationUtilities.java:775) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] at org.apache.hadoop.hive.ql.optimizer.ppr.PartitionExpressionForMetastore.deserializeExpr(PartitionExpressionForMetastore.java:96) [hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] at org.apache.hadoop.hive.ql.optimizer.ppr.PartitionExpressionForMetastore.convertExprToFilter(PartitionExpressionForMetastore.java:52) [hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] at org.apache.hadoop.hive.metastore.PartFilterExprUtil.makeExpressionTree(PartFilterExprUtil.java:48) [hive-standalone-metastore-server-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] at org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByExprInternal(ObjectStore.java:3593) [hive-standalone-metastore-server-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] at org.apache.hadoop.hive.metastore.VerifyingObjectStore.getPartitionsByExpr(VerifyingObjectStore.java:80) [hive-standalone-metastore-server-4.0.0-SNAPSHOT-tests.jar:4.0.0-SNAPSHOT] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_192] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_192] {code} *Cause:* In case of msck repair with partition filtering we expect expression proxy class to be set as PartitionExpressionForMetastore ( https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/ddl/misc/msck/MsckAnalyzer.java#L78 ), While dropping partition we serialize the drop partition filter expression as ( https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/Msck.java#L589 ) which is incompatible during deserializtion happening in PartitionExpressionForMetastore ( https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ppr/PartitionExpressionForMetastore.java#L52 ) hence the query fails with Failed to deserialize the expression. *Solutions*: I could think of two approaches to this problem # Since PartitionExpressionForMetastore is required only during parition pruning step, We can switch back the expression proxy class to MsckPartitionExpressionProxy once the partition pruning step is done. # The other solution is to make serialization process in msck drop partition filter expression compatible with the one with PartitionExpressionForMetastore, We can do this via Reflection since the drop partition serialization happens in Msck class (standadlone-metatsore) by this way we can completely remove the need for class MsckPartitionExpressionProxy and this also helps to reduce the complexity of Msck Repair command with parition filtering to work with ease (no need to set the expression proxyClass config). I am personally inclined to the 2nd approach. Before moving on i want to know if this is the best approach or is there any other better/easier approach to solve this problem. PS: qtest added in HIVE-22957 mainly focused on adding missing partition. Forgot to add case for dropping partition. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-23751) QTest: Override #mkdirs() method in ProxyFileSystem To Align After HADOOP-16582
Syed Shameerur Rahman created HIVE-23751: Summary: QTest: Override #mkdirs() method in ProxyFileSystem To Align After HADOOP-16582 Key: HIVE-23751 URL: https://issues.apache.org/jira/browse/HIVE-23751 Project: Hive Issue Type: Task Reporter: Syed Shameerur Rahman Assignee: Syed Shameerur Rahman Fix For: 4.0.0, 3.2.0 HADOOP-16582 have changed the way how mkdirs() work: *Before HADOOP-16582:* All calls to mkdirs(p) were fast-tracked to FileSystem.mkdirs which were then re-routed to mkdirs(p, permission) method. For ProxyFileSytem the call would look like {code:java} FileUtiles.mkdir(p) -> FileSystem.mkdirs(p) ---> ProxyFileSytem.mkdirs(p,permission) {code} An implementation of FileSystem have only needed implement mkdirs(p, permission) *After HADOOP-16582:* Since FilterFileSystem overrides mkdirs(p) method the new call to ProxyFileSystem would look like {code:java} FileUtiles.mkdir(p) ---> FilterFileSystem.mkdirs(p) --> {code} This will make all the qtests fails with the below exception {code:java} Caused by: java.lang.IllegalArgumentException: Wrong FS: pfile:/media/ebs1/workspace/hive-3.1-qtest/group/5/label/HiveQTest/hive-1.2.0/itests/qtest/target/warehouse/dest1, expected: file:/// {code} Note: We will hit this issue when we bump up hadoop version in hive. So as per the discussion in HADOOP-16963 ProxyFileSystem would need to override the mkdirs(p) method inorder to solve the above problem. So now the new flow would look like {code:java} FileUtiles.mkdir(p) > ProxyFileSytem.mkdirs(p) ---> ProxyFileSytem.mkdirs(p, permission) ---> {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-23737) LLAP: Reuse dagDelete Feature Of Tez Custom Shuffle Handler Instead Of LLAP's dagDelete
Syed Shameerur Rahman created HIVE-23737: Summary: LLAP: Reuse dagDelete Feature Of Tez Custom Shuffle Handler Instead Of LLAP's dagDelete Key: HIVE-23737 URL: https://issues.apache.org/jira/browse/HIVE-23737 Project: Hive Issue Type: Improvement Environment: *strong text* Reporter: Syed Shameerur Rahman Assignee: Syed Shameerur Rahman LLAP have a dagDelete feature added as part of HIVE-9911, But now that Tez have added support for dagDelete in custom shuffle handler (TEZ-3362) we could re-use that feature in LLAP. There are some added advantages of using Tez's dagDelete feature rather than the current LLAP's dagDelete feature. 1) We can easily extend this feature to accommodate the upcoming features such as vertex and failed task attempt shuffle data clean up. Refer TEZ-3363 and TEZ-4129 2) It will be more easier to maintain this feature by separating it out from the Hive's code path. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-23606) LLAP: Delay In DirectByteBuffer Clean Up For EncodedReaderImpl
Syed Shameerur Rahman created HIVE-23606: Summary: LLAP: Delay In DirectByteBuffer Clean Up For EncodedReaderImpl Key: HIVE-23606 URL: https://issues.apache.org/jira/browse/HIVE-23606 Project: Hive Issue Type: Bug Affects Versions: 3.0.0 Reporter: Syed Shameerur Rahman Assignee: Syed Shameerur Rahman Fix For: 4.0.0 DirectByteBuffler are only cleaned up when there is Full GC or manually invoked cleaner method of DirectByteBuffer, Since full GC may take some time to kick in, In the meanwhile the native memory usage of LLAP daemon process might shoot up and this will force the YARN pmem monitor to kill the container running the daemon. HIVE-16180 tried to solve this problem, but the code structure got messed up after HIVE-15665 The IdentityHashMap (toRelease) is initialized in https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/EncodedReaderImpl.java#L409 , but it is getting re-initialized inside the method getDataFromCacheAndDisk() https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/EncodedReaderImpl.java#L633 which makes it local to that method hence the original toRelease IdentityHashMap remains empty. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-23085) LLAP: Support Multiple NVMe-SSD disk Locations While Using SSD Cache
Syed Shameerur Rahman created HIVE-23085: Summary: LLAP: Support Multiple NVMe-SSD disk Locations While Using SSD Cache Key: HIVE-23085 URL: https://issues.apache.org/jira/browse/HIVE-23085 Project: Hive Issue Type: Improvement Reporter: Syed Shameerur Rahman Assignee: Syed Shameerur Rahman Fix For: 4.0.0 Currently we can configure only one SSD location while using SSD cache in LLAP. This highly undermines the capacity of some machines to use its disk capacity to the fullest. For example *AWS* provides *r5d.4x large* series which comes with *2 * 300 GB NVme SSD disk* with the current design only one of the mounted *NVme SSD* disk can be used for caching. Hence adding support for caching data at multiple ssd mounted locations. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-22957) Support For FilterExp In MSCK Command
Syed Shameerur Rahman created HIVE-22957: Summary: Support For FilterExp In MSCK Command Key: HIVE-22957 URL: https://issues.apache.org/jira/browse/HIVE-22957 Project: Hive Issue Type: Improvement Reporter: Syed Shameerur Rahman Assignee: Syed Shameerur Rahman Fix For: 4.0.0 Currently MSCK command supports full repair of table (all partitions) or some subset of partitions based on partitionSpec. The aim of this jira is to introduce a filterExp (=, !=, <, >, >=, <=, LIKE) in MSCK command so that a larger subset of partitions can be recovered (added/deleted) without firing a full repair might take time if the no. of partitions are huge. *Approach*: The initial approach is to add a where clause in MSCK command Eg: MCK REPAIR TABLE ADD|DROP|SYNC PARTITIONS WHERE AND *Flow:* 1) Parse the where clause and generate filterExpression 2) fetch all the partitions from the metastore which matches the filter expression 3) fetch all the partition file from the filesystem 4) remove all the partition path which does not match with the filter expression 5) Based on ADD | DROP | SYNC do the remaining steps. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-22900) Predicate Push Down Of Like Filter While Fetching Partition Data From MetaStore
Syed Shameerur Rahman created HIVE-22900: Summary: Predicate Push Down Of Like Filter While Fetching Partition Data From MetaStore Key: HIVE-22900 URL: https://issues.apache.org/jira/browse/HIVE-22900 Project: Hive Issue Type: New Feature Reporter: Syed Shameerur Rahman Assignee: Syed Shameerur Rahman Fix For: 4.0.0 Currently PPD is disabled for like filter while fetching partition data from metastore. The following patch covers all the test cases mentioned in HIVE-5134 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-22891) Skip PartitonDesc Extraction In CombineHiveRecord For Non-LLAP Execution Mode
Syed Shameerur Rahman created HIVE-22891: Summary: Skip PartitonDesc Extraction In CombineHiveRecord For Non-LLAP Execution Mode Key: HIVE-22891 URL: https://issues.apache.org/jira/browse/HIVE-22891 Project: Hive Issue Type: Task Reporter: Syed Shameerur Rahman Assignee: Syed Shameerur Rahman Fix For: 4.0.0 {code:java} try { // TODO: refactor this out if (pathToPartInfo == null) { MapWork mrwork; if (HiveConf.getVar(conf, HiveConf.ConfVars.HIVE_EXECUTION_ENGINE).equals("tez")) { mrwork = (MapWork) Utilities.getMergeWork(jobConf); if (mrwork == null) { mrwork = Utilities.getMapWork(jobConf); } } else { mrwork = Utilities.getMapWork(jobConf); } pathToPartInfo = mrwork.getPathToPartitionInfo(); } PartitionDesc part = extractSinglePartSpec(hsplit); inputFormat = HiveInputFormat.wrapForLlap(inputFormat, jobConf, part); } catch (HiveException e) { throw new IOException(e); } {code} The above piece of code in CombineHiveRecordReader.java was introduced in HIVE-15147. This overwrites inputFormat based on the PartitionDesc which is not required in non-LLAP mode of execution as the method HiveInputFormat.wrapForLlap() simply returns the previously defined inputFormat in case of non-LLAP mode. The method call extractSinglePartSpec() has some serious performance implications. If there are large no. of small files, each call in the method extractSinglePartSpec() takes approx ~ (2 - 3) seconds. Hence the same query which runs in Hive 1.x / Hive 2 is way faster than the query run on latest hive. {code:java} 2020-02-11 07:15:04,701 INFO [main] org.apache.hadoop.hive.ql.io.orc.ReaderImpl: Reading ORC rows from 2020-02-11 07:15:06,468 WARN [main] org.apache.hadoop.hive.ql.io.CombineHiveRecordReader: Multiple partitions found; not going to pass a part spec to LLAP IO: {{logdate=2020-02-03, hour=01, event=win}} and {{logdate=2020-02-03, hour=02, event=act}} 2020-02-11 07:15:06,468 INFO [main] org.apache.hadoop.hive.ql.io.CombineHiveRecordReader: succeeded in getting org.apache.hadoop.mapred.FileSplit{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-22433) Hive JDBC Storage Handler: Incorrect results fetched from BOOLEAN and TIMESTAMP DataType From JDBC Data Source
Syed Shameerur Rahman created HIVE-22433: Summary: Hive JDBC Storage Handler: Incorrect results fetched from BOOLEAN and TIMESTAMP DataType From JDBC Data Source Key: HIVE-22433 URL: https://issues.apache.org/jira/browse/HIVE-22433 Project: Hive Issue Type: Bug Reporter: Syed Shameerur Rahman Assignee: Syed Shameerur Rahman Fix For: 4.0.0 Steps to Reproduce: {code:java} //Derby table: create table testtbl(a BOOLEAN, b TIMESTAMP); // Insert to table via mysql connector // data in db true 2019-11-11 12:00:00 //Hive table: CREATE EXTERNAL TABLE `hive_table`( a BOOLEAN, b TIMESTAMP ) STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler' TBLPROPERTIES ( 'hive.sql.database.type'='DERBY', 'hive.sql.dbcp.password'='', 'hive.sql.dbcp.username'='', 'hive.sql.jdbc.driver'='', 'hive.sql.jdbc.url'='', 'hive.sql.table'='testtbl'); //Hive query: select * from hive_table; // result from select query false 2019-11-11 20:00:00 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-22431) Hive JDBC Storage Handler: java.lang.ClassCastException on accessing TINYINT, SMALLINT Data Type From JDBC Data Source
Syed Shameerur Rahman created HIVE-22431: Summary: Hive JDBC Storage Handler: java.lang.ClassCastException on accessing TINYINT, SMALLINT Data Type From JDBC Data Source Key: HIVE-22431 URL: https://issues.apache.org/jira/browse/HIVE-22431 Project: Hive Issue Type: Bug Reporter: Syed Shameerur Rahman Assignee: Syed Shameerur Rahman Steps to Reproduce: {code:java} //MySQL table: create table testtbl(a TINYINT, b SMALLINT); // Insert to table //Hive table: CREATE EXTERNAL TABLE `hive_table`( a TINYINT, b SMALLINT ) ROW FORMAT SERDE 'org.apache.hive.storage.jdbc.JdbcSerDe' STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler' TBLPROPERTIES ( 'hive.sql.database.type'='MYSQL', 'hive.sql.dbcp.password'='hive', 'hive.sql.dbcp.username'='hive', 'hive.sql.jdbc.driver'='com.mysql.jdbc.Driver', 'hive.sql.jdbc.url'='jdbc:mysql://hadoop/test', 'hive.sql.table'='testtbl'); //Hive query: select * from hive_table; {code} *Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Byte* *Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Short* {code:java} {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-22409) Logging: Implement QueryID Based Hive Logging
Syed Shameerur Rahman created HIVE-22409: Summary: Logging: Implement QueryID Based Hive Logging Key: HIVE-22409 URL: https://issues.apache.org/jira/browse/HIVE-22409 Project: Hive Issue Type: Improvement Reporter: Syed Shameerur Rahman Assignee: Syed Shameerur Rahman Currently all the hive logs are logged in ${sys:hive.log.dir}/${sys:hive.log.file} which is basically a single log file. Over the time it becomes tedious to search for logs since multiple hive query logs are logged into single log file. Hence we propose a queryID based hive logging where logs of different queries are logged into a separate log file based on their queryID. CC [~prasanth_j] [~gopalv] [~sseth] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-22392) Hive JDBC Storage Handler: Support For Writing Data to JDBC Data Source
Syed Shameerur Rahman created HIVE-22392: Summary: Hive JDBC Storage Handler: Support For Writing Data to JDBC Data Source Key: HIVE-22392 URL: https://issues.apache.org/jira/browse/HIVE-22392 Project: Hive Issue Type: New Feature Reporter: Syed Shameerur Rahman Assignee: Syed Shameerur Rahman JDBC Storage Handler supports reading from JDBC data source in Hive. Currently writing to a JDBC data source is not supported. Hence adding support for simple insert query so that the data can be written back to JDBC data source. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-21454) Tez default configs get overwritten by MR default configs
Syed Shameerur Rahman created HIVE-21454: Summary: Tez default configs get overwritten by MR default configs Key: HIVE-21454 URL: https://issues.apache.org/jira/browse/HIVE-21454 Project: Hive Issue Type: Bug Reporter: Syed Shameerur Rahman Due to changes done in HIVE-17781 Tez default configs such as tez.counters.max which has a default value of 1200 gets overwritten by mapreduce.job.counters.max which has a default value of 120 -- This message was sent by Atlassian JIRA (v7.6.3#76005)