[jira] [Created] (HIVE-26188) Query level cache and HMS local cache doesn't work locally and with Explain statements.

2022-04-28 Thread Soumyakanti Das (Jira)
Soumyakanti Das created HIVE-26188:
--

 Summary: Query level cache and HMS local cache doesn't work 
locally and with Explain statements.
 Key: HIVE-26188
 URL: https://issues.apache.org/jira/browse/HIVE-26188
 Project: Hive
  Issue Type: Bug
Reporter: Soumyakanti Das
Assignee: Soumyakanti Das


{{ExplainSemanticAnalyzer}} should override {{startAnalysis()}} method that 
creates the query level cache. This is important because after 
https://issues.apache.org/jira/browse/HIVE-25918, the HMS local cache only 
works if the query level cache is also initialized.

Also, {{data/conf/llap/hive-site.xml}} properties for the HMS cache are 
incorrect which should be fixed to enable the cache during qtests.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HIVE-26187) Set operations and time travel is not working

2022-04-28 Thread Jira
Zoltán Borók-Nagy created HIVE-26187:


 Summary: Set operations and time travel is not working
 Key: HIVE-26187
 URL: https://issues.apache.org/jira/browse/HIVE-26187
 Project: Hive
  Issue Type: Bug
Reporter: Zoltán Borók-Nagy


Set operations doesn't work well with time travel queries.

Repro:

{noformat}
select * from  t FOR SYSTEM_VERSION AS OF 

MINUS

select * from t FOR SYSTEM_VERSION AS OF ;
{noformat}

Returns 0 results because both selects use the same snapshot id, instead of 
snapshot_id_1 and snapshot_id_2.

Probably there're issues with other queries as well, when the same table is 
used multiple times with different snapshot ids.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HIVE-26186) Resultset returned by getTables does not order data per JDBC specification

2022-04-28 Thread N Campbell (Jira)
N Campbell created HIVE-26186:
-

 Summary: Resultset returned by getTables does not order data per 
JDBC specification
 Key: HIVE-26186
 URL: https://issues.apache.org/jira/browse/HIVE-26186
 Project: Hive
  Issue Type: Bug
  Components: JDBC
Affects Versions: 3.1.3
 Environment: !HiveMeta.png!
Reporter: N Campbell
 Attachments: HiveMeta.png

JDBC specification states that data in a Resultset must be ordered.

A simple Java program issues a request to getTables
ResultSet rs = dbMeta.getTables( {*}null{*}, "cert", "%", {*}null{*});

The Resultset is not order per JDBC spec
[https://docs.oracle.com/javase/8/docs/api/java/sql/DatabaseMetaData.html#getTables-java.lang.String-java.lang.String-java.lang.String-java.lang.String:A-]

Happens with various releases including

hive-jdbc-3.1.3000.7.1.7.0-551

hive-jdbc-3.1.3000.7.1.6.0-297



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HIVE-26185) Need support for metadataonly operations with iceberg (e.g select distinct on partition column)

2022-04-28 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-26185:
---

 Summary: Need support for metadataonly operations with iceberg 
(e.g select distinct on partition column)
 Key: HIVE-26185
 URL: https://issues.apache.org/jira/browse/HIVE-26185
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2
Reporter: Rajesh Balamohan


{noformat}
select distinct ss_sold_date_sk from store_sales
{noformat}

This query scans 1800+ rows in hive acid. But takes ages to process with 
NullScanOptimiser during compilation phase 
(https://issues.apache.org/jira/browse/HIVE-24262)

{noformat}
Hive ACID

INFO  : Executing 
command(queryId=hive_20220427233926_282bc9d8-220c-4a09-928d-411601c2ef14): 
select distinct ss_sold_date_sk from store_sales
INFO  : Compute 'ndembla-test2' is active.
INFO  : Query ID = hive_20220427233926_282bc9d8-220c-4a09-928d-411601c2ef14
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Subscribed to counters: [] for queryId: 
hive_20220427233926_282bc9d8-220c-4a09-928d-411601c2ef14
INFO  : Tez session hasn't been created yet. Opening session
INFO  : Dag name: select distinct ss_sold_date_s...store_sales (Stage-1)
INFO  : Status: Running (Executing on YARN cluster with App id 
application_1651102345385_)

INFO  : Status: DAG finished successfully in 1.81 seconds
INFO  : DAG ID: dag_1651102345385__5
INFO  :
INFO  : Query Execution Summary
INFO  : 
--
INFO  : OPERATIONDURATION
INFO  : 
--
INFO  : Compile Query  55.47s
INFO  : Prepare Plan2.32s
INFO  : Get Query Coordinator (AM)  0.13s
INFO  : Submit Plan 0.03s
INFO  : Start DAG   0.09s
INFO  : Run DAG 1.80s
INFO  : 
--
INFO  :
INFO  : Task Execution Summary
INFO  : 
--
INFO  :   VERTICES  DURATION(ms)   CPU_TIME(ms)GC_TIME(ms)   
INPUT_RECORDS   OUTPUT_RECORDS
INFO  : 
--
INFO  :  Map 1   1009.00  0  0   
1,8241,824
INFO  :  Reducer 2  0.00  0  0   
1,8240
INFO  : 
--
INFO  :

{noformat}




However, same query scans *2.8 Billion records.* in iceberg format. This can be 
fixed.

{noformat}

INFO  : Executing 
command(queryId=hive_20220427233519_cddc6dd1-95a3-4f0e-afa5-e11e9dc5fa72): 
select distinct ss_sold_date_sk from store_sales
INFO  : Compute 'ndembla-test2' is active.
INFO  : Query ID = hive_20220427233519_cddc6dd1-95a3-4f0e-afa5-e11e9dc5fa72
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Subscribed to counters: [] for queryId: 
hive_20220427233519_cddc6dd1-95a3-4f0e-afa5-e11e9dc5fa72
INFO  : Tez session hasn't been created yet. Opening session
INFO  : Dag name: select distinct ss_sold_date_s...store_sales (Stage-1)
INFO  : Status: Running (Executing on YARN cluster with App id 
application_1651102345385_)

--
VERTICES  MODESTATUS  TOTAL  COMPLETED  RUNNING  PENDING  
FAILED  KILLED
--
Map 1 ..  llap SUCCEEDED   7141   714100
   0   0
Reducer 2 ..  llap SUCCEEDED  2  200
   0   0
--
VERTICES: 02/02  [==>>] 100%  ELAPSED TIME: 18.48 s
--
INFO  : Status: DAG finished successfully in 17.97 seconds
INFO  : DAG ID: dag_1651102345385__4
INFO  :
INFO  : Query Execution Summary
INFO  : 
--
INFO  : OPERATIONDURATION
INFO  : 
--
INFO  : Compile Query   1.81s
INFO  : Prepare Plan0.04s
INFO  : Get Query Coordinator 

[jira] [Created] (HIVE-26184) COLLECT_SET with GROUP BY is very slow when some keys are highly skewed

2022-04-28 Thread okumin (Jira)
okumin created HIVE-26184:
-

 Summary: COLLECT_SET with GROUP BY is very slow when some keys are 
highly skewed
 Key: HIVE-26184
 URL: https://issues.apache.org/jira/browse/HIVE-26184
 Project: Hive
  Issue Type: Bug
  Components: Hive
Affects Versions: 3.1.3, 2.3.8
Reporter: okumin
Assignee: okumin


I observed some reducers spend 98% of CPU time in invoking 
`java.util.HashMap#clear`.

Looking the detail, I found COLLECT_SET reuses a LinkedHashSet and its `clear` 
can be quite heavy when a relation has a small number of highly skewed keys.

 

To reproduce the issue, first, we will create rows with a skewed key.
{code:java}
INSERT INTO test_collect_set
SELECT '----' AS key, CAST(UUID() AS VARCHAR) 
AS value
FROM table_with_many_rows
LIMIT 10;{code}
Then, we will create many non-skewed rows.
{code:java}
INSERT INTO test_collect_set
SELECT UUID() AS key, UUID() AS value
FROM sample_datasets.nasdaq
LIMIT 500;{code}
We can observe the issue when we aggregate values by `key`.
{code:java}
SELECT key, COLLECT_SET(value) FROM group_by_skew GROUP BY key{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)