[jira] [Updated] (HIVE-27176) EXPLAIN SKEW

Jira Sun, 26 Mar 2023 00:44:07 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-27176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


László Bodor updated HIVE-27176:
--------------------------------
    Description: 
Thinking about a new explain feature, which is actually not an explain, instead 
a set of analytical queries: considering a very complicated and large SQL 
statement (this below is a simple one, just for example's sake):
{code}
SELECT a FROM (SELECT b ... JOIN c on b.x = c.y) d JOIN e ON d.v = e.w
{code}

EXPLAIN SKEW under the hood should run a query like:
{code}
SELECT "b", "x", x, count (distinct b.x) as count order by count desc limit 50
UNION ALL
SELECT "c", "y", y, count (distinct c.y) as count order by count desc limit 50
UNION ALL
SELECT "d", "v", v count (distinct d.v) as count order by count desc limit 50
UNION ALL
SELECT "e", "w", w, count (distinct e.w) as count order by count desc limit 50
{code}

collecting some cardinality info about all the join columns found in the query, 
so result might be like:

{code}
table_name column_name column_value count
b "x" x_skew_value1 100431234
b "x" x_skew_value2 234
c "y" y_skew_value1 350000
c "y" x_skew_value2 459999
c "y" x_skew_value3 42
...
{code}
this doesn't solve the problem, instead shows data skew immediately for further 
analysis, also it doesn't suffer from incomplete stats problem, as it really 
has to query data on the cluster
 

  was:
Thinking about a new explain feature, which is actually not an explain, instead 
a set of analytical queries: considering a very complicated and large SQL 
statement (this below is a simple one, just for example's sake):
{code}
SELECT a FROM (SELECT b ... JOIN c on b.x = c.y) d JOIN e ON d.v = e.w
{code}

EXPLAIN SKEW under the hood should run a query like:
{code}
SELECT "b", "x", x, count (distinct b.x) as count order by count desc limit 50
UNION ALL
SELECT "c", "y", y, count (distinct c.y) as count order by count desc limit 50
UNION ALL
SELECT "d", "v", v count (distinct d.v) as count order by count desc limit 50
UNION ALL
SELECT "e", "w", w, count (distinct e.w) as count order by count desc limit 50
{code}

collecting some cardinality info about all the join columns found in the query, 
so result might be like:

{code}
table_name column_name column_value count
b "x" x_skew_value1 100431234
b "x" x_skew_value2 234
c "y" y_skew_value1 350000
c "y" x_skew_value2 459999
c "y" x_skew_value3 42
...
{code}
this doesn't solve the problem, instead shows data skew immediately for further 
analysis

 


> EXPLAIN SKEW <query>
> --------------------
>
>                 Key: HIVE-27176
>                 URL: https://issues.apache.org/jira/browse/HIVE-27176
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: László Bodor
>            Priority: Major
>
> Thinking about a new explain feature, which is actually not an explain, 
> instead a set of analytical queries: considering a very complicated and large 
> SQL statement (this below is a simple one, just for example's sake):
> {code}
> SELECT a FROM (SELECT b ... JOIN c on b.x = c.y) d JOIN e ON d.v = e.w
> {code}
> EXPLAIN SKEW under the hood should run a query like:
> {code}
> SELECT "b", "x", x, count (distinct b.x) as count order by count desc limit 50
> UNION ALL
> SELECT "c", "y", y, count (distinct c.y) as count order by count desc limit 50
> UNION ALL
> SELECT "d", "v", v count (distinct d.v) as count order by count desc limit 50
> UNION ALL
> SELECT "e", "w", w, count (distinct e.w) as count order by count desc limit 50
> {code}
> collecting some cardinality info about all the join columns found in the 
> query, so result might be like:
> {code}
> table_name column_name column_value count
> b "x" x_skew_value1 100431234
> b "x" x_skew_value2 234
> c "y" y_skew_value1 350000
> c "y" x_skew_value2 459999
> c "y" x_skew_value3 42
> ...
> {code}
> this doesn't solve the problem, instead shows data skew immediately for 
> further analysis, also it doesn't suffer from incomplete stats problem, as it 
> really has to query data on the cluster
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-27176) EXPLAIN SKEW

Reply via email to