[jira] [Updated] (PHOENIX-3670) KeyRange.intersect(List , List) is inefficient,especially for join dynamic filter

chenglei (JIRA) Tue, 14 Feb 2017 07:19:26 -0800

     [ 
https://issues.apache.org/jira/browse/PHOENIX-3670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


chenglei updated PHOENIX-3670:
------------------------------
    Description: 
In my business system, there is a following join SQL(which is simplified), 
fact_table is a fact table,  joining dimension table dim_table1 and dim_table2 
: 

{code:borderStyle=solid} 
select /*+ SKIP_SCAN */ sum(t.click)  from fact_table t join dim_table1 d1 on  
t.cust_id=d1.id  join dim_table2 d2 on t.cust_id =d2.id  where t.date between 
'2016-01-01' and '2017-01-01' and d1.code = 2008 and d2.region = 'us';
{code} 

I use /*+ SKIP_SCAN */ hint to enable join dynamic filter. For some small 
dataset, the sql executes quickly, but when the dataset is bigger, the sql 
become very slowly. When the  row count of fact_table is 30 million,dim_table1 
is 300 thousand and dim_table2 is 100 thousand, the above query  costs 17s.

When I debug the SQL executing, I find RHS1 return 5523 rows:
{code:borderStyle=solid} 
   select d1.id from dim_table1 d1 where d1.code = 2008
{code} 

and RHS2 return 23881 rows: 
{code:borderStyle=solid}
   select d2.id from dim_table2 d2 where d2.region='us'
{code}  

then HashJoinPlan uses  KeyRange.intersect(List<KeyRange> , List<KeyRange> ) 
method to compute RHS1 intersecting RHS2 for dynamic filter, narrowing down 
fact_table.cust_id should be. 

Surprisingly,the KeyRange.intersect method costs 11s ! although the whole sql 
execution only costs 17s.After I read the code of  KeyRange.intersect method,I 
find following two problem:

1. The double loop is inefficient in line 521 and line 522,when keyRanges  size 
is M, keyRanges2 size is N, the time complexity is O(M*N), for my example,is 
5523*23881: 

{code:borderStyle=solid} 
519 public static List<KeyRange> intersect(List<KeyRange> keyRanges,  
List<KeyRange> keyRanges2) {
520        List<KeyRange> tmp = new ArrayList<KeyRange>();
521        for (KeyRange r1 : keyRanges) {
522            for (KeyRange r2 : keyRanges2) {
523                KeyRange r = r1.intersect(r2);
524                if (EMPTY_RANGE != r) {
525                    tmp.add(r);
526                }
527            }
528        }
{code}  

2. line 540 shoule be r = r.union(tmp.get( i )), not intersect,just as 
KeyRange.coalesce method does:

{code:borderStyle=solid} 
532        Collections.sort(tmp, KeyRange.COMPARATOR);
533        List<KeyRange> tmp2 = new ArrayList<KeyRange>();
534        KeyRange r = tmp.get(0);
535        for (int i=1; i<tmp.size(); i++) {
536            if (EMPTY_RANGE == r.intersect(tmp.get(i))) {
537                tmp2.add(r);
538                r = tmp.get(i);
539            } else {
540                r = r.intersect(tmp.get(i));
541            }
542        }
{code}

and it seems that no unit tests for  KeyRange.intersect(List<KeyRange> , 
List<KeyRange>) method. 

  was:
In my business system, there is a following join SQL(which is simplified), 
fact_table is a fact table,  joining dimension table dim_table1 and dim_table2 
: 

{code:borderStyle=solid} 
select /*+ SKIP_SCAN */ sum(t.click)  from fact_table t join dim_table1 d1 on  
t.cust_id=d1.id  join dim_table2 d2 on t.cust_id =d2.id  where t.date between 
'2016-01-01' and '2017-01-01' and d1.code = 2008 and d2.region = 'us';
{code} 

I use /*+ SKIP_SCAN */ hint to enable join dynamic filter. For some small 
dataset, the sql executes quickly, but when the dataset is bigger, the sql 
become very slowly. When the  row count of fact_table is 30 million,dim_table1 
is 300 thousand and dim_table2 is 100 thousand, the above query  costs 17s.

When I debug the SQL executing, I find RHS1 return 5523 rows:
{code:borderStyle=solid} 
   select d1.id from dim_table1 d1 where d1.code = 2008
{code} 

and RHS2 return 23881 rows: 
{code:borderStyle=solid}
   select d2.id from dim_table2 d2 where d2.region='us'
{code}  

then HashJoinPlan uses  KeyRange.intersect(List<KeyRange> , List<KeyRange> ) 
method to compute RHS1 intersecting RHS2 for dynamic filter, narrowing down 
fact_table.cust_id should be. 

Surprisingly,the KeyRange.intersect method costs 11s ! although the whole sql 
execution only costs 17s.After I read the code of  KeyRange.intersect method,I 
find following two problem:

1. The double loop is inefficient in line 521 and line 522,when keyRanges  size 
is M, keyRanges2 size is N, the time complexity is O(M*N), for my example,is 
5523*23881: 

{code:borderStyle=solid} 
519 public static List<KeyRange> intersect(List<KeyRange> keyRanges,  
List<KeyRange> keyRanges2) {
520        List<KeyRange> tmp = new ArrayList<KeyRange>();
521        for (KeyRange r1 : keyRanges) {
522            for (KeyRange r2 : keyRanges2) {
523                KeyRange r = r1.intersect(r2);
524                if (EMPTY_RANGE != r) {
525                    tmp.add(r);
526                }
527            }
528        }
{code}  

2. line 540 shoule be r = r.union(tmp.get(i)), not intersect,just as 
KeyRange.coalesce method does:

{code:borderStyle=solid} 
532        Collections.sort(tmp, KeyRange.COMPARATOR);
533        List<KeyRange> tmp2 = new ArrayList<KeyRange>();
534        KeyRange r = tmp.get(0);
535        for (int i=1; i<tmp.size(); i++) {
536            if (EMPTY_RANGE == r.intersect(tmp.get(i))) {
537                tmp2.add(r);
538                r = tmp.get(i);
539            } else {
540                r = r.intersect(tmp.get(i));
541            }
542        }
{code}

and it seems that no unit tests for  KeyRange.intersect(List<KeyRange> , 
List<KeyRange>) method. 


> KeyRange.intersect(List<KeyRange> , List<KeyRange>) is inefficient,especially 
> for join dynamic filter
> -----------------------------------------------------------------------------------------------------
>
>                 Key: PHOENIX-3670
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-3670
>             Project: Phoenix
>          Issue Type: Improvement
>    Affects Versions: 4.9.0
>            Reporter: chenglei
>
> In my business system, there is a following join SQL(which is simplified), 
> fact_table is a fact table,  joining dimension table dim_table1 and 
> dim_table2 : 
> {code:borderStyle=solid} 
> select /*+ SKIP_SCAN */ sum(t.click)  from fact_table t join dim_table1 d1 on 
>  t.cust_id=d1.id  join dim_table2 d2 on t.cust_id =d2.id  where t.date 
> between '2016-01-01' and '2017-01-01' and d1.code = 2008 and d2.region = 'us';
> {code} 
> I use /*+ SKIP_SCAN */ hint to enable join dynamic filter. For some small 
> dataset, the sql executes quickly, but when the dataset is bigger, the sql 
> become very slowly. When the  row count of fact_table is 30 
> million,dim_table1 is 300 thousand and dim_table2 is 100 thousand, the above 
> query  costs 17s.
> When I debug the SQL executing, I find RHS1 return 5523 rows:
> {code:borderStyle=solid} 
>    select d1.id from dim_table1 d1 where d1.code = 2008
> {code} 
> and RHS2 return 23881 rows: 
> {code:borderStyle=solid}
>    select d2.id from dim_table2 d2 where d2.region='us'
> {code}  
> then HashJoinPlan uses  KeyRange.intersect(List<KeyRange> , List<KeyRange> ) 
> method to compute RHS1 intersecting RHS2 for dynamic filter, narrowing down 
> fact_table.cust_id should be. 
> Surprisingly,the KeyRange.intersect method costs 11s ! although the whole sql 
> execution only costs 17s.After I read the code of  KeyRange.intersect 
> method,I find following two problem:
> 1. The double loop is inefficient in line 521 and line 522,when keyRanges  
> size is M, keyRanges2 size is N, the time complexity is O(M*N), for my 
> example,is 5523*23881: 
> {code:borderStyle=solid} 
> 519 public static List<KeyRange> intersect(List<KeyRange> keyRanges,  
> List<KeyRange> keyRanges2) {
> 520        List<KeyRange> tmp = new ArrayList<KeyRange>();
> 521        for (KeyRange r1 : keyRanges) {
> 522            for (KeyRange r2 : keyRanges2) {
> 523                KeyRange r = r1.intersect(r2);
> 524                if (EMPTY_RANGE != r) {
> 525                    tmp.add(r);
> 526                }
> 527            }
> 528        }
> {code}  
> 2. line 540 shoule be r = r.union(tmp.get( i )), not intersect,just as 
> KeyRange.coalesce method does:
> {code:borderStyle=solid} 
> 532        Collections.sort(tmp, KeyRange.COMPARATOR);
> 533        List<KeyRange> tmp2 = new ArrayList<KeyRange>();
> 534        KeyRange r = tmp.get(0);
> 535        for (int i=1; i<tmp.size(); i++) {
> 536            if (EMPTY_RANGE == r.intersect(tmp.get(i))) {
> 537                tmp2.add(r);
> 538                r = tmp.get(i);
> 539            } else {
> 540                r = r.intersect(tmp.get(i));
> 541            }
> 542        }
> {code}
> and it seems that no unit tests for  KeyRange.intersect(List<KeyRange> , 
> List<KeyRange>) method. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (PHOENIX-3670) KeyRange.intersect(List , List) is inefficient,especially for join dynamic filter

Reply via email to