[ https://issues.apache.org/jira/browse/PHOENIX-3670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
chenglei updated PHOENIX-3670: ------------------------------ Description: In my business system, there is a following join SQL(which is simplified), fact_table is a fact table, joining dimension table dim_table1 and dim_table2 : {code:borderStyle=solid} select /*+ SKIP_SCAN */ sum(t.click) from fact_table t join dim_table1 d1 on t.cust_id=d1.id join dim_table2 d2 on t.cust_id =d2.id where t.date between '2016-01-01' and '2017-01-01' and d1.code = 2008 and d2.region = 'us'; {code} I use /*+ SKIP_SCAN */ hint to enable join dynamic filter. For some small dataset, the sql executes quickly, but when the dataset is bigger, the sql become very slowly. When the row count of fact_table is 30 million,dim_table1 is 300 thousand and dim_table2 is 100 thousand, the above query costs 17s. When I debug the SQL executing, I find RHS1 return 5523 rows: {code:borderStyle=solid} select d1.id from dim_table1 d1 where d1.code = 2008 {code} and RHS2 return 23881 rows: {code:borderStyle=solid} select d2.id from dim_table2 d2 where d2.region='us' {code} then HashJoinPlan uses KeyRange.intersect(List<KeyRange> , List<KeyRange> ) method to compute RHS1 intersecting RHS2 for dynamic filter, narrowing down fact_table.cust_id should be. Surprisingly,the KeyRange.intersect method costs 11s ! although the whole sql execution only costs 17s.After I read the code of KeyRange.intersect method,I find following two problem: 1. The double loop is inefficient in line 521 and line 522,when keyRanges size is M, keyRanges2 size is N, the time complexity is O(M*N), for my example,is 5523*23881: {code:borderStyle=solid} 519 public static List<KeyRange> intersect(List<KeyRange> keyRanges, List<KeyRange> keyRanges2) { 520 List<KeyRange> tmp = new ArrayList<KeyRange>(); 521 for (KeyRange r1 : keyRanges) { 522 for (KeyRange r2 : keyRanges2) { 523 KeyRange r = r1.intersect(r2); 524 if (EMPTY_RANGE != r) { 525 tmp.add(r); 526 } 527 } 528 } {code} 2. line 540 shoule be r = r.union(tmp.get( i )), not intersect,just as KeyRange.coalesce method does: {code:borderStyle=solid} 532 Collections.sort(tmp, KeyRange.COMPARATOR); 533 List<KeyRange> tmp2 = new ArrayList<KeyRange>(); 534 KeyRange r = tmp.get(0); 535 for (int i=1; i<tmp.size(); i++) { 536 if (EMPTY_RANGE == r.intersect(tmp.get(i))) { 537 tmp2.add(r); 538 r = tmp.get(i); 539 } else { 540 r = r.intersect(tmp.get(i)); 541 } 542 } {code} and it seems that no unit tests for KeyRange.intersect(List<KeyRange> , List<KeyRange>) method. was: In my business system, there is a following join SQL(which is simplified), fact_table is a fact table, joining dimension table dim_table1 and dim_table2 : {code:borderStyle=solid} select /*+ SKIP_SCAN */ sum(t.click) from fact_table t join dim_table1 d1 on t.cust_id=d1.id join dim_table2 d2 on t.cust_id =d2.id where t.date between '2016-01-01' and '2017-01-01' and d1.code = 2008 and d2.region = 'us'; {code} I use /*+ SKIP_SCAN */ hint to enable join dynamic filter. For some small dataset, the sql executes quickly, but when the dataset is bigger, the sql become very slowly. When the row count of fact_table is 30 million,dim_table1 is 300 thousand and dim_table2 is 100 thousand, the above query costs 17s. When I debug the SQL executing, I find RHS1 return 5523 rows: {code:borderStyle=solid} select d1.id from dim_table1 d1 where d1.code = 2008 {code} and RHS2 return 23881 rows: {code:borderStyle=solid} select d2.id from dim_table2 d2 where d2.region='us' {code} then HashJoinPlan uses KeyRange.intersect(List<KeyRange> , List<KeyRange> ) method to compute RHS1 intersecting RHS2 for dynamic filter, narrowing down fact_table.cust_id should be. Surprisingly,the KeyRange.intersect method costs 11s ! although the whole sql execution only costs 17s.After I read the code of KeyRange.intersect method,I find following two problem: 1. The double loop is inefficient in line 521 and line 522,when keyRanges size is M, keyRanges2 size is N, the time complexity is O(M*N), for my example,is 5523*23881: {code:borderStyle=solid} 519 public static List<KeyRange> intersect(List<KeyRange> keyRanges, List<KeyRange> keyRanges2) { 520 List<KeyRange> tmp = new ArrayList<KeyRange>(); 521 for (KeyRange r1 : keyRanges) { 522 for (KeyRange r2 : keyRanges2) { 523 KeyRange r = r1.intersect(r2); 524 if (EMPTY_RANGE != r) { 525 tmp.add(r); 526 } 527 } 528 } {code} 2. line 540 shoule be r = r.union(tmp.get(i)), not intersect,just as KeyRange.coalesce method does: {code:borderStyle=solid} 532 Collections.sort(tmp, KeyRange.COMPARATOR); 533 List<KeyRange> tmp2 = new ArrayList<KeyRange>(); 534 KeyRange r = tmp.get(0); 535 for (int i=1; i<tmp.size(); i++) { 536 if (EMPTY_RANGE == r.intersect(tmp.get(i))) { 537 tmp2.add(r); 538 r = tmp.get(i); 539 } else { 540 r = r.intersect(tmp.get(i)); 541 } 542 } {code} and it seems that no unit tests for KeyRange.intersect(List<KeyRange> , List<KeyRange>) method. > KeyRange.intersect(List<KeyRange> , List<KeyRange>) is inefficient,especially > for join dynamic filter > ----------------------------------------------------------------------------------------------------- > > Key: PHOENIX-3670 > URL: https://issues.apache.org/jira/browse/PHOENIX-3670 > Project: Phoenix > Issue Type: Improvement > Affects Versions: 4.9.0 > Reporter: chenglei > > In my business system, there is a following join SQL(which is simplified), > fact_table is a fact table, joining dimension table dim_table1 and > dim_table2 : > {code:borderStyle=solid} > select /*+ SKIP_SCAN */ sum(t.click) from fact_table t join dim_table1 d1 on > t.cust_id=d1.id join dim_table2 d2 on t.cust_id =d2.id where t.date > between '2016-01-01' and '2017-01-01' and d1.code = 2008 and d2.region = 'us'; > {code} > I use /*+ SKIP_SCAN */ hint to enable join dynamic filter. For some small > dataset, the sql executes quickly, but when the dataset is bigger, the sql > become very slowly. When the row count of fact_table is 30 > million,dim_table1 is 300 thousand and dim_table2 is 100 thousand, the above > query costs 17s. > When I debug the SQL executing, I find RHS1 return 5523 rows: > {code:borderStyle=solid} > select d1.id from dim_table1 d1 where d1.code = 2008 > {code} > and RHS2 return 23881 rows: > {code:borderStyle=solid} > select d2.id from dim_table2 d2 where d2.region='us' > {code} > then HashJoinPlan uses KeyRange.intersect(List<KeyRange> , List<KeyRange> ) > method to compute RHS1 intersecting RHS2 for dynamic filter, narrowing down > fact_table.cust_id should be. > Surprisingly,the KeyRange.intersect method costs 11s ! although the whole sql > execution only costs 17s.After I read the code of KeyRange.intersect > method,I find following two problem: > 1. The double loop is inefficient in line 521 and line 522,when keyRanges > size is M, keyRanges2 size is N, the time complexity is O(M*N), for my > example,is 5523*23881: > {code:borderStyle=solid} > 519 public static List<KeyRange> intersect(List<KeyRange> keyRanges, > List<KeyRange> keyRanges2) { > 520 List<KeyRange> tmp = new ArrayList<KeyRange>(); > 521 for (KeyRange r1 : keyRanges) { > 522 for (KeyRange r2 : keyRanges2) { > 523 KeyRange r = r1.intersect(r2); > 524 if (EMPTY_RANGE != r) { > 525 tmp.add(r); > 526 } > 527 } > 528 } > {code} > 2. line 540 shoule be r = r.union(tmp.get( i )), not intersect,just as > KeyRange.coalesce method does: > {code:borderStyle=solid} > 532 Collections.sort(tmp, KeyRange.COMPARATOR); > 533 List<KeyRange> tmp2 = new ArrayList<KeyRange>(); > 534 KeyRange r = tmp.get(0); > 535 for (int i=1; i<tmp.size(); i++) { > 536 if (EMPTY_RANGE == r.intersect(tmp.get(i))) { > 537 tmp2.add(r); > 538 r = tmp.get(i); > 539 } else { > 540 r = r.intersect(tmp.get(i)); > 541 } > 542 } > {code} > and it seems that no unit tests for KeyRange.intersect(List<KeyRange> , > List<KeyRange>) method. -- This message was sent by Atlassian JIRA (v6.3.15#6346)