[ https://issues.apache.org/jira/browse/PHOENIX-3670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
chenglei updated PHOENIX-3670: ------------------------------ Summary: KeyRange.intersect(List<KeyRange> , List<KeyRange>) is inefficient,especially for join dynamic filter (was: KeyRange.intersect(List<KeyRange> , List<KeyRange> ) is inefficient,especially for join dynamic filter) > KeyRange.intersect(List<KeyRange> , List<KeyRange>) is inefficient,especially > for join dynamic filter > ----------------------------------------------------------------------------------------------------- > > Key: PHOENIX-3670 > URL: https://issues.apache.org/jira/browse/PHOENIX-3670 > Project: Phoenix > Issue Type: Improvement > Affects Versions: 4.9.0 > Reporter: chenglei > > In my business system, there is a following join SQL(which is simplified), > fact_table is a fact table, joining dimension table dim_table1 and > dim_table2 : > {code:borderStyle=solid} > select /*+ SKIP_SCAN */ sum(t.click) from fact_table t join dim_table1 d1 on > t.cust_id=d1.id join dim_table2 d2 on t.cust_id =d2.id where t.date > between '2016-01-01' and '2017-01-01' and d1.code = 2008 and d2.region = 'us'; > {code} > I use /*+ SKIP_SCAN */ hint to enable join dynamic filter. For some small > dataset, the sql executes quickly, but when the dataset is bigger, the sql > become very slowly. When the row count of fact_table is 30 > million,dim_table1 is 300 thousand and dim_table2 is 100 thousand, the above > query costs 17s. > When I debug the SQL executing, I find RHS1 return 5523 rows: > {code:borderStyle=solid} > select d1.id from dim_table1 d1 where d1.code = 2008 > {code} > and RHS2 return 23881 rows: > {code:borderStyle=solid} > select d2.id from dim_table2 d2 where d2.region='us' > {code} > then HashJoinPlan uses KeyRange.intersect(List<KeyRange> , List<KeyRange> ) > method to compute RHS1 intersecting RHS2 for dynamic filter, narrowing down > fact_table.cust_id should be. > Surprisingly,the KeyRange.intersect method costs 11s ! although the whole sql > execution only costs 17s.After I read the code of KeyRange.intersect > method,I find following two problem: > 1. The double loop is inefficient in line 521 and line 522,when keyRanges > size is M, keyRanges2 size is N, the time complexity is O(M*N), for my > example,is 5523*23881: > {code:borderStyle=solid} > 519 public static List<KeyRange> intersect(List<KeyRange> keyRanges, > List<KeyRange> keyRanges2) { > 520 List<KeyRange> tmp = new ArrayList<KeyRange>(); > 521 for (KeyRange r1 : keyRanges) { > 522 for (KeyRange r2 : keyRanges2) { > 523 KeyRange r = r1.intersect(r2); > 524 if (EMPTY_RANGE != r) { > 525 tmp.add(r); > 526 } > 527 } > 528 } > {code} > 2. line 540 shoule be r = r.union(tmp.get(i)), not intersect,just as > KeyRange.coalesce method does: > {code:borderStyle=solid} > 532 Collections.sort(tmp, KeyRange.COMPARATOR); > 533 List<KeyRange> tmp2 = new ArrayList<KeyRange>(); > 534 KeyRange r = tmp.get(0); > 535 for (int i=1; i<tmp.size(); i++) { > 536 if (EMPTY_RANGE == r.intersect(tmp.get(i))) { > 537 tmp2.add(r); > 538 r = tmp.get(i); > 539 } else { > 540 r = r.intersect(tmp.get(i)); > 541 } > 542 } > {code} > and it seems that no unit tests for KeyRange.intersect(List<KeyRange> , > List<KeyRange>) method. -- This message was sent by Atlassian JIRA (v6.3.15#6346)