[ 
https://issues.apache.org/jira/browse/PHOENIX-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632007#comment-14632007
 ] 

Ankit Singhal edited comment on PHOENIX-2126 at 7/17/15 10:45 PM:
------------------------------------------------------------------

PFB, the improvement observed on table of 150B records by using multithreaded 
mergesort with minheap.

||cardinality||improvement||
|High cardinality(10k+) |3x faster|
|Low cardinality(10-100)        |1.31x faster|

I have created a patch for above changes but facing a issue in automating 
number of threads by using salt bucket as in ScanPlan class , I am getting 
table.getBucketNum() as always 'null' even for salted table when the query run 
on hbase cluster but somehow on local integration cluster it returns expected 
salt buckets for the table.

Regards,
Ankit Singhal


was (Author: ankit.singhal):
I have implemented the above approach and found great improvement.
||cardinality||improvement||
|High cardinality(10k+) |3x faster|
|Low cardinality(10-100)        |1.31x faster|

I have created a patch for phoenix-4.2 with multi-threaded and minheap 
implementation of merge-sort but facing a issue in automating number of threads 
by using salt bucket as in ScanPlan class , I am getting table.getBucketNum() 
as 'null' even for salted table when run on HBase cluster but somehow it runs 
fine during local integration test.

Regards,
Ankit Singhal

> Improving performance of merge sort by multi-threaded implementation
> --------------------------------------------------------------------
>
>                 Key: PHOENIX-2126
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-2126
>             Project: Phoenix
>          Issue Type: Improvement
>    Affects Versions: 4.1.0, 4.2.0
>            Reporter: Ankit Singhal
>         Attachments: PHOENIX-2126_v1.0.patch
>
>
> {code}
> CREATE TABLE IF NOT EXISTS DSP_VIEW_DAILY1 (
> dim1 INTEGER NOT NULL,
> A.B INTEGER,
> A.M DECIMAL,
> CONSTRAINT PK PRIMARY KEY
> (dim1))
> SALT_BUCKETS =256,DEFAULT_COLUMN_FAMILY='A';
> {code}
> *Query to benchmark:-*
> {code}
> select dim1,sum(b),sum(m) from table where Datemth>=201505 and 
> Datemth<=201505 and dim1 IS NOT NULL  group by dim1 order by sum(m) desc 
> nulls last limit 10;
> {code}
> *current scenario:-*
> *CASE 1: * consider the case when dim1 is high cardinality attribute (10K+) 
> and table have salt bucket set to 256, we will get 256 iterators from above 
> query at the client and MergeSortRowKeyResultIterator has to merge these 256 
> iterators with single thread. So let's say each iterator has 10k tuples 
> returned, then merge sort needs to merge 2.5M tuples which will be costly if 
> it is done with single thread and the query spend most of its time on client
> *CASE 2: * consider the case when dim1 is high cardinality attribute (10K+) 
> and table have salt bucket set to 1, we will get 1 iterator from  above query 
> at the client and MergeSortRowKeyResultIterator doesn't need to merge 
> anything. Here, it is fine with single thread.
> *CASE 3: * consider the case when dim1 is low cardinality attribute (10-100) 
> and table have salt bucket set to 256, we will get 256 iterator from  above 
> query at the client and MergeSortRowKeyResultIterator has to merge these 256 
> iterators with single thread. here the single thread is also fine as he has 
> to merge only 2560 tuples.
> *Solution for case1 problem is:-*
> Optimized the implementation of merging 'n'-sorted iterators(having 'm' 
> tuples)  by using "min heap" which optimizes the time complexity from 
> O(n2m) to O(nmLogn) (as heapify takes (Logn) time).
> And, By using multiple-threads('t') to process group of iterators which 
> further optimized the complexity to 
> T(nm)=T(nm)/t+T(t)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to