[jira] [Comment Edited] (PHOENIX-2126) Improving performance of merge sort by multi-threaded implementation

Ankit Singhal (JIRA) Fri, 17 Jul 2015 15:38:38 -0700

    [ 
https://issues.apache.org/jira/browse/PHOENIX-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632007#comment-14632007
 ]


Ankit Singhal edited comment on PHOENIX-2126 at 7/17/15 10:37 PM:
------------------------------------------------------------------

I have implemented the above approach and found great improvement.
||cardinality||improvement||
|High cardinality(10k+) |3x faster|
|Low cardinality(10-100)        |1.31x faster|

I have created a patch for phoenix-4.2 with multi-threaded and minheap 
implementation of merge-sort but facing a issue in automating number of threads 
by using salt bucket as in ScanPlan class , I am getting table.getBucketNum() 
as 'null' even for salted table when run on HBase cluster but somehow it runs 
fine during local integration test.

Regards,
Ankit Singhal


was (Author: ankit.singhal):
I have implemented the above approach and found great improvement.
||cardinality||improvement||
|High cardinality(10k+) |3x faster|
|Low cardinality(10-100)        |1.31x faster|

I have created a patch for phoenix-4.2 with multi-threaded and minheap 
implementation of merge-sort but facing a issue in automating number of threads 
by using salt bucket.

Regards,
Ankit Singhal

> Improving performance of merge sort by multi-threaded implementation
> --------------------------------------------------------------------
>
>                 Key: PHOENIX-2126
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-2126
>             Project: Phoenix
>          Issue Type: Improvement
>    Affects Versions: 4.1.0, 4.2.0
>            Reporter: Ankit Singhal
>         Attachments: PHOENIX-2126_v1.0.patch
>
>
> {code}
> CREATE TABLE IF NOT EXISTS DSP_VIEW_DAILY1 (
> dim1 INTEGER NOT NULL,
> A.B INTEGER,
> A.M DECIMAL,
> CONSTRAINT PK PRIMARY KEY
> (dim1))
> SALT_BUCKETS =256,DEFAULT_COLUMN_FAMILY='A';
> {code}
> *Query to benchmark:-*
> {code}
> select dim1,sum(b),sum(m) from table where Datemth>=201505 and 
> Datemth<=201505 and dim1 IS NOT NULL  group by dim1 order by sum(m) desc 
> nulls last limit 10;
> {code}
> *current scenario:-*
> *CASE 1: * consider the case when dim1 is high cardinality attribute (10K+) 
> and table have salt bucket set to 256, we will get 256 iterators from above 
> query at the client and MergeSortRowKeyResultIterator has to merge these 256 
> iterators with single thread. So let's say each iterator has 10k tuples 
> returned, then merge sort needs to merge 2.5M tuples which will be costly if 
> it is done with single thread and the query spend most of its time on client
> *CASE 2: * consider the case when dim1 is high cardinality attribute (10K+) 
> and table have salt bucket set to 1, we will get 1 iterator from  above query 
> at the client and MergeSortRowKeyResultIterator doesn't need to merge 
> anything. Here, it is fine with single thread.
> *CASE 3: * consider the case when dim1 is low cardinality attribute (10-100) 
> and table have salt bucket set to 256, we will get 256 iterator from  above 
> query at the client and MergeSortRowKeyResultIterator has to merge these 256 
> iterators with single thread. here the single thread is also fine as he has 
> to merge only 2560 tuples.
> *Solution for case1 problem is:-*
> Optimized the implementation of merging 'n'-sorted iterators(having 'm' 
> tuples)  by using "min heap" which optimizes the time complexity from 
> O(n2m) to O(nmLogn) (as heapify takes (Logn) time).
> And, By using multiple-threads('t') to process group of iterators which 
> further optimized the complexity to 
> T(nm)=T(nm)/t+T(t)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (PHOENIX-2126) Improving performance of merge sort by multi-threaded implementation

Reply via email to