[jira] [Commented] (PHOENIX-1779) Parallelize fetching of next batch of records for scans corresponding to queries with no order by

Eli Levine (JIRA) Fri, 10 Jul 2015 09:58:41 -0700

    [ 
https://issues.apache.org/jira/browse/PHOENIX-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14622586#comment-14622586
 ]


Eli Levine commented on PHOENIX-1779:
-------------------------------------

Going through some code that uses row-value constructors got me thinking: How 
does the fact that rows are no longer guaranteed to be returned in rowkey order 
impact row-value constructors in Phoenix in general? At the end of 
http://phoenix.apache.org/paged.html we suggest the user grab values from the 
last row processed and use them in the next RVC call. After PHOENIX-1779 this 
is no longer guaranteed to work, right? Does the optimization for PHOENIX-1779 
make sense with RVC at all? I see a few options:
1. Force user to supply ORDER BY whenever they use RVCs. Seems pretty onerous.
2. Don't do PHOENIX-1779's optimization in the presence of RVCs.
3. Instruct user to use previous result's largest (or lowest, depending of PK 
sort order) PK value seen, instead of just grabbing values from last row to use 
in RVC. Also pretty onerous for users IMHO.

Imagine this simple use case: somebody is writing code for paging over Phoenix 
results.  The fist query does not use RVCs. Subsequent queries, if any, will 
use RVCs with values filled in based on previous results. Ideally, caller could 
tell Phoenix to "run this query in Phoenix with or without RVCs and return 
results in row-key order" because they want to use the results for paging and 
easily grab the last PK values to use for subsequent RVCs.

Maybe the right thing to do is: (1) Force row-key ordered results in the 
presence of RVCs and (2) Allow users to pass in a query hint that forces 
ordered results for use in the first paged query with no RVCc.

[~samarthjain], [~jamestaylor], thoughts?

CC [~jfernando_sfdc]

> Parallelize fetching of next batch of records for scans corresponding to 
> queries with no order by 
> --------------------------------------------------------------------------------------------------
>
>                 Key: PHOENIX-1779
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-1779
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: Samarth Jain
>            Assignee: Samarth Jain
>             Fix For: 5.0.0, 4.4.0
>
>         Attachments: PHOENIX-1779.patch, PHOENIX-1779_v2.patch, 
> PHOENIX-1779_v3.patch, wip.patch, wip3.patch, wipwithsplits.patch
>
>
> Today in Phoenix we parallelize the first execution of scans i.e. we load 
> only the first batch of records up to the scan's cache size in parallel. 
> Loading of subsequent batches of records in scanners is essentially serial. 
> This could be improved especially for queries, including the ones with no 
> order by clauses,  that do not need any kind of merge sort on the client. 
> This could also potentially improve the performance of UPSERT SELECT 
> statements that load data from one table and insert into another. One such 
> use case being creating immutable indexes for tables that already have data. 
> It could also potentially improve the performance of our MapReduce solution 
> for bulk loading data by improving the speed of the loading/mapping phase. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PHOENIX-1779) Parallelize fetching of next batch of records for scans corresponding to queries with no order by

Reply via email to