[
https://issues.apache.org/jira/browse/PHOENIX-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14622586#comment-14622586
]
Eli Levine commented on PHOENIX-1779:
-------------------------------------
Going through some code that uses row-value constructors got me thinking: How
does the fact that rows are no longer guaranteed to be returned in rowkey order
impact row-value constructors in Phoenix in general? At the end of
http://phoenix.apache.org/paged.html we suggest the user grab values from the
last row processed and use them in the next RVC call. After PHOENIX-1779 this
is no longer guaranteed to work, right? Does the optimization for PHOENIX-1779
make sense with RVC at all? I see a few options:
1. Force user to supply ORDER BY whenever they use RVCs. Seems pretty onerous.
2. Don't do PHOENIX-1779's optimization in the presence of RVCs.
3. Instruct user to use previous result's largest (or lowest, depending of PK
sort order) PK value seen, instead of just grabbing values from last row to use
in RVC. Also pretty onerous for users IMHO.
Imagine this simple use case: somebody is writing code for paging over Phoenix
results. The fist query does not use RVCs. Subsequent queries, if any, will
use RVCs with values filled in based on previous results. Ideally, caller could
tell Phoenix to "run this query in Phoenix with or without RVCs and return
results in row-key order" because they want to use the results for paging and
easily grab the last PK values to use for subsequent RVCs.
Maybe the right thing to do is: (1) Force row-key ordered results in the
presence of RVCs and (2) Allow users to pass in a query hint that forces
ordered results for use in the first paged query with no RVCc.
[~samarthjain], [~jamestaylor], thoughts?
CC [~jfernando_sfdc]
> Parallelize fetching of next batch of records for scans corresponding to
> queries with no order by
> --------------------------------------------------------------------------------------------------
>
> Key: PHOENIX-1779
> URL: https://issues.apache.org/jira/browse/PHOENIX-1779
> Project: Phoenix
> Issue Type: Improvement
> Reporter: Samarth Jain
> Assignee: Samarth Jain
> Fix For: 5.0.0, 4.4.0
>
> Attachments: PHOENIX-1779.patch, PHOENIX-1779_v2.patch,
> PHOENIX-1779_v3.patch, wip.patch, wip3.patch, wipwithsplits.patch
>
>
> Today in Phoenix we parallelize the first execution of scans i.e. we load
> only the first batch of records up to the scan's cache size in parallel.
> Loading of subsequent batches of records in scanners is essentially serial.
> This could be improved especially for queries, including the ones with no
> order by clauses, that do not need any kind of merge sort on the client.
> This could also potentially improve the performance of UPSERT SELECT
> statements that load data from one table and insert into another. One such
> use case being creating immutable indexes for tables that already have data.
> It could also potentially improve the performance of our MapReduce solution
> for bulk loading data by improving the speed of the loading/mapping phase.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)