[jira] [Commented] (PHOENIX-6608) DISCUSS: Rethink MapReduce split generation

Istvan Toth (Jira) Thu, 09 Dec 2021 22:22:06 -0800


    [ 
https://issues.apache.org/jira/browse/PHOENIX-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456903#comment-17456903
 ]


Istvan Toth commented on PHOENIX-6608:
--------------------------------------

Yes, this job is for PetaByte table, and uses JDBC.
Ons solution that we are considering is disabling loading guideposts and 
splitting by regions only.

> DISCUSS: Rethink MapReduce split generation
> -------------------------------------------
>
>                 Key: PHOENIX-6608
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-6608
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: Lars Hofhansl
>            Priority: Major
>
> I just ran into an issue with Trino, which uses Phoenix' M/R integration to 
> generate splits for its worker nodes.
> See: [https://github.com/trinodb/trino/issues/10143]
> And a fix: [https://github.com/trinodb/trino/pull/10153]
> In short the issue is that with large data size and guideposts enabled 
> (default) Phoenix' RoundRobinResultIterator starts scanning when tasks are 
> submitted to the queue. For large datasets (per client) this fills the heap 
> with pre-fetches HBase result objects.
> MapReduce (and Spark) integrations have presumably the same issue.
> My proposed solution is instead of allowing Phoenix to do intra-split 
> parallelism we create more splits (the fix above groups 20 scans into a split 
> - 20 turned out to be a good number).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PHOENIX-6608) DISCUSS: Rethink MapReduce split generation

Reply via email to