Lars Hofhansl created PHOENIX-6608:
--------------------------------------

             Summary: DISCUSS: Rethink MapReduce split generation
                 Key: PHOENIX-6608
                 URL: https://issues.apache.org/jira/browse/PHOENIX-6608
             Project: Phoenix
          Issue Type: Improvement
            Reporter: Lars Hofhansl


I just ran into an issue with Trino, which uses Phoenix' M/R integration to 
generate splits for its worker nodes.

See: [https://github.com/trinodb/trino/issues/10143]

And a fix: [https://github.com/trinodb/trino/pull/10153]

In short the issue is that with large data size and guideposts enabled 
(default) Phoenix' RoundRobinResultIterator starts scanning when tasks are 
submitted to the queue. For large datasets (per client) this fills the heap 
with pre-fetches HBase result objects.

MapReduce (and Spark) integrations have presumably the same issue.

My proposed solution is instead of allowing Phoenix to do intra-split 
parallelism we create more splits (the fix above groups 20 scans into a split - 
20 turned out to be a good number).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to