Lars Hofhansl created PHOENIX-6608: -------------------------------------- Summary: DISCUSS: Rethink MapReduce split generation Key: PHOENIX-6608 URL: https://issues.apache.org/jira/browse/PHOENIX-6608 Project: Phoenix Issue Type: Improvement Reporter: Lars Hofhansl
I just ran into an issue with Trino, which uses Phoenix' M/R integration to generate splits for its worker nodes. See: [https://github.com/trinodb/trino/issues/10143] And a fix: [https://github.com/trinodb/trino/pull/10153] In short the issue is that with large data size and guideposts enabled (default) Phoenix' RoundRobinResultIterator starts scanning when tasks are submitted to the queue. For large datasets (per client) this fills the heap with pre-fetches HBase result objects. MapReduce (and Spark) integrations have presumably the same issue. My proposed solution is instead of allowing Phoenix to do intra-split parallelism we create more splits (the fix above groups 20 scans into a split - 20 turned out to be a good number). -- This message was sent by Atlassian Jira (v8.20.1#820001)