[ https://issues.apache.org/jira/browse/PHOENIX-6608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456903#comment-17456903 ]
Istvan Toth commented on PHOENIX-6608: -------------------------------------- Yes, this job is for PetaByte table, and uses JDBC. Ons solution that we are considering is disabling loading guideposts and splitting by regions only. > DISCUSS: Rethink MapReduce split generation > ------------------------------------------- > > Key: PHOENIX-6608 > URL: https://issues.apache.org/jira/browse/PHOENIX-6608 > Project: Phoenix > Issue Type: Improvement > Reporter: Lars Hofhansl > Priority: Major > > I just ran into an issue with Trino, which uses Phoenix' M/R integration to > generate splits for its worker nodes. > See: [https://github.com/trinodb/trino/issues/10143] > And a fix: [https://github.com/trinodb/trino/pull/10153] > In short the issue is that with large data size and guideposts enabled > (default) Phoenix' RoundRobinResultIterator starts scanning when tasks are > submitted to the queue. For large datasets (per client) this fills the heap > with pre-fetches HBase result objects. > MapReduce (and Spark) integrations have presumably the same issue. > My proposed solution is instead of allowing Phoenix to do intra-split > parallelism we create more splits (the fix above groups 20 scans into a split > - 20 turned out to be a good number). -- This message was sent by Atlassian Jira (v8.20.1#820001)