[ https://issues.apache.org/jira/browse/IMPALA-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710273#comment-16710273 ]
Peter Ebert commented on IMPALA-2424: ------------------------------------- This is becoming increasingly important for scaling and separation of storage and compute. If impala is installed on a subset of nodes, or distinct compute only nodes, remote reads would be essentially random and cross rack traffic may become saturated, especially at large scale where network over-subscription is common this could be a problem. With rack aware scheduling and proper distribution of impala and storage nodes per rack, rack aware scheduling could keep traffic within the TOR switches and improve performance. > Rack-aware scheduling > --------------------- > > Key: IMPALA-2424 > URL: https://issues.apache.org/jira/browse/IMPALA-2424 > Project: IMPALA > Issue Type: Improvement > Components: Distributed Exec > Affects Versions: Impala 2.2.4 > Reporter: Marcel Kornacker > Priority: Minor > Labels: scalability, scheduling > > Currently, Impala makes an effort to schedule plan fragments local to the > data that is being scanned; when no collocated impalad is available, the plan > fragment is placed randomly. > In order to support configurations where Impala is run on a subset of the > nodes in a cluster, we should schedule fragments within the same rack that > holds the assigned scan ranges (if a collocated impalad isn't available). > See https://issues.apache.org/jira/browse/HADOOP-692 for details of how rack > locality is recorded in hdfs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org