[ https://issues.apache.org/jira/browse/DRILL-4706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15625907#comment-15625907 ]
Padma Penumarthy commented on DRILL-4706: ----------------------------------------- For the data mentioned in the description of the problem, 4 nodes have 16 files each, 3 nodes have 17 files and other 3 nodes have 15 files i.e. data is not distributed equally among all nodes. With soft affinity parallelizer, we are allocating 16 fragments on each node. So, the nodes which have only 15 parquet files locally are doing remote read from one of the fragments. 3 remote reads for the 3 rowGroups (512 MB *3 ~ 1.5G) explains 2% (of 70G) remote reads. With the local affinity parallelizer, we schedule 16 fragments on 4 nodes, 17 on 3 nodes and 15 on the other 3 nodes. There were no remote reads in this case. > Fragment planning causes Drillbits to read remote chunks when local copies > are available > ---------------------------------------------------------------------------------------- > > Key: DRILL-4706 > URL: https://issues.apache.org/jira/browse/DRILL-4706 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning & Optimization > Affects Versions: 1.6.0 > Environment: CentOS, RHEL > Reporter: Kunal Khatua > Assignee: Sorabh Hamirwasia > Labels: performance, planning > > When a table (datasize=70GB) of 160 parquet files (each having a single > rowgroup and fitting within one chunk) is available on a 10-node setup with > replication=3 ; a pure data scan query causes about 2% of the data to be read > remotely. > Even with the creation of metadata cache, the planner is selecting a > sub-optimal plan of executing the SCAN fragments such that some of the data > is served from a remote server. -- This message was sent by Atlassian JIRA (v6.3.4#6332)