Paul Rogers created DRILL-7789: ---------------------------------- Summary: Exchanges are slow on large systems & queries Key: DRILL-7789 URL: https://issues.apache.org/jira/browse/DRILL-7789 Project: Apache Drill Issue Type: Bug Affects Versions: 1.16.0 Reporter: Paul Rogers
A user with moderate-sized cluster and query has experienced extreme slowness in exchanges. Up to 11/12 of the time is spent waiting in one query, 3/4 of time spent waiting in another. We suspect that exchanges are somehow serializing across the cluster. Cluster: * Drill 1.16 (MapR version) * MapR-FS * Data stored in a 8GB Parquet file, unpacks to about 80 GB, 20B records * 4 Drillbits * Each node has 56 cores, 400 GB of memory * Drill queries run with 40 fragments (70% of CPU) and 80 GB of memory The query is, essentially: {noformat} Parquet writer - Hash Join - Scan - Window, Sort - Window, Sort - Hash Join - Scan - Scan {noformat} In the above, each line represents a fragment boundary. The plan includes mux exchanges between the two "lower" scans and the hash join. The total query time is 6 hours. Of that, 30 minutes is spent working, the other 5.5 hours is spent waiting. (The 30 minutes is obtained by summing the "Avg Runtime" column in the profile.) When checking resource usage with "top", we found that only a small amount of CPU was used. We should have seen 4000% (40 cores) but we actually saw just around 300-400%. This again indicates that the query spent most of its time doing nothing: not using CPU. In particular the sender spends about 5 hours waiting for the receiver, which in turn spends about 5 hours waiting for the sender. This pattern occurs in every exchange in the "main" data path (the 20B records.) As an experiment, the user disabled Mux exchanges. The system became overloaded at 40 fragments per node, so parallelism was reduced to 20. Now, the partition sender waited for the unordered receiver and visa-versa. The original query incurred spilling. We hypothesized that the spilling caused delays which somehow rippled through the DAG. However, the user revised the query to eliminate spilling and to reduce the query to just the "bottom" hash join. The query ran for an hour, of which 3/4 of the time was again spent with senders and receivers waiting for each other. We have eliminated a number of potential causes: * System has sufficient memory * MapRFS file system has plenty of spindles and plenty of I/O capability. * Network is fast * No other load on the nodes * Query was simplified down to the simplest possible: a single join (with exchanges) * If the query is simplified further (scan and write to Parquet, no join), it completes in just a few minutes: about as fast as the disk I/O rate. The query profile does not provide sufficient information to dig further. The profile provides aggregate wait times, but does not, say, tell us which fragments wait for which other fragments for how long. We believe that, if the exchange delays are fixed, the query which takes six hours should complete in less than a half hour -- even with shuffles, spilling, reading from Parquet and writing to Parquet. -- This message was sent by Atlassian Jira (v8.3.4#803005)