[ https://issues.apache.org/jira/browse/IMPALA-5706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gabor Kaszab resolved IMPALA-5706. ---------------------------------- Resolution: Fixed Fix Version/s: Impala 3.1.0 https://gerrit.cloudera.org/#/c/9943/ > Parallelise read I/O in sorter > ------------------------------ > > Key: IMPALA-5706 > URL: https://issues.apache.org/jira/browse/IMPALA-5706 > Project: IMPALA > Issue Type: Sub-task > Components: Backend > Affects Versions: Impala 2.10.0 > Reporter: Tim Armstrong > Assignee: Gabor Kaszab > Priority: Major > Labels: resource-management, spill > Fix For: Impala 3.1.0 > > > IMPALA-3200 offers an opportunity to improve the spilling sort algorithm: > * Use the reliability of reservations to select the most efficient order to > conduct merges in (rather than greedily trying to maximise the fan-in of the > current merge). We want to minimise the depth of the merge tree, then > structure the tree based on the preferred fan-in. > * Do multiple-buffering of the stream being written (this happens > automatically if there are free buffers in the BufferPool client). > * Do multiple-buffering of the streams being read, instead of blocking on > read I/O frequently. > More concretely, the idea is to implement double-buffering of spilled input > runs by calling BufferPool::Pin() early to prefetch the second page in each > input Run. Currently only one page per input run is pinned, which means that > the sorter frequently blocks on I/O. > I'd suggest doing this in two steps. > The first step is to change how the fan-in of each merge run is selected. We > know the number of runs to be merged and the buffer reservation that is > available, so we can compute the maximum possible fan-in of each merge step > (assuming 1 buffer for the output run and 1 buffer for each input run to the > merge). We can then calculate the minimum number of rounds of merging > required and, based on that, decide how the runs should be merged (you could > think about it as a tree of merge operations). I think we want to reduce the > number of bytes written to disk. E.g. if we have 5 buffers and 8 input runs, > we should merge input runs (1,2,3,4) then merge that intermediate runs with > runs (5,6,7). It's reasonable to assume that the input runs are all > approximate the same size. > ee53ddb389549247f5bfe760d446dc7b3b963a29 actually removed some logic along > those lines because it didn't work with the old buffer management scheme. The > logic before that commit might provide some ideas. There are also some > related TODOs in Sorter::MergeIntermediateRuns() and Sorter::CreateMerger() > to simplify how the number of input runs is decided and how the merger is set > up: > {code} > // TODO: once we have reliable reservations (IMPALA-3200), we should > calculate this > // based on the available reservations. > .... > // TODO: this isn't optimal: we could defer creating the merged run if we > have > // reliable reservations (IMPALA-3200). > ... > // TODO: IMPALA-3200: we should not need this logic once we have > reliable > // reservations (IMPALA-3200). > {code} > The second step would be to adjust the logic from the first step to reserve 2 > buffers per input and output run and then implement the logic to call Pin() > earlier to prefetch the page after the current page. -- This message was sent by Atlassian JIRA (v7.6.3#76005)