Tim Armstrong has posted comments on this change. Change subject: IMPALA-3202: implement spill-to-disk in new buffer pool ......................................................................
Patch Set 12: (22 comments) http://gerrit.cloudera.org:8080/#/c/5584/9/be/src/runtime/bufferpool/buffer-pool-internal.h File be/src/runtime/bufferpool/buffer-pool-internal.h: PS9, Line 255: int64_t len; > But the callback does have to do something special for both cases. And see This was removed. http://gerrit.cloudera.org:8080/#/c/5584/10/be/src/runtime/bufferpool/buffer-pool-internal.h File be/src/runtime/bufferpool/buffer-pool-internal.h: PS10, Line 44: an unpinned page > i think saying: "a dirty unpinned page" would help make the transitions and Done PS10, Line 56: can > can be Done http://gerrit.cloudera.org:8080/#/c/5584/12/be/src/runtime/bufferpool/buffer-pool-internal.h File be/src/runtime/bufferpool/buffer-pool-internal.h: Line 59: /// unused reservation >= dirty unpinned pages I realised that we can't maintain this invariant if a client decreases the reservation. I was thinking about refactoring a bit so that the client's ReservationTracker is internal to the client. DecreaseReservation() would then go through the client interface and could decrease the # of dirty unpinned pages before decreasing the reservation. I'd prefer to do this as a follow-up patch to avoid making this even bigger. Added a TODO here http://gerrit.cloudera.org:8080/#/c/5584/9/be/src/runtime/bufferpool/buffer-pool.cc File be/src/runtime/bufferpool/buffer-pool.cc: Line 333: buffer.Reset(); > If you're not comfortable using DCHECK, let's at least add a quick comment I added the comment. My intuition is that there are a lot of local invariants that need to maintained for this global invariant to be true, so there's a higher chance of error. Also, if we violated an invariant on a release build and overcommitted memory, we'll actually get a diagnostic this way, instead of unexplained strange behaviour. Line 469: cl.lock(); > You missed these questions. I removed this anyway. My concern about atomicity was to do with the cascade of checks of page states in a couple of other places. It's harder to reason about if there are intermediate states where the page is between lists. I had some early iterations like that and really struggled to reason about the possible races. http://gerrit.cloudera.org:8080/#/c/5584/10/be/src/runtime/bufferpool/buffer-pool.cc File be/src/runtime/bufferpool/buffer-pool.cc: PS10, Line 534: The async. writes may have hit an error. > Isn't it more that *initiating* the writes may have hit an error, as oppose It could have been a write that was already in-flight. Reworded to make this clearer. http://gerrit.cloudera.org:8080/#/c/5584/12/be/src/runtime/bufferpool/buffer-pool.cc File be/src/runtime/bufferpool/buffer-pool.cc: I also folded in a few fixes that I added when experimenting with building things on top of this - clarified some aspects of concurrency (e.g. it's safe to free a buffer from a different thread) and added in some counters for the spilling path. Line 191: client->reservation_->AllocateFrom(page->len); > is the order of doing the pin_count / reservation bookkeeping and MoveToPin Only if we go down the RETURN_IF_ERROR path above. Added a comment since it's subtle. Line 217: while (page_handle->pin_count() > 1) Unpin(client, page_handle); > this is okay, but why would a client want to extract a buffer that it had m It could make sense on clean-up paths. E.g. if you're wanting to take a list of pages and attach the buffers to a row batch. We also use this via DestroyPage() - if you destroy a pinned page it is implemented via ExtractBuffer. Line 331: bytes_evicted += buffer.len(); > what's the difference between total_len and bytes_evicted? Nothing. I think there was a distinction in an earlier iteration but I missed removing the redundancy. I combined and renamed to bytes_found. PS12, Line 387: move it between lists. > shouldn't this really say "remove it from the in-flight list."? though see This code is removed now. Line 412: file_group_->DestroyWriteHandle(move(page->write_handle)); > why did we do this on line 395 as well? and why didn't that case need to ho That duplication was removed now that we're just waiting for it to exit the write state. It was meant to cancel the write quickly, but we don't have any kind of fast cancel mechanism now, so there wasn't a real benefit. The lock is only acquired here because I was concerned about other threads being able to get a reference to a page via a list and then dropping the list lock. I don't think that's necessary now, so I removed that, and renamed Page->lock to Page->buffer_lock since it's now just protecting that one field. Line 464: page->repin_after_write = true; > i guess this is just an optimization? as in, alternatively, we could just w I was thinking more about contention for clean_pages_lock_. But it should be pretty rare and not a major problem (compared to other possible things we could optimise). PS12, Line 470: atomically > atomically with respect to what? what would be wrong if we moved it to pinn Removed this code. PS12, Line 500: Safe to touch the : // page state without holding the lock variables below because no concurrent operations : // can modify evicted pages. > this comment applies to line 496 as well, right? So maybe move it up (or at I removed the 'below" and clarified that it was talking about page->buffer. IsEvicted() actually acquires the lock so it doesn't apply to that. PS12, Line 528: min_inflight_writes > min_inflight_bytes? "writes" sounds like number of I/Os. Done Line 560: * file_group_->tmp_file_mgr()->NumActiveTmpDevices(); > given that we don't know which tmp device a page is mapped to, how does thi It's really determined by the FileGroup's allocation policy. Initially it will round-robin across disks, but if it's recycling file ranges then it's more random. We might miss some opportunities to maximise throughput but hopefully this will work well enough. I added a comment to explain. Line 611: in_flight_write_pages_.Remove(page); > DCHECK that only one of the after_write flags is set? Removed the flags PS12, Line 614: !destroy_after_write > why do we need this guard? why couldn't we just have DestroyPageInternal r That's true. This does save acquiring clean_pages_lock_ but the case is probably rare enough that it's not worth optimising. I'll remove for now. PS12, Line 629: write_complete_cv_ I thought about this and realised that spurious wakeups from other writes finishing could result in quite a lot of context switches (e.g. if you have 2 * num disks writes in flight, and you're waiting for a specific write to finish). I added a per-page CV to avoid this. Line 635: } > I don't feel too strongly, but why not just put the code from DebugStringLo This was leftover from an earlier iteration. I didn't realise DebugStringLocked() had no callers. -- To view, visit http://gerrit.cloudera.org:8080/5584 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I8c6119a4557da853fefa02e17c49d8be0e9fbe01 Gerrit-PatchSet: 12 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-HasComments: Yes