[Impala-ASF-CR] IMPALA-3202: implement spill-to-disk in new buffer pool

Tim Armstrong (Code Review) Wed, 25 Jan 2017 11:08:40 -0800

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3202: implement spill-to-disk in new buffer pool
......................................................................



Patch Set 12:

(22 comments)

http://gerrit.cloudera.org:8080/#/c/5584/9/be/src/runtime/bufferpool/buffer-pool-internal.h
File be/src/runtime/bufferpool/buffer-pool-internal.h:

PS9, Line 255:  int64_t len;
> But the callback does have to do something special for both cases.  And see
This was removed.


http://gerrit.cloudera.org:8080/#/c/5584/10/be/src/runtime/bufferpool/buffer-pool-internal.h
File be/src/runtime/bufferpool/buffer-pool-internal.h:

PS10, Line 44: an unpinned page
> i think saying: "a dirty unpinned page" would help make the transitions and
Done


PS10, Line 56: can
> can be
Done


http://gerrit.cloudera.org:8080/#/c/5584/12/be/src/runtime/bufferpool/buffer-pool-internal.h
File be/src/runtime/bufferpool/buffer-pool-internal.h:

Line 59: ///   unused reservation >= dirty unpinned pages
I realised that we can't maintain this invariant if a client decreases the 
reservation. 

I was thinking about refactoring a bit so that the client's ReservationTracker 
is internal to the client. DecreaseReservation() would then go through the 
client interface and could decrease the # of dirty unpinned pages before 
decreasing the reservation.

I'd prefer to do this as a follow-up patch to avoid making this even bigger. 
Added a TODO here


http://gerrit.cloudera.org:8080/#/c/5584/9/be/src/runtime/bufferpool/buffer-pool.cc
File be/src/runtime/bufferpool/buffer-pool.cc:

Line 333:     buffer.Reset();
> If you're not comfortable using DCHECK, let's at least add a quick comment 
I added the comment. 

My intuition is that there are a lot of local invariants that need to 
maintained for this global invariant to be true, so there's a higher chance of 
error. Also, if we violated an invariant on a release build and overcommitted 
memory, we'll actually get a diagnostic this way, instead of unexplained 
strange behaviour.


Line 469:     cl.lock();
> You missed these questions.
I removed this anyway. My concern about atomicity was to do with the cascade of 
checks of page states in a couple of other places. It's harder to reason about 
if there are intermediate states where the page is between lists. I had some 
early iterations like that and really struggled to reason about the possible 
races.


http://gerrit.cloudera.org:8080/#/c/5584/10/be/src/runtime/bufferpool/buffer-pool.cc
File be/src/runtime/bufferpool/buffer-pool.cc:

PS10, Line 534: The async. writes may have hit an error.
> Isn't it more that *initiating* the writes may have hit an error, as oppose
It could have been a write that was already in-flight. Reworded to make this 
clearer.


http://gerrit.cloudera.org:8080/#/c/5584/12/be/src/runtime/bufferpool/buffer-pool.cc
File be/src/runtime/bufferpool/buffer-pool.cc:

I also folded in a few fixes that I added when experimenting with building 
things on top of this - clarified some aspects of concurrency (e.g. it's safe 
to free a buffer from a different thread) and added in some counters for the 
spilling path.


Line 191:   client->reservation_->AllocateFrom(page->len);
> is the order of doing the pin_count / reservation bookkeeping and MoveToPin
Only if we go down the RETURN_IF_ERROR path above. Added a comment since it's 
subtle.


Line 217:   while (page_handle->pin_count() > 1) Unpin(client, page_handle);
> this is okay, but why would a client want to extract a buffer that it had m
It could make sense on clean-up paths. E.g. if you're wanting to take a list of 
pages and attach the buffers to a row batch.

We also use this via DestroyPage() - if you destroy a pinned page it is 
implemented via ExtractBuffer.


Line 331:     bytes_evicted += buffer.len();
> what's the difference between total_len and bytes_evicted?
Nothing. I think there was a distinction in an earlier iteration but I missed 
removing the redundancy. I combined and renamed to bytes_found.


PS12, Line 387: move it between lists.
> shouldn't this really say "remove it from the in-flight list."?  though see
This code is removed now.


Line 412:       file_group_->DestroyWriteHandle(move(page->write_handle));
> why did we do this on line 395 as well? and why didn't that case need to ho
That duplication was removed now that we're just waiting for it to exit the 
write state. It was meant to cancel the write quickly, but we don't have any 
kind of fast cancel mechanism now, so there wasn't a real benefit.

The lock is only acquired here because I was concerned about other threads 
being able to get a reference to a page via a list and then dropping the list 
lock. I don't think that's necessary now, so I removed that, and renamed 
Page->lock to Page->buffer_lock since it's now just protecting that one field.


Line 464:       page->repin_after_write = true;
> i guess this is just an optimization? as in, alternatively, we could just w
I was thinking more about contention for clean_pages_lock_. But it should be 
pretty rare and not a major problem (compared to other possible things we could 
optimise).


PS12, Line 470: atomically
> atomically with respect to what? what would be wrong if we moved it to pinn
Removed this code.


PS12, Line 500: Safe to touch the
              :   // page state without holding the lock variables below 
because no concurrent operations
              :   // can modify evicted pages.
> this comment applies to line 496 as well, right? So maybe move it up (or at
I removed the 'below" and clarified that it was talking about page->buffer.

IsEvicted() actually acquires the lock so it doesn't apply to that.


PS12, Line 528: min_inflight_writes
> min_inflight_bytes?  "writes" sounds like number of I/Os.
Done


Line 560:       * file_group_->tmp_file_mgr()->NumActiveTmpDevices();
> given that we don't know which tmp device a page is mapped to, how does thi
It's really determined by the FileGroup's allocation policy. Initially it will 
round-robin across disks, but if it's recycling file ranges then it's more 
random. 

We might miss some opportunities to maximise throughput but hopefully this will 
work well enough. I added a comment to explain.


Line 611:     in_flight_write_pages_.Remove(page);
> DCHECK that only one of the after_write flags is set?
Removed the flags


PS12, Line 614: !destroy_after_write
> why do we need this guard?  why couldn't we just have DestroyPageInternal r
That's true. This does save acquiring clean_pages_lock_ but the case is 
probably rare enough that it's not worth optimising. I'll remove for now.


PS12, Line 629: write_complete_cv_
I thought about this and realised that spurious wakeups from other writes 
finishing could result in quite a lot of context switches (e.g. if you have 2 * 
num disks writes in flight, and you're waiting for a specific write to finish). 
I added a per-page CV to avoid this.


Line 635: }
> I don't feel too strongly, but why not just put the code from DebugStringLo
This was leftover from an earlier iteration. I didn't realise 
DebugStringLocked() had no callers.


-- 
To view, visit http://gerrit.cloudera.org:8080/5584
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8c6119a4557da853fefa02e17c49d8be0e9fbe01
Gerrit-PatchSet: 12
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3202: implement spill-to-disk in new buffer pool

Reply via email to