Henry Robinson has posted comments on this change.

Change subject: IMPALA-3977: TransmitData() should not block
......................................................................


Patch Set 1:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/5491/1/be/src/runtime/data-stream-mgr.cc
File be/src/runtime/data-stream-mgr.cc:

PS1, Line 53: STREAM_EXPIRATION_TIME_MS
Do you think we still need the stream cache? In the case where a receiver has 
completed before it receives an initial RPC, we should be able to rely on 
cancellation being detected on the sender side. The cost is that we might retry 
that initial RPC a couple of times before we receive cancellation from the 
coordinator.

Can you think this through and work out whether there's a state where the 
stream cache is really worth the extra complexity?


PS1, Line 99: if (acquire_lock) lock_.lock();
Can we just require now that the caller takes the lock?


http://gerrit.cloudera.org:8080/#/c/5491/1/be/src/runtime/data-stream-sender.cc
File be/src/runtime/data-stream-sender.cc:

PS1, Line 209: TTransmitDataResult res;
Let's move this inside the loop so it's freshly initialized on each iteration. 
That avoids any bugs where one RPC returns EAGAIN and sets a field in the 
result that - for some reason - a subsequent successful RPC does not overwrite.


PS1, Line 217: DoTransmitDataRpc
Since a row batch can potentially be many megabytes large, I wonder if it makes 
sense to have an initial 'probe' RPC that has no payload, but just confirms 
that the receiver is ready. Then we can limit the retry logic to that probe 
only, and if a subsequent RPC returns EAGAIN, we know that the receiver is 
closed and don't need to retry.


Line 224:     if (res.status.status_code != 
TErrorCode::DATASTREAM_RECEIVER_EAGAIN) break;
Is cancellation detected during this loop? That's part of the point of moving 
control back to the sender on EAGAIN, so that cancellation can be detected. 
Otherwise this blocks from the sender's perspective in the same way that it did 
before.


PS1, Line 225: VLOG_RPC << "Datastream receiver not yet ready. Retrying";
Not very informative without the fragment ID / destination address etc.


PS1, Line 226: FLAGS_datastream_sender_timeout_ms
rather than have the sleep period be a function of the total timeout, it makes 
more sense to have a fixed retry interval and exit this loop if the timeout has 
been exceeded.

For example, if someone sets the timeout to ten minutes (because they're being 
ultra conservative on a heavy cluster), this will retry every two minutes.


Line 228: 
I think it's a bit confusing to have this method return EAGAIN upon failure. 
Can you replace the status with timeout if it times-out of the loop?


-- 
To view, visit http://gerrit.cloudera.org:8080/5491
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I47f9d6c1de95a72ab73471dcd2a3118ef13e7db0
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sail...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
Gerrit-HasComments: Yes

Reply via email to