[Impala-ASF-CR] IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches
Tianyi Wang has uploaded this change for review. ( http://gerrit.cloudera.org:8080/8226 Change subject: IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches .. IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches This patch replaces BufferedTupleStream::GetRows with GetRowBatches, which returns multiple batches instead of one. BufferedTupleStrem::GetRows pins a stream and read all the rows into a single batch, ignoring batch_size configuration. It is not a good API and may cause problems in memory management and row batch processing. Change-Id: I3831c38994da2b69775a9809ff01de5d23584414 --- M be/src/exec/partitioned-hash-join-node.cc M be/src/exec/partitioned-hash-join-node.h M be/src/runtime/buffered-tuple-stream-test.cc M be/src/runtime/buffered-tuple-stream.cc M be/src/runtime/buffered-tuple-stream.h 5 files changed, 43 insertions(+), 68 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/8226/1 -- To view, visit http://gerrit.cloudera.org:8080/8226 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414 Gerrit-Change-Number: 8226 Gerrit-PatchSet: 1 Gerrit-Owner: Tianyi Wang
[Impala-ASF-CR] IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/8226 ) Change subject: IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches .. Patch Set 1: (3 comments) http://gerrit.cloudera.org:8080/#/c/8226/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/8226/1//COMMIT_MSG@14 PS1, Line 14: Any testing? http://gerrit.cloudera.org:8080/#/c/8226/1/be/src/exec/partitioned-hash-join-node.cc File be/src/exec/partitioned-hash-join-node.cc: http://gerrit.cloudera.org:8080/#/c/8226/1/be/src/exec/partitioned-hash-join-node.cc@950 PS1, Line 950: break; I think this changes the behavior here - this will only break out of the FOREACH_ROW loop, but what we want is to stop iterating over the entire set of build rows (that was previously one batch, but is now the same set of rows split into multiple batches). http://gerrit.cloudera.org:8080/#/c/8226/1/be/src/runtime/buffered-tuple-stream.h File be/src/runtime/buffered-tuple-stream.h: http://gerrit.cloudera.org:8080/#/c/8226/1/be/src/runtime/buffered-tuple-stream.h@339 PS1, Line 339: vector>* batch Can you move this to be after 'batch_size'? We generally keep out parameters as the trailing parameters for clarity. -- To view, visit http://gerrit.cloudera.org:8080/8226 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414 Gerrit-Change-Number: 8226 Gerrit-PatchSet: 1 Gerrit-Owner: Tianyi Wang Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Fri, 06 Oct 2017 18:24:58 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/8226 ) Change subject: IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/8226/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/8226/1//COMMIT_MSG@10 PS1, Line 10: which returns multiple batches instead of one. that will cause us to still have to pin the entire stream in memory. Instead, I think the ask of the JIRA is the change the callers to use the GetNext() interface, so that they only need to keep a single batch in memory at a time, since I believe all callers are really just iterating through the stream anyway. -- To view, visit http://gerrit.cloudera.org:8080/8226 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414 Gerrit-Change-Number: 8226 Gerrit-PatchSet: 1 Gerrit-Owner: Tianyi Wang Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Fri, 06 Oct 2017 18:47:32 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches
Tianyi Wang has posted comments on this change. ( http://gerrit.cloudera.org:8080/8226 ) Change subject: IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/8226/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/8226/1//COMMIT_MSG@10 PS1, Line 10: which returns multiple batches instead of one. > that will cause us to still have to pin the entire stream in memory. Inste The callers iterates through the build rows for every row in null_probe_rows/null_aware_probe_partition_. If we don't pin the stream we need to read the stream multi times, for every prob batch. Sure about that? -- To view, visit http://gerrit.cloudera.org:8080/8226 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414 Gerrit-Change-Number: 8226 Gerrit-PatchSet: 1 Gerrit-Owner: Tianyi Wang Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tianyi Wang Gerrit-Comment-Date: Fri, 06 Oct 2017 19:03:27 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/8226 ) Change subject: IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches .. Patch Set 1: > (1 comment) Correct, it will unpin/pin multiple times. But if the memory is available the buffers will stay in memory even though it was unpinned (you'll want to take a look through the buffer pool code starting with the header comments). If the memory is not available, then we will read from disk but compare that to today where we'd run out of memory and fail. It does mean, however, that rows are "unflattened" multiple times. It would be good to sync up with Tim about this JIRA to ask his intention when he returns on Monday. I believe the JIRA was filed a while back before we did a big rework of these layers (I think the JIRA still makes sense, but good to sync up). -- To view, visit http://gerrit.cloudera.org:8080/8226 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414 Gerrit-Change-Number: 8226 Gerrit-PatchSet: 1 Gerrit-Owner: Tianyi Wang Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tianyi Wang Gerrit-Comment-Date: Fri, 06 Oct 2017 21:54:16 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches
Tianyi Wang has posted comments on this change. ( http://gerrit.cloudera.org:8080/8226 ) Change subject: IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches .. Patch Set 1: > Patch Set 1: > > > (1 comment) > > Correct, it will unpin/pin multiple times. But if the memory is available the > buffers will stay in memory even though it was unpinned (you'll want to take > a look through the buffer pool code starting with the header comments). If > the memory is not available, then we will read from disk but compare that to > today where we'd run out of memory and fail. It does mean, however, that rows > are "unflattened" multiple times. > > It would be good to sync up with Tim about this JIRA to ask his intention > when he returns on Monday. I believe the JIRA was filed a while back before > we did a big rework of these layers (I think the JIRA still makes sense, but > good to sync up). Thanks. I will abandon this patch for now. -- To view, visit http://gerrit.cloudera.org:8080/8226 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414 Gerrit-Change-Number: 8226 Gerrit-PatchSet: 1 Gerrit-Owner: Tianyi Wang Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tianyi Wang Gerrit-Comment-Date: Fri, 06 Oct 2017 21:56:40 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches
Tianyi Wang has abandoned this change. ( http://gerrit.cloudera.org:8080/8226 ) Change subject: IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches .. Abandoned Need to sync up with Tim before moving forward. -- To view, visit http://gerrit.cloudera.org:8080/8226 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: abandon Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414 Gerrit-Change-Number: 8226 Gerrit-PatchSet: 1 Gerrit-Owner: Tianyi Wang Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tianyi Wang
[Impala-ASF-CR] IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8226 ) Change subject: IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches .. Patch Set 1: (2 comments) http://gerrit.cloudera.org:8080/#/c/8226/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/8226/1//COMMIT_MSG@10 PS1, Line 10: which returns multiple batches instead of one. > The callers iterates through the build rows for every row in null_probe_row I'm ok with the stream staying pinned, the main goal was to remove the GetRows() API, which uses RowBatch in a weird way. I think keeping the stream unpinned to handle many nulls on the build side is just a special case of https://issues.apache.org/jira/browse/IMPALA-4857. I think that is harder to implement well - we'd want to consider doing some kind of blocked algorithm to reduce I/O. http://gerrit.cloudera.org:8080/#/c/8226/1/be/src/runtime/buffered-tuple-stream.h File be/src/runtime/buffered-tuple-stream.h: http://gerrit.cloudera.org:8080/#/c/8226/1/be/src/runtime/buffered-tuple-stream.h@332 PS1, Line 332: Status GetNext( My intent was to remove the GetRows() API altogether and switch callers to calling GetNext() directly. Creating many row batches is somewhat better, but there's still a lot of unnecessary memory overhead with all of the tuple_ptrs_. -- To view, visit http://gerrit.cloudera.org:8080/8226 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414 Gerrit-Change-Number: 8226 Gerrit-PatchSet: 1 Gerrit-Owner: Tianyi Wang Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 10 Oct 2017 20:34:32 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/8226 ) Change subject: IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/8226/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/8226/1//COMMIT_MSG@10 PS1, Line 10: which returns multiple batches instead of one. > I'm ok with the stream staying pinned, the main goal was to remove the GetR WFM, agree the important step here is to eliminate GetRows(). -- To view, visit http://gerrit.cloudera.org:8080/8226 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414 Gerrit-Change-Number: 8226 Gerrit-PatchSet: 1 Gerrit-Owner: Tianyi Wang Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 11 Oct 2017 16:55:26 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches
Tianyi Wang has posted comments on this change. ( http://gerrit.cloudera.org:8080/8226 ) Change subject: IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/8226/1/be/src/runtime/buffered-tuple-stream.h File be/src/runtime/buffered-tuple-stream.h: http://gerrit.cloudera.org:8080/#/c/8226/1/be/src/runtime/buffered-tuple-stream.h@332 PS1, Line 332: Status GetNext( > My intent was to remove the GetRows() API altogether and switch callers to Do you mean to keep only one rowbatch and assemble the row batches multiple times but yet let the stream stay pinned? -- To view, visit http://gerrit.cloudera.org:8080/8226 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414 Gerrit-Change-Number: 8226 Gerrit-PatchSet: 1 Gerrit-Owner: Tianyi Wang Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 18 Oct 2017 23:36:17 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8226 ) Change subject: IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/8226/1/be/src/runtime/buffered-tuple-stream.h File be/src/runtime/buffered-tuple-stream.h: http://gerrit.cloudera.org:8080/#/c/8226/1/be/src/runtime/buffered-tuple-stream.h@332 PS1, Line 332: Status GetNext( > Do you mean to keep only one rowbatch and assemble the row batches multiple Yes that's what I meant - have one RowBatch and iterate over the stream by calling GetNext() repeatedly. -- To view, visit http://gerrit.cloudera.org:8080/8226 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414 Gerrit-Change-Number: 8226 Gerrit-PatchSet: 1 Gerrit-Owner: Tianyi Wang Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 20 Oct 2017 20:47:38 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches
Tianyi Wang has restored this change. ( http://gerrit.cloudera.org:8080/8226 ) Change subject: IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches .. Restored -- To view, visit http://gerrit.cloudera.org:8080/8226 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: restore Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414 Gerrit-Change-Number: 8226 Gerrit-PatchSet: 1 Gerrit-Owner: Tianyi Wang Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tianyi Wang Gerrit-Reviewer: Tim Armstrong