[Impala-ASF-CR] IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches

2017-10-06 Thread Tianyi Wang (Code Review)
Tianyi Wang has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/8226


Change subject: IMPALA-2758: Change BufferedTupleStream::GetRows to returning 
multi batches
..

IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches

This patch replaces BufferedTupleStream::GetRows with GetRowBatches,
which returns multiple batches instead of one.
BufferedTupleStrem::GetRows pins a stream and read all the rows into a
single batch, ignoring batch_size configuration. It is not a good API
and may cause problems in memory management and row batch processing.

Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
---
M be/src/exec/partitioned-hash-join-node.cc
M be/src/exec/partitioned-hash-join-node.h
M be/src/runtime/buffered-tuple-stream-test.cc
M be/src/runtime/buffered-tuple-stream.cc
M be/src/runtime/buffered-tuple-stream.h
5 files changed, 43 insertions(+), 68 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/8226/1
--
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 1
Gerrit-Owner: Tianyi Wang 


[Impala-ASF-CR] IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches

2017-10-06 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Change BufferedTupleStream::GetRows to returning 
multi batches
..


Patch Set 1:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/8226/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8226/1//COMMIT_MSG@14
PS1, Line 14:
Any testing?


http://gerrit.cloudera.org:8080/#/c/8226/1/be/src/exec/partitioned-hash-join-node.cc
File be/src/exec/partitioned-hash-join-node.cc:

http://gerrit.cloudera.org:8080/#/c/8226/1/be/src/exec/partitioned-hash-join-node.cc@950
PS1, Line 950: break;
I think this changes the behavior here - this will only break out of the 
FOREACH_ROW loop, but what we want is to stop iterating over the entire set of 
build rows (that was previously one batch, but is now the same set of rows 
split into multiple batches).


http://gerrit.cloudera.org:8080/#/c/8226/1/be/src/runtime/buffered-tuple-stream.h
File be/src/runtime/buffered-tuple-stream.h:

http://gerrit.cloudera.org:8080/#/c/8226/1/be/src/runtime/buffered-tuple-stream.h@339
PS1, Line 339: vector>* batch
Can you move this to be after 'batch_size'? We generally keep out parameters as 
the trailing parameters for clarity.



--
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 1
Gerrit-Owner: Tianyi Wang 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Fri, 06 Oct 2017 18:24:58 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches

2017-10-06 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Change BufferedTupleStream::GetRows to returning 
multi batches
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8226/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8226/1//COMMIT_MSG@10
PS1, Line 10: which returns multiple batches instead of one.
that will cause us to still have to pin the entire stream in memory.  Instead, 
I think the ask of the JIRA is the change the callers to use the GetNext() 
interface, so that they only need to keep a single batch in memory at a time, 
since I believe all callers are really just iterating through the stream anyway.



--
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 1
Gerrit-Owner: Tianyi Wang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Comment-Date: Fri, 06 Oct 2017 18:47:32 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches

2017-10-06 Thread Tianyi Wang (Code Review)
Tianyi Wang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Change BufferedTupleStream::GetRows to returning 
multi batches
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8226/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8226/1//COMMIT_MSG@10
PS1, Line 10: which returns multiple batches instead of one.
> that will cause us to still have to pin the entire stream in memory.  Inste
The callers iterates through the build rows for every row in 
null_probe_rows/null_aware_probe_partition_. If we don't pin the stream we need 
to read the stream multi times, for every prob batch. Sure about that?



--
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 1
Gerrit-Owner: Tianyi Wang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tianyi Wang 
Gerrit-Comment-Date: Fri, 06 Oct 2017 19:03:27 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches

2017-10-06 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Change BufferedTupleStream::GetRows to returning 
multi batches
..


Patch Set 1:

> (1 comment)

Correct, it will unpin/pin multiple times. But if the memory is available the 
buffers will stay in memory even though it was unpinned (you'll want to take a 
look through the buffer pool code starting with the header comments). If the 
memory is not available, then we will read from disk but compare that to today 
where we'd run out of memory and fail. It does mean, however, that rows are 
"unflattened" multiple times.

It would be good to sync up with Tim about this JIRA to ask his intention when 
he returns on Monday.  I believe the JIRA was filed a while back before we did 
a big rework of these layers (I think the JIRA still makes sense, but good to 
sync up).


--
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 1
Gerrit-Owner: Tianyi Wang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tianyi Wang 
Gerrit-Comment-Date: Fri, 06 Oct 2017 21:54:16 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches

2017-10-06 Thread Tianyi Wang (Code Review)
Tianyi Wang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Change BufferedTupleStream::GetRows to returning 
multi batches
..


Patch Set 1:

> Patch Set 1:
>
> > (1 comment)
>
> Correct, it will unpin/pin multiple times. But if the memory is available the 
> buffers will stay in memory even though it was unpinned (you'll want to take 
> a look through the buffer pool code starting with the header comments). If 
> the memory is not available, then we will read from disk but compare that to 
> today where we'd run out of memory and fail. It does mean, however, that rows 
> are "unflattened" multiple times.
>
> It would be good to sync up with Tim about this JIRA to ask his intention 
> when he returns on Monday.  I believe the JIRA was filed a while back before 
> we did a big rework of these layers (I think the JIRA still makes sense, but 
> good to sync up).

Thanks. I will abandon this patch for now.


--
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 1
Gerrit-Owner: Tianyi Wang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tianyi Wang 
Gerrit-Comment-Date: Fri, 06 Oct 2017 21:56:40 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches

2017-10-06 Thread Tianyi Wang (Code Review)
Tianyi Wang has abandoned this change. ( http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Change BufferedTupleStream::GetRows to returning 
multi batches
..


Abandoned

Need to sync up with Tim before moving forward.
--
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: abandon
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 1
Gerrit-Owner: Tianyi Wang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tianyi Wang 


[Impala-ASF-CR] IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches

2017-10-10 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Change BufferedTupleStream::GetRows to returning 
multi batches
..


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8226/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8226/1//COMMIT_MSG@10
PS1, Line 10: which returns multiple batches instead of one.
> The callers iterates through the build rows for every row in null_probe_row
I'm ok with the stream staying pinned, the main goal was to remove the 
GetRows() API, which uses RowBatch in a weird way.

I think keeping the stream unpinned to handle many nulls on the build side is 
just a special case of
https://issues.apache.org/jira/browse/IMPALA-4857. I think that is harder to 
implement well - we'd want to consider doing some kind of blocked algorithm to 
reduce I/O.


http://gerrit.cloudera.org:8080/#/c/8226/1/be/src/runtime/buffered-tuple-stream.h
File be/src/runtime/buffered-tuple-stream.h:

http://gerrit.cloudera.org:8080/#/c/8226/1/be/src/runtime/buffered-tuple-stream.h@332
PS1, Line 332:   Status GetNext(
My intent was to remove the GetRows() API altogether and switch callers to 
calling GetNext() directly.

Creating many row batches is somewhat better, but there's still a lot of 
unnecessary memory overhead with all of the tuple_ptrs_.



--
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 1
Gerrit-Owner: Tianyi Wang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tianyi Wang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 10 Oct 2017 20:34:32 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches

2017-10-11 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Change BufferedTupleStream::GetRows to returning 
multi batches
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8226/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8226/1//COMMIT_MSG@10
PS1, Line 10: which returns multiple batches instead of one.
> I'm ok with the stream staying pinned, the main goal was to remove the GetR
WFM, agree the important step here is to eliminate GetRows().



--
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 1
Gerrit-Owner: Tianyi Wang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tianyi Wang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 11 Oct 2017 16:55:26 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches

2017-10-18 Thread Tianyi Wang (Code Review)
Tianyi Wang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Change BufferedTupleStream::GetRows to returning 
multi batches
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8226/1/be/src/runtime/buffered-tuple-stream.h
File be/src/runtime/buffered-tuple-stream.h:

http://gerrit.cloudera.org:8080/#/c/8226/1/be/src/runtime/buffered-tuple-stream.h@332
PS1, Line 332:   Status GetNext(
> My intent was to remove the GetRows() API altogether and switch callers to
Do you mean to keep only one rowbatch and assemble the row batches multiple 
times but yet let the stream stay pinned?



--
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 1
Gerrit-Owner: Tianyi Wang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tianyi Wang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 18 Oct 2017 23:36:17 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches

2017-10-20 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Change BufferedTupleStream::GetRows to returning 
multi batches
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8226/1/be/src/runtime/buffered-tuple-stream.h
File be/src/runtime/buffered-tuple-stream.h:

http://gerrit.cloudera.org:8080/#/c/8226/1/be/src/runtime/buffered-tuple-stream.h@332
PS1, Line 332:   Status GetNext(
> Do you mean to keep only one rowbatch and assemble the row batches multiple
Yes that's what I meant - have one RowBatch and iterate over the stream by 
calling GetNext() repeatedly.



--
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 1
Gerrit-Owner: Tianyi Wang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tianyi Wang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 20 Oct 2017 20:47:38 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2758: Change BufferedTupleStream::GetRows to returning multi batches

2017-10-25 Thread Tianyi Wang (Code Review)
Tianyi Wang has restored this change. ( http://gerrit.cloudera.org:8080/8226 )

Change subject: IMPALA-2758: Change BufferedTupleStream::GetRows to returning 
multi batches
..


Restored
--
To view, visit http://gerrit.cloudera.org:8080/8226
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: restore
Gerrit-Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Gerrit-Change-Number: 8226
Gerrit-PatchSet: 1
Gerrit-Owner: Tianyi Wang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tianyi Wang 
Gerrit-Reviewer: Tim Armstrong