Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17979 )

Change subject: IMPALA-10791 Add batching reading for remote temporary files
......................................................................


Patch Set 6:

(10 comments)

Looks very good!

Another thing we may need to worry about is the enhancement to FE to include 
the extra system memory blocks proposed to use in this patch for spill-able 
nodes, such as HJB, SORT or ANALYTICEVAL nodes.

http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-file.h
File be/src/runtime/io/disk-file.h:

http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-file.h@206
PS5, Line 206:
             :   virtual ~DiskFile() {}
> Previously, a read buffer and a memory block are the same thing in our patc
Okay. That is a good reason.


http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-file.h@326
PS5, Line 326: id UpdateReadBufferMetaData(int64_t offset) {
> It depends on the file size.
Please see my comments at 
https://gerrit.cloudera.org/#/c/17979/6/be/src/runtime/tmp-file-mgr.cc@411


http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-file.cc
File be/src/runtime/io/disk-file.cc:

http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-file.cc@116
PS5, Line 116:
> Think for a while, maybe DISABLED is better, because it means the block doe
Done


http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-io-mgr.cc
File be/src/runtime/io/disk-io-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-io-mgr.cc@424
PS5, Line 424:           read_buffer_bloc, MemBlockStatus::DISABLED, dstl, 
&read_buffer_lock)) {
> Because the memory of the block is released, if this case happens, we can't
Well, if we are here, it means we want to a memory block to hold some data, and 
query is not cancelled or all pages of the block are read, right?

The transition here is actually from DISABLED back to ALLOC to facilitate a 
reuse.

I am thinking that there are limited number of mem blocks to use (say 10) that 
are not enough to cover the entire file. See my other comment on when the 
available memory from system is not enough for the entire file.


http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-io-mgr.cc@430
PS5, Line 430: RN_IF_ERROR(
This could be a performance hit to open/close a file to read one data block. Is 
it possible to keep the file open?


http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-io-mgr.cc@436
PS5, Line 436: atus = Status(TErrorCode::DISK_IO_ERROR, GetBackendStri
> The read timer works when it is destructed. It counts the time interval dur
Done


http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-io-mgr.cc@445
PS5, Line 445: queue->read_size()->Update(local_file_size);
             :       disk_file_dst_->SetReadBufferBlockStatus(
> Do you mean delete the memory block if fails? Right now it is handled by th
See my comment above. Basically, I have a good mem block and there is a read 
error for a particular offset. The question is whether we can use the block on 
other offsets.

But if the reading error here is fatal (say for HJ builder, we can't read some 
block of data), then we should terminate the execution for the query.


http://gerrit.cloudera.org:8080/#/c/17979/5/be/src/runtime/io/disk-io-mgr.cc@452
PS5, Line 452:
This code could mask out a read error set at line 445.  Better to return the 
read error.


http://gerrit.cloudera.org:8080/#/c/17979/6/be/src/runtime/tmp-file-mgr.cc
File be/src/runtime/tmp-file-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/17979/6/be/src/runtime/tmp-file-mgr.cc@402
PS6, Line 402: TmpFileMgr::SetUpReadBufferParams
Since the work in this method is done for tmp_dirs_remote_ctrl_, suggest to 
move it to class TmpDirRemoteCtrl.


http://gerrit.cloudera.org:8080/#/c/17979/6/be/src/runtime/tmp-file-mgr.cc@411
PS6, Line 411:  tmp_dirs_remote_ctrl_.num_read_buffer_blocks_per_file_ =
             :       
static_cast<int>(tmp_dirs_remote_ctrl_.remote_tmp_file_size_
             :           / tmp_dirs_remote_ctrl_.read_buffer_block_size_);
This assumes that there is enough memory to hold the entire file. When it is 
not the case as handled at line 424 below, I wonder if we need to adjust 
num_read_buffer_blocks_per_file_ accordingly. A different strategy is to take 
the max allowed memory 'max_allow_bytes' and use it to figure out 
num_read_buffer_blocks_per_file_.



--
To view, visit http://gerrit.cloudera.org:8080/17979
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1dcc5d0881ffaeff09c5c514306cd668373ad31b
Gerrit-Change-Number: 17979
Gerrit-PatchSet: 6
Gerrit-Owner: Yida Wu <wydbaggio...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Yida Wu <wydbaggio...@gmail.com>
Gerrit-Comment-Date: Wed, 10 Nov 2021 14:56:01 +0000
Gerrit-HasComments: Yes

Reply via email to