Michael Smith has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17979 )

Change subject: IMPALA-10791 Add batch reading for remote temporary files
......................................................................


Patch Set 13:

(14 comments)

http://gerrit.cloudera.org:8080/#/c/17979/12/be/src/runtime/io/disk-file.h
File be/src/runtime/io/disk-file.h:

http://gerrit.cloudera.org:8080/#/c/17979/12/be/src/runtime/io/disk-file.h@55
PS12, Line 55: /// The caller may need to maintain the status of the MemBlock 
and make sure the block is
> It is quite a simple class working closely with the DiskFile and TmpFile st
Ack


http://gerrit.cloudera.org:8080/#/c/17979/12/be/src/runtime/io/disk-file.h@220
PS12, Line 220:     SpinLock read_buffer_ctrl_lock_;
> Maybe personal preference. Prefer a fixed-size array if the size is fixed a
Ack


http://gerrit.cloudera.org:8080/#/c/17979/13/be/src/runtime/io/disk-io-mgr-test.cc
File be/src/runtime/io/disk-io-mgr-test.cc:

http://gerrit.cloudera.org:8080/#/c/17979/13/be/src/runtime/io/disk-io-mgr-test.cc@191
PS13, Line 191:   void TestMemBlockStatus(MemBlockStatus last_status);
Naming this "ValidateMemBlockStatus" seems more consistent with naming test 
helper functions.


http://gerrit.cloudera.org:8080/#/c/17979/13/be/src/runtime/io/disk-io-mgr-test.cc@2494
PS13, Line 2494:   ObjectPool pool;
ObjectPool is useful when you need to generate a bunch or variable number of 
objects that need to persist passed the creation scope. That's not the case 
here, you could use a local variable, as in

    MemBlock block(block_id);


http://gerrit.cloudera.org:8080/#/c/17979/13/be/src/runtime/io/disk-io-mgr-test.cc@2525
PS13, Line 2525: TEST_F(DiskIoMgrTest, MemBlockTest) {
Does disk-io-mgr-test contain a bunch of unit tests for DiskFile too? This 
would make more sense to me in a new file: disk-file-test.cc.

I'd also be interested in negative tests, but it looks like we only enforce 
order via DCHECK so the negative tests wouldn't really work.


http://gerrit.cloudera.org:8080/#/c/17979/13/be/src/runtime/io/disk-io-mgr.cc
File be/src/runtime/io/disk-io-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/17979/13/be/src/runtime/io/disk-io-mgr.cc@264
PS13, Line 264:     if (disk_file_->IsBatchReadEnabled()) {
The condition check is unnecessary because UpdateReadBufferMetaData returns 
immediately when !IsBatchReadEnabled.


http://gerrit.cloudera.org:8080/#/c/17979/13/be/src/runtime/io/disk-io-mgr.cc@353
PS13, Line 353:       // is_to_delete flag.
nit: line length limit is 90, this doesn't need to wrap.


http://gerrit.cloudera.org:8080/#/c/17979/13/be/src/runtime/io/disk-io-mgr.cc@407
PS13, Line 407:   const char* remote_file_path = disk_file_src_->path().c_str();
Storing the result of `c_str()` is usually a bad idea, since it has no 
guaranteed lifetime. In this case it looks like it's fine, but also 
unnecessary. I'd personally get the string& instead and use that for later 
calls.


http://gerrit.cloudera.org:8080/#/c/17979/13/be/src/runtime/io/disk-io-mgr.cc@411
PS13, Line 411:   int ret = 0;
Initializing this up-front is a very old-school C idiom. Compilers can often 
optimize better if you declare the variable in the scope where it's used, so 
I'd prefer to move this to the else block at line 444 where it's actually 
needed.


http://gerrit.cloudera.org:8080/#/c/17979/13/be/src/runtime/io/disk-io-mgr.cc@447
PS13, Line 447:       ret = hdfsPreadFully(hdfs_conn, remote_hdfs_file, 
remote_file_offset,
If we wanted to get fancy we could use a lambda to return the read result. I 
think that adds no additional overhead, but also entirely unnecessary.

    int ret = [&]() {
      ScopedHistogramTimer read_timer(queue->read_latency());
      return hdfsPreadFully(hdfs_conn, remote_hdfs_file, remote_file_offset, 
read_buffer_bloc->data(), local_file_size);
    }();


http://gerrit.cloudera.org:8080/#/c/17979/12/be/src/runtime/io/scan-range.cc
File be/src/runtime/io/scan-range.cc:

http://gerrit.cloudera.org:8080/#/c/17979/12/be/src/runtime/io/scan-range.cc@156
PS12, Line 156:     // Set the correct reader to read the range if the memory 
buffer is not available.
> Added some comments. The logic here is to set the correct file reader if it
Ack


http://gerrit.cloudera.org:8080/#/c/17979/13/be/src/runtime/tmp-file-mgr-test.cc
File be/src/runtime/tmp-file-mgr-test.cc:

http://gerrit.cloudera.org:8080/#/c/17979/13/be/src/runtime/tmp-file-mgr-test.cc@2101
PS13, Line 2101: TEST_F(TmpFileMgrTest, TestBatchReadingFromRemote) {
Can we add a test case where the remote file is deleted while DoUpload is in 
progress, and verify it stops early? Not sure if that belongs here or in 
disk-io-mgr-test.cc.


http://gerrit.cloudera.org:8080/#/c/17979/13/be/src/runtime/tmp-file-mgr.cc
File be/src/runtime/tmp-file-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/17979/13/be/src/runtime/tmp-file-mgr.cc@1019
PS13, Line 1019:     // For other status, will be treated as disabled.
So if MemBlock is reserved/allocated, that means another process has already 
called this function but the async RemoteOperRange hasn't finished yet? Might 
be nice to lay that out explicitly in the comment.


http://gerrit.cloudera.org:8080/#/c/17979/13/be/src/runtime/tmp-file-mgr.cc@1022
PS13, Line 1022:       return;
Ok, took me a sec to see why these checks make sense even though the only 
caller - GetReadBufferFile - also does them before calling. These are done 
within the lock, so the caller is short-circuiting with unlocked calls to see 
if the memblock has advanced to written/disabled yet.

That would probably be clearer to me if all the checks were in this function 
rather than split across function and caller.



--
To view, visit http://gerrit.cloudera.org:8080/17979
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1dcc5d0881ffaeff09c5c514306cd668373ad31b
Gerrit-Change-Number: 17979
Gerrit-PatchSet: 13
Gerrit-Owner: Yida Wu <wydbaggio...@gmail.com>
Gerrit-Reviewer: Abhishek Rawat <ara...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Michael Smith <michael.sm...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Yida Wu <wydbaggio...@gmail.com>
Gerrit-Comment-Date: Tue, 28 Jun 2022 19:28:27 +0000
Gerrit-HasComments: Yes

Reply via email to