Michael Smith has posted comments on this change. ( http://gerrit.cloudera.org:8080/17979 )
Change subject: IMPALA-10791 Add batch reading for remote temporary files ...................................................................... Patch Set 13: (14 comments) http://gerrit.cloudera.org:8080/#/c/17979/12/be/src/runtime/io/disk-file.h File be/src/runtime/io/disk-file.h: http://gerrit.cloudera.org:8080/#/c/17979/12/be/src/runtime/io/disk-file.h@55 PS12, Line 55: /// The caller may need to maintain the status of the MemBlock and make sure the block is > It is quite a simple class working closely with the DiskFile and TmpFile st Ack http://gerrit.cloudera.org:8080/#/c/17979/12/be/src/runtime/io/disk-file.h@220 PS12, Line 220: SpinLock read_buffer_ctrl_lock_; > Maybe personal preference. Prefer a fixed-size array if the size is fixed a Ack http://gerrit.cloudera.org:8080/#/c/17979/13/be/src/runtime/io/disk-io-mgr-test.cc File be/src/runtime/io/disk-io-mgr-test.cc: http://gerrit.cloudera.org:8080/#/c/17979/13/be/src/runtime/io/disk-io-mgr-test.cc@191 PS13, Line 191: void TestMemBlockStatus(MemBlockStatus last_status); Naming this "ValidateMemBlockStatus" seems more consistent with naming test helper functions. http://gerrit.cloudera.org:8080/#/c/17979/13/be/src/runtime/io/disk-io-mgr-test.cc@2494 PS13, Line 2494: ObjectPool pool; ObjectPool is useful when you need to generate a bunch or variable number of objects that need to persist passed the creation scope. That's not the case here, you could use a local variable, as in MemBlock block(block_id); http://gerrit.cloudera.org:8080/#/c/17979/13/be/src/runtime/io/disk-io-mgr-test.cc@2525 PS13, Line 2525: TEST_F(DiskIoMgrTest, MemBlockTest) { Does disk-io-mgr-test contain a bunch of unit tests for DiskFile too? This would make more sense to me in a new file: disk-file-test.cc. I'd also be interested in negative tests, but it looks like we only enforce order via DCHECK so the negative tests wouldn't really work. http://gerrit.cloudera.org:8080/#/c/17979/13/be/src/runtime/io/disk-io-mgr.cc File be/src/runtime/io/disk-io-mgr.cc: http://gerrit.cloudera.org:8080/#/c/17979/13/be/src/runtime/io/disk-io-mgr.cc@264 PS13, Line 264: if (disk_file_->IsBatchReadEnabled()) { The condition check is unnecessary because UpdateReadBufferMetaData returns immediately when !IsBatchReadEnabled. http://gerrit.cloudera.org:8080/#/c/17979/13/be/src/runtime/io/disk-io-mgr.cc@353 PS13, Line 353: // is_to_delete flag. nit: line length limit is 90, this doesn't need to wrap. http://gerrit.cloudera.org:8080/#/c/17979/13/be/src/runtime/io/disk-io-mgr.cc@407 PS13, Line 407: const char* remote_file_path = disk_file_src_->path().c_str(); Storing the result of `c_str()` is usually a bad idea, since it has no guaranteed lifetime. In this case it looks like it's fine, but also unnecessary. I'd personally get the string& instead and use that for later calls. http://gerrit.cloudera.org:8080/#/c/17979/13/be/src/runtime/io/disk-io-mgr.cc@411 PS13, Line 411: int ret = 0; Initializing this up-front is a very old-school C idiom. Compilers can often optimize better if you declare the variable in the scope where it's used, so I'd prefer to move this to the else block at line 444 where it's actually needed. http://gerrit.cloudera.org:8080/#/c/17979/13/be/src/runtime/io/disk-io-mgr.cc@447 PS13, Line 447: ret = hdfsPreadFully(hdfs_conn, remote_hdfs_file, remote_file_offset, If we wanted to get fancy we could use a lambda to return the read result. I think that adds no additional overhead, but also entirely unnecessary. int ret = [&]() { ScopedHistogramTimer read_timer(queue->read_latency()); return hdfsPreadFully(hdfs_conn, remote_hdfs_file, remote_file_offset, read_buffer_bloc->data(), local_file_size); }(); http://gerrit.cloudera.org:8080/#/c/17979/12/be/src/runtime/io/scan-range.cc File be/src/runtime/io/scan-range.cc: http://gerrit.cloudera.org:8080/#/c/17979/12/be/src/runtime/io/scan-range.cc@156 PS12, Line 156: // Set the correct reader to read the range if the memory buffer is not available. > Added some comments. The logic here is to set the correct file reader if it Ack http://gerrit.cloudera.org:8080/#/c/17979/13/be/src/runtime/tmp-file-mgr-test.cc File be/src/runtime/tmp-file-mgr-test.cc: http://gerrit.cloudera.org:8080/#/c/17979/13/be/src/runtime/tmp-file-mgr-test.cc@2101 PS13, Line 2101: TEST_F(TmpFileMgrTest, TestBatchReadingFromRemote) { Can we add a test case where the remote file is deleted while DoUpload is in progress, and verify it stops early? Not sure if that belongs here or in disk-io-mgr-test.cc. http://gerrit.cloudera.org:8080/#/c/17979/13/be/src/runtime/tmp-file-mgr.cc File be/src/runtime/tmp-file-mgr.cc: http://gerrit.cloudera.org:8080/#/c/17979/13/be/src/runtime/tmp-file-mgr.cc@1019 PS13, Line 1019: // For other status, will be treated as disabled. So if MemBlock is reserved/allocated, that means another process has already called this function but the async RemoteOperRange hasn't finished yet? Might be nice to lay that out explicitly in the comment. http://gerrit.cloudera.org:8080/#/c/17979/13/be/src/runtime/tmp-file-mgr.cc@1022 PS13, Line 1022: return; Ok, took me a sec to see why these checks make sense even though the only caller - GetReadBufferFile - also does them before calling. These are done within the lock, so the caller is short-circuiting with unlocked calls to see if the memblock has advanced to written/disabled yet. That would probably be clearer to me if all the checks were in this function rather than split across function and caller. -- To view, visit http://gerrit.cloudera.org:8080/17979 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1dcc5d0881ffaeff09c5c514306cd668373ad31b Gerrit-Change-Number: 17979 Gerrit-PatchSet: 13 Gerrit-Owner: Yida Wu <wydbaggio...@gmail.com> Gerrit-Reviewer: Abhishek Rawat <ara...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Michael Smith <michael.sm...@cloudera.com> Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com> Gerrit-Reviewer: Yida Wu <wydbaggio...@gmail.com> Gerrit-Comment-Date: Tue, 28 Jun 2022 19:28:27 +0000 Gerrit-HasComments: Yes