Yida Wu has uploaded a new patch set (#36). ( http://gerrit.cloudera.org:8080/16318 )
Change subject: IMPALA-9867: Add Support for Spilling to S3: Milestone 1 ...................................................................... IMPALA-9867: Add Support for Spilling to S3: Milestone 1 Major Features 1) Local files as buffers for spilling to S3. 2) Async Upload for remote files. 3) Sync remote files deletion after query ends. 4) Local buffer files management. 5) Compatibility of spilling to local and remote. 6) All the errors from hdfs/s3 should terminate the query. Changes on TmpFile: * TmpFile is separated into two types of implementation, TmpFileLocal and TmpFileRemote. TmpFileLocal is used for Spilling to local file system. TmpFileRemote is a new type for Spilling to the remote. It contains two DiskFiles, one for local buffer, the other for the remote file. * The DiskFile is an object that contains the information of a pysical file for passing to the DiskIOMgr to execute the IO operations on that specific file. The DiskFile also contains status information of the file,includes DiskFileStatus::INWRITING/PERSISTED/DELETED. When the DiskFile is initialized, it is in INWRITING status. If the file is persisted into the file system, it would become PERSISTED status. If the file is deleted, for example, the local buffer is evicted, so the DiskFile status of the buffer file would become deleted. After that, if the file is fetching from the remote, the DiskFile status of the buffer file would become INWRITING, and then PERSISTED if the fetching finishes successfully. Implementation Details: 1) A new enum type is added to specify the disk type of files, indicating where the file physically locates. The types include DiskFileType::LOCAL/LOCAL_BUFFER/DFS/S3. DiskFileType::LOCAL indicates the file is in the local file system. DiskFileType::LOCAL_BUFFER indicates the file is in the local file system, and it is the buffer of a remote scratch file. DiskFileType::DFS/S3 indicates the file is in the HDFS/S3. The local buffer allows the buffer pool to pin(read), but mainly for remote files, buffer pool would pin(read) the page from the remote file system. 2) Two disk queues have been added to do the file operation jobs. Queue name: RemoteS3DiskFileOper/RemoteDfsDiskFileOper File operations on the remote disk like upload and fetch should be done in these queues. The purpose of the queues is to isolate the file operations from normal read/write IO operations in different queues. It could increase the efficiency of the file operations by not being interrupted during a relatively long execution time, and also provide a more accurate control on the thread number working on file operation jobs. RemoteOperRange is the new type to carry the file operation jobs. Previously,we have request types of READ and WRITE. Now FILE_FETCH/FILE_UPLOAD are added. 3) The tmp files are physically deleted when the tmp file group is deconstructing. For remote files, the entire directory would be deleted. 4) The local buffer files management is to control the total size of local buffer files and evict files if needed. A local buffer file can be evicted if the temporary file has uploaded a copy to the remote disk or the query ends. There are two modes to decide the sequence of choosing files to be evicted first. Default is LIFO, the other is FIFO. It can be controlled by startup option remote_tmp_files_avail_pool_lifo. Also, a thread TmpFileSpaceReserveThreadLoop in TmpFileMgr is running to allow to reserve buffer file space in an async way to avoid deadlocks. Startup option allow_spill_to_hdfs is added. By default the HDFS path is not allowed, but for testcases, the option can be set true to allow the use of HDFS path as scratch space for testing only. Startup option wait_for_spill_buffer_timeout_s is added to control the maximum duration waiting for the buffer in the TmpFileBufferPool. Default value is 60, stands for 60 seconds. 5) Spilling to local has higher priority than spilling to remote. If no local scratch space is available, temporary data will be spilled to remote. The first available local directory is used for the local buffer for spilling to remote if any remote directory is configured. If remote directory is configured without any available local scratch space, an error will be returned during initialization. The purpose of the design is to simplify the implementation in milestone 1 with less changes on the configuration. Example (setting remote scratch space): Assume that the directories we have for scratch space: * Local dir: /tmp/local_buffer, /tmp/local, /tmp/local_sec * Remote dir: s3a://tmp/remote The scratch space path is configured in the startup options, and could have three types of configurations: 1. Pure local scratch space --scratch_dirs="/tmp/local" 2. Pure remote scratch space --scratch_dirs="s3a://tmp/remote,/tmp/local_buffer:16GB" 3. Mixed local and remote scratch space --scratch_dirs="s3a://tmp/romote:200GB,/tmp/local_buffer:1GB, /tmp/local:2GB, /tmp/local_sec:16GB" * Type 1: a pure local scratch space with unlimited size. * Type 2: a pure remote scratch space with a 16GB local buffer. * Type 3: a mixed local and remote scratch space, the size of the local buffer for the remote directory is 1GB, while local scratch spaces are 2GB and 16GB, remote scratch space bytes limit is 200GB. Remote scratch space is used only when all of the local spaces are at capacity. * Note: The first local directory would be the local buffer path, if a remote scratch space is registered. Limitations: * Only one remote scratch dir is supported. * The first local scratch dir is used for the buffer of remote scratch space if remote scratch dir exists. Testcases: * Ran pre-review-test * Unit Tests added to tmp-file-mgr-test/disk-io-mgr-test/buffer-pool-test. * E2E Tests added to custom_cluster/test_scratch_disk.py. * Ran Unit Tests: $IMPALA_HOME/be/build/debug/runtime/buffered-tuple-stream-test $IMPALA_HOME/be/build/debug/runtime/tmp-file-mgr-test $IMPALA_HOME/be/build/debug/runtime/bufferpool/buffer-pool-test $IMPALA_HOME/be/build/debug/runtime/io/disk-io-mgr-test * Ran E2E Tests: custom_cluster/test_scratch_disk.py Change-Id: I419b1d5dbbfe35334d9f964c4b65e553579fdc89 --- M be/src/runtime/bufferpool/buffer-pool-test.cc M be/src/runtime/hdfs-fs-cache.cc M be/src/runtime/hdfs-fs-cache.h M be/src/runtime/io/CMakeLists.txt A be/src/runtime/io/disk-file.cc A be/src/runtime/io/disk-file.h M be/src/runtime/io/disk-io-mgr-test.cc M be/src/runtime/io/disk-io-mgr.cc M be/src/runtime/io/disk-io-mgr.h A be/src/runtime/io/file-writer.h M be/src/runtime/io/local-file-system.cc M be/src/runtime/io/local-file-system.h A be/src/runtime/io/local-file-writer.cc A be/src/runtime/io/local-file-writer.h M be/src/runtime/io/request-context.cc M be/src/runtime/io/request-context.h M be/src/runtime/io/request-ranges.h M be/src/runtime/io/scan-range.cc M be/src/runtime/tmp-file-mgr-internal.h M be/src/runtime/tmp-file-mgr-test.cc M be/src/runtime/tmp-file-mgr.cc M be/src/runtime/tmp-file-mgr.h M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M common/thrift/metrics.json M tests/custom_cluster/test_scratch_disk.py 26 files changed, 4,336 insertions(+), 279 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/18/16318/36 -- To view, visit http://gerrit.cloudera.org:8080/16318 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I419b1d5dbbfe35334d9f964c4b65e553579fdc89 Gerrit-Change-Number: 16318 Gerrit-PatchSet: 36 Gerrit-Owner: Yida Wu <wydbaggio...@gmail.com> Gerrit-Reviewer: Abhishek Rawat <ara...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Sahil Takiar <stak...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Yida Wu <wydbaggio...@gmail.com>