Yida Wu has uploaded a new patch set (#36). ( 
http://gerrit.cloudera.org:8080/16318 )

Change subject: IMPALA-9867: Add Support for Spilling to S3: Milestone 1
......................................................................

IMPALA-9867: Add Support for Spilling to S3: Milestone 1

Major Features
1) Local files as buffers for spilling to S3.
2) Async Upload for remote files.
3) Sync remote files deletion after query ends.
4) Local buffer files management.
5) Compatibility of spilling to local and remote.
6) All the errors from hdfs/s3 should terminate the query.

Changes on TmpFile:
* TmpFile is separated into two types of implementation, TmpFileLocal
  and TmpFileRemote.
  TmpFileLocal is used for Spilling to local file system.
  TmpFileRemote is a new type for Spilling to the remote. It contains
  two DiskFiles, one for local buffer, the other for the remote file.
* The DiskFile is an object that contains the information of a pysical
  file for passing to the DiskIOMgr to execute the IO operations on
  that specific file. The DiskFile also contains status information of
  the file,includes DiskFileStatus::INWRITING/PERSISTED/DELETED.
  When the DiskFile is initialized, it is in INWRITING status. If the
  file is persisted into the file system, it would become PERSISTED
  status. If the file is deleted, for example, the local buffer is
  evicted, so the DiskFile status of the buffer file would become
  deleted. After that, if the file is fetching from the remote, the
  DiskFile status of the buffer file would become INWRITING, and then
  PERSISTED if the fetching finishes successfully.

Implementation Details:
1) A new enum type is added to specify the disk type of files,
   indicating where the file physically locates.
   The types include DiskFileType::LOCAL/LOCAL_BUFFER/DFS/S3.
   DiskFileType::LOCAL indicates the file is in the local file system.
   DiskFileType::LOCAL_BUFFER indicates the file is in the local file
   system, and it is the buffer of a remote scratch file.
   DiskFileType::DFS/S3 indicates the file is in the HDFS/S3.
   The local buffer allows the buffer pool to pin(read), but mainly
   for remote files, buffer pool would pin(read) the page from the
   remote file system.
2) Two disk queues have been added to do the file operation jobs.
   Queue name: RemoteS3DiskFileOper/RemoteDfsDiskFileOper
   File operations on the remote disk like upload and fetch should
   be done in these queues. The purpose of the queues is to isolate
   the file operations from normal read/write IO operations in different
   queues. It could increase the efficiency of the file operations by
   not being interrupted during a relatively long execution time, and
   also provide a more accurate control on the thread number working on
   file operation jobs.
   RemoteOperRange is the new type to carry the file operation jobs.
   Previously,we have request types of READ and WRITE.
   Now FILE_FETCH/FILE_UPLOAD are added.
3) The tmp files are physically deleted when the tmp file group is
   deconstructing. For remote files, the entire directory would be
   deleted.
4) The local buffer files management is to control the total size
   of local buffer files and evict files if needed.
   A local buffer file can be evicted if the temporary file has uploaded
   a copy to the remote disk or the query ends.
   There are two modes to decide the sequence of choosing files to be
   evicted first. Default is LIFO, the other is FIFO. It can be
   controlled by startup option remote_tmp_files_avail_pool_lifo.
   Also, a thread TmpFileSpaceReserveThreadLoop in TmpFileMgr is
   running to allow to reserve buffer file space in an async way to
   avoid deadlocks.
   Startup option allow_spill_to_hdfs is added. By default the HDFS path
   is not allowed, but for testcases, the option can be set true to
   allow the use of HDFS path as scratch space for testing only.
   Startup option wait_for_spill_buffer_timeout_s is added to control
   the maximum duration waiting for the buffer in the TmpFileBufferPool.
   Default value is 60, stands for 60 seconds.
5) Spilling to local has higher priority than spilling to remote.
   If no local scratch space is available, temporary data will be
   spilled to remote.
   The first available local directory is used for the local buffer
   for spilling to remote if any remote directory is configured.
   If remote directory is configured without any available local
   scratch space, an error will be returned during initialization.
   The purpose of the design is to simplify the implementation in
   milestone 1 with less changes on the configuration.

Example (setting remote scratch space):
Assume that the directories we have for scratch space:
* Local dir: /tmp/local_buffer, /tmp/local, /tmp/local_sec
* Remote dir: s3a://tmp/remote
The scratch space path is configured in the startup options, and could
have three types of configurations:
1. Pure local scratch space
  --scratch_dirs="/tmp/local"
2. Pure remote scratch space
  --scratch_dirs="s3a://tmp/remote,/tmp/local_buffer:16GB"
3. Mixed local and remote scratch space
  --scratch_dirs="s3a://tmp/romote:200GB,/tmp/local_buffer:1GB,
/tmp/local:2GB, /tmp/local_sec:16GB"
* Type 1: a pure local scratch space with unlimited size.
* Type 2: a pure remote scratch space with a 16GB local buffer.
* Type 3: a mixed local and remote scratch space, the size of the local
buffer for the remote directory is 1GB, while local scratch spaces are
2GB and 16GB, remote scratch space bytes limit is 200GB. Remote scratch
space is used only when all of the local spaces are at capacity.
* Note: The first local directory would be the local buffer path, if a
remote scratch space is registered.

Limitations:
* Only one remote scratch dir is supported.
* The first local scratch dir is used for the buffer of remote scratch
  space if remote scratch dir exists.

Testcases:
* Ran pre-review-test
* Unit Tests added to
  tmp-file-mgr-test/disk-io-mgr-test/buffer-pool-test.
* E2E Tests added to custom_cluster/test_scratch_disk.py.
* Ran Unit Tests:
$IMPALA_HOME/be/build/debug/runtime/buffered-tuple-stream-test
$IMPALA_HOME/be/build/debug/runtime/tmp-file-mgr-test
$IMPALA_HOME/be/build/debug/runtime/bufferpool/buffer-pool-test
$IMPALA_HOME/be/build/debug/runtime/io/disk-io-mgr-test
* Ran E2E Tests:
custom_cluster/test_scratch_disk.py

Change-Id: I419b1d5dbbfe35334d9f964c4b65e553579fdc89
---
M be/src/runtime/bufferpool/buffer-pool-test.cc
M be/src/runtime/hdfs-fs-cache.cc
M be/src/runtime/hdfs-fs-cache.h
M be/src/runtime/io/CMakeLists.txt
A be/src/runtime/io/disk-file.cc
A be/src/runtime/io/disk-file.h
M be/src/runtime/io/disk-io-mgr-test.cc
M be/src/runtime/io/disk-io-mgr.cc
M be/src/runtime/io/disk-io-mgr.h
A be/src/runtime/io/file-writer.h
M be/src/runtime/io/local-file-system.cc
M be/src/runtime/io/local-file-system.h
A be/src/runtime/io/local-file-writer.cc
A be/src/runtime/io/local-file-writer.h
M be/src/runtime/io/request-context.cc
M be/src/runtime/io/request-context.h
M be/src/runtime/io/request-ranges.h
M be/src/runtime/io/scan-range.cc
M be/src/runtime/tmp-file-mgr-internal.h
M be/src/runtime/tmp-file-mgr-test.cc
M be/src/runtime/tmp-file-mgr.cc
M be/src/runtime/tmp-file-mgr.h
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M common/thrift/metrics.json
M tests/custom_cluster/test_scratch_disk.py
26 files changed, 4,336 insertions(+), 279 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/18/16318/36
--
To view, visit http://gerrit.cloudera.org:8080/16318
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I419b1d5dbbfe35334d9f964c4b65e553579fdc89
Gerrit-Change-Number: 16318
Gerrit-PatchSet: 36
Gerrit-Owner: Yida Wu <wydbaggio...@gmail.com>
Gerrit-Reviewer: Abhishek Rawat <ara...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <stak...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Yida Wu <wydbaggio...@gmail.com>

Reply via email to