[Impala-ASF-CR] IMPALA-10943: Add test to verify support for multiple resource and executor pools
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17891 ) Change subject: IMPALA-10943: Add test to verify support for multiple resource and executor pools .. Patch Set 3: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/7584/ -- To view, visit http://gerrit.cloudera.org:8080/17891 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If76d386d8de5730da937674ddd9a69aa1aa1355e Gerrit-Change-Number: 17891 Gerrit-PatchSet: 3 Gerrit-Owner: Bikramjeet Vig Gerrit-Reviewer: Abhishek Rawat Gerrit-Reviewer: Andrew Sherman Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Comment-Date: Tue, 02 Nov 2021 05:23:22 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10923: Fine grained table refreshing at partition level events for transactional tables
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17858 ) Change subject: IMPALA-10923: Fine grained table refreshing at partition level events for transactional tables .. Patch Set 12: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/9706/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/17858 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6ba07c9a338a25614690e314335ee4b801486da9 Gerrit-Change-Number: 17858 Gerrit-PatchSet: 12 Gerrit-Owner: Yu-Wen Lai Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Fucun Chu Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Sourabh Goyal Gerrit-Reviewer: Vihang Karajgaonkar Gerrit-Reviewer: Yu-Wen Lai Gerrit-Comment-Date: Tue, 02 Nov 2021 05:19:33 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10923: Fine grained table refreshing at partition level events for transactional tables
Yu-Wen Lai has uploaded a new patch set (#12). ( http://gerrit.cloudera.org:8080/17858 ) Change subject: IMPALA-10923: Fine grained table refreshing at partition level events for transactional tables .. IMPALA-10923: Fine grained table refreshing at partition level events for transactional tables To enable fine-grained table refreshing, there are three main changes in this commit. 1. Maintain validWriteIdList in Catalogd for transactional tables. We will keep track of write id changes for partitioned tables by AllocWriteIdEvents, CommitTxnEvents, and AbortTxnEvents. 2. Conduct partition level refreshing for transactional tables' addPartitionEvents, dropPartitionEvents, and AlterPartitionEvents. 3. Introduce a config hms_event_incremental_refresh_transactional_table, which can switch on/off the fine-grained table refreshing. Performance Tests: A simple test was performed by running insert into one partition for a partitioned ACID table(50,000 partitions). Below are the time taken to refresh this table by the event. StorageBefore After = S3 50 secs 50 msecs local 3 secs 3 msecs Change-Id: I6ba07c9a338a25614690e314335ee4b801486da9 --- M be/src/catalog/catalog-server.cc M be/src/util/backend-gflag-util.cc M common/thrift/BackendGflags.thrift M fe/src/main/java/org/apache/impala/catalog/Catalog.java M fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/catalog/Table.java A fe/src/main/java/org/apache/impala/catalog/TableWriteId.java M fe/src/main/java/org/apache/impala/catalog/events/MetastoreEvents.java M fe/src/main/java/org/apache/impala/hive/common/MutableValidReaderWriteIdList.java M fe/src/main/java/org/apache/impala/hive/common/MutableValidWriteIdList.java M fe/src/main/java/org/apache/impala/service/BackendConfig.java M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java A fe/src/test/java/org/apache/impala/catalog/CatalogTableWriteIdTest.java M fe/src/test/java/org/apache/impala/catalog/CatalogTest.java M fe/src/test/java/org/apache/impala/catalog/events/MetastoreEventsProcessorTest.java M fe/src/test/java/org/apache/impala/hive/common/MutableValidReaderWriteIdListTest.java 17 files changed, 1,002 insertions(+), 58 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/58/17858/12 -- To view, visit http://gerrit.cloudera.org:8080/17858 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I6ba07c9a338a25614690e314335ee4b801486da9 Gerrit-Change-Number: 17858 Gerrit-PatchSet: 12 Gerrit-Owner: Yu-Wen Lai Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Fucun Chu Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Sourabh Goyal Gerrit-Reviewer: Vihang Karajgaonkar Gerrit-Reviewer: Yu-Wen Lai
[Impala-ASF-CR] IMPALA-10923: Fine grained table refreshing at partition level events for transactional tables
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17858 ) Change subject: IMPALA-10923: Fine grained table refreshing at partition level events for transactional tables .. Patch Set 12: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7585/ DRY_RUN=true -- To view, visit http://gerrit.cloudera.org:8080/17858 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6ba07c9a338a25614690e314335ee4b801486da9 Gerrit-Change-Number: 17858 Gerrit-PatchSet: 12 Gerrit-Owner: Yu-Wen Lai Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Fucun Chu Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Sourabh Goyal Gerrit-Reviewer: Vihang Karajgaonkar Gerrit-Reviewer: Yu-Wen Lai Gerrit-Comment-Date: Tue, 02 Nov 2021 04:59:13 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10926: Improve catalogd consistency and self events detection
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17859 ) Change subject: IMPALA-10926: Improve catalogd consistency and self events detection .. Patch Set 26: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/7582/ -- To view, visit http://gerrit.cloudera.org:8080/17859 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36364e401911352c474eb98c8d61bbaae9b9 Gerrit-Change-Number: 17859 Gerrit-PatchSet: 26 Gerrit-Owner: Sourabh Goyal Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Sourabh Goyal Gerrit-Reviewer: Vihang Karajgaonkar Gerrit-Reviewer: Yu-Wen Lai Gerrit-Comment-Date: Tue, 02 Nov 2021 03:50:28 + Gerrit-HasComments: No
[Impala-ASF-CR] WiP: IMPALA-10798 : Prototype for JSON reader
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17771 ) Change subject: WiP: IMPALA-10798 : Prototype for JSON reader .. Patch Set 7: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/9705/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/17771 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If79364a421d862d0d837f9be694911e388d4d629 Gerrit-Change-Number: 17771 Gerrit-PatchSet: 7 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Comment-Date: Tue, 02 Nov 2021 03:43:46 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10934: Enable table definition over a single file
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/17878 ) Change subject: IMPALA-10934: Enable table definition over a single file .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/17878/2/be/src/runtime/io/disk-io-mgr.cc File be/src/runtime/io/disk-io-mgr.cc: http://gerrit.cloudera.org:8080/#/c/17878/2/be/src/runtime/io/disk-io-mgr.cc@142 PS2, Line 142: // The maximum number of SFS I/O threads. : DEFINE_int32(num_sfs_io_threads, 16, "Number of SFS I/O threads"); > Agree that turning off file handle caching for the SFS case should not hurt My understanding is that the file handle cache will be disabled for SFS unless we explicitly try to enable it. That's probably ok. The path to enabling the file handle cache would be to understand the distinction between SFS+S3 vs SFS+HDFS vs whatnot and map them to the right thread pools. That probably isn't that hard if we want to go that way, and it could be done in the backend. -- To view, visit http://gerrit.cloudera.org:8080/17878 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I32be936243aa4c8320f5d06d2b7fbf98822f82e7 Gerrit-Change-Number: 17878 Gerrit-PatchSet: 2 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Comment-Date: Tue, 02 Nov 2021 03:31:20 + Gerrit-HasComments: Yes
[Impala-ASF-CR] WiP: IMPALA-10798 : Prototype for JSON reader
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17771 ) Change subject: WiP: IMPALA-10798 : Prototype for JSON reader .. Patch Set 7: (1 comment) http://gerrit.cloudera.org:8080/#/c/17771/7/bin/bootstrap_toolchain.py File bin/bootstrap_toolchain.py: http://gerrit.cloudera.org:8080/#/c/17771/7/bin/bootstrap_toolchain.py@469 PS7, Line 469: ) flake8: E501 line too long (91 > 90 characters) -- To view, visit http://gerrit.cloudera.org:8080/17771 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If79364a421d862d0d837f9be694911e388d4d629 Gerrit-Change-Number: 17771 Gerrit-PatchSet: 7 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Comment-Date: Tue, 02 Nov 2021 03:22:49 + Gerrit-HasComments: Yes
[Impala-ASF-CR] WiP: IMPALA-10798 : Prototype for JSON reader
Hello Quanlong Huang, Aman Sinha, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/17771 to look at the new patch set (#7). Change subject: WiP: IMPALA-10798 : Prototype for JSON reader .. WiP: IMPALA-10798 : Prototype for JSON reader This prototype allows user to create a table stored as jsonfile and query it. Steps to test: - create a json table with schema specified using eligible datatypes (int8/16/32/64/float/double/string/varchar/char/timestamp/boolean) - add your json file (with eligble datatypes and same column names as schema specified in the create command) to hdfs location - add this 'location' to your table - run a select statement Fix: - arrow library is included wherever required - json format is added to scan node base class. - json scanner files are added, that implement methods to read the json file from the specified file location Change-Id: If79364a421d862d0d837f9be694911e388d4d629 --- M CMakeLists.txt M be/CMakeLists.txt M be/src/exec/CMakeLists.txt A be/src/exec/hdfs-json-scanner.cc A be/src/exec/hdfs-json-scanner.h M be/src/exec/hdfs-scan-node-base.cc M bin/bootstrap_toolchain.py M bin/impala-config.sh A cmake_modules/FindArrow.cmake 9 files changed, 612 insertions(+), 2 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/71/17771/7 -- To view, visit http://gerrit.cloudera.org:8080/17771 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: If79364a421d862d0d837f9be694911e388d4d629 Gerrit-Change-Number: 17771 Gerrit-PatchSet: 7 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang
[Impala-ASF-CR] WiP: IMPALA-10798 : Prototype for JSON reader
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17771 ) Change subject: WiP: IMPALA-10798 : Prototype for JSON reader .. Patch Set 6: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/9704/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/17771 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If79364a421d862d0d837f9be694911e388d4d629 Gerrit-Change-Number: 17771 Gerrit-PatchSet: 6 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Comment-Date: Tue, 02 Nov 2021 02:48:49 + Gerrit-HasComments: No
[Impala-ASF-CR] WiP: IMPALA-10798 : Prototype for JSON reader
Hello Quanlong Huang, Aman Sinha, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/17771 to look at the new patch set (#6). Change subject: WiP: IMPALA-10798 : Prototype for JSON reader .. WiP: IMPALA-10798 : Prototype for JSON reader This prototype allows user to create a table stored as jsonfile and query it. Steps to test: - create a json table with schema specified using eligible datatypes (int8/16/32/64/float/double/string/varchar/char/timestamp) - add your json file (with eligble datatypes and same column names as schema specified in the create command) to hdfs location - add this 'location' to your table - run a select statement Fix: - arrow library is included wherever required - json format is added to scan node base class. - json scanner files are added, that implement methods to read the json file from the specified file location Change-Id: If79364a421d862d0d837f9be694911e388d4d629 --- M CMakeLists.txt M be/CMakeLists.txt M be/src/exec/CMakeLists.txt A be/src/exec/hdfs-json-scanner.cc A be/src/exec/hdfs-json-scanner.h M be/src/exec/hdfs-scan-node-base.cc M bin/bootstrap_toolchain.py M bin/impala-config.sh A cmake_modules/FindArrow.cmake 9 files changed, 607 insertions(+), 2 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/71/17771/6 -- To view, visit http://gerrit.cloudera.org:8080/17771 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: If79364a421d862d0d837f9be694911e388d4d629 Gerrit-Change-Number: 17771 Gerrit-PatchSet: 6 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang
[Impala-ASF-CR] WiP: IMPALA-10798 : Prototype for JSON reader
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17771 ) Change subject: WiP: IMPALA-10798 : Prototype for JSON reader .. Patch Set 6: (1 comment) http://gerrit.cloudera.org:8080/#/c/17771/6/bin/bootstrap_toolchain.py File bin/bootstrap_toolchain.py: http://gerrit.cloudera.org:8080/#/c/17771/6/bin/bootstrap_toolchain.py@469 PS6, Line 469: ) flake8: E501 line too long (91 > 90 characters) -- To view, visit http://gerrit.cloudera.org:8080/17771 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If79364a421d862d0d837f9be694911e388d4d629 Gerrit-Change-Number: 17771 Gerrit-PatchSet: 6 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Comment-Date: Tue, 02 Nov 2021 02:26:44 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10791 Add batching reading for remote temporary files
Yida Wu has posted comments on this change. ( http://gerrit.cloudera.org:8080/17979 ) Change subject: IMPALA-10791 Add batching reading for remote temporary files .. Patch Set 4: (9 comments) http://gerrit.cloudera.org:8080/#/c/17979/1/be/src/runtime/io/disk-file.h File be/src/runtime/io/disk-file.h: http://gerrit.cloudera.org:8080/#/c/17979/1/be/src/runtime/io/disk-file.h@195 PS1, Line 195: > Can we use MemBlockState state_ here? Because the naming in DiskFile is "file_status_", maybe just keep them the same, otherwise it may be good to change all of them, including the interface names, but should be some work. http://gerrit.cloudera.org:8080/#/c/17979/3/be/src/runtime/io/request-context.cc File be/src/runtime/io/request-context.cc: http://gerrit.cloudera.org:8080/#/c/17979/3/be/src/runtime/io/request-context.cc@201 PS3, Line 201: unstarted_remote_file_op_ranges_; > Maybe named to unstarted_remote_file_op_ranges_? Done http://gerrit.cloudera.org:8080/#/c/17979/3/be/src/runtime/io/request-ranges.h File be/src/runtime/io/request-ranges.h: http://gerrit.cloudera.org:8080/#/c/17979/3/be/src/runtime/io/request-ranges.h@118 PS3, Line 118: WRITE, > May need to explain what it is. Done http://gerrit.cloudera.org:8080/#/c/17979/3/be/src/runtime/io/request-ranges.h@702 PS3, Line 702: > nit. upload the file to a remote location. Done http://gerrit.cloudera.org:8080/#/c/17979/3/be/src/runtime/io/request-ranges.h@708 PS3, Line 708: > nit. the fetch file operation from a remote site. Done http://gerrit.cloudera.org:8080/#/c/17979/3/be/src/runtime/io/scan-range.cc File be/src/runtime/io/scan-range.cc: http://gerrit.cloudera.org:8080/#/c/17979/3/be/src/runtime/io/scan-range.cc@171 PS3, Line 171: the range : // is supposed to be read in one round. > Suggest to remove as there is no guarantee. Changed to "supposed". http://gerrit.cloudera.org:8080/#/c/17979/3/be/src/runtime/io/scan-range.cc@178 PS3, Line 178: read_status > need to check the status. Done http://gerrit.cloudera.org:8080/#/c/17979/2/be/src/runtime/tmp-file-mgr.cc File be/src/runtime/tmp-file-mgr.cc: http://gerrit.cloudera.org:8080/#/c/17979/2/be/src/runtime/tmp-file-mgr.cc@257 PS2, Line 257: Status setup_read_buffer_status = SetUpReadBufferParams(); : if (!setup_read_buffer_status.ok()) { > If handling the rare case is simple task, I feel we should do so. Changed. If the file size is smaller than the max block size, set the block size as file size. Otherwise block size is the max block size, which is 16MB. http://gerrit.cloudera.org:8080/#/c/17979/2/be/src/runtime/tmp-file-mgr.cc@1039 PS2, Line 1039: read_buffer_block->NotifyAllWaits(); > In practice, the read buffer memory is always full during the big queries ( Have a simple test today (15x tpcds, q67, c5d.4xlarge 16u32g, 1G read buffer). 1. Disabled is set: (Time: 135s) (Data Read: 13.8GB) 2. Disabled is not set: (Time: 150s) (Data Read: 17.7GB) As expected, if the disabled is not set, performance is worse because more data is read (more duplicated read). It could be a little different for other queries, but if the read buffer is not available (full) for most of the time, which is quite likely when spilling large amount of data, disabling the file from batching read when failing to reserve space could be a better solution. I think the next optimization is to make the read buffer more available, maybe using a better eviction policy. -- To view, visit http://gerrit.cloudera.org:8080/17979 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1dcc5d0881ffaeff09c5c514306cd668373ad31b Gerrit-Change-Number: 17979 Gerrit-PatchSet: 4 Gerrit-Owner: Yida Wu Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Yida Wu Gerrit-Comment-Date: Tue, 02 Nov 2021 01:05:56 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10791 Add batching reading for remote temporary files
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17979 ) Change subject: IMPALA-10791 Add batching reading for remote temporary files .. Patch Set 4: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/9703/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/17979 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1dcc5d0881ffaeff09c5c514306cd668373ad31b Gerrit-Change-Number: 17979 Gerrit-PatchSet: 4 Gerrit-Owner: Yida Wu Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Yida Wu Gerrit-Comment-Date: Tue, 02 Nov 2021 01:03:17 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10791 Add batching reading for remote temporary files
Yida Wu has uploaded a new patch set (#4). ( http://gerrit.cloudera.org:8080/17979 ) Change subject: IMPALA-10791 Add batching reading for remote temporary files .. IMPALA-10791 Add batching reading for remote temporary files The patch adds a feature to batching read from a remote temporary file in order to improve the reading performance for the spilled remote data. Originally, the design is to use the local disk file as the buffer for batching reading from the remote file. But in practice, it doesn't help to improve the performance. Therefore, the design is changed to use the memory as the read buffer. Currently, each TmpFileRemote has two DiskFile, one is for the remote, and one is for the local buffer. The patch adds MemBlocks to the local buffer file. Each local buffer file is divided into several MemBlocks evenly, but in order to guarantee a page not being cut into two parts in different blocks, the block size could be a little different to each other in practice. The default block size is the minimum value between 1/4 default file size and MAX_REMOTE_READ_MEM_BLOCK_THRESHOLD_MB, which is 16MB. When pinning a page, the system will detect if there is enough memory for the block that holds the page, if not, we will go reading the page directly and disable this block, because it may be good to avoid duplicated reads from the remote fs for the same content. If the system decides to fetch a block, the block will be stored in the memory until all of the pages in the block are read or the query ends. One challenge of using the memory for the buffer is that, when the system is lacking of memory when it needs to spill the data. So we make a restriction to limit the percentage of the memory for the read buffer to 5% of the total, because right now the impala process will reserve 20% memory as unused memory by default, using 5% for the emergency case like spilling is reasonable. Two start options have been added for the new feature. 1. remote_batching_read. Default is false. If set true, the batching read is enabled. 2. remote_read_memory_buffer_size. Default is 1G. The maximum memory that can be used by the read buffer. The number also restricted by the total system memory, which can not exceed 5% of the total memory. The patch also increases the MAX_REMOTE_TMPFILE_SIZE_THRESHOLD_MB from 256 to 512. Tests: Ran core and exhaustive tests. Added and ran TmpFileMgrTest::TestBatchingReadFromRemote. Added e2e test test_scratch_dirs_batch_reading. Change-Id: I1dcc5d0881ffaeff09c5c514306cd668373ad31b --- M be/src/runtime/io/disk-file.cc M be/src/runtime/io/disk-file.h M be/src/runtime/io/disk-io-mgr.cc M be/src/runtime/io/request-context.cc M be/src/runtime/io/request-ranges.h M be/src/runtime/io/scan-range.cc M be/src/runtime/tmp-file-mgr-internal.h M be/src/runtime/tmp-file-mgr-test.cc M be/src/runtime/tmp-file-mgr.cc M be/src/runtime/tmp-file-mgr.h M common/thrift/metrics.json M tests/custom_cluster/test_scratch_disk.py 12 files changed, 1,110 insertions(+), 151 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/79/17979/4 -- To view, visit http://gerrit.cloudera.org:8080/17979 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I1dcc5d0881ffaeff09c5c514306cd668373ad31b Gerrit-Change-Number: 17979 Gerrit-PatchSet: 4 Gerrit-Owner: Yida Wu Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Yida Wu
[Impala-ASF-CR] IMPALA-10943: Add test to verify support for multiple resource and executor pools
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17891 ) Change subject: IMPALA-10943: Add test to verify support for multiple resource and executor pools .. Patch Set 3: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7584/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/17891 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If76d386d8de5730da937674ddd9a69aa1aa1355e Gerrit-Change-Number: 17891 Gerrit-PatchSet: 3 Gerrit-Owner: Bikramjeet Vig Gerrit-Reviewer: Abhishek Rawat Gerrit-Reviewer: Andrew Sherman Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Comment-Date: Mon, 01 Nov 2021 22:56:44 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10943: Add test to verify support for multiple resource and executor pools
Bikramjeet Vig has posted comments on this change. ( http://gerrit.cloudera.org:8080/17891 ) Change subject: IMPALA-10943: Add test to verify support for multiple resource and executor pools .. Patch Set 3: unrelated flaky tests failed in last GVO, starting another one. -- To view, visit http://gerrit.cloudera.org:8080/17891 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If76d386d8de5730da937674ddd9a69aa1aa1355e Gerrit-Change-Number: 17891 Gerrit-PatchSet: 3 Gerrit-Owner: Bikramjeet Vig Gerrit-Reviewer: Abhishek Rawat Gerrit-Reviewer: Andrew Sherman Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Comment-Date: Mon, 01 Nov 2021 22:56:36 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10943: Add test to verify support for multiple resource and executor pools
Bikramjeet Vig has removed a vote on this change. Change subject: IMPALA-10943: Add test to verify support for multiple resource and executor pools .. Removed Verified-1 by Impala Public Jenkins -- To view, visit http://gerrit.cloudera.org:8080/17891 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: deleteVote Gerrit-Change-Id: If76d386d8de5730da937674ddd9a69aa1aa1355e Gerrit-Change-Number: 17891 Gerrit-PatchSet: 3 Gerrit-Owner: Bikramjeet Vig Gerrit-Reviewer: Abhishek Rawat Gerrit-Reviewer: Andrew Sherman Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell
[Impala-ASF-CR] IMPALA-10984: Improve TimestampValue to String casting
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/17980 ) Change subject: IMPALA-10984: Improve TimestampValue to String casting .. Patch Set 5: (16 comments) http://gerrit.cloudera.org:8080/#/c/17980/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/17980/3//COMMIT_MSG@17 PS3, Line 17: adds method > nit reimplements Done http://gerrit.cloudera.org:8080/#/c/17980/3//COMMIT_MSG@22 PS3, Line 22: format. The chosen DateTimeFormatContext then is passed to > nit is passed Done http://gerrit.cloudera.org:8080/#/c/17980/3//COMMIT_MSG@32 PS3, Line 32: > nit. duplicated in before/after. Probably should be mentioned the para at l Done http://gerrit.cloudera.org:8080/#/c/17980/3//COMMIT_MSG@33 PS3, Line 33: > nit. this column can be removed? Done http://gerrit.cloudera.org:8080/#/c/17980/3//COMMIT_MSG@38 PS3, Line 38: 2.31 > nit. not aligned with the rest of the values in this column. Done http://gerrit.cloudera.org:8080/#/c/17980/3/be/src/runtime/datetime-iso-sql-format-tokenizer.h File be/src/runtime/datetime-iso-sql-format-tokenizer.h: http://gerrit.cloudera.org:8080/#/c/17980/3/be/src/runtime/datetime-iso-sql-format-tokenizer.h@111 PS3, Line 111: Iterates throug > nit. fmt_out_max_len_? Removed. This is now written directly to DateTimeFormatContext.fmt_out_len. http://gerrit.cloudera.org:8080/#/c/17980/3/be/src/runtime/datetime-simple-date-format-parser.cc File be/src/runtime/datetime-simple-date-format-parser.cc: http://gerrit.cloudera.org:8080/#/c/17980/3/be/src/runtime/datetime-simple-date-format-parser.cc@401 PS3, Line 401: } : } : return nullptr; : } : > inline? Done http://gerrit.cloudera.org:8080/#/c/17980/3/be/src/runtime/timestamp-parse-util.h File be/src/runtime/timestamp-parse-util.h: http://gerrit.cloudera.org:8080/#/c/17980/3/be/src/runtime/timestamp-parse-util.h@79 PS3, Line 79: /// max_length -- the maximum length of characters that 'dst' can hold. Only used for : /// assertion in debug build. > I need to update this comment in next patch set, since we're enforcing the Done http://gerrit.cloudera.org:8080/#/c/17980/2/be/src/runtime/timestamp-parse-util.cc File be/src/runtime/timestamp-parse-util.cc: http://gerrit.cloudera.org:8080/#/c/17980/2/be/src/runtime/timestamp-parse-util.cc@305 PS2, Line 305: CATOR: { > optional: Done. Benchmarked it with expression "cast(now() as string format 'Y .S')". Compared patch set 3 vs 4, the (10%ile, 50%ile, 90%ile) increased from (19.9, 20.1, 20.3) to (61.1, 61.3, 61.6). 3X increase. http://gerrit.cloudera.org:8080/#/c/17980/2/be/src/runtime/timestamp-parse-util.cc@351 PS2, Line 351: DCHECK(!d.is_special()); > After changing AppendToBuffer() now we can't this dcheck, even if we want t Done http://gerrit.cloudera.org:8080/#/c/17980/3/be/src/runtime/timestamp-value.cc File be/src/runtime/timestamp-value.cc: http://gerrit.cloudera.org:8080/#/c/17980/3/be/src/runtime/timestamp-value.cc@83 PS3, Line 83: st.clear(); > UNLIKELY? Done http://gerrit.cloudera.org:8080/#/c/17980/3/be/src/runtime/timestamp-value.cc@213 PS3, Line 213: } > Can we update any remaining callers to the new variant and eliminate this f Unfortunately, there are couple call sites to this function. Especially the output stream operator of TimestampValue. In patch set 4, I change the signature, asking the caller to supply a string output argument. Add a comment as well in the header file warning caller to reuse the output string. http://gerrit.cloudera.org:8080/#/c/17980/3/be/src/runtime/timestamp-value.cc@222 PS3, Line 222: > So the space is bounded by the row batch size * max_length? Approx how much Yes, I think batch size * max_length is the approximation. There is also 8 bytes alignment and power2 round up thing in mem-pool code I have not fully understand yet. http://gerrit.cloudera.org:8080/#/c/17980/3/be/src/runtime/timestamp-value.cc@224 PS3, Line 224: int64_t t_in_nano_sec = t.total_nanoseconds(); > Should use C++ cast Done http://gerrit.cloudera.org:8080/#/c/17980/3/be/src/runtime/timestamp-value.cc@225 PS3, Line 225: > unlikely? Done http://gerrit.cloudera.org:8080/#/c/17980/3/be/src/runtime/timestamp-value.cc@230 PS3, Line 230: int64_t days = total_in_nano_sec / NANOS_PER_DAY; : int64_t nano_secs_remaining = total_in_nano_sec % NANOS_PER_DAY; : return TimestampValue(date_ + boost::gregorian::date_duration(days), : boost::posix_time::time_duration(0, 0, 0, nano_secs_remaining)); : > nit This method probably can be inlined. Done. Moved to timestap-value.inline.h as well. -- To view, visit http://gerrit.cloudera.org:8080/17980 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: commen
[Impala-ASF-CR] IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.
Amogh Margoor has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 ) Change subject: IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table. .. Patch Set 19: (1 comment) http://gerrit.cloudera.org:8080/#/c/17860/12//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/17860/12//COMMIT_MSG@24 PS12, Line 24: TPCH scale 42 > I think it would be good to execute the whole benchmark with bin/single_nod Hi Zoltan, Sorry for the delay with benchmark. I ran the entire tpch bechmark at scale 42. This was the summary of report (Delta is the change). Report Generated on 2021-10-28 Run Description: "78ce235db6d5b720f3e3319ff571a2da054a2602 vs c46d765dccd5739c848d8c1c82043e72394b8397" Cluster Name: UNKNOWN Lab Run Info: UNKNOWN Impala Version: impalad version 4.1.0-SNAPSHOT RELEASE (2021-10-28) Baseline Impala Version: impalad version 4.1.0-SNAPSHOT RELEASE (2021-10-27) +--+---+-++++ | Workload | File Format | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) | +--+---+-++++ | TPCH(42) | parquet / none / none | 12.83 | -1.54% | 8.26 | -1.48% | +--+---+-++++ Very slight improvement overall and major improvements in these 2 queries: (I) Improvement: TPCH(42) TPCH-Q6 [parquet / none / none] (1.85s -> 1.72s [-7.30%]) +--++---+--++---+---+--+++---+---+---+ | Operator | % of Query | Avg | Base Avg | Delta(Avg) | StdDev(%) | Max | Base Max | Delta(Max) | #Hosts | #Inst | #Rows | Est #Rows | +--++---+--++---+---+--+++---+---+---+ | 00:SCAN HDFS | 94.83% | 1.50s | 1.62s| -7.75% | 2.07% | 1.56s | 1.73s| -9.58% | 1 | 1 | 4.79M | 29.96M| +--++---+--++---+---+--+++---+---+---+ (I) Improvement: TPCH(42) TPCH-Q19 [parquet / none / none] (4.73s -> 4.18s [-11.72%]) +--++--+--++---+--+--+++---++---+ | Operator | % of Query | Avg | Base Avg | Delta(Avg) | StdDev(%) | Max | Base Max | Delta(Max) | #Hosts | #Inst | #Rows | Est #Rows | +--++--+--++---+--+--+++---++---+ | 01:SCAN HDFS | 22.68% | 729.91ms | 736.69ms | -0.92% | 1.61% | 751.55ms | 747.34ms | +0.56% | 1 | 1 | 20.33K | 1.50M | | 00:SCAN HDFS | 74.84% | 2.41s| 2.97s| -18.98%| 0.67% | 2.44s| 3.00s| -18.70%| 1 | 1 | 13.07K | 29.96M| +--++--+--++---+--+--+++---++---+ There was no regression reported as such just these 2 improvements and couple of queries with high variability in runtime (not related to our change). -- To view, visit http://gerrit.cloudera.org:8080/17860 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60 Gerrit-Change-Number: 17860 Gerrit-PatchSet: 19 Gerrit-Owner: Amogh Margoor Gerrit-Reviewer: Amogh Margoor Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Kurt Deschler Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 01 Nov 2021 17:51:22 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10926: Improve catalogd consistency and self events detection
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17859 ) Change subject: IMPALA-10926: Improve catalogd consistency and self events detection .. Patch Set 26: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7582/ DRY_RUN=true -- To view, visit http://gerrit.cloudera.org:8080/17859 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36364e401911352c474eb98c8d61bbaae9b9 Gerrit-Change-Number: 17859 Gerrit-PatchSet: 26 Gerrit-Owner: Sourabh Goyal Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Sourabh Goyal Gerrit-Reviewer: Vihang Karajgaonkar Gerrit-Reviewer: Yu-Wen Lai Gerrit-Comment-Date: Mon, 01 Nov 2021 17:50:18 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10923: Fine grained table refreshing at partition level events for transactional tables
Sourabh Goyal has posted comments on this change. ( http://gerrit.cloudera.org:8080/17858 ) Change subject: IMPALA-10923: Fine grained table refreshing at partition level events for transactional tables .. Patch Set 11: (2 comments) http://gerrit.cloudera.org:8080/#/c/17858/11/be/src/catalog/catalog-server.cc File be/src/catalog/catalog-server.cc: http://gerrit.cloudera.org:8080/#/c/17858/11/be/src/catalog/catalog-server.cc@117 PS11, Line 117: "catalog server will refresh transactional tables incrementally for partition level " nit: instead of catalog server, we should say event processor http://gerrit.cloudera.org:8080/#/c/17858/11/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java File fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java: http://gerrit.cloudera.org:8080/#/c/17858/11/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@4353 PS11, Line 4353: // set write id as committed before reload the partitions so that we can get How Is this helping in getting up to date file metadata? Also what happens id hdfsTable.reloadPartitionsFromEvent() throws an exception? What would happen to the newly committed writeIds in the table? Are these writeIds give correct info about partition metadata? -- To view, visit http://gerrit.cloudera.org:8080/17858 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6ba07c9a338a25614690e314335ee4b801486da9 Gerrit-Change-Number: 17858 Gerrit-PatchSet: 11 Gerrit-Owner: Yu-Wen Lai Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Fucun Chu Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Sourabh Goyal Gerrit-Reviewer: Vihang Karajgaonkar Gerrit-Reviewer: Yu-Wen Lai Gerrit-Comment-Date: Mon, 01 Nov 2021 17:49:34 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10984: Improve TimestampValue to String casting
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17980 ) Change subject: IMPALA-10984: Improve TimestampValue to String casting .. Patch Set 4: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/9702/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/17980 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4fcb4545d9c9a3fdb38c4db58bb4b1321a429d61 Gerrit-Change-Number: 17980 Gerrit-PatchSet: 4 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Kurt Deschler Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Riza Suminto Gerrit-Comment-Date: Mon, 01 Nov 2021 17:14:43 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10984: Improve TimestampValue to String casting
Kurt Deschler has posted comments on this change. ( http://gerrit.cloudera.org:8080/17980 ) Change subject: IMPALA-10984: Improve TimestampValue to String casting .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/17980/3/be/src/runtime/timestamp-value.cc File be/src/runtime/timestamp-value.cc: http://gerrit.cloudera.org:8080/#/c/17980/3/be/src/runtime/timestamp-value.cc@222 PS3, Line 222: StringVal result(ctx, max_length); > The allocation comes from exps_results_pool_. So the space is bounded by the row batch size * max_length? Approx how much?? -- To view, visit http://gerrit.cloudera.org:8080/17980 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4fcb4545d9c9a3fdb38c4db58bb4b1321a429d61 Gerrit-Change-Number: 17980 Gerrit-PatchSet: 3 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Kurt Deschler Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Riza Suminto Gerrit-Comment-Date: Mon, 01 Nov 2021 17:07:21 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10984: Improve TimestampValue to String casting
Hello Qifan Chen, Kurt Deschler, Csaba Ringhofer, Bikramjeet Vig, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/17980 to look at the new patch set (#5). Change subject: IMPALA-10984: Improve TimestampValue to String casting .. IMPALA-10984: Improve TimestampValue to String casting TimestampValue::ToString was implemented by concatenating boost::gregorian::to_iso_extended_string and boost::posix_time::to_simple_string using stringstream. This involves multiple string allocations, copying, and might hit lock within tcmalloc::CentralFreeList. FROM_UNIXTIME and CAST expression that touches this function can be inefficient if the expression is being evaluated for millions of rows. This patch adds method TimestampValue::ToStringVal and reimplements TimestampValue::ToString by supplying default DateTimeFormatContext if no pattern was specified. "-MM-dd HH:mm:ss" will be picked as the default format if the time_ component does not have fractional seconds. Otherwise, "-MM-dd HH:mm:ss.S" will be picked as the default format. The chosen DateTimeFormatContext then is passed to TimestampParser::Format along with date_ and time_ to be formatted into the string representation. Int to string parsing method is replaced with FastInt32ToBufferLeft in TimestampParser::Format. We ran a set of expression benchmarks in a machine with Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz. This patch gives > 10X performance improvement for CAST timestamp to string and FROM_UNIXTIME without a date-time pattern. Following are the detailed results before and after the patch. Before the patch: FromUnixCodegen: Function 10%ile 50%ile 90%ile 10%ile 50%ile 90%ile (relative) (relative) (relative) --- literal 36.7 37 37.3 1X 1X 1X cast(now() as string) 2.31 2.31 2.330.0628X 0.0623X0.0626X cast(now() as string format 'Y .S') 16.9 17.5 17.5 0.459X 0.472X 0.471X from_unixtime(0,'-MM-dd HH:mm:ss') 6.3 6.3 6.37 0.171X 0.17X 0.171X from_unixtime(0,'-MM-dd') 11.8 11.8 12 0.32X 0.32X 0.322X from_unixtime(0) 2.36 2.4 2.40.0644X 0.0648X0.0644X After the patch: FromUnixCodegen: Function 10%ile 50%ile 90%ile 10%ile 50%ile 90%ile (relative) (relative) (relative) --- literal 37.7 38.1 38.4 1X 1X 1X cast(now() as string) 29.9 30.1 30.2 0.794X 0.79X 0.787X cast(now() as string format 'Y .S') 61.1 61.3 61.6 1.62X 1.61X 1.61X from_unixtime(0,'-MM-dd HH:mm:ss') 33.6 33.8 34.2 0.892X 0.887X 0.892X from_unixtime(0,'-MM-dd') 50.5 50.6 50.9 1.34X 1.33X 1.33X from_unixtime(0) 34 34.2 34.5 0.902X 0.896X 0.898X The literal expression used as the baseline in this benchmark is "cast('2012-01-01 09:10:11.123456789' as timestamp)". This patch also updates numbers in expr-benchmark for BenchmarkTimestampFunctions and tidy up expr-benchmark a bit to clear its MemPool in between benchmark iteration so that it does not run out of memory. Testing: - Pass core tests. Change-Id: I4fcb4545d9c9a3fdb38c4db58bb4b1321a429d61 --- M be/src/benchmarks/expr-benchmark.cc M be/src/exec/kudu-util-ir.cc M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/cast-functions-ir.cc M be/src/exprs/literal.cc M be/src/exprs/timestamp-functions-ir.cc M be/src/exprs/timestamp-functions.cc M be/src/runtime/date-parse-util.cc M be/src/runtime/datetime-iso-sql-format-tokenizer.cc M be/src/runtime/datetime-iso-sql-format-tokenizer.h M be/src/runtime/datetime-parser-common.cc M be/src/runtime/datetime-parser-common.h M be/src/runtime/datetime-simple-date-format-parser.cc M be/src/runtime/datetime-simple-date-format-parser.h M be/src/runtime/timestamp-parse-util.cc M be/src/runtime/timestamp-parse-util.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/client-request-state.cc M be/src/util/min-max-filter.cc 22 files changed, 316 insertions(+), 213 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/17
[Impala-ASF-CR] IMPALA-10984: Improve TimestampValue to String casting
Hello Qifan Chen, Kurt Deschler, Csaba Ringhofer, Bikramjeet Vig, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/17980 to look at the new patch set (#4). Change subject: IMPALA-10984: Improve TimestampValue to String casting .. IMPALA-10984: Improve TimestampValue to String casting TimestampValue::ToString was implemented by concatenating boost::gregorian::to_iso_extended_string and boost::posix_time::to_simple_string using stringstream. This involves multiple string allocations, copying, and might hit lock within tcmalloc::CentralFreeList. FROM_UNIXTIME and CAST expression that touches this function can be inefficient if the expression is being evaluated for millions of rows. This patch reimplement TimestampValue::ToString by supplying default DateTimeFormatContext if no pattern was specified. "-MM-dd HH:mm:ss" will be picked as the default format if the time_ component does not have fractional seconds. Otherwise, "-MM-dd HH:mm:ss.S" will be picked as the default format. The chosen DateTimeFormatContext then passed to TimestampParser::Format along with date_ and time_ to be formatted into the string representation. Int to string parsing method is replaced with FastInt32ToBufferLeft in TimestampParser::Format. We ran a set of expression benchmarks in a machine with Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz. This patch gives > 10X performance improvement for CAST timestamp to string and FROM_UNIXTIME without a date-time pattern. Following are the detailed results before and after the patch. Before the patch: FromUnixCodegen: Function 10%ile 50%ile 90%ile 10%ile 50%ile 90%ile (relative) (relative) (relative) --- literal 36.7 37 37.3 1X 1X 1X cast(now() as string) 2.31 2.31 2.330.0628X 0.0623X0.0626X cast(now() as string format 'Y .S') 16.9 17.5 17.5 0.459X 0.472X 0.471X from_unixtime(0,'-MM-dd HH:mm:ss') 6.3 6.3 6.37 0.171X 0.17X 0.171X from_unixtime(0,'-MM-dd') 11.8 11.8 12 0.32X 0.32X 0.322X from_unixtime(0) 2.36 2.4 2.40.0644X 0.0648X0.0644X After the patch: FromUnixCodegen: Function 10%ile 50%ile 90%ile 10%ile 50%ile 90%ile (relative) (relative) (relative) --- literal 37.7 38.1 38.4 1X 1X 1X cast(now() as string) 29.9 30.1 30.2 0.794X 0.79X 0.787X cast(now() as string format 'Y .S') 61.1 61.3 61.6 1.62X 1.61X 1.61X from_unixtime(0,'-MM-dd HH:mm:ss') 33.6 33.8 34.2 0.892X 0.887X 0.892X from_unixtime(0,'-MM-dd') 50.5 50.6 50.9 1.34X 1.33X 1.33X from_unixtime(0) 34 34.2 34.5 0.902X 0.896X 0.898X The literal expression used as the baseline in this benchmark is "cast('2012-01-01 09:10:11.123456789' as timestamp)". This patch also updates numbers in expr-benchmark for BenchmarkTimestampFunctions and tidy up expr-benchmark a bit to clear its MemPool in between benchmark iteration so that it does not run out of memory. Testing: - Pass core tests. Change-Id: I4fcb4545d9c9a3fdb38c4db58bb4b1321a429d61 --- M be/src/benchmarks/expr-benchmark.cc M be/src/exec/kudu-util-ir.cc M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/cast-functions-ir.cc M be/src/exprs/literal.cc M be/src/exprs/timestamp-functions-ir.cc M be/src/exprs/timestamp-functions.cc M be/src/runtime/date-parse-util.cc M be/src/runtime/datetime-iso-sql-format-tokenizer.cc M be/src/runtime/datetime-iso-sql-format-tokenizer.h M be/src/runtime/datetime-parser-common.cc M be/src/runtime/datetime-parser-common.h M be/src/runtime/datetime-simple-date-format-parser.cc M be/src/runtime/datetime-simple-date-format-parser.h M be/src/runtime/timestamp-parse-util.cc M be/src/runtime/timestamp-parse-util.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/client-request-state.cc M be/src/util/min-max-filter.cc 22 files changed, 316 insertions(+), 213 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/17980/4 -- To view, visit http://gerrit.cloudera.o
[Impala-ASF-CR] IMPALA-10997: Refactor Java Hive UDF code.
Steve Carlin has posted comments on this change. ( http://gerrit.cloudera.org:8080/17986 ) Change subject: IMPALA-10997: Refactor Java Hive UDF code. .. Patch Set 1: (2 comments) http://gerrit.cloudera.org:8080/#/c/17986/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/17986/1//COMMIT_MSG@19 PS1, Line 19: HiveUdfExecutor: Abstract base class that contains code that is common to : the legacy UDF.class and the GenericUDF.class when it is eventually created. : HiveUdfExecutorLegacy: Implementation of the code that is UDF.class specific. > nit: each line should have 72 or fewer characters if possible. Done http://gerrit.cloudera.org:8080/#/c/17986/1/fe/src/main/java/org/apache/impala/hive/executor/UdfExecutor.java File fe/src/main/java/org/apache/impala/hive/executor/UdfExecutor.java: http://gerrit.cloudera.org:8080/#/c/17986/1/fe/src/main/java/org/apache/impala/hive/executor/UdfExecutor.java@105 PS1, Line 105: classLoaderClosed_ = true; > Why not use classLoader_ = null? Done -- To view, visit http://gerrit.cloudera.org:8080/17986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic1b981aed3021aef08c87e7cdbf7c6af95906754 Gerrit-Change-Number: 17986 Gerrit-PatchSet: 1 Gerrit-Owner: Steve Carlin Gerrit-Reviewer: Fucun Chu Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Steve Carlin Gerrit-Comment-Date: Mon, 01 Nov 2021 13:39:36 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10994: Normalize the pip package name part of download URL.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17987 ) Change subject: IMPALA-10994: Normalize the pip package name part of download URL. .. Patch Set 5: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/9701/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/17987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I479df0ad7acf3c650b8f5317372261d5e2840864 Gerrit-Change-Number: 17987 Gerrit-PatchSet: 5 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Fucun Chu Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Mon, 01 Nov 2021 12:47:41 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10994: Normalize the pip package name part of download URL.
yx91...@126.com has uploaded a new patch set (#5). ( http://gerrit.cloudera.org:8080/17987 ) Change subject: IMPALA-10994: Normalize the pip package name part of download URL. .. IMPALA-10994: Normalize the pip package name part of download URL. According to PEP-0503, pip repo server doesn't support unnormalized URL access, and some package name within 'infra/python/deps/*requirements.txt' are unnormalized, e.g. 'Cython', and pip_download.py will concat $PYPI_MIRROR and package name to get download URL directly, which maybe unnormalized. Fix this by normalize package name in download URL using the recommanded method in PEP-0503. Change-Id: I479df0ad7acf3c650b8f5317372261d5e2840864 --- M infra/python/deps/pip_download.py 1 file changed, 2 insertions(+), 1 deletion(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/87/17987/5 -- To view, visit http://gerrit.cloudera.org:8080/17987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I479df0ad7acf3c650b8f5317372261d5e2840864 Gerrit-Change-Number: 17987 Gerrit-PatchSet: 5 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Fucun Chu Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] Impala-10994: Normalize pip package name
Fucun Chu has posted comments on this change. ( http://gerrit.cloudera.org:8080/17987 ) Change subject: Impala-10994: Normalize pip package name .. Patch Set 4: (2 comments) http://gerrit.cloudera.org:8080/#/c/17987/4//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/17987/4//COMMIT_MSG@7 PS4, Line 7: Impala-10994 The ticket address must be uppercase, IMPALA-10994. http://gerrit.cloudera.org:8080/#/c/17987/4//COMMIT_MSG@8 PS4, Line 8: Please add a message that is exactly long enough to explain what the problem was, and how it was fixed. Each should have 72 or fewer characters if possible. see: https://cwiki.apache.org/confluence/display/IMPALA/Contributing+to+Impala -- To view, visit http://gerrit.cloudera.org:8080/17987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I479df0ad7acf3c650b8f5317372261d5e2840864 Gerrit-Change-Number: 17987 Gerrit-PatchSet: 4 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Fucun Chu Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Mon, 01 Nov 2021 11:13:46 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10926: Improve catalogd consistency and self events detection
Hello Vihang Karajgaonkar, kis...@cloudera.com, Yu-Wen Lai, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/17859 to look at the new patch set (#26). Change subject: IMPALA-10926: Improve catalogd consistency and self events detection .. IMPALA-10926: Improve catalogd consistency and self events detection In the current design catalogd cache gets updated from 2 sources: 1. Impala shell 2. MetastoreEventProcessor The updates from the Impala shell are applied in place whereas MetastoreEventProcessor runs as a background thread, polls HMS events and apply them asynchronously. These two stream of updates cause consistency issues. For example consider a following sequence of alter table events on a table t1 as per HMS: 1. alter table t1 from source s1 say other Impala cluster 2. alter table t1 from source s2 say other Hive cluster 3. alter table t1 from local Impala cluster The #3 alter table ddl operation would get reflected in the local cache immediately. However, later on event processor would process events from #1 and #2 above and try to alter the table. In an ideal scenario, these alters should have been applied before #3 i.e in the same order as they appear in HMS notification log. This leaves table t1 in an inconsistent state. Proposed solution: The main idea of the solution is to keep track of the last event id for a given table as eventId which the catalogd has synced to in the Table object. The events processor ignores any event whose EVENT_ID is less than or equal to the eventId stored in the table. Once the events processor successfully processes a given event, it updates the value of eventId in the table before releasing the table lock. Also, any DDL or refresh operation on the catalogd (from both catalog HMS endpoint and Impala shell) will follow the following steps to update the event id for the table: 1. Acquire write lock on the table 2. Perform ddl operation in HMS 3. Sync table till the latest event id (as per HMS) since its last synced event id The above steps ensure that any concurrent updates applied on a same db/table from multiple sources like Hive, Impala or say multiple Impala clusters, get reflected in the local catalogd cache (in the same order as they appear in HMS) thus removing any inconsistencies. Also the solution relies on the existing locking mechanism in the catalogd to prevent any other concurrent updates to the table (even via EventsProcessor). In case of database objects, we will also have a similar eventId which represents the events on the database object (CREATE, DROP, ALTER database) to which the catalogd as synced to. This patch addresses the following: - Add a new flag enable_sync_to_latest_event_on_ddls to enable/disable this improvement. It is turned off by default. - Sync db/table to latest event id for ddls from catalog HMS endpoints. A subsequent patch would address the same for DDLs executed from Impala shell - Event processor skips processing an event if db/table is already synced till that event id and sets that event id in db/table in case the event is processed - When EventProcessor detects a self event, it sets the last synced event id in db/table before skipping an event - Full table refresh sets the last event processed in table cache Testing: 1. Added new unit tests and modified existing ones 2. Ran exhaustive tests with flag both turned on and off Change-Id: I36364e401911352c474eb98c8d61bbaae9b9 --- M be/src/catalog/catalog-server.cc M be/src/util/backend-gflag-util.cc M common/thrift/BackendGflags.thrift M fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java M fe/src/main/java/org/apache/impala/catalog/Db.java M fe/src/main/java/org/apache/impala/catalog/Table.java M fe/src/main/java/org/apache/impala/catalog/TableLoader.java M fe/src/main/java/org/apache/impala/catalog/events/EventFactory.java M fe/src/main/java/org/apache/impala/catalog/events/MetastoreEvents.java M fe/src/main/java/org/apache/impala/catalog/events/MetastoreEventsProcessor.java M fe/src/main/java/org/apache/impala/catalog/events/NoOpEventProcessor.java M fe/src/main/java/org/apache/impala/catalog/metastore/CatalogMetastoreServiceHandler.java M fe/src/main/java/org/apache/impala/catalog/metastore/HmsApiNameEnum.java M fe/src/main/java/org/apache/impala/catalog/metastore/MetastoreServiceHandler.java M fe/src/main/java/org/apache/impala/service/BackendConfig.java M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java M fe/src/main/java/org/apache/impala/service/JniCatalog.java M fe/src/test/java/org/apache/impala/catalog/AlterDatabaseTest.java A fe/src/test/java/org/apache/impala/catalog/MetastoreApiTestUtils.java M fe/src/test/java/org/apache/impala/catalog/events/EventsProcessorStressTest.java M fe/src/test/java/org/apache/impala/catalog/events/MetastoreEventsProcessorTest.java M fe/src/test
[Impala-ASF-CR] IMPALA-10997: Refactor Java Hive UDF code.
Fucun Chu has posted comments on this change. ( http://gerrit.cloudera.org:8080/17986 ) Change subject: IMPALA-10997: Refactor Java Hive UDF code. .. Patch Set 1: (2 comments) This looks good, I only had some minor comments. http://gerrit.cloudera.org:8080/#/c/17986/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/17986/1//COMMIT_MSG@19 PS1, Line 19: HiveUdfExecutor: Abstract base class that contains code that is common to : the legacy UDF.class and the GenericUDF.class when it is eventually created. : HiveUdfExecutorLegacy: Implementation of the code that is UDF.class specific. nit: each line should have 72 or fewer characters if possible. http://gerrit.cloudera.org:8080/#/c/17986/1/fe/src/main/java/org/apache/impala/hive/executor/UdfExecutor.java File fe/src/main/java/org/apache/impala/hive/executor/UdfExecutor.java: http://gerrit.cloudera.org:8080/#/c/17986/1/fe/src/main/java/org/apache/impala/hive/executor/UdfExecutor.java@105 PS1, Line 105: classLoaderClosed_ = true; Why not use classLoader_ = null? -- To view, visit http://gerrit.cloudera.org:8080/17986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic1b981aed3021aef08c87e7cdbf7c6af95906754 Gerrit-Change-Number: 17986 Gerrit-PatchSet: 1 Gerrit-Owner: Steve Carlin Gerrit-Reviewer: Fucun Chu Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Mon, 01 Nov 2021 08:21:15 + Gerrit-HasComments: Yes