[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 26: (2 comments) http://gerrit.cloudera.org:8080/#/c/19699/26//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/19699/26//COMMIT_MSG@7 PS26, Line 7: Prototype a simple JSON File reader Let's change the title to something like "Initial support for reading JSON files" http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h File be/src/exec/json-parser.h: http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h@37 PS22, Line 37: > > It seems to be tightly coupled with hdfs-json-scanner so I think we I took a look. The failed job is https://jenkins.impala.io/job/clang-tidy-ub2004/99/console It fails in running bin/run_clang_tidy.sh. There are some errors in the output file: https://jenkins.impala.io/job/clang-tidy-ub2004/99/artifact/tidylog.txt E.g. one of the error: /home/ubuntu/Impala/be/src/exec/json/json-parser.cc:233:16: error: explicit instantiation of 'impala::JsonParser' must occur in namespace 'impala' template class JsonParser; ^ -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 26 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Mon, 21 Aug 2023 01:07:03 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Zihao Ye has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 22: (1 comment) Thank you once again for the code review. I have been a little busy lately, but I managed to find some time to complete the move of json-parser.h and separate the implementation into json-parser.cc. As for the remaining task of adding new test cases, I will try to find another time to finish it. http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h File be/src/exec/json-parser.h: http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h@37 PS22, Line 37: > It seems to be tightly coupled with hdfs-json-scanner so I think we > can put it in /json unless it can be reused in other places. > > Moving the implementation codes to json-parser.cc helps to speedup > recompilation when you have code changes. Also helps to make this > header file shorter and easier for going through. You can keep some > short methods in the header file and just move large methods like > Parse(). Done, It could compiles successfully in my own environment, but I'm not sure why it keeps failing to build here. -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 22 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Fri, 18 Aug 2023 09:51:08 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 26: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/13774/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 26 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Fri, 18 Aug 2023 09:40:47 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Hello Quanlong Huang, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/19699 to look at the new patch set (#26). Change subject: IMPALA-10798: Prototype a simple JSON File reader .. IMPALA-10798: Prototype a simple JSON File reader Prototype of HdfsJsonScanner implemented based on rapidjson, which supports scanning data from splitting json files. The scanning of JSON data is mainly completed by two parts working together. The first part is the JsonParser responsible for parsing the JSON object, which is implemented based on the SAX-style API of rapidjson. It reads data from the char stream, parses it, and calls the corresponding callback function when encountering the corresponding JSON element. See the comments of the JsonParser class for more details. The other part is the HdfsJsonScanner, which inherits from HdfsScanner and provides callback functions for the JsonParser. The callback functions are responsible for providing data buffers to the Parser and converting and materializing the Parser's parsing results into RowBatch. It should be noted that the parser returns numeric values as strings to the scanner. The scanner uses the TextConverter class to convert the strings to the desired types, similar to how the HdfsTextScanner works. This is an advantage compared to using number value provided by rapidjson directly, as it eliminates concerns about inconsistencies in converting decimals (e.g. losing precision). Limitations - Multiline json objects are not fully supported yet. It is ok when each file has only one scan range. However, when a file has multiple scan ranges, there is a small probability of incomplete scanning of multiline JSON objects that span ScanRange boundaries (in such cases, parsing errors may be reported). For more details, please refer to the comments in the 'multiline_json.test'. - Compressed JSON files are not supported yet. - Complex types are not supported yet. Tests - Most of the existing end-to-end tests can run on JSON format. - Add TestQueriesJsonTables in test_queries.py for testing multiline, malformed, and overflow in JSON. Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 --- M be/CMakeLists.txt M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-scan-node-base.cc A be/src/exec/json/CMakeLists.txt A be/src/exec/json/hdfs-json-scanner.cc A be/src/exec/json/hdfs-json-scanner.h A be/src/exec/json/json-parser-test.cc A be/src/exec/json/json-parser.cc A be/src/exec/json/json-parser.h M be/src/exec/text-converter.inline.h M bin/rat_exclude_files.txt M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M testdata/bin/generate-schema-statements.py A testdata/data/json_test/complex.json A testdata/data/json_test/malformed.json A testdata/data/json_test/multiline.json A testdata/data/json_test/overflow.json M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv M testdata/workloads/functional-query/functional-query_core.csv M testdata/workloads/functional-query/functional-query_dimensions.csv M testdata/workloads/functional-query/functional-query_exhaustive.csv M testdata/workloads/functional-query/functional-query_pairwise.csv A testdata/workloads/functional-query/queries/QueryTest/complex_json.test A testdata/workloads/functional-query/queries/QueryTest/malformed_json.test A testdata/workloads/functional-query/queries/QueryTest/multiline_json.test A testdata/workloads/functional-query/queries/QueryTest/overflow_json.test M testdata/workloads/tpcds/tpcds_core.csv M testdata/workloads/tpcds/tpcds_exhaustive.csv M testdata/workloads/tpcds/tpcds_pairwise.csv M testdata/workloads/tpch/tpch_core.csv M testdata/workloads/tpch/tpch_dimensions.csv M testdata/workloads/tpch/tpch_exhaustive.csv M testdata/workloads/tpch/tpch_pairwise.csv M tests/common/test_dimensions.py M tests/metadata/test_hms_integration.py M tests/query_test/test_decimal_queries.py M tests/query_test/test_queries.py M tests/query_test/test_scanners.py M tests/query_test/test_scanners_fuzz.py M tests/query_test/test_tpch_queries.py 42 files changed, 1,498 insertions(+), 44 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/26 -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 26 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 24: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/13773/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 24 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Fri, 18 Aug 2023 08:39:56 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Hello Quanlong Huang, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/19699 to look at the new patch set (#24). Change subject: IMPALA-10798: Prototype a simple JSON File reader .. IMPALA-10798: Prototype a simple JSON File reader Prototype of HdfsJsonScanner implemented based on rapidjson, which supports scanning data from splitting json files. The scanning of JSON data is mainly completed by two parts working together. The first part is the JsonParser responsible for parsing the JSON object, which is implemented based on the SAX-style API of rapidjson. It reads data from the char stream, parses it, and calls the corresponding callback function when encountering the corresponding JSON element. See the comments of the JsonParser class for more details. The other part is the HdfsJsonScanner, which inherits from HdfsScanner and provides callback functions for the JsonParser. The callback functions are responsible for providing data buffers to the Parser and converting and materializing the Parser's parsing results into RowBatch. It should be noted that the parser returns numeric values as strings to the scanner. The scanner uses the TextConverter class to convert the strings to the desired types, similar to how the HdfsTextScanner works. This is an advantage compared to using number value provided by rapidjson directly, as it eliminates concerns about inconsistencies in converting decimals (e.g. losing precision). Limitations - Multiline json objects are not fully supported yet. It is ok when each file has only one scan range. However, when a file has multiple scan ranges, there is a small probability of incomplete scanning of multiline JSON objects that span ScanRange boundaries (in such cases, parsing errors may be reported). For more details, please refer to the comments in the 'multiline_json.test'. - Compressed JSON files are not supported yet. - Complex types are not supported yet. Tests - Most of the existing end-to-end tests can run on JSON format. - Add TestQueriesJsonTables in test_queries.py for testing multiline, malformed, and overflow in JSON. Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 --- M be/CMakeLists.txt M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-scan-node-base.cc A be/src/exec/json/CMakeLists.txt A be/src/exec/json/hdfs-json-scanner.cc A be/src/exec/json/hdfs-json-scanner.h A be/src/exec/json/json-parser-test.cc A be/src/exec/json/json-parser.cc A be/src/exec/json/json-parser.h M be/src/exec/text-converter.inline.h M bin/rat_exclude_files.txt M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M testdata/bin/generate-schema-statements.py A testdata/data/json_test/complex.json A testdata/data/json_test/malformed.json A testdata/data/json_test/multiline.json A testdata/data/json_test/overflow.json M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv M testdata/workloads/functional-query/functional-query_core.csv M testdata/workloads/functional-query/functional-query_dimensions.csv M testdata/workloads/functional-query/functional-query_exhaustive.csv M testdata/workloads/functional-query/functional-query_pairwise.csv A testdata/workloads/functional-query/queries/QueryTest/complex_json.test A testdata/workloads/functional-query/queries/QueryTest/malformed_json.test A testdata/workloads/functional-query/queries/QueryTest/multiline_json.test A testdata/workloads/functional-query/queries/QueryTest/overflow_json.test M testdata/workloads/tpcds/tpcds_core.csv M testdata/workloads/tpcds/tpcds_exhaustive.csv M testdata/workloads/tpcds/tpcds_pairwise.csv M testdata/workloads/tpch/tpch_core.csv M testdata/workloads/tpch/tpch_dimensions.csv M testdata/workloads/tpch/tpch_exhaustive.csv M testdata/workloads/tpch/tpch_pairwise.csv M tests/common/test_dimensions.py M tests/metadata/test_hms_integration.py M tests/query_test/test_decimal_queries.py M tests/query_test/test_queries.py M tests/query_test/test_scanners.py M tests/query_test/test_scanners_fuzz.py M tests/query_test/test_tpch_queries.py 42 files changed, 1,498 insertions(+), 44 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/24 -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 24 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 23: (7 comments) The patch is in good shape. I ran the following command to go through all tests on text format: git grep file_format tests | grep "'text'" Identified some tests that would be good to add for json. http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h File be/src/exec/json-parser.h: http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h@37 PS22, Line 37: > > Can we move this file into be/src/exec/json if it's only used by It seems to be tightly coupled with hdfs-json-scanner so I think we can put it in /json unless it can be reused in other places. Moving the implementation codes to json-parser.cc helps to speedup recompilation when you have code changes. Also helps to make this header file shorter and easier for going through. You can keep some short methods in the header file and just move large methods like Parse(). http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.cc File be/src/exec/json/hdfs-json-scanner.cc: http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.cc@300 PS22, Line 300: // TODO: Support Invoke CodeGend WriteSlot > > A follow-up question for https://gerrit.cloudera.org/c/19699/18/be/src/ex I checked other scanners and they have the same behavior. I think it's ok to ignore failures and just set nulls on the slot as long as we can report the mem issue in other places, e.g. report mem issue when we want to read more data. Currently, we have 'buffer_status_' that can report this. So I think the current implementation is ok. http://gerrit.cloudera.org:8080/#/c/19699/23/tests/data_errors/test_data_errors.py File tests/data_errors/test_data_errors.py: http://gerrit.cloudera.org:8080/#/c/19699/23/tests/data_errors/test_data_errors.py@128 PS23, Line 128: self.run_test_case('DataErrorsTest/hdfs-scan-node-errors', vector) Can we add a similar test for json? http://gerrit.cloudera.org:8080/#/c/19699/23/tests/query_test/test_cancellation.py File tests/query_test/test_cancellation.py: http://gerrit.cloudera.org:8080/#/c/19699/23/tests/query_test/test_cancellation.py@113 PS23, Line 113: 'text' Let's add json here http://gerrit.cloudera.org:8080/#/c/19699/23/tests/query_test/test_chars.py File tests/query_test/test_chars.py: http://gerrit.cloudera.org:8080/#/c/19699/23/tests/query_test/test_chars.py@37 PS23, Line 37: 'text' Let's test json here http://gerrit.cloudera.org:8080/#/c/19699/23/tests/query_test/test_chars.py@68 PS23, Line 68: 'text' Let's test json here as well http://gerrit.cloudera.org:8080/#/c/19699/23/tests/query_test/test_date_queries.py File tests/query_test/test_date_queries.py: http://gerrit.cloudera.org:8080/#/c/19699/23/tests/query_test/test_date_queries.py@45 PS23, Line 45: 'text' Let's add json here. Please also update the above comment. DATE type is also supported in orc and json? -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 23 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 15 Aug 2023 01:08:55 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 23: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 23 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Mon, 14 Aug 2023 12:18:37 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 23: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/9595/ DRY_RUN=true -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 23 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Mon, 14 Aug 2023 07:55:47 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 23: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/9594/ -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 23 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Mon, 14 Aug 2023 06:23:16 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 23: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/9594/ DRY_RUN=true -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 23 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Mon, 14 Aug 2023 06:16:38 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 23: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/13631/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 23 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 25 Jul 2023 13:00:36 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Zihao Ye has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 23: (8 comments) Thanks for the review! I manually tested reading a JSON file with some large strings under a low mem_limit, and get the following error: ERROR: Memory limit exceeded: Failed to allocate row batch EXCHANGE_NODE (id=1) could not allocate 8.00 KB without exceeding limit. Error occurred on backend 4620ee0acf3b:27000 Memory left in process limit: 11.41 GB Memory left in query limit: -125.32 MB Query(264ddc1e2c291b18:f1bec1c5): memory limit exceeded. Limit=100.00 MB Reservation=12.00 MB ReservationLimit=68.00 MB OtherMemory=213.32 MB Total=225.32 MB Peak=225.33 MB Unclaimed reservations: Reservation=4.00 MB OtherMemory=0 Total=4.00 MB Peak=12.00 MB Fragment 264ddc1e2c291b18:f1bec1c50001: Reservation=8.00 MB OtherMemory=213.31 MB Total=221.31 MB Peak=221.32 MB HDFS_SCAN_NODE (id=0): Reservation=8.00 MB OtherMemory=8.00 KB Total=8.01 MB Peak=95.03 MB Queued Batches: Total=8.00 KB Peak=71.03 MB KrpcDataStreamSender (dst_id=1): Total=142.28 MB Peak=142.29 MB RowBatchSerialization: Total=142.28 MB Peak=142.28 MB Fragment 264ddc1e2c291b18:f1bec1c5: Reservation=0 OtherMemory=8.00 KB Total=8.00 KB Peak=8.00 KB EXCHANGE_NODE (id=1): Reservation=0 OtherMemory=0 Total=0 Peak=0 KrpcDeferredRpcs: Total=0 Peak=0 PLAN_ROOT_SINK: Total=0 Peak=0 CodeGen: Total=0 Peak=0 CodeGen: Total=0 Peak=0 It seems that the EXCHANGE_NODE is unable to allocate memory, rather than the HDFS_SCAN_NODE. Did this meet expectations? And, patch set 23 also resolved a bug where scanning complex types would hit DCHECK (thansk test_scanners_fuzz.py found it), so an additional related test, complex_json.test, was added. http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h File be/src/exec/json-parser.h: http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h@37 PS22, Line 37: > Can we move this file into be/src/exec/json if it's only used by > hdfs-json-scanner? > Also do you plan to move the implementation codes into a > json-parser.cc file? Of course, it can be moved to /json. I initially placed it outside mainly refer to the position of delimited-text-parser.h/.cc. If you think it's necessary, I can also put it in /json later. I have not thought about moving the implementation codes into json-parser.cc. Please let me know if you think it's necessary. http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h@59 PS22, Line 59: /// must succeed. Functions with bool return type return true on succeed, and return false > nit: "The following functions materialize output tuples. Functions with voi Done http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h@213 PS22, Line 213: f > nit: "been" Done http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.h File be/src/exec/json/hdfs-json-scanner.h: http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.h@47 PS22, Line 47: /// exactly one scanner. > Let's also mention the error handling, i.e. how different kinds of errors a Done http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.cc File be/src/exec/json/hdfs-json-scanner.cc: http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.cc@188 PS22, Line 188: string data_view = string(data, std::min(len, max_view_len)); > It'd be more helpful to print the column name and table name, e.g. Done, now we can see the specific column where the errors occurred and the corresponding data with length limit. http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.cc@211 PS22, Line 211: stream_->filename(), stream_->scan_range()->offset() + offset, > It seems we should return a non-ok status for unrecoverable > situations, e.g. running out of memory. Yes, that would be better, but how can we find out about unrecoverable error occurring here? Maybe state_->CheckQueryState()? http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.cc@300 PS22, Line 300: // TODO: Support Invoke CodeGend WriteSlot > A follow-up question for > https://gerrit.cloudera.org/c/19699/18/be/src/exec/json-parser.h#303 > > This can return false in copying strings when we run out of memory: > https://github.com/apache/impala/blob/af3f56e6d1605a56f7bd02b0af35be980a7e4c63/be/src/exec/text-converter.inline.h#L96 > > It seems we will return true in HandleConvertError() and let > RapidJSON continue parsing. Can we stop it and report the mem > issue? Or did I miss something? Yes, it will continue parsing. Similar to the issue in HandleError(), how can we determine if memory has run out here? Relying solely on the return value of WriteSlot is not sufficient, I also tried
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Hello Quanlong Huang, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/19699 to look at the new patch set (#23). Change subject: IMPALA-10798: Prototype a simple JSON File reader .. IMPALA-10798: Prototype a simple JSON File reader Prototype of HdfsJsonScanner implemented based on rapidjson, which supports scanning data from splitting json files. The scanning of JSON data is mainly completed by two parts working together. The first part is the JsonParser responsible for parsing the JSON object, which is implemented based on the SAX-style API of rapidjson. It reads data from the char stream, parses it, and calls the corresponding callback function when encountering the corresponding JSON element. See the comments of the JsonParser class for more details. The other part is the HdfsJsonScanner, which inherits from HdfsScanner and provides callback functions for the JsonParser. The callback functions are responsible for providing data buffers to the Parser and converting and materializing the Parser's parsing results into RowBatch. It should be noted that the parser returns numeric values as strings to the scanner. The scanner uses the TextConverter class to convert the strings to the desired types, similar to how the HdfsTextScanner works. This is an advantage compared to using number value provided by rapidjson directly, as it eliminates concerns about inconsistencies in converting decimals (e.g. losing precision). Limitations - Multiline json objects are not fully supported yet. It is ok when each file has only one scan range. However, when a file has multiple scan ranges, there is a small probability of incomplete scanning of multiline JSON objects that span ScanRange boundaries (in such cases, parsing errors may be reported). For more details, please refer to the comments in the 'multiline_json.test'. - Compressed JSON files are not supported yet. - Complex types are not supported yet. Tests - Most of the existing end-to-end tests can run on JSON format. - Add TestQueriesJsonTables in test_queries.py for testing multiline, malformed, and overflow in JSON. Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 --- M be/CMakeLists.txt M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-scan-node-base.cc A be/src/exec/json-parser-test.cc A be/src/exec/json-parser.h A be/src/exec/json/CMakeLists.txt A be/src/exec/json/hdfs-json-scanner.cc A be/src/exec/json/hdfs-json-scanner.h M be/src/exec/text-converter.inline.h M bin/rat_exclude_files.txt M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M testdata/bin/generate-schema-statements.py A testdata/data/json_test/complex.json A testdata/data/json_test/malformed.json A testdata/data/json_test/multiline.json A testdata/data/json_test/overflow.json M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv M testdata/workloads/functional-query/functional-query_core.csv M testdata/workloads/functional-query/functional-query_dimensions.csv M testdata/workloads/functional-query/functional-query_exhaustive.csv M testdata/workloads/functional-query/functional-query_pairwise.csv A testdata/workloads/functional-query/queries/QueryTest/complex_json.test A testdata/workloads/functional-query/queries/QueryTest/malformed_json.test A testdata/workloads/functional-query/queries/QueryTest/multiline_json.test A testdata/workloads/functional-query/queries/QueryTest/overflow_json.test M testdata/workloads/tpcds/tpcds_core.csv M testdata/workloads/tpcds/tpcds_exhaustive.csv M testdata/workloads/tpcds/tpcds_pairwise.csv M testdata/workloads/tpch/tpch_core.csv M testdata/workloads/tpch/tpch_dimensions.csv M testdata/workloads/tpch/tpch_exhaustive.csv M testdata/workloads/tpch/tpch_pairwise.csv M tests/common/test_dimensions.py M tests/metadata/test_hms_integration.py M tests/query_test/test_decimal_queries.py M tests/query_test/test_queries.py M tests/query_test/test_scanners.py M tests/query_test/test_scanners_fuzz.py M tests/query_test/test_tpch_queries.py 41 files changed, 1,442 insertions(+), 44 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/23 -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 23 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 22: (9 comments) The patch is looking good! I mostly look into the error handling in this round. Not sure if we have tests about exceeding mem_limit. If not, we can add a test to write a json file with huge strings and read it in a query with a low mem_limit that will be exceeded. http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h File be/src/exec/json-parser.h: http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h@37 PS22, Line 37: Can we move this file into be/src/exec/json if it's only used by hdfs-json-scanner? Also do you plan to move the implementation codes into a json-parser.cc file? http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h@59 PS22, Line 59: /// means return true on success. nit: "The following functions materialize output tuples. Functions with void return type must succeed. Functions with bool return type return true on succeed, and return false to stop parsing the whole scan range." http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h@213 PS22, Line 213: be nit: "been" http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.h File be/src/exec/json/hdfs-json-scanner.h: http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.h@47 PS22, Line 47: /// exactly one scanner. Let's also mention the error handling, i.e. how different kinds of errors are handled. http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.cc File be/src/exec/json/hdfs-json-scanner.cc: http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.cc@188 PS22, Line 188:<< desc->col_pos() - scan_node_->num_partition_keys() It'd be more helpful to print the column name and table name, e.g. Error converting column 'key1' of table my_tbl to bigint http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.cc@211 PS22, Line 211: return Status::OK(); It seems we should return a non-ok status for unrecoverable situations, e.g. running out of memory. http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.cc@300 PS22, Line 300: if (LIKELY(text_converter_->WriteSlot(slot_desc, _type, tuple_, data, len, true, A follow-up question for https://gerrit.cloudera.org/c/19699/18/be/src/exec/json-parser.h#303 This can return false in copying strings when we run out of memory: https://github.com/apache/impala/blob/af3f56e6d1605a56f7bd02b0af35be980a7e4c63/be/src/exec/text-converter.inline.h#L96 It seems we will return true in HandleConvertError() and let RapidJSON continue parsing. Can we stop it and report the mem issue? Or did I miss something? http://gerrit.cloudera.org:8080/#/c/19699/22/testdata/data/json_test/malformed.json File testdata/data/json_test/malformed.json: http://gerrit.cloudera.org:8080/#/c/19699/22/testdata/data/json_test/malformed.json@2 PS22, Line 2: {"bool_col":False,"int_col":1,"float_col":0.1,"string_col":abc123} Can we also add these cases? { } [ ] ( ) {"string_col":"abc123"} ["string_col", "abc123"] http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/multiline.json File testdata/data/json_test/multiline.json: http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/multiline.json@4 PS21, Line 4: 1234 : 567 > > Just curious, what the behavior if this is parsed as a numeric Ack -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 22 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Mon, 24 Jul 2023 11:32:52 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Zihao Ye has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 21: (12 comments) http://gerrit.cloudera.org:8080/#/c/19699/21//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/19699/21//COMMIT_MSG@22 PS21, Line 22: converting and materializing the Parser's parsing results into RowBatch. > Could you mention that numeric values are parsed from strings using the sam Done http://gerrit.cloudera.org:8080/#/c/19699/21//COMMIT_MSG@25 PS21, Line 25: , > nit: period "." Done http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/hdfs-scanner.cc File be/src/exec/hdfs-scanner.cc: http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/hdfs-scanner.cc@831 PS21, Line 831: if (scan_node_->skip_header_line_count() > 1) { > I think don't need this for JSON tables. We can make sure 'skip_header_line Done, this part is indeed unnecessary, I have copied the rest of the code into HandleConvertError directly. http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json-parser.h File be/src/exec/json-parser.h: http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json-parser.h@188 PS21, Line 188: kParseStopWhenDoneFlag > Does this mean only parsing the first json object in each line and > skip the others (if there are)? It's not like that, if this flag is not added, the parser will check if the stream has ended after parsing an object, and if it hasn't, it will report an error "kParseErrorDocumentRootNotSingular". The purpose of this flag is to skip this check and allow the parser to parse multiple objects in a stream without reporting this error. http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json-parser.h@189 PS21, Line 189: reader_.Parse(stream_, *this); > Could you add a comment above this for the parsing mechanism? E.g. Done http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h File be/src/exec/json-parser.h: http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@303 PS18, Line 303: current_field_idx_ = -1; > What's the behavior if we return false here? Will RapidJSON skip > the whole json object or just skip this value? This part will return false only when HandleConvertError returns false (i.e., must have abort_on_error is true). In such cases, the query will be aborted. If abort_on_error is false, this part will never return false. Even if this slot has a convert error, we just set it to null and return true, see HandleConvertError function. http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json/hdfs-json-scanner.h File be/src/exec/json/hdfs-json-scanner.h: http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json/hdfs-json-scanner.h@39 PS21, Line 39: This > nit: "this" Done http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json/hdfs-json-scanner.cc File be/src/exec/json/hdfs-json-scanner.cc: http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json/hdfs-json-scanner.cc@288 PS21, Line 288: if (LIKELY(text_converter_->WriteSlot(slot_desc, _type, tuple_, data, len, true, : false, current_pool_))) return true; > nit: please write this in multi-lines (with brackets) Done http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/malformed.json File testdata/data/json_test/malformed.json: http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/malformed.json@1 PS21, Line 1: {"bool_col":true,"int_col":0,"float_col":"abc","string_col":"abc123"} > Could you add a line that misses the right bracket, and one more line with Done http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/multiline.json File testdata/data/json_test/multiline.json: http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/multiline.json@1 PS21, Line 1: {"Id": 1, "Key": "normal object", "Value": "abcdefg"} > Could you add a line of two json objects? Done http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/multiline.json@4 PS21, Line 4: 1234 : 567 > Just curious, what the behavior if this is parsed as a numeric > column, e.g. bigint? We will get a bigint 1234 and two error reporting, one is Missing '}' after '1234', another is Invalid value '567}'. Because when parsing numbers, the parser considers the number to end when it encounters a newline character. http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/datasets/functional/functional_schema_template.sql File testdata/datasets/functional/functional_schema_template.sql: http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/datasets/functional/functional_schema_template.sql@1388 PS21, Line 1388: LOAD DATA LOCAL INPATH '{impala_home}/testdata/data/overflow.txt' OVERWRITE INTO TABLE {db_name}{db_suffix}.{table_name}; > Can we add a similar table for json and test the overflow
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 22: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/13609/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 22 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Fri, 21 Jul 2023 06:43:19 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Hello Quanlong Huang, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/19699 to look at the new patch set (#22). Change subject: IMPALA-10798: Prototype a simple JSON File reader .. IMPALA-10798: Prototype a simple JSON File reader Prototype of HdfsJsonScanner implemented based on rapidjson, which supports scanning data from splitting json files. The scanning of JSON data is mainly completed by two parts working together. The first part is the JsonParser responsible for parsing the JSON object, which is implemented based on the SAX-style API of rapidjson. It reads data from the char stream, parses it, and calls the corresponding callback function when encountering the corresponding JSON element. See the comments of the JsonParser class for more details. The other part is the HdfsJsonScanner, which inherits from HdfsScanner and provides callback functions for the JsonParser. The callback functions are responsible for providing data buffers to the Parser and converting and materializing the Parser's parsing results into RowBatch. It should be noted that the parser returns numeric values as strings to the scanner. The scanner uses the TextConverter class to convert the strings to the desired types, similar to how the HdfsTextScanner works. This is an advantage compared to using number value provided by rapidjson directly, as it eliminates concerns about inconsistencies in converting decimals (e.g. losing precision). Limitations - Multiline json objects are not fully supported yet. It is ok when each file has only one scan range. However, when a file has multiple scan ranges, there is a small probability of incomplete scanning of multiline JSON objects that span ScanRange boundaries (in such cases, parsing errors may be reported). For more details, please refer to the comments in the 'multiline_json.test'. - Compressed JSON files are not supported yet. - Complex types are not supported yet. Tests - Most of the existing end-to-end tests can run on JSON format. - Add TestQueriesJsonTables in test_queries.py for testing multiline, malformed, and overflow in JSON. Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 --- M be/CMakeLists.txt M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-scan-node-base.cc A be/src/exec/json-parser-test.cc A be/src/exec/json-parser.h A be/src/exec/json/CMakeLists.txt A be/src/exec/json/hdfs-json-scanner.cc A be/src/exec/json/hdfs-json-scanner.h M be/src/exec/text-converter.inline.h M bin/rat_exclude_files.txt M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M testdata/bin/generate-schema-statements.py A testdata/data/json_test/malformed.json A testdata/data/json_test/multiline.json A testdata/data/json_test/overflow.json M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv M testdata/workloads/functional-query/functional-query_core.csv M testdata/workloads/functional-query/functional-query_dimensions.csv M testdata/workloads/functional-query/functional-query_exhaustive.csv M testdata/workloads/functional-query/functional-query_pairwise.csv A testdata/workloads/functional-query/queries/QueryTest/malformed_json.test A testdata/workloads/functional-query/queries/QueryTest/multiline_json.test A testdata/workloads/functional-query/queries/QueryTest/overflow_json.test M testdata/workloads/tpcds/tpcds_core.csv M testdata/workloads/tpcds/tpcds_exhaustive.csv M testdata/workloads/tpcds/tpcds_pairwise.csv M testdata/workloads/tpch/tpch_core.csv M testdata/workloads/tpch/tpch_dimensions.csv M testdata/workloads/tpch/tpch_exhaustive.csv M testdata/workloads/tpch/tpch_pairwise.csv M tests/common/test_dimensions.py M tests/metadata/test_hms_integration.py M tests/query_test/test_decimal_queries.py M tests/query_test/test_queries.py M tests/query_test/test_scanners.py M tests/query_test/test_scanners_fuzz.py M tests/query_test/test_tpch_queries.py 39 files changed, 1,377 insertions(+), 44 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/22 -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 22 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 21: (13 comments) http://gerrit.cloudera.org:8080/#/c/19699/21//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/19699/21//COMMIT_MSG@22 PS21, Line 22: converting and materializing the Parser's parsing results into RowBatch. Could you mention that numeric values are parsed from strings using the same functionality of text-scanners? This is an advantage of using RapidJSON directly. So we don't need to worry about inconsistency in converting decimals (e.g. losing precisions). http://gerrit.cloudera.org:8080/#/c/19699/21//COMMIT_MSG@25 PS21, Line 25: , nit: period "." http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/hdfs-scanner.cc File be/src/exec/hdfs-scanner.cc: http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/hdfs-scanner.cc@831 PS21, Line 831: if (scan_node_->skip_header_line_count() > 1) { I think don't need this for JSON tables. We can make sure 'skip_header_line_count' is always 0 for JSON scan nodes. It comes from FE: https://github.com/apache/impala/blob/97e44c11923f3d28e08aba1b5dd66b8a35465deb/fe/src/main/java/org/apache/impala/catalog/FeFsTable.java#L290 http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json-parser.h File be/src/exec/json-parser.h: http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json-parser.h@188 PS21, Line 188: kParseStopWhenDoneFlag Does this mean only parsing the first json object in each line and skip the others (if there are)? http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json-parser.h@189 PS21, Line 189: reader_.Parse(stream_, *this); Could you add a comment above this for the parsing mechanism? E.g. Reads characters from the stream, and publishes events to this handler (JsonParser). http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h File be/src/exec/json-parser.h: http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@169 PS18, Line 169: while (!stream_.Eos()) { > > What about kParseInsituFlag? Can we use it in our scenario? I see. Thanks for the explanation! http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@303 PS18, Line 303: current_field_idx_ = -1; What's the behavior if we return false here? Will RapidJSON skip the whole json object or just skip this value? http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json/hdfs-json-scanner.h File be/src/exec/json/hdfs-json-scanner.h: http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json/hdfs-json-scanner.h@39 PS21, Line 39: This nit: "this" http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json/hdfs-json-scanner.cc File be/src/exec/json/hdfs-json-scanner.cc: http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json/hdfs-json-scanner.cc@288 PS21, Line 288: if (LIKELY(text_converter_->WriteSlot(slot_desc, _type, tuple_, data, len, true, : false, current_pool_))) return true; nit: please write this in multi-lines (with brackets) http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/malformed.json File testdata/data/json_test/malformed.json: http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/malformed.json@1 PS21, Line 1: {"bool_col":true,"int_col":0,"float_col":"abc","string_col":"abc123"} Could you add a line that misses the right bracket, and one more line with duplicated keys? http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/multiline.json File testdata/data/json_test/multiline.json: http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/multiline.json@1 PS21, Line 1: {"Id": 1, "Key": "normal object", "Value": "abcdefg"} Could you add a line of two json objects? http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/multiline.json@4 PS21, Line 4: 1234 : 567 Just curious, what the behavior if this is parsed as a numeric column, e.g. bigint? http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/datasets/functional/functional_schema_template.sql File testdata/datasets/functional/functional_schema_template.sql: http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/datasets/functional/functional_schema_template.sql@1388 PS21, Line 1388: LOAD DATA LOCAL INPATH '{impala_home}/testdata/data/overflow.txt' OVERWRITE INTO TABLE {db_name}{db_suffix}.{table_name}; Can we add a similar table for json and test the overflow behaviors? -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 21 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Zihao Ye has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 21: (10 comments) http://gerrit.cloudera.org:8080/#/c/19699/18//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/19699/18//COMMIT_MSG@23 PS18, Line 23: > Could you add a section for the current limitations? E.g. Done http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h File be/src/exec/json-parser.h: http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@45 PS18, Line 45: 'begin' and ' > nit: add quotes on var names, i.e. 'begin' and 'end' Done http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@49 PS18, Line 49: func > nit: remove "some" Done http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@68 PS18, Line 68: /// bool AddNumber(int index, const char* str, uint32_t len); > nit: it'd be helpful to give a doc link here. E.g. Done http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@143 PS18, Line 143: current_field_idx_ = -1; > This is not that readable. Could you add some comments and break them into Done http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@169 PS18, Line 169: while (!stream_.Eos()) { > What about kParseInsituFlag? Can we use it in our scenario? Unfortunately, we cannot use kParseInsituFlag here because it requires our char stream to provide both input and output abilities simultaneously. Specifically, it needs the ability to write data back from a previous position after reading some data. However, our char stream get the buffer in chunks (see GetNextBuffer), it is difficult to go back the old buffer to write data. http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@230 PS18, Line 230: inline void GetNextBuffer(const char** begin, const char** end) { > nit: This is a bit confusing. It might be better to use DCHECK_EQ(current_f Done http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@242 PS18, Line 242: /// 2. Call Key() upon encountering a key to find its index of the row in the schema and > nit: could you add a comment before each methods mentioning they are interf Done http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json/hdfs-json-scanner.cc File be/src/exec/json/hdfs-json-scanner.cc: http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json/hdfs-json-scanner.cc@114 PS18, Line 114: the previous scan ran > nit: "the previous scan range in the same file" might be more clear Done http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json/hdfs-json-scanner.cc@160 PS18, Line 160: if > nit: add a space after "if" Done -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 21 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Wed, 19 Jul 2023 08:32:49 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 21: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/13578/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 21 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Wed, 19 Jul 2023 08:30:48 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Hello Quanlong Huang, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/19699 to look at the new patch set (#21). Change subject: IMPALA-10798: Prototype a simple JSON File reader .. IMPALA-10798: Prototype a simple JSON File reader Prototype of HdfsJsonScanner implemented based on rapidjson, which supports scanning data from splitting json files. The scanning of JSON data is mainly completed by two parts working together. The first part is the JsonParser responsible for parsing the JSON object, which is implemented based on the SAX-style API of rapidjson. It reads data from the char stream, parses it, and calls the corresponding callback function when encountering the corresponding JSON element. See the comments of the JsonParser class for more details. The other part is the HdfsJsonScanner, which inherits from HdfsScanner and provides callback functions for the JsonParser. The callback functions are responsible for providing data buffers to the Parser and converting and materializing the Parser's parsing results into RowBatch. Limitations - Multiline json objects are not fully supported yet, It is ok when each file has only one scan range. However, when a file has multiple scan ranges, there is a small probability of incomplete scanning of multiline JSON objects that span ScanRange boundaries (in such cases, parsing errors may be reported). For more details, please refer to the comments in the 'multiline_json.test'. - Compressed JSON files are not supported yet. - Complex types are not supported yet. Tests - Most of the existing end-to-end tests can run on JSON format. - Add TestQueriesJsonTables in test_queries.py for testing multiline and malformed JSON. Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 --- M be/CMakeLists.txt M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-scan-node-base.cc A be/src/exec/json-parser-test.cc A be/src/exec/json-parser.h A be/src/exec/json/CMakeLists.txt A be/src/exec/json/hdfs-json-scanner.cc A be/src/exec/json/hdfs-json-scanner.h M be/src/exec/text-converter.inline.h M bin/rat_exclude_files.txt M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M testdata/bin/generate-schema-statements.py A testdata/data/json_test/malformed.json A testdata/data/json_test/multiline.json M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv M testdata/workloads/functional-query/functional-query_core.csv M testdata/workloads/functional-query/functional-query_dimensions.csv M testdata/workloads/functional-query/functional-query_exhaustive.csv M testdata/workloads/functional-query/functional-query_pairwise.csv A testdata/workloads/functional-query/queries/QueryTest/malformed_json.test A testdata/workloads/functional-query/queries/QueryTest/multiline_json.test M testdata/workloads/tpcds/tpcds_core.csv M testdata/workloads/tpcds/tpcds_exhaustive.csv M testdata/workloads/tpcds/tpcds_pairwise.csv M testdata/workloads/tpch/tpch_core.csv M testdata/workloads/tpch/tpch_dimensions.csv M testdata/workloads/tpch/tpch_exhaustive.csv M testdata/workloads/tpch/tpch_pairwise.csv M tests/common/test_dimensions.py M tests/metadata/test_hms_integration.py M tests/query_test/test_decimal_queries.py M tests/query_test/test_queries.py M tests/query_test/test_scanners.py M tests/query_test/test_scanners_fuzz.py M tests/query_test/test_tpch_queries.py 37 files changed, 1,273 insertions(+), 44 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/21 -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 21 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 20: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/13577/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 20 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Wed, 19 Jul 2023 07:55:13 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Hello Quanlong Huang, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/19699 to look at the new patch set (#20). Change subject: IMPALA-10798: Prototype a simple JSON File reader .. IMPALA-10798: Prototype a simple JSON File reader Prototype of HdfsJsonScanner implemented based on rapidjson, which supports scanning data from splitting json files. The scanning of JSON data is mainly completed by two parts working together. The first part is the JsonParser responsible for parsing the JSON object, which is implemented based on the SAX-style API of rapidjson. It reads data from the char stream, parses it, and calls the corresponding callback function when encountering the corresponding JSON element. See the comments of the JsonParser class for more details. The other part is the HdfsJsonScanner, which inherits from HdfsScanner and provides callback functions for the JsonParser. The callback functions are responsible for providing data buffers to the Parser and converting and materializing the Parser's parsing results into RowBatch. Limitations - Multiline json objects are not fully supported yet, It is ok when each file has only one scan range. However, when a file has multiple scan ranges, there is a small probability of incomplete scanning of multiline JSON objects that span ScanRange boundaries (in such cases, parsing errors may be reported). For more details, please refer to the comments in the 'multiline_json.test'. - Compressed JSON files are not supported yet. - Complex types are not supported yet. Tests - Most of the existing end-to-end tests can run on JSON format. - Add TestQueriesJsonTables in test_queries.py for testing multiline and malformed JSON. Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 --- M be/CMakeLists.txt M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-scan-node-base.cc A be/src/exec/json-parser-test.cc A be/src/exec/json-parser.h A be/src/exec/json/CMakeLists.txt A be/src/exec/json/hdfs-json-scanner.cc A be/src/exec/json/hdfs-json-scanner.h M be/src/exec/text-converter.inline.h M bin/rat_exclude_files.txt M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M testdata/bin/generate-schema-statements.py A testdata/data/json_test/malformed.json A testdata/data/json_test/multiline.json M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv M testdata/workloads/functional-query/functional-query_core.csv M testdata/workloads/functional-query/functional-query_dimensions.csv M testdata/workloads/functional-query/functional-query_exhaustive.csv M testdata/workloads/functional-query/functional-query_pairwise.csv A testdata/workloads/functional-query/queries/QueryTest/malformed_json.test A testdata/workloads/functional-query/queries/QueryTest/multiline_json.test M testdata/workloads/tpcds/tpcds_core.csv M testdata/workloads/tpcds/tpcds_exhaustive.csv M testdata/workloads/tpcds/tpcds_pairwise.csv M testdata/workloads/tpch/tpch_core.csv M testdata/workloads/tpch/tpch_dimensions.csv M testdata/workloads/tpch/tpch_exhaustive.csv M testdata/workloads/tpch/tpch_pairwise.csv M tests/common/test_dimensions.py M tests/metadata/test_hms_integration.py M tests/query_test/test_decimal_queries.py M tests/query_test/test_queries.py M tests/query_test/test_scanners.py M tests/query_test/test_scanners_fuzz.py M tests/query_test/test_tpch_queries.py 37 files changed, 1,273 insertions(+), 44 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/20 -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 20 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Hello Quanlong Huang, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/19699 to look at the new patch set (#19). Change subject: IMPALA-10798: Prototype a simple JSON File reader .. IMPALA-10798: Prototype a simple JSON File reader Prototype of HdfsJsonScanner implemented based on rapidjson, which supports scanning data from splitting json files. The scanning of JSON data is mainly completed by two parts working together. The first part is the JsonParser responsible for parsing the JSON object, which is implemented based on the SAX-style API of rapidjson. It reads data from the char stream, parses it, and calls the corresponding callback function when encountering the corresponding JSON element. See the comments of the JsonParser class for more details. The other part is the HdfsJsonScanner, which inherits from HdfsScanner and provides callback functions for the JsonParser. The callback functions are responsible for providing data buffers to the Parser and converting and materializing the Parser's parsing results into RowBatch. Limitations - Multiline json objects are not fully supported yet, It is ok when each file has only one scan range. However, when a file has multiple scan ranges, there is a small probability of incomplete scanning of multiline JSON objects that span ScanRange boundaries (in such cases, parsing errors may be reported). For more details, please refer to the comments in the 'multiline_json.test'. - Compressed JSON files are not supported yet. - Complex types are not supported yet. Tests - Most of the existing end-to-end tests can run on JSON format. - Add TestQueriesJsonTables in test_queries.py for testing multiline and malformed JSON. Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 --- M be/CMakeLists.txt M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-scan-node-base.cc A be/src/exec/json-parser-test.cc A be/src/exec/json-parser.h A be/src/exec/json/CMakeLists.txt A be/src/exec/json/hdfs-json-scanner.cc A be/src/exec/json/hdfs-json-scanner.h M be/src/exec/text-converter.inline.h M bin/rat_exclude_files.txt M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M testdata/bin/generate-schema-statements.py A testdata/data/json_test/malformed.json A testdata/data/json_test/multiline.json M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv M testdata/workloads/functional-query/functional-query_core.csv M testdata/workloads/functional-query/functional-query_dimensions.csv M testdata/workloads/functional-query/functional-query_exhaustive.csv M testdata/workloads/functional-query/functional-query_pairwise.csv A testdata/workloads/functional-query/queries/QueryTest/malformed_json.test A testdata/workloads/functional-query/queries/QueryTest/multiline_json.test M testdata/workloads/tpcds/tpcds_core.csv M testdata/workloads/tpcds/tpcds_exhaustive.csv M testdata/workloads/tpcds/tpcds_pairwise.csv M testdata/workloads/tpch/tpch_core.csv M testdata/workloads/tpch/tpch_dimensions.csv M testdata/workloads/tpch/tpch_exhaustive.csv M testdata/workloads/tpch/tpch_pairwise.csv M tests/common/test_dimensions.py M tests/metadata/test_hms_integration.py M tests/query_test/test_decimal_queries.py M tests/query_test/test_queries.py M tests/query_test/test_scanners.py M tests/query_test/test_scanners_fuzz.py M tests/query_test/test_tpch_queries.py 37 files changed, 1,240 insertions(+), 44 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/19 -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 19 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 18: (10 comments) http://gerrit.cloudera.org:8080/#/c/19699/18//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/19699/18//COMMIT_MSG@23 PS18, Line 23: Could you add a section for the current limitations? E.g. Limitations - multiline json objects are not supported - compressed json files are not supported Does this patch support parsing complex types from json files? http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h File be/src/exec/json-parser.h: http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@45 PS18, Line 45: begin and end nit: add quotes on var names, i.e. 'begin' and 'end' http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@49 PS18, Line 49: some nit: remove "some" http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@68 PS18, Line 68: /// bool AddNumber(int index, const char* str, uint32_t len); nit: it'd be helpful to give a doc link here. E.g. /// See more in https://rapidjson.org/classrapidjson_1_1_handler.html http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@143 PS18, Line 143: !memcmp(field_found_.data(), field_found_.data() + 1, field_found_.size() - 1))); This is not that readable. Could you add some comments and break them into smaller conditions? http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@169 PS18, Line 169: rapidjson::kParseStopWhenDoneFlag; What about kParseInsituFlag? Can we use it in our scenario? http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@230 PS18, Line 230: DCHECK(!IsRequiredField()); nit: This is a bit confusing. It might be better to use DCHECK_EQ(current_field_idx_, -1) directly. http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@242 PS18, Line 242: bool StartObject() { nit: could you add a comment before each methods mentioning they are interfaces of handler used in RapidJson? e.g. /// Handler methods used in Rapidjson http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json/hdfs-json-scanner.cc File be/src/exec/json/hdfs-json-scanner.cc: http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json/hdfs-json-scanner.cc@114 PS18, Line 114: the scan range before nit: "the previous scan range in the same file" might be more clear http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json/hdfs-json-scanner.cc@160 PS18, Line 160: if( nit: add a space after "if" -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 18 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 18 Jul 2023 13:41:06 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 18: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/13536/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 18 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Thu, 13 Jul 2023 09:55:15 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Hello Quanlong Huang, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/19699 to look at the new patch set (#18). Change subject: IMPALA-10798: Prototype a simple JSON File reader .. IMPALA-10798: Prototype a simple JSON File reader Prototype of HdfsJsonScanner implemented based on rapidjson, which supports scanning data from splitting json files. The scanning of JSON data is mainly completed by two parts working together. The first part is the JsonParser responsible for parsing the JSON object, which is implemented based on the SAX-style API of rapidjson. It reads data from the char stream, parses it, and calls the corresponding callback function when encountering the corresponding JSON element. See the comments of the JsonParser class for more details. The other part is the HdfsJsonScanner, which inherits from HdfsScanner and provides callback functions for the JsonParser. The callback functions are responsible for providing data buffers to the Parser and converting and materializing the Parser's parsing results into RowBatch. Tests - Most of the existing end-to-end tests can run on JSON format. - Add TestQueriesJsonTables in test_queries.py for testing multiline and malformed JSON. Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 --- M be/CMakeLists.txt M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-scan-node-base.cc A be/src/exec/json-parser-test.cc A be/src/exec/json-parser.h A be/src/exec/json/CMakeLists.txt A be/src/exec/json/hdfs-json-scanner.cc A be/src/exec/json/hdfs-json-scanner.h M be/src/exec/text-converter.inline.h M bin/rat_exclude_files.txt M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M testdata/bin/generate-schema-statements.py A testdata/data/json_test/malformed.json A testdata/data/json_test/multiline.json M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv M testdata/workloads/functional-query/functional-query_core.csv M testdata/workloads/functional-query/functional-query_dimensions.csv M testdata/workloads/functional-query/functional-query_exhaustive.csv M testdata/workloads/functional-query/functional-query_pairwise.csv A testdata/workloads/functional-query/queries/QueryTest/malformed_json.test A testdata/workloads/functional-query/queries/QueryTest/multiline_json.test M testdata/workloads/tpcds/tpcds_core.csv M testdata/workloads/tpcds/tpcds_exhaustive.csv M testdata/workloads/tpcds/tpcds_pairwise.csv M testdata/workloads/tpch/tpch_core.csv M testdata/workloads/tpch/tpch_dimensions.csv M testdata/workloads/tpch/tpch_exhaustive.csv M testdata/workloads/tpch/tpch_pairwise.csv M tests/common/test_dimensions.py M tests/metadata/test_hms_integration.py M tests/query_test/test_decimal_queries.py M tests/query_test/test_queries.py M tests/query_test/test_scanners.py M tests/query_test/test_scanners_fuzz.py M tests/query_test/test_tpch_queries.py 37 files changed, 1,240 insertions(+), 44 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/18 -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 18 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 17: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/13392/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 17 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Mon, 26 Jun 2023 11:57:36 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Zihao Ye has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 15: (5 comments) Thanks for the review! http://gerrit.cloudera.org:8080/#/c/19699/15//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/19699/15//COMMIT_MSG@11 PS15, Line 11: > Could you summarize the high level design? Done http://gerrit.cloudera.org:8080/#/c/19699/15/testdata/bin/generate-schema-statements.py File testdata/bin/generate-schema-statements.py: http://gerrit.cloudera.org:8080/#/c/19699/15/testdata/bin/generate-schema-statements.py@209 PS15, Line 209: 'json': "JSONFILE" > nit: add a comma at the end Done http://gerrit.cloudera.org:8080/#/c/19699/15/testdata/datasets/functional/schema_constraints.csv File testdata/datasets/functional/schema_constraints.csv: http://gerrit.cloudera.org:8080/#/c/19699/15/testdata/datasets/functional/schema_constraints.csv@198 PS15, Line 198: table_name:decimal_tiny, constraint:restrict_to, table_format:orc/def/block > I think we need to add json here and for several other tables. > Otherwise, some tables are missing in JSON format. Done, but I'm not sure if there are any other tables that need to be added because I'm not familiar with their purpose. http://gerrit.cloudera.org:8080/#/c/19699/15/tests/query_test/test_queries.py File tests/query_test/test_queries.py: http://gerrit.cloudera.org:8080/#/c/19699/15/tests/query_test/test_queries.py@217 PS15, Line 217: class TestQueriesTextTables(ImpalaTestSuite): > Can we add some tests like these for JSON? E.g. tests for > multi-line strings, multi-line json objects, malformed json > objects. Done, if there are any other necessary tests that need to be added, please let me know. http://gerrit.cloudera.org:8080/#/c/19699/15/tests/query_test/test_scanners_fuzz.py File tests/query_test/test_scanners_fuzz.py: http://gerrit.cloudera.org:8080/#/c/19699/15/tests/query_test/test_scanners_fuzz.py@98 PS15, Line 98: elif table_format.file_format in ('rc', 'seq', 'json'): > Why do we skip JSON here? I didn't generate any JSON tables starts with 'decimal_', because of the comment say "Decimal can only be tested on formats Impala can write to (text and parquet)", so I skipped it. Now that we have these tables, the test can pass, so it ?won't be skipped anymore. -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 15 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Mon, 26 Jun 2023 11:53:38 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Hello Quanlong Huang, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/19699 to look at the new patch set (#17). Change subject: IMPALA-10798: Prototype a simple JSON File reader .. IMPALA-10798: Prototype a simple JSON File reader Prototype of HdfsJsonScanner implemented based on rapidjson, which supports scanning data from splitting json files. The scanning of JSON data is mainly completed by two parts working together. The first part is the JsonParser responsible for parsing the JSON object, which is implemented based on the SAX-style API of rapidjson. It reads data from the char stream, parses it, and calls the corresponding callback function when encountering the corresponding JSON element. See the comments of the JsonParser class for more details. The other part is the HdfsJsonScanner, which inherits from HdfsScanner and provides callback functions for the JsonParser. The callback functions are responsible for providing data buffers to the Parser and converting and materializing the Parser's parsing results into RowBatch. Tests - Most of the existing end-to-end tests can run on JSON format. - Add TestQueriesJsonTables in test_queries.py for testing multiline and malformed JSON. Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 --- M be/CMakeLists.txt M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-scan-node-base.cc A be/src/exec/json-parser-test.cc A be/src/exec/json-parser.h A be/src/exec/json/CMakeLists.txt A be/src/exec/json/hdfs-json-scanner.cc A be/src/exec/json/hdfs-json-scanner.h M be/src/exec/text-converter.inline.h M bin/rat_exclude_files.txt M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M testdata/bin/generate-schema-statements.py A testdata/data/json_test/malformed.json A testdata/data/json_test/multiline.json M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv M testdata/workloads/functional-query/functional-query_core.csv M testdata/workloads/functional-query/functional-query_dimensions.csv M testdata/workloads/functional-query/functional-query_exhaustive.csv M testdata/workloads/functional-query/functional-query_pairwise.csv A testdata/workloads/functional-query/queries/QueryTest/malformed_json.test A testdata/workloads/functional-query/queries/QueryTest/multiline_json.test M testdata/workloads/tpcds/tpcds_core.csv M testdata/workloads/tpcds/tpcds_exhaustive.csv M testdata/workloads/tpcds/tpcds_pairwise.csv M testdata/workloads/tpch/tpch_core.csv M testdata/workloads/tpch/tpch_dimensions.csv M testdata/workloads/tpch/tpch_exhaustive.csv M testdata/workloads/tpch/tpch_pairwise.csv M tests/common/test_dimensions.py M tests/metadata/test_hms_integration.py M tests/query_test/test_decimal_queries.py M tests/query_test/test_queries.py M tests/query_test/test_scanners.py M tests/query_test/test_scanners_fuzz.py M tests/query_test/test_tpch_queries.py 37 files changed, 1,236 insertions(+), 44 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/17 -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 17 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 17: (2 comments) http://gerrit.cloudera.org:8080/#/c/19699/17/tests/common/test_dimensions.py File tests/common/test_dimensions.py: http://gerrit.cloudera.org:8080/#/c/19699/17/tests/common/test_dimensions.py@124 PS17, Line 124: def create_uncompressed_json_dimension(workload): flake8: E302 expected 2 blank lines, found 1 http://gerrit.cloudera.org:8080/#/c/19699/17/tests/query_test/test_queries.py File tests/query_test/test_queries.py: http://gerrit.cloudera.org:8080/#/c/19699/17/tests/query_test/test_queries.py@263 PS17, Line 263: class TestQueriesJsonTables(ImpalaTestSuite): flake8: E302 expected 2 blank lines, found 1 -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 17 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Comment-Date: Mon, 26 Jun 2023 11:35:24 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 16: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/13355/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 16 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Comment-Date: Wed, 21 Jun 2023 09:38:02 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 16: (5 comments) Thanks for working on this! Add some test requirements first. http://gerrit.cloudera.org:8080/#/c/19699/15//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/19699/15//COMMIT_MSG@11 PS15, Line 11: Could you summarize the high level design? http://gerrit.cloudera.org:8080/#/c/19699/15/testdata/bin/generate-schema-statements.py File testdata/bin/generate-schema-statements.py: http://gerrit.cloudera.org:8080/#/c/19699/15/testdata/bin/generate-schema-statements.py@209 PS15, Line 209: 'json': "JSONFILE" nit: add a comma at the end http://gerrit.cloudera.org:8080/#/c/19699/15/testdata/datasets/functional/schema_constraints.csv File testdata/datasets/functional/schema_constraints.csv: http://gerrit.cloudera.org:8080/#/c/19699/15/testdata/datasets/functional/schema_constraints.csv@198 PS15, Line 198: table_name:decimal_tiny, constraint:restrict_to, table_format:orc/def/block I think we need to add json here and for several other tables. Otherwise, some tables are missing in JSON format. http://gerrit.cloudera.org:8080/#/c/19699/15/tests/query_test/test_queries.py File tests/query_test/test_queries.py: http://gerrit.cloudera.org:8080/#/c/19699/15/tests/query_test/test_queries.py@217 PS15, Line 217: class TestQueriesTextTables(ImpalaTestSuite): Can we add some tests like these for JSON? E.g. tests for multi-line strings, multi-line json objects, malformed json objects. http://gerrit.cloudera.org:8080/#/c/19699/15/tests/query_test/test_scanners_fuzz.py File tests/query_test/test_scanners_fuzz.py: http://gerrit.cloudera.org:8080/#/c/19699/15/tests/query_test/test_scanners_fuzz.py@98 PS15, Line 98: elif table_format.file_format in ('rc', 'seq', 'json'): Why do we skip JSON here? -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 16 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Comment-Date: Wed, 21 Jun 2023 09:17:45 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Hello Quanlong Huang, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/19699 to look at the new patch set (#16). Change subject: IMPALA-10798: Prototype a simple JSON File reader .. IMPALA-10798: Prototype a simple JSON File reader Prototype of HdfsJsonScanner implemented based on rapidjson, which supports scanning data from splitting json files. Tests - Most of the end-to-end tests can run on JSON format. Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 --- M be/CMakeLists.txt M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-scan-node-base.cc A be/src/exec/json-parser-test.cc A be/src/exec/json-parser.h A be/src/exec/json/CMakeLists.txt A be/src/exec/json/hdfs-json-scanner.cc A be/src/exec/json/hdfs-json-scanner.h M be/src/exec/text-converter.inline.h M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M testdata/bin/generate-schema-statements.py M testdata/workloads/functional-query/functional-query_core.csv M testdata/workloads/functional-query/functional-query_dimensions.csv M testdata/workloads/functional-query/functional-query_exhaustive.csv M testdata/workloads/functional-query/functional-query_pairwise.csv M testdata/workloads/tpcds/tpcds_core.csv M testdata/workloads/tpcds/tpcds_exhaustive.csv M testdata/workloads/tpcds/tpcds_pairwise.csv M testdata/workloads/tpch/tpch_core.csv M testdata/workloads/tpch/tpch_dimensions.csv M testdata/workloads/tpch/tpch_exhaustive.csv M testdata/workloads/tpch/tpch_pairwise.csv M tests/common/test_dimensions.py M tests/metadata/test_hms_integration.py M tests/query_test/test_scanners.py M tests/query_test/test_scanners_fuzz.py M tests/query_test/test_tpch_queries.py 28 files changed, 1,100 insertions(+), 41 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/16 -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 16 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 15: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/9418/ -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 15 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Comment-Date: Wed, 21 Jun 2023 07:39:47 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 15: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/9418/ DRY_RUN=true -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 15 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Comment-Date: Wed, 21 Jun 2023 02:12:05 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 15: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/13274/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 15 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Comment-Date: Tue, 13 Jun 2023 10:15:17 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Hello Quanlong Huang, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/19699 to look at the new patch set (#15). Change subject: IMPALA-10798: Prototype a simple JSON File reader .. IMPALA-10798: Prototype a simple JSON File reader Prototype of HdfsJsonScanner implemented based on rapidjson, which supports scanning data from splitting json files. Tests - Most of the end-to-end tests can run on JSON format. Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 --- M be/CMakeLists.txt M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-scan-node-base.cc A be/src/exec/json-parser-test.cc A be/src/exec/json-parser.h A be/src/exec/json/CMakeLists.txt A be/src/exec/json/hdfs-json-scanner.cc A be/src/exec/json/hdfs-json-scanner.h M be/src/exec/text-converter.inline.h M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M testdata/bin/generate-schema-statements.py M testdata/workloads/functional-query/functional-query_core.csv M testdata/workloads/functional-query/functional-query_dimensions.csv M testdata/workloads/functional-query/functional-query_exhaustive.csv M testdata/workloads/functional-query/functional-query_pairwise.csv M testdata/workloads/tpcds/tpcds_core.csv M testdata/workloads/tpcds/tpcds_exhaustive.csv M testdata/workloads/tpcds/tpcds_pairwise.csv M testdata/workloads/tpch/tpch_core.csv M testdata/workloads/tpch/tpch_dimensions.csv M testdata/workloads/tpch/tpch_exhaustive.csv M testdata/workloads/tpch/tpch_pairwise.csv M tests/common/test_dimensions.py M tests/metadata/test_hms_integration.py M tests/query_test/test_decimal_queries.py M tests/query_test/test_scanners.py M tests/query_test/test_scanners_fuzz.py M tests/query_test/test_tpch_queries.py 29 files changed, 1,099 insertions(+), 41 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/15 -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 15 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 14: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/13254/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 14 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Comment-Date: Sat, 10 Jun 2023 02:42:45 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Hello Quanlong Huang, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/19699 to look at the new patch set (#14). Change subject: IMPALA-10798: Prototype a simple JSON File reader .. IMPALA-10798: Prototype a simple JSON File reader Prototype of HdfsJsonScanner implemented based on rapidjson, which supports scanning data from splitting json files. Tests - Most of the end-to-end tests can run on JSON format. Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 --- M be/CMakeLists.txt M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-scan-node-base.cc A be/src/exec/json-parser-test.cc A be/src/exec/json-parser.h A be/src/exec/json/CMakeLists.txt A be/src/exec/json/hdfs-json-scanner.cc A be/src/exec/json/hdfs-json-scanner.h M be/src/exec/text-converter.inline.h M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M testdata/bin/generate-schema-statements.py M testdata/workloads/functional-query/functional-query_core.csv M testdata/workloads/functional-query/functional-query_dimensions.csv M testdata/workloads/functional-query/functional-query_exhaustive.csv M testdata/workloads/functional-query/functional-query_pairwise.csv M testdata/workloads/tpcds/tpcds_core.csv M testdata/workloads/tpcds/tpcds_exhaustive.csv M testdata/workloads/tpcds/tpcds_pairwise.csv M testdata/workloads/tpch/tpch_core.csv M testdata/workloads/tpch/tpch_dimensions.csv M testdata/workloads/tpch/tpch_exhaustive.csv M testdata/workloads/tpch/tpch_pairwise.csv M tests/common/test_dimensions.py M tests/metadata/test_hms_integration.py M tests/query_test/test_decimal_queries.py M tests/query_test/test_scanners.py M tests/query_test/test_scanners_fuzz.py M tests/query_test/test_tpch_queries.py 29 files changed, 1,090 insertions(+), 41 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/14 -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 14 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 12: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/13244/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 12 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Comment-Date: Fri, 09 Jun 2023 10:56:48 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Hello Quanlong Huang, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/19699 to look at the new patch set (#12). Change subject: IMPALA-10798: Prototype a simple JSON File reader .. IMPALA-10798: Prototype a simple JSON File reader Prototype of HdfsJsonScanner implemented based on rapidjson, which supports scanning data from splitting json files. Tests - Most of the end-to-end tests can run on JSON format. Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 --- M be/CMakeLists.txt M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-scan-node-base.cc A be/src/exec/json-parser-test.cc A be/src/exec/json-parser.h A be/src/exec/json/CMakeLists.txt A be/src/exec/json/hdfs-json-scanner.cc A be/src/exec/json/hdfs-json-scanner.h M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M testdata/bin/generate-schema-statements.py M testdata/workloads/functional-query/functional-query_core.csv M testdata/workloads/functional-query/functional-query_dimensions.csv M testdata/workloads/functional-query/functional-query_exhaustive.csv M testdata/workloads/functional-query/functional-query_pairwise.csv M testdata/workloads/tpcds/tpcds_core.csv M testdata/workloads/tpcds/tpcds_exhaustive.csv M testdata/workloads/tpcds/tpcds_pairwise.csv M testdata/workloads/tpch/tpch_core.csv M testdata/workloads/tpch/tpch_dimensions.csv M testdata/workloads/tpch/tpch_exhaustive.csv M testdata/workloads/tpch/tpch_pairwise.csv M tests/common/test_dimensions.py M tests/metadata/test_hms_integration.py M tests/query_test/test_decimal_queries.py M tests/query_test/test_scanners.py M tests/query_test/test_scanners_fuzz.py M tests/query_test/test_tpch_queries.py 28 files changed, 1,090 insertions(+), 40 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/12 -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 12 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 11: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/13146/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 11 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Comment-Date: Wed, 31 May 2023 03:15:26 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Hello Quanlong Huang, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/19699 to look at the new patch set (#11). Change subject: IMPALA-10798: Prototype a simple JSON File reader .. IMPALA-10798: Prototype a simple JSON File reader Prototype of HdfsJsonScanner implemented based on rapidjson, which supports scanning data from splitting json files. Tests - Most of the end-to-end tests can run on JSON format. Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 --- M be/CMakeLists.txt M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-scan-node-base.cc A be/src/exec/json-parser-test.cc A be/src/exec/json-parser.h A be/src/exec/json/CMakeLists.txt A be/src/exec/json/hdfs-json-scanner.cc A be/src/exec/json/hdfs-json-scanner.h M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M testdata/bin/generate-schema-statements.py M testdata/workloads/functional-query/functional-query_core.csv M testdata/workloads/functional-query/functional-query_dimensions.csv M testdata/workloads/functional-query/functional-query_exhaustive.csv M testdata/workloads/functional-query/functional-query_pairwise.csv M testdata/workloads/tpcds/tpcds_core.csv M testdata/workloads/tpcds/tpcds_exhaustive.csv M testdata/workloads/tpcds/tpcds_pairwise.csv M testdata/workloads/tpch/tpch_core.csv M testdata/workloads/tpch/tpch_dimensions.csv M testdata/workloads/tpch/tpch_exhaustive.csv M testdata/workloads/tpch/tpch_pairwise.csv M tests/common/test_dimensions.py M tests/metadata/test_hms_integration.py M tests/query_test/test_decimal_queries.py M tests/query_test/test_scanners.py M tests/query_test/test_scanners_fuzz.py M tests/query_test/test_tpch_queries.py 28 files changed, 1,173 insertions(+), 40 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/11 -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 11 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 10: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/12963/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 10 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Comment-Date: Mon, 08 May 2023 07:35:51 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Hello Quanlong Huang, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/19699 to look at the new patch set (#10). Change subject: IMPALA-10798: Prototype a simple JSON File reader .. IMPALA-10798: Prototype a simple JSON File reader Prototype of HdfsJsonScanner implemented based on rapidjson, which supports scanning data from splitting json files. Tests - Most of the end-to-end tests can run on JSON format. Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 --- M be/CMakeLists.txt M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-scan-node-base.cc A be/src/exec/json-parser-test.cc A be/src/exec/json-parser.h A be/src/exec/json/CMakeLists.txt A be/src/exec/json/hdfs-json-scanner.cc A be/src/exec/json/hdfs-json-scanner.h M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M testdata/bin/generate-schema-statements.py M testdata/workloads/functional-query/functional-query_core.csv M testdata/workloads/functional-query/functional-query_dimensions.csv M testdata/workloads/functional-query/functional-query_exhaustive.csv M testdata/workloads/functional-query/functional-query_pairwise.csv M testdata/workloads/tpcds/tpcds_core.csv M testdata/workloads/tpcds/tpcds_exhaustive.csv M testdata/workloads/tpcds/tpcds_pairwise.csv M testdata/workloads/tpch/tpch_core.csv M testdata/workloads/tpch/tpch_dimensions.csv M testdata/workloads/tpch/tpch_exhaustive.csv M testdata/workloads/tpch/tpch_pairwise.csv M tests/common/test_dimensions.py M tests/metadata/test_hms_integration.py M tests/query_test/test_decimal_queries.py M tests/query_test/test_scanners.py M tests/query_test/test_scanners_fuzz.py M tests/query_test/test_tpch_queries.py 28 files changed, 1,166 insertions(+), 43 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/10 -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 10 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 10: (1 comment) http://gerrit.cloudera.org:8080/#/c/19699/10/be/src/exec/json/hdfs-json-scanner.cc File be/src/exec/json/hdfs-json-scanner.cc: http://gerrit.cloudera.org:8080/#/c/19699/10/be/src/exec/json/hdfs-json-scanner.cc@103 PS10, Line 103: // the entire scan range without finding a single tuple. The bytes will be picked up line has trailing whitespace -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 10 Gerrit-Owner: Zihao Ye Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Comment-Date: Mon, 08 May 2023 07:15:28 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 9: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/12930/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 9 Gerrit-Owner: Anonymous Coward <18770832...@163.com> Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Comment-Date: Thu, 04 May 2023 10:14:47 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Hello Quanlong Huang, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/19699 to look at the new patch set (#9). Change subject: IMPALA-10798: Prototype a simple JSON File reader .. IMPALA-10798: Prototype a simple JSON File reader Prototype of HdfsJsonScanner implemented based on rapidjson, which supports scanning data from splitting json files. Tests - Most of the end-to-end tests can run on JSON format. Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 --- M be/CMakeLists.txt M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-scan-node-base.cc A be/src/exec/json-parser-test.cc A be/src/exec/json-parser.h A be/src/exec/json/CMakeLists.txt A be/src/exec/json/hdfs-json-scanner.cc A be/src/exec/json/hdfs-json-scanner.h M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M testdata/bin/generate-schema-statements.py M testdata/workloads/functional-query/functional-query_core.csv M testdata/workloads/functional-query/functional-query_dimensions.csv M testdata/workloads/functional-query/functional-query_exhaustive.csv M testdata/workloads/functional-query/functional-query_pairwise.csv M testdata/workloads/tpcds/tpcds_core.csv M testdata/workloads/tpcds/tpcds_exhaustive.csv M testdata/workloads/tpcds/tpcds_pairwise.csv M testdata/workloads/tpch/tpch_core.csv M testdata/workloads/tpch/tpch_dimensions.csv M testdata/workloads/tpch/tpch_exhaustive.csv M testdata/workloads/tpch/tpch_pairwise.csv M tests/common/test_dimensions.py M tests/metadata/test_hms_integration.py M tests/query_test/test_decimal_queries.py M tests/query_test/test_scanners.py M tests/query_test/test_scanners_fuzz.py M tests/query_test/test_tpch_queries.py 28 files changed, 1,165 insertions(+), 43 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/9 -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 9 Gerrit-Owner: Anonymous Coward <18770832...@163.com> Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 8: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/12928/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 8 Gerrit-Owner: Anonymous Coward <18770832...@163.com> Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Comment-Date: Thu, 04 May 2023 03:06:29 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Hello Quanlong Huang, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/19699 to look at the new patch set (#8). Change subject: IMPALA-10798: Prototype a simple JSON File reader .. IMPALA-10798: Prototype a simple JSON File reader Prototype of HdfsJsonScanner implemented based on rapidjson, which supports scanning data from splitting json files. Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 --- M be/CMakeLists.txt M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-scan-node-base.cc A be/src/exec/json-parser-test.cc A be/src/exec/json-parser.h A be/src/exec/json/CMakeLists.txt A be/src/exec/json/hdfs-json-scanner.cc A be/src/exec/json/hdfs-json-scanner.h M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M testdata/bin/generate-schema-statements.py M testdata/workloads/functional-query/functional-query_core.csv M testdata/workloads/functional-query/functional-query_dimensions.csv M testdata/workloads/functional-query/functional-query_exhaustive.csv M testdata/workloads/functional-query/functional-query_pairwise.csv M testdata/workloads/tpcds/tpcds_core.csv M testdata/workloads/tpcds/tpcds_exhaustive.csv M testdata/workloads/tpcds/tpcds_pairwise.csv M testdata/workloads/tpch/tpch_core.csv M testdata/workloads/tpch/tpch_dimensions.csv M testdata/workloads/tpch/tpch_exhaustive.csv M testdata/workloads/tpch/tpch_pairwise.csv M tests/common/test_dimensions.py M tests/metadata/test_hms_integration.py M tests/query_test/test_decimal_queries.py M tests/query_test/test_scanners.py M tests/query_test/test_scanners_fuzz.py M tests/query_test/test_tpch_queries.py 28 files changed, 1,167 insertions(+), 45 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/8 -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 8 Gerrit-Owner: Anonymous Coward <18770832...@163.com> Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 7: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/12888/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 7 Gerrit-Owner: Anonymous Coward <18770832...@163.com> Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Comment-Date: Fri, 28 Apr 2023 09:00:47 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Hello Quanlong Huang, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/19699 to look at the new patch set (#7). Change subject: IMPALA-10798: Prototype a simple JSON File reader .. IMPALA-10798: Prototype a simple JSON File reader Prototype of HdfsJsonScanner implemented based on rapidjson, which supports scanning data from splitting json files. Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 --- M be/CMakeLists.txt M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-scan-node-base.cc A be/src/exec/json-parser-test.cc A be/src/exec/json-parser.h A be/src/exec/json/CMakeLists.txt A be/src/exec/json/hdfs-json-scanner.cc A be/src/exec/json/hdfs-json-scanner.h M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M testdata/bin/generate-schema-statements.py M testdata/workloads/functional-query/functional-query_core.csv M testdata/workloads/functional-query/functional-query_dimensions.csv M testdata/workloads/functional-query/functional-query_exhaustive.csv M testdata/workloads/functional-query/functional-query_pairwise.csv M testdata/workloads/tpcds/tpcds_core.csv M testdata/workloads/tpcds/tpcds_exhaustive.csv M testdata/workloads/tpcds/tpcds_pairwise.csv M testdata/workloads/tpch/tpch_core.csv M testdata/workloads/tpch/tpch_dimensions.csv M testdata/workloads/tpch/tpch_exhaustive.csv M testdata/workloads/tpch/tpch_pairwise.csv M tests/common/test_dimensions.py M tests/metadata/test_hms_integration.py M tests/query_test/test_decimal_queries.py M tests/query_test/test_scanners_fuzz.py M tests/query_test/test_tpch_queries.py 27 files changed, 1,164 insertions(+), 45 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/7 -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 7 Gerrit-Owner: Anonymous Coward <18770832...@163.com> Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 6: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/12887/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 6 Gerrit-Owner: Anonymous Coward <18770832...@163.com> Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Comment-Date: Fri, 28 Apr 2023 06:39:08 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Hello Quanlong Huang, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/19699 to look at the new patch set (#6). Change subject: IMPALA-10798: Prototype a simple JSON File reader .. IMPALA-10798: Prototype a simple JSON File reader Prototype of HdfsJsonScanner implemented based on rapidjson, which supports scanning data from splitting json files. Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 --- M be/CMakeLists.txt M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-scan-node-base.cc A be/src/exec/json-parser-test.cc A be/src/exec/json-parser.h A be/src/exec/json/CMakeLists.txt A be/src/exec/json/hdfs-json-scanner.cc A be/src/exec/json/hdfs-json-scanner.h M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M testdata/bin/generate-schema-statements.py M testdata/workloads/functional-query/functional-query_core.csv M testdata/workloads/functional-query/functional-query_dimensions.csv M testdata/workloads/functional-query/functional-query_exhaustive.csv M testdata/workloads/functional-query/functional-query_pairwise.csv M testdata/workloads/tpcds/tpcds_core.csv M testdata/workloads/tpcds/tpcds_exhaustive.csv M testdata/workloads/tpcds/tpcds_pairwise.csv M testdata/workloads/tpch/tpch_core.csv M testdata/workloads/tpch/tpch_dimensions.csv M testdata/workloads/tpch/tpch_exhaustive.csv M testdata/workloads/tpch/tpch_pairwise.csv M tests/common/test_dimensions.py M tests/metadata/test_hms_integration.py M tests/query_test/test_decimal_queries.py M tests/query_test/test_scanners_fuzz.py M tests/query_test/test_tpch_queries.py 27 files changed, 1,167 insertions(+), 45 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/6 -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 6 Gerrit-Owner: Anonymous Coward <18770832...@163.com> Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 5: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/12879/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 5 Gerrit-Owner: Anonymous Coward <18770832...@163.com> Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Comment-Date: Thu, 27 Apr 2023 08:48:36 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Hello Quanlong Huang, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/19699 to look at the new patch set (#5). Change subject: IMPALA-10798: Prototype a simple JSON File reader .. IMPALA-10798: Prototype a simple JSON File reader Prototype of HdfsJsonScanner implemented based on rapidjson, which supports scanning data from splitting json files. Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 --- M be/CMakeLists.txt M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-scan-node-base.cc A be/src/exec/json-parser-test.cc A be/src/exec/json-parser.h A be/src/exec/json/CMakeLists.txt A be/src/exec/json/hdfs-json-scanner.cc A be/src/exec/json/hdfs-json-scanner.h M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M testdata/bin/generate-schema-statements.py M testdata/workloads/functional-query/functional-query_core.csv M testdata/workloads/functional-query/functional-query_dimensions.csv M testdata/workloads/functional-query/functional-query_exhaustive.csv M testdata/workloads/functional-query/functional-query_pairwise.csv M testdata/workloads/tpcds/tpcds_core.csv M testdata/workloads/tpcds/tpcds_exhaustive.csv M testdata/workloads/tpcds/tpcds_pairwise.csv M testdata/workloads/tpch/tpch_core.csv M testdata/workloads/tpch/tpch_dimensions.csv M testdata/workloads/tpch/tpch_exhaustive.csv M testdata/workloads/tpch/tpch_pairwise.csv M tests/common/test_dimensions.py M tests/metadata/test_hms_integration.py M tests/query_test/test_decimal_queries.py M tests/query_test/test_scanners_fuzz.py M tests/query_test/test_tpch_queries.py 27 files changed, 1,160 insertions(+), 45 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/5 -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 5 Gerrit-Owner: Anonymous Coward <18770832...@163.com> Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 4: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/12878/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 4 Gerrit-Owner: Anonymous Coward <18770832...@163.com> Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Comment-Date: Thu, 27 Apr 2023 08:04:32 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 4: (10 comments) http://gerrit.cloudera.org:8080/#/c/19699/4/be/src/exec/json-parser.h File be/src/exec/json-parser.h: http://gerrit.cloudera.org:8080/#/c/19699/4/be/src/exec/json-parser.h@334 PS4, Line 334: line has trailing whitespace http://gerrit.cloudera.org:8080/#/c/19699/4/be/src/exec/json/hdfs-json-scanner.h File be/src/exec/json/hdfs-json-scanner.h: http://gerrit.cloudera.org:8080/#/c/19699/4/be/src/exec/json/hdfs-json-scanner.h@123 PS4, Line 123: /// This is used to indicate whether an error has occurred in the currently parsed row. line has trailing whitespace http://gerrit.cloudera.org:8080/#/c/19699/4/be/src/exec/json/hdfs-json-scanner.h@134 PS4, Line 134: /// JsonParse comment. line has trailing whitespace http://gerrit.cloudera.org:8080/#/c/19699/4/be/src/exec/json/hdfs-json-scanner.h@138 PS4, Line 138: /// specific uses described in the JsonParse comment. line has trailing whitespace http://gerrit.cloudera.org:8080/#/c/19699/4/be/src/exec/json/hdfs-json-scanner.cc File be/src/exec/json/hdfs-json-scanner.cc: http://gerrit.cloudera.org:8080/#/c/19699/4/be/src/exec/json/hdfs-json-scanner.cc@42 PS4, Line 42: scanner_state_(CREATED), line has trailing whitespace http://gerrit.cloudera.org:8080/#/c/19699/4/be/src/exec/json/hdfs-json-scanner.cc@202 PS4, Line 202: line has trailing whitespace http://gerrit.cloudera.org:8080/#/c/19699/4/be/src/exec/json/hdfs-json-scanner.cc@221 PS4, Line 221: // due to BreakParse(). line has trailing whitespace http://gerrit.cloudera.org:8080/#/c/19699/4/be/src/exec/json/hdfs-json-scanner.cc@233 PS4, Line 233: // the parser that eos has been reached. line has trailing whitespace http://gerrit.cloudera.org:8080/#/c/19699/4/tests/query_test/test_scanners_fuzz.py File tests/query_test/test_scanners_fuzz.py: http://gerrit.cloudera.org:8080/#/c/19699/4/tests/query_test/test_scanners_fuzz.py@80 PS4, Line 80: a flake8: W504 line break after binary operator http://gerrit.cloudera.org:8080/#/c/19699/4/tests/query_test/test_tpch_queries.py File tests/query_test/test_tpch_queries.py: http://gerrit.cloudera.org:8080/#/c/19699/4/tests/query_test/test_tpch_queries.py@41 PS4, Line 41: s flake8: E501 line too long (96 > 90 characters) -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 4 Gerrit-Owner: Anonymous Coward <18770832...@163.com> Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Comment-Date: Thu, 27 Apr 2023 07:53:06 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Hello Quanlong Huang, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/19699 to look at the new patch set (#4). Change subject: IMPALA-10798: Prototype a simple JSON File reader .. IMPALA-10798: Prototype a simple JSON File reader Prototype of HdfsJsonScanner implemented based on rapidjson, which supports scanning data from splitting json files. Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 --- M be/CMakeLists.txt M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-scan-node-base.cc A be/src/exec/json-parser-test.cc A be/src/exec/json-parser.h A be/src/exec/json/CMakeLists.txt A be/src/exec/json/hdfs-json-scanner.cc A be/src/exec/json/hdfs-json-scanner.h M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M testdata/bin/generate-schema-statements.py M testdata/workloads/functional-query/functional-query_core.csv M testdata/workloads/functional-query/functional-query_dimensions.csv M testdata/workloads/functional-query/functional-query_exhaustive.csv M testdata/workloads/functional-query/functional-query_pairwise.csv M testdata/workloads/tpcds/tpcds_core.csv M testdata/workloads/tpcds/tpcds_exhaustive.csv M testdata/workloads/tpcds/tpcds_pairwise.csv M testdata/workloads/tpch/tpch_core.csv M testdata/workloads/tpch/tpch_dimensions.csv M testdata/workloads/tpch/tpch_exhaustive.csv M testdata/workloads/tpch/tpch_pairwise.csv M tests/common/test_dimensions.py M tests/metadata/test_hms_integration.py M tests/query_test/test_decimal_queries.py M tests/query_test/test_scanners_fuzz.py M tests/query_test/test_tpch_queries.py 27 files changed, 1,158 insertions(+), 45 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/4 -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 4 Gerrit-Owner: Anonymous Coward <18770832...@163.com> Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 3: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/12845/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 3 Gerrit-Owner: Anonymous Coward <18770832...@163.com> Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Mon, 24 Apr 2023 12:05:18 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. Patch Set 3: (9 comments) http://gerrit.cloudera.org:8080/#/c/19699/3/be/src/exec/json-parser-test.cc File be/src/exec/json-parser-test.cc: http://gerrit.cloudera.org:8080/#/c/19699/3/be/src/exec/json-parser-test.cc@109 PS3, Line 109: line has trailing whitespace http://gerrit.cloudera.org:8080/#/c/19699/3/be/src/exec/json/hdfs-json-scanner.h File be/src/exec/json/hdfs-json-scanner.h: http://gerrit.cloudera.org:8080/#/c/19699/3/be/src/exec/json/hdfs-json-scanner.h@125 PS3, Line 125: /// This is used to indicate whether an error has occurred in the currently parsed row. line has trailing whitespace http://gerrit.cloudera.org:8080/#/c/19699/3/be/src/exec/json/hdfs-json-scanner.h@136 PS3, Line 136: /// JsonParse comment. line has trailing whitespace http://gerrit.cloudera.org:8080/#/c/19699/3/be/src/exec/json/hdfs-json-scanner.h@140 PS3, Line 140: /// specific uses described in the JsonParse comment. line has trailing whitespace http://gerrit.cloudera.org:8080/#/c/19699/3/be/src/exec/json/hdfs-json-scanner.cc File be/src/exec/json/hdfs-json-scanner.cc: http://gerrit.cloudera.org:8080/#/c/19699/3/be/src/exec/json/hdfs-json-scanner.cc@42 PS3, Line 42: scanner_state_(CREATED), line has trailing whitespace http://gerrit.cloudera.org:8080/#/c/19699/3/be/src/exec/json/hdfs-json-scanner.cc@200 PS3, Line 200: line has trailing whitespace http://gerrit.cloudera.org:8080/#/c/19699/3/be/src/exec/json/hdfs-json-scanner.cc@219 PS3, Line 219: // due to BreakParse(). line has trailing whitespace http://gerrit.cloudera.org:8080/#/c/19699/3/be/src/exec/json/hdfs-json-scanner.cc@231 PS3, Line 231: // the parser that eos has been reached. line has trailing whitespace http://gerrit.cloudera.org:8080/#/c/19699/3/tests/query_test/test_tpch_queries.py File tests/query_test/test_tpch_queries.py: http://gerrit.cloudera.org:8080/#/c/19699/3/tests/query_test/test_tpch_queries.py@41 PS3, Line 41: s flake8: E501 line too long (96 > 90 characters) -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 3 Gerrit-Owner: Anonymous Coward <18770832...@163.com> Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Mon, 24 Apr 2023 11:45:40 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader
18770832...@163.com has uploaded a new patch set (#3). ( http://gerrit.cloudera.org:8080/19699 ) Change subject: IMPALA-10798: Prototype a simple JSON File reader .. IMPALA-10798: Prototype a simple JSON File reader Prototype of HdfsJsonScanner implemented based on rapidjson, which supports scanning data from splitting json files. Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 --- M be/CMakeLists.txt M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-scan-node-base.cc A be/src/exec/json-parser-test.cc A be/src/exec/json-parser.h A be/src/exec/json/CMakeLists.txt A be/src/exec/json/hdfs-json-scanner.cc A be/src/exec/json/hdfs-json-scanner.h M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M testdata/bin/generate-schema-statements.py M testdata/workloads/functional-query/functional-query_core.csv M testdata/workloads/functional-query/functional-query_dimensions.csv M testdata/workloads/functional-query/functional-query_exhaustive.csv M testdata/workloads/functional-query/functional-query_pairwise.csv M testdata/workloads/tpcds/tpcds_core.csv M testdata/workloads/tpcds/tpcds_exhaustive.csv M testdata/workloads/tpcds/tpcds_pairwise.csv M testdata/workloads/tpch/tpch_core.csv M testdata/workloads/tpch/tpch_dimensions.csv M testdata/workloads/tpch/tpch_exhaustive.csv M testdata/workloads/tpch/tpch_pairwise.csv M tests/common/test_dimensions.py M tests/metadata/test_hms_integration.py M tests/query_test/test_decimal_queries.py M tests/query_test/test_tpch_queries.py 26 files changed, 1,180 insertions(+), 42 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/3 -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 3 Gerrit-Owner: Anonymous Coward <18770832...@163.com>