[Impala-ASF-CR] IMPALA-10798: Initial support for reading JSON files

2023-09-05 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Initial support for reading JSON files
..

IMPALA-10798: Initial support for reading JSON files

Prototype of HdfsJsonScanner implemented based on rapidjson, which
supports scanning data from splitting json files.

The scanning of JSON data is mainly completed by two parts working
together. The first part is the JsonParser responsible for parsing the
JSON object, which is implemented based on the SAX-style API of
rapidjson. It reads data from the char stream, parses it, and calls the
corresponding callback function when encountering the corresponding JSON
element. See the comments of the JsonParser class for more details.

The other part is the HdfsJsonScanner, which inherits from HdfsScanner
and provides callback functions for the JsonParser. The callback
functions are responsible for providing data buffers to the Parser and
converting and materializing the Parser's parsing results into RowBatch.
It should be noted that the parser returns numeric values as strings to
the scanner. The scanner uses the TextConverter class to convert the
strings to the desired types, similar to how the HdfsTextScanner works.
This is an advantage compared to using number value provided by
rapidjson directly, as it eliminates concerns about inconsistencies in
converting decimals (e.g. losing precision).

Added a startup flag, enable_json_scanner, to be able to disable this
feature if we hit critical bugs in production.

Limitations
 - Multiline json objects are not fully supported yet. It is ok when
   each file has only one scan range. However, when a file has multiple
   scan ranges, there is a small probability of incomplete scanning of
   multiline JSON objects that span ScanRange boundaries (in such cases,
   parsing errors may be reported). For more details, please refer to
   the comments in the 'multiline_json.test'.
 - Compressed JSON files are not supported yet.
 - Complex types are not supported yet.

Tests
 - Most of the existing end-to-end tests can run on JSON format.
 - Add TestQueriesJsonTables in test_queries.py for testing multiline,
   malformed, and overflow in JSON.

Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Reviewed-on: http://gerrit.cloudera.org:8080/19699
Reviewed-by: Quanlong Huang 
Tested-by: Impala Public Jenkins 
---
M be/CMakeLists.txt
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-scan-node-base.cc
A be/src/exec/json/CMakeLists.txt
A be/src/exec/json/hdfs-json-scanner.cc
A be/src/exec/json/hdfs-json-scanner.h
A be/src/exec/json/json-parser-test.cc
A be/src/exec/json/json-parser.cc
A be/src/exec/json/json-parser.h
M be/src/exec/text-converter.inline.h
M be/src/util/backend-gflag-util.cc
M bin/rat_exclude_files.txt
M common/thrift/BackendGflags.thrift
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M testdata/bin/create-load-data.sh
M testdata/bin/generate-schema-statements.py
M testdata/bin/load-dependent-tables.sql
A testdata/data/chars-formats.json
A testdata/data/json_test/complex.json
A testdata/data/json_test/malformed.json
A testdata/data/json_test/multiline.json
A testdata/data/json_test/overflow.json
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
A 
testdata/workloads/functional-query/queries/DataErrorsTest/hdfs-json-scan-node-errors.test
A testdata/workloads/functional-query/queries/QueryTest/complex_json.test
A 
testdata/workloads/functional-query/queries/QueryTest/disable-json-scanner.test
A testdata/workloads/functional-query/queries/QueryTest/malformed_json.test
A testdata/workloads/functional-query/queries/QueryTest/multiline_json.test
A testdata/workloads/functional-query/queries/QueryTest/overflow_json.test
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/test_dimensions.py
M tests/custom_cluster/test_disable_features.py
M tests/data_errors/test_data_errors.py
M tests/metadata/test_hms_integration.py
M tests/query_test/test_cancellation.py
M tests/query_test/test_chars.py
M tests/query_test/test_date_queries.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanne

[Impala-ASF-CR] IMPALA-10798: Initial support for reading JSON files

2023-09-05 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Initial support for reading JSON files
..


Patch Set 32: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 32
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Tue, 05 Sep 2023 16:55:39 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Initial support for reading JSON files

2023-09-05 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Initial support for reading JSON files
..


Patch Set 32:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/9669/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 32
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Tue, 05 Sep 2023 12:28:14 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Initial support for reading JSON files

2023-09-05 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Initial support for reading JSON files
..


Patch Set 32: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/9667/


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 32
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Tue, 05 Sep 2023 07:08:51 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Initial support for reading JSON files

2023-09-04 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Initial support for reading JSON files
..


Patch Set 32: Code-Review+2

LGTM. The patch has clean boundary to existing codes. Given the wide code 
coverage and the feature flag to turn it off is added, it's safe for us to ship 
it.

Thanks for contributing this great feature, Zihao!


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 32
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Tue, 05 Sep 2023 02:48:46 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Initial support for reading JSON files

2023-09-04 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Initial support for reading JSON files
..


Patch Set 32:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/9667/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 32
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Tue, 05 Sep 2023 02:49:42 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Initial support for reading JSON files

2023-09-04 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Initial support for reading JSON files
..


Patch Set 32:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/13912/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 32
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Tue, 05 Sep 2023 02:32:00 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Initial support for reading JSON files

2023-09-04 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Initial support for reading JSON files
..


Patch Set 31:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/13911/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 31
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Tue, 05 Sep 2023 02:13:34 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Initial support for reading JSON files

2023-09-04 Thread Zihao Ye (Code Review)
Hello Quanlong Huang, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/19699

to look at the new patch set (#32).

Change subject: IMPALA-10798: Initial support for reading JSON files
..

IMPALA-10798: Initial support for reading JSON files

Prototype of HdfsJsonScanner implemented based on rapidjson, which
supports scanning data from splitting json files.

The scanning of JSON data is mainly completed by two parts working
together. The first part is the JsonParser responsible for parsing the
JSON object, which is implemented based on the SAX-style API of
rapidjson. It reads data from the char stream, parses it, and calls the
corresponding callback function when encountering the corresponding JSON
element. See the comments of the JsonParser class for more details.

The other part is the HdfsJsonScanner, which inherits from HdfsScanner
and provides callback functions for the JsonParser. The callback
functions are responsible for providing data buffers to the Parser and
converting and materializing the Parser's parsing results into RowBatch.
It should be noted that the parser returns numeric values as strings to
the scanner. The scanner uses the TextConverter class to convert the
strings to the desired types, similar to how the HdfsTextScanner works.
This is an advantage compared to using number value provided by
rapidjson directly, as it eliminates concerns about inconsistencies in
converting decimals (e.g. losing precision).

Added a startup flag, enable_json_scanner, to be able to disable this
feature if we hit critical bugs in production.

Limitations
 - Multiline json objects are not fully supported yet. It is ok when
   each file has only one scan range. However, when a file has multiple
   scan ranges, there is a small probability of incomplete scanning of
   multiline JSON objects that span ScanRange boundaries (in such cases,
   parsing errors may be reported). For more details, please refer to
   the comments in the 'multiline_json.test'.
 - Compressed JSON files are not supported yet.
 - Complex types are not supported yet.

Tests
 - Most of the existing end-to-end tests can run on JSON format.
 - Add TestQueriesJsonTables in test_queries.py for testing multiline,
   malformed, and overflow in JSON.

Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
---
M be/CMakeLists.txt
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-scan-node-base.cc
A be/src/exec/json/CMakeLists.txt
A be/src/exec/json/hdfs-json-scanner.cc
A be/src/exec/json/hdfs-json-scanner.h
A be/src/exec/json/json-parser-test.cc
A be/src/exec/json/json-parser.cc
A be/src/exec/json/json-parser.h
M be/src/exec/text-converter.inline.h
M be/src/util/backend-gflag-util.cc
M bin/rat_exclude_files.txt
M common/thrift/BackendGflags.thrift
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M testdata/bin/create-load-data.sh
M testdata/bin/generate-schema-statements.py
M testdata/bin/load-dependent-tables.sql
A testdata/data/chars-formats.json
A testdata/data/json_test/complex.json
A testdata/data/json_test/malformed.json
A testdata/data/json_test/multiline.json
A testdata/data/json_test/overflow.json
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
A 
testdata/workloads/functional-query/queries/DataErrorsTest/hdfs-json-scan-node-errors.test
A testdata/workloads/functional-query/queries/QueryTest/complex_json.test
A 
testdata/workloads/functional-query/queries/QueryTest/disable-json-scanner.test
A testdata/workloads/functional-query/queries/QueryTest/malformed_json.test
A testdata/workloads/functional-query/queries/QueryTest/multiline_json.test
A testdata/workloads/functional-query/queries/QueryTest/overflow_json.test
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/test_dimensions.py
M tests/custom_cluster/test_disable_features.py
M tests/data_errors/test_data_errors.py
M tests/metadata/test_hms_integration.py
M tests/query_test/test_cancellation.py
M tests/query_test/test_chars.py
M tests/query_test/test_date_queries.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M 

[Impala-ASF-CR] IMPALA-10798: Initial support for reading JSON files

2023-09-04 Thread Zihao Ye (Code Review)
Hello Quanlong Huang, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/19699

to look at the new patch set (#31).

Change subject: IMPALA-10798: Initial support for reading JSON files
..

IMPALA-10798: Initial support for reading JSON files

Prototype of HdfsJsonScanner implemented based on rapidjson, which
supports scanning data from splitting json files.

The scanning of JSON data is mainly completed by two parts working
together. The first part is the JsonParser responsible for parsing the
JSON object, which is implemented based on the SAX-style API of
rapidjson. It reads data from the char stream, parses it, and calls the
corresponding callback function when encountering the corresponding JSON
element. See the comments of the JsonParser class for more details.

The other part is the HdfsJsonScanner, which inherits from HdfsScanner
and provides callback functions for the JsonParser. The callback
functions are responsible for providing data buffers to the Parser and
converting and materializing the Parser's parsing results into RowBatch.
It should be noted that the parser returns numeric values as strings to
the scanner. The scanner uses the TextConverter class to convert the
strings to the desired types, similar to how the HdfsTextScanner works.
This is an advantage compared to using number value provided by
rapidjson directly, as it eliminates concerns about inconsistencies in
converting decimals (e.g. losing precision).

Added a startup flag, enable_json_scanner, to be able to disable this
feature if we hit critical bugs in production.

Limitations
 - Multiline json objects are not fully supported yet. It is ok when
   each file has only one scan range. However, when a file has multiple
   scan ranges, there is a small probability of incomplete scanning of
   multiline JSON objects that span ScanRange boundaries (in such cases,
   parsing errors may be reported). For more details, please refer to
   the comments in the 'multiline_json.test'.
 - Compressed JSON files are not supported yet.
 - Complex types are not supported yet.

Tests
 - Most of the existing end-to-end tests can run on JSON format.
 - Add TestQueriesJsonTables in test_queries.py for testing multiline,
   malformed, and overflow in JSON.

Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
---
M be/CMakeLists.txt
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-scan-node-base.cc
A be/src/exec/json/CMakeLists.txt
A be/src/exec/json/hdfs-json-scanner.cc
A be/src/exec/json/hdfs-json-scanner.h
A be/src/exec/json/json-parser-test.cc
A be/src/exec/json/json-parser.cc
A be/src/exec/json/json-parser.h
M be/src/exec/text-converter.inline.h
M be/src/util/backend-gflag-util.cc
M bin/rat_exclude_files.txt
M common/thrift/BackendGflags.thrift
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M testdata/bin/create-load-data.sh
M testdata/bin/generate-schema-statements.py
M testdata/bin/load-dependent-tables.sql
A testdata/data/chars-formats.json
A testdata/data/json_test/complex.json
A testdata/data/json_test/malformed.json
A testdata/data/json_test/multiline.json
A testdata/data/json_test/overflow.json
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
A 
testdata/workloads/functional-query/queries/DataErrorsTest/hdfs-json-scan-node-errors.test
A testdata/workloads/functional-query/queries/QueryTest/complex_json.test
A 
testdata/workloads/functional-query/queries/QueryTest/disable-json-scanner.test
A testdata/workloads/functional-query/queries/QueryTest/malformed_json.test
A testdata/workloads/functional-query/queries/QueryTest/multiline_json.test
A testdata/workloads/functional-query/queries/QueryTest/overflow_json.test
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/test_dimensions.py
M tests/custom_cluster/test_disable_features.py
M tests/data_errors/test_data_errors.py
M tests/metadata/test_hms_integration.py
M tests/query_test/test_cancellation.py
M tests/query_test/test_chars.py
M tests/query_test/test_date_queries.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M 

[Impala-ASF-CR] IMPALA-10798: Initial support for reading JSON files

2023-09-04 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Initial support for reading JSON files
..


Patch Set 30: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 30
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Tue, 05 Sep 2023 01:55:27 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Initial support for reading JSON files

2023-08-28 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Initial support for reading JSON files
..


Patch Set 30: Code-Review+1


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 30
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Mon, 28 Aug 2023 09:30:27 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Initial support for reading JSON files

2023-08-25 Thread Zihao Ye (Code Review)
Zihao Ye has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Initial support for reading JSON files
..


Patch Set 30:

Done, the startup flag enable_json_scanner has been added and a related test 
has been added too.


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 30
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Fri, 25 Aug 2023 08:58:56 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Initial support for reading JSON files

2023-08-24 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Initial support for reading JSON files
..


Patch Set 30:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/13840/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 30
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Fri, 25 Aug 2023 02:55:50 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Initial support for reading JSON files

2023-08-24 Thread Zihao Ye (Code Review)
Hello Quanlong Huang, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/19699

to look at the new patch set (#30).

Change subject: IMPALA-10798: Initial support for reading JSON files
..

IMPALA-10798: Initial support for reading JSON files

Prototype of HdfsJsonScanner implemented based on rapidjson, which
supports scanning data from splitting json files.

The scanning of JSON data is mainly completed by two parts working
together. The first part is the JsonParser responsible for parsing the
JSON object, which is implemented based on the SAX-style API of
rapidjson. It reads data from the char stream, parses it, and calls the
corresponding callback function when encountering the corresponding JSON
element. See the comments of the JsonParser class for more details.

The other part is the HdfsJsonScanner, which inherits from HdfsScanner
and provides callback functions for the JsonParser. The callback
functions are responsible for providing data buffers to the Parser and
converting and materializing the Parser's parsing results into RowBatch.
It should be noted that the parser returns numeric values as strings to
the scanner. The scanner uses the TextConverter class to convert the
strings to the desired types, similar to how the HdfsTextScanner works.
This is an advantage compared to using number value provided by
rapidjson directly, as it eliminates concerns about inconsistencies in
converting decimals (e.g. losing precision).

Added a startup flag, enable_json_scanner, to be able to disable this
feature if we hit critical bugs in production.

Limitations
 - Multiline json objects are not fully supported yet. It is ok when
   each file has only one scan range. However, when a file has multiple
   scan ranges, there is a small probability of incomplete scanning of
   multiline JSON objects that span ScanRange boundaries (in such cases,
   parsing errors may be reported). For more details, please refer to
   the comments in the 'multiline_json.test'.
 - Compressed JSON files are not supported yet.
 - Complex types are not supported yet.

Tests
 - Most of the existing end-to-end tests can run on JSON format.
 - Add TestQueriesJsonTables in test_queries.py for testing multiline,
   malformed, and overflow in JSON.

Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
---
M be/CMakeLists.txt
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-scan-node-base.cc
A be/src/exec/json/CMakeLists.txt
A be/src/exec/json/hdfs-json-scanner.cc
A be/src/exec/json/hdfs-json-scanner.h
A be/src/exec/json/json-parser-test.cc
A be/src/exec/json/json-parser.cc
A be/src/exec/json/json-parser.h
M be/src/exec/text-converter.inline.h
M be/src/util/backend-gflag-util.cc
M bin/rat_exclude_files.txt
M common/thrift/BackendGflags.thrift
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M testdata/bin/create-load-data.sh
M testdata/bin/generate-schema-statements.py
M testdata/bin/load-dependent-tables.sql
A testdata/data/chars-formats.json
A testdata/data/json_test/complex.json
A testdata/data/json_test/malformed.json
A testdata/data/json_test/multiline.json
A testdata/data/json_test/overflow.json
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
A 
testdata/workloads/functional-query/queries/DataErrorsTest/hdfs-json-scan-node-errors.test
A testdata/workloads/functional-query/queries/QueryTest/complex_json.test
A 
testdata/workloads/functional-query/queries/QueryTest/disable-json-scanner.test
A testdata/workloads/functional-query/queries/QueryTest/malformed_json.test
A testdata/workloads/functional-query/queries/QueryTest/multiline_json.test
A testdata/workloads/functional-query/queries/QueryTest/overflow_json.test
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/test_dimensions.py
M tests/custom_cluster/test_disable_features.py
M tests/data_errors/test_data_errors.py
M tests/metadata/test_hms_integration.py
M tests/query_test/test_cancellation.py
M tests/query_test/test_chars.py
M tests/query_test/test_date_queries.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M 

[Impala-ASF-CR] IMPALA-10798: Initial support for reading JSON files

2023-08-24 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Initial support for reading JSON files
..


Patch Set 29: Code-Review+1

Overall LGTM.

Let's add a startup flag, enable_json_scanner (just like enable_orc_scanner 
when we first added the orc-scanner), to be able to disable this feature if we 
hit critical bugs in production.


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 29
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Fri, 25 Aug 2023 00:02:08 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Initial support for reading JSON files

2023-08-24 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Initial support for reading JSON files
..


Patch Set 29: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 29
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Thu, 24 Aug 2023 18:35:39 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Initial support for reading JSON files

2023-08-24 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Initial support for reading JSON files
..


Patch Set 29:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/9630/ 
DRY_RUN=true


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 29
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Thu, 24 Aug 2023 14:16:41 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Initial support for reading JSON files

2023-08-24 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Initial support for reading JSON files
..


Patch Set 29:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/13831/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 29
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Thu, 24 Aug 2023 11:36:02 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Initial support for reading JSON files

2023-08-24 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Initial support for reading JSON files
..


Patch Set 28:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/13830/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 28
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Thu, 24 Aug 2023 11:29:47 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Initial support for reading JSON files

2023-08-24 Thread Zihao Ye (Code Review)
Zihao Ye has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Initial support for reading JSON files
..


Patch Set 29:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/19699/26//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19699/26//COMMIT_MSG@7
PS26, Line 7: Initial support for reading JSON fi
> Let's change the title to something like "Initial support for reading JSON 
Done


http://gerrit.cloudera.org:8080/#/c/19699/23/tests/data_errors/test_data_errors.py
File tests/data_errors/test_data_errors.py:

http://gerrit.cloudera.org:8080/#/c/19699/23/tests/data_errors/test_data_errors.py@128
PS23, Line 128: self.run_test_case('DataErrorsTest/hdfs-scan-node-errors', 
vector)
> Can we add a similar test for json?
Done


http://gerrit.cloudera.org:8080/#/c/19699/23/tests/query_test/test_cancellation.py
File tests/query_test/test_cancellation.py:

http://gerrit.cloudera.org:8080/#/c/19699/23/tests/query_test/test_cancellation.py@113
PS23, Line 113: 'text'
> Let's add json here
Done


http://gerrit.cloudera.org:8080/#/c/19699/23/tests/query_test/test_chars.py
File tests/query_test/test_chars.py:

http://gerrit.cloudera.org:8080/#/c/19699/23/tests/query_test/test_chars.py@37
PS23, Line 37: ptions
> Let's test json here
Done


http://gerrit.cloudera.org:8080/#/c/19699/23/tests/query_test/test_chars.py@68
PS23, Line 68:
> Let's test json here as well
Done


http://gerrit.cloudera.org:8080/#/c/19699/23/tests/query_test/test_date_queries.py
File tests/query_test/test_date_queries.py:

http://gerrit.cloudera.org:8080/#/c/19699/23/tests/query_test/test_date_queries.py@45
PS23, Line 45:
> Let's add json here. Please also update the above comment. DATE type is als
Done



--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 29
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Thu, 24 Aug 2023 11:10:37 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10798: Initial support for reading JSON files

2023-08-24 Thread Zihao Ye (Code Review)
Hello Quanlong Huang, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/19699

to look at the new patch set (#29).

Change subject: IMPALA-10798: Initial support for reading JSON files
..

IMPALA-10798: Initial support for reading JSON files

Prototype of HdfsJsonScanner implemented based on rapidjson, which
supports scanning data from splitting json files.

The scanning of JSON data is mainly completed by two parts working
together. The first part is the JsonParser responsible for parsing the
JSON object, which is implemented based on the SAX-style API of
rapidjson. It reads data from the char stream, parses it, and calls the
corresponding callback function when encountering the corresponding JSON
element. See the comments of the JsonParser class for more details.

The other part is the HdfsJsonScanner, which inherits from HdfsScanner
and provides callback functions for the JsonParser. The callback
functions are responsible for providing data buffers to the Parser and
converting and materializing the Parser's parsing results into RowBatch.
It should be noted that the parser returns numeric values as strings to
the scanner. The scanner uses the TextConverter class to convert the
strings to the desired types, similar to how the HdfsTextScanner works.
This is an advantage compared to using number value provided by
rapidjson directly, as it eliminates concerns about inconsistencies in
converting decimals (e.g. losing precision).

Limitations
 - Multiline json objects are not fully supported yet. It is ok when
   each file has only one scan range. However, when a file has multiple
   scan ranges, there is a small probability of incomplete scanning of
   multiline JSON objects that span ScanRange boundaries (in such cases,
   parsing errors may be reported). For more details, please refer to
   the comments in the 'multiline_json.test'.
 - Compressed JSON files are not supported yet.
 - Complex types are not supported yet.

Tests
 - Most of the existing end-to-end tests can run on JSON format.
 - Add TestQueriesJsonTables in test_queries.py for testing multiline,
   malformed, and overflow in JSON.

Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
---
M be/CMakeLists.txt
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-scan-node-base.cc
A be/src/exec/json/CMakeLists.txt
A be/src/exec/json/hdfs-json-scanner.cc
A be/src/exec/json/hdfs-json-scanner.h
A be/src/exec/json/json-parser-test.cc
A be/src/exec/json/json-parser.cc
A be/src/exec/json/json-parser.h
M be/src/exec/text-converter.inline.h
M bin/rat_exclude_files.txt
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/bin/create-load-data.sh
M testdata/bin/generate-schema-statements.py
M testdata/bin/load-dependent-tables.sql
A testdata/data/chars-formats.json
A testdata/data/json_test/complex.json
A testdata/data/json_test/malformed.json
A testdata/data/json_test/multiline.json
A testdata/data/json_test/overflow.json
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
A 
testdata/workloads/functional-query/queries/DataErrorsTest/hdfs-json-scan-node-errors.test
A testdata/workloads/functional-query/queries/QueryTest/complex_json.test
A testdata/workloads/functional-query/queries/QueryTest/malformed_json.test
A testdata/workloads/functional-query/queries/QueryTest/multiline_json.test
A testdata/workloads/functional-query/queries/QueryTest/overflow_json.test
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/test_dimensions.py
M tests/data_errors/test_data_errors.py
M tests/metadata/test_hms_integration.py
M tests/query_test/test_cancellation.py
M tests/query_test/test_chars.py
M tests/query_test/test_date_queries.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_tpch_queries.py
50 files changed, 1,719 insertions(+), 54 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/29
--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Chan

[Impala-ASF-CR] IMPALA-10798: Initial support for reading JSON files

2023-08-24 Thread Zihao Ye (Code Review)
Hello Quanlong Huang, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/19699

to look at the new patch set (#28).

Change subject: IMPALA-10798: Initial support for reading JSON files
..

IMPALA-10798: Initial support for reading JSON files

Prototype of HdfsJsonScanner implemented based on rapidjson, which
supports scanning data from splitting json files.

The scanning of JSON data is mainly completed by two parts working
together. The first part is the JsonParser responsible for parsing the
JSON object, which is implemented based on the SAX-style API of
rapidjson. It reads data from the char stream, parses it, and calls the
corresponding callback function when encountering the corresponding JSON
element. See the comments of the JsonParser class for more details.

The other part is the HdfsJsonScanner, which inherits from HdfsScanner
and provides callback functions for the JsonParser. The callback
functions are responsible for providing data buffers to the Parser and
converting and materializing the Parser's parsing results into RowBatch.
It should be noted that the parser returns numeric values as strings to
the scanner. The scanner uses the TextConverter class to convert the
strings to the desired types, similar to how the HdfsTextScanner works.
This is an advantage compared to using number value provided by
rapidjson directly, as it eliminates concerns about inconsistencies in
converting decimals (e.g. losing precision).

Limitations
 - Multiline json objects are not fully supported yet. It is ok when
   each file has only one scan range. However, when a file has multiple
   scan ranges, there is a small probability of incomplete scanning of
   multiline JSON objects that span ScanRange boundaries (in such cases,
   parsing errors may be reported). For more details, please refer to
   the comments in the 'multiline_json.test'.
 - Compressed JSON files are not supported yet.
 - Complex types are not supported yet.

Tests
 - Most of the existing end-to-end tests can run on JSON format.
 - Add TestQueriesJsonTables in test_queries.py for testing multiline,
   malformed, and overflow in JSON.

Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
---
M be/CMakeLists.txt
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-scan-node-base.cc
A be/src/exec/json/CMakeLists.txt
A be/src/exec/json/hdfs-json-scanner.cc
A be/src/exec/json/hdfs-json-scanner.h
A be/src/exec/json/json-parser-test.cc
A be/src/exec/json/json-parser.cc
A be/src/exec/json/json-parser.h
M be/src/exec/text-converter.inline.h
M bin/rat_exclude_files.txt
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/bin/create-load-data.sh
M testdata/bin/generate-schema-statements.py
M testdata/bin/load-dependent-tables.sql
A testdata/data/chars-formats.json
A testdata/data/json_test/complex.json
A testdata/data/json_test/malformed.json
A testdata/data/json_test/multiline.json
A testdata/data/json_test/overflow.json
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
A 
testdata/workloads/functional-query/queries/DataErrorsTest/hdfs-json-scan-node-errors.test
A testdata/workloads/functional-query/queries/QueryTest/complex_json.test
A testdata/workloads/functional-query/queries/QueryTest/malformed_json.test
A testdata/workloads/functional-query/queries/QueryTest/multiline_json.test
A testdata/workloads/functional-query/queries/QueryTest/overflow_json.test
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/test_dimensions.py
M tests/data_errors/test_data_errors.py
M tests/metadata/test_hms_integration.py
M tests/query_test/test_cancellation.py
M tests/query_test/test_chars.py
M tests/query_test/test_date_queries.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_tpch_queries.py
50 files changed, 1,716 insertions(+), 51 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/28
--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Chan

[Impala-ASF-CR] IMPALA-10798: Initial support for reading JSON files

2023-08-24 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Initial support for reading JSON files
..


Patch Set 28:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/19699/28/tests/data_errors/test_data_errors.py
File tests/data_errors/test_data_errors.py:

http://gerrit.cloudera.org:8080/#/c/19699/28/tests/data_errors/test_data_errors.py@162
PS28, Line 162: \
flake8: E502 the backslash is redundant between brackets


http://gerrit.cloudera.org:8080/#/c/19699/28/tests/query_test/test_chars.py
File tests/query_test/test_chars.py:

http://gerrit.cloudera.org:8080/#/c/19699/28/tests/query_test/test_chars.py@39
PS28, Line 39: a
flake8: W504 line break after binary operator


http://gerrit.cloudera.org:8080/#/c/19699/28/tests/query_test/test_chars.py@83
PS28, Line 83: a
flake8: W504 line break after binary operator



--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 28
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Thu, 24 Aug 2023 11:05:07 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10798: Initial support for reading JSON files

2023-08-22 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Initial support for reading JSON files
..


Patch Set 27:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/13819/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 27
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Wed, 23 Aug 2023 06:52:51 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Initial support for reading JSON files

2023-08-22 Thread Zihao Ye (Code Review)
Hello Quanlong Huang, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/19699

to look at the new patch set (#27).

Change subject: IMPALA-10798: Initial support for reading JSON files
..

IMPALA-10798: Initial support for reading JSON files

Prototype of HdfsJsonScanner implemented based on rapidjson, which
supports scanning data from splitting json files.

The scanning of JSON data is mainly completed by two parts working
together. The first part is the JsonParser responsible for parsing the
JSON object, which is implemented based on the SAX-style API of
rapidjson. It reads data from the char stream, parses it, and calls the
corresponding callback function when encountering the corresponding JSON
element. See the comments of the JsonParser class for more details.

The other part is the HdfsJsonScanner, which inherits from HdfsScanner
and provides callback functions for the JsonParser. The callback
functions are responsible for providing data buffers to the Parser and
converting and materializing the Parser's parsing results into RowBatch.
It should be noted that the parser returns numeric values as strings to
the scanner. The scanner uses the TextConverter class to convert the
strings to the desired types, similar to how the HdfsTextScanner works.
This is an advantage compared to using number value provided by
rapidjson directly, as it eliminates concerns about inconsistencies in
converting decimals (e.g. losing precision).

Limitations
 - Multiline json objects are not fully supported yet. It is ok when
   each file has only one scan range. However, when a file has multiple
   scan ranges, there is a small probability of incomplete scanning of
   multiline JSON objects that span ScanRange boundaries (in such cases,
   parsing errors may be reported). For more details, please refer to
   the comments in the 'multiline_json.test'.
 - Compressed JSON files are not supported yet.
 - Complex types are not supported yet.

Tests
 - Most of the existing end-to-end tests can run on JSON format.
 - Add TestQueriesJsonTables in test_queries.py for testing multiline,
   malformed, and overflow in JSON.

Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
---
M be/CMakeLists.txt
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-scan-node-base.cc
A be/src/exec/json/CMakeLists.txt
A be/src/exec/json/hdfs-json-scanner.cc
A be/src/exec/json/hdfs-json-scanner.h
A be/src/exec/json/json-parser-test.cc
A be/src/exec/json/json-parser.cc
A be/src/exec/json/json-parser.h
M be/src/exec/text-converter.inline.h
M bin/rat_exclude_files.txt
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/bin/generate-schema-statements.py
A testdata/data/json_test/complex.json
A testdata/data/json_test/malformed.json
A testdata/data/json_test/multiline.json
A testdata/data/json_test/overflow.json
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
A testdata/workloads/functional-query/queries/QueryTest/complex_json.test
A testdata/workloads/functional-query/queries/QueryTest/malformed_json.test
A testdata/workloads/functional-query/queries/QueryTest/multiline_json.test
A testdata/workloads/functional-query/queries/QueryTest/overflow_json.test
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/test_dimensions.py
M tests/metadata/test_hms_integration.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_tpch_queries.py
42 files changed, 1,498 insertions(+), 44 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/27
--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 27
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye