Hello Quanlong Huang, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/19699

to look at the new patch set (#29).

Change subject: IMPALA-10798: Initial support for reading JSON files
......................................................................

IMPALA-10798: Initial support for reading JSON files

Prototype of HdfsJsonScanner implemented based on rapidjson, which
supports scanning data from splitting json files.

The scanning of JSON data is mainly completed by two parts working
together. The first part is the JsonParser responsible for parsing the
JSON object, which is implemented based on the SAX-style API of
rapidjson. It reads data from the char stream, parses it, and calls the
corresponding callback function when encountering the corresponding JSON
element. See the comments of the JsonParser class for more details.

The other part is the HdfsJsonScanner, which inherits from HdfsScanner
and provides callback functions for the JsonParser. The callback
functions are responsible for providing data buffers to the Parser and
converting and materializing the Parser's parsing results into RowBatch.
It should be noted that the parser returns numeric values as strings to
the scanner. The scanner uses the TextConverter class to convert the
strings to the desired types, similar to how the HdfsTextScanner works.
This is an advantage compared to using number value provided by
rapidjson directly, as it eliminates concerns about inconsistencies in
converting decimals (e.g. losing precision).

Limitations
 - Multiline json objects are not fully supported yet. It is ok when
   each file has only one scan range. However, when a file has multiple
   scan ranges, there is a small probability of incomplete scanning of
   multiline JSON objects that span ScanRange boundaries (in such cases,
   parsing errors may be reported). For more details, please refer to
   the comments in the 'multiline_json.test'.
 - Compressed JSON files are not supported yet.
 - Complex types are not supported yet.

Tests
 - Most of the existing end-to-end tests can run on JSON format.
 - Add TestQueriesJsonTables in test_queries.py for testing multiline,
   malformed, and overflow in JSON.

Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
---
M be/CMakeLists.txt
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-scan-node-base.cc
A be/src/exec/json/CMakeLists.txt
A be/src/exec/json/hdfs-json-scanner.cc
A be/src/exec/json/hdfs-json-scanner.h
A be/src/exec/json/json-parser-test.cc
A be/src/exec/json/json-parser.cc
A be/src/exec/json/json-parser.h
M be/src/exec/text-converter.inline.h
M bin/rat_exclude_files.txt
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/bin/create-load-data.sh
M testdata/bin/generate-schema-statements.py
M testdata/bin/load-dependent-tables.sql
A testdata/data/chars-formats.json
A testdata/data/json_test/complex.json
A testdata/data/json_test/malformed.json
A testdata/data/json_test/multiline.json
A testdata/data/json_test/overflow.json
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
A 
testdata/workloads/functional-query/queries/DataErrorsTest/hdfs-json-scan-node-errors.test
A testdata/workloads/functional-query/queries/QueryTest/complex_json.test
A testdata/workloads/functional-query/queries/QueryTest/malformed_json.test
A testdata/workloads/functional-query/queries/QueryTest/multiline_json.test
A testdata/workloads/functional-query/queries/QueryTest/overflow_json.test
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/test_dimensions.py
M tests/data_errors/test_data_errors.py
M tests/metadata/test_hms_integration.py
M tests/query_test/test_cancellation.py
M tests/query_test/test_chars.py
M tests/query_test/test_date_queries.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_tpch_queries.py
50 files changed, 1,719 insertions(+), 54 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/29
--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 29
Gerrit-Owner: Zihao Ye <eyiz...@163.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Zihao Ye <eyiz...@163.com>

Reply via email to