[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-08-20 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 26:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/19699/26//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19699/26//COMMIT_MSG@7
PS26, Line 7: Prototype a simple JSON File reader
Let's change the title to something like "Initial support for reading JSON 
files"


http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h
File be/src/exec/json-parser.h:

http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h@37
PS22, Line 37:
> > It seems to be tightly coupled with hdfs-json-scanner so I think we
I took a look. The failed job is 
https://jenkins.impala.io/job/clang-tidy-ub2004/99/console
It fails in running bin/run_clang_tidy.sh. There are some errors in the output 
file: https://jenkins.impala.io/job/clang-tidy-ub2004/99/artifact/tidylog.txt

E.g. one of the error:

/home/ubuntu/Impala/be/src/exec/json/json-parser.cc:233:16: error: explicit 
instantiation of 'impala::JsonParser' must occur in namespace 'impala'
template class JsonParser;
   ^



--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 26
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Mon, 21 Aug 2023 01:07:03 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-08-18 Thread Zihao Ye (Code Review)
Zihao Ye has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 22:

(1 comment)

Thank you once again for the code review. I have been a little busy lately, but 
I managed to find some time to complete the move of json-parser.h and separate 
the implementation into json-parser.cc. As for the remaining task of adding new 
test cases, I will try to find another time to finish it.

http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h
File be/src/exec/json-parser.h:

http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h@37
PS22, Line 37:
> It seems to be tightly coupled with hdfs-json-scanner so I think we
 > can put it in /json unless it can be reused in other places.
 >
 > Moving the implementation codes to json-parser.cc helps to speedup
 > recompilation when you have code changes. Also helps to make this
 > header file shorter and easier for going through. You can keep some
 > short methods in the header file and just move large methods like
 > Parse().

Done, It could compiles successfully in my own environment, but I'm not sure 
why it keeps failing to build here.



--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 22
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Fri, 18 Aug 2023 09:51:08 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-08-18 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 26:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/13774/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 26
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Fri, 18 Aug 2023 09:40:47 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-08-18 Thread Zihao Ye (Code Review)
Hello Quanlong Huang, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/19699

to look at the new patch set (#26).

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..

IMPALA-10798: Prototype a simple JSON File reader

Prototype of HdfsJsonScanner implemented based on rapidjson, which
supports scanning data from splitting json files.

The scanning of JSON data is mainly completed by two parts working
together. The first part is the JsonParser responsible for parsing the
JSON object, which is implemented based on the SAX-style API of
rapidjson. It reads data from the char stream, parses it, and calls the
corresponding callback function when encountering the corresponding JSON
element. See the comments of the JsonParser class for more details.

The other part is the HdfsJsonScanner, which inherits from HdfsScanner
and provides callback functions for the JsonParser. The callback
functions are responsible for providing data buffers to the Parser and
converting and materializing the Parser's parsing results into RowBatch.
It should be noted that the parser returns numeric values as strings to
the scanner. The scanner uses the TextConverter class to convert the
strings to the desired types, similar to how the HdfsTextScanner works.
This is an advantage compared to using number value provided by
rapidjson directly, as it eliminates concerns about inconsistencies in
converting decimals (e.g. losing precision).

Limitations
 - Multiline json objects are not fully supported yet. It is ok when
   each file has only one scan range. However, when a file has multiple
   scan ranges, there is a small probability of incomplete scanning of
   multiline JSON objects that span ScanRange boundaries (in such cases,
   parsing errors may be reported). For more details, please refer to
   the comments in the 'multiline_json.test'.
 - Compressed JSON files are not supported yet.
 - Complex types are not supported yet.

Tests
 - Most of the existing end-to-end tests can run on JSON format.
 - Add TestQueriesJsonTables in test_queries.py for testing multiline,
   malformed, and overflow in JSON.

Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
---
M be/CMakeLists.txt
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-scan-node-base.cc
A be/src/exec/json/CMakeLists.txt
A be/src/exec/json/hdfs-json-scanner.cc
A be/src/exec/json/hdfs-json-scanner.h
A be/src/exec/json/json-parser-test.cc
A be/src/exec/json/json-parser.cc
A be/src/exec/json/json-parser.h
M be/src/exec/text-converter.inline.h
M bin/rat_exclude_files.txt
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/bin/generate-schema-statements.py
A testdata/data/json_test/complex.json
A testdata/data/json_test/malformed.json
A testdata/data/json_test/multiline.json
A testdata/data/json_test/overflow.json
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
A testdata/workloads/functional-query/queries/QueryTest/complex_json.test
A testdata/workloads/functional-query/queries/QueryTest/malformed_json.test
A testdata/workloads/functional-query/queries/QueryTest/multiline_json.test
A testdata/workloads/functional-query/queries/QueryTest/overflow_json.test
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/test_dimensions.py
M tests/metadata/test_hms_integration.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_tpch_queries.py
42 files changed, 1,498 insertions(+), 44 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/26
--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 26
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-08-18 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 24:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/13773/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 24
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Fri, 18 Aug 2023 08:39:56 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-08-18 Thread Zihao Ye (Code Review)
Hello Quanlong Huang, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/19699

to look at the new patch set (#24).

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..

IMPALA-10798: Prototype a simple JSON File reader

Prototype of HdfsJsonScanner implemented based on rapidjson, which
supports scanning data from splitting json files.

The scanning of JSON data is mainly completed by two parts working
together. The first part is the JsonParser responsible for parsing the
JSON object, which is implemented based on the SAX-style API of
rapidjson. It reads data from the char stream, parses it, and calls the
corresponding callback function when encountering the corresponding JSON
element. See the comments of the JsonParser class for more details.

The other part is the HdfsJsonScanner, which inherits from HdfsScanner
and provides callback functions for the JsonParser. The callback
functions are responsible for providing data buffers to the Parser and
converting and materializing the Parser's parsing results into RowBatch.
It should be noted that the parser returns numeric values as strings to
the scanner. The scanner uses the TextConverter class to convert the
strings to the desired types, similar to how the HdfsTextScanner works.
This is an advantage compared to using number value provided by
rapidjson directly, as it eliminates concerns about inconsistencies in
converting decimals (e.g. losing precision).

Limitations
 - Multiline json objects are not fully supported yet. It is ok when
   each file has only one scan range. However, when a file has multiple
   scan ranges, there is a small probability of incomplete scanning of
   multiline JSON objects that span ScanRange boundaries (in such cases,
   parsing errors may be reported). For more details, please refer to
   the comments in the 'multiline_json.test'.
 - Compressed JSON files are not supported yet.
 - Complex types are not supported yet.

Tests
 - Most of the existing end-to-end tests can run on JSON format.
 - Add TestQueriesJsonTables in test_queries.py for testing multiline,
   malformed, and overflow in JSON.

Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
---
M be/CMakeLists.txt
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-scan-node-base.cc
A be/src/exec/json/CMakeLists.txt
A be/src/exec/json/hdfs-json-scanner.cc
A be/src/exec/json/hdfs-json-scanner.h
A be/src/exec/json/json-parser-test.cc
A be/src/exec/json/json-parser.cc
A be/src/exec/json/json-parser.h
M be/src/exec/text-converter.inline.h
M bin/rat_exclude_files.txt
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/bin/generate-schema-statements.py
A testdata/data/json_test/complex.json
A testdata/data/json_test/malformed.json
A testdata/data/json_test/multiline.json
A testdata/data/json_test/overflow.json
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
A testdata/workloads/functional-query/queries/QueryTest/complex_json.test
A testdata/workloads/functional-query/queries/QueryTest/malformed_json.test
A testdata/workloads/functional-query/queries/QueryTest/multiline_json.test
A testdata/workloads/functional-query/queries/QueryTest/overflow_json.test
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/test_dimensions.py
M tests/metadata/test_hms_integration.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_tpch_queries.py
42 files changed, 1,498 insertions(+), 44 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/24
--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 24
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-08-14 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 23:

(7 comments)

The patch is in good shape. I ran the following command to go through all tests 
on text format:

 git grep file_format tests | grep "'text'"

Identified some tests that would be good to add for json.

http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h
File be/src/exec/json-parser.h:

http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h@37
PS22, Line 37:
> > Can we move this file into be/src/exec/json if it's only used by
It seems to be tightly coupled with hdfs-json-scanner so I think we can put it 
in /json unless it can be reused in other places.

Moving the implementation codes to json-parser.cc helps to speedup 
recompilation when you have code changes. Also helps to make this header file 
shorter and easier for going through. You can keep some short methods in the 
header file and just move large methods like Parse().


http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.cc
File be/src/exec/json/hdfs-json-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.cc@300
PS22, Line 300:   // TODO: Support Invoke CodeGend WriteSlot
> > A follow-up question for https://gerrit.cloudera.org/c/19699/18/be/src/ex
I checked other scanners and they have the same behavior. I think it's ok to 
ignore failures and just set nulls on the slot as long as we can report the mem 
issue in other places, e.g. report mem issue when we want to read more data. 
Currently, we have 'buffer_status_' that can report this. So I think the 
current implementation is ok.


http://gerrit.cloudera.org:8080/#/c/19699/23/tests/data_errors/test_data_errors.py
File tests/data_errors/test_data_errors.py:

http://gerrit.cloudera.org:8080/#/c/19699/23/tests/data_errors/test_data_errors.py@128
PS23, Line 128: self.run_test_case('DataErrorsTest/hdfs-scan-node-errors', 
vector)
Can we add a similar test for json?


http://gerrit.cloudera.org:8080/#/c/19699/23/tests/query_test/test_cancellation.py
File tests/query_test/test_cancellation.py:

http://gerrit.cloudera.org:8080/#/c/19699/23/tests/query_test/test_cancellation.py@113
PS23, Line 113: 'text'
Let's add json here


http://gerrit.cloudera.org:8080/#/c/19699/23/tests/query_test/test_chars.py
File tests/query_test/test_chars.py:

http://gerrit.cloudera.org:8080/#/c/19699/23/tests/query_test/test_chars.py@37
PS23, Line 37: 'text'
Let's test json here


http://gerrit.cloudera.org:8080/#/c/19699/23/tests/query_test/test_chars.py@68
PS23, Line 68: 'text'
Let's test json here as well


http://gerrit.cloudera.org:8080/#/c/19699/23/tests/query_test/test_date_queries.py
File tests/query_test/test_date_queries.py:

http://gerrit.cloudera.org:8080/#/c/19699/23/tests/query_test/test_date_queries.py@45
PS23, Line 45: 'text'
Let's add json here. Please also update the above comment. DATE type is also 
supported in orc and json?



--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 23
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Tue, 15 Aug 2023 01:08:55 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-08-14 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 23: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 23
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Mon, 14 Aug 2023 12:18:37 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-08-14 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 23:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/9595/ 
DRY_RUN=true


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 23
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Mon, 14 Aug 2023 07:55:47 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-08-14 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 23: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/9594/


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 23
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Mon, 14 Aug 2023 06:23:16 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-08-14 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 23:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/9594/ 
DRY_RUN=true


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 23
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Mon, 14 Aug 2023 06:16:38 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-07-25 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 23:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/13631/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 23
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Tue, 25 Jul 2023 13:00:36 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-07-25 Thread Zihao Ye (Code Review)
Zihao Ye has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 23:

(8 comments)

Thanks for the review!
I manually tested reading a JSON file with some large strings under a low 
mem_limit, and get the following error:

ERROR: Memory limit exceeded: Failed to allocate row batch
EXCHANGE_NODE (id=1) could not allocate 8.00 KB without exceeding limit.
Error occurred on backend 4620ee0acf3b:27000
Memory left in process limit: 11.41 GB
Memory left in query limit: -125.32 MB
Query(264ddc1e2c291b18:f1bec1c5): memory limit exceeded. Limit=100.00 
MB Reservation=12.00 MB ReservationLimit=68.00 MB OtherMemory=213.32 MB 
Total=225.32 MB Peak=225.33 MB
Unclaimed reservations: Reservation=4.00 MB OtherMemory=0 Total=4.00 MB 
Peak=12.00 MB
Fragment 264ddc1e2c291b18:f1bec1c50001: Reservation=8.00 MB 
OtherMemory=213.31 MB Total=221.31 MB Peak=221.32 MB
HDFS_SCAN_NODE (id=0): Reservation=8.00 MB OtherMemory=8.00 KB Total=8.01 MB 
Peak=95.03 MB
Queued Batches: Total=8.00 KB Peak=71.03 MB
KrpcDataStreamSender (dst_id=1): Total=142.28 MB Peak=142.29 MB
RowBatchSerialization: Total=142.28 MB Peak=142.28 MB
Fragment 264ddc1e2c291b18:f1bec1c5: Reservation=0 OtherMemory=8.00 KB 
Total=8.00 KB Peak=8.00 KB
EXCHANGE_NODE (id=1): Reservation=0 OtherMemory=0 Total=0 Peak=0
KrpcDeferredRpcs: Total=0 Peak=0
PLAN_ROOT_SINK: Total=0 Peak=0
CodeGen: Total=0 Peak=0
CodeGen: Total=0 Peak=0

It seems that the EXCHANGE_NODE is unable to allocate memory, rather than the 
HDFS_SCAN_NODE. Did this meet expectations?
And, patch set 23 also resolved a bug where scanning complex types would hit 
DCHECK (thansk test_scanners_fuzz.py found it), so an additional related test, 
complex_json.test, was added.

http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h
File be/src/exec/json-parser.h:

http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h@37
PS22, Line 37:
> Can we move this file into be/src/exec/json if it's only used by
 > hdfs-json-scanner?
 > Also do you plan to move the implementation codes into a
 > json-parser.cc file?

Of course, it can be moved to /json. I initially placed it outside mainly refer 
to the position of delimited-text-parser.h/.cc. If you think it's necessary, I 
can also put it in /json later.
I have not thought about moving the implementation codes into json-parser.cc. 
Please let me know if you think it's necessary.


http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h@59
PS22, Line 59: /// must succeed. Functions with bool return type return true on 
succeed, and return false
> nit: "The following functions materialize output tuples. Functions with voi
Done


http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h@213
PS22, Line 213: f
> nit: "been"
Done


http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.h
File be/src/exec/json/hdfs-json-scanner.h:

http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.h@47
PS22, Line 47: /// exactly one scanner.
> Let's also mention the error handling, i.e. how different kinds of errors a
Done


http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.cc
File be/src/exec/json/hdfs-json-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.cc@188
PS22, Line 188: string data_view = string(data, std::min(len, 
max_view_len));
> It'd be more helpful to print the column name and table name, e.g.
Done, now we can see the specific column where the errors occurred and the 
corresponding data with length limit.


http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.cc@211
PS22, Line 211: stream_->filename(), stream_->scan_range()->offset() + 
offset,
> It seems we should return a non-ok status for unrecoverable
 > situations, e.g. running out of memory.

Yes, that would be better, but how can we find out about unrecoverable error 
occurring here? Maybe state_->CheckQueryState()?


http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.cc@300
PS22, Line 300:   // TODO: Support Invoke CodeGend WriteSlot
> A follow-up question for 
> https://gerrit.cloudera.org/c/19699/18/be/src/exec/json-parser.h#303
 >
 > This can return false in copying strings when we run out of memory:
 > https://github.com/apache/impala/blob/af3f56e6d1605a56f7bd02b0af35be980a7e4c63/be/src/exec/text-converter.inline.h#L96
 >
 > It seems we will return true in HandleConvertError() and let
 > RapidJSON continue parsing. Can we stop it and report the mem
 > issue? Or did I miss something?

Yes, it will continue parsing. Similar to the issue in HandleError(), how can 
we determine if memory has run out here? Relying solely on the return value of 
WriteSlot is not sufficient, I also tried 

[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-07-25 Thread Zihao Ye (Code Review)
Hello Quanlong Huang, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/19699

to look at the new patch set (#23).

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..

IMPALA-10798: Prototype a simple JSON File reader

Prototype of HdfsJsonScanner implemented based on rapidjson, which
supports scanning data from splitting json files.

The scanning of JSON data is mainly completed by two parts working
together. The first part is the JsonParser responsible for parsing the
JSON object, which is implemented based on the SAX-style API of
rapidjson. It reads data from the char stream, parses it, and calls the
corresponding callback function when encountering the corresponding JSON
element. See the comments of the JsonParser class for more details.

The other part is the HdfsJsonScanner, which inherits from HdfsScanner
and provides callback functions for the JsonParser. The callback
functions are responsible for providing data buffers to the Parser and
converting and materializing the Parser's parsing results into RowBatch.
It should be noted that the parser returns numeric values as strings to
the scanner. The scanner uses the TextConverter class to convert the
strings to the desired types, similar to how the HdfsTextScanner works.
This is an advantage compared to using number value provided by
rapidjson directly, as it eliminates concerns about inconsistencies in
converting decimals (e.g. losing precision).

Limitations
 - Multiline json objects are not fully supported yet. It is ok when
   each file has only one scan range. However, when a file has multiple
   scan ranges, there is a small probability of incomplete scanning of
   multiline JSON objects that span ScanRange boundaries (in such cases,
   parsing errors may be reported). For more details, please refer to
   the comments in the 'multiline_json.test'.
 - Compressed JSON files are not supported yet.
 - Complex types are not supported yet.

Tests
 - Most of the existing end-to-end tests can run on JSON format.
 - Add TestQueriesJsonTables in test_queries.py for testing multiline,
   malformed, and overflow in JSON.

Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
---
M be/CMakeLists.txt
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-scan-node-base.cc
A be/src/exec/json-parser-test.cc
A be/src/exec/json-parser.h
A be/src/exec/json/CMakeLists.txt
A be/src/exec/json/hdfs-json-scanner.cc
A be/src/exec/json/hdfs-json-scanner.h
M be/src/exec/text-converter.inline.h
M bin/rat_exclude_files.txt
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/bin/generate-schema-statements.py
A testdata/data/json_test/complex.json
A testdata/data/json_test/malformed.json
A testdata/data/json_test/multiline.json
A testdata/data/json_test/overflow.json
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
A testdata/workloads/functional-query/queries/QueryTest/complex_json.test
A testdata/workloads/functional-query/queries/QueryTest/malformed_json.test
A testdata/workloads/functional-query/queries/QueryTest/multiline_json.test
A testdata/workloads/functional-query/queries/QueryTest/overflow_json.test
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/test_dimensions.py
M tests/metadata/test_hms_integration.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_tpch_queries.py
41 files changed, 1,442 insertions(+), 44 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/23
--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 23
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-07-24 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 22:

(9 comments)

The patch is looking good! I mostly look into the error handling in this round.

Not sure if we have tests about exceeding mem_limit. If not, we can add a test 
to write a json file with huge strings and read it in a query with a low 
mem_limit that will be exceeded.

http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h
File be/src/exec/json-parser.h:

http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h@37
PS22, Line 37:
Can we move this file into be/src/exec/json if it's only used by 
hdfs-json-scanner?
Also do you plan to move the implementation codes into a json-parser.cc file?


http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h@59
PS22, Line 59: /// means return true on success.
nit: "The following functions materialize output tuples. Functions with void 
return type must succeed. Functions with bool return type return true on 
succeed, and return false to stop parsing the whole scan range."


http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h@213
PS22, Line 213: be
nit: "been"


http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.h
File be/src/exec/json/hdfs-json-scanner.h:

http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.h@47
PS22, Line 47: /// exactly one scanner.
Let's also mention the error handling, i.e. how different kinds of errors are 
handled.


http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.cc
File be/src/exec/json/hdfs-json-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.cc@188
PS22, Line 188:<< desc->col_pos() - scan_node_->num_partition_keys()
It'd be more helpful to print the column name and table name, e.g.
Error converting column 'key1' of table my_tbl to bigint


http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.cc@211
PS22, Line 211:   return Status::OK();
It seems we should return a non-ok status for unrecoverable situations, e.g. 
running out of memory.


http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.cc@300
PS22, Line 300:   if (LIKELY(text_converter_->WriteSlot(slot_desc, _type, 
tuple_, data, len, true,
A follow-up question for 
https://gerrit.cloudera.org/c/19699/18/be/src/exec/json-parser.h#303

This can return false in copying strings when we run out of memory:
https://github.com/apache/impala/blob/af3f56e6d1605a56f7bd02b0af35be980a7e4c63/be/src/exec/text-converter.inline.h#L96

It seems we will return true in HandleConvertError() and let RapidJSON continue 
parsing. Can we stop it and report the mem issue? Or did I miss something?


http://gerrit.cloudera.org:8080/#/c/19699/22/testdata/data/json_test/malformed.json
File testdata/data/json_test/malformed.json:

http://gerrit.cloudera.org:8080/#/c/19699/22/testdata/data/json_test/malformed.json@2
PS22, Line 2: {"bool_col":False,"int_col":1,"float_col":0.1,"string_col":abc123}
Can we also add these cases?

{ }
[ ]
( )
{"string_col":"abc123"}
["string_col", "abc123"]


http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/multiline.json
File testdata/data/json_test/multiline.json:

http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/multiline.json@4
PS21, Line 4: 1234
: 567
> > Just curious, what the behavior if this is parsed as a numeric
Ack



--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 22
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Mon, 24 Jul 2023 11:32:52 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-07-21 Thread Zihao Ye (Code Review)
Zihao Ye has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 21:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/19699/21//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19699/21//COMMIT_MSG@22
PS21, Line 22: converting and materializing the Parser's parsing results into 
RowBatch.
> Could you mention that numeric values are parsed from strings using the sam
Done


http://gerrit.cloudera.org:8080/#/c/19699/21//COMMIT_MSG@25
PS21, Line 25: ,
> nit: period "."
Done


http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/hdfs-scanner.cc
File be/src/exec/hdfs-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/hdfs-scanner.cc@831
PS21, Line 831: if (scan_node_->skip_header_line_count() > 1) {
> I think don't need this for JSON tables. We can make sure 'skip_header_line
Done, this part is indeed unnecessary, I have copied the rest of the code into 
HandleConvertError directly.


http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json-parser.h
File be/src/exec/json-parser.h:

http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json-parser.h@188
PS21, Line 188: kParseStopWhenDoneFlag
> Does this mean only parsing the first json object in each line and
 > skip the others (if there are)?

It's not like that, if this flag is not added, the parser will check if the 
stream has ended after parsing an object, and if it hasn't, it will report an 
error "kParseErrorDocumentRootNotSingular". The purpose of this flag is to skip 
this check and allow the parser to parse multiple objects in a stream without 
reporting this error.


http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json-parser.h@189
PS21, Line 189:   reader_.Parse(stream_, *this);
> Could you add a comment above this for the parsing mechanism? E.g.
Done


http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h
File be/src/exec/json-parser.h:

http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@303
PS18, Line 303:   current_field_idx_ = -1;
> What's the behavior if we return false here? Will RapidJSON skip
 > the whole json object or just skip this value?

This part will return false only when HandleConvertError returns false (i.e., 
must have abort_on_error is true). In such cases, the query will be aborted. If 
abort_on_error is false, this part will never return false. Even if this slot 
has a convert error, we just set it to null and return true, see 
HandleConvertError function.


http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json/hdfs-json-scanner.h
File be/src/exec/json/hdfs-json-scanner.h:

http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json/hdfs-json-scanner.h@39
PS21, Line 39: This
> nit: "this"
Done


http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json/hdfs-json-scanner.cc
File be/src/exec/json/hdfs-json-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json/hdfs-json-scanner.cc@288
PS21, Line 288:   if (LIKELY(text_converter_->WriteSlot(slot_desc, _type, 
tuple_, data, len, true,
  :   false, current_pool_))) return true;
> nit: please write this in multi-lines (with brackets)
Done


http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/malformed.json
File testdata/data/json_test/malformed.json:

http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/malformed.json@1
PS21, Line 1: 
{"bool_col":true,"int_col":0,"float_col":"abc","string_col":"abc123"}
> Could you add a line that misses the right bracket, and one more line with
Done


http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/multiline.json
File testdata/data/json_test/multiline.json:

http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/multiline.json@1
PS21, Line 1: {"Id": 1, "Key": "normal object", "Value": "abcdefg"}
> Could you add a line of two json objects?
Done


http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/multiline.json@4
PS21, Line 4: 1234
: 567
> Just curious, what the behavior if this is parsed as a numeric
 > column, e.g. bigint?

We will get a bigint 1234 and two error reporting, one is Missing '}' after 
'1234', another is Invalid value '567}'. Because when parsing numbers, the 
parser considers the number to end when it encounters a newline character.


http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/datasets/functional/functional_schema_template.sql
File testdata/datasets/functional/functional_schema_template.sql:

http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/datasets/functional/functional_schema_template.sql@1388
PS21, Line 1388: LOAD DATA LOCAL INPATH 
'{impala_home}/testdata/data/overflow.txt' OVERWRITE INTO TABLE 
{db_name}{db_suffix}.{table_name};
> Can we add a similar table for json and test the overflow 

[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-07-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 22:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/13609/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 22
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Fri, 21 Jul 2023 06:43:19 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-07-21 Thread Zihao Ye (Code Review)
Hello Quanlong Huang, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/19699

to look at the new patch set (#22).

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..

IMPALA-10798: Prototype a simple JSON File reader

Prototype of HdfsJsonScanner implemented based on rapidjson, which
supports scanning data from splitting json files.

The scanning of JSON data is mainly completed by two parts working
together. The first part is the JsonParser responsible for parsing the
JSON object, which is implemented based on the SAX-style API of
rapidjson. It reads data from the char stream, parses it, and calls the
corresponding callback function when encountering the corresponding JSON
element. See the comments of the JsonParser class for more details.

The other part is the HdfsJsonScanner, which inherits from HdfsScanner
and provides callback functions for the JsonParser. The callback
functions are responsible for providing data buffers to the Parser and
converting and materializing the Parser's parsing results into RowBatch.
It should be noted that the parser returns numeric values as strings to
the scanner. The scanner uses the TextConverter class to convert the
strings to the desired types, similar to how the HdfsTextScanner works.
This is an advantage compared to using number value provided by
rapidjson directly, as it eliminates concerns about inconsistencies in
converting decimals (e.g. losing precision).

Limitations
 - Multiline json objects are not fully supported yet. It is ok when
   each file has only one scan range. However, when a file has multiple
   scan ranges, there is a small probability of incomplete scanning of
   multiline JSON objects that span ScanRange boundaries (in such cases,
   parsing errors may be reported). For more details, please refer to
   the comments in the 'multiline_json.test'.
 - Compressed JSON files are not supported yet.
 - Complex types are not supported yet.

Tests
 - Most of the existing end-to-end tests can run on JSON format.
 - Add TestQueriesJsonTables in test_queries.py for testing multiline,
   malformed, and overflow in JSON.

Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
---
M be/CMakeLists.txt
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-scan-node-base.cc
A be/src/exec/json-parser-test.cc
A be/src/exec/json-parser.h
A be/src/exec/json/CMakeLists.txt
A be/src/exec/json/hdfs-json-scanner.cc
A be/src/exec/json/hdfs-json-scanner.h
M be/src/exec/text-converter.inline.h
M bin/rat_exclude_files.txt
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/bin/generate-schema-statements.py
A testdata/data/json_test/malformed.json
A testdata/data/json_test/multiline.json
A testdata/data/json_test/overflow.json
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
A testdata/workloads/functional-query/queries/QueryTest/malformed_json.test
A testdata/workloads/functional-query/queries/QueryTest/multiline_json.test
A testdata/workloads/functional-query/queries/QueryTest/overflow_json.test
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/test_dimensions.py
M tests/metadata/test_hms_integration.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_tpch_queries.py
39 files changed, 1,377 insertions(+), 44 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/22
--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 22
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-07-19 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 21:

(13 comments)

http://gerrit.cloudera.org:8080/#/c/19699/21//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19699/21//COMMIT_MSG@22
PS21, Line 22: converting and materializing the Parser's parsing results into 
RowBatch.
Could you mention that numeric values are parsed from strings using the same 
functionality of text-scanners? This is an advantage of using RapidJSON 
directly. So we don't need to worry about inconsistency in converting decimals 
(e.g. losing precisions).


http://gerrit.cloudera.org:8080/#/c/19699/21//COMMIT_MSG@25
PS21, Line 25: ,
nit: period "."


http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/hdfs-scanner.cc
File be/src/exec/hdfs-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/hdfs-scanner.cc@831
PS21, Line 831: if (scan_node_->skip_header_line_count() > 1) {
I think don't need this for JSON tables. We can make sure 
'skip_header_line_count' is always 0 for JSON scan nodes. It comes from FE:
https://github.com/apache/impala/blob/97e44c11923f3d28e08aba1b5dd66b8a35465deb/fe/src/main/java/org/apache/impala/catalog/FeFsTable.java#L290


http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json-parser.h
File be/src/exec/json-parser.h:

http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json-parser.h@188
PS21, Line 188: kParseStopWhenDoneFlag
Does this mean only parsing the first json object in each line and skip the 
others (if there are)?


http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json-parser.h@189
PS21, Line 189:   reader_.Parse(stream_, *this);
Could you add a comment above this for the parsing mechanism? E.g.

 Reads characters from the stream, and publishes events to this handler 
(JsonParser).


http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h
File be/src/exec/json-parser.h:

http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@169
PS18, Line 169: while (!stream_.Eos()) {
> > What about kParseInsituFlag? Can we use it in our scenario?
I see. Thanks for the explanation!


http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@303
PS18, Line 303:   current_field_idx_ = -1;
What's the behavior if we return false here? Will RapidJSON skip the whole json 
object or just skip this value?


http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json/hdfs-json-scanner.h
File be/src/exec/json/hdfs-json-scanner.h:

http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json/hdfs-json-scanner.h@39
PS21, Line 39: This
nit: "this"


http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json/hdfs-json-scanner.cc
File be/src/exec/json/hdfs-json-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json/hdfs-json-scanner.cc@288
PS21, Line 288:   if (LIKELY(text_converter_->WriteSlot(slot_desc, _type, 
tuple_, data, len, true,
  :   false, current_pool_))) return true;
nit: please write this in multi-lines (with brackets)


http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/malformed.json
File testdata/data/json_test/malformed.json:

http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/malformed.json@1
PS21, Line 1: 
{"bool_col":true,"int_col":0,"float_col":"abc","string_col":"abc123"}
Could you add a line that misses the right bracket, and one more line with 
duplicated keys?


http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/multiline.json
File testdata/data/json_test/multiline.json:

http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/multiline.json@1
PS21, Line 1: {"Id": 1, "Key": "normal object", "Value": "abcdefg"}
Could you add a line of two json objects?


http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/multiline.json@4
PS21, Line 4: 1234
: 567
Just curious, what the behavior if this is parsed as a numeric column, e.g. 
bigint?


http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/datasets/functional/functional_schema_template.sql
File testdata/datasets/functional/functional_schema_template.sql:

http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/datasets/functional/functional_schema_template.sql@1388
PS21, Line 1388: LOAD DATA LOCAL INPATH 
'{impala_home}/testdata/data/overflow.txt' OVERWRITE INTO TABLE 
{db_name}{db_suffix}.{table_name};
Can we add a similar table for json and test the overflow behaviors?



--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 21
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala 

[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-07-19 Thread Zihao Ye (Code Review)
Zihao Ye has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 21:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/19699/18//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19699/18//COMMIT_MSG@23
PS18, Line 23:
> Could you add a section for the current limitations? E.g.
Done


http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h
File be/src/exec/json-parser.h:

http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@45
PS18, Line 45: 'begin' and '
> nit: add quotes on var names, i.e. 'begin' and 'end'
Done


http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@49
PS18, Line 49: func
> nit: remove "some"
Done


http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@68
PS18, Line 68: ///   bool AddNumber(int index, const char* str, uint32_t len);
> nit: it'd be helpful to give a doc link here. E.g.
Done


http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@143
PS18, Line 143: current_field_idx_ = -1;
> This is not that readable. Could you add some comments and break them into
Done


http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@169
PS18, Line 169: while (!stream_.Eos()) {
> What about kParseInsituFlag? Can we use it in our scenario?

Unfortunately, we cannot use kParseInsituFlag here because it requires our char 
stream to provide both input and output abilities simultaneously. Specifically, 
it needs the ability to write data back from a previous position after reading 
some data. However, our char
stream get the buffer in chunks (see GetNextBuffer), it is difficult to go back 
the old buffer to write data.


http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@230
PS18, Line 230:   inline void GetNextBuffer(const char** begin, const char** 
end) {
> nit: This is a bit confusing. It might be better to use DCHECK_EQ(current_f
Done


http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@242
PS18, Line 242:   /// 2. Call Key() upon encountering a key to find its index 
of the row in the schema and
> nit: could you add a comment before each methods mentioning they are interf
Done


http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json/hdfs-json-scanner.cc
File be/src/exec/json/hdfs-json-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json/hdfs-json-scanner.cc@114
PS18, Line 114: the previous scan ran
> nit: "the previous scan range in the same file" might be more clear
Done


http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json/hdfs-json-scanner.cc@160
PS18, Line 160: if
> nit: add a space after "if"
Done



--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 21
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Wed, 19 Jul 2023 08:32:49 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-07-19 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 21:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/13578/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 21
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Wed, 19 Jul 2023 08:30:48 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-07-19 Thread Zihao Ye (Code Review)
Hello Quanlong Huang, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/19699

to look at the new patch set (#21).

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..

IMPALA-10798: Prototype a simple JSON File reader

Prototype of HdfsJsonScanner implemented based on rapidjson, which
supports scanning data from splitting json files.

The scanning of JSON data is mainly completed by two parts working
together. The first part is the JsonParser responsible for parsing the
JSON object, which is implemented based on the SAX-style API of
rapidjson. It reads data from the char stream, parses it, and calls the
corresponding callback function when encountering the corresponding JSON
element. See the comments of the JsonParser class for more details.

The other part is the HdfsJsonScanner, which inherits from HdfsScanner
and provides callback functions for the JsonParser. The callback
functions are responsible for providing data buffers to the Parser and
converting and materializing the Parser's parsing results into RowBatch.

Limitations
 - Multiline json objects are not fully supported yet, It is ok when
   each file has only one scan range. However, when a file has multiple
   scan ranges, there is a small probability of incomplete scanning of
   multiline JSON objects that span ScanRange boundaries (in such cases,
   parsing errors may be reported). For more details, please refer to
   the comments in the 'multiline_json.test'.
 - Compressed JSON files are not supported yet.
 - Complex types are not supported yet.

Tests
 - Most of the existing end-to-end tests can run on JSON format.
 - Add TestQueriesJsonTables in test_queries.py for testing multiline
   and malformed JSON.

Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
---
M be/CMakeLists.txt
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-scan-node-base.cc
A be/src/exec/json-parser-test.cc
A be/src/exec/json-parser.h
A be/src/exec/json/CMakeLists.txt
A be/src/exec/json/hdfs-json-scanner.cc
A be/src/exec/json/hdfs-json-scanner.h
M be/src/exec/text-converter.inline.h
M bin/rat_exclude_files.txt
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/bin/generate-schema-statements.py
A testdata/data/json_test/malformed.json
A testdata/data/json_test/multiline.json
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
A testdata/workloads/functional-query/queries/QueryTest/malformed_json.test
A testdata/workloads/functional-query/queries/QueryTest/multiline_json.test
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/test_dimensions.py
M tests/metadata/test_hms_integration.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_tpch_queries.py
37 files changed, 1,273 insertions(+), 44 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/21
--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 21
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-07-19 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 20:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/13577/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 20
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Wed, 19 Jul 2023 07:55:13 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-07-19 Thread Zihao Ye (Code Review)
Hello Quanlong Huang, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/19699

to look at the new patch set (#20).

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..

IMPALA-10798: Prototype a simple JSON File reader

Prototype of HdfsJsonScanner implemented based on rapidjson, which
supports scanning data from splitting json files.

The scanning of JSON data is mainly completed by two parts working
together. The first part is the JsonParser responsible for parsing the
JSON object, which is implemented based on the SAX-style API of
rapidjson. It reads data from the char stream, parses it, and calls the
corresponding callback function when encountering the corresponding JSON
element. See the comments of the JsonParser class for more details.

The other part is the HdfsJsonScanner, which inherits from HdfsScanner
and provides callback functions for the JsonParser. The callback
functions are responsible for providing data buffers to the Parser and
converting and materializing the Parser's parsing results into RowBatch.

Limitations
 - Multiline json objects are not fully supported yet, It is ok when
   each file has only one scan range. However, when a file has multiple
   scan ranges, there is a small probability of incomplete scanning of
   multiline JSON objects that span ScanRange boundaries (in such cases,
   parsing errors may be reported). For more details, please refer to
   the comments in the 'multiline_json.test'.
 - Compressed JSON files are not supported yet.
 - Complex types are not supported yet.

Tests
 - Most of the existing end-to-end tests can run on JSON format.
 - Add TestQueriesJsonTables in test_queries.py for testing multiline
   and malformed JSON.

Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
---
M be/CMakeLists.txt
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-scan-node-base.cc
A be/src/exec/json-parser-test.cc
A be/src/exec/json-parser.h
A be/src/exec/json/CMakeLists.txt
A be/src/exec/json/hdfs-json-scanner.cc
A be/src/exec/json/hdfs-json-scanner.h
M be/src/exec/text-converter.inline.h
M bin/rat_exclude_files.txt
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/bin/generate-schema-statements.py
A testdata/data/json_test/malformed.json
A testdata/data/json_test/multiline.json
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
A testdata/workloads/functional-query/queries/QueryTest/malformed_json.test
A testdata/workloads/functional-query/queries/QueryTest/multiline_json.test
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/test_dimensions.py
M tests/metadata/test_hms_integration.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_tpch_queries.py
37 files changed, 1,273 insertions(+), 44 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/20
--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 20
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-07-19 Thread Zihao Ye (Code Review)
Hello Quanlong Huang, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/19699

to look at the new patch set (#19).

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..

IMPALA-10798: Prototype a simple JSON File reader

Prototype of HdfsJsonScanner implemented based on rapidjson, which
supports scanning data from splitting json files.

The scanning of JSON data is mainly completed by two parts working
together. The first part is the JsonParser responsible for parsing the
JSON object, which is implemented based on the SAX-style API of
rapidjson. It reads data from the char stream, parses it, and calls the
corresponding callback function when encountering the corresponding JSON
element. See the comments of the JsonParser class for more details.

The other part is the HdfsJsonScanner, which inherits from HdfsScanner
and provides callback functions for the JsonParser. The callback
functions are responsible for providing data buffers to the Parser and
converting and materializing the Parser's parsing results into RowBatch.

Limitations
 - Multiline json objects are not fully supported yet, It is ok when
   each file has only one scan range. However, when a file has multiple
   scan ranges, there is a small probability of incomplete scanning of
   multiline JSON objects that span ScanRange boundaries (in such cases,
   parsing errors may be reported). For more details, please refer to
   the comments in the 'multiline_json.test'.
 - Compressed JSON files are not supported yet.
 - Complex types are not supported yet.

Tests
 - Most of the existing end-to-end tests can run on JSON format.
 - Add TestQueriesJsonTables in test_queries.py for testing multiline
   and malformed JSON.

Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
---
M be/CMakeLists.txt
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-scan-node-base.cc
A be/src/exec/json-parser-test.cc
A be/src/exec/json-parser.h
A be/src/exec/json/CMakeLists.txt
A be/src/exec/json/hdfs-json-scanner.cc
A be/src/exec/json/hdfs-json-scanner.h
M be/src/exec/text-converter.inline.h
M bin/rat_exclude_files.txt
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/bin/generate-schema-statements.py
A testdata/data/json_test/malformed.json
A testdata/data/json_test/multiline.json
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
A testdata/workloads/functional-query/queries/QueryTest/malformed_json.test
A testdata/workloads/functional-query/queries/QueryTest/multiline_json.test
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/test_dimensions.py
M tests/metadata/test_hms_integration.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_tpch_queries.py
37 files changed, 1,240 insertions(+), 44 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/19
--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 19
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-07-18 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 18:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/19699/18//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19699/18//COMMIT_MSG@23
PS18, Line 23:
Could you add a section for the current limitations? E.g.

Limitations
 - multiline json objects are not supported
 - compressed json files are not supported

Does this patch support parsing complex types from json files?


http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h
File be/src/exec/json-parser.h:

http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@45
PS18, Line 45: begin and end
nit: add quotes on var names, i.e. 'begin' and 'end'


http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@49
PS18, Line 49: some
nit: remove "some"


http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@68
PS18, Line 68: ///   bool AddNumber(int index, const char* str, uint32_t len);
nit: it'd be helpful to give a doc link here. E.g.

 /// See more in https://rapidjson.org/classrapidjson_1_1_handler.html


http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@143
PS18, Line 143: !memcmp(field_found_.data(), field_found_.data() + 1, 
field_found_.size() - 1)));
This is not that readable. Could you add some comments and break them into 
smaller conditions?


http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@169
PS18, Line 169:   rapidjson::kParseStopWhenDoneFlag;
What about kParseInsituFlag? Can we use it in our scenario?


http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@230
PS18, Line 230: DCHECK(!IsRequiredField());
nit: This is a bit confusing. It might be better to use 
DCHECK_EQ(current_field_idx_, -1) directly.


http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@242
PS18, Line 242:   bool StartObject() {
nit: could you add a comment before each methods mentioning they are interfaces 
of handler used in RapidJson? e.g.

  /// Handler methods used in Rapidjson


http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json/hdfs-json-scanner.cc
File be/src/exec/json/hdfs-json-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json/hdfs-json-scanner.cc@114
PS18, Line 114: the scan range before
nit: "the previous scan range in the same file" might be more clear


http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json/hdfs-json-scanner.cc@160
PS18, Line 160: if(
nit: add a space after "if"



--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 18
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Tue, 18 Jul 2023 13:41:06 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-07-13 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 18:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/13536/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 18
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Thu, 13 Jul 2023 09:55:15 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-07-13 Thread Zihao Ye (Code Review)
Hello Quanlong Huang, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/19699

to look at the new patch set (#18).

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..

IMPALA-10798: Prototype a simple JSON File reader

Prototype of HdfsJsonScanner implemented based on rapidjson, which
supports scanning data from splitting json files.

The scanning of JSON data is mainly completed by two parts working
together. The first part is the JsonParser responsible for parsing the
JSON object, which is implemented based on the SAX-style API of
rapidjson. It reads data from the char stream, parses it, and calls the
corresponding callback function when encountering the corresponding JSON
element. See the comments of the JsonParser class for more details.

The other part is the HdfsJsonScanner, which inherits from HdfsScanner
and provides callback functions for the JsonParser. The callback
functions are responsible for providing data buffers to the Parser and
converting and materializing the Parser's parsing results into RowBatch.

Tests
 - Most of the existing end-to-end tests can run on JSON format.
 - Add TestQueriesJsonTables in test_queries.py for testing multiline
   and malformed JSON.

Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
---
M be/CMakeLists.txt
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-scan-node-base.cc
A be/src/exec/json-parser-test.cc
A be/src/exec/json-parser.h
A be/src/exec/json/CMakeLists.txt
A be/src/exec/json/hdfs-json-scanner.cc
A be/src/exec/json/hdfs-json-scanner.h
M be/src/exec/text-converter.inline.h
M bin/rat_exclude_files.txt
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/bin/generate-schema-statements.py
A testdata/data/json_test/malformed.json
A testdata/data/json_test/multiline.json
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
A testdata/workloads/functional-query/queries/QueryTest/malformed_json.test
A testdata/workloads/functional-query/queries/QueryTest/multiline_json.test
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/test_dimensions.py
M tests/metadata/test_hms_integration.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_tpch_queries.py
37 files changed, 1,240 insertions(+), 44 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/18
--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 18
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-06-26 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 17:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/13392/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 17
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Mon, 26 Jun 2023 11:57:36 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-06-26 Thread Zihao Ye (Code Review)
Zihao Ye has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 15:

(5 comments)

Thanks for the review!

http://gerrit.cloudera.org:8080/#/c/19699/15//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19699/15//COMMIT_MSG@11
PS15, Line 11:
> Could you summarize the high level design?
Done


http://gerrit.cloudera.org:8080/#/c/19699/15/testdata/bin/generate-schema-statements.py
File testdata/bin/generate-schema-statements.py:

http://gerrit.cloudera.org:8080/#/c/19699/15/testdata/bin/generate-schema-statements.py@209
PS15, Line 209:   'json': "JSONFILE"
> nit: add a comma at the end
Done


http://gerrit.cloudera.org:8080/#/c/19699/15/testdata/datasets/functional/schema_constraints.csv
File testdata/datasets/functional/schema_constraints.csv:

http://gerrit.cloudera.org:8080/#/c/19699/15/testdata/datasets/functional/schema_constraints.csv@198
PS15, Line 198: table_name:decimal_tiny, constraint:restrict_to, 
table_format:orc/def/block
> I think we need to add json here and for several other tables.
 > Otherwise, some tables are missing in JSON format.

Done, but I'm not sure if there are any other tables that need to be added 
because I'm not familiar with their purpose.


http://gerrit.cloudera.org:8080/#/c/19699/15/tests/query_test/test_queries.py
File tests/query_test/test_queries.py:

http://gerrit.cloudera.org:8080/#/c/19699/15/tests/query_test/test_queries.py@217
PS15, Line 217: class TestQueriesTextTables(ImpalaTestSuite):
> Can we add some tests like these for JSON? E.g. tests for
 > multi-line strings, multi-line json objects, malformed json
 > objects.

Done, if there are any other necessary tests that need to be added, please let 
me know.


http://gerrit.cloudera.org:8080/#/c/19699/15/tests/query_test/test_scanners_fuzz.py
File tests/query_test/test_scanners_fuzz.py:

http://gerrit.cloudera.org:8080/#/c/19699/15/tests/query_test/test_scanners_fuzz.py@98
PS15, Line 98: elif table_format.file_format in ('rc', 'seq', 'json'):
> Why do we skip JSON here?

I didn't generate any JSON tables starts with 'decimal_', because of the 
comment say "Decimal can only be tested on formats Impala can write to (text 
and parquet)", so I skipped it. Now that we have these tables, the test can 
pass, so it  ?won't be skipped anymore.



--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 15
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zihao Ye 
Gerrit-Comment-Date: Mon, 26 Jun 2023 11:53:38 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-06-26 Thread Zihao Ye (Code Review)
Hello Quanlong Huang, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/19699

to look at the new patch set (#17).

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..

IMPALA-10798: Prototype a simple JSON File reader

Prototype of HdfsJsonScanner implemented based on rapidjson, which
supports scanning data from splitting json files.

The scanning of JSON data is mainly completed by two parts working
together. The first part is the JsonParser responsible for parsing the
JSON object, which is implemented based on the SAX-style API of
rapidjson. It reads data from the char stream, parses it, and calls the
corresponding callback function when encountering the corresponding JSON
element. See the comments of the JsonParser class for more details.

The other part is the HdfsJsonScanner, which inherits from HdfsScanner
and provides callback functions for the JsonParser. The callback
functions are responsible for providing data buffers to the Parser and
converting and materializing the Parser's parsing results into RowBatch.

Tests
 - Most of the existing end-to-end tests can run on JSON format.
 - Add TestQueriesJsonTables in test_queries.py for testing multiline
   and malformed JSON.

Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
---
M be/CMakeLists.txt
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-scan-node-base.cc
A be/src/exec/json-parser-test.cc
A be/src/exec/json-parser.h
A be/src/exec/json/CMakeLists.txt
A be/src/exec/json/hdfs-json-scanner.cc
A be/src/exec/json/hdfs-json-scanner.h
M be/src/exec/text-converter.inline.h
M bin/rat_exclude_files.txt
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/bin/generate-schema-statements.py
A testdata/data/json_test/malformed.json
A testdata/data/json_test/multiline.json
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
A testdata/workloads/functional-query/queries/QueryTest/malformed_json.test
A testdata/workloads/functional-query/queries/QueryTest/multiline_json.test
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/test_dimensions.py
M tests/metadata/test_hms_integration.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_queries.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_tpch_queries.py
37 files changed, 1,236 insertions(+), 44 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/17
--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 17
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-06-26 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 17:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/19699/17/tests/common/test_dimensions.py
File tests/common/test_dimensions.py:

http://gerrit.cloudera.org:8080/#/c/19699/17/tests/common/test_dimensions.py@124
PS17, Line 124: def create_uncompressed_json_dimension(workload):
flake8: E302 expected 2 blank lines, found 1


http://gerrit.cloudera.org:8080/#/c/19699/17/tests/query_test/test_queries.py
File tests/query_test/test_queries.py:

http://gerrit.cloudera.org:8080/#/c/19699/17/tests/query_test/test_queries.py@263
PS17, Line 263: class TestQueriesJsonTables(ImpalaTestSuite):
flake8: E302 expected 2 blank lines, found 1



--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 17
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Mon, 26 Jun 2023 11:35:24 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-06-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 16:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/13355/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 16
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Wed, 21 Jun 2023 09:38:02 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-06-21 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 16:

(5 comments)

Thanks for working on this! Add some test requirements first.

http://gerrit.cloudera.org:8080/#/c/19699/15//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19699/15//COMMIT_MSG@11
PS15, Line 11:
Could you summarize the high level design?


http://gerrit.cloudera.org:8080/#/c/19699/15/testdata/bin/generate-schema-statements.py
File testdata/bin/generate-schema-statements.py:

http://gerrit.cloudera.org:8080/#/c/19699/15/testdata/bin/generate-schema-statements.py@209
PS15, Line 209:   'json': "JSONFILE"
nit: add a comma at the end


http://gerrit.cloudera.org:8080/#/c/19699/15/testdata/datasets/functional/schema_constraints.csv
File testdata/datasets/functional/schema_constraints.csv:

http://gerrit.cloudera.org:8080/#/c/19699/15/testdata/datasets/functional/schema_constraints.csv@198
PS15, Line 198: table_name:decimal_tiny, constraint:restrict_to, 
table_format:orc/def/block
I think we need to add json here and for several other tables. Otherwise, some 
tables are missing in JSON format.


http://gerrit.cloudera.org:8080/#/c/19699/15/tests/query_test/test_queries.py
File tests/query_test/test_queries.py:

http://gerrit.cloudera.org:8080/#/c/19699/15/tests/query_test/test_queries.py@217
PS15, Line 217: class TestQueriesTextTables(ImpalaTestSuite):
Can we add some tests like these for JSON? E.g. tests for multi-line strings, 
multi-line json objects, malformed json objects.


http://gerrit.cloudera.org:8080/#/c/19699/15/tests/query_test/test_scanners_fuzz.py
File tests/query_test/test_scanners_fuzz.py:

http://gerrit.cloudera.org:8080/#/c/19699/15/tests/query_test/test_scanners_fuzz.py@98
PS15, Line 98: elif table_format.file_format in ('rc', 'seq', 'json'):
Why do we skip JSON here?



--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 16
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Wed, 21 Jun 2023 09:17:45 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-06-21 Thread Zihao Ye (Code Review)
Hello Quanlong Huang, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/19699

to look at the new patch set (#16).

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..

IMPALA-10798: Prototype a simple JSON File reader

Prototype of HdfsJsonScanner implemented based on rapidjson, which
supports scanning data from splitting json files.

Tests
 - Most of the end-to-end tests can run on JSON format.

Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
---
M be/CMakeLists.txt
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-scan-node-base.cc
A be/src/exec/json-parser-test.cc
A be/src/exec/json-parser.h
A be/src/exec/json/CMakeLists.txt
A be/src/exec/json/hdfs-json-scanner.cc
A be/src/exec/json/hdfs-json-scanner.h
M be/src/exec/text-converter.inline.h
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/bin/generate-schema-statements.py
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/test_dimensions.py
M tests/metadata/test_hms_integration.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_tpch_queries.py
28 files changed, 1,100 insertions(+), 41 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/16
--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 16
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-06-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 15: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/9418/


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 15
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Wed, 21 Jun 2023 07:39:47 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-06-20 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 15:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/9418/ 
DRY_RUN=true


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 15
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Wed, 21 Jun 2023 02:12:05 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-06-13 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 15:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/13274/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 15
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Tue, 13 Jun 2023 10:15:17 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-06-13 Thread Zihao Ye (Code Review)
Hello Quanlong Huang, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/19699

to look at the new patch set (#15).

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..

IMPALA-10798: Prototype a simple JSON File reader

Prototype of HdfsJsonScanner implemented based on rapidjson, which
supports scanning data from splitting json files.

Tests
 - Most of the end-to-end tests can run on JSON format.

Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
---
M be/CMakeLists.txt
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-scan-node-base.cc
A be/src/exec/json-parser-test.cc
A be/src/exec/json-parser.h
A be/src/exec/json/CMakeLists.txt
A be/src/exec/json/hdfs-json-scanner.cc
A be/src/exec/json/hdfs-json-scanner.h
M be/src/exec/text-converter.inline.h
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/bin/generate-schema-statements.py
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/test_dimensions.py
M tests/metadata/test_hms_integration.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_tpch_queries.py
29 files changed, 1,099 insertions(+), 41 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/15
--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 15
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-06-09 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 14:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/13254/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 14
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Sat, 10 Jun 2023 02:42:45 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-06-09 Thread Zihao Ye (Code Review)
Hello Quanlong Huang, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/19699

to look at the new patch set (#14).

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..

IMPALA-10798: Prototype a simple JSON File reader

Prototype of HdfsJsonScanner implemented based on rapidjson, which
supports scanning data from splitting json files.

Tests
 - Most of the end-to-end tests can run on JSON format.

Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
---
M be/CMakeLists.txt
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-scan-node-base.cc
A be/src/exec/json-parser-test.cc
A be/src/exec/json-parser.h
A be/src/exec/json/CMakeLists.txt
A be/src/exec/json/hdfs-json-scanner.cc
A be/src/exec/json/hdfs-json-scanner.h
M be/src/exec/text-converter.inline.h
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/bin/generate-schema-statements.py
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/test_dimensions.py
M tests/metadata/test_hms_integration.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_tpch_queries.py
29 files changed, 1,090 insertions(+), 41 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/14
--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 14
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-06-09 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 12:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/13244/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 12
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Fri, 09 Jun 2023 10:56:48 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-06-09 Thread Zihao Ye (Code Review)
Hello Quanlong Huang, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/19699

to look at the new patch set (#12).

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..

IMPALA-10798: Prototype a simple JSON File reader

Prototype of HdfsJsonScanner implemented based on rapidjson, which
supports scanning data from splitting json files.

Tests
 - Most of the end-to-end tests can run on JSON format.

Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
---
M be/CMakeLists.txt
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-scan-node-base.cc
A be/src/exec/json-parser-test.cc
A be/src/exec/json-parser.h
A be/src/exec/json/CMakeLists.txt
A be/src/exec/json/hdfs-json-scanner.cc
A be/src/exec/json/hdfs-json-scanner.h
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/bin/generate-schema-statements.py
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/test_dimensions.py
M tests/metadata/test_hms_integration.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_tpch_queries.py
28 files changed, 1,090 insertions(+), 40 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/12
--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 12
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-05-30 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 11:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/13146/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 11
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Wed, 31 May 2023 03:15:26 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-05-30 Thread Zihao Ye (Code Review)
Hello Quanlong Huang, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/19699

to look at the new patch set (#11).

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..

IMPALA-10798: Prototype a simple JSON File reader

Prototype of HdfsJsonScanner implemented based on rapidjson, which
supports scanning data from splitting json files.

Tests
 - Most of the end-to-end tests can run on JSON format.

Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
---
M be/CMakeLists.txt
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-scan-node-base.cc
A be/src/exec/json-parser-test.cc
A be/src/exec/json-parser.h
A be/src/exec/json/CMakeLists.txt
A be/src/exec/json/hdfs-json-scanner.cc
A be/src/exec/json/hdfs-json-scanner.h
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/bin/generate-schema-statements.py
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/test_dimensions.py
M tests/metadata/test_hms_integration.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_tpch_queries.py
28 files changed, 1,173 insertions(+), 40 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/11
--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 11
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-05-09 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 10:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/12963/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 10
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Mon, 08 May 2023 07:35:51 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-05-09 Thread Zihao Ye (Code Review)
Hello Quanlong Huang, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/19699

to look at the new patch set (#10).

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..

IMPALA-10798: Prototype a simple JSON File reader

Prototype of HdfsJsonScanner implemented based on rapidjson, which
supports scanning data from splitting json files.

Tests
 - Most of the end-to-end tests can run on JSON format.

Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
---
M be/CMakeLists.txt
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-scan-node-base.cc
A be/src/exec/json-parser-test.cc
A be/src/exec/json-parser.h
A be/src/exec/json/CMakeLists.txt
A be/src/exec/json/hdfs-json-scanner.cc
A be/src/exec/json/hdfs-json-scanner.h
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/bin/generate-schema-statements.py
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/test_dimensions.py
M tests/metadata/test_hms_integration.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_tpch_queries.py
28 files changed, 1,166 insertions(+), 43 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/10
--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 10
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-05-09 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 10:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/19699/10/be/src/exec/json/hdfs-json-scanner.cc
File be/src/exec/json/hdfs-json-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/19699/10/be/src/exec/json/hdfs-json-scanner.cc@103
PS10, Line 103: // the entire scan range without finding a single tuple. 
The bytes will be picked up
line has trailing whitespace



--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 10
Gerrit-Owner: Zihao Ye 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Mon, 08 May 2023 07:15:28 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-05-04 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 9:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/12930/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 9
Gerrit-Owner: Anonymous Coward <18770832...@163.com>
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Thu, 04 May 2023 10:14:47 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-05-04 Thread Anonymous Coward (Code Review)
Hello Quanlong Huang, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/19699

to look at the new patch set (#9).

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..

IMPALA-10798: Prototype a simple JSON File reader

Prototype of HdfsJsonScanner implemented based on rapidjson, which
supports scanning data from splitting json files.

Tests
 - Most of the end-to-end tests can run on JSON format.

Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
---
M be/CMakeLists.txt
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-scan-node-base.cc
A be/src/exec/json-parser-test.cc
A be/src/exec/json-parser.h
A be/src/exec/json/CMakeLists.txt
A be/src/exec/json/hdfs-json-scanner.cc
A be/src/exec/json/hdfs-json-scanner.h
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/bin/generate-schema-statements.py
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/test_dimensions.py
M tests/metadata/test_hms_integration.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_tpch_queries.py
28 files changed, 1,165 insertions(+), 43 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/9
--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 9
Gerrit-Owner: Anonymous Coward <18770832...@163.com>
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-05-03 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 8:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/12928/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 8
Gerrit-Owner: Anonymous Coward <18770832...@163.com>
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Thu, 04 May 2023 03:06:29 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-05-03 Thread Anonymous Coward (Code Review)
Hello Quanlong Huang, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/19699

to look at the new patch set (#8).

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..

IMPALA-10798: Prototype a simple JSON File reader

Prototype of HdfsJsonScanner implemented based on rapidjson, which
supports scanning data from splitting json files.

Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
---
M be/CMakeLists.txt
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-scan-node-base.cc
A be/src/exec/json-parser-test.cc
A be/src/exec/json-parser.h
A be/src/exec/json/CMakeLists.txt
A be/src/exec/json/hdfs-json-scanner.cc
A be/src/exec/json/hdfs-json-scanner.h
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/bin/generate-schema-statements.py
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/test_dimensions.py
M tests/metadata/test_hms_integration.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_tpch_queries.py
28 files changed, 1,167 insertions(+), 45 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/8
--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 8
Gerrit-Owner: Anonymous Coward <18770832...@163.com>
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-04-28 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 7:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/12888/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 7
Gerrit-Owner: Anonymous Coward <18770832...@163.com>
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Fri, 28 Apr 2023 09:00:47 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-04-28 Thread Anonymous Coward (Code Review)
Hello Quanlong Huang, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/19699

to look at the new patch set (#7).

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..

IMPALA-10798: Prototype a simple JSON File reader

Prototype of HdfsJsonScanner implemented based on rapidjson, which
supports scanning data from splitting json files.

Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
---
M be/CMakeLists.txt
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-scan-node-base.cc
A be/src/exec/json-parser-test.cc
A be/src/exec/json-parser.h
A be/src/exec/json/CMakeLists.txt
A be/src/exec/json/hdfs-json-scanner.cc
A be/src/exec/json/hdfs-json-scanner.h
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/bin/generate-schema-statements.py
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/test_dimensions.py
M tests/metadata/test_hms_integration.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_tpch_queries.py
27 files changed, 1,164 insertions(+), 45 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/7
--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 7
Gerrit-Owner: Anonymous Coward <18770832...@163.com>
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-04-28 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 6:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/12887/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 6
Gerrit-Owner: Anonymous Coward <18770832...@163.com>
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Fri, 28 Apr 2023 06:39:08 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-04-28 Thread Anonymous Coward (Code Review)
Hello Quanlong Huang, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/19699

to look at the new patch set (#6).

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..

IMPALA-10798: Prototype a simple JSON File reader

Prototype of HdfsJsonScanner implemented based on rapidjson, which
supports scanning data from splitting json files.

Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
---
M be/CMakeLists.txt
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-scan-node-base.cc
A be/src/exec/json-parser-test.cc
A be/src/exec/json-parser.h
A be/src/exec/json/CMakeLists.txt
A be/src/exec/json/hdfs-json-scanner.cc
A be/src/exec/json/hdfs-json-scanner.h
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/bin/generate-schema-statements.py
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/test_dimensions.py
M tests/metadata/test_hms_integration.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_tpch_queries.py
27 files changed, 1,167 insertions(+), 45 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/6
--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 6
Gerrit-Owner: Anonymous Coward <18770832...@163.com>
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-04-27 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 5:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/12879/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 5
Gerrit-Owner: Anonymous Coward <18770832...@163.com>
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Thu, 27 Apr 2023 08:48:36 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-04-27 Thread Anonymous Coward (Code Review)
Hello Quanlong Huang, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/19699

to look at the new patch set (#5).

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..

IMPALA-10798: Prototype a simple JSON File reader

Prototype of HdfsJsonScanner implemented based on rapidjson, which
supports scanning data from splitting json files.

Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
---
M be/CMakeLists.txt
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-scan-node-base.cc
A be/src/exec/json-parser-test.cc
A be/src/exec/json-parser.h
A be/src/exec/json/CMakeLists.txt
A be/src/exec/json/hdfs-json-scanner.cc
A be/src/exec/json/hdfs-json-scanner.h
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/bin/generate-schema-statements.py
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/test_dimensions.py
M tests/metadata/test_hms_integration.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_tpch_queries.py
27 files changed, 1,160 insertions(+), 45 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/5
--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 5
Gerrit-Owner: Anonymous Coward <18770832...@163.com>
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-04-27 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 4:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/12878/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 4
Gerrit-Owner: Anonymous Coward <18770832...@163.com>
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Thu, 27 Apr 2023 08:04:32 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-04-27 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 4:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/19699/4/be/src/exec/json-parser.h
File be/src/exec/json-parser.h:

http://gerrit.cloudera.org:8080/#/c/19699/4/be/src/exec/json-parser.h@334
PS4, Line 334:
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/19699/4/be/src/exec/json/hdfs-json-scanner.h
File be/src/exec/json/hdfs-json-scanner.h:

http://gerrit.cloudera.org:8080/#/c/19699/4/be/src/exec/json/hdfs-json-scanner.h@123
PS4, Line 123:   /// This is used to indicate whether an error has occurred in 
the currently parsed row.
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/19699/4/be/src/exec/json/hdfs-json-scanner.h@134
PS4, Line 134:   /// JsonParse comment.
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/19699/4/be/src/exec/json/hdfs-json-scanner.h@138
PS4, Line 138:   /// specific uses described in the JsonParse comment.
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/19699/4/be/src/exec/json/hdfs-json-scanner.cc
File be/src/exec/json/hdfs-json-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/19699/4/be/src/exec/json/hdfs-json-scanner.cc@42
PS4, Line 42:   scanner_state_(CREATED),
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/19699/4/be/src/exec/json/hdfs-json-scanner.cc@202
PS4, Line 202:
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/19699/4/be/src/exec/json/hdfs-json-scanner.cc@221
PS4, Line 221: // due to BreakParse().
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/19699/4/be/src/exec/json/hdfs-json-scanner.cc@233
PS4, Line 233: // the parser that eos has been reached.
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/19699/4/tests/query_test/test_scanners_fuzz.py
File tests/query_test/test_scanners_fuzz.py:

http://gerrit.cloudera.org:8080/#/c/19699/4/tests/query_test/test_scanners_fuzz.py@80
PS4, Line 80: a
flake8: W504 line break after binary operator


http://gerrit.cloudera.org:8080/#/c/19699/4/tests/query_test/test_tpch_queries.py
File tests/query_test/test_tpch_queries.py:

http://gerrit.cloudera.org:8080/#/c/19699/4/tests/query_test/test_tpch_queries.py@41
PS4, Line 41: s
flake8: E501 line too long (96 > 90 characters)



--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 4
Gerrit-Owner: Anonymous Coward <18770832...@163.com>
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Thu, 27 Apr 2023 07:53:06 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-04-27 Thread Anonymous Coward (Code Review)
Hello Quanlong Huang, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/19699

to look at the new patch set (#4).

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..

IMPALA-10798: Prototype a simple JSON File reader

Prototype of HdfsJsonScanner implemented based on rapidjson, which
supports scanning data from splitting json files.

Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
---
M be/CMakeLists.txt
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-scan-node-base.cc
A be/src/exec/json-parser-test.cc
A be/src/exec/json-parser.h
A be/src/exec/json/CMakeLists.txt
A be/src/exec/json/hdfs-json-scanner.cc
A be/src/exec/json/hdfs-json-scanner.h
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/bin/generate-schema-statements.py
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/test_dimensions.py
M tests/metadata/test_hms_integration.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_tpch_queries.py
27 files changed, 1,158 insertions(+), 45 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/4
--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 4
Gerrit-Owner: Anonymous Coward <18770832...@163.com>
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-04-24 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 3:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/12845/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 3
Gerrit-Owner: Anonymous Coward <18770832...@163.com>
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Mon, 24 Apr 2023 12:05:18 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-04-24 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..


Patch Set 3:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/19699/3/be/src/exec/json-parser-test.cc
File be/src/exec/json-parser-test.cc:

http://gerrit.cloudera.org:8080/#/c/19699/3/be/src/exec/json-parser-test.cc@109
PS3, Line 109:
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/19699/3/be/src/exec/json/hdfs-json-scanner.h
File be/src/exec/json/hdfs-json-scanner.h:

http://gerrit.cloudera.org:8080/#/c/19699/3/be/src/exec/json/hdfs-json-scanner.h@125
PS3, Line 125:   /// This is used to indicate whether an error has occurred in 
the currently parsed row.
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/19699/3/be/src/exec/json/hdfs-json-scanner.h@136
PS3, Line 136:   /// JsonParse comment.
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/19699/3/be/src/exec/json/hdfs-json-scanner.h@140
PS3, Line 140:   /// specific uses described in the JsonParse comment.
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/19699/3/be/src/exec/json/hdfs-json-scanner.cc
File be/src/exec/json/hdfs-json-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/19699/3/be/src/exec/json/hdfs-json-scanner.cc@42
PS3, Line 42:   scanner_state_(CREATED),
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/19699/3/be/src/exec/json/hdfs-json-scanner.cc@200
PS3, Line 200:
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/19699/3/be/src/exec/json/hdfs-json-scanner.cc@219
PS3, Line 219: // due to BreakParse().
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/19699/3/be/src/exec/json/hdfs-json-scanner.cc@231
PS3, Line 231: // the parser that eos has been reached.
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/19699/3/tests/query_test/test_tpch_queries.py
File tests/query_test/test_tpch_queries.py:

http://gerrit.cloudera.org:8080/#/c/19699/3/tests/query_test/test_tpch_queries.py@41
PS3, Line 41: s
flake8: E501 line too long (96 > 90 characters)



--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 3
Gerrit-Owner: Anonymous Coward <18770832...@163.com>
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Mon, 24 Apr 2023 11:45:40 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10798: Prototype a simple JSON File reader

2023-04-24 Thread Anonymous Coward (Code Review)
18770832...@163.com has uploaded a new patch set (#3). ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
..

IMPALA-10798: Prototype a simple JSON File reader

Prototype of HdfsJsonScanner implemented based on rapidjson, which
supports scanning data from splitting json files.

Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
---
M be/CMakeLists.txt
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-scan-node-base.cc
A be/src/exec/json-parser-test.cc
A be/src/exec/json-parser.h
A be/src/exec/json/CMakeLists.txt
A be/src/exec/json/hdfs-json-scanner.cc
A be/src/exec/json/hdfs-json-scanner.h
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/bin/generate-schema-statements.py
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/test_dimensions.py
M tests/metadata/test_hms_integration.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_tpch_queries.py
26 files changed, 1,180 insertions(+), 42 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/19699/3
--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 3
Gerrit-Owner: Anonymous Coward <18770832...@163.com>