[Impala-ASF-CR] IMPALA-12786: Optimize count(*) for JSON scans

Zihao Ye (Code Review) Fri, 29 Mar 2024 01:39:04 -0700

Hello Quanlong Huang, Riza Suminto, Daniel Becker, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit


    http://gerrit.cloudera.org:8080/21039

to look at the new patch set (#9).

Change subject: IMPALA-12786: Optimize count(*) for JSON scans
......................................................................

IMPALA-12786: Optimize count(*) for JSON scans

When performing zero slots scans on a JSON table for operations like
count(*), we don't require specific data from the JSON, we only need the
number of top-level JSON objects. However, the current JSON parser based
on rapidjson still decodes and copies specific data from the JSON, even
in zero slots scans. Skipping these steps can significantly improve scan
performance.

This patch introduces a JSON skipper to conduct zero slots scans on JSON
data. Essentially, it is a simplified version of a rapidjson parser,
removing specific data decoding and copying operations, resulting in
faster parsing of the number of JSON objects. The skipper retains the
ability to recognize malformed JSON and provide specific error codes
same as the rapidjson parser. Nevertheless, as it bypasses specific
data parsing, it cannot identify string encoding errors or numeric
overflow errors. Despite this, these data errors do not impact the
counting of JSON objects, so it is acceptable to ignore them. The TEXT
scanner exhibits similar behavior.

Additionally, a new query option, disable_optimized_json_count_star, has
been added to disable this optimization and revert to the old behavior.

In the performance test of TPC-DS with a format of json/none and a scale
of 10GB, the performance optimization is shown in the following tables:
+-----------+---------------------------+--------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+
| Workload  | Query                     | File Format        | Avg(s) | Base 
Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Iters | Median Diff(%) | MW 
Zval | Tval   |
+-----------+---------------------------+--------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+
| TPCDS(10) | TPCDS-Q_COUNT_UNOPTIMIZED | json / none / none | 6.78   | 6.88    
    |   -1.46%   |   4.93%   |   3.63%        | 9     |   -1.51%       | -0.74  
 | -0.72  |
| TPCDS(10) | TPCDS-Q_COUNT_ZERO_SLOT   | json / none / none | 2.42   | 6.75    
    | I -64.20%  |   6.44%   |   4.58%        | 9     | I -177.75%     | -3.36  
 | -37.55 |
| TPCDS(10) | TPCDS-Q_COUNT_OPTIMIZED   | json / none / none | 2.42   | 7.03    
    | I -65.63%  |   3.93%   |   4.39%        | 9     | I -194.13%     | -3.36  
 | -42.82 |
+-----------+---------------------------+--------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+

(I) Improvement: TPCDS(10) TPCDS-Q_COUNT_ZERO_SLOT [json / none / none] (6.75s 
-> 2.42s [-64.20%])
+--------------+------------+---------+----------+------------+------------+----------+----------+------------+--------+-------+--------+-----------+
| Operator     | % of Query | Avg     | Base Avg | Delta(Avg) | StdDev(%)  | 
Max      | Base Max | Delta(Max) | #Hosts | #Inst | #Rows  | Est #Rows |
+--------------+------------+---------+----------+------------+------------+----------+----------+------------+--------+-------+--------+-----------+
| 01:AGGREGATE | 2.58%      | 54.85ms | 58.88ms  | -6.85%     | * 14.43% * | 
115.82ms | 133.11ms | -12.99%    | 3      | 3     | 3      | 1         |
| 00:SCAN HDFS | 97.41%     | 2.07s   | 6.07s    | -65.84%    |   5.87%    | 
2.43s    | 6.95s    | -65.01%    | 3      | 3     | 28.80M | 143.83M   |
+--------------+------------+---------+----------+------------+------------+----------+----------+------------+--------+-------+--------+-----------+

(I) Improvement: TPCDS(10) TPCDS-Q_COUNT_OPTIMIZED [json / none / none] (7.03s 
-> 2.42s [-65.63%])
+--------------+------------+-------+----------+------------+-----------+-------+----------+------------+--------+-------+--------+-----------+
| Operator     | % of Query | Avg   | Base Avg | Delta(Avg) | StdDev(%) | Max   
| Base Max | Delta(Max) | #Hosts | #Inst | #Rows  | Est #Rows |
+--------------+------------+-------+----------+------------+-----------+-------+----------+------------+--------+-------+--------+-----------+
| 00:SCAN HDFS | 99.35%     | 2.07s | 6.49s    | -68.15%    |   4.83%   | 2.37s 
| 7.49s    | -68.32%    | 3      | 3     | 28.80M | 143.83M   |
+--------------+------------+-------+----------+------------+-----------+-------+----------+------------+--------+-------+--------+-----------+

Testing:
- Added new test cases in TestQueriesJsonTables to verify that query
  results are consistent before and after optimization.
- Passed existing JSON scanning-related tests.

Change-Id: I97ff097661c3c577aeafeeb1518408ce7a8a255e
---
M be/src/exec/json/hdfs-json-scanner.cc
M be/src/exec/json/json-parser-test.cc
M be/src/exec/json/json-parser.cc
M be/src/exec/json/json-parser.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M testdata/workloads/functional-query/queries/QueryTest/complex_json.test
M testdata/workloads/functional-query/queries/QueryTest/malformed_json.test
M testdata/workloads/functional-query/queries/QueryTest/multiline_json.test
M testdata/workloads/functional-query/queries/QueryTest/overflow_json.test
M tests/query_test/test_queries.py
13 files changed, 477 insertions(+), 12 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/21039/9
--
To view, visit http://gerrit.cloudera.org:8080/21039
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I97ff097661c3c577aeafeeb1518408ce7a8a255e
Gerrit-Change-Number: 21039
Gerrit-PatchSet: 9
Gerrit-Owner: Zihao Ye <eyiz...@163.com>
Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com>
Gerrit-Reviewer: Zihao Ye <eyiz...@163.com>

[Impala-ASF-CR] IMPALA-12786: Optimize count(*) for JSON scans

Reply via email to