Zihao Ye has posted comments on this change. ( http://gerrit.cloudera.org:8080/21039 )
Change subject: IMPALA-12786: Optimize count(*) for JSON scans ...................................................................... Patch Set 8: (6 comments) http://gerrit.cloudera.org:8080/#/c/21039/8/be/src/exec/json/json-parser-test.cc File be/src/exec/json/json-parser-test.cc: http://gerrit.cloudera.org:8080/#/c/21039/8/be/src/exec/json/json-parser-test.cc@207 PS8, Line 207: TestSkip(R"("abc\")", TYPE_STRING, kParseErrorStringMissQuotationMark); > Add negative/positive testcase against unicode and/or encoded unicode? Done http://gerrit.cloudera.org:8080/#/c/21039/8/be/src/exec/json/json-parser.h File be/src/exec/json/json-parser.h: http://gerrit.cloudera.org:8080/#/c/21039/8/be/src/exec/json/json-parser.h@253 PS8, Line 253: https://github.com/Tencent/rapidjson/blob/master/include/rapidjson/reader.h#L539 > nit: Use permalink Done http://gerrit.cloudera.org:8080/#/c/21039/8/be/src/exec/json/json-parser.h@269 PS8, Line 269: inline bool Consume(char expect) { > nit: Please add comment about what this function do. Done http://gerrit.cloudera.org:8080/#/c/21039/8/be/src/exec/json/json-parser.h@282 PS8, Line 282: /// The following function is used to skip a specific JSON value. It maintains logic : /// consistent with rapidjson, consuming the stream and returning true upon successfully : /// skipping the specified value, or returning false and setting the respective error : /// code if an error is encountered. > nit: Consider mentioning this RFC about valid JSON values. Done http://gerrit.cloudera.org:8080/#/c/21039/8/be/src/exec/json/json-parser.h@293 PS8, Line 293: bool SkipValue(); > rapidjson::GenericReader has ParseHex4 to handle unicode. Will that be a co I believe this won't be a concern. As mentioned in the commit message, erroneous encoding does not affect our counting of JSON objects. We do not require the actual data of the string values, as long as it is a correctly formatted string value (with unescaped starting and ending double quotes) then we are able to correctly skip over it. Therefore, we do not need ParseHex4 to specifically handle Unicode, because it does not affect our recognition of the ending double quotes for strings. http://gerrit.cloudera.org:8080/#/c/21039/8/be/src/exec/json/json-parser.cc File be/src/exec/json/json-parser.cc: http://gerrit.cloudera.org:8080/#/c/21039/8/be/src/exec/json/json-parser.cc@332 PS8, Line 332: template<class Stream> : bool JsonSkipper<Stream>::SkipNumber() { > rapidjson::GenericReader seems to handle NaN and Infinity. This one does no Done, and I have created another ticket IMPALA-12957 to track the feature support for Inf & NaN. -- To view, visit http://gerrit.cloudera.org:8080/21039 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I97ff097661c3c577aeafeeb1518408ce7a8a255e Gerrit-Change-Number: 21039 Gerrit-PatchSet: 8 Gerrit-Owner: Zihao Ye <eyiz...@163.com> Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com> Gerrit-Reviewer: Zihao Ye <eyiz...@163.com> Gerrit-Comment-Date: Fri, 29 Mar 2024 09:07:19 +0000 Gerrit-HasComments: Yes