Zihao Ye has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 )
Change subject: IMPALA-10798: Prototype a simple JSON File reader ...................................................................... Patch Set 23: (8 comments) Thanks for the review! I manually tested reading a JSON file with some large strings under a low mem_limit, and get the following error: ERROR: Memory limit exceeded: Failed to allocate row batch EXCHANGE_NODE (id=1) could not allocate 8.00 KB without exceeding limit. Error occurred on backend 4620ee0acf3b:27000 Memory left in process limit: 11.41 GB Memory left in query limit: -125.32 MB Query(264ddc1e2c291b18:f1bec1c500000000): memory limit exceeded. Limit=100.00 MB Reservation=12.00 MB ReservationLimit=68.00 MB OtherMemory=213.32 MB Total=225.32 MB Peak=225.33 MB Unclaimed reservations: Reservation=4.00 MB OtherMemory=0 Total=4.00 MB Peak=12.00 MB Fragment 264ddc1e2c291b18:f1bec1c500000001: Reservation=8.00 MB OtherMemory=213.31 MB Total=221.31 MB Peak=221.32 MB HDFS_SCAN_NODE (id=0): Reservation=8.00 MB OtherMemory=8.00 KB Total=8.01 MB Peak=95.03 MB Queued Batches: Total=8.00 KB Peak=71.03 MB KrpcDataStreamSender (dst_id=1): Total=142.28 MB Peak=142.29 MB RowBatchSerialization: Total=142.28 MB Peak=142.28 MB Fragment 264ddc1e2c291b18:f1bec1c500000000: Reservation=0 OtherMemory=8.00 KB Total=8.00 KB Peak=8.00 KB EXCHANGE_NODE (id=1): Reservation=0 OtherMemory=0 Total=0 Peak=0 KrpcDeferredRpcs: Total=0 Peak=0 PLAN_ROOT_SINK: Total=0 Peak=0 CodeGen: Total=0 Peak=0 CodeGen: Total=0 Peak=0 It seems that the EXCHANGE_NODE is unable to allocate memory, rather than the HDFS_SCAN_NODE. Did this meet expectations? And, patch set 23 also resolved a bug where scanning complex types would hit DCHECK (thansk test_scanners_fuzz.py found it), so an additional related test, complex_json.test, was added. http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h File be/src/exec/json-parser.h: http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h@37 PS22, Line 37: > Can we move this file into be/src/exec/json if it's only used by > hdfs-json-scanner? > Also do you plan to move the implementation codes into a > json-parser.cc file? Of course, it can be moved to /json. I initially placed it outside mainly refer to the position of delimited-text-parser.h/.cc. If you think it's necessary, I can also put it in /json later. I have not thought about moving the implementation codes into json-parser.cc. Please let me know if you think it's necessary. http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h@59 PS22, Line 59: /// must succeed. Functions with bool return type return true on succeed, and return false > nit: "The following functions materialize output tuples. Functions with voi Done http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h@213 PS22, Line 213: f > nit: "been" Done http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.h File be/src/exec/json/hdfs-json-scanner.h: http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.h@47 PS22, Line 47: /// exactly one scanner. > Let's also mention the error handling, i.e. how different kinds of errors a Done http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.cc File be/src/exec/json/hdfs-json-scanner.cc: http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.cc@188 PS22, Line 188: string data_view = string(data, std::min(len, max_view_len)); > It'd be more helpful to print the column name and table name, e.g. Done, now we can see the specific column where the errors occurred and the corresponding data with length limit. http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.cc@211 PS22, Line 211: stream_->filename(), stream_->scan_range()->offset() + offset, > It seems we should return a non-ok status for unrecoverable > situations, e.g. running out of memory. Yes, that would be better, but how can we find out about unrecoverable error occurring here? Maybe state_->CheckQueryState()? http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.cc@300 PS22, Line 300: // TODO: Support Invoke CodeGend WriteSlot > A follow-up question for > https://gerrit.cloudera.org/c/19699/18/be/src/exec/json-parser.h#303 > > This can return false in copying strings when we run out of memory: > https://github.com/apache/impala/blob/af3f56e6d1605a56f7bd02b0af35be980a7e4c63/be/src/exec/text-converter.inline.h#L96 > > It seems we will return true in HandleConvertError() and let > RapidJSON continue parsing. Can we stop it and report the mem > issue? Or did I miss something? Yes, it will continue parsing. Similar to the issue in HandleError(), how can we determine if memory has run out here? Relying solely on the return value of WriteSlot is not sufficient, I also tried using current_pool_->mem_tracker()->LimitExceeded(MemLimit::HARD) to check, but it always returns false here. If you have any better ideas, please let me know. http://gerrit.cloudera.org:8080/#/c/19699/22/testdata/data/json_test/malformed.json File testdata/data/json_test/malformed.json: http://gerrit.cloudera.org:8080/#/c/19699/22/testdata/data/json_test/malformed.json@2 PS22, Line 2: {"bool_col":False,"int_col":1,"float_col":0.1,"string_col":abc123} > Can we also add these cases? Done -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 23 Gerrit-Owner: Zihao Ye <eyiz...@163.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Zihao Ye <eyiz...@163.com> Gerrit-Comment-Date: Tue, 25 Jul 2023 12:50:30 +0000 Gerrit-HasComments: Yes