Zihao Ye has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
......................................................................


Patch Set 23:

(8 comments)

Thanks for the review!
I manually tested reading a JSON file with some large strings under a low 
mem_limit, and get the following error:

ERROR: Memory limit exceeded: Failed to allocate row batch
EXCHANGE_NODE (id=1) could not allocate 8.00 KB without exceeding limit.
Error occurred on backend 4620ee0acf3b:27000
Memory left in process limit: 11.41 GB
Memory left in query limit: -125.32 MB
Query(264ddc1e2c291b18:f1bec1c500000000): memory limit exceeded. Limit=100.00 
MB Reservation=12.00 MB ReservationLimit=68.00 MB OtherMemory=213.32 MB 
Total=225.32 MB Peak=225.33 MB
Unclaimed reservations: Reservation=4.00 MB OtherMemory=0 Total=4.00 MB 
Peak=12.00 MB
Fragment 264ddc1e2c291b18:f1bec1c500000001: Reservation=8.00 MB 
OtherMemory=213.31 MB Total=221.31 MB Peak=221.32 MB
HDFS_SCAN_NODE (id=0): Reservation=8.00 MB OtherMemory=8.00 KB Total=8.01 MB 
Peak=95.03 MB
Queued Batches: Total=8.00 KB Peak=71.03 MB
KrpcDataStreamSender (dst_id=1): Total=142.28 MB Peak=142.29 MB
RowBatchSerialization: Total=142.28 MB Peak=142.28 MB
Fragment 264ddc1e2c291b18:f1bec1c500000000: Reservation=0 OtherMemory=8.00 KB 
Total=8.00 KB Peak=8.00 KB
EXCHANGE_NODE (id=1): Reservation=0 OtherMemory=0 Total=0 Peak=0
KrpcDeferredRpcs: Total=0 Peak=0
PLAN_ROOT_SINK: Total=0 Peak=0
CodeGen: Total=0 Peak=0
CodeGen: Total=0 Peak=0

It seems that the EXCHANGE_NODE is unable to allocate memory, rather than the 
HDFS_SCAN_NODE. Did this meet expectations?
And, patch set 23 also resolved a bug where scanning complex types would hit 
DCHECK (thansk test_scanners_fuzz.py found it), so an additional related test, 
complex_json.test, was added.

http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h
File be/src/exec/json-parser.h:

http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h@37
PS22, Line 37:
> Can we move this file into be/src/exec/json if it's only used by
 > hdfs-json-scanner?
 > Also do you plan to move the implementation codes into a
 > json-parser.cc file?

Of course, it can be moved to /json. I initially placed it outside mainly refer 
to the position of delimited-text-parser.h/.cc. If you think it's necessary, I 
can also put it in /json later.
I have not thought about moving the implementation codes into json-parser.cc. 
Please let me know if you think it's necessary.


http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h@59
PS22, Line 59: /// must succeed. Functions with bool return type return true on 
succeed, and return false
> nit: "The following functions materialize output tuples. Functions with voi
Done


http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h@213
PS22, Line 213: f
> nit: "been"
Done


http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.h
File be/src/exec/json/hdfs-json-scanner.h:

http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.h@47
PS22, Line 47: /// exactly one scanner.
> Let's also mention the error handling, i.e. how different kinds of errors a
Done


http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.cc
File be/src/exec/json/hdfs-json-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.cc@188
PS22, Line 188:     string data_view = string(data, std::min(len, 
max_view_len));
> It'd be more helpful to print the column name and table name, e.g.
Done, now we can see the specific column where the errors occurred and the 
corresponding data with length limit.


http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.cc@211
PS22, Line 211:         stream_->filename(), stream_->scan_range()->offset() + 
offset,
> It seems we should return a non-ok status for unrecoverable
 > situations, e.g. running out of memory.

Yes, that would be better, but how can we find out about unrecoverable error 
occurring here? Maybe state_->CheckQueryState()?


http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.cc@300
PS22, Line 300:   // TODO: Support Invoke CodeGend WriteSlot
> A follow-up question for 
> https://gerrit.cloudera.org/c/19699/18/be/src/exec/json-parser.h#303
 >
 > This can return false in copying strings when we run out of memory:
 > https://github.com/apache/impala/blob/af3f56e6d1605a56f7bd02b0af35be980a7e4c63/be/src/exec/text-converter.inline.h#L96
 >
 > It seems we will return true in HandleConvertError() and let
 > RapidJSON continue parsing. Can we stop it and report the mem
 > issue? Or did I miss something?

Yes, it will continue parsing. Similar to the issue in HandleError(), how can 
we determine if memory has run out here? Relying solely on the return value of 
WriteSlot is not sufficient, I also tried using 
current_pool_->mem_tracker()->LimitExceeded(MemLimit::HARD) to check, but it 
always returns false here. If you have any better ideas, please let me know.


http://gerrit.cloudera.org:8080/#/c/19699/22/testdata/data/json_test/malformed.json
File testdata/data/json_test/malformed.json:

http://gerrit.cloudera.org:8080/#/c/19699/22/testdata/data/json_test/malformed.json@2
PS22, Line 2: {"bool_col":False,"int_col":1,"float_col":0.1,"string_col":abc123}
> Can we also add these cases?
Done



--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 23
Gerrit-Owner: Zihao Ye <eyiz...@163.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Zihao Ye <eyiz...@163.com>
Gerrit-Comment-Date: Tue, 25 Jul 2023 12:50:30 +0000
Gerrit-HasComments: Yes

Reply via email to