Mihaly Szjatinya has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/22049 )

Change subject: WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence 
files
......................................................................


Patch Set 13:

(3 comments)

Improved memory management. Added 'DecodingTime' counter.

http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/exec/text/hdfs-text-scanner.cc
File be/src/exec/text/hdfs-text-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/exec/text/hdfs-text-scanner.cc@532
PS8, Line 532:     *eosr = stream_->eosr();
> What will happen with decoder_ if it is end of stream, but the last charact
Good point, I had left this for subsequent patches as less important. Added the 
check.


http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.cc
File be/src/util/char-codec.cc:

http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.cc@77
PS8, Line 77:       return 
memory_pool_->mem_tracker()->MemLimitExceeded(nullptr, details, *bytes_read);
            :     }
> This can do a quite large untracked memory allocation. MemTracker's Consume
Agree, added TryConsume() within a helper class to secure Release() for every 
return path. Perhaps we may think of more sophisticated estimates though, if 
that's important.


http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.cc@113
PS8, Line 113:     partial_symbol_.clear();
> We could release the decompressed buffer's memory at this point if we would
Agree, but instead of releasing here, I employed the mechanism from 
DecompressStreamToBuffer(), which moves previous buffer onto row_batch's pool 
on every subsequent call, which later gets released either still on 
ScannerThread, or on ScanNode's side in case it was sent to queue. As testing 
shows this secures stable memory usage throughout reading large files.

Potential problem is a significant exceeding of row_batch's limit (set to 
either 1024 rows or 8MB of attached memory), particularity in cases when both 
decompression and decoding are in conjunction.



--
To view, visit http://gerrit.cloudera.org:8080/22049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65
Gerrit-Change-Number: 22049
Gerrit-PatchSet: 13
Gerrit-Owner: Mihaly Szjatinya <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Mihaly Szjatinya <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Comment-Date: Sun, 02 Mar 2025 21:08:07 +0000
Gerrit-HasComments: Yes

Reply via email to