Mihaly Szjatinya has posted comments on this change. ( http://gerrit.cloudera.org:8080/22049 )
Change subject: WIP IMPALA-10319: Support arbitrary encodings on Text/Sequence files ...................................................................... Patch Set 13: (3 comments) Improved memory management. Added 'DecodingTime' counter. http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/exec/text/hdfs-text-scanner.cc File be/src/exec/text/hdfs-text-scanner.cc: http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/exec/text/hdfs-text-scanner.cc@532 PS8, Line 532: *eosr = stream_->eosr(); > What will happen with decoder_ if it is end of stream, but the last charact Good point, I had left this for subsequent patches as less important. Added the check. http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.cc File be/src/util/char-codec.cc: http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.cc@77 PS8, Line 77: return memory_pool_->mem_tracker()->MemLimitExceeded(nullptr, details, *bytes_read); : } > This can do a quite large untracked memory allocation. MemTracker's Consume Agree, added TryConsume() within a helper class to secure Release() for every return path. Perhaps we may think of more sophisticated estimates though, if that's important. http://gerrit.cloudera.org:8080/#/c/22049/8/be/src/util/char-codec.cc@113 PS8, Line 113: partial_symbol_.clear(); > We could release the decompressed buffer's memory at this point if we would Agree, but instead of releasing here, I employed the mechanism from DecompressStreamToBuffer(), which moves previous buffer onto row_batch's pool on every subsequent call, which later gets released either still on ScannerThread, or on ScanNode's side in case it was sent to queue. As testing shows this secures stable memory usage throughout reading large files. Potential problem is a significant exceeding of row_batch's limit (set to either 1024 rows or 8MB of attached memory), particularity in cases when both decompression and decoding are in conjunction. -- To view, visit http://gerrit.cloudera.org:8080/22049 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65 Gerrit-Change-Number: 22049 Gerrit-PatchSet: 13 Gerrit-Owner: Mihaly Szjatinya <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Mihaly Szjatinya <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Comment-Date: Sun, 02 Mar 2025 21:08:07 +0000 Gerrit-HasComments: Yes
