csringhofer commented on a change in pull request #1008:
URL: https://github.com/apache/orc/pull/1008#discussion_r782839420
##########
File path: c++/src/Compression.cc
##########
@@ -533,24 +545,37 @@ DIAGNOSTIC_PUSH
}
/** There are three possible scenarios when seeking a position:
Review comment:
now it is four possible scenarios
##########
File path: c++/src/Compression.cc
##########
@@ -533,24 +545,37 @@ DIAGNOSTIC_PUSH
}
/** There are three possible scenarios when seeking a position:
- * 1. The seeked position is already read and decompressed into
- * the output stream.
- * 2. It is already read from the input stream, but has not been
- * decompressed yet, ie. it's not in the output stream.
- * 3. It is not read yet from the inputstream.
+ * 1. The chunk of the seeked position is already read and decompressed into
the output
+ * stream, ie. chunk header is read and chunk contents are in the output
stream.
+ * 2. The chunk of the seeked position is partially read. This only happens
for
+ * uncompressed chunks. The chunk header is read but the seeked position
hasn't been
+ * read yet.
+ * 3. It is already read from the input stream, but has not been
decompressed yet, ie.
+ * it's not in the output stream.
+ * 4. It is not read yet from the input stream.
*/
void DecompressionStream::seek(PositionProvider& position) {
size_t seekedPosition = position.current();
Review comment:
not really related to the change, but I think that it would be clearer
if this would be renamed, e.g. to startOfSeekedChunk
##########
File path: c++/src/Compression.cc
##########
@@ -533,24 +545,37 @@ DIAGNOSTIC_PUSH
}
/** There are three possible scenarios when seeking a position:
- * 1. The seeked position is already read and decompressed into
- * the output stream.
- * 2. It is already read from the input stream, but has not been
- * decompressed yet, ie. it's not in the output stream.
- * 3. It is not read yet from the inputstream.
+ * 1. The chunk of the seeked position is already read and decompressed into
the output
+ * stream, ie. chunk header is read and chunk contents are in the output
stream.
+ * 2. The chunk of the seeked position is partially read. This only happens
for
+ * uncompressed chunks. The chunk header is read but the seeked position
hasn't been
+ * read yet.
+ * 3. It is already read from the input stream, but has not been
decompressed yet, ie.
+ * it's not in the output stream.
+ * 4. It is not read yet from the input stream.
*/
void DecompressionStream::seek(PositionProvider& position) {
size_t seekedPosition = position.current();
- // Case 1: the seeked position is the one that is currently buffered and
- // decompressed. Here we only need to set the output buffer's pointer to
the
- // seeked position. Note that after the headerPosition comes the 3 bytes of
- // the header.
+ // Case 1&2: the seeked position is in the current chunk and it's buffered
and
+ // decompressed. Note that after the headerPosition comes the 3 bytes of
the header.
if (headerPosition == seekedPosition
&& inputBufferStartPosition <= headerPosition + 3 && inputBufferStart)
{
position.next(); // Skip the input level position.
size_t posInChunk = position.next(); // Chunk level position.
- outputBufferLength = uncompressedBufferLength - posInChunk;
- outputBuffer = outputBufferStart + posInChunk;
+ // Case 1: The position is in the decompressed buffer. Here we only need
to
+ // set the output buffer's pointer to the seeked position.
+ if (uncompressedBufferLength >= posInChunk) {
+ outputBufferLength = uncompressedBufferLength - posInChunk;
+ outputBuffer = outputBufferStart + posInChunk;
+ return;
+ }
+ // Case 2: The position is outside the decompressed buffer. Skip bytes
to seek.
Review comment:
This is a bit confusing as it can only happen in the uncompressed case.
##########
File path: c++/src/Compression.cc
##########
@@ -321,6 +321,17 @@ DIAGNOSTIC_PUSH
DECOMPRESS_ORIGINAL,
DECOMPRESS_EOF};
+ std::string decompressStateToString(DecompressState state) {
Review comment:
I couldn't find the place where we use this function
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]