Caideyipi commented on PR #824:
URL: https://github.com/apache/tsfile/pull/824#issuecomment-4550686336
I found a functional issue.
`Tablet.serializedSize()` claims to return the exact serialized byte size,
but it uses
`ReadWriteIOUtils.sizeToWrite(insertTargetName)` to calculate string
sizes. That helper uses `s.getBytes()`, which
depends on the platform default charset. The actual serialization path
uses `ReadWriteIOUtils.write(String, ...)`,
which encodes strings with `TSFileConfig.STRING_CHARSET` (UTF-8).
So when the device/table name, measurement name, or schema properties
contain non-ASCII characters, `serializedSize()`
can differ from the real serialized size if the process default charset is
not UTF-8.
This is probably not an issue when TsFile is used through IoTDB, because
IoTDB startup sets the default charset. But
TsFile can also be used independently, and in standalone usage this can
make the size estimate incorrect and break the
“exact size” guarantee.
Suggested fix: make `ReadWriteIOUtils.sizeToWrite(String)` use
`TSFileConfig.STRING_CHARSET`, consistent with the
write path, and add a non-ASCII name test.
There is also a CodeQL alert for integer narrowing/overflow in
`serializedSizeOfTimes()`. Since this method is
intended to return an exact byte size, that should probably be handled as
well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]