jmestwa-coder commented on code in PR #50025:
URL: https://github.com/apache/arrow/pull/50025#discussion_r3415850270
##########
cpp/src/parquet/types.cc:
##########
@@ -184,6 +184,43 @@ std::string FormatFloat16Value(::std::string_view val) {
std::string FormatStatValue(Type::type parquet_type, ::std::string_view val,
const std::shared_ptr<const LogicalType>&
logical_type) {
+ // Statistics blobs come from the file's Thrift-encoded metadata and may have
+ // arbitrary length under a malicious writer. Reject any value that is
shorter
+ // than what the physical (and, for FLOAT16, logical) type requires so the
+ // memcpy/byte loads below cannot run past the buffer.
+ size_t required = 0;
+ switch (parquet_type) {
+ case Type::BOOLEAN:
+ required = sizeof(bool);
+ break;
+ case Type::INT32:
+ case Type::FLOAT:
+ required = 4;
+ break;
+ case Type::INT64:
+ case Type::DOUBLE:
+ required = 8;
+ break;
+ case Type::INT96:
+ required = 3 * sizeof(int32_t);
+ break;
+ case Type::BYTE_ARRAY:
+ case Type::FIXED_LEN_BYTE_ARRAY:
+ if (logical_type != nullptr && logical_type->is_float16()) {
Review Comment:
Fair. Full per-length validation needs the column type_length, and since
FormatStatValue is public I left the signature alone, so the guard only covers
the types that do an unconditional fixed-width load
(BOOLEAN/INT32/INT64/INT96/FLOAT/DOUBLE and the float16 FLBA), which are the
ones that actually over-read. The variable-length FLBA (decimal/string) and
BYTE_ARRAY paths are already size-driven and stay in bounds. If you'd rather
thread the type_length through a new overload to cover everything, I'm happy to
go that way.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]