wgtmac commented on code in PR #50025:
URL: https://github.com/apache/arrow/pull/50025#discussion_r3308017566


##########
cpp/src/parquet/types.cc:
##########
@@ -184,6 +184,43 @@ std::string FormatFloat16Value(::std::string_view val) {
 
 std::string FormatStatValue(Type::type parquet_type, ::std::string_view val,
                             const std::shared_ptr<const LogicalType>& 
logical_type) {
+  // Statistics blobs come from the file's Thrift-encoded metadata and may have
+  // arbitrary length under a malicious writer. Reject any value that is 
shorter
+  // than what the physical (and, for FLOAT16, logical) type requires so the
+  // memcpy/byte loads below cannot run past the buffer.
+  size_t required = 0;
+  switch (parquet_type) {
+    case Type::BOOLEAN:
+      required = sizeof(bool);
+      break;
+    case Type::INT32:
+    case Type::FLOAT:
+      required = 4;
+      break;
+    case Type::INT64:
+    case Type::DOUBLE:
+      required = 8;
+      break;
+    case Type::INT96:
+      required = 3 * sizeof(int32_t);
+      break;
+    case Type::BYTE_ARRAY:
+    case Type::FIXED_LEN_BYTE_ARRAY:
+      if (logical_type != nullptr && logical_type->is_float16()) {

Review Comment:
   I admit that it is better than nothing to check types with known sizes. It 
still looks confusing with this partial support. If we do want to add the 
check, I think we need to support all types. We can either pass the type 
instance instead of only the enum or pass the type length to `FormatStatValue`. 
The challenge is that `FormatStatValue` is a public API so we cannot changing 
its signature without breaking backward compatibility.



##########
cpp/src/parquet/types.cc:
##########
@@ -184,6 +184,43 @@ std::string FormatFloat16Value(::std::string_view val) {
 
 std::string FormatStatValue(Type::type parquet_type, ::std::string_view val,
                             const std::shared_ptr<const LogicalType>& 
logical_type) {
+  // Statistics blobs come from the file's Thrift-encoded metadata and may have
+  // arbitrary length under a malicious writer. Reject any value that is 
shorter
+  // than what the physical (and, for FLOAT16, logical) type requires so the
+  // memcpy/byte loads below cannot run past the buffer.
+  size_t required = 0;
+  switch (parquet_type) {
+    case Type::BOOLEAN:
+      required = sizeof(bool);
+      break;
+    case Type::INT32:
+    case Type::FLOAT:
+      required = 4;
+      break;
+    case Type::INT64:
+    case Type::DOUBLE:
+      required = 8;
+      break;
+    case Type::INT96:
+      required = 3 * sizeof(int32_t);
+      break;
+    case Type::BYTE_ARRAY:
+    case Type::FIXED_LEN_BYTE_ARRAY:
+      if (logical_type != nullptr && logical_type->is_float16()) {

Review Comment:
   BTW, we need to create an issue for changes like this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to