steveloughran commented on code in PR #3725:
URL: https://github.com/apache/avro/pull/3725#discussion_r3156697095


##########
lang/java/avro/src/main/java/org/apache/avro/util/ByteBufferInputStream.java:
##########
@@ -65,6 +65,18 @@ public int read(byte[] b, int off, int len) throws 
IOException {
     }
   }
 
+  @Override
+  public int available() throws IOException {

Review Comment:
   there's this real oddness with available(), in that some interpret it as 
"all that is left in the stream" but it can also be interpreted as "bytes you 
can read() before blocking for new data". that's probably the correct one.
   
   it does hold here, just important not to use available() as measures of how 
much is left in a stream, which may be larger. looks like you are doing the 
right thing and testing it later on.



##########
lang/java/avro/src/main/java/org/apache/avro/generic/GenericDatumReader.java:
##########
@@ -361,11 +381,40 @@ protected Object readMap(Object old, Schema expected, 
ResolvingDecoder in) throw
             addToMap(map, readMapKey(null, expected, in), 
readWithoutConversion(null, eValue, in));
           }
         }
-      } while ((l = in.mapNext()) > 0);
+      } while ((l = mapNext(in, eValue)) > 0);
     }
     return map;
   }
 
+  /**
+   * Reads the next map block count and validates remaining bytes before the
+   * caller allocates storage.
+   */
+  private long mapNext(ResolvingDecoder in, Schema valueType) throws 
IOException {
+    long l = in.mapNext();
+    if (l > 0) {
+      ensureAvailableMapBytes(in, l, valueType);
+    }
+    return l;
+  }
+
+  /**
+   * Validates remaining bytes for a map block. Each map entry has a string key
+   * (at least 1 byte for the length varint) plus a value, so the minimum bytes
+   * per entry is {@code 1 + minBytesPerElement(valueSchema)}.
+   */
+  private static void ensureAvailableMapBytes(Decoder decoder, long count, 
Schema valueSchema) throws EOFException {
+    // Map keys are always strings: at least 1 byte for the length varint
+    int minBytesPerEntry = 1 + minBytesPerElement(valueSchema);

Review Comment:
   go with copilot and use longs here to avoid problems on very large files



##########
lang/java/avro/src/main/java/org/apache/avro/generic/GenericDatumReader.java:
##########
@@ -384,6 +433,73 @@ protected void addToMap(Object map, Object key, Object 
value) {
     ((Map) map).put(key, value);
   }
 
+  /**
+   * Returns the minimum number of bytes required to encode a single value of 
the
+   * given schema in Avro binary format. Used to validate that the decoder has
+   * enough data remaining before allocating collection backing arrays.
+   * <p>
+   * Returns 0 for types whose binary encoding is empty ({@code null}, 
zero-length
+   * {@code fixed}, records with only zero-byte fields). Returns a positive 
value
+   * for all other types.
+   */
+  static int minBytesPerElement(Schema schema) {
+    return minBytesPerElement(schema, Collections.newSetFromMap(new 
IdentityHashMap<>()));
+  }
+
+  private static int minBytesPerElement(Schema schema, Set<Schema> visited) {
+    switch (schema.getType()) {
+    case NULL:
+      return 0;
+    case FIXED:
+      return schema.getFixedSize();
+    case FLOAT:
+      return 4;
+    case DOUBLE:
+      return 8;
+    case RECORD:
+      if (!visited.add(schema)) {
+        return 0; // break recursion for self-referencing schemas
+      }
+      long sum = 0;
+      for (Schema.Field f : schema.getFields()) {

Review Comment:
   I worry about the cost of this operation on a complex wide and recursive 
structure, as it'll be invoked once per record.



##########
lang/java/avro/src/main/java/org/apache/avro/generic/GenericDatumReader.java:
##########
@@ -306,13 +311,25 @@ protected Object readArray(Object old, Schema expected, 
ResolvingDecoder in) thr
           }
         }
         base += l;
-      } while ((l = in.arrayNext()) > 0);
+      } while ((l = arrayNext(in, expectedType)) > 0);
       return pruneArray(array);
     } else {
       return pruneArray(newArray(old, 0, expected));
     }
   }
 
+  /**
+   * Reads the next array block count and validates remaining bytes before the
+   * caller allocates storage.
+   */
+  private long arrayNext(ResolvingDecoder in, Schema elementType) throws 
IOException {
+    long l = in.arrayNext();
+    if (l > 0) {
+      ensureAvailableCollectionBytes(in, l, elementType);

Review Comment:
   if this a no-op on l <= 0 then you wouldn't need to guard all the uses



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to