aokolnychyi commented on code in PR #8755:
URL: https://github.com/apache/iceberg/pull/8755#discussion_r1445928357


##########
api/src/main/java/org/apache/iceberg/types/TypeUtil.java:
##########
@@ -452,6 +454,68 @@ private static void checkSchemaCompatibility(
     }
   }
 
+  /**
+   * Estimates the number of bytes a value for a given field may occupy in 
memory.
+   *
+   * <p>This method approximates the memory size based on the internal Java 
representation defined
+   * by {@link Type.TypeID}. It is important to note that the actual size 
might differ from this
+   * estimation. The method is designed to handle a variety of data types, 
including primitive
+   * types, strings, and nested types such as structs, maps, and lists.
+   *
+   * @param field a field for which to estimate the size
+   * @return the estimated size in bytes of the field's value in memory
+   */
+  public static long defaultSize(Types.NestedField field) {
+    return defaultSize(field.type());
+  }
+
+  private static long defaultSize(Type type) {
+    switch (type.typeId()) {
+      case BOOLEAN:
+        // the size of a boolean variable is virtual machine dependent
+        // it is common to believe booleans occupy 1 byte in most JVMs
+        return 1;
+      case INTEGER:
+      case FLOAT:
+      case DATE:
+        // ints and floats occupy 4 bytes
+        // dates are internally represented as ints
+        return 4;
+      case LONG:
+      case DOUBLE:
+      case TIME:
+      case TIMESTAMP:
+        // longs and doubles occupy 8 bytes
+        // times and timestamps are internally represented as longs
+        return 8;
+      case STRING:
+        // 12 (header) + 12 (fields) + 16 (array overhead) + 20 (10 chars, 2 
bytes each) = 60 bytes
+        return 60;
+      case UUID:
+        // 12 (header) + 16 (two long variables) = 28 bytes
+        return 28;
+      case FIXED:
+        return ((Types.FixedType) type).length();
+      case BINARY:
+        return 100;
+      case DECIMAL:
+        // 12 (header) + (12 + 12 + 4) (BigInteger) + 4 (scale) = 44 bytes
+        return 44;
+      case STRUCT:
+        Types.StructType struct = (Types.StructType) type;
+        return OBJECT_HEADER + 
struct.fields().stream().mapToLong(TypeUtil::defaultSize).sum();
+      case LIST:
+        Types.ListType list = (Types.ListType) type;
+        return OBJECT_HEADER + 5 * defaultSize(list.elementType());
+      case MAP:
+        Types.MapType map = (Types.MapType) type;
+        long entrySize = OBJECT_HEADER + defaultSize(map.keyType()) + 
defaultSize(map.valueType());
+        return OBJECT_HEADER + 5 * entrySize;
+      default:
+        return 16;

Review Comment:
   I don't think we should fail queries in such cases. It is just an estimate 
so we better use a non-precise estimate rather than fail queries.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to