Re: [PR] Pyarrow IO property for configuring large v small types on read [iceberg-python]

via GitHub Thu, 01 Aug 2024 11:09:34 -0700


kevinjqliu commented on code in PR #986:
URL: https://github.com/apache/iceberg-python/pull/986#discussion_r1700560371



##########
pyiceberg/io/__init__.py:
##########
@@ -80,6 +80,7 @@
 GCS_ENDPOINT = "gcs.endpoint"
 GCS_DEFAULT_LOCATION = "gcs.default-bucket-location"
 GCS_VERSION_AWARE = "gcs.version-aware"
+PYARROW_USE_LARGE_TYPES_ON_READ = "pyarrow.use-large-types-on-read"

Review Comment:
   nit: if this is a pyarrow specific setting, lets move it to the pyarrow file 



##########
pyiceberg/io/pyarrow.py:
##########
@@ -877,8 +883,8 @@ def _(obj: Union[pa.ListType, pa.LargeListType, 
pa.FixedSizeListType], visitor:
     visitor.before_list_element(obj.value_field)
     result = visit_pyarrow(obj.value_type, visitor)
     visitor.after_list_element(obj.value_field)
-
-    return visitor.list(obj, result)
+    ret = visitor.list(obj, result)

Review Comment:
   nit : was this for debugging?



##########
tests/io/test_pyarrow_visitor.py:
##########
@@ -596,6 +597,11 @@ def 
test_pyarrow_schema_ensure_large_types(pyarrow_schema_nested_without_ids: pa
     assert 
_pyarrow_schema_ensure_large_types(pyarrow_schema_nested_without_ids) == 
expected_schema
 
 
+def test_pyarrow_schema_ensure_small_types(pyarrow_schema_nested_without_ids: 
pa.Schema) -> None:
+    schema_with_large_types = 
_pyarrow_schema_ensure_small_types(pyarrow_schema_nested_without_ids)

Review Comment:
   nit: name is large_type, function is small_type
   
   what is this function testing for? 



##########
tests/integration/test_reads.py:
##########
@@ -663,3 +668,55 @@ def another_task() -> None:
 
         table.transaction().set_properties(lock="xxx").commit_transaction()
         assert table.properties.get("lock") == "xxx"
+
+
[email protected]
[email protected]("catalog", 
[pytest.lazy_fixture("session_catalog_hive"), 
pytest.lazy_fixture("session_catalog")])
+def test_table_scan_with_large_types(catalog: Catalog) -> None:

Review Comment:
   nit: test_table_scan_default_to_large_types



##########
pyiceberg/io/pyarrow.py:
##########
@@ -1146,6 +1152,31 @@ def primitive(self, primitive: pa.DataType) -> 
pa.DataType:
         return primitive
 
 
+class _ConvertToSmallTypes(PyArrowSchemaVisitor[Union[pa.DataType, 
pa.Schema]]):
+    def schema(self, schema: pa.Schema, struct_result: pa.StructType) -> 
pa.Schema:

Review Comment:
   nit: looks like this is the same function definition as the one in 
`_ConvertToLargeTypes`, as with other functions here. 
   Perhaps abstract into a common class and extend/override specific functions. 
   



##########
tests/integration/test_reads.py:
##########
@@ -663,3 +668,55 @@ def another_task() -> None:
 
         table.transaction().set_properties(lock="xxx").commit_transaction()
         assert table.properties.get("lock") == "xxx"
+
+
[email protected]
[email protected]("catalog", 
[pytest.lazy_fixture("session_catalog_hive"), 
pytest.lazy_fixture("session_catalog")])
+def test_table_scan_with_large_types(catalog: Catalog) -> None:
+    identifier = "default.test_table_scan_with_large_types"
+    arrow_table = pa.Table.from_arrays(
+        [pa.array(["a", "b", "c"]), pa.array([b"a", b"b", b"c"]), 
pa.array([["a", "b"], ["c", "d"], ["e", "f"]])],
+        names=["string", "binary", "list"],
+    )
+
+    try:
+        catalog.drop_table(identifier)
+    except NoSuchTableError:
+        pass
+
+    tbl = catalog.create_table(
+        identifier,
+        schema=arrow_table.schema,
+    )
+
+    tbl.append(arrow_table)
+
+    result_table = tbl.scan().to_arrow()
+
+    assert 
result_table.schema.equals(_pyarrow_schema_ensure_large_types(arrow_table.schema))

Review Comment:
   nit: check specifically for large_string / large_binary / large_list, 
instead of relying solely on `_pyarrow_schema_ensure_large_types`



##########
tests/integration/test_reads.py:
##########
@@ -663,3 +668,55 @@ def another_task() -> None:
 
         table.transaction().set_properties(lock="xxx").commit_transaction()
         assert table.properties.get("lock") == "xxx"
+
+
[email protected]
[email protected]("catalog", 
[pytest.lazy_fixture("session_catalog_hive"), 
pytest.lazy_fixture("session_catalog")])
+def test_table_scan_with_large_types(catalog: Catalog) -> None:
+    identifier = "default.test_table_scan_with_large_types"
+    arrow_table = pa.Table.from_arrays(
+        [pa.array(["a", "b", "c"]), pa.array([b"a", b"b", b"c"]), 
pa.array([["a", "b"], ["c", "d"], ["e", "f"]])],
+        names=["string", "binary", "list"],
+    )
+
+    try:
+        catalog.drop_table(identifier)
+    except NoSuchTableError:
+        pass
+
+    tbl = catalog.create_table(
+        identifier,
+        schema=arrow_table.schema,
+    )
+
+    tbl.append(arrow_table)
+
+    result_table = tbl.scan().to_arrow()
+
+    assert 
result_table.schema.equals(_pyarrow_schema_ensure_large_types(arrow_table.schema))
+
+
[email protected]
[email protected]("catalog", 
[pytest.lazy_fixture("session_catalog_hive"), 
pytest.lazy_fixture("session_catalog")])
+def test_table_scan_with_small_types(catalog: Catalog) -> None:

Review Comment:
   nit: test_table_scan_override_with_small_types



##########
pyiceberg/io/pyarrow.py:
##########
@@ -1146,6 +1152,31 @@ def primitive(self, primitive: pa.DataType) -> 
pa.DataType:
         return primitive
 
 
+class _ConvertToSmallTypes(PyArrowSchemaVisitor[Union[pa.DataType, 
pa.Schema]]):
+    def schema(self, schema: pa.Schema, struct_result: pa.StructType) -> 
pa.Schema:
+        return pa.schema(struct_result)
+
+    def struct(self, struct: pa.StructType, field_results: List[pa.Field]) -> 
pa.StructType:
+        return pa.struct(field_results)
+
+    def field(self, field: pa.Field, field_result: pa.DataType) -> pa.Field:
+        return field.with_type(field_result)
+
+    def list(self, list_type: pa.ListType, element_result: pa.DataType) -> 
pa.DataType:
+        print("DEBUG")

Review Comment:
   nit: remove print



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Pyarrow IO property for configuring large v small types on read [iceberg-python]

Reply via email to