Re: [PR] Add Refs metadata table [iceberg-python]

2024-04-17 Thread via GitHub


Fokko commented on code in PR #602:
URL: https://github.com/apache/iceberg-python/pull/602#discussion_r1568666707


##
tests/integration/test_inspect_table.py:
##
@@ -274,6 +274,64 @@ def test_inspect_entries_partitioned(spark: SparkSession, 
session_catalog: Catal
 
 @pytest.mark.integration
 @pytest.mark.parametrize("format_version", [1, 2])
+def test_inspect_refs(
+spark: SparkSession, session_catalog: Catalog, arrow_table_with_null: 
pa.Table, format_version: int
+) -> None:
+identifier = "default.table_metadata_refs"
+tbl = _create_table(session_catalog, identifier, 
properties={"format-version": format_version})
+
+# write data to create snapshot
+tbl.overwrite(arrow_table_with_null)
+
+# create a test branch
+spark.sql(
+f"""
+ALTER TABLE {identifier} CREATE BRANCH IF NOT EXISTS testBranch RETAIN 7 
DAYS WITH SNAPSHOT RETENTION 2 SNAPSHOTS
+"""
+)
+
+# create a test tag against current snapshot
+current_snapshot = tbl.current_snapshot()
+assert current_snapshot is not None
+current_snapshot_id = current_snapshot.snapshot_id
+
+spark.sql(
+f"""
+ALTER TABLE {identifier} CREATE TAG testTag AS OF VERSION 
{current_snapshot_id} RETAIN 180 DAYS
+"""
+)
+
+df = tbl.refresh().inspect.refs()
+
+assert df.column_names == [
+'name',
+'type',
+'snapshot_id',
+'max_reference_age_in_ms',
+'min_snapshots_to_keep',
+'max_snapshot_age_in_ms',
+]
+
+assert [name.as_py() for name in df['name']] == ['testBranch', 'main', 
'testTag']
+assert [ref_type.as_py() for ref_type in df['type']] == ['BRANCH', 
'BRANCH', 'TAG']
+
+for snapshot_id in df['snapshot_id']:
+assert isinstance(snapshot_id.as_py(), int)
+
+for int_column in ['max_reference_age_in_ms', 'min_snapshots_to_keep', 
'max_snapshot_age_in_ms']:
+for value in df[int_column]:
+assert isinstance(value.as_py(), int) or not value.as_py()
+
+lhs = spark.table(f"{identifier}.refs").toPandas()
+rhs = df.to_pandas()
+for column in df.column_names:
+for left, right in zip(lhs[column].to_list(), rhs[column].to_list()):
+if isinstance(left, float) and math.isnan(left) and 
isinstance(right, float) and math.isnan(right):
+# NaN != NaN in Python
+continue
+assert left == right, f"Difference in column {column}: {left} != 
{right}"
+
+

Review Comment:
   ```suggestion
   
   @pytest.mark.integration
   @pytest.mark.parametrize("format_version", [1, 2])
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Refs metadata table [iceberg-python]

2024-04-17 Thread via GitHub


Fokko commented on PR #602:
URL: https://github.com/apache/iceberg-python/pull/602#issuecomment-2061026458

   Thanks @geruh for working on this 🙌 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Refs metadata table [iceberg-python]

2024-04-17 Thread via GitHub


Fokko merged PR #602:
URL: https://github.com/apache/iceberg-python/pull/602


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Refs metadata table [iceberg-python]

2024-04-16 Thread via GitHub


Fokko commented on PR #602:
URL: https://github.com/apache/iceberg-python/pull/602#issuecomment-2059988471

   @geruh Thanks for creating this PR! Can you resolve the conflicts so we can 
get this in? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Refs metadata table [iceberg-python]

2024-04-16 Thread via GitHub


Fokko commented on code in PR #602:
URL: https://github.com/apache/iceberg-python/pull/602#discussion_r1567985291


##
pyiceberg/table/__init__.py:
##
@@ -3410,6 +3410,32 @@ def _readable_metrics_struct(bound_type: PrimitiveType) 
-> pa.StructType:
 schema=entries_schema,
 )
 
+def refs(self) -> "pa.Table":
+import pyarrow as pa
+
+ref_schema = pa.schema([
+pa.field('name', pa.string(), nullable=False),
+pa.field('type', pa.string(), nullable=False),

Review Comment:
   Thanks, and yes, I agree that it is okay to diverge from the Java 
implementation here since it is just an implementation detail.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Refs metadata table [iceberg-python]

2024-04-14 Thread via GitHub


geruh commented on code in PR #602:
URL: https://github.com/apache/iceberg-python/pull/602#discussion_r1565252853


##
tests/integration/test_inspect_table.py:
##
@@ -266,3 +266,56 @@ def test_inspect_entries_partitioned(spark: SparkSession, 
session_catalog: Catal
 
 assert df.to_pydict()['data_file'][0]['partition'] == {'dt_day': 
date(2021, 2, 1), 'dt_month': None}
 assert df.to_pydict()['data_file'][1]['partition'] == {'dt_day': None, 
'dt_month': 612}
+
+
+@pytest.mark.integration
+@pytest.mark.parametrize("format_version", [1, 2])
+def test_inspect_refs(
+spark: SparkSession, session_catalog: Catalog, arrow_table_with_null: 
pa.Table, format_version: int
+) -> None:
+identifier = "default.table_metadata_refs"
+tbl = _create_table(session_catalog, identifier, 
properties={"format-version": format_version})
+
+# write data to create snapshot
+tbl.overwrite(arrow_table_with_null)
+
+# create a test branch
+spark.sql(
+f"""
+ALTER TABLE {identifier} CREATE BRANCH IF NOT EXISTS testBranch RETAIN 7 
DAYS WITH SNAPSHOT RETENTION 2 SNAPSHOTS
+"""
+)
+
+# create a test tag against current snapshot
+current_snapshot = tbl.current_snapshot()
+assert current_snapshot is not None
+current_snapshot_id = current_snapshot.snapshot_id
+
+spark.sql(
+f"""
+ALTER TABLE {identifier} CREATE TAG testTag AS OF VERSION 
{current_snapshot_id} RETAIN 180 DAYS
+"""
+)
+
+df = tbl.refresh().inspect.refs()
+
+assert df.column_names == [
+'name',
+'type',
+'snapshot_id',
+'max_reference_age_in_ms',
+'min_snapshots_to_keep',
+'max_snapshot_age_in_ms',
+]
+
+for snapshot_id in df['snapshot_id']:
+assert isinstance(snapshot_id.as_py(), int)

Review Comment:
   Yeah I'll update to test all the output



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Refs metadata table [iceberg-python]

2024-04-14 Thread via GitHub


geruh commented on code in PR #602:
URL: https://github.com/apache/iceberg-python/pull/602#discussion_r1565238973


##
pyiceberg/table/__init__.py:
##
@@ -3410,6 +3410,32 @@ def _readable_metrics_struct(bound_type: PrimitiveType) 
-> pa.StructType:
 schema=entries_schema,
 )
 
+def refs(self) -> "pa.Table":
+import pyarrow as pa
+
+ref_schema = pa.schema([
+pa.field('name', pa.string(), nullable=False),
+pa.field('type', pa.string(), nullable=False),

Review Comment:
   Hey Fokko, I looked into this for both pandas and pyarrow and it offers a 
few benefits like reduced memory usage and improved performance on sorting and 
filtering. Since we're only dealing with `branch` or `tag`, this is ideal. 
However, this does stray from the Java implementation which uses a [string 
type](https://github.com/apache/iceberg/blob/fb657b413e2bb7f6c5e2c78465173df0426d3527/core/src/main/java/org/apache/iceberg/RefsTable.java#L36).
 But I think we should be fine with this difference, because in both pyarrow 
and pandas, this type seems to be well supported and can be easily converted to 
string if needed. Also, this would be beneficial for huge tables with many 
references. 
   
   In PyArrow, this would be implemented as a dictionary type mapping integers 
to strings. Also, this could be added to the other metadata tables such as the 
operation field in the snapshots table.



##
pyiceberg/table/__init__.py:
##
@@ -3410,6 +3410,32 @@ def _readable_metrics_struct(bound_type: PrimitiveType) 
-> pa.StructType:
 schema=entries_schema,
 )
 
+def refs(self) -> "pa.Table":
+import pyarrow as pa
+
+ref_schema = pa.schema([
+pa.field('name', pa.string(), nullable=False),
+pa.field('type', pa.string(), nullable=False),

Review Comment:
   Hey Fokko, I looked into this for both pandas and pyarrow and it offers a 
few benefits like reduced memory usage and improved performance on sorting and 
filtering. Since we're only dealing with `branch` or `tag`, this is ideal. 
However, this does stray from the Java implementation which uses a [string 
type](https://github.com/apache/iceberg/blob/fb657b413e2bb7f6c5e2c78465173df0426d3527/core/src/main/java/org/apache/iceberg/RefsTable.java#L36).
 But I think we should be fine with this difference, because in both pyarrow 
and pandas, this type seems to be well supported and can be easily converted to 
string if needed. Also, this would be beneficial for huge tables with many 
references. 
   
   In pyarrow, this would be implemented as a dictionary type mapping integers 
to strings. Also, this could be added to the other metadata tables such as the 
operation field in the snapshots table.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Refs metadata table [iceberg-python]

2024-04-13 Thread via GitHub


amogh-jahagirdar commented on code in PR #602:
URL: https://github.com/apache/iceberg-python/pull/602#discussion_r1564265956


##
tests/integration/test_inspect_table.py:
##
@@ -266,3 +266,56 @@ def test_inspect_entries_partitioned(spark: SparkSession, 
session_catalog: Catal
 
 assert df.to_pydict()['data_file'][0]['partition'] == {'dt_day': 
date(2021, 2, 1), 'dt_month': None}
 assert df.to_pydict()['data_file'][1]['partition'] == {'dt_day': None, 
'dt_month': 612}
+
+
+@pytest.mark.integration
+@pytest.mark.parametrize("format_version", [1, 2])
+def test_inspect_refs(
+spark: SparkSession, session_catalog: Catalog, arrow_table_with_null: 
pa.Table, format_version: int
+) -> None:
+identifier = "default.table_metadata_refs"
+tbl = _create_table(session_catalog, identifier, 
properties={"format-version": format_version})
+
+# write data to create snapshot
+tbl.overwrite(arrow_table_with_null)
+
+# create a test branch
+spark.sql(
+f"""
+ALTER TABLE {identifier} CREATE BRANCH IF NOT EXISTS testBranch RETAIN 7 
DAYS WITH SNAPSHOT RETENTION 2 SNAPSHOTS
+"""
+)
+
+# create a test tag against current snapshot
+current_snapshot = tbl.current_snapshot()
+assert current_snapshot is not None
+current_snapshot_id = current_snapshot.snapshot_id
+
+spark.sql(
+f"""
+ALTER TABLE {identifier} CREATE TAG testTag AS OF VERSION 
{current_snapshot_id} RETAIN 180 DAYS
+"""
+)
+
+df = tbl.refresh().inspect.refs()
+
+assert df.column_names == [
+'name',
+'type',
+'snapshot_id',
+'max_reference_age_in_ms',
+'min_snapshots_to_keep',
+'max_snapshot_age_in_ms',
+]
+
+for snapshot_id in df['snapshot_id']:
+assert isinstance(snapshot_id.as_py(), int)

Review Comment:
   Nit: Why specifically only check the type for snapshot_id? If we want to 
validate the types in the dataframe should we just check all of them? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Refs metadata table [iceberg-python]

2024-04-13 Thread via GitHub


Fokko commented on code in PR #602:
URL: https://github.com/apache/iceberg-python/pull/602#discussion_r1564223181


##
pyiceberg/table/__init__.py:
##
@@ -3410,6 +3410,32 @@ def _readable_metrics_struct(bound_type: PrimitiveType) 
-> pa.StructType:
 schema=entries_schema,
 )
 
+def refs(self) -> "pa.Table":
+import pyarrow as pa
+
+ref_schema = pa.schema([
+pa.field('name', pa.string(), nullable=False),
+pa.field('type', pa.string(), nullable=False),

Review Comment:
   What do you think of changing this field into a categorical? 
https://pandas.pydata.org/docs/user_guide/categorical.html#series-creation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Add Refs metadata table [iceberg-python]

2024-04-12 Thread via GitHub


geruh opened a new pull request, #602:
URL: https://github.com/apache/iceberg-python/pull/602

   This PR adds the Refs metadata table the existing inspect logic for Iceberg 
tables as listed in #511. The refs metadata table in Iceberg stores the table's 
known snapshot references including branches, and tags.  
   
   ### Usage
   
   ```
   > table.inspect.refs()
   ```
   
   ```
   pyarrow.Table
   name: string not null
   type: string not null
   snapshot_id: int64 not null
   max_reference_age_in_ms: int64
   min_snapshots_to_keep: int32
   max_snapshot_age_in_ms: int64
   
   name: [["testBranch","main","testTag"]]
   type: [["BRANCH","BRANCH","TAG"]]
   snapshot_id: [[4549716733136159472,4549716733136159472,4549716733136159472]]
   max_reference_age_in_ms: [[60480,null,1555200]]
   min_snapshots_to_keep: [[2,null,null]]
   max_snapshot_age_in_ms: [[null,null,null]]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org