Re: [PR] Add Refs metadata table [iceberg-python]
Fokko commented on code in PR #602: URL: https://github.com/apache/iceberg-python/pull/602#discussion_r1568666707 ## tests/integration/test_inspect_table.py: ## @@ -274,6 +274,64 @@ def test_inspect_entries_partitioned(spark: SparkSession, session_catalog: Catal @pytest.mark.integration @pytest.mark.parametrize("format_version", [1, 2]) +def test_inspect_refs( +spark: SparkSession, session_catalog: Catalog, arrow_table_with_null: pa.Table, format_version: int +) -> None: +identifier = "default.table_metadata_refs" +tbl = _create_table(session_catalog, identifier, properties={"format-version": format_version}) + +# write data to create snapshot +tbl.overwrite(arrow_table_with_null) + +# create a test branch +spark.sql( +f""" +ALTER TABLE {identifier} CREATE BRANCH IF NOT EXISTS testBranch RETAIN 7 DAYS WITH SNAPSHOT RETENTION 2 SNAPSHOTS +""" +) + +# create a test tag against current snapshot +current_snapshot = tbl.current_snapshot() +assert current_snapshot is not None +current_snapshot_id = current_snapshot.snapshot_id + +spark.sql( +f""" +ALTER TABLE {identifier} CREATE TAG testTag AS OF VERSION {current_snapshot_id} RETAIN 180 DAYS +""" +) + +df = tbl.refresh().inspect.refs() + +assert df.column_names == [ +'name', +'type', +'snapshot_id', +'max_reference_age_in_ms', +'min_snapshots_to_keep', +'max_snapshot_age_in_ms', +] + +assert [name.as_py() for name in df['name']] == ['testBranch', 'main', 'testTag'] +assert [ref_type.as_py() for ref_type in df['type']] == ['BRANCH', 'BRANCH', 'TAG'] + +for snapshot_id in df['snapshot_id']: +assert isinstance(snapshot_id.as_py(), int) + +for int_column in ['max_reference_age_in_ms', 'min_snapshots_to_keep', 'max_snapshot_age_in_ms']: +for value in df[int_column]: +assert isinstance(value.as_py(), int) or not value.as_py() + +lhs = spark.table(f"{identifier}.refs").toPandas() +rhs = df.to_pandas() +for column in df.column_names: +for left, right in zip(lhs[column].to_list(), rhs[column].to_list()): +if isinstance(left, float) and math.isnan(left) and isinstance(right, float) and math.isnan(right): +# NaN != NaN in Python +continue +assert left == right, f"Difference in column {column}: {left} != {right}" + + Review Comment: ```suggestion @pytest.mark.integration @pytest.mark.parametrize("format_version", [1, 2]) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Refs metadata table [iceberg-python]
Fokko commented on PR #602: URL: https://github.com/apache/iceberg-python/pull/602#issuecomment-2061026458 Thanks @geruh for working on this 🙌 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Refs metadata table [iceberg-python]
Fokko merged PR #602: URL: https://github.com/apache/iceberg-python/pull/602 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Refs metadata table [iceberg-python]
Fokko commented on PR #602: URL: https://github.com/apache/iceberg-python/pull/602#issuecomment-2059988471 @geruh Thanks for creating this PR! Can you resolve the conflicts so we can get this in? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Refs metadata table [iceberg-python]
Fokko commented on code in PR #602: URL: https://github.com/apache/iceberg-python/pull/602#discussion_r1567985291 ## pyiceberg/table/__init__.py: ## @@ -3410,6 +3410,32 @@ def _readable_metrics_struct(bound_type: PrimitiveType) -> pa.StructType: schema=entries_schema, ) +def refs(self) -> "pa.Table": +import pyarrow as pa + +ref_schema = pa.schema([ +pa.field('name', pa.string(), nullable=False), +pa.field('type', pa.string(), nullable=False), Review Comment: Thanks, and yes, I agree that it is okay to diverge from the Java implementation here since it is just an implementation detail. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Refs metadata table [iceberg-python]
geruh commented on code in PR #602: URL: https://github.com/apache/iceberg-python/pull/602#discussion_r1565252853 ## tests/integration/test_inspect_table.py: ## @@ -266,3 +266,56 @@ def test_inspect_entries_partitioned(spark: SparkSession, session_catalog: Catal assert df.to_pydict()['data_file'][0]['partition'] == {'dt_day': date(2021, 2, 1), 'dt_month': None} assert df.to_pydict()['data_file'][1]['partition'] == {'dt_day': None, 'dt_month': 612} + + +@pytest.mark.integration +@pytest.mark.parametrize("format_version", [1, 2]) +def test_inspect_refs( +spark: SparkSession, session_catalog: Catalog, arrow_table_with_null: pa.Table, format_version: int +) -> None: +identifier = "default.table_metadata_refs" +tbl = _create_table(session_catalog, identifier, properties={"format-version": format_version}) + +# write data to create snapshot +tbl.overwrite(arrow_table_with_null) + +# create a test branch +spark.sql( +f""" +ALTER TABLE {identifier} CREATE BRANCH IF NOT EXISTS testBranch RETAIN 7 DAYS WITH SNAPSHOT RETENTION 2 SNAPSHOTS +""" +) + +# create a test tag against current snapshot +current_snapshot = tbl.current_snapshot() +assert current_snapshot is not None +current_snapshot_id = current_snapshot.snapshot_id + +spark.sql( +f""" +ALTER TABLE {identifier} CREATE TAG testTag AS OF VERSION {current_snapshot_id} RETAIN 180 DAYS +""" +) + +df = tbl.refresh().inspect.refs() + +assert df.column_names == [ +'name', +'type', +'snapshot_id', +'max_reference_age_in_ms', +'min_snapshots_to_keep', +'max_snapshot_age_in_ms', +] + +for snapshot_id in df['snapshot_id']: +assert isinstance(snapshot_id.as_py(), int) Review Comment: Yeah I'll update to test all the output -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Refs metadata table [iceberg-python]
geruh commented on code in PR #602: URL: https://github.com/apache/iceberg-python/pull/602#discussion_r1565238973 ## pyiceberg/table/__init__.py: ## @@ -3410,6 +3410,32 @@ def _readable_metrics_struct(bound_type: PrimitiveType) -> pa.StructType: schema=entries_schema, ) +def refs(self) -> "pa.Table": +import pyarrow as pa + +ref_schema = pa.schema([ +pa.field('name', pa.string(), nullable=False), +pa.field('type', pa.string(), nullable=False), Review Comment: Hey Fokko, I looked into this for both pandas and pyarrow and it offers a few benefits like reduced memory usage and improved performance on sorting and filtering. Since we're only dealing with `branch` or `tag`, this is ideal. However, this does stray from the Java implementation which uses a [string type](https://github.com/apache/iceberg/blob/fb657b413e2bb7f6c5e2c78465173df0426d3527/core/src/main/java/org/apache/iceberg/RefsTable.java#L36). But I think we should be fine with this difference, because in both pyarrow and pandas, this type seems to be well supported and can be easily converted to string if needed. Also, this would be beneficial for huge tables with many references. In PyArrow, this would be implemented as a dictionary type mapping integers to strings. Also, this could be added to the other metadata tables such as the operation field in the snapshots table. ## pyiceberg/table/__init__.py: ## @@ -3410,6 +3410,32 @@ def _readable_metrics_struct(bound_type: PrimitiveType) -> pa.StructType: schema=entries_schema, ) +def refs(self) -> "pa.Table": +import pyarrow as pa + +ref_schema = pa.schema([ +pa.field('name', pa.string(), nullable=False), +pa.field('type', pa.string(), nullable=False), Review Comment: Hey Fokko, I looked into this for both pandas and pyarrow and it offers a few benefits like reduced memory usage and improved performance on sorting and filtering. Since we're only dealing with `branch` or `tag`, this is ideal. However, this does stray from the Java implementation which uses a [string type](https://github.com/apache/iceberg/blob/fb657b413e2bb7f6c5e2c78465173df0426d3527/core/src/main/java/org/apache/iceberg/RefsTable.java#L36). But I think we should be fine with this difference, because in both pyarrow and pandas, this type seems to be well supported and can be easily converted to string if needed. Also, this would be beneficial for huge tables with many references. In pyarrow, this would be implemented as a dictionary type mapping integers to strings. Also, this could be added to the other metadata tables such as the operation field in the snapshots table. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Refs metadata table [iceberg-python]
amogh-jahagirdar commented on code in PR #602: URL: https://github.com/apache/iceberg-python/pull/602#discussion_r1564265956 ## tests/integration/test_inspect_table.py: ## @@ -266,3 +266,56 @@ def test_inspect_entries_partitioned(spark: SparkSession, session_catalog: Catal assert df.to_pydict()['data_file'][0]['partition'] == {'dt_day': date(2021, 2, 1), 'dt_month': None} assert df.to_pydict()['data_file'][1]['partition'] == {'dt_day': None, 'dt_month': 612} + + +@pytest.mark.integration +@pytest.mark.parametrize("format_version", [1, 2]) +def test_inspect_refs( +spark: SparkSession, session_catalog: Catalog, arrow_table_with_null: pa.Table, format_version: int +) -> None: +identifier = "default.table_metadata_refs" +tbl = _create_table(session_catalog, identifier, properties={"format-version": format_version}) + +# write data to create snapshot +tbl.overwrite(arrow_table_with_null) + +# create a test branch +spark.sql( +f""" +ALTER TABLE {identifier} CREATE BRANCH IF NOT EXISTS testBranch RETAIN 7 DAYS WITH SNAPSHOT RETENTION 2 SNAPSHOTS +""" +) + +# create a test tag against current snapshot +current_snapshot = tbl.current_snapshot() +assert current_snapshot is not None +current_snapshot_id = current_snapshot.snapshot_id + +spark.sql( +f""" +ALTER TABLE {identifier} CREATE TAG testTag AS OF VERSION {current_snapshot_id} RETAIN 180 DAYS +""" +) + +df = tbl.refresh().inspect.refs() + +assert df.column_names == [ +'name', +'type', +'snapshot_id', +'max_reference_age_in_ms', +'min_snapshots_to_keep', +'max_snapshot_age_in_ms', +] + +for snapshot_id in df['snapshot_id']: +assert isinstance(snapshot_id.as_py(), int) Review Comment: Nit: Why specifically only check the type for snapshot_id? If we want to validate the types in the dataframe should we just check all of them? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Refs metadata table [iceberg-python]
Fokko commented on code in PR #602: URL: https://github.com/apache/iceberg-python/pull/602#discussion_r1564223181 ## pyiceberg/table/__init__.py: ## @@ -3410,6 +3410,32 @@ def _readable_metrics_struct(bound_type: PrimitiveType) -> pa.StructType: schema=entries_schema, ) +def refs(self) -> "pa.Table": +import pyarrow as pa + +ref_schema = pa.schema([ +pa.field('name', pa.string(), nullable=False), +pa.field('type', pa.string(), nullable=False), Review Comment: What do you think of changing this field into a categorical? https://pandas.pydata.org/docs/user_guide/categorical.html#series-creation -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Add Refs metadata table [iceberg-python]
geruh opened a new pull request, #602: URL: https://github.com/apache/iceberg-python/pull/602 This PR adds the Refs metadata table the existing inspect logic for Iceberg tables as listed in #511. The refs metadata table in Iceberg stores the table's known snapshot references including branches, and tags. ### Usage ``` > table.inspect.refs() ``` ``` pyarrow.Table name: string not null type: string not null snapshot_id: int64 not null max_reference_age_in_ms: int64 min_snapshots_to_keep: int32 max_snapshot_age_in_ms: int64 name: [["testBranch","main","testTag"]] type: [["BRANCH","BRANCH","TAG"]] snapshot_id: [[4549716733136159472,4549716733136159472,4549716733136159472]] max_reference_age_in_ms: [[60480,null,1555200]] min_snapshots_to_keep: [[2,null,null]] max_snapshot_age_in_ms: [[null,null,null]] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org