Re: [PR] Implement __getstate__ and __setstate__ on PyArrowFileIO and FsSpecFileIO so that they can be pickled [iceberg-python]
HonahX merged PR #543: URL: https://github.com/apache/iceberg-python/pull/543 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Implement __getstate__ and __setstate__ on PyArrowFileIO and FsSpecFileIO so that they can be pickled [iceberg-python]
HonahX commented on PR #543: URL: https://github.com/apache/iceberg-python/pull/543#issuecomment-2041611486 Merged, Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Implement __getstate__ and __setstate__ on PyArrowFileIO and FsSpecFileIO so that they can be pickled [iceberg-python]
amogh-jahagirdar commented on code in PR #543: URL: https://github.com/apache/iceberg-python/pull/543#discussion_r1554993449 ## tests/io/test_fsspec.py: ## @@ -586,6 +597,25 @@ def test_writing_avro_file_gcs(generated_manifest_entry_file: str, fsspec_fileio fsspec_fileio_gcs.delete(f"gs://warehouse/{filename}") +@pytest.mark.gcs +def test_fsspec_pickle_roundtrip_gcs(fsspec_fileio_gcs: FsspecFileIO) -> None: +_test_fsspec_pickle_round_trip(fsspec_fileio_gcs, "gs://warehouse/foo.txt") + + +def _test_fsspec_pickle_round_trip(fsspec_fileio: FsspecFileIO, location: str) -> None: +serialized_file_io = pickle.dumps(fsspec_fileio) +deserialized_file_io = pickle.loads(serialized_file_io) +output_file = deserialized_file_io.new_output(location) +with output_file.create() as f: +f.write(b"foo") + +input_file = deserialized_file_io.new_input(location) +with input_file.open() as f: +data = f.read() +assert data == b"foo" +assert len(input_file) == 3 + Review Comment: Good idea, yes tests in general should be able to be re-run properly and to do that we should cleanup the resource at the end! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Implement __getstate__ and __setstate__ on PyArrowFileIO and FsSpecFileIO so that they can be pickled [iceberg-python]
amogh-jahagirdar commented on code in PR #543: URL: https://github.com/apache/iceberg-python/pull/543#discussion_r1554993337 ## tests/io/test_fsspec.py: ## @@ -61,7 +62,7 @@ def test_fsspec_new_input_file(fsspec_fileio: FsspecFileIO) -> None: assert input_file.location == f"s3://warehouse/{filename}" -@pytest.mark.s3 +@pytest.mark.s3fsspec_file_io Review Comment: Ah good catch, I think this was a copy/paste bug (somehow pasted fsspec_file_io on this line by mistake) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Implement __getstate__ and __setstate__ on PyArrowFileIO and FsSpecFileIO so that they can be pickled [iceberg-python]
amogh-jahagirdar commented on code in PR #543: URL: https://github.com/apache/iceberg-python/pull/543#discussion_r1554993080 ## tests/io/test_fsspec.py: ## @@ -586,6 +597,25 @@ def test_writing_avro_file_gcs(generated_manifest_entry_file: str, fsspec_fileio fsspec_fileio_gcs.delete(f"gs://warehouse/{filename}") +@pytest.mark.gcs +def test_fsspec_pickle_roundtrip_gcs(fsspec_fileio_gcs: FsspecFileIO) -> None: +_test_fsspec_pickle_round_trip(fsspec_fileio_gcs, "gs://warehouse/foo.txt") + + +def _test_fsspec_pickle_round_trip(fsspec_fileio: FsspecFileIO, location: str) -> None: +serialized_file_io = pickle.dumps(fsspec_fileio) Review Comment: Agreed! I can take up renaming in a separate PR so it's easier to review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Implement __getstate__ and __setstate__ on PyArrowFileIO and FsSpecFileIO so that they can be pickled [iceberg-python]
HonahX commented on code in PR #543: URL: https://github.com/apache/iceberg-python/pull/543#discussion_r1554859960 ## tests/io/test_fsspec.py: ## @@ -586,6 +597,25 @@ def test_writing_avro_file_gcs(generated_manifest_entry_file: str, fsspec_fileio fsspec_fileio_gcs.delete(f"gs://warehouse/{filename}") +@pytest.mark.gcs +def test_fsspec_pickle_roundtrip_gcs(fsspec_fileio_gcs: FsspecFileIO) -> None: +_test_fsspec_pickle_round_trip(fsspec_fileio_gcs, "gs://warehouse/foo.txt") + + +def _test_fsspec_pickle_round_trip(fsspec_fileio: FsspecFileIO, location: str) -> None: +serialized_file_io = pickle.dumps(fsspec_fileio) +deserialized_file_io = pickle.loads(serialized_file_io) +output_file = deserialized_file_io.new_output(location) +with output_file.create() as f: +f.write(b"foo") + +input_file = deserialized_file_io.new_input(location) +with input_file.open() as f: +data = f.read() +assert data == b"foo" +assert len(input_file) == 3 + Review Comment: ```suggestion fsspec_fileio.delete(location) ``` How about deleting the file in the end to make these tests re-runnable? ## tests/io/test_fsspec.py: ## @@ -61,7 +62,7 @@ def test_fsspec_new_input_file(fsspec_fileio: FsspecFileIO) -> None: assert input_file.location == f"s3://warehouse/{filename}" -@pytest.mark.s3 +@pytest.mark.s3fsspec_file_io Review Comment: This seems to be an unrelated change ## tests/io/test_fsspec.py: ## @@ -586,6 +597,25 @@ def test_writing_avro_file_gcs(generated_manifest_entry_file: str, fsspec_fileio fsspec_fileio_gcs.delete(f"gs://warehouse/{filename}") +@pytest.mark.gcs +def test_fsspec_pickle_roundtrip_gcs(fsspec_fileio_gcs: FsspecFileIO) -> None: +_test_fsspec_pickle_round_trip(fsspec_fileio_gcs, "gs://warehouse/foo.txt") + + +def _test_fsspec_pickle_round_trip(fsspec_fileio: FsspecFileIO, location: str) -> None: +serialized_file_io = pickle.dumps(fsspec_fileio) Review Comment: I just realized that we use both `fileio` and `file_io` in the codespace: (e.g. `fsspec_fileio`, `load_file_io`). I would be good if we could consistently use one of them. This may be done in a separate PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Implement __getstate__ and __setstate__ on PyArrowFileIO and FsSpecFileIO so that they can be pickled [iceberg-python]
amogh-jahagirdar commented on code in PR #543: URL: https://github.com/apache/iceberg-python/pull/543#discussion_r1554762228 ## tests/io/test_pyarrow.py: ## @@ -256,6 +257,14 @@ def test_raise_on_opening_a_local_file_not_found() -> None: assert "[Errno 2] Failed to open local file" in str(exc_info.value) +def test_pickle_pyarrow_file_io() -> None: Review Comment: Sorry for the delay on this, got busy with other work, Updated! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Implement __getstate__ and __setstate__ on PyArrowFileIO and FsSpecFileIO so that they can be pickled [iceberg-python]
Fokko commented on code in PR #543: URL: https://github.com/apache/iceberg-python/pull/543#discussion_r1537486747 ## tests/io/test_pyarrow.py: ## @@ -256,6 +257,14 @@ def test_raise_on_opening_a_local_file_not_found() -> None: assert "[Errno 2] Failed to open local file" in str(exc_info.value) +def test_pickle_pyarrow_file_io() -> None: Review Comment: Yes, that would be great. You can just re-use an integration-test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Implement __getstate__ and __setstate__ on PyArrowFileIO and FsSpecFileIO so that they can be pickled [iceberg-python]
amogh-jahagirdar commented on code in PR #543: URL: https://github.com/apache/iceberg-python/pull/543#discussion_r1536736087 ## tests/io/test_pyarrow.py: ## @@ -256,6 +257,14 @@ def test_raise_on_opening_a_local_file_not_found() -> None: assert "[Errno 2] Failed to open local file" in str(exc_info.value) +def test_pickle_pyarrow_file_io() -> None: Review Comment: Let me add a test for fsspec as well. Also, we probably want a stronger round/trip test of pickling worth asserting on a few fields or even more, actually attempt to use the deserialized FileIO for reading/writing. ## tests/io/test_pyarrow.py: ## @@ -256,6 +257,14 @@ def test_raise_on_opening_a_local_file_not_found() -> None: assert "[Errno 2] Failed to open local file" in str(exc_info.value) +def test_pickle_pyarrow_file_io() -> None: Review Comment: Let me add a test for fsspec as well. Also, we probably want a stronger round/trip test of pickling worth asserting on a few fields or even better, actually attempt to use the deserialized FileIO for reading/writing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Implement __getstate__ and __setstate__ on PyArrowFileIO and FsSpecFileIO so that they can be pickled [iceberg-python]
amogh-jahagirdar commented on code in PR #543: URL: https://github.com/apache/iceberg-python/pull/543#discussion_r1536736087 ## tests/io/test_pyarrow.py: ## @@ -256,6 +257,14 @@ def test_raise_on_opening_a_local_file_not_found() -> None: assert "[Errno 2] Failed to open local file" in str(exc_info.value) +def test_pickle_pyarrow_file_io() -> None: Review Comment: Let me add a test for fsspec as well -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org