Re: [PR] GH-47172: [Python][Test] Add function to create Arrow table instead of pandas df [arrow]
egolearner commented on PR #47199: URL: https://github.com/apache/arrow/pull/47199#issuecomment-3149258969 It seems `wheel-windows-cp313-cp313t-amd64` failing is unrelated to this PR. > RuntimeError: CFFI does not support the free-threaded build of CPython 3.13. Upgrade to free-threaded 3.14 or newer to use CFFI with the free-threaded build. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-47172: [Python][Test] Add function to create Arrow table instead of pandas df [arrow]
github-actions[bot] commented on PR #47199: URL: https://github.com/apache/arrow/pull/47199#issuecomment-3143999063 Revision: f0e1edfffe07a6ec3d2d51fe2c10b805f05fd57d Submitted crossbow builds: [ursacomputing/crossbow @ actions-9e7c764ef8](https://github.com/ursacomputing/crossbow/branches/all?query=actions-9e7c764ef8) |Task|Status| ||--| |wheel-windows-cp310-cp310-amd64|[](https://github.com/ursacomputing/crossbow/actions/runs/16672265116/job/47191028992)| |wheel-windows-cp311-cp311-amd64|[](https://github.com/ursacomputing/crossbow/actions/runs/16672264904/job/47191028428)| |wheel-windows-cp312-cp312-amd64|[](https://github.com/ursacomputing/crossbow/actions/runs/16672264927/job/47191028524)| |wheel-windows-cp313-cp313-amd64|[](https://github.com/ursacomputing/crossbow/actions/runs/16672265003/job/47191028628)| |wheel-windows-cp313-cp313t-amd64|[](https://github.com/ursacomputing/crossbow/actions/runs/16672265005/job/47191028567)| |wheel-windows-cp39-cp39-amd64|[](https://github.com/ursacomputing/crossbow/actions/runs/16672264995/job/47191028545)| -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-47172: [Python][Test] Add function to create Arrow table instead of pandas df [arrow]
rok commented on PR #47199: URL: https://github.com/apache/arrow/pull/47199#issuecomment-3143989301 @github-actions crossbow submit wheel-windows-* -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-47172: [Python][Test] Add function to create Arrow table instead of pandas df [arrow]
rok commented on PR #47199: URL: https://github.com/apache/arrow/pull/47199#issuecomment-3143977018 This looks pretty good @egolearner. I'll start some more Python tests and merge if they pass. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-47172: [Python][Test] Add function to create Arrow table instead of pandas df [arrow]
egolearner commented on code in PR #47199: URL: https://github.com/apache/arrow/pull/47199#discussion_r2246622702 ## python/pyarrow/tests/parquet/test_basic.py: ## @@ -76,20 +76,16 @@ def test_set_data_page_size(): _check_roundtrip(t, data_page_size=target_page_size) [email protected] def test_set_write_batch_size(): Review Comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-47172: [Python][Test] Add function to create Arrow table instead of pandas df [arrow]
rok commented on code in PR #47199: URL: https://github.com/apache/arrow/pull/47199#discussion_r2242818334 ## python/pyarrow/tests/parquet/test_basic.py: ## @@ -76,20 +76,16 @@ def test_set_data_page_size(): _check_roundtrip(t, data_page_size=target_page_size) [email protected] def test_set_write_batch_size(): Review Comment: This now runs when Pandas is not present, which is great, but fails when numpy is not present. Can you try adding `@pytest.mark.numpy`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-47172: [Python][Test] Add function to create Arrow table instead of pandas df [arrow]
rok commented on code in PR #47199: URL: https://github.com/apache/arrow/pull/47199#discussion_r2242783733 ## python/pyarrow/tests/parquet/common.py: ## @@ -121,6 +121,11 @@ def _test_dataframe(size=1, seed=0): return df +def _test_table(size=1, seed=0): +df = _test_dataframe(size, seed) +return pa.Table.from_pandas(df, preserve_index=False) Review Comment: That looks good, thanks! > Maybe we can deal this in another issue? It seems numpy is still a must for a lot of test cases. Yeah, let's capture that with another issue and defer to when it's needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-47172: [Python][Test] Add function to create Arrow table instead of pandas df [arrow]
egolearner commented on code in PR #47199: URL: https://github.com/apache/arrow/pull/47199#discussion_r2242686563 ## python/pyarrow/tests/parquet/common.py: ## @@ -121,6 +121,11 @@ def _test_dataframe(size=1, seed=0): return df +def _test_table(size=1, seed=0): +df = _test_dataframe(size, seed) +return pa.Table.from_pandas(df, preserve_index=False) Review Comment: Thanks for your review @rok I have added `_test_dict` function as data generation logic for both `_test_dataframe` and `_test_table`. PTAL > It might even be good to have fallback logic in _test_table for cases numpy is not available. This logic could use stdlib's random or some testing utility we have available in arrow c++. Maybe we can deal this in another issue? It seems `numpy` is still a must for a lot of test cases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] GH-47172: [Python][Test] Add function to create Arrow table instead of pandas df [arrow]
rok commented on code in PR #47199:
URL: https://github.com/apache/arrow/pull/47199#discussion_r2240170410
##
python/pyarrow/tests/parquet/common.py:
##
@@ -121,6 +121,11 @@ def _test_dataframe(size=1, seed=0):
return df
+def _test_table(size=1, seed=0):
+df = _test_dataframe(size, seed)
+return pa.Table.from_pandas(df, preserve_index=False)
Review Comment:
Doesn't `_test_dataframe` use Pandas? Depending on Pandas would go counter
the intent [stated here](https://github.com/apache/arrow/issues/47172):
> This issue would move some of tests using _test_dataframe to use a new
utility function and remove the @pytest.mark.pandas in this cases.
You could move numpy logic from `_test_dataframe` into `_test_table` and
have test `_test_dataframe` like:
```python
# I've not tested this
def _test_table(size=1, seed=0):
np.random.seed(seed)
return pa.Table({
'uint8': _random_integers(size, np.uint8),
'uint16': _random_integers(size, np.uint16),
'uint32': _random_integers(size, np.uint32),
'uint64': _random_integers(size, np.uint64),
'int8': _random_integers(size, np.int8),
'int16': _random_integers(size, np.int16),
'int32': _random_integers(size, np.int32),
'int64': _random_integers(size, np.int64),
'float32': np.random.randn(size).astype(np.float32),
'float64': np.arange(size, dtype=np.float64),
'bool': np.random.randn(size) > 0,
'strings': [util.rands(10) for i in range(size)],
'all_none': [None] * size,
'all_none_category': [None] * size
)
def _test_dataframe(size=1, seed=0):
import pandas as pd
np.random.seed(seed)
return _test_table(size, seed).to_pandas()
```
Possibly out of scope:
It might even be good to have fallback logic in _test_table for cases numpy
is not available. This logic could use stdlib's `random` or some testing
utility we have available in arrow c++.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
