timsaucer opened a new issue, #800:
URL: https://github.com/apache/datafusion-python/issues/800
**Describe the bug**
When using a `pyarrow.dataset` as your source and performing a dataframe
`count` operation you get an error.
**To Reproduce**
You can point the below snippet to any parquet file.
```
from datafusion import SessionContext
import pyarrow.dataset as ds
ctx = SessionContext()
file_path =
"/some-path/datafusion-python/examples/tpch/data/lineitem.parquet"
pyarrow_dataset = ds.dataset([file_path])
ctx.register_dataset("pyarrow_dataset", pyarrow_dataset)
df = ctx.table("pyarrow_dataset").select("l_orderkey", "l_partkey",
"l_linenumber")
df.limit(3).show()
df.count()
```
This generates the following output. The `show` is to demonstrate the file
is read appropriately.
```
DataFrame()
+------------+-----------+--------------+
| l_orderkey | l_partkey | l_linenumber |
+------------+-----------+--------------+
| 1 | 155190 | 1 |
| 1 | 67310 | 2 |
| 1 | 63700 | 3 |
+------------+-----------+--------------+
Traceback (most recent call last):
File
"/Users/tsaucer/src/personal/arrow_rs_dataset_read/count_dataset_read.py", line
16, in <module>
df.count()
File
"/Users/tsaucer/src/personal/datafusion-python/python/datafusion/dataframe.py",
line 507, in count
return self.df.count()
^^^^^^^^^^^^^^^
Exception: External error: Arrow error: External error: ArrowException:
Invalid argument error: must either specify a row count or at least one column
```
**Expected behavior**
`count()` should return the number of rows in this dataset.
Work around is to aggregate and count
```
from datafusion import col, functions as f
df.aggregate([], [f.count(col("l_orderkey"))]).show()
```
**Additional context**
In my investigation, I found that we register arrow datasets by creating a
`TableProvider` in `src/dataset.rs` and then the execution calls happen in
`src/dataset_exec.rs`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]