datapythonista commented on PR #750:
URL:
https://github.com/apache/datafusion-python/pull/750#issuecomment-2220954392
Thanks for the comments, and sorry if my feedback is not helpful. Just one
last comment if you don't mind.
Since the final goal seems to be adoption, and making things easier for
Python users, the question that comes to my mind is whether this API wants to
be a building block for other projects, or it wants to be a reasonable project
for final users.
My previous feedback was based on DataFusion being more for developers than
for final users. And for example, nice DataFrame APIs for DataFusion being
built as separate projects. If the idea is to make this API reasonable for
final users, I think the approach here it makes more sense to me (not sure if
I'd wrap everything, but some class surely would need it).
For me, the main things that would make DataFusion as usable as other
DataFrame libraries are summarized in this example:
```python
import datafusion
from datafusion import col, lit, functions as f
import pyarrow
# something like this would be implemented internally, so users can call
`datafusion.read_*`
def _read_parquet(*args, **kwargs):
ctx = datafusion.SessionContext()
return ctx.read_parquet(*args, **kwargs)
datafusion.read_parquet = _read_parquet # creating an alias of `read_*`
functions so users don't need to know about `SessionContext` when the defaults
are fine
df = (datafusion.read_parquet("buildings.parquet")
.filter( # `.filter()` accepting multiple conditions (which
will be an AND) instead of having to use `&` with its operator precedence
problems
col("is_offplan") == False,
col("rooms") >= 2, # `.lit(2)` not being required, and
Python literals working with operators
)
.aggregate(
[col("area_name_en")],
[f.mean(col("has_parking").cast(float))], # `.cast()`
accepting Python types, which would be internally converted to the PyArrow
equivalent
)
.select(
col("area_name_en").alias("Area"),
col("AVG(has_parking)").alias("Percentage of buildings
with parking"), # removing the default `?table?` in column names, the column
name was "AVG(?table?.has_parking)"
)
)
```
Implementing those things and similar ones would be worth the wrappers. I
personally don't see it worth for users to have an idea of the project
structure browsing the source code (when docs can be provided, and do a better
job at that).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]