milenkovicm opened a new pull request, #1338:
URL: https://github.com/apache/datafusion-ballista/pull/1338
# Which issue does this PR close?
Closes #1142
# Rationale for this change
For quite some time, we wanted to provide a Ballista Python interface and
make it an extension of DataFusion Python. For the reasons mentioned in #1142,
we haven't been able to do so. The main issue was that we could not use
DataFrame as there was a class mismatch, something like
```python
from pyballista import BallistaBuilder
from datafusion import SessionContext
from datafusion import functions as f
# %%
ctx: SessionContext = BallistaBuilder()\
.config("ballista.job.name", "example ballista")\
.config("ballista.shuffle.partitions", "16")\
.standalone()
df = ctx.sql("SELECT 1 as r").aggregate(
[f.col("r")], [f.count_star()]
)
df.show()
```
was not possible due to FFI between datafusion and ballista python.
# What changes are included in this PR?
This PR relies on python duck typing, to "fake" `DataFrame` interface and
replace it with `DistributedDataFrame` extension which would execute query on
ballista cluster.
```python
from ballista import BallistaSessionContext
from datafusion import col, lit, DataFrame
from datafusion import functions as f
# we replace
# ctx = SessionContext()
# with
ctx = BallistaSessionContext(address="df://127.0.0.1:50050")
df : DataFrame = ctx.table("t")
df.filter(col("id") > lit(4)).show()
df0 = ctx.sql("SELECT 1 as r")
df0.aggregate(
[f.col("r")], [f.count_star()]
)
df0.show()
```
There is slight inneficiency where original logical plan will be serialised
in datafusion python and deserialised in ballista python in order to cross FFI
boundary as well as re-creation of BallistaContext
Also, we would need to override few methods to make it work with ballista.
# Are there any user-facing changes?
There will be change in interface, but too early to tell
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]