Re: [DISCUSS] Support pandas API layer on PySpark

Bryan Cutler Tue, 16 Mar 2021 22:17:40 -0700

+1 the proposal sounds good to me. Having a familiar API built-in will
really help new users get into using Spark that might only have Pandas
experience. It sounds like maintenance costs should be manageable, once the
hurdle with setting up tests is done. Just out of curiosity, does Koalas
pretty much implement all of the Pandas APIs now? If there are some that
are yet to be implemented or others that have differences, are these
documented so users won't be caught off-guard?


On Tue, Mar 16, 2021 at 6:54 PM Andrew Melo <andrew.m...@gmail.com> wrote:

> Hi,
>
> Integrating Koalas with pyspark might help enable a richer integration
> between the two. Something that would be useful with a tighter
> integration is support for custom column array types. Currently, Spark
> takes dataframes, converts them to arrow buffers then transmits them
> over the socket to Python. On the other side, pyspark takes the arrow
> buffer and converts it to a Pandas dataframe. Unfortunately, the
> default Pandas representation of a list-type for a column causes it to
> turn what was contiguous value/offset arrays in Arrow into
> deserialized Python objects for each row. Obviously, this kills
> performance.
>
> A PR to extend the pyspark API to elide the pandas conversion
> (https://github.com/apache/spark/pull/26783) was submitted and
> rejected, which is unfortunate, but perhaps this proposed integration
> would provide the hooks via Pandas' ExtensionArray interface to allow
> Spark to performantly interchange jagged/ragged lists to/from python
> UDFs.
>
> Cheers
> Andrew
>
> On Tue, Mar 16, 2021 at 8:15 PM Hyukjin Kwon <gurwls...@gmail.com> wrote:
> >
> > Thank you guys for all your feedback. I will start working on SPIP with
> Koalas team.
> > I would expect the SPIP can be sent late this week or early next week.
> >
> >
> > I inlined and answered the questions unanswered as below:
> >
> > Is the community developing the pandas API layer for Spark interested in
> being part of Spark or do they prefer having their own release cycle?
> >
> > Yeah, Koalas team used to have its own release cycle to develop and move
> quickly.
> > Now it became pretty mature with reaching 1.7.0, and the team thinks
> that it’s now
> > fine to have less frequent releases, and they are happy to work together
> with Spark with
> > contributing to it. The active contributors in the Koalas community will
> continue to
> > make the contributions in Spark.
> >
> > How about test code? Does it fit into the PySpark test framework?
> >
> > Yes, this will be one of the places where it needs some efforts. Koalas
> currently uses pytest
> > with various dependency version combinations (e.g., Python version,
> conda vs pip) whereas
> > PySpark uses the plain unittests with less dependency version
> combinations.
> >
> > For pytest in Koalas <> unittests in PySpark:
> >
> >   I am currently thinking we will have to convert the Koalas tests to
> use unittests to match
> >   with PySpark for now.
> >   It is a feasible option for PySpark to migrate to pytest too but it
> will need extra effort to
> >   make it working with our own PySpark testing framework seamlessly.
> >   Koalas team (presumably and likely I) will take a look in any event.
> >
> > For the combinations of dependency versions:
> >
> >   Due to the lack of the resources in GitHub Actions, I currently plan
> to just add the
> >   Koalas tests into the matrix PySpark is currently using.
> >
> > one question I have; what’s an initial goal of the proposal?
> > Is that to port all the pandas interfaces that Koalas has already
> implemented?
> > Or, the basic set of them?
> >
> > The goal of the proposal is to port all of Koalas project into PySpark.
> > For example,
> >
> > import koalas
> >
> > will be equivalent to
> >
> > # Names, etc. might change in the final proposal or during the review
> > from pyspark.sql import pandas
> >
> > Koalas supports pandas APIs with a separate layer to cover a bit of
> difference between
> > DataFrame structures in pandas and PySpark, e.g.) other types as column
> names (labels),
> > index (something like row number in DBMSs) and so on. So I think it
> would make more sense
> > to port the whole layer instead of a subset of the APIs.
> >
> >
> >
> >
> >
> > 2021년 3월 17일 (수) 오전 12:32, Wenchen Fan <cloud0...@gmail.com>님이 작성:
> >>
> >> +1, it's great to have Pandas support in Spark out of the box.
> >>
> >> On Tue, Mar 16, 2021 at 10:12 PM Takeshi Yamamuro <
> linguin....@gmail.com> wrote:
> >>>
> >>> +1; the pandas interfaces are pretty popular and supporting them in
> pyspark looks promising, I think.
> >>> one question I have; what's an initial goal of the proposal?
> >>> Is that to port all the pandas interfaces that Koalas has already
> implemented?
> >>> Or, the basic set of them?
> >>>
> >>> On Tue, Mar 16, 2021 at 1:44 AM Ismaël Mejía <ieme...@gmail.com>
> wrote:
> >>>>
> >>>> +1
> >>>>
> >>>> Bringing a Pandas API for pyspark to upstream Spark will only bring
> >>>> benefits for everyone (more eyes to use/see/fix/improve the API) as
> >>>> well as better alignment with core Spark improvements, the extra
> >>>> weight looks manageable.
> >>>>
> >>>> On Mon, Mar 15, 2021 at 4:45 PM Nicholas Chammas
> >>>> <nicholas.cham...@gmail.com> wrote:
> >>>> >
> >>>> > On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin <r...@databricks.com>
> wrote:
> >>>> >>
> >>>> >> I don't think we should deprecate existing APIs.
> >>>> >
> >>>> >
> >>>> > +1
> >>>> >
> >>>> > I strongly prefer Spark's immutable DataFrame API to the Pandas
> API. I could be wrong, but I wager most people who have worked with both
> Spark and Pandas feel the same way.
> >>>> >
> >>>> > For the large community of current PySpark users, or users
> switching to PySpark from another Spark language API, it doesn't make sense
> to deprecate the current API, even by convention.
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>>
> >>>
> >>>
> >>> --
> >>> ---
> >>> Takeshi Yamamuro
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [DISCUSS] Support pandas API layer on PySpark

Reply via email to