Re: PySpark API divergence + improving pandas interoperability

Reynold Xin Mon, 21 Mar 2016 22:39:18 -0700

Hi Wes,

I agree it is difficult to do this design case by case, but what I was
pointing out was "it is difficult to generalize without seeing a lot more
cases".


I do think we need to see a lot of these cases and then make a call. My
intuition is that we can just have config options that control behavior,
similar to what a lot of relational databases do. The good thing is that
Spark data frames are very abstracted away from the underlying execution so
a lot of the behaviors can be controlled just by analysis rules.



On Mon, Mar 21, 2016 at 8:48 PM, Wes McKinney <w...@cloudera.com> wrote:

> hi Reynold,
>
> It's of course possible to find solutions to specific issues, but what
> I'm curious about is a general decision-making framework around
> building strong user experiences for programmers using each of the
> Spark APIs. Right now, the semantics of using Spark are very tied to
> the semantics of the Scala API. In turn, the semantics of Spark
> DataFrames may be constrained by the semantics of Spark SQL, depending
> on your attitude toward API divergence and using a sort of "code
> generation" to produce a certain behavior.
>
> To make this concrete, in this particular example, PySpark could
> implicitly cast boolean to numeric (completely from the Python side)
> before the SUM operation is sent to Spark. But then SUM(boolean) in
> Scala and SUM(boolean) in Python now have distinct behavior. Now
> imagine that you have made many such divergent API design decisions
> (some of which may interact with each other). Another classic example
> is preventing floor division with integers by automatically casting to
> double (__truediv__ in Python).
>
> Adding a different mode to Spark SQL might be dangerous, and it would
> likely add a lot of complexity.
>
> Thanks,
> Wes
>
> On Mon, Mar 21, 2016 at 9:59 AM, Reynold Xin <r...@databricks.com> wrote:
> > Hi Wes,
> >
> > Thanks for the email. It is difficult to generalize without seeing a lot
> > more cases, but the boolean issue is simply a query analysis rule.
> >
> > I can see us having a config option that changes analysis to match more
> > Python/R like, which changes the behavior of implicit type coercion and
> > allows boolean to integral automatically.
> >
> >
> > On Thursday, March 17, 2016, Wes McKinney <w...@cloudera.com> wrote:
> >>
> >> hi everyone,
> >>
> >> I've recently gotten moving on solving some of the low-level data
> >> interoperability problems between Python's NumPy-focused scientific
> >> computing and data libraries like pandas and the rest of the big data
> >> ecosystem, Spark being a very important part of that.
> >>
> >> One of the major efforts here is creating a unified data access layer
> >> for pandas users using Apache Arrow as the structured data exchange
> >> medium (read more here:
> >> http://wesmckinney.com/blog/pandas-and-apache-arrow/). I created
> >> https://issues.apache.org/jira/browse/SPARK-13534 to add an Arrow
> >> "thunderbolt port"  (to make an analogy) to Spark for moving data from
> >> Spark SQL to pandas much more efficiently than the current
> >> serialization scheme. If anyone wants to be a partner in crime on
> >> this, feel free to reach out! I'll be dropping the Arrow
> >> memory<->pandas conversion code in the next couple weeks.
> >>
> >> As I'm looking more at the implementation details and API design of
> >> PySpark, I note that it has been intended to have near 1-1 parity with
> >> the Scala API, enabling developers to jump between APIs without a lot
> >> of cognitive dissonance (you lose type information in Python, but
> >> c'est la vie). Much of PySpark appears to be wrapping Scala / Java API
> >> calls with py4j (much as many Python libraries wrap C/C++ libraries in
> >> an analogous fashion).
> >>
> >> In the long run, I'm concerned this may become problematic as users'
> >> expectations about the semantics of interacting with the data may not
> >> be compatible with the behavior of the Spark Scala API (particularly
> >> the API design and semantics of Spark SQL and Datasets). As the Spark
> >> user base grows, so, too, will the user needs, particularly in the
> >> more accessible APIs (Python / R). I expect the Scala users tend to be
> >> a more sophisticated audience with a more software engineering /
> >> computer science tilt.
> >>
> >> With a "big picture" goal of bringing about a semantic convergence
> >> between big data and small data in a certain subset of scalable
> >> computations, I am curious what is the Spark development community's
> >> attitude towards efforts to achieve 1-1 PySpark API parity (with a
> >> slight API lag as new features show up strictly in Scala before in
> >> Python), particularly in the strictly semantic realm of data
> >> interactions (at the end of the day, code has to move around bits
> >> someplace). Here is an illustrative, albeit somewhat trivial example
> >> of what I'm talking about:
> >>
> >> https://issues.apache.org/jira/browse/SPARK-13943
> >>
> >> If closer semantic compatibility with existing software in R and
> >> Python is not a high priority, that is a completely reasonable answer.
> >>
> >> Another thought is treating PySpark as the place where the "rubber
> >> meets the road" -- the point of contact for any Python developers
> >> building applications with Spark. This would leave library developers
> >> aiming to create higher level user experiences (e.g. emulating pandas
> >> more closely) and thus use PySpark as an implementation tool that
> >> users otherwise do not directly interact with. But this is seemingly
> >> at odds with the efforts to make Spark DataFrames behave in an
> >> pandas/R-like fashion.
> >>
> >> The nearest analogue to this I would give is the relationship between
> >> pandas and NumPy in the earlier days of pandas (version 0.7 and
> >> earlier). pandas relies on NumPy data structures and many of its array
> >> algorithms. Early on I was lightly criticized in the community for
> >> creating pandas as a separate project rather than contributing patches
> >> to NumPy, but over time it has proven to have been the right decision,
> >> as domain specific needs can evolve in a decoupled way without onerous
> >> API design compromises.
> >>
> >> very best,
> >> Wes
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: dev-h...@spark.apache.org
> >>
> >
>

Re: PySpark API divergence + improving pandas interoperability

Reply via email to