Re: Python friendly API for Spark 3.0

Matei Zaharia Sun, 16 Sep 2018 14:09:51 -0700

My 2 cents on this is that the biggest room for improvement in Python is 
similarity to Pandas. We already made the Python DataFrame API different from 
Scala/Java in some respects, but if there’s anything we can do to make it more 
obvious to Pandas users, that will help the most. The other issue though is 
that a bunch of Pandas functions are just missing in Spark — it would be 
awesome to set up an umbrella JIRA to just track those and let people fill them 
in.


Matei

> On Sep 16, 2018, at 1:02 PM, Mark Hamstra <m...@clearstorydata.com> wrote:
> 
> It's not splitting hairs, Erik. It's actually very close to something that I 
> think deserves some discussion (perhaps on a separate thread.) What I've been 
> thinking about also concerns API "friendliness" or style. The original RDD 
> API was very intentionally modeled on the Scala parallel collections API. 
> That made it quite friendly for some Scala programmers, but not as much so 
> for users of the other language APIs when they eventually came about. 
> Similarly, the Dataframe API drew a lot from pandas and R, so it is 
> relatively friendly for those used to those abstractions. Of course, the 
> Spark SQL API is modeled closely on HiveQL and standard SQL. The new barrier 
> scheduling draws inspiration from MPI. With all of these models and sources 
> of inspiration, as well as multiple language targets, there isn't really a 
> strong sense of coherence across Spark -- I mean, even though one of the key 
> advantages of Spark is the ability to do within a single framework things 
> that would otherwise require multiple frameworks, actually doing that is 
> requiring more than one programming style or multiple design abstractions 
> more than what is strictly necessary even when writing Spark code in just a 
> single language.
> 
> For me, that raises questions over whether we want to start designing, 
> implementing and supporting APIs that are designed to be more consistent, 
> friendly and idiomatic to particular languages and abstractions -- e.g. an 
> API covering all of Spark that is designed to look and feel as much like 
> "normal" code for a Python programmer, another that looks and feels more like 
> "normal" Java code, another for Scala, etc. That's a lot more work and 
> support burden than the current approach where sometimes it feels like you 
> are writing "normal" code for your prefered programming environment, and 
> sometimes it feels like you are trying to interface with something foreign, 
> but underneath it hopefully isn't too hard for those writing the 
> implementation code below the APIs, and it is not too hard to maintain 
> multiple language bindings that are each fairly lightweight.
> 
> It's a cost-benefit judgement, of course, whether APIs that are heavier (in 
> terms of implementing and maintaining) and friendlier (for end users) are 
> worth doing, and maybe some of these "friendlier" APIs can be done outside of 
> Spark itself (imo, Frameless is doing a very nice job for the parts of Spark 
> that it is currently covering -- https://github.com/typelevel/frameless); but 
> what we have currently is a bit too ad hoc and fragmentary for my taste. 
> 
> On Sat, Sep 15, 2018 at 10:33 AM Erik Erlandson <eerla...@redhat.com> wrote:
> I am probably splitting hairs to finely, but I was considering the difference 
> between improvements to the jvm-side (py4j and the scala/java code) that 
> would make it easier to write the python layer ("python-friendly api"), and 
> actual improvements to the python layers ("friendly python api").
> 
> They're not mutually exclusive of course, and both worth working on. But it's 
> *possible* to improve either without the other.
> 
> Stub files look like a great solution for type annotations, maybe even if 
> only python 3 is supported.
> 
> I definitely agree that any decision to drop python 2 should not be taken 
> lightly. Anecdotally, I'm seeing an increase in python developers announcing 
> that they are dropping support for python 2 (and loving it). As people have 
> already pointed out, if we don't drop python 2 for spark 3.0, we're stuck 
> with it until 4.0, which would place spark in a possibly-awkward position of 
> supporting python 2 for some time after it goes EOL.
> 
> Under the current release cadence, spark 3.0 will land some time in early 
> 2019, which at that point will be mere months until EOL for py2.
> 
> On Fri, Sep 14, 2018 at 5:01 PM, Holden Karau <hol...@pigscanfly.ca> wrote:
> 
> 
> On Fri, Sep 14, 2018, 3:26 PM Erik Erlandson <eerla...@redhat.com> wrote:
> To be clear, is this about "python-friendly API" or "friendly python API" ?
> Well what would you consider to be different between those two statements? I 
> think it would be good to be a bit more explicit, but I don't think we should 
> necessarily limit ourselves.
> 
> On the python side, it might be nice to take advantage of static typing. 
> Requires python 3.6 but with python 2 going EOL, a spark-3.0 might be a good 
> opportunity to jump the python-3-only train.
> I think we can make types sort of work without ditching 2 (the types only 
> would work in 3 but it would still function in 2). Ditching 2 entirely would 
> be a big thing to consider, I honestly hadn't been considering that but it 
> could be from just spending so much time maintaining a 2/3 code base. I'd 
> suggest reaching out to to user@ before making that kind of change.
> 
> On Fri, Sep 14, 2018 at 12:15 PM, Holden Karau <hol...@pigscanfly.ca> wrote:
> Since we're talking about Spark 3.0 in the near future (and since some recent 
> conversation on a proposed change reminded me) I wanted to open up the floor 
> and see if folks have any ideas on how we could make a more Python friendly 
> API for 3.0? I'm planning on taking some time to look at other systems in the 
> solution space and see what we might want to learn from them but I'd love to 
> hear what other folks are thinking too.
> 
> -- 
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> 
> 


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Python friendly API for Spark 3.0

Reply via email to