My 2 cents on this is that the biggest room for improvement in Python is similarity to Pandas. We already made the Python DataFrame API different from Scala/Java in some respects, but if there’s anything we can do to make it more obvious to Pandas users, that will help the most. The other issue though is that a bunch of Pandas functions are just missing in Spark — it would be awesome to set up an umbrella JIRA to just track those and let people fill them in.
Matei > On Sep 16, 2018, at 1:02 PM, Mark Hamstra <m...@clearstorydata.com> wrote: > > It's not splitting hairs, Erik. It's actually very close to something that I > think deserves some discussion (perhaps on a separate thread.) What I've been > thinking about also concerns API "friendliness" or style. The original RDD > API was very intentionally modeled on the Scala parallel collections API. > That made it quite friendly for some Scala programmers, but not as much so > for users of the other language APIs when they eventually came about. > Similarly, the Dataframe API drew a lot from pandas and R, so it is > relatively friendly for those used to those abstractions. Of course, the > Spark SQL API is modeled closely on HiveQL and standard SQL. The new barrier > scheduling draws inspiration from MPI. With all of these models and sources > of inspiration, as well as multiple language targets, there isn't really a > strong sense of coherence across Spark -- I mean, even though one of the key > advantages of Spark is the ability to do within a single framework things > that would otherwise require multiple frameworks, actually doing that is > requiring more than one programming style or multiple design abstractions > more than what is strictly necessary even when writing Spark code in just a > single language. > > For me, that raises questions over whether we want to start designing, > implementing and supporting APIs that are designed to be more consistent, > friendly and idiomatic to particular languages and abstractions -- e.g. an > API covering all of Spark that is designed to look and feel as much like > "normal" code for a Python programmer, another that looks and feels more like > "normal" Java code, another for Scala, etc. That's a lot more work and > support burden than the current approach where sometimes it feels like you > are writing "normal" code for your prefered programming environment, and > sometimes it feels like you are trying to interface with something foreign, > but underneath it hopefully isn't too hard for those writing the > implementation code below the APIs, and it is not too hard to maintain > multiple language bindings that are each fairly lightweight. > > It's a cost-benefit judgement, of course, whether APIs that are heavier (in > terms of implementing and maintaining) and friendlier (for end users) are > worth doing, and maybe some of these "friendlier" APIs can be done outside of > Spark itself (imo, Frameless is doing a very nice job for the parts of Spark > that it is currently covering -- https://github.com/typelevel/frameless); but > what we have currently is a bit too ad hoc and fragmentary for my taste. > > On Sat, Sep 15, 2018 at 10:33 AM Erik Erlandson <eerla...@redhat.com> wrote: > I am probably splitting hairs to finely, but I was considering the difference > between improvements to the jvm-side (py4j and the scala/java code) that > would make it easier to write the python layer ("python-friendly api"), and > actual improvements to the python layers ("friendly python api"). > > They're not mutually exclusive of course, and both worth working on. But it's > *possible* to improve either without the other. > > Stub files look like a great solution for type annotations, maybe even if > only python 3 is supported. > > I definitely agree that any decision to drop python 2 should not be taken > lightly. Anecdotally, I'm seeing an increase in python developers announcing > that they are dropping support for python 2 (and loving it). As people have > already pointed out, if we don't drop python 2 for spark 3.0, we're stuck > with it until 4.0, which would place spark in a possibly-awkward position of > supporting python 2 for some time after it goes EOL. > > Under the current release cadence, spark 3.0 will land some time in early > 2019, which at that point will be mere months until EOL for py2. > > On Fri, Sep 14, 2018 at 5:01 PM, Holden Karau <hol...@pigscanfly.ca> wrote: > > > On Fri, Sep 14, 2018, 3:26 PM Erik Erlandson <eerla...@redhat.com> wrote: > To be clear, is this about "python-friendly API" or "friendly python API" ? > Well what would you consider to be different between those two statements? I > think it would be good to be a bit more explicit, but I don't think we should > necessarily limit ourselves. > > On the python side, it might be nice to take advantage of static typing. > Requires python 3.6 but with python 2 going EOL, a spark-3.0 might be a good > opportunity to jump the python-3-only train. > I think we can make types sort of work without ditching 2 (the types only > would work in 3 but it would still function in 2). Ditching 2 entirely would > be a big thing to consider, I honestly hadn't been considering that but it > could be from just spending so much time maintaining a 2/3 code base. I'd > suggest reaching out to to user@ before making that kind of change. > > On Fri, Sep 14, 2018 at 12:15 PM, Holden Karau <hol...@pigscanfly.ca> wrote: > Since we're talking about Spark 3.0 in the near future (and since some recent > conversation on a proposed change reminded me) I wanted to open up the floor > and see if folks have any ideas on how we could make a more Python friendly > API for 3.0? I'm planning on taking some time to look at other systems in the > solution space and see what we might want to learn from them but I'd love to > hear what other folks are thinking too. > > -- > Twitter: https://twitter.com/holdenkarau > Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > > --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org