Hi Wes, I agree it is difficult to do this design case by case, but what I was pointing out was "it is difficult to generalize without seeing a lot more cases".
I do think we need to see a lot of these cases and then make a call. My intuition is that we can just have config options that control behavior, similar to what a lot of relational databases do. The good thing is that Spark data frames are very abstracted away from the underlying execution so a lot of the behaviors can be controlled just by analysis rules. On Mon, Mar 21, 2016 at 8:48 PM, Wes McKinney <w...@cloudera.com> wrote: > hi Reynold, > > It's of course possible to find solutions to specific issues, but what > I'm curious about is a general decision-making framework around > building strong user experiences for programmers using each of the > Spark APIs. Right now, the semantics of using Spark are very tied to > the semantics of the Scala API. In turn, the semantics of Spark > DataFrames may be constrained by the semantics of Spark SQL, depending > on your attitude toward API divergence and using a sort of "code > generation" to produce a certain behavior. > > To make this concrete, in this particular example, PySpark could > implicitly cast boolean to numeric (completely from the Python side) > before the SUM operation is sent to Spark. But then SUM(boolean) in > Scala and SUM(boolean) in Python now have distinct behavior. Now > imagine that you have made many such divergent API design decisions > (some of which may interact with each other). Another classic example > is preventing floor division with integers by automatically casting to > double (__truediv__ in Python). > > Adding a different mode to Spark SQL might be dangerous, and it would > likely add a lot of complexity. > > Thanks, > Wes > > On Mon, Mar 21, 2016 at 9:59 AM, Reynold Xin <r...@databricks.com> wrote: > > Hi Wes, > > > > Thanks for the email. It is difficult to generalize without seeing a lot > > more cases, but the boolean issue is simply a query analysis rule. > > > > I can see us having a config option that changes analysis to match more > > Python/R like, which changes the behavior of implicit type coercion and > > allows boolean to integral automatically. > > > > > > On Thursday, March 17, 2016, Wes McKinney <w...@cloudera.com> wrote: > >> > >> hi everyone, > >> > >> I've recently gotten moving on solving some of the low-level data > >> interoperability problems between Python's NumPy-focused scientific > >> computing and data libraries like pandas and the rest of the big data > >> ecosystem, Spark being a very important part of that. > >> > >> One of the major efforts here is creating a unified data access layer > >> for pandas users using Apache Arrow as the structured data exchange > >> medium (read more here: > >> http://wesmckinney.com/blog/pandas-and-apache-arrow/). I created > >> https://issues.apache.org/jira/browse/SPARK-13534 to add an Arrow > >> "thunderbolt port" (to make an analogy) to Spark for moving data from > >> Spark SQL to pandas much more efficiently than the current > >> serialization scheme. If anyone wants to be a partner in crime on > >> this, feel free to reach out! I'll be dropping the Arrow > >> memory<->pandas conversion code in the next couple weeks. > >> > >> As I'm looking more at the implementation details and API design of > >> PySpark, I note that it has been intended to have near 1-1 parity with > >> the Scala API, enabling developers to jump between APIs without a lot > >> of cognitive dissonance (you lose type information in Python, but > >> c'est la vie). Much of PySpark appears to be wrapping Scala / Java API > >> calls with py4j (much as many Python libraries wrap C/C++ libraries in > >> an analogous fashion). > >> > >> In the long run, I'm concerned this may become problematic as users' > >> expectations about the semantics of interacting with the data may not > >> be compatible with the behavior of the Spark Scala API (particularly > >> the API design and semantics of Spark SQL and Datasets). As the Spark > >> user base grows, so, too, will the user needs, particularly in the > >> more accessible APIs (Python / R). I expect the Scala users tend to be > >> a more sophisticated audience with a more software engineering / > >> computer science tilt. > >> > >> With a "big picture" goal of bringing about a semantic convergence > >> between big data and small data in a certain subset of scalable > >> computations, I am curious what is the Spark development community's > >> attitude towards efforts to achieve 1-1 PySpark API parity (with a > >> slight API lag as new features show up strictly in Scala before in > >> Python), particularly in the strictly semantic realm of data > >> interactions (at the end of the day, code has to move around bits > >> someplace). Here is an illustrative, albeit somewhat trivial example > >> of what I'm talking about: > >> > >> https://issues.apache.org/jira/browse/SPARK-13943 > >> > >> If closer semantic compatibility with existing software in R and > >> Python is not a high priority, that is a completely reasonable answer. > >> > >> Another thought is treating PySpark as the place where the "rubber > >> meets the road" -- the point of contact for any Python developers > >> building applications with Spark. This would leave library developers > >> aiming to create higher level user experiences (e.g. emulating pandas > >> more closely) and thus use PySpark as an implementation tool that > >> users otherwise do not directly interact with. But this is seemingly > >> at odds with the efforts to make Spark DataFrames behave in an > >> pandas/R-like fashion. > >> > >> The nearest analogue to this I would give is the relationship between > >> pandas and NumPy in the earlier days of pandas (version 0.7 and > >> earlier). pandas relies on NumPy data structures and many of its array > >> algorithms. Early on I was lightly criticized in the community for > >> creating pandas as a separate project rather than contributing patches > >> to NumPy, but over time it has proven to have been the right decision, > >> as domain specific needs can evolve in a decoupled way without onerous > >> API design compromises. > >> > >> very best, > >> Wes > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > >> For additional commands, e-mail: dev-h...@spark.apache.org > >> > > >