hi Reynold, It's of course possible to find solutions to specific issues, but what I'm curious about is a general decision-making framework around building strong user experiences for programmers using each of the Spark APIs. Right now, the semantics of using Spark are very tied to the semantics of the Scala API. In turn, the semantics of Spark DataFrames may be constrained by the semantics of Spark SQL, depending on your attitude toward API divergence and using a sort of "code generation" to produce a certain behavior.
To make this concrete, in this particular example, PySpark could implicitly cast boolean to numeric (completely from the Python side) before the SUM operation is sent to Spark. But then SUM(boolean) in Scala and SUM(boolean) in Python now have distinct behavior. Now imagine that you have made many such divergent API design decisions (some of which may interact with each other). Another classic example is preventing floor division with integers by automatically casting to double (__truediv__ in Python). Adding a different mode to Spark SQL might be dangerous, and it would likely add a lot of complexity. Thanks, Wes On Mon, Mar 21, 2016 at 9:59 AM, Reynold Xin <r...@databricks.com> wrote: > Hi Wes, > > Thanks for the email. It is difficult to generalize without seeing a lot > more cases, but the boolean issue is simply a query analysis rule. > > I can see us having a config option that changes analysis to match more > Python/R like, which changes the behavior of implicit type coercion and > allows boolean to integral automatically. > > > On Thursday, March 17, 2016, Wes McKinney <w...@cloudera.com> wrote: >> >> hi everyone, >> >> I've recently gotten moving on solving some of the low-level data >> interoperability problems between Python's NumPy-focused scientific >> computing and data libraries like pandas and the rest of the big data >> ecosystem, Spark being a very important part of that. >> >> One of the major efforts here is creating a unified data access layer >> for pandas users using Apache Arrow as the structured data exchange >> medium (read more here: >> http://wesmckinney.com/blog/pandas-and-apache-arrow/). I created >> https://issues.apache.org/jira/browse/SPARK-13534 to add an Arrow >> "thunderbolt port" (to make an analogy) to Spark for moving data from >> Spark SQL to pandas much more efficiently than the current >> serialization scheme. If anyone wants to be a partner in crime on >> this, feel free to reach out! I'll be dropping the Arrow >> memory<->pandas conversion code in the next couple weeks. >> >> As I'm looking more at the implementation details and API design of >> PySpark, I note that it has been intended to have near 1-1 parity with >> the Scala API, enabling developers to jump between APIs without a lot >> of cognitive dissonance (you lose type information in Python, but >> c'est la vie). Much of PySpark appears to be wrapping Scala / Java API >> calls with py4j (much as many Python libraries wrap C/C++ libraries in >> an analogous fashion). >> >> In the long run, I'm concerned this may become problematic as users' >> expectations about the semantics of interacting with the data may not >> be compatible with the behavior of the Spark Scala API (particularly >> the API design and semantics of Spark SQL and Datasets). As the Spark >> user base grows, so, too, will the user needs, particularly in the >> more accessible APIs (Python / R). I expect the Scala users tend to be >> a more sophisticated audience with a more software engineering / >> computer science tilt. >> >> With a "big picture" goal of bringing about a semantic convergence >> between big data and small data in a certain subset of scalable >> computations, I am curious what is the Spark development community's >> attitude towards efforts to achieve 1-1 PySpark API parity (with a >> slight API lag as new features show up strictly in Scala before in >> Python), particularly in the strictly semantic realm of data >> interactions (at the end of the day, code has to move around bits >> someplace). Here is an illustrative, albeit somewhat trivial example >> of what I'm talking about: >> >> https://issues.apache.org/jira/browse/SPARK-13943 >> >> If closer semantic compatibility with existing software in R and >> Python is not a high priority, that is a completely reasonable answer. >> >> Another thought is treating PySpark as the place where the "rubber >> meets the road" -- the point of contact for any Python developers >> building applications with Spark. This would leave library developers >> aiming to create higher level user experiences (e.g. emulating pandas >> more closely) and thus use PySpark as an implementation tool that >> users otherwise do not directly interact with. But this is seemingly >> at odds with the efforts to make Spark DataFrames behave in an >> pandas/R-like fashion. >> >> The nearest analogue to this I would give is the relationship between >> pandas and NumPy in the earlier days of pandas (version 0.7 and >> earlier). pandas relies on NumPy data structures and many of its array >> algorithms. Early on I was lightly criticized in the community for >> creating pandas as a separate project rather than contributing patches >> to NumPy, but over time it has proven to have been the right decision, >> as domain specific needs can evolve in a decoupled way without onerous >> API design compromises. >> >> very best, >> Wes >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org