Assaf Mendelson created SPARK-17333: ---------------------------------------
Summary: Make pyspark interface friendly with static analysis Key: SPARK-17333 URL: https://issues.apache.org/jira/browse/SPARK-17333 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Assaf Mendelson Priority: Trivial Static analysis tools such as those common to IDE for auto completion and error marking, tend to have poor results with pyspark. This is cause by two separate issues: The first is that many elements are created programmatically such as the max function in pyspark.sql.functions. The second is that we tend to use pyspark in a functional manner, meaning that we chain many actions (e.g. df.filter().groupby().agg()....) and since python has no type information this can become difficult to understand. I would suggest changing the interface to improve it. The way I see it we can either change the interface or provide interface enhancements. Changing the interface means defining (when possible) all functions directly, i.e. instead of having a __functions__ dictionary in pyspark.sql.functions.py and then generating the functions programmatically by using _create_function, create the function directly. def max(col): """ docstring """ _create_function(max,"docstring") Second we can add type indications to all functions as defined in pep 484 or pycharm's legacy type hinting (https://www.jetbrains.com/help/pycharm/2016.1/type-hinting-in-pycharm.html#legacy). So for example max might look like this: def max(col): """ does a max. :type col: Column :rtype Column """ This would provide a wide range of support as these types of hints, while old are pretty common. A second option is to use PEP 3107 to define interfaces (pyi files) in this case we might have a functions.pyi file which would contain something like: def max(col: Column) -> Column: """ Aggregate function: returns the maximum value of the expression in a group. """ ... This has the advantage of easier to understand types and not touching the code (only supported code) but has the disadvantage of being separately managed (i.e. greater chance of doing a mistake) and the fact that some configuration would be needed in the IDE/static analysis tool instead of working out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org