Assaf Mendelson created SPARK-17333:
---------------------------------------

             Summary: Make pyspark interface friendly with static analysis
                 Key: SPARK-17333
                 URL: https://issues.apache.org/jira/browse/SPARK-17333
             Project: Spark
          Issue Type: Improvement
          Components: PySpark
            Reporter: Assaf Mendelson
            Priority: Trivial


Static analysis tools such as those common to IDE for auto completion and error 
marking, tend to have poor results with pyspark.

This is cause by two separate issues:
The first is that many elements are created programmatically such as the max 
function in pyspark.sql.functions.
The second is that we tend to use pyspark in a functional manner, meaning that 
we chain many actions (e.g. df.filter().groupby().agg()....) and since python 
has no type information this can become difficult to understand.

I would suggest changing the interface to improve it. 

The way I see it we can either change the interface or provide interface 
enhancements.

Changing the interface means defining (when possible) all functions directly, 
i.e. instead of having a __functions__ dictionary in pyspark.sql.functions.py 
and then generating the functions programmatically by using _create_function, 
create the function directly. 
def max(col):
   """
   docstring
   """
   _create_function(max,"docstring")

Second we can add type indications to all functions as defined in pep 484 or 
pycharm's legacy type hinting 
(https://www.jetbrains.com/help/pycharm/2016.1/type-hinting-in-pycharm.html#legacy).
So for example max might look like this:
def max(col):
   """
   does  a max.
  :type col: Column
  :rtype Column
   """
This would provide a wide range of support as these types of hints, while old 
are pretty common.


A second option is to use PEP 3107 to define interfaces (pyi files)
in this case we might have a functions.pyi file which would contain something 
like:
def max(col: Column) -> Column:
    """
    Aggregate function: returns the maximum value of the expression in a group.
    """
    ...

This has the advantage of easier to understand types and not touching the code 
(only supported code) but has the disadvantage of being separately managed 
(i.e. greater chance of doing a mistake) and the fact that some configuration 
would be needed in the IDE/static analysis tool instead of working out of the 
box.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to