[ 
https://issues.apache.org/jira/browse/SPARK-17333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16751040#comment-16751040
 ] 

Maciej Szymkiewicz commented on SPARK-17333:
--------------------------------------------

[~Alexander_Gorokhov] Personally I maintain relatively complete set of 
annotations ([https://github.com/zero323/pyspark-stubs)] and in the past 
declared that I am happy to donate it and help with merge 
([http://apache-spark-developers-list.1001551.n3.nabble.com/PYTHON-PySpark-typing-hints-td21560.html)].
 This topic has been also raised on another occasion 
([http://apache-spark-developers-list.1001551.n3.nabble.com/Python-friendly-API-for-Spark-3-0-td25016.html)|http://apache-spark-developers-list.1001551.n3.nabble.com/Python-friendly-API-for-Spark-3-0-td25016.html).]

> Make pyspark interface friendly with static analysis
> ----------------------------------------------------
>
>                 Key: SPARK-17333
>                 URL: https://issues.apache.org/jira/browse/SPARK-17333
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>            Reporter: Assaf Mendelson
>            Priority: Trivial
>
> Static analysis tools such as those common to IDE for auto completion and 
> error marking, tend to have poor results with pyspark.
> This is cause by two separate issues:
> The first is that many elements are created programmatically such as the max 
> function in pyspark.sql.functions.
> The second is that we tend to use pyspark in a functional manner, meaning 
> that we chain many actions (e.g. df.filter().groupby().agg()....) and since 
> python has no type information this can become difficult to understand.
> I would suggest changing the interface to improve it. 
> The way I see it we can either change the interface or provide interface 
> enhancements.
> Changing the interface means defining (when possible) all functions directly, 
> i.e. instead of having a __functions__ dictionary in pyspark.sql.functions.py 
> and then generating the functions programmatically by using _create_function, 
> create the function directly. 
> def max(col):
>    """
>    docstring
>    """
>    _create_function(max,"docstring")
> Second we can add type indications to all functions as defined in pep 484 or 
> pycharm's legacy type hinting 
> (https://www.jetbrains.com/help/pycharm/2016.1/type-hinting-in-pycharm.html#legacy).
> So for example max might look like this:
> def max(col):
>    """
>    does  a max.
>   :type col: Column
>   :rtype Column
>    """
> This would provide a wide range of support as these types of hints, while old 
> are pretty common.
> A second option is to use PEP 3107 to define interfaces (pyi files)
> in this case we might have a functions.pyi file which would contain something 
> like:
> def max(col: Column) -> Column:
>     """
>     Aggregate function: returns the maximum value of the expression in a 
> group.
>     """
>     ...
> This has the advantage of easier to understand types and not touching the 
> code (only supported code) but has the disadvantage of being separately 
> managed (i.e. greater chance of doing a mistake) and the fact that some 
> configuration would be needed in the IDE/static analysis tool instead of 
> working out of the box.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to