Hi everyone,

For the last few months I've been working on static type annotations for
PySpark. For those of you, who are not familiar with the idea, typing
hints have been introduced by PEP 484
(https://www.python.org/dev/peps/pep-0484/) and further extended with
PEP 526 (https://www.python.org/dev/peps/pep-0526/) with the main goal
of providing information required for static analysis. Right now there a
few tools which support typing hints, including Mypy
(https://github.com/python/mypy) and PyCharm
(https://www.jetbrains.com/help/pycharm/2017.1/type-hinting-in-pycharm.html). 
Type hints can be added using function annotations
(https://www.python.org/dev/peps/pep-3107/, Python 3 only), docstrings,
or source independent stub files
(https://www.python.org/dev/peps/pep-0484/#stub-files). Typing is
optional, gradual and has no runtime impact.

At this moment I've annotated majority of the API, including majority of
pyspark.sql and pyspark.ml. At this moment project is still rough around
the edges, and may result in both false positive and false negatives,
but I think it become mature enough to be useful in practice.

The current version is compatible only with Python 3, but it is
possible, with some limitations, to backport it to Python 2 (though it
is not on my todo list).

There is a number of possible benefits for PySpark users and developers:

  * Static analysis can detect a number of common mistakes to prevent
    runtime failures. Generic self is still fairly limited, so it is
    more useful with DataFrames, SS and ML than RDD, DStreams or RDD.
  * Annotations can be used for documenting complex signatures
    (https://git.io/v95JN) including dependencies on arguments and value
    (https://git.io/v95JA).
  * Detecting possible bugs in Spark (SPARK-20631) .
  * Showing API inconsistencies.

Roadmap

  * Update the project to reflect Spark 2.2.
  * Refine existing annotations.

If there will be enough interest I am happy to contribute this back to
Spark or submit to Typeshed (https://github.com/python/typeshed -  this
would require a formal ASF approval, and since Typeshed doesn't provide
versioning, is probably not the best option in our case).

Further inforamtion:

  * https://github.com/zero323/pyspark-stubs - GitHub repository

  * 
https://speakerdeck.com/marcobonzanini/static-type-analysis-for-robust-data-products-at-pydata-london-2017
    - interesting presentation by Marco Bonzanini

-- 
Best,
Maciej

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to