[jira] [Commented] (SPARK-16569) Use Cython to speed up Pyspark internals

Robert Kruszewski (JIRA) Fri, 15 Jul 2016 05:22:41 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-16569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15379310#comment-15379310
 ]


Robert Kruszewski commented on SPARK-16569:
-------------------------------------------

Cython only improves performance if you have type annotations. Once you get 
type annotations it's no longer valid python. You'd need to introduce some 
compiler to the execution chain (gcc/clang) which might be difficult in some 
environments. You still instantiate the interpreter so not sure if there's a 
real difference here at this point it seems easier to just write java/scala 
instead of python.

If you had something like mypy and use that to feed cython types that would 
probably work but would require changes to cython.

> Use Cython to speed up Pyspark internals
> ----------------------------------------
>
>                 Key: SPARK-16569
>                 URL: https://issues.apache.org/jira/browse/SPARK-16569
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark, SQL
>    Affects Versions: 1.6.2, 2.0.0
>            Reporter: Maciej Bryński
>            Priority: Minor
>
> CC: [~davies]
> Many operations I do are like:
> {code}
> dataframe.rdd.map(some_function)
> {code}
> In Pyspark this mean creating Row object for every record and this is slow.
> IDEA:
> Use Cython to speed up Pyspark internals
> What do you think ?
> Sample profile:
> {code}
> ============================================================
> Profile of RDD<id=9>
> ============================================================
>          2000373036 function calls (2000312850 primitive calls) in 2045.307 
> seconds
>    Ordered by: internal time, cumulative time
>    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>     14948  427.117    0.029 1811.622    0.121 {built-in method loads}
> 199920000  402.086    0.000  937.045    0.000 types.py:1162(_create_row)
> 199920000  262.708    0.000  262.708    0.000 {built-in method __new__ of 
> type object at 0x9d1c40}
> 199920000  190.908    0.000 1219.794    0.000 types.py:558(fromInternal)
> 199920000  153.611    0.000  153.611    0.000 types.py:1280(__setattr__)
> 199920197  145.022    0.000 2024.126    0.000 rdd.py:1004(<genexpr>)
> 199920000  118.640    0.000  381.348    0.000 types.py:1194(__new__)
> 199920000  101.272    0.000 1321.067    0.000 types.py:1159(<lambda>)
> 200189064   91.928    0.000   91.928    0.000 {built-in method isinstance}
> 199920000   61.608    0.000   61.608    0.000 
> types.py:1158(_create_row_inbound_converter)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16569) Use Cython to speed up Pyspark internals

Reply via email to