[ https://issues.apache.org/jira/browse/SPARK-16569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15379310#comment-15379310 ]
Robert Kruszewski commented on SPARK-16569: ------------------------------------------- Cython only improves performance if you have type annotations. Once you get type annotations it's no longer valid python. You'd need to introduce some compiler to the execution chain (gcc/clang) which might be difficult in some environments. You still instantiate the interpreter so not sure if there's a real difference here at this point it seems easier to just write java/scala instead of python. If you had something like mypy and use that to feed cython types that would probably work but would require changes to cython. > Use Cython to speed up Pyspark internals > ---------------------------------------- > > Key: SPARK-16569 > URL: https://issues.apache.org/jira/browse/SPARK-16569 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL > Affects Versions: 1.6.2, 2.0.0 > Reporter: Maciej BryĆski > Priority: Minor > > CC: [~davies] > Many operations I do are like: > {code} > dataframe.rdd.map(some_function) > {code} > In Pyspark this mean creating Row object for every record and this is slow. > IDEA: > Use Cython to speed up Pyspark internals > What do you think ? > Sample profile: > {code} > ============================================================ > Profile of RDD<id=9> > ============================================================ > 2000373036 function calls (2000312850 primitive calls) in 2045.307 > seconds > Ordered by: internal time, cumulative time > ncalls tottime percall cumtime percall filename:lineno(function) > 14948 427.117 0.029 1811.622 0.121 {built-in method loads} > 199920000 402.086 0.000 937.045 0.000 types.py:1162(_create_row) > 199920000 262.708 0.000 262.708 0.000 {built-in method __new__ of > type object at 0x9d1c40} > 199920000 190.908 0.000 1219.794 0.000 types.py:558(fromInternal) > 199920000 153.611 0.000 153.611 0.000 types.py:1280(__setattr__) > 199920197 145.022 0.000 2024.126 0.000 rdd.py:1004(<genexpr>) > 199920000 118.640 0.000 381.348 0.000 types.py:1194(__new__) > 199920000 101.272 0.000 1321.067 0.000 types.py:1159(<lambda>) > 200189064 91.928 0.000 91.928 0.000 {built-in method isinstance} > 199920000 61.608 0.000 61.608 0.000 > types.py:1158(_create_row_inbound_converter) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org