[jira] [Commented] (SPARK-677) PySpark should not collect results through local filesystem

Matthew Farrellee (JIRA) Fri, 18 Jul 2014 08:47:42 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14066458#comment-14066458
 ]


Matthew Farrellee commented on SPARK-677:
-----------------------------------------

this can also be used to address the fragile nature of py4j connection 
construction. the parent can create the fifo.

> PySpark should not collect results through local filesystem
> -----------------------------------------------------------
>
>                 Key: SPARK-677
>                 URL: https://issues.apache.org/jira/browse/SPARK-677
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 0.7.0
>            Reporter: Josh Rosen
>
> Py4J is slow when transferring large arrays, so PySpark currently dumps data 
> to the disk and reads it back in order to collect() RDDs.  On large enough 
> datasets, this data will spill from the buffer cache and write to the 
> physical disk, resulting in terrible performance.
> Instead, we should stream the data from Java to Python over a local socket or 
> a FIFO.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-677) PySpark should not collect results through local filesystem

Reply via email to