[jira] [Resolved] (SPARK-677) PySpark should not collect results through local filesystem

Josh Rosen (JIRA) Fri, 22 May 2015 13:40:34 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Josh Rosen resolved SPARK-677.
------------------------------
          Resolution: Fixed
       Fix Version/s: 1.2.2
                      1.3.1
                      1.4.0
    Target Version/s: 1.3.1, 1.2.2, 1.4.0  (was: 1.2.2, 1.3.1, 1.4.0)

This was fixed for 1.3.1, 1.2.2, and 1.4.0.  I don't think that we'l do a 1.1.x 
backport, so I'm going to mark this as resolved.

> PySpark should not collect results through local filesystem
> -----------------------------------------------------------
>
>                 Key: SPARK-677
>                 URL: https://issues.apache.org/jira/browse/SPARK-677
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0, 1.4.0
>            Reporter: Josh Rosen
>            Assignee: Davies Liu
>             Fix For: 1.4.0, 1.3.1, 1.2.2
>
>
> Py4J is slow when transferring large arrays, so PySpark currently dumps data 
> to the disk and reads it back in order to collect() RDDs.  On large enough 
> datasets, this data will spill from the buffer cache and write to the 
> physical disk, resulting in terrible performance.
> Instead, we should stream the data from Java to Python over a local socket or 
> a FIFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-677) PySpark should not collect results through local filesystem

Reply via email to