[ 
https://issues.apache.org/jira/browse/SPARK-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-677:
----------------------------------

    Assignee: Apache Spark  (was: Davies Liu)

> PySpark should not collect results through local filesystem
> -----------------------------------------------------------
>
>                 Key: SPARK-677
>                 URL: https://issues.apache.org/jira/browse/SPARK-677
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0, 1.4.0
>            Reporter: Josh Rosen
>            Assignee: Apache Spark
>
> Py4J is slow when transferring large arrays, so PySpark currently dumps data 
> to the disk and reads it back in order to collect() RDDs.  On large enough 
> datasets, this data will spill from the buffer cache and write to the 
> physical disk, resulting in terrible performance.
> Instead, we should stream the data from Java to Python over a local socket or 
> a FIFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to