[ https://issues.apache.org/jira/browse/SPARK-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14066458#comment-14066458 ]
Matthew Farrellee commented on SPARK-677: ----------------------------------------- this can also be used to address the fragile nature of py4j connection construction. the parent can create the fifo. > PySpark should not collect results through local filesystem > ----------------------------------------------------------- > > Key: SPARK-677 > URL: https://issues.apache.org/jira/browse/SPARK-677 > Project: Spark > Issue Type: Improvement > Components: PySpark > Affects Versions: 0.7.0 > Reporter: Josh Rosen > > Py4J is slow when transferring large arrays, so PySpark currently dumps data > to the disk and reads it back in order to collect() RDDs. On large enough > datasets, this data will spill from the buffer cache and write to the > physical disk, resulting in terrible performance. > Instead, we should stream the data from Java to Python over a local socket or > a FIFO. -- This message was sent by Atlassian JIRA (v6.2#6252)