[jira] [Commented] (SPARK-677) PySpark should not collect results through local filesystem
[ https://issues.apache.org/jira/browse/SPARK-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353641#comment-14353641 ] Apache Spark commented on SPARK-677: User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/4923 PySpark should not collect results through local filesystem --- Key: SPARK-677 URL: https://issues.apache.org/jira/browse/SPARK-677 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.0.2, 1.1.1, 1.3.0, 1.2.1, 1.4.0 Reporter: Josh Rosen Assignee: Davies Liu Py4J is slow when transferring large arrays, so PySpark currently dumps data to the disk and reads it back in order to collect() RDDs. On large enough datasets, this data will spill from the buffer cache and write to the physical disk, resulting in terrible performance. Instead, we should stream the data from Java to Python over a local socket or a FIFO. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-677) PySpark should not collect results through local filesystem
[ https://issues.apache.org/jira/browse/SPARK-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200514#comment-14200514 ] Matei Zaharia commented on SPARK-677: - [~joshrosen] is this fixed now? PySpark should not collect results through local filesystem --- Key: SPARK-677 URL: https://issues.apache.org/jira/browse/SPARK-677 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 0.7.0 Reporter: Josh Rosen Py4J is slow when transferring large arrays, so PySpark currently dumps data to the disk and reads it back in order to collect() RDDs. On large enough datasets, this data will spill from the buffer cache and write to the physical disk, resulting in terrible performance. Instead, we should stream the data from Java to Python over a local socket or a FIFO. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-677) PySpark should not collect results through local filesystem
[ https://issues.apache.org/jira/browse/SPARK-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200686#comment-14200686 ] Josh Rosen commented on SPARK-677: -- No, it's still an issue in 1.2.0: {code} def collect(self): Return a list that contains all of the elements in this RDD. with SCCallSiteSync(self.context) as css: bytesInJava = self._jrdd.collect().iterator() return list(self._collect_iterator_through_file(bytesInJava)) def _collect_iterator_through_file(self, iterator): # Transferring lots of data through Py4J can be slow because # socket.readline() is inefficient. Instead, we'll dump the data to a # file and read it back. tempFile = NamedTemporaryFile(delete=False, dir=self.ctx._temp_dir) tempFile.close() self.ctx._writeToFile(iterator, tempFile.name) # Read the data into Python and deserialize it: with open(tempFile.name, 'rb') as tempFile: for item in self._jrdd_deserializer.load_stream(tempFile): yield item os.unlink(tempFile.name) {code} PySpark should not collect results through local filesystem --- Key: SPARK-677 URL: https://issues.apache.org/jira/browse/SPARK-677 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 0.7.0 Reporter: Josh Rosen Py4J is slow when transferring large arrays, so PySpark currently dumps data to the disk and reads it back in order to collect() RDDs. On large enough datasets, this data will spill from the buffer cache and write to the physical disk, resulting in terrible performance. Instead, we should stream the data from Java to Python over a local socket or a FIFO. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-677) PySpark should not collect results through local filesystem
[ https://issues.apache.org/jira/browse/SPARK-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066458#comment-14066458 ] Matthew Farrellee commented on SPARK-677: - this can also be used to address the fragile nature of py4j connection construction. the parent can create the fifo. PySpark should not collect results through local filesystem --- Key: SPARK-677 URL: https://issues.apache.org/jira/browse/SPARK-677 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 0.7.0 Reporter: Josh Rosen Py4J is slow when transferring large arrays, so PySpark currently dumps data to the disk and reads it back in order to collect() RDDs. On large enough datasets, this data will spill from the buffer cache and write to the physical disk, resulting in terrible performance. Instead, we should stream the data from Java to Python over a local socket or a FIFO. -- This message was sent by Atlassian JIRA (v6.2#6252)