[jira] [Commented] (SPARK-677) PySpark should not collect results through local filesystem

2015-03-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353641#comment-14353641
 ] 

Apache Spark commented on SPARK-677:


User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/4923

 PySpark should not collect results through local filesystem
 ---

 Key: SPARK-677
 URL: https://issues.apache.org/jira/browse/SPARK-677
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.0.2, 1.1.1, 1.3.0, 1.2.1, 1.4.0
Reporter: Josh Rosen
Assignee: Davies Liu

 Py4J is slow when transferring large arrays, so PySpark currently dumps data 
 to the disk and reads it back in order to collect() RDDs.  On large enough 
 datasets, this data will spill from the buffer cache and write to the 
 physical disk, resulting in terrible performance.
 Instead, we should stream the data from Java to Python over a local socket or 
 a FIFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-677) PySpark should not collect results through local filesystem

2014-11-06 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200514#comment-14200514
 ] 

Matei Zaharia commented on SPARK-677:
-

[~joshrosen] is this fixed now?

 PySpark should not collect results through local filesystem
 ---

 Key: SPARK-677
 URL: https://issues.apache.org/jira/browse/SPARK-677
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 0.7.0
Reporter: Josh Rosen

 Py4J is slow when transferring large arrays, so PySpark currently dumps data 
 to the disk and reads it back in order to collect() RDDs.  On large enough 
 datasets, this data will spill from the buffer cache and write to the 
 physical disk, resulting in terrible performance.
 Instead, we should stream the data from Java to Python over a local socket or 
 a FIFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-677) PySpark should not collect results through local filesystem

2014-11-06 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14200686#comment-14200686
 ] 

Josh Rosen commented on SPARK-677:
--

No, it's still an issue in 1.2.0:

{code}
 def collect(self):

Return a list that contains all of the elements in this RDD.

with SCCallSiteSync(self.context) as css:
bytesInJava = self._jrdd.collect().iterator()
return list(self._collect_iterator_through_file(bytesInJava))

def _collect_iterator_through_file(self, iterator):
# Transferring lots of data through Py4J can be slow because
# socket.readline() is inefficient.  Instead, we'll dump the data to a
# file and read it back.
tempFile = NamedTemporaryFile(delete=False, dir=self.ctx._temp_dir)
tempFile.close()
self.ctx._writeToFile(iterator, tempFile.name)
# Read the data into Python and deserialize it:
with open(tempFile.name, 'rb') as tempFile:
for item in self._jrdd_deserializer.load_stream(tempFile):
yield item
os.unlink(tempFile.name)
{code}

 PySpark should not collect results through local filesystem
 ---

 Key: SPARK-677
 URL: https://issues.apache.org/jira/browse/SPARK-677
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 0.7.0
Reporter: Josh Rosen

 Py4J is slow when transferring large arrays, so PySpark currently dumps data 
 to the disk and reads it back in order to collect() RDDs.  On large enough 
 datasets, this data will spill from the buffer cache and write to the 
 physical disk, resulting in terrible performance.
 Instead, we should stream the data from Java to Python over a local socket or 
 a FIFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-677) PySpark should not collect results through local filesystem

2014-07-18 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066458#comment-14066458
 ] 

Matthew Farrellee commented on SPARK-677:
-

this can also be used to address the fragile nature of py4j connection 
construction. the parent can create the fifo.

 PySpark should not collect results through local filesystem
 ---

 Key: SPARK-677
 URL: https://issues.apache.org/jira/browse/SPARK-677
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 0.7.0
Reporter: Josh Rosen

 Py4J is slow when transferring large arrays, so PySpark currently dumps data 
 to the disk and reads it back in order to collect() RDDs.  On large enough 
 datasets, this data will spill from the buffer cache and write to the 
 physical disk, resulting in terrible performance.
 Instead, we should stream the data from Java to Python over a local socket or 
 a FIFO.



--
This message was sent by Atlassian JIRA
(v6.2#6252)