[ https://issues.apache.org/jira/browse/SPARK-17602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-17602. ---------------------------------- Resolution: Incomplete > PySpark - Performance Optimization Large Size of Broadcast Variable > ------------------------------------------------------------------- > > Key: SPARK-17602 > URL: https://issues.apache.org/jira/browse/SPARK-17602 > Project: Spark > Issue Type: Improvement > Components: PySpark > Affects Versions: 1.6.2, 2.0.0 > Environment: Linux > Reporter: Xiao Ming Bao > Priority: Major > Labels: bulk-closed > Attachments: PySpark – Performance Optimization for Large Size of > Broadcast variable.pdf > > Original Estimate: 120h > Remaining Estimate: 120h > > Problem: currently at executor side, the broadcast variable is written to > disk as file and each python work process reads the bd from local disk and > de-serialize to python object before executing a task, when the size of > broadcast variables is large, the read/de-serialization takes a lot of time. > And when the python worker is NOT reused and the number of task is large, > this performance would be very bad since python worker needs to > read/de-serialize for each task. > Brief of the solution: > transfer the broadcast variable to daemon python process via file (or > socket/mmap) and deserialize file to object in daemon python process, after > worker python process forked by daemon python process, worker python process > would automatically has the deserialzied object and use it directly because > of the memory Copy-on-write tech of Linux. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org