[jira] [Resolved] (SPARK-17602) PySpark - Performance Optimization Large Size of Broadcast Variable

Hyukjin Kwon (Jira) Mon, 07 Oct 2019 22:46:13 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-17602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon resolved SPARK-17602.
----------------------------------
    Resolution: Incomplete

> PySpark - Performance Optimization Large Size of Broadcast Variable
> -------------------------------------------------------------------
>
>                 Key: SPARK-17602
>                 URL: https://issues.apache.org/jira/browse/SPARK-17602
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 1.6.2, 2.0.0
>         Environment: Linux
>            Reporter: Xiao Ming Bao
>            Priority: Major
>              Labels: bulk-closed
>         Attachments: PySpark – Performance Optimization for Large Size of 
> Broadcast variable.pdf
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Problem: currently at executor side, the broadcast variable is written to 
> disk as file and each python work process reads the bd from local disk and 
> de-serialize to python object before executing a task, when the size of 
> broadcast  variables is large, the read/de-serialization takes a lot of time. 
> And when the python worker is NOT reused and the number of task is large, 
> this performance would be very bad since python worker needs to 
> read/de-serialize for each task. 
> Brief of the solution:
>  transfer the broadcast variable to daemon python process via file (or 
> socket/mmap) and deserialize file to object in daemon python process, after 
> worker python process forked by daemon python process, worker python process 
> would automatically has the deserialzied object and use it directly because 
> of the memory Copy-on-write tech of Linux.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17602) PySpark - Performance Optimization Large Size of Broadcast Variable

Reply via email to