[ 
https://issues.apache.org/jira/browse/SPARK-33120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212670#comment-17212670
 ] 

Dongjoon Hyun edited comment on SPARK-33120 at 10/12/20, 9:00 PM:
------------------------------------------------------------------

Hi, [~tsmock]. What is the benefit you need here?
bq. I would like to avoid copying all of the files to every executor until it 
is actually needed.


was (Author: dongjoon):
Hi, [~tsmock]. What is the benefit you need here?
> I would like to avoid copying all of the files to every executor until it is 
> actually needed.

> Lazy Load of SparkContext.addFiles
> ----------------------------------
>
>                 Key: SPARK-33120
>                 URL: https://issues.apache.org/jira/browse/SPARK-33120
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.0.1
>         Environment: Mac OS X (2 systems), workload to eventually be run on 
> Amazon EMR.
> Java 11 application.
>            Reporter: Taylor Smock
>            Priority: Minor
>
> In my spark job, I may have various random files that may or may not be used 
> by each task.
> I would like to avoid copying all of the files to every executor until it is 
> actually needed.
>  
> What I've tried:
>  * SparkContext.addFiles w/ SparkFiles.get . In testing, all files were 
> distributed to all clients.
>  * Broadcast variables. Since I _don't_ know what files I'm going to need 
> until I have started the task, I have to broadcast all the data at once, 
> which leads to nodes getting data, and then caching it to disk. In short, the 
> same issues as SparkContext.addFiles, but with the added benefit of having 
> the ability to create a mapping of paths to files.
> What I would like to see:
>  * SparkContext.addFiles(file, Enum.LazyLoad) w/ SparkFiles.get(file, 
> Enum.WaitForAvailability) or Future<?> future = SparkFiles.get(file)
>  
>  
> Notes: 
> https://issues.apache.org/jira/browse/SPARK-4290?focusedCommentId=14205346&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14205346
>  indicated that `SparkFiles.get` would be required to get the data on the 
> local driver, but in my testing that did not appear to be the case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to