[ https://issues.apache.org/jira/browse/SPARK-33120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212670#comment-17212670 ]
Dongjoon Hyun edited comment on SPARK-33120 at 10/12/20, 9:00 PM: ------------------------------------------------------------------ Hi, [~tsmock]. What is the benefit you need here? bq. I would like to avoid copying all of the files to every executor until it is actually needed. was (Author: dongjoon): Hi, [~tsmock]. What is the benefit you need here? > I would like to avoid copying all of the files to every executor until it is > actually needed. > Lazy Load of SparkContext.addFiles > ---------------------------------- > > Key: SPARK-33120 > URL: https://issues.apache.org/jira/browse/SPARK-33120 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 3.0.1 > Environment: Mac OS X (2 systems), workload to eventually be run on > Amazon EMR. > Java 11 application. > Reporter: Taylor Smock > Priority: Minor > > In my spark job, I may have various random files that may or may not be used > by each task. > I would like to avoid copying all of the files to every executor until it is > actually needed. > > What I've tried: > * SparkContext.addFiles w/ SparkFiles.get . In testing, all files were > distributed to all clients. > * Broadcast variables. Since I _don't_ know what files I'm going to need > until I have started the task, I have to broadcast all the data at once, > which leads to nodes getting data, and then caching it to disk. In short, the > same issues as SparkContext.addFiles, but with the added benefit of having > the ability to create a mapping of paths to files. > What I would like to see: > * SparkContext.addFiles(file, Enum.LazyLoad) w/ SparkFiles.get(file, > Enum.WaitForAvailability) or Future<?> future = SparkFiles.get(file) > > > Notes: > https://issues.apache.org/jira/browse/SPARK-4290?focusedCommentId=14205346&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14205346 > indicated that `SparkFiles.get` would be required to get the data on the > local driver, but in my testing that did not appear to be the case. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org