[ https://issues.apache.org/jira/browse/SPARK-33120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212679#comment-17212679 ]
Taylor Smock commented on SPARK-33120: -------------------------------------- I'd like to avoid using excess network resources and disk resources. For example, if I only have 5 GiB of space left on a node, and I've got 10 GiB of data, I don't want to send anything that that node doesn't need. For example, I'm doing something geographically, and I've got a set of binary data files for the whole world (from the NASA SRTM elevation data, if you are interested). The (current) binary files have a naming scheme like `(N/S)<lat>(E/W)<lon>.ext` . I can work around that, but I've been trying to make the methodology generic enough for future binary data files. I think the best solution would be a lazy load for the `addFiles` function (each file is used by relatively few jobs). I could be going at the problem in a sub-optimal fashion though. This isn't (currently) high priority for me (hence `minor`), since the total size of the data files is currently < 10 GiB (there are 9576 elevation files). Hopefully this answered your question. > Lazy Load of SparkContext.addFiles > ---------------------------------- > > Key: SPARK-33120 > URL: https://issues.apache.org/jira/browse/SPARK-33120 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 3.0.1 > Environment: Mac OS X (2 systems), workload to eventually be run on > Amazon EMR. > Java 11 application. > Reporter: Taylor Smock > Priority: Minor > > In my spark job, I may have various random files that may or may not be used > by each task. > I would like to avoid copying all of the files to every executor until it is > actually needed. > > What I've tried: > * SparkContext.addFiles w/ SparkFiles.get . In testing, all files were > distributed to all clients. > * Broadcast variables. Since I _don't_ know what files I'm going to need > until I have started the task, I have to broadcast all the data at once, > which leads to nodes getting data, and then caching it to disk. In short, the > same issues as SparkContext.addFiles, but with the added benefit of having > the ability to create a mapping of paths to files. > What I would like to see: > * SparkContext.addFiles(file, Enum.LazyLoad) w/ SparkFiles.get(file, > Enum.WaitForAvailability) or Future<?> future = SparkFiles.get(file) > > > Notes: > https://issues.apache.org/jira/browse/SPARK-4290?focusedCommentId=14205346&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14205346 > indicated that `SparkFiles.get` would be required to get the data on the > local driver, but in my testing that did not appear to be the case. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org