[ 
https://issues.apache.org/jira/browse/SPARK-33120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212679#comment-17212679
 ] 

Taylor Smock commented on SPARK-33120:
--------------------------------------

I'd like to avoid using excess network resources and disk resources. For 
example, if I only have 5 GiB of space left on a node, and I've got 10 GiB of 
data, I don't want to send anything that that node doesn't need.

 

For example, I'm doing something geographically, and I've got a set of binary 
data files for the whole world (from the NASA SRTM elevation data, if you are 
interested). The (current) binary files have a naming scheme like 
`(N/S)<lat>(E/W)<lon>.ext` . I can work around that, but I've been trying to 
make the methodology generic enough for future binary data files. I think the 
best solution would be a lazy load for the `addFiles` function (each file is 
used by relatively few jobs). I could be going at the problem in a sub-optimal 
fashion though.

 

This isn't (currently) high priority for me (hence `minor`), since the total 
size of the data files is currently < 10 GiB (there are 9576 elevation files).

Hopefully this answered your question.

> Lazy Load of SparkContext.addFiles
> ----------------------------------
>
>                 Key: SPARK-33120
>                 URL: https://issues.apache.org/jira/browse/SPARK-33120
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.0.1
>         Environment: Mac OS X (2 systems), workload to eventually be run on 
> Amazon EMR.
> Java 11 application.
>            Reporter: Taylor Smock
>            Priority: Minor
>
> In my spark job, I may have various random files that may or may not be used 
> by each task.
> I would like to avoid copying all of the files to every executor until it is 
> actually needed.
>  
> What I've tried:
>  * SparkContext.addFiles w/ SparkFiles.get . In testing, all files were 
> distributed to all clients.
>  * Broadcast variables. Since I _don't_ know what files I'm going to need 
> until I have started the task, I have to broadcast all the data at once, 
> which leads to nodes getting data, and then caching it to disk. In short, the 
> same issues as SparkContext.addFiles, but with the added benefit of having 
> the ability to create a mapping of paths to files.
> What I would like to see:
>  * SparkContext.addFiles(file, Enum.LazyLoad) w/ SparkFiles.get(file, 
> Enum.WaitForAvailability) or Future<?> future = SparkFiles.get(file)
>  
>  
> Notes: 
> https://issues.apache.org/jira/browse/SPARK-4290?focusedCommentId=14205346&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14205346
>  indicated that `SparkFiles.get` would be required to get the data on the 
> local driver, but in my testing that did not appear to be the case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to