[ 
https://issues.apache.org/jira/browse/BEAM-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15985529#comment-15985529
 ] 

Ismaël Mejía commented on BEAM-673:
-----------------------------------

Oh you have a point [~lcwik], I haven’t thought about this like a general 
problem. Notice that the case you mention is a common task of the resource 
managers e.g. YARN, Mesos or Kubernetes, in them there is the concept of 
resource/offers and the underlying processing system e.g. Hadoop, Spark, Flink 
just tell them their preferences to allocate the workers.

[~jkff] I agree with this aspect of data dependency that you mention, in the 
case of this JIRA the sources are the ones that know this information that the 
runner needs to pass to the given resource manager, but probably it could be 
the case that it would be a specific transform e.g. A ML specific transform 
could hint the need of GPU as Luke mentioned.

This definitely deserves extra research and a more formal design to cover this 
more general scenario, so I am moving it out of the FSR list and I will also 
create a new JIRA for the more general case and let this for the particular 
case of Data Locality for the Spark runner.


> Data locality for Read.Bounded
> ------------------------------
>
>                 Key: BEAM-673
>                 URL: https://issues.apache.org/jira/browse/BEAM-673
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-spark
>            Reporter: Amit Sela
>            Assignee: Ismaël Mejía
>
> In some distributed filesystems, such as HDFS, we should be able to hint to 
> Spark the preferred locations of splits.
> Here is an example of how Spark does that for Hadoop RDDs:
> https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L249



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to