[ https://issues.apache.org/jira/browse/BEAM-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15985529#comment-15985529 ]
Ismaël Mejía commented on BEAM-673: ----------------------------------- Oh you have a point [~lcwik], I haven’t thought about this like a general problem. Notice that the case you mention is a common task of the resource managers e.g. YARN, Mesos or Kubernetes, in them there is the concept of resource/offers and the underlying processing system e.g. Hadoop, Spark, Flink just tell them their preferences to allocate the workers. [~jkff] I agree with this aspect of data dependency that you mention, in the case of this JIRA the sources are the ones that know this information that the runner needs to pass to the given resource manager, but probably it could be the case that it would be a specific transform e.g. A ML specific transform could hint the need of GPU as Luke mentioned. This definitely deserves extra research and a more formal design to cover this more general scenario, so I am moving it out of the FSR list and I will also create a new JIRA for the more general case and let this for the particular case of Data Locality for the Spark runner. > Data locality for Read.Bounded > ------------------------------ > > Key: BEAM-673 > URL: https://issues.apache.org/jira/browse/BEAM-673 > Project: Beam > Issue Type: Bug > Components: runner-spark > Reporter: Amit Sela > Assignee: Ismaël Mejía > > In some distributed filesystems, such as HDFS, we should be able to hint to > Spark the preferred locations of splits. > Here is an example of how Spark does that for Hadoop RDDs: > https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L249 -- This message was sent by Atlassian JIRA (v6.3.15#6346)