[jira] [Commented] (BEAM-673) Data locality for Read.Bounded
[ https://issues.apache.org/jira/browse/BEAM-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985529#comment-15985529 ] Ismaël Mejía commented on BEAM-673: --- Oh you have a point [~lcwik], I haven’t thought about this like a general problem. Notice that the case you mention is a common task of the resource managers e.g. YARN, Mesos or Kubernetes, in them there is the concept of resource/offers and the underlying processing system e.g. Hadoop, Spark, Flink just tell them their preferences to allocate the workers. [~jkff] I agree with this aspect of data dependency that you mention, in the case of this JIRA the sources are the ones that know this information that the runner needs to pass to the given resource manager, but probably it could be the case that it would be a specific transform e.g. A ML specific transform could hint the need of GPU as Luke mentioned. This definitely deserves extra research and a more formal design to cover this more general scenario, so I am moving it out of the FSR list and I will also create a new JIRA for the more general case and let this for the particular case of Data Locality for the Spark runner. > Data locality for Read.Bounded > -- > > Key: BEAM-673 > URL: https://issues.apache.org/jira/browse/BEAM-673 > Project: Beam > Issue Type: Bug > Components: runner-spark >Reporter: Amit Sela >Assignee: Ismaël Mejía > > In some distributed filesystems, such as HDFS, we should be able to hint to > Spark the preferred locations of splits. > Here is an example of how Spark does that for Hadoop RDDs: > https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L249 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-673) Data locality for Read.Bounded
[ https://issues.apache.org/jira/browse/BEAM-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983950#comment-15983950 ] Eugene Kirpichov commented on BEAM-673: --- On second thought, this might be related to SDF: processing different restrictions of the same element may have different requirements. Or more like: a design for DoFn's giving hints to runners about their resource requirements would need to include some data dependence. I don't have a good idea about how to express it in a way that will be modular and will combine well with the rest of the Beam model and various tricks runners are allowed to do (such as fusion or materialization). > Data locality for Read.Bounded > -- > > Key: BEAM-673 > URL: https://issues.apache.org/jira/browse/BEAM-673 > Project: Beam > Issue Type: Bug > Components: runner-spark >Reporter: Amit Sela >Assignee: Ismaël Mejía > Fix For: First stable release > > > In some distributed filesystems, such as HDFS, we should be able to hint to > Spark the preferred locations of splits. > Here is an example of how Spark does that for Hadoop RDDs: > https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L249 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-673) Data locality for Read.Bounded
[ https://issues.apache.org/jira/browse/BEAM-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983364#comment-15983364 ] Eugene Kirpichov commented on BEAM-673: --- [~iemejia] No - the fact that a DoFn's efficiency depends on where it runs is not related to the fact that the DoFn is splittable; if one wanted to introduce locality into the Beam model, it'd need to be introduced at the level of DoFn in general. > Data locality for Read.Bounded > -- > > Key: BEAM-673 > URL: https://issues.apache.org/jira/browse/BEAM-673 > Project: Beam > Issue Type: Bug > Components: runner-spark >Reporter: Amit Sela >Assignee: Ismaël Mejía > Fix For: First stable release > > > In some distributed filesystems, such as HDFS, we should be able to hint to > Spark the preferred locations of splits. > Here is an example of how Spark does that for Hadoop RDDs: > https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L249 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-673) Data locality for Read.Bounded
[ https://issues.apache.org/jira/browse/BEAM-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15981810#comment-15981810 ] Ismaël Mejía commented on BEAM-673: --- [~jkff] Is the data locality case considered in the design of Splittable DoFn? > Data locality for Read.Bounded > -- > > Key: BEAM-673 > URL: https://issues.apache.org/jira/browse/BEAM-673 > Project: Beam > Issue Type: Bug > Components: runner-spark >Reporter: Amit Sela >Assignee: Ismaël Mejía > Fix For: First stable release > > > In some distributed filesystems, such as HDFS, we should be able to hint to > Spark the preferred locations of splits. > Here is an example of how Spark does that for Hadoop RDDs: > https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L249 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-673) Data locality for Read.Bounded
[ https://issues.apache.org/jira/browse/BEAM-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15981809#comment-15981809 ] Ismaël Mejía commented on BEAM-673: --- However rushing into adding this could also produce a not so good side effect, so maybe you are right Ahmet, in any case if we add it before FSR we will mark it as @Experimental until it matures. > Data locality for Read.Bounded > -- > > Key: BEAM-673 > URL: https://issues.apache.org/jira/browse/BEAM-673 > Project: Beam > Issue Type: Bug > Components: runner-spark >Reporter: Amit Sela >Assignee: Ismaël Mejía > Fix For: First stable release > > > In some distributed filesystems, such as HDFS, we should be able to hint to > Spark the preferred locations of splits. > Here is an example of how Spark does that for Hadoop RDDs: > https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L249 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-673) Data locality for Read.Bounded
[ https://issues.apache.org/jira/browse/BEAM-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15981796#comment-15981796 ] Ismaël Mejía commented on BEAM-673: --- Not really, the goal of adding this to FSR was to create that API changes before the Source API freezes because of stability, even if in first instance it could not even be implemented or just for one runner (Spark) and probably one source (HDFS). I think the only thing we need is to add a method to hint the location for sources, and even this method can have default empty list implementation so runners would implement this in a opt-in fashion. > Data locality for Read.Bounded > -- > > Key: BEAM-673 > URL: https://issues.apache.org/jira/browse/BEAM-673 > Project: Beam > Issue Type: Bug > Components: runner-spark >Reporter: Amit Sela >Assignee: Ismaël Mejía > Fix For: First stable release > > > In some distributed filesystems, such as HDFS, we should be able to hint to > Spark the preferred locations of splits. > Here is an example of how Spark does that for Hadoop RDDs: > https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L249 -- This message was sent by Atlassian JIRA (v6.3.15#6346)