[jira] [Commented] (BEAM-673) Data locality for Read.Bounded

2017-04-26 Thread JIRA

[ 
https://issues.apache.org/jira/browse/BEAM-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985529#comment-15985529
 ] 

Ismaël Mejía commented on BEAM-673:
---

Oh you have a point [~lcwik], I haven’t thought about this like a general 
problem. Notice that the case you mention is a common task of the resource 
managers e.g. YARN, Mesos or Kubernetes, in them there is the concept of 
resource/offers and the underlying processing system e.g. Hadoop, Spark, Flink 
just tell them their preferences to allocate the workers.

[~jkff] I agree with this aspect of data dependency that you mention, in the 
case of this JIRA the sources are the ones that know this information that the 
runner needs to pass to the given resource manager, but probably it could be 
the case that it would be a specific transform e.g. A ML specific transform 
could hint the need of GPU as Luke mentioned.

This definitely deserves extra research and a more formal design to cover this 
more general scenario, so I am moving it out of the FSR list and I will also 
create a new JIRA for the more general case and let this for the particular 
case of Data Locality for the Spark runner.


> Data locality for Read.Bounded
> --
>
> Key: BEAM-673
> URL: https://issues.apache.org/jira/browse/BEAM-673
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Amit Sela
>Assignee: Ismaël Mejía
>
> In some distributed filesystems, such as HDFS, we should be able to hint to 
> Spark the preferred locations of splits.
> Here is an example of how Spark does that for Hadoop RDDs:
> https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L249



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-673) Data locality for Read.Bounded

2017-04-25 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983950#comment-15983950
 ] 

Eugene Kirpichov commented on BEAM-673:
---

On second thought, this might be related to SDF: processing different 
restrictions of the same element may have different requirements.

Or more like: a design for DoFn's giving hints to runners about their resource 
requirements would need to include some data dependence. I don't have a good 
idea about how to express it in a way that will be modular and will combine 
well with the rest of the Beam model and various tricks runners are allowed to 
do (such as fusion or materialization).

> Data locality for Read.Bounded
> --
>
> Key: BEAM-673
> URL: https://issues.apache.org/jira/browse/BEAM-673
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Amit Sela
>Assignee: Ismaël Mejía
> Fix For: First stable release
>
>
> In some distributed filesystems, such as HDFS, we should be able to hint to 
> Spark the preferred locations of splits.
> Here is an example of how Spark does that for Hadoop RDDs:
> https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L249



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-673) Data locality for Read.Bounded

2017-04-25 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983364#comment-15983364
 ] 

Eugene Kirpichov commented on BEAM-673:
---

[~iemejia] No - the fact that a DoFn's efficiency depends on where it runs is 
not related to the fact that the DoFn is splittable; if one wanted to introduce 
locality into the Beam model, it'd need to be introduced at the level of DoFn 
in general.

> Data locality for Read.Bounded
> --
>
> Key: BEAM-673
> URL: https://issues.apache.org/jira/browse/BEAM-673
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Amit Sela
>Assignee: Ismaël Mejía
> Fix For: First stable release
>
>
> In some distributed filesystems, such as HDFS, we should be able to hint to 
> Spark the preferred locations of splits.
> Here is an example of how Spark does that for Hadoop RDDs:
> https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L249



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-673) Data locality for Read.Bounded

2017-04-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/BEAM-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15981810#comment-15981810
 ] 

Ismaël Mejía commented on BEAM-673:
---

[~jkff] Is the data locality case considered in the design of Splittable DoFn?

> Data locality for Read.Bounded
> --
>
> Key: BEAM-673
> URL: https://issues.apache.org/jira/browse/BEAM-673
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Amit Sela
>Assignee: Ismaël Mejía
> Fix For: First stable release
>
>
> In some distributed filesystems, such as HDFS, we should be able to hint to 
> Spark the preferred locations of splits.
> Here is an example of how Spark does that for Hadoop RDDs:
> https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L249



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-673) Data locality for Read.Bounded

2017-04-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/BEAM-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15981809#comment-15981809
 ] 

Ismaël Mejía commented on BEAM-673:
---

However rushing into adding this could also produce a not so good side effect, 
so maybe you are right Ahmet, in any case if we add it before FSR we will mark 
it as @Experimental until it matures.

> Data locality for Read.Bounded
> --
>
> Key: BEAM-673
> URL: https://issues.apache.org/jira/browse/BEAM-673
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Amit Sela
>Assignee: Ismaël Mejía
> Fix For: First stable release
>
>
> In some distributed filesystems, such as HDFS, we should be able to hint to 
> Spark the preferred locations of splits.
> Here is an example of how Spark does that for Hadoop RDDs:
> https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L249



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-673) Data locality for Read.Bounded

2017-04-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/BEAM-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15981796#comment-15981796
 ] 

Ismaël Mejía commented on BEAM-673:
---

Not really, the goal of adding this to FSR was to create that API changes 
before the Source API freezes because of stability, even if in first instance 
it could not even be implemented or just for one runner (Spark) and probably 
one source (HDFS). I think the only thing we need is to add a method to hint 
the location for sources, and even this method can have default empty list 
implementation so runners would implement this in a opt-in fashion.

> Data locality for Read.Bounded
> --
>
> Key: BEAM-673
> URL: https://issues.apache.org/jira/browse/BEAM-673
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Amit Sela
>Assignee: Ismaël Mejía
> Fix For: First stable release
>
>
> In some distributed filesystems, such as HDFS, we should be able to hint to 
> Spark the preferred locations of splits.
> Here is an example of how Spark does that for Hadoop RDDs:
> https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L249



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)