[jira] [Work logged] (BEAM-2873) Detect number of shards for file sink in Flink Streaming Runner

2018-06-19 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-2873?focusedWorklogId=113456&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-113456
 ]

ASF GitHub Bot logged work on BEAM-2873:


Author: ASF GitHub Bot
Created on: 20/Jun/18 00:06
Start Date: 20/Jun/18 00:06
Worklog Time Spent: 10m 
  Work Description: stale[bot] commented on issue #4760: [BEAM-2873] 
Setting number of shards for writes with runner determined sharding
URL: https://github.com/apache/beam/pull/4760#issuecomment-398583237
 
 
   This pull request has been closed due to lack of activity. If you think that 
is incorrect, or the pull request requires review, you can revive the PR at any 
time.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 113456)
Time Spent: 3.5h  (was: 3h 20m)

> Detect number of shards for file sink in Flink Streaming Runner
> ---
>
> Key: BEAM-2873
> URL: https://issues.apache.org/jira/browse/BEAM-2873
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-flink
>Reporter: Aljoscha Krettek
>Assignee: Dawid Wysakowicz
>Priority: Major
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> [~reuvenlax] mentioned that this is done for the Dataflow Runner and the 
> default behaviour on Flink can be somewhat surprising for users.
> ML entry: https://www.mail-archive.com/dev@beam.apache.org/msg02665.html:
> This is how the file sink has always worked in Beam. If no sharding is 
> specified, then this means runner-determined sharding, and by default that is 
> one file per bundle. If Flink has small bundles, then I suggest using the 
> withNumShards method to explicitly pick the number of output shards.
> The Flink runner can detect that runner-determined sharding has been chosen, 
> and override it with a specific number of shards. For example, the Dataflow 
> streaming runner (which as you mentioned also has small bundles) detects this 
> case and sets the number of out files shards based on the number of workers 
> in the worker pool 
> [Here|https://github.com/apache/beam/blob/9e6530adb00669b7cf0f01cb8b128be0a21fd721/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L354]
>  is the code that does this; it should be quite simple to do something 
> similar for Flink, and then there will be no need for users to explicitly 
> call withNumShards themselves.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-2873) Detect number of shards for file sink in Flink Streaming Runner

2018-06-19 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-2873?focusedWorklogId=113457&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-113457
 ]

ASF GitHub Bot logged work on BEAM-2873:


Author: ASF GitHub Bot
Created on: 20/Jun/18 00:06
Start Date: 20/Jun/18 00:06
Worklog Time Spent: 10m 
  Work Description: stale[bot] closed pull request #4760: [BEAM-2873] 
Setting number of shards for writes with runner determined sharding
URL: https://github.com/apache/beam/pull/4760
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineExecutionEnvironment.java
 
b/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineExecutionEnvironment.java
index 7f7281e14bd..d9466fa2b76 100644
--- 
a/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineExecutionEnvironment.java
+++ 
b/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineExecutionEnvironment.java
@@ -99,9 +99,6 @@ public void translate(FlinkRunner flinkRunner, Pipeline 
pipeline) {
 optimizer.translate(pipeline);
 TranslationMode translationMode = optimizer.getTranslationMode();
 
-pipeline.replaceAll(FlinkTransformOverrides.getDefaultOverrides(
-translationMode == TranslationMode.STREAMING));
-
 FlinkPipelineTranslator translator;
 if (translationMode == TranslationMode.STREAMING) {
   this.flinkStreamEnv = createStreamExecutionEnvironment();
@@ -111,6 +108,9 @@ public void translate(FlinkRunner flinkRunner, Pipeline 
pipeline) {
   translator = new FlinkBatchPipelineTranslator(flinkBatchEnv, options);
 }
 
+pipeline.replaceAll(FlinkTransformOverrides.getDefaultOverrides(
+translationMode == TranslationMode.STREAMING, options));
+
 translator.translate(pipeline);
   }
 
@@ -164,8 +164,14 @@ private ExecutionEnvironment 
createBatchExecutionEnvironment() {
   flinkBatchEnv.setParallelism(options.getParallelism());
 }
 
+// set the correct max parallelism.
+if (options.getMaxParallelism() != -1) {
+  flinkBatchEnv.getConfig().setMaxParallelism(options.getMaxParallelism());
+}
+
 // set parallelism in the options (required by some execution code)
 options.setParallelism(flinkBatchEnv.getParallelism());
+options.setMaxParallelism(flinkBatchEnv.getConfig().getMaxParallelism());
 
 if (options.getObjectReuse()) {
   flinkBatchEnv.getConfig().enableObjectReuse();
@@ -208,8 +214,14 @@ private StreamExecutionEnvironment 
createStreamExecutionEnvironment() {
   flinkStreamEnv.setParallelism(options.getParallelism());
 }
 
+// set the correct max parallelism.
+if (options.getMaxParallelism() != -1) {
+  flinkBatchEnv.getConfig().setMaxParallelism(options.getMaxParallelism());
+}
+
 // set parallelism in the options (required by some execution code)
 options.setParallelism(flinkStreamEnv.getParallelism());
+options.setMaxParallelism(flinkStreamEnv.getMaxParallelism());
 
 if (options.getObjectReuse()) {
   flinkStreamEnv.getConfig().enableObjectReuse();
diff --git 
a/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineOptions.java
 
b/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineOptions.java
index b2cbefbc5b0..908a08fdece 100644
--- 
a/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineOptions.java
+++ 
b/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineOptions.java
@@ -64,6 +64,12 @@
   Integer getParallelism();
   void setParallelism(Integer value);
 
+  @Description("The maximal degree of parallelism to be used when distributing 
operations "
+  + "onto workers.")
+  @Default.Integer(-1)
+  Integer getMaxParallelism();
+  void setMaxParallelism(Integer value);
+
   @Description("The interval between consecutive checkpoints (i.e. snapshots 
of the current"
   + "pipeline state used for fault tolerance).")
   @Default.Long(-1L)
diff --git 
a/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkStreamingPipelineTranslator.java
 
b/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkStreamingPipelineTranslator.java
index 2e16ed9966c..d5359be834c 100644
--- 
a/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkStreamingPipelineTranslator.java
+++ 
b/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkStreamingPipelineTranslator.java
@@ -17,13 +17,20 @@
  */
 package org.apache.beam.runners.flink;
 
+import com.google.common.annotations.VisibleForTesting;
+import java.util.Collections;
+import java.util.List;
 import java.util.Map

[jira] [Work logged] (BEAM-2873) Detect number of shards for file sink in Flink Streaming Runner

2018-06-12 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-2873?focusedWorklogId=111342&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-111342
 ]

ASF GitHub Bot logged work on BEAM-2873:


Author: ASF GitHub Bot
Created on: 12/Jun/18 23:39
Start Date: 12/Jun/18 23:39
Worklog Time Spent: 10m 
  Work Description: stale[bot] commented on issue #4760: [BEAM-2873] 
Setting number of shards for writes with runner determined sharding
URL: https://github.com/apache/beam/pull/4760#issuecomment-396767963
 
 
   This pull request has been marked as stale due to 60 days of inactivity. It 
will be closed in 1 week if no further activity occurs. If you think that’s 
incorrect or this pull request requires a review, please simply write any 
comment. If closed, you can revive the PR at any time and @mention a reviewer 
or discuss it on the d...@beam.apache.org list. Thank you for your 
contributions.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 111342)
Time Spent: 3h 20m  (was: 3h 10m)

> Detect number of shards for file sink in Flink Streaming Runner
> ---
>
> Key: BEAM-2873
> URL: https://issues.apache.org/jira/browse/BEAM-2873
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-flink
>Reporter: Aljoscha Krettek
>Assignee: Dawid Wysakowicz
>Priority: Major
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> [~reuvenlax] mentioned that this is done for the Dataflow Runner and the 
> default behaviour on Flink can be somewhat surprising for users.
> ML entry: https://www.mail-archive.com/dev@beam.apache.org/msg02665.html:
> This is how the file sink has always worked in Beam. If no sharding is 
> specified, then this means runner-determined sharding, and by default that is 
> one file per bundle. If Flink has small bundles, then I suggest using the 
> withNumShards method to explicitly pick the number of output shards.
> The Flink runner can detect that runner-determined sharding has been chosen, 
> and override it with a specific number of shards. For example, the Dataflow 
> streaming runner (which as you mentioned also has small bundles) detects this 
> case and sets the number of out files shards based on the number of workers 
> in the worker pool 
> [Here|https://github.com/apache/beam/blob/9e6530adb00669b7cf0f01cb8b128be0a21fd721/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L354]
>  is the code that does this; it should be quite simple to do something 
> similar for Flink, and then there will be no need for users to explicitly 
> call withNumShards themselves.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-2873) Detect number of shards for file sink in Flink Streaming Runner

2018-04-13 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2873?focusedWorklogId=91007&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-91007
 ]

ASF GitHub Bot logged work on BEAM-2873:


Author: ASF GitHub Bot
Created on: 13/Apr/18 22:57
Start Date: 13/Apr/18 22:57
Worklog Time Spent: 10m 
  Work Description: robertwb commented on issue #4760: [BEAM-2873] Setting 
number of shards for writes with runner determined sharding
URL: https://github.com/apache/beam/pull/4760#issuecomment-381279762
 
 
   Jenkins: retest this please.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 91007)
Time Spent: 3h 10m  (was: 3h)

> Detect number of shards for file sink in Flink Streaming Runner
> ---
>
> Key: BEAM-2873
> URL: https://issues.apache.org/jira/browse/BEAM-2873
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-flink
>Reporter: Aljoscha Krettek
>Assignee: Dawid Wysakowicz
>Priority: Major
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> [~reuvenlax] mentioned that this is done for the Dataflow Runner and the 
> default behaviour on Flink can be somewhat surprising for users.
> ML entry: https://www.mail-archive.com/dev@beam.apache.org/msg02665.html:
> This is how the file sink has always worked in Beam. If no sharding is 
> specified, then this means runner-determined sharding, and by default that is 
> one file per bundle. If Flink has small bundles, then I suggest using the 
> withNumShards method to explicitly pick the number of output shards.
> The Flink runner can detect that runner-determined sharding has been chosen, 
> and override it with a specific number of shards. For example, the Dataflow 
> streaming runner (which as you mentioned also has small bundles) detects this 
> case and sets the number of out files shards based on the number of workers 
> in the worker pool 
> [Here|https://github.com/apache/beam/blob/9e6530adb00669b7cf0f01cb8b128be0a21fd721/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L354]
>  is the code that does this; it should be quite simple to do something 
> similar for Flink, and then there will be no need for users to explicitly 
> call withNumShards themselves.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-2873) Detect number of shards for file sink in Flink Streaming Runner

2018-03-27 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2873?focusedWorklogId=84952&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-84952
 ]

ASF GitHub Bot logged work on BEAM-2873:


Author: ASF GitHub Bot
Created on: 27/Mar/18 17:41
Start Date: 27/Mar/18 17:41
Worklog Time Spent: 10m 
  Work Description: dawidwys commented on issue #4760: [BEAM-2873] Setting 
number of shards for writes with runner determined sharding
URL: https://github.com/apache/beam/pull/4760#issuecomment-376612564
 
 
   @aljoscha any comments regarding this PR? 
   
   Unfortunately I was not able to make this change pass the tests, but at all 
times I found the failures unrelated to those changes. Would appreciate any 
help or double check.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 84952)
Time Spent: 3h  (was: 2h 50m)

> Detect number of shards for file sink in Flink Streaming Runner
> ---
>
> Key: BEAM-2873
> URL: https://issues.apache.org/jira/browse/BEAM-2873
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-flink
>Reporter: Aljoscha Krettek
>Assignee: Dawid Wysakowicz
>Priority: Major
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> [~reuvenlax] mentioned that this is done for the Dataflow Runner and the 
> default behaviour on Flink can be somewhat surprising for users.
> ML entry: https://www.mail-archive.com/dev@beam.apache.org/msg02665.html:
> This is how the file sink has always worked in Beam. If no sharding is 
> specified, then this means runner-determined sharding, and by default that is 
> one file per bundle. If Flink has small bundles, then I suggest using the 
> withNumShards method to explicitly pick the number of output shards.
> The Flink runner can detect that runner-determined sharding has been chosen, 
> and override it with a specific number of shards. For example, the Dataflow 
> streaming runner (which as you mentioned also has small bundles) detects this 
> case and sets the number of out files shards based on the number of workers 
> in the worker pool 
> [Here|https://github.com/apache/beam/blob/9e6530adb00669b7cf0f01cb8b128be0a21fd721/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L354]
>  is the code that does this; it should be quite simple to do something 
> similar for Flink, and then there will be no need for users to explicitly 
> call withNumShards themselves.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-2873) Detect number of shards for file sink in Flink Streaming Runner

2018-03-27 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2873?focusedWorklogId=84777&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-84777
 ]

ASF GitHub Bot logged work on BEAM-2873:


Author: ASF GitHub Bot
Created on: 27/Mar/18 08:13
Start Date: 27/Mar/18 08:13
Worklog Time Spent: 10m 
  Work Description: dawidwys commented on issue #4760: [BEAM-2873] Setting 
number of shards for writes with runner determined sharding
URL: https://github.com/apache/beam/pull/4760#issuecomment-376436206
 
 
   retest this please


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 84777)
Time Spent: 2h 50m  (was: 2h 40m)

> Detect number of shards for file sink in Flink Streaming Runner
> ---
>
> Key: BEAM-2873
> URL: https://issues.apache.org/jira/browse/BEAM-2873
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-flink
>Reporter: Aljoscha Krettek
>Assignee: Dawid Wysakowicz
>Priority: Major
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> [~reuvenlax] mentioned that this is done for the Dataflow Runner and the 
> default behaviour on Flink can be somewhat surprising for users.
> ML entry: https://www.mail-archive.com/dev@beam.apache.org/msg02665.html:
> This is how the file sink has always worked in Beam. If no sharding is 
> specified, then this means runner-determined sharding, and by default that is 
> one file per bundle. If Flink has small bundles, then I suggest using the 
> withNumShards method to explicitly pick the number of output shards.
> The Flink runner can detect that runner-determined sharding has been chosen, 
> and override it with a specific number of shards. For example, the Dataflow 
> streaming runner (which as you mentioned also has small bundles) detects this 
> case and sets the number of out files shards based on the number of workers 
> in the worker pool 
> [Here|https://github.com/apache/beam/blob/9e6530adb00669b7cf0f01cb8b128be0a21fd721/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L354]
>  is the code that does this; it should be quite simple to do something 
> similar for Flink, and then there will be no need for users to explicitly 
> call withNumShards themselves.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-2873) Detect number of shards for file sink in Flink Streaming Runner

2018-03-22 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2873?focusedWorklogId=83112&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-83112
 ]

ASF GitHub Bot logged work on BEAM-2873:


Author: ASF GitHub Bot
Created on: 22/Mar/18 09:17
Start Date: 22/Mar/18 09:17
Worklog Time Spent: 10m 
  Work Description: dawidwys commented on issue #4760: [BEAM-2873] Setting 
number of shards for writes with runner determined sharding
URL: https://github.com/apache/beam/pull/4760#issuecomment-375228820
 
 
   retest this please


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 83112)
Time Spent: 2h 40m  (was: 2.5h)

> Detect number of shards for file sink in Flink Streaming Runner
> ---
>
> Key: BEAM-2873
> URL: https://issues.apache.org/jira/browse/BEAM-2873
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-flink
>Reporter: Aljoscha Krettek
>Assignee: Dawid Wysakowicz
>Priority: Major
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> [~reuvenlax] mentioned that this is done for the Dataflow Runner and the 
> default behaviour on Flink can be somewhat surprising for users.
> ML entry: https://www.mail-archive.com/dev@beam.apache.org/msg02665.html:
> This is how the file sink has always worked in Beam. If no sharding is 
> specified, then this means runner-determined sharding, and by default that is 
> one file per bundle. If Flink has small bundles, then I suggest using the 
> withNumShards method to explicitly pick the number of output shards.
> The Flink runner can detect that runner-determined sharding has been chosen, 
> and override it with a specific number of shards. For example, the Dataflow 
> streaming runner (which as you mentioned also has small bundles) detects this 
> case and sets the number of out files shards based on the number of workers 
> in the worker pool 
> [Here|https://github.com/apache/beam/blob/9e6530adb00669b7cf0f01cb8b128be0a21fd721/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L354]
>  is the code that does this; it should be quite simple to do something 
> similar for Flink, and then there will be no need for users to explicitly 
> call withNumShards themselves.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-2873) Detect number of shards for file sink in Flink Streaming Runner

2018-03-19 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2873?focusedWorklogId=81934&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-81934
 ]

ASF GitHub Bot logged work on BEAM-2873:


Author: ASF GitHub Bot
Created on: 19/Mar/18 17:29
Start Date: 19/Mar/18 17:29
Worklog Time Spent: 10m 
  Work Description: aljoscha commented on issue #4760: [BEAM-2873] Setting 
number of shards for writes with runner determined sharding
URL: https://github.com/apache/beam/pull/4760#issuecomment-374297803
 
 
   Sorry for coming up with yet more stuff, but: I think we have to add 
something similar to `maxNumWorkers()` to Flink. Flink has the `maxParallelism` 
concept so we should just expose that in the options and use it also for 
determining the shards, as the Dataflow Runner does.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 81934)
Time Spent: 2.5h  (was: 2h 20m)

> Detect number of shards for file sink in Flink Streaming Runner
> ---
>
> Key: BEAM-2873
> URL: https://issues.apache.org/jira/browse/BEAM-2873
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-flink
>Reporter: Aljoscha Krettek
>Assignee: Dawid Wysakowicz
>Priority: Major
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> [~reuvenlax] mentioned that this is done for the Dataflow Runner and the 
> default behaviour on Flink can be somewhat surprising for users.
> ML entry: https://www.mail-archive.com/dev@beam.apache.org/msg02665.html:
> This is how the file sink has always worked in Beam. If no sharding is 
> specified, then this means runner-determined sharding, and by default that is 
> one file per bundle. If Flink has small bundles, then I suggest using the 
> withNumShards method to explicitly pick the number of output shards.
> The Flink runner can detect that runner-determined sharding has been chosen, 
> and override it with a specific number of shards. For example, the Dataflow 
> streaming runner (which as you mentioned also has small bundles) detects this 
> case and sets the number of out files shards based on the number of workers 
> in the worker pool 
> [Here|https://github.com/apache/beam/blob/9e6530adb00669b7cf0f01cb8b128be0a21fd721/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L354]
>  is the code that does this; it should be quite simple to do something 
> similar for Flink, and then there will be no need for users to explicitly 
> call withNumShards themselves.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-2873) Detect number of shards for file sink in Flink Streaming Runner

2018-03-19 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2873?focusedWorklogId=81850&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-81850
 ]

ASF GitHub Bot logged work on BEAM-2873:


Author: ASF GitHub Bot
Created on: 19/Mar/18 14:25
Start Date: 19/Mar/18 14:25
Worklog Time Spent: 10m 
  Work Description: dawidwys commented on issue #4760: [BEAM-2873] Setting 
number of shards for writes with runner determined sharding
URL: https://github.com/apache/beam/pull/4760#issuecomment-374230231
 
 
   @aljoscha I've updated the shards for default cluster parallelism. I think 
tests failures are unrelated to those changes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 81850)
Time Spent: 2h 20m  (was: 2h 10m)

> Detect number of shards for file sink in Flink Streaming Runner
> ---
>
> Key: BEAM-2873
> URL: https://issues.apache.org/jira/browse/BEAM-2873
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-flink
>Reporter: Aljoscha Krettek
>Assignee: Dawid Wysakowicz
>Priority: Major
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> [~reuvenlax] mentioned that this is done for the Dataflow Runner and the 
> default behaviour on Flink can be somewhat surprising for users.
> ML entry: https://www.mail-archive.com/dev@beam.apache.org/msg02665.html:
> This is how the file sink has always worked in Beam. If no sharding is 
> specified, then this means runner-determined sharding, and by default that is 
> one file per bundle. If Flink has small bundles, then I suggest using the 
> withNumShards method to explicitly pick the number of output shards.
> The Flink runner can detect that runner-determined sharding has been chosen, 
> and override it with a specific number of shards. For example, the Dataflow 
> streaming runner (which as you mentioned also has small bundles) detects this 
> case and sets the number of out files shards based on the number of workers 
> in the worker pool 
> [Here|https://github.com/apache/beam/blob/9e6530adb00669b7cf0f01cb8b128be0a21fd721/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L354]
>  is the code that does this; it should be quite simple to do something 
> similar for Flink, and then there will be no need for users to explicitly 
> call withNumShards themselves.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-2873) Detect number of shards for file sink in Flink Streaming Runner

2018-03-16 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2873?focusedWorklogId=81119&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-81119
 ]

ASF GitHub Bot logged work on BEAM-2873:


Author: ASF GitHub Bot
Created on: 16/Mar/18 09:17
Start Date: 16/Mar/18 09:17
Worklog Time Spent: 10m 
  Work Description: dawidwys commented on issue #4760: [BEAM-2873] Setting 
number of shards for writes with runner determined sharding
URL: https://github.com/apache/beam/pull/4760#issuecomment-373651077
 
 
   Run Flink ValidatesRunner


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 81119)
Time Spent: 2h 10m  (was: 2h)

> Detect number of shards for file sink in Flink Streaming Runner
> ---
>
> Key: BEAM-2873
> URL: https://issues.apache.org/jira/browse/BEAM-2873
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-flink
>Reporter: Aljoscha Krettek
>Assignee: Dawid Wysakowicz
>Priority: Major
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> [~reuvenlax] mentioned that this is done for the Dataflow Runner and the 
> default behaviour on Flink can be somewhat surprising for users.
> ML entry: https://www.mail-archive.com/dev@beam.apache.org/msg02665.html:
> This is how the file sink has always worked in Beam. If no sharding is 
> specified, then this means runner-determined sharding, and by default that is 
> one file per bundle. If Flink has small bundles, then I suggest using the 
> withNumShards method to explicitly pick the number of output shards.
> The Flink runner can detect that runner-determined sharding has been chosen, 
> and override it with a specific number of shards. For example, the Dataflow 
> streaming runner (which as you mentioned also has small bundles) detects this 
> case and sets the number of out files shards based on the number of workers 
> in the worker pool 
> [Here|https://github.com/apache/beam/blob/9e6530adb00669b7cf0f01cb8b128be0a21fd721/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L354]
>  is the code that does this; it should be quite simple to do something 
> similar for Flink, and then there will be no need for users to explicitly 
> call withNumShards themselves.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-2873) Detect number of shards for file sink in Flink Streaming Runner

2018-03-09 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2873?focusedWorklogId=78873&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-78873
 ]

ASF GitHub Bot logged work on BEAM-2873:


Author: ASF GitHub Bot
Created on: 09/Mar/18 10:51
Start Date: 09/Mar/18 10:51
Worklog Time Spent: 10m 
  Work Description: aljoscha commented on issue #4760: [BEAM-2873] Setting 
number of shards for writes with runner determined sharding
URL: https://github.com/apache/beam/pull/4760#issuecomment-371779916
 
 
   @dawidwys Yes, that is the problem. I would add a special case for -1 and 
set it to some default but log a warning.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 78873)
Time Spent: 2h  (was: 1h 50m)

> Detect number of shards for file sink in Flink Streaming Runner
> ---
>
> Key: BEAM-2873
> URL: https://issues.apache.org/jira/browse/BEAM-2873
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-flink
>Reporter: Aljoscha Krettek
>Assignee: Dawid Wysakowicz
>Priority: Major
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> [~reuvenlax] mentioned that this is done for the Dataflow Runner and the 
> default behaviour on Flink can be somewhat surprising for users.
> ML entry: https://www.mail-archive.com/dev@beam.apache.org/msg02665.html:
> This is how the file sink has always worked in Beam. If no sharding is 
> specified, then this means runner-determined sharding, and by default that is 
> one file per bundle. If Flink has small bundles, then I suggest using the 
> withNumShards method to explicitly pick the number of output shards.
> The Flink runner can detect that runner-determined sharding has been chosen, 
> and override it with a specific number of shards. For example, the Dataflow 
> streaming runner (which as you mentioned also has small bundles) detects this 
> case and sets the number of out files shards based on the number of workers 
> in the worker pool 
> [Here|https://github.com/apache/beam/blob/9e6530adb00669b7cf0f01cb8b128be0a21fd721/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L354]
>  is the code that does this; it should be quite simple to do something 
> similar for Flink, and then there will be no need for users to explicitly 
> call withNumShards themselves.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)