[jira] [Commented] (BEAM-1723) FlinkRunner should deduplicate when an UnboundedSource requires Deduping

2017-05-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002004#comment-16002004
 ] 

ASF GitHub Bot commented on BEAM-1723:
--

Github user aljoscha closed the pull request at:

https://github.com/apache/beam/pull/2959


> FlinkRunner should deduplicate when an UnboundedSource requires Deduping
> 
>
> Key: BEAM-1723
> URL: https://issues.apache.org/jira/browse/BEAM-1723
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink
>Reporter: Thomas Groh
>Assignee: Aljoscha Krettek
> Fix For: 2.0.0
>
>
> UnboundedSource implementations can require deduping, and the FlinkRunner 
> currently logs a warning that this is not supported.
> https://github.com/apache/beam/blob/master/runners/flink/runner/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/io/UnboundedSourceWrapper.java#L139



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1723) FlinkRunner should deduplicate when an UnboundedSource requires Deduping

2017-05-08 Thread Aljoscha Krettek (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001431#comment-16001431
 ] 

Aljoscha Krettek commented on BEAM-1723:


That sound good! I opened a PR: https://github.com/apache/beam/pull/2959

> FlinkRunner should deduplicate when an UnboundedSource requires Deduping
> 
>
> Key: BEAM-1723
> URL: https://issues.apache.org/jira/browse/BEAM-1723
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink
>Reporter: Thomas Groh
>Assignee: Aljoscha Krettek
> Fix For: 2.0.0
>
>
> UnboundedSource implementations can require deduping, and the FlinkRunner 
> currently logs a warning that this is not supported.
> https://github.com/apache/beam/blob/master/runners/flink/runner/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/io/UnboundedSourceWrapper.java#L139



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1723) FlinkRunner should deduplicate when an UnboundedSource requires Deduping

2017-05-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001428#comment-16001428
 ] 

ASF GitHub Bot commented on BEAM-1723:
--

GitHub user aljoscha opened a pull request:

https://github.com/apache/beam/pull/2959

[BEAM-1723] deduplication of UnboundedSource in Flink runner



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aljoscha/beam cherry-pick-1723

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/beam/pull/2959.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2959






> FlinkRunner should deduplicate when an UnboundedSource requires Deduping
> 
>
> Key: BEAM-1723
> URL: https://issues.apache.org/jira/browse/BEAM-1723
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink
>Reporter: Thomas Groh
>Assignee: Aljoscha Krettek
> Fix For: 2.0.0
>
>
> UnboundedSource implementations can require deduping, and the FlinkRunner 
> currently logs a warning that this is not supported.
> https://github.com/apache/beam/blob/master/runners/flink/runner/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/io/UnboundedSourceWrapper.java#L139



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1723) FlinkRunner should deduplicate when an UnboundedSource requires Deduping

2017-05-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001239#comment-16001239
 ] 

ASF GitHub Bot commented on BEAM-1723:
--

Github user asfgit closed the pull request at:

https://github.com/apache/beam/pull/2476


> FlinkRunner should deduplicate when an UnboundedSource requires Deduping
> 
>
> Key: BEAM-1723
> URL: https://issues.apache.org/jira/browse/BEAM-1723
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink
>Reporter: Thomas Groh
>Assignee: Jingsong Lee
> Fix For: 2.0.0
>
>
> UnboundedSource implementations can require deduping, and the FlinkRunner 
> currently logs a warning that this is not supported.
> https://github.com/apache/beam/blob/master/runners/flink/runner/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/io/UnboundedSourceWrapper.java#L139



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1723) FlinkRunner should deduplicate when an UnboundedSource requires Deduping

2017-04-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15962136#comment-15962136
 ] 

ASF GitHub Bot commented on BEAM-1723:
--

GitHub user JingsongLi opened a pull request:

https://github.com/apache/beam/pull/2476

[BEAM-1723] deduplication of UnboundedSource in Flink runner

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

 - [ ] Make sure the PR title is formatted like:
   `[BEAM-] Description of pull request`
 - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable
   Travis-CI on your fork and ensure the whole test matrix passes).
 - [ ] Replace `` in the title with the actual Jira issue
   number, if there is one.
 - [ ] If this contribution is large, please file an Apache
   [Individual Contributor License 
Agreement](https://www.apache.org/licenses/icla.pdf).

---


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/JingsongLi/beam BEAM-1723

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/beam/pull/2476.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2476


commit 2dc6ea51bb60f8f4beb397bc7179e1829be30d77
Author: JingsongLi 
Date:   2017-04-09T13:16:50Z

[BEAM-1723] deduplication of UnboundedSource in Flink runner




> FlinkRunner should deduplicate when an UnboundedSource requires Deduping
> 
>
> Key: BEAM-1723
> URL: https://issues.apache.org/jira/browse/BEAM-1723
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink
>Reporter: Thomas Groh
>Assignee: Jingsong Lee
>
> UnboundedSource implementations can require deduping, and the FlinkRunner 
> currently logs a warning that this is not supported.
> https://github.com/apache/beam/blob/master/runners/flink/runner/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/io/UnboundedSourceWrapper.java#L139



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1723) FlinkRunner should deduplicate when an UnboundedSource requires Deduping

2017-04-09 Thread Jingsong Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15962125#comment-15962125
 ] 

Jingsong Lee commented on BEAM-1723:


I think it is necessary to be configurable because the deduplication window is 
related to the checkpoint interval.

> FlinkRunner should deduplicate when an UnboundedSource requires Deduping
> 
>
> Key: BEAM-1723
> URL: https://issues.apache.org/jira/browse/BEAM-1723
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink
>Reporter: Thomas Groh
>Assignee: Jingsong Lee
>
> UnboundedSource implementations can require deduping, and the FlinkRunner 
> currently logs a warning that this is not supported.
> https://github.com/apache/beam/blob/master/runners/flink/runner/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/io/UnboundedSourceWrapper.java#L139



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1723) FlinkRunner should deduplicate when an UnboundedSource requires Deduping

2017-04-09 Thread Jingsong Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15962077#comment-15962077
 ] 

Jingsong Lee commented on BEAM-1723:


I understand. The reason for the duplication is that {{PubSubIO}} use Pull-Ack 
model, {{acknowledge()}} in {{finalizeCheckpoint()}} may be fail, while Kafka 
use offset to restore.

> FlinkRunner should deduplicate when an UnboundedSource requires Deduping
> 
>
> Key: BEAM-1723
> URL: https://issues.apache.org/jira/browse/BEAM-1723
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink
>Reporter: Thomas Groh
>Assignee: Jingsong Lee
>
> UnboundedSource implementations can require deduping, and the FlinkRunner 
> currently logs a warning that this is not supported.
> https://github.com/apache/beam/blob/master/runners/flink/runner/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/io/UnboundedSourceWrapper.java#L139



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1723) FlinkRunner should deduplicate when an UnboundedSource requires Deduping

2017-04-06 Thread Kenneth Knowles (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959390#comment-15959390
 ] 

Kenneth Knowles commented on BEAM-1723:
---

The caches do need to be fault-tolerant or you'll get dupes.

It is simplest to have no configuration, but hard to say. I think there could 
be some discussion here. The deduplication window is really about the potential 
for re-delivery of a message, not like allowed lateness at all.

For example, in {{PubsubIO}} duplicates occur when output is committed but 
{{finalizeCheckpoint}} does not succeed at ACKing all messages. Then Pubsub 
will redeliver the message.

> FlinkRunner should deduplicate when an UnboundedSource requires Deduping
> 
>
> Key: BEAM-1723
> URL: https://issues.apache.org/jira/browse/BEAM-1723
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink
>Reporter: Thomas Groh
>Assignee: Jingsong Lee
>
> UnboundedSource implementations can require deduping, and the FlinkRunner 
> currently logs a warning that this is not supported.
> https://github.com/apache/beam/blob/master/runners/flink/runner/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/io/UnboundedSourceWrapper.java#L139



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1723) FlinkRunner should deduplicate when an UnboundedSource requires Deduping

2017-04-06 Thread Jingsong Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959226#comment-15959226
 ] 

Jingsong Lee commented on BEAM-1723:


I see {{CachedIdDeduplicator}} in direct runner. It use {{LoadingCache}} to 
dedup. The expireAfterAccess is 10 minutes and the maximumSize is 100_000. Do 
these two values need to be parameterized?

Do these caches need be snapshotted in flink runner?  (Fault tolerance)

> FlinkRunner should deduplicate when an UnboundedSource requires Deduping
> 
>
> Key: BEAM-1723
> URL: https://issues.apache.org/jira/browse/BEAM-1723
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink
>Reporter: Thomas Groh
>
> UnboundedSource implementations can require deduping, and the FlinkRunner 
> currently logs a warning that this is not supported.
> https://github.com/apache/beam/blob/master/runners/flink/runner/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/io/UnboundedSourceWrapper.java#L139



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)