By setting
properties.setProperty("batch.size", "10240000");
properties.setProperty("linger.ms", "10000");
In the properties passed to FlinkKafkaProducer010 (to postpone automatic
flushing) and killing (kill -9 PID) the YarnTaskManager process in the middle
of executing a Flink job. Thus records that were not flushed automatically were
lost (this at-least-once bug was about missing manual/explicit flush on a
checkpoint).
Piotrek
> On Jul 26, 2017, at 1:19 PM, Aljoscha Krettek <[email protected]> wrote:
>
> Sweet (maybe?)! How did you reproduce data-loss?
>
> Best,
> Aljoscha
>
>> On 26. Jul 2017, at 11:13, Piotr Nowojski <[email protected]> wrote:
>>
>> It took me longer then I expected but I was able reproduce data loss with
>> older Flink versions while running fling in 3 nodes cluster. I have also
>> validated that at-least-once semantic is fixed for Kafka 0.10 in Flink
>> 1.3-SNAPSHOT.
>>
>> Piotrek
>>
>>> On Jul 20, 2017, at 4:52 PM, Stephan Ewen <[email protected]> wrote:
>>>
>>> Thank you very much, for driving this!
>>>
>>> On Thu, Jul 20, 2017 at 9:09 AM, Piotr Nowojski <[email protected]>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Regarding Kafka at-least-once bug. I could try to play with Flink 1.3.1 on
>>>> a real cluster to provoke this bug, by basically repeating
>>>> KafkaProducerTestBase#testOneToOneAtLeastOnce on a larger scale.
>>>>
>>>> Piotrek
>>>>
>>>>> On Jul 19, 2017, at 5:26 PM, Aljoscha Krettek <[email protected]>
>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> Yes! In my opinion, the most critical issues are these:
>>>>>
>>>>> - https://issues.apache.org/jira/browse/FLINK-6964: <
>>>> https://issues.apache.org/jira/browse/FLINK-6964:> Fix recovery for
>>>> incremental checkpoints in StandaloneCompletedCheckpointStore
>>>>> - https://issues.apache.org/jira/browse/FLINK-7041: <
>>>> https://issues.apache.org/jira/browse/FLINK-7041:> Deserialize
>>>> StateBackend from JobCheckpointingSettings with user classloader
>>>>>
>>>>> The first one makes incremental checkpoints on RocksDB unusable with
>>>> externalised checkpoints. The latter means that you cannot have custom
>>>> configuration of the RocksDB backend.
>>>>>
>>>>> - https://issues.apache.org/jira/browse/FLINK-7216: <
>>>> https://issues.apache.org/jira/browse/FLINK-7216:> ExecutionGraph can
>>>> perform concurrent global restarts to scheduling
>>>>> - https://issues.apache.org/jira/browse/FLINK-7153: <
>>>> https://issues.apache.org/jira/browse/FLINK-7153:> Eager Scheduling can't
>>>> allocate source for ExecutionGraph correctly
>>>>>
>>>>> These are critical scheduler bugs, Stephan can probably say more about
>>>> them than I can.
>>>>>
>>>>> - https://issues.apache.org/jira/browse/FLINK-7143: <
>>>> https://issues.apache.org/jira/browse/FLINK-7143:> Partition assignment
>>>> for Kafka consumer is not stable
>>>>> - https://issues.apache.org/jira/browse/FLINK-7195: <
>>>> https://issues.apache.org/jira/browse/FLINK-7195:> FlinkKafkaConsumer
>>>> should not respect fetched partitions to filter restored partition states
>>>>> - https://issues.apache.org/jira/browse/FLINK-6996: <
>>>> https://issues.apache.org/jira/browse/FLINK-6996:> FlinkKafkaProducer010
>>>> doesn't guarantee at-least-once semantic
>>>>>
>>>>> The first one means that you can have duplicate data because several
>>>> consumers would be consuming from one partition, without noticing it. The
>>>> second one causes partitions to be dropped from state if a broker is
>>>> temporarily not reachable.
>>>>>
>>>>> The first two issues would have been caught by my proposed testing
>>>> procedures, the third and fourth might be caught but are very tricky to
>>>> provoke. I’m currently experimenting with this testing procedure to Flink
>>>> 1.3.1 to see if I can provoke it.
>>>>>
>>>>> The Kafka bugs are super hard to provoke because they only occur if
>>>> Kafka has some temporary problems or there are communication problems.
>>>>>
>>>>> I forgot to mention that I have actually two goals with this: 1)
>>>> thoroughly test Flink and 2) build expertise in the community, i.e. we’re
>>>> forced to try cluster environments/distributions that we are not familiar
>>>> with and we actually deploy a full job and play around with it.
>>>>>
>>>>> Best,
>>>>> Aljoscha
>>>>>
>>>>>
>>>>>> On 19. Jul 2017, at 15:49, Shaoxuan Wang <[email protected]> wrote:
>>>>>>
>>>>>> Hi Aljoscha,
>>>>>> Glad to see that we have a more thorough testing procedure. Could you
>>>>>> please share us what (the critical issues you mentioned) have been
>>>> broken
>>>>>> in 1.3.0 & 1.3.1, and how the new proposed "functional testing section
>>>> and
>>>>>> a combination of systems/configurations" can cover this. This will help
>>>> us
>>>>>> to improve our production verification as well.
>>>>>>
>>>>>> Regards,
>>>>>> Shaoxuan
>>>>>>
>>>>>>
>>>>>> On Wed, Jul 19, 2017 at 9:11 PM, Aljoscha Krettek <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Everyone,
>>>>>>>
>>>>>>> We are on the verge of starting the release process for Flink 1.3.2 and
>>>>>>> there have been some critical issues in both Flink 1.3.0 and 1.3.1. For
>>>>>>> Flink 1.3.2 I want to make very sure that we test as much as possible.
>>>> For
>>>>>>> this I’m proposing a slightly changed testing procedure [1]. This is
>>>>>>> similar to the testing document we used for previous releases but has
>>>> a new
>>>>>>> functional testing section that tries to outline a testing procedure
>>>> and a
>>>>>>> combination of systems/configurations that we have to test. Please
>>>> have a
>>>>>>> look and comment on whether you think this is sufficient (or a bit too
>>>>>>> much).
>>>>>>>
>>>>>>> What do you think?
>>>>>>>
>>>>>>> Best,
>>>>>>> Aljoscha
>>>>>>>
>>>>>>> [1] https://docs.google.com/document/d/16fU1cpxoYf3o9cCDyakj7ZDnUoJTj
>>>>>>> 4_CEmMTpCkY81s/edit?usp=sharing
>>>>>
>>>>
>>>>
>>
>