Re: Looking for Maintainers for Flink on YARN

2022-01-28 Thread
Hello, I would like to maintain the yarn component. In our company, we are
using yarn to schedule flink mainly. The stability is importance to us. I
will spend some time to check the issues.

Konstantin Knauf  于2022年1月26日周三 17:17写道:

> Hi everyone,
>
> We are seeing an increasing number of test instabilities related to YARN
> [1]. Does someone in this group have the time to pick these up? The Flink
> Confluence contains a guide on how to triage test instability tickets.
>
> Thanks,
>
> Konstantin
>
> [1]
>
> https://issues.apache.org/jira/browse/FLINK-25514?jql=project%20%3D%20FLINK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20%3D%20%22Deployment%20%2F%20YARN%22%20AND%20labels%20%3D%20test-stability
> [2]
>
> https://cwiki.apache.org/confluence/display/FLINK/Triage+Test+Instability+Tickets
>
> On Mon, Sep 13, 2021 at 2:22 PM 柳尘  wrote:
>
> > Thanks to Konstantin for raising this question, and to Marton and Gabor
> > To strengthen!
> >
> >  If i can help
> > In order to better participate in the work, please let me know.
> >
> > the best,
> > cheng xingyuan
> >
> >
> > > 2021年7月29日 下午4:15,Konstantin Knauf  写道:
> > >
> > > Dear community,
> > >
> > > We are looking for community members, who would like to maintain
> Flink's
> > > YARN support going forward. So far, this has been handled by teams at
> > > Ververica & Alibaba. The focus of these teams has shifted over the past
> > > months so that we only have little time left for this topic. Still, we
> > > think, it is important to maintain high quality support for Flink on
> > YARN.
> > >
> > > What does "Maintaining Flink on YARN" mean? There are no known bigger
> > > efforts outstanding. We are mainly talking about addressing
> > > "test-stability" issues, bugs, version upgrades, community
> contributions
> > &
> > > smaller feature requests. The prioritization of these would be up to
> the
> > > future maintainers, except "test-stability" issues which are important
> to
> > > address for overall productivity.
> > >
> > > If a group of community members forms itself, we are happy to give an
> > > introduction to relevant pieces of the code base, principles,
> > assumptions,
> > > ... and hand over open threads.
> > >
> > > If you would like to take on this responsibility or can join this
> effort
> > in
> > > a supporting role, please reach out!
> > >
> > > Cheers,
> > >
> > > Konstantin
> > > for the Deployment & Coordination Team at Ververica
> > >
> > > --
> > >
> > > Konstantin Knauf
> > >
> > > https://twitter.com/snntrable
> > >
> > > https://github.com/knaufk
> >
> >
>
> --
>
> Konstantin Knauf | Head of Product
>
> +49 160 91394525
>
>
> Follow us @VervericaData Ververica 
>
>
> --
>
> Join Flink Forward  - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
> Ververica GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Karl Anton Wehner, Holger Temme, Yip Park Tung Jason,
> Jinwei (Kevin) Zhang
>


Re: [VOTE] FLIP-199: Change some default config values of blocking shuffle for better usability

2022-01-11 Thread
+1 for the proposal. In fact, we have used these params in our inner flink
version for good performance.

Yun Gao  于2022年1月12日周三 10:42写道:

> +1 since it would highly improve the open-box experience for batch jobs.
>
> Thanks Yingjie for drafting the PR and initiating the discussion.
>
> Best,
> Yun
>
>
>
>  --Original Mail --
> Sender:Yingjie Cao 
> Send Date:Tue Jan 11 15:15:01 2022
> Recipients:dev 
> Subject:[VOTE] FLIP-199: Change some default config values of blocking
> shuffle for better usability
> Hi all,
>
> I'd like to start a vote on FLIP-199: Change some default config values of
> blocking shuffle for better usability [1] which has been discussed in this
> thread [2].
>
> The vote will be open for at least 72 hours unless there is an objection or
> not enough votes.
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-199%3A+Change+some+default+config+values+of+blocking+shuffle+for+better+usability
> [2] https://lists.apache.org/thread/pt2b1f17x2l5rlvggwxs6m265lo4ly7p
>
> Best,
> Yingjie
>


Re: [DISCUSS] Change some default config values of blocking shuffle

2022-01-04 Thread
>>> documentation for the sort shuffle [1] to include a tuning guide? I am
>>> thinking of a more in depth description of what things you might observe
>>> and how to influence them with the configuration options.
>>>
>>> [1]
>>> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/ops/batch/blocking_shuffle/#sort-shuffle
>>>
>>> Cheers,
>>> Till
>>>
>>> On Tue, Dec 14, 2021 at 8:43 AM Jingsong Li 
>>> wrote:
>>>
>>>> Hi Yingjie,
>>>>
>>>> Thanks for your explanation. I have no more questions. +1
>>>>
>>>> On Tue, Dec 14, 2021 at 3:31 PM Yingjie Cao 
>>>> wrote:
>>>> >
>>>> > Hi Jingsong,
>>>> >
>>>> > Thanks for your feedback.
>>>> >
>>>> > >>> My question is, what is the maximum parallelism a job can have
>>>> with the default configuration? (Does this break out of the box)
>>>> >
>>>> > Yes, you are right, these two options are related to network memory
>>>> and framework off-heap memory. Generally, these changes will not break out
>>>> of the box experience, but for some extreme cases, for example, there are
>>>> too many ResultPartitions per task, users may need to increase network
>>>> memory to avoid "insufficient network buffer" error. For framework
>>>> off-head, I believe that user do not need to change the default value.
>>>> >
>>>> > In fact, I have a basic goal when changing these config values: when
>>>> running TPCDS of medium parallelism with the default value, all queries
>>>> must pass without any error and at the same time, the performance can be
>>>> improved. I think if we achieve this goal, most common use cases can be
>>>> covered.
>>>> >
>>>> > Currently, for the default configuration, the exclusive buffers
>>>> required at input gate side is still parallelism relevant (though since
>>>> 1.14, we can decouple the network buffer consumption from parallelism by
>>>> setting a config value, it has slight performance influence on streaming
>>>> jobs), which means that no large parallelism can be supported by the
>>>> default configuration. Roughly, I would say the default value can support
>>>> jobs of several hundreds of parallelism.
>>>> >
>>>> > >>> I do feel that this correspondence is a bit difficult to control
>>>> at the moment, and it would be best if a rough table could be provided.
>>>> >
>>>> > I think this is a good suggestion, we can provide those suggestions
>>>> in the document.
>>>> >
>>>> > Best,
>>>> > Yingjie
>>>> >
>>>> > Jingsong Li  于2021年12月14日周二 14:39写道:
>>>> >>
>>>> >> Hi  Yingjie,
>>>> >>
>>>> >> +1 for this FLIP. I'm pretty sure this will greatly improve the ease
>>>> >> of batch jobs.
>>>> >>
>>>> >> Looks like "taskmanager.memory.framework.off-heap.batch-shuffle.size"
>>>> >> and "taskmanager.network.sort-shuffle.min-buffers" are related to
>>>> >> network memory and framework.off-heap.size.
>>>> >>
>>>> >> My question is, what is the maximum parallelism a job can have with
>>>> >> the default configuration? (Does this break out of the box)
>>>> >>
>>>> >> How much network memory and framework.off-heap.size are required for
>>>> >> how much parallelism in the default configuration?
>>>> >>
>>>> >> I do feel that this correspondence is a bit difficult to control at
>>>> >> the moment, and it would be best if a rough table could be provided.
>>>> >>
>>>> >> Best,
>>>> >> Jingsong
>>>> >>
>>>> >> On Tue, Dec 14, 2021 at 2:16 PM Yingjie Cao 
>>>> wrote:
>>>> >> >
>>>> >> > Hi Jiangang,
>>>> >> >
>>>> >> > Thanks for your suggestion.
>>>> >> >
>>>> >> > >>> The config can affect the memory usage. Will the related
>>>> memory configs be changed?
>>>> >> >
>>>> >> > I think we will not change 

Re: Re: [DISCUSS] Introduce Hash Lookup Join

2021-12-29 Thread
Thank you for the proposal, Jing. I like the idea to partition data by some
key to improve the cache hit. I have some questions:

   1. When it comes to hive, how do you load partial data instead of the
   whole data? Any change related with hive?
   2. How to define the cache configuration? For example, the size and the
   ttl.
   3. Will this feature add another shuffle phase compared with the default
   behavior? In what situations will user choose this feature?
   4. For the keys, the default implementation will be ok. But I wonder
   whether we can support more flexible strategies.


wenlong.lwl  于2021年12月29日周三 17:18写道:

> Hi, Jing, thanks for driving the discussion.
>
> Have you made some investigation on the syntax of join hint?
> Why do you choose USE_HASH from oracle instead of the style of spark
> SHUFFLE_HASH, they are quite different.
> People in the big data world may be more familiar with spark/hive, if we
> need to choose one, personally, I prefer the style of spark.
>
>
> Best,
> Wenlong
>
> On Wed, 29 Dec 2021 at 16:48, zst...@163.com  wrote:
>
> >
> >
> >
> > Hi Jing,
> > Thanks for your detail reply.
> > 1) In the last suggestion, hash by primary key is not use for raising the
> > cache hit, but handling with skew of left source. Now that you have
> 'skew'
> > hint and other discussion about it, I'm looking forward to it.
> > 2) I mean to support user defined partitioner function. We have a case
> > that joining a datalake source with special way of partition, and have
> > implemented not elegantly in our internal version. As you said, it needs
> > more design.
> > 3) I thing so-called 'HashPartitionedCache' is usefull, otherwise loading
> > all data such as hive lookup table source is almost not available in big
> > data.
> >
> >
> >
> >
> >
> >
> >
> > Best regards,
> > Yuan
> >
> >
> >
> >
> >
> >
> >
> >
> > 在 2021-12-29 14:52:11,"Jing Zhang"  写道:
> > >Hi, Lincoln
> > >Thanks a lot for the feedback.
> > >
> > >>  Regarding the hint name ‘USE_HASH’, could we consider more
> candidates?
> > >Things are a little different from RDBMS in the distributed world, and
> we
> > >also aim to solve the data skew problem, so all these incoming hints
> names
> > >should be considered together.
> > >
> > >About skew problem, I would discuss this in next FLIP individually. I
> > would
> > >like to share hint proposal for skew here.
> > >I want to introduce 'skew' hint which is a query hint, similar with skew
> > >hint in spark [1] and MaxCompute[2].
> > >The 'skew' hint could only contain the name of the table with skew.
> > >Besides, skew hint could accept table name and column names.
> > >In addition, skew hint could accept table name, column names and skew
> > >values.
> > >For example:
> > >
> > >SELECT /*+ USE_HASH('Orders', 'Customers'), SKEW('Orders') */
> o.order_id,
> > >o.total, c.country, c.zip
> > >FROM Orders AS o
> > >JOIN Customers FOR SYSTEM_TIME AS OF o.proc_time AS c
> > >ON o.customer_id = c.id;
> > >
> > >The 'skew' hint is not only used for look up join here, but also could
> be
> > >used for other types of join later, for example, batch hash join or
> > >streaming regular join.
> > >Go back to better name problem for hash look up join. Since the 'skew'
> > hint
> > >is a separate hint, so 'use_hash' is still an alternative.
> > >WDYT?
> > >I don't have a good idea about the better hint name yet. I would like to
> > >heard more suggestions about hint names.
> > >
> > >>  As you mentioned in the flip, this solution depends on future changes
> > to
> > >calcite (and also upgrading calcite would be another possible big
> change:
> > >at least calicite-1.30 vs 1.26, are we preparing to accept this big
> > >change?).
> > >
> > >Indeed, solution 1 depends on calcite upgrade.
> > >I admit upgrade from Calcite 1.26 to 1.30 would be a big change. I still
> > >remember what we have suffered from last upgrade to Calcite 1.26.
> > >However we could not always avoid upgrade for the following reason:
> > >1. Other features also depends on the Calcite upgrade. For example,
> > Session
> > >Window and Count Window.
> > >2. If we always avoid Calcite upgrade, there would be more gap with the
> > >latest version. One day, if upgrading becomes a thing which has to be
> > done,
> > >the pain is more.
> > >
> > >WDYT?
> > >
> > >>  Is there another possible way to minimize the change in calcite?
> > >
> > >Do you check the 'Other Alternatives' part in the FLIP-204? It gives
> > >another solution which does not depend on calcite upgrade and do not
> need
> > >to worry about the hint would be missed in the propagation.
> > >This is also what we have done in the internal version.
> > >The core idea is propagating 'use_hash' hint to TableScan with matched
> > >table names.  However, it is a little hacky.
> > >
> > >> As I know there're more limitations than `Correlate`.
> > >
> > >As mentioned before, in our external version, I choose the the 'Other
> > >Alternatives' part in the FLIP-204.
> > 

Re: [DISCUSS] FLIP-198: Working directory for Flink processes

2021-12-12 Thread
I like the idea. It can reuse the disk to do many things. Isn't it only for
inner failover? If not, the cleaning may be a problem. Also, many resource
components have their own disk schedule strategy.

Chesnay Schepler  于2021年12月12日周日 19:59写道:

> How do you intend to handle corrupted files, in particular due to
> process crashes during a write?
> Will all writes to a cached directory append some suffix (e.g.,
> ".pending") and do a rename?
>
> On 10/12/2021 17:54, Till Rohrmann wrote:
> > Hi everyone,
> >
> > I would like to start a discussion about introducing an explicit working
> > directory for Flink processes that can be used to store information [1].
> > Per default this working directory will reside in the temporary directory
> > of the node Flink runs on. However, if configured to reside on a
> persistent
> > volume, then this information can be used to recover from process/node
> > failures. Moreover, such a working directory can be used to consolidate
> > some of our other directories Flink creates under /tmp (e.g. blobStorage,
> > RocksDB working directory).
> >
> > Here is a draft PR that outlines the required changes [2].
> >
> > Looking forward to your feedback.
> >
> > [1] https://cwiki.apache.org/confluence/x/ZZiqCw
> > [2] https://github.com/apache/flink/pull/18083
> >
> > Cheers,
> > Till
> >
>
>


Re: [ANNOUNCE] New Apache Flink Committer - Ingo Bürk

2021-12-12 Thread
Congratulations!

Best
Liu Jiangang

Nicholas Jiang  于2021年12月13日周一 11:28写道:

> Congratulations, Ingo!
>
> Best,
> Nicholas Jiang
>


Re: [ANNOUNCE] New Apache Flink Committer - Matthias Pohl

2021-12-12 Thread
Congratulations!

Best
Liu Jiangang

Nicholas Jiang  于2021年12月13日周一 11:23写道:

> Congratulations, Matthias!
>
> Best,
> Nicholas Jiang
>


Re: [DISCUSS] FLIP-168: Speculative execution for Batch Job

2021-12-12 Thread
Any progress on the feature? We have the same requirement in our company.
Since the soft and hard environment can be complex, it is normal to see a
slow task which determines the execution time of the flink job.

 于2021年6月20日周日 22:35写道:

> Hi everyone,
>
> I would like to kick off a discussion on speculative execution for batch
> job.
> I have created FLIP-168 [1] that clarifies our motivation to do this and
> some improvement proposals for the new design.
> It would be great to resolve the problem of long tail task in batch job.
> Please let me know your thoughts. Thanks.
>   Regards,
> wangwj
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
>


Re: [DISCUSS] Change some default config values of blocking shuffle

2021-12-10 Thread
Glad to see the suggestion. In our test, we found that small jobs with the
changing configs can not improve the performance much just as your test. I
have some suggestions:

   - The config can affect the memory usage. Will the related memory
   configs be changed?
   - Can you share the tpcds results for different configs? Although we
   change the default values, it is helpful to change them for different
   users. In this case, the experience can help a lot.

Best,
Liu Jiangang

Yun Gao  于2021年12月10日周五 17:20写道:

> Hi Yingjie,
>
> Very thanks for drafting the FLIP and initiating the discussion!
>
> May I have a double confirmation for
> taskmanager.network.sort-shuffle.min-parallelism that
> since other frameworks like Spark have used sort-based shuffle for all the
> cases, does our
> current circumstance still have difference with them?
>
> Best,
> Yun
>
>
>
>
> --
> From:Yingjie Cao 
> Send Time:2021 Dec. 10 (Fri.) 16:17
> To:dev ; user ; user-zh <
> user...@flink.apache.org>
> Subject:Re: [DISCUSS] Change some default config values of blocking shuffle
>
> Hi dev & users:
>
> I have created a FLIP [1] for it, feedbacks are highly appreciated.
>
> Best,
> Yingjie
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-199%3A+Change+some+default+config+values+of+blocking+shuffle+for+better+usability
> Yingjie Cao  于2021年12月3日周五 17:02写道:
>
> Hi dev & users,
>
> We propose to change some default values of blocking shuffle to improve
> the user out-of-box experience (not influence streaming). The default
> values we want to change are as follows:
>
> 1. Data compression
> (taskmanager.network.blocking-shuffle.compression.enabled): Currently, the
> default value is 'false'.  Usually, data compression can reduce both disk
> and network IO which is good for performance. At the same time, it can save
> storage space. We propose to change the default value to true.
>
> 2. Default shuffle implementation
> (taskmanager.network.sort-shuffle.min-parallelism): Currently, the default
> value is 'Integer.MAX', which means by default, Flink jobs will always use
> hash-shuffle. In fact, for high parallelism, sort-shuffle is better for
> both stability and performance. So we propose to reduce the default value
> to a proper smaller one, for example, 128. (We tested 128, 256, 512 and
> 1024 with a tpc-ds and 128 is the best one.)
>
> 3. Read buffer of sort-shuffle
> (taskmanager.memory.framework.off-heap.batch-shuffle.size): Currently, the
> default value is '32M'. Previously, when choosing the default value, both
> ‘32M' and '64M' are OK for tests and we chose the smaller one in a cautious
> way. However, recently, it is reported in the mailing list that the default
> value is not enough which caused a buffer request timeout issue. We already
> created a ticket to improve the behavior. At the same time, we propose to
> increase this default value to '64M' which can also help.
>
> 4. Sort buffer size of sort-shuffle
> (taskmanager.network.sort-shuffle.min-buffers): Currently, the default
> value is '64' which means '64' network buffers (32k per buffer by default).
> This default value is quite modest and the performance can be influenced.
> We propose to increase this value to a larger one, for example, 512 (the
> default TM and network buffer configuration can serve more than 10 result
> partitions concurrently).
>
> We already tested these default values together with tpc-ds benchmark in a
> cluster and both the performance and stability improved a lot. These
> changes can help to improve the out-of-box experience of blocking shuffle.
> What do you think about these changes? Is there any concern? If there are
> no objections, I will make these changes soon.
>
> Best,
> Yingjie
>
>


Re: [ANNOUNCE] New Apache Flink Committer - Ingo Bürk

2021-12-02 Thread
Congratulations!

Best,
Liu Jiangang

Till Rohrmann  于2021年12月2日周四 下午11:24写道:

> Hi everyone,
>
> On behalf of the PMC, I'm very happy to announce Ingo Bürk as a new Flink
> committer.
>
> Ingo has started contributing to Flink since the beginning of this year. He
> worked mostly on SQL components. He has authored many PRs and helped review
> a lot of other PRs in this area. He actively reported issues and helped our
> users on the MLs. His most notable contributions were Support SQL 2016 JSON
> functions in Flink SQL (FLIP-90), Register sources/sinks in Table API
> (FLIP-129) and various other contributions in the SQL area. Moreover, he is
> one of the few people in our community who actually understands Flink's
> frontend.
>
> Please join me in congratulating Ingo for becoming a Flink committer!
>
> Cheers,
> Till
>


Re: [ANNOUNCE] New Apache Flink Committer - Matthias Pohl

2021-12-02 Thread
Congratulations!

Best,
Liu Jiangang

Till Rohrmann  于2021年12月2日周四 下午11:28写道:

> Hi everyone,
>
> On behalf of the PMC, I'm very happy to announce Matthias Pohl as a new
> Flink committer.
>
> Matthias has worked on Flink since August last year. He helped review a ton
> of PRs. He worked on a variety of things but most notably the tracking and
> reporting of concurrent exceptions, fixing HA bugs and deprecating and
> removing our Mesos support. He actively reports issues helping Flink to
> improve and he is actively engaged in Flink's MLs.
>
> Please join me in congratulating Matthias for becoming a Flink committer!
>
> Cheers,
> Till
>


Re: [ANNOUNCE] Open source of remote shuffle project for Flink batch data processing

2021-11-30 Thread
Good work for flink's batch processing!
Remote shuffle service can resolve the container lost problem and reduce
the running time for batch jobs once failover. We have investigated the
component a lot and welcome Flink's native solution. We will try it and
help improve it.

Thanks,
Liu Jiangang

Yingjie Cao  于2021年11月30日周二 下午9:33写道:

> Hi dev & users,
>
> We are happy to announce the open source of remote shuffle project [1] for
> Flink. The project is originated in Alibaba and the main motivation is to
> improve batch data processing for both performance & stability and further
> embrace cloud native. For more features about the project, please refer to
> [1].
>
> Before going open source, the project has been used widely in production
> and it behaves well on both stability and performance. We hope you enjoy
> it. Collaborations and feedbacks are highly appreciated.
>
> Best,
> Yingjie on behalf of all contributors
>
> [1] https://github.com/flink-extended/flink-remote-shuffle
>


Re: [ANNOUNCE] New Apache Flink Committer - Jing Zhang

2021-11-22 Thread
Congratulations!

Matthias Pohl  于2021年11月22日周一 下午4:10写道:

> Congratulations :-)
>
> On Thu, Nov 18, 2021 at 3:23 AM Jingsong Li 
> wrote:
>
> > Congratulations, Jing! Well deserved!
> >
> > On Wed, Nov 17, 2021 at 3:00 PM Lincoln Lee 
> > wrote:
> > >
> > > Congratulations, Jing!
> > >
> > > Best,
> > > Lincoln Lee
> > >
> > >
> > > Jing Zhang  于2021年11月17日周三 上午10:24写道:
> > >
> > > > Thanks to everyone. It's my honor to work in community with you all.
> > > >
> > > > Best,
> > > > Jing Zhang
> > > >
> > > > Zakelly Lan  于2021年11月17日周三 上午12:06写道:
> > > >
> > > > > Congratulations,  Jing!
> > > > >
> > > > > Best,
> > > > > Zakelly
> > > > >
> > > > > On Tue, Nov 16, 2021 at 11:03 PM Yang Wang 
> > > > wrote:
> > > > >
> > > > >> Congratulations,  Jing!
> > > > >>
> > > > >> Best,
> > > > >> Yang
> > > > >>
> > > > >> Benchao Li  于2021年11月16日周二 下午9:31写道:
> > > > >>
> > > > >> > Congratulations Jing~
> > > > >> >
> > > > >> > OpenInx  于2021年11月16日周二 下午1:58写道:
> > > > >> >
> > > > >> > > Congrats Jing!
> > > > >> > >
> > > > >> > > On Tue, Nov 16, 2021 at 11:59 AM Terry Wang <
> zjuwa...@gmail.com
> > >
> > > > >> wrote:
> > > > >> > >
> > > > >> > > > Congratulations,  Jing!
> > > > >> > > > Well deserved!
> > > > >> > > >
> > > > >> > > > Best,
> > > > >> > > > Terry Wang
> > > > >> > > >
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > > 2021年11月16日 上午11:27,Zhilong Hong 
> 写道:
> > > > >> > > > >
> > > > >> > > > > Congratulations, Jing!
> > > > >> > > > >
> > > > >> > > > > Best regards,
> > > > >> > > > > Zhilong Hong
> > > > >> > > > >
> > > > >> > > > > On Mon, Nov 15, 2021 at 9:41 PM Martijn Visser <
> > > > >> > mart...@ververica.com>
> > > > >> > > > > wrote:
> > > > >> > > > >
> > > > >> > > > >> Congratulations Jing!
> > > > >> > > > >>
> > > > >> > > > >> On Mon, 15 Nov 2021 at 14:39, Timo Walther <
> > twal...@apache.org
> > > > >
> > > > >> > > wrote:
> > > > >> > > > >>
> > > > >> > > > >>> Hi everyone,
> > > > >> > > > >>>
> > > > >> > > > >>> On behalf of the PMC, I'm very happy to announce Jing
> > Zhang
> > > > as a
> > > > >> > new
> > > > >> > > > >>> Flink committer.
> > > > >> > > > >>>
> > > > >> > > > >>> Jing has been very active in the Flink community esp. in
> > the
> > > > >> > > Table/SQL
> > > > >> > > > >>> area for quite some time: 81 PRs [1] in total and is
> also
> > > > >> active on
> > > > >> > > > >>> answering questions on the user mailing list. She is
> > currently
> > > > >> > > > >>> contributing a lot around the new windowing table-valued
> > > > >> functions
> > > > >> > > [2].
> > > > >> > > > >>>
> > > > >> > > > >>> Please join me in congratulating Jing Zhang for
> becoming a
> > > > Flink
> > > > >> > > > >> committer!
> > > > >> > > > >>>
> > > > >> > > > >>> Thanks,
> > > > >> > > > >>> Timo
> > > > >> > > > >>>
> > > > >> > > > >>> [1] https://github.com/apache/flink/pulls/beyond1920
> > > > >> > > > >>> [2] https://issues.apache.org/jira/browse/FLINK-23997
> > > > >> > > > >>>
> > > > >> > > > >>
> > > > >> > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >> >
> > > > >> > --
> > > > >> >
> > > > >> > Best,
> > > > >> > Benchao Li
> > > > >> >
> > > > >>
> > > > >
> > > >
> >
> >
> >
> > --
> > Best, Jingsong Lee
> >
>


Re: [DISCUSS] Improve the name and structure of job vertex and operator name for job

2021-11-20 Thread
+1 for the FLIP. We have met the problem that a long name stuck the metric
collection for SQL jobs.

wenlong.lwl  于2021年11月19日周五 下午10:29写道:

> hi, yun,
> Thanks for the suggestion, but I am not sure whether we need such a prefix
> or not, because the log has included vertex id, when the name is concise
> enough, we can get the vertex id easily.
> Does anyone have some comments on this?
>
>
> Best,
> Wenlong Lyu
>
> On Thu, 18 Nov 2021 at 19:03, Yun Tang  wrote:
>
> > Hi Wenlong,
> >
> > Thanks for bringing up this discussion and I believe many guys have ever
> > suffered from the long and unreadable operator name for long time.
> >
> > I have another suggestion which inspired by Aitozi, that we could add
> some
> > hint to tell the vertex index. Such as make the pipeline from "source -->
> > flatMap --> sink" to "[vertex-0] souce --> [vertex-1] flatMap -->
> > [vertex-2] sink".
> > This could make user or developer much easier to know which vertex is
> > wrong when meeting exceptions.
> >
> > Best
> > Yun Tang
> >
> > On 2021/11/17 07:42:28 godfrey he wrote:
> > > Hi Wenlong, I'm fine with the config options.
> > >
> > > Best,
> > > Godfrey
> > >
> > > wenlong.lwl  于2021年11月17日周三 下午3:13写道:
> > >
> > > >
> > > > Hi Chesney and Konstantin,
> > > > thanks for your feedback, I have added a section about How we support
> > set
> > > > description at DataStream API in the doc.
> > > >
> > > >
> > > > Bests,
> > > > Wenlong
> > > >
> > > > On Tue, 16 Nov 2021 at 21:05, Konstantin Knauf 
> > wrote:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > Thanks for starting this discussion. I am in favor of solving this
> > for
> > > > > DataStream and Table API at the same time, using the same
> > configuration
> > > > > keys. IMO we shouldn't introduce any additional fragmentation if we
> > can
> > > > > avoid it.
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Konstantin
> > > > >
> > > > > On Tue, Nov 16, 2021 at 1:50 PM wenlong.lwl <
> wenlong88@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > hi, Chesney, we focus on sql first because the operator and
> > topology of
> > > > > sql
> > > > > > jobs are generated by the engine, raising most of the problems in
> > naming,
> > > > > > not only because the name is long but also because the topology
> > can be
> > > > > more
> > > > > > complex than DataStream.
> > > > > >
> > > > > > The case in Datastream is much better, most of the names in
> > DataStream
> > > > > API
> > > > > > are quite concise except for the windowing you mentioned, and the
> > > > > topology
> > > > > > is usually simpler,  what's more we can easily expose to
> > DataStream API
> > > > > as
> > > > > > a second step once the foundation implementation is done. If it
> is
> > > > > > necessary, we can also cover the changes on DataStream API now,
> > maybe
> > > > > take
> > > > > > Windowing first as an example?
> > > > > >
> > > > > > Best,
> > > > > > Wenlong
> > > > > >
> > > > > > On Tue, 16 Nov 2021 at 19:14, Chesnay Schepler <
> ches...@apache.org
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Why should this be specific to the table API? The datastream
> API
> > has
> > > > > > > similar issues with long operator names (like windowing).
> > > > > > >
> > > > > > > On 16/11/2021 11:22, wenlong.lwl wrote:
> > > > > > > > Thanks Godfrey for the suggestion.
> > > > > > > > Regarding 1, how about
> > > > > table.optimizer.simplify-operator-name-enabled,
> > > > > > > > which means that we would simplify the name of operator and
> > keep the
> > > > > > > > details in description only.
> > > > > > > > "table.optimizer.operator-name.description-enabled" can not
> > describe
> > > > > > what
> > > > > > > > it means I think.
> > > > > > > > Regarding 2, I agree that it is better to use enum instead of
> > > > > boolean.
> > > > > > > For
> > > > > > > > key I think you are meaning
> > "pipeline.vertex-description-pattern"
> > > > > > instead
> > > > > > > > of "pipeline.vertex-name-pattern", and I would like to choose
> > > > > > > DEFAULT/TREE
> > > > > > > > for values.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Wenlong
> > > > > > > >
> > > > > > > > On Tue, 16 Nov 2021 at 17:28, godfrey he <
> godfre...@gmail.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > >> Thanks for creating this FLIP Wenlong.
> > > > > > > >>
> > > > > > > >> The FLIP already looks pretty solid, I think the config
> > options can
> > > > > be
> > > > > > > >> improved a little:
> > > > > > > >> 1) about table.optimizer.separate-name-and-description, I
> > think
> > > > > > > >> "operator-name" should be considered in the option,
> > > > > > > >> how about table.optimizer.operator-name.description-enabled
> ?
> > > > > > > >> 2) about pipeline.tree-mode-vertex-description, I think we
> > can make
> > > > > > > >> the mode accept string value,
> > > > > > > >> which is more flexible. How about
> > pipeline.vertex-name-pattern, the
> > > > > > > >> default value is "TREE",
> > > > > > > >> 

Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-22 Thread
Thanks, Till. There are many reasons to reduce the heartbeat interval and
timeout. But I am not sure what values are suitable. In our cases, the GC
time and big job can be related factors. Since most flink jobs are pipeline
and a total failover can cost some time, we should tolerate some stop-world
situations. Also, I think that the FLINK-23216 should be solved to detect
lost container fast and react to it. For my side, I suggest
reducing the values gradually.

Till Rohrmann  于2021年7月22日周四 下午5:33写道:

> Thanks for your inputs Gen and Arnaud.
>
> I do agree with you, Gen, that we need better guidance for our users on
> when to change the heartbeat configuration. I think this should happen in
> any case. I am, however, not so sure whether we can give hard threshold
> like 5000 tasks, for example, because as Arnaud said it strongly depends on
> the workload. Maybe we can explain it based on symptoms a user might
> experience and what to do then.
>
> Concerning your workloads, Arnaud, I'd be interested to learn a bit more.
> The user code runs in its own thread. This means that its operation won't
> block the main thread/heartbeat. The only thing that can happen is that the
> user code starves the heartbeat in terms of CPU cycles or causes a lot of
> GC pauses. If you are observing the former problem, then we might think
> about changing the priorities of the respective threads. This should then
> improve Flink's stability for these workloads and a shorter heartbeat
> timeout should be possible.
>
> Also for the RAM-cached repositories, what exactly is causing the heartbeat
> to time out? Is it because you have a lot of GC or that the heartbeat
> thread does not get enough CPU cycles?
>
> Cheers,
> Till
>
> On Thu, Jul 22, 2021 at 9:16 AM LINZ, Arnaud 
> wrote:
>
> > Hello,
> >
> >
> >
> > From a user perspective: we have some (rare) use cases where we use
> > “coarse grain” datasets, with big beans and tasks that do lengthy
> operation
> > (such as ML training). In these cases we had to increase the time out to
> > huge values (heartbeat.timeout: 50) so that our app is not killed.
> >
> > I’m aware this is not the way Flink was meant to be used, but it’s a
> > convenient way to distribute our workload on datanodes without having to
> > use another concurrency framework (such as M/R) that would require the
> > recoding of sources and sinks.
> >
> >
> >
> > In some other (most common) cases, our tasks do some R/W accesses to
> > RAM-cached repositories backed by a key-value storage such as Kudu (or
> > Hbase). If most of those calls are very fast, sometimes when the system
> is
> > under heavy load they may block more than a few seconds, and having our
> app
> > killed because of a short timeout is not an option.
> >
> >
> >
> > That’s why I’m not in favor of very short timeouts… Because in my
> > experience it really depends on what user code does in the tasks. (I
> > understand that normally, as user code is not a JVM-blocking activity
> such
> > as a GC, it should have no impact on heartbeats, but from experience, it
> > really does)
> >
> >
> >
> > Cheers,
> >
> > Arnaud
> >
> >
> >
> >
> >
> > *De :* Gen Luo 
> > *Envoyé :* jeudi 22 juillet 2021 05:46
> > *À :* Till Rohrmann 
> > *Cc :* Yang Wang ; dev ;
> > user 
> > *Objet :* Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval
> > default values
> >
> >
> >
> > Hi,
> >
> > Thanks for driving this @Till Rohrmann  . I would
> > give +1 on reducing the heartbeat timeout and interval, though I'm not
> sure
> > if 15s and 3s would be enough either.
> >
> >
> >
> > IMO, except for the standalone cluster, where the heartbeat mechanism in
> > Flink is totally relied, reducing the heartbeat can also help JM to find
> > out faster TaskExecutors in abnormal conditions that can not respond to
> the
> > heartbeat requests, e.g., continuously Full GC, though the process of
> > TaskExecutor is alive and may not be known by the deployment system.
> Since
> > there are cases that can benefit from this change, I think it could be
> done
> > if it won't break the experience in other scenarios.
> >
> >
> >
> > If we can address what will block the main threads from processing
> > heartbeats, or enlarge the GC costs, we can try to get rid of them to
> have
> > a more predictable response time of heartbeat, or give some advices to
> > users if their jobs may encounter these issues. For example, as far as I
> > know JM of a large scale job will be more busy and may not able to
> process
> > heartbeats in time, then we can give a advice that users working with job
> > large than 5000 tasks should enlarge there heartbeat interval to 10s and
> > timeout to 50s. The numbers are written casually.
> >
> >
> >
> > As for the issue in FLINK-23216, I think it should be fixed and may not
> be
> > a main concern for this case.
> >
> >
> >
> > On Wed, Jul 21, 2021 at 6:26 PM Till Rohrmann 
> > wrote:
> >
> > Thanks for sharing these insights.
> >
> >
> >
> > I think it is 

Re: Job Recovery Time on TM Lost

2021-07-12 Thread
Yes, time is main when detecting the TM's liveness. The count method will
check by certain intervals.

Gen Luo  于2021年7月9日周五 上午10:37写道:

> @刘建刚
> Welcome to join the discuss and thanks for sharing your experience.
>
> I have a minor question. In my experience, network failures in a certain
> cluster usually takes a time to recovery, which can be measured as p99 to
> guide configuring. So I suppose it would be better to use time than attempt
> count as the configuration for confirming TM liveness. How do you think
> about this? Or is the premise right according to your experience?
>
> @Lu Niu 
> > Does that mean the akka timeout situation we talked above doesn't apply
> to flink 1.11?
>
> I suppose it's true. According to the reply from Till in FLINK-23216
> <https://issues.apache.org/jira/browse/FLINK-23216>, it should be
> confirmed that the problem is introduced by declarative resource
> management, which is introduced to Flink in 1.12.
>
> In previous versions, although JM still uses heartbeat to check TMs
> status, RM will tell JM about TM lost once it is noticed by Yarn. This is
> much faster than JM's heartbeat mechanism, if one uses default heartbeat
> configurations. However, after 1.12 with declarative resource management,
> RM will no longer tell this to JM, since it doesn't have a related
> AllocationID.  So the heartbeat mechanism becomes the only way JM can know
> about TM lost.
>
> On Fri, Jul 9, 2021 at 6:34 AM Lu Niu  wrote:
>
>> Thanks everyone! This is a great discussion!
>>
>> 1. Restarting takes 30s when throwing exceptions from application code
>> because the restart delay is 30s in config. Before lots of related config
>> are 30s which lead to the confusion. I redo the test with config:
>>
>> FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=2147483647,
>> backoffTimeMS=1000)
>> heartbeat.timeout: 50
>> akka.ask.timeout 30 s
>> akka.lookup.timeout 30 s
>> akka.tcp.timeout 30 s
>> akka.watch.heartbeat.interval 30 s
>> akka.watch.heartbeat.pause 120 s
>>
>>Now Phase 1 drops down to 2s now and phase 2 takes 13s. The whole
>> restart takes 14s. Does that mean the akka timeout situation we talked
>> above doesn't apply to flink 1.11?
>>
>> 2. About flaky connection between TMs, we did notice sometimes exception
>> as follows:
>> ```
>> TaskFoo switched from RUNNING to FAILED on
>> container_e02_1599158147594_156068_01_38 @
>> xenon-prod-001-20200316-data-slave-prod-0a0201a4.ec2.pin220.com
>> (dataPort=40957).
>> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
>> Connection unexpectedly closed by remote task manager '
>> xenon-prod-001-20200316-data-slave-prod-0a020f7a.ec2.pin220.com/10.2.15.122:33539'.
>> This might indicate that the remote task manager was lost.
>> at
>> org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:144)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
>> at
>> org.apache.flink.runtime.io.network.netty.NettyMessageClientDecoderDelegate.channelInactive(NettyMessageClientDecoderDelegate.java:97)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1416)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
>&g

Re: [DISCUSS] FLIP-182: Watermark alignment

2021-07-12 Thread
+1 for the source watermark alignment.
In the previous flink version, the source connectors are different in
implementation and it is hard to make this feature. When the consumed data
is not aligned or consuming history data, it is very easy to cause the
unalignment. Source alignment can resolve many unstable problems.

Seth Wiesman  于2021年7月9日周五 下午11:25写道:

> +1
>
> In my opinion, this limitation is perfectly fine for the MVP. Watermark
> alignment is a long-standing issue and this already moves the ball so far
> forward.
>
> I don't expect this will cause many issues in practice, as I understand it
> the FileSource always processes one split at a time, and in my experience,
> 90% of Kafka users have a small number of partitions scale their pipelines
> to have one reader per partition. Obviously, there are larger-scale Kafka
> topics and more sources that will be ported over in the future but I think
> there is an implicit understanding that aligning sources adds latency to
> pipelines, and we can frame the follow-up "per-split" alignment as an
> optimization.
>
> On Fri, Jul 9, 2021 at 6:40 AM Piotr Nowojski 
> wrote:
>
> > Hey!
> >
> > A couple of weeks ago me and Arvid Heise played around with an idea to
> > address a long standing issue of Flink: lack of watermark/event time
> > alignment between different parallel instances of sources, that can lead
> to
> > ever growing state size for downstream operators like WindowOperator.
> >
> > We had an impression that this is relatively low hanging fruit that can
> be
> > quite easily implemented - at least partially (the first part mentioned
> in
> > the FLIP document). I have written down our proposal [1] and you can also
> > check out our PoC that we have implemented [2].
> >
> > We think that this is a quite easy proposal, that has been in large part
> > already implemented. There is one obvious limitation of our PoC. Namely
> we
> > can only easily block individual SourceOperators. This works perfectly
> fine
> > as long as there is at most one split per SourceOperator. However it
> > doesn't work with multiple splits. In that case, if a single
> > `SourceOperator` is responsible for processing both the least and the
> most
> > advanced splits, we won't be able to block this most advanced split for
> > generating new records. I'm proposing to solve this problem in the future
> > in another follow up FLIP, as a solution that works with a single split
> per
> > operator is easier and already valuable for some of the users.
> >
> > What do you think about this proposal?
> > Best, Piotrek
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-182%3A+Support+watermark+alignment+of+FLIP-27+Sources
> > [2] https://github.com/pnowojski/flink/commits/aligned-sources
> >
>


Re: Job Recovery Time on TM Lost

2021-07-07 Thread
It is really helpful to find the lost container quickly. In our inner flink
version, we optimize it by task's report and jobmaster's probe. When a task
fails because of the connection, it reports to the jobmaster. The jobmaster
will try to confirm the liveness of the unconnected taskmanager for certain
times by config. If the jobmaster find the taskmanager unconnected or dead,
it releases the taskmanger. This will work for most cases. For an unstable
environment, config needs adjustment.

Gen Luo  于2021年7月6日周二 下午8:41写道:

> Yes, I have noticed the PR and commented there with some consideration
> about the new option. We can discuss further there.
>
> On Tue, Jul 6, 2021 at 6:04 PM Till Rohrmann  wrote:
>
> > This is actually a very good point Gen. There might not be a lot to gain
> > for us by implementing a fancy algorithm for figuring out whether a TM is
> > dead or not based on failed heartbeat RPCs from the JM if the TM <> TM
> > communication does not tolerate failures and directly fails the affected
> > tasks. This assumes that the JM and TM run in the same environment.
> >
> > One simple approach could be to make the number of failed heartbeat RPCs
> > until a target is marked as unreachable configurable because what
> > represents a good enough criterion in one user's environment might
> produce
> > too many false-positives in somebody else's environment. Or even simpler,
> > one could say that one can disable reacting to a failed heartbeat RPC as
> it
> > is currently the case.
> >
> > We currently have a discussion about this on this PR [1]. Maybe you wanna
> > join the discussion there and share your insights.
> >
> > [1] https://github.com/apache/flink/pull/16357
> >
> > Cheers,
> > Till
> >
> > On Tue, Jul 6, 2021 at 4:37 AM Gen Luo  wrote:
> >
> >> I know that there are retry strategies for akka rpc frameworks. I was
> >> just considering that, since the environment is shared by JM and TMs,
> and
> >> the connections among TMs (using netty) are flaky in unstable
> environments,
> >> which will also cause the job failure, is it necessary to build a
> >> strongly guaranteed connection between JM and TMs, or it could be as
> flaky
> >> as the connections among TMs?
> >>
> >> As far as I know, connections among TMs will just fail on their first
> >> connection loss, so behaving like this in JM just means "as flaky as
> >> connections among TMs". In a stable environment it's good enough, but
> in an
> >> unstable environment, it indeed increases the instability. IMO, though a
> >> single connection loss is not reliable, a double check should be good
> >> enough. But since I'm not experienced with an unstable environment, I
> can't
> >> tell whether that's also enough for it.
> >>
> >> On Mon, Jul 5, 2021 at 5:59 PM Till Rohrmann 
> >> wrote:
> >>
> >>> I think for RPC communication there are retry strategies used by the
> >>> underlying Akka ActorSystem. So a RpcEndpoint can reconnect to a remote
> >>> ActorSystem and resume communication. Moreover, there are also
> >>> reconciliation protocols in place which reconcile the states between
> the
> >>> components because of potentially lost RPC messages. So the main
> question
> >>> would be whether a single connection loss is good enough for
> triggering the
> >>> timeout or whether we want a more elaborate mechanism to reason about
> the
> >>> availability of the remote system (e.g. a couple of lost heartbeat
> >>> messages).
> >>>
> >>> Cheers,
> >>> Till
> >>>
> >>> On Mon, Jul 5, 2021 at 10:00 AM Gen Luo  wrote:
> >>>
>  As far as I know, a TM will report connection failure once its
>  connected TM is lost. I suppose JM can believe the report and fail the
>  tasks in the lost TM if it also encounters a connection failure.
> 
>  Of course, it won't work if the lost TM is standalone. But I suppose
> we
>  can use the same strategy as the connected scenario. That is,
> consider it
>  possibly lost on the first connection loss, and fail it if double
> check
>  also fails. The major difference is the senders of the probes are the
> same
>  one rather than two different roles, so the results may tend to be
> the same.
> 
>  On the other hand, the fact also means that the jobs can be fragile in
>  an unstable environment, no matter whether the failover is triggered
> by TM
>  or JM. So maybe it's not that worthy to introduce extra
> configurations for
>  fault tolerance of heartbeat, unless we also introduce some retry
>  strategies for netty connections.
> 
> 
>  On Fri, Jul 2, 2021 at 9:34 PM Till Rohrmann 
>  wrote:
> 
> > Could you share the full logs with us for the second experiment, Lu?
> I
> > cannot tell from the top of my head why it should take 30s unless
> you have
> > configured a restart delay of 30s.
> >
> > Let's discuss FLINK-23216 on the JIRA ticket, Gen.
> >
> > I've now implemented FLINK-23209 [1] but it somehow has the 

Re: [ANNOUNCE] New Apache Flink Committer - Yang Wang

2021-07-06 Thread
Congratulations, Yang Wang.

Best
Jiangang Liu

Leonard Xu  于2021年7月7日周三 上午10:27写道:

> Congratulations!  Yang Wang
>
>
> Best,
> Leonard
> > 在 2021年7月7日,10:23,tison  写道:
> >
> > Congratulations and well deserved!
> >
> > It is my pressure to work with you excellent developer.
> >
> > Best,
> > tison.
> >
> >
> > Jark Wu  于2021年7月7日周三 上午10:21写道:
> >
> >> Congratulations Yang Wang!
> >>
> >> Best,
> >> Jark
> >>
> >> On Wed, 7 Jul 2021 at 10:09, Xintong Song 
> wrote:
> >>
> >>> Hi everyone,
> >>>
> >>> On behalf of the PMC, I'm very happy to announce Yang Wang as a new
> Flink
> >>> committer.
> >>>
> >>> Yang has been a very active contributor for more than two years, mainly
> >>> focusing on Flink's deployment components. He's a main contributor and
> >>> maintainer of Flink's native Kubernetes deployment and native
> Kubernetes
> >>> HA. He's also very active on the mailing lists, participating in
> >>> discussions and helping with user questions.
> >>>
> >>> Please join me in congratulating Yang Wang for becoming a Flink
> >> committer!
> >>>
> >>> Thank you~
> >>>
> >>> Xintong Song
> >>>
> >>
>
>


Re: [ANNOUNCE] New PMC member: Guowei Ma

2021-07-06 Thread
Congratulations,Guowei Ma.

Best
Jiangang Liu

tison  于2021年7月7日周三 上午10:24写道:

> Congrats! NB.
>
> Best,
> tison.
>
>
> Jark Wu  于2021年7月7日周三 上午10:20写道:
>
> > Congratulations Guowei!
> >
> > Best,
> > Jark
> >
> > On Wed, 7 Jul 2021 at 09:54, XING JIN  wrote:
> >
> > > Congratulations, Guowei~ !
> > >
> > > Best,
> > > Jin
> > >
> > > Xintong Song  于2021年7月7日周三 上午9:37写道:
> > >
> > > > Congratulations, Guowei~!
> > > >
> > > > Thank you~
> > > >
> > > > Xintong Song
> > > >
> > > >
> > > >
> > > > On Wed, Jul 7, 2021 at 9:31 AM Qingsheng Ren 
> > wrote:
> > > >
> > > > > Congratulations Guowei!
> > > > >
> > > > > --
> > > > > Best Regards,
> > > > >
> > > > > Qingsheng Ren
> > > > > Email: renqs...@gmail.com
> > > > > 2021年7月7日 +0800 09:30 Leonard Xu ,写道:
> > > > > > Congratulations! Guowei Ma
> > > > > >
> > > > > > Best,
> > > > > > Leonard
> > > > > >
> > > > > > > ÔÚ 2021Äê7ÔÂ6ÈÕ£¬21:56£¬Kurt Young  дµÀ£º
> > > > > > >
> > > > > > > Hi all!
> > > > > > >
> > > > > > > I'm very happy to announce that Guowei Ma has joined the Flink
> > PMC!
> > > > > > >
> > > > > > > Congratulations and welcome Guowei!
> > > > > > >
> > > > > > > Best,
> > > > > > > Kurt
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: [DISCUSS] Better user experience in the WindowAggregate upon Changelog (contains update message)

2021-07-01 Thread
Thanks for the discussion, JING ZHANG. I like the first proposal since it
is simple and consistent with dataStream API. It is helpful to add more
docs about the special late case in WindowAggregate. Also, I expect the
more flexible emit strategies later.

Jark Wu  于2021年7月2日周五 上午10:33写道:

> Sorry, I made a typo above. I mean I prefer proposal (1) that
> only needs to set `table.exec.emit.allow-lateness` to handle late events.
> `table.exec.emit.late-fire.delay` can be optional which is 0s by default.
> `table.exec.state.ttl` will not affect window state anymore, so window
> state
> is still cleaned accurately by watermark.
>
> We don't need to expose `table.exec.emit.late-fire.enabled` on docs and
> can remove it in the next version.
>
> Best,
> Jark
>
> On Thu, 1 Jul 2021 at 21:20, Jark Wu  wrote:
>
> > Thanks Jing for bringing up this topic,
> >
> > The emit strategy configs are annotated as Experiential and not public on
> > documentations.
> > However, I see this is a very useful feature which many users are looking
> > for.
> > I have posted these configs for many questions like "how to handle late
> > events in SQL".
> > Thus, I think it's time to make the configuration public and explicitly
> > document it. In the long
> > term, we would like to propose an EMIT syntax for SQL, but until then we
> > can get more
> > valuable feedback from users when they are using the configs.
> >
> > Regarding the exposed configuration, I prefer proposal (2).
> > But it would be better not to expose `table.exec.emit.late-fire.enabled`
> > on docs and we can
> > remove it in the next version.
> >
> > Best,
> > Jark
> >
> >
> > On Tue, 29 Jun 2021 at 11:09, JING ZHANG  wrote:
> >
> >> When WindowAggregate works upon Changelog which contains update
> messages,
> >> UPDATE BEFORE message may be dropped as a late message. [1]
> >>
> >> In order to handle late UB message, user needs to set *all* the
> >> following 3 parameters:
> >>
> >> (1) enable late fire by setting
> >>
> >> table.exec.emit.late-fire.enabled : true
> >>
> >> (2) set per record emit behavior for late records by setting
> >>
> >> table.exec.emit.late-fire.delay : 0 s
> >>
> >> (3) keep window state for extra time after window is fired by setting
> >>
> >> table.exec.emit.allow-lateness : 1 h// 或者table.exec.state.ttl: 1h
> >>
> >>
> >> The solution has two disadvantages:
> >>
> >> (1) Users may not realize that UB messages may be dropped as a late
> >> event, so they will not set related parameters.
> >>
> >> (2) When users look for a solution to solve the dropped UB messages
> >> problem, the current solution is a bit inconvenient for them because
> they
> >> need to set all the 3 parameters. Besides, some configurations have
> overlap
> >> ability.
> >>
> >>
> >> Now there are two proposals to simplify the 3 parameters a little.
> >>
> >> (1) Users only need set table.exec.emit.allow-lateness (just like the
> >> behavior on Datastream, user only need set allow-lateness), framework
> could
> >> atom set `table.exec.emit.late-fire.enabled` to true and set
> >> `table.exec.emit.late-fire.delay` to 0s.
> >>
> >> And in the later version, we deprecate `table.exec.emit.late-fire.delay`
> >> and `table.exec.emit.late-fire.enabled`.
> >>
> >>
> >> (2) Users need set `table.exec.emit.late-fire.enabled` to true and set
> >> `table.exec.state.ttl`, framework  could atom set
> >> `table.exec.emit.late-fire.delay` to 0s.
> >>
> >> And in the later version, we deprecate `table.exec.emit.late-fire.delay`
> >> and `table.exec.emit.allow-lateness `.
> >>
> >>
> >> Please let me know what you think about the issue.
> >>
> >> Thank you.
> >>
> >> [1] https://issues.apache.org/jira/browse/FLINK-22781
> >>
> >>
> >> Best regards,
> >> JING ZHANG
> >>
> >>
> >>
> >>
>


Re: [VOTE] FLIP-147: Support Checkpoint After Tasks Finished

2021-06-28 Thread
+1 (binding)

Best
liujiangang

Piotr Nowojski  于2021年6月29日周二 上午2:05写道:

> +1 (binding)
>
> Piotrek
>
> pon., 28 cze 2021 o 12:48 Dawid Wysakowicz 
> napisał(a):
>
> > +1 (binding)
> >
> > Best,
> >
> > Dawid
> >
> > On 28/06/2021 10:45, Yun Gao wrote:
> > > Hi all,
> > >
> > > For FLIP-147[1] which targets at supports checkpoints after tasks
> > finished and modify operator
> > > API and implementation to ensures the commit of last piece of data,
> > since after the last vote
> > > we have more discussions[2][3] and a few updates, including changes to
> > PublicEvolving API,
> > > I'd like to have another VOTE on the current state of the FLIP.
> > >
> > > The vote will last at least 72 hours (Jul 1st), following the consensus
> > > voting process.
> > >
> > > thanks,
> > >  Yun
> > >
> > >
> > > [1] https://cwiki.apache.org/confluence/x/mw-ZCQ
> > > [2]
> >
> https://lists.apache.org/thread.html/r400da9898ff66fd613c25efea15de440a86f14758ceeae4950ea25cf%40%3Cdev.flink.apache.org
> > > [3]
> >
> https://lists.apache.org/thread.html/r3953df796ef5ac67d5be9f2251a95ad72efbca31f1d1555d13e71197%40%3Cdev.flink.apache.org%3E
> >
> >
>


Re: [DISCUSS] Do not merge PRs with "unrelated" test failures.

2021-06-28 Thread
> > > > > > > > some
> > > > > > > > > > > > > metrics?
> > > > > > > > > > > > > 2) Should we log the disable commit in the
> > > corresponding
> > > > > > issue
> > > > > > > > and
> > > > > > > > > > > > increase
> > > > > > > > > > > > > the priority?
> > > > > > > > > > > > > 3) What if nobody looks into this issue and this
> > > becomes
> > > > > some
> > > > > > > > > > potential
> > > > > > > > > > > > > bugs released with the new version?
> > > > > > > > > > > > > 4) If no person is actively working on the issue,
> who
> > > > > should
> > > > > > > > > > re-enable
> > > > > > > > > > > > it?
> > > > > > > > > > > > > Would it block PRs again?
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Best,
> > > > > > > > > > > > > Jark
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, 24 Jun 2021 at 10:04, Xintong Song <
> > > > > > > > tonysong...@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks all for the feedback.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > @Till @Yangze
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I'm also not convinced by the idea of having an
> > > exception
> > > > > > for
> > > > > > > > > local
> > > > > > > > > > > > > builds.
> > > > > > > > > > > > > > We need to execute the entire build (or at least
> > the
> > > > > > failing
> > > > > > > > > stage)
> > > > > > > > > > > > > > locally, to make sure subsequent test cases
> > > prevented by
> > > > > > the
> > > > > > > > > > failure
> > > > > > > > > > > > one
> > > > > > > > > > > > > > are all executed. In that case, it's probably
> > easier
> > > to
> > > > > > rerun
> > > > > > > > the
> > > > > > > > > > build
> > > > > > > > > > > > > on
> > > > > > > > > > > > > > azure than locally.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Concerning disabling unstable test cases that
> > > regularly
> > > > > > block
> > > > > > > > PRs
> > > > > > > > > > from
> > > > > > > > > > > > > > merging, maybe we can say that such cases can
> only
> > be
> > > > > > > disabled
> > > > > > > > > when
> > > > > > > > > > > > > someone
> > > > > > > > > > > > > > is actively looking into it, likely the person
> who
> > > > > disabled
> > > > > > > the
> > > > > > > > > > case.
> > > > > > > > > > > > If
> > > > > > > > > > > > > > this person is no longer actively working on it,
> > > he/she
> > > > > > > should
> > > > > > > > > > enable
> > > > > > > > > > > > the
> > > > > > > > > > > > > > case again no matter if it is fixed or not.
> > > > > > > > > > > > > &

Re: [VOTE] FLIP-171: Async Sink

2021-06-24 Thread
A common base for implementing sink is really helpful.
+1 (binding)

Thanks
liujiangang

Danny Cranmer  于2021年6月24日周四 下午7:10写道:

> Looking forward to simplifying adding new destination sinks.
>
> +1 (binding).
>
> Thanks
>
> On Thu, 24 Jun 2021, 10:46 Arvid Heise,  wrote:
>
> > Thanks for preparing the FLIP. +1 (binding) from my side.
> >
> > On Tue, Jun 22, 2021 at 2:28 PM Hausmann, Steffen
> > 
> > wrote:
> >
> > > Hi there,
> > >
> > > After the discussion in [1], I’d like to open a voting thread for
> > FLIP-171
> > > [2], which proposes a base implementation for sinks that support async
> > > requests.
> > >
> > > The vote will be open until June 25 (72h), unless there is an objection
> > or
> > > not enough votes.
> > >
> > > Cheers, Steffen
> > >
> > > [1]
> > >
> >
> https://mail-archives.apache.org/mod_mbox/flink-dev/202106.mbox/%3CC83F4222-4D07-412D-9BD5-DB92D59DDF03%40amazon.de%3E
> > > [2]
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-171%3A+Async+Sink
> > >
> > >
> > >
> > >
> > > Amazon Web Services EMEA SARL
> > > 38 avenue John F. Kennedy, L-1855 Luxembourg
> > > Sitz der Gesellschaft: L-1855 Luxemburg
> > > eingetragen im Luxemburgischen Handelsregister unter R.C.S. B186284
> > >
> > > Amazon Web Services EMEA SARL, Niederlassung Deutschland
> > > Marcel-Breuer-Str. 12, D-80807 Muenchen
> > > Sitz der Zweigniederlassung: Muenchen
> > > eingetragen im Handelsregister des Amtsgerichts Muenchen unter HRB
> > 242240,
> > > USt-ID DE317013094
> > >
> > >
> > >
> > >
> >
>


Re: Re: [ANNOUNCE] New PMC member: Arvid Heise

2021-06-24 Thread
Congratulations

Best
liujiangang

Matthias Pohl  于2021年6月23日周三 下午2:11写道:

> Congratulations, Arvid! :-)
>
> On Thu, Jun 17, 2021 at 9:02 AM Arvid Heise  wrote:
>
> > Thank you for your trust and support.
> >
> > Arvid
> >
> > On Thu, Jun 17, 2021 at 8:39 AM Roman Khachatryan 
> > wrote:
> >
> > > Congratulations!
> > >
> > > Regards,
> > > Roman
> > >
> > > On Thu, Jun 17, 2021 at 5:56 AM Xingbo Huang 
> wrote:
> > > >
> > > > Congratulations, Arvid!
> > > >
> > > > Best,
> > > > Xingbo
> > > >
> > > > Yun Tang  于2021年6月17日周四 上午10:49写道:
> > > >
> > > > > Congratulations, Arvid
> > > > >
> > > > > Best
> > > > > Yun Tang
> > > > > 
> > > > > From: Yun Gao 
> > > > > Sent: Thursday, June 17, 2021 10:46
> > > > > To: Jingsong Li ; dev <
> dev@flink.apache.org>
> > > > > Subject: Re: Re: [ANNOUNCE] New PMC member: Arvid Heise
> > > > >
> > > > > Congratulations, Arvid!
> > > > >
> > > > > Best,
> > > > > Yun
> > > > >
> > > > >
> > > > > --
> > > > > Sender:Jingsong Li
> > > > > Date:2021/06/17 10:41:29
> > > > > Recipient:dev
> > > > > Theme:Re: [ANNOUNCE] New PMC member: Arvid Heise
> > > > >
> > > > > Congratulations, Arvid!
> > > > >
> > > > > Best,
> > > > > Jingsong
> > > > >
> > > > > On Thu, Jun 17, 2021 at 6:52 AM Matthias J. Sax 
> > > wrote:
> > > > >
> > > > > > Congrats!
> > > > > >
> > > > > > On 6/16/21 6:06 AM, Leonard Xu wrote:
> > > > > > > Congratulations, Arvid!
> > > > > > >
> > > > > > >
> > > > > > >> 在 2021年6月16日,20:08,Till Rohrmann  写道:
> > > > > > >>
> > > > > > >> Congratulations, Arvid!
> > > > > > >>
> > > > > > >> Cheers,
> > > > > > >> Till
> > > > > > >>
> > > > > > >> On Wed, Jun 16, 2021 at 1:47 PM JING ZHANG <
> > beyond1...@gmail.com>
> > > > > > wrote:
> > > > > > >>
> > > > > > >>> Congratulations, Arvid!
> > > > > > >>>
> > > > > > >>> Nicholas Jiang  于2021年6月16日周三 下午7:25写道:
> > > > > > >>>
> > > > > >  Congratulations, Arvid!
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > >  --
> > > > > >  Sent from:
> > > > > > >>>
> > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
> > > > > > 
> > > > > > >>>
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best, Jingsong Lee
> > > > >
> > > > >
> > >
> >
>


Re: [VOTE] FLIP-169: DataStream API for Fine-Grained Resource Requirements

2021-06-23 Thread
+1  (binding)

Thanks
liujiangang

Zhu Zhu  于2021年6月24日周四 上午11:38写道:

> +1  (binding)
>
> Thanks,
> Zhu
>
> Yangze Guo  于2021年6月21日周一 下午3:42写道:
>
> > According to the latest comment of Zhu Zhu[1], I append the potential
> > resource deadlock in batch jobs as a known limitation to this FLIP.
> > Thus, I'd extend the voting period for another 72h.
> >
> > [1]
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-169-DataStream-API-for-Fine-Grained-Resource-Requirements-td51071.html
> >
> > Best,
> > Yangze Guo
> >
> > On Tue, Jun 15, 2021 at 7:53 PM Xintong Song 
> > wrote:
> > >
> > > +1 (binding)
> > >
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Tue, Jun 15, 2021 at 6:21 PM Arvid Heise  wrote:
> > >
> > > > LGTM +1 (binding) from my side.
> > > >
> > > > On Tue, Jun 15, 2021 at 11:00 AM Yangze Guo 
> > wrote:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > I'd like to start the vote of FLIP-169 [1]. This FLIP is discussed
> in
> > > > > the thread[2].
> > > > >
> > > > > The vote will be open for at least 72 hours. Unless there is an
> > > > > objection, I will try to close it by Jun. 18, 2021 if we have
> > received
> > > > > sufficient votes.
> > > > >
> > > > > [1]
> > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-169+DataStream+API+for+Fine-Grained+Resource+Requirements
> > > > > [2]
> > > > >
> > > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-169-DataStream-API-for-Fine-Grained-Resource-Requirements-td51071.html
> > > > >
> > > > > Best,
> > > > > Yangze Guo
> > > > >
> > > >
> >
>


Re: [DISCUSS] Do not merge PRs with "unrelated" test failures.

2021-06-22 Thread
It is a good principle to run all tests successfully with any change.  This
means a lot for project's stability and development. I am big +1 for this
proposal.

Best
liujiangang

Till Rohrmann  于2021年6月22日周二 下午6:36写道:

> One way to address the problem of regularly failing tests that block
> merging of PRs is to disable the respective tests for the time being. Of
> course, the failing test then needs to be fixed. But at least that way we
> would not block everyone from making progress.
>
> Cheers,
> Till
>
> On Tue, Jun 22, 2021 at 12:00 PM Arvid Heise  wrote:
>
> > I think this is overall a good idea. So +1 from my side.
> > However, I'd like to put a higher priority on infrastructure then, in
> > particular docker image/artifact caches.
> >
> > On Tue, Jun 22, 2021 at 11:50 AM Till Rohrmann 
> > wrote:
> >
> > > Thanks for bringing this topic to our attention Xintong. I think your
> > > proposal makes a lot of sense and we should follow it. It will give us
> > > confidence that our changes are working and it might be a good
> incentive
> > to
> > > quickly fix build instabilities. Hence, +1.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Tue, Jun 22, 2021 at 11:12 AM Xintong Song 
> > > wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > In the past a couple of weeks, I've observed several times that PRs
> are
> > > > merged without a green light from the CI tests, where failure cases
> are
> > > > considered *unrelated*. This may not always cause problems, but would
> > > > increase the chance of breaking our code base. In fact, it has
> occurred
> > > to
> > > > me twice in the past few weeks that I had to revert a commit which
> > breaks
> > > > the master branch due to this.
> > > >
> > > > I think it would be nicer to enforce a stricter rule, that no PRs
> > should
> > > be
> > > > merged without passing CI.
> > > >
> > > > The problems of merging PRs with "unrelated" test failures are:
> > > > - It's not always straightforward to tell whether a test failures are
> > > > related or not.
> > > > - It prevents subsequent test cases from being executed, which may
> fail
> > > > relating to the PR changes.
> > > >
> > > > To make things easier for the committers, the following exceptions
> > might
> > > be
> > > > considered acceptable.
> > > > - The PR has passed CI in the contributor's personal workspace.
> Please
> > > post
> > > > the link in such cases.
> > > > - The CI tests have been triggered multiple times, on the same
> commit,
> > > and
> > > > each stage has at least passed for once. Please also comment in such
> > > cases.
> > > >
> > > > If we all agree on this, I'd update the community guidelines for
> > merging
> > > > PRs wrt. this proposal. [1]
> > > >
> > > > Please let me know what do you think.
> > > >
> > > > Thank you~
> > > >
> > > > Xintong Song
> > > >
> > > >
> > > > [1]
> > > >
> > https://cwiki.apache.org/confluence/display/FLINK/Merging+Pull+Requests
> > > >
> > >
> >
>


Re: [ANNOUNCE] New PMC member: Xintong Song

2021-06-16 Thread
Congrats, Xintong!

Best.

Xintong Song  于2021年6月16日周三 下午5:51写道:

> Thanks all for the support.
> It's my honor to be part of the community and work with all of you great
> people.
>
> Thank you~
>
> Xintong Song
>
>
> On Wed, Jun 16, 2021 at 5:31 PM Konstantin Knauf 
> wrote:
>
> > Congratulations, Xintong!
> >
> > On Wed, Jun 16, 2021 at 11:30 AM Qingsheng Ren 
> wrote:
> >
> > > Congratulations Xintong!
> > >
> > > --
> > > Best Regards,
> > >
> > > Qingsheng Ren
> > > Email: renqs...@gmail.com
> > > On Jun 16, 2021, 5:23 PM +0800, Dawid Wysakowicz <
> dwysakow...@apache.org
> > >,
> > > wrote:
> > > > Hi all!
> > > >
> > > > I'm very happy to announce that Xintong Song has joined the Flink
> PMC!
> > > >
> > > > Congratulations and welcome Xintong!
> > > >
> > > > Best,
> > > > Dawid
> > >
> >
> >
> > --
> >
> > Konstantin Knauf
> >
> > https://twitter.com/snntrable
> >
> > https://github.com/knaufk
> >
>


Re: [ANNOUNCE] New PMC member: Arvid Heise

2021-06-16 Thread
Congratulations, Arvid!

Best

Xintong Song  于2021年6月16日周三 下午5:51写道:

> Congratulations, Arvid~!
>
> Thank you~
>
> Xintong Song
>
>
>
> On Wed, Jun 16, 2021 at 5:31 PM Konstantin Knauf 
> wrote:
>
> > Congratulations, Arvid!
> >
> > On Wed, Jun 16, 2021 at 11:30 AM Yangze Guo  wrote:
> >
> > > Congrats, Arvid!
> > >
> > > Best,
> > > Yangze Guo
> > >
> > > On Wed, Jun 16, 2021 at 5:27 PM Qingsheng Ren 
> > wrote:
> > > >
> > > > Congratulations Arvid!
> > > >
> > > > --
> > > > Best Regards,
> > > >
> > > > Qingsheng Ren
> > > > Email: renqs...@gmail.com
> > > > On Jun 16, 2021, 5:21 PM +0800, Dawid Wysakowicz <
> > dwysakow...@apache.org>,
> > > wrote:
> > > > > Hi all!
> > > > >
> > > > > I'm very happy to announce that Arvid Heise has joined the Flink
> PMC!
> > > > >
> > > > > Congratulations and welcome Arvid!
> > > > >
> > > > > Best,
> > > > > Dawid
> > >
> >
> >
> > --
> >
> > Konstantin Knauf
> >
> > https://twitter.com/snntrable
> >
> > https://github.com/knaufk
> >
>


Re: Add control mode for flink

2021-06-11 Thread
Thanks Till for the reply. The suggestions are really helpful for the
topic. Maybe something I mention is not clear or not detail. Here are what
I want to say:

   1. Changing log level is not suitable for the topic as you said. Because
   our inner log4j is old, so this feature is implemented in a trick way
   through restful interface. Since it can be done through logging backend and
   is not related with control event, please forgive my fault and just ignore
   it.
   2.  I totally agree with you that data consistency is important for
   flink. So in our initial plan, control events should flow from source to
   down tasks. In this way, it can combine with current checkpoint mechanism.
   Every detail should be carefully considered to ensure exactly-once.
   3. For the quick recovery, I have mentioned that some work is doing to
   solve it, like generalized incremental checkpoints. It is really helpful to
   us and can resolve many problems. But maybe some special corners can not be
   covered, for example, uploading a big jar may consume some time.

Welcome more people to discuss the question, give suggestions and ideas to
make a better flink.

Till Rohrmann [via Apache Flink User Mailing List archive.] <
ml+s2336050n44392...@n4.nabble.com> 于2021年6月11日周五 下午4:46写道:

> Thanks for starting this discussion. I do see the benefit of dynamically
> configuring your Flink job and the cluster running it. Some of the use
> cases which were mentioned here are already possible. E.g. adjusting the
> log level dynamically can be done by configuring an appropriate logging
> backend and then changing the logging properties (log4j 2 supports this for
> example). Then the remaining use cases can be categorized into two
> categories:
>
> 1) changing the job
> 2) changing the cluster configuration
>
> 1) would benefit from general control flow events which will be processed
> by all operators. 2) would require some component sending some control
> events to the other Flink processes.
>
> Implementing the control flow events can already be done to a good extent
> on the user level by using a connected stream and a user level-record type
> which can distinguish between control events and normal records.
> Admittedly, it is a bit of work, though.
>
> I think persisting all of these changes would be very important because
> otherwise, you might end up easily in an inconsistent state. For example,
> assume you have changed the log level and now a subset of the TaskManagers
> needs to be restarted. Now, all of a sudden some TaskManagers log on level
> X and the others on level Y. The same applies to job changes. A regional
> failover would have to restore the latest dynamically configured state. All
> in all, this looks like a very complex and complicated task.
>
> On the other hand, most of the described use cases should be realizable
> with a restart of a job. So if Flink were able to quickly resume a job,
> then we would probably not need this feature. Applying the changes to the
> Flink and the job configuration and resubmitting the job would do the
> trick. Hence, improving Flink's recovery speed could be an alternative
> approach to this problem.
>
> Cheers,
> Till
>
> On Fri, Jun 11, 2021 at 9:51 AM Jary Zhen <[hidden email]
> <http:///user/SendEmail.jtp?type=node=44392=0>> wrote:
>
>>  big +1 for this feature,
>>
>>1. Reset kafka offset in certain cases.
>>2. Stop checkpoint in certain cases.
>>3. Change log level for debug.
>>
>>
>> 刘建刚 <[hidden email] <http:///user/SendEmail.jtp?type=node=44392=1>>
>> 于2021年6月11日周五 下午12:17写道:
>>
>>> Thanks for all the discussions and suggestions. Since the topic has
>>> been discussed for about a week, it is time to have a conclusion and new
>>> ideas are welcomed at the same time.
>>> First, the topic starts with use cases in restful interface. The
>>> restful interface supported many useful interactions with users, for
>>> example as follows. It is an easy way to control the job compared with
>>> broadcast api.
>>>
>>>1. Change data processing’ logic by dynamic configs, such as filter
>>>condition.
>>>2. Define some tools to control the job, such as QPS limit,
>>>sampling, change log level and so on.
>>>
>>> Second, we broaden the topic to control flow in order to support all
>>> kinds of control events besides the above user cases. There is a strong
>>> demand to support custom (broadcast) events for iteration, SQL control
>>> events and so on. As Xintong Song said, the key to the control flow lies as
>>> follows:
>>>
>>>1. Who (which component) is 

Re: Add control mode for flink

2021-06-10 Thread
ld very helpful if users can leverage the built-in
> >> control flow of Flink.
> >>
> >> My 2 cents:
> >> 1. @Steven IMHO, producing control events from JobMaster is similar to
> >> triggering a savepoint. The REST api is non-blocking, and users should
> poll
> >> the results to confirm the operation is succeeded. If something goes
> wrong,
> >> it’s user’s responsibility to retry.
> >> 2. There are two kinds of existing special elements, special stream
> >> records (e.g. watermarks) and events (e.g. checkpoint barrier). They all
> >> flow through the whole DAG, but events needs to be acknowledged by
> >> downstream and can overtake records, while stream records are not). So
> I’m
> >> wondering if we plan to unify the two approaches in the new control flow
> >> (as Xintong mentioned both in the previous mails)?
> >>
> >> Best,
> >> Paul Lam
> >>
> >> 2021年6月8日 14:08,Steven Wu  写道:
> >>
> >>
> >> I can see the benefits of control flow. E.g., it might help the old (and
> >> inactive) FLIP-17 side input. I would suggest that we add more details
> of
> >> some of the potential use cases.
> >>
> >> Here is one mismatch with using control flow for dynamic config. Dynamic
> >> config is typically targeted/loaded by one specific operator. Control
> flow
> >> will propagate the dynamic config to all operators. not a problem per se
> >>
> >> Regarding using the REST api (to jobmanager) for accepting control
> >> signals from external system, where are we going to persist/checkpoint
> the
> >> signal? jobmanager can die before the control signal is propagated and
> >> checkpointed. Did we lose the control signal in this case?
> >>
> >>
> >> On Mon, Jun 7, 2021 at 11:05 PM Xintong Song 
> >> wrote:
> >>
> >>> +1 on separating the effort into two steps:
> >>>
> >>>1. Introduce a common control flow framework, with flexible
> >>>interfaces for generating / reacting to control messages for various
> >>>purposes.
> >>>2. Features that leverating the control flow can be worked on
> >>>concurrently
> >>>
> >>> Meantime, keeping collecting potential features that may leverage the
> >>> control flow should be helpful. It provides good inputs for the control
> >>> flow framework design, to make the framework common enough to cover the
> >>> potential use cases.
> >>>
> >>> My suggestions on the next steps:
> >>>
> >>>1. Allow more time for opinions to be heard and potential use cases
> >>>to be collected
> >>>2. Draft a FLIP with the scope of common control flow framework
> >>>3. We probably need a poc implementation to make sure the framework
> >>>covers at least the following scenarios
> >>>   1. Produce control events from arbitrary operators
> >>>   2. Produce control events from JobMaster
> >>>   3. Consume control events from arbitrary operators downstream
> >>>   where the events are produced
> >>>
> >>>
> >>> Thank you~
> >>> Xintong Song
> >>>
> >>>
> >>>
> >>> On Tue, Jun 8, 2021 at 1:37 PM Yun Gao  wrote:
> >>>
> >>>> Very thanks Jiangang for bringing this up and very thanks for the
> >>>> discussion!
> >>>>
> >>>> I also agree with the summarization by Xintong and Jing that control
> >>>> flow seems to be
> >>>> a common buidling block for many functionalities and dynamic
> >>>> configuration framework
> >>>> is a representative application that frequently required by users.
> >>>> Regarding the control flow,
> >>>> currently we are also considering the design of iteration for the
> >>>> flink-ml, and as Xintong has pointed
> >>>> out, it also required the control flow in cases like detection global
> >>>> termination inside the iteration
> >>>>  (in this case we need to broadcast an event through the iteration
> >>>> body to detect if there are still
> >>>> records reside in the iteration body). And regarding  whether to
> >>>> implement the dynamic configuration
> >>>> framework, I also agree with Xintong that the consistency guarantee
> >>>> would 

Re: Re: Add control mode for flink

2021-06-08 Thread
Thanks for the reply. It is a good question. There are multi choices as
follows:

   1. We can persist control signals in HighAvailabilityServices and replay
   them after failover.
   2. Only tell the users that the control signals take effect after they
   are checkpointed.


Steven Wu [via Apache Flink User Mailing List archive.] <
ml+s2336050n44278...@n4.nabble.com> 于2021年6月8日周二 下午2:15写道:

>
> I can see the benefits of control flow. E.g., it might help the old (and
> inactive) FLIP-17 side input. I would suggest that we add more details of
> some of the potential use cases.
>
> Here is one mismatch with using control flow for dynamic config. Dynamic
> config is typically targeted/loaded by one specific operator. Control flow
> will propagate the dynamic config to all operators. not a problem per se
>
> Regarding using the REST api (to jobmanager) for accepting control
> signals from external system, where are we going to persist/checkpoint the
> signal? jobmanager can die before the control signal is propagated and
> checkpointed. Did we lose the control signal in this case?
>
>
> On Mon, Jun 7, 2021 at 11:05 PM Xintong Song <[hidden email]
> <http:///user/SendEmail.jtp?type=node=44278=0>> wrote:
>
>> +1 on separating the effort into two steps:
>>
>>1. Introduce a common control flow framework, with flexible
>>interfaces for generating / reacting to control messages for various
>>purposes.
>>2. Features that leverating the control flow can be worked on
>>concurrently
>>
>> Meantime, keeping collecting potential features that may leverage the
>> control flow should be helpful. It provides good inputs for the control
>> flow framework design, to make the framework common enough to cover the
>> potential use cases.
>>
>> My suggestions on the next steps:
>>
>>1. Allow more time for opinions to be heard and potential use cases
>>to be collected
>>2. Draft a FLIP with the scope of common control flow framework
>>3. We probably need a poc implementation to make sure the framework
>>covers at least the following scenarios
>>   1. Produce control events from arbitrary operators
>>   2. Produce control events from JobMaster
>>   3. Consume control events from arbitrary operators downstream
>>   where the events are produced
>>
>>
>> Thank you~
>>
>> Xintong Song
>>
>>
>>
>> On Tue, Jun 8, 2021 at 1:37 PM Yun Gao <[hidden email]
>> <http:///user/SendEmail.jtp?type=node=44278=1>> wrote:
>>
>>> Very thanks Jiangang for bringing this up and very thanks for the
>>> discussion!
>>>
>>> I also agree with the summarization by Xintong and Jing that control
>>> flow seems to be
>>> a common buidling block for many functionalities and dynamic
>>> configuration framework
>>> is a representative application that frequently required by users.
>>> Regarding the control flow,
>>> currently we are also considering the design of iteration for the
>>> flink-ml, and as Xintong has pointed
>>> out, it also required the control flow in cases like detection global
>>> termination inside the iteration
>>>  (in this case we need to broadcast an event through the iteration body
>>> to detect if there are still
>>> records reside in the iteration body). And regarding  whether to
>>> implement the dynamic configuration
>>> framework, I also agree with Xintong that the consistency guarantee
>>> would be a point to consider, we
>>> might consider if we need to ensure every operator could receive the
>>> dynamic configuration.
>>>
>>> Best,
>>> Yun
>>>
>>>
>>>
>>> --
>>> Sender:kai wang<[hidden email]
>>> <http:///user/SendEmail.jtp?type=node=44278=2>>
>>> Date:2021/06/08 11:52:12
>>> Recipient:JING ZHANG<[hidden email]
>>> <http:///user/SendEmail.jtp?type=node=44278=3>>
>>> Cc:刘建刚<[hidden email]
>>> <http:///user/SendEmail.jtp?type=node=44278=4>>; Xintong Song
>>> [via Apache Flink User Mailing List archive.]<[hidden email]
>>> <http:///user/SendEmail.jtp?type=node=44278=5>>; user<[hidden
>>> email] <http:///user/SendEmail.jtp?type=node=44278=6>>; dev<[hidden
>>> email] <http:///user/SendEmail.jtp?type=node=44278=7>>
>>> Theme:Re: Add control mode for flink
>>>
>>>
>>>
>>> I

Re: Add control mode for flink

2021-06-07 Thread
Thanks Xintong Song for the detailed supplement. Since flink is
long-running, it is similar to many services. So interacting with it or
controlling it is a common desire. This was our initial thought when
implementing the feature. In our inner flink, many configs used in yaml can
be adjusted by dynamic to avoid restarting the job, for examples as follow:

   1. Limit the input qps.
   2. Degrade the job by sampling and so on.
   3. Reset kafka offset in certain cases.
   4. Stop checkpoint in certain cases.
   5. Control the history consuming.
   6. Change log level for debug.


After deep discussion, we realize that a common control flow
will benefit both users and developers. Dynamic config is just one of the
use cases. For the concrete design and implementation, it relates with many
components, like jobmaster, network channel, operators and so on, which
needs deeper consideration and design.

Xintong Song [via Apache Flink User Mailing List archive.] <
ml+s2336050n44245...@n4.nabble.com> 于2021年6月7日周一 下午2:52写道:

> Thanks Jiangang for bringing this up, and Steven & Peter for the feedback.
>
> I was part of the preliminary offline discussions before this proposal
> went public. So maybe I can help clarify things a bit.
>
> In short, despite the phrase "control mode" might be a bit misleading,
> what we truly want to do from my side is to make the concept of "control
> flow" explicit and expose it to users.
>
> ## Background
> Jiangang & his colleagues at Kuaishou maintain an internal version of
> Flink. One of their custom features is allowing dynamically changing
> operator behaviors via the REST APIs. He's willing to contribute this
> feature to the community, and came to Yun Gao and me for suggestions. After
> discussion, we feel that the underlying question to be answered is how do
> we model the control flow in Flink. Dynamically controlling jobs via REST
> API can be one of the features built on top of the control flow, and there
> could be others.
>
> ## Control flow
> Control flow refers to the communication channels for sending
> events/signals to/between tasks/operators, that changes Flink's behavior in
> a way that may or may not affect the computation logic. Typical control
> events/signals Flink currently has are watermarks and checkpoint barriers.
>
> In general, for modeling control flow, the following questions should be
> considered.
> 1. Who (which component) is responsible for generating the control
> messages?
> 2. Who (which component) is responsible for reacting to the messages.
> 3. How do the messages propagate?
> 4. When it comes to affecting the computation logics, how should the
> control flow work together with the exact-once consistency.
>
> 1) & 2) may vary depending on the use cases, while 3) & 4) probably share
> many things in common. A unified control flow model would help deduplicate
> the common logics, allowing us to focus on the use case specific parts.
>
> E.g.,
> - Watermarks: generated by source operators, handled by window operators.
> - Checkpoint barrier: generated by the checkpoint coordinator, handled by
> all tasks
> - Dynamic controlling: generated by JobMaster (in reaction to the REST
> command), handled by specific operators/UDFs
> - Operator defined events: The following features are still in planning,
> but may potentially benefit from the control flow model. (Please correct me
> if I'm wrong, @Yun, @Jark)
>   * Iteration: When a certain condition is met, we might want to signal
> downstream operators with an event
>   * Mini-batch assembling: Flink currently uses special watermarks for
> indicating the end of each mini-batch, which makes it tricky to deal with
> event time related computations.
>   * Hive dimension table join: For periodically reloaded hive tables, it
> would be helpful to have specific events signaling that a reloading is
> finished.
>   * Bootstrap dimension table join: This is similar to the previous one.
> In cases where we want to fully load the dimension table before starting
> joining the mainstream, it would be helpful to have an event signaling the
> finishing of the bootstrap.
>
> ## Dynamic REST controlling
> Back to the specific feature that Jiangang proposed, I personally think
> it's quite convenient. Currently, to dynamically change the behavior of an
> operator, we need to set up a separate source for the control events and
> leverage broadcast state. Being able to send the events via REST APIs
> definitely improves the usability.
>
> Leveraging dynamic configuration frameworks is for sure one possible
> approach. The reason we are in favor of introducing the control flow is
> that:
> - It benefits not only this specific dynamic controlling feature, but
> potentia

Re: Add control mode for flink

2021-06-06 Thread
Thank you for the reply. I have checked the post you mentioned. The dynamic
config may be useful sometimes. But it is hard to keep data consistent in
flink, for example, what if the dynamic config will take effect when
failover. Since dynamic config is a desire for users, maybe flink can
support it in some way.

For the control mode, dynamic config is just one of the control modes. In
the google doc, I have list some other cases. For example, control events
are generated in operators or external services. Besides user's dynamic
config, flink system can support some common dynamic configuration, like
qps limit, checkpoint control and so on.

It needs good design to handle the control mode structure. Based on that,
other control features can be added easily later, like changing log level
when job is running. In the end, flink will not just process data, but also
interact with users to receive control events like a service.

Steven Wu  于2021年6月4日周五 下午11:11写道:

> I am not sure if we should solve this problem in Flink. This is more like
> a dynamic config problem that probably should be solved by some
> configuration framework. Here is one post from google search:
> https://medium.com/twodigits/dynamic-app-configuration-inject-configuration-at-run-time-using-spring-boot-and-docker-ffb42631852a
>
> On Fri, Jun 4, 2021 at 7:09 AM 刘建刚  wrote:
>
>> Hi everyone,
>>
>>   Flink jobs are always long-running. When the job is running, users
>> may want to control the job but not stop it. The control reasons can be
>> different as following:
>>
>>1.
>>
>>Change data processing’ logic, such as filter condition.
>>2.
>>
>>Send trigger events to make the progress forward.
>>3.
>>
>>Define some tools to degrade the job, such as limit input qps,
>>sampling data.
>>4.
>>
>>Change log level to debug current problem.
>>
>>   The common way to do this is to stop the job, do modifications and
>> start the job. It may take a long time to recover. In some situations,
>> stopping jobs is intolerable, for example, the job is related to money or
>> important activities.So we need some technologies to control the running
>> job without stopping the job.
>>
>>
>> We propose to add control mode for flink. A control mode based on the
>> restful interface is first introduced. It works by these steps:
>>
>>
>>1. The user can predefine some logic which supports config control,
>>such as filter condition.
>>2. Run the job.
>>3. If the user wants to change the job's running logic, just send a
>>restful request with the responding config.
>>
>> Other control modes will also be considered in the future. More
>> introduction can refer to the doc
>> https://docs.google.com/document/d/1WSU3Tw-pSOcblm3vhKFYApzVkb-UQ3kxso8c8jEzIuA/edit?usp=sharing
>> . If the community likes the proposal, more discussion is needed and a more
>> detailed design will be given later. Any suggestions and ideas are welcome.
>>
>>


Add control mode for flink

2021-06-04 Thread
Hi everyone,

  Flink jobs are always long-running. When the job is running, users
may want to control the job but not stop it. The control reasons can be
different as following:

   1.

   Change data processing’ logic, such as filter condition.
   2.

   Send trigger events to make the progress forward.
   3.

   Define some tools to degrade the job, such as limit input qps, sampling
   data.
   4.

   Change log level to debug current problem.

  The common way to do this is to stop the job, do modifications and
start the job. It may take a long time to recover. In some situations,
stopping jobs is intolerable, for example, the job is related to money or
important activities.So we need some technologies to control the running
job without stopping the job.


We propose to add control mode for flink. A control mode based on the
restful interface is first introduced. It works by these steps:


   1. The user can predefine some logic which supports config control, such
   as filter condition.
   2. Run the job.
   3. If the user wants to change the job's running logic, just send a
   restful request with the responding config.

Other control modes will also be considered in the future. More
introduction can refer to the doc
https://docs.google.com/document/d/1WSU3Tw-pSOcblm3vhKFYApzVkb-UQ3kxso8c8jEzIuA/edit?usp=sharing
. If the community likes the proposal, more discussion is needed and a more
detailed design will be given later. Any suggestions and ideas are welcome.


Re: [VOTE] FLIP-143: Unified Sink API

2020-09-29 Thread
+1 (binding)

Best,
Liu Jiangang

Jingsong Li  于2020年9月29日周二 下午1:36写道:

> +1 (binding)
>
> Best,
> Jingsong
>
> On Mon, Sep 28, 2020 at 3:21 AM Kostas Kloudas  wrote:
>
> > +1 (binding)
> >
> > @Steven Wu I think there will be opportunities to fine tune the API
> > during the implementation.
> >
> > Cheers,
> > Kostas
> >
> > On Sun, Sep 27, 2020 at 7:56 PM Steven Wu  wrote:
> > >
> > > +1 (non-binding)
> > >
> > > Although I would love to continue the discussion for tweaking the
> > > CommitResult/GlobaCommitter interface maybe during the implementation
> > phase.
> > >
> > > On Fri, Sep 25, 2020 at 5:35 AM Aljoscha Krettek 
> > > wrote:
> > >
> > > > +1 (binding)
> > > >
> > > > Aljoscha
> > > >
> > > > On 25.09.20 14:26, Guowei Ma wrote:
> > > > >  From the discussion[1] we could find that FLIP focuses on
> providing
> > an
> > > > > unified transactional sink API. So I updated the FLIP's title to
> > "Unified
> > > > > Transactional Sink API". But I found that the old link could not be
> > > > opened
> > > > > again.
> > > > >
> > > > > I would update the link[2] here. Sorry for the inconvenience.
> > > > >
> > > > > [1]
> > > > >
> > > >
> >
> https://lists.apache.org/thread.html/rf09dfeeaf35da5ee98afe559b5a6e955c9f03ade0262727f6b5c4c1e%40%3Cdev.flink.apache.org%3E
> > > > > [2] https://cwiki.apache.org/confluence/x/KEJ4CQ
> > > > >
> > > > > Best,
> > > > > Guowei
> > > > >
> > > > >
> > > > > On Thu, Sep 24, 2020 at 8:13 PM Guowei Ma 
> > wrote:
> > > > >
> > > > >> Hi, all
> > > > >>
> > > > >> After the discussion in [1], I would like to open a voting thread
> > for
> > > > >> FLIP-143 [2], which proposes a unified sink api.
> > > > >>
> > > > >> The vote will be open until September 29th (72h + weekend), unless
> > there
> > > > >> is an objection or not enough votes.
> > > > >>
> > > > >> [1]
> > > > >>
> > > >
> >
> https://lists.apache.org/thread.html/rf09dfeeaf35da5ee98afe559b5a6e955c9f03ade0262727f6b5c4c1e%40%3Cdev.flink.apache.org%3E
> > > > >> [2]
> > > > >>
> > > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-143%3A+Unified+Sink+API
> > > > >>
> > > > >> Best,
> > > > >> Guowei
> > > > >>
> > > > >
> > > >
> > > >
> >
>
>
> --
> Best, Jingsong Lee
>


Re: [ANNOUNCE] New Apache Flink Committer - Arvid Heise

2020-09-15 Thread
Congratulations!

Best

Matthias Pohl  于2020年9月15日周二 下午6:07写道:

> Congratulations! ;-)
>
> On Tue, Sep 15, 2020 at 11:47 AM Xingbo Huang  wrote:
>
> > Congratulations!
> >
> > Best,
> > Xingbo
> >
> > Igal Shilman  于2020年9月15日周二 下午5:44写道:
> >
> > > Congrats Arvid!
> > >
> > > On Tue, Sep 15, 2020 at 11:12 AM David Anderson  >
> > > wrote:
> > >
> > > > Congratulations, Arvid! Well deserved.
> > > >
> > > > Best,
> > > > David
> > > >
> > > > On Tue, Sep 15, 2020 at 10:23 AM Paul Lam 
> > wrote:
> > > >
> > > > > Congrats, Arvid!
> > > > >
> > > > > Best,
> > > > > Paul Lam
> > > > >
> > > > > > 2020年9月15日 15:29,Jingsong Li  写道:
> > > > > >
> > > > > > Congratulations Arvid !
> > > > > >
> > > > > > Best,
> > > > > > Jingsong
> > > > > >
> > > > > > On Tue, Sep 15, 2020 at 3:27 PM Dawid Wysakowicz <
> > > > dwysakow...@apache.org
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > >> Congratulations Arvid! Very well deserved!
> > > > > >>
> > > > > >> Best,
> > > > > >>
> > > > > >> Dawid
> > > > > >>
> > > > > >> On 15/09/2020 04:38, Zhijiang wrote:
> > > > > >>> Hi all,
> > > > > >>>
> > > > > >>> On behalf of the PMC, I’m very happy to announce Arvid Heise
> as a
> > > new
> > > > > >> Flink committer.
> > > > > >>>
> > > > > >>> Arvid has been an active community member for more than a year,
> > > with
> > > > > 138
> > > > > >> contributions including 116 commits, reviewed many PRs with good
> > > > quality
> > > > > >> comments.
> > > > > >>> He is mainly working on the runtime scope, involved in critical
> > > > > features
> > > > > >> like task mailbox model and unaligned checkpoint, etc.
> > > > > >>> Besides that, he was super active to reply questions in the
> user
> > > mail
> > > > > >> list (34 emails in March, 51 emails in June, etc), also active
> in
> > > dev
> > > > > mail
> > > > > >> list and Jira issue discussions.
> > > > > >>>
> > > > > >>> Please join me in congratulating Arvid for becoming a Flink
> > > > committer!
> > > > > >>>
> > > > > >>> Best,
> > > > > >>> Zhijiang
> > > > > >>
> > > > > >>
> > > > > >
> > > > > > --
> > > > > > Best, Jingsong Lee
> > > > >
> > > > >
> > > >
> > >
> >
>
>
> --
>
> Matthias Pohl | Engineer
>
> Follow us @VervericaData Ververica 
>
> --
>
> Join Flink Forward  - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
> Ververica GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Yip Park Tung Jason, Jinwei (Kevin) Zhang, Karl Anton
> Wehner
>


Re: [ANNOUNCE] New Apache Flink Committer - Niels Basjes

2020-09-14 Thread
Congratulations!

Best,
liujiangang

Danny Chan  于2020年9月15日周二 上午9:44写道:

> Congratulations! 
>
> Best,
> Danny Chan
> 在 2020年9月15日 +0800 AM9:31,dev@flink.apache.org,写道:
> >
> > Congratulations! 
>


Re: [ANNOUNCE] Yu Li became a Flink committer

2020-01-24 Thread
Congratulations!

> 2020年1月23日 下午4:59,Stephan Ewen  写道:
> 
> Hi all!
> 
> We are announcing that Yu Li has joined the rank of Flink committers.
> 
> Yu joined already in late December, but the announcement got lost because
> of the Christmas and New Years season, so here is a belated proper
> announcement.
> 
> Yu is one of the main contributors to the state backend components in the
> recent year, working on various improvements, for example the RocksDB
> memory management for 1.10.
> He has also been one of the release managers for the big 1.10 release.
> 
> Congrats for joining us, Yu!
> 
> Best,
> Stephan



Re: java.lang.StackOverflowError

2020-01-21 Thread
I am using flink 1.6.2 on yarn. State backend is rocksdb. 

> 2020年1月22日 上午10:15,刘建刚  写道:
> 
>   I have a flink job which fails occasionally. I am eager to avoid this 
> problem. Can anyone help me? The error stacktrace is as following:
> java.io.IOException: java.lang.StackOverflowError
>   at 
> org.apache.flink.runtime.io.network.partition.consumer.InputChannel.checkError(InputChannel.java:191)
>   at 
> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.getNextBuffer(RemoteInputChannel.java:194)
>   at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:589)
>   at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:546)
>   at 
> org.apache.flink.streaming.runtime.io.BarrierBuffer.getNextNonBlocked(BarrierBuffer.java:175)
>   at 
> org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:236)
>   at 
> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:335)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:754)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.StackOverflowError
>   at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.notifyChannelNonEmpty(SingleInputGate.java:656)
>   at 
> org.apache.flink.runtime.io.network.partition.consumer.InputChannel.notifyChannelNonEmpty(InputChannel.java:125)
>   at 
> org.apache.flink.runtime.io.network.partition.consumer.InputChannel.setError(InputChannel.java:203)
>   at 
> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.notifyBufferAvailable(RemoteInputChannel.java:403)
>   at 
> org.apache.flink.runtime.io.network.buffer.LocalBufferPool.recycle(LocalBufferPool.java:282)
>   at 
> org.apache.flink.runtime.io.network.buffer.NetworkBuffer.deallocate(NetworkBuffer.java:172)
>   at 
> org.apache.flink.shaded.netty4.io.netty.buffer.AbstractReferenceCountedByteBuf.release0(AbstractReferenceCountedByteBuf.java:95)
>   at 
> org.apache.flink.shaded.netty4.io.netty.buffer.AbstractReferenceCountedByteBuf.release(AbstractReferenceCountedByteBuf.java:84)
>   at 
> org.apache.flink.runtime.io.network.buffer.NetworkBuffer.recycleBuffer(NetworkBuffer.java:147)
>   at 
> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.notifyBufferAvailable(RemoteInputChannel.java:380)
>   at 
> org.apache.flink.runtime.io.network.buffer.LocalBufferPool.recycle(LocalBufferPool.java:282)
>   at 
> org.apache.flink.runtime.io.network.buffer.NetworkBuffer.deallocate(NetworkBuffer.java:172)
>   at 
> org.apache.flink.shaded.netty4.io.netty.buffer.AbstractReferenceCountedByteBuf.release0(AbstractReferenceCountedByteBuf.java:95)
>   at 
> org.apache.flink.shaded.netty4.io.netty.buffer.AbstractReferenceCountedByteBuf.release(AbstractReferenceCountedByteBuf.java:84)
>   at 
> org.apache.flink.runtime.io.network.buffer.NetworkBuffer.recycleBuffer(NetworkBuffer.java:147)
>   at 
> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.notifyBufferAvailable(RemoteInputChannel.java:380)
>   at 
> org.apache.flink.runtime.io.network.buffer.LocalBufferPool.recycle(LocalBufferPool.java:282)
>   at 
> org.apache.flink.runtime.io.network.buffer.NetworkBuffer.deallocate(NetworkBuffer.java:172)
>   at 
> org.apache.flink.shaded.netty4.io.netty.buffer.AbstractReferenceCountedByteBuf.release0(AbstractReferenceCountedByteBuf.java:95)
>   at 
> org.apache.flink.shaded.netty4.io.netty.buffer.AbstractReferenceCountedByteBuf.release(AbstractReferenceCountedByteBuf.java:84)
>   at 
> org.apache.flink.runtime.io.network.buffer.NetworkBuffer.recycleBuffer(NetworkBuffer.java:147)
>   at 
> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.notifyBufferAvailable(RemoteInputChannel.java:380)
>   at 
> org.apache.flink.runtime.io.network.buffer.LocalBufferPool.recycle(LocalBufferPool.java:282)
>   at 
> org.apache.flink.runtime.io.network.buffer.NetworkBuffer.deallocate(NetworkBuffer.java:172)
>   at 
> org.apache.flink.shaded.netty4.io.netty.buffer.AbstractReferenceCountedByteBuf.release0(AbstractReferenceCountedByteBuf.java:95)
>   at 
> org.apache.flink.shaded.netty4.io.netty.buffer.AbstractReferenceCountedByteBuf.release(AbstractReferenceCountedByteBuf.java:84)
>   at 
> org.apache.flink.runtime.io.network.buffer.NetworkBuffer.recycleBuffer(NetworkBuffer.java:147)
>  

java.lang.StackOverflowError

2020-01-21 Thread
  I have a flink job which fails occasionally. I am eager to avoid this
problem. Can anyone help me? The error stacktrace is as following:

java.io.IOException: java.lang.StackOverflowError
at 
org.apache.flink.runtime.io.network.partition.consumer.InputChannel.checkError(InputChannel.java:191)
at 
org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.getNextBuffer(RemoteInputChannel.java:194)
at 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:589)
at 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:546)
at 
org.apache.flink.streaming.runtime.io.BarrierBuffer.getNextNonBlocked(BarrierBuffer.java:175)
at 
org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:236)
at 
org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:335)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:754)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.StackOverflowError
at 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.notifyChannelNonEmpty(SingleInputGate.java:656)
at 
org.apache.flink.runtime.io.network.partition.consumer.InputChannel.notifyChannelNonEmpty(InputChannel.java:125)
at 
org.apache.flink.runtime.io.network.partition.consumer.InputChannel.setError(InputChannel.java:203)
at 
org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.notifyBufferAvailable(RemoteInputChannel.java:403)
at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.recycle(LocalBufferPool.java:282)
at 
org.apache.flink.runtime.io.network.buffer.NetworkBuffer.deallocate(NetworkBuffer.java:172)
at 
org.apache.flink.shaded.netty4.io.netty.buffer.AbstractReferenceCountedByteBuf.release0(AbstractReferenceCountedByteBuf.java:95)
at 
org.apache.flink.shaded.netty4.io.netty.buffer.AbstractReferenceCountedByteBuf.release(AbstractReferenceCountedByteBuf.java:84)
at 
org.apache.flink.runtime.io.network.buffer.NetworkBuffer.recycleBuffer(NetworkBuffer.java:147)
at 
org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.notifyBufferAvailable(RemoteInputChannel.java:380)
at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.recycle(LocalBufferPool.java:282)
at 
org.apache.flink.runtime.io.network.buffer.NetworkBuffer.deallocate(NetworkBuffer.java:172)
at 
org.apache.flink.shaded.netty4.io.netty.buffer.AbstractReferenceCountedByteBuf.release0(AbstractReferenceCountedByteBuf.java:95)
at 
org.apache.flink.shaded.netty4.io.netty.buffer.AbstractReferenceCountedByteBuf.release(AbstractReferenceCountedByteBuf.java:84)
at 
org.apache.flink.runtime.io.network.buffer.NetworkBuffer.recycleBuffer(NetworkBuffer.java:147)
at 
org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.notifyBufferAvailable(RemoteInputChannel.java:380)
at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.recycle(LocalBufferPool.java:282)
at 
org.apache.flink.runtime.io.network.buffer.NetworkBuffer.deallocate(NetworkBuffer.java:172)
at 
org.apache.flink.shaded.netty4.io.netty.buffer.AbstractReferenceCountedByteBuf.release0(AbstractReferenceCountedByteBuf.java:95)
at 
org.apache.flink.shaded.netty4.io.netty.buffer.AbstractReferenceCountedByteBuf.release(AbstractReferenceCountedByteBuf.java:84)
at 
org.apache.flink.runtime.io.network.buffer.NetworkBuffer.recycleBuffer(NetworkBuffer.java:147)
at 
org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.notifyBufferAvailable(RemoteInputChannel.java:380)
at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.recycle(LocalBufferPool.java:282)
at 
org.apache.flink.runtime.io.network.buffer.NetworkBuffer.deallocate(NetworkBuffer.java:172)
at 
org.apache.flink.shaded.netty4.io.netty.buffer.AbstractReferenceCountedByteBuf.release0(AbstractReferenceCountedByteBuf.java:95)
at 
org.apache.flink.shaded.netty4.io.netty.buffer.AbstractReferenceCountedByteBuf.release(AbstractReferenceCountedByteBuf.java:84)
at 
org.apache.flink.runtime.io.network.buffer.NetworkBuffer.recycleBuffer(NetworkBuffer.java:147)
at 
org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.notifyBufferAvailable(RemoteInputChannel.java:380)
at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.recycle(LocalBufferPool.java:282)
at 
org.apache.flink.runtime.io.network.buffer.NetworkBuffer.deallocate(NetworkBuffer.java:172)
at 

How to get kafka record's timestamp in job

2019-12-31 Thread
  In kafka010, ConsumerRecord has a field named timestamp. It is
encapsulated
in Kafka010Fetcher. How can I get the timestamp when I write a flink job?
Thank you very much.


Re: Error:java: 无效的标记: --add-exports=java.base/sun.net.util=ALL-UNNAMED

2019-11-21 Thread
Thank you very much. It works for me.

> 在 2019年11月14日,下午1:06,Biao Liu  写道:
> 
> Hi,
> 
> I have encountered the same issue when setting up a dev environment.
> 
> It seems that the my Intellij (2019.2.1) unexpectedly activates java11
> profile of maven. It doesn't match the Java compiler (JDK8). I'm not sure
> why it happened silently.
> 
> So for me, the solution is "Intellij" -> "View" -> "Tool Windows" ->
> "Maven" -> "Profiles" -> uncheck the "java11" -> reimport maven project.
> 
> Thanks,
> Biao /'bɪ.aʊ/
> 
> 
> 
> On Mon, 4 Nov 2019 at 18:01, OpenInx  wrote:
> 
>> Hi
>> I met the same problem before. After some digging,  I find that the idea
>> will detect the JDK version
>> and choose whether to use the jdk11 option to run the flink maven building.
>> if you are in jdk11 env,  then
>> it will add the option --add-exports when maven building in IDEA.
>> 
>> For my case,  I was in IntelliJIdea2019.2 which depends on the jdk11, and
>> once I re-import the flink
>> modules then the IDEA will add the --add-exports flag even if  I removed
>> all the flags in .idea/compile.xml
>> explicitly.  I noticed that the Intellij's JDK affected the flink maven
>> building, so I turned to use the Intellij with JDK8
>> bundled,  then the problem was gone.
>> 
>> You can verify it, and if  it's really the same. can just replace your IDEA
>> with the pkg suffix with "with bundled JBR 8" in
>> here [1].
>> Say if you are using MacOS, then should download the package "2019.2.4 for
>> macOS with bundled JBR 8 (dmg)"
>> 
>> Hope it works for you
>> Thanks.
>> 
>> [1]. https://www.jetbrains.com/idea/download/other.html
>> 
>> 
>> On Mon, Nov 4, 2019 at 5:44 PM Till Rohrmann  wrote:
>> 
>>> Try to reimport that maven project. This should resolve this issue.
>>> 
>>> Cheers,
>>> Till
>>> 
>>> On Mon, Nov 4, 2019 at 10:34 AM 刘建刚  wrote:
>>> 
>>>>  Hi, I am using flink 1.9 in idea. But when I run a unit test in
>>> idea.
>>>> The idea reports the following error:"Error:java: 无效的标记:
>>>> --add-exports=java.base/sun.net.util=ALL-UNNAMED".
>>>>  Everything is ok when I use flink 1.6. I am using jdk 1.8. Is it
>>>> related to the java version?
>>>> 
>>> 
>> 



Re: How to estimate the memory size of flink state

2019-11-20 Thread
  Thank you. Your suggestion is good and I benefit a lot. For my case, I 
want to know the state memory size for other reasons. 
  When the the gc pressure is bigger, I need to limit the source or discard 
some data from the source to ensure job’s running. If the state size is bigger, 
I need to discard data. If the state size is not bigger, I need to limit the 
source.  The state size shows the resident memory. For event time, discarding 
data can reduce memory usage.
  Could you please give me some suggestions? 

> 在 2019年11月20日,下午3:16,sysukelee  写道:
> 
> Hi Liu,
> We monitor the jvm used/max heap memory to determine whether to rescale the 
> job.
> To avoid problems caused by oom, you don't need to know exactly how much 
> memory exactly used by state. 
> Focusing on jvm memory use is more reasonable.
>  
> <https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1=sysukelee=sysukelee%40gmail.com=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png=%5B%22sysukelee%40gmail.com%22%5D>
> On 11/20/2019 15:08,刘建刚 
> <mailto:liujiangangp...@gmail.com> wrote: 
> We are using flink 1.6.2. For filesystem backend, we want to monitor
> the state size in memory. Once the state size becomes bigger, we can get
> noticed and take measures such as rescaling the job, or the job may fail
> because of the memory.
> We have tried to get the memory usage for the jvm, like gc throughput.
> For our case, state can vary greatly at the peak. So maybe I can refer to
> the state memory size.
> I checked the metrics and code, but didn't find any information about
> the state memory size. I can get the checkpoint size, but they are
> serialized result that can not reflect the running state in memory.  Can
> anyone give me some suggestions? Thank you very much.



How to estimate the memory size of flink state

2019-11-19 Thread
  We are using flink 1.6.2. For filesystem backend, we want to monitor
the state size in memory. Once the state size becomes bigger, we can get
noticed and take measures such as rescaling the job, or the job may fail
because of the memory.
  We have tried to get the memory usage for the jvm, like gc throughput.
For our case, state can vary greatly at the peak. So maybe I can refer to
the state memory size.
  I checked the metrics and code, but didn't find any information about
the state memory size. I can get the checkpoint size, but they are
serialized result that can not reflect the running state in memory.  Can
anyone give me some suggestions? Thank you very much.


How to use two continuously window with EventTime in sql

2019-10-29 Thread
  For one sql window, I can register table with event time and use time
field in the tumble window. But if I want to use the result for the first
window and use another window to process it, how can I do it? Thank you.


Uncertain result when using group by in stream sql

2019-09-13 Thread
  I use flink stream sql to write a demo about "group by".  The records
are [(bj, 1), (bj, 3), (bj, 5)]. I group by the first element and sum the
second element.
  Every time I run the program, the result is different. It seems that
the records are out of order. Even sometimes record is lost. I am confused
about that.
  The code is as below:

public class Test {
   public static void main(String[] args) throws Exception {
  StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
  StreamTableEnvironment tEnv =
StreamTableEnvironment.getTableEnvironment(env);

  DataStream> dataStream = env.fromElements(
Tuple2.of("bj", 1L),
Tuple2.of("bj", 3L),
Tuple2.of("bj", 5L));
  tEnv.registerDataStream("person", dataStream);

  String sql = "select f0, sum(f1) from person group by f0";
  Table table = tEnv.sqlQuery(sql);
  tEnv.toRetractStream(table, Row.class).print();

  env.execute();
   }
}

  The results may be as below:
1> (true,bj,1)
1> (false,bj,1)
1> (true,bj,4)
1> (false,bj,4)
1> (true,bj,9)

1> (true,bj,5)
1> (false,bj,5)
1> (true,bj,8)
1> (false,bj,8)
1> (true,bj,9)


How to implement grouping set in stream

2019-09-10 Thread
  I want to implement grouping set in stream. I am new to flink sql. I
want to find a example to teach me how to self define rule and
implement corresponding operator. Can anyone give me any suggestion?


How to calculate one day's uv every minute by SQL

2019-09-04 Thread
  We want to calculate one day's uv and show the result every minute .
We have implemented this by java code:

  dataStream.keyBy(dimension)
.incrementWindow(Time.days(1), Time.minutes(1))
.uv(userId)

  The input data is big. So we use ValueState to store all the
distinct userIds from 00:00:00 to last minute. For current minute, we union
the minute's data with ValueState to obtain a new
ValueState and output the current uv.
  The problem is how to translate the java code to sql? We expect the
sql to be like this:

   select incrementWindow_end, dimension, distinct(userId) from table
group by incrementWindow(Time.days(1), Time.minutes(1)), dimension

  Anyone can give me some suggestions? Thank you very much.


How to load udf jars in flink program

2019-08-15 Thread
  We are using per-job to load udf jar when start job. Our jar file is
in another path but not flink's lib path. In the main function, we use
classLoader to load the jar file by the jar path. But it reports the
following error when job starts running.
  If the jar file is in lib, everything is ok. But our udf jar file is
managed in a special path. How can I load udf jars in flink program with
only giving the jar path?

org.apache.flink.api.common.InvalidProgramException: Table program
cannot be compiled. This is a bug. Please file an issue.
at 
org.apache.flink.table.codegen.Compiler$class.compile(Compiler.scala:36)
at 
org.apache.flink.table.runtime.CRowProcessRunner.compile(CRowProcessRunner.scala:35)
at 
org.apache.flink.table.runtime.CRowProcessRunner.open(CRowProcessRunner.scala:49)
at 
org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:36)
at 
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.open(AbstractUdfStreamOperator.java:102)
at 
org.apache.flink.streaming.api.operators.ProcessOperator.open(ProcessOperator.java:56)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.openAllOperators(StreamTask.java:424)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:290)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:723)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.codehaus.commons.compiler.CompileException: Line 5,
Column 1: Cannot determine simple type name "com"
at 
org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11877)
at 
org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:6758)
at 
org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:6519)
at 
org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:6532)
at 
org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:6532)
at 
org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:6532)
at 
org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:6532)
at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:6498)
at org.codehaus.janino.UnitCompiler.access$14000(UnitCompiler.java:218)
at 
org.codehaus.janino.UnitCompiler$22$1.visitReferenceType(UnitCompiler.java:6405)
at 
org.codehaus.janino.UnitCompiler$22$1.visitReferenceType(UnitCompiler.java:6400)
at org.codehaus.janino.Java$ReferenceType.accept(Java.java:3983)
at org.codehaus.janino.UnitCompiler$22.visitType(UnitCompiler.java:6400)
at org.codehaus.janino.UnitCompiler$22.visitType(UnitCompiler.java:6393)
at org.codehaus.janino.Java$ReferenceType.accept(Java.java:3982)
at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:6393)
at org.codehaus.janino.UnitCompiler.access$1300(UnitCompiler.java:218)
at org.codehaus.janino.UnitCompiler$25.getType(UnitCompiler.java:8206)
at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:6798)
at org.codehaus.janino.UnitCompiler.access$14500(UnitCompiler.java:218)
at 
org.codehaus.janino.UnitCompiler$22$2$1.visitFieldAccess(UnitCompiler.java:6423)
at 
org.codehaus.janino.UnitCompiler$22$2$1.visitFieldAccess(UnitCompiler.java:6418)
at org.codehaus.janino.Java$FieldAccess.accept(Java.java:4365)
at 
org.codehaus.janino.UnitCompiler$22$2.visitLvalue(UnitCompiler.java:6418)
at 
org.codehaus.janino.UnitCompiler$22$2.visitLvalue(UnitCompiler.java:6414)
at org.codehaus.janino.Java$Lvalue.accept(Java.java:4203)
at 
org.codehaus.janino.UnitCompiler$22.visitRvalue(UnitCompiler.java:6414)
at 
org.codehaus.janino.UnitCompiler$22.visitRvalue(UnitCompiler.java:6393)
at org.codehaus.janino.Java$Rvalue.accept(Java.java:4171)
at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:6393)
at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:6780)
at org.codehaus.janino.UnitCompiler.access$14300(UnitCompiler.java:218)
at 
org.codehaus.janino.UnitCompiler$22$2$1.visitAmbiguousName(UnitCompiler.java:6421)
at 
org.codehaus.janino.UnitCompiler$22$2$1.visitAmbiguousName(UnitCompiler.java:6418)
at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4279)
at 
org.codehaus.janino.UnitCompiler$22$2.visitLvalue(UnitCompiler.java:6418)
at 
org.codehaus.janino.UnitCompiler$22$2.visitLvalue(UnitCompiler.java:6414)
at org.codehaus.janino.Java$Lvalue.accept(Java.java:4203)
at 
org.codehaus.janino.UnitCompiler$22.visitRvalue(UnitCompiler.java:6414)
at 
org.codehaus.janino.UnitCompiler$22.visitRvalue(UnitCompiler.java:6393)
at org.codehaus.janino.Java$Rvalue.accept(Java.java:4171)
at 

How to convert protobuf to Row

2019-05-06 Thread
  I read byte data from Kafka. I use a class ProtoSchema
implemented DeserializationSchema
to get the actual java class. My question is that how can I transfer the
byte data to Row just by ProtoSchema? What if the data structure is nested?
Thank you.


Containers are not released after job failed

2019-04-26 Thread
  I run flink 1.6.2 on yarn. At some time, job is failed becuase of:
org.apache.flink.util.FlinkException: The assigned slot
container_e708_1555051789618_2644286_01_61_0 was removed

  Then the job restarts. After some time, the container
container_e708_1555051789618_2644286_01_61 is still not released.

  The log of container_e708_1555051789618_2644286_01_61 is as
following:
[image: image.png]

  The log shows that two tasks are canceled before successful
registration at resource manager and one is canceled after registration.
After five minutes, the container registers again. At last, the container
is alive but not used.
  Anyone have any idea about this problem. Thank you.


One source is much slower than the other side when join history data

2019-02-26 Thread
  When consuming history data in join operator with eventTime, reading
data from one source is much slower than the other. As a result, the join
operator will cache much data from the faster source in order to wait the
slower source.
  The question is that how can I make the difference of consumers'
speed small?


request for access

2019-02-13 Thread
Hi Guys,

I want to contribute to Apache Flink.
Would you please give me the permission as a contributor?
My JIRA Username is Jiangang. My JIRA full name is Liu.