Re: [DISCUSS] FLIP-208: Update KafkaSource to detect EOF based on de-serialized record
Hi Qingsheng, Thanks for the comment! After double checking the solution of passing recordEvaluator via env.fromSource(...), I realized that this approach does not automatically bring the feature for all sources. We would still need to update the implementation of every source in order to pass recordEvaluator from DataStreamSource to SourceReaderBase. Could you check whether this is the case? Now we have two options under discussion. One option is to pass recordEvaluator via XXXSourceBuilder for each connector. And the other option is to pass recordEvaluator via env.fromSource(...). Suppose the amount of the implementation overhead is similar (e.g. we need to change implementation of every source), it seems that the first option is slightly better. This is because most existing parameters of a source, such as boundedness and deserializers, are passed to XXXSourceBuilder directly. It might be better to follow the existing paradigm. What do you think? Thanks, Dong On Wed, Jan 5, 2022 at 5:33 PM Qingsheng Ren wrote: > Hi Dong, > > Thanks for making this FLIP. I share the same concern with Martijn. This > looks like a feature that could be shared across all sources so I think > it’ll be great to make it a general one. > > Instead of passing the RecordEvaluator to SourceReaderBase, what about > embedding the evaluator into SourceOperator? We can create a wrapper > SourceOutput in SourceOperator and intercept records emitted by > SourceReader. This could make this feature individual from implementation > of SourceReader so it's applicable for all sources. The API to users looks > like: > > env.fromSource(source, watermarkStrategy, name, recordEvaluator) > > or > > env.fromSource(…).withRecordEvaluator(evaluator) > > What do you think? > > Best regards, > > Qingsheng > > > On Jan 4, 2022, at 3:31 PM, Martijn Visser > wrote: > > > > Hi Dong, > > > > Thanks for writing the FLIP. It focusses only on the KafkaSource, but I > > would expect that if such a functionality is desired, it should be made > > available for all unbounded sources (for example, Pulsar and Kinesis). If > > it's only available for Kafka, I see it as if we're increasing feature > > sparsity while we actually want to decrease that. What do you think? > > > > Best regards, > > > > Martijn > > > > On Tue, 4 Jan 2022 at 08:04, Dong Lin wrote: > > > >> Hi all, > >> > >> We created FLIP-208: Update KafkaSource to detect EOF based on > >> de-serialized records. Please find the KIP wiki in the link > >> > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-208%3A+Update+KafkaSource+to+detect+EOF+based+on+de-serialized+records > >> . > >> > >> This FLIP aims to address the use-case where users need to stop a Flink > job > >> gracefully based on the content of de-serialized records observed in the > >> KafkaSource. This feature is needed by users who currently depend on > >> KafkaDeserializationSchema::isEndOfStream() to migrate their Flink job > from > >> FlinkKafkaConsumer to KafkaSource. > >> > >> Could you help review this FLIP when you get time? Your comments are > >> appreciated! > >> > >> Cheers, > >> Dong > >> > >
Re: [VOTE] Create a separate sub project for FLIP-188: flink-store
+1 for the separate repository under the Flink umbrella as we've already started creating more repositories with connectors, would it be possible to re-use the same build infrastructure for this one? (eg. shared set of Gradle plugins that unify the build experience)? Best, D. On Fri, Jan 7, 2022 at 11:31 AM Jingsong Li wrote: > For more references on `store` and `storage`: > > For example, > > Rocksdb is a library that provides an embeddable, persistent key-value > store for fast storage. [1] > > Apache HBase [1] is an open-source, distributed, versioned, > column-oriented store modeled after Google' Bigtable. [2] > > [1] https://github.com/facebook/rocksdb > [2] https://github.com/apache/hbase > > Best, > Jingsong > > On Fri, Jan 7, 2022 at 6:17 PM Jingsong Li wrote: > > > > Thanks all, > > > > Combining everyone's comments, I recommend using `flink-table-store`: > > > > ## table > > something to do with table storage (From Till). Not only flink-table, > > but also for user-oriented tables. > > > > ## store vs storage > > - The first point I think, store is better pronounced, storage is > > three syllables while store is two syllables > > - Yes, store also stands for shopping. But I think the English > > polysemy is also quite interesting, a store to store various items, it > > also feels interesting to represent the feeling that we want to do > > data storage. > > - The first feeling is, storage is a physical object or abstract > > concept, store is a software application or entity > > > > So I prefer `flink-table-store`, what do you think? > > > > (@_@ Naming is too difficult) > > > > Best, > > Jingsong > > > > On Fri, Jan 7, 2022 at 5:37 PM Konstantin Knauf > wrote: > > > > > > +1 to a separate repository assuming this repository will still be > part of > > > Apache Flink (same PMC, Committers). I am not aware we have something > like > > > "sub-projects" officially. > > > > > > I share Till and Timo's concerns regarding "store". > > > > > > On Fri, Jan 7, 2022 at 9:59 AM Till Rohrmann > wrote: > > > > > > > +1 for the separate project. > > > > > > > > I would agree that flink-store is not the best name. flink-storage > > > > > flink-store but I would even more prefer a name that conveys that it > has > > > > something to do with table storage. > > > > > > > > Cheers, > > > > Till > > > > > > > > On Fri, Jan 7, 2022 at 9:14 AM Timo Walther > wrote: > > > > > > > > > +1 for the separate project > > > > > > > > > > But maybe use `flink-storage` instead of `flink-store`? > > > > > > > > > > I'm not a native speaker but store is defined as "A place where > items > > > > > may be purchased.". It almost sounds like the `flink-packages` > project. > > > > > > > > > > Regards, > > > > > Timo > > > > > > > > > > > > > > > On 07.01.22 08:37, Jingsong Li wrote: > > > > > > Hi everyone, > > > > > > > > > > > > I'd like to start a vote for create a separate sub project for > > > > > > FLIP-188 [1]: `flink-store`. > > > > > > > > > > > > - If you agree with the name `flink-store`, please just +1 > > > > > > - If you have a better suggestion, please write your suggestion, > > > > > > followed by a reply that can +1 to the name that has appeared > > > > > > - If you do not want it to be a subproject of flink, just -1 > > > > > > > > > > > > The vote will be open for at least 72 hours unless there is an > > > > > > objection or not enough votes. > > > > > > > > > > > > [1] > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-188%3A+Introduce+Built-in+Dynamic+Table+Storage > > > > > > > > > > > > Best, > > > > > > Jingsong > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Konstantin Knauf > > > > > > https://twitter.com/snntrable > > > > > > https://github.com/knaufk > > > > > > > > -- > > Best, Jingsong Lee > > > > -- > Best, Jingsong Lee >
Re: [DISCUSS] FLIP-208: Update KafkaSource to detect EOF based on de-serialized record
Hi Martijn, Thanks for the comments! In general I agree we should avoid feature sparsity. In this particular case, connectors are a bit different than most other features in Flink. AFAIK, we plan to move connectors (including Kafka and Pulsar) out of the Flink project in the future, which means that the development of these connectors will be mostly de-centralized (outside of Flink) and be up to their respective maintainers. While I agree that we should provide API/infrastructure in Flink (as this FLIP does) to support feature consistency across connectors, I am not sure we should own the responsibility to actually update all connectors to achieve feature consistency, given that we don't plan to do it in Flink anyway due to its heavy burden. With that being said, I am happy to follow the community guideline if we decide that connector-related FLIP should update every connector's API to ensure feature consistency (to a reasonable extent). For example, in this particular case, it looks like the EOF-detection feature can be applied to every connector (including bounded sources). Is it still sufficient to just update Kafka, Pulsar and Kinesis? Thanks, Dong On Tue, Jan 4, 2022 at 3:31 PM Martijn Visser wrote: > Hi Dong, > > Thanks for writing the FLIP. It focusses only on the KafkaSource, but I > would expect that if such a functionality is desired, it should be made > available for all unbounded sources (for example, Pulsar and Kinesis). If > it's only available for Kafka, I see it as if we're increasing feature > sparsity while we actually want to decrease that. What do you think? > > Best regards, > > Martijn > > On Tue, 4 Jan 2022 at 08:04, Dong Lin wrote: > > > Hi all, > > > > We created FLIP-208: Update KafkaSource to detect EOF based on > > de-serialized records. Please find the KIP wiki in the link > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-208%3A+Update+KafkaSource+to+detect+EOF+based+on+de-serialized+records > > . > > > > This FLIP aims to address the use-case where users need to stop a Flink > job > > gracefully based on the content of de-serialized records observed in the > > KafkaSource. This feature is needed by users who currently depend on > > KafkaDeserializationSchema::isEndOfStream() to migrate their Flink job > from > > FlinkKafkaConsumer to KafkaSource. > > > > Could you help review this FLIP when you get time? Your comments are > > appreciated! > > > > Cheers, > > Dong > > >
[jira] [Created] (FLINK-25578) Graduate Sink V1 interfaces to PublicEvolving and deprecate them
Fabian Paul created FLINK-25578: --- Summary: Graduate Sink V1 interfaces to PublicEvolving and deprecate them Key: FLINK-25578 URL: https://issues.apache.org/jira/browse/FLINK-25578 Project: Flink Issue Type: Sub-task Components: API / Core Affects Versions: 1.15.0 Reporter: Fabian Paul In the discussion of FLIP-191 we decided to not break existing sinks and give them time to migrate to Sink V2. We do not plan to delete Sink V1 in near future and still support but we won't develop new features. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-25577) Update GCS documentation not to use no longer supported flink-shaded-hadoop artifacts
David Morávek created FLINK-25577: - Summary: Update GCS documentation not to use no longer supported flink-shaded-hadoop artifacts Key: FLINK-25577 URL: https://issues.apache.org/jira/browse/FLINK-25577 Project: Flink Issue Type: Technical Debt Components: Connectors / FileSystem Reporter: David Morávek Fix For: 1.15.0 The GCS documentation refers to no longer supported artifacts. Also 2.8.3 version of Hadoop will be no longer supported as of 1.15. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-25576) Update com.h2database:h2 to 2.0.206
Martijn Visser created FLINK-25576: -- Summary: Update com.h2database:h2 to 2.0.206 Key: FLINK-25576 URL: https://issues.apache.org/jira/browse/FLINK-25576 Project: Flink Issue Type: Technical Debt Components: Connectors / JDBC Reporter: Martijn Visser Assignee: Martijn Visser Flink uses com.h2database:h2 version 1.4.200, we should update this to 2.0.206 -- This message was sent by Atlassian Jira (v8.20.1#820001)
Re: [ANNOUNCE] Apache Flink ML 2.0.0 released
Great job! <3 Thanks Dong and Yun for managing the release and big thanks to everyone who has contributed! Best, D. On Fri, Jan 7, 2022 at 2:27 PM Yun Gao wrote: > The Apache Flink community is very happy to announce the release of Apache > Flink ML 2.0.0. > > > > Apache Flink ML provides API and infrastructure that simplifies > implementing distributed ML algorithms, > > and it also provides a library of off-the-shelf ML algorithms. > > > > Please check out the release blog post for an overview of the release: > > https://flink.apache.org/news/2022/01/07/release-ml-2.0.0.html > > > > The release is available for download at: > > https://flink.apache.org/downloads.html > > > > Maven artifacts for Flink ML can be found at: > > https://search.maven.org/search?q=g:org.apache.flink%20ml > > > > Python SDK for Flink ML published to the PyPI index can be found at: > > https://pypi.org/project/apache-flink-ml/ > > > > The full release notes are available in Jira: > > > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12351079 > > > > We would like to thank all contributors of the Apache Flink community who > made this release possible! > > > > Regards, > > Dong and Yun >
[ANNOUNCE] Apache Flink ML 2.0.0 released
The Apache Flink community is very happy to announce the release of Apache Flink ML 2.0.0. Apache Flink ML provides API and infrastructure that simplifies implementing distributed ML algorithms, and it also provides a library of off-the-shelf ML algorithms. Please check out the release blog post for an overview of the release: https://flink.apache.org/news/2022/01/07/release-ml-2.0.0.html The release is available for download at: https://flink.apache.org/downloads.html Maven artifacts for Flink ML can be found at: https://search.maven.org/search?q=g:org.apache.flink%20ml Python SDK for Flink ML published to the PyPI index can be found at: https://pypi.org/project/apache-flink-ml/ The full release notes are available in Jira: https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12351079 We would like to thank all contributors of the Apache Flink community who made this release possible! Regards, Dong and Yun
[jira] [Created] (FLINK-25575) Implement StreamGraph translation for Sink V2 interfaces
Fabian Paul created FLINK-25575: --- Summary: Implement StreamGraph translation for Sink V2 interfaces Key: FLINK-25575 URL: https://issues.apache.org/jira/browse/FLINK-25575 Project: Flink Issue Type: Sub-task Components: API / Core, API / DataStream Affects Versions: 1.15.0 Reporter: Fabian Paul This task covers the translation from the user-defined interfaces and the extension topologies to the actual operators. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-25574) Update Async Sink to use decomposed interfaces
Fabian Paul created FLINK-25574: --- Summary: Update Async Sink to use decomposed interfaces Key: FLINK-25574 URL: https://issues.apache.org/jira/browse/FLINK-25574 Project: Flink Issue Type: Sub-task Components: Connectors / Common Affects Versions: 1.15.0 Reporter: Fabian Paul -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-25573) Update Kafka Sink to use decomposed interfaces
Fabian Paul created FLINK-25573: --- Summary: Update Kafka Sink to use decomposed interfaces Key: FLINK-25573 URL: https://issues.apache.org/jira/browse/FLINK-25573 Project: Flink Issue Type: Sub-task Components: Connectors / Kafka Affects Versions: 1.15.0 Reporter: Fabian Paul -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-25572) Update File Sink to use decomposed interfaces
Fabian Paul created FLINK-25572: --- Summary: Update File Sink to use decomposed interfaces Key: FLINK-25572 URL: https://issues.apache.org/jira/browse/FLINK-25572 Project: Flink Issue Type: Sub-task Components: Connectors / FileSystem Affects Versions: 1.15.0 Reporter: Fabian Paul -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-25571) Update Elasticsearch Sink to use decomposed interfaces
Fabian Paul created FLINK-25571: --- Summary: Update Elasticsearch Sink to use decomposed interfaces Key: FLINK-25571 URL: https://issues.apache.org/jira/browse/FLINK-25571 Project: Flink Issue Type: Sub-task Components: Connectors / ElasticSearch Affects Versions: 1.15.0 Reporter: Fabian Paul -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-25570) Introduce Sink V2 extension APIs
Fabian Paul created FLINK-25570: --- Summary: Introduce Sink V2 extension APIs Key: FLINK-25570 URL: https://issues.apache.org/jira/browse/FLINK-25570 Project: Flink Issue Type: Sub-task Components: API / Core, Connectors / Common Affects Versions: 1.15.0 Reporter: Fabian Paul Fix For: 1.15.0 This task introduces the interfaces needed to implement the custom operations before/after the writer and committer. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-25569) Introduce decomposed Sink V2 interfaces
Fabian Paul created FLINK-25569: --- Summary: Introduce decomposed Sink V2 interfaces Key: FLINK-25569 URL: https://issues.apache.org/jira/browse/FLINK-25569 Project: Flink Issue Type: Sub-task Components: Connectors / Common Affects Versions: 1.15.0 Reporter: Fabian Paul Fix For: 1.15.0 This task introduces the interfaces described in [1] without the datastream extension hooks. [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-191%3A+Extend+unified+Sink+interface+to+support+small+file+compaction -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-25568) Add Elasticsearch 7 Source Connector
Alexander Preuss created FLINK-25568: Summary: Add Elasticsearch 7 Source Connector Key: FLINK-25568 URL: https://issues.apache.org/jira/browse/FLINK-25568 Project: Flink Issue Type: New Feature Components: Connectors / ElasticSearch Reporter: Alexander Preuss Assignee: Alexander Preuss We want to support not only Sink but also Source for Elasticsearch. As a first step we want to add a ScanTableSource for Elasticsearch 7. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[RESULT][VOTE] FLIP-191: Extend unified Sink interface to support small file compaction
I am happy to announce that FLIP-191 [1] has been accepted by this vote [2]. There are 5 approving votes, 3 of which are binding: * Martijn Visser (non-binding) * Yun Gao (binding) * Arvid Heise (binding) * Guowei Ma (binding) * Jing Ge (non-binding) There are no disapproving votes. Thanks everyone! Fabian [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-191%3A+Extend+unified+Sink+interface+to+support+small+file+compaction [2] https://lists.apache.org/thread/97kwy315t9r4j02l8v0wotkll4tngb3m
[jira] [Created] (FLINK-25567) Casting of Multisets to Multisets
Sergey Nuyanzin created FLINK-25567: --- Summary: Casting of Multisets to Multisets Key: FLINK-25567 URL: https://issues.apache.org/jira/browse/FLINK-25567 Project: Flink Issue Type: Sub-task Reporter: Sergey Nuyanzin There is a lack of multiset support which also probably should be fixed There is a draft commit which potentially could help here https://github.com/apache/flink/pull/18287/commits/2d1626610d3ad7597336f09ac661e62e86da336b -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-25566) Fail to cancel task if disk is bad for java.lang.NoClassDefFoundError
Liu created FLINK-25566: --- Summary: Fail to cancel task if disk is bad for java.lang.NoClassDefFoundError Key: FLINK-25566 URL: https://issues.apache.org/jira/browse/FLINK-25566 Project: Flink Issue Type: Improvement Components: Runtime / Task Reporter: Liu Attachments: image-2022-01-07-19-07-10-968.png, image-2022-01-07-19-08-49-038.png, image-2022-01-07-19-11-39-448.png When we detecting disk error, we will restart the job to rescale. However, the related task will stuck in cancelling for java.lang.NoClassDefFoundError. !image-2022-01-07-19-08-49-038.png|width=743,height=157! In the TaskManagerRunner's method onFatalError, it will not terminateJVM at once. The process will stuck in the disk. !image-2022-01-07-19-11-39-448.png|width=1085,height=400! In this case, maybe we should terminate the container at once. -- This message was sent by Atlassian Jira (v8.20.1#820001)
Re: [VOTE] Create a separate sub project for FLIP-188: flink-store
For more references on `store` and `storage`: For example, Rocksdb is a library that provides an embeddable, persistent key-value store for fast storage. [1] Apache HBase [1] is an open-source, distributed, versioned, column-oriented store modeled after Google' Bigtable. [2] [1] https://github.com/facebook/rocksdb [2] https://github.com/apache/hbase Best, Jingsong On Fri, Jan 7, 2022 at 6:17 PM Jingsong Li wrote: > > Thanks all, > > Combining everyone's comments, I recommend using `flink-table-store`: > > ## table > something to do with table storage (From Till). Not only flink-table, > but also for user-oriented tables. > > ## store vs storage > - The first point I think, store is better pronounced, storage is > three syllables while store is two syllables > - Yes, store also stands for shopping. But I think the English > polysemy is also quite interesting, a store to store various items, it > also feels interesting to represent the feeling that we want to do > data storage. > - The first feeling is, storage is a physical object or abstract > concept, store is a software application or entity > > So I prefer `flink-table-store`, what do you think? > > (@_@ Naming is too difficult) > > Best, > Jingsong > > On Fri, Jan 7, 2022 at 5:37 PM Konstantin Knauf wrote: > > > > +1 to a separate repository assuming this repository will still be part of > > Apache Flink (same PMC, Committers). I am not aware we have something like > > "sub-projects" officially. > > > > I share Till and Timo's concerns regarding "store". > > > > On Fri, Jan 7, 2022 at 9:59 AM Till Rohrmann wrote: > > > > > +1 for the separate project. > > > > > > I would agree that flink-store is not the best name. flink-storage > > > > flink-store but I would even more prefer a name that conveys that it has > > > something to do with table storage. > > > > > > Cheers, > > > Till > > > > > > On Fri, Jan 7, 2022 at 9:14 AM Timo Walther wrote: > > > > > > > +1 for the separate project > > > > > > > > But maybe use `flink-storage` instead of `flink-store`? > > > > > > > > I'm not a native speaker but store is defined as "A place where items > > > > may be purchased.". It almost sounds like the `flink-packages` project. > > > > > > > > Regards, > > > > Timo > > > > > > > > > > > > On 07.01.22 08:37, Jingsong Li wrote: > > > > > Hi everyone, > > > > > > > > > > I'd like to start a vote for create a separate sub project for > > > > > FLIP-188 [1]: `flink-store`. > > > > > > > > > > - If you agree with the name `flink-store`, please just +1 > > > > > - If you have a better suggestion, please write your suggestion, > > > > > followed by a reply that can +1 to the name that has appeared > > > > > - If you do not want it to be a subproject of flink, just -1 > > > > > > > > > > The vote will be open for at least 72 hours unless there is an > > > > > objection or not enough votes. > > > > > > > > > > [1] > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-188%3A+Introduce+Built-in+Dynamic+Table+Storage > > > > > > > > > > Best, > > > > > Jingsong > > > > > > > > > > > > > > > > > > > > > > -- > > > > Konstantin Knauf > > > > https://twitter.com/snntrable > > > > https://github.com/knaufk > > > > -- > Best, Jingsong Lee -- Best, Jingsong Lee
Re: [VOTE] Create a separate sub project for FLIP-188: flink-store
Thanks all, Combining everyone's comments, I recommend using `flink-table-store`: ## table something to do with table storage (From Till). Not only flink-table, but also for user-oriented tables. ## store vs storage - The first point I think, store is better pronounced, storage is three syllables while store is two syllables - Yes, store also stands for shopping. But I think the English polysemy is also quite interesting, a store to store various items, it also feels interesting to represent the feeling that we want to do data storage. - The first feeling is, storage is a physical object or abstract concept, store is a software application or entity So I prefer `flink-table-store`, what do you think? (@_@ Naming is too difficult) Best, Jingsong On Fri, Jan 7, 2022 at 5:37 PM Konstantin Knauf wrote: > > +1 to a separate repository assuming this repository will still be part of > Apache Flink (same PMC, Committers). I am not aware we have something like > "sub-projects" officially. > > I share Till and Timo's concerns regarding "store". > > On Fri, Jan 7, 2022 at 9:59 AM Till Rohrmann wrote: > > > +1 for the separate project. > > > > I would agree that flink-store is not the best name. flink-storage > > > flink-store but I would even more prefer a name that conveys that it has > > something to do with table storage. > > > > Cheers, > > Till > > > > On Fri, Jan 7, 2022 at 9:14 AM Timo Walther wrote: > > > > > +1 for the separate project > > > > > > But maybe use `flink-storage` instead of `flink-store`? > > > > > > I'm not a native speaker but store is defined as "A place where items > > > may be purchased.". It almost sounds like the `flink-packages` project. > > > > > > Regards, > > > Timo > > > > > > > > > On 07.01.22 08:37, Jingsong Li wrote: > > > > Hi everyone, > > > > > > > > I'd like to start a vote for create a separate sub project for > > > > FLIP-188 [1]: `flink-store`. > > > > > > > > - If you agree with the name `flink-store`, please just +1 > > > > - If you have a better suggestion, please write your suggestion, > > > > followed by a reply that can +1 to the name that has appeared > > > > - If you do not want it to be a subproject of flink, just -1 > > > > > > > > The vote will be open for at least 72 hours unless there is an > > > > objection or not enough votes. > > > > > > > > [1] > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-188%3A+Introduce+Built-in+Dynamic+Table+Storage > > > > > > > > Best, > > > > Jingsong > > > > > > > > > > > > > > > -- > > Konstantin Knauf > > https://twitter.com/snntrable > > https://github.com/knaufk -- Best, Jingsong Lee
[jira] [Created] (FLINK-25565) Write and Read Parquet INT64 Timestamp
Bo Cui created FLINK-25565: -- Summary: Write and Read Parquet INT64 Timestamp Key: FLINK-25565 URL: https://issues.apache.org/jira/browse/FLINK-25565 Project: Flink Issue Type: New Feature Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) Affects Versions: 1.12.0, 1.15.0 Reporter: Bo Cui Flink cannot read parquet files that contain INT64 Timestamp generated by Spark -- This message was sent by Atlassian Jira (v8.20.1#820001)
Re: [VOTE] Create a separate sub project for FLIP-188: flink-store
+1 to a separate repository assuming this repository will still be part of Apache Flink (same PMC, Committers). I am not aware we have something like "sub-projects" officially. I share Till and Timo's concerns regarding "store". On Fri, Jan 7, 2022 at 9:59 AM Till Rohrmann wrote: > +1 for the separate project. > > I would agree that flink-store is not the best name. flink-storage > > flink-store but I would even more prefer a name that conveys that it has > something to do with table storage. > > Cheers, > Till > > On Fri, Jan 7, 2022 at 9:14 AM Timo Walther wrote: > > > +1 for the separate project > > > > But maybe use `flink-storage` instead of `flink-store`? > > > > I'm not a native speaker but store is defined as "A place where items > > may be purchased.". It almost sounds like the `flink-packages` project. > > > > Regards, > > Timo > > > > > > On 07.01.22 08:37, Jingsong Li wrote: > > > Hi everyone, > > > > > > I'd like to start a vote for create a separate sub project for > > > FLIP-188 [1]: `flink-store`. > > > > > > - If you agree with the name `flink-store`, please just +1 > > > - If you have a better suggestion, please write your suggestion, > > > followed by a reply that can +1 to the name that has appeared > > > - If you do not want it to be a subproject of flink, just -1 > > > > > > The vote will be open for at least 72 hours unless there is an > > > objection or not enough votes. > > > > > > [1] > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-188%3A+Introduce+Built-in+Dynamic+Table+Storage > > > > > > Best, > > > Jingsong > > > > > > > > -- Konstantin Knauf https://twitter.com/snntrable https://github.com/knaufk
Re: [VOTE] Create a separate sub project for FLIP-188: flink-store
+1 for the separate project. I would agree that flink-store is not the best name. flink-storage > flink-store but I would even more prefer a name that conveys that it has something to do with table storage. Cheers, Till On Fri, Jan 7, 2022 at 9:14 AM Timo Walther wrote: > +1 for the separate project > > But maybe use `flink-storage` instead of `flink-store`? > > I'm not a native speaker but store is defined as "A place where items > may be purchased.". It almost sounds like the `flink-packages` project. > > Regards, > Timo > > > On 07.01.22 08:37, Jingsong Li wrote: > > Hi everyone, > > > > I'd like to start a vote for create a separate sub project for > > FLIP-188 [1]: `flink-store`. > > > > - If you agree with the name `flink-store`, please just +1 > > - If you have a better suggestion, please write your suggestion, > > followed by a reply that can +1 to the name that has appeared > > - If you do not want it to be a subproject of flink, just -1 > > > > The vote will be open for at least 72 hours unless there is an > > objection or not enough votes. > > > > [1] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-188%3A+Introduce+Built-in+Dynamic+Table+Storage > > > > Best, > > Jingsong > > > >
[DISCUSS] FLIP-209: Support to run multiple shuffle plugins in one session cluster
Hi dev, I'd like to start a discussion for FLIP-209 [1] which aims to support to run multiple shuffle plugins in one session cluster. Currently, one Flink cluster can only use one shuffle service plugin configured by 'shuffle-service-factory.class'. It is not flexible enough and cannot support use cases like selecting different shuffle service for different workloads (e.g. batch vs. streaming). This feature has been mentioned for several times [2] and FLIP-209 aims to implement it. Please refer to FLIP-209 [1] for more details and any feedback is highly appreciated. Best, Yingjie [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-209%3A+Support+to+run+multiple+shuffle+plugins+in+one+session+cluster?moved=true [2] https://lists.apache.org/thread/k4owttq9q3cq4knoobrzc31bghf7vc0o
Re: [DISCUSS] FLIP-188: Introduce Built-in Dynamic Table Storage
+1 with a separate repo and +1 with the flink-storage name On Fri, Jan 7, 2022 at 8:40 AM Jingsong Li wrote: > Hi everyone, > > Vote for create a separate sub project for FLIP-188 thread is here: > https://lists.apache.org/thread/wzzhr27cvrh6w107bn464m1m1ycfll1z > > Best, > Jingsong > > > On Fri, Jan 7, 2022 at 3:30 PM Jingsong Li wrote: > > > > Hi Timo, > > > > I think we can consider exposing to DataStream users in the future, if > > the API definition is clear after. > > I am fine with `flink-table-store` too. > > But I tend to prefer shorter and clearer name: > > `flink-store`. > > > > I think I can create a separate thread to vote. > > > > Looking forward to your thoughts! > > > > Best, > > Jingsong > > > > > > On Thu, Dec 30, 2021 at 9:48 PM Timo Walther wrote: > > > > > > +1 for a separate repository. And also +1 for finding a good name. > > > > > > `flink-warehouse` would be definitely a good marketing name but I agree > > > that we should not start marketing for code bases. Are we planning to > > > make this storage also available to DataStream API users? If not, I > > > would also vote for `flink-managed-table` or better: > `flink-table-store` > > > > > > Thanks, > > > Timo > > > > > > > > > > > > On 29.12.21 07:58, Jingsong Li wrote: > > > > Thanks Till for your suggestions. > > > > > > > > Personally, I like flink-warehouse, this is what we want to convey to > > > > the user, but it indicates a bit too much scope. > > > > > > > > How about just calling it flink-store? > > > > Simply to convey an impression: this is flink's store project, > > > > providing a built-in store for the flink compute engine, which can be > > > > used by flink-table as well as flink-datastream. > > > > > > > > Best, > > > > Jingsong > > > > > > > > On Tue, Dec 28, 2021 at 5:15 PM Till Rohrmann > wrote: > > > >> > > > >> Hi Jingsong, > > > >> > > > >> I think that developing flink-dynamic-storage as a separate sub > project is > > > >> a very good idea since it allows us to move a lot faster and > decouple > > > >> releases from Flink. Hence big +1. > > > >> > > > >> Do we want to name it flink-dynamic-storage or shall we use a more > > > >> descriptive name? dynamic-storage sounds a bit generic to me and I > wouldn't > > > >> know that this has something to do with letting Flink manage your > tables > > > >> and their storage. I don't have a very good idea but maybe we can > call it > > > >> flink-managed-tables, flink-warehouse, flink-olap or so. > > > >> > > > >> Cheers, > > > >> Till > > > >> > > > >> On Tue, Dec 28, 2021 at 9:49 AM Martijn Visser < > mart...@ververica.com> > > > >> wrote: > > > >> > > > >>> Hi Jingsong, > > > >>> > > > >>> That sounds promising! +1 from my side to continue development > under > > > >>> flink-dynamic-storage as a Flink subproject. I think having a more > in-depth > > > >>> interface will benefit everyone. > > > >>> > > > >>> Best regards, > > > >>> > > > >>> Martijn > > > >>> > > > >>> On Tue, 28 Dec 2021 at 04:23, Jingsong Li > wrote: > > > >>> > > > Hi all, > > > > > > After some experimentation, we felt no problem putting the dynamic > > > storage outside of flink, and it also allowed us to design the > > > interface in more depth. > > > > > > What do you think? If there is no problem, I am asking for PMC's > help > > > here: we want to propose flink-dynamic-storage as a flink > subproject, > > > and we want to build the project under apache. > > > > > > Best, > > > Jingsong > > > > > > > > > On Wed, Nov 24, 2021 at 8:10 PM Jingsong Li < > jingsongl...@gmail.com> > > > wrote: > > > > > > > > Hi Stephan, > > > > > > > > Thanks for your reply. > > > > > > > > Data never expires automatically. > > > > > > > > If there is a need for data retention, the user can choose one > of the > > > > following options: > > > > - In the SQL for querying the managed table, users filter the > data by > > > themselves > > > > - Define the time partition, and users can delete the expired > > > > partition by themselves. (DROP PARTITION ...) > > > > - In the future version, we will support the "DELETE FROM" > statement, > > > > users can delete the expired data according to the conditions. > > > > > > > > So to answer your question: > > > > > > > >> Will the VMQ send retractions so that the data will be removed > from > > > the table (via compactions)? > > > > > > > > The current implementation is not sending retraction, which I > think > > > > theoretically should be sent, currently the user can filter by > > > > subsequent conditions. > > > > And yes, the subscriber would not see strictly a correct result. > I > > > > think this is something we can improve for Flink SQL. > > > > > > > >> Do we want time retention semantics handled by the compaction? > > > > > > > > Currently, no, Data never exp
[jira] [Created] (FLINK-25564) TaskManagerProcessFailureStreamingRecoveryITCase>AbstractTaskManagerProcessFailureRecoveryTest.testTaskManagerProcessFailure fails on AZP
Till Rohrmann created FLINK-25564: - Summary: TaskManagerProcessFailureStreamingRecoveryITCase>AbstractTaskManagerProcessFailureRecoveryTest.testTaskManagerProcessFailure fails on AZP Key: FLINK-25564 URL: https://issues.apache.org/jira/browse/FLINK-25564 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.15.0 Reporter: Till Rohrmann Fix For: 1.15.0 The test {{TaskManagerProcessFailureStreamingRecoveryITCase>AbstractTaskManagerProcessFailureRecoveryTest.testTaskManagerProcessFailure}} fails on AZP with {code} Jan 07 05:07:22 [ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 31.057 s <<< FAILURE! - in org.apache.flink.test.recovery.TaskManagerProcessFailureStreamingRecoveryITCase Jan 07 05:07:22 [ERROR] org.apache.flink.test.recovery.TaskManagerProcessFailureStreamingRecoveryITCase.testTaskManagerProcessFailure Time elapsed: 31.012 s <<< FAILURE! Jan 07 05:07:22 java.lang.AssertionError: The program encountered a IOExceptionList : /tmp/junit2133275241637829858/junit7793757951823298127 Jan 07 05:07:22 at org.junit.Assert.fail(Assert.java:89) Jan 07 05:07:22 at org.apache.flink.test.recovery.AbstractTaskManagerProcessFailureRecoveryTest.testTaskManagerProcessFailure(AbstractTaskManagerProcessFailureRecoveryTest.java:205) Jan 07 05:07:22 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) Jan 07 05:07:22 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) Jan 07 05:07:22 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) Jan 07 05:07:22 at java.lang.reflect.Method.invoke(Method.java:498) Jan 07 05:07:22 at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) Jan 07 05:07:22 at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) Jan 07 05:07:22 at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) Jan 07 05:07:22 at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) Jan 07 05:07:22 at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54) Jan 07 05:07:22 at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54) Jan 07 05:07:22 at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54) Jan 07 05:07:22 at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54) Jan 07 05:07:22 at org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45) Jan 07 05:07:22 at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61) Jan 07 05:07:22 at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) Jan 07 05:07:22 at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) Jan 07 05:07:22 at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) Jan 07 05:07:22 at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) Jan 07 05:07:22 at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) Jan 07 05:07:22 at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) Jan 07 05:07:22 at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) Jan 07 05:07:22 at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) Jan 07 05:07:22 at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) Jan 07 05:07:22 at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) Jan 07 05:07:22 at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) Jan 07 05:07:22 at org.junit.runners.ParentRunner.run(ParentRunner.java:413) Jan 07 05:07:22 at org.junit.runner.JUnitCore.run(JUnitCore.java:137) Jan 07 05:07:22 at org.junit.runner.JUnitCore.run(JUnitCore.java:115) Jan 07 05:07:22 at org.junit.vintage.engine.execution.RunnerExecutor.execute(RunnerExecutor.java:42) Jan 07 05:07:22 at org.junit.vintage.engine.VintageTestEngine.executeAllChildren(VintageTestEngine.java:80) Jan 07 05:07:22 at org.junit.vintage.engine.VintageTestEngine.execute(VintageTestEngine.java:72) Jan 07 05:07:22 at org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:107) Jan 07 05:07:22 at org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:88) Jan 07 05:07:22 at org.junit.platform.launcher.core.EngineExecutionOrchestrator.lambda$execute$0(EngineExecutionOrchestrator.java:54) Jan 07 05:07:22 at org.junit.platform.launcher.c
[NOTICE] Introduction of flink-table-test-utils
Hi everyone, for the Scala-free planner effort we introduced a new `flink-table-test-utils`. It is recommended that connectors, formats, and catalogs use this module as a test dependency if table-common is not enough. We will fill this module with more test utilities that can also be used by users in the near future. Currently, there are already a couple of Asserts available for testing but those are marked as @Experimental for now. One main difference to directly depending on the flink-table-planner module is that connectors should be tested against the loader and should not rely on planner internal classes. This will also make it easier to move connectors into a separate repo in the near future. More information here: https://issues.apache.org/jira/browse/FLINK-25228 Regards, Timo
Re: [DISCUSS] FLIP-206: Support PyFlink Runtime Execution in Thread Mode
Hi Till, I have written a more complicated PyFlink job. Compared with the previous single python udf job, there is an extra stage of converting between table and datastream. Besides, I added a python map function for the job. Because python datastream has not yet implemented Thread mode, the python map function operator is still running in Process Mode. ``` source = t_env.from_path("source_table") # schema [id: String, d:int] @udf(result_type=DataTypes.STRING(), func_type="general") def upper(x): return x.upper() t_env.create_temporary_system_function("upper", upper) # python map function ds = t_env.to_data_stream(source) \ .map(lambda x: x, output_type=Types.ROW_NAMED(["id", "d"], [Types.STRING(), Types.INT()])) t = t_env.from_data_stream(ds) t.select('upper(id)').execute_insert('sink_table') ``` The input data size is 1k. Mode | QPS Process Mode |3w Thread Mode + Process mode |4w >From the table, we can find that the nodes run in Process Mode is the performance bottleneck of the job. Best, Xingbo Till Rohrmann 于2022年1月5日周三 23:16写道: > Thanks for the detailed answer Xingbo. Quick question on the last figure in > the FLIP. You said that this is a real world Flink stream SQL job. The > title of the graph says UDF(String Upper). So do I understand correctly > that string upper is the real world use case you have measured? What I > wanted to ask is how a slightly more complex Flink Python job (involving > shuffles, with back pressure, etc.) performs using the thread and process > mode respectively. > > If the mode solely needs changes in the Python part of Flink, then I don't > have any concerns from the runtime perspective. > > Cheers, > Till > > On Wed, Jan 5, 2022 at 1:55 PM Xingbo Huang wrote: > > > Hi Till and Thomas, > > > > Thanks a lot for joining the discussion. > > > > For Till: > > > > >>> Is the slower performance currently the biggest pain point for our > > Python users? What else are our Python users mainly complaining about? > > > > PyFlink users are most concerned about two parts, one is better > usability, > > the other is performance. Users often make some benchmarks when they > > investigate pyflink[1][2] at the beginning to decide whether to use > > PyFlink. The performance of a PyFlink job depends on two parts, one is > the > > overhead of the PyFlink framework, and the other is the Python function > > complexity implemented by the user. In the Python ecosystem, there are > many > > libraries and tools that can help Python users improve the performance of > > their custom functions, such as pandas[3], numba[4] and cython[5]. So we > > hope that the framework overhead of PyFlink itself can also be reduced. > > > > >>> Concerning the proposed changes, are there any changes required on > the > > runtime side (changes to Flink)? How will the deployment and memory > > management be affected when using the thread execution mode? > > > > The changes on PyFlink Runtime mentioned here are actually only > > modifications of PyFlink custom Operators, such as > > PythonScalarFunctionOperator[6], which won't affect deployment and memory > > management. > > > > >>> One more question that came to my mind: How much performance > > improvement dowe gain on a real-world Python use case? Were the > > measurements more like micro benchmarks where the Python UDF was called > w/o > > the overhead of Flink? I would just be curious how much the Python > > component contributes to the overall runtime of a real world job. Do we > > have some data on this? > > > > The last figure I put in FLIP is the performance comparison of three real > > Flink Stream Sql Jobs. They are a Java UDF job, a Python UDF job in > Process > > Mode, and a Python UDF job in Thread Mode. The calculated value of QPS is > > the end-to-end Flink job execution result. As shown in the performance > > comparison chart, the performance of Python udf with the same function > can > > often only reach 20% of Java udf, so the performance of python udf will > > often become the performance bottleneck in a PyFlink job. > > > > For Thomas: > > > > The first time that I realized the framework overhead of various IPC > > (socket, grpc, shared memory) cannot be ignored in some scenarios is due > to > > an image algorithm prediction job of PyFlink. Its input parameters are a > > series of huge image binary arrays, and its data size is bigger than 1G. > > The performance overhead of serialization/deserialization has become an > > important part of its poor performance. Although this job is a bit > extreme, > > through measurement, we did find the impact of the > > serialization/deserialization overhead caused by larger size parameters > on > > the performance of the IPC framework. > > > > >>> As I understand it, you measured the difference in throughput for > UPPER > > betwee
Re: [VOTE] Create a separate sub project for FLIP-188: flink-store
+1 for the separate project But maybe use `flink-storage` instead of `flink-store`? I'm not a native speaker but store is defined as "A place where items may be purchased.". It almost sounds like the `flink-packages` project. Regards, Timo On 07.01.22 08:37, Jingsong Li wrote: Hi everyone, I'd like to start a vote for create a separate sub project for FLIP-188 [1]: `flink-store`. - If you agree with the name `flink-store`, please just +1 - If you have a better suggestion, please write your suggestion, followed by a reply that can +1 to the name that has appeared - If you do not want it to be a subproject of flink, just -1 The vote will be open for at least 72 hours unless there is an objection or not enough votes. [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-188%3A+Introduce+Built-in+Dynamic+Table+Storage Best, Jingsong