Re: Enquiry about contributing to Flink
Hi Qing, Thanks a lot for wanting to contribute your improvements back to Flink. The code contribution process is documented and highly recommend to read through [1] For the 1st item, I can assign it to you so you can open a PR. The 2nd item also has a Jira [2]. For the 3rd item, I would recommend opening a Jira ticket with a description of the bug/problem and how you want to fix it. Then it can also be assigned to you and a PR can be created. Let me know if you have any more questions! Best regards, Martijn [1] https://flink.apache.org/contributing/contribute-code.html [2] https://issues.apache.org/jira/browse/FLINK-14101 Op vr 1 jul. 2022 om 03:50 schreef Qing Lim : > Hi Flink Dev, > > My company uses flink, and we have some code changes that we'd like to > contribute to the main fork. > > Can I get your advice on what to do here? > > The features we wish to contribute are as follow: > > > 1. JDBC connector filter pushdown, already tracked on JIRA: > https://issues.apache.org/jira/browse/FLINK-16024 > 2. Support MSSQL dialect in JDBC connector > 3. A tiny fix on Postgres JDBC Dialect > > We are already using these features in Production. > > Look forward for your advice, thank you! > > This e-mail and any attachments are confidential to the addressee(s) and > may contain information that is legally privileged and/or confidential. If > you are not the intended recipient of this e-mail you are hereby notified > that any dissemination, distribution, or copying of its content is strictly > prohibited. If you have received this message in error, please notify the > sender by return e-mail and destroy the message and all copies in your > possession. > > To find out more details about how we may collect, use and share your > personal information, please see https://www.mwam.com/privacy-policy. > This includes details of how calls you make to us may be recorded in order > for us to comply with our legal and regulatory obligations. > > To the extent that the contents of this email constitutes a financial > promotion, please note that it is issued only to and/or directed only at > persons who are professional clients or eligible counterparties as defined > in the FCA Rules. Any investment products or services described in this > email are available only to professional clients and eligible > counterparties. Persons who are not professional clients or eligible > counterparties should not rely or act on the contents of this email. > > Marshall Wace LLP is authorised and regulated by the Financial Conduct > Authority. Marshall Wace LLP is a limited liability partnership registered > in England and Wales with registered number OC302228 and registered office > at George House, 131 Sloane Street, London, SW1X 9AT. If you are receiving > this e-mail as a client, or an investor in an investment vehicle, managed > or advised by Marshall Wace North America L.P., the sender of this e-mail > is communicating with you in the sender's capacity as an associated or > related person of Marshall Wace North America L.P. (“MWNA”), which is > registered with the US Securities and Exchange Commission (“SEC”) as an > investment adviser. Registration with the SEC does not imply that MWNA or > its employees possess a certain level of skill or training. >
[jira] [Created] (FLINK-28340) Support using system conda env for PyFlink tests
Juntao Hu created FLINK-28340: - Summary: Support using system conda env for PyFlink tests Key: FLINK-28340 URL: https://issues.apache.org/jira/browse/FLINK-28340 Project: Flink Issue Type: Improvement Components: API / Python Affects Versions: 1.15.0 Reporter: Juntao Hu Fix For: 1.16.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
RE: Re: Re: [DISCUSS] FLIP-239: Port JDBC Connector Source to FLIP-27
Hi, Lijie. Thanks for the catching and sorry for the typo in sorting out the draft. I updated it. Best regards, Roc Marshal On 2022/07/01 02:24:56 Lijie Wang wrote: > Hi Roc, > > Thanks for driving the discussion. > > Could you describe in detail what the JdbcSourceSplit represents? It looks > like something wrong with the comments of JdbcSourceSplit in FLIP(it > describe as "A {@link SourceSplit} that represents a file, or a region of a > file"). > > Best, > Lijie > > > Roc Marshal 于2022年6月30日周四 21:41写道: > > > Hi, Boto. > > Thanks for your reply. > > > >+1 to me on watermark strategy definition in ‘streaming’ & table > > source. I'm not sure if FLIP-202[1] is suitable for a separate discussion, > > but I think your proposal is very helpful to the new source. It would be > > great if the new source could be compatible with this abstraction. > > > >In addition, whether we need to support such a special bounded scenario > > abstraction? > >The number of JdbcSourceSplit is certain, but the time to generate all > > JdbcSourceSplit completely is not certain in the user defined > > implementation. When the condition that the JdbcSourceSplit > > generate-process end is met, the JdbcSourceSplit will not be generated. > > After all JdbcSourceSplit processing is completed, the reader will be > > notified that there are no more JdbcSourceSplit from > > JdbcSourceSplitEnumerator. > > > > - [1] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-202%3A+Introduce+ClickHouse+Connector > > > > Best regards, > > Roc Marshal > > > > On 2022/06/30 09:02:23 João Boto wrote: > > > Hi, > > > > > > On source we could improve the JdbcParameterValuesProvider.. to be > > defined as a query(s) or something more dynamic. > > > The most time if your job is dynamic or have some condition to be met > > (based on data on table) you have to create a connection an get that info > > from database > > > > > > If we are going to create/allow a "streaming" jdbc source, we should be > > able to define watermark and get new data from table using that watermark.. > > > > > > > > > For the sink (but it could apply on source) will be great to be able to > > set your implementation of the connection type.. For example if you are > > connecting to clickhouse, be able to set a implementation based on > > "BalancedClickhouseDataSource" for example (in this[1] implementation we > > have a example) or set a extension version of a implementation for debug > > purpose > > > > > > Regards > > > > > > > > > [1] > > https://github.com/apache/flink/pull/20097/files#diff-8b36e3403381dc14c748aeb5de0b4ceb7d7daec39594b1eacff1694b5266419d > > > > > > On 2022/06/27 13:09:51 Roc Marshal wrote: > > > > Hi, all, > > > > > > > > > > > > > > > > > > > > I would like to open a discussion on porting JDBC Source to new Source > > API (FLIP-27[1]). > > > > > > > > Martijn Visser, Jing Ge and I had a preliminary discussion on the JIRA > > FLINK-25420[2] and planed to start the discussion about the source part > > first. > > > > > > > > > > > > > > > > Please let me know: > > > > > > > > - The issues about old Jdbc source you encountered; > > > > - The new feature or design you want; > > > > - More suggestions from other dimensions... > > > > > > > > > > > > > > > > You could find more details in FLIP-239[3]. > > > > > > > > Looking forward to your feedback. > > > > > > > > > > > > > > > > > > > > [1] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface > > > > > > > > [2] https://issues.apache.org/jira/browse/FLINK-25420 > > > > > > > > [3] > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=217386271 > > > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > > Roc Marshal > > > >
[jira] [Created] (FLINK-28339) Introduce SparkCatalog in table store
Jingsong Lee created FLINK-28339: Summary: Introduce SparkCatalog in table store Key: FLINK-28339 URL: https://issues.apache.org/jira/browse/FLINK-28339 Project: Flink Issue Type: New Feature Components: Table Store Reporter: Jingsong Lee Assignee: Jingsong Lee Fix For: table-store-0.2.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [VOTE] Apache Flink ML Release 2.1.0, release candidate #2
Thanks for the update! +1 (non-binding) Here is what I checked. All required checks are included. - Verified that the checksums and GPG files match the corresponding release files. - Verified that the source distributions do not contain any binaries. - Built the source distribution and ensured that all source files have Apache headers. - Verified that all POM files point to the same version. - Browsed through JIRA release notes files and did not find anything unexpected. - Browsed through README.md files and did not find anything unexpected. - Checked the source code tag "release-2.0.0-rc2" and did not find anything unexpected. On Fri, Jul 1, 2022 at 11:11 AM Zhipeng Zhang wrote: > Hi everyone, > > > Please review and vote on the release candidate #2 for the version 2.1.0 of > Apache Flink ML as follows: > > [ ] +1, Approve the release > > [ ] -1, Do not approve the release (please provide specific comments) > > > **Testing Guideline** > > > You can find here [1] a page in the project wiki on instructions for > testing. > > To cast a vote, it is not necessary to perform all listed checks, but > please > > mention which checks you have performed when voting. > > > **Release Overview** > > > As an overview, the release consists of the following: > > a) Flink ML source release to be deployed to dist.apache.org > > b) Flink ML Python source distributions to be deployed to PyPI > > c) Maven artifacts to be deployed to the Maven Central Repository > > > **Staging Areas to Review** > > > The staging areas containing the above mentioned artifacts are as follows, > for your review: > > * All artifacts for a) and b) can be found in the corresponding dev > repository at dist.apache.org [2], > > which are signed with the key with fingerprint > 0789F389E67ADDFA034E603FABF0C46E59C8941C [3] > > * All artifacts for c) can be found at the Apache Nexus Repository [4] > > > Other links for your review: > > * JIRA release notes [5] > > * Source code tag "release-2.0.0-rc2" [6] > > * PR to update the website Downloads page to include Flink ML links [7] > > > **Vote Duration** > > The voting time will run for at least 72 hours. > > It is adopted by majority approval, with at least 3 PMC affirmative votes. > > > Thanks, > > Yun and Zhipeng > > > [1] > > https://cwiki.apache.org/confluence/display/FLINK/Verifying+a+Flink+ML+Release > > [2] https://dist.apache.org/repos/dist/dev/flink/flink-ml-2.1.0-rc2/ > > [3] https://dist.apache.org/repos/dist/release/flink/KEYS > > [4] > https://repository.apache.org/content/repositories/orgapacheflink-1517/ > > [5] > > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12351141 > > [6] https://github.com/apache/flink-ml/releases/tag/release-2.1.0-rc2 > > [7] https://github.com/apache/flink-web/pull/556 >
[jira] [Created] (FLINK-28338) org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.JsonParseException: Invalid numeric value: Leading zeroes not allowed
wangkang created FLINK-28338: Summary: org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.JsonParseException: Invalid numeric value: Leading zeroes not allowed Key: FLINK-28338 URL: https://issues.apache.org/jira/browse/FLINK-28338 Project: Flink Issue Type: Improvement Components: BuildSystem / Shaded Affects Versions: 1.13.6 Environment: flinksql 1.13.6 ,解析kafka里面的json数据含有 0 开头的变量值 Reporter: wangkang Attachments: image-2022-07-01-11-31-32-561.png Caused by: org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.JsonParseException: Invalid numeric value: Leading zeroes not allowed at [Source: (byte[])"\{"uuid":"1285","name":"杨YP","age":01,"ts":"2022-07-01 10:33:09.553","partition":"part8"}"; line: 1, column: 38] at org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:2337) !image-2022-07-01-11-31-32-561.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-28337) java.lang.IllegalArgumentException: Table identifier not set
wei created FLINK-28337: --- Summary: java.lang.IllegalArgumentException: Table identifier not set Key: FLINK-28337 URL: https://issues.apache.org/jira/browse/FLINK-28337 Project: Flink Issue Type: Bug Components: Connectors / Hive Affects Versions: 1.14.2 Environment: Flink 1.14.2 Hive 3.1.2 Iceberg 0.12.1 Hadoop 3.2.1 Reporter: wei I use Flink Table SDK to select iceberg table. Set hivecatalog to usercatalog, but looks like the default_catalog is still used. The error message is as flollows: {code:java} 0:42:41,886 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl [] - s3a-file-system metrics system started 10:42:44,392 INFO org.apache.iceberg.BaseMetastoreCatalog [] - Table loaded by catalog: default_iceberg.s3a_flink.icebergtbcloudtrackingtest 10:42:44,397 INFO org.apache.iceberg.mr.hive.HiveIcebergSerDe [] - Using schema from existing table {"type":"struct","schema-id":0,"fields":[{"id":1,"name":"vin","required":true,"type":"string"},{"id":2,"name":"name","required":true,"type":"string"},{"id":3,"name":"uuid","required":false,"type":"string"},{"id":4,"name":"channel","required":true,"type":"string"},{"id":5,"name":"run_scene","required":true,"type":"string"},{"id":6,"name":"timestamp","required":true,"type":"timestamp"},{"id":7,"name":"rcv_timestamp","required":true,"type":"timestamp"},{"id":8,"name":"raw","required":true,"type":"string"}]} 10:42:44,832 INFO org.apache.iceberg.BaseMetastoreTableOperations [] - Refreshing table metadata from new version: s3a://warehouse/s3a_flink.db/icebergTBCloudTrackingTest/metadata/00011-8d1ef9f1-8172-49fd-b0de-58196642b662.metadata.json 10:42:44,866 INFO org.apache.iceberg.BaseMetastoreCatalog [] - Table loaded by catalog: default_iceberg.s3a_flink.icebergtbcloudtrackingtest 10:42:44,867 INFO org.apache.iceberg.mr.hive.HiveIcebergSerDe [] - Using schema from existing table {"type":"struct","schema-id":0,"fields":[{"id":1,"name":"vin","required":true,"type":"string"},{"id":2,"name":"name","required":true,"type":"string"},{"id":3,"name":"uuid","required":false,"type":"string"},{"id":4,"name":"channel","required":true,"type":"string"},{"id":5,"name":"run_scene","required":true,"type":"string"},{"id":6,"name":"timestamp","required":true,"type":"timestamp"},{"id":7,"name":"rcv_timestamp","required":true,"type":"timestamp"},{"id":8,"name":"raw","required":true,"type":"string"}]} 10:42:48,079 INFO org.apache.hadoop.hive.metastore.HiveMetaStoreClient [] - Trying to connect to metastore with URI thrift://hiveserver:9083 10:42:48,079 INFO org.apache.hadoop.hive.metastore.HiveMetaStoreClient [] - Opened a connection to metastore, current connections: 3 10:42:48,081 INFO org.apache.hadoop.hive.metastore.HiveMetaStoreClient [] - Connected to metastore. 10:42:48,081 INFO org.apache.hadoop.hive.metastore.RetryingMetaStoreClient [] - RetryingMetaStoreClient proxy=class org.apache.hadoop.hive.metastore.HiveMetaStoreClient ugi=root (auth:SIMPLE) retries=1 delay=1 lifetime=0 10:42:48,132 INFO org.apache.hadoop.hive.metastore.HiveMetaStoreClient [] - Closed a connection to metastore, current connections: 2 10:42:48,308 INFO org.apache.flink.connectors.hive.HiveParallelismInference [] - Hive source(s3a_flink.icebergTBCloudTrackingTest}) getNumFiles use time: 171 ms, result: 2 Exception in thread "main" java.lang.IllegalArgumentException: Table identifier not set at org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkArgument(Preconditions.java:142) at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:114) at org.apache.iceberg.mr.Catalogs.loadTable(Catalogs.java:89) at org.apache.iceberg.mr.mapreduce.IcebergInputFormat.lambda$getSplits$0(IcebergInputFormat.java:102) at java.util.Optional.orElseGet(Optional.java:267) at org.apache.iceberg.mr.mapreduce.IcebergInputFormat.getSplits(IcebergInputFormat.java:102) at org.apache.iceberg.mr.mapred.MapredIcebergInputFormat.getSplits(MapredIcebergInputFormat.java:69) at org.apache.iceberg.mr.hive.HiveIcebergInputFormat.getSplits(HiveIcebergInputFormat.java:98) at org.apache.flink.connectors.hive.HiveSourceFileEnumerator.createMRSplits(HiveSourceFileEnumerator.java:107) at org.apache.flink.connectors.hive.HiveSourceFileEnumerator.createInputSplits(HiveSourceFileEnumerator.java:71) at org.apache.flink.connectors.hive.HiveTableSource.lambda$getDataStream$1(HiveTableSource.java:149) at org.apache.flink.connectors.hive.HiveParallelismInference.logRunningTime(HiveParallelismInference.java:107) at org.apache.flink.connectors.hive.HiveParallelismInference.infer(HiveParallelismInference.java:95) at org.apache.flink.connectors.hive.HiveTableSource.getDataStr
[VOTE] Apache Flink ML Release 2.1.0, release candidate #2
Hi everyone, Please review and vote on the release candidate #2 for the version 2.1.0 of Apache Flink ML as follows: [ ] +1, Approve the release [ ] -1, Do not approve the release (please provide specific comments) **Testing Guideline** You can find here [1] a page in the project wiki on instructions for testing. To cast a vote, it is not necessary to perform all listed checks, but please mention which checks you have performed when voting. **Release Overview** As an overview, the release consists of the following: a) Flink ML source release to be deployed to dist.apache.org b) Flink ML Python source distributions to be deployed to PyPI c) Maven artifacts to be deployed to the Maven Central Repository **Staging Areas to Review** The staging areas containing the above mentioned artifacts are as follows, for your review: * All artifacts for a) and b) can be found in the corresponding dev repository at dist.apache.org [2], which are signed with the key with fingerprint 0789F389E67ADDFA034E603FABF0C46E59C8941C [3] * All artifacts for c) can be found at the Apache Nexus Repository [4] Other links for your review: * JIRA release notes [5] * Source code tag "release-2.0.0-rc2" [6] * PR to update the website Downloads page to include Flink ML links [7] **Vote Duration** The voting time will run for at least 72 hours. It is adopted by majority approval, with at least 3 PMC affirmative votes. Thanks, Yun and Zhipeng [1] https://cwiki.apache.org/confluence/display/FLINK/Verifying+a+Flink+ML+Release [2] https://dist.apache.org/repos/dist/dev/flink/flink-ml-2.1.0-rc2/ [3] https://dist.apache.org/repos/dist/release/flink/KEYS [4] https://repository.apache.org/content/repositories/orgapacheflink-1517/ [5] https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12351141 [6] https://github.com/apache/flink-ml/releases/tag/release-2.1.0-rc2 [7] https://github.com/apache/flink-web/pull/556
[jira] [Created] (FLINK-28336) Support parquet-avro format in PyFlink DataStream
Juntao Hu created FLINK-28336: - Summary: Support parquet-avro format in PyFlink DataStream Key: FLINK-28336 URL: https://issues.apache.org/jira/browse/FLINK-28336 Project: Flink Issue Type: New Feature Components: API / Python Affects Versions: 1.15.0 Reporter: Juntao Hu Fix For: 1.16.0 Parquet-avro has three interfaces, ReflectData, SpecificData and GenericData, considered that the first two interface need cooresponding Java class, we just support GenericData in PyFlink, where users use strings to define Avro schema. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: Re: [DISCUSS] FLIP-239: Port JDBC Connector Source to FLIP-27
Hi Roc, Thanks for driving the discussion. Could you describe in detail what the JdbcSourceSplit represents? It looks like something wrong with the comments of JdbcSourceSplit in FLIP(it describe as "A {@link SourceSplit} that represents a file, or a region of a file"). Best, Lijie Roc Marshal 于2022年6月30日周四 21:41写道: > Hi, Boto. > Thanks for your reply. > >+1 to me on watermark strategy definition in ‘streaming’ & table > source. I'm not sure if FLIP-202[1] is suitable for a separate discussion, > but I think your proposal is very helpful to the new source. It would be > great if the new source could be compatible with this abstraction. > >In addition, whether we need to support such a special bounded scenario > abstraction? >The number of JdbcSourceSplit is certain, but the time to generate all > JdbcSourceSplit completely is not certain in the user defined > implementation. When the condition that the JdbcSourceSplit > generate-process end is met, the JdbcSourceSplit will not be generated. > After all JdbcSourceSplit processing is completed, the reader will be > notified that there are no more JdbcSourceSplit from > JdbcSourceSplitEnumerator. > > - [1] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-202%3A+Introduce+ClickHouse+Connector > > Best regards, > Roc Marshal > > On 2022/06/30 09:02:23 João Boto wrote: > > Hi, > > > > On source we could improve the JdbcParameterValuesProvider.. to be > defined as a query(s) or something more dynamic. > > The most time if your job is dynamic or have some condition to be met > (based on data on table) you have to create a connection an get that info > from database > > > > If we are going to create/allow a "streaming" jdbc source, we should be > able to define watermark and get new data from table using that watermark.. > > > > > > For the sink (but it could apply on source) will be great to be able to > set your implementation of the connection type.. For example if you are > connecting to clickhouse, be able to set a implementation based on > "BalancedClickhouseDataSource" for example (in this[1] implementation we > have a example) or set a extension version of a implementation for debug > purpose > > > > Regards > > > > > > [1] > https://github.com/apache/flink/pull/20097/files#diff-8b36e3403381dc14c748aeb5de0b4ceb7d7daec39594b1eacff1694b5266419d > > > > On 2022/06/27 13:09:51 Roc Marshal wrote: > > > Hi, all, > > > > > > > > > > > > > > > I would like to open a discussion on porting JDBC Source to new Source > API (FLIP-27[1]). > > > > > > Martijn Visser, Jing Ge and I had a preliminary discussion on the JIRA > FLINK-25420[2] and planed to start the discussion about the source part > first. > > > > > > > > > > > > Please let me know: > > > > > > - The issues about old Jdbc source you encountered; > > > - The new feature or design you want; > > > - More suggestions from other dimensions... > > > > > > > > > > > > You could find more details in FLIP-239[3]. > > > > > > Looking forward to your feedback. > > > > > > > > > > > > > > > [1] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface > > > > > > [2] https://issues.apache.org/jira/browse/FLINK-25420 > > > > > > [3] > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=217386271 > > > > > > > > > > > > > > > Best regards, > > > > > > Roc Marshal > >
Enquiry about contributing to Flink
Hi Flink Dev, My company uses flink, and we have some code changes that we'd like to contribute to the main fork. Can I get your advice on what to do here? The features we wish to contribute are as follow: 1. JDBC connector filter pushdown, already tracked on JIRA: https://issues.apache.org/jira/browse/FLINK-16024 2. Support MSSQL dialect in JDBC connector 3. A tiny fix on Postgres JDBC Dialect We are already using these features in Production. Look forward for your advice, thank you! This e-mail and any attachments are confidential to the addressee(s) and may contain information that is legally privileged and/or confidential. If you are not the intended recipient of this e-mail you are hereby notified that any dissemination, distribution, or copying of its content is strictly prohibited. If you have received this message in error, please notify the sender by return e-mail and destroy the message and all copies in your possession. To find out more details about how we may collect, use and share your personal information, please see https://www.mwam.com/privacy-policy. This includes details of how calls you make to us may be recorded in order for us to comply with our legal and regulatory obligations. To the extent that the contents of this email constitutes a financial promotion, please note that it is issued only to and/or directed only at persons who are professional clients or eligible counterparties as defined in the FCA Rules. Any investment products or services described in this email are available only to professional clients and eligible counterparties. Persons who are not professional clients or eligible counterparties should not rely or act on the contents of this email. Marshall Wace LLP is authorised and regulated by the Financial Conduct Authority. Marshall Wace LLP is a limited liability partnership registered in England and Wales with registered number OC302228 and registered office at George House, 131 Sloane Street, London, SW1X 9AT. If you are receiving this e-mail as a client, or an investor in an investment vehicle, managed or advised by Marshall Wace North America L.P., the sender of this e-mail is communicating with you in the sender's capacity as an associated or related person of Marshall Wace North America L.P. (“MWNA”), which is registered with the US Securities and Exchange Commission (“SEC”) as an investment adviser. Registration with the SEC does not imply that MWNA or its employees possess a certain level of skill or training.
[jira] [Created] (FLINK-28335) Delete topic after tests
Jingsong Lee created FLINK-28335: Summary: Delete topic after tests Key: FLINK-28335 URL: https://issues.apache.org/jira/browse/FLINK-28335 Project: Flink Issue Type: Bug Components: Table Store Reporter: Jingsong Lee Fix For: table-store-0.2.0 Currently our test does not remove the topic, kafka local cluster may reuse the data inside the topic, resulting in a test error. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [DISCUSS] FLIP-238: Introduce FLIP-27-based Data Generator Source
Hi all, getting back to the idea of reusing FlinkConnectorRateLimiter: it is designed for the SourceFunction API and has an open() method that takes a RuntimeContext. Therefore, we need to add a different interface for the new Source API. This is where I see a certain limitation for the rate-limiting use case: in the old API the individual readers were able to retrieve the current parallelism from the RuntimeContext. In the new API, this is not supported, the information about the parallelism is only available in the SplitEnumeratorContext to which the readers do not have access. I see two possibilities: 1. Add an optional RateLimiter parameter to the DataGeneratorSource constructor. The RateLimiter is then "fixed" and has to be fully configured by the user in the main method. 2. Piggy-back on Splits: add parallelism as a field of a Split. The initialization of this field would happen dynamically upon splits creation in the createEnumerator() method where currentParallelism is available. The second approach makes implementation rather significantly more complex since we cannot simply wrap NumberSequenceSource.SplitSerializer in that case. The advantage of this approach is that with any kind of autoscaling, the source rate will match the original configuration. But I'm not sure how useful this is. I can even imagine scenarios where scaling the input rate together with parallelism would be better for demo purposes. Would be glad to hear your thoughts on this. Best, Alexander Fedulov On Mon, Jun 20, 2022 at 4:31 PM David Anderson wrote: > I'm very happy with this. +1 > > A lot of SourceFunction implementations used in demos/POC implementations > include a call to sleep(), so adding rate limiting is a good idea, in my > opinion. > > Best, > David > > On Mon, Jun 20, 2022 at 10:10 AM Qingsheng Ren wrote: > > > Hi Alexander, > > > > Thanks for creating this FLIP! I’d like to share some thoughts. > > > > 1. About the “generatorFunction” I’m expecting an initializer on it > > because it’s hard to require all fields in the generator function are > > serializable in user’s implementation. Providing a function like “open” > in > > the interface could let the function to make some initializations in the > > task initializing stage. > > > > 2. As of the throttling functinality you mentioned, there’s a > > FlinkConnectorRateLimiter under flink-core and maybe we could reuse this > > interface. Actually I prefer to make rate limiting as a common feature > > provided in the Source API, but this requires another FLIP and a lot of > > discussions so I’m OK to have it in the DataGen source first. > > > > Best regards, > > Qingsheng > > > > > > > On Jun 17, 2022, at 01:47, Alexander Fedulov > > wrote: > > > > > > Hi Jing, > > > > > > thanks for your thorough analysis. I agree with the points you make and > > > also with the idea to approach the larger task of providing a universal > > > (DataStream + SQL) data generator base iteratively. > > > Regarding the name, the SourceFunction-based *DataGeneratorSource* > > resides > > > in the *org.apache.flink.streaming.api.functions.source.datagen*. I > think > > > it is OK to simply place the new one (with the same name) next to the > > > *NumberSequenceSource* into > *org.apache.flink.api.connector.source.lib*. > > > > > > One more thing I wanted to discuss: I noticed that *DataGenTableSource > > *has > > > built-in throttling functionality (*rowsPerSecond*). I believe it is > > > something that could be also useful for the DataStream users of the > > > stateless data generator and since we want to eventually converge on > the > > > same implementation for DataStream and Table/SQL it sounds like a good > > idea > > > to add it to the FLIP. What do you think? > > > > > > Best, > > > Alexander Fedulov > > > > > > > > > On Tue, Jun 14, 2022 at 7:17 PM Jing Ge wrote: > > > > > >> Hi, > > >> > > >> After reading all discussions posted in this thread and the source > code > > of > > >> DataGeneratorSource which unfortunately used "Source" instead of > > >> "SourceFunction" in its name, issues could summarized as following: > > >> > > >> 1. The current DataGeneratorSource based on SourceFunction is a > stateful > > >> source connector and built for Table/SQL. > > >> 2. The right name for the new data generator source i.e. > > >> DataGeneratorSource has been used for the current implementation based > > on > > >> SourceFunction. > > >> 3. A new data generator source should be developed based on the new > > Source > > >> API. > > >> 4. The new data generator source should be used both for DataStream > and > > >> Table/SQL, which means the current DataGeneratorSource should be > > replaced > > >> with the new one. > > >> 5. The core event generation logic should be pluggable to support > > various > > >> (test) scenarios, e.g. rondom stream, changlog stream, controllable > > events > > >> per checkpoint, etc. > > >> > > >> which turns out that > > >> > > >> To solve 1+3+4 -> we w
Re: [VOTE] Release 1.15.1, release candidate #1
Hello all, -1 The Kinesis Data Streams consumer does not work with Stop With Savepoint [1]. We are planning to have a fix ready to merge tomorrow and would appreciate getting this in 1.15.1. [1] https://issues.apache.org/jira/browse/FLINK-23528 Thanks, Danny On Thu, Jun 30, 2022 at 9:31 AM Jingsong Li wrote: > Hi David, Thanks for creating this RC. > > -1 > > We found an incompatible modification in 1.15.0 [1] > I think we should fix it. > > [1] https://issues.apache.org/jira/browse/FLINK-28322 > > Best, > Jingsong > > On Tue, Jun 28, 2022 at 8:45 PM Robert Metzger > wrote: > > > > +1 (binding) > > > > - staging repo contents look fine > > - KEYS file ok > > - binaries start locally properly. WebUI accessible on Mac. > > > > On Mon, Jun 27, 2022 at 11:21 AM Qingsheng Ren wrote: > > > > > +1 (non-binding) > > > > > > - checked/verified signatures and hashes > > > - checked that all POM files point to the same version > > > - built from source, without Hadoop and using Scala 2.12 > > > - started standalone cluster locally, WebUI is accessiable and ran > > > WordCount example successfully > > > - executed a job with SQL client consuming from Kafka source to collect > > > sink > > > > > > Best, > > > Qingsheng > > > > > > > > > > On Jun 27, 2022, at 14:46, Xingbo Huang wrote: > > > > > > > > +1 (non-binding) > > > > > > > > - verify signatures and checksums > > > > - no binaries found in source archive > > > > - build from source > > > > - Reviewed the release note blog > > > > - verify python wheel package contents > > > > - pip install apache-flink-libraries and apache-flink wheel packages > > > > - run the examples from Python Table API tutorial > > > > > > > > Best, > > > > Xingbo > > > > > > > > Chesnay Schepler 于2022年6月24日周五 21:42写道: > > > > > > > >> +1 (binding) > > > >> > > > >> - signatures OK > > > >> - all required artifacts appear to be present > > > >> - tag exists with the correct version adjustments > > > >> - binary shows correct commit and version > > > >> - examples run fine > > > >> - website PR looks good > > > >> > > > >> On 22/06/2022 14:20, David Anderson wrote: > > > >>> Hi everyone, > > > >>> > > > >>> Please review and vote on release candidate #1 for version 1.15.1, > as > > > >>> follows: > > > >>> [ ] +1, Approve the release > > > >>> [ ] -1, Do not approve the release (please provide specific > comments) > > > >>> > > > >>> The complete staging area is available for your review, which > includes: > > > >>> > > > >>> * JIRA release notes [1], > > > >>> * the official Apache source release and binary convenience > releases to > > > >> be > > > >>> deployed to dist.apache.org [2], which are signed with the key > with > > > >>> fingerprint E982F098 [3], > > > >>> * all artifacts to be deployed to the Maven Central Repository [4], > > > >>> * source code tag "release-1.15.1-rc1" [5], > > > >>> * website pull request listing the new release and adding > announcement > > > >> blog > > > >>> post [6]. > > > >>> > > > >>> The vote will be open for at least 72 hours. It is adopted by > majority > > > >>> approval, with at least 3 PMC affirmative votes. > > > >>> > > > >>> Thanks, > > > >>> David > > > >>> > > > >>> [1] > > > >>> > > > >> > > > > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version= > > > >>> 12351546 > > > >>> [2] https://dist.apache.org/repos/dist/dev/flink/flink-1.15.1-rc1/ > > > >>> [3] https://dist.apache.org/repos/dist/release/flink/KEYS > > > >>> [4] > > > >> > https://repository.apache.org/content/repositories/orgapacheflink-1511/ > > > >>> [5] https://github.com/apache/flink/tree/release-1.15.1-rc1 > > > >>> [6] https://github.com/apache/flink-web/pull/554 > > > >>> > > > >> > > > >> > > > > > > >
Re: [DISCUSS] Introduce multi delete API to Flink's FileSystem class
Hi, Thanks for the proposal Yun, I think that's a good idea and it could solve the issue you mentioned (FLINK-26590) in many cases (though not all, depending on deletion speed; but in practice it may be enough). Having a separate interface (BulkDeletingFileSystem) would probably help in incremental implementation of the feature (i.e. FS by FS, rather than all at once). Although the same can be achieved by adding supportsBulkDelete(). Regarding BulkFileDeleter, I think it's required in some form, because grouping must be done before calling FS.delete(), even if it accepts a collection. Have you considered limiting the batch sizes for deletions? For example, S3 has a limit of 1000 [1], but the SDK handles it automatically, IIUC. If we don't rely on this handling, and implement our own, the batches could be also deleted in parallel. This can be an initial step, from which all the file systems would benefit, even those without bulk-delete support. [1] https://docs.aws.amazon.com/AmazonS3/latest/API/API_DeleteObjects.html Regards, Roman On Thu, Jun 30, 2022 at 5:10 PM Piotr Nowojski wrote: > Hi, > > Yes, I know that you can not use recursive deletes for > incremental checkpoints and I didn't suggest it anywhere. I just pointed > out that I would expect multi/bulk deletes to supersede the recursive > deletes feature assuming good underlying implementation. > Also I'm not surprised that multi deletes can be faster. I would > expect/hope for that. I've just raised a point that they don't have to be. > It depends on the underlying file system. However in contrast to the > recursive deletes, with multi deletes I wouldn't expect multi delete to be > potentially slower. > > Re the Dawid's PoC. I'm not sure/I don't remember why he proposed > `BulkDeletingFileSystem` over adding a default method to the FileSystem > interface. But it seems to me like a minor point. The majority of Dawid's > PR is about `BulkFileDeleter` interface, not `BulkDeletingFileSystem`, so > about how to use the bulk deletes inside Flink, not how to implement it on > the FileSystem side. Do you maybe have a concrete design proposal for this > feature? > > Best, > Piotrek > > czw., 30 cze 2022 o 15:12 Yun Tang napisał(a): > > > Hi Piotr, > > > > As I said in the original email, you cannot delete folders recursively > for > > incremental checkpoints. And If you take a close look at the original > > email, I have shared the experimental results, which proved 29x > improvement: > > "A simple experiment shows that deleting 1000 objects with each 5MB size, > > will cost 39494ms with for-loop single delete operations, and the result > > will drop to 1347ms if using multi-delete API in Tencent Cloud." > > > > I think I can leverage some ideas from Dawid's work. And as I said, I > > would introduce the multi-delete API to the original FileSystem class > > instead of introducing another BulkDeletingFileSystem, which makes the > file > > system abstraction closer to the modern cloud-based environment. > > > > Best > > Yun Tang > > > > From: Piotr Nowojski > > Sent: Thursday, June 30, 2022 18:25 > > To: dev ; Dawid Wysakowicz > > > Subject: Re: [DISCUSS] Introduce multi delete API to Flink's FileSystem > > class > > > > Hi, > > > > I presume this would mostly supersede the recursive deletes [1]? I > remember > > an argument that the recursive deletes were not obviously better, even if > > the underlying FS was supporting it. I'm not saying that this would have > > been a counter argument against this effort, since every FileSystem could > > decide on its own whether to use the multi delete call or not. But I > think > > at the very least it should be benchmarked/compared whether implementing > it > > for a particular FS makes sense or not. > > > > Also there seems to be some similar (abandoned?) effort from Dawid, with > > named bulk deletes, with "BulkDeletingFileSystem"? [2] Isn't this > basically > > the same thing that you are proposing Yun Tang? > > > > Best, > > Piotrek > > > > [1] https://issues.apache.org/jira/browse/FLINK-13856 > > [2] > > > > > https://issues.apache.org/jira/browse/FLINK-13856?focusedCommentId=17481712&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17481712 > > > > czw., 30 cze 2022 o 11:45 Zakelly Lan > napisał(a): > > > > > Hi Yun, > > > > > > Thanks for bringing this into discussion. > > > I'm +1 to this idea. > > > And IIUC, Flink implements the OSS and S3 filesystem based on the > hadoop > > > filesystem interface, which does not provide the multi-delete API, it > may > > > take some effort to implement this. > > > > > > Best, > > > Zakelly > > > > > > On Thu, Jun 30, 2022 at 5:36 PM Martijn Visser < > martijnvis...@apache.org > > > > > > wrote: > > > > > > > Hi Yun Tang, > > > > > > > > +1 for addressing this problem and your approach. > > > > > > > > Best regards, > > > > > > > > Martijn > > > > > > > > Op do 30 jun. 2022 om 11:12 schreef Feifan W
[jira] [Created] (FLINK-28334) PushProjectIntoTableSourceScanRule should cover the case when table source SupportsReadingMetadata and not SupportsProjectionPushDown
lincoln lee created FLINK-28334: --- Summary: PushProjectIntoTableSourceScanRule should cover the case when table source SupportsReadingMetadata and not SupportsProjectionPushDown Key: FLINK-28334 URL: https://issues.apache.org/jira/browse/FLINK-28334 Project: Flink Issue Type: Bug Components: Table SQL / Planner Affects Versions: 1.15.0 Reporter: lincoln lee Fix For: 1.16.0 "SELECT id, ts FROM src" query on such a table: {code} CREATE TABLE src ( id int, name varchar, tags varchar METADATA VIRTUAL, ts timestamp(3) METADATA VIRTUAL ) WITH ( 'connector' = 'values', 'readable-metadata' = 'tags:varchar,ts:timestamp(3)', 'enable-projection-push-down' = 'false' ) {code} error occurs {code} java.lang.AssertionError: Sql optimization: Assertion error: RexInputRef index 3 out of range 0..2 at org.apache.flink.table.planner.plan.optimize.program.FlinkVolcanoProgram.optimize(FlinkVolcanoProgram.scala:84) at org.apache.flink.table.planner.plan.optimize.program.FlinkChainedProgram.$anonfun$optimize$1(FlinkChainedProgram.scala:62) at scala.collection.TraversableOnce.$anonfun$foldLeft$1(TraversableOnce.scala:156) at scala.collection.TraversableOnce.$anonfun$foldLeft$1$adapted(TraversableOnce.scala:156) at scala.collection.Iterator.foreach(Iterator.scala:937) at scala.collection.Iterator.foreach$(Iterator.scala:937) at scala.collection.AbstractIterator.foreach(Iterator.scala:1425) at scala.collection.IterableLike.foreach(IterableLike.scala:70) at scala.collection.IterableLike.foreach$(IterableLike.scala:69) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:156) at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:154) at scala.collection.AbstractTraversable.foldLeft(Traversable.scala:104) at org.apache.flink.table.planner.plan.optimize.program.FlinkChainedProgram.optimize(FlinkChainedProgram.scala:58) at org.apache.flink.table.planner.plan.optimize.StreamCommonSubGraphBasedOptimizerV2.optimizeTree(StreamCommonSubGraphBasedOptimizerV2.scala:209) at org.apache.flink.table.planner.plan.optimize.StreamCommonSubGraphBasedOptimizerV2.optimizeBlock(StreamCommonSubGraphBasedOptimizerV2.scala:156) at org.apache.flink.table.planner.plan.optimize.StreamCommonSubGraphBasedOptimizerV2.$anonfun$doOptimize$1(StreamCommonSubGraphBasedOptimizerV2.scala:79) at org.apache.flink.table.planner.plan.optimize.StreamCommonSubGraphBasedOptimizerV2.$anonfun$doOptimize$1$adapted(StreamCommonSubGraphBasedOptimizerV2.scala:78) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:58) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:51) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.flink.table.planner.plan.optimize.StreamCommonSubGraphBasedOptimizerV2.doOptimize(StreamCommonSubGraphBasedOptimizerV2.scala:78) at org.apache.flink.table.planner.plan.optimize.CommonSubGraphBasedOptimizer.optimize(CommonSubGraphBasedOptimizer.scala:94) at org.apache.flink.table.planner.delegation.PlannerBase.optimize(PlannerBase.scala:389) at org.apache.flink.table.planner.utils.TableTestUtilBase.assertPlanEquals(TableTestBase.scala:1199) at org.apache.flink.table.planner.utils.TableTestUtilBase.doVerifyPlan2(TableTestBase.scala:1109) at org.apache.flink.table.planner.utils.TableTestUtilBase.doVerifyPlan(TableTestBase.scala:1066) at org.apache.flink.table.planner.utils.TableTestUtilBase.verifyExecPlan(TableTestBase.scala:687) at org.apache.flink.table.planner.plan.nodes.exec.stream.TableSourceJsonPlanTest.testReuseSourceWithoutProjectionPushDown(TableSourceJsonPlanTest.java:308) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:258) at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61) at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54) at org.junit.rules.ExpectedExcept
[jira] [Created] (FLINK-28333) GlueSchemaRegistryAvroKinesisITCase is being Ignored due to `Access key not configured`
Ahmed Hamdy created FLINK-28333: --- Summary: GlueSchemaRegistryAvroKinesisITCase is being Ignored due to `Access key not configured` Key: FLINK-28333 URL: https://issues.apache.org/jira/browse/FLINK-28333 Project: Flink Issue Type: Bug Components: Tests Affects Versions: 1.15.1 Reporter: Ahmed Hamdy h1. Description {{GlueSchemaRegistryJsonKinesisITCase}} test is not being run on CI and is skipped due to {{Access key not configured}}. Access Key and Secret Key should be added to test environment variables to enable test. Currently on adding these test to environment variables the test fails with {quote}AWSSchemaRegistryException: Exception occurred while fetching or registering schema definition = {"$id":"https://example.com/address.schema.json","$schema":"http://json-schema.org/draft-07/schema#","type":"object","properties":{"f1":{"type":"string"},"f2":{"type":"integer","maximum":1}}}, schema name = gsr_json_input_stream at com.amazonaws.services.schemaregistry.common.AWSSchemaRegistryClient.getORRegisterSchemaVersionId(AWSSchemaRegistryClient.java:202) {quote} These tests should be enabled and fixed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-28332) GlueSchemaRegistryJsonKinesisITCase is being Ignored due to `Access key not configured`
Ahmed Hamdy created FLINK-28332: --- Summary: GlueSchemaRegistryJsonKinesisITCase is being Ignored due to `Access key not configured` Key: FLINK-28332 URL: https://issues.apache.org/jira/browse/FLINK-28332 Project: Flink Issue Type: Bug Components: Tests Affects Versions: 1.15.1 Reporter: Ahmed Hamdy h1. Description {{GlueSchemaRegistryJsonKinesisITCase}} test is not being run on CI and is skipped due to {{Access key not configured}}. Access Key and Secret Key should be added to test environment variables to enable test. Currently on adding these test to environment variables the test fails with {quote}AWSSchemaRegistryException: Exception occurred while fetching or registering schema definition = {"$id":"https://example.com/address.schema.json","$schema":"http://json-schema.org/draft-07/schema#","type":"object","properties":{"f1":{"type":"string"},"f2":{"type":"integer","maximum":1}}}, schema name = gsr_json_input_stream at com.amazonaws.services.schemaregistry.common.AWSSchemaRegistryClient.getORRegisterSchemaVersionId(AWSSchemaRegistryClient.java:202) {quote} These tests should be enabled and fixed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [DISCUSS] Introduce multi delete API to Flink's FileSystem class
Hi, Yes, I know that you can not use recursive deletes for incremental checkpoints and I didn't suggest it anywhere. I just pointed out that I would expect multi/bulk deletes to supersede the recursive deletes feature assuming good underlying implementation. Also I'm not surprised that multi deletes can be faster. I would expect/hope for that. I've just raised a point that they don't have to be. It depends on the underlying file system. However in contrast to the recursive deletes, with multi deletes I wouldn't expect multi delete to be potentially slower. Re the Dawid's PoC. I'm not sure/I don't remember why he proposed `BulkDeletingFileSystem` over adding a default method to the FileSystem interface. But it seems to me like a minor point. The majority of Dawid's PR is about `BulkFileDeleter` interface, not `BulkDeletingFileSystem`, so about how to use the bulk deletes inside Flink, not how to implement it on the FileSystem side. Do you maybe have a concrete design proposal for this feature? Best, Piotrek czw., 30 cze 2022 o 15:12 Yun Tang napisał(a): > Hi Piotr, > > As I said in the original email, you cannot delete folders recursively for > incremental checkpoints. And If you take a close look at the original > email, I have shared the experimental results, which proved 29x improvement: > "A simple experiment shows that deleting 1000 objects with each 5MB size, > will cost 39494ms with for-loop single delete operations, and the result > will drop to 1347ms if using multi-delete API in Tencent Cloud." > > I think I can leverage some ideas from Dawid's work. And as I said, I > would introduce the multi-delete API to the original FileSystem class > instead of introducing another BulkDeletingFileSystem, which makes the file > system abstraction closer to the modern cloud-based environment. > > Best > Yun Tang > > From: Piotr Nowojski > Sent: Thursday, June 30, 2022 18:25 > To: dev ; Dawid Wysakowicz > Subject: Re: [DISCUSS] Introduce multi delete API to Flink's FileSystem > class > > Hi, > > I presume this would mostly supersede the recursive deletes [1]? I remember > an argument that the recursive deletes were not obviously better, even if > the underlying FS was supporting it. I'm not saying that this would have > been a counter argument against this effort, since every FileSystem could > decide on its own whether to use the multi delete call or not. But I think > at the very least it should be benchmarked/compared whether implementing it > for a particular FS makes sense or not. > > Also there seems to be some similar (abandoned?) effort from Dawid, with > named bulk deletes, with "BulkDeletingFileSystem"? [2] Isn't this basically > the same thing that you are proposing Yun Tang? > > Best, > Piotrek > > [1] https://issues.apache.org/jira/browse/FLINK-13856 > [2] > > https://issues.apache.org/jira/browse/FLINK-13856?focusedCommentId=17481712&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17481712 > > czw., 30 cze 2022 o 11:45 Zakelly Lan napisał(a): > > > Hi Yun, > > > > Thanks for bringing this into discussion. > > I'm +1 to this idea. > > And IIUC, Flink implements the OSS and S3 filesystem based on the hadoop > > filesystem interface, which does not provide the multi-delete API, it may > > take some effort to implement this. > > > > Best, > > Zakelly > > > > On Thu, Jun 30, 2022 at 5:36 PM Martijn Visser > > > wrote: > > > > > Hi Yun Tang, > > > > > > +1 for addressing this problem and your approach. > > > > > > Best regards, > > > > > > Martijn > > > > > > Op do 30 jun. 2022 om 11:12 schreef Feifan Wang : > > > > > > > Thanks a lot for the proposal @Yun Tang ! It sounds great and I > can't > > > > find any reason not to make this improvement. > > > > > > > > > > > > —— > > > > Name: Feifan Wang > > > > Email: zoltar9...@163.com > > > > > > > > > > > > Replied Message > > > > | From | Yun Tang | > > > > | Date | 06/30/2022 16:56 | > > > > | To | dev@flink.apache.org | > > > > | Subject | [DISCUSS] Introduce multi delete API to Flink's > FileSystem > > > > class | > > > > Hi guys, > > > > > > > > As more and more teams move to cloud-based environments. Cloud object > > > > storage has become the factual technical standard for big data > > > ecosystems. > > > > From our experience, the performance of writing/deleting objects in > > > object > > > > storage could vary in each call, the FLIP of changelog state-backend > > had > > > > ever taken experiments to verify the performance of writing the same > > data > > > > with multi times [1], and it proves that p999 latency could be 8x > than > > > p50 > > > > latency. This is also true for delete operations. > > > > > > > > Currently, after introducing the checkpoint backpressure > mechanism[2], > > > the > > > > newly triggered checkpoint could be delayed due to not cleaning > > > checkpoints > > > > as fast as possible [3]. > > > > Moreover, Flink's checkpoin
[jira] [Created] (FLINK-28331) Persist status after every observe loop
Matyas Orhidi created FLINK-28331: - Summary: Persist status after every observe loop Key: FLINK-28331 URL: https://issues.apache.org/jira/browse/FLINK-28331 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Affects Versions: kubernetes-operator-1.1.0 Reporter: Matyas Orhidi Fix For: kubernetes-operator-1.1.0 Make sure we don't loose any status information because of the reconcile logic. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re:Re: Re:Re: [DISCUSS] FLIP-218: Support SELECT clause in CREATE TABLE(CTAS)
Hi Martijn, Thank you for your reply, these are two good questions. >1. The FLIP mentions that if the user doesn't specify the WITH option part >in the query of the sink table, it will be assumed that the user wants to >create a managed table. What will happen if the user doesn't have Table >Store configured/installed? Will we throw an error? If it is a Catalog that does not support managed table and no `connector` is specified, then the corresponding TableSink cannot be generated, will fail. If it is a Catalog that supports managed table and no `connector` is specified, then it will fail because the table store related configuration is not set and there is no table store related jar. >2. Will there be support included for FLIP-190 (version upgrades)? FLIP-190 mainly solves the problem of Streaming mode upgrade. FLIP-218 use scenarios more in Batch mode. CTAS atomicity implementation requires serialization support for Catalog and hook, which currently cannot be serialized into json, so they cannot be supported FLIP-190. Non-atomic implementations are able to support FLIP-190. -- Best regards, Mang Zhang At 2022-06-30 16:47:38, "Martijn Visser" wrote: >Hi Mang, > >I have two questions/remarks: > >1. The FLIP mentions that if the user doesn't specify the WITH option part >in the query of the sink table, it will be assumed that the user wants to >create a managed table. What will happen if the user doesn't have Table >Store configured/installed? Will we throw an error? > >2. Will there be support included for FLIP-190 (version upgrades)? > >Best regards, > >Martijn > >Op wo 29 jun. 2022 om 05:18 schreef Mang Zhang : > >> Hi everyone, >> Thank you to all those who participated in the discussion, we have >> discussed many rounds, the program has been gradually revised and improved, >> looking forward to further feedback, we will launch a vote in the next day >> or two. >> >> >> >> >> >> >> >> -- >> >> Best regards, >> Mang Zhang >> >> >> >> >> >> At 2022-06-28 22:23:16, "Mang Zhang" wrote: >> >Hi Yuxia, >> >Thank you very much for your reply. >> > >> > >> >>1: Also, the mixture of ctas and rtas confuses me as the FLIP talks >> nothing about rtas but refer it in the configuration suddenly. And if >> we're not to implement rtas in this FLIP, it may be better not to refer it >> and the `rtas` shouldn't exposed to user as a configuration. >> >Currently does not support RTAS because in the stream mode and batch mode >> semantic unification issues and specific business scenarios are not very >> clear, the future we will support, if in support of rtas and then modify >> the option name, then it will bring the cost of modifying the configuration >> to the user. >> >>2: How will the CTASJobStatusHook be passed to StreamGraph as a hook? >> Could you please explain about it. Some pseudocode will be much better if >> it's possible. I'm lost in this part. >> > >> > >> > >> > >> >This part is too much of an implementation detail, and of course we had >> to make some changes to achieve this. FLIP focuses on semantic consistency >> in stream and batch mode, and can provide optional atomicity support. >> > >> > >> >>3: The name `AtomicCatalog` confuses me. Seems the backgroud for the >> naming is to implement atomic for ctas, we propose a interface for catalog >> to support serializing, then we name it to `AtomicCatalog`. At least, the >> interface is for the atomic of ctas. But if we want to implement other >> features like isolate which may also require serializable catalog in the >> future, should we introduce a new interface naming `IsolateCatalog`? Have >> you ever considered other names like `SerializableCatalog`. As it's a >> public interface, maybe we should be careful about the name. >> >Regarding the definition of the Catalog name, we have also discussed the >> name `SerializableCatalog`, which is too specific and does not relate to >> the atomic functionality we want to express. CTAS/RTAS want to support >> atomicity, need Catalog to implement `AtomicCatalog`, so it's more >> straightforward to understand. >> > >> > >> >Hope this answers your question. >> > >> > >> > >> > >> >-- >> > >> >Best regards, >> >Mang Zhang >> > >> > >> > >> > >> > >> >At 2022-06-28 11:36:51, "yuxia" wrote: >> >>Thanks for updating. The FLIP looks generall good to me. I have only >> minor questions: >> >> >> >>1: Also, the mixture of ctas and rtas confuses me as the FLIP talks >> nothing about rtas but refer it in the configuration suddenly. And if >> we're not to implement rtas in this FLIP, it may be better not to refer it >> and the `rtas` shouldn't exposed to user as a configuration. >> >> >> >>2: How will the CTASJobStatusHook be passed to StreamGraph as a hook? >> Could you please explain about it. Some pseudocode will be much better if >> it's possible. I'm lost in this part. >> >> >> >>3: The name `AtomicCatalog` confuses me. Seems the backgroud for the >> naming is to implement atomic for ctas, we propo
Re: [DISCUSS] Contribution of Multi Cluster Kafka Source
Hi Mason, I added mason6345 to the Flink confluence space, you should be able to add a FLIP now. Looking forward to the contribution! Thomas On Thu, Jun 30, 2022 at 9:25 AM Martijn Visser wrote: > > Hi Mason, > > I'm sure there's a PMC (*hint*) out there who can grant you access to > create a FLIP. Looking forward to it, this sounds like an improvement that > users are looking forward to. > > Best regards, > > Martijn > > Op di 28 jun. 2022 om 09:21 schreef Mason Chen : > > > Hi all, > > > > Thanks for the feedback! I'm adding the users, who responded in the user > > mailing list, to this thread. > > > > @Qingsheng - Yes, I would prefer to reuse the existing Kafka connector > > module. It makes a lot of sense since the dependencies are the same and the > > implementation can also extend and improve some of the test utilities you > > have been working on for the FLIP 27 Kafka Source. I will enumerate the > > migration steps in the FLIP template. > > > > @Ryan - I don't have a public branch available yet, but I would appreciate > > your review on the FLIP design! When the FLIP design is approved by devs > > and the community, I can start to commit our implementation to a fork. > > > > @Andrew - Yup, one of the requirements of the connector is to read > > multiple clusters within a single source, so it should be able to work well > > with your use case. > > > > @Devs - what do I need to get started on the FLIP design? I see the FLIP > > template and I have an account (mason6345), but I don't have access to > > create a page. > > > > Best, > > Mason > > > > > > > > > > On Sun, Jun 26, 2022 at 8:08 PM Qingsheng Ren wrote: > > > >> Hi Mason, > >> > >> It sounds like an exciting enhancement to the Kafka source and will > >> benefit a lot of users I believe. > >> > >> Would you prefer to reuse the existing flink-connector-kafka module or > >> create a new one for the new multi-cluster feature? Personally I prefer the > >> former one because users won’t need to introduce another dependency module > >> to their projects in order to use the feature. > >> > >> Thanks for the effort on this and looking forward to your FLIP! > >> > >> Best, > >> Qingsheng > >> > >> > On Jun 24, 2022, at 09:43, Mason Chen wrote: > >> > > >> > Hi community, > >> > > >> > We have been working on a Multi Cluster Kafka Source and are looking to > >> > contribute it upstream. I've given a talk about the features and design > >> at > >> > a Flink meetup: https://youtu.be/H1SYOuLcUTI. > >> > > >> > The main features that it provides is: > >> > 1. Reading multiple Kafka clusters within a single source. > >> > 2. Adjusting the clusters and topics the source consumes from > >> dynamically, > >> > without Flink job restart. > >> > > >> > Some of the challenging use cases that these features solve are: > >> > 1. Transparent Kafka cluster migration without Flink job restart. > >> > 2. Transparent Kafka topic migration without Flink job restart. > >> > 3. Direct integration with Hybrid Source. > >> > > >> > In addition, this is designed with wrapping and managing the existing > >> > KafkaSource components to enable these features, so it can continue to > >> > benefit from KafkaSource improvements and bug fixes. It can be > >> considered > >> > as a form of a composite source. > >> > > >> > I think the contribution of this source could benefit a lot of users who > >> > have asked in the mailing list about Flink handling Kafka migrations and > >> > removing topics in the past. I would love to hear and address your > >> thoughts > >> > and feedback, and if possible drive a FLIP! > >> > > >> > Best, > >> > Mason > >> > >>
RE: Re: [DISCUSS] FLIP-239: Port JDBC Connector Source to FLIP-27
Hi, Boto. Thanks for your reply. +1 to me on watermark strategy definition in ‘streaming’ & table source. I'm not sure if FLIP-202[1] is suitable for a separate discussion, but I think your proposal is very helpful to the new source. It would be great if the new source could be compatible with this abstraction. In addition, whether we need to support such a special bounded scenario abstraction? The number of JdbcSourceSplit is certain, but the time to generate all JdbcSourceSplit completely is not certain in the user defined implementation. When the condition that the JdbcSourceSplit generate-process end is met, the JdbcSourceSplit will not be generated. After all JdbcSourceSplit processing is completed, the reader will be notified that there are no more JdbcSourceSplit from JdbcSourceSplitEnumerator. - [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-202%3A+Introduce+ClickHouse+Connector Best regards, Roc Marshal On 2022/06/30 09:02:23 João Boto wrote: > Hi, > > On source we could improve the JdbcParameterValuesProvider.. to be defined as > a query(s) or something more dynamic. > The most time if your job is dynamic or have some condition to be met (based > on data on table) you have to create a connection an get that info from > database > > If we are going to create/allow a "streaming" jdbc source, we should be able > to define watermark and get new data from table using that watermark.. > > > For the sink (but it could apply on source) will be great to be able to set > your implementation of the connection type.. For example if you are > connecting to clickhouse, be able to set a implementation based on > "BalancedClickhouseDataSource" for example (in this[1] implementation we have > a example) or set a extension version of a implementation for debug purpose > > Regards > > > [1] > https://github.com/apache/flink/pull/20097/files#diff-8b36e3403381dc14c748aeb5de0b4ceb7d7daec39594b1eacff1694b5266419d > > On 2022/06/27 13:09:51 Roc Marshal wrote: > > Hi, all, > > > > > > > > > > I would like to open a discussion on porting JDBC Source to new Source API > > (FLIP-27[1]). > > > > Martijn Visser, Jing Ge and I had a preliminary discussion on the JIRA > > FLINK-25420[2] and planed to start the discussion about the source part > > first. > > > > > > > > Please let me know: > > > > - The issues about old Jdbc source you encountered; > > - The new feature or design you want; > > - More suggestions from other dimensions... > > > > > > > > You could find more details in FLIP-239[3]. > > > > Looking forward to your feedback. > > > > > > > > > > [1] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface > > > > [2] https://issues.apache.org/jira/browse/FLINK-25420 > > > > [3] > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=217386271 > > > > > > > > > > Best regards, > > > > Roc Marshal >
Re: [DISCUSS] Contribution of Multi Cluster Kafka Source
Hi Mason, I'm sure there's a PMC (*hint*) out there who can grant you access to create a FLIP. Looking forward to it, this sounds like an improvement that users are looking forward to. Best regards, Martijn Op di 28 jun. 2022 om 09:21 schreef Mason Chen : > Hi all, > > Thanks for the feedback! I'm adding the users, who responded in the user > mailing list, to this thread. > > @Qingsheng - Yes, I would prefer to reuse the existing Kafka connector > module. It makes a lot of sense since the dependencies are the same and the > implementation can also extend and improve some of the test utilities you > have been working on for the FLIP 27 Kafka Source. I will enumerate the > migration steps in the FLIP template. > > @Ryan - I don't have a public branch available yet, but I would appreciate > your review on the FLIP design! When the FLIP design is approved by devs > and the community, I can start to commit our implementation to a fork. > > @Andrew - Yup, one of the requirements of the connector is to read > multiple clusters within a single source, so it should be able to work well > with your use case. > > @Devs - what do I need to get started on the FLIP design? I see the FLIP > template and I have an account (mason6345), but I don't have access to > create a page. > > Best, > Mason > > > > > On Sun, Jun 26, 2022 at 8:08 PM Qingsheng Ren wrote: > >> Hi Mason, >> >> It sounds like an exciting enhancement to the Kafka source and will >> benefit a lot of users I believe. >> >> Would you prefer to reuse the existing flink-connector-kafka module or >> create a new one for the new multi-cluster feature? Personally I prefer the >> former one because users won’t need to introduce another dependency module >> to their projects in order to use the feature. >> >> Thanks for the effort on this and looking forward to your FLIP! >> >> Best, >> Qingsheng >> >> > On Jun 24, 2022, at 09:43, Mason Chen wrote: >> > >> > Hi community, >> > >> > We have been working on a Multi Cluster Kafka Source and are looking to >> > contribute it upstream. I've given a talk about the features and design >> at >> > a Flink meetup: https://youtu.be/H1SYOuLcUTI. >> > >> > The main features that it provides is: >> > 1. Reading multiple Kafka clusters within a single source. >> > 2. Adjusting the clusters and topics the source consumes from >> dynamically, >> > without Flink job restart. >> > >> > Some of the challenging use cases that these features solve are: >> > 1. Transparent Kafka cluster migration without Flink job restart. >> > 2. Transparent Kafka topic migration without Flink job restart. >> > 3. Direct integration with Hybrid Source. >> > >> > In addition, this is designed with wrapping and managing the existing >> > KafkaSource components to enable these features, so it can continue to >> > benefit from KafkaSource improvements and bug fixes. It can be >> considered >> > as a form of a composite source. >> > >> > I think the contribution of this source could benefit a lot of users who >> > have asked in the mailing list about Flink handling Kafka migrations and >> > removing topics in the past. I would love to hear and address your >> thoughts >> > and feedback, and if possible drive a FLIP! >> > >> > Best, >> > Mason >> >>
Re: [DISCUSS] Introduce multi delete API to Flink's FileSystem class
Hi Piotr, As I said in the original email, you cannot delete folders recursively for incremental checkpoints. And If you take a close look at the original email, I have shared the experimental results, which proved 29x improvement: "A simple experiment shows that deleting 1000 objects with each 5MB size, will cost 39494ms with for-loop single delete operations, and the result will drop to 1347ms if using multi-delete API in Tencent Cloud." I think I can leverage some ideas from Dawid's work. And as I said, I would introduce the multi-delete API to the original FileSystem class instead of introducing another BulkDeletingFileSystem, which makes the file system abstraction closer to the modern cloud-based environment. Best Yun Tang From: Piotr Nowojski Sent: Thursday, June 30, 2022 18:25 To: dev ; Dawid Wysakowicz Subject: Re: [DISCUSS] Introduce multi delete API to Flink's FileSystem class Hi, I presume this would mostly supersede the recursive deletes [1]? I remember an argument that the recursive deletes were not obviously better, even if the underlying FS was supporting it. I'm not saying that this would have been a counter argument against this effort, since every FileSystem could decide on its own whether to use the multi delete call or not. But I think at the very least it should be benchmarked/compared whether implementing it for a particular FS makes sense or not. Also there seems to be some similar (abandoned?) effort from Dawid, with named bulk deletes, with "BulkDeletingFileSystem"? [2] Isn't this basically the same thing that you are proposing Yun Tang? Best, Piotrek [1] https://issues.apache.org/jira/browse/FLINK-13856 [2] https://issues.apache.org/jira/browse/FLINK-13856?focusedCommentId=17481712&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17481712 czw., 30 cze 2022 o 11:45 Zakelly Lan napisał(a): > Hi Yun, > > Thanks for bringing this into discussion. > I'm +1 to this idea. > And IIUC, Flink implements the OSS and S3 filesystem based on the hadoop > filesystem interface, which does not provide the multi-delete API, it may > take some effort to implement this. > > Best, > Zakelly > > On Thu, Jun 30, 2022 at 5:36 PM Martijn Visser > wrote: > > > Hi Yun Tang, > > > > +1 for addressing this problem and your approach. > > > > Best regards, > > > > Martijn > > > > Op do 30 jun. 2022 om 11:12 schreef Feifan Wang : > > > > > Thanks a lot for the proposal @Yun Tang ! It sounds great and I can't > > > find any reason not to make this improvement. > > > > > > > > > —— > > > Name: Feifan Wang > > > Email: zoltar9...@163.com > > > > > > > > > Replied Message > > > | From | Yun Tang | > > > | Date | 06/30/2022 16:56 | > > > | To | dev@flink.apache.org | > > > | Subject | [DISCUSS] Introduce multi delete API to Flink's FileSystem > > > class | > > > Hi guys, > > > > > > As more and more teams move to cloud-based environments. Cloud object > > > storage has become the factual technical standard for big data > > ecosystems. > > > From our experience, the performance of writing/deleting objects in > > object > > > storage could vary in each call, the FLIP of changelog state-backend > had > > > ever taken experiments to verify the performance of writing the same > data > > > with multi times [1], and it proves that p999 latency could be 8x than > > p50 > > > latency. This is also true for delete operations. > > > > > > Currently, after introducing the checkpoint backpressure mechanism[2], > > the > > > newly triggered checkpoint could be delayed due to not cleaning > > checkpoints > > > as fast as possible [3]. > > > Moreover, Flink's checkpoint cleanup mechanism cannot leverage deleting > > > folder API to speed up the procedure with incremental checkpoints[4]. > > > This is extremely obvious in cloud object storage, and all most all > > object > > > storage SDKs have multi-delete API to accelerate the performance, e.g. > > AWS > > > S3 [5], Aliyun OSS [6], and Tencentyun COS [7]. > > > A simple experiment shows that deleting 1000 objects with each 5MB > size, > > > will cost 39494ms with for-loop single delete operations, and the > result > > > will drop to 1347ms if using multi-delete API in Tencent Cloud. > > > > > > However, Flink's FileSystem API refers to the HDFS's FileSystem API and > > > lacks such a multi-delete API, which is somehow outdated currently in > > > cloud-based environments. > > > Thus I suggest adding such a multi-delete API to Flink's FileSystem[8] > > > class and file systems that do not support such a multi-delete feature > > will > > > roll back to a for-loop single delete. > > > By doing so, we can at least accelerate the speed of discarding > > > checkpoints in cloud environments. > > > > > > WDYT? > > > > > > > > > [1] > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-158%3A+Generalized+incremental+checkpoints#FLIP158:Generalizedincrementalcheckpoints-DF
[jira] [Created] (FLINK-28330) Remove old delegation token framework code when new is working fine
Gabor Somogyi created FLINK-28330: - Summary: Remove old delegation token framework code when new is working fine Key: FLINK-28330 URL: https://issues.apache.org/jira/browse/FLINK-28330 Project: Flink Issue Type: Sub-task Affects Versions: 1.16.0 Reporter: Gabor Somogyi -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-28329) List top 15 biggest directories in terms of used disk space
Martijn Visser created FLINK-28329: -- Summary: List top 15 biggest directories in terms of used disk space Key: FLINK-28329 URL: https://issues.apache.org/jira/browse/FLINK-28329 Project: Flink Issue Type: Improvement Components: Test Infrastructure, Tests Reporter: Martijn Visser Assignee: Martijn Visser We are having the situation where a lot of disk space gets used by both Bash and Java E2E tests. In order to identify which tests aren't properly cleaning up, it would be good if we output the top 15 directories which are the biggest in used disk space -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-28328) RescaleCheckpointManuallyITCase.testCheckpointRescalingInKeyedState failed with IllegalStateException
Huang Xingbo created FLINK-28328: Summary: RescaleCheckpointManuallyITCase.testCheckpointRescalingInKeyedState failed with IllegalStateException Key: FLINK-28328 URL: https://issues.apache.org/jira/browse/FLINK-28328 Project: Flink Issue Type: Bug Components: Runtime / Checkpointing Affects Versions: 1.16.0 Reporter: Huang Xingbo {code:java} 2022-06-30T10:24:44.5149015Z Jun 30 10:24:44 java.util.concurrent.ExecutionException: org.apache.flink.runtime.checkpoint.CheckpointException: Trigger checkpoint failure. 2022-06-30T10:24:44.5165889Z Jun 30 10:24:44at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) 2022-06-30T10:24:44.5174822Z Jun 30 10:24:44at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) 2022-06-30T10:24:44.5176702Z Jun 30 10:24:44at org.apache.flink.test.checkpointing.RescaleCheckpointManuallyITCase.runJobAndGetCheckpoint(RescaleCheckpointManuallyITCase.java:196) 2022-06-30T10:24:44.5178545Z Jun 30 10:24:44at org.apache.flink.test.checkpointing.RescaleCheckpointManuallyITCase.testCheckpointRescalingKeyedState(RescaleCheckpointManuallyITCase.java:137) 2022-06-30T10:24:44.5180318Z Jun 30 10:24:44at org.apache.flink.test.checkpointing.RescaleCheckpointManuallyITCase.testCheckpointRescalingInKeyedState(RescaleCheckpointManuallyITCase.java:115) 2022-06-30T10:24:44.5181746Z Jun 30 10:24:44at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 2022-06-30T10:24:44.5183196Z Jun 30 10:24:44at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 2022-06-30T10:24:44.5184703Z Jun 30 10:24:44at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 2022-06-30T10:24:44.5185708Z Jun 30 10:24:44at java.lang.reflect.Method.invoke(Method.java:498) 2022-06-30T10:24:44.5186854Z Jun 30 10:24:44at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) 2022-06-30T10:24:44.5188130Z Jun 30 10:24:44at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) 2022-06-30T10:24:44.5189317Z Jun 30 10:24:44at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) 2022-06-30T10:24:44.5190508Z Jun 30 10:24:44at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) 2022-06-30T10:24:44.5191745Z Jun 30 10:24:44at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) 2022-06-30T10:24:44.5193308Z Jun 30 10:24:44at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 2022-06-30T10:24:44.5194728Z Jun 30 10:24:44at org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45) 2022-06-30T10:24:44.5195872Z Jun 30 10:24:44at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61) 2022-06-30T10:24:44.5196823Z Jun 30 10:24:44at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) 2022-06-30T10:24:44.5197864Z Jun 30 10:24:44at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) 2022-06-30T10:24:44.5198838Z Jun 30 10:24:44at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) 2022-06-30T10:24:44.5199856Z Jun 30 10:24:44at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) 2022-06-30T10:24:44.5201014Z Jun 30 10:24:44at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) 2022-06-30T10:24:44.5202053Z Jun 30 10:24:44at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) 2022-06-30T10:24:44.5203015Z Jun 30 10:24:44at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) 2022-06-30T10:24:44.5204282Z Jun 30 10:24:44at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) 2022-06-30T10:24:44.5205225Z Jun 30 10:24:44at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) 2022-06-30T10:24:44.5206196Z Jun 30 10:24:44at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) 2022-06-30T10:24:44.5207234Z Jun 30 10:24:44at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54) 2022-06-30T10:24:44.5208253Z Jun 30 10:24:44at org.junit.rules.RunRules.evaluate(RunRules.java:20) 2022-06-30T10:24:44.5209332Z Jun 30 10:24:44at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) 2022-06-30T10:24:44.5210340Z Jun 30 10:24:44at org.junit.runners.ParentRunner.run(ParentRunner.java:413) 2022-06-30T10:24:44.5211276Z Jun 30 10:24:44at org.junit.runner.JUnitCore.run(JUnitCore.java:137) 2022-06-30T10:24:44.5212212Z Jun 30 10:24:44at org.junit.runner.JUnitCore.run(JUnitCore.java:115) 2022-06-30T10:24:44.5213266Z Jun 30 10:24:44at org.junit.vintage.engine.execution.RunnerExecutor.execute(Runn
[jira] [Created] (FLINK-28327) Make table store src codes compiled with Flink 1.14
Jingsong Lee created FLINK-28327: Summary: Make table store src codes compiled with Flink 1.14 Key: FLINK-28327 URL: https://issues.apache.org/jira/browse/FLINK-28327 Project: Flink Issue Type: Sub-task Components: Table Store Reporter: Jingsong Lee Assignee: Jingsong Lee Fix For: table-store-0.2.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-28326) ResultPartitionTest.testIdleAndBackPressuredTime failed with AssertError
Huang Xingbo created FLINK-28326: Summary: ResultPartitionTest.testIdleAndBackPressuredTime failed with AssertError Key: FLINK-28326 URL: https://issues.apache.org/jira/browse/FLINK-28326 Project: Flink Issue Type: Bug Components: Runtime / Network Affects Versions: 1.16.0 Reporter: Huang Xingbo {code:java} 2022-06-30T09:23:24.0469768Z Jun 30 09:23:24 [INFO] 2022-06-30T09:23:24.0470382Z Jun 30 09:23:24 [ERROR] Failures: 2022-06-30T09:23:24.0471581Z Jun 30 09:23:24 [ERROR] ResultPartitionTest.testIdleAndBackPressuredTime:414 2022-06-30T09:23:24.0472898Z Jun 30 09:23:24 Expected: a value greater than <0L> 2022-06-30T09:23:24.0474090Z Jun 30 09:23:24 but: <0L> was equal to <0L> {code} https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=37406&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=24c3384f-1bcb-57b3-224f-51bf973bbee8 -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [DISCUSS] Introduce multi delete API to Flink's FileSystem class
Hi, I presume this would mostly supersede the recursive deletes [1]? I remember an argument that the recursive deletes were not obviously better, even if the underlying FS was supporting it. I'm not saying that this would have been a counter argument against this effort, since every FileSystem could decide on its own whether to use the multi delete call or not. But I think at the very least it should be benchmarked/compared whether implementing it for a particular FS makes sense or not. Also there seems to be some similar (abandoned?) effort from Dawid, with named bulk deletes, with "BulkDeletingFileSystem"? [2] Isn't this basically the same thing that you are proposing Yun Tang? Best, Piotrek [1] https://issues.apache.org/jira/browse/FLINK-13856 [2] https://issues.apache.org/jira/browse/FLINK-13856?focusedCommentId=17481712&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17481712 czw., 30 cze 2022 o 11:45 Zakelly Lan napisał(a): > Hi Yun, > > Thanks for bringing this into discussion. > I'm +1 to this idea. > And IIUC, Flink implements the OSS and S3 filesystem based on the hadoop > filesystem interface, which does not provide the multi-delete API, it may > take some effort to implement this. > > Best, > Zakelly > > On Thu, Jun 30, 2022 at 5:36 PM Martijn Visser > wrote: > > > Hi Yun Tang, > > > > +1 for addressing this problem and your approach. > > > > Best regards, > > > > Martijn > > > > Op do 30 jun. 2022 om 11:12 schreef Feifan Wang : > > > > > Thanks a lot for the proposal @Yun Tang ! It sounds great and I can't > > > find any reason not to make this improvement. > > > > > > > > > —— > > > Name: Feifan Wang > > > Email: zoltar9...@163.com > > > > > > > > > Replied Message > > > | From | Yun Tang | > > > | Date | 06/30/2022 16:56 | > > > | To | dev@flink.apache.org | > > > | Subject | [DISCUSS] Introduce multi delete API to Flink's FileSystem > > > class | > > > Hi guys, > > > > > > As more and more teams move to cloud-based environments. Cloud object > > > storage has become the factual technical standard for big data > > ecosystems. > > > From our experience, the performance of writing/deleting objects in > > object > > > storage could vary in each call, the FLIP of changelog state-backend > had > > > ever taken experiments to verify the performance of writing the same > data > > > with multi times [1], and it proves that p999 latency could be 8x than > > p50 > > > latency. This is also true for delete operations. > > > > > > Currently, after introducing the checkpoint backpressure mechanism[2], > > the > > > newly triggered checkpoint could be delayed due to not cleaning > > checkpoints > > > as fast as possible [3]. > > > Moreover, Flink's checkpoint cleanup mechanism cannot leverage deleting > > > folder API to speed up the procedure with incremental checkpoints[4]. > > > This is extremely obvious in cloud object storage, and all most all > > object > > > storage SDKs have multi-delete API to accelerate the performance, e.g. > > AWS > > > S3 [5], Aliyun OSS [6], and Tencentyun COS [7]. > > > A simple experiment shows that deleting 1000 objects with each 5MB > size, > > > will cost 39494ms with for-loop single delete operations, and the > result > > > will drop to 1347ms if using multi-delete API in Tencent Cloud. > > > > > > However, Flink's FileSystem API refers to the HDFS's FileSystem API and > > > lacks such a multi-delete API, which is somehow outdated currently in > > > cloud-based environments. > > > Thus I suggest adding such a multi-delete API to Flink's FileSystem[8] > > > class and file systems that do not support such a multi-delete feature > > will > > > roll back to a for-loop single delete. > > > By doing so, we can at least accelerate the speed of discarding > > > checkpoints in cloud environments. > > > > > > WDYT? > > > > > > > > > [1] > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-158%3A+Generalized+incremental+checkpoints#FLIP158:Generalizedincrementalcheckpoints-DFSwritelatency > > > [2] https://issues.apache.org/jira/browse/FLINK-17073 > > > [3] https://issues.apache.org/jira/browse/FLINK-26590 > > > [4] > > > > > > https://github.com/apache/flink/blob/1486fee1acd9cd1e340f6d2007f723abd20294e5/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CompletedCheckpoint.java#L315 > > > [5] > > > > > > https://docs.aws.amazon.com/AmazonS3/latest/userguide/delete-multiple-objects.html > > > [6] > > > > > > https://www.alibabacloud.com/help/en/object-storage-service/latest/delete-objects-8#section-v6n-zym-tax > > > [7] > > > > > > https://intl.cloud.tencent.com/document/product/436/44018#delete-objects-in-batch > > > [8] > > > > > > https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/core/fs/FileSystem.java > > > > > > > > > Best > > > Yun Tang > > > > > > > > >
[jira] [Created] (FLINK-28325) DataOutputSerializer#writeBytes increase position twice
huweihua created FLINK-28325: Summary: DataOutputSerializer#writeBytes increase position twice Key: FLINK-28325 URL: https://issues.apache.org/jira/browse/FLINK-28325 Project: Flink Issue Type: Bug Components: Runtime / Task Reporter: huweihua Attachments: image-2022-06-30-18-14-50-827.png, image-2022-06-30-18-15-18-590.png Hi, I was looking at the code and found that DataOutputSerializer.writeBytes increases the position twice, I feel it is a problem, please let me know if it is for a special purpose org.apache.flink.core.memory.DataOutputSerializer#writeBytes !image-2022-06-30-18-14-50-827.png!!image-2022-06-30-18-15-18-590.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [DISCUSS] Introduce multi delete API to Flink's FileSystem class
Hi Yun, Thanks for bringing this into discussion. I'm +1 to this idea. And IIUC, Flink implements the OSS and S3 filesystem based on the hadoop filesystem interface, which does not provide the multi-delete API, it may take some effort to implement this. Best, Zakelly On Thu, Jun 30, 2022 at 5:36 PM Martijn Visser wrote: > Hi Yun Tang, > > +1 for addressing this problem and your approach. > > Best regards, > > Martijn > > Op do 30 jun. 2022 om 11:12 schreef Feifan Wang : > > > Thanks a lot for the proposal @Yun Tang ! It sounds great and I can't > > find any reason not to make this improvement. > > > > > > —— > > Name: Feifan Wang > > Email: zoltar9...@163.com > > > > > > Replied Message > > | From | Yun Tang | > > | Date | 06/30/2022 16:56 | > > | To | dev@flink.apache.org | > > | Subject | [DISCUSS] Introduce multi delete API to Flink's FileSystem > > class | > > Hi guys, > > > > As more and more teams move to cloud-based environments. Cloud object > > storage has become the factual technical standard for big data > ecosystems. > > From our experience, the performance of writing/deleting objects in > object > > storage could vary in each call, the FLIP of changelog state-backend had > > ever taken experiments to verify the performance of writing the same data > > with multi times [1], and it proves that p999 latency could be 8x than > p50 > > latency. This is also true for delete operations. > > > > Currently, after introducing the checkpoint backpressure mechanism[2], > the > > newly triggered checkpoint could be delayed due to not cleaning > checkpoints > > as fast as possible [3]. > > Moreover, Flink's checkpoint cleanup mechanism cannot leverage deleting > > folder API to speed up the procedure with incremental checkpoints[4]. > > This is extremely obvious in cloud object storage, and all most all > object > > storage SDKs have multi-delete API to accelerate the performance, e.g. > AWS > > S3 [5], Aliyun OSS [6], and Tencentyun COS [7]. > > A simple experiment shows that deleting 1000 objects with each 5MB size, > > will cost 39494ms with for-loop single delete operations, and the result > > will drop to 1347ms if using multi-delete API in Tencent Cloud. > > > > However, Flink's FileSystem API refers to the HDFS's FileSystem API and > > lacks such a multi-delete API, which is somehow outdated currently in > > cloud-based environments. > > Thus I suggest adding such a multi-delete API to Flink's FileSystem[8] > > class and file systems that do not support such a multi-delete feature > will > > roll back to a for-loop single delete. > > By doing so, we can at least accelerate the speed of discarding > > checkpoints in cloud environments. > > > > WDYT? > > > > > > [1] > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-158%3A+Generalized+incremental+checkpoints#FLIP158:Generalizedincrementalcheckpoints-DFSwritelatency > > [2] https://issues.apache.org/jira/browse/FLINK-17073 > > [3] https://issues.apache.org/jira/browse/FLINK-26590 > > [4] > > > https://github.com/apache/flink/blob/1486fee1acd9cd1e340f6d2007f723abd20294e5/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CompletedCheckpoint.java#L315 > > [5] > > > https://docs.aws.amazon.com/AmazonS3/latest/userguide/delete-multiple-objects.html > > [6] > > > https://www.alibabacloud.com/help/en/object-storage-service/latest/delete-objects-8#section-v6n-zym-tax > > [7] > > > https://intl.cloud.tencent.com/document/product/436/44018#delete-objects-in-batch > > [8] > > > https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/core/fs/FileSystem.java > > > > > > Best > > Yun Tang > > > > >
[jira] [Created] (FLINK-28324) JUnit5 Migration] Module: flink-sql-client
zl created FLINK-28324: -- Summary: JUnit5 Migration] Module: flink-sql-client Key: FLINK-28324 URL: https://issues.apache.org/jira/browse/FLINK-28324 Project: Flink Issue Type: Improvement Components: Table SQL / Client Reporter: zl -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [DISCUSS] Introduce multi delete API to Flink's FileSystem class
Hi Yun Tang, +1 for addressing this problem and your approach. Best regards, Martijn Op do 30 jun. 2022 om 11:12 schreef Feifan Wang : > Thanks a lot for the proposal @Yun Tang ! It sounds great and I can't > find any reason not to make this improvement. > > > —— > Name: Feifan Wang > Email: zoltar9...@163.com > > > Replied Message > | From | Yun Tang | > | Date | 06/30/2022 16:56 | > | To | dev@flink.apache.org | > | Subject | [DISCUSS] Introduce multi delete API to Flink's FileSystem > class | > Hi guys, > > As more and more teams move to cloud-based environments. Cloud object > storage has become the factual technical standard for big data ecosystems. > From our experience, the performance of writing/deleting objects in object > storage could vary in each call, the FLIP of changelog state-backend had > ever taken experiments to verify the performance of writing the same data > with multi times [1], and it proves that p999 latency could be 8x than p50 > latency. This is also true for delete operations. > > Currently, after introducing the checkpoint backpressure mechanism[2], the > newly triggered checkpoint could be delayed due to not cleaning checkpoints > as fast as possible [3]. > Moreover, Flink's checkpoint cleanup mechanism cannot leverage deleting > folder API to speed up the procedure with incremental checkpoints[4]. > This is extremely obvious in cloud object storage, and all most all object > storage SDKs have multi-delete API to accelerate the performance, e.g. AWS > S3 [5], Aliyun OSS [6], and Tencentyun COS [7]. > A simple experiment shows that deleting 1000 objects with each 5MB size, > will cost 39494ms with for-loop single delete operations, and the result > will drop to 1347ms if using multi-delete API in Tencent Cloud. > > However, Flink's FileSystem API refers to the HDFS's FileSystem API and > lacks such a multi-delete API, which is somehow outdated currently in > cloud-based environments. > Thus I suggest adding such a multi-delete API to Flink's FileSystem[8] > class and file systems that do not support such a multi-delete feature will > roll back to a for-loop single delete. > By doing so, we can at least accelerate the speed of discarding > checkpoints in cloud environments. > > WDYT? > > > [1] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-158%3A+Generalized+incremental+checkpoints#FLIP158:Generalizedincrementalcheckpoints-DFSwritelatency > [2] https://issues.apache.org/jira/browse/FLINK-17073 > [3] https://issues.apache.org/jira/browse/FLINK-26590 > [4] > https://github.com/apache/flink/blob/1486fee1acd9cd1e340f6d2007f723abd20294e5/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CompletedCheckpoint.java#L315 > [5] > https://docs.aws.amazon.com/AmazonS3/latest/userguide/delete-multiple-objects.html > [6] > https://www.alibabacloud.com/help/en/object-storage-service/latest/delete-objects-8#section-v6n-zym-tax > [7] > https://intl.cloud.tencent.com/document/product/436/44018#delete-objects-in-batch > [8] > https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/core/fs/FileSystem.java > > > Best > Yun Tang > >
Re:[DISCUSS] Introduce multi delete API to Flink's FileSystem class
Thanks a lot for the proposal @Yun Tang ! It sounds great and I can't find any reason not to make this improvement. —— Name: Feifan Wang Email: zoltar9...@163.com Replied Message | From | Yun Tang | | Date | 06/30/2022 16:56 | | To | dev@flink.apache.org | | Subject | [DISCUSS] Introduce multi delete API to Flink's FileSystem class | Hi guys, As more and more teams move to cloud-based environments. Cloud object storage has become the factual technical standard for big data ecosystems. From our experience, the performance of writing/deleting objects in object storage could vary in each call, the FLIP of changelog state-backend had ever taken experiments to verify the performance of writing the same data with multi times [1], and it proves that p999 latency could be 8x than p50 latency. This is also true for delete operations. Currently, after introducing the checkpoint backpressure mechanism[2], the newly triggered checkpoint could be delayed due to not cleaning checkpoints as fast as possible [3]. Moreover, Flink's checkpoint cleanup mechanism cannot leverage deleting folder API to speed up the procedure with incremental checkpoints[4]. This is extremely obvious in cloud object storage, and all most all object storage SDKs have multi-delete API to accelerate the performance, e.g. AWS S3 [5], Aliyun OSS [6], and Tencentyun COS [7]. A simple experiment shows that deleting 1000 objects with each 5MB size, will cost 39494ms with for-loop single delete operations, and the result will drop to 1347ms if using multi-delete API in Tencent Cloud. However, Flink's FileSystem API refers to the HDFS's FileSystem API and lacks such a multi-delete API, which is somehow outdated currently in cloud-based environments. Thus I suggest adding such a multi-delete API to Flink's FileSystem[8] class and file systems that do not support such a multi-delete feature will roll back to a for-loop single delete. By doing so, we can at least accelerate the speed of discarding checkpoints in cloud environments. WDYT? [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-158%3A+Generalized+incremental+checkpoints#FLIP158:Generalizedincrementalcheckpoints-DFSwritelatency [2] https://issues.apache.org/jira/browse/FLINK-17073 [3] https://issues.apache.org/jira/browse/FLINK-26590 [4] https://github.com/apache/flink/blob/1486fee1acd9cd1e340f6d2007f723abd20294e5/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CompletedCheckpoint.java#L315 [5] https://docs.aws.amazon.com/AmazonS3/latest/userguide/delete-multiple-objects.html [6] https://www.alibabacloud.com/help/en/object-storage-service/latest/delete-objects-8#section-v6n-zym-tax [7] https://intl.cloud.tencent.com/document/product/436/44018#delete-objects-in-batch [8] https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/core/fs/FileSystem.java Best Yun Tang
Re: [DISCUSS] FLIP-239: Port JDBC Connector Source to FLIP-27
Hi, On source we could improve the JdbcParameterValuesProvider.. to be defined as a query(s) or something more dynamic. The most time if your job is dynamic or have some condition to be met (based on data on table) you have to create a connection an get that info from database If we are going to create/allow a "streaming" jdbc source, we should be able to define watermark and get new data from table using that watermark.. For the sink (but it could apply on source) will be great to be able to set your implementation of the connection type.. For example if you are connecting to clickhouse, be able to set a implementation based on "BalancedClickhouseDataSource" for example (in this[1] implementation we have a example) or set a extension version of a implementation for debug purpose Regards [1] https://github.com/apache/flink/pull/20097/files#diff-8b36e3403381dc14c748aeb5de0b4ceb7d7daec39594b1eacff1694b5266419d On 2022/06/27 13:09:51 Roc Marshal wrote: > Hi, all, > > > > > I would like to open a discussion on porting JDBC Source to new Source API > (FLIP-27[1]). > > Martijn Visser, Jing Ge and I had a preliminary discussion on the JIRA > FLINK-25420[2] and planed to start the discussion about the source part first. > > > > Please let me know: > > - The issues about old Jdbc source you encountered; > - The new feature or design you want; > - More suggestions from other dimensions... > > > > You could find more details in FLIP-239[3]. > > Looking forward to your feedback. > > > > > [1] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface > > [2] https://issues.apache.org/jira/browse/FLINK-25420 > > [3] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=217386271 > > > > > Best regards, > > Roc Marshal
[DISCUSS] Introduce multi delete API to Flink's FileSystem class
Hi guys, As more and more teams move to cloud-based environments. Cloud object storage has become the factual technical standard for big data ecosystems. >From our experience, the performance of writing/deleting objects in object >storage could vary in each call, the FLIP of changelog state-backend had ever >taken experiments to verify the performance of writing the same data with >multi times [1], and it proves that p999 latency could be 8x than p50 latency. >This is also true for delete operations. Currently, after introducing the checkpoint backpressure mechanism[2], the newly triggered checkpoint could be delayed due to not cleaning checkpoints as fast as possible [3]. Moreover, Flink's checkpoint cleanup mechanism cannot leverage deleting folder API to speed up the procedure with incremental checkpoints[4]. This is extremely obvious in cloud object storage, and all most all object storage SDKs have multi-delete API to accelerate the performance, e.g. AWS S3 [5], Aliyun OSS [6], and Tencentyun COS [7]. A simple experiment shows that deleting 1000 objects with each 5MB size, will cost 39494ms with for-loop single delete operations, and the result will drop to 1347ms if using multi-delete API in Tencent Cloud. However, Flink's FileSystem API refers to the HDFS's FileSystem API and lacks such a multi-delete API, which is somehow outdated currently in cloud-based environments. Thus I suggest adding such a multi-delete API to Flink's FileSystem[8] class and file systems that do not support such a multi-delete feature will roll back to a for-loop single delete. By doing so, we can at least accelerate the speed of discarding checkpoints in cloud environments. WDYT? [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-158%3A+Generalized+incremental+checkpoints#FLIP158:Generalizedincrementalcheckpoints-DFSwritelatency [2] https://issues.apache.org/jira/browse/FLINK-17073 [3] https://issues.apache.org/jira/browse/FLINK-26590 [4] https://github.com/apache/flink/blob/1486fee1acd9cd1e340f6d2007f723abd20294e5/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CompletedCheckpoint.java#L315 [5] https://docs.aws.amazon.com/AmazonS3/latest/userguide/delete-multiple-objects.html [6] https://www.alibabacloud.com/help/en/object-storage-service/latest/delete-objects-8#section-v6n-zym-tax [7] https://intl.cloud.tencent.com/document/product/436/44018#delete-objects-in-batch [8] https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/core/fs/FileSystem.java Best Yun Tang
Re: Re:Re: [DISCUSS] FLIP-218: Support SELECT clause in CREATE TABLE(CTAS)
Hi Mang, I have two questions/remarks: 1. The FLIP mentions that if the user doesn't specify the WITH option part in the query of the sink table, it will be assumed that the user wants to create a managed table. What will happen if the user doesn't have Table Store configured/installed? Will we throw an error? 2. Will there be support included for FLIP-190 (version upgrades)? Best regards, Martijn Op wo 29 jun. 2022 om 05:18 schreef Mang Zhang : > Hi everyone, > Thank you to all those who participated in the discussion, we have > discussed many rounds, the program has been gradually revised and improved, > looking forward to further feedback, we will launch a vote in the next day > or two. > > > > > > > > -- > > Best regards, > Mang Zhang > > > > > > At 2022-06-28 22:23:16, "Mang Zhang" wrote: > >Hi Yuxia, > >Thank you very much for your reply. > > > > > >>1: Also, the mixture of ctas and rtas confuses me as the FLIP talks > nothing about rtas but refer it in the configuration suddenly. And if > we're not to implement rtas in this FLIP, it may be better not to refer it > and the `rtas` shouldn't exposed to user as a configuration. > >Currently does not support RTAS because in the stream mode and batch mode > semantic unification issues and specific business scenarios are not very > clear, the future we will support, if in support of rtas and then modify > the option name, then it will bring the cost of modifying the configuration > to the user. > >>2: How will the CTASJobStatusHook be passed to StreamGraph as a hook? > Could you please explain about it. Some pseudocode will be much better if > it's possible. I'm lost in this part. > > > > > > > > > >This part is too much of an implementation detail, and of course we had > to make some changes to achieve this. FLIP focuses on semantic consistency > in stream and batch mode, and can provide optional atomicity support. > > > > > >>3: The name `AtomicCatalog` confuses me. Seems the backgroud for the > naming is to implement atomic for ctas, we propose a interface for catalog > to support serializing, then we name it to `AtomicCatalog`. At least, the > interface is for the atomic of ctas. But if we want to implement other > features like isolate which may also require serializable catalog in the > future, should we introduce a new interface naming `IsolateCatalog`? Have > you ever considered other names like `SerializableCatalog`. As it's a > public interface, maybe we should be careful about the name. > >Regarding the definition of the Catalog name, we have also discussed the > name `SerializableCatalog`, which is too specific and does not relate to > the atomic functionality we want to express. CTAS/RTAS want to support > atomicity, need Catalog to implement `AtomicCatalog`, so it's more > straightforward to understand. > > > > > >Hope this answers your question. > > > > > > > > > >-- > > > >Best regards, > >Mang Zhang > > > > > > > > > > > >At 2022-06-28 11:36:51, "yuxia" wrote: > >>Thanks for updating. The FLIP looks generall good to me. I have only > minor questions: > >> > >>1: Also, the mixture of ctas and rtas confuses me as the FLIP talks > nothing about rtas but refer it in the configuration suddenly. And if > we're not to implement rtas in this FLIP, it may be better not to refer it > and the `rtas` shouldn't exposed to user as a configuration. > >> > >>2: How will the CTASJobStatusHook be passed to StreamGraph as a hook? > Could you please explain about it. Some pseudocode will be much better if > it's possible. I'm lost in this part. > >> > >>3: The name `AtomicCatalog` confuses me. Seems the backgroud for the > naming is to implement atomic for ctas, we propose a interface for catalog > to support serializing, then we name it to `AtomicCatalog`. At least, the > interface is for the atomic of ctas. But if we want to implement other > features like isolate which may also require serializable catalog in the > future, should we introduce a new interface naming `IsolateCatalog`? Have > you ever considered other names like `SerializableCatalog`. As it's a > public interface, maybe we should be careful about the name. > >> > >> > >>Best regards, > >>Yuxia > >> > >>- 原始邮件 - > >>发件人: "Mang Zhang" > >>收件人: "dev" > >>抄送: imj...@gmail.com > >>发送时间: 星期一, 2022年 6 月 27日 下午 5:43:50 > >>主题: Re:Re: Re:Re: Re: Re: Re: [DISCUSS] FLIP-218: Support SELECT clause > in CREATE TABLE(CTAS) > >> > >>Hi Jark, > >>First of all, thank you for your very good advice! > >>The RTAS point you mentioned is a good one, and we should support it as > well. > >>However, by investigating the semantics of RTAS and how RTAS is used > within the company, I found that: > >>1. The semantics of RTAS says that if the table exists, need to delete > the old data and use the new data. > >>This semantics is better implemented in Batch mode, for example, if the > target table is a Hive table, old data file can be deleted directly. > >>But in Streaming mode, the target table i
[jira] [Created] (FLINK-28323) Support using new KafkaSource in PyFlink
Juntao Hu created FLINK-28323: - Summary: Support using new KafkaSource in PyFlink Key: FLINK-28323 URL: https://issues.apache.org/jira/browse/FLINK-28323 Project: Flink Issue Type: New Feature Components: API / Python Affects Versions: 1.15.0 Reporter: Juntao Hu Fix For: 1.16.0 KafkaSource implements new FileSource API, which should also be introduced to Python API, thus some other API e.g. HybridSource can use it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [VOTE] Release 1.15.1, release candidate #1
Hi David, Thanks for creating this RC. -1 We found an incompatible modification in 1.15.0 [1] I think we should fix it. [1] https://issues.apache.org/jira/browse/FLINK-28322 Best, Jingsong On Tue, Jun 28, 2022 at 8:45 PM Robert Metzger wrote: > > +1 (binding) > > - staging repo contents look fine > - KEYS file ok > - binaries start locally properly. WebUI accessible on Mac. > > On Mon, Jun 27, 2022 at 11:21 AM Qingsheng Ren wrote: > > > +1 (non-binding) > > > > - checked/verified signatures and hashes > > - checked that all POM files point to the same version > > - built from source, without Hadoop and using Scala 2.12 > > - started standalone cluster locally, WebUI is accessiable and ran > > WordCount example successfully > > - executed a job with SQL client consuming from Kafka source to collect > > sink > > > > Best, > > Qingsheng > > > > > > > On Jun 27, 2022, at 14:46, Xingbo Huang wrote: > > > > > > +1 (non-binding) > > > > > > - verify signatures and checksums > > > - no binaries found in source archive > > > - build from source > > > - Reviewed the release note blog > > > - verify python wheel package contents > > > - pip install apache-flink-libraries and apache-flink wheel packages > > > - run the examples from Python Table API tutorial > > > > > > Best, > > > Xingbo > > > > > > Chesnay Schepler 于2022年6月24日周五 21:42写道: > > > > > >> +1 (binding) > > >> > > >> - signatures OK > > >> - all required artifacts appear to be present > > >> - tag exists with the correct version adjustments > > >> - binary shows correct commit and version > > >> - examples run fine > > >> - website PR looks good > > >> > > >> On 22/06/2022 14:20, David Anderson wrote: > > >>> Hi everyone, > > >>> > > >>> Please review and vote on release candidate #1 for version 1.15.1, as > > >>> follows: > > >>> [ ] +1, Approve the release > > >>> [ ] -1, Do not approve the release (please provide specific comments) > > >>> > > >>> The complete staging area is available for your review, which includes: > > >>> > > >>> * JIRA release notes [1], > > >>> * the official Apache source release and binary convenience releases to > > >> be > > >>> deployed to dist.apache.org [2], which are signed with the key with > > >>> fingerprint E982F098 [3], > > >>> * all artifacts to be deployed to the Maven Central Repository [4], > > >>> * source code tag "release-1.15.1-rc1" [5], > > >>> * website pull request listing the new release and adding announcement > > >> blog > > >>> post [6]. > > >>> > > >>> The vote will be open for at least 72 hours. It is adopted by majority > > >>> approval, with at least 3 PMC affirmative votes. > > >>> > > >>> Thanks, > > >>> David > > >>> > > >>> [1] > > >>> > > >> > > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version= > > >>> 12351546 > > >>> [2] https://dist.apache.org/repos/dist/dev/flink/flink-1.15.1-rc1/ > > >>> [3] https://dist.apache.org/repos/dist/release/flink/KEYS > > >>> [4] > > >> https://repository.apache.org/content/repositories/orgapacheflink-1511/ > > >>> [5] https://github.com/apache/flink/tree/release-1.15.1-rc1 > > >>> [6] https://github.com/apache/flink-web/pull/554 > > >>> > > >> > > >> > > > >
[jira] [Created] (FLINK-28322) DataStreamScanProvider's new method is not compatible
Jingsong Lee created FLINK-28322: Summary: DataStreamScanProvider's new method is not compatible Key: FLINK-28322 URL: https://issues.apache.org/jira/browse/FLINK-28322 Project: Flink Issue Type: Bug Components: Table SQL / API Reporter: Jingsong Lee Fix For: 1.15.0 In FLINK-25990 , Add a method "DataStream produceDataStream(ProviderContext providerContext, StreamExecutionEnvironment execEnv)" in DataStreamScanProvider. But this method has no default implementation, this is not compatible when users upgrade to 1.15 from 1.14. This method should be: {code:java} default DataStream produceDataStream( ProviderContext providerContext, StreamExecutionEnvironment execEnv) { return produceDataStream(execEnv); } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [VOTE] FLIP-242: Introduce configurable RateLimitingStrategy for Async Sink
Thanks, Hong. +1 (binding) from me. On Wed, Jun 29, 2022 at 8:32 PM Martijn Visser wrote: > This is looking good! Thanks for the efforts. +1 (binding) > > Op wo 29 jun. 2022 om 20:55 schreef Piotr Nowojski > > > Thanks for starting the voting thread. +1 (binding) from my side. > > > > Best, > > Piotrek > > > > > > > > śr., 29 cze 2022 o 17:32 Teoh, Hong > > napisał(a): > > > > > Hi everyone, > > > > > > Thanks for all the feedback so far. Based on the discussion [1], we > seem > > > to have consensus. So, I would like to start a vote on FLIP-242 [2]. > > > > > > The vote will last for at least 72 hours unless there is an objection > or > > > insufficient votes. > > > > > > [1] https://lists.apache.org/thread/k5s970xlqoj7opx1pzbpylxcov0tp842 > > > [2] > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-242%3A+Introduce+configurable+RateLimitingStrategy+for+Async+Sink > > > > > > Regards, > > > Hong > > > > > >
[jira] [Created] (FLINK-28321) HiveDialectQueryITCase fails with error code 137
Martijn Visser created FLINK-28321: -- Summary: HiveDialectQueryITCase fails with error code 137 Key: FLINK-28321 URL: https://issues.apache.org/jira/browse/FLINK-28321 Project: Flink Issue Type: Bug Components: Connectors / Hive Affects Versions: 1.15.0 Reporter: Martijn Visser {code:java} Moving data to directory file:/tmp/junit6349996144152770842/warehouse/db1.db/src1/.hive-staging_hive_2022-06-30_03-47-28_878_1781340705558822791-1/-ext-1 Loading data to table db1.src1 MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 0 HDFS Write: 0 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK OK OK OK OK OK OK OK OK OK OK OK ##[error]Exit code 137 returned from process: file name '/bin/docker', arguments 'exec -i -u 1001 -w /home/agent02_azpcontainer 8f23cd917ec9d96c13789dabcaafe59398053d00ecf042a5426f9d1588ade349 /__a/externals/node/bin/node /__w/_temp/containerHandlerInvoker.js'. {code} https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=37387&view=logs&j=fc5181b0-e452-5c8f-68de-1097947f6483&t=995c650b-6573-581c-9ce6-7ad4cc038461&l=24786 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-28320) ShuffleMasterTest hangs and doesn't produce any output for 900 seconds
Martijn Visser created FLINK-28320: -- Summary: ShuffleMasterTest hangs and doesn't produce any output for 900 seconds Key: FLINK-28320 URL: https://issues.apache.org/jira/browse/FLINK-28320 Project: Flink Issue Type: Bug Components: Runtime / Coordination Reporter: Martijn Visser {code:java} Jun 30 03:36:13 [INFO] Tests run: 16, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 6.057 s - in org.apache.flink.runtime.rpc.RpcEndpointTest Jun 30 04:06:22 == Jun 30 04:06:22 Process produced no output for 900 seconds. Jun 30 04:06:22 == Jun 30 04:06:22 == Jun 30 04:06:22 The following Java processes are running (JPS) Jun 30 04:06:22 == Picked up JAVA_TOOL_OPTIONS: -XX:+HeapDumpOnOutOfMemoryError Jun 30 04:06:23 630 Launcher Jun 30 04:06:23 23159 surefirebooter6827274093814314206.jar Jun 30 04:06:23 11959 Jps Jun 30 04:06:23 == {code} https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=37387&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=24c3384f-1bcb-57b3-224f-51bf973bbee8&l=8099 The ShuffleMasterTest comes after RpcEndpointTest, which is why this test must be the one that's hanging -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-28319) test_ci tests times out after/during running org.apache.flink.test.streaming.experimental
Martijn Visser created FLINK-28319: -- Summary: test_ci tests times out after/during running org.apache.flink.test.streaming.experimental Key: FLINK-28319 URL: https://issues.apache.org/jira/browse/FLINK-28319 Project: Flink Issue Type: Bug Affects Versions: 1.15.2 Reporter: Martijn Visser https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=37384&view=logs&j=5c8e7682-d68f-54d1-16a2-a09310218a49&t=86f654fa-ab48-5c1a-25f4-7e7f6afb9bba&l=6280 -- This message was sent by Atlassian Jira (v8.20.10#820010)