[jira] [Created] (FLINK-32565) Support cast BYTES to DOUBLE

2023-07-07 Thread Hanyu Zheng (Jira)
Hanyu Zheng created FLINK-32565:
---

 Summary: Support cast BYTES to DOUBLE
 Key: FLINK-32565
 URL: https://issues.apache.org/jira/browse/FLINK-32565
 Project: Flink
  Issue Type: Sub-task
Reporter: Hanyu Zheng


We are undertaking a task that requires casting from the BYTES type to DOUBLE. 
In particular, we have a string '00T1p29eYmpEAE'. Our current approach is 
to convert this string to BYTES and then cast the result to DOUBLE using the 
following SQL query:

 
{code:java}
SELECT CAST((CAST('00T1p29eYmpEAE' as BYTES)) as DOUBLE);{code}
{{ }}
However, we encounter an issue when executing this query, potentially due to an 
error in the conversion between BYTES and DOUBLE. Our goal is to identify and 
correct this issue so that our query can execute successfully. The tasks 
involved are:
 # Investigate and pinpoint the specific reason for the conversion failure from 
BYTES to DOUBLE.
 # Design and implement a solution that enables our query to function correctly.
 # Test this solution across all required scenarios to ensure its robustness.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32564) Support cast BYTES to BIGINT

2023-07-07 Thread Hanyu Zheng (Jira)
Hanyu Zheng created FLINK-32564:
---

 Summary: Support cast BYTES to BIGINT
 Key: FLINK-32564
 URL: https://issues.apache.org/jira/browse/FLINK-32564
 Project: Flink
  Issue Type: Sub-task
Reporter: Hanyu Zheng


We are dealing with a task that requires casting from the BYTES type to BIGINT. 
Specifically, we have a string '00T1p29eYmpEAE'. Our approach is to convert 
this string to BYTES and then cast the result to BIGINT with the following SQL 
query:
{code:java}
SELECT CAST((CAST('00T1p29eYmpEAE' as BYTES)) as BIGINT);{code}
However, an issue arises when executing this query, likely due to an error in 
the conversion between BYTES and BIGINT. We aim to identify and rectify this 
issue so our query can run correctly. The tasks involved are:
 # Investigate and identify the specific reason for the failure of conversion 
from BYTES to BIGINT.
 # Design and implement a solution to ensure our query can function correctly.
 # Test this solution across all required scenarios to guarantee its 
functionality.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISUCSS] Deprecate multiple APIs in 1.18

2023-07-07 Thread Jing Ge
Hi Alex,

I would follow FLIP-197 and try to release them asap depending on dev
resources and how difficult those issues are. The fastest timeline is the
period defined in FLIP-197 in ideal conditions.

Best regards,
Jing

On Fri, Jul 7, 2023 at 12:20 PM Alexander Fedulov <
alexander.fedu...@gmail.com> wrote:

> @Xintong
> > - IIUC, the testing scenario you described is like blocking the source
> for
> > proceeding (emit data, finish, etc.) until a checkpoint is finished.
>
> It is more tricky than that - we need to prevent the Sink from receiving a
> checkpoint barrier until the Source is done emitting a given set of
> records. In
> the current tests, which are also used for V2 Sinks, SourceFunction
> controls
> when the Sink is "allowed" to commit by holding the checkpoint lock while
> producing the records. The lock is not available in the new Source by
> design
> and we need a solution that provides the same functionality (without
> modifying
> the Sinks). I am currently checking if a workaround is at all possible
> without
> adjusting anything in the Source interface.
>
> > I may not have understood all the details, but based on what you
> described
> > I'd hesitate to block the deprecation / removal of SourceFunction on
> this.
>
> I don't think we should, just wanted to highlight that there are some
> unknowns
> with respect to estimating the amount of work required.
>
> @Jing
> I want to understand in which release would you target graduation of the
> mentioned connectors to @Public/@PublicEvolving - basically the anticipated
> timeline of the steps in both options with respect to releases.
>
> Best,
> Alex
>
> On Fri, 7 Jul 2023 at 10:53, Xintong Song  wrote:
>
> > Thanks all for the discussion. I've created FLINK-32557 for this.
> >
> > Best,
> >
> > Xintong
> >
> >
> >
> > On Thu, Jul 6, 2023 at 1:00 AM Jing Ge 
> wrote:
> >
> > > Hi Alex,
> > >
> > >
> > > > > 3. remove SinkFunction.
> > > > Which steps do you imply for the 1.18 release and for the 2.0
> release?
> > > >
> > >
> > > for 2.0 release. 1.18 will be released soon.
> > >
> > > Best regards,
> > > Jing
> > >
> > >
> > > On Wed, Jul 5, 2023 at 1:08 PM Alexander Fedulov <
> > > alexander.fedu...@gmail.com> wrote:
> > >
> > > > @Jing
> > > > Just to clarify, when you say:
> > > >
> > > > 3. remove SinkFunction.
> > > > Which steps do you imply for the 1.18 release and for the 2.0
> release?
> > >
> > > @Xintong
> > > > A side note - with the new Source API we lose the ability to control
> > > > checkpointing from the source since there is no lock anymore. This
> > > > functionality
> > > > is currently used in a variety of tests for the Sinks - the tests
> that
> > > rely
> > > > on tight
> > > > synchronization between specific elements passed from the source  to
> > the
> > > > sink before
> > > > allowing a checkpoint to complete (see FiniteTestSource [1]). Since
> > > FLIP-27
> > > > Sources rely
> > > > on decoupling via the mailbox, without exposing the lock, it is not
> > > > immediately clear
> > > > if it is possible to achieve the same functionality without major
> > > > extensions in the
> > > > runtime for such testing purposes. My hope initially was that only
> the
> > > > legacy Sinks
> > > > relied on this - this would have made it possible to drop
> > > > SourceFunction+SinkFunction
> > > > together, but, in fact, it also already became part of the new SinkV2
> > > > testing IT suits
> > > > [2]. Moreover, I know of at least one major connector that also
> relies
> > on
> > > > it for
> > > > verifying committed sink metadata for a specific set of records
> > (Iceberg)
> > > > [3]. In my
> > > > estimation this currently presents a major blocker for the
> > SourceFunction
> > > > removal.
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://github.com/apache/flink/blob/master/flink-test-utils-parent/flink-test-utils/src/main/java/org/apache/flink/streaming/util/FiniteTestSource.java
> > > > [2]
> > > >
> > > >
> > >
> >
> https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-files/src/test/java/org/apache/flink/connector/file/sink/StreamingExecutionFileSinkITCase.java#L132
> > > > [3]
> > > >
> > > >
> > >
> >
> https://github.com/apache/iceberg/blob/master/flink/v1.17/flink/src/test/java/org/apache/iceberg/flink/source/BoundedTestSource.java#L75C1-L85C2
> > > >
> > > > Best,
> > > > Alex
> > > >
> > > > On Wed, 5 Jul 2023 at 10:47, Chesnay Schepler 
> > > wrote:
> > > >
> > > > > There's a whole bunch of metric APIs that would need to be
> > deprecated.
> > > > > That is of course if the metric FLIPs are being accepted.
> > > > >
> > > > > Which makes me wonder if we aren't doing things the wrong way
> around;
> > > > > shouldn't the decision to deprecate an API be part of the FLIP
> > > > discussion?
> > > > >
> > > > > On 05/07/2023 07:39, Xintong Song wrote:
> > > > > > Thanks all for the discussion.
> > > > > >
> > > > > > It seems to me there's a consensus on marking the following as
> > 

Re: [DISUCSS] Deprecate multiple APIs in 1.18

2023-07-07 Thread Alexander Fedulov
@Xintong
> - IIUC, the testing scenario you described is like blocking the source for
> proceeding (emit data, finish, etc.) until a checkpoint is finished.

It is more tricky than that - we need to prevent the Sink from receiving a
checkpoint barrier until the Source is done emitting a given set of
records. In
the current tests, which are also used for V2 Sinks, SourceFunction controls
when the Sink is "allowed" to commit by holding the checkpoint lock while
producing the records. The lock is not available in the new Source by design
and we need a solution that provides the same functionality (without
modifying
the Sinks). I am currently checking if a workaround is at all possible
without
adjusting anything in the Source interface.

> I may not have understood all the details, but based on what you described
> I'd hesitate to block the deprecation / removal of SourceFunction on this.

I don't think we should, just wanted to highlight that there are some
unknowns
with respect to estimating the amount of work required.

@Jing
I want to understand in which release would you target graduation of the
mentioned connectors to @Public/@PublicEvolving - basically the anticipated
timeline of the steps in both options with respect to releases.

Best,
Alex

On Fri, 7 Jul 2023 at 10:53, Xintong Song  wrote:

> Thanks all for the discussion. I've created FLINK-32557 for this.
>
> Best,
>
> Xintong
>
>
>
> On Thu, Jul 6, 2023 at 1:00 AM Jing Ge  wrote:
>
> > Hi Alex,
> >
> >
> > > > 3. remove SinkFunction.
> > > Which steps do you imply for the 1.18 release and for the 2.0 release?
> > >
> >
> > for 2.0 release. 1.18 will be released soon.
> >
> > Best regards,
> > Jing
> >
> >
> > On Wed, Jul 5, 2023 at 1:08 PM Alexander Fedulov <
> > alexander.fedu...@gmail.com> wrote:
> >
> > > @Jing
> > > Just to clarify, when you say:
> > >
> > > 3. remove SinkFunction.
> > > Which steps do you imply for the 1.18 release and for the 2.0 release?
> >
> > @Xintong
> > > A side note - with the new Source API we lose the ability to control
> > > checkpointing from the source since there is no lock anymore. This
> > > functionality
> > > is currently used in a variety of tests for the Sinks - the tests that
> > rely
> > > on tight
> > > synchronization between specific elements passed from the source  to
> the
> > > sink before
> > > allowing a checkpoint to complete (see FiniteTestSource [1]). Since
> > FLIP-27
> > > Sources rely
> > > on decoupling via the mailbox, without exposing the lock, it is not
> > > immediately clear
> > > if it is possible to achieve the same functionality without major
> > > extensions in the
> > > runtime for such testing purposes. My hope initially was that only the
> > > legacy Sinks
> > > relied on this - this would have made it possible to drop
> > > SourceFunction+SinkFunction
> > > together, but, in fact, it also already became part of the new SinkV2
> > > testing IT suits
> > > [2]. Moreover, I know of at least one major connector that also relies
> on
> > > it for
> > > verifying committed sink metadata for a specific set of records
> (Iceberg)
> > > [3]. In my
> > > estimation this currently presents a major blocker for the
> SourceFunction
> > > removal.
> > >
> > > [1]
> > >
> > >
> >
> https://github.com/apache/flink/blob/master/flink-test-utils-parent/flink-test-utils/src/main/java/org/apache/flink/streaming/util/FiniteTestSource.java
> > > [2]
> > >
> > >
> >
> https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-files/src/test/java/org/apache/flink/connector/file/sink/StreamingExecutionFileSinkITCase.java#L132
> > > [3]
> > >
> > >
> >
> https://github.com/apache/iceberg/blob/master/flink/v1.17/flink/src/test/java/org/apache/iceberg/flink/source/BoundedTestSource.java#L75C1-L85C2
> > >
> > > Best,
> > > Alex
> > >
> > > On Wed, 5 Jul 2023 at 10:47, Chesnay Schepler 
> > wrote:
> > >
> > > > There's a whole bunch of metric APIs that would need to be
> deprecated.
> > > > That is of course if the metric FLIPs are being accepted.
> > > >
> > > > Which makes me wonder if we aren't doing things the wrong way around;
> > > > shouldn't the decision to deprecate an API be part of the FLIP
> > > discussion?
> > > >
> > > > On 05/07/2023 07:39, Xintong Song wrote:
> > > > > Thanks all for the discussion.
> > > > >
> > > > > It seems to me there's a consensus on marking the following as
> > > deprecated
> > > > > in 1.18:
> > > > > - DataSet API
> > > > > - SourceFunction
> > > > > - Queryable State
> > > > > - All Scala APIs
> > > > >
> > > > > More time is needed for deprecating SinkFunction.
> > > > >
> > > > > I'll leave this discussion open for a few more days. And if there's
> > no
> > > > > objections, I'll create JIRA tickets accordingly.
> > > > >
> > > > > Best,
> > > > >
> > > > > Xintong
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Jul 5, 2023 at 1:34 PM Xintong Song  >
> > > > wrote:
> > > > >
> > > > >> Thanks for the input, Jing. I'd also be +1 for option 1.
> 

[VOTE] Release 2.0 must-have work items

2023-07-07 Thread Xintong Song
Hi all,

I'd like to start the VOTE for the must-have work items for release 2.0
[1]. The corresponding discussion thread is [2].

Please note that once the vote is approved, any changes to the must-have
items (adding / removing must-have items, changing the priority) requires
another vote. Assigning contributors / reviewers, updating descriptions /
progress, changes to nice-to-have items do not require another vote.

The vote will be open until at least July 12, following the consensus
voting process. Votes of PMC members are binding.

Best,

Xintong


[1] https://cwiki.apache.org/confluence/display/FLINK/2.0+Release

[2] https://lists.apache.org/thread/l3dkdypyrovd3txzodn07lgdwtwvhgk4


[jira] [Created] (FLINK-32563) Allow connectors CI to specify the main supported Flink version

2023-07-07 Thread Etienne Chauchot (Jira)
Etienne Chauchot created FLINK-32563:


 Summary: Allow connectors CI to specify the main supported Flink 
version
 Key: FLINK-32563
 URL: https://issues.apache.org/jira/browse/FLINK-32563
 Project: Flink
  Issue Type: Technical Debt
  Components: Build System / CI
Reporter: Etienne Chauchot
Assignee: Etienne Chauchot


As part of [this 
discussion|https://lists.apache.org/thread/pr0g812olzpgz21d9oodhc46db9jpxo3] , 
the need for connectors to specify the main flink version that a connector 
supports has arisen. 

This CI variable will allow to configure the build and tests differently 
depending on this version. This parameter would be optional.

The first use case is to run archunit tests only on the main supported version 
as discussed in the above thread.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Release 2.0 Work Items

2023-07-07 Thread Xintong Song
Thanks all for the discussion.

The wiki has been updated as discussed. I'm starting a vote now.

Best,

Xintong



On Wed, Jul 5, 2023 at 9:52 AM Xintong Song  wrote:

> Hi ConradJam,
>
> I think Chesnay has already put his name as the Contributor for the two
> tasks you listed. Maybe you can reach out to him to see if you can
> collaborate on this.
>
> In general, I don't think contributing to a release 2.0 issue is much
> different from contributing to a regular issue. We haven't yet created JIRA
> tickets for all the listed tasks because many of them needs further
> discussions and / or FLIPs to decide whether and how they should be
> performed.
>
> Best,
>
> Xintong
>
>
>
> On Mon, Jul 3, 2023 at 10:37 PM ConradJam  wrote:
>
>> Hi Community:
>>   I see some tasks in the 2.0 list that haven't been assigned yet. I want
>> to take the initiative to take on some tasks that I can complete. How do I
>> apply to the community for this part of the task? I am interested in the
>> following parts of FLINK-32377
>> , do I need to create
>> issuse myself and point it to myself?
>>
>> - the current timestamp, which is problematic w.r.t. caching and testing,
>> while providing no value.
>> - Remove JarRequestBody#programArgs in favor of #programArgsList.
>>
>> [1] FLINK-32377 
>> https://issues.apache.org/jira/browse/FLINK-32377
>>
>> Teoh, Hong  于2023年6月30日周五 00:53写道:
>>
>>
>> Teoh, Hong  于2023年6月30日周五 00:53写道:
>>
>> > Thanks Xintong for driving the effort.
>> >
>> > I’d add a +1 to reworking configs, as suggested by @Jark and @Chesnay,
>> > especially the types. We have various configs that encode Time /
>> MemorySize
>> > that are Long instead!
>> >
>> > Regards,
>> > Hong
>> >
>> >
>> >
>> > > On 29 Jun 2023, at 16:19, Yuan Mei  wrote:
>> > >
>> > > CAUTION: This email originated from outside of the organization. Do
>> not
>> > click links or open attachments unless you can confirm the sender and
>> know
>> > the content is safe.
>> > >
>> > >
>> > >
>> > > Thanks for driving this effort, Xintong!
>> > >
>> > > To Chesnay
>> > >> I'm curious as to why the "Disaggregated State Management" item is
>> > >> marked as a must-have; will it require changes that break something?
>> > >> What prevents it from being added in 2.1?
>> > >
>> > > As to "Disaggregated State Management".
>> > >
>> > > We plan to provide a new type of state backend to support DFS as
>> primary
>> > > storage.
>> > > To achieve this, we at least need to include two parts of amends (not
>> > > entirely sure yet, since we are still in the designing and prototype
>> > phase)
>> > >
>> > > 1. Statebackend Change
>> > > 2. State Access Change
>> > >
>> > > Not all of the interfaces related are `@Internal`. Some of the
>> interfaces
>> > > like `StateBackend` is `@PublicEvolving`
>> > > So, you are right in the sense that "Disaggregated State Management"
>> > itself
>> > > probably does not need to be a "Must Have"
>> > >
>> > > But I was hoping changes that related to public APIs can be finalized
>> and
>> > > merged in Flink 2.0 (I will fix the wiki accordingly).
>> > >
>> > > I also agree with Jark that 2.0 is a good chance to rework the default
>> > > value of configurations.
>> > >
>> > > Best
>> > > Yuan
>> > >
>> > >
>> > > On Thu, Jun 29, 2023 at 8:43 PM Chesnay Schepler 
>> > wrote:
>> > >
>> > >> Something else configuration-related is that there are a bunch of
>> > >> options where the type isn't quite correct (e.g., a String where it
>> > >> could be an enum, a string where it should be an int or something).
>> > >> Could do a pass over those as well.
>> > >>
>> > >> On 29/06/2023 13:50, Jark Wu wrote:
>> > >>> Hi,
>> > >>>
>> > >>> I think one more thing we need to consider to do in 2.0 is changing
>> the
>> > >>> default value of configuration to improve out-of-box user
>> experience.
>> > >>>
>> > >>> Currently, in order to run a Flink job, users may need to set
>> > >>> a bunch of configurations, such as minibatch, checkpoint interval,
>> > >>> exactly-once,
>> > >>> incremental-checkpoint, etc. It's very verbose and hard to use for
>> > >>> beginners.
>> > >>> Most of them can have a universally applicable value.  Because
>> changing
>> > >> the
>> > >>> default value is a breaking change. I think It's worth considering
>> > >> changing
>> > >>> them in 2.0.
>> > >>>
>> > >>> What do you think?
>> > >>>
>> > >>> Best,
>> > >>> Jark
>> > >>>
>> > >>>
>> > >>> On Wed, 28 Jun 2023 at 14:10, Sergey Nuyanzin 
>> > >> wrote:
>> > >>>
>> >  Hi Chesnay
>> > 
>> > > "Move Calcite rules from Scala to Java": I would hope that this
>> would
>> > >> be
>> > > an entirely internal change, and could thus be an incremental
>> process
>> > > independent of major releases.
>> > > What is the actual scale of this item; how much are we actually
>> >  re-writing?
>> > 
>> >  Thanks for asking
>> >  yes, you're 

[jira] [Created] (FLINK-32562) FileSink Compactor Service should not use FileWriter from Sink for OutputStreamBasedFileCompactor

2023-07-07 Thread Shengnan YU (Jira)
Shengnan YU created FLINK-32562:
---

 Summary: FileSink Compactor Service should not use FileWriter from 
Sink for OutputStreamBasedFileCompactor
 Key: FLINK-32562
 URL: https://issues.apache.org/jira/browse/FLINK-32562
 Project: Flink
  Issue Type: Improvement
  Components: Connectors / FileSystem
Affects Versions: 1.18.0
Reporter: Shengnan YU


Gzip format is designed to be concatenatable but it will be broken by Compactor 
in FileSink. 

It is because when Compactor Service create new compacted file by using 
GzipOutputStream, which will create extra bytes at header, which cause the 
final file will have extra bytes in header. (Gzip header is presented in every 
finished part file, we don't need an extra header in compacted file). This is 
because in Compactor Service, it uses the FileWriter specified in FileSink to 
create the compacted outputstream. I think will should use an simple bytes 
ouputstream to concat stream instead, or at least give a option.

 

Currently the ConcatFileCompactor only supports pure text file. Many compressed 
codec support concating like gzip, bzip2. I think we should support those kind 
of concating, otherwise people must use RecordWiseCompactorFactor which is very 
ineffcient.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32561) Change the status field reconciliationTimestamp from long to Date

2023-07-07 Thread Xin Hao (Jira)
Xin Hao created FLINK-32561:
---

 Summary: Change the status field reconciliationTimestamp from long 
to Date
 Key: FLINK-32561
 URL: https://issues.apache.org/jira/browse/FLINK-32561
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Reporter: Xin Hao


Can we change the field `status.reconciliationStatus.reconciliationTimestamp` 
from long to date?

 

At first, this is a broken change for the CRD.

The benefit is that:
 # The date format is more human-readable, this is useful when we debug issues.
 # It will be easy to add this field into additionalPrinterColumns with date 
duration format.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISUCSS] Deprecate multiple APIs in 1.18

2023-07-07 Thread Xintong Song
Thanks all for the discussion. I've created FLINK-32557 for this.

Best,

Xintong



On Thu, Jul 6, 2023 at 1:00 AM Jing Ge  wrote:

> Hi Alex,
>
>
> > > 3. remove SinkFunction.
> > Which steps do you imply for the 1.18 release and for the 2.0 release?
> >
>
> for 2.0 release. 1.18 will be released soon.
>
> Best regards,
> Jing
>
>
> On Wed, Jul 5, 2023 at 1:08 PM Alexander Fedulov <
> alexander.fedu...@gmail.com> wrote:
>
> > @Jing
> > Just to clarify, when you say:
> >
> > 3. remove SinkFunction.
> > Which steps do you imply for the 1.18 release and for the 2.0 release?
>
> @Xintong
> > A side note - with the new Source API we lose the ability to control
> > checkpointing from the source since there is no lock anymore. This
> > functionality
> > is currently used in a variety of tests for the Sinks - the tests that
> rely
> > on tight
> > synchronization between specific elements passed from the source  to the
> > sink before
> > allowing a checkpoint to complete (see FiniteTestSource [1]). Since
> FLIP-27
> > Sources rely
> > on decoupling via the mailbox, without exposing the lock, it is not
> > immediately clear
> > if it is possible to achieve the same functionality without major
> > extensions in the
> > runtime for such testing purposes. My hope initially was that only the
> > legacy Sinks
> > relied on this - this would have made it possible to drop
> > SourceFunction+SinkFunction
> > together, but, in fact, it also already became part of the new SinkV2
> > testing IT suits
> > [2]. Moreover, I know of at least one major connector that also relies on
> > it for
> > verifying committed sink metadata for a specific set of records (Iceberg)
> > [3]. In my
> > estimation this currently presents a major blocker for the SourceFunction
> > removal.
> >
> > [1]
> >
> >
> https://github.com/apache/flink/blob/master/flink-test-utils-parent/flink-test-utils/src/main/java/org/apache/flink/streaming/util/FiniteTestSource.java
> > [2]
> >
> >
> https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-files/src/test/java/org/apache/flink/connector/file/sink/StreamingExecutionFileSinkITCase.java#L132
> > [3]
> >
> >
> https://github.com/apache/iceberg/blob/master/flink/v1.17/flink/src/test/java/org/apache/iceberg/flink/source/BoundedTestSource.java#L75C1-L85C2
> >
> > Best,
> > Alex
> >
> > On Wed, 5 Jul 2023 at 10:47, Chesnay Schepler 
> wrote:
> >
> > > There's a whole bunch of metric APIs that would need to be deprecated.
> > > That is of course if the metric FLIPs are being accepted.
> > >
> > > Which makes me wonder if we aren't doing things the wrong way around;
> > > shouldn't the decision to deprecate an API be part of the FLIP
> > discussion?
> > >
> > > On 05/07/2023 07:39, Xintong Song wrote:
> > > > Thanks all for the discussion.
> > > >
> > > > It seems to me there's a consensus on marking the following as
> > deprecated
> > > > in 1.18:
> > > > - DataSet API
> > > > - SourceFunction
> > > > - Queryable State
> > > > - All Scala APIs
> > > >
> > > > More time is needed for deprecating SinkFunction.
> > > >
> > > > I'll leave this discussion open for a few more days. And if there's
> no
> > > > objections, I'll create JIRA tickets accordingly.
> > > >
> > > > Best,
> > > >
> > > > Xintong
> > > >
> > > >
> > > >
> > > > On Wed, Jul 5, 2023 at 1:34 PM Xintong Song 
> > > wrote:
> > > >
> > > >> Thanks for the input, Jing. I'd also be +1 for option 1.
> > > >>
> > > >> Best,
> > > >>
> > > >> Xintong
> > > >>
> > > >>
> > > >>
> > > >> On Mon, Jul 3, 2023 at 7:20 PM Jing Ge 
> > > wrote:
> > > >>
> > > >>> Hi Xingtong,
> > > >>>
> > > >>> Option 1, secure plan would be:
> > > >>>
> > > >>> 1. graduate kafka, File, JDBC connectors to @Public
> > > >>> 2. graduate SinkV2 to @Public
> > > >>> 3. remove SinkFunction.
> > > >>>
> > > >>> Option 2, risky plan but at a fast pace:
> > > >>>
> > > >>> 1. graduate SinkV2 to @Public and expecting more maintenance effort
> > > since
> > > >>> there are many known and unsolved issues.
> > > >>> 2. remove SinkFunction.
> > > >>> 3. It depends on the connectors' contributors whether connectors
> can
> > > >>> upgrade to Flink 2.0, since we moved forward with SinkV2 API
> without
> > > >>> taking
> > > >>> care of implementations in external connectors.
> > > >>>
> > > >>> I am ok with both of them and personally prefer option 1.
> > > >>>
> > > >>> Best regards,
> > > >>> Jing
> > > >>>
> > > >>>
> > > >>> On Fri, Jun 30, 2023 at 3:41 AM Xintong Song <
> tonysong...@gmail.com>
> > > >>> wrote:
> > > >>>
> > >  I see. Thanks for the explanation. I may have not looked into this
> > > >>> deeply
> > >  enough, and would trust the decision from you and the community
> > > members
> > > >>> who
> > >  participated in the discussion & vote.
> > > 
> > >  Best,
> > > 
> > >  Xintong
> > > 
> > > 
> > > 
> > >  On Thu, Jun 29, 2023 at 10:28 PM Alexander Fedulov <
> > >  alexander.fedu...@gmail.com> wrote:
> 

[jira] [Created] (FLINK-32560) Properly deprecate all Scala APIs

2023-07-07 Thread Xintong Song (Jira)
Xintong Song created FLINK-32560:


 Summary: Properly deprecate all Scala APIs
 Key: FLINK-32560
 URL: https://issues.apache.org/jira/browse/FLINK-32560
 Project: Flink
  Issue Type: Sub-task
  Components: API / Scala
Reporter: Xintong Song
 Fix For: 1.18.0


We agreed to drop Scala API support in FLIP-265 [1], and have tried to 
deprecate them in FLINK-29740. Also, both user documentation and roadmap[2] 
shows that scala API supports are deprecated. However, none of the APIs in 
`flink-streaming-scala` are annotated with `@Deprecated` atm, and only 
`ExecutionEnvironment` and `package` are marked `@Deprecated` in `flink-scala`.

[1] 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-265+Deprecate+and+remove+Scala+API+support
[2] https://flink.apache.org/roadmap/




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32559) Deprecate Queryable State

2023-07-07 Thread Xintong Song (Jira)
Xintong Song created FLINK-32559:


 Summary: Deprecate Queryable State
 Key: FLINK-32559
 URL: https://issues.apache.org/jira/browse/FLINK-32559
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Queryable State
Reporter: Xintong Song
 Fix For: 1.18.0


Queryable State is described as approaching end-of-life in the roadmap [1], but 
is neither deprecated in codes nor in user documentation [2]. There're also 
more negative opinions than positive ones in the discussion about rescuing it 
[3].

[1] https://flink.apache.org/roadmap/
[2] 
https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/dev/datastream/fault-tolerance/queryable_state/
[3] https://lists.apache.org/thread/9hmwcjb3q5c24pk3qshjvybfqk62v17m



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32558) Properly deprecate DataSet API

2023-07-07 Thread Xintong Song (Jira)
Xintong Song created FLINK-32558:


 Summary: Properly deprecate DataSet API
 Key: FLINK-32558
 URL: https://issues.apache.org/jira/browse/FLINK-32558
 Project: Flink
  Issue Type: Sub-task
  Components: API / DataSet
Reporter: Xintong Song
 Fix For: 1.18.0


DataSet API is described as "legacy", "soft deprecated" in user documentation 
[1]. The required tasks for formally deprecating / removing it, according to 
FLIP-131 [2], are all completed.

This task include marking all related API classes as `@Deprecated` and update 
the user documentation.

[1] 
https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/dev/dataset/overview/
[2] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866741



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32557) API deprecations in Flink 1.18

2023-07-07 Thread Xintong Song (Jira)
Xintong Song created FLINK-32557:


 Summary: API deprecations in Flink 1.18
 Key: FLINK-32557
 URL: https://issues.apache.org/jira/browse/FLINK-32557
 Project: Flink
  Issue Type: Technical Debt
Reporter: Xintong Song
 Fix For: 1.18.0


As discussed in [1], we are deprecating multiple APIs in release 1.18, in order 
to completely remove them in release 2.0.

The listed APIs possibly should have been deprecated already, i.e., already (or 
won't) have replacements, but are somehow not yet.

[1] https://lists.apache.org/thread/3dw4f8frlg8hzlv324ql7n2755bzs9hy



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] FLIP-325: Support configuring end-to-end allowed latency

2023-07-07 Thread Jing Ge
Hi Dong,

Thanks for your clarification.


> Actually, I think it could make sense to toggle isBacklog between true and
> false while the job is running.
>

If isBacklog is toggled too often back and forth(e.g. by unexpected
mistake, unstable system, etc), a large amount of RecordAttributes might be
triggered, which will lead to performance issues. This should not be the
right way to use RecordAttributes right? Devs and users should be aware of
it and know how to monitor, maintain, and fix issues.

Your reply contains valuable information. It might make sense to add them
into the FLIP:

1. It is up to the operator to decide when to emit RecordAttributes. But
devs and users should be aware that the number of RecordAttributes should
not be too high to cause performance issues.
2. Although users can decide how to configure them, the end-to-end latency
should be (commonly?) configured lower than the checkpoint interval.
3. The three ways you mentioned for how to derive isBacklog.

WDYT?

Best regards,
Jing


On Fri, Jul 7, 2023 at 3:13 AM Dong Lin  wrote:

> Hi Jing,
>
> Thanks for the comments. Please see my reply inline.
>
> On Fri, Jul 7, 2023 at 5:40 AM Jing Ge  wrote:
>
> > Hi,
> >
> > Thank you all for the inspired discussion. Really appreciate it!
> >
> > @Dong I'd like to ask some (stupid) questions to make sure I understand
> > your thoughts correctly.
> >
> > 1. It will make no sense to send the same type of RecordAttributes right?
> > e.g.  if one RecordAttributes(isBacklog=true) has been sent, a new
> > RecordAttributes will be only sent when isBacklog is changed to be false,
> > and vice versa. In this way, the number of RecordAttributes will be very
> > limited.
> >
>
> Yes, you are right. Actually, this is what we plan to do when we update
> operators to emit RecordAttributes via `Output#emitRecordAttributes()`.
>
> Note that the FLIP does not specify the frequency of how operators should
> invoke `Output#emitRecordAttributes()`. It is up to the operator
> to decide when to emit RecordAttributes.
>
>
> > 2. Since source readers can invoke Output#emitRecordAttributes to emit
> > RecordAttributes(isBacklog=true/false), it might be weird to send
> > RecordAttributes with different isBacklog back and forth too often. Devs
> > and users should pay attention to it. Something is wrong when such a
> thing
> > happens(metrics for monitoring?). Is this correct?
> >
>
>

> Actually, I think it could make sense to toggle isBacklog between true and
> false while the job is running.
>
>

> Suppose the job is reading from user-action data from Kafka and there is a
> traffic spike for 2 hours. If the job keeps running in pure stream mode,
> the watermark lag might keep increasing during this period because the
> job's processing capability can not catch up with the Kafka input
> throughput. In this case, it can be beneficial to dynamically switch
> isBacklog to true when watermarkLag exceeds a given threshold (e.g. 5
> minutes), and switch isBacklog to false again when the watermarkLag is low
> enough (30 seconds).
>
>
> > 3. Is there any relationship between end-to-end-latency and checkpoint
> > interval that users should pay attention to? In the example described in
> > the FLIP, both have the same value, 2 min. What about end-to-end-latency
> is
> > configured bigger than checkpoint interval? Could checkpoint between
> > end-to-end-latency be skipped?
> >
>
> This FLIP would not enforce any relationship between end-to-end latency and
> checkpoint interval. Users are free to configure end-to-end latency to be
> bigger than checkpoint interval.
>
> I don't think there exists any use-case which requires end-to-end latency
> to be higher than the checkpoint interval. Note that introducing a
> relationship between these two configs would increase code complexity and
> also make the documentation of these configs a bit more complex for users
> to understand.
>
> Since there is no correctness when a user sets end-to-end latency to be
> bigger than the checkpointing interval, I think it is simpler to just let
> the user decide how to configure them.
>
>
> > 4. Afaiu, one major discussion point is that isBacklog can be derived
> from
> > back pressure and there will be no need of RecordAttributes. In case a
> > Flink job has rich resources that there is no back pressure (it will be
> > difficult to perfectly have just enough resources that everything is fine
> > but will have back pressure only for backlog) but we want to improve the
> > throughput. We then need some other ways to derive isBacklog. That is the
> > reason why RecordAttributes has been introduced. Did I understand it
> > correctly?
> >
>
> I think there can be multiple ways to derive isBackog, including:
> 1) Based on the source operator's state. For example, when MySQL CDC source
> is reading snapshot, it can claim isBacklog=true.
> 2) Based on the watermarkLag in the source. For example, when system_time -
> watermark > user_specified_threshold, 

Re: [DISCUSS] FLIP-314: Support Customized Job Lineage Listener

2023-07-07 Thread Jing Ge
Hi Shammon,

Thanks for the update!

Best regards,
Jing

On Fri, Jul 7, 2023 at 4:46 AM Shammon FY  wrote:

> Thanks Jing, sounds good to me.
>
> I have updated the FLIP and renamed the lineage related classes to
> `LineageGraph`, `LineageVertex` and `LineageEdge` and keep it consistent
> with the job definition in Flink.
>
> Best,
> Shammon FY
>
> On Thu, Jul 6, 2023 at 8:25 PM Jing Ge  wrote:
>
> > Hi Shammon,
> >
> > Thanks for the clarification. Atlas might have his historical reason back
> > to the hadoop era or maybe even back to the hibernate where Entity and
> > Relation were commonly used. Flink already used Vertex and Edge to
> describe
> > DAG. Some popular tools like dbt are also using this convention[1] and,
> > afaik, most graph frameworks use vertex and edge too. It will be easier
> for
> > Flink devs and users to have a consistent naming convention for the same
> > concept, i.e. in this case, DAG.
> >
> > Best regards,
> > Jing
> >
> > [1]
> >
> >
> https://docs.getdbt.com/docs/dbt-cloud-apis/discovery-use-cases-and-examples#discovery
> >
> > On Wed, Jul 5, 2023 at 11:28 AM Shammon FY  wrote:
> >
> > > Hi Jing,
> > >
> > > Thanks for your feedback.
> > >
> > > > 1. TableColumnLineageRelation#sinkColumn() should return
> > > TableColumnLineageEntity instead of String, right?
> > >
> > > The `sinkColumn()` will return `String` which is the column name in the
> > > sink connector. I found the name of `TableColumnLineageEntity` may
> > > cause ambiguity and I have renamed it to
> > `TableColumnSourceLineageEntity`.
> > > In my mind the `TableColumnLineageRelation` represents the lineage for
> > each
> > > sink column, each column may be computed from multiple sources and
> > columns.
> > > I use `TableColumnSourceLineageEntity` to manage each source and its
> > > columns for the sink column, so `TableColumnLineageRelation` has a sink
> > > column name and `TableColumnSourceLineageEntity` list.
> > >
> > > > 2. Since LineageRelation already contains all information to build
> the
> > > lineage between sources and sink, do we still need to set the
> > LineageEntity
> > > in the source?
> > >
> > > The lineage interface of `DataStream` is very flexible. We have added
> > > `setLineageEntity` to the source to limit and verify user behavior,
> > > ensuring that users have not added non-existent sources as lineage.
> > >
> > > > 3. About the "Entity" and "Relation" naming, I was confused too, like
> > > Qingsheng mentioned. How about LineageVertex, LineageEdge, and
> > LineageEdges
> > > which contains multiple LineageEdge?
> > >
> > > We referred to `Atlas` for the name of lineage, it uses `Entity` and
> > > `Relation` to represent the lineage relationship and another metadata
> > > service `Datahub` uses `DataSet` to represent the entity. I think
> > `Entity`
> > > and `Relation` are nicer for lineage, what do you think of it?
> > >
> > > Best,
> > > Shammon FY
> > >
> > >
> > > On Thu, Jun 29, 2023 at 4:21 AM Jing Ge 
> > > wrote:
> > >
> > > > Hi Shammon,
> > > >
> > > > Thanks for your proposal. After reading the FLIP, I'd like to ask
> > > > some questions to make sure we are on the same page. Thanks!
> > > >
> > > > 1. TableColumnLineageRelation#sinkColumn() should return
> > > > TableColumnLineageEntity instead of String, right?
> > > >
> > > > 2. Since LineageRelation already contains all information to build
> the
> > > > lineage between sources and sink, do we still need to set the
> > > LineageEntity
> > > > in the source?
> > > >
> > > > 3. About the "Entity" and "Relation" naming, I was confused too, like
> > > > Qingsheng mentioned. How about LineageVertex, LineageEdge, and
> > > LineageEdges
> > > > which contains multiple LineageEdge? E.g. multiple sources join into
> > one
> > > > sink, or, edges of columns from one or different tables, etc.
> > > >
> > > > Best regards,
> > > > Jing
> > > >
> > > > On Sun, Jun 25, 2023 at 2:06 PM Shammon FY 
> wrote:
> > > >
> > > > > Hi yuxia and Yun,
> > > > >
> > > > > Thanks for your input.
> > > > >
> > > > > For yuxia:
> > > > > > 1: What kinds of JobStatus will the `JobExecutionStatusEven`
> > > including?
> > > > >
> > > > > At present, we only need to notify the listener when a job goes to
> > > > > termination, but I think it makes sense to add generic `oldStatus`
> > and
> > > > > `newStatus` in the listener and users can update the job state in
> > their
> > > > > service as needed.
> > > > >
> > > > > > 2: I'm really confused about the `config()` included in
> > > > `LineageEntity`,
> > > > > where is it from and what is it for ?
> > > > >
> > > > > The `config` in `LineageEntity` is used for users to get options
> for
> > > > source
> > > > > and sink connectors. As the examples in the FLIP, users can add
> > > > > server/group/topic information in the config for kafka and create
> > > lineage
> > > > > entities for `DataStream` jobs, then the listeners can get this
> > > > information
> > > > > to identify the same connector in 

Re: [DISCUSS] Flink and Externalized connectors leads to block and circular dependency problems

2023-07-07 Thread Mason Chen
Hi all,

I also agree with what's been said above.

+1, I think the Table API delegation is a good suggestion--it essentially
allows a connector to get Python support for free. We've seen that
Table/SQL and Python APIs complement each other well and are ideal for data
scientists. With respect to unaligned functionalities, I think that also
holds true for other APIs, e.g. Datastream and Table/SQL since there is
functionality that is not natural to represent as a configuration/SQL.

Best,
Mason

On Wed, Jul 5, 2023 at 10:14 PM Dian Fu  wrote:

> Hi Chesnay,
>
> >> The wrapping of connectors is a bit of a maintenance nightmare and
> doesn't really work with external/custom connectors.
>
> Cannot agree with you more.
>
> >> Has there ever been thoughts about changing flink-pythons connector
> setup to use the table api connectors underneath?
>
> I'm still not sure if this is feasible for all connectors, however,
> this may be a good idea. The concern is that the DataStream API
> connectors functionalities may be unaligned between Java and Python
> connectors. Besides, there are still a few connectors which only have
> DataStream API connectors, e.g. Google PubSub, RabbitMQ, Cassandra,
> Pulsar, Hybrid Source, etc. Besides, it currently already supports
> Table API connectors in PyFlink and if we take this way, maybe we
> could just tell users to use Table API connector directly.
>
> Another option in my head before is to provide an API which allows
> configuring the behavior via key/value pairs in both the Java & Python
> DataStream API connectors.
>
> Regards,
> Dian
>
> On Wed, Jul 5, 2023 at 6:34 PM Chesnay Schepler 
> wrote:
> >
> > Has there ever been thoughts about changing flink-pythons connector
> > setup to use the table api connectors underneath?
> >
> > The wrapping of connectors is a bit of a maintenance nightmare and
> > doesn't really work with external/custom connectors.
> >
> > On 04/07/2023 13:35, Dian Fu wrote:
> > > Thanks Ran Tao for proposing this discussion and Martijn for sharing
> > > the thought.
> > >
> > >>   While flink-python now fails the CI, it shouldn't actually depend
> on the
> > > externalized connectors. I'm not sure what PyFlink does with it, but if
> > > belongs to the connector code,
> > >
> > > For each DataStream connector, there is a corresponding Python wrapper
> > > and also some test cases in PyFlink. In theory, we should move that
> > > wrapper into each connector repository. In the past, we have not done
> > > that when externalizing the connectors since it may introduce some
> > > burden when releasing since it means that we have to publish each
> > > connector to PyPI separately.
> > >
> > > To resolve this problem, I guess we can move the connector support in
> > > PyFlink into the external connector repository.
> > >
> > > Regards,
> > > Dian
> > >
> > >
> > > On Mon, Jul 3, 2023 at 11:08 PM Ran Tao  wrote:
> > >> @Martijn
> > >> thanks for clear explanations.
> > >>
> > >> If we follow the line you specified (Connectors shouldn't rely on
> > >> dependencies that may or may not be
> > >> available in Flink itself)
> > >> It seems that we should add a certain dependency if we need(such as
> > >> commons-io, commons-collection) in connector pom explicitly.
> > >> And bundle it in sql-connector uber jar.
> > >>
> > >> Then there is only one thing left that we need to make flink-python
> test
> > >> not depend on the released flink-connector.
> > >> Maybe we should check it out and decouple it like you suggested.
> > >>
> > >> Best Regards,
> > >> Ran Tao
> > >> https://github.com/chucheng92
> > >>
> > >>
> > >> Martijn Visser  于2023年7月3日周一 22:06写道:
> > >>
> > >>> Hi Ran Tao,
> > >>>
> > >>> Thanks for opening this topic. I think there's a couple of things at
> hand:
> > >>> 1. Connectors shouldn't rely on dependencies that may or may not be
> > >>> available in Flink itself, like we've seen with flink-shaded. That
> avoids a
> > >>> tight coupling between Flink and connectors, which is exactly what
> we try
> > >>> to avoid.
> > >>> 2. When following that line, that would also be applicable for
> things like
> > >>> commons-collections and commons-io. If a connector wants to use
> them, it
> > >>> should make sure that it bundles those artifacts itself.
> > >>> 3. While flink-python now fails the CI, it shouldn't actually depend
> on the
> > >>> externalized connectors. I'm not sure what PyFlink does with it, but
> if
> > >>> belongs to the connector code, that code should also be moved to the
> > >>> individual connector repo. If it's just a generic test, we could
> consider
> > >>> creating a generic test against released connector versions to
> determine
> > >>> compatibility.
> > >>>
> > >>> I'm curious about the opinions of others as well.
> > >>>
> > >>> Best regards,
> > >>>
> > >>> Martijn
> > >>>
> > >>> On Mon, Jul 3, 2023 at 3:37 PM Ran Tao 
> wrote:
> > >>>
> >  I have an issue here that needs to upgrade commons-collections[1]
> (this
> > >>> is

[jira] [Created] (FLINK-32556) Renames contenderID into componentId

2023-07-07 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-32556:
-

 Summary: Renames contenderID into componentId
 Key: FLINK-32556
 URL: https://issues.apache.org/jira/browse/FLINK-32556
 Project: Flink
  Issue Type: Sub-task
Reporter: Matthias Pohl


We introduced {{contenderID}} in a lot of places with FLINK-26522. The original 
multi-component leader election classes of FLINK-24038 used {{componentId}}.

Revisiting that naming made me realize that it's actually wrong. A contender is 
a specific instance of a component that participates in the leader election. A 
component, in this sense, is the more abstract concept. {{contenderID}} refers 
to an ID for the specific contender instance but the IDs we're sharing are 
actually referring to a Flink component and therefore, are the same between 
different contenders which compete for leadership for the same component. This 
contradicts the definition of an identifier.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)