[jira] [Created] (FLINK-28574) Bump the fabric8 kubernetes-client to 6.0.0

2022-07-15 Thread ConradJam (Jira)
ConradJam created FLINK-28574:
-

 Summary: Bump the fabric8 kubernetes-client to 6.0.0
 Key: FLINK-28574
 URL: https://issues.apache.org/jira/browse/FLINK-28574
 Project: Flink
  Issue Type: Bug
  Components: Kubernetes Operator
Affects Versions: kubernetes-operator-1.1.0
Reporter: ConradJam


fabric8 kubernetes-client now is release to 
[6.0.0|https://github.com/fabric8io/kubernetes-client/releases/tag/v6.0.0] , 
Later we can upgrade this version and remove the deprecated API usage

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[RESULT][VOTE] Creating benchmark channel in Apache Flink slack

2022-07-15 Thread Anton Kalashnikov

Hi everyone,

I’m happy to announce that 'Creating new benchmark Slack channel' has 
been accepted, with 7 approving votes, 5 of which are binding:

- Yun Gao (binding)
- Yuan Mei (binding)
- Godfrey He
- Marton Balassi (binding)
- Alexander Fedulov
- Yun Tang (binding)
- Jing Zhang (binding)

There is no disapproving vote.

Thanks everyone for votes.

Márton, I think I will reach you later for handling admin steps.

--

Best regards,
Anton Kalashnikov



[DISCUSS] FLIP-252: Amazon DynamoDB Sink Connector

2022-07-15 Thread Danny Cranmer
Hello all,

We would like to start a discussion thread on FLIP-252: Amazon DynamoDB
Sink Connector [1] where we propose to provide a sink connector for Amazon
DynamoDB [2] based on the Async Sink [3]. Looking forward to comments and
feedback. Thank you.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-252%3A+Amazon+DynamoDB+Sink+Connector
[2] https://aws.amazon.com/dynamodb
[3] https://cwiki.apache.org/confluence/display/FLINK/FLIP-171%3A+Async+Sink


[jira] [Created] (FLINK-28573) Nested type will lose nullability when converting from TableSchema

2022-07-15 Thread Jane Chan (Jira)
Jane Chan created FLINK-28573:
-

 Summary: Nested type will lose nullability when converting from 
TableSchema
 Key: FLINK-28573
 URL: https://issues.apache.org/jira/browse/FLINK-28573
 Project: Flink
  Issue Type: Bug
  Components: Table Store
Affects Versions: table-store-0.2.0
Reporter: Jane Chan
 Fix For: table-store-0.2.0


E.g. ArrayDataType, MultisetDataType etc



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Add new JobStatus fields to Flink Kubernetes Operator CRD

2022-07-15 Thread Gyula Fóra
Hi Matyas!

So to clarify your suggestion, we would have the following JobStatus fields:

jobId : String
state : String
savepointInfo : SavepointInfo
jobDetailsInfo : String (optional) - output of Flink Rest API job details

And the user could configure with a flag whether to include jobDetailsInfo
or not in status.

Cheers,
Gyula

On Fri, Jul 15, 2022 at 3:02 PM Őrhidi Mátyás 
wrote:

> Hi Gyula,
>
> since the jobDetailsInfo could evolve, another option would be to dump it
> as yaml/json into the metadata.
>
> Best,
> Matyas
>
> On Fri, Jul 15, 2022 at 2:58 PM Gyula Fóra  wrote:
>
> > Based on some further though, a reasonable middleground would be to add
> an
> > optional metadata/jobDetailsInfo field to the JobStatus.
> > We would also add an accompanying config option (default false) whether
> to
> > populate this field for jobs.
> >
> > This way operator users could decide if they want to expose the job
> > information provided by Flink Rest API or only the information that the
> > operator itself needs.
> >
> > What do you all think?
> >
> > Gyula
> >
> > On Fri, Jul 15, 2022 at 2:09 PM Gyula Fóra  wrote:
> >
> > > Hi All!
> > >
> > > I fully acknowledge the general need to access more info about the
> > running
> > > deployments. This need however is very specific to the use-cases /
> > > platforms built on the operator.
> > > I think we need a good way to tackle this without growing the status
> > > arbitrarily.
> > >
> > > Currently the JobStatus in the operator contains the following fields:
> > >
> > >- jobId
> > >- state : Flink JobStatus
> > >- savepointInfo : Operator savepoint tracking info
> > >- startTime : Flink job startTime
> > >- updateTime : Last time state was updated in the operator
> > >- jobName: Name of the job
> > >
> > > Technically speaking only jobId, state and savepointInfo are used
> inside
> > > the operator logic, the rest is unnecessary and "could be removed"
> > without
> > > affecting any operator functionality.
> > >
> > > I think instead of adding more of these "unnecessary/arbitrary" fields
> we
> > > should add a more generic way that allows a configurable / pluggable
> way
> > to
> > > extend the status with user/platform specific fields based on the Flink
> > job
> > > information. At the same time we should already @Deprecate / phase out
> > the
> > > currently unnecessary fields.
> > >
> > > One way of doing this would be adding a new Map metadata
> > > (or similar) field. And at the same time add a configurable / pluggable
> > way
> > > to create the content of this metadata based on the Flink rest api
> > response
> > > (the extended job details).
> > >
> > > What do you think?
> > > Gyula
> > >
> > > On Fri, Jul 15, 2022 at 1:05 PM WONG, DAREN
> > 
> > > wrote:
> > >
> > >> Hi Martin,
> > >>
> > >> Yes, that's understandable. I think adding job endTime, duration,
> > jobPlan
> > >> is useful to other Flink users too as they now have info to track:
> > >>
> > >> 1. endTime: If the job has ended, the user can know when it has ended.
> > If
> > >> the job is still streaming, then the user can know as it defaults to
> > "-1".
> > >> 2. duration: Info on how long the job has been running for, useful for
> > >> monitoring purposes.
> > >> 3. jobPlan: Contains more detailed job info such as the operators in
> the
> > >> job graph and the parallelism of each operator. This could benefit
> Flink
> > >> users as follows:
> > >> 3.1. Help users to get a quick view on jobs simply by querying
> > >> via k8s API, without need to integrate with Flink Client/API. Useful
> for
> > >> users who mainly use kubectl.
> > >> 3.2. Allows users to easily notice a change in job. For eg, if
> > >> user changed a job code by adding a new operator but built it with
> same
> > jar
> > >> name, then they can notice the change in jobPlan.
> > >> 3.3. User may want to operate on jobPlan difference. For eg,
> > >> create difference notification, allocate resources, or other
> automation
> > >> purposed.
> > >>
> > >> In general, I think adding these info is useful for Flink users from
> > >> simple monitoring to audit trail purposes. In addition, these info are
> > >> available via Flink REST API, hence I believe Flink users who tracks
> > these
> > >> info via API would benefit from them when they start using Flink
> > Kubernetes
> > >> Operator.
> > >>
> > >> Regards,
> > >> Daren
> > >>
> > >>
> > >> On 13/07/2022, 08:25, "Martijn Visser" 
> > wrote:
> > >>
> > >> CAUTION: This email originated from outside of the organization.
> Do
> > >> not click links or open attachments unless you can confirm the sender
> > and
> > >> know the content is safe.
> > >>
> > >>
> > >>
> > >> Hi Daren,
> > >>
> > >> Could you list the benefits for the users of Flink? I do think
> that
> > an
> > >> internal AWS requirement is not a good argument for getting
> > something
> > >> done
> > >> in Flink.
> > >>
> > >> Best regards,
> > >

Re: [DISCUSS] Add new JobStatus fields to Flink Kubernetes Operator CRD

2022-07-15 Thread Őrhidi Mátyás
Hi Gyula,

since the jobDetailsInfo could evolve, another option would be to dump it
as yaml/json into the metadata.

Best,
Matyas

On Fri, Jul 15, 2022 at 2:58 PM Gyula Fóra  wrote:

> Based on some further though, a reasonable middleground would be to add an
> optional metadata/jobDetailsInfo field to the JobStatus.
> We would also add an accompanying config option (default false) whether to
> populate this field for jobs.
>
> This way operator users could decide if they want to expose the job
> information provided by Flink Rest API or only the information that the
> operator itself needs.
>
> What do you all think?
>
> Gyula
>
> On Fri, Jul 15, 2022 at 2:09 PM Gyula Fóra  wrote:
>
> > Hi All!
> >
> > I fully acknowledge the general need to access more info about the
> running
> > deployments. This need however is very specific to the use-cases /
> > platforms built on the operator.
> > I think we need a good way to tackle this without growing the status
> > arbitrarily.
> >
> > Currently the JobStatus in the operator contains the following fields:
> >
> >- jobId
> >- state : Flink JobStatus
> >- savepointInfo : Operator savepoint tracking info
> >- startTime : Flink job startTime
> >- updateTime : Last time state was updated in the operator
> >- jobName: Name of the job
> >
> > Technically speaking only jobId, state and savepointInfo are used inside
> > the operator logic, the rest is unnecessary and "could be removed"
> without
> > affecting any operator functionality.
> >
> > I think instead of adding more of these "unnecessary/arbitrary" fields we
> > should add a more generic way that allows a configurable / pluggable way
> to
> > extend the status with user/platform specific fields based on the Flink
> job
> > information. At the same time we should already @Deprecate / phase out
> the
> > currently unnecessary fields.
> >
> > One way of doing this would be adding a new Map metadata
> > (or similar) field. And at the same time add a configurable / pluggable
> way
> > to create the content of this metadata based on the Flink rest api
> response
> > (the extended job details).
> >
> > What do you think?
> > Gyula
> >
> > On Fri, Jul 15, 2022 at 1:05 PM WONG, DAREN
> 
> > wrote:
> >
> >> Hi Martin,
> >>
> >> Yes, that's understandable. I think adding job endTime, duration,
> jobPlan
> >> is useful to other Flink users too as they now have info to track:
> >>
> >> 1. endTime: If the job has ended, the user can know when it has ended.
> If
> >> the job is still streaming, then the user can know as it defaults to
> "-1".
> >> 2. duration: Info on how long the job has been running for, useful for
> >> monitoring purposes.
> >> 3. jobPlan: Contains more detailed job info such as the operators in the
> >> job graph and the parallelism of each operator. This could benefit Flink
> >> users as follows:
> >> 3.1. Help users to get a quick view on jobs simply by querying
> >> via k8s API, without need to integrate with Flink Client/API. Useful for
> >> users who mainly use kubectl.
> >> 3.2. Allows users to easily notice a change in job. For eg, if
> >> user changed a job code by adding a new operator but built it with same
> jar
> >> name, then they can notice the change in jobPlan.
> >> 3.3. User may want to operate on jobPlan difference. For eg,
> >> create difference notification, allocate resources, or other automation
> >> purposed.
> >>
> >> In general, I think adding these info is useful for Flink users from
> >> simple monitoring to audit trail purposes. In addition, these info are
> >> available via Flink REST API, hence I believe Flink users who tracks
> these
> >> info via API would benefit from them when they start using Flink
> Kubernetes
> >> Operator.
> >>
> >> Regards,
> >> Daren
> >>
> >>
> >> On 13/07/2022, 08:25, "Martijn Visser" 
> wrote:
> >>
> >> CAUTION: This email originated from outside of the organization. Do
> >> not click links or open attachments unless you can confirm the sender
> and
> >> know the content is safe.
> >>
> >>
> >>
> >> Hi Daren,
> >>
> >> Could you list the benefits for the users of Flink? I do think that
> an
> >> internal AWS requirement is not a good argument for getting
> something
> >> done
> >> in Flink.
> >>
> >> Best regards,
> >>
> >> Martijn
> >>
> >> Op di 12 jul. 2022 om 21:17 schreef WONG, DAREN
> >> :
> >>
> >> > Hi Yang,
> >> >
> >> > The requirement to add *plan* currently originates from an
> internal
> >> AWS
> >> > requirement as our service needs visibility of *plan*, but we
> think
> >> it
> >> > could be beneficial as well to customers who uses *plan* too.
> >> >
> >> > Regards,
> >> > Daren
> >> >
> >> >
> >> >
> >> >
> >> > On 12/07/2022, 13:23, "Yang Wang"  wrote:
> >> >
> >> > CAUTION: This email originated from outside of the
> >> organization. Do
> >> > not click links or open attachment

[jira] [Created] (FLINK-28572) FlinkSQL executes Table.execute multiple times on the same Table, and changing the Table.execute code position will produce different phenomena

2022-07-15 Thread Andrew Chan (Jira)
Andrew Chan created FLINK-28572:
---

 Summary: FlinkSQL executes Table.execute multiple times on the 
same Table, and changing the Table.execute code position will produce different 
phenomena
 Key: FLINK-28572
 URL: https://issues.apache.org/jira/browse/FLINK-28572
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / API, Table SQL / Planner
Affects Versions: 1.13.6
 Environment: flink-table-planner-blink_2.11  

1.13.6
Reporter: Andrew Chan


*The following code prints and inserts fine*

public static void main(String[] args) {
Configuration conf = new Configuration();
conf.setInteger("rest.port", 2000);
StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment(conf);
env.setParallelism(1);
StreamTableEnvironment tEnv = StreamTableEnvironment.create(env);
tEnv.executeSql("create table sensor(" +
" id string, " +
" ts bigint, " +
" vc int" +
")with(" +
" 'connector' = 'kafka', " +
" 'topic' = 's1', " +
" 'properties.bootstrap.servers' = 'hadoop162:9092', " +
" 'properties.group.id' = 'atguigu', " +
" 'scan.startup.mode' = 'latest-offset', " +
" 'format' = 'csv'" +
")");
Table result = tEnv.sqlQuery("select * from sensor");
tEnv.executeSql("create table s_out(" +
" id string, " +
" ts bigint, " +
" vc int" +
")with(" +
" 'connector' = 'kafka', " +
" 'topic' = 's2', " +
" 'properties.bootstrap.servers' = 'hadoop162:9092', " +
" 'format' = 'json', " +
" 'sink.partitioner' = 'round-robin' " +
")");
result.executeInsert("s_out");
result.execute().print();
}

 

*When the code that prints this line is moved up, it can be printed normally, 
but the insert statement is invalid, as follows*

public static void main(String[] args) {
Configuration conf = new Configuration();
conf.setInteger("rest.port", 2000);
StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment(conf);
env.setParallelism(1);

StreamTableEnvironment tEnv = StreamTableEnvironment.create(env);
// 1. 通过ddl方式建表(动态表), 与文件关联
tEnv.executeSql("create table sensor(" +
" id string, " +
" ts bigint, " +
" vc int" +
")with(" +
" 'connector' = 'kafka', " +
" 'topic' = 's1', " +
" 'properties.bootstrap.servers' = 'hadoop162:9092', " +
" 'properties.group.id' = 'atguigu', " +
" 'scan.startup.mode' = 'latest-offset', " +
" 'format' = 'csv'" +
")");


Table result = tEnv.sqlQuery("select * from sensor");

tEnv.executeSql("create table s_out(" +
" id string, " +
" ts bigint, " +
" vc int" +
")with(" +
" 'connector' = 'kafka', " +
" 'topic' = 's2', " +
" 'properties.bootstrap.servers' = 'hadoop162:9092', " +
" 'format' = 'json', " +
" 'sink.partitioner' = 'round-robin' " +
")");
{color:#FF}result.execute().print();{color}
{color:#FF} result.executeInsert("s_out");{color}
}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Add new JobStatus fields to Flink Kubernetes Operator CRD

2022-07-15 Thread Gyula Fóra
Based on some further though, a reasonable middleground would be to add an
optional metadata/jobDetailsInfo field to the JobStatus.
We would also add an accompanying config option (default false) whether to
populate this field for jobs.

This way operator users could decide if they want to expose the job
information provided by Flink Rest API or only the information that the
operator itself needs.

What do you all think?

Gyula

On Fri, Jul 15, 2022 at 2:09 PM Gyula Fóra  wrote:

> Hi All!
>
> I fully acknowledge the general need to access more info about the running
> deployments. This need however is very specific to the use-cases /
> platforms built on the operator.
> I think we need a good way to tackle this without growing the status
> arbitrarily.
>
> Currently the JobStatus in the operator contains the following fields:
>
>- jobId
>- state : Flink JobStatus
>- savepointInfo : Operator savepoint tracking info
>- startTime : Flink job startTime
>- updateTime : Last time state was updated in the operator
>- jobName: Name of the job
>
> Technically speaking only jobId, state and savepointInfo are used inside
> the operator logic, the rest is unnecessary and "could be removed" without
> affecting any operator functionality.
>
> I think instead of adding more of these "unnecessary/arbitrary" fields we
> should add a more generic way that allows a configurable / pluggable way to
> extend the status with user/platform specific fields based on the Flink job
> information. At the same time we should already @Deprecate / phase out the
> currently unnecessary fields.
>
> One way of doing this would be adding a new Map metadata
> (or similar) field. And at the same time add a configurable / pluggable way
> to create the content of this metadata based on the Flink rest api response
> (the extended job details).
>
> What do you think?
> Gyula
>
> On Fri, Jul 15, 2022 at 1:05 PM WONG, DAREN 
> wrote:
>
>> Hi Martin,
>>
>> Yes, that's understandable. I think adding job endTime, duration, jobPlan
>> is useful to other Flink users too as they now have info to track:
>>
>> 1. endTime: If the job has ended, the user can know when it has ended. If
>> the job is still streaming, then the user can know as it defaults to "-1".
>> 2. duration: Info on how long the job has been running for, useful for
>> monitoring purposes.
>> 3. jobPlan: Contains more detailed job info such as the operators in the
>> job graph and the parallelism of each operator. This could benefit Flink
>> users as follows:
>> 3.1. Help users to get a quick view on jobs simply by querying
>> via k8s API, without need to integrate with Flink Client/API. Useful for
>> users who mainly use kubectl.
>> 3.2. Allows users to easily notice a change in job. For eg, if
>> user changed a job code by adding a new operator but built it with same jar
>> name, then they can notice the change in jobPlan.
>> 3.3. User may want to operate on jobPlan difference. For eg,
>> create difference notification, allocate resources, or other automation
>> purposed.
>>
>> In general, I think adding these info is useful for Flink users from
>> simple monitoring to audit trail purposes. In addition, these info are
>> available via Flink REST API, hence I believe Flink users who tracks these
>> info via API would benefit from them when they start using Flink Kubernetes
>> Operator.
>>
>> Regards,
>> Daren
>>
>>
>> On 13/07/2022, 08:25, "Martijn Visser"  wrote:
>>
>> CAUTION: This email originated from outside of the organization. Do
>> not click links or open attachments unless you can confirm the sender and
>> know the content is safe.
>>
>>
>>
>> Hi Daren,
>>
>> Could you list the benefits for the users of Flink? I do think that an
>> internal AWS requirement is not a good argument for getting something
>> done
>> in Flink.
>>
>> Best regards,
>>
>> Martijn
>>
>> Op di 12 jul. 2022 om 21:17 schreef WONG, DAREN
>> :
>>
>> > Hi Yang,
>> >
>> > The requirement to add *plan* currently originates from an internal
>> AWS
>> > requirement as our service needs visibility of *plan*, but we think
>> it
>> > could be beneficial as well to customers who uses *plan* too.
>> >
>> > Regards,
>> > Daren
>> >
>> >
>> >
>> >
>> > On 12/07/2022, 13:23, "Yang Wang"  wrote:
>> >
>> > CAUTION: This email originated from outside of the
>> organization. Do
>> > not click links or open attachments unless you can confirm the
>> sender and
>> > know the content is safe.
>> >
>> >
>> >
>> > Thanks for the explanation. Only having 1 API call in most
>> cases makes
>> > sense to me.
>> >
>> > Could you please elaborate more about why do we need the *plan*
>> in CR
>> > status?
>> >
>> >
>> > Best,
>> > Yang
>> >
>> > Gyula Fóra  于2022年7月12日周二 17:36写道:
>> >
>> > > H

Re: [DISCUSS] Add new JobStatus fields to Flink Kubernetes Operator CRD

2022-07-15 Thread Gyula Fóra
Hi All!

I fully acknowledge the general need to access more info about the running
deployments. This need however is very specific to the use-cases /
platforms built on the operator.
I think we need a good way to tackle this without growing the status
arbitrarily.

Currently the JobStatus in the operator contains the following fields:

   - jobId
   - state : Flink JobStatus
   - savepointInfo : Operator savepoint tracking info
   - startTime : Flink job startTime
   - updateTime : Last time state was updated in the operator
   - jobName: Name of the job

Technically speaking only jobId, state and savepointInfo are used inside
the operator logic, the rest is unnecessary and "could be removed" without
affecting any operator functionality.

I think instead of adding more of these "unnecessary/arbitrary" fields we
should add a more generic way that allows a configurable / pluggable way to
extend the status with user/platform specific fields based on the Flink job
information. At the same time we should already @Deprecate / phase out the
currently unnecessary fields.

One way of doing this would be adding a new Map metadata (or
similar) field. And at the same time add a configurable / pluggable way to
create the content of this metadata based on the Flink rest api response
(the extended job details).

What do you think?
Gyula

On Fri, Jul 15, 2022 at 1:05 PM WONG, DAREN 
wrote:

> Hi Martin,
>
> Yes, that's understandable. I think adding job endTime, duration, jobPlan
> is useful to other Flink users too as they now have info to track:
>
> 1. endTime: If the job has ended, the user can know when it has ended. If
> the job is still streaming, then the user can know as it defaults to "-1".
> 2. duration: Info on how long the job has been running for, useful for
> monitoring purposes.
> 3. jobPlan: Contains more detailed job info such as the operators in the
> job graph and the parallelism of each operator. This could benefit Flink
> users as follows:
> 3.1. Help users to get a quick view on jobs simply by querying via
> k8s API, without need to integrate with Flink Client/API. Useful for users
> who mainly use kubectl.
> 3.2. Allows users to easily notice a change in job. For eg, if
> user changed a job code by adding a new operator but built it with same jar
> name, then they can notice the change in jobPlan.
> 3.3. User may want to operate on jobPlan difference. For eg,
> create difference notification, allocate resources, or other automation
> purposed.
>
> In general, I think adding these info is useful for Flink users from
> simple monitoring to audit trail purposes. In addition, these info are
> available via Flink REST API, hence I believe Flink users who tracks these
> info via API would benefit from them when they start using Flink Kubernetes
> Operator.
>
> Regards,
> Daren
>
>
> On 13/07/2022, 08:25, "Martijn Visser"  wrote:
>
> CAUTION: This email originated from outside of the organization. Do
> not click links or open attachments unless you can confirm the sender and
> know the content is safe.
>
>
>
> Hi Daren,
>
> Could you list the benefits for the users of Flink? I do think that an
> internal AWS requirement is not a good argument for getting something
> done
> in Flink.
>
> Best regards,
>
> Martijn
>
> Op di 12 jul. 2022 om 21:17 schreef WONG, DAREN
> :
>
> > Hi Yang,
> >
> > The requirement to add *plan* currently originates from an internal
> AWS
> > requirement as our service needs visibility of *plan*, but we think
> it
> > could be beneficial as well to customers who uses *plan* too.
> >
> > Regards,
> > Daren
> >
> >
> >
> >
> > On 12/07/2022, 13:23, "Yang Wang"  wrote:
> >
> > CAUTION: This email originated from outside of the organization.
> Do
> > not click links or open attachments unless you can confirm the
> sender and
> > know the content is safe.
> >
> >
> >
> > Thanks for the explanation. Only having 1 API call in most cases
> makes
> > sense to me.
> >
> > Could you please elaborate more about why do we need the *plan*
> in CR
> > status?
> >
> >
> > Best,
> > Yang
> >
> > Gyula Fóra  于2022年7月12日周二 17:36写道:
> >
> > > Hi Devs!
> > >
> > > I discussed with Daren offline, and I agree with him that
> > technically we
> > > almost never need 2 API calls.
> > >
> > > I think it's fine to have a second API call once directly after
> > application
> > > submission (technically even this can be eliminated by setting
> a fix
> > job id
> > > always).
> > >
> > > +1 from me.
> > >
> > > Cheers,
> > > Gyula
> > >
> > >
> > > On Tue, Jul 12, 2022 at 11:32 AM WONG, DAREN
> >  > > >
> > > wrote:
> > >
> >

[jira] [Created] (FLINK-28571) Add Chi-squared test as Transformer to ml.feature

2022-07-15 Thread Simon Tao (Jira)
Simon Tao created FLINK-28571:
-

 Summary: Add Chi-squared test as Transformer to ml.feature
 Key: FLINK-28571
 URL: https://issues.apache.org/jira/browse/FLINK-28571
 Project: Flink
  Issue Type: New Feature
  Components: Library / Machine Learning
Reporter: Simon Tao


Pearson's chi-squared 
test:https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test

For more information on 
chi-squared:http://en.wikipedia.org/wiki/Chi-squared_test



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Add new JobStatus fields to Flink Kubernetes Operator CRD

2022-07-15 Thread WONG, DAREN
Hi Martin,

Yes, that's understandable. I think adding job endTime, duration, jobPlan is 
useful to other Flink users too as they now have info to track:

1. endTime: If the job has ended, the user can know when it has ended. If the 
job is still streaming, then the user can know as it defaults to "-1". 
2. duration: Info on how long the job has been running for, useful for 
monitoring purposes.
3. jobPlan: Contains more detailed job info such as the operators in the job 
graph and the parallelism of each operator. This could benefit Flink users as 
follows:
3.1. Help users to get a quick view on jobs simply by querying via k8s 
API, without need to integrate with Flink Client/API. Useful for users who 
mainly use kubectl.
3.2. Allows users to easily notice a change in job. For eg, if user 
changed a job code by adding a new operator but built it with same jar name, 
then they can notice the change in jobPlan.
3.3. User may want to operate on jobPlan difference. For eg, create 
difference notification, allocate resources, or other automation purposed.

In general, I think adding these info is useful for Flink users from simple 
monitoring to audit trail purposes. In addition, these info are available via 
Flink REST API, hence I believe Flink users who tracks these info via API would 
benefit from them when they start using Flink Kubernetes Operator. 

Regards,
Daren


On 13/07/2022, 08:25, "Martijn Visser"  wrote:

CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.



Hi Daren,

Could you list the benefits for the users of Flink? I do think that an
internal AWS requirement is not a good argument for getting something done
in Flink.

Best regards,

Martijn

Op di 12 jul. 2022 om 21:17 schreef WONG, DAREN
:

> Hi Yang,
>
> The requirement to add *plan* currently originates from an internal AWS
> requirement as our service needs visibility of *plan*, but we think it
> could be beneficial as well to customers who uses *plan* too.
>
> Regards,
> Daren
>
>
>
>
> On 12/07/2022, 13:23, "Yang Wang"  wrote:
>
> CAUTION: This email originated from outside of the organization. Do
> not click links or open attachments unless you can confirm the sender and
> know the content is safe.
>
>
>
> Thanks for the explanation. Only having 1 API call in most cases makes
> sense to me.
>
> Could you please elaborate more about why do we need the *plan* in CR
> status?
>
>
> Best,
> Yang
>
> Gyula Fóra  于2022年7月12日周二 17:36写道:
>
> > Hi Devs!
> >
> > I discussed with Daren offline, and I agree with him that
> technically we
> > almost never need 2 API calls.
> >
> > I think it's fine to have a second API call once directly after
> application
> > submission (technically even this can be eliminated by setting a fix
> job id
> > always).
> >
> > +1 from me.
> >
> > Cheers,
> > Gyula
> >
> >
> > On Tue, Jul 12, 2022 at 11:32 AM WONG, DAREN
>  > >
> > wrote:
> >
> > > Hi Matyas,
> > >
> > > Thanks for the feedback, and yes I agree. An alternative approach
> would
> > > instead be:
> > >
> > > - 2 API calls only when jobID is not available (i.e when
> submitting a new
> > > application cluster, which is a one-off event).
> > > - 1 API call when jobID is already available by directly calling
> > > "/jobs/:jobid".
> > >
> > > With this approach, we can keep the API call to 1 in most cases.
> > >
> > > Regards,
> > > Daren
> > >
> > >
> > > On 11/07/2022, 14:44, "Őrhidi Mátyás" 
> wrote:
> > >
> > > CAUTION: This email originated from outside of the
> organization. Do
> > > not click links or open attachments unless you can confirm the
> sender and
> > > know the content is safe.
> > >
> > >
> > >
> > > Hi Daren,
> > >
> > > At the moment the Operator fetches the job state via
> > >
> > >
> >
> 
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/#jobs-overview
> > > which contains the 'end-time' and 'duration' fields already. I
> feel
> > > calling
> > > the
> > >
> > >
> >
> 
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/#jobs-jobid
> > > after the previous call for every job in every reconcile loop
> would
> > be
> > > too
> >

Re: [DISCUSS] FLIP-247 Bulk fetch of table and column statistics for given partitions

2022-07-15 Thread godfrey he
Hi Jing,

Thanks for the driving this, LGTM.

Best,
Godfrey

Jingsong Li  于2022年7月15日周五 11:38写道:
>
> Thanks for starting this discussion.
>
> Have we considered introducing a listPartitionWithStats() in Catalog?
>
> Best,
> Jingsong
>
> On Fri, Jul 15, 2022 at 10:08 AM Jark Wu  wrote:
> >
> > Hi Jing,
> >
> > Thanks for starting this discussion. The bulk fetch is a great improvement
> > for the optimizer.
> > The FLIP looks good to me.
> >
> > Best,
> > Jark
> >
> > On Fri, 8 Jul 2022 at 17:36, Jing Ge  wrote:
> >
> > > Hi devs,
> > >
> > > After having multiple discussions with Jark and Goldfrey, I'd like to 
> > > start
> > > a discussion on the mailing list w.r.t. FLIP-247[1], which will
> > > significantly improve the performance by providing the bulk fetch
> > > capability for table and column statistics.
> > >
> > > Currently the statistics information about tables can only be fetched from
> > > the catalog by each given partition iteratively. Since getting statistics
> > > information from catalogs is a very heavy operation, in order to improve
> > > the query performance, we’d better provide functionality to fetch the
> > > statistics information of a table for all given partitions in one shot.
> > >
> > > Based on the manual performance test, for 2000 partitions, the cost will 
> > > be
> > > improved from 10s to 2s. The improvement result is 500%.
> > >
> > > [1]
> > >
> > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-247%3A+Bulk+fetch+of+table+and+column+statistics+for+given+partitions
> > >
> > > Best regards,
> > > Jing
> > >


[jira] [Created] (FLINK-28570) Introduces a StreamPhysicalPlanChecker to validate if there's any non-deterministic updates which may cause wrong result

2022-07-15 Thread lincoln lee (Jira)
lincoln lee created FLINK-28570:
---

 Summary: Introduces a StreamPhysicalPlanChecker to validate if 
there's any non-deterministic updates which may cause wrong result
 Key: FLINK-28570
 URL: https://issues.apache.org/jira/browse/FLINK-28570
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Planner
Reporter: lincoln lee
 Fix For: 1.16.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-28569) SinkUpsertMaterializer should be aware of the input upsertKey if it is not empty

2022-07-15 Thread lincoln lee (Jira)
lincoln lee created FLINK-28569:
---

 Summary: SinkUpsertMaterializer should be aware of the input 
upsertKey if it is not empty
 Key: FLINK-28569
 URL: https://issues.apache.org/jira/browse/FLINK-28569
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Runtime
Reporter: lincoln lee
 Fix For: 1.16.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-28568) Implements a new lookup join operator (sync mode only) with state to eliminate the non determinism

2022-07-15 Thread lincoln lee (Jira)
lincoln lee created FLINK-28568:
---

 Summary: Implements a new lookup join operator (sync mode only) 
with state to eliminate the non determinism
 Key: FLINK-28568
 URL: https://issues.apache.org/jira/browse/FLINK-28568
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Runtime
Reporter: lincoln lee
 Fix For: 1.16.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-28567) Introduce predicate inference from one side of join to the other

2022-07-15 Thread Alexander Trushev (Jira)
Alexander Trushev created FLINK-28567:
-

 Summary: Introduce predicate inference from one side of join to 
the other
 Key: FLINK-28567
 URL: https://issues.apache.org/jira/browse/FLINK-28567
 Project: Flink
  Issue Type: Improvement
  Components: Table SQL / Planner
Reporter: Alexander Trushev


h2. Context

There is JoinPushTransitivePredicatesRule in Calcite that infers predicates 
from on a Join and creates Filters if those predicates can be pushed to its 
inputs.
*Example.* (a0 = b0 AND a0 > 0) => (b0 > 0)

h2. Proposal

Add org.apache.calcite.rel.rules.JoinPushTransitivePredicatesRule to 
FlinkStreamRuleSets and to FlinkBatchRuleSets

h2. Benefit

Before the changes:

{code}
Flink SQL> explain select * from A join B on a0 = b0 and a0 > 0;

Join(joinType=[InnerJoin], where=[=(a0, b0)], select=[a0, b0], 
leftInputSpec=[NoUniqueKey], rightInputSpec=[NoUniqueKey])
:- Exchange(distribution=[hash[a0]])
: +- Calc(select=[a0], where=[>(a0, 0)])
: +- TableSourceScan(table=[[default_catalog, default_database, A, filter=[]]], 
fields=[a0])
+- Exchange(distribution=[hash[b0]])
+- TableSourceScan(table=[[default_catalog, default_database, B]], fields=[b0])
{code}

After the changes:

{code}
Join(joinType=[InnerJoin], where=[=(a0, b0)], select=[a0, b0], 
leftInputSpec=[NoUniqueKey], rightInputSpec=[NoUniqueKey])
:- Exchange(distribution=[hash[a0]])
: +- Calc(select=[a0], where=[>(a0, 0)])
: +- TableSourceScan(table=[[default_catalog, default_database, A, filter=[]]], 
fields=[a0])
+- Exchange(distribution=[hash[b0]])
+- Calc(select=[b0], where=[>(b0, 0)])
+- TableSourceScan(table=[[default_catalog, default_database, B, filter=[]]], 
fields=[b0])
{code}

i.e., b0 > 0 is inferred and pushed down




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-28566) Adds materialization support to eliminate the non determinism generated by lookup join node

2022-07-15 Thread lincoln lee (Jira)
lincoln lee created FLINK-28566:
---

 Summary: Adds materialization support to eliminate the non 
determinism generated by lookup join node
 Key: FLINK-28566
 URL: https://issues.apache.org/jira/browse/FLINK-28566
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Planner
Reporter: lincoln lee
 Fix For: 1.16.0


In order to minimize the potential exceptions or data errors when many users 
use the update stream to lookup join an external 
table (essentially due to the non-deterministic result based on processing-time 
to lookup external tables). 
When update exists in the input stream and the lookup key does not contain the 
primary key of the external table,
FLINK automatically adds materialization of the update by default, so that it 
will only lookup the external table 
when the insert or update_after message arrives, and when the delete or 
update_before message arrives, it will 
directly querying the latest version of the locally materialized data and sent 
it to downstream operator.

To do so,we introduce a new option 'table.exec.lookup-join.upsert-materialize' 
and resue the `UpsertMaterialize`. By default, the materialize operator will be 
added when an update stream lookup an external table without containing its 
primary keys(includes no primary key defined). You can also choose no 
materialization(NONE) or force materialization(FORCE) which will always enable 
materialization except input is insert only.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[RESULT][VOTE] FLIP-249: Flink Web UI Enhancement for Speculative Execution

2022-07-15 Thread Gen Luo
Hi everyone,

I’m happy to announce that FLIP-249[1] has been accepted, with 4 approving
votes, 3 of which are binding[2]:
- Zhu Zhu (binding)
- Lijie Wang
- Jing Zhang (binding)
- Yun Gao (binding)

There is no disapproving vote.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-249%3A+Flink+Web+UI+Enhancement+for+Speculative+Execution
[2] https://lists.apache.org/thread/0xonn4y8lkj49comors0z86tpyzkvvqg

Best,
Gen


[jira] [Created] (FLINK-28565) Create NOTICE file for flink-table-store-hive-catalog

2022-07-15 Thread Jingsong Lee (Jira)
Jingsong Lee created FLINK-28565:


 Summary: Create NOTICE file for flink-table-store-hive-catalog
 Key: FLINK-28565
 URL: https://issues.apache.org/jira/browse/FLINK-28565
 Project: Flink
  Issue Type: Improvement
  Components: Table Store
Reporter: Jingsong Lee
 Fix For: table-store-0.2.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-28564) Update NOTICE/LICENCE files for 1.1.0 release

2022-07-15 Thread Matyas Orhidi (Jira)
Matyas Orhidi created FLINK-28564:
-

 Summary: Update NOTICE/LICENCE files for 1.1.0 release
 Key: FLINK-28564
 URL: https://issues.apache.org/jira/browse/FLINK-28564
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Reporter: Matyas Orhidi
 Fix For: kubernetes-operator-1.1.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-28563) Add Transformer for VectorSlicer

2022-07-15 Thread weibo zhao (Jira)
weibo zhao created FLINK-28563:
--

 Summary: Add Transformer for VectorSlicer
 Key: FLINK-28563
 URL: https://issues.apache.org/jira/browse/FLINK-28563
 Project: Flink
  Issue Type: New Feature
  Components: Library / Machine Learning
Reporter: weibo zhao


h1. Add Transformer for VectorSlicer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)