Re: Re: [DISCUSS] FLIP-134: DataStream Semantics for Bounded Input

2020-08-17 Thread Yun Gao
Hi, 

Very thanks for bringing up this discussion!

One more question is that does the BATCH and STREAMING mode also decides 
the shuffle types and operators? I'm asking so because that even for blocking 
mode, it should also benefit from keeping some edges to be pipeline if the 
resources are known to be enough. Do we also consider to expose more 
fine-grained control on the shuffle types? 

Best,
 Yun 



 --Original Mail --
Sender:Kostas Kloudas 
Send Date:Tue Aug 18 02:24:21 2020
Recipients:David Anderson 
CC:dev , user 
Subject:Re: [DISCUSS] FLIP-134: DataStream Semantics for Bounded Input
Hi Kurt and David,

Thanks a lot for the insightful feedback!

@Kurt: For the topic of checkpointing with Batch Scheduling, I totally
agree with you that it requires a lot more work and careful thinking
on the semantics. This FLIP was written under the assumption that if
the user wants to have checkpoints on bounded input, he/she will have
to go with STREAMING as the scheduling mode. Checkpointing for BATCH
can be handled as a separate topic in the future.

In the case of MIXED workloads and for this FLIP, the scheduling mode
should be set to STREAMING. That is why the AUTOMATIC option sets
scheduling to BATCH only if all the sources are bounded. I am not sure
what are the plans there at the scheduling level, as one could imagine
in the future that in mixed workloads, we schedule first all the
bounded subgraphs in BATCH mode and we allow only one UNBOUNDED
subgraph per application, which is going to be scheduled after all
Bounded ones have finished. Essentially the bounded subgraphs will be
used to bootstrap the unbounded one. But, I am not aware of any plans
towards that direction.


@David: The processing time timer handling is a topic that has also
been discussed in the community in the past, and I do not remember any
final conclusion unfortunately.

In the current context and for bounded input, we chose to favor
reproducibility of the result, as this is expected in batch processing
where the whole input is available in advance. This is why this
proposal suggests to not allow processing time timers. But I
understand your argument that the user may want to be able to run the
same pipeline on batch and streaming this is why we added the two
options under future work, namely (from the FLIP):

```
Future Work: In the future we may consider adding as options the capability of:
* firing all the registered processing time timers at the end of a job
(at close()) or,
* ignoring all the registered processing time timers at the end of a job.
```

Conceptually, we are essentially saying that we assume that batch
execution is assumed to be instantaneous and refers to a single
"point" in time and any processing-time timers for the future may fire
at the end of execution or be ignored (but not throw an exception). I
could also see ignoring the timers in batch as the default, if this
makes more sense.

By the way, do you have any usecases in mind that will help us better
shape our processing time timer handling?

Kostas

On Mon, Aug 17, 2020 at 2:52 PM David Anderson  wrote:
>
> Kostas,
>
> I'm pleased to see some concrete details in this FLIP.
>
> I wonder if the current proposal goes far enough in the direction of 
> recognizing the need some users may have for "batch" and "bounded streaming" 
> to be treated differently. If I've understood it correctly, the section on 
> scheduling allows me to choose STREAMING scheduling even if I have bounded 
> sources. I like that approach, because it recognizes that even though I have 
> bounded inputs, I don't necessarily want batch processing semantics. I think 
> it makes sense to extend this idea to processing time support as well.
>
> My thinking is that sometimes in development and testing it's reasonable to 
> run exactly the same job as in production, except with different sources and 
> sinks. While it might be a reasonable default, I'm not convinced that 
> switching a processing time streaming job to read from a bounded source 
> should always cause it to fail.
>
> David
>
> On Wed, Aug 12, 2020 at 5:22 PM Kostas Kloudas  wrote:
>>
>> Hi all,
>>
>> As described in FLIP-131 [1], we are aiming at deprecating the DataSet
>> API in favour of the DataStream API and the Table API. After this work
>> is done, the user will be able to write a program using the DataStream
>> API and this will execute efficiently on both bounded and unbounded
>> data. But before we reach this point, it is worth discussing and
>> agreeing on the semantics of some operations as we transition from the
>> streaming world to the batch one.
>>
>> This thread and the associated FLIP [2] aim at discussing these issues
>> as these topics are pretty important to users and can lead to
>> unpleasant surprises if we do not pay attention.
>>
>> Let's have a healthy discussion here and I will be updating the FLIP
>> accordingly.
>>
>> Cheers,
>> Kostas
>>
>> [1] 
>> 

[jira] [Created] (FLINK-18982) InputFormat should remove unnecessary override methods

2020-08-17 Thread Ryan Tao (Jira)
Ryan Tao created FLINK-18982:


 Summary: InputFormat should remove unnecessary override methods
 Key: FLINK-18982
 URL: https://issues.apache.org/jira/browse/FLINK-18982
 Project: Flink
  Issue Type: Improvement
  Components: API / Core
Affects Versions: 1.11.1
Reporter: Ryan Tao


_InputFormat_ has inherited from _InputSplitSource_. Based on basic Java 
knowledge, these two methods do not need to be written here, which may easily 
lead to ambiguity. 

InputFormat & InputSplitSource:
{code:java}
public interface InputFormat extends 
InputSplitSource, Serializable {
...
@Override
T[] createInputSplits(int minNumSplits) throws IOException;
@Override
InputSplitAssigner getInputSplitAssigner(T[] inputSplits);
...
}
{code}
 
{code:java}
public interface InputSplitSource extends Serializable {
   T[] createInputSplits(int minNumSplits) throws Exception; 
   InputSplitAssigner getInputSplitAssigner(T[] inputSplits);
}{code}
 

 


As for the reason, watching the commit history, we can find that these two 
methods appeared in _InputFormat_ before _InputSplitSource_. Later, they were 
taken to InputSplitSource separately, but they were not removed in 
_InputFormat_.

Another point is that the InputSplitSource throws an Exception, as it should 
be. (Subclasses of InputFormat throw specific exception.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Re: [DISCUSS] Removing deprecated methods from DataStream API

2020-08-17 Thread Yun Gao
+1 for removing the methods that are deprecated for a while & have alternative 
methods.

One specific thing is that if we remove the DataStream#split, do we consider 
enabling side-output in more operators in the future ? Currently it should be 
only available in ProcessFunctions, but not available to other commonly used 
UDF like Source or AsyncFunction[1].

One temporary solution occurs to me is to add a ProcessFunction after the 
operators want to use side-output. But I think the solution is not very direct 
to come up with and if it really works we might add it to the document of 
side-output. 

[1] https://issues.apache.org/jira/browse/FLINK-7954

Best,
 Yun


 --Original Mail --
Sender:Kostas Kloudas 
Send Date:Tue Aug 18 03:52:44 2020
Recipients:Dawid Wysakowicz 
CC:dev , user 
Subject:Re: [DISCUSS] Removing deprecated methods from DataStream API
+1 for removing them.



From a quick look, most of them (not all) have been deprecated a long time ago.



Cheers,

Kostas



On Mon, Aug 17, 2020 at 9:37 PM Dawid Wysakowicz  wrote:

>

> @David Yes, my idea was to remove any use of fold method and all related 
> classes including WindowedStream#fold

>

> @Klou Good idea to also remove the deprecated enableCheckpointing() & 
> StreamExecutionEnvironment#readFile and alike. I did another pass over some 
> of the classes and thought we could also drop:

>

> ExecutionConfig#set/getCodeAnalysisMode

> ExecutionConfig#disable/enableSysoutLogging

> ExecutionConfig#set/isFailTaskOnCheckpointError

> ExecutionConfig#isLatencyTrackingEnabled

>

> As for the `forceCheckpointing` I am not fully convinced to doing it. As far 
> as I know iterations still do not participate in checkpointing correctly. 
> Therefore it still might make sense to force it. In other words there is no 
> real alternative to that method. Unless we only remove the methods from 
> StreamExecutionEnvironment and redirect to the setter in CheckpointConfig. 
> WDYT?

>

> An updated list of methods I suggest to remove:

>

> ExecutionConfig#set/getCodeAnalysisMode (deprecated in 1.9)

> ExecutionConfig#disable/enableSysoutLogging (deprecated in 1.10)

> ExecutionConfig#set/isFailTaskOnCheckpointError (deprecated in 1.9)

> ExecutionConfig#isLatencyTrackingEnabled (deprecated in 1.7)

> ExecutionConfig#(get)/setNumberOfExecutionRetries() (deprecated in 1.1?)

> StreamExecutionEnvironment#readFile,readFileStream(...),socketTextStream(...),socketTextStream(...)
>  (deprecated in 1.2)

> RuntimeContext#getAllAccumulators (deprecated in 0.10)

> DataStream#fold and all related classes and methods such as FoldFunction, 
> FoldingState, FoldingStateDescriptor ... (deprecated in 1.3/1.4)

> StreamExecutionEnvironment#setStateBackend(AbstractStateBackend) (deprecated 
> in 1.5)

> DataStream#split (deprecated in 1.8)

> Methods in (Connected)DataStream that specify keys as either indices or field 
> names such as DataStream#keyBy, DataStream#partitionCustom, 
> ConnectedStream#keyBy,  (deprecated in 1.11)

>

> Bear in mind that majority of the options listed above in ExecutionConfig 
> take no effect. They were left there purely to satisfy the binary 
> compatibility. Personally I don't see any benefit of leaving a method and 
> silently dropping the underlying feature. The only configuration that is 
> respected is setting the number of execution retries.

>

> I also wanted to make it explicit that most of the changes above would result 
> in a binary incompatibility and require additional exclusions in the japicmp. 
> Those are:

>

> ExecutionConfig#set/getCodeAnalysisMode (deprecated in 1.9)

> ExecutionConfig#disable/enableSysoutLogging (deprecated in 1.10)

> ExecutionConfig#set/isFailTaskOnCheckpointError (deprecated in 1.9)

> ExecutionConfig#isLatencyTrackingEnabled (deprecated in 1.7)

> ExecutionConfig#(get)/setNumberOfExecutionRetries() (deprecated in 1.1?)

> DataStream#fold and all related classes and methods such as FoldFunction, 
> FoldingState, FoldingStateDescriptor ... (deprecated in 1.3/1.4)

> DataStream#split (deprecated in 1.8)

> Methods in (Connected)DataStream that specify keys as either indices or field 
> names such as DataStream#keyBy, DataStream#partitionCustom, 
> ConnectedStream#keyBy,  (deprecated in 1.11)

> StreamExecutionEnvironment#readFile,readFileStream(...),socketTextStream(...),socketTextStream(...)
>  (deprecated in 1.2)

>

> Looking forward to more opinions on the issue.

>

> Best,

>

> Dawid

>

>

> On 17/08/2020 12:49, Kostas Kloudas wrote:

>

> Thanks a lot for starting this Dawid,

>

> Big +1 for the proposed clean-up, and I would also add the deprecated

> methods of the StreamExecutionEnvironment like:

>

> enableCheckpointing(long interval, CheckpointingMode mode, boolean force)

> enableCheckpointing()

> isForceCheckpointing()

>

> readFile(FileInputFormat inputFormat,String

> filePath,FileProcessingMode watchType,long interval, FilePathFilter

> 

Re: [DISCUSS] FLIP-132: Temporal Table DDL

2020-08-17 Thread Rui Li
Thanks Leonard for the clarifications!

On Mon, Aug 17, 2020 at 9:17 PM Leonard Xu  wrote:

>
> > But are we still able to track different views of such a
> > table through time, as rows are added/deleted to/from the table?
>
> Yes, in fact we support temporal table from changlog which contains all
> possible message types(INSERT/UPDATE/DELETE).
>
> > For
> > example, suppose I have an append-only table source with event-time and
> PK,
> > will I be allowed to do an event-time temporal join with this table?
> Yes, I list some examples in the doc, the example versioned_rates3  is
> this case exactly.
>
> Best
> Leonard
>
>
> >
> > On Wed, Aug 12, 2020 at 3:31 PM Leonard Xu  xbjt...@gmail.com>> wrote:
> >
> >> Hi, all
> >>
> >> After a detailed offline discussion about the temporal table related
> >> concept and behavior, we had a reliable solution and rejected several
> >> alternatives.
> >> Compared to rejected alternatives, the proposed approach is a more
> unified
> >> story and also friendly to user and current Flink framework.
> >> I improved the FLIP[1] with the proposed approach and refactored the
> >> document organization to make it clear enough.
> >>
> >> Please let me know if you have any concerns, I’m looking forward your
> >> comments.
> >>
> >>
> >> Best
> >> Leonard
> >>
> >> [1]
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-132+Temporal+Table+DDL
> >> <
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-132+Temporal+Table+DDL
> <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-132+Temporal+Table+DDL
> >
> >>>
> >>
> >>
> >>> 在 2020年8月4日,21:25,Leonard Xu  xbjt...@gmail.com>> 写道:
> >>>
> >>> Hi, all
> >>>
> >>> I’ve updated the FLIP[1] with the terminology `ChangelogTime`.
> >>>
> >>> Best
> >>> Leonard
> >>> [1]
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-132+Temporal+Table+DDL
> <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-132+Temporal+Table+DDL
> >
> >> <
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-132+Temporal+Table+DDL
> <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-132+Temporal+Table+DDL
> >
> >>>
> >>>
>  在 2020年8月4日,20:58,Leonard Xu  xbjt...@gmail.com>  >> xbjt...@gmail.com >> 写道:
> 
>  Hi, Timo
> 
>  Thanks for you response.
> 
> > 1) Naming: Is operation time a good term for this concept? If I read
> >> "The operation time is the time when the changes happened in system." or
> >> "The system time of DML execution in database", why don't we call it
> >> `ChangelogTime` or `SystemTime`? Introducing another terminology of
> time in
> >> Flink should be thought through.
> 
>  I agree that we should thought through. I have considered the name
> >> `ChangelogTime` and `SystemTime` too, I don’t have strong opinion on the
> >> name.
> 
>  I proposed `operationTime` because most changelog comes from Database
> >> and we always called an action as `operation` rather than `change` in
> >> Database, the operation time is  easier to understand  for database
> users,
> >> but it's more like a database terminology.
> 
>  For `SystemTime`, user may confuse which one does the system in
> >> `SystemTime` represents?  Flink, Database or CDC tool.  Maybe it’s not a
> >> good name.
> 
>  `ChangelogTime` is a pretty choice which is more unified with existed
> >> terminology `Changelog` and `ChangelogMode`, so let me use
> `ChangelogTime`
> >> and I’ll update the FLIP.
> 
> 
> > 2) Exposing it through `org.apache.flink.types.Row`: Shall we also
> >> expose the concept of time through the user-level `Row` type? The FLIP
> does
> >> not mention this explictly. I think we can keep it as an internal
> concept
> >> but I just wanted to ask for clarification.
> 
>  Yes, I want to keep it as an internal concept, we have discussed that
> >> changelog time concept should be the third time concept(the other two
> are
> >> event-time and processing-time). It’s not easy for normal users(or to
> help
> >> normal users) understand the three concepts accurately, and I did not
> find
> >> a big enough scenario that user need to touch the changelog time for
> now,
> >> so I tend to do not expose the concept to users.
> 
> 
>  Best,
>  Leonard
> 
> 
> >
> > On 04.08.20 04:58, Leonard Xu wrote:
> >> Thanks Konstantin,
> >> Regarding your questions, hope my comments has address your
> questions
> >> and I also add a few explanation in the FLIP.
> >> Thank you all for the feedback,
> >> It seems everyone involved  in this thread has reached a consensus.
> >> I will start a vote thread  later.
> >> Best,
> >> Leonard
> >>> 在 2020年8月3日,19:35,godfrey he  godfre...@gmail.com>  >> godfre...@gmail.com >> 写道:
> >>>
> >>> Thanks Lennard for driving this FLIP.
> >>> Looks good to me.
> >>>
> >>> Best,
> >>> Godfrey
> 

[jira] [Created] (FLINK-18981) Support column comment for Hive tables

2020-08-17 Thread Rui Li (Jira)
Rui Li created FLINK-18981:
--

 Summary: Support column comment for Hive tables
 Key: FLINK-18981
 URL: https://issues.apache.org/jira/browse/FLINK-18981
 Project: Flink
  Issue Type: Task
  Components: Connectors / Hive
Reporter: Rui Li
 Fix For: 1.12.0


Start working on this once FLINK-18958 is done



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Decompose failure recovery time

2020-08-17 Thread Zhinan Cheng
Hi all,

I am working on measuring the failure recovery time of Flink and I
want to decompose the recovery time into different parts, say the time
to detect the failure, the time to restart the job, and the time to
restore the checkpointing.

Unfortunately, I cannot find  any information in Flink doc to solve
this, Is there any way that Flink has provided for this, otherwise,
how can I solve this?

Thanks a lot for your help.

Regards,
Juno


[jira] [Created] (FLINK-18980) "Avro Confluent Schema Registry nightly end-to-end test" hangs

2020-08-17 Thread Dian Fu (Jira)
Dian Fu created FLINK-18980:
---

 Summary: "Avro Confluent Schema Registry nightly end-to-end test" 
hangs
 Key: FLINK-18980
 URL: https://issues.apache.org/jira/browse/FLINK-18980
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kafka, Tests
Affects Versions: 1.11.0
Reporter: Dian Fu


[https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=5629=logs=08866332-78f7-59e4-4f7e-49a56faa3179=3e8647c1-5a28-5917-dd93-bf78594ea994]

{code}
2020-08-17T22:10:48.3400852Z Job has been submitted with JobID 
d9a2a95d9204d149165af0e0a1a7c488
2020-08-17T22:10:49.1383170Z [2020-08-17 22:10:49,137] INFO 127.0.0.1 - - 
[17/Aug/2020:22:10:49 +] "GET /schemas/ids/1 HTTP/1.1" 200 389  6 
(io.confluent.rest-utils.requests:77)
2020-08-17T22:10:51.8854897Z [2020-08-17 22:10:49,256] INFO Wait to catch up 
until the offset of the last message at 3 
(io.confluent.kafka.schemaregistry.storage.KafkaStore:356)
2020-08-17T22:10:51.8856950Z [2020-08-17 22:10:49,775] INFO 127.0.0.1 - - 
[17/Aug/2020:22:10:49 +] "POST /subjects/test-output-subject/versions 
HTTP/1.1" 500 61  524 (io.confluent.rest-utils.requests:77)
2020-08-18T00:20:27.5068516Z ##[error]The operation was canceled.
2020-08-18T00:20:27.5085465Z ##[section]Finishing: Run e2e tests
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Removing deprecated methods from DataStream API

2020-08-17 Thread Kostas Kloudas
+1 for removing them.

>From a quick look, most of them (not all) have been deprecated a long time ago.

Cheers,
Kostas

On Mon, Aug 17, 2020 at 9:37 PM Dawid Wysakowicz  wrote:
>
> @David Yes, my idea was to remove any use of fold method and all related 
> classes including WindowedStream#fold
>
> @Klou Good idea to also remove the deprecated enableCheckpointing() & 
> StreamExecutionEnvironment#readFile and alike. I did another pass over some 
> of the classes and thought we could also drop:
>
> ExecutionConfig#set/getCodeAnalysisMode
> ExecutionConfig#disable/enableSysoutLogging
> ExecutionConfig#set/isFailTaskOnCheckpointError
> ExecutionConfig#isLatencyTrackingEnabled
>
> As for the `forceCheckpointing` I am not fully convinced to doing it. As far 
> as I know iterations still do not participate in checkpointing correctly. 
> Therefore it still might make sense to force it. In other words there is no 
> real alternative to that method. Unless we only remove the methods from 
> StreamExecutionEnvironment and redirect to the setter in CheckpointConfig. 
> WDYT?
>
> An updated list of methods I suggest to remove:
>
> ExecutionConfig#set/getCodeAnalysisMode (deprecated in 1.9)
> ExecutionConfig#disable/enableSysoutLogging (deprecated in 1.10)
> ExecutionConfig#set/isFailTaskOnCheckpointError (deprecated in 1.9)
> ExecutionConfig#isLatencyTrackingEnabled (deprecated in 1.7)
> ExecutionConfig#(get)/setNumberOfExecutionRetries() (deprecated in 1.1?)
> StreamExecutionEnvironment#readFile,readFileStream(...),socketTextStream(...),socketTextStream(...)
>  (deprecated in 1.2)
> RuntimeContext#getAllAccumulators (deprecated in 0.10)
> DataStream#fold and all related classes and methods such as FoldFunction, 
> FoldingState, FoldingStateDescriptor ... (deprecated in 1.3/1.4)
> StreamExecutionEnvironment#setStateBackend(AbstractStateBackend) (deprecated 
> in 1.5)
> DataStream#split (deprecated in 1.8)
> Methods in (Connected)DataStream that specify keys as either indices or field 
> names such as DataStream#keyBy, DataStream#partitionCustom, 
> ConnectedStream#keyBy,  (deprecated in 1.11)
>
> Bear in mind that majority of the options listed above in ExecutionConfig 
> take no effect. They were left there purely to satisfy the binary 
> compatibility. Personally I don't see any benefit of leaving a method and 
> silently dropping the underlying feature. The only configuration that is 
> respected is setting the number of execution retries.
>
> I also wanted to make it explicit that most of the changes above would result 
> in a binary incompatibility and require additional exclusions in the japicmp. 
> Those are:
>
> ExecutionConfig#set/getCodeAnalysisMode (deprecated in 1.9)
> ExecutionConfig#disable/enableSysoutLogging (deprecated in 1.10)
> ExecutionConfig#set/isFailTaskOnCheckpointError (deprecated in 1.9)
> ExecutionConfig#isLatencyTrackingEnabled (deprecated in 1.7)
> ExecutionConfig#(get)/setNumberOfExecutionRetries() (deprecated in 1.1?)
> DataStream#fold and all related classes and methods such as FoldFunction, 
> FoldingState, FoldingStateDescriptor ... (deprecated in 1.3/1.4)
> DataStream#split (deprecated in 1.8)
> Methods in (Connected)DataStream that specify keys as either indices or field 
> names such as DataStream#keyBy, DataStream#partitionCustom, 
> ConnectedStream#keyBy,  (deprecated in 1.11)
> StreamExecutionEnvironment#readFile,readFileStream(...),socketTextStream(...),socketTextStream(...)
>  (deprecated in 1.2)
>
> Looking forward to more opinions on the issue.
>
> Best,
>
> Dawid
>
>
> On 17/08/2020 12:49, Kostas Kloudas wrote:
>
> Thanks a lot for starting this Dawid,
>
> Big +1 for the proposed clean-up, and I would also add the deprecated
> methods of the StreamExecutionEnvironment like:
>
> enableCheckpointing(long interval, CheckpointingMode mode, boolean force)
> enableCheckpointing()
> isForceCheckpointing()
>
> readFile(FileInputFormat inputFormat,String
> filePath,FileProcessingMode watchType,long interval, FilePathFilter
> filter)
> readFileStream(...)
>
> socketTextStream(String hostname, int port, char delimiter, long maxRetry)
> socketTextStream(String hostname, int port, char delimiter)
>
> There are more, like the (get)/setNumberOfExecutionRetries() that were
> deprecated long ago, but I have not investigated to see if they are
> actually easy to remove.
>
> Cheers,
> Kostas
>
> On Mon, Aug 17, 2020 at 10:53 AM Dawid Wysakowicz
>  wrote:
>
> Hi devs and users,
>
> I wanted to ask you what do you think about removing some of the deprecated 
> APIs around the DataStream API.
>
> The APIs I have in mind are:
>
> RuntimeContext#getAllAccumulators (deprecated in 0.10)
> DataStream#fold and all related classes and methods such as FoldFunction, 
> FoldingState, FoldingStateDescriptor ... (deprecated in 1.3/1.4)
> StreamExecutionEnvironment#setStateBackend(AbstractStateBackend) (deprecated 
> in 1.5)
> DataStream#split (deprecated in 1.8)
> Methods in (Connected)DataStream 

Re: [DISCUSS] Removing deprecated methods from DataStream API

2020-08-17 Thread Dawid Wysakowicz
@David Yes, my idea was to remove any use of fold method and all related
classes including WindowedStream#fold

@Klou Good idea to also remove the deprecated enableCheckpointing() &
StreamExecutionEnvironment#readFile and alike. I did another pass over
some of the classes and thought we could also drop:

  * ExecutionConfig#set/getCodeAnalysisMode
  * ExecutionConfig#disable/enableSysoutLogging
  * ExecutionConfig#set/isFailTaskOnCheckpointError
  * ExecutionConfig#isLatencyTrackingEnabled

As for the `forceCheckpointing` I am not fully convinced to doing it. As
far as I know iterations still do not participate in checkpointing
correctly. Therefore it still might make sense to force it. In other
words there is no real alternative to that method. Unless we only remove
the methods from StreamExecutionEnvironment and redirect to the setter
in CheckpointConfig. WDYT?

An updated list of methods I suggest to remove:

  * ExecutionConfig#set/getCodeAnalysisMode (deprecated in 1.9)
  * ExecutionConfig#disable/enableSysoutLogging (deprecated in 1.10)
  * ExecutionConfig#set/isFailTaskOnCheckpointError (deprecated in 1.9)
  * ExecutionConfig#isLatencyTrackingEnabled (deprecated in 1.7)
  * ExecutionConfig#(get)/setNumberOfExecutionRetries() (deprecated in 1.1?)
  * 
StreamExecutionEnvironment#readFile,readFileStream(...),socketTextStream(...),socketTextStream(...)
(deprecated in 1.2)
  * RuntimeContext#getAllAccumulators (deprecated in 0.10)
  * DataStream#fold and all related classes and methods such as
FoldFunction, FoldingState, FoldingStateDescriptor ... (deprecated
in 1.3/1.4)
  * StreamExecutionEnvironment#setStateBackend(AbstractStateBackend)
(deprecated in 1.5)
  * DataStream#split (deprecated in 1.8)
  * Methods in (Connected)DataStream that specify keys as either indices
or field names such as DataStream#keyBy, DataStream#partitionCustom,
ConnectedStream#keyBy,  (deprecated in 1.11)

Bear in mind that majority of the options listed above in
ExecutionConfig take no effect. They were left there purely to satisfy
the binary compatibility. Personally I don't see any benefit of leaving
a method and silently dropping the underlying feature. The only
configuration that is respected is setting the number of execution retries.

I also wanted to make it explicit that most of the changes above would
result in a binary incompatibility and require additional exclusions in
the japicmp. Those are:

  * ExecutionConfig#set/getCodeAnalysisMode (deprecated in 1.9)
  * ExecutionConfig#disable/enableSysoutLogging (deprecated in 1.10)
  * ExecutionConfig#set/isFailTaskOnCheckpointError (deprecated in 1.9)
  * ExecutionConfig#isLatencyTrackingEnabled (deprecated in 1.7)
  * ExecutionConfig#(get)/setNumberOfExecutionRetries() (deprecated in 1.1?)
  * DataStream#fold and all related classes and methods such as
FoldFunction, FoldingState, FoldingStateDescriptor ... (deprecated
in 1.3/1.4)
  * DataStream#split (deprecated in 1.8)
  * Methods in (Connected)DataStream that specify keys as either indices
or field names such as DataStream#keyBy, DataStream#partitionCustom,
ConnectedStream#keyBy,  (deprecated in 1.11)
  * 
StreamExecutionEnvironment#readFile,readFileStream(...),socketTextStream(...),socketTextStream(...)
(deprecated in 1.2)

Looking forward to more opinions on the issue.

Best,

Dawid


On 17/08/2020 12:49, Kostas Kloudas wrote:
> Thanks a lot for starting this Dawid,
>
> Big +1 for the proposed clean-up, and I would also add the deprecated
> methods of the StreamExecutionEnvironment like:
>
> enableCheckpointing(long interval, CheckpointingMode mode, boolean force)
> enableCheckpointing()
> isForceCheckpointing()
>
> readFile(FileInputFormat inputFormat,String
> filePath,FileProcessingMode watchType,long interval, FilePathFilter
> filter)
> readFileStream(...)
>
> socketTextStream(String hostname, int port, char delimiter, long maxRetry)
> socketTextStream(String hostname, int port, char delimiter)
>
> There are more, like the (get)/setNumberOfExecutionRetries() that were
> deprecated long ago, but I have not investigated to see if they are
> actually easy to remove.
>
> Cheers,
> Kostas
>
> On Mon, Aug 17, 2020 at 10:53 AM Dawid Wysakowicz
>  wrote:
>> Hi devs and users,
>>
>> I wanted to ask you what do you think about removing some of the deprecated 
>> APIs around the DataStream API.
>>
>> The APIs I have in mind are:
>>
>> RuntimeContext#getAllAccumulators (deprecated in 0.10)
>> DataStream#fold and all related classes and methods such as FoldFunction, 
>> FoldingState, FoldingStateDescriptor ... (deprecated in 1.3/1.4)
>> StreamExecutionEnvironment#setStateBackend(AbstractStateBackend) (deprecated 
>> in 1.5)
>> DataStream#split (deprecated in 1.8)
>> Methods in (Connected)DataStream that specify keys as either indices or 
>> field names such as DataStream#keyBy, DataStream#partitionCustom, 
>> ConnectedStream#keyBy,  (deprecated in 1.11)
>>
>> I think the 

[jira] [Created] (FLINK-18979) Update statefun docs to better emphasize remote modules

2020-08-17 Thread Seth Wiesman (Jira)
Seth Wiesman created FLINK-18979:


 Summary: Update statefun docs to better emphasize remote modules
 Key: FLINK-18979
 URL: https://issues.apache.org/jira/browse/FLINK-18979
 Project: Flink
  Issue Type: Improvement
  Components: Stateful Functions
Reporter: Seth Wiesman
Assignee: Seth Wiesman






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] FLIP-134: DataStream Semantics for Bounded Input

2020-08-17 Thread Kostas Kloudas
Hi Kurt and David,

Thanks a lot for the insightful feedback!

@Kurt: For the topic of checkpointing with Batch Scheduling, I totally
agree with you that it requires a lot more work and careful thinking
on the semantics. This FLIP was written under the assumption that if
the user wants to have checkpoints on bounded input, he/she will have
to go with STREAMING as the scheduling mode. Checkpointing for BATCH
can be handled as a separate topic in the future.

In the case of MIXED workloads and for this FLIP, the scheduling mode
should be set to STREAMING. That is why the AUTOMATIC option sets
scheduling to BATCH only if all the sources are bounded. I am not sure
what are the plans there at the scheduling level, as one could imagine
in the future that in mixed workloads, we schedule first all the
bounded subgraphs in BATCH mode and we allow only one UNBOUNDED
subgraph per application, which is going to be scheduled after all
Bounded ones have finished. Essentially the bounded subgraphs will be
used to bootstrap the unbounded one. But, I am not aware of any plans
towards that direction.


@David: The processing time timer handling is a topic that has also
been discussed in the community in the past, and I do not remember any
final conclusion unfortunately.

In the current context and for bounded input, we chose to favor
reproducibility of the result, as this is expected in batch processing
where the whole input is available in advance. This is why this
proposal suggests to not allow processing time timers. But I
understand your argument that the user may want to be able to run the
same pipeline on batch and streaming this is why we added the two
options under future work, namely (from the FLIP):

```
Future Work: In the future we may consider adding as options the capability of:
* firing all the registered processing time timers at the end of a job
(at close()) or,
* ignoring all the registered processing time timers at the end of a job.
```

Conceptually, we are essentially saying that we assume that batch
execution is assumed to be instantaneous and refers to a single
"point" in time and any processing-time timers for the future may fire
at the end of execution or be ignored (but not throw an exception). I
could also see ignoring the timers in batch as the default, if this
makes more sense.

By the way, do you have any usecases in mind that will help us better
shape our processing time timer handling?

Kostas

On Mon, Aug 17, 2020 at 2:52 PM David Anderson  wrote:
>
> Kostas,
>
> I'm pleased to see some concrete details in this FLIP.
>
> I wonder if the current proposal goes far enough in the direction of 
> recognizing the need some users may have for "batch" and "bounded streaming" 
> to be treated differently. If I've understood it correctly, the section on 
> scheduling allows me to choose STREAMING scheduling even if I have bounded 
> sources. I like that approach, because it recognizes that even though I have 
> bounded inputs, I don't necessarily want batch processing semantics. I think 
> it makes sense to extend this idea to processing time support as well.
>
> My thinking is that sometimes in development and testing it's reasonable to 
> run exactly the same job as in production, except with different sources and 
> sinks. While it might be a reasonable default, I'm not convinced that 
> switching a processing time streaming job to read from a bounded source 
> should always cause it to fail.
>
> David
>
> On Wed, Aug 12, 2020 at 5:22 PM Kostas Kloudas  wrote:
>>
>> Hi all,
>>
>> As described in FLIP-131 [1], we are aiming at deprecating the DataSet
>> API in favour of the DataStream API and the Table API. After this work
>> is done, the user will be able to write a program using the DataStream
>> API and this will execute efficiently on both bounded and unbounded
>> data. But before we reach this point, it is worth discussing and
>> agreeing on the semantics of some operations as we transition from the
>> streaming world to the batch one.
>>
>> This thread and the associated FLIP [2] aim at discussing these issues
>> as these topics are pretty important to users and can lead to
>> unpleasant surprises if we do not pay attention.
>>
>> Let's have a healthy discussion here and I will be updating the FLIP
>> accordingly.
>>
>> Cheers,
>> Kostas
>>
>> [1] 
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866741
>> [2] 
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158871522


[jira] [Created] (FLINK-18978) Support full table scan of key and namespace from statebackend

2020-08-17 Thread Seth Wiesman (Jira)
Seth Wiesman created FLINK-18978:


 Summary: Support full table scan of key and namespace from 
statebackend
 Key: FLINK-18978
 URL: https://issues.apache.org/jira/browse/FLINK-18978
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / State Backends
Reporter: Seth Wiesman
Assignee: Seth Wiesman
 Fix For: 1.12.0


Support full table scan of keys and namespaces from the state backend. All 
operations assume the calling code already knows what namespace they are 
interested in interacting with.

This is a prerequisite to support reading window operators with the state 
processor api because window panes are stored as additional namespace 
components. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18977) Extract WindowOperator construction into a builder class

2020-08-17 Thread Seth Wiesman (Jira)
Seth Wiesman created FLINK-18977:


 Summary:  Extract WindowOperator construction into a builder class
 Key: FLINK-18977
 URL: https://issues.apache.org/jira/browse/FLINK-18977
 Project: Flink
  Issue Type: Improvement
  Components: API / DataStream
Reporter: Seth Wiesman
 Fix For: 1.12.0


Extracts the logic from WindowedStream into a builder class so that there is 
one definitive way to create and configure the window operator. This is a 
pre-requisite to supporting the window operator in the state processor api. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Planning Flink 1.12

2020-08-17 Thread Seth Wiesman
@David Anderson 
I think we can get that into 1.12. I have a branch that just needs to be
cleaned up.

+1 for https://issues.apache.org/jira/browse/FLINK-13095

On Mon, Aug 17, 2020 at 2:03 AM Till Rohrmann  wrote:

> Hi Rodrigo,
>
> FLINK-10407 has not been up to date with the actual progress on the
> feature. We will update it soon.
>
> I think that the community won't actively work on FLINK-12002 in this
> release. However, we will work on FLINK-16430 [1] and once this is done
> continue with the clean up of the legacy scheduler code paths.
>
> As for the scope of the 1.12 release, we hope to finish FLINK-16430 and
> make considerable progress with FLINK-10407. However, I don't think that
> FLINK-10407 will be fully functional yet.
>
> [1] https://issues.apache.org/jira/browse/FLINK-16430
>
> Cheers,
> Till
>
> On Sat, Aug 15, 2020 at 6:43 PM rodrigobrochado <
> rodrigo.broch...@predito.com.br> wrote:
>
> > Thanks Dian!
> >
> > About the wiki page, I think that the "Reactive-scaling mode" by Till has
> > an
> > open issue on FLINK-10407 [1].
> >
> > Still about scaling, what about the adaptive parallelism of jobs [2]?
> This
> > is somehow related to Prasanna and Harold's comments above. It seems to
> > depend on finishing the old "Redesign Flink Scheduling" [3], FLIP-119
> > (already on wiki list), and FLINK-15626 (Remove of legacy scheduler) [4].
> > Would they be achievable until 1.12?
> >
> > If I could add one more, the python UDF in docker mode would be awesome
> > [5].
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-10407
> > [2] https://issues.apache.org/jira/browse/FLINK-12002
> > [3] https://issues.apache.org/jira/browse/FLINK-10429
> > [4] https://issues.apache.org/jira/browse/FLINK-15626
> > [5] https://issues.apache.org/jira/browse/FLINK-14025
> >
> > Thanks,
> > Rodrigo
> >
> >
> >
> > --
> > Sent from:
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
> >
>


Re: [VOTE] Release 1.10.2, release candidate #2

2020-08-17 Thread Jeff Zhang
+1 (non-binding)

Verified it on Zeppelin, it works well.



Robert Metzger  于2020年8月17日周一 下午9:41写道:

> Thanks a lot for creating another release candidate!
>
> +1 (binding)
>
> - checked diff:
> https://github.com/apache/flink/compare/release-1.10.1...release-1.10.2-rc2
>   - kubernetes client was upgraded from 4.5.2 to 4.9.2 + some shading
> changes > verified NOTICE file with shade output
> - source compiles
> - source sha is correct
> - checked staging repository: versions set correctly, contents seem in line
> with the changes
>
>
>
>
> On Mon, Aug 17, 2020 at 1:56 PM Zhu Zhu  wrote:
>
> > Hi everyone,
> >
> > Please review and vote on the release candidate #2 for the version
> 1.10.2,
> > as follows:
> > [ ] +1, Approve the release
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> >
> > The complete staging area is available for your review, which includes:
> > * JIRA release notes [1],
> > * the official Apache source release and binary convenience releases to
> be
> > deployed to dist.apache.org [2], which are signed with the key with
> > fingerprint C63E230EFFF519A5BBF2C9AE6767487CD505859C [3],
> > * all artifacts to be deployed to the Maven Central Repository [4],
> > * source code tag "release-1.10.2-rc2" [5],
> > * website pull request listing the new release and adding announcement
> blog
> > post [6].
> >
> > The vote will be open for at least 72 hours. It is adopted by majority
> > approval, with at least 3 PMC affirmative votes.
> >
> > Thanks,
> > Zhu Zhu
> >
> > [1]
> >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12347791
> > [2] https://dist.apache.org/repos/dist/dev/flink/flink-1.10.2-rc2/
> > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> > [4]
> > https://repository.apache.org/content/repositories/orgapacheflink-1395/
> > [5]
> >
> >
> https://github.com/apache/flink/commit/68bb8b612932e479ca03c7ae7c0080818d89c8a1
> > [6] https://github.com/apache/flink-web/pull/366
> >
>


-- 
Best Regards

Jeff Zhang


Re: [VOTE] Release 1.10.2, release candidate #2

2020-08-17 Thread Robert Metzger
Thanks a lot for creating another release candidate!

+1 (binding)

- checked diff:
https://github.com/apache/flink/compare/release-1.10.1...release-1.10.2-rc2
  - kubernetes client was upgraded from 4.5.2 to 4.9.2 + some shading
changes > verified NOTICE file with shade output
- source compiles
- source sha is correct
- checked staging repository: versions set correctly, contents seem in line
with the changes




On Mon, Aug 17, 2020 at 1:56 PM Zhu Zhu  wrote:

> Hi everyone,
>
> Please review and vote on the release candidate #2 for the version 1.10.2,
> as follows:
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
>
> The complete staging area is available for your review, which includes:
> * JIRA release notes [1],
> * the official Apache source release and binary convenience releases to be
> deployed to dist.apache.org [2], which are signed with the key with
> fingerprint C63E230EFFF519A5BBF2C9AE6767487CD505859C [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "release-1.10.2-rc2" [5],
> * website pull request listing the new release and adding announcement blog
> post [6].
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
> Thanks,
> Zhu Zhu
>
> [1]
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12347791
> [2] https://dist.apache.org/repos/dist/dev/flink/flink-1.10.2-rc2/
> [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> [4]
> https://repository.apache.org/content/repositories/orgapacheflink-1395/
> [5]
>
> https://github.com/apache/flink/commit/68bb8b612932e479ca03c7ae7c0080818d89c8a1
> [6] https://github.com/apache/flink-web/pull/366
>


Re: [DISCUSS] FLIP-132: Temporal Table DDL

2020-08-17 Thread Leonard Xu

> But are we still able to track different views of such a
> table through time, as rows are added/deleted to/from the table?

Yes, in fact we support temporal table from changlog which contains all 
possible message types(INSERT/UPDATE/DELETE).

> For
> example, suppose I have an append-only table source with event-time and PK,
> will I be allowed to do an event-time temporal join with this table?
Yes, I list some examples in the doc, the example versioned_rates3  is this 
case exactly.

Best
Leonard


> 
> On Wed, Aug 12, 2020 at 3:31 PM Leonard Xu  > wrote:
> 
>> Hi, all
>> 
>> After a detailed offline discussion about the temporal table related
>> concept and behavior, we had a reliable solution and rejected several
>> alternatives.
>> Compared to rejected alternatives, the proposed approach is a more unified
>> story and also friendly to user and current Flink framework.
>> I improved the FLIP[1] with the proposed approach and refactored the
>> document organization to make it clear enough.
>> 
>> Please let me know if you have any concerns, I’m looking forward your
>> comments.
>> 
>> 
>> Best
>> Leonard
>> 
>> [1]
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-132+Temporal+Table+DDL
>> <
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-132+Temporal+Table+DDL
>>  
>> 
>>> 
>> 
>> 
>>> 在 2020年8月4日,21:25,Leonard Xu mailto:xbjt...@gmail.com>> 
>>> 写道:
>>> 
>>> Hi, all
>>> 
>>> I’ve updated the FLIP[1] with the terminology `ChangelogTime`.
>>> 
>>> Best
>>> Leonard
>>> [1]
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-132+Temporal+Table+DDL
>>  
>> 
>> <
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-132+Temporal+Table+DDL
>>  
>> 
>>> 
>>> 
 在 2020年8月4日,20:58,Leonard Xu mailto:xbjt...@gmail.com> 
 > xbjt...@gmail.com >> 写道:
 
 Hi, Timo
 
 Thanks for you response.
 
> 1) Naming: Is operation time a good term for this concept? If I read
>> "The operation time is the time when the changes happened in system." or
>> "The system time of DML execution in database", why don't we call it
>> `ChangelogTime` or `SystemTime`? Introducing another terminology of time in
>> Flink should be thought through.
 
 I agree that we should thought through. I have considered the name
>> `ChangelogTime` and `SystemTime` too, I don’t have strong opinion on the
>> name.
 
 I proposed `operationTime` because most changelog comes from Database
>> and we always called an action as `operation` rather than `change` in
>> Database, the operation time is  easier to understand  for database users,
>> but it's more like a database terminology.
 
 For `SystemTime`, user may confuse which one does the system in
>> `SystemTime` represents?  Flink, Database or CDC tool.  Maybe it’s not a
>> good name.
 
 `ChangelogTime` is a pretty choice which is more unified with existed
>> terminology `Changelog` and `ChangelogMode`, so let me use `ChangelogTime`
>> and I’ll update the FLIP.
 
 
> 2) Exposing it through `org.apache.flink.types.Row`: Shall we also
>> expose the concept of time through the user-level `Row` type? The FLIP does
>> not mention this explictly. I think we can keep it as an internal concept
>> but I just wanted to ask for clarification.
 
 Yes, I want to keep it as an internal concept, we have discussed that
>> changelog time concept should be the third time concept(the other two are
>> event-time and processing-time). It’s not easy for normal users(or to help
>> normal users) understand the three concepts accurately, and I did not find
>> a big enough scenario that user need to touch the changelog time for now,
>> so I tend to do not expose the concept to users.
 
 
 Best,
 Leonard
 
 
> 
> On 04.08.20 04:58, Leonard Xu wrote:
>> Thanks Konstantin,
>> Regarding your questions, hope my comments has address your questions
>> and I also add a few explanation in the FLIP.
>> Thank you all for the feedback,
>> It seems everyone involved  in this thread has reached a consensus.
>> I will start a vote thread  later.
>> Best,
>> Leonard
>>> 在 2020年8月3日,19:35,godfrey he >>  > godfre...@gmail.com >> 写道:
>>> 
>>> Thanks Lennard for driving this FLIP.
>>> Looks good to me.
>>> 
>>> Best,
>>> Godfrey
>>> 
>>> Jark Wu mailto:imj...@gmail.com> 
>>> >> 于2020年8月3日周一
>> 下午12:04写道:
>>> 
 Thanks Leonard for the great FLIP. I think it is in very good shape.
 +1 to start a vote.
 
 Best,
 

[jira] [Created] (FLINK-18976) Support user defined query in JDBC Source

2020-08-17 Thread Leonard Xu (Jira)
Leonard Xu created FLINK-18976:
--

 Summary: Support user defined query in JDBC Source
 Key: FLINK-18976
 URL: https://issues.apache.org/jira/browse/FLINK-18976
 Project: Flink
  Issue Type: Improvement
  Components: Connectors / JDBC
Reporter: Leonard Xu


User may want to filter some useless data before loading data from a DB table, 
we can expose a `connector.read.query` to support this feature in DDL.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] FLIP-132: Temporal Table DDL

2020-08-17 Thread Rui Li
Hey Leonard,

Thanks for summarizing the document. I have one quick question. I
understand a temporal table w/o version means each row in the table only
has one version. But are we still able to track different views of such a
table through time, as rows are added/deleted to/from the table? For
example, suppose I have an append-only table source with event-time and PK,
will I be allowed to do an event-time temporal join with this table?

On Wed, Aug 12, 2020 at 3:31 PM Leonard Xu  wrote:

> Hi, all
>
> After a detailed offline discussion about the temporal table related
> concept and behavior, we had a reliable solution and rejected several
> alternatives.
> Compared to rejected alternatives, the proposed approach is a more unified
> story and also friendly to user and current Flink framework.
> I improved the FLIP[1] with the proposed approach and refactored the
> document organization to make it clear enough.
>
> Please let me know if you have any concerns, I’m looking forward your
> comments.
>
>
> Best
> Leonard
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-132+Temporal+Table+DDL
> <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-132+Temporal+Table+DDL
> >
>
>
> > 在 2020年8月4日,21:25,Leonard Xu  写道:
> >
> > Hi, all
> >
> > I’ve updated the FLIP[1] with the terminology `ChangelogTime`.
> >
> > Best
> > Leonard
> > [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-132+Temporal+Table+DDL
> <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-132+Temporal+Table+DDL
> >
> >
> >> 在 2020年8月4日,20:58,Leonard Xu  xbjt...@gmail.com>> 写道:
> >>
> >> Hi, Timo
> >>
> >> Thanks for you response.
> >>
> >>> 1) Naming: Is operation time a good term for this concept? If I read
> "The operation time is the time when the changes happened in system." or
> "The system time of DML execution in database", why don't we call it
> `ChangelogTime` or `SystemTime`? Introducing another terminology of time in
> Flink should be thought through.
> >>
> >> I agree that we should thought through. I have considered the name
> `ChangelogTime` and `SystemTime` too, I don’t have strong opinion on the
> name.
> >>
> >> I proposed `operationTime` because most changelog comes from Database
> and we always called an action as `operation` rather than `change` in
> Database, the operation time is  easier to understand  for database users,
> but it's more like a database terminology.
> >>
> >> For `SystemTime`, user may confuse which one does the system in
> `SystemTime` represents?  Flink, Database or CDC tool.  Maybe it’s not a
> good name.
> >>
> >> `ChangelogTime` is a pretty choice which is more unified with existed
> terminology `Changelog` and `ChangelogMode`, so let me use `ChangelogTime`
> and I’ll update the FLIP.
> >>
> >>
> >>> 2) Exposing it through `org.apache.flink.types.Row`: Shall we also
> expose the concept of time through the user-level `Row` type? The FLIP does
> not mention this explictly. I think we can keep it as an internal concept
> but I just wanted to ask for clarification.
> >>
> >> Yes, I want to keep it as an internal concept, we have discussed that
> changelog time concept should be the third time concept(the other two are
> event-time and processing-time). It’s not easy for normal users(or to help
> normal users) understand the three concepts accurately, and I did not find
> a big enough scenario that user need to touch the changelog time for now,
> so I tend to do not expose the concept to users.
> >>
> >>
> >> Best,
> >> Leonard
> >>
> >>
> >>>
> >>> On 04.08.20 04:58, Leonard Xu wrote:
>  Thanks Konstantin,
>  Regarding your questions, hope my comments has address your questions
> and I also add a few explanation in the FLIP.
>  Thank you all for the feedback,
>  It seems everyone involved  in this thread has reached a consensus.
>  I will start a vote thread  later.
>  Best,
>  Leonard
> > 在 2020年8月3日,19:35,godfrey he  godfre...@gmail.com>> 写道:
> >
> > Thanks Lennard for driving this FLIP.
> > Looks good to me.
> >
> > Best,
> > Godfrey
> >
> > Jark Wu mailto:imj...@gmail.com>> 于2020年8月3日周一
> 下午12:04写道:
> >
> >> Thanks Leonard for the great FLIP. I think it is in very good shape.
> >> +1 to start a vote.
> >>
> >> Best,
> >> Jark
> >>
> >> On Fri, 31 Jul 2020 at 17:56, Fabian Hueske  > wrote:
> >>
> >>> Hi Leonard,
> >>>
> >>> Thanks for this FLIP!
> >>> Looks good from my side.
> >>>
> >>> Cheers, Fabian
> >>>
> >>> Am Do., 30. Juli 2020 um 22:15 Uhr schrieb Seth Wiesman <
> >>> sjwies...@gmail.com 
>  :
> >>>
>  Hi Leondard,
> 
>  Thank you for pushing this, I think the updated syntax looks
> really
> >> good
>  and the semantics make sense to me.
> 
>  +1
> 
>  Seth
> 
>  On Wed, Jul 

Re: [DISCUSS] FLIP-134: DataStream Semantics for Bounded Input

2020-08-17 Thread David Anderson
Kostas,

I'm pleased to see some concrete details in this FLIP.

I wonder if the current proposal goes far enough in the direction of
recognizing the need some users may have for "batch" and "bounded
streaming" to be treated differently. If I've understood it correctly, the
section on scheduling allows me to choose STREAMING scheduling even if I
have bounded sources. I like that approach, because it recognizes that even
though I have bounded inputs, I don't necessarily want batch processing
semantics. I think it makes sense to extend this idea to processing time
support as well.

My thinking is that sometimes in development and testing it's reasonable to
run exactly the same job as in production, except with different sources
and sinks. While it might be a reasonable default, I'm not convinced that
switching a processing time streaming job to read from a bounded source
should always cause it to fail.

David

On Wed, Aug 12, 2020 at 5:22 PM Kostas Kloudas  wrote:

> Hi all,
>
> As described in FLIP-131 [1], we are aiming at deprecating the DataSet
> API in favour of the DataStream API and the Table API. After this work
> is done, the user will be able to write a program using the DataStream
> API and this will execute efficiently on both bounded and unbounded
> data. But before we reach this point, it is worth discussing and
> agreeing on the semantics of some operations as we transition from the
> streaming world to the batch one.
>
> This thread and the associated FLIP [2] aim at discussing these issues
> as these topics are pretty important to users and can lead to
> unpleasant surprises if we do not pay attention.
>
> Let's have a healthy discussion here and I will be updating the FLIP
> accordingly.
>
> Cheers,
> Kostas
>
> [1]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866741
> [2]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158871522
>


[VOTE] Release 1.10.2, release candidate #2

2020-08-17 Thread Zhu Zhu
Hi everyone,

Please review and vote on the release candidate #2 for the version 1.10.2,
as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)


The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release and binary convenience releases to be
deployed to dist.apache.org [2], which are signed with the key with
fingerprint C63E230EFFF519A5BBF2C9AE6767487CD505859C [3],
* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag "release-1.10.2-rc2" [5],
* website pull request listing the new release and adding announcement blog
post [6].

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

Thanks,
Zhu Zhu

[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12347791
[2] https://dist.apache.org/repos/dist/dev/flink/flink-1.10.2-rc2/
[3] https://dist.apache.org/repos/dist/release/flink/KEYS
[4] https://repository.apache.org/content/repositories/orgapacheflink-1395/
[5]
https://github.com/apache/flink/commit/68bb8b612932e479ca03c7ae7c0080818d89c8a1
[6] https://github.com/apache/flink-web/pull/366


Re: [DISCUSS] Removing deprecated methods from DataStream API

2020-08-17 Thread David Anderson
I assume that along with DataStream#fold you would also
remove WindowedStream#fold.

I'm in favor of going ahead with all of these.

David

On Mon, Aug 17, 2020 at 10:53 AM Dawid Wysakowicz 
wrote:

> Hi devs and users,
>
> I wanted to ask you what do you think about removing some of the
> deprecated APIs around the DataStream API.
>
> The APIs I have in mind are:
>
>- RuntimeContext#getAllAccumulators (deprecated in 0.10)
>- DataStream#fold and all related classes and methods such as
>FoldFunction, FoldingState, FoldingStateDescriptor ... (deprecated in
>1.3/1.4)
>- StreamExecutionEnvironment#setStateBackend(AbstractStateBackend)
>(deprecated in 1.5)
>- DataStream#split (deprecated in 1.8)
>- Methods in (Connected)DataStream that specify keys as either indices
>or field names such as DataStream#keyBy, DataStream#partitionCustom,
>ConnectedStream#keyBy,  (deprecated in 1.11)
>
> I think the first three should be straightforward. They are long
> deprecated. The getAccumulators method is not used very often in my
> opinion. The same applies to the DataStream#fold which additionally is not
> very performant. Lastly the setStateBackend has an alternative with a class
> from the AbstractStateBackend hierarchy, therefore it will be still code
> compatible. Moreover if we remove the
> #setStateBackend(AbstractStateBackend) we will get rid off warnings users
> have right now when setting a statebackend as the correct method cannot be
> used without an explicit casting.
>
> As for the DataStream#split I know there were some objections against
> removing the #split method in the past. I still believe the output tags can
> replace the split method already.
>
> The only problem in the last set of methods I propose to remove is that
> they were deprecated only in the last release and those method were only
> partially deprecated. Moreover some of the methods were not deprecated in
> ConnectedStreams. Nevertheless I'd still be inclined to remove the methods
> in this release.
>
> Let me know what do you think about it.
>
> Best,
>
> Dawid
>


Re: [DISCUSS] Removing deprecated methods from DataStream API

2020-08-17 Thread Kostas Kloudas
Thanks a lot for starting this Dawid,

Big +1 for the proposed clean-up, and I would also add the deprecated
methods of the StreamExecutionEnvironment like:

enableCheckpointing(long interval, CheckpointingMode mode, boolean force)
enableCheckpointing()
isForceCheckpointing()

readFile(FileInputFormat inputFormat,String
filePath,FileProcessingMode watchType,long interval, FilePathFilter
filter)
readFileStream(...)

socketTextStream(String hostname, int port, char delimiter, long maxRetry)
socketTextStream(String hostname, int port, char delimiter)

There are more, like the (get)/setNumberOfExecutionRetries() that were
deprecated long ago, but I have not investigated to see if they are
actually easy to remove.

Cheers,
Kostas

On Mon, Aug 17, 2020 at 10:53 AM Dawid Wysakowicz
 wrote:
>
> Hi devs and users,
>
> I wanted to ask you what do you think about removing some of the deprecated 
> APIs around the DataStream API.
>
> The APIs I have in mind are:
>
> RuntimeContext#getAllAccumulators (deprecated in 0.10)
> DataStream#fold and all related classes and methods such as FoldFunction, 
> FoldingState, FoldingStateDescriptor ... (deprecated in 1.3/1.4)
> StreamExecutionEnvironment#setStateBackend(AbstractStateBackend) (deprecated 
> in 1.5)
> DataStream#split (deprecated in 1.8)
> Methods in (Connected)DataStream that specify keys as either indices or field 
> names such as DataStream#keyBy, DataStream#partitionCustom, 
> ConnectedStream#keyBy,  (deprecated in 1.11)
>
> I think the first three should be straightforward. They are long deprecated. 
> The getAccumulators method is not used very often in my opinion. The same 
> applies to the DataStream#fold which additionally is not very performant. 
> Lastly the setStateBackend has an alternative with a class from the 
> AbstractStateBackend hierarchy, therefore it will be still code compatible. 
> Moreover if we remove the #setStateBackend(AbstractStateBackend) we will get 
> rid off warnings users have right now when setting a statebackend as the 
> correct method cannot be used without an explicit casting.
>
> As for the DataStream#split I know there were some objections against 
> removing the #split method in the past. I still believe the output tags can 
> replace the split method already.
>
> The only problem in the last set of methods I propose to remove is that they 
> were deprecated only in the last release and those method were only partially 
> deprecated. Moreover some of the methods were not deprecated in 
> ConnectedStreams. Nevertheless I'd still be inclined to remove the methods in 
> this release.
>
> Let me know what do you think about it.
>
> Best,
>
> Dawid


[jira] [Created] (FLINK-18975) Translate the 'Execution Plans' page of "Application Development's Managing Execution" into Chinese

2020-08-17 Thread Roc Marshal (Jira)
Roc Marshal created FLINK-18975:
---

 Summary: Translate the 'Execution Plans' page of "Application 
Development's Managing Execution" into Chinese
 Key: FLINK-18975
 URL: https://issues.apache.org/jira/browse/FLINK-18975
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.11.1, 1.11.0, 1.10.1, 1.10.0
Reporter: Roc Marshal


The page url is 
https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/execution_plans.html

The markdown file is located in flink/docs/dev/execution_plans.zh.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18974) Translate the 'User-Defined Functions' page of "Application Development's DataStream API" into Chinese

2020-08-17 Thread Roc Marshal (Jira)
Roc Marshal created FLINK-18974:
---

 Summary: Translate the 'User-Defined Functions' page of 
"Application Development's DataStream API" into Chinese
 Key: FLINK-18974
 URL: https://issues.apache.org/jira/browse/FLINK-18974
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.11.1, 1.11.0, 1.10.1, 1.10.0
Reporter: Roc Marshal


The page url is 
https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/user_defined_functions.html

The markdown file is located in flink/docs/dev/user_defined_functions.zh.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18973) Translate the 'History Server' page of 'Debugging & Monitoring' into Chinese

2020-08-17 Thread Roc Marshal (Jira)
Roc Marshal created FLINK-18973:
---

 Summary: Translate the 'History Server' page of 'Debugging & 
Monitoring' into Chinese
 Key: FLINK-18973
 URL: https://issues.apache.org/jira/browse/FLINK-18973
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.11.1, 1.11.0, 1.10.1, 1.10.0
Reporter: Roc Marshal


The page url is 
https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/monitoring/historyserver.html


The markdown file is located in flink/docs/monitoring/historyserver.zh.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18972) Unfulfillable slot requests of Blink planner batch jobs never timeout

2020-08-17 Thread Zhu Zhu (Jira)
Zhu Zhu created FLINK-18972:
---

 Summary: Unfulfillable slot requests of Blink planner batch jobs 
never timeout
 Key: FLINK-18972
 URL: https://issues.apache.org/jira/browse/FLINK-18972
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.11.1
Reporter: Zhu Zhu
Assignee: Zhu Zhu
 Fix For: 1.12.0, 1.11.2


The unfulfillability check of batch slot requests are unexpectedly disabled in 
{{SchedulerImpl#start() -> BulkSlotProviderImpl#start()}}.
This means slot allocation timeout will not be triggered if a Blink planner 
batch job cannot obtain any slot.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[DISCUSS] Removing deprecated methods from DataStream API

2020-08-17 Thread Dawid Wysakowicz
Hi devs and users,

I wanted to ask you what do you think about removing some of the
deprecated APIs around the DataStream API.

The APIs I have in mind are:

  * RuntimeContext#getAllAccumulators (deprecated in 0.10)
  * DataStream#fold and all related classes and methods such as
FoldFunction, FoldingState, FoldingStateDescriptor ... (deprecated
in 1.3/1.4)
  * StreamExecutionEnvironment#setStateBackend(AbstractStateBackend)
(deprecated in 1.5)
  * DataStream#split (deprecated in 1.8)
  * Methods in (Connected)DataStream that specify keys as either indices
or field names such as DataStream#keyBy, DataStream#partitionCustom,
ConnectedStream#keyBy,  (deprecated in 1.11)

I think the first three should be straightforward. They are long
deprecated. The getAccumulators method is not used very often in my
opinion. The same applies to the DataStream#fold which additionally is
not very performant. Lastly the setStateBackend has an alternative with
a class from the AbstractStateBackend hierarchy, therefore it will be
still code compatible. Moreover if we remove the
#setStateBackend(AbstractStateBackend) we will get rid off warnings
users have right now when setting a statebackend as the correct method
cannot be used without an explicit casting.

As for the DataStream#split I know there were some objections against
removing the #split method in the past. I still believe the output tags
can replace the split method already.

The only problem in the last set of methods I propose to remove is that
they were deprecated only in the last release and those method were only
partially deprecated. Moreover some of the methods were not deprecated
in ConnectedStreams. Nevertheless I'd still be inclined to remove the
methods in this release.

Let me know what do you think about it.

Best,

Dawid



signature.asc
Description: OpenPGP digital signature


[jira] [Created] (FLINK-18971) Support to mount kerberos conf as ConfigMap and Keytab as Secrete

2020-08-17 Thread Yangze Guo (Jira)
Yangze Guo created FLINK-18971:
--

 Summary: Support to mount kerberos conf as ConfigMap and Keytab as 
Secrete
 Key: FLINK-18971
 URL: https://issues.apache.org/jira/browse/FLINK-18971
 Project: Flink
  Issue Type: Sub-task
Reporter: Yangze Guo


Currently, if user want to enable Kerberos Authentication, they need to build a 
custom image with keytab and krb5 conf file. To improve usability, we need to 
create a ConfigMap and a Secrete for krb5 conf and keytab when needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Planning Flink 1.12

2020-08-17 Thread Till Rohrmann
Hi Rodrigo,

FLINK-10407 has not been up to date with the actual progress on the
feature. We will update it soon.

I think that the community won't actively work on FLINK-12002 in this
release. However, we will work on FLINK-16430 [1] and once this is done
continue with the clean up of the legacy scheduler code paths.

As for the scope of the 1.12 release, we hope to finish FLINK-16430 and
make considerable progress with FLINK-10407. However, I don't think that
FLINK-10407 will be fully functional yet.

[1] https://issues.apache.org/jira/browse/FLINK-16430

Cheers,
Till

On Sat, Aug 15, 2020 at 6:43 PM rodrigobrochado <
rodrigo.broch...@predito.com.br> wrote:

> Thanks Dian!
>
> About the wiki page, I think that the "Reactive-scaling mode" by Till has
> an
> open issue on FLINK-10407 [1].
>
> Still about scaling, what about the adaptive parallelism of jobs [2]? This
> is somehow related to Prasanna and Harold's comments above. It seems to
> depend on finishing the old "Redesign Flink Scheduling" [3], FLIP-119
> (already on wiki list), and FLINK-15626 (Remove of legacy scheduler) [4].
> Would they be achievable until 1.12?
>
> If I could add one more, the python UDF in docker mode would be awesome
> [5].
>
> [1] https://issues.apache.org/jira/browse/FLINK-10407
> [2] https://issues.apache.org/jira/browse/FLINK-12002
> [3] https://issues.apache.org/jira/browse/FLINK-10429
> [4] https://issues.apache.org/jira/browse/FLINK-15626
> [5] https://issues.apache.org/jira/browse/FLINK-14025
>
> Thanks,
> Rodrigo
>
>
>
> --
> Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
>


[jira] [Created] (FLINK-18970) Adding Junit TestMarkers

2020-08-17 Thread goutham (Jira)
goutham created FLINK-18970:
---

 Summary: Adding Junit TestMarkers
 Key: FLINK-18970
 URL: https://issues.apache.org/jira/browse/FLINK-18970
 Project: Flink
  Issue Type: Improvement
  Components: Test Infrastructure, Tests
Affects Versions: 1.11.1
Reporter: goutham
 Fix For: 1.11.1


I am planning to add Test Marker to run the Unit test and Integration test 
using markers.

Currently, if you want to run the complete build locally it takes close to 2 
hours. Based on requirement developers can run unit tests only or just 
integration or both. 

By default, it will run all the tests. 

planning to introduce below markers 

@Tag("IntegrationTest")

@Tag("UnitTest")



--
This message was sent by Atlassian Jira
(v8.3.4#803005)