Re: [DISCUSS] Correct the terminology of "Time-windowed Join" to "Interval Join" in Table API & SQL

2019-12-22 Thread Jark Wu
I agree with Jingsong, we are discussing to align the "concepts", not align
the "implementations".

For the "concepts", the "Time-windowed Join" in SQL and "Interval Join" in
DataStream are the same thing.

Best,
Jark

On Mon, 23 Dec 2019 at 15:16, Jingsong Li  wrote:

> Hi Danny,
>
> > DatasStream interval join and Table/SQL Time-windowed Join are
> not equivalent
>
> In my opinion, there is no difference between table and DataStream except
> that outer join is not implemented in DataStream.
> KeyedStream has defined equivalent conditions.
> Other conditions can be completed in the subsequent IntervalJoined.process.
> And the interval join of DataStream is implemented according to the feature
> of SQL.[1] You can see the references in the description.
>
> > why not choose Time-windowed Join
>
> As Jark said, there is a "Window Join" in DataStream, we can support it in
> table too in future. It is very easy to misunderstand with "Time-windowed
> Join".
> So, in my opinion, "Interval join" or "Range join" are the "complete" word
> to describe this kind of join.  But better not "Time-windowed Join".
>
> [1] https://issues.apache.org/jira/browse/FLINK-8478
>
> Best,
> Jingsong Lee
>


[jira] [Created] (FLINK-15361) ParquetTableSource should pass predicate in projectFields

2019-12-22 Thread Jingsong Lee (Jira)
Jingsong Lee created FLINK-15361:


 Summary: ParquetTableSource should pass predicate in projectFields
 Key: FLINK-15361
 URL: https://issues.apache.org/jira/browse/FLINK-15361
 Project: Flink
  Issue Type: Bug
  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
Reporter: Jingsong Lee
 Fix For: 1.9.2, 1.10.0


After projectFields, ParquetTableSource will loose predicates.

Since this is only performance related, the test did not fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Correct the terminology of "Time-windowed Join" to "Interval Join" in Table API & SQL

2019-12-22 Thread Jingsong Li
Hi Danny,

> DatasStream interval join and Table/SQL Time-windowed Join are
not equivalent

In my opinion, there is no difference between table and DataStream except
that outer join is not implemented in DataStream.
KeyedStream has defined equivalent conditions.
Other conditions can be completed in the subsequent IntervalJoined.process.
And the interval join of DataStream is implemented according to the feature
of SQL.[1] You can see the references in the description.

> why not choose Time-windowed Join

As Jark said, there is a "Window Join" in DataStream, we can support it in
table too in future. It is very easy to misunderstand with "Time-windowed
Join".
So, in my opinion, "Interval join" or "Range join" are the "complete" word
to describe this kind of join.  But better not "Time-windowed Join".

[1] https://issues.apache.org/jira/browse/FLINK-8478

Best,
Jingsong Lee


Re: Flink On K8s, build docker image very slowly, is there some way to make it faster?

2019-12-22 Thread vino yang
Hi Lake,

Can you clearly count or identify which steps are taking a long time?

Best,
Vino

LakeShen  于2019年12月23日周一 下午2:46写道:

> Hi community , when I run the flink task on k8s , the first thing is that
> to build the flink task jar to
> Docker Image . I find that It would spend much time to build docker image.
> Is there some way to makr it faster.
> Thank your replay.
>


Re: Flink On K8s, build docker image very slowly, is there some way to make it faster?

2019-12-22 Thread Xintong Song
Hi Lake,

Usually building a docker image should not take much time (typically less
than 2 minutes).

It is probably the network issue that causes the long time for image
building. Of course we will need more information (e.g., logs) to confirm
that, but according to our experience pulling the base image (based on
which the Flink on K8s image will be built) from DockerHub could take quite
some time from mainland China (where I assume you are since you're also
writing to user-zh).

If this is indeed the case that you met, you can try to modify the
"flink-container/docker/Dockerfile", change the line "FROM
openjdk:8-jre-alpine" to point to a domestic or local image source.

Thank you~

Xintong Song



On Mon, Dec 23, 2019 at 2:46 PM LakeShen  wrote:

> Hi community , when I run the flink task on k8s , the first thing is that
> to build the flink task jar to
> Docker Image . I find that It would spend much time to build docker image.
> Is there some way to makr it faster.
> Thank your replay.
>


Flink On K8s, build docker image very slowly, is there some way to make it faster?

2019-12-22 Thread LakeShen
Hi community , when I run the flink task on k8s , the first thing is that
to build the flink task jar to
Docker Image . I find that It would spend much time to build docker image.
Is there some way to makr it faster.
Thank your replay.


Re: [DISCUSS] Correct the terminology of "Time-windowed Join" to "Interval Join" in Table API & SQL

2019-12-22 Thread Danny Chan
Thanks Jark for bringing up this discussion, just look at the api definitions,
it seems that Flink DatasStream interval join and Table/SQL Time-windowed Join 
are
not equivalent for the join conditions:

The Interval Join only supports event time columnbs comparison of the
joined streams[1]; while the Time-windowed Join allows any type columns to be 
within the
equi-join part with required additional join condition that bounds the time on 
both sides.

So from the limitations of the implementations, it seems that DatasStream 
interval join is purely
joingng streams with times whose functionality is subset of Time-windowed Join 
which boulds two streams with times first then apply more complex join 
condition predicates.

I'm also inclined we should unify the terminology, but just curious, why not 
choose Time-windowed Join because it's
functionality is more "complete" ?

[1] 
https://github.com/apache/flink/blob/cce1cef50d993aba5060ea5ac597174525ae895e/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/datastream/KeyedStream.java#L449

Best,
Danny Chan
在 2019年12月23日 +0800 AM11:42,Jark Wu ,写道:
> Hi everyone,
>
> Currently, in the Table API & SQL documentation[1], we call the joins with
> time conditions as "Time-windowed Join". However, the same feature is
> called "Interval Join" in DataStream[2]. We should align the terminology in
> Flink project.
>
> From my point of view, "Interval Join" is more suitable, because it joins a
> time interval range of right stream. And "Windowed Join" should be joining
> data in the same window, this is also described in DataStream API[3].
>
> For Table API & SQL, the "Time-windowed Join" is the "Interval Join" in
> DataStream. And the "Windowed Join" feature is missed in Table API & SQL.
>
> I would propose to correct the terminology in docs before 1.10 is release.
>
> What do you think?
>
> Best,
> Jark
>
> [1]:
> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/tableApi.html#joins
> [2]:
> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/operators/joining.html#interval-join
> [3]:
> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/operators/joining.html#window-join


Re: [DISCUSS] Correct the terminology of "Time-windowed Join" to "Interval Join" in Table API & SQL

2019-12-22 Thread Jingsong Li
Thanks Jark for bringing this.

+1 to use a unify name: "Interval Join" before 1.10 is release.

I think maybe "Interval Join" was come from SQL world too in [1].
Another candidate is to use "Range Join", But considering DataStream, I am
OK with "Interval".

[1] https://issues.apache.org/jira/browse/FLINK-8478

Best,
Jingsong Lee

On Mon, Dec 23, 2019 at 11:42 AM Jark Wu  wrote:

> Hi everyone,
>
> Currently, in the Table API & SQL documentation[1], we call the joins with
> time conditions as "Time-windowed Join". However, the same feature is
> called "Interval Join" in DataStream[2]. We should align the terminology in
> Flink project.
>
> From my point of view, "Interval Join" is more suitable, because it joins a
> time interval range of right stream. And "Windowed Join" should be joining
> data in the same window, this is also described in DataStream API[3].
>
> For Table API & SQL, the "Time-windowed Join" is the "Interval Join" in
> DataStream. And the "Windowed Join" feature is missed in Table API & SQL.
>
> I would propose to correct the terminology in docs before 1.10 is release.
>
> What do you think?
>
> Best,
> Jark
>
> [1]:
>
> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/tableApi.html#joins
> [2]:
>
> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/operators/joining.html#interval-join
> [3]:
>
> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/operators/joining.html#window-join
>


-- 
Best, Jingsong Lee


[DISCUSS] Correct the terminology of "Time-windowed Join" to "Interval Join" in Table API & SQL

2019-12-22 Thread Jark Wu
Hi everyone,

Currently, in the Table API & SQL documentation[1], we call the joins with
time conditions as "Time-windowed Join". However, the same feature is
called "Interval Join" in DataStream[2]. We should align the terminology in
Flink project.

>From my point of view, "Interval Join" is more suitable, because it joins a
time interval range of right stream. And "Windowed Join" should be joining
data in the same window, this is also described in DataStream API[3].

For Table API & SQL, the "Time-windowed Join" is the "Interval Join" in
DataStream. And the "Windowed Join" feature is missed in Table API & SQL.

I would propose to correct the terminology in docs before 1.10 is release.

What do you think?

Best,
Jark

[1]:
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/tableApi.html#joins
[2]:
https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/operators/joining.html#interval-join
[3]:
https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/operators/joining.html#window-join


Re: [DISCUSS] Some feedback after trying out the new FLIP-49 memory configurations

2019-12-22 Thread Xintong Song
>
> +1 to not have more options for off-heap memory, that would get confusing
> fast. We can keep the name "off-heap" for both task and framework, as long
> as they mean the same thing: native plus direct, and fully counted into
> MaxDirectMemory. I would suggest to update the config descriptions then to
> reflect that.
>
True, this should be explained in the config descriptions.

Which configuration option will be set in Flink's default flink-conf.yaml?
> If we want to maintain the existing behaviour it would have to be the then
> deprecated taskmanager.heap.size config option. If we are ok with Yarn
> requesting slightly larger containers, then it could also
> be taskmanager.memory.total-flink.size.
>
Good point. Currently, we have "taskmanager.memory.total-process.size". In
order to preserve the previous behavior, we need to have
"taskmanager.heap.size" so it can be mapped to different new options in
standalone / active setups.
I think we can have the deprecated "taskmanager.heap.size" in the default
flink-conf.yaml, and also have the
new "taskmanager.memory.total-process.size" in a commented line. We can
explain how the deprecated config option behaves differently in the
comments, so that user can switch to the new config options if they want to.

Thank you~

Xintong Song



On Sat, Dec 21, 2019 at 1:00 AM Till Rohrmann  wrote:

> Thanks for the feedback and good discussion everyone. I left some comments
> inline.
>
> On Fri, Dec 20, 2019 at 1:59 PM Stephan Ewen  wrote:
>
>> +1 to not have more options for off-heap memory, that would get confusing
>> fast. We can keep the name "off-heap" for both task and framework, as long
>> as they mean the same thing: native plus direct, and fully counted into
>> MaxDirectMemory. I would suggest to update the config descriptions then to
>> reflect that.
>>
>>
>>
>> On Fri, Dec 20, 2019 at 1:03 PM Xintong Song 
>> wrote:
>>
>>> Regarding the framework/task direct/native memory options, I tend to
>>> think it differently. I'm in favor of keep the "*.off-heap.size" for the
>>> config option keys.
>>>
>>>- It's not necessary IMO to expose the difference concepts of direct
>>>/ native memory to the users.
>>>- I would avoid introducing more options for native memory if
>>>possible. Taking fine grained resource management and dynamic slot
>>>allocation into consideration, that also means introduce more fields into
>>>ResourceSpec / ResourceProfile.
>>>- My gut feeling is that having a relative loose MaxDirectMemory
>>>should not be a big problem.
>>>- In most cases, the task / framework off-heap memory should be
>>>   mainly (if not all) direct memory, so the difference between derived
>>>   MaxDirectMemory and the ideal direct memory limit should not be too 
>>> much.
>>>   - We do not have a good way to know the exact size needed for jvm
>>>   overhead / metaspace and framework / task off-heap memory, thus having
>>>   to conservatively reserve slightly more memory then what actually 
>>> needed.
>>>   Such reserved but not used memory can cover for the small 
>>> MaxDirectMemory
>>>   error.
>>>   -
>>>   - MaxDirectMemory is not the only way to trigger full gc. We
>>>   still heap activities that can also trigger the gc.
>>>
>>> Regarding the memory type config options, I've looked into the latest
>>> ConfigOptions changes. I think it shouldn't be too complicated to change
>>> the config options to use memory type, and I can handle it maybe during
>>> your vacations.
>>>
>>>
>>> Also agree with improving MemorySize logging and parsing. This should
>>> not be a blocker that has to be done in 1.10. I would say we finish other
>>> works (testability, documentation and those discussed in this thread)
>>> first, and get to this only if we have time.
>>>
>>>
>>> Thank you~
>>>
>>> Xintong Song
>>>
>>>
>>>
>>> On Fri, Dec 20, 2019 at 6:07 PM Andrey Zagrebin <
>>> azagrebin.apa...@gmail.com> wrote:
>>>
 Hi Stephan and Xintong,

 Thanks for the further FLIP-49 feedbacks.

   - "taskmanager.memory.size" (old main config option) is replaced by
> "taskmanager.memory.total-process.size" which has a different meaning in
> standalone setups. The old option did not subtract metaspace and other
> overhead, while the new option does. That means that with the default
> config, standalone clusters get quite a bit less memory. (independent of
> managed memory going off heap).
> I am wondering if we could interpret "taskmanager.memory.size" as
> the deprecated key for "taskmanager.memory.total-flink.size". That would 
> be
> in line with the old mechanism (assuming managed memory is set to off 
> heap).
> The effect would be that the container size on Yarn/Mesos
> increases, because from "taskmanager.memory.total-flink.size", we need to
> add overhead and metaspace to reach the total process size, rather than
> cutting off memory. 

Re: [DISCUSS] FLIP-90: Support SQL 2016-2017 JSON functions in Flink SQL

2019-12-22 Thread Jark Wu
Hi Forward,

Thanks for creating the FLIP. +1 to start a vote.

 @Hequn Cheng  @Kurt Young  , could
you help to review the design doc too?

Best,
Jark


On Mon, 23 Dec 2019 at 10:10, tison  wrote:

> modified:
>
> https://lists.apache.org/x/thread.html/b3c0265cc2b660fe11ce550b84a831a7606de12908ff7ff0959a4794@%3Cdev.flink.apache.org%3E
>


Re: [DISCUSS] FLIP-90: Support SQL 2016-2017 JSON functions in Flink SQL

2019-12-22 Thread tison
modified:
https://lists.apache.org/x/thread.html/b3c0265cc2b660fe11ce550b84a831a7606de12908ff7ff0959a4794@%3Cdev.flink.apache.org%3E


Re: [DISCUSS] FLIP-90: Support SQL 2016-2017 JSON functions in Flink SQL

2019-12-22 Thread tison
FYI the previous discussion is here[1].


Best,
tison.

[1]
https://lists.apache.org/x/thread.html/e259aa70432e4003e1598e8f8db844813a869a0dd96accfc1b73deb6@%3Cdev.flink.apache.org%3E


Forward Xu  于2019年12月21日周六 上午8:20写道:

> Hi everybody,
>
>
> I'd like to kick off a discussion on FLIP-90: Support SQL 2016-2017 JSON
> functions in Flink SQL.
>
>
> Implement Support SQL 2016-2017 JSON functions in Flink SQL[1].
>
>
>
> Would love to hear your thoughts.
>
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-90%3A+Support+SQL+2016-2017+JSON+functions+in+Flink+SQL
>
>
> Best,
>
> ForwardXu
>
> >
>


Re: [DISCUSS] What parts of the Python API should we focus on next ?

2019-12-22 Thread jincheng sun
Hi Bowen,

Your suggestions are very helpful for expanding the PyFlink ecology.  I
also mentioned above to integrate notebooks,Jupyter and Zeppelin are both
very excellent notebooks. The process of integrating Jupyter and Zeppelin
also requires the support of Jupyter and Zeppelin community personnel.
Currently Jeff has made great efforts in Zeppelin community for PyFink. I
would greatly appreciate if anyone who active in the Jupyter community also
willing to help to integrate PyFlink.

Best,
Jincheng


Bowen Li  于2019年12月20日周五 上午12:55写道:

> - integrate PyFlink with Jupyter notebook
>- Description: users should be able to run PyFlink seamlessly in Jupyter
>- Benefits: Jupyter is the industrial standard notebook for data
> scientists. I’ve talked to a few companies in North America, they think
> Jupyter is the #1 way to empower internal DS with Flink
>
>
> On Wed, Dec 18, 2019 at 19:05 jincheng sun 
> wrote:
>
>> Also CC user-zh.
>>
>> Best,
>> Jincheng
>>
>>
>> jincheng sun  于2019年12月19日周四 上午10:20写道:
>>
>>> Hi folks,
>>>
>>> As release-1.10 is under feature-freeze(The stateless Python UDF is
>>> already supported), it is time for us to plan the features of PyFlink for
>>> the next release.
>>>
>>> To make sure the features supported in PyFlink are the mostly demanded
>>> for the community, we'd like to get more people involved, i.e., it would be
>>> better if all of the devs and users join in the discussion of which kind of
>>> features are more important and urgent.
>>>
>>> We have already listed some features from different aspects which you
>>> can find below, however it is not the ultimate plan. We appreciate any
>>> suggestions from the community, either on the functionalities or
>>> performance improvements, etc. Would be great to have the following
>>> information if you want to suggest to add some features:
>>>
>>> -
>>> - Feature description: 
>>> - Benefits of the feature: 
>>> - Use cases (optional): 
>>> --
>>>
>>> Features in my mind
>>>
>>> 1. Integration with most popular Python libraries
>>> - fromPandas/toPandas API
>>>Description:
>>>   Support to convert between Table and pandas.DataFrame.
>>>Benefits:
>>>   Users could switch between Flink and Pandas API, for example,
>>> do some analysis using Flink and then perform analysis using the Pandas API
>>> if the result data is small and could fit into the memory, and vice versa.
>>>
>>> - Support Scalar Pandas UDF
>>>Description:
>>>   Support scalar Pandas UDF in Python Table API & SQL. Both the
>>> input and output of the UDF is pandas.Series.
>>>Benefits:
>>>   1) Scalar Pandas UDF performs better than row-at-a-time UDF,
>>> ranging from 3x to over 100x (from pyspark)
>>>   2) Users could use Pandas/Numpy API in the Python UDF
>>> implementation if the input/output data type is pandas.Series
>>>
>>> - Support Pandas UDAF in batch GroupBy aggregation
>>>Description:
>>>Support Pandas UDAF in batch GroupBy aggregation of Python
>>> Table API & SQL. Both the input and output of the UDF is pandas.DataFrame.
>>>Benefits:
>>>   1) Pandas UDAF performs better than row-at-a-time UDAF more
>>> than 10x in certain scenarios
>>>   2) Users could use Pandas/Numpy API in the Python UDAF
>>> implementation if the input/output data type is pandas.DataFrame
>>>
>>> 2. Fully support  all kinds of Python UDF
>>> - Support Python UDAF(stateful) in GroupBy aggregation (NOTE: Please
>>> give us some use case if you want this feature to be contained in the next
>>> release)
>>>   Description:
>>> Support UDAF in GroupBy aggregation.
>>>   Benefits:
>>> Users could define and use Python UDAF and use it in GroupBy
>>> aggregation. Without it, users have to use Java/Scala UDAF.
>>>
>>> - Support Python UDTF
>>>   Description:
>>>Support  Python UDTF in Python Table API & SQL
>>>   Benefits:
>>> Users could define and use Python UDTF in Python Table API &
>>> SQL. Without it, users have to use Java/Scala UDTF.
>>>
>>> 3. Debugging and Monitoring of Python UDF
>>>- Support User-Defined Metrics
>>>  Description:
>>>Allow users to define user-defined metrics and global job
>>> parameters with Python UDFs.
>>>  Benefits:
>>>UDF needs metrics to monitor some business or technical
>>> indicators, which is also a requirement for UDFs.
>>>
>>>- Make the log level configurable
>>>  Description:
>>>Allow users to config the log level of Python UDF.
>>>  Benefits:
>>>Users could configure different log levels when debugging and
>>> deploying.
>>>
>>> 4. Enrich the Python execution environment
>>>- Docker Mode Support
>>>  Description:
>>>  Support running python UDF in docker workers.
>>>  Benefits:
>>>  Support various of deployments to meet more users' requirements.
>>>

Re: [DISCUSS] FLIP-27: Refactor Source Interface

2019-12-22 Thread Becket Qin
Hi Steven,

I think the current proposal is what you mentioned - a Kafka source that
can be constructed in either BOUNDED or UNBOUNDED mode. And Flink can get
the boundedness by invoking getBoundedness().

So one can create a Kafka source by doing something like the following:

new KafkaSource().startOffset(),endOffset(); // A bounded instance.
new KafkaSource().startOffset(); // An unbounded instance.

If users want to have an UNBOUNDED Kafka source that stops at some point.
They can wrap the BOUNDED Kafka source like below:

SourceUtils.asUnbounded(new KafkaSource.startOffset().endOffset());

The wrapped source would be an unbounded Kafka source that stops at the end
offset.

Does that make sense?

Thanks,

Jiangjie (Becket) Qin

On Fri, Dec 20, 2019 at 1:31 PM Jark Wu  wrote:

> Hi,
>
> First of all, I think it is not called "UNBOUNDED", according to the
> FLIP-27, it is called "CONTINUOUS_UNBOUNDED".
> And from the description of the Boundedness in the FLIP-27[1] declares
> clearly what Becket and I think.
>
> public enum Boundedness {
>
> /**
>  * A bounded source processes the data that is currently available and
> will end after that.
>  *
>  * When a source produces a bounded stream, the runtime may activate
> additional optimizations
>  * that are suitable only for bounded input. Incorrectly producing
> unbounded data when the source
>  * is set to produce a bounded stream will often result in programs
> that do not output any results
>  * and may eventually fail due to runtime errors (out of memory or
> storage).
>  */
> BOUNDED,
>
> /**
>  * A continuous unbounded source continuously processes all data as it
> comes.
>  *
>  * The source may run forever (until the program is terminated) or
> might actually end at some point,
>  * based on some source-specific conditions. Because that is not
> transparent to the runtime,
>  * the runtime will use an execution mode for continuous unbounded
> streams whenever this mode
>  * is chosen.
>  */
> CONTINUOUS_UNBOUNDED
> }
>
> Best,
> Jark
>
> [1]:
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface#FLIP-27:RefactorSourceInterface-Source
>
>
>
> On Fri, 20 Dec 2019 at 12:55, Steven Wu  wrote:
>
> > Becket,
> >
> > Regarding "UNBOUNDED source that stops at some point", I found it
> difficult
> > to grasp what UNBOUNDED really mean.
> >
> > If we want to use Kafka source with an end/stop time, I guess you call it
> > UNBOUNDED kafka source that stops (aka BOUNDED-streaming). The
> > terminology is a little confusing to me. Maybe BOUNDED/UNBOUNDED
> shouldn't
> > be used to categorize source. Just call it Kafka source and it can run in
> > either BOUNDED or UNBOUNDED mode.
> >
> > Thanks,
> > Steven
> >
> > On Thu, Dec 19, 2019 at 7:02 PM Becket Qin  wrote:
> >
> > > I had an offline chat with Jark, and here are some more thoughts:
> > >
> > > 1. From SQL perspective, BOUNDED source leads to the batch execution
> > mode,
> > > UNBOUNDED source leads to the streaming execution mode.
> > > 2. The semantic of UNBOUNDED source is may or may not stop. The
> semantic
> > of
> > > BOUNDED source is will stop.
> > > 3. The semantic of DataStream is may or may not terminate. The semantic
> > of
> > > BoundedDataStream is will terminate.
> > >
> > > Given that, option 3 seems a better option because:
> > > 1. SQL already has strict binding between Boundedness and execution
> mode.
> > > Letting DataStream be consistent would be good.
> > > 2. The semantic of UNBOUNDED source is exactly the same as DataStream.
> So
> > > we should avoid breaking such semantic, i.e. turning some DataStream
> from
> > > "may or may not terminate" to "will terminate".
> > >
> > > For case where users want BOUNDED-streaming combination, they can
> simply
> > > use an UNBOUNDED source that stops at some point. We can even provide a
> > > simple wrapper to wrap a BOUNDED source as an UNBOUNDED source if that
> > > helps. But API wise, option 3 seems telling a pretty good whole story.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > >
> > >
> > >
> > > On Thu, Dec 19, 2019 at 10:30 PM Becket Qin 
> > wrote:
> > >
> > > > Hi Timo,
> > > >
> > > > Bounded is just a special case of unbounded and every bounded source
> > can
> > > >> also be treated as an unbounded source. This would unify the API if
> > > >> people don't need a bounded operation.
> > > >
> > > >
> > > > With option 3 users can still get a unified API with something like
> > > below:
> > > >
> > > > DataStream boundedStream = env.boundedSource(boundedSource);
> > > > DataStream unboundedStream = env.source(unboundedSource);
> > > >
> > > > So in both cases, users can still use a unified DataStream without
> > > > touching the bounded stream only methods.
> > > > By "unify the API if people don't need the bounded operation". Do you
> > > > expect a DataStream with a Bounded source to have the batch 

[ANNOUNCE] Weekly Community Update 2019/51

2019-12-22 Thread Hequn Cheng
Dear community,

Happy to share this week's brief community digest with updates on Flink
1.10 and Flink 1.9.2, a proposal to integrate Flink Docker image
publication into Flink release process, a discussion on new features of
PyFlink and a couple of blog posts. Enjoy.

Flink Development
==

* [releases] Kostas Kloudas suggests to focus a little bit on documenting
the new features that the community added to release-1.10 during the
feature-freeze phrase. He has created an umbrella issue(FLINK-15273) to
monitor the pending documentation tasks.[1]

* [releases] Hequn has started a conversation about the release of Flink
1.9.2. One blocker has been addressed this week but a new one is reported.
Considering the ongoing release-1.10 and the limited resources of the
community, the process of 1.9.2 is planned to slow down. [2]

* [releases] Patrick proposes to integrate Flink Docker image publication
into the Flink release process. There are some discussions on whether to
have a dedicated git repo for the Dockerfiles. [3]

* [sq] The discussion on supporting JSON functions in Flink SQL seems to
have reached an agreement. Jark Wu suggested Forward Xu to start a vote. [4]

* [runtime] Stephan raised a discussion and gives some feedback after
trying out the new FLIP-49 memory configurations. He gives some
alternatives on config key names and descriptions. The feedback received
many +1 from other ones. [5]

* [connectors] Some new updates for the discussion on Flip-27, the new
source interface. This has been a log ongoing topic. This week the
discussions are focused on the concept of BOUNDED AND UNBOUNDED for the
source.  [6]

* [pyflink] Jincheng has started a discussion on what parts of the Python
API should we focus on next. A default feature list is given but looking
forward to hearing more feedback from the community. [7]

[1]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Documentation-tasks-for-release-1-10-td36031.html
[2]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Releasing-Flink-1-9-2-td36087.html
[3]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Integrate-Flink-Docker-image-publication-into-Flink-release-process-td36139.html
[4]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Support-JSON-functions-in-Flink-SQL-td32674.html
[5]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Some-feedback-after-trying-out-the-new-FLIP-49-memory-configurations-td36129.html
[6]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-27-Refactor-Source-Interface-td24952.html
[7]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-What-parts-of-the-Python-API-should-we-focus-on-next-td36119.html

Notable Bugs
==

[FLINK-15262] [1.10.0] kafka connector doesn't read from beginning
immediately when 'connector.startup-mode' = 'earliest-offset'. [8]
[FLINK-15300] [1.10.0] Shuffle memory fraction sanity check does not
account for its min/max limit. [9]
[FLINK-15304] [1.11.0] Remove unexpected Hadoop dependency from Flink's
Mesos integration. [10]
[FLINK-15313] [1.10.0] Can not insert decimal with precision into sink
using TypeInformation. [11]
[FLINK-15320] [1.10.0] JobManager crashes in the standalone model when
cancelling job which subtask' status is scheduled. [12]

[8] https://issues.apache.org/jira/browse/FLINK-15262
[9] https://issues.apache.org/jira/browse/FLINK-15300
[10] https://issues.apache.org/jira/browse/FLINK-15304
[11] https://issues.apache.org/jira/browse/FLINK-15313
[12] https://issues.apache.org/jira/browse/FLINK-15320

Events, Blog Posts, Misc
===

* Philip Wilcox has published a blog about how they use Flink to detect
offline scooters in Bird. The blog mainly shares some experience of how to
solve a set of tricky problems involving Kafka, event time, watermarks, and
ordering. [13]

* In this blog post, Preetdeep Kumar introduces use-cases and best
practices for utilizing Apache Flink for processing streaming data. [14].

[13]
https://www.ververica.com/blog/replayable-process-functions-time-ordering-and-timers
[14] https://dzone.com/articles/streaming-etl-with-apache-flink

Cheers,
Hequn


[jira] [Created] (FLINK-15360) Yarn e2e test is broken with building docker image

2019-12-22 Thread Yang Wang (Jira)
Yang Wang created FLINK-15360:
-

 Summary: Yarn e2e test is broken with building docker image
 Key: FLINK-15360
 URL: https://issues.apache.org/jira/browse/FLINK-15360
 Project: Flink
  Issue Type: Bug
Reporter: Yang Wang
 Fix For: 1.10.0


Yarn e2e test is broken with building docker image. This is because this change 
[https://github.com/apache/flink/commit/cce1cef50d993aba5060ea5ac597174525ae895e].

 

Shell function \{{retry_times}} do not support passing a command as multiple 
parts. For example, \{{retry_times 5 0 docker build image}} could not work.

 

cc [~karmagyz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15359) Remove unused YarnConfigOptions

2019-12-22 Thread Zili Chen (Jira)
Zili Chen created FLINK-15359:
-

 Summary: Remove unused YarnConfigOptions
 Key: FLINK-15359
 URL: https://issues.apache.org/jira/browse/FLINK-15359
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Configuration
Reporter: Zili Chen
 Fix For: 1.11.0


There are several unused {{YarnConfigOptions}}. Remove them for preventing 
misunderstanding.


- {{yarn.appmaster.rpc.address}}
- {{yarn.appmaster.rpc.port}}
- {{yarn.maximum-failed-containers}}





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15358) [configuration] the flink conf yaml file will skip the rest of value after a `#` comment sign

2019-12-22 Thread BlaBlabla (Jira)
BlaBlabla created FLINK-15358:
-

 Summary: [configuration] the flink conf yaml file  will skip the 
rest of value after a `#` comment sign
 Key: FLINK-15358
 URL: https://issues.apache.org/jira/browse/FLINK-15358
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Configuration
Affects Versions: 1.9.1
Reporter: BlaBlabla


Hello, 

I have to use influx metrics reporter ,however the password contains a # sign, 
then the flink will skip the rest part of the password after #, eg:

      metrics.reporter.influxdb.password: xxpasssxx#blabla 
 
#blabla is parsed as an  end line comment.

Can you guys fix it?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)