Re: [DISCUSS] FLIP-130: Support for Python DataStream API (Stateless Part)

Aljoscha Krettek Mon, 27 Jul 2020 05:58:35 -0700

Hi,

I'm also not against adding that if it enables actual use cases. I don'tthink we need to spell out the whole API in the FLIP, though. We can addthings as they come up.


Best,
Aljoscha

On 24.07.20 14:43, Shuiqiang Chen wrote:

Hi David,

Thank you for your reply! I have started the vote for this FLIP, but we can
keep the discussion on this thread.
In my perspective, I would not against adding the
DataStream.partitionCustom to Python DataStream API.  However, more inputs
are welcomed.

Best,
Shuiqiang



David Anderson <[email protected]> 于2020年7月24日周五 下午7:52写道：

Sorry I'm coming to this rather late, but I would like to argue that
DataStream.partitionCustom enables an important use case.
What I have in mind is performing partitioned enrichment, where each
instance can preload a slice of a static dataset that is being used for
enrichment.

For an example, consider
https://github.com/knaufk/enrichments-with-flink/blob/master/src/main/java/com/github/knaufk/enrichments/CustomPartitionEnrichmenttJob.java
.

Regards,
David

On Fri, Jul 24, 2020 at 12:18 PM Shuiqiang Chen <[email protected]>
wrote:

Hi Aljoscha, Thank you for your response.  I'll keep these two helper
methods in the Python DataStream implementation.

And thank you all for joining in the discussion. It seems that we have
reached a consensus. I will start a vote for this FLIP later today.

Best,
Shuiqiang

Hequn Cheng <[email protected]> 于2020年7月24日周五 下午5:29写道：

Thanks a lot for your valuable feedback and suggestions! @Aljoscha

Krettek

<[email protected]>
+1 to the vote.

Best,
Hequn

On Fri, Jul 24, 2020 at 5:16 PM Aljoscha Krettek <[email protected]>
wrote:

Thanks for updating! And yes, I think it's ok to include the few

helper

methods such as "readFromFile" and "print".

I think we can now proceed to a vote! Nice work, overall!

Best,
Aljoscha

On 16.07.20 17:16, Hequn Cheng wrote:

Hi,

Thanks a lot for your discussions.
I think Aljoscha makes good suggestions here! Those problematic APIs

should

not be added to the new Python DataStream API.

Only one item I want to add based on the reply from Shuiqiang:
I would also tend to keep the readTextFile() method. Apart from

print(),

the readTextFile() may also be very helpful and frequently used for

playing

with Flink.
For example, it is used in our WordCount example[1] which is almost

the

first Flink program that every beginner runs.
It is more efficient for reading multi-line data compared to
fromCollection() meanwhile far more easier to be used compared to

Kafka,

Kinesis, RabbitMQ,etc., in
cases for playing with Flink.

What do you think?

Best,
Hequn

[1]

https://github.com/apache/flink/blob/master/flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/wordcount/WordCount.java



On Thu, Jul 16, 2020 at 3:37 PM Shuiqiang Chen <[email protected]

wrote:

Hi Aljoscha,

Thank you for your valuable comments! I agree with you that there

is

some

optimization space for existing API and can be applied to the

python

DataStream API implementation.

According to your comments, I have concluded them into the

following

parts:


1. SingleOutputStreamOperator and DataStreamSource.
Yes, the SingleOutputStreamOperator and DataStreamSource are a bit
redundant, so we can unify their APIs into DataStream to make it

more

clear.

2. The internal or low-level methods.
   - DataStream.get_id(): Has been removed in the FLIP wiki page.
   - DataStream.partition_custom(): Has been removed in the FLIP

wiki

page.

   - SingleOutputStreamOperator.can_be_parallel/forceNoParallel: Has

been

removed in the FLIP wiki page.
Sorry for mistakenly making those internal methods public, we would

not

expose them to users in the Python API.

3. "declarative" Apis.
- KeyedStream.sum/min/max/min_by/max_by: Has been removed in the

FLIP

wiki

page. They could be well covered by Table API.

4. Spelling problems.
- StreamExecutionEnvironment.from_collections. Should be

from_collection().

- StreamExecutionEnvironment.generate_sequenece. Should be
generate_sequence().
Sorry for the spelling error.

5. Predefined source and sink.
As you said, most of the predefined sources are not suitable for
production, we can ignore them in the new Python DataStream API.
There is one exception that maybe I think we should add the print()

since

it is commonly used by users and it is very useful for debugging

jobs.

We

can add comments for the API that it should never be used for

production.

Meanwhile, as you mentioned, a good alternative that always prints

on

the

client should also be supported. For this case, maybe we can add

the

collect method and return an Iterator. With the iterator, uses can

print

the content on the client. This is also consistent with the

behavior

in

Table API.

6. For Row.
Do you mean that we should not expose the Row type in Python API?

Maybe

haven't gotten your concerns well.
We can use tuple type in Python DataStream to support Row. (I have

updated

the example section of the FLIP to reflect the design.)

Highly appreciated for your suggestions again. Looking forward to

your

feedback.

Best,
Shuiqiang

Aljoscha Krettek <[email protected]> 于2020年7月15日周三 下午5:58写道：

Hi,

thanks for the proposal! I have some comments about the API. We

should

not

blindly copy the existing Java DataSteam because we made some

mistakes

with

that and we now have a chance to fix them and not forward them to

new

API.


I don't think we need SingleOutputStreamOperator, in the Scala

API we

just

have DataStream and the relevant methods from

SingleOutputStreamOperator

are added to DataStream. Having this extra type is more confusing

than

helpful to users, I think. In the same vain, I think we also don't

need

DataStreamSource. The source methods can also just return a

DataStream.


There are some methods that I would consider internal and we

shouldn't

expose them:
   - DataStream.get_id(): this is an internal method
   - DataStream.partition_custom(): I think adding this method was

mistake

because it's to low-level, I could be convinced otherwise
   - DataStream.print()/DataStream.print_to_error(): These are

questionable

because they print to the TaskManager log. Maybe we could add a

good

alternative that always prints on the client, similar to the Table

API

   - DataStream.write_to_socket(): It was a mistake to add this

sink

on

DataStream it is not fault-tolerant and shouldn't be used in

production


   - KeyedStream.sum/min/max/min_by/max_by: Nowadays, the Table API

should

be used for "declarative" use cases and I think these methods

should

not

be

in the DataStream API
   - SingleOutputStreamOperator.can_be_parallel/forceNoParallel:

these

are

internal methods

   - StreamExecutionEnvironment.from_parallel_collection(): I think

the

usability is questionable
   - StreamExecutionEnvironment.from_collections -> should be

called

from_collection
   - StreamExecutionEnvironment.generate_sequenece -> should be

called

generate_sequence

I think most of the predefined sources are questionable:
   - fromParallelCollection: I don't know if this is useful
   - readTextFile: most of the variants are not

useful/fault-tolerant

   - readFile: same
   - socketTextStream: also not useful except for toy examples
   - createInput: also not useful, and it's legacy DataSet

InputFormats


I think we need to think hard whether we want to further expose

Row

in

our

APIs. I think adding it to flink-core was more an accident than

anything

else but I can see that it would be useful for Python/Java

interop.


Best,
Aljoscha


On Mon, Jul 13, 2020, at 04:38, jincheng sun wrote:

Thanks for bring up this DISCUSS Shuiqiang!

+1 for the proposal!

Best,
Jincheng


Xingbo Huang <[email protected]> 于2020年7月9日周四 上午10:41写道：

Hi Shuiqiang,

Thanks a lot for driving this discussion.
Big +1 for supporting Python DataStream.
In many ML scenarios, operating Object will be more natural than

operating

Table.

Best,
Xingbo

Wei Zhong <[email protected]> 于2020年7月9日周四 上午10:35写道：

Hi Shuiqiang,

Thanks for driving this. Big +1 for supporting DataStream API

in

PyFlink!


Best,
Wei

在 2020年7月9日，10:29，Hequn Cheng <[email protected]> 写道：

+1 for adding the Python DataStream API and starting with the

stateless

part.
There are already some users that expressed their wish to have

the

Python

DataStream APIs. Once we have the APIs in PyFlink, we can

cover

more

use

cases for our users.

Best, Hequn

On Wed, Jul 8, 2020 at 11:45 AM Shuiqiang Chen <

[email protected]>

wrote:

Sorry, the 3rd link is broken, please refer to this one:

Support

Python

DataStream API
<

https://docs.google.com/document/d/1H3hz8wuk22-8cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit


Shuiqiang Chen <[email protected]> 于2020年7月8日周三 上午11:13写道：

Hi everyone,

As we all know, Flink provides three layered APIs: the

ProcessFunctions,

the DataStream API and the SQL & Table API. Each API offers

different

trade-off between conciseness and expressiveness and targets

different

use

cases[1].

Currently, the SQL & Table API has already been supported in

PyFlink.

The

API provides relational operations as well as user-defined

functions

to

provide convenience for users who are familiar with python

and

relational

programming.

Meanwhile, the DataStream API and ProcessFunctions provide

more

generic

APIs to implement stream processing applications. The

ProcessFunctions

expose time and state which are the fundamental building

blocks

for

any

kind of streaming application.
To cover more use cases, we are planning to cover all these

APIs

in

PyFlink.

In this discussion(FLIP-130), we propose to support the

Python

DataStream

API for the stateless part. For more detail, please refer to

the

FLIP

wiki

page here[2]. If interested in the stateful part, you can

also

take a

look the design doc here[3] for which we are going to

discuss

in

separate

FLIP.

Any comments will be highly appreciated!

[1]

https://flink.apache.org/flink-applications.html#layered-apis

[2]

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866298

[3]

https://docs.google.com/document/d/1H3hz8wuk228cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit?usp=sharing


Best,
Shuiqiang

Re: [DISCUSS] FLIP-130: Support for Python DataStream API (Stateless Part)

Reply via email to