Thanks for updating! And yes, I think it's ok to include the few helper methods such as "readFromFile" and "print".

I think we can now proceed to a vote! Nice work, overall!

Best,
Aljoscha

On 16.07.20 17:16, Hequn Cheng wrote:
Hi,

Thanks a lot for your discussions.
I think Aljoscha makes good suggestions here! Those problematic APIs should
not be added to the new Python DataStream API.

Only one item I want to add based on the reply from Shuiqiang:
I would also tend to keep the readTextFile() method. Apart from print(),
the readTextFile() may also be very helpful and frequently used for playing
with Flink.
For example, it is used in our WordCount example[1] which is almost the
first Flink program that every beginner runs.
It is more efficient for reading multi-line data compared to
fromCollection() meanwhile far more easier to be used compared to Kafka,
Kinesis, RabbitMQ,etc., in
cases for playing with Flink.

What do you think?

Best,
Hequn

[1]
https://github.com/apache/flink/blob/master/flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/wordcount/WordCount.java


On Thu, Jul 16, 2020 at 3:37 PM Shuiqiang Chen <acqua....@gmail.com> wrote:

Hi Aljoscha,

Thank you for your valuable comments! I agree with you that there is some
optimization space for existing API and can be applied to the python
DataStream API implementation.

According to your comments, I have concluded them into the following parts:

1. SingleOutputStreamOperator and DataStreamSource.
Yes, the SingleOutputStreamOperator and DataStreamSource are a bit
redundant, so we can unify their APIs into DataStream to make it more
clear.

2. The internal or low-level methods.
  - DataStream.get_id(): Has been removed in the FLIP wiki page.
  - DataStream.partition_custom(): Has been removed in the FLIP wiki page.
  - SingleOutputStreamOperator.can_be_parallel/forceNoParallel: Has been
removed in the FLIP wiki page.
Sorry for mistakenly making those internal methods public, we would not
expose them to users in the Python API.

3. "declarative" Apis.
- KeyedStream.sum/min/max/min_by/max_by: Has been removed in the FLIP wiki
page. They could be well covered by Table API.

4. Spelling problems.
- StreamExecutionEnvironment.from_collections. Should be from_collection().
- StreamExecutionEnvironment.generate_sequenece. Should be
generate_sequence().
Sorry for the spelling error.

5. Predefined source and sink.
As you said, most of the predefined sources are not suitable for
production, we can ignore them in the new Python DataStream API.
There is one exception that maybe I think we should add the print() since
it is commonly used by users and it is very useful for debugging jobs. We
can add comments for the API that it should never be used for production.
Meanwhile, as you mentioned, a good alternative that always prints on the
client should also be supported. For this case, maybe we can add the
collect method and return an Iterator. With the iterator, uses can print
the content on the client. This is also consistent with the behavior in
Table API.

6. For Row.
Do you mean that we should not expose the Row type in Python API? Maybe I
haven't gotten your concerns well.
We can use tuple type in Python DataStream to support Row. (I have updated
the example section of the FLIP to reflect the design.)

Highly appreciated for your suggestions again. Looking forward to your
feedback.

Best,
Shuiqiang

Aljoscha Krettek <aljos...@apache.org> 于2020年7月15日周三 下午5:58写道:

Hi,

thanks for the proposal! I have some comments about the API. We should
not
blindly copy the existing Java DataSteam because we made some mistakes
with
that and we now have a chance to fix them and not forward them to a new
API.

I don't think we need SingleOutputStreamOperator, in the Scala API we
just
have DataStream and the relevant methods from SingleOutputStreamOperator
are added to DataStream. Having this extra type is more confusing than
helpful to users, I think. In the same vain, I think we also don't need
DataStreamSource. The source methods can also just return a DataStream.

There are some methods that I would consider internal and we shouldn't
expose them:
  - DataStream.get_id(): this is an internal method
  - DataStream.partition_custom(): I think adding this method was a
mistake
because it's to low-level, I could be convinced otherwise
  - DataStream.print()/DataStream.print_to_error(): These are questionable
because they print to the TaskManager log. Maybe we could add a good
alternative that always prints on the client, similar to the Table API
  - DataStream.write_to_socket(): It was a mistake to add this sink on
DataStream it is not fault-tolerant and shouldn't be used in production

  - KeyedStream.sum/min/max/min_by/max_by: Nowadays, the Table API should
be used for "declarative" use cases and I think these methods should not
be
in the DataStream API
  - SingleOutputStreamOperator.can_be_parallel/forceNoParallel: these are
internal methods

  - StreamExecutionEnvironment.from_parallel_collection(): I think the
usability is questionable
  - StreamExecutionEnvironment.from_collections -> should be called
from_collection
  - StreamExecutionEnvironment.generate_sequenece -> should be called
generate_sequence

I think most of the predefined sources are questionable:
  - fromParallelCollection: I don't know if this is useful
  - readTextFile: most of the variants are not useful/fault-tolerant
  - readFile: same
  - socketTextStream: also not useful except for toy examples
  - createInput: also not useful, and it's legacy DataSet InputFormats

I think we need to think hard whether we want to further expose Row in
our
APIs. I think adding it to flink-core was more an accident than anything
else but I can see that it would be useful for Python/Java interop.

Best,
Aljoscha


On Mon, Jul 13, 2020, at 04:38, jincheng sun wrote:
Thanks for bring up this DISCUSS Shuiqiang!

+1 for the proposal!

Best,
Jincheng


Xingbo Huang <hxbks...@gmail.com> 于2020年7月9日周四 上午10:41写道:

Hi Shuiqiang,

Thanks a lot for driving this discussion.
Big +1 for supporting Python DataStream.
In many ML scenarios, operating Object will be more natural than
operating
Table.

Best,
Xingbo

Wei Zhong <weizhong0...@gmail.com> 于2020年7月9日周四 上午10:35写道:

Hi Shuiqiang,

Thanks for driving this. Big +1 for supporting DataStream API in
PyFlink!

Best,
Wei


在 2020年7月9日,10:29,Hequn Cheng <he...@apache.org> 写道:

+1 for adding the Python DataStream API and starting with the
stateless
part.
There are already some users that expressed their wish to have
the
Python
DataStream APIs. Once we have the APIs in PyFlink, we can cover
more
use
cases for our users.

Best, Hequn

On Wed, Jul 8, 2020 at 11:45 AM Shuiqiang Chen <
acqua....@gmail.com>
wrote:

Sorry, the 3rd link is broken, please refer to this one: Support
Python
DataStream API
<




https://docs.google.com/document/d/1H3hz8wuk22-8cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit


Shuiqiang Chen <acqua....@gmail.com> 于2020年7月8日周三 上午11:13写道:

Hi everyone,

As we all know, Flink provides three layered APIs: the
ProcessFunctions,
the DataStream API and the SQL & Table API. Each API offers a
different
trade-off between conciseness and expressiveness and targets
different
use
cases[1].

Currently, the SQL & Table API has already been supported in
PyFlink.
The
API provides relational operations as well as user-defined
functions
to
provide convenience for users who are familiar with python and
relational
programming.

Meanwhile, the DataStream API and ProcessFunctions provide more
generic
APIs to implement stream processing applications. The
ProcessFunctions
expose time and state which are the fundamental building blocks
for
any
kind of streaming application.
To cover more use cases, we are planning to cover all these
APIs
in
PyFlink.

In this discussion(FLIP-130), we propose to support the Python
DataStream
API for the stateless part. For more detail, please refer to
the
FLIP
wiki
page here[2]. If interested in the stateful part, you can also
take a
look the design doc here[3] for which we are going to discuss
in
a
separate
FLIP.

Any comments will be highly appreciated!

[1]
https://flink.apache.org/flink-applications.html#layered-apis
[2]





https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866298
[3]





https://docs.google.com/document/d/1H3hz8wuk228cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit?usp=sharing

Best,
Shuiqiang













Reply via email to