Re: [DISCUSS] FLIP-130: Support for Python DataStream API (Stateless Part)

Hequn Cheng Tue, 28 Jul 2020 02:22:22 -0700

Hi,

It's a good idea to start with a minimum size of API and add useful ones
when we find it is truly useful.
>From my side, I'm also ok with the partitionCustom() method. Thanks David
for your feedback!


Best,
Hequn

On Mon, Jul 27, 2020 at 8:57 PM Aljoscha Krettek <aljos...@apache.org>
wrote:

> Hi,
>
> I'm also not against adding that if it enables actual use cases. I don't
> think we need to spell out the whole API in the FLIP, though. We can add
> things as they come up.
>
> Best,
> Aljoscha
>
> On 24.07.20 14:43, Shuiqiang Chen wrote:
> > Hi David,
> >
> > Thank you for your reply! I have started the vote for this FLIP, but we
> can
> > keep the discussion on this thread.
> > In my perspective, I would not against adding the
> > DataStream.partitionCustom to Python DataStream API.  However, more
> inputs
> > are welcomed.
> >
> > Best,
> > Shuiqiang
> >
> >
> >
> > David Anderson <da...@alpinegizmo.com> 于2020年7月24日周五 下午7:52写道：
> >
> >> Sorry I'm coming to this rather late, but I would like to argue that
> >> DataStream.partitionCustom enables an important use case.
> >> What I have in mind is performing partitioned enrichment, where each
> >> instance can preload a slice of a static dataset that is being used for
> >> enrichment.
> >>
> >> For an example, consider
> >>
> https://github.com/knaufk/enrichments-with-flink/blob/master/src/main/java/com/github/knaufk/enrichments/CustomPartitionEnrichmenttJob.java
> >> .
> >>
> >> Regards,
> >> David
> >>
> >> On Fri, Jul 24, 2020 at 12:18 PM Shuiqiang Chen <acqua....@gmail.com>
> >> wrote:
> >>
> >>> Hi Aljoscha, Thank you for your response.  I'll keep these two helper
> >>> methods in the Python DataStream implementation.
> >>>
> >>> And thank you all for joining in the discussion. It seems that we have
> >>> reached a consensus. I will start a vote for this FLIP later today.
> >>>
> >>> Best,
> >>> Shuiqiang
> >>>
> >>> Hequn Cheng <he...@apache.org> 于2020年7月24日周五 下午5:29写道：
> >>>
> >>>> Thanks a lot for your valuable feedback and suggestions! @Aljoscha
> >>> Krettek
> >>>> <aljos...@apache.org>
> >>>> +1 to the vote.
> >>>>
> >>>> Best,
> >>>> Hequn
> >>>>
> >>>> On Fri, Jul 24, 2020 at 5:16 PM Aljoscha Krettek <aljos...@apache.org
> >
> >>>> wrote:
> >>>>
> >>>>> Thanks for updating! And yes, I think it's ok to include the few
> >>> helper
> >>>>> methods such as "readFromFile" and "print".
> >>>>>
> >>>>> I think we can now proceed to a vote! Nice work, overall!
> >>>>>
> >>>>> Best,
> >>>>> Aljoscha
> >>>>>
> >>>>> On 16.07.20 17:16, Hequn Cheng wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> Thanks a lot for your discussions.
> >>>>>> I think Aljoscha makes good suggestions here! Those problematic APIs
> >>>>> should
> >>>>>> not be added to the new Python DataStream API.
> >>>>>>
> >>>>>> Only one item I want to add based on the reply from Shuiqiang:
> >>>>>> I would also tend to keep the readTextFile() method. Apart from
> >>>> print(),
> >>>>>> the readTextFile() may also be very helpful and frequently used for
> >>>>> playing
> >>>>>> with Flink.
> >>>>>> For example, it is used in our WordCount example[1] which is almost
> >>> the
> >>>>>> first Flink program that every beginner runs.
> >>>>>> It is more efficient for reading multi-line data compared to
> >>>>>> fromCollection() meanwhile far more easier to be used compared to
> >>>> Kafka,
> >>>>>> Kinesis, RabbitMQ,etc., in
> >>>>>> cases for playing with Flink.
> >>>>>>
> >>>>>> What do you think?
> >>>>>>
> >>>>>> Best,
> >>>>>> Hequn
> >>>>>>
> >>>>>> [1]
> >>>>>>
> >>>>>
> >>>>
> >>>
> https://github.com/apache/flink/blob/master/flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/wordcount/WordCount.java
> >>>>>>
> >>>>>>
> >>>>>> On Thu, Jul 16, 2020 at 3:37 PM Shuiqiang Chen <acqua....@gmail.com
> >>>>
> >>>>> wrote:
> >>>>>>
> >>>>>>> Hi Aljoscha,
> >>>>>>>
> >>>>>>> Thank you for your valuable comments! I agree with you that there
> >>> is
> >>>>> some
> >>>>>>> optimization space for existing API and can be applied to the
> >>> python
> >>>>>>> DataStream API implementation.
> >>>>>>>
> >>>>>>> According to your comments, I have concluded them into the
> >>> following
> >>>>> parts:
> >>>>>>>
> >>>>>>> 1. SingleOutputStreamOperator and DataStreamSource.
> >>>>>>> Yes, the SingleOutputStreamOperator and DataStreamSource are a bit
> >>>>>>> redundant, so we can unify their APIs into DataStream to make it
> >>> more
> >>>>>>> clear.
> >>>>>>>
> >>>>>>> 2. The internal or low-level methods.
> >>>>>>>    - DataStream.get_id(): Has been removed in the FLIP wiki page.
> >>>>>>>    - DataStream.partition_custom(): Has been removed in the FLIP
> >>> wiki
> >>>>> page.
> >>>>>>>    - SingleOutputStreamOperator.can_be_parallel/forceNoParallel:
> Has
> >>>> been
> >>>>>>> removed in the FLIP wiki page.
> >>>>>>> Sorry for mistakenly making those internal methods public, we would
> >>>> not
> >>>>>>> expose them to users in the Python API.
> >>>>>>>
> >>>>>>> 3. "declarative" Apis.
> >>>>>>> - KeyedStream.sum/min/max/min_by/max_by: Has been removed in the
> >>> FLIP
> >>>>> wiki
> >>>>>>> page. They could be well covered by Table API.
> >>>>>>>
> >>>>>>> 4. Spelling problems.
> >>>>>>> - StreamExecutionEnvironment.from_collections. Should be
> >>>>> from_collection().
> >>>>>>> - StreamExecutionEnvironment.generate_sequenece. Should be
> >>>>>>> generate_sequence().
> >>>>>>> Sorry for the spelling error.
> >>>>>>>
> >>>>>>> 5. Predefined source and sink.
> >>>>>>> As you said, most of the predefined sources are not suitable for
> >>>>>>> production, we can ignore them in the new Python DataStream API.
> >>>>>>> There is one exception that maybe I think we should add the print()
> >>>>> since
> >>>>>>> it is commonly used by users and it is very useful for debugging
> >>> jobs.
> >>>>> We
> >>>>>>> can add comments for the API that it should never be used for
> >>>>> production.
> >>>>>>> Meanwhile, as you mentioned, a good alternative that always prints
> >>> on
> >>>>> the
> >>>>>>> client should also be supported. For this case, maybe we can add
> >>> the
> >>>>>>> collect method and return an Iterator. With the iterator, uses can
> >>>> print
> >>>>>>> the content on the client. This is also consistent with the
> >>> behavior
> >>>> in
> >>>>>>> Table API.
> >>>>>>>
> >>>>>>> 6. For Row.
> >>>>>>> Do you mean that we should not expose the Row type in Python API?
> >>>> Maybe
> >>>>> I
> >>>>>>> haven't gotten your concerns well.
> >>>>>>> We can use tuple type in Python DataStream to support Row. (I have
> >>>>> updated
> >>>>>>> the example section of the FLIP to reflect the design.)
> >>>>>>>
> >>>>>>> Highly appreciated for your suggestions again. Looking forward to
> >>> your
> >>>>>>> feedback.
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Shuiqiang
> >>>>>>>
> >>>>>>> Aljoscha Krettek <aljos...@apache.org> 于2020年7月15日周三 下午5:58写道：
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> thanks for the proposal! I have some comments about the API. We
> >>>> should
> >>>>>>> not
> >>>>>>>> blindly copy the existing Java DataSteam because we made some
> >>>> mistakes
> >>>>>>> with
> >>>>>>>> that and we now have a chance to fix them and not forward them to
> >>> a
> >>>> new
> >>>>>>> API.
> >>>>>>>>
> >>>>>>>> I don't think we need SingleOutputStreamOperator, in the Scala
> >>> API we
> >>>>>>> just
> >>>>>>>> have DataStream and the relevant methods from
> >>>>> SingleOutputStreamOperator
> >>>>>>>> are added to DataStream. Having this extra type is more confusing
> >>>> than
> >>>>>>>> helpful to users, I think. In the same vain, I think we also don't
> >>>> need
> >>>>>>>> DataStreamSource. The source methods can also just return a
> >>>> DataStream.
> >>>>>>>>
> >>>>>>>> There are some methods that I would consider internal and we
> >>>> shouldn't
> >>>>>>>> expose them:
> >>>>>>>>    - DataStream.get_id(): this is an internal method
> >>>>>>>>    - DataStream.partition_custom(): I think adding this method was
> >>> a
> >>>>>>> mistake
> >>>>>>>> because it's to low-level, I could be convinced otherwise
> >>>>>>>>    - DataStream.print()/DataStream.print_to_error(): These are
> >>>>> questionable
> >>>>>>>> because they print to the TaskManager log. Maybe we could add a
> >>> good
> >>>>>>>> alternative that always prints on the client, similar to the Table
> >>>> API
> >>>>>>>>    - DataStream.write_to_socket(): It was a mistake to add this
> >>> sink
> >>>> on
> >>>>>>>> DataStream it is not fault-tolerant and shouldn't be used in
> >>>> production
> >>>>>>>>
> >>>>>>>>    - KeyedStream.sum/min/max/min_by/max_by: Nowadays, the Table
> API
> >>>>> should
> >>>>>>>> be used for "declarative" use cases and I think these methods
> >>> should
> >>>>> not
> >>>>>>> be
> >>>>>>>> in the DataStream API
> >>>>>>>>    - SingleOutputStreamOperator.can_be_parallel/forceNoParallel:
> >>> these
> >>>>> are
> >>>>>>>> internal methods
> >>>>>>>>
> >>>>>>>>    - StreamExecutionEnvironment.from_parallel_collection(): I
> think
> >>>> the
> >>>>>>>> usability is questionable
> >>>>>>>>    - StreamExecutionEnvironment.from_collections -> should be
> >>> called
> >>>>>>>> from_collection
> >>>>>>>>    - StreamExecutionEnvironment.generate_sequenece -> should be
> >>> called
> >>>>>>>> generate_sequence
> >>>>>>>>
> >>>>>>>> I think most of the predefined sources are questionable:
> >>>>>>>>    - fromParallelCollection: I don't know if this is useful
> >>>>>>>>    - readTextFile: most of the variants are not
> >>> useful/fault-tolerant
> >>>>>>>>    - readFile: same
> >>>>>>>>    - socketTextStream: also not useful except for toy examples
> >>>>>>>>    - createInput: also not useful, and it's legacy DataSet
> >>>> InputFormats
> >>>>>>>>
> >>>>>>>> I think we need to think hard whether we want to further expose
> >>> Row
> >>>> in
> >>>>>>> our
> >>>>>>>> APIs. I think adding it to flink-core was more an accident than
> >>>>> anything
> >>>>>>>> else but I can see that it would be useful for Python/Java
> >>> interop.
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Aljoscha
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Mon, Jul 13, 2020, at 04:38, jincheng sun wrote:
> >>>>>>>>> Thanks for bring up this DISCUSS Shuiqiang!
> >>>>>>>>>
> >>>>>>>>> +1 for the proposal!
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>> Jincheng
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Xingbo Huang <hxbks...@gmail.com> 于2020年7月9日周四 上午10:41写道：
> >>>>>>>>>
> >>>>>>>>>> Hi Shuiqiang,
> >>>>>>>>>>
> >>>>>>>>>> Thanks a lot for driving this discussion.
> >>>>>>>>>> Big +1 for supporting Python DataStream.
> >>>>>>>>>> In many ML scenarios, operating Object will be more natural than
> >>>>>>>> operating
> >>>>>>>>>> Table.
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Xingbo
> >>>>>>>>>>
> >>>>>>>>>> Wei Zhong <weizhong0...@gmail.com> 于2020年7月9日周四 上午10:35写道：
> >>>>>>>>>>
> >>>>>>>>>>> Hi Shuiqiang,
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks for driving this. Big +1 for supporting DataStream API
> >>> in
> >>>>>>>> PyFlink!
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Wei
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> 在 2020年7月9日，10:29，Hequn Cheng <he...@apache.org> 写道：
> >>>>>>>>>>>>
> >>>>>>>>>>>> +1 for adding the Python DataStream API and starting with the
> >>>>>>>> stateless
> >>>>>>>>>>>> part.
> >>>>>>>>>>>> There are already some users that expressed their wish to have
> >>>>>>> the
> >>>>>>>>>> Python
> >>>>>>>>>>>> DataStream APIs. Once we have the APIs in PyFlink, we can
> >>> cover
> >>>>>>>> more
> >>>>>>>>>> use
> >>>>>>>>>>>> cases for our users.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best, Hequn
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, Jul 8, 2020 at 11:45 AM Shuiqiang Chen <
> >>>>>>>> acqua....@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Sorry, the 3rd link is broken, please refer to this one:
> >>> Support
> >>>>>>>>>> Python
> >>>>>>>>>>>>> DataStream API
> >>>>>>>>>>>>> <
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>>
> https://docs.google.com/document/d/1H3hz8wuk22-8cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Shuiqiang Chen <acqua....@gmail.com> 于2020年7月8日周三 上午11:13写道：
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi everyone,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> As we all know, Flink provides three layered APIs: the
> >>>>>>>>>>> ProcessFunctions,
> >>>>>>>>>>>>>> the DataStream API and the SQL & Table API. Each API offers
> >>> a
> >>>>>>>>>> different
> >>>>>>>>>>>>>> trade-off between conciseness and expressiveness and targets
> >>>>>>>>>> different
> >>>>>>>>>>>>> use
> >>>>>>>>>>>>>> cases[1].
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Currently, the SQL & Table API has already been supported in
> >>>>>>>> PyFlink.
> >>>>>>>>>>> The
> >>>>>>>>>>>>>> API provides relational operations as well as user-defined
> >>>>>>>> functions
> >>>>>>>>>> to
> >>>>>>>>>>>>>> provide convenience for users who are familiar with python
> >>> and
> >>>>>>>>>>> relational
> >>>>>>>>>>>>>> programming.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Meanwhile, the DataStream API and ProcessFunctions provide
> >>> more
> >>>>>>>>>> generic
> >>>>>>>>>>>>>> APIs to implement stream processing applications. The
> >>>>>>>>>> ProcessFunctions
> >>>>>>>>>>>>>> expose time and state which are the fundamental building
> >>> blocks
> >>>>>>>> for
> >>>>>>>>>> any
> >>>>>>>>>>>>>> kind of streaming application.
> >>>>>>>>>>>>>> To cover more use cases, we are planning to cover all these
> >>>>>>> APIs
> >>>>>>>> in
> >>>>>>>>>>>>>> PyFlink.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> In this discussion(FLIP-130), we propose to support the
> >>> Python
> >>>>>>>>>>> DataStream
> >>>>>>>>>>>>>> API for the stateless part. For more detail, please refer to
> >>>>>>> the
> >>>>>>>> FLIP
> >>>>>>>>>>>>> wiki
> >>>>>>>>>>>>>> page here[2]. If interested in the stateful part, you can
> >>> also
> >>>>>>>> take a
> >>>>>>>>>>>>>> look the design doc here[3] for which we are going to
> >>> discuss
> >>>>>>> in
> >>>>>>>> a
> >>>>>>>>>>>>> separate
> >>>>>>>>>>>>>> FLIP.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Any comments will be highly appreciated!
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> [1]
> >>>>>>>> https://flink.apache.org/flink-applications.html#layered-apis
> >>>>>>>>>>>>>> [2]
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866298
> >>>>>>>>>>>>>> [3]
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>>
> https://docs.google.com/document/d/1H3hz8wuk228cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit?usp=sharing
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>> Shuiqiang
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>
>

Re: [DISCUSS] FLIP-130: Support for Python DataStream API (Stateless Part)

Reply via email to