Hi,

thanks for the proposal! I have some comments about the API. We should not 
blindly copy the existing Java DataSteam because we made some mistakes with 
that and we now have a chance to fix them and not forward them to a new API.

I don't think we need SingleOutputStreamOperator, in the Scala API we just have 
DataStream and the relevant methods from SingleOutputStreamOperator are added 
to DataStream. Having this extra type is more confusing than helpful to users, 
I think. In the same vain, I think we also don't need DataStreamSource. The 
source methods can also just return a DataStream.

There are some methods that I would consider internal and we shouldn't expose 
them:
 - DataStream.get_id(): this is an internal method
 - DataStream.partition_custom(): I think adding this method was a mistake 
because it's to low-level, I could be convinced otherwise
 - DataStream.print()/DataStream.print_to_error(): These are questionable 
because they print to the TaskManager log. Maybe we could add a good 
alternative that always prints on the client, similar to the Table API
 - DataStream.write_to_socket(): It was a mistake to add this sink on 
DataStream it is not fault-tolerant and shouldn't be used in production

 - KeyedStream.sum/min/max/min_by/max_by: Nowadays, the Table API should be 
used for "declarative" use cases and I think these methods should not be in the 
DataStream API
 - SingleOutputStreamOperator.can_be_parallel/forceNoParallel: these are 
internal methods

 - StreamExecutionEnvironment.from_parallel_collection(): I think the usability 
is questionable
 - StreamExecutionEnvironment.from_collections -> should be called 
from_collection
 - StreamExecutionEnvironment.generate_sequenece -> should be called 
generate_sequence

I think most of the predefined sources are questionable:
 - fromParallelCollection: I don't know if this is useful
 - readTextFile: most of the variants are not useful/fault-tolerant
 - readFile: same
 - socketTextStream: also not useful except for toy examples
 - createInput: also not useful, and it's legacy DataSet InputFormats

I think we need to think hard whether we want to further expose Row in our 
APIs. I think adding it to flink-core was more an accident than anything else 
but I can see that it would be useful for Python/Java interop.

Best,
Aljoscha


On Mon, Jul 13, 2020, at 04:38, jincheng sun wrote:
> Thanks for bring up this DISCUSS Shuiqiang!
> 
> +1 for the proposal!
> 
> Best,
> Jincheng
> 
> 
> Xingbo Huang <hxbks...@gmail.com> 于2020年7月9日周四 上午10:41写道:
> 
> > Hi Shuiqiang,
> >
> > Thanks a lot for driving this discussion.
> > Big +1 for supporting Python DataStream.
> > In many ML scenarios, operating Object will be more natural than operating
> > Table.
> >
> > Best,
> > Xingbo
> >
> > Wei Zhong <weizhong0...@gmail.com> 于2020年7月9日周四 上午10:35写道:
> >
> > > Hi Shuiqiang,
> > >
> > > Thanks for driving this. Big +1 for supporting DataStream API in PyFlink!
> > >
> > > Best,
> > > Wei
> > >
> > >
> > > > 在 2020年7月9日,10:29,Hequn Cheng <he...@apache.org> 写道:
> > > >
> > > > +1 for adding the Python DataStream API and starting with the stateless
> > > > part.
> > > > There are already some users that expressed their wish to have the
> > Python
> > > > DataStream APIs. Once we have the APIs in PyFlink, we can cover more
> > use
> > > > cases for our users.
> > > >
> > > > Best, Hequn
> > > >
> > > > On Wed, Jul 8, 2020 at 11:45 AM Shuiqiang Chen <acqua....@gmail.com>
> > > wrote:
> > > >
> > > >> Sorry, the 3rd link is broken, please refer to this one: Support
> > Python
> > > >> DataStream API
> > > >> <
> > > >>
> > >
> > https://docs.google.com/document/d/1H3hz8wuk22-8cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit
> > > >>>
> > > >>
> > > >> Shuiqiang Chen <acqua....@gmail.com> 于2020年7月8日周三 上午11:13写道:
> > > >>
> > > >>> Hi everyone,
> > > >>>
> > > >>> As we all know, Flink provides three layered APIs: the
> > > ProcessFunctions,
> > > >>> the DataStream API and the SQL & Table API. Each API offers a
> > different
> > > >>> trade-off between conciseness and expressiveness and targets
> > different
> > > >> use
> > > >>> cases[1].
> > > >>>
> > > >>> Currently, the SQL & Table API has already been supported in PyFlink.
> > > The
> > > >>> API provides relational operations as well as user-defined functions
> > to
> > > >>> provide convenience for users who are familiar with python and
> > > relational
> > > >>> programming.
> > > >>>
> > > >>> Meanwhile, the DataStream API and ProcessFunctions provide more
> > generic
> > > >>> APIs to implement stream processing applications. The
> > ProcessFunctions
> > > >>> expose time and state which are the fundamental building blocks for
> > any
> > > >>> kind of streaming application.
> > > >>> To cover more use cases, we are planning to cover all these APIs in
> > > >>> PyFlink.
> > > >>>
> > > >>> In this discussion(FLIP-130), we propose to support the Python
> > > DataStream
> > > >>> API for the stateless part. For more detail, please refer to the FLIP
> > > >> wiki
> > > >>> page here[2]. If interested in the stateful part, you can also take a
> > > >>> look the design doc here[3] for which we are going to discuss in a
> > > >> separate
> > > >>> FLIP.
> > > >>>
> > > >>> Any comments will be highly appreciated!
> > > >>>
> > > >>> [1] https://flink.apache.org/flink-applications.html#layered-apis
> > > >>> [2]
> > > >>>
> > > >>
> > >
> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866298
> > > >>> [3]
> > > >>>
> > > >>
> > >
> > https://docs.google.com/document/d/1H3hz8wuk228cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit?usp=sharing
> > > >>>
> > > >>> Best,
> > > >>> Shuiqiang
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>
> > >
> > >
> >
>

Reply via email to