Hi, It's a good idea to start with a minimum size of API and add useful ones when we find it is truly useful. >From my side, I'm also ok with the partitionCustom() method. Thanks David for your feedback!
Best, Hequn On Mon, Jul 27, 2020 at 8:57 PM Aljoscha Krettek <aljos...@apache.org> wrote: > Hi, > > I'm also not against adding that if it enables actual use cases. I don't > think we need to spell out the whole API in the FLIP, though. We can add > things as they come up. > > Best, > Aljoscha > > On 24.07.20 14:43, Shuiqiang Chen wrote: > > Hi David, > > > > Thank you for your reply! I have started the vote for this FLIP, but we > can > > keep the discussion on this thread. > > In my perspective, I would not against adding the > > DataStream.partitionCustom to Python DataStream API. However, more > inputs > > are welcomed. > > > > Best, > > Shuiqiang > > > > > > > > David Anderson <da...@alpinegizmo.com> 于2020年7月24日周五 下午7:52写道: > > > >> Sorry I'm coming to this rather late, but I would like to argue that > >> DataStream.partitionCustom enables an important use case. > >> What I have in mind is performing partitioned enrichment, where each > >> instance can preload a slice of a static dataset that is being used for > >> enrichment. > >> > >> For an example, consider > >> > https://github.com/knaufk/enrichments-with-flink/blob/master/src/main/java/com/github/knaufk/enrichments/CustomPartitionEnrichmenttJob.java > >> . > >> > >> Regards, > >> David > >> > >> On Fri, Jul 24, 2020 at 12:18 PM Shuiqiang Chen <acqua....@gmail.com> > >> wrote: > >> > >>> Hi Aljoscha, Thank you for your response. I'll keep these two helper > >>> methods in the Python DataStream implementation. > >>> > >>> And thank you all for joining in the discussion. It seems that we have > >>> reached a consensus. I will start a vote for this FLIP later today. > >>> > >>> Best, > >>> Shuiqiang > >>> > >>> Hequn Cheng <he...@apache.org> 于2020年7月24日周五 下午5:29写道: > >>> > >>>> Thanks a lot for your valuable feedback and suggestions! @Aljoscha > >>> Krettek > >>>> <aljos...@apache.org> > >>>> +1 to the vote. > >>>> > >>>> Best, > >>>> Hequn > >>>> > >>>> On Fri, Jul 24, 2020 at 5:16 PM Aljoscha Krettek <aljos...@apache.org > > > >>>> wrote: > >>>> > >>>>> Thanks for updating! And yes, I think it's ok to include the few > >>> helper > >>>>> methods such as "readFromFile" and "print". > >>>>> > >>>>> I think we can now proceed to a vote! Nice work, overall! > >>>>> > >>>>> Best, > >>>>> Aljoscha > >>>>> > >>>>> On 16.07.20 17:16, Hequn Cheng wrote: > >>>>>> Hi, > >>>>>> > >>>>>> Thanks a lot for your discussions. > >>>>>> I think Aljoscha makes good suggestions here! Those problematic APIs > >>>>> should > >>>>>> not be added to the new Python DataStream API. > >>>>>> > >>>>>> Only one item I want to add based on the reply from Shuiqiang: > >>>>>> I would also tend to keep the readTextFile() method. Apart from > >>>> print(), > >>>>>> the readTextFile() may also be very helpful and frequently used for > >>>>> playing > >>>>>> with Flink. > >>>>>> For example, it is used in our WordCount example[1] which is almost > >>> the > >>>>>> first Flink program that every beginner runs. > >>>>>> It is more efficient for reading multi-line data compared to > >>>>>> fromCollection() meanwhile far more easier to be used compared to > >>>> Kafka, > >>>>>> Kinesis, RabbitMQ,etc., in > >>>>>> cases for playing with Flink. > >>>>>> > >>>>>> What do you think? > >>>>>> > >>>>>> Best, > >>>>>> Hequn > >>>>>> > >>>>>> [1] > >>>>>> > >>>>> > >>>> > >>> > https://github.com/apache/flink/blob/master/flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/wordcount/WordCount.java > >>>>>> > >>>>>> > >>>>>> On Thu, Jul 16, 2020 at 3:37 PM Shuiqiang Chen <acqua....@gmail.com > >>>> > >>>>> wrote: > >>>>>> > >>>>>>> Hi Aljoscha, > >>>>>>> > >>>>>>> Thank you for your valuable comments! I agree with you that there > >>> is > >>>>> some > >>>>>>> optimization space for existing API and can be applied to the > >>> python > >>>>>>> DataStream API implementation. > >>>>>>> > >>>>>>> According to your comments, I have concluded them into the > >>> following > >>>>> parts: > >>>>>>> > >>>>>>> 1. SingleOutputStreamOperator and DataStreamSource. > >>>>>>> Yes, the SingleOutputStreamOperator and DataStreamSource are a bit > >>>>>>> redundant, so we can unify their APIs into DataStream to make it > >>> more > >>>>>>> clear. > >>>>>>> > >>>>>>> 2. The internal or low-level methods. > >>>>>>> - DataStream.get_id(): Has been removed in the FLIP wiki page. > >>>>>>> - DataStream.partition_custom(): Has been removed in the FLIP > >>> wiki > >>>>> page. > >>>>>>> - SingleOutputStreamOperator.can_be_parallel/forceNoParallel: > Has > >>>> been > >>>>>>> removed in the FLIP wiki page. > >>>>>>> Sorry for mistakenly making those internal methods public, we would > >>>> not > >>>>>>> expose them to users in the Python API. > >>>>>>> > >>>>>>> 3. "declarative" Apis. > >>>>>>> - KeyedStream.sum/min/max/min_by/max_by: Has been removed in the > >>> FLIP > >>>>> wiki > >>>>>>> page. They could be well covered by Table API. > >>>>>>> > >>>>>>> 4. Spelling problems. > >>>>>>> - StreamExecutionEnvironment.from_collections. Should be > >>>>> from_collection(). > >>>>>>> - StreamExecutionEnvironment.generate_sequenece. Should be > >>>>>>> generate_sequence(). > >>>>>>> Sorry for the spelling error. > >>>>>>> > >>>>>>> 5. Predefined source and sink. > >>>>>>> As you said, most of the predefined sources are not suitable for > >>>>>>> production, we can ignore them in the new Python DataStream API. > >>>>>>> There is one exception that maybe I think we should add the print() > >>>>> since > >>>>>>> it is commonly used by users and it is very useful for debugging > >>> jobs. > >>>>> We > >>>>>>> can add comments for the API that it should never be used for > >>>>> production. > >>>>>>> Meanwhile, as you mentioned, a good alternative that always prints > >>> on > >>>>> the > >>>>>>> client should also be supported. For this case, maybe we can add > >>> the > >>>>>>> collect method and return an Iterator. With the iterator, uses can > >>>> print > >>>>>>> the content on the client. This is also consistent with the > >>> behavior > >>>> in > >>>>>>> Table API. > >>>>>>> > >>>>>>> 6. For Row. > >>>>>>> Do you mean that we should not expose the Row type in Python API? > >>>> Maybe > >>>>> I > >>>>>>> haven't gotten your concerns well. > >>>>>>> We can use tuple type in Python DataStream to support Row. (I have > >>>>> updated > >>>>>>> the example section of the FLIP to reflect the design.) > >>>>>>> > >>>>>>> Highly appreciated for your suggestions again. Looking forward to > >>> your > >>>>>>> feedback. > >>>>>>> > >>>>>>> Best, > >>>>>>> Shuiqiang > >>>>>>> > >>>>>>> Aljoscha Krettek <aljos...@apache.org> 于2020年7月15日周三 下午5:58写道: > >>>>>>> > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>> thanks for the proposal! I have some comments about the API. We > >>>> should > >>>>>>> not > >>>>>>>> blindly copy the existing Java DataSteam because we made some > >>>> mistakes > >>>>>>> with > >>>>>>>> that and we now have a chance to fix them and not forward them to > >>> a > >>>> new > >>>>>>> API. > >>>>>>>> > >>>>>>>> I don't think we need SingleOutputStreamOperator, in the Scala > >>> API we > >>>>>>> just > >>>>>>>> have DataStream and the relevant methods from > >>>>> SingleOutputStreamOperator > >>>>>>>> are added to DataStream. Having this extra type is more confusing > >>>> than > >>>>>>>> helpful to users, I think. In the same vain, I think we also don't > >>>> need > >>>>>>>> DataStreamSource. The source methods can also just return a > >>>> DataStream. > >>>>>>>> > >>>>>>>> There are some methods that I would consider internal and we > >>>> shouldn't > >>>>>>>> expose them: > >>>>>>>> - DataStream.get_id(): this is an internal method > >>>>>>>> - DataStream.partition_custom(): I think adding this method was > >>> a > >>>>>>> mistake > >>>>>>>> because it's to low-level, I could be convinced otherwise > >>>>>>>> - DataStream.print()/DataStream.print_to_error(): These are > >>>>> questionable > >>>>>>>> because they print to the TaskManager log. Maybe we could add a > >>> good > >>>>>>>> alternative that always prints on the client, similar to the Table > >>>> API > >>>>>>>> - DataStream.write_to_socket(): It was a mistake to add this > >>> sink > >>>> on > >>>>>>>> DataStream it is not fault-tolerant and shouldn't be used in > >>>> production > >>>>>>>> > >>>>>>>> - KeyedStream.sum/min/max/min_by/max_by: Nowadays, the Table > API > >>>>> should > >>>>>>>> be used for "declarative" use cases and I think these methods > >>> should > >>>>> not > >>>>>>> be > >>>>>>>> in the DataStream API > >>>>>>>> - SingleOutputStreamOperator.can_be_parallel/forceNoParallel: > >>> these > >>>>> are > >>>>>>>> internal methods > >>>>>>>> > >>>>>>>> - StreamExecutionEnvironment.from_parallel_collection(): I > think > >>>> the > >>>>>>>> usability is questionable > >>>>>>>> - StreamExecutionEnvironment.from_collections -> should be > >>> called > >>>>>>>> from_collection > >>>>>>>> - StreamExecutionEnvironment.generate_sequenece -> should be > >>> called > >>>>>>>> generate_sequence > >>>>>>>> > >>>>>>>> I think most of the predefined sources are questionable: > >>>>>>>> - fromParallelCollection: I don't know if this is useful > >>>>>>>> - readTextFile: most of the variants are not > >>> useful/fault-tolerant > >>>>>>>> - readFile: same > >>>>>>>> - socketTextStream: also not useful except for toy examples > >>>>>>>> - createInput: also not useful, and it's legacy DataSet > >>>> InputFormats > >>>>>>>> > >>>>>>>> I think we need to think hard whether we want to further expose > >>> Row > >>>> in > >>>>>>> our > >>>>>>>> APIs. I think adding it to flink-core was more an accident than > >>>>> anything > >>>>>>>> else but I can see that it would be useful for Python/Java > >>> interop. > >>>>>>>> > >>>>>>>> Best, > >>>>>>>> Aljoscha > >>>>>>>> > >>>>>>>> > >>>>>>>> On Mon, Jul 13, 2020, at 04:38, jincheng sun wrote: > >>>>>>>>> Thanks for bring up this DISCUSS Shuiqiang! > >>>>>>>>> > >>>>>>>>> +1 for the proposal! > >>>>>>>>> > >>>>>>>>> Best, > >>>>>>>>> Jincheng > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Xingbo Huang <hxbks...@gmail.com> 于2020年7月9日周四 上午10:41写道: > >>>>>>>>> > >>>>>>>>>> Hi Shuiqiang, > >>>>>>>>>> > >>>>>>>>>> Thanks a lot for driving this discussion. > >>>>>>>>>> Big +1 for supporting Python DataStream. > >>>>>>>>>> In many ML scenarios, operating Object will be more natural than > >>>>>>>> operating > >>>>>>>>>> Table. > >>>>>>>>>> > >>>>>>>>>> Best, > >>>>>>>>>> Xingbo > >>>>>>>>>> > >>>>>>>>>> Wei Zhong <weizhong0...@gmail.com> 于2020年7月9日周四 上午10:35写道: > >>>>>>>>>> > >>>>>>>>>>> Hi Shuiqiang, > >>>>>>>>>>> > >>>>>>>>>>> Thanks for driving this. Big +1 for supporting DataStream API > >>> in > >>>>>>>> PyFlink! > >>>>>>>>>>> > >>>>>>>>>>> Best, > >>>>>>>>>>> Wei > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>> 在 2020年7月9日,10:29,Hequn Cheng <he...@apache.org> 写道: > >>>>>>>>>>>> > >>>>>>>>>>>> +1 for adding the Python DataStream API and starting with the > >>>>>>>> stateless > >>>>>>>>>>>> part. > >>>>>>>>>>>> There are already some users that expressed their wish to have > >>>>>>> the > >>>>>>>>>> Python > >>>>>>>>>>>> DataStream APIs. Once we have the APIs in PyFlink, we can > >>> cover > >>>>>>>> more > >>>>>>>>>> use > >>>>>>>>>>>> cases for our users. > >>>>>>>>>>>> > >>>>>>>>>>>> Best, Hequn > >>>>>>>>>>>> > >>>>>>>>>>>> On Wed, Jul 8, 2020 at 11:45 AM Shuiqiang Chen < > >>>>>>>> acqua....@gmail.com> > >>>>>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> Sorry, the 3rd link is broken, please refer to this one: > >>> Support > >>>>>>>>>> Python > >>>>>>>>>>>>> DataStream API > >>>>>>>>>>>>> < > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>> > >>>> > >>> > https://docs.google.com/document/d/1H3hz8wuk22-8cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> Shuiqiang Chen <acqua....@gmail.com> 于2020年7月8日周三 上午11:13写道: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Hi everyone, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> As we all know, Flink provides three layered APIs: the > >>>>>>>>>>> ProcessFunctions, > >>>>>>>>>>>>>> the DataStream API and the SQL & Table API. Each API offers > >>> a > >>>>>>>>>> different > >>>>>>>>>>>>>> trade-off between conciseness and expressiveness and targets > >>>>>>>>>> different > >>>>>>>>>>>>> use > >>>>>>>>>>>>>> cases[1]. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Currently, the SQL & Table API has already been supported in > >>>>>>>> PyFlink. > >>>>>>>>>>> The > >>>>>>>>>>>>>> API provides relational operations as well as user-defined > >>>>>>>> functions > >>>>>>>>>> to > >>>>>>>>>>>>>> provide convenience for users who are familiar with python > >>> and > >>>>>>>>>>> relational > >>>>>>>>>>>>>> programming. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Meanwhile, the DataStream API and ProcessFunctions provide > >>> more > >>>>>>>>>> generic > >>>>>>>>>>>>>> APIs to implement stream processing applications. The > >>>>>>>>>> ProcessFunctions > >>>>>>>>>>>>>> expose time and state which are the fundamental building > >>> blocks > >>>>>>>> for > >>>>>>>>>> any > >>>>>>>>>>>>>> kind of streaming application. > >>>>>>>>>>>>>> To cover more use cases, we are planning to cover all these > >>>>>>> APIs > >>>>>>>> in > >>>>>>>>>>>>>> PyFlink. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> In this discussion(FLIP-130), we propose to support the > >>> Python > >>>>>>>>>>> DataStream > >>>>>>>>>>>>>> API for the stateless part. For more detail, please refer to > >>>>>>> the > >>>>>>>> FLIP > >>>>>>>>>>>>> wiki > >>>>>>>>>>>>>> page here[2]. If interested in the stateful part, you can > >>> also > >>>>>>>> take a > >>>>>>>>>>>>>> look the design doc here[3] for which we are going to > >>> discuss > >>>>>>> in > >>>>>>>> a > >>>>>>>>>>>>> separate > >>>>>>>>>>>>>> FLIP. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Any comments will be highly appreciated! > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> [1] > >>>>>>>> https://flink.apache.org/flink-applications.html#layered-apis > >>>>>>>>>>>>>> [2] > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>> > >>>> > >>> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866298 > >>>>>>>>>>>>>> [3] > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>> > >>>> > >>> > https://docs.google.com/document/d/1H3hz8wuk228cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit?usp=sharing > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Best, > >>>>>>>>>>>>>> Shuiqiang > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>>> > >>>> > >>> > >> > > > >