[DISCUSS] FLIP-133: Rework PyFlink Documentation

2020-07-30 Thread jincheng sun
Hi folks,

Since the release of Flink 1.11, users of PyFlink have continued to grow.
As far as I know there are many companies have used PyFlink for data
analysis, operation and maintenance monitoring business has been put into
production(Such as 聚美优品[1](Jumei),  浙江墨芷[2] (Mozhi) etc.).  According to
the feedback we received, current documentation is not very friendly to
PyFlink users. There are two shortcomings:

- Python related content is mixed in the Java/Scala documentation, which
makes it difficult for users who only focus on PyFlink to read.
- There is already a "Python Table API" section in the Table API document
to store PyFlink documents, but the number of articles is small and the
content is fragmented. It is difficult for beginners to learn from it.

In addition, FLIP-130 introduced the Python DataStream API. Many documents
will be added for those new APIs. In order to increase the readability and
maintainability of the PyFlink document, Wei Zhong and me have discussed
offline and would like to rework it via this FLIP.

We will rework the document around the following three objectives:

- Add a separate section for Python API under the "Application Development"
section.
- Restructure current Python documentation to a brand new structure to
ensure complete content and friendly to beginners.
- Improve the documents shared by Python/Java/Scala to make it more
friendly to Python users and without affecting Java/Scala users.

More detail can be found in the FLIP-133:
https://cwiki.apache.org/confluence/display/FLINK/FLIP-133%3A+Rework+PyFlink+Documentation

Best,
Jincheng

[1] https://mp.weixin.qq.com/s/zVsBIs1ZEFe4atYUYtZpRg
[2] https://mp.weixin.qq.com/s/R4p_a2TWGpESBWr3pLtM2g


Re: [DISCUSS] FLIP-133: Rework PyFlink Documentation

2020-08-02 Thread jincheng sun
Would be great if you could join the contribution of PyFlink
documentation @Marta !
Thanks for all of the positive feedback. I will start a formal vote then
later...

Best,
Jincheng


Shuiqiang Chen  于2020年8月3日周一 上午9:56写道:

> Hi jincheng,
>
> Thanks for the discussion. +1 for the FLIP.
>
> A well-organized documentation will greatly improve the efficiency and
> experience for developers.
>
> Best,
> Shuiqiang
>
> Hequn Cheng  于2020年8月1日周六 上午8:42写道:
>
>> Hi Jincheng,
>>
>> Thanks a lot for raising the discussion. +1 for the FLIP.
>>
>> I think this will bring big benefits for the PyFlink users. Currently,
>> the Python TableAPI document is hidden deeply under the TableAPI&SQL tab
>> which makes it quite unreadable. Also, the PyFlink documentation is mixed
>> with Java/Scala documentation. It is hard for users to have an overview of
>> all the PyFlink documents. As more and more functionalities are added into
>> PyFlink, I think it's time for us to refactor the document.
>>
>> Best,
>> Hequn
>>
>>
>> On Fri, Jul 31, 2020 at 3:43 PM Marta Paes Moreira 
>> wrote:
>>
>>> Hi, Jincheng!
>>>
>>> Thanks for creating this detailed FLIP, it will make a big difference in
>>> the experience of Python developers using Flink. I'm interested in
>>> contributing to this work, so I'll reach out to you offline!
>>>
>>> Also, thanks for sharing some information on the adoption of PyFlink,
>>> it's
>>> great to see that there are already production users.
>>>
>>> Marta
>>>
>>> On Fri, Jul 31, 2020 at 5:35 AM Xingbo Huang  wrote:
>>>
>>> > Hi Jincheng,
>>> >
>>> > Thanks a lot for bringing up this discussion and the proposal.
>>> >
>>> > Big +1 for improving the structure of PyFlink doc.
>>> >
>>> > It will be very friendly to give PyFlink users a unified entrance to
>>> learn
>>> > PyFlink documents.
>>> >
>>> > Best,
>>> > Xingbo
>>> >
>>> > Dian Fu  于2020年7月31日周五 上午11:00写道:
>>> >
>>> >> Hi Jincheng,
>>> >>
>>> >> Thanks a lot for bringing up this discussion and the proposal. +1 to
>>> >> improve the Python API doc.
>>> >>
>>> >> I have received many feedbacks from PyFlink beginners about
>>> >> the PyFlink doc, e.g. the materials are too few, the Python doc is
>>> mixed
>>> >> with the Java doc and it's not easy to find the docs he wants to know.
>>> >>
>>> >> I think it would greatly improve the user experience if we can have
>>> one
>>> >> place which includes most knowledges PyFlink users should know.
>>> >>
>>> >> Regards,
>>> >> Dian
>>> >>
>>> >> 在 2020年7月31日,上午10:14,jincheng sun  写道:
>>> >>
>>> >> Hi folks,
>>> >>
>>> >> Since the release of Flink 1.11, users of PyFlink have continued to
>>> grow.
>>> >> As far as I know there are many companies have used PyFlink for data
>>> >> analysis, operation and maintenance monitoring business has been put
>>> into
>>> >> production(Such as 聚美优品[1](Jumei),  浙江墨芷[2] (Mozhi) etc.).  According
>>> to
>>> >> the feedback we received, current documentation is not very friendly
>>> to
>>> >> PyFlink users. There are two shortcomings:
>>> >>
>>> >> - Python related content is mixed in the Java/Scala documentation,
>>> which
>>> >> makes it difficult for users who only focus on PyFlink to read.
>>> >> - There is already a "Python Table API" section in the Table API
>>> document
>>> >> to store PyFlink documents, but the number of articles is small and
>>> the
>>> >> content is fragmented. It is difficult for beginners to learn from it.
>>> >>
>>> >> In addition, FLIP-130 introduced the Python DataStream API. Many
>>> >> documents will be added for those new APIs. In order to increase the
>>> >> readability and maintainability of the PyFlink document, Wei Zhong
>>> and me
>>> >> have discussed offline and would like to rework it via this FLIP.
>>> >>
>>> >> We will rework the document around the following three objectives:
>>> >>
>>> >> - Add a separate section for Python API under the "Application
>>> >> Development" section.
>>> >> - Restructure current Python documentation to a brand new structure to
>>> >> ensure complete content and friendly to beginners.
>>> >> - Improve the documents shared by Python/Java/Scala to make it more
>>> >> friendly to Python users and without affecting Java/Scala users.
>>> >>
>>> >> More detail can be found in the FLIP-133:
>>> >>
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-133%3A+Rework+PyFlink+Documentation
>>> >>
>>> >> Best,
>>> >> Jincheng
>>> >>
>>> >> [1] https://mp.weixin.qq.com/s/zVsBIs1ZEFe4atYUYtZpRg
>>> >> [2] https://mp.weixin.qq.com/s/R4p_a2TWGpESBWr3pLtM2g
>>> >>
>>> >>
>>> >>
>>>
>>


Re: [DISCUSS] FLIP-133: Rework PyFlink Documentation

2020-08-04 Thread jincheng sun
t; are more independent, it will be challenging to respond to questions about
> which features from the other APIs are available from Python.
>
> David
>
> On Mon, Aug 3, 2020 at 8:07 AM jincheng sun 
> wrote:
>
>> Would be great if you could join the contribution of PyFlink
>> documentation @Marta !
>> Thanks for all of the positive feedback. I will start a formal vote then
>> later...
>>
>> Best,
>> Jincheng
>>
>>
>> Shuiqiang Chen  于2020年8月3日周一 上午9:56写道:
>>
>> > Hi jincheng,
>> >
>> > Thanks for the discussion. +1 for the FLIP.
>> >
>> > A well-organized documentation will greatly improve the efficiency and
>> > experience for developers.
>> >
>> > Best,
>> > Shuiqiang
>> >
>> > Hequn Cheng  于2020年8月1日周六 上午8:42写道:
>> >
>> >> Hi Jincheng,
>> >>
>> >> Thanks a lot for raising the discussion. +1 for the FLIP.
>> >>
>> >> I think this will bring big benefits for the PyFlink users. Currently,
>> >> the Python TableAPI document is hidden deeply under the TableAPI&SQL
>> tab
>> >> which makes it quite unreadable. Also, the PyFlink documentation is
>> mixed
>> >> with Java/Scala documentation. It is hard for users to have an
>> overview of
>> >> all the PyFlink documents. As more and more functionalities are added
>> into
>> >> PyFlink, I think it's time for us to refactor the document.
>> >>
>> >> Best,
>> >> Hequn
>> >>
>> >>
>> >> On Fri, Jul 31, 2020 at 3:43 PM Marta Paes Moreira <
>> ma...@ververica.com>
>> >> wrote:
>> >>
>> >>> Hi, Jincheng!
>> >>>
>> >>> Thanks for creating this detailed FLIP, it will make a big difference
>> in
>> >>> the experience of Python developers using Flink. I'm interested in
>> >>> contributing to this work, so I'll reach out to you offline!
>> >>>
>> >>> Also, thanks for sharing some information on the adoption of PyFlink,
>> >>> it's
>> >>> great to see that there are already production users.
>> >>>
>> >>> Marta
>> >>>
>> >>> On Fri, Jul 31, 2020 at 5:35 AM Xingbo Huang 
>> wrote:
>> >>>
>> >>> > Hi Jincheng,
>> >>> >
>> >>> > Thanks a lot for bringing up this discussion and the proposal.
>> >>> >
>> >>> > Big +1 for improving the structure of PyFlink doc.
>> >>> >
>> >>> > It will be very friendly to give PyFlink users a unified entrance to
>> >>> learn
>> >>> > PyFlink documents.
>> >>> >
>> >>> > Best,
>> >>> > Xingbo
>> >>> >
>> >>> > Dian Fu  于2020年7月31日周五 上午11:00写道:
>> >>> >
>> >>> >> Hi Jincheng,
>> >>> >>
>> >>> >> Thanks a lot for bringing up this discussion and the proposal. +1
>> to
>> >>> >> improve the Python API doc.
>> >>> >>
>> >>> >> I have received many feedbacks from PyFlink beginners about
>> >>> >> the PyFlink doc, e.g. the materials are too few, the Python doc is
>> >>> mixed
>> >>> >> with the Java doc and it's not easy to find the docs he wants to
>> know.
>> >>> >>
>> >>> >> I think it would greatly improve the user experience if we can have
>> >>> one
>> >>> >> place which includes most knowledges PyFlink users should know.
>> >>> >>
>> >>> >> Regards,
>> >>> >> Dian
>> >>> >>
>> >>> >> 在 2020年7月31日,上午10:14,jincheng sun  写道:
>> >>> >>
>> >>> >> Hi folks,
>> >>> >>
>> >>> >> Since the release of Flink 1.11, users of PyFlink have continued to
>> >>> grow.
>> >>> >> As far as I know there are many companies have used PyFlink for
>> data
>> >>> >> analysis, operation and maintenance monitoring business has been
>> put
>> >>> into
>> >>> >> production(Such as 聚美优品[1](Jumei),  浙江墨芷[2] (Mozhi) etc.).
>> According
>> >>> to
>> &

Re: [DISCUSS] FLIP-133: Rework PyFlink Documentation

2020-08-05 Thread jincheng sun
Hi David, Thank you for sharing the problems with the current document, and
I agree with you as I also got the same feedback from Chinese users. I am
often contacted by users to ask questions such as whether PyFlink supports
"Java UDF" and whether PyFlink supports "xxxConnector". The root cause of
these problems is that our existing documents are based on Java users (text
and API mixed part). Since Python is newly added from 1.9, many document
information is not friendly to Python users. They don't want to look for
Python content in unfamiliar Java documents. Just yesterday, there were
complaints from Chinese users about where is all the document entries of
 Python API. So, have a centralized entry and clear document structure,
which is the urgent demand of Python users. The original intention of FLIP
is do our best to solve these user pain points.

Hi Xingbo and Wei Thank you for sharing PySpark's status on document
optimization. You're right. PySpark already has a lot of Python user
groups. They also find that Python user community is an important position
for multilingual support. The centralization and unification of Python
document content will reduce the learning cost of Python users, and good
document structure and content will also reduce the Q & A burden of the
community, It's a once and for all job.

Hi Seth, I wonder if your concerns have been resolved through the previous
discussion?

Anyway, the principle of FLIP is that in python document should only
include Python specific content, instead of making a copy of the Java
content. And would be great to have you to join in the improvement for
PyFlink (Both PRs and Review PRs).

Best,
Jincheng


Wei Zhong  于2020年8月5日周三 下午5:46写道:

> Hi Xingbo,
>
> Thanks for your information.
>
> I think the PySpark's documentation redesigning deserves our attention. It
> seems that the Spark community has also begun to treat the user experience
> of Python documentation more seriously. We can continue to pay attention to
> the discussion and progress of the redesigning in the Spark community. It
> is so similar to our working that there should be some ideas worthy for us.
>
> Best,
> Wei
>
>
> 在 2020年8月5日,15:02,Xingbo Huang  写道:
>
> Hi,
>
> I found that the spark community is also working on redesigning pyspark
> documentation[1] recently. Maybe we can compare the difference between our
> document structure and its document structure.
>
> [1] https://issues.apache.org/jira/browse/SPARK-31851
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/Need-some-help-and-contributions-in-PySpark-API-documentation-td29972.html
>
> Best,
> Xingbo
>
> David Anderson  于2020年8月5日周三 上午3:17写道:
>
>> I'm delighted to see energy going into improving the documentation.
>>
>> With the current documentation, I get a lot of questions that I believe
>> reflect two fundamental problems with what we currently provide:
>>
>> (1) We have a lot of contextual information in our heads about how Flink
>> works, and we are able to use that knowledge to make reasonable inferences
>> about how things (probably) work in cases we aren't so familiar with. For
>> example, I get a lot of questions of the form "If I use  will
>> I still have exactly once guarantees?" The answer is always yes, but they
>> continue to have doubts because we have failed to clearly communicate this
>> fundamental, underlying principle.
>>
>> This specific example about fault tolerance applies across all of the
>> Flink docs, but the general idea can also be applied to the Table/SQL and
>> PyFlink docs. The guiding principles underlying these APIs should be
>> written down in one easy-to-find place.
>>
>> (2) The other kind of question I get a lot is "Can I do  with ?"
>> E.g., "Can I use the JDBC table sink from PyFlink?" These questions can be
>> very difficult to answer because it is frequently the case that one has to
>> reason about why a given feature doesn't seem to appear in the
>> documentation. It could be that I'm looking in the wrong place, or it could
>> be that someone forgot to document something, or it could be that it can in
>> fact be done by applying a general mechanism in a specific way that I
>> haven't thought of -- as in this case, where one can use a JDBC sink from
>> Python if one thinks to use DDL.
>>
>> So I think it would be helpful to be explicit about both what is, and
>> what is not, supported in PyFlink. And to have some very clear organizing
>> principles in the documentation so that users can quickly learn where to
>> look for specific facts.
>>
>> Regards,
>> David
>>
>>

Re: [DISCUSS] FLIP-133: Rework PyFlink Documentation

2020-08-10 Thread jincheng sun
Thank you for your positive feedback Seth !
Would you please vote in the voting mail thread. Thank you!

Best,
Jincheng


Seth Wiesman  于2020年8月10日周一 下午10:34写道:

> I think this sounds good. +1
>
> On Wed, Aug 5, 2020 at 8:37 PM jincheng sun 
> wrote:
>
>> Hi David, Thank you for sharing the problems with the current document,
>> and I agree with you as I also got the same feedback from Chinese users. I
>> am often contacted by users to ask questions such as whether PyFlink
>> supports "Java UDF" and whether PyFlink supports "xxxConnector". The root
>> cause of these problems is that our existing documents are based on Java
>> users (text and API mixed part). Since Python is newly added from 1.9, many
>> document information is not friendly to Python users. They don't want to
>> look for Python content in unfamiliar Java documents. Just yesterday, there
>> were complaints from Chinese users about where is all the document entries
>> of  Python API. So, have a centralized entry and clear document structure,
>> which is the urgent demand of Python users. The original intention of FLIP
>> is do our best to solve these user pain points.
>>
>> Hi Xingbo and Wei Thank you for sharing PySpark's status on document
>> optimization. You're right. PySpark already has a lot of Python user
>> groups. They also find that Python user community is an important position
>> for multilingual support. The centralization and unification of Python
>> document content will reduce the learning cost of Python users, and good
>> document structure and content will also reduce the Q & A burden of the
>> community, It's a once and for all job.
>>
>> Hi Seth, I wonder if your concerns have been resolved through the
>> previous discussion?
>>
>> Anyway, the principle of FLIP is that in python document should only
>> include Python specific content, instead of making a copy of the Java
>> content. And would be great to have you to join in the improvement for
>> PyFlink (Both PRs and Review PRs).
>>
>> Best,
>> Jincheng
>>
>>
>> Wei Zhong  于2020年8月5日周三 下午5:46写道:
>>
>>> Hi Xingbo,
>>>
>>> Thanks for your information.
>>>
>>> I think the PySpark's documentation redesigning deserves our attention.
>>> It seems that the Spark community has also begun to treat the user
>>> experience of Python documentation more seriously. We can continue to pay
>>> attention to the discussion and progress of the redesigning in the Spark
>>> community. It is so similar to our working that there should be some ideas
>>> worthy for us.
>>>
>>> Best,
>>> Wei
>>>
>>>
>>> 在 2020年8月5日,15:02,Xingbo Huang  写道:
>>>
>>> Hi,
>>>
>>> I found that the spark community is also working on redesigning pyspark
>>> documentation[1] recently. Maybe we can compare the difference between our
>>> document structure and its document structure.
>>>
>>> [1] https://issues.apache.org/jira/browse/SPARK-31851
>>>
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/Need-some-help-and-contributions-in-PySpark-API-documentation-td29972.html
>>>
>>> Best,
>>> Xingbo
>>>
>>> David Anderson  于2020年8月5日周三 上午3:17写道:
>>>
>>>> I'm delighted to see energy going into improving the documentation.
>>>>
>>>> With the current documentation, I get a lot of questions that I believe
>>>> reflect two fundamental problems with what we currently provide:
>>>>
>>>> (1) We have a lot of contextual information in our heads about how
>>>> Flink works, and we are able to use that knowledge to make reasonable
>>>> inferences about how things (probably) work in cases we aren't so familiar
>>>> with. For example, I get a lot of questions of the form "If I use >>> feature> will I still have exactly once guarantees?" The answer is always
>>>> yes, but they continue to have doubts because we have failed to clearly
>>>> communicate this fundamental, underlying principle.
>>>>
>>>> This specific example about fault tolerance applies across all of the
>>>> Flink docs, but the general idea can also be applied to the Table/SQL and
>>>> PyFlink docs. The guiding principles underlying these APIs should be
>>>> written down in one easy-to-find place.
>>>>
>>>> (2) The other kind of question I get a lot is "Can I do  wi

[ANNOUNCE] New PMC member: Dian Fu

2020-08-27 Thread jincheng sun
Hi all,

On behalf of the Flink PMC, I'm happy to announce that Dian Fu is now part
of the Apache Flink Project Management Committee (PMC).

Dian Fu has been very active on PyFlink component, working on various
important features, such as the Python UDF and Pandas integration, and
keeps checking and voting for our releases, and also has successfully
produced two releases(1.9.3&1.11.1) as RM, currently working as RM to push
forward the release of Flink 1.12.

Please join me in congratulating Dian Fu for becoming a Flink PMC Member!

Best,
Jincheng(on behalf of the Flink PMC)


Re: [ANNOUNCE] Apache Flink 1.8.2 released

2019-09-13 Thread jincheng sun
Thanks for being the release manager and the great work Jark :)
Also thanks to the community making this release possible!

Best,
Jincheng

Jark Wu  于2019年9月13日周五 下午10:07写道:

> Hi,
>
> The Apache Flink community is very happy to announce the release of Apache
> Flink 1.8.2, which is the second bugfix release for the Apache Flink 1.8
> series.
>
> Apache Flink® is an open-source stream processing framework for
> distributed, high-performing, always-available, and accurate data streaming
> applications.
>
> The release is available for download at:
> https://flink.apache.org/downloads.html
>
> Please check out the release blog post for an overview of the improvements
> for this bugfix release:
> https://flink.apache.org/news/2019/09/11/release-1.8.2.html
>
> The full release notes are available in Jira:
> https://issues.apache.org/jira/projects/FLINK/versions/12345670
>
> We would like to thank all contributors of the Apache Flink community who
> made this release possible!
> Great thanks to @Jincheng for the kindly help during this release.
>
> Regards,
> Jark
>


Re: [DISCUSS] Drop Kafka 0.8/0.9

2019-12-04 Thread jincheng sun
+1  for drop it, and Thanks for bring up this discussion Chesnay!

Best,
Jincheng

Jark Wu  于2019年12月5日周四 上午10:19写道:

> +1 for dropping, also cc'ed user mailing list.
>
>
> Best,
> Jark
>
> On Thu, 5 Dec 2019 at 03:39, Konstantin Knauf 
> wrote:
>
> > Hi Chesnay,
> >
> > +1 for dropping. I have not heard from any user using 0.8 or 0.9 for a
> long
> > while.
> >
> > Cheers,
> >
> > Konstantin
> >
> > On Wed, Dec 4, 2019 at 1:57 PM Chesnay Schepler 
> > wrote:
> >
> > > Hello,
> > >
> > > What's everyone's take on dropping the Kafka 0.8/0.9 connectors from
> the
> > > Flink codebase?
> > >
> > > We haven't touched either of them for the 1.10 release, and it seems
> > > quite unlikely that we will do so in the future.
> > >
> > > We could finally close a number of test stability tickets that have
> been
> > > lingering for quite a while.
> > >
> > >
> > > Regards,
> > >
> > > Chesnay
> > >
> > >
> >
> > --
> >
> > Konstantin Knauf | Solutions Architect
> >
> > +49 160 91394525
> >
> >
> > Follow us @VervericaData Ververica 
> >
> >
> > --
> >
> > Join Flink Forward  - The Apache Flink
> > Conference
> >
> > Stream Processing | Event Driven | Real Time
> >
> > --
> >
> > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
> >
> > --
> > Ververica GmbH
> > Registered at Amtsgericht Charlottenburg: HRB 158244 B
> > Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
> > (Tony) Cheng
> >
>


Re: [DISCUSS] What parts of the Python API should we focus on next ?

2019-12-18 Thread jincheng sun
Also CC user-zh.

Best,
Jincheng


jincheng sun  于2019年12月19日周四 上午10:20写道:

> Hi folks,
>
> As release-1.10 is under feature-freeze(The stateless Python UDF is
> already supported), it is time for us to plan the features of PyFlink for
> the next release.
>
> To make sure the features supported in PyFlink are the mostly demanded for
> the community, we'd like to get more people involved, i.e., it would be
> better if all of the devs and users join in the discussion of which kind of
> features are more important and urgent.
>
> We have already listed some features from different aspects which you can
> find below, however it is not the ultimate plan. We appreciate any
> suggestions from the community, either on the functionalities or
> performance improvements, etc. Would be great to have the following
> information if you want to suggest to add some features:
>
> -
> - Feature description: 
> - Benefits of the feature: 
> - Use cases (optional): 
> --
>
> Features in my mind
>
> 1. Integration with most popular Python libraries
> - fromPandas/toPandas API
>Description:
>   Support to convert between Table and pandas.DataFrame.
>Benefits:
>   Users could switch between Flink and Pandas API, for example, do
> some analysis using Flink and then perform analysis using the Pandas API if
> the result data is small and could fit into the memory, and vice versa.
>
> - Support Scalar Pandas UDF
>Description:
>   Support scalar Pandas UDF in Python Table API & SQL. Both the
> input and output of the UDF is pandas.Series.
>Benefits:
>   1) Scalar Pandas UDF performs better than row-at-a-time UDF,
> ranging from 3x to over 100x (from pyspark)
>   2) Users could use Pandas/Numpy API in the Python UDF
> implementation if the input/output data type is pandas.Series
>
> - Support Pandas UDAF in batch GroupBy aggregation
>Description:
>Support Pandas UDAF in batch GroupBy aggregation of Python
> Table API & SQL. Both the input and output of the UDF is pandas.DataFrame.
>Benefits:
>   1) Pandas UDAF performs better than row-at-a-time UDAF more than
> 10x in certain scenarios
>   2) Users could use Pandas/Numpy API in the Python UDAF
> implementation if the input/output data type is pandas.DataFrame
>
> 2. Fully support  all kinds of Python UDF
> - Support Python UDAF(stateful) in GroupBy aggregation (NOTE: Please
> give us some use case if you want this feature to be contained in the next
> release)
>   Description:
> Support UDAF in GroupBy aggregation.
>   Benefits:
> Users could define and use Python UDAF and use it in GroupBy
> aggregation. Without it, users have to use Java/Scala UDAF.
>
> - Support Python UDTF
>   Description:
>Support  Python UDTF in Python Table API & SQL
>   Benefits:
> Users could define and use Python UDTF in Python Table API & SQL.
> Without it, users have to use Java/Scala UDTF.
>
> 3. Debugging and Monitoring of Python UDF
>- Support User-Defined Metrics
>  Description:
>Allow users to define user-defined metrics and global job
> parameters with Python UDFs.
>  Benefits:
>UDF needs metrics to monitor some business or technical indicators,
> which is also a requirement for UDFs.
>
>- Make the log level configurable
>  Description:
>Allow users to config the log level of Python UDF.
>  Benefits:
>Users could configure different log levels when debugging and
> deploying.
>
> 4. Enrich the Python execution environment
>- Docker Mode Support
>  Description:
>  Support running python UDF in docker workers.
>  Benefits:
>  Support various of deployments to meet more users' requirements.
>
> 5. Expand the usage scope of Python UDF
>- Support to use Python UDF via SQL client
>  Description:
>  Support to register and use Python UDF via SQL client
>  Benefits:
>  SQL client is a very important interface for SQL users. This
> feature allows SQL users to use Python UDFs via SQL client.
>
>- Integrate Python UDF with Notebooks
>  Description:
>  Such as Zeppelin, etc (Especially Python dependencies)
>
>- Support to register Python UDF into catalog
>   Description:
>   Support to register Python UDF into catalog
>   Benefits:
>   1)Catalog is the centralized place to manage metadata such as
> tables, UDFs, etc. With it, users could register the UDFs once and use it
> anywhere.
> 

Re: [DISCUSS] What parts of the Python API should we focus on next ?

2019-12-22 Thread jincheng sun
Hi Bowen,

Your suggestions are very helpful for expanding the PyFlink ecology.  I
also mentioned above to integrate notebooks,Jupyter and Zeppelin are both
very excellent notebooks. The process of integrating Jupyter and Zeppelin
also requires the support of Jupyter and Zeppelin community personnel.
Currently Jeff has made great efforts in Zeppelin community for PyFink. I
would greatly appreciate if anyone who active in the Jupyter community also
willing to help to integrate PyFlink.

Best,
Jincheng


Bowen Li  于2019年12月20日周五 上午12:55写道:

> - integrate PyFlink with Jupyter notebook
>- Description: users should be able to run PyFlink seamlessly in Jupyter
>- Benefits: Jupyter is the industrial standard notebook for data
> scientists. I’ve talked to a few companies in North America, they think
> Jupyter is the #1 way to empower internal DS with Flink
>
>
> On Wed, Dec 18, 2019 at 19:05 jincheng sun 
> wrote:
>
>> Also CC user-zh.
>>
>> Best,
>> Jincheng
>>
>>
>> jincheng sun  于2019年12月19日周四 上午10:20写道:
>>
>>> Hi folks,
>>>
>>> As release-1.10 is under feature-freeze(The stateless Python UDF is
>>> already supported), it is time for us to plan the features of PyFlink for
>>> the next release.
>>>
>>> To make sure the features supported in PyFlink are the mostly demanded
>>> for the community, we'd like to get more people involved, i.e., it would be
>>> better if all of the devs and users join in the discussion of which kind of
>>> features are more important and urgent.
>>>
>>> We have already listed some features from different aspects which you
>>> can find below, however it is not the ultimate plan. We appreciate any
>>> suggestions from the community, either on the functionalities or
>>> performance improvements, etc. Would be great to have the following
>>> information if you want to suggest to add some features:
>>>
>>> -
>>> - Feature description: 
>>> - Benefits of the feature: 
>>> - Use cases (optional): 
>>> --
>>>
>>> Features in my mind
>>>
>>> 1. Integration with most popular Python libraries
>>> - fromPandas/toPandas API
>>>Description:
>>>   Support to convert between Table and pandas.DataFrame.
>>>Benefits:
>>>   Users could switch between Flink and Pandas API, for example,
>>> do some analysis using Flink and then perform analysis using the Pandas API
>>> if the result data is small and could fit into the memory, and vice versa.
>>>
>>> - Support Scalar Pandas UDF
>>>Description:
>>>   Support scalar Pandas UDF in Python Table API & SQL. Both the
>>> input and output of the UDF is pandas.Series.
>>>Benefits:
>>>   1) Scalar Pandas UDF performs better than row-at-a-time UDF,
>>> ranging from 3x to over 100x (from pyspark)
>>>   2) Users could use Pandas/Numpy API in the Python UDF
>>> implementation if the input/output data type is pandas.Series
>>>
>>> - Support Pandas UDAF in batch GroupBy aggregation
>>>Description:
>>>Support Pandas UDAF in batch GroupBy aggregation of Python
>>> Table API & SQL. Both the input and output of the UDF is pandas.DataFrame.
>>>Benefits:
>>>   1) Pandas UDAF performs better than row-at-a-time UDAF more
>>> than 10x in certain scenarios
>>>   2) Users could use Pandas/Numpy API in the Python UDAF
>>> implementation if the input/output data type is pandas.DataFrame
>>>
>>> 2. Fully support  all kinds of Python UDF
>>> - Support Python UDAF(stateful) in GroupBy aggregation (NOTE: Please
>>> give us some use case if you want this feature to be contained in the next
>>> release)
>>>   Description:
>>> Support UDAF in GroupBy aggregation.
>>>   Benefits:
>>> Users could define and use Python UDAF and use it in GroupBy
>>> aggregation. Without it, users have to use Java/Scala UDAF.
>>>
>>> - Support Python UDTF
>>>   Description:
>>>Support  Python UDTF in Python Table API & SQL
>>>   Benefits:
>>> Users could define and use Python UDTF in Python Table API &
>>> SQL. Without it, users have to use Java/Scala UDTF.
>>>
>>> 3. Debugging and Monitoring of Python UDF
>>>- Support User-Defined Metrics

[ANNOUNCE] Dian Fu becomes a Flink committer

2020-01-16 Thread jincheng sun
Hi everyone,

I'm very happy to announce that Dian accepted the offer of the Flink PMC to
become a committer of the Flink project.

Dian Fu has been contributing to Flink for many years. Dian Fu played an
essential role in PyFlink/CEP/SQL/Table API modules. Dian Fu has
contributed several major features, reported and fixed many bugs, spent a
lot of time reviewing pull requests and also frequently helping out on the
user mailing lists and check/vote the release.

Please join in me congratulating Dian for becoming a Flink committer !

Best,
Jincheng(on behalf of the Flink PMC)


Re: [ANNOUNCE] Dian Fu becomes a Flink committer

2020-01-16 Thread jincheng sun
Congrats Dian Fu and welcome on board!

Best,
Jincheng

Shuo Cheng  于2020年1月16日周四 下午6:22写道:

> Congratulations!  Dian Fu
>
> > Xingbo Wei Zhong  于2020年1月16日周四 下午6:13写道: >> jincheng sun
> 于2020年1月16日周四 下午5:58写道:
>


[DISCUSS] Upload the Flink Python API 1.9.x to PyPI for user convenience.

2020-02-03 Thread jincheng sun
Hi folks,

I am very happy to receive some user inquiries about the use of Flink
Python API (PyFlink) recently. One of the more common questions is whether
it is possible to install PyFlink without using source code build. The most
convenient and natural way for users is to use `pip install apache-flink`.
We originally planned to support the use of `pip install apache-flink` in
Flink 1.10, but the reason for this decision was that when Flink 1.9 was
released at August 22, 2019[1], Flink's PyPI account system was not ready.
At present, our PyPI account is available at October 09, 2019 [2](Only PMC
can access), So for the convenience of users I propose:

- Publish the latest release version (1.9.2) of Flink 1.9 to PyPI.
- Update Flink 1.9 documentation to add support for `pip install`.

As we all know, Flink 1.9.2 was just completed released at January 31, 2020
[3]. There is still at least 1 to 2 months before the release of 1.9.3, so
my proposal is completely considered from the perspective of user
convenience. Although the proposed work is not large, we have not set a
precedent for independent release of the Flink Python API(PyFlink) in the
previous release process. I hereby initiate the current discussion and look
forward to your feedback!

Best,
Jincheng

[1]
https://lists.apache.org/thread.html/4a4d23c449f26b66bc58c71cc1a5c6079c79b5049c6c6744224c5f46%40%3Cdev.flink.apache.org%3E
[2]
https://lists.apache.org/thread.html/8273a5e8834b788d8ae552a5e177b69e04e96c0446bb90979444deee%40%3Cprivate.flink.apache.org%3E
[3]
https://lists.apache.org/thread.html/ra27644a4e111476b6041e8969def4322f47d5e0aae8da3ef30cd2926%40%3Cdev.flink.apache.org%3E


[VOTE] Release Flink Python API(PyFlink) 1.9.2 to PyPI, release candidate #1

2020-02-10 Thread jincheng sun
Hi everyone,

Please review and vote on the release candidate #1 for the PyFlink version
1.9.2, as follows:

[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

The complete staging area is available for your review, which includes:

* the official Apache binary convenience releases to be deployed to
dist.apache.org [1], which are signed with the key with fingerprint
8FEA1EE9D0048C0CCC70B7573211B0703B79EA0E [2] and built from source code [3].

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

Thanks,
Jincheng

[1] https://dist.apache.org/repos/dist/dev/flink/flink-1.9.2-rc1/
[2] https://dist.apache.org/repos/dist/release/flink/KEYS
[3] https://github.com/apache/flink/tree/release-1.9.2


Re: [VOTE] Release Flink Python API(PyFlink) 1.9.2 to PyPI, release candidate #1

2020-02-10 Thread jincheng sun
+1 (binding)

- Install the PyFlink by `pip install` [SUCCESS]
- Run word_count in both command line and IDE [SUCCESS]

Best,
Jincheng



Wei Zhong  于2020年2月11日周二 上午11:17写道:

> Hi,
>
> Thanks for driving this, Jincheng.
>
> +1 (non-binding)
>
> - Verified signatures and checksums.
> - Verified README.md and setup.py.
> - Run `pip install apache-flink-1.9.2.tar.gz` in Python 2.7.15 and Python
> 3.7.5 successfully.
> - Start local pyflink shell in Python 2.7.15 and Python 3.7.5 via
> `pyflink-shell.sh local` and try the examples in the help message, run well
> and no exception.
> - Try a word count example in IDE with Python 2.7.15 and Python 3.7.5, run
> well and no exception.
>
> Best,
> Wei
>
>
> 在 2020年2月10日,19:12,jincheng sun  写道:
>
> Hi everyone,
>
> Please review and vote on the release candidate #1 for the PyFlink version
> 1.9.2, as follows:
>
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
> The complete staging area is available for your review, which includes:
>
> * the official Apache binary convenience releases to be deployed to
> dist.apache.org [1], which are signed with the key with fingerprint
> 8FEA1EE9D0048C0CCC70B7573211B0703B79EA0E [2] and built from source code [3].
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
> Thanks,
> Jincheng
>
> [1] https://dist.apache.org/repos/dist/dev/flink/flink-1.9.2-rc1/
> [2] https://dist.apache.org/repos/dist/release/flink/KEYS
> [3] https://github.com/apache/flink/tree/release-1.9.2
>
>
>


Re: [VOTE] Release Flink Python API(PyFlink) 1.9.2 to PyPI, release candidate #1

2020-02-12 Thread jincheng sun
Hi folks,

Thanks everyone for voting. I'm closing the vote now and will post the
result as a separate email.

Best,
Jincheng


Xingbo Huang  于2020年2月13日周四 上午9:28写道:

> +1 (non-binding)
>
> - Install the PyFlink by `pip install` [SUCCESS]
> - Run word_count.py [SUCCESS]
>
> Thanks,
> Xingbo
>
> Becket Qin  于2020年2月12日周三 下午2:28写道:
>
>> +1 (binding)
>>
>> - verified signature
>> - Ran word count example successfully.
>>
>> Thanks,
>>
>> Jiangjie (Becket) Qin
>>
>> On Wed, Feb 12, 2020 at 1:29 PM Jark Wu  wrote:
>>
>>> +1
>>>
>>> - checked/verified signatures and hashes
>>> - Pip installed the package successfully: pip install
>>> apache-flink-1.9.2.tar.gz
>>> - Run word count example successfully through the documentation [1].
>>>
>>> Best,
>>> Jark
>>>
>>> [1]:
>>>
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/tutorials/python_table_api.html
>>>
>>> On Tue, 11 Feb 2020 at 22:00, Hequn Cheng  wrote:
>>>
>>> > +1 (non-binding)
>>> >
>>> > - Check signature and checksum.
>>> > - Install package successfully with Pip under Python 3.7.4.
>>> > - Run wordcount example successfully under Python 3.7.4.
>>> >
>>> > Best, Hequn
>>> >
>>> > On Tue, Feb 11, 2020 at 12:17 PM Dian Fu 
>>> wrote:
>>> >
>>> > > +1 (non-binding)
>>> > >
>>> > > - Verified the signature and checksum
>>> > > - Pip installed the package successfully: pip install
>>> > > apache-flink-1.9.2.tar.gz
>>> > > - Run word count example successfully.
>>> > >
>>> > > Regards,
>>> > > Dian
>>> > >
>>> > > 在 2020年2月11日,上午11:44,jincheng sun  写道:
>>> > >
>>> > >
>>> > > +1 (binding)
>>> > >
>>> > > - Install the PyFlink by `pip install` [SUCCESS]
>>> > > - Run word_count in both command line and IDE [SUCCESS]
>>> > >
>>> > > Best,
>>> > > Jincheng
>>> > >
>>> > >
>>> > >
>>> > > Wei Zhong  于2020年2月11日周二 上午11:17写道:
>>> > >
>>> > >> Hi,
>>> > >>
>>> > >> Thanks for driving this, Jincheng.
>>> > >>
>>> > >> +1 (non-binding)
>>> > >>
>>> > >> - Verified signatures and checksums.
>>> > >> - Verified README.md and setup.py.
>>> > >> - Run `pip install apache-flink-1.9.2.tar.gz` in Python 2.7.15 and
>>> > Python
>>> > >> 3.7.5 successfully.
>>> > >> - Start local pyflink shell in Python 2.7.15 and Python 3.7.5 via
>>> > >> `pyflink-shell.sh local` and try the examples in the help message,
>>> run
>>> > well
>>> > >> and no exception.
>>> > >> - Try a word count example in IDE with Python 2.7.15 and Python
>>> 3.7.5,
>>> > >> run well and no exception.
>>> > >>
>>> > >> Best,
>>> > >> Wei
>>> > >>
>>> > >>
>>> > >> 在 2020年2月10日,19:12,jincheng sun  写道:
>>> > >>
>>> > >> Hi everyone,
>>> > >>
>>> > >> Please review and vote on the release candidate #1 for the PyFlink
>>> > >> version 1.9.2, as follows:
>>> > >>
>>> > >> [ ] +1, Approve the release
>>> > >> [ ] -1, Do not approve the release (please provide specific
>>> comments)
>>> > >>
>>> > >> The complete staging area is available for your review, which
>>> includes:
>>> > >>
>>> > >> * the official Apache binary convenience releases to be deployed to
>>> > >> dist.apache.org [1], which are signed with the key with fingerprint
>>> > >> 8FEA1EE9D0048C0CCC70B7573211B0703B79EA0E [2] and built from source
>>> code
>>> > [3].
>>> > >>
>>> > >> The vote will be open for at least 72 hours. It is adopted by
>>> majority
>>> > >> approval, with at least 3 PMC affirmative votes.
>>> > >>
>>> > >> Thanks,
>>> > >> Jincheng
>>> > >>
>>> > >> [1] https://dist.apache.org/repos/dist/dev/flink/flink-1.9.2-rc1/
>>> > >> [2] https://dist.apache.org/repos/dist/release/flink/KEYS
>>> > >> [3] https://github.com/apache/flink/tree/release-1.9.2
>>> > >>
>>> > >>
>>> > >
>>> >
>>>
>>


[ANNOUNCE] Apache Flink Python API(PyFlink) 1.9.2 released

2020-02-12 Thread jincheng sun
Hi everyone,

The Apache Flink community is very happy to announce the release of Apache
Flink Python API(PyFlink) 1.9.2, which is the first release to PyPI for the
Apache Flink Python API 1.9 series.

Apache Flink® is an open-source stream processing framework for
distributed, high-performing, always-available, and accurate data streaming
applications.

The release is available for download at:

https://pypi.org/project/apache-flink/1.9.2/#files

Or installed using pip command:

pip install apache-flink==1.9.2

We would like to thank all contributors of the Apache Flink community who
helped to verify this release and made this release possible!

Best,
Jincheng


Re: 【PyFlink】对于数据以Csv()格式写入kafka报错,以及使用python udf时无法启动udf

2020-03-20 Thread jincheng sun
彭哲夫,你好:

你是如何安装的beam,安装的版本是多少?如果是1.10 需要apache-beam 2.15,如果是 master需要apache-beam
2.19.

BTW,  为了共享你的问题,我将你的问题发到了中文用户列表里面,我们大家一起讨论。

Best,
Jincheng


Zhefu PENG  于2020年3月20日周五 下午5:12写道:

> Hi Jincheng,
>
> 针对昨天提到的第二点,我做了两个思路:1. 将我的pyflink.py以本地模式运行,2.对集群节点进行环境配置
>
> 1.
> 在本地模式中,我们发现在装apache-beam包的时候出于某些原因没有装全,少了_bz2模块,再补充上之后,pyflink.py的脚本可以正常运行,其中的udf功能也能正常使用。
>
> 2.
> 在本地模式运行成功的基础上,我们根据你的建议,对所有的worker节点进行了环境的更新,都更新到了python3.6以及安装了apache_beam和apache-flink.
> 但是以集群模式运行带有udf功能的脚本仍然报错,尝试谷歌搜索以后也没有搜到相关解答,在附件附上错误日志,希望能得到帮助(因为本地模式已经成功所以就不附带代码了),非常感谢!
>
> 期待您的回复
> 彭哲夫
>
>
> Zhefu PENG  于2020年3月19日周四 下午11:14写道:
>
>> Hi Jincheng:
>>
>> 非常感谢你如此迅速而细致的回复!~
>>
>> 关于第一点:根据你的回复,我在flink的lib目录下增加flink-csv-1.10.0-sql-jar.jar包之后,运行成功。而第一个包我在之前浏览你博客中关于kafka的使用的demo(based
>> on flink 1.9)中有看到并下载,因此这里有个提议,或许你未来可以对于后续使用者补充
>> flink-csv-1.10.0-sql-jar.jar包的使用的必要性 :),但也有可能是我在查询学习时看漏了,但不管怎么说感谢你的帮助解决;
>>
>> 关于第二点:因为部门相关组织安排问题,我现在没有权限去worker节点上查询,但是针对这一点我有个好奇的地方:我目前只在启动脚本的主机上安装了python3.5+,
>> 并且除了udf功能外,我都能正常使用(比如sql本身就有的concat之类,或者add_columns()这种简单功能)。所以是不是我理解为,如果要使用pyflink的全部功能,应该是集群的环境都要是python3.5+?
>> 但是简单的功能,只要启动脚本的主机环境符合就够了?
>> 还是关于第二点,我刚刚又重新跑了一下脚本,本来是希望能获得和之前一样的错误日志发给我的mentor,但是发现这次报了新的问题:
>> java.lang.NoClassDefFoundError: Could not initialize class
>> org.apache.beam.sdk.options.PipelineOptionsFactory
>> at
>> org.apache.flink.python.AbstractPythonFunctionRunner.open(AbstractPythonFunctionRunner.java:173)
>> at
>> org.apache.flink.table.runtime.operators.python.AbstractPythonScalarFunctionOperator$ProjectUdfInputPythonScalarFunctionRunner.open(AbstractPythonScalarFunctionOperator.java:193)
>> at
>> org.apache.flink.streaming.api.operators.python.AbstractPythonFunctionOperator.open(AbstractPythonFunctionOperator.java:139)
>> at
>> org.apache.flink.table.runtime.operators.python.AbstractPythonScalarFunctionOperator.open(AbstractPythonScalarFunctionOperator.java:143)
>> at
>> org.apache.flink.table.runtime.operators.python.PythonScalarFunctionOperator.open(PythonScalarFunctionOperator.java:73)
>> at
>> org.apache.flink.streaming.runtime.tasks.StreamTask.initializeStateAndOpen(StreamTask.java:1007)
>> at
>> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$0(StreamTask.java:454)
>> at
>> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94)
>>
>>
>> 我猜测原因正如你提到的worker的环境不符标准,我会在明天上班后请同事帮忙check后,根据你的建议进行修改尝试。也希望能解答一下疑问,因为刚毕业参加工作,可能提的问题会显得比较低级,请见谅!
>>
>> 再次感谢你的回复,我会根据建议尽快进行错误修复
>> 彭哲夫
>>
>> jincheng sun  于2020年3月19日周四 下午9:08写道:
>>
>>> 彭哲夫,你好:
>>>
>>> 你上面问题可能原因是:
>>>
>>> 1. pyflink默认不包含kafka connector的jar包和csv的格式JIR包,需要把这些jar包加到pyflink的lib目录下:
>>>
>>> $ PYFLINK_LIB=`python -c "import pyflink;import
>>> os;print(os.path.dirname(os.path.abspath(pyflink.__file__))+'/lib')"`
>>> $ cd $PYFLINK_LIB
>>> $ curl -O
>>> https://repo1.maven.org/maven2/org/apache/flink/flink-sql-connector-kafka_2.11/1.10.0/flink-sql-connector-kafka_2.11-1.10.0.jar
>>> $ curl -O
>>> https://repo1.maven.org/maven2/org/apache/flink/flink-csv/1.10.0/flink-csv-1.10.0-sql-jar.jar
>>>
>>> 2. 有可能的原因是worker上没有安装python3以上环境或者环境中没有安装apache-beam,可以尝试在worker机器上执行一下:
>>> python --version 检查python版本,同时执行 pip list 查看是否有apache-beam,如果没有,可以执行
>>> :python -m pip install apache-flink
>>>
>>> 期望对你有帮助,有问题我们持续沟通。
>>>
>>> Best,
>>> Jincheng
>>>
>>>
>>>
>>> Zhefu PENG  于2020年3月19日周四 下午8:13写道:
>>>
>>>> 你好:
>>>>
>>>>
>>>> 在网上看到了你的博客,关于你对pyflink的开发和推动深感敬佩。我们部门因为业务需要最近在调研使用flink相关,我写了个一个简单的demo想做体验和测试,但是遇到了两个问题(第二个问题是目前遇到的比较大的困难,第一个问题采取了规避策略:)):
>>>>
>>>> 1.
>>>> 当我把数据想以Csv格式输出到Kafka时,报错。(从社区文档我已经了解到应该用Csv()取代OldCsv(),并修改)。查看报错信息后我怀疑是因为缺少jar包导致(比如之前使用Json格式时候),但是从另一个文档中了解到csv格式应该是built-in的。目前采取了规避措施,采用json格式输出。
>>>>
>>>> 报错信息如下:
>>>>
>>>> py4j.protocol.Py4JJavaError: An error occurred while calling
>>>> o62.insertInto.
>>>> : org.apache.flink.table.api.NoMatchingTableFactoryException: Could not
>>>> find a suitable table factory for
>>>> 'org.apache.flink.table.factories.SerializationSchemaFactory' in
>>>> the classpath.
>>>>
>>>> Reason: No factory supports all properties.
>>>>
>>>> The matching candidates:
>>>> org.apache.flink.formats

Re: 【PyFlink】对于数据以Csv()格式写入kafka报错,以及使用python udf时无法启动udf

2020-03-24 Thread jincheng sun
Hi Zhefu,

谢谢您分享解决问题的细节,这对社区有很大的贡献!

1. 关于订阅问题

我想确认一下,你是否参考了[1],同时以订阅中文用户列表(user-zh@flink.apache.org)为例,您需要发送邮件到(
user-zh-subscr...@flink.apache.org),就是在原有邮件的地址上添加subscribe。同时收到一封“confirm
subscribe to *user-zh*@flink.apache.org”的确认邮件,需要进行确认回复。

2. 关于JAR包冲突问题

flink-python
JAR会携带flink-python对beam依赖的核心JAR包,我这里想了解一些,为啥你集群上面存在这beam相关的包?另外我认为您提供的case很好,让我想到了可以对PyFlink对Beam的依赖进行一些优化,比如将beam进行relocation.
我已经创建了社区改进JIRA[2].

3. 关于打包问题

上传给PyFlink的Python环境包需要是在机器间可移植的,所以的确不能包含软链接。如果是用virtualenv创建的环境的话,需要加上--always-copy选项。此外,如果集群机器上已经有准备好的python3.5+的环境,可以不用上传环境包,直接使用add_python_executable("python3")为集群指定要使用的Python
Interpreter。
除了virtualenv,conda/miniconda
也可用于创建虚拟环境,但是大小要大很多,在virtualenv处于某些原因不work的时候(比如源python解释器依赖的so文件在集群上不存在),可以考虑使用。

再次感谢您分享问题的解决细节和问题的反馈!

Best,
Jincheng

[1]
https://flink.apache.org/community.html#how-to-subscribe-to-a-mailing-list
[2] https://issues.apache.org/jira/browse/FLINK-16762


Zhefu PENG  于2020年3月24日周二 下午9:33写道:

> Hi Jincheng,
>
> 在中文邮件用户列表里我无法回复自己的问题(我已经十分确认subscribe了mailing list, but i dont know why
> and how),所以在这里回复一下。 经过同事的帮忙和共同努力,我们初步解决了之前的疑问。反馈如下:
>
> 1. 首先是包冲突的问题,我们发现flink-python这个包下也有beam-runners-core-java-2.15.0.jar,
> beam-runners-direct-java-2.15.0.jar,  beam-runners-flink_2.11-2.15.0.jar,
> beam-sdk-java-core-2.15.0.jar这四个jar包的代码,若是运行的集群环境下本身也有这四个包的话,则会产生冲突,运行udf功能时找不到依赖。
>
> 2.
> 因为一些原因,集群的python默认环境无法改成python3,因此我们在代码中添加env.add_python_archive的功能,并且使用set_python_executor来帮助指定解释器。
> 用到的压缩包,在打包时一定要去掉软链接的使用,(可能在解压后不支持软链接查找。),我之前打包方式有错,所以出现了问题。
>
> 以上两点是我们排查出的原因,也终于能在cluster
> mode下成功运行udf。希望我们的反馈也能给你们的发展以及对用户的指导增加一点贡献。再次感谢帮助:)
>
> Best,
> Zhefu
>
> Zhefu PENG  于2020年3月20日周五 下午11:19写道:
>
>>
>> 感谢回复!beam就是直接用pip install方法安装的,因为在用pip install
>> apache-flink的时候发现有很多依赖,而且在安装时候要求安装beam的2.15.0的版本,我就安上了2.15.0。flink版本是1.10,没有从源码编译。因此我们也很困扰现在,希望能够求得帮助。
>>
>> 还有个小问题,就是我看到似乎你没有把日志一起转到邮件列表中,我想附带上去,那么我在订阅后,直接回复可以把日志贴上去吗?
>>
>> 顺祝,周末愉快:)非常感谢回复!
>>
>> 彭哲夫
>>
>> On Fri, Mar 20, 2020 at 22:34 jincheng sun 
>> wrote:
>>
>>> 彭哲夫,你好:
>>>
>>> 你是如何安装的beam,安装的版本是多少?如果是1.10 需要apache-beam 2.15,如果是 master需要apache-beam
>>> 2.19.
>>>
>>> BTW,  为了共享你的问题,我将你的问题发到了中文用户列表里面,我们大家一起讨论。
>>>
>>> Best,
>>> Jincheng
>>>
>>>
>>> Zhefu PENG  于2020年3月20日周五 下午5:12写道:
>>>
>>>> Hi Jincheng,
>>>>
>>>> 针对昨天提到的第二点,我做了两个思路:1. 将我的pyflink.py以本地模式运行,2.对集群节点进行环境配置
>>>>
>>>> 1.
>>>> 在本地模式中,我们发现在装apache-beam包的时候出于某些原因没有装全,少了_bz2模块,再补充上之后,pyflink.py的脚本可以正常运行,其中的udf功能也能正常使用。
>>>>
>>>> 2.
>>>> 在本地模式运行成功的基础上,我们根据你的建议,对所有的worker节点进行了环境的更新,都更新到了python3.6以及安装了apache_beam和apache-flink.
>>>> 但是以集群模式运行带有udf功能的脚本仍然报错,尝试谷歌搜索以后也没有搜到相关解答,在附件附上错误日志,希望能得到帮助(因为本地模式已经成功所以就不附带代码了),非常感谢!
>>>>
>>>> 期待您的回复
>>>> 彭哲夫
>>>>
>>>>
>>>> Zhefu PENG  于2020年3月19日周四 下午11:14写道:
>>>>
>>>>> Hi Jincheng:
>>>>>
>>>>> 非常感谢你如此迅速而细致的回复!~
>>>>>
>>>>> 关于第一点:根据你的回复,我在flink的lib目录下增加flink-csv-1.10.0-sql-jar.jar包之后,运行成功。而第一个包我在之前浏览你博客中关于kafka的使用的demo(based
>>>>> on flink 1.9)中有看到并下载,因此这里有个提议,或许你未来可以对于后续使用者补充
>>>>> flink-csv-1.10.0-sql-jar.jar包的使用的必要性 :),但也有可能是我在查询学习时看漏了,但不管怎么说感谢你的帮助解决;
>>>>>
>>>>> 关于第二点:因为部门相关组织安排问题,我现在没有权限去worker节点上查询,但是针对这一点我有个好奇的地方:我目前只在启动脚本的主机上安装了python3.5+,
>>>>> 并且除了udf功能外,我都能正常使用(比如sql本身就有的concat之类,或者add_columns()这种简单功能)。所以是不是我理解为,如果要使用pyflink的全部功能,应该是集群的环境都要是python3.5+?
>>>>> 但是简单的功能,只要启动脚本的主机环境符合就够了?
>>>>> 还是关于第二点,我刚刚又重新跑了一下脚本,本来是希望能获得和之前一样的错误日志发给我的mentor,但是发现这次报了新的问题:
>>>>> java.lang.NoClassDefFoundError: Could not initialize class
>>>>> org.apache.beam.sdk.options.PipelineOptionsFactory
>>>>> at
>>>>> org.apache.flink.python.AbstractPythonFunctionRunner.open(AbstractPythonFunctionRunner.java:173)
>>>>> at
>>>>> org.apache.flink.table.runtime.operators.python.AbstractPythonScalarFunctionOperator$ProjectUdfInputPythonScalarFunctionRunner.open(AbstractPythonScalarFunctionOperator.java:193)
>>>>> at
>>>>> org.apache.flink.streaming.api.operators.python.AbstractPythonFunctionOperator.open(AbstractPythonFunctionOperator.java:139)
>>>>> at
>>>>> org.apache.flink.table.runtime.operators.python.AbstractPythonScalarFunctionOperator.open(AbstractPythonScalarFunctionOperator.java:143)
>>>>> at
>>>>> org.apache.flink.table.runtime.operators.python.PythonScalarFunctionOperator.open(PythonScalarFunctionOperator.java:7

Re: 向您请教pyflink在windows上运行的问题,我第一次接触flink。

2020-03-24 Thread jincheng sun
很高兴收到您的邮件,我不太确定您具体看的是哪一个视频,所以很难确定您遇到的问题原因。您可以参考官方文档[1],
同时我个人博客里有一些3分钟的视频和文档入门教程[2],您可以先查阅一下。如又问题,可以保持邮件沟通,可以在我的博客留言!

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.10/tutorials/python_table_api.html
[2] https://enjoyment.cool/

Best,
Jincheng



xu1990xaut  于2020年3月24日周二 下午11:36写道:

> 您好,之前在哔哩哔哩上看过您讲的视频。  也跟着视频动手做了。
> 我用的flink1.10,在pip的时候是直接pip install apache-flink,结果默认就是1.10版本。
> 然后我在pycharm中运行word-count这个脚本时,一直不出结果,也不报错。   请问这是什么原因。
> 我也装了jdk,另外页面访问flink8081那个端口也可以出来界面。  我是第一次接触flink,在网上也搜过这个问题,
> 可是一直没有得到答案。  麻烦您,给小弟指点指点,   谢谢您了。
>
>
>
>


Re: Re: 向您请教pyflink在windows上运行的问题,我第一次接触flink。

2020-03-24 Thread jincheng sun
上面视频中对应的word_count示例的源码应该是这个:
https://github.com/sunjincheng121/enjoyment.code/blob/master/myPyFlink/enjoyment/word_count.py运行完成之后计算结果应该是写到sink_file
= 'sink.csv'文件里面去了。你可以将这个文件的路径打印出来,查看这个文件内容。

另外如果您只是为了学习入门的话,建议你查阅[1][2], 我让想整理了解PyFlink最新的状况,可以查看[3]。

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/python/installation.html
[2]
https://enjoyment.cool/2020/01/22/Three-Min-Series-How-PyFlink-does-ETL/#more
[3]
https://www.bilibili.com/video/BV1W7411o7Tj?from=search&seid=14518199503613218690

Best,
Jincheng
-
Twitter: https://twitter.com/sunjincheng121
-


xu1990xaut  于2020年3月25日周三 下午2:23写道:

> 孙老师您好,我之前在网上看的是这个视频《【Apache Flink 进阶教程】14课. Apache Flink Python API
> 的现状及未来规划》。 今天我也在虚拟机下试了,还是无法运行。
> 我用的是flink1.10,python3.6。  麻烦老师指点指点。
>
>
>
>
>
>
> 在 2020-03-25 11:32:29,"jincheng sun"  写道:
>
> 很高兴收到您的邮件,我不太确定您具体看的是哪一个视频,所以很难确定您遇到的问题原因。您可以参考官方文档[1],
> 同时我个人博客里有一些3分钟的视频和文档入门教程[2],您可以先查阅一下。如又问题,可以保持邮件沟通,可以在我的博客留言!
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.10/tutorials/python_table_api.html
> [2] https://enjoyment.cool/
>
> Best,
> Jincheng
>
>
>
> xu1990xaut  于2020年3月24日周二 下午11:36写道:
>
>> 您好,之前在哔哩哔哩上看过您讲的视频。  也跟着视频动手做了。
>> 我用的flink1.10,在pip的时候是直接pip install apache-flink,结果默认就是1.10版本。
>> 然后我在pycharm中运行word-count这个脚本时,一直不出结果,也不报错。   请问这是什么原因。
>> 我也装了jdk,另外页面访问flink8081那个端口也可以出来界面。  我是第一次接触flink,在网上也搜过这个问题,
>> 可是一直没有得到答案。  麻烦您,给小弟指点指点,   谢谢您了。
>>
>>
>>
>>
>
>
>
>


Re: Re: Re: Re: 向您请教pyflink在windows上运行的问题,我第一次接触flink。

2020-03-26 Thread jincheng sun
第一行错误信息是没有安装 bash ?

xu1990xaut  于2020年3月26日周四 下午12:12写道:

> 孙老师,我按照您视频里的方法把flink包安装好了。  但是运行您提供得demo时出现下面这个错误。  我在网上找了好久还是没解决。
> 望老师再指点指点。
>
>
>
>
>
> 在 2020-03-25 15:47:49,"jincheng sun"  写道:
>
> 哦,PyFlink目前不支持windows。
>
> Best,
> Jincheng
> -
> Twitter: https://twitter.com/sunjincheng121
> -
>
>
> xu1990xaut  于2020年3月25日周三 下午2:55写道:
>
>> 谢谢孙老师。   我用的就是这个示例。另外我看到python下又两个flink版本,一个是import flink,一个是import
>> pyflink。 pyflink是不是不能在windows下运行?
>> python下的flink我确定是安装正确的。
>> 运行flink是也启动了start-cluster.bat(start-clust.sh),但是pycharm控制台很久不出结果,cpu的占用率也正常。
>> 我实在不知道是哪里问题。
>>
>>
>>
>>
>>
>> 在 2020-03-25 14:44:25,"jincheng sun"  写道:
>>
>> 上面视频中对应的word_count示例的源码应该是这个:
>> https://github.com/sunjincheng121/enjoyment.code/blob/master/myPyFlink/enjoyment/word_count.py运行完成之后计算结果应该是写到sink_file
>> = 'sink.csv'文件里面去了。你可以将这个文件的路径打印出来,查看这个文件内容。
>>
>> 另外如果您只是为了学习入门的话,建议你查阅[1][2], 我让想整理了解PyFlink最新的状况,可以查看[3]。
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/python/installation.html
>> [2]
>> https://enjoyment.cool/2020/01/22/Three-Min-Series-How-PyFlink-does-ETL/#more
>> [3]
>> https://www.bilibili.com/video/BV1W7411o7Tj?from=search&seid=14518199503613218690
>>
>> Best,
>> Jincheng
>> -
>> Twitter: https://twitter.com/sunjincheng121
>> -
>>
>>
>> xu1990xaut  于2020年3月25日周三 下午2:23写道:
>>
>>> 孙老师您好,我之前在网上看的是这个视频《【Apache Flink 进阶教程】14课. Apache Flink Python API
>>> 的现状及未来规划》。 今天我也在虚拟机下试了,还是无法运行。
>>> 我用的是flink1.10,python3.6。  麻烦老师指点指点。
>>>
>>>
>>>
>>>
>>>
>>>
>>> 在 2020-03-25 11:32:29,"jincheng sun"  写道:
>>>
>>> 很高兴收到您的邮件,我不太确定您具体看的是哪一个视频,所以很难确定您遇到的问题原因。您可以参考官方文档[1],
>>> 同时我个人博客里有一些3分钟的视频和文档入门教程[2],您可以先查阅一下。如又问题,可以保持邮件沟通,可以在我的博客留言!
>>>
>>> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.10/tutorials/python_table_api.html
>>> [2] https://enjoyment.cool/
>>>
>>> Best,
>>> Jincheng
>>>
>>>
>>>
>>> xu1990xaut  于2020年3月24日周二 下午11:36写道:
>>>
>>>> 您好,之前在哔哩哔哩上看过您讲的视频。  也跟着视频动手做了。
>>>> 我用的flink1.10,在pip的时候是直接pip install apache-flink,结果默认就是1.10版本。
>>>> 然后我在pycharm中运行word-count这个脚本时,一直不出结果,也不报错。   请问这是什么原因。
>>>> 我也装了jdk,另外页面访问flink8081那个端口也可以出来界面。  我是第一次接触flink,在网上也搜过这个问题,
>>>> 可是一直没有得到答案。  麻烦您,给小弟指点指点,   谢谢您了。
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>
>
>
>


Re: flink1.10 & pyflink相关问题咨询

2020-03-26 Thread jincheng sun
看你错误日志运行的示例使用了PyUDFDemoConnector,也就是参考的博客[1],
在写这个博客时候1.10还没有发布,在发布之后接口有变化,所以PyUDFDemoConnector有个问题,我前两天进行了更新。你可以更新一下JAR。

另外你发的问题很久之前,发布1.10之前已经fix了[2],所以你更新一下connector在测试一下看看。有问题继续沟通。

Best,
Jincheng
[1]
https://enjoyment.cool/2019/12/05/Apache-Flink-%E8%AF%B4%E9%81%93%E7%B3%BB%E5%88%97-%E5%A6%82%E4%BD%95%E5%9C%A8PyFlink-1-10%E4%B8%AD%E8%87%AA%E5%AE%9A%E4%B9%89Python-UDF/
[2] https://issues.apache.org/jira/browse/FLINK-14581


zilong xiao  于2020年3月25日周三 下午12:19写道:

> 是的,有一个关键步骤:`source
> py36/bin/activate`是在文档中未体现的,执行该步骤后提交到yarn集群可以正常工作,然后最近在进一步研究1.10对于udf的支持,在尝试提交udf作业时,会出现如下异常:
>
> Caused by: java.io.IOException: Cannot run program
> "xxx/pyflink-udf-runner.sh": error=2, No such file or directory
>
> 提交作业前的操作如下:
> 1.pip install virtualenv
> 2.virtualenv --always-copy venv
> 3.venv/bin/pip install apache-beam==2.15.0
> 4.venv/bin/pip install apache-flink
> 5.venv/bin/pip install pydemo.tar.gz
> 6.zip -r venv.zip venv
> 7.bin/flink run -pyarch venv.zip -pyexec venv.zip/venv/bin/python -py
> ./word_count_socket.py -j pydemo.jar
>
> 不知道前辈是否有遇到过类似情况呢?
>
> 完整异常栈信息 & 作业见附件
>
> jincheng sun  于2020年3月19日周四 下午12:08写道:
>
>> 开心看到你在使用PyFlink 1.10,您遇到的问题,核心问题和将解决方式如下:
>>
>> 1.
>> 利用shell的alias功能更改python命令指向是无效的,因为flink不通过shell启动Python进程。所以对flink来说本地python环境依然是python2.
>> 2. 可以通过virtualenv, conda等工具创建python3.5+的环境,并激活,在激活了的环境下提交python job。 比如:
>>   pip install virtualenv
>>   virtualenv --python /usr/local/bin/python3 py36
>>   source py36/bin/activate
>>   flink run -py pyflink.py
>> 3. 另外也可以修改python命令的软链接,令其指向python3.5+。
>>
>> 你可以尝试一下,有问题随时邮件交流!
>>
>> Best,
>> 孙金城(金竹)
>>
>>
>>
>> zilong xiao  于2020年3月18日周三 下午12:14写道:
>>
>>> hi,金竹前辈您好,我是一名从事实时计算方向的IT工作者,最近在使用flink1.10 &
>>> pyflink时遇到一点问题,希望能加下您的钉钉或者其他联系方式和您进一步交流,问题大概描述如下:
>>>
>>> 任务提交环境:
>>> Apache-beam:2.15.0
>>> 本地python:2.7(已配置python3.7,通过修改~/.zshrc,alias
>>> python='/usr/local/bin/python3.7')
>>> pip:20.0.2
>>> flink:1.10
>>>
>>> 提交命令:bin/flink run -pyarch tmp/venv.zip -pyexec
>>> tmp/venv.zip/venv/bin/python3 -py word_count.py
>>>
>>> 在本地尝试以pre-job模式部署作业时,发现会提示如下报错,导致任务提交失败
>>>
>>> RuntimeError: Python versions prior to 3.5 are not supported for PyFlink
>>> [sys.version_info(major=2, minor=7, micro=16, releaselevel='final',
>>> serial=0)].
>>>
>>>
>>> 显而易见,正如flink官方文档所说flink1.10作业必须要求python3.5+,我通过-pyarch
>>> -pyexec来指定任务执行环境以及解释器环境,发现这两个指令貌似没生效,或者说没有作用,还是会有如上异常,具体执行过程都是参考您的文档:
>>> https://enjoyment.cool/2020/01/02/Apache-Flink-%E8%AF%B4%E9%81%93%E7%B3%BB%E5%88%97-PyFlink-%E4%BD%9C%E4%B8%9A%E7%9A%84%E5%A4%9A%E7%A7%8D%E9%83%A8%E7%BD%B2%E6%A8%A1%E5%BC%8F/#more
>>> 来操作的,我在想可能还是我的打开方式不对,亦或该指令还存在隐藏问题?可是网上也没有太多的资料,所以希望能和前辈您交流交流,帮我解开这个疑惑,期待前辈您的回复。
>>>
>>


Re: [udf questions]

2020-03-26 Thread jincheng sun
比较明显的一个问题是UDF定义有些问题是你的sink表定义了一个STRING类型的字段,但是你SQL里面是SELECT了udf的结果,所以我想你的UDF应该也返回一个STRING,也就是result_type=DataTypes.STRING(),同时确保你udf真的返回的是一个字符串,不是json的Object。

Best,
Jincheng


WuPangang  于2020年3月26日周四 下午5:24写道:

> Data as below:
>  
> {"host":"172.25.69.145","@timestamp":"2020-03-23T09:21:16.315Z","@version":"1","offset":532178261,"logmark":"advertise-api","type":"squirrel","path":"/home/logs/app/action_log/advertise_api/AdStatistic.2020-03-23.log","message":"{\"date_time\":\"2020-03-23
> 17:21:15\",\"serverTime\":1584955275,\"currentTime\":\"1584955274\",\"distinct_id\":\"734232168\",\"device_id\":\"ad2c15b7cf910cc6\",\"event\":\"view_material\",\"click_pos\":\"ad_view_avatar\",\"material_name\":\"\\u5934\\u50cf\\u66dd\\u5149\",\"phase\":\"prod\",\"source\":\"phone\",\"client_v\":\"2.100\",\"platform\":\"android\",\"ip\":\"39.84.23.81\",\"network\":\"WIFI\",\"idfa\":\"867179032526091\",\"unique_device_id\":\"ad2c15b7cf910cc6\",\"manufacturer\":\"HUAWEI\",\"model\":\"PRA-AL00X\",\"carrier\":\"\",\"local_dns\":\"0.0,0.0\",\"user_id\":\"734232168\",\"is_login\":1,\"url_path\":\"http:\\/\\/
> down-ddz.734399.com\\/download\\/shuabao\\/shuabao10-awddz-release-hl.apk\",\"country\":\"\",\"province\":\"\",\"city\":\"\",\"setup_source\":\"androiddefault\",\"seller_id\":\"139091\",\"plan_id\":\"380141\",\"plan_name\":\"HL-10-12.27-\\u5927\\u80f8-\\u6e38\\u620f\",\"put_platform_id\":\"25\",\"put_ad_type_id\":\"27\",\"put_position_id\":\"0\",\"put_sell_type_id\":\"0\",\"material_id\":\"378120\",\"show_id\":\"SMALL_VIDEO_6948314\",\"put_start_time\":\"1577808000\",\"plan_params\":\"advertise-380141-378120\",\"download_app_name\":\"\\u7231\\u73a9\\u6597\\u5730\\u4e3b\",\"ad_type\":\"0\",\"put_source\":\"1\",\"is_ad_recommend\":\"\",\"third_video_url\":\"\",\"third_img_url\":\"\",\"ua\":\"JuMei\\/
> (PRA-AL00X; Android; Android OS ; 8.0.0; zh)
> ApacheHttpClient\\/4.0\",\"platform_v\":\"2.100\",\"played_time\":\"\",\"video_time\":\"\",\"status\":1,\"target_link\":\"http:\\/\\/
> down-ddz.734399.com
> \\/download\\/shuabao\\/shuabao10-awddz-release-hl.apk\",\"ad_material_title\":\"\\u5de5\\u8d44\\u6ca13W\\uff0c\\u5c31\\u73a9\\u8fd9\\u6597\\u5730\\u4e3b\\uff0c\\u6bcf\\u5929\\u90fd\\u80fd\\u9886\\u7ea2\\u5305~\",\"ad_material_desc\":\"\\u73a9\\u6597\\u5730\\u4e3b\\uff0c\\u7ea2\\u5305\\u79d2\\u4f53\\u73b0\",\"icon_url\":\"http:\\/\\/
> p12.jmstatic.com
> \\/adv\\/\\/material\\/20190920\\/jmadv5d849b203750f.png\",\"button_text\":\"\\u7acb\\u5373\\u4e0b\\u8f7d\",\"material_type\":\"video\",\"third_app_id\":\"\",\"third_pos_id\":\"\",\"download_apk\":\"\",\"package_name\":\"\",\"h5_url\":\"\",\"message\":\"\",\"unit_price\":\"4.15\",\"balance_type\":\"3\",\"ecpm\":\"15.189\",\"accumulate_ctr\":\"0.366\",\"pre_ctr\":\"0.366\",\"real_unit_price\":\"0\",\"pre_charge\":\"0\",\"pre_change\":\"0\",\"subsidy_type\":\"\",\"subsidy_ratio\":\"\",\"suppress_type\":\"0\",\"suppress_ratio\":\"0\",\"two_ecpm\":\"15.165\",\"real_ecpm\":\"0\",\"real_unit_price_threshold\":\"0\",\"plan_accumulate_pv\":\"0\",\"plan_accumulate_cpv\":\"0\",\"plan_accumulate_change\":\"0\",\"request_id\":\"ad2c15b7cf910cc6-1584955121.768552967\",\"ad_activity_type\":\"0\",\"one_cost\":\"4.15\",\"bid\":\"0.366\",\"real_spr\":\"0.998419\",\"one_ecpm\":\"15.189\",\"es_one\":\"0,16,\",\"es_two\":\"0\",\"es_three\":\"0\",\"es_four\":\"0\",\"age\":\"1020\",\"sex\":\"0\",\"one_cost_expend\":\"4.15\",\"two_cost_expend\":\"4.153438\",\"reward_source\":\"\",\"provide\":\"\"}","topicid":"advertise_module","project":"advertise-api”}
> Problem:
> 数据是个嵌套json,并且核心字段message的格式不能直接通过table api json 相关的方法来处理。
> 自己思考的解决思路:通过udf, 使用json.loads来处理。
> 实际使用中遇到的问题: job提交之后,在dashboard上发现bytes received,records recevied
> 都是0;下游是同步给Kafka,去消费下游kafka也没有数据。
>
> Code as below:
> from pyflink.datastream import
> StreamExecutionEnvironment,CheckpointingMode,TimeCharacteristic
> from pyflink.table import StreamTableEnvironment,
> EnvironmentSettings,TableSink,TableConfig,DataTypes
> from pyflink.table.descriptors import
> Schema,Kafka,Rowtime,ConnectTableDescriptor,CustomConnectorDescriptor,Json
> from pyflink.common import RestartStrategies
> from pyflink.table.udf import udf
> import json
>
> env = StreamExecutionEnvironment.get_execution_environment()
> #env.set_stream_time_characteristic(TimeCharacteristic.EventTime)
> ##checkpoint设置
> #env.enable_checkpointing(30)
>
> #env.get_checkpoint_config().set_checkpointing_mode(CheckpointingMode().EXACTLY_ONCE)
> #env.get_checkpoint_config().set_min_pause_between_checkpoints(3)
> #env.get_checkpoint_config().set_checkpoint_timeout(6)
> #env.get_checkpoint_config().set_max_concurrent_checkpoints(1)
> #env.get_checkpoint_config().set_prefer_checkpoint_for_recovery(False)
> ##contain设置
> env.set_parallelism(12)
> env.set_restart_strategy(RestartStrategies.fixed_delay_restart(50,6))
> ##使用blink api
> environment_settings =
> EnvironmentSettings.new_instance().use_blink_planner().in_streaming_mode().build()
> table_env =
> StreamTableEnvironment

Re: [ANNOUNCE] Apache Flink 1.9.3 released

2020-04-25 Thread jincheng sun
Thanks for your great job, Dian!

Best,
Jincheng


Hequn Cheng  于2020年4月25日周六 下午8:30写道:

> @Dian, thanks a lot for the release and for being the release manager.
> Also thanks to everyone who made this release possible!
>
> Best,
> Hequn
>
> On Sat, Apr 25, 2020 at 7:57 PM Dian Fu  wrote:
>
>> Hi everyone,
>>
>> The Apache Flink community is very happy to announce the release of
>> Apache Flink 1.9.3, which is the third bugfix release for the Apache Flink
>> 1.9 series.
>>
>> Apache Flink® is an open-source stream processing framework for
>> distributed, high-performing, always-available, and accurate data streaming
>> applications.
>>
>> The release is available for download at:
>> https://flink.apache.org/downloads.html
>>
>> Please check out the release blog post for an overview of the
>> improvements for this bugfix release:
>> https://flink.apache.org/news/2020/04/24/release-1.9.3.html
>>
>> The full release notes are available in Jira:
>> https://issues.apache.org/jira/projects/FLINK/versions/12346867
>>
>> We would like to thank all contributors of the Apache Flink community who
>> made this release possible!
>> Also great thanks to @Jincheng for helping finalize this release.
>>
>> Regards,
>> Dian
>>
>


Re: Python UDF from Java

2020-04-30 Thread jincheng sun
Thanks Flavio  and Thanks Marta,

That's a good question  as many user want to know that!

CC to user-zh mailing list :)

Best,
Jincheng
-
Twitter: https://twitter.com/sunjincheng121
-


Flavio Pompermaier  于2020年5月1日周五 上午7:04写道:

> Yes, that's awesome! I think this would be a really attractive feature to
> promote the usage of Flink.
>
> Thanks Marta,
> Flavio
>
> On Fri, May 1, 2020 at 12:26 AM Marta Paes Moreira 
> wrote:
>
>> Hi, Flavio.
>>
>> Extending the scope of Python UDFs is described in FLIP-106 [1, 2] and is
>> planned for the upcoming 1.11 release, according to Piotr's last update.
>>
>> Hope this addresses your question!
>>
>> Marta
>>
>> [1]
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-106%3A+Support+Python+UDF+in+SQL+Function+DDL
>> [2]
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-106-Support-Python-UDF-in-SQL-Function-DDL-td38107.html
>>
>> On Thu, Apr 30, 2020 at 11:30 PM Flavio Pompermaier 
>> wrote:
>>
>>> Hi to all,
>>> is it possible to run a Python UDF from a Java job (using Table API or
>>> SQL)?
>>> Is there any reference?
>>>
>>> Best,
>>> Flavio
>>>
>>
>


Re: pyflink 问题请教

2020-05-25 Thread jincheng sun
你好,感谢你邮件中的问题!

> 1. pyflink 是还没有对应到 java datastream api 和 dataset api 的底层api 吗? 有什么加入计划?

是的,PyDataStream的支持已经初步计划在1.12版本进行支持,近期我会发出PyFlink在1.12的初步规划!

> 2. pyfilink的api 和Java api的命名是基本对应的吗?

Python API 和 Java API要求语法语义高度一致,所以命名应该是对应的。
同时PyFlink会根据Python语言的特点在局部使用过程中进行一定的使用优化, 比如
Java的Class.builder().setA(x).setB(y).build()
会优化为Python Keyword Argument 形式: Class(a=x, b=y)

> 3. pyalink 在合并到 pyflink 中吗? 进度大约到多少了?

pyAlink是Flink生态库Alink的Python入口,也可以理解为是pyFlink的生态库,pyAlink是基于pyFlink进行开发的,Alink的算法最终也可以在Flink中应用,也即是说PyFlink可以开发ML作业,算法的实现可以利用Alink算法库。在1.11中PyFlink会增加ML
Pipeline的API支持,Alink本身也会进行接口的对齐。

> 4. 一个具体的java 对应到 pyflink的问题

这个问题很好,这是PyFlink支持 Datastream API需要考虑的重点问题,
其实API的定义也会和DataStream保持语义语法一致(可能会有Python语言的使用优化),关于API的对应我们采用Py4J进行支持,关于各种function(processFunction/mapFunction/...)会共用
Python UDF的实现方案(建立在Beam基础之上)


同时我很想了解一下您对Python DataStream API的需求业务场景是什么?如果您能细致的描述一下您对PyDataStream
API需求的应用场景,这将对我们有很大的帮助!:)

感谢你邮件!

Best,
Jincheng
-
Twitter: https://twitter.com/sunjincheng121
-


zhangyong <274196...@qq.com> 于2020年5月22日周五 上午11:06写道:

> 孙老师, 你好.
> 最近在使用pyflink, 遇到一些问题, 通过网络查找后还是没能找到答案的, 想请教下您.
>
> 1. pyflink 是还没有对应到 java datastream api 和 dataset api 的底层api 吗? 有什么加入计划?
> 2. pyfilink的api 和Java api的命名是基本对应的吗?
> 3. pyalink 在合并到 pyflink 中吗? 进度大约到多少了?
>
4. 一个具体的java 对应到 pyflink的问题:
>

>
>
>
> 这里的addSource() , process() , map() 对应到pyflink中是哪些函数, 怎么实现??
>
> 多谢 孙老师 抽空解答下疑惑.
>
>
>
>


Re: pyflink数据查询

2020-06-15 Thread jincheng sun
你好 Jack,

>  pyflink 从source通过sql对数据进行查询聚合等操作 不输出到sink中,而是可以直接作为结果,
我这边可以通过开发web接口直接查询这个结果,不必去sink中进行查询

我理解你上面说的 【直接作为结果】+ 【web接口查询】已经包含了“sink”的动作。只是这个“sink” 是这样的实现而已。对于您的场景:
1. 如果您想直接将结果不落地(不存储)执行推送的 web页面,可以自定义一个Web Socket的Sink。
2. 如果您不是想直接推送到web页面,而是通过查询拉取结果,那么您上面说的
【直接作为结果】这句话就要描述一下,您想怎样作为结果?我理解是要落盘的(持久化),所以这里持久化本质也是一个sink。Flink可以支持很多中sink,比如:数据库,文件系统,消息队列等等。您可以参考官方文档:
https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/connect.html

如果上面回复 没有解决你的问题,欢迎随时反馈~~

Best,
Jincheng



Jeff Zhang  于2020年6月9日周二 下午5:39写道:

> 可以用zeppelin的z.show 来查询job结果。这边有pyflink在zeppelin上的入门教程
> https://www.bilibili.com/video/BV1Te411W73b?p=20
> 可以加入钉钉群讨论:30022475
>
>
>
> jack  于2020年6月9日周二 下午5:28写道:
>
>> 问题请教:
>> 描述: pyflink 从source通过sql对数据进行查询聚合等操作
>> 不输出到sink中,而是可以直接作为结果,我这边可以通过开发web接口直接查询这个结果,不必去sink中进行查询。
>>
>> flink能否实现这样的方式?
>> 感谢
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


[ANNOUNCE] Yu Li is now part of the Flink PMC

2020-06-16 Thread jincheng sun
Hi all,

On behalf of the Flink PMC, I'm happy to announce that Yu Li is now
part of the Apache Flink Project Management Committee (PMC).

Yu Li has been very active on Flink's Statebackend component, working on
various improvements, for example the RocksDB memory management for 1.10.
and keeps checking and voting for our releases, and also has successfully
produced two releases(1.10.0&1.10.1) as RM.

Congratulations & Welcome Yu Li!

Best,
Jincheng (on behalf of the Flink PMC)


Re: pyflink的table api 能否根据 filed的值进行判断插入到不同的sink表中

2020-06-21 Thread jincheng sun
您好,jack:

Table API  不用 if/else 直接用类似逻辑即可:

val t1 = table.filter('x  > 2).groupBy(..)
val t2 = table.filter('x <= 2).groupBy(..)
t1.insert_into("sink1)
t2.insert_into("sink2")


Best,
Jincheng



jack  于2020年6月19日周五 上午10:35写道:

>
> 测试使用如下结构:
> table= t_env.from_path("source")
>
> if table.filter("logType=syslog"):
> table.filter("logType=syslog").insert_into("sink1")
> elif table.filter("logType=alarm"):
> table.filter("logType=alarm").insert_into("sink2")
>
>
> 我测试了下,好像table
> .filter("logType=syslog").insert_into("sink1")生效,下面的elif不生效,原因是
> table.filter("logType=syslog")或者table.where在做条件判断的同时已经将数据进行过滤,走不到下面的分支??
>
>
>
>
> 在 2020-06-19 10:08:25,"jack"  写道:
> >使用pyflink的table api 能否根据 filed的值进行判断插入到不同的sink表中?
> >
> >
> >场景:使用pyflink通过filter进行条件过滤后插入到sink中,
> >比如以下两条消息,logType不同,可以使用filter接口进行过滤后插入到sink表中:
> >{
> >"logType":"syslog",
> >"message":"sla;flkdsjf"
> >}
> >{
> >"logType":"alarm",
> >"message":"sla;flkdsjf"
> >}
> >  t_env.from_path("source")\
> >  .filter("logType=syslog")\
> >  .insert_into("sink1")
> >有方法直接在上面的代码中通过判断logType字段的类型实现类似if else的逻辑吗:
> >if logType=="syslog":
> >   insert_into(sink1)
> >elif logType=="alarm":
> >   insert_into(sink2)
> >
> >
> >如果insert_into 有.filter .select等接口的返回值的话就好办了,可以接着往下通过filter进行判断,代码类似以下:
> >
> >
> >  t_env.from_path("source")\
> >  .filter("logType=syslog")\
> >  .insert_into("sink1")\
> >  .filter("logType=alarm")\
> >  .insert_into("sink2")
> >请各位大牛指点,感谢
> >
> >
> >
> >
> >
>
>