Hi,

I found that the spark community is also working on redesigning pyspark
documentation[1] recently. Maybe we can compare the difference between our
document structure and its document structure.

[1] https://issues.apache.org/jira/browse/SPARK-31851
http://apache-spark-developers-list.1001551.n3.nabble.com/Need-some-help-and-contributions-in-PySpark-API-documentation-td29972.html

Best,
Xingbo

David Anderson <da...@alpinegizmo.com> 于2020年8月5日周三 上午3:17写道:

> I'm delighted to see energy going into improving the documentation.
>
> With the current documentation, I get a lot of questions that I believe
> reflect two fundamental problems with what we currently provide:
>
> (1) We have a lot of contextual information in our heads about how Flink
> works, and we are able to use that knowledge to make reasonable inferences
> about how things (probably) work in cases we aren't so familiar with. For
> example, I get a lot of questions of the form "If I use <this feature> will
> I still have exactly once guarantees?" The answer is always yes, but they
> continue to have doubts because we have failed to clearly communicate this
> fundamental, underlying principle.
>
> This specific example about fault tolerance applies across all of the
> Flink docs, but the general idea can also be applied to the Table/SQL and
> PyFlink docs. The guiding principles underlying these APIs should be
> written down in one easy-to-find place.
>
> (2) The other kind of question I get a lot is "Can I do <X> with <Y>?"
> E.g., "Can I use the JDBC table sink from PyFlink?" These questions can be
> very difficult to answer because it is frequently the case that one has to
> reason about why a given feature doesn't seem to appear in the
> documentation. It could be that I'm looking in the wrong place, or it could
> be that someone forgot to document something, or it could be that it can in
> fact be done by applying a general mechanism in a specific way that I
> haven't thought of -- as in this case, where one can use a JDBC sink from
> Python if one thinks to use DDL.
>
> So I think it would be helpful to be explicit about both what is, and what
> is not, supported in PyFlink. And to have some very clear organizing
> principles in the documentation so that users can quickly learn where to
> look for specific facts.
>
> Regards,
> David
>
>
> On Tue, Aug 4, 2020 at 1:01 PM jincheng sun <sunjincheng...@gmail.com>
> wrote:
>
>> Hi Seth and David,
>>
>> I'm very happy to have your reply and suggestions. I would like to share
>> my thoughts here:
>>
>> The main motivation we want to refactor the PyFlink doc is that we want
>> to make sure that the Python users could find all they want starting from
>> the PyFlink documentation mainpage. That’s, the PyFlink documentation
>> should have a catalogue which includes all the functionalities available in
>> PyFlink. However, this doesn’t mean that we will make a copy of the content
>> of the documentation in the other places. It may be just a reference/link
>> to the other documentation if needed. For the documentation added under
>> PyFlink mainpage, the principle is that it should only include Python
>> specific content, instead of making a copy of the Java content.
>>
>> >>  I'm concerned that this proposal duplicates a lot of content that
>> will quickly get out of sync. It feels like it is documenting PyFlink
>> separately from the rest of the project.
>>
>> Regarding the concerns about maintainability, as mentioned above, The
>> goal of this FLIP is to provide an intelligible entrance of Python API, and
>> the content in it should only contain the information which is useful for
>> Python users. There are indeed many agenda items that duplicate the Java
>> documents in this FLIP, but it doesn't mean the content would be copied
>> from Java documentation. i.e, if the content of the document is the same as
>> the corresponding Java document, we will add a link to the Java document.
>> e.g. the "Built-in functions" and "SQL". We only create a page for the
>> Python-only content, and then redirect to the Java document if there is
>> something shared with Java. e.g. "Connectors" and "Catalogs". If the
>> document is Python-only and already exists, we will move it from the old
>> python document to the new python document, e.g. "Configurations". If the
>> document is Python-only and not exists before, we will create a new page
>> for it. e.g. "DataTypes".
>>
>> The main reason we create a new page for Python Data Types is that it is
>> only conceptually one-to-one correspondence with Java Data Types, but the
>> actual document content would be very different from Java DataTypes. Some
>> detailed difference are as following:
>>
>>
>>
>>   - The text in the Java Data Types document is written for JVM-based
>> language users, which is incomprehensible to users who only understand
>> python.
>>
>>   - Currently the Python Data Types does not support the "bridgedTo"
>> method, DataTypes.RAW, DataTypes.NULL and User Defined Types.
>>
>>   - The section "Planner Compatibility" and "Data Type Extraction" are
>> only useful for Java/Scala users.
>>
>>   - We want to add sections which may only apply for Python such as which
>> Data Types are currently supported in Python, the mapping between DataType
>> and Python object type, etc.
>>
>> I think the root cause of such a difference with existing documents is
>> that, Python is the first non-JVM language we support in flink. This means
>> our previous method of sharing documents between Java and Scala may not be
>> suitable for Python. So we will adopt some very different methods to
>> provide documentation for Python users. Of course, we should reduce
>> maintenance costs as much as possible while ensuring user experience.
>> Furthermore, python is the first step of flink multi-language support, and
>> there may be R, Go, etc in future. it is very necessary for us to form main
>> page for each language, so that users of each type of language can focus on
>> the content which they care about.
>>
>> >> Things like the cookbook and tutorial should be under the Try Flink
>> section of the documentation.
>>
>> Regarding the position of the "Cookbook" section, in my sense the "Try
>> Flink" is for the new users and the "Cookbook" is for more advanced users,
>> i.e., In “Try Flink” can be the simplest end-to-end example, such as “Hello
>> World” and In “Cookbook” we can add more use cases closer to production
>> business, Such as, CDN log analysis, PV / UV of e-commerce. So I prefer to
>> keep the current structure.
>>
>> >>  it's relatively straightforward to compare the Python API with the
>> Java and Scala versions.
>>
>> Regarding the comparison between Python API and Java/Scala API, I think
>> the majority of users, especially the beginner users, would not have this
>> demand. The priority of increasing user experience for beginner users seems
>> higher than it from my side. Would you please add more inputs for why user
>> want to compare? How much impact will the comparison be if we put it on
>> multiple pages :)
>>
>> Thanks for all of your feedback and suggestions, any follow-up feedback
>> is welcome.
>>
>> Best,
>>
>> Jincheng
>>
>>
>> David Anderson <da...@alpinegizmo.com> 于2020年8月3日周一 下午10:49写道:
>>
>>> Jincheng,
>>>
>>> One thing that I like about the way that the documentation is currently
>>> organized is that it's relatively straightforward to compare the Python API
>>> with the Java and Scala versions. I'm concerned that if the PyFlink docs
>>> are more independent, it will be challenging to respond to questions about
>>> which features from the other APIs are available from Python.
>>>
>>> David
>>>
>>> On Mon, Aug 3, 2020 at 8:07 AM jincheng sun <sunjincheng...@gmail.com>
>>> wrote:
>>>
>>>> Would be great if you could join the contribution of PyFlink
>>>> documentation @Marta !
>>>> Thanks for all of the positive feedback. I will start a formal vote then
>>>> later...
>>>>
>>>> Best,
>>>> Jincheng
>>>>
>>>>
>>>> Shuiqiang Chen <acqua....@gmail.com> 于2020年8月3日周一 上午9:56写道:
>>>>
>>>> > Hi jincheng,
>>>> >
>>>> > Thanks for the discussion. +1 for the FLIP.
>>>> >
>>>> > A well-organized documentation will greatly improve the efficiency and
>>>> > experience for developers.
>>>> >
>>>> > Best,
>>>> > Shuiqiang
>>>> >
>>>> > Hequn Cheng <he...@apache.org> 于2020年8月1日周六 上午8:42写道:
>>>> >
>>>> >> Hi Jincheng,
>>>> >>
>>>> >> Thanks a lot for raising the discussion. +1 for the FLIP.
>>>> >>
>>>> >> I think this will bring big benefits for the PyFlink users.
>>>> Currently,
>>>> >> the Python TableAPI document is hidden deeply under the TableAPI&SQL
>>>> tab
>>>> >> which makes it quite unreadable. Also, the PyFlink documentation is
>>>> mixed
>>>> >> with Java/Scala documentation. It is hard for users to have an
>>>> overview of
>>>> >> all the PyFlink documents. As more and more functionalities are
>>>> added into
>>>> >> PyFlink, I think it's time for us to refactor the document.
>>>> >>
>>>> >> Best,
>>>> >> Hequn
>>>> >>
>>>> >>
>>>> >> On Fri, Jul 31, 2020 at 3:43 PM Marta Paes Moreira <
>>>> ma...@ververica.com>
>>>> >> wrote:
>>>> >>
>>>> >>> Hi, Jincheng!
>>>> >>>
>>>> >>> Thanks for creating this detailed FLIP, it will make a big
>>>> difference in
>>>> >>> the experience of Python developers using Flink. I'm interested in
>>>> >>> contributing to this work, so I'll reach out to you offline!
>>>> >>>
>>>> >>> Also, thanks for sharing some information on the adoption of
>>>> PyFlink,
>>>> >>> it's
>>>> >>> great to see that there are already production users.
>>>> >>>
>>>> >>> Marta
>>>> >>>
>>>> >>> On Fri, Jul 31, 2020 at 5:35 AM Xingbo Huang <hxbks...@gmail.com>
>>>> wrote:
>>>> >>>
>>>> >>> > Hi Jincheng,
>>>> >>> >
>>>> >>> > Thanks a lot for bringing up this discussion and the proposal.
>>>> >>> >
>>>> >>> > Big +1 for improving the structure of PyFlink doc.
>>>> >>> >
>>>> >>> > It will be very friendly to give PyFlink users a unified entrance
>>>> to
>>>> >>> learn
>>>> >>> > PyFlink documents.
>>>> >>> >
>>>> >>> > Best,
>>>> >>> > Xingbo
>>>> >>> >
>>>> >>> > Dian Fu <dian0511...@gmail.com> 于2020年7月31日周五 上午11:00写道:
>>>> >>> >
>>>> >>> >> Hi Jincheng,
>>>> >>> >>
>>>> >>> >> Thanks a lot for bringing up this discussion and the proposal.
>>>> +1 to
>>>> >>> >> improve the Python API doc.
>>>> >>> >>
>>>> >>> >> I have received many feedbacks from PyFlink beginners about
>>>> >>> >> the PyFlink doc, e.g. the materials are too few, the Python doc
>>>> is
>>>> >>> mixed
>>>> >>> >> with the Java doc and it's not easy to find the docs he wants to
>>>> know.
>>>> >>> >>
>>>> >>> >> I think it would greatly improve the user experience if we can
>>>> have
>>>> >>> one
>>>> >>> >> place which includes most knowledges PyFlink users should know.
>>>> >>> >>
>>>> >>> >> Regards,
>>>> >>> >> Dian
>>>> >>> >>
>>>> >>> >> 在 2020年7月31日,上午10:14,jincheng sun <sunjincheng...@gmail.com> 写道:
>>>> >>> >>
>>>> >>> >> Hi folks,
>>>> >>> >>
>>>> >>> >> Since the release of Flink 1.11, users of PyFlink have continued
>>>> to
>>>> >>> grow.
>>>> >>> >> As far as I know there are many companies have used PyFlink for
>>>> data
>>>> >>> >> analysis, operation and maintenance monitoring business has been
>>>> put
>>>> >>> into
>>>> >>> >> production(Such as 聚美优品[1](Jumei),  浙江墨芷[2] (Mozhi) etc.).
>>>> According
>>>> >>> to
>>>> >>> >> the feedback we received, current documentation is not very
>>>> friendly
>>>> >>> to
>>>> >>> >> PyFlink users. There are two shortcomings:
>>>> >>> >>
>>>> >>> >> - Python related content is mixed in the Java/Scala
>>>> documentation,
>>>> >>> which
>>>> >>> >> makes it difficult for users who only focus on PyFlink to read.
>>>> >>> >> - There is already a "Python Table API" section in the Table API
>>>> >>> document
>>>> >>> >> to store PyFlink documents, but the number of articles is small
>>>> and
>>>> >>> the
>>>> >>> >> content is fragmented. It is difficult for beginners to learn
>>>> from it.
>>>> >>> >>
>>>> >>> >> In addition, FLIP-130 introduced the Python DataStream API. Many
>>>> >>> >> documents will be added for those new APIs. In order to increase
>>>> the
>>>> >>> >> readability and maintainability of the PyFlink document, Wei
>>>> Zhong
>>>> >>> and me
>>>> >>> >> have discussed offline and would like to rework it via this FLIP.
>>>> >>> >>
>>>> >>> >> We will rework the document around the following three
>>>> objectives:
>>>> >>> >>
>>>> >>> >> - Add a separate section for Python API under the "Application
>>>> >>> >> Development" section.
>>>> >>> >> - Restructure current Python documentation to a brand new
>>>> structure to
>>>> >>> >> ensure complete content and friendly to beginners.
>>>> >>> >> - Improve the documents shared by Python/Java/Scala to make it
>>>> more
>>>> >>> >> friendly to Python users and without affecting Java/Scala users.
>>>> >>> >>
>>>> >>> >> More detail can be found in the FLIP-133:
>>>> >>> >>
>>>> >>>
>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-133%3A+Rework+PyFlink+Documentation
>>>> >>> >>
>>>> >>> >> Best,
>>>> >>> >> Jincheng
>>>> >>> >>
>>>> >>> >> [1] https://mp.weixin.qq.com/s/zVsBIs1ZEFe4atYUYtZpRg
>>>> >>> >> [2] https://mp.weixin.qq.com/s/R4p_a2TWGpESBWr3pLtM2g
>>>> >>> >>
>>>> >>> >>
>>>> >>> >>
>>>> >>>
>>>> >>
>>>>
>>>

Reply via email to