Hi, I found that the spark community is also working on redesigning pyspark documentation[1] recently. Maybe we can compare the difference between our document structure and its document structure.
[1] https://issues.apache.org/jira/browse/SPARK-31851 http://apache-spark-developers-list.1001551.n3.nabble.com/Need-some-help-and-contributions-in-PySpark-API-documentation-td29972.html Best, Xingbo David Anderson <da...@alpinegizmo.com> 于2020年8月5日周三 上午3:17写道: > I'm delighted to see energy going into improving the documentation. > > With the current documentation, I get a lot of questions that I believe > reflect two fundamental problems with what we currently provide: > > (1) We have a lot of contextual information in our heads about how Flink > works, and we are able to use that knowledge to make reasonable inferences > about how things (probably) work in cases we aren't so familiar with. For > example, I get a lot of questions of the form "If I use <this feature> will > I still have exactly once guarantees?" The answer is always yes, but they > continue to have doubts because we have failed to clearly communicate this > fundamental, underlying principle. > > This specific example about fault tolerance applies across all of the > Flink docs, but the general idea can also be applied to the Table/SQL and > PyFlink docs. The guiding principles underlying these APIs should be > written down in one easy-to-find place. > > (2) The other kind of question I get a lot is "Can I do <X> with <Y>?" > E.g., "Can I use the JDBC table sink from PyFlink?" These questions can be > very difficult to answer because it is frequently the case that one has to > reason about why a given feature doesn't seem to appear in the > documentation. It could be that I'm looking in the wrong place, or it could > be that someone forgot to document something, or it could be that it can in > fact be done by applying a general mechanism in a specific way that I > haven't thought of -- as in this case, where one can use a JDBC sink from > Python if one thinks to use DDL. > > So I think it would be helpful to be explicit about both what is, and what > is not, supported in PyFlink. And to have some very clear organizing > principles in the documentation so that users can quickly learn where to > look for specific facts. > > Regards, > David > > > On Tue, Aug 4, 2020 at 1:01 PM jincheng sun <sunjincheng...@gmail.com> > wrote: > >> Hi Seth and David, >> >> I'm very happy to have your reply and suggestions. I would like to share >> my thoughts here: >> >> The main motivation we want to refactor the PyFlink doc is that we want >> to make sure that the Python users could find all they want starting from >> the PyFlink documentation mainpage. That’s, the PyFlink documentation >> should have a catalogue which includes all the functionalities available in >> PyFlink. However, this doesn’t mean that we will make a copy of the content >> of the documentation in the other places. It may be just a reference/link >> to the other documentation if needed. For the documentation added under >> PyFlink mainpage, the principle is that it should only include Python >> specific content, instead of making a copy of the Java content. >> >> >> I'm concerned that this proposal duplicates a lot of content that >> will quickly get out of sync. It feels like it is documenting PyFlink >> separately from the rest of the project. >> >> Regarding the concerns about maintainability, as mentioned above, The >> goal of this FLIP is to provide an intelligible entrance of Python API, and >> the content in it should only contain the information which is useful for >> Python users. There are indeed many agenda items that duplicate the Java >> documents in this FLIP, but it doesn't mean the content would be copied >> from Java documentation. i.e, if the content of the document is the same as >> the corresponding Java document, we will add a link to the Java document. >> e.g. the "Built-in functions" and "SQL". We only create a page for the >> Python-only content, and then redirect to the Java document if there is >> something shared with Java. e.g. "Connectors" and "Catalogs". If the >> document is Python-only and already exists, we will move it from the old >> python document to the new python document, e.g. "Configurations". If the >> document is Python-only and not exists before, we will create a new page >> for it. e.g. "DataTypes". >> >> The main reason we create a new page for Python Data Types is that it is >> only conceptually one-to-one correspondence with Java Data Types, but the >> actual document content would be very different from Java DataTypes. Some >> detailed difference are as following: >> >> >> >> - The text in the Java Data Types document is written for JVM-based >> language users, which is incomprehensible to users who only understand >> python. >> >> - Currently the Python Data Types does not support the "bridgedTo" >> method, DataTypes.RAW, DataTypes.NULL and User Defined Types. >> >> - The section "Planner Compatibility" and "Data Type Extraction" are >> only useful for Java/Scala users. >> >> - We want to add sections which may only apply for Python such as which >> Data Types are currently supported in Python, the mapping between DataType >> and Python object type, etc. >> >> I think the root cause of such a difference with existing documents is >> that, Python is the first non-JVM language we support in flink. This means >> our previous method of sharing documents between Java and Scala may not be >> suitable for Python. So we will adopt some very different methods to >> provide documentation for Python users. Of course, we should reduce >> maintenance costs as much as possible while ensuring user experience. >> Furthermore, python is the first step of flink multi-language support, and >> there may be R, Go, etc in future. it is very necessary for us to form main >> page for each language, so that users of each type of language can focus on >> the content which they care about. >> >> >> Things like the cookbook and tutorial should be under the Try Flink >> section of the documentation. >> >> Regarding the position of the "Cookbook" section, in my sense the "Try >> Flink" is for the new users and the "Cookbook" is for more advanced users, >> i.e., In “Try Flink” can be the simplest end-to-end example, such as “Hello >> World” and In “Cookbook” we can add more use cases closer to production >> business, Such as, CDN log analysis, PV / UV of e-commerce. So I prefer to >> keep the current structure. >> >> >> it's relatively straightforward to compare the Python API with the >> Java and Scala versions. >> >> Regarding the comparison between Python API and Java/Scala API, I think >> the majority of users, especially the beginner users, would not have this >> demand. The priority of increasing user experience for beginner users seems >> higher than it from my side. Would you please add more inputs for why user >> want to compare? How much impact will the comparison be if we put it on >> multiple pages :) >> >> Thanks for all of your feedback and suggestions, any follow-up feedback >> is welcome. >> >> Best, >> >> Jincheng >> >> >> David Anderson <da...@alpinegizmo.com> 于2020年8月3日周一 下午10:49写道: >> >>> Jincheng, >>> >>> One thing that I like about the way that the documentation is currently >>> organized is that it's relatively straightforward to compare the Python API >>> with the Java and Scala versions. I'm concerned that if the PyFlink docs >>> are more independent, it will be challenging to respond to questions about >>> which features from the other APIs are available from Python. >>> >>> David >>> >>> On Mon, Aug 3, 2020 at 8:07 AM jincheng sun <sunjincheng...@gmail.com> >>> wrote: >>> >>>> Would be great if you could join the contribution of PyFlink >>>> documentation @Marta ! >>>> Thanks for all of the positive feedback. I will start a formal vote then >>>> later... >>>> >>>> Best, >>>> Jincheng >>>> >>>> >>>> Shuiqiang Chen <acqua....@gmail.com> 于2020年8月3日周一 上午9:56写道: >>>> >>>> > Hi jincheng, >>>> > >>>> > Thanks for the discussion. +1 for the FLIP. >>>> > >>>> > A well-organized documentation will greatly improve the efficiency and >>>> > experience for developers. >>>> > >>>> > Best, >>>> > Shuiqiang >>>> > >>>> > Hequn Cheng <he...@apache.org> 于2020年8月1日周六 上午8:42写道: >>>> > >>>> >> Hi Jincheng, >>>> >> >>>> >> Thanks a lot for raising the discussion. +1 for the FLIP. >>>> >> >>>> >> I think this will bring big benefits for the PyFlink users. >>>> Currently, >>>> >> the Python TableAPI document is hidden deeply under the TableAPI&SQL >>>> tab >>>> >> which makes it quite unreadable. Also, the PyFlink documentation is >>>> mixed >>>> >> with Java/Scala documentation. It is hard for users to have an >>>> overview of >>>> >> all the PyFlink documents. As more and more functionalities are >>>> added into >>>> >> PyFlink, I think it's time for us to refactor the document. >>>> >> >>>> >> Best, >>>> >> Hequn >>>> >> >>>> >> >>>> >> On Fri, Jul 31, 2020 at 3:43 PM Marta Paes Moreira < >>>> ma...@ververica.com> >>>> >> wrote: >>>> >> >>>> >>> Hi, Jincheng! >>>> >>> >>>> >>> Thanks for creating this detailed FLIP, it will make a big >>>> difference in >>>> >>> the experience of Python developers using Flink. I'm interested in >>>> >>> contributing to this work, so I'll reach out to you offline! >>>> >>> >>>> >>> Also, thanks for sharing some information on the adoption of >>>> PyFlink, >>>> >>> it's >>>> >>> great to see that there are already production users. >>>> >>> >>>> >>> Marta >>>> >>> >>>> >>> On Fri, Jul 31, 2020 at 5:35 AM Xingbo Huang <hxbks...@gmail.com> >>>> wrote: >>>> >>> >>>> >>> > Hi Jincheng, >>>> >>> > >>>> >>> > Thanks a lot for bringing up this discussion and the proposal. >>>> >>> > >>>> >>> > Big +1 for improving the structure of PyFlink doc. >>>> >>> > >>>> >>> > It will be very friendly to give PyFlink users a unified entrance >>>> to >>>> >>> learn >>>> >>> > PyFlink documents. >>>> >>> > >>>> >>> > Best, >>>> >>> > Xingbo >>>> >>> > >>>> >>> > Dian Fu <dian0511...@gmail.com> 于2020年7月31日周五 上午11:00写道: >>>> >>> > >>>> >>> >> Hi Jincheng, >>>> >>> >> >>>> >>> >> Thanks a lot for bringing up this discussion and the proposal. >>>> +1 to >>>> >>> >> improve the Python API doc. >>>> >>> >> >>>> >>> >> I have received many feedbacks from PyFlink beginners about >>>> >>> >> the PyFlink doc, e.g. the materials are too few, the Python doc >>>> is >>>> >>> mixed >>>> >>> >> with the Java doc and it's not easy to find the docs he wants to >>>> know. >>>> >>> >> >>>> >>> >> I think it would greatly improve the user experience if we can >>>> have >>>> >>> one >>>> >>> >> place which includes most knowledges PyFlink users should know. >>>> >>> >> >>>> >>> >> Regards, >>>> >>> >> Dian >>>> >>> >> >>>> >>> >> 在 2020年7月31日,上午10:14,jincheng sun <sunjincheng...@gmail.com> 写道: >>>> >>> >> >>>> >>> >> Hi folks, >>>> >>> >> >>>> >>> >> Since the release of Flink 1.11, users of PyFlink have continued >>>> to >>>> >>> grow. >>>> >>> >> As far as I know there are many companies have used PyFlink for >>>> data >>>> >>> >> analysis, operation and maintenance monitoring business has been >>>> put >>>> >>> into >>>> >>> >> production(Such as 聚美优品[1](Jumei), 浙江墨芷[2] (Mozhi) etc.). >>>> According >>>> >>> to >>>> >>> >> the feedback we received, current documentation is not very >>>> friendly >>>> >>> to >>>> >>> >> PyFlink users. There are two shortcomings: >>>> >>> >> >>>> >>> >> - Python related content is mixed in the Java/Scala >>>> documentation, >>>> >>> which >>>> >>> >> makes it difficult for users who only focus on PyFlink to read. >>>> >>> >> - There is already a "Python Table API" section in the Table API >>>> >>> document >>>> >>> >> to store PyFlink documents, but the number of articles is small >>>> and >>>> >>> the >>>> >>> >> content is fragmented. It is difficult for beginners to learn >>>> from it. >>>> >>> >> >>>> >>> >> In addition, FLIP-130 introduced the Python DataStream API. Many >>>> >>> >> documents will be added for those new APIs. In order to increase >>>> the >>>> >>> >> readability and maintainability of the PyFlink document, Wei >>>> Zhong >>>> >>> and me >>>> >>> >> have discussed offline and would like to rework it via this FLIP. >>>> >>> >> >>>> >>> >> We will rework the document around the following three >>>> objectives: >>>> >>> >> >>>> >>> >> - Add a separate section for Python API under the "Application >>>> >>> >> Development" section. >>>> >>> >> - Restructure current Python documentation to a brand new >>>> structure to >>>> >>> >> ensure complete content and friendly to beginners. >>>> >>> >> - Improve the documents shared by Python/Java/Scala to make it >>>> more >>>> >>> >> friendly to Python users and without affecting Java/Scala users. >>>> >>> >> >>>> >>> >> More detail can be found in the FLIP-133: >>>> >>> >> >>>> >>> >>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-133%3A+Rework+PyFlink+Documentation >>>> >>> >> >>>> >>> >> Best, >>>> >>> >> Jincheng >>>> >>> >> >>>> >>> >> [1] https://mp.weixin.qq.com/s/zVsBIs1ZEFe4atYUYtZpRg >>>> >>> >> [2] https://mp.weixin.qq.com/s/R4p_a2TWGpESBWr3pLtM2g >>>> >>> >> >>>> >>> >> >>>> >>> >> >>>> >>> >>>> >> >>>> >>>