Thank you all. Yes, attracting more Python users and being more Python user-friendly is always good.
Basically, SPARK-42493 is proposing to introduce intentional inconsistency to Apache Spark documentation. The inconsistency from SPARK-42493 might give Python users the following questions first. - Why not RDD pages which are the heart of Apache Spark? Is Python not good in RDD? - Why not ML and Structured Streaming pages when DATA+AI Summit focuses on ML heavily? Also, more questions to the Scala users. - Is Scala language stepping down to the 2nd citizen language? - What about Scala 3? Of course, I understand SPARK-42493 has specific scopes (SQL/Dataset/Dataframe) and didn't mean anything like the above at all. However, if SPARK-42493 is emphasized as "the first step" to introduce that inconsistency, I'm wondering - What direction we are heading? - What is the next target scope? - When it will be achieved (or completed)? - Or, is the goal to be permanently inconsistent in terms of the documentation? It's unclear even in the documentation-only scope. If we are expecting more and more subtasks during Apache Spark 3.5 timeframe, shall we have an umbrella JIRA? Bests, Dongjoon. On Thu, Feb 23, 2023 at 6:15 PM Allan Folting <afolting...@gmail.com> wrote: > Thanks a lot for the questions and comments/feedback! > > To address your questions Dongjoon, I do not intend for these updates to > the documentation to be tied to the potential changes/suggestions you ask > about. > > In other words, this proposal is only about adjusting the documentation to > target the majority of people reading it - namely the large and growing > number of Python users - and new users in particular as they are often > already familiar with and have a preference for Python when evaluating or > starting to use Spark. > > While we may want to strengthen support for Python in other ways, I think > such efforts should be tracked separately from this. > > Allan > > On Thu, Feb 23, 2023 at 1:44 AM Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > >> If this is not just flip flopping the document pages and involves other >> changes, then a proper impact analysis needs to be done to assess the >> efforts involved. Personally I don't think it really matters. >> >> HTH >> >> >> >> view my Linkedin profile >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> >> https://en.everybodywiki.com/Mich_Talebzadeh >> >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Thu, 23 Feb 2023 at 01:40, Hyukjin Kwon <gurwls...@gmail.com> wrote: >> >>> > 1. Does this suggestion imply Python API implementation will be the >>> new blocker in the future in terms of feature parity among languages? Until >>> now, Python API feature parity was one of the audit items because it's not >>> enforced. In other words, Scala and Java have been the full feature because >>> they are the underlying main developer languages while Python/R/SQL >>> environments were the nice-to-have. >>> >>> I think it wouldn't be treated as a blocker .. but I do believe we have >>> added all new features into the Python side for the last couple of >>> releases. So, I wouldn't worry about this at this moment - we have been >>> doing fine in terms of feature parity. >>> >>> > 2. Does this suggestion assume that the Python environment is easier >>> for users than Scala/Java always? Given that we support Python 3.8 to 3.11, >>> the support matrix for Python library dependency is a problem for the >>> Apache Spark community to solve in order to claim that. As we say >>> at SPARK-41454, Python language also introduces breaking changes to us >>> historically and we have many `Pinned` python libraries issues. >>> >>> Yes. In fact, regardless of this change, I do believe we should test >>> more versions, etc. At least scheduled jobs like we're doing JDK and Scala >>> versions. >>> >>> >>> FWIW, my take about this change is: people use Python and PySpark more >>> (according to the chart and stats provided) so let's put those examples >>> first :-). >>> >>> >>> On Thu, 23 Feb 2023 at 10:27, Dongjoon Hyun <dongjoon.h...@gmail.com> >>> wrote: >>> >>>> I have two questions to clarify the scope and boundaries. >>>> >>>> 1. Does this suggestion imply Python API implementation will be the new >>>> blocker in the future in terms of feature parity among languages? Until >>>> now, Python API feature parity was one of the audit items because it's not >>>> enforced. In other words, Scala and Java have been the full feature because >>>> they are the underlying main developer languages while Python/R/SQL >>>> environments were the nice-to-have. >>>> >>>> 2. Does this suggestion assume that the Python environment is easier >>>> for users than Scala/Java always? Given that we support Python 3.8 to 3.11, >>>> the support matrix for Python library dependency is a problem for the >>>> Apache Spark community to solve in order to claim that. As we say >>>> at SPARK-41454, Python language also introduces breaking changes to us >>>> historically and we have many `Pinned` python libraries issues. >>>> >>>> Changing documentation is easy, but I hope we can give clear >>>> communication and direction in this effort because this is one of the most >>>> user-facing changes. >>>> >>>> Dongjoon. >>>> >>>> On Wed, Feb 22, 2023 at 5:26 PM 416161...@qq.com <ruife...@foxmail.com> >>>> wrote: >>>> >>>>> +1 LGTM >>>>> >>>>> ------------------------------ >>>>> Ruifeng Zheng >>>>> ruife...@foxmail.com >>>>> >>>>> <https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage&nocheck=true&name=Ruifeng+Zheng&icon=https%3A%2F%2Fres.mail.qq.com%2Fzh_CN%2Fhtmledition%2Fimages%2Frss%2Fmale.gif%3Frand%3D1617349242&mail=ruifengz%40foxmail.com&code=> >>>>> >>>>> >>>>> >>>>> ------------------ Original ------------------ >>>>> *From:* "Xinrong Meng" <xinrong.apa...@gmail.com>; >>>>> *Date:* Thu, Feb 23, 2023 09:17 AM >>>>> *To:* "Allan Folting"<afolting...@gmail.com>; >>>>> *Cc:* "dev"<dev@spark.apache.org>; >>>>> *Subject:* Re: [DISCUSS] Show Python code examples first in Spark >>>>> documentation >>>>> >>>>> +1 Good idea! >>>>> >>>>> On Thu, Feb 23, 2023 at 7:41 AM Jack Goodson <jackagood...@gmail.com> >>>>> wrote: >>>>> >>>>>> Good idea, at the company I work at we discussed using Scala as our >>>>>> primary language because technically it is slightly stronger than python >>>>>> but ultimately chose python in the end as it’s easier for other devs to >>>>>> be >>>>>> on boarded to our platform and future hiring for the team etc would be >>>>>> easier >>>>>> >>>>>> On Thu, 23 Feb 2023 at 12:20 PM, Hyukjin Kwon <gurwls...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> +1 I like this idea too. >>>>>>> >>>>>>> On Thu, Feb 23, 2023 at 6:00 AM Allan Folting <afolting...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> I would like to propose that we show Python code examples first in >>>>>>>> the Spark documentation where we have multiple programming language >>>>>>>> examples. >>>>>>>> An example is on the Quick Start page: >>>>>>>> https://spark.apache.org/docs/latest/quick-start.html >>>>>>>> >>>>>>>> I propose this change because Python has become more popular than >>>>>>>> the other languages supported in Apache Spark. There are a lot more >>>>>>>> users >>>>>>>> of Spark in Python than Scala today and Python attracts a broader set >>>>>>>> of >>>>>>>> new users. >>>>>>>> For Python usage data, see https://www.tiobe.com/tiobe-index/ and >>>>>>>> https://insights.stackoverflow.com/trends?tags=r%2Cscala%2Cpython%2Cjava >>>>>>>> . >>>>>>>> >>>>>>>> Also, this change aligns with Python already being the first tab on >>>>>>>> our home page: >>>>>>>> https://spark.apache.org/ >>>>>>>> >>>>>>>> Anyone who wants to use another language can still just click on >>>>>>>> the other tabs. >>>>>>>> >>>>>>>> I created a draft PR for the Spark SQL, DataFrames and Datasets >>>>>>>> Guide page as a first step: >>>>>>>> https://github.com/apache/spark/pull/40087 >>>>>>>> >>>>>>>> >>>>>>>> I would appreciate it if you could share your thoughts on this >>>>>>>> proposal. >>>>>>>> >>>>>>>> >>>>>>>> Thanks a lot, >>>>>>>> Allan Folting >>>>>>>> >>>>>>>