Re: [DISCUSS] Show Python code examples first in Spark documentation

Dongjoon Hyun Thu, 23 Feb 2023 18:57:03 -0800

Thank you all.

Yes, attracting more Python users and being more Python user-friendly is
always good.


Basically, SPARK-42493 is proposing to introduce intentional inconsistency
to Apache Spark documentation.

The inconsistency from SPARK-42493 might give Python users the following
questions first.

- Why not RDD pages which are the heart of Apache Spark? Is Python not good
in RDD?
- Why not ML and Structured Streaming pages when DATA+AI Summit focuses on
ML heavily?

Also, more questions to the Scala users.
- Is Scala language stepping down to the 2nd citizen language?
- What about Scala 3?

Of course, I understand SPARK-42493 has specific scopes
(SQL/Dataset/Dataframe) and didn't mean anything like the above at all.
However, if SPARK-42493 is emphasized as "the first step" to introduce that
inconsistency, I'm wondering
- What direction we are heading?
- What is the next target scope?
- When it will be achieved (or completed)?
- Or, is the goal to be permanently inconsistent in terms of the
documentation?

It's unclear even in the documentation-only scope. If we are expecting more
and more subtasks during Apache Spark 3.5 timeframe, shall we have an
umbrella JIRA?

Bests,
Dongjoon.


On Thu, Feb 23, 2023 at 6:15 PM Allan Folting <afolting...@gmail.com> wrote:

> Thanks a lot for the questions and comments/feedback!
>
> To address your questions Dongjoon, I do not intend for these updates to
> the documentation to be tied to the potential changes/suggestions you ask
> about.
>
> In other words, this proposal is only about adjusting the documentation to
> target the majority of people reading it - namely the large and growing
> number of Python users - and new users in particular as they are often
> already familiar with and have a preference for Python when evaluating or
> starting to use Spark.
>
> While we may want to strengthen support for Python in other ways, I think
> such efforts should be tracked separately from this.
>
> Allan
>
> On Thu, Feb 23, 2023 at 1:44 AM Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> If this is not just flip flopping the document pages and involves other
>> changes, then a proper impact analysis needs to be done to assess the
>> efforts involved. Personally I don't think it really matters.
>>
>> HTH
>>
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 23 Feb 2023 at 01:40, Hyukjin Kwon <gurwls...@gmail.com> wrote:
>>
>>> > 1. Does this suggestion imply Python API implementation will be the
>>> new blocker in the future in terms of feature parity among languages? Until
>>> now, Python API feature parity was one of the audit items because it's not
>>> enforced. In other words, Scala and Java have been the full feature because
>>> they are the underlying main developer languages while Python/R/SQL
>>> environments were the nice-to-have.
>>>
>>> I think it wouldn't be treated as a blocker .. but I do believe we have
>>> added all new features into the Python side for the last couple of
>>> releases. So, I wouldn't worry about this at this moment - we have been
>>> doing fine in terms of feature parity.
>>>
>>> > 2. Does this suggestion assume that the Python environment is easier
>>> for users than Scala/Java always? Given that we support Python 3.8 to 3.11,
>>> the support matrix for Python library dependency is a problem for the
>>> Apache Spark community to solve in order to claim that. As we say
>>> at SPARK-41454, Python language also introduces breaking changes to us
>>> historically and we have many `Pinned` python libraries issues.
>>>
>>> Yes. In fact, regardless of this change, I do believe we should test
>>> more versions, etc. At least scheduled jobs like we're doing JDK and Scala
>>> versions.
>>>
>>>
>>> FWIW, my take about this change is: people use Python and PySpark more
>>> (according to the chart and stats provided) so let's put those examples
>>> first :-).
>>>
>>>
>>> On Thu, 23 Feb 2023 at 10:27, Dongjoon Hyun <dongjoon.h...@gmail.com>
>>> wrote:
>>>
>>>> I have two questions to clarify the scope and boundaries.
>>>>
>>>> 1. Does this suggestion imply Python API implementation will be the new
>>>> blocker in the future in terms of feature parity among languages? Until
>>>> now, Python API feature parity was one of the audit items because it's not
>>>> enforced. In other words, Scala and Java have been the full feature because
>>>> they are the underlying main developer languages while Python/R/SQL
>>>> environments were the nice-to-have.
>>>>
>>>> 2. Does this suggestion assume that the Python environment is easier
>>>> for users than Scala/Java always? Given that we support Python 3.8 to 3.11,
>>>> the support matrix for Python library dependency is a problem for the
>>>> Apache Spark community to solve in order to claim that. As we say
>>>> at SPARK-41454, Python language also introduces breaking changes to us
>>>> historically and we have many `Pinned` python libraries issues.
>>>>
>>>> Changing documentation is easy, but I hope we can give clear
>>>> communication and direction in this effort because this is one of the most
>>>> user-facing changes.
>>>>
>>>> Dongjoon.
>>>>
>>>> On Wed, Feb 22, 2023 at 5:26 PM 416161...@qq.com <ruife...@foxmail.com>
>>>> wrote:
>>>>
>>>>> +1 LGTM
>>>>>
>>>>> ------------------------------
>>>>> Ruifeng Zheng
>>>>> ruife...@foxmail.com
>>>>>
>>>>> <https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage&nocheck=true&name=Ruifeng+Zheng&icon=https%3A%2F%2Fres.mail.qq.com%2Fzh_CN%2Fhtmledition%2Fimages%2Frss%2Fmale.gif%3Frand%3D1617349242&mail=ruifengz%40foxmail.com&code=>
>>>>>
>>>>>
>>>>>
>>>>> ------------------ Original ------------------
>>>>> *From:* "Xinrong Meng" <xinrong.apa...@gmail.com>;
>>>>> *Date:* Thu, Feb 23, 2023 09:17 AM
>>>>> *To:* "Allan Folting"<afolting...@gmail.com>;
>>>>> *Cc:* "dev"<dev@spark.apache.org>;
>>>>> *Subject:* Re: [DISCUSS] Show Python code examples first in Spark
>>>>> documentation
>>>>>
>>>>> +1 Good idea!
>>>>>
>>>>> On Thu, Feb 23, 2023 at 7:41 AM Jack Goodson <jackagood...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Good idea, at the company I work at we discussed using Scala as our
>>>>>> primary language because technically it is slightly stronger than python
>>>>>> but ultimately chose python in the end as it’s easier for other devs to 
>>>>>> be
>>>>>> on boarded to our platform and future hiring for the team etc would be
>>>>>> easier
>>>>>>
>>>>>> On Thu, 23 Feb 2023 at 12:20 PM, Hyukjin Kwon <gurwls...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> +1 I like this idea too.
>>>>>>>
>>>>>>> On Thu, Feb 23, 2023 at 6:00 AM Allan Folting <afolting...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I would like to propose that we show Python code examples first in
>>>>>>>> the Spark documentation where we have multiple programming language
>>>>>>>> examples.
>>>>>>>> An example is on the Quick Start page:
>>>>>>>> https://spark.apache.org/docs/latest/quick-start.html
>>>>>>>>
>>>>>>>> I propose this change because Python has become more popular than
>>>>>>>> the other languages supported in Apache Spark. There are a lot more 
>>>>>>>> users
>>>>>>>> of Spark in Python than Scala today and Python attracts a broader set 
>>>>>>>> of
>>>>>>>> new users.
>>>>>>>> For Python usage data, see https://www.tiobe.com/tiobe-index/ and
>>>>>>>> https://insights.stackoverflow.com/trends?tags=r%2Cscala%2Cpython%2Cjava
>>>>>>>> .
>>>>>>>>
>>>>>>>> Also, this change aligns with Python already being the first tab on
>>>>>>>> our home page:
>>>>>>>> https://spark.apache.org/
>>>>>>>>
>>>>>>>> Anyone who wants to use another language can still just click on
>>>>>>>> the other tabs.
>>>>>>>>
>>>>>>>> I created a draft PR for the Spark SQL, DataFrames and Datasets
>>>>>>>> Guide page as a first step:
>>>>>>>> https://github.com/apache/spark/pull/40087
>>>>>>>>
>>>>>>>>
>>>>>>>> I would appreciate it if you could share your thoughts on this
>>>>>>>> proposal.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks a lot,
>>>>>>>> Allan Folting
>>>>>>>>
>>>>>>>

Re: [DISCUSS] Show Python code examples first in Spark documentation

Reply via email to