Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-23 Thread Santosh Pingale
Yes, I definitely agree and +1 to the proposal (FWIW).

I was looking at Dongjoon's comments which made a lot of sense to me and
trying to come up with an approach that provides smooth segway to python as
first tab later on. But this is mostly guess work as I do not personally
know the actual user behaviour on docs site.

On Fri, Feb 24, 2023, 8:01 AM Hyukjin Kwon  wrote:

> That sounds good to have that especially given that it will allow more
> flexibility to the users.
> But I think that's slightly orthogonal to this proposal since this
> proposal is more about the default (before users take an action).
>
>
> On Fri, 24 Feb 2023 at 15:35, Santosh Pingale 
> wrote:
>
>> Very interesting and user focused discussion, thanks for the proposal.
>>
>> Would it be better if we rather let users set the preference about the
>> language they want to see first in the code examples? This preference can
>> be easily stored on the browser side and used to decide ordering. This is
>> inline with freedom users have with spark today.
>>
>>
>> On Fri, Feb 24, 2023, 4:46 AM Allan Folting 
>> wrote:
>>
>>> I think this needs to be consistently done on all relevant pages and my
>>> intent is to do that work in time for when it is first released.
>>> I started with the "Spark SQL, DataFrames and Datasets Guide" page to
>>> break it up into multiple, scoped PRs.
>>> I should have made that clear before.
>>>
>>> I think it's a great idea to have an umbrella JIRA for this to outline
>>> the full scope and track overall progress and I'm happy to create it.
>>>
>>> I can't speak on behalf of all Scala users of course, but I don't think
>>> this change makes Scala appear as a 2nd class citizen, like I don't think
>>> of Python as a 2nd class citizen because it is not first currently, but it
>>> does recognize that Python is more broadly popular today.
>>>
>>> Thanks,
>>> Allan
>>>
>>> On Thu, Feb 23, 2023 at 6:55 PM Dongjoon Hyun 
>>> wrote:
>>>
 Thank you all.

 Yes, attracting more Python users and being more Python user-friendly
 is always good.

 Basically, SPARK-42493 is proposing to introduce intentional
 inconsistency to Apache Spark documentation.

 The inconsistency from SPARK-42493 might give Python users the
 following questions first.

 - Why not RDD pages which are the heart of Apache Spark? Is Python not
 good in RDD?
 - Why not ML and Structured Streaming pages when DATA+AI Summit focuses
 on ML heavily?

 Also, more questions to the Scala users.
 - Is Scala language stepping down to the 2nd citizen language?
 - What about Scala 3?

 Of course, I understand SPARK-42493 has specific scopes
 (SQL/Dataset/Dataframe) and didn't mean anything like the above at all.
 However, if SPARK-42493 is emphasized as "the first step" to introduce
 that inconsistency, I'm wondering
 - What direction we are heading?
 - What is the next target scope?
 - When it will be achieved (or completed)?
 - Or, is the goal to be permanently inconsistent in terms of the
 documentation?

 It's unclear even in the documentation-only scope. If we are expecting
 more and more subtasks during Apache Spark 3.5 timeframe, shall we have an
 umbrella JIRA?

 Bests,
 Dongjoon.


 On Thu, Feb 23, 2023 at 6:15 PM Allan Folting 
 wrote:

> Thanks a lot for the questions and comments/feedback!
>
> To address your questions Dongjoon, I do not intend for these updates
> to the documentation to be tied to the potential changes/suggestions you
> ask about.
>
> In other words, this proposal is only about adjusting the
> documentation to target the majority of people reading it - namely the
> large and growing number of Python users - and new users in particular as
> they are often already familiar with and have a preference for Python when
> evaluating or starting to use Spark.
>
> While we may want to strengthen support for Python in other ways, I
> think such efforts should be tracked separately from this.
>
> Allan
>
> On Thu, Feb 23, 2023 at 1:44 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> If this is not just flip flopping the document pages and involves
>> other changes, then a proper impact analysis needs to be done to assess 
>> the
>> efforts involved. Personally I don't think it really matters.
>>
>> HTH
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>> for any loss, damage or destruction of data or any other property which 
>> may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author 

Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-23 Thread Hyukjin Kwon
That sounds good to have that especially given that it will allow more
flexibility to the users.
But I think that's slightly orthogonal to this proposal since this proposal
is more about the default (before users take an action).


On Fri, 24 Feb 2023 at 15:35, Santosh Pingale 
wrote:

> Very interesting and user focused discussion, thanks for the proposal.
>
> Would it be better if we rather let users set the preference about the
> language they want to see first in the code examples? This preference can
> be easily stored on the browser side and used to decide ordering. This is
> inline with freedom users have with spark today.
>
>
> On Fri, Feb 24, 2023, 4:46 AM Allan Folting  wrote:
>
>> I think this needs to be consistently done on all relevant pages and my
>> intent is to do that work in time for when it is first released.
>> I started with the "Spark SQL, DataFrames and Datasets Guide" page to
>> break it up into multiple, scoped PRs.
>> I should have made that clear before.
>>
>> I think it's a great idea to have an umbrella JIRA for this to outline
>> the full scope and track overall progress and I'm happy to create it.
>>
>> I can't speak on behalf of all Scala users of course, but I don't think
>> this change makes Scala appear as a 2nd class citizen, like I don't think
>> of Python as a 2nd class citizen because it is not first currently, but it
>> does recognize that Python is more broadly popular today.
>>
>> Thanks,
>> Allan
>>
>> On Thu, Feb 23, 2023 at 6:55 PM Dongjoon Hyun 
>> wrote:
>>
>>> Thank you all.
>>>
>>> Yes, attracting more Python users and being more Python user-friendly is
>>> always good.
>>>
>>> Basically, SPARK-42493 is proposing to introduce intentional
>>> inconsistency to Apache Spark documentation.
>>>
>>> The inconsistency from SPARK-42493 might give Python users the following
>>> questions first.
>>>
>>> - Why not RDD pages which are the heart of Apache Spark? Is Python not
>>> good in RDD?
>>> - Why not ML and Structured Streaming pages when DATA+AI Summit focuses
>>> on ML heavily?
>>>
>>> Also, more questions to the Scala users.
>>> - Is Scala language stepping down to the 2nd citizen language?
>>> - What about Scala 3?
>>>
>>> Of course, I understand SPARK-42493 has specific scopes
>>> (SQL/Dataset/Dataframe) and didn't mean anything like the above at all.
>>> However, if SPARK-42493 is emphasized as "the first step" to introduce
>>> that inconsistency, I'm wondering
>>> - What direction we are heading?
>>> - What is the next target scope?
>>> - When it will be achieved (or completed)?
>>> - Or, is the goal to be permanently inconsistent in terms of the
>>> documentation?
>>>
>>> It's unclear even in the documentation-only scope. If we are expecting
>>> more and more subtasks during Apache Spark 3.5 timeframe, shall we have an
>>> umbrella JIRA?
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Thu, Feb 23, 2023 at 6:15 PM Allan Folting 
>>> wrote:
>>>
 Thanks a lot for the questions and comments/feedback!

 To address your questions Dongjoon, I do not intend for these updates
 to the documentation to be tied to the potential changes/suggestions you
 ask about.

 In other words, this proposal is only about adjusting the documentation
 to target the majority of people reading it - namely the large and growing
 number of Python users - and new users in particular as they are often
 already familiar with and have a preference for Python when evaluating or
 starting to use Spark.

 While we may want to strengthen support for Python in other ways, I
 think such efforts should be tracked separately from this.

 Allan

 On Thu, Feb 23, 2023 at 1:44 AM Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> If this is not just flip flopping the document pages and involves
> other changes, then a proper impact analysis needs to be done to assess 
> the
> efforts involved. Personally I don't think it really matters.
>
> HTH
>
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any loss, damage or destruction of data or any other property which may
> arise from relying on this email's technical content is explicitly
> disclaimed. The author will in no case be liable for any monetary damages
> arising from such loss, damage or destruction.
>
>
>
>
> On Thu, 23 Feb 2023 at 01:40, Hyukjin Kwon 
> wrote:
>
>> > 1. Does this suggestion imply Python API implementation will be the
>> new blocker in the future in terms of feature parity among languages? 
>> Until
>> now, Python API feature parity was one of the audit items because it's 
>> not
>> enforced. In other words, Scala and Java have been the 

Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-23 Thread Santosh Pingale
Very interesting and user focused discussion, thanks for the proposal.

Would it be better if we rather let users set the preference about the
language they want to see first in the code examples? This preference can
be easily stored on the browser side and used to decide ordering. This is
inline with freedom users have with spark today.


On Fri, Feb 24, 2023, 4:46 AM Allan Folting  wrote:

> I think this needs to be consistently done on all relevant pages and my
> intent is to do that work in time for when it is first released.
> I started with the "Spark SQL, DataFrames and Datasets Guide" page to
> break it up into multiple, scoped PRs.
> I should have made that clear before.
>
> I think it's a great idea to have an umbrella JIRA for this to outline the
> full scope and track overall progress and I'm happy to create it.
>
> I can't speak on behalf of all Scala users of course, but I don't think
> this change makes Scala appear as a 2nd class citizen, like I don't think
> of Python as a 2nd class citizen because it is not first currently, but it
> does recognize that Python is more broadly popular today.
>
> Thanks,
> Allan
>
> On Thu, Feb 23, 2023 at 6:55 PM Dongjoon Hyun 
> wrote:
>
>> Thank you all.
>>
>> Yes, attracting more Python users and being more Python user-friendly is
>> always good.
>>
>> Basically, SPARK-42493 is proposing to introduce intentional
>> inconsistency to Apache Spark documentation.
>>
>> The inconsistency from SPARK-42493 might give Python users the following
>> questions first.
>>
>> - Why not RDD pages which are the heart of Apache Spark? Is Python not
>> good in RDD?
>> - Why not ML and Structured Streaming pages when DATA+AI Summit focuses
>> on ML heavily?
>>
>> Also, more questions to the Scala users.
>> - Is Scala language stepping down to the 2nd citizen language?
>> - What about Scala 3?
>>
>> Of course, I understand SPARK-42493 has specific scopes
>> (SQL/Dataset/Dataframe) and didn't mean anything like the above at all.
>> However, if SPARK-42493 is emphasized as "the first step" to introduce
>> that inconsistency, I'm wondering
>> - What direction we are heading?
>> - What is the next target scope?
>> - When it will be achieved (or completed)?
>> - Or, is the goal to be permanently inconsistent in terms of the
>> documentation?
>>
>> It's unclear even in the documentation-only scope. If we are expecting
>> more and more subtasks during Apache Spark 3.5 timeframe, shall we have an
>> umbrella JIRA?
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Thu, Feb 23, 2023 at 6:15 PM Allan Folting 
>> wrote:
>>
>>> Thanks a lot for the questions and comments/feedback!
>>>
>>> To address your questions Dongjoon, I do not intend for these updates to
>>> the documentation to be tied to the potential changes/suggestions you ask
>>> about.
>>>
>>> In other words, this proposal is only about adjusting the documentation
>>> to target the majority of people reading it - namely the large and growing
>>> number of Python users - and new users in particular as they are often
>>> already familiar with and have a preference for Python when evaluating or
>>> starting to use Spark.
>>>
>>> While we may want to strengthen support for Python in other ways, I
>>> think such efforts should be tracked separately from this.
>>>
>>> Allan
>>>
>>> On Thu, Feb 23, 2023 at 1:44 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 If this is not just flip flopping the document pages and involves other
 changes, then a proper impact analysis needs to be done to assess the
 efforts involved. Personally I don't think it really matters.

 HTH



view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Thu, 23 Feb 2023 at 01:40, Hyukjin Kwon  wrote:

> > 1. Does this suggestion imply Python API implementation will be the
> new blocker in the future in terms of feature parity among languages? 
> Until
> now, Python API feature parity was one of the audit items because it's not
> enforced. In other words, Scala and Java have been the full feature 
> because
> they are the underlying main developer languages while Python/R/SQL
> environments were the nice-to-have.
>
> I think it wouldn't be treated as a blocker .. but I do believe we
> have added all new features into the Python side for the last couple of
> releases. So, I wouldn't worry about this at this moment - we have been
> doing fine in terms of feature parity.
>
> 

Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-23 Thread Allan Folting
I think this needs to be consistently done on all relevant pages and my
intent is to do that work in time for when it is first released.
I started with the "Spark SQL, DataFrames and Datasets Guide" page to break
it up into multiple, scoped PRs.
I should have made that clear before.

I think it's a great idea to have an umbrella JIRA for this to outline the
full scope and track overall progress and I'm happy to create it.

I can't speak on behalf of all Scala users of course, but I don't think
this change makes Scala appear as a 2nd class citizen, like I don't think
of Python as a 2nd class citizen because it is not first currently, but it
does recognize that Python is more broadly popular today.

Thanks,
Allan

On Thu, Feb 23, 2023 at 6:55 PM Dongjoon Hyun 
wrote:

> Thank you all.
>
> Yes, attracting more Python users and being more Python user-friendly is
> always good.
>
> Basically, SPARK-42493 is proposing to introduce intentional inconsistency
> to Apache Spark documentation.
>
> The inconsistency from SPARK-42493 might give Python users the following
> questions first.
>
> - Why not RDD pages which are the heart of Apache Spark? Is Python not
> good in RDD?
> - Why not ML and Structured Streaming pages when DATA+AI Summit focuses on
> ML heavily?
>
> Also, more questions to the Scala users.
> - Is Scala language stepping down to the 2nd citizen language?
> - What about Scala 3?
>
> Of course, I understand SPARK-42493 has specific scopes
> (SQL/Dataset/Dataframe) and didn't mean anything like the above at all.
> However, if SPARK-42493 is emphasized as "the first step" to introduce
> that inconsistency, I'm wondering
> - What direction we are heading?
> - What is the next target scope?
> - When it will be achieved (or completed)?
> - Or, is the goal to be permanently inconsistent in terms of the
> documentation?
>
> It's unclear even in the documentation-only scope. If we are expecting
> more and more subtasks during Apache Spark 3.5 timeframe, shall we have an
> umbrella JIRA?
>
> Bests,
> Dongjoon.
>
>
> On Thu, Feb 23, 2023 at 6:15 PM Allan Folting 
> wrote:
>
>> Thanks a lot for the questions and comments/feedback!
>>
>> To address your questions Dongjoon, I do not intend for these updates to
>> the documentation to be tied to the potential changes/suggestions you ask
>> about.
>>
>> In other words, this proposal is only about adjusting the documentation
>> to target the majority of people reading it - namely the large and growing
>> number of Python users - and new users in particular as they are often
>> already familiar with and have a preference for Python when evaluating or
>> starting to use Spark.
>>
>> While we may want to strengthen support for Python in other ways, I think
>> such efforts should be tracked separately from this.
>>
>> Allan
>>
>> On Thu, Feb 23, 2023 at 1:44 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> If this is not just flip flopping the document pages and involves other
>>> changes, then a proper impact analysis needs to be done to assess the
>>> efforts involved. Personally I don't think it really matters.
>>>
>>> HTH
>>>
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Thu, 23 Feb 2023 at 01:40, Hyukjin Kwon  wrote:
>>>
 > 1. Does this suggestion imply Python API implementation will be the
 new blocker in the future in terms of feature parity among languages? Until
 now, Python API feature parity was one of the audit items because it's not
 enforced. In other words, Scala and Java have been the full feature because
 they are the underlying main developer languages while Python/R/SQL
 environments were the nice-to-have.

 I think it wouldn't be treated as a blocker .. but I do believe we have
 added all new features into the Python side for the last couple of
 releases. So, I wouldn't worry about this at this moment - we have been
 doing fine in terms of feature parity.

 > 2. Does this suggestion assume that the Python environment is easier
 for users than Scala/Java always? Given that we support Python 3.8 to 3.11,
 the support matrix for Python library dependency is a problem for the
 Apache Spark community to solve in order to claim that. As we say
 at SPARK-41454, Python language also introduces breaking changes to us
 historically and we have many `Pinned` python libraries issues.

 Yes. In fact, regardless of this change, I do believe we should test
 more versions, 

Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-23 Thread Dongjoon Hyun
Thank you all.

Yes, attracting more Python users and being more Python user-friendly is
always good.

Basically, SPARK-42493 is proposing to introduce intentional inconsistency
to Apache Spark documentation.

The inconsistency from SPARK-42493 might give Python users the following
questions first.

- Why not RDD pages which are the heart of Apache Spark? Is Python not good
in RDD?
- Why not ML and Structured Streaming pages when DATA+AI Summit focuses on
ML heavily?

Also, more questions to the Scala users.
- Is Scala language stepping down to the 2nd citizen language?
- What about Scala 3?

Of course, I understand SPARK-42493 has specific scopes
(SQL/Dataset/Dataframe) and didn't mean anything like the above at all.
However, if SPARK-42493 is emphasized as "the first step" to introduce that
inconsistency, I'm wondering
- What direction we are heading?
- What is the next target scope?
- When it will be achieved (or completed)?
- Or, is the goal to be permanently inconsistent in terms of the
documentation?

It's unclear even in the documentation-only scope. If we are expecting more
and more subtasks during Apache Spark 3.5 timeframe, shall we have an
umbrella JIRA?

Bests,
Dongjoon.


On Thu, Feb 23, 2023 at 6:15 PM Allan Folting  wrote:

> Thanks a lot for the questions and comments/feedback!
>
> To address your questions Dongjoon, I do not intend for these updates to
> the documentation to be tied to the potential changes/suggestions you ask
> about.
>
> In other words, this proposal is only about adjusting the documentation to
> target the majority of people reading it - namely the large and growing
> number of Python users - and new users in particular as they are often
> already familiar with and have a preference for Python when evaluating or
> starting to use Spark.
>
> While we may want to strengthen support for Python in other ways, I think
> such efforts should be tracked separately from this.
>
> Allan
>
> On Thu, Feb 23, 2023 at 1:44 AM Mich Talebzadeh 
> wrote:
>
>> If this is not just flip flopping the document pages and involves other
>> changes, then a proper impact analysis needs to be done to assess the
>> efforts involved. Personally I don't think it really matters.
>>
>> HTH
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 23 Feb 2023 at 01:40, Hyukjin Kwon  wrote:
>>
>>> > 1. Does this suggestion imply Python API implementation will be the
>>> new blocker in the future in terms of feature parity among languages? Until
>>> now, Python API feature parity was one of the audit items because it's not
>>> enforced. In other words, Scala and Java have been the full feature because
>>> they are the underlying main developer languages while Python/R/SQL
>>> environments were the nice-to-have.
>>>
>>> I think it wouldn't be treated as a blocker .. but I do believe we have
>>> added all new features into the Python side for the last couple of
>>> releases. So, I wouldn't worry about this at this moment - we have been
>>> doing fine in terms of feature parity.
>>>
>>> > 2. Does this suggestion assume that the Python environment is easier
>>> for users than Scala/Java always? Given that we support Python 3.8 to 3.11,
>>> the support matrix for Python library dependency is a problem for the
>>> Apache Spark community to solve in order to claim that. As we say
>>> at SPARK-41454, Python language also introduces breaking changes to us
>>> historically and we have many `Pinned` python libraries issues.
>>>
>>> Yes. In fact, regardless of this change, I do believe we should test
>>> more versions, etc. At least scheduled jobs like we're doing JDK and Scala
>>> versions.
>>>
>>>
>>> FWIW, my take about this change is: people use Python and PySpark more
>>> (according to the chart and stats provided) so let's put those examples
>>> first :-).
>>>
>>>
>>> On Thu, 23 Feb 2023 at 10:27, Dongjoon Hyun 
>>> wrote:
>>>
 I have two questions to clarify the scope and boundaries.

 1. Does this suggestion imply Python API implementation will be the new
 blocker in the future in terms of feature parity among languages? Until
 now, Python API feature parity was one of the audit items because it's not
 enforced. In other words, Scala and Java have been the full feature because
 they are the underlying main developer languages while Python/R/SQL
 environments were the nice-to-have.

 2. Does this suggestion assume that the Python environment is easier
 for users than Scala/Java always? Given that 

Re: [VOTE] Release Apache Spark 3.4.0 (RC1)

2023-02-23 Thread Hyukjin Kwon
Yes we should fix. I will take a look

On Thu, 23 Feb 2023 at 07:32, Jonathan Kelly  wrote:

> Thanks! I was wondering about that ClientE2ETestSuite failure today, so
> I'm glad to know that it's also being experienced by others.
>
> On a similar note, I am experiencing the following error when running the
> Python tests with Python 3.7:
>
> + ./python/run-tests --python-executables=python3
> Running PySpark tests. Output is in
> /home/ec2-user/spark/python/unit-tests.log
> Will test against the following Python executables: ['python3']
> Will test the following Python modules: ['pyspark-connect',
> 'pyspark-core', 'pyspark-errors', 'pyspark-ml', 'pyspark-mllib',
> 'pyspark-pandas', 'pyspark-pandas-slow', 'pyspark-resource', 'pyspark-sql',
> 'pyspark-streaming']
> python3 python_implementation is CPython
> python3 version is: Python 3.7.16
> Starting test(python3): pyspark.ml.tests.test_feature (temp output:
> /home/ec2-user/spark/python/target/8ca9ab1a-05cc-4845-bf89-30d9001510bc/python3__pyspark.ml.tests.test_feature__kg6sseie.log)
> Starting test(python3): pyspark.ml.tests.test_base (temp output:
> /home/ec2-user/spark/python/target/f2264f3b-6b26-4e61-9452-8d6ddd7eb002/python3__pyspark.ml.tests.test_base__0902zf9_.log)
> Starting test(python3): pyspark.ml.tests.test_algorithms (temp output:
> /home/ec2-user/spark/python/target/d1dc4e07-e58c-4c03-abe5-09d8fab22e6a/python3__pyspark.ml.tests.test_algorithms__lh3wb2u8.log)
> Starting test(python3): pyspark.ml.tests.test_evaluation (temp output:
> /home/ec2-user/spark/python/target/3f42dc79-c945-4cf2-a1eb-83e72b40a9ee/python3__pyspark.ml.tests.test_evaluation__89idc7fa.log)
> Finished test(python3): pyspark.ml.tests.test_base (16s)
> Starting test(python3): pyspark.ml.tests.test_functions (temp output:
> /home/ec2-user/spark/python/target/5a3b90f0-216b-4edd-9d15-6619d3e03300/python3__pyspark.ml.tests.test_functions__g5u1290s.log)
> Traceback (most recent call last):
>   File "/usr/lib64/python3.7/runpy.py", line 193, in _run_module_as_main
> "__main__", mod_spec)
>   File "/usr/lib64/python3.7/runpy.py", line 85, in _run_code
> exec(code, run_globals)
>   File "/home/ec2-user/spark/python/pyspark/ml/tests/test_functions.py",
> line 21, in 
> from pyspark.ml.functions import predict_batch_udf
>   File "/home/ec2-user/spark/python/pyspark/ml/functions.py", line 38, in
> 
> from typing import Any, Callable, Iterator, List, Mapping, Protocol,
> TYPE_CHECKING, Tuple, Union
> ImportError: cannot import name 'Protocol' from 'typing'
> (/usr/lib64/python3.7/typing.py)
> Had test failures in pyspark.ml.tests.test_functions with python3; see
> logs.
>
> I know we should move on to a newer version of Python, but isn't Python
> 3.7 still officially supported?
>
> Thank you,
> Jonathan Kelly
>
> On Wed, Feb 22, 2023 at 1:47 PM Herman van Hovell
>  wrote:
>
>> Hi All,
>>
>> Thanks for testing the 3.4.0 RC! I apologize for the maven testing
>> failures for the Spark Connect Scala Client. We will try to get those
>> sorted as soon as possible.
>>
>> This is an artifact of having multiple build systems, and only running CI
>> for one (SBT). That, however, is a debate for another day :)...
>>
>> Cheers,
>> Herman
>>
>> On Wed, Feb 22, 2023 at 5:32 PM Bjørn Jørgensen 
>> wrote:
>>
>>> ./build/mvn clean package
>>>
>>> I'm using ubuntu rolling, python 3.11 openjdk 17
>>>
>>> CompatibilitySuite:
>>> - compatibility MiMa tests *** FAILED ***
>>>   java.lang.AssertionError: assertion failed: Failed to find the jar
>>> inside folder: /home/bjorn/spark-3.4.0/connector/connect/client/jvm/target
>>>   at scala.Predef$.assert(Predef.scala:223)
>>>   at
>>> org.apache.spark.sql.connect.client.util.IntegrationTestUtils$.findJar(IntegrationTestUtils.scala:67)
>>>   at
>>> org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar$lzycompute(CompatibilitySuite.scala:57)
>>>   at
>>> org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar(CompatibilitySuite.scala:53)
>>>   at
>>> org.apache.spark.sql.connect.client.CompatibilitySuite.$anonfun$new$1(CompatibilitySuite.scala:69)
>>>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>>>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>>>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>>>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>>>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>>>   ...
>>> - compatibility API tests: Dataset *** FAILED ***
>>>   java.lang.AssertionError: assertion failed: Failed to find the jar
>>> inside folder: /home/bjorn/spark-3.4.0/connector/connect/client/jvm/target
>>>   at scala.Predef$.assert(Predef.scala:223)
>>>   at
>>> org.apache.spark.sql.connect.client.util.IntegrationTestUtils$.findJar(IntegrationTestUtils.scala:67)
>>>   at
>>> org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar$lzycompute(CompatibilitySuite.scala:57)
>>>   at
>>> 

Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-23 Thread Allan Folting
Thanks a lot for the questions and comments/feedback!

To address your questions Dongjoon, I do not intend for these updates to
the documentation to be tied to the potential changes/suggestions you ask
about.

In other words, this proposal is only about adjusting the documentation to
target the majority of people reading it - namely the large and growing
number of Python users - and new users in particular as they are often
already familiar with and have a preference for Python when evaluating or
starting to use Spark.

While we may want to strengthen support for Python in other ways, I think
such efforts should be tracked separately from this.

Allan

On Thu, Feb 23, 2023 at 1:44 AM Mich Talebzadeh 
wrote:

> If this is not just flip flopping the document pages and involves other
> changes, then a proper impact analysis needs to be done to assess the
> efforts involved. Personally I don't think it really matters.
>
> HTH
>
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 23 Feb 2023 at 01:40, Hyukjin Kwon  wrote:
>
>> > 1. Does this suggestion imply Python API implementation will be the new
>> blocker in the future in terms of feature parity among languages? Until
>> now, Python API feature parity was one of the audit items because it's not
>> enforced. In other words, Scala and Java have been the full feature because
>> they are the underlying main developer languages while Python/R/SQL
>> environments were the nice-to-have.
>>
>> I think it wouldn't be treated as a blocker .. but I do believe we have
>> added all new features into the Python side for the last couple of
>> releases. So, I wouldn't worry about this at this moment - we have been
>> doing fine in terms of feature parity.
>>
>> > 2. Does this suggestion assume that the Python environment is easier
>> for users than Scala/Java always? Given that we support Python 3.8 to 3.11,
>> the support matrix for Python library dependency is a problem for the
>> Apache Spark community to solve in order to claim that. As we say
>> at SPARK-41454, Python language also introduces breaking changes to us
>> historically and we have many `Pinned` python libraries issues.
>>
>> Yes. In fact, regardless of this change, I do believe we should test more
>> versions, etc. At least scheduled jobs like we're doing JDK and Scala
>> versions.
>>
>>
>> FWIW, my take about this change is: people use Python and PySpark more
>> (according to the chart and stats provided) so let's put those examples
>> first :-).
>>
>>
>> On Thu, 23 Feb 2023 at 10:27, Dongjoon Hyun 
>> wrote:
>>
>>> I have two questions to clarify the scope and boundaries.
>>>
>>> 1. Does this suggestion imply Python API implementation will be the new
>>> blocker in the future in terms of feature parity among languages? Until
>>> now, Python API feature parity was one of the audit items because it's not
>>> enforced. In other words, Scala and Java have been the full feature because
>>> they are the underlying main developer languages while Python/R/SQL
>>> environments were the nice-to-have.
>>>
>>> 2. Does this suggestion assume that the Python environment is easier for
>>> users than Scala/Java always? Given that we support Python 3.8 to 3.11, the
>>> support matrix for Python library dependency is a problem for the Apache
>>> Spark community to solve in order to claim that. As we say at SPARK-41454,
>>> Python language also introduces breaking changes to us historically and we
>>> have many `Pinned` python libraries issues.
>>>
>>> Changing documentation is easy, but I hope we can give clear
>>> communication and direction in this effort because this is one of the most
>>> user-facing changes.
>>>
>>> Dongjoon.
>>>
>>> On Wed, Feb 22, 2023 at 5:26 PM 416161...@qq.com 
>>> wrote:
>>>
 +1 LGTM

 --
 Ruifeng Zheng
 ruife...@foxmail.com

 



 -- Original --
 *From:* "Xinrong Meng" ;
 *Date:* Thu, Feb 23, 2023 09:17 AM
 *To:* "Allan Folting";
 *Cc:* "dev";
 *Subject:* Re: [DISCUSS] Show Python code examples first in Spark
 documentation

 +1 Good idea!

 On Thu, Feb 23, 2023 at 7:41 AM Jack Goodson 
 wrote:

> Good idea, at the company I work at we discussed using Scala as our
> primary 

Logging in SparkExtensions

2023-02-23 Thread Maytas Monsereenusorn
Hi,

I have created a fat / shaded library jar to use in Spark via
SparkExtensions. The usage is done through setting spark.sql.extensions
conf to my class that extends `SparkSessionExtensionsProvider` within my
jar. The purpose of this extension jar is to inject my custom UDFs
functions (see: https://github.com/apache/spark/pull/22576).

The problem I am having is that I am trying to have my UDF Expression class
extend Spark Logging trait (org.apache.spark.internal.Logging). I was able
to compile the jar and load my extension fine but when I try to use my UDF
in Spark, I got the following error:

java.lang.AbstractMethodError: Method
com/company/bdp/expressions/NfHMAC.org$apache$spark$internal$Logging$$log__$eq(Lorg/slf4j/Logger;)V
is abstract
at com.company.bdp.expressions.NfHMAC.org
$apache$spark$internal$Logging$$log__$eq(EncryptionExpressions.scala)
at org.apache.spark.internal.Logging.$init$(Logging.scala:43)
at com.company.bdp.expressions.NfHMAC.(EncryptionExpressions.scala:47)



The problem seems to be that no logging framework is bound to slf4j when
classes in my extension try to log. Although, log4j is included in Spark
classpath itself. I have also tried to exclude all logging stuff from my
fat jar to make sure that there is no conflict when Spark loads my
extension. I have tried to include a logging framework in my fat jar (i.e.
logback-classic) and then logging works fine when I run my UDF function. I
am not exactly sure on how SparkExtensions are loaded so I might be missing
something. Is it expected that a library to be loaded via SparkExtension
needs to include its own logging framework?


Thanks,

Maytas


Re: [VOTE] Release Apache Spark 3.4.0 (RC1)

2023-02-23 Thread Gengliang Wang
Thanks for creating the RC1, Xinrong!

Besides the blockers mentioned by Tom, let's include the following bug fix
in Spark 3.4.0 as well:
[SPARK-42406][SQL] Fix check for missing required fields of to_protobuf


Gengliang

On Wed, Feb 22, 2023 at 3:09 PM Tom Graves 
wrote:

> It looks like there are still blockers open, we need to make sure they are
> addressed before doing a release:
>
> https://issues.apache.org/jira/browse/SPARK-41793
> https://issues.apache.org/jira/browse/SPARK-42444
>
> Tom
> On Tuesday, February 21, 2023 at 10:35:45 PM CST, Xinrong Meng <
> xinrong.apa...@gmail.com> wrote:
>
>
> Please vote on releasing the following candidate as Apache Spark version
> 3.4.0.
>
> The vote is open until 11:59pm Pacific time *February 27th* and passes if
> a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.4.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v3.4.0-rc1* (commit
> e2484f626bb338274665a49078b528365ea18c3b):
> https://github.com/apache/spark/tree/v3.4.0-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1435
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc1-docs/
>
> The list of bug fixes going into 3.4.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12351465
>
> This release is using the release script of the tag v3.4.0-rc1.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.4.0?
> ===
> The current list of open tickets targeted at 3.4.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.4.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> Thanks,
> Xinrong Meng
>


Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-23 Thread Mich Talebzadeh
If this is not just flip flopping the document pages and involves other
changes, then a proper impact analysis needs to be done to assess the
efforts involved. Personally I don't think it really matters.

HTH



   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 23 Feb 2023 at 01:40, Hyukjin Kwon  wrote:

> > 1. Does this suggestion imply Python API implementation will be the new
> blocker in the future in terms of feature parity among languages? Until
> now, Python API feature parity was one of the audit items because it's not
> enforced. In other words, Scala and Java have been the full feature because
> they are the underlying main developer languages while Python/R/SQL
> environments were the nice-to-have.
>
> I think it wouldn't be treated as a blocker .. but I do believe we have
> added all new features into the Python side for the last couple of
> releases. So, I wouldn't worry about this at this moment - we have been
> doing fine in terms of feature parity.
>
> > 2. Does this suggestion assume that the Python environment is easier for
> users than Scala/Java always? Given that we support Python 3.8 to 3.11, the
> support matrix for Python library dependency is a problem for the Apache
> Spark community to solve in order to claim that. As we say at SPARK-41454,
> Python language also introduces breaking changes to us historically and we
> have many `Pinned` python libraries issues.
>
> Yes. In fact, regardless of this change, I do believe we should test more
> versions, etc. At least scheduled jobs like we're doing JDK and Scala
> versions.
>
>
> FWIW, my take about this change is: people use Python and PySpark more
> (according to the chart and stats provided) so let's put those examples
> first :-).
>
>
> On Thu, 23 Feb 2023 at 10:27, Dongjoon Hyun 
> wrote:
>
>> I have two questions to clarify the scope and boundaries.
>>
>> 1. Does this suggestion imply Python API implementation will be the new
>> blocker in the future in terms of feature parity among languages? Until
>> now, Python API feature parity was one of the audit items because it's not
>> enforced. In other words, Scala and Java have been the full feature because
>> they are the underlying main developer languages while Python/R/SQL
>> environments were the nice-to-have.
>>
>> 2. Does this suggestion assume that the Python environment is easier for
>> users than Scala/Java always? Given that we support Python 3.8 to 3.11, the
>> support matrix for Python library dependency is a problem for the Apache
>> Spark community to solve in order to claim that. As we say at SPARK-41454,
>> Python language also introduces breaking changes to us historically and we
>> have many `Pinned` python libraries issues.
>>
>> Changing documentation is easy, but I hope we can give clear
>> communication and direction in this effort because this is one of the most
>> user-facing changes.
>>
>> Dongjoon.
>>
>> On Wed, Feb 22, 2023 at 5:26 PM 416161...@qq.com 
>> wrote:
>>
>>> +1 LGTM
>>>
>>> --
>>> Ruifeng Zheng
>>> ruife...@foxmail.com
>>>
>>> 
>>>
>>>
>>>
>>> -- Original --
>>> *From:* "Xinrong Meng" ;
>>> *Date:* Thu, Feb 23, 2023 09:17 AM
>>> *To:* "Allan Folting";
>>> *Cc:* "dev";
>>> *Subject:* Re: [DISCUSS] Show Python code examples first in Spark
>>> documentation
>>>
>>> +1 Good idea!
>>>
>>> On Thu, Feb 23, 2023 at 7:41 AM Jack Goodson 
>>> wrote:
>>>
 Good idea, at the company I work at we discussed using Scala as our
 primary language because technically it is slightly stronger than python
 but ultimately chose python in the end as it’s easier for other devs to be
 on boarded to our platform and future hiring for the team etc would be
 easier

 On Thu, 23 Feb 2023 at 12:20 PM, Hyukjin Kwon 
 wrote:

> +1 I like this idea too.
>
> On Thu, Feb 23, 2023 at 6:00 AM Allan Folting 
> wrote:
>
>> Hi all,
>>
>> I would like to propose that we show Python code examples first in
>> the Spark documentation where we have multiple programming language
>> examples.
>> An example is on the Quick Start page:
>> https://spark.apache.org/docs/latest/quick-start.html
>>
>> I propose this change because Python has become more popular than the
>> other languages supported in Apache Spark.