Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-22 Thread Hyukjin Kwon
> 1. Does this suggestion imply Python API implementation will be the new blocker in the future in terms of feature parity among languages? Until now, Python API feature parity was one of the audit items because it's not enforced. In other words, Scala and Java have been the full feature because

Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-22 Thread Dongjoon Hyun
I have two questions to clarify the scope and boundaries. 1. Does this suggestion imply Python API implementation will be the new blocker in the future in terms of feature parity among languages? Until now, Python API feature parity was one of the audit items because it's not enforced. In other

Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-22 Thread 416161...@qq.com
+1 LGTM RuifengZheng ruife...@foxmail.com --Original-- From: "Xinrong Meng"

Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-22 Thread Xinrong Meng
+1 Good idea! On Thu, Feb 23, 2023 at 7:41 AM Jack Goodson wrote: > Good idea, at the company I work at we discussed using Scala as our > primary language because technically it is slightly stronger than python > but ultimately chose python in the end as it’s easier for other devs to be > on

Unsubscribe

2023-02-22 Thread Tang Jinxin
Unsubscribe

Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-22 Thread Jack Goodson
Good idea, at the company I work at we discussed using Scala as our primary language because technically it is slightly stronger than python but ultimately chose python in the end as it’s easier for other devs to be on boarded to our platform and future hiring for the team etc would be easier On

Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-22 Thread Hyukjin Kwon
+1 I like this idea too. On Thu, Feb 23, 2023 at 6:00 AM Allan Folting wrote: > Hi all, > > I would like to propose that we show Python code examples first in the > Spark documentation where we have multiple programming language examples. > An example is on the Quick Start page: >

Re: [VOTE] Release Apache Spark 3.4.0 (RC1)

2023-02-22 Thread Tom Graves
It looks like there are still blockers open, we need to make sure they are addressed before doing a release: https://issues.apache.org/jira/browse/SPARK-41793 https://issues.apache.org/jira/browse/SPARK-42444 TomOn Tuesday, February 21, 2023 at 10:35:45 PM CST, Xinrong Meng wrote:

Re: [VOTE] Release Apache Spark 3.4.0 (RC1)

2023-02-22 Thread Jonathan Kelly
Thanks! I was wondering about that ClientE2ETestSuite failure today, so I'm glad to know that it's also being experienced by others. On a similar note, I am experiencing the following error when running the Python tests with Python 3.7: + ./python/run-tests --python-executables=python3 Running

Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-22 Thread Mich Talebzadeh
I have seen many data engineering teams start out with Scala because technically it is the best choice for many given reasons and basically it is what Spark is. I also concur that Python is more popular than Scala because of the advent of data science. A majority of use cases we see these days are

Re: [VOTE] Release Apache Spark 3.4.0 (RC1)

2023-02-22 Thread Herman van Hovell
Hi All, Thanks for testing the 3.4.0 RC! I apologize for the maven testing failures for the Spark Connect Scala Client. We will try to get those sorted as soon as possible. This is an artifact of having multiple build systems, and only running CI for one (SBT). That, however, is a debate for

Re: [VOTE] Release Apache Spark 3.4.0 (RC1)

2023-02-22 Thread Bjørn Jørgensen
./build/mvn clean package I'm using ubuntu rolling, python 3.11 openjdk 17 CompatibilitySuite: - compatibility MiMa tests *** FAILED *** java.lang.AssertionError: assertion failed: Failed to find the jar inside folder: /home/bjorn/spark-3.4.0/connector/connect/client/jvm/target at

Re: Spark Union performance issue

2023-02-22 Thread Prem Sahoo
Please see inline comments So you union two tables, union the result with another one, and finally with a last one? first Union 2 tables = Result1 2nd Union of another 2 tables = Result2 3rd Result1UnionResult2 = finalResult How many columns do all these tables have? each is having around

Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-22 Thread Sean Owen
FWIW I agree with this. On Wed, Feb 22, 2023 at 2:59 PM Allan Folting wrote: > Hi all, > > I would like to propose that we show Python code examples first in the > Spark documentation where we have multiple programming language examples. > An example is on the Quick Start page: >

Re: Spark Union performance issue

2023-02-22 Thread Zhiyuan Lin
Hi Spark devs, I'm experiencing a Union performance degradation as well. Since this email thread is very related, posting it here to see if anyone has any insights. *Background*: After upgrading a Spark job from Spark 2.4 to Spark 3.1 without any code change, we saw *big performance degradation*

[DISCUSS] Show Python code examples first in Spark documentation

2023-02-22 Thread Allan Folting
Hi all, I would like to propose that we show Python code examples first in the Spark documentation where we have multiple programming language examples. An example is on the Quick Start page: https://spark.apache.org/docs/latest/quick-start.html I propose this change because Python has become

Re: [VOTE] Release Apache Spark 3.4.0 (RC1)

2023-02-22 Thread Mridul Muralidharan
Signatures, digests, etc check out fine - thanks for updating them ! Checked out tag and build/tested with -Phive -Pyarn -Pmesos -Pkubernetes The test ClientE2ETestSuite.simple udf failed [1] in "Connect Client " module ... yet to test "Spark Protobuf" module due to the failure. Regards,

Re: Spark Union performance issue

2023-02-22 Thread Enrico Minack
So you union two tables, union the result with another one, and finally with a last one? How many columns do all these tables have? Are you sure creating the plan depends on the number of rows? Enrico Am 22.02.23 um 19:08 schrieb Prem Sahoo: here is the information missed 1. Spark 3.2.0 2.

Re: Spark Union performance issue

2023-02-22 Thread Prem Sahoo
here is the information missed 1. Spark 3.2.0 2. it is scala based 3. size of tables will be ~60G 4. explain plan for catalysts shows lots of time is being spent in creating the plan 5. number of union table is 2 , and another 2 then finally 2 slowness is providing resylut as the data size &

Re: Spark Union performance issue

2023-02-22 Thread Enrico Minack
Plus number of unioned tables would be helpful, as well as which downstream operations are performed on the unioned tables. And what "performance issues" do you exactly measure? Enrico Am 22.02.23 um 16:50 schrieb Mich Talebzadeh: Hi, Few details will help 1. Spark version 2. Spark

Re: Spark Union performance issue

2023-02-22 Thread Mich Talebzadeh
Hi, Few details will help 1. Spark version 2. Spark SQL, Scala or PySpark 3. size of tables in join. 4. What does explain() or the joining operation show? HTH view my Linkedin profile

Spark Union performance issue

2023-02-22 Thread Prem Sahoo
Hello Team, We are observing Spark Union performance issues when unioning big tables with lots of rows. Do we have any option apart from the Union ?

Re: Pandas UDF cogroup.applyInPandas with multiple dataframes

2023-02-22 Thread Santosh Pingale
I have opened two PRs: One that tries to maintain backwards compatibility: https://github.com/apache/spark/pull/39902 One that breaks the API to make it cleaner: https://github.com/apache/spark/pull/40122

Re: [VOTE] Release Apache Spark 3.4.0 (RC1)

2023-02-22 Thread Mridul Muralidharan
Thanks Xinrong ! The signature verifications are fine now ... will continue with testing the release. Regards, Mridul On Wed, Feb 22, 2023 at 1:27 AM Xinrong Meng wrote: > Hi Mridul, > > Would you please try that again? It should work now. > > On Wed, Feb 22, 2023 at 2:04 PM Mridul