Re: Code fails when AQE enabled in Spark 3.1

2022-01-31 Thread Gaspar Muñoz
it looks that this commit ( https://github.com/apache/spark/commit/a85490659f45410be3588c669248dc4f534d2a71) do the trick. [image: image.png] Don't you think, this bug is enough important to incluide in 3.1 branch? Regards El jue, 20 ene 2022 a las 8:55, Gaspar Muñoz () escribió: > Hi guys, >

Re: [EXTERNAL] Fwd: Log4j upgrade in spark binary from 1.2.17 to 2.17.1

2022-01-31 Thread Martin Grigorov
Hi, On Mon, Jan 31, 2022 at 7:57 PM KS, Rajabhupati wrote: > Thanks a lot Sean. One final question before I close the conversion how do > we know what are the features that will be added as part of spark 3.3 > version? > There will be release notes for 3.3 at linked at

bucketBy in pyspark not retaining partition information

2022-01-31 Thread Nitin Siwach
I am reading two datasets that I saved to the disk with ```bucketBy``` option on the same key with the same number of partitions. When I read them back and join them, they should not result in a shuffle. But, that isn't the case I am seeing. *The following code demonstrates the alleged

Re: [EXTERNAL] Fwd: Log4j upgrade in spark binary from 1.2.17 to 2.17.1

2022-01-31 Thread KS, Rajabhupati
Thanks a lot Sean. One final question before I close the conversion how do we know what are the features that will be added as part of spark 3.3 version? Regards Rajabhupati From: Sean Owen Sent: Monday, January 31, 2022 10:50:16 PM To: KS, Rajabhupati Cc:

RE: [EXTERNAL] Fwd: Log4j upgrade in spark binary from 1.2.17 to 2.17.1

2022-01-31 Thread KS, Rajabhupati
Thanks Sean , When is spark 3.3.0 is expected to release? Regards Raja From: Sean Owen mailto:sro...@gmail.com>> Sent: Monday, January 31, 2022 10:28 PM To: KS, Rajabhupati mailto:rajabhupati...@comcast.com>> Subject: [EXTERNAL] Fwd: Log4j upgrade in spark binary from 1.2.17 to 2.17.1 Further,

Re: [EXTERNAL] Fwd: Log4j upgrade in spark binary from 1.2.17 to 2.17.1

2022-01-31 Thread Sean Owen
https://spark.apache.org/versioning-policy.html On Mon, Jan 31, 2022 at 11:15 AM KS, Rajabhupati wrote: > Thanks Sean , When is spark 3.3.0 is expected to release? > > > > Regards > > Raja > > *From:* Sean Owen > *Sent:* Monday, January 31, 2022 10:28 PM > *To:* KS, Rajabhupati > *Subject:*

Re: Log4j upgrade in spark binary from 1.2.17 to 2.17.1

2022-01-31 Thread Sean Owen
(BTW you are sending to the Spark incubator list, and Spark has not been in incubation for about 7 years. Use user@spark.apache.org) What update are you looking for? this has been discussed extensively on the Spark mailing list. Spark is not evidently vulnerable to this. 3.3.0 will include log4j

RE: Log4j upgrade in spark binary from 1.2.17 to 2.17.1

2022-01-31 Thread KS, Rajabhupati
Hi Team , Is there any update on this request ? We did see Jira https://issues.apache.org/jira/browse/SPARK-37630 for this request but we see it closed . Regards Raja From: KS, Rajabhupati Sent: Sunday, January 30, 2022 9:03 AM To: u...@spark.incubator.apache.org Subject: Log4j upgrade in

Re:

2022-01-31 Thread Bitfox
Please send an e-mail: user-unsubscr...@spark.apache.org to unsubscribe yourself from the mailing list. On Mon, Jan 31, 2022 at 10:11 PM wrote: > unsubscribe > > >

Re:

2022-01-31 Thread Bitfox
Please send an e-mail: user-unsubscr...@spark.apache.org to unsubscribe yourself from the mailing list. On Mon, Jan 31, 2022 at 10:23 PM Gaetano Fabiano wrote: > Unsubscribe > > Inviato da iPhone > > - > To unsubscribe e-mail:

[no subject]

2022-01-31 Thread Gaetano Fabiano
Unsubscribe Inviato da iPhone - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

[no subject]

2022-01-31 Thread pduflot
unsubscribe

Re: A Persisted Spark DataFrame is computed twice

2022-01-31 Thread Sean Owen
One guess - you are doing two things here, count() and write(). There is a persist(), but it's async. It won't necessarily wait for the persist to finish before proceeding and may have to recompute at least some partitions for the second op. You could debug further by looking at the stages and

Regarding Spark Cassandra Metrics

2022-01-31 Thread Yogesh Kumar Garg
Hi all, I am developing a spark application where I am loading the data into Cassandra and I am using the Spark Cassandra connector for the same. I have created a FAT jar with all the dependencies and submitted that using spark-submit. I am able to load the data successfully to cassandra, but I

Re: unsubscribe

2022-01-31 Thread Bitfox
The signature in your messages has showed how to unsubscribe. To unsubscribe e-mail: user-unsubscr...@spark.apache.org On Mon, Jan 31, 2022 at 7:53 PM Lucas Schroeder Rossi wrote: > unsubscribe > > - > To unsubscribe e-mail:

unsubscribe

2022-01-31 Thread Lucas Schroeder Rossi
unsubscribe - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

unsubscribe

2022-01-31 Thread Lucas Schroeder Rossi
unsubscribe - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Migration to Spark 3.2

2022-01-31 Thread Aurélien Mazoyer
Hi Stephen, I managed to solve my issue, I had a conflicting version of jackson databind that came from parent pom. Thank you, Aurelien Le dim. 30 janv. 2022 à 23:28, Aurélien Mazoyer a écrit : > Hi Stephen, > > Thank you for your answer. Yes, I changed the scope to "provided" but got > the

Re: why the pyspark RDD API is so slow?

2022-01-31 Thread Sebastian Piu
When you operate on a dataframe from the python side you are just invoking methods in the JVM via a proxy (py4j) so it is almost as coding in java itself. This is as long as you don't define any udf's or any other code that needs to invoke python for processing Check the High Performance Spark

Re: why the pyspark RDD API is so slow?

2022-01-31 Thread Bitfox
Hi In PySpark, RDD need serialised/deserialised, but dataframe doesn’t? Why? Thanks On Mon, Jan 31, 2022 at 4:46 PM Khalid Mammadov wrote: > Your scala program does not use any Spark API hence faster that others. If > you write the same code in pure Python I think it will be even faster than

Re: why the pyspark RDD API is so slow?

2022-01-31 Thread Khalid Mammadov
Your scala program does not use any Spark API hence faster that others. If you write the same code in pure Python I think it will be even faster than Scala program, especially taking into account these 2 programs runs on a single VM. Regarding Dataframe and RDD I would suggest to use Dataframes

Re:[ANNOUNCE] Apache Spark 3.2.1 released

2022-01-31 Thread beliefer
Thank you huaxin gao! Glad to see the release. At 2022-01-29 09:07:13, "huaxin gao" wrote: We are happy to announce the availability of Spark 3.2.1! Spark 3.2.1 is a maintenance release containing stability fixes. This release is based on the branch-3.2 maintenance branch of Spark. We

Re: A Persisted Spark DataFrame is computed twice

2022-01-31 Thread Sebastian Piu
Can you share the stages as seen in the spark ui for the count and coalesce jobs My suggestion of moving things around was just for troubleshooting rather than a solution of that wasn't clear before On Mon, 31 Jan 2022, 08:07 Benjamin Du, wrote: > Remvoing coalesce didn't help either. > > > >

Re: A Persisted Spark DataFrame is computed twice

2022-01-31 Thread Benjamin Du
I did check the execution plan, there were 2 stages and both stages show that the pandas UDF (which takes almost all the computation time of the DataFrame) is executed. It didn't seem to be an issue of repartition/coalesce as the DataFrame was still computed twice after removing coalesce.

unsubscribe

2022-01-31 Thread Rajeev

Re: A Persisted Spark DataFrame is computed twice

2022-01-31 Thread Benjamin Du
Remvoing coalesce didn't help either. Best, Ben Du Personal Blog | GitHub | Bitbucket | Docker Hub From: Deepak Sharma Sent: