Re: Code fails when AQE enabled in Spark 3.1

2022-01-31 Thread Gaspar Muñoz
it looks that this commit (
https://github.com/apache/spark/commit/a85490659f45410be3588c669248dc4f534d2a71)
do the trick.

[image: image.png]

Don't you think, this bug is enough important to incluide in 3.1 branch?

Regards

El jue, 20 ene 2022 a las 8:55, Gaspar Muñoz ()
escribió:

> Hi guys,
>
> hundreds of spark jobs run on my company every day. We are running Spark
> 3.1.2 and we want enable Adaptive Query Execution (AQE) for all of them.
> We can't upgrade to 3.2 right now so we want enable it explicitly using
> appropriate conf when spark submit.
>
> Some of them fails when enable AQE but I can't discover what is
> happening.  In order to give your information I prepared a small snippet
> for spark shell that fails in Spark 3.1 when AQE enabled and works when
> disabled. It also work in 3.2 but I think maybe is a bug that can be fixed
> for 3.1.3.
>
> The code and explanation can be found here:
> https://issues.apache.org/jira/browse/SPARK-37898
>
> Regards
> --
> Gaspar Muñoz Soria
>


-- 
Gaspar Muñoz Soria

Vía de las dos Castillas, 33, Ática 4, 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 828 6473


Re: [EXTERNAL] Fwd: Log4j upgrade in spark binary from 1.2.17 to 2.17.1

2022-01-31 Thread Martin Grigorov
Hi,

On Mon, Jan 31, 2022 at 7:57 PM KS, Rajabhupati
 wrote:

> Thanks a lot Sean. One final question before I close the conversion how do
> we know what are the features that will be added as part of spark 3.3
> version?
>

There will be release notes for 3.3 at linked at
https://spark.apache.org/downloads.html#release-notes-for-stable-releases
once it is released.


>
> Regards
> Rajabhupati
> --
> *From:* Sean Owen 
> *Sent:* Monday, January 31, 2022 10:50:16 PM
> *To:* KS, Rajabhupati 
> *Cc:* user@spark.apache.org 
> *Subject:* Re: [EXTERNAL] Fwd: Log4j upgrade in spark binary from 1.2.17
> to 2.17.1
>
> https://spark.apache.org/versioning-policy.html
> 
>
> On Mon, Jan 31, 2022 at 11:15 AM KS, Rajabhupati <
> rajabhupati...@comcast.com> wrote:
>
> Thanks Sean , When is spark 3.3.0 is expected to release?
>
>
>
> Regards
>
> Raja
>
> *From:* Sean Owen 
> *Sent:* Monday, January 31, 2022 10:28 PM
> *To:* KS, Rajabhupati 
> *Subject:* [EXTERNAL] Fwd: Log4j upgrade in spark binary from 1.2.17 to
> 2.17.1
>
>
>
> Further, you're using an email that can't receive email ...
>
> -- Forwarded message -
> From: *Sean Owen* 
> Date: Mon, Jan 31, 2022 at 10:56 AM
> Subject: Re: Log4j upgrade in spark binary from 1.2.17 to 2.17.1
> To: KS, Rajabhupati 
> Cc: u...@spark.incubator.apache.org ,
> d...@spark.incubator.apache.org 
>
>
>
> (BTW you are sending to the Spark incubator list, and Spark has not been
> in incubation for about 7 years. Use user@spark.apache.org)
>
>
>
> What update are you looking for? this has been discussed extensively on
> the Spark mailing list.
>
> Spark is not evidently vulnerable to this. 3.3.0 will include log4j 2.17
> anyway.
>
>
>
> The ticket you cite points you to the correct ticket:
> https://issues.apache.org/jira/browse/SPARK-6305
> 
>
>
>
> On Mon, Jan 31, 2022 at 10:53 AM KS, Rajabhupati <
> rajabhupati...@comcast.com.invalid> wrote:
>
> Hi Team ,
>
>
>
> Is there any update on this request ?
>
>
>
> We did see Jira https://issues.apache.org/jira/browse/SPARK-37630
> 
> for this request but we see it closed .
>
>
>
> Regards
>
> Raja
>
>
>
> *From:* KS, Rajabhupati 
> *Sent:* Sunday, January 30, 2022 9:03 AM
> *To:* u...@spark.incubator.apache.org
> *Subject:* Log4j upgrade in spark binary from 1.2.17 to 2.17.1
>
>
>
> Hi Team,
>
>
>
> We were checking for log4j upgrade in Open source spark version to avoid
> the recent vulnerability in the spark binary . Do we have any new release
> which is planned to upgrade the log4j from 1.2.17 to 2.17.1.Any sooner
> response is appreciated ?
>
>
>
>
>
> Regards
>
> Rajabhupati
>
>


bucketBy in pyspark not retaining partition information

2022-01-31 Thread Nitin Siwach
I am reading two datasets that I saved to the disk with ```bucketBy```
option on the same key with the same number of partitions. When I read them
back and join them, they should not result in a shuffle.

But, that isn't the case I am seeing.

*The following code demonstrates the alleged behavior:*

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.config("spark.sql.autoBroadcastJoinThreshold",
"-1").getOrCreate()
import random

data1 = [(i,random.randint(1,5),random.randint(1,5)) for t in range(2) for
i in range(5)]
data2 = [(i,random.randint(1,5),random.randint(1,5)) for t in range(2) for
i in range(5)]
df1=spark.createDataFrame(data1,schema = 'a int,b int,c int')
df2=spark.createDataFrame(data1,schema = 'a int,b int,c int')

parquet_path1 = './bucket_test_parquet1'
parquet_path2 = './bucket_test_parquet2'

df1.write.bucketBy(5,"a").format("parquet").saveAsTable('df',path=parquet_path1,mode='overwrite')
df2.write.bucketBy(5,"a").format("parquet").saveAsTable('df',path=parquet_path2,mode='overwrite')

read_parquet1 = spark.read.format("parquet").load(parquet_path1,header=True)
read_parquet1.createOrReplaceTempView("read_parquet1")
read_parquet1.createOrReplaceTempView('read_parquet1')
read_parquet1 = spark.sql("SELECT * from read_parquet1")

read_parquet2 = spark.read.format("parquet").load(parquet_path2,header=True)
read_parquet2.createOrReplaceTempView("read_parquet2")
read_parquet2.createOrReplaceTempView('read_parquet2')
read_parquet2 = spark.sql("SELECT * from read_parquet2")
read_parquet1.join(read_parquet2,on='a').explain()

*The output that I am getting is*

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [a#24, b#25, c#26, b#34, c#35]
   +- SortMergeJoin [a#24], [a#33], Inner
  :- Sort [a#24 ASC NULLS FIRST], false, 0
  :  +- Exchange hashpartitioning(a#24, 200), ENSURE_REQUIREMENTS, [id=#61]
  : +- Filter isnotnull(a#24)
  :+- FileScan parquet [a#24,b#25,c#26] Batched: true,
DataFilters: [isnotnull(a#24)], Format: Parquet, Location:
InMemoryFileIndex(1
paths)[file:/home/nitin/pymonsoon/bucket_test_parquet1],
PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema:
struct
  +- Sort [a#33 ASC NULLS FIRST], false, 0
 +- Exchange hashpartitioning(a#33, 200), ENSURE_REQUIREMENTS, [id=#62]
+- Filter isnotnull(a#33)
   +- FileScan parquet [a#33,b#34,c#35] Batched: true,
DataFilters: [isnotnull(a#33)], Format: Parquet, Location:
InMemoryFileIndex(1
paths)[file:/home/nitin/pymonsoon/bucket_test_parquet2],
PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema:
struct


*Which clearly has hashpartitioning goiong on. Kindly, help me clarify
the utility of ```bucketBy```*


Re: [EXTERNAL] Fwd: Log4j upgrade in spark binary from 1.2.17 to 2.17.1

2022-01-31 Thread KS, Rajabhupati
Thanks a lot Sean. One final question before I close the conversion how do we 
know what are the features that will be added as part of spark 3.3 version?

Regards
Rajabhupati

From: Sean Owen 
Sent: Monday, January 31, 2022 10:50:16 PM
To: KS, Rajabhupati 
Cc: user@spark.apache.org 
Subject: Re: [EXTERNAL] Fwd: Log4j upgrade in spark binary from 1.2.17 to 2.17.1

https://spark.apache.org/versioning-policy.html

On Mon, Jan 31, 2022 at 11:15 AM KS, Rajabhupati 
mailto:rajabhupati...@comcast.com>> wrote:

Thanks Sean , When is spark 3.3.0 is expected to release?



Regards

Raja

From: Sean Owen mailto:sro...@gmail.com>>
Sent: Monday, January 31, 2022 10:28 PM
To: KS, Rajabhupati 
mailto:rajabhupati...@comcast.com>>
Subject: [EXTERNAL] Fwd: Log4j upgrade in spark binary from 1.2.17 to 2.17.1



Further, you're using an email that can't receive email ...

-- Forwarded message -
From: Sean Owen mailto:sro...@gmail.com>>
Date: Mon, Jan 31, 2022 at 10:56 AM
Subject: Re: Log4j upgrade in spark binary from 1.2.17 to 2.17.1
To: KS, Rajabhupati 
mailto:rajabhupati...@comcast.com.invalid>>
Cc: u...@spark.incubator.apache.org 
mailto:u...@spark.incubator.apache.org>>, 
d...@spark.incubator.apache.org 
mailto:d...@spark.incubator.apache.org>>



(BTW you are sending to the Spark incubator list, and Spark has not been in 
incubation for about 7 years. Use 
user@spark.apache.org)



What update are you looking for? this has been discussed extensively on the 
Spark mailing list.

Spark is not evidently vulnerable to this. 3.3.0 will include log4j 2.17 anyway.



The ticket you cite points you to the correct ticket: 
https://issues.apache.org/jira/browse/SPARK-6305



On Mon, Jan 31, 2022 at 10:53 AM KS, Rajabhupati 
mailto:rajabhupati...@comcast.com.invalid>> 
wrote:

Hi Team ,



Is there any update on this request ?



We did see Jira 
https://issues.apache.org/jira/browse/SPARK-37630
 for this request but we see it closed .



Regards

Raja



From: KS, Rajabhupati 
mailto:rajabhupati...@comcast.com>>
Sent: Sunday, January 30, 2022 9:03 AM
To: u...@spark.incubator.apache.org
Subject: Log4j upgrade in spark binary from 1.2.17 to 2.17.1



Hi Team,



We were checking for log4j upgrade in Open source spark version to avoid the 
recent vulnerability in the spark binary . Do we have any new release which is 
planned to upgrade the log4j from 1.2.17 to 2.17.1.Any sooner response is 
appreciated ?





Regards

Rajabhupati


RE: [EXTERNAL] Fwd: Log4j upgrade in spark binary from 1.2.17 to 2.17.1

2022-01-31 Thread KS, Rajabhupati
Thanks Sean , When is spark 3.3.0 is expected to release?

Regards
Raja
From: Sean Owen mailto:sro...@gmail.com>>
Sent: Monday, January 31, 2022 10:28 PM
To: KS, Rajabhupati 
mailto:rajabhupati...@comcast.com>>
Subject: [EXTERNAL] Fwd: Log4j upgrade in spark binary from 1.2.17 to 2.17.1

Further, you're using an email that can't receive email ...
-- Forwarded message -
From: Sean Owen mailto:sro...@gmail.com>>
Date: Mon, Jan 31, 2022 at 10:56 AM
Subject: Re: Log4j upgrade in spark binary from 1.2.17 to 2.17.1
To: KS, Rajabhupati 
mailto:rajabhupati...@comcast.com.invalid>>
Cc: u...@spark.incubator.apache.org 
mailto:u...@spark.incubator.apache.org>>, 
d...@spark.incubator.apache.org 
mailto:d...@spark.incubator.apache.org>>

(BTW you are sending to the Spark incubator list, and Spark has not been in 
incubation for about 7 years. Use 
user@spark.apache.org)

What update are you looking for? this has been discussed extensively on the 
Spark mailing list.
Spark is not evidently vulnerable to this. 3.3.0 will include log4j 2.17 anyway.

The ticket you cite points you to the correct ticket: 
https://issues.apache.org/jira/browse/SPARK-6305

On Mon, Jan 31, 2022 at 10:53 AM KS, Rajabhupati 
mailto:rajabhupati...@comcast.com.invalid>> 
wrote:
Hi Team ,

Is there any update on this request ?

We did see Jira 
https://issues.apache.org/jira/browse/SPARK-37630
 for this request but we see it closed .

Regards
Raja

From: KS, Rajabhupati 
mailto:rajabhupati...@comcast.com>>
Sent: Sunday, January 30, 2022 9:03 AM
To: u...@spark.incubator.apache.org
Subject: Log4j upgrade in spark binary from 1.2.17 to 2.17.1

Hi Team,

We were checking for log4j upgrade in Open source spark version to avoid the 
recent vulnerability in the spark binary . Do we have any new release which is 
planned to upgrade the log4j from 1.2.17 to 2.17.1.Any sooner response is 
appreciated ?


Regards
Rajabhupati


Re: [EXTERNAL] Fwd: Log4j upgrade in spark binary from 1.2.17 to 2.17.1

2022-01-31 Thread Sean Owen
https://spark.apache.org/versioning-policy.html

On Mon, Jan 31, 2022 at 11:15 AM KS, Rajabhupati 
wrote:

> Thanks Sean , When is spark 3.3.0 is expected to release?
>
>
>
> Regards
>
> Raja
>
> *From:* Sean Owen 
> *Sent:* Monday, January 31, 2022 10:28 PM
> *To:* KS, Rajabhupati 
> *Subject:* [EXTERNAL] Fwd: Log4j upgrade in spark binary from 1.2.17 to
> 2.17.1
>
>
>
> Further, you're using an email that can't receive email ...
>
> -- Forwarded message -
> From: *Sean Owen* 
> Date: Mon, Jan 31, 2022 at 10:56 AM
> Subject: Re: Log4j upgrade in spark binary from 1.2.17 to 2.17.1
> To: KS, Rajabhupati 
> Cc: u...@spark.incubator.apache.org ,
> d...@spark.incubator.apache.org 
>
>
>
> (BTW you are sending to the Spark incubator list, and Spark has not been
> in incubation for about 7 years. Use user@spark.apache.org)
>
>
>
> What update are you looking for? this has been discussed extensively on
> the Spark mailing list.
>
> Spark is not evidently vulnerable to this. 3.3.0 will include log4j 2.17
> anyway.
>
>
>
> The ticket you cite points you to the correct ticket:
> https://issues.apache.org/jira/browse/SPARK-6305
> 
>
>
>
> On Mon, Jan 31, 2022 at 10:53 AM KS, Rajabhupati <
> rajabhupati...@comcast.com.invalid> wrote:
>
> Hi Team ,
>
>
>
> Is there any update on this request ?
>
>
>
> We did see Jira https://issues.apache.org/jira/browse/SPARK-37630
> 
> for this request but we see it closed .
>
>
>
> Regards
>
> Raja
>
>
>
> *From:* KS, Rajabhupati 
> *Sent:* Sunday, January 30, 2022 9:03 AM
> *To:* u...@spark.incubator.apache.org
> *Subject:* Log4j upgrade in spark binary from 1.2.17 to 2.17.1
>
>
>
> Hi Team,
>
>
>
> We were checking for log4j upgrade in Open source spark version to avoid
> the recent vulnerability in the spark binary . Do we have any new release
> which is planned to upgrade the log4j from 1.2.17 to 2.17.1.Any sooner
> response is appreciated ?
>
>
>
>
>
> Regards
>
> Rajabhupati
>
>


Re: Log4j upgrade in spark binary from 1.2.17 to 2.17.1

2022-01-31 Thread Sean Owen
(BTW you are sending to the Spark incubator list, and Spark has not been in
incubation for about 7 years. Use user@spark.apache.org)

What update are you looking for? this has been discussed extensively on the
Spark mailing list.
Spark is not evidently vulnerable to this. 3.3.0 will include log4j 2.17
anyway.

The ticket you cite points you to the correct ticket:
https://issues.apache.org/jira/browse/SPARK-6305

On Mon, Jan 31, 2022 at 10:53 AM KS, Rajabhupati
 wrote:

> Hi Team ,
>
>
>
> Is there any update on this request ?
>
>
>
> We did see Jira https://issues.apache.org/jira/browse/SPARK-37630 for
> this request but we see it closed .
>
>
>
> Regards
>
> Raja
>
>
>
> *From:* KS, Rajabhupati 
> *Sent:* Sunday, January 30, 2022 9:03 AM
> *To:* u...@spark.incubator.apache.org
> *Subject:* Log4j upgrade in spark binary from 1.2.17 to 2.17.1
>
>
>
> Hi Team,
>
>
>
> We were checking for log4j upgrade in Open source spark version to avoid
> the recent vulnerability in the spark binary . Do we have any new release
> which is planned to upgrade the log4j from 1.2.17 to 2.17.1.Any sooner
> response is appreciated ?
>
>
>
>
>
> Regards
>
> Rajabhupati
>


RE: Log4j upgrade in spark binary from 1.2.17 to 2.17.1

2022-01-31 Thread KS, Rajabhupati
Hi Team ,

Is there any update on this request ?

We did see Jira https://issues.apache.org/jira/browse/SPARK-37630 for this 
request but we see it closed .

Regards
Raja

From: KS, Rajabhupati 
Sent: Sunday, January 30, 2022 9:03 AM
To: u...@spark.incubator.apache.org
Subject: Log4j upgrade in spark binary from 1.2.17 to 2.17.1

Hi Team,

We were checking for log4j upgrade in Open source spark version to avoid the 
recent vulnerability in the spark binary . Do we have any new release which is 
planned to upgrade the log4j from 1.2.17 to 2.17.1.Any sooner response is 
appreciated ?


Regards
Rajabhupati


Re:

2022-01-31 Thread Bitfox
Please send an e-mail: user-unsubscr...@spark.apache.org
to unsubscribe yourself from the mailing list.


On Mon, Jan 31, 2022 at 10:11 PM  wrote:

> unsubscribe
>
>
>


Re:

2022-01-31 Thread Bitfox
Please send an e-mail: user-unsubscr...@spark.apache.org
to unsubscribe yourself from the mailing list.


On Mon, Jan 31, 2022 at 10:23 PM Gaetano Fabiano 
wrote:

> Unsubscribe
>
> Inviato da iPhone
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


[no subject]

2022-01-31 Thread Gaetano Fabiano
Unsubscribe 

Inviato da iPhone

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[no subject]

2022-01-31 Thread pduflot
unsubscribe

 



Re: A Persisted Spark DataFrame is computed twice

2022-01-31 Thread Sean Owen
One guess - you are doing two things here, count() and write(). There is a
persist(), but it's async. It won't necessarily wait for the persist to
finish before proceeding and may have to recompute at least some partitions
for the second op. You could debug further by looking at the stages and
seeing what exactly is executing and where it uses cached partitions or not.

On Mon, Jan 31, 2022 at 2:12 AM Benjamin Du  wrote:

> I did check the execution plan, there were 2 stages and both stages show
> that the pandas UDF (which takes almost all the computation time of the
> DataFrame) is executed.
>
> It didn't seem to be an issue of repartition/coalesce as the DataFrame was
> still computed twice after removing coalesce.
>
>
>


Regarding Spark Cassandra Metrics

2022-01-31 Thread Yogesh Kumar Garg
Hi all,

I am developing a spark application where I am loading the data into
Cassandra and I am using the Spark Cassandra connector for the same. I have
created a FAT jar with all the dependencies and submitted that using
spark-submit.

I am able to load the data successfully to cassandra, but I am not able to
get the metrics from the spark cassandra connector. I checked the executor
logs and saw that the following properties failed to initialize because of
the mentioned error.

Properties:

"spark.metrics.conf.driver.source.cassandra-connector.class":
"org.apache.spark.metrics.CassandraConnectorSource"
"spark.metrics.conf.executor.source.cassandra-connector.class":
"org.apache.spark.metrics.CassandraConnectorSource"

Error:


22/01/28 15:30:55 ERROR MetricsSystem: Source class
org.apache.spark.metrics.CassandraConnectorSource cannot be
instantiated
java.lang.ClassNotFoundException:
org.apache.spark.metrics.CassandraConnectorSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:235)
at 
org.apache.spark.metrics.MetricsSystem$$anonfun$registerSources$1.apply(MetricsSystem.scala:182)
at 
org.apache.spark.metrics.MetricsSystem$$anonfun$registerSources$1.apply(MetricsSystem.scala:179)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
at 
org.apache.spark.metrics.MetricsSystem.registerSources(MetricsSystem.scala:179)
at org.apache.spark.metrics.MetricsSystem.start(MetricsSystem.scala:101)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:364)
at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:200)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:228)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:65)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at 
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:293)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)


I am loading the data to cassandra using the below code:

cassandraTableDataset.toDF(cassandraTable.getRenamedColumns()).
write().format(sparkCassandraFormat).
options(ImmutableMap.of(cassandraKeyspaceString, getKeyspace(config),
cassandraTableString, getTable(config))).
mode(SaveMode.Append).
save();


I cannot copy the spark cassandra connector jar to all the nodes in the
cluster because of some restrictions.

*Solutions tried:*

*Solution 1: *

Used spark.jars and spark.executor.extraClassPath options, but it did not
work. As the executor's spark session is getting created before these jars
or FAT application jar is fetched/copied to the executor node.

*Solution 2:*

I tried to manually initialize the
org.apache.spark.metrics.CassandraConnectorSource class and registered with
SparkEnv Metric system just before the cassandra loading, and again it did
not work. I am assuming these changes are happening on the driver only.

*Solution 3:*

I also tried to set the same Properties using  sparkEnvironment
.getSparkSession().conf().set(), and it did not work as well. I do not know
if the above mentioned properties can be added at runtime or not, so I
tried this as well. I was hoping that this would help me change the
executor config at runtime.


Spark Version: 2.3.0
Scala Version: 2.11
Spark Cassandra Connector:
com.datastax.spark:spark-cassandra-connector_2.11:2.3.0

Please help with this issue, as these metrics are important. Thanks in
advance.

Thank You,
Yogesh


Re: unsubscribe

2022-01-31 Thread Bitfox
The signature in your messages has showed how to unsubscribe.

To unsubscribe e-mail: user-unsubscr...@spark.apache.org

On Mon, Jan 31, 2022 at 7:53 PM Lucas Schroeder Rossi 
wrote:

> unsubscribe
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


unsubscribe

2022-01-31 Thread Lucas Schroeder Rossi
unsubscribe

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



unsubscribe

2022-01-31 Thread Lucas Schroeder Rossi
unsubscribe

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Migration to Spark 3.2

2022-01-31 Thread Aurélien Mazoyer
Hi Stephen,

I managed to solve my issue, I had a conflicting version of jackson
databind that came from parent pom.

Thank you,

Aurelien

Le dim. 30 janv. 2022 à 23:28, Aurélien Mazoyer  a
écrit :

> Hi Stephen,
>
> Thank you for your answer. Yes, I changed the scope to "provided" but got
> the same error :-( FYI. I am getting this error while running tests.
>
> Regards,
>
> Aurelien
>
> Le jeu. 27 janv. 2022 à 23:57, Stephen Coy  a
> écrit :
>
>> Hi Aurélien,
>>
>> Your Jackson versions look fine.
>>
>> What happens if you change the scope of your Jackson dependencies to
>> “provided”?
>>
>> This should result in your application using the versions provided by
>> Spark and avoid this potential collision.
>>
>> Cheers,
>>
>> Steve C
>>
>> On 27 Jan 2022, at 9:48 pm, Aurélien Mazoyer 
>> wrote:
>>
>> Hi Stephen,
>>
>> Thank you for your answer!
>> Here it is, it seems that jackson dependencies are correct, no? :
>>
>> Thanks,
>>
>> [INFO] com.krrier:spark-lib-full:jar:0.0.1-SNAPSHOT
>> [INFO] +- com.krrier:backend:jar:0.0.1-SNAPSHOT:compile
>> [INFO] |  \- com.krrier:data:jar:0.0.1-SNAPSHOT:compile
>> [INFO] +- com.krrier:plugin-api:jar:0.0.1-SNAPSHOT:compile
>> [INFO] +- com.opencsv:opencsv:jar:4.2:compile
>> [INFO] |  +- org.apache.commons:commons-lang3:jar:3.9:compile
>> [INFO] |  +- org.apache.commons:commons-text:jar:1.3:compile
>> [INFO] |  +- commons-beanutils:commons-beanutils:jar:1.9.3:compile
>> [INFO] |  |  \- commons-logging:commons-logging:jar:1.2:compile
>> [INFO] |  \- org.apache.commons:commons-collections4:jar:4.1:compile
>> [INFO] +- org.apache.solr:solr-solrj:jar:7.4.0:compile
>> [INFO] |  +- org.apache.commons:commons-math3:jar:3.6.1:compile
>> [INFO] |  +- org.apache.httpcomponents:httpclient:jar:4.5.3:compile
>> [INFO] |  +- org.apache.httpcomponents:httpcore:jar:4.4.6:compile
>> [INFO] |  +- org.apache.httpcomponents:httpmime:jar:4.5.3:compile
>> [INFO] |  +- org.apache.zookeeper:zookeeper:jar:3.4.11:compile
>> [INFO] |  +- org.codehaus.woodstox:stax2-api:jar:3.1.4:compile
>> [INFO] |  +- org.codehaus.woodstox:woodstox-core-asl:jar:4.4.1:compile
>> [INFO] |  \- org.noggit:noggit:jar:0.8:compile
>> [INFO] +- com.databricks:spark-xml_2.12:jar:0.5.0:compile
>> [INFO] +- org.apache.tika:tika-parsers:jar:1.24:compile
>> [INFO] |  +- org.apache.tika:tika-core:jar:1.24:compile
>> [INFO] |  +- org.glassfish.jaxb:jaxb-runtime:jar:2.3.2:compile
>> [INFO] |  |  +- jakarta.xml.bind:jakarta.xml.bind-api:jar:2.3.2:compile
>> [INFO] |  |  +- org.glassfish.jaxb:txw2:jar:2.3.2:compile
>> [INFO] |  |  +- com.sun.istack:istack-commons-runtime:jar:3.0.8:compile
>> [INFO] |  |  +- org.jvnet.staxex:stax-ex:jar:1.8.1:compile
>> [INFO] |  |  \- com.sun.xml.fastinfoset:FastInfoset:jar:1.2.16:compile
>> [INFO] |  +- com.sun.activation:jakarta.activation:jar:1.2.1:compile
>> [INFO] |  +- xerces:xercesImpl:jar:2.12.0:compile
>> [INFO] |  |  \- xml-apis:xml-apis:jar:1.4.01:compile
>> [INFO] |  +- javax.annotation:javax.annotation-api:jar:1.3.2:compile
>> [INFO] |  +- org.gagravarr:vorbis-java-tika:jar:0.8:compile
>> [INFO] |  +- org.tallison:jmatio:jar:1.5:compile
>> [INFO] |  +- org.apache.james:apache-mime4j-core:jar:0.8.3:compile
>> [INFO] |  +- org.apache.james:apache-mime4j-dom:jar:0.8.3:compile
>> [INFO] |  +- org.tukaani:xz:jar:1.8:compile
>> [INFO] |  +- com.epam:parso:jar:2.0.11:compile
>> [INFO] |  +- org.brotli:dec:jar:0.1.2:compile
>> [INFO] |  +- commons-codec:commons-codec:jar:1.13:compile
>> [INFO] |  +- org.apache.pdfbox:pdfbox:jar:2.0.19:compile
>> [INFO] |  |  \- org.apache.pdfbox:fontbox:jar:2.0.19:compile
>> [INFO] |  +- org.apache.pdfbox:pdfbox-tools:jar:2.0.19:compile
>> [INFO] |  +- org.apache.pdfbox:preflight:jar:2.0.19:compile
>> [INFO] |  |  \- org.apache.pdfbox:xmpbox:jar:2.0.19:compile
>> [INFO] |  +- org.apache.pdfbox:jempbox:jar:1.8.16:compile
>> [INFO] |  +- org.bouncycastle:bcmail-jdk15on:jar:1.64:compile
>> [INFO] |  |  \- org.bouncycastle:bcpkix-jdk15on:jar:1.64:compile
>> [INFO] |  +- org.bouncycastle:bcprov-jdk15on:jar:1.64:compile
>> [INFO] |  +- org.apache.poi:poi:jar:4.1.2:compile
>> [INFO] |  |  \- com.zaxxer:SparseBitSet:jar:1.2:compile
>> [INFO] |  +- org.apache.poi:poi-scratchpad:jar:4.1.2:compile
>> [INFO] |  +- com.healthmarketscience.jackcess:jackcess:jar:3.0.1:compile
>> [INFO] |  +-
>> com.healthmarketscience.jackcess:jackcess-encrypt:jar:3.0.0:compile
>> [INFO] |  +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:compile
>> [INFO] |  +- org.ow2.asm:asm:jar:7.3.1:compile
>> [INFO] |  +- com.googlecode.mp4parser:isoparser:jar:1.1.22:compile
>> [INFO] |  +- org.tallison:metadata-extractor:jar:2.13.0:compile
>> [INFO] |  |  \- org.tallison.xmp:xmpcore-shaded:jar:6.1.10:compile
>> [INFO] |  | \- com.adobe.xmp:xmpcore:jar:6.1.10:compile
>> [INFO] |  +- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile
>> [INFO] |  +- com.rometools:rome:jar:1.12.2:compile
>> [INFO] |  |  \- com.rometools:rome-utils:jar:1.12.2:compile
>> [INFO] |  +- 

Re: why the pyspark RDD API is so slow?

2022-01-31 Thread Sebastian Piu
When you operate on a dataframe from the python side you are just invoking
methods in the JVM via a proxy (py4j) so it is almost as coding in java
itself. This is as long as you don't define any udf's or any other code
that needs to invoke python for processing

Check the High Performance Spark book, the Pyspark chapter, for a good
explanation of what's going on

On Mon, 31 Jan 2022 at 09:10, Bitfox  wrote:

> Hi
>
> In PySpark, RDD need serialised/deserialised, but dataframe doesn’t? Why?
>
> Thanks
>
> On Mon, Jan 31, 2022 at 4:46 PM Khalid Mammadov 
> wrote:
>
>> Your scala program does not use any Spark API hence faster that others.
>> If you write the same code in pure Python I think it will be even faster
>> than Scala program, especially taking into account these 2 programs runs on
>> a single VM.
>>
>> Regarding Dataframe and RDD I would suggest to use Dataframes anyway
>> since it's recommended approach since Spark 2.0.
>> RDD for Pyspark is slow as others said it needs to be
>> serialised/deserialised.
>>
>> One general note is that Spark is written Scala and core is running on
>> JVM and Python is wrapper around Scala API and most of PySpark APIs are
>> delegated to Scala/JVM to be executed. Hence most of big data
>> transformation tasks will complete almost at the same time as they (Scala
>> and Python) use the same API under the hood. Therefore you can also observe
>> that APIs are very similar and code is written in the same fashion.
>>
>>
>> On Sun, 30 Jan 2022, 10:10 Bitfox,  wrote:
>>
>>> Hello list,
>>>
>>> I did a comparison for pyspark RDD, scala RDD, pyspark dataframe and a
>>> pure scala program. The result shows the pyspark RDD is too slow.
>>>
>>> For the operations and dataset please see:
>>>
>>> https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/
>>>
>>> The result table is below.
>>> Can you give suggestions on how to optimize the RDD operation?
>>>
>>> Thanks a lot.
>>>
>>>
>>> *program* *time*
>>> scala program 49s
>>> pyspark dataframe 56s
>>> scala RDD 1m31s
>>> pyspark RDD 7m15s
>>>
>>


Re: why the pyspark RDD API is so slow?

2022-01-31 Thread Bitfox
Hi

In PySpark, RDD need serialised/deserialised, but dataframe doesn’t? Why?

Thanks

On Mon, Jan 31, 2022 at 4:46 PM Khalid Mammadov 
wrote:

> Your scala program does not use any Spark API hence faster that others. If
> you write the same code in pure Python I think it will be even faster than
> Scala program, especially taking into account these 2 programs runs on a
> single VM.
>
> Regarding Dataframe and RDD I would suggest to use Dataframes anyway since
> it's recommended approach since Spark 2.0.
> RDD for Pyspark is slow as others said it needs to be
> serialised/deserialised.
>
> One general note is that Spark is written Scala and core is running on JVM
> and Python is wrapper around Scala API and most of PySpark APIs are
> delegated to Scala/JVM to be executed. Hence most of big data
> transformation tasks will complete almost at the same time as they (Scala
> and Python) use the same API under the hood. Therefore you can also observe
> that APIs are very similar and code is written in the same fashion.
>
>
> On Sun, 30 Jan 2022, 10:10 Bitfox,  wrote:
>
>> Hello list,
>>
>> I did a comparison for pyspark RDD, scala RDD, pyspark dataframe and a
>> pure scala program. The result shows the pyspark RDD is too slow.
>>
>> For the operations and dataset please see:
>>
>> https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/
>>
>> The result table is below.
>> Can you give suggestions on how to optimize the RDD operation?
>>
>> Thanks a lot.
>>
>>
>> *program* *time*
>> scala program 49s
>> pyspark dataframe 56s
>> scala RDD 1m31s
>> pyspark RDD 7m15s
>>
>


Re: why the pyspark RDD API is so slow?

2022-01-31 Thread Khalid Mammadov
Your scala program does not use any Spark API hence faster that others. If
you write the same code in pure Python I think it will be even faster than
Scala program, especially taking into account these 2 programs runs on a
single VM.

Regarding Dataframe and RDD I would suggest to use Dataframes anyway since
it's recommended approach since Spark 2.0.
RDD for Pyspark is slow as others said it needs to be
serialised/deserialised.

One general note is that Spark is written Scala and core is running on JVM
and Python is wrapper around Scala API and most of PySpark APIs are
delegated to Scala/JVM to be executed. Hence most of big data
transformation tasks will complete almost at the same time as they (Scala
and Python) use the same API under the hood. Therefore you can also observe
that APIs are very similar and code is written in the same fashion.


On Sun, 30 Jan 2022, 10:10 Bitfox,  wrote:

> Hello list,
>
> I did a comparison for pyspark RDD, scala RDD, pyspark dataframe and a
> pure scala program. The result shows the pyspark RDD is too slow.
>
> For the operations and dataset please see:
>
> https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/
>
> The result table is below.
> Can you give suggestions on how to optimize the RDD operation?
>
> Thanks a lot.
>
>
> *program* *time*
> scala program 49s
> pyspark dataframe 56s
> scala RDD 1m31s
> pyspark RDD 7m15s
>


Re:[ANNOUNCE] Apache Spark 3.2.1 released

2022-01-31 Thread beliefer
Thank you huaxin gao!
Glad to see the release.







At 2022-01-29 09:07:13, "huaxin gao"  wrote:

We are happy to announce the availability of Spark 3.2.1!

Spark 3.2.1 is a maintenance release containing stability fixes. This
release is based on the branch-3.2 maintenance branch of Spark. We strongly
recommend all 3.2 users to upgrade to this stable release.

To download Spark 3.2.1, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-2-1.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.



Huaxin Gao

Re: A Persisted Spark DataFrame is computed twice

2022-01-31 Thread Sebastian Piu
Can you share the stages as seen in the spark ui for the count and coalesce
jobs

My suggestion of moving things around was just for troubleshooting rather
than a solution of that wasn't clear before

On Mon, 31 Jan 2022, 08:07 Benjamin Du,  wrote:

> Remvoing coalesce didn't help either.
>
>
>
> Best,
>
> 
>
> Ben Du
>
> Personal Blog  | GitHub
>  | Bitbucket 
> | Docker Hub 
> --
> *From:* Deepak Sharma 
> *Sent:* Sunday, January 30, 2022 12:45 AM
> *To:* Benjamin Du 
> *Cc:* u...@spark.incubator.apache.org 
> *Subject:* Re: A Persisted Spark DataFrame is computed twice
>
> coalesce returns a new dataset.
> That will cause the recomputation.
>
> Thanks
> Deepak
>
> On Sun, 30 Jan 2022 at 14:06, Benjamin Du  wrote:
>
> I have some PySpark code like below. Basically, I persist a DataFrame
> (which is time-consuming to compute) to disk, call the method
> DataFrame.count to trigger the caching/persist immediately, and then I
> coalesce the DataFrame to reduce the number of partitions (the original
> DataFrame has 30,000 partitions) and output it to HDFS. Based on the
> execution time of job stages and the execution plan, it seems to me that
> the DataFrame is recomputed at df.coalesce(300). Does anyone know why
> this happens?
>
> df = spark.read.parquet("/input/hdfs/path") \
> .filter(...) \
> .withColumn("new_col", my_pandas_udf("col0", "col1")) \
> .persist(StorageLevel.DISK_ONLY)
> df.count()
> df.coalesce(300).write.mode("overwrite").parquet(output_mod)
>
>
> BTW, it works well if I manually write the DataFrame to HDFS, read it
> back, coalesce it and write it back to HDFS.
> Originally post at
> https://stackoverflow.com/questions/70781494/a-persisted-spark-dataframe-is-computed-twice.
> 
>
> Best,
>
> 
>
> Ben Du
>
> Personal Blog  | GitHub
>  | Bitbucket 
> | Docker Hub 
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>


Re: A Persisted Spark DataFrame is computed twice

2022-01-31 Thread Benjamin Du
I did check the execution plan, there were 2 stages and both stages show that 
the pandas UDF (which takes almost all the computation time of the DataFrame) 
is executed.

It didn't seem to be an issue of repartition/coalesce as the DataFrame was 
still computed twice after removing coalesce.




Best,



Ben Du

Personal Blog | GitHub | 
Bitbucket | Docker 
Hub


From: Gourav Sengupta 
Sent: Sunday, January 30, 2022 1:08 AM
To: sebastian@gmail.com 
Cc: Benjamin Du ; u...@spark.incubator.apache.org 

Subject: Re: A Persisted Spark DataFrame is computed twice

Hi,

without getting into suppositions, the best option is to look into the SPARK UI 
SQL section.

It is the most wonderful tool to explain what is happening, and why. In SPARK 
3.x they have made the UI even better, with different set of granularity and 
details.

On another note, you might want to read the difference between repartition and 
coalesce before making any kind of assumptions.


Regards,
Gourav Sengupta

On Sun, Jan 30, 2022 at 8:52 AM Sebastian Piu 
mailto:sebastian@gmail.com>> wrote:
It's probably the repartitioning and deserialising the df that you are seeing 
take time. Try doing this

1. Add another count after your current one and compare times
2. Move coalesce before persist



You should see

On Sun, 30 Jan 2022, 08:37 Benjamin Du, 
mailto:legendu@outlook.com>> wrote:
I have some PySpark code like below. Basically, I persist a DataFrame (which is 
time-consuming to compute) to disk, call the method DataFrame.count to trigger 
the caching/persist immediately, and then I coalesce the DataFrame to reduce 
the number of partitions (the original DataFrame has 30,000 partitions) and 
output it to HDFS. Based on the execution time of job stages and the execution 
plan, it seems to me that the DataFrame is recomputed at df.coalesce(300). Does 
anyone know why this happens?


df = spark.read.parquet("/input/hdfs/path") \
.filter(...) \
.withColumn("new_col", my_pandas_udf("col0", "col1")) \
.persist(StorageLevel.DISK_ONLY)
df.count()
df.coalesce(300).write.mode("overwrite").parquet(output_mod)


BTW, it works well if I manually write the DataFrame to HDFS, read it back, 
coalesce it and write it back to HDFS.

Originally post at 
https://stackoverflow.com/questions/70781494/a-persisted-spark-dataframe-is-computed-twice.

Best,



Ben Du

Personal Blog | GitHub | 
Bitbucket | Docker 
Hub


unsubscribe

2022-01-31 Thread Rajeev



Re: A Persisted Spark DataFrame is computed twice

2022-01-31 Thread Benjamin Du
Remvoing coalesce didn't help either.




Best,



Ben Du

Personal Blog | GitHub | 
Bitbucket | Docker 
Hub


From: Deepak Sharma 
Sent: Sunday, January 30, 2022 12:45 AM
To: Benjamin Du 
Cc: u...@spark.incubator.apache.org 
Subject: Re: A Persisted Spark DataFrame is computed twice

coalesce returns a new dataset.
That will cause the recomputation.

Thanks
Deepak

On Sun, 30 Jan 2022 at 14:06, Benjamin Du 
mailto:legendu@outlook.com>> wrote:
I have some PySpark code like below. Basically, I persist a DataFrame (which is 
time-consuming to compute) to disk, call the method DataFrame.count to trigger 
the caching/persist immediately, and then I coalesce the DataFrame to reduce 
the number of partitions (the original DataFrame has 30,000 partitions) and 
output it to HDFS. Based on the execution time of job stages and the execution 
plan, it seems to me that the DataFrame is recomputed at df.coalesce(300). Does 
anyone know why this happens?


df = spark.read.parquet("/input/hdfs/path") \
.filter(...) \
.withColumn("new_col", my_pandas_udf("col0", "col1")) \
.persist(StorageLevel.DISK_ONLY)
df.count()
df.coalesce(300).write.mode("overwrite").parquet(output_mod)


BTW, it works well if I manually write the DataFrame to HDFS, read it back, 
coalesce it and write it back to HDFS.

Originally post at 
https://stackoverflow.com/questions/70781494/a-persisted-spark-dataframe-is-computed-twice.

Best,



Ben Du

Personal Blog | GitHub | 
Bitbucket | Docker 
Hub


--
Thanks
Deepak
www.bigdatabig.com
www.keosha.net