[ 
https://issues.apache.org/jira/browse/SPARK-35066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated SPARK-35066:
--------------------------------
    Description: 
Hi,

The following snippet code runs 4-5 times slower when it's used in Apache Spark 
or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2:

 
{code:java}
spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.driver.maxResultSize", "0") \
        .config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000m") \
        .getOrCreate()

Toys = spark.read \
  .parquet('./toys-cleaned').repartition(12)

# tokenize the text
regexTokenizer = RegexTokenizer(inputCol="reviewText",
outputCol="all_words", pattern="\\W")
toys_with_words = regexTokenizer.transform(Toys)

# remove stop words
remover = StopWordsRemover(inputCol="all_words", outputCol="words")
toys_with_tokens = remover.transform(toys_with_words).drop("all_words")

all_words = toys_with_tokens.select(explode("words").alias("word"))
# group by, sort and limit to 50k
top50k =
all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(50000)

top50k.show()
{code}
 

Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 
partitions are respected in a way that all 12 tasks are being processed 
altogether. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of 
them finish immediately and only 2 are being processed. (I've tried to disable 
a couple of configs related to something similar, but none of them worked)

Screenshot of spark 3.1.1 task:

!image-2022-03-17-17-18-36-793.png|width=1073,height=652!

!image-2022-03-17-17-19-11-655.png!

 

Screenshot of spark 3.0.2 task:

 

!image-2022-03-17-17-19-34-906.png!

For a longer discussion: [Spark User List 
|https://lists.apache.org/thread/1bslwjdwnr5tw7wjkv0672vj41x4g2f1]

 

You can reproduce this big difference of performance between Spark 3.1.1 and 
Spark 3.0.2 by using the shared code with any dataset that is large enough to 
take longer than a minute. Not sure if this is related to SQL, any Spark config 
being enabled in 3.x but not really into action before 3.1.1, or it's about 
.transform in Spark ML.

  was:
Hi,

The following snippet code runs 4-5 times slower when it's used in Apache Spark 
or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2:

 
{code:java}
spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.driver.maxResultSize", "0") \
        .config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000m") \
        .getOrCreate()

Toys = spark.read \
  .parquet('./toys-cleaned').repartition(12)

# tokenize the text
regexTokenizer = RegexTokenizer(inputCol="reviewText",
outputCol="all_words", pattern="\\W")
toys_with_words = regexTokenizer.transform(Toys)

# remove stop words
remover = StopWordsRemover(inputCol="all_words", outputCol="words")
toys_with_tokens = remover.transform(toys_with_words).drop("all_words")

all_words = toys_with_tokens.select(explode("words").alias("word"))
# group by, sort and limit to 50k
top50k =
all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(50000)

top50k.show()
{code}
 

Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 
partitions are respected in a way that all 12 tasks are being processed 
altogether. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of 
them finish immediately and only 2 are being processed. (I've tried to disable 
a couple of configs related to something similar, but none of them worked)

Screenshot of spark 3.1.1 task:

!image-2022-03-17-17-18-36-793.png|width=1073,height=652!

!image-2022-03-17-17-19-11-655.png!

 

Screenshot of spark 3.0.2 task:

 

!image-2022-03-17-17-19-34-906.png!

For a longer discussion: [Spark User List 
|http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html](no
 longer exists!)

 

You can reproduce this big difference of performance between Spark 3.1.1 and 
Spark 3.0.2 by using the shared code with any dataset that is large enough to 
take longer than a minute. Not sure if this is related to SQL, any Spark config 
being enabled in 3.x but not really into action before 3.1.1, or it's about 
.transform in Spark ML.


> Spark 3.1.1 is slower than 3.0.2 by 4-5 times
> ---------------------------------------------
>
>                 Key: SPARK-35066
>                 URL: https://issues.apache.org/jira/browse/SPARK-35066
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, SQL
>    Affects Versions: 3.1.1
>         Environment: Spark/PySpark: 3.1.1
> Language: Python 3.7.x / Scala 12
> OS: macOS, Linux, and Windows
> Cloud: Databricks 7.3 for 3.0.1 and 8 for 3.1.1
>            Reporter: Maziyar PANAHI
>            Priority: Major
>         Attachments: Screenshot 2021-04-07 at 11.15.48.png, Screenshot 
> 2021-04-08 at 15.08.09.png, Screenshot 2021-04-08 at 15.13.19-1.png, 
> Screenshot 2021-04-08 at 15.13.19.png, image-2022-03-17-17-18-36-793.png, 
> image-2022-03-17-17-19-11-655.png, image-2022-03-17-17-19-34-906.png
>
>
> Hi,
> The following snippet code runs 4-5 times slower when it's used in Apache 
> Spark or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2:
>  
> {code:java}
> spark = SparkSession.builder \
>         .master("local[*]") \
>         .config("spark.driver.memory", "16G") \
>         .config("spark.driver.maxResultSize", "0") \
>         .config("spark.serializer",
> "org.apache.spark.serializer.KryoSerializer") \
>         .config("spark.kryoserializer.buffer.max", "2000m") \
>         .getOrCreate()
> Toys = spark.read \
>   .parquet('./toys-cleaned').repartition(12)
> # tokenize the text
> regexTokenizer = RegexTokenizer(inputCol="reviewText",
> outputCol="all_words", pattern="\\W")
> toys_with_words = regexTokenizer.transform(Toys)
> # remove stop words
> remover = StopWordsRemover(inputCol="all_words", outputCol="words")
> toys_with_tokens = remover.transform(toys_with_words).drop("all_words")
> all_words = toys_with_tokens.select(explode("words").alias("word"))
> # group by, sort and limit to 50k
> top50k =
> all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(50000)
> top50k.show()
> {code}
>  
> Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 
> partitions are respected in a way that all 12 tasks are being processed 
> altogether. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 
> of them finish immediately and only 2 are being processed. (I've tried to 
> disable a couple of configs related to something similar, but none of them 
> worked)
> Screenshot of spark 3.1.1 task:
> !image-2022-03-17-17-18-36-793.png|width=1073,height=652!
> !image-2022-03-17-17-19-11-655.png!
>  
> Screenshot of spark 3.0.2 task:
>  
> !image-2022-03-17-17-19-34-906.png!
> For a longer discussion: [Spark User List 
> |https://lists.apache.org/thread/1bslwjdwnr5tw7wjkv0672vj41x4g2f1]
>  
> You can reproduce this big difference of performance between Spark 3.1.1 and 
> Spark 3.0.2 by using the shared code with any dataset that is large enough to 
> take longer than a minute. Not sure if this is related to SQL, any Spark 
> config being enabled in 3.x but not really into action before 3.1.1, or it's 
> about .transform in Spark ML.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to