[ 
https://issues.apache.org/jira/browse/SPARK-35066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated SPARK-35066:
--------------------------------
    Description: 
Hi,

The following snippet code runs 4-5 times slower when it's used in Apache Spark 
or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2:

 
{code:java}
spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.driver.maxResultSize", "0") \
        .config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000m") \
        .getOrCreate()

Toys = spark.read \
  .parquet('./toys-cleaned').repartition(12)

# tokenize the text
regexTokenizer = RegexTokenizer(inputCol="reviewText",
outputCol="all_words", pattern="\\W")
toys_with_words = regexTokenizer.transform(Toys)

# remove stop words
remover = StopWordsRemover(inputCol="all_words", outputCol="words")
toys_with_tokens = remover.transform(toys_with_words).drop("all_words")

all_words = toys_with_tokens.select(explode("words").alias("word"))
# group by, sort and limit to 50k
top50k =
all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(50000)

top50k.show()
{code}
 

Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 
partitions are respected in a way that all 12 tasks are being processed 
altogether. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of 
them finish immediately and only 2 are being processed. (I've tried to disable 
a couple of configs related to something similar, but none of them worked)

Screenshot of spark 3.1.1 task:

!image-2022-03-17-17-18-36-793.png|width=1073,height=652!

!image-2022-03-17-17-19-11-655.png!

 

Screenshot of spark 3.0.2 task:

 

!image-2022-03-17-17-19-34-906.png!

For a longer discussion: [Spark User List 
|https://lists.apache.org/thread/1hlg9fpxnw8dzx8bd2fvffmk7yozoszf]

 

You can reproduce this big difference of performance between Spark 3.1.1 and 
Spark 3.0.2 by using the shared code with any dataset that is large enough to 
take longer than a minute. Not sure if this is related to SQL, any Spark config 
being enabled in 3.x but not really into action before 3.1.1, or it's about 
.transform in Spark ML.

  was:
Hi,

The following snippet code runs 4-5 times slower when it's used in Apache Spark 
or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2:

 
{code:java}
spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.driver.maxResultSize", "0") \
        .config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000m") \
        .getOrCreate()

Toys = spark.read \
  .parquet('./toys-cleaned').repartition(12)

# tokenize the text
regexTokenizer = RegexTokenizer(inputCol="reviewText",
outputCol="all_words", pattern="\\W")
toys_with_words = regexTokenizer.transform(Toys)

# remove stop words
remover = StopWordsRemover(inputCol="all_words", outputCol="words")
toys_with_tokens = remover.transform(toys_with_words).drop("all_words")

all_words = toys_with_tokens.select(explode("words").alias("word"))
# group by, sort and limit to 50k
top50k =
all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(50000)

top50k.show()
{code}
 

Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 
partitions are respected in a way that all 12 tasks are being processed 
altogether. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of 
them finish immediately and only 2 are being processed. (I've tried to disable 
a couple of configs related to something similar, but none of them worked)

Screenshot of spark 3.1.1 task:

!image-2022-03-17-17-18-36-793.png|width=1073,height=652!

!image-2022-03-17-17-19-11-655.png!

 

Screenshot of spark 3.0.2 task:

 

!image-2022-03-17-17-19-34-906.png!

For a longer discussion: [Spark User List 
|https://lists.apache.org/thread/1bslwjdwnr5tw7wjkv0672vj41x4g2f1]

 

You can reproduce this big difference of performance between Spark 3.1.1 and 
Spark 3.0.2 by using the shared code with any dataset that is large enough to 
take longer than a minute. Not sure if this is related to SQL, any Spark config 
being enabled in 3.x but not really into action before 3.1.1, or it's about 
.transform in Spark ML.


> Spark 3.1.1 is slower than 3.0.2 by 4-5 times
> ---------------------------------------------
>
>                 Key: SPARK-35066
>                 URL: https://issues.apache.org/jira/browse/SPARK-35066
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, SQL
>    Affects Versions: 3.1.1
>         Environment: Spark/PySpark: 3.1.1
> Language: Python 3.7.x / Scala 12
> OS: macOS, Linux, and Windows
> Cloud: Databricks 7.3 for 3.0.1 and 8 for 3.1.1
>            Reporter: Maziyar PANAHI
>            Priority: Major
>         Attachments: Screenshot 2021-04-07 at 11.15.48.png, Screenshot 
> 2021-04-08 at 15.08.09.png, Screenshot 2021-04-08 at 15.13.19-1.png, 
> Screenshot 2021-04-08 at 15.13.19.png, image-2022-03-17-17-18-36-793.png, 
> image-2022-03-17-17-19-11-655.png, image-2022-03-17-17-19-34-906.png
>
>
> Hi,
> The following snippet code runs 4-5 times slower when it's used in Apache 
> Spark or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2:
>  
> {code:java}
> spark = SparkSession.builder \
>         .master("local[*]") \
>         .config("spark.driver.memory", "16G") \
>         .config("spark.driver.maxResultSize", "0") \
>         .config("spark.serializer",
> "org.apache.spark.serializer.KryoSerializer") \
>         .config("spark.kryoserializer.buffer.max", "2000m") \
>         .getOrCreate()
> Toys = spark.read \
>   .parquet('./toys-cleaned').repartition(12)
> # tokenize the text
> regexTokenizer = RegexTokenizer(inputCol="reviewText",
> outputCol="all_words", pattern="\\W")
> toys_with_words = regexTokenizer.transform(Toys)
> # remove stop words
> remover = StopWordsRemover(inputCol="all_words", outputCol="words")
> toys_with_tokens = remover.transform(toys_with_words).drop("all_words")
> all_words = toys_with_tokens.select(explode("words").alias("word"))
> # group by, sort and limit to 50k
> top50k =
> all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(50000)
> top50k.show()
> {code}
>  
> Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 
> partitions are respected in a way that all 12 tasks are being processed 
> altogether. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 
> of them finish immediately and only 2 are being processed. (I've tried to 
> disable a couple of configs related to something similar, but none of them 
> worked)
> Screenshot of spark 3.1.1 task:
> !image-2022-03-17-17-18-36-793.png|width=1073,height=652!
> !image-2022-03-17-17-19-11-655.png!
>  
> Screenshot of spark 3.0.2 task:
>  
> !image-2022-03-17-17-19-34-906.png!
> For a longer discussion: [Spark User List 
> |https://lists.apache.org/thread/1hlg9fpxnw8dzx8bd2fvffmk7yozoszf]
>  
> You can reproduce this big difference of performance between Spark 3.1.1 and 
> Spark 3.0.2 by using the shared code with any dataset that is large enough to 
> take longer than a minute. Not sure if this is related to SQL, any Spark 
> config being enabled in 3.x but not really into action before 3.1.1, or it's 
> about .transform in Spark ML.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to