Re: Dynamic Allocation Backlog Property in Spark on Kubernetes

2021-04-08 Thread Attila Zsolt Piros
You should not set "spark.dynamicAllocation.schedulerBacklogTimeout" so high and the purpose of this config is very different form the one you would like to use it for. The confusion I guess comes from the fact that you are still thinking in multiple Spark jobs. *But Dynamic Allocation is

AW: possible bug

2021-04-08 Thread Weiand, Markus, NMA-CFD
I've changed the code to set driver memory to 100g, changed python code: import pyspark conf=pyspark.SparkConf().setMaster("local[64]").setAppName("Test1").set(key="spark.driver.memory", value="100g") sc=pyspark.SparkContext.getOrCreate(conf) rows=7 data=list(range(rows))

RE: Dynamic Allocation Backlog Property in Spark on Kubernetes

2021-04-08 Thread Ranju Jain
Hi Attila, Thanks for your reply. If I talk about single job which started to run with minExecutors as 3. And Suppose this job [which reads the full data from backend and process and writes it to a location] takes around 2 hour to complete. What I understood is, as the default value of

Evaluating Apache Spark with Data Orchestration using TPC-DS

2021-04-08 Thread Bin Fan
Dear Spark Users, I am sharing a whitepaper on “Evaluating Apache Spark and Alluxio for Data Analytics ” which talks about how to benchmark Spark on Alluxio to accelerate TPCDS benchmark results with details. Hope this helps. If you have any questions, feel free to reach

Re: Dynamic Allocation Backlog Property in Spark on Kubernetes

2021-04-08 Thread Attila Zsolt Piros
Hi! For dynamic allocation you do not need to run the Spark jobs in parallel. Dynamic allocation simply means Spark scales up by requesting more executors when there are pending tasks (which kinda related to the available partitions) and scales down when the executor is idle (as within one job

Re: Why is Spark 3.0.x faster than Spark 3.1.x

2021-04-08 Thread Mich Talebzadeh
Well the normal course of action (considering laws of diminishing returns) is that your mileage varies: Spark 3.0.1 is pretty stable and good enough. Unless there is an overriding reason why you have to use 3.1.1, you can set it aside and try it when you have other use cases. For now I guess you

Re: Why is Spark 3.0.x faster than Spark 3.1.x

2021-04-08 Thread Maziyar Panahi
I personally added the followings to my SparkSession in 3.1.1 and the result was exactly the same as before (local master). The 3.1.1 is still 4-5 times slower than 3.0.2 at least for that piece of code. I will do more investigation to see how it does with other stuff, especially anything

Re: Why is Spark 3.0.x faster than Spark 3.1.x

2021-04-08 Thread Mich Talebzadeh
spark 3.1.1 I enabled the parameter spark_session.conf.set("spark.sql.adaptive.enabled", "true") to see it effects in yarn cluster mode, i.e spark-submit --master yarn --deploy-mode client with 4 executors it crashed the cluster. I then reduced the number of executors to 2 and this time it

Re: Why is Spark 3.0.x faster than Spark 3.1.x

2021-04-08 Thread Maziyar Panahi
Thanks Sean, I have already tried adding that and the result is absolutely the same. The reason that config cannot be the reason (at least not alone) it's because my comparison is between Spark 3.0.2 and Spark 3.1.1. This config has been set to true the beginning of 3.0.0 and hasn't changed:

Re: Why is Spark 3.0.x faster than Spark 3.1.x

2021-04-08 Thread Sean Owen
Right, you already established a few times that the difference is the number of partitions. Russell answered with what is almost surely the correct answer, that it's AQE. In toy cases it isn't always a win. Disable it if you need to. It's not a problem per se in 3.1; AQE speeds up more realistic

Re: Why is Spark 3.0.x faster than Spark 3.1.x

2021-04-08 Thread maziyar
So this is what I have in my Spark UI for 3.0.2 and 3.1.1:For pyspark==3.0.2 (stage "showString at NativeMethodAccessorImpl.java:0"): Finished in 10 secondsFor pyspark==3.1.1 (same stage "showString at

Re: possible bug

2021-04-08 Thread Russell Spitzer
Could be that the driver JVM cannot handle the metadata required to store the partition information of a 70k partition RDD. I see you say you have a 100GB driver but i'm not sure where you configured that? Did you set --driver-memory 100G ? On Thu, Apr 8, 2021 at 8:08 AM Weiand, Markus, NMA-CFD

AW: possible bug

2021-04-08 Thread Weiand, Markus, NMA-CFD
This is the reduction of an error in a complex program where allocated 100 GB driver (=worker=executor as local mode) memory. In the example I used the default size, as the puny example shouldn't need more anyway. And without the coalesce or with coalesce(1,True) everything works fine. I'm

Re: possible bug

2021-04-08 Thread Sean Owen
That's a very low level error from the JVM. Any chance you are misconfiguring the executor size? like to 10MB instead of 10GB, that kind of thing. Trying to think of why the JVM would have very little memory to operate. An app running out of mem would not look like this. On Thu, Apr 8, 2021 at

possible bug

2021-04-08 Thread Weiand, Markus, NMA-CFD
Hi all, I'm using spark on a c5a.16xlarge machine in amazon cloud (so having 64 cores and 128 GB RAM). I'm using spark 3.01. The following python code leads to an exception, is this a bug or is my understanding of the API incorrect? import pyspark

Re: Why is Spark 3.0.x faster than Spark 3.1.x

2021-04-08 Thread maziyar
Hi Mich, Thanks for the reply. I have tried to minimize as much a possible the effect of other factors between pyspark==3.0.2 and pyspark==3.1.1 including not reading csv or gz and just reading the Parquet. Here is a code purely in pyspark (nothing else included) and it finishes within 47

Re: Why is Spark 3.0.x faster than Spark 3.1.x

2021-04-08 Thread Mich Talebzadeh
Hi, I just looked through your code 1. I assume you are testing this against spark 3.1.1? 2. You are testing this set-up in a local mode in a single JVM, so it is not really distributed. I doubt whether a meaningful performance deduction can be made here 3. It is accepted that

Why is Spark 3.0.x is faster than Spark 3.1.x

2021-04-08 Thread maziyar
Hi, I have a simple code, when I run it in Spark/PySpark (tried both Scala and Python) in 3.1.1 it finishes within 5 minutes. However, the same code, same data, same SparkSession configs, just running it on Spark 3.0.2 or PySpark 3.0.2 will finish it within a minute. That's over 5x times faster

Why is Spark 3.0.x faster than Spark 3.1.x

2021-04-08 Thread maziyar
Hi, I have a simple code that does a groupby, agg count, sort, etc. This code finishes within 5 minutes on Spark 3.1.x. However, the same code, same dataset, same SparkSession (configs) on Spark 3.0.2 will finish within a minute. That is over 5x times the difference. My SparkSession (same when