Tuning spark job to make count faster.

2021-04-05 Thread Krishna Chakka
Hi, I am working on a spark job. It takes 10 mins for the job just for the count() function. Question is How can I make it faster ? From the above image, what I understood is that there 4001 tasks are running in parallel. Total tasks are 76,553 . Here are the parameters that I am using for

Re: Ordering pushdown for Spark Datasources

2021-04-05 Thread Kohki Nishio
The log data is stored in Lucene and I have a custom data source to access it. For example, the condition is log-level = INFO, this brings in a couple of million records per partition. Then there are hundreds of partitions involved in a query. Spark has to go through all the entries to show the fir

Re: [SPARK SQL] Sometimes spark does not scale down on k8s

2021-04-05 Thread Alexei
I've increased spark.scheduler.listenerbus.eventqueue.executorManagement.capacity to 10M, this lead to several things.First, scaler didn't break when it was expected to. I mean, maxNeededExecutors remained low (except peak values).Second, scaler started to behave a bit weird. Having maxExecutors=50

Spark doesn't add _SUCCESS file when 'partitionBy' is used

2021-04-05 Thread Eric Beabes
When I do the following, Spark( 2.4) doesn't put _SUCCESS file in the partition directory: val outputPath = s"s3://mybucket/$table" df .orderBy(time) .coalesce(numFiles) .write .partitionBy("partitionDate") .mode("overwrite") .format("parquet") .save(outputPath) But when I remove 'partitionBy'

Re: Ordering pushdown for Spark Datasources

2021-04-05 Thread Mich Talebzadeh
Hi, A couple of clarifications: 1. How is the log data stored on say HDFS? 2. You stated show the first 100 entries for a given condition. That condition is a predicate itself? There are articles for predicate pushdown in Spark. For example check Using Spark predicate push down in Spa