Hi,
I am working on a spark job. It takes 10 mins for the job just for the count()
function. Question is How can I make it faster ?
From the above image, what I understood is that there 4001 tasks are running in
parallel. Total tasks are 76,553 .
Here are the parameters that I am using for
The log data is stored in Lucene and I have a custom data source to access
it. For example, the condition is log-level = INFO, this brings in a couple
of million records per partition. Then there are hundreds of partitions
involved in a query. Spark has to go through all the entries to show the
fir
I've increased spark.scheduler.listenerbus.eventqueue.executorManagement.capacity to 10M, this lead to several things.First, scaler didn't break when it was expected to. I mean, maxNeededExecutors remained low (except peak values).Second, scaler started to behave a bit weird. Having maxExecutors=50
When I do the following, Spark( 2.4) doesn't put _SUCCESS file in the
partition directory:
val outputPath = s"s3://mybucket/$table"
df
.orderBy(time)
.coalesce(numFiles)
.write
.partitionBy("partitionDate")
.mode("overwrite")
.format("parquet")
.save(outputPath)
But when I remove 'partitionBy'
Hi,
A couple of clarifications:
1. How is the log data stored on say HDFS?
2. You stated show the first 100 entries for a given condition. That
condition is a predicate itself?
There are articles for predicate pushdown in Spark. For example check
Using Spark predicate push down in Spa