Re: the life cycle shuffle Dependency

2023-12-29 Thread murat migdisoglu
Hello, why would you like to delete the shuffle data yourself in the first place? On Thu, Dec 28, 2023, 10:08 yang chen wrote: > > hi, I'm learning spark, and wonder when to delete shuffle data, I find the > ContextCleaner class which clean the shuffle data when shuffle dependency > is GC-ed.

Re: Cluster-mode job compute-time/cost metrics

2023-12-12 Thread murat migdisoglu
Hey Jack, Emr serverless is a great fit for this. You can get these metrics for each job when they are completed. Besides that, if you create separate "emr applications" per group and tag them appropriately, you can use the cost explorer to see the amount of resources being used. If emr

Re: Spike on number of tasks - dynamic allocation

2023-02-27 Thread murat migdisoglu
relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Mon, 27 Feb 2023 at 09:06, murat migdisoglu > wrote: > >> On an a

Spike on number of tasks - dynamic allocation

2023-02-27 Thread murat migdisoglu
On an auto-scaling cluster using YARN as resource manager, we observed that when we decrease the number of worker nodes after upscaling instance types, the number of tasks for the same spark job spikes. (the total cpu/memory capacity of the cluster remains identical) the same spark job, with the

High number of tasks when ran on a hybrid cluster

2022-08-09 Thread murat migdisoglu
Hi , I recently created a spark cluster on AWS-EMR using a fleet configuration with hybrid instance types. The instance types on this cluster vary depending on the availability of the type. While running the same spark applications that were running on homogenous cluster(some pyspark apps doing

Re: Need your help!! (URGENT Code works fine when submitted as java main but part of data missing when running as Spark-Submit)

2020-07-23 Thread murat migdisoglu
a potential reason might be that you are getting a classnotfound exception when you run on the cluster (due to a missing jar in your uber jar) and you are possibly silently eating up exceptions in your code. 1- you can check if there are any failed tasks 2- you can check if there are any failed

elasticsearch-hadoop is not compatible with spark 3.0( scala 2.12)

2020-06-23 Thread murat migdisoglu
Hi, I'm testing our codebase against spark 3.0.0 stack and I realized that elasticsearch-hadoop libraries are built against scala 2.11 and thus are not working with spark 3.0.0. (and probably 2.4.2). Is there anybody else facing this issue? How did you solve it? The PR on the ES library is open

Re: java.lang.ClassNotFoundException for s3a comitter

2020-06-18 Thread murat migdisoglu
: org.apache.spark.internal.io.cloud.PathOutputCommitProtocol On Thu, Jun 18, 2020 at 1:35 AM murat migdisoglu wrote: > Hello all, > we have a hadoop cluster (using yarn) using s3 as filesystem with s3guard > is enabled. > We are using hadoop 3.2.1 with spark 2.4.5. > > When I try to save a dataframe in par

java.lang.ClassNotFoundException: com.hortonworks.spark.cloud.commit.PathOutputCommitProtoco

2020-06-17 Thread murat migdisoglu
Hello all, we have a hadoop cluster (using yarn) using s3 as filesystem with s3guard is enabled. We are using hadoop 3.2.1 with spark 2.4.5. When I try to save a dataframe in parquet format, I get the following exception: java.lang.ClassNotFoundException: