Cluster-mode job compute-time/cost metrics

2023-12-11 Thread Jack Wells
Hello Spark experts - I’m running Spark jobs in cluster mode using a
dedicated cluster for each job. Is there a way to see how much compute time
each job takes via Spark APIs, metrics, etc.? In case it makes a
difference, I’m using AWS EMR - I’d ultimately like to be able to say this
job costs $X since it took Y minutes on Z instance types (assuming all of
the nodes are the same instance type), but I figure I could probably need
to get the Z instance type through EMR APIs.

Thanks!
Jack


Unsubscribe

2023-12-11 Thread 18706753459
Unsubscribe

Unsubscribe

2023-12-11 Thread Dusty Williams
Unsubscribe


unsubscribe

2023-12-11 Thread Stevens, Clay
unsubscribe



Spark 3.1.3 with Hive dynamic partitions fails while driver moves the staged files

2023-12-11 Thread Shay Elbaz
Hi all,

Running on Dataproc 2.0/1.3/1.4, we use INSERT INTO OVERWRITE command to insert 
new (time) partitions into existing Hive tables. But we see too many failures 
coming from org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles. This is where 
the driver moves the successful files from staging to final directory.
For some reason, the underlying FS implementation - GoogleCloudStorageImpl in 
this case - fails to move at least one file and the exception is propagated all 
the way through. We see many different failures - from 
hadoop.FileSystem.mkdirs, rename, etc., all coming from Hive.replaceFiles().
I guess FS failures are expected, but nowhere in 
org.apache.hadoop.hive.ql.metadata.Hive.loadDynamicPartitions there is 
try/catch/retry mechanism. Is it expected that the FS implementation do it?
Because the GCS connector does not seem to do that :)
We ended up patching and rebuilding hive-exec jar as an immediate mitigation 
(try/catch/retry), while our platform teams are reaching out to GCP support.

I know this is more of a Hive issue rather than Spark, but still I wonder if 
anybody has encountered this issue, or similar?

Thanks,
Shay







unsubscribe

2023-12-11 Thread Sergey Boytsov
-- 

Sergei Boitsov

JetBrains GmbH
Christoph-Rapparini-Bogen 23
80639 München
Handelsregister: Amtsgericht München, HRB 187151
Geschäftsführer: Yury Belyaev


unsubscribe

2023-12-11 Thread Klaus Schaefers
-- 
“Overfitting” is not about an excessive amount of physical exercise...


Re: [EXTERNAL] Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-12-11 Thread Eugene Miretsky
Hey Mich,

Thanks for the detailed response. I get most of these options.

However, what we are trying to do is avoid having to upload the source
configs and pyspark.zip files to the cluster every time we execute the job
using spark-submit. Here is the code that does it:
https://github.com/apache/spark/blob/bacdb3b5fec9783f4604276480eb2a0f5702/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L813

Wondering if there is a way to skip uploading the configs. Uploading the
pyspark.zip file can be skipped by setting
PYSPARK_ARCHIVES_PATH=local://

On Mon, Dec 11, 2023 at 5:15 AM Mich Talebzadeh 
wrote:

> Hi Eugene,
>
> With regard to your points
>
> What are the PYTHONPATH and SPARK_HOME env variables in your script?
>
> OK let us look at a typical of my Spark project structure
>
> - project_root
>   |-- README.md
>   |-- __init__.py
>   |-- conf
>   |   |-- (configuration files for Spark)
>   |-- deployment
>   |   |-- deployment.yaml
>   |-- design
>   |   |-- (design-related files or documentation)
>   |-- othermisc
>   |   |-- (other miscellaneous files)
>   |-- sparkutils
>   |   |-- (utility modules or scripts specific to Spark)
>   |-- src
>   |-- (main source code for your Spark application)
>
> If you want Spark to recognize modules from the sparkutils directory or
> any other directories within your project, you can include those
> directories in the PYTHONPATH.
>
> For example, if you want to include the sparkutils directory:
>
> export PYTHONPATH=/path/to/project_root/sparkutils:$PYTHONPATH
> to recap, the ${PYTHONPATH} variable is primarily used to specify
> additional directories where Python should look for modules and packages.
> In the context of Spark, it is typically used to include directories
> containing custom Python code or modules that your Spark application
> depends on.
>
> With regard to
>
> The --conf "spark.yarn.appMasterEnv.SPARK_HOME=$SPARK_HOME" configuration
> option in Spark is used when submitting a Spark application to run on YARN
>
>-
>
>--conf: This is used to specify Spark configuration properties when
>submitting a Spark application.
>-
>
>spark.yarn.appMasterEnv.SPARK_HOME: This is a Spark configuration
>property that defines the value of the SPARK_HOME environment variable
>for the Spark application's Application Master (the process responsible for
>managing the execution of tasks on a YARN cluster).
>-
>
>$SPARK_HOME: This holds the path to the Spark installation directory.
>
> This configuration is setting the SPARK_HOME environment variable for the
> Spark Application Master when the application is running on YARN. This is
> important because the Spark Application Master needs to know the location
> of the Spark installation directory (SPARK_HOME) to configure and manage
> the Spark application's execution on the YARN cluster. HTH
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 11 Dec 2023 at 01:43, Eugene Miretsky  wrote:
>
>> Setting PYSPARK_ARCHIVES_PATH to hfds:// did the tricky. But don't
>> understand a few things
>>
>> 1) The default behaviour is if PYSPARK_ARCHIVES_PATH is empty,
>> pyspark.zip is uploaded from the local SPARK_HOME. If it is set to
>> "local://" the upload is skipped. I would expect the latter to be the
>> default. What's the use case for uploading the local pyspark.zip every
>> time?
>> 2) It seems like the localConfigs are meant to be copied every time (code)
>> what's the use case for that? Why not just use the cluster config?
>>
>>
>>
>> On Sun, Dec 10, 2023 at 1:15 PM Eugene Miretsky  wrote:
>>
>>> Thanks Mich,
>>>
>>> Tried this and still getting
>>> INF Client: "Uploading resource
>>> file:/opt/spark/spark-3.5.0-bin-hadoop3/python/lib/pyspark.zip ->
>>> hdfs:/". It is also doing it for (py4j.-0.10.9.7-src.zip and
>>> __spark_conf__.zip). It is working now because I enabled direct
>>> access to HDFS to allow copying the files. But ideally I would like to not
>>> have to copy any files directly to HDFS.
>>>
>>> 1) We would expect pyspark as well as the relevant configs to already be
>>> available on the cluster - why are they being copied over? (we can always
>>> provide the extra libraries needed using py-files the way you did)
>>> 2) If we wanted users to be able to use custom pyspark, we would rather
>>> just copy the file HDFS/GCS in other ways, and let users reference it in
>>> their job

Re: [PySpark][Spark Dataframe][Observation] Why empty dataframe join doesn't let you get metrics from observation?

2023-12-11 Thread Михаил Кулаков
Hey Enrico it does help to understand it, thanks for explaining.

Regarding this comment

> PySpark and Scala should behave identically here

Is it ok that Scala and PySpark optimization works differently in this case?


вт, 5 дек. 2023 г. в 20:08, Enrico Minack :

> Hi Michail,
>
> with spark.conf.set("spark.sql.planChangeLog.level", "WARN") you can see
> how Spark optimizes the query plan.
>
> In PySpark, the plan is optimized into
>
> Project ...
>   +- CollectMetrics 2, [count(1) AS count(1)#200L]
>   +- LocalTableScan , [col1#125, col2#126L, col3#127, col4#132L]
>
> The entire join gets optimized away into an empty table. Looks like it
> figures out that df has no rows with col1 = 'c'. So df is never consumed
> / iterated, so the observation does not retrieve any metrics.
>
> In Scala, the optimization is different:
>
> *(2) Project ...
>   +- CollectMetrics 2, [count(1) AS count(1)#63L]
>  +- *(1) Project [col1#37, col2#38, col3#39, cast(null as int) AS
> col4#51]
> +- *(1) Filter (isnotnull(col1#37) AND (col1#37 = c))
>+- CollectMetrics 1, [count(1) AS count(1)#56L]
>   +- LocalTableScan [col1#37, col2#38, col3#39]
>
> where the join also gets optimized away, but table df is still filtered
> for col1 = 'c', which iterates over the rows and collects the metrics for
> observation 1.
>
> Hope this helps to understand why there are no observed metrics for
> Observation("1") in your case.
>
> Enrico
>
>
>
> Am 04.12.23 um 10:45 schrieb Enrico Minack:
>
> Hi Michail,
>
> observations as well as ordinary accumulators only observe / process
> rows that are iterated / consumed by downstream stages. If the query
> plan decides to skip one side of the join, that one will be removed from
> the final plan completely. Then, the Observation will not retrieve any
> metrics and .get waits forever. Definitively not helpful.
>
> When creating the Observation class, we thought about a timeout for the
> get method but could not find a use case where the user would call get
> without first executing the query. Here is a scenario where though
> executing the query there is no observation result. We will rethink this.
>
> Interestingly, your example works in Scala:
>
> import org.apache.spark.sql.Observation
>
> val df = Seq(("a", 1, "1 2 3 4"), ("b", 2, "1 2 3 4")).toDF("col1",
> "col2", "col3")
> val df_join = Seq(("a", 6), ("b", 5)).toDF("col1", "col4")
>
> val o1 = Observation()
> val o2 = Observation()
>
> val df1 = df.observe(o1, count("*")).filter("col1 = 'c'")
> val df2 = df1.join(df_join, "col1", "left").observe(o2, count("*"))
>
> df2.show()
> +++++
> |col1|col2|col3|col4|
> +++++
> +++++
>
> o1.get
> Map[String,Any] = Map(count(1) -> 2)
>
> o2.get
> Map[String,Any] = Map(count(1) -> 0)
>
>
> Pyspark and Scala should behave identically here. I will investigate.
>
> Cheers,
> Enrico
>
>
>
> Am 02.12.23 um 17:11 schrieb Михаил Кулаков:
>
> Hey folks, I actively using observe method on my spark jobs and
> noticed interesting behavior:
> Here is an example of working and non working code:
> https://gist.github.com/Coola4kov/8aeeb05abd39794f8362a3cf1c66519c
> 
> 
>
> In a few words, if I'm joining dataframe after some filter rules and
> it became empty, observations configured on the first dataframe never
> return any results, unless some action called on the empty dataframe
> specifically before join.
>
> Looks like a bug to me, I will appreciate any advice on how to fix
> this behavior.
>
>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>


Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-12-11 Thread Mich Talebzadeh
Hi Eugene,

With regard to your points

What are the PYTHONPATH and SPARK_HOME env variables in your script?

OK let us look at a typical of my Spark project structure

- project_root
  |-- README.md
  |-- __init__.py
  |-- conf
  |   |-- (configuration files for Spark)
  |-- deployment
  |   |-- deployment.yaml
  |-- design
  |   |-- (design-related files or documentation)
  |-- othermisc
  |   |-- (other miscellaneous files)
  |-- sparkutils
  |   |-- (utility modules or scripts specific to Spark)
  |-- src
  |-- (main source code for your Spark application)

If you want Spark to recognize modules from the sparkutils directory or any
other directories within your project, you can include those directories in
the PYTHONPATH.

For example, if you want to include the sparkutils directory:

export PYTHONPATH=/path/to/project_root/sparkutils:$PYTHONPATH
to recap, the ${PYTHONPATH} variable is primarily used to specify
additional directories where Python should look for modules and packages.
In the context of Spark, it is typically used to include directories
containing custom Python code or modules that your Spark application
depends on.

With regard to

The --conf "spark.yarn.appMasterEnv.SPARK_HOME=$SPARK_HOME" configuration
option in Spark is used when submitting a Spark application to run on YARN

   -

   --conf: This is used to specify Spark configuration properties when
   submitting a Spark application.
   -

   spark.yarn.appMasterEnv.SPARK_HOME: This is a Spark configuration
   property that defines the value of the SPARK_HOME environment variable
   for the Spark application's Application Master (the process responsible for
   managing the execution of tasks on a YARN cluster).
   -

   $SPARK_HOME: This holds the path to the Spark installation directory.

This configuration is setting the SPARK_HOME environment variable for the
Spark Application Master when the application is running on YARN. This is
important because the Spark Application Master needs to know the location
of the Spark installation directory (SPARK_HOME) to configure and manage
the Spark application's execution on the YARN cluster. HTH
Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 11 Dec 2023 at 01:43, Eugene Miretsky  wrote:

> Setting PYSPARK_ARCHIVES_PATH to hfds:// did the tricky. But don't
> understand a few things
>
> 1) The default behaviour is if PYSPARK_ARCHIVES_PATH is empty,
> pyspark.zip is uploaded from the local SPARK_HOME. If it is set to
> "local://" the upload is skipped. I would expect the latter to be the
> default. What's the use case for uploading the local pyspark.zip every
> time?
> 2) It seems like the localConfigs are meant to be copied every time (code)
> what's the use case for that? Why not just use the cluster config?
>
>
>
> On Sun, Dec 10, 2023 at 1:15 PM Eugene Miretsky  wrote:
>
>> Thanks Mich,
>>
>> Tried this and still getting
>> INF Client: "Uploading resource
>> file:/opt/spark/spark-3.5.0-bin-hadoop3/python/lib/pyspark.zip ->
>> hdfs:/". It is also doing it for (py4j.-0.10.9.7-src.zip and
>> __spark_conf__.zip). It is working now because I enabled direct
>> access to HDFS to allow copying the files. But ideally I would like to not
>> have to copy any files directly to HDFS.
>>
>> 1) We would expect pyspark as well as the relevant configs to already be
>> available on the cluster - why are they being copied over? (we can always
>> provide the extra libraries needed using py-files the way you did)
>> 2) If we wanted users to be able to use custom pyspark, we would rather
>> just copy the file HDFS/GCS in other ways, and let users reference it in
>> their job
>> 3) What are the PYTHONPATH and SPARK_HOME env variables in your script?
>> Are they local paths, or paths on the spark cluster?
>>
>> On Fri, Nov 17, 2023 at 8:57 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> How are you submitting your spark job from your client?
>>>
>>> Your files can either be on HDFS or HCFS such as gs, s3 etc.
>>>
>>> With reference to --py-files hdfs://yarn-master-url hdfs://foo.py', I
>>> assume you want your
>>>
>>> spark-submit --verbose \
>>>--deploy-mode cluster \
>>>--conf "spark.yarn.appMasterEnv.SPARK_HOME=$SPARK_HOME" \
>>>--conf "spark.yarn.appMasterEnv.PYTHONPATH=${PYTHONPATH}" \
>>>--conf "spark.executorEnv.PYTHONPATH=${PYTHONPATH}" \
>>>--py-files