RE: Spark on Java 17

2023-12-09 Thread Luca Canali
Hi Faiz, We find G1GC works well for some of our workloads that are Parquet-read intensive and we have been using G1GC with Spark on Java 8 already (spark.driver.extraJavaOptions and spark.executor.extraJavaOptions= “-XX:+UseG1GC”), while currently we are mostly running Spark (3.3 and higher)

RE: Profiling PySpark Pandas UDF

2022-08-29 Thread Luca Canali
From: Abdeali Kothari Sent: Friday, August 26, 2022 15:59 To: Luca Canali Cc: Russell Jurney ; Gourav Sengupta ; Sean Owen ; Takuya UESHIN ; user ; Subash Prabanantham Subject: Re: Profiling PySpark Pandas UDF Hi Luca, I see you pushed some code to the PR 3 hrs ago. That's awesome. If I ca

RE: Profiling PySpark Pandas UDF

2022-08-26 Thread Luca Canali
@Abdeali as for “lightweight profiling”, there is some work in progress on instrumenting Python UDFs with Spark metrics, see https://issues.apache.org/jira/browse/SPARK-34265 However it is a bit stuck at the moment, and needs to be revived I believe. Best, Luca From: Abdeali

RE: measure running time

2021-12-23 Thread Luca Canali
Hi Mich, With Spark 3.1.1 you need to use spark-measure built with Scala 2.12: bin/pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17 Best, Luca From: Mich Talebzadeh Sent: Thursday, December 23, 2021 19:59 To: Luca Canali Cc: user Subject: Re: measure running

RE: measure running time

2021-12-23 Thread Luca Canali
Hi, I agree with Gourav that just measuring execution time is a simplistic approach that may lead you to miss important details, in particular when running distributed computations. WebUI, REST API, and metrics instrumentation in Spark can be quite useful for further drill down. See

RE: How to estimate the executor memory size according by the data

2021-12-23 Thread Luca Canali
API and the Spark metrics system, see https://spark.apache.org/docs/latest/monitoring.html Further information on the topic also at https://db-blog.web.cern.ch/blog/luca-canali/2020-08-spark3-memory-monitoring Best, Luca -Original Message- From: Arthur Li Sent: Thursday, December

RE: Spark 3.0 plugins

2021-12-20 Thread Luca Canali
Hi Anil, To recap: Apache Spark plugins are an interface and configuration that allows to inject code on executor start-up and, among others, provide a hook to the Spark metrics system. This provides a way to extend metrics collection beyond what is available in Apache Spark.

RE: Spark Prometheus Metrics for Executors Not Working

2021-05-24 Thread Luca Canali
The PrometheusServlet adds a servlet within the existing Spark UI to serve metrics data in Prometheus format. Similarly to what happens with the MetricsServlet, the Prometheus servlet does not work on executors, as executors do not have a Spark UI end point to which the servlet could attach

RE: Understanding Executors UI

2021-01-08 Thread Luca Canali
ory instrumentation and improved instrumentation for streaming, so you can you profit from testing there too. From: Eric Beabes Sent: Friday, January 8, 2021 04:23 To: Luca Canali Cc: spark-user Subject: Re: Understanding Executors UI So when I see this for 'Storage Memory': 3.3TB/ 598.5 GB - it's tell

RE: Understanding Executors UI

2021-01-06 Thread Luca Canali
://spark.apache.org/docs/latest/tuning.html#memory-management-overview Additional resource: see also this diagram https://canali.web.cern.ch/docs/SparkExecutorMemory.png and https://db-blog.web.cern.ch/blog/luca-canali/2020-08-spark3-memory-monitoring Best, Luca From: Eric Beabes Sent: Wednesday, January

RE: Adding isolation level when reading from DB2 with spark.read

2020-09-02 Thread Luca Canali
Hi Filipa , Spark JDBC data source has the option to add a "sessionInitStatement". Documented in https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html and https://issues.apache.org/jira/browse/SPARK-21519 I guess you could use that option to "inject " a SET ISOLATION statement,

RE: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

2020-08-24 Thread Luca Canali
Hi Abhishek, Just a few ideas/comments on the topic: When benchmarking/testing I find it useful to collect a more complete view of resources usage and Spark metrics, beyond just measuring query elapsed time. Something like this: https://github.com/cerndb/spark-dashboard I'd rather not use

RE: tcps oracle connection from spark

2019-06-19 Thread Luca Canali
Connecting to Oracle from Spark using the TPCS protocol works OK for me. Maybe try to turn debug on with -Djavax.net.debug=all? See also: https://blogs.oracle.com/dev2dev/ssl-connection-to-oracle-db-using-jdbc%2c-tlsv12%2c-jks-or-oracle-wallets Regards, L. From: Richard Xin Sent: Wednesday,

RE: Spark Profiler

2019-03-27 Thread Luca Canali
I find that the Spark metrics system is quite useful to gather resource utilization metrics of Spark applications, including CPU, memory and I/O. If you are interested an example how this works for us at: https://db-blog.web.cern.ch/blog/luca-canali/2019-02-performance-dashboard-apache-spark

RE: kerberos auth for MS SQL server jdbc driver

2018-10-15 Thread Luca Canali
We have a case where we interact with a Kerberized service and found a simple workaround to distribute and make use of the driver’s Kerberos credential cache file in the executors. Maybe some of the ideas there can be of help for this case too? Our case in on Linux though. Details: