RE: Spark on Yarn with Java 17

2023-12-09 Thread Luca Canali
Jason, In case you need a pointer on how to run Spark with a version of Java different than the version used by the Hadoop processes, as indicated by Dongjoon, this is an example of what we do on our Hadoop clusters:

RE: On adding applyInArrow to groupBy and cogroup

2023-11-03 Thread Luca Canali
Hi Enrico, +1 on supporting Arrow on par with Pandas. Besides the frameworks and libraries that you mentioned I add awkward array, a library used in High Energy Physics (for those interested more details on how we tested awkward array with Spark from back when mapInArrow was introduced can be

RE: Executor metrics are missing on Prometheus sink

2023-02-10 Thread Luca Canali
Hi Qian, Indeed the metrics available with the Prometheus servlet sink (which is marked still as experimental) are limited, compared to the full instrumentation, and this is due to the way it is implemented with a servlet and cannot be easily extended from what I can see. You can use another

RE: Spark 3 is Slower than Spark 2 for TPCDS Q04 query.

2021-12-20 Thread Luca Canali
Hi Senthil, I have just run a couple of quick tests for TPCDS Q4, using the TPCDS schema created at scale 1500 that I have on a Hadoop/YARN cluster, and was not able to reproduce the difference in execution time between Spark 2 and Spark 3 that you report in your mail. This is the Spark

RE: [DISCUSS][CORE] Exposing application status metrics via a source

2018-09-14 Thread Luca Canali
Hi Stavros, All, Interesting topic, I add here some thoughts and personal opinions on it: I find too the metrics system quite useful for the use case of building Grafana dashboards as opposed to scraping logs and/or using the Event Listener infrastructure, as you mentioned in your mail. A few

Run an OS command or script supplied by the user at the start of each executor

2017-05-12 Thread Luca Canali
Hi, I have recently experimented with a few ways to run OS commands from the executors (in a YARN deployment) for a specific use case where we want to interact with an external system of interest for our environment. From that experience I thought that having the possibility to spawn a script