Jason, In case you need a pointer on how to run Spark with a version of Java
different than the version used by the Hadoop processes, as indicated by
Dongjoon, this is an example of what we do on our Hadoop clusters:
Hi Enrico,
+1 on supporting Arrow on par with Pandas. Besides the frameworks and libraries
that you mentioned I add awkward array, a library used in High Energy Physics
(for those interested more details on how we tested awkward array with Spark
from back when mapInArrow was introduced can be
Hi Qian,
Indeed the metrics available with the Prometheus servlet sink (which is marked
still as experimental) are limited, compared to the full instrumentation, and
this is due to the way it is implemented with a servlet and cannot be easily
extended from what I can see.
You can use another
Hi Senthil,
I have just run a couple of quick tests for TPCDS Q4, using the TPCDS schema
created at scale 1500 that I have on a Hadoop/YARN cluster, and was not able to
reproduce the difference in execution time between Spark 2 and Spark 3 that you
report in your mail.
This is the Spark
Hi Stavros, All,
Interesting topic, I add here some thoughts and personal opinions on it: I find
too the metrics system quite useful for the use case of building Grafana
dashboards as opposed to scraping logs and/or using the Event Listener
infrastructure, as you mentioned in your mail.
A few
Hi,
I have recently experimented with a few ways to run OS commands from the
executors (in a YARN deployment) for a specific use case where we want to
interact with an external system of interest for our environment. From that
experience I thought that having the possibility to spawn a script