Exporting all Executor Metrics in Prometheus format in K8s cluster

2021-02-04 Thread Dávid Szakállas
I’ve been trying to set up monitoring for our Spark 3.0.1 cluster running in K8s. We are using Prometheus as our monitoring system. We require both executor and driver metrics. My initial approach was to use the following configuration, to expose both metrics on the Spark UI: {

Custom Metrics Source -> Sink routing

2020-09-30 Thread Dávid Szakállas
Is there a way to customize what metrics sources are routed to what sinks? If I understood the docs correctly, there are some global switches for enabling sources, e.g. spark.metrics.staticSources.enabled,

Secrets in Spark apps

2020-07-27 Thread Dávid Szakállas
Hi folks, Do you know what’s the best method to passing secrets to Spark operations, for e.g doing encryption, salting with a secret before hashing etc.? I have multiple ideas on top of my head The secret's source: - environment variable - config property - remote service accessed through an

Dataset schema incompatibility bug when reading column partitioned data

2019-03-29 Thread Dávid Szakállas
We observed the following bug on Spark 2.4.0: scala> spark.createDataset(Seq((1,2))).write.partitionBy("_1").parquet("foo.parquet") scala> val schema = StructType(Seq(StructField("_1", IntegerType),StructField("_2", IntegerType))) scala>

Support nested keys in DataFrameWriter.bucketBy

2018-10-15 Thread Dávid Szakállas
Currently (In Spark 2.3.1) we cannot bucket DataFrames by nested columns, e.g df.write.bucketBy(10, "key.a").saveAsTable(“junk”) will result in the following exception: org.apache.spark.sql.AnalysisException: bucket column key.a is not defined in table junk, defined table columns are: key,

updateStateByKey for window batching

2016-08-22 Thread Dávid Szakállas
Hi! I’m curious about the fault-tolerance properties of stateful streaming operations. I am specifically interested about updateStateByKey. What happens if a node fails during processing? Is the state recoverable? Our use case is the following: we have messages arriving from a message queue