Re: how can i write spark addListener metric to kafka

2020-06-09 Thread Tathagata Das
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#reporting-metrics-programmatically-using-asynchronous-apis On Tue, Jun 9, 2020 at 4:42 PM a s wrote: > hi Guys, > > I am building a structured streaming app for google analytics data > > i want to capture the

how can i write spark addListener metric to kafka

2020-06-09 Thread a s
hi Guys, I am building a structured streaming app for google analytics data i want to capture the number of rows read and processed i am able to see it in log how can i send it to kafka Thanks Alis

Re: [spark-structured-streaming] [kafka] consume topics from multiple Kafka clusters

2020-06-09 Thread Srinivas V
ok, thanks for confirming, I will do it this way. Regards Srini On Tue, Jun 9, 2020 at 11:31 PM Gerard Maas wrote: > Hi Srinivas, > > Reading from different brokers is possible but you need to connect to each > Kafka cluster separately. > Trying to mix connections to two different Kafka

Re: [spark-structured-streaming] [kafka] consume topics from multiple Kafka clusters

2020-06-09 Thread Gerard Maas
Hi Srinivas, Reading from different brokers is possible but you need to connect to each Kafka cluster separately. Trying to mix connections to two different Kafka clusters in one subscriber is not supported. (I'm sure that it would give all kind of weird errors) The "kafka.bootstrap.servers"

Re: [spark-structured-streaming] [kafka] consume topics from multiple Kafka clusters

2020-06-09 Thread Srinivas V
Thanks for the quick reply. This may work but I have like 5 topics to listen to right now, I am trying to keep all topics in an array in a properties file and trying to read all at once. This way it is dynamic and you have one code block like below and you may add or delete topics from the config

Re: [spark-structured-streaming] [kafka] consume topics from multiple Kafka clusters

2020-06-09 Thread German SM
Hello, I've never tried that, this doesn't work? val df_cluster1 = spark .read .format("kafka") .option("kafka.bootstrap.servers", "cluster1_host:cluster1_port") .option("subscribe", "topic1") val df_cluster2 = spark .read .format("kafka") .option("kafka.bootstrap.servers",

[spark-structured-streaming] [kafka] consume topics from multiple Kafka clusters

2020-06-09 Thread Srinivas V
Hello, In Structured Streaming, is it possible to have one spark application with one query to consume topics from multiple kafka clusters? I am trying to consume two topics each from different Kafka Cluster, but it gives one of the topics as an unknown topic and the job keeps running without

Out of memory causing due to high number of spark submissions in FIFO mode

2020-06-09 Thread Sunil Pasumarthi
Hi all, I have written a small ETL spark application which takes data from GCS and transforms them and saves them again into some other GCS bucket. I am trying to run this application for different ids using a spark cluster in google's dataproc and just tweaking the default configuration to use a

Re: [SPARK-30957][SQL] Null-safe variant of Dataset.join(Dataset[_], Seq[String])

2020-06-09 Thread Alexandros Biratsis
Hi Enrico and Spark devs, Since the current plan is not to provide a built-in functionality for dropping repeated/redundant columns, I wrote two helper methods as a workaround solution. The 1st method supports multiple Column instances extending the current drop

[PySpark CrossValidator] Dropping column randCol before fitting model

2020-06-09 Thread Ablaye FAYE
Hello, I have noticed that the _fit method of CrossValidator class adds a new column (randCol) to the input dataset in Pyspark. This column allows to split the dataset in k folds. Is this variable removed from the training data and test data of the fold before fitting model? I ask this question