Re: Single executor processing all tasks in spark structured streaming kafka

2021-03-11 Thread Sachit Murarka
Hi Kapil, Thanks for suggestion. Yes, It worked. Regards Sachit On Tue, 9 Mar 2021, 00:19 Kapil Garg, wrote: > Hi Sachit, > What do you mean by "spark is running only 1 executor with 1 task" ? > Did you submit the spark application with multiple executors but only 1 is > being used and rest

Re: Single executor processing all tasks in spark structured streaming kafka

2021-03-08 Thread Kapil Garg
Hi Sachit, What do you mean by "spark is running only 1 executor with 1 task" ? Did you submit the spark application with multiple executors but only 1 is being used and rest are idle ? If that's the case, then it might happen due to spark.locality.wait setting which is by default set to 3s. This

Single executor processing all tasks in spark structured streaming kafka

2021-03-08 Thread Sachit Murarka
Hi All, I am using Spark 3.0.1 Structuring streaming with Pyspark. The problem is spark is running only 1 executor with 1 task. Following is the summary of what I am doing. Can anyone help on why my executor is 1 only? def process_events(event): fetch_actual_data() #many more steps def

Re: Spark structured streaming -Kafka - deployment / monitor and restart

2020-07-06 Thread Jungtaek Lim
In SS, checkpointing is now a part of running micro-batch and it's supported natively. (making clear, my library doesn't deal with the native behavior of checkpointing) In other words, it can't be customized like you have been doing with your database. You probably don't need to do it with SS,

Re: Spark structured streaming -Kafka - deployment / monitor and restart

2020-07-06 Thread KhajaAsmath Mohammed
Thanks Lim, this is really helpful. I have few questions. Our earlier approach used low level customer to read offsets from database and use those information to read using spark streaming in Dstreams. Save the offsets back once the process is finished. This way we never lost data. with your

Re: Spark structured streaming -Kafka - deployment / monitor and restart

2020-07-05 Thread Jungtaek Lim
There're sections in SS programming guide which exactly answer these questions: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#managing-streaming-queries http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#monitoring-streaming-queries

Re: Spark structured streaming -Kafka - deployment / monitor and restart

2020-07-05 Thread Gabor Somogyi
In 3.0 the community just added it. On Sun, 5 Jul 2020, 14:28 KhajaAsmath Mohammed, wrote: > Hi, > > We are trying to move our existing code from spark dstreams to structured > streaming for one of the old application which we built few years ago. > > Structured streaming job doesn’t have

Spark structured streaming -Kafka - deployment / monitor and restart

2020-07-05 Thread KhajaAsmath Mohammed
Hi, We are trying to move our existing code from spark dstreams to structured streaming for one of the old application which we built few years ago. Structured streaming job doesn’t have streaming tab in sparkui. Is there a way to monitor the job submitted by us in structured streaming ? Since

Re: [spark-structured-streaming] [kafka] consume topics from multiple Kafka clusters

2020-06-09 Thread Srinivas V
ok, thanks for confirming, I will do it this way. Regards Srini On Tue, Jun 9, 2020 at 11:31 PM Gerard Maas wrote: > Hi Srinivas, > > Reading from different brokers is possible but you need to connect to each > Kafka cluster separately. > Trying to mix connections to two different Kafka

Re: [spark-structured-streaming] [kafka] consume topics from multiple Kafka clusters

2020-06-09 Thread Gerard Maas
Hi Srinivas, Reading from different brokers is possible but you need to connect to each Kafka cluster separately. Trying to mix connections to two different Kafka clusters in one subscriber is not supported. (I'm sure that it would give all kind of weird errors) The "kafka.bootstrap.servers"

Re: [spark-structured-streaming] [kafka] consume topics from multiple Kafka clusters

2020-06-09 Thread Srinivas V
Thanks for the quick reply. This may work but I have like 5 topics to listen to right now, I am trying to keep all topics in an array in a properties file and trying to read all at once. This way it is dynamic and you have one code block like below and you may add or delete topics from the config

Re: [spark-structured-streaming] [kafka] consume topics from multiple Kafka clusters

2020-06-09 Thread German SM
Hello, I've never tried that, this doesn't work? val df_cluster1 = spark .read .format("kafka") .option("kafka.bootstrap.servers", "cluster1_host:cluster1_port") .option("subscribe", "topic1") val df_cluster2 = spark .read .format("kafka") .option("kafka.bootstrap.servers",

[spark-structured-streaming] [kafka] consume topics from multiple Kafka clusters

2020-06-09 Thread Srinivas V
Hello, In Structured Streaming, is it possible to have one spark application with one query to consume topics from multiple kafka clusters? I am trying to consume two topics each from different Kafka Cluster, but it gives one of the topics as an unknown topic and the job keeps running without

how to specify which partition each record send on spark structured streaming kafka sink?

2019-08-13 Thread zenglong chen
Key option is not work!

How to add spark structured streaming kafka source receiver

2019-08-09 Thread zenglong chen
I have set the " maxOffsetsPerTrigger",but it still receive one partition per trigger on micro-batch mode.So where to set receiving on 10 partitions parallel like what is Spark Streaming doing?

Re: Handle Null Columns in Spark Structured Streaming Kafka

2019-04-30 Thread SNEHASISH DUTTA
Hi NA function will replace null with some default value and not all my columns are of type string, so for some other data types (long/int etc) I have to provide some default value But ideally those values should be null Actually this null column drop is happening in this step df.selectExpr(

Re: Handle Null Columns in Spark Structured Streaming Kafka

2019-04-29 Thread Jason Nerothin
See also here: https://stackoverflow.com/questions/44671597/how-to-replace-null-values-with-a-specific-value-in-dataframe-using-spark-in-jav On Mon, Apr 29, 2019 at 5:27 PM Jason Nerothin wrote: > Spark SQL has had an na.fill function on it since at least 2.1. Would that > work for you? > > >

Re: Handle Null Columns in Spark Structured Streaming Kafka

2019-04-29 Thread Jason Nerothin
Spark SQL has had an na.fill function on it since at least 2.1. Would that work for you? https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/DataFrameNaFunctions.html On Mon, Apr 29, 2019 at 4:57 PM Shixiong(Ryan) Zhu wrote: > Hey Snehasish, > > Do you have a reproducer for this

Handle Null Columns in Spark Structured Streaming Kafka

2019-04-24 Thread SNEHASISH DUTTA
Hi, While writing to kafka using spark structured streaming , if all the values in certain column are Null it gets dropped Is there any way to override this , other than using na.fill functions Regards, Snehasish

Re: Apache Spark Structured Streaming - Kafka Streaming - Option to ignore checkpoint

2018-06-06 Thread amihay gonen
If you are using kafka direct connect api it might be committing offset back to kafka itself בתאריך יום ה׳, 7 ביוני 2018, 4:10, מאת licl ‏: > I met the same issue and I have try to delete the checkpoint dir before the > job , > > But spark seems can read the correct offset even though after the

Re: Apache Spark Structured Streaming - Kafka Streaming - Option to ignore checkpoint

2018-06-06 Thread licl
I met the same issue and I have try to delete the checkpoint dir before the job , But spark seems can read the correct offset even though after the checkpoint dir is deleted , I don't know how spark do this without checkpoint's metadata. -- Sent from:

Apache Spark Structured Streaming - Kafka Streaming - Option to ignore checkpoint

2018-03-22 Thread M Singh
Hi: I am working on a realtime application using spark structured streaming (v 2.2.1). The application reads data from kafka and if there is a failure, I would like to ignore the checkpoint.  Is there any configuration to just read from last kafka offset after a failure and ignore any offset

Re: Apache Spark Structured Streaming - Kafka Consumer cannot fetch records for offset exception

2018-03-22 Thread Tathagata Das
Structured Streaming AUTOMATICALLY saves the offsets in a checkpoint directory that you provide. And when you start the query again with the same directory it will just pick up where it left off.

Apache Spark Structured Streaming - Kafka Consumer cannot fetch records for offset exception

2018-03-22 Thread M Singh
Hi: I am working with Spark (2.2.1) and Kafka (0.10) on AWS EMR and for the last few days, after running the application for 30-60 minutes get exception from Kafka Consumer included below. The structured streaming application is processing 1 minute worth of data from kafka topic. So I've tried

Spark Structured Streaming + Kafka

2017-11-14 Thread Agostino Calamita
Hi, I have a problem with Structured Streaming and Kafka. I have 2 brokers and a topic with 8 partitions and replication factor 2. This is my driver program: public static void main(String[] args) { SparkSession spark = SparkSession .builder()