Re: Is spark not good for ingesting into updatable databases?

2018-10-30 Thread ravidspark
Hi Jorn, Just want to check if you got a chance to look at this problem. I couldn't figure out any reason on why this is happening. Any help would be appreciated. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: Is spark not good for ingesting into updatable databases?

2018-10-27 Thread ravidspark
Hi Jorn, Thanks for your kind reply. I do accept that there might be something in the code. Any help would be appreciated. To give you some insights, I checked the source of the message in kafka if it has been repeated twice. But, I could only find it once. Also, it would have been convincing

Is spark not good for ingesting into updatable databases?

2018-10-26 Thread ravidspark
Hi All, My problem is as explained, Environment: Spark 2.2.0 installed on CDH Use-Case: Reading from Kafka, cleansing the data and ingesting into a non updatable database. Problem: My streaming batch duration is 1 minute and I am receiving 3000 messages/min. I am observing a weird case where,

Spark maxTaskFailures is not recognized with Cassandra

2018-06-05 Thread ravidspark
Hi All, I configured the number of task failures using spark.task.maxFailures as 10 in my spark application which ingests data into Cassandra reading from Kafka. I observed that when Cassandra service is down, it is not retrying for the property I set i.e. 10. Instead it is retrying with the

assertion failed: Beginning offset 34242088 is after the ending offset 34242084 for topic partition 2. You either provided an invalid fromOffset, or the Kafka topic has been damaged

2018-05-14 Thread ravidspark
Hi Community, Seeing the below message in the logs and Spark application is getting terminated. There is an issue with our Kafka service and it auto restarts during which leader reelection happens. *Exception:* assertion failed: Beginning offset 34242088 is after the ending offset 34242084 for

Re: ordered ingestion not guaranteed

2018-05-12 Thread ravidspark
Jorn, Thanks for the response. My downstream database is Kudu. 1. Yes. As you have suggested, I have been using a central caching mechanism that caches the rdd results and to make a comparison with the next batch to check for the latest timestamps and ignore the old timestamps. But, I see

ordered ingestion not guaranteed

2018-05-11 Thread ravidspark
Hi All, I am using Spark 2.2.0 & I have below use case: *Reading from Kafka using Spark Streaming and updating(not just inserting) the records into downstream database* I understand that the way Spark read messages from Kafka will not be in a order of timestamp as stored in Kafka partitions

Making spark streaming application single threaded

2018-05-09 Thread ravidspark
Hi All, Is there any property which makes my spark streaming application a single threaded? I researched on this property, *spark.dynamicAllocation.maxExecutors=1*, but as far as I understand this launches a maximum of one container but not a single thread. In local mode, we can configure the

Best place to persist offsets into Zookeeper

2018-05-07 Thread ravidspark
Hi All, I have the below problem in Spark Kafka steaming. Environment: Spark-2.2.0 Problem: We have written our own logic for offset management in zookeeper when streaming data with Spark + Kafka. Everything is working fine and we are able to control the offset commitment to zookeeper during

How can I launch a a thread in background on all worker nodes before the data processing actually starts?

2018-03-15 Thread ravidspark
*Environment:* Spark 2.2.0 *Kafka:* 0.10.0 *Language:* Java *UseCase:* Streaming data from Kafka using JavaDStreams and storing into a downstream database. *Issue:* I have a use case, where in I have to launch a thread in the background that would connect to a DB and Cache the retrieved

Spark with Kudu behaving unexpectedly when bringing down the Kudu Service

2018-02-23 Thread ravidspark
Hi All, I am trying to read data from Kafka and ingest into Kudu using Spark Streaming. I am not using KuduContext to perform the upsert operation into kudu. Instead using Kudus native Client API to build the PartialRow and applying the operation for every record from Kafka. I am able to run the