Spark Streaming fails with unable to get records after polling for 512 ms

2017-11-14 Thread jkagitala
Hi, I'm trying to add spark-streaming to our kafka topic. But, I keep getting this error java.lang.AssertionError: assertion failed: Failed to get record after polling for 512 ms. I tried to add different params like max.poll.interval.ms, spark.streaming.kafka.consumer.poll.ms to 1ms in

Executor not getting added SparkUI & Spark Eventlog in deploymode:cluster

2017-11-14 Thread Mamillapalli, Purna Pradeep
Hi all, Im performing spark submit using Spark rest api POST operation on 6066 port with below config > Launch Command: > "/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.141-1.b16.el7_3.x86_64/jre/bin/java" > "-cp" "/usr/local/spark/conf/:/usr/local/spark/jars/*" "-Xmx4096M" >

Re: Process large JSON file without causing OOM

2017-11-14 Thread Alec Swan
Thanks all. I am not submitting a spark job explicitly. Instead, I am using the Spark library functionality embedded in my web service as shown in the code I included in the previous email. So, effectively Spark SQL runs in the web service's JVM. Therefore, --driver-memory option would not (and

Re: Use of Accumulators

2017-11-14 Thread Kedarnath Dixit
Yes! Thanks! ~Kedar Dixit Bigdata Analytics at Persistent Systems Ltd. From: Holden Karau Sent: 14 November 2017 20:04:50 To: Kedarnath Dixit Cc: user@spark.apache.org Subject: Re: Use of Accumulators And where do you want to read the

Re: Use of Accumulators

2017-11-14 Thread Holden Karau
And where do you want to read the toggle back from? On the driver? On Tue, Nov 14, 2017 at 3:52 AM Kedarnath Dixit < kedarnath_di...@persistent.com> wrote: > Hi, > > > > Inside the transformation if there is any change for the Variable’s > associated data, I want to just toggle it saying there

Spark Structured Streaming + Kafka

2017-11-14 Thread Agostino Calamita
Hi, I have a problem with Structured Streaming and Kafka. I have 2 brokers and a topic with 8 partitions and replication factor 2. This is my driver program: public static void main(String[] args) { SparkSession spark = SparkSession .builder()

Re: Measuring cluster utilization of a streaming job

2017-11-14 Thread Teemu Heikkilä
Without knowing anything about your pipeline the best estimate of the resources needed is to run the job with same ingestion rate as the normal production load. With kafka you can enable back pressure so with high load also your latency will just increase but you don’t have to have capacity for

Measuring cluster utilization of a streaming job

2017-11-14 Thread Nadeem Lalani
Hi, I was wondering if anyone has done some work around measuring the cluster resource utilization of a "typical" spark streaming job. We are trying to build a message ingestion system which will read from Kafka and do some processing. We have had some concerns raised in the team that a 24*7

RE: Use of Accumulators

2017-11-14 Thread Kedarnath Dixit
Hi, Inside the transformation if there is any change for the Variable’s associated data, I want to just toggle it saying there is some change while processing the data. Please let me know if we can runtime do this. Thanks! ~Kedar Dixit Bigdata Analytics at Persistent Systems Ltd. From: