Spark RDD DAG behaviour understanding in case of checkpointing

2016-01-23 Thread gaurav sharma
Hi Tathagata/Cody, I am facing a challenge in Production with DAG behaviour during checkpointing in spark streaming - Step 1 : Read data from Kafka every 15 min - call this KafkaStreamRDD ~ 100 GB of data Step 2 : Repartition KafkaStreamRdd from 5 to 100 partitions to parallelise processing -

Debug what is replication Level of which RDD

2016-01-23 Thread gaurav sharma
Hi All, I have enabled replication for my RDDs. I see on the Storage tab of the Spark UI, which mentions the replication level 2x or 1x. But the names given are mappedRDD, shuffledRDD, I am not able to debug which of my RDD is 2n replicated, and which one is 1x. Please help. Regards, Gaurab

Re: Multiple Spark Streaming Jobs on Single Master

2015-10-24 Thread gaurav sharma
; On Fri, Oct 23, 2015 at 8:36 AM, gaurav sharma <sharmagaura...@gmail.com> > wrote: > >> Hi, >> >> I created 2 workers on same machine each with 4 cores and 6GB ram >> >> I submitted first job, and it allocated 2 cores on each of the worker >> proce

Re: Multiple Spark Streaming Jobs on Single Master

2015-10-23 Thread gaurav sharma
Hi, I created 2 workers on same machine each with 4 cores and 6GB ram I submitted first job, and it allocated 2 cores on each of the worker processes, and utilized full 4 GB ram for each executor process When i submit my second job it always say in WAITING state. Cheers!! On Tue, Oct 20,

Re: Worker Machine running out of disk for Long running Streaming process

2015-09-15 Thread gaurav sharma
...@databricks.com> wrote: > >> Could you periodically (say every 10 mins) run System.gc() on the driver. >> The cleaning up shuffles is tied to the garbage collection. >> >> >> On Fri, Aug 21, 2015 at 2:59 AM, gaurav sharma <sharmagaura...@gmail.com> >> w

Worker Machine running out of disk for Long running Streaming process

2015-08-21 Thread gaurav sharma
Hi All, I have a 24x7 running Streaming Process, which runs on 2 hour windowed data The issue i am facing is my worker machines are running OUT OF DISK space I checked that the SHUFFLE FILES are not getting cleaned up.

Re: Writing streaming data to cassandra creates duplicates

2015-08-04 Thread gaurav sharma
Ideally the 2 messages read from kafka must differ on some parameter atleast, or else they are logically same As a solution to your problem, if the message content is same, u cud create a new field UUID, which might play the role of partition key while inserting the 2 messages in Cassandra Msg1

Re: Spark Streaming Kafka could not find leader offset for Set()

2015-07-30 Thread gaurav sharma
I have run into similar excpetions ERROR DirectKafkaInputDStream: ArrayBuffer(java.net.SocketTimeoutException, org.apache.spark.SparkException: Couldn't find leader offsets for Set([AdServe,1])) and the issue has happened on Kafka Side, where my broker offsets go out of sync, or do not return

Re: createDirectStream and Stats

2015-07-12 Thread gaurav sharma
Hi guys, I too am facing similar challenge with directstream. I have 3 Kafka Partitions. and running spark on 18 cores, with parallelism level set to 48. I am running simple map-reduce job on incoming stream. Though the reduce stage takes milliseconds-seconds for around 15 million packets,

Re: Worker dies with java.io.IOException: Stream closed

2015-07-12 Thread gaurav sharma
permission to write to /opt/ on that machine as its one machine always throwing up. Thanks Best Regards On Sat, Jul 11, 2015 at 11:18 PM, gaurav sharma sharmagaura...@gmail.com wrote: Hi All, I am facing this issue in my production environment. My worker dies by throwing this exception

Worker dies with java.io.IOException: Stream closed

2015-07-11 Thread gaurav sharma
Hi All, I am facing this issue in my production environment. My worker dies by throwing this exception. But i see the space is available on all the partitions on my disk I did NOT see any abrupt increase in DIsk IO, which might have choked the executor to write on to the stderr file. But still

Re: How does one decide no of executors/cores/memory allocation?

2015-06-15 Thread gaurav sharma
When you submit a job, spark breaks down it into stages, as per DAG. the stages run transformations or actions on the rdd's. Each rdd constitutes of N partitions. The tasks creates by spark to execute the stage are equal to the number of partitions. Every task is executed on the cored utilized

Re: How to pass arguments dynamically, that needs to be used in executors

2015-06-12 Thread gaurav sharma
. It will get shipped once to all nodes in the cluster and can be referenced by them. HTH. -Todd On Thu, Jun 11, 2015 at 8:23 AM, gaurav sharma sharmagaura...@gmail.com wrote: Hi, I am using Kafka Spark cluster for real time aggregation analytics use case in production. Cluster details

Spark Streaming - Can i BIND Spark Executor to Kafka Partition Leader

2015-06-12 Thread gaurav sharma
Hi, I am using Kafka Spark cluster for real time aggregation analytics use case in production. *Cluster details* *6 nodes*, each node running 1 Spark and kafka processes each. Node1 - 1 Master , 1 Worker, 1 Driver, 1 Kafka process Node 2,3,4,5,6 - 1 Worker prcocess each

How to pass arguments dynamically, that needs to be used in executors

2015-06-11 Thread gaurav sharma
Hi, I am using Kafka Spark cluster for real time aggregation analytics use case in production. Cluster details 6 nodes, each node running 1 Spark and kafka processes each. Node1 - 1 Master , 1 Worker, 1 Driver, 1 Kafka process Node 2,3,4,5,6 - 1 Worker prcocess each