[Spark Streaming] - ERROR Error cleaning broadcast Exception

2017-07-11 Thread Nipun Arora
Hi All, I get the following error while running my spark streaming application, we have a large application running multiple stateful (with mapWithState) and stateless operations. It's getting difficult to isolate the error since spark itself hangs and the only error we see is in the spark log

Kafka + Spark Streaming consumer API offsets

2017-06-05 Thread Nipun Arora
I need some clarification for Kafka consumers in Spark or otherwise. I have the following Kafka Consumer. The consumer is reading from a topic, and I have a mechanism which blocks the consumer from time to time. The producer is a separate thread which is continuously sending data. I want to

Re: [Spark Streaming] DAG Output Processing mechanism

2017-05-29 Thread Nipun Arora
Sending out the message again.. Hopefully someone cal clarify :) I would like some clarification on the execution model for spark streaming. Broadly, I am trying to understand if output operations in a DAG are only processed after all intermediate operations are finished for all parts of the

Re: [Spark Streaming] DAG Output Processing mechanism

2017-05-28 Thread Nipun Arora
ill kafka output 1 and kafka output 2 wait for all operations to finish on B and C before sending an output, or will they simply send an output as soon as the ops in B and C are done. What kind of synchronization guarantees are there? On Sun, May 28, 2017 at 9:59 AM, Nipun Arora <nipunarora

[Spark Streaming] DAG Output Processing mechanism

2017-05-28 Thread Nipun Arora
up vote 0 down vote favorite I would like some clarification on the execution model for spark streaming. Broadly, I am trying to understand if output operations in a DAG are only processed after all intermediate operations are finished for all parts of the DAG. Let me give an example: I have a

[Spark Streaming] DAG Execution Model Clarification

2017-05-26 Thread Nipun Arora
Hi, I would like some clarification on the execution model for spark streaming. Broadly, I am trying to understand if output operations in a DAG are only processed after all intermediate operations are finished for all parts of the DAG. Let me give an example: I have a dstream -A , I do map

SparkAppHandle - get Input and output streams

2017-05-18 Thread Nipun Arora
Hi, I wanted to know how to get the the input and output streams from SparkAppHandle? I start application like the following: SparkAppHandle sparkAppHandle = sparkLauncher.startApplication(); I have used the following previously to capture inputstream from error and output streams, but I would

Spark Launch programatically - Basics!

2017-05-17 Thread Nipun Arora
Hi, I am trying to get a simple spark application to run programatically. I looked at http://spark.apache.org/docs/2.1.0/api/java/index.html?org/apache/spark/launcher/package-summary.html, at the following code. public class MyLauncher { public static void main(String[] args) throws

Re: Restful API Spark Application

2017-05-15 Thread Nipun Arora
gt; Hi Nipun > > > > Have you checked out the job servwr > > > > https://github.com/spark-jobserver/spark-jobserver > > > > Regards > > Sam > > On Fri, 12 May 2017 at 21:00, Nipun Arora <nipunarora2...@gmail.com> > wrote: > >> > &

Restful API Spark Application

2017-05-12 Thread Nipun Arora
Hi, We have written a java spark application (primarily uses spark sql). We want to expand this to provide our application "as a service". For this, we are trying to write a REST API. While a simple REST API can be easily made, and I can get Spark to run through the launcher. I wonder, how the

[Spark Streaming] Dynamic Broadcast Variable Update

2017-05-02 Thread Nipun Arora
Hi All, To support our Spark Streaming based anomaly detection tool, we have made a patch in Spark 1.6.2 to dynamically update broadcast variables. I'll first explain our use-case, which I believe should be common to several people using Spark Streaming applications. Broadcast variables are

Re: Resource Leak in Spark Streaming

2017-02-07 Thread Nipun Arora
ient objects are not persistent. How can we go around this problem? Thanks, Nipun On Tue, Feb 7, 2017 at 6:35 PM, Nipun Arora <nipunarora2...@gmail.com> wrote: > Ryan, > > Apologies for coming back so late, I created a github repo to resolve > this problem. On trying you

Re: Resource Leak in Spark Streaming

2017-02-07 Thread Nipun Arora
Ryan, Apologies for coming back so late, I created a github repo to resolve this problem. On trying your solution for making the pool a Singleton, I get a null pointer exception in the worker. Do you have any other suggestions, or a simpler mechanism for handling this? I have put all the current

Re: Resource Leak in Spark Streaming

2017-01-31 Thread Nipun Arora
Just to be clear the pool object creation happens in the driver code, and not in any anonymous function which should be executed in the executor. On Tue, Jan 31, 2017 at 10:21 PM Nipun Arora <nipunarora2...@gmail.com> wrote: > Thanks for the suggestion Ryan, I will convert it to

Re: Resource Leak in Spark Streaming

2017-01-31 Thread Nipun Arora
, 2017 at 6:09 PM Shixiong(Ryan) Zhu <shixi...@databricks.com> wrote: > Looks like you create KafkaProducerPool in the driver. So when the task is > running in the executor, it will always see an new empty KafkaProducerPool > and create KafkaProducers. But nobody closes the

Re: Resource Leak in Spark Streaming

2017-01-31 Thread Nipun Arora
des. It's also possible that it leaks > resources. > > On Tue, Jan 31, 2017 at 2:12 PM, Nipun Arora <nipunarora2...@gmail.com> > wrote: > > It is spark 1.6 > > Thanks > Nipun > > On Tue, Jan 31, 2017 at 1:45 PM Shixiong(Ryan) Zhu < > shixi...@databricks

Re: Resource Leak in Spark Streaming

2017-01-31 Thread Nipun Arora
It is spark 1.6 Thanks Nipun On Tue, Jan 31, 2017 at 1:45 PM Shixiong(Ryan) Zhu <shixi...@databricks.com> wrote: > Could you provide your Spark version please? > > On Tue, Jan 31, 2017 at 10:37 AM, Nipun Arora <nipunarora2...@gmail.com> > wrote: > > Hi, &g

Resource Leak in Spark Streaming

2017-01-31 Thread Nipun Arora
Hi, I get a resource leak, where the number of file descriptors in spark streaming keeps increasing. We end up with a "too many file open" error eventually through an exception caused in: JAVARDDKafkaWriter, which is writing a spark JavaDStream The exception is attached inline. Any help will be

[SparkStreaming] SparkStreaming not allowing to do parallelize within a transform operation to generate a new RDD

2017-01-19 Thread Nipun Arora
help. Thanks Nipun -- Forwarded message ----- From: Nipun Arora <nipunarora2...@gmail.com> Date: Wed, Jan 18, 2017 at 5:25 PM Subject: [SparkStreaming] SparkStreaming not allowing to do parallelize within a transform operation to generate a new RDD To: user <user@spark.a

[SparkStreaming] SparkStreaming not allowing to do parallelize within a transform operation to generate a new RDD

2017-01-18 Thread Nipun Arora
Hi, I am trying to transform an RDD in a dstream by adding changing the log with the maximum timestamp, and adding a duplicate copy of it with some modifications. The following is the example code: JavaDStream logMessageWithHB = logMessageMatched.transform(new Function() {

SparkStreaming add Max Line of each RDD throwing an exception

2017-01-18 Thread Nipun Arora
Please note: I have asked the following question in stackoverflow as well http://stackoverflow.com/questions/41729451/adding-to-spark-streaming-dstream-rdd-the-max-line-of-each-rdd I am trying to add to each RDD in a JavaDStream the line with the maximum timestamp, with some modification.

Re: Kafka + Spark 1.3 Integration

2016-02-11 Thread Nipun Arora
/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java > > > On Wed, Feb 10, 2016 at 1:28 PM, Nipun Arora <nipunarora2...@gmail.com> > wrote: > >> Hi, >> >> I am trying some basic integration and wa

Re: [Spark Streaming] Spark Streaming dropping last lines

2016-02-10 Thread Nipun Arora
not getting any data, i.e. the file end has probably been reached by printing the first two lines of every micro-batch. Thanks Nipun On Mon, Feb 8, 2016 at 10:05 PM Nipun Arora <nipunarora2...@gmail.com> wrote: > I have a spark-streaming service, where I am processing and detecting &g

Kafka + Spark 1.3 Integration

2016-02-10 Thread Nipun Arora
Hi, I am trying some basic integration and was going through the manual. I would like to read from a topic, and get a JavaReceiverInputDStream for messages in that topic. However the example is of JavaPairReceiverInputDStream<>. How do I get a stream for only a single topic in Java? Reference

[Spark Streaming] Spark Streaming dropping last lines

2016-02-08 Thread Nipun Arora
I have a spark-streaming service, where I am processing and detecting anomalies on the basis of some offline generated model. I feed data into this service from a log file, which is streamed using the following command tail -f | nc -lk Here the spark streaming service is taking data from

Re: [SPARK STREAMING] Questions regarding foreachPartition

2015-11-17 Thread Nipun Arora
ion basis, not global ordering. > > You typically want to acquire resources inside the foreachpartition > closure, just before handling the iterator. > > > http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd > > On Mon, Nov

[SPARK STREAMING] Questions regarding foreachPartition

2015-11-16 Thread Nipun Arora
Hi, I wanted to understand forEachPartition logic. In the code below, I am assuming the iterator is executing in a distributed fashion. 1. Assuming I have a stream which has timestamp data which is sorted. Will the stringiterator in foreachPartition process each line in order? 2. Assuming I have

[SPARK STREAMING ] Sending data to ElasticSearch

2015-10-29 Thread Nipun Arora
Hi, I am sending data to an elasticsearch deployment. The printing to file seems to work fine, but I keep getting no-node found for ES when I send data to it. I suspect there is some special way to handle the connection object? Can anyone explain what should be changed here? Thanks Nipun The

[SPARK STREAMING] Concurrent operations in spark streaming

2015-10-24 Thread Nipun Arora
I wanted to understand something about the internals of spark streaming executions. If I have a stream X, and in my program I send stream X to function A and function B: 1. In function A, I do a few transform/filter operations etc. on X->Y->Z to create stream Z. Now I do a forEach Operation on Z

Re: [SPARK STREAMING] polling based operation instead of event based operation

2015-10-23 Thread Nipun Arora
the > list! > > Regards, > > Lars Albertsson > > > > On Thu, Oct 22, 2015 at 10:48 PM, Nipun Arora <nipunarora2...@gmail.com> > wrote: > > Hi, > > In general in spark stream one can do transformations ( filter, map > etc.) or > > output operati

Re: [Spark Streaming] Design Patterns forEachRDD

2015-10-22 Thread Nipun Arora
; customerWithPromotion.foreachPartition(); > } > }); > > > On 21-Oct-2015, at 10:55 AM, Nipun Arora <nipunarora2...@gmail.com> wrote: > > Hi All, > > Can anyone provide a design pattern for the following code shown in the > Spark User Manual, in JAVA ? I have t

[SPARK STREAMING] polling based operation instead of event based operation

2015-10-22 Thread Nipun Arora
Hi, In general in spark stream one can do transformations ( filter, map etc.) or output operations (collect, forEach) etc. in an event-driven pardigm... i.e. the action happens only if a message is received. Is it possible to do actions every few seconds in a polling based fashion, regardless if

[Spark Streaming] Design Patterns forEachRDD

2015-10-21 Thread Nipun Arora
Hi All, Can anyone provide a design pattern for the following code shown in the Spark User Manual, in JAVA ? I have the same exact use-case, and for some reason the design pattern for Java is missing. Scala version taken from :

Re: [Spark Streaming] Runtime Error in call to max function for JavaPairRDD

2015-06-23 Thread Nipun Arora
This prevents these artifacts from being included in the assembly JARs. See scope https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html#Dependency_Scope On Mon, Jun 22, 2015 at 10:28 AM, Nipun Arora nipunarora2...@gmail.com wrote: Hi Tathagata, I am attaching

Re: [Spark Streaming] Null Pointer Exception when accessing broadcast variable to store a hashmap in Java

2015-06-23 Thread Nipun Arora
btw. just for reference I have added the code in a gist: https://gist.github.com/nipunarora/ed987e45028250248edc and a stackoverflow reference here: http://stackoverflow.com/questions/31006490/broadcast-variable-null-pointer-exception-in-spark-streaming On Tue, Jun 23, 2015 at 11:01 AM, Nipun

Re: [Spark Streaming] Null Pointer Exception when accessing broadcast variable to store a hashmap in Java

2015-06-23 Thread Nipun Arora
checkpointing? I had a similar issue when recreating a streaming context from checkpoint as broadcast variables are not checkpointed. On 23 Jun 2015 5:01 pm, Nipun Arora nipunarora2...@gmail.com wrote: Hi, I have a spark streaming application where I need to access a model saved in a HashMap. I

[Spark Streaming] Null Pointer Exception when accessing broadcast variable to store a hashmap in Java

2015-06-23 Thread Nipun Arora
Hi, I have a spark streaming application where I need to access a model saved in a HashMap. I have *no problems in running the same code with broadcast variables in the local installation.* However I get a *null pointer* *exception* when I deploy it on my spark test cluster. I have stored a

Re: [Spark Streaming] Null Pointer Exception when accessing broadcast variable to store a hashmap in Java

2015-06-23 Thread Nipun Arora
I found the error so just posting on the list. It seems broadcast variables cannot be declared static. If you do you get a null pointer exception. Thanks Nipun On Tue, Jun 23, 2015 at 11:08 AM, Nipun Arora nipunarora2...@gmail.com wrote: btw. just for reference I have added the code in a gist

Re: [Spark Streaming] Iterative programming on an ordered spark stream using Java?

2015-06-18 Thread Nipun Arora
dataset in memory, you can tweak the parameter of remember, such that it does checkpoint at appropriate time. Thanks Twinkle On Thursday, June 18, 2015, Nipun Arora nipunarora2...@gmail.com wrote: Hi All, I am updating my question so that I give more detail. I have also created

Re: [Spark Streaming] Iterative programming on an ordered spark stream using Java?

2015-06-18 Thread Nipun Arora
@Twinkle - what did you mean by Regarding not keeping whole dataset in memory, you can tweak the parameter of remember, such that it does checkpoint at appropriate time? On Thu, Jun 18, 2015 at 11:40 AM, Nipun Arora nipunarora2...@gmail.com wrote: Hi All, I appreciate the help :) Here

[Spark Streaming] Runtime Error in call to max function for JavaPairRDD

2015-06-18 Thread Nipun Arora
Hi, I have the following piece of code, where I am trying to transform a spark stream and add min and max to it of eachRDD. However, I get an error saying max call does not exist, at run-time (compiles properly). I am using spark-1.4 I have added the question to stackoverflow as well:

Re: [Spark Streaming] Runtime Error in call to max function for JavaPairRDD

2015-06-18 Thread Nipun Arora
in the classpath so you do not have to bundle it. On Thu, Jun 18, 2015 at 8:44 AM, Nipun Arora nipunarora2...@gmail.com wrote: Hi, I have the following piece of code, where I am trying to transform a spark stream and add min and max to it of eachRDD. However, I get an error saying max call does

[Spark Streaming] Iterative programming on an ordered spark stream using Java?

2015-06-18 Thread Nipun Arora
in future iterations? I would like to keep some accumulated history to make calculations.. not the entire dataset, but persist certain events which can be used in future DStream RDDs? Thanks Nipun On Wed, Jun 17, 2015 at 11:15 PM, Nipun Arora nipunarora2...@gmail.com wrote: Hi Silvio, Thanks for your

[no subject]

2015-06-17 Thread Nipun Arora
Hi, Is there anyway in spark streaming to keep data across multiple micro-batches? Like in a HashMap or something? Can anyone make suggestions on how to keep data across iterations where each iteration is an RDD being processed in JavaDStream? This is especially the case when I am trying to

Iterative Programming by keeping data across micro-batches in spark-streaming?

2015-06-17 Thread Nipun Arora
Hi, Is there anyway in spark streaming to keep data across multiple micro-batches? Like in a HashMap or something? Can anyone make suggestions on how to keep data across iterations where each iteration is an RDD being processed in JavaDStream? This is especially the case when I am trying to

Re: Iterative Programming by keeping data across micro-batches in spark-streaming?

2015-06-17 Thread Nipun Arora
. Any suggestions? Thanks Nipun On Wed, Jun 17, 2015 at 10:52 PM, Silvio Fiorito silvio.fior...@granturing.com wrote: Hi, just answered in your other thread as well... Depending on your requirements, you can look at the updateStateByKey API From: Nipun Arora Date: Wednesday, June 17