mapWithState termination

2017-05-31 Thread Dominik Safaric
Dear all, I would appreciate if anyone could explain when does mapWithState terminate, i.e. apply subsequent transformations such as writing the state to an external sink? Given a KafkaConsumer instance pulling messages from a Kafka topic, and a mapWithState transformation updating the state

Spark Streaming 2.1 recovery

2017-05-16 Thread Dominik Safaric
Hi, currently I am exploring Spark’s fault tolerance capabilities in terms of fault recovery. Namely I run a Spark 2.1 standalone cluster on a master and four worker nodes. The application pulls data using the Kafka direct stream API from a Kafka topic over a (sliding) window of time, and

Spark Streaming 2.1 - slave parallel recovery

2017-05-04 Thread Dominik Safaric
Hi all, I’m running cluster consisting of a master and four slaves. The cluster runs a Spark application that reads data from a Kafka topic over a window of time, and writes the data back to Kafka. Checkpointing is enabled by using HDFS. However, although Spark periodically commits checkpoints

Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-27 Thread Dominik Safaric
mance. > > On Thu, Apr 27, 2017 at 11:33 AM, Dominik Safaric > <dominiksafa...@gmail.com> wrote: >> Indeed I have. But, even when storing the offsets in Spark and committing >> offsets upon completion of an output operation within the foreachRDD call >> (as pointed

Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-27 Thread Dominik Safaric
docs/latest/streaming-kafka-0-10-integration.html#kafka-itself > > On Wed, Apr 26, 2017 at 1:17 PM, Dominik Safaric > <dominiksafa...@gmail.com> wrote: >> The reason why I want to obtain this information, i.e. <partition, offset, >> timestamp> tuples is to rela

Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-26 Thread Dominik Safaric
ntegration.html#obtaining-offsets > > Timestamp isn't really a meaningful idea for a range of offsets. > > > On Tue, Apr 25, 2017 at 2:43 PM, Dominik Safaric > <dominiksafa...@gmail.com> wrote: >> Hi all, >> >> Because the Spark Streaming direct Kafka consumer maps

Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-25 Thread Dominik Safaric
Hi all, Because the Spark Streaming direct Kafka consumer maps offsets for a given Kafka topic and a partition internally while having enable.auto.commit set to false, how can I retrieve the offset of each made consumer’s poll call using the offset ranges of an RDD? More precisely, the

Re: Streaming 2.1.0 - window vs. batch duration

2017-03-18 Thread Dominik Safaric
489765610 > 1489765611 > 1489765612 > 1489765613 > Window: > 1489765610 > 1489765611 > 1489765612 > 1489765613 > 1489765614 > Window: > 1489765611 > 1489765612 > 1489765613 > 1489765614 > 1489765615 > > On Thu, Mar 16, 2017 at 2:34 PM, Dominik Safaric

Re: Streaming 2.1.0 - window vs. batch duration

2017-03-18 Thread Dominik Safaric
Window: > 1489765608 > 1489765609 > 1489765610 > 1489765611 > 1489765612 > Window: > 1489765609 > 1489765610 > 1489765611 > 1489765612 > 1489765613 > Window: > 1489765610 > 1489765611 > 1489765612 > 1489765613 > 1489765614 > Window: &g

Streaming 2.1.0 - window vs. batch duration

2017-03-16 Thread Dominik Safaric
Hi all, As I’ve implemented a streaming application pulling data from Kafka every 1 second (batch interval), I am observing some quite strange behaviour (didn’t use Spark extensively in the past, but continuous operator based engines instead of). Namely the dstream.window(Seconds(60))

Re: Spark Streaming - java.lang.ClassNotFoundException Scala anonymous function

2017-03-01 Thread Dominik Safaric
> Spark classes that interfere. > > > On Wed, Mar 1, 2017, 14:20 Dominik Safaric <dominiksafa...@gmail.com > <mailto:dominiksafa...@gmail.com>> wrote: > I've been trying to submit a Spark Streaming application using spark-submit > to a cluster of mine consisting of a

Spark Streaming - java.lang.ClassNotFoundException Scala anonymous function

2017-03-01 Thread Dominik Safaric
I've been trying to submit a Spark Streaming application using spark-submit to a cluster of mine consisting of a master and two worker nodes. The application has been written in Scala, and build using Maven. Importantly, the Maven build is configured to produce a fat JAR containing all

Spark Streaming - parallel recovery

2017-02-22 Thread Dominik Safaric
Hi, As I am investigate among others onto the fault recovery capabilities of Spark, I’ve been curious - what source code artifact initiates the parallel recovery process? In addition, how is a faulty node detected (from a driver's point of view)? Thanks in advance, Dominik

Spark Streaming fault tolerance benchmark

2016-08-13 Thread Dominik Safaric
A few months ago, I've started investigating part of an empirical research several stream processing engines, including but not limited to Spark Streaming. As the benchmark should extend the scope further from performance metrics such as throughput and latency, I've focused onto fault tolerance

RESOLVED - Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Dominik Safaric
roker the host.name it should advertise to the consumers and producers. By setting this property, I instantly started receiving Kafka log messages. Nevertheless, thank you all for your help, I appreciate it! > On 07 Jun 2016, at 17:44, Dominik Safaric <dominiksafa...@gmail.com> wrote:

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Dominik Safaric
you dealing with offsets ? > > Can you verify the offsets on the broker: > > kafka-run-class.sh kafka.tools.GetOffsetShell --topic --broker-list > --time -1 > > -Todd > > On Tue, Jun 7, 2016 at 8:17 AM, Dominik Safaric <dominiksafa...@gmail.com > <mailto:d

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Dominik Safaric
h 0.9.0.1 due to changes in the kafka clients between 0.8.2.2 > and 0.9.0.x. See this for more information: > > https://issues.apache.org/jira/browse/SPARK-12177 > <https://issues.apache.org/jira/browse/SPARK-12177> > > -Todd > > On Tue, Jun 7, 2016 at 7:35 AM,

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Dominik Safaric
ila.pl> wrote: > > Hi, > > What's the version of Spark? You're using Kafka 0.9.0.1, ain't you? What's > the topic name? > > Jacek > > On 7 Jun 2016 11:06 a.m., "Dominik Safaric" <dominiksafa...@gmail.com > <mailto:dominiksafa...@gmail.com&

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Dominik Safaric
ww.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> > > > On 7 June 2016 at 11:32, Dominik Safa

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Dominik Safaric
countByValueAndWindow = price.filter(_ > > 95.0).countByValueAndWindow(Seconds(windowLength), Seconds(slidingInterval)) > countByValueAndWindow.print() > // > ssc.start() > ssc.awaitTermination() > > HTH > > Dr Mich Talebzadeh > > LinkedIn > https://www.l

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Dominik Safaric
AcPCCdOABUrV8Pw > > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> > > > On 7 June 2016 at 10:06, Dominik Safaric <dominiksafa...@gmail

Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Dominik Safaric
As I am trying to integrate Kafka into Spark, the following exception occurs: org.apache.spark.SparkException: java.nio.channels.ClosedChannelException org.apache.spark.SparkException: Couldn't find leader offsets for Set([**,0]) at

Spark Streaming - Travis CI and GitHub custom receiver - continuous data but empty RDD?

2016-03-05 Thread Dominik Safaric
Dear all, Lately, as a part of a scientific research, I've been developing an application that streams (or at least should) data from Travis CI and GitHub, using their REST API's. The purpose of this is to get insight into the commit-build relationship, in order to further perform numerous

Spark Streaming - processing/transforming DStreams using a custom Receiver

2016-02-25 Thread Dominik Safaric
Recently, I've implemented the following Receiver and custom Spark Streaming InputDStream using Scala: /** * The GitHubUtils object declares an interface consisting of overloaded createStream * functions. The createStream function takes as arguments the ctx : StreamingContext * passed by the