Re: Maelstrom: Kafka integration with Spark

2016-08-24 Thread Jeoffrey Lim
retry strategy on numerous places around Kafka related routines). Not that i'm complaining or competing, at the end of the day having a Spark App that continues to work overnight gives developer a good sleep at night :) On Thu, Aug 25, 2016 at 3:23 AM, Jeoffrey Lim <jeoffr...@gmail.com> wrote:

Re: Maelstrom: Kafka integration with Spark

2016-08-24 Thread Jeoffrey Lim
l can't achieve sub-millisecond > end-to-end stream processing, so my guess is you need to be more > specific about your terms there. > > I promise I'm not trying to start a pissing contest :) just wanted to > check if you were aware of the current state of the other consumers. > C

Re: Maelstrom: Kafka integration with Spark

2016-08-23 Thread Jeoffrey Lim
or Spark 1.3 and Kafka 0.8.2.1 (and of course with the latest Kafka 0.10 as well) On Wed, Aug 24, 2016 at 9:49 AM, Cody Koeninger <c...@koeninger.org> wrote: > Were you aware that the spark 2.0 / kafka 0.10 integration also reuses > kafka consumer instances on the executors? > > On Tu

Maelstrom: Kafka integration with Spark

2016-08-23 Thread Jeoffrey Lim
has been running stable in production environment and has been proven to be resilient to numerous production issues. Please check out the project's page in github: https://github.com/jeoffreylim/maelstrom Contributors welcome! Cheers! Jeoffrey Lim P.S. I am also looking for a job opportunity

Re: How to initiate a shutdown of Spark Streaming context?

2014-09-15 Thread Jeoffrey Lim
What we did for gracefully shutting down the spark streaming context is extend a Spark Web UI Tab and perform a SparkContext.SparkUI.attachTab(custom web ui). However, the custom scala Web UI extensions needs to be under the package org.apache.spark.ui to get around with the package access

Re: Some Serious Issue with Spark Streaming ? Blocks Getting Removed and Jobs have Failed..

2014-09-12 Thread Jeoffrey Lim
Our issue could be related to this problem as described in: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-in-1-hour-batch-duration-RDD-files-gets-lost-td14027.html which the DStream is processed for every 1 hour batch duration. I have implemented IO throttling in the

Spark Streaming in 1 hour batch duration RDD files gets lost

2014-09-11 Thread Jeoffrey Lim
Hi, Our spark streaming app is configured to pull data from Kafka in 1 hour batch duration which performs aggregation of data by specific keys and store the related RDDs to HDFS in the transform phase. We have tried checkpoint of 7 days on the DStream of Kafka to ensure that the generated stream