Essentially, I went from: k = createStream ..... val dataout = k.map(x=>myFunc(x._2,someParams)) dataout.foreachRDD ( rdd => rdd.foreachPartition(rec => { myOutputFunc.write(rec) })
To: kIn = createDirectStream ..... k = kIn.repartition(numberOfExecutors) //since #kafka partitions < #spark-executors val dataout = k.map(x=>myFunc(x._2,someParams)) dataout.foreachRDD ( rdd => rdd.foreachPartition(rec => { myOutputFunc.write(rec) }) With the new API, the app starts up and works fine for a while but I guess starts to deteriorate after a while. With the existing API "createStream", the app does deteriorate but over a much longer period, hours vs days. On Fri, Jun 19, 2015 at 1:40 PM, Tathagata Das <t...@databricks.com> wrote: > Yes, please tell us what operation are you using. > > TD > > On Fri, Jun 19, 2015 at 11:42 AM, Cody Koeninger <c...@koeninger.org> > wrote: > >> Is there any more info you can provide / relevant code? >> >> On Fri, Jun 19, 2015 at 1:23 PM, Tim Smith <secs...@gmail.com> wrote: >> >>> Update on performance of the new API: the new code using the >>> createDirectStream API ran overnight and when I checked the app state in >>> the morning, there were massive scheduling delays :( >>> >>> Not sure why and haven't investigated a whole lot. For now, switched >>> back to the createStream API build of my app. Yes, for the record, this is >>> with CDH 5.4.1 and Spark 1.3. >>> >>> >>> >>> On Thu, Jun 18, 2015 at 7:05 PM, Tim Smith <secs...@gmail.com> wrote: >>> >>>> Thanks for the super-fast response, TD :) >>>> >>>> I will now go bug my hadoop vendor to upgrade from 1.3 to 1.4. >>>> Cloudera, are you listening? :D >>>> >>>> >>>> >>>> >>>> >>>> On Thu, Jun 18, 2015 at 7:02 PM, Tathagata Das < >>>> tathagata.das1...@gmail.com> wrote: >>>> >>>>> Are you using Spark 1.3.x ? That explains. This issue has been fixed >>>>> in Spark 1.4.0. Bonus you get a fancy new streaming UI with more awesome >>>>> stats. :) >>>>> >>>>> On Thu, Jun 18, 2015 at 7:01 PM, Tim Smith <secs...@gmail.com> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I just switched from "createStream" to the "createDirectStream" API >>>>>> for kafka and while things otherwise seem happy, the first thing I >>>>>> noticed >>>>>> is that stream/receiver stats are gone from the Spark UI :( Those stats >>>>>> were very handy for keeping an eye on health of the app. >>>>>> >>>>>> What's the best way to re-create those in the Spark UI? Maintain >>>>>> Accumulators? Would be really nice to get back receiver-like stats even >>>>>> though I understand that "createDirectStream" is a receiver-less design. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Tim >>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >