Re: [streaming] DStream with window performance issue

2015-09-09 Thread Cody Koeninger
It looked like from your graphs that you had a 10 second batch time, but that your processing time was consistently 11 seconds. If that's correct, then yes your delay is going to keep growing. You'd need to either increase your batch time, or get your processing time down (either by adding more

Re: [streaming] DStream with window performance issue

2015-09-09 Thread Понькин Алексей
That`s correct, I have 10 seconds batch. The problem is actually in processing time, it is increasing constantly no matter how small or large my window duration is. I am trying to prepare some example code to clarify my use case. -- Яндекс.Почта — надёжная почта

Re: [streaming] DStream with window performance issue

2015-09-08 Thread Понькин Алексей
Thank you very much for great answer! -- Яндекс.Почта — надёжная почта http://mail.yandex.ru/neo2/collect/?exp=1=1 08.09.2015, 23:53, "Cody Koeninger" : > Yeah, that's the general idea. > > When you say hard code topic name, do you mean  Set(topicA, topicB, topicB) ?  >

Re: [streaming] DStream with window performance issue

2015-09-08 Thread Понькин Алексей
Oh my, I implemented one directStream instead of union of three but it is still growing exponential with window method. -- Яндекс.Почта — надёжная почта http://mail.yandex.ru/neo2/collect/?exp=1=1 08.09.2015, 23:53, "Cody Koeninger" : > Yeah, that's the general idea. > >

Re: [streaming] DStream with window performance issue

2015-09-08 Thread Cody Koeninger
Yeah, that's the general idea. When you say hard code topic name, do you mean Set(topicA, topicB, topicB) ? You should be able to use a variable for that - read it from a config file, whatever. If you're talking about the match statement, yeah you'd need to hardcode your cases. On Tue, Sep

Re: [streaming] DStream with window performance issue

2015-09-08 Thread Alexey Ponkin
Ok. Spark 1.4.1 on yarn Here is my application I have 4 different Kafka topics(different object streams) type Edge = (String,String) val a = KafkaUtils.createDirectStream[...](sc,"A",params).filter( nonEmpty ).map( toEdge ) val b = KafkaUtils.createDirectStream[...](sc,"B",params).filter(

Re: [streaming] DStream with window performance issue

2015-09-08 Thread Понькин Алексей
The thing is, that these topics contain absolutely different AVRO objects(Array[Byte]) that I need to deserialize to different Java(Scala) objects, filter and then map to tuple (String, String). So i have 3 streams with different avro object in there. I need to cast them(using some business

Re: [streaming] DStream with window performance issue

2015-09-08 Thread Cody Koeninger
I'm not 100% sure what's going on there, but why are you doing a union in the first place? If you want multiple topics in a stream, just pass them all in the set of topics to one call to createDirectStream On Tue, Sep 8, 2015 at 10:52 AM, Alexey Ponkin wrote: > Ok. > Spark

Re: [streaming] DStream with window performance issue

2015-09-08 Thread Cody Koeninger
That doesn't really matter. With the direct stream you'll get all objects for a given topicpartition in the same spark partition. You know what topic it's from via hasOffsetRanges. Then you can deserialize appropriately based on topic. On Tue, Sep 8, 2015 at 11:16 AM, Понькин Алексей

[streaming] DStream with window performance issue

2015-09-08 Thread Alexey Ponkin
Hi, I have an application with 2 streams, which are joined together. Stream1 - is simple DStream(relativly small size batch chunks) Stream2 - is a windowed DStream(with duration for example 60 seconds) Stream1 and Stream2 are Kafka direct stream. The problem is that according to logs window

Re: [streaming] DStream with window performance issue

2015-09-08 Thread Cody Koeninger
Can you provide more info (what version of spark, code example)? On Tue, Sep 8, 2015 at 8:18 AM, Alexey Ponkin wrote: > Hi, > > I have an application with 2 streams, which are joined together. > Stream1 - is simple DStream(relativly small size batch chunks) > Stream2 - is a