[streaming] reading Kafka direct stream throws kafka.common.OffsetOutOfRangeException

2015-09-30 Thread Alexey Ponkin
Hi

I have simple spark-streaming job(8 executors 1 core - on 8 node cluster) - 
read from Kafka topic( 3 brokers with 8 partitions) and save to Cassandra.
The problem is that when I increase number of incoming messages in topic the 
job is starting to fail with kafka.common.OffsetOutOfRangeException.
Job fails starting from 100 events per second.

Thanks in advance

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [streaming] DStream with window performance issue

2015-09-08 Thread Alexey Ponkin
Ok.
Spark 1.4.1 on yarn

Here is my application
I have 4 different Kafka topics(different object streams)

type Edge = (String,String)

val a = KafkaUtils.createDirectStream[...](sc,"A",params).filter( nonEmpty 
).map( toEdge )
val b = KafkaUtils.createDirectStream[...](sc,"B",params).filter( nonEmpty 
).map( toEdge )
val c = KafkaUtils.createDirectStream[...](sc,"C",params).filter( nonEmpty 
).map( toEdge )

val u = a union b union c

val source = u.window(Seconds(600), Seconds(10))

val z = KafkaUtils.createDirectStream[...](sc,"Z",params).filter( nonEmpty 
).map( toEdge )

val joinResult = source.rightOuterJoin( z )
joinResult.foreachRDD { rdd=>
  rdd.foreachPartition { partition =>
  // save to result topic in kafka
   }
 }

The 'window' function in the code above is constantly growing,
no matter how many events appeared in corresponding kafka topics

but if I change one line from   

val source = u.window(Seconds(600), Seconds(10))

to 

val partitioner = ssc.sparkContext.broadcast(new HashPartitioner(8))

val source = u.transform(_.partitionBy(partitioner.value) 
).window(Seconds(600), Seconds(10))

Everything works perfect.

Perhaps the problem was in WindowedDStream

I forced to use PartitionerAwareUnionRDD( partitionBy the same partitioner ) 
instead of UnionRDD.

Nonetheless I did not see any hints about such a bahaviour in doc.
Is it a bug or absolutely normal behaviour?





08.09.2015, 17:03, "Cody Koeninger" <c...@koeninger.org>:
>  Can you provide more info (what version of spark, code example)?
>
>  On Tue, Sep 8, 2015 at 8:18 AM, Alexey Ponkin <alexey.pon...@ya.ru> wrote:
>>  Hi,
>>
>>  I have an application with 2 streams, which are joined together.
>>  Stream1 - is simple DStream(relativly small size batch chunks)
>>  Stream2 - is a windowed DStream(with duration for example 60 seconds)
>>
>>  Stream1 and Stream2 are Kafka direct stream.
>>  The problem is that according to logs window operation is constantly 
>> increasing(> href="http://en.zimagez.com/zimage/screenshot2015-09-0816-06-55.php;>screen).
>>  And also I see gap in pocessing window(> href="http://en.zimagez.com/zimage/screenshot2015-09-0816-10-22.php;>screen)
>>  in logs there are no events in that period.
>>  So what is happen in that gap and why window is constantly insreasing?
>>
>>  Thank you in advance
>>
>>  -
>>  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>  For additional commands, e-mail: user-h...@spark.apache.org

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[streaming] DStream with window performance issue

2015-09-08 Thread Alexey Ponkin
Hi,

I have an application with 2 streams, which are joined together.
Stream1 - is simple DStream(relativly small size batch chunks)
Stream2 - is a windowed DStream(with duration for example 60 seconds)

Stream1 and Stream2 are Kafka direct stream. 
The problem is that according to logs window operation is constantly 
increasing(http://en.zimagez.com/zimage/screenshot2015-09-0816-06-55.php;>screen).
And also I see gap in pocessing window(http://en.zimagez.com/zimage/screenshot2015-09-0816-10-22.php;>screen)
 in logs there are no events in that period.
So what is happen in that gap and why window is constantly insreasing?

Thank you in advance

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[streaming] Using org.apache.spark.Logging will silently break task execution

2015-09-06 Thread Alexey Ponkin
Hi,

I have the following code

object MyJob extends org.apache.spark.Logging{
...
 val source: DStream[SomeType] ...

 source.foreachRDD { rdd =>
  logInfo(s"""+++ForEachRDD+++""")
  rdd.foreachPartition { partitionOfRecords =>
logInfo(s"""+++ForEachPartition+++""")
  }
  }

I was expecting to see both log messages in job log.
But unfortunately you will never see string '+++ForEachPartition+++' in logs, 
cause block foreachPartition will never execute.
And also there is no error message or something in logs.
I wonder is this a bug or known behavior? 
I know that org.apache.spark.Logging is DeveloperAPI, but why it is silently 
fails with no messages?
What to use instead of org.apache.spark.Logging? in spark-streaming jobs?

P.S. running spark 1.4.1 (on yarn)

Thanks in advance

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org