Hi,

I’ve been trying for the last couple of weeks to create a Spark Streaming Job 
which joins two streams using a common id, and then have another run queries on 
the output of the joined streams.

I’m using Spark 0.9.0, and the ooyala job server to share the context between 
jobs.

The flow is can be summarised as (same from both streams point of view):

  1.  Use custom Network Receiver to consume from RabbitMQ and push data into 
spark
  2.  Convert to appropriate message format (flatMap/map)
  3.  Do some aggregation/stateful transformation based on key using 
updateStateByKey
  4.  Extract a common identifier (flatMap)
  5.  Join using cogroup with the other stream
  6.  Transform joined streams into a different format (using flatMap)
  7.  foreachRDD call use the namedRDD.update function provided by the ooyala 
job server

The idea is that I can then referece the named RDD directly from other jobs 
submitted from the observer
This works well for a couple of hours, after that I notice that the batch 
executing time starts to increase. At first it just looks like jitters, but 
eventually the execution time will start exceeding the batch interval. What is 
really odd is that if I leave the job running with no messages being consumed 
from rabbitmq I still see the batch processing time increase.

I was under the impression that if I checkpoint the stream before my stateful 
operations (updateStateByKey) the lineage would not increase over time and that 
therefore there will be a stable point where the system can run “indefinitely”.

Has anyone else solved this problem already? Am I missing something fundamental?

Regards,
Fred

________________________________________________________________________
Privileged, confidential and/or copyright information may be contained in this 
communication. This e-mail and any files transmitted with it are confidential 
and intended solely for the use of the individual or entity to whom they are 
addressed. If you are not the intended addressee, you may not copy, forward, 
disclose or otherwise use this e-mail or any part of it in any way whatsoever. 
To do so is prohibited and may be unlawful. If you have received this email in 
error 
please notify the sender immediately.

Paddy Power PLC may monitor the content of e-mail sent and received for the 
purpose of ensuring compliance with its policies and procedures.

Paddy Power plc, Power Tower, Blocks 1-3 Belfield Office Park, Beech Hill Road, 
Clonskeagh, Dublin 4.  Registered in Ireland: 16956
________________________________________________________________________

Reply via email to