Hi, I’ve been trying for the last couple of weeks to create a Spark Streaming Job which joins two streams using a common id, and then have another run queries on the output of the joined streams.
I’m using Spark 0.9.0, and the ooyala job server to share the context between jobs. The flow is can be summarised as (same from both streams point of view): 1. Use custom Network Receiver to consume from RabbitMQ and push data into spark 2. Convert to appropriate message format (flatMap/map) 3. Do some aggregation/stateful transformation based on key using updateStateByKey 4. Extract a common identifier (flatMap) 5. Join using cogroup with the other stream 6. Transform joined streams into a different format (using flatMap) 7. foreachRDD call use the namedRDD.update function provided by the ooyala job server The idea is that I can then referece the named RDD directly from other jobs submitted from the observer This works well for a couple of hours, after that I notice that the batch executing time starts to increase. At first it just looks like jitters, but eventually the execution time will start exceeding the batch interval. What is really odd is that if I leave the job running with no messages being consumed from rabbitmq I still see the batch processing time increase. I was under the impression that if I checkpoint the stream before my stateful operations (updateStateByKey) the lineage would not increase over time and that therefore there will be a stable point where the system can run “indefinitely”. Has anyone else solved this problem already? Am I missing something fundamental? Regards, Fred ________________________________________________________________________ Privileged, confidential and/or copyright information may be contained in this communication. This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the intended addressee, you may not copy, forward, disclose or otherwise use this e-mail or any part of it in any way whatsoever. To do so is prohibited and may be unlawful. If you have received this email in error please notify the sender immediately. Paddy Power PLC may monitor the content of e-mail sent and received for the purpose of ensuring compliance with its policies and procedures. Paddy Power plc, Power Tower, Blocks 1-3 Belfield Office Park, Beech Hill Road, Clonskeagh, Dublin 4. Registered in Ireland: 16956 ________________________________________________________________________