Hi, I have following scenario.. need some help ASAP
1. Ad hoc query on spark streaming. How can i run spark queries on ongoing streaming context. Scenario: If a stream job running to find out min and max value in last 5 min(which i am able to do.) Now i want to run interactive query to find min max in last 30 min on this stream. What i was thinking to store the streaming RDD as schemaRDD and do query on that.Is there any better approach?? Where should i store schemaRDD for near real time performance?? 2. Saving and loading intermediate RDDs in cache/disk. What is the best approach to do this. In case any worker fails , whether new worker will resume task,load this saved RDDs?? 3. Write ahead log and Check point. How are the significance of WAL, and checkpoint?? In case of checkpoint if any worker fails will other worker load checkpoint detail and resume its job?? What scenarios i should use WAL and Checkpoint. 4. Spawning multiple processes within spark streaming. Doing multiple operations on same stream. 5. Accessing cached data between spark components. Can cached data in spark streaming is accessible to spark sql?? Can it be shared between these component? or can it be between to sparkcontext? If yes how? if not any alternative approach? 6. Dynamic look up data in spark streaming. I have a scenario where on a stream i want to do some filtering using dynamic lookup data. How can i achieve this scenario? In case i get this lookup data as another stream, and cache it..will it possible to updata/merge this data in cache in 24/7? What is the best approach to do this. I refered Twitter streaming example in spark where it reads a spamfile. but this file is not dynamic in nature.