Hi,

I have following scenario.. need some help ASAP

1. Ad hoc query on spark streaming.
   How can i run spark queries on ongoing streaming context.
   Scenario: If a stream job running to find out min and max value in last
5 min(which i am able to do.)
   Now i want to run interactive query to find min max in last 30 min on
this stream.
   What i was thinking to store the streaming RDD as schemaRDD and do query
on that.Is there any better approach??
   Where should i store schemaRDD for near real time performance??
2. Saving and loading intermediate RDDs in cache/disk.
   What is the best approach to do this. In case any worker fails , whether
new worker will resume task,load this saved RDDs??
3. Write ahead log and Check point.
   How are the significance of WAL, and checkpoint?? In case of checkpoint
if any worker fails will other worker load checkpoint detail and resume its
job??
   What scenarios i should use WAL and Checkpoint.
4. Spawning multiple processes within spark streaming.
   Doing multiple operations on same stream.
5. Accessing cached data between spark components.
   Can cached data in spark streaming is accessible to spark sql?? Can it
be shared between these component? or can it be between to sparkcontext?
   If yes how? if not any alternative approach?
6. Dynamic look up data in spark streaming.
   I have a scenario where on a stream i want to do some filtering using
dynamic lookup data. How can i achieve this scenario?
   In case i get this lookup data as another stream, and cache it..will it
possible to updata/merge this data in cache in 24/7?
What is the best approach to do this. I refered Twitter streaming example
in spark where it reads a spamfile. but this file is not dynamic in nature.

Reply via email to