Hi Adrian, yes, your assumption is correct.
I'm using HBase for storing the partial calculations. Thank you for the feedbacks - it is exactly what I had in mind. Thx D On Thu, Nov 5, 2015 at 10:43 AM, Adrian Tanase <atan...@adobe.com> wrote: > You should also specify how you’re planning to query or “publish” the > data. I would consider a combination of: > - spark streaming job that ingests the raw events in real time, validates, > pre-process and saves to stable storage > - stable storage could be HDFS/parquet or a database optimized for time > series (hbase, cassandra, etc) > - regular spark job that you trigger via cron every day/week/month OR > - query the DB directly depending on how much data it has or if it > supports secondary indexes that build up partial aggregations > (hourly/daily) that are easy to compute at query time > > Your example of average is easy to do live on a DB if it has secondary > indexes as the operation is associative and it can be gradually rolled up > at hourly/daily/monthly level. > For “count distinct” or unique metrics it’s tougher as you’ll need access > to the raw data (unless you’re willing to accept ~99% accuracy, when you > can use HLL aggregators). > > Hope this helps, > -adrian > > > > On 11/5/15, 10:48 AM, "danilo" <dani.ri...@gmail.com> wrote: > > >Hi All, > > > >I'm quite new about this topic and about Spark in general. > > > >I have a sensor that is pushing data in real time and I need to calculate > >some KPIs based on the data I have received. Given that some of the KPIs > are > >related to very old data (e.g. average of number of event in the last 3 > >months) I was wondering what is the best approach to do this with SPARK. > > > >The approach I'm currently following is creating partial KPIs in real time > >and then create the other KPIs with a second spark chain scheduled on > daily > >/ weekly / monthly basis. > > > >does make sense? if so, how can I schedule spark to run only once in a > day / > >week / month? > > > >Thx > >D > > > > > > > >-- > >View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Scheduling-Spark-process-tp25287.html > >Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > >--------------------------------------------------------------------- > >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >For additional commands, e-mail: user-h...@spark.apache.org > > > -- Danilo Rizzo