You should also specify how you’re planning to query or “publish” the data. I would consider a combination of: - spark streaming job that ingests the raw events in real time, validates, pre-process and saves to stable storage - stable storage could be HDFS/parquet or a database optimized for time series (hbase, cassandra, etc) - regular spark job that you trigger via cron every day/week/month OR - query the DB directly depending on how much data it has or if it supports secondary indexes that build up partial aggregations (hourly/daily) that are easy to compute at query time
Your example of average is easy to do live on a DB if it has secondary indexes as the operation is associative and it can be gradually rolled up at hourly/daily/monthly level. For “count distinct” or unique metrics it’s tougher as you’ll need access to the raw data (unless you’re willing to accept ~99% accuracy, when you can use HLL aggregators). Hope this helps, -adrian On 11/5/15, 10:48 AM, "danilo" <dani.ri...@gmail.com> wrote: >Hi All, > >I'm quite new about this topic and about Spark in general. > >I have a sensor that is pushing data in real time and I need to calculate >some KPIs based on the data I have received. Given that some of the KPIs are >related to very old data (e.g. average of number of event in the last 3 >months) I was wondering what is the best approach to do this with SPARK. > >The approach I'm currently following is creating partial KPIs in real time >and then create the other KPIs with a second spark chain scheduled on daily >/ weekly / monthly basis. > >does make sense? if so, how can I schedule spark to run only once in a day / >week / month? > >Thx >D > > > >-- >View this message in context: >http://apache-spark-user-list.1001560.n3.nabble.com/Scheduling-Spark-process-tp25287.html >Sent from the Apache Spark User List mailing list archive at Nabble.com. > >--------------------------------------------------------------------- >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org