Re: Scheduling Spark process

Adrian Tanase Thu, 05 Nov 2015 01:44:59 -0800

You should also specify how you’re planning to query or “publish” the data. I 
would consider a combination of:
- spark streaming job that ingests the raw events in real time, validates, 
pre-process and saves to stable storage
  - stable storage could be HDFS/parquet or a database optimized for time 
series (hbase, cassandra, etc)
- regular spark job that you trigger via cron every day/week/month OR
- query the DB directly depending on how much data it has or if it supports 
secondary indexes that build up partial aggregations (hourly/daily) that are 
easy to compute at query time


Your example of average is easy to do live on a DB if it has secondary indexes 
as the operation is associative and it can be gradually rolled up at 
hourly/daily/monthly level.
For “count distinct” or unique metrics it’s tougher as you’ll need access to 
the raw data (unless you’re willing to accept ~99% accuracy, when you can use 
HLL aggregators).

Hope this helps,
-adrian



On 11/5/15, 10:48 AM, "danilo" <dani.ri...@gmail.com> wrote:

>Hi All,
>
>I'm quite new about this topic and about Spark in general. 
>
>I have a sensor that is pushing data in real time and I need to calculate
>some KPIs based on the data I have received. Given that some of the KPIs are
>related to very old data (e.g. average of number of event in the last 3
>months) I was wondering what is the best approach to do this with SPARK. 
>
>The approach I'm currently following is creating partial KPIs in real time
>and then create the other KPIs with a second spark chain scheduled on daily
>/ weekly / monthly basis.
>
>does make sense? if so, how can I schedule spark to run only once in a day /
>week / month?
>
>Thx
>D
>
>
>
>--
>View this message in context: 
>http://apache-spark-user-list.1001560.n3.nabble.com/Scheduling-Spark-process-tp25287.html
>Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Scheduling Spark process

Reply via email to