Online evaluation of MLLIB model

2016-07-13 Thread Danilo Rizzo
Hi All, I'm trying to create a ML pipeline that is in charge of the model
training.
In my use case I have the need to evaluate the mode in real time from an
external application; googling I saw that I can submit a spark job using
the submit API.

Not sure if this is the best way to achieve that, any thoughts? I'm
wondering if it is able to manage a lot of requests of model evaluation
keeping the response time faster enough to be used in a web application

-- 
Danilo Rizzo


Re: Scheduling Spark process

2015-11-05 Thread Danilo Rizzo
Hi Adrian,

yes, your assumption is correct.

I'm using HBase for storing the partial calculations.

Thank you for the feedbacks - it is exactly what I had in mind.

Thx
D

On Thu, Nov 5, 2015 at 10:43 AM, Adrian Tanase <atan...@adobe.com> wrote:

> You should also specify how you’re planning to query or “publish” the
> data. I would consider a combination of:
> - spark streaming job that ingests the raw events in real time, validates,
> pre-process and saves to stable storage
>   - stable storage could be HDFS/parquet or a database optimized for time
> series (hbase, cassandra, etc)
> - regular spark job that you trigger via cron every day/week/month OR
> - query the DB directly depending on how much data it has or if it
> supports secondary indexes that build up partial aggregations
> (hourly/daily) that are easy to compute at query time
>
> Your example of average is easy to do live on a DB if it has secondary
> indexes as the operation is associative and it can be gradually rolled up
> at hourly/daily/monthly level.
> For “count distinct” or unique metrics it’s tougher as you’ll need access
> to the raw data (unless you’re willing to accept ~99% accuracy, when you
> can use HLL aggregators).
>
> Hope this helps,
> -adrian
>
>
>
> On 11/5/15, 10:48 AM, "danilo" <dani.ri...@gmail.com> wrote:
>
> >Hi All,
> >
> >I'm quite new about this topic and about Spark in general.
> >
> >I have a sensor that is pushing data in real time and I need to calculate
> >some KPIs based on the data I have received. Given that some of the KPIs
> are
> >related to very old data (e.g. average of number of event in the last 3
> >months) I was wondering what is the best approach to do this with SPARK.
> >
> >The approach I'm currently following is creating partial KPIs in real time
> >and then create the other KPIs with a second spark chain scheduled on
> daily
> >/ weekly / monthly basis.
> >
> >does make sense? if so, how can I schedule spark to run only once in a
> day /
> >week / month?
> >
> >Thx
> >D
> >
> >
> >
> >--
> >View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Scheduling-Spark-process-tp25287.html
> >Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> >-
> >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >For additional commands, e-mail: user-h...@spark.apache.org
> >
>



-- 
Danilo Rizzo