For me, the latency of model evaluation is more important than training latency. This holds true for retraining / model updates as well. I would say that the "evaluation / prediction" latency is the most critical one.
Your point regarding 3) is very interesting for me. I have 2 types of data: - low volume information about a customer - high volume usage data The high volume data will require aggregation (e.g. spark SQL) prior the model can be evaluated. Here, a higher latency would be OK. Regarding the low volume data: some features will require some sort of SQL for extraction. Kenneth Chan <kenn...@apache.org> schrieb am Di., 27. Sep. 2016 um 07:43 Uhr: > re: kappa vs lambda. > as far as i understand, at high-level, kappa is more like a subset of > lambda (ie. only keep the real-time part) > > https://www.ericsson.com/research-blog/data-knowledge/data-processing-architectures-lambda-and-kappa/ > > Gerog, would you be more specific when you talk about "latency requirement" > > 1. latency of training a model with new data? > 2. latency of deploy new model ? or > 3. latency of getting predicted result using the previously trained model > given a query? > > if you are talking about 3, depending on how your model calculates the > prediction. It doesn't need spark if the model can be fit into memory. > > > > > On Mon, Sep 26, 2016 at 9:41 PM, Georg Heiler <georg.kf.hei...@gmail.com> > wrote: > >> Hi Donald >> For me it is more about stacking and meta learning. The selection of >> models could be performed offline. >> >> But >> 1 I am concerned about keeping the model up to date - retraining >> 2 having some sort of reinforcement learning to improve / punish based on >> correctness of new ground truth 1/month >> 3 to have Very quick responses. Especially more like an evaluation of a >> random forest /gbt / nnet without staring a yearn job. >> >> Thank you all for the feedback so far >> Best regards to >> Georg >> Donald Szeto <don...@apache.org> schrieb am Di. 27. Sep. 2016 um 06:34: >> >>> Sorry for side-tracking. I think Kappa architecture is a promising >>> paradigm, but including batch processing from the canonical store to the >>> serving layer store should still be necessary. I believe this somewhat >>> hybrid Kappa-Lambda architecture would be generic enough to handle many use >>> cases. If this is something that sounds good to everyone, we should drive >>> PredictionIO to that direction. >>> >>> Georg, are you talking about updating an existing model in different >>> ways, evaluate them, and select one within a time constraint, say every 1 >>> second? >>> >>> On Mon, Sep 26, 2016 at 4:11 PM, Pat Ferrel <p...@occamsmachete.com> >>> wrote: >>> >>>> If you need the model updated in realtime you are talking about a kappa >>>> architecture and PredictionIO does not support that. It does Lambda only. >>>> >>>> The MLlib-based recommenders use live contexts to serve from in-memory >>>> copies of the ALS models but the models themselves were calculated in the >>>> background. There are several scaling issues with doing this but it can be >>>> done. >>>> >>>> On Sep 25, 2016, at 10:23 AM, Georg Heiler <georg.kf.hei...@gmail.com> >>>> wrote: >>>> >>>> Wow thanks. This is a great explanation. >>>> >>>> So when I think about writing a spark template for fraud detection (a >>>> combination of spark sql and xgboost ) and would require <1 second latency >>>> how should I store the model? >>>> >>>> As far as I know startup of YARN jobs e.g. A spark job is too slow for >>>> that. >>>> So it would be great if the model could be evaluated without using the >>>> cluster or at least having a hot spark context similar to spark jobserver >>>> or SnappyData.io <http://snappydata.io> is this possible for >>>> prediction.io? >>>> >>>> Regards, >>>> Georg >>>> Pat Ferrel <p...@occamsmachete.com> schrieb am So. 25. Sep. 2016 um >>>> 18:19: >>>> >>>>> Gustavo it correct. To put another way both Oryx and PredictionIO are >>>>> based on what is called a Lambda Architecture. Loosely speaking this means >>>>> a potentially slow background task computes the predictive “model” but >>>>> this does not interfere with serving queries. Then when the model is ready >>>>> (stored in HDFS or Elasticsearch depending on the template) it is deployed >>>>> and the switch happens in microseconds. >>>>> >>>>> In the case of the Universal Recommender the model is stored in >>>>> Elasticsearch. During `pio train` the new model in inserted into >>>>> Elasticsearch and indexed. Once the indexing is done the index alias used >>>>> to serve queries is switched to the new index in one atomic action so >>>>> there >>>>> is no downtime and any slow operation happens in the background without >>>>> impeding queries. >>>>> >>>>> The answer will vary somewhat with the template. Templates that use >>>>> HDFS for storage may need to be re-deployed but still the switch from >>>>> using >>>>> one to having the new one running is microseconds. >>>>> >>>>> PMML is not relevant to this above discussion and is anyway useless >>>>> for many model types including recommenders. If you look carefully at how >>>>> that is implementing in Oryx you will see that the PMML “models” for >>>>> recommenders are not actually stored as PMML, only a minimal description >>>>> of >>>>> where the real data is stored are in PMML. Remember that it has all the >>>>> problems of XML including no good way to read in parallel. >>>>> >>>>> >>>>> On Sep 25, 2016, at 7:47 AM, Gustavo Frederico < >>>>> gustavo.freder...@thinkwrap.com> wrote: >>>>> >>>>> I undestand that the querying for PredictionIO is very fast, as if it >>>>> were an Elasticsearch query. Also recall that the training moment is a >>>>> different moment that often takes a long time in most learning >>>>> systems, but as long as it's not ridiculously long, it doesn't matter >>>>> that much. >>>>> >>>>> Gustavo >>>>> >>>>> On Sun, Sep 25, 2016 at 2:30 AM, Georg Heiler < >>>>> georg.kf.hei...@gmail.com> wrote: >>>>> > Hi predictionIO users, >>>>> > I wonder what is the delay of an engine evaluating a model in >>>>> prediction.io. >>>>> > Are the models cached? >>>>> > >>>>> > Another project http://oryx.io/ is generating PMML which can be >>>>> evaluated >>>>> > quickly from a production application. >>>>> > >>>>> > I believe, that very often the latency until the prediction happens, >>>>> is >>>>> > overlooked. How does predictionIO handle this topic? >>>>> > >>>>> > Best regards, >>>>> > Georg >>>>> >>>>> >>>> >>> >