2013/9/30 Eustache DIEMERT <[email protected]>: > Hi List, > > The topic of prediction latency seems to pop up from times to times, either > on the list or on SO [1]. > > I'm pondering the idea of creating a performance benchmark in the spirit of > the out-of-core example. > > This benchmark could be used to compare and track prediction time of a few > popular algorithms in sklearn. > > It could be used also to showcase to new users what they can expect from the > different algorithms and help them choose a solution that fits their > requirements. > > I don't know if other machine learning toolkits have this kind of benchmarks > (and do we already have some various pieces out there ?), but I think it > would be a great feature for the "enterprise" users who often need to choose > a toolkit in a limited amount of time and stick to it during the life of the > project. > > It could also give water to the idea that Python can/should be fast ;) > > Any thoughts on that ?
I think it would be great to have a topical / howto section on the main concepts to be aware off when analyzing the performance tradeoffs at prediction time: For instance, explain the difference between throughput (number of predictions achievable per second, possibly in batch) vs latency (time to perform an atomic prediction). Then give orders of magnitudes to analyze the latency and throughput of the predictions by considering: - feature extraction time [e.g. a couple of MB/s on a single CPU for the typical text feature extraction] - Python function calls and input validation overhead [in the nanos to micro second] - impact of the model complexity: - number of (non-zeros) dimensions for a linear model (and the `linear_model.sparsify()` trick), - number of support vectors in a non-linear SVM, - number and depth of decision trees in a RF or GBRT models... Also highlighting the fact that models that use BLAS internally for prediction (typically linear models) can benefit from packing hundreds or thousands of predictions at once to better benefit from CPU cache locality, SSE instructions and multithreading. Also: memory usage optimization: loading a model with joblib.load(filename, mmap_mode='r') to make it possible to share the memory: for instance when several gunicorn workers (distinct Python processes) of a flask API exposing a scikit-learn model as HTTP service can load a large model (e.g. a big extra trees ensemble) only once in memory rather than loading a copy of the model each. We could even have subsection for each class of models and give additional pointers as done by Peter for the SO question on GBRTs. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register > http://pubads.g.doubleclick.net/gampad/clk?id=60133471&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
