Re: [Scikit-learn-general] Representing classifiers outside of Python

Olivier Grisel Mon, 23 Sep 2013 09:12:00 -0700

>2013/9/23 Fred Baba <[email protected]>:
> System performance is currently on the order of ~1us, so Python overhead
> would be unacceptable. For SVM, I'll extract the support vectors and
> investigate using libSVM directly, as per federico vaggi's advice. +1 for
> PMML support at some point down the road. Thanks for the quick responses.


If you really are dealing with micro-seconds per-prediction latencies
then even in C you will probably have to abandon models with a high
prediction time complexity. I was curious so I made the following
experiment:

>>> from sklearn.datasets import load_digits
>>> from sklearn.svm import SVC, LinearSVC
>>> digits = load_digits()

Complex non linear model:

>>> model = SVC(gamma=0.001, C=10).fit(digits.data, digits.target)
>>> %timeit _ = model.predict(digits.data[42:43])
10000 loops, best of 3: 131 µs per loop

Simple linear model:

>>> model2 = LinearSVC(C=10).fit(digits.data, digits.target)
>>> %timeit _ = model2.predict(digits.data[42:43])
10000 loops, best of 3: 49.7 µs per loop

model2 decision is a very simple dot product between the feature. I
have not checked but is it possible that most of the 50 µs spent in
predict are actually spend in sklearn input check python boilerplate
and very few time it actually spend inside the BLAS call that computes
the dot product. However one can reasonably assume that the instance
of the SVC class has the same level of Python overhead and is at least
twice as slow as the linear model. That means that at least 70 µs are
likely to be spent inside libsvm itself (sklearn is a wrapper of
libsvm for this specific model).

This model only has 803 support vectors in 64 dim space:

>>> model.support_vectors_.shape
(803, 64)

So for any non-trival, non-linear model you will be likely to spend
more than a couple of tens of µs for predicting the outcome of a
single sample.

Also I don't know you architecture, but it is very likely that your
input data is not directly a vector or array of numerical feature
values ready for consumption by libsvm or what ever machine learning
algorithm implementation you end up choosing, you probably have to do
some feature extraction from your raw data be it row from a DB or JSON
events sent by a frontend applications or so on. This feature
extraction layer is probably even slower than computing the prediction
in many cases.

Also remember that some models can really benefit from packing
predictions together:

In [18]: %timeit _ = model.predict(digits.data[0:100])
100 loops, best of 3: 7.22 ms per loop

That's 72 µs per-prediction. This is probably caused by the fact that
many predictions can be packed together in a single DGEMM BLAS call
for matrix multiplication before computing the non-linear kernel
activations (our patched libsvm can leverage dense feature
representations).

This is even more the case for a dense linear model.

In [19]: %timeit _ = model2.predict(digits.data[0:100])
10000 loops, best of 3: 198 µs per loop

That's 2 µs per prediction which is getting closer to your
requirements. Packing 1000 predictions at once will make the
individual prediction time drop below 0.6 µs for a linear model:

>>> %timeit _ = model2.predict(digits.data[0:1000])
1000 loops, best of 3: 569 µs per loop

I am using the OSX implementation of BLAS, which is not the best. Hand
built Atlas or MKL might be a bit faster.

That is, if you are willing to contribute PMML exporters for some
sklearn models, probably as a side project for scikit-learn, I am sure
that many users will like it. However I am not sure that individual
prediction latency caused by the Python interpreter overhead alone is
worth it.

-- 
Olivier

------------------------------------------------------------------------------
LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint
2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes
Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13. 
http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Representing classifiers outside of Python

Reply via email to