[SparkML] Random access in SparseVector will slow down inference stage for some tree based models

Vincent Wang Fri, 29 Jun 2018 02:23:41 -0700

Hi there,

I'm using *GBTClassifier* do some classification jobs and find the
performance of scoring stage is not quite satisfying. The trained model has
about 160 trees and the input feature vector is sparse and its size is
about 20+.


After some digging, I found the model will repeatedly and randomly access
feature in SparseVector when predicting an input vector, which will
eventually call function *breeze.linalg.SparseVector#apply.* That function
generally uses a binary search to locate the corresponding index so the
complexity is O(log numNonZero).

Then I tried to convert my feature vectors to dense vectors before
inference and the result shows that the inference stage can speed up for
about 2~3 times. (Random access in DenseVector is O(1))

So my question is why not use *breeze.linalg.HashVector* when randomly
accessing values in SpareVector since the complexity is O(1) according to
Breeze's documentation, much better than the SparseVector in such case.

Thanks,
Vincent

[SparkML] Random access in SparseVector will slow down inference stage for some tree based models

Reply via email to