2013/7/17 Harold Nguyen <[email protected]>:
> Hello Scikit-learn community!
>
> I was just wondering if anyone was using Cassandra
> as a datastore for scikit-learn, and what your data
> pipeline architecture looks like ? Do you just use Pycassa
> to get the data, and run scikit-learn off of it ?

I don't, but any database will do. You just need a way to turn any
record in your database as 1D numpy array of numerical features (this
is called feature extract). Several feature vectors can then be packed
into a 2D numpy array with shape (batch_size, n_features).

If you have high cardinality categorical variables in your feature
vector you might want to use 1 hot encoding in scipy sparse matrices
instead.

Have a look at this part of the documentation to transform a list of
python dicts to scipy sparse matrix for instance:

http://scikit-learn.org/dev/modules/feature_extraction.html

If the data is not too sparse (not too many zeros) it's sometimes
easier to work with numpy arrays. You can convert a scipy.sparse
matrix to a numpy array by calling the `.toarray()` method of the
matrix.

> How do you iterate through the data when modeling so that
> all the data doesn't fit into memory ? (I'd like to use all
> the data in our Cassandra cluster for modeling/training/etc...)

What you want is out of core learning. This can be implemented by
using models that have `partial_fit` method. For classification there
is:

- sklearn.linear_model.SGDClassifier
- sklearn.linear_model.PassiveAggressiveClassifier
- sklearn.linear_model.Perceptron
(and maybe one naive bayes as well)

For regression:

- sklearn.linear_model.SGDRegressor
- sklearn.linear_model.PassiveAggressiveRegressor

For clustering:

- sklearn.cluster.MiniBatchKMeans

There is an example for out of core text classification with stateless
feature extraction here:

http://scikit-learn.org/dev/auto_examples/applications/plot_out_of_core_classification.html

However before embarking in out-of-core learning, you should make sure
you get accurate models on a sub sample of the data that fits in
memory. Batch learning using the regular `fit` API is much easier to
start with and provides more tools such as cross validation and grid
search to find working model parameters from the data.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to