Hi Oliver,
Thank you very much. Could this potentially take a long time ? Is there a
way
to do batch processing, or parallel computing ? (a la Mahout-ish?)
Harold
On Wed, Jul 17, 2013 at 9:42 AM, Olivier Grisel <[email protected]>wrote:
> 2013/7/17 Harold Nguyen <[email protected]>:
> > Hello Scikit-learn community!
> >
> > I was just wondering if anyone was using Cassandra
> > as a datastore for scikit-learn, and what your data
> > pipeline architecture looks like ? Do you just use Pycassa
> > to get the data, and run scikit-learn off of it ?
>
> I don't, but any database will do. You just need a way to turn any
> record in your database as 1D numpy array of numerical features (this
> is called feature extract). Several feature vectors can then be packed
> into a 2D numpy array with shape (batch_size, n_features).
>
> If you have high cardinality categorical variables in your feature
> vector you might want to use 1 hot encoding in scipy sparse matrices
> instead.
>
> Have a look at this part of the documentation to transform a list of
> python dicts to scipy sparse matrix for instance:
>
> http://scikit-learn.org/dev/modules/feature_extraction.html
>
> If the data is not too sparse (not too many zeros) it's sometimes
> easier to work with numpy arrays. You can convert a scipy.sparse
> matrix to a numpy array by calling the `.toarray()` method of the
> matrix.
>
> > How do you iterate through the data when modeling so that
> > all the data doesn't fit into memory ? (I'd like to use all
> > the data in our Cassandra cluster for modeling/training/etc...)
>
> What you want is out of core learning. This can be implemented by
> using models that have `partial_fit` method. For classification there
> is:
>
> - sklearn.linear_model.SGDClassifier
> - sklearn.linear_model.PassiveAggressiveClassifier
> - sklearn.linear_model.Perceptron
> (and maybe one naive bayes as well)
>
> For regression:
>
> - sklearn.linear_model.SGDRegressor
> - sklearn.linear_model.PassiveAggressiveRegressor
>
> For clustering:
>
> - sklearn.cluster.MiniBatchKMeans
>
> There is an example for out of core text classification with stateless
> feature extraction here:
>
>
> http://scikit-learn.org/dev/auto_examples/applications/plot_out_of_core_classification.html
>
> However before embarking in out-of-core learning, you should make sure
> you get accurate models on a sub sample of the data that fits in
> memory. Batch learning using the regular `fit` API is much easier to
> start with and provides more tools such as cross validation and grid
> search to find working model parameters from the data.
>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
>
>
> ------------------------------------------------------------------------------
> See everything from the browser to the database with AppDynamics
> Get end-to-end visibility with application monitoring from AppDynamics
> Isolate bottlenecks and diagnose root cause in seconds.
> Start your free trial of AppDynamics Pro today!
> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general