2013/7/17 Harold Nguyen <[email protected]>: > Hello Scikit-learn community! > > I was just wondering if anyone was using Cassandra > as a datastore for scikit-learn, and what your data > pipeline architecture looks like ? Do you just use Pycassa > to get the data, and run scikit-learn off of it ?
I don't, but any database will do. You just need a way to turn any record in your database as 1D numpy array of numerical features (this is called feature extract). Several feature vectors can then be packed into a 2D numpy array with shape (batch_size, n_features). If you have high cardinality categorical variables in your feature vector you might want to use 1 hot encoding in scipy sparse matrices instead. Have a look at this part of the documentation to transform a list of python dicts to scipy sparse matrix for instance: http://scikit-learn.org/dev/modules/feature_extraction.html If the data is not too sparse (not too many zeros) it's sometimes easier to work with numpy arrays. You can convert a scipy.sparse matrix to a numpy array by calling the `.toarray()` method of the matrix. > How do you iterate through the data when modeling so that > all the data doesn't fit into memory ? (I'd like to use all > the data in our Cassandra cluster for modeling/training/etc...) What you want is out of core learning. This can be implemented by using models that have `partial_fit` method. For classification there is: - sklearn.linear_model.SGDClassifier - sklearn.linear_model.PassiveAggressiveClassifier - sklearn.linear_model.Perceptron (and maybe one naive bayes as well) For regression: - sklearn.linear_model.SGDRegressor - sklearn.linear_model.PassiveAggressiveRegressor For clustering: - sklearn.cluster.MiniBatchKMeans There is an example for out of core text classification with stateless feature extraction here: http://scikit-learn.org/dev/auto_examples/applications/plot_out_of_core_classification.html However before embarking in out-of-core learning, you should make sure you get accurate models on a sub sample of the data that fits in memory. Batch learning using the regular `fit` API is much easier to start with and provides more tools such as cross validation and grid search to find working model parameters from the data. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
