Check out SGDClassifier and partial_fit()...I've used these to good effect.
Also, PROTIP: if you want decent help, don't piggy-back on threads that
have nothing to do with your question.
Just sayin'.
On 16 November 2012 12:23, Ronnie Ghose <[email protected]> wrote:
> Any ideas for online learning with Scikit? I have a data set that is >
> 20gb that I want to train on.... I don't think I can do that easily, so
> what should I do?
>
> Thanks,
> Shomiron Ghose
>
>
> On 15 November 2012 15:45, Fred Mailhot <[email protected]> wrote:
>
>> Dear list,
>>
>> I'm using GridSearchCV to do some simple model selection for a text
>> classification task. I've got it working (see below for caveat), but I'm
>> not convinced that I'm making the best use of this tool. If someone has the
>> time/inclination, I'd love a set of eyes to check the following gist to see
>> if I'm doing this correctly:
>>
>> https://gist.github.com/e2ca1910450819a8a28
>>
>> Also, for some reason this is throwing errors when I set n_jobs to
>> anything other than 1. I'm on OS X 10.7.4, using sklearn 0.13. The
>> traceback looks like:
>>
>> Process PoolWorker-1:
>> Traceback (most recent call last):
>> File
>> "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py",
>> line 232, in _bootstrap
>> self.run()
>> File
>> "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py",
>> line 88, in run
>> self._target(*self._args, **self._kwargs)
>> File
>> "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/pool.py",
>> line 59, in worker
>> task = get()
>> File
>> "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/queues.py",
>> line 352, in get
>> return recv()
>> TypeError: ('data type not understood', <type 'numpy.dtype'>, ('S0', 0,
>> 1))
>> Process PoolWorker-2:
>> [...etc etc ad infinitum]
>>
>> Has anyone else come across this, or perhaps have any insight into what's
>> going on? Needless to say, this grid search is taking FOREVER (ca. 10hrs
>> thus far, and only about halfway through), and I'd love to be able to
>> parallelize it.
>>
>> Many thanks,
>> Fred.
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Monitor your physical, virtual and cloud infrastructure from a single
>> web console. Get in-depth insight into apps, servers, databases, vmware,
>> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
>> Pricing starts from $795 for 25 servers or applications!
>> http://p.sf.net/sfu/zoho_dev2dev_nov
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
> Monitor your physical, virtual and cloud infrastructure from a single
> web console. Get in-depth insight into apps, servers, databases, vmware,
> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
> Pricing starts from $795 for 25 servers or applications!
> http://p.sf.net/sfu/zoho_dev2dev_nov
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general