Re: [Scikit-learn-general] Overflow error dumping large GaussianProcess estimator with joblib.

Vincent Dubourg Thu, 06 Mar 2014 06:17:23 -0800

Hi Ralf,

The GaussianProcess class computes and stores the full matrix of Manhattan
distances between features hence the object can quickly take a huge amount
of memory...
One option though consists in dumping this big matrix after fit by using
the storage_mode='light' kwarg (default is 'full' and keeps everything) at
instanciation.
I gave it a try on my desktop:


import numpy as np
from sklearn.externals import joblib
from sklearn.gaussian_process import GaussianProcess

gp_light = GaussianProcess(storage_mode='light').fit(np.random.rand(7e3 *
16).reshape((7e3, 16)), np.random.rand(7e3))
joblib.dump(gp_light, 'gp_light.pkl', compress=6) # passes and the file is
<1Mb

gp_full = GaussianProcess().fit(np.random.rand(7e3 * 16).reshape((7e3,
16)), np.random.rand(7e3))
joblib.dump(gp_light, 'gp_full.pkl', compress=6) # raises your error

I hope this workaround will help you.

Cheers,
Vincent



2014-03-05 23:12 GMT+01:00 Cory Dolphin <[email protected]>:

> I believe I experienced a similar issue, reported on Github 
> #122<https://github.com/joblib/joblib/issues/122>.
> It seems to be a zlib issue in Python, and not something which can or will
> be fixed in joblib.
>
> Would love to know if a work-around was found, it is almost ironic that
> compression fails for large files :-)
>
>
> Cory
>
>
> On Wed, Mar 5, 2014 at 5:05 PM, Ralf Gunter <[email protected]> wrote:
>
>> Hi folks,
>>
>> Attempting to dump a trained GP estimator (7000 samples, 16 features)
>> using joblib (compress = 6) is causing the following error:
>>
>>
>> Traceback (most recent call last):
>>   File
>> "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
>> line 241, in save
>>     obj, filename = self._write_array(obj, filename)
>>   File
>> "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
>> line 214, in _write_array
>>     compress=self.compress)
>>   File
>> "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
>> line 89, in write_zfile
>>     file_handle.write(zlib.compress(asbytes(data), compress))
>> OverflowError: size does not fit in an int
>>
>> Traceback (most recent call last):
>>   File "learn.py", line 28, in <module>
>>     joblib.dump(reg, opt.save_model, compress = 6)
>>   File
>> "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
>> line 367, in dump
>>     pickler.dump(value)
>>   File "/home/rgunter/.local/lib/python2.7/pickle.py", line 224, in dump
>>     self.save(obj)
>>   File
>> "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
>> line 249, in save
>>     return Pickler.save(self, obj)
>>   File "/home/rgunter/.local/lib/python2.7/pickle.py", line 331, in save
>>     self.save_reduce(obj=obj, *rv)
>>   File "/home/rgunter/.local/lib/python2.7/pickle.py", line 419, in
>> save_reduce
>>     save(state)
>>   File
>> "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
>> line 249, in save
>>     return Pickler.save(self, obj)
>>   File "/home/rgunter/.local/lib/python2.7/pickle.py", line 286, in save
>>     f(self, obj) # Call unbound method with explicit self
>>   File "/home/rgunter/.local/lib/python2.7/pickle.py", line 649, in
>> save_dict
>>     self._batch_setitems(obj.iteritems())
>>   File "/home/rgunter/.local/lib/python2.7/pickle.py", line 681, in
>> _batch_setitems
>>     save(v)
>>   File
>> "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
>> line 249, in save
>>     return Pickler.save(self, obj)
>>   File "/home/rgunter/.local/lib/python2.7/pickle.py", line 331, in save
>>     self.save_reduce(obj=obj, *rv)
>>   File "/home/rgunter/.local/lib/python2.7/pickle.py", line 419, in
>> save_reduce
>>     save(state)
>>   File
>> "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
>> line 249, in save
>>     return Pickler.save(self, obj)
>>   File "/home/rgunter/.local/lib/python2.7/pickle.py", line 286, in save
>>     f(self, obj) # Call unbound method with explicit self
>>   File "/home/rgunter/.local/lib/python2.7/pickle.py", line 562, in
>> save_tuple
>>     save(element)
>>   File
>> "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
>> line 249, in save
>>     return Pickler.save(self, obj)
>>   File "/home/rgunter/.local/lib/python2.7/pickle.py", line 286, in save
>>     f(self, obj) # Call unbound method with explicit self
>>   File "/home/rgunter/.local/lib/python2.7/pickle.py", line 486, in
>> save_string
>>     self.write(BINSTRING + pack("<i", n) + obj)
>> struct.error: 'i' format requires -2147483648 <= number <= 2147483647
>>
>>
>> The same error occurs when dumping a random (40k x 40k) array to disk
>> (i.e. > 4GB):
>>
>>   import numpy as np
>>   from sklearn.externals import joblib
>>
>>   w = np.random.random((40000, 40000))
>>   joblib.dump(w, "test.pkl", compress = 6)
>>
>> This machine has 64GB of RAM so saving/loading this shouldn't be a
>> problem. In fact, we'd like to go all the way to 15000 samples if
>> possible. Unsurprisingly, disabling compression does make the error go
>> away but also generates huge files.
>>
>> Python is version 2.7.5, numpy is 1.8.0 and sklearn is mblondel's
>> kernel ridge branch[1] (i.e. joblib is 0.7.1), but the same happens in
>> a fresh pull from the mainline (with joblib at 0.8.0a3).
>>
>> Perhaps the joblib list might be a more appropriate place for this,
>> but since the object being pickled is from sklearn I thought it would
>> be best to run this through you first. Does anyone have any experience
>> with model persistence for such big estimators? How should they be
>> stored on disk? It seems this may be an explicit limitation from
>> python[2], but since the same compression level for 5000 samples
>> already takes ~1.4GB, I'm a bit concerned with how big an uncompressed
>> version would grow for the sizes we're interested in.
>>
>> Thanks!
>>
>> [1] -- https://github.com/mblondel/scikit-learn/tree/kernel_ridge
>> [2] -- http://bugs.python.org/issue8651
>>
>>
>> ------------------------------------------------------------------------------
>> Subversion Kills Productivity. Get off Subversion & Make the Move to
>> Perforce.
>> With Perforce, you get hassle-free workflows. Merge that actually works.
>> Faster operations. Version large binaries.  Built-in WAN optimization and
>> the
>> freedom to use Git, Perforce or both. Make the move to Perforce.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>
>
>
> ------------------------------------------------------------------------------
> Subversion Kills Productivity. Get off Subversion & Make the Move to
> Perforce.
> With Perforce, you get hassle-free workflows. Merge that actually works.
> Faster operations. Version large binaries.  Built-in WAN optimization and
> the
> freedom to use Git, Perforce or both. Make the move to Perforce.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works. 
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Overflow error dumping large GaussianProcess estimator with joblib.

Reply via email to