Re: [Scikit-learn-general] Overflow error dumping large GaussianProcess estimator with joblib.

Ralf Gunter Thu, 06 Mar 2014 10:26:00 -0800

Thanks Vincent, that indeed does the trick! It would be very useful to
have the confidence intervals along as well, but this should do
meanwhile.


It's odd that only the compressed code path is hitting this bug, since
I'd imagine both versions are serializing the object to a "s#" string
at some point (and hence hitting #8651)...

2014-03-06 8:15 GMT-06:00 Vincent Dubourg <[email protected]>:
> Hi Ralf,
>
> The GaussianProcess class computes and stores the full matrix of Manhattan
> distances between features hence the object can quickly take a huge amount
> of memory...
> One option though consists in dumping this big matrix after fit by using the
> storage_mode='light' kwarg (default is 'full' and keeps everything) at
> instanciation.
> I gave it a try on my desktop:
>
>
> import numpy as np
> from sklearn.externals import joblib
> from sklearn.gaussian_process import GaussianProcess
>
> gp_light = GaussianProcess(storage_mode='light').fit(np.random.rand(7e3 *
> 16).reshape((7e3, 16)), np.random.rand(7e3))
> joblib.dump(gp_light, 'gp_light.pkl', compress=6) # passes and the file is
> <1Mb
>
> gp_full = GaussianProcess().fit(np.random.rand(7e3 * 16).reshape((7e3, 16)),
> np.random.rand(7e3))
> joblib.dump(gp_light, 'gp_full.pkl', compress=6) # raises your error
>
> I hope this workaround will help you.
>
> Cheers,
> Vincent
>
>
>
> 2014-03-05 23:12 GMT+01:00 Cory Dolphin <[email protected]>:
>
>> I believe I experienced a similar issue, reported on Github #122. It seems
>> to be a zlib issue in Python, and not something which can or will be fixed
>> in joblib.
>>
>> Would love to know if a work-around was found, it is almost ironic that
>> compression fails for large files :-)
>>
>>
>> Cory
>>
>>
>> On Wed, Mar 5, 2014 at 5:05 PM, Ralf Gunter <[email protected]> wrote:
>>>
>>> Hi folks,
>>>
>>> Attempting to dump a trained GP estimator (7000 samples, 16 features)
>>> using joblib (compress = 6) is causing the following error:
>>>
>>>
>>> Traceback (most recent call last):
>>>   File
>>> "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
>>> line 241, in save
>>>     obj, filename = self._write_array(obj, filename)
>>>   File
>>> "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
>>> line 214, in _write_array
>>>     compress=self.compress)
>>>   File
>>> "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
>>> line 89, in write_zfile
>>>     file_handle.write(zlib.compress(asbytes(data), compress))
>>> OverflowError: size does not fit in an int
>>>
>>> Traceback (most recent call last):
>>>   File "learn.py", line 28, in <module>
>>>     joblib.dump(reg, opt.save_model, compress = 6)
>>>   File
>>> "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
>>> line 367, in dump
>>>     pickler.dump(value)
>>>   File "/home/rgunter/.local/lib/python2.7/pickle.py", line 224, in dump
>>>     self.save(obj)
>>>   File
>>> "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
>>> line 249, in save
>>>     return Pickler.save(self, obj)
>>>   File "/home/rgunter/.local/lib/python2.7/pickle.py", line 331, in save
>>>     self.save_reduce(obj=obj, *rv)
>>>   File "/home/rgunter/.local/lib/python2.7/pickle.py", line 419, in
>>> save_reduce
>>>     save(state)
>>>   File
>>> "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
>>> line 249, in save
>>>     return Pickler.save(self, obj)
>>>   File "/home/rgunter/.local/lib/python2.7/pickle.py", line 286, in save
>>>     f(self, obj) # Call unbound method with explicit self
>>>   File "/home/rgunter/.local/lib/python2.7/pickle.py", line 649, in
>>> save_dict
>>>     self._batch_setitems(obj.iteritems())
>>>   File "/home/rgunter/.local/lib/python2.7/pickle.py", line 681, in
>>> _batch_setitems
>>>     save(v)
>>>   File
>>> "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
>>> line 249, in save
>>>     return Pickler.save(self, obj)
>>>   File "/home/rgunter/.local/lib/python2.7/pickle.py", line 331, in save
>>>     self.save_reduce(obj=obj, *rv)
>>>   File "/home/rgunter/.local/lib/python2.7/pickle.py", line 419, in
>>> save_reduce
>>>     save(state)
>>>   File
>>> "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
>>> line 249, in save
>>>     return Pickler.save(self, obj)
>>>   File "/home/rgunter/.local/lib/python2.7/pickle.py", line 286, in save
>>>     f(self, obj) # Call unbound method with explicit self
>>>   File "/home/rgunter/.local/lib/python2.7/pickle.py", line 562, in
>>> save_tuple
>>>     save(element)
>>>   File
>>> "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
>>> line 249, in save
>>>     return Pickler.save(self, obj)
>>>   File "/home/rgunter/.local/lib/python2.7/pickle.py", line 286, in save
>>>     f(self, obj) # Call unbound method with explicit self
>>>   File "/home/rgunter/.local/lib/python2.7/pickle.py", line 486, in
>>> save_string
>>>     self.write(BINSTRING + pack("<i", n) + obj)
>>> struct.error: 'i' format requires -2147483648 <= number <= 2147483647
>>>
>>>
>>> The same error occurs when dumping a random (40k x 40k) array to disk
>>> (i.e. > 4GB):
>>>
>>>   import numpy as np
>>>   from sklearn.externals import joblib
>>>
>>>   w = np.random.random((40000, 40000))
>>>   joblib.dump(w, "test.pkl", compress = 6)
>>>
>>> This machine has 64GB of RAM so saving/loading this shouldn't be a
>>> problem. In fact, we'd like to go all the way to 15000 samples if
>>> possible. Unsurprisingly, disabling compression does make the error go
>>> away but also generates huge files.
>>>
>>> Python is version 2.7.5, numpy is 1.8.0 and sklearn is mblondel's
>>> kernel ridge branch[1] (i.e. joblib is 0.7.1), but the same happens in
>>> a fresh pull from the mainline (with joblib at 0.8.0a3).
>>>
>>> Perhaps the joblib list might be a more appropriate place for this,
>>> but since the object being pickled is from sklearn I thought it would
>>> be best to run this through you first. Does anyone have any experience
>>> with model persistence for such big estimators? How should they be
>>> stored on disk? It seems this may be an explicit limitation from
>>> python[2], but since the same compression level for 5000 samples
>>> already takes ~1.4GB, I'm a bit concerned with how big an uncompressed
>>> version would grow for the sizes we're interested in.
>>>
>>> Thanks!
>>>
>>> [1] -- https://github.com/mblondel/scikit-learn/tree/kernel_ridge
>>> [2] -- http://bugs.python.org/issue8651
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Subversion Kills Productivity. Get off Subversion & Make the Move to
>>> Perforce.
>>> With Perforce, you get hassle-free workflows. Merge that actually works.
>>> Faster operations. Version large binaries.  Built-in WAN optimization and
>>> the
>>> freedom to use Git, Perforce or both. Make the move to Perforce.
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Subversion Kills Productivity. Get off Subversion & Make the Move to
>> Perforce.
>> With Perforce, you get hassle-free workflows. Merge that actually works.
>> Faster operations. Version large binaries.  Built-in WAN optimization and
>> the
>> freedom to use Git, Perforce or both. Make the move to Perforce.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>
>
> ------------------------------------------------------------------------------
> Subversion Kills Productivity. Get off Subversion & Make the Move to
> Perforce.
> With Perforce, you get hassle-free workflows. Merge that actually works.
> Faster operations. Version large binaries.  Built-in WAN optimization and
> the
> freedom to use Git, Perforce or both. Make the move to Perforce.
> http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>

------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works. 
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Overflow error dumping large GaussianProcess estimator with joblib.

Reply via email to