Re: [Scikit-learn-general] Overflow error dumping large GaussianProcess estimator with joblib.

Vincent Dubourg Thu, 06 Mar 2014 23:41:29 -0800

The confidence intervals will still be available, at a greater
computational cost though.
Indeed if you ask for prediction variance using eval_MSE, the GP class will
compute the cross-distances and the covariance again from X.



2014-03-06 19:20 GMT+01:00 Ralf Gunter <[email protected]>:

> Thanks Vincent, that indeed does the trick! It would be very useful to
> have the confidence intervals along as well, but this should do
> meanwhile.
>
> It's odd that only the compressed code path is hitting this bug, since
> I'd imagine both versions are serializing the object to a "s#" string
> at some point (and hence hitting #8651)...
>
> 2014-03-06 8:15 GMT-06:00 Vincent Dubourg <[email protected]>:
> > Hi Ralf,
> >
> > The GaussianProcess class computes and stores the full matrix of
> Manhattan
> > distances between features hence the object can quickly take a huge
> amount
> > of memory...
> > One option though consists in dumping this big matrix after fit by using
> the
> > storage_mode='light' kwarg (default is 'full' and keeps everything) at
> > instanciation.
> > I gave it a try on my desktop:
> >
> >
> > import numpy as np
> > from sklearn.externals import joblib
> > from sklearn.gaussian_process import GaussianProcess
> >
> > gp_light = GaussianProcess(storage_mode='light').fit(np.random.rand(7e3 *
> > 16).reshape((7e3, 16)), np.random.rand(7e3))
> > joblib.dump(gp_light, 'gp_light.pkl', compress=6) # passes and the file
> is
> > <1Mb
> >
> > gp_full = GaussianProcess().fit(np.random.rand(7e3 * 16).reshape((7e3,
> 16)),
> > np.random.rand(7e3))
> > joblib.dump(gp_light, 'gp_full.pkl', compress=6) # raises your error
> >
> > I hope this workaround will help you.
> >
> > Cheers,
> > Vincent
> >
> >
> >
> > 2014-03-05 23:12 GMT+01:00 Cory Dolphin <[email protected]>:
> >
> >> I believe I experienced a similar issue, reported on Github #122. It
> seems
> >> to be a zlib issue in Python, and not something which can or will be
> fixed
> >> in joblib.
> >>
> >> Would love to know if a work-around was found, it is almost ironic that
> >> compression fails for large files :-)
> >>
> >>
> >> Cory
> >>
> >>
> >> On Wed, Mar 5, 2014 at 5:05 PM, Ralf Gunter <[email protected]>
> wrote:
> >>>
> >>> Hi folks,
> >>>
> >>> Attempting to dump a trained GP estimator (7000 samples, 16 features)
> >>> using joblib (compress = 6) is causing the following error:
> >>>
> >>>
> >>> Traceback (most recent call last):
> >>>   File
> >>>
> "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
> >>> line 241, in save
> >>>     obj, filename = self._write_array(obj, filename)
> >>>   File
> >>>
> "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
> >>> line 214, in _write_array
> >>>     compress=self.compress)
> >>>   File
> >>>
> "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
> >>> line 89, in write_zfile
> >>>     file_handle.write(zlib.compress(asbytes(data), compress))
> >>> OverflowError: size does not fit in an int
> >>>
> >>> Traceback (most recent call last):
> >>>   File "learn.py", line 28, in <module>
> >>>     joblib.dump(reg, opt.save_model, compress = 6)
> >>>   File
> >>>
> "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
> >>> line 367, in dump
> >>>     pickler.dump(value)
> >>>   File "/home/rgunter/.local/lib/python2.7/pickle.py", line 224, in
> dump
> >>>     self.save(obj)
> >>>   File
> >>>
> "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
> >>> line 249, in save
> >>>     return Pickler.save(self, obj)
> >>>   File "/home/rgunter/.local/lib/python2.7/pickle.py", line 331, in
> save
> >>>     self.save_reduce(obj=obj, *rv)
> >>>   File "/home/rgunter/.local/lib/python2.7/pickle.py", line 419, in
> >>> save_reduce
> >>>     save(state)
> >>>   File
> >>>
> "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
> >>> line 249, in save
> >>>     return Pickler.save(self, obj)
> >>>   File "/home/rgunter/.local/lib/python2.7/pickle.py", line 286, in
> save
> >>>     f(self, obj) # Call unbound method with explicit self
> >>>   File "/home/rgunter/.local/lib/python2.7/pickle.py", line 649, in
> >>> save_dict
> >>>     self._batch_setitems(obj.iteritems())
> >>>   File "/home/rgunter/.local/lib/python2.7/pickle.py", line 681, in
> >>> _batch_setitems
> >>>     save(v)
> >>>   File
> >>>
> "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
> >>> line 249, in save
> >>>     return Pickler.save(self, obj)
> >>>   File "/home/rgunter/.local/lib/python2.7/pickle.py", line 331, in
> save
> >>>     self.save_reduce(obj=obj, *rv)
> >>>   File "/home/rgunter/.local/lib/python2.7/pickle.py", line 419, in
> >>> save_reduce
> >>>     save(state)
> >>>   File
> >>>
> "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
> >>> line 249, in save
> >>>     return Pickler.save(self, obj)
> >>>   File "/home/rgunter/.local/lib/python2.7/pickle.py", line 286, in
> save
> >>>     f(self, obj) # Call unbound method with explicit self
> >>>   File "/home/rgunter/.local/lib/python2.7/pickle.py", line 562, in
> >>> save_tuple
> >>>     save(element)
> >>>   File
> >>>
> "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
> >>> line 249, in save
> >>>     return Pickler.save(self, obj)
> >>>   File "/home/rgunter/.local/lib/python2.7/pickle.py", line 286, in
> save
> >>>     f(self, obj) # Call unbound method with explicit self
> >>>   File "/home/rgunter/.local/lib/python2.7/pickle.py", line 486, in
> >>> save_string
> >>>     self.write(BINSTRING + pack("<i", n) + obj)
> >>> struct.error: 'i' format requires -2147483648 <= number <= 2147483647
> >>>
> >>>
> >>> The same error occurs when dumping a random (40k x 40k) array to disk
> >>> (i.e. > 4GB):
> >>>
> >>>   import numpy as np
> >>>   from sklearn.externals import joblib
> >>>
> >>>   w = np.random.random((40000, 40000))
> >>>   joblib.dump(w, "test.pkl", compress = 6)
> >>>
> >>> This machine has 64GB of RAM so saving/loading this shouldn't be a
> >>> problem. In fact, we'd like to go all the way to 15000 samples if
> >>> possible. Unsurprisingly, disabling compression does make the error go
> >>> away but also generates huge files.
> >>>
> >>> Python is version 2.7.5, numpy is 1.8.0 and sklearn is mblondel's
> >>> kernel ridge branch[1] (i.e. joblib is 0.7.1), but the same happens in
> >>> a fresh pull from the mainline (with joblib at 0.8.0a3).
> >>>
> >>> Perhaps the joblib list might be a more appropriate place for this,
> >>> but since the object being pickled is from sklearn I thought it would
> >>> be best to run this through you first. Does anyone have any experience
> >>> with model persistence for such big estimators? How should they be
> >>> stored on disk? It seems this may be an explicit limitation from
> >>> python[2], but since the same compression level for 5000 samples
> >>> already takes ~1.4GB, I'm a bit concerned with how big an uncompressed
> >>> version would grow for the sizes we're interested in.
> >>>
> >>> Thanks!
> >>>
> >>> [1] -- https://github.com/mblondel/scikit-learn/tree/kernel_ridge
> >>> [2] -- http://bugs.python.org/issue8651
> >>>
> >>>
> >>>
> ------------------------------------------------------------------------------
> >>> Subversion Kills Productivity. Get off Subversion & Make the Move to
> >>> Perforce.
> >>> With Perforce, you get hassle-free workflows. Merge that actually
> works.
> >>> Faster operations. Version large binaries.  Built-in WAN optimization
> and
> >>> the
> >>> freedom to use Git, Perforce or both. Make the move to Perforce.
> >>>
> >>>
> http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
> >>> _______________________________________________
> >>> Scikit-learn-general mailing list
> >>> [email protected]
> >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >>
> >>
> >>
> >>
> >>
> ------------------------------------------------------------------------------
> >> Subversion Kills Productivity. Get off Subversion & Make the Move to
> >> Perforce.
> >> With Perforce, you get hassle-free workflows. Merge that actually works.
> >> Faster operations. Version large binaries.  Built-in WAN optimization
> and
> >> the
> >> freedom to use Git, Perforce or both. Make the move to Perforce.
> >>
> >>
> http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
> >> _______________________________________________
> >> Scikit-learn-general mailing list
> >> [email protected]
> >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >>
> >
> >
> >
> ------------------------------------------------------------------------------
> > Subversion Kills Productivity. Get off Subversion & Make the Move to
> > Perforce.
> > With Perforce, you get hassle-free workflows. Merge that actually works.
> > Faster operations. Version large binaries.  Built-in WAN optimization and
> > the
> > freedom to use Git, Perforce or both. Make the move to Perforce.
> >
> http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
> > _______________________________________________
> > Scikit-learn-general mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
>
>
> ------------------------------------------------------------------------------
> Subversion Kills Productivity. Get off Subversion & Make the Move to
> Perforce.
> With Perforce, you get hassle-free workflows. Merge that actually works.
> Faster operations. Version large binaries.  Built-in WAN optimization and
> the
> freedom to use Git, Perforce or both. Make the move to Perforce.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>

------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works. 
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Overflow error dumping large GaussianProcess estimator with joblib.

Reply via email to