I believe I experienced a similar issue, reported on Github
#122<https://github.com/joblib/joblib/issues/122>.
It seems to be a zlib issue in Python, and not something which can or will
be fixed in joblib.
Would love to know if a work-around was found, it is almost ironic that
compression fails for large files :-)
Cory
On Wed, Mar 5, 2014 at 5:05 PM, Ralf Gunter <[email protected]> wrote:
> Hi folks,
>
> Attempting to dump a trained GP estimator (7000 samples, 16 features)
> using joblib (compress = 6) is causing the following error:
>
>
> Traceback (most recent call last):
> File
> "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
> line 241, in save
> obj, filename = self._write_array(obj, filename)
> File
> "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
> line 214, in _write_array
> compress=self.compress)
> File
> "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
> line 89, in write_zfile
> file_handle.write(zlib.compress(asbytes(data), compress))
> OverflowError: size does not fit in an int
>
> Traceback (most recent call last):
> File "learn.py", line 28, in <module>
> joblib.dump(reg, opt.save_model, compress = 6)
> File
> "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
> line 367, in dump
> pickler.dump(value)
> File "/home/rgunter/.local/lib/python2.7/pickle.py", line 224, in dump
> self.save(obj)
> File
> "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
> line 249, in save
> return Pickler.save(self, obj)
> File "/home/rgunter/.local/lib/python2.7/pickle.py", line 331, in save
> self.save_reduce(obj=obj, *rv)
> File "/home/rgunter/.local/lib/python2.7/pickle.py", line 419, in
> save_reduce
> save(state)
> File
> "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
> line 249, in save
> return Pickler.save(self, obj)
> File "/home/rgunter/.local/lib/python2.7/pickle.py", line 286, in save
> f(self, obj) # Call unbound method with explicit self
> File "/home/rgunter/.local/lib/python2.7/pickle.py", line 649, in
> save_dict
> self._batch_setitems(obj.iteritems())
> File "/home/rgunter/.local/lib/python2.7/pickle.py", line 681, in
> _batch_setitems
> save(v)
> File
> "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
> line 249, in save
> return Pickler.save(self, obj)
> File "/home/rgunter/.local/lib/python2.7/pickle.py", line 331, in save
> self.save_reduce(obj=obj, *rv)
> File "/home/rgunter/.local/lib/python2.7/pickle.py", line 419, in
> save_reduce
> save(state)
> File
> "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
> line 249, in save
> return Pickler.save(self, obj)
> File "/home/rgunter/.local/lib/python2.7/pickle.py", line 286, in save
> f(self, obj) # Call unbound method with explicit self
> File "/home/rgunter/.local/lib/python2.7/pickle.py", line 562, in
> save_tuple
> save(element)
> File
> "/home/rgunter/.local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py",
> line 249, in save
> return Pickler.save(self, obj)
> File "/home/rgunter/.local/lib/python2.7/pickle.py", line 286, in save
> f(self, obj) # Call unbound method with explicit self
> File "/home/rgunter/.local/lib/python2.7/pickle.py", line 486, in
> save_string
> self.write(BINSTRING + pack("<i", n) + obj)
> struct.error: 'i' format requires -2147483648 <= number <= 2147483647
>
>
> The same error occurs when dumping a random (40k x 40k) array to disk
> (i.e. > 4GB):
>
> import numpy as np
> from sklearn.externals import joblib
>
> w = np.random.random((40000, 40000))
> joblib.dump(w, "test.pkl", compress = 6)
>
> This machine has 64GB of RAM so saving/loading this shouldn't be a
> problem. In fact, we'd like to go all the way to 15000 samples if
> possible. Unsurprisingly, disabling compression does make the error go
> away but also generates huge files.
>
> Python is version 2.7.5, numpy is 1.8.0 and sklearn is mblondel's
> kernel ridge branch[1] (i.e. joblib is 0.7.1), but the same happens in
> a fresh pull from the mainline (with joblib at 0.8.0a3).
>
> Perhaps the joblib list might be a more appropriate place for this,
> but since the object being pickled is from sklearn I thought it would
> be best to run this through you first. Does anyone have any experience
> with model persistence for such big estimators? How should they be
> stored on disk? It seems this may be an explicit limitation from
> python[2], but since the same compression level for 5000 samples
> already takes ~1.4GB, I'm a bit concerned with how big an uncompressed
> version would grow for the sizes we're interested in.
>
> Thanks!
>
> [1] -- https://github.com/mblondel/scikit-learn/tree/kernel_ridge
> [2] -- http://bugs.python.org/issue8651
>
>
> ------------------------------------------------------------------------------
> Subversion Kills Productivity. Get off Subversion & Make the Move to
> Perforce.
> With Perforce, you get hassle-free workflows. Merge that actually works.
> Faster operations. Version large binaries. Built-in WAN optimization and
> the
> freedom to use Git, Perforce or both. Make the move to Perforce.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries. Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general