Hi- Would it be possible then (in relatively short order) to create a py2 -> py3 numpy pickle converter? This would run in py2, np.load or unpickle a pickle in the usual way and then repickle and/or save using a pickler that uses an explicit pickle type for encoding the bytes associated with numpy dtypes. The numpy unpickler in py3 would then know what to do. IE. is there a way to make the numpy py2 pickler be explicit about byte strings? Presumably this would cover most use-cases even for complicated pickled objects and could be used transparently within py2 or py3.
Best, C > On Aug 24, 2015, at 2:30 PM, Nathaniel Smith <n...@pobox.com> wrote: > > On Aug 24, 2015 9:29 AM, "Pauli Virtanen" <p...@iki.fi <mailto:p...@iki.fi>> > wrote: > > > > 24.08.2015, 01:02, Chris Laumann kirjoitti: > > [clip] > > > Is there documentation about the limits and workarounds for py2/py3 > > > pickle/np.save/load compatibility? I haven't found anything except > > > developer bug tracking discussions (eg. #4879 in github numpy). > > > > Not sure if it's written down somewhere but: > > > > - You should consider pickles not portable between Py2/3. > > > > - Setting encoding='bytes' or encoding='latin1' should produce correct > > results for numerical data. However, neither is "safe" because the > > option also affects other data than numpy arrays that you may have > > possibly saved. > > For those wondering what's going on here: if you pickled a str in python 2, > then python 3 wants to unpickle it as a str. But in python 2 str was a vector > of arbitrary bytes in some assumed encoding, and in python 3 str is a vector > of Unicode characters. So it needs to know what encoding to use, which is > fine and what you'd expect for the py2->py3 transition. > > But: when pickling arrays, numpy on py2 used a str to store the raw memory of > your array. Trying to run this data through a character decoder then > obviously makes a mess of everything. So the fundamental problem is that on > py2, there's no way to distinguish between a string of text and a string of > bytes -- they're encoded in exactly the same way in the pickle file -- and > the python 3 unpickler just has to guess. You can tell it to guess in a way > that works for raw bytes -- that's what the encoding= options Pauli mentions > above do -- but obviously this will then be incorrect if you have any actual > non-latin1 textual strings in your pickle, and you can't get it to handle > both correctly at the same time. > > If you're desperate, it should be possible to get your data out of py2 > pickles by loading then with one of the encoding options above, and then > going through the resulting object and converting all the actual textual > strings back to the correct encoding by hand. No data is actually lost. And > of course even this is unnecessary if your file contains only ASCII/latin1. > > -n > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion