Hi-

Would it be possible then (in relatively short order) to create a py2 -> py3 
numpy pickle converter? This would run in py2, np.load or unpickle a pickle in 
the usual way and then repickle and/or save using a pickler that uses an 
explicit pickle type for encoding the bytes associated with numpy dtypes. The 
numpy unpickler in py3 would then know what to do. IE. is there a way to make 
the numpy py2 pickler be explicit about byte strings? Presumably this would 
cover most use-cases even for complicated pickled objects and could be used 
transparently within py2 or py3.

Best, C

> On Aug 24, 2015, at 2:30 PM, Nathaniel Smith <n...@pobox.com> wrote:
> 
> On Aug 24, 2015 9:29 AM, "Pauli Virtanen" <p...@iki.fi <mailto:p...@iki.fi>> 
> wrote:
> >
> > 24.08.2015, 01:02, Chris Laumann kirjoitti:
> > [clip]
> > > Is there documentation about the limits and workarounds for py2/py3
> > > pickle/np.save/load compatibility? I haven't found anything except
> > > developer bug tracking discussions (eg. #4879 in github numpy).
> >
> > Not sure if it's written down somewhere but:
> >
> > - You should consider pickles not portable between Py2/3.
> >
> > - Setting encoding='bytes' or encoding='latin1' should produce correct
> > results for numerical data. However, neither is "safe" because the
> > option also affects other data than numpy arrays that you may have
> > possibly saved.
> 
> For those wondering what's going on here: if you pickled a str in python 2, 
> then python 3 wants to unpickle it as a str. But in python 2 str was a vector 
> of arbitrary bytes in some assumed encoding, and in python 3 str is a vector 
> of Unicode characters. So it needs to know what encoding to use, which is 
> fine and what you'd expect for the py2->py3 transition.
> 
> But: when pickling arrays, numpy on py2 used a str to store the raw memory of 
> your array. Trying to run this data through a character decoder then 
> obviously makes a mess of everything. So the fundamental problem is that on 
> py2, there's no way to distinguish between a string of text and a string of 
> bytes -- they're encoded in exactly the same way in the pickle file -- and 
> the python 3 unpickler just has to guess. You can tell it to guess in a way 
> that works for raw bytes -- that's what the encoding= options Pauli mentions 
> above do -- but obviously this will then be incorrect if you have any actual 
> non-latin1 textual strings in your pickle, and you can't get it to handle 
> both correctly at the same time.
> 
> If you're desperate, it should be possible to get your data out of py2 
> pickles by loading then with one of the encoding options above, and then 
> going through the resulting object and converting all the actual textual 
> strings back to the correct encoding by hand. No data is actually lost. And 
> of course even this is unnecessary if your file contains only ASCII/latin1.
> 
> -n
> 
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to