On Wed, Jan 15, 2014 at 10:27 AM, Chris Barker <chris.bar...@noaa.gov>wrote:
> On Wed, Jan 15, 2014 at 4:38 AM, Julian Taylor < > jtaylor.deb...@googlemail.com> wrote: > >> > I try to print my fileContent array after I read it and it looks >> > like this : >> > >> > ["b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile1.txt'" >> > "b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile2.txt'" >> > "b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile3.txt'"] >> > > >> you have the bytes representation and a duplicate slash in it. >> > > the duplicate slash confuses me, but I'm not running py3 to test, so... > > >> np.loadtxt(file, dtype=bytes).astype(str) >> >> for non ascii I guess you should use python directly as numpy would also >> require a python loop with explicit decoding. >> >> Currently handling strings in python3 with numpy is even worse than >> before, you always have to go over bytes and do explicit decodes to get >> python strings out of ascii data. >> > > There is a MASSIVE set of threads on Python-dev about better support for > ASCII and ASCII+binary data in py3 -- but in the meantime, I think we have > two issue shere that could be adressed: > > 1) loadtext behavior -- it's a really, really common case for data files > suitable for loadtxt to be ascii, but they also could be another encoding > -- so loadtext should have the option to specify the encoding (default to > ascii? or ascii-compatible?) > > The trick here is handling both these cases correctly -- clearly loadtxt > is broken on py3 now. This example works fine under py2. > > It seems to be reading the file as bytes, then passing those bytes off to > a unicode string (str in py3), without specifying an encoding (which I > think is how that b' ...' > junk gets in there. > > note that: np.loadtxt('pathlist.txt', dtype=unicode) works fine on py2 as > well: > > In [7]: np.loadtxt('pathlist.txt', dtype=unicode) > Out[7]: > array([u'C:\\Users\\Documents\\Project\\mytextfile1.txt', > u'C:\\Users\\Documents\\Project\\mytextfile2.txt', > u'C:\\Users\\Documents\\Project\\mytextfile3.txt'], > dtype='<U42') > > which is what should happen in py3. So the internal loadtxt code must be > confusing bytes and unicode objects... > > Anyway, this should work, and there should be an obvious way to spell it. > > 2) numpy string types -- it seems numpy already has a both a string type > and unicode type -- perhaps some re-naming or better documentation is in > order: > the string type 'S10', for example, should be clearly defined as 1-byte > per character ascii-compatible. > > I'm not sure how many bytes the unicode type has, but it may make sense to > be abel to choose UCS-2 or UCS-4 -- though memory is cheep, I'd probably go > with UCS-4 and be done with it. > There was a discussion of this long ago and UCS-4 was chosen as the numpy standard. There are just too many complications that arise in supporting both. Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion