Andrew McNamara wrote:
Marc-Andre Lemburg mentioned that he has encountered UTF-16 encoded csv
files, so a reasonable starting point would be the ability to read and
parse, as well as the ability to generate, one of these.

I see. That would be reasonable, indeed. Notice that this is not so much a "Unicode issue", but more an "encoding" issue. If you solve the "arbitrary encodings" problem, you solve UTF-16 as a side effect.

The reader interface currently returns a row at a time, consuming as many
lines from the supplied iterable (with the most common iterable being
a file). This suggests to me that we will need an optional "encoding"
argument to the reader constructor, and that the reader will need to
decode the source lines.

Ok. In this context, I see two possible implementation strategies: 1. Implement the csv module two times: once for bytes, and once for Unicode characters. It is likely that the source code would be the same for each case; you just need to make sure the "Dialect and Formatting Parameters" change their width accordingly. If you use the SRE approach, you would do

   #define CSV_ITEM_T char
   #define CSV_NAME_PREFIX byte_
   #include "csvimpl.c"
   #define CSV_ITEM_T Py_Unicode
   #define CSV_NAME_PREFIX unicode_
   #include "csvimpl.c"

2. Use just the existing _csv module, and represent non-byte encodings
   as UTF-8. This will work as long as the delimiters and other markup
   characters have always a single byte in UTF-8, which is the case
   for "':\, as well as for \r and \n. Then, wenn processing using
   an explicit encoding, first convert the input into Unicode objects.
   Then encode the Unicode objects into UTF-8, and pass it to _csv.
   For the results you get back, convert each element back from UTF-8
   to a Unicode object.

This could be implemented as

def reader(f, encoding=None):
    if encoding is None: return _csv.reader(f)
    enc, dec, reader, writer = codecs.lookup(encoding)
    utf8_enc, utf8_dec, utf8_r, utf8_w = codecs.lookup("UTF-8")
    # Make a recoder which can only read
    utf8_stream = codecs.StreamRecoder(f, utf8_enc, None, Reader, None)
    csv_reader = _csv.reader(utf8_stream)
    # For performance reasons, map_result could be implemented in C
    def map_result(t):
        result = [None]*len(t)
        for i, val in enumerate(t):
            result[i] = utf8_dec(val)
        return tuple(result)
    return itertools.imap(map_result, csv_reader)
# This code is untested

This approach has the disadvantage of performing three recodings:
from input charset to Unicode, from Unicode to UTF-8, from UTF-8
to Unicode. One could:
- skip the initial recoding if the encoding is already known
  to be _csv-safe (i.e. if it is a pure ASCII superset).
  This would be valid for ASCII, iso-8859-n, UTF-8, ...
- offer the user to keep the results in the input encoding,
  instead of always returning Unicode objects.

Apart from this disadvantage, I think this gives people what they want:
they can specify the encoding of the input, and they get the results not
only csv-separated, but also unicode-decode. This approach is the same
that is used for Python source code encodings: the source is first
recoded into UTF-8, then parsed, then recoded back.

That said, I'm hardly a unicode expert, so I
may be overlooking something (could a utf-16 encoded character span a
line break, for example).

This cannot happen: \r, in UTF-16, is also 2 bytes (0D 00, if UTF-16LE). There are issues that Unicode has additional line break characters, which is probably irrelevant.

Regards,
Martin
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to