On Mon, Oct 21, 2013 at 11:57 AM, Manish Tripathi <tr.man...@gmail.com>wrote:
> It's pipeline data so must have been generated through Siebel and sent as > excel csv. > > I am assuming that you are talking about "Siebel Analytics", some kind of analysis software from Oracle: http://en.wikipedia.org/wiki/Siebel_Systems That would be fine, except that knowing it comes out of Siebel is no guarantee that the output you're consuming is well-formed Excel CSV. For example, I see things like this: http://spendolini.blogspot.com/2006/04/custom-export-to-csv.html where the generated output is "ad-hoc". ----------- Hmmm... but let's assume for the moment that your data is ok. Could the problem be in pandas? Let's follow this line of logic, and see where it takes us. Given the structure of the error you're seeing, I have to assume that pandas is trying to decode the bytes, and runs into an issue, though the exact position where it's running into an error is in question. In fact, looking at: https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L1357 for example, the library appears to be trying to decode line-by-line under certain situations. If it runs into an error, it will report an offset into a particular line. Wow. That can be very bad, if I'm reading that right. It does not give that offset from the perspective of the whole file. But it's worse because it's unsound. The code _should_ be doing the decoding from the perspective of the whole file, not at the level of single lines. It needs to be using codecs.open(), and let codecs.open() handle the details of byte->unicode-string decoding. Otherwise, by that time, it's way too late: we've just taken an interpretation of the bytes that's potentially invalid. Example: if we're working with UTF-16, and we got into this code path, it'd be really bad. It's hard to tell whether or not we're taking that code path. I'm following the definition of read_csv from: https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L409 to: https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L282 to: https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L184 to: https://github.com/pydata/pandas/blob/master/pandas/io/common.py#L100 Ok, at that point, they appear to try to decode the entire file. Somewhat good so far. Though, technically, pandas should be using codecs.open(): http://docs.python.org/2/howto/unicode.html#reading-and-writing-unicode-data and because they aren't, they appears to suck the entire file into memory with StringIO. Yikes. Now the pandas library must make sure _not_ to decode() again, because decoding is not an idempotent operation. As a concrete example: ############################################################## >>> 'foobar'.decode('utf-16') u'\u6f66\u626f\u7261' >>> 'foobar'.decode('utf-16').decode('utf-16') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.7/encodings/utf_16.py", line 16, in decode return codecs.utf_16_decode(input, errors, True) UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128) ############################################################## This is reminiscent of the kind of error you're encountering, though I'm not sure if this is the same situation. Unfortunately, I'm running out of time to analyze this further. If you could upload your data file somewhere, someone else here may have time to investigate the error you're seeing in more detail. From reading the Pandas code, I'm discouraged by the code quality: I do think that there's a potential of a bug in the library. The code is a heck of a lot more complicated than I think it needs to be.
_______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor