Hi Steven, Thanks, very much, for the very detailed reply. It was very useful. This is just a utility script to read some sentiment analysis data to manipulate the positive and negative sentiments of multiple people into a single sentiment per line. So the data I got was from some public domain which I have no control over. What worked was your suggestion to ignore the errors (I made sure that my results are not messed up when I choose to ignore the errors). Thanks, much.
On Mon, Oct 28, 2013 at 7:49 PM, Steven D'Aprano <st...@pearwood.info>wrote: > On Mon, Oct 28, 2013 at 06:13:59PM -0400, SM wrote: > > Hello, > > I have an extremely simple piece of code which reads a .csv file, which > has > > 1000 lines of fixed fields, one line at a time, and tries to print some > > values. > > > > 1 #!/usr/bin/python3 > > 2 # > > 3 import sys, time, re, os > > 4 > > 5 if __name__=="__main__": > > 6 > > 7 ifd = open("infile.csv", 'r') > > By default Python 3 uses UTF-8 when reading files. As the error below > shows, your file actually isn't UTF-8. > > What are you using to generate the CSV file? Consult the documentation > for that program and see what it is using. If it has an option to save > using UTF-8, use that. > > See below for more discussion. > > > > 8 > > 9 linenum = 0 > > 10 for line in ifd: > > 11 line1 = re.split(",", line) > > 12 total = 0 > > 13 if linenum == 0: > > 14 linenum = linenum + 1 > > 15 continue > [snip many more lines of code] > > All of this manual effort is unnecessary, as Python comes standard with > a library to read CSV files. It is much better to use that: > > http://docs.python.org/3/library/csv.html > > > 31 ifd.close > > This line is buggy. To close the file, you need to *call* the close > method by using parentheses, that is, you must write: > > ifd.close() > > > Without the parentheses, you just get a reference to the close methof > but don't do anything with it. > > > > It works fine till it parses the 1st 126 lines in the input file. For > the > > 127th line (irrespective of the contents of the actual line), it prints > the > > following error: > > Traceback (most recent call last): > > File "p1.py", line 10, in <module> > > for line in ifd: > > File "/usr/lib/python3.2/codecs.py", line 300, in decode > > (result, consumed) = self._buffer_decode(data, self.errors, final) > > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position > 1173: > > invalid continuation byte > > $ > > > > I am not able to figure out the cause of this error. Any clues as to why > I > > am seeing this error, are appreciated. > > As mentioned earlier, the error is that the CSV file is not encoded > using UTF-8. Best solution is to go back to the source where the file > comes from and pick the option to always save using UTF-8. > > Second best solution is to identify what codec is actually being used. > If you tell us what program generates the CSV file in the first place, > and the operating system you are using (Windows? Mac? Linux?), we might > be able to identify the codec being used. > > If you can't identify the codec, you can guess. Guessing is bad, for two > reasons: > > - you can waste a lot of time with bad guesses; > > - worse, some bad guesses won't give you an error, but will just > give you bad data. > > Nevertheless, you can try using a different encoding when you open the > file. Try this: > > ifd = open("infile.csv", 'r', encoding='latin-1') > > "Latin 1" is an encoding which should not fail, but it might give back > rubbish data. Such rubbish data is often called "moji-bake": > > en.wikipedia.org/wiki/Mojibake > > Another option is to cover up the errors by passing an error handler: > > ifd = open("infile.csv", 'r', errors='replace') > > which will replace any undecodable bytes in the file with a "missing > character". > > > -- > Steven > _______________________________________________ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > https://mail.python.org/mailman/listinfo/tutor >
_______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor