Hello! I recently encountered a problem while using the Reuters dataset from python-nltk.
I'm currently running python-nltk 3.2.1-2 on Debian Stretch, and I downloaded the Reuters corpus using nltk.download() In the dataset, the character "<" (u+003C) is incorrectly represented as "<" This inconsistency occurs only for such character, while all the other special character seem to be correctly encoded. I generated a list of files presenting the problem, but I'm not sure where to file the bug report. I have only access to Debian computers, so I haven't tried to replicate the problem on other systems. Since the character is represented in HTML notation, I suspect this problem occurred when the files where generated from the original Reuters dataset, since it is stored in SGML format[1], so the problem might be general for NLTK. [1] http://www.daviddlewis.com/resources/testcollections/reuters21578/readme.txt Thank you for your attention, Matteo Gamba -- XMPP: ar...@hipatia.net ---------------------------------------------------------------------------- Fingerprint: 1CD6 BCD3 582C 9107 3173 AE0C 1457 F9D5 E4DE AEB8 Public key: 0xE4DEAEB8 - http://pgp.mit.edu ---------------------------------------------------------------------------- http://guri.hipatia.net http://www.hipatia.net -- debian-science-maintainers mailing list debian-science-maintainers@lists.alioth.debian.org http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/debian-science-maintainers