Inconsistency in the Reuters dataset from python-nltk

Matteo Gamba Fri, 20 Jan 2017 11:53:57 -0800

Hello!

I recently encountered a problem while using the Reuters dataset from
python-nltk.


I'm currently running python-nltk 3.2.1-2 on Debian Stretch, and I
downloaded the Reuters corpus using nltk.download()

In the dataset, the character "<" (u+003C) is incorrectly represented as
"&lt;"

This inconsistency occurs only for such character, while all the other
special character seem to be correctly encoded.

I generated a list of files presenting the problem, but I'm not sure
where to file the bug report.

I have only access to Debian computers, so I haven't tried to replicate
the problem on other systems. Since the character is represented in HTML
notation, I suspect this problem occurred when the files where generated
from the original Reuters dataset, since it is stored in SGML format[1],
so the problem might be general for NLTK.

[1]
http://www.daviddlewis.com/resources/testcollections/reuters21578/readme.txt

Thank you for your attention,

Matteo Gamba

-- 

XMPP: ar...@hipatia.net

----------------------------------------------------------------------------
Fingerprint: 1CD6 BCD3 582C 9107 3173 AE0C 1457 F9D5 E4DE AEB8
Public key: 0xE4DEAEB8 - http://pgp.mit.edu
----------------------------------------------------------------------------

http://guri.hipatia.net
http://www.hipatia.net

-- 
debian-science-maintainers mailing list
debian-science-maintainers@lists.alioth.debian.org
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/debian-science-maintainers

Inconsistency in the Reuters dataset from python-nltk

Reply via email to