Andrew Palumbo created MAHOUT-1608:
--------------------------------------
Summary: Add Option WikipediaToSequenceFile to remove Category
Labels from Documents
Key: MAHOUT-1608
URL: https://issues.apache.org/jira/browse/MAHOUT-1608
Project: Mahout
Issue Type: Improvement
Affects Versions: 0.9
Reporter: Andrew Palumbo
Priority: Minor
Fix For: 1.0
Currently WikipediaMapper job extracts Category labels from the text of the
Wikipedia documents and leaves the label as [[Category:label]] in the document.
Add in an option to WikipediaToSequenceFile.java to remove [[Category:label]]
from the text after extracting the label.
--
This message was sent by Atlassian JIRA
(v6.2#6252)