[ https://issues.apache.org/jira/browse/MAHOUT-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Palumbo reassigned MAHOUT-1608: -------------------------------------- Assignee: Andrew Palumbo > Add Option WikipediaToSequenceFile to remove Category Labels from Documents > --------------------------------------------------------------------------- > > Key: MAHOUT-1608 > URL: https://issues.apache.org/jira/browse/MAHOUT-1608 > Project: Mahout > Issue Type: Improvement > Affects Versions: 0.9 > Reporter: Andrew Palumbo > Assignee: Andrew Palumbo > Priority: Minor > Fix For: 1.0 > > > Currently WikipediaMapper job extracts Category labels from the text of the > Wikipedia documents and leaves the label as [[Category:label]] in the > document. Add in an option to WikipediaToSequenceFile.java to remove > [[Category:label]] from the text after extracting the label. -- This message was sent by Atlassian JIRA (v6.2#6252)