[ https://issues.apache.org/jira/browse/MAHOUT-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14109483#comment-14109483 ]
ASF GitHub Bot commented on MAHOUT-1608: ---------------------------------------- GitHub user andrewpalumbo reopened a pull request: https://github.com/apache/mahout/pull/45 MAHOUT-1608: Add Option WikipediaToSequenceFile to remove Category Labels from Documents Added a CLI option --removeLabels to allow for more robust models. When set will remove [[Category:label]] from the text of a wikipedia document after labeling the document. You can merge this pull request into a Git repository by running: $ git pull https://github.com/andrewpalumbo/mahout MAHOUT-1608 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/mahout/pull/45.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #45 ---- commit d72c535f00b9adb77b586c71308ec23ddade1b33 Author: Andrew Palumbo <ap....@outlook.com> Date: 2014-08-23T04:15:56Z Added option to remove Labels From Text commit cf3a095ceba048dbf38c19b66750049dd8cf7505 Author: Andrew Palumbo <ap....@outlook.com> Date: 2014-08-25T18:25:35Z Check for malformed Category tags ---- > Add Option WikipediaToSequenceFile to remove Category Labels from Documents > --------------------------------------------------------------------------- > > Key: MAHOUT-1608 > URL: https://issues.apache.org/jira/browse/MAHOUT-1608 > Project: Mahout > Issue Type: Improvement > Affects Versions: 0.9 > Reporter: Andrew Palumbo > Priority: Minor > Fix For: 1.0 > > > Currently WikipediaMapper job extracts Category labels from the text of the > Wikipedia documents and leaves the label as [[Category:label]] in the > document. Add in an option to WikipediaToSequenceFile.java to remove > [[Category:label]] from the text after extracting the label. -- This message was sent by Atlassian JIRA (v6.2#6252)