[ https://issues.apache.org/jira/browse/MAHOUT-250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Olivier Grisel updated MAHOUT-250: ---------------------------------- Summary: Make WikipediaXmlSplitter able to directly use the bzip2 compressed dump as input (was: Make WikipediaXmlSplitter able able to directly use the bzip2 compressed dump as input) > Make WikipediaXmlSplitter able to directly use the bzip2 compressed dump as > input > --------------------------------------------------------------------------------- > > Key: MAHOUT-250 > URL: https://issues.apache.org/jira/browse/MAHOUT-250 > Project: Mahout > Issue Type: Improvement > Components: Classification > Affects Versions: 0.2 > Reporter: Olivier Grisel > Priority: Minor > Fix For: 0.3 > > Attachments: MAHOUT-250-WikipediaXmlSplitter-BZip2.patch > > > Wikipedia.org ships large bzip2 compressed archives hence it would make sense > to be able to load the chunked XML into HDFS directly from the original file > without having to uncompress a 25GB temporary file on the local hard drive. > Reusing the Hadoop BZip2 codecs allows us to avoid having to introduce a new > dependency. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.