Make WikipediaXmlSplitter able able to directly use the bzip2 compressed dump as input --------------------------------------------------------------------------------------
Key: MAHOUT-250 URL: https://issues.apache.org/jira/browse/MAHOUT-250 Project: Mahout Issue Type: Improvement Components: Classification Affects Versions: 0.2 Reporter: Olivier Grisel Priority: Minor Fix For: 0.3 Wikipedia.org ships large bzip2 compressed archives hence it would make sense to be able to load the chunked XML into HDFS directly from the original file without having to uncompress a 25GB temporary file on the local hard drive. Reusing the Hadoop BZip2 codecs allows us to avoid having to introduce a new dependency. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.