Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Converting Content
(https://cwiki.apache.org/confluence/display/MAHOUT/Converting+Content)
Edited by Grant Ingersoll:
---------------------------------------------------------------------
{toc}
h1. Intro
Mahout has some tools for converting content into formats more consumable for
Mahout. While they shouldn't be confused as a full ETL layer, they can be
useful for things like converting text files and log files. All of these can
be accessed via the $MAHOUT_HOME/bin/mahout command line driver.
h1. SequenceFilesFrom*
* SequenceFilesFromDirectory -- Converts a directory of text files to a
SequenceFile where the key is the name of the file and the value is all of the
text
* SequenceFilesFromMailArchives -- Similar to Directory but converts mbox files.
h1. RegexConverterDriver
Useful for converting things like log files from one format to another. For
instance, you could convert Solr log files containing query requests to a
format consumable by [FrequentItemsetMining]
For example, the following will extract queries from HTTP request logs to
[Solr|http://lucene.apache.org] and prepare them for use by Frequent Itemset
Mining.
{noformat}
bin/mahout regexconverter --input
/Users/grantingersoll/projects/content/lucid/lucidfind/logs --output
/tmp/solr/output --regex "(?<=(\?|&)q=).*?(?=&|$)" --overwrite
--transformerClass url --formatterClass fpg
{noformat}
Change your notification preferences:
https://cwiki.apache.org/confluence/users/viewnotifications.action