Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Converting Content 
(https://cwiki.apache.org/confluence/display/MAHOUT/Converting+Content)

Added by Grant Ingersoll:
---------------------------------------------------------------------
{toc}

h1. Intro

Mahout has some tools for converting content into formats more consumable for 
Mahout.  While they shouldn't be confused as a full ETL layer, they can be 
useful for things like converting text files and log files.  All of these can 
be accessed via the $MAHOUT_HOME/bin/mahout command line driver.

h1. SequenceFilesFrom*

* SequenceFilesFromDirectory -- Converts
* SequenceFilesFromMailArchives -- works


h1. RegexConverterDriver

Useful for converting things like log files from one format to another.  For 
instance, you could convert Solr log files containing query requests to a 
format consumable by [FrequentItemsetMining]

For example, the following will extract queries from HTTP request logs to 
[Solr|http://lucene.apache.org] and prepare them for use by Frequent Itemset 
Mining.
{noformat}
bin/mahout regexconverter --input 
/Users/grantingersoll/projects/content/lucid/lucidfind/logs --output 
/tmp/solr/output --regex "(?<=(\?|&)q=).*?(?=&|$)" --overwrite 
--transformerClass url --formatterClass fpg
{noformat}



Change your notification preferences: 
https://cwiki.apache.org/confluence/users/viewnotifications.action

Reply via email to