Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: File Format Integrations 
(https://cwiki.apache.org/confluence/display/MAHOUT/File+Format+Integrations)


Edited by Lance Norskog:
---------------------------------------------------------------------
There are several importers and exporters for common file formats.
h2. General-purpose convertors
h3. Importer 'bin/mahout' jobs
Run these with --help to see options
* bin/mahout arff.vector
* bin/mahout lucene.vector
* 'mahout regexconverter' reads text lines and emits the "captured" regex 
output into LongWritable/Text SequenceFiles. See a 
[tutorial|http://download.oracle.com/javase/tutorial/essential/regex/] and 
[cheat 
sheet|http://www.omicentral.com/cheatsheets/JavaRegularExpressionsCheatSheet.pdf]
 for this marvelously opaque toolkit.

h3. Exporter 'bin/mahout' jobs
Some programs exist to dump text versions of SequenceFiles for eyeballing. Run 
these with --help to see options.
* bin/mahout clusterdump
* bin/mahout cmdump
* bin/mahout matrixdump
* bin/mahout seqdumper
* bin/mahout vectordump

*Note:* all classes with a 'main' method can be used as a bin/mahout job name.

h3. Importer classes
These are not main() classes and must be coded against.
* CSVVectorIterator imports CSV files into vectors. 
* MailProcessor parses text-only mailboxes into a SequenceFile with a numbered 
key and the text body in the value.
h3. Exporter classes
* *GraphMLClusterWriter* saves cluster data in the 
[GraphML|http://graphml.graphdrawing.org/]
* *CSVClusterWriter* saves clusters in a csv-based format.

Both of these formats are read by the [Gephi|http://gephi.org/] program, an 
interactive graph explorer. 

There are many file importers which are custom-made for particular algorithms:
* The various text -> Lucene index converters




Change your notification preferences: 
https://cwiki.apache.org/confluence/users/viewnotifications.action    

Reply via email to