[
https://issues.apache.org/jira/browse/MAHOUT-588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Timothy Potter updated MAHOUT-588:
----------------------------------
Attachment: SequenceFilesFromMailArchives.java
The SequenceFilesFromMailArchives is based on SequenceFilesFromDirectory, but
adds block compression to the SequenceFile.Writer and parses individual mail
messages using simple regex patterns. I used ^From \S+@\S.*\d{4}$ for my
message boundary pattern.
Running on the asf-mail-archives, I get 6,107,076 messages (which is slightly
more than Szymon's?).
To run this, you would need to save to
utils/src/main/java/org/apache/mahout/text and then do something like:
$MAHOUT_HOME/bin/mahout org.apache.mahout.text.SequenceFilesFromMailArchives \
--input /mnt/asf-mail-archives/extracted \
--output /mnt/asf-mail-archives/sequence-files \
-c UTF-8 -chunk 1024 -prefix TamingText
The chunk size is rather large because it is the raw size before compression.
> Benchmark Mahout's clustering performance on EC2 and publish the results
> ------------------------------------------------------------------------
>
> Key: MAHOUT-588
> URL: https://issues.apache.org/jira/browse/MAHOUT-588
> Project: Mahout
> Issue Type: Task
> Reporter: Grant Ingersoll
> Attachments: SequenceFilesFromMailArchives.java
>
>
> For Taming Text, I've commissioned some benchmarking work on Mahout's
> clustering algorithms. I've asked the two doing the project to do all the
> work in the open here. The goal is to use a publicly reusable dataset (for
> now, the ASF mail archives, assuming it is big enough) and run on EC2 and
> make all resources available so others can reproduce/improve.
> I'd like to add the setup code to utils (although it could possibly be done
> as a Vectorizer) and the publication of the results will be put up on the
> Wiki as well as in the book. This issue is to track the patches, etc.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.