Ah of course! Good one. Do you know if there is an existing tool to index those emails?
On Sat, Feb 25, 2012 at 4:10 AM, Lance Norskog <[email protected]> wrote: > Apache mail files? You need an AWS account to pull them. > > http://www.lucidimagination.com/search/document/1ab0374bd10d8d89 > > On Fri, Feb 24, 2012 at 8:56 AM, Frank Scholten (Commented) (JIRA) > <[email protected]> wrote: >> >> [ >> https://issues.apache.org/jira/browse/MAHOUT-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215734#comment-13215734 >> ] >> >> Frank Scholten commented on MAHOUT-944: >> --------------------------------------- >> >> Renamed config to LuceneStorageConfig and simplified serialization. Added >> AbstractLuceneStorageTest with helper methods for indexing documents. >> >> https://github.com/frankscholten/mahout/commit/41ee459d075c3aa2e10eab4eb5580cbf505fcbf6 >> >> Does anyone know of a large index I can use for testing? Wikipedia is not >> that big, the sequential lucene2seq version takes only 3,5 minutes on my >> machine to convert it into a sequence file. >> >>> LuceneIndexToSequenceFiles (lucene2seq) utility >>> ----------------------------------------------- >>> >>> Key: MAHOUT-944 >>> URL: https://issues.apache.org/jira/browse/MAHOUT-944 >>> Project: Mahout >>> Issue Type: New Feature >>> Components: Integration >>> Affects Versions: 0.5 >>> Reporter: Frank Scholten >>> Assignee: Grant Ingersoll >>> Priority: Minor >>> Fix For: 0.7 >>> >>> Attachments: MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, >>> MAHOUT-944.patch >>> >>> >>> Here is a lucene2seq tool I used in a project. It creates sequence files >>> based on the stored fields of a lucene index. >>> The output from this tool can be then fed into seq2sparse and from there >>> you can do text clustering. >>> Comes with Java bean configuration. >>> Let me know what you think. Some CLI code can be added later on. I used >>> this for a small-scale project +- 100.000 docs. Is a MR version useful or >>> is that overkill? >>> See https://github.com/frankscholten/mahout/tree/lucene2seq for commits and >>> review comments from Simon Willnauer (Thanks Simon!) >>> or the attached patch. >> >> -- >> This message is automatically generated by JIRA. >> If you think it was sent incorrectly, please contact your JIRA >> administrators: >> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa >> For more information on JIRA, see: http://www.atlassian.com/software/jira >> >> > > > > -- > Lance Norskog > [email protected] >
