Apache mail files? You need an AWS account to pull them. http://www.lucidimagination.com/search/document/1ab0374bd10d8d89
On Fri, Feb 24, 2012 at 8:56 AM, Frank Scholten (Commented) (JIRA) <[email protected]> wrote: > > [ > https://issues.apache.org/jira/browse/MAHOUT-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215734#comment-13215734 > ] > > Frank Scholten commented on MAHOUT-944: > --------------------------------------- > > Renamed config to LuceneStorageConfig and simplified serialization. Added > AbstractLuceneStorageTest with helper methods for indexing documents. > > https://github.com/frankscholten/mahout/commit/41ee459d075c3aa2e10eab4eb5580cbf505fcbf6 > > Does anyone know of a large index I can use for testing? Wikipedia is not > that big, the sequential lucene2seq version takes only 3,5 minutes on my > machine to convert it into a sequence file. > >> LuceneIndexToSequenceFiles (lucene2seq) utility >> ----------------------------------------------- >> >> Key: MAHOUT-944 >> URL: https://issues.apache.org/jira/browse/MAHOUT-944 >> Project: Mahout >> Issue Type: New Feature >> Components: Integration >> Affects Versions: 0.5 >> Reporter: Frank Scholten >> Assignee: Grant Ingersoll >> Priority: Minor >> Fix For: 0.7 >> >> Attachments: MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, >> MAHOUT-944.patch >> >> >> Here is a lucene2seq tool I used in a project. It creates sequence files >> based on the stored fields of a lucene index. >> The output from this tool can be then fed into seq2sparse and from there you >> can do text clustering. >> Comes with Java bean configuration. >> Let me know what you think. Some CLI code can be added later on. I used this >> for a small-scale project +- 100.000 docs. Is a MR version useful or is that >> overkill? >> See https://github.com/frankscholten/mahout/tree/lucene2seq for commits and >> review comments from Simon Willnauer (Thanks Simon!) >> or the attached patch. > > -- > This message is automatically generated by JIRA. > If you think it was sent incorrectly, please contact your JIRA > administrators: > https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa > For more information on JIRA, see: http://www.atlassian.com/software/jira > > -- Lance Norskog [email protected]
