integrations/org.apache.mahout.text.SequenceFilesFromMailArchives Used in examples/bin/asf-email-examples.sh
On Wed, Feb 29, 2012 at 6:27 AM, Frank Scholten <[email protected]> wrote: > Ah of course! Good one. > > Do you know if there is an existing tool to index those emails? > > On Sat, Feb 25, 2012 at 4:10 AM, Lance Norskog <[email protected]> wrote: >> Apache mail files? You need an AWS account to pull them. >> >> http://www.lucidimagination.com/search/document/1ab0374bd10d8d89 >> >> On Fri, Feb 24, 2012 at 8:56 AM, Frank Scholten (Commented) (JIRA) >> <[email protected]> wrote: >>> >>> [ >>> https://issues.apache.org/jira/browse/MAHOUT-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215734#comment-13215734 >>> ] >>> >>> Frank Scholten commented on MAHOUT-944: >>> --------------------------------------- >>> >>> Renamed config to LuceneStorageConfig and simplified serialization. Added >>> AbstractLuceneStorageTest with helper methods for indexing documents. >>> >>> https://github.com/frankscholten/mahout/commit/41ee459d075c3aa2e10eab4eb5580cbf505fcbf6 >>> >>> Does anyone know of a large index I can use for testing? Wikipedia is not >>> that big, the sequential lucene2seq version takes only 3,5 minutes on my >>> machine to convert it into a sequence file. >>> >>>> LuceneIndexToSequenceFiles (lucene2seq) utility >>>> ----------------------------------------------- >>>> >>>> Key: MAHOUT-944 >>>> URL: https://issues.apache.org/jira/browse/MAHOUT-944 >>>> Project: Mahout >>>> Issue Type: New Feature >>>> Components: Integration >>>> Affects Versions: 0.5 >>>> Reporter: Frank Scholten >>>> Assignee: Grant Ingersoll >>>> Priority: Minor >>>> Fix For: 0.7 >>>> >>>> Attachments: MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, >>>> MAHOUT-944.patch >>>> >>>> >>>> Here is a lucene2seq tool I used in a project. It creates sequence files >>>> based on the stored fields of a lucene index. >>>> The output from this tool can be then fed into seq2sparse and from there >>>> you can do text clustering. >>>> Comes with Java bean configuration. >>>> Let me know what you think. Some CLI code can be added later on. I used >>>> this for a small-scale project +- 100.000 docs. Is a MR version useful or >>>> is that overkill? >>>> See https://github.com/frankscholten/mahout/tree/lucene2seq for commits >>>> and review comments from Simon Willnauer (Thanks Simon!) >>>> or the attached patch. >>> >>> -- >>> This message is automatically generated by JIRA. >>> If you think it was sent incorrectly, please contact your JIRA >>> administrators: >>> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa >>> For more information on JIRA, see: http://www.atlassian.com/software/jira >>> >>> >> >> >> >> -- >> Lance Norskog >> [email protected] >> -- Lance Norskog [email protected]
