Apache mail files? You need an AWS account to pull them.

http://www.lucidimagination.com/search/document/1ab0374bd10d8d89

On Fri, Feb 24, 2012 at 8:56 AM, Frank Scholten (Commented) (JIRA)
<[email protected]> wrote:
>
>    [ 
> https://issues.apache.org/jira/browse/MAHOUT-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215734#comment-13215734
>  ]
>
> Frank Scholten commented on MAHOUT-944:
> ---------------------------------------
>
> Renamed config to LuceneStorageConfig and simplified serialization. Added 
> AbstractLuceneStorageTest with helper methods for indexing documents.
>
> https://github.com/frankscholten/mahout/commit/41ee459d075c3aa2e10eab4eb5580cbf505fcbf6
>
> Does anyone know of a large index I can use for testing? Wikipedia is not 
> that big, the sequential lucene2seq version takes only 3,5 minutes on my 
> machine to convert it into a sequence file.
>
>> LuceneIndexToSequenceFiles (lucene2seq) utility
>> -----------------------------------------------
>>
>>                 Key: MAHOUT-944
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-944
>>             Project: Mahout
>>          Issue Type: New Feature
>>          Components: Integration
>>    Affects Versions: 0.5
>>            Reporter: Frank Scholten
>>            Assignee: Grant Ingersoll
>>            Priority: Minor
>>             Fix For: 0.7
>>
>>         Attachments: MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
>> MAHOUT-944.patch
>>
>>
>> Here is a lucene2seq tool I used in a project. It creates sequence files 
>> based on the stored fields of a lucene index.
>> The output from this tool can be then fed into seq2sparse and from there you 
>> can do text clustering.
>> Comes with Java bean configuration.
>> Let me know what you think. Some CLI code can be added later on. I used this 
>> for a small-scale project +- 100.000 docs. Is a MR version useful or is that 
>> overkill?
>> See https://github.com/frankscholten/mahout/tree/lucene2seq for commits and 
>> review comments from Simon Willnauer (Thanks Simon!)
>> or the attached patch.
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA 
> administrators: 
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>



-- 
Lance Norskog
[email protected]

Reply via email to