Re: [jira] [Commented] (MAHOUT-944) LuceneIndexToSequenceFiles (lucene2seq) utility

Frank Scholten Wed, 29 Feb 2012 06:27:54 -0800

Ah of course! Good one.

Do you know if there is an existing tool to index those emails?


On Sat, Feb 25, 2012 at 4:10 AM, Lance Norskog <[email protected]> wrote:
> Apache mail files? You need an AWS account to pull them.
>
> http://www.lucidimagination.com/search/document/1ab0374bd10d8d89
>
> On Fri, Feb 24, 2012 at 8:56 AM, Frank Scholten (Commented) (JIRA)
> <[email protected]> wrote:
>>
>>    [ 
>> https://issues.apache.org/jira/browse/MAHOUT-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215734#comment-13215734
>>  ]
>>
>> Frank Scholten commented on MAHOUT-944:
>> ---------------------------------------
>>
>> Renamed config to LuceneStorageConfig and simplified serialization. Added 
>> AbstractLuceneStorageTest with helper methods for indexing documents.
>>
>> https://github.com/frankscholten/mahout/commit/41ee459d075c3aa2e10eab4eb5580cbf505fcbf6
>>
>> Does anyone know of a large index I can use for testing? Wikipedia is not 
>> that big, the sequential lucene2seq version takes only 3,5 minutes on my 
>> machine to convert it into a sequence file.
>>
>>> LuceneIndexToSequenceFiles (lucene2seq) utility
>>> -----------------------------------------------
>>>
>>>                 Key: MAHOUT-944
>>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-944
>>>             Project: Mahout
>>>          Issue Type: New Feature
>>>          Components: Integration
>>>    Affects Versions: 0.5
>>>            Reporter: Frank Scholten
>>>            Assignee: Grant Ingersoll
>>>            Priority: Minor
>>>             Fix For: 0.7
>>>
>>>         Attachments: MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
>>> MAHOUT-944.patch
>>>
>>>
>>> Here is a lucene2seq tool I used in a project. It creates sequence files 
>>> based on the stored fields of a lucene index.
>>> The output from this tool can be then fed into seq2sparse and from there 
>>> you can do text clustering.
>>> Comes with Java bean configuration.
>>> Let me know what you think. Some CLI code can be added later on. I used 
>>> this for a small-scale project +- 100.000 docs. Is a MR version useful or 
>>> is that overkill?
>>> See https://github.com/frankscholten/mahout/tree/lucene2seq for commits and 
>>> review comments from Simon Willnauer (Thanks Simon!)
>>> or the attached patch.
>>
>> --
>> This message is automatically generated by JIRA.
>> If you think it was sent incorrectly, please contact your JIRA 
>> administrators: 
>> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>
>>
>
>
>
> --
> Lance Norskog
> [email protected]
>

Re: [jira] [Commented] (MAHOUT-944) LuceneIndexToSequenceFiles (lucene2seq) utility

Reply via email to