Re: [jira] [Commented] (MAHOUT-944) LuceneIndexToSequenceFiles (lucene2seq) utility

Lance Norskog Wed, 29 Feb 2012 18:53:11 -0800

integrations/org.apache.mahout.text.SequenceFilesFromMailArchives

Used in examples/bin/asf-email-examples.sh


On Wed, Feb 29, 2012 at 6:27 AM, Frank Scholten <[email protected]> wrote:
> Ah of course! Good one.
>
> Do you know if there is an existing tool to index those emails?
>
> On Sat, Feb 25, 2012 at 4:10 AM, Lance Norskog <[email protected]> wrote:
>> Apache mail files? You need an AWS account to pull them.
>>
>> http://www.lucidimagination.com/search/document/1ab0374bd10d8d89
>>
>> On Fri, Feb 24, 2012 at 8:56 AM, Frank Scholten (Commented) (JIRA)
>> <[email protected]> wrote:
>>>
>>>    [ 
>>> https://issues.apache.org/jira/browse/MAHOUT-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215734#comment-13215734
>>>  ]
>>>
>>> Frank Scholten commented on MAHOUT-944:
>>> ---------------------------------------
>>>
>>> Renamed config to LuceneStorageConfig and simplified serialization. Added 
>>> AbstractLuceneStorageTest with helper methods for indexing documents.
>>>
>>> https://github.com/frankscholten/mahout/commit/41ee459d075c3aa2e10eab4eb5580cbf505fcbf6
>>>
>>> Does anyone know of a large index I can use for testing? Wikipedia is not 
>>> that big, the sequential lucene2seq version takes only 3,5 minutes on my 
>>> machine to convert it into a sequence file.
>>>
>>>> LuceneIndexToSequenceFiles (lucene2seq) utility
>>>> -----------------------------------------------
>>>>
>>>>                 Key: MAHOUT-944
>>>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-944
>>>>             Project: Mahout
>>>>          Issue Type: New Feature
>>>>          Components: Integration
>>>>    Affects Versions: 0.5
>>>>            Reporter: Frank Scholten
>>>>            Assignee: Grant Ingersoll
>>>>            Priority: Minor
>>>>             Fix For: 0.7
>>>>
>>>>         Attachments: MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch, 
>>>> MAHOUT-944.patch
>>>>
>>>>
>>>> Here is a lucene2seq tool I used in a project. It creates sequence files 
>>>> based on the stored fields of a lucene index.
>>>> The output from this tool can be then fed into seq2sparse and from there 
>>>> you can do text clustering.
>>>> Comes with Java bean configuration.
>>>> Let me know what you think. Some CLI code can be added later on. I used 
>>>> this for a small-scale project +- 100.000 docs. Is a MR version useful or 
>>>> is that overkill?
>>>> See https://github.com/frankscholten/mahout/tree/lucene2seq for commits 
>>>> and review comments from Simon Willnauer (Thanks Simon!)
>>>> or the attached patch.
>>>
>>> --
>>> This message is automatically generated by JIRA.
>>> If you think it was sent incorrectly, please contact your JIRA 
>>> administrators: 
>>> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>>
>>>
>>
>>
>>
>> --
>> Lance Norskog
>> [email protected]
>>



-- 
Lance Norskog
[email protected]

Re: [jira] [Commented] (MAHOUT-944) LuceneIndexToSequenceFiles (lucene2seq) utility

Reply via email to