[
https://issues.apache.org/jira/browse/MAHOUT-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209184#comment-13209184
]
Jake Mannix commented on MAHOUT-944:
------------------------------------
bq. A better name for all of this is probably LuceneStorageTo... as it implies
that the fields must have storage. I could see us having another implementation
that works on the posting list itself
Let's keep the name the same, and at some point I'll get around to scratching
that particular itch - I've long wanted a nice map-reduce job which
"uninverted" the index into bag-of-words vectors. Everyone writes "let's build
an inverted index with map-reduce". Nobody writes the uninversion step!
> LuceneIndexToSequenceFiles (lucene2seq) utility
> -----------------------------------------------
>
> Key: MAHOUT-944
> URL: https://issues.apache.org/jira/browse/MAHOUT-944
> Project: Mahout
> Issue Type: New Feature
> Components: Integration
> Affects Versions: 0.5
> Reporter: Frank Scholten
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 0.7
>
> Attachments: MAHOUT-944.patch, MAHOUT-944.patch, MAHOUT-944.patch,
> MAHOUT-944.patch
>
>
> Here is a lucene2seq tool I used in a project. It creates sequence files
> based on the stored fields of a lucene index.
> The output from this tool can be then fed into seq2sparse and from there you
> can do text clustering.
> Comes with Java bean configuration.
> Let me know what you think. Some CLI code can be added later on. I used this
> for a small-scale project +- 100.000 docs. Is a MR version useful or is that
> overkill?
> See https://github.com/frankscholten/mahout/tree/lucene2seq for commits and
> review comments from Simon Willnauer (Thanks Simon!)
> or the attached patch.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira