[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968148#comment-13968148 ]
Gokhan Capan commented on MAHOUT-1178: -------------------------------------- Well I can add this, but considering the current status of the project, I think this is no longer in people's interest. What do you say [~ssc], should we 'won't fix' it or commit? > GSOC 2013: Improve Lucene support in Mahout > ------------------------------------------- > > Key: MAHOUT-1178 > URL: https://issues.apache.org/jira/browse/MAHOUT-1178 > Project: Mahout > Issue Type: New Feature > Reporter: Dan Filimon > Assignee: Gokhan Capan > Labels: gsoc2013, mentor > Fix For: 1.0 > > Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch > > > [via Ted Dunning] > It should be possible to view a Lucene index as a matrix. This would > require that we standardize on a way to convert documents to rows. There > are many choices, the discussion of which should be deferred to the actual > work on the project, but there are a few obvious constraints: > a) it should be possible to get the same result as dumping the term vectors > for each document each to a line and converting that result using standard > Mahout methods. > b) numeric fields ought to work somehow. > c) if there are multiple text fields that ought to work sensibly as well. > Two options include dumping multiple matrices or to convert the fields > into a single row of a single matrix. > d) it should be possible to refer back from a row of the matrix to find the > correct document. THis might be because we remember the Lucene doc number > or because a field is named as holding a unique id. > e) named vectors and matrices should be used if plausible. -- This message was sent by Atlassian JIRA (v6.2#6252)