[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13918159#comment-13918159 ]
Gokhan Capan commented on MAHOUT-1178: -------------------------------------- Let me get the pieces together and submit a patch in a few days. > GSOC 2013: Improve Lucene support in Mahout > ------------------------------------------- > > Key: MAHOUT-1178 > URL: https://issues.apache.org/jira/browse/MAHOUT-1178 > Project: Mahout > Issue Type: New Feature > Reporter: Dan Filimon > Assignee: Gokhan Capan > Labels: gsoc2013, mentor > Fix For: 1.0 > > Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch > > > [via Ted Dunning] > It should be possible to view a Lucene index as a matrix. This would > require that we standardize on a way to convert documents to rows. There > are many choices, the discussion of which should be deferred to the actual > work on the project, but there are a few obvious constraints: > a) it should be possible to get the same result as dumping the term vectors > for each document each to a line and converting that result using standard > Mahout methods. > b) numeric fields ought to work somehow. > c) if there are multiple text fields that ought to work sensibly as well. > Two options include dumping multiple matrices or to convert the fields > into a single row of a single matrix. > d) it should be possible to refer back from a row of the matrix to find the > correct document. THis might be because we remember the Lucene doc number > or because a field is named as holding a unique id. > e) named vectors and matrices should be used if plausible. -- This message was sent by Atlassian JIRA (v6.2#6252)