[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

Suneel Marthi (JIRA) Sun, 07 Apr 2013 04:09:20 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13624873#comment-13624873
 ]


Suneel Marthi commented on MAHOUT-1178:
---------------------------------------

Given that Mahout trunk is presently at Lucene 4.2.0, I don't think we should 
worry about backward compatibility to previous Lucene 3.x versions (primarily 
due to the differences between Lucene 4.x and Lucene 3.x which are not 
compatible). 

To Dan's original description above, for (d) and (e) something similar to the 
present RowIdJob - which creates a row-by-row matrix and a docIndex - which 
maps a Row# back to the original DocId could be what we are looking for (we 
maybe able to leverage the existing RowIdJob once we get to the implementation 
details).


                
> GSOC 2013: Improve Lucene support in Mahout
> -------------------------------------------
>
>                 Key: MAHOUT-1178
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1178
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Dan Filimon
>              Labels: gsoc2013, mentor
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

Reply via email to