[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968259#comment-13968259 ] Suneel Marthi commented on MAHOUT-1178: --- I wasn't considering this to be a fix for incremental doc mgmt, fine with leaving this as is now. > GSOC 2013: Improve Lucene support in Mahout > --- > > Key: MAHOUT-1178 > URL: https://issues.apache.org/jira/browse/MAHOUT-1178 > Project: Mahout > Issue Type: New Feature >Reporter: Dan Filimon >Assignee: Gokhan Capan > Labels: gsoc2013, mentor > Fix For: 1.0 > > Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch > > > [via Ted Dunning] > It should be possible to view a Lucene index as a matrix. This would > require that we standardize on a way to convert documents to rows. There > are many choices, the discussion of which should be deferred to the actual > work on the project, but there are a few obvious constraints: > a) it should be possible to get the same result as dumping the term vectors > for each document each to a line and converting that result using standard > Mahout methods. > b) numeric fields ought to work somehow. > c) if there are multiple text fields that ought to work sensibly as well. > Two options include dumping multiple matrices or to convert the fields > into a single row of a single matrix. > d) it should be possible to refer back from a row of the matrix to find the > correct document. THis might be because we remember the Lucene doc number > or because a field is named as holding a unique id. > e) named vectors and matrices should be used if plausible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968254#comment-13968254 ] Gokhan Capan commented on MAHOUT-1178: -- The thing is it just 'loads' a Lucene index in memory as a matrix. You construct a matrix with the lucene index directory location and that's it. So it is not a fix for incremental document management issue. The alternative approach is querying the index when a row/column vector, or cell is required. I, however, am not sure if the SolrMatrix thing is fast enough for that. I haven't been available lately, and now I'm reading through the changes in and proposals for Mahout's future, and trying to set up my perspective for Mahout2. We probably can come up with a better way of document storage (still Lucene/Solr based). Let me leave this as is now, and then we can discuss the input formats further. Is that OK for you? > GSOC 2013: Improve Lucene support in Mahout > --- > > Key: MAHOUT-1178 > URL: https://issues.apache.org/jira/browse/MAHOUT-1178 > Project: Mahout > Issue Type: New Feature >Reporter: Dan Filimon >Assignee: Gokhan Capan > Labels: gsoc2013, mentor > Fix For: 1.0 > > Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch > > > [via Ted Dunning] > It should be possible to view a Lucene index as a matrix. This would > require that we standardize on a way to convert documents to rows. There > are many choices, the discussion of which should be deferred to the actual > work on the project, but there are a few obvious constraints: > a) it should be possible to get the same result as dumping the term vectors > for each document each to a line and converting that result using standard > Mahout methods. > b) numeric fields ought to work somehow. > c) if there are multiple text fields that ought to work sensibly as well. > Two options include dumping multiple matrices or to convert the fields > into a single row of a single matrix. > d) it should be possible to refer back from a row of the matrix to find the > correct document. THis might be because we remember the Lucene doc number > or because a field is named as holding a unique id. > e) named vectors and matrices should be used if plausible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968240#comment-13968240 ] Suneel Marthi commented on MAHOUT-1178: --- Sorry for responding late (just waking up in my part of the world). I still see value in having this. Both from lucene2seq and if we consider moving entirely to Lucene as document repository format (see the discussion in M-1252). Gokhan, please commit a patch if u think its ready else we can close this as 'Won't Fix'. > GSOC 2013: Improve Lucene support in Mahout > --- > > Key: MAHOUT-1178 > URL: https://issues.apache.org/jira/browse/MAHOUT-1178 > Project: Mahout > Issue Type: New Feature >Reporter: Dan Filimon >Assignee: Gokhan Capan > Labels: gsoc2013, mentor > Fix For: 1.0 > > Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch > > > [via Ted Dunning] > It should be possible to view a Lucene index as a matrix. This would > require that we standardize on a way to convert documents to rows. There > are many choices, the discussion of which should be deferred to the actual > work on the project, but there are a few obvious constraints: > a) it should be possible to get the same result as dumping the term vectors > for each document each to a line and converting that result using standard > Mahout methods. > b) numeric fields ought to work somehow. > c) if there are multiple text fields that ought to work sensibly as well. > Two options include dumping multiple matrices or to convert the fields > into a single row of a single matrix. > d) it should be possible to refer back from a row of the matrix to find the > correct document. THis might be because we remember the Lucene doc number > or because a field is named as holding a unique id. > e) named vectors and matrices should be used if plausible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968221#comment-13968221 ] Gokhan Capan commented on MAHOUT-1178: -- I personally like the idea of integrating additional storage layers as matrix inputs, but not like the implementation I did here. After agreeing on the new algorithm layers, we can later move to the the additional input formats. So my vote also is for "Won't Fix" > GSOC 2013: Improve Lucene support in Mahout > --- > > Key: MAHOUT-1178 > URL: https://issues.apache.org/jira/browse/MAHOUT-1178 > Project: Mahout > Issue Type: New Feature >Reporter: Dan Filimon >Assignee: Gokhan Capan > Labels: gsoc2013, mentor > Fix For: 1.0 > > Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch > > > [via Ted Dunning] > It should be possible to view a Lucene index as a matrix. This would > require that we standardize on a way to convert documents to rows. There > are many choices, the discussion of which should be deferred to the actual > work on the project, but there are a few obvious constraints: > a) it should be possible to get the same result as dumping the term vectors > for each document each to a line and converting that result using standard > Mahout methods. > b) numeric fields ought to work somehow. > c) if there are multiple text fields that ought to work sensibly as well. > Two options include dumping multiple matrices or to convert the fields > into a single row of a single matrix. > d) it should be possible to refer back from a row of the matrix to find the > correct document. THis might be because we remember the Lucene doc number > or because a field is named as holding a unique id. > e) named vectors and matrices should be used if plausible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968185#comment-13968185 ] Sebastian Schelter commented on MAHOUT-1178: I'd personally resolve this as won't fix, as we should concentrate on the scala DSL in the future, any objections? > GSOC 2013: Improve Lucene support in Mahout > --- > > Key: MAHOUT-1178 > URL: https://issues.apache.org/jira/browse/MAHOUT-1178 > Project: Mahout > Issue Type: New Feature >Reporter: Dan Filimon >Assignee: Gokhan Capan > Labels: gsoc2013, mentor > Fix For: 1.0 > > Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch > > > [via Ted Dunning] > It should be possible to view a Lucene index as a matrix. This would > require that we standardize on a way to convert documents to rows. There > are many choices, the discussion of which should be deferred to the actual > work on the project, but there are a few obvious constraints: > a) it should be possible to get the same result as dumping the term vectors > for each document each to a line and converting that result using standard > Mahout methods. > b) numeric fields ought to work somehow. > c) if there are multiple text fields that ought to work sensibly as well. > Two options include dumping multiple matrices or to convert the fields > into a single row of a single matrix. > d) it should be possible to refer back from a row of the matrix to find the > correct document. THis might be because we remember the Lucene doc number > or because a field is named as holding a unique id. > e) named vectors and matrices should be used if plausible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968148#comment-13968148 ] Gokhan Capan commented on MAHOUT-1178: -- Well I can add this, but considering the current status of the project, I think this is no longer in people's interest. What do you say [~ssc], should we 'won't fix' it or commit? > GSOC 2013: Improve Lucene support in Mahout > --- > > Key: MAHOUT-1178 > URL: https://issues.apache.org/jira/browse/MAHOUT-1178 > Project: Mahout > Issue Type: New Feature >Reporter: Dan Filimon >Assignee: Gokhan Capan > Labels: gsoc2013, mentor > Fix For: 1.0 > > Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch > > > [via Ted Dunning] > It should be possible to view a Lucene index as a matrix. This would > require that we standardize on a way to convert documents to rows. There > are many choices, the discussion of which should be deferred to the actual > work on the project, but there are a few obvious constraints: > a) it should be possible to get the same result as dumping the term vectors > for each document each to a line and converting that result using standard > Mahout methods. > b) numeric fields ought to work somehow. > c) if there are multiple text fields that ought to work sensibly as well. > Two options include dumping multiple matrices or to convert the fields > into a single row of a single matrix. > d) it should be possible to refer back from a row of the matrix to find the > correct document. THis might be because we remember the Lucene doc number > or because a field is named as holding a unique id. > e) named vectors and matrices should be used if plausible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967810#comment-13967810 ] Sebastian Schelter commented on MAHOUT-1178: [~gokhancapan] what's the status here? > GSOC 2013: Improve Lucene support in Mahout > --- > > Key: MAHOUT-1178 > URL: https://issues.apache.org/jira/browse/MAHOUT-1178 > Project: Mahout > Issue Type: New Feature >Reporter: Dan Filimon >Assignee: Gokhan Capan > Labels: gsoc2013, mentor > Fix For: 1.0 > > Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch > > > [via Ted Dunning] > It should be possible to view a Lucene index as a matrix. This would > require that we standardize on a way to convert documents to rows. There > are many choices, the discussion of which should be deferred to the actual > work on the project, but there are a few obvious constraints: > a) it should be possible to get the same result as dumping the term vectors > for each document each to a line and converting that result using standard > Mahout methods. > b) numeric fields ought to work somehow. > c) if there are multiple text fields that ought to work sensibly as well. > Two options include dumping multiple matrices or to convert the fields > into a single row of a single matrix. > d) it should be possible to refer back from a row of the matrix to find the > correct document. THis might be because we remember the Lucene doc number > or because a field is named as holding a unique id. > e) named vectors and matrices should be used if plausible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13918159#comment-13918159 ] Gokhan Capan commented on MAHOUT-1178: -- Let me get the pieces together and submit a patch in a few days. > GSOC 2013: Improve Lucene support in Mahout > --- > > Key: MAHOUT-1178 > URL: https://issues.apache.org/jira/browse/MAHOUT-1178 > Project: Mahout > Issue Type: New Feature >Reporter: Dan Filimon >Assignee: Gokhan Capan > Labels: gsoc2013, mentor > Fix For: 1.0 > > Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch > > > [via Ted Dunning] > It should be possible to view a Lucene index as a matrix. This would > require that we standardize on a way to convert documents to rows. There > are many choices, the discussion of which should be deferred to the actual > work on the project, but there are a few obvious constraints: > a) it should be possible to get the same result as dumping the term vectors > for each document each to a line and converting that result using standard > Mahout methods. > b) numeric fields ought to work somehow. > c) if there are multiple text fields that ought to work sensibly as well. > Two options include dumping multiple matrices or to convert the fields > into a single row of a single matrix. > d) it should be possible to refer back from a row of the matrix to find the > correct document. THis might be because we remember the Lucene doc number > or because a field is named as holding a unique id. > e) named vectors and matrices should be used if plausible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13917584#comment-13917584 ] Suneel Marthi commented on MAHOUT-1178: --- [~gcapan] > GSOC 2013: Improve Lucene support in Mahout > --- > > Key: MAHOUT-1178 > URL: https://issues.apache.org/jira/browse/MAHOUT-1178 > Project: Mahout > Issue Type: New Feature >Reporter: Dan Filimon > Labels: gsoc2013, mentor > Fix For: Backlog > > Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch > > > [via Ted Dunning] > It should be possible to view a Lucene index as a matrix. This would > require that we standardize on a way to convert documents to rows. There > are many choices, the discussion of which should be deferred to the actual > work on the project, but there are a few obvious constraints: > a) it should be possible to get the same result as dumping the term vectors > for each document each to a line and converting that result using standard > Mahout methods. > b) numeric fields ought to work somehow. > c) if there are multiple text fields that ought to work sensibly as well. > Two options include dumping multiple matrices or to convert the fields > into a single row of a single matrix. > d) it should be possible to refer back from a row of the matrix to find the > correct document. THis might be because we remember the Lucene doc number > or because a field is named as holding a unique id. > e) named vectors and matrices should be used if plausible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13799916#comment-13799916 ] Gokhan Capan commented on MAHOUT-1178: -- Hi [~smarthi], Although I'm not sure if there is no more an interest, I have a Lucene matrix (in-memory) and a Solr matrix (that does not load the index into memory) implementations. I believe both can be committed after a couple review rounds. > GSOC 2013: Improve Lucene support in Mahout > --- > > Key: MAHOUT-1178 > URL: https://issues.apache.org/jira/browse/MAHOUT-1178 > Project: Mahout > Issue Type: New Feature >Reporter: Dan Filimon > Labels: gsoc2013, mentor > Fix For: Backlog > > Attachments: MAHOUT-1178.patch, MAHOUT-1178-TEST.patch > > > [via Ted Dunning] > It should be possible to view a Lucene index as a matrix. This would > require that we standardize on a way to convert documents to rows. There > are many choices, the discussion of which should be deferred to the actual > work on the project, but there are a few obvious constraints: > a) it should be possible to get the same result as dumping the term vectors > for each document each to a line and converting that result using standard > Mahout methods. > b) numeric fields ought to work somehow. > c) if there are multiple text fields that ought to work sensibly as well. > Two options include dumping multiple matrices or to convert the fields > into a single row of a single matrix. > d) it should be possible to refer back from a row of the matrix to find the > correct document. THis might be because we remember the Lucene doc number > or because a field is named as holding a unique id. > e) named vectors and matrices should be used if plausible. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13798830#comment-13798830 ] Suneel Marthi commented on MAHOUT-1178: --- [~gokhancapan] Can this be rolled int 0.9 release? > GSOC 2013: Improve Lucene support in Mahout > --- > > Key: MAHOUT-1178 > URL: https://issues.apache.org/jira/browse/MAHOUT-1178 > Project: Mahout > Issue Type: New Feature >Reporter: Dan Filimon > Labels: gsoc2013, mentor > Fix For: Backlog > > Attachments: MAHOUT-1178.patch, MAHOUT-1178-TEST.patch > > > [via Ted Dunning] > It should be possible to view a Lucene index as a matrix. This would > require that we standardize on a way to convert documents to rows. There > are many choices, the discussion of which should be deferred to the actual > work on the project, but there are a few obvious constraints: > a) it should be possible to get the same result as dumping the term vectors > for each document each to a line and converting that result using standard > Mahout methods. > b) numeric fields ought to work somehow. > c) if there are multiple text fields that ought to work sensibly as well. > Two options include dumping multiple matrices or to convert the fields > into a single row of a single matrix. > d) it should be possible to refer back from a row of the matrix to find the > correct document. THis might be because we remember the Lucene doc number > or because a field is named as holding a unique id. > e) named vectors and matrices should be used if plausible. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13634477#comment-13634477 ] Gokhan Capan commented on MAHOUT-1178: -- Thanks for the valuable reviews. I updated the review request, but not the patch here. I will do it after another review round. > GSOC 2013: Improve Lucene support in Mahout > --- > > Key: MAHOUT-1178 > URL: https://issues.apache.org/jira/browse/MAHOUT-1178 > Project: Mahout > Issue Type: New Feature >Reporter: Dan Filimon > Labels: gsoc2013, mentor > Attachments: MAHOUT-1178.patch, MAHOUT-1178-TEST.patch > > > [via Ted Dunning] > It should be possible to view a Lucene index as a matrix. This would > require that we standardize on a way to convert documents to rows. There > are many choices, the discussion of which should be deferred to the actual > work on the project, but there are a few obvious constraints: > a) it should be possible to get the same result as dumping the term vectors > for each document each to a line and converting that result using standard > Mahout methods. > b) numeric fields ought to work somehow. > c) if there are multiple text fields that ought to work sensibly as well. > Two options include dumping multiple matrices or to convert the fields > into a single row of a single matrix. > d) it should be possible to refer back from a row of the matrix to find the > correct document. THis might be because we remember the Lucene doc number > or because a field is named as holding a unique id. > e) named vectors and matrices should be used if plausible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13629056#comment-13629056 ] Gokhan Capan commented on MAHOUT-1178: -- Hi Sebastian, I did, though I'm not sure if I did it correctly:) Anyway, if it is correct, the diff here and there are not the same (the base directories I created the diffs are different, and the one in reviewboard is in a single diff file. Code is same though, I hope this is not a problem) > GSOC 2013: Improve Lucene support in Mahout > --- > > Key: MAHOUT-1178 > URL: https://issues.apache.org/jira/browse/MAHOUT-1178 > Project: Mahout > Issue Type: New Feature >Reporter: Dan Filimon > Labels: gsoc2013, mentor > Attachments: MAHOUT-1178.patch, MAHOUT-1178-TEST.patch > > > [via Ted Dunning] > It should be possible to view a Lucene index as a matrix. This would > require that we standardize on a way to convert documents to rows. There > are many choices, the discussion of which should be deferred to the actual > work on the project, but there are a few obvious constraints: > a) it should be possible to get the same result as dumping the term vectors > for each document each to a line and converting that result using standard > Mahout methods. > b) numeric fields ought to work somehow. > c) if there are multiple text fields that ought to work sensibly as well. > Two options include dumping multiple matrices or to convert the fields > into a single row of a single matrix. > d) it should be possible to refer back from a row of the matrix to find the > correct document. THis might be because we remember the Lucene doc number > or because a field is named as holding a unique id. > e) named vectors and matrices should be used if plausible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13629000#comment-13629000 ] Sebastian Schelter commented on MAHOUT-1178: Gokhan, could you upload your patch to reviewboard? http://reviewboard.apache.org This makes commenting much easier. Best, Sebastian > GSOC 2013: Improve Lucene support in Mahout > --- > > Key: MAHOUT-1178 > URL: https://issues.apache.org/jira/browse/MAHOUT-1178 > Project: Mahout > Issue Type: New Feature >Reporter: Dan Filimon > Labels: gsoc2013, mentor > Attachments: MAHOUT-1178.patch, MAHOUT-1178-TEST.patch > > > [via Ted Dunning] > It should be possible to view a Lucene index as a matrix. This would > require that we standardize on a way to convert documents to rows. There > are many choices, the discussion of which should be deferred to the actual > work on the project, but there are a few obvious constraints: > a) it should be possible to get the same result as dumping the term vectors > for each document each to a line and converting that result using standard > Mahout methods. > b) numeric fields ought to work somehow. > c) if there are multiple text fields that ought to work sensibly as well. > Two options include dumping multiple matrices or to convert the fields > into a single row of a single matrix. > d) it should be possible to refer back from a row of the matrix to find the > correct document. THis might be because we remember the Lucene doc number > or because a field is named as holding a unique id. > e) named vectors and matrices should be used if plausible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13628397#comment-13628397 ] Ted Dunning commented on MAHOUT-1178: - {quote} Ted, do you think this should load the entire index to memory as a matrix? Or should it ask to the index when a get request is done? (And if this is the option, should set methods also update the lucene index itself?) {quote} My own interests would be a) a flexible schema (you satisfied that with your proposed implementation) b) a fast iterator that gives me sparse vectors for each document in the index in index order. If I get multiple iterators, one for each matrix view of the index, that is just fine. You have added potential additional operations c) getRow(int rowNumber /* not doc id */) and get(int rowNumber, int colNumber) d) putRow(int rowNumber, Vector doc) I don't know the value of (c) since the rowNumber has little external meaning. I think that (d) is pretty much impossible to do given the difficulty of reverse engineering the vector. I could be wrong and that would be intriguing. We would need the ability to independently update different matrix views of the index to update different fields. If possible, it is kind of cool. One addition that I would think *very* interesting/helpful would be to adjust (b) to provide the same thing, but for a query result rather than the entire index. > GSOC 2013: Improve Lucene support in Mahout > --- > > Key: MAHOUT-1178 > URL: https://issues.apache.org/jira/browse/MAHOUT-1178 > Project: Mahout > Issue Type: New Feature >Reporter: Dan Filimon > Labels: gsoc2013, mentor > > [via Ted Dunning] > It should be possible to view a Lucene index as a matrix. This would > require that we standardize on a way to convert documents to rows. There > are many choices, the discussion of which should be deferred to the actual > work on the project, but there are a few obvious constraints: > a) it should be possible to get the same result as dumping the term vectors > for each document each to a line and converting that result using standard > Mahout methods. > b) numeric fields ought to work somehow. > c) if there are multiple text fields that ought to work sensibly as well. > Two options include dumping multiple matrices or to convert the fields > into a single row of a single matrix. > d) it should be possible to refer back from a row of the matrix to find the > correct document. THis might be because we remember the Lucene doc number > or because a field is named as holding a unique id. > e) named vectors and matrices should be used if plausible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13627883#comment-13627883 ] Saikat Kanjilal commented on MAHOUT-1178: - Ted/Gokhan, >From the recent thread it seems like Gokhan has a great start on this, let me >know if I can help with integration tests or docs. Regards > GSOC 2013: Improve Lucene support in Mahout > --- > > Key: MAHOUT-1178 > URL: https://issues.apache.org/jira/browse/MAHOUT-1178 > Project: Mahout > Issue Type: New Feature >Reporter: Dan Filimon > Labels: gsoc2013, mentor > > [via Ted Dunning] > It should be possible to view a Lucene index as a matrix. This would > require that we standardize on a way to convert documents to rows. There > are many choices, the discussion of which should be deferred to the actual > work on the project, but there are a few obvious constraints: > a) it should be possible to get the same result as dumping the term vectors > for each document each to a line and converting that result using standard > Mahout methods. > b) numeric fields ought to work somehow. > c) if there are multiple text fields that ought to work sensibly as well. > Two options include dumping multiple matrices or to convert the fields > into a single row of a single matrix. > d) it should be possible to refer back from a row of the matrix to find the > correct document. THis might be because we remember the Lucene doc number > or because a field is named as holding a unique id. > e) named vectors and matrices should be used if plausible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13626626#comment-13626626 ] Saikat Kanjilal commented on MAHOUT-1178: - I had some other questions/thoughts: 1) Ideally I'd think the implementation would belong in the commons or core area since as you mentioned this would potentially affect multiple algorithms, does that jive with everyone's thinking on this 2) Can we get away with just storing only the needed parts of the matrix in memory as opposed to the whole matrix 3) Is it an either or scenario with named vectors versus matrices, pardon my ignorance on this part since I'm not familiar with named vectors in mahout Looking forward to begin design/implementation. > GSOC 2013: Improve Lucene support in Mahout > --- > > Key: MAHOUT-1178 > URL: https://issues.apache.org/jira/browse/MAHOUT-1178 > Project: Mahout > Issue Type: New Feature >Reporter: Dan Filimon > Labels: gsoc2013, mentor > > [via Ted Dunning] > It should be possible to view a Lucene index as a matrix. This would > require that we standardize on a way to convert documents to rows. There > are many choices, the discussion of which should be deferred to the actual > work on the project, but there are a few obvious constraints: > a) it should be possible to get the same result as dumping the term vectors > for each document each to a line and converting that result using standard > Mahout methods. > b) numeric fields ought to work somehow. > c) if there are multiple text fields that ought to work sensibly as well. > Two options include dumping multiple matrices or to convert the fields > into a single row of a single matrix. > d) it should be possible to refer back from a row of the matrix to find the > correct document. THis might be because we remember the Lucene doc number > or because a field is named as holding a unique id. > e) named vectors and matrices should be used if plausible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13626619#comment-13626619 ] Gokhan Capan commented on MAHOUT-1178: -- Ted, do you think this should load the entire index to memory as a matrix? Or should it ask to the index when a get request is done? (And if this is the option, should set methods also update the lucene index itself?) > GSOC 2013: Improve Lucene support in Mahout > --- > > Key: MAHOUT-1178 > URL: https://issues.apache.org/jira/browse/MAHOUT-1178 > Project: Mahout > Issue Type: New Feature >Reporter: Dan Filimon > Labels: gsoc2013, mentor > > [via Ted Dunning] > It should be possible to view a Lucene index as a matrix. This would > require that we standardize on a way to convert documents to rows. There > are many choices, the discussion of which should be deferred to the actual > work on the project, but there are a few obvious constraints: > a) it should be possible to get the same result as dumping the term vectors > for each document each to a line and converting that result using standard > Mahout methods. > b) numeric fields ought to work somehow. > c) if there are multiple text fields that ought to work sensibly as well. > Two options include dumping multiple matrices or to convert the fields > into a single row of a single matrix. > d) it should be possible to refer back from a row of the matrix to find the > correct document. THis might be because we remember the Lucene doc number > or because a field is named as holding a unique id. > e) named vectors and matrices should be used if plausible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13625892#comment-13625892 ] Saikat Kanjilal commented on MAHOUT-1178: - Got it thanks, let me know how or where to help and dive in. > GSOC 2013: Improve Lucene support in Mahout > --- > > Key: MAHOUT-1178 > URL: https://issues.apache.org/jira/browse/MAHOUT-1178 > Project: Mahout > Issue Type: New Feature >Reporter: Dan Filimon > Labels: gsoc2013, mentor > > [via Ted Dunning] > It should be possible to view a Lucene index as a matrix. This would > require that we standardize on a way to convert documents to rows. There > are many choices, the discussion of which should be deferred to the actual > work on the project, but there are a few obvious constraints: > a) it should be possible to get the same result as dumping the term vectors > for each document each to a line and converting that result using standard > Mahout methods. > b) numeric fields ought to work somehow. > c) if there are multiple text fields that ought to work sensibly as well. > Two options include dumping multiple matrices or to convert the fields > into a single row of a single matrix. > d) it should be possible to refer back from a row of the matrix to find the > correct document. THis might be because we remember the Lucene doc number > or because a field is named as holding a unique id. > e) named vectors and matrices should be used if plausible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13625845#comment-13625845 ] Ted Dunning commented on MAHOUT-1178: - The need for this stems from the fact that text or text-like data is commonly used as an input for Mahout projects. This includes text classification, recommendation engines and clustering. There are definitely better ways to represent a Lucene index if you want to preserve the data, but if you want to use other Mahout stuff, then a matrix is really required. > GSOC 2013: Improve Lucene support in Mahout > --- > > Key: MAHOUT-1178 > URL: https://issues.apache.org/jira/browse/MAHOUT-1178 > Project: Mahout > Issue Type: New Feature >Reporter: Dan Filimon > Labels: gsoc2013, mentor > > [via Ted Dunning] > It should be possible to view a Lucene index as a matrix. This would > require that we standardize on a way to convert documents to rows. There > are many choices, the discussion of which should be deferred to the actual > work on the project, but there are a few obvious constraints: > a) it should be possible to get the same result as dumping the term vectors > for each document each to a line and converting that result using standard > Mahout methods. > b) numeric fields ought to work somehow. > c) if there are multiple text fields that ought to work sensibly as well. > Two options include dumping multiple matrices or to convert the fields > into a single row of a single matrix. > d) it should be possible to refer back from a row of the matrix to find the > correct document. THis might be because we remember the Lucene doc number > or because a field is named as holding a unique id. > e) named vectors and matrices should be used if plausible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13624873#comment-13624873 ] Suneel Marthi commented on MAHOUT-1178: --- Given that Mahout trunk is presently at Lucene 4.2.0, I don't think we should worry about backward compatibility to previous Lucene 3.x versions (primarily due to the differences between Lucene 4.x and Lucene 3.x which are not compatible). To Dan's original description above, for (d) and (e) something similar to the present RowIdJob - which creates a row-by-row matrix and a docIndex - which maps a Row# back to the original DocId could be what we are looking for (we maybe able to leverage the existing RowIdJob once we get to the implementation details). > GSOC 2013: Improve Lucene support in Mahout > --- > > Key: MAHOUT-1178 > URL: https://issues.apache.org/jira/browse/MAHOUT-1178 > Project: Mahout > Issue Type: New Feature >Reporter: Dan Filimon > Labels: gsoc2013, mentor > > [via Ted Dunning] > It should be possible to view a Lucene index as a matrix. This would > require that we standardize on a way to convert documents to rows. There > are many choices, the discussion of which should be deferred to the actual > work on the project, but there are a few obvious constraints: > a) it should be possible to get the same result as dumping the term vectors > for each document each to a line and converting that result using standard > Mahout methods. > b) numeric fields ought to work somehow. > c) if there are multiple text fields that ought to work sensibly as well. > Two options include dumping multiple matrices or to convert the fields > into a single row of a single matrix. > d) it should be possible to refer back from a row of the matrix to find the > correct document. THis might be because we remember the Lucene doc number > or because a field is named as holding a unique id. > e) named vectors and matrices should be used if plausible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13624763#comment-13624763 ] Saikat Kanjilal commented on MAHOUT-1178: - Ted, I'd like to help out on this issue, do you have some more context as to why this is needed, I'd imagine some better data structures than a matrix, is this strictly due to the fact that mahout operates on matrices as first class citizens? Let me know how or where to begin taking a stab at doing this. Also should this be backwards compatible against different lucene versions assuming that the indexing scheme in lucene is somewhat consistent across versions. > GSOC 2013: Improve Lucene support in Mahout > --- > > Key: MAHOUT-1178 > URL: https://issues.apache.org/jira/browse/MAHOUT-1178 > Project: Mahout > Issue Type: New Feature >Reporter: Dan Filimon > Labels: gsoc2013, mentor > > [via Ted Dunning] > It should be possible to view a Lucene index as a matrix. This would > require that we standardize on a way to convert documents to rows. There > are many choices, the discussion of which should be deferred to the actual > work on the project, but there are a few obvious constraints: > a) it should be possible to get the same result as dumping the term vectors > for each document each to a line and converting that result using standard > Mahout methods. > b) numeric fields ought to work somehow. > c) if there are multiple text fields that ought to work sensibly as well. > Two options include dumping multiple matrices or to convert the fields > into a single row of a single matrix. > d) it should be possible to refer back from a row of the matrix to find the > correct document. THis might be because we remember the Lucene doc number > or because a field is named as holding a unique id. > e) named vectors and matrices should be used if plausible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira