[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2014-04-14 Thread Suneel Marthi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968259#comment-13968259
 ] 

Suneel Marthi commented on MAHOUT-1178:
---

I wasn't considering this to be a fix for incremental doc mgmt, fine with 
leaving this as is now.

> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>Assignee: Gokhan Capan
>  Labels: gsoc2013, mentor
> Fix For: 1.0
>
> Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch
>
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2014-04-14 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968254#comment-13968254
 ] 

Gokhan Capan commented on MAHOUT-1178:
--

The thing is it just 'loads' a Lucene index in memory as a matrix. You 
construct a matrix with the lucene index directory location and that's it. So 
it is not a fix for incremental document management issue.

The alternative approach is querying the index when a row/column vector, or 
cell is required. I, however, am not sure if the SolrMatrix thing is fast 
enough for that.

I haven't been available lately, and now I'm reading through the changes in and 
proposals for Mahout's future, and trying to set up my perspective for Mahout2. 
We probably can come up with a better way of document storage (still 
Lucene/Solr based). Let me leave this as is now, and then we can discuss the 
input formats further.

Is that OK for you?

> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>Assignee: Gokhan Capan
>  Labels: gsoc2013, mentor
> Fix For: 1.0
>
> Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch
>
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2014-04-14 Thread Suneel Marthi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968240#comment-13968240
 ] 

Suneel Marthi commented on MAHOUT-1178:
---

Sorry for responding late (just waking up in my part of the world).  I still 
see value in having this. Both from lucene2seq and if we consider moving 
entirely to Lucene as document repository format (see the discussion in 
M-1252). 

Gokhan, please commit a patch if u think its ready else we can close this as 
'Won't Fix'.

> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>Assignee: Gokhan Capan
>  Labels: gsoc2013, mentor
> Fix For: 1.0
>
> Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch
>
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2014-04-14 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968221#comment-13968221
 ] 

Gokhan Capan commented on MAHOUT-1178:
--

I personally like the idea of integrating additional storage layers as matrix 
inputs, but not like the implementation I did here.
After agreeing on the new algorithm layers, we can later move to the the 
additional input formats. 

So my vote also is for "Won't Fix"

> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>Assignee: Gokhan Capan
>  Labels: gsoc2013, mentor
> Fix For: 1.0
>
> Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch
>
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2014-04-14 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968185#comment-13968185
 ] 

Sebastian Schelter commented on MAHOUT-1178:


I'd personally resolve this as won't fix, as we should concentrate on the scala 
DSL in the future, any objections?

> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>Assignee: Gokhan Capan
>  Labels: gsoc2013, mentor
> Fix For: 1.0
>
> Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch
>
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2014-04-14 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968148#comment-13968148
 ] 

Gokhan Capan commented on MAHOUT-1178:
--

Well I can add this, but considering the current status of the project, I think 
this is no longer in people's interest.
What do you say [~ssc], should we 'won't fix' it or commit?

> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>Assignee: Gokhan Capan
>  Labels: gsoc2013, mentor
> Fix For: 1.0
>
> Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch
>
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2014-04-13 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967810#comment-13967810
 ] 

Sebastian Schelter commented on MAHOUT-1178:


[~gokhancapan] what's the status here?

> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>Assignee: Gokhan Capan
>  Labels: gsoc2013, mentor
> Fix For: 1.0
>
> Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch
>
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2014-03-03 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13918159#comment-13918159
 ] 

Gokhan Capan commented on MAHOUT-1178:
--

Let me get the pieces together and submit a patch in a few days.

> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>Assignee: Gokhan Capan
>  Labels: gsoc2013, mentor
> Fix For: 1.0
>
> Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch
>
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2014-03-02 Thread Suneel Marthi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13917584#comment-13917584
 ] 

Suneel Marthi commented on MAHOUT-1178:
---

[~gcapan]

> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>  Labels: gsoc2013, mentor
> Fix For: Backlog
>
> Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch
>
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2013-10-19 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13799916#comment-13799916
 ] 

Gokhan Capan commented on MAHOUT-1178:
--

Hi [~smarthi], 

Although I'm not sure if there is no more an interest, I have a Lucene matrix 
(in-memory) and a Solr matrix (that does not load the index into memory) 
implementations. I believe both can be committed after a couple review rounds.



> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>  Labels: gsoc2013, mentor
> Fix For: Backlog
>
> Attachments: MAHOUT-1178.patch, MAHOUT-1178-TEST.patch
>
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2013-10-17 Thread Suneel Marthi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13798830#comment-13798830
 ] 

Suneel Marthi commented on MAHOUT-1178:
---

[~gokhancapan] Can this be rolled int 0.9 release?

> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>  Labels: gsoc2013, mentor
> Fix For: Backlog
>
> Attachments: MAHOUT-1178.patch, MAHOUT-1178-TEST.patch
>
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2013-04-17 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13634477#comment-13634477
 ] 

Gokhan Capan commented on MAHOUT-1178:
--

Thanks for the valuable reviews. I updated the review request, but not the 
patch here. I will do it after another review round.

> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>  Labels: gsoc2013, mentor
> Attachments: MAHOUT-1178.patch, MAHOUT-1178-TEST.patch
>
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2013-04-11 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13629056#comment-13629056
 ] 

Gokhan Capan commented on MAHOUT-1178:
--

Hi Sebastian,

I did, though I'm not sure if I did it correctly:) Anyway, if it is correct, 
the diff here and there are not the same (the base directories I created the 
diffs are different, and the one in reviewboard is in a single diff file. Code 
is same though, I hope this is not a problem)


> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>  Labels: gsoc2013, mentor
> Attachments: MAHOUT-1178.patch, MAHOUT-1178-TEST.patch
>
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2013-04-11 Thread Sebastian Schelter (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13629000#comment-13629000
 ] 

Sebastian Schelter commented on MAHOUT-1178:


Gokhan,

could you upload your patch to reviewboard? http://reviewboard.apache.org This 
makes commenting much easier.

Best,
Sebastian

> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>  Labels: gsoc2013, mentor
> Attachments: MAHOUT-1178.patch, MAHOUT-1178-TEST.patch
>
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2013-04-10 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13628397#comment-13628397
 ] 

Ted Dunning commented on MAHOUT-1178:
-

{quote}
Ted, do you think this should load the entire index to memory as a matrix? Or 
should it ask to the index when a get request is done? (And if this is the 
option, should set methods also update the lucene index itself?)
{quote}

My own interests would be 

a) a flexible schema (you satisfied that with your proposed implementation)

b) a fast iterator that gives me sparse vectors for each document in the index 
in index order.  If I get multiple iterators, one for each matrix view of the 
index, that is just fine.

You have added potential additional operations

c) getRow(int rowNumber /* not doc id */) and get(int rowNumber, int colNumber)

d) putRow(int rowNumber, Vector doc)

I don't know the value of (c) since the rowNumber has little external meaning.

I think that (d) is pretty much impossible to do given the difficulty of 
reverse engineering the vector.  I could be wrong and that would be intriguing. 
 We would need the ability to independently update different matrix views of 
the index to update different fields.  If possible, it is kind of cool.

One addition that I would think *very* interesting/helpful would be to adjust 
(b) to provide the same thing, but for a query result rather than the entire 
index.

> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>  Labels: gsoc2013, mentor
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2013-04-10 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13627883#comment-13627883
 ] 

Saikat Kanjilal commented on MAHOUT-1178:
-

Ted/Gokhan,
>From the recent thread it seems like Gokhan has a great start on this, let me 
>know if I can help with integration tests or docs.
Regards

> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>  Labels: gsoc2013, mentor
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2013-04-09 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13626626#comment-13626626
 ] 

Saikat Kanjilal commented on MAHOUT-1178:
-

I had some other questions/thoughts:
1) Ideally I'd think the implementation would belong in the commons or core 
area since as you mentioned this would potentially affect multiple algorithms, 
does that jive with everyone's thinking on this
2) Can we get away with just storing only the needed parts of the matrix in 
memory as opposed to the whole matrix
3) Is it an either or scenario with named vectors versus matrices, pardon my 
ignorance on this part since I'm not familiar with named vectors in mahout

Looking forward to begin design/implementation.

> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>  Labels: gsoc2013, mentor
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2013-04-09 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13626619#comment-13626619
 ] 

Gokhan Capan commented on MAHOUT-1178:
--

Ted, do you think this should load the entire index to memory as a matrix? Or 
should it ask to the index when a get request is done? (And if this is the 
option, should set methods also update the lucene index itself?)

> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>  Labels: gsoc2013, mentor
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2013-04-08 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13625892#comment-13625892
 ] 

Saikat Kanjilal commented on MAHOUT-1178:
-

Got it thanks, let me know how or where to help and dive in.

> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>  Labels: gsoc2013, mentor
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2013-04-08 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13625845#comment-13625845
 ] 

Ted Dunning commented on MAHOUT-1178:
-

The need for this stems from the fact that text or text-like data is commonly 
used as an input for Mahout projects.  This includes text classification, 
recommendation engines and clustering.

There are definitely better ways to represent a Lucene index if you want to 
preserve the data, but if you want to use other Mahout stuff, then a matrix is 
really required.  

> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>  Labels: gsoc2013, mentor
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2013-04-07 Thread Suneel Marthi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13624873#comment-13624873
 ] 

Suneel Marthi commented on MAHOUT-1178:
---

Given that Mahout trunk is presently at Lucene 4.2.0, I don't think we should 
worry about backward compatibility to previous Lucene 3.x versions (primarily 
due to the differences between Lucene 4.x and Lucene 3.x which are not 
compatible). 

To Dan's original description above, for (d) and (e) something similar to the 
present RowIdJob - which creates a row-by-row matrix and a docIndex - which 
maps a Row# back to the original DocId could be what we are looking for (we 
maybe able to leverage the existing RowIdJob once we get to the implementation 
details).



> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>  Labels: gsoc2013, mentor
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2013-04-07 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13624763#comment-13624763
 ] 

Saikat Kanjilal commented on MAHOUT-1178:
-

Ted, I'd like to help out on this issue, do you have some more context as to 
why this is needed, I'd imagine some better data structures than a matrix, is 
this strictly due to the fact that mahout operates on matrices as first class 
citizens?  Let me know how or where to begin taking a stab at doing this.   
Also should this be backwards compatible against different lucene versions 
assuming that the indexing scheme in lucene is somewhat consistent across 
versions.

> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>  Labels: gsoc2013, mentor
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira