Dan, I think what you've written is fine (I wanted to edit to remove the '?' around random forests but couldn't).
ok? On 29 March 2013 11:14, Dan Filimon <dangeorge.fili...@gmail.com> wrote: > I added Andy's first suggestion and Ted's suggestion as ideas. > > Andy, could you flesh out your second suggestion into a project and make an > issue please? > > > On Fri, Mar 29, 2013 at 3:53 AM, Ted Dunning <ted.dunn...@gmail.com> wrote: > >> It should be possible to view a Lucene index as a matrix. This would >> require that we standardize on a way to convert documents to rows. There >> are many choices, the discussion of which should be deferred to the actual >> work on the project, but there are a few obvious constraints: >> >> a) it should be possible to get the same result as dumping the term vectors >> for each document each to a line and converting that result using standard >> Mahout methods. >> >> b) numeric fields ought to work somehow. >> >> c) if there are multiple text fields that ought to work sensibly as well. >> Two options include dumping multiple matrices or to convert the fields >> into a single row of a single matrix. >> >> d) it should be possible to refer back from a row of the matrix to find the >> correct document. THis might be because we remember the Lucene doc number >> or because a field is named as holding a unique id. >> >> e) named vectors and matrices should be used if plausible. >> >> On Thu, Mar 28, 2013 at 4:58 PM, Dan Filimon <dangeorge.fili...@gmail.com >> >wrote: >> >> > ... >> > Ted, could you explain a bit more what you mean by "simplify the >> connection >> > to Lucene for clustering and classification"? It's too vague for an idea >> > proposal. >> > >> -- Dr Andy Twigg Junior Research Fellow, St Johns College, Oxford Room 351, Department of Computer Science http://www.cs.ox.ac.uk/people/andy.twigg/ andy.tw...@cs.ox.ac.uk | +447799647538