Dan,

I think what you've written is fine (I wanted to edit to remove the
'?' around random forests but couldn't).

ok?



On 29 March 2013 11:14, Dan Filimon <dangeorge.fili...@gmail.com> wrote:
> I added Andy's first suggestion and Ted's suggestion as ideas.
>
> Andy, could you flesh out your second suggestion into a project and make an
> issue please?
>
>
> On Fri, Mar 29, 2013 at 3:53 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:
>
>> It should be possible to view a Lucene index as a matrix.  This would
>> require that we standardize on a way to convert documents to rows.  There
>> are many choices, the discussion of which should be deferred to the actual
>> work on the project, but there are a few obvious constraints:
>>
>> a) it should be possible to get the same result as dumping the term vectors
>> for each document each to a line and converting that result using standard
>> Mahout methods.
>>
>> b) numeric fields ought to work somehow.
>>
>> c) if there are multiple text fields that ought to work sensibly as well.
>>  Two options include dumping multiple matrices or to convert the fields
>> into a single row of a single matrix.
>>
>> d) it should be possible to refer back from a row of the matrix to find the
>> correct document.  THis might be because we remember the Lucene doc number
>> or because a field is named as holding a unique id.
>>
>> e) named vectors and matrices should be used if plausible.
>>
>> On Thu, Mar 28, 2013 at 4:58 PM, Dan Filimon <dangeorge.fili...@gmail.com
>> >wrote:
>>
>> > ...
>> > Ted, could you explain a bit more what you mean by "simplify the
>> connection
>> > to Lucene for clustering and classification"? It's too vague for an idea
>> > proposal.
>> >
>>



-- 
Dr Andy Twigg
Junior Research Fellow, St Johns College, Oxford
Room 351, Department of Computer Science
http://www.cs.ox.ac.uk/people/andy.twigg/
andy.tw...@cs.ox.ac.uk | +447799647538

Reply via email to