Re: Machine Learning Question

Ted Dunning Wed, 17 Feb 2010 12:23:45 -0800

I think I understand your question.  To make sure, here it is in my terms:

- you have documents with tag tokens in the fid field

- you have a bunch of rules for defining which documents appear where in
your hierarchy.  These rules are defined as Lucene queries.

- when you get a new document, it is slow to run every one of these queries
against the new document.

- you would like to run these queries very quickly in order to update your
hierarchy quickly and to provide author feedback.  Using ML would be a
spiffy way to do this and might provide hints for updating your hierarchy
rules.

My first suggestion for you would be to consider building a one document
index for the author feedback situation.  Running all of your rules against
that index should be pretty darned fast.  That doesn't help with some of the
other issues and might be hard to do with solr, but it would be easy with
raw Lucene.  You should be able to run several thousands of rules per second
this way.

That doesn't answer the question you asked, though.  The answer there, is
yes.  Definitely.  There are a number of machine learning approaches that
could reverse engineer your rules to give you new rules that could be
evaluated very quickly.  Some learning techniques and some configurations
would likely not give you precise accuracy, but some would likely give you
perfect replication.  Random forest will probably give you accurate results
as would logistic regression (referred to as SGD in Mahout), especially if
you use interaction variables (that depend on the presence of tag
combinations).  You will probably need to do a topological sort because it
is common for hierarchical structures to have rules that exclude a node from
a child if it appears in the parent (or vice versa).  Thus, you would want
to evaluate rules in dependency order and augment the document with any
category assignments as you go down the rule list.

Operationally, you would need to do some coding and not all of the pieces
you need are fully baked yet.  The first step is vectorization of your tag
list for many documents.  Robin has recently checked in some good code for
that and Drew has a more elaborate document model right behind that.  You
can also vectorize directly from a Lucene index which is probably very
convenient for you.  That gives you training data.

Training the classifiers will take a bit since you need to train pretty much
one classifier per category (unless you know that a document can have only
one category).  That shouldn't be hard, however, and with lots of examples
the training should converge to perfect performance pretty quickly.  The
command line form for running training is evolving a bit right now and your
feedback would be invaluable.

Deploying the classifiers should not be too difficult, but you would be in
slightly new territory there since I don't think that many (any) people have
deployed Mahout-trained classifiers in anger just yet.

Does this help?

On Wed, Feb 17, 2010 at 1:23 AM, David Stuart <
[email protected]> wrote:

> Hi All,
>
> I think this question is appropriate for the Mahout mailing list but if not
> any pointers in the right direction or advise would be welcomed.
>
> We have a taxonomy based navigation system where items in the navigation
> tree are made up of tag based queries (instead of natural language words)
> which are matched against content items tagged in a similar way.
>
> so we have a taxonomy tree with queries
> Id         Label
> 001     Fruit (fid:123 or fid:675) AND -fid:(324 OR 678) ...
> 002     Round
> 003               Apple
> 004               Orange
> 006        Star
> 007              Star fruit
> ....
>
> Content pool
>
> "Interesting article on fruit" -> tagged with (123, 234, 675)
> "The mightly orange!" -> tagged with (123, 324, 678)
>
> hopefully you get the picture..
>
> Now we bake these queries into our Solr index so instead of doing the Fruit
> query we have pre done it and just search for items in index that have id
> 001 the reasons for doing this are not really important but we have written
> a indexer for the purpose. Also content items are multi-surfacing so a item
> could appear at 001, 004 and 007
>
> Although the indexer is ok at doing this pre bake job its not very fast and
> as the content and tree grows it gets slower.
>
> NOW for the actual Question!!!
>
> Is there a ML model that can quickly classify/identify where a new (or
> retagged)  piece of content fits onto the tree. Oh the queries on the leaf
> nodes can change (less often) so a quick process to reclassify what is in
> score for that leaf would be useful.
> The reason I want this is because it would great have realtime feed back to
> an author applying tags to a document of where it fits in the site.
>
> Once I get this working I would love to add suggested tags or weighting
> based on content items with contextual similarity.
> I think it was Grant that was talking about a Solr external field that
> could be used to hook this together or maybe I am mistaken
>
> Hope this makes sense
>
> Thanks for you help/advise in advance
>
> Regards,
>
> Dave
>
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Machine Learning Question

Reply via email to