I think I understand your question. To make sure, here it is in my terms: - you have documents with tag tokens in the fid field
- you have a bunch of rules for defining which documents appear where in your hierarchy. These rules are defined as Lucene queries. - when you get a new document, it is slow to run every one of these queries against the new document. - you would like to run these queries very quickly in order to update your hierarchy quickly and to provide author feedback. Using ML would be a spiffy way to do this and might provide hints for updating your hierarchy rules. My first suggestion for you would be to consider building a one document index for the author feedback situation. Running all of your rules against that index should be pretty darned fast. That doesn't help with some of the other issues and might be hard to do with solr, but it would be easy with raw Lucene. You should be able to run several thousands of rules per second this way. That doesn't answer the question you asked, though. The answer there, is yes. Definitely. There are a number of machine learning approaches that could reverse engineer your rules to give you new rules that could be evaluated very quickly. Some learning techniques and some configurations would likely not give you precise accuracy, but some would likely give you perfect replication. Random forest will probably give you accurate results as would logistic regression (referred to as SGD in Mahout), especially if you use interaction variables (that depend on the presence of tag combinations). You will probably need to do a topological sort because it is common for hierarchical structures to have rules that exclude a node from a child if it appears in the parent (or vice versa). Thus, you would want to evaluate rules in dependency order and augment the document with any category assignments as you go down the rule list. Operationally, you would need to do some coding and not all of the pieces you need are fully baked yet. The first step is vectorization of your tag list for many documents. Robin has recently checked in some good code for that and Drew has a more elaborate document model right behind that. You can also vectorize directly from a Lucene index which is probably very convenient for you. That gives you training data. Training the classifiers will take a bit since you need to train pretty much one classifier per category (unless you know that a document can have only one category). That shouldn't be hard, however, and with lots of examples the training should converge to perfect performance pretty quickly. The command line form for running training is evolving a bit right now and your feedback would be invaluable. Deploying the classifiers should not be too difficult, but you would be in slightly new territory there since I don't think that many (any) people have deployed Mahout-trained classifiers in anger just yet. Does this help? On Wed, Feb 17, 2010 at 1:23 AM, David Stuart < [email protected]> wrote: > Hi All, > > I think this question is appropriate for the Mahout mailing list but if not > any pointers in the right direction or advise would be welcomed. > > We have a taxonomy based navigation system where items in the navigation > tree are made up of tag based queries (instead of natural language words) > which are matched against content items tagged in a similar way. > > so we have a taxonomy tree with queries > Id Label > 001 Fruit (fid:123 or fid:675) AND -fid:(324 OR 678) ... > 002 Round > 003 Apple > 004 Orange > 006 Star > 007 Star fruit > .... > > Content pool > > "Interesting article on fruit" -> tagged with (123, 234, 675) > "The mightly orange!" -> tagged with (123, 324, 678) > > hopefully you get the picture.. > > Now we bake these queries into our Solr index so instead of doing the Fruit > query we have pre done it and just search for items in index that have id > 001 the reasons for doing this are not really important but we have written > a indexer for the purpose. Also content items are multi-surfacing so a item > could appear at 001, 004 and 007 > > Although the indexer is ok at doing this pre bake job its not very fast and > as the content and tree grows it gets slower. > > NOW for the actual Question!!! > > Is there a ML model that can quickly classify/identify where a new (or > retagged) piece of content fits onto the tree. Oh the queries on the leaf > nodes can change (less often) so a quick process to reclassify what is in > score for that leaf would be useful. > The reason I want this is because it would great have realtime feed back to > an author applying tags to a document of where it fits in the site. > > Once I get this working I would love to add suggested tags or weighting > based on content items with contextual similarity. > I think it was Grant that was talking about a Solr external field that > could be used to hook this together or maybe I am mistaken > > Hope this makes sense > > Thanks for you help/advise in advance > > Regards, > > Dave > > -- Ted Dunning, CTO DeepDyve
