2011/11/18 Alex Lopez <[email protected]>: > I wanted to share my 2 cents about the classification using Stanbol as I had > relatively good results applying Olivier's method (using MoreLikeThis to > compare the input text with wikipedia abstracts) within my Stanbol instance > running a dbpedia index: > > Using RemoteStreaming to classify remote plain text (in this example some > RFC about mail) on a default Stanbol using full launcher: > > http://stanbolserver/solr/default/dbpedia_43k/mlt?stream.url=http://www.rfc-editor.org/rfc/rfc6409.txt&mlt.fl=@en/rdfs:comment/&mlt.interestingTerms=list&mlt.mintf=0&fl=ref/rdf:type/+ref/dc:subject/ > > or if a better index has been loaded (dbpedia) with indexed abstracts: > > http://stanbolserver/solr/default/dbpedia/mlt?stream.url=http://www.rfc-editor.org/rfc/rfc6409.txt&mlt.fl=@en/dbp-ont:abstract/&mlt.interestingTerms=list&mlt.mintf=0&fl=ref/rdf:type/+ref/dc:subject/ > > Then process results: infer common broader categories, etc.
Nice to see that you experimented further with this idea. For the broader category structure we have the information in the dbpedia skos graph. > Just to make some tests I extracted the most-repeated broader categories > using all dc:subject with the above text and yielded: > > Internet > Email > Internet_protocols > World_Wide_Web > Application_layer_protocols > > Another example using a Portuguese text (bible fragment): > > http://stanbolserver/solr/default/dbpedia/mlt?stream.url=http://scrapmaker.com/data/wordlists/genesis/portuguese.txt&mlt.fl=@pt/dbp-ont:abstract/&mlt.interestingTerms=list&mlt.mintf=0&fl=ref/dc:subject/ > > Categories_named_after_religious_texts This kind of categories are noisy technical boilerplate and should not be indexed. My pignlproc scripts should take care of that. > Christian_liturgy,_rites,_and_worship_services > Christian_theology > > It works for me :) > > However in the on-line instances I tested, the SOLR server didn't seem to be > exposed (as it is in last Stanbol revisions) so I can't give any > ready-to-see working example. Yes we need to work on that :) I think Rupert has already started. > Thanks Olivier for the great idea! I have plenty of ideas to improve the quality further by using mahout and the sparse prior logistic regression to trim down the index to only keep the most discriminative words (and bi-grams) for each category only. This should both reduce the size of the index, improve the processing speed and the quality of the predictions. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel
