2011/11/21 Alex Lopez <[email protected]>: > Em 18-11-2011 17:33, Olivier Grisel escreveu: >> >> 2011/11/18 Alex Lopez<[email protected]>: >>> >>> I wanted to share my 2 cents about the classification using Stanbol as I >>> had >>> relatively good results applying Olivier's method (using MoreLikeThis to >>> compare the input text with wikipedia abstracts) within my Stanbol >>> instance >>> running a dbpedia index: >>> >>> Using RemoteStreaming to classify remote plain text (in this example some >>> RFC about mail) on a default Stanbol using full launcher: >>> >>> >>> http://stanbolserver/solr/default/dbpedia_43k/mlt?stream.url=http://www.rfc-editor.org/rfc/rfc6409.txt&mlt.fl=@en/rdfs:comment/&mlt.interestingTerms=list&mlt.mintf=0&fl=ref/rdf:type/+ref/dc:subject/ >>> >>> or if a better index has been loaded (dbpedia) with indexed abstracts: >>> >>> >>> http://stanbolserver/solr/default/dbpedia/mlt?stream.url=http://www.rfc-editor.org/rfc/rfc6409.txt&mlt.fl=@en/dbp-ont:abstract/&mlt.interestingTerms=list&mlt.mintf=0&fl=ref/rdf:type/+ref/dc:subject/ >>> >>> Then process results: infer common broader categories, etc. >> >> Nice to see that you experimented further with this idea. For the >> broader category structure we have the information in the dbpedia skos >> graph. >> >>> Just to make some tests I extracted the most-repeated broader categories >>> using all dc:subject with the above text and yielded: >>> >>> Internet >>> Email >>> Internet_protocols >>> World_Wide_Web >>> Application_layer_protocols >>> >>> Another example using a Portuguese text (bible fragment): >>> >>> >>> http://stanbolserver/solr/default/dbpedia/mlt?stream.url=http://scrapmaker.com/data/wordlists/genesis/portuguese.txt&mlt.fl=@pt/dbp-ont:abstract/&mlt.interestingTerms=list&mlt.mintf=0&fl=ref/dc:subject/ >>> >>> Categories_named_after_religious_texts >> >> This kind of categories are noisy technical boilerplate and should not >> be indexed. My pignlproc scripts should take care of that. >> > > I see... I'm just beginning to explore the structure of the dbpedia SKOS > graph... do you know of any resource that serves as a good introduction to > the structure of this graph (specially near the root)?
I found that the sub categories of this node can server as a good set of semantic roots: http://dbpedia.org/page/Category:Main_topic_classifications > Is it known to have > cycles? Yes, plenty of them :) > Are you removing these useless categories based on label/uri text > matching or because they all have as skos:broader some special category? I try to find a matching dbpedia entry by removing the "Category:" prefix and following the redirect if any. If there is a matching concept that has a not too short abstract I mark the category as semantically interesting: https://github.com/ogrisel/pignlproc/blob/master/examples/topic-corpus/02_find_grounded_topics.pig > For > example I found that category:Education has a skos:broader > category:Main_topic_classification but also has skos:broader > category:Society, which has also skos:Broader > category:Main_topic_classification (so Education is not really "main topic" > because it is under Society which is also "main topic"), does this make > sense? (Sorry if this goes a little bit off topic but I really couldn't find > any evident source explaining the dbpedia skos graph). In my case I start from the Category:Main_topic_classification and consider it's children as root categories and accept the fact that a given node can have several paths to the top level roots. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel
