Em 18-11-2011 17:33, Olivier Grisel escreveu:
2011/11/18 Alex Lopez<[email protected]>:
I wanted to share my 2 cents about the classification using Stanbol as I had
relatively good results applying Olivier's method (using MoreLikeThis to
compare the input text with wikipedia abstracts) within my Stanbol instance
running a dbpedia index:

Using RemoteStreaming to classify remote plain text (in this example some
RFC about mail) on a default Stanbol using full launcher:

http://stanbolserver/solr/default/dbpedia_43k/mlt?stream.url=http://www.rfc-editor.org/rfc/rfc6409.txt&mlt.fl=@en/rdfs:comment/&mlt.interestingTerms=list&mlt.mintf=0&fl=ref/rdf:type/+ref/dc:subject/

or if a better index has been loaded (dbpedia) with indexed abstracts:

http://stanbolserver/solr/default/dbpedia/mlt?stream.url=http://www.rfc-editor.org/rfc/rfc6409.txt&mlt.fl=@en/dbp-ont:abstract/&mlt.interestingTerms=list&mlt.mintf=0&fl=ref/rdf:type/+ref/dc:subject/

Then process results: infer common broader categories, etc.

Nice to see that you experimented further with this idea. For the
broader category structure we have the information in the dbpedia skos
graph.

Just to make some tests I extracted the most-repeated broader categories
using all dc:subject with the above text and yielded:

Internet
Email
Internet_protocols
World_Wide_Web
Application_layer_protocols

Another example using a Portuguese text (bible fragment):

http://stanbolserver/solr/default/dbpedia/mlt?stream.url=http://scrapmaker.com/data/wordlists/genesis/portuguese.txt&mlt.fl=@pt/dbp-ont:abstract/&mlt.interestingTerms=list&mlt.mintf=0&fl=ref/dc:subject/

Categories_named_after_religious_texts

This kind of categories are noisy technical boilerplate and should not
be indexed. My pignlproc scripts should take care of that.


I see... I'm just beginning to explore the structure of the dbpedia SKOS graph... do you know of any resource that serves as a good introduction to the structure of this graph (specially near the root)? Is it known to have cycles? Are you removing these useless categories based on label/uri text matching or because they all have as skos:broader some special category? For example I found that category:Education has a skos:broader category:Main_topic_classification but also has skos:broader category:Society, which has also skos:Broader category:Main_topic_classification (so Education is not really "main topic" because it is under Society which is also "main topic"), does this make sense? (Sorry if this goes a little bit off topic but I really couldn't find any evident source explaining the dbpedia skos graph).

Christian_liturgy,_rites,_and_worship_services
Christian_theology

It works for me :)

However in the on-line instances I tested, the SOLR server didn't seem to be
exposed (as it is in last Stanbol revisions) so I can't give any
ready-to-see working example.

Yes we need to work on that :) I think Rupert has already started.

Thanks Olivier for the great idea!

I have plenty of ideas to improve the quality further by using mahout
and the sparse prior logistic regression to trim down the index to
only keep the most discriminative words (and bi-grams) for each
category only. This should both reduce the size of the index, improve
the processing speed and the quality of the predictions.

Reply via email to