Re: Olivier's presentation on Stanbol at ApacheCon

Alex Lopez Mon, 21 Nov 2011 02:46:36 -0800

Em 18-11-2011 17:33, Olivier Grisel escreveu:

2011/11/18 Alex Lopez<[email protected]>:

I wanted to share my 2 cents about the classification using Stanbol as I had
relatively good results applying Olivier's method (using MoreLikeThis to
compare the input text with wikipedia abstracts) within my Stanbol instance
running a dbpedia index:


Using RemoteStreaming to classify remote plain text (in this example some
RFC about mail) on a default Stanbol using full launcher:

http://stanbolserver/solr/default/dbpedia_43k/mlt?stream.url=http://www.rfc-editor.org/rfc/rfc6409.txt&mlt.fl=@en/rdfs:comment/&mlt.interestingTerms=list&mlt.mintf=0&fl=ref/rdf:type/+ref/dc:subject/

or if a better index has been loaded (dbpedia) with indexed abstracts:

http://stanbolserver/solr/default/dbpedia/mlt?stream.url=http://www.rfc-editor.org/rfc/rfc6409.txt&mlt.fl=@en/dbp-ont:abstract/&mlt.interestingTerms=list&mlt.mintf=0&fl=ref/rdf:type/+ref/dc:subject/

Then process results: infer common broader categories, etc.


Nice to see that you experimented further with this idea. For the
broader category structure we have the information in the dbpedia skos
graph.

Just to make some tests I extracted the most-repeated broader categories
using all dc:subject with the above text and yielded:

Internet
Email
Internet_protocols
World_Wide_Web
Application_layer_protocols

Another example using a Portuguese text (bible fragment):

http://stanbolserver/solr/default/dbpedia/mlt?stream.url=http://scrapmaker.com/data/wordlists/genesis/portuguese.txt&mlt.fl=@pt/dbp-ont:abstract/&mlt.interestingTerms=list&mlt.mintf=0&fl=ref/dc:subject/

Categories_named_after_religious_texts


This kind of categories are noisy technical boilerplate and should not
be indexed. My pignlproc scripts should take care of that.

I see... I'm just beginning to explore the structure of the dbpedia SKOSgraph... do you know of any resource that serves as a good introductionto the structure of this graph (specially near the root)? Is it known tohave cycles? Are you removing these useless categories based onlabel/uri text matching or because they all have as skos:broader somespecial category? For example I found that category:Education has askos:broader category:Main_topic_classification but also hasskos:broader category:Society, which has also skos:Broadercategory:Main_topic_classification (so Education is not really "maintopic" because it is under Society which is also "main topic"), doesthis make sense? (Sorry if this goes a little bit off topic but I reallycouldn't find any evident source explaining the dbpedia skos graph).

Christian_liturgy,_rites,_and_worship_services
Christian_theology

It works for me :)

However in the on-line instances I tested, the SOLR server didn't seem to be
exposed (as it is in last Stanbol revisions) so I can't give any
ready-to-see working example.


Yes we need to work on that :) I think Rupert has already started.

Thanks Olivier for the great idea!


I have plenty of ideas to improve the quality further by using mahout
and the sparse prior logistic regression to trim down the index to
only keep the most discriminative words (and bi-grams) for each
category only. This should both reduce the size of the index, improve
the processing speed and the quality of the predictions.

Re: Olivier's presentation on Stanbol at ApacheCon

Reply via email to