I wanted to share my 2 cents about the classification using Stanbol as I had relatively good results applying Olivier's method (using MoreLikeThis to compare the input text with wikipedia abstracts) within my Stanbol instance running a dbpedia index:

Using RemoteStreaming to classify remote plain text (in this example some RFC about mail) on a default Stanbol using full launcher:

http://stanbolserver/solr/default/dbpedia_43k/mlt?stream.url=http://www.rfc-editor.org/rfc/rfc6409.txt&mlt.fl=@en/rdfs:comment/&mlt.interestingTerms=list&mlt.mintf=0&fl=ref/rdf:type/+ref/dc:subject/

or if a better index has been loaded (dbpedia) with indexed abstracts:

http://stanbolserver/solr/default/dbpedia/mlt?stream.url=http://www.rfc-editor.org/rfc/rfc6409.txt&mlt.fl=@en/dbp-ont:abstract/&mlt.interestingTerms=list&mlt.mintf=0&fl=ref/rdf:type/+ref/dc:subject/

Then process results: infer common broader categories, etc.

Just to make some tests I extracted the most-repeated broader categories using all dc:subject with the above text and yielded:

Internet
Email
Internet_protocols
World_Wide_Web
Application_layer_protocols

Another example using a Portuguese text (bible fragment):

http://stanbolserver/solr/default/dbpedia/mlt?stream.url=http://scrapmaker.com/data/wordlists/genesis/portuguese.txt&mlt.fl=@pt/dbp-ont:abstract/&mlt.interestingTerms=list&mlt.mintf=0&fl=ref/dc:subject/

Categories_named_after_religious_texts
Christian_liturgy,_rites,_and_worship_services
Christian_theology

It works for me :)

However in the on-line instances I tested, the SOLR server didn't seem to be exposed (as it is in last Stanbol revisions) so I can't give any ready-to-see working example.

Thanks Olivier for the great idea!

Em 18-11-2011 15:52, Olivier Grisel escreveu:
2011/11/18 Reto Bachmann-Gmür<[email protected]>:
On Tue, Nov 15, 2011 at 2:09 PM, Bertrand Delacretaz<[email protected]
wrote:

On Tue, Nov 15, 2011 at 12:45 PM, Stefane Fermigier<[email protected]>  wrote:
Is online here:


http://www.slideshare.net/nuxeo/apache-stanbol-and-the-web-of-data-apachecon-2011

I attended Olivier's presentation and was impressed by the results of
his Universal Topic Classification experiment (starting at slide 38).

The results look very impressive. Is there some documentation on how to set
up this effective topic classification?

Right now it's still a prototype using solr directly. I need to
refactor a bunch of stuff but that will likely be impacted by the new
RDF Path mapper / indexer we are gonna work on during the hackathon.

Reply via email to