2011/11/21 Alex Lopez <[email protected]>:
> Em 18-11-2011 17:33, Olivier Grisel escreveu:
>>
>> 2011/11/18 Alex Lopez<[email protected]>:
>>>
>>> I wanted to share my 2 cents about the classification using Stanbol as I
>>> had
>>> relatively good results applying Olivier's method (using MoreLikeThis to
>>> compare the input text with wikipedia abstracts) within my Stanbol
>>> instance
>>> running a dbpedia index:
>>>
>>> Using RemoteStreaming to classify remote plain text (in this example some
>>> RFC about mail) on a default Stanbol using full launcher:
>>>
>>>
>>> http://stanbolserver/solr/default/dbpedia_43k/mlt?stream.url=http://www.rfc-editor.org/rfc/rfc6409.txt&mlt.fl=@en/rdfs:comment/&mlt.interestingTerms=list&mlt.mintf=0&fl=ref/rdf:type/+ref/dc:subject/
>>>
>>> or if a better index has been loaded (dbpedia) with indexed abstracts:
>>>
>>>
>>> http://stanbolserver/solr/default/dbpedia/mlt?stream.url=http://www.rfc-editor.org/rfc/rfc6409.txt&mlt.fl=@en/dbp-ont:abstract/&mlt.interestingTerms=list&mlt.mintf=0&fl=ref/rdf:type/+ref/dc:subject/
>>>
>>> Then process results: infer common broader categories, etc.
>>
>> Nice to see that you experimented further with this idea. For the
>> broader category structure we have the information in the dbpedia skos
>> graph.
>>
>>> Just to make some tests I extracted the most-repeated broader categories
>>> using all dc:subject with the above text and yielded:
>>>
>>> Internet
>>> Email
>>> Internet_protocols
>>> World_Wide_Web
>>> Application_layer_protocols
>>>
>>> Another example using a Portuguese text (bible fragment):
>>>
>>>
>>> http://stanbolserver/solr/default/dbpedia/mlt?stream.url=http://scrapmaker.com/data/wordlists/genesis/portuguese.txt&mlt.fl=@pt/dbp-ont:abstract/&mlt.interestingTerms=list&mlt.mintf=0&fl=ref/dc:subject/
>>>
>>> Categories_named_after_religious_texts
>>
>> This kind of categories are noisy technical boilerplate and should not
>> be indexed. My pignlproc scripts should take care of that.
>>
>
> I see... I'm just beginning to explore the structure of the dbpedia SKOS
> graph... do you know of any resource that serves as a good introduction to
> the structure of this graph (specially near the root)?

I found that the sub categories of this node can server as a good set
of semantic roots:

  http://dbpedia.org/page/Category:Main_topic_classifications

> Is it known to have
> cycles?

Yes, plenty of them :)

> Are you removing these useless categories based on label/uri text
> matching or because they all have as skos:broader some special category?

I try to find a matching dbpedia entry by removing the "Category:"
prefix and following the redirect if any. If there is a matching
concept that has a not too short abstract I mark the category as
semantically interesting:

  
https://github.com/ogrisel/pignlproc/blob/master/examples/topic-corpus/02_find_grounded_topics.pig

> For
> example I found that category:Education has a skos:broader
> category:Main_topic_classification but also has skos:broader
> category:Society, which has also skos:Broader
> category:Main_topic_classification (so Education is not really "main topic"
> because it is under Society which is also "main topic"), does this make
> sense? (Sorry if this goes a little bit off topic but I really couldn't find
> any evident source explaining the dbpedia skos graph).

In my case I start from the Category:Main_topic_classification and
consider it's children as root categories and accept the fact that a
given node can have several paths to the top level roots.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Reply via email to