Problem trying to create a new dbpedia index and site in Italian.

Stefano Norcia Thu, 01 Mar 2012 05:49:56 -0800

Hi all,

My name is Stefano Norcia and I'm working on the early adoption project for
Etcware.


For our early adoption project (Etcware Early Adoption project) we need to
use a DBPedia index in Italian
language in the enhancement and enrichment process enabled by the Stanbol
engines.

The main problem is that the NLP module does not support italian language
directly, so if you put an italian
text in the enhancement engine the dbpedia engine does not detect any
concept/place/people.

We have done some experiments to perform this goal:

First attempt was to rebuild the dbpedia index following the instructions
found in the stanbol/
entityhub/indexing/dbpedia folder. In this folder there is a shell script
(fetch_prepare.sh) that
describe how to prepare the dbpedia datasets before creating the index. We
followed those
instructions and tried to create a new index to replace the standard
English dbpedia index and
"site" starting from the italian dbpedia datasets. We are aware that the
italian datasets are not
complete and that some packages are missing (like persondata_en.nt.bz2 and
so on).
These are the packages we used to create the index (
http://downloads.dbpedia.org/3.7/it/) :

o dbpedia_3.7.owl.bz2
o geo_coordinates_it.nt.bz2
o instance_types_it.nt.bz2
o labels_it.nt.bz2
o long_abstracts_it.nt.bz2
o short_abstracts_it.nt.bz2

We are also able to create the incoming_links text file from the package
page_links_it.nt.bz2.
After rebuilding the index we replaced the DBPedia english index in stanbol
with our custom
one (simply replacing the old one with the new one and restarting stanbol).

Sadly, after that, the results produced by the enhancement engines are
exactly the same as before,
neither italian concepts are detected nor possible enhancements are
returned from all the other
enhancement engines.

As a second attempt, we decided to use the generic RDF indexer (combined
with the standard
Keyword Linking Engine) to process the italian DBPedia datasets; in this
case the indexing process
succeeded and we were able to get a lot of results testing the enhancement
engines with italian
content. This time the problem is that the results are simply too much and
contain also stopwords.

For example you can find a sample text introduced for enhancement and the
results shown by the
Keyword Linking Engine in attachment.

The terms shown in bold are clearly stopwords. I don’t know if the problem
is in dataset indexing,
or if there is a way to eliminate them after the creation of the index.

We have also made an attempt to change the stopwords filter in the solyard
base index zip
(/stanbol/entityhub/yard/solr/
src/main/resources/solr/core/default/default.solrindex.zip

and simple.solrindex.zip) and rebuild the content hub (and dbpedia indexer
too with mvn
assembly:single in contenthub/indexer/dbpedia ) with the right stopwords.

We've checked the generated JAR and the italian stopwords are there, as a
file inside the solr config
folder, but the results were always the same as before (still stopwords in
the enhancement results).

Do you have any suggestions on how to perform these tasks?

Thanks in advance.

-Stefano

PS follow an enrichment example from the rdf index we built from dpedia
with simplerdfindexer and dblp :

text:

*Infermiera con tbc, troppi dettagli sui media. Il Garante apre
un'istruttoria

Il Garante Privacy ha aperto un'istruttoria in seguito alla pubblicazione
di notizie da parte di agenzie di stampa e quotidiani - anche on line -
che, nel riferire di un caso di una infermiera in servizio presso il
reparto di neonatologia del Policlinico Gemelli, risultata positiva ai test
sulla tubercolosi, hanno riportato il nome della donna, l'iniziale del
cognome e l'età.

Il diritto-dovere dei giornalisti di informare sugli sviluppi della
vicenda, di sicura rilevanza per l'opinione pubblica, considerato l'elevato
numero di neonati e di famiglie coinvolte, deve essere comunque bilanciato,
secondo i principi stabiliti dal Codice deontologico con il rispetto delle
persone.

Il Garante ricorda che, anche quando questi dettagli fossero stati forniti
in una sede pubblica, i mezzi di informazione sono tenuti a valutare con
scrupolo l'interesse pubblico delle singole informazioni diffuse.

I media evitino dunque di riportare informazioni non essenziali che possano
ledere la riservatezza delle persone e nello stesso tempo possano indurre
ulteriori stati di allarme e di preoccupazione in coloro che si sono
avvalsi dei servizi sanitari dell'ospedale o sono altrimenti entrati in
contatto con la persona.

Roma, 24 agosto 2011*

Enrichments :

2011 2011

Agosto Agosto

*Alla Alla*

*Anché Anché*

*Che? Che?*

Cognome Cognome

*CON CON*

*Dal' Dal'*

Problema dei servizi Problema dei servizi

*Dell Dell*

Diritto Diritto

Donna Donna

Essere Essere

Il nome della rosa Il nome della rosa

Informazione Informazione

Interesse pubblico Interesse pubblico

Media Media

Mezzi di produzione Mezzi di produzione

*Nello Nello*

Neonatologia Neonatologia

*NON NON*

Numero di coordinazione (chimica) Numero di coordinazione (chimica)

Opinione pubblica Opinione pubblica

Ospedale Ospedale

*PER PER*

Persona Persona

Privacy Privacy

Pubblicazione di matrimonio Pubblicazione di matrimonio

Secondo Secondo

Servizio Servizio

Stampa Stampa

Stati di immaginazione Stati di immaginazione

*SUI SUI*

TBC TBC

Tempo Tempo

.test .test

Tubercolosi Tubercolosi

*UNA UNA*


The ones in bold are stopwords, the other results are good ones but anyway
the stopwords where not eliminated in dataset indexing, or maybe there is a
way to eliminate them from the datasets but I don't know how.

Problem trying to create a new dbpedia index and site in Italian.

Reply via email to