Dear Stefano,
I am new as well on the list, and we are also working in the context of
the early adoption program. If I understand correctly, the problem is
that without an appropriate Named Entities extraction engine for
Italian, I am afraid that the result would always be disappointing. In
the context of our project we will integrate enhancement services of NER
for Italian and French (and possibly keyword extraction), so, hopefully,
you will be able to profit of the power of Stanbol. There might be some
problems in terms of timing, as it is not clear if in the short project
window, there will be the possibility of feeding our integration into
yours. Is the unavailability of Italian NER a blocking factor for you or
you can go on with development while waiting for the integration?
Cheers,
Luca
On 01/03/2012 14:49, Stefano Norcia wrote:
Hi all,
My name is Stefano Norcia and I'm working on the early adoption project for
Etcware.
For our early adoption project (Etcware Early Adoption project) we need to
use a DBPedia index in Italian
language in the enhancement and enrichment process enabled by the Stanbol
engines.
The main problem is that the NLP module does not support italian language
directly, so if you put an italian
text in the enhancement engine the dbpedia engine does not detect any
concept/place/people.
We have done some experiments to perform this goal:
First attempt was to rebuild the dbpedia index following the instructions
found in the stanbol/
entityhub/indexing/dbpedia folder. In this folder there is a shell script
(fetch_prepare.sh) that
describe how to prepare the dbpedia datasets before creating the index. We
followed those
instructions and tried to create a new index to replace the standard
English dbpedia index and
"site" starting from the italian dbpedia datasets. We are aware that the
italian datasets are not
complete and that some packages are missing (like persondata_en.nt.bz2 and
so on).
These are the packages we used to create the index (
http://downloads.dbpedia.org/3.7/it/) :
o dbpedia_3.7.owl.bz2
o geo_coordinates_it.nt.bz2
o instance_types_it.nt.bz2
o labels_it.nt.bz2
o long_abstracts_it.nt.bz2
o short_abstracts_it.nt.bz2
We are also able to create the incoming_links text file from the package
page_links_it.nt.bz2.
After rebuilding the index we replaced the DBPedia english index in stanbol
with our custom
one (simply replacing the old one with the new one and restarting stanbol).
Sadly, after that, the results produced by the enhancement engines are
exactly the same as before,
neither italian concepts are detected nor possible enhancements are
returned from all the other
enhancement engines.
As a second attempt, we decided to use the generic RDF indexer (combined
with the standard
Keyword Linking Engine) to process the italian DBPedia datasets; in this
case the indexing process
succeeded and we were able to get a lot of results testing the enhancement
engines with italian
content. This time the problem is that the results are simply too much and
contain also stopwords.
For example you can find a sample text introduced for enhancement and the
results shown by the
Keyword Linking Engine in attachment.
The terms shown in bold are clearly stopwords. I don’t know if the problem
is in dataset indexing,
or if there is a way to eliminate them after the creation of the index.
We have also made an attempt to change the stopwords filter in the solyard
base index zip
(/stanbol/entityhub/yard/solr/
src/main/resources/solr/core/default/default.solrindex.zip
and simple.solrindex.zip) and rebuild the content hub (and dbpedia indexer
too with mvn
assembly:single in contenthub/indexer/dbpedia ) with the right stopwords.
We've checked the generated JAR and the italian stopwords are there, as a
file inside the solr config
folder, but the results were always the same as before (still stopwords in
the enhancement results).
Do you have any suggestions on how to perform these tasks?
Thanks in advance.
-Stefano
PS follow an enrichment example from the rdf index we built from dpedia
with simplerdfindexer and dblp :
text:
*Infermiera con tbc, troppi dettagli sui media. Il Garante apre
un'istruttoria
Il Garante Privacy ha aperto un'istruttoria in seguito alla pubblicazione
di notizie da parte di agenzie di stampa e quotidiani - anche on line -
che, nel riferire di un caso di una infermiera in servizio presso il
reparto di neonatologia del Policlinico Gemelli, risultata positiva ai test
sulla tubercolosi, hanno riportato il nome della donna, l'iniziale del
cognome e l'età.
Il diritto-dovere dei giornalisti di informare sugli sviluppi della
vicenda, di sicura rilevanza per l'opinione pubblica, considerato l'elevato
numero di neonati e di famiglie coinvolte, deve essere comunque bilanciato,
secondo i principi stabiliti dal Codice deontologico con il rispetto delle
persone.
Il Garante ricorda che, anche quando questi dettagli fossero stati forniti
in una sede pubblica, i mezzi di informazione sono tenuti a valutare con
scrupolo l'interesse pubblico delle singole informazioni diffuse.
I media evitino dunque di riportare informazioni non essenziali che possano
ledere la riservatezza delle persone e nello stesso tempo possano indurre
ulteriori stati di allarme e di preoccupazione in coloro che si sono
avvalsi dei servizi sanitari dell'ospedale o sono altrimenti entrati in
contatto con la persona.
Roma, 24 agosto 2011*
Enrichments :
2011 2011
Agosto Agosto
*Alla Alla*
*Anché Anché*
*Che? Che?*
Cognome Cognome
*CON CON*
*Dal' Dal'*
Problema dei servizi Problema dei servizi
*Dell Dell*
Diritto Diritto
Donna Donna
Essere Essere
Il nome della rosa Il nome della rosa
Informazione Informazione
Interesse pubblico Interesse pubblico
Media Media
Mezzi di produzione Mezzi di produzione
*Nello Nello*
Neonatologia Neonatologia
*NON NON*
Numero di coordinazione (chimica) Numero di coordinazione (chimica)
Opinione pubblica Opinione pubblica
Ospedale Ospedale
*PER PER*
Persona Persona
Privacy Privacy
Pubblicazione di matrimonio Pubblicazione di matrimonio
Secondo Secondo
Servizio Servizio
Stampa Stampa
Stati di immaginazione Stati di immaginazione
*SUI SUI*
TBC TBC
Tempo Tempo
.test .test
Tubercolosi Tubercolosi
*UNA UNA*
The ones in bold are stopwords, the other results are good ones but anyway
the stopwords where not eliminated in dataset indexing, or maybe there is a
way to eliminate them from the datasets but I don't know how.