Re: Problem trying to create a new dbpedia index and site in Italian.

Rupert Westenthaler Thu, 01 Mar 2012 10:47:34 -0800

Hi Stefano, Luca

See my comments inline.

On 01.03.2012, at 15:59, Luca Dini wrote:

> Dear Stefano,
> I am new as well on the list, and we are also working in the context of the 
> early adoption program. If I understand correctly, the problem is that 
> without an appropriate Named Entities extraction engine for Italian, I am 
> afraid that the result would always be disappointing. In the context of our 
> project we will integrate enhancement services of NER for Italian and French 
> (and possibly keyword extraction), so, hopefully, you will be able to profit 
> of the power of Stanbol. There might be some problems in terms of timing, as 
> it is not clear if in the short project window, there will be the possibility 
> of  feeding our integration into yours. Is the unavailability of Italian NER 
> a blocking factor for you or you can go on with development while waiting for 
> the integration?
> 

Thats true. For Datasets such as DBpedia the combination of "NER + 
NamedEntityTaggingEngine" is the way to go. Thats simple because DBpedia 
defines Entities for nearly all natural language words. Therefore "keyword 
extraction" (used by the KeywordLinkingEngine) does not really work.

However note that the KeywordLinkingEngine as support for POS (Part of Speech) 
taggers. So if a POS tagger is available for a given language than it will use 
this information to only lookup Nouns (see [1] for a more detailed information 
on the used algorithm). The bad news are that there is no POS tagger available 
for italian :(

[1] 
http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.html

The final possibility to improve results of the KeywordLinkingEngine with 
DBPedia is to filter all entities with other types than Persons, Organizations 
and Places. However this has also a big disadvantage. because this will also 
exclude all redirects and such entities are very important as they allow to 
link Entities that are mentioned by alternate names. However if you would like 
to try this you should have a look at the

    org.apache.stanbol.entityhub.indexing.core.processor.FieldValueFilter

This filter is included in the default configuration of the DBpedia indexer and 
can be activated by changing the configuration within the 

    {indexing-dir}/indexing/config/entityTypes.properties

@ Luca
>  we will integrate enhancement services of NER for Italian and French

That would be really great. Is the framework you integrate open source? Can you 
provide a link?

> Cheers,
> Luca
> 
> On 01/03/2012 14:49, Stefano Norcia wrote:
>> Hi all,
>> 
>> My name is Stefano Norcia and I'm working on the early adoption project for
>> Etcware.
>> 
>> For our early adoption project (Etcware Early Adoption project) we need to
>> use a DBPedia index in Italian
>> language in the enhancement and enrichment process enabled by the Stanbol
>> engines.
>> 
>> The main problem is that the NLP module does not support italian language
>> directly, so if you put an italian
>> text in the enhancement engine the dbpedia engine does not detect any
>> concept/place/people.
>> 

The NER engine uses the language as detected by the LangID engine and 
deactivates itself if no NER model id available for the detected language. In 
such case the NamedEntityTaggingEngine will also link no Entities because the 
are no NamedEntities detected within the text.

However this dies not mean that no Italien labels are present in the DBpedia 
index. In fact Italien labels ARE present in the all DBpedia indexes. No need 
to build your own indexes unless you have some special requirement. 

You can try this even on the Test server. Simple send some Italien text first to

    http://dev.iks-project.eu:8081/enhancer/chain/dbpedia-ner

This engine uses "NER + NamedEntityTaggingEngine" so you will not get any 
results - as expected. Than you can try the same text with

    http://dev.iks-project.eu:8081/enhancer/chain/dbpedia-keyword

this will return linked entities. But as I mentioned above and you already 
experienced yourself it also gives a lot of false positives.

>> We have done some experiments to perform this goal:
>> 
>> First attempt was to rebuild the dbpedia index following the instructions
>> found in the stanbol/
>> entityhub/indexing/dbpedia folder. In this folder there is a shell script
>> (fetch_prepare.sh) that
>> describe how to prepare the dbpedia datasets before creating the index. We
>> followed those
>> instructions and tried to create a new index to replace the standard
>> English dbpedia index and
>> "site" starting from the italian dbpedia datasets. We are aware that the
>> italian datasets are not
>> complete and that some packages are missing (like persondata_en.nt.bz2 and
>> so on).
>> These are the packages we used to create the index (
>> http://downloads.dbpedia.org/3.7/it/) :
>> 
>> o dbpedia_3.7.owl.bz2
>> o geo_coordinates_it.nt.bz2
>> o instance_types_it.nt.bz2
>> o labels_it.nt.bz2
>> o long_abstracts_it.nt.bz2
>> o short_abstracts_it.nt.bz2
>> 

You should always include the english versions as such include a lot of 
information that are also very useful for other languages.

>> We are also able to create the incoming_links text file from the package
>> page_links_it.nt.bz2.
>> After rebuilding the index we replaced the DBPedia english index in stanbol
>> with our custom
>> one (simply replacing the old one with the new one and restarting stanbol).
>> 
>> Sadly, after that, the results produced by the enhancement engines are
>> exactly the same as before,
>> neither italian concepts are detected nor possible enhancements are
>> returned from all the other
>> enhancement engines.
>> 

I assume that this index was completely fine. The reason why you where not 
getting any results was because the NER engine deactivates itself for italian 
texts.

Note also the the

* NamedEntityTaggingEngine and
* KeywordLinkingEngine

do use the exact same DBPedia index. So you can/should use the same index for 
both. This is also the case on the "http://dev.iks-project.eu:8081";

Also Note that the DBpedia indexer and the generic RDF indexer do create the 
same type of indexes. The DBpedia indexer only contains a configuration that is 
optimized for DBpedia.

>> As a second attempt, we decided to use the generic RDF indexer (combined
>> with the standard
>> Keyword Linking Engine) to process the italian DBPedia datasets; in this
>> case the indexing process
>> succeeded and we were able to get a lot of results testing the enhancement
>> engines with italian
>> content. This time the problem is that the results are simply too much and
>> contain also stopwords.
>> 
>> For example you can find a sample text introduced for enhancement and the
>> results shown by the
>> Keyword Linking Engine in attachment.
>> 
>> The terms shown in bold are clearly stopwords. I don’t know if the problem
>> is in dataset indexing,
>> or if there is a way to eliminate them after the creation of the index.

Using stop words would in fact improve the performance of the 
KeywordLinkingEngine. The current default Solr configuration includes optimized 
sold field configurations for english and german. 

If you can provide such a configuration for Italien it would be great if you 
could contribute such a configuration to Stanbol! I would be happy to work on 
that! 

>> 
>> We have also made an attempt to change the stopwords filter in the solyard
>> base index zip
>> (/stanbol/entityhub/yard/solr/
>> src/main/resources/solr/core/default/default.solrindex.zip

>> 
>> and simple.solrindex.zip) and rebuild the content hub (and dbpedia indexer
>> too with mvn
>> assembly:single in contenthub/indexer/dbpedia ) with the right stopwords.
>> 

This would be the place where a Stanbol committer would change the 
configuration. If you use the DPpedia Indexer you can simple change the Solr 
Configuration in

    {indexing-root}/indexing/config/dbpedia/conf/schema.xml

If you use the generic RDF indexer you should extract the 
"default.solrindex.zip" to

    {indexing-root}/indexing/config/

and than rename the directory to the same name as the name of your site (this 
is the value of the "name" property in the 
"/indexing/config/indexing.properties" file).

>> We've checked the generated JAR and the italian stopwords are there, as a
>> file inside the solr config
>> folder, but the results were always the same as before (still stopwords in
>> the enhancement results).
>> 

If you use the RDF indexer the Solr Configuration is taken

* from the directory "{indexing-root}/indexing/config/{name}" or if not present
* from the class path used by the indexer

so the reason why it had not worked for you was that you have not creates a new 
RDF indexer version after you changed the "default.solrindex.zip" and rebuilded 
the Entityhub. For that you would have also needed to re-create the indexer by 
using "mvn assembly:single". 

But as I mentioned above there is a simpler solution for adding italian stop 
words by simple editing the SolrConf contained in 

    {indexing-root}/indexing/config/dbpedia/conf/

of the DBPedia Indexer.

Hopefully that answers all your questions. If you have additional questions 
feel free to ask.

best
Rupert Westenthaler

>> Do you have any suggestions on how to perform these tasks?
>> 
>> Thanks in advance.
>> 
>> -Stefano
>> 
>> PS follow an enrichment example from the rdf index we built from dpedia
>> with simplerdfindexer and dblp :
>> 
>> text:
>> 
>> *Infermiera con tbc, troppi dettagli sui media. Il Garante apre
>> un'istruttoria
>> 
>> Il Garante Privacy ha aperto un'istruttoria in seguito alla pubblicazione
>> di notizie da parte di agenzie di stampa e quotidiani - anche on line -
>> che, nel riferire di un caso di una infermiera in servizio presso il
>> reparto di neonatologia del Policlinico Gemelli, risultata positiva ai test
>> sulla tubercolosi, hanno riportato il nome della donna, l'iniziale del
>> cognome e l'età.
>> 
>> Il diritto-dovere dei giornalisti di informare sugli sviluppi della
>> vicenda, di sicura rilevanza per l'opinione pubblica, considerato l'elevato
>> numero di neonati e di famiglie coinvolte, deve essere comunque bilanciato,
>> secondo i principi stabiliti dal Codice deontologico con il rispetto delle
>> persone.
>> 
>> Il Garante ricorda che, anche quando questi dettagli fossero stati forniti
>> in una sede pubblica, i mezzi di informazione sono tenuti a valutare con
>> scrupolo l'interesse pubblico delle singole informazioni diffuse.
>> 
>> I media evitino dunque di riportare informazioni non essenziali che possano
>> ledere la riservatezza delle persone e nello stesso tempo possano indurre
>> ulteriori stati di allarme e di preoccupazione in coloro che si sono
>> avvalsi dei servizi sanitari dell'ospedale o sono altrimenti entrati in
>> contatto con la persona.
>> 
>> Roma, 24 agosto 2011*
>> 
>> Enrichments :
>> 
>> 2011 2011
>> 
>> Agosto Agosto
>> 
>> *Alla Alla*
>> 
>> *Anché Anché*
>> 
>> *Che? Che?*
>> 
>> Cognome Cognome
>> 
>> *CON CON*
>> 
>> *Dal' Dal'*
>> 
>> Problema dei servizi Problema dei servizi
>> 
>> *Dell Dell*
>> 
>> Diritto Diritto
>> 
>> Donna Donna
>> 
>> Essere Essere
>> 
>> Il nome della rosa Il nome della rosa
>> 
>> Informazione Informazione
>> 
>> Interesse pubblico Interesse pubblico
>> 
>> Media Media
>> 
>> Mezzi di produzione Mezzi di produzione
>> 
>> *Nello Nello*
>> 
>> Neonatologia Neonatologia
>> 
>> *NON NON*
>> 
>> Numero di coordinazione (chimica) Numero di coordinazione (chimica)
>> 
>> Opinione pubblica Opinione pubblica
>> 
>> Ospedale Ospedale
>> 
>> *PER PER*
>> 
>> Persona Persona
>> 
>> Privacy Privacy
>> 
>> Pubblicazione di matrimonio Pubblicazione di matrimonio
>> 
>> Secondo Secondo
>> 
>> Servizio Servizio
>> 
>> Stampa Stampa
>> 
>> Stati di immaginazione Stati di immaginazione
>> 
>> *SUI SUI*
>> 
>> TBC TBC
>> 
>> Tempo Tempo
>> 
>> .test .test
>> 
>> Tubercolosi Tubercolosi
>> 
>> *UNA UNA*
>> 
>> 
>> The ones in bold are stopwords, the other results are good ones but anyway
>> the stopwords where not eliminated in dataset indexing, or maybe there is a
>> way to eliminate them from the datasets but I don't know how.
>> 
>

Re: Problem trying to create a new dbpedia index and site in Italian.

Reply via email to