Hi Stefano, Luca See my comments inline.
On 01.03.2012, at 15:59, Luca Dini wrote: > Dear Stefano, > I am new as well on the list, and we are also working in the context of the > early adoption program. If I understand correctly, the problem is that > without an appropriate Named Entities extraction engine for Italian, I am > afraid that the result would always be disappointing. In the context of our > project we will integrate enhancement services of NER for Italian and French > (and possibly keyword extraction), so, hopefully, you will be able to profit > of the power of Stanbol. There might be some problems in terms of timing, as > it is not clear if in the short project window, there will be the possibility > of feeding our integration into yours. Is the unavailability of Italian NER > a blocking factor for you or you can go on with development while waiting for > the integration? > Thats true. For Datasets such as DBpedia the combination of "NER + NamedEntityTaggingEngine" is the way to go. Thats simple because DBpedia defines Entities for nearly all natural language words. Therefore "keyword extraction" (used by the KeywordLinkingEngine) does not really work. However note that the KeywordLinkingEngine as support for POS (Part of Speech) taggers. So if a POS tagger is available for a given language than it will use this information to only lookup Nouns (see [1] for a more detailed information on the used algorithm). The bad news are that there is no POS tagger available for italian :( [1] http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.html The final possibility to improve results of the KeywordLinkingEngine with DBPedia is to filter all entities with other types than Persons, Organizations and Places. However this has also a big disadvantage. because this will also exclude all redirects and such entities are very important as they allow to link Entities that are mentioned by alternate names. However if you would like to try this you should have a look at the org.apache.stanbol.entityhub.indexing.core.processor.FieldValueFilter This filter is included in the default configuration of the DBpedia indexer and can be activated by changing the configuration within the {indexing-dir}/indexing/config/entityTypes.properties @ Luca > we will integrate enhancement services of NER for Italian and French That would be really great. Is the framework you integrate open source? Can you provide a link? > Cheers, > Luca > > On 01/03/2012 14:49, Stefano Norcia wrote: >> Hi all, >> >> My name is Stefano Norcia and I'm working on the early adoption project for >> Etcware. >> >> For our early adoption project (Etcware Early Adoption project) we need to >> use a DBPedia index in Italian >> language in the enhancement and enrichment process enabled by the Stanbol >> engines. >> >> The main problem is that the NLP module does not support italian language >> directly, so if you put an italian >> text in the enhancement engine the dbpedia engine does not detect any >> concept/place/people. >> The NER engine uses the language as detected by the LangID engine and deactivates itself if no NER model id available for the detected language. In such case the NamedEntityTaggingEngine will also link no Entities because the are no NamedEntities detected within the text. However this dies not mean that no Italien labels are present in the DBpedia index. In fact Italien labels ARE present in the all DBpedia indexes. No need to build your own indexes unless you have some special requirement. You can try this even on the Test server. Simple send some Italien text first to http://dev.iks-project.eu:8081/enhancer/chain/dbpedia-ner This engine uses "NER + NamedEntityTaggingEngine" so you will not get any results - as expected. Than you can try the same text with http://dev.iks-project.eu:8081/enhancer/chain/dbpedia-keyword this will return linked entities. But as I mentioned above and you already experienced yourself it also gives a lot of false positives. >> We have done some experiments to perform this goal: >> >> First attempt was to rebuild the dbpedia index following the instructions >> found in the stanbol/ >> entityhub/indexing/dbpedia folder. In this folder there is a shell script >> (fetch_prepare.sh) that >> describe how to prepare the dbpedia datasets before creating the index. We >> followed those >> instructions and tried to create a new index to replace the standard >> English dbpedia index and >> "site" starting from the italian dbpedia datasets. We are aware that the >> italian datasets are not >> complete and that some packages are missing (like persondata_en.nt.bz2 and >> so on). >> These are the packages we used to create the index ( >> http://downloads.dbpedia.org/3.7/it/) : >> >> o dbpedia_3.7.owl.bz2 >> o geo_coordinates_it.nt.bz2 >> o instance_types_it.nt.bz2 >> o labels_it.nt.bz2 >> o long_abstracts_it.nt.bz2 >> o short_abstracts_it.nt.bz2 >> You should always include the english versions as such include a lot of information that are also very useful for other languages. >> We are also able to create the incoming_links text file from the package >> page_links_it.nt.bz2. >> After rebuilding the index we replaced the DBPedia english index in stanbol >> with our custom >> one (simply replacing the old one with the new one and restarting stanbol). >> >> Sadly, after that, the results produced by the enhancement engines are >> exactly the same as before, >> neither italian concepts are detected nor possible enhancements are >> returned from all the other >> enhancement engines. >> I assume that this index was completely fine. The reason why you where not getting any results was because the NER engine deactivates itself for italian texts. Note also the the * NamedEntityTaggingEngine and * KeywordLinkingEngine do use the exact same DBPedia index. So you can/should use the same index for both. This is also the case on the "http://dev.iks-project.eu:8081" Also Note that the DBpedia indexer and the generic RDF indexer do create the same type of indexes. The DBpedia indexer only contains a configuration that is optimized for DBpedia. >> As a second attempt, we decided to use the generic RDF indexer (combined >> with the standard >> Keyword Linking Engine) to process the italian DBPedia datasets; in this >> case the indexing process >> succeeded and we were able to get a lot of results testing the enhancement >> engines with italian >> content. This time the problem is that the results are simply too much and >> contain also stopwords. >> >> For example you can find a sample text introduced for enhancement and the >> results shown by the >> Keyword Linking Engine in attachment. >> >> The terms shown in bold are clearly stopwords. I don’t know if the problem >> is in dataset indexing, >> or if there is a way to eliminate them after the creation of the index. Using stop words would in fact improve the performance of the KeywordLinkingEngine. The current default Solr configuration includes optimized sold field configurations for english and german. If you can provide such a configuration for Italien it would be great if you could contribute such a configuration to Stanbol! I would be happy to work on that! >> >> We have also made an attempt to change the stopwords filter in the solyard >> base index zip >> (/stanbol/entityhub/yard/solr/ >> src/main/resources/solr/core/default/default.solrindex.zip >> >> and simple.solrindex.zip) and rebuild the content hub (and dbpedia indexer >> too with mvn >> assembly:single in contenthub/indexer/dbpedia ) with the right stopwords. >> This would be the place where a Stanbol committer would change the configuration. If you use the DPpedia Indexer you can simple change the Solr Configuration in {indexing-root}/indexing/config/dbpedia/conf/schema.xml If you use the generic RDF indexer you should extract the "default.solrindex.zip" to {indexing-root}/indexing/config/ and than rename the directory to the same name as the name of your site (this is the value of the "name" property in the "/indexing/config/indexing.properties" file). >> We've checked the generated JAR and the italian stopwords are there, as a >> file inside the solr config >> folder, but the results were always the same as before (still stopwords in >> the enhancement results). >> If you use the RDF indexer the Solr Configuration is taken * from the directory "{indexing-root}/indexing/config/{name}" or if not present * from the class path used by the indexer so the reason why it had not worked for you was that you have not creates a new RDF indexer version after you changed the "default.solrindex.zip" and rebuilded the Entityhub. For that you would have also needed to re-create the indexer by using "mvn assembly:single". But as I mentioned above there is a simpler solution for adding italian stop words by simple editing the SolrConf contained in {indexing-root}/indexing/config/dbpedia/conf/ of the DBPedia Indexer. Hopefully that answers all your questions. If you have additional questions feel free to ask. best Rupert Westenthaler >> Do you have any suggestions on how to perform these tasks? >> >> Thanks in advance. >> >> -Stefano >> >> PS follow an enrichment example from the rdf index we built from dpedia >> with simplerdfindexer and dblp : >> >> text: >> >> *Infermiera con tbc, troppi dettagli sui media. Il Garante apre >> un'istruttoria >> >> Il Garante Privacy ha aperto un'istruttoria in seguito alla pubblicazione >> di notizie da parte di agenzie di stampa e quotidiani - anche on line - >> che, nel riferire di un caso di una infermiera in servizio presso il >> reparto di neonatologia del Policlinico Gemelli, risultata positiva ai test >> sulla tubercolosi, hanno riportato il nome della donna, l'iniziale del >> cognome e l'età. >> >> Il diritto-dovere dei giornalisti di informare sugli sviluppi della >> vicenda, di sicura rilevanza per l'opinione pubblica, considerato l'elevato >> numero di neonati e di famiglie coinvolte, deve essere comunque bilanciato, >> secondo i principi stabiliti dal Codice deontologico con il rispetto delle >> persone. >> >> Il Garante ricorda che, anche quando questi dettagli fossero stati forniti >> in una sede pubblica, i mezzi di informazione sono tenuti a valutare con >> scrupolo l'interesse pubblico delle singole informazioni diffuse. >> >> I media evitino dunque di riportare informazioni non essenziali che possano >> ledere la riservatezza delle persone e nello stesso tempo possano indurre >> ulteriori stati di allarme e di preoccupazione in coloro che si sono >> avvalsi dei servizi sanitari dell'ospedale o sono altrimenti entrati in >> contatto con la persona. >> >> Roma, 24 agosto 2011* >> >> Enrichments : >> >> 2011 2011 >> >> Agosto Agosto >> >> *Alla Alla* >> >> *Anché Anché* >> >> *Che? Che?* >> >> Cognome Cognome >> >> *CON CON* >> >> *Dal' Dal'* >> >> Problema dei servizi Problema dei servizi >> >> *Dell Dell* >> >> Diritto Diritto >> >> Donna Donna >> >> Essere Essere >> >> Il nome della rosa Il nome della rosa >> >> Informazione Informazione >> >> Interesse pubblico Interesse pubblico >> >> Media Media >> >> Mezzi di produzione Mezzi di produzione >> >> *Nello Nello* >> >> Neonatologia Neonatologia >> >> *NON NON* >> >> Numero di coordinazione (chimica) Numero di coordinazione (chimica) >> >> Opinione pubblica Opinione pubblica >> >> Ospedale Ospedale >> >> *PER PER* >> >> Persona Persona >> >> Privacy Privacy >> >> Pubblicazione di matrimonio Pubblicazione di matrimonio >> >> Secondo Secondo >> >> Servizio Servizio >> >> Stampa Stampa >> >> Stati di immaginazione Stati di immaginazione >> >> *SUI SUI* >> >> TBC TBC >> >> Tempo Tempo >> >> .test .test >> >> Tubercolosi Tubercolosi >> >> *UNA UNA* >> >> >> The ones in bold are stopwords, the other results are good ones but anyway >> the stopwords where not eliminated in dataset indexing, or maybe there is a >> way to eliminate them from the datasets but I don't know how. >> >
