Hi On Tue, Nov 13, 2012 at 10:30 AM, Andrea Taurchini <ataurch...@gmail.com> wrote: > Hi Rupert, > I have the following under felix configurations : > > EntityHub Referenced Site Configuration > ID ................................................. ITdbpedia > > EntityHub Cache configuration > ID ................................................. ITdbpediaIndex > Cache mappings ........................... empty > > Sol Yard Configuration > ID .................................................. ITdbpediaIndex > Solr Index/Core .............................. ITdbpedia > Use default SolrCore configuration .... unflagged > > but honestly I can't find any solr/core where should I look for ? >
The SolrCores are just OSGI services and no Components. Because of that you can only see them in the Services Tab on the Felix Webconsole. You will need to search for "org.apache.solr.core.SolrCore" and than inspect the metdata. The value of the "org.apache.solr.core.SolrCore.name" property needs to match value configured for "Solr Index/Core" in your SolrYard. > ---- > > I already produced my index, as you pointed out, modifying > indexing/config/indexing.properties > but unfortunately I didn't know I had to change the indexingDestination > maybe this is the problem ? > > IndexingDestination=org.apache.stanbol.entityhub.indexing.destination.solryard.SolrYardIndexingDestination,solrConf,boosts:fieldboosts > If you have not changed this, than the default SolrCore schema was used for indexing. I do not think that this will have an major impact to the resulting index as the dbpedia configuration only differs in some minor things from the default. > Moreover once I produced the correct zip file Itdbpedia.solr.zip (I changed the name should be "Itdbpedia.solrindex.zip". In addition note that names are case sensitive. So if you use ITdbpedia, than you should also name the file with upper case IT > the indexing properties, so I don't have to change the folder manually as > you said) I have to save it to stanbol/datafiles and then restart stanbol, > right ? > No restarting the server will not work. Replace the file and than stop/start the bundle via the bundle tab of the Felix Webconsole. As soon as you stop the current index should be deleted (you can check this by looking at the folder "{stanbol-working-dir}/stanbol/indexes/default"). When you start the bundle again the index should be re-initialised based on the current "Itdbpedia.solrindex.zip" > > Too complex steps ??? Nahhh ^_^ ok than I have to replace it with something more complex [1] [1] http://en.wikiquote.org/wiki/The_Hitchhiker%27s_Guide_to_the_Galaxy#Preface > > ------- > > Is it possibile to create a single index with both languages (it, en) > version of dbpedia ? > I think this is very difficult to manage isn't it? > The simple answer is YES: just add the RDF files with the Italian labels, short/long abstracts e.g. curl http://downloads.dbpedia.org/3.8/it/labels_en_uris_it.nt.bz2 | bzcat | head <http://dbpedia.org/resource/Harmonium> <http://www.w3.org/2000/01/rdf-schema#label> "Armonium"@it . <http://dbpedia.org/resource/Anthropology> <http://www.w3.org/2000/01/rdf-schema#label> "Antropologia"@it . <http://dbpedia.org/resource/Agriculture> <http://www.w3.org/2000/01/rdf-schema#label> "Agricoltura"@it . The complex answer is also YES: While there are Italian labels, comments ... available for http://dbpedia.org/resources this only include those where there is an English counterpart available. Entities of the Italian Wikipedia that do not have an English version are not included. If you want to have all Italian Entities you will need to use the Italian dbpedia (http://it.dbpedia.org/resources) e.g. curl http://downloads.dbpedia.org/3.8/it/labels_it.nt.bz2 | bzcat | head <http://it.dbpedia.org/resource/Armonium> <http://www.w3.org/2000/01/rdf-schema#label> "Armonium"@it . <http://it.dbpedia.org/resource/Antropologia> <http://www.w3.org/2000/01/rdf-schema#label> "Antropologia"@it . <http://it.dbpedia.org/resource/Agricoltura> <http://www.w3.org/2000/01/rdf-schema#label> "Agricoltura"@it . By comparing the file size of the labels_it.nt.bz2 (18M) and labels_en_uris_it.nt.bz2 (7.9M) you can easily see that with the English dbpedia you will not have all the Italian entities available. To integrate two languages you need the "interlanguage_links_it.nt.bz2". This defines links from the Italian entities to all other languages. <http://it.dbpedia.org/resource/Armonium> <http://dbpedia.org/ontology/wikiPageInterLanguageLink> <http://dbpedia.org/resource/Harmonium> . For indexing you need to do the following: 1. Calculate the incoming_links.txt file for the Italian page links (http://downloads.dbpedia.org/3.8/it/page_links_it.nt.bz2) 2. Download all the RDF files you need * basically the same you currently use from http://downloads.dbpedia.org/3.8/en/ but now from http://downloads.dbpedia.org/3.8/it/ * language specific labels from other languages you are interested in. IMPORTANT: use the http://downloads.dbpedia.org/3.8/{lang}/{type}_{lang}.nt.bz2 files and NOT the http://downloads.dbpedia.org/3.8/{lang}/{type}_en_uris_{lang}.nt.bz2 * include http://downloads.dbpedia.org/3.8/en/instance_types_en.nq.bz2 3. You will need to add the LdpathSourceProcessor to the list of entityProcessor in the indexing.properties file. The configuration should look like entityProcessor=org.apache.stanbol.entityhub.indexing.core.processor.FieldValueFilter,config:entityTypes;org.apache.stanbol.entityhub.indexing.core.processor.LdpathSourceProcessor,ldpath:dbpedia.ldpath;org.apache.stanbol.entityhub.indexing.core.processor.FiledMapperProcessor 4. Create an LDPath [2] program that merges all the data you need with the Italian dbpedia resource. [2] http://code.google.com/p/ldpath/ The configuration in (3) refers to the ldpath file "dbpedia.ldpath". This is a text file that is expected to be located within the "indexing/config" directory. I will not give an LDpath introduction, but what you need is something like 1: rdfs:label = (rdfs:label | dbp-ont:wikiPageInterLanguageLink/rdfs:label); 2: skos:altLabel = (^dbp-ont:wikiPageRedirects/rdfs:label | dbp-ont:wikiPageInterLanguageLink/^dbp-ont:wikiPageRedirects/rdfs:label); 3: rdfs:comment = (rdfs:label | dbp-ont:wikiPageInterLanguageLink/rdfs:label); 4: dbp-ont:abstract = (dbp-ont:abstract | dbp-ont:wikiPageInterLanguageLink/dbp-ont:abstract); 5: rdf:type = (rdf:type | dbp-ont:wikiPageInterLanguageLink/rdf:type); NOTE: you will need to remove the '{line-number}: ' before using this ldpath (1) merges the rdfs:labels of the current Entity (the Italian label) with labels of entities referenced by inter language links. So this will ensure that you have labels for all languages for the Italian entity. (2) merges labels of redirected pages to the skos:altLabel field. For this to work you will need to include the "redirects_{language}.nt.bz2" file in the languages you are interested (3) same as for rdfs:labels but for short abstracts (4) the same but for long abstracts (5) rdf:type statements might be missing for Italian. So I merge those as well with types from other languages. I would recommend to only include types for the English dbpedia 5. Add surfaceForms mapping to the mappings.txt file # add rdfs:labels and rdfs:labels of redirected sites to dbp-ont:surfaceForm rdfs:label > dbp-ont:surfaceForm skos:altLabel > dbp-ont:surfaceForm Those two mappings ensure that both the rdfs:label and skos:altLabel values are also stored in the dbp-ont:surfaceForm field. This allows you to allow the Stanbol Enhancer (or more precisely the NamedEntityLinkingEngine or KeywordLinkingEngine) to match against labels of redirected pages by changing the name field form the default rdfs:label to dbp-ont:surfaceForm Let me conclude that I have never tried this exact use case myself, but I have already created several dbpedia indexing with very similar configurations. When using LDPath during indexing you need to expect higher indexing times and you might also need to assign more memory to the indexing tool. Please also note http://markmail.org/message/67ivlyoxfqad6xoe as you will most likely need process dbpedia files for some languages using the bzcat ${filename}.bz2 \ | sed 's/\\\\/\\u005c\\u005c/g;s/\\\([^u"]\)/\\u005c\1/g' \ | gzip -c > ${filename}.gz rm -f ${filename}.bz2 best Rupert > > Thanks, > Andrea > > > > > > > > 2012/11/12 Rupert Westenthaler <rupert.westentha...@gmail.com> > >> Hi Andrea, >> >> >> On Mon, Nov 12, 2012 at 12:59 PM, Andrea Taurchini <ataurch...@gmail.com> >> wrote: >> > folder /indexing/dist the two files : >> > >> > 1)dbpedia.solrindex.zip >> > 2)org.apache.stanbol.data.site.dbpedia-{version}.jar >> > >> > I prefer to install it as a new referenced site and not overwriting it to >> > previous dbpedia english index so I made the following : >> > >> > 1) saved the zip in the stanbol/datafiles directory >> > 2) installed the bundle using the Apache Felix web console >> > >> > So I have a new referenced site under http://localhost:8080/entityhub. >> > The problem is that if I try to search for an entity such as >> > >> > curl " >> > >> http://localhost:8080/entityhub/site/ITdbpedia/entity?id=http://dbpedia.org/resource/Paris >> > " >> > >> >> How have you managed to deploy the Site under "ITdbpedia"? Have you >> manually changed the configuration after installing the Bundle? >> >> While this might work (if you correctly adapt the configuration for >> the ReferencedSite, Cache and SolrYard those will still override the >> configurations of the default DBpedia index simple because the OSGI >> config files provided by the bundle (2) do have the same name as the >> default dbpedia index config files. >> >> > <p>Problem accessing /entityhub/site/ITdbpedia/find. Reason: >> > <pre> Unable to initialize the Cache with Yard ITdbpediaIndex! This >> > is usually caused by Errors while reading the Cache Configuration from >> > the Yard.</pre></p><h3>Caused >> > by:</h3><pre>java.lang.IllegalStateException: Unable to initialize the >> > Cache with Yard ITdbpediaIndex! This is usually caused by Errors while >> > reading the Cache Configuration from the Yard. >> >> This usually happens if the SolrYard "ITdbpediaIndex" is configured >> for a SolrCore that is not available. Are you sure that a SolrCore >> with the name configured for the "Solr Index/Core" property of the >> ITdbpediaIndex SolrYard is available? >> Assuming you have configured {solr-core} you will need to (a) extract >> the "dbpedia.solrindex.zip" file (b) rename the root folder from >> "dbpedia" to "{solr-core}" (c) re-create the ZIP file (d) rename it to >> "{solr-core}.solrindex.tzp". >> >> - - - >> >> The intended way to change the name of a ReferencedSite created by the >> Entityhub Indexing Tool is to change the value of the "name" property >> within the >> "./indexing/config/indexing.properties" file. >> >> In case of the dbpedia Indexing tool you need to change the >> "indexingDestination" from >> >> >> indexingDestination=org.apache.stanbol.entityhub.indexing.destination.solryard.SolrYardIndexingDestination,solrConf,boosts:fieldboosts >> >> to >> >> >> indexingDestination=org.apache.stanbol.entityhub.indexing.destination.solryard.SolrYardIndexingDestination,solrConf:dbpedia,boosts:fieldboosts >> >> NOTE the change from "solrConf" to "solrConf:dbpedia". This is >> necessary to tell the SolrYardIndexingDestination component that the >> SolrCore configuration is called "dbpedia". By default it assumes that >> the name is equals to the value of the "name" property. >> >> Before re-indexing you should also delete the "./indexing/destination" >> folder as otherwise you will have both the data of the old index >> (dbpedia) and the new one {name} in the destination folder. >> >> - - - >> >> If you want to create an "installable bundle" without reindexing the >> data you can follow the following steps: >> >> 0. if there are still files in the indexing/resources/rdfdata folder >> remove them as they are already imported into the Jena TDB store >> (indexing/resources/tdb) >> 1. make the changes as described above >> 2. delete the indexing/destination folder (make sure to NOT delete the >> indexing/dist folder!) >> 3. replace the indexing/resource/incoming_links.txt file with an empty >> one (make sure to not delete the current version) >> 4. start the indexing (this should now complete in some seconds as no >> entities are indexed. >> >> After that you should see in the indexing/dist folder 4 files >> >> a. "dbpedia.solrindex.zip" >> b. "{name}.solrindex.zip" (this is empty - delete it) >> c. "org.apache.stanbol.data.site.dbpedia-{version}.jar" (the old >> bundle - delete it) >> d. "org.apache.stanbol.data.site.{name}-{version}.jar (the new bundle) >> >> (d) is the patched Bundle that you can use to install your custom >> dbpedia index without overriding the default one. However to use this >> bundle you need still modify the "dbpedia.solrindex.zip" as described >> above: (a) extract the "dbpedia.solrindex.zip" file (b) rename the >> root folder from "dbpedia" to "{name}" (c) re-create the ZIP file (d) >> renme it to "{name}.solrindex.zip". >> >> I admit that those steps are complex, but they might save you the time >> needed to re-create your index. >> >> best >> Rupert >> >> >> -- >> | Rupert Westenthaler rupert.westentha...@gmail.com >> | Bodenlehenstraße 11 ++43-699-11108907 >> | A-5500 Bischofshofen >> -- | Rupert Westenthaler rupert.westentha...@gmail.com | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen