Hi

On Tue, Nov 13, 2012 at 10:30 AM, Andrea Taurchini <ataurch...@gmail.com> wrote:
> Hi Rupert,
> I have the following under felix configurations :
>
> EntityHub Referenced Site Configuration
> ID ................................................. ITdbpedia
>
> EntityHub Cache configuration
> ID ................................................. ITdbpediaIndex
> Cache mappings ........................... empty
>
> Sol Yard Configuration
> ID .................................................. ITdbpediaIndex
> Solr Index/Core .............................. ITdbpedia
> Use default SolrCore configuration .... unflagged
>
> but honestly I can't find any solr/core where should I look for ?
>

The SolrCores are just OSGI services and no Components. Because of
that you can only see them in the Services Tab on the Felix
Webconsole. You will need to search for
"org.apache.solr.core.SolrCore" and than inspect the metdata. The
value of the "org.apache.solr.core.SolrCore.name" property needs to
match value configured for "Solr Index/Core" in your SolrYard.

> ----
>
> I already produced my index, as you pointed out, modifying
> indexing/config/indexing.properties
> but unfortunately I didn't know I had to change the indexingDestination
> maybe this is the problem ?
>
> IndexingDestination=org.apache.stanbol.entityhub.indexing.destination.solryard.SolrYardIndexingDestination,solrConf,boosts:fieldboosts
>

If you have not changed this, than the default SolrCore schema was
used for indexing. I do not think that this will have an major impact
to the resulting index as the dbpedia configuration only differs in
some minor things from the default.

> Moreover once I produced the correct zip file Itdbpedia.solr.zip (I changed

the name should be "Itdbpedia.solrindex.zip". In addition note that
names are case sensitive. So if you use ITdbpedia, than you should
also name the file with upper case IT

> the indexing properties, so I don't have to change the folder manually as
> you said) I have to save it to stanbol/datafiles and then restart stanbol,
> right ?
>

No restarting the server will not work. Replace the file and than
stop/start the bundle via the bundle tab of the Felix Webconsole. As
soon as you stop the current index should be deleted (you can check
this by looking at the folder
"{stanbol-working-dir}/stanbol/indexes/default"). When you start the
bundle again the index should be re-initialised based on the current
"Itdbpedia.solrindex.zip"

>
> Too complex steps ??? Nahhh  ^_^

ok than I have to replace it with something more complex [1]

[1] http://en.wikiquote.org/wiki/The_Hitchhiker%27s_Guide_to_the_Galaxy#Preface

>
> -------
>
> Is it possibile to create a single index with both languages (it, en)
> version of dbpedia ?
> I think this is very difficult to manage isn't it?
>

The simple answer is YES: just add the RDF files with the Italian
labels, short/long abstracts

e.g. curl http://downloads.dbpedia.org/3.8/it/labels_en_uris_it.nt.bz2
| bzcat | head

<http://dbpedia.org/resource/Harmonium>
<http://www.w3.org/2000/01/rdf-schema#label> "Armonium"@it .
<http://dbpedia.org/resource/Anthropology>
<http://www.w3.org/2000/01/rdf-schema#label> "Antropologia"@it .
<http://dbpedia.org/resource/Agriculture>
<http://www.w3.org/2000/01/rdf-schema#label> "Agricoltura"@it .


The complex answer is also YES: While there are Italian labels,
comments ... available for http://dbpedia.org/resources this only
include those where there is an English counterpart available.
Entities of the Italian Wikipedia that do not have an English version
are not included. If you want to have all Italian Entities you will
need to use the Italian dbpedia (http://it.dbpedia.org/resources)


e.g. curl http://downloads.dbpedia.org/3.8/it/labels_it.nt.bz2 | bzcat | head

<http://it.dbpedia.org/resource/Armonium>
<http://www.w3.org/2000/01/rdf-schema#label> "Armonium"@it .
<http://it.dbpedia.org/resource/Antropologia>
<http://www.w3.org/2000/01/rdf-schema#label> "Antropologia"@it .
<http://it.dbpedia.org/resource/Agricoltura>
<http://www.w3.org/2000/01/rdf-schema#label> "Agricoltura"@it .

By comparing the file size of the labels_it.nt.bz2 (18M) and
        labels_en_uris_it.nt.bz2 (7.9M) you can easily see that with the
English dbpedia you will not have all the Italian entities available.

To integrate two languages you need the
"interlanguage_links_it.nt.bz2". This defines links from the Italian
entities to all other languages.

<http://it.dbpedia.org/resource/Armonium>
<http://dbpedia.org/ontology/wikiPageInterLanguageLink>
<http://dbpedia.org/resource/Harmonium> .

For indexing you need to do the following:

1. Calculate the incoming_links.txt file for the Italian page links
(http://downloads.dbpedia.org/3.8/it/page_links_it.nt.bz2)


2. Download all the RDF files you need

    * basically the same you currently use from
http://downloads.dbpedia.org/3.8/en/ but now from
http://downloads.dbpedia.org/3.8/it/
    * language specific labels from other languages you are interested in.
         IMPORTANT: use the
             http://downloads.dbpedia.org/3.8/{lang}/{type}_{lang}.nt.bz2
         files and NOT the
             
http://downloads.dbpedia.org/3.8/{lang}/{type}_en_uris_{lang}.nt.bz2
    * include http://downloads.dbpedia.org/3.8/en/instance_types_en.nq.bz2


3. You will need to add the LdpathSourceProcessor to the list of
entityProcessor in the indexing.properties file. The configuration
should look like

entityProcessor=org.apache.stanbol.entityhub.indexing.core.processor.FieldValueFilter,config:entityTypes;org.apache.stanbol.entityhub.indexing.core.processor.LdpathSourceProcessor,ldpath:dbpedia.ldpath;org.apache.stanbol.entityhub.indexing.core.processor.FiledMapperProcessor

4. Create an LDPath [2] program that merges all the data you need with
the Italian dbpedia resource.

[2] http://code.google.com/p/ldpath/

The configuration in (3) refers to the ldpath file "dbpedia.ldpath".
This is a text file that is expected to be located within the
"indexing/config" directory. I will not give an LDpath introduction,
but what you need is something like

1: rdfs:label = (rdfs:label | dbp-ont:wikiPageInterLanguageLink/rdfs:label);
2: skos:altLabel = (^dbp-ont:wikiPageRedirects/rdfs:label |
dbp-ont:wikiPageInterLanguageLink/^dbp-ont:wikiPageRedirects/rdfs:label);
3: rdfs:comment = (rdfs:label | dbp-ont:wikiPageInterLanguageLink/rdfs:label);
4: dbp-ont:abstract = (dbp-ont:abstract |
dbp-ont:wikiPageInterLanguageLink/dbp-ont:abstract);
5: rdf:type = (rdf:type | dbp-ont:wikiPageInterLanguageLink/rdf:type);

NOTE: you will need to remove the '{line-number}: ' before using this ldpath

(1) merges the rdfs:labels of the current Entity (the Italian label)
with labels of entities referenced by inter language links. So this
will ensure that you have labels for all languages for the Italian
entity.
(2) merges labels of redirected pages to the skos:altLabel field. For
this to work you will need to include the
"redirects_{language}.nt.bz2" file in the languages you are interested
(3) same as for rdfs:labels but for short abstracts
(4) the same but for long abstracts
(5) rdf:type statements might be missing for Italian. So I merge those
as well with types from other languages. I would recommend to only
include types for the English dbpedia


5. Add surfaceForms mapping to the mappings.txt file

# add rdfs:labels and rdfs:labels of redirected sites to dbp-ont:surfaceForm
rdfs:label > dbp-ont:surfaceForm
skos:altLabel > dbp-ont:surfaceForm

Those two mappings ensure that both the rdfs:label and skos:altLabel
values are also stored in the dbp-ont:surfaceForm field. This allows
you to allow the Stanbol Enhancer (or more precisely the
NamedEntityLinkingEngine or KeywordLinkingEngine) to match against
labels of redirected pages by changing the name field form the default
rdfs:label to dbp-ont:surfaceForm


Let me conclude that I have never tried this exact use case myself,
but I have already created several dbpedia indexing with very similar
configurations. When using LDPath during indexing you need to expect
higher indexing times and you might also need to assign more memory to
the indexing tool.

Please also note http://markmail.org/message/67ivlyoxfqad6xoe as you
will most likely need process dbpedia files for some languages using
the

    bzcat ${filename}.bz2 \
        | sed 's/\\\\/\\u005c\\u005c/g;s/\\\([^u"]\)/\\u005c\1/g' \
        | gzip -c > ${filename}.gz
    rm -f ${filename}.bz2

best
Rupert

>
> Thanks,
> Andrea
>
>
>
>
>
>
>
> 2012/11/12 Rupert Westenthaler <rupert.westentha...@gmail.com>
>
>> Hi Andrea,
>>
>>
>> On Mon, Nov 12, 2012 at 12:59 PM, Andrea Taurchini <ataurch...@gmail.com>
>> wrote:
>> > folder /indexing/dist the two files :
>> >
>> > 1)dbpedia.solrindex.zip
>> > 2)org.apache.stanbol.data.site.dbpedia-{version}.jar
>> >
>> > I prefer to install it as a new referenced site and not overwriting it to
>> > previous dbpedia english index so I made the following :
>> >
>> > 1) saved the zip in the stanbol/datafiles directory
>> > 2) installed the bundle using the Apache Felix web console
>> >
>> > So I have a new referenced site under http://localhost:8080/entityhub.
>> > The problem is that if I try to search for an entity such as
>> >
>> > curl "
>> >
>> http://localhost:8080/entityhub/site/ITdbpedia/entity?id=http://dbpedia.org/resource/Paris
>> > "
>> >
>>
>> How have you managed to deploy the Site under "ITdbpedia"? Have you
>> manually changed the configuration after installing the Bundle?
>>
>> While this might work (if you correctly adapt the configuration for
>> the ReferencedSite, Cache and SolrYard those will still override the
>> configurations of the default DBpedia index simple because the OSGI
>> config files provided by the bundle (2) do have the same name as the
>> default dbpedia index config files.
>>
>> > <p>Problem accessing /entityhub/site/ITdbpedia/find. Reason:
>> > <pre>    Unable to initialize the Cache with Yard ITdbpediaIndex! This
>> > is usually caused by Errors while reading the Cache Configuration from
>> > the Yard.</pre></p><h3>Caused
>> > by:</h3><pre>java.lang.IllegalStateException: Unable to initialize the
>> > Cache with Yard ITdbpediaIndex! This is usually caused by Errors while
>> > reading the Cache Configuration from the Yard.
>>
>> This usually happens if the SolrYard "ITdbpediaIndex" is configured
>> for a SolrCore that is not available. Are you sure that a SolrCore
>> with the name configured for the "Solr Index/Core" property of the
>> ITdbpediaIndex SolrYard is available?
>> Assuming you have configured {solr-core} you will need to (a) extract
>> the "dbpedia.solrindex.zip" file (b) rename the root folder from
>> "dbpedia" to "{solr-core}" (c) re-create the ZIP file (d) rename it to
>> "{solr-core}.solrindex.tzp".
>>
>> - - -
>>
>> The intended way to change the name of a ReferencedSite created by the
>> Entityhub Indexing Tool is to change the value of the "name" property
>> within the
>> "./indexing/config/indexing.properties" file.
>>
>> In case of the dbpedia Indexing tool you need to change the
>> "indexingDestination" from
>>
>>
>> indexingDestination=org.apache.stanbol.entityhub.indexing.destination.solryard.SolrYardIndexingDestination,solrConf,boosts:fieldboosts
>>
>> to
>>
>>
>> indexingDestination=org.apache.stanbol.entityhub.indexing.destination.solryard.SolrYardIndexingDestination,solrConf:dbpedia,boosts:fieldboosts
>>
>> NOTE the change from "solrConf" to "solrConf:dbpedia". This is
>> necessary to tell the SolrYardIndexingDestination component that the
>> SolrCore configuration is called "dbpedia". By default it assumes that
>> the name is equals to the value of the "name" property.
>>
>> Before re-indexing you should also delete the "./indexing/destination"
>> folder as otherwise you will have both the data of the old index
>> (dbpedia) and the new one {name} in the destination folder.
>>
>> - - -
>>
>> If you want to create an "installable bundle" without reindexing the
>> data you can follow the following steps:
>>
>> 0. if there are still files in the indexing/resources/rdfdata folder
>> remove them as they are already imported into the Jena TDB store
>> (indexing/resources/tdb)
>> 1. make the changes as described above
>> 2. delete the indexing/destination folder (make sure to NOT delete the
>> indexing/dist folder!)
>> 3. replace the indexing/resource/incoming_links.txt file with an empty
>> one (make sure to not delete the current version)
>> 4. start the indexing (this should now complete in some seconds as no
>> entities are indexed.
>>
>> After that you should see in the indexing/dist folder 4 files
>>
>> a. "dbpedia.solrindex.zip"
>> b. "{name}.solrindex.zip" (this is empty - delete it)
>> c. "org.apache.stanbol.data.site.dbpedia-{version}.jar" (the old
>> bundle - delete it)
>> d. "org.apache.stanbol.data.site.{name}-{version}.jar (the new bundle)
>>
>> (d) is the patched Bundle that you can use to install your custom
>> dbpedia index without overriding the default one. However to use this
>> bundle you need still modify the "dbpedia.solrindex.zip" as described
>> above: (a) extract the "dbpedia.solrindex.zip" file (b) rename the
>> root folder from "dbpedia" to "{name}" (c) re-create the ZIP file (d)
>> renme it to "{name}.solrindex.zip".
>>
>> I admit that those steps are complex, but they might save you the time
>> needed to re-create your index.
>>
>> best
>> Rupert
>>
>>
>> --
>> | Rupert Westenthaler             rupert.westentha...@gmail.com
>> | Bodenlehenstraße 11                             ++43-699-11108907
>> | A-5500 Bischofshofen
>>



-- 
| Rupert Westenthaler             rupert.westentha...@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Reply via email to