Re: Working with dbpedia indexed data

Rupert Westenthaler Tue, 27 May 2014 03:52:25 -0700

On Mon, May 26, 2014 at 9:19 PM, Cristian Petroaca
<[email protected]> wrote:
> Thanks Rupert! The genericrfd reindexing worked.
>
> Just one thing : it seems kind of odd that my solrindex.zip got from 796MB
> (after dbpedia indexing) to 1,5GB (after genericrdf indexing based on
> dbpedia index) but my yago_class_labels.nt file contains around 100,000
> entries.
> The only thing I changed in config was the name of the site as you
> suggested and in mappings.txt file I removed everything except "rdfs:label".
>


No Idea ... as long as all the data you need are available ^^

best
Rupert

>
> 2014-05-26 16:26 GMT+03:00 Rupert Westenthaler <
> [email protected]>:
>
>> Hi Cristian,
>>
>> On Mon, May 26, 2014 at 2:33 PM, Cristian Petroaca
>> <[email protected]> wrote:
>> > I just found out that according to
>> >
>> http://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/README.mdthe
>> > min-score can actually be set to 0 and all entities will be indexed
>> > :).
>> > So, I'll give that a go ( hopefully my dbpedia index won't become
>> gigantic
>> > in size).
>> >
>>
>> Even if you set the value to zero it will still only index entities
>> listed in the incoming_links.txt file. So you will need to append the
>> Yago types to that file.
>>
>> An other possibility would be to first create the dbpedia index and
>> after that append the Yago classes by using the generic rdf indexing
>> tool. For that you can
>>
>> 1) take the destination folder of the dbpedia indexing tool and link
>> (or move) it to the destination of the generic indexing tool.
>> 2) make sure to configure the same site name as for the dbpedia index
>> tool to the generic indexing tool
>> 3) add the RDF data of the Yago classes to the rdf data folder of the
>> generic indexing tool
>> 4) adapt all the other configurations as needed
>> 5) start the indexing process.
>>
>> The generic indexing tool will check if the target solr index does
>> already exist. As it is present it will just add the additional
>> entities to the solr core.
>>
>> When the process completes you can use the "solrindex.zip" file
>> generated by the generic RDF indexing tool together with the OSGI
>> bunlde (the jar file) generated by the dbpedia indexing tool.
>>
>> Especially if you have already created an dbpedia index I would
>> recommend you to try this out as it would avoid re-indexing the whole
>> dbpedia data again.
>>
>> best
>> Rupert
>>
>>
>>
>> >
>> > 2014-05-25 16:58 GMT+03:00 Cristian Petroaca <
>> [email protected]>:
>> >
>> >> Hi Rupert.
>> >>
>> >> I'm answering to your suggestions on integrating the yago class labels
>> in
>> >> the dbpedia index in this thread since it's a lot shorter than the other
>> >> one.
>> >>
>> >> For clarity, your suggestions were :
>> >>
>> >> "1. The indexing tool does support LDPath. That means you can import
>> >> all the required RDF files and use LDPath to append the labels of the
>> Yago
>> >> Types directly to the dbpedia entities. This would prevent additional
>> >> lookups to retrieve the types, but also increase the size of the index a
>> >> lot. 2. You could also index the Yago Types and use an additional
>> Entityhub
>> >> lookup to retrieve them. In this case you should first collect all types
>> >> referenced by Entities in the processed text and in a second step
>> retrieve
>> >> the labels. While this means additional lookups it will only load the
>> >> labels for an type once. In addition you could use a cache for types. 3.
>> >> Your engine could use LDPath to retrieve the types. This would require
>> to
>> >> index the data like with option (2) and use a LDPath statement similar
>> to
>> >> (1). It would be the slowest solution (as it requires an additional
>> lookup
>> >> for every extracted entity) but require the least code."
>> >>
>> >> It seems that the best solution would be no 2, so I took that path. But
>> >> I'm having some issues with building the dbpedia index with the yago
>> class
>> >> labels.
>> >>
>> >> I managed to create an .nt file from the data files on the yago site
>> which
>> >> contains the yago class labels. The file has this format :
>> >> <http://dbpedia.org/class/yago/Floret111669786> <
>> >> http://www.w3.org/2000/01/rdf-schema#label> "floret"@en .
>> >> <http://dbpedia.org/class/yago/Servant110582154> <
>> >> http://www.w3.org/2000/01/rdf-schema#label> "retainer"@en .
>> >> <http://dbpedia.org/class/yago/Varietal107900225> <
>> >> http://www.w3.org/2000/01/rdf-schema#label> "varietal"@en .
>> >>
>> >> I compressed this to a .bz2 archive and put it in the
>> >> indexing/resources/rdfdata folder with the rest of them.
>> >>
>> >> After running the indexer I got my dbpedia index but it seems the yago
>> >> class labels are not present in the index. The first clue was that they
>> >> were missing from the indexing/destination/indexed-entities-ids archive.
>> >> Second confirmation came when I tried to retrieve a yago class label by
>> >> calling site.getEntity(yago_class_uri) and the return was null. I should
>> >> mention that the same call works if I want to get a
>> >> http://dbpedia.org/resource/[id] entity.
>> >>
>> >> From what I saw, the indexing process indexes entities only if they are
>> in
>> >> the incoming_links.txt file and only if their score is higher than 2 so
>> I
>> >> guess that's the point where the yago classes were not inserted. From
>> >> looking at the code, the min-score parameter from the minincoming.config
>> >> file cannot be set to 0, or something that would ignore the
>> >> incoming_links.txt ranking and just index everything. So, in this
>> >> situation, is there a solution for getting these yago classes as
>> entities
>> >> in the index?
>> >>
>> >> I'd like to mention that the indexing process did correctly read the
>> >> yago_class_labels.nt file and started to index the entities into Jena.
>> >>
>> >> Thanks,
>> >> Cristian
>> >>
>> >>
>> >>
>> >> 2014-05-07 14:54 GMT+03:00 Cristian Petroaca <
>> [email protected]>
>> >> :
>> >>
>> >> Hi Rupert,
>> >>>
>> >>> Ok, I'll resend this mail in this thread. Again, out of habit I sent it
>> >>> in the gigantic "Named entities coreference" thread instead.
>> >>>
>> >>> So, I managed to create a dbpedia index with the yago class information
>> >>> but looking into the yago_types.nt file which assigns yago classes to
>> >>> dbpedia entities I realized that there are no yago class labels
>> present, I
>> >>> just have the class uri like : <
>> >>> http://dbpedia/..something../President1829302/. I also need the class
>> >>> labels so that I can compare them to the noun token's string from the
>> text.
>> >>>
>> >>> I can get the labels from one of the yago downloads here :
>> >>>
>> http://www.mpi-inf.mpg.de/yago-naga/yago/download/yago/yagoMultilingualClassLabels.txt
>> .
>> >>> I'll need another yago download file to map the yago wordnet classes to
>> >>> dbpedia uris. That could be done via a script maybe.
>> >>>
>> >>> Once I have the dbpedia_yago_class_uri -> label file is it possible to
>> >>> integrate this data in the dbpedia index and later be able to query the
>> >>> labels from the 'dbpedia' Site? How would that work in the dbpedia
>> indexing
>> >>> process? What should I change in the mappings.txt file? At first
>> glance it
>> >>> seems that the indexing is done based on the incoming_links.txt entity
>> >>> scoring and in my case I don't want to include triples involving the
>> actual
>> >>> entity but triples invloving a property of the entity (its yago class).
>> >>>
>> >>> Other than that, I saw that someone will be working on integrating YAGO
>> >>> as part of Gsoc 2014. So maybe waiting for that is an option too but I
>> >>> don't know what the extent of the integration will be.
>> >>>
>> >>> Thanks,
>> >>> Cristi
>> >>>
>> >>>
>> >>> 2014-04-30 12:04 GMT+03:00 Rupert Westenthaler <
>> >>> [email protected]>:
>> >>>
>> >>> On Wed, Apr 30, 2014 at 10:37 AM, Cristian Petroaca
>> >>>> <[email protected]> wrote:
>> >>>> > Hi All,
>> >>>> >
>> >>>> > I'm currently working on
>> >>>> https://issues.apache.org/jira/browse/STANBOL-1279.
>> >>>> >
>> >>>> > I am using the SiteManager to get a Site with referenceId =
>> "dbpedia"
>> >>>> and
>> >>>> > am querying data related to some NERs (querying by NER label and
>> type).
>> >>>> > This works and I do get results from the dbpedia index.
>> >>>> >
>> >>>> > What I want to do is this :
>> >>>> >
>> >>>> > 1. I want to be able to store and get yago class types in the
>> dbpedia
>> >>>> data.
>> >>>> > This data is stored in the yago-types.nt file from the dbpedia 3.9
>> >>>> > downloads. Is it possible to create a new dbpedia index with the 3.9
>> >>>> files
>> >>>> > using this script
>> >>>> >
>> >>>>
>> https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/dbpedia-3.8/fetch_data_en_int.sh
>> >>>> > ?
>> >>>>
>> >>>> yep. Just make suer you change
>> >>>>
>> >>>>     DBPEDIA=http://downloads.dbpedia.org/3.8
>> >>>>
>> >>>> to dbpedia 3.9
>> >>>>
>> >>>> BTW: you can also remove
>> >>>>
>> >>>>         #corrects encoding and recompress using gz
>> >>>>         bzcat ${filename}.bz2 \
>> >>>>             | sed 's/\\\\/\\u005c\\u005c/g;s/\\\([^u"]\)/\\u005c\1/g'
>> \
>> >>>>             | gzip -c > ${filename}.gz
>> >>>>         rm -f ${filename}.bz2
>> >>>>
>> >>>> as this is no longer necessary.
>> >>>>
>> >>>> >
>> >>>> > 2. I want to access some specific dbpedia properties such as
>> >>>> > dbpedia-owl:locationCity and others. These are already present in
>> the
>> >>>> > mappingbased_properties_en.nt
>> >>>> > file which is in the fetch_data_en_int.sh script but are not in the
>> >>>> >
>> >>>>
>> https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/src/main/resources/indexing/config/mappings.txt
>> >>>> > file.
>> >>>> > Should I include them there and do a dbpedia index rebuild?
>> >>>>
>> >>>> Exactly. If the size of the created SolrIndex is an issue I recommend
>> >>>> also that you remove properties you do not need.
>> >>>>
>> >>>> >
>> >>>> > I've already described this in the "Named entity coref resolution
>> >>>> based on
>> >>>> > dbpedia" mail thread but I thought of creating a new mail for
>> >>>> visibility
>> >>>> > and for not clogging the other thread.
>> >>>>
>> >>>> The old thread is anyways already much to long. Please make sure that
>> >>>> important points and decisions of that thread are also reflected in
>> >>>> the description of STANBOL-1279
>> >>>>
>> >>>> best
>> >>>> Rupert
>> >>>>
>> >>>> >
>> >>>> > Thanks,
>> >>>> > Cristian
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> | Rupert Westenthaler             [email protected]
>> >>>> | Bodenlehenstraße 11                              ++43-699-11108907
>> >>>> | A-5500 Bischofshofen
>> >>>> |
>> REDLINK.CO..........................................................................
>> >>>> | http://redlink.co/
>> >>>>
>> >>>
>> >>>
>> >>
>>
>>
>>
>> --
>> | Rupert Westenthaler             [email protected]
>> | Bodenlehenstraße 11                              ++43-699-11108907
>> | A-5500 Bischofshofen
>> | 
>> REDLINK.CO..........................................................................
>> | http://redlink.co/
>>



-- 
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                              ++43-699-11108907
| A-5500 Bischofshofen
| REDLINK.CO 
..........................................................................
| http://redlink.co/

Re: Working with dbpedia indexed data

Reply via email to