On Mon, May 26, 2014 at 9:19 PM, Cristian Petroaca <cristian.petro...@gmail.com> wrote: > Thanks Rupert! The genericrfd reindexing worked. > > Just one thing : it seems kind of odd that my solrindex.zip got from 796MB > (after dbpedia indexing) to 1,5GB (after genericrdf indexing based on > dbpedia index) but my yago_class_labels.nt file contains around 100,000 > entries. > The only thing I changed in config was the name of the site as you > suggested and in mappings.txt file I removed everything except "rdfs:label". >
No Idea ... as long as all the data you need are available ^^ best Rupert > > 2014-05-26 16:26 GMT+03:00 Rupert Westenthaler < > rupert.westentha...@gmail.com>: > >> Hi Cristian, >> >> On Mon, May 26, 2014 at 2:33 PM, Cristian Petroaca >> <cristian.petro...@gmail.com> wrote: >> > I just found out that according to >> > >> http://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/README.mdthe >> > min-score can actually be set to 0 and all entities will be indexed >> > :). >> > So, I'll give that a go ( hopefully my dbpedia index won't become >> gigantic >> > in size). >> > >> >> Even if you set the value to zero it will still only index entities >> listed in the incoming_links.txt file. So you will need to append the >> Yago types to that file. >> >> An other possibility would be to first create the dbpedia index and >> after that append the Yago classes by using the generic rdf indexing >> tool. For that you can >> >> 1) take the destination folder of the dbpedia indexing tool and link >> (or move) it to the destination of the generic indexing tool. >> 2) make sure to configure the same site name as for the dbpedia index >> tool to the generic indexing tool >> 3) add the RDF data of the Yago classes to the rdf data folder of the >> generic indexing tool >> 4) adapt all the other configurations as needed >> 5) start the indexing process. >> >> The generic indexing tool will check if the target solr index does >> already exist. As it is present it will just add the additional >> entities to the solr core. >> >> When the process completes you can use the "solrindex.zip" file >> generated by the generic RDF indexing tool together with the OSGI >> bunlde (the jar file) generated by the dbpedia indexing tool. >> >> Especially if you have already created an dbpedia index I would >> recommend you to try this out as it would avoid re-indexing the whole >> dbpedia data again. >> >> best >> Rupert >> >> >> >> > >> > 2014-05-25 16:58 GMT+03:00 Cristian Petroaca < >> cristian.petro...@gmail.com>: >> > >> >> Hi Rupert. >> >> >> >> I'm answering to your suggestions on integrating the yago class labels >> in >> >> the dbpedia index in this thread since it's a lot shorter than the other >> >> one. >> >> >> >> For clarity, your suggestions were : >> >> >> >> "1. The indexing tool does support LDPath. That means you can import >> >> all the required RDF files and use LDPath to append the labels of the >> Yago >> >> Types directly to the dbpedia entities. This would prevent additional >> >> lookups to retrieve the types, but also increase the size of the index a >> >> lot. 2. You could also index the Yago Types and use an additional >> Entityhub >> >> lookup to retrieve them. In this case you should first collect all types >> >> referenced by Entities in the processed text and in a second step >> retrieve >> >> the labels. While this means additional lookups it will only load the >> >> labels for an type once. In addition you could use a cache for types. 3. >> >> Your engine could use LDPath to retrieve the types. This would require >> to >> >> index the data like with option (2) and use a LDPath statement similar >> to >> >> (1). It would be the slowest solution (as it requires an additional >> lookup >> >> for every extracted entity) but require the least code." >> >> >> >> It seems that the best solution would be no 2, so I took that path. But >> >> I'm having some issues with building the dbpedia index with the yago >> class >> >> labels. >> >> >> >> I managed to create an .nt file from the data files on the yago site >> which >> >> contains the yago class labels. The file has this format : >> >> <http://dbpedia.org/class/yago/Floret111669786> < >> >> http://www.w3.org/2000/01/rdf-schema#label> "floret"@en . >> >> <http://dbpedia.org/class/yago/Servant110582154> < >> >> http://www.w3.org/2000/01/rdf-schema#label> "retainer"@en . >> >> <http://dbpedia.org/class/yago/Varietal107900225> < >> >> http://www.w3.org/2000/01/rdf-schema#label> "varietal"@en . >> >> >> >> I compressed this to a .bz2 archive and put it in the >> >> indexing/resources/rdfdata folder with the rest of them. >> >> >> >> After running the indexer I got my dbpedia index but it seems the yago >> >> class labels are not present in the index. The first clue was that they >> >> were missing from the indexing/destination/indexed-entities-ids archive. >> >> Second confirmation came when I tried to retrieve a yago class label by >> >> calling site.getEntity(yago_class_uri) and the return was null. I should >> >> mention that the same call works if I want to get a >> >> http://dbpedia.org/resource/[id] entity. >> >> >> >> From what I saw, the indexing process indexes entities only if they are >> in >> >> the incoming_links.txt file and only if their score is higher than 2 so >> I >> >> guess that's the point where the yago classes were not inserted. From >> >> looking at the code, the min-score parameter from the minincoming.config >> >> file cannot be set to 0, or something that would ignore the >> >> incoming_links.txt ranking and just index everything. So, in this >> >> situation, is there a solution for getting these yago classes as >> entities >> >> in the index? >> >> >> >> I'd like to mention that the indexing process did correctly read the >> >> yago_class_labels.nt file and started to index the entities into Jena. >> >> >> >> Thanks, >> >> Cristian >> >> >> >> >> >> >> >> 2014-05-07 14:54 GMT+03:00 Cristian Petroaca < >> cristian.petro...@gmail.com> >> >> : >> >> >> >> Hi Rupert, >> >>> >> >>> Ok, I'll resend this mail in this thread. Again, out of habit I sent it >> >>> in the gigantic "Named entities coreference" thread instead. >> >>> >> >>> So, I managed to create a dbpedia index with the yago class information >> >>> but looking into the yago_types.nt file which assigns yago classes to >> >>> dbpedia entities I realized that there are no yago class labels >> present, I >> >>> just have the class uri like : < >> >>> http://dbpedia/..something../President1829302/. I also need the class >> >>> labels so that I can compare them to the noun token's string from the >> text. >> >>> >> >>> I can get the labels from one of the yago downloads here : >> >>> >> http://www.mpi-inf.mpg.de/yago-naga/yago/download/yago/yagoMultilingualClassLabels.txt >> . >> >>> I'll need another yago download file to map the yago wordnet classes to >> >>> dbpedia uris. That could be done via a script maybe. >> >>> >> >>> Once I have the dbpedia_yago_class_uri -> label file is it possible to >> >>> integrate this data in the dbpedia index and later be able to query the >> >>> labels from the 'dbpedia' Site? How would that work in the dbpedia >> indexing >> >>> process? What should I change in the mappings.txt file? At first >> glance it >> >>> seems that the indexing is done based on the incoming_links.txt entity >> >>> scoring and in my case I don't want to include triples involving the >> actual >> >>> entity but triples invloving a property of the entity (its yago class). >> >>> >> >>> Other than that, I saw that someone will be working on integrating YAGO >> >>> as part of Gsoc 2014. So maybe waiting for that is an option too but I >> >>> don't know what the extent of the integration will be. >> >>> >> >>> Thanks, >> >>> Cristi >> >>> >> >>> >> >>> 2014-04-30 12:04 GMT+03:00 Rupert Westenthaler < >> >>> rupert.westentha...@gmail.com>: >> >>> >> >>> On Wed, Apr 30, 2014 at 10:37 AM, Cristian Petroaca >> >>>> <cristian.petro...@gmail.com> wrote: >> >>>> > Hi All, >> >>>> > >> >>>> > I'm currently working on >> >>>> https://issues.apache.org/jira/browse/STANBOL-1279. >> >>>> > >> >>>> > I am using the SiteManager to get a Site with referenceId = >> "dbpedia" >> >>>> and >> >>>> > am querying data related to some NERs (querying by NER label and >> type). >> >>>> > This works and I do get results from the dbpedia index. >> >>>> > >> >>>> > What I want to do is this : >> >>>> > >> >>>> > 1. I want to be able to store and get yago class types in the >> dbpedia >> >>>> data. >> >>>> > This data is stored in the yago-types.nt file from the dbpedia 3.9 >> >>>> > downloads. Is it possible to create a new dbpedia index with the 3.9 >> >>>> files >> >>>> > using this script >> >>>> > >> >>>> >> https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/dbpedia-3.8/fetch_data_en_int.sh >> >>>> > ? >> >>>> >> >>>> yep. Just make suer you change >> >>>> >> >>>> DBPEDIA=http://downloads.dbpedia.org/3.8 >> >>>> >> >>>> to dbpedia 3.9 >> >>>> >> >>>> BTW: you can also remove >> >>>> >> >>>> #corrects encoding and recompress using gz >> >>>> bzcat ${filename}.bz2 \ >> >>>> | sed 's/\\\\/\\u005c\\u005c/g;s/\\\([^u"]\)/\\u005c\1/g' >> \ >> >>>> | gzip -c > ${filename}.gz >> >>>> rm -f ${filename}.bz2 >> >>>> >> >>>> as this is no longer necessary. >> >>>> >> >>>> > >> >>>> > 2. I want to access some specific dbpedia properties such as >> >>>> > dbpedia-owl:locationCity and others. These are already present in >> the >> >>>> > mappingbased_properties_en.nt >> >>>> > file which is in the fetch_data_en_int.sh script but are not in the >> >>>> > >> >>>> >> https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/src/main/resources/indexing/config/mappings.txt >> >>>> > file. >> >>>> > Should I include them there and do a dbpedia index rebuild? >> >>>> >> >>>> Exactly. If the size of the created SolrIndex is an issue I recommend >> >>>> also that you remove properties you do not need. >> >>>> >> >>>> > >> >>>> > I've already described this in the "Named entity coref resolution >> >>>> based on >> >>>> > dbpedia" mail thread but I thought of creating a new mail for >> >>>> visibility >> >>>> > and for not clogging the other thread. >> >>>> >> >>>> The old thread is anyways already much to long. Please make sure that >> >>>> important points and decisions of that thread are also reflected in >> >>>> the description of STANBOL-1279 >> >>>> >> >>>> best >> >>>> Rupert >> >>>> >> >>>> > >> >>>> > Thanks, >> >>>> > Cristian >> >>>> >> >>>> >> >>>> >> >>>> -- >> >>>> | Rupert Westenthaler rupert.westentha...@gmail.com >> >>>> | Bodenlehenstraße 11 ++43-699-11108907 >> >>>> | A-5500 Bischofshofen >> >>>> | >> REDLINK.CO.......................................................................... >> >>>> | http://redlink.co/ >> >>>> >> >>> >> >>> >> >> >> >> >> >> -- >> | Rupert Westenthaler rupert.westentha...@gmail.com >> | Bodenlehenstraße 11 ++43-699-11108907 >> | A-5500 Bischofshofen >> | >> REDLINK.CO.......................................................................... >> | http://redlink.co/ >> -- | Rupert Westenthaler rupert.westentha...@gmail.com | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen | REDLINK.CO .......................................................................... | http://redlink.co/