Re: dbpedia solr index dump

Rupert Westenthaler Sun, 19 Aug 2012 22:25:34 -0700

Hi,



On Fri, Aug 17, 2012 at 6:40 PM, harish suvarna <[email protected]> wrote:
> I read the readme.md in the entityhub/indexing/dbpedia and started indexing
> the chinese dbpedia 3.8. Chinese dbpedia3.8 dump does not seem to have 2
> files needed. instance_types and person_data. Still I went ahead and tried
> to run the index generation.

I had not yet time to look at dbpedia 3.8. They might have changed
names of some dump files. Generally "instance_types" are very
important (this provides the information about the type of an Entity).
"person_data" includes additional information for persons, AFAIK those
information are not included in the default configuration of the
dbpedia indexing tool

> I get a java exception.

The included exceptions look like the RDF file containing the Chinese
labels is not well formatted. The experience says that this is most
likely related to char encoding issues. This was also the case with
some dbpedia 3.7 files (see the special treatment of some files in the
shell script of the dbpedia).

You will need to have a look at the line that caused the error
(labels_zh.nt.bz2; [line: 6972, col: 46] Broken token:
http://www.w3.org/2000/01/rdf-sche). If it is indeed a encoding
related issue there are some linux command line utilities to check and
correct those issues. If you are unsure feel free to post this line
within this thread.

> The other observations are
> 1. the curl thing to generate incoming_links.txt took more than 3 hours and
> generated 2.5GB of this file.

This is expected. NOTE that this step is only executed a single time.

> 2. dbpedia 3.8 seem to have Category: and not Cat:. So the step to
> substitute Cat: with Category: is not required now.

thats great

> 3. After Java exception the program to generate index seems doing nothing
> nor is terminated. I waited overnight and killed it.

The importer may hang after exceptions like the one reported as the
termination is not ensured if threads are closed because of an
exception. It is ok to kill the indexing tool if that happens.
Loggings are typically printed every some seconds. So if you do not
see any loggings for some minutes (especially after an exception) you
might want to kill the indexing tool.

> Previously I did not understand your question
>
> Are this the data for the Entities with the URIs
> "http://zh.dboedua.org/resource/{name}";?
>
> Now I understand. There is no Chinese dbpedia server running.
>
> http://wiki.dbpedia.org/Internationalization has list of language chapters
> supported for dbpedia now. Chinese is yet to come.
> My intention is to make a stanbol solr chinese dbpedia dump so that I can
> 'spot' keywords in dbpedia better than English dump.
>

Chinese labels for the English dbpedia
("http://dboedua.org/resource/{name}";) should work for that reason.
The Chinese version ("http://zh.dboedua.org/resource/{name}";) would
just provide more Entities (not more information for entities included
in the English version.

best
Rupert

> -harish
>
> =======================================================
> ttp://www.w3.org/2000/01/rdf-sche
> 08:23:37,914 [Thread-5] ERROR source.ResourceLoader - Unable to load
> resource
> /Users/harishs/Linguistics2/dbpedia/indexing/resources/rdfdata/labels_zh.nt.bz2
> org.openjena.riot.RiotException: [line: 6972, col: 46] Broken token:
> http://www.w3.org/2000/01/rdf-sche
>     at
> org.openjena.riot.ErrorHandlerLib$ErrorHandlerStd.fatal(ErrorHandlerLib.java:97)
>     at org.openjena.riot.lang.LangBase.raiseException(LangBase.java:205)
>     at org.openjena.riot.lang.LangBase.nextToken(LangBase.java:152)
>     at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:38)
>     at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:22)
>     at org.openjena.riot.lang.LangNTuple.runParser(LangNTuple.java:58)
>     at org.openjena.riot.lang.LangBase.parse(LangBase.java:75)
>     at org.openjena.riot.RiotReader.parseQuads(RiotReader.java:173)
>     at
> com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadQuads$(BulkLoader.java:154)
>     at
> com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadDataset(BulkLoader.java:113)
>     at com.hp.hpl.jena.tdb.TDBLoader.loadDataset$(TDBLoader.java:282)
>     at com.hp.hpl.jena.tdb.TDBLoader.loadDataset(TDBLoader.java:193)
>     at com.hp.hpl.jena.tdb.TDBLoader.load(TDBLoader.java:74)
>     at
> org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfResourceImporter.importResource(RdfResourceImporter.java:72)
>     at
> org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResource(ResourceLoader.java:201)
>     at
> org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResources(ResourceLoader.java:137)
>     at
> org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource.initialise(RdfIndexingSource.java:245)
>     at
> org.apache.stanbol.entityhub.indexing.core.impl.IndexingSourceInitialiser.run(IndexingSourceInitialiser.java:43)
>     at java.lang.Thread.run(Thread.java:680)
> 08:23:37,917 [Thread-5] ERROR source.ResourceLoader - Exception while
> loading file
> /Users/harishs/Linguistics2/dbpedia/indexing/resources/rdfdata/labels_zh.nt.bz2
> org.openjena.riot.RiotException: [line: 6972, col: 46] Broken token:
> http://www.w3.org/2000/01/rdf-sche
>     at
> org.openjena.riot.ErrorHandlerLib$ErrorHandlerStd.fatal(ErrorHandlerLib.java:97)
>     at org.openjena.riot.lang.LangBase.raiseException(LangBase.java:205)
>     at org.openjena.riot.lang.LangBase.nextToken(LangBase.java:152)
>     at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:38)
>     at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:22)
>     at org.openjena.riot.lang.LangNTuple.runParser(LangNTuple.java:58)
>     at org.openjena.riot.lang.LangBase.parse(LangBase.java:75)
>     at org.openjena.riot.RiotReader.parseQuads(RiotReader.java:173)
>     at
> com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadQuads$(BulkLoader.java:154)
>     at
> com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadDataset(BulkLoader.java:113)
>     at com.hp.hpl.jena.tdb.TDBLoader.loadDataset$(TDBLoader.java:282)
>     at com.hp.hpl.jena.tdb.TDBLoader.loadDataset(TDBLoader.java:193)
>     at com.hp.hpl.jena.tdb.TDBLoader.load(TDBLoader.java:74)
>     at
> org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfResourceImporter.importResource(RdfResourceImporter.java:72)
>     at
> org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResource(ResourceLoader.java:201)
>     at
> org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResources(ResourceLoader.java:137)
>     at
> org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource.initialise(RdfIndexingSource.java:245)
>     at
> org.apache.stanbol.entityhub.indexing.core.impl.IndexingSourceInitialiser.run(IndexingSourceInitialiser.java:43)
>     at java.lang.Thread.run(Thread.java:680)
> Exception in thread "Thread-5" java.lang.IllegalStateException: Error while
> loading Resource
> /Users/harishs/Linguistics2/dbpedia/indexing/resources/rdfdata/labels_zh.nt.bz2
>     at
> org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.setResourceState(ResourceLoader.java:273)
>     at
> org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResource(ResourceLoader.java:215)
>     at
> org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResources(ResourceLoader.java:137)
>     at
> org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource.initialise(RdfIndexingSource.java:245)
>     at
> org.apache.stanbol.entityhub.indexing.core.impl.IndexingSourceInitialiser.run(IndexingSourceInitialiser.java:43)
>     at java.lang.Thread.run(Thread.java:680)
> Caused by: org.openjena.riot.RiotException: [line: 6972, col: 46] Broken
> token: http://www.w3.org/2000/01/rdf-sche
>     at
> org.openjena.riot.ErrorHandlerLib$ErrorHandlerStd.fatal(ErrorHandlerLib.java:97)
>     at org.openjena.riot.lang.LangBase.raiseException(LangBase.java:205)
>     at org.openjena.riot.lang.LangBase.nextToken(LangBase.java:152)
>     at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:38)
>     at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:22)
>     at org.openjena.riot.lang.LangNTuple.runParser(LangNTuple.java:58)
>     at org.openjena.riot.lang.LangBase.parse(LangBase.java:75)
>     at org.openjena.riot.RiotReader.parseQuads(RiotReader.java:173)
>     at
> com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadQuads$(BulkLoader.java:154)
>     at
> com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadDataset(BulkLoader.java:113)
>     at com.hp.hpl.jena.tdb.TDBLoader.loadDataset$(TDBLoader.java:282)
>     at com.hp.hpl.jena.tdb.TDBLoader.loadDataset(TDBLoader.java:193)
>     at com.hp.hpl.jena.tdb.TDBLoader.load(TDBLoader.java:74)
>     at
> org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfResourceImporter.importResource(RdfResourceImporter.java:72)
>     at
> org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResource(ResourceLoader.java:201)
>     ... 4 more
> =========================================================
>
> On Wed, Aug 15, 2012 at 6:01 PM, harish suvarna <[email protected]> wrote:
>
>> Thanks Rupert. I am making some progress here. I am finding that paoding
>> breaks words into small segments, espcially foreign names. For ex, motorola
>> is broken into two parts (mot, rola), similarly
>> michael is borken into (mik, kael). Now the ngram based dbpedia lookup
>> looks for these in the dbpedia index and cannot find.
>> My segmentation process and dbpedia solr index must both use the same
>> segmenter. There is a paoding analyzer for solr too. I just need to create
>> the solr index for dbpedia using that.
>> Actually now, I have more dbpedia hits in character ngram based dbpedia
>> lookup for chinese than the number of hits I get if I use paoding.
>> We dont know what language analyzers have been used by ogrisel is creating
>> the solr dbpedia dump of 1.19gb.
>>
>> I also experimented with contenthub search for chinese. Right now it does
>> not work. I need to debug that part also. Even the UI in the contenthub
>> does not display the chinese characters. The enhancer UI does display the
>> characters well.
>>
>> Also for English Stanbol, I did play with contenthub. I took a small text
>> as follows.
>> ==============
>>  United States produced an Olympic-record time to win gold in the women's
>> 200m freestyle relay final. A brilliant final leg from Allison Schmitt led
>> the Americans home, ahead of Australia, in a time of seven minutes 42.92
>> seconds. Missy Franklin gave them a great start, while Dana Vollmer and
>> Shannon Vreeland also produced fast times.
>> =====================================================================
>>
>> The above text is properly processed and I get the dbpedia links for all
>> persons, countries in the above. Hoewver, the above piece is related to
>> 'swimming' and this word does not appear at all in the text. In the dbpedia
>> link of Allison Scmitt, the dbpedia categories do tell us that it is in
>> swimming category. Did anyone try to process the categories inside the link
>> and add them as metadata for this content. If we add this, then we add more
>> value than a simple solr based search in content store. Some one in IKS
>> conference demoed this as a semantic search. Any hints/clues on this work ?
>>
>>
>>
>>
>>
>> On Wed, Aug 15, 2012 at 1:25 PM, Rupert Westenthaler <
>> [email protected]> wrote:
>>
>>> On Wed, Aug 15, 2012 at 3:06 AM, harish suvarna <[email protected]>
>>> wrote:
>>> > Is {stanbol-trunk}/entityhub/indeing/dbpedia it different from the
>>> custom
>>> > ontology file tool that is mentioned in
>>> > http://incubator.apache.org/stanbol/docs/trunk/customvocabulary.html ?
>>> >
>>>
>>> The custom DBpedia indexing tool comes with a different default
>>> configuration and also with a custolmised Solr schema (schema.xml
>>> file) for dbpedia. Otherwise it is the same software as the generic
>>> RDF indexing tool. Most of the things mentioned in
>>> "customvocabulary.html" are also valid for the dbpedia indexing tool.
>>> Please also notice the readme and the comments in the configuration of
>>> the dbpedia indexing tool.
>>>
>>> > Is it same as the entityhub page in Stanbol localhost:8080?
>>>
>>> This tool was used to create all available dbpedia indexes for Apache
>>> Stanbol. This includes the dbpedia default data (shipped with the
>>> launcher).
>>>
>>> best
>>> Rupert
>>>
>>> >
>>> > -harish
>>> >
>>> >
>>> > On Thu, Aug 9, 2012 at 10:58 PM, Rupert Westenthaler <
>>> > [email protected]> wrote:
>>> >
>>> >> Hi
>>> >>
>>> >>
>>> >> On Fri, Aug 10, 2012 at 1:28 AM, harish suvarna <[email protected]>
>>> >> wrote:
>>> >> > Thanks Rupert for the update.
>>> >> > Meanwhile I am looking at generating custom vocab index page
>>> >> > http://incubator.apache.org/stanbol/docs/trunk/customvocabulary.htmland
>>> >> > trying to know which files I have to use under dbpedia chinese
>>> download
>>> >> > available at http://downloads.dbpedia.org/3.8/zh/
>>> >>
>>> >> Are this the data for the Entities with the URIs
>>> >> "http://zh.dboedua.org/resource/{name}";?
>>> >>
>>> >> Anyway cool that dbpedia 3.8 got finally released!
>>> >>
>>> >> >
>>> >> > The dbpedia download for chinese has article categories, lables,
>>> >> short/long
>>> >> > abstracts, inter language links. Donot know which ones to use for the
>>> >> > stanbol entityhub custom vocabulary index tool.
>>> >>
>>> >> For linking concepts you need only the labels. If you also include the
>>> >> short abstracts you will also have the mouse over text in the Stanbol
>>> >> Enhancer UI. Geo coordinates are needed for the map in the enhancer
>>> >> UI.
>>> >>
>>> >> You should also include the data providing the rdf:types of the
>>> >> Entities. However I do not know which of the files does include those.
>>> >>
>>> >> Categories are currently not used by Stanbol. If you want to include
>>> >> them you should add (1) the categories (2) categories labels and (3)
>>> >> article categories
>>> >>
>>> >> Note that there is an own Entityhub Indexing Tool for dbpedia
>>> >> {stanbol-trunk}/entityhub/indeing/dbpedia.
>>> >>
>>> >>
>>> >> best
>>> >> Rupert
>>> >>
>>> >> >
>>> >> > -harish
>>> >> >
>>> >> >
>>> >> > On Thu, Aug 9, 2012 at 11:08 AM, Rupert Westenthaler <
>>> >> > [email protected]> wrote:
>>> >> >
>>> >> >> Hi
>>> >> >>
>>> >> >> the dbpedia 3.7 index was build by ogrisel so I do not know the
>>> details.
>>> >> >>
>>> >> >> I think Chinese (zh) labels are included, but the index only
>>> contains
>>> >> >> Entities for Wikipedia pages with 5 or more incoming links.
>>> >> >>
>>> >> >> In addition while  the English DBpedia contains zh labels it will
>>> not
>>> >> >> contain Entities that do not have a counterpart in the English
>>> >> >> Wikipedia.
>>> >> >>
>>> >> >> best
>>> >> >> Rupert
>>> >> >>
>>> >> >> On Thu, Aug 9, 2012 at 1:00 AM, harish suvarna <[email protected]>
>>> >> wrote:
>>> >> >> > I received a USB in IKS conf which contained the 1.19GB of dbpedia
>>> >> full
>>> >> >> > solr index. Does it contain the data from the chinese dump
>>> (available
>>> >> in
>>> >> >> > the dbpedia.org download server under zh folder)?
>>> >> >> >
>>> >> >> > I do get some dbpedia entries for chinese text in stanbol
>>> >> enhancements. I
>>> >> >> > am using the 1.19GB dump. I am expecting some more enhancements
>>> which
>>> >> are
>>> >> >> > present  in wikipedia chinese. Just wondering if chinese dump is
>>> not
>>> >> >> > utilized.
>>> >> >> >
>>> >> >> > -harish
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> --
>>> >> >> | Rupert Westenthaler             [email protected]
>>> >> >> | Bodenlehenstraße 11                             ++43-699-11108907
>>> >> >> | A-5500 Bischofshofen
>>> >> >>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> | Rupert Westenthaler             [email protected]
>>> >> | Bodenlehenstraße 11                             ++43-699-11108907
>>> >> | A-5500 Bischofshofen
>>> >>
>>>
>>>
>>>
>>> --
>>> | Rupert Westenthaler             [email protected]
>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>> | A-5500 Bischofshofen
>>>
>>
>>



-- 
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: dbpedia solr index dump

Reply via email to