Thanks Rupert. I am making some progress here. I am finding that paoding breaks words into small segments, espcially foreign names. For ex, motorola is broken into two parts (mot, rola), similarly michael is borken into (mik, kael). Now the ngram based dbpedia lookup looks for these in the dbpedia index and cannot find. My segmentation process and dbpedia solr index must both use the same segmenter. There is a paoding analyzer for solr too. I just need to create the solr index for dbpedia using that. Actually now, I have more dbpedia hits in character ngram based dbpedia lookup for chinese than the number of hits I get if I use paoding. We dont know what language analyzers have been used by ogrisel is creating the solr dbpedia dump of 1.19gb.
I also experimented with contenthub search for chinese. Right now it does not work. I need to debug that part also. Even the UI in the contenthub does not display the chinese characters. The enhancer UI does display the characters well. Also for English Stanbol, I did play with contenthub. I took a small text as follows. ============== United States produced an Olympic-record time to win gold in the women's 200m freestyle relay final. A brilliant final leg from Allison Schmitt led the Americans home, ahead of Australia, in a time of seven minutes 42.92 seconds. Missy Franklin gave them a great start, while Dana Vollmer and Shannon Vreeland also produced fast times. ===================================================================== The above text is properly processed and I get the dbpedia links for all persons, countries in the above. Hoewver, the above piece is related to 'swimming' and this word does not appear at all in the text. In the dbpedia link of Allison Scmitt, the dbpedia categories do tell us that it is in swimming category. Did anyone try to process the categories inside the link and add them as metadata for this content. If we add this, then we add more value than a simple solr based search in content store. Some one in IKS conference demoed this as a semantic search. Any hints/clues on this work ? On Wed, Aug 15, 2012 at 1:25 PM, Rupert Westenthaler < [email protected]> wrote: > On Wed, Aug 15, 2012 at 3:06 AM, harish suvarna <[email protected]> > wrote: > > Is {stanbol-trunk}/entityhub/indeing/dbpedia it different from the custom > > ontology file tool that is mentioned in > > http://incubator.apache.org/stanbol/docs/trunk/customvocabulary.html ? > > > > The custom DBpedia indexing tool comes with a different default > configuration and also with a custolmised Solr schema (schema.xml > file) for dbpedia. Otherwise it is the same software as the generic > RDF indexing tool. Most of the things mentioned in > "customvocabulary.html" are also valid for the dbpedia indexing tool. > Please also notice the readme and the comments in the configuration of > the dbpedia indexing tool. > > > Is it same as the entityhub page in Stanbol localhost:8080? > > This tool was used to create all available dbpedia indexes for Apache > Stanbol. This includes the dbpedia default data (shipped with the > launcher). > > best > Rupert > > > > > -harish > > > > > > On Thu, Aug 9, 2012 at 10:58 PM, Rupert Westenthaler < > > [email protected]> wrote: > > > >> Hi > >> > >> > >> On Fri, Aug 10, 2012 at 1:28 AM, harish suvarna <[email protected]> > >> wrote: > >> > Thanks Rupert for the update. > >> > Meanwhile I am looking at generating custom vocab index page > >> > http://incubator.apache.org/stanbol/docs/trunk/customvocabulary.htmland > >> > trying to know which files I have to use under dbpedia chinese > download > >> > available at http://downloads.dbpedia.org/3.8/zh/ > >> > >> Are this the data for the Entities with the URIs > >> "http://zh.dboedua.org/resource/{name}"? > >> > >> Anyway cool that dbpedia 3.8 got finally released! > >> > >> > > >> > The dbpedia download for chinese has article categories, lables, > >> short/long > >> > abstracts, inter language links. Donot know which ones to use for the > >> > stanbol entityhub custom vocabulary index tool. > >> > >> For linking concepts you need only the labels. If you also include the > >> short abstracts you will also have the mouse over text in the Stanbol > >> Enhancer UI. Geo coordinates are needed for the map in the enhancer > >> UI. > >> > >> You should also include the data providing the rdf:types of the > >> Entities. However I do not know which of the files does include those. > >> > >> Categories are currently not used by Stanbol. If you want to include > >> them you should add (1) the categories (2) categories labels and (3) > >> article categories > >> > >> Note that there is an own Entityhub Indexing Tool for dbpedia > >> {stanbol-trunk}/entityhub/indeing/dbpedia. > >> > >> > >> best > >> Rupert > >> > >> > > >> > -harish > >> > > >> > > >> > On Thu, Aug 9, 2012 at 11:08 AM, Rupert Westenthaler < > >> > [email protected]> wrote: > >> > > >> >> Hi > >> >> > >> >> the dbpedia 3.7 index was build by ogrisel so I do not know the > details. > >> >> > >> >> I think Chinese (zh) labels are included, but the index only contains > >> >> Entities for Wikipedia pages with 5 or more incoming links. > >> >> > >> >> In addition while the English DBpedia contains zh labels it will not > >> >> contain Entities that do not have a counterpart in the English > >> >> Wikipedia. > >> >> > >> >> best > >> >> Rupert > >> >> > >> >> On Thu, Aug 9, 2012 at 1:00 AM, harish suvarna <[email protected]> > >> wrote: > >> >> > I received a USB in IKS conf which contained the 1.19GB of dbpedia > >> full > >> >> > solr index. Does it contain the data from the chinese dump > (available > >> in > >> >> > the dbpedia.org download server under zh folder)? > >> >> > > >> >> > I do get some dbpedia entries for chinese text in stanbol > >> enhancements. I > >> >> > am using the 1.19GB dump. I am expecting some more enhancements > which > >> are > >> >> > present in wikipedia chinese. Just wondering if chinese dump is > not > >> >> > utilized. > >> >> > > >> >> > -harish > >> >> > >> >> > >> >> > >> >> -- > >> >> | Rupert Westenthaler [email protected] > >> >> | Bodenlehenstraße 11 ++43-699-11108907 > >> >> | A-5500 Bischofshofen > >> >> > >> > >> > >> > >> -- > >> | Rupert Westenthaler [email protected] > >> | Bodenlehenstraße 11 ++43-699-11108907 > >> | A-5500 Bischofshofen > >> > > > > -- > | Rupert Westenthaler [email protected] > | Bodenlehenstraße 11 ++43-699-11108907 > | A-5500 Bischofshofen >
