Hi Rupert, Thanks for the pointer :) I will configure the indexing tool to get the foaf sub-set from freebase dump since I believe it has a more-than sufficient set of foaf data. I will use the irc channel for frequent questions on my way forward..
Regards, Dileepa On Wed, Jun 19, 2013 at 5:07 PM, Rupert Westenthaler < [email protected]> wrote: > Hi Dileepa, > > IMO it would be better if you join the #stanbol IRC channel on > freenode.net. This would allow to reduce rtt's (rount trip times) for > such kind of questions a lot. > > > For the Freebase indexing I implemented property filters (see > STANBOL-1016) with this you can specify what triples need to be > imported and others that should be dropped. See the Freebase Indexing > Tool for details and an Example for how to configure it. A good > starting point would be to only import triples where the property > starts with the FOAF namespace. > > In addition you could write your own EntityProcessor that checks if a > Resource does have all required fields before importing. > EntitiyProcessor implemetnations do get the Representation for an > Entity parsed. It is very easy to write an EntityProcessor that checks > if values for some properties are present. If not all required are > present you can filter those by returning NULL. > > With those two things in palce you should easily get a rather good > quality sub-set of FOAF data from the referenced dataset. > > best > Rupert > > On Wed, Jun 19, 2013 at 12:52 PM, Dileepa Jayakody > <[email protected]> wrote: > > The the link of the data-set project I'm looking at : > > http://km.aifb.kit.edu/projects/btc-2012/ > > > > > > On Wed, Jun 19, 2013 at 4:21 PM, Dileepa Jayakody < > [email protected] > >> wrote: > > > >> Hi Rupert et al, > >> On Wed, Jun 19, 2013 at 2:27 PM, Rupert Westenthaler < > >> [email protected]> wrote: > >> > >>> Hi > >>> > >>> > >>> On Wed, Jun 19, 2013 at 9:20 AM, Dileepa Jayakody > >>> <[email protected]> wrote: > >>> > Hi All, > >>> > > >>> > I'm trying out entityhub indexing tool to configure a site for a > sample > >>> > foaf dataset. My data set (sampleNquads.nx) is in n-quad format. > >>> Actually > >>> > it is a set of links to foaf files from various sources in nquad > >>> format. > >>> > > >>> > eg: > >>> > <http://www.agfa.com/> <http://www.agfa.com/global/en/main/index.jsp> > . > >>> > *<http://sebastian.tramp.name/> < > http://sebastian.tramp.name/index.rdf> > >>> .* > >>> > <http://gitorious.com/~tobyink> <http://gitorious.org/~tobyink> . > >>> > > >>> > >>> I am not completely sure what you are mean by that. > >>> > >> > >> I have misunderstood the N-Quad format, and thought it's just a set of > >> links to external rdf files. > >> The sample data-set I used was just a small part of the actual datahub > >> dataset (>1 GB) and incomplete. That might be the reason the indexer not > >> been able to index the dataset.:) > >> > >>> > >>> Generally: Links to RDF files are not supported by the Indexing Tool. > >>> You will need to download the RDF files to the > >>> "indexing/resources/rdfdata" directory. > >>> > >>> Quad Formats are in principle supported by the Indexing Tool. However > >>> node that only SPO are used and the Context is dropped during the > >>> import. > >>> > >>> > >>> For debugging the indexing process: > >>> > >>> * the Indexing Tool logs the number of indexed Entities. You should > >>> check this value > >>> * the IDs off all indexed entities are also stored in > >>> "indexing/destination/indexed-entities-ids.zip". After installing the > >>> index to Stanbol you can use those IDs to retrieve the available data > >>> by using requests like "curl -H "Accept: text/turtle" > >>> " > http://localhost:8080/entityhub/site/{site-name}/entity?id={entity-id}" > >>> > >>> > I followed the instructions here [1] and in the ReadMe.md, > >>> > indexing.properties files of the tool and created a site {datahub} > for > >>> my > >>> > data accessible at : http://localhost:8080/entityhub/site/datahub/ > >>> > > >>> > However when I try out sample requests to find entities in the site I > >>> get > >>> > no results. > >>> > I'm trying to find the entity with *name=Sebastian** which is > actually > >>> in > >>> > the sample dataset used above but I get an empty results set. Can > anyone > >>> > please help me understand what I've done wrong here? Basically I have > >>> > followed the steps in init, index executions of the tool. > >>> > > >>> > Is it because my dataset is only a set of external links to foaf > files? > >>> > Do I need to manually download the foaf files to > >>> indexing/resources/rdfdata > >>> > directory? > >>> > > >>> > eg : > >>> > > >>> > request: curl -X POST -d "name=Sebastian*" > >>> > http://localhost:8080/entityhub/site/datahub/find > >>> > > >>> > result : > >>> > { > >>> > "query": { > >>> > "selected": [ > >>> > "http:\/\/stanbol.apache.org > >>> \/ontology\/entityhub\/query#score", > >>> > "http:\/\/www.w3.org\/2000\/01\/rdf-schema#label" > >>> > ], > >>> > "constraints": [{ > >>> > "type": "text", > >>> > "patternType": "wildcard", > >>> > "text": "SSebastian Tramp", > >>> > "field": "http:\/\/www.w3.org > \/2000\/01\/rdf-schema#label" > >>> > }], > >>> > "limit": 5, > >>> > "offset": 0 > >>> > }, > >>> > "results": [] > >>> > } > >>> > > >>> > >>> For queries like that you need to make sure that your entities do have > >>> values for "rdf:label". AFAIK the default > >>> "indexing/config/mapping.txt" configuration does copy the foaf:name > >>> value to rdfs:label, but if you do specifically work with FOAF data > >>> you should preferable query for "foaf:name". > >>> > >>> Thanks for these useful pointers. I will follow them. > >> > >> In general for my GSOC project on FOAF co-reference based > disambiguation, > >> do you think this datahub dataset is useful? > >> This is the best dataset I found so far other than already indexed > DBpedia > >> dataset in Stanbol. > >> > >> Thanks, > >> Dileepa > >> > >> > >>> best > >>> Rupert > >>> > >>> > > >>> > Your help is much appreciated here. > >>> > Thanks, > >>> > Dileepa > >>> > > >>> > > >>> > [1] http://stanbol.apache.org/docs/trunk/customvocabulary.html > >>> > >>> > >>> > >>> -- > >>> | Rupert Westenthaler [email protected] > >>> | Bodenlehenstraße 11 ++43-699-11108907 > >>> | A-5500 Bischofshofen > >>> > >> > >> > > > > -- > | Rupert Westenthaler [email protected] > | Bodenlehenstraße 11 ++43-699-11108907 > | A-5500 Bischofshofen >
