Hi Dileepa,

IMO it would be better if you join the #stanbol IRC channel on
freenode.net. This would allow to reduce rtt's (rount trip times) for
such kind of questions a lot.


For the Freebase indexing I implemented property filters (see
STANBOL-1016) with this you can specify what triples need to be
imported and others that should be dropped. See the Freebase Indexing
Tool for details and an Example for how to configure it. A good
starting point would be to only import triples where the property
starts with the FOAF namespace.

In addition you could write your own EntityProcessor that checks if a
Resource does have all required fields before importing.
EntitiyProcessor implemetnations do get the Representation for an
Entity parsed. It is very easy to write an EntityProcessor that checks
if values for some properties are present. If not all required are
present you can filter those by returning NULL.

With those two things in palce you should easily get a rather good
quality sub-set of FOAF data from the referenced dataset.

best
Rupert

On Wed, Jun 19, 2013 at 12:52 PM, Dileepa Jayakody
<dileepajayak...@gmail.com> wrote:
> The the link of the data-set project I'm looking at :
> http://km.aifb.kit.edu/projects/btc-2012/
>
>
> On Wed, Jun 19, 2013 at 4:21 PM, Dileepa Jayakody <dileepajayak...@gmail.com
>> wrote:
>
>> Hi Rupert et al,
>> On Wed, Jun 19, 2013 at 2:27 PM, Rupert Westenthaler <
>> rupert.westentha...@gmail.com> wrote:
>>
>>> Hi
>>>
>>>
>>> On Wed, Jun 19, 2013 at 9:20 AM, Dileepa Jayakody
>>> <dileepajayak...@gmail.com> wrote:
>>> > Hi All,
>>> >
>>> > I'm trying out entityhub indexing tool to configure a site for a sample
>>> > foaf dataset. My data set (sampleNquads.nx) is in n-quad format.
>>> Actually
>>> >  it is a set of links to foaf files from various sources in nquad
>>> format.
>>> >
>>> > eg:
>>> > <http://www.agfa.com/> <http://www.agfa.com/global/en/main/index.jsp> .
>>> > *<http://sebastian.tramp.name/> <http://sebastian.tramp.name/index.rdf>
>>> .*
>>> > <http://gitorious.com/~tobyink> <http://gitorious.org/~tobyink> .
>>> >
>>>
>>> I am not completely sure what you are mean by that.
>>>
>>
>> I have misunderstood the N-Quad format, and thought it's just a set of
>> links to external rdf files.
>> The sample data-set I used  was just a small part of the actual datahub
>> dataset (>1 GB) and incomplete. That might be the reason the indexer not
>> been able to index the dataset.:)
>>
>>>
>>> Generally: Links to RDF files are not supported by the Indexing Tool.
>>> You will need to download the RDF files to the
>>> "indexing/resources/rdfdata" directory.
>>>
>>> Quad Formats are in principle supported by the Indexing Tool. However
>>> node that only SPO are used and the Context is dropped during the
>>> import.
>>>
>>>
>>> For debugging the indexing process:
>>>
>>>   * the Indexing Tool logs the number of indexed Entities. You should
>>> check this value
>>>   * the IDs off all indexed entities are also stored in
>>> "indexing/destination/indexed-entities-ids.zip". After installing the
>>> index to Stanbol you can use those IDs to retrieve the available data
>>> by using requests like "curl -H "Accept: text/turtle"
>>> "http://localhost:8080/entityhub/site/{site-name}/entity?id={entity-id}";
>>>
>>> > I followed the instructions here [1] and in the ReadMe.md,
>>> > indexing.properties files of the tool and created a site {datahub} for
>>> my
>>> > data accessible at : http://localhost:8080/entityhub/site/datahub/
>>> >
>>> > However when I try out sample requests to find entities in the site I
>>> get
>>> > no results.
>>> > I'm trying to find the entity with *name=Sebastian** which is actually
>>> in
>>> > the sample dataset used above but I get an empty results set. Can anyone
>>> > please help me understand what I've done wrong here? Basically I have
>>> > followed the steps in init, index executions of the tool.
>>> >
>>> > Is it because my dataset is only a set of external links to foaf files?
>>> > Do I need to manually download the foaf files to
>>> indexing/resources/rdfdata
>>> > directory?
>>> >
>>> > eg :
>>> >
>>> > request: curl -X POST -d "name=Sebastian*"
>>> > http://localhost:8080/entityhub/site/datahub/find
>>> >
>>> > result :
>>> > {
>>> >     "query": {
>>> >         "selected": [
>>> >             "http:\/\/stanbol.apache.org
>>> \/ontology\/entityhub\/query#score",
>>> >             "http:\/\/www.w3.org\/2000\/01\/rdf-schema#label"
>>> >         ],
>>> >         "constraints": [{
>>> >             "type": "text",
>>> >             "patternType": "wildcard",
>>> >             "text": "SSebastian Tramp",
>>> >             "field": "http:\/\/www.w3.org\/2000\/01\/rdf-schema#label"
>>> >         }],
>>> >         "limit": 5,
>>> >         "offset": 0
>>> >     },
>>> >     "results": []
>>> > }
>>> >
>>>
>>> For queries like that you need to make sure that your entities do have
>>> values for "rdf:label". AFAIK the default
>>> "indexing/config/mapping.txt" configuration does copy the foaf:name
>>> value to rdfs:label, but if you do specifically work with FOAF data
>>> you should preferable query for "foaf:name".
>>>
>>> Thanks for these useful pointers. I will follow them.
>>
>> In general for my GSOC project on FOAF co-reference based disambiguation,
>> do you think this datahub dataset is useful?
>> This is the best dataset I found so far other than already indexed DBpedia
>> dataset in Stanbol.
>>
>> Thanks,
>> Dileepa
>>
>>
>>> best
>>> Rupert
>>>
>>> >
>>> > Your help is much appreciated here.
>>> > Thanks,
>>> > Dileepa
>>> >
>>> >
>>> > [1] http://stanbol.apache.org/docs/trunk/customvocabulary.html
>>>
>>>
>>>
>>> --
>>> | Rupert Westenthaler             rupert.westentha...@gmail.com
>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>> | A-5500 Bischofshofen
>>>
>>
>>



--
| Rupert Westenthaler             rupert.westentha...@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Reply via email to