Hi All,

I continued with the btc2012 dataset to create a foaf-site for Stanbol as
per your opinions.
Thanks to all for providing me your opinions. @Andreas I have updated the
foaf-wiki page as you suggested by removing obsolete links to
foaf data-source projects :)

btc2012 contains data from 5 main sources: datahub, dbpedia, freebase, rest
and timbl.
Since Stanbol already has dbpedia and freebase datasets integrated I used
only datahub and timble datasets to create a foaf-site.
I used the 
datahub/data-3.nq.gz<http://km.aifb.kit.edu/projects/btc-2012/datahub/data-3.nq.gz>and
timbl/data-6.nq.gz<http://km.aifb.kit.edu/projects/btc-2012/timbl/data-6.nq.gz>
 datasets both of size ~1GB.

For the foaf-site creation and indexing process, I used the generic-rdf
indexing tool [1] .
Following is the process I used to create a foaf-site for Stanbol using
btc2012 dataset.

*Steps*

1. Build the generic-rdf indexing tool using *mvn clean install*.

2. Initialize the tool with below command :
*java -jar org.apache.stanbol.entityhub.indexing.genericrdf-0.12.0-SNAPSHOT.jar
init*
Above initialization command will create the indexing tool directories for
various purposes in the indexing process.

3. Configure the tool to filter foaf entities.
${indexingToolDir}/indexing/config is the main configuration directory of
the tool.
3.1. To filter entities which define foaf:properties configure below
entries in indexing.properties

*
entityDataIterable=org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource,config:indexingsource,bnode:true
*
(Please note the additional bnode:true parameter above is activated to
process blank nodes in the dataset)

Above entityDataIterable configuration requires 2 additional configuration
files : indexingsource.properties and propertiyfilter.config. These files
are not included in generic-rdf index tool by default.
You can use the 2 files used in freebase indexing tool at [2] for filtering
purpose. Copy the 2 files into ${indexingToolDir}/indexing/config and add
the below entry to propertyfilter.config
*
*
*foaf:**
Above entry instructs the tool to filter entities which defines some foaf
property in foaf namespace.

3.2. Configure the FieldValueFilter to index only foaf:Person and
foaf:Organization type entities by activating 'values' as below.
*values=foaf:Person;foaf:Organization*

3.3. Check above entity filtering (FieldValueFilter) is enabled in
indexing.properties by searching for below entry.
*
entityProcessor=org.apache.stanbol.entityhub.indexing.core.processor.FieldValueFilter,config:entityTypes;
 *
*
*
4. Change the 'name' value in indexing.properties to a suitable new Site
name (eg: foaf-site ) and run the indexing tool using below command:
*java -Xmx1024m -jar  org.apache.stanbol.entityhub.indexing.genericrdf
-0.12.0-SNAPSHOT.jar index*

5. Above will execute the entity extraction and indexing process and create
2 files in {indexingToolDir}/indexing/dist directory.
Copy the generated org.apache.stanbol.data.site.foaf-site-1.0.0.jar to
${stanbol}/fileinstall directory.
Copy the generated foaf-site.solrindex.zip to ${stanbol}/datafiles
directory.

6. Launch Stanbol server using full-launcher and access the foaf-site at
: localhost:8080/entityhub/site/foaf-site

So with this I have completed the first milestone I had in mind for my
Project.
The next task is to identify and define the foaf properties set which are
going to be used as keys in the disambiguation algorithm. This task also
includes developing an EntityProcessor to filter foaf entities further by
allowing only the entities which have disambiguation properties identified
above.

Your thoughts and opinions in moving forward are highly appreciated.

Thanks,
Dileepa

[1]
https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/genericrdf
[2]
https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/freebase

On Thu, Jun 27, 2013 at 11:00 AM, Andreas Kuckartz <a.kucka...@ping.de>wrote:

> Dileepa Jayakody:
> > In the foaf-wiki site [1] there are many datasource projects but many
> > of them are out of date.
>
> If possible please take a few minutes to update that Wiki page.
>
> > Can I please have your opinions on finalizing a dataset for my
> > project?
>
> The main criteria in my opinion should be:
> - how much effort is necessary ?
> - how much data can be expected regarding "co-reference" ?
>
> That being said I thing that the btc dataset would be a good choice. It
> was created to be used in projects such as yours.
>
> Cheers,
> Andreas
>

Reply via email to