Hi All, I continued with the btc2012 dataset to create a foaf-site for Stanbol as per your opinions. Thanks to all for providing me your opinions. @Andreas I have updated the foaf-wiki page as you suggested by removing obsolete links to foaf data-source projects :)
btc2012 contains data from 5 main sources: datahub, dbpedia, freebase, rest and timbl. Since Stanbol already has dbpedia and freebase datasets integrated I used only datahub and timble datasets to create a foaf-site. I used the datahub/data-3.nq.gz<http://km.aifb.kit.edu/projects/btc-2012/datahub/data-3.nq.gz>and timbl/data-6.nq.gz<http://km.aifb.kit.edu/projects/btc-2012/timbl/data-6.nq.gz> datasets both of size ~1GB. For the foaf-site creation and indexing process, I used the generic-rdf indexing tool [1] . Following is the process I used to create a foaf-site for Stanbol using btc2012 dataset. *Steps* 1. Build the generic-rdf indexing tool using *mvn clean install*. 2. Initialize the tool with below command : *java -jar org.apache.stanbol.entityhub.indexing.genericrdf-0.12.0-SNAPSHOT.jar init* Above initialization command will create the indexing tool directories for various purposes in the indexing process. 3. Configure the tool to filter foaf entities. ${indexingToolDir}/indexing/config is the main configuration directory of the tool. 3.1. To filter entities which define foaf:properties configure below entries in indexing.properties * entityDataIterable=org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource,config:indexingsource,bnode:true * (Please note the additional bnode:true parameter above is activated to process blank nodes in the dataset) Above entityDataIterable configuration requires 2 additional configuration files : indexingsource.properties and propertiyfilter.config. These files are not included in generic-rdf index tool by default. You can use the 2 files used in freebase indexing tool at [2] for filtering purpose. Copy the 2 files into ${indexingToolDir}/indexing/config and add the below entry to propertyfilter.config * * *foaf:** Above entry instructs the tool to filter entities which defines some foaf property in foaf namespace. 3.2. Configure the FieldValueFilter to index only foaf:Person and foaf:Organization type entities by activating 'values' as below. *values=foaf:Person;foaf:Organization* 3.3. Check above entity filtering (FieldValueFilter) is enabled in indexing.properties by searching for below entry. * entityProcessor=org.apache.stanbol.entityhub.indexing.core.processor.FieldValueFilter,config:entityTypes; * * * 4. Change the 'name' value in indexing.properties to a suitable new Site name (eg: foaf-site ) and run the indexing tool using below command: *java -Xmx1024m -jar org.apache.stanbol.entityhub.indexing.genericrdf -0.12.0-SNAPSHOT.jar index* 5. Above will execute the entity extraction and indexing process and create 2 files in {indexingToolDir}/indexing/dist directory. Copy the generated org.apache.stanbol.data.site.foaf-site-1.0.0.jar to ${stanbol}/fileinstall directory. Copy the generated foaf-site.solrindex.zip to ${stanbol}/datafiles directory. 6. Launch Stanbol server using full-launcher and access the foaf-site at : localhost:8080/entityhub/site/foaf-site So with this I have completed the first milestone I had in mind for my Project. The next task is to identify and define the foaf properties set which are going to be used as keys in the disambiguation algorithm. This task also includes developing an EntityProcessor to filter foaf entities further by allowing only the entities which have disambiguation properties identified above. Your thoughts and opinions in moving forward are highly appreciated. Thanks, Dileepa [1] https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/genericrdf [2] https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/freebase On Thu, Jun 27, 2013 at 11:00 AM, Andreas Kuckartz <a.kucka...@ping.de>wrote: > Dileepa Jayakody: > > In the foaf-wiki site [1] there are many datasource projects but many > > of them are out of date. > > If possible please take a few minutes to update that Wiki page. > > > Can I please have your opinions on finalizing a dataset for my > > project? > > The main criteria in my opinion should be: > - how much effort is necessary ? > - how much data can be expected regarding "co-reference" ? > > That being said I thing that the btc dataset would be a good choice. It > was created to be used in projects such as yours. > > Cheers, > Andreas >