On Mon, Jul 8, 2013 at 2:19 AM, Dileepa Jayakody <dileepajayak...@gmail.com>wrote:
> Hi All, > > I continued with the btc2012 dataset to create a foaf-site for Stanbol as > per your opinions. > Thanks to all for providing me your opinions. @Andreas I have updated the > foaf-wiki page as you suggested by removing obsolete links to > foaf data-source projects :) > > btc2012 contains data from 5 main sources: datahub, dbpedia, freebase, > rest and timbl. > Since Stanbol already has dbpedia and freebase datasets integrated I used > only datahub and timble datasets to create a foaf-site. > I used the > datahub/data-3.nq.gz<http://km.aifb.kit.edu/projects/btc-2012/datahub/data-3.nq.gz>and > timbl/data-6.nq.gz<http://km.aifb.kit.edu/projects/btc-2012/timbl/data-6.nq.gz> > datasets both of size ~1GB. > > For the foaf-site creation and indexing process, I used the generic-rdf > indexing tool [1] . > Following is the process I used to create a foaf-site for Stanbol using > btc2012 dataset. > > *Steps* > > 1. Build the generic-rdf indexing tool using *mvn clean install*. > > 2. Initialize the tool with below command : > *java -jar org.apache.stanbol.entityhub.indexing.genericrdf > -0.12.0-SNAPSHOT.jar init* > Above initialization command will create the indexing tool directories for > various purposes in the indexing process. > > 3. Configure the tool to filter foaf entities. > ${indexingToolDir}/indexing/config is the main configuration directory of > the tool. > 3.1. To filter entities which define foaf:properties configure below > entries in indexing.properties > > * > entityDataIterable=org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource,config:indexingsource,bnode:true > * > (Please note the additional bnode:true parameter above is activated to > process blank nodes in the dataset) > > Above entityDataIterable configuration requires 2 additional configuration > files : indexingsource.properties and propertiyfilter.config. These files > are not included in generic-rdf index tool by default. > You can use the 2 files used in freebase indexing tool at [2] for > filtering purpose. Copy the 2 files into ${indexingToolDir}/indexing/config > and add the below entry to propertyfilter.config > * > * > *foaf:** > Above entry instructs the tool to filter entities which defines some foaf > property in foaf namespace. > > 3.2. Configure the FieldValueFilter to index only foaf:Person and > foaf:Organization type entities by activating 'values' as below. > *values=foaf:Person;foaf:Organization* > > 3.3. Check above entity filtering (FieldValueFilter) is enabled in > indexing.properties by searching for below entry. > * > entityProcessor=org.apache.stanbol.entityhub.indexing.core.processor.FieldValueFilter,config:entityTypes; > * > > > ** > 4. Change the 'name' value in indexing.properties to a suitable new Site > name (eg: foaf-site ) and run the indexing tool using below command: > *java -Xmx1024m -jar org.apache.stanbol.entityhub.indexing.genericrdf > -0.12.0-SNAPSHOT.jar index* > > Don't forget to copy the n-quad datafiles downloaded from btc2012 to {indexingToolDir}/indexing/resources/rdfdata directory prior to executing indexing command :) > 5. Above will execute the entity importing and indexing process and > create 2 files in {indexingToolDir}/indexing/dist directory. > Copy the generated org.apache.stanbol.data.site.foaf-site-1.0.0.jar to > ${stanbol}/fileinstall directory. > Copy the generated foaf-site.solrindex.zip to ${stanbol}/datafiles > directory. > > 6. Launch Stanbol server using full-launcher and access the foaf-site at > : localhost:8080/entityhub/site/foaf-site > > So with this I have completed the first milestone I had in mind for my > Project. > The next task is to identify and define the foaf properties set which are > going to be used as keys in the disambiguation algorithm. This task also > includes developing an EntityProcessor to filter foaf entities further by > allowing only the entities which have disambiguation properties identified > above. > > Your thoughts and opinions in moving forward are highly appreciated. > > Thanks, > Dileepa > > [1] > https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/genericrdf > [2] > https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/freebase > > On Thu, Jun 27, 2013 at 11:00 AM, Andreas Kuckartz <a.kucka...@ping.de>wrote: > >> Dileepa Jayakody: >> > In the foaf-wiki site [1] there are many datasource projects but many >> > of them are out of date. >> >> If possible please take a few minutes to update that Wiki page. >> >> > Can I please have your opinions on finalizing a dataset for my >> > project? >> >> The main criteria in my opinion should be: >> - how much effort is necessary ? >> - how much data can be expected regarding "co-reference" ? >> >> That being said I thing that the btc dataset would be a good choice. It >> was created to be used in projects such as yours. >> >> Cheers, >> Andreas >> > >