On Tue, Feb 15, 2011 at 8:05 PM, Olivier Grisel <[email protected]> wrote: > Great. Do you think it would be possible to have a default > configuration for a small index of the top 10000 entities as measured > by popularity? Yes the indexer can be configured to build specialized indexes. However to really make it easy to use I would need to implement some improvements.
To use the current version have a look at the /entityhub/indexer/dbPedia bundle (1) Use "mvn assembly:assembly" to build the jar with all dependency (2) copy the jar to a different directory (because otherwise mvn clean might delete some files you do not want to be deleted) (3) use "java -jar org.apache.stanbol.entityhub.indexing.dbPedia-0.1-SNAPSHOT-jar-with-dependencies.jar -h" to see options Parameters: The first parameter is the URL of the Solr Core used for indexing. You will want to configure an own core for the dbPedia index The second parameter is the path to a directory with the RDF dump of DBPedia. Files can be found on "http://wiki.dbpedia.org/Downloads36". Download the files you need and put them into a directory. The indexer will automatically all the files. Options: -i : this can be used to provide the file with incomming links. You should better know than I how to create such files, because you provided me with the one I used to create my index ("incoming-counts.tsv"). Note that this file is based on an older version of the dbPedia dump because of that newer entities will not be ignored during indexing. -ri : the minimum number of incoming links required that an entity is included within the index. This can be used to control the size of the index. -s : This is very handy to resume the indexing if you have already completed the importing of the RDF data. -r : Resume Mode. Can also be used to activate the entity ranking based indexing mode (see NOTE below) IMPORTANT NOTE: For building small indices (number of indexed entities << number of entities in the dataset) it will be faster to activate the "-r" switch. The generic RDF Indexer has two modes how to iterate over the entities in the dataset. First by iterating over all triples and second by using the entity ranking (file parsed by the -i option). The first method is ~5times faster than the second, but if one only index a small subset of the entities the entity ranking based indexing mode will still be more efficient. On my laptop it needed around 3 days to build the index, but this was mainly limited to the ~100 IO operation/sec of the hard disk. > > I am also thinking of building maven artifacts to embed the opennlp > models in version 1.5 without checking them in the Stanbol svn repo. I > could help you bundle a set of small entity indexes. > That would be a cool thing to do. I am specially interested to find a good way to provide configurations for the Entityhub (especially to provide a default config so that the entityhub can be used without any required configuration). Adding new Referenced Sites by copying special bundles to an config directory (e.g. by using http://felix.apache.org/site/apache-felix-file-install.html) would be an other great thing to do. > Also could you write a howto for building indexes? I think such howto > should better be written as text file in the stanbol source tree or > better as a new documentation page for the stanbol website (using the > markdown syntax) rather than a new wikipage on the IKS wiki). > I do not plan to update the documentation on the IKS wiki. Looking at the stanbol website and start to move/adapt existing Documentation is on my TODO list since some weeks. However I fear that I will only have time to start with this after the Semantic Interaction Framework Hackathon February 24th-26th, in Vienna > As soon as you have such an howto ready I would be glad to write a > bunch of pig scripts to build indexes for topics (rather than > entities) so as to be able to perform document level topic assignment > rather than occurrence-based entity lookups. > OK I do not really understand what you mean by that. best Rupert > -- > Olivier > -- | Rupert Westenthaler [email protected] | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen
