Author: rwesten
Date: Mon May 16 10:48:53 2011
New Revision: 1103688
URL: http://svn.apache.org/viewvc?rev=1103688&view=rev
Log:
STANBOL-187: Updated the README files for the indexing utilities
- The README files now use markdown syntax
- Added a new section that explains the distribution files
- Added a new section that links to the /data/sites/ modules used to use the
created indexes as Referenced Sites for the Stanbol Entityhub
other:
- removed the genericrdf module form the indexing reactor
- added data to the stanbol reactor
Added:
incubator/stanbol/trunk/entityhub/indexing/dblp/README.md (contents,
props changed)
- copied, changed from r1103627,
incubator/stanbol/trunk/entityhub/indexing/dblp/README.txt
incubator/stanbol/trunk/entityhub/indexing/dbpedia/README.md (contents,
props changed)
- copied, changed from r1103627,
incubator/stanbol/trunk/entityhub/indexing/dbpedia/README.txt
incubator/stanbol/trunk/entityhub/indexing/geonames/README.md
Removed:
incubator/stanbol/trunk/entityhub/indexing/dblp/README.txt
incubator/stanbol/trunk/entityhub/indexing/dbpedia/README.txt
Modified:
incubator/stanbol/trunk/entityhub/indexing/dbpedia/pom.xml
incubator/stanbol/trunk/entityhub/indexing/pom.xml
incubator/stanbol/trunk/pom.xml
Copied: incubator/stanbol/trunk/entityhub/indexing/dblp/README.md (from
r1103627, incubator/stanbol/trunk/entityhub/indexing/dblp/README.txt)
URL:
http://svn.apache.org/viewvc/incubator/stanbol/trunk/entityhub/indexing/dblp/README.md?p2=incubator/stanbol/trunk/entityhub/indexing/dblp/README.md&p1=incubator/stanbol/trunk/entityhub/indexing/dblp/README.txt&r1=1103627&r2=1103688&rev=1103688&view=diff
==============================================================================
--- incubator/stanbol/trunk/entityhub/indexing/dblp/README.txt (original)
+++ incubator/stanbol/trunk/entityhub/indexing/dblp/README.md Mon May 16
10:48:53 2011
@@ -1,46 +1,85 @@
-Indexer for the DBLP dataset (see http://dblp.uni-trier.de/)
+# Indexer for the [DBLP](http://dblp.uni-trier.de/) dataset.
This Tool creates a full cache for DBLP based on the RDF Dump available at
http://dblp.l3s.de/dblp.rdf.gz
-Building:
-========
+## Building:
+
If not yet build by the built process of the entityhub call
- mvn install
+
+ mvn install
+
in this directory.
If the build succeeds go to the /target directory and copy the
- org.apache.stanbol.entityhub.indexing.dblp-*-jar-with-dependencies.jar
+
+ org.apache.stanbol.entityhub.indexing.dblp-*-jar-with-dependencies.jar
+
to the directory you would like to start the indexing.
-Index:
-==================
+## Index:
+
+### (1) Initialise the configuration
-(1) Initialise the configuration by calling
-java -jar
org.apache.stanbol.entityhub.indexing.dblp-*-jar-with-dependencies.jar init
+The default configuration is initialised by calling
+
+ java -jar
org.apache.stanbol.entityhub.indexing.dblp-*-jar-with-dependencies.jar init
This will create a sub-folder with the name indexing in the current directory.
Within this folder all the
- - configurations (indexing/config)
- - source files (indexing/resources)
- - created files (indexing/destination)
- - distribution files (indexing/distribution)
+
+* configurations (indexing/config)
+* source files (indexing/resources)
+* created files (indexing/destination)
+* distribution files (indexing/distribution)
+
will be located.
-(2) Download the Source File:
+### (2) Download the Source File:
Download the DBLP RDF dump from http://dblp.l3s.de/dblp.rdf.gz to
"indexing/resources/rdfData" and rename it to "dblp.nt.gz" (because this file
does not use rdf/xml but N-Triples).
You can use the following two commands to accomplish this step
-curl -C - -O http://dblp.l3s.de/dblp.rdf.gz
-mv dblp.rdf.gz indexing/resources/rdfData/dblp.rdf.gz
+ curl -C - -O http://dblp.l3s.de/dblp.rdf.gz
+ mv dblp.rdf.gz indexing/resources/rdfData/dblp.rdf.gz
-(3) Start the indexing by calling
-java -Xmx1024m -jar
org.apache.stanbol.entityhub.indexing.dblp-*-jar-with-dependencies.jar index
+### (3) Start the indexing by calling
+
+ java -Xmx1024m -jar
org.apache.stanbol.entityhub.indexing.dblp-*-jar-with-dependencies.jar index
Note that calling the utility with the option -h will print the help.
Indexing took about 3h on a normal hard disk and about 40min on a SSD (on a
-2010 MacBook Pro).
\ No newline at end of file
+2010 MacBook Pro).
+
+### (4) Using the precomputed Index:
+
+After the indexing completes the distribution folder will contain two files
+
+1. dblp.solrindex.ref: This contains the configuration for the SolrIndex. It
does
+not contain the data and is intended to be used to provide configurations
without
+the need to also include the precomputed index. When loading this file to
+Apache Stanbol (typically via the Apache Sling Installer Framework) the
+Stanbol DataFileProvder service will ask for the binary data.
+
+2. dblp.solrindex.zip: This is the ZIP archive with the precomputed data.
+Typically you will need to copy this file to the data directory of the
+Apache Stanbol DataFileProvider (defaults to "sling/datafiles").
+
+## Using DBLP as Referenced Site of the Entityhub
+
+The necessary configurations needed to use DBLP as referenced site for the
+Apache Stanbol Entityhub are provided by the "Apache Stanbol Data: DBLP"
+bundle.
+
+See
[{stanbol}/data/sites/dblp](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/data/sites/dblp)
+
+The README of this Bundle provides details about the installation process.
+During the installation the "dblp.solrindex.zip" created by this utility is
+needed.
+
+
+
+
Propchange: incubator/stanbol/trunk/entityhub/indexing/dblp/README.md
------------------------------------------------------------------------------
svn:mime-type = text/plain
Copied: incubator/stanbol/trunk/entityhub/indexing/dbpedia/README.md (from
r1103627, incubator/stanbol/trunk/entityhub/indexing/dbpedia/README.txt)
URL:
http://svn.apache.org/viewvc/incubator/stanbol/trunk/entityhub/indexing/dbpedia/README.md?p2=incubator/stanbol/trunk/entityhub/indexing/dbpedia/README.md&p1=incubator/stanbol/trunk/entityhub/indexing/dbpedia/README.txt&r1=1103627&r2=1103688&rev=1103688&view=diff
==============================================================================
--- incubator/stanbol/trunk/entityhub/indexing/dbpedia/README.txt (original)
+++ incubator/stanbol/trunk/entityhub/indexing/dbpedia/README.md Mon May 16
10:48:53 2011
@@ -1,39 +1,50 @@
-Indexer for the DBpedia dataset (see http://dbpedia.org/)
+# Indexer for the DBpedia dataset (see http://dbpedia.org/)
This Tool creates local indexes of DBpedia to be used with the Stanbol
Entityhub.
-Building:
-========
+## Building:
+
If not yet build by the built process of the entityhub call
- mvn install
+
+ mvn install
+
in this directory.
If the build succeeds go to the /target directory and copy the
- org.apache.stanbol.entityhub.indexing.dbpedia-*-jar-with-dependencies.jar
+
+ org.apache.stanbol.entityhub.indexing.dbpedia-*-jar-with-dependencies.jar
+
to the directory you would like to start the indexing.
-Index:
-==================
+## Index:
+
+### (1) Initialise the configuration
+
+The configuration can be initialised with the defaults by calling
-(1) Initialise the configuration by calling
-java -jar
org.apache.stanbol.entityhub.indexing.dbpedia-*-jar-with-dependencies.jar init
+ java -jar
org.apache.stanbol.entityhub.indexing.dbpedia-*-jar-with-dependencies.jar init
This will create a sub-folder with the name indexing in the current directory.
Within this folder all the
- - configurations (indexing/config)
- - source files (indexing/resources)
- - created files (indexing/destination)
- - distribution files (indexing/distribution)
+
+* configurations (indexing/config)
+* source files (indexing/resources)
+* created files (indexing/destination)
+* distribution files (indexing/distribution)
+
will be located.
The indexing itself can be started by
-java -jar
org.apache.stanbol.entityhub.indexing.dbpedia-*-jar-with-dependencies.jar index
+
+ java -jar
org.apache.stanbol.entityhub.indexing.dbpedia-*-jar-with-dependencies.jar index
+
but before doing this please note the points (2), (3) and (4)
-(2) Download the dbPedia Dump Files:
+### (2) Download the dbPedia Dump Files:
All RDF dumps need to be copied to the directory
- indexing/resources/rdfData
+
+ indexing/resources/rdfData
The RDF dump of DBpedia.org is splited up in a number of different files.
The actual files needed depend on the configuration of the mappings
@@ -45,7 +56,9 @@ required or not.
During the initialisation of the Indeing all the RDF files within the
"indexing/resources/rdfData" directory will be imported to an Jena TDB RDF
triple store. The imported data are stored under
- indexing/resources/tdb
+
+ indexing/resources/tdb
+
and can be reused for subsequent indexing processes.
To avoid (re)importing of already imported resources one need to remove such
@@ -53,13 +66,14 @@ RDF files from the "indexing/resources/r
option - rename the "rdfData" folder after the initial run.
It is also safe to
- - cancel the indexing process after the initialisation has competed
- (as soon as the log says that the indexing has started).
- - load additional RDF dumps by putting additional RDF files to the "rdfData"
- directory. This files will be added to the others on the next start of the
- indexing tool.
-(3) Entity Scores
+* cancel the indexing process after the initialisation has competed
+(as soon as the log says that the indexing has started).
+* load additional RDF dumps by putting additional RDF files to the "rdfData"
+directory. This files will be added to the others on the next start of the
+indexing tool.
+
+### (3) Entity Scores
The DBpedia.org indexer uses the incomming links from other wikipages to
calculate the rank of entities. Entities with more incomming links get an
@@ -68,69 +82,100 @@ on DBpedia (page_links_en.nt.bz2). This
following command to get an file containing an ordered list of incomming
count and the local name of the entity.
-curl http://downloads.dbpedia.org/{version}/en/page_links_en.nt.bz2 \
- | bzcat \
- | sed -e 's/.*<http\:\/\/dbpedia\.org\/resource\/\([^>]*\)> ./\1/' \
- | sort \
- | uniq -c \
- | sort -nr > incoming_links.txt
+ curl http://downloads.dbpedia.org/{version}/en/page_links_en.nt.bz2 \
+ | bzcat \
+ | sed -e 's/.*<http\:\/\/dbpedia\.org\/resource\/\([^>]*\)> ./\1/' \
+ | sort \
+ | uniq -c \
+ | sort -nr > incoming_links.txt
Depending on the machine and the download speed for the source file the
execution of this command will take several hours.
Important NOTES:
- - Links to Categories use wrong URLs in the current version (3.6) of the
- page_links_en.nt.bz2 dump.
- All categories start with "CAT:{categoryName}" but the correct local name
- would be "Category:{categoryName}". because of this categories would not be
- indexed.
- It is strongly suggested to
- - first check if still Category: is used as prefix (e.g. by checking if
- http://dbpedia.org/page/Category:Political_culture is still valid) and
- - second if that is the case replace all occurrences of "CAT:" to
"Category:"
+
+* Links to Categories use wrong URLs in the current version (3.6) of the
+page_links_en.nt.bz2 dump.
+All categories start with "CAT:{categoryName}" but the correct local name
+would be "Category:{categoryName}". because of this categories would not be
+indexed.
+It is strongly suggested to
+** first check if still Category: is used as prefix (e.g. by checking if
+http://dbpedia.org/page/Category:Political_culture is still valid) and
+** second if that is the case replace all occurrences of "CAT:" to "Category:"
The resulting file MUST BE copied to
- indexing/resources/incoming_links.txt
+
+ indexing/resources/incoming_links.txt
There is also the possibility do download a precomputed file from:
- TODO: add download loaction
+
+TODO: add download loaction
-(4) Configuration of the Index
+### (4) Configuration of the Index
- The configurations are contained within the "indexing/config" folder:
- - indexing.properties: Main configuration for the indexing process. It
- defines the used components and there configurations. Usually no need to
- make any changes.
- - mapping.txt: Define the fields, data type requirements and languages to be
- indexed. Note: It is also important that the dumps containing the RDF
- data are available.
- - dbpedia/conf/schema.xml: Defines the schema used by Solr to store the data.
- This can be used to configure e.g. if values are stored (available for
- retrieval) or only indexed. See the comments within the file for details
- - fieldBoosts.properties: Can be used to set boost factors for fields.
- - minIncomming.properties: Can be used to define the minimum number of
- incommings links (to an Wiki page from other Wili pages) so that an
entity
- is indexed. Higher values will cause less entities to be indexed. A
- value of 0 will result in all entities to be indexed.
- - scoreRange.properties: Can be use to set the upper bound for entities
score.
- The entities with the most incomming links will get this score. Entities
- with no incomming links would get a score of zero.
+The configurations are contained within the "indexing/config" folder:
+* indexing.properties: Main configuration for the indexing process. It
+defines the used components and there configurations. Usually no need to
+make any changes.
+* mapping.txt: Define the fields, data type requirements and languages to be
+indexed. Note: It is also important that the dumps containing the RDF
+data are available.
+* dbpedia/conf/schema.xml: Defines the schema used by Solr to store the data.
+This can be used to configure e.g. if values are stored (available for
+retrieval) or only indexed. See the comments within the file for details
+* fieldBoosts.properties: Can be used to set boost factors for fields.
+* minIncomming.properties: Can be used to define the minimum number of
+incommings links (to an Wiki page from other Wili pages) so that an entity
+is indexed. Higher values will cause less entities to be indexed. A
+value of 0 will result in all entities to be indexed.
+* scoreRange.properties: Can be use to set the upper bound for entities score.
+The entities with the most incomming links will get this score. Entities
+with no incomming links would get a score of zero.
+
+### (5) Using the precomputed Index:
+
+After the indexing completes the distribution folder will contain two files
+
+1. dbpedia.solrindex.ref: This contains the configuration for the SolrIndex.
It does
+not contain the data and is intended to be used to provide configurations
without
+the need to also include the precomputed index. When loading this file to
+Apache Stanbol (typically via the Apache Sling Installer Framework) the
+Stanbol DataFileProvder service will ask for the binary data.
+
+2. dbpedia.solrindex.zip: This is the ZIP archive with the precomputed data.
+Typically you will need to copy this file to the data directory of the
+Apache Stanbol DataFileProvider (defaults to "sling/datafiles").
+
+## Using DBPedia.org as Referenced Site of the Entityhub
+
+The necessary configurations needed to use DBPedia as referenced site for the
+Apache Stanbol Entityhub are provided by the "Apache Stanbol Data: DBpedia.org"
+bundle.
+
+See
[{stanbol}/data/sites/dbpedia](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/data/sites/dbpedia)
+
+The README of this Bundle provides details about the installation process.
+During the installation the "dbpedia.solrindex.zip" created by this utility is
+needed.
-Default configuration:
-======================
+
+## The used Default configuration:
This describes the default configuration as initialised during the first start
of the indexing tool.
The default configuration stores creates an index with the following features:
-Languages:
+### Languages:
+
By default English, German, France and Italian and all literals without any
language information are indexed. Please also note that one needs to provide
also the RDF dumps for those languages.
-Labels and Descriptions:
+### Labels and Descriptions:
+
DBpedia.org uses "rdfs:label" for labels. Short description are stored within
"rdfs:comment" and a longer version in "dbp-ont:abstract".
For both labels and descriptions generic language analyzer are used for
indexing.
@@ -139,35 +184,41 @@ such fields.
Abstracts are only indexed and not stored in the index. This means that values
can be searched but not retrieved.
-Entity types:
+### Entity types:
+
The types of the entities (Person, Organisation, Places, ...) are stored in
"rdf:type". Values are URLs as defined mainly by the DBpedia.org ontology.
-Spatial Information:
+### Spatial Information:
+
The geo locations are indexed within "geo:lat", "geo:long" and "geo:alt". The
mappings ensure that lat/long values are doubles and the altitude are integers.
-Categories:
+### Categories:
+
DBpedia contains also categories. Entities are linked to categories by the
"skos:subject" and/or the "dcterms:subject" property. During the import all
values defined by "dcterms:subject" are copied to "skos:subject".
Categories itself are hierarchical. Parent categories can be used by following
"skos:broader" relations.
e.g.
- Berlin -> skos:subject
- -> Category:City-states -> skos:broader
- -> Category:Cities -> skos:broader
- -> Category:Populated_places -> skos:broader
- -> Category:Human_habitats ...
+
+ Berlin -> skos:subject
+ -> Category:City-states -> skos:broader
+ -> Category:Cities -> skos:broader
+ -> Category:Populated_places -> skos:broader
+ -> Category:Human_habitats ...
All properties defined by SKOS (http://www.w3.org/TR/skos-reference/) are
indexed and stored.
-DBpedia Ontology:
+### DBpedia Ontology:
+
All properties of the DBpedia.org Ontology are indexed and stored in the index.
see http://wiki.dbpedia.org/Ontology
-DBpedia Properties:
+### DBpedia Properties:
+
Properties are field/values directly taken from the information boxes on the
right side of Wikipedia pages. Fieldnames may depend on the language and also
the data type of the values may be different from entity to entity.
@@ -176,11 +227,13 @@ It is possible to include some/all such
Note that in such cases it is also required do include the RDF dump containing
this data.
-Person related Properties:
+### Person related Properties:
+
DBpedia uses FOAF (http://www.foaf-project.org/) to provide additional
information
for persons. Some properties such as foaf:homepage are also used for entities
of
other types. All properties defined by FOAF are indexed and stored.
-Dublin Core (DC) Metadata:
+### Dublin Core (DC) Metadata:
+
DC Elements and DC Terms metadata are indexed and stored.
All DC Element properties are mapped to there DC Terms counterpart.
Propchange: incubator/stanbol/trunk/entityhub/indexing/dbpedia/README.md
------------------------------------------------------------------------------
svn:mime-type = text/plain
Modified: incubator/stanbol/trunk/entityhub/indexing/dbpedia/pom.xml
URL:
http://svn.apache.org/viewvc/incubator/stanbol/trunk/entityhub/indexing/dbpedia/pom.xml?rev=1103688&r1=1103687&r2=1103688&view=diff
==============================================================================
--- incubator/stanbol/trunk/entityhub/indexing/dbpedia/pom.xml (original)
+++ incubator/stanbol/trunk/entityhub/indexing/dbpedia/pom.xml Mon May 16
10:48:53 2011
@@ -29,7 +29,7 @@
<artifactId>org.apache.stanbol.entityhub.indexing.dbpedia</artifactId>
<version>0.9-SNAPSHOT</version>
<packaging>jar</packaging>
- <name>Apache Stanbol Entityhub Indexing for geonames.org</name>
+ <name>Apache Stanbol Entityhub Indexing for DBpedia.org</name>
<description>This uses the RDF dump of dbpedia.org to create a Full Yard for
dbpedia.org</description>
<scm>
<connection>
Added: incubator/stanbol/trunk/entityhub/indexing/geonames/README.md
URL:
http://svn.apache.org/viewvc/incubator/stanbol/trunk/entityhub/indexing/geonames/README.md?rev=1103688&view=auto
==============================================================================
--- incubator/stanbol/trunk/entityhub/indexing/geonames/README.md (added)
+++ incubator/stanbol/trunk/entityhub/indexing/geonames/README.md Mon May 16
10:48:53 2011
@@ -0,0 +1,38 @@
+# Indexing utility for the [geonames.org](http://www.geonames.org) dataset.
+
+Up to now this tool is not yet ported to ne new Indexing infrastructure defined
+by the org.apache.stanbol.entityhub.indexing.core module.
+
+Please follow [STANBOL-187](https://issues.apache.org/jira/browse/STANBOL-187)
+for updates
+
+
+## Building and Indexing
+
+Built the utility:
+
+ mvn install
+ mvn assembly:assembly
+
+To print the help of the utility call
+
+ java -jar
target/org.apache.stanbol.entityhub.indexing.geonames-.*-jar-with-dependencies.jar
-h
+
+You will need an external SolrServer and configure it with the Solr Core
+configuration as provided by the SolrYard module
"org.apache.stanbol.entityhub.yard.solr".
+
+## Creating a Entityhub Solr Archive
+
+This step is required to create the Archive with the Solr Index as required
+after the Installation of geonames.org Referenced Site (see the
+"org.apache.stanbol.data.sites.geonames" module for details)
+
+The Entityhub uses special solr archive for the initialisation of local solr
+indexes. As soon as this indexer is moved to the new Indexing Infrastructure
+(see [STANBOL-187](https://issues.apache.org/jira/browse/STANBOL-187) ) the
+required files will be automatically created.
+
+Until that this needs to be done manually by creating a ZIP archive of the
+data and the configuration of the SolrIndex used for the indexing.
+The archive needs to be renamed to "geonames.solrindex.zip".
+
Modified: incubator/stanbol/trunk/entityhub/indexing/pom.xml
URL:
http://svn.apache.org/viewvc/incubator/stanbol/trunk/entityhub/indexing/pom.xml?rev=1103688&r1=1103687&r2=1103688&view=diff
==============================================================================
--- incubator/stanbol/trunk/entityhub/indexing/pom.xml (original)
+++ incubator/stanbol/trunk/entityhub/indexing/pom.xml Mon May 16 10:48:53 2011
@@ -56,7 +56,6 @@
<module>destination/solryard</module>
<!-- Utils for createing local caches (indexing utils) -->
<module>geonames</module>
- <module>genericrdf</module>
<module>dbpedia</module>
<module>dblp</module>
</modules>
Modified: incubator/stanbol/trunk/pom.xml
URL:
http://svn.apache.org/viewvc/incubator/stanbol/trunk/pom.xml?rev=1103688&r1=1103687&r2=1103688&view=diff
==============================================================================
--- incubator/stanbol/trunk/pom.xml (original)
+++ incubator/stanbol/trunk/pom.xml Mon May 16 10:48:53 2011
@@ -48,6 +48,7 @@
<module>owl</module>
<module>entityhub</module>
<module>enhancer</module>
+ <module>data</module>
<module>contenthub/web</module>
<module>launchers/stateless</module>
<module>launchers/full</module>