Author: agruber
Date: Wed Sep 7 15:32:20 2011
New Revision: 1166228
URL: http://svn.apache.org/viewvc?rev=1166228&view=rev
Log:
typos and minor explanatory addings w.r.t. indexing utility
Modified:
incubator/stanbol/trunk/entityhub/indexing/genericrdf/README.md
incubator/stanbol/trunk/entityhub/indexing/genericrdf/src/main/resources/indexing/config/indexing.properties
incubator/stanbol/trunk/entityhub/indexing/genericrdf/src/main/resources/indexing/config/mappings.txt
Modified: incubator/stanbol/trunk/entityhub/indexing/genericrdf/README.md
URL:
http://svn.apache.org/viewvc/incubator/stanbol/trunk/entityhub/indexing/genericrdf/README.md?rev=1166228&r1=1166227&r2=1166228&view=diff
==============================================================================
--- incubator/stanbol/trunk/entityhub/indexing/genericrdf/README.md (original)
+++ incubator/stanbol/trunk/entityhub/indexing/genericrdf/README.md Wed Sep 7
15:32:20 2011
@@ -13,20 +13,19 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY
See the License for the specific language governing permissions and
limitations under the License.
-# Default Indexin Tool for RDF
+# Default Indexing Tool for RDF
-This tool provides a default configuration for indexing RDF
-files (e.g. a SKOS export of a thesaurus or a set of foaf files É)
+This tool provides a default configuration for creating a SOLr index of RDF
files (e.g. a SKOS export of a thesaurus or a set of foaf files)
-## Building:
+## Building
-If not yet build by the built process of the entityhub call
+If not yet built during the build process of the entityhub call
mvn install
in this directory and than
- mvn -o assembly:single
+ mvn assembly:single
to build the jar with all the dependencies used later for indexing.
@@ -36,15 +35,15 @@ If the build succeeds go to the /target
to the directory you would like to start the indexing.
-## Index:
+## Indexing
-### (1) Initialise the configuration
+### (1) Initialize the configuration
-The default configuration is initialised by calling
+The default configuration is initialized by calling
java -jar
org.apache.stanbol.entityhub.indexing.genericrdf-*-jar-with-dependencies.jar
init
-This will create a sub-folder with the name indexing in the current directory.
+This will create a sub-folder "indexing" in the current directory.
Within this folder all the
* configurations (indexing/config)
@@ -54,7 +53,7 @@ Within this folder all the
will be located.
-### (2) Adapt the configuration:
+### (2) Adapt the configuration
The configuration is located within the
@@ -62,41 +61,41 @@ The configuration is located within the
directory.
-The Indexer supports two Indexing Modes
+The indexer supports two indexing modes
-1. Iterate over the Data and lookup the Scores for Entities (default). For
this mode the "entityDataIterable" and a "entityScoreProvider" MUST BE
configured. If no entity scores are available there exists an default
entityScoreProvider that provides no entity scores. This mode is typically used
to index all entities of a dataset.
-2. Iterate over the entity IDs and Scores and lookup the data. For this Mode a
"entityIdIterator" and a "entityDataProvider" Provider MUST BE configured. This
mode is typically used to index a predefined list of entities (that might only
be a very small subset of the whole dataset).
+1. Iterate over the data and lookup the scores for entities (default). For
this mode the "entityDataIterable" and a "entityScoreProvider" MUST BE
configured. If no entity scores are available, a default entityScoreProvider
provides no entity scores. This mode is typically used to index all entities of
a dataset.
+2. Iterate over the entity IDs and Scores and lookup the data. For this Mode a
"entityIdIterator" and a "entityDataProvider" Provider MUST BE configured. This
mode is typically used to index a predefined list of entities (that might only
be a very small subset of the an large dataset).
The configuration of the mentioned components is contained in the main
indexing configuration file explained below.
-#### Main indexing Configuration (indexing.properties)
+#### Main indexing configuration (indexing.properties)
This file contains the main configuration for the indexing process.
* the "name" property MUST BE set to the name of the referenced site to be
created by the indexing process
* the "entityDataIterable" is used to configure the component iterating over
the RDF data to be indexed. The "source" parameter refers to the directory the
RDF files to be indexed are searched. The RDF files can be compressed with
'gz', 'bz2' or 'zip'. It is even supported to load multiple RDF files contained
in a single ZIP archive.
-* the "entityScoreProvider" is used to provide the ranking for entities. A
typical example are the number of incoming links. Such rankings are typically
used to weight recommendations and sort result lists. (e.g. by a query for
"Paris" it is much more likely that a User refers to Paris in France as to one
of the two Paris in Texas). If no rankings are available you should use the
"org.apache.stanbol.entityhub.indexing.core.source.NoEntityScoreProvider".
+* the "entityScoreProvider" is used to provide the ranking for entities. A
typical example is the number of incoming links. Such rankings are typically
used to weight recommendations and sort result lists. (e.g. by a query for
"Paris" it is much more likely that a user refers to Paris in France as to one
of the two Paris in Texas). If no rankings are available you should use the
"org.apache.stanbol.entityhub.indexing.core.source.NoEntityScoreProvider".
* the "scoreNormalizer" is only useful in case entity scores are available.
This component is used to normalize rankings or also to filter entities with
low rankings.
-* the "entityProcessor" is used to process (map, convert, filter) information
of entities before indexing. The mapping configuration is in an own file
(default "mapping.txt").
-* Indexes need to provide the configurations used to store entities. The
"fieldConfiguration" allows to specify this. Typically is is the same mapping
file as used for the "entityProcessor" however this is not a requirement.
-* the "indexingDestination" property is used to configure the target for the
Indexing. Currently there is only a single implementation that stores the
indexed data within a SolrYard. The "boosts" parameter can be used to boost
(see Solr Documentation for details) specific fields (typically labels) for
full text searches.
-* all properties starting with "org.apache.stanbol.entityhub.site." are used
for the configuration for the referenced site.
+* the "entityProcessor" is used to process (map, convert, filter) information
of entities before indexing. The mapping configuration is provided in an
separate file (default "mapping.txt").
+* Indexes need to provide the configurations used to store entities. The
"fieldConfiguration" allows to specify this. Typically it is the same mapping
file as used for the "entityProcessor" however this is not a requirement.
+* the "indexingDestination" property is used to configure the target for the
indexing. Currently there is only a single implementation that stores the
indexed data within a SolrYard. The "boosts" parameter can be used to boost
(see Solr Documentation for details) specific fields (typically labels) for
full text searches.
+* all properties starting with "org.apache.stanbol.entityhub.site." are used
for the configuration of the referenced site.
-Pleas note also the documentation within the "indexing.properties" file for
details.
+Please note also the documentation within the "indexing.properties" file for
details.
-#### Mapping Configuration (mappings.txt)
+#### Mapping configuration (mappings.txt)
-Mappings are used for three different things:
+Mappings are used for three different purposes:
1. During the indexing process by the "entityProcessor" to process the
information of each entity
2. At runtime by the local Cache to process single Entities that are updated
in the cache.
-3. At runtime by the Entityhub when importing an Entity form a referenced Site.
+3. At runtime by the Entityhub when importing an Entity from a referenced Site.
-The configurations for (1) and (2) are typically identical. For (3) on might
want to use a different configuration. The default configuration assumes to use
the same configuration (mapping.txt) for (1) and (2) and no specific
configuration for (3).
+The configurations for (1) and (2) are typically identical. For (3) one might
want to use a different configuration. The default configuration assumes to use
the same configuration (mappings.txt) for (1) and (2) and no specific
configuration for (3).
-For details how to configure mappings see the documentation on the [IKS
wiki](TODO add link)
+The mappings.txt in its default already include mappings for popular
ontologies such as Dublin Core, SKOS and FOAF. Domain specific mappings can be
added to this configuration.
-#### Score Normalizer Configuration
+#### Score Normalizer configuration
The default configuration also provides examples for configurations of the
different score normalisers. However by default they are not used.
@@ -105,7 +104,7 @@ The default configuration also provides
NOTE:
-* To use score normalisation scores need to be provided for Entities. This
means a "entityScoreProvider" or a "entityIdIterator" needs to be configured
(indexing.properties).
+* To use score normalisation, scores need to be provided for Entities. This
means an "entityScoreProvider" or an "entityIdIterator" needs to be configured
(indexing.properties).
* Multiple score normalisers can be used. The call order is determined by the
configuration of the "scoreNormalizer" property (indexing.properties).
### (3) Provide the RDF files to be indexed
@@ -120,21 +119,21 @@ By default the RDF files need to be loca
indexing/resources/rdfdata
-however this can be changed by the "source" parameter of the
"entityDataIterable" or "entityDataProvider" property in the main indexing
configuration (indexing.properties).
+however this can be changed via the "source" parameter of the
"entityDataIterable" or "entityDataProvider" property in the main indexing
configuration (indexing.properties).
-Supported RDF files
+Supported RDF files are:
-* RDF XML (by using one of "rdf", "owl", "xml" as extension): Note that this
encoding is not well suited for importing large RDF datasets.
+* RDF/XML (by using one of "rdf", "owl", "xml" as extension): Note that this
encoding is not well suited for importing large RDF datasets.
* N-Triples (by using "nt" as extension): This is the preferred format for
importing (especially large) RDF datasets.
* NTurtle (by using "ttl" as extension)
* N3 (by using "n3" as extension)
* NQuards (by using "nq" as extension): Note that all named graphs will be
imported into the same index.
* Trig (by using "trig" as extension)
-Supported compression formats:
+Supported compression formats are:
* "gz" and "bz2" files: One need to use double file extensions to indicate
both the used compression and RDF file format (e.g. myDump.nt.bz2)
-* "zip": For ZIP archives all files within the archive are treated separately.
That means that even if a ZIP archive contains multiple RDF files all will be
imported.
+* "zip": For ZIP archives all files within the archive are treated separately.
That means that even if a ZIP archive contains multiple RDF files, all of them
will be imported.
### (4) Create the Index
@@ -143,7 +142,7 @@ Supported compression formats:
Note that calling the utility with the option -h will print the help.
-## Use the created Index with the Entityhub:
+## Use the created index with the Entityhub
After the indexing completes the distribution folder
@@ -157,11 +156,11 @@ will contain two files
* a "Cache" used to connect the ReferencedSite with your Data and
* a "SolrYard" that managed the data indexed by this utility.
- When installing this bundle the Site will not be functional, because this
Bundle does not contain the indexed data but only the configuration for the
Solr Index.
+ When installing this bundle the Site will not be yet work, because this
Bundle does not contain the indexed data but only the configuration for the
Solr Index.
2. {name}.solrindex.zip: This is the ZIP archive with the indexed data. This
file will be requested by the Apache Stanbol Data File Provider after
installing the Bundle described above. To install the data you need copy this
file to the "/sling/datafiles" folder within the working directory of your
Stanbol Server.
- If you do that before you install the bundle the data will be picked up
during the installation of the bundle automatically. If you provide the File
afterwards you will need to restart the SolrYard installed by the Bundle.
+ If you copy the ZIP archive before installing the bundle, the data will be
picked up during the installation of the bundle automatically. If you provide
the file afterwards you will also need to restart the SolrYard installed by the
Bundle.
{name} denotes to the value you configured for the "name" property within the
"indexing.properties" file.
Modified:
incubator/stanbol/trunk/entityhub/indexing/genericrdf/src/main/resources/indexing/config/indexing.properties
URL:
http://svn.apache.org/viewvc/incubator/stanbol/trunk/entityhub/indexing/genericrdf/src/main/resources/indexing/config/indexing.properties?rev=1166228&r1=1166227&r2=1166228&view=diff
==============================================================================
---
incubator/stanbol/trunk/entityhub/indexing/genericrdf/src/main/resources/indexing/config/indexing.properties
(original)
+++
incubator/stanbol/trunk/entityhub/indexing/genericrdf/src/main/resources/indexing/config/indexing.properties
Wed Sep 7 15:32:20 2011
@@ -19,13 +19,13 @@
# It MUST BE a single word with no spaces.
name=changeme
-# an optional short description used. If missing default descriptions are
+# an optional short description may be used. If missing default descriptions
are
# created.
-description=The DBLP Computer Science Bibliography (http://dblp.uni-trier.de)
+description=short description (http://www.example.org)
# Indexing Mode dependent Configurations: (see readme.md for details)
-# (1) Iterate over Data and lookup scores: (defalut)
+# (1) Iterate over Data and lookup scores: (default)
# use the Jena TDB as source for indexing the RDF data located within
# "indexing/resource/rdfdata"
@@ -36,7 +36,7 @@ entityScoreProvider=org.apache.stanbol.e
# The EntityFieldScoreProvider can be used to use the value of an property as
score
# the property can be configured by the "field" parameter
-# Scores are parsed from numbers and Strings that can be converted to numbers.
+# Scores are parsed from numbers and strings that can be converted to numbers.
#entityScoreProvider=org.apache.stanbol.entityhub.indexing.core.source.EntityFieldScoreProvider,field:http://www.example.org/myOntology#score
# The EntityIneratorToScoreProviderAdapter can be used to adapt any configured
@@ -81,7 +81,7 @@ entityScoreProvider=org.apache.stanbol.e
# configurations.
-# Entity Processor:
+# Entity Processor
# Currently the only available implementation is the FiledMapperProcessor.
entityProcessor=org.apache.stanbol.entityhub.indexing.core.processor.FiledMapperProcessor
@@ -113,20 +113,20 @@ indexingDestination=org.apache.stanbol.e
# Additional configurations for ReferencedSite
-# all the following properties are optional, but can be used to configure
+# All the following properties are optional, but can be used to configure
# the referenced site used to access the indexed data within the Entityhub
-# the entity prefixes are used to determine if an entity needs to be searched
+# The entity prefixes are used to determine if an entity needs to be searched
# on a referenced site. If not specified requests for any entity will be
# forwarded to this referenced site.
# use ';' to seperate multiple values
#org.apache.stanbol.entityhub.site.entityPrefix=http://example.org/resource;urn:mycompany:
# Configuration the remote Service
-# It the indexed data are also remotly availabel (e.g. by a Linked data
endpoint)
+# If the indexed data are also available remotly (e.g. by a Linked data
endpoint)
# than it is possible to allow also direct access to such entities
# (a) retrieving entities (access URI and EntityDereferencer implementation)
-#org.apache.stanbol.entityhub.site.accessUri=http://example.org/resource"
+#org.apache.stanbol.entityhub.site.accessUri="http://example.org/resource"
#org.apache.stanbol.entityhub.site.dereferencerType=
# available EntityDereferencer implementation
# - org.apache.stanbol.entityhub.dereferencer.CoolUriDereferencer
@@ -142,7 +142,7 @@ indexingDestination=org.apache.stanbol.e
# The referenced site can also specify additional mappings to be used in the
# case an entity of this site is imported to the Entityhub.
-# typically the same mappings as used for the indexing are a good start.
+# Typically the same mappings as used for the indexing are a good start.
# However one might want to copy some values (e.g. labels) to commonly used
# fields used by the Entityhub
org.apache.stanbol.entityhub.site.fieldMappings=mappings.txt
@@ -151,7 +151,7 @@ org.apache.stanbol.entityhub.site.fieldM
# License(s)
# Add here the name and URLs of the license to be used for all entities
# provided by this referenced site
-# NOTE: licenseName and licenseUrl MUST use the same ordering!
+# NOTE: licenseName and licenseUrl MUST use the ordering as below!
# This example shows dual licensing with "cc by-sa" and GNU
#org.apache.stanbol.entityhub.site.licenseName=Creative Commons
Attribution-ShareAlike 3.0;GNU Free Documentation License
#org.apache.stanbol.entityhub.site.licenseUrl=http://creativecommons.org/licenses/by-sa/3.0/;http://www.gnu.org/licenses/fdl.html
Modified:
incubator/stanbol/trunk/entityhub/indexing/genericrdf/src/main/resources/indexing/config/mappings.txt
URL:
http://svn.apache.org/viewvc/incubator/stanbol/trunk/entityhub/indexing/genericrdf/src/main/resources/indexing/config/mappings.txt?rev=1166228&r1=1166227&r2=1166228&view=diff
==============================================================================
---
incubator/stanbol/trunk/entityhub/indexing/genericrdf/src/main/resources/indexing/config/mappings.txt
(original)
+++
incubator/stanbol/trunk/entityhub/indexing/genericrdf/src/main/resources/indexing/config/mappings.txt
Wed Sep 7 15:32:20 2011
@@ -14,7 +14,7 @@
# limitations under the License.
#
#NOTE: THIS IS A DEFAULT MAPPING SPECIFICATION THAT INCLUDES MAPPINGS FOR
-# COMMON ONTOLOGIES. USERS MIGHT WANT TO ADAPT THIS CONFIGURATION AB
+# COMMON ONTOLOGIES. USERS MIGHT WANT TO ADAPT THIS CONFIGURATION BY
# COMMENTING/UNCOMMENTING AND/OR ADDING NEW MAPPINGS
# --- Define the Languages for all fields ---
@@ -27,7 +27,7 @@
# --- RDF RDFS and OWL Mappings ---
# This configuration only index properties that are typically used to store
-# instance data defined by such namespaces. This excludes Ontology definitions
+# instance data defined by such namespaces. This excludes ontology definitions
# NOTE that nearly all other ontologies are are using properties of these three
# schemas, therefore it is strongly recommended to include such
information!
@@ -41,14 +41,14 @@ rdfs:seeAlso | d=entityhub:ref
owl:sameAs | d=entityhub:ref
-#If one likes to also index Ontologies one should add the following statements
+#If one likes to also index ontologies one should add the following statements
#owl:*
#rdfs:*
# --- Dublin Core (DC) ---
-# The default configuration imports all dc-terms data and copies vlaues for the
-# old dc-elements standard over to the according properties ofthe dc-terms
-#standard.
+# The default configuration imports all dc-terms data and copies values for the
+# old dc-elements standard over to the according properties of the dc-terms
+# standard.
# NOTE that a lot of other ontologies are also using DC for some of there data
# therefore it is strongly recommended to include such information!
@@ -78,11 +78,11 @@ dc-elements:source > dc:source
dc-elements:subject > dc:subject
dc-elements:title > dc:title
dc-elements:type > dc:type
-#also use ec-elements:title as label
+#also use dc-elements:title as label
dc-elements:title > rdfs:label
# --- Social Networks (via foaf) ---
-#The Friend of a Friend schema often used to describe social relations between
people
+#The Friend of a Friend schema is often used to describe social relations
between people
foaf:*
# copy the name of a person over to rdfs:label
@@ -104,7 +104,7 @@ foaf:page | d=xsd:anyURI
# --- Simple Knowledge Organization System (SKOS) ---
# A common data model for sharing and linking knowledge organization systems
-# via the Semantic Web. Typically used to encode controlled vocabularies auch
as
+# via the Semantic Web. Typically used to encode controlled vocabularies as
# a thesaurus
skos:*
@@ -123,14 +123,14 @@ skos:narrowMatch > skos:skos:narrower
# however such properties are only intended to be used by reasoners to
# calculate transitive closures over broader/narrower hierarchies.
# see http://www.w3.org/TR/skos-reference/#L2413 for details
-# to correct such cases we will copy transitive relations to there counterpart
+# to correct such cases we will copy transitive relations to their counterpart
skos:narrowerTransitive > skos:narrower
skos:broaderTransitive > skos:broader
# --- Semantically-Interlinked Online Communities (SIOC) ---
-# an ontology for describing the information in online communities.
+# An ontology for describing the information in online communities.
# This information can be used to export information from online communities
# and to link them together. The scope of the application areas that SIOC can
# be used for includes (and is not limited to) weblogs, message boards,