mappings.txt

agruber Wed, 07 Sep 2011 08:32:48 -0700

Author: agruber
Date: Wed Sep  7 15:32:20 2011
New Revision: 1166228

URL: http://svn.apache.org/viewvc?rev=1166228&view=rev
Log:
typos and minor explanatory addings w.r.t. indexing utility


Modified:
    incubator/stanbol/trunk/entityhub/indexing/genericrdf/README.md
    
incubator/stanbol/trunk/entityhub/indexing/genericrdf/src/main/resources/indexing/config/indexing.properties
    
incubator/stanbol/trunk/entityhub/indexing/genericrdf/src/main/resources/indexing/config/mappings.txt

Modified: incubator/stanbol/trunk/entityhub/indexing/genericrdf/README.md
URL: 
http://svn.apache.org/viewvc/incubator/stanbol/trunk/entityhub/indexing/genericrdf/README.md?rev=1166228&r1=1166227&r2=1166228&view=diff
==============================================================================
--- incubator/stanbol/trunk/entityhub/indexing/genericrdf/README.md (original)
+++ incubator/stanbol/trunk/entityhub/indexing/genericrdf/README.md Wed Sep  7 
15:32:20 2011
@@ -13,20 +13,19 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY 
 See the License for the specific language governing permissions and
 limitations under the License.
 
-# Default Indexin Tool for RDF
+# Default Indexing Tool for RDF
 
-This tool provides a default configuration for indexing RDF
-files (e.g. a SKOS export of a thesaurus or a set of foaf files É)
+This tool provides a default configuration for creating a SOLr index of RDF 
files (e.g. a SKOS export of a thesaurus or a set of foaf files)
 
-## Building:
+## Building
 
-If not yet build by the built process of the entityhub call
+If not yet built during the build process of the entityhub call
 
     mvn install
 
 in this directory and than
 
-    mvn -o assembly:single
+    mvn assembly:single
     
 to build the jar with all the dependencies used later for indexing.
 
@@ -36,15 +35,15 @@ If the build succeeds go to the /target 
 
 to the directory you would like to start the indexing.
 
-## Index:
+## Indexing
 
-### (1) Initialise the configuration
+### (1) Initialize the configuration
 
-The default configuration is initialised by calling
+The default configuration is initialized by calling
 
     java -jar 
org.apache.stanbol.entityhub.indexing.genericrdf-*-jar-with-dependencies.jar 
init
 
-This will create a sub-folder with the name indexing in the current directory.
+This will create a sub-folder "indexing" in the current directory.
 Within this folder all the
 
 * configurations (indexing/config)
@@ -54,7 +53,7 @@ Within this folder all the
 
 will be located.
 
-### (2) Adapt the configuration:
+### (2) Adapt the configuration
 
 The configuration is located within the
 
@@ -62,41 +61,41 @@ The configuration is located within the
 
 directory.
 
-The Indexer supports two Indexing Modes
+The indexer supports two indexing modes
 
-1. Iterate over the Data and lookup the Scores for Entities (default). For 
this mode the "entityDataIterable" and a "entityScoreProvider" MUST BE 
configured. If no entity scores are available there exists an default 
entityScoreProvider that provides no entity scores. This mode is typically used 
to index all entities of a dataset.
-2. Iterate over the entity IDs and Scores and lookup the data. For this Mode a 
"entityIdIterator" and a "entityDataProvider" Provider MUST BE configured. This 
mode is typically used to index a predefined list of entities (that might only 
be a very small subset of the whole dataset). 
+1. Iterate over the data and lookup the scores for entities (default). For 
this mode the "entityDataIterable" and a "entityScoreProvider" MUST BE 
configured. If no entity scores are available, a default entityScoreProvider 
provides no entity scores. This mode is typically used to index all entities of 
a dataset.
+2. Iterate over the entity IDs and Scores and lookup the data. For this Mode a 
"entityIdIterator" and a "entityDataProvider" Provider MUST BE configured. This 
mode is typically used to index a predefined list of entities (that might only 
be a very small subset of the an large dataset). 
 
 The configuration of the mentioned components is contained in the main 
indexing configuration file explained below.
 
-#### Main indexing Configuration (indexing.properties)
+#### Main indexing configuration (indexing.properties)
 
 This file contains the main configuration for the indexing process.
 
 * the "name" property MUST BE set to the name of the referenced site to be 
created by the indexing process
 * the "entityDataIterable" is used to configure the component iterating over 
the RDF data to be indexed. The "source" parameter refers to the directory the 
RDF files to be indexed are searched. The RDF files can be compressed with 
'gz', 'bz2' or 'zip'. It is even supported to load multiple RDF files contained 
in a single ZIP archive.
-* the "entityScoreProvider" is used to provide the ranking for entities. A 
typical example are the number of incoming links. Such rankings are typically 
used to weight recommendations and sort result lists. (e.g. by a query for 
"Paris" it is much more likely that a User refers to Paris in France as to one 
of the two Paris in Texas). If no rankings are available you should use the 
"org.apache.stanbol.entityhub.indexing.core.source.NoEntityScoreProvider".
+* the "entityScoreProvider" is used to provide the ranking for entities. A 
typical example is the number of incoming links. Such rankings are typically 
used to weight recommendations and sort result lists. (e.g. by a query for 
"Paris" it is much more likely that a user refers to Paris in France as to one 
of the two Paris in Texas). If no rankings are available you should use the 
"org.apache.stanbol.entityhub.indexing.core.source.NoEntityScoreProvider".
 * the "scoreNormalizer" is only useful in case entity scores are available. 
This component is used to normalize rankings or also to filter entities with 
low rankings.
-* the "entityProcessor" is used to process (map, convert, filter) information 
of entities before indexing. The mapping configuration is in an own file 
(default "mapping.txt").
-* Indexes need to provide the configurations used to store entities. The 
"fieldConfiguration" allows to specify this. Typically is is the same mapping 
file as used for the "entityProcessor" however this is not a requirement.
-* the "indexingDestination" property is used to configure the target for the 
Indexing. Currently there is only a single implementation that stores the 
indexed data within a SolrYard. The "boosts" parameter can be used to boost 
(see Solr Documentation for details) specific fields (typically labels) for 
full text searches.
-* all properties starting with "org.apache.stanbol.entityhub.site." are used 
for the configuration for the referenced site.
+* the "entityProcessor" is used to process (map, convert, filter) information 
of entities before indexing. The mapping configuration is provided in an 
separate file (default "mapping.txt").
+* Indexes need to provide the configurations used to store entities. The 
"fieldConfiguration" allows to specify this. Typically it is the same mapping 
file as used for the "entityProcessor" however this is not a requirement.
+* the "indexingDestination" property is used to configure the target for the 
indexing. Currently there is only a single implementation that stores the 
indexed data within a SolrYard. The "boosts" parameter can be used to boost 
(see Solr Documentation for details) specific fields (typically labels) for 
full text searches.
+* all properties starting with "org.apache.stanbol.entityhub.site." are used 
for the configuration of the referenced site.
 
-Pleas note also the documentation within the "indexing.properties" file for 
details.
+Please note also the documentation within the "indexing.properties" file for 
details.
 
-#### Mapping Configuration (mappings.txt)
+#### Mapping configuration (mappings.txt)
 
-Mappings are used for three different things:
+Mappings are used for three different purposes:
 
 1. During the indexing process by the "entityProcessor" to process the 
information of each entity
 2. At runtime by the local Cache to process single Entities that are updated 
in the cache.
-3. At runtime by the Entityhub when importing an Entity form a referenced Site.
+3. At runtime by the Entityhub when importing an Entity from a referenced Site.
 
-The configurations for (1) and (2) are typically identical. For (3) on might 
want to use a different configuration. The default configuration assumes to use 
the same configuration (mapping.txt) for (1) and (2) and no specific 
configuration for (3).
+The configurations for (1) and (2) are typically identical. For (3) one might 
want to use a different configuration. The default configuration assumes to use 
the same configuration (mappings.txt) for (1) and (2) and no specific 
configuration for (3).
 
-For details how to configure mappings see the documentation on the [IKS 
wiki](TODO add link)
+The mappings.txt in its default already include mappings for popular 
ontologies such as Dublin Core, SKOS and FOAF. Domain specific mappings can be 
added to this configuration. 
 
-#### Score Normalizer Configuration
+#### Score Normalizer configuration
 
 The default configuration also provides examples for configurations of the 
different score normalisers. However by default they are not used.
 
@@ -105,7 +104,7 @@ The default configuration also provides 
 
 NOTE: 
 
-* To use score normalisation scores need to be provided for Entities. This 
means a "entityScoreProvider" or a "entityIdIterator" needs to be configured 
(indexing.properties).
+* To use score normalisation, scores need to be provided for Entities. This 
means an "entityScoreProvider" or an "entityIdIterator" needs to be configured 
(indexing.properties).
 * Multiple score normalisers can be used. The call order is determined by the 
configuration of the "scoreNormalizer" property (indexing.properties). 
 
 ### (3) Provide the RDF files to be indexed
@@ -120,21 +119,21 @@ By default the RDF files need to be loca
 
     indexing/resources/rdfdata
 
-however this can be changed by the "source" parameter of the 
"entityDataIterable" or "entityDataProvider" property in the main indexing 
configuration (indexing.properties).
+however this can be changed via the "source" parameter of the 
"entityDataIterable" or "entityDataProvider" property in the main indexing 
configuration (indexing.properties).
 
-Supported RDF files
+Supported RDF files are:
 
-* RDF XML (by using one of "rdf", "owl", "xml" as extension): Note that this 
encoding is not well suited for importing large RDF datasets.
+* RDF/XML (by using one of "rdf", "owl", "xml" as extension): Note that this 
encoding is not well suited for importing large RDF datasets.
 * N-Triples (by using "nt" as extension): This is the preferred format for 
importing (especially large) RDF datasets.
 * NTurtle (by using "ttl" as extension)
 * N3 (by using "n3" as extension)
 * NQuards (by using "nq" as extension): Note that all named graphs will be 
imported into the same index.
 * Trig (by using "trig" as extension)
 
-Supported compression formats:
+Supported compression formats are:
 
 * "gz" and "bz2" files: One need to use double file extensions to indicate 
both the used compression and RDF file format (e.g. myDump.nt.bz2)
-* "zip": For ZIP archives all files within the archive are treated separately. 
That means that even if a ZIP archive contains multiple RDF files all will be 
imported.
+* "zip": For ZIP archives all files within the archive are treated separately. 
That means that even if a ZIP archive contains multiple RDF files, all of them 
will be imported.
 
 ### (4) Create the Index
 
@@ -143,7 +142,7 @@ Supported compression formats:
 Note that calling the utility with the option -h will print the help.
 
 
-## Use the created Index with the Entityhub:
+## Use the created index with the Entityhub
 
 After the indexing completes the distribution folder 
 
@@ -157,11 +156,11 @@ will contain two files
  * a "Cache" used to connect the ReferencedSite with your Data and
  * a "SolrYard" that managed the data indexed by this utility.
 
- When installing this bundle the Site will not be functional, because this 
Bundle does not contain the indexed data but only the configuration for the 
Solr Index.
+ When installing this bundle the Site will not be yet work, because this 
Bundle does not contain the indexed data but only the configuration for the 
Solr Index.
 
 2. {name}.solrindex.zip: This is the ZIP archive with the indexed data. This 
file will be requested by the Apache Stanbol Data File Provider after 
installing the Bundle described above. To install the data you need copy this 
file to the "/sling/datafiles" folder within the working directory of your 
Stanbol Server.
 
- If you do that before you install the bundle the data will be picked up 
during the installation of the bundle automatically. If you provide the File 
afterwards you will need to restart the SolrYard installed by the Bundle.
+ If you copy the ZIP archive before installing the bundle, the data will be 
picked up during the installation of the bundle automatically. If you provide 
the file afterwards you will also need to restart the SolrYard installed by the 
Bundle.
 
 {name} denotes to the value you configured for the "name" property within the
 "indexing.properties" file.

Modified: 
incubator/stanbol/trunk/entityhub/indexing/genericrdf/src/main/resources/indexing/config/indexing.properties
URL: 
http://svn.apache.org/viewvc/incubator/stanbol/trunk/entityhub/indexing/genericrdf/src/main/resources/indexing/config/indexing.properties?rev=1166228&r1=1166227&r2=1166228&view=diff
==============================================================================
--- 
incubator/stanbol/trunk/entityhub/indexing/genericrdf/src/main/resources/indexing/config/indexing.properties
 (original)
+++ 
incubator/stanbol/trunk/entityhub/indexing/genericrdf/src/main/resources/indexing/config/indexing.properties
 Wed Sep  7 15:32:20 2011
@@ -19,13 +19,13 @@
 # It MUST BE a single word with no spaces.
 name=changeme
 
-# an optional short description used. If missing default descriptions are
+# an optional short description may be used. If missing default descriptions 
are
 # created.
-description=The DBLP Computer Science Bibliography (http://dblp.uni-trier.de)
+description=short description (http://www.example.org)
 
 # Indexing Mode dependent Configurations: (see readme.md for details)
 
-# (1) Iterate over Data and lookup scores: (defalut)
+# (1) Iterate over Data and lookup scores: (default)
 
 # use the Jena TDB as source for indexing the RDF data located within
 # "indexing/resource/rdfdata"
@@ -36,7 +36,7 @@ entityScoreProvider=org.apache.stanbol.e
 
 # The EntityFieldScoreProvider can be used to use the value of an property as 
score
 # the property can be configured by the "field" parameter
-# Scores are parsed from numbers and Strings that can be converted to numbers.
+# Scores are parsed from numbers and strings that can be converted to numbers.
 
#entityScoreProvider=org.apache.stanbol.entityhub.indexing.core.source.EntityFieldScoreProvider,field:http://www.example.org/myOntology#score
 
 # The EntityIneratorToScoreProviderAdapter can be used to adapt any configured
@@ -81,7 +81,7 @@ entityScoreProvider=org.apache.stanbol.e
 #    configurations.
 
 
-# Entity Processor:
+# Entity Processor
 
 # Currently the only available implementation is the FiledMapperProcessor.
 
entityProcessor=org.apache.stanbol.entityhub.indexing.core.processor.FiledMapperProcessor
@@ -113,20 +113,20 @@ indexingDestination=org.apache.stanbol.e
 
 # Additional configurations for ReferencedSite
 
-# all the following properties are optional, but can be used to configure
+# All the following properties are optional, but can be used to configure
 # the referenced site used to access the indexed data within the Entityhub
 
-# the entity prefixes are used to determine if an entity needs to be searched
+# The entity prefixes are used to determine if an entity needs to be searched
 # on a referenced site. If not specified requests for any entity will be
 # forwarded to this referenced site.
 # use ';' to seperate multiple values
 
#org.apache.stanbol.entityhub.site.entityPrefix=http://example.org/resource;urn:mycompany:
 
 # Configuration the remote Service
-# It the indexed data are also remotly availabel (e.g. by a Linked data 
endpoint)
+# If the indexed data are also available remotly (e.g. by a Linked data 
endpoint)
 # than it is possible to allow also direct access to such entities
 # (a) retrieving entities (access URI and EntityDereferencer implementation)
-#org.apache.stanbol.entityhub.site.accessUri=http://example.org/resource";
+#org.apache.stanbol.entityhub.site.accessUri="http://example.org/resource";
 #org.apache.stanbol.entityhub.site.dereferencerType=
 # available EntityDereferencer implementation
 # - org.apache.stanbol.entityhub.dereferencer.CoolUriDereferencer
@@ -142,7 +142,7 @@ indexingDestination=org.apache.stanbol.e
 
 # The referenced site can also specify additional mappings to be used in the
 # case an entity of this site is imported to the Entityhub.
-# typically the same mappings as used for the indexing are a good start. 
+# Typically the same mappings as used for the indexing are a good start. 
 # However one might want to copy some values (e.g. labels) to commonly used
 # fields used by the Entityhub
 org.apache.stanbol.entityhub.site.fieldMappings=mappings.txt
@@ -151,7 +151,7 @@ org.apache.stanbol.entityhub.site.fieldM
 # License(s)
 # Add here the name and URLs of the license to be used for all entities
 # provided by this referenced site
-# NOTE: licenseName and licenseUrl MUST use the same ordering!
+# NOTE: licenseName and licenseUrl MUST use the ordering as below!
 # This example shows dual licensing with "cc by-sa" and GNU
 #org.apache.stanbol.entityhub.site.licenseName=Creative Commons 
Attribution-ShareAlike 3.0;GNU Free Documentation License
 
#org.apache.stanbol.entityhub.site.licenseUrl=http://creativecommons.org/licenses/by-sa/3.0/;http://www.gnu.org/licenses/fdl.html

Modified: 
incubator/stanbol/trunk/entityhub/indexing/genericrdf/src/main/resources/indexing/config/mappings.txt
URL: 
http://svn.apache.org/viewvc/incubator/stanbol/trunk/entityhub/indexing/genericrdf/src/main/resources/indexing/config/mappings.txt?rev=1166228&r1=1166227&r2=1166228&view=diff
==============================================================================
--- 
incubator/stanbol/trunk/entityhub/indexing/genericrdf/src/main/resources/indexing/config/mappings.txt
 (original)
+++ 
incubator/stanbol/trunk/entityhub/indexing/genericrdf/src/main/resources/indexing/config/mappings.txt
 Wed Sep  7 15:32:20 2011
@@ -14,7 +14,7 @@
 # limitations under the License.
 #
 #NOTE: THIS IS A DEFAULT MAPPING SPECIFICATION THAT INCLUDES MAPPINGS FOR
-#      COMMON ONTOLOGIES. USERS MIGHT WANT TO ADAPT THIS CONFIGURATION AB
+#      COMMON ONTOLOGIES. USERS MIGHT WANT TO ADAPT THIS CONFIGURATION BY
 #      COMMENTING/UNCOMMENTING AND/OR ADDING NEW MAPPINGS
 
 # --- Define the Languages for all fields ---
@@ -27,7 +27,7 @@
 
 # --- RDF RDFS and OWL Mappings ---
 # This configuration only index properties that are typically used to store
-# instance data defined by such namespaces. This excludes Ontology definitions
+# instance data defined by such namespaces. This excludes ontology definitions
 
 # NOTE that nearly all other ontologies are are using properties of these three
 #      schemas, therefore it is strongly recommended to include such 
information!
@@ -41,14 +41,14 @@ rdfs:seeAlso | d=entityhub:ref
 
 owl:sameAs | d=entityhub:ref
 
-#If one likes to also index Ontologies one should add the following statements
+#If one likes to also index ontologies one should add the following statements
 #owl:*
 #rdfs:*
 
 # --- Dublin Core (DC) ---
-# The default configuration imports all dc-terms data and copies vlaues for the
-# old dc-elements standard over to the according properties ofthe dc-terms
-#standard.
+# The default configuration imports all dc-terms data and copies values for the
+# old dc-elements standard over to the according properties of the dc-terms
+# standard.
 
 # NOTE that a lot of other ontologies are also using DC for some of there data
 #      therefore it is strongly recommended to include such information!
@@ -78,11 +78,11 @@ dc-elements:source > dc:source
 dc-elements:subject > dc:subject
 dc-elements:title > dc:title
 dc-elements:type > dc:type
-#also use ec-elements:title as label
+#also use dc-elements:title as label
 dc-elements:title > rdfs:label
 
 # --- Social Networks (via foaf) ---
-#The Friend of a Friend schema often used to describe social relations between 
people
+#The Friend of a Friend schema is often used to describe social relations 
between people
 foaf:*
 
 # copy the name of a person over to rdfs:label
@@ -104,7 +104,7 @@ foaf:page | d=xsd:anyURI
 # --- Simple Knowledge Organization System (SKOS) ---
 
 # A common data model for sharing and linking knowledge organization systems 
-# via the Semantic Web. Typically used to encode controlled vocabularies auch 
as
+# via the Semantic Web. Typically used to encode controlled vocabularies as
 # a thesaurus  
 skos:*
 
@@ -123,14 +123,14 @@ skos:narrowMatch > skos:skos:narrower
 # however such properties are only intended to be used by reasoners to
 # calculate transitive closures over broader/narrower hierarchies.
 # see http://www.w3.org/TR/skos-reference/#L2413 for details
-# to correct such cases we will copy transitive relations to there counterpart
+# to correct such cases we will copy transitive relations to their counterpart
 skos:narrowerTransitive > skos:narrower
 skos:broaderTransitive > skos:broader
 
 
 # --- Semantically-Interlinked Online Communities (SIOC) ---
 
-# an ontology for describing the information in online communities. 
+# An ontology for describing the information in online communities. 
 # This information can be used to export information from online communities 
 # and to link them together. The scope of the application areas that SIOC can 
 # be used for includes (and is not limited to) weblogs, message boards,

svn commit: r1166228 - in /incubator/stanbol/trunk/entityhub/indexing/genericrdf: README.md src/main/resources/indexing/config/indexing.properties src/main/resources/indexing/config/mappings.txt

Reply via email to