Author: wkasper
Date: Tue Jul 31 08:22:51 2012
New Revision: 1367455
URL: http://svn.apache.org/viewvc?rev=1367455&view=rev
Log:
Stanbol-707: New Language Identification Engine
Added:
incubator/stanbol/trunk/enhancer/engines/langdetect/
incubator/stanbol/trunk/enhancer/engines/langdetect/README.md
incubator/stanbol/trunk/enhancer/engines/langdetect/pom.xml
incubator/stanbol/trunk/enhancer/engines/langdetect/src/
incubator/stanbol/trunk/enhancer/engines/langdetect/src/license/
incubator/stanbol/trunk/enhancer/engines/langdetect/src/license/THIRD-PARTY.properties
incubator/stanbol/trunk/enhancer/engines/langdetect/src/main/
incubator/stanbol/trunk/enhancer/engines/langdetect/src/main/java/
incubator/stanbol/trunk/enhancer/engines/langdetect/src/main/java/org/
incubator/stanbol/trunk/enhancer/engines/langdetect/src/main/java/org/apache/
incubator/stanbol/trunk/enhancer/engines/langdetect/src/main/java/org/apache/stanbol/
incubator/stanbol/trunk/enhancer/engines/langdetect/src/main/java/org/apache/stanbol/enhancer/
incubator/stanbol/trunk/enhancer/engines/langdetect/src/main/java/org/apache/stanbol/enhancer/engines/
incubator/stanbol/trunk/enhancer/engines/langdetect/src/main/java/org/apache/stanbol/enhancer/engines/langdetect/
incubator/stanbol/trunk/enhancer/engines/langdetect/src/main/java/org/apache/stanbol/enhancer/engines/langdetect/LanguageDetectionEnhancementEngine.java
incubator/stanbol/trunk/enhancer/engines/langdetect/src/main/java/org/apache/stanbol/enhancer/engines/langdetect/LanguageIdentifier.java
incubator/stanbol/trunk/enhancer/engines/langdetect/src/main/resources/
incubator/stanbol/trunk/enhancer/engines/langdetect/src/main/resources/OSGI-INF/
incubator/stanbol/trunk/enhancer/engines/langdetect/src/main/resources/OSGI-INF/metatype/
incubator/stanbol/trunk/enhancer/engines/langdetect/src/main/resources/OSGI-INF/metatype/metatype.properties
incubator/stanbol/trunk/enhancer/engines/langdetect/src/main/resources/profiles.cfg
incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/
incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/java/
incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/java/org/
incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/java/org/apache/
incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/java/org/apache/stanbol/
incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/java/org/apache/stanbol/enhancer/
incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/java/org/apache/stanbol/enhancer/engines/
incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/java/org/apache/stanbol/enhancer/engines/langdetect/
incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/java/org/apache/stanbol/enhancer/engines/langdetect/LanguageDetectionEngineTest.java
incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/java/org/apache/stanbol/enhancer/engines/langdetect/MockComponentContext.java
incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/resources/
incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/resources/README
incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/resources/en.txt
incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/resources/ja.txt
(with props)
incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/resources/ko.txt
incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/resources/zh.txt
(with props)
Added: incubator/stanbol/trunk/enhancer/engines/langdetect/README.md
URL:
http://svn.apache.org/viewvc/incubator/stanbol/trunk/enhancer/engines/langdetect/README.md?rev=1367455&view=auto
==============================================================================
--- incubator/stanbol/trunk/enhancer/engines/langdetect/README.md (added)
+++ incubator/stanbol/trunk/enhancer/engines/langdetect/README.md Tue Jul 31
08:22:51 2012
@@ -0,0 +1,125 @@
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+
+# LangDetect: Language Identification Enhancement Engine
+
+The **LanguageDetection** engine determines the language of text.
+
+## Technical Description
+
+The provided engine is based on the [language detection
library](http://code.google.com/p/language-detection/).
+The text to be checked must be provided in plain text format by the content
item.
+
+The result of language identification is added as TextAnnotation to the
content item's metadata as string value of the property
+
+ http://purl.org/dc/terms/language
+
+This RDF snippet illustrates the output:
+
+ <fise:TextAnnotation
rdf:about="urn:enhancement-a147957b-41f9-58f7-bbf1-b880b3aa4b49">
+ <dc:language>en</dc:language>
+
<dc:creator>org.apache.stanbol.enhancer.engines.langdetect.LanguageDetectionEnhancementEngine</dc:creator>
+ </fise:TextAnnotation>
+
+
+By default the language identifier distinguishes [53
languages](http://code.google.com/p/language-detection/wiki/LanguageList)
listed here:
+
+* af: Afrikaans
+* ar: Arabic
+* bg: Bulgarian
+* bn: Bengali
+* cs: Czech
+* da: Dannish
+* de: German
+* el: Greek
+* en: English
+* es: Spanish
+* et: Estonian
+* fa: Persian
+* fi: Finnish
+* fr: French
+* gu: Gujarati
+* he: Hebrew
+* hi: Hindi
+* hr: Croatian
+* hu: Hungarian
+* id: Indonesian
+* it: Italian
+* ja: Japanese
+* kn: Kannada
+* ko: Korean
+* lt: Lithuanian
+* lv: Latvian
+* mk: Macedonian
+* ml: Malayalam
+* mr: Marathi
+* ne: Nepali
+* nl: Dutch
+* no: Norwegian
+* pa: Punjabi
+* pl: Polish
+* pt: Portuguese
+* ro: Romanian
+* ru: Russian
+* sk: Slovak
+* sl: Slovene
+* so: Somali
+* sq: Albanian
+* sv: Swedish
+* sw: Swahili
+* ta: Tamil
+* te: Telugu
+* th: Thai
+* tl: Tagalog
+* tr: Turkish
+* uk: Ukrainian
+* ur: Urdu
+* vi: Vietnamese
+* zh-cn: Simplified Chinese
+* zh-tw: Traditional Chinese
+
+Additional language models can be created by the
[tools](http://code.google.com/p/language-detection/wiki/Tools).
+
+## Configuration options
+
+*
<pre><code>org.apache.stanbol.enhancer.engines.langdetect.probe-length</pre></code>
+
+ an integer specifying how many characters will be used for
+ identification. A value of 0 or below means to use the complete
+ text. Otherwise only a substring of the specified length taken from the
+ middle of the text will be used. The default value is 400 characters.
+
+## Usage
+
+Assuming that the Stanbol endpoint with the full launcher is running at
+
+ http://localhost:8080
+
+and the engine is activated, from the command line commands like this
+can be used for submitting some text file as content item:
+
+* stateless interface
+
+ curl -i -X POST -H "Content-Type:text/plain" -T testfile.txt
http://localhost:8080/engines
+
+* stateful interface
+
+ curl -i -X PUT -H "Content-Type:text/plain" -T testfile.txt
http://localhost:8080/contenthub/content/someFileId
+
+Alternatively, the Stanbol web interface can be used for submitting documents
+and viewing the metadata at
+
+ http://localhost:8080/contenthub
+
Added: incubator/stanbol/trunk/enhancer/engines/langdetect/pom.xml
URL:
http://svn.apache.org/viewvc/incubator/stanbol/trunk/enhancer/engines/langdetect/pom.xml?rev=1367455&view=auto
==============================================================================
--- incubator/stanbol/trunk/enhancer/engines/langdetect/pom.xml (added)
+++ incubator/stanbol/trunk/enhancer/engines/langdetect/pom.xml Tue Jul 31
08:22:51 2012
@@ -0,0 +1,140 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+ xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/maven-v4_0_0.xsd">
+
+ <modelVersion>4.0.0</modelVersion>
+
+ <parent>
+ <artifactId>org.apache.stanbol.enhancer.parent</artifactId>
+ <groupId>org.apache.stanbol</groupId>
+ <version>0.10.0-incubating-SNAPSHOT</version>
+ <relativePath>../../parent</relativePath>
+ </parent>
+
+ <groupId>org.apache.stanbol</groupId>
+ <artifactId>org.apache.stanbol.enhancer.engines.langdetect</artifactId>
+ <version>0.10.0-incubating-SNAPSHOT</version>
+ <packaging>bundle</packaging>
+
+ <name>Apache Stanbol Enhancer Enhancement Engine : Language Identifier</name>
+ <description>language detection for 53 languages based on
http://code.google.com/p/language-detection
+ </description>
+
+ <inceptionYear>2012</inceptionYear>
+
+ <scm>
+ <connection>
+
scm:svn:http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/langdetect/
+ </connection>
+ <developerConnection>
+
scm:svn:https://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/langdetect/
+ </developerConnection>
+ <url>http://incubator.apache.org/stanbol/</url>
+ </scm>
+
+ <build>
+ <plugins>
+ <plugin>
+ <groupId>org.apache.felix</groupId>
+ <artifactId>maven-bundle-plugin</artifactId>
+ <extensions>true</extensions>
+ <configuration>
+ <instructions>
+ <Export-Package>
+
org.apache.stanbol.enhancer.engines.langdetect;version=${project.version}
+ </Export-Package>
+ <Embed-Dependency>
+ langdetect;scope=compile
+ </Embed-Dependency>
+
<Embed-Transitive>true</Embed-Transitive>
+ </instructions>
+ </configuration>
+ </plugin>
+ <plugin>
+ <groupId>org.apache.felix</groupId>
+ <artifactId>maven-scr-plugin</artifactId>
+ </plugin>
+ <plugin>
+ <groupId>org.apache.rat</groupId>
+ <artifactId>apache-rat-plugin</artifactId>
+ <configuration>
+ <excludes>
+ <!-- AL20 licensed files: See
src/test/resources/README -->
+
<exclude>src/test/resources/*.txt</exclude>
+ </excludes>
+ </configuration>
+ </plugin>
+ </plugins>
+ </build>
+
+ <dependencies>
+ <dependency>
+ <groupId>org.apache.stanbol</groupId>
+ <artifactId>org.apache.stanbol.enhancer.servicesapi</artifactId>
+ <version>0.10.0-incubating-SNAPSHOT</version>
+ </dependency>
+
+ <dependency>
+ <groupId>com.cybozu.labs</groupId>
+ <artifactId>langdetect</artifactId>
+ <version>1.1-20120112</version>
+ </dependency>
+
+ <dependency>
+ <groupId>org.apache.felix</groupId>
+ <artifactId>org.apache.felix.scr.annotations</artifactId>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.clerezza</groupId>
+ <artifactId>rdf.core</artifactId>
+ </dependency>
+ <dependency>
+ <groupId>commons-io</groupId>
+ <artifactId>commons-io</artifactId>
+ </dependency>
+ <dependency>
+ <groupId>org.slf4j</groupId>
+ <artifactId>slf4j-api</artifactId>
+ </dependency>
+
+ <dependency>
+ <groupId>org.apache.stanbol</groupId>
+ <artifactId>org.apache.stanbol.enhancer.test</artifactId>
+ <version>0.10.0-incubating-SNAPSHOT</version>
+ <scope>test</scope>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.stanbol</groupId>
+ <artifactId>org.apache.stanbol.enhancer.core</artifactId>
+ <version>0.10.0-incubating-SNAPSHOT</version>
+ <scope>test</scope>
+ </dependency>
+ <dependency>
+ <groupId>org.slf4j</groupId>
+ <artifactId>slf4j-simple</artifactId>
+ <scope>test</scope>
+ </dependency>
+ <dependency>
+ <groupId>junit</groupId>
+ <artifactId>junit</artifactId>
+ <scope>test</scope>
+ </dependency>
+ </dependencies>
+
+</project>
Added:
incubator/stanbol/trunk/enhancer/engines/langdetect/src/license/THIRD-PARTY.properties
URL:
http://svn.apache.org/viewvc/incubator/stanbol/trunk/enhancer/engines/langdetect/src/license/THIRD-PARTY.properties?rev=1367455&view=auto
==============================================================================
---
incubator/stanbol/trunk/enhancer/engines/langdetect/src/license/THIRD-PARTY.properties
(added)
+++
incubator/stanbol/trunk/enhancer/engines/langdetect/src/license/THIRD-PARTY.properties
Tue Jul 31 08:22:51 2012
@@ -0,0 +1,24 @@
+# Generated by org.codehaus.mojo.license.AddThirdPartyMojo
+#-------------------------------------------------------------------------------
+# Already used licenses in project :
+# - Apache Software License
+# - Apache Software License, Version 2.0
+# - BSD License
+# - Common Development And Distribution License (CDDL), Version 1.0
+# - Common Development And Distribution License (CDDL), Version 1.1
+# - Common Public License, Version 1.0
+# - Eclipse Public License, Version 1.0
+# - GNU General Public License (GPL), Version 2 with classpath exception
+# - GNU Lesser General Public License (LGPL)
+# - GNU Lesser General Public License (LGPL), Version 2.1
+# - ICU License
+# - MIT License
+# - Public Domain License
+#-------------------------------------------------------------------------------
+# Please fill the missing licenses for dependencies :
+#
+#
+#Mon Jul 30 15:41:25 CEST 2012
+javax.servlet--servlet-api--2.5=Common Development And Distribution License
(CDDL), Version 1.0
+org.osgi--org.osgi.compendium--4.1.0=The Apache Software License, Version 2.0
+org.osgi--org.osgi.core--4.1.0=The Apache Software License, Version 2.0
Added:
incubator/stanbol/trunk/enhancer/engines/langdetect/src/main/java/org/apache/stanbol/enhancer/engines/langdetect/LanguageDetectionEnhancementEngine.java
URL:
http://svn.apache.org/viewvc/incubator/stanbol/trunk/enhancer/engines/langdetect/src/main/java/org/apache/stanbol/enhancer/engines/langdetect/LanguageDetectionEnhancementEngine.java?rev=1367455&view=auto
==============================================================================
---
incubator/stanbol/trunk/enhancer/engines/langdetect/src/main/java/org/apache/stanbol/enhancer/engines/langdetect/LanguageDetectionEnhancementEngine.java
(added)
+++
incubator/stanbol/trunk/enhancer/engines/langdetect/src/main/java/org/apache/stanbol/enhancer/engines/langdetect/LanguageDetectionEnhancementEngine.java
Tue Jul 31 08:22:51 2012
@@ -0,0 +1,232 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.stanbol.enhancer.engines.langdetect;
+
+import static
org.apache.stanbol.enhancer.servicesapi.rdf.Properties.DC_LANGUAGE;
+import static org.apache.stanbol.enhancer.servicesapi.rdf.Properties.DC_TYPE;
+import static
org.apache.stanbol.enhancer.servicesapi.rdf.TechnicalClasses.DCTERMS_LINGUISTIC_SYSTEM;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Dictionary;
+import java.util.List;
+import java.util.Map;
+import java.util.Map.Entry;
+import java.util.Set;
+
+import org.apache.clerezza.rdf.core.MGraph;
+import org.apache.clerezza.rdf.core.UriRef;
+import org.apache.clerezza.rdf.core.impl.PlainLiteralImpl;
+import org.apache.clerezza.rdf.core.impl.TripleImpl;
+import org.apache.commons.io.IOUtils;
+import org.apache.felix.scr.annotations.Component;
+import org.apache.felix.scr.annotations.Properties;
+import org.apache.felix.scr.annotations.Property;
+import org.apache.felix.scr.annotations.Service;
+import org.apache.stanbol.enhancer.servicesapi.Blob;
+import org.apache.stanbol.enhancer.servicesapi.Chain;
+import org.apache.stanbol.enhancer.servicesapi.ContentItem;
+import org.apache.stanbol.enhancer.servicesapi.EngineException;
+import org.apache.stanbol.enhancer.servicesapi.EnhancementEngine;
+import org.apache.stanbol.enhancer.servicesapi.InvalidContentException;
+import org.apache.stanbol.enhancer.servicesapi.ServiceProperties;
+import org.apache.stanbol.enhancer.servicesapi.helper.ContentItemHelper;
+import org.apache.stanbol.enhancer.servicesapi.helper.EnhancementEngineHelper;
+import org.apache.stanbol.enhancer.servicesapi.impl.AbstractEnhancementEngine;
+import org.osgi.service.cm.ConfigurationException;
+import org.osgi.service.component.ComponentContext;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import com.cybozu.labs.langdetect.LangDetectException;
+
+/**
+ * {@link LanguageDetectionEnhancementEngine} provides functionality to
enhance document
+ * with their language.
+ *
+ * @author Walter Kasper, DFKI
+ */
+@Component(immediate = true, metatype = true, inherit=true)
+@Service
+@Properties(value={
+ @Property(name=EnhancementEngine.PROPERTY_NAME,value="langdetect")
+})
+public class LanguageDetectionEnhancementEngine
+ extends AbstractEnhancementEngine<LangDetectException,RuntimeException>
+ implements EnhancementEngine, ServiceProperties {
+
+ /**
+ * a configurable value of the text segment length to check
+ */
+ @Property
+ public static final String PROBE_LENGTH_PROP =
"org.apache.stanbol.enhancer.engines.langdetect.probe-length";
+
+
+ /**
+ * The default value for the Execution of this Engine. Currently set to
+ * {@link ServiceProperties#ORDERING_PRE_PROCESSING} - 2<p>
+ * NOTE: this information is used by the default and weighed {@link Chain}
+ * implementation to determine the processing order of
+ * {@link EnhancementEngine}s. Other {@link Chain} implementation do not
+ * use this information.
+ */
+ public static final Integer defaultOrder = ORDERING_PRE_PROCESSING - 2;
+
+ /**
+ * This contains the only MIME type directly supported by this enhancement
engine.
+ */
+ private static final String TEXT_PLAIN_MIMETYPE = "text/plain";
+ /**
+ * Set containing the only supported mime type {@link #TEXT_PLAIN_MIMETYPE}
+ */
+ private static final Set<String> SUPPORTED_MIMTYPES =
Collections.singleton(TEXT_PLAIN_MIMETYPE);
+
+ /**
+ * This contains the logger.
+ */
+ private static final Logger log =
LoggerFactory.getLogger(LanguageDetectionEnhancementEngine.class);
+
+ private static final int PROBE_LENGTH_DEFAULT = 1000;
+
+ /**
+ * How much text should be used for testing: If the value is 0 or smaller,
+ * the complete text will be used. Otherwise a text probe of the given
length
+ * is taken from the middle of the text. The default length is 1000.
+ */
+ private int probeLength = PROBE_LENGTH_DEFAULT;
+
+ private LanguageIdentifier languageIdentifier;
+
+ /**
+ * Initialize the language identifier model and load the prop length bound
if
+ * provided as a property.
+ *
+ * @param ce
+ * the {@link ComponentContext}
+ */
+ protected void activate(ComponentContext ce) throws
ConfigurationException, LangDetectException {
+ super.activate(ce);
+ if (ce != null) {
+ @SuppressWarnings("unchecked")
+ Dictionary<String, String> properties = ce.getProperties();
+ String lengthVal = properties.get(PROBE_LENGTH_PROP);
+ probeLength = lengthVal == null ? PROBE_LENGTH_DEFAULT :
Integer.parseInt(lengthVal);
+ }
+ languageIdentifier = new LanguageIdentifier();
+ }
+
+ protected void deactivate(ComponentContext ce) {
+ super.deactivate(ce);
+ this.languageIdentifier = null;
+ }
+
+ public int canEnhance(ContentItem ci) throws EngineException {
+ if(ContentItemHelper.getBlob(ci, SUPPORTED_MIMTYPES) != null){
+ return ENHANCE_ASYNC; //Langid now supports async processing
+ } else {
+ return CANNOT_ENHANCE;
+ }
+ }
+
+ public void computeEnhancements(ContentItem ci) throws EngineException {
+ Entry<UriRef,Blob> contentPart = ContentItemHelper.getBlob(ci,
SUPPORTED_MIMTYPES);
+ if(contentPart == null){
+ throw new IllegalStateException("No ContentPart with Mimetype '"
+ + TEXT_PLAIN_MIMETYPE+"' found for ContentItem
"+ci.getUri()
+ + ": This is also checked in the canEnhance method! ->
This "
+ + "indicated an Bug in the implementation of the "
+ + "EnhancementJobManager!");
+ }
+ String text = "";
+ try {
+ text = ContentItemHelper.getText(contentPart.getValue());
+ } catch (IOException e) {
+ throw new InvalidContentException(this, ci, e);
+ }
+ if (text.trim().length() == 0) {
+ log.info("No text contained in ContentPart {} of ContentItem {}",
+ contentPart.getKey(),ci.getUri());
+ return;
+ }
+
+ // truncate text to some piece from the middle if probeLength > 0
+ int checkLength = probeLength;
+ if (checkLength > 0 && text.length() > checkLength) {
+ text = text.substring(text.length() / 2 - checkLength / 2,
text.length() / 2 + checkLength / 2);
+ }
+ String language = null;
+ try {
+ language = languageIdentifier.getLanguage(text);
+ log.info("language identified as " + language);
+ }
+ catch (LangDetectException e) {
+ log.warn("Could not identify language");
+ return;
+ }
+
+ // add language to metadata
+ MGraph g = ci.getMetadata();
+ ci.getLock().writeLock().lock();
+ try {
+ UriRef textEnhancement =
EnhancementEngineHelper.createTextEnhancement(ci, this);
+ g.add(new TripleImpl(textEnhancement, DC_LANGUAGE, new
PlainLiteralImpl(language)));
+ g.add(new TripleImpl(textEnhancement, DC_TYPE,
DCTERMS_LINGUISTIC_SYSTEM));
+ } finally {
+ ci.getLock().writeLock().unlock();
+ }
+ }
+
+ public List<String> loadProfiles(String folder, String configFile) throws
Exception {
+ List<String> profiles = new ArrayList<String>();
+ java.util.Properties props = new java.util.Properties();
+
props.load(getClass().getClassLoader().getResourceAsStream(configFile));
+ String languages = props.getProperty("languages");
+ if (languages == null) {
+ throw new IOException("No languages defined");
+ }
+ for (String lang: languages.split(",")) {
+ String profileFile = folder+"/"+lang;
+ InputStream is =
getClass().getClassLoader().getResourceAsStream(profileFile);
+ String profile;
+ try {
+ profile = IOUtils.toString(is, "UTF-8");
+ if (profile != null && profile.length() > 0) {
+ profiles.add(profile);
+ }
+ is.close();
+ } catch (IOException e) {
+ e.printStackTrace();
+ }
+ }
+ return profiles;
+ }
+
+ public int getProbeLength() {
+ return probeLength;
+ }
+
+ public void setProbeLength(int probeLength) {
+ this.probeLength = probeLength;
+ }
+
+ public Map<String, Object> getServiceProperties() {
+ return
Collections.unmodifiableMap(Collections.singletonMap(ENHANCEMENT_ENGINE_ORDERING,
(Object) defaultOrder));
+ }
+
+}
Added:
incubator/stanbol/trunk/enhancer/engines/langdetect/src/main/java/org/apache/stanbol/enhancer/engines/langdetect/LanguageIdentifier.java
URL:
http://svn.apache.org/viewvc/incubator/stanbol/trunk/enhancer/engines/langdetect/src/main/java/org/apache/stanbol/enhancer/engines/langdetect/LanguageIdentifier.java?rev=1367455&view=auto
==============================================================================
---
incubator/stanbol/trunk/enhancer/engines/langdetect/src/main/java/org/apache/stanbol/enhancer/engines/langdetect/LanguageIdentifier.java
(added)
+++
incubator/stanbol/trunk/enhancer/engines/langdetect/src/main/java/org/apache/stanbol/enhancer/engines/langdetect/LanguageIdentifier.java
Tue Jul 31 08:22:51 2012
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.stanbol.enhancer.engines.langdetect;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.ArrayList;
+import java.util.List;
+
+import org.apache.commons.io.IOUtils;
+
+import com.cybozu.labs.langdetect.Detector;
+import com.cybozu.labs.langdetect.DetectorFactory;
+import com.cybozu.labs.langdetect.LangDetectException;
+
+/**
+ * Standalone version of the Language Identifier
+ * @author <a href="mailto:[email protected]">Walter Kasper</a>
+ *
+ */
+
+public class LanguageIdentifier {
+
+ public LanguageIdentifier() throws LangDetectException {
+ DetectorFactory.clear();
+ try {
+
DetectorFactory.loadProfile(loadProfiles("profiles","profiles.cfg"));
+ } catch (Exception e) {
+ throw new LangDetectException(null, "Error in Initialization:
"+e.getMessage());
+ }
+ }
+ /**
+ * Load the profiles from the classpath
+ * @param folder where the profiles are
+ * @param configFile specifies which language profiles should be used
+ * @return a list of profiles
+ * @throws Exception
+ */
+ public List<String> loadProfiles(String folder, String configFile) throws
Exception {
+ List<String> profiles = new ArrayList<String>();
+ java.util.Properties props = new java.util.Properties();
+
props.load(getClass().getClassLoader().getResourceAsStream(configFile));
+ String languages = props.getProperty("languages");
+ if (languages == null) {
+ throw new IOException("No languages defined");
+ }
+ for (String lang: languages.split(",")) {
+ String profileFile = folder+"/"+lang;
+ InputStream is =
getClass().getClassLoader().getResourceAsStream(profileFile);
+ try {
+ String profile = IOUtils.toString(is, "UTF-8");
+ if (profile != null && profile.length() > 0) {
+ profiles.add(profile);
+ }
+ is.close();
+ } catch (IOException e) {
+ e.printStackTrace();
+ }
+ }
+ return profiles;
+ }
+
+ public String getLanguage(String text) throws LangDetectException {
+ Detector detector = DetectorFactory.create();
+ detector.append(text);
+ return detector.detect();
+ }
+
+}
Added:
incubator/stanbol/trunk/enhancer/engines/langdetect/src/main/resources/OSGI-INF/metatype/metatype.properties
URL:
http://svn.apache.org/viewvc/incubator/stanbol/trunk/enhancer/engines/langdetect/src/main/resources/OSGI-INF/metatype/metatype.properties?rev=1367455&view=auto
==============================================================================
---
incubator/stanbol/trunk/enhancer/engines/langdetect/src/main/resources/OSGI-INF/metatype/metatype.properties
(added)
+++
incubator/stanbol/trunk/enhancer/engines/langdetect/src/main/resources/OSGI-INF/metatype/metatype.properties
Tue Jul 31 08:22:51 2012
@@ -0,0 +1,32 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+stanbol.enhancer.engine.name.name=Name
+stanbol.enhancer.engine.name.description=The name of the enhancement engine as
\
+used in the RESTful interface '/engine/<name>'
+
+service.ranking.name=Ranking
+service.ranking.description=If two enhancement engines with the same name are
active the \
+one with the higher ranking will be used to process parsed content items.
+
+#===============================================================================
+#Properties and Options used to configure LangIdEnhancementEngine
+#===============================================================================
+
+org.apache.stanbol.enhancer.engines.langdetect.LanguageDetectionEnhancementEngine.name=Apache
Stanbol \
+Enhancer Engine: Language Identification
+org.apache.stanbol.enhancer.engines.langdetect.LanguageDetectionEnhancementEngine.description=Detects
\
+the Language for parsed Text.
Added:
incubator/stanbol/trunk/enhancer/engines/langdetect/src/main/resources/profiles.cfg
URL:
http://svn.apache.org/viewvc/incubator/stanbol/trunk/enhancer/engines/langdetect/src/main/resources/profiles.cfg?rev=1367455&view=auto
==============================================================================
---
incubator/stanbol/trunk/enhancer/engines/langdetect/src/main/resources/profiles.cfg
(added)
+++
incubator/stanbol/trunk/enhancer/engines/langdetect/src/main/resources/profiles.cfg
Tue Jul 31 08:22:51 2012
@@ -0,0 +1,25 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# This is a tika LanguageIdentifier properties file.
+# Its name is org/apache/tika/language/tika.language.properties
+# You can override it by placing a copy on the classpath in a file called
+# org/apache/tika/language/tika.language.override.properties
+
+# List of languages for which there are <language>.ngp profiles
+# If there exists an ISO 639-1 2-letter code it should be used
+# If not, you can choose an ISO 639-2 3-letter code
+languages=af,ar,bg,bn,cs,da,de,el,en,es,et,fa,fi,fr,gu,he,hi,hr,hu,id,it,ja,kn,ko,lt,lv,mk,ml,mr,ne,nl,no,pa,pl,pt,ro,ru,sk,sl,so,sq,sv,sw,ta,te,th,tl,tr,uk,ur,vi,zh-cn,zh-tw
Added:
incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/java/org/apache/stanbol/enhancer/engines/langdetect/LanguageDetectionEngineTest.java
URL:
http://svn.apache.org/viewvc/incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/java/org/apache/stanbol/enhancer/engines/langdetect/LanguageDetectionEngineTest.java?rev=1367455&view=auto
==============================================================================
---
incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/java/org/apache/stanbol/enhancer/engines/langdetect/LanguageDetectionEngineTest.java
(added)
+++
incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/java/org/apache/stanbol/enhancer/engines/langdetect/LanguageDetectionEngineTest.java
Tue Jul 31 08:22:51 2012
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.stanbol.enhancer.engines.langdetect;
+
+import static junit.framework.Assert.assertEquals;
+import static
org.apache.stanbol.enhancer.test.helper.EnhancementStructureHelper.validateAllEntityAnnotations;
+import static
org.apache.stanbol.enhancer.test.helper.EnhancementStructureHelper.validateAllTextAnnotations;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertNotNull;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.Arrays;
+import java.util.HashMap;
+
+
+import org.apache.clerezza.rdf.core.LiteralFactory;
+import org.apache.clerezza.rdf.core.Resource;
+import org.apache.clerezza.rdf.core.UriRef;
+import org.apache.commons.io.IOUtils;
+import
org.apache.stanbol.enhancer.contentitem.inmemory.InMemoryContentItemFactory;
+import org.apache.stanbol.enhancer.servicesapi.ContentItem;
+import org.apache.stanbol.enhancer.servicesapi.ContentItemFactory;
+import org.apache.stanbol.enhancer.servicesapi.EngineException;
+import org.apache.stanbol.enhancer.servicesapi.EnhancementEngine;
+import org.apache.stanbol.enhancer.servicesapi.helper.EnhancementEngineHelper;
+import org.apache.stanbol.enhancer.servicesapi.impl.StringSource;
+import org.apache.stanbol.enhancer.servicesapi.rdf.Properties;
+import org.junit.BeforeClass;
+import org.junit.Test;
+import org.osgi.service.cm.ConfigurationException;
+import org.osgi.service.component.ComponentContext;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import com.cybozu.labs.langdetect.Detector;
+import com.cybozu.labs.langdetect.DetectorFactory;
+import com.cybozu.labs.langdetect.LangDetectException;
+
+/**
+ * {@link LanguageDetectionEngineTest} is a test class for {@link
TextCategorizer}.
+ *
+ * @author Walter Kasper, DFKI
+ */
+public class LanguageDetectionEngineTest {
+
+ private static final Logger LOG =
LoggerFactory.getLogger(LanguageDetectionEngineTest.class);
+
+ private static final ContentItemFactory ciFactory =
InMemoryContentItemFactory.getInstance();
+
+ private static final String[] TEST_FILE_NAMES =
{"en.txt","ja.txt","ko.txt","zh.txt"};
+
+ private static LanguageIdentifier langId;
+
+ /**
+ * This initializes the text categorizer.
+ * @throws LangDetectException
+ */
+ @BeforeClass
+ public static void oneTimeSetUp() throws IOException, LangDetectException {
+ langId = new LanguageIdentifier();
+ }
+
+ /**
+ * Tests the language identification.
+ *
+ * @throws IOException if there is an error when reading the text
+ */
+ @Test
+ public void testLangId() throws LangDetectException, IOException {
+ LOG.info("Testing: {}", Arrays.asList(TEST_FILE_NAMES));
+ for (String file: TEST_FILE_NAMES) {
+ String expectedLang = file.substring(0,2);
+ InputStream in =
LanguageDetectionEngineTest.class.getClassLoader().getResourceAsStream(file);
+ assertNotNull("failed to load resource " + file, in);
+ String text = IOUtils.toString(in, "UTF-8");
+ in.close();
+ String language = langId.getLanguage(text);
+ if (!expectedLang.equals(language.substring(0,2))) {
+ LOG.info("Expected: {}; Found {}",expectedLang,language);
+ }
+ assertEquals(expectedLang, language.substring(0,2));
+ }
+ }
+
+ /**
+ * Test the engine and validates the created enhancements
+ * @throws EngineException
+ * @throws IOException
+ * @throws ConfigurationException
+ * @throws LangDetectException
+ */
+ @Test
+ public void testEngine() throws EngineException, ConfigurationException,
LangDetectException, IOException {
+ LOG.info("Testing engine: {}", TEST_FILE_NAMES[0]);
+ InputStream in =
LanguageDetectionEngineTest.class.getClassLoader().getResourceAsStream(TEST_FILE_NAMES[0]);
+ assertNotNull("failed to load resource " + TEST_FILE_NAMES[0], in);
+ String text = IOUtils.toString(in, "UTF-8");
+ in.close();
+ LanguageDetectionEnhancementEngine langIdEngine = new
LanguageDetectionEnhancementEngine();
+ ComponentContext context = new MockComponentContext();
+ context.getProperties().put(EnhancementEngine.PROPERTY_NAME,
"langdetect");
+ langIdEngine.activate(context);
+ ContentItem ci = ciFactory.createContentItem(new StringSource(text));
+ langIdEngine.computeEnhancements(ci);
+ HashMap<UriRef,Resource> expectedValues = new
HashMap<UriRef,Resource>();
+ expectedValues.put(Properties.ENHANCER_EXTRACTED_FROM, ci.getUri());
+ expectedValues.put(Properties.DC_CREATOR,
LiteralFactory.getInstance().createTypedLiteral(
+ langIdEngine.getClass().getName()));
+ int textAnnotationCount = validateAllTextAnnotations(ci.getMetadata(),
text, expectedValues);
+ assertEquals("A single TextAnnotation is expected",
1,textAnnotationCount);
+ //even through this tests do not validate service quality but rather
+ //the correct integration of the CELI service as EnhancementEngine
+ //we expect the "en" is detected for the parsed text
+ assertEquals("The detected language for text '"+text+"' MUST BE 'en'",
+ "en",EnhancementEngineHelper.getLanguage(ci));
+
+ int entityAnnoNum = validateAllEntityAnnotations(ci.getMetadata(),
expectedValues);
+ assertEquals("No EntityAnnotations are expected",0, entityAnnoNum);
+
+ }
+}
Added:
incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/java/org/apache/stanbol/enhancer/engines/langdetect/MockComponentContext.java
URL:
http://svn.apache.org/viewvc/incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/java/org/apache/stanbol/enhancer/engines/langdetect/MockComponentContext.java?rev=1367455&view=auto
==============================================================================
---
incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/java/org/apache/stanbol/enhancer/engines/langdetect/MockComponentContext.java
(added)
+++
incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/java/org/apache/stanbol/enhancer/engines/langdetect/MockComponentContext.java
Tue Jul 31 08:22:51 2012
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.stanbol.enhancer.engines.langdetect;
+
+import java.util.Dictionary;
+import java.util.Hashtable;
+
+import org.osgi.framework.Bundle;
+import org.osgi.framework.BundleContext;
+import org.osgi.framework.ServiceReference;
+import org.osgi.service.component.ComponentContext;
+import org.osgi.service.component.ComponentInstance;
+
+public class MockComponentContext implements ComponentContext {
+
+ private final Dictionary properties = new Hashtable();
+
+ @Override
+ public Dictionary getProperties() {
+ return properties;
+ }
+
+ @Override
+ public Object locateService(String name) {
+ return null;
+ }
+
+ @Override
+ public Object locateService(String name, ServiceReference reference) {
+ return null;
+ }
+
+ @Override
+ public Object[] locateServices(String name) {
+ return null;
+ }
+
+ @Override
+ public BundleContext getBundleContext() {
+ return null;
+ }
+
+ @Override
+ public Bundle getUsingBundle() {
+ return null;
+ }
+
+ @Override
+ public ComponentInstance getComponentInstance() {
+ return null;
+ }
+
+ @Override
+ public void enableComponent(String name) {
+ }
+
+ @Override
+ public void disableComponent(String name) {
+ }
+
+ @Override
+ public ServiceReference getServiceReference() {
+ return null;
+ }
+
+}
Added:
incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/resources/README
URL:
http://svn.apache.org/viewvc/incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/resources/README?rev=1367455&view=auto
==============================================================================
---
incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/resources/README
(added)
+++
incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/resources/README
Tue Jul 31 08:22:51 2012
@@ -0,0 +1,23 @@
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+
+The following files are provided under the Apache License, Version 2.0:
+
+en.txt
+zh.txt
+ja.txt
+ko.txt
+
+
Added:
incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/resources/en.txt
URL:
http://svn.apache.org/viewvc/incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/resources/en.txt?rev=1367455&view=auto
==============================================================================
---
incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/resources/en.txt
(added)
+++
incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/resources/en.txt
Tue Jul 31 08:22:51 2012
@@ -0,0 +1,9 @@
+The Java platform and language began as an internal project at Sun
Microsystems in December 1990, providing an alternative to the C++/C
programming languages. Engineer Patrick Naughton had become increasingly
frustrated with the state of Sun's C++ and C APIs (application programming
interfaces) and tools. While considering moving to NeXT, Naughton was offered a
chance to work on new technology and thus the Stealth Project was started.
+
+The Stealth Project was soon renamed to the Green Project with James Gosling
and Mike Sheridan joining Naughton. Together with other engineers, they began
work in a small office on Sand Hill Road in Menlo Park, California. They were
attempting to develop a new technology for programming next generation smart
appliances, which Sun expected to be a major new opportunity[4].
+
+The team originally considered using C++, but it was rejected for several
reasons. Because they were developing an embedded system with limited
resources, they decided that C++ demanded too large a footprint and that its
complexity led to developer errors. The language's lack of garbage collection
meant that programmers had to manually manage system memory, a challenging and
error-prone task. The team was also troubled by the language's lack of portable
facilities for security, distributed programming, and threading. Finally, they
wanted a platform that could be easily ported to all types of devices.
+
+Bill Joy had envisioned a new language combining Mesa and C. In a paper called
Further, he proposed to Sun that its engineers should produce an
object-oriented environment based on C++. Initially, Gosling attempted to
modify and extend C++ (which he referred to as "C++ ++ --") but soon abandoned
that in favor of creating an entirely new language, which he called Oak, after
the tree that stood just outside his office.
+
+By the summer of 1992, they were able to demonstrate portions of the new
platform including the Green OS, the Oak language, the libraries, and the
hardware. Their first attempt, demonstrated on September 3, 1992, focused on
building a PDA device named Star7[2] which had a graphical interface and a
smart agent called "Duke" to assist the user. In November of that year, the
Green Project was spun off to become firstperson, a wholly owned subsidiary of
Sun Microsystems, and the team relocated to Palo Alto, California[5]. The
firstperson team was interested in building highly interactive devices, and
when Time Warner issued an RFP for a set-top box, firstperson changed their
target and responded with a proposal for a set-top box platform. However, the
cable industry felt that their platform gave too much control to the user and
firstperson lost their bid to SGI. An additional deal with The 3DO Company for
a set-top box also failed to materialize. Unable to generate interest with
in the TV industry, the company was rolled back into Sun.
\ No newline at end of file
Added:
incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/resources/ja.txt
URL:
http://svn.apache.org/viewvc/incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/resources/ja.txt?rev=1367455&view=auto
==============================================================================
Binary file - no diff available.
Propchange:
incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/resources/ja.txt
------------------------------------------------------------------------------
svn:mime-type = application/octet-stream
Added:
incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/resources/ko.txt
URL:
http://svn.apache.org/viewvc/incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/resources/ko.txt?rev=1367455&view=auto
==============================================================================
---
incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/resources/ko.txt
(added)
+++
incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/resources/ko.txt
Tue Jul 31 08:22:51 2012
@@ -0,0 +1,5 @@
+구íì (å
·æ å[1], 1984ë
11ì 9ì¼ ~ )ì ì¸ì²ê´ìììì
íì´ë ëí민êµì ì¸í°ë· ì¼ì§± ì¶ì ë°°ì°ì´ë¤.[2] 2002ë
ì
ê´ê³ ë¡ ë°ë·íì¼ë©°, ìí¸ì½¤ ãë
¼ì¤í±5ãì ì¶ì°íê³
ê·¸ì¸ìë ãê½ë³´ë¤ ë¨ìã를 ë¹ë¡¯í ì¬ë¬ í¸ì ëë¼ë§ì
ì¶ì°íìë¤.
+
+2009ë
ë¬´ë µë¶í° ê·¸ë
ë ì±
ì ë´ê³ 그림 ì ìí를 ì´ë©°
ë¨í¸ìí를 ì ìí¨ì¼ë¡ì¨ ìì¤ê°, ì¼ë¬ì¤í¸ë ì´í°,
ìíê°ë
ì¼ë¡ ìì ì íëììì ëíê°ê³ ìë¤.[3][4][2]
ê·¸ë
ê° ì¶ê°í ìì¤ ãí±ê³ ãë ë°ë§¤ ì¼ì£¼ì¼ ë§ì ì¼ë§
ë¶ê° íë ¸ê³ ,[5] ìíê°ë
ë°ë·ìì¸ ãì ì¾í ëì°ë¯¸ãë
ë¶ì°ìììë¨í¸ ìíì ìì ê´ê°ìì ììíë¤.[6] ê·¸ë
ì
첫 ë²ì§¸ ì¥í¸ ìí "ìì "ì YG ìí°í
ì¸ë¨¼í¸ê° ì ìì¬ë¥¼
ë§¡ì 2010ë
6ì 24ì¼ì ê°ë´ëìë¤.[7]
+
+2003ë
ìì¸ìì ëí ë°©ì¡ì°ìê³¼ì ì
ííìì¼ë ë°©ì¡ íë
ë±ì¼ë¡ ì¤í´íìê³ ,[8] 2010ë
ì ì±ê· ê´ëíêµ ìì 1ì°¨ì
í©ê²©íìë¤.
Added:
incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/resources/zh.txt
URL:
http://svn.apache.org/viewvc/incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/resources/zh.txt?rev=1367455&view=auto
==============================================================================
Binary file - no diff available.
Propchange:
incubator/stanbol/trunk/enhancer/engines/langdetect/src/test/resources/zh.txt
------------------------------------------------------------------------------
svn:mime-type = application/octet-stream