Re: Stanbol Chinese

Rupert Westenthaler Wed, 01 Aug 2012 21:27:14 -0700

Hi Walter

On Wed, Aug 1, 2012 at 7:13 PM, Walter Kasper <[email protected]> wrote:
> <rdf:Description
> rdf:about="urn:enhancement-0fe47b47-13c6-fc7d-335f-59e48e7a2bf1">
>     <j.2:type rdf:resource="http://purl.org/dc/terms/LinguisticSystem"/>
>     <j.8:extracted-from
> rdf:resource="urn:content-item-sha1-811041df069ba48e9c4682927267e565d5ec7bd4"/>
>     <rdf:type
> rdf:resource="http://fise.iks-project.eu/ontology/Enhancement"/>
>     <rdf:type
> rdf:resource="http://fise.iks-project.eu/ontology/TextAnnotation"/>
>     <j.2:language>en</j.2:language>
>     <j.2:created
> rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime";>2012-08-01T16:53:40.970Z</j.2:created>
>     <j.2:creator
> rdf:datatype="http://www.w3.org/2001/XMLSchema#string";>org.apache.stanbol.enhancer.engines.langdetect.LanguageDetectionEnhancementEngine</j.2:creator>
>   </rdf:Description>
>


AFAIK the used framework supports confidence values and can also
return multiple suggestions. Can you please use this features to
create multiple Language annotations that include the confidence
values.

Usage of those is easy as there are two helper methods

* EnhancementEngineHelper.getLanguage(..) method will return the
language with the highest confidence - suited for simple use case
* EnhancementEngineHelper.getLanguageAnnotations(..) returns a list
with all language annotations (sorted by confidence). It returns the
subjects of the language annotations. Users need to retrieve the
language, fise:confidence, creator ... themselves.

See STANBOL-613 [1] for details.

best
Rupert

[1] https://issues.apache.org/jira/browse/STANBOL-613

>
> Did you make 'mvn clean' before 'mvn install'?
>
> Walter
>
>
> harish suvarna wrote:
>>
>> Did a fresh build and inside Stanbol in localhost:8080, it is installed
>> but
>> is not activated. I still see the com.google.inject errors.
>> I do see the pom.xml update from you.
>>
>> -harish
>>
>> On Wed, Aug 1, 2012 at 12:55 AM, Walter Kasper <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> The OSGI bundlöe declared some package imports that usually indeed are
>>> not
>>> available nor required. I fixed that. Just check out the corrected
>>> pom.xml.
>>> On a fresh clean Stanbol installation langdetect worked fine for me.
>>>
>>>
>>> Best regards,
>>>
>>> Walter
>>>
>>> harish suvarna wrote:
>>>
>>>> Thanks Dr Walter. langdetect is very useful. I could successfully
>>>> compile
>>>> it but unable to load into stanbol as I get th error
>>>> ======
>>>> ERROR: Bundle org.apache.stanbol.enhancer.**engines.langdetect [177]:
>>>> Error
>>>> starting/stopping bundle. (org.osgi.framework.**BundleException:
>>>> Unresolved
>>>> constraint in bundle org.apache.stanbol.enhancer.**engines.langdetect
>>>> [177]:
>>>> Unable to resolve 177.0: missing requirement [177.0] package;
>>>> (package=com.google.inject))
>>>> org.osgi.framework.**BundleException: Unresolved constraint in bundle
>>>> org.apache.stanbol.enhancer.**engines.langdetect [177]: Unable to
>>>> resolve
>>>> 177.0: missing requirement [177.0] package; (package=com.google.inject)
>>>>       at org.apache.felix.framework.**Felix.resolveBundle(Felix.**
>>>> java:3443)
>>>>       at
>>>> org.apache.felix.framework.**Felix.startBundle(Felix.java:**1727)
>>>>       at org.apache.felix.framework.**Felix.setBundleStartLevel(**
>>>> Felix.java:1333)
>>>>       at
>>>> org.apache.felix.framework.**StartLevelImpl.run(**
>>>> StartLevelImpl.java:270)
>>>>       at java.lang.Thread.run(Thread.**java:680)
>>>> ==============
>>>>
>>>> I added the dependency
>>>> <dependency>
>>>>         <groupId>com.google.inject</**groupId>
>>>>         <artifactId>guice</artifactId>
>>>>         <version>3.0</version>
>>>>       </dependency>
>>>>
>>>> but looks like it is looking for version 1.3.0, which I can't find in
>>>> repo1.maven.org. I am not sure who is needing the inject library. The
>>>> entire source of langdetect plugin does not contain the word inject.
>>>> Only
>>>> the manifest file in target/classes has this listed.
>>>>
>>>>
>>>> -harish
>>>>
>>>> On Tue, Jul 31, 2012 at 1:32 AM, Walter Kasper <[email protected]> wrote:
>>>>
>>>>   Hi Harish,
>>>>>
>>>>> I checked in a new language identifier for Stanbol based on
>>>>>
>>>>> http://code.google.com/p/****language-detection/<http://code.google.com/p/**language-detection/>
>>>>>
>>>>> <http://**code.google.com/p/language-**detection/<http://code.google.com/p/language-detection/>
>>>>>>
>>>>>> .
>>>>>
>>>>> Just check out from Stanbol trunk, install and try out.
>>>>>
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Walter
>>>>>
>>>>> harish suvarna wrote:
>>>>>
>>>>>   Rupert,
>>>>>>
>>>>>> My initial debugging for Chinese text told me that the language
>>>>>> identification done by langid enhancer using apache tika does not
>>>>>> recognize
>>>>>> chinese. The tika language detection seems is not supporting the CJK
>>>>>> languages. With the result, the chinese language is identified as
>>>>>> lithuanian language 'lt' . The apache tika group has an enhancement
>>>>>> item
>>>>>> 856 registered for detecting cjk languages
>>>>>>
>>>>>> https://issues.apache.org/****jira/browse/TIKA-856<https://issues.apache.org/**jira/browse/TIKA-856>
>>>>>>
>>>>>> <https://**issues.apache.org/jira/browse/**TIKA-856<https://issues.apache.org/jira/browse/TIKA-856>
>>>>>>     in Feb 2012. I am not sure about the use of language
>>>>>> identification
>>>>>> in
>>>>>> stanbol yet. Is the language id used to select the dbpedia  index
>>>>>> (approprite dbpedia language dump) for entity lookups?
>>>>>>
>>>>>>
>>>>>> I am just thinking that, for my purpose, pick option 3 and make sure
>>>>>> that
>>>>>> it is of my language of my interest and then call paoding segmenter.
>>>>>> Then
>>>>>> iterate over the ngrams and do an entityhub lookup. I just still need
>>>>>> to
>>>>>> understand the code around how the whole entity lookup for dbpedia
>>>>>> works.
>>>>>>
>>>>>> I find that the language detection library
>>>>>>
>>>>>> http://code.google.com/p/****language-detection/<http://code.google.com/p/**language-detection/>
>>>>>>
>>>>>> <http://**code.google.com/p/language-**detection/<http://code.google.com/p/language-detection/>>is
>>>>>> very good at language
>>>>>>
>>>>>> detection. It supports 53 languages out of box and the quality seems
>>>>>> good.
>>>>>> It is apache 2.0 license. I could volunteer to create a new langid
>>>>>> engine
>>>>>> based on this with the stanbol community approval. So if anyone sheds
>>>>>> some
>>>>>> light on how to add a new java library into stanbol, that be great. I
>>>>>> am a
>>>>>> maven beginner now.
>>>>>>
>>>>>> Thanks,
>>>>>> harish
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 26, 2012 at 9:46 PM, Rupert Westenthaler <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>    Hi harish,
>>>>>>
>>>>>>> Note: Sorry I forgot to include the stanbol-dev mailing list in my
>>>>>>> last
>>>>>>> answer.
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jul 27, 2012 at 3:33 AM, harish suvarna <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>   Thanks a lot Rupert.
>>>>>>>>
>>>>>>>> I am weighing between options 2 and 3. What is the difference?
>>>>>>>> Optiion 2
>>>>>>>> sounds like enhancing KeyWordLinkingEngine to deal with chinese
>>>>>>>> text.
>>>>>>>> It
>>>>>>>>
>>>>>>>>   may
>>>>>>>
>>>>>>>   be like paoding is hardcoded into KeyWordLinkingEngine. Option 3 is
>>>>>>>>
>>>>>>>> like
>>>>>>>>
>>>>>>>>   a
>>>>>>>
>>>>>>>   separate engine.
>>>>>>>>
>>>>>>>>   Option (2) will require some work improvements on the Stanbol
>>>>>>>> side.
>>>>>>>
>>>>>>> However there where already discussion on how to create a "text
>>>>>>> processing chain" that allows to split up things like tokenizing, POS
>>>>>>> tagging, Lemmatizing ... in different Enhancement Engines without
>>>>>>> suffering form disadvantages of creating high amounts of RDF triples.
>>>>>>> One Idea was to base this on the Apache Lucene TokenStream [1] API
>>>>>>> and
>>>>>>> share the data as ContentPart [2] of the ContentItem.
>>>>>>>
>>>>>>> Option (3) indeed means that you will create your own
>>>>>>> EnhancementEngine - a similar one to the KeywordLinkingEngine.
>>>>>>>
>>>>>>>      But will I be able to use the stanbol dbpedia lookup using
>>>>>>> option
>>>>>>> 3?
>>>>>>> Yes. You need only to obtain a Entityhub "ReferencedSite" and use the
>>>>>>> "FieldQuery" interface to search for Entities (see [1] for an
>>>>>>> example)
>>>>>>>
>>>>>>> best
>>>>>>> Rupert
>>>>>>>
>>>>>>> [1]
>>>>>>> http://blog.mikemccandless.****com/2012/04/lucenes-**
>>>>>>> tokenstreams-are-actually.**html<http://blog.**
>>>>>>>
>>>>>>> mikemccandless.com/2012/04/**lucenes-tokenstreams-are-**actually.html<http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html>
>>>>>>> [2]
>>>>>>>
>>>>>>> http://incubator.apache.org/****stanbol/docs/trunk/components/****<http://incubator.apache.org/**stanbol/docs/trunk/components/**>
>>>>>>> enhancer/contentitem.html#****content-parts<http://**
>>>>>>> incubator.apache.org/stanbol/**docs/trunk/components/**
>>>>>>>
>>>>>>> enhancer/contentitem.html#**content-parts<http://incubator.apache.org/stanbol/docs/trunk/components/enhancer/contentitem.html#content-parts>
>>>>>>> [3]
>>>>>>>
>>>>>>> http://svn.apache.org/repos/****asf/incubator/stanbol/trunk/**<http://svn.apache.org/repos/**asf/incubator/stanbol/trunk/**>
>>>>>>> enhancer/engines/****keywordextraction/src/main/****
>>>>>>> java/org/apache/stanbol/
>>>>>>> **enhancer/engines/****keywordextraction/linking/**
>>>>>>> impl/EntitySearcherUtils.java<**http://svn.apache.org/repos/**
>>>>>>> asf/incubator/stanbol/trunk/**enhancer/engines/**
>>>>>>> keywordextraction/src/main/**java/org/apache/stanbol/**
>>>>>>> enhancer/engines/**keywordextraction/linking/**
>>>>>>>
>>>>>>> impl/EntitySearcherUtils.java<http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntitySearcherUtils.java>
>>>>>>>
>>>>>>>
>>>>>>>    Btw, I created my own enhancement engine chains and I could see
>>>>>>> them
>>>>>>>
>>>>>>>> yesterday in localhost:8080. But today all of them have vanished and
>>>>>>>> only
>>>>>>>> the default chain shows up. Can I dig them up somewhere in the
>>>>>>>> stanbol
>>>>>>>> directory?
>>>>>>>>
>>>>>>>> -harish
>>>>>>>>
>>>>>>>> I just created the eclipse project
>>>>>>>> On Thu, Jul 26, 2012 at 5:04 AM, Rupert Westenthaler
>>>>>>>> <[email protected]****> wrote:
>>>>>>>>
>>>>>>>>   Hi,
>>>>>>>>>
>>>>>>>>> There are no NER (Named Entity Recognition) models for Chinese text
>>>>>>>>> available via OpenNLP. So the default configuration of Stanbol will
>>>>>>>>> not process Chinese text. What you can do is to configure a
>>>>>>>>> KeywordLinking Engine for Chinese text as this engine can also
>>>>>>>>> process
>>>>>>>>> in unknown languages (see [1] for details).
>>>>>>>>>
>>>>>>>>> However also the KeywordLinking Engine requires at least n
>>>>>>>>> tokenizer
>>>>>>>>> for looking up Words. As there is no specific Tokenizer for OpenNLP
>>>>>>>>> Chinese text it will use the default one that uses a fixed set of
>>>>>>>>> chars to split words (white spaces, hyphens ...). You may better
>>>>>>>>> how
>>>>>>>>> well this would work with Chinese texts. My assumption would be
>>>>>>>>> that
>>>>>>>>> it is not sufficient - so results will be sub-optimal.
>>>>>>>>>
>>>>>>>>> To apply Chinese optimization I see three possibilities:
>>>>>>>>>
>>>>>>>>> 1. add support for Chinese to OpenNLP (Tokenizer, Sentence
>>>>>>>>> detection,
>>>>>>>>> POS tagging, Named Entity Detection)
>>>>>>>>> 2. allow the KeywordLinkingEngine to use other already available
>>>>>>>>> tools
>>>>>>>>> for text processing (e.g. stuff that is already available for
>>>>>>>>> Solr/Lucene [2] or the paoding chinese segment or referenced in you
>>>>>>>>> mail). Currently the KeywordLinkingEngine is hardwired with
>>>>>>>>> OpenNLP,
>>>>>>>>> because representing Tokens, POS ... as RDF would be to much of an
>>>>>>>>> overhead.
>>>>>>>>> 3. implement a new EnhancementEngine for processing Chinese text.
>>>>>>>>>
>>>>>>>>> Hope this helps to get you started.
>>>>>>>>>
>>>>>>>>> best
>>>>>>>>> Rupert
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> http://incubator.apache.org/****stanbol/docs/trunk/**<http://incubator.apache.org/**stanbol/docs/trunk/**>
>>>>>>>>> multilingual.html<http://**incubator.apache.org/stanbol/**
>>>>>>>>>
>>>>>>>>> docs/trunk/multilingual.html<http://incubator.apache.org/stanbol/docs/trunk/multilingual.html>
>>>>>>>>> [2]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> http://wiki.apache.org/solr/****LanguageAnalysis#Chinese.2C_**<http://wiki.apache.org/solr/**LanguageAnalysis#Chinese.2C_**>
>>>>>>>>>
>>>>>>>> Japanese.2C_Korean<http://**wiki.apache.org/solr/**
>>>>>>>
>>>>>>>
>>>>>>> LanguageAnalysis#Chinese.2C_**Japanese.2C_Korean<http://wiki.apache.org/solr/LanguageAnalysis#Chinese.2C_Japanese.2C_Korean>
>>>>>>>   On Thu, Jul 26, 2012 at 2:00 AM, harish suvarna
>>>>>>> <[email protected]>
>>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>   Hi Rupert,
>>>>>>>>>>
>>>>>>>>>> Finally I am getting some time to work on Stanbol. My job is to
>>>>>>>>>> demonstrate
>>>>>>>>>> Stanbol annotations for Chinese text.
>>>>>>>>>> I am just starting on it. I am following the instructions to build
>>>>>>>>>> an
>>>>>>>>>> enhancement engine from Anuj's blog. dbpedia has some chinese data
>>>>>>>>>>
>>>>>>>>>>   dump
>>>>>>>>
>>>>>>>> too.
>>>>>>>>
>>>>>>>>> We may have to depend on the ngrams as keys and look them up in the
>>>>>>>>>>
>>>>>>>>>> dbpedia
>>>>>>>>>> labels.
>>>>>>>>>>
>>>>>>>>>> I am planning to use the paoding chinese segmentor
>>>>>>>>>>
>>>>>>>>>> (http://code.google.com/p/****paoding/<http://code.google.com/p/**paoding/>
>>>>>>>>>>
>>>>>>>>>> <http://code.google.**com/p/paoding/<http://code.google.com/p/paoding/>
>>>>>>>>>>>
>>>>>>>>>>> )
>>>>>>>>>>
>>>>>>>>>> for word breaking.
>>>>>>>>>>
>>>>>>>>>> Just curious. I pasted some chinese text in default engine of
>>>>>>>>>> stanbol.
>>>>>>>>>> It
>>>>>>>>>> kind of finished the processing in no time at all. This gave me
>>>>>>>>>> suspicion
>>>>>>>>>> that may be if the language is chinese, no further processing is
>>>>>>>>>> done.
>>>>>>>>>> Is it
>>>>>>>>>> right? Any more tips for making all this work in Stanbol?
>>>>>>>>>>
>>>>>>>>>> -harish
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> | Rupert Westenthaler             [email protected]
>>>>>>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>>>>>>> | A-5500 Bischofshofen
>>>>>>>>>
>>>>>>>>>
>>>>>>>>   --
>>>>>>>
>>>>>>> | Rupert Westenthaler             [email protected]
>>>>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>>>>> | A-5500 Bischofshofen
>>>>>>>
>>>>>>>
>>>>>>>   --
>>>>>
>>>>> Dr. Walter Kasper
>>>>> DFKI GmbH
>>>>> Stuhlsatzenhausweg 3
>>>>> D-66123 Saarbrücken
>>>>> Tel.:  +49-681-85775-5300
>>>>> Fax:   +49-681-85775-5338
>>>>> Email: [email protected]
>>>>> ------------------------------****----------------------------**--**-
>>>>>
>>>>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>>>>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>>>>>
>>>>> Geschaeftsfuehrung:
>>>>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>>>>> Dr. Walter Olthoff
>>>>>
>>>>> Vorsitzender des Aufsichtsrats:
>>>>> Prof. Dr. h.c. Hans A. Aukes
>>>>>
>>>>> Amtsgericht Kaiserslautern, HRB 2313
>>>>> ------------------------------****----------------------------**--**-
>>>>>
>>>>>
>>>>>
>>> --
>>> Dr. Walter Kasper
>>> DFKI GmbH
>>> Stuhlsatzenhausweg 3
>>> D-66123 Saarbrücken
>>> Tel.:  +49-681-85775-5300
>>> Fax:   +49-681-85775-5338
>>> Email: [email protected]
>>> ------------------------------**------------------------------**-
>>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>>>
>>> Geschaeftsfuehrung:
>>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>>> Dr. Walter Olthoff
>>>
>>> Vorsitzender des Aufsichtsrats:
>>> Prof. Dr. h.c. Hans A. Aukes
>>>
>>> Amtsgericht Kaiserslautern, HRB 2313
>>> ------------------------------**------------------------------**-
>>>
>>>
>
>
>
>
> --
> Dr. Walter Kasper
> DFKI GmbH
> Stuhlsatzenhausweg 3
> D-66123 Saarbrücken
> Tel.:  +49-681-85775-5300
> Fax:   +49-681-85775-5338
> Email: [email protected]
> -------------------------------------------------------------
>
> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>
> Geschaeftsfuehrung:
> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
> Dr. Walter Olthoff
>
> Vorsitzender des Aufsichtsrats:
> Prof. Dr. h.c. Hans A. Aukes
>
> Amtsgericht Kaiserslautern, HRB 2313
> -------------------------------------------------------------
>



-- 
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Stanbol Chinese

Reply via email to