Re: Stanbol Chinese

harish suvarna Wed, 01 Aug 2012 09:33:46 -0700

Did a fresh build and inside Stanbol in localhost:8080, it is installed but
is not activated. I still see the com.google.inject errors.
I do see the pom.xml update from you.


-harish

On Wed, Aug 1, 2012 at 12:55 AM, Walter Kasper <[email protected]> wrote:

> Hi,
>
> The OSGI bundlöe declared some package imports that usually indeed are not
> available nor required. I fixed that. Just check out the corrected pom.xml.
> On a fresh clean Stanbol installation langdetect worked fine for me.
>
>
> Best regards,
>
> Walter
>
> harish suvarna wrote:
>
>> Thanks Dr Walter. langdetect is very useful. I could successfully compile
>> it but unable to load into stanbol as I get th error
>> ======
>> ERROR: Bundle org.apache.stanbol.enhancer.**engines.langdetect [177]:
>> Error
>> starting/stopping bundle. (org.osgi.framework.**BundleException:
>> Unresolved
>> constraint in bundle org.apache.stanbol.enhancer.**engines.langdetect
>> [177]:
>> Unable to resolve 177.0: missing requirement [177.0] package;
>> (package=com.google.inject))
>> org.osgi.framework.**BundleException: Unresolved constraint in bundle
>> org.apache.stanbol.enhancer.**engines.langdetect [177]: Unable to resolve
>> 177.0: missing requirement [177.0] package; (package=com.google.inject)
>>      at org.apache.felix.framework.**Felix.resolveBundle(Felix.**
>> java:3443)
>>      at org.apache.felix.framework.**Felix.startBundle(Felix.java:**1727)
>>      at org.apache.felix.framework.**Felix.setBundleStartLevel(**
>> Felix.java:1333)
>>      at
>> org.apache.felix.framework.**StartLevelImpl.run(**
>> StartLevelImpl.java:270)
>>      at java.lang.Thread.run(Thread.**java:680)
>> ==============
>>
>> I added the dependency
>> <dependency>
>>        <groupId>com.google.inject</**groupId>
>>        <artifactId>guice</artifactId>
>>        <version>3.0</version>
>>      </dependency>
>>
>> but looks like it is looking for version 1.3.0, which I can't find in
>> repo1.maven.org. I am not sure who is needing the inject library. The
>> entire source of langdetect plugin does not contain the word inject. Only
>> the manifest file in target/classes has this listed.
>>
>>
>> -harish
>>
>> On Tue, Jul 31, 2012 at 1:32 AM, Walter Kasper <[email protected]> wrote:
>>
>>  Hi Harish,
>>>
>>> I checked in a new language identifier for Stanbol based on
>>> http://code.google.com/p/****language-detection/<http://code.google.com/p/**language-detection/>
>>> <http://**code.google.com/p/language-**detection/<http://code.google.com/p/language-detection/>
>>> >.
>>>
>>> Just check out from Stanbol trunk, install and try out.
>>>
>>>
>>> Best regards,
>>>
>>> Walter
>>>
>>> harish suvarna wrote:
>>>
>>>  Rupert,
>>>> My initial debugging for Chinese text told me that the language
>>>> identification done by langid enhancer using apache tika does not
>>>> recognize
>>>> chinese. The tika language detection seems is not supporting the CJK
>>>> languages. With the result, the chinese language is identified as
>>>> lithuanian language 'lt' . The apache tika group has an enhancement item
>>>> 856 registered for detecting cjk languages
>>>>    
>>>> https://issues.apache.org/****jira/browse/TIKA-856<https://issues.apache.org/**jira/browse/TIKA-856>
>>>> <https://**issues.apache.org/jira/browse/**TIKA-856<https://issues.apache.org/jira/browse/TIKA-856>
>>>> >
>>>>
>>>>    in Feb 2012. I am not sure about the use of language identification
>>>> in
>>>> stanbol yet. Is the language id used to select the dbpedia  index
>>>> (approprite dbpedia language dump) for entity lookups?
>>>>
>>>>
>>>> I am just thinking that, for my purpose, pick option 3 and make sure
>>>> that
>>>> it is of my language of my interest and then call paoding segmenter.
>>>> Then
>>>> iterate over the ngrams and do an entityhub lookup. I just still need to
>>>> understand the code around how the whole entity lookup for dbpedia
>>>> works.
>>>>
>>>> I find that the language detection library
>>>> http://code.google.com/p/****language-detection/<http://code.google.com/p/**language-detection/>
>>>> <http://**code.google.com/p/language-**detection/<http://code.google.com/p/language-detection/>>is
>>>> very good at language
>>>>
>>>> detection. It supports 53 languages out of box and the quality seems
>>>> good.
>>>> It is apache 2.0 license. I could volunteer to create a new langid
>>>> engine
>>>> based on this with the stanbol community approval. So if anyone sheds
>>>> some
>>>> light on how to add a new java library into stanbol, that be great. I
>>>> am a
>>>> maven beginner now.
>>>>
>>>> Thanks,
>>>> harish
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Jul 26, 2012 at 9:46 PM, Rupert Westenthaler <
>>>> [email protected]> wrote:
>>>>
>>>>   Hi harish,
>>>>
>>>>> Note: Sorry I forgot to include the stanbol-dev mailing list in my last
>>>>> answer.
>>>>>
>>>>>
>>>>> On Fri, Jul 27, 2012 at 3:33 AM, harish suvarna <[email protected]>
>>>>> wrote:
>>>>>
>>>>>  Thanks a lot Rupert.
>>>>>>
>>>>>> I am weighing between options 2 and 3. What is the difference?
>>>>>> Optiion 2
>>>>>> sounds like enhancing KeyWordLinkingEngine to deal with chinese text.
>>>>>> It
>>>>>>
>>>>>>  may
>>>>>
>>>>>  be like paoding is hardcoded into KeyWordLinkingEngine. Option 3 is
>>>>>> like
>>>>>>
>>>>>>  a
>>>>>
>>>>>  separate engine.
>>>>>>
>>>>>>  Option (2) will require some work improvements on the Stanbol side.
>>>>> However there where already discussion on how to create a "text
>>>>> processing chain" that allows to split up things like tokenizing, POS
>>>>> tagging, Lemmatizing ... in different Enhancement Engines without
>>>>> suffering form disadvantages of creating high amounts of RDF triples.
>>>>> One Idea was to base this on the Apache Lucene TokenStream [1] API and
>>>>> share the data as ContentPart [2] of the ContentItem.
>>>>>
>>>>> Option (3) indeed means that you will create your own
>>>>> EnhancementEngine - a similar one to the KeywordLinkingEngine.
>>>>>
>>>>>     But will I be able to use the stanbol dbpedia lookup using option
>>>>> 3?
>>>>> Yes. You need only to obtain a Entityhub "ReferencedSite" and use the
>>>>> "FieldQuery" interface to search for Entities (see [1] for an example)
>>>>>
>>>>> best
>>>>> Rupert
>>>>>
>>>>> [1]
>>>>> http://blog.mikemccandless.****com/2012/04/lucenes-**
>>>>> tokenstreams-are-actually.**html<http://blog.**
>>>>> mikemccandless.com/2012/04/**lucenes-tokenstreams-are-**actually.html<http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html>
>>>>> >
>>>>> [2]
>>>>> http://incubator.apache.org/****stanbol/docs/trunk/components/****<http://incubator.apache.org/**stanbol/docs/trunk/components/**>
>>>>> enhancer/contentitem.html#****content-parts<http://**
>>>>> incubator.apache.org/stanbol/**docs/trunk/components/**
>>>>> enhancer/contentitem.html#**content-parts<http://incubator.apache.org/stanbol/docs/trunk/components/enhancer/contentitem.html#content-parts>
>>>>> >
>>>>> [3]
>>>>> http://svn.apache.org/repos/****asf/incubator/stanbol/trunk/**<http://svn.apache.org/repos/**asf/incubator/stanbol/trunk/**>
>>>>> enhancer/engines/****keywordextraction/src/main/****
>>>>> java/org/apache/stanbol/
>>>>> **enhancer/engines/****keywordextraction/linking/**
>>>>> impl/EntitySearcherUtils.java<**http://svn.apache.org/repos/**
>>>>> asf/incubator/stanbol/trunk/**enhancer/engines/**
>>>>> keywordextraction/src/main/**java/org/apache/stanbol/**
>>>>> enhancer/engines/**keywordextraction/linking/**
>>>>> impl/EntitySearcherUtils.java<http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntitySearcherUtils.java>
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>>   Btw, I created my own enhancement engine chains and I could see them
>>>>>
>>>>>> yesterday in localhost:8080. But today all of them have vanished and
>>>>>> only
>>>>>> the default chain shows up. Can I dig them up somewhere in the stanbol
>>>>>> directory?
>>>>>>
>>>>>> -harish
>>>>>>
>>>>>> I just created the eclipse project
>>>>>> On Thu, Jul 26, 2012 at 5:04 AM, Rupert Westenthaler
>>>>>> <[email protected]****> wrote:
>>>>>>
>>>>>>  Hi,
>>>>>>>
>>>>>>> There are no NER (Named Entity Recognition) models for Chinese text
>>>>>>> available via OpenNLP. So the default configuration of Stanbol will
>>>>>>> not process Chinese text. What you can do is to configure a
>>>>>>> KeywordLinking Engine for Chinese text as this engine can also
>>>>>>> process
>>>>>>> in unknown languages (see [1] for details).
>>>>>>>
>>>>>>> However also the KeywordLinking Engine requires at least n tokenizer
>>>>>>> for looking up Words. As there is no specific Tokenizer for OpenNLP
>>>>>>> Chinese text it will use the default one that uses a fixed set of
>>>>>>> chars to split words (white spaces, hyphens ...). You may better how
>>>>>>> well this would work with Chinese texts. My assumption would be that
>>>>>>> it is not sufficient - so results will be sub-optimal.
>>>>>>>
>>>>>>> To apply Chinese optimization I see three possibilities:
>>>>>>>
>>>>>>> 1. add support for Chinese to OpenNLP (Tokenizer, Sentence detection,
>>>>>>> POS tagging, Named Entity Detection)
>>>>>>> 2. allow the KeywordLinkingEngine to use other already available
>>>>>>> tools
>>>>>>> for text processing (e.g. stuff that is already available for
>>>>>>> Solr/Lucene [2] or the paoding chinese segment or referenced in you
>>>>>>> mail). Currently the KeywordLinkingEngine is hardwired with OpenNLP,
>>>>>>> because representing Tokens, POS ... as RDF would be to much of an
>>>>>>> overhead.
>>>>>>> 3. implement a new EnhancementEngine for processing Chinese text.
>>>>>>>
>>>>>>> Hope this helps to get you started.
>>>>>>>
>>>>>>> best
>>>>>>> Rupert
>>>>>>>
>>>>>>> [1] 
>>>>>>> http://incubator.apache.org/****stanbol/docs/trunk/**<http://incubator.apache.org/**stanbol/docs/trunk/**>
>>>>>>> multilingual.html<http://**incubator.apache.org/stanbol/**
>>>>>>> docs/trunk/multilingual.html<http://incubator.apache.org/stanbol/docs/trunk/multilingual.html>
>>>>>>> >
>>>>>>> [2]
>>>>>>>
>>>>>>>   
>>>>>>> http://wiki.apache.org/solr/****LanguageAnalysis#Chinese.2C_**<http://wiki.apache.org/solr/**LanguageAnalysis#Chinese.2C_**>
>>>>>>>
>>>>>> Japanese.2C_Korean<http://**wiki.apache.org/solr/**
>>>>> LanguageAnalysis#Chinese.2C_**Japanese.2C_Korean<http://wiki.apache.org/solr/LanguageAnalysis#Chinese.2C_Japanese.2C_Korean>
>>>>> >
>>>>>
>>>>>  On Thu, Jul 26, 2012 at 2:00 AM, harish suvarna <[email protected]>
>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>  Hi Rupert,
>>>>>>>> Finally I am getting some time to work on Stanbol. My job is to
>>>>>>>> demonstrate
>>>>>>>> Stanbol annotations for Chinese text.
>>>>>>>> I am just starting on it. I am following the instructions to build
>>>>>>>> an
>>>>>>>> enhancement engine from Anuj's blog. dbpedia has some chinese data
>>>>>>>>
>>>>>>>>  dump
>>>>>>>
>>>>>> too.
>>>>>>
>>>>>>> We may have to depend on the ngrams as keys and look them up in the
>>>>>>>> dbpedia
>>>>>>>> labels.
>>>>>>>>
>>>>>>>> I am planning to use the paoding chinese segmentor
>>>>>>>> (http://code.google.com/p/****paoding/<http://code.google.com/p/**paoding/>
>>>>>>>> <http://code.google.**com/p/paoding/<http://code.google.com/p/paoding/>
>>>>>>>> >)
>>>>>>>>
>>>>>>>> for word breaking.
>>>>>>>>
>>>>>>>> Just curious. I pasted some chinese text in default engine of
>>>>>>>> stanbol.
>>>>>>>> It
>>>>>>>> kind of finished the processing in no time at all. This gave me
>>>>>>>> suspicion
>>>>>>>> that may be if the language is chinese, no further processing is
>>>>>>>> done.
>>>>>>>> Is it
>>>>>>>> right? Any more tips for making all this work in Stanbol?
>>>>>>>>
>>>>>>>> -harish
>>>>>>>>
>>>>>>>>
>>>>>>> --
>>>>>>> | Rupert Westenthaler             [email protected]
>>>>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>>>>> | A-5500 Bischofshofen
>>>>>>>
>>>>>>>
>>>>>>  --
>>>>> | Rupert Westenthaler             [email protected]
>>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>>> | A-5500 Bischofshofen
>>>>>
>>>>>
>>>>>  --
>>> Dr. Walter Kasper
>>> DFKI GmbH
>>> Stuhlsatzenhausweg 3
>>> D-66123 Saarbrücken
>>> Tel.:  +49-681-85775-5300
>>> Fax:   +49-681-85775-5338
>>> Email: [email protected]
>>> ------------------------------****----------------------------**--**-
>>>
>>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>>>
>>> Geschaeftsfuehrung:
>>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>>> Dr. Walter Olthoff
>>>
>>> Vorsitzender des Aufsichtsrats:
>>> Prof. Dr. h.c. Hans A. Aukes
>>>
>>> Amtsgericht Kaiserslautern, HRB 2313
>>> ------------------------------****----------------------------**--**-
>>>
>>>
>>>
>
> --
> Dr. Walter Kasper
> DFKI GmbH
> Stuhlsatzenhausweg 3
> D-66123 Saarbrücken
> Tel.:  +49-681-85775-5300
> Fax:   +49-681-85775-5338
> Email: [email protected]
> ------------------------------**------------------------------**-
> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>
> Geschaeftsfuehrung:
> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
> Dr. Walter Olthoff
>
> Vorsitzender des Aufsichtsrats:
> Prof. Dr. h.c. Hans A. Aukes
>
> Amtsgericht Kaiserslautern, HRB 2313
> ------------------------------**------------------------------**-
>
>

Re: Stanbol Chinese

Reply via email to