Problems Indexing/Parsing Tibetan Text
Hello, I'm currently working out some problems when searching for Tibetan Characters. More specifically: /u0f10-/u0f19. We are using the StandardAnalyzer (3.4) and I've narrowed the problem down to StandardTokenizerImpl throwing away these characters i.e. in getNextToken(), falls through case1: /* Not numeric, word, ideographic, hiragana, or SE Asian -- ignore it */. So, the question is: is this the expected behaviour and if it is what would be the best way to go about supporting code points that are not recognized by the StandardAnalyzer in a general way?
Re: Problems Indexing/Parsing Tibetan Text
On Fri, Mar 30, 2012 at 12:46 PM, Denis Brodeur denisbrod...@gmail.com wrote: Hello, I'm currently working out some problems when searching for Tibetan Characters. More specifically: /u0f10-/u0f19. We are using the unicode doesn't consider most of these characters part of a word: most are punctuation and symbols (except 0f18 and 0f19 which are combining characters that combine with digits). for example 0f14 is a text delimiter. in general standardtokenizer discards punctuation and is geared at word boundaries, just like you would have trouble searching on characters like '(', etc in english. So i think its totally expected. -- lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Problems Indexing/Parsing Tibetan Text
Thanks Robert. That makes sense. Do you have a link handy where I can find this information? i.e. word boundary/punctuation for any unicode character set? On Fri, Mar 30, 2012 at 12:57 PM, Robert Muir rcm...@gmail.com wrote: On Fri, Mar 30, 2012 at 12:46 PM, Denis Brodeur denisbrod...@gmail.com wrote: Hello, I'm currently working out some problems when searching for Tibetan Characters. More specifically: /u0f10-/u0f19. We are using the unicode doesn't consider most of these characters part of a word: most are punctuation and symbols (except 0f18 and 0f19 which are combining characters that combine with digits). for example 0f14 is a text delimiter. in general standardtokenizer discards punctuation and is geared at word boundaries, just like you would have trouble searching on characters like '(', etc in english. So i think its totally expected. -- lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Problems Indexing/Parsing Tibetan Text
fileformat.info On Mar 30, 2012, at 1:04 PM, Denis Brodeur denisbrod...@gmail.com wrote: Thanks Robert. That makes sense. Do you have a link handy where I can find this information? i.e. word boundary/punctuation for any unicode character set? On Fri, Mar 30, 2012 at 12:57 PM, Robert Muir rcm...@gmail.com wrote: On Fri, Mar 30, 2012 at 12:46 PM, Denis Brodeur denisbrod...@gmail.com wrote: Hello, I'm currently working out some problems when searching for Tibetan Characters. More specifically: /u0f10-/u0f19. We are using the unicode doesn't consider most of these characters part of a word: most are punctuation and symbols (except 0f18 and 0f19 which are combining characters that combine with digits). for example 0f14 is a text delimiter. in general standardtokenizer discards punctuation and is geared at word boundaries, just like you would have trouble searching on characters like '(', etc in english. So i think its totally expected. -- lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Problems Indexing/Parsing Tibetan Text
On Fri, Mar 30, 2012 at 1:03 PM, Denis Brodeur denisbrod...@gmail.com wrote: Thanks Robert. That makes sense. Do you have a link handy where I can find this information? i.e. word boundary/punctuation for any unicode character set? yeah, usually i use http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[\u0f10-\u0f19]g= you can then click on a character and see all of its properties easily. (site seems to have some issues today) -- lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Problems Indexing/Parsing Tibetan Text
Another good reference is this one: http://unicode.org/reports/tr29/ Since the latest Lucene uses this for the basis of its text segmentation, it's worth getting familiar with it. On Fri, Mar 30, 2012 at 10:09 AM, Robert Muir rcm...@gmail.com wrote: On Fri, Mar 30, 2012 at 1:03 PM, Denis Brodeur denisbrod...@gmail.com wrote: Thanks Robert. That makes sense. Do you have a link handy where I can find this information? i.e. word boundary/punctuation for any unicode character set? yeah, usually i use http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[\u0f10-\u0f19]g= you can then click on a character and see all of its properties easily. (site seems to have some issues today) -- lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org