Problems Indexing/Parsing Tibetan Text

2012-03-30 Thread Denis Brodeur
Hello, I'm currently working out some problems when searching for Tibetan
Characters.  More specifically: /u0f10-/u0f19.  We are using the
StandardAnalyzer (3.4) and I've narrowed the problem down to
StandardTokenizerImpl throwing away these characters i.e. in
getNextToken(), falls through  case1: /* Not numeric, word, ideographic,
hiragana, or SE Asian -- ignore it */. So, the question is: is this the
expected behaviour and if it is what would be the best way to go about
supporting code points that are not recognized by the StandardAnalyzer in a
general way?


Re: Problems Indexing/Parsing Tibetan Text

2012-03-30 Thread Robert Muir
On Fri, Mar 30, 2012 at 12:46 PM, Denis Brodeur denisbrod...@gmail.com wrote:
 Hello, I'm currently working out some problems when searching for Tibetan
 Characters.  More specifically: /u0f10-/u0f19.  We are using the

unicode doesn't consider most of these characters part of a word: most
are punctuation and symbols
(except 0f18 and 0f19 which are combining characters that combine with digits).

for example 0f14 is a text delimiter.

in general standardtokenizer discards punctuation and is geared at
word boundaries, just like
you would have trouble searching on characters like '(', etc in
english. So i think its totally expected.

-- 
lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Problems Indexing/Parsing Tibetan Text

2012-03-30 Thread Denis Brodeur
Thanks Robert.  That makes sense.  Do you have a link handy where I can
find this information? i.e. word boundary/punctuation for any unicode
character set?

On Fri, Mar 30, 2012 at 12:57 PM, Robert Muir rcm...@gmail.com wrote:

 On Fri, Mar 30, 2012 at 12:46 PM, Denis Brodeur denisbrod...@gmail.com
 wrote:
  Hello, I'm currently working out some problems when searching for Tibetan
  Characters.  More specifically: /u0f10-/u0f19.  We are using the

 unicode doesn't consider most of these characters part of a word: most
 are punctuation and symbols
 (except 0f18 and 0f19 which are combining characters that combine with
 digits).

 for example 0f14 is a text delimiter.

 in general standardtokenizer discards punctuation and is geared at
 word boundaries, just like
 you would have trouble searching on characters like '(', etc in
 english. So i think its totally expected.

 --
 lucidimagination.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Problems Indexing/Parsing Tibetan Text

2012-03-30 Thread Benson Margulies
fileformat.info

On Mar 30, 2012, at 1:04 PM, Denis Brodeur denisbrod...@gmail.com wrote:

 Thanks Robert.  That makes sense.  Do you have a link handy where I can
 find this information? i.e. word boundary/punctuation for any unicode
 character set?

 On Fri, Mar 30, 2012 at 12:57 PM, Robert Muir rcm...@gmail.com wrote:

 On Fri, Mar 30, 2012 at 12:46 PM, Denis Brodeur denisbrod...@gmail.com
 wrote:
 Hello, I'm currently working out some problems when searching for Tibetan
 Characters.  More specifically: /u0f10-/u0f19.  We are using the

 unicode doesn't consider most of these characters part of a word: most
 are punctuation and symbols
 (except 0f18 and 0f19 which are combining characters that combine with
 digits).

 for example 0f14 is a text delimiter.

 in general standardtokenizer discards punctuation and is geared at
 word boundaries, just like
 you would have trouble searching on characters like '(', etc in
 english. So i think its totally expected.

 --
 lucidimagination.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Problems Indexing/Parsing Tibetan Text

2012-03-30 Thread Robert Muir
On Fri, Mar 30, 2012 at 1:03 PM, Denis Brodeur denisbrod...@gmail.com wrote:
 Thanks Robert.  That makes sense.  Do you have a link handy where I can
 find this information? i.e. word boundary/punctuation for any unicode
 character set?


yeah, usually i use
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[\u0f10-\u0f19]g=

you can then click on a character and see all of its properties easily.

(site seems to have some issues today)

-- 
lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Problems Indexing/Parsing Tibetan Text

2012-03-30 Thread Brandon Mintern
Another good reference is this one: http://unicode.org/reports/tr29/

Since the latest Lucene uses this for the basis of its text
segmentation, it's worth getting familiar with it.

On Fri, Mar 30, 2012 at 10:09 AM, Robert Muir rcm...@gmail.com wrote:
 On Fri, Mar 30, 2012 at 1:03 PM, Denis Brodeur denisbrod...@gmail.com wrote:
 Thanks Robert.  That makes sense.  Do you have a link handy where I can
 find this information? i.e. word boundary/punctuation for any unicode
 character set?


 yeah, usually i use
 http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[\u0f10-\u0f19]g=

 you can then click on a character and see all of its properties easily.

 (site seems to have some issues today)

 --
 lucidimagination.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org