WhitespaceAnalyzer [was: Re: regaridng Reader.terms()]

Steven Rowe Wed, 23 May 2007 10:07:39 -0700

Hi Mohammad,

WhitespaceAnalyzer uses Java's Character.isWhitespace(char) method to
determine whether or not a character should be part of a token.  As far
as I know, this method is problematic only for characters outside of the
Basic Multilingual Plane (BMP).  I think Lucene should switch to using
the int-based character methods, to support characters outside of the BMP.


Arabic characters are all in the BMP[1][2][3][4], so this method should
work properly.  The only Arabic or Persian characters I can find outside
of the BMP are in the Old Persian block[5], [ U+103A0 - U+103DF ], but
this is ancient cuneiform - are you really dealing with digital
cuneiform documents?

I suspect that you are using WhitespaceAnalyzer as the basis for a more
sophisticated tokenizer - if this is true, you may want to check out the
tokenizer in the AraMorph project[6][7].  (The AraMorph stemmer probably
will not serve your needs, though, since Persian and Arabic have
different lexicons and grammars.)

Hope it helps,
Steve

[1] Arabic: http://www.unicode.org/charts/PDF/U0600.pdf
[2] Arabic Supplement: http://www.unicode.org/charts/PDF/U0750.pdf
[3] Arabic Presentation Forms A: http://www.unicode.org/charts/PDF/UFB50.pdf
[4] Arabic Presentation Forms B: http://www.unicode.org/charts/PDF/UFE70.pdf
[5] Old Persian: http://www.unicode.org/charts/PDF/U103A0.pdf
[6] AraMorph: http://www.nongnu.org/aramorph/
[7] ArabicTokenizer for Lucene:
http://www.nongnu.org/aramorph/javadoc/gpl/pierrick/brihaye/aramorph/lucene/ArabicTokenizer.html


Mohammad Norouzi wrote:
> Hi Steve,
> No I didn't make any change on WhiteSpaceAnalyzer I just extends my classes
> from the original classes and then override my new changes. so I dont think
> I should to contribute my classes.
> 
> and my language is Persian, and only change I've made is not to ignoring
> unicode characters in Persian and arabic language, because with original
> WhitespaceAnalyzer it didnt work fine whether it ignore or something
> else, I
> dont know but I extends my classes and now I am using my analyzer to index.
> 
> On 5/22/07, Steven Rowe <[EMAIL PROTECTED]> wrote:
>>
>> Hi Mohammad,
>>
>> May I ask what your language is?  And what kind of changes to
>> WhitespaceAnalyzer were required to make it work with your language?
>>
>> If you have made modifications to WhitespaceAnalyzer that are generally
>> useful, please consider contributing your changes back to the Lucene
>> project.  There is some info here on how to get started:
>>
>>    http://wiki.apache.org/jakarta-lucene/HowToContribute
>>
>> Thanks,
>> Steve
>>
>> Mohammad Norouzi wrote:
>> > Walter,
>> > Yes I am using a customized WhiteSpaceAnalyzer while indexing.
>> > I said customized because I realized that standard WhiteSpaceAnalyzer
>> dont
>> > accept unicode terms in my language so I make some change to support
>> that.
>> >
>> > but for reading no Analyzer is used
>> >
>> > if I want to get that result, which analyzer should I use?
>> >
>> > in my case, I dont need any boost factor or any other feature of
>> lucene,
>> I
>> > need just searching through the index.
>> >
>> >
>> > On 5/22/07, Walter Ferrara <[EMAIL PROTECTED]> wrote:
>> >>
>> >> If Reader.terms() gives you:
>> >> text3
>> >> text4
>> >> while you expect
>> >> text3 text4
>> >>
>> >> you should change, I presume, the Analyzer, maybe writing your own
>> one.
>> >>
>> >> Mohammad Norouzi wrote:
>> >> > Hi all
>> >> >
>> >> > consider following index
>> >> >
>> >> > field1           field2                              field3
>> >> > text1           text1 text2                      text3 text4
>> >> > text4           text2                              text2 text3 text5
>> >> >
>> >> > I want to get all terms in filed3
>> >> > if I use Reader.terms() it will returns: (however i have to put
>> an if
>> >> > statement to filter result of the field3 only)
>> >> > text3
>> >> > text4
>> >> > text2
>> >> > text5
>> >> >
>> >> > but I need following result:
>> >> > "text3 text4"
>> >> > "text2 text3 text5"
>> >> >
>> >> >
>> >> > is this possible? if yes, how? and if no, is there any tricky way to
>> >> get
>> >> > this result?
>> >> >
>> >> > thank you so much.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

WhitespaceAnalyzer [was: Re: regaridng Reader.terms()]

Reply via email to