subject:"RE\: lucene farsi problem"

RE: lucene farsi problem

2008-05-11 Thread Steven A Rowe

Hi Esra,

Did you try the new version of the patch?

In the latest verson, I have taken the code that was in CollatingRangeQuery and 
put it into RangeQuery.

I also put the same functionality into RangeFilter, and provided code to call 
it from ConstantScoreRangeQuery and QueryParser.  Note that 
ConstantScoreRangeQuery doesn't have the clause limit restriction that 
RangeQuery has (1024 max clauses, IIRC).

Steve

On 05/10/2008 at 1:22 PM, esra wrote:
 
 Hi Steve,
 
 i used the locale as ar and it works fine .
 
 again thanks a lot for your help.
 
 Esra
 
 
 Steven A Rowe wrote:
  
  Hi Esra,
  
  On 05/06/2008 at 7:38 AM, esra wrote:
   i tried the class and it works fine with the locale parameter ar.
  
  Cool, I'm glad this addressed your problem!
  
   Actually we are using fa for farsi and ar for arabic.
   I have added a little control for the locale parameter in my
   code and now i can see the correct results.
  
  From what I could tell, the Collator available for Locale(fa) in the
  Sun 1.4.2 and 1.5.0 JDKs does not contain Farsi character collation,
  but the Collator available for Locale(ar) *does* contain Farsi
  collation.  I switched TestCollatingRangeQuery from Locale(fa) to
  Locale(ar) when I couldn't get the Collator returned for Farsi [ via
  Collator.getInstance(new Locale(fa) ] to produce correct results.
  
  Did you find that Locale(fa) produces the correct results?  If so,
  which VM are you using?
  
  At Chris Hostetter's suggestion, I am rewriting the patch attached to
  LUCENE-1279, including the following changes:
  
  - Merged the contents of the CollatingRangeQuery class into RangeQuery
  and RangeFilter - Switched the Locale parameter to instead take an
  instance of Collator - Modified QueryParser.jj to construct a
  QueryParser class that can accept a range collator and pass it either
  to RangeQuery or through ConstantScoreRangeQuery to RangeFilter.
  
  I plan on posting the revised patch in the next day or two.
  
  Steve
  
  On 05/06/2008 at 7:38 AM, esra wrote:
   
   Hi Steven ,
   Hi Steven,
   
   i tried the class and it works fine with the locale parameter ar.
   
   Actually we are using fa for farsi and ar for arabic.
   I have added a little control for the locale parameter in my
   code and now i can see the correct results.
   
   Thank you very much for ypur help.
   
   Esra.
   
   Steven A Rowe wrote:

Hi Esra,

I have attached a patch to LUCENE-1279 containing a new class:
CollatingRangeQuery.

The patch also contains a test class: TestCollatingRangeQuery.  One
of the test methods checks for the Farsi range you were having
trouble with.

It should be mentioned that according to
Collator.getAvailableLocales(), neither Java 1.4.2 nor Java 1.5.0
contains Farsi collation information. However, in the test class I
use the Arabic Locale, which seems to properly collate the non-Arabic
Farsi letter U+0698, and hopefully behaves well with other Farsi
letters.  If you find that this is not the case, you can look into
writing collation rules using RuleBasedCollator - you should be able
to directly specify the proper letter orderings for Farsi;
CollatingRangeQuery would also have to supply a constructor that
takes in a Collator instead of a Locale.

Please give the class a try and post back about how it works.

Thanks,
Steve

On 05/03/2008 at 8:33 AM, esra wrote:
 
 Hi Steven,
 
 thanks for your help
 
 Esra
 
 
 Steven A Rowe wrote:
  
  Hi Esra,
  
  I have created an issue for this - see
  https://issues.apache.org/jira/browse/LUCENE-1279.
  
  I'll try to take a crack at a patch this weekend.
  
  Steve
  
  On 05/02/2008 at 12:55 PM, esra wrote:
   
   Hi Steven ,
   
   yes you are right, sorry i am a bit confused.
   
   i checked again and the correct one is  zhe/U+698.
   
   It seems the word is in the range but my customer says it
   shouldn't be.
   
   I think problem occurs because  zhe is a Persian letter outside
   the Arabic alphabet. In farsi alphabet this letter is not after
   the س letter but it's unicode is bigger than س letter's and
   the searcher works with unicodes.
   
   Esra
   
   
   Steven A Rowe wrote:

Hi Esra,

You are *still* incorrectly referring to the
 glyph with three
   dots
over it:

On 05/02/2008 at 12:18 PM, esra wrote:
 yes the correct one is ژ /ze/U+632.

ژ is *not* ze/U+632 - it is zhe/U+698.

Have you increased the font size?  Can you see the difference
between these two?:

ژ/zhe/U+698
ز/ze/U+632

 my problem is when i do search for  د-ژ range.
   The result is
 ساب
 ووفر and this word's first

RE: lucene farsi problem

2008-05-10 Thread esra


Hi Steve,

i used the locale as ar and it works fine .

again thanks a lot for your help.

Esra


Steven A Rowe wrote:
 
 Hi Esra,
 
 On 05/06/2008 at 7:38 AM, esra wrote:
 i tried the class and it works fine with the locale parameter ar.
 
 Cool, I'm glad this addressed your problem!
 
 Actually we are using fa for farsi and ar for arabic.
 I have added a little control for the locale parameter in my
 code and now i can see the correct results.
 
 From what I could tell, the Collator available for Locale(fa) in the Sun
 1.4.2 and 1.5.0 JDKs does not contain Farsi character collation, but the
 Collator available for Locale(ar) *does* contain Farsi collation.  I
 switched TestCollatingRangeQuery from Locale(fa) to Locale(ar) when I
 couldn't get the Collator returned for Farsi [ via
 Collator.getInstance(new Locale(fa) ] to produce correct results.
 
 Did you find that Locale(fa) produces the correct results?  If so, which
 VM are you using?
 
 At Chris Hostetter's suggestion, I am rewriting the patch attached to
 LUCENE-1279, including the following changes:
 
 - Merged the contents of the CollatingRangeQuery class into RangeQuery and
 RangeFilter
 - Switched the Locale parameter to instead take an instance of Collator
 - Modified QueryParser.jj to construct a QueryParser class that can accept
 a range collator and pass it either to RangeQuery or through
 ConstantScoreRangeQuery to RangeFilter.
 
 I plan on posting the revised patch in the next day or two.
 
 Steve
 
 On 05/06/2008 at 7:38 AM, esra wrote:
 
 Hi Steven ,
 Hi Steven,
 
 i tried the class and it works fine with the locale parameter ar.
 
 Actually we are using fa for farsi and ar for arabic.
 I have added a little control for the locale parameter in my
 code and now i can see the correct results.
 
 Thank you very much for ypur help.
 
 Esra.
 
 Steven A Rowe wrote:
  
  Hi Esra,
  
  I have attached a patch to LUCENE-1279 containing a new class:
  CollatingRangeQuery.
  
  The patch also contains a test class: TestCollatingRangeQuery.  One of
  the test methods checks for the Farsi range you were having trouble
  with.
  
  It should be mentioned that according to
  Collator.getAvailableLocales(), neither Java 1.4.2 nor Java 1.5.0
  contains Farsi collation information. However, in the test class I use
  the Arabic Locale, which seems to properly collate the non-Arabic Farsi
  letter U+0698, and hopefully behaves well with other Farsi letters.  If
  you find that this is not the case, you can look into writing collation
  rules using RuleBasedCollator - you should be able to directly specify
  the proper letter orderings for Farsi; CollatingRangeQuery would also
  have to supply a constructor that takes in a Collator instead of a
  Locale.
  
  Please give the class a try and post back about how it works.
  
  Thanks,
  Steve
  
  On 05/03/2008 at 8:33 AM, esra wrote:
   
   Hi Steven,
   
   thanks for your help
   
   Esra
   
   
   Steven A Rowe wrote:

Hi Esra,

I have created an issue for this - see
https://issues.apache.org/jira/browse/LUCENE-1279.

I'll try to take a crack at a patch this weekend.

Steve

On 05/02/2008 at 12:55 PM, esra wrote:
 
 Hi Steven ,
 
 yes you are right, sorry i am a bit confused.
 
 i checked again and the correct one is  zhe/U+698.
 
 It seems the word is in the range but my customer says it
 shouldn't be.
 
 I think problem occurs because  zhe is a Persian letter
 outside the Arabic
 alphabet. In farsi alphabet this letter is not after the س
 letter but it's
 unicode is bigger than س letter's and the searcher works
 with unicodes.
 
 Esra
 
 
 Steven A Rowe wrote:
  
  Hi Esra,
  
  You are *still* incorrectly referring to the glyph with three
 dots
  over it:
  
  On 05/02/2008 at 12:18 PM, esra wrote:
   yes the correct one is ژ /ze/U+632.
  
  ژ is *not* ze/U+632 - it is zhe/U+698.
  
  Have you increased the font size?  Can you see the difference
  between these two?:
  
  ژ/zhe/U+698
  ز/ze/U+632
  
   my problem is when i do search for  د-ژ range.
 The result is
   ساب
   ووفر and this word's first letter is س and it's unicode is
   U+633 and it is not in the in the [ U+062F -
 U+0632 ] range.
  
  Like I keep saying, in the above description, you're
 using the
   glyph
  ژ/zhe/U+698, while calling at the same time incorrectly
  referring to it as ze/U+632.
  
  I don't mean to continually bang on about this - if you're
 *sure*
  that when you search, you're using the character represented by
 the
  glyph with one dot (and not three), i.e. ز/ze/U+632, then
 the
  problem lies elsewhere.
  
  Steve
  
  On 05/02/2008 at 12:18 PM, esra wrote:
   yes the correct one is ژ /ze/U+632.
   
   my problem is when i do search forد-ژ

RE: lucene farsi problem

2008-05-09 Thread Steven A Rowe

Hi Esra,

On 05/07/2008 at 11:49 AM, Steven A Rowe wrote:
 At Chris Hostetter's suggestion, I am rewriting the patch
 attached to LUCENE-1279, including the following changes:
 
 - Merged the contents of the CollatingRangeQuery class into
 RangeQuery and RangeFilter
 - Switched the Locale parameter to instead take an instance
 of Collator
 - Modified QueryParser.jj to construct a QueryParser class
 that can accept a range collator and pass it either to
 RangeQuery or through ConstantScoreRangeQuery to RangeFilter.

I have attached the above-described revised patch to LUCENE-1279 - Esra, if you 
get a chance, could you try it out?  The implementation hasn't changed (except 
for the cosmetic changes noted above) -- you'll just be using RangeQuery 
instead of CollatingRangeQuery.

Thanks,
Steve

RE: lucene farsi problem

2008-05-08 Thread Vizzini

Dear Steven

Thanks for reply. I've just checked the link and it was working. Anyway, you
are right, but my point is to use the correct term for main 3 reasons:

1. Respect the host language, i.e. English
3. Apparently the Islamic regime in Tehran is against the word ‘Persian’,
and we as the free world have responsibility to fight them in any possible
way – and in the meantime assist Iranians in their fight against the tyrant
Moslem clerics.

Kind Regards
PV

Steven A Rowe wrote:

Hi PV,

On 05/07/2008 at 2:54 AM, PV wrote:
Sorry for cross posting, but why the word 'Farsi' instead of
'Persian'? No one says Lucnce français or Español, or Deutsch - so why
Farsi?

Please read the following article, I found it quite enlightening.
http://www.cais-soas.com/CAIS/Languages/persian_not_farsi.htm

I was unable to follow your link -- I get the message The page cannot be
found -- but Google's cache came to my rescue:

http://209.85.165.104/search?q=cache:vJo1ye09MzYJ:www.cais-soas.com/CAIS/Languages/persian_not_farsi.htm+persian+farsi+site:cais-soas.comhl=enct=clnkcd=1gl=usclient=firefox-a

The article makes fine logical arguments against the use of Farsi, but
clearly does not take a balanced point of view on the controversy.

I count myself among the ignorant - I had not realized that Farsi was a
politically charged term. That said, I don't plan on refusing to use the
term. I'll just think about the impact of the choice (now that I'm aware
there is one) before I do use it.

Thanks for the pointer. Knowledge is good.

Steve

On 05/07/2008 at 2:54 AM, Vizzini wrote:

Sorry for cross posting, but why the word 'Farsi' instead of
'Persian'? No one says Lucnce français or Español, or Deutsch - so why
Farsi?

Please read the following article, I found it quite enlightening.
http://www.cais-soas.com/CAIS/Languages/persian_not_farsi.htm

-- View this message in context:
http://www.nabble.com/lucene-farsi-problem-tp16977096p17098552.html Sent
from the Lucene - Java Users mailing list archive at Nabble.com.

--
View this message in context:
http://www.nabble.com/lucene-farsi-problem-tp16977096p17122444.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: lucene farsi problem

2008-05-08 Thread Grant Ingersoll

Point #2 does not belong on this forum. This is a forum for Lucene
Java, not for political views. There are plenty of other places for
that, so let's close this discussion off on this particular point and
simply address the issue at hand with Lucene and LUCENE-1279.

Cheers,
Grant

On May 8, 2008, at 4:52 AM, Vizzini wrote:

Dear Steven

Thanks for reply. I've just checked the link and it was working.
Anyway, you

are right, but my point is to use the correct term for main 3 reasons:

1. Respect the host language, i.e. English
3. Apparently the Islamic regime in Tehran is against the word
‘Persian’,
and we as the free world have responsibility to fight them in any
possible
way – and in the meantime assist Iranians in their fight against the
tyrant

Moslem clerics.

Kind Regards
PV

Steven A Rowe wrote:

Hi PV,

On 05/07/2008 at 2:54 AM, PV wrote:

Sorry for cross posting, but why the word 'Farsi' instead of
'Persian'? No one says Lucnce français or Español, or Deutsch -
so why

Farsi?

Please read the following article, I found it quite enlightening.
http://www.cais-soas.com/CAIS/Languages/persian_not_farsi.htm

I was unable to follow your link -- I get the message The page
cannot be

found -- but Google's cache came to my rescue:

http://209.85.165.104/search?q=cache:vJo1ye09MzYJ:www.cais-soas.com/CAIS/Languages/persian_not_farsi.htm+persian+farsi+site:cais-soas.comhl=enct=clnkcd=1gl=usclient=firefox-a

The article makes fine logical arguments against the use of
Farsi, but

clearly does not take a balanced point of view on the controversy.

I count myself among the ignorant - I had not realized that Farsi
was a
politically charged term. That said, I don't plan on refusing to
use the
term. I'll just think about the impact of the choice (now that I'm
aware

there is one) before I do use it.

Thanks for the pointer. Knowledge is good.

Steve

On 05/07/2008 at 2:54 AM, Vizzini wrote:

Sorry for cross posting, but why the word 'Farsi' instead of
'Persian'? No one says Lucnce français or Español, or Deutsch -
so why

Farsi?

Please read the following article, I found it quite enlightening.
http://www.cais-soas.com/CAIS/Languages/persian_not_farsi.htm

-- View this message in context:
http://www.nabble.com/lucene-farsi-problem-
tp16977096p17098552.html Sent

from the Lucene - Java Users mailing list archive at Nabble.com.

--
View this message in context:
http://www.nabble.com/lucene-farsi-problem-tp16977096p17122444.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: lucene farsi problem

2008-05-07 Thread Vizzini


Sorry for cross posting, but why the word 'Farsi' instead of 'Persian'?  No
one says Lucnce français or Español, or Deutsch - so why Farsi?

Please read the following article, I found it quite enlightening. 
http://www.cais-soas.com/CAIS/Languages/persian_not_farsi.htm

PV

-- 
View this message in context: 
http://www.nabble.com/lucene-farsi-problem-tp16977096p17098552.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: lucene farsi problem

2008-05-07 Thread Steven A Rowe

Hi Esra,

On 05/06/2008 at 7:38 AM, esra wrote:
 i tried the class and it works fine with the locale parameter ar.

Cool, I'm glad this addressed your problem!

 Actually we are using fa for farsi and ar for arabic.
 I have added a little control for the locale parameter in my
 code and now i can see the correct results.

From what I could tell, the Collator available for Locale(fa) in the Sun 
1.4.2 and 1.5.0 JDKs does not contain Farsi character collation, but the 
Collator available for Locale(ar) *does* contain Farsi collation.  I switched 
TestCollatingRangeQuery from Locale(fa) to Locale(ar) when I couldn't get 
the Collator returned for Farsi [ via Collator.getInstance(new Locale(fa) ] 
to produce correct results.

Did you find that Locale(fa) produces the correct results?  If so, which VM 
are you using?

At Chris Hostetter's suggestion, I am rewriting the patch attached to 
LUCENE-1279, including the following changes:

- Merged the contents of the CollatingRangeQuery class into RangeQuery and 
RangeFilter
- Switched the Locale parameter to instead take an instance of Collator
- Modified QueryParser.jj to construct a QueryParser class that can accept a 
range collator and pass it either to RangeQuery or through 
ConstantScoreRangeQuery to RangeFilter.

I plan on posting the revised patch in the next day or two.

Steve

On 05/06/2008 at 7:38 AM, esra wrote:
 
 Hi Steven ,
 Hi Steven,
 
 i tried the class and it works fine with the locale parameter ar.
 
 Actually we are using fa for farsi and ar for arabic.
 I have added a little control for the locale parameter in my
 code and now i can see the correct results.
 
 Thank you very much for ypur help.
 
 Esra.
 
 Steven A Rowe wrote:
  
  Hi Esra,
  
  I have attached a patch to LUCENE-1279 containing a new class:
  CollatingRangeQuery.
  
  The patch also contains a test class: TestCollatingRangeQuery.  One of
  the test methods checks for the Farsi range you were having trouble
  with.
  
  It should be mentioned that according to
  Collator.getAvailableLocales(), neither Java 1.4.2 nor Java 1.5.0
  contains Farsi collation information. However, in the test class I use
  the Arabic Locale, which seems to properly collate the non-Arabic Farsi
  letter U+0698, and hopefully behaves well with other Farsi letters.  If
  you find that this is not the case, you can look into writing collation
  rules using RuleBasedCollator - you should be able to directly specify
  the proper letter orderings for Farsi; CollatingRangeQuery would also
  have to supply a constructor that takes in a Collator instead of a
  Locale.
  
  Please give the class a try and post back about how it works.
  
  Thanks,
  Steve
  
  On 05/03/2008 at 8:33 AM, esra wrote:
   
   Hi Steven,
   
   thanks for your help
   
   Esra
   
   
   Steven A Rowe wrote:

Hi Esra,

I have created an issue for this - see
https://issues.apache.org/jira/browse/LUCENE-1279.

I'll try to take a crack at a patch this weekend.

Steve

On 05/02/2008 at 12:55 PM, esra wrote:
 
 Hi Steven ,
 
 yes you are right, sorry i am a bit confused.
 
 i checked again and the correct one is  zhe/U+698.
 
 It seems the word is in the range but my customer says it
 shouldn't be.
 
 I think problem occurs because  zhe is a Persian letter
 outside the Arabic
 alphabet. In farsi alphabet this letter is not after the س
 letter but it's
 unicode is bigger than س letter's and the searcher works
 with unicodes.
 
 Esra
 
 
 Steven A Rowe wrote:
  
  Hi Esra,
  
  You are *still* incorrectly referring to the glyph with three dots
  over it:
  
  On 05/02/2008 at 12:18 PM, esra wrote:
   yes the correct one is ژ /ze/U+632.
  
  ژ is *not* ze/U+632 - it is zhe/U+698.
  
  Have you increased the font size?  Can you see the difference
  between these two?:
  
  ژ/zhe/U+698
  ز/ze/U+632
  
   my problem is when i do search for  د-ژ range.
 The result is
   ساب
   ووفر and this word's first letter is س and it's unicode is
   U+633 and it is not in the in the [ U+062F -
 U+0632 ] range.
  
  Like I keep saying, in the above description, you're
 using the
   glyph
  ژ/zhe/U+698, while calling at the same time incorrectly
  referring to it as ze/U+632.
  
  I don't mean to continually bang on about this - if you're *sure*
  that when you search, you're using the character represented by the
  glyph with one dot (and not three), i.e. ز/ze/U+632, then the
  problem lies elsewhere.
  
  Steve
  
  On 05/02/2008 at 12:18 PM, esra wrote:
   yes the correct one is ژ /ze/U+632.
   
   my problem is when i do search forد-ژ range. The result is 
   ساب ووفر  and this word's first letter is س  and it's unicode
   is U+633  and  it is not in the in the [ U+062F -

RE: lucene farsi problem

2008-05-07 Thread Steven A Rowe

Hi PV,

On 05/07/2008 at 2:54 AM, PV wrote:
Sorry for cross posting, but why the word 'Farsi' instead of
'Persian'? No one says Lucnce français or Español, or Deutsch - so why Farsi?

Please read the following article, I found it quite enlightening.
http://www.cais-soas.com/CAIS/Languages/persian_not_farsi.htm

I was unable to follow your link -- I get the message The page cannot be
found -- but Google's cache came to my rescue:

http://209.85.165.104/search?q=cache:vJo1ye09MzYJ:www.cais-soas.com/CAIS/Languages/persian_not_farsi.htm+persian+farsi+site:cais-soas.comhl=enct=clnkcd=1gl=usclient=firefox-a

The article makes fine logical arguments against the use of Farsi, but
clearly does not take a balanced point of view on the controversy.

I count myself among the ignorant - I had not realized that Farsi was a
politically charged term. That said, I don't plan on refusing to use the term.
I'll just think about the impact of the choice (now that I'm aware there is
one) before I do use it.

Thanks for the pointer. Knowledge is good.

Steve

On 05/07/2008 at 2:54 AM, Vizzini wrote:

Sorry for cross posting, but why the word 'Farsi' instead of
'Persian'? No one says Lucnce français or Español, or Deutsch - so why Farsi?

Please read the following article, I found it quite enlightening.
http://www.cais-soas.com/CAIS/Languages/persian_not_farsi.htm

-- View this message in context:
http://www.nabble.com/lucene-farsi-problem-tp16977096p17098552.html Sent
from the Lucene - Java Users mailing list archive at Nabble.com.

RE: lucene farsi problem

2008-05-06 Thread esra


Hi Steven ,
Hi Steven,

i tried the class and it works fine with the locale parameter ar.

Actually we are using fa for farsi and ar for arabic.
I have added a little control for the locale parameter in my code and now i
can see the correct results.

Thank you very much for ypur help.

Esra.



Steven A Rowe wrote:
 
 Hi Esra,
 
 I have attached a patch to LUCENE-1279 containing a new class:
 CollatingRangeQuery.
 
 The patch also contains a test class: TestCollatingRangeQuery.  One of the
 test methods checks for the Farsi range you were having trouble with.
 
 It should be mentioned that according to Collator.getAvailableLocales(),
 neither Java 1.4.2 nor Java 1.5.0 contains Farsi collation information. 
 However, in the test class I use the Arabic Locale, which seems to
 properly collate the non-Arabic Farsi letter U+0698, and hopefully behaves
 well with other Farsi letters.  If you find that this is not the case, you
 can look into writing collation rules using RuleBasedCollator - you should
 be able to directly specify the proper letter orderings for Farsi;
 CollatingRangeQuery would also have to supply a constructor that takes in
 a Collator instead of a Locale.
 
 Please give the class a try and post back about how it works.
 
 Thanks,
 Steve
 
 On 05/03/2008 at 8:33 AM, esra wrote:
 
 Hi Steven,
 
 thanks for your help
 
 Esra
 
 
 Steven A Rowe wrote:
  
  Hi Esra,
  
  I have created an issue for this - see
  https://issues.apache.org/jira/browse/LUCENE-1279.
  
  I'll try to take a crack at a patch this weekend.
  
  Steve
  
  On 05/02/2008 at 12:55 PM, esra wrote:
   
   Hi Steven ,
   
   yes you are right, sorry i am a bit confused.
   
   i checked again and the correct one is  zhe/U+698.
   
   It seems the word is in the range but my customer says it
   shouldn't be.
   
   I think problem occurs because  zhe is a Persian letter
   outside the Arabic
   alphabet. In farsi alphabet this letter is not after the س
   letter but it's
   unicode is bigger than س letter's and the searcher works
   with unicodes.
   
   Esra
   
   
   Steven A Rowe wrote:

Hi Esra,

You are *still* incorrectly referring to the glyph with three dots
over it:

On 05/02/2008 at 12:18 PM, esra wrote:
 yes the correct one is ژ /ze/U+632.

ژ is *not* ze/U+632 - it is zhe/U+698.

Have you increased the font size?  Can you see the difference
 between
these two?:

ژ/zhe/U+698
ز/ze/U+632

 my problem is when i do search for  د-ژ range. The result is 
 ساب
 ووفر and this word's first letter is س and it's unicode is
 U+633 and it is not in the in the [ U+062F - U+0632 ] range.

Like I keep saying, in the above description, you're using the
 glyph
ژ/zhe/U+698, while calling at the same time incorrectly
 referring
to it as ze/U+632.

I don't mean to continually bang on about this - if you're *sure*
that when you search, you're using the character represented by the
glyph with one dot (and not three), i.e. ز/ze/U+632, then the
problem lies elsewhere.

Steve

On 05/02/2008 at 12:18 PM, esra wrote:
 yes the correct one is ژ /ze/U+632.
 
 my problem is when i do search forد-ژ range. The result
 is  ساب ووفر
  and this word's first letter is س  and it's unicode is
 U+633  and  it
 is not in the in the [ U+062F - U+0632 ] range.
 
 am i wrong?
 
 Esra
 
 Steven A Rowe wrote:
  
  Hi Esra,
  
  I still think you're wrong :).
  
  On 05/02/2008 at 9:31 AM, esra wrote:
ژ = U+632
  
  According to the website you linked to, the above character,
 which
  has three dots over it, is named zhe, and its
 Unicode code point
   is
  U+698. (I had to increase the font size to see the three dots.)
  
  I think you are confusing ژ/zhe/U+698 with the letter
  ز/ze/U+632, which has just one dot over it.
  
  Unless you were mistaken in all of your emails when
 you included
   the
  character ژ/zhe instead of ز/ze, then what I said in my
  previous email still stands: there is no problem here.
  
  Steve
  
  On 05/02/2008 at 9:31 AM, esra wrote:
   
   Hi Steven,
   
   sorry i made a mistake. unicodes are like this:
   
د=U+62F
ژ = U+632
and the first letter of ساب ووفر  is  س = U+633
   
   you can also check them here

   http://www.unics.uni-hannover.de/nhtcapri/persian-alphabet.html
   
   Esra
   
   
   Steven A Rowe wrote:

Hi Esra,

Going back to the original problem statement, I
 see something
   that
looks illogical to me - please correct me if I'm wrong:

On Apr 30, 2008, at 3:21 AM, esra wrote:
 i am using lucene's IndexSearcher to search
 the given xml
   by
 keyword which contains farsi information.

RE: lucene farsi problem

2008-05-04 Thread Steven A Rowe

Hi Esra,

I have attached a patch to LUCENE-1279 containing a new class: 
CollatingRangeQuery.

The patch also contains a test class: TestCollatingRangeQuery.  One of the test 
methods checks for the Farsi range you were having trouble with.

It should be mentioned that according to Collator.getAvailableLocales(), 
neither Java 1.4.2 nor Java 1.5.0 contains Farsi collation information.  
However, in the test class I use the Arabic Locale, which seems to properly 
collate the non-Arabic Farsi letter U+0698, and hopefully behaves well with 
other Farsi letters.  If you find that this is not the case, you can look into 
writing collation rules using RuleBasedCollator - you should be able to 
directly specify the proper letter orderings for Farsi; CollatingRangeQuery 
would also have to supply a constructor that takes in a Collator instead of a 
Locale.

Please give the class a try and post back about how it works.

Thanks,
Steve

On 05/03/2008 at 8:33 AM, esra wrote:
 
 Hi Steven,
 
 thanks for your help
 
 Esra
 
 
 Steven A Rowe wrote:
  
  Hi Esra,
  
  I have created an issue for this - see
  https://issues.apache.org/jira/browse/LUCENE-1279.
  
  I'll try to take a crack at a patch this weekend.
  
  Steve
  
  On 05/02/2008 at 12:55 PM, esra wrote:
   
   Hi Steven ,
   
   yes you are right, sorry i am a bit confused.
   
   i checked again and the correct one is  zhe/U+698.
   
   It seems the word is in the range but my customer says it
   shouldn't be.
   
   I think problem occurs because  zhe is a Persian letter
   outside the Arabic
   alphabet. In farsi alphabet this letter is not after the س
   letter but it's
   unicode is bigger than س letter's and the searcher works
   with unicodes.
   
   Esra
   
   
   Steven A Rowe wrote:

Hi Esra,

You are *still* incorrectly referring to the glyph with three dots
over it:

On 05/02/2008 at 12:18 PM, esra wrote:
 yes the correct one is ژ /ze/U+632.

ژ is *not* ze/U+632 - it is zhe/U+698.

Have you increased the font size?  Can you see the difference between
these two?:

ژ/zhe/U+698
ز/ze/U+632

 my problem is when i do search for  د-ژ range. The result is  ساب
 ووفر and this word's first letter is س and it's unicode is
 U+633 and it is not in the in the [ U+062F - U+0632 ] range.

Like I keep saying, in the above description, you're using the glyph
ژ/zhe/U+698, while calling at the same time incorrectly referring
to it as ze/U+632.

I don't mean to continually bang on about this - if you're *sure*
that when you search, you're using the character represented by the
glyph with one dot (and not three), i.e. ز/ze/U+632, then the
problem lies elsewhere.

Steve

On 05/02/2008 at 12:18 PM, esra wrote:
 yes the correct one is ژ /ze/U+632.
 
 my problem is when i do search forد-ژ range. The result
 is  ساب ووفر
  and this word's first letter is س  and it's unicode is
 U+633  and  it
 is not in the in the [ U+062F - U+0632 ] range.
 
 am i wrong?
 
 Esra
 
 Steven A Rowe wrote:
  
  Hi Esra,
  
  I still think you're wrong :).
  
  On 05/02/2008 at 9:31 AM, esra wrote:
ژ = U+632
  
  According to the website you linked to, the above character, which
  has three dots over it, is named zhe, and its
 Unicode code point
   is
  U+698. (I had to increase the font size to see the three dots.)
  
  I think you are confusing ژ/zhe/U+698 with the letter
  ز/ze/U+632, which has just one dot over it.
  
  Unless you were mistaken in all of your emails when
 you included
   the
  character ژ/zhe instead of ز/ze, then what I said in my
  previous email still stands: there is no problem here.
  
  Steve
  
  On 05/02/2008 at 9:31 AM, esra wrote:
   
   Hi Steven,
   
   sorry i made a mistake. unicodes are like this:
   
د=U+62F
ژ = U+632
and the first letter of ساب ووفر  is  س = U+633
   
   you can also check them here

   http://www.unics.uni-hannover.de/nhtcapri/persian-alphabet.html
   
   Esra
   
   
   Steven A Rowe wrote:

Hi Esra,

Going back to the original problem statement, I
 see something
   that
looks illogical to me - please correct me if I'm wrong:

On Apr 30, 2008, at 3:21 AM, esra wrote:
 i am using lucene's IndexSearcher to search
 the given xml
   by
 keyword which contains farsi information.
 while searching i
   use
 ranges like
 
 آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی
 
 when i do search for  د-ژ  range the results
 are wrong ,
   they
 are the results of   س-ظ range.
 
 for example when i do search for د-ژ  one of the the results
 is ساب

RE: lucene farsi problem

2008-05-03 Thread esra


Hi Steven,

thanks for your help

Esra


Steven A Rowe wrote:
 
 Hi Esra,
 
 I have created an issue for this - see
 https://issues.apache.org/jira/browse/LUCENE-1279.
 
 I'll try to take a crack at a patch this weekend.
 
 Steve
 
 On 05/02/2008 at 12:55 PM, esra wrote:
 
 Hi Steven ,
 
 yes you are right, sorry i am a bit confused.
 
 i checked again and the correct one is  zhe/U+698.
 
 It seems the word is in the range but my customer says it
 shouldn't be.
 
 I think problem occurs because  zhe is a Persian letter
 outside the Arabic
 alphabet. In farsi alphabet this letter is not after the س
 letter but it's
 unicode is bigger than س letter's and the searcher works
 with unicodes.
 
 Esra
 
 
 Steven A Rowe wrote:
  
  Hi Esra,
  
  You are *still* incorrectly referring to the glyph with three dots over
  it:
  
  On 05/02/2008 at 12:18 PM, esra wrote:
   yes the correct one is ژ /ze/U+632.
  
  ژ is *not* ze/U+632 - it is zhe/U+698.
  
  Have you increased the font size?  Can you see the difference between
  these two?:
  
  ژ/zhe/U+698
  ز/ze/U+632
  
   my problem is when i do search for  د-ژ range. The result is  ساب
   ووفر and this word's first letter is س and it's unicode is U+633 
   and it is not in the in the [ U+062F - U+0632 ] range.
  
  Like I keep saying, in the above description, you're using the glyph
  ژ/zhe/U+698, while calling at the same time incorrectly referring
  to it as ze/U+632.
  
  I don't mean to continually bang on about this - if you're *sure* that
  when you search, you're using the character represented by the glyph
  with one dot (and not three), i.e. ز/ze/U+632, then the problem
  lies elsewhere.
  
  Steve
  
  On 05/02/2008 at 12:18 PM, esra wrote:
   yes the correct one is ژ /ze/U+632.
   
   my problem is when i do search forد-ژ range. The result
   is  ساب ووفر
and this word's first letter is س  and it's unicode is
   U+633  and  it
   is not in the in the [ U+062F - U+0632 ] range.
   
   am i wrong?
   
   Esra
   
   Steven A Rowe wrote:

Hi Esra,

I still think you're wrong :).

On 05/02/2008 at 9:31 AM, esra wrote:
  ژ = U+632

According to the website you linked to, the above character, which
has three dots over it, is named zhe, and its Unicode code point
 is
U+698. (I had to increase the font size to see the three dots.)

I think you are confusing ژ/zhe/U+698 with the letter
ز/ze/U+632, which has just one dot over it.

Unless you were mistaken in all of your emails when you included
 the
character ژ/zhe instead of ز/ze, then what I said in my
previous email still stands: there is no problem here.

Steve

On 05/02/2008 at 9:31 AM, esra wrote:
 
 Hi Steven,
 
 sorry i made a mistake. unicodes are like this:
 
  د=U+62F
  ژ = U+632
  and the first letter of ساب ووفر  is  س = U+633
 
 you can also check them here
  
 http://www.unics.uni-hannover.de/nhtcapri/persian-alphabet.html
 
 Esra
 
 
 Steven A Rowe wrote:
  
  Hi Esra,
  
  Going back to the original problem statement, I see something
 that
  looks illogical to me - please correct me if I'm wrong:
  
  On Apr 30, 2008, at 3:21 AM, esra wrote:
   i am using lucene's IndexSearcher to search the given xml
 by
   keyword which contains farsi information. while searching i
 use
   ranges like
   
   آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی
   
   when i do search for  د-ژ  range the results are wrong ,
 they
   are the results of   س-ظ range.
   
   for example when i do search for د-ژ  one of the the
 results is
   ساب ووفر, this result also shown on the  س-ظ  range's
 result
   list which is the corret range.
   
   As IndexSearcher use compareTo method and this method uses
   unicodes for comparing, i found the unicodes of the
 characters.
   
   د=U+62F
   ژ = U+698
   and the first letter of ساب ووفر  is  س = U+633
  
  It appears to me that *both* the د-ژ range [
 U+062F - U+0698 ]
   and
  the س-ظ range [ U+0633 - U+0638 ] contain the
 first letter of
   ساب
  ووفر, which is س = U+0633.
  
  You stated that U+0633 should be contained in the [
 U+0633 - U+0638
   ]
  range - I agree - but why do you think U+0633 should not be
  contained in the [ U+062F - U+0698 ] range?
  
  In other words, it looks to me like your problem is
 not a problem
   at
  all.
  
  Steve
  
  
 
 -- View this message in context:
 
   http://www.nabble.com/lucene-farsi-problem-tp16977096p17019498
   .html Sent
from the Lucene - Java Users mailing list archive at Nabble.com.



 -
   To
unsubscribe, e-mail: [EMAIL PROTECTED] For
additional commands, e-mail: [EMAIL PROTECTED]

RE: lucene farsi problem

2008-05-02 Thread esra


Hi Steven,

sorry i made a mistake. unicodes are like this:

 د=U+62F
 ژ = U+632
 and the first letter of ساب ووفر  is  س = U+633

you can also check them here
:http://www.unics.uni-hannover.de/nhtcapri/persian-alphabet.html

Esra


Steven A Rowe wrote:
 
 Hi Esra,
 
 Going back to the original problem statement, I see something that looks
 illogical to me - please correct me if I'm wrong:
 
 On Apr 30, 2008, at 3:21 AM, esra wrote:
 i am using lucene's IndexSearcher to search the given xml by
 keyword which contains farsi information.
 while searching i use ranges like
 
 آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی
 
 when i do search for  د-ژ  range the results are wrong , they
 are the results of   س-ظ range.
 
 for example when i do search for د-ژ  one of the the results is
 ساب ووفر, this result also shown on the  س-ظ  range's result
 list which is the corret range.
 
 As IndexSearcher use compareTo method and this method uses
 unicodes for comparing, i found the unicodes of the characters.
 
 د=U+62F
 ژ = U+698
 and the first letter of ساب ووفر  is  س = U+633
 
 It appears to me that *both* the د-ژ range [ U+062F - U+0698 ] and the
 س-ظ range [ U+0633 - U+0638 ] contain the first letter of ساب ووفر,
 which is س = U+0633.  
 
 You stated that U+0633 should be contained in the [ U+0633 - U+0638 ]
 range - I agree - but why do you think U+0633 should not be contained in
 the [ U+062F - U+0698 ] range?
 
 In other words, it looks to me like your problem is not a problem at all.
 
 Steve
 
 

-- 
View this message in context: 
http://www.nabble.com/lucene-farsi-problem-tp16977096p17019498.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: lucene farsi problem

2008-05-02 Thread Steven A Rowe

Hi Esra,

I still think you're wrong :).

On 05/02/2008 at 9:31 AM, esra wrote:
  ژ = U+632

According to the website you linked to, the above character, which has three 
dots over it, is named zhe, and its Unicode code point is U+698.  (I had to 
increase the font size to see the three dots.)

I think you are confusing ژ/zhe/U+698 with the letter ز/ze/U+632, which 
has just one dot over it.

Unless you were mistaken in all of your emails when you included the character 
ژ/zhe instead of ز/ze, then what I said in my previous email still 
stands: there is no problem here.

Steve

On 05/02/2008 at 9:31 AM, esra wrote:
 
 Hi Steven,
 
 sorry i made a mistake. unicodes are like this:
 
  د=U+62F
  ژ = U+632
  and the first letter of ساب ووفر  is  س = U+633
 
 you can also check them here
  http://www.unics.uni-hannover.de/nhtcapri/persian-alphabet.html
 
 Esra
 
 
 Steven A Rowe wrote:
  
  Hi Esra,
  
  Going back to the original problem statement, I see something that
  looks illogical to me - please correct me if I'm wrong:
  
  On Apr 30, 2008, at 3:21 AM, esra wrote:
   i am using lucene's IndexSearcher to search the given xml by
   keyword which contains farsi information.
   while searching i use ranges like
   
   آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی
   
   when i do search for  د-ژ  range the results are wrong , they
   are the results of   س-ظ range.
   
   for example when i do search for د-ژ  one of the the results is
   ساب ووفر, this result also shown on the  س-ظ  range's result
   list which is the corret range.
   
   As IndexSearcher use compareTo method and this method uses
   unicodes for comparing, i found the unicodes of the characters.
   
   د=U+62F
   ژ = U+698
   and the first letter of ساب ووفر  is  س = U+633
  
  It appears to me that *both* the د-ژ range [ U+062F - U+0698 ] and
  the س-ظ range [ U+0633 - U+0638 ] contain the first letter of ساب
  ووفر, which is س = U+0633.
  
  You stated that U+0633 should be contained in the [ U+0633 - U+0638 ]
  range - I agree - but why do you think U+0633 should not be contained
  in the [ U+062F - U+0698 ] range?
  
  In other words, it looks to me like your problem is not a problem at
  all.
  
  Steve
  
  
 
 -- View this message in context:
 http://www.nabble.com/lucene-farsi-problem-tp16977096p17019498.html Sent
 from the Lucene - Java Users mailing list archive at Nabble.com.
 
 
 - To
 unsubscribe, e-mail: [EMAIL PROTECTED] For
 additional commands, e-mail: [EMAIL PROTECTED]

RE: lucene farsi problem

2008-05-02 Thread esra


Hi Steven ,

yes the correct one is ژ /ze/U+632.

my problem is when i do search forد-ژ range. The result is  ساب ووفر 
 and this word's first letter is س  and it's unicode is U+633  and  it
is not in the in the [ U+062F - U+0632 ] range.

am i wrong?

Esra


Steven A Rowe wrote:
 
 Hi Esra,
 
 I still think you're wrong :).
 
 On 05/02/2008 at 9:31 AM, esra wrote:
  ژ = U+632
 
 According to the website you linked to, the above character, which has
 three dots over it, is named zhe, and its Unicode code point is U+698. 
 (I had to increase the font size to see the three dots.)
 
 I think you are confusing ژ/zhe/U+698 with the letter ز/ze/U+632,
 which has just one dot over it.
 
 Unless you were mistaken in all of your emails when you included the
 character ژ/zhe instead of ز/ze, then what I said in my previous
 email still stands: there is no problem here.
 
 Steve
 
 On 05/02/2008 at 9:31 AM, esra wrote:
 
 Hi Steven,
 
 sorry i made a mistake. unicodes are like this:
 
  د=U+62F
  ژ = U+632
  and the first letter of ساب ووفر  is  س = U+633
 
 you can also check them here
  http://www.unics.uni-hannover.de/nhtcapri/persian-alphabet.html
 
 Esra
 
 
 Steven A Rowe wrote:
  
  Hi Esra,
  
  Going back to the original problem statement, I see something that
  looks illogical to me - please correct me if I'm wrong:
  
  On Apr 30, 2008, at 3:21 AM, esra wrote:
   i am using lucene's IndexSearcher to search the given xml by
   keyword which contains farsi information.
   while searching i use ranges like
   
   آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی
   
   when i do search for  د-ژ  range the results are wrong , they
   are the results of   س-ظ range.
   
   for example when i do search for د-ژ  one of the the results is
   ساب ووفر, this result also shown on the  س-ظ  range's result
   list which is the corret range.
   
   As IndexSearcher use compareTo method and this method uses
   unicodes for comparing, i found the unicodes of the characters.
   
   د=U+62F
   ژ = U+698
   and the first letter of ساب ووفر  is  س = U+633
  
  It appears to me that *both* the د-ژ range [ U+062F - U+0698 ] and
  the س-ظ range [ U+0633 - U+0638 ] contain the first letter of ساب
  ووفر, which is س = U+0633.
  
  You stated that U+0633 should be contained in the [ U+0633 - U+0638 ]
  range - I agree - but why do you think U+0633 should not be contained
  in the [ U+062F - U+0698 ] range?
  
  In other words, it looks to me like your problem is not a problem at
  all.
  
  Steve
  
  
 
 -- View this message in context:
 http://www.nabble.com/lucene-farsi-problem-tp16977096p17019498.html Sent
 from the Lucene - Java Users mailing list archive at Nabble.com.
 
 
 - To
 unsubscribe, e-mail: [EMAIL PROTECTED] For
 additional commands, e-mail: [EMAIL PROTECTED]
 

 
  
 
 
 

-- 
View this message in context: 
http://www.nabble.com/lucene-farsi-problem-tp16977096p17022861.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: lucene farsi problem

2008-05-02 Thread Steven A Rowe

Hi Esra,

You are *still* incorrectly referring to the glyph with three dots over it:

On 05/02/2008 at 12:18 PM, esra wrote:
 yes the correct one is ژ /ze/U+632.

ژ is *not* ze/U+632 - it is zhe/U+698.

Have you increased the font size?  Can you see the difference between these 
two?:

ژ/zhe/U+698
ز/ze/U+632

 my problem is when i do search for  د-ژ range. The result
 is  ساب ووفر and this word's first letter is س and it's unicode is
 U+633  and it is not in the in the [ U+062F - U+0632 ] range.

Like I keep saying, in the above description, you're using the glyph 
ژ/zhe/U+698, while calling at the same time incorrectly referring to it as 
ze/U+632.

I don't mean to continually bang on about this - if you're *sure* that when you 
search, you're using the character represented by the glyph with one dot (and 
not three), i.e. ز/ze/U+632, then the problem lies elsewhere.

Steve

On 05/02/2008 at 12:18 PM, esra wrote:
 yes the correct one is ژ /ze/U+632.
 
 my problem is when i do search forد-ژ range. The result
 is  ساب ووفر
  and this word's first letter is س  and it's unicode is
 U+633  and  it
 is not in the in the [ U+062F - U+0632 ] range.
 
 am i wrong?
 
 Esra
 
 Steven A Rowe wrote:
  
  Hi Esra,
  
  I still think you're wrong :).
  
  On 05/02/2008 at 9:31 AM, esra wrote:
ژ = U+632
  
  According to the website you linked to, the above character, which has
  three dots over it, is named zhe, and its Unicode code point is
  U+698. (I had to increase the font size to see the three dots.)
  
  I think you are confusing ژ/zhe/U+698 with the letter
  ز/ze/U+632, which has just one dot over it.
  
  Unless you were mistaken in all of your emails when you included the
  character ژ/zhe instead of ز/ze, then what I said in my
  previous email still stands: there is no problem here.
  
  Steve
  
  On 05/02/2008 at 9:31 AM, esra wrote:
   
   Hi Steven,
   
   sorry i made a mistake. unicodes are like this:
   
د=U+62F
ژ = U+632
and the first letter of ساب ووفر  is  س = U+633
   
   you can also check them here
http://www.unics.uni-hannover.de/nhtcapri/persian-alphabet.html
   
   Esra
   
   
   Steven A Rowe wrote:

Hi Esra,

Going back to the original problem statement, I see something that
looks illogical to me - please correct me if I'm wrong:

On Apr 30, 2008, at 3:21 AM, esra wrote:
 i am using lucene's IndexSearcher to search the given xml by
 keyword which contains farsi information.
 while searching i use ranges like
 
 آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی
 
 when i do search for  د-ژ  range the results are wrong , they
 are the results of   س-ظ range.
 
 for example when i do search for د-ژ  one of the the results is
 ساب ووفر, this result also shown on the  س-ظ  range's result
 list which is the corret range.
 
 As IndexSearcher use compareTo method and this method uses
 unicodes for comparing, i found the unicodes of the characters.
 
 د=U+62F
 ژ = U+698
 and the first letter of ساب ووفر  is  س = U+633

It appears to me that *both* the د-ژ range [ U+062F - U+0698 ] and
the س-ظ range [ U+0633 - U+0638 ] contain the first letter of ساب
ووفر, which is س = U+0633.

You stated that U+0633 should be contained in the [ U+0633 - U+0638 ]
range - I agree - but why do you think U+0633 should not be contained
in the [ U+062F - U+0698 ] range?

In other words, it looks to me like your problem is not a problem at
all.

Steve


   
   -- View this message in context:
   
 http://www.nabble.com/lucene-farsi-problem-tp16977096p17019498
 .html Sent
  from the Lucene - Java Users mailing list archive at Nabble.com.
  
  
  - To
  unsubscribe, e-mail: [EMAIL PROTECTED] For
  additional commands, e-mail: [EMAIL PROTECTED]
  
  
 
 
 
 
 
 
 --
 View this message in context: 
http://www.nabble.com/lucene-farsi-problem-tp16977096p17022861.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

RE: lucene farsi problem

2008-05-02 Thread esra


Hi Steven ,

yes you are right, sorry i am a bit confused.

i checked again and the correct one is  zhe/U+698. 

It seems the word is in the range but my customer says it shouldn't be.

I think problem occurs because  zhe is a Persian letter outside the Arabic
alphabet. In farsi alphabet this letter is not after the س letter but it's
unicode is bigger than س letter's and the searcher works with unicodes. 

Esra


Steven A Rowe wrote:
 
 Hi Esra,
 
 You are *still* incorrectly referring to the glyph with three dots over
 it:
 
 On 05/02/2008 at 12:18 PM, esra wrote:
 yes the correct one is ژ /ze/U+632.
 
 ژ is *not* ze/U+632 - it is zhe/U+698.
 
 Have you increased the font size?  Can you see the difference between
 these two?:
 
 ژ/zhe/U+698
 ز/ze/U+632
 
 my problem is when i do search for  د-ژ range. The result
 is  ساب ووفر and this word's first letter is س and it's unicode is
 U+633  and it is not in the in the [ U+062F - U+0632 ] range.
 
 Like I keep saying, in the above description, you're using the glyph
 ژ/zhe/U+698, while calling at the same time incorrectly referring to
 it as ze/U+632.
 
 I don't mean to continually bang on about this - if you're *sure* that
 when you search, you're using the character represented by the glyph with
 one dot (and not three), i.e. ز/ze/U+632, then the problem lies
 elsewhere.
 
 Steve
 
 On 05/02/2008 at 12:18 PM, esra wrote:
 yes the correct one is ژ /ze/U+632.
 
 my problem is when i do search forد-ژ range. The result
 is  ساب ووفر
  and this word's first letter is س  and it's unicode is
 U+633  and  it
 is not in the in the [ U+062F - U+0632 ] range.
 
 am i wrong?
 
 Esra
 
 Steven A Rowe wrote:
  
  Hi Esra,
  
  I still think you're wrong :).
  
  On 05/02/2008 at 9:31 AM, esra wrote:
ژ = U+632
  
  According to the website you linked to, the above character, which has
  three dots over it, is named zhe, and its Unicode code point is
  U+698. (I had to increase the font size to see the three dots.)
  
  I think you are confusing ژ/zhe/U+698 with the letter
  ز/ze/U+632, which has just one dot over it.
  
  Unless you were mistaken in all of your emails when you included the
  character ژ/zhe instead of ز/ze, then what I said in my
  previous email still stands: there is no problem here.
  
  Steve
  
  On 05/02/2008 at 9:31 AM, esra wrote:
   
   Hi Steven,
   
   sorry i made a mistake. unicodes are like this:
   
د=U+62F
ژ = U+632
and the first letter of ساب ووفر  is  س = U+633
   
   you can also check them here
http://www.unics.uni-hannover.de/nhtcapri/persian-alphabet.html
   
   Esra
   
   
   Steven A Rowe wrote:

Hi Esra,

Going back to the original problem statement, I see something that
looks illogical to me - please correct me if I'm wrong:

On Apr 30, 2008, at 3:21 AM, esra wrote:
 i am using lucene's IndexSearcher to search the given xml by
 keyword which contains farsi information.
 while searching i use ranges like
 
 آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی
 
 when i do search for  د-ژ  range the results are wrong , they
 are the results of   س-ظ range.
 
 for example when i do search for د-ژ  one of the the results is
 ساب ووفر, this result also shown on the  س-ظ  range's result
 list which is the corret range.
 
 As IndexSearcher use compareTo method and this method uses
 unicodes for comparing, i found the unicodes of the characters.
 
 د=U+62F
 ژ = U+698
 and the first letter of ساب ووفر  is  س = U+633

It appears to me that *both* the د-ژ range [ U+062F - U+0698 ]
 and
the س-ظ range [ U+0633 - U+0638 ] contain the first letter of
 ساب
ووفر, which is س = U+0633.

You stated that U+0633 should be contained in the [ U+0633 - U+0638
 ]
range - I agree - but why do you think U+0633 should not be
 contained
in the [ U+062F - U+0698 ] range?

In other words, it looks to me like your problem is not a problem
 at
all.

Steve


   
   -- View this message in context:
   
 http://www.nabble.com/lucene-farsi-problem-tp16977096p17019498
  .html Sent
  from the Lucene - Java Users mailing list archive at Nabble.com.
  
  
  -
 To
  unsubscribe, e-mail: [EMAIL PROTECTED] For
  additional commands, e-mail: [EMAIL PROTECTED]
  
  
 
 
 
 
 
  
  --
  View this message in context:
 http://www.nabble.com/lucene-farsi-problem-tp16977096p17022861.html
  Sent from the Lucene - Java Users mailing list archive at Nabble.com.
  
  
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
  
 
 
 

-- 
View this message in context: 
http://www.nabble.com/lucene-farsi-problem-tp16977096p17023557.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

RE: lucene farsi problem

2008-05-02 Thread Steven A Rowe

Hi Esra,

I have created an issue for this - see 
https://issues.apache.org/jira/browse/LUCENE-1279.

I'll try to take a crack at a patch this weekend.

Steve

On 05/02/2008 at 12:55 PM, esra wrote:
 
 Hi Steven ,
 
 yes you are right, sorry i am a bit confused.
 
 i checked again and the correct one is  zhe/U+698.
 
 It seems the word is in the range but my customer says it
 shouldn't be.
 
 I think problem occurs because  zhe is a Persian letter
 outside the Arabic
 alphabet. In farsi alphabet this letter is not after the س
 letter but it's
 unicode is bigger than س letter's and the searcher works
 with unicodes.
 
 Esra
 
 
 Steven A Rowe wrote:
  
  Hi Esra,
  
  You are *still* incorrectly referring to the glyph with three dots over
  it:
  
  On 05/02/2008 at 12:18 PM, esra wrote:
   yes the correct one is ژ /ze/U+632.
  
  ژ is *not* ze/U+632 - it is zhe/U+698.
  
  Have you increased the font size?  Can you see the difference between
  these two?:
  
  ژ/zhe/U+698
  ز/ze/U+632
  
   my problem is when i do search for  د-ژ range. The result is  ساب
   ووفر and this word's first letter is س and it's unicode is U+633 
   and it is not in the in the [ U+062F - U+0632 ] range.
  
  Like I keep saying, in the above description, you're using the glyph
  ژ/zhe/U+698, while calling at the same time incorrectly referring
  to it as ze/U+632.
  
  I don't mean to continually bang on about this - if you're *sure* that
  when you search, you're using the character represented by the glyph
  with one dot (and not three), i.e. ز/ze/U+632, then the problem
  lies elsewhere.
  
  Steve
  
  On 05/02/2008 at 12:18 PM, esra wrote:
   yes the correct one is ژ /ze/U+632.
   
   my problem is when i do search forد-ژ range. The result
   is  ساب ووفر
and this word's first letter is س  and it's unicode is
   U+633  and  it
   is not in the in the [ U+062F - U+0632 ] range.
   
   am i wrong?
   
   Esra
   
   Steven A Rowe wrote:

Hi Esra,

I still think you're wrong :).

On 05/02/2008 at 9:31 AM, esra wrote:
  ژ = U+632

According to the website you linked to, the above character, which
has three dots over it, is named zhe, and its Unicode code point is
U+698. (I had to increase the font size to see the three dots.)

I think you are confusing ژ/zhe/U+698 with the letter
ز/ze/U+632, which has just one dot over it.

Unless you were mistaken in all of your emails when you included the
character ژ/zhe instead of ز/ze, then what I said in my
previous email still stands: there is no problem here.

Steve

On 05/02/2008 at 9:31 AM, esra wrote:
 
 Hi Steven,
 
 sorry i made a mistake. unicodes are like this:
 
  د=U+62F
  ژ = U+632
  and the first letter of ساب ووفر  is  س = U+633
 
 you can also check them here
  
 http://www.unics.uni-hannover.de/nhtcapri/persian-alphabet.html
 
 Esra
 
 
 Steven A Rowe wrote:
  
  Hi Esra,
  
  Going back to the original problem statement, I see something that
  looks illogical to me - please correct me if I'm wrong:
  
  On Apr 30, 2008, at 3:21 AM, esra wrote:
   i am using lucene's IndexSearcher to search the given xml by
   keyword which contains farsi information. while searching i use
   ranges like
   
   آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی
   
   when i do search for  د-ژ  range the results are wrong , they
   are the results of   س-ظ range.
   
   for example when i do search for د-ژ  one of the the results is
   ساب ووفر, this result also shown on the  س-ظ  range's result
   list which is the corret range.
   
   As IndexSearcher use compareTo method and this method uses
   unicodes for comparing, i found the unicodes of the characters.
   
   د=U+62F
   ژ = U+698
   and the first letter of ساب ووفر  is  س = U+633
  
  It appears to me that *both* the د-ژ range [
 U+062F - U+0698 ]
   and
  the س-ظ range [ U+0633 - U+0638 ] contain the
 first letter of
   ساب
  ووفر, which is س = U+0633.
  
  You stated that U+0633 should be contained in the [
 U+0633 - U+0638
   ]
  range - I agree - but why do you think U+0633 should not be
  contained in the [ U+062F - U+0698 ] range?
  
  In other words, it looks to me like your problem is
 not a problem
   at
  all.
  
  Steve
  
  
 
 -- View this message in context:
 
   http://www.nabble.com/lucene-farsi-problem-tp16977096p17019498
   .html Sent
from the Lucene - Java Users mailing list archive at Nabble.com.



 -
   To
unsubscribe, e-mail: [EMAIL PROTECTED] For
additional commands, e-mail: [EMAIL PROTECTED]


   
   
   
   
   
   
   -- View this message in context:

RE: lucene farsi problem

2008-05-01 Thread esra


Hi Steve,

thanks for your reply , i know farsi is written and read right-to-left.
i am using RangeOuery class and it's rewrite(IndexReader reader) method
decides if the word is in range or not by compareTo method and this decision
is made by using unicodes.

while searching for د-ژ range the lowerTerm is د and  the upperTerm is
ژ. 
And while comparing for the result ساب ووفر also takes the first letter as
س and does the comparison for this letter.

 د=U+62F
 ژ = U+698
 and the first letter of ساب ووفر  is  س = U+633

Esra,


Steven A Rowe wrote:
 
 Hi Esra,
 
 Caveat: I don't speak, read, write, or dream in Farsi - I just know that
 it mostly shares its orthography with Arabic, and that they are both
 written and read right-to-left.
 
 How are you constructing the queries?  Using QueryParser?  If so, then I
 suspect the problem is that you intend the range you supply to be read
 entirely right-to-left, but Lucene instead reads it left-to-right.  Have
 you tried using e.g. د-ژ instead of د-ژ?  (That is, placing the lower
 valued term on the left instead of the right.)
 
 AFAICT, RangeFilter (called from ConstantScoreRangeQuery, which is called
 from QueryParser) does not test whether lowerTerm is in fact lower than
 upperTerm.  If it turns out that the problem is simply one of order, it
 might make sense to modify RangeFilter so that it flip them when lowerTerm
  upperTerm.
 
 Steve
 
 On 04/30/2008 at 3:21 AM, esra wrote:
 
 hi,
 
 i am using lucene's IndexSearcher to search the given xml
 by keyword which
 contains farsi information.
 while searching i use ranges like
 
 آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی
 
 when i do search for  د-ژ  range the results are wrong ,
 they are the
 results of   س-ظ range.
 
 for example when i do search for د-ژ  one of the the
 results is ساب ووفر
 , this result also shown on the  س-ظ  range's result list
 which is the
 corret range.
 
 As IndexSearcher use compareTo method and this method uses
 unicodes for
 comparing, i found the unicodes of the characters.
 
 د=U+62F
 ژ = U+698
 and the first letter of ساب ووفر  is  س = U+633
 
 Do you have any idea how to solve this problem, there are
 analyzers for
 different languages ,
 will this be usefull if so do you know where to find a farsi analyzer?
 
 I would bu glad if you help.
 
 thanks ,
 
 Esra
 
 -- View this message in context:
 http://www.nabble.com/lucene-farsi-problem-tp16977096p16977096.html Sent
 from the Lucene - Java Users mailing list archive at Nabble.com.
 
 
 - To
 unsubscribe, e-mail: [EMAIL PROTECTED] For
 additional commands, e-mail: [EMAIL PROTECTED]
 

 
  
 
 
 

-- 
View this message in context: 
http://www.nabble.com/lucene-farsi-problem-tp16977096p16993041.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: lucene farsi problem

2008-05-01 Thread esra


Hi,

document's encoding is UTF-8.

i tried the  explain() method and the result for د-ژ  range searching is:

  fieldWeight(keywordIndex:Ø³Ø§Ø¨ ÙˆÙˆÙ�Ø± in 0), product of:
  1.0 = tf(termFreq(keywordIndex:Ø³Ø§Ø¨ ÙˆÙˆÙ�Ø±)=1)
  0.30685282 = idf(docFreq=1)
  1.0 = fieldNorm(field=keywordIndex, doc=0)

here keywordIndex is ساب ووفر.

 i also  installed the luke.jnlp  but i don't know what to check by Luke.

Thanks,

Esra



Grant Ingersoll-6 wrote:
 
 I am not sure how Standard Analyzer will perform on Farsi.  The thing  
 to do now would be to get Luke and have a look at the actual document  
 that matches and see what it's tokens look like.  You might also try  
 using the explain() method to see why that document matches.
 
 Also, are you sure you are loading the file w/ the proper encodings,  
 etc?
 
 -Grant
 
 On Apr 30, 2008, at 8:06 AM, esra wrote:
 

 Hi,
 thanks for your reply.
 I am using StandartAnalyzer now and my xml document is like below:

 keyword![CDATA[ساب ووفر]]/keyword
  description![CDATA[یک ووفر که در محفظه ای  
 جدا از سایر درایور ها
 قرار دارد تا صدایی با باس فوق العاده  
 پایین تولید کند. ]]/description

 i googled for farsi analyzer and found nothing also i am not sure it  
 if
 would solve my problem or not.

 Thanks,

 Esra


 Grant Ingersoll-6 wrote:

 What Analyzer are you using?  You might try looking in Luke to see
 what is in your index, etc.  It also isn't clear to me what your
 documents look like.

 As for a Farsi analyzer, I would Google Farsi analyzer Lucene and
 see if you can find anything.  Otherwise, you will have to write your
 own (and donate it)

 -Grant

 On Apr 30, 2008, at 3:21 AM, esra wrote:


 hi,

 i am using lucene's IndexSearcher to search the given xml by
 keyword which
 contains farsi information.
 while searching i use ranges like

 آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی

 when i do search for  د-ژ  range the results are wrong , they  
 are
 the
 results of   س-ظ range.

 for example when i do search for د-ژ  one of the the results is
 ساب ووفر
 , this result also shown on the  س-ظ  range's result list which
 is the
 corret range.

 As IndexSearcher use compareTo method and this method uses
 unicodes for
 comparing, i found the unicodes of the characters.

 د=U+62F
 ژ = U+698
 and the first letter of ساب ووفر  is  س = U+633

 Do you have any idea how to solve this problem, there are analyzers
 for
 different languages ,
 will this be usefull if so do you know where to find a farsi  
 analyzer?

 I would bu glad if you help.

 thanks ,

 Esra

 -- 
 View this message in context:
 http://www.nabble.com/lucene-farsi-problem-tp16977096p16977096.html
 Sent from the Lucene - Java Users mailing list archive at  
 Nabble.com.


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


 --
 Grant Ingersoll

 Lucene Helpful Hints:
 http://wiki.apache.org/lucene-java/BasicsOfPerformance
 http://wiki.apache.org/lucene-java/LuceneFAQ







 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




 -- 
 View this message in context:
 http://www.nabble.com/lucene-farsi-problem-tp16977096p16980977.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

 
 --
 Grant Ingersoll
 
 Lucene Helpful Hints:
 http://wiki.apache.org/lucene-java/BasicsOfPerformance
 http://wiki.apache.org/lucene-java/LuceneFAQ
 
 
 
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 

-- 
View this message in context: 
http://www.nabble.com/lucene-farsi-problem-tp16977096p16993174.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: lucene farsi problem

2008-05-01 Thread Steven A Rowe

Hi Esra,

Going back to the original problem statement, I see something that looks 
illogical to me - please correct me if I'm wrong:

On Apr 30, 2008, at 3:21 AM, esra wrote:
 i am using lucene's IndexSearcher to search the given xml by
 keyword which contains farsi information.
 while searching i use ranges like
 
 آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی
 
 when i do search for  د-ژ  range the results are wrong , they
 are the results of   س-ظ range.
 
 for example when i do search for د-ژ  one of the the results is
 ساب ووفر, this result also shown on the  س-ظ  range's result
 list which is the corret range.
 
 As IndexSearcher use compareTo method and this method uses
 unicodes for comparing, i found the unicodes of the characters.
 
 د=U+62F
 ژ = U+698
 and the first letter of ساب ووفر  is  س = U+633

It appears to me that *both* the د-ژ range [ U+062F - U+0698 ] and the س-ظ 
range [ U+0633 - U+0638 ] contain the first letter of ساب ووفر, which is س 
= U+0633.  

You stated that U+0633 should be contained in the [ U+0633 - U+0638 ] range - I 
agree - but why do you think U+0633 should not be contained in the [ U+062F - 
U+0698 ] range?

In other words, it looks to me like your problem is not a problem at all.

Steve

Re: lucene farsi problem

2008-05-01 Thread Grant Ingersoll

On May 1, 2008, at 4:36 AM, esra wrote:

Hi,

document's encoding is UTF-8.

i tried the explain() method and the result for د-ژ range
searching is:

fieldWeight(keywordIndex:Ø³Ø§Ø¨ ÙˆÙˆÙ�Ø± in 0),
product of:

1.0 = tf(termFreq(keywordIndex:Ø³Ø§Ø¨ ÙˆÙˆÙ�Ø±)=1)
0.30685282 = idf(docFreq=1)
1.0 = fieldNorm(field=keywordIndex, doc=0)

here keywordIndex is ساب ووفر.

i also installed the luke.jnlp but i don't know what to check by
Luke.

http://wiki.apache.org/lucene-java/LuceneFAQ#head-3558e5121806fb4fce80fc022d889484a9248b71

Luke can be used to view your index. Not saying it is your problem
here, but often times when I get back results that seem incorrect,
the first thing I do is go look at my index using Luke, and compare
the incorrect document with what is in the query to see where the
(mis)match is occurring. Usually, this analysis shows that my
document/query is not what I thought it was.

Luke can browse documents and parse queries, amongst other useful
things.

Thanks,

Esra

Grant Ingersoll-6 wrote:

I am not sure how Standard Analyzer will perform on Farsi. The thing
to do now would be to get Luke and have a look at the actual document
that matches and see what it's tokens look like. You might also try
using the explain() method to see why that document matches.

Also, are you sure you are loading the file w/ the proper encodings,
etc?

-Grant

On Apr 30, 2008, at 8:06 AM, esra wrote:

Hi,
thanks for your reply.
I am using StandartAnalyzer now and my xml document is like below:

keyword![CDATA[ساب ووفر]]/keyword
description![CDATA[یک ووفر که در محفظه ای
جدا از سایر درایور ها
قرار دارد تا صدایی با باس فوق العاده
پایین تولید کند. ]]/description

i googled for farsi analyzer and found nothing also i am not sure it
if
would solve my problem or not.

Thanks,

Esra

Grant Ingersoll-6 wrote:

What Analyzer are you using? You might try looking in Luke to see
what is in your index, etc. It also isn't clear to me what your
documents look like.

As for a Farsi analyzer, I would Google Farsi analyzer Lucene and
see if you can find anything. Otherwise, you will have to write
your

own (and donate it)

-Grant

On Apr 30, 2008, at 3:21 AM, esra wrote:

hi,

i am using lucene's IndexSearcher to search the given xml by
keyword which
contains farsi information.
while searching i use ranges like

آ-ث | ج-خ | د-ژ | س-ظ | ع-ق | ک-ل | م-ی

when i do search for د-ژ range the results are wrong , they
are
the
results of س-ظ range.

for example when i do search for د-ژ one of the the results
is

ساب ووفر
, this result also shown on the س-ظ range's result list
which

is the
corret range.

As IndexSearcher use compareTo method and this method uses
unicodes for
comparing, i found the unicodes of the characters.

د=U+62F
ژ = U+698
and the first letter of ساب ووفر is س = U+633

Do you have any idea how to solve this problem, there are
analyzers

for
different languages ,
will this be usefull if so do you know where to find a farsi
analyzer?

I would bu glad if you help.

thanks ,

Esra

--
View this message in context:
http://www.nabble.com/lucene-farsi-problem-
tp16977096p16977096.html

Sent from the Lucene - Java Users mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--
View this message in context:
http://www.nabble.com/lucene-farsi-problem-tp16977096p16980977.html
Sent from the Lucene - Java Users mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--
View this message in context:
http://www.nabble.com/lucene-farsi-problem-tp16977096p16993174.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: lucene farsi problem

2008-04-30 Thread Grant Ingersoll

What Analyzer are you using?  You might try looking in Luke to see  
what is in your index, etc.  It also isn't clear to me what your  
documents look like.


As for a Farsi analyzer, I would Google Farsi analyzer Lucene and  
see if you can find anything.  Otherwise, you will have to write your  
own (and donate it)


-Grant

On Apr 30, 2008, at 3:21 AM, esra wrote:



hi,

i am using lucene's IndexSearcher to search the given xml by  
keyword which

contains farsi information.
while searching i use ranges like

آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی

when i do search for  د-ژ  range the results are wrong , they are  
the

results of   س-ظ range.

for example when i do search for د-ژ  one of the the results is  
ساب ووفر
, this result also shown on the  س-ظ  range's result list which  
is the

corret range.

As IndexSearcher use compareTo method and this method uses  
unicodes for

comparing, i found the unicodes of the characters.

د=U+62F
ژ = U+698
and the first letter of ساب ووفر  is  س = U+633

Do you have any idea how to solve this problem, there are analyzers  
for

different languages ,
will this be usefull if so do you know where to find a farsi analyzer?

I would bu glad if you help.

thanks ,

Esra

--
View this message in context: 
http://www.nabble.com/lucene-farsi-problem-tp16977096p16977096.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: lucene farsi problem

2008-04-30 Thread esra

Hi,
thanks for your reply.
I am using StandartAnalyzer now and my xml document is like below:

keyword![CDATA[ساب ووفر]]/keyword
description![CDATA[یک ووفر که در محفظه ای جدا از سایر درایور ها
قرار دارد تا صدایی با باس فوق العاده پایین تولید کند. ]]/description

i googled for farsi analyzer and found nothing also i am not sure it if
would solve my problem or not.

Thanks,

Esra

Grant Ingersoll-6 wrote:

What Analyzer are you using? You might try looking in Luke to see
what is in your index, etc. It also isn't clear to me what your
documents look like.

As for a Farsi analyzer, I would Google Farsi analyzer Lucene and
see if you can find anything. Otherwise, you will have to write your
own (and donate it)

-Grant

On Apr 30, 2008, at 3:21 AM, esra wrote:

hi,

i am using lucene's IndexSearcher to search the given xml by
keyword which
contains farsi information.
while searching i use ranges like

آ-ث | ج-خ | د-ژ | س-ظ | ع-ق | ک-ل | م-ی

when i do search for د-ژ range the results are wrong , they are
the
results of س-ظ range.

for example when i do search for د-ژ one of the the results is
ساب ووفر
, this result also shown on the س-ظ range's result list which
is the
corret range.

As IndexSearcher use compareTo method and this method uses
unicodes for
comparing, i found the unicodes of the characters.

د=U+62F
ژ = U+698
and the first letter of ساب ووفر is س = U+633

Do you have any idea how to solve this problem, there are analyzers
for
different languages ,
will this be usefull if so do you know where to find a farsi analyzer?

I would bu glad if you help.

thanks ,

Esra

--
View this message in context:
http://www.nabble.com/lucene-farsi-problem-tp16977096p16977096.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--
View this message in context:
http://www.nabble.com/lucene-farsi-problem-tp16977096p16980977.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: lucene farsi problem

2008-04-30 Thread Grant Ingersoll

I am not sure how Standard Analyzer will perform on Farsi.  The thing  
to do now would be to get Luke and have a look at the actual document  
that matches and see what it's tokens look like.  You might also try  
using the explain() method to see why that document matches.


Also, are you sure you are loading the file w/ the proper encodings,  
etc?


-Grant

On Apr 30, 2008, at 8:06 AM, esra wrote:



Hi,
thanks for your reply.
I am using StandartAnalyzer now and my xml document is like below:

keyword![CDATA[ساب ووفر]]/keyword
 description![CDATA[یک ووفر که در محفظه ای  
جدا از سایر درایور ها
قرار دارد تا صدایی با باس فوق العاده  
پایین تولید کند. ]]/description


i googled for farsi analyzer and found nothing also i am not sure it  
if

would solve my problem or not.

Thanks,

Esra


Grant Ingersoll-6 wrote:


What Analyzer are you using?  You might try looking in Luke to see
what is in your index, etc.  It also isn't clear to me what your
documents look like.

As for a Farsi analyzer, I would Google Farsi analyzer Lucene and
see if you can find anything.  Otherwise, you will have to write your
own (and donate it)

-Grant

On Apr 30, 2008, at 3:21 AM, esra wrote:



hi,

i am using lucene's IndexSearcher to search the given xml by
keyword which
contains farsi information.
while searching i use ranges like

آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی

when i do search for  د-ژ  range the results are wrong , they  
are

the
results of   س-ظ range.

for example when i do search for د-ژ  one of the the results is
ساب ووفر
, this result also shown on the  س-ظ  range's result list which
is the
corret range.

As IndexSearcher use compareTo method and this method uses
unicodes for
comparing, i found the unicodes of the characters.

د=U+62F
ژ = U+698
and the first letter of ساب ووفر  is  س = U+633

Do you have any idea how to solve this problem, there are analyzers
for
different languages ,
will this be usefull if so do you know where to find a farsi  
analyzer?


I would bu glad if you help.

thanks ,

Esra

--
View this message in context:
http://www.nabble.com/lucene-farsi-problem-tp16977096p16977096.html
Sent from the Lucene - Java Users mailing list archive at  
Nabble.com.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





--
View this message in context: 
http://www.nabble.com/lucene-farsi-problem-tp16977096p16980977.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: lucene farsi problem

2008-04-30 Thread Steven A Rowe

Hi Esra,

Caveat: I don't speak, read, write, or dream in Farsi - I just know that it 
mostly shares its orthography with Arabic, and that they are both written and 
read right-to-left.

How are you constructing the queries?  Using QueryParser?  If so, then I 
suspect the problem is that you intend the range you supply to be read entirely 
right-to-left, but Lucene instead reads it left-to-right.  Have you tried using 
e.g. د-ژ instead of د-ژ?  (That is, placing the lower valued term on the 
left instead of the right.)

AFAICT, RangeFilter (called from ConstantScoreRangeQuery, which is called from 
QueryParser) does not test whether lowerTerm is in fact lower than upperTerm.  
If it turns out that the problem is simply one of order, it might make sense to 
modify RangeFilter so that it flip them when lowerTerm  upperTerm.

Steve

On 04/30/2008 at 3:21 AM, esra wrote:
 
 hi,
 
 i am using lucene's IndexSearcher to search the given xml
 by keyword which
 contains farsi information.
 while searching i use ranges like
 
 آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی
 
 when i do search for  د-ژ  range the results are wrong ,
 they are the
 results of   س-ظ range.
 
 for example when i do search for د-ژ  one of the the
 results is ساب ووفر
 , this result also shown on the  س-ظ  range's result list
 which is the
 corret range.
 
 As IndexSearcher use compareTo method and this method uses
 unicodes for
 comparing, i found the unicodes of the characters.
 
 د=U+62F
 ژ = U+698
 and the first letter of ساب ووفر  is  س = U+633
 
 Do you have any idea how to solve this problem, there are
 analyzers for
 different languages ,
 will this be usefull if so do you know where to find a farsi analyzer?
 
 I would bu glad if you help.
 
 thanks ,
 
 Esra
 
 -- View this message in context:
 http://www.nabble.com/lucene-farsi-problem-tp16977096p16977096.html Sent
 from the Lucene - Java Users mailing list archive at Nabble.com.
 
 
 - To
 unsubscribe, e-mail: [EMAIL PROTECTED] For
 additional commands, e-mail: [EMAIL PROTECTED]

RE: lucene farsi problem

2008-04-30 Thread Steven A Rowe

On 04/30/2008 at 12:50 PM, Steven A Rowe wrote:
 Caveat: I don't speak, read, write, or dream in Farsi - I
 just know that it mostly shares its orthography with Arabic,
 and that they are both written and read right-to-left.
 
 How are you constructing the queries?  Using QueryParser?  If
 so, then I suspect the problem is that you intend the range
 you supply to be read entirely right-to-left, but Lucene
 instead reads it left-to-right.  Have you tried using e.g.
 د-ژ instead of د-ژ?  (That is, placing the lower valued
 term on the left instead of the right.)

Sigh - can't edit RTL text - the example should be (hoping it doesn't get 
reversed again):

ژ-د instead of د-ژ (reversing the order of the lower and upper terms)

 AFAICT, RangeFilter (called from ConstantScoreRangeQuery,
 which is called from QueryParser) does not test whether
 lowerTerm is in fact lower than upperTerm.  If it turns out
 that the problem is simply one of order, it might make sense
 to modify RangeFilter so that it flip them when lowerTerm  upperTerm.
 
 Steve
 
 On 04/30/2008 at 3:21 AM, esra wrote:
  
  hi,
  
  i am using lucene's IndexSearcher to search the given xml
  by keyword which
  contains farsi information.
  while searching i use ranges like
  
  آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی
  
  when i do search for  د-ژ  range the results are wrong ,
  they are the
  results of   س-ظ range.
  
  for example when i do search for د-ژ  one of the the
  results is ساب ووفر
  , this result also shown on the  س-ظ  range's result list
  which is the
  corret range.
  
  As IndexSearcher use compareTo method and this method uses
  unicodes for
  comparing, i found the unicodes of the characters.
  
  د=U+62F
  ژ = U+698
  and the first letter of ساب ووفر  is  س = U+633
  
  Do you have any idea how to solve this problem, there are analyzers for
  different languages , will this be usefull if so do you know where to
  find a farsi analyzer?
  
  I would bu glad if you help.
  
  thanks ,
  
  Esra

RE: lucene farsi problem

RE: lucene farsi problem

RE: lucene farsi problem

RE: lucene farsi problem

Re: lucene farsi problem

Re: lucene farsi problem

RE: lucene farsi problem

RE: lucene farsi problem

RE: lucene farsi problem

RE: lucene farsi problem

RE: lucene farsi problem

RE: lucene farsi problem

RE: lucene farsi problem

RE: lucene farsi problem

RE: lucene farsi problem

RE: lucene farsi problem

RE: lucene farsi problem

RE: lucene farsi problem

Re: lucene farsi problem

RE: lucene farsi problem

Re: lucene farsi problem

Re: lucene farsi problem

Re: lucene farsi problem

Re: lucene farsi problem

RE: lucene farsi problem

RE: lucene farsi problem

26 matches

Site Navigation

Mail list logo

Footer information