[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on

2010-11-14 Thread M Alexander (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12931836#action_12931836
 ] 

M Alexander commented on LUCENE-2745:
-

I tried to follow Steve's first respons and ended up with the expected 
compilation issue. I then decided to simplify my approach and did pretty much 
what Uwe has suggested - all is working now. Thanks all for your prompt help - 
much appreciated

 ArabicAnalyzer - the ability to recognise email addresses host names and so on
 --

 Key: LUCENE-2745
 URL: https://issues.apache.org/jira/browse/LUCENE-2745
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2
 Environment: All
Reporter: M Alexander

 The ArabicAnalyzer does not recognise email addresses, hostnames and so on. 
 For example,
 a...@hotmail.com
 will be tokenised to [adam] [hotmail] [com]
 It would be great if the ArabicAnalyzer can tokenises this to 
 [a...@hotmail.com]. The same applies to hostnames and so on.
 Can this be resolved? I hope so
 Thanks
 MAA

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on

2010-11-10 Thread M Alexander (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12930814#action_12930814
 ] 

M Alexander commented on LUCENE-2745:
-

Quick question - how difficult is it to make the new StandardTokenizer 
(branch_3X) with its new capabilities (including properly tokenizing Arabic as 
well as identifying email addresses, hostnames, etc) to work with version 2.9.2?

Is it very difficult, or would it only require copying across few classes and 
minor tweaks?

 ArabicAnalyzer - the ability to recognise email addresses host names and so on
 --

 Key: LUCENE-2745
 URL: https://issues.apache.org/jira/browse/LUCENE-2745
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2
 Environment: All
Reporter: M Alexander

 The ArabicAnalyzer does not recognise email addresses, hostnames and so on. 
 For example,
 a...@hotmail.com
 will be tokenised to [adam] [hotmail] [com]
 It would be great if the ArabicAnalyzer can tokenises this to 
 [a...@hotmail.com]. The same applies to hostnames and so on.
 Can this be resolved? I hope so
 Thanks
 MAA

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on

2010-11-10 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12930857#action_12930857
 ] 

Steven Rowe commented on LUCENE-2745:
-

bq. how difficult is it to make the new StandardTokenizer (branch_3X) with its 
new capabilities (including properly tokenizing Arabic as well as identifying 
email addresses, hostnames, etc) to work with version 2.9.2? 

You wouldn't be able to just drop the files in and compile, but backporting to 
2.9.X would definitely be possible.

Here are the things I found looking through CHANGES.txt on branch_3x that would 
require attention if you were to backport to 2.9.2:

* LUCENE-2302: TermAttribute - CharTermAttribute
* LUCENE-2074: Java4 - Java5 regeneration of StandardTokenizerImpl* from 
.jflex source; support for different behavior based on Lucene Version

There are probably some other things, not sure what.

Likely LUCENE-2302 is the biggest issue (it will block compilation), but if I 
remember correctly, the change is fairly simple.

 ArabicAnalyzer - the ability to recognise email addresses host names and so on
 --

 Key: LUCENE-2745
 URL: https://issues.apache.org/jira/browse/LUCENE-2745
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2
 Environment: All
Reporter: M Alexander

 The ArabicAnalyzer does not recognise email addresses, hostnames and so on. 
 For example,
 a...@hotmail.com
 will be tokenised to [adam] [hotmail] [com]
 It would be great if the ArabicAnalyzer can tokenises this to 
 [a...@hotmail.com]. The same applies to hostnames and so on.
 Can this be resolved? I hope so
 Thanks
 MAA

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on

2010-11-10 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12930943#action_12930943
 ] 

Uwe Schindler commented on LUCENE-2745:
---

bq. Likely LUCENE-2302 is the biggest issue (it will block compilation), but if 
I remember correctly, the change is fairly simple.

Simply use only the now-deprecated TermAttribute instead of backporting this 
issue. For StandardTokenizer this should be simple, just replace some methods 
like copyBuffer() to setTermBuffer() and replace the addAttributes and remove 
Generics, but add casts.

 ArabicAnalyzer - the ability to recognise email addresses host names and so on
 --

 Key: LUCENE-2745
 URL: https://issues.apache.org/jira/browse/LUCENE-2745
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2
 Environment: All
Reporter: M Alexander

 The ArabicAnalyzer does not recognise email addresses, hostnames and so on. 
 For example,
 a...@hotmail.com
 will be tokenised to [adam] [hotmail] [com]
 It would be great if the ArabicAnalyzer can tokenises this to 
 [a...@hotmail.com]. The same applies to hostnames and so on.
 Can this be resolved? I hope so
 Thanks
 MAA

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on

2010-11-08 Thread M Alexander (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929556#action_12929556
 ] 

M Alexander commented on LUCENE-2745:
-

{quote}
I think that ArabicLetterTokenizer, which is the tokenizer used by 
ArabicAnalyzer, is obsolete (as of version 3.1), since StandardTokenizer, which 
implements the Unicode word segmentation rules from UAX#29, should be able to 
properly tokenize Arabic. StandardTokenizer recognizes email addresses, 
hostnames, and URLs, so your concern would be addressed. (See LUCENE-2167, 
though, which was just reopened to turn off full URL output.) 
You can test this by composing your own analyzer, if you're willing to try 
using using as-yet-unreleased branch_3X, from which 3.1 will be cut (hopefully 
fairly soon): just copy ArabicAnalyzer class and swap in StandardTokenizer for 
ArabicLetterTokenizer
{quote} 

I tried to test this and failed (miserably). I think I struggled to patch 
LUCENE-2167 correctly through my eclipse. I might just wait for branch_3X 
release to make my life easier. I will then create my own Analyzer to perform 
Arabic Text Analysis and another one for Farsi Text Analysis. Both Analyzers 
will have the ability to handle diacritics as well as email addresses, 
hostnames and so on. I will colse this issue for now (will re-open in the 
future if needed).

Quick question - any thoughts of handling Arabic email addresses and hostnames 
in the future?

Thanks to both of you for the time taken and I shall wait for the branch 
release to solve my issue.

 ArabicAnalyzer - the ability to recognise email addresses host names and so on
 --

 Key: LUCENE-2745
 URL: https://issues.apache.org/jira/browse/LUCENE-2745
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2
 Environment: All
Reporter: M Alexander

 The ArabicAnalyzer does not recognise email addresses, hostnames and so on. 
 For example,
 a...@hotmail.com
 will be tokenised to [adam] [hotmail] [com]
 It would be great if the ArabicAnalyzer can tokenises this to 
 [a...@hotmail.com]. The same applies to hostnames and so on.
 Can this be resolved? I hope so
 Thanks
 MAA

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on

2010-11-08 Thread M Alexander (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929558#action_12929558
 ] 

M Alexander commented on LUCENE-2745:
-

Oh, do you have a rough timing of the branch_3X release date?

 ArabicAnalyzer - the ability to recognise email addresses host names and so on
 --

 Key: LUCENE-2745
 URL: https://issues.apache.org/jira/browse/LUCENE-2745
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2
 Environment: All
Reporter: M Alexander

 The ArabicAnalyzer does not recognise email addresses, hostnames and so on. 
 For example,
 a...@hotmail.com
 will be tokenised to [adam] [hotmail] [com]
 It would be great if the ArabicAnalyzer can tokenises this to 
 [a...@hotmail.com]. The same applies to hostnames and so on.
 Can this be resolved? I hope so
 Thanks
 MAA

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on

2010-11-08 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929566#action_12929566
 ] 

Steven Rowe commented on LUCENE-2745:
-

bq. Oh, do you have a rough timing of the branch_3X release date? 

Wild guess: January 2011

 ArabicAnalyzer - the ability to recognise email addresses host names and so on
 --

 Key: LUCENE-2745
 URL: https://issues.apache.org/jira/browse/LUCENE-2745
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2
 Environment: All
Reporter: M Alexander

 The ArabicAnalyzer does not recognise email addresses, hostnames and so on. 
 For example,
 a...@hotmail.com
 will be tokenised to [adam] [hotmail] [com]
 It would be great if the ArabicAnalyzer can tokenises this to 
 [a...@hotmail.com]. The same applies to hostnames and so on.
 Can this be resolved? I hope so
 Thanks
 MAA

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on

2010-11-07 Thread M Alexander (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929364#action_12929364
 ] 

M Alexander commented on LUCENE-2745:
-

Thanks Steven. I will give it a go and will share the outcome.

 ArabicAnalyzer - the ability to recognise email addresses host names and so on
 --

 Key: LUCENE-2745
 URL: https://issues.apache.org/jira/browse/LUCENE-2745
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2
 Environment: All
Reporter: M Alexander

 The ArabicAnalyzer does not recognise email addresses, hostnames and so on. 
 For example,
 a...@hotmail.com
 will be tokenised to [adam] [hotmail] [com]
 It would be great if the ArabicAnalyzer can tokenises this to 
 [a...@hotmail.com]. The same applies to hostnames and so on.
 Can this be resolved? I hope so
 Thanks
 MAA

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on

2010-11-07 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929365#action_12929365
 ] 

Robert Muir commented on LUCENE-2745:
-

I agree with what Steven said here... since previously StandardTokenizer would 
break on diacritics (shadda, etc)
it wasn't appropriate for arabic writing systems, so we added 
ArabicLetterTokenizer as a workaround.

but you can use a different tokenizer in your own Analyzer to meet your 
needs... and we should try to avoid 
(deprecate+remove) language-specific tokenizers if we can.

the only trick to deprecating this ArabicLetterTokenizer is the persian case, 
since i dont think UAX#29 will split on
zero-width-non-joiner, so we need to do something to handle that case, 
otherwise we can default to a better tokenizer here.


 ArabicAnalyzer - the ability to recognise email addresses host names and so on
 --

 Key: LUCENE-2745
 URL: https://issues.apache.org/jira/browse/LUCENE-2745
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2
 Environment: All
Reporter: M Alexander

 The ArabicAnalyzer does not recognise email addresses, hostnames and so on. 
 For example,
 a...@hotmail.com
 will be tokenised to [adam] [hotmail] [com]
 It would be great if the ArabicAnalyzer can tokenises this to 
 [a...@hotmail.com]. The same applies to hostnames and so on.
 Can this be resolved? I hope so
 Thanks
 MAA

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on

2010-11-07 Thread M Alexander (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929367#action_12929367
 ] 

M Alexander commented on LUCENE-2745:
-

Yes Robert, I have faced the diacritics problem. I am trying to have an 
Analyzer that would not break on diacritics as well as recognising email 
addresses, hostnames and so on (which Arabic text may contain). This is why I 
asked the question to see if there is a way to have full Arabic analysis 
(including diacritics) along with recognising email addresses, hostnames, etc 
at the same Analyzer. I will try your suggestions and will share the output. 
Thanks Robert for your help

 ArabicAnalyzer - the ability to recognise email addresses host names and so on
 --

 Key: LUCENE-2745
 URL: https://issues.apache.org/jira/browse/LUCENE-2745
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2
 Environment: All
Reporter: M Alexander

 The ArabicAnalyzer does not recognise email addresses, hostnames and so on. 
 For example,
 a...@hotmail.com
 will be tokenised to [adam] [hotmail] [com]
 It would be great if the ArabicAnalyzer can tokenises this to 
 [a...@hotmail.com]. The same applies to hostnames and so on.
 Can this be resolved? I hope so
 Thanks
 MAA

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on

2010-11-07 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929386#action_12929386
 ] 

Steven Rowe commented on LUCENE-2745:
-

bq. the only trick to deprecating this ArabicLetterTokenizer is the persian 
case, since i dont think UAX#29 will split on zero-width-non-joiner, so we need 
to do something to handle that case, otherwise we can default to a better 
tokenizer here.

Robert, can you provide more detail?  AFAICT from [this Wikipedia 
article|http://en.wikipedia.org/wiki/Zero-width_non-joiner], ZWNJs are used in 
Persian as display hints, not as word separators.

 ArabicAnalyzer - the ability to recognise email addresses host names and so on
 --

 Key: LUCENE-2745
 URL: https://issues.apache.org/jira/browse/LUCENE-2745
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2
 Environment: All
Reporter: M Alexander

 The ArabicAnalyzer does not recognise email addresses, hostnames and so on. 
 For example,
 a...@hotmail.com
 will be tokenised to [adam] [hotmail] [com]
 It would be great if the ArabicAnalyzer can tokenises this to 
 [a...@hotmail.com]. The same applies to hostnames and so on.
 Can this be resolved? I hope so
 Thanks
 MAA

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on

2010-11-07 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929389#action_12929389
 ] 

Robert Muir commented on LUCENE-2745:
-

steven, check out the link at the bottom of that article.
especially the top... it explains the use in the language,
particularly to block cursive joining for prefixes, suffixes,
compounds. we split on this and the affixes are in the stoplist

this is how the whole analyzer works, more examples in
the tests... I can give you more refs later, when I have
better bandwidth... but its specific to this language.
we shouldn't split on it in general... also often a real
space is used instead, so this approach is the simplest
for the language

 ArabicAnalyzer - the ability to recognise email addresses host names and so on
 --

 Key: LUCENE-2745
 URL: https://issues.apache.org/jira/browse/LUCENE-2745
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2
 Environment: All
Reporter: M Alexander

 The ArabicAnalyzer does not recognise email addresses, hostnames and so on. 
 For example,
 a...@hotmail.com
 will be tokenised to [adam] [hotmail] [com]
 It would be great if the ArabicAnalyzer can tokenises this to 
 [a...@hotmail.com]. The same applies to hostnames and so on.
 Can this be resolved? I hope so
 Thanks
 MAA

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on

2010-11-07 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929392#action_12929392
 ] 

Steven Rowe commented on LUCENE-2745:
-

bq. steven, check out the link at the bottom of that article.

Yup, did that.

bq. especially the top... it explains the use in the language, particularly to 
block cursive joining for prefixes, suffixes, compounds. we split on this and 
the affixes are in the stoplist 

Um, like I said, Persian uses ZWNJs as display hints, not as word separators.

According to the [ICU web 
demo|http://demo.icu-project.org/icu-bin/ubrowse?go=200C], ZWNJs have the 
\p{Word_Break:Extend} property, so the Lucene UAX#29-based tokenizers will 
*not* split on this char.

What am I not getting?

 ArabicAnalyzer - the ability to recognise email addresses host names and so on
 --

 Key: LUCENE-2745
 URL: https://issues.apache.org/jira/browse/LUCENE-2745
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2
 Environment: All
Reporter: M Alexander

 The ArabicAnalyzer does not recognise email addresses, hostnames and so on. 
 For example,
 a...@hotmail.com
 will be tokenised to [adam] [hotmail] [com]
 It would be great if the ArabicAnalyzer can tokenises this to 
 [a...@hotmail.com]. The same applies to hostnames and so on.
 Can this be resolved? I hope so
 Thanks
 MAA

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on

2010-11-07 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929396#action_12929396
 ] 

Steven Rowe commented on LUCENE-2745:
-

{quote}
this is how the whole analyzer works, more examples in
the tests... I can give you more refs later, when I have
better bandwidth... but its specific to this language.
we shouldn't split on it in general... also often a real
space is used instead, so this approach is the simplest
for the language
{quote}

AFAICT, ArabicLetterTokenizer just adds non-spacing marks to the list of 
acceptable token characters, so they won't be used to split words.  However, 
ZWNJ (U+200C) has the Cf -- Format -- general category, *not* the Mn 
general category (non-spacing marks), so as far as I can tell, the current 
Lucene ArabicLetterTokenizer (and hence ArabicAnalyzer) splits on ZWNJ.

None of the tests in TestArabicLetterTokenizer nor in TestArabicAnalyzer 
contain ZWNJ (U+200C).

Maybe what I'm not understanding is this approach in your quote above.  Can 
you describe this approach?

When you wrote we split on this and the affixes are in the stoplist did you 
mean that ArabicLetterTokenizer *intentionally* breaks Persian words at ZWNJ?  
And then throws away the affixes that result?  Hunh


 ArabicAnalyzer - the ability to recognise email addresses host names and so on
 --

 Key: LUCENE-2745
 URL: https://issues.apache.org/jira/browse/LUCENE-2745
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2
 Environment: All
Reporter: M Alexander

 The ArabicAnalyzer does not recognise email addresses, hostnames and so on. 
 For example,
 a...@hotmail.com
 will be tokenised to [adam] [hotmail] [com]
 It would be great if the ArabicAnalyzer can tokenises this to 
 [a...@hotmail.com]. The same applies to hostnames and so on.
 Can this be resolved? I hope so
 Thanks
 MAA

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on

2010-11-07 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929403#action_12929403
 ] 

Robert Muir commented on LUCENE-2745:
-

yes, sorry for the confusion!

one solution, add a charfilter that maps zwnj to space for persiananalyzer?

this way, it could use uax29 and support numerics etc

 ArabicAnalyzer - the ability to recognise email addresses host names and so on
 --

 Key: LUCENE-2745
 URL: https://issues.apache.org/jira/browse/LUCENE-2745
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2
 Environment: All
Reporter: M Alexander

 The ArabicAnalyzer does not recognise email addresses, hostnames and so on. 
 For example,
 a...@hotmail.com
 will be tokenised to [adam] [hotmail] [com]
 It would be great if the ArabicAnalyzer can tokenises this to 
 [a...@hotmail.com]. The same applies to hostnames and so on.
 Can this be resolved? I hope so
 Thanks
 MAA

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on

2010-11-07 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929405#action_12929405
 ] 

Steven Rowe commented on LUCENE-2745:
-

{quote}
one solution, add a charfilter that maps zwnj to space for persiananalyzer?

this way, it could use uax29 and support numerics etc
{quote}

I like it - it sounds better than my other idea: a configurable token splitting 
filter.

 ArabicAnalyzer - the ability to recognise email addresses host names and so on
 --

 Key: LUCENE-2745
 URL: https://issues.apache.org/jira/browse/LUCENE-2745
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2
 Environment: All
Reporter: M Alexander

 The ArabicAnalyzer does not recognise email addresses, hostnames and so on. 
 For example,
 a...@hotmail.com
 will be tokenised to [adam] [hotmail] [com]
 It would be great if the ArabicAnalyzer can tokenises this to 
 [a...@hotmail.com]. The same applies to hostnames and so on.
 Can this be resolved? I hope so
 Thanks
 MAA

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org