[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on
[ https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12931836#action_12931836 ] M Alexander commented on LUCENE-2745: - I tried to follow Steve's first respons and ended up with the expected compilation issue. I then decided to simplify my approach and did pretty much what Uwe has suggested - all is working now. Thanks all for your prompt help - much appreciated > ArabicAnalyzer - the ability to recognise email addresses host names and so on > -- > > Key: LUCENE-2745 > URL: https://issues.apache.org/jira/browse/LUCENE-2745 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2 > Environment: All >Reporter: M Alexander > > The ArabicAnalyzer does not recognise email addresses, hostnames and so on. > For example, > a...@hotmail.com > will be tokenised to [adam] [hotmail] [com] > It would be great if the ArabicAnalyzer can tokenises this to > [a...@hotmail.com]. The same applies to hostnames and so on. > Can this be resolved? I hope so > Thanks > MAA -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on
[ https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930814#action_12930814 ] M Alexander commented on LUCENE-2745: - Quick question - how difficult is it to make the new StandardTokenizer (branch_3X) with its new capabilities (including properly tokenizing Arabic as well as identifying email addresses, hostnames, etc) to work with version 2.9.2? Is it very difficult, or would it only require copying across few classes and minor tweaks? > ArabicAnalyzer - the ability to recognise email addresses host names and so on > -- > > Key: LUCENE-2745 > URL: https://issues.apache.org/jira/browse/LUCENE-2745 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2 > Environment: All >Reporter: M Alexander > > The ArabicAnalyzer does not recognise email addresses, hostnames and so on. > For example, > a...@hotmail.com > will be tokenised to [adam] [hotmail] [com] > It would be great if the ArabicAnalyzer can tokenises this to > [a...@hotmail.com]. The same applies to hostnames and so on. > Can this be resolved? I hope so > Thanks > MAA -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on
[ https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929558#action_12929558 ] M Alexander commented on LUCENE-2745: - Oh, do you have a rough timing of the branch_3X release date? > ArabicAnalyzer - the ability to recognise email addresses host names and so on > -- > > Key: LUCENE-2745 > URL: https://issues.apache.org/jira/browse/LUCENE-2745 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2 > Environment: All >Reporter: M Alexander > > The ArabicAnalyzer does not recognise email addresses, hostnames and so on. > For example, > a...@hotmail.com > will be tokenised to [adam] [hotmail] [com] > It would be great if the ArabicAnalyzer can tokenises this to > [a...@hotmail.com]. The same applies to hostnames and so on. > Can this be resolved? I hope so > Thanks > MAA -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on
[ https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] M Alexander closed LUCENE-2745. --- Resolution: Later Will wait for the relaese, which should have the solution within > ArabicAnalyzer - the ability to recognise email addresses host names and so on > -- > > Key: LUCENE-2745 > URL: https://issues.apache.org/jira/browse/LUCENE-2745 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2 > Environment: All >Reporter: M Alexander > > The ArabicAnalyzer does not recognise email addresses, hostnames and so on. > For example, > a...@hotmail.com > will be tokenised to [adam] [hotmail] [com] > It would be great if the ArabicAnalyzer can tokenises this to > [a...@hotmail.com]. The same applies to hostnames and so on. > Can this be resolved? I hope so > Thanks > MAA -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on
[ https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929556#action_12929556 ] M Alexander commented on LUCENE-2745: - {quote} I think that ArabicLetterTokenizer, which is the tokenizer used by ArabicAnalyzer, is obsolete (as of version 3.1), since StandardTokenizer, which implements the Unicode word segmentation rules from UAX#29, should be able to properly tokenize Arabic. StandardTokenizer recognizes email addresses, hostnames, and URLs, so your concern would be addressed. (See LUCENE-2167, though, which was just reopened to turn off full URL output.) You can test this by composing your own analyzer, if you're willing to try using using as-yet-unreleased branch_3X, from which 3.1 will be cut (hopefully fairly soon): just copy ArabicAnalyzer class and swap in StandardTokenizer for ArabicLetterTokenizer {quote} I tried to test this and failed (miserably). I think I struggled to patch LUCENE-2167 correctly through my eclipse. I might just wait for branch_3X release to make my life easier. I will then create my own Analyzer to perform Arabic Text Analysis and another one for Farsi Text Analysis. Both Analyzers will have the ability to handle diacritics as well as email addresses, hostnames and so on. I will colse this issue for now (will re-open in the future if needed). Quick question - any thoughts of handling Arabic email addresses and hostnames in the future? Thanks to both of you for the time taken and I shall wait for the branch release to solve my issue. > ArabicAnalyzer - the ability to recognise email addresses host names and so on > -- > > Key: LUCENE-2745 > URL: https://issues.apache.org/jira/browse/LUCENE-2745 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2 > Environment: All >Reporter: M Alexander > > The ArabicAnalyzer does not recognise email addresses, hostnames and so on. > For example, > a...@hotmail.com > will be tokenised to [adam] [hotmail] [com] > It would be great if the ArabicAnalyzer can tokenises this to > [a...@hotmail.com]. The same applies to hostnames and so on. > Can this be resolved? I hope so > Thanks > MAA -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on
[ https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929367#action_12929367 ] M Alexander commented on LUCENE-2745: - Yes Robert, I have faced the diacritics problem. I am trying to have an Analyzer that would not break on diacritics as well as recognising email addresses, hostnames and so on (which Arabic text may contain). This is why I asked the question to see if there is a way to have full Arabic analysis (including diacritics) along with recognising email addresses, hostnames, etc at the same Analyzer. I will try your suggestions and will share the output. Thanks Robert for your help > ArabicAnalyzer - the ability to recognise email addresses host names and so on > -- > > Key: LUCENE-2745 > URL: https://issues.apache.org/jira/browse/LUCENE-2745 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2 > Environment: All >Reporter: M Alexander > > The ArabicAnalyzer does not recognise email addresses, hostnames and so on. > For example, > a...@hotmail.com > will be tokenised to [adam] [hotmail] [com] > It would be great if the ArabicAnalyzer can tokenises this to > [a...@hotmail.com]. The same applies to hostnames and so on. > Can this be resolved? I hope so > Thanks > MAA -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on
[ https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929364#action_12929364 ] M Alexander commented on LUCENE-2745: - Thanks Steven. I will give it a go and will share the outcome. > ArabicAnalyzer - the ability to recognise email addresses host names and so on > -- > > Key: LUCENE-2745 > URL: https://issues.apache.org/jira/browse/LUCENE-2745 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2 > Environment: All >Reporter: M Alexander > > The ArabicAnalyzer does not recognise email addresses, hostnames and so on. > For example, > a...@hotmail.com > will be tokenised to [adam] [hotmail] [com] > It would be great if the ArabicAnalyzer can tokenises this to > [a...@hotmail.com]. The same applies to hostnames and so on. > Can this be resolved? I hope so > Thanks > MAA -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on
ArabicAnalyzer - the ability to recognise email addresses host names and so on -- Key: LUCENE-2745 URL: https://issues.apache.org/jira/browse/LUCENE-2745 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Affects Versions: 3.0.2, 3.0.1, 3.0, 2.9.3, 2.9.2 Environment: All Reporter: M Alexander The ArabicAnalyzer does not recognise email addresses, hostnames and so on. For example, a...@hotmail.com will be tokenised to [adam] [hotmail] [com] It would be great if the ArabicAnalyzer can tokenises this to [a...@hotmail.com]. The same applies to hostnames and so on. Can this be resolved? I hope so Thanks MAA -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org