Re:Re: Some questions about StandardTokenizer and UNICODE Regular Expressions

2016-06-16 Thread dr
be referred to with the >abbreviation “WB:” in \p{WB:property-name}, are described in the table at ><http://www.unicode.org/reports/tr29/#Default_Word_Boundaries>. > >-- >Steve >www.lucidworks.com > > >> On Jun 16, 2016, at 7:01 AM, dr wrote: >> >> Hi guys >

Re: Some questions about StandardTokenizer and UNICODE Regular Expressions

2016-06-16 Thread Steve Rowe
; Hi guys > Currenly, I'm looking into the rules of StandardTokenizer, but met some > probleam. >As the docs says, StandardTokenizer implements the Word Break rules from > the Unicode Text Segmentation algorithm, as specified in Unicode Standard > Annex

Some questions about StandardTokenizer and UNICODE Regular Expressions

2016-06-16 Thread dr
Hi guys Currenly, I'm looking into the rules of StandardTokenizer, but met some probleam. As the docs says, StandardTokenizer implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29. Also it is generated by JFlex, a

Re: StandardTokenizer#setMaxTokenLength

2015-07-20 Thread Steve Rowe
StandardTokenizer and UAX29URLEmailTokenizer. Steve [1] https://issues.apache.org/jira/browse/LUCENE-5897 [2] https://issues.apache.org/jira/browse/LUCENE-5400 > On Jul 20, 2015, at 4:21 AM, Piotr Idzikowski > wrote: > > Hello. > Btw, I think ClassicAnalyzer has the same problem >

Re: StandardTokenizer#setMaxTokenLength

2015-07-20 Thread Piotr Idzikowski
015, at 4:47 AM, Piotr Idzikowski > wrote: > > > > Hello. > > I am developing own analyzer based on StandardAnalyzer. > > I realized that tokenizer.setMaxTokenLength is called many times. > > > > *protected TokenStreamComponents createComponents(final String

Re: StandardTokenizer#setMaxTokenLength

2015-07-20 Thread Piotr Idzikowski
ing about StandardTokenizer and setMaxTokenLength, I think I have > found another problem. > It looks like when the word is longer than max length analyzer adds two > tokens -> word.substring(0,maxLength) and word.substring(maxLength) > > Look at this code(sorry, it is quite ugly): > public cl

Re: StandardTokenizer#setMaxTokenLength

2015-07-20 Thread Piotr Idzikowski
Hello Steve, It is always pleasure to help you develop such a great lib. Talking about StandardTokenizer and setMaxTokenLength, I think I have found another problem. It looks like when the word is longer than max length analyzer adds two tokens -> word.substring(0,maxLength) and word.substr

Re: StandardTokenizer#setMaxTokenLength

2015-07-17 Thread Steve Rowe
tMaxTokenLength is called many times. > > *protected TokenStreamComponents createComponents(final String fieldName, > final Reader reader) {* > *final StandardTokenizer src = new StandardTokenizer(getVersion(), > reader);* > *src.setMaxTokenLength(maxTokenLength);* > *

StandardTokenizer#setMaxTokenLength

2015-07-16 Thread Piotr Idzikowski
Hello. I am developing own analyzer based on StandardAnalyzer. I realized that tokenizer.setMaxTokenLength is called many times. *protected TokenStreamComponents createComponents(final String fieldName, final Reader reader) {* *final StandardTokenizer src = new StandardTokenizer(getVersion

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-10-02 Thread Steve Rowe
; Boilerplate upgrade recommendation: consider using the most recent Lucene >> release (4.10.1) - it’s the most stable, performant, and featureful release >> available, and many bugs have been fixed since the 4.1 release. > Yeah sure, I did try this and hit a load of errors but I certainly w

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-10-01 Thread Paul Taylor
a load of errors but I certainly will do so. FYI, StandardTokenizer doesn’t find word boundaries for Chinese, Japanese, Korean, Thai, and other languages that don’t use whitespace to denote word boundaries, except those around punctuation. Note that Lucene 4.1 does have specialized tokenizers

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-10-01 Thread Steve Rowe
Paul, Boilerplate upgrade recommendation: consider using the most recent Lucene release (4.10.1) - it’s the most stable, performant, and featureful release available, and many bugs have been fixed since the 4.1 release. FYI, StandardTokenizer doesn’t find word boundaries for Chinese, Japanese

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-10-01 Thread Michael McCandless
type? > > Dawid > > > On Wed, Oct 1, 2014 at 7:01 AM, Steve Rowe wrote: >> Hi Paul, >> >> StandardTokenizer implements the Word Boundaries rules in the Unicode Text >> Segmentation Standard Annex UAX#29 - here’s the relevant section for Unicode >>

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-10-01 Thread Paul Taylor
ead: relatively easy) to create an analyzer (or a modification of the standard one's lexer) so that punctuation is returned as a separate token type? Dawid On Wed, Oct 1, 2014 at 7:01 AM, Steve Rowe wrote: Hi Paul, StandardTokenizer implements the Word Boundaries rules in the Unicode Text S

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-10-01 Thread Dawid Weiss
analyzer (or a modification of the standard one's lexer) so that punctuation is returned as a separate token type? Dawid On Wed, Oct 1, 2014 at 7:01 AM, Steve Rowe wrote: > Hi Paul, > > StandardTokenizer implements the Word Boundaries rules in the Unicode Text > Segmentation Standard An

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-09-30 Thread Steve Rowe
Hi Paul, StandardTokenizer implements the Word Boundaries rules in the Unicode Text Segmentation Standard Annex UAX#29 - here’s the relevant section for Unicode 6.1.0, which is the version supported by Lucene 4.1.0: <http://www.unicode.org/reports/tr29/tr29-19.html#Word_Boundaries>.

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-09-30 Thread Jack Krupansky
-21203548.html That doesn't give you Java details for Lucene, but the tokenizer rules are the same. -- Jack Krupansky -Original Message- From: Paul Taylor Sent: Tuesday, September 30, 2014 3:54 PM To: java-user@lucene.apache.org Subject: Does StandardTokenizer remove punctuation (in L

Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-09-30 Thread Paul Taylor
Does StandardTokenizer remove punctuation (in Lucene 4.1) Im just trying to move back to StandardTokenizer from my own old custom implemenation because the newer version seems to have much better support for Asian languages However this code except fails on incrementToken() implying that the

Re: usage of CollationAttributeFactory StandardTokenizer Analyzer

2014-07-31 Thread craiglang44
Sent from my BlackBerry® smartphone -Original Message- From: Cemo Date: Thu, 31 Jul 2014 11:04:18 To: Reply-To: java-user@lucene.apache.org Subject: usage of CollationAttributeFactory StandardTokenizer Analyzer Hi, I am trying to use CollationAttributeFactory with a custom analyzer

usage of CollationAttributeFactory StandardTokenizer Analyzer

2014-07-31 Thread Cemo
Hi, I am trying to use CollationAttributeFactory with a custom analyzer. I am using StandardTokenizer with CollationAttributeFactory as in org.apache.lucene.collation.CollationKeyAnalyzer. protected TokenStreamComponents createComponents(String fieldName

Re: Extending StandardTokenizer Jflex to not split on '/'

2014-02-20 Thread Diego Fernandez
me - are you regenerating the scanner (‘ant jflex’)? > > > > > > FYI, I found a bug when I was testing the above: “http://example.com” > > is left > > > intact when “/“ is added to MidLetter, but it shouldn’t be; although ‘:’ > > and > > > ‘/‘ are in [/\p

Re: Extending StandardTokenizer Jflex to not split on '/'

2014-02-17 Thread Steve Rowe
ded to MidLetter, but it shouldn’t be; although ‘:’ > and > > ‘/‘ are in [/\p{WB:MidLetter}], the letter-on-both-sides requirement > should > > instead result in “http://example.com” being split into “http” and > > “example.com”. Further testing indicates that this is a p

Re: Extending StandardTokenizer Jflex to not split on '/'

2014-02-17 Thread Diego Fernandez
. Further testing indicates that this is a problem for > MidLetter, MidNumLet and MidNum. I’ve filed an issue: > <https://issues.apache.org/jira/browse/LUCENE-5447>. > > Steve > > On Feb 14, 2014, at 1:42 PM, Diego Fernandez wrote: > > > Hi guys, this is my first ti

Re: Extending StandardTokenizer Jflex to not split on '/'

2014-02-14 Thread Steve Rowe
problem for MidLetter, MidNumLet and MidNum. I’ve filed an issue: <https://issues.apache.org/jira/browse/LUCENE-5447>. Steve On Feb 14, 2014, at 1:42 PM, Diego Fernandez wrote: > Hi guys, this is my first time posting on the Lucene list, so hello everyone. > > I really like

Extending StandardTokenizer Jflex to not split on '/'

2014-02-14 Thread Diego Fernandez
Hi guys, this is my first time posting on the Lucene list, so hello everyone. I really like the way that the StandardTokenizer works, however I'd like for it to not split tokens on / (forward slash). I've been looking at http://unicode.org/reports/tr29/#Default_Word_Boundaries

RE: StandardTokenizer generation from JFlex grammar

2012-10-04 Thread vempap
Thanks Steve for the pointers. I'll look into it. -- View this message in context: http://lucene.472066.n3.nabble.com/StandardTokenizer-generation-from-JFlex-grammar-tp4011940p4011944.html Sent from the Lucene - Java Users mailing list archive at Nabbl

RE: StandardTokenizer generation from JFlex grammar

2012-10-04 Thread Steven A Rowe
M To: d...@lucene.apache.org Subject: StandardTokenizer generation from JFlex grammar Hello, I'm trying to generate the standard tokenizer again using the jflex specification (StandardTokenizerImpl.jflex) but I'm not able to do so due to some errors (I would like to create my own jflex file using the

StandardTokenizer generation from JFlex grammar

2012-10-04 Thread vempap
://lucene.472066.n3.nabble.com/StandardTokenizer-generation-from-JFlex-grammar-tp4011940.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional

RE: StandardTokenizer and split tokens

2012-06-24 Thread Uwe Schindler
> -Original Message- > From: Mansour Al Akeel [mailto:mansour.alak...@gmail.com] > Sent: Saturday, June 23, 2012 11:21 PM > To: java-user@lucene.apache.org > Subject: Re: StandardTokenizer and split tokens > > Uwe, > thank you for the advice. I updated my code. > > >

Re: StandardTokenizer and split tokens

2012-06-23 Thread Mansour Al Akeel
Uwe, thank you for the advice. I updated my code. On Sat, Jun 23, 2012 at 3:15 AM, Uwe Schindler wrote: >> I found the main issue. >> I was using ByteRef without the length. This fixed the problem. >> >>                       String word = new > String(ref.bytes,ref.offset,ref.length); > > Pleas

RE: StandardTokenizer and split tokens

2012-06-23 Thread Uwe Schindler
> I found the main issue. > I was using ByteRef without the length. This fixed the problem. > > String word = new String(ref.bytes,ref.offset,ref.length); Please see my other mail, using no character set here is the second problem of your code, this is the correct way to do:

RE: StandardTokenizer and split tokens

2012-06-23 Thread Uwe Schindler
Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Mansour Al Akeel [mailto:mansour.alak...@gmail.com] > Sent: Saturday, June 23, 2012 12:26 AM > To: java-user@lucene.apache.org > Subject: StandardTokenizer and split t

Re: StandardTokenizer and split tokens

2012-06-22 Thread Mansour Al Akeel
d. Here's the Anayzer: > > public class AutoCompleteAnalyzer extends Analyzer { >        public TokenStream tokenStream(String fieldName, Reader reader) { >                TokenStream result = null; >                result = new StandardTokenizer(Version.LUCENE_36, reader); >    

StandardTokenizer and split tokens

2012-06-22 Thread Mansour Al Akeel
are generated. Here's the Anayzer: public class AutoCompleteAnalyzer extends Analyzer { public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream result = null; result = new StandardTokenizer(Version.LUCENE_36, r

Re: StandardTokenizer

2011-09-30 Thread Peyman Faratin
> "i'll email you at x...@abc.com" >> >> and I am looking at the tokens a StandardAnalyzer (which uses the >> StandardTokenizer) produces >> >> 1: [i'll:0->4:] >> 2: [email:5->10:] >> 3: [you:11->14:] >> 5: [x:18->

Re: StandardTokenizer

2011-09-30 Thread Ian Lea
, although probably not the apostrophe. -- Ian. On Thu, Sep 29, 2011 at 7:51 PM, Peyman Faratin wrote: > Hi > > I have a sentence > > "i'll email you at x...@abc.com" > > and I am looking at the tokens a StandardAnalyzer (which uses the > StandardTokenizer) pro

StandardTokenizer

2011-09-29 Thread Peyman Faratin
Hi I have a sentence "i'll email you at x...@abc.com" and I am looking at the tokens a StandardAnalyzer (which uses the StandardTokenizer) produces 1: [i'll:0->4:] 2: [email:5->10:] 3: [you:11->14:] 5: [x:18->19:] 6: [abc.com:20->27:] I am using

StandardTokenizer question

2011-04-18 Thread Mindaugas Žakšauskas
nds Analyzer { public TokenStream tokenStream(String fieldName, Reader reader) { return new StopFilter( true, new StandardTokenizer(Version.LUCENE_30, reader), StopAnalyzer.ENGLISH_STOP_WORDS_

Re: iterate through tokens in standardtokenizer

2010-09-12 Thread Karthik K
Thanks a lot for the help.

RE: iterate through tokens in standardtokenizer

2010-09-12 Thread Uwe Schindler
e- > From: Karthik K [mailto:karthikkato...@gmail.com] > Sent: Sunday, September 12, 2010 7:12 AM > To: java-user@lucene.apache.org > Subject: iterate through tokens in standardtokenizer > > Hi, > I am trying to use standardTokenizer in a non-lucene project to generate >

iterate through tokens in standardtokenizer

2010-09-11 Thread Karthik K
Hi, I am trying to use standardTokenizer in a non-lucene project to generate tokens. The previous versions i used supported token.next , getToken to iterate over and retrieve the tokens continuously. 3.0.2 doesnt have that and i cant figure out how to iterate. Can get number of tokens with

Re: Strange behaviour of StandardTokenizer

2010-06-21 Thread Anna Hunecke
18.6.2010: > Von: Simon Willnauer > Betreff: Re: Strange behaviour of StandardTokenizer > An: java-user@lucene.apache.org > Datum: Freitag, 18. Juni, 2010 09:52 Uhr > Hi Anna, > > what are you using you tokenizer for? There are a lot of > different > options in lucene

Re: Strange behaviour of StandardTokenizer

2010-06-18 Thread Simon Willnauer
Hi Anna, what are you using you tokenizer for? There are a lot of different options in lucene an StandardTokenizer is not necessarily the best one. The behaviour you are see is that the tokenizer detects you token as a number. When you look at the grammar that is kind of obvious. // floating

Re: Strange behaviour of StandardTokenizer

2010-06-18 Thread Ahmet Arslan
> okay, so it is recognized as a number? Yes. You can see token type definitions in *.jflex file. > Maybe I'll have to use another tokenizer. MappingCharFilter with StandardTokenizer option exists. NormalizeCharMap map = new NormalizeCharMap(); map.add("-", " &quo

Re: Strange behaviour of StandardTokenizer

2010-06-18 Thread Anna Hunecke
Ahmet Arslan > Betreff: Re: Strange behaviour of StandardTokenizer > An: java-user@lucene.apache.org > Datum: Donnerstag, 17. Juni, 2010 15:50 Uhr > > > I ran into a strange behaviour of the > StandardTokenizer. > > Terms containing a '-' are tokenized differently >

Re: Strange behaviour of StandardTokenizer

2010-06-17 Thread Ahmet Arslan
> I ran into a strange behaviour of the StandardTokenizer. > Terms containing a '-' are tokenized differently depending > on the context. > For example, the term 'nl-lt' is split into 'nl' and 'lt'. > The term 'nl-lt0' is tokeni

Strange behaviour of StandardTokenizer

2010-06-17 Thread Anna Hunecke
Hi! I ran into a strange behaviour of the StandardTokenizer. Terms containing a '-' are tokenized differently depending on the context. For example, the term 'nl-lt' is split into 'nl' and 'lt'. The term 'nl-lt0' is tokenized into 'nl-l

Re: Recover special terms from StandardTokenizer

2009-12-13 Thread Weiwei Wang
>> >> - >> Uwe Schindler >> H.-H.-Meier-Allee 63, D-28213 Bremen >> http://www.thetaphi.de >> eMail: u...@thetaphi.de >> >> > -Original Message- >> > From: Weiwei Wang [mailto:ww.wang...@gmail.com] >> > Sent: Sunday, Dec

Re: Recover special terms from StandardTokenizer

2009-12-13 Thread Weiwei Wang
ng [mailto:ww.wang...@gmail.com] > > Sent: Sunday, December 13, 2009 12:51 PM > > To: java-user@lucene.apache.org > > Subject: Re: Recover special terms from StandardTokenizer > > > > LowercaseCharFilter is necessary, as in the MappingCharFilter we need to > > provid

RE: Recover special terms from StandardTokenizer

2009-12-13 Thread Uwe Schindler
il: u...@thetaphi.de > -Original Message- > From: Weiwei Wang [mailto:ww.wang...@gmail.com] > Sent: Sunday, December 13, 2009 12:51 PM > To: java-user@lucene.apache.org > Subject: Re: Recover special terms from StandardTokenizer > > LowercaseCharFilter is necessary, as in the

Re: Recover special terms from StandardTokenizer

2009-12-13 Thread Weiwei Wang
gt; -Original Message- > > From: Weiwei Wang [mailto:ww.wang...@gmail.com] > > Sent: Sunday, December 13, 2009 12:23 PM > > To: java-user@lucene.apache.org > > Subject: Re: Recover special terms from StandardTokenizer > > > > thanks, Uwe. >

RE: Recover special terms from StandardTokenizer

2009-12-13 Thread Uwe Schindler
age- > From: Weiwei Wang [mailto:ww.wang...@gmail.com] > Sent: Sunday, December 13, 2009 12:23 PM > To: java-user@lucene.apache.org > Subject: Re: Recover special terms from StandardTokenizer > > thanks, Uwe. > Maybe i was not very clear. My situation is like this: > Analyzer: >

Re: Recover special terms from StandardTokenizer

2009-12-13 Thread Weiwei Wang
(RECOVERY_MAP,filter); StandardTokenizer tokenStream = new StandardTokenizer(Version.LUCENE_30, filter); tokenStream.setMaxTokenLength(maxTokenLength); TokenStream result = new StandardFilter(tokenStream); result = getStopFilter(result); result = new SnowballFilter(result, ST

RE: Recover special terms from StandardTokenizer

2009-12-13 Thread Uwe Schindler
.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Weiwei Wang [mailto:ww.wang...@gmail.com] > Sent: Sunday, December 13, 2009 11:43 AM > To: java-user@lucene.apache.org > Subject: Re: Recover special terms from St

Re: Recover special terms from StandardTokenizer

2009-12-13 Thread Weiwei Wang
izeCharMap(); >> RECOVERY_MAP.add("c++","cplusplus$"); >> CharFilter filter = new LowercaseCharFilter(reader); >> filter = new MappingCharFilter(RECOVERY_MAP,filter); >> StandardTokenizer tokenStream = new StandardTokenizer(Version.LUCENE_30, >> filter); >&g

Re: Recover special terms from StandardTokenizer

2009-12-12 Thread Weiwei Wang
harMap(); > RECOVERY_MAP.add("c++","cplusplus$"); > CharFilter filter = new LowercaseCharFilter(reader); > filter = new MappingCharFilter(RECOVERY_MAP,filter); > StandardTokenizer tokenStream = new StandardTokenizer(Version.LUCENE_30, > filter); > tokenStream.setMaxToken

Re: Recover special terms from StandardTokenizer

2009-12-12 Thread Weiwei Wang
Thanks, Koji, I followed your advice and change my analyzer as shown below: NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap(); RECOVERY_MAP.add("c++","cplusplus$"); CharFilter filter = new LowercaseCharFilter(reader); filter = new MappingCharFilter(RECOVERY_MAP,filter

Re: Recover special terms from StandardTokenizer

2009-12-11 Thread Weiwei Wang
s expressed here belong to everybody, the opinions to me. The >> distinction is yours to draw >> >> >> On Fri, Dec 11, 2009 at 11:00 AM, Weiwei Wang >> wrote: >> >> >> >>> Hi, all, >>>I designed a ftp search engine based o

Re: Recover special terms from StandardTokenizer

2009-12-11 Thread Koji Sekiguchi
Lucene. I did a few modifications to the StandardTokenizer. My problem is: C++ is tokenized as c from StandardTokenizer and I want to recover it from the TokenStream from StandardTokenizer What should I do? -- Weiwei Wang Alex Wang 王巍巍 Room 403, Mengmin Wei Building Computer Science Department

Re: Recover special terms from StandardTokenizer

2009-12-11 Thread Anshum
everybody, the opinions to me. The distinction is yours to draw On Fri, Dec 11, 2009 at 11:00 AM, Weiwei Wang wrote: > Hi, all, > I designed a ftp search engine based on Lucene. I did a few > modifications to the StandardTokenizer. > My problem is: > C++ is toke

Recover special terms from StandardTokenizer

2009-12-10 Thread Weiwei Wang
Hi, all, I designed a ftp search engine based on Lucene. I did a few modifications to the StandardTokenizer. My problem is: C++ is tokenized as c from StandardTokenizer and I want to recover it from the TokenStream from StandardTokenizer What should I do? -- Weiwei Wang Alex Wang 王巍巍

RE: Keep URLs intact and not tokenized by the StandardTokenizer

2009-11-19 Thread Delbru, Renaud
Hi, Some time ago, I had to modify and extend the Lucene StandardTokenizer grammar (flex file) so that it preserves URIs (based on RFC3986). I have extracted the files from my project and published the source code on github [1] under the Apache License 2.0, if it can help. [1] http

Re: Keep URLs intact and not tokenized by the StandardTokenizer

2009-11-19 Thread Sudha Verma
Thanks. I was hoping Lucene would already have a solution for this since it seems like it would be a common problem. I am new to the lucene API. If I were to implement something from scratch, are my options to extend the Tokenizer to support http regex and then pass the text to StandardTokenizer

RE: Keep URLs intact and not tokenized by the StandardTokenizer

2009-11-19 Thread Steven A Rowe
Hi Sudha, In the past, I've built regexes to recognize URLs using the information here: http://www.foad.org/~abigail/Perl/url2.html The above, however, is currently a dead link. Here's the Internet Archive's WayBack Machine's cache of this page from August 2007:

Keep URLs intact and not tokenized by the StandardTokenizer

2009-11-18 Thread Sudha Verma
Hi, I am using lucene 2-9-1. I am reading in free text documents which I index using lucene and the StandardAnalyzer at the moment. The StandardAnalyzer keeps email addresses intact and does not tokenize them. Is there something similar for URLs? This seems like a common need. So, I thought I'd

Re: Best way to create own version of StandardTokenizer ?

2009-09-07 Thread Robert Muir
On Mon, Sep 7, 2009 at 10:47 AM, Paul Taylor wrote: > Robert Muir wrote: >>> >>> I think we would like to implement the complete unicode rules, so if you >>> could provide us with some code that would be great. >>> >> >> ok, I will followup... what version of lucene are you using, 2.9? >> >> ... >

Re: Best way to create own version of StandardTokenizer ?

2009-09-07 Thread Paul Taylor
e issues. first, i do not know what standardtokenizer does with geresh/gershayim, forget about single quote/double quote. but to fix the tokenization (gershayim example), you want to ensure you do not split on these. since this is used in hebrew acronym, i would modify the acronym rule

Re: Best way to create own version of StandardTokenizer ?

2009-09-07 Thread Robert Muir
what you are suggesting I couldn't follow how to change > jflex. you are right, for you there are a couple issues. first, i do not know what standardtokenizer does with geresh/gershayim, forget about single quote/double quote. but to fix the tokenization (gershayim example), you want to

Re: Best way to create own version of StandardTokenizer ?

2009-09-07 Thread Paul Taylor
Robert Muir wrote: Paul, thanks for the examples. In my opinion, only one of these is a tokenizer problem :) none of these will be affected by a unicode upgrade. Things like: http://bugs.musicbrainz.org/ticket/1006 another approach is using ibm ICU library for this case, as the buil

Re: Best way to create own version of StandardTokenizer ?

2009-09-04 Thread Robert Muir
Paul, no problem. it is not fully functional right now (incomplete, bugs, etc). patch is kinda for reading only :) but if you have other similar issues on your project, feel free to post links to them on that jira ticket. this way we can look at what problems you have and if appropriate maybe they

Re: Best way to create own version of StandardTokenizer ?

2009-09-04 Thread Paul Taylor
Robert Muir wrote: Paul, thanks for the examples. In my opinion, only one of these is a tokenizer problem :) none of these will be affected by a unicode upgrade. Thanks for taking the time to write that response, it will take me a bit of time to understand all this because I've ever used Lucene

Re: Best way to create own version of StandardTokenizer ?

2009-09-04 Thread Robert Muir
Paul, thanks for the examples. In my opinion, only one of these is a tokenizer problem :) none of these will be affected by a unicode upgrade. > Things like: > > http://bugs.musicbrainz.org/ticket/1006 in this case, it appears you want to do script conversion, and it appears from the ticket you a

Re: Best way to create own version of StandardTokenizer ?

2009-09-04 Thread Paul Taylor
Robert Muir wrote: On Fri, Sep 4, 2009 at 11:18 AM, Paul Taylor wrote: I submitted this https://issues.apache.org/jira/browse/LUCENE-1787 patch to StandardTokenizerImpl, understandably it hasn't been incoroprated into Lucene (yet) but I need it for the project Im working on. So would you reco

Re: Best way to create own version of StandardTokenizer ?

2009-09-04 Thread Robert Muir
On Fri, Sep 4, 2009 at 11:18 AM, Paul Taylor wrote: > I submitted this https://issues.apache.org/jira/browse/LUCENE-1787 patch to > StandardTokenizerImpl, understandably it hasn't been incoroprated into > Lucene (yet) but I need it for the project Im working on. So would you > recommend keeping the

Best way to create own version of StandardTokenizer ?

2009-09-04 Thread Paul Taylor
I submitted this https://issues.apache.org/jira/browse/LUCENE-1787 patch to StandardTokenizerImpl, understandably it hasn't been incoroprated into Lucene (yet) but I need it for the project Im working on. So would you recommend keeping the same class name, and just putting in the classpath befo

Re: StandardTokenizer issue ?

2009-03-15 Thread Paul Cowan
iMe wrote: This analyzer uses the StandardTokenizer which javadoc states: Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split. But looking to my index with luke, I saw that my product reference

Re: StandardTokenizer issue ?

2009-03-13 Thread iMe
Grant Ingersoll-6 wrote: > > That does sound like an issue. Can you open a JIRA issue for it? > I don't know how to do that... Could somebody do it for me ? Thank you -- View this message in context: http://www.nabble.com/StandardTokenizer-issue---tp22471475p22495653.html

Re: StandardTokenizer issue ?

2009-03-13 Thread Grant Ingersoll
That does sound like an issue. Can you open a JIRA issue for it? Thanks, Grant On Mar 12, 2009, at 5:55 AM, iMe wrote: I spotted an unexepcted behavior when using the StandardAnalyzer. This analyzer uses the StandardTokenizer which javadoc states: Splits words at hyphens, unless there&#

StandardTokenizer issue ?

2009-03-12 Thread iMe
I spotted an unexepcted behavior when using the StandardAnalyzer. This analyzer uses the StandardTokenizer which javadoc states: Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split. But lo

RE: 2.3.2 -> 2.4.0 StandardTokenizer issue

2009-02-21 Thread Philip Puffinburger
2.3.2 -> 2.4.0 StandardTokenizer issue that was just a suggestion as a quick hack... it still won't really fix the problem because some character + accent combinations don't have composed forms. even if you added entire combining diacritical marks block to the jflex grammar, its still wron

Re: 2.3.2 -> 2.4.0 StandardTokenizer issue

2009-02-21 Thread Robert Muir
e tokens to do its > operations. So instead of 0..1 conversions we'd be doing 1..2 conversions > during indexing and searching. > > -Original Message- > From: Robert Muir [mailto:rcm...@gmail.com] > Sent: Saturday, February 21, 2009 8:35 AM > To: java-user@lu

RE: 2.3.2 -> 2.4.0 StandardTokenizer issue

2009-02-21 Thread Philip Puffinburger
ge- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Saturday, February 21, 2009 8:35 AM To: java-user@lucene.apache.org Subject: Re: 2.3.2 -> 2.4.0 StandardTokenizer issue normalize your text to NFC. then it will be \u0043 \u00F3 \u00

Re: 2.3.2 -> 2.4.0 StandardTokenizer issue

2009-02-21 Thread Robert Muir
\u006F - C o m o > > It's splitting at the \u0301. > > >worst case scenerio, you could probably use the StandardTokenizer from > >2.3.2 with the rest of the 2.4 code. > > We've thought of that, but would be the last thing we did to get it back to > working.

RE: 2.3.2 -> 2.4.0 StandardTokenizer issue

2009-02-20 Thread Philip Puffinburger
e are the characters that of going through: \u0043 \u006F \u0301 \u006D \u006F - C o m o It's splitting at the \u0301. >worst case scenerio, you could probably use the StandardTokenizer from >2.3.2 with the rest of the 2.4 code. We've thought of that, but would be the last thing we

Re: 2.3.2 -> 2.4.0 StandardTokenizer issue

2009-02-20 Thread Chris Hostetter
ou could probably use the StandardTokenizer from 2.3.2 with the rest of the 2.4 code. this will show you exactly what changed... svn diff http://svn.apache.org/repos/asf/lucene/java/branches/lucene_2_3/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex http://svn.apache.org/re

Re: Regarding ArabicLetterTokenizer and the StandardTokenizer - best of both worlds!

2009-02-20 Thread Robert Muir
, Grant Ingersoll wrote: > It's been a few years since I've worked on Arabic, but it sounds > reasonable. Care to submit a patch with unit tests showing the > StandardTokenizer properly handling all Arabic characters? > http://wiki.apache.org/lucene-java/HowToContribute > &g

Re: Regarding ArabicLetterTokenizer and the StandardTokenizer - best of both worlds!

2009-02-20 Thread Grant Ingersoll
It's been a few years since I've worked on Arabic, but it sounds reasonable. Care to submit a patch with unit tests showing the StandardTokenizer properly handling all Arabic characters? http://wiki.apache.org/lucene-java/HowToContribute On Feb 20, 2009, at 6:22 AM, Yusuf Aaji w

Regarding ArabicLetterTokenizer and the StandardTokenizer - best of both worlds!

2009-02-20 Thread Yusuf Aaji
the same way the StandardTokenizer does. Also the problem of the StandardTokenizer is that it fails to handle arabic diacritics right. so it splits words which shouldn't be splitted. Arabic diacritics are: (as mentioned in the class: org.apache.lucene.analysis.ar.ArabicNormalizer) FAT

RE: 2.3.2 -> 2.4.0 StandardTokenizer issue

2009-02-19 Thread Philip Puffinburger
Actually, WhitespaceTokenizer won't work. Too many person names and it won't do anything with punctuation. Something had to have changed in StandardTokenizer, and we need some of the 2.4 fixes/features, so we are kind of stuck. -Original Message- From: Philip Pu

2.3.2 -> 2.4.0 StandardTokenizer issue

2009-02-16 Thread Philip Puffinburger
We have our own Analyzer which has the following Public final TokenStream tokenStream(String fieldname, Reader reader) { TokenStream result = new StandardTokenizer(reader); result = new StandardFilter(result); result = new MyAccentFilter(result); result = new LowerCaseFilter(result

Re: StandardTokenizer and Korean grouping with alphanum

2008-09-22 Thread Daniel Noll
Steven A Rowe wrote: Korean has been treated differently from Chinese and Japanese since LUCENE-461 . The grouping of Hangul with digits was introduced in this issue. Certainly I found LUCENE-461 during my search, and certainly grouping togeth

RE: StandardTokenizer and Korean grouping with alphanum

2008-09-22 Thread Steven A Rowe
Hi Daniel, On 09/22/2008 at 12:49 AM, Daniel Noll wrote: > I have a question about Korean tokenisation. Currently there > is a rule in StandardTokenizerImpl.jflex which looks like this: > > ALPHANUM = ({LETTER}|{DIGIT}|{KOREAN})+ LUCENE-1126

StandardTokenizer and Korean grouping with alphanum

2008-09-21 Thread Daniel Noll
Hi all. I have a question about Korean tokenisation. Currently there is a rule in StandardTokenizerImpl.jflex which looks like this: ALPHANUM = ({LETTER}|{DIGIT}|{KOREAN})+ I'm wondering if there was some good reason why it isn't: ALPHANUM = (({LETTER}|{DIGIT})+|{KOREAN}+) Basically I'

Re: StandardTokenizer is slowing down highlighting a lot

2007-07-26 Thread Stanislaw Osinski
> If anyone is interested, I could prepare a JFlex based Analyzer > equivalent > (to the extent possible) to current StandardAnalyzer, which might > offer nice > indexing and highlighting speed-ups. +1. I think a lot of people would be interested in a faster StandardAnalyzer. I've attached a

Re: StandardTokenizer is slowing down highlighting a lot

2007-07-25 Thread Stanislaw Osinski
On 25/07/07, Yonik Seeley <[EMAIL PROTECTED]> wrote: On 7/25/07, Stanislaw Osinski <[EMAIL PROTECTED]> wrote: > JavaCC is slow indeed. JavaCC is a very fast parser for a large document... the issue is small fields and JavaCC's use of an exception for flow control at the end of a value. As JVMs

Re: StandardTokenizer is slowing down highlighting a lot

2007-07-25 Thread Yonik Seeley
On 7/25/07, Stanislaw Osinski <[EMAIL PROTECTED]> wrote: JavaCC is slow indeed. JavaCC is a very fast parser for a large document... the issue is small fields and JavaCC's use of an exception for flow control at the end of a value. As JVMs have advanced, exception-as-control-flow as gotten com

Re: StandardTokenizer is slowing down highlighting a lot

2007-07-25 Thread Grant Ingersoll
On Jul 25, 2007, at 7:19 AM, Stanislaw Osinski wrote: Unfortunately, StandardAnalyzer is slow. StandardAnalyzer is really limited by JavaCC speed. You cannot shave much more performance out of the grammar as it is already about as simple as it gets. JavaCC is slow indeed. We used it for

Re: StandardTokenizer is slowing down highlighting a lot

2007-07-25 Thread Stanislaw Osinski
I am sure a faster StandardAnalyzer would be greatly appreciated. I'm increasing the priority of that task then :) StandardAnalyzer appears widely used and horrendously slow. Even better would be a StandardAnalyzer that could have different recognizers enabled/disabled. For example, dropping

Re: StandardTokenizer is slowing down highlighting a lot

2007-07-25 Thread Mark Miller
I would be very interested. I have been playing around with Antlr to see if it is any faster than JavaCC, but haven't seen great gains in my simple tests. I had not considered trying JFlex. I am sure a faster StandardAnalyzer would be greatly appreciated. StandardAnalyzer appears widely used a

Re: StandardTokenizer is slowing down highlighting a lot

2007-07-25 Thread Stanislaw Osinski
Unfortunately, StandardAnalyzer is slow. StandardAnalyzer is really limited by JavaCC speed. You cannot shave much more performance out of the grammar as it is already about as simple as it gets. JavaCC is slow indeed. We used it for a while for Carrot2, but then (3 years ago :) switched to JF

  1   2   >