Re: StandardTokenizer#setMaxTokenLength

2015-07-20 Thread Steve Rowe
Hi Piotr, The behavior you mention is an intentional change from the behavior in Lucene 4.9.0 and earlier, when tokens longer than maxTokenLenth were silently ignored: see LUCENE-5897[1] and LUCENE-5400[2]. The new behavior is as follows: Token matching rules are no longer allowed to match aga

Re: StandardTokenizer#setMaxTokenLength

2015-07-20 Thread Piotr Idzikowski
Hello. Btw, I think ClassicAnalyzer has the same problem Regards On Fri, Jul 17, 2015 at 4:40 PM, Steve Rowe wrote: > Hi Piotr, > > Thanks for reporting! > > See https://issues.apache.org/jira/browse/LUCENE-6682 > > Steve > www.lucidworks.com > > > On Jul 16, 2015, at 4:47 AM, Piotr Idzikowski

Re: StandardTokenizer#setMaxTokenLength

2015-07-20 Thread Piotr Idzikowski
I should add that this is Lucene 4.10.4. But I have checked it on the 5.2.1 version and I have got the same result Regards Piotr On Mon, Jul 20, 2015 at 9:44 AM, Piotr Idzikowski wrote: > Hello Steve, > It is always pleasure to help you develop such a great lib. > Talking about StandardTokenize

Re: StandardTokenizer#setMaxTokenLength

2015-07-20 Thread Piotr Idzikowski
Hello Steve, It is always pleasure to help you develop such a great lib. Talking about StandardTokenizer and setMaxTokenLength, I think I have found another problem. It looks like when the word is longer than max length analyzer adds two tokens -> word.substring(0,maxLength) and word.substring(maxL

Re: StandardTokenizer#setMaxTokenLength

2015-07-17 Thread Steve Rowe
Hi Piotr, Thanks for reporting! See https://issues.apache.org/jira/browse/LUCENE-6682 Steve www.lucidworks.com > On Jul 16, 2015, at 4:47 AM, Piotr Idzikowski > wrote: > > Hello. > I am developing own analyzer based on StandardAnalyzer. > I realized that tokenizer.setMaxTokenLength is called

RE: StandardTokenizer generation from JFlex grammar

2012-10-04 Thread vempap
Thanks Steve for the pointers. I'll look into it. -- View this message in context: http://lucene.472066.n3.nabble.com/StandardTokenizer-generation-from-JFlex-grammar-tp4011940p4011944.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. ---

RE: StandardTokenizer generation from JFlex grammar

2012-10-04 Thread Steven A Rowe
Hi Phani, Assuming you're using Lucene 3.6.X, see: and

RE: StandardTokenizer and split tokens

2012-06-24 Thread Uwe Schindler
> -Original Message- > From: Mansour Al Akeel [mailto:mansour.alak...@gmail.com] > Sent: Saturday, June 23, 2012 11:21 PM > To: java-user@lucene.apache.org > Subject: Re: StandardTokenizer and split tokens > > Uwe, > thank you for the advice. I updated my code. > > >

Re: StandardTokenizer and split tokens

2012-06-23 Thread Mansour Al Akeel
Uwe, thank you for the advice. I updated my code. On Sat, Jun 23, 2012 at 3:15 AM, Uwe Schindler wrote: >> I found the main issue. >> I was using ByteRef without the length. This fixed the problem. >> >>                       String word = new > String(ref.bytes,ref.offset,ref.length); > > Pleas

RE: StandardTokenizer and split tokens

2012-06-23 Thread Uwe Schindler
> I found the main issue. > I was using ByteRef without the length. This fixed the problem. > > String word = new String(ref.bytes,ref.offset,ref.length); Please see my other mail, using no character set here is the second problem of your code, this is the correct way to do:

RE: StandardTokenizer and split tokens

2012-06-23 Thread Uwe Schindler
Don't ever do this: String word = new String(ref.bytes); This has following problems: - ignores character set!!! (in general: never ever use new String(byte[]) without specifying the 2nd charset parameter!). byte[] != String. Depending on the default charset on your computer this would return bul

Re: StandardTokenizer and split tokens

2012-06-22 Thread Mansour Al Akeel
I found the main issue. I was using ByteRef without the length. This fixed the problem. String word = new String(ref.bytes,ref.offset,ref.length); Thank you. On Fri, Jun 22, 2012 at 6:26 PM, Mansour Al Akeel wrote: > Hello all, > > I am tying to write a simple autosugg

Re: StandardTokenizer

2011-09-30 Thread Peyman Faratin
thank you Ian On Sep 30, 2011, at 4:19 AM, Ian Lea wrote: > This all changed with the 3.1 release. See > http://lucene.apache.org/java/3_1_0/changes/Changes.html#3.1.0.api_changes, > number 18. > > You can get the old behaviour with StandardAnalyzer by passing > VERSION_30, or you could look at

Re: StandardTokenizer

2011-09-30 Thread Ian Lea
This all changed with the 3.1 release. See http://lucene.apache.org/java/3_1_0/changes/Changes.html#3.1.0.api_changes, number 18. You can get the old behaviour with StandardAnalyzer by passing VERSION_30, or you could look at UAX29URLEmailTokenizer which should pick up the email component, althou

Re: StandardTokenizer issue ?

2009-03-15 Thread Paul Cowan
iMe wrote: This analyzer uses the StandardTokenizer which javadoc states: Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split. But looking to my index with luke, I saw that my product reference AB-CD-

Re: StandardTokenizer issue ?

2009-03-13 Thread iMe
Grant Ingersoll-6 wrote: > > That does sound like an issue. Can you open a JIRA issue for it? > I don't know how to do that... Could somebody do it for me ? Thank you -- View this message in context: http://www.nabble.com/StandardTokenizer-issue---tp22471475p22495653.html Sent from the Lu

Re: StandardTokenizer issue ?

2009-03-13 Thread Grant Ingersoll
That does sound like an issue. Can you open a JIRA issue for it? Thanks, Grant On Mar 12, 2009, at 5:55 AM, iMe wrote: I spotted an unexepcted behavior when using the StandardAnalyzer. This analyzer uses the StandardTokenizer which javadoc states: Splits words at hyphens, unless there's

Re: StandardTokenizer and Korean grouping with alphanum

2008-09-22 Thread Daniel Noll
Steven A Rowe wrote: Korean has been treated differently from Chinese and Japanese since LUCENE-461 . The grouping of Hangul with digits was introduced in this issue. Certainly I found LUCENE-461 during my search, and certainly grouping togeth

RE: StandardTokenizer and Korean grouping with alphanum

2008-09-22 Thread Steven A Rowe
Hi Daniel, On 09/22/2008 at 12:49 AM, Daniel Noll wrote: > I have a question about Korean tokenisation. Currently there > is a rule in StandardTokenizerImpl.jflex which looks like this: > > ALPHANUM = ({LETTER}|{DIGIT}|{KOREAN})+ LUCENE-1126

Re: StandardTokenizer is slowing down highlighting a lot

2007-07-26 Thread Stanislaw Osinski
> If anyone is interested, I could prepare a JFlex based Analyzer > equivalent > (to the extent possible) to current StandardAnalyzer, which might > offer nice > indexing and highlighting speed-ups. +1. I think a lot of people would be interested in a faster StandardAnalyzer. I've attached a

Re: StandardTokenizer is slowing down highlighting a lot

2007-07-25 Thread Stanislaw Osinski
On 25/07/07, Yonik Seeley <[EMAIL PROTECTED]> wrote: On 7/25/07, Stanislaw Osinski <[EMAIL PROTECTED]> wrote: > JavaCC is slow indeed. JavaCC is a very fast parser for a large document... the issue is small fields and JavaCC's use of an exception for flow control at the end of a value. As JVMs

Re: StandardTokenizer is slowing down highlighting a lot

2007-07-25 Thread Yonik Seeley
On 7/25/07, Stanislaw Osinski <[EMAIL PROTECTED]> wrote: JavaCC is slow indeed. JavaCC is a very fast parser for a large document... the issue is small fields and JavaCC's use of an exception for flow control at the end of a value. As JVMs have advanced, exception-as-control-flow as gotten com

Re: StandardTokenizer is slowing down highlighting a lot

2007-07-25 Thread Grant Ingersoll
On Jul 25, 2007, at 7:19 AM, Stanislaw Osinski wrote: Unfortunately, StandardAnalyzer is slow. StandardAnalyzer is really limited by JavaCC speed. You cannot shave much more performance out of the grammar as it is already about as simple as it gets. JavaCC is slow indeed. We used it for

Re: StandardTokenizer is slowing down highlighting a lot

2007-07-25 Thread Stanislaw Osinski
I am sure a faster StandardAnalyzer would be greatly appreciated. I'm increasing the priority of that task then :) StandardAnalyzer appears widely used and horrendously slow. Even better would be a StandardAnalyzer that could have different recognizers enabled/disabled. For example, dropping

Re: StandardTokenizer is slowing down highlighting a lot

2007-07-25 Thread Mark Miller
I would be very interested. I have been playing around with Antlr to see if it is any faster than JavaCC, but haven't seen great gains in my simple tests. I had not considered trying JFlex. I am sure a faster StandardAnalyzer would be greatly appreciated. StandardAnalyzer appears widely used a

Re: StandardTokenizer is slowing down highlighting a lot

2007-07-25 Thread Stanislaw Osinski
Unfortunately, StandardAnalyzer is slow. StandardAnalyzer is really limited by JavaCC speed. You cannot shave much more performance out of the grammar as it is already about as simple as it gets. JavaCC is slow indeed. We used it for a while for Carrot2, but then (3 years ago :) switched to JF

Re: StandardTokenizer is slowing down highlighting a lot

2007-07-19 Thread Michael Stoppelman
On 7/19/07, Mark Miller <[EMAIL PROTECTED]> wrote: I think it goes without saying that a semi-complex NFA or DFA is going to be quite a bit slower than say, breaking on whitespace. Not that I am against such a warning. This is true to those very familiar with the code base and the Tokenizer s

Re: StandardTokenizer is slowing down highlighting a lot

2007-07-19 Thread Mark Miller
I think it goes without saying that a semi-complex NFA or DFA is going to be quite a bit slower than say, breaking on whitespace. Not that I am against such a warning. To support my point on writing a custom solution that is more exact towards your needs: If you just remove the recognizer i

Re: StandardTokenizer is slowing down highlighting a lot

2007-07-18 Thread Michael Stoppelman
Might be nice to add a line of documentation to the highlighter on the possible performance hit if one uses StandardAnalyzer which probably is a common case. Thanks for the speedy response. -M On 7/18/07, Mark Miller <[EMAIL PROTECTED]> wrote: Unfortunately, StandardAnalyzer is slow. StandardA

Re: StandardTokenizer is slowing down highlighting a lot

2007-07-18 Thread Mark Miller
Unfortunately, StandardAnalyzer is slow. StandardAnalyzer is really limited by JavaCC speed. You cannot shave much more performance out of the grammar as it is already about as simple as it gets. You should first see if you can get away without it and use a different Analyzer, or if you can re-

Re: StandardTokenizer throws extra exceptions

2005-10-31 Thread Rob Young
Roxana Angheluta wrote: I had the same problem. I solved it by manually editing the file ParseException.java every time when modifying .jj file: import java.io.*; public class ParseException extends IOException { It's not the most elegant way to do it, I'm also interested in a more scalable

Re: StandardTokenizer throws extra exceptions

2005-10-31 Thread Roxana Angheluta
Rob Young wrote: Hi, I'm trying to create another, slightly changed, version of StandardAnalyzer. I've coppied out the source, editted the .jj file and re-built the StandardTokenizer class. The problem I am facing is, when I have all this in eclipse it's telling me that the ParseException is

Re: StandardTokenizer

2005-09-27 Thread Yonik Seeley
I'd write a TokenFilter for that... much easier. -Yonik Now hiring -- http://tinyurl.com/7m67g On 9/27/05, Lorenzo Viscanti <[EMAIL PROTECTED]> wrote: > > Hi, I'm trying to modify the StandardTokenizer, in order to get to get a > good tokenization for my needs. > Basically I would like to separat

Re: standardTokenizer - how to terminate at End of Stream

2005-09-21 Thread Beady Geraghty
Thank you for your response. That was my original goal. On 9/21/05, Chris Hostetter <[EMAIL PROTECTED]> wrote: > > > : Since I used the StandAnalyzer when I originally created the index, > : I therefore use the StandardTokenizer to tokenize the input stream. > : Is there a better way to do what I

Re: standardTokenizer - how to terminate at End of Stream

2005-09-21 Thread Chris Hostetter
: Since I used the StandAnalyzer when I originally created the index, : I therefore use the StandardTokenizer to tokenize the input stream. : Is there a better way to do what I try to do ? : From your comment below, it appears that I should just use next() instead if your goal is to recreate

Re: standardTokenizer - how to terminate at End of Stream

2005-09-21 Thread Beady Geraghty
Thank you for the response. I was trying to do something really simple - I want to extract the context for terms and phrases from files that satisfy some (many) queries. I *know* that file test.txt is a hit (because I queried the index, and it tells me that test.txt satisfies the query). Then, I o

Re: standardTokenizer - how to terminate at End of Stream

2005-09-21 Thread Erik Hatcher
Could you elaborate on what you're trying to do, please? Using StandardTokenizer in this low-level fashion is practically unheard of, so I think knowing what you're attempting to do will help us help you :) Erik On Sep 21, 2005, at 12:17 PM, Beady Geraghty wrote: I see some definiti

Re: standardTokenizer - how to terminate at End of Stream

2005-09-21 Thread Beady Geraghty
I see some definitions in StandardTokenizerConstants.java Perhaps these are the values for t.kind. Perhaps, I was confused between between the usage of getNextToken() and next() in the standard analyzer. When should one use getNextToken() instead of next() I am just starting to use Lucene, so ple