mark these are similar to my concerns with us doing unicode 4.0 (suppl. characters, etc) support in 3.1. this is why i left a comment on LUCENE-1689, I'm pretty confused about what approach we should take, because technically, fixing this will break things.
and again, I do believe we should have fixed everything to unicode 4.0 in for Lucene 3.0, since its the unicode version of java 5 its too late for that now, but i definitely don't want to cause problems for 3.1, right now though, it looks unavoidable. On Mon, Nov 16, 2009 at 3:16 PM, Mark Miller <markrmil...@gmail.com> wrote: > This is a big deal, weather its jdk or Lucene related. We are forcing > those on 1.4 to move to 1.5 - any problems you face with that with the > JDK are Lucene problems if they affect Lucene. We need big clear > warnings about this - we should have had them before we pushed to users > to 1.5 as well if I am reading right. > > If it matters what JVM runs jflex, that is also a big deal. Even if it > hasn't been regenerated yet, it likely will be before long. We will > break then? Perhaps its better to break now? > > I've only read through this thread quick, but to me, this is all a big > deal. Think of it from a user perspective. Its not okay to just say, > well, this stuff screws up Lucene, but its just because the user is > switching from 1.4 to 1.5 - thats not our concern - they should know the > consequences - I think that is our concern - very much so. > > Robert Muir wrote: > > i suppose we are ok then, except for the fact that now > > StandardTokenizer is working with a unicode 3.0 definition, instead of > > the unicode version (4.0) that corresponds to our required minimum jre > > (1.5)... > > > > sorry if i raised a stink about nothing, but you see my concerns maybe? > > > > On Mon, Nov 16, 2009 at 3:01 PM, Uwe Schindler <u...@thetaphi.de > > <mailto:u...@thetaphi.de>> wrote: > > > > JFlex was not regenerated as far as I know, but if somebody did, > > its already broken… > > > > > > > > ----- > > Uwe Schindler > > H.-H.-Meier-Allee 63, D-28213 Bremen > > http://www.thetaphi.de > > eMail: u...@thetaphi.de <mailto:u...@thetaphi.de> > > > > > ------------------------------------------------------------------------ > > > > *From:* Robert Muir [mailto:rcm...@gmail.com > > <mailto:rcm...@gmail.com>] > > *Sent:* Monday, November 16, 2009 8:53 PM > > > > *To:* java-dev@lucene.apache.org <mailto:java-dev@lucene.apache.org> > > *Subject:* Re: Why release 3.0? > > > > > > > > btw, so heres a great example. you are backwards broken regardless > > of JVM for StandardTokenizer, because we used 1.4 JRE to run jflex > > in 2.9, but 1.5 in 3.0, right? > > > > On Mon, Nov 16, 2009 at 2:51 PM, Robert Muir <rcm...@gmail.com > > <mailto:rcm...@gmail.com>> wrote: > > > > Uwe, thats probably a good solution I think. just as long as we > > document somewhere, > > I think there is some warning verbage in StandardTokenizer already > > about this. > > > > NOTE: if you change StandardTokenizerImpl.jflex and need to > regenerate > > the tokenizer, remember to use JRE 1.4 to run jflex (before > > Lucene 3.0). This grammar now uses constructs (eg :digit:, > > :letter:) whose meaning can vary according to the JRE used to > > run jflex. See > > https://issues.apache.org/jira/browse/LUCENE-1126 for details. > > > > > > > > On Mon, Nov 16, 2009 at 2:50 PM, Uwe Schindler <u...@thetaphi.de > > <mailto:u...@thetaphi.de>> wrote: > > > > But it is a general warning that should be placed in the Wiki: If > > you upgrade from Java 1.4 to Java 5, think about reindexing. > > > > > > > > It has definitely nothing to do with 3.0, because uses could have > > changed (and most of them have) before. > > > > ----- > > Uwe Schindler > > H.-H.-Meier-Allee 63, D-28213 Bremen > > http://www.thetaphi.de > > eMail: u...@thetaphi.de <mailto:u...@thetaphi.de> > > > > > ------------------------------------------------------------------------ > > > > *From:* Robert Muir [mailto:rcm...@gmail.com > > <mailto:rcm...@gmail.com>] > > *Sent:* Monday, November 16, 2009 8:45 PM > > > > > > *To:* java-dev@lucene.apache.org <mailto:java-dev@lucene.apache.org> > > *Subject:* Re: Why release 3.0? > > > > > > > > right, my point is its true its nothing to do with Lucene at all, > > really. > > > > but the reality is we should clarify this to users I think. > > > > Its especially complex in the current StandardTokenizer, which > > uses a mix of hardcoded ranges and properties, can you tell me if > > you should reindex for given language X? > > I wouldn't want to answer that question right now. > > > > On Mon, Nov 16, 2009 at 2:42 PM, Uwe Schindler <u...@thetaphi.de > > <mailto:u...@thetaphi.de>> wrote: > > > > We tried out: Character.getType() for these two chars: > > > > > > > > Java 5: > > '\u00AD' = 16 > > '\u06DD' = 16 > > > > Java 1.4: > > '\u00AD' = 20 > > '\u06DD' = 7 > > > > > > > > The first is the soft hyphen. > > > > ----- > > Uwe Schindler > > H.-H.-Meier-Allee 63, D-28213 Bremen > > http://www.thetaphi.de > > eMail: u...@thetaphi.de <mailto:u...@thetaphi.de> > > > > > ------------------------------------------------------------------------ > > > > *From:* Robert Muir [mailto:rcm...@gmail.com > > <mailto:rcm...@gmail.com>] > > *Sent:* Monday, November 16, 2009 8:37 PM > > > > > > *To:* java-dev@lucene.apache.org <mailto:java-dev@lucene.apache.org> > > *Subject:* Re: Why release 3.0? > > > > > > > > right, its nothing to do with lucene, instead due to property > > changes, etc. > > > > i just think we should inform users on java 1.4/2.9 that if they > > upgrade to java 1.5/3.0, they should reindex. > > > > the reason i say this about properties, is there are some that > > change that will affect tokenizers, i give two examples, a hyphen > > that changes from punctuation to format (might affect > > SolrWordDelimiterFilter), > > and arabic ayah which changes from NSM to format, which surely > > affects ArabicLetterTokenizer. > > > > On Mon, Nov 16, 2009 at 2:33 PM, Steven A Rowe <sar...@syr.edu > > <mailto:sar...@syr.edu>> wrote: > > > > Hi Robert, > > > > I agree that the Unicode version supported by the JVM, as you say, > > really has nothing to do with Lucene. > > > > The disruption here is users' upgrading from Java 1.4 to 1.5+, not > > when they upgrade Lucene. I'd guess with few exceptions that most > > people have been using Lucene with 1.5+ for a couple of years now, > > though. > > > > But even the upgrade from Java 1.4 to 1.5+ will have (had) zero > > impact on most Lucene users, assuming that most use Latin-1 > > exclusively; although I haven't looked, I'd be surprised if > > Latin-1 characters changed much, if at all, from Unicode 3.0 to 4.0. > > > > It would be useful, I think, to include (a pointer to?) a > > description of the details of the Unicode 3.0->4.0 differences in > > the Lucene 3.0 release notes, since the minimum required Java > > version, and so also the supported Unicode version, changes then. > > > > Steve > > > > > > On 11/16/2009 at 2:15 PM, Robert Muir wrote: > > > the problem is that the properties have changed for various > > characters, > > > and new characters were added. > > > > > > it really has nothing to do with lucene, but the idea you can go > from > > > jdk 1.4/lucene 2.9 to jdk 1.5/lucene3.0 without reindexing is not > > true. > > > > > > > > > On Mon, Nov 16, 2009 at 2:12 PM, Uwe Schindler <u...@thetaphi.de > > <mailto:u...@thetaphi.de>> wrote: > > > > > > > > > But an UTF-8 stream from Java 4 can still be read with Java > 5, > > > what is the problem? Java 5 extended Unicode support, but an index > > > created with older versions can still be read. UTF-8 is > standardized… > > > > > > > > > > > > ----- > > > Uwe Schindler > > > H.-H.-Meier-Allee 63, D-28213 Bremen > > > http://www.thetaphi.de > > > eMail: u...@thetaphi.de <mailto:u...@thetaphi.de> > > > > > > > > > ________________________________ > > > > > > > > > From: Robert Muir [mailto:rcm...@gmail.com > > <mailto:rcm...@gmail.com>] > > > Sent: Monday, November 16, 2009 8:09 PM > > > > > > To: java-dev@lucene.apache.org > > <mailto:java-dev@lucene.apache.org> > > > Subject: Re: Why release 3.0? > > > > > > > > > > > > uwe, on topic please read my comment on LUCENE-1689, because > > > unicode version was bumped in jdk 1.5, i believe this index > backwards > > > compatibility is only theoretical > > > > > > On Mon, Nov 16, 2009 at 2:05 PM, Uwe Schindler > > <u...@thetaphi.de <mailto:u...@thetaphi.de>> wrote: > > > > > > 2.9 has *not* the same format as 3.0, an index created with > 3.0 > > > cannot be read with 2.9. This is because compressed field support > was > > > removed and therefore the version number of the stored fields > > file was > > > upgraded. But indexes from 2.9 can be read with 3.0 and support > > may get > > > removed in 4.0. 3.0 Indexes can be read until version 4.9. > > > > > > > > > > > > Uwe > > > > > > ----- > > > Uwe Schindler > > > H.-H.-Meier-Allee 63, D-28213 Bremen > > > http://www.thetaphi.de > > > eMail: u...@thetaphi.de <mailto:u...@thetaphi.de> > > > > > > > > > ________________________________ > > > > > > > > > From: Jake Mannix [mailto:jake.man...@gmail.com > > <mailto:jake.man...@gmail.com>] > > > Sent: Monday, November 16, 2009 7:15 PM > > > > > > > > > To: java-dev@lucene.apache.org > > <mailto:java-dev@lucene.apache.org> > > > > > > Subject: Re: Why release 3.0? > > > > > > > > > > > > Don't users need to upgrade to 3.0 because 3.1 won't be > > > necessarily able to read your > > > 2.4 index file formats? I suppose if you've already > > upgraded to > > > 2.9, then all is well because > > > 2.9 is the same format as 3.0, but we can't assume all users > > > upgraded from 2.4 to 2.9. > > > > > > If you've done that already, then 3.0 might not be necessary, > > > but if you're on 2.4 right now, > > > you will be in for a bad surprise if you try to upgrade to > 3.1. > > > > > > -jake > > > > > > On Mon, Nov 16, 2009 at 10:10 AM, Erick Erickson > > > <erickerick...@gmail.com <mailto:erickerick...@gmail.com>> wrote: > > > > > > One of my "specialties" is asking obvious questions just to > see > > > if everyone's assumptions are aligned. So with the discussion about > > > branching 3.0 I have to ask "Is there going to be any 3.0 release > > > intended for *production*?". And if not, would we save a lot of > > > work by just not worrying about retrofitting fixes to a 3.0 branch > > > and carrying on with 3.1 as the first *supported* 3.x release? > > > > > > Since 3.0 is "upgrade-to-java5 and remove deprecations", > > I'm not > > > sure *as a user* I see a good reason to upgrade to 3.0. Getting a > > > "beta/snapshot" release to get a head start on cleaning up my code > > > does seem worthwhile, if I have the spare time. And having a base > > > 3.0 version that's not changing all over the place would be useful > > > for that. > > > > > > That said, I'm also not terribly comfortable with a "release" > > > that's out there and unsupported. > > > > > > Apologies if this has already been discussed, but I don't > > > remember it. Although my memory isn't what it used to be (but > > > some would claim it never was<G>)... > > > > > > Erick > > > > > > > > > > -- > > Robert Muir > > rcm...@gmail.com <mailto:rcm...@gmail.com> > > > > > > > > > > -- > > Robert Muir > > rcm...@gmail.com <mailto:rcm...@gmail.com> > > > > > > > > > > -- > > Robert Muir > > rcm...@gmail.com <mailto:rcm...@gmail.com> > > > > > > > > > > -- > > Robert Muir > > rcm...@gmail.com <mailto:rcm...@gmail.com> > > > > > > > > > > -- > > Robert Muir > > rcm...@gmail.com <mailto:rcm...@gmail.com> > > > -- > - Mark > > http://www.lucidimagination.com > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com