Re: Why release 3.0?

Robert Muir Mon, 16 Nov 2009 12:26:03 -0800

mark these are similar to my concerns with us doing unicode 4.0 (suppl.
characters, etc) support in 3.1.
this is why i left a comment on LUCENE-1689, I'm pretty confused about what
approach we should take, because technically, fixing this will break things.


and again, I do believe we should have fixed everything to unicode 4.0 in
for Lucene 3.0, since its the unicode version of java 5
its too late for that now, but i definitely don't want to cause problems for
3.1, right now though, it looks unavoidable.

On Mon, Nov 16, 2009 at 3:16 PM, Mark Miller <markrmil...@gmail.com> wrote:

> This is a big deal, weather its jdk or Lucene related. We are forcing
> those on 1.4 to move to 1.5 - any problems you face with that with the
> JDK are Lucene problems if they affect Lucene. We need big clear
> warnings about this - we should have had them before we pushed to users
> to 1.5 as well if I am reading right.
>
> If it matters what JVM runs jflex, that is also a big deal. Even if it
> hasn't been regenerated yet, it likely will be before long. We will
> break then? Perhaps its better to break now?
>
> I've only read through this thread quick, but to me, this is all a big
> deal. Think of it from a user perspective. Its not okay to just say,
> well, this stuff screws up Lucene, but its just because the user is
> switching from 1.4 to 1.5 - thats not our concern - they should know the
> consequences - I think that is our concern - very much so.
>
> Robert Muir wrote:
> > i suppose we are ok then, except for the fact that now
> > StandardTokenizer is working with a unicode 3.0 definition, instead of
> > the unicode version (4.0) that corresponds to our required minimum jre
> > (1.5)...
> >
> > sorry if i raised a stink about nothing, but you see my concerns maybe?
> >
> > On Mon, Nov 16, 2009 at 3:01 PM, Uwe Schindler <u...@thetaphi.de
> > <mailto:u...@thetaphi.de>> wrote:
> >
> >     JFlex was not regenerated as far as I know, but if somebody did,
> >     its already broken…
> >
> >
> >
> >     -----
> >     Uwe Schindler
> >     H.-H.-Meier-Allee 63, D-28213 Bremen
> >     http://www.thetaphi.de
> >     eMail: u...@thetaphi.de <mailto:u...@thetaphi.de>
> >
> >
> ------------------------------------------------------------------------
> >
> >     *From:* Robert Muir [mailto:rcm...@gmail.com
> >     <mailto:rcm...@gmail.com>]
> >     *Sent:* Monday, November 16, 2009 8:53 PM
> >
> >     *To:* java-dev@lucene.apache.org <mailto:java-dev@lucene.apache.org>
> >     *Subject:* Re: Why release 3.0?
> >
> >
> >
> >     btw, so heres a great example. you are backwards broken regardless
> >     of JVM for StandardTokenizer, because we used 1.4 JRE to run jflex
> >     in 2.9, but 1.5 in 3.0, right?
> >
> >     On Mon, Nov 16, 2009 at 2:51 PM, Robert Muir <rcm...@gmail.com
> >     <mailto:rcm...@gmail.com>> wrote:
> >
> >     Uwe, thats probably a good solution I think. just as long as we
> >     document somewhere,
> >     I think there is some warning verbage in StandardTokenizer already
> >     about this.
> >
> >     NOTE: if you change StandardTokenizerImpl.jflex and need to
> regenerate
> >           the tokenizer, remember to use JRE 1.4 to run jflex (before
> >           Lucene 3.0).  This grammar now uses constructs (eg :digit:,
> >           :letter:) whose meaning can vary according to the JRE used to
> >           run jflex.  See
> >           https://issues.apache.org/jira/browse/LUCENE-1126 for details.
> >
> >
> >
> >     On Mon, Nov 16, 2009 at 2:50 PM, Uwe Schindler <u...@thetaphi.de
> >     <mailto:u...@thetaphi.de>> wrote:
> >
> >     But it is a general warning that should be placed in the Wiki: If
> >     you upgrade from Java 1.4 to Java 5, think about reindexing.
> >
> >
> >
> >     It has definitely nothing to do with 3.0, because uses could have
> >     changed (and most of them have) before.
> >
> >     -----
> >     Uwe Schindler
> >     H.-H.-Meier-Allee 63, D-28213 Bremen
> >     http://www.thetaphi.de
> >     eMail: u...@thetaphi.de <mailto:u...@thetaphi.de>
> >
> >
> ------------------------------------------------------------------------
> >
> >     *From:* Robert Muir [mailto:rcm...@gmail.com
> >     <mailto:rcm...@gmail.com>]
> >     *Sent:* Monday, November 16, 2009 8:45 PM
> >
> >
> >     *To:* java-dev@lucene.apache.org <mailto:java-dev@lucene.apache.org>
> >     *Subject:* Re: Why release 3.0?
> >
> >
> >
> >     right, my point is its true its nothing to do with Lucene at all,
> >     really.
> >
> >     but the reality is we should clarify this to users I think.
> >
> >     Its especially complex in the current StandardTokenizer, which
> >     uses a mix of hardcoded ranges and properties, can you tell me if
> >     you should reindex for given language X?
> >     I wouldn't want to answer that question right now.
> >
> >     On Mon, Nov 16, 2009 at 2:42 PM, Uwe Schindler <u...@thetaphi.de
> >     <mailto:u...@thetaphi.de>> wrote:
> >
> >     We tried out: Character.getType() for these two chars:
> >
> >
> >
> >     Java 5:
> >     '\u00AD' = 16
> >     '\u06DD' = 16
> >
> >     Java 1.4:
> >     '\u00AD' = 20
> >     '\u06DD' = 7
> >
> >
> >
> >     The first is the soft hyphen.
> >
> >     -----
> >     Uwe Schindler
> >     H.-H.-Meier-Allee 63, D-28213 Bremen
> >     http://www.thetaphi.de
> >     eMail: u...@thetaphi.de <mailto:u...@thetaphi.de>
> >
> >
> ------------------------------------------------------------------------
> >
> >     *From:* Robert Muir [mailto:rcm...@gmail.com
> >     <mailto:rcm...@gmail.com>]
> >     *Sent:* Monday, November 16, 2009 8:37 PM
> >
> >
> >     *To:* java-dev@lucene.apache.org <mailto:java-dev@lucene.apache.org>
> >     *Subject:* Re: Why release 3.0?
> >
> >
> >
> >     right, its nothing to do with lucene, instead due to property
> >     changes, etc.
> >
> >     i just think we should inform users on java 1.4/2.9 that if they
> >     upgrade to java 1.5/3.0, they should reindex.
> >
> >     the reason i say this about properties, is there are some that
> >     change that will affect tokenizers, i give two examples, a hyphen
> >     that changes from punctuation to format (might affect
> >     SolrWordDelimiterFilter),
> >     and arabic ayah which changes from NSM to format, which surely
> >     affects ArabicLetterTokenizer.
> >
> >     On Mon, Nov 16, 2009 at 2:33 PM, Steven A Rowe <sar...@syr.edu
> >     <mailto:sar...@syr.edu>> wrote:
> >
> >     Hi Robert,
> >
> >     I agree that the Unicode version supported by the JVM, as you say,
> >     really has nothing to do with Lucene.
> >
> >     The disruption here is users' upgrading from Java 1.4 to 1.5+, not
> >     when they upgrade Lucene.  I'd guess with few exceptions that most
> >     people have been using Lucene with 1.5+ for a couple of years now,
> >     though.
> >
> >     But even the upgrade from Java 1.4 to 1.5+ will have (had) zero
> >     impact on most Lucene users, assuming that most use Latin-1
> >     exclusively; although I haven't looked, I'd be surprised if
> >     Latin-1 characters changed much, if at all, from Unicode 3.0 to 4.0.
> >
> >     It would be useful, I think, to include (a pointer to?) a
> >     description of the details of the Unicode 3.0->4.0 differences in
> >     the Lucene 3.0 release notes, since the minimum required Java
> >     version, and so also the supported Unicode version, changes then.
> >
> >     Steve
> >
> >
> >     On 11/16/2009 at 2:15 PM, Robert Muir wrote:
> >     > the problem is that the properties have changed for various
> >     characters,
> >     > and new characters were added.
> >     >
> >     > it really has nothing to do with lucene, but the idea you can go
> from
> >     > jdk 1.4/lucene 2.9 to jdk 1.5/lucene3.0 without reindexing is not
> >     true.
> >     >
> >     >
> >     > On Mon, Nov 16, 2009 at 2:12 PM, Uwe Schindler <u...@thetaphi.de
> >     <mailto:u...@thetaphi.de>> wrote:
> >     >
> >     >
> >     >       But an UTF-8 stream from Java 4 can still be read with Java
> 5,
> >     > what is the problem? Java 5 extended Unicode support, but an index
> >     > created with older versions can still be read. UTF-8 is
> standardized…
> >     >
> >     >
> >     >
> >     >       -----
> >     >       Uwe Schindler
> >     >       H.-H.-Meier-Allee 63, D-28213 Bremen
> >     >       http://www.thetaphi.de
> >     >       eMail: u...@thetaphi.de <mailto:u...@thetaphi.de>
> >     >
> >     >
> >     > ________________________________
> >     >
> >     >
> >     >       From: Robert Muir [mailto:rcm...@gmail.com
> >     <mailto:rcm...@gmail.com>]
> >     >       Sent: Monday, November 16, 2009 8:09 PM
> >     >
> >     >       To: java-dev@lucene.apache.org
> >     <mailto:java-dev@lucene.apache.org>
> >     >       Subject: Re: Why release 3.0?
> >     >
> >     >
> >     >
> >     >       uwe, on topic please read my comment on LUCENE-1689, because
> >     > unicode version was bumped in jdk 1.5, i believe this index
> backwards
> >     > compatibility is only theoretical
> >     >
> >     >       On Mon, Nov 16, 2009 at 2:05 PM, Uwe Schindler
> >     <u...@thetaphi.de <mailto:u...@thetaphi.de>> wrote:
> >     >
> >     >       2.9 has *not* the same format as 3.0, an index created with
> 3.0
> >     > cannot be read with 2.9. This is because compressed field support
> was
> >     > removed and therefore the version number of the stored fields
> >     file was
> >     > upgraded. But indexes from 2.9 can be read with 3.0 and support
> >     may get
> >     > removed in 4.0. 3.0 Indexes can be read until version 4.9.
> >     >
> >     >
> >     >
> >     >       Uwe
> >     >
> >     >       -----
> >     >       Uwe Schindler
> >     >       H.-H.-Meier-Allee 63, D-28213 Bremen
> >     >       http://www.thetaphi.de
> >     >       eMail: u...@thetaphi.de <mailto:u...@thetaphi.de>
> >     >
> >     >
> >     > ________________________________
> >     >
> >     >
> >     >       From: Jake Mannix [mailto:jake.man...@gmail.com
> >     <mailto:jake.man...@gmail.com>]
> >     >       Sent: Monday, November 16, 2009 7:15 PM
> >     >
> >     >
> >     >       To: java-dev@lucene.apache.org
> >     <mailto:java-dev@lucene.apache.org>
> >     >
> >     >       Subject: Re: Why release 3.0?
> >     >
> >     >
> >     >
> >     >       Don't users need to upgrade to 3.0 because 3.1 won't be
> >     > necessarily able to read your
> >     >       2.4 index file formats?  I suppose if you've already
> >     upgraded to
> >     > 2.9, then all is well because
> >     >       2.9 is the same format as 3.0, but we can't assume all users
> >     > upgraded from 2.4 to 2.9.
> >     >
> >     >       If you've done that already, then 3.0 might not be necessary,
> >     > but if you're on 2.4 right now,
> >     >       you will be in for a bad surprise if you try to upgrade to
> 3.1.
> >     >
> >     >         -jake
> >     >
> >     >       On Mon, Nov 16, 2009 at 10:10 AM, Erick Erickson
> >     > <erickerick...@gmail.com <mailto:erickerick...@gmail.com>> wrote:
> >     >
> >     >       One of my "specialties" is asking obvious questions just to
> see
> >     > if everyone's assumptions are aligned. So with the discussion about
> >     > branching 3.0 I have to ask "Is there going to be any 3.0 release
> >     > intended for *production*?". And if not, would we save a lot of
> >     > work by just not worrying about retrofitting fixes to a 3.0 branch
> >     > and carrying on with 3.1 as the first *supported* 3.x release?
> >     >
> >     >       Since 3.0 is "upgrade-to-java5 and remove deprecations",
> >     I'm not
> >     > sure *as a user* I see a good reason to upgrade to 3.0. Getting a
> >     > "beta/snapshot" release to get a head start on cleaning up my code
> >     > does seem worthwhile, if I have the spare time. And having a base
> >     > 3.0 version that's not changing all over the place would be useful
> >     > for that.
> >     >
> >     >       That said, I'm also not terribly comfortable with a "release"
> >     > that's out there and unsupported.
> >     >
> >     >       Apologies if this has already been discussed, but I don't
> >     > remember it. Although my memory isn't what it used to be (but
> >     > some would claim it never was<G>)...
> >     >
> >     >       Erick
> >
> >
> >
> >
> >     --
> >     Robert Muir
> >     rcm...@gmail.com <mailto:rcm...@gmail.com>
> >
> >
> >
> >
> >     --
> >     Robert Muir
> >     rcm...@gmail.com <mailto:rcm...@gmail.com>
> >
> >
> >
> >
> >     --
> >     Robert Muir
> >     rcm...@gmail.com <mailto:rcm...@gmail.com>
> >
> >
> >
> >
> >     --
> >     Robert Muir
> >     rcm...@gmail.com <mailto:rcm...@gmail.com>
> >
> >
> >
> >
> > --
> > Robert Muir
> > rcm...@gmail.com <mailto:rcm...@gmail.com>
>
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


-- 
Robert Muir
rcm...@gmail.com

Re: Why release 3.0?

Reply via email to