Re: Modularization

2009-04-10 Thread Grant Ingersoll
I'm really ambivalent about Maven.  Having just converted Mahout to  
it, am using it for some other projects and used it quite a bit in the  
past, I am still on the fence (although I am mostly happy w/ it for  
Mahout).  I keep being lured in by the promise of it (dep. management,  
convention over configuration, POM, and most importantly, being able  
to point IntelliJ at it and have it setup my project structures), but  
then left hanging by the execution/bugginess of components.  For a  
simpler project structure like Mahout, it has worked pretty well  
except for the release stuff (which was a major pain and still isn't  
perfect), but for Lucene, I'm not so sure.  Multimodule support in  
Maven is OK at best and we have a lot of modules in Lucene.


Having been on the Maven list a number of times in the past, my sense  
was that it was overwhelmed by the sheer number of requests for help  
and the community itself was not able to keep up, so getting help may  
be more difficult.  Maybe that has changed since I was last on (about  
1.5 years ago)


Customization work in Maven is also a pain, and I have yet to see a  
project of any significance that didn't require some customization, no  
matter how much you follow the conventions.   For instance, Lucene's  
automated regression tests come to mind.  And, I am willing to bet  
Lucene's release process would need to be customized.


Finally, we have a pretty large installed base.  This is not something  
we should do lightly (not that anyone was suggesting otherwise).  We  
have a working build system and we have pretty broad Ant knowledge in  
the project (including the guy who wrote the book on Ant).


To sum up, I'm -0.9.  You might be able to convince me of using Maven,  
but the execution would really have to overcome a whole lot in order  
to do so.


-Grant


On Apr 9, 2009, at 6:48 PM, Earwin Burrfoot wrote:

On Fri, Apr 10, 2009 at 02:25, Chris Hostetter hossman_luc...@fucit.org 
 wrote:

Or just make it trivial to get all jars that fit a given profile w/o
actually merging those jars into an uber-jar ... does maven's
dependency management have any like bundles or virtual packages  
so
we could publish a lucene-all-analzers POM that didn't have an  
actual
lucene-all-analyzers.jar but listed dependencies on all of the  
individual

jars?


Maven can do this. Not sure transitive dependencies were meant to be
used that way, but they definetly work like you want.


I think ideally the existig contrib/analysis would be broken up by
language -- even if that means only 2 or 3 classes per jar -- but i  
don't
deal with multilingual stuff much so i don't have much of an  
opinoin ...
perhaps the majority of our users that deal with non-english tend  
to deal
with *lots* of langauges so having a single multilingual-analysis  
module

would be suitable.


I bet lots of users dealing with non-english language deal only with
it, because they're providing local services. Like we're working with
a mix of russian/english/ukrainian.

But my point really is that I don't see any adequate reason to have
dozens of well-defined micromodules.
People that care big time about dead weight in their distributions
should use tools like jar jar links anyway. (If I remember right, one
of its abilities is to build an uberjar from a bunch of jars, dropping
unused classes in the process)


--
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Modularization

2009-04-09 Thread Chris Hostetter

: Then during build we can package up certain combinations.  I think
: there should be sub-kitchen-sink jars by area, eg a jar that contains
: all analyzers/tokenstreams/filters, all queries/filters, etc.

Or just make it trivial to get all jars that fit a given profile w/o 
actually merging those jars into an uber-jar ... does maven's 
dependency management have any like bundles or virtual packages so 
we could publish a lucene-all-analzers POM that didn't have an actual 
lucene-all-analyzers.jar but listed dependencies on all of the individual 
jars?

(FYI: Perl's CPAN has the concept of a Bundle that's just an empty 
distribution that depends on other distributions so you have an single 
refrence point for installing them)

: So, how would you refactor the various sources of
: analyzers/tokenstream/tokenfilters we have today
: (src/java/org/apache/lucene/analysis/*, contrib/snowball/*,
: contrib/collation/* and contrib/analyzers/*)?  (Even contrib/memory
: has a neat PatternAnalyzer, that operates on a string using a regexp
: to get tokenns out, that only now am I just discovering).

I think ideally the existig contrib/analysis would be broken up by 
language -- even if that means only 2 or 3 classes per jar -- but i don't 
deal with multilingual stuff much so i don't have much of an opinoin ... 
perhaps the majority of our users that deal with non-english tend to deal 
with *lots* of langauges so having a single multilingual-analysis module 
would be suitable.

: We also need to think about how this impacts our back-compat policy.
: EG when are we allowed to split up modules into sub-modules, or merge
: them.

spliting a module should always be fair game as long as the new module(s) 
maintain the same back compat policy ... it's not a burden to ask people 
to start using 2 jars instead of 1 jar (especially if we're already going 
to have an easy way to bundle jars up into uber-jars)

in theory merging modules should require that the new module adopt the 
most restrictive back-compat policy of the previous modules.

: Assuming there's general consensus on this break core into modules
: approach, I think the next step is to take in inventory of all of
: Lucene's classes and roughly divide them into proposed modules, and
: iterate on that?  Hoss do you want to take a first stab at that?

Heh.  i'm not sure i could even answer the want question in the 
afirmative.  This is essentially a question of refactoring, and I think 
approaching this incrimentally would be the best strategy ... either by 
first finding some low hanging fruit in core that could be extracted int 
oa contrib easily (spans, query parser) or by restructuring the build 
system to put contribs and the demo on equal footing with core as 
modules and reasses as progress is made.

on a personal note: even if i wanted to lead this charge, i really can't 
right now ... folks may have noticed my involvement with lucene has been 
markedly lower in the last few months, i expect it to get even lower over 
the next 2 months before it will (hopefully) get higher. 



-Hoss


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Modularization

2009-04-09 Thread Chris Hostetter

: We've been doing this using just one source tree (like in Lucene), and
: instead ensuring the separation using the build system. We did not, like you

I think you are missunderstanding my previous comment ... Lucene-Java does 
not currenlty have one source tree in the sense that someone else 
suggested (i forget who) and i was commenting on ... at the moment Lucene 
has several source trees (src/java, src/demo, and each dir matching 
contrib/*/src).  

Based on your examples, i believe we are suggesting the same thing: 
building seperate modules from seperate base directories (in your case 
foo/A and foo/B) with well defined dependencies.






-Hoss


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Modularization

2009-04-09 Thread Chris Hostetter


: If there are any serious moves to reorganize things, we should at least
: consider the benefits of maven.

+1

we can certainly do a lot to improve things just by refacting stuff from 
core into contrib, and improving the visibility of contribs and 
documentation about contribs -- but if we're going to make massive changes 
to how things are built or how the source code is organized, then 
utilizing maven as the build system seems like an obvious choice to me.

(and i don't even like maven that much)



-Hoss


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Modularization

2009-04-01 Thread Nadav Har'El
On Mon, Mar 30, 2009, Chris Hostetter wrote about Re: Modularization:
 code isolation (by directory hierarchy) is hte best way i've seen to 
 ensure modularization, and protect against inadvertent dependency 
 bleeding.
...
 it's certainly possible to have all source code in a single directory 
 hierarchy, and then rely on the build system to ensure your don't 
 inwarranted dependencies, but that requires you do express rules in the 
 build system about what exactly the acceptible dependencies are, and it 
 relies on everyone using the buildsystem correctly (missguided users of 
 hand-holding IDEs could get very frustrated when the patches they submit 
 violate rules of an overly complicated set of ant build files)

In a project I've been involved in, we are building a library with similar
concerns that Lucene now faces - on one hand you want to be a kitchen sink
providing features for everyone, but on the other hand you want to create
small jars and allow people who only need a small number of features to pick
only some of the jars, instead of one huge jar.

We've been doing this using just one source tree (like in Lucene), and
instead ensuring the separation using the build system. We did not, like you
suggest, found this to complicated to set up or maintain. The only snag, of
course, is that people who don't know how to write build.xml properly do
not touch it, but it's exactly like people who don't know how to properly
code in Java do not touch our source code :-) Having a hand-holding IDE
is no replacement for knowing how to code, whether the code is Java source
code or Ant configuration.

The idea of the Ant-based approach is to have the Ant build script compile
each module source separately, allowing it only to refer to pre-defined
dependencies. This instead of the more usual approach of compiling all the
source code together (and thus allowing unwanted dependencies) and only
collecting the jars from the compiled classes at the very end.

For example, let's say that we want to build three JARs of three packages,
foo.A, foo.B, and foo.C. Let's say that foo.A is stand-alone (doesn't need
the other source code to compile), and foo.B depends on stuff from foo.A
(and must not depend on stuff from foo.C).

In that case, I would first create an Ant rule to build a jar from the sources
of foo.A, and them alone (which ensures that foo.A doesn't accidentally
depend on foo.B or foo.C). Note the includes argument to javac, and the
separate destdir:

target name=A.compile
sequential
mkdir dir=${build.classes}/A/
javac srcdir=${src} destdir=${build.classes}/A
includes=foo/A/**/*.java
sourcepath= listfiles=no
/javac
/sequential
/target

target name=A.jar depends=A.compile
sequential
mkdir dir=${build.jars}/
jar destfile=${build.jars}/A.jar 
basedir=${build.classes}/A
/jar
/sequential
/target

Now, we do a similar thing for B.jar - when compiling it, we allow the
compiler to look at only the source code of foo.B, and at the previously
built A.jar. It cannot, for example, accidentally use stuff from foo.C:

target name=B.compile depends=A.jar
sequential
mkdir dir=${build.classes}/B/
javac srcdir=${src} destdir=${build.classes}/B
includes=foo/B/**/*.java sourcepath= listfiles=no
classpath
pathelement location=${build.jars}/A.jar /
/classpath
/javac
/sequential
/target

target name=B.jar depends=B.compile
sequential
mkdir dir=${build.jars}/
jar destfile=${build.jars}/B.jar 
basedir=${build.classes}/B
/jar
/sequential
/target

Putting my money (or rather, time) where my mouth is, is there an interest
that I try to build a build script for Lucene to demonstrate these ideas
in action?
 
 FWIW: having lots/more of very small, isolated, hierarcies also wouldn't 
 hinder any attempts at having kitchen-sink or essential jars --
 combining the classes from lots of little isolated code trees is a lot 
 easier then extracting a few classes from one big code tree. 

But I think you've swept on issue under the rug: what happens when the
hierarcies aren't completely isolated? For example, an analyzer package
obviously depends on some Lucene core package. Or the query parser package
depends on the wildcard query package (for example). You need to specify these
dependencies somehow, and allow only them. How do you do that? Via an
Eclipse .project file in each of the small hierarcies? How is this any
better than having an Ant build file? How would anyone not using Eclipse
use this sort of setup?

Another problem

Re: Modularization

2009-04-01 Thread Ryan McKinley


we can have fine grained modularity w/o having second class  
citizens, and
we can achieve it without needing to make radical changes -- but  
putting

more stuff into core isn't going to help us get there.



I totally agree.

However, just to stir the pot (and assuming you are well rested), I'll  
drop your radical changes constraint and suggest that maven (while  
it can be a PIA) makes this kind of modularity trivial.


With maven we could easily have:
 /core
 /modules/xxx

Each module could easily declare:
 * its dependencies on other modules
 * the required JRE
 * document its level of maturity

And there are good off the shelf tools to report the dependency  
graphs, etc, etc.


If there are any serious moves to reorganize things, we should at  
least consider the benefits of maven.


ryan

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Modularization

2009-04-01 Thread Douglas Campos
+1 on maven, and I volunteer to aid in the creation of the maven project
files (pom's)

On Wed, Apr 1, 2009 at 11:02 AM, Ryan McKinley ryan...@gmail.com wrote:


 we can have fine grained modularity w/o having second class citizens, and
 we can achieve it without needing to make radical changes -- but putting
 more stuff into core isn't going to help us get there.


 I totally agree.

 However, just to stir the pot (and assuming you are well rested), I'll drop
 your radical changes constraint and suggest that maven (while it can be a
 PIA) makes this kind of modularity trivial.

 With maven we could easily have:
  /core
  /modules/xxx

 Each module could easily declare:
  * its dependencies on other modules
  * the required JRE
  * document its level of maturity

 And there are good off the shelf tools to report the dependency graphs,
 etc, etc.

 If there are any serious moves to reorganize things, we should at least
 consider the benefits of maven.

 ryan


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




-- 
Douglas Campos
Theros Consulting
+55 11 9267 4540
+55 11 3020 8168


Re: Modularization

2009-04-01 Thread Earwin Burrfoot
Lucene is in fact already available through maven. poms do exist, all
what is left is to find who manages them and releases.

On Thu, Apr 2, 2009 at 01:40, Douglas Campos doug...@theros.info wrote:
 +1 on maven, and I volunteer to aid in the creation of the maven project
 files (pom's)

 On Wed, Apr 1, 2009 at 11:02 AM, Ryan McKinley ryan...@gmail.com wrote:

 we can have fine grained modularity w/o having second class citizens, and
 we can achieve it without needing to make radical changes -- but putting
 more stuff into core isn't going to help us get there.


 I totally agree.

 However, just to stir the pot (and assuming you are well rested), I'll
 drop your radical changes constraint and suggest that maven (while it can
 be a PIA) makes this kind of modularity trivial.

 With maven we could easily have:
  /core
  /modules/xxx

 Each module could easily declare:
  * its dependencies on other modules
  * the required JRE
  * document its level of maturity

 And there are good off the shelf tools to report the dependency graphs,
 etc, etc.

 If there are any serious moves to reorganize things, we should at least
 consider the benefits of maven.

 ryan

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




 --
 Douglas Campos
 Theros Consulting
 +55 11 9267 4540
 +55 11 3020 8168




-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Modularization

2009-04-01 Thread Douglas Campos
I haven't paid attention, as I looked first for the build.xml on trunk

as we already are using maven, Ryan's approach is the way to go, IMHO

On Wed, Apr 1, 2009 at 7:00 PM, Earwin Burrfoot ear...@gmail.com wrote:

 Lucene is in fact already available through maven. poms do exist, all
 what is left is to find who manages them and releases.

 On Thu, Apr 2, 2009 at 01:40, Douglas Campos doug...@theros.info wrote:
  +1 on maven, and I volunteer to aid in the creation of the maven project
  files (pom's)
 
  On Wed, Apr 1, 2009 at 11:02 AM, Ryan McKinley ryan...@gmail.com
 wrote:
 
  we can have fine grained modularity w/o having second class citizens,
 and
  we can achieve it without needing to make radical changes -- but
 putting
  more stuff into core isn't going to help us get there.
 
 
  I totally agree.
 
  However, just to stir the pot (and assuming you are well rested), I'll
  drop your radical changes constraint and suggest that maven (while it
 can
  be a PIA) makes this kind of modularity trivial.
 
  With maven we could easily have:
   /core
   /modules/xxx
 
  Each module could easily declare:
   * its dependencies on other modules
   * the required JRE
   * document its level of maturity
 
  And there are good off the shelf tools to report the dependency graphs,
  etc, etc.
 
  If there are any serious moves to reorganize things, we should at least
  consider the benefits of maven.
 
  ryan
 
  -
  To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-dev-h...@lucene.apache.org
 
 
 
 
  --
  Douglas Campos
  Theros Consulting
  +55 11 9267 4540
  +55 11 3020 8168
 



 --
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




-- 
Douglas Campos
Theros Consulting
+55 11 9267 4540
+55 11 3020 8168


Re: Modularization

2009-03-31 Thread Babak Farhang
 maturity, and their back compat commitments.  The demo and getting
 started guies could also be expanded to refrence the contrib jars that
 contain code many people may want to reuse...

Here's an idea. Each contrib is really a project onto its own. And any
project, I suggest, ought to have its own demo program, together maybe
with a small write-up describing the idea behind the contrib and what
the demo does. So to get the ball rolling, how about adopting some
such documentation policy for *future* contribs as a
pseudo-requirement for making it into the official release?

Cheers,
-Babak

PS this not a swipe at any upcoming contrib (TrieUtils: the
documentation there is really good :)


On Mon, Mar 30, 2009 at 5:31 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 After stiring things up, and then being off-list for ~10 days, I'm in an
 interesting position coming back to this thread and seeing the discussion
 *after* it essentially ended, with a lot of semi-concensus but no clear
 sense of hard and fast resolution or plan of action.

 FWIW, here are the notes i made based on reading the thread about the
 various sentiments i noticed expressed (wether i agree with them or
 not) in order to try and get a handle on what had been discussed.
 some of these were the optinion of a single person and i've paraphrased,
 others are my generalization of similar comments made by various
 people...

 - contrib has a bad rap
 - widely varying degrees of quality/stability in contrib code, hard to get
 people to rely on the good ones because of the less good ones
 - many people want a good, out of hte box, kitchen sink experience (ie:
 one monolithic jar containing all the essentials)
 - need easy discoverability of all things of a given type (ie: all
 queries, all filters, all analyzers, etc...) .. ie: combined javadocs.
 - need easy installation of of all things of a given type (ie: a jar
 containing all types of queries, a jar containing all types of analyzers,
 etc...)
 - still need to deal with contribs that have external dependencies
 - still need to deal with contribs that require future versions of
 langauge (Java1.7 when core is still 1.5 compat)
 - users need better guidance about why something is a contrib
 (additional functionality, alternate functionality, example of use, tool,
 etc...)
 - while we should maintain/increase modularization, documentation should
 make features of contribs more promonent without stressing the isolation
 resulting from code modularization.
 - we should merge all contrib  core code into a unified src/ tree, and
 make the pacakging independent of the physical location in svn (ie: jars
 based on java package, not directory)

 While I'm mostly in favor of all of these sentiments, and think it's
 really just a question of how to go about it, the last one is actually
 something i've pretty stronly opposed to -- I think the best way forward
 is to have lots of small, well isolated source trees.

 code isolation (by directory hierarchy) is hte best way i've seen to
 ensure modularization, and protect against inadvertent dependency
 bleeding.  If we want to be able to produce small jars targeted at
 specific goals, and we want o.l.a.foo.FooClass to be in foo.jar and
 o.l.a.bar.BarClass to be in bar.jar then we shouldn't have
 src/java/o/l/a/foo/FooClass.java and src/java/o/l/a/bar/BarClass.java --
 doing so makes it way to easy for inadvertnent dependencies to crop up
 that make FooClass depend on bar class, and thus make it impossible to use
 foo.jar without also using bar.jar at runtime.

 it's certainly possible to have all source code in a single directory
 hierarchy, and then rely on the build system to ensure your don't
 inwarranted dependencies, but that requires you do express rules in the
 build system about what exactly the acceptible dependencies are, and it
 relies on everyone using the buildsystem correctly (missguided users of
 hand-holding IDEs could get very frustrated when the patches they submit
 violate rules of an overly complicated set of ant build files)

 FWIW: having lots/more of very small, isolated, hierarcies also wouldn't
 hinder any attempts at having kitchen-sink or essential jars --
 combining the classes from lots of little isolated code trees is a lot
 easier then extracting a few classes from one big code tree.

 One underlying assumption that seems to have permiated the existing
 discussion (without ever being explicitly stated) is the idea that most
 currently lives in src/java is the core and would be a single module
 ... personally i'd like to challege that assumption.  I'd like to suggest
 that besides obvious things that could be refactored out into other
 modules (span queries, queryparser) there are lots of additional ways
 that src/java could be sliced...

  - interfaces and abstract clases and concrete classes for reading an
 index in one index-api.jar (ie: Directory but no FSDirectory; IndexReader
 but not MultiReader)
  - ditto

Re: Modularization

2009-03-31 Thread Michael McCandless
On Mon, Mar 30, 2009 at 7:31 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 code isolation (by directory hierarchy) is hte best way i've seen to
 ensure modularization, and protect against inadvertent dependency
 bleeding.

OK I agree this (divorced top-level directories) is a great way to
enforce modularity and we should use that.

It seems the toplevel directory structure could still have subdirs,
eg:

  analyzers
languages
  th
  es
  fr
  snowball?
  ...
standard
collation

and:

  search
searcher
queries
  span
  function

And in those leaf subdirs above would be the package subdir
structure (src/{java,test}/org/apache/lucene/...).

Though svn checkout and svn update and svn diff are going to
take quite a bit longer with this switch...

 One underlying assumption that seems to have permiated the existing
 discussion (without ever being explicitly stated) is the idea that
 most currently lives in src/java is the core and would be a single
 module ... personally i'd like to challege that assumption.  I'd
 like to suggest that besides obvious things that could be refactored
 out into other modules (span queries, queryparser) there are lots
 of additional ways that src/java could be sliced...

+1: I very much agree what is now called core should be refactored
as a number of modules.

So the general new proposal here seems to be lets break up src/java/*
into separate modules (each under its own toplevel directory), just
like contrib/* is today.

And move Lucene to an a la carte model for what we now call core.
(what we now call contrib is already a la carte today).

We would then do away with the top level core vs contrib, and
everything would simply be modules, where each module has
metadata/javadocs stating:

  * JRE version required

  * What external dependencies (including dependencies to other Lucene
modules) are needed

  * Some measure of maturity

  * Back-compat policy

  * CHANGES

Then during build we can package up certain combinations.  I think
there should be sub-kitchen-sink jars by area, eg a jar that contains
all analyzers/tokenstreams/filters, all queries/filters, etc.

This does make the future decision process far easier.  Rather than
have a capricious and ill-defined does it go into core vs contrib
question, we now simply decide if it goes into an existing module or
makes a new one.

 Even without making radical changes to the way our source code is
 organized, a lot of improvements could be made by having better
 documentation .

Agreed. I think this is actually somewhat orthogonal, though should
follow more naturally once Lucene is simply a collection of modules.
I would think we present all and a per-module sets of javadocs,
plus javadocs aggregated based on how the JARs aggregate?  (Ie I could
browse the kitchen-sink javadocs, the all analyzers javadocs, or
the thai analyzers only javadocs).

 (ie: a new ThaiStemmerFilter could be added to an existing
 thai-analysis module)

So, how would you refactor the various sources of
analyzers/tokenstream/tokenfilters we have today
(src/java/org/apache/lucene/analysis/*, contrib/snowball/*,
contrib/collation/* and contrib/analyzers/*)?  (Even contrib/memory
has a neat PatternAnalyzer, that operates on a string using a regexp
to get tokenns out, that only now am I just discovering).

We also need to think about how this impacts our back-compat policy.
EG when are we allowed to split up modules into sub-modules, or merge
them.

Assuming there's general consensus on this break core into modules
approach, I think the next step is to take in inventory of all of
Lucene's classes and roughly divide them into proposed modules, and
iterate on that?  Hoss do you want to take a first stab at that?

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Modularization

2009-03-30 Thread Chris Hostetter

After stiring things up, and then being off-list for ~10 days, I'm in an 
interesting position coming back to this thread and seeing the discussion 
*after* it essentially ended, with a lot of semi-concensus but no clear 
sense of hard and fast resolution or plan of action.

FWIW, here are the notes i made based on reading the thread about the 
various sentiments i noticed expressed (wether i agree with them or 
not) in order to try and get a handle on what had been discussed.  
some of these were the optinion of a single person and i've paraphrased, 
others are my generalization of similar comments made by various 
people...

- contrib has a bad rap
- widely varying degrees of quality/stability in contrib code, hard to get 
people to rely on the good ones because of the less good ones
- many people want a good, out of hte box, kitchen sink experience (ie: 
one monolithic jar containing all the essentials)
- need easy discoverability of all things of a given type (ie: all 
queries, all filters, all analyzers, etc...) .. ie: combined javadocs.
- need easy installation of of all things of a given type (ie: a jar 
containing all types of queries, a jar containing all types of analyzers, 
etc...)
- still need to deal with contribs that have external dependencies
- still need to deal with contribs that require future versions of 
langauge (Java1.7 when core is still 1.5 compat)
- users need better guidance about why something is a contrib 
(additional functionality, alternate functionality, example of use, tool, 
etc...)
- while we should maintain/increase modularization, documentation should 
make features of contribs more promonent without stressing the isolation 
resulting from code modularization.
- we should merge all contrib  core code into a unified src/ tree, and 
make the pacakging independent of the physical location in svn (ie: jars 
based on java package, not directory)

While I'm mostly in favor of all of these sentiments, and think it's 
really just a question of how to go about it, the last one is actually 
something i've pretty stronly opposed to -- I think the best way forward 
is to have lots of small, well isolated source trees.

code isolation (by directory hierarchy) is hte best way i've seen to 
ensure modularization, and protect against inadvertent dependency 
bleeding.  If we want to be able to produce small jars targeted at 
specific goals, and we want o.l.a.foo.FooClass to be in foo.jar and 
o.l.a.bar.BarClass to be in bar.jar then we shouldn't have 
src/java/o/l/a/foo/FooClass.java and src/java/o/l/a/bar/BarClass.java -- 
doing so makes it way to easy for inadvertnent dependencies to crop up 
that make FooClass depend on bar class, and thus make it impossible to use 
foo.jar without also using bar.jar at runtime.

it's certainly possible to have all source code in a single directory 
hierarchy, and then rely on the build system to ensure your don't 
inwarranted dependencies, but that requires you do express rules in the 
build system about what exactly the acceptible dependencies are, and it 
relies on everyone using the buildsystem correctly (missguided users of 
hand-holding IDEs could get very frustrated when the patches they submit 
violate rules of an overly complicated set of ant build files)

FWIW: having lots/more of very small, isolated, hierarcies also wouldn't 
hinder any attempts at having kitchen-sink or essential jars --
combining the classes from lots of little isolated code trees is a lot 
easier then extracting a few classes from one big code tree. 

One underlying assumption that seems to have permiated the existing 
discussion (without ever being explicitly stated) is the idea that most 
currently lives in src/java is the core and would be a single module 
... personally i'd like to challege that assumption.  I'd like to suggest 
that besides obvious things that could be refactored out into other 
modules (span queries, queryparser) there are lots of additional ways 
that src/java could be sliced...

 - interfaces and abstract clases and concrete classes for reading an 
index in one index-api.jar (ie: Directory but no FSDirectory; IndexReader 
but not MultiReader)
 - ditto for creating/updating an index in one index-update.jar (ie: 
IndexWriter, TokenStream, Tokenizer, TokenFilter, Analyzer  but 
not any impls of the last 3)
 - ditto for searching in index-search.jar (ie: Searcher, Searchable, 
HitCollector, Query ... but not any concrete subclasses
 - simple-analysis.jar (SimpleAnalyzer, WhitespaceAnalyzer, 
LetterTokenizer, LowercaseFilter, etc...)
 - english-analysis.jar (StandardAnalyzer, etc...)
 - primative-queries.jar (TermQuery, BooleanQuery, MatchAllDocsQuery, 
MultiTermQuery, etc...)
 - range-queries.jar (RangeQuery, RangeFilter, ConstantScoreRangeQuery)

   ...etc...


The crux of my point being that what we think of today as the lucene 
core is actually kind of big and bloated, and already has *a* kitchen 
sink thrown in -- it's just not neccessarily

Re: Modularization

2009-03-30 Thread Michael Busch

On 3/31/09 1:31 AM, Chris Hostetter wrote:

code isolation (by directory hierarchy) is hte best way i've seen to
ensure modularization, and protect against inadvertent dependency
bleeding.
+1. That's actually what I meant with one-to-one mapping between the 
packaging and the source code (I didn't say that as elaborately as you :) )
To make jars based on packages rather than directories would be the 
wrong decision I strongly believe, for the reasons you mentioned nicely 
here.


-Michael

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Modularization

2009-03-23 Thread Michael McCandless
Michael Busch busch...@gmail.com wrote:

 And I don't think the sudden separation of core vs contrib
 should be so prominent (or even visible); it's really a detail of
 how we manage source control.

 When looking at the website I'd like read that Lucene can do hit
 highlighting, powerful query parsing, spell checking, analyze
 different languages, etc.  I could care less that some of these
 happen to live under a contrib subdirectory somewhere in the
 source control system.

 OK, so I think we all agree about the packaging. But I believe it is
 also important how the source code is organized. Maybe Lucene
 consumers don't care too much, however, Lucene is an open source
 project. So we also want to attract possible contributors with a
 nicely organized code base. If there is a clear separation between
 the different components on a source code level, becoming familiar
 with Lucene as a contributor might not be so overwhelming.

+1

We want the source code to be well organized: consumability by Lucene
developers (not just Lucene users) is also important for Lucene's
future growth.

 Besides that, I think a one-to-one mapping between the packaging and
 the source code has no disadvantages. (and it would certainly make
 the build scripts easier!)

Right.

So, towards that... why even break out contrib vs core, in source
control?  Can't we simply migrate contrib/* into core, in the right
places?

 Could we, instead, adopt some standard way (in the package
 javadocs) of stating the maturity/activity/back compat policies/etc
 of a given package?

 This makes sense; e.g. we could release new modules as beta versions
 (= use at own risk, no backwards-compatibility).

In fact we already have a 2.9 Jira issue opened to better document the
back-compat/JDK version requirements of all packages.

I think, like we've done with core lately when a new feature is added,
we could have the default assumption be full back compatibility, but
then those classes/methods/packages that are very new and may change
simply say so clearly in their javadocs.

 And if we start a new module (e.g. a GSoC project) we could exclude
 it from a release easily if it's truly experimental and not in a
 release-able state.

Right.

 So I think the beginnings of a rough proposal is taking shape, for
3.0:

   1. Fix web site to give a better intro to Lucene's features,
   without exposing core vs. contrib false (to the Lucene
   consumer)  distinction

   2. When releasing, we make a single JAR holding core  contrib
   classes for a given area.  The final JAR files don't contain a
   core vs contrib distinction.

   3. We create a bundled JAR that has the common packages
   typically needed (index/search core, analyzers, queries,
   highlighter, spellchecker)

 +1 to all three points.

OK.

So I guess I'm proposing adding:

   4. Move contrib/* under src/java/*, updating the javadocs to state
   back compatibility promises per class/package.

I think net/net this'd be a great simplification?

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Modularization

2009-03-23 Thread Yonik Seeley
On Mon, Mar 23, 2009 at 11:10 AM, Michael McCandless
luc...@mikemccandless.com wrote:
   4. Move contrib/* under src/java/*, updating the javadocs to state
       back compatibility promises per class/package.

- contrib has always had a lower bar and stuff was committed under
that lower bar - there should be no blanket promotion.
- contrib items may have different dependencies... putting it all
under the same source root can make a developers job harder
- many contrib items are less related to lucene-java core indexing and
searching... if there is no contrib, then they don't belong in the
lucene-java project at all.
- right now it's clear - core can't have dependencies on non-core
classes.  If everything is stuck in the same source tree, that goes
away.

I think there are a lot of benefits to continue considering very
carefully if something is core or not.

-Yonik

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Modularization

2009-03-23 Thread Mark Miller
Are you arguing for no change Yonik? I agree with all of your points in 
any case.


What appeals to me most so far is:

Take the best of contrib and up its status to something like modules. 
Equal to core, different requirements, dependencies, etc. Perhaps take 
queryparser out of core, but frankly I'd wouldn't mind just leaving core 
as it is.


Reintroduce the sandbox (I believe core was sandbox, part of the lower 
bar history) and put lesser contrib there and new stuff thats unproven. 
Contrib doesn't appeal to me as a name anyway.


That would give core, modules, and the sandbox (perhaps sandbox is a 
module?). Things could move from sandbox to core or the modules. Modules 
get new requirements similar to core - back compat guarantees and 
changes.txt per module.



Yonik Seeley wrote:

On Mon, Mar 23, 2009 at 11:10 AM, Michael McCandless
luc...@mikemccandless.com wrote:
  

  4. Move contrib/* under src/java/*, updating the javadocs to state
  back compatibility promises per class/package.



- contrib has always had a lower bar and stuff was committed under
that lower bar - there should be no blanket promotion.
- contrib items may have different dependencies... putting it all
under the same source root can make a developers job harder
- many contrib items are less related to lucene-java core indexing and
searching... if there is no contrib, then they don't belong in the
lucene-java project at all.
- right now it's clear - core can't have dependencies on non-core
classes.  If everything is stuck in the same source tree, that goes
away.

I think there are a lot of benefits to continue considering very
carefully if something is core or not.

-Yonik

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

  



--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Modularization

2009-03-23 Thread Earwin Burrfoot
 - contrib has always had a lower bar and stuff was committed under
 that lower bar - there should be no blanket promotion.
 - contrib items may have different dependencies... putting it all
 under the same source root can make a developers job harder
 - many contrib items are less related to lucene-java core indexing and
 searching... if there is no contrib, then they don't belong in the
 lucene-java project at all.
 - right now it's clear - core can't have dependencies on non-core
 classes.  If everything is stuck in the same source tree, that goes
 away.
Adding to this, afaik contribs have no java 1.4 restriction. If you
merge them into the core, you must either enforce it for contribs, or
lift it from the core. I think both variants may be a reason for
several heart attacks :)
One could argue that five years after 1.5 was released Lucene is going
to use it, so the point is no longer relevant. Sorry, 1.7 is just
behind the door.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Modularization

2009-03-23 Thread Mark Miller

Earwin Burrfoot wrote:

- contrib has always had a lower bar and stuff was committed under
that lower bar - there should be no blanket promotion.
- contrib items may have different dependencies... putting it all
under the same source root can make a developers job harder
- many contrib items are less related to lucene-java core indexing and
searching... if there is no contrib, then they don't belong in the
lucene-java project at all.
- right now it's clear - core can't have dependencies on non-core
classes.  If everything is stuck in the same source tree, that goes
away.


Adding to this, afaik contribs have no java 1.4 restriction. If you
merge them into the core, you must either enforce it for contribs, or
lift it from the core. I think both variants may be a reason for
several heart attacks :)
One could argue that five years after 1.5 was released Lucene is going
to use it, so the point is no longer relevant. Sorry, 1.7 is just
behind the door.

  
I think we are considering this for Lucene 3.0 (should be the release 
after next) which will allow Java 1.5.


- Mark

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Modularization

2009-03-23 Thread Earwin Burrfoot
On Mon, Mar 23, 2009 at 22:13, Mark Miller markrmil...@gmail.com wrote:
 Earwin Burrfoot wrote:

 - contrib has always had a lower bar and stuff was committed under
 that lower bar - there should be no blanket promotion.
 - contrib items may have different dependencies... putting it all
 under the same source root can make a developers job harder
 - many contrib items are less related to lucene-java core indexing and
 searching... if there is no contrib, then they don't belong in the
 lucene-java project at all.
 - right now it's clear - core can't have dependencies on non-core
 classes.  If everything is stuck in the same source tree, that goes
 away.


 Adding to this, afaik contribs have no java 1.4 restriction. If you
 merge them into the core, you must either enforce it for contribs, or
 lift it from the core. I think both variants may be a reason for
 several heart attacks :)
 One could argue that five years after 1.5 was released Lucene is going
 to use it, so the point is no longer relevant. Sorry, 1.7 is just
 behind the door.



 I think we are considering this for Lucene 3.0 (should be the release after
 next) which will allow Java 1.5.

So where are you going to put 1.6 and 1.7 contribs?

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Modularization

2009-03-23 Thread Michael McCandless
 I think we are considering this for Lucene 3.0 (should be the
 release after next) which will allow Java 1.5.

 So where are you going to put 1.6 and 1.7 contribs?

This is a good point: core Lucene must remain on old JREs, but we
should not force all contrib packages to do so.

 - contrib has always had a lower bar and stuff was committed under
 that lower bar - there should be no blanket promotion.

OK so that was the past, and I agree.

I assume by this you're also advocating that going forward this is an
ongoing reason to put something into contrib?  I agree with that. Ie,
if a contribution is made, but it's not clear the quality is up to
core's standards, I would much rather have some place to commit it
(contrib) than to reject it, because once it has a home here, it has a
chance to gain interest, grow, improve, etc.

But: do you think, for this reason, the web site should continue to
present the dichotomy?

 - contrib items may have different dependencies... putting it all
 under the same source root can make a developers job harder

That's a good point  criterion for leaving something in contrib.

 - many contrib items are less related to lucene-java core indexing
 and searching... if there is no contrib, then they don't belong in
 the lucene-java project at all.

But most contrib packages are very related to Lucene.

Though I agree some contrib packages likely have very narrow
appeal/usage (eg, contrib/db, for using BDB as the raw store for an
index).

And I agree (as above): I would like to have somewhere for
contributions to go, rather than reject them.

 - right now it's clear - core can't have dependencies on non-core
 classes.  If everything is stuck in the same source tree, that goes
 away.

Well... this gets to Hoss's motivation, which I appreciate, to keep
the core tiny.

But that's just good software design and you don't need a divorced
directory structure to achieve that.

 I think there are a lot of benefits to continue considering very
 carefully if something is core or not.

I agree, but at least we need some clear criteria so the future
decision process is more straightforward.  Towards that... it seems
like there are good reasons why something should be put into contrib:

  * It uses a version of JDK higher than what core can allow

  * It has external dependencies

  * Its quality is debatable (or at least not proven)

  * It's of somewhat narrow usage/interest (eg: contrib/bdb)

But I don't think it doesn't have to be in core (the software
modularity goal) is the right reason to put something in contrib.

Getting back to the original topic: Trie(Numeric)RangeFilter runs on
JDK 1.4, has no external dependencies, looks to be high quality, and
likely will have wide appeal.  Doesn't it belong in core?

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Modularization

2009-03-23 Thread Mike Klaas


On 23-Mar-09, at 2:41 PM, Michael McCandless wrote:


I agree, but at least we need some clear criteria so the future
decision process is more straightforward.  Towards that... it seems
like there are good reasons why something should be put into contrib:

 * It uses a version of JDK higher than what core can allow

 * It has external dependencies

 * Its quality is debatable (or at least not proven)

 * It's of somewhat narrow usage/interest (eg: contrib/bdb)

But I don't think it doesn't have to be in core (the software
modularity goal) is the right reason to put something in contrib.


Agreed.  I don't think that building on the existing 'contrib' is the  
way to go.  Frequently-used, high-quality components should be more  
properly part of Lucene, whether that means that they move to core,  
or in a new blessed modules section.



Getting back to the original topic: Trie(Numeric)RangeFilter runs on
JDK 1.4, has no external dependencies, looks to be high quality, and
likely will have wide appeal.  Doesn't it belong in core?


+1.  It is important that Lucene come blessed with very good quality  
defaults.  Fast range queries are a common requirement.  Similarly, I  
wouldn't be happy to have a new, wicked QueryParser be relegated to  
contrib where it is unlikely to be found by non-savvy users.  At the  
very least, I agree with Michael that it should be findable in the  
same place.


It does make sense to separate the machinery/building blocks (base  
Query, Weight, Scorer, Filter classes, Similarity interface, etc.)  
from the Query/Filter implementations that use them.  But whether this  
is done by putting them in separate directories or via global core/ 
modules distinction seems unimportant.


-Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Modularization (was: Re: New flexible query parser)

2009-03-21 Thread Michael Busch

On 3/21/09 12:27 AM, Michael Busch wrote:

+1. I'd love to see Lucene going into such a direction.

However, I'm a little worried about contrib's reputation. I think it 
contains components with differing levels of activity, maturity and 
support.
So maybe instead of moving things from core into contrib to achieve 
the goal you mentioned, we could create a new folder named e.g. 
'components', which will contain stuff that we claim is as stable, 
mature and supported as the core, just packaged into separate jars. 
Those jars should then only have dependencies on the core, but not on 
each other. They would also follow the same backwards-compatibility 
and other requirements as the core. Thoughts?


I guess something very similar has been proposed and discussed here: 
http://www.nabble.com/Moving-SweetSpotSimilarity-out-of-contrib-to19267437.html#a19320894

(same link that Hoss sent while having his deja vu)...

-Michael

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Modularization (was: Re: New flexible query parser)

2009-03-21 Thread Michael McCandless

I think we are mixing up source code modularity with
bundling/packaging.

Honestly, I would not mind much where the source code lives in svn, so
long as a developer, upon downloading Lucene 2.9, can go to *one*
place (javadocs) for Lucene's queries  filters and see
{Int,Long}NumberRangeFilter in there.

We are not there today: a developer must first realize there's a whole
separate place to look for other queries (contrib/queries).  Then
the developer browses that and likely becomes confused/misled by what
TrieRangeQuery means (is it a letter trie?).

My goal here is Lucene's consumability -- when someone new says hey I
heard about this great search library called Lucene; let me go try it
out I want that first impression to be as solid as possible.  I think
this is very important for growing Lucene's community.  This is why
out of the box defaults are so crucial (eg changing IW from flushing
every 10 docs to every 16 MB gained sizable throughput).

How many times have we seen a review, article, blog post, etc.,
comparing Lucene to other search libraries only to incorrectly
complain because Lucene can't do XYZ or Lucene's indexing
performance is poor, etc, because they didn't dig in to learn all the
tunings/options/tricks we all know you are supposed to do?  (It
frustrates me to end when this happens).  This then hurts Lucene's
adoption because others read such articles and conclude Lucene is a
non-starter.

We all ought to be concerned with Lucene's adoption  growth with time
(I am), and first-impression consumability / out of the box defaults
are big drivers of that.

What if (maybe for 3.0, since we can mix in 1.5 sources at that
point?) we change how Lucene is bundled, such that core queries and
contrib/query/* are in one JAR (lucene-query-3.0.jar)?  And
lucene-analyzers-3.0.jar would include contrib/analyzers/* and
org/apache/lucene/analysis/*.  And lucene-queryparser.jar, etc.

Mike

Michael Busch wrote:


On 3/21/09 12:27 AM, Michael Busch wrote:

+1. I'd love to see Lucene going into such a direction.

However, I'm a little worried about contrib's reputation. I think  
it contains components with differing levels of activity, maturity  
and support.
So maybe instead of moving things from core into contrib to achieve  
the goal you mentioned, we could create a new folder named e.g.  
'components', which will contain stuff that we claim is as stable,  
mature and supported as the core, just packaged into separate jars.  
Those jars should then only have dependencies on the core, but not  
on each other. They would also follow the same backwards- 
compatibility and other requirements as the core. Thoughts?


I guess something very similar has been proposed and discussed here: 
http://www.nabble.com/Moving-SweetSpotSimilarity-out-of-contrib-to19267437.html#a19320894
(same link that Hoss sent while having his deja vu)...

-Michael

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: Modularization (was: Re: New flexible query parser)

2009-03-21 Thread Uwe Schindler

 Honestly, I would not mind much where the source code lives in svn, so
 long as a developer, upon downloading Lucene 2.9, can go to *one*
 place (javadocs) for Lucene's queries  filters and see
 {Int,Long}NumberRangeFilter in there.
 
 We are not there today: a developer must first realize there's a whole
 separate place to look for other queries (contrib/queries).  Then
 the developer browses that and likely becomes confused/misled by what
 TrieRangeQuery means (is it a letter trie?).

That is a problem. The contrib/queries is a typical example of a
contribution that is almost always used in third-party projects (Solr):
It is stable and does not depend on other thing like the core and is 1.4
compatible (at the moment). Other contributions have external dependencies
or need another java version than the core.
I would split both types of contributions and would give the stable and
only-on-core depending ones a higher ranking (like put them into the
top-level changes list). E.g. when we release 2.9, nobody will realize, that
there is a new TrieRangeFilter in contrib/queries, because it is not in the
top-level changes list. Or the new contrib/spatial should have a visibility.
 
 My goal here is Lucene's consumability -- when someone new says hey I
 heard about this great search library called Lucene; let me go try it
 out I want that first impression to be as solid as possible.  I think
 this is very important for growing Lucene's community.  This is why
 out of the box defaults are so crucial (eg changing IW from flushing
 every 10 docs to every 16 MB gained sizable throughput).
 
 How many times have we seen a review, article, blog post, etc.,
 comparing Lucene to other search libraries only to incorrectly
 complain because Lucene can't do XYZ or Lucene's indexing
 performance is poor, etc, because they didn't dig in to learn all the
 tunings/options/tricks we all know you are supposed to do?  (It
 frustrates me to end when this happens).  This then hurts Lucene's
 adoption because others read such articles and conclude Lucene is a
 non-starter.

I know this problem. And about the contrib queries: Most developments that
use Lucene (e.g. Solr) use always some of the contrib jars. And almost
everytime contrib/queries. But starters like the journalists writing those
articles, only take the core and test something with it.

So splitting up the whole Lucene in different parts is better (so these
people must always think about all available packages and which they need
for their project):

 We all ought to be concerned with Lucene's adoption  growth with time
 (I am), and first-impression consumability / out of the box defaults
 are big drivers of that.
 
 What if (maybe for 3.0, since we can mix in 1.5 sources at that
 point?) we change how Lucene is bundled, such that core queries and
 contrib/query/* are in one JAR (lucene-query-3.0.jar)?  And
 lucene-analyzers-3.0.jar would include contrib/analyzers/* and
 org/apache/lucene/analysis/*.  And lucene-queryparser.jar, etc.

This is even better! +1

I would propose:
- core: Indexer, Documents, IndexReader, Searcher and the default
directory-stores (fs, mmap, nio).
- queries: current core queries and contrib/queries
- queryparser (the new one? Or two different packages for old and new): this
should really be removed from core, a lot of people think, that they can
only query lucene using the queryparser and do not even try to build their
Boolean-queries manually and often fail, when it gets complicated, where the
query parser cannot help or fails, e.g. querying non-tokenized fields (but
this would depend on queries, we need that here)...
- analysis (and completely remove analyzers from core, let only be the
abstract analyzer stay there and keyword analyzer, if you want to index
without analyzer or do not need one because of only non-tokenized fields,...
- highlighting
- custom sorting separate
- spatial
- ...

We then could change our contrib SVN accounts and have new roles like
(core-committer, queries-committer,...)

Uwe


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Modularization (was: Re: New flexible query parser)

2009-03-21 Thread Grant Ingersoll


On Mar 21, 2009, at 11:26 AM, Michael McCandless wrote:

What if (maybe for 3.0, since we can mix in 1.5 sources at that
point?) we change how Lucene is bundled, such that core queries and
contrib/query/* are in one JAR (lucene-query-3.0.jar)?  And
lucene-analyzers-3.0.jar would include contrib/analyzers/* and
org/apache/lucene/analysis/*.  And lucene-queryparser.jar, etc.



Since we are just talking about packaging, why can't we have both/all  
of the above?  Individual jars, as well as one big jar, that  
contains everything (or, everything that has only dependencies we can  
ship, or everything that we deem important for an OOTB experience).   
I, for one, find it annoying to have to go get snowball, analyzers,  
spellchecking and highlighting separate in most cases b/c I almost  
always use all of them and don't particularly care if there are extra  
classes in a JAR, but can appreciate the need to do that in specific  
instances where leaner versions are needed.  After all, the Ant magic  
to do all of this is pretty trivial given we just need to combine the  
various jars into a single jar (while keeping the indiv. ones)


If there is a sense that some contribs aren't maintained or aren't as  
good, then we need to ask ourselves whether they are:
1. stable and solid and don't need much care and are doing just fine  
thank you very much, or,

2. need to be archived, since they only serve as a distraction, or
3. in need of a new champion to maintain/promote them

-Grant

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Modularization

2009-03-21 Thread Michael Busch

On 3/21/09 11:26 AM, Michael McCandless wrote:

I think we are mixing up source code modularity with
bundling/packaging.

Honestly, I would not mind much where the source code lives in svn, so
long as a developer, upon downloading Lucene 2.9, can go to *one*
place (javadocs) for Lucene's queries  filters and see
{Int,Long}NumberRangeFilter in there.
We are not there today: a developer must first realize there's a whole
separate place to look for other queries (contrib/queries).  Then
the developer browses that and likely becomes confused/misled by what
TrieRangeQuery means (is it a letter trie?).

My goal here is Lucene's consumability -- when someone new says hey I
heard about this great search library called Lucene; let me go try it
out I want that first impression to be as solid as possible.  I think
this is very important for growing Lucene's community.  This is why
out of the box defaults are so crucial (eg changing IW from flushing
every 10 docs to every 16 MB gained sizable throughput).

So this guy landing on http://lucene.apache.org/java/docs/index.html 
sees the Overview section first. That one only gives a very short 
introduction to what Lucene is. He might then look at Features, which 
is also not very specific. I think the next thing would then be to look 
for the documentation of the newest release, so he would click on 
Lucene 2.4.1 Documentation. The landing page doesn't say much, except 
tells you to go look for the javadocs and other docs in the menu. So 
maybe the Getting Started link might the first one to go to, but it's 
also pretty far down the list. So probably he would click on the 
javadocs first. Now he encounters All, Core, Demo, Contrib. Until now, 
he hasn't read the word Contrib anywhere. We basically have nowhere 
documentation that introduces the concept of contribs, or where to find 
them, I think? Even the Contributions section talks about something 
else. So that guy probably looks then trough the  demo and examples and 
ends up using only core features until becoming more familiar with 
Lucene as a whole. Maybe he actually ends up buying LIA(2) :)



How many times have we seen a review, article, blog post, etc.,
comparing Lucene to other search libraries only to incorrectly
complain because Lucene can't do XYZ or Lucene's indexing
performance is poor, etc, because they didn't dig in to learn all the
tunings/options/tricks we all know you are supposed to do?  (It
frustrates me to end when this happens).  This then hurts Lucene's
adoption because others read such articles and conclude Lucene is a
non-starter.

We all ought to be concerned with Lucene's adoption  growth with time
(I am), and first-impression consumability / out of the box defaults
are big drivers of that.

point?) we change how Lucene is bundled, such that core queries and
contrib/query/* are in one JAR (lucene-query-3.0.jar)?  And
lucene-analyzers-3.0.jar would include contrib/analyzers/* and
org/apache/lucene/analysis/*.  And lucene-queryparser.jar, etc.



So yeah I like this and 3.0 is a good opportunity to do this. I think a 
big part of this work should be good documentation. As you mentioned, 
Mike, it should be very simple to get an overview of what the different 
modules are. So there should be the list of the different modules, 
together with a short description for each of them and infos about where 
to find them (which jar). Then by clicking on e.g. queries, the user 
would see the list of all queries we support.


But I think we should still have main modules, such as core, queries, 
analyzers, ... and separately e.g. sandbox modules?, for the things 
currently in contrib that are experimental or, as Mark called them, 
graveyard contribs :) ... even though we might then as well ask the 
questions if we can not really bury the latter ones...



Mike

Michael Busch wrote:


On 3/21/09 12:27 AM, Michael Busch wrote:

+1. I'd love to see Lucene going into such a direction.

However, I'm a little worried about contrib's reputation. I think it 
contains components with differing levels of activity, maturity and 
support.
So maybe instead of moving things from core into contrib to achieve 
the goal you mentioned, we could create a new folder named e.g. 
'components', which will contain stuff that we claim is as stable, 
mature and supported as the core, just packaged into separate jars. 
Those jars should then only have dependencies on the core, but not 
on each other. They would also follow the same 
backwards-compatibility and other requirements as the core. Thoughts?


I guess something very similar has been proposed and discussed here: 
http://www.nabble.com/Moving-SweetSpotSimilarity-out-of-contrib-to19267437.html#a19320894 


(same link that Hoss sent while having his deja vu)...

-Michael

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org





Re: Modularization

2009-03-21 Thread Michael McCandless
 Maybe he actually ends up buying LIA(2) :)

LIA/2 suffers the same false dichotomy, and it drives me crazy there
too: we put all contrib packages in a different chapter, even though
it'd make much more sense to cover all analyzers in one chapter, all
queries in one chapter, etc.

I find myself cross-referencing over to TrieRangeQuery in Chapter 8,
from LIA's search chapter (Chapter 3), and it's awkward.

 So yeah I like this and 3.0 is a good opportunity to do this. I
 think a big part of this work should be good documentation. As you
 mentioned, Mike, it should be very simple to get an overview of what
 the different modules are.  So there should be the list of the
 different modules, together with a short description for each of
 them and infos about where to find them (which jar).  Then by
 clicking on e.g. queries, the user would see the list of all queries
 we support.

I agree: revamping the web-site for a better top-down introduction of
Lucene's features should be part of 3.0.

And I don't think the sudden separation of core vs contrib should
be so prominent (or even visible); it's really a detail of how we
manage source control.

When looking at the website I'd like read that Lucene can do hit
highlighting, powerful query parsing, spell checking, analyze
different languages, etc.  I could care less that some of these happen
to live under a contrib subdirectory somewhere in the source control
system.

 But I think we should still have main modules, such as core,
 queries, analyzers, ... and separately e.g. sandbox modules?, for
 the things currently in contrib that are experimental or, as Mark
 called them, graveyard contribs :) ... even though we might then
 as well ask the questions if we can not really bury the latter
 ones...

Could we, instead, adopt some standard way (in the package javadocs)
of stating the maturity/activity/back compat policies/etc of a given
package?

 Since we are just talking about packaging, why can't we have
 both/all of the above?  Individual jars, as well as one big jar,
 that contains everything (or, everything that has only dependencies
 we can ship, or everything that we deem important for an OOTB
 experience).  I, for one, find it annoying to have to go get
 snowball, analyzers, spellchecking and highlighting separate in most
 cases b/c I almost always use all of them and don't particularly
 care if there are extra classes in a JAR, but can appreciate the
 need to do that in specific instances where leaner versions are
 needed.  After all, the Ant magic to do all of this is pretty
 trivial given we just need to combine the various jars into a single
 jar (while keeping the indiv. ones)

+1

So I think the beginnings of a rough proposal is taking shape, for 3.0:

  1. Fix web site to give a better intro to Lucene's features, without
 exposing core vs. contrib false (to the Lucene consumer)
 distinction

  2. When releasing, we make a single JAR holding core  contrib
 classes for a given area.  The final JAR files don't contain a
 core vs contrib distinction.

  3. We create a bundled JAR that has the common packages
 typically needed (index/search core, analyzers, queries,
 highlighter, spellchecker)

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Modularization (was: Re: New flexible query parser)

2009-03-21 Thread DM Smith


On Mar 21, 2009, at 7:23 AM, Grant Ingersoll wrote:



On Mar 21, 2009, at 11:26 AM, Michael McCandless wrote:

What if (maybe for 3.0, since we can mix in 1.5 sources at that
point?) we change how Lucene is bundled, such that core queries and
contrib/query/* are in one JAR (lucene-query-3.0.jar)?  And
lucene-analyzers-3.0.jar would include contrib/analyzers/* and
org/apache/lucene/analysis/*.  And lucene-queryparser.jar, etc.



Since we are just talking about packaging, why can't we have both/ 
all of the above?  Individual jars, as well as one big jar, that  
contains everything (or, everything that has only dependencies we  
can ship, or everything that we deem important for an OOTB  
experience).  I, for one, find it annoying to have to go get  
snowball, analyzers, spellchecking and highlighting separate in most  
cases b/c I almost always use all of them and don't particularly  
care if there are extra classes in a JAR, but can appreciate the  
need to do that in specific instances where leaner versions are  
needed.  After all, the Ant magic to do all of this is pretty  
trivial given we just need to combine the various jars into a single  
jar (while keeping the indiv. ones)


If there is a sense that some contribs aren't maintained or aren't  
as good, then we need to ask ourselves whether they are:
1. stable and solid and don't need much care and are doing just fine  
thank you very much, or,

2. need to be archived, since they only serve as a distraction, or
3. in need of a new champion to maintain/promote them


From a user's perspective (i.e. mine):
I like the idea regarding having more jars. Specifically, I'd like a  
jar that was devoted alone to reading an index. Ultimately, I'd like  
it to work in a J2ME environment, but that is entirely a different  
thread.


There are parts that are needed for both reading and writing  
(directory, analyzers, tokens, and such). And there are parts dealing  
with writing.


There is a distinction between core and contrib regarding backward  
compatibility and quality (perhaps perceived quality).


To me the hardest part in wrapping my head around contrib is that I am  
not clear on why something is in contrib, what it can do, whether it  
is just an example, an alternate way of doing something or it is  
useful exactly as provided.


There are parts of contrib that I see as essential to my application  
(pretty much Grant's list), that I can use as is. While there are many  
different applications of Lucene, my guess is that a non-trivial  
application of Lucene needs to use various contribs. Some contribs are  
high quality and I think deserve the kind of attention that core gets.


What I'd like to see is not more stuff move into core from contrib.  
But rather that we have two levels of contrib: One recommended for use  
and maintained at the same level as core. The other is stuff that is  
use if you find it useful, and at your own risk. That is, as it is  
today.


I understand the desire to have one jar do it all. Nothing wrong with  
having that too, perhaps lucene-essentials.jar that holds all useful,  
recommended, highly maintained, well-explained stuff.


As to the whole question of the oobe for reviewers, today, it is what  
does Lucene-core.jar do. With more jars it would be what does this  
core collection of jars do or what does lucene-esssentials.


-- DM Smith





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Modularization

2009-03-21 Thread Michael Busch

On 3/21/09 1:36 PM, Michael McCandless wrote:

And I don't think the sudden separation of core vs contrib should
be so prominent (or even visible); it's really a detail of how we
manage source control.

When looking at the website I'd like read that Lucene can do hit
highlighting, powerful query parsing, spell checking, analyze
different languages, etc.  I could care less that some of these happen
to live under a contrib subdirectory somewhere in the source control
system.

   
OK, so I think we all agree about the packaging. But I believe it is 
also important
how the source code is organized. Maybe Lucene consumers don't care too 
much,
however, Lucene is an open source project. So we also want to attract 
possible
contributors with a nicely organized code base. If there is a clear 
separation between
the different components on a source code level, becoming familiar with 
Lucene as a

contributor might not be so overwhelming.

Besides that, I think a one-to-one mapping between the packaging and the 
source code
has no disadvantages. (and it would certainly make the build scripts 
easier!)

But I think we should still have main modules, such as core,
queries, analyzers, ... and separately e.g. sandbox modules?, for
the things currently in contrib that are experimental or, as Mark
called them, graveyard contribs :) ... even though we might then
as well ask the questions if we can not really bury the latter
ones...
 


Could we, instead, adopt some standard way (in the package javadocs)
of stating the maturity/activity/back compat policies/etc of a given
package?
   


This makes sense; e.g. we could release new modules as beta versions (= 
use at own risk,

no backwards-compatibility).

And if we start a new module (e.g. a GSoC project) we could exclude it 
from a release

easily if it's truly experimental and not in a release-able state.

So I think the beginnings of a rough proposal is taking shape, for 3.0:

   1. Fix web site to give a better intro to Lucene's features, without
  exposing core vs. contrib false (to the Lucene consumer)
  distinction

   2. When releasing, we make a single JAR holding core  contrib
  classes for a given area.  The final JAR files don't contain a
  core vs contrib distinction.

   3. We create a bundled JAR that has the common packages
  typically needed (index/search core, analyzers, queries,
  highlighter, spellchecker)

   

+1 to all three points.


Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org