Re: Modularization
I'm really ambivalent about Maven. Having just converted Mahout to it, am using it for some other projects and used it quite a bit in the past, I am still on the fence (although I am mostly happy w/ it for Mahout). I keep being lured in by the promise of it (dep. management, convention over configuration, POM, and most importantly, being able to point IntelliJ at it and have it setup my project structures), but then left hanging by the execution/bugginess of components. For a simpler project structure like Mahout, it has worked pretty well except for the release stuff (which was a major pain and still isn't perfect), but for Lucene, I'm not so sure. Multimodule support in Maven is OK at best and we have a lot of modules in Lucene. Having been on the Maven list a number of times in the past, my sense was that it was overwhelmed by the sheer number of requests for help and the community itself was not able to keep up, so getting help may be more difficult. Maybe that has changed since I was last on (about 1.5 years ago) Customization work in Maven is also a pain, and I have yet to see a project of any significance that didn't require some customization, no matter how much you follow the conventions. For instance, Lucene's automated regression tests come to mind. And, I am willing to bet Lucene's release process would need to be customized. Finally, we have a pretty large installed base. This is not something we should do lightly (not that anyone was suggesting otherwise). We have a working build system and we have pretty broad Ant knowledge in the project (including the guy who wrote the book on Ant). To sum up, I'm -0.9. You might be able to convince me of using Maven, but the execution would really have to overcome a whole lot in order to do so. -Grant On Apr 9, 2009, at 6:48 PM, Earwin Burrfoot wrote: On Fri, Apr 10, 2009 at 02:25, Chris Hostetter hossman_luc...@fucit.org wrote: Or just make it trivial to get all jars that fit a given profile w/o actually merging those jars into an uber-jar ... does maven's dependency management have any like bundles or virtual packages so we could publish a lucene-all-analzers POM that didn't have an actual lucene-all-analyzers.jar but listed dependencies on all of the individual jars? Maven can do this. Not sure transitive dependencies were meant to be used that way, but they definetly work like you want. I think ideally the existig contrib/analysis would be broken up by language -- even if that means only 2 or 3 classes per jar -- but i don't deal with multilingual stuff much so i don't have much of an opinoin ... perhaps the majority of our users that deal with non-english tend to deal with *lots* of langauges so having a single multilingual-analysis module would be suitable. I bet lots of users dealing with non-english language deal only with it, because they're providing local services. Like we're working with a mix of russian/english/ukrainian. But my point really is that I don't see any adequate reason to have dozens of well-defined micromodules. People that care big time about dead weight in their distributions should use tools like jar jar links anyway. (If I remember right, one of its abilities is to build an uberjar from a bunch of jars, dropping unused classes in the process) -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Modularization
: Then during build we can package up certain combinations. I think : there should be sub-kitchen-sink jars by area, eg a jar that contains : all analyzers/tokenstreams/filters, all queries/filters, etc. Or just make it trivial to get all jars that fit a given profile w/o actually merging those jars into an uber-jar ... does maven's dependency management have any like bundles or virtual packages so we could publish a lucene-all-analzers POM that didn't have an actual lucene-all-analyzers.jar but listed dependencies on all of the individual jars? (FYI: Perl's CPAN has the concept of a Bundle that's just an empty distribution that depends on other distributions so you have an single refrence point for installing them) : So, how would you refactor the various sources of : analyzers/tokenstream/tokenfilters we have today : (src/java/org/apache/lucene/analysis/*, contrib/snowball/*, : contrib/collation/* and contrib/analyzers/*)? (Even contrib/memory : has a neat PatternAnalyzer, that operates on a string using a regexp : to get tokenns out, that only now am I just discovering). I think ideally the existig contrib/analysis would be broken up by language -- even if that means only 2 or 3 classes per jar -- but i don't deal with multilingual stuff much so i don't have much of an opinoin ... perhaps the majority of our users that deal with non-english tend to deal with *lots* of langauges so having a single multilingual-analysis module would be suitable. : We also need to think about how this impacts our back-compat policy. : EG when are we allowed to split up modules into sub-modules, or merge : them. spliting a module should always be fair game as long as the new module(s) maintain the same back compat policy ... it's not a burden to ask people to start using 2 jars instead of 1 jar (especially if we're already going to have an easy way to bundle jars up into uber-jars) in theory merging modules should require that the new module adopt the most restrictive back-compat policy of the previous modules. : Assuming there's general consensus on this break core into modules : approach, I think the next step is to take in inventory of all of : Lucene's classes and roughly divide them into proposed modules, and : iterate on that? Hoss do you want to take a first stab at that? Heh. i'm not sure i could even answer the want question in the afirmative. This is essentially a question of refactoring, and I think approaching this incrimentally would be the best strategy ... either by first finding some low hanging fruit in core that could be extracted int oa contrib easily (spans, query parser) or by restructuring the build system to put contribs and the demo on equal footing with core as modules and reasses as progress is made. on a personal note: even if i wanted to lead this charge, i really can't right now ... folks may have noticed my involvement with lucene has been markedly lower in the last few months, i expect it to get even lower over the next 2 months before it will (hopefully) get higher. -Hoss - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Modularization
: We've been doing this using just one source tree (like in Lucene), and : instead ensuring the separation using the build system. We did not, like you I think you are missunderstanding my previous comment ... Lucene-Java does not currenlty have one source tree in the sense that someone else suggested (i forget who) and i was commenting on ... at the moment Lucene has several source trees (src/java, src/demo, and each dir matching contrib/*/src). Based on your examples, i believe we are suggesting the same thing: building seperate modules from seperate base directories (in your case foo/A and foo/B) with well defined dependencies. -Hoss - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Modularization
: If there are any serious moves to reorganize things, we should at least : consider the benefits of maven. +1 we can certainly do a lot to improve things just by refacting stuff from core into contrib, and improving the visibility of contribs and documentation about contribs -- but if we're going to make massive changes to how things are built or how the source code is organized, then utilizing maven as the build system seems like an obvious choice to me. (and i don't even like maven that much) -Hoss - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Modularization
On Mon, Mar 30, 2009, Chris Hostetter wrote about Re: Modularization: code isolation (by directory hierarchy) is hte best way i've seen to ensure modularization, and protect against inadvertent dependency bleeding. ... it's certainly possible to have all source code in a single directory hierarchy, and then rely on the build system to ensure your don't inwarranted dependencies, but that requires you do express rules in the build system about what exactly the acceptible dependencies are, and it relies on everyone using the buildsystem correctly (missguided users of hand-holding IDEs could get very frustrated when the patches they submit violate rules of an overly complicated set of ant build files) In a project I've been involved in, we are building a library with similar concerns that Lucene now faces - on one hand you want to be a kitchen sink providing features for everyone, but on the other hand you want to create small jars and allow people who only need a small number of features to pick only some of the jars, instead of one huge jar. We've been doing this using just one source tree (like in Lucene), and instead ensuring the separation using the build system. We did not, like you suggest, found this to complicated to set up or maintain. The only snag, of course, is that people who don't know how to write build.xml properly do not touch it, but it's exactly like people who don't know how to properly code in Java do not touch our source code :-) Having a hand-holding IDE is no replacement for knowing how to code, whether the code is Java source code or Ant configuration. The idea of the Ant-based approach is to have the Ant build script compile each module source separately, allowing it only to refer to pre-defined dependencies. This instead of the more usual approach of compiling all the source code together (and thus allowing unwanted dependencies) and only collecting the jars from the compiled classes at the very end. For example, let's say that we want to build three JARs of three packages, foo.A, foo.B, and foo.C. Let's say that foo.A is stand-alone (doesn't need the other source code to compile), and foo.B depends on stuff from foo.A (and must not depend on stuff from foo.C). In that case, I would first create an Ant rule to build a jar from the sources of foo.A, and them alone (which ensures that foo.A doesn't accidentally depend on foo.B or foo.C). Note the includes argument to javac, and the separate destdir: target name=A.compile sequential mkdir dir=${build.classes}/A/ javac srcdir=${src} destdir=${build.classes}/A includes=foo/A/**/*.java sourcepath= listfiles=no /javac /sequential /target target name=A.jar depends=A.compile sequential mkdir dir=${build.jars}/ jar destfile=${build.jars}/A.jar basedir=${build.classes}/A /jar /sequential /target Now, we do a similar thing for B.jar - when compiling it, we allow the compiler to look at only the source code of foo.B, and at the previously built A.jar. It cannot, for example, accidentally use stuff from foo.C: target name=B.compile depends=A.jar sequential mkdir dir=${build.classes}/B/ javac srcdir=${src} destdir=${build.classes}/B includes=foo/B/**/*.java sourcepath= listfiles=no classpath pathelement location=${build.jars}/A.jar / /classpath /javac /sequential /target target name=B.jar depends=B.compile sequential mkdir dir=${build.jars}/ jar destfile=${build.jars}/B.jar basedir=${build.classes}/B /jar /sequential /target Putting my money (or rather, time) where my mouth is, is there an interest that I try to build a build script for Lucene to demonstrate these ideas in action? FWIW: having lots/more of very small, isolated, hierarcies also wouldn't hinder any attempts at having kitchen-sink or essential jars -- combining the classes from lots of little isolated code trees is a lot easier then extracting a few classes from one big code tree. But I think you've swept on issue under the rug: what happens when the hierarcies aren't completely isolated? For example, an analyzer package obviously depends on some Lucene core package. Or the query parser package depends on the wildcard query package (for example). You need to specify these dependencies somehow, and allow only them. How do you do that? Via an Eclipse .project file in each of the small hierarcies? How is this any better than having an Ant build file? How would anyone not using Eclipse use this sort of setup? Another problem
Re: Modularization
we can have fine grained modularity w/o having second class citizens, and we can achieve it without needing to make radical changes -- but putting more stuff into core isn't going to help us get there. I totally agree. However, just to stir the pot (and assuming you are well rested), I'll drop your radical changes constraint and suggest that maven (while it can be a PIA) makes this kind of modularity trivial. With maven we could easily have: /core /modules/xxx Each module could easily declare: * its dependencies on other modules * the required JRE * document its level of maturity And there are good off the shelf tools to report the dependency graphs, etc, etc. If there are any serious moves to reorganize things, we should at least consider the benefits of maven. ryan - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Modularization
+1 on maven, and I volunteer to aid in the creation of the maven project files (pom's) On Wed, Apr 1, 2009 at 11:02 AM, Ryan McKinley ryan...@gmail.com wrote: we can have fine grained modularity w/o having second class citizens, and we can achieve it without needing to make radical changes -- but putting more stuff into core isn't going to help us get there. I totally agree. However, just to stir the pot (and assuming you are well rested), I'll drop your radical changes constraint and suggest that maven (while it can be a PIA) makes this kind of modularity trivial. With maven we could easily have: /core /modules/xxx Each module could easily declare: * its dependencies on other modules * the required JRE * document its level of maturity And there are good off the shelf tools to report the dependency graphs, etc, etc. If there are any serious moves to reorganize things, we should at least consider the benefits of maven. ryan - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Douglas Campos Theros Consulting +55 11 9267 4540 +55 11 3020 8168
Re: Modularization
Lucene is in fact already available through maven. poms do exist, all what is left is to find who manages them and releases. On Thu, Apr 2, 2009 at 01:40, Douglas Campos doug...@theros.info wrote: +1 on maven, and I volunteer to aid in the creation of the maven project files (pom's) On Wed, Apr 1, 2009 at 11:02 AM, Ryan McKinley ryan...@gmail.com wrote: we can have fine grained modularity w/o having second class citizens, and we can achieve it without needing to make radical changes -- but putting more stuff into core isn't going to help us get there. I totally agree. However, just to stir the pot (and assuming you are well rested), I'll drop your radical changes constraint and suggest that maven (while it can be a PIA) makes this kind of modularity trivial. With maven we could easily have: /core /modules/xxx Each module could easily declare: * its dependencies on other modules * the required JRE * document its level of maturity And there are good off the shelf tools to report the dependency graphs, etc, etc. If there are any serious moves to reorganize things, we should at least consider the benefits of maven. ryan - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Douglas Campos Theros Consulting +55 11 9267 4540 +55 11 3020 8168 -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Modularization
I haven't paid attention, as I looked first for the build.xml on trunk as we already are using maven, Ryan's approach is the way to go, IMHO On Wed, Apr 1, 2009 at 7:00 PM, Earwin Burrfoot ear...@gmail.com wrote: Lucene is in fact already available through maven. poms do exist, all what is left is to find who manages them and releases. On Thu, Apr 2, 2009 at 01:40, Douglas Campos doug...@theros.info wrote: +1 on maven, and I volunteer to aid in the creation of the maven project files (pom's) On Wed, Apr 1, 2009 at 11:02 AM, Ryan McKinley ryan...@gmail.com wrote: we can have fine grained modularity w/o having second class citizens, and we can achieve it without needing to make radical changes -- but putting more stuff into core isn't going to help us get there. I totally agree. However, just to stir the pot (and assuming you are well rested), I'll drop your radical changes constraint and suggest that maven (while it can be a PIA) makes this kind of modularity trivial. With maven we could easily have: /core /modules/xxx Each module could easily declare: * its dependencies on other modules * the required JRE * document its level of maturity And there are good off the shelf tools to report the dependency graphs, etc, etc. If there are any serious moves to reorganize things, we should at least consider the benefits of maven. ryan - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Douglas Campos Theros Consulting +55 11 9267 4540 +55 11 3020 8168 -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Douglas Campos Theros Consulting +55 11 9267 4540 +55 11 3020 8168
Re: Modularization
maturity, and their back compat commitments. The demo and getting started guies could also be expanded to refrence the contrib jars that contain code many people may want to reuse... Here's an idea. Each contrib is really a project onto its own. And any project, I suggest, ought to have its own demo program, together maybe with a small write-up describing the idea behind the contrib and what the demo does. So to get the ball rolling, how about adopting some such documentation policy for *future* contribs as a pseudo-requirement for making it into the official release? Cheers, -Babak PS this not a swipe at any upcoming contrib (TrieUtils: the documentation there is really good :) On Mon, Mar 30, 2009 at 5:31 PM, Chris Hostetter hossman_luc...@fucit.org wrote: After stiring things up, and then being off-list for ~10 days, I'm in an interesting position coming back to this thread and seeing the discussion *after* it essentially ended, with a lot of semi-concensus but no clear sense of hard and fast resolution or plan of action. FWIW, here are the notes i made based on reading the thread about the various sentiments i noticed expressed (wether i agree with them or not) in order to try and get a handle on what had been discussed. some of these were the optinion of a single person and i've paraphrased, others are my generalization of similar comments made by various people... - contrib has a bad rap - widely varying degrees of quality/stability in contrib code, hard to get people to rely on the good ones because of the less good ones - many people want a good, out of hte box, kitchen sink experience (ie: one monolithic jar containing all the essentials) - need easy discoverability of all things of a given type (ie: all queries, all filters, all analyzers, etc...) .. ie: combined javadocs. - need easy installation of of all things of a given type (ie: a jar containing all types of queries, a jar containing all types of analyzers, etc...) - still need to deal with contribs that have external dependencies - still need to deal with contribs that require future versions of langauge (Java1.7 when core is still 1.5 compat) - users need better guidance about why something is a contrib (additional functionality, alternate functionality, example of use, tool, etc...) - while we should maintain/increase modularization, documentation should make features of contribs more promonent without stressing the isolation resulting from code modularization. - we should merge all contrib core code into a unified src/ tree, and make the pacakging independent of the physical location in svn (ie: jars based on java package, not directory) While I'm mostly in favor of all of these sentiments, and think it's really just a question of how to go about it, the last one is actually something i've pretty stronly opposed to -- I think the best way forward is to have lots of small, well isolated source trees. code isolation (by directory hierarchy) is hte best way i've seen to ensure modularization, and protect against inadvertent dependency bleeding. If we want to be able to produce small jars targeted at specific goals, and we want o.l.a.foo.FooClass to be in foo.jar and o.l.a.bar.BarClass to be in bar.jar then we shouldn't have src/java/o/l/a/foo/FooClass.java and src/java/o/l/a/bar/BarClass.java -- doing so makes it way to easy for inadvertnent dependencies to crop up that make FooClass depend on bar class, and thus make it impossible to use foo.jar without also using bar.jar at runtime. it's certainly possible to have all source code in a single directory hierarchy, and then rely on the build system to ensure your don't inwarranted dependencies, but that requires you do express rules in the build system about what exactly the acceptible dependencies are, and it relies on everyone using the buildsystem correctly (missguided users of hand-holding IDEs could get very frustrated when the patches they submit violate rules of an overly complicated set of ant build files) FWIW: having lots/more of very small, isolated, hierarcies also wouldn't hinder any attempts at having kitchen-sink or essential jars -- combining the classes from lots of little isolated code trees is a lot easier then extracting a few classes from one big code tree. One underlying assumption that seems to have permiated the existing discussion (without ever being explicitly stated) is the idea that most currently lives in src/java is the core and would be a single module ... personally i'd like to challege that assumption. I'd like to suggest that besides obvious things that could be refactored out into other modules (span queries, queryparser) there are lots of additional ways that src/java could be sliced... - interfaces and abstract clases and concrete classes for reading an index in one index-api.jar (ie: Directory but no FSDirectory; IndexReader but not MultiReader) - ditto
Re: Modularization
On Mon, Mar 30, 2009 at 7:31 PM, Chris Hostetter hossman_luc...@fucit.org wrote: code isolation (by directory hierarchy) is hte best way i've seen to ensure modularization, and protect against inadvertent dependency bleeding. OK I agree this (divorced top-level directories) is a great way to enforce modularity and we should use that. It seems the toplevel directory structure could still have subdirs, eg: analyzers languages th es fr snowball? ... standard collation and: search searcher queries span function And in those leaf subdirs above would be the package subdir structure (src/{java,test}/org/apache/lucene/...). Though svn checkout and svn update and svn diff are going to take quite a bit longer with this switch... One underlying assumption that seems to have permiated the existing discussion (without ever being explicitly stated) is the idea that most currently lives in src/java is the core and would be a single module ... personally i'd like to challege that assumption. I'd like to suggest that besides obvious things that could be refactored out into other modules (span queries, queryparser) there are lots of additional ways that src/java could be sliced... +1: I very much agree what is now called core should be refactored as a number of modules. So the general new proposal here seems to be lets break up src/java/* into separate modules (each under its own toplevel directory), just like contrib/* is today. And move Lucene to an a la carte model for what we now call core. (what we now call contrib is already a la carte today). We would then do away with the top level core vs contrib, and everything would simply be modules, where each module has metadata/javadocs stating: * JRE version required * What external dependencies (including dependencies to other Lucene modules) are needed * Some measure of maturity * Back-compat policy * CHANGES Then during build we can package up certain combinations. I think there should be sub-kitchen-sink jars by area, eg a jar that contains all analyzers/tokenstreams/filters, all queries/filters, etc. This does make the future decision process far easier. Rather than have a capricious and ill-defined does it go into core vs contrib question, we now simply decide if it goes into an existing module or makes a new one. Even without making radical changes to the way our source code is organized, a lot of improvements could be made by having better documentation . Agreed. I think this is actually somewhat orthogonal, though should follow more naturally once Lucene is simply a collection of modules. I would think we present all and a per-module sets of javadocs, plus javadocs aggregated based on how the JARs aggregate? (Ie I could browse the kitchen-sink javadocs, the all analyzers javadocs, or the thai analyzers only javadocs). (ie: a new ThaiStemmerFilter could be added to an existing thai-analysis module) So, how would you refactor the various sources of analyzers/tokenstream/tokenfilters we have today (src/java/org/apache/lucene/analysis/*, contrib/snowball/*, contrib/collation/* and contrib/analyzers/*)? (Even contrib/memory has a neat PatternAnalyzer, that operates on a string using a regexp to get tokenns out, that only now am I just discovering). We also need to think about how this impacts our back-compat policy. EG when are we allowed to split up modules into sub-modules, or merge them. Assuming there's general consensus on this break core into modules approach, I think the next step is to take in inventory of all of Lucene's classes and roughly divide them into proposed modules, and iterate on that? Hoss do you want to take a first stab at that? Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Modularization
After stiring things up, and then being off-list for ~10 days, I'm in an interesting position coming back to this thread and seeing the discussion *after* it essentially ended, with a lot of semi-concensus but no clear sense of hard and fast resolution or plan of action. FWIW, here are the notes i made based on reading the thread about the various sentiments i noticed expressed (wether i agree with them or not) in order to try and get a handle on what had been discussed. some of these were the optinion of a single person and i've paraphrased, others are my generalization of similar comments made by various people... - contrib has a bad rap - widely varying degrees of quality/stability in contrib code, hard to get people to rely on the good ones because of the less good ones - many people want a good, out of hte box, kitchen sink experience (ie: one monolithic jar containing all the essentials) - need easy discoverability of all things of a given type (ie: all queries, all filters, all analyzers, etc...) .. ie: combined javadocs. - need easy installation of of all things of a given type (ie: a jar containing all types of queries, a jar containing all types of analyzers, etc...) - still need to deal with contribs that have external dependencies - still need to deal with contribs that require future versions of langauge (Java1.7 when core is still 1.5 compat) - users need better guidance about why something is a contrib (additional functionality, alternate functionality, example of use, tool, etc...) - while we should maintain/increase modularization, documentation should make features of contribs more promonent without stressing the isolation resulting from code modularization. - we should merge all contrib core code into a unified src/ tree, and make the pacakging independent of the physical location in svn (ie: jars based on java package, not directory) While I'm mostly in favor of all of these sentiments, and think it's really just a question of how to go about it, the last one is actually something i've pretty stronly opposed to -- I think the best way forward is to have lots of small, well isolated source trees. code isolation (by directory hierarchy) is hte best way i've seen to ensure modularization, and protect against inadvertent dependency bleeding. If we want to be able to produce small jars targeted at specific goals, and we want o.l.a.foo.FooClass to be in foo.jar and o.l.a.bar.BarClass to be in bar.jar then we shouldn't have src/java/o/l/a/foo/FooClass.java and src/java/o/l/a/bar/BarClass.java -- doing so makes it way to easy for inadvertnent dependencies to crop up that make FooClass depend on bar class, and thus make it impossible to use foo.jar without also using bar.jar at runtime. it's certainly possible to have all source code in a single directory hierarchy, and then rely on the build system to ensure your don't inwarranted dependencies, but that requires you do express rules in the build system about what exactly the acceptible dependencies are, and it relies on everyone using the buildsystem correctly (missguided users of hand-holding IDEs could get very frustrated when the patches they submit violate rules of an overly complicated set of ant build files) FWIW: having lots/more of very small, isolated, hierarcies also wouldn't hinder any attempts at having kitchen-sink or essential jars -- combining the classes from lots of little isolated code trees is a lot easier then extracting a few classes from one big code tree. One underlying assumption that seems to have permiated the existing discussion (without ever being explicitly stated) is the idea that most currently lives in src/java is the core and would be a single module ... personally i'd like to challege that assumption. I'd like to suggest that besides obvious things that could be refactored out into other modules (span queries, queryparser) there are lots of additional ways that src/java could be sliced... - interfaces and abstract clases and concrete classes for reading an index in one index-api.jar (ie: Directory but no FSDirectory; IndexReader but not MultiReader) - ditto for creating/updating an index in one index-update.jar (ie: IndexWriter, TokenStream, Tokenizer, TokenFilter, Analyzer but not any impls of the last 3) - ditto for searching in index-search.jar (ie: Searcher, Searchable, HitCollector, Query ... but not any concrete subclasses - simple-analysis.jar (SimpleAnalyzer, WhitespaceAnalyzer, LetterTokenizer, LowercaseFilter, etc...) - english-analysis.jar (StandardAnalyzer, etc...) - primative-queries.jar (TermQuery, BooleanQuery, MatchAllDocsQuery, MultiTermQuery, etc...) - range-queries.jar (RangeQuery, RangeFilter, ConstantScoreRangeQuery) ...etc... The crux of my point being that what we think of today as the lucene core is actually kind of big and bloated, and already has *a* kitchen sink thrown in -- it's just not neccessarily
Re: Modularization
On 3/31/09 1:31 AM, Chris Hostetter wrote: code isolation (by directory hierarchy) is hte best way i've seen to ensure modularization, and protect against inadvertent dependency bleeding. +1. That's actually what I meant with one-to-one mapping between the packaging and the source code (I didn't say that as elaborately as you :) ) To make jars based on packages rather than directories would be the wrong decision I strongly believe, for the reasons you mentioned nicely here. -Michael - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Modularization
Michael Busch busch...@gmail.com wrote: And I don't think the sudden separation of core vs contrib should be so prominent (or even visible); it's really a detail of how we manage source control. When looking at the website I'd like read that Lucene can do hit highlighting, powerful query parsing, spell checking, analyze different languages, etc. I could care less that some of these happen to live under a contrib subdirectory somewhere in the source control system. OK, so I think we all agree about the packaging. But I believe it is also important how the source code is organized. Maybe Lucene consumers don't care too much, however, Lucene is an open source project. So we also want to attract possible contributors with a nicely organized code base. If there is a clear separation between the different components on a source code level, becoming familiar with Lucene as a contributor might not be so overwhelming. +1 We want the source code to be well organized: consumability by Lucene developers (not just Lucene users) is also important for Lucene's future growth. Besides that, I think a one-to-one mapping between the packaging and the source code has no disadvantages. (and it would certainly make the build scripts easier!) Right. So, towards that... why even break out contrib vs core, in source control? Can't we simply migrate contrib/* into core, in the right places? Could we, instead, adopt some standard way (in the package javadocs) of stating the maturity/activity/back compat policies/etc of a given package? This makes sense; e.g. we could release new modules as beta versions (= use at own risk, no backwards-compatibility). In fact we already have a 2.9 Jira issue opened to better document the back-compat/JDK version requirements of all packages. I think, like we've done with core lately when a new feature is added, we could have the default assumption be full back compatibility, but then those classes/methods/packages that are very new and may change simply say so clearly in their javadocs. And if we start a new module (e.g. a GSoC project) we could exclude it from a release easily if it's truly experimental and not in a release-able state. Right. So I think the beginnings of a rough proposal is taking shape, for 3.0: 1. Fix web site to give a better intro to Lucene's features, without exposing core vs. contrib false (to the Lucene consumer) distinction 2. When releasing, we make a single JAR holding core contrib classes for a given area. The final JAR files don't contain a core vs contrib distinction. 3. We create a bundled JAR that has the common packages typically needed (index/search core, analyzers, queries, highlighter, spellchecker) +1 to all three points. OK. So I guess I'm proposing adding: 4. Move contrib/* under src/java/*, updating the javadocs to state back compatibility promises per class/package. I think net/net this'd be a great simplification? Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Modularization
On Mon, Mar 23, 2009 at 11:10 AM, Michael McCandless luc...@mikemccandless.com wrote: 4. Move contrib/* under src/java/*, updating the javadocs to state back compatibility promises per class/package. - contrib has always had a lower bar and stuff was committed under that lower bar - there should be no blanket promotion. - contrib items may have different dependencies... putting it all under the same source root can make a developers job harder - many contrib items are less related to lucene-java core indexing and searching... if there is no contrib, then they don't belong in the lucene-java project at all. - right now it's clear - core can't have dependencies on non-core classes. If everything is stuck in the same source tree, that goes away. I think there are a lot of benefits to continue considering very carefully if something is core or not. -Yonik - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Modularization
Are you arguing for no change Yonik? I agree with all of your points in any case. What appeals to me most so far is: Take the best of contrib and up its status to something like modules. Equal to core, different requirements, dependencies, etc. Perhaps take queryparser out of core, but frankly I'd wouldn't mind just leaving core as it is. Reintroduce the sandbox (I believe core was sandbox, part of the lower bar history) and put lesser contrib there and new stuff thats unproven. Contrib doesn't appeal to me as a name anyway. That would give core, modules, and the sandbox (perhaps sandbox is a module?). Things could move from sandbox to core or the modules. Modules get new requirements similar to core - back compat guarantees and changes.txt per module. Yonik Seeley wrote: On Mon, Mar 23, 2009 at 11:10 AM, Michael McCandless luc...@mikemccandless.com wrote: 4. Move contrib/* under src/java/*, updating the javadocs to state back compatibility promises per class/package. - contrib has always had a lower bar and stuff was committed under that lower bar - there should be no blanket promotion. - contrib items may have different dependencies... putting it all under the same source root can make a developers job harder - many contrib items are less related to lucene-java core indexing and searching... if there is no contrib, then they don't belong in the lucene-java project at all. - right now it's clear - core can't have dependencies on non-core classes. If everything is stuck in the same source tree, that goes away. I think there are a lot of benefits to continue considering very carefully if something is core or not. -Yonik - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Modularization
- contrib has always had a lower bar and stuff was committed under that lower bar - there should be no blanket promotion. - contrib items may have different dependencies... putting it all under the same source root can make a developers job harder - many contrib items are less related to lucene-java core indexing and searching... if there is no contrib, then they don't belong in the lucene-java project at all. - right now it's clear - core can't have dependencies on non-core classes. If everything is stuck in the same source tree, that goes away. Adding to this, afaik contribs have no java 1.4 restriction. If you merge them into the core, you must either enforce it for contribs, or lift it from the core. I think both variants may be a reason for several heart attacks :) One could argue that five years after 1.5 was released Lucene is going to use it, so the point is no longer relevant. Sorry, 1.7 is just behind the door. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Modularization
Earwin Burrfoot wrote: - contrib has always had a lower bar and stuff was committed under that lower bar - there should be no blanket promotion. - contrib items may have different dependencies... putting it all under the same source root can make a developers job harder - many contrib items are less related to lucene-java core indexing and searching... if there is no contrib, then they don't belong in the lucene-java project at all. - right now it's clear - core can't have dependencies on non-core classes. If everything is stuck in the same source tree, that goes away. Adding to this, afaik contribs have no java 1.4 restriction. If you merge them into the core, you must either enforce it for contribs, or lift it from the core. I think both variants may be a reason for several heart attacks :) One could argue that five years after 1.5 was released Lucene is going to use it, so the point is no longer relevant. Sorry, 1.7 is just behind the door. I think we are considering this for Lucene 3.0 (should be the release after next) which will allow Java 1.5. - Mark - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Modularization
On Mon, Mar 23, 2009 at 22:13, Mark Miller markrmil...@gmail.com wrote: Earwin Burrfoot wrote: - contrib has always had a lower bar and stuff was committed under that lower bar - there should be no blanket promotion. - contrib items may have different dependencies... putting it all under the same source root can make a developers job harder - many contrib items are less related to lucene-java core indexing and searching... if there is no contrib, then they don't belong in the lucene-java project at all. - right now it's clear - core can't have dependencies on non-core classes. If everything is stuck in the same source tree, that goes away. Adding to this, afaik contribs have no java 1.4 restriction. If you merge them into the core, you must either enforce it for contribs, or lift it from the core. I think both variants may be a reason for several heart attacks :) One could argue that five years after 1.5 was released Lucene is going to use it, so the point is no longer relevant. Sorry, 1.7 is just behind the door. I think we are considering this for Lucene 3.0 (should be the release after next) which will allow Java 1.5. So where are you going to put 1.6 and 1.7 contribs? -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Modularization
I think we are considering this for Lucene 3.0 (should be the release after next) which will allow Java 1.5. So where are you going to put 1.6 and 1.7 contribs? This is a good point: core Lucene must remain on old JREs, but we should not force all contrib packages to do so. - contrib has always had a lower bar and stuff was committed under that lower bar - there should be no blanket promotion. OK so that was the past, and I agree. I assume by this you're also advocating that going forward this is an ongoing reason to put something into contrib? I agree with that. Ie, if a contribution is made, but it's not clear the quality is up to core's standards, I would much rather have some place to commit it (contrib) than to reject it, because once it has a home here, it has a chance to gain interest, grow, improve, etc. But: do you think, for this reason, the web site should continue to present the dichotomy? - contrib items may have different dependencies... putting it all under the same source root can make a developers job harder That's a good point criterion for leaving something in contrib. - many contrib items are less related to lucene-java core indexing and searching... if there is no contrib, then they don't belong in the lucene-java project at all. But most contrib packages are very related to Lucene. Though I agree some contrib packages likely have very narrow appeal/usage (eg, contrib/db, for using BDB as the raw store for an index). And I agree (as above): I would like to have somewhere for contributions to go, rather than reject them. - right now it's clear - core can't have dependencies on non-core classes. If everything is stuck in the same source tree, that goes away. Well... this gets to Hoss's motivation, which I appreciate, to keep the core tiny. But that's just good software design and you don't need a divorced directory structure to achieve that. I think there are a lot of benefits to continue considering very carefully if something is core or not. I agree, but at least we need some clear criteria so the future decision process is more straightforward. Towards that... it seems like there are good reasons why something should be put into contrib: * It uses a version of JDK higher than what core can allow * It has external dependencies * Its quality is debatable (or at least not proven) * It's of somewhat narrow usage/interest (eg: contrib/bdb) But I don't think it doesn't have to be in core (the software modularity goal) is the right reason to put something in contrib. Getting back to the original topic: Trie(Numeric)RangeFilter runs on JDK 1.4, has no external dependencies, looks to be high quality, and likely will have wide appeal. Doesn't it belong in core? Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Modularization
On 23-Mar-09, at 2:41 PM, Michael McCandless wrote: I agree, but at least we need some clear criteria so the future decision process is more straightforward. Towards that... it seems like there are good reasons why something should be put into contrib: * It uses a version of JDK higher than what core can allow * It has external dependencies * Its quality is debatable (or at least not proven) * It's of somewhat narrow usage/interest (eg: contrib/bdb) But I don't think it doesn't have to be in core (the software modularity goal) is the right reason to put something in contrib. Agreed. I don't think that building on the existing 'contrib' is the way to go. Frequently-used, high-quality components should be more properly part of Lucene, whether that means that they move to core, or in a new blessed modules section. Getting back to the original topic: Trie(Numeric)RangeFilter runs on JDK 1.4, has no external dependencies, looks to be high quality, and likely will have wide appeal. Doesn't it belong in core? +1. It is important that Lucene come blessed with very good quality defaults. Fast range queries are a common requirement. Similarly, I wouldn't be happy to have a new, wicked QueryParser be relegated to contrib where it is unlikely to be found by non-savvy users. At the very least, I agree with Michael that it should be findable in the same place. It does make sense to separate the machinery/building blocks (base Query, Weight, Scorer, Filter classes, Similarity interface, etc.) from the Query/Filter implementations that use them. But whether this is done by putting them in separate directories or via global core/ modules distinction seems unimportant. -Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Modularization (was: Re: New flexible query parser)
On 3/21/09 12:27 AM, Michael Busch wrote: +1. I'd love to see Lucene going into such a direction. However, I'm a little worried about contrib's reputation. I think it contains components with differing levels of activity, maturity and support. So maybe instead of moving things from core into contrib to achieve the goal you mentioned, we could create a new folder named e.g. 'components', which will contain stuff that we claim is as stable, mature and supported as the core, just packaged into separate jars. Those jars should then only have dependencies on the core, but not on each other. They would also follow the same backwards-compatibility and other requirements as the core. Thoughts? I guess something very similar has been proposed and discussed here: http://www.nabble.com/Moving-SweetSpotSimilarity-out-of-contrib-to19267437.html#a19320894 (same link that Hoss sent while having his deja vu)... -Michael - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Modularization (was: Re: New flexible query parser)
I think we are mixing up source code modularity with bundling/packaging. Honestly, I would not mind much where the source code lives in svn, so long as a developer, upon downloading Lucene 2.9, can go to *one* place (javadocs) for Lucene's queries filters and see {Int,Long}NumberRangeFilter in there. We are not there today: a developer must first realize there's a whole separate place to look for other queries (contrib/queries). Then the developer browses that and likely becomes confused/misled by what TrieRangeQuery means (is it a letter trie?). My goal here is Lucene's consumability -- when someone new says hey I heard about this great search library called Lucene; let me go try it out I want that first impression to be as solid as possible. I think this is very important for growing Lucene's community. This is why out of the box defaults are so crucial (eg changing IW from flushing every 10 docs to every 16 MB gained sizable throughput). How many times have we seen a review, article, blog post, etc., comparing Lucene to other search libraries only to incorrectly complain because Lucene can't do XYZ or Lucene's indexing performance is poor, etc, because they didn't dig in to learn all the tunings/options/tricks we all know you are supposed to do? (It frustrates me to end when this happens). This then hurts Lucene's adoption because others read such articles and conclude Lucene is a non-starter. We all ought to be concerned with Lucene's adoption growth with time (I am), and first-impression consumability / out of the box defaults are big drivers of that. What if (maybe for 3.0, since we can mix in 1.5 sources at that point?) we change how Lucene is bundled, such that core queries and contrib/query/* are in one JAR (lucene-query-3.0.jar)? And lucene-analyzers-3.0.jar would include contrib/analyzers/* and org/apache/lucene/analysis/*. And lucene-queryparser.jar, etc. Mike Michael Busch wrote: On 3/21/09 12:27 AM, Michael Busch wrote: +1. I'd love to see Lucene going into such a direction. However, I'm a little worried about contrib's reputation. I think it contains components with differing levels of activity, maturity and support. So maybe instead of moving things from core into contrib to achieve the goal you mentioned, we could create a new folder named e.g. 'components', which will contain stuff that we claim is as stable, mature and supported as the core, just packaged into separate jars. Those jars should then only have dependencies on the core, but not on each other. They would also follow the same backwards- compatibility and other requirements as the core. Thoughts? I guess something very similar has been proposed and discussed here: http://www.nabble.com/Moving-SweetSpotSimilarity-out-of-contrib-to19267437.html#a19320894 (same link that Hoss sent while having his deja vu)... -Michael - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: Modularization (was: Re: New flexible query parser)
Honestly, I would not mind much where the source code lives in svn, so long as a developer, upon downloading Lucene 2.9, can go to *one* place (javadocs) for Lucene's queries filters and see {Int,Long}NumberRangeFilter in there. We are not there today: a developer must first realize there's a whole separate place to look for other queries (contrib/queries). Then the developer browses that and likely becomes confused/misled by what TrieRangeQuery means (is it a letter trie?). That is a problem. The contrib/queries is a typical example of a contribution that is almost always used in third-party projects (Solr): It is stable and does not depend on other thing like the core and is 1.4 compatible (at the moment). Other contributions have external dependencies or need another java version than the core. I would split both types of contributions and would give the stable and only-on-core depending ones a higher ranking (like put them into the top-level changes list). E.g. when we release 2.9, nobody will realize, that there is a new TrieRangeFilter in contrib/queries, because it is not in the top-level changes list. Or the new contrib/spatial should have a visibility. My goal here is Lucene's consumability -- when someone new says hey I heard about this great search library called Lucene; let me go try it out I want that first impression to be as solid as possible. I think this is very important for growing Lucene's community. This is why out of the box defaults are so crucial (eg changing IW from flushing every 10 docs to every 16 MB gained sizable throughput). How many times have we seen a review, article, blog post, etc., comparing Lucene to other search libraries only to incorrectly complain because Lucene can't do XYZ or Lucene's indexing performance is poor, etc, because they didn't dig in to learn all the tunings/options/tricks we all know you are supposed to do? (It frustrates me to end when this happens). This then hurts Lucene's adoption because others read such articles and conclude Lucene is a non-starter. I know this problem. And about the contrib queries: Most developments that use Lucene (e.g. Solr) use always some of the contrib jars. And almost everytime contrib/queries. But starters like the journalists writing those articles, only take the core and test something with it. So splitting up the whole Lucene in different parts is better (so these people must always think about all available packages and which they need for their project): We all ought to be concerned with Lucene's adoption growth with time (I am), and first-impression consumability / out of the box defaults are big drivers of that. What if (maybe for 3.0, since we can mix in 1.5 sources at that point?) we change how Lucene is bundled, such that core queries and contrib/query/* are in one JAR (lucene-query-3.0.jar)? And lucene-analyzers-3.0.jar would include contrib/analyzers/* and org/apache/lucene/analysis/*. And lucene-queryparser.jar, etc. This is even better! +1 I would propose: - core: Indexer, Documents, IndexReader, Searcher and the default directory-stores (fs, mmap, nio). - queries: current core queries and contrib/queries - queryparser (the new one? Or two different packages for old and new): this should really be removed from core, a lot of people think, that they can only query lucene using the queryparser and do not even try to build their Boolean-queries manually and often fail, when it gets complicated, where the query parser cannot help or fails, e.g. querying non-tokenized fields (but this would depend on queries, we need that here)... - analysis (and completely remove analyzers from core, let only be the abstract analyzer stay there and keyword analyzer, if you want to index without analyzer or do not need one because of only non-tokenized fields,... - highlighting - custom sorting separate - spatial - ... We then could change our contrib SVN accounts and have new roles like (core-committer, queries-committer,...) Uwe - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Modularization (was: Re: New flexible query parser)
On Mar 21, 2009, at 11:26 AM, Michael McCandless wrote: What if (maybe for 3.0, since we can mix in 1.5 sources at that point?) we change how Lucene is bundled, such that core queries and contrib/query/* are in one JAR (lucene-query-3.0.jar)? And lucene-analyzers-3.0.jar would include contrib/analyzers/* and org/apache/lucene/analysis/*. And lucene-queryparser.jar, etc. Since we are just talking about packaging, why can't we have both/all of the above? Individual jars, as well as one big jar, that contains everything (or, everything that has only dependencies we can ship, or everything that we deem important for an OOTB experience). I, for one, find it annoying to have to go get snowball, analyzers, spellchecking and highlighting separate in most cases b/c I almost always use all of them and don't particularly care if there are extra classes in a JAR, but can appreciate the need to do that in specific instances where leaner versions are needed. After all, the Ant magic to do all of this is pretty trivial given we just need to combine the various jars into a single jar (while keeping the indiv. ones) If there is a sense that some contribs aren't maintained or aren't as good, then we need to ask ourselves whether they are: 1. stable and solid and don't need much care and are doing just fine thank you very much, or, 2. need to be archived, since they only serve as a distraction, or 3. in need of a new champion to maintain/promote them -Grant - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Modularization
On 3/21/09 11:26 AM, Michael McCandless wrote: I think we are mixing up source code modularity with bundling/packaging. Honestly, I would not mind much where the source code lives in svn, so long as a developer, upon downloading Lucene 2.9, can go to *one* place (javadocs) for Lucene's queries filters and see {Int,Long}NumberRangeFilter in there. We are not there today: a developer must first realize there's a whole separate place to look for other queries (contrib/queries). Then the developer browses that and likely becomes confused/misled by what TrieRangeQuery means (is it a letter trie?). My goal here is Lucene's consumability -- when someone new says hey I heard about this great search library called Lucene; let me go try it out I want that first impression to be as solid as possible. I think this is very important for growing Lucene's community. This is why out of the box defaults are so crucial (eg changing IW from flushing every 10 docs to every 16 MB gained sizable throughput). So this guy landing on http://lucene.apache.org/java/docs/index.html sees the Overview section first. That one only gives a very short introduction to what Lucene is. He might then look at Features, which is also not very specific. I think the next thing would then be to look for the documentation of the newest release, so he would click on Lucene 2.4.1 Documentation. The landing page doesn't say much, except tells you to go look for the javadocs and other docs in the menu. So maybe the Getting Started link might the first one to go to, but it's also pretty far down the list. So probably he would click on the javadocs first. Now he encounters All, Core, Demo, Contrib. Until now, he hasn't read the word Contrib anywhere. We basically have nowhere documentation that introduces the concept of contribs, or where to find them, I think? Even the Contributions section talks about something else. So that guy probably looks then trough the demo and examples and ends up using only core features until becoming more familiar with Lucene as a whole. Maybe he actually ends up buying LIA(2) :) How many times have we seen a review, article, blog post, etc., comparing Lucene to other search libraries only to incorrectly complain because Lucene can't do XYZ or Lucene's indexing performance is poor, etc, because they didn't dig in to learn all the tunings/options/tricks we all know you are supposed to do? (It frustrates me to end when this happens). This then hurts Lucene's adoption because others read such articles and conclude Lucene is a non-starter. We all ought to be concerned with Lucene's adoption growth with time (I am), and first-impression consumability / out of the box defaults are big drivers of that. point?) we change how Lucene is bundled, such that core queries and contrib/query/* are in one JAR (lucene-query-3.0.jar)? And lucene-analyzers-3.0.jar would include contrib/analyzers/* and org/apache/lucene/analysis/*. And lucene-queryparser.jar, etc. So yeah I like this and 3.0 is a good opportunity to do this. I think a big part of this work should be good documentation. As you mentioned, Mike, it should be very simple to get an overview of what the different modules are. So there should be the list of the different modules, together with a short description for each of them and infos about where to find them (which jar). Then by clicking on e.g. queries, the user would see the list of all queries we support. But I think we should still have main modules, such as core, queries, analyzers, ... and separately e.g. sandbox modules?, for the things currently in contrib that are experimental or, as Mark called them, graveyard contribs :) ... even though we might then as well ask the questions if we can not really bury the latter ones... Mike Michael Busch wrote: On 3/21/09 12:27 AM, Michael Busch wrote: +1. I'd love to see Lucene going into such a direction. However, I'm a little worried about contrib's reputation. I think it contains components with differing levels of activity, maturity and support. So maybe instead of moving things from core into contrib to achieve the goal you mentioned, we could create a new folder named e.g. 'components', which will contain stuff that we claim is as stable, mature and supported as the core, just packaged into separate jars. Those jars should then only have dependencies on the core, but not on each other. They would also follow the same backwards-compatibility and other requirements as the core. Thoughts? I guess something very similar has been proposed and discussed here: http://www.nabble.com/Moving-SweetSpotSimilarity-out-of-contrib-to19267437.html#a19320894 (same link that Hoss sent while having his deja vu)... -Michael - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Modularization
Maybe he actually ends up buying LIA(2) :) LIA/2 suffers the same false dichotomy, and it drives me crazy there too: we put all contrib packages in a different chapter, even though it'd make much more sense to cover all analyzers in one chapter, all queries in one chapter, etc. I find myself cross-referencing over to TrieRangeQuery in Chapter 8, from LIA's search chapter (Chapter 3), and it's awkward. So yeah I like this and 3.0 is a good opportunity to do this. I think a big part of this work should be good documentation. As you mentioned, Mike, it should be very simple to get an overview of what the different modules are. So there should be the list of the different modules, together with a short description for each of them and infos about where to find them (which jar). Then by clicking on e.g. queries, the user would see the list of all queries we support. I agree: revamping the web-site for a better top-down introduction of Lucene's features should be part of 3.0. And I don't think the sudden separation of core vs contrib should be so prominent (or even visible); it's really a detail of how we manage source control. When looking at the website I'd like read that Lucene can do hit highlighting, powerful query parsing, spell checking, analyze different languages, etc. I could care less that some of these happen to live under a contrib subdirectory somewhere in the source control system. But I think we should still have main modules, such as core, queries, analyzers, ... and separately e.g. sandbox modules?, for the things currently in contrib that are experimental or, as Mark called them, graveyard contribs :) ... even though we might then as well ask the questions if we can not really bury the latter ones... Could we, instead, adopt some standard way (in the package javadocs) of stating the maturity/activity/back compat policies/etc of a given package? Since we are just talking about packaging, why can't we have both/all of the above? Individual jars, as well as one big jar, that contains everything (or, everything that has only dependencies we can ship, or everything that we deem important for an OOTB experience). I, for one, find it annoying to have to go get snowball, analyzers, spellchecking and highlighting separate in most cases b/c I almost always use all of them and don't particularly care if there are extra classes in a JAR, but can appreciate the need to do that in specific instances where leaner versions are needed. After all, the Ant magic to do all of this is pretty trivial given we just need to combine the various jars into a single jar (while keeping the indiv. ones) +1 So I think the beginnings of a rough proposal is taking shape, for 3.0: 1. Fix web site to give a better intro to Lucene's features, without exposing core vs. contrib false (to the Lucene consumer) distinction 2. When releasing, we make a single JAR holding core contrib classes for a given area. The final JAR files don't contain a core vs contrib distinction. 3. We create a bundled JAR that has the common packages typically needed (index/search core, analyzers, queries, highlighter, spellchecker) Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Modularization (was: Re: New flexible query parser)
On Mar 21, 2009, at 7:23 AM, Grant Ingersoll wrote: On Mar 21, 2009, at 11:26 AM, Michael McCandless wrote: What if (maybe for 3.0, since we can mix in 1.5 sources at that point?) we change how Lucene is bundled, such that core queries and contrib/query/* are in one JAR (lucene-query-3.0.jar)? And lucene-analyzers-3.0.jar would include contrib/analyzers/* and org/apache/lucene/analysis/*. And lucene-queryparser.jar, etc. Since we are just talking about packaging, why can't we have both/ all of the above? Individual jars, as well as one big jar, that contains everything (or, everything that has only dependencies we can ship, or everything that we deem important for an OOTB experience). I, for one, find it annoying to have to go get snowball, analyzers, spellchecking and highlighting separate in most cases b/c I almost always use all of them and don't particularly care if there are extra classes in a JAR, but can appreciate the need to do that in specific instances where leaner versions are needed. After all, the Ant magic to do all of this is pretty trivial given we just need to combine the various jars into a single jar (while keeping the indiv. ones) If there is a sense that some contribs aren't maintained or aren't as good, then we need to ask ourselves whether they are: 1. stable and solid and don't need much care and are doing just fine thank you very much, or, 2. need to be archived, since they only serve as a distraction, or 3. in need of a new champion to maintain/promote them From a user's perspective (i.e. mine): I like the idea regarding having more jars. Specifically, I'd like a jar that was devoted alone to reading an index. Ultimately, I'd like it to work in a J2ME environment, but that is entirely a different thread. There are parts that are needed for both reading and writing (directory, analyzers, tokens, and such). And there are parts dealing with writing. There is a distinction between core and contrib regarding backward compatibility and quality (perhaps perceived quality). To me the hardest part in wrapping my head around contrib is that I am not clear on why something is in contrib, what it can do, whether it is just an example, an alternate way of doing something or it is useful exactly as provided. There are parts of contrib that I see as essential to my application (pretty much Grant's list), that I can use as is. While there are many different applications of Lucene, my guess is that a non-trivial application of Lucene needs to use various contribs. Some contribs are high quality and I think deserve the kind of attention that core gets. What I'd like to see is not more stuff move into core from contrib. But rather that we have two levels of contrib: One recommended for use and maintained at the same level as core. The other is stuff that is use if you find it useful, and at your own risk. That is, as it is today. I understand the desire to have one jar do it all. Nothing wrong with having that too, perhaps lucene-essentials.jar that holds all useful, recommended, highly maintained, well-explained stuff. As to the whole question of the oobe for reviewers, today, it is what does Lucene-core.jar do. With more jars it would be what does this core collection of jars do or what does lucene-esssentials. -- DM Smith - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Modularization
On 3/21/09 1:36 PM, Michael McCandless wrote: And I don't think the sudden separation of core vs contrib should be so prominent (or even visible); it's really a detail of how we manage source control. When looking at the website I'd like read that Lucene can do hit highlighting, powerful query parsing, spell checking, analyze different languages, etc. I could care less that some of these happen to live under a contrib subdirectory somewhere in the source control system. OK, so I think we all agree about the packaging. But I believe it is also important how the source code is organized. Maybe Lucene consumers don't care too much, however, Lucene is an open source project. So we also want to attract possible contributors with a nicely organized code base. If there is a clear separation between the different components on a source code level, becoming familiar with Lucene as a contributor might not be so overwhelming. Besides that, I think a one-to-one mapping between the packaging and the source code has no disadvantages. (and it would certainly make the build scripts easier!) But I think we should still have main modules, such as core, queries, analyzers, ... and separately e.g. sandbox modules?, for the things currently in contrib that are experimental or, as Mark called them, graveyard contribs :) ... even though we might then as well ask the questions if we can not really bury the latter ones... Could we, instead, adopt some standard way (in the package javadocs) of stating the maturity/activity/back compat policies/etc of a given package? This makes sense; e.g. we could release new modules as beta versions (= use at own risk, no backwards-compatibility). And if we start a new module (e.g. a GSoC project) we could exclude it from a release easily if it's truly experimental and not in a release-able state. So I think the beginnings of a rough proposal is taking shape, for 3.0: 1. Fix web site to give a better intro to Lucene's features, without exposing core vs. contrib false (to the Lucene consumer) distinction 2. When releasing, we make a single JAR holding core contrib classes for a given area. The final JAR files don't contain a core vs contrib distinction. 3. We create a bundled JAR that has the common packages typically needed (index/search core, analyzers, queries, highlighter, spellchecker) +1 to all three points. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org