Re: [VOTE] Release PyLucene 8.8.1
+1 I ran my usual smoke test: install JCC, PyLucene, then index and optimize the first 100K documents from a Wikipedia English snapshot, and run a couple queries. Sorry for being late to the party too! Mike McCandless http://blog.mikemccandless.com On Mon, Mar 1, 2021 at 9:35 PM Andi Vajda wrote: > > The PyLucene 8.8.1 (rc1) release tracking the recent release of > Apache Lucene 8.8.1 is ready. > > A release candidate is available from: > https://dist.apache.org/repos/dist/dev/lucene/pylucene/8.8.1-rc1/ > > PyLucene 8.8.1 is built with JCC 3.9, included in these release artifacts. > > JCC 3.9 supports Python 3.3 up to Python 3.9 (in addition to Python 2.3+). > PyLucene may be built with Python 2 or Python 3. > > Please vote to release these artifacts as PyLucene 8.8.1. > Anyone interested in this release can and should vote ! > > Thanks ! > > Andi.. > > ps: the KEYS file for PyLucene release signing is at: > https://dist.apache.org/repos/dist/release/lucene/pylucene/KEYS > https://dist.apache.org/repos/dist/dev/lucene/pylucene/KEYS > > pps: here is my +1 >
Re: [VOTE] Release PyLucene 8.6.1
+1 to release. I ran my usual smoke test to index, forceMerge and search the first 100K documents from English Wikipedia export, on Arch Linux, Java 1.11.06, Python 3.8.1 -- test ran fine! Thanks Andi. Mike McCandless http://blog.mikemccandless.com On Mon, Aug 24, 2020 at 7:56 PM Andi Vajda wrote: > > The PyLucene 8.6.1 (rc1) release tracking the recent release of > Apache Lucene 8.6.1 is ready. > > A release candidate is available from: > https://dist.apache.org/repos/dist/dev/lucene/pylucene/8.6.1-rc1/ > > PyLucene 8.6.1 is built with JCC 3.8, included in these release artifacts. > > JCC 3.8 supports Python 3.3 up to Python 3.8 (in addition to Python 2.3+). > PyLucene may be built with Python 2 or Python 3. > > Please vote to release these artifacts as PyLucene 8.6.1. > Anyone interested in this release can and should vote ! > > Thanks ! > > Andi.. > > ps: the KEYS file for PyLucene release signing is at: > https://dist.apache.org/repos/dist/release/lucene/pylucene/KEYS > https://dist.apache.org/repos/dist/dev/lucene/pylucene/KEYS > > pps: here is my +1 >
Re: Memory usage
Hi Siddharth, Your understanding of MMapDirectory is correct -- only give your JVM enough heap to not spend too much CPU on GC, and then let the OS use all available remaining RAM to cache hot pages from your index. There are some structures Lucene loads into JVM heap, but even those are being moved off-heap (accessed via Directory) recently such as FSTs used for the terms index, and BKD index (for dimensional points). I'm not sure exactly which structures are still in heap ... maybe the live documents bitset? During indexing, the recently indexed documents are buffered in JVM heap, up until the IndexWriterConfig.setRAMBufferSizeMB and then they will be written to the Directory as new segments. Mike McCandless http://blog.mikemccandless.com On Wed, Nov 6, 2019 at 11:27 PM siddharth teotia wrote: > Hi All > > I have some questions about the memory usage. I would really appreciate if > someone can help answer these. > > I understand from the docs that during reading/querying, Lucene uses > MMapDirectory (assuming it is supported on the platform). So the Java heap > overhead in this case will purely come from the objects that are > allocated/instantiated on the query path to process the query and build > results etc. But the whole index itself will not be loaded into memory > because we memory mapped the file. Is my understanding correct? In this > case, we are better off not increasing the Java heap and keep as much > as possible available for the file system cache for mmap to do its job > efficiently. > > However, are there any portions of index structures that are completely > loaded in memory regardless of whether it is MMapDirectory or not? If so, > are they loaded in Java heap or do we use off-heap (direct buffers) in > such cases? > > Secondly, on the write path I think even though the writer opens a > MMapDirectory, the writes are gathered/buffered in memory upto a flush > threshold controlled by IndexWriterConfig. Is this buffering done in Java > heap or direct memory? > > Thanks a lot for help > Siddharth >
Re: [VOTE] Release PyLucene 7.6.0 (rc1)
+1 to release! I ran my usual simple test indexing the first 100K docs from an old wikipedia export, force merging, and running a few searches. Thank you for continuing to release PyLucene Andi! Mike McCandless http://blog.mikemccandless.com On Fri, Jan 4, 2019 at 4:59 PM Andi Vajda wrote: > > The PyLucene 7.6.0 (rc1) release tracking the recent release of > Apache Lucene 7.6.0 is ready. > > A release candidate is available from: >https://dist.apache.org/repos/dist/dev/lucene/pylucene/7.6.0-rc1/ > > PyLucene 7.6.0 is built with JCC 3.4 included in these release artifacts. > > JCC 3.4 supports Python 3.3+ (in addition to Python 2.3+). > PyLucene may be built with Python 2 or Python 3. > > Please vote to release these artifacts as PyLucene 7.6.0. > Anyone interested in this release can and should vote ! > > Thanks ! > > Andi.. > > ps: the KEYS file for PyLucene release signing is at: > https://dist.apache.org/repos/dist/release/lucene/pylucene/KEYS > https://dist.apache.org/repos/dist/dev/lucene/pylucene/KEYS > > pps: here is my +1 >
Re: [VOTE] Release PyLucene 6.5.0 (rc1) (now with Python 3 support)
+1 to release. I tested on Ubuntu 16.04 with Python 3.5.2 and Java 1.8.0_121. I ran my usual smoke test of indexing first 100K docs from Wikipedia English export and running a few searches. But first I had to run 2to3 on this ancient script! I had to apply Ruediger's patch to JCC's setup.py else it was trying to link with -lpython3.5 but I have -lpython3.5m. Mike McCandless http://blog.mikemccandless.com On Mon, Mar 27, 2017 at 6:12 PM, Andi Vajdawrote: > > The PyLucene 6.5.0 (rc1) release tracking today's release of > Apache Lucene 6.5.0 is ready. > > A release candidate is available from: > https://dist.apache.org/repos/dist/dev/lucene/pylucene/6.5.0-rc1/ > > PyLucene 6.5.0 is built with JCC 3.0 included in these release artifacts. > > JCC 3.0 now supports Python 3.3+ (in addition to Python 2.3+). > PyLucene may be built with Python 2 or Python 3. > > Please vote to release these artifacts as PyLucene 6.5.0. > Anyone interested in this release can and should vote ! > > Thanks ! > > Andi.. > > ps: the KEYS file for PyLucene release signing is at: > https://dist.apache.org/repos/dist/release/lucene/pylucene/KEYS > https://dist.apache.org/repos/dist/dev/lucene/pylucene/KEYS > > pps: here is my +1 >
Re: [VOTE] Release PyLucene 6.4.1 (rc1)
+1 to release. I ran my usual smoke test: indexing first 100K docs from English Wikipedia export, optimizing, running a couple searches, on Ubuntu 16.04, Java 1.8.0_101, Python 2.7.12. Mike McCandless http://blog.mikemccandless.com On Sun, Feb 12, 2017 at 5:25 AM, Michael McCandless <luc...@mikemccandless.com> wrote: > Sorry, I will have a look! > > Mike McCandless > > http://blog.mikemccandless.com > > > On Sat, Feb 11, 2017 at 5:23 PM, Andi Vajda <va...@apache.org> wrote: >> >> Ping ? >> Two more PMC votes are needed before this release can happen. >> Thanks ! >> >> Andi.. >> >>> On Feb 6, 2017, at 13:38, Andi Vajda <va...@apache.org> wrote: >>> >>> >>> The PyLucene 6.4.1 (rc1) release tracking today's release of >>> Apache Lucene 6.4.1 is ready. >>> >>> A release candidate is available from: >>> https://dist.apache.org/repos/dist/dev/lucene/pylucene/6.4.1-rc1/ >>> >>> PyLucene 6.4.1 is built with JCC 2.23 included in these release artifacts. >>> >>> Please vote to release these artifacts as PyLucene 6.4.1. >>> Anyone interested in this release can and should vote ! >>> >>> Thanks ! >>> >>> Andi.. >>> >>> ps: the KEYS file for PyLucene release signing is at: >>> https://dist.apache.org/repos/dist/release/lucene/pylucene/KEYS >>> https://dist.apache.org/repos/dist/dev/lucene/pylucene/KEYS >>> >>> pps: here is my +1
Re: [VOTE] Release PyLucene 6.4.1 (rc1)
Sorry, I will have a look! Mike McCandless http://blog.mikemccandless.com On Sat, Feb 11, 2017 at 5:23 PM, Andi Vajdawrote: > > Ping ? > Two more PMC votes are needed before this release can happen. > Thanks ! > > Andi.. > >> On Feb 6, 2017, at 13:38, Andi Vajda wrote: >> >> >> The PyLucene 6.4.1 (rc1) release tracking today's release of >> Apache Lucene 6.4.1 is ready. >> >> A release candidate is available from: >> https://dist.apache.org/repos/dist/dev/lucene/pylucene/6.4.1-rc1/ >> >> PyLucene 6.4.1 is built with JCC 2.23 included in these release artifacts. >> >> Please vote to release these artifacts as PyLucene 6.4.1. >> Anyone interested in this release can and should vote ! >> >> Thanks ! >> >> Andi.. >> >> ps: the KEYS file for PyLucene release signing is at: >> https://dist.apache.org/repos/dist/release/lucene/pylucene/KEYS >> https://dist.apache.org/repos/dist/dev/lucene/pylucene/KEYS >> >> pps: here is my +1
Re: Doing Range/NUmber queries
No, you must replace the entire document: the old one is removed, and the new one is indexed in its place. The one exception to this is update-able document value (e.g. see IW.updateNumericDocValue). Mike McCandless http://blog.mikemccandless.com On Tue, Aug 9, 2016 at 2:49 PM, lukeswrote: > Thanks Michael, > > Is there a way to partially update the document ? I know there's a API > updateDocument on IndexWriter, but that seems to create a new document with > just a field i am specifying. What i want is delete some fields from > existing(indexed) document, and then add some new fields(could or not be > same). Alternatively i tried to search for the document, and then calling > removeFields and finally updateDocument, but now any search after the above > process is not able for find that document(I created the new IndexReader). > Am i missing anything ? > > Regards. > > > > -- > View this message in context: http://lucene.472066.n3. > nabble.com/Doing-Range-NUmber-queries-tp4290722p4291023.html > Sent from the Lucene - General mailing list archive at Nabble.com. >
Re: Doing Range/NUmber queries
For 1), you need to copy it yourself, i.e. add another Field to the Lucene Document you are about to index, with the same (string, numeric, etc.) value from the first field. For 2), it's best to use points (IntPoint, etc.) for range filtering. For 3), to search a boolean value, just map your boolean to a token, e.g. "true" and "false". Mike McCandless http://blog.mikemccandless.com On Mon, Aug 8, 2016 at 1:37 AM, lukeswrote: > *Update(Found the answer for point 2).* > > > > -- > View this message in context: http://lucene.472066.n3. > nabble.com/Doing-Range-NUmber-queries-tp4290722p4290725.html > Sent from the Lucene - General mailing list archive at Nabble.com. >
[ANNOUNCE] Apache Lucene 5.5.0 released
23 February 2016, Apache Lucene™ 5.5.0 available The Lucene PMC is pleased to announce the release of Apache Lucene 5.5.0, expected to be the last 5.x feature release before Lucene 6.0.0. Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. This release contains numerous bug fixes, optimizations, and improvements, some of which are highlighted below. The release is available for immediate download at: http://lucene.apache.org/core/mirrors-core-latest-redir.html Please read CHANGES.txt for a full list of new features and changes: https://lucene.apache.org/core/5_5_0/changes/Changes.html Lucene 5.5.0 Release Highlights: * JoinUtil.createJoinQuery can now join on numeric doc values fields * BlendedInfixSuggester now has an exponential reciprocal scoring model, to more strongly favor suggestions with matches closer to the beginning * CustomAnalyzer has improved (compile time) type safety * DFISimilarity implements the divergence from independence scoring model * Fully wrap any other merge policy using MergePolicyWrapper * Sandbox geo point queries have graduated into the spatial module, and now use a more efficient binary term encoding for smaller index size, faster indexing, and decreased search-time heap usage * BooleanQuery performs some new query optimizations * TermsQuery constructors are more GC efficient Please report any feedback to the mailing lists (http://lucene.apache.org/core/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also applies to Maven access. Mike McCandless http://blog.mikemccandless.com
[ANNOUNCE] Apache Solr 5.5.0 released
23 February 2016, Apache Solr™ 5.5.0 available Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. Solr 5.5.0 is available for immediate download at: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html Please read CHANGES.txt for a full list of new features and changes: https://lucene.apache.org/solr/5_5_0/changes/Changes.html This is expected to be the last 5.x feature release before Solr 6.0.0. Solr 5.5 Release Highlights: * The schema version has been increased to 1.6, and Solr now returns non-stored doc values fields along with stored fields * The PERSIST CoreAdmin action has been removed * The element is deprecated in favor of a similar element, in solrconfig.xml * CheckIndex now works on HdfsDirectory * RuleBasedAuthorizationPlugin now allows wildcards in the role, and accepts an 'all' permission * Users can now choose compression mode in SchemaCodecFactory * Solr now supports Lucene's XMLQueryParser * Collections APIs now have async support * Uninverted field faceting is re-enabled, for higher performance on rarely changing indices Please report any feedback to the mailing lists (http://lucene.apache.org/solr/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also applies to Maven access. Mike McCandless http://blog.mikemccandless.com
[ANNOUNCE] Apache Lucene 4.10.4 released
March 2015, Apache Lucene™ 4.10.4 available The Lucene PMC is pleased to announce the release of Apache Lucene 4.10.4 Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. The release is available for immediate download at: http://www.apache.org/dyn/closer.cgi/lucene/java/4.10.4l Lucene 4.10.4 includes 13 bug fixes. See the CHANGES.txt file included with the release for a full list of changes and further details. Please report any feedback to the mailing lists (http://lucene.apache.org/core/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. Mike McCandless http://blog.mikemccandless.com
[ANNOUNCE] Apache Solr 4.10.4 released
October 2014, Apache Solr™ 4.10.4 available The Lucene PMC is pleased to announce the release of Apache Solr 4.10.4 Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. Solr 4.10.4 is available for immediate download at: http://www.apache.org/dyn/closer.cgi/lucene/solr/4.10.4 Solr 4.10.4 includes 24 bug fixes, as well as Lucene 4.10.4 and its 13 bug fixes. See the CHANGES.txt file included with the release for a full list of changes and further details. Please report any feedback to the mailing lists (http://lucene.apache.org/solr/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. Mike McCandless http://blog.mikemccandless.com
Re: [ANNOUNCE] Apache Lucene 4.10.4 released
Correction: the download link for Lucene 4.10.4 is: http://www.apache.org/dyn/closer.cgi/lucene/java/4.10.4 Mike McCandless http://blog.mikemccandless.com On Thu, Mar 5, 2015 at 10:26 AM, Michael McCandless luc...@mikemccandless.com wrote: March 2015, Apache Lucene™ 4.10.4 available The Lucene PMC is pleased to announce the release of Apache Lucene 4.10.4 Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. The release is available for immediate download at: http://www.apache.org/dyn/closer.cgi/lucene/java/4.10.4l Lucene 4.10.4 includes 13 bug fixes. See the CHANGES.txt file included with the release for a full list of changes and further details. Please report any feedback to the mailing lists (http://lucene.apache.org/core/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. Mike McCandless http://blog.mikemccandless.com
Re: How can I make better project than Lucene?
On Tue, Nov 18, 2014 at 1:16 PM, Marvin Humphrey mar...@rectangular.com wrote: On Sat, Nov 15, 2014 at 3:22 AM, Michael McCandless luc...@mikemccandless.com wrote: The analysis chain (attributes) is overly complex. If you were to start from scratch, what would the analysis chain look like? Hi Marvin, long time no talk! I like the new Go bindings for Lucy! Here are some things that bug me about Lucene's analysis APIs: Lucene's Attributes have separate interface from impl, with default impls, and this causes complex code in oal.util.Attribute*. It seems like overkill. Seems like we should just have concrete core impls for the atts Lucene knows how to index. There are 5 java source files in that package related to attributes (Attribute.java AttributeFactory.java AttributeImpl.java AttributeReflector.java AttributeSource.java): too much. There should not be a global AttributeFactory that owns all attrs throughout the pipeline: that's too global. Rather, each stage should be free to control what the next stages sees (LUCENE-2450) ... the namespace should be private to that stage, and each stage can delete/add/replace the incoming bindings it saw. This may seem more complex but I think it'd be simpler in the end? And, the first stage should not have to be responsible for clearing things that later stages had inserted: common source of bugs for that first Tokenizer to not call clearAttributes. Reuse of token streams was an afterthought that took a long time to work its way down to simpler APIs, but now we ReuseStrategy, AnalyzerWrapper, DelegatingAnalzyerWrapper. Custom analyzers can't be (easily?) serialized, so ES and Solr have their own layers to parse a custom chain from JSON/XML. Those layers could do better error checking... Can we do something better with offsets, such that TokenFilters (not just Tokenizers/CharReaders) would also be able to set correct offsets? The stuffing of things into analysis that really should have been a gentle schema is annoying: KeywordAnalyzer, Numeric*. Token filters that want to create graphs are nearly impossible. E.g you cannot put a WDF in front of SynonymFilter today because SynonymFilter can't handle an incoming graph (LUCENE-5012). Deleted tokens should still be present, just marked as deleted (so IW doesn't index them). This would make it possible (to Rob's horror) for tokenizers to preserve every single character they saw, but things that are not tokens (punctuation, whitespace) are marked deleted. Maybe this makes it possible for all stages to work with offsets properly? There is probably more, and probably lots of people disagree that these are even problems :) Mike McCandless http://blog.mikemccandless.com
Re: How can I make better project than Lucene?
Actually I think competing projects is very healthy for open source development. There are many things you could explore to contrast with Lucene, e.g. write your new search engine in Go not Java: Java has many problems, maybe Go fixes them. Go also has a low-latency garbage collector in development ... and Java's GC options still can't scale to the heap sizes that are practical now. Lucene has many limitations, so your competing engine could focus on them. E.g. the schemalessness of Lucene has become a big problem, and near impossible to fix at this point, and prevents new important features like LUCENE-5879 from being possible, so you could give your engine a gentle schema from the start. The Lucene Filter/Query situation is a mess: one should extend the other. Lucene has weak support for proximity queries (SpanQuery is slow and does not get much attention). Lucene is showing its age, missing some compelling features like a builtin transaction log, core support for numerics (they are sort of hacked on top), optimistic concurrency support (sequence ids, versions, something), distributed support (near real time replication, etc.), multi-tenancy, an example server implementation, so the search servers on top of Lucene have had to fill these gaps. Maybe you could make your engine distributed from the start (Go is a great match for that, from what little I know). All 3 highlighter options have problems. The analysis chain (attributes) is overly complex. In your competing engine you can borrow/copy/steal from Lucene's good parts to get started... Mike McCandless http://blog.mikemccandless.com On Fri, Nov 14, 2014 at 8:43 PM, swsong_dev swsong_...@websqrd.com wrote: I’m developing search engine, Fastcatsearch. http://github hthttp://githubtp//github.com/fastcatsearch/fastcatsearch Lucene is widely known and famous project and I cannot beat Lucene for now. But is there any chance to beat Lucene? Anything like features, performance. Please, let me know what to do to make better product than Lucene. Thank you.
Re: How can I make better project than Lucene?
Well the Apache Software License is very generous about poaching. Your ideas will go further if you don't insist on going with them. Mike McCandless http://blog.mikemccandless.com On Sat, Nov 15, 2014 at 6:42 AM, Will Martin wmartin...@gmail.com wrote: Btw: SwSong should not steal code; which implies an existing license whose terms he is willing to break. Not a good first step.;-) will -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Saturday, November 15, 2014 6:22 AM To: general@lucene.apache.org Subject: Re: How can I make better project than Lucene? Actually I think competing projects is very healthy for open source development. There are many things you could explore to contrast with Lucene, e.g. write your new search engine in Go not Java: Java has many problems, maybe Go fixes them. Go also has a low-latency garbage collector in development ... and Java's GC options still can't scale to the heap sizes that are practical now. Lucene has many limitations, so your competing engine could focus on them. E.g. the schemalessness of Lucene has become a big problem, and near impossible to fix at this point, and prevents new important features like LUCENE-5879 from being possible, so you could give your engine a gentle schema from the start. The Lucene Filter/Query situation is a mess: one should extend the other. Lucene has weak support for proximity queries (SpanQuery is slow and does not get much attention). Lucene is showing its age, missing some compelling features like a builtin transaction log, core support for numerics (they are sort of hacked on top), optimistic concurrency support (sequence ids, versions, something), distributed support (near real time replication, etc.), multi-tenancy, an example server implementation, so the search servers on top of Lucene have had to fill these gaps. Maybe you could make your engine distributed from the start (Go is a great match for that, from what little I know). All 3 highlighter options have problems. The analysis chain (attributes) is overly complex. In your competing engine you can borrow/copy/steal from Lucene's good parts to get started... Mike McCandless http://blog.mikemccandless.com On Fri, Nov 14, 2014 at 8:43 PM, swsong_dev swsong_...@websqrd.com wrote: I’m developing search engine, Fastcatsearch. http://github hthttp://githubtp//github.com/fastcatsearch/fastcatsearch Lucene is widely known and famous project and I cannot beat Lucene for now. But is there any chance to beat Lucene? Anything like features, performance. Please, let me know what to do to make better product than Lucene. Thank you.
Re: How can I make better project than Lucene?
Yes it does. Mike McCandless http://blog.mikemccandless.com On Sat, Nov 15, 2014 at 8:53 AM, Will Martin wmartin...@gmail.com wrote: Um, doesn't the Apache license require inclusion of the license? Just sayin' -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Saturday, November 15, 2014 8:47 AM To: general@lucene.apache.org Subject: Re: How can I make better project than Lucene? Well the Apache Software License is very generous about poaching. Your ideas will go further if you don't insist on going with them. Mike McCandless http://blog.mikemccandless.com On Sat, Nov 15, 2014 at 6:42 AM, Will Martin wmartin...@gmail.com wrote: Btw: SwSong should not steal code; which implies an existing license whose terms he is willing to break. Not a good first step.;-) will -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Saturday, November 15, 2014 6:22 AM To: general@lucene.apache.org Subject: Re: How can I make better project than Lucene? Actually I think competing projects is very healthy for open source development. There are many things you could explore to contrast with Lucene, e.g. write your new search engine in Go not Java: Java has many problems, maybe Go fixes them. Go also has a low-latency garbage collector in development ... and Java's GC options still can't scale to the heap sizes that are practical now. Lucene has many limitations, so your competing engine could focus on them. E.g. the schemalessness of Lucene has become a big problem, and near impossible to fix at this point, and prevents new important features like LUCENE-5879 from being possible, so you could give your engine a gentle schema from the start. The Lucene Filter/Query situation is a mess: one should extend the other. Lucene has weak support for proximity queries (SpanQuery is slow and does not get much attention). Lucene is showing its age, missing some compelling features like a builtin transaction log, core support for numerics (they are sort of hacked on top), optimistic concurrency support (sequence ids, versions, something), distributed support (near real time replication, etc.), multi-tenancy, an example server implementation, so the search servers on top of Lucene have had to fill these gaps. Maybe you could make your engine distributed from the start (Go is a great match for that, from what little I know). All 3 highlighter options have problems. The analysis chain (attributes) is overly complex. In your competing engine you can borrow/copy/steal from Lucene's good parts to get started... Mike McCandless http://blog.mikemccandless.com On Fri, Nov 14, 2014 at 8:43 PM, swsong_dev swsong_...@websqrd.com wrote: I’m developing search engine, Fastcatsearch. http://github hthttp://githubtp//github.com/fastcatsearch/fastcatsearch Lucene is widely known and famous project and I cannot beat Lucene for now. But is there any chance to beat Lucene? Anything like features, performance. Please, let me know what to do to make better product than Lucene. Thank you.
[ANNOUNCE] Apache Lucene 4.10.2 released
October 2014, Apache Lucene™ 4.10.2 available The Lucene PMC is pleased to announce the release of Apache Lucene 4.10.2 Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. The release is available for immediate download at: http://lucene.apache.org/core/mirrors-core-latest-redir.html Lucene 4.10.2 includes 2 bug fixes. See the CHANGES.txt file included with the release for a full list of changes and further details. Please report any feedback to the mailing lists (http://lucene.apache.org/core/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. Happy Halloween, Mike McCandless http://blog.mikemccandless.com
[ANNOUNCE] Apache Solr 4.10.2 released
October 2014, Apache Solr™ 4.10.2 available The Lucene PMC is pleased to announce the release of Apache Solr 4.10.2 Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. Solr 4.10.2 is available for immediate download at: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html Solr 4.10.2 includes 10 bug fixes, as well as Lucene 4.10.2 and its 2 bug fixes. See the CHANGES.txt file included with the release for a full list of changes and further details. Please report any feedback to the mailing lists (http://lucene.apache.org/solr/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. Happy Halloween, Mike McCandless http://blog.mikemccandless.com
Re: [VOTE] Release PyLucene 4.10.1-1
+1 to release I ran my usual smoke test, indexing, optimizing searching first 100 K Wikipedia English docs... Mike McCandless http://blog.mikemccandless.com On Wed, Oct 1, 2014 at 7:13 PM, Andi Vajda va...@apache.org wrote: The PyLucene 4.10.1-1 release tracking the recent release of Apache Lucene 4.10.1 is ready. This release candidate fixes the regression found in the previous one, 4.10.1-0, and is available from: http://people.apache.org/~vajda/staging_area/ A list of changes in this release can be seen at: http://svn.apache.org/repos/asf/lucene/pylucene/branches/pylucene_4_10/CHANGES PyLucene 4.10.1 is built with JCC 2.21 included in these release artifacts. A list of Lucene Java changes can be seen at: http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_4_10_1/lucene/CHANGES.txt Please vote to release these artifacts as PyLucene 4.10.1-1. Anyone interested in this release can and should vote ! Thanks ! Andi.. ps: the KEYS file for PyLucene release signing is at: http://svn.apache.org/repos/asf/lucene/pylucene/dist/KEYS http://people.apache.org/~vajda/staging_area/KEYS pps: here is my +1
[ANNOUNCE] Apache Lucene 4.10.1 released
September 2014, Apache Lucene™ 4.10.1 available The Lucene PMC is pleased to announce the release of Apache Lucene 4.10.1 Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. The release is available for immediate download at: http://lucene.apache.org/core/mirrors-core-latest-redir.html Lucene 4.10.1 includes 7 bug fixes. See the CHANGES.txt file included with the release for a full list of changes and further details. Please report any feedback to the mailing lists (http://lucene.apache.org/core/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. Mike McCandless http://blog.mikemccandless.com
[ANNOUNCE] Apache Solr 4.10.1 released
September 2014, Apache Solr™ 4.10.1 available The Lucene PMC is pleased to announce the release of Apache Solr 4.10.1 Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. Solr 4.10.1 is available for immediate download at: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html Solr 4.10.1 includes 6 bug fixes, as well as Lucene 4.10.1 and its 7 bug fixes. See the CHANGES.txt file included with the release for a full list of changes and further details. Please report any feedback to the mailing lists (http://lucene.apache.org/solr/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. Mike McCandless http://blog.mikemccandless.com
[ANNOUNCE] Apache Lucene 4.9.1 released
September 2014, Apache Lucene™ 4.9.1 available The Lucene PMC is pleased to announce the release of Apache Lucene 4.9.1 Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. The release is available for immediate download at: http://lucene.apache.org/core/mirrors-core-latest-redir.html Lucene 4.9.1 includes 7 bug fixes. See the CHANGES.txt file included with the release for a full list of changes and further details. Please report any feedback to the mailing lists (http://lucene.apache.org/core/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. Mike McCandless http://blog.mikemccandless.com
[ANNOUNCE] Apache Solr 4.9.1 released
September 2014, Apache Solr™ 4.9.1 available The Lucene PMC is pleased to announce the release of Apache Solr 4.9.1 Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. Solr 4.9.1 is available for immediate download at: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html Solr 4.9.1 includes 2 bug fixes, as well as Lucene 4.9.1 and its 7 bug fixes. See the CHANGES.txt file included with the release for a full list of changes and further details. Please report any feedback to the mailing lists (http://lucene.apache.org/solr/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. Mike McCandless http://blog.mikemccandless.com
Re: [VOTE] Release PyLucene 4.9.0-0
+1 I ran my usual smoke test: index first 100K docs from Wikipedia (en), do a few searches, run forceMerge. Mike McCandless http://blog.mikemccandless.com On Mon, Jul 7, 2014 at 11:14 AM, Andi Vajda va...@apache.org wrote: The PyLucene 4.9.0-0 release tracking the recent release of Apache Lucene 4.9.0 is ready. *** ATTENTION *** Starting with release 4.8.0, Lucene now requires Java 1.7 at the minimum. Using Java 1.6 with Lucene 4.8.0 and newer is not supported. On Mac OS X, Java 6 is still a common default, please upgrade if you haven't done so already. A common upgrade is Oracle Java 1.7 for Mac OS X: http://docs.oracle.com/javase/7/docs/webnotes/install/mac/mac-jdk.html On Mac OS X, once installed, a way to make Java 1.7 the default in your bash shell is: $ export JAVA_HOME=`/usr/libexec/java_home` Be sure to verify that this JAVA_HOME value is correct. On any system, if you're upgrading your Java installation, please rebuild JCC as well. You must use the same version of Java for both JCC and PyLucene. *** /ATTENTION *** A release candidate is available from: http://people.apache.org/~vajda/staging_area/ A list of changes in this release can be seen at: http://svn.apache.org/repos/asf/lucene/pylucene/branches/pylucene_4_9/CHANGES PyLucene 4.9.0 is built with JCC 2.20 included in these release artifacts. A list of Lucene Java changes can be seen at: http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_4_9_0/lucene/CHANGES.txt Please vote to release these artifacts as PyLucene 4.9.0-0. Anyone interested in this release can and should vote ! Thanks ! Andi.. ps: the KEYS file for PyLucene release signing is at: http://svn.apache.org/repos/asf/lucene/pylucene/dist/KEYS http://people.apache.org/~vajda/staging_area/KEYS pps: here is my +1
Re: Near real time reader using ControlledRealTimeReopenThread
Don't call IndexWriter.commit with each added document. Call it only when you need to ensure durability (all index changes are written to stable storage). You spawn CRTRT, passing it your SearcherManager and IndexWriter, and it periodically reopens for you, with methods to wait for a specific indexing generation if a given search must be real-time. See Lucene's test cases for examples on how to use ControlledRealTimeReopenThread... Mike McCandless http://blog.mikemccandless.com On Wed, Jun 25, 2014 at 9:44 AM, Arun B C bcarunm...@yahoo.com.invalid wrote: Hi General Group, Am currently using Lucene 4.7.2. I use DirectoryReader directoryreader = DirectoryReader.open(indexWriter, true); to get near real time reader. In order to manage directoryReader instance being shared in a multi-threaded environment, People suggested to use NRTManager. But I dont find NRTManager anymore in version 4.7.2. I think it was replaced with ControlledRealTimeReopenThread. I am not able to find any information about Near real time in ControlledRealTimeReopenThread java doc. I also found a sample provided by a person 'futuretelematics' in http://stackoverflow.com/questions/17993960/lucene-4-4-0-new-controlledrealtimereopenthread-sample-usage. It has indexWriter.commit in create and update index methods. Is that right, will it not slow down search? or point me to a sample/information to achieve near real time search using apache 4.7.2 or later. Please do let me know if you require any other information. Thanks in advance. Br, Arun BC
Re: [VOTE] Release PyLucene 4.8.0-1
+1 to release. I ran my usual smoke test: index first 100K Wikipedia docs, forceMerge, run a few searches. Mike McCandless http://blog.mikemccandless.com On Wed, Apr 30, 2014 at 5:07 PM, Andi Vajda va...@apache.org wrote: The PyLucene 4.8.0-1 release tracking the recent release of Apache Lucene 4.8.0 is ready. *** ATTENTION *** Starting with release 4.8.0, Lucene now requires Java 1.7 at the minimum. Using Java 1.6 with Lucene 4.8.0 is not supported. On Mac OS X, Java 6 is still a common default, please upgrade if you haven't done so already. A common upgrade is Oracle Java 1.7 for Mac OS X: http://docs.oracle.com/javase/7/docs/webnotes/install/mac/mac-jdk.html On Mac OS X, once installed, a way to make Java 1.7 the default in your bash shell is: $ export JAVA_HOME=`/usr/libexec/java_home` Be sure to verify that JAVA_HOME value. On any system, if you're upgrading your Java installation, please rebuild JCC as well. You must use the same version of Java for both JCC and PyLucene. *** /ATTENTION *** A release candidate is available from: http://people.apache.org/~vajda/staging_area/ A list of changes in this release can be seen at: http://svn.apache.org/repos/asf/lucene/pylucene/branches/pylucene_4_8/CHANGES PyLucene 4.8.0 is built with JCC 2.19 included in these release artifacts. The version of JCC included with PyLucene did not change since the previous release. A list of Lucene Java changes can be seen at: http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_4_8_0/lucene/CHANGES.txt Please vote to release these artifacts as PyLucene 4.8.0-1. Anyone interested in this release can and should vote ! Thanks ! Andi.. ps: the KEYS file for PyLucene release signing is at: http://svn.apache.org/repos/asf/lucene/pylucene/dist/KEYS http://people.apache.org/~vajda/staging_area/KEYS pps: here is my +1
Re: [VOTE] Release PyLucene 4.6.1-0
Hmm I see many ._* files in the .tar.gz, e.g.: mike@vine:~/src/pylucene-4.6.1-0/jcc$ tar tzf pylucene-4.6.1-0-src.tar.gz | head ./._pylucene-4.6.1-0 pylucene-4.6.1-0/ pylucene-4.6.1-0/._CHANGES pylucene-4.6.1-0/CHANGES pylucene-4.6.1-0/._CREDITS pylucene-4.6.1-0/CREDITS pylucene-4.6.1-0/._extensions.xml pylucene-4.6.1-0/extensions.xml pylucene-4.6.1-0/._INSTALL pylucene-4.6.1-0/INSTALL Mike McCandless http://blog.mikemccandless.com On Wed, Feb 5, 2014 at 7:29 PM, Andi Vajda va...@apache.org wrote: The PyLucene 4.6.1-0 release tracking the recent release of Apache Lucene 4.6.1 is ready. A release candidate is available from: http://people.apache.org/~vajda/staging_area/ A list of changes in this release can be seen at: http://svn.apache.org/repos/asf/lucene/pylucene/branches/pylucene_4_6/CHANGES PyLucene 4.6.1 is built with JCC 2.19 included in these release artifacts: http://svn.apache.org/repos/asf/lucene/pylucene/trunk/jcc/CHANGES A list of Lucene Java changes can be seen at: http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_4_6_1/lucene/CHANGES.txt Please vote to release these artifacts as PyLucene 4.6.1-0. Thanks ! Andi.. ps: the KEYS file for PyLucene release signing is at: http://svn.apache.org/repos/asf/lucene/pylucene/dist/KEYS http://people.apache.org/~vajda/staging_area/KEYS pps: here is my +1
Re: [nag] [VOTE] Release PyLucene 4.5.0-2
+1 to wait for 4.5.1 instead? Mike McCandless http://blog.mikemccandless.com On Thu, Oct 17, 2013 at 10:43 PM, Andi Vajda va...@apache.org wrote: One more PMC vote is needed to finalize this release. Then, we could wait for Lucene 4.5.1 to happen instead ? Andi.. -- Forwarded message -- Date: Mon, 14 Oct 2013 14:06:45 -0400 From: Steve Rowe sar...@gmail.com To: general@lucene.apache.org, Andi Vajda va...@apache.org Cc: pylucene-...@lucene.apache.org Subject: Re: [VOTE] Release PyLucene 4.5.0-2 +1 Having installed setuptools 1.1.6 for the previous RC, I successfully installed JCC and pylucene and got 0 failures from 'make test'. One small thing about the installation instructions: I had to run 'make test' as root because of some permissions issues (since 'make install' was run as root and apparently took ownership of some files in the unpacked distribution directory) - seems like that shouldn't be necessary. - running build_ext running install running bdist_egg running egg_info writing lucene.egg-info/PKG-INFO error: lucene.egg-info/PKG-INFO: Permission denied make: *** [install-test] Error 1 - Steve On Oct 13, 2013, at 11:04 PM, Andi Vajda va...@apache.org wrote: The PyLucene 4.5.0-2 release tracking the recent release of Apache Lucene 4.5.0 is ready. A release candidate is available from: http://people.apache.org/~vajda/staging_area/ A list of changes in this release can be seen at: http://svn.apache.org/repos/asf/lucene/pylucene/branches/pylucene_4_5/CHANGES PyLucene 4.5.0 is built with JCC 2.18 included in these release artifacts: http://svn.apache.org/repos/asf/lucene/pylucene/trunk/jcc/CHANGES A list of Lucene Java changes can be seen at: http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_4_5_0/lucene/CHANGES.txt Please vote to release these artifacts as PyLucene 4.5.0-2. Thanks ! Andi.. ps: the KEYS file for PyLucene release signing is at: http://svn.apache.org/repos/asf/lucene/pylucene/dist/KEYS http://people.apache.org/~vajda/staging_area/KEYS pps: here is my +1
Re: [VOTE] Release PyLucene 4.3.1-1
Hmm I see two test failures, on Linux, Python 2.7.3, Java 1.7.0_07 : ERROR: testCachingWorks (__main__.CachingWrapperFilterTestCase) -- Traceback (most recent call last): File test/test_CachingWrapperFilter.py, line 53, in testCachingWorks strongRef = cacher.getDocIdSet(context, context.reader().getLiveDocs()) AttributeError: 'IndexReader' object has no attribute 'getLiveDocs' and: ERROR: testPayloadsPos0 (__main__.PositionIncrementTestCase) -- Traceback (most recent call last): File test/test_PositionIncrement.py, line 257, in testPayloadsPos0 pspans = MultiSpansWrapper.wrap(searcher.getTopReaderContext(), snq) File /home/mike/src/pylucene-4.3.1-1/test/MultiSpansWrapper.py, line 49, in wrap return query.getSpans(ctx, ctx.reader().getLiveDocs(), termContexts) AttributeError: 'IndexReader' object has no attribute 'getLiveDocs' Mike McCandless http://blog.mikemccandless.com On Wed, Jun 26, 2013 at 4:07 PM, Andi Vajda va...@apache.org wrote: The PyLucene 4.3.1-1 release tracking the recent release of Apache Lucene 4.3.1 is ready. A release candidate is available from: http://people.apache.org/~vajda/staging_area/ A list of changes in this release can be seen at: http://svn.apache.org/repos/asf/lucene/pylucene/branches/pylucene_4_3/CHANGES PyLucene 4.3.1 is built with JCC 2.16 included in these release artifacts: http://svn.apache.org/repos/asf/lucene/pylucene/trunk/jcc/CHANGES A list of Lucene Java changes can be seen at: http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_4_3_1/lucene/CHANGES.txt Please vote to release these artifacts as PyLucene 4.3.1-1. Thanks ! Andi.. ps: the KEYS file for PyLucene release signing is at: http://svn.apache.org/repos/asf/lucene/pylucene/dist/KEYS http://people.apache.org/~vajda/staging_area/KEYS pps: here is my +1
Re: Re[2]: minFuzzyLength in FuzzySuggester behaves differently for English and Russian
On Wed, Jun 5, 2013 at 2:51 AM, Artem Lukanin ice...@mail.ru wrote: OK, I will try to do it myself. Thank you! As I understand I have to clone lucene_solr_4_3 from https://github.com/apache/lucene-solr.git and upload a patch to the issue for review? I'm not a git user, but that sounds right! See here for more details: http://wiki.apache.org/lucene-java/HowToContribute Mike McCandless http://blog.mikemccandless.com
Re: IndexWriter.commit() performance
On Tue, Jun 4, 2013 at 7:31 PM, Renata Vaccaro ren...@emailtopia.com wrote: Thanks. I need the documents to be searchable as soon as they are added. I also need the documents added to survive a machine crash. Soft commits and NRT gets might work, but from what I've read they are only available for Solr? Likely commits got slower on upgrade because on your very, very old Lucene version fsync was not called, so there was no safety on OS/hardware crash to ensure the index was intact. Solr's soft commit uses Lucene's near-real-time APIs, so you can definitely do this with just Lucene: pass the IndexWriter to DirectoryReader.open, and then use DirectoryReader.openIfChanged to reopen (without committing). This lets you decouple durability to crashes (how often you commit) from index-to-search latency (how often you reopen the reader). Mike McCandless http://blog.mikemccandless.com
Re: minFuzzyLength in FuzzySuggester behaves differently for English and Russian
This unfortunately is a limitation of the current FuzzySuggester implementation: it computes edits in UTF-8 space instead of Unicode character (code point) space. This should be fixable: we'd need to fix TokenStreamToAutomaton to work in Unicode character space, then fix FuzzySuggester to do the same steps that FuzzyQuery does: do the LevN expansion in Unicode character space, then convert that automaton to UTF-8, then intersect with the suggest FST. Could you open an issue for this? I won't have any time soon to work on this but we should open an issue to discuss / see if someone else has time / iterate. Thanks! Mike McCandless http://blog.mikemccandless.com On Thu, May 30, 2013 at 8:39 AM, Artem Lukanin ice...@mail.ru wrote: BTW, I have to set maxEdits=2 to allow letter transpositions in Russian, because there will be actually 2 transpositions of 4 bytes representing 2 Russian letters in UTF-8. The worst case is when one field has both Russian and English letters (or e.g. numbers), where I have to use minFuzzyLength=6 and maxEdits=2, which will work only for Russian words of more than 2 letters and for English words of more than 5 letters! -- View this message in context: http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-tp4067018p4067026.html Sent from the Lucene - General mailing list archive at Nabble.com.
Re: minFuzzyLength in FuzzySuggester behaves differently for English and Russian
Thanks Artem. If you have time/energy to work out a patch that would be great :) Mike McCandless http://blog.mikemccandless.com On Mon, Jun 3, 2013 at 7:17 AM, Artem Lukanin ice...@mail.ru wrote: I have opened an issue: https://issues.apache.org/jira/browse/LUCENE-5030 -- View this message in context: http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-tp4067018p4067774.html Sent from the Lucene - General mailing list archive at Nabble.com.
Re: How to convert TermDocs and TermEnum ??
Hi, Have a look at MIGRATE.txt? Mike McCandless http://blog.mikemccandless.com On Mon, May 20, 2013 at 10:54 AM, A. Lotfi majidna...@yahoo.com wrote: Hi, I found some difficulties converting from old API to the newest one : import org.apache.lucene.index.TermDocs; // does not exist import org.apache.lucene.index.TermEnum; // does not exist I tried the migration doc, but could not figure it out, here my code : private ListShort[] loadFieldValues(IndexReader reader) throws IOException { ListShort[] retArray = new List[reader.maxDoc()]; TermEnum termEnum = reader.terms (new Term (GEO_LOC)); try { do { Term term = termEnum.term(); if (term==null || term.field() != GEO_LOC) break; TermDocs td = reader.termDocs(term); String value = term.text(); ... ... td.close(); } while (termEnum.next()); } finally { termEnum.close(); } return retArray; } I will appreciate your help. thanks
Re: [VOTE] Release PyLucene 4.3.0-1
+1 to release! Exciting to finally have a PyLucene 4.x :) I ran my usual smoke test (index first 100K Wikipedia docs and run a couple searches) and it looks great! Only strangeness was ... I set JDK['linux2'] to my install location (Oracle JDK), and normally this works fine, but this time setup.py couldn't find javac nor javadoc ... so I had to go set those two full paths as well, and then jcc built fine. Mike McCandless http://blog.mikemccandless.com On Mon, May 6, 2013 at 8:27 PM, Andi Vajda va...@apache.org wrote: It looks like the time has finally come for a PyLucene 4.x release ! The PyLucene 4.3.0-1 release tracking the recent release of Apache Lucene 4.3.0 is ready. A release candidate is available from: http://people.apache.org/~vajda/staging_area/ A list of changes in this release can be seen at: http://svn.apache.org/repos/asf/lucene/pylucene/branches/pylucene_4_3/CHANGES PyLucene 4.3.0 is built with JCC 2.16 included in these release artifacts: http://svn.apache.org/repos/asf/lucene/pylucene/trunk/jcc/CHANGES A list of Lucene Java changes can be seen at: http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_4_3_0/lucene/CHANGES.txt Please vote to release these artifacts as PyLucene 4.3.0-1. Thanks ! Andi.. ps: the KEYS file for PyLucene release signing is at: http://svn.apache.org/repos/asf/lucene/pylucene/dist/KEYS http://people.apache.org/~vajda/staging_area/KEYS pps: here is my +1
Re: [VOTE] Release PyLucene 4.2.1
I'm having trouble on an Ubuntu 12.10 box, using Java 1.7_07 and Python 2.7.3. I was able to build and install both JCC and PyLucene, apparently successfully. I can import lucene in Python and print lucene.VERSION and confirm it's 4.2.1. lucene.initVM(lucene.CLASSPATH) succeeds. Yet, there are no Lucene classes in the lucene module? When I print dir(lucene) I just get this: ['CLASSPATH', 'ConstVariableDescriptor', 'FinalizerClass', 'FinalizerProxy', 'InvalidArgsError', 'JArray', 'JArray_bool', 'JArray_byte', 'JArray_char', 'JArray_double', 'JArray_float', 'JArray_int', 'JArray_long', 'JArray_object', 'JArray_short', 'JArray_string', 'JCCEnv', 'JCC_VERSION', 'JObject', 'JavaError', 'PrintWriter', 'StringWriter', 'VERSION', '__builtins__', '__dir__', '__doc__', '__file__', '__name__', '__package__', '__path__', '_lucene', 'findClass', 'getVMEnv', 'initVM', 'makeClass', 'makeInterface', 'os', 'sys'] Am I missing something silly...? Shouldn't Lucene's classes (eg FSDirectory) be visible in globals() in the lucene module? Mike McCandless http://blog.mikemccandless.com On Sat, Apr 13, 2013 at 5:51 PM, Andi Vajda va...@apache.org wrote: It looks like the time has finally come for a PyLucene 4.x release ! The PyLucene 4.2.1-0 release tracking the recent release of Apache Lucene 4.2.1 is ready. A release candidate is available from: http://people.apache.org/~vajda/staging_area/ A list of changes in this release can be seen at: http://svn.apache.org/repos/asf/lucene/pylucene/branches/pylucene_4_2/CHANGES PyLucene 4.2.1 is built with JCC 2.16 included in these release artifacts: http://svn.apache.org/repos/asf/lucene/pylucene/trunk/jcc/CHANGES A list of Lucene Java changes can be seen at: http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_4_2_1/lucene/CHANGES.txt Please vote to release these artifacts as PyLucene 4.2.1-0. Thanks ! Andi.. ps: the KEYS file for PyLucene release signing is at: http://svn.apache.org/repos/asf/lucene/pylucene/dist/KEYS http://people.apache.org/~vajda/staging_area/KEYS pps: here is my +1
Re: Welcome Tommaso Teofili to the PMC
Welcome Tommaso! Mike McCandless http://blog.mikemccandless.com On Sun, Mar 17, 2013 at 11:04 AM, Steve Rowe sar...@gmail.com wrote: I'm pleased to announce that Tommaso Teofili has accepted the PMC's invitation to join. Welcome Tommaso! - Steve - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: different result for 'OR'
That is odd. Can you print the Query.toString of the actual two queries you are running? (I think the OR must be capitalized to be parsed by the classic QueryParser?). Mike McCandless http://blog.mikemccandless.com On Mon, Jan 21, 2013 at 7:34 AM, Jeroen Venderbosch j...@woodwing.com wrote: I would expect that the query *q=description:(electronics) or description:(usb)* would give the same number of results as *q=description:(electronics or usb)*. But the first query returns 9662 documents and the second one 9493. What is the difference? -- View this message in context: http://lucene.472066.n3.nabble.com/different-result-for-OR-tp4035014.html Sent from the Lucene - General mailing list archive at Nabble.com.
Re: Is there a way to clear lucene's cache?
Lucene itself doesn't do any caching. Maybe you are thinking of Solr? The OS also does caching, so if you want a cold test you'll have to tell the OS to flush its IO cache in between tests. EG on Linux do sudo echo 3 /proc/sys/vm/drop_caches. Mike McCandless http://blog.mikemccandless.com On Wed, Jan 2, 2013 at 10:39 AM, S L sol.leder...@gmail.com wrote: I'm doing some performance testing and caching is not helpful for the tests. Is there a way to clear lucene's query cache between rounds of tests? I've tried restarting tomcat but that doesn't help. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Is-there-a-way-to-clear-lucene-s-cache-tp4030059.html Sent from the Lucene - General mailing list archive at Nabble.com.
Re: Welcome Sami Siren to the PMC
Welcome Sami! Mike McCandless http://blog.mikemccandless.com On Wed, Dec 12, 2012 at 3:17 PM, Mark Miller markrmil...@gmail.com wrote: I'm please to announce that Sami Siren has accepted the PMC's invitation to join. Welcome Sami! - Mark - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Welcome Alan Woodward as Lucene/Solr committer
Welcome aboard Alan! Happy Coding, Mike McCandless http://blog.mikemccandless.com On Wed, Oct 17, 2012 at 1:36 AM, Robert Muir rcm...@gmail.com wrote: I'm pleased to announce that the Lucene PMC has voted Alan as a Lucene/Solr committer. Alan has been contributing patches on various tricky stuff: positions iterators, span queries, highlighters, codecs, and so on. Alan: its tradition that you introduce yourself with your background. I think your account is fully working and you should be able to add yourself to the who we are page on the website as well. Congratulations!
Re: Can Lucene be used where each entity to be ranked is a set of documents?
On Wed, Aug 22, 2012 at 10:36 AM, Robert Muir rcm...@gmail.com wrote: On Tue, Aug 21, 2012 at 7:42 AM, shashank shashank91.b...@gmail.com wrote: Hello, I am working on a project wherein each entity to be ranked is not a single document but infact a group of documents. So, the ranking not only involves standard search engine scoring parameters but also the association of documents within an entity/group i.e. association of documents within the group also contributes to the ranking score. You may want to look at Lucene's block join module (http://lucene.apache.org/core/4_0_0-BETA/join/index.html): combined with IndexWriter's add/updateDocuments functionality which lets you add documents as a 'group'. Currently I think the way in which the group is scored is just an enum with a fixed set of choices (ScoreMode), so you might have to modify the source code at the moment if you have a sophisticated way of scoring the group of documents, but this would be nice to fix so that its something extensible... Also look at grouping module. If you have no parent documents/fields (ie only child docs that must be grouped/scored according to some criteria) then grouping should work. But Robert is right: the scoring of a group is fairly simplistic now ... so you may need to tweak the code to do what you need (and please send patches back!). Mike McCandless http://blog.mikemccandless.com
Re: Is query-time Join actually in Lucene 3.6?
Query-time join lives under Lucene's contrib/join in 3.6: http://lucene.apache.org/core/3_6_1/lucene-contrib/index.html#join Mike McCandless http://blog.mikemccandless.com On Tue, Aug 7, 2012 at 11:41 AM, Homer Nabble homernab...@gmail.com wrote: This page states New query-time joining is more flexible (but less performant) than index-time joins. https://wiki.apache.org/lucene-java/Lucene3.6 However, I download Lucene 3.6.0 (and 3.6.1) and there is no mention of query-time join in the CHANGES.TXT. Also, I see no binaries for org.apache.lucene.search.join - though the API doc in the same download contains information about this package: lucene-3.6.0/docs/api/all/org/apache/lucene/search/join/package-frame.html Could someone please let me know what the story is with query time joins? Thanks!
Re: Is it possible to Lucene with a database managed by external application ?
That should be fine. You just have to separately pull the added/updated rows from the DB and index them into your Lucene index. Mike McCandless http://blog.mikemccandless.com On Tue, May 29, 2012 at 3:09 AM, Ievgen Krapyva ykrap...@gmail.com wrote: Hi everybody, I've just started reading about Lucene and thinking whether I can use it in my case. My case is that the database content I want to provide the search capability for is managed (new entries added / removed / edited) by other application (written in PHP). Am I right in thinking that to get the best with Lucene I have to regulary update indexes, i.e. make a full database indexing ? Thanks.
Re: How to construct the term frequency vector of all words in dictionary?
You can get a TermEnum (IndexReader.terms()) and then keep calling .next() to advance to the next term, and then .docFreq() to get the document frequency (how many documents have the term) for that term... Mike McCandless http://blog.mikemccandless.com On Tue, May 15, 2012 at 1:24 PM, Aoi Morida xu.xum...@gmail.com wrote: Hi all, I want to create the term frequency vector for all words in the dictionary. I find that the function getTermFreqVector() can only give term frequency of the words existed in the particular document. BTW, I want to extract words in the dictionary and I find that the function getWordsIterator() can do this. But as I import org.apache.lucene.search.spell.LuceneDictionary, there is always an error message. I wondered what's wrong with it. My lucene version is 2.9.4. Thank you. Regards, Aoi -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-construct-the-term-frequency-vector-of-all-words-in-dictionary-tp3983898.html Sent from the Lucene - General mailing list archive at Nabble.com.
Re: Lucene index directory on disk: (i) do I need to keep it and (ii) how do I handle encryption?
FSDirectory won't load the index into RAM. But RAMDirectory can: eg, you can init a RAMDirectory, passing your FSDir to its ctor, to copy all files into RAM. Then you can delete the FSDir, but realize this means once your app shuts down you've lost the index. I think you can handle your encrypted case by copying the files yourself from FSDir into RAMDir, decrypting as you go. Mike McCandless http://blog.mikemccandless.com On Tue, Apr 24, 2012 at 10:36 AM, Ilya Zavorin izavo...@caci.com wrote: I have two somewhat related questions. I'm working on an Android app that uses Lucene indexing and search. Currently, I precompute an index on desktop and then copy the resulting index folder and files to the Android device. In my app, I open the index up like this: String indexDir = /mnt/sdcard/MyIndexDir; Directory dir = FSDirectory.open(indexDir); I have 2 questions: 1. Does Lucene load the entire index into memory? If so, does it mean that after creating the Directory object, I can delete the index dir from the device? Does this depend on the size of the index? If so, do I have an option of forcing it to load the whole index into memory regardless of its size? 2. Right now the index folder is unencrypted. This is temporary: we have a requirement to encrypt every single file and folder that is used by the app. The problem with this is that I can't create an unencrypted copy of the folder on the device, i.e. I can't do something like this: String indexDirEncr = /mnt/sdcard/MyIndexDirEncr; String indexDirUnencr = /mnt/sdcard/MyIndexDirUnencr; // // Decrypt indexDirEncr and store it in indexDirUnencr // Directory dir = FSDirectory.open(indexDirUnencr); Is there a good way to handle this? That is, is it somehow possible to load the encrypted folder into memory, decrypt it and then load the decrypted version from memory to create a Directory object? Thanks much! Ilya Zavorin
Re: Welcome Jan Høydahl to the PMC
Welcome Jan! Mike McCandless http://blog.mikemccandless.com On Mon, Feb 13, 2012 at 9:50 AM, Robert Muir rcm...@gmail.com wrote: Hello, I'm pleased to announce that Jan has accepted the PMC's invitation to join. Congratulations Jan! -- lucidimagination.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[ANNOUNCE] Apache Lucene 3.4.0 released
September 14 2011, Apache Lucene™ 3.4.0 available The Lucene PMC is pleased to announce the release of Apache Lucene 3.4.0. Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. This release contains numerous bug fixes, optimizations, and improvements, some of which are highlighted below. The release is available for immediate download at: http://www.apache.org/dyn/closer.cgi/lucene/java (see note below). If you are already using Apache Lucene 3.1, 3.2 or 3.3, we strongly recommend you upgrade to 3.4.0 because of the index corruption bug on OS or computer crash or power loss (LUCENE-3418), now fixed in 3.4.0. See the CHANGES.txt file included with the release for a full list of details. Lucene 3.4.0 Release Highlights: * Fixed a major bug (LUCENE-3418) whereby a Lucene index could easily become corrupted if the OS or computer crashed or lost power. * Added a new faceting module (contrib/facet) for computing facet counts (both hierarchical and non-hierarchical) at search time (LUCENE-3079). * Added a new join module (contrib/join), enabling indexing and searching of nested (parent/child) documents using BlockJoinQuery/Collector (LUCENE-3171). * It is now possible to index documents with term frequencies included but without positions (LUCENE-2048); previously omitTermFreqAndPositions always omitted both. * The modular QueryParser (contrib/queryparser) can now create NumericRangeQuery. * Added SynonymFilter, in contrib/analyzers, to apply multi-word synonyms during indexing or querying, including parsers to read the wordnet and solr synonym formats (LUCENE-3233). * You can now control how documents that don't have a value on the sort field should sort (LUCENE-3390), using SortField.setMissingValue. * Fixed a case where term vectors could be silently deleted from the index after addIndexes (LUCENE-3402). Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. Happy searching, Apache Lucene/Solr Developers
[ANNOUNCE] Apache Solr 3.4.0 released
September 14 2011, Apache Solr™ 3.4.0 available The Lucene PMC is pleased to announce the release of Apache Solr 3.4.0. Apache Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world's largest internet sites. This release contains numerous bug fixes, optimizations, and improvements, some of which are highlighted below. The release is available for immediate download at: http://www.apache.org/dyn/closer.cgi/lucene/solr (see note below). If you are already using Apache Solr 3.1, 3.2 or 3.3, we strongly recommend you upgrade to 3.4.0 because of the index corruption bug on OS or computer crash or power loss (LUCENE-3418), now fixed in 3.4.0. See the CHANGES.txt file included with the release for a full list of details. Solr 3.4.0 Release Highlights: * Bug fixes and improvements from Apache Lucene 3.4.0, including a major bug (LUCENE-3418) whereby a Lucene index could easily become corrupted if the OS or computer crashed or lost power. * SolrJ client can now parse grouped and range facets results (SOLR-2523). * A new XsltUpdateRequestHandler allows posting XML that's transformed by a provided XSLT into a valid Solr document (SOLR-2630). * Post-group faceting option (group.truncate) can now compute facet counts for only the highest ranking documents per-group. (SOLR-2665). * Add commitWithin update request parameter to all update handlers that were previously missing it. This tells Solr to commit the change within the specified amount of time (SOLR-2540). * You can now specify NIOFSDirectory (SOLR-2670). * New parameter hl.phraseLimit speeds up FastVectorHighlighter (LUCENE-3234). * The query cache and filter cache can now be disabled per request See http://wiki.apache.org/solr/CommonQueryParameters#Caching_of_filters (SOLR-2429). * Improved memory usage, build time, and performance of SynonymFilterFactory (LUCENE-3233). * Added omitPositions to the schema, so you can omit position information while still indexing term frequencies (LUCENE-2048). * Various fixes for multi-threaded DataImportHandler. Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. Happy searching, Apache Lucene/Solr Developers
Re: Caused by: java.io.IOException: read past EOF
Can you post the traceback/exception? Are you overriding the default LockFactory for your Directory? Mike McCandless http://blog.mikemccandless.com On Fri, Sep 9, 2011 at 6:07 AM, Java_dev abde...@hotmail.com wrote: Hi Michael, Thx for taking time to help me out. We are using Lucene to index the titles(maintitle,subtitle,isbnnumber,productavailability...) in our database for a faster search on several fields. We are adding every day new titles and a lot of titles are being updated. The index is stored in a directory on the local filesystem (Linux RedHat). In the directory there are +/- 50 files (segments_grr4f,segments.gen,_21z41.cfs(847MB),_6zq20.cf(643MB)s ...) When the update is in progress lucene creates a 'write.lock' file now and then. Thx in advance. -- View this message in context: http://lucene.472066.n3.nabble.com/Caused-by-java-io-IOException-read-past-EOF-tp3319842p3322467.html Sent from the Lucene - General mailing list archive at Nabble.com.
Re: [VOTE] Release PyLucene 3.3 (rc3)
+1 to release! Smoke test passed and I see grouping module classes are visible by default! Thanks Andi :) Mike McCandless http://blog.mikemccandless.com On Thu, Jul 21, 2011 at 12:47 PM, Andi Vajda va...@apache.org wrote: A problem was found with rc2. Please, vote on rc3, thanks :-) The Apache PyLucene 3.3-3 release closely tracking the recent release of Apache Lucene Java 3.3 is ready. A release candidate is available from: http://people.apache.org/~vajda/staging_area/ This new release candidate fixes an issue with wrapping the new grouping contrib module which is now part of the PyLucene build. A list of changes in this release can be seen at: http://svn.apache.org/repos/asf/lucene/pylucene/branches/pylucene_3_3/CHANGES PyLucene 3.3 is built with JCC 2.10 included in these release artifacts. A list of Lucene Java changes can be seen at: http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_3_3/lucene/CHANGES.txt Please vote to release these artifacts as PyLucene 3.3-3. Thanks ! Andi.. ps: the KEYS file for PyLucene release signing is at: http://svn.apache.org/repos/asf/lucene/pylucene/dist/KEYS http://people.apache.org/~vajda/staging_area/KEYS pps: here is my +1
Re: [VOTE] Release PyLucene 3.3.0
Everything looks good -- I was able to compile, run all tests successfully, and run my usual smoke test (indexing optimizing searching on first 100K wikipedia docs), but... I then tried to enable the grouping module (lucene/contrib/grouping), by adding a GROUPING_JAR matching all the other contrib jars, and running make. This then hit various compilation errors -- is anyone able to enable the grouping module and compile successfully? Mike McCandless http://blog.mikemccandless.com On Fri, Jul 1, 2011 at 8:24 AM, Andi Vajda va...@apache.org wrote: The PyLucene 3.3.0-1 release closely tracking the recent release of Lucene Java 3.3 is ready. A release candidate is available from: http://people.apache.org/~vajda/staging_area/ A list of changes in this release can be seen at: http://svn.apache.org/repos/asf/lucene/pylucene/branches/pylucene_3_3/CHANGES PyLucene 3.3.0 is built with JCC 2.9 included in these release artifacts. A list of Lucene Java changes can be seen at: http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_3_3/lucene/CHANGES.txt Please vote to release these artifacts as PyLucene 3.3.0-1. Thanks ! Andi.. ps: the KEYS file for PyLucene release signing is at: http://svn.apache.org/repos/asf/lucene/pylucene/dist/KEYS http://people.apache.org/~vajda/staging_area/KEYS pps: here is my +1
Re: [VOTE] Release PyLucene 3.2.0
+1 I built on OS X 10.6.6, passed all tests (I think? No overall summary in the end, but I didn't see any obvious problem), and ran my usual smoke test indexing first 100K docs from a line file from Wikipedia, and running a few searches. Mike McCandless http://blog.mikemccandless.com On Mon, Jun 6, 2011 at 4:58 PM, Andi Vajda va...@apache.org wrote: The PyLucene 3.2.0-1 release closely tracking the recent release of Lucene Java 3.2 is ready. A release candidate is available from: http://people.apache.org/~vajda/staging_area/ A list of changes in this release can be seen at: http://svn.apache.org/repos/asf/lucene/pylucene/branches/pylucene_3_2/CHANGES PyLucene 3.2.0 is built with JCC 2.9 included in these release artifacts. A list of Lucene Java changes can be seen at: http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_3_2/lucene/CHANGES.txt Please vote to release these artifacts as PyLucene 3.2.0-1. Thanks ! Andi.. ps: the KEYS file for PyLucene release signing is at: http://svn.apache.org/repos/asf/lucene/pylucene/dist/KEYS http://people.apache.org/~vajda/staging_area/KEYS pps: here is my +1
Re: Lucene: is it possible to search with an error in one letter?
If you want to allow for any single character change, you can use FuzzyQuery. EG, pencil~1 allows for 1 character change, pencil~2 allows for 2. Note that FuzzyQuery is very costly in 3.x, but is substantially (eg factor of 100 times) faster in trunk / 4.0. Mike http://blog.mikemccandless.com On Mon, May 30, 2011 at 1:33 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi, Yes, penc?l should do it. Otis --- Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: boraldo bora...@mail.ru To: general@lucene.apache.org Sent: Mon, May 30, 2011 8:08:54 AM Subject: Lucene: is it possible to search with an error in one letter? I have a document with text pencil. I want to search it using query pencel. Or vica versa. Is it possible ? -- View this message in context: http://lucene.472066.n3.nabble.com/Lucene-is-it-possible-to-search-with-an-error-in-one-letter-tp3001723p3001723.html Sent from the Lucene - General mailing list archive at Nabble.com.
Re: Welcome Chris Male Andi Vajda as full Solr / Lucene Committers
Welcome! Mike http://blog.mikemccandless.com On Mon, May 23, 2011 at 12:39 PM, Simon Willnauer simon.willna...@googlemail.com wrote: Hi folks, I am happy to announce that the Lucene PMC has accepted Chris Male and Andi Vajda as Lucene/Solr committers. Congratulations Welcome on board, Chris Andi!! Simon - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Special Board Report for May 2011
know where to draw the line. I trust the great people of this community to know when it's better to discuss something in email. An example, if a new feature is being discussed, then it's ok if two people want to hash few things out quickly, before they send a detailed and organized proposal to the list -- the details to hash out are the initial proposal's details. The rest should be followed on list, even if it means slightly slower response time. Today's list and JIRA volume always look to me like the response time is instantaneous. We have very active people from around the globe, so you have a high chance receiving response in no time. In the worse case, it will take a couple of hours, but I don't remember when did that happen (which is an amazing thing !) Cheers, Shai On Fri, May 6, 2011 at 8:35 PM, Grant Ingersoll gsing...@apache.org wrote: More reading (shall I say required reading?). Benson does a good job of explaining some of the concepts around consensus and why we also should be primarily using mailing lists: https://blogs.apache.org/comdev/entry/how_apache_projects_use_consensus -Grant On May 5, 2011, at 10:10 AM, Grant Ingersoll wrote: I'd like to throw out another idea: I think we should standardize on rotating the PMC Chair every year. I think to date, there have been two Chairs: Doug and me. Back when Doug left, no one wanted to do it (both Hoss and I said we would if no one else wanted to) and so I took it on. For the most part, it's a thankless task of herding cats (albeit low volume, thankfully), despite the important sounding name that marketing types love. I would like us to share the burden across the PMC by rotating it on an annual basis. Many other ASF projects do exactly this and I think it removes any political pressure. Have I sold it enough? ;-) Besides, I just know others are dying to file board reports on a quarterly basis! More inline below... On May 5, 2011, at 8:27 AM, Michael McCandless wrote: On Wed, May 4, 2011 at 6:40 PM, Grant Ingersoll gsing...@apache.org wrote: 2. I think we need to prioritize getting patch contributors more feedback sooner. I think some of this can be automated much like what Hadoop has done. This should help identify new committers sooner and encourage them to keep contributing. Big +1. We should be using automation everywhere we can. But, really, we (as all projects do) need more devs. Growing the community should be job #1 of all committers. Agreed, but this dovetails w/ the use of IRC. I realize live collab is nice, but it discourages those who aren't in the know about the channel being used from ever contributing. Say, for instance, I'm interested in DWPT (DocWriterPerThread), how am I supposed to know that at 8 am EDT on May 5th (made up example), three of the committers are going to be talking about it on IRC? If there is email about it, then I can participate. Nothing we do is so important that it can't wait a few hours or a day, besides the fact, that email is damn near instantaneous these days anyway. Also, keep in mind that until about a year ago, most everything was done on the mailing list and I think we progressed just fine. Since then, dev@has almost completely dried up in terms of discussions (factoring out JIRA mails which have picked up -- which is good) and the large majority of discussion takes place on IRC. I agree, however, we should have the IRC discussion on another thread. So, what other ideas do people have? I'll leave this thread open for a week or so and then add what we think are good things to https://svn.apache.org/repos/asf/lucene/board-reports/2011/special-board-report-may.txt The board meeting is on May 19th. I plan on attending. How about also PMC members will be more proactive in tackling issues that erode the community? I think this would start with a thread on general@. We need to get in the habit of discussing even tiny elephants as soon as they appear, somehow. Yeah, I agree. The hard part for me, is I often feel like people on the outside make big deals about this stuff and don't get that even having the discussion is a very healthy sign. Besides the fact, that no one likes confrontation and uncomfortable topics. We also, I think, are all tired of endless debates that go on and on w/ no resolution. It's one of the big downsides (and, of course, upsides) to consensus based open source as opposed to the dictatorial approach. Here's an example: Is Lucid abusing their too-strong influence over Lucene/Solr? It's a great question, and I personally feel the answer today is no, but nevertheless we should be able to discuss it and similar could-be-controversial topics. I hopefully would agree we are good stewards of the fact that we
Re: Special Board Report for May 2011
On Wed, May 4, 2011 at 6:40 PM, Grant Ingersoll gsing...@apache.org wrote: At our core, this means we are supporting a set of libraries that can be used for search and related capabilities across a lot of different applications ranging in size and shape, as well as a server that makes those capabilities available and easy to consume without requiring Java programming for those who choose to use it. Our goal has always been to make the parts we like to work on as fast, efficient and capable as possible.As with all open source projects, anyone should be able to contribute where they see fit and to scratch their itch. Open source has always been evolutionary in code development, not revolutionary. +1 I will throw out some ideas as possibly helpful in continuing to build a strong community, but maybe they aren't. And, no, I don't think any one of these solves everything. 1. No more IRC for design decisions (answering user questions is OK, IMO) even if they are captured on JIRA. Either that or we should make IRC logged and public and part of the public record on Lucene/Solr.The fact is, most mailing list subscribers are not on IRC and IRC discussions/decisions rob many of us of the opportunity to participate in the design and it sometimes come across that everything is done by the time it hits JIRA. It's also very hard for people who aren't on IRC to get the full gist of the discussion if only a summary is presented in JIRA. Also, due to time zones, many people are asleep while others are working. IRC also prevents ideas from breathing a bit. Also, since IRC isn't logged, there is less decorum/respect at times (even if I think the banter keeps things lighter most of the time) and even though most of us committers are friends, outsiders or potential contributors may not see sarcasm or jokes in the same way that the rest of us who know each other do. -0 Probably we should fork off a separate thread to discuss IRC? But here's my quick take: I feel there are times when it's appropriate and time's when it's not and we should use the right tool for the job at hand. EG, the recent landing of the [very large] concurrent flushing (DWPT) branch was a great example where live collaboration was very helpful, I think. I completely agree that no decisions are made on IRC: if it's not on the list, it didn't happen. Discussions can happen and if that results in an idea, an approach, that suggestion gets moved to an issue / to the dev list for iterating. 2. I think we need to prioritize getting patch contributors more feedback sooner. I think some of this can be automated much like what Hadoop has done. This should help identify new committers sooner and encourage them to keep contributing. Big +1. We should be using automation everywhere we can. But, really, we (as all projects do) need more devs. Growing the community should be job #1 of all committers. 3. As a core principal, design discussions, etc. should not take place on private emails or via IM or phone calls. I don't know how much of this there is, but I've seen hints of it from a variety of people to know it happens. Obviously, there is no way to enforce this other than people should take it to heart and stop it. +1 Also, big issues should not be sent via private email to hand-picked people. Send it to general@ 4. I think it goes w/o saying that we all learned our lessons about committing and reverting things. Reverting someone else's code is for when things break the build, not for political/idealogical reasons. +1 Add to this no way! list: committing without first resolving the objections raised by other committers. And also: 'don't walk away from discussions, especially important ones'. Radio silence / silent treatment is not a good approach in the real world, and it's even worse in the open-source world. Try always to bring closure, to heal the community after strong disagreements. 5. People should commit and do their work where they see fit. If others have better ideas about refactoring them, then step up and help or do the refactoring afterwards. It's software. Not everything need be perfect the first time or in just the right location the first time. At the same time, if others want to refactor it and it doesn't hurt anything but ends up being better for more people b/c it is reusable and componetized, than the refactoring should not be a problem. +1, progress not perfection, as long as we are free to refactor. Freedom to refactor/poach is the bread butter of open source. So, what other ideas do people have? I'll leave this thread open for a week or so and then add what we think are good things to https://svn.apache.org/repos/asf/lucene/board-reports/2011/special-board-report-may.txt The board meeting is on May 19th. I plan on attending. How about also PMC members will be more proactive in tackling issues that
Re: Special Board Report for May 2011
On Wed, May 4, 2011 at 7:26 PM, Ted Dunning ted.dunn...@gmail.com wrote: The amazing thing to me is that Lucene of all projects is having problems like this. Lucene has always been my primary example of Open Source Done Right. I think with passion comes blowups. I think it's natural, and, as long as the community heals, healthy. We will emerge stronger from this. I very much hope that it comes back to those roots. The people who contribute to Lucene are too good a group to have these problems. We will. This is a resilient community ;) In fact I find it very inspiring that despite this storm in the background, committers were still actively pushing things forward. EG, Simon others landed the concurrent flushing (DWPT) branch... resulting in astounding gains in Lucene's indexing throughput on concurrent hardware (http://people.apache.org/~mikemccand/lucenebench/indexing.html). Mike http://blog.mikemccandless.com
Re: IndexFiles cmd runs, even when IndexFiles.java is deleted
Likely the .class file is still present? Javac compiles .java files into .class files, and then java executes from .class files. Mike http://blog.mikemccandless.com On Mon, May 2, 2011 at 8:13 AM, daniel daniel_pfis...@msn.com wrote: I'm new to Lucene and Java, I'm trying to modify the source code for the indexing function in Lucene-3.0.3; however, when I modified IndexFiles.java nothing happened, it simply indexed the files the same way as before. So I deleted that file entirely, and entered java org.apache.lucene.demo.IndexFiles (+ file to be index) in the cmd line again, and IT STILL RAN! What is going on here? How can the program run when the file is removed? -- View this message in context: http://lucene.472066.n3.nabble.com/IndexFiles-cmd-runs-even-when-IndexFiles-java-is-deleted-tp2889622p2889622.html Sent from the Lucene - General mailing list archive at Nabble.com.
Re: [VOTE] Create Solr TLP - bigger picture
Thanks Shane. I agree we (the PMC) should have stepped in well before things got to this point. Hindsight is 20/20, and, I'm still learning here too ;) Then we could have prevented such extreme non-Apache behavior (invalid vetos, reverting wars). Mike http://blog.mikemccandless.com On Wed, Apr 27, 2011 at 9:21 AM, Shane Curcuru a...@shanecurcuru.org wrote: Michael McCandless luc...@mikemccandless.com wrote: ...snip... While I agree, out of context, Robert's use of a veto/revert wars is inappropriate, and is not how things should be done in a healthy Apache project Lucene/Solr are not healthy right now, and desperate times call for desperate measures. Apache projects are about community and consensus driven development. When the larger community is having serious disagreements about the direction of the project, the first place the community (people here) should go is to the PMC - that'd be private@lucene in this case. PMCs *should* be the place to work these kinds of issues out. If committers start engaging in controversial reverts, the community should *insist* that the PMC assist in the matter and help show the community-based way forwards. If committers on any Apache project aren't getting answers or help from their PMC, then you can always raise the issue up to board@. Remember: we're all volunteers here: it does take time for PMCs or communities to really understand the issue and respond to it (even if there isn't consensus). So I certainly wouldn't urge people to email board@ with every little issue without letting the PMC discuss it. But from a board perspective, we would certainly rather have heard of some of the apparent community issues in Solr and Lucene recently from a PMC member or committer *first*, before one of the directors was reading through some of these threads or JIRA comments this week. The board welcomes reports on community health from our projects - good or bad. - Shane (not on Lucene lists)
Re: Welcome Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers
Welcome Dawid and Stanislaw! Mike On Tue, Feb 8, 2011 at 1:13 PM, Robert Muir rcm...@gmail.com wrote: I'm pleased to announce that the PMC has voted in Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers! Welcome! - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [VOTE] Release PyLucene 2.9.4-1 and 3.0.3-1
+1 to both. I installed both on Linux (Fedora 13) and ran my test python script that indexes first 100K line docs from wikipedia and runs a few searches. No problems! Mike On Sun, Dec 5, 2010 at 1:50 AM, Andi Vajda va...@apache.org wrote: With the recent releases of Lucene Java 2.9.4 and 3.0.3, the PyLucene 2.9.4-1 and 3.0.3-1 releases closely tracking them are ready. Release candidates are available from: http://people.apache.org/~vajda/staging_area/ A list of changes in this release can be seen at: http://svn.apache.org/repos/asf/lucene/pylucene/branches/pylucene_2_9/CHANGES http://svn.apache.org/repos/asf/lucene/pylucene/branches/pylucene_3_0/CHANGES All versions of PyLucene are built with the same version of JCC, currently version 2.7, included in these release artifacts. A list of Lucene Java changes can be seen at: http://svn.apache.org/repos/asf/lucene/java/branches/lucene_2_9/CHANGES.txt http://svn.apache.org/repos/asf/lucene/java/branches/lucene_3_0/CHANGES.txt Please vote to release these artifacts as PyLucene 2.9.4-1 and 3.0.3-1. Thanks ! Andi.. ps: the KEYS file for PyLucene release signing is at: http://svn.apache.org/repos/asf/lucene/pylucene/dist/KEYS http://people.apache.org/~vajda/staging_area/KEYS pps: here is my +1
Re: [VOTE] Release of Apache Lucene 3.0.3 and 2.9.4 artifacts (take 2)
On Wed, Dec 1, 2010 at 3:38 AM, Uwe Schindler u...@thetaphi.de wrote: Hi, Thanks to the PMC for voting on the Lucene 3.0.3 and 2.9.4 artifacts. The vote has passed with 3 positive votes: - Robert Muir - Andi Vajda - Uwe Schindler Excellent! Thanks everyone :) I will start to publish the artifacts to the mirrors today and send the announcement message on Friday morning after the website was updated. Mike: What do you thing are they key facts/bug fixes for this version? I will prepare the announcement message today, mostly it is the same as always, but we should list some key points, like serious bugs. How about something like this: This release contains numerous bug fixes since 2.9.3/3.0.2, including a memory leak in IndexWriter exacerbated by frequent commits, a file handle leak in IndexWriter when near-real-time readers are opened with compound file format enabled, a rare index corruption case on disk full, and various thread safety issues. Mike
Re: PMC Additions
Welcome Simon and Koji! Mike On Sun, Nov 28, 2010 at 7:30 AM, Grant Ingersoll gsing...@apache.org wrote: I'm pleased to announce the addition of Simon Willnauer and Koji Sekiguchi to the Lucene PMC. Both Simon and Koji have been long time contributors/committers to both Lucene and Solr. Congrats! -Grant - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [VOTE] Rename Lucene Java to be Lucene Core
+1 Mike On Tue, Nov 9, 2010 at 3:57 PM, Grant Ingersoll gsing...@apache.org wrote: Per the discuss thread and the fact that Java is TM Oracle, I would like us to change Lucene Java to now be referred to as Lucene Core. The primary change is on the website where the Java tab will now be the Core tab and other mentions will be adjusted accordingly. I still expect we will just refer to it informally as Lucene, since that is what it is. +1 -Grant
Re: [DISCUSS] Lucene Java - Lucene Core
+1 Seems prudent given the current Java climate. Mike On Mon, Nov 8, 2010 at 10:57 AM, Grant Ingersoll gsing...@apache.org wrote: Hi Luceneers, esp. PMC and Committers, I'm in the process of reviewing our branding per the Trademarks committee sending out requirements. So, expect to see some changes to the website and logos in the coming days as well as, potentially, a request for help. Per the Branding Requirements at http://www.apache.org/foundation/marks/pmcs, I think we should stop calling our core Java implementation Lucene Java, since Java is an Oracle TM, and move to simply calling it Lucene Core or Lucene for Java. I'm inclined to call it Lucene Core or (Core, for short). Most of us just call it Lucene anyway, so the Core part really is only for navigation purposes on the website. I'd like to discuss this for a day or two and then call a vote. Thoughts? -Grant
Re: Welcome Steven Rowe as Lucene/Solr committer!
Welcome Steven!! Mike On Wed, Sep 22, 2010 at 9:19 AM, Robert Muir rcm...@gmail.com wrote: I'm pleased to announce that the PMC has accepted Steven Rowe as Lucene/Solr committer! Welcome Steven! -- Robert Muir rcm...@gmail.com
Re: Welcome Robert Muir to the Lucene PMC
Congrats! Mike On Wed, Jul 7, 2010 at 2:12 PM, Grant Ingersoll gsing...@apache.org wrote: In recognition of Robert's continuing contributions to Lucene and Solr, I'm happy to announce Robert has accepted our invitation to join the Lucene PMC. Cheers, Grant Ingersoll Lucene PMC Chair - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [VOTE] [Take 2] Release PyLucene 2.9.3-1 and 3.0.2-1
+1 Mike On Tue, Jun 29, 2010 at 7:47 AM, Andi Vajda va...@apache.org wrote: The first vote started on June 18th received two PMC votes and one user vote. A couple of bugs got fixed in the meantime so I'd like to call for another vote hoping for three PMC votes to make this release possible. --- With the recent - simultaneous - releases of Java Lucene 2.9.3 and 3.0.2, the PyLucene 2.9.3-1 and 3.0.2-1 releases closely tracking them are ready. Release candidates are available from: http://people.apache.org/~vajda/staging_area/ A list of changes in this release can be seen at: http://svn.apache.org/repos/asf/lucene/pylucene/branches/pylucene_2_9/CHANGES http://svn.apache.org/repos/asf/lucene/pylucene/branches/pylucene_3_0/CHANGES All versions of PyLucene are now built with the same version of JCC, currently version 2.6, included in these release artifacts. A list of Lucene Java changes can be seen at: http://svn.apache.org/repos/asf/lucene/java/branches/lucene_2_9/CHANGES.txt http://svn.apache.org/repos/asf/lucene/java/branches/lucene_3_0/CHANGES.txt Please vote to release these artifacts as PyLucene 2.9.3-1 and 3.0.2-1. Thanks ! Andi.. ps: the KEYS file for PyLucene release signing is at: http://svn.apache.org/repos/asf/lucene/pylucene/dist/KEYS http://people.apache.org/~vajda/staging_area/KEYS pps: here is my +1
Re: [PMC] [DISCUSS] Lucy
Technically, it's clear that Lucy is taking an innovative and well-thought-out approach, building a search engine that folds in what's been learned from all the painful experiences of those before it. Marvin gets to chuckle whenever we have one of our massive back compat discussions... When it hits its first release it should be a real gem. Further, there's no question that Marvin's closeness has substantially strengthened Lucene (java). The awesome amount of cross-fertilization, discussing design tradeoffs, etc., has led to sizable improvements in Lucene (like the switch to per-segment searching). That said, yes, Lucy doesn't have a large dev community. And Lucy doesn't have any users yet since it has no release (though KS's users should count here, once Lucy releases). There's unfortunately (for both projects) not enough overlap in the dev communities of Lucene (java) and Lucy. And, Apache does now (for better or worse) strictly insist on non-umbrella TLPs. So net/net I'm +1 for Lucy to move to Apache incubator with the eventual goal of a separate TLP. Mike On Sat, Jun 12, 2010 at 7:10 AM, Grant Ingersoll gsing...@apache.org wrote: It's been a while since we've taken a look at Lucy from a PMC standpoint, but I think it is worth us reviewing once again. And, while this isn't easy to do because I very much value Marvin as a member of the Lucene community, I think we need have a frank discussion about whether Lucy belongs as a Lucene subproject, especially in light of recent Board concerns about Lucene's umbrella status. Since the last email discussing Lucy, Marvin has been working on it, AFAICT, which is a good thing. I still, however, don't think it meets the community standards of the ASF (see http://incubator.apache.org/guides/graduation.html#subproject for instance). For instance, there does not appear to be anyone else who has contributed to it at any level beyond the occasional email here and there. The last email on the dev list from someone other than Marvin was someone announcing KinoSearch on May 1st. Before that, it was on April 6. The last email on the user list was from Marvin in November of 2009. And, while Marvin participates regularly on d...@l.a.o and we have had many cross pollination talks, it does not, unfortunately, make for a community around Lucy. There also has yet to be a single release in its time here. Even if there were an attempt at a release, how many PMC members even follow Lucy enough that you feel comfortable voting for a release? If this were a project coming from the Incubator to us via a graduation vote, I would vote to not let it graduate. Finally, given that Lucy undoubtedly is a separate community (if it ever exists) with separate goals from Lucene and that it is considered ASF best practice for PMC's to not be umbrella projects, I think we should consider either Lucy going into the Incubator with the goal of growing it's own community and standing on it's own as a TLP in its own right (just as we recommended for CLucene recently) or going to Google Code or some other such hosting service where it will be free to make decisions on its future without the hinderance of a PMC that isn't aligned with its needs and objectives, as I believe is the current case with the Lucene PMC. -Grant
Re: [VOTE] #2 Apache Lucene Java 2.9.3 and 3.0.2 artifacts to be released
On Fri, Jun 11, 2010 at 11:58 AM, Uwe Schindler u...@thetaphi.de wrote: Hi all, It is not yet quite clear if we should release take2 or take1 of the artifacts. Both are on my people account, please vote: [1] Release http://people.apache.org/~uschindler/staging-area/lucene-2.9.3-3.0.2-take2-r ev953716/ including LUCENE-2494 as Lucene 3.0.2 and 2.9.2. You only need to recheck the 3.0.2 artifacts (if you submitted a vote to the first call), as I only rebuilt the 3.0.2 ones. Lucene 2.9.3 has no Java 5 and keeps unchanged. +1 for [1]. All Lucene in Action 2nd edition tests pass with these 3.0.2 JARs. Mike
Re: [VOTE] Apache Lucene Java 2.9.3 and 3.0.2 artifacts to be released
I would argue my 3 cases were borderline bugs -- they weren't just pure perf improvements. 2135 acts like a mem leak, in that we retain [often very large] memory for longer than we should. 2161 is nasty choke point in NRT (getting a new NRT reader syncs the old one thus blocking any searches, since searches use sync'd methods like getNorms, I think). 2360 was a regression, specifically indexing small docs got a slower due to the fix from another issue. That said, I don't think we need to be so strict (only bug fixes get backported). If someone has the itch/time/energy and is willing to do the work for backport and the risk is low, back compat is preserved, etc., I think it's great. Mike On Fri, Jun 11, 2010 at 1:04 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : See 3.0.2: http://s.apache.org/6kf : vs. 3.0.1: http://s.apache.org/t5 Ugh... ok, well i guess the precident has allready been set then. hope it doesn't bite us in the ass down the road. -Hoss
Re: [VOTE] Apache Lucene Java 2.9.3 and 3.0.2 artifacts to be released
This looks like something new to me (doesn't ring a bell). It looks odd -- the assertion that's tripping would seem to indicate that a file that we are copying into a CFS file (after flushing) is still changing while we are copying, which is not good. All files should be closed before we build the CFS. Strange... was this just a local hard drive / NTFS file system, Uwe? I also can't repro, so far -- I have a while(1) stress test running on OpenSolaris and Windows Server 2003, but no failures yet... Can anyone else get this test to fail? Mike On Tue, Jun 8, 2010 at 7:54 AM, Uwe Schindler u...@thetaphi.de wrote: I ran the tests on my computer and with 2.9.3 I got a failure, which i cannot reproduce: [junit] Testsuite: org.apache.lucene.index.TestThreadedOptimize [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 9,017 sec [junit] [junit] - Standard Output --- [junit] Thread-45: hit exception [junit] java.lang.AssertionError [junit] at org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java:195 ) [junit] at org.apache.lucene.index.DocumentsWriter.createCompoundFile(DocumentsWriter.j ava:672) [junit] at org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:4418) [junit] at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:4264) [junit] at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:4255) [junit] at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2546) [junit] at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2500) [junit] at org.apache.lucene.index.TestThreadedOptimize$1.run(TestThreadedOptimize.java :92) [junit] - --- [junit] Testcase: testThreadedOptimize(org.apache.lucene.index.TestThreadedOptimize): FAILED [junit] null [junit] junit.framework.AssertionFailedError [junit] at org.apache.lucene.index.TestThreadedOptimize.runTest(TestThreadedOptimize.ja va:113) [junit] at org.apache.lucene.index.TestThreadedOptimize.testThreadedOptimize(TestThread edOptimize.java:154) [junit] at org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:221) [junit] [junit] [junit] Test org.apache.lucene.index.TestThreadedOptimize FAILED Maybe it's just the bug in the test we already know, if this is so, we can proceed with releasing. It happened in JDK 1.4.2 when doing a test build on my windows machine of 2.9.3-src.zip. Mike, maybe it's an already fixed test-only bug (missing volatile on field in this test)? Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Uwe Schindler [mailto:u...@thetaphi.de] Sent: Monday, June 07, 2010 5:21 PM To: general@lucene.apache.org Subject: [VOTE] Apache Lucene Java 2.9.3 and 3.0.2 artifacts to be released Hi all, I have posted a release candidate for both Lucene Java 2.9.3 and 3.0.2 (which both have the same bug fix level, functionality and release announcement), build from revision 951790 of the corresponding branches. Thanks for all your help! Please test them and give your votes, the scheduled release date for both versions is Friday, June 18th, 2010. Only votes from Lucene PMC are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. We planned the parallel release with one announcement because of their parallel development / bug fix level to emphasize that they are equal except deprecation removal and Java 5 since major version 3. I will post the possible release announcement soon for corrections. Artifacts can be found at: http://people.apache.org/~uschindler/staging-area/lucene-2.9.3-3.0.2- take1-r ev951790/ Changes: http://people.apache.org/~uschindler/staging-area/lucene-2.9.3-3.0.2- take1-r ev951790/changes-2.9.3/Changes.html http://people.apache.org/~uschindler/staging-area/lucene-2.9.3-3.0.2- take1-r ev951790/changes-2.9.3/Contrib-Changes.html http://people.apache.org/~uschindler/staging-area/lucene-2.9.3-3.0.2- take1-r ev951790/changes-3.0.2/Changes.html http://people.apache.org/~uschindler/staging-area/lucene-2.9.3-3.0.2- take1-r ev951790/changes-3.0.2/Contrib-Changes.html Maven artifacts: http://people.apache.org/~uschindler/staging-area/lucene-2.9.3-3.0.2- take1-r ev951790/maven/ Happy testing! P.S.: I already tested the latest 3.0.2 artifacts with pangaea.de :-) - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de
Re: [VOTE] Apache Lucene Java 2.9.3 and 3.0.2 artifacts to be released
Alas, I have 4 envs (Windows Server 2003, OpenSolaris 2009.06, CentOS 5.4, OS X 10.6.2, running stress tests for 4+ hours now, and I haven't hit a single failure... If nobody else can repro this, I think we should not hold up the release? Mike On Tue, Jun 8, 2010 at 8:44 AM, Uwe Schindler u...@thetaphi.de wrote: No idea, its NTFS on Windows 7, 64 bit, JDK 1.4.2_19-32bit. The test now works, so cannot reproduce. Not idea what we should do! - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Tuesday, June 08, 2010 2:36 PM To: general@lucene.apache.org Subject: Re: [VOTE] Apache Lucene Java 2.9.3 and 3.0.2 artifacts to be released This looks like something new to me (doesn't ring a bell). It looks odd -- the assertion that's tripping would seem to indicate that a file that we are copying into a CFS file (after flushing) is still changing while we are copying, which is not good. All files should be closed before we build the CFS. Strange... was this just a local hard drive / NTFS file system, Uwe? I also can't repro, so far -- I have a while(1) stress test running on OpenSolaris and Windows Server 2003, but no failures yet... Can anyone else get this test to fail? Mike On Tue, Jun 8, 2010 at 7:54 AM, Uwe Schindler u...@thetaphi.de wrote: I ran the tests on my computer and with 2.9.3 I got a failure, which i cannot reproduce: [junit] Testsuite: org.apache.lucene.index.TestThreadedOptimize [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 9,017 sec [junit] [junit] - Standard Output --- [junit] Thread-45: hit exception [junit] java.lang.AssertionError [junit] at org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.ja va:195 ) [junit] at org.apache.lucene.index.DocumentsWriter.createCompoundFile(Document sWr iter.j ava:672) [junit] at org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:4 418) [junit] at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:4264) [junit] at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:4255) [junit] at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2546) [junit] at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2500) [junit] at org.apache.lucene.index.TestThreadedOptimize$1.run(TestThreadedOptimi z e.java :92) [junit] - --- [junit] Testcase: testThreadedOptimize(org.apache.lucene.index.TestThreadedOptimize): FAILED [junit] null [junit] junit.framework.AssertionFailedError [junit] at org.apache.lucene.index.TestThreadedOptimize.runTest(TestThreadedOpti m ize.ja va:113) [junit] at org.apache.lucene.index.TestThreadedOptimize.testThreadedOptimize(Test Thread edOptimize.java:154) [junit] at org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:221) [junit] [junit] [junit] Test org.apache.lucene.index.TestThreadedOptimize FAILED Maybe it's just the bug in the test we already know, if this is so, we can proceed with releasing. It happened in JDK 1.4.2 when doing a test build on my windows machine of 2.9.3-src.zip. Mike, maybe it's an already fixed test-only bug (missing volatile on field in this test)? Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Uwe Schindler [mailto:u...@thetaphi.de] Sent: Monday, June 07, 2010 5:21 PM To: general@lucene.apache.org Subject: [VOTE] Apache Lucene Java 2.9.3 and 3.0.2 artifacts to be released Hi all, I have posted a release candidate for both Lucene Java 2.9.3 and 3.0.2 (which both have the same bug fix level, functionality and release announcement), build from revision 951790 of the corresponding branches. Thanks for all your help! Please test them and give your votes, the scheduled release date for both versions is Friday, June 18th, 2010. Only votes from Lucene PMC are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. We planned the parallel release with one announcement because of their parallel development / bug fix level to emphasize that they are equal except deprecation removal and Java 5 since major version 3. I will post the possible release announcement soon for corrections. Artifacts can be found at: http://people.apache.org/~uschindler/staging-area/lucene-2.9.3-3.0.2- take1-r ev951790/ Changes: http://people.apache.org/~uschindler/staging
Re: [VOTE] Apache Lucene Java 2.9.3 and 3.0.2 artifacts to be released
+1 to release. ant test passes for both -src.tar.gz downloads, and .asc's check out, and Lucene in Action 2nd Edition's tests all pass w/ 3.0.2 dropped in. Mike On Mon, Jun 7, 2010 at 4:32 PM, Andi Vajda va...@apache.org wrote: On Mon, 7 Jun 2010, Uwe Schindler wrote: I have posted a release candidate for both Lucene Java 2.9.3 and 3.0.2 (which both have the same bug fix level, functionality and release announcement), build from revision 951790 of the corresponding branches. Thanks for all your help! Please test them and give your votes, the scheduled release date for both versions is Friday, June 18th, 2010. Only votes from Lucene PMC are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. We planned the parallel release with one announcement because of their parallel development / bug fix level to emphasize that they are equal except deprecation removal and Java 5 since major version 3. I will post the possible release announcement soon for corrections. Artifacts can be found at: http://people.apache.org/~uschindler/staging-area/lucene-2.9.3-3.0.2-take1-r ev951790/ PyLucene 2.9.3 and 3.0.2 built from their respective Lucene artifacts pass all tests. +1 Andi..
Re: Welcome Uwe Schindler to the Lucene PMC
Welcome Uwe!! Mike On Thu, Apr 1, 2010 at 7:05 AM, Grant Ingersoll gsing...@apache.org wrote: I'm pleased to announce that the Lucene PMC has voted to add Uwe Schindler to the PMC. Uwe has been doing a lot of work in Lucene and Solr, including several of the last releases in Lucene. Please join me in extending congratulations to Uwe! -Grant Ingersoll PMC Chair - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: java.io.IOException: read past EOF
Your index is in serious trouble -- you have 2 segments_N files, both of which are 0 length. This won't be easy to recover (CheckIndex won't be able to). Any idea how this happened? Was this index created using 2.4.x? Mike On Tue, Mar 23, 2010 at 5:36 PM, Jean-Michel RAMSEYER jm.ramse...@greenivory.com wrote: Hi there, I'm new in Lucene's world and I'm currently meeting a problem on an index. I'm running Lucene 2.4.1 on a Linux server with a sun jvm version 1.6.0.17b04, in which the issue http://issues.apache.org/jira/browse/LUCENE-1282 is solved. I tried to open indexes on another computer with luke but it fails too. Files segments* are empty, so is there a way to rebuild index from cfs files? Is there a way to recover this index? Thank you for your answers. Exception trace : java.io.IOException: read past EOF at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:151) at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38) at org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:36) at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:68) at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:221) at org.apache.lucene.index.DirectoryIndexReader$1.doBody(DirectoryIndexReader.java:95) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:653) at org.apache.lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:115) at org.apache.lucene.index.IndexReader.open(IndexReader.java:316) at org.apache.lucene.index.IndexReader.open(IndexReader.java:206) at org.apache.lucene.search.IndexSearcher.init(IndexSearcher.java:47) ls -lah result : total 18G drwxr-xr-x 2 tomcat tomcat 4.0K 2010-03-22 16:29 . drwxr-xr-x 121 tomcat tomcat 12K 2010-03-23 14:22 .. -rw-r--r-- 1 tomcat tomcat 1.9G 2010-03-20 13:57 _1gg2.cfs -rw-r--r-- 1 tomcat tomcat 2.0G 2010-03-20 21:45 _1yhj.cfs -rw-r--r-- 1 tomcat tomcat 1.9G 2010-03-21 04:16 _2gdz.cfs -rw-r--r-- 1 tomcat tomcat 2.0G 2010-03-21 15:00 _2y9u.cfs -rw-r--r-- 1 tomcat tomcat 2.0G 2010-03-22 03:21 _3ghg.cfs -rw-r--r-- 1 tomcat tomcat 2.0G 2010-03-22 07:09 _3xty.cfs -rw-r--r-- 1 tomcat tomcat 2.0G 2010-03-22 12:24 _4ekl.cfs -rw-r--r-- 1 tomcat tomcat 192M 2010-03-22 13:25 _4gn2.cfs -rw-r--r-- 1 tomcat tomcat 198M 2010-03-22 14:23 _4ief.cfs -rw-r--r-- 1 tomcat tomcat 195M 2010-03-22 15:14 _4kbm.cfs -rw-r--r-- 1 tomcat tomcat 21M 2010-03-22 15:18 _4kil.cfs -rw-r--r-- 1 tomcat tomcat 23M 2010-03-22 15:22 _4kop.cfs -rw-r--r-- 1 tomcat tomcat 22M 2010-03-22 15:27 _4ku0.cfs -rw-r--r-- 1 tomcat tomcat 25M 2010-03-22 15:31 _4kzb.cfs -rw-r--r-- 1 tomcat tomcat 21M 2010-03-22 15:36 _4l56.cfs -rw-r--r-- 1 tomcat tomcat 1.9M 2010-03-22 15:36 _4l5r.cfs -rw-r--r-- 1 tomcat tomcat 2.0M 2010-03-22 15:37 _4l6c.cfs -rw-r--r-- 1 tomcat tomcat 165K 2010-03-22 15:37 _4l6d.cfs -rw-r--r-- 1 tomcat tomcat 58K 2010-03-22 15:37 _4l6e.cfs -rw-r--r-- 1 tomcat tomcat 80K 2010-03-22 15:37 _4l6f.cfs -rw-r--r-- 1 tomcat tomcat 149K 2010-03-22 15:37 _4l6g.cfs -rw-r--r-- 1 tomcat tomcat 218K 2010-03-22 15:37 _4l6h.cfs -rw-r--r-- 1 tomcat tomcat 198K 2010-03-22 15:37 _4l6i.cfs -rw-r--r-- 1 tomcat tomcat 45K 2010-03-22 15:37 _4l6j.cfs -rw-r--r-- 1 tomcat tomcat 58K 2010-03-22 15:37 _4l6k.cfs -rw-r--r-- 1 tomcat tomcat 158K 2010-03-22 15:37 _4l6l.cfs -rw-r--r-- 1 tomcat tomcat 116K 2010-03-22 15:37 _4l6m.cfs -rw-r--r-- 1 tomcat tomcat 1.1M 2010-03-22 15:37 _4l6n.cfs -rw-r--r-- 1 tomcat tomcat 128K 2010-03-22 15:37 _4l6o.cfs -rw-r--r-- 1 tomcat tomcat 1.9G 2010-03-20 04:12 _hnt.cfs -rw-r--r-- 1 tomcat tomcat 0 2010-03-22 15:37 segments_44o3 -rw-r--r-- 1 tomcat tomcat 0 2010-03-22 15:37 segments_44o4 -rw-r--r-- 1 tomcat tomcat 0 2010-03-22 15:37 segments.gen -rw-r--r-- 1 tomcat tomcat 1.9G 2010-03-20 07:52 _ywu.cfs
Re: java.io.IOException: read past EOF
It can be tricky eg if segments share doc stores, I think you can't always recover that. But this index seems not to have separate doc stores (no *.cfx), so, I think in theory one could regenerate the segment metadata (SegmentInfo) from the index files, but I don't know that anyone has created this yet. Also, it could in general result in re-attaching segments that had been merged away (ie, causing duplicates in the index). Mike On Wed, Mar 24, 2010 at 2:39 AM, Ted Dunning ted.dunn...@gmail.com wrote: The documentation ( http://lucene.apache.org/java/2_4_0/fileformats.html#File%20Naming) makes it seem that the cfs files could be used to recover most of the information from the index. Is that not so? On Tue, Mar 23, 2010 at 11:30 PM, Michael McCandless luc...@mikemccandless.com wrote: Your index is in serious trouble -- you have 2 segments_N files, both of which are 0 length. This won't be easy to recover (CheckIndex won't be able to).
Re: Less drastic ways
On Sun, Mar 14, 2010 at 4:29 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Even if we merge Lucene/Solr and we treat Solr as just another Lucene contrib/module, say, contributors who care only about Solr will still patch against Solr and Lucene developers or those people who have the itch for that functionality being in Lucene, too, will still have to poach/refactor and pull that functionality in Lucene later on. Yes, people with their respective itches can still create Solr-only and Lucene-only functions, after the merge. We should not block any feature from going in solely because it's not factored so that both Lucene Solr can use it. But, no, poaching is no longer needed with merged dev -- we are free to efficiently refactor at that point. Merged, we don't need to have full copies of the code in two projects, await releases to de-dup, etc. -- code can just freely move back and forth within the project. It's also more likely that someone wearing a Lucene hat will see the Solr work going on and jump in and help to make it work in Lucene. Merged dev makes refactoring much more efficient then poaching across project lines. Both achieve the same goals with time, it's just that poaching is a much slower/more wasteful way to achieve it... (but of course is the only option for disparate projects, eg, pulling stuff from Nutch down into Lucene). Whether Solr is a separate project or a Lucene contrib/module that has its own user (and contributor) community that is not tightly integrated with Lucene's -dev community, the same thing will happen, no? True, but much less efficiently (if we can only poach across project lines). Maybe it will help if we made things visual for us visual peeps. Is this, roughly, what the plan is: trunk/ lucene-core/ modules/ analysis/ wordnet/ spellchecker/ whatever/ ... facets/ ... functions/ solr/ dih/ ... I honestly don't know what module structure we'll come up with! It's tbd'd But this looks like a good start :) I think we'd also have a queryparser module (we have like 7 of them, according to Robert ;), a queries module (I'd think functions lives inside there). Mike
Re: Less drastic ways
Hm, again I'm confused. If this is how it worked in Solr/Lucene land, then there wouldn't be pieces in Solr that we now want to refactor and move into Lucene core or modules. A list of about 4-5 such pieces of functionality in Solr has already been listed. That's really my main question. Why were/can't things be committed to the appropriate place? Why where they committed to Solr? Pre-merge: If someone wants a new functionality in Solr they should be free to create a patch to make it work well, in Solr, alone. To expect them to also factor it so that it works well for Lucene-only users is wrong. They should not need to, nor be expected to, and they shouldn't feel bad not having factored it that way. They use Solr and they need it working in Solr and that was their itch and they scratched it and net/net that was a great step forward for Solr. We should not up and reject contributions because they are not well factored for the two projects. Beggars can't be choosers... Someone who later has the itch for this functionality in Lucene should then be fully free to pick it up, refactor, and make it work in Lucene alone, by poaching it (pulling it into Lucene). Poaching is a natural way for code to be pulled across projects... and while in the short term it'd result in code dup, in the long term this is how refactoring can happen across projects. It's completely normal and fine, in my opinion. But poaching, while effective, is slow ... Lucene would poach, have to stabilize do a release, Solr would have to upgrade and then fix to cutover to Lucene's sources (assuming the sources hadn't diverged too much, else Solr would have to wait for Lucene's next release, etc.) And we have *alot* of modules to refactor here, between Solr and Lucene. So for these two reasons I vote for merging Solr/Lucene dev over gobbs of poaching. That gives us complete freedom to quickly move the code around. Poaching should still be perfectly fine for other cases, like pulling analyzers from Nutch, from other projects, etc. Mike
Re: [VOTE] merge lucene/solr development (take 3)
On Tue, Mar 9, 2010 at 5:10 AM, Andrzej Bialecki a...@getopt.org wrote: Re: Nutch components - those that are reusable in Lucene or Solr contexts eventually find their way to respective projects, witness e.g. CommonGrams. In fact I think this is a great example -- as far as I can tell, CommonGrams was poached from Nutch, into Solr, and then was nurtured/improved in both projects separately right? So can/should we freely poach across all our sub projects? It has obvious downsides (it's essentially a fork that will confuse those users that use both Solr Lucene, in the short term, until things stabilize into a clean refactoring; it's double the dev; we must re-sync with time; etc.). But it has a massive upside: it means we don't rely only on push (Solr devs to push into Lucene or vice/versa). We can also use pull (Lucene devs can pull pieces from Nutch/Solr into Lucene). It becomes a 2-way street for properly factoring our shared code with time. If we had that freedom (poaching is perfectly fine), then, interested devs could freely refactor across sub projects. Not having this freedom today, and not having merged dev, is stunting both Solr Lucene's growth. Mike
Re: [VOTE] merge lucene/solr development (take 3)
I'm still +1 for merging Solr/Lucene dev. I think poaching, when we have so much that needs to be shared, is going to cause far more problems than it'll solve. It's not the right tool for [this] job. I do think poaching is good legitimate tool when it's less code (eg the CommonGrams case), so, we should do both ;) Mike On Tue, Mar 9, 2010 at 8:49 AM, Grant Ingersoll gsing...@apache.org wrote: On Mar 9, 2010, at 8:21 AM, Michael McCandless wrote: On Tue, Mar 9, 2010 at 7:21 AM, Grant Ingersoll gsing...@apache.org wrote: If we had that freedom (poaching is perfectly fine), then, interested devs could freely refactor across sub projects. As someone who works on both, I don't think it is fine. Just look at the function query mess. Just look at the version mess. It's very frustrating as a developer and it makes me choose between two projects that I happen to like equally, but for different reasons. If I worked on Nutch, I would feel the same way. But... Lucene should poach from external (eg non-Apache) projects, if the license works? Ie if some great analyzer is out there, and Robert spots it, and the license works, we should poach it? (In fact he just did this w/ Andrzej's Polish stemmer ;) ). I'd prefer donate to poach, but, realize that isn't always the case. So we have something of a double standard... And, ironically, I think it's the fact that there's so much committer overlap between Solr and Lucene that is causing this antagonism towards poaching. When in fact I think poaching, at a wider scale (across unrelated projects) is a very useful means for any healthy open source software to evolve. Why should Lucene be prevented from having a useful feature just because Solr happened to create it first? But why should I be forced to maintain two versions due to some arbitrary code separation? And why should you force a good chunk of us to do a whole lot of extra work simply because of some arbitrary code separation? Here, it is the Lucene PMC that releases code and it is just silly that with all of this overlap at the committer level we still have this duplication. I can't speak for the external projects (I don't believe any of them have even responded here other than Jackrabbit), but if they don't like it, they should get more involved in the community and work to be committers. At any rate, this is exactly why merging makes sense. You would no longer have this issue of first. I would no longer have to choose where to add my spatial work based on some arbitrary line that someone drew in the sand that isn't all that pertinent anymore given the desires of most in the community to blur that line. It would be available to everyone. For that matter, why do we even need to have this discussion at all? Most of us Solr committers are Lucene committers. We can simply start committing Solr code to Lucene such that in 6 months the whole discussion is moot and the three committers on Solr who aren't Lucene committers can earn their Lucene merit very quickly by patching the Solr portion of Lucene. We can move all the code to it's appropriate place, add a contrib module for the WAR stuff and the response writers and voila, Solr is in Lucene, the dev mailing lists have merged by the fact that Solr dev would be defunct and all of the proposals in this vote are implemented simply by employing our commit privileges in a concerted way. Yet, somehow, me thinks that isn't a good solution either, right? Yet it is perfectly legal and is just as valid a solution as the poaching solution and in a lot of ways seems to be what Chris is proposing. -Grant
Re: Composing posts for both JIRA and email (was a JIRA post)
Great guidelines Marvin! I agree w/ most of this, except, I do use Jira's markup (bq., {quote}) when adding comments. I'm torn between how important the first read (via the email Jira sends) is vs the I click through to the issue read it). Typically I just click through to the issue unless it's a smallish comment. I don't get why Jira can't support email markup (, means nested levels of quoting) in addition to its own... maybe they are gunning for some kind of lock-in of their users. EG I've seen people respond to normal email threads, but quoting using bq.! Sometimes I compose with an external editor (in emacs, which wraps) sometimes directly in the browser. It's All Text plugin sounds neat -- what does it gain over simple copy/paste out of your editor? I can't stand that gmail doesn't do the right thing w/ line wrapping outgoing email, though -- when I quote a message (like below), the addition of the 's causes already wrapped text to be further wrapped, thus looking hideous (you should see examples below). And yes I hate that the first line under {code} has no indentation. Silly. Sounds like we just need a Jira upgrade @ Apache to fix that one... Mike On Thu, Mar 4, 2010 at 12:28 PM, Marvin Humphrey mar...@rectangular.com wrote: (CC to lucy-dev and general, reply-to set to general) On Thu, Mar 04, 2010 at 06:18:28AM +, Shai Erera (JIRA) wrote: (Warning, this post is long, and is easier to read in JIRA) I consume email from many of the Lucene lists, and I hate it when people force me to read stuff via JIRA. It slows me down to have to jump to all those forum web pages. I only go the web page if there are 5 or more posts in a row on the same issue that I need to read. For what it's worth, I've worked out a few routines that make it possible to compose messages which read well in both mediums. * Never edit your posts unless absolutely necessary. If JIRA used diffs, things would be different, but instead it sends the whole frikkin' post twice (before and after), which makes it very difficult to see what was edited. If you must edit, append an edited: block at the end to describe what you changed instead of just making changes inline. * Use FireFox and the It's All Text plugin, which makes it possible to edit JIRA posts using an external editor such as Vim instead of typing into a textarea. http://trac.gerf.org/itsalltext * After editing, use the preview button (it's a little monitor icon to the upper right of the textarea) to make sure the post looks good in JIRA. * Use for quoting instead of JIRA's bq. and {quote} since JIRA's mechanisms look so crappy in email. This is easy from Vim, because rewrapping a long line (by typing gq from visual mode to rewrap the current selection) that starts with causes to be prepended to the wrapped lines. * Use asterisk bullet lists liberally, because they look good everywhere. * Use asterisks for *emphasis*, because that looks good everywhere. * If you wrap lines, use a reasonably short line length. (I use 78; Mike McCandless, who also wraps lines for his Jira posts, uses a smaller number). Otherwise you'll get nasty wrapping in narrow windows, both in email clients and web browsers. There are still a couple compromises that don't work out well. For email, ideally you want to set off code blocks with indenting: int foo = 1; int bar = 2; To make code look decent in JIRA, you have to wrap that with {code} tags, which unfortunately look heinous in email. Left-justifying the tags but indenting the code seems like it would be a rotten-but-salvageable compromise, as it at least sets off the tags visually rather than making them appear as though they are part of the code fragment. {code} int foo = 1; int bar = 2; {code} Unfortunately, that's going to look like this in JIRA, because of a bug that strips all leading whitespace from the first line. |-| | int foo; | | int bar; | |-| It seems that this has been fixed by Atlassian in the Confluence wiki (http://jira.atlassian.com/browse/CONF-4548), but the issue remains for the JIRA installation at issues.apache.org. So for now, I manually strip indentation until the whole block is flush left. {code} int foo = 1; int bar = 2; {code} (Gag. I vastly prefer wikis that automatically apply fixed-width styling to any indented text.) One last tip for Lucy developers (and other non-Java devs). JIRA has limited syntax highlighting support -- Java, JavaScript, ActionScript, XML and SQL only -- and defaults to assuming your code is Java. In general, you want to override that and tell JIRA to use none. {code:none} int foo = 1; int bar = 2; {code} Marvin Humphrey
Re: [VOTE] merge lucene/solr development
On Thu, Mar 4, 2010 at 12:41 PM, Chris Hostetter hossman_luc...@fucit.org wrote: Why don't we just start by attempting to have a common dev list and merging committers, in the hopes that it will promote better communication about features up and down the stack, and better bug fixing/refactoring/modularization -- then see if that leads us to a point where it makes sense to more tightly couple the build systems and releases? I'm all for being iterative / taking baby steps, when the problem naturally can be solved that way -- progress not perfection. Many problems decompose like this. But I don't think this problem does. In particular, how would the above baby step address the code duplication (my goal in the original opening) -- eg the 3 places where concrete analyzers/queries are today. How would it lead to making facets work with pure Lucene? To developing spatial in one place? Mike
[VOTE] Merge the development of Solr/Lucene (take 2)
A new vote, that slightly changes proposal from last vote (adding only that Lucene can cut a release even if Solr doesn't): * Merging the dev lists into a single list. * Merging committers. * When any change is committed (to a module that belongs to Solr or to Lucene), all tests must pass. * Release details will be decided by dev community, but, Lucene may release without Solr. * Modulariize the sources: pull things out of Lucene's core (break out query parser, move all core queries analyzers under their contrib counterparts), pull things out of Solr's core (analyzers, queries). These things would not change: * Besides modularizing (above), the source code would remain factored into separate dirs/modules the way it is now. * Issue tracking remains separate (SOLR-XXX and LUCENE-XXX issues). * User's lists remain separate. * Web sites remain separate. * Release artifacts/jars remain separate. Mike
Re: [VOTE] Merge the development of Solr/Lucene (take 2)
I forgot my vote: +1 Mike On Thu, Mar 4, 2010 at 4:33 PM, Michael McCandless luc...@mikemccandless.com wrote: A new vote, that slightly changes proposal from last vote (adding only that Lucene can cut a release even if Solr doesn't): * Merging the dev lists into a single list. * Merging committers. * When any change is committed (to a module that belongs to Solr or to Lucene), all tests must pass. * Release details will be decided by dev community, but, Lucene may release without Solr. * Modulariize the sources: pull things out of Lucene's core (break out query parser, move all core queries analyzers under their contrib counterparts), pull things out of Solr's core (analyzers, queries). These things would not change: * Besides modularizing (above), the source code would remain factored into separate dirs/modules the way it is now. * Issue tracking remains separate (SOLR-XXX and LUCENE-XXX issues). * User's lists remain separate. * Web sites remain separate. * Release artifacts/jars remain separate. Mike
Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?
If we don't somehow first address the code duplication across the 2 projects, making Solr a TLP will make things worse. I started here with analysis because I think that's the biggest pain point: it seemed like an obvious first step to fixing the code duplication and thus the most likely to reach some consensus. And it's also very timely: Robert is right now making all kinds of great fixes to our collective analyzers (in between bouts of fuzzy DFA debugging). But it goes beyond analyzers: I'd like to see other modules, now in Solr, eventually moved to Lucene, because they really are core functionality (eg facets, function (and other?) queries, spatial, maybe improvements to spellchecker/highlighter). How can we do this? And how can we do this so that it lasts over time? If new cool core things are born in Solr-land (which of course happens alot -- lots of good healthy usage), how will they find their way back to Lucene? Yonik's proposal (merging development of Solr/Lucene, but keeping all else separate) would achieve this. If we do the opposite (Solr - TLP), how could we possibly achieve this? I guess one possibility is to just suck it up and duplicate the code. Meaning, each project will have to manually merge fixes in from the other project (so long as there's someone around with the itch to do so). Lucene would copy in all of Solr's analysis, and vice-versa (and likewise other dup'd functionality). I really dislike this solution... it will confuse the daylights out of users, its error proned, it's a waste of dev effort, there will always be little differences... but maybe it is in fact the lesser evil? I would much prefer merging Solr/Lucene development... Mike On Mon, Mar 1, 2010 at 12:01 PM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Grant, On Mar 1, 2010, at 8:20 AM, Mattmann, Chris A (388J) wrote: Hi Robert, I think my proposal (Solr-TLP) is sort of orthogonal to the whole analyzers issue - I was in favor, at the very least, of having a separate module/project/whatever that both Solr/Lucene (and whatever project) can depend on for the shared analyzer code... Not really. They are intimately linked. Ummm, how so? Making project A called Apache Super Analyzers and then making Lucene(-java) and Solr depend on Apache Super Analyzers is separate of whether or not Lucene(-java) and Solr are TLPs or not... Cheers, Chris Cheers, Chris On 3/1/10 9:12 AM, Robert Muir rcm...@gmail.com wrote: this will make the analyzers duplication problem even worse On Mon, Mar 1, 2010 at 11:06 AM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Mark, Thanks for your message. I respect your viewpoint, but I respectfully disagree. It just seems (to me at least based on the discussion) like a TLP for Solr is the way to go. Cheers, Chris On 3/1/10 8:54 AM, Mark Miller markrmil...@gmail.com wrote: On 03/01/2010 10:40 AM, Mattmann, Chris A (388J) wrote: Hi Mark, That would really be no real world change from how things work today. The fact is, today, Solr already operates essentially as an independent project. Well if that's the case, then it would lead me to think that it's more of a TLP more than anything else per best practices. That depends. It could be argued it should be a top level project or that it should be closer to the Lucene project. Some people are arguing for both approaches right now. There are two directions we could move in. The only real difference is that it shares the same PMC with Lucene now and wouldn't with this change. This would address none of the issues that triggered the idea for a possible merge. I don't agree -- you're looking to bring together two communities that are fairly separate as you put it. The separation likely didn't spring up over night and has been this way for a while (as least to my knowledge). This is exactly the type of situation that typically leads to TLP creation from what I've seen. It also causes negatives between Solr/Lucene that some are looking to address. Hence the birth of this proposal. Going TLP with Solr will only aggravate those negatives, not help them. While the communities operate fairly separately at the moment, the people in the communities are not so separate. The committer list has huge overlap. Many committers on one project but not the other do a lot of work on both projects. There is already a strong link with the personal - merging the management of the projects addresses many of the concerns that have prompted this discussion. TLP'ing Solr only makes those concerns multiply. They would diverge further, and incompatible overlap between them would increase. Cheers, Chris On 03/01/2010 10:04 AM, Mattmann, Chris A (388J) wrote: Hey Grant, I'd like to explore this does this imply that the Lucene sub-projects will go away and Lucene will turn into Lucene-java and maintain its Apache TLP, and then
Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?
On Mon, Mar 1, 2010 at 12:58 PM, Marvin Humphrey mar...@rectangular.com wrote: On Mon, Mar 01, 2010 at 12:44:02PM -0500, Michael McCandless wrote: But it goes beyond analyzers: I'd like to see other modules, now in Solr, eventually moved to Lucene, because they really are core functionality (eg facets, function (and other?) queries, spatial, maybe improvements to spellchecker/highlighter). I disagree. Those don't belong in core, and though they are all great features, adding them to core constitutes bloat, IMO. The Query class belongs in core. All those other modules should be distributed as plugins, which could be used by Solr, Katta, Lucene, whatever. Note that this is orthogonal to whether Solr and Lucene merge or diverge. I agree with this (sorry I wasn't clear). By core functionality I mean it should be a separate module (plugin) that direct Lucene users can use, not whenever you install core Lucene you get these functions. Ie, users shouldn't have to install Solr to use facets with Lucene. Mike
Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?
Because the code dup with analyzers is only one of the problems to solve. In fact, it's the easiest of the problems to solve (that's why I proposed it, only, first). A more differentiating example is a much less mature module EG take spatial -- if Solr were its own TLP, how could spatial be built out in a way that we don't waste effort, and so that both direct Lucene and Solr users could use it when it's released? Mike On Mon, Mar 1, 2010 at 1:07 PM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Mike, I'm not sure I follow this line of thinking: how would Solr being a TLP affect the creation of a separate project/module for Analyzers any more so than it not being a TLP? Both Lucene-java and Solr (as a TLP) could depend on the newly created refactored Analysis project. Chris On 3/1/10 10:44 AM, Michael McCandless luc...@mikemccandless.com wrote: If we don't somehow first address the code duplication across the 2 projects, making Solr a TLP will make things worse. I started here with analysis because I think that's the biggest pain point: it seemed like an obvious first step to fixing the code duplication and thus the most likely to reach some consensus. And it's also very timely: Robert is right now making all kinds of great fixes to our collective analyzers (in between bouts of fuzzy DFA debugging). But it goes beyond analyzers: I'd like to see other modules, now in Solr, eventually moved to Lucene, because they really are core functionality (eg facets, function (and other?) queries, spatial, maybe improvements to spellchecker/highlighter). How can we do this? And how can we do this so that it lasts over time? If new cool core things are born in Solr-land (which of course happens alot -- lots of good healthy usage), how will they find their way back to Lucene? Yonik's proposal (merging development of Solr/Lucene, but keeping all else separate) would achieve this. If we do the opposite (Solr - TLP), how could we possibly achieve this? I guess one possibility is to just suck it up and duplicate the code. Meaning, each project will have to manually merge fixes in from the other project (so long as there's someone around with the itch to do so). Lucene would copy in all of Solr's analysis, and vice-versa (and likewise other dup'd functionality). I really dislike this solution... it will confuse the daylights out of users, its error proned, it's a waste of dev effort, there will always be little differences... but maybe it is in fact the lesser evil? I would much prefer merging Solr/Lucene development... Mike On Mon, Mar 1, 2010 at 12:01 PM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Grant, On Mar 1, 2010, at 8:20 AM, Mattmann, Chris A (388J) wrote: Hi Robert, I think my proposal (Solr-TLP) is sort of orthogonal to the whole analyzers issue - I was in favor, at the very least, of having a separate module/project/whatever that both Solr/Lucene (and whatever project) can depend on for the shared analyzer code... Not really. They are intimately linked. Ummm, how so? Making project A called Apache Super Analyzers and then making Lucene(-java) and Solr depend on Apache Super Analyzers is separate of whether or not Lucene(-java) and Solr are TLPs or not... Cheers, Chris Cheers, Chris On 3/1/10 9:12 AM, Robert Muir rcm...@gmail.com wrote: this will make the analyzers duplication problem even worse On Mon, Mar 1, 2010 at 11:06 AM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Mark, Thanks for your message. I respect your viewpoint, but I respectfully disagree. It just seems (to me at least based on the discussion) like a TLP for Solr is the way to go. Cheers, Chris On 3/1/10 8:54 AM, Mark Miller markrmil...@gmail.com wrote: On 03/01/2010 10:40 AM, Mattmann, Chris A (388J) wrote: Hi Mark, That would really be no real world change from how things work today. The fact is, today, Solr already operates essentially as an independent project. Well if that's the case, then it would lead me to think that it's more of a TLP more than anything else per best practices. That depends. It could be argued it should be a top level project or that it should be closer to the Lucene project. Some people are arguing for both approaches right now. There are two directions we could move in. The only real difference is that it shares the same PMC with Lucene now and wouldn't with this change. This would address none of the issues that triggered the idea for a possible merge. I don't agree -- you're looking to bring together two communities that are fairly separate as you put it. The separation likely didn't spring up over night and has been this way for a while (as least to my knowledge). This is exactly the type of situation that typically leads to TLP creation from what I've seen. It also causes negatives between Solr/Lucene
Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?
Also, there still seems to be a misconception on what's being proposed here. The proposal is to synchronize the development of Solr and Lucene. Ie, a single dev list, single set of committers, synchronized releases. Everything else remains the same. EG the release artifacts, user's lists, web sites, branding, all remain separate. How the source code is modularized is an orthogonal question. We've discussed breaking out things of Lucene's core, like query parser, queries, analyzers into their own modules (and shipping their own artifacts), which I still think makes great sense. But it's independent of synchronizing our development. Mike On Mon, Mar 1, 2010 at 1:03 PM, Michael McCandless luc...@mikemccandless.com wrote: On Mon, Mar 1, 2010 at 12:58 PM, Marvin Humphrey mar...@rectangular.com wrote: On Mon, Mar 01, 2010 at 12:44:02PM -0500, Michael McCandless wrote: But it goes beyond analyzers: I'd like to see other modules, now in Solr, eventually moved to Lucene, because they really are core functionality (eg facets, function (and other?) queries, spatial, maybe improvements to spellchecker/highlighter). I disagree. Those don't belong in core, and though they are all great features, adding them to core constitutes bloat, IMO. The Query class belongs in core. All those other modules should be distributed as plugins, which could be used by Solr, Katta, Lucene, whatever. Note that this is orthogonal to whether Solr and Lucene merge or diverge. I agree with this (sorry I wasn't clear). By core functionality I mean it should be a separate module (plugin) that direct Lucene users can use, not whenever you install core Lucene you get these functions. Ie, users shouldn't have to install Solr to use facets with Lucene. Mike
Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?
This looks great! But, the goal is to make a standalone toolkit exposing GIS functions, right? My original question (integrating this into Lucene/Solr) remains. EG there's alot of good working happening now in Solr to make spatial search available. How will that find its way back to Lucene? Lucene has its own (now duplicate) spatial package that was already developed. Users will now be confused about the two, each have different bugs/features, etc. If we had shared development then the ongoing effort would result in a spatial package that direct Lucene users and Solr users would be able to use. Mike On Mon, Mar 1, 2010 at 1:28 PM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: I'm glad that you brought that up! :) Check out: http://incubator.apache.org/projects/sis.html We're just starting to tackle that very issue right now...patches/ideas/contributions welcome. Cheers, Chris On 3/1/10 11:25 AM, Michael McCandless luc...@mikemccandless.com wrote: Because the code dup with analyzers is only one of the problems to solve. In fact, it's the easiest of the problems to solve (that's why I proposed it, only, first). A more differentiating example is a much less mature module EG take spatial -- if Solr were its own TLP, how could spatial be built out in a way that we don't waste effort, and so that both direct Lucene and Solr users could use it when it's released? Mike On Mon, Mar 1, 2010 at 1:07 PM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Mike, I'm not sure I follow this line of thinking: how would Solr being a TLP affect the creation of a separate project/module for Analyzers any more so than it not being a TLP? Both Lucene-java and Solr (as a TLP) could depend on the newly created refactored Analysis project. Chris On 3/1/10 10:44 AM, Michael McCandless luc...@mikemccandless.com wrote: If we don't somehow first address the code duplication across the 2 projects, making Solr a TLP will make things worse. I started here with analysis because I think that's the biggest pain point: it seemed like an obvious first step to fixing the code duplication and thus the most likely to reach some consensus. And it's also very timely: Robert is right now making all kinds of great fixes to our collective analyzers (in between bouts of fuzzy DFA debugging). But it goes beyond analyzers: I'd like to see other modules, now in Solr, eventually moved to Lucene, because they really are core functionality (eg facets, function (and other?) queries, spatial, maybe improvements to spellchecker/highlighter). How can we do this? And how can we do this so that it lasts over time? If new cool core things are born in Solr-land (which of course happens alot -- lots of good healthy usage), how will they find their way back to Lucene? Yonik's proposal (merging development of Solr/Lucene, but keeping all else separate) would achieve this. If we do the opposite (Solr - TLP), how could we possibly achieve this? I guess one possibility is to just suck it up and duplicate the code. Meaning, each project will have to manually merge fixes in from the other project (so long as there's someone around with the itch to do so). Lucene would copy in all of Solr's analysis, and vice-versa (and likewise other dup'd functionality). I really dislike this solution... it will confuse the daylights out of users, its error proned, it's a waste of dev effort, there will always be little differences... but maybe it is in fact the lesser evil? I would much prefer merging Solr/Lucene development... Mike On Mon, Mar 1, 2010 at 12:01 PM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Grant, On Mar 1, 2010, at 8:20 AM, Mattmann, Chris A (388J) wrote: Hi Robert, I think my proposal (Solr-TLP) is sort of orthogonal to the whole analyzers issue - I was in favor, at the very least, of having a separate module/project/whatever that both Solr/Lucene (and whatever project) can depend on for the shared analyzer code... Not really. They are intimately linked. Ummm, how so? Making project A called Apache Super Analyzers and then making Lucene(-java) and Solr depend on Apache Super Analyzers is separate of whether or not Lucene(-java) and Solr are TLPs or not... Cheers, Chris Cheers, Chris On 3/1/10 9:12 AM, Robert Muir rcm...@gmail.com wrote: this will make the analyzers duplication problem even worse On Mon, Mar 1, 2010 at 11:06 AM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Mark, Thanks for your message. I respect your viewpoint, but I respectfully disagree. It just seems (to me at least based on the discussion) like a TLP for Solr is the way to go. Cheers, Chris On 3/1/10 8:54 AM, Mark Miller markrmil...@gmail.com wrote: On 03/01/2010 10:40 AM, Mattmann, Chris A (388J) wrote: Hi Mark, That would really be no real world change from how
Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?
The possibility of slowing down releases is the only real concern I also share But, I think release frequency is largely a matter of discipline :) But, digging into it, I think as long as the project keeps a stable trunk (something Lucene has always tried to do -- does Solr?)... then release frequency is really a matter of discipline. I mean in Lucene we keep saying we want faster releases, but why doesn't it happen? Couldn't we have done 2X as many releases in the past few years? Did we really want to release more frequently? If we really want to take it seriously I think we should have someone unofficially be the next release czar. As soon as a release is finished, this czar is responsible for roughly planning the next one. This means making a tentative schedule, tracking big features and making sure they land early enough to bake fully on trunk, etc. New modules (eg spatial) need not gate the release -- that module's docs would call out clearly that it's not fully baked yet... Mike On Mon, Mar 1, 2010 at 1:13 PM, Michael Busch busch...@gmail.com wrote: It seems like most of the people agree with these good goals but are concerned about the release cycles (including me). How can we achieve these goals without making releases more difficult? Michael On 3/1/10 9:44 AM, Michael McCandless wrote: If we don't somehow first address the code duplication across the 2 projects, making Solr a TLP will make things worse. I started here with analysis because I think that's the biggest pain point: it seemed like an obvious first step to fixing the code duplication and thus the most likely to reach some consensus. And it's also very timely: Robert is right now making all kinds of great fixes to our collective analyzers (in between bouts of fuzzy DFA debugging). But it goes beyond analyzers: I'd like to see other modules, now in Solr, eventually moved to Lucene, because they really are core functionality (eg facets, function (and other?) queries, spatial, maybe improvements to spellchecker/highlighter). How can we do this? And how can we do this so that it lasts over time? If new cool core things are born in Solr-land (which of course happens alot -- lots of good healthy usage), how will they find their way back to Lucene? Yonik's proposal (merging development of Solr/Lucene, but keeping all else separate) would achieve this. If we do the opposite (Solr - TLP), how could we possibly achieve this? I guess one possibility is to just suck it up and duplicate the code. Meaning, each project will have to manually merge fixes in from the other project (so long as there's someone around with the itch to do so). Lucene would copy in all of Solr's analysis, and vice-versa (and likewise other dup'd functionality). I really dislike this solution... it will confuse the daylights out of users, its error proned, it's a waste of dev effort, there will always be little differences... but maybe it is in fact the lesser evil? I would much prefer merging Solr/Lucene development... Mike On Mon, Mar 1, 2010 at 12:01 PM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Grant, On Mar 1, 2010, at 8:20 AM, Mattmann, Chris A (388J) wrote: Hi Robert, I think my proposal (Solr-TLP) is sort of orthogonal to the whole analyzers issue - I was in favor, at the very least, of having a separate module/project/whatever that both Solr/Lucene (and whatever project) can depend on for the shared analyzer code... Not really. They are intimately linked. Ummm, how so? Making project A called Apache Super Analyzers and then making Lucene(-java) and Solr depend on Apache Super Analyzers is separate of whether or not Lucene(-java) and Solr are TLPs or not... Cheers, Chris Cheers, Chris On 3/1/10 9:12 AM, Robert Muirrcm...@gmail.com wrote: this will make the analyzers duplication problem even worse On Mon, Mar 1, 2010 at 11:06 AM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Mark, Thanks for your message. I respect your viewpoint, but I respectfully disagree. It just seems (to me at least based on the discussion) like a TLP for Solr is the way to go. Cheers, Chris On 3/1/10 8:54 AM, Mark Millermarkrmil...@gmail.com wrote: On 03/01/2010 10:40 AM, Mattmann, Chris A (388J) wrote: Hi Mark, That would really be no real world change from how things work today. The fact is, today, Solr already operates essentially as an independent project. Well if that's the case, then it would lead me to think that it's more of a TLP more than anything else per best practices. That depends. It could be argued it should be a top level project or that it should be closer to the Lucene project. Some people are arguing for both approaches right now. There are two directions we could move in. The only real difference is that it shares the same PMC with Lucene now
Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?
To make this more concrete, I think this is roughly what's being proposed: * Merging the dev lists into a single list. * Merging committers. * When a change it committed to Lucene, it must pass all Solr tests. * Release both at once. These things would not change: * Most importantly, the source code would remain factored into separate dirs/modules. * User's lists should remain separate. * Web sites would remain separate. * Solr Lucene are still separate downloads, separate JARs, seperate subdirs in the source tree, etc. The outside world still sees Solr Lucene as separate entities. It's only that they would now be developed/released in synchrony. There are some important gains by doing this: * Single source for all the code dup we now have across the projects (my original reason, specifically on analyzers, for starting this). * Whenever a new feature is added to Lucene, we'd work through what the impact is to Solr. This can still mean we separately develop exposure in Solr, but it'd get us to at least more immediately think about it. * Solr is Lucene's biggest direct user -- most people who use Lucene use it through Solr -- so having it more closely integrated means we know sooner if we broke something. * Right now I could test whether flex breaks anything in Solr. I can't do that now since Solr is isn't upgraded to 3.1. Recent big changes (eg segment based searching, Version, attr based tokenstream api) caused alot of work in Solr that could've been much smoother had Solr been there as we were working through them. Recent new features, eg near-real-time search, which are unavailable in Solr still, would have at least had some discussion about how to expose this in Solr. Over time (and we don't have to do this right on day 1) we can make core capabilities available to pure Lucene. EG core Lucene users should be able to use faceting, use a schema, etc. I think this idea makes alot of sense and I think now is a good time to do it. Yes, this a big change, but I think the gains are sizable. As Lucene Solr diverge more, it'll only become harder and harder to merge. Robert's massive patch on SOLR-1657, upgrading most Solr's analyzers to 3.0, is aging... while other changes to analyzers are being proposed (SOLR-1799). If we were integrated (or at least single source for analyzers), Robert would already have committed it. Mike On Fri, Feb 26, 2010 at 5:20 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Fri, Feb 26, 2010 at 5:15 PM, Steven A Rowe sar...@syr.edu wrote: On 02/24/2010 at 2:20 PM, Yonik Seeley wrote: I've started to think that a merge of Solr and Lucene would be in the best interest of both projects. The Sorlucene :) merger could be achieved virtually, i.e. via policy, rather than physically merging: Everything is virtual here anyway :-) I agree with Mike that a single dev list is highly desirable. There would still be separate downloads. What to do with some of the other stuff is unspecified. Committers would need to be merged though - that's the only way to make a change across projects w/o breaking stuff. -Yonik
Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?
I think this is a good idea! LuSolr ;) (kidding) I agree with all of your points Yonik. What do other people think...? Mike On Wed, Feb 24, 2010 at 2:20 PM, Yonik Seeley yo...@apache.org wrote: I've started to think that a merge of Solr and Lucene would be in the best interest of both projects. Recently, Solr as pulled back from using Lucene trunk (or even the latest version), as the increased amount of change between releases (and in-between releases) made it impractical to deal with. This is a pretty big negative for Lucene, since Solr is the biggest Lucene user (where people are directly exposed to lucene for the express purpose of developing search features). I know Solr development has always benefited hugely from users using trunk, and Lucene trunk has now lost all the solr users. Some in Lucene development have expressed a desire to make Lucene more of a complete solution, rather than just a core full-text search library... things like a data schema, faceting, etc. The Lucene project already has an enterprise search platform with these features... that's Solr. Trying to pull popular pieces out of Solr makes life harder for Solr developers, brings our projects into conflict, and is often unsuccessful (witness the largely failed migration of FunctionQueries from Solr to Lucene). For Lucene to achieve the ultimate in usability for users, it can't require Java experience... it needs higher level abstractions provided by Solr. The other benefit to Lucene would be to bring features to developers much sooner... Solr has had features years before they were developed in Lucene, and currently has more developers working with it. Esp with Solr not using Lucene trunk, if a Solr developer wants a feature quickly, they cannot add it to Lucene (even if it might make sense there) since that introduces a big unpredictable lag - when that version of Lucene make it's way into Solr. The current divide is a bit unnatural. For maximum benefit of both projects, it seems like Solr and Lucene should essentially merge. Lucene core would essentially remain as it is, but: 1) Solr would go back to using Lucene's trunk 2) For new Solr features, there would be an effort to abstract it such that non-Solr users could use the functionality (faceting, field collapsing, etc) 3) For new Lucene features, there would be an effort to integrate it into Solr. 4) Releases would be synchronized... Lucene and Solr would release at the same time. -Yonik
Re: Stale NFS file handle Exception
This is a known limitation of Lucene over NFS. It's because NFS makes no effort to protect open files from deletion. Other filesystems prevent (or delay) deletion of still open files, eg on Unix the delete on last close semantics is used, on Windows the file cannot be deleted until no process has it open anymore. One way to workaround this is to make a custom IndexDeletionPolicy, so that your app defers deletion of old commits until you know all current readers have reopened. Another workaround is to simply catch that exception (best to screen for Stale NFS file handle, so you don't mask other IOException cases), and reopen your reader right then -- but this is only viable if it's acceptable that a random Query will be forced to wait while reopen/warming takes place. Mike On Thu, Jan 14, 2010 at 1:25 AM, Claudio Deluca decl...@gmail.com wrote: Hi, We are using Lucene 2.4.1 in a load-balanced environment. The lucene index is stored on server a while server b accesses the index though an nfs share. After creating the instance of IndexWriter, the documents are beeing added and the index gets optimized and closed. *IndexWriter theIndexWriter = new IndexWriter(new File(indexerPath), new WhitespaceAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED); ... theIndexWriter.optimize(); theIndexWriter.close();* For serach we open the index on application startup like this *Directory theDirectory = FSDirectory.getDirectory(theConfigPath); IndexReader indexReader = IndexReader.open(theDirectory, true); IndexSearcher searcher = new IndexSearcher(indexReader);* The exception appears, when server a finished recreation of index (closed the indexwriter) and server b executes a search query over the index. Only if we restart the application the problem does not longer appear, because at this point the index will be newly opened. How can we avoid this Exception? java.io.IOException: Stale NFS file handle at java.io.RandomAccessFile.readBytes(Native Method) at java.io.RandomAccessFile.read(Unknown Source) at org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal(FSDirectory.java:596) at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:136) at org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:247) at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:157) at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38) at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:78) at org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:110) at org.apache.lucene.index.SegmentTermPositions.next(SegmentTermPositions.java:98) at org.apache.lucene.search.PhrasePositions.next(PhrasePositions.java:41) at org.apache.lucene.search.PhraseScorer.init(PhraseScorer.java:131) at org.apache.lucene.search.PhraseScorer.next(PhraseScorer.java:76) at org.apache.lucene.search.ConjunctionScorer.init(ConjunctionScorer.java:80) at org.apache.lucene.search.ConjunctionScorer.next(ConjunctionScorer.java:48) at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:319) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:136) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:123) at org.apache.lucene.search.Searcher.search(Searcher.java:86) Thanks, Claudio
Re: Lucene PMC += Mark Miller
Welcome! Mike On Thu, Jan 14, 2010 at 10:37 AM, Grant Ingersoll gsing...@apache.org wrote: I'm pleased to announce the Lucene PMC has elected to add Mark Miller to its ranks in recognition of his longstanding contributions to the Lucene community as a committer on both Lucene Java and Solr. Congrats, Mark! -Grant Ingersoll Lucene PMC Chair
Re: [spatial] Cartesian Tiers nomenclature
Right, NRQ is able to translate any requested range into the union (OR) of brackets (from the trie) created during indexing. Can spatial do the same thing, just with 2D instead of 1D? Ie, reconstruct any expressible shape (created at query time) as the union of some number of grids/tiers, at finer finer levels, created during indexing? Spatial, today, seems to do this, except it must also do precise filtering on each matching doc, because some of the grids may contain hits outside of the requested shape. In fact, NRQ could also borrow from spatial's current approach -- ie, create the union of some smallish number of coarse brackets. Some of the brackets will fall entirely within the requested range, and so require no further filtering, while others will fall part inside / part outside of the requested range, and so will require precise filtering. If NRQ did this, it should have much fewer postings to enum, at the cost of having to do precise filtering on some of them (and we'd have to somehow encode the orig value in the index). Mike On Tue, Dec 29, 2009 at 8:42 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Tue, Dec 29, 2009 at 7:13 PM, Marvin Humphrey mar...@rectangular.com wrote: ... but for this algorithm, different rasterization resolutions need not proceed by powers-of-two. Indeed - one way to further generalize would be to use something like Lucene's trie-based Numeric field, but with a square instead of a line. That would allow to tweak the space/speed tradeoff. -Yonik http://www.lucidimagination.com
Re: [spatial] Cartesian Tiers nomenclature
It's great that there's such a sudden burst of energy to improve spatial in both Solr and Lucene! Isn't this concept the same as trie (for Lucene's numeric fields), but in 2D not 1D? If so, I think tiles doesn't convey that they recursively subdivide. Also: why does this notion even need naming so badly? Why does this concept leak out of the abstraction? Shouldn't this (cartesian tier, cartesian tier plotter) all be under the hood? I make a SpatialField, I index it, I can then make SpatialShapeQuery, a SpatialDistanceSort, etc.? Ie, trie is known within Lucene, but doesn't leak out -- the outside world knows it as Numeric*. Trie is an implementation detail, inside Lucene. (NOTE: I only know just enough about spatial to be dangerous...) Mike On Tue, Dec 29, 2009 at 2:49 AM, patrick o'leary pj...@pjaol.com wrote: Ah the language of math is the ultimate lingua franca - Nice ! When you look at the coordinates entity from KML, ask why are the lat / longs reversed to long/ lat? Answer because the folks working on the display thought in terms of *display not GIS*, the point is over Y degrees of longitude and down X degrees of latitude. But again that's not a convention used outside a little part of GeoTools or KML, GML / GeoRSS are again just the regular lat,long (NS,EW), or projected EPSG or other standard projections in OGC 05-011. To my knowledge google are the only real pushers of (EW,NS) these days. So what does this diatribe mean? We're kind of at the bleeding edge of defining the standard, hence the difficulty of finding data on it. This is one reason why locallucene and localsolr became popular, it solved a problem simply. Doc's about it exist on gissearch.com dzone are doing articles on it http://java.dzone.com/articles/spatial-search-hibernate?utm_source=feedburnerutm_medium=feedutm_campaign=Feed%3A+javalobby%2Ffrontpage+%28Javalobby+%2F+Java+Zone%29 Locallucene in google has over 8,000 results http://www.google.com/search?q=locallucene Localsolr has over 4,000 results http://www.google.com/search?q=localsolr I've seen and help with installations all over the place, heck even codehaus use it, as do folks on github with geonames db. I see named it mathematically scientifically correct, and gaining enough traction and popularity to start becoming part of the standard, not just duplicating one. I can't honestly see how a refactoring is bringing anything positive to this, when there isn't a good standard out there yet. On Mon, Dec 28, 2009 at 10:22 PM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Patrick, Interesting. It seems like there is a precedent already in the Local Lucene and Local SOLR packages that define CartesianTier as lingua franca. Like I said in an earlier email it depends on who you talk to regarding the preference of what to call these Tiles/Grids/Tiers, etc., and that seems to be further evidenced by your research. I for one don¹t really have a preference but precedent matters to me and if Tiers have been used to date then there should be strong consideration to use that nomenclature and +1 from me. Cheers, Chris On 12/28/09 9:25 PM, patrick o'leary pj...@pjaol.com wrote: So trying no to drag this out, the most frequent generic term used in GIS software is SRID http://en.wikipedia.org/wiki/SRID Again this provides just a basic nomenclature for the high level element, somewhat the blackbird of objects rather than the defining the magpie (sorry for the CS 101 reference) But it should show that every implementation is unique in some format. Perhaps as unique as CartesianTier's ( sorry Ted ! ) On Mon, Dec 28, 2009 at 5:26 PM, patrick o'leary pj...@pjaol.com wrote: Hmm, depends, tiles indicate to me a direct correlation between the id and a map tile, which will depend upon using the right projection with the cartesian plotter On Mon, Dec 28, 2009 at 2:56 PM, Grant Ingersoll gsing...@apache.org wrote: On Dec 28, 2009, at 4:19 PM, patrick o'leary wrote: Hmm, but when you say grid, to me that's just a bunch of regularly spaced lines.. Yeah, I hear you. I chose spatial tiles for the Solr patch, but spatial grid would work too. Or map tiles/map grids. That anchors it into the spatial world, since we're calling Lucene's spatial contrib/spatial and Solr's Solr Spatial. On Mon, Dec 28, 2009 at 1:16 PM, Grant Ingersoll gsing...@apache.org wrote: On Dec 28, 2009, at 3:51 PM, patrick o'leary wrote: So Grant here's the deal behind the name. Cartesian because it's a simple x.y coordinate system Tier because there are multiple tiers, levels of resolution. If you look at it closer: - To programmers there's a quadtree implementation - To web users who use maps these are grids / tiles. - To GIS experts this is a form of multi-resolution raster-ing. - To astrophysicists these are tiers. - To the MS folks I've