Re: updating jakarta site
Henri Yandell wrote: Redirect of jakarta.apache.org/lucene to lucene.apache.org/java/docs/index.html I noticed there's a commented out redirect in the .htaccess, so after adding my own I deleted it again and left the redirect off for the moment. Unsure if there's a reason the commented out bit is there and lucene.apache.org/java and jakarta.apache.org/lucene look to be clones currently (barring the extra news item at lucene.apache.org). When the redirect was first put into place there were some broken links at lucene.apache.org/java, so the redirect be removed until the links were fixed. I think the links were fixed but the redirect was never restored. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: updating jakarta site
Erik Hatcher wrote: When Doug is cool with re-enabling the redirect, it's fine with me. I'm cool with it if it works. Why not re-enable it, search for site:apache.org lucene on Google, Yahoo! and MSN, and click on the first few links. If these work, then I'm okay with the redirect. As we change stuff like this, we should try to change things only once, rather than a temporary change that might not be appropriate long-term. This is especially the case with things like URLs email addresses, that get saved in mail archives, web indexes, etc. The fewer times we change them the less we'll break things. Thankfully jakarta never made it into a package name! Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: updating jakarta site
Henri Yandell wrote: Your download page is already separate, you're using the global closer.cgi file. So we need to: - rename Lucene Java's mailing lists, with forwards put into place. - add a mailing list page to Lucene Java's website, modelled after http://jakarta.apache.org/site/mail2.html#Lucene. This should replace the link in the sidebar to Jakarta's mailing list page. The mailing lists should probably be renamed: [EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED] Does that sound right to folks? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: updating jakarta site
Garrett Rooney wrote: Actually, currently we've got both lucene4c and java commits going to [EMAIL PROTECTED], and there was some talk of just leaving it that way, since it isn't that much traffic and it encourages people to keep an eye on what's going on in other languages. I think that's a bad idea. Once there are lots of commits folks will start unsubscribing or ignoring things. Personally I only want to see commits for projects that I'm actively contributing to. I don't anticipate I'll be comitting to lucene4c, so I don't feel the need to track it on a commit-by-commit level. I do anticipate I'll make commits to Lucene Java, and try to carefully read every commit message for this project. Yes, I could set up filters so that I only see the commits I like, but I'd much prefer these were simply separate mailing lists. Soon we hope to have Nutch under Lucene's umbrella. Do you and Erik really want to see all of the Nutch commits in your inbox? http://www.mail-archive.com/nutch-cvs%40lists.sourceforge.net/ We keep running into this same confusion: I think that the Lucene TLP should be setup primarily as a container of sub-projects. Jakarta Lucene is the first of these and Lucene4c is the second. We don't intend to merge Jakarta Lucene and Lucene4c into a single project, with the single set of developers, building a single download. So each component of Jakarta Lucene should be moved to a sub-component of Lucene TLP, not to a top-level component. This is the case for bug databases, mailing lists, web sites, etc. across the board. Do we disagree on this? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: patch - DEFAULT_ vars in IndexWriter non-final and DEFAULT for useCompoundFile
Kevin A. Burton wrote: Wolf Siberski wrote: Kevin A. Burton wrote: I see following issues with your patch: - you changed the DEFAULT_... semantics from constant to modifiable, but didn't adjust the names according to Java conventions (default_...). Java doesn't have any naming conventions which include an underscore. I assume you mean defautUse ... http://java.sun.com/docs/codeconv/html/CodeConventions.doc8.html#15436 - you can achieve the same by writing your own IndexWriterFactory which sets the corresponding values after creating a new IndexWriter. Should be ~30 lines of code. It only makes sense to include a patch if either a solution is impossible with the current code or *a lot* of (potential) users have to work around something. I *could* (and I thought of it) but it seems reasonable to be able to set Lucene to use whatever settings you want at anytime... You could do exactly that with an IndexWriterFactory, that's Wolf's point. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: patch - DEFAULT_ vars in IndexWriter non-final and DEFAULT for useCompoundFile
Kevin A. Burton wrote: Doug Cutting wrote: Wolf Siberski wrote: So, if anything at all, I would rather opt for making these constants private :-). I agree. In general, fields should either be final, or private with accessor methods. So, we could change this to: private static int defaultMergeFactor = Integer.parseInt(System.getProperty(org.apache.lucene.mergeFactor, 10)); public static int getDefaultMergeFactor() { return mergeFactor; } public static void setDefaultMergeFactor(int mergeFactor) { defaultMergeFactor = mergeFactor; } In my original patch I deleted 5 final keywords for a reduction in code of 25 bytes.. If I were to submit the patch again I'd have to add 2 methods and 35 additional lines of code. Seems to me that the Java coding conventions in this situation should be ignored. This isn't a coding convention, but rather software engineering. If we wish to be able to back-compatibly modify Lucene's implementation at a later date, its usually easiest to have access through methods rather than fields, since we can intecept reads and writes to the field. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
read index terms lazily
Attached is a patch which delays reading of index terms until it is first accessed. The cost of this is another file descriptor, until the terms are accessed, when it is closed. The benefit is that operations that do not require access to index terms are much faster and use much less memory. Thoughts? Doug Index: src/java/org/apache/lucene/index/TermInfosReader.java === --- src/java/org/apache/lucene/index/TermInfosReader.java (revision 155349) +++ src/java/org/apache/lucene/index/TermInfosReader.java (working copy) @@ -33,6 +33,12 @@ private SegmentTermEnum origEnum; private long size; + private Term[] indexTerms = null; + private TermInfo[] indexInfos; + private long[] indexPointers; + + private SegmentTermEnum indexEnum; + TermInfosReader(Directory dir, String seg, FieldInfos fis) throws IOException { directory = dir; @@ -42,7 +48,10 @@ origEnum = new SegmentTermEnum(directory.openInput(segment + .tis), fieldInfos, false); size = origEnum.size; -readIndex(); + +indexEnum = + new SegmentTermEnum(directory.openInput(segment + .tii), + fieldInfos, true); } protected void finalize() { @@ -73,28 +82,23 @@ return termEnum; } - Term[] indexTerms = null; - TermInfo[] indexInfos; - long[] indexPointers; - - private final void readIndex() throws IOException { -SegmentTermEnum indexEnum = - new SegmentTermEnum(directory.openInput(segment + .tii), - fieldInfos, true); + private final void ensureIndexIsRead() throws IOException { +if (indexTerms != null) + return; try { int indexSize = (int)indexEnum.size; indexTerms = new Term[indexSize]; indexInfos = new TermInfo[indexSize]; indexPointers = new long[indexSize]; - + for (int i = 0; indexEnum.next(); i++) { - indexTerms[i] = indexEnum.term(); - indexInfos[i] = indexEnum.termInfo(); - indexPointers[i] = indexEnum.indexPointer; +indexTerms[i] = indexEnum.term(); +indexInfos[i] = indexEnum.termInfo(); +indexPointers[i] = indexEnum.indexPointer; } } finally { - indexEnum.close(); +indexEnum.close(); } } @@ -126,6 +130,8 @@ TermInfo get(Term term) throws IOException { if (size == 0) return null; +ensureIndexIsRead(); + // optimize sequential access: first try scanning cached enum w/o seeking SegmentTermEnum enumerator = getEnum(); if (enumerator.term() != null // term is at or past current @@ -179,6 +185,7 @@ final long getPosition(Term term) throws IOException { if (size == 0) return -1; +ensureIndexIsRead(); int indexOffset = getIndexOffset(term); seekEnum(indexOffset); - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Javadoc not available due to non-public classes?
Kevin A. Burton wrote: You know ... the javadoc on the site doesn't include non-public classes like TermInfosWriter. Confused me for a second. That's because it's not public. The javadoc on the site is to document the public api. This is not a bug, but a feature. Also.. the site doesn't have JXR output for Lucene. Would be nice to have. Maven essentially gives you this for free... If you would like to provide a patch to upgrade Lucene to use Maven, educate Lucene developers about Maven, and help to run it, that would be great! Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Patch - IndexReader methods and MultiSearcher methods...
Kevin A. Burton wrote: Also, I assume that the reason you make the reader field protected is because getReader() is not sufficient, i.e., you want to set the reader. This would stylistically be better done with a setReader() method, no? Do you only change it at construction, or at runtime? If you only change it at construction, then super(reader) in the constructor might suffice. We change it at runtime. This is a ReloadableIndexSearcher that I developed that can reload if an index has been optimized() or added to by another external process. I just have my external process do the merger and then call reload() on the main index. The cool thing about this approach is that the entire webapp is operational while this happens. While the swap is happening searches just backup for a second and then complete. It also doesn't require 2x memory because I can dispose of the current reader, block searches, then open the new reader. That can easily be done without subclassing IndexSearcher. public class SearcherCache { private Searcher searcher; public synchronized Searcher getSearcher() { return searcher; } public synchronized setSearcher(Searcher searcher) { this.searcher = searcher; } } Then use SearcherCache.getSearcher() whenever you need a searcher. You could make it more complicated, e.g., have it automatically update the searcher when the index has changed, etc. But none of that requires or is in particular faciliated by subclassing IndexSearcher, so far as I can see. Why don't we do this. I don't think we should have a setReader then. This way there's no strong contract that developers preserve things that might break caching. I'd like to keep the protected change though. Making the field protected is just an obscure way of making it changeable. If we really need to make it settable, then we should add a setReader() method and add some cautions to its documentation. But I'm not yet convinced this needs to be settable. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?
Wolf Siberski wrote: The price is an extension (or modification) of the Searchable interface. I've added corresponding search(Weight...) methods to the existing search(Query...) methods and deprecated the latter. I think this is the right solution. If Searchable is meant to be Lucene internal, then IMHO these 'duplicates' should be removed. Searchable should be public, so that other RPC mechanisms may be used, rather than RMI. Thus the architecture supports distributed search and RMI is just one potential platform. Searchable is meant to be the abstract network protocol. Queries, filters and sort criteria are designed to be compact so that they may be efficiently passed to a remote index, with only the top-scoring hits are returned, rather than every non-zero scoring hit. HitCollector-based access to remote indexes is discouraged. HitColletors are rather primarily meant to be used to implement queries, sorting and filtering. The deprecated methods should be removed in Lucene 2.0. We could probably remove them now without breaking anyone, but it's better to be safe. Regarding your other comments: I've been a bit too eager in refactoring, not giving enough thought to backward compatibility issues. Now I've reverted to existing API and behavior as far as (IMHO) possible, and that was pretty far. The only API change necessary is createWeight() _throws IOException_, because the idfs have to be computed in the Weight constructors. I think that's okay. Thanks for all your work! An improved patch is attached to the Bugzilla issue. This patch now looks great to me. +1 Does anyone object to comitting this patch? http://issues.apache.org/bugzilla/show_bug.cgi?id=31841 Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?
Wolf Siberski wrote: Now I found another solution which requires more changes, but IMHO is much cleaner: - when a query computes its Weight, it caches it in an attribute - a query can be 'frozen'. A frozen query always returns the cached Weight when calling Query.weight(). Orignally there was no Weight in Lucene, only Query and Scorer. Weight was added in order to make it so that searching did not modify a Query, so that a Query instance could be reused. Searcher-dependent state of the query is meant to reside in the Weight. IndexReader dependent state resides in the Scorer. Your freezing a query violates this. Can't we create the weight once in Searcher.search? This approach requires that weights can be serialized. Interestingly, Weight already implements Serializable, but the current implementation doesn't work for all weight classes. The reason is that some weights hold a reference to a searcher which is of course not serializable. We can't make it transient either, because this searcher is the source of the Similarity needed by scorers. On closer look it turned out that the searcher is used only for two things: as source for a Similarity, and as docFreqsmaxDoc source. docFreqmaxDoc are only necessary to initialize the weights, but not needed by scorers. So instead of providing the Searcher, I now provide a Similarity and a DocFreqSource to the weights. Only the Similarity is stored by weights. We need to make sure, however, that this is the correct Similarity. It should still be the result of Query.getSimilarity(Searcher), which doesn't appear to be the case in your patch. As for DocFreqSource versus Searcher, couldn't the Searcher be passed as a source for docFreqs and simoly have Weights not keep a pointer to it? This isn't a big deal, but it would substantially mimimize the API changes. As (IMHO) positive side effect, Similarity got rid of Searcher dependencies, which leads to a better split of responsibilities: - Similarity only provides scoring formulas - Searcher (rsp. DocFreqSource) provides the raw data (tf/df/maxDoc) This change affects quite a few classes (because the createWeight() signature is changed), but the modifications are pretty straightforward. But couldn't the signature change be avoided if the Weight constructors immediately call Query.getSimilarity(Searcher) to get their Similarity, and no longer kept a pointer to the Searcher? From my point of view, the patch submitted now is a sound solution for Bug 31841 (at least I like it :-) ). The next thing which IMHO needs to be done is a review by someone else. I've make a quick review, but it would be nice if others looked at this too. Thanks again for all your work here! Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Into javadocs? [Bug 31841] - [PATCH] MultiSearcher problems with Similarity.docFreq()
Paul Elschot wrote: Would you mind if some pieces of your reply end up in the javadocs? Not at all. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [VOTE] Incubate lucene4c?
+1 Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [VOTE] Re: Incubating Lucene.Net
+1 Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Incubating Lucene.Net
George Aroush wrote: Any thoughts on Lucene.Net/dotLucene package name are welcome. I agree that Lucene.Net is a better name. It's more consistent with Lucene Java and Lucene4c, the names for other ports of Lucene. I think it's okay to reclaim the name of an abandonded project, especially if the abandoned project is better known and is substantially similar. The only problem would be if someone else felt that the name Lucene.Net was their property. But the folks at http://searchblackbox.com/ don't use name Lucene.Net anymore. Also, I owned and used the domain lucene.net to refer to Apache's Lucene before the Sourceforge Lucene.Net project started in 8/03, which arguably gives me rights to the name: http://web.archive.org/web/*/http://www.lucene.net/ Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: removing the old FAQ
Daniel Naber wrote: could someone (Doug?) make me an administrator for the old Lucene project at sourceforge? Done. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene.apache.org
Henri Yandell wrote: On names, Lucene Java might hit trademark issues I guess. So potential worry there. Good point. Although I note that Apache already has projects called Xerces Java and Xalan Java. Sun says: http://www.sun.com/policies/trademarks/#20c So, technically, the fullname of the product should be Lucene for the Java platform, which we might sometimes abbreviate Lucene Java. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene.apache.org
Erik Hatcher wrote: Doug - do you have your Forest work handy? Or has anyone else stepped up to build the web site? I don't have anything reusable. I converted Nutch from a different (not Anakia) XML-based site to Forrest with little difficulty (mostly using string replace in Emacs). I started by downloading Forrest and using the tutorial to seed a new project: http://forrest.apache.org/docs/your-project.html Then I outlined the site in site.xml and translated pages to the new schema. Forrest's default directory layout was not intuitive to me, and it can be changed, but I left it alone, opting to keep things as vanilla as possible. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene.apache.org
Erik Hatcher wrote: I have checked out our current site to the lucene.apache.org area, and I've also set up a redirect from the jakarta.apache.org/lucene area. Keep in mind, there are two projects here: 1. Porting Java Lucene's site to Forrest. This should be structured as a sub-project of lucene.apache.org. It should be maintained in https://svn.apache.org/repos/asf/lucene/java/trunk/docs/. 2. Building a new site for lucene.apache.org. This should initially contain a single sub-project, Lucene Java. This site should be maintained in https://svn.apache.org/repos/asf/lucene/site/. We expect to shortly be adding more sub-projects, and we don't want to have to re-structure the site again soon, so let's structure it for the long-term from the start. Make sense? Do we need a separate logo for Lucene Java? Now I'm beginning to see the need for Murray Altheim's logo work: http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg07799.html If we adopted this, then we could use the same font to easily generate logos for Lucene Java, Lucene 4c, Lucene .Net, etc. Some potential sub-projects (e.g., Nutch) already have distinct logos, but it might make sense to use similar logos for ports. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [ANNOUNCE] lucene4c 0.02
Garrett Rooney wrote: Additionally it would be good to work on updating the disk format documentation, I've found several cases where the docs are quite out of date compared to the current code. It's hard to expect the various different ports to maintain compatibility when the formats are only documented in code. If you have a chance, please submit bugs for these. Thanks. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene.apache.org
Garrett Rooney wrote: Agreed. Java Lucene is a subproject of the Lucene TLP, leaving the existing Java Lucene site there for the time being seems ok, just so we have something there, but we should endeavour to put up something more permanent ASAP. I think, for the present, http://lucene.apache.org/ should redirect to http://lucene.apache.org/java/. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: What does [] do to a query and what's up with lucene.apache.org?
Erik Hatcher wrote: I'm really at the limit of my bandwidth - I've got the sandbox restructuring effort on my plate right now and would like it if someone could pick up the ball on the web site side of things. Then perhaps you shouldn't have redirected everything to lucene.apache.org... We need to fix this ASAP. I just checked out the java lucene docs in http://lucene.apache.org/java/. Now we just need to fixup the redirects, so that http://lucene.apache.org/ and http://jakarta.apache.org/lucene/ both redirect to http://lucene.apache.org/java/. How did you implement the redirect? It's not a meta redirect, so they must be in the web server configuration? How does one change that? It's worth looking at what various search engines list for Lucene and making sure that those links are not broken: http://www.google.com/search?q=+site%3Aapache.org+lucene http://search.yahoo.com/search?p=lucene+site%3Aapache.org http://search.msn.com/results.aspx?q=site%3Aapache.org+lucene Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene.apache.org
Erik Hatcher wrote: It also might be a good time to think about mailing list names. There was a request on infrastructure@ to move [EMAIL PROTECTED] to [EMAIL PROTECTED], would it make more sense to move it to [EMAIL PROTECTED] NOW you tell me :) I think until we have these elusive other languages in, we should stick with [EMAIL PROTECTED] We certainly want to have a cohesive Lucene community regardless of language, and dev@ makes sense to keep across all languages to me. I (respectfully) disagree. I don't think other Apache projects work that way. Sub-projects have their own development lists. Perhaps we should have new mailing lists for the top-level project, but the mailing list that replaces lucene-dev@jakarta.apache.org should be specific to Lucene Java. In general, nearly everything related to Jakarta Lucene should be moved to the Java sub-project of TLP Lucene. There may be some exceptions, but those should be the results of public deliberations. For example, Garrett suggested that the file format documentation might move to the top-level. There's merit to that, but we should figure out how each port will describe what version of the file format it implements, whether it implements any extensions, etc. before we yank the file format documentation from the lucene port. And we also want to try not to break URLs when we move things. For this reason it's best to move things as few tims as possible, so that we don't end up with a confusing set of redirects. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene.apache.org
Doug Cutting wrote: And we also want to try not to break URLs when we move things. For this reason it's best to move things as few tims as possible, so that we don't end up with a confusing set of redirects. More to the point, we also want to try not to break email addresses. So the fewer times we change them the fewer forwards we'll have to maintain. The new dev list should be [EMAIL PROTECTED], the new user list should be [EMAIL PROTECTED], etc. If folks don't like the moniker Lucene Java then we could consider different names, like Lucene4j or somesuch. Apache started out with just a single project, the web server. When other (now called top level) projects were added, the webserver was renamed Apache Server and was hosted at httpd.apache.org. Eventually the name evolved to Apache HTTP Server. We're in a similar situation. Lucene is both the top-level name and the flagship sub-project. It is the burden of the sub-project to rename itself. However if we want to take more time to consider what to name Lucene Java, then we should backout the redirects and stay at http://www.jakarta.apache.org/lucene/ until we've picked a new name. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene.apache.org
Bernhard Messer wrote: Doug, you placed a copy of the website in the java directory. In both, the original and the java directory the api directory is missing. I can't copy it into because of the access rights :-( Argh. The group protection is 'lucene', as it should be, but you're not in 'lucene'. We need to fix that. Erik, can we please undo the redirects and roll back to http://jakarta.apache.org/lucene/ until we get lucene.apache.org fully setup? Thanks. I'm in a meeting for the next few hours... Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene.apache.org
Erik Hatcher wrote: I've amended my request for e-mail lists here with Doug's preference: http://issues.apache.org/jira/browse/INFRA-195 Do others agree this is the best approach? I don't mean to be autocratic. Do we imagine different pools of users and developers for different Lucene sub-projects, or one big pool for all of them? I assume they'll be mostly disjoint. A new name now too? I don't really want to open that can of worms if we can help it. If folks are okay with Lucene Java then we're done. I mostly just meant to point out that we *are* coining a new name, so we should state that, agree on what it means, and start using it. I'm perfectly content with Lucene Java as a name for the project formerly known as Jakarta Lucene. So unless we hear vigorous objections, let's go with that. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Transactional Directories
Oscar Picasso wrote: Hi, I am currently implementing a Directory backed by a Berkeley DB that I am willing to release as an open source project. Besides the internal implementation, it differs from the one in the sandbox in that it is implemented with the Berkeley DB Java Edition. Using the Java Edition allows an easier distribution as you just need to add a single jar in your classpath and you have a fully functional Berkeley DB embedded in your application without the hassle of installing the C Berkeley DB. While initially implemented with the Java Edition this Directory can easily be ported to a Berkeley DB C edition or a Berkeley DB XML (for example to use Berkeley DB XML + Lucene as the base for a document management system). This implementation works fine and I am quite happy with its speed. There is still an important problem I face and it has to do with how to deal with some transactions. After all, the purpose of a Berkeley implementation, or a JDBC one for that matter, is its ability to use transactions. After looking at the Andy Varga code, it seems that the implementation in the sandbox face the same problem (correct me if I am wrong). I have also learn that the JDBC directory was not implemented with transactions in mind. Here the problem. If I do something like that: -- case A -- pseudo-code +begin transaction new IndexWriter create/update/delete objects in the database index.addDocument (related to the objects) indexWriter.close() +commit /pseudo-code Everything is fine. The operations are transactionally protected. You can even do many writes/updates. As far as everything in enclosed by the pairs begin-transaction/new-index-writer ... index-writer.close/commit everything is properly undone is case of any operation fails insidde the transaction. For batch insertions the whole batch is rolled back but at least your object database is consistent with the index. If you do mostly batch insertions and relatively few random individual insertions. That's fine. However with a relatively high number of random insertions, the cost of the new IndexWriter / index.close() performed for each insertion is two high. Unfortunately this it is a common case for some kind of applications and it is where a transactional directory would the most useful. In such a case you would like to do something like that: -- case B -- pseudo-code new IndexWriter ... +begin transaction-1 create/update/delete objects in the database index.addDocument (related to the objects) + commit ... +begin transaction-2 create/update/delete objects in the database index.addDocument (related to the objects) + commit ... indexWriter.close() /pseudo-code The benefits would be to protect individual insertions while avoiding the cost of using each time a new IndexWriter. It doesn't work however. Here is my understanding. Suppose that in case B, transaction-1 fails and transaction-2 succeeds. In that case the underlying database system rolls back all the writes done during transaction-1 whether they were related to the objects stored in the database or to the index (the writes done to the IndexOutput are also undone). From the database point of view consistency is maintained between the stored object and the index. The problem is that after transaction-1 Lucene still 'remembers' the segment(s) it wrote during transaction-1. Later, Lucene might 'want' to perform some operation based on these references (on merging the segments, I think) while the underlying segment(s) files do not exist anymore. This is where an Exception is thrown. The solution would be to instruct Lucene to 'forget' or undo any reference to the segments created during transaction-1 in case of rollback; I have noticed that references to the segments are stored in a segmentInfos map. I was thinking about removing the segmentsInfo map entries created during transaction-1 in case of a rollback but I don't see if it's enough and/or potentially dangerous. I would really appreciate any comment about this idea and also about my understanding of the Lucene indexing process. If I/we could find a solution it would also benefit a JDBC Directory implementation Thanks. Oscar P.S.: If and when my implementation is fully functional, is there a place in the Lucene project where I could release it? (Maybe the sandbox). __ Do you Yahoo!? The all-new My Yahoo! - What will yours do? http://my.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Transactional Directories
[ Please ignore my previous message. I somehow hit Send before typing anything! ] Oscar Picasso wrote: However with a relatively high number of random insertions, the cost of the new IndexWriter / index.close() performed for each insertion is two high. Did you measure that? How much slower was it? Did you perform any profiling? Perhaps one could improve this by, e.g., disabling document index buffering, so that indexes are written directly to the final directory in this case, rather than first bufferred in a RAMDirectory. Unfortunately this it is a common case for some kind of applications and it is where a transactional directory would the most useful. In such a case you would like to do something like that: -- case B -- pseudo-code new IndexWriter ... +begin transaction-1 create/update/delete objects in the database index.addDocument (related to the objects) + commit ... +begin transaction-2 create/update/delete objects in the database index.addDocument (related to the objects) + commit ... indexWriter.close() /pseudo-code The benefits would be to protect individual insertions while avoiding the cost of using each time a new IndexWriter. It doesn't work however. Here is my understanding. Suppose that in case B, transaction-1 fails and transaction-2 succeeds. So you've got multiple threads? Or are you proceeding in the face of exceptions? Otherwise I would expect that if transaction-1 fails then you'd avoid transaction-2, no? Also, you'd want to add an flush() call after each addDocument(), since document additions are bufferred. But a flush() is just what IndexWriter.close() does, so then things would not be any faster than creating a new IndexWriter for each document. The bottom line is that there are optimizations to be made when batching additions. Lucene's API is designed to encourage batching, so that these optimizations may be used. If you don't batch, things will be somewhat slower. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Study Group (WAS Re: Normalized Scoring)
Paul Elschot wrote: I learned a lot by adding some javadocs to such classes. I suppose Doug added the Expert markings, but I don't know their precise purpose. The Expert declaration is meant to indicate that most users should not need to understand the feature. Lucene's API seeks to be both simple and flexible, but this is not always possible. When flexibility is added that is not part of the simple API, it is deemed expert. For example, we don't expect most users to need to write new Query implementations. So Query methods that are only used internally in query processing are marked Expert, since they only need to be understood by those implementing a Query. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: whither sandbox
Erik Hatcher wrote: Also, we should package a lucene-XX-all.zip/.tar.gz that includes all the contrib pieces also allowing someone to simply download Lucene and all the packaged contrib pieces at once. I'll go further: that should be the only download. We should avoid having a bunch of different downloads. Ant used to require you to separately download the optional tasks, but that was a pain. Now they're included. So we will have at least: lucene-XX.tar.gz lucene-src-XX.tar.gz But should we add the following? lucene-contrib-XX.tar.gz lucene-contrib-src-XX.tar.gz Or should we just bundle these into the first two? I vote for bundling. There will still be separate jar files, so folks only have to deploy what they need. Download size is not an issue these days. Thoughts? Also, we should combine the javadoc into a single tree, with a Core group followed by a Contrib group: http://java.sun.com/j2se/1.4.2/docs/tooldocs/solaris/javadoc.html#group As an example, Nutch does this for Core and Plugin: http://www.nutch.org/docs/api/overview-summary.html Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [PROPOSAL] Lucene to search.apache.org
Erik Hatcher wrote: Hmmm good point. I hadn't considered access control. A migration will be performed later today, and I think it will initially be a test migration for me to verify. I'll double-check with Justin, who's doing the conversion, on how access control will be initially configured. Have a look at svn.apache.org:/x1/svn/asf-authorization. The way other projects do this is to have a project-pmc group that has access to /project, then have project-subproject groups that have access to /project/subproject. So I think we should start with the java code tree in /lucene/java and put current Lucene committers in the lucene-java group. We know we want to have subprojects (nutch, .net, etc.) so let's avoid having to re-org when we add the first one. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Fwd: [PROPOSAL] Lucene to search.apache.org
Erik Hatcher wrote: The decision was a bit slow to get out, but Lucene has been approved for TLP. Thanks for pushing this through! I propose we simply import our two CVS repositories in with all of jakarata-lucene as the root of the repository and jakarta-lucene-sandbox under sandbox in the root. We can shuffle things around once we get it all into svn using svn move nicely. Thoughts? I think we want Java Lucene to be a sub-project of Lucene. So the repository should be something like: https://svn.apache.org/repos/asf/lucene/java Then if we add dotLucene, it will go in something like: https://svn.apache.org/repos/asf/lucene/dot In each of these we'll have subdirectories named trunk, branches, tags, etc. Folks will generally check out and work on 'trunk'. We should also have a repository for the top-level project's website. This could be: https://svn.apache.org/repos/asf/lucene/site This does not need subdirectories. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [PROPOSAL] Lucene to search.apache.org
Erik Hatcher wrote: On Feb 1, 2005, at 3:13 PM, Doug Cutting wrote: I think we want Java Lucene to be a sub-project of Lucene. So the repository should be something like: https://svn.apache.org/repos/asf/lucene/java I already put in the request for this initial svn structure: /asf/lucene /trunk/ /sandbox/ /branches/ /tags/ svn move is an inexpensive and easy operation - so let's run with this structure to get our existing stuff in, and refactor it ourselves once we're in. Okay, if you like. Anyone on the lucene PMC will be able to reorganize. But once we import the code we'll want to rearrange things before we give comitters access. Non-PMC comitters will generally only have access to subprojects, and some perhaps the site. Changes to access must go through infrastucture. So we should re-org before we start adding committers, no? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?
Doug Cutting wrote: It would translate a query t1 t2 given fields f1 and f2 into something like: +(f1:t1^b1 f2:t1^b2) +(f2:t1^b1 f2:t2^b2) Oops. The first term on that line should be f1:t2, not f2:t1: +(f1:t2^b1 f2:t2^b2) f1:t1 t2~s1^b3 f2:t1 t2~s2^b4 Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?
Chuck Williams wrote: That expansion is scalable, but it only accounts for proximity of all query terms together. E.g., it does not favor a match where t1 and t2 are close together while t3 is distant over a match where all 3 terms are distant. Worse, it would not favor a match with t1 and t2 in a short title, and t2 and t3 proximal in the content (with no occurrence of t1 in the content) vs. a match with t1 and t2 in the title and t2 and t3 distant in the content. Right. I just mentioned this same weakness in a message replying to David. Is that distinct from my goal to develop an improved MultiFieldQueryParser for Lucene 2.0? Not distinct, but I think the first step is to decide on the expansion we want. Unless somebody has a better idea, I think the best solution is a new Query class that simultaneously supports multiple fields, term diversity and term proximity. It would be similar to SpansQuery, but generalized. It would be like BooleanQuery in the sense that individual query clauses could be required or not. Then, default AND could be achieved by expanding queries to all-required. With this new Query class, revised versions of QueryParser and MultiFieldQuery parser would generate it. Am I way off-base somewhere and/or is there a simpler approach to the same end? It just sounds like a lot to bite off at once. What did you think of my DensityPhraseQuery proposal? We could use this in place of a PhraseQuery w/ slop=infinity. We'd need just one per field. The straight boolean clauses are required for two reasons: 1. To make sure that every query term appears in some field; and 2. To reward a term that occurs frequently in a field, but near no other query terms. Sure, idf is important enough to evaluate independently as a factor. However, I do not think these considerations are orthogonal. For example, I'm putting a lot of weight in field boosting and don't want the preference of title matches over body matches to be overwhelmed by the idf's. If field boosting needs to then trump idf, we should be able to deal with that when we subsequently tune field boosting, no? We can, e.g., square the field boosts if we need. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?
Christoph Goller wrote: The similarity specified for the search has to be modified so that both idf(...) AND queryNorm(...) always return 1 and as you say everything except for tf(term,doc)*docNorm(doc) could be precompiled into the boosts of the rewritten query. coord/tf/sloppyFreq computation would be done locally by the Searchables as specified for this search. So the changes for the MultiSearcher bug would remain locally in MultiSearcher. I think this would be a very clean solution. What do others think? This sounds good to me! Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [PROPOSAL] Lucene to search.apache.org
Maybe we should just call it lucene.apache.org, and move the current Lucene project to lucene.apache.org/java? The other projects we imagine adding (Nutch, DotLucene, CLucene, etc.) are all Lucene-related, no? Lucene has a pretty good brand name... Doug Otis Gospodnetic wrote: ir.apache.org is what I was thinking, too. +1 for IR from me. It's broad enough to serve as a home for other related projects, not just the initial group of them. Otis --- Andrzej Bialecki [EMAIL PROTECTED] wrote: Scott Ganyo wrote: Not especially creative, but index.apache.org looks to be available. S On Jan 17, 2005, at 3:29 AM, Erik Hatcher wrote: Looks like we should consider alternate names. Suggestions?? ir.apache.org (not Infra-Red, but Information Retrieval) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?
Wolf Siberski wrote: Doug Cutting wrote: So, when a query is executed on a MultiSearcher of RemoteSearchables, the following remote calls are made: 1. RemoteSearchable.rewrite(Query) is called After that step, are wildcards replaced by term lists? Yes. I haven't taken a look at the rewrite() methods. Could you explain to me what is this step doing from a high-level perspective. I'm not sufficiently familiar with Lucene yet. Lucene has a few primitive query types: TermQuery, PhraseQuery, SpanQuery, and BooleanQuery. Other derived query types (RangeQuery, FuzzyQuery, WildcardQuery) are rewritten into primitive queries before evaluation. Rewriting typicially involves expanding the derived query into a BooleanQuery of TermQueries. 2. RemoteSearchable.docFreq(Term) is called for each term in the rewritten query while constructing a Weight. We could optimize this step by sending a list of terms and receiving the corresponding list of docFreqs. Yes. And this could entirely be hidden within the RemoteSearchable implementation. For example, the RPC made by its rewrite() implementation could also return the docFreq() of each term in the rewritten query, and these could be squirrelled away in a cache, which would then be accessed by the docFreq() method, so that only a single RPC is required to implement both rewrite() and all of the docFreq() calls. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?
Chuck Williams wrote: Doug Cutting wrote: It would indeed be nice to be able to short-circuit rewriting for queries where it is a no-op. Do you have a proposal for how this could be done? First, this gets into the other part of Bug 31841. I don't believe MultiSearcher.rewrite() is ever called. Rewriting is done in the Weight's, which invoke the rewrite() method of the Searcher, which is always the Seacher invoked by the MultiSearcher, not the MultiSearcher itself. This would be fixed by the proposal under consideration. Weights would be constructed much earlier, using the top-level Searcher, so rewrites would use this too. In fact, MultiSearcher.rewrite() is broken. It requires Query.combine() which is unsupported except for the derived queries (i.e., those for which rewriting is not a no-op). When I added topmostSearcher to get the Weight's to call the MultiSearcher.docFreq(), that also caused them to call MultiSearcher.rewrite() which blows up on, for example, a simple TermQuery, because there is no TermQuery.combine(). That's why my patch contains a new default implementation for Query.combine() (which as noted in the bug report is probably not a good idea in general). So, I don't believe there is any valid rewrite() implementation for MultiSearcher to start from, unless I've completely misunderstood something. It looks like MultiSearcher.rewrite() was never implemented correctly since it was never called -- a latent bug. It only needs to be called when queries are rewritten to something different: public Query rewrite(Query original) throws IOException { Query[] queries = new Query[searchables.length]; boolean changed = false; for (int i = 0; i searchables.length; i++) { Query rewritten = searchables[i].rewrite(original); changed = !rewritten.equals(original); queries[i] = rewritten; } if (changed) { return original.combine(queries); } else { return original; } } Then we'll need an implementation of combine() for all query types. The implementation for BooleanQuery is fairly simple: combine() each of the corresponding clauses. For TermQuery, PhraseQuery and SpanQuery combine should create a deduplicated OR. Derived queries already have an implementation. To address the question above, RemoteSearchable.rewrite() should be a no-op, i.e. always return this. For good error handling, it should verify that the query does not require rewriting. This requires some mechanism to determine whether or not a query requires rewriting. The challenge here is that some query types have a non-trivial rewrite() method not because they require rewriting, but because they might have subqueries that require rewriting (e.g., BooleanQuery). Other query types (e.g., MultiTermQuery) always require rewriting, while those that implement Weight's never require it. I think an upward incompatibility is required in the API to address this. If that is acceptable, then this could work: 1. Add a new interface called Rewritable that specifies a boolean rewriteRequired() method. 2. Have Query implement Rewritable but NOT provide an implementation for rewriteRequired(). This will force all applications to add support for this in order to upgrade. 2. Change all the Weight's to call Query.maybeRewrite() instead of Query.rewrite(). 3. Have Query.maybeRewrite() only call Query.rewrite() if Query.rewriteRequired() is true. 4. Have RemoteSearchable.maybeRewrite() throw an Exception if Query.rewriteRequired() is true. 5. Implement rewriteRequired() for all the built-in Query types (which is either true for derived queries, false for primitive queries, or an or of rewriteRequired() for all the subqueries). That sounds hairy. Why not just add a single new method: boolean Query.isRewritten() { returns true; } Then override this in TermQuery, PhraseQuery and SpanQuery to return false, and in BooleanQuery to walk its clauses and return true iff any of them return true. As an optimization, RemoteSearchable could avoid calling rewrite() when this is true. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: JDK code in the codebase
Erik Hatcher wrote: The questions still remain, though, and lawyers do want to know the answers: - How did JDK code get into Lucene's codebase to begin with? I put it there in a moment of ignorance way back as a hack in order to make things run in an older version of the JVM. http://cvs.sourceforge.net/viewcvs.py/lucene/lucene/com/lucene/util/Arrays.java?rev=1.1.1.1view=auto - Is there any more lingering? Not to my knowledge. Sorry, my bad. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?
Chuck Williams wrote: I was thinking of the aggressive version with an index-time solution, although I don't know the Lucene architecture for distributed indexing and searching well enough to formulate the idea precisely. Conceptually, I'd like each server that owns a slice of the index in a distributed environment to have the complete docFreq data, i.e. to have docFreq's that represent the collection as a whole, not just its index slice. If this was achieved at index-time, then the current implementation would work at query time. I.e., MultiSearch could send the queries out to the remote Searcher's and these Searcher's could consult their local indexes for the correct docFreq's to use. This is different than what I described. I described keeping a docFreq cache at the central dispatch node, while you describe replicating that cache on every search node. I don't see the advantage in this replication. It is both more efficient to maintain a single cache, and faster to search, since fewer dictionary lookups are involved. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?
Chuck Williams wrote: There needs to be a way to create the aggregate docFreq table and keep it current under incremental changes to the indices on the various remote nodes. I think you're getting ahead of yourself. Searchers are based on IndexReaders, and hence doFreqs don't change until a new Searcher is created. So long as this is true, and the central dispatch node uses a searcher, then a simple cache, perhaps that is pre-fetched, is all that's feasable. It shouldn't take that long to pre-fetch the cache when indexes are re-opened. Lets run before we sprint, and hey, let's even walk first by first fixing the bug in question. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?
Wolf Siberski wrote: Chuck Williams wrote: This is a nice solution! By having MultiSearcher create the Weight, it can pass itself in as the searcher, thereby allowing the correct docFreq() method to be called. This is similar to what I tried to do with topmostSearcher, but a much better way to do it. This still wouldn't work for RemoteSearchables, except if you allow call-backs from each RemoteSearchable to the MultiSearcher. I don't see what callbacks are required. When the Weight is constructed it invokes docFreq for each term, which, if RemoteSearchables are involved, will result in IPC calls to those RemoteSearchables. Then, the Weight object is serialized to each RemoteSearchable and a TopDocs is returned. Where are the callbacks? These are only required for HitCollector-based methods, which are not advised with RemoteSearchable. For this, MultiSearcher would have to be remotely callable, too. A MultiSearcher can be made remotely callable by wrapping it in a RemoteSearchable, if that's required. But I'm not sure that's your concern here. As I said above, IMHO we should stay with a simple client/server model here. I think we would still have a simple model, unless I'm missing something. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: what if the IndexReader crashes, after delete, before close.
Sigh. This stuff would get a lot simpler if we were able to use Java 1.4's FileLock. Then locks would be automatically cleared by the OS if the JVM crashes. Should we upgrade the JVM requirements to 1.4 for Lucene's 1.9/2.0 releases and update the locking code? Doug Luke Shannon wrote: Here is how I handle it. The Indexer is a Runnable. All the members it uses are static. The run() method calls a syncronized method called go(). This kicks off the indexing. Before you even get to here, the method in the CMS code that created the thread object and instaniated the index is also sychronized. Here is the code that handles the potential lock file that may be left behind from a Reader or Writer. Note: I found I had to check if the index existed before checking if it was locked. If I checked if it was locked and the index had not been created yet I got an error. //if we have gotten to hear that this is the only index running. //the index should not be locked. if it is the lock is stale //and must be released before we can continue try { if (index.exists() IndexReader.isLocked(indexFileLocation)) { Trace.ERROR(INDEX INFO: Had to clear a stale index lock); IndexReader.unlock(FSDirectory.getDirectory(index, false)); } } catch (IOException e3) { Trace.ERROR(INDEX ERROR: IMPORTANT. Was unable to clear a stale index lock: + e3); } HTH Luke - Original Message - From: Peter Veentjer - Anchor Men [EMAIL PROTECTED] To: Lucene Users List lucene-user@jakarta.apache.org Sent: Tuesday, January 11, 2005 3:24 AM Subject: RE: what if the IndexReader crashes, after delete, before close. -Oorspronkelijk bericht- Van: Luke Shannon [mailto:[EMAIL PROTECTED] Verzonden: maandag 10 januari 2005 15:46 Aan: Lucene Users List Onderwerp: Re: what if the IndexReader crashes, after delete, before close. One thing that will happen is the lock file will get left behind. This means when you start back up and try to create another Reader you will get a file lock error. I have figured out that part the hard way ;) Why can`t I access my index anymore?? Ahh.. The lock file Our system is threaded and synchronized. Thus when a Reader is being created I know it is the only one (the Writer comes after the reader has been closed). Before creating it I check if the Index is locked. If it is, I forcefully clear it. This prevents the above problem from happening. You can have more than 1 reader open at anytime. Even while a delete or add is in progress. But you can`t use a reader where documents are deleted (IndexReader) and added(IndexWriter) at the same time. If you don`t have other threads doing delete/add you won`t have to synchronize anything. And how do you synchronize on it? I have applied the ReadWriteLock From Doug Lea`s concurrency library after I have build my own synchronization brick and somebody pointed out that I was implementing the ReadWriteLock. But at the moment I don`t do any synchronization. And I want to have a component that is executed if the system is started and knows that to do if there is rubbish in the index directory. I want that component to restore my index to a usable version (and even small loss of information is acceptable because everything is checked once and a while. And user-added-information is going to be stored in the database. So nothing gets lost. The index can be rebuild.. Luke - Original Message - From: Peter Veentjer - Anchor Men [EMAIL PROTECTED] To: lucene-user@jakarta.apache.org Sent: Saturday, January 08, 2005 4:08 AM Subject: what if the IndexReader crashes, after delete, before close. What happens to the Index if the IndexReader crashes, after I have deleted documents, and before I have called close. Are the deletes ignored? Is the Index screwed up? Is the filesystem screwed up (if a document is deleted new delete-files appear) so are the delete-files still there (and can these be ignored the next time?). Can I restore the index to the previous state, just by removing those delete-files? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: what if the IndexReader crashes, after delete, before close.
Terry Steichen wrote: Would it be possible to optimize the operation to use 1.4 runtime features but retain the option, if desired to run in a legacy (1.3) environment, perhaps in a degraded mode? Lucene 1.4.3 is a degraded mode, no? There are still back-compatibility issues. To be safe, Lucene 2.0 should still respect Lucene 1.x file locks. So FSDirectory's Lock.obtain() should fail if a lock file exists, unless it's a lock file written by Lucene 2.0 and java.nio.FileLock says its unlocked. To implement this I guess we'd need to store a version number in the lock files. Does that sound right? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?
Chuck Williams wrote: As Wolf does, I hope a committer with deep knowledge of Lucene's design in this area will weigh in on the issue and help to resolve it. The root of the bug is in MultiSearcher.search(). This should construct a Weight, weight the query, then score the now-weighted query. Here's a potential way to fix it: 1. Replace all of the ... search(Query, ...) methods in Searchable.java with ... search(Weight, ...) methods. 2. Add search(Query, ...) convenience methods to Searcher.java which do something like: public ... search(Query query, ...) { return search(query.weight(this), ...); } 3. Update the search() methods in IndexSearcher, MultiSearcher and RemoteSearchable to operate on Weight's instead of queries. Does that make sense? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?
Chuck Williams wrote: This is a nice solution! By having MultiSearcher create the Weight, it can pass itself in as the searcher, thereby allowing the correct docFreq() method to be called. Glad to hear it at least makes sense... Now I hope it works! I'm still left wondering if having MultiSearcher query all the RemoteSearchable's on every call to docFreq() within each TermQuery, PhraseQuery, SpanQuery and PhrasePrefixQuery is the way to go long term, although it seems like the best thing to do right now. The calls only happen when the Weight's are created, so maybe it's not too bad. Longer term, it might be better to distribute the idf information out to the RemoteSearchable's to minimize the required number of remote accesses for each Query. I'm not sure exactly what you mean by distribute the idf information out to the RemoteSearchable. I think one might profitably implement a docFreq() cache in RemoteSearchable. This could be a simple cache, or it could be fairly agressive, pre-fetching all the docFreqs. (As an optimization, it could only pre-fetch those greater than 1, and, when a term is not in the cache, assume its docFreq is 1. As a lossy optimization, it could only pre-fetch those greater than N, and somehow estimate those not in the cache.) Is that what you meant? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: auto-filters?
markharw00d wrote: If we intend to make more use of filters this may be an appropriate time to raise a general question I have on their use. Is there a danger in tieing them to a specific implementation (java.util.BitSet)? I do not object in principal to replacing BitSet with an interface, e.g. DocIdSet. Please feel free to submit a more detailed proposal for this. I think this is not so performance intensive that an extra method call will be significant. If it is, then we can simply have Lucene's BitVector implement this interface directly. We must be careful to preserve the distinction between Filter and DocIdSet. Filters are query-independent and serializeable, passed across the wire with RemoteSearchable requests. DocIdSets are query-dependent, should be computed and cached locally, and not passed over the wire. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: CFS file and file formats
Bernhard Messer wrote: Why not implementing a small utility class, f.e CompoundFileUtil.java within the org.apache.lucene.index Package ? This class could be public and implement the necessary functionality. This is what i would prefer, because we don't have to change the visibility of CompoundFileReader or other parts of the API. The other option would be to add a public static method to IndexReader class. But i don't like to overwhelm IndexReader with a method, just a very small audience would use. Currently IndexWriter is the only public place in the API where the compound format appears. So, until we decide to expose index formats more systematically, I think this should stay at the IndexReader level. Thus I would prefer a main() on IndexReader that had various commands, perhaps something like: java org.apache.lucene.index.IndexReader dir cfs list java org.apache.lucene.index.IndexReader dir cfs extract java org.apache.lucene.index.IndexReader dir unlock java org.apache.lucene.index.IndexReader dir list-segments etc. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
auto-filters?
Filters are more efficient than query terms for many things. For example, a RangeFilter is usually more efficient than a RangeQuery and has no risk of triggering BooleanQuery.TooManyClauses. And Filter caching (e.g., with CachingWrapperFilter) can make otherwise expensive clauses almost free, after the first time. But filters are not obvious. Many Lucene applications that would benefit from them do not. Wouldn't it be better if we could automatically spot Query clauses which are amenable to filter-conversion? Then applications would just get faster and throw fewer exceptions, without having to know anything about filters. From a user level I think this might work as follows: 1. Query clauses which have a boost of 0.0 are candidates for filter conversion, since they cannot contribute to the score. We should perhaps make boost=0 the default for certain classes of query (e.g., perhaps RangeQuery) or make subclasses with this as the default (KeywordQuery). 2. One should be able to specify a filter cache size per IndexSearcher, with the notion that each filter cached uses one bit per document. I'm not yet clear how this should be implemented. It might be based on something like: public interface DocIdCollector { void collectDocId(int docId); } /** Collects all DocIds that match the query. DocIds are collected in no particular order and may be collected more than once. Returns true if this feature is supported, false otherwise. */ public boolean Query.getFilterBits(IndexReader, DocIdCollector); Implementing this for various query classes is straight-forward. TermQuery might return null for any but very common terms (occurring in, e.g., greater than 10% of documents). RangeQuery would use the logic that's currently in RangeFilter. Etc. BooleanScorer could then use this method to create a filter bit-vector for all of the boost=0.0 clauses, then use that to filter the other boost!=0 clauses. The bit vectors could be cached in the scorer (using a LinkedHashMap), although I'm a little fuzzy on exactly how the cache API would work. I'm not convinced the above is the best design, but I am convinced Lucene needs a solution for this. It could automatically eliminate most causes of BooleanQuery.TooManyClauses (e.g., from date ranges), and also make many required keyword clauses (document type, language, etc.) much faster. What do others think? Does anyone have a better design or improvements to what I describe? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DefaultSimilarity 2.0?
Chuck Williams wrote: Finally, I'd suggest picking content that has multiple fields and allow the individual implementations to decide how to search these fields -- just title and body would be enough. I would like to use my MaxDisjunctionQuery and see how it compares to other approaches (e.g., the default MultiFieldQueryParser, assuming somebody uses that in this test). I think that would be a good contest too, but I'd rather first just focus on the ranking of single-field queries. There are a number of issues that come up with multi-field queries that I'd rather postpone in order to reduce the number of variables we test at one time. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Migration to SVN?
Garrett Rooney wrote: The least effort way of doing that would be to include both the core and sandbox under the same trunk, but again, that implies that you ALWAYS tag and branch them together, and sometimes you may not want to do that. I think we should always branch these together. To my thinking, the distinction between core and sandbox is primarily be one of packaging: the core should be separate jar, as should each of the sandbox elements. But all should be released and tested as a unit, to ensure compatibility. I think the term sandbox is misleading and has outlived its usefulness. We should probably rename this to something like utils or optional. These should be treated much like Ant's optional tasks: package them as separate jars, segregate their documentation, but don't branch them separately. Perhaps we should also make it so that a failed sandbox build or unit test does not stop a build: the quality guarantee need not be as high for sandbox items. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
DefaultSimilarity 2.0?
Chuck Williams wrote: Another issue will likely be the tf() and idf() computations. I have a similar desired relevance ranking and was not getting what I wanted due to the idf() term dominating the score. [ ... ] Chuck has made a series of criticisms of the DefaultSimilarity implementation. Unfortunately it is difficult to quickly evaluate these, as it requires relevance judgements. But, still, we should consider modifying DefaultSimilarity for the 2.0 release if there are easy improvements to be had. But how do we decide what's better? Perhaps we should perform a formal or semi-formal evaluation of various Similarity implementations on a reference collection. For example, for a formal evalution we might use one the TREC Web collections, which have associated queries and relevance judgements. Or, less formally, we could use a crawl of the ~5M pages in DMOZ (I would be glad to collect these using Nutch). This could work as follows: -- Different folks could download and index a reference collection, offering demonstration search systems. We would provide complete code. These would differ only in their Similarity implementation. All implementations would use the same Analyzer and search only a single field. -- These folks could then announce their candiate implementations and let others run queries against them, via HTTP. Different Similarity implementations could thus be publicly and interactively compared. -- Hopefully a consensus, or at least a healthy majority, would agree on which was the best implementation and we could make that the default for Lucene 2.0. Are there folks (e.g., Chuck) who would be willing to play this game? Should we make it more formal, using, e.g., TREC? Does anyone have other ideas how we should decide how to modify DefaultSimilarity? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Explanations and overridden similarity
Dan Climan wrote: Shouldn't the call to Similarity.decodeNorm be replaced with a call to Similarity.getDefault().decodeNorm decodeNorm is a static method. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: potential new Lucene logo
Murray Altheim wrote: I thought I'd have a go at the Lucene logo, not to change it markedly but clean it up so that it is based on an existing font. This potential Lucene logo is based on an ITC font called Magneto Bold Extended, which you can see here: http://www.identifont.com/show?72W I modified the 'c' slightly because at small sizes it starts looking too much like an 'e', especially since in the Lucene logo the baseline is extended across the entire logo. I experimented with several border thicknesses, settling on one that was visible at small sizes but not too thick at larger sizes. Here's a sample of the result: http://www.altheim.com/murray/img/lucene-20b-320w.jpg Thanks! This looks nice to me. I've posted a zip file containing a number of sizes plus the originals, which are in SVG and PNG: http://www.altheim.com/murray/img/lucene_logo.zip (198K) The SVG file is the source image, and after conversion into a raster format I did some bit hand cleaning to end up with the PNG, which is a 9394x961px image at 72dpi. Being available as a very large PNG, it can then be used for T-shirts, etc. I consider the PNG image the master, since it's had some cleanup in evening out lines and curves, etc. I just put the original, scalable artwork for the existing logo at: http://jakarta.apache.org/lucene/lucene.eps I have used the Gimp in the past to generate high-resolution PNG files from this when needed. The one thing about the logo (either the existing one or the one I've done) is that neither do too well when shrunk small. The source PNG can be reduced to any size, but after reduction to small sizes often needs some hand cleanup. This could be fixed by no longer using an outline around the font, but I didn't want to take that kind of liberty with the design, especially since then it would be a single-colour font. I kinda like the current design, as it reminds me of a logo from a recreational vehicle (camper, caravan, etc.) The design was originally donated by Jeff Boozer and Joy Busse in April of 1998. I asked for something that looked like 60's refrigerator chrome. I realize that sometimes people feel (understandably) proprietary about a given image, and I don't mean to push this image on anyone. If the group wants to use it, I'm fine with that, and release any rights on its use. I'll leave the zip file online for several weeks, then it will be removed. I don't feel too proprietary. Do folks prefer Murray's reworked Lucene logo or the original logo? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: setLowercaseWildcardTerms and FuzzyQueries
Daniel Naber wrote: I'm aware that the Wildcard name won't fit well anymore, suggestions for a better name are welcome. Expanded? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Boolean Scorer
Christoph Goller wrote: I think we should change BooleanScorer. An easy way would be to sort the bucket list before it is used. Do you think that would affect performance dramatically? I think it would make it slower. Otherwise we should reimplement BooleanScorer. I haven't looked into the DisjunctionScorer patch in Bugzilla yet. Maybe it's a good starting point. I think we should incorporate Paul's code into CVS. This algorithm may be slower in some cases, but it may also be faster in some cases. We should add a static method to switch back to the old implementation, and encourage folks to benchmark their code. If it proves no slower then we could remove the old implementation altogether. What do others think? Paul's code is in: http://issues.apache.org/bugzilla/show_bug.cgi?id=31785 Has anyone tried this? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Release 1.4.3
Christoph Goller wrote: Doug, could you please move api/ to api.old/ and api.new/ to api/ Done. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Release 1.4.3
Christoph Goller wrote: I think i should finally make Release 1.4.3. Great! I presume the default.properties does no longer exist. I just fill in 1.4.3 as version in the build.xml before building it. Is this ok? I build releases with something like: ant -Dversion=1.4.3 clean dist So that it doesn't matter what version is in build.xml. So you shouldn't need to change build.xml for this release. I think there is less confusion if, when folks build lucene themselves it does not, by default, have the same name as a released version. Thus if they patch things and do not update build.xml (which is likely) the generated jar files will have rc1-dev, clearly identifying them as a non-released version. Releases (binaries and sources) are no longer on www.apache.org /www/jakarta.apache.org/builds/jakarta-lucene/release/ Only the web-page and the documentation (Javadoc) is there. Instead they are on cvs.apache.org /www/cvs.apache.org/dist/jakarta/lucene Is this correct? Yes. Release directories should now be made under www.apache.org:/www/cvs.apache.org/dist/jakarta/lucene/. Two other things that are not in the wiki instructions: 1. copy the lucene jar into the distribution too cp build/lucene-X.X.jar dist 2. compute MD5 sums (cd dist; md5sum lucene* MD5.txt) If you have time, please update the wiki instructions too. Thanks! Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: GIS
Guillermo Payet wrote: The fact that Lucene stores and indexes (or seems it seems) all terms as Strings and that there is no NumericTerm makes me think that I might be missing something and that this migh be a much bigger deal than I think? You could write a HitCollector that uses FieldCache.getFloats(latitude) and FieldCache.getFloats(longitude) to efficiently lookup the latitude and longitude of each textual match. Then combine the distance score with the text score. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: FuzzyQuery prefix length
Erik Hatcher wrote: On Oct 20, 2004, at 12:14 PM, Doug Cutting wrote: The advantages of a zero-character prefix default are that it's back-compatibile and that it will find more matches, when spelling differences are in the first characters. I prefer this default. Anyone using QueryParser needs to be aware of the issues of exposing fuzzy queries, range queries, and any other types the syntax supports. It would not be Lucene's fault if a system with millions of documents is exposed through QueryParser and fuzzy queries take a bit longer or thrown a TooManyClauses exception. I am clearly outvoted. I still disagree, but will not veto this. My last words on the topic (I promise!): In designing Lucene I tried hard to only add features that were scalable. For example, one could easily implement a RegexQuery that scans text of stored fields, returning those which match a regex. This would provide grep-like functionality, which some folks might find useful. But it would not be scalable. If someone contributed such a thing I would lobby against permitting its use from QueryParser in the default configuration. The query parser already requires an initial character before a wildcard, in order to make this operator more scalable. I don't see why fuzzy queries should be treated differently, why we permit such a huge scalability hole in the default configuration. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Normalized Scoring -- was RE: idf and explain(), was Re: Search and Scoring
Chuck Williams wrote: However, I'm not sure this analysis is completely correct due to MultiSearcher.docFreq() which appears to be trying to redefine the tf's to be the global value across all indices. It wasn't clear to me how this code is ever reached, e.g. from TermQuery -- SegmentTermDocs. If the tf's and idf's are in fact computed globally, then the interleaving should work as it is, thus I'm guessing they are not. Idf's are already computed globally across all indexes. Tf's are local to the document. In short, scores from a MultiSearcher are the same as when searching an IndexReader with the same documents. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Retrieving Document Boosts
Dan Climan wrote: TermEnum terms = ir.terms(); int numTerms = 0; while (terms.next()) { Term t = terms.term(); if (t.field().equals(FullText)) numTerms++; } double lengthNorm = 1.0 / Math.sqrt(numTerms); //since lengthNorm was defined as 1/sqrt(numTerms) by default The numTerms is not the number of unique words in the collection, but rather the number of tokens in the document in question. So, if you want to re-create this externally you could re-tokenize the text for the field and count the tokens. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene and large (2GB+) indexes using RAMDirectory
Jonathan Hager wrote: Nate Denning encountered the following error when trying to load a large (greater than 2147483647 bytes) index into a RAMDirectory. The server has 12GB of memory, so loading it into memory should not be a problem. Have you instead tried copying the index to a ramfs ('mount -t ramfs'), then opening it with a normal FSDirectory? This forces the entire index into RAM without forcing it into Java's heap. In my experience, huge Java heaps are problematic. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: API cleanup for Field and future cleanup for IndexReader
Bernhard Messer wrote: Christoph Goller wrote: Bernhard Messer wrote: Currently there are 3 different methods available to get the field names from an index. a) getFieldNames(); b) getFieldNames(boolean indexed); c) getIndexedFieldNames(boolean storedTermVector); my proposal is to deprecate a), b) and c) and add one new method which can handle all the possible options. +1 +1 Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: idf and explain(), was Re: Search and Scoring
Chuck Williams wrote: That's a good point on how the standard vector space inner product similarity measure does imply that the idf is squared relative to the document tf. Even having been aware of this formula for a long time, this particular implication never occurred to me. Do you know if anybody has done precision/recall or other relevancy empirical measurements comparing this vs. a model that does not square idf? No, not that I know of. Regarding normalization, the normalization in Hits does not have very nice properties. Due to the 1.0 threshold check, it loses information, and it arbitrarily defines the highest scoring result in any list that generates scores above 1.0 as a perfect match. It would be nice if score values were meaningful independent of searches, e.g., if 0.6 meant the same quality of retrieval independent of what search was done. This would allow, for example, sites to use a a simple quality threshold to only show results that were good enough. At my last company (I was President and head of engineering for InQuira), we found this to be important to many customers. If this is a big issue for you, as it seems it is, please submit a patch to optionally disable score normalization in Hits.java. The standard vector space similarity measure includes normalization by the product of the norms of the vectors, i.e.: score(d,q) = sum over t of ( weight(t,q) * weight(t,d) ) / sqrt [ (sum(t) weight(t,q)^2) * (sum(t) weight(t,d)^2) ] This makes the score a cosine, which since the values are all positive, forces it to be in [0, 1]. The sumOfSquares() normalization in Lucene does not fully implement this. Is there a specific reason for that? The quantity 'sum(t) weight(t,d)^2' must be recomputed for each document each time a document is added to the collection, since 'weight(t,d)' is dependent on global term statistics. This is prohibitivly expensive. Research has also demonstrated that such cosine normalization gives somewhat inferior results (e.g., Singhal's pivoted length normalization). Re. explain(), I don't see a downside to extending it show the final normalization in Hits. It could still show the raw score just prior to that normalization. In order to normalize scores to 1.0 one must know the maximum score. Explain only computes the score for a single document, and the maximum score is not known. Although I think it would be best to have a normalization that would render scores comparable across searches. Then please submit a patch. Lucene doesn't change on its own. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Propose Bernhard as committer
+1 Christoph Goller wrote: I would like to propose Bernhard as Lucene committer. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: FuzzyQuery prefix length
Daniel Naber wrote: On Tuesday 12 October 2004 17:22, Doug Cutting wrote: Which is worse: a person who searches for Photokopie~ in a 1000 document collection does not find documents containing Fotokopie; or a person who searches for Photokopie~ in a 1M document collection doesn't find anything because it takes too long. I think some relevant results are better than none. I disagree, as the user who doesn't get the Fotokopie matches will not understand what's going on. He will assume that there are no such documents, which is wrong. I disagree. For someone to assume that they would need a detailed understanding of how ~ works. Such a person would likely also know whether initial characters are considered in the operation of ~. Most users who use ~ would probably use it when they're uncertain of spelling, without a detailed understanding of how it works, and, most of the time, it will help them. If there's a timeout the user will at least notice something is wrong. Besides that, it's the developers responsibility to get things fast enough. We're talking about the appropriate default. Defaults are used by unsophisticated developers. A system deployed by an unsophisticated developer should not suffer from erratic timeouts. Users using the standard query syntax should enjoy a reasonable experience on multi-million document collections without having to tweak things. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: FuzzyQuery prefix length
Daniel Naber wrote: Searching for Photokopie~ on a 230,000 document corpus takes 2.3 seconds here (AMD Athlon 2600+; other fuzzy terms get similar performance). As the number of terms doesn't increase so fast with more documents, it will not take 10 seconds for 1 million documents. So fuzzy search isn't *that* slow. How long do non-fuzzy queries take? What is the ratio? How about a query with multiple fuzzy terms? If someone launches a service but fails to test it with fuzzy queries, will they be subject to inadvertant denial-of-service when a user starts using fuzzy queries? Web-based search is particularly vulnerable. If a query takes a few seconds and the user hits his browser's STOP and RELOAD buttons, the first query keeps running on the server. This is not an imaginary problem. I have worked with several clients who have run into this in deployed applications. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: What's the purpose of hashing docid in BooleanScorer
Christoph Goller wrote: With the current scorer API one could get rid of buckettable and advance all subscores only by one document each time. I am not sure whether the bucketable implementation is really much more efficient. I only see the advantage of inlining some of the scorer.next and score.score code. Indeed, sub-scorers could be, e.g., kept in a priority queue. This is done in ConjunctionScorer, PhraseScorer, etc. However this adds a priority queue update to the inner search loop. With long queries and with common terms this overhead can be significant. With short queries and/or with rare terms the current ScoreTable-based implementation may indeed be slower, but I believe with longer queries containing common terms it is substantially faster. This algorithm is described in: http://lucene.sourceforge.net/papers/riao97.ps If we had a priority-queue-based implementation then we could benchmark these. If we found that one were faster than the other for particular classes of queries then we could have a query optimizer which automatically selects the most efficient implementation... Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: What's the purpose of hashing docid in BooleanScorer; DisjunctionScorer
Paul Elschot wrote: I have a DisjunctionScorer based on a PriorityQueue lying around, but I can't benchmark it myself at the moment. In case there is interest, I'll gladly adapt it to org.apache.lucene.search and add it in bugzilla. This should look a lot like SpanOrQuery.getSpans(). On a related note, I implemented ConjunctionScorer using Java's collection classes rather than a Lucene priority queue, just to see if I could. It turns out to have to allocate memory in sortScorers() which makes it slower than it could be, but I have not yet gotten around to fixing it. I'd like to re-write this to look like PhraseScorer and NearSpans, which operate without allocation. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Contribution: better multi-field searching
Paul Elschot wrote: Did you see my IDF question at the bottom of the original note? I'm really curious why the square of IDF is used for Term and Phrase queries, rather than just IDF. It seems like it might be a bug? I missed that. It has been discussed recently, but I don't remember the outcome, perhaps some else? This has indeed been discussed before. Lucene computes a dot-product of a query vector and each document vector. Weights in both vectors are normalized tf*idf, i.e., (tf*idf)/length. The dot product of vectors d and q is: score(d,q) = sum over t of ( weight(t,q) * weight(t,d) ) Given this formulation, and the use of tf*idf weights, each component of the sum has an idf^2 factor. That's just the way it works with dot products of tf*idf/length vectors. It's not a bug. If folks don't like it they can simply override Similarity.idf() to return sqrt(super()). If someone can demonstrate that an alternate formulation produces superior results for most applications, then we should of course change the default implementation. But just noting that there's a factor which is equal to idf^2 in each element of the sum does not do this. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search and Scoring
Chuck Williams wrote: I think there are at least two bugs here: 1. idf should not be squared. I discussed this in a separate message. It's not a bug. 2. explain() should explain the actual reported score(). This is mostly a documentation bug in Hits. The normalization of scores to 1.0 is performed only by Hits. Hits is a high-level wrapper on the lower-level HitCollector-based search implementations, which do not perform this normalization. We should probably document that Hits scores are so normalized. Also, we could add a method to disable this normalization in Hits. The normalization was added long ago because many folks found it disconcerting when scores were greater than 1.0. We should not attempt to normalize scores reported by explain(). The intended use of explain() is to compare its output against other calls to explain(), in order to understand how one document scores higher than another. Scores don't make much sense in isolation, and neither do explanations. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Contribution: better multi-field searching
Chuck Williams wrote: The issue is this. Imagine you have two fields, title and document, both of which you want to search with simple queries like: albino elephant. There are two general approaches, either a) create a combined field that concatenates the two individual fields, or b) expand the simple query into a BooleanQuery that searches for each term in both fields. With approach a), you lose the flexibility to set separate boost factors on the individual fields. I wanted title to be much more important than description for ranking results, and wanted to control this explicitly, as length norm was not always doing the right thing; e.g., descriptions are not always long. With approach b) you run into another problem. Suppose the example query is expanded into (title:albino description:albino title:elephant description:elephant). Then, assuming tf/idf doesn't affect ranking, a document with albino in both title and description will score the same as a document with albino in title and elephant in description. The latter document for most applications is much better since it matches both query terms. If albino is the more important term according to idf, then the less desirable documents (albino in both fields) will rank consistently ahead of the albino elephants (which is what was happening to me, yielding horrible results). Another way to handle this would be to generate a query like: title:(albino elephant) description(albino elephant) In this case the coord factor would boost titles and descriptions which contained both terms. You may or may not want to disable the coord factor for the outer query, which can be done with: BooleanQuery title = new BooleanQuery(); title.add(new TermQuery(new Term(title, albino)), false, false); title.add(new TermQuery(new Term(title, elephant)), false, false); BooleanQuery desc = new BooleanQuery(); desc.add(new TermQuery(new Term(desc, albino)), false, false); desc.add(new TermQuery(new Term(desc, elephant)), false, false); BooleanQuery outer = new BooleanQuery() { public getSimilarity() { new DefaultSimilarity() { public coord(int overlap, int length) { return 1.0f; } } } }; outer.add(title, false, false); outer.add(desc, false, false); In general, doesn't coord() handle this situation? Also, you can separately boost title and desc here, if you like: title:(albino elephant)^4.0 description(albino elephant) or title.boost(4.0f); Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexInput GCJ
Andi Vajda wrote: This code is generated by JavaCC. I think the best way to fix this would be to fixup the code automatically whenever it is regenerated. So, instead of patching QueryParser.java, patch build.xml. In the javacc-QueryParser task, add a replace task which replaces 'jj_la1_0()' with 'jj_la1_0_method()'. That is a brittle kludge as the code that needs to be changed may vary everytime the parser is re-generated. There used to be two such methods in 1.4.1 for instance. The proper way to workaround this issue is to fix javacc to not generate such java code in the first place. It is indeed a brittle kludge. Are you willing to submit a bug report to JavaCC? https://javacc.dev.java.net/servlets/ProjectIssues Or even a patch? https://javacc.dev.java.net/source/browse/javacc/ Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Contribution: better multi-field searching
Chuck Williams wrote: That approach does not work. I could not find an approach that would work with the built-in classes, although of course there might be one. The problem has two components: coord and the fact that BooleanQuery's sum their clause scores to compute the final score. The latter is not easily overridden. Specifically, title:(albino elephant)^4 description:(albino elephant) still has the problem that a result with albino in the title and albino in the description gets the same score as a result with albino in the title and elephant in the description Perhaps I misunderstood what you desire. You want a reward for albino and elephant both occurring in the document, regardless of field, if so, then what you'd want is: (title:albino description:albino) (title:elephant description:elephant) with coord disabled on the *inner* queries, no? This way coord would explicitly boost documents which matched on both terms. FYI, MaxDisjunctionQuery has made an enormous improvement in the quality of my query results, and I have strong reason to believe the same would be true in most other domains (more on that coming in the idf^2 discussion). In terms of the albino elephant example, the query above was putting all the albino animals except elephants above the albino elephants, while the query with an outer BooleanQuery and inner MaxDisjunctionQuery's ( (title:albino^4 | description:albino)~0.1 (title:elephant^4 | description:elephant)~0.1 ) properly puts the albino elephants on top. If albino is outscoring elephant then you could either reduce the impact of idf or increase the impact of coordination. Did you try, e.g., defining coord as (overlap/max)^2 or somesuch? Or, perhaps take proximity into account, with albino elephant~10? Or simply using AND instead of OR? These days most web search engines use AND as the default operator and reward for proximity. Is that wrong for your application? AND is effectively a coord of (overlap/max)^infinity. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: documentation in fileformats.html
Daniel Naber wrote: The web page is updated now, could you please re-check if it's correct? I added that information so that the Lucene = 1.4 format is still there. We should note that when compression is enabled, gzip is used. Also, byte[] is not a type defined in the file. In the formalism used in fileformats.html, this should be: Value - String | BinaryValue (depending on Bits) BinaryValue - ValueSize, Byte^ValueSize ValueSize - VInt Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: FuzzyQuery prefix length
Daniel Naber wrote: -It is the only change so far that we cannot express in the API, i.e. we cannot just deprecate a method to make Lucene's users aware of this. So we can only list it in CHANGES.txt, where some people will surely miss it. We could define a new query parser class with the new behaviour and deprecate the old query parser. I am not advocating this, merely noting that it is possible to make this change back-compatibly. If we agree that this change does make Lucene better (and I'm not sure we do) then we should make the change, no? Back-compatibility is a good thing, but, with a major release, should quality suffer becaue of back-compatibility issues? I hope not. Rather we should take the opportunity of a major release to make Lucene as good as we can. -There are words in German like Photokopie/Fotokopie which have the same meaning and a very similar spelling, so people will expect a FuzzyQuery to match such words. But as the difference is in the first two characters it won't be found with the default. -People whose index is just 1000 documents large will probably not notice a difference in speed, but they might see a difference in quality (see above). Why should these people change the default instead of those with a 10 mio document index? Which is worse: a person who searches for Photokopie~ in a 1000 document collection does not find documents containing Fotokopie; or a person who searches for Photokopie~ in a 1M document collection doesn't find anything because it takes too long. I think some relevant results are better than none. Classes of queries which take orders of magnitude longer than others are a problem. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: QueryParser and backwards-compatibility
Christoph Goller wrote: Since 1.4.2 is already out, we would have to make a version 1.4.3. OK, one more vote needed :-) I'm okay with a 1.4.3 release for this. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PhrasePrefixQuery - MultiPhraseQuery
Daniel Naber wrote: I copied PhrasePrefixQuery to MultiPhraseQuery, decprecating PhrasePrefixQuery. The wiki also suggests to make MultipleTermPositions a private nested class. However, it is public currently so I wonder whether we can just remove/deprecate it without offering an alternative. Any opinions? I don't feel too strongly about this. I doubt anyone uses this class directly, so we could just remove it and wait until we make a 1.9 RC, and see if anyone complains then. Or we could just leave it public and improve it's javadoc. It is well-named and well-implemented, and may be useful for other things, although I can't think what they are... What do others think? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: FuzzyQuery prefix length
Daniel Naber wrote: I agree that the default should stay 0, even for Lucene 2.0. It should certainly stay zero for 1.4.x releases. However 2.0 is our opportunity to make incompatible changes. What is the best default for this, that will work well for the most applications? Does anyone have fuzzy-query benchmarks for, e.g., ~1M document indexes, where each document contains a few k of text? Ideally with such indexes, even complex queries should take less than a second, no? How long does a fuzzy query take? And how much does a prefix of zero, one, or two change that? Queries that take much longer than a second are considerably less usable. I think the the default should provide good usability for indexes of at least 1M documents. Another thing to examine is how different the generated terms are with different prefixes. One could randomly select some words from an index and compute the average amount that a prefix of one and two changes the end results. My guess is that the changes are small. Since fuzzy search is a heuristic, not an exact computation, good approximations are fair play. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/queryParser QueryParser.java QueryParser.jj
[EMAIL PROTECTED] wrote: goller 2004/10/11 06:36:14 Modified:src/java/org/apache/lucene/queryParser Tag: lucene_1_4_2_dev QueryParser.java QueryParser.jj [ ... ] + * @deprecated use [EMAIL PROTECTED] #getFieldQuery(String, String)} Should these be deprecated in 1.4.3? I don't think so. They should be deprecated in 1.9 and removed in 2.0, but 1.4.3 should not require application changes if possible when upgrading from earlier 1.x releases. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sandbox - core ?
Erik Hatcher wrote: It would be nice if the Sandbox components were versioned and released along with the core - perhaps this would be a sufficient enough solution? But, alas, I have no free time currently to devote to this effort. That's precisely the reason to add these to the main CVS tree: if they're somewhere else then they simply won't get versioned and released in parallel with the core, while if they're in the main CVS tree this will happen with no extra effort. In general, I'm a proponent of bundling as much as possible into a single CVS tree and build procedure, since it makes it much easier to keep things synchronized. If folks feel the jar is too big, then we can always build these into a separate jar. I'd also vote to put analyzers in the same CVS tree and under the top-level build.xml, for the same reason. If we like, we could put them each in subdirectories of src/analyzers, and have each built as a separate jar. Thoughts? The sandbox should be for experimental stuff. Stuff that's proven widely useful should go into the main tree and get released along with every Lucene release. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sandbox - core ?
Otis Gospodnetic wrote: I like this idea. I don't care so much about 1 or more CVS repositories, as much as separate Jars, so if we can make analyzers-1.4.2.jar and highlighter-1.4.2.jar along lucene-1.4.2.jar, that would be ideal, in my opinion. A minor point: we should prefix all the jar file names with 'lucene-'. Also, I think the javadoc should include everything, not just the core. That way folks can easily see what's available. We could group things to make it clear what's core and what's in auxiliary jars: http://java.sun.com/j2se/1.4.2/docs/tooldocs/solaris/javadoc.html#group So we might have groups for Core, Analyzers, etc. However, I still think a separate jar for the highlighter is overkill. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexInput GCJ
Andi Vajda wrote: Do you intend to ultimately support Java Lucene with GCJ ? As far as possible... I'm down to 3 patches: Can you please file a Lucene bug report and attach these patches? I'm not guaranteeing that they'll all be committed right away, but rather that that's a better place to keep track of them. And they may get comitted! Thanks. I've also suggested a few changes to your patches below. - GCJH cannot generate a header file from QueryParser.class because there are one static field and one static method which have the same name (down from two in Lucene 1.4.1) This code is generated by JavaCC. I think the best way to fix this would be to fixup the code automatically whenever it is regenerated. So, instead of patching QueryParser.java, patch build.xml. In the javacc-QueryParser task, add a replace task which replaces 'jj_la1_0()' with 'jj_la1_0_method()'. Is there a GCJ bug number assigned to this issue? If not, could you please file one and note the bug number in a comment? That way, if/when GCJ more elegantly resolves this we can remove the hack. - The delete(int) and delete(Term) methods on IndexReader clash with the 'delete' c++ keyword. GCJ will generate them as 'delete$' which is a neat workaround; the problem, however, is that the dynamic linker, at least on Mac OS X, doesn't then properly link to these symbols and fails to load the resulting shared library. So I defined two synonym methods deleteDocument(int) and deleteDocuments(Term) in a patch to IndexReader. In your patch, please add javadocs and deprecate the old delete() methods. Again, GCJ and/or OS X bug numbers in a comment would be good to have. - Because of GCJ bug 15411, http://gcc.gnu.org/bugzilla/show_bug.cgi?id=15411, Searcher.java needs to be patched to define the missing method definitions Please add this bug reference in a comment to the patched code. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene JAR for Maven Repo
I just copied the 1.4.2 jar there. Doug Otis Gospodnetic wrote: Here is the email I mentioned earlier on lucene-dev. --- Brian McCallister [EMAIL PROTECTED] wrote: To: [EMAIL PROTECTED] From: Brian McCallister [EMAIL PROTECTED] Subject: Maven Repo Date: Thu, 26 Aug 2004 19:59:50 -0400 Hi all, Thank you for the amazing work on lucene. That said, any chance you could push lucene-1.4.1.jar onto the ibiblio maven repository? I'm happy to do so myself if you prefer (is just copying it to /www/www.apache.org/dist/java-repository/lucene/jars/ ) but figured I'd ask before just copying the jar over =) Thank you again! -Brian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene 1.4.2?
Daniel Naber wrote: On Friday 01 October 2004 23:57, Doug Cutting wrote: It is not mirrored yet. Erik's the only one who has ever done that. Erik, do you have time to mirror 1.4.2? Thanks. BTW, the release on the official download pages is still 1.4-final: http://jakarta.apache.org/site/sourceindex.cgi http://jakarta.apache.org/site/binindex.cgi Right. The official site is the mirrored site. The procedure for releasing to the mirror is documented at: http://jakarta.apache.org/site/convert-to-mirror.html Would someone else like to do this? Erik's been rather busy. If another comitter has the time, it would be great to get this done ASAP. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene 1.4.2?
Christoph Goller wrote: I would never have guessed that calling the constructor there could make such a difference. The improvement is greatest for OR queries that contain a common term, i.e., which match a large portion of the collection. However for, e.g., most phrase searches and AND searches the improvement is probably not so pronounced. When folks use Lucene as a vector-space search engine, constructing queries that represent large weighted vectors, the improvement is substantial. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene 1.4.2?
Christoph Goller wrote: Items 4 and 5 don't seem that important to me. As far as I am concerned we can leave them out. When did 4 happen? Was it a rare or common problem? I agree that we don't need to put 5 in 1.4.2. So the only thing missing is your optimization. Then 1.4.2 should be ready. I just committed this. I can make a 1.4.2 release later today or Monday. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using MMapDirectory fails TestCompoundFile; MMapDirectory for huge indexes
Paul Elschot wrote: I'm working on a memory mapped directory that uses multiple buffers for large files. Great! There will be a small performance hit, as each call to readByte() will need to first check whether it's overflowed the current buffer, right? While trying some test runs I found that the current version fails a test: [junit] Testsuite: org.apache.lucene.index.TestCompoundFile Thanks for testing this! I'm testing the version with multiple buffers using a smaller maximum buffer size (1024 * 128), and it does this test in the same way. You mean it fails too? I have not yet looked into TestCompoundFile. When it is a good test case for this, I'll submit the multibuffer version as an enhancement. Thanks, that would be great. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene 1.4.2?
The new release is up at http://jakarta.apache.org/lucene/. It is not mirrored yet. Erik's the only one who has ever done that. Erik, do you have time to mirror 1.4.2? Thanks. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DbDirectory and compound files
Andi Vajda wrote: You ask if this makes sense. No, not really. I don't know the details of the purpose of the compound file implementation so this may be my problem. The purpose of the compound file implementation is to minimize the number of open files that an IndexReader must keep open. Instead of 7 + the number of indexed fields files per segement, only a single file must be kept open per segement. This helps applications which keep lots of unoptimized indexes open. (It also, and this is more common, helps folks who open a new IndexReader for each query and don't close it. In this case, opening fewer files gives the garbage collector time to close files before the process runs into its file descriptor limit, inducing a flurry of but reports about too many open files.) Does that make any more sense? However, from earlier posts of yours, it seems that the Directory implementation classes such as OutputStream et al are being deprecated and replaced by others, so it may very well be that DbDirectory needs to be rewritten when these changes are finalized. These changes are back-compatible: the old classes and methods are still there and interoperate with the new but are deprecated. You might wait until there is a Lucene release with the new API in it before you update DbDirectory. To move to the new API, all that should be required is changing your subclass of InputStream to instead subclass BufferedIndexInput, and also change your subclass of IndexOutput to instead subclass BufferedIndexOutput. You'll also need to add a length() method to your BufferedIndexInput subclass, instead of setting a protected length field in the constructor. That's it. The revision of the API was primarily to make buffering optional. We could have left the buffered implementation names the same, but then the classes would be named poorly and it also seemed like an opportunity to remove the name clash with java.io. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene 1.4.2?
Christoph Goller wrote: I'd like the changes on FuzzyQuery, PhraseQuery, and PhrasePrefixQuery included in the branch. Any objections? I'm okay with these, but the primary purpose of 1.4.2 should be to stabilize things, not to add new features. So let's be very selective about what we add, and scrutinize changes carefully so we don't introduce new bugs. Are you confident that these are safe changes? If we agree to let a *few* features in, then I vote for my optimization to IndexSearcher. Of all the optimizations I made recently, the single biggest performance improvement was to avoid allocating a new ScoreDoc for every non-zero score in IndexSearcher.search(Query,Filter,int). I think this is safe. Are there any concerns about putting this optimization into 1.4.2? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DbDirectory and compound files
Andi Vajda wrote: So, my question: why is the compound file storage implemented in such an orthogonal to Directory way instead of just being another Directory implementation called FSCompoundFileDirectory ? To combine the files of a segment we need to know when the segment was complete. So a method would need to be added to Directory to instruct it when to combine files. And then the Directory would need to be able to locate files within the combined file in order to open them. It would be a shame to re-invent this logic for each Directory implementation, so the indexing code has a generic implementation layered on top of Directory. Does that make sense? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene 1.4.2?
Daniel Naber wrote: On Monday 20 September 2004 18:49, Doug Cutting wrote: To be clear, you are proposing that we branch from the 1.4.1 tag in CVS and re-apply the patches below? Yes, exactly. Now that we have a patch for the memory leak problem, should we start a 1.4.2 branch? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene 1.4.2?
Daniel Naber wrote: I can try to do some of the work, but I'd need detailed instructions for branching and tagging. It's probably easier/better if you do those parts. I've never branched with CVS before either... so here goes! I've added a branch called lucene_1_4_2_dev. To get a copy, use: cvs -d :ext:[EMAIL PROTECTED]:/home/cvs co -r lucene_1_4_2_dev -d lucene_1_4_2_dev jakarta-lucene Where XXX is your username at Apache. Then you can make changes and commit them from this directory. I just made the memory leak patch in this branch, but I've not yet updated CHANGES.txt. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexInput GCJ
Doug Cutting wrote: Still to do: 1. Replace OutputStream with IndexOutput and BufferedIndexOutput. This is not critical and mostly for consistency, as mmap makes more sense for read-only data. 2. Update RAMDirectory and FSDirectory to no longer use deprecated classes. This is done last, to make sure that the earlier steps to not break back-compatibility for existing Directory implementations. These changes are now complete and in CVS. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/store MMapDirectory.java
[EMAIL PROTECTED] wrote: Added: src/java/org/apache/lucene/store MMapDirectory.java Log: Add an nio mmap based Directory implementation. For my simple benchmarks this is somewhat slower than the classic FSDirectory, but I thought it was still worth having. It should use less memory when there are lots of query terms, since it does not need to allocate a new buffer per term and the mmapped data can be shared. This may be good for folks who, e.g., use lots of wildcards. It also should, in theory, someday be faster. One downside is that it cannot handle indexes with files larger than 2^31 bytes. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/store MMapDirectory.java
Bruce Ritchie wrote: [EMAIL PROTECTED] wrote: One downside is that it cannot handle indexes with files larger than 2^31 bytes. Can you expand slightly on what causes this limitation and whether it still exists on 64 bit hardware? This is a limit of the nio ByteBuffer API, which uses int instead of long to address data. Java defines int as a singed 32 bit quantity. The size of a ByteBuffer is also an int. http://java.sun.com/j2se/1.4.2/docs/api/java/nio/channels/FileChannel.html#map(java.nio.channels.FileChannel.MapMode,%20long,%20long) http://java.sun.com/j2se/1.4.2/docs/api/java/nio/ByteBuffer.html Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: cvs commit: jakarta-lucene build.xml
Daniel Naber wrote: I'm using gcc/gcj 3.3.3, do I maybe need a more recent version? I'm currently using 3.4.1, but I think 3.4.0 will work as well. I had troubles with 3.3. I've worked more on this, and now have a version (not yet committed) which appears a bit faster than a JVM. More soon. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]