Re: updating jakarta site

2005-03-01 Thread Doug Cutting
Erik Hatcher wrote: When Doug is cool with re-enabling the redirect, it's fine with me. I'm cool with it if it works. Why not re-enable it, search for "site:apache.org lucene" on Google, Yahoo! and MSN, and click on the first few links. If these work, then I'm okay with the redirect. As we cha

Re: updating jakarta site

2005-03-01 Thread Doug Cutting
Henri Yandell wrote: Redirect of jakarta.apache.org/lucene to lucene.apache.org/java/docs/index.html I noticed there's a commented out redirect in the .htaccess, so after adding my own I deleted it again and left the redirect off for the moment. Unsure if there's a reason the commented out bit is t

Re: patch - DEFAULT_ vars in IndexWriter non-final and DEFAULT for useCompoundFile

2005-02-28 Thread Doug Cutting
Kevin A. Burton wrote: Doug Cutting wrote: Wolf Siberski wrote: So, if anything at all, I would rather opt for making these constants private :-). I agree. In general, fields should either be final, or private with accessor methods. So, we could change this to: private static int

Re: patch - DEFAULT_ vars in IndexWriter non-final and DEFAULT for useCompoundFile

2005-02-28 Thread Doug Cutting
Kevin A. Burton wrote: Wolf Siberski wrote: Kevin A. Burton wrote: I see following issues with your patch: - you changed the DEFAULT_... semantics from constant to modifiable, but didn't adjust the names according to Java conventions (default_...). Java doesn't have any naming conventions which

Re: updating jakarta site

2005-02-28 Thread Doug Cutting
Garrett Rooney wrote: Actually, currently we've got both lucene4c and java commits going to [EMAIL PROTECTED], and there was some talk of just leaving it that way, since it isn't that much traffic and it encourages people to keep an eye on what's going on in other languages. I think that's a bad

Re: updating jakarta site

2005-02-28 Thread Doug Cutting
Henri Yandell wrote: Your download page is already separate, you're using the global closer.cgi file. So we need to: - rename Lucene Java's mailing lists, with forwards put into place. - add a mailing list page to Lucene Java's website, modelled after http://jakarta.apache.org/site/mail2.html#Luce

read index terms lazily

2005-02-25 Thread Doug Cutting
Attached is a patch which delays reading of index terms until it is first accessed. The cost of this is another file descriptor, until the terms are accessed, when it is closed. The benefit is that operations that do not require access to index terms are much faster and use much less memory.

svn commit: r155349 - in lucene/java/trunk/src/java/org/apache/lucene/search: IndexSearcher.java MultiSearcher.java

2005-02-25 Thread cutting
Author: cutting Date: Fri Feb 25 09:39:02 2005 New Revision: 155349 URL: http://svn.apache.org/viewcvs?view=rev&rev=155349 Log: Added accessor methods, as suggested by Kevin Burton. Modified: lucene/java/trunk/src/java/org/apache/lucene/search/IndexSearcher.java lucene/java/trunk

Re: patch - DEFAULT_ vars in IndexWriter non-final and DEFAULT for useCompoundFile

2005-02-25 Thread Doug Cutting
Doug Cutting wrote: public static int getDefaultMergeFactor() { return mergeFactor; } Oops. That should be 'return defaultMergeFactor'. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-ma

Re: patch - DEFAULT_ vars in IndexWriter non-final and DEFAULT for useCompoundFile

2005-02-25 Thread Doug Cutting
Wolf Siberski wrote: So, if anything at all, I would rather opt for making these constants private :-). I agree. In general, fields should either be final, or private with accessor methods. So, we could change this to: private static int defaultMergeFactor = Integer.parseInt(System.getPropert

Re: Patch - IndexReader methods and MultiSearcher methods...

2005-02-24 Thread Doug Cutting
Kevin A. Burton wrote: I *realize* that there are other ways to do this but we have some legacy code that can't be rewritten right now. Thus the change to protected and using a reloadable implementation. Changing Lucene's API to be back-compatible with an altered version of Lucene is not a com

Re: Patch - IndexReader methods and MultiSearcher methods...

2005-02-24 Thread Doug Cutting
Kevin A. Burton wrote: Also, I assume that the reason you make the reader field protected is because getReader() is not sufficient, i.e., you want to set the reader. This would stylistically be better done with a setReader() method, no? Do you only change it at construction, or at runtime? If

Re: Javadoc not available due to non-public classes?

2005-02-24 Thread Doug Cutting
Kevin A. Burton wrote: You know ... the javadoc on the site doesn't include non-public classes like TermInfosWriter. Confused me for a second. That's because it's not public. The javadoc on the site is to document the public api. This is not a bug, but a feature. Also.. the site doesn't hav

Re: Patch - IndexReader methods and MultiSearcher methods...

2005-02-24 Thread Doug Cutting
If you add things that will appear in the javadoc then they need javadoc comments. Also, I assume that the reason you make the reader field protected is because getReader() is not sufficient, i.e., you want to set the reader. This would stylistically be better done with a setReader() method, n

Re: svn commit: r155011 - lucene/java/trunk/src/java/overview.html

2005-02-23 Thread Doug Cutting
[EMAIL PROTECTED] wrote: Log: fix broken links to source - FileDocument.java contains + http://svn.apache.org/repos/asf/lucene/java/trunk/src/demo/org/apache/lucene/demo/FileDocument.java";>FileDocument.java contains This makes the docs point to the current version of the code, rather than a ver

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-02-22 Thread Doug Cutting
Wolf Siberski wrote: The price is an extension (or modification) of the Searchable interface. I've added corresponding search(Weight...) methods to the existing search(Query...) methods and deprecated the latter. I think this is the right solution. If Searchable is meant to be Lucene internal, then

Re: Into javadocs? [Bug 31841] - [PATCH] MultiSearcher problems with Similarity.docFreq()

2005-02-21 Thread Doug Cutting
Paul Elschot wrote: Would you mind if some pieces of your reply end up in the javadocs? Not at all. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-02-21 Thread Doug Cutting
Wolf Siberski wrote: Now I found another solution which requires more changes, but IMHO is much cleaner: - when a query computes its Weight, it caches it in an attribute - a query can be 'frozen'. A frozen query always returns the cached Weight when calling Query.weight(). Orignally there was no

Re: Incubating Lucene.Net

2005-02-17 Thread Doug Cutting
George Aroush wrote: Any thoughts on Lucene.Net/dotLucene package name are welcome. I agree that Lucene.Net is a better name. It's more consistent with Lucene Java and Lucene4c, the names for other ports of Lucene. I think it's okay to reclaim the name of an abandonded project, especially if t

Re: [VOTE] Re: Incubating Lucene.Net

2005-02-17 Thread Doug Cutting
+1 Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: [VOTE] Incubate lucene4c?

2005-02-17 Thread Doug Cutting
+1 Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: removing the old FAQ

2005-02-16 Thread Doug Cutting
Daniel Naber wrote: could someone (Doug?) make me an administrator for the old Lucene project at sourceforge? Done. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: lucene.apache.org

2005-02-15 Thread Doug Cutting
Henri Yandell wrote: On names, Lucene Java might hit trademark issues I guess. So potential worry there. Good point. Although I note that Apache already has projects called "Xerces Java" and "Xalan Java". Sun says: http://www.sun.com/policies/trademarks/#20c So, technically, the fullname of the

Re: Transactional Directories

2005-02-14 Thread Doug Cutting
[ Please ignore my previous message. I somehow hit "Send" before typing anything! ] Oscar Picasso wrote: However with a relatively high number of random insertions, the cost of the "new IndexWriter / index.close()" performed for each insertion is two high. Did you measure that? How much slower

Re: Transactional Directories

2005-02-14 Thread Doug Cutting
Oscar Picasso wrote: Hi, I am currently implementing a Directory backed by a Berkeley DB that I am willing to release as an open source project. Besides the internal implementation, it differs from the one in the sandbox in that it is implemented with the Berkeley DB Java Edition. Using the Java E

Re: lucene.apache.org

2005-02-14 Thread Doug Cutting
Erik Hatcher wrote: I've amended my request for e-mail lists here with Doug's preference: http://issues.apache.org/jira/browse/INFRA-195 Do others agree this is the best approach? I don't mean to be autocratic. Do we imagine different pools of users and developers for different Lucene sub-p

Re: lucene.apache.org

2005-02-14 Thread Doug Cutting
Bernhard Messer wrote: Doug, you placed a copy of the website in the "java" directory. In both, the original and the java directory the "api" directory is missing. I can't copy it into because of the access rights :-( Argh. The group protection is 'lucene', as it should be, but you're not in 'l

Re: lucene.apache.org

2005-02-14 Thread Doug Cutting
Doug Cutting wrote: And we also want to try not to break URLs when we move things. For this reason it's best to move things as few tims as possible, so that we don't end up with a confusing set of redirects. More to the point, we also want to try not to break email addresses. So

Re: lucene.apache.org

2005-02-14 Thread Doug Cutting
Erik Hatcher wrote: It also might be a good time to think about mailing list names. There was a request on infrastructure@ to move [EMAIL PROTECTED] to [EMAIL PROTECTED], would it make more sense to move it to [EMAIL PROTECTED] NOW you tell me :) I think until we have these elusive other l

Re: What does [] do to a query and what's up with lucene.apache.org?

2005-02-14 Thread Doug Cutting
Erik Hatcher wrote: I'm really at the limit of my bandwidth - I've got the sandbox restructuring effort on my plate right now and would like it if someone could pick up the ball on the web site side of things. Then perhaps you shouldn't have redirected everything to lucene.apache.org... We need

Re: lucene.apache.org

2005-02-14 Thread Doug Cutting
Garrett Rooney wrote: Agreed. Java Lucene is a subproject of the Lucene TLP, leaving the existing Java Lucene site there for the time being seems ok, just so we have something there, but we should endeavour to put up something more permanent ASAP. I think, for the present, http://lucene.apache.

Re: [ANNOUNCE] lucene4c 0.02

2005-02-14 Thread Doug Cutting
Garrett Rooney wrote: Additionally it would be good to work on updating the disk format documentation, I've found several cases where the docs are quite out of date compared to the current code. It's hard to expect the various different ports to maintain compatibility when the formats are only

Re: What does [] do to a query and what's up with lucene.apache.org?

2005-02-14 Thread Doug Cutting
Otis Gospodnetic wrote: lucene.apache.org seems to work now. Here is the query syntax: http://lucene.apache.org/queryparsersyntax.html We should be cautious in promoting lucene.apache.org urls until we have this structured correctly. Let's stick with calling this http://jakarta.apache.org/luce

Re: lucene.apache.org

2005-02-14 Thread Doug Cutting
Erik Hatcher wrote: I have checked out our current site to the lucene.apache.org area, and I've also set up a redirect from the jakarta.apache.org/lucene area. Keep in mind, there are two projects here: 1. Porting Java Lucene's site to Forrest. This should be structured as a sub-project of luce

Re: lucene.apache.org

2005-02-14 Thread Doug Cutting
Erik Hatcher wrote: Doug - do you have your Forest work handy? Or has anyone else stepped up to build the web site? I don't have anything reusable. I converted Nutch from a different (not Anakia) XML-based site to Forrest with little difficulty (mostly using string replace in Emacs). I start

Re: Study Group (WAS Re: Normalized Scoring)

2005-02-07 Thread Doug Cutting
Paul Elschot wrote: I learned a lot by adding some javadocs to such classes. I suppose Doug added the Expert markings, but I don't know their precise purpose. The "Expert" declaration is meant to indicate that most users should not need to understand the feature. Lucene's API seeks to be both sim

Re: whither sandbox

2005-02-04 Thread Doug Cutting
Erik Hatcher wrote: Also, we should package a lucene-XX-all.zip/.tar.gz that includes all the contrib pieces also allowing someone to simply download Lucene and all the packaged contrib pieces at once. I'll go further: that should be the only download. We should avoid having a bunch of differen

whither sandbox

2005-02-04 Thread Doug Cutting
So, now that we've got the sandbox in the same source tree let's decide what we want to do with it. I have previously argued that we should make sure that sandbox code should be tagged and released in parallel with core code (http://tinyurl.com/5d6tx). Now that should be easy. But how should

svn commit: r151390 - in lucene/java/trunk: CHANGES.txt src/java/org/apache/lucene/index/SegmentMerger.java

2005-02-04 Thread cutting
Author: cutting Date: Fri Feb 4 11:09:53 2005 New Revision: 151390 URL: http://svn.apache.org/viewcvs?view=rev&rev=151390 Log: Fix for bug #32847. Use uncached access to norms when merging to minimize RAM usage. Modified: lucene/java/trunk/CHANGES.txt lucene/java/trunk/src/java

Re: [PROPOSAL] Lucene to search.apache.org

2005-02-02 Thread Doug Cutting
Erik Hatcher wrote: Hmmm good point. I hadn't considered access control. A migration will be performed later today, and I think it will initially be a test migration for me to verify. I'll double-check with Justin, who's doing the conversion, on how access control will be initially config

Re: [PROPOSAL] Lucene to search.apache.org

2005-02-01 Thread Doug Cutting
Erik Hatcher wrote: On Feb 1, 2005, at 3:13 PM, Doug Cutting wrote: I think we want Java Lucene to be a sub-project of Lucene. So the repository should be something like: https://svn.apache.org/repos/asf/lucene/java I already put in the request for this initial svn structure: /asf/lucene

Re: Fwd: [PROPOSAL] Lucene to search.apache.org

2005-02-01 Thread Doug Cutting
Mario Alejandro M. wrote: What is necesary to be part of this? I'm porting Lucene to Delphi ... Existing projects generally enter Apache through the incubator: http://incubator.apache.org/ Doug - To unsubscribe, e-mail: [EMAIL PROT

Re: [PROPOSAL] Lucene to search.apache.org

2005-02-01 Thread Doug Cutting
Erik Hatcher wrote: We need website work. Doug mentioned (maybe this was in regards to Nutch, though) that he's used Forest. Yes, I used Forrest for the new Nutch site. The source for the site is in: http://svn.apache.org/repos/asf/incubator/nutch/trunk/src/site/ The generated site is at: http://

Re: Fwd: [PROPOSAL] Lucene to search.apache.org

2005-02-01 Thread Doug Cutting
Erik Hatcher wrote: The decision was a bit slow to get out, but Lucene has been approved for TLP. Thanks for pushing this through! I propose we simply import our two CVS repositories in with all of jakarata-lucene as the root of the repository and jakarta-lucene-sandbox under "sandbox" in the r

Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-02-01 Thread Doug Cutting
David Spencer wrote: Let's start with the issue that's been raised so much: whether idf is better defined with log() or sqrt(log()). I can redo my page and rebuild indexes if necessary, I just need it clarified what we want to do, esp -> does the index need to be rebuilt? The index needs to be r

Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-02-01 Thread Doug Cutting
Chuck Williams wrote: > So I think this can be implemented using the expansion I proposed > yesterday for MultiFieldQueryParser, plus something like my > DensityPhraseQuery and perhaps a few Similarity tweaks. I don't think that works unless the mechanism is limited to default-AND (i.e., all

Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-02-01 Thread Doug Cutting
David Spencer wrote: +(f1:t1^2.0 t1) +(f1:t2^2.0 t2) f1:"t1 t2"~5^3.0 "t1 t2"~2^1.5 (f1:t1^2.0 t1) (f1:t2^2.0 t2) f1:"t1 t2"~5^3.0 "t1 t2"~2^1.5 (f1:t1^2.0 t1) (f1:t2^2.0 t2) (f1:t3^2.0 t3) (f1:t4^2.0 t4) (f1:t5^2.0 t5) f1:"t1 t2 t3 t4 t5"~5^3.0 "t1 t2 t3 t4 t5"~2^1.5 This looks great to me! I'd

Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-02-01 Thread Doug Cutting
Chuck Williams wrote: Doug Cutting wrote: > What did you think of my DensityPhraseQuery proposal? It is a step in the direction of what I have in mind, but I'd like to go further. How about a query class with these properties: 1. Inputs are: a. F = list of fields b. B =

Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-31 Thread Doug Cutting
Chuck Williams wrote: That expansion is scalable, but it only accounts for proximity of all query terms together. E.g., it does not favor a match where t1 and t2 are close together while t3 is distant over a match where all 3 terms are distant. Worse, it would not favor a match with t1 and t2 in

Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-31 Thread Doug Cutting
David Spencer wrote: But what is right if there are > 2 terms in terms of the phrases - does it have a phrase for every pair of terms like this (ignore fields and boosts and proximity for a sec): search for "t1 t2 t3" gives you these phrases in addition to the direct field matches: "t1 t2" "t2

Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-31 Thread Doug Cutting
Chuck Williams wrote: I think the differences are pretty clear as the systems stands. Notice a substantial difference in the idf's in the respective explanations. I continue to think the current mechanism weights these too high, primarily due to its squaring. The other big difference occurs when

Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-31 Thread Doug Cutting
Doug Cutting wrote: It would translate a query "t1 t2" given fields f1 and f2 into something like: +(f1:t1^b1 f2:t1^b2) +(f2:t1^b1 f2:t2^b2) Oops. The first term on that line should be "f1:t2", not "f2:t1": +(f1:t2^b1 f2:t2^b2) f1:"

Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-31 Thread Doug Cutting
David Spencer wrote: I worked w/ Chuck to get up a test page that shows search results with 2 versions of Similarity side by side. David, This looks great! Thanks for doing this. Is the default operator AND or OR? It appears to be OR, but it should probably be AND. That's become the industry s

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-28 Thread Doug Cutting
Christoph Goller wrote: The similarity specified for the search has to be modified so that both idf(...) AND queryNorm(...) always return 1 and as you say everything except for tf(term,doc)*docNorm(doc) could be precompiled into the boosts of the rewritten query. coord/tf/sloppyFreq computation wo

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-27 Thread Doug Cutting
Chuck Williams wrote: Christoph Goller writes: > You may be right. But I am not completely convinced. I think > this should be decided based on the proposed benchmark evaluation. Is that still happening? Like anything else in an all-volunteer operation, it will only happen if folks volunteer t

Re: [PROPOSAL] Lucene to search.apache.org

2005-01-17 Thread Doug Cutting
Maybe we should just call it lucene.apache.org, and move the current Lucene project to lucene.apache.org/java? The other projects we imagine adding (Nutch, DotLucene, CLucene, etc.) are all Lucene-related, no? Lucene has a pretty good brand name... Doug Otis Gospodnetic wrote: ir.apache.org is

Re: JDK code in the codebase

2005-01-14 Thread Doug Cutting
Erik Hatcher wrote: The questions still remain, though, and lawyers do want to know the answers: - How did JDK code get into Lucene's codebase to begin with? I put it there in a moment of ignorance way back as a hack in order to make things run in an older version of the JVM. http://cvs.sourc

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-14 Thread Doug Cutting
Chuck Williams wrote: Doug Cutting wrote: > It would indeed be nice to be able to short-circuit rewriting for > queries where it is a no-op. Do you have a proposal for how this could > be done? First, this gets into the other part of Bug 31841. I don't believe MultiSearche

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-14 Thread Doug Cutting
Wolf Siberski wrote: Doug Cutting wrote: So, when a query is executed on a MultiSearcher of RemoteSearchables, the following remote calls are made: 1. RemoteSearchable.rewrite(Query) is called After that step, are wildcards replaced by term lists? Yes. I haven't taken a look at the re

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-13 Thread Doug Cutting
Chuck Williams wrote: If auto-filters can provide an effective implementation for RangeQuery's that avoids rewriting, and we can give up MultiTermQuery and PrefixQuery in the distributed environment, then how about something like this refinement: 1. No rewriting is done. It would indeed be nice

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-13 Thread Doug Cutting
Chuck Williams wrote: It just seems like a lot of IPC activity for each query. As things stand now, I think you are proposing this? 1. MultiSearcher calls the remote node to rewrite the query, requiring serialization of the query. 2. The remote node returns the rewritten query to the dispatc

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-13 Thread Doug Cutting
Chuck Williams wrote: I think there is another problem here. It is currently the Weight implementations that do rewrite(), which requires access to the index, not just to the idf's. E.g., RangeQuery.rewrite() must find the terms in the index within the range. So, the Weight cannot be computed in

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-13 Thread Doug Cutting
Wolf Siberski wrote: Yes, I agree. I just wanted to point out that the current Weight implementations need to be modified heavily to introduce the behaviour you describe above. For example, take a look at TermQuery.TermWeight.scorer(): [...] return new TermScorer(this, termDocs, getSimilarity

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-12 Thread Doug Cutting
Wolf Siberski wrote: Chuck Williams wrote: This is a nice solution! By having MultiSearcher create the Weight, it can pass itself in as the searcher, thereby allowing the correct docFreq() method to be called. This is similar to what I tried to do with topmostSearcher, but a much better way to do

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-12 Thread Doug Cutting
Chuck Williams wrote: There needs to be a way to create the aggregate docFreq table and keep it current under incremental changes to the indices on the various remote nodes. I think you're getting ahead of yourself. Searchers are based on IndexReaders, and hence doFreqs don't change until a new S

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-12 Thread Doug Cutting
Chuck Williams wrote: I was thinking of the aggressive version with an index-time solution, although I don't know the Lucene architecture for distributed indexing and searching well enough to formulate the idea precisely. Conceptually, I'd like each server that owns a slice of the index in a distri

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-11 Thread Doug Cutting
Chuck Williams wrote: This is a nice solution! By having MultiSearcher create the Weight, it can pass itself in as the searcher, thereby allowing the correct docFreq() method to be called. Glad to hear it at least makes sense... Now I hope it works! I'm still left wondering if having MultiSearcher

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-11 Thread Doug Cutting
Chuck Williams wrote: As Wolf does, I hope a committer with deep knowledge of Lucene's design in this area will weigh in on the issue and help to resolve it. The root of the bug is in MultiSearcher.search(). This should construct a Weight, weight the query, then score the now-weighted query. Her

Re: what if the IndexReader crashes, after delete, before close.

2005-01-11 Thread Doug Cutting
Terry Steichen wrote: Would it be possible to optimize the operation to use 1.4 runtime features but retain the option, if desired to run in a legacy (1.3) environment, perhaps in a degraded mode? Lucene 1.4.3 is a "degraded" mode, no? There are still back-compatibility issues. To be safe, Luce

Re: what if the IndexReader crashes, after delete, before close.

2005-01-11 Thread Doug Cutting
Sigh. This stuff would get a lot simpler if we were able to use Java 1.4's FileLock. Then locks would be automatically cleared by the OS if the JVM crashes. Should we upgrade the JVM requirements to 1.4 for Lucene's 1.9/2.0 releases and update the locking code? Doug Luke Shannon wrote: Here

Re: auto-filters?

2005-01-06 Thread Doug Cutting
Paul Elschot wrote: Filters are more efficient than query terms for many I think there are two reasons for the peformance gain: - having things in RAM, eg. the bits of a filter after it is computed once, - being able to search per field instead of per document. Also, bit-vectors are constant-time

Re: DO NOT REPLY [Bug 32965] - Use filter bits for next() and skipTo() in FilteredQuery

2005-01-06 Thread Doug Cutting
[EMAIL PROTECTED] wrote: --- Additional Comments From [EMAIL PROTECTED] 2005-01-06 20:13 --- Patch to IndexSearcher.java to use FilteredQuery I like where this is going, and want to take it further! Why not patch Searcher.java instead of IndexSearcher.java? Once that's done, Filters coul

Re: CFS file and file formats

2005-01-03 Thread Doug Cutting
Bernhard Messer wrote: Why not implementing a small utility class, f.e CompoundFileUtil.java within the org.apache.lucene.index Package ? This class could be public and implement the necessary functionality. This is what i would prefer, because we don't have to change the visibility of CompoundF

Re: CFS file and file formats

2005-01-03 Thread Doug Cutting
Bernhard Messer wrote: I understand the technical reason for main() there, but logically this belongs to an external utility class, I think. Otis you are right, i already thought about it. It could be simply moved to a newly created class in org.apache.lucene.util package. But then we have to cha

Re: Lucene site generation

2005-01-03 Thread Doug Cutting
Erik Hatcher wrote: I think Forrest is the right way to go - but I've not experience with it myself. I recently developed a Nutch site for the Incubator using Forrest (not yet published). It was easy and pleasant. A nice feature is 'forrest run' which permits one to edit xml source and then vi

Re: auto-filters?

2005-01-03 Thread Doug Cutting
markharw00d wrote: If we intend to make more use of filters this may be an appropriate time to raise a general question I have on their use. Is there a danger in tieing them to a specific implementation (java.util.BitSet)? I do not object in principal to replacing BitSet with an interface, e.g.

auto-filters?

2005-01-02 Thread Doug Cutting
Filters are more efficient than query terms for many things. For example, a RangeFilter is usually more efficient than a RangeQuery and has no risk of triggering BooleanQuery.TooManyClauses. And Filter caching (e.g., with CachingWrapperFilter) can make otherwise expensive clauses almost free, after

Re: CFS file and file formats

2004-12-23 Thread Doug Cutting
It would be useful to have a command-line utility (i.e., a static main(String[]) method somewhere) that lists the files and sizes contained inside a CFS file, and perhaps even an option to unpack it. Anyone care to contribute this method? Doug ---

Re: PageRank and Lucene javadoc

2004-12-21 Thread Doug Cutting
David Spencer wrote: And my feeling is that in the context of machine-generated pages, Page Rank doesn't help that much. It's better than random. It correctly identified overview-summary as the best "home page" for the collection in both cases. It also identified some core classes (IndexReader

Re: Migration to SVN?

2004-12-20 Thread Doug Cutting
Garrett Rooney wrote: The "least effort" way of doing that would be to include both the core and sandbox under the same trunk, but again, that implies that you ALWAYS tag and branch them together, and sometimes you may not want to do that. I think we should always branch these together. To my t

Re: DefaultSimilarity 2.0?

2004-12-20 Thread Doug Cutting
Chuck Williams wrote: Finally, I'd suggest picking content that has multiple fields and allow the individual implementations to decide how to search these fields -- just title and body would be enough. I would like to use my MaxDisjunctionQuery and see how it compares to other approaches (e.g., th

DefaultSimilarity 2.0?

2004-12-17 Thread Doug Cutting
Chuck Williams wrote: Another issue will likely be the tf() and idf() computations. I have a similar desired relevance ranking and was not getting what I wanted due to the idf() term dominating the score. [ ... ] Chuck has made a series of criticisms of the DefaultSimilarity implementation. Unfo

Re: Explanations and overridden similarity

2004-12-16 Thread Doug Cutting
Dan Climan wrote: Shouldn't the call to Similarity.decodeNorm be replaced with a call to Similarity.getDefault().decodeNorm decodeNorm is a static method. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-

Re: setLowercaseWildcardTerms and FuzzyQueries

2004-12-13 Thread Doug Cutting
Daniel Naber wrote: I'm aware that the "Wildcard" name won't fit well anymore, suggestions for a better name are welcome. "Expanded"? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECT

Re: potential new Lucene logo

2004-12-13 Thread Doug Cutting
Murray Altheim wrote: I thought I'd have a go at the Lucene logo, not to change it markedly but clean it up so that it is based on an existing font. This potential Lucene logo is based on an ITC font called Magneto Bold Extended, which you can see here: http://www.identifont.com/show?72W I modi

Re: Boolean Scorer

2004-12-10 Thread Doug Cutting
Christoph Goller wrote: I think we should change BooleanScorer. An easy way would be to sort the bucket list before it is used. Do you think that would affect performance dramatically? I think it would make it slower. Otherwise we should reimplement BooleanScorer. I haven't looked into the Disjun

Re: Release 1.4.3

2004-12-06 Thread Doug Cutting
Christoph Goller wrote: Doug, could you please move api/ to api.old/ and api.new/ to api/ Done. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Release 1.4.3

2004-11-26 Thread Doug Cutting
Christoph Goller wrote: I think i should finally make Release 1.4.3. Great! I presume the default.properties does no longer exist. I just fill in "1.4.3" as version in the build.xml before building it. Is this ok? I build releases with something like: ant -Dversion=1.4.3 clean dist So that it doe

Re: GIS

2004-11-16 Thread Doug Cutting
Guillermo Payet wrote: The fact that Lucene stores and indexes (or seems it seems) all terms as Strings and that there is no NumericTerm makes me think that I might be missing something and that this migh be a much bigger deal than I think? You could write a HitCollector that uses FieldCache.get

Re: FuzzyQuery prefix length

2004-10-26 Thread Doug Cutting
Erik Hatcher wrote: On Oct 20, 2004, at 12:14 PM, Doug Cutting wrote: The advantages of a zero-character prefix default are that it's back-compatibile and that it will find more matches, when spelling differences are in the first characters. I prefer this default. Anyone using QueryParser

Re: Normalized Scoring -- was RE: idf and explain(), was Re: Search and Scoring

2004-10-21 Thread Doug Cutting
Chuck Williams wrote: However, I'm not sure this analysis is completely correct due to MultiSearcher.docFreq() which appears to be trying to redefine the tf's to be the global value across all indices. It wasn't clear to me how this code is ever reached, e.g. from TermQuery --> SegmentTermDocs. I

Re: Retrieving Document Boosts

2004-10-20 Thread Doug Cutting
Dan Climan wrote: TermEnum terms = ir.terms(); int numTerms = 0; while (terms.next()) { Term t = terms.term(); if (t.field().equals("FullText")) numTerms++; }

Re: FuzzyQuery prefix length

2004-10-20 Thread Doug Cutting
Daniel Naber wrote: On Tuesday 12 October 2004 17:22, Doug Cutting wrote: Which is worse: a person who searches for Photokopie~ in a 1000 document collection does not find documents containing Fotokopie; or a person who searches for Photokopie~ in a 1M document collection doesn't find any

Re: What's the purpose of hashing docid in BooleanScorer; DisjunctionScorer

2004-10-18 Thread Doug Cutting
Paul Elschot wrote: I have a DisjunctionScorer based on a PriorityQueue lying around, but I can't benchmark it myself at the moment. In case there is interest, I'll gladly adapt it to org.apache.lucene.search and add it in bugzilla. This should look a lot like SpanOrQuery.getSpans(). On a related

Re: What's the purpose of hashing docid in BooleanScorer

2004-10-18 Thread Doug Cutting
Christoph Goller wrote: With the current scorer API one could get rid of buckettable and advance all subscores only by one document each time. I am not sure whether the bucketable implementation is really much more efficient. I only see the advantage of inlining some of the scorer.next and score.sc

Re: FuzzyQuery prefix length

2004-10-18 Thread Doug Cutting
Daniel Naber wrote: Searching for Photokopie~ on a 230,000 document corpus takes 2.3 seconds here (AMD Athlon 2600+; other fuzzy terms get similar performance). As the number of terms doesn't increase so fast with more documents, it will not take 10 seconds for 1 million documents. So fuzzy sear

Re: FuzzyQuery prefix length

2004-10-18 Thread Doug Cutting
Daniel Naber wrote: On Tuesday 12 October 2004 17:22, Doug Cutting wrote: Which is worse: a person who searches for Photokopie~ in a 1000 document collection does not find documents containing Fotokopie; or a person who searches for Photokopie~ in a 1M document collection doesn't find any

Re: Propose Bernhard as committer

2004-10-18 Thread Doug Cutting
+1 Christoph Goller wrote: I would like to propose Bernhard as Lucene committer. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: idf and explain(), was Re: Search and Scoring

2004-10-18 Thread Doug Cutting
Chuck Williams wrote: That's a good point on how the standard vector space inner product similarity measure does imply that the idf is squared relative to the document tf. Even having been aware of this formula for a long time, this particular implication never occurred to me. Do you know if anyb

Re: API cleanup for Field and future cleanup for IndexReader

2004-10-18 Thread Doug Cutting
Bernhard Messer wrote: Christoph Goller wrote: Bernhard Messer wrote: Currently there are 3 different methods available to get the field names from an index. a) getFieldNames(); b) getFieldNames(boolean indexed); c) getIndexedFieldNames(boolean storedTermVector); my proposal is to deprecate a), b

  1   2   3   4   5   6   7   8   9   >