Re: updating jakarta site

2005-03-01 Thread Doug Cutting
Henri Yandell wrote:
Redirect of jakarta.apache.org/lucene to lucene.apache.org/java/docs/index.html
I noticed there's a commented out redirect in the .htaccess, so after
adding my own I deleted it again and left the redirect off for the
moment. Unsure if there's a reason the commented out bit is there and
lucene.apache.org/java and jakarta.apache.org/lucene look to be clones
currently (barring the extra news item at lucene.apache.org).
When the redirect was first put into place there were some broken links 
at lucene.apache.org/java, so the redirect be removed until the links 
were fixed.  I think the links were fixed but the redirect was never 
restored.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: updating jakarta site

2005-03-01 Thread Doug Cutting
Erik Hatcher wrote:
When Doug is cool with re-enabling the redirect, it's fine with me.
I'm cool with it if it works.  Why not re-enable it, search for 
site:apache.org lucene on Google, Yahoo! and MSN, and click on the 
first few links.  If these work, then I'm okay with the redirect.

As we change stuff like this, we should try to change things only once, 
rather than a temporary change that might not be appropriate long-term. 
 This is especially the case with things like URLs  email addresses, 
that get saved in mail archives, web indexes, etc.  The fewer times we 
change them the less we'll break things.  Thankfully jakarta never 
made it into a package name!

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: updating jakarta site

2005-02-28 Thread Doug Cutting
Henri Yandell wrote:
Your download page is already separate, you're using the global closer.cgi file.
So we need to:
- rename Lucene Java's mailing lists, with forwards put into place.
- add a mailing list page to Lucene Java's website, modelled after 
http://jakarta.apache.org/site/mail2.html#Lucene.  This should replace 
the link in the sidebar to Jakarta's mailing list page.

The mailing lists should probably be renamed:
  [EMAIL PROTECTED]
  [EMAIL PROTECTED]
  [EMAIL PROTECTED]
Does that sound right to folks?
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: updating jakarta site

2005-02-28 Thread Doug Cutting
Garrett Rooney wrote:
Actually, currently we've got both lucene4c and java commits going to 
[EMAIL PROTECTED], and there was some talk of just leaving it 
that way, since it isn't that much traffic and it encourages people to 
keep an eye on what's going on in other languages.
I think that's a bad idea.  Once there are lots of commits folks will 
start unsubscribing or ignoring things.  Personally I only want to see 
commits for projects that I'm actively contributing to.  I don't 
anticipate I'll be comitting to lucene4c, so I don't feel the need to 
track it on a commit-by-commit level.  I do anticipate I'll make commits 
to Lucene Java, and try to carefully read every commit message for this 
project.  Yes, I could set up filters so that I only see the commits I 
like, but I'd much prefer these were simply separate mailing lists. 
Soon we hope to have Nutch under Lucene's umbrella.  Do you and Erik 
really want to see all of the Nutch commits in your inbox?

http://www.mail-archive.com/nutch-cvs%40lists.sourceforge.net/
We keep running into this same confusion: I think that the Lucene TLP 
should be setup primarily as a container of sub-projects.  Jakarta 
Lucene is the first of these and Lucene4c is the second.  We don't 
intend to merge Jakarta Lucene and Lucene4c into a single project, with 
the single set of developers, building a single download.  So each 
component of Jakarta Lucene should be moved to a sub-component of Lucene 
TLP, not to a top-level component.  This is the case for bug databases, 
mailing lists, web sites, etc. across the board.  Do we disagree on this?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: patch - DEFAULT_ vars in IndexWriter non-final and DEFAULT for useCompoundFile

2005-02-28 Thread Doug Cutting
Kevin A. Burton wrote:
Wolf Siberski wrote:
Kevin A. Burton wrote:
I see following issues with your patch:
- you changed the DEFAULT_... semantics from constant to modifiable,
  but didn't adjust the names according to Java conventions 
(default_...).

Java doesn't have any naming conventions which include an underscore.
I assume you mean defautUse ...
http://java.sun.com/docs/codeconv/html/CodeConventions.doc8.html#15436
- you can achieve the same by writing your own IndexWriterFactory which
  sets the corresponding values after creating a new IndexWriter.
  Should be ~30 lines of code. It only makes sense to include a
  patch if either a solution is impossible with the current code
  or *a lot* of (potential) users have to work around something.

I *could* (and I thought of it) but it seems reasonable to be able to 
set Lucene to use whatever settings you
want at anytime...
You could do exactly that with an IndexWriterFactory, that's Wolf's point.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: patch - DEFAULT_ vars in IndexWriter non-final and DEFAULT for useCompoundFile

2005-02-28 Thread Doug Cutting
Kevin A. Burton wrote:
Doug Cutting wrote:
Wolf Siberski wrote:
So, if anything at all, I would rather opt for making these constants
private :-).

I agree.  In general, fields should either be final, or private with 
accessor methods.  So, we could change this to:

private static int defaultMergeFactor =
  Integer.parseInt(System.getProperty(org.apache.lucene.mergeFactor,
  10));
public static int getDefaultMergeFactor() {
  return mergeFactor;
}
public static void setDefaultMergeFactor(int mergeFactor) {
  defaultMergeFactor = mergeFactor;
}

In my original patch I deleted 5 final keywords for a reduction in 
code of 25 bytes..   If I were to submit the patch again I'd have to add 
2 methods and 35 additional lines of code.

Seems to me that the Java coding conventions in this situation should be 
ignored.
This isn't a coding convention, but rather software engineering.  If we 
wish to be able to back-compatibly modify Lucene's implementation at a 
later date, its usually easiest to have access through methods rather 
than fields, since we can intecept reads and writes to the field.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


read index terms lazily

2005-02-25 Thread Doug Cutting
Attached is a patch which delays reading of index terms until it is 
first accessed.  The cost of this is another file descriptor, until the 
terms are accessed, when it is closed.  The benefit is that operations 
that do not require access to index terms are much faster and use much 
less memory.

Thoughts?
Doug
Index: src/java/org/apache/lucene/index/TermInfosReader.java
===
--- src/java/org/apache/lucene/index/TermInfosReader.java	(revision 155349)
+++ src/java/org/apache/lucene/index/TermInfosReader.java	(working copy)
@@ -33,6 +33,12 @@
   private SegmentTermEnum origEnum;
   private long size;
 
+  private Term[] indexTerms = null;
+  private TermInfo[] indexInfos;
+  private long[] indexPointers;
+  
+  private SegmentTermEnum indexEnum;
+
   TermInfosReader(Directory dir, String seg, FieldInfos fis)
throws IOException {
 directory = dir;
@@ -42,7 +48,10 @@
 origEnum = new SegmentTermEnum(directory.openInput(segment + .tis),
fieldInfos, false);
 size = origEnum.size;
-readIndex();
+
+indexEnum =
+  new SegmentTermEnum(directory.openInput(segment + .tii),
+			  fieldInfos, true);
   }
 
   protected void finalize() {
@@ -73,28 +82,23 @@
 return termEnum;
   }
 
-  Term[] indexTerms = null;
-  TermInfo[] indexInfos;
-  long[] indexPointers;
-
-  private final void readIndex() throws IOException {
-SegmentTermEnum indexEnum =
-  new SegmentTermEnum(directory.openInput(segment + .tii),
-			  fieldInfos, true);
+  private final void ensureIndexIsRead() throws IOException {
+if (indexTerms != null)
+  return;
 try {
   int indexSize = (int)indexEnum.size;
 
   indexTerms = new Term[indexSize];
   indexInfos = new TermInfo[indexSize];
   indexPointers = new long[indexSize];
-
+
   for (int i = 0; indexEnum.next(); i++) {
-	indexTerms[i] = indexEnum.term();
-	indexInfos[i] = indexEnum.termInfo();
-	indexPointers[i] = indexEnum.indexPointer;
+indexTerms[i] = indexEnum.term();
+indexInfos[i] = indexEnum.termInfo();
+indexPointers[i] = indexEnum.indexPointer;
   }
 } finally {
-  indexEnum.close();
+indexEnum.close();
 }
   }
 
@@ -126,6 +130,8 @@
   TermInfo get(Term term) throws IOException {
 if (size == 0) return null;
 
+ensureIndexIsRead();
+
 // optimize sequential access: first try scanning cached enum w/o seeking
 SegmentTermEnum enumerator = getEnum();
 if (enumerator.term() != null // term is at or past current
@@ -179,6 +185,7 @@
   final long getPosition(Term term) throws IOException {
 if (size == 0) return -1;
 
+ensureIndexIsRead();
 int indexOffset = getIndexOffset(term);
 seekEnum(indexOffset);
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Javadoc not available due to non-public classes?

2005-02-24 Thread Doug Cutting
Kevin A. Burton wrote:
You know ... the javadoc on the site doesn't include non-public classes 
like TermInfosWriter.  Confused me for a second.  
That's because it's not public.  The javadoc on the site is to document 
the public api.  This is not a bug, but a feature.

Also.. the site doesn't have JXR output for Lucene.  Would be nice to 
have.  Maven essentially gives you this for free...
If you would like to provide a patch to upgrade Lucene to use Maven, 
educate Lucene developers about Maven, and help to run it,  that would 
be great!

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Patch - IndexReader methods and MultiSearcher methods...

2005-02-24 Thread Doug Cutting
Kevin A. Burton wrote:
Also, I assume that the reason you make the reader field protected is 
because getReader() is not sufficient, i.e., you want to set the 
reader.  This would stylistically be better done with a setReader() 
method, no?  Do you only change it at construction, or at runtime?  If 
you only change it at construction, then super(reader) in the 
constructor might suffice.
We change it at runtime.  This is a ReloadableIndexSearcher that I 
developed that can reload if an index has been optimized() or added to 
by another external process.  I just have my external process do the 
merger and then call reload() on the main index.  The cool thing about 
this approach is that the entire webapp is operational while this 
happens.  While the swap is happening searches just backup for a second 
and then complete.  It also doesn't require 2x memory because I can 
dispose of the current reader, block searches, then open the new reader.
That can easily be done without subclassing IndexSearcher.
  public class SearcherCache {
private Searcher searcher;
public synchronized Searcher getSearcher() { return searcher; }
public synchronized setSearcher(Searcher searcher) {
  this.searcher = searcher;
}
  }
Then use SearcherCache.getSearcher() whenever you need a searcher.  You 
could make it more complicated, e.g., have it automatically update the 
searcher when the index has changed, etc.  But none of that requires or 
is in particular faciliated by subclassing IndexSearcher, so far as I 
can see.

Why don't we do this.  I don't think we should have a setReader then.  
This way there's no strong contract that developers preserve things that 
might break caching.  I'd like to keep the protected change though.  
Making the field protected is just an obscure way of making it 
changeable.  If we really need to make it settable, then we should add a 
setReader() method and add some cautions to its documentation.  But I'm 
not yet convinced this needs to be settable.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-02-22 Thread Doug Cutting
Wolf Siberski wrote:
The price is an extension (or modification) of the
Searchable interface. I've added corresponding search(Weight...) methods
to the existing search(Query...) methods and deprecated the latter.
I think this is the right solution.
If Searchable is meant to be Lucene internal, then IMHO these 'duplicates'
should be removed.
Searchable should be public, so that other RPC mechanisms may be used, 
rather than RMI.  Thus the architecture supports distributed search and 
RMI is just one potential platform.  Searchable is meant to be the 
abstract network protocol.  Queries, filters and sort criteria are 
designed to be compact so that they may be efficiently passed to a 
remote index, with only the top-scoring hits are returned, rather than 
every non-zero scoring hit.  HitCollector-based access to remote indexes 
is discouraged.  HitColletors are rather primarily meant to be used to 
implement queries, sorting and filtering.

The deprecated methods should be removed in Lucene 2.0.  We could 
probably remove them now without breaking anyone, but it's better to be 
safe.

Regarding your other comments: I've been a bit too eager in refactoring,
not giving enough thought to backward compatibility issues. Now I've
reverted to existing API and behavior as far as (IMHO) possible,
and that was pretty far. The only API change necessary is
createWeight() _throws IOException_, because the idfs have to
be computed in the Weight constructors.
I think that's okay.  Thanks for all your work!
An improved patch is attached to the Bugzilla issue.
This patch now looks great to me.  +1
Does anyone object to comitting this patch?
  http://issues.apache.org/bugzilla/show_bug.cgi?id=31841
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-02-21 Thread Doug Cutting
Wolf Siberski wrote:
Now I found another solution which requires more changes, but IMHO is
much cleaner:
- when a query computes its Weight, it caches it in an attribute
- a query can be 'frozen'. A frozen query always returns the cached
  Weight when calling Query.weight().
Orignally there was no Weight in Lucene, only Query and Scorer.  Weight 
was added in order to make it so that searching did not modify a Query, 
so that a Query instance could be reused.  Searcher-dependent state of 
the query is meant to reside in the Weight.  IndexReader dependent state 
resides in the Scorer.  Your freezing a query violates this.  Can't we 
create the weight once in Searcher.search?

This approach requires that weights can be serialized. Interestingly,
Weight already implements Serializable, but the current implementation
doesn't work for all weight classes. The reason is that some weights
hold a reference to a searcher which is of course not serializable.
We can't make it transient either, because this searcher is the source
of the Similarity needed by scorers.
On closer look it turned out that the searcher is used only for two
things: as source for a Similarity, and as docFreqsmaxDoc source.
docFreqmaxDoc are only necessary to initialize the weights, but not
needed by scorers. So instead of providing the Searcher, I now provide
a Similarity and a DocFreqSource to the weights. Only the Similarity is
stored by weights.
We need to make sure, however, that this is the correct Similarity.  It 
should still be the result of Query.getSimilarity(Searcher), which 
doesn't appear to be the case in your patch.

As for DocFreqSource versus Searcher, couldn't the Searcher be passed as 
 a source for docFreqs and simoly have Weights not keep a pointer to 
it?  This isn't a big deal, but it would substantially mimimize the API 
changes.

As (IMHO) positive side effect, Similarity got rid of
Searcher dependencies, which leads to a better split of responsibilities:
- Similarity only provides scoring formulas
- Searcher (rsp. DocFreqSource) provides the raw data (tf/df/maxDoc)
This change affects quite a few classes (because the createWeight() 
signature
is changed), but the modifications are pretty straightforward.
But couldn't the signature change be avoided if the Weight constructors 
immediately call Query.getSimilarity(Searcher) to get their Similarity, 
and no longer kept a pointer to the Searcher?

From my point of view, the patch submitted now is a sound solution
for Bug 31841 (at least I like it :-) ).
The next thing which IMHO needs to be done is a review by someone else.
I've make a quick review, but it would be nice if others looked at this too.
Thanks again for all your work here!
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Into javadocs? [Bug 31841] - [PATCH] MultiSearcher problems with Similarity.docFreq()

2005-02-21 Thread Doug Cutting
Paul Elschot wrote:
Would you mind if some pieces of your reply end up in the
javadocs?
Not at all.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: [VOTE] Incubate lucene4c?

2005-02-17 Thread Doug Cutting
+1
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: [VOTE] Re: Incubating Lucene.Net

2005-02-17 Thread Doug Cutting
+1
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Incubating Lucene.Net

2005-02-17 Thread Doug Cutting
George Aroush wrote:
Any thoughts on Lucene.Net/dotLucene package name are welcome.
I agree that Lucene.Net is a better name.  It's more consistent with 
Lucene Java and Lucene4c, the names for other ports of Lucene.  I think 
it's okay to reclaim the name of an abandonded project, especially if 
the abandoned project is better known and is substantially similar.

The only problem would be if someone else felt that the name Lucene.Net 
was their property.  But the folks at http://searchblackbox.com/ don't 
use name Lucene.Net anymore.  Also, I owned and used the domain 
lucene.net to refer to Apache's Lucene before the Sourceforge Lucene.Net 
project started in 8/03, which arguably gives me rights to the name:

http://web.archive.org/web/*/http://www.lucene.net/
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: removing the old FAQ

2005-02-16 Thread Doug Cutting
Daniel Naber wrote:
could someone (Doug?) make me an administrator for the old Lucene project 
at sourceforge?
Done.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: lucene.apache.org

2005-02-15 Thread Doug Cutting
Henri Yandell wrote:
On names, Lucene Java might hit trademark issues I guess. So potential
worry there.
Good point.  Although I note that Apache already has projects called 
Xerces Java and Xalan Java.  Sun says:

http://www.sun.com/policies/trademarks/#20c
So, technically, the fullname of the product should be Lucene for the 
Java platform, which we might sometimes abbreviate Lucene Java.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: lucene.apache.org

2005-02-14 Thread Doug Cutting
Erik Hatcher wrote:
Doug - do you have your Forest work handy?  Or has anyone else stepped  
up to build the web site?
I don't have anything reusable.  I converted Nutch from a different (not 
Anakia) XML-based site to Forrest with little difficulty (mostly using 
string replace in Emacs).

I started by downloading Forrest and using the tutorial to seed a new 
project:

http://forrest.apache.org/docs/your-project.html
Then I outlined the site in site.xml and translated pages to the new 
schema.  Forrest's default directory layout was not intuitive to me, and 
it can be changed, but I left it alone, opting to keep things as vanilla 
as possible.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: lucene.apache.org

2005-02-14 Thread Doug Cutting
Erik Hatcher wrote:
I have checked out our current site to the lucene.apache.org area, and  
I've also set up a redirect from the jakarta.apache.org/lucene area.
Keep in mind, there are two projects here:
1. Porting Java Lucene's site to Forrest.  This should be structured as 
a sub-project of lucene.apache.org.  It should be maintained in 
https://svn.apache.org/repos/asf/lucene/java/trunk/docs/.

2. Building a new site for lucene.apache.org.  This should initially 
contain a single sub-project, Lucene Java.  This site should be 
maintained in https://svn.apache.org/repos/asf/lucene/site/.

We expect to shortly be adding more sub-projects, and we don't want to 
have to re-structure the site again soon, so let's structure it for the 
long-term from the start.  Make sense?

Do we need a separate logo for Lucene Java? Now I'm beginning to see the 
need for Murray Altheim's logo work:

http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg07799.html
If we adopted this, then we could use the same font to easily generate 
logos for Lucene Java, Lucene 4c, Lucene .Net, etc.  Some 
potential sub-projects (e.g., Nutch) already have distinct logos, but it 
might make sense to use similar logos for ports.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: [ANNOUNCE] lucene4c 0.02

2005-02-14 Thread Doug Cutting
Garrett Rooney wrote:
Additionally it would be good to work on updating the disk format 
documentation, I've found several cases where the docs are quite out of 
date compared to the current code.  It's hard to expect the various 
different ports to maintain compatibility when the formats are only 
documented in code.
If you have a chance, please submit bugs for these.  Thanks.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: lucene.apache.org

2005-02-14 Thread Doug Cutting
Garrett Rooney wrote:
Agreed.  Java Lucene is a subproject of the Lucene TLP, leaving the 
existing Java Lucene site there for the time being seems ok, just so we 
have something there, but we should endeavour to put up something more 
permanent ASAP.
I think, for the present, http://lucene.apache.org/ should redirect to 
http://lucene.apache.org/java/.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: What does [] do to a query and what's up with lucene.apache.org?

2005-02-14 Thread Doug Cutting
Erik Hatcher wrote:
I'm really at the limit of my bandwidth - I've got the sandbox 
restructuring effort on my plate right now and would like it if someone 
could pick up the ball on the web site side of things.
Then perhaps you shouldn't have redirected everything to 
lucene.apache.org...

We need to fix this ASAP.  I just checked out the java lucene docs in 
http://lucene.apache.org/java/.  Now we just need to fixup the 
redirects, so that http://lucene.apache.org/ and 
http://jakarta.apache.org/lucene/ both redirect to 
http://lucene.apache.org/java/.

How did you implement the redirect?  It's not a meta redirect, so they 
must be in the web server configuration?  How does one change that?

It's worth looking at what various search engines list for Lucene and 
making sure that those links are not broken:

http://www.google.com/search?q=+site%3Aapache.org+lucene
http://search.yahoo.com/search?p=lucene+site%3Aapache.org
http://search.msn.com/results.aspx?q=site%3Aapache.org+lucene
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: lucene.apache.org

2005-02-14 Thread Doug Cutting
Erik Hatcher wrote:
It also might be a good time to think about mailing list names.  There 
was a request on infrastructure@ to move [EMAIL PROTECTED] to 
[EMAIL PROTECTED], would it make more sense to move it to [EMAIL PROTECTED]

NOW you tell me  :)
I think until we have these elusive other languages in, we should stick 
with [EMAIL PROTECTED]  We certainly want to have a cohesive Lucene community 
regardless of language, and dev@ makes sense to keep across all 
languages to me.
I (respectfully) disagree.  I don't think other Apache projects work 
that way.  Sub-projects have their own development lists.  Perhaps we 
should have new mailing lists for the top-level project, but the mailing 
list that replaces lucene-dev@jakarta.apache.org should be specific to 
Lucene Java.

In general, nearly everything related to Jakarta Lucene should be moved 
to the Java sub-project of TLP Lucene.  There may be some exceptions, 
but those should be the results of public deliberations.  For example, 
Garrett suggested that the file format documentation might move to the 
top-level.  There's merit to that, but we should figure out how each 
port will describe what version of the file format it implements, 
whether it implements any extensions, etc. before we yank the file 
format documentation from the lucene port.

And we also want to try not to break URLs when we move things.  For this 
reason it's best to move things as few tims as possible, so that we 
don't end up with a confusing set of redirects.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: lucene.apache.org

2005-02-14 Thread Doug Cutting
Doug Cutting wrote:
And we also want to try not to break URLs when we move things.  For this 
reason it's best to move things as few tims as possible, so that we 
don't end up with a confusing set of redirects.
More to the point, we also want to try not to break email addresses.  So 
the fewer times we change them the fewer forwards we'll have to 
maintain.  The new dev list should be [EMAIL PROTECTED], the 
new user list should be [EMAIL PROTECTED], etc.

If folks don't like the moniker Lucene Java then we could consider 
different names, like Lucene4j or somesuch.

Apache started out with just a single project, the web server.  When 
other (now called top level) projects were added, the webserver was 
renamed Apache Server and was hosted at httpd.apache.org.  Eventually 
the name evolved to Apache HTTP Server.

We're in a similar situation.  Lucene is both the top-level name and the 
flagship sub-project.  It is the burden of the sub-project to rename itself.

However if we want to take more time to consider what to name Lucene 
Java, then we should backout the redirects and stay at 
http://www.jakarta.apache.org/lucene/ until we've picked a new name.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: lucene.apache.org

2005-02-14 Thread Doug Cutting
Bernhard Messer wrote:
Doug, you placed a copy of the website in the java directory. In both, 
the original and the java directory the api directory is missing. I 
can't copy it into because of the access rights :-(
Argh.  The group protection is 'lucene', as it should be, but you're not 
in 'lucene'.  We need to fix that.

Erik, can we please undo the redirects and roll back to 
http://jakarta.apache.org/lucene/ until we get lucene.apache.org fully 
setup?  Thanks.

I'm in a meeting for the next few hours...
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: lucene.apache.org

2005-02-14 Thread Doug Cutting
Erik Hatcher wrote:
I've amended my request for e-mail lists here with Doug's preference:
http://issues.apache.org/jira/browse/INFRA-195
Do others agree this is the best approach?  I don't mean to be 
autocratic.  Do we imagine different pools of users and developers for 
different Lucene sub-projects, or one big pool for all of them?  I 
assume they'll be mostly disjoint.

A new name now too?
I don't really want to open that can of worms if we can help it.  If 
folks are okay with Lucene Java then we're done.  I mostly just meant 
to point out that we *are* coining a new name, so we should state that, 
agree on what it means, and start using it.  I'm perfectly content with 
Lucene Java as a name for the project formerly known as Jakarta 
Lucene.  So unless we hear vigorous objections, let's go with that.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Transactional Directories

2005-02-14 Thread Doug Cutting

Oscar Picasso wrote:
Hi,
I am currently implementing a Directory backed by a Berkeley DB that I am
willing to release as an open source project.
Besides the internal implementation, it differs from the one in the sandbox in
that it is implemented with the Berkeley DB Java Edition.
Using the Java Edition allows an easier distribution as you just need to add a
single jar in your classpath and you have a fully functional Berkeley DB
embedded in your application without the hassle of installing the C Berkeley
DB.
While initially implemented with the Java Edition this Directory can easily be
ported to a Berkeley DB C edition or a Berkeley DB XML (for example to use
Berkeley DB XML + Lucene as the base for a document management system).
This implementation works fine and I am quite happy with its speed.
There is still an important problem I face and it has to do with how to deal
with some transactions. After all, the purpose of a Berkeley implementation, or
a JDBC one for that matter, is its ability to use transactions.
After looking at the Andy Varga code, it seems that the implementation in the
sandbox face the same problem (correct me if I am wrong). I have also learn
that the JDBC directory was not implemented with transactions in mind.
Here the problem. 

If I do something like that:
-- case A --
pseudo-code
+begin transaction
 new IndexWriter
 create/update/delete objects in the database
 index.addDocument (related to the objects)
 indexWriter.close()
+commit
/pseudo-code
Everything is fine. The operations are transactionally protected. You can even
do many writes/updates. As far as everything in enclosed by the pairs
begin-transaction/new-index-writer ... index-writer.close/commit everything is
properly undone is case of any operation fails insidde the transaction.
For batch insertions the whole batch is rolled back but at least your object
database is consistent with the index.
If you do mostly batch insertions and relatively few random individual
insertions. That's fine.
However with a relatively high number of random insertions, the cost of the
new IndexWriter / index.close() performed for each insertion is two high.
Unfortunately this it is a common case for some kind of applications and it is
where a transactional directory would the most useful.
In such a case you would like to do something like that:
-- case B --
pseudo-code
new IndexWriter
 ...
+begin transaction-1
 create/update/delete objects in the database
 index.addDocument (related to the objects)
+ commit
...
+begin transaction-2
 create/update/delete objects in the database
 index.addDocument (related to the objects)
+ commit
...
indexWriter.close()
/pseudo-code
The benefits would be to protect individual insertions while avoiding the cost
of using each time a new IndexWriter.
It doesn't work however. Here is my understanding. 

Suppose that in case B, transaction-1 fails and transaction-2 succeeds.
In that case the underlying database system rolls back all the writes done
during transaction-1 whether they were related to the objects stored in the
database or to the index (the writes done to the IndexOutput are also undone).
From the database point of view consistency is maintained between the stored
object and the index.
The problem is that after transaction-1 Lucene still 'remembers' the segment(s)
it wrote during transaction-1. Later, Lucene might 'want' to perform some
operation based on these references (on merging the segments, I think) while
the underlying segment(s) files do not exist anymore. This is where an
Exception is thrown.
The solution would be to instruct Lucene to 'forget' or undo any reference to
the segments created during transaction-1 in case of rollback;
I have noticed that references to the segments are stored in a segmentInfos
map. I was thinking about removing the segmentsInfo map entries created during
transaction-1 in case of a rollback but I don't see if it's enough and/or
potentially dangerous.
I would really appreciate any comment about this idea and also about my
understanding of the Lucene indexing process.
If I/we could find a solution it would also benefit a JDBC Directory
implementation
Thanks.
Oscar
P.S.: If and when my implementation is fully functional, is there a place in
the Lucene project where I could release it? (Maybe the sandbox).

		
__ 
Do you Yahoo!? 
The all-new My Yahoo! - What will yours do?
http://my.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Transactional Directories

2005-02-14 Thread Doug Cutting
[ Please ignore my previous message.  I somehow hit Send before typing 
anything! ]

Oscar Picasso wrote:
However with a relatively high number of random insertions, the cost of the
new IndexWriter / index.close() performed for each insertion is two high.
Did you measure that?  How much slower was it?  Did you perform any 
profiling?  Perhaps one could improve this by, e.g., disabling document 
index buffering, so that indexes are written directly to the final 
directory in this case, rather than first bufferred in a RAMDirectory.

Unfortunately this it is a common case for some kind of applications and it is
where a transactional directory would the most useful.
In such a case you would like to do something like that:
-- case B --
pseudo-code
new IndexWriter
 ...
+begin transaction-1
 create/update/delete objects in the database
 index.addDocument (related to the objects)
+ commit
...
+begin transaction-2
 create/update/delete objects in the database
 index.addDocument (related to the objects)
+ commit
...
indexWriter.close()
/pseudo-code
The benefits would be to protect individual insertions while avoiding the cost
of using each time a new IndexWriter.
It doesn't work however. Here is my understanding. 

Suppose that in case B, transaction-1 fails and transaction-2 succeeds.
So you've got multiple threads?  Or are you proceeding in the face of 
exceptions?  Otherwise I would expect that if transaction-1 fails then 
you'd avoid transaction-2, no?

Also, you'd want to add an flush() call after each addDocument(), since 
document additions are bufferred.  But a flush() is just what 
IndexWriter.close() does, so then things would not be any faster than 
creating a new IndexWriter for each document.

The bottom line is that there are optimizations to be made when batching 
additions.  Lucene's API is designed to encourage batching, so that 
these optimizations may be used.  If you don't batch, things will be 
somewhat slower.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Study Group (WAS Re: Normalized Scoring)

2005-02-07 Thread Doug Cutting
Paul Elschot wrote:
I learned a lot by adding some javadocs to such classes. I suppose Doug
added the Expert markings, but I don't know their precise purpose.
The Expert declaration is meant to indicate that most users should not 
need to understand the feature.  Lucene's API seeks to be both simple 
and flexible, but this is not always possible.  When flexibility is 
added that is not part of the simple API, it is deemed expert.  For 
example, we don't expect most users to need to write new Query 
implementations.  So Query methods that are only used internally in 
query processing are marked Expert, since they only need to be 
understood by those implementing a Query.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: whither sandbox

2005-02-04 Thread Doug Cutting
Erik Hatcher wrote:
Also, we should package a lucene-XX-all.zip/.tar.gz that includes all 
the contrib pieces also allowing someone to simply download Lucene and 
all the packaged contrib pieces at once.
I'll go further: that should be the only download.  We should avoid 
having a bunch of different downloads.  Ant used to require you to 
separately download the optional tasks, but that was a pain.  Now 
they're included.

So we will have at least:
  lucene-XX.tar.gz
  lucene-src-XX.tar.gz
But should we add the following?
  lucene-contrib-XX.tar.gz
  lucene-contrib-src-XX.tar.gz
Or should we just bundle these into the first two?  I vote for bundling. 
 There will still be separate jar files, so folks only have to deploy 
what they need.  Download size is not an issue these days.  Thoughts?

Also, we should combine the javadoc into a single tree, with a Core 
group followed by a Contrib group:

http://java.sun.com/j2se/1.4.2/docs/tooldocs/solaris/javadoc.html#group
As an example, Nutch does this for Core and Plugin:
http://www.nutch.org/docs/api/overview-summary.html
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: [PROPOSAL] Lucene to search.apache.org

2005-02-02 Thread Doug Cutting
Erik Hatcher wrote:
Hmmm good point.  I hadn't considered access control.  A migration 
will be performed later today, and I think it will initially be a test 
migration for me to verify.  I'll double-check with Justin, who's doing 
the conversion, on how access control will be initially configured.
Have a look at svn.apache.org:/x1/svn/asf-authorization.  The way other 
projects do this is to have a project-pmc group that has access to 
/project, then have project-subproject groups that have access to 
/project/subproject.  So I think we should start with the java code 
tree in /lucene/java and put current Lucene committers in the 
lucene-java group.  We know we want to have subprojects (nutch, .net, 
etc.) so let's avoid having to re-org when we add the first one.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Fwd: [PROPOSAL] Lucene to search.apache.org

2005-02-01 Thread Doug Cutting
Erik Hatcher wrote:
The decision was a bit slow to get out, but Lucene has been approved for 
TLP.
Thanks for pushing this through!
I 
propose we simply import our two CVS repositories in with all of 
jakarata-lucene as the root of the repository and jakarta-lucene-sandbox 
under sandbox in the root.  We can shuffle things around once we get 
it all into svn using svn move nicely.

Thoughts?
I think we want Java Lucene to be a sub-project of Lucene.  So the 
repository should be something like:

  https://svn.apache.org/repos/asf/lucene/java
Then if we add dotLucene, it will go in something like:
  https://svn.apache.org/repos/asf/lucene/dot
In each of these we'll have subdirectories named trunk, branches, tags, 
etc.  Folks will generally check out and work on 'trunk'.

We should also have a repository for the top-level project's website. 
This could be:

  https://svn.apache.org/repos/asf/lucene/site
This does not need subdirectories.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: [PROPOSAL] Lucene to search.apache.org

2005-02-01 Thread Doug Cutting
Erik Hatcher wrote:
On Feb 1, 2005, at 3:13 PM, Doug Cutting wrote:
I think we want Java Lucene to be a sub-project of Lucene.  So the 
repository should be something like:

  https://svn.apache.org/repos/asf/lucene/java

I already put in the request for this initial svn structure:
/asf/lucene
/trunk/
/sandbox/
/branches/
/tags/
svn move is an inexpensive and easy operation - so let's run with this 
structure to get our existing stuff in, and refactor it ourselves once 
we're in.
Okay, if you like.  Anyone on the lucene PMC will be able to reorganize. 
 But once we import the code we'll want to rearrange things before we 
give comitters access.  Non-PMC comitters will generally only have 
access to subprojects, and some perhaps the site.  Changes to access 
must go through infrastucture.  So we should re-org before we start 
adding committers, no?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-31 Thread Doug Cutting
Doug Cutting wrote:
It would translate a query t1 t2 given fields f1 and f2 into 
something like:

+(f1:t1^b1 f2:t1^b2)
+(f2:t1^b1 f2:t2^b2)
Oops.  The first term on that line should be f1:t2, not f2:t1:
+(f1:t2^b1 f2:t2^b2)
f1:t1 t2~s1^b3
f2:t1 t2~s2^b4
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-31 Thread Doug Cutting
Chuck Williams wrote:
That expansion is scalable, but it only accounts for proximity of all
query terms together.  E.g., it does not favor a match where t1 and t2
are close together while t3 is distant over a match where all 3 terms
are distant.  Worse, it would not favor a match with t1 and t2 in a
short title, and t2 and t3 proximal in the content (with no occurrence
of t1 in the content) vs. a match with t1 and t2 in the title and t2 and
t3 distant in the content.
Right.  I just mentioned this same weakness in a message replying to David.
   Is that distinct from my goal to develop an improved
   MultiFieldQueryParser for Lucene 2.0?
Not distinct, but I think the first step is to decide on the expansion
we want.  Unless somebody has a better idea, I think the best solution
is a new Query class that simultaneously supports multiple fields, term
diversity and term proximity.  It would be similar to SpansQuery, but
generalized.  It would be like BooleanQuery in the sense that individual
query clauses could be required or not.  Then, default AND could be
achieved by expanding queries to all-required.
With this new Query class, revised versions of QueryParser and
MultiFieldQuery parser would generate it.
Am I way off-base somewhere and/or is there a simpler approach to the
same end?
It just sounds like a lot to bite off at once.
What did you think of my DensityPhraseQuery proposal?  We could use this 
in place of a PhraseQuery w/ slop=infinity.  We'd need just one per field.

The straight boolean clauses are required for two reasons:
  1. To make sure that every query term appears in some field; and
  2. To reward a term that occurs frequently in a field, but near no 
other query terms.

Sure, idf is important enough to evaluate independently as a factor.
However, I do not think these considerations are orthogonal.  For
example, I'm putting a lot of weight in field boosting and don't want
the preference of title matches over body matches to be overwhelmed by
the idf's.
If field boosting needs to then trump idf, we should be able to deal 
with that when we subsequently tune field boosting, no?  We can, e.g., 
square the field boosts if we need.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-28 Thread Doug Cutting
Christoph Goller wrote:
The similarity specified for the search has to be modified so that both
idf(...) AND  queryNorm(...) always return 1 and as you say everything
except for tf(term,doc)*docNorm(doc) could be precompiled into the boosts
of the rewritten query. coord/tf/sloppyFreq computation would be done
locally by the Searchables as specified for this search.
So the changes for the MultiSearcher bug would remain locally in 
MultiSearcher.
I think this would be a very clean solution. What do others think?
This sounds good to me!
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: [PROPOSAL] Lucene to search.apache.org

2005-01-17 Thread Doug Cutting
Maybe we should just call it lucene.apache.org, and move the current 
Lucene project to lucene.apache.org/java?  The other projects we imagine 
adding (Nutch, DotLucene, CLucene, etc.) are all Lucene-related, no? 
Lucene has a pretty good brand name...

Doug
Otis Gospodnetic wrote:
ir.apache.org is what I was thinking, too.  +1 for IR from me.  It's
broad enough to serve as a home for other related projects, not just
the initial group of them.
Otis
--- Andrzej Bialecki [EMAIL PROTECTED] wrote:

Scott Ganyo wrote:
Not especially creative, but index.apache.org looks to be
available.
S
On Jan 17, 2005, at 3:29 AM, Erik Hatcher wrote:

Looks like we should consider alternate names.  Suggestions??
ir.apache.org
(not Infra-Red, but Information Retrieval)
--
Best regards,
Andrzej Bialecki
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-14 Thread Doug Cutting
Wolf Siberski wrote:
Doug Cutting wrote:
 So, when a query is executed on a MultiSearcher of RemoteSearchables, 
the following remote calls are made:

1. RemoteSearchable.rewrite(Query) is called
After that step, are wildcards replaced by term lists?
Yes.
I haven't taken a look at the rewrite() methods. Could
you explain to me what is this step doing from a high-level
perspective. I'm not sufficiently familiar with Lucene yet.
Lucene has a few primitive query types: TermQuery, PhraseQuery, 
SpanQuery, and BooleanQuery.  Other derived query types (RangeQuery, 
FuzzyQuery, WildcardQuery) are rewritten into primitive queries before 
evaluation.  Rewriting typicially involves expanding the derived query 
into a BooleanQuery of TermQueries.

2. RemoteSearchable.docFreq(Term) is called for each term in the 
rewritten query while constructing a Weight.
We could optimize this step by sending a list of terms
and receiving the corresponding list of docFreqs.
Yes.  And this could entirely be hidden within the RemoteSearchable 
implementation.  For example, the RPC made by its rewrite() 
implementation could also return the docFreq() of each term in the 
rewritten query, and these could be squirrelled away in a cache, which 
would then be accessed by the docFreq() method, so that only a single 
RPC is required to implement both rewrite() and all of the docFreq() calls.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-14 Thread Doug Cutting
Chuck Williams wrote:
Doug Cutting wrote:
   It would indeed be nice to be able to short-circuit rewriting for
   queries where it is a no-op.  Do you have a proposal for how this
could
   be done?
First, this gets into the other part of Bug 31841.  I don't believe
MultiSearcher.rewrite() is ever called.  Rewriting is done in the
Weight's, which invoke the rewrite() method of the Searcher, which is
always the Seacher invoked by the MultiSearcher, not the MultiSearcher
itself.
This would be fixed by the proposal under consideration.  Weights would 
be constructed much earlier, using the top-level Searcher, so rewrites 
would use this too.

In fact, MultiSearcher.rewrite() is broken.  It requires
Query.combine() which is unsupported except for the derived queries
(i.e., those for which rewriting is not a no-op).  When I added
topmostSearcher to get the Weight's to call the MultiSearcher.docFreq(),
that also caused them to call MultiSearcher.rewrite() which blows up on,
for example, a simple TermQuery, because there is no
TermQuery.combine().  That's why my patch contains a new default
implementation for Query.combine() (which as noted in the bug report is
probably not a good idea in general).
So, I don't believe there is any valid rewrite() implementation for
MultiSearcher to start from, unless I've completely misunderstood
something.
It looks like MultiSearcher.rewrite() was never implemented correctly 
since it was never called -- a latent bug.  It only needs to be called 
when queries are rewritten to something different:

  public Query rewrite(Query original) throws IOException {
Query[] queries = new Query[searchables.length];
boolean changed = false;
for (int i = 0; i  searchables.length; i++) {
  Query rewritten = searchables[i].rewrite(original);
  changed = !rewritten.equals(original);
  queries[i] = rewritten;
}
if (changed) {
  return original.combine(queries);
} else {
  return original;
}
  }
Then we'll need an implementation of combine() for all query types.  The 
implementation for BooleanQuery is fairly simple: combine() each of the 
corresponding clauses.  For TermQuery, PhraseQuery and SpanQuery combine 
should create a deduplicated OR.  Derived queries already have an 
implementation.

To address the question above, RemoteSearchable.rewrite() should be a
no-op, i.e. always return this.  For good error handling, it should
verify that the query does not require rewriting.  This requires some
mechanism to determine whether or not a query requires rewriting.  The
challenge here is that some query types have a non-trivial rewrite()
method not because they require rewriting, but because they might have
subqueries that require rewriting (e.g., BooleanQuery).  Other query
types (e.g., MultiTermQuery) always require rewriting, while those that
implement Weight's never require it.  I think an upward incompatibility
is required in the API to address this.
If that is acceptable, then this could work:
  1.  Add a new interface called Rewritable that specifies a boolean
rewriteRequired() method.
  2.  Have Query implement Rewritable but NOT provide an implementation
for rewriteRequired().  This will force all applications to add support
for this in order to upgrade.
  2.  Change all the Weight's to call Query.maybeRewrite() instead of
Query.rewrite().
  3.  Have Query.maybeRewrite() only call Query.rewrite() if
Query.rewriteRequired() is true.
  4.  Have RemoteSearchable.maybeRewrite() throw an Exception if
Query.rewriteRequired() is true.
  5.  Implement rewriteRequired() for all the built-in Query types
(which is either true for derived queries, false for primitive queries,
or an or of rewriteRequired() for all the subqueries).
That sounds hairy.  Why not just add a single new method:
   boolean Query.isRewritten() { returns true; }
Then override this in TermQuery, PhraseQuery and SpanQuery to return 
false, and in BooleanQuery to walk its clauses and return true iff any 
of them return true.  As an optimization, RemoteSearchable could avoid 
calling rewrite() when this is true.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: JDK code in the codebase

2005-01-14 Thread Doug Cutting
Erik Hatcher wrote:
The questions still remain, though, and lawyers do want to know the  
answers:

 - How did JDK code get into Lucene's codebase to begin with?
I put it there in a moment of ignorance way back as a hack in order to 
make things run in an older version of the JVM.

http://cvs.sourceforge.net/viewcvs.py/lucene/lucene/com/lucene/util/Arrays.java?rev=1.1.1.1view=auto
 - Is there any more lingering?
Not to my knowledge.
Sorry, my bad.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-12 Thread Doug Cutting
Chuck Williams wrote:
I was thinking of the aggressive version with an index-time solution,
although I don't know the Lucene architecture for distributed indexing
and searching well enough to formulate the idea precisely.
Conceptually, I'd like each server that owns a slice of the index in a
distributed environment to have the complete docFreq data, i.e. to have
docFreq's that represent the collection as a whole, not just its index
slice.  If this was achieved at index-time, then the current
implementation would work at query time.  I.e., MultiSearch could send
the queries out to the remote Searcher's and these Searcher's could
consult their local indexes for the correct docFreq's to use.
This is different than what I described.  I described keeping a docFreq 
cache at the central dispatch node, while you describe replicating that 
cache on every search node.  I don't see the advantage in this 
replication.  It is both more efficient to maintain a single cache, and 
faster to search, since fewer dictionary lookups are involved.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-12 Thread Doug Cutting
Chuck Williams wrote:
There needs to be a way to create the aggregate docFreq table and keep
it current under incremental changes to the indices on the various
remote nodes.
I think you're getting ahead of yourself.  Searchers are based on 
IndexReaders, and hence doFreqs don't change until a new Searcher is 
created.  So long as this is true, and the central dispatch node uses a 
searcher, then a simple cache, perhaps that is pre-fetched, is all 
that's feasable.  It shouldn't take that long to pre-fetch the cache 
when indexes are re-opened.  Lets run before we sprint, and hey, let's 
even walk first by first fixing the bug in question.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-12 Thread Doug Cutting
Wolf Siberski wrote:
Chuck Williams wrote:
This is a nice solution!  By having MultiSearcher create the Weight, it
can pass itself in as the searcher, thereby allowing the correct
docFreq() method to be called.  This is similar to what I tried to do
with topmostSearcher, but a much better way to do it.
This still wouldn't work for RemoteSearchables, except if you allow
call-backs from each RemoteSearchable to the MultiSearcher.
I don't see what callbacks are required.  When the Weight is constructed 
 it invokes docFreq for each term, which, if RemoteSearchables are 
involved, will result in IPC calls to those RemoteSearchables.  Then, 
the Weight object is serialized to each RemoteSearchable and a TopDocs 
is returned.  Where are the callbacks?  These are only required for 
HitCollector-based methods, which are not advised with RemoteSearchable.

For
this, MultiSearcher would have to be remotely callable, too.
A MultiSearcher can be made remotely callable by wrapping it in a 
RemoteSearchable, if that's required.  But I'm not sure that's your 
concern here.

As I said
above, IMHO we should stay with a simple client/server model here.
I think we would still have a simple model, unless I'm missing something.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: what if the IndexReader crashes, after delete, before close.

2005-01-11 Thread Doug Cutting
Sigh.  This stuff would get a lot simpler if we were able to use Java 
1.4's FileLock.  Then locks would be automatically cleared by the OS if 
the JVM crashes.

Should we upgrade the JVM requirements to 1.4 for Lucene's 1.9/2.0 
releases and update the locking code?

Doug
Luke Shannon wrote:
Here is how I handle it.
The Indexer is a Runnable. All the members it uses are static. The run()
method calls a syncronized method called go(). This kicks off the indexing.
Before you even get to here, the method in the CMS code that created the
thread object and instaniated the index is also sychronized.
Here is the code that handles the potential lock file that may be left
behind from a Reader or Writer.
Note: I found I had to check if the index existed before checking if it was
locked. If I checked if it was locked and the index had not been created yet
I got an error.
//if we have gotten to hear that this is the only index running.
//the index should not be locked. if it is the lock is stale
//and must be released before we can continue
try {
if (index.exists()  IndexReader.isLocked(indexFileLocation)) {
Trace.ERROR(INDEX INFO: Had to clear a stale index lock);
IndexReader.unlock(FSDirectory.getDirectory(index, false));
}
} catch (IOException e3) {
Trace.ERROR(INDEX ERROR: IMPORTANT. Was unable to clear a stale index lock:
 + e3);
}
HTH
Luke
- Original Message - 
From: Peter Veentjer - Anchor Men [EMAIL PROTECTED]
To: Lucene Users List lucene-user@jakarta.apache.org
Sent: Tuesday, January 11, 2005 3:24 AM
Subject: RE: what if the IndexReader crashes, after delete, before close.


-Oorspronkelijk bericht-
Van: Luke Shannon [mailto:[EMAIL PROTECTED]
Verzonden: maandag 10 januari 2005 15:46
Aan: Lucene Users List
Onderwerp: Re: what if the IndexReader crashes, after delete, before
close.

One thing that will happen is the lock file
will get left behind. This means when you start
back up and try to create another Reader you will
get a file lock error.

I have figured out that part the hard way ;) Why can`t I access my index
anymore?? Ahh.. The lock file

Our system is threaded and synchronized.
Thus when a Reader is being created I know
it is the only one (the Writer comes after
the reader has been closed). Before creating
it I check if the Index is locked. If it is,
I forcefully clear it. This prevents the above
problem from happening.

You can have more than 1 reader open at anytime. Even while a delete or
add is in progress. But you can`t use a reader where documents are
deleted (IndexReader) and added(IndexWriter) at the same time. If you
don`t have other threads doing delete/add you won`t have to synchronize
anything.
And how do you synchronize on it? I have applied the ReadWriteLock From
Doug Lea`s concurrency library after I have build my own
synchronization brick and somebody pointed out that I was implementing
the ReadWriteLock. But at the moment I don`t do any synchronization.
And I want to have a component that is executed if the system is started
and knows that to do if there is rubbish in the index directory. I want
that component to restore my index to a usable version (and even small
loss of information is acceptable because everything is checked once and
a while. And user-added-information is going to be stored in the
database. So nothing gets lost. The index can be rebuild..

Luke
- Original Message -
From: Peter Veentjer - Anchor Men [EMAIL PROTECTED]
To: lucene-user@jakarta.apache.org
Sent: Saturday, January 08, 2005 4:08 AM
Subject: what if the IndexReader crashes, after delete, before close.
What happens to the Index if the IndexReader crashes, after I have
deleted
documents, and before I have called close. Are the deletes ignored? Is
the
Index screwed up? Is the filesystem screwed up (if a document is deleted
new
delete-files appear) so are the delete-files still there (and can these
be
ignored the next time?). Can I restore the index to the previous state,
just
by removing those delete-files?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: what if the IndexReader crashes, after delete, before close.

2005-01-11 Thread Doug Cutting
Terry Steichen wrote:
Would it be 
possible to optimize the operation to use 1.4 runtime features but 
retain the option, if desired to run in a legacy (1.3) environment, 
perhaps in a degraded mode?
Lucene 1.4.3 is a degraded mode, no?
There are still back-compatibility issues.  To be safe, Lucene 2.0 
should still respect Lucene 1.x file locks.  So FSDirectory's 
Lock.obtain() should fail if a lock file exists, unless it's a lock file 
written by Lucene 2.0 and java.nio.FileLock says its unlocked.  To 
implement this I guess we'd need to store a version number in the lock 
files.  Does that sound right?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-11 Thread Doug Cutting
Chuck Williams wrote:
As Wolf does, I hope a committer with deep knowledge of Lucene's design
in this area will weigh in on the issue and help to resolve it.
The root of the bug is in MultiSearcher.search().  This should construct 
a Weight, weight the query, then score the now-weighted query.

Here's a potential way to fix it:
1. Replace all of the
   ... search(Query, ...)
methods in Searchable.java with
   ... search(Weight, ...)
methods.
2. Add search(Query, ...) convenience methods to Searcher.java which do 
something like:

  public ... search(Query query, ...) {
 return search(query.weight(this), ...);
  }
3. Update the search() methods in IndexSearcher, MultiSearcher and 
RemoteSearchable to operate on Weight's instead of queries.

Does that make sense?
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

2005-01-11 Thread Doug Cutting
Chuck Williams wrote:
This is a nice solution!  By having MultiSearcher create the Weight, it
can pass itself in as the searcher, thereby allowing the correct
docFreq() method to be called.
Glad to hear it at least makes sense... Now I hope it works!
I'm still left wondering if having MultiSearcher query all the
RemoteSearchable's on every call to docFreq() within each TermQuery,
PhraseQuery, SpanQuery and PhrasePrefixQuery is the way to go long term,
although it seems like the best thing to do right now.  The calls only
happen when the Weight's are created, so maybe it's not too bad.  Longer
term, it might be better to distribute the idf information out to the
RemoteSearchable's to minimize the required number of remote accesses
for each Query.
I'm not sure exactly what you mean by distribute the idf information 
out to the RemoteSearchable.  I think one might profitably implement a 
docFreq() cache in RemoteSearchable.  This could be a simple cache, or 
it could be fairly agressive, pre-fetching all the docFreqs.  (As an 
optimization, it could only pre-fetch those greater than 1, and, when a 
term is not in the cache, assume its docFreq is 1.  As a lossy 
optimization, it could only pre-fetch those greater than N, and somehow 
estimate those not in the cache.)  Is that what you meant?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: auto-filters?

2005-01-03 Thread Doug Cutting
markharw00d wrote:
If we intend to make more use of filters this may be an appropriate time 
to raise a general question I have on their use.  Is there a danger in 
tieing them to a specific  implementation (java.util.BitSet)?
I do not object in principal to replacing BitSet with an interface, e.g. 
DocIdSet.  Please feel free to submit a more detailed proposal for this. 
 I think this is not so performance intensive that an extra method call 
will be significant.  If it is, then we can simply have Lucene's 
BitVector implement this interface directly.

We must be careful to preserve the distinction between Filter and 
DocIdSet.  Filters are query-independent and serializeable, passed 
across the wire with RemoteSearchable requests.  DocIdSets are 
query-dependent, should be computed and cached locally, and not passed 
over the wire.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: CFS file and file formats

2005-01-03 Thread Doug Cutting
Bernhard Messer wrote:
Why not implementing a small utility class, f.e CompoundFileUtil.java 
within the org.apache.lucene.index Package ? This class could be public 
and implement the necessary functionality. This is what i would prefer, 
because we don't have to change the visibility of CompoundFileReader or 
other parts of the API. The other option would be to add a public static 
method to IndexReader class. But i don't like to overwhelm IndexReader 
with a method, just a very small audience would use.
Currently IndexWriter is the only public place in the API where the 
compound format appears.  So, until we decide to expose index formats 
more systematically, I think this should stay at the IndexReader level.

Thus I would prefer a main() on IndexReader that had various commands, 
perhaps something like:

  java org.apache.lucene.index.IndexReader dir cfs list
  java org.apache.lucene.index.IndexReader dir cfs extract
  java org.apache.lucene.index.IndexReader dir unlock
  java org.apache.lucene.index.IndexReader dir list-segments
etc.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


auto-filters?

2005-01-02 Thread Doug Cutting
Filters are more efficient than query terms for many things. For
example, a RangeFilter is usually more efficient than a RangeQuery and
has no risk of triggering BooleanQuery.TooManyClauses. And Filter
caching (e.g., with CachingWrapperFilter) can make otherwise expensive
clauses almost free, after the first time.
But filters are not obvious. Many Lucene applications that would
benefit from them do not. Wouldn't it be better if we could
automatically spot Query clauses which are amenable to
filter-conversion? Then applications would just get faster and throw
fewer exceptions, without having to know anything about filters.
From a user level I think this might work as follows:
1. Query clauses which have a boost of 0.0 are candidates for filter
conversion, since they cannot contribute to the score.  We should
perhaps make boost=0 the default for certain classes of query (e.g.,
perhaps RangeQuery) or make subclasses with this as the default
(KeywordQuery).
2. One should be able to specify a filter cache size per IndexSearcher,
with the notion that each filter cached uses one bit per document.
I'm not yet clear how this should be implemented.  It might be based on
something like:
  public interface DocIdCollector {
void collectDocId(int docId);
  }
  /** Collects all DocIds that match the query.  DocIds are collected
  in no particular order and may be collected more than once.
  Returns true if this feature is supported, false otherwise. */
  public boolean Query.getFilterBits(IndexReader, DocIdCollector);
Implementing this for various query classes is straight-forward.
TermQuery might return null for any but very common terms (occurring in,
e.g., greater than 10% of documents).  RangeQuery would use the logic
that's currently in RangeFilter.  Etc.
BooleanScorer could then use this method to create a filter bit-vector
for all of the boost=0.0 clauses, then use that to filter the other
boost!=0 clauses.  The bit vectors could be cached in the scorer (using
a LinkedHashMap), although I'm a little fuzzy on exactly how the cache
API would work.
I'm not convinced the above is the best design, but I am convinced
Lucene needs a solution for this.  It could automatically eliminate most
causes of BooleanQuery.TooManyClauses (e.g., from date ranges), and also
make many required keyword clauses (document type, language, etc.) much
faster.
What do others think?  Does anyone have a better design or improvements
to what I describe?
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: DefaultSimilarity 2.0?

2004-12-20 Thread Doug Cutting
Chuck Williams wrote:
Finally, I'd suggest picking content that has multiple fields and allow
the individual implementations to decide how to search these fields --
just title and body would be enough.  I would like to use my
MaxDisjunctionQuery and see how it compares to other approaches (e.g.,
the default MultiFieldQueryParser, assuming somebody uses that in this
test).
I think that would be a good contest too, but I'd rather first just 
focus on the ranking of single-field queries.  There are a number of 
issues that come up with multi-field queries that I'd rather postpone in 
order to reduce the number of variables we test at one time.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Migration to SVN?

2004-12-20 Thread Doug Cutting
Garrett Rooney wrote:
The least effort way of doing that would be to include both the core 
and sandbox under the same trunk, but again, that implies that you 
ALWAYS tag and branch them together, and sometimes you may not want to 
do that.
I think we should always branch these together.  To my thinking, the 
distinction between core and sandbox is primarily be one of 
packaging: the core should be separate jar, as should each of the 
sandbox elements.  But all should be released and tested as a unit, to 
ensure compatibility.

I think the term sandbox is misleading and has outlived its 
usefulness.  We should probably rename this to something like utils or 
optional.  These should be treated much like Ant's optional tasks: 
package them as separate jars, segregate their documentation, but don't 
branch them separately.  Perhaps we should also make it so that a failed 
sandbox build or unit test does not stop a build: the quality guarantee 
need not be as high for sandbox items.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


DefaultSimilarity 2.0?

2004-12-17 Thread Doug Cutting
Chuck Williams wrote:
Another issue will likely be the tf() and idf() computations.  I have a
similar desired relevance ranking and was not getting what I wanted due
to the idf() term dominating the score. [ ... ]
Chuck has made a series of criticisms of the DefaultSimilarity 
implementation.  Unfortunately it is difficult to quickly evaluate 
these, as it requires relevance judgements.  But, still, we should 
consider modifying DefaultSimilarity for the 2.0 release if there are 
easy improvements to be had.  But how do we decide what's better?

Perhaps we should perform a formal or semi-formal evaluation of various 
Similarity implementations on a reference collection.  For example, for 
a formal evalution we might use one the TREC Web collections, which have 
associated queries and relevance judgements.  Or, less formally, we 
could use a crawl of the ~5M pages in DMOZ (I would be glad to collect 
these using Nutch).

This could work as follows:
  -- Different folks could download and index a reference collection, 
offering demonstration search systems.  We would provide complete code. 
 These would differ only in their Similarity implementation.  All 
implementations would use the same Analyzer and search only a single field.
  -- These folks could then announce their candiate implementations and 
let others run queries against them, via HTTP.  Different Similarity 
implementations could thus be publicly and interactively compared.
  -- Hopefully a consensus, or at least a healthy majority, would agree 
on which was the best implementation and we could make that the default 
for Lucene 2.0.

Are there folks (e.g., Chuck) who would be willing to play this game? 
Should we make it more formal, using, e.g., TREC?  Does anyone have 
other ideas how we should decide how to modify DefaultSimilarity?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Explanations and overridden similarity

2004-12-16 Thread Doug Cutting
Dan Climan wrote:
Shouldn't the call to Similarity.decodeNorm be replaced with a call to
Similarity.getDefault().decodeNorm
decodeNorm is a static method.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: potential new Lucene logo

2004-12-13 Thread Doug Cutting
Murray Altheim wrote:
I thought I'd have a go at the
Lucene logo, not to change it markedly but clean it up so that it
is based on an existing font. This potential Lucene logo is based
on an ITC font called Magneto Bold Extended, which you can see here:
http://www.identifont.com/show?72W
I modified the 'c' slightly because at small sizes it starts
looking too much like an 'e', especially since in the Lucene
logo the baseline is extended across the entire logo. I experimented
with several border thicknesses, settling on one that was visible
at small sizes but not too thick at larger sizes. Here's a sample
of the result:
http://www.altheim.com/murray/img/lucene-20b-320w.jpg
Thanks!  This looks nice to me.
I've posted a zip file containing a number of sizes plus the
originals, which are in SVG and PNG:
   http://www.altheim.com/murray/img/lucene_logo.zip (198K)
The SVG file is the source image, and after conversion into a
raster format I did some bit hand cleaning to end up with the
PNG, which is a 9394x961px image at 72dpi. Being available as
a very large PNG, it can then be used for T-shirts, etc. I
consider the PNG image the master, since it's had some cleanup
in evening out lines and curves, etc.
I just put the original, scalable artwork for the existing logo at:
http://jakarta.apache.org/lucene/lucene.eps
I have used the Gimp in the past to generate high-resolution PNG files 
from this when needed.

The one thing about the logo (either the existing one or the
one I've done) is that neither do too well when shrunk small.
The source PNG can be reduced to any size, but after reduction
to small sizes often needs some hand cleanup. This could be
fixed by no longer using an outline around the font, but I
didn't want to take that kind of liberty with the design,
especially since then it would be a single-colour font. I kinda
like the current design, as it reminds me of a logo from a
recreational vehicle (camper, caravan, etc.)
The design was originally donated by Jeff Boozer and Joy Busse in April 
of 1998.  I asked for something that looked like 60's refrigerator chrome.

I realize that sometimes people feel (understandably) proprietary
about a given image, and I don't mean to push this image on anyone.
If the group wants to use it, I'm fine with that, and release any
rights on its use. I'll leave the zip file online for several
weeks, then it will be removed.
I don't feel too proprietary.  Do folks prefer Murray's reworked Lucene 
logo or the original logo?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: setLowercaseWildcardTerms and FuzzyQueries

2004-12-13 Thread Doug Cutting
Daniel Naber wrote:
I'm aware that the Wildcard name won't 
fit well anymore, suggestions for a better name are welcome.
Expanded?
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Boolean Scorer

2004-12-10 Thread Doug Cutting
Christoph Goller wrote:
I think we should change BooleanScorer. An easy way would be to sort the 
bucket
list before it is used. Do you think that would affect performance 
dramatically?
I think it would make it slower.
Otherwise we should reimplement BooleanScorer. I haven't looked into the
DisjunctionScorer patch in Bugzilla yet. Maybe it's a good starting point.
I think we should incorporate Paul's code into CVS.  This algorithm may 
be slower in some cases, but it may also be faster in some cases.  We 
should add a static method to switch back to the old implementation, and 
encourage folks to benchmark their code.  If it proves no slower then we 
could remove the old implementation altogether.

What do others think?
Paul's code is in:
http://issues.apache.org/bugzilla/show_bug.cgi?id=31785
Has anyone tried this?
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Release 1.4.3

2004-12-06 Thread Doug Cutting
Christoph Goller wrote:
Doug, could you please move api/ to api.old/ and api.new/ to api/
Done.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Release 1.4.3

2004-11-26 Thread Doug Cutting
Christoph Goller wrote:
I think i should finally make Release 1.4.3.
Great!
I presume the default.properties does no longer exist. I just fill in
1.4.3 as version in the build.xml before building it. Is this ok?
I build releases with something like:
  ant -Dversion=1.4.3 clean dist
So that it doesn't matter what version is in build.xml.  So you 
shouldn't need to change build.xml for this release.

I think there is less confusion if, when folks build lucene themselves 
it does not, by default, have the same name as a released version.  Thus 
if they patch things and do not update build.xml (which is likely) the 
generated jar files will have rc1-dev, clearly identifying them as a 
non-released version.

Releases (binaries and sources) are no longer on www.apache.org
/www/jakarta.apache.org/builds/jakarta-lucene/release/
Only the web-page and the documentation (Javadoc) is there.
Instead they are on cvs.apache.org
/www/cvs.apache.org/dist/jakarta/lucene
Is this correct?
Yes.  Release directories should now be made under 
www.apache.org:/www/cvs.apache.org/dist/jakarta/lucene/.

Two other things that are not in the wiki instructions:
  1. copy the lucene jar into the distribution too
 cp build/lucene-X.X.jar dist
  2. compute MD5 sums
 (cd dist; md5sum lucene*  MD5.txt)
If you have time, please update the wiki instructions too.
Thanks!
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: GIS

2004-11-16 Thread Doug Cutting
Guillermo Payet wrote:
The fact that Lucene stores and indexes (or seems it seems) all terms 
as Strings and that there is no NumericTerm makes me think that I 
might be missing something and that this migh be a much bigger deal
than I think?
You could write a HitCollector that uses 
FieldCache.getFloats(latitude) and FieldCache.getFloats(longitude) 
to efficiently lookup the latitude and longitude of each textual match. 
 Then combine the distance score with the text score.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: FuzzyQuery prefix length

2004-10-26 Thread Doug Cutting
Erik Hatcher wrote:
On Oct 20, 2004, at 12:14 PM, Doug Cutting wrote:
The advantages of a zero-character prefix default are that it's 
back-compatibile and that it will find more matches, when spelling 
differences are in the first characters.

I prefer this default.
Anyone using QueryParser needs to be aware of the issues of exposing 
fuzzy queries, range queries, and any other types the syntax supports.  
It would not be Lucene's fault if a system with millions of documents is 
exposed through QueryParser and fuzzy queries take a bit longer or 
thrown a TooManyClauses exception.
I am clearly outvoted.  I still disagree, but will not veto this.
My last words on the topic (I promise!): In designing Lucene I tried 
hard to only add features that were scalable.  For example, one could 
easily implement a RegexQuery that scans text of stored fields, 
returning those which match a regex.  This would provide grep-like 
functionality, which some folks might find useful.  But it would not be 
scalable.  If someone contributed such a thing I would lobby against 
permitting its use from QueryParser in the default configuration.  The 
query parser already requires an initial character before a wildcard, in 
order to make this operator more scalable.  I don't see why fuzzy 
queries should be treated differently, why we permit such a huge 
scalability hole in the default configuration.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Normalized Scoring -- was RE: idf and explain(), was Re: Search and Scoring

2004-10-21 Thread Doug Cutting
Chuck Williams wrote:
However, I'm not sure this analysis is completely correct due to MultiSearcher.docFreq() which appears to be trying to redefine the tf's to be the global value across all indices.  It wasn't clear to me how this code is ever reached, e.g. from TermQuery -- SegmentTermDocs.  If the tf's and idf's are in fact computed globally, then the interleaving should work as it is, thus I'm guessing they are not.
Idf's are already computed globally across all indexes.  Tf's are local 
to the document.  In short, scores from a MultiSearcher are the same as 
when searching an IndexReader with the same documents.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Retrieving Document Boosts

2004-10-20 Thread Doug Cutting
Dan Climan wrote:
TermEnum terms = ir.terms();
int numTerms = 0;
while (terms.next())
{
Term t = terms.term();

if (t.field().equals(FullText))
numTerms++;
}
double lengthNorm = 1.0 / Math.sqrt(numTerms); //since
lengthNorm was defined as 1/sqrt(numTerms) by default
The numTerms is not the number of unique words in the collection, but 
rather the number of tokens in the document in question.  So, if you 
want to re-create this externally you could re-tokenize the text for the 
field and count the tokens.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: lucene and large (2GB+) indexes using RAMDirectory

2004-10-18 Thread Doug Cutting
Jonathan Hager wrote:
Nate Denning encountered the following error when trying to load a
large (greater than 2147483647 bytes) index into a RAMDirectory.  The
server has 12GB of memory, so loading it into memory should not be a
problem.
Have you instead tried copying the index to a ramfs ('mount -t ramfs'), 
then opening it with a normal FSDirectory?  This forces the entire index 
into RAM without forcing it into Java's heap.  In my experience, huge 
Java heaps are problematic.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: API cleanup for Field and future cleanup for IndexReader

2004-10-18 Thread Doug Cutting
Bernhard Messer wrote:
Christoph Goller wrote:
Bernhard Messer wrote:
Currently there are 3 different methods available to get the field 
names from an index.

a) getFieldNames();
b) getFieldNames(boolean indexed);
c) getIndexedFieldNames(boolean storedTermVector);
my proposal is to deprecate a), b) and c) and add one new method 
which can handle all the possible options.

+1
+1
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: idf and explain(), was Re: Search and Scoring

2004-10-18 Thread Doug Cutting
Chuck Williams wrote:
That's a good point on how the standard vector space inner product
similarity measure does imply that the idf is squared relative to the
document tf.  Even having been aware of this formula for a long time,
this particular implication never occurred to me.  Do you know if
anybody has done precision/recall or other relevancy empirical
measurements comparing this vs. a model that does not square idf?
No, not that I know of.
Regarding normalization, the normalization in Hits does not have very
nice properties.  Due to the  1.0 threshold check, it loses
information, and it arbitrarily defines the highest scoring result in
any list that generates scores above 1.0 as a perfect match.  It would
be nice if score values were meaningful independent of searches, e.g.,
if 0.6 meant the same quality of retrieval independent of what search
was done.  This would allow, for example, sites to use a a simple
quality threshold to only show results that were good enough.  At my
last company (I was President and head of engineering for InQuira), we
found this to be important to many customers.
If this is a big issue for you, as it seems it is, please submit a patch 
to optionally disable score normalization in Hits.java.

The standard vector space similarity measure includes normalization by
the product of the norms of the vectors, i.e.:
  score(d,q) =  sum over t of ( weight(t,q) * weight(t,d) ) /
sqrt [ (sum(t) weight(t,q)^2) * (sum(t) weight(t,d)^2) ]
This makes the score a cosine, which since the values are all positive,
forces it to be in [0, 1].  The sumOfSquares() normalization in Lucene
does not fully implement this.  Is there a specific reason for that?
The quantity 'sum(t) weight(t,d)^2' must be recomputed for each document 
each time a document is added to the collection, since 'weight(t,d)' is 
dependent on global term statistics.  This is prohibitivly expensive. 
Research has also demonstrated that such cosine normalization gives 
somewhat inferior results (e.g., Singhal's pivoted length normalization).

Re. explain(), I don't see a downside to extending it show the final
normalization in Hits.  It could still show the raw score just prior to
that normalization.
In order to normalize scores to 1.0 one must know the maximum score. 
Explain only computes the score for a single document, and the maximum 
score is not known.

 Although I think it would be best to have a
 normalization that would render scores comparable across searches.
Then please submit a patch.  Lucene doesn't change on its own.
Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Propose Bernhard as committer

2004-10-18 Thread Doug Cutting
+1
Christoph Goller wrote:
I would like to propose Bernhard as Lucene committer.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: FuzzyQuery prefix length

2004-10-18 Thread Doug Cutting
Daniel Naber wrote:
On Tuesday 12 October 2004 17:22, Doug Cutting wrote:
Which is worse: a person who searches for Photokopie~ in a 1000 document
collection does not find documents containing Fotokopie; or a person who
searches for Photokopie~ in a 1M document collection doesn't find
anything because it takes too long.  I think some relevant results are
better than none.
I disagree, as the user who doesn't get the Fotokopie matches will not 
understand what's going on. He will assume that there are no such 
documents, which is wrong.
I disagree.   For someone to assume that they would need a detailed 
understanding of how ~ works.  Such a person would likely also know 
whether initial characters are considered in the operation of ~.  Most 
users who use ~ would probably use it when they're uncertain of 
spelling, without a detailed understanding of how it works, and, most of 
the time, it will help them.

If there's a timeout the user will at least 
notice something is wrong. Besides that, it's the developers 
responsibility to get things fast enough.
We're talking about the appropriate default.  Defaults are used by 
unsophisticated developers.  A system deployed by an unsophisticated 
developer should not suffer from erratic timeouts.  Users using the 
standard query syntax should enjoy a reasonable experience on 
multi-million document collections without having to tweak things.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: FuzzyQuery prefix length

2004-10-18 Thread Doug Cutting
Daniel Naber wrote:
Searching for Photokopie~ on a 230,000 document corpus takes 2.3 seconds here 
(AMD Athlon 2600+; other fuzzy terms get similar performance). As the number 
of terms doesn't increase so fast with more documents, it will not take 10 
seconds for 1 million documents. So fuzzy search isn't *that* slow.
How long do non-fuzzy queries take?  What is the ratio?  How about a 
query with multiple fuzzy terms?

If someone launches a service but fails to test it with fuzzy queries, 
will they be subject to inadvertant denial-of-service when a user starts 
using fuzzy queries?  Web-based search is particularly vulnerable.  If a 
query takes a few seconds and the user hits his browser's STOP and 
RELOAD buttons, the first query keeps running on the server.

This is not an imaginary problem.  I have worked with several clients 
who have run into this in deployed applications.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: What's the purpose of hashing docid in BooleanScorer

2004-10-18 Thread Doug Cutting
Christoph Goller wrote:
With the current scorer API one could get rid of buckettable and
advance all subscores only by one document each time. I am not sure
whether the bucketable implementation is really much more efficient.
I only see the advantage of inlining some of the scorer.next and
score.score code.
Indeed, sub-scorers could be, e.g., kept in a priority queue.  This is 
done in ConjunctionScorer, PhraseScorer, etc.  However this adds a 
priority queue update to the inner search loop.  With long queries and 
with common terms this overhead can be significant.  With short queries 
and/or with rare terms the current ScoreTable-based implementation may 
indeed be slower, but I believe with longer queries containing common 
terms it is substantially faster.

This algorithm is described in:
http://lucene.sourceforge.net/papers/riao97.ps
If we had a priority-queue-based implementation then we could benchmark 
these.  If we found that one were faster than the other for particular 
classes of queries then we could have a query optimizer which 
automatically selects the most efficient implementation...

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: What's the purpose of hashing docid in BooleanScorer; DisjunctionScorer

2004-10-18 Thread Doug Cutting
Paul Elschot wrote:
I have a DisjunctionScorer based on a PriorityQueue lying around,
but I can't benchmark it myself at the moment. In case there is
interest, I'll gladly adapt it to org.apache.lucene.search and 
add it in bugzilla.
This should look a lot like SpanOrQuery.getSpans().
On a related note, I implemented ConjunctionScorer using Java's 
collection classes rather than a Lucene priority queue, just to see if I 
could.  It turns out to have to allocate memory in sortScorers() which 
makes it slower than it could be, but I have not yet gotten around to 
fixing it.  I'd like to re-write this to look like PhraseScorer and 
NearSpans, which operate without allocation.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Contribution: better multi-field searching

2004-10-13 Thread Doug Cutting
Paul Elschot wrote:
Did you see my IDF question at the bottom of the original note?  I'm
really curious why the square of IDF is used for Term and Phrase
queries, rather than just IDF.  It seems like it might be a bug?
I missed that.
It has been discussed recently, but I don't remember the outcome,
perhaps some else?
This has indeed been discussed before.
Lucene computes a dot-product of a query vector and each document 
vector.  Weights in both vectors are normalized tf*idf, i.e., 
(tf*idf)/length.  The dot product of vectors d and q is:

  score(d,q) =  sum over t of ( weight(t,q) * weight(t,d) )
Given this formulation, and the use of tf*idf weights, each component of 
the sum has an idf^2 factor.  That's just the way it works with dot 
products of tf*idf/length vectors.  It's not a bug.  If folks don't like 
it they can simply override Similarity.idf() to return sqrt(super()).

If someone can demonstrate that an alternate formulation produces 
superior results for most applications, then we should of course change 
the default implementation.  But just noting that there's a factor which 
is equal to idf^2 in each element of the sum does not do this.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Search and Scoring

2004-10-13 Thread Doug Cutting
Chuck Williams wrote:
I think there are at least two bugs here:
  1.  idf should not be squared.
I discussed this in a separate message.  It's not a bug.
  2.  explain() should explain the actual reported score().
This is mostly a documentation bug in Hits.  The normalization of scores 
to 1.0 is performed only by Hits.  Hits is a high-level wrapper on the 
lower-level HitCollector-based search implementations, which do not 
perform this normalization.  We should probably document that Hits 
scores are so normalized.  Also, we could add a method to disable this 
normalization in Hits.  The normalization was added long ago because 
many folks found it disconcerting when scores were greater than 1.0.

We should not attempt to normalize scores reported by explain().  The 
intended use of explain() is to compare its output against other calls 
to explain(), in order to understand how one document scores higher than 
another.  Scores don't make much sense in isolation, and neither do 
explanations.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Contribution: better multi-field searching

2004-10-13 Thread Doug Cutting
Chuck Williams wrote:
The issue is this.  Imagine you have two fields, title and document,
both of which you want to search with simple queries like:  albino
elephant.  There are two general approaches, either a) create a combined
field that concatenates the two individual fields, or b) expand the
simple query into a BooleanQuery that searches for each term in both
fields.
With approach a), you lose the flexibility to set separate boost factors
on the individual fields.  I wanted title to be much more important than
description for ranking results, and wanted to control this explicitly,
as length norm was not always doing the right thing; e.g., descriptions
are not always long.
With approach b) you run into another problem.  Suppose the example
query is expanded into (title:albino description:albino title:elephant
description:elephant).  Then, assuming tf/idf doesn't affect ranking, a
document with albino in both title and description will score the same
as a document with albino in title and elephant in description.  The
latter document for most applications is much better since it matches
both query terms.  If albino is the more important term according to
idf, then the less desirable documents (albino in both fields) will rank
consistently ahead of the albino elephants (which is what was happening
to me, yielding horrible results).
Another way to handle this would be to generate a query like:
  title:(albino elephant) description(albino elephant)
In this case the coord factor would boost titles and descriptions which 
contained both terms.  You may or may not want to disable the coord 
factor for the outer query, which can be done with:

BooleanQuery title = new BooleanQuery();
title.add(new TermQuery(new Term(title, albino)), false, false);
title.add(new TermQuery(new Term(title, elephant)), false, false);
BooleanQuery desc = new BooleanQuery();
desc.add(new TermQuery(new Term(desc, albino)), false, false);
desc.add(new TermQuery(new Term(desc, elephant)), false, false);
BooleanQuery outer = new BooleanQuery() {
  public getSimilarity() {
new DefaultSimilarity() {
  public coord(int overlap, int length) { return 1.0f; }
}
  }
};
outer.add(title, false, false);
outer.add(desc, false, false);
In general, doesn't coord() handle this situation?
Also, you can separately boost title and desc here, if you like:
  title:(albino elephant)^4.0 description(albino elephant)
or
title.boost(4.0f);

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: IndexInput GCJ

2004-10-13 Thread Doug Cutting
Andi Vajda wrote:
This code is generated by JavaCC.  I think the best way to fix this 
would be to fixup the code automatically whenever it is regenerated. 
So, instead of patching QueryParser.java, patch build.xml.  In the 
javacc-QueryParser task, add a replace task which replaces 
'jj_la1_0()' with 'jj_la1_0_method()'.
That is a brittle kludge as the code that needs to be changed may vary
everytime the parser is re-generated. There used to be two such methods 
in 1.4.1 for instance. The proper way to workaround this issue is to fix 
javacc to not generate such java code in the first place.
It is indeed a brittle kludge.
Are you willing to submit a bug report to JavaCC?
https://javacc.dev.java.net/servlets/ProjectIssues
Or even a patch?
https://javacc.dev.java.net/source/browse/javacc/
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Contribution: better multi-field searching

2004-10-13 Thread Doug Cutting
Chuck Williams wrote:
That approach does not work.  I could not find an approach that would
work with the built-in classes, although of course there might be one.
The problem has two components:  coord and the fact that BooleanQuery's
sum their clause scores to compute the final score.  The latter is not
easily overridden.  Specifically,
  title:(albino elephant)^4 description:(albino elephant)
still has the problem that a result with albino in the title and albino
in the description gets the same score as a result with albino in the
title and elephant in the description 
Perhaps I misunderstood what you desire.  You want a reward for albino 
and elephant both occurring in the document, regardless of field, if so, 
then what you'd want is:

(title:albino description:albino) (title:elephant description:elephant)
with coord disabled on the *inner* queries, no?  This way coord would 
explicitly boost documents which matched on both terms.

FYI, MaxDisjunctionQuery has made an enormous improvement in the quality
of my query results, and I have strong reason to believe the same would
be true in most other domains (more on that coming in the idf^2
discussion).  In terms of the albino elephant example, the query above
was putting all the albino animals except elephants above the albino
elephants, while the query with an outer BooleanQuery and inner
MaxDisjunctionQuery's
( (title:albino^4 | description:albino)~0.1
  (title:elephant^4 | description:elephant)~0.1
)
properly puts the albino elephants on top.
If albino is outscoring elephant then you could either reduce the 
impact of idf or increase the impact of coordination.  Did you try, 
e.g., defining coord as (overlap/max)^2 or somesuch?

Or, perhaps take proximity into account, with albino elephant~10?  Or 
simply using AND instead of OR?  These days most web search engines use 
AND as the default operator and reward for proximity.  Is that wrong for 
your application?  AND is effectively a coord of (overlap/max)^infinity.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: documentation in fileformats.html

2004-10-13 Thread Doug Cutting
Daniel Naber wrote:
The web page is updated now, could you please re-check if it's correct? I 
added that information so that the Lucene = 1.4 format is still there.
We should note that when compression is enabled, gzip is used.
Also, byte[] is not a type defined in the file.  In the formalism used 
in fileformats.html, this should be:

Value - String | BinaryValue (depending on Bits)
BinaryValue - ValueSize, Byte^ValueSize
ValueSize - VInt
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: FuzzyQuery prefix length

2004-10-12 Thread Doug Cutting
Daniel Naber wrote:
-It is the only change so far that we cannot express in the API, i.e. we 
cannot just deprecate a method to make Lucene's users aware of this. So we 
can only list it in CHANGES.txt, where some people will surely miss it.
We could define a new query parser class with the new behaviour and 
deprecate the old query parser.  I am not advocating this, merely noting 
that it is possible to make this change back-compatibly.

If we agree that this change does make Lucene better (and I'm not sure 
we do) then we should make the change, no?  Back-compatibility is a good 
thing, but, with a major release, should quality suffer becaue of 
back-compatibility issues?  I hope not.  Rather we should take the 
opportunity of a major release to make Lucene as good as we can.

-There are words in German like Photokopie/Fotokopie which have the same 
meaning and a very similar spelling, so people will expect a FuzzyQuery to 
match such words. But as the difference is in the first two characters it 
won't be found with the default.

-People whose index is just 1000 documents large will probably not notice a 
difference in speed, but they might see a difference in quality (see 
above). Why should these people change the default instead of those with a 
10 mio document index?
Which is worse: a person who searches for Photokopie~ in a 1000 document 
collection does not find documents containing Fotokopie; or a person who 
searches for Photokopie~ in a 1M document collection doesn't find 
anything because it takes too long.  I think some relevant results are 
better than none.  Classes of queries which take orders of magnitude 
longer than others are a problem.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: QueryParser and backwards-compatibility

2004-10-11 Thread Doug Cutting
Christoph Goller wrote:
Since 1.4.2 is already out, we would have to make a version 1.4.3.
OK, one more vote needed :-)
I'm okay with a 1.4.3 release for this.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: PhrasePrefixQuery - MultiPhraseQuery

2004-10-11 Thread Doug Cutting
Daniel Naber wrote:
I copied PhrasePrefixQuery to MultiPhraseQuery, decprecating 
PhrasePrefixQuery. The wiki also suggests to make MultipleTermPositions a 
private nested class. However, it is public currently so I wonder whether 
we can just remove/deprecate it without offering an alternative. Any 
opinions?
I don't feel too strongly about this.
I doubt anyone uses this class directly, so we could just remove it and 
wait until we make a 1.9 RC, and see if anyone complains then.

Or we could just leave it public and improve it's javadoc.  It is 
well-named and well-implemented, and may be useful for other things, 
although I can't think what they are...

What do others think?
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: FuzzyQuery prefix length

2004-10-11 Thread Doug Cutting
Daniel Naber wrote:
I agree that the default should stay 0, even for Lucene 2.0.
It should certainly stay zero for 1.4.x releases.
However 2.0 is our opportunity to make incompatible changes.  What is 
the best default for this, that will work well for the most applications?

Does anyone have fuzzy-query benchmarks for, e.g., ~1M document indexes, 
where each document contains a few k of text?  Ideally with such 
indexes, even complex queries should take less than a second, no?  How 
long does a fuzzy query take?  And how much does a prefix of zero, one, 
or two change that?  Queries that take much longer than a second are 
considerably less usable.  I think the the default should provide good 
usability for indexes of at least 1M documents.

Another thing to examine is how different the generated terms are with 
different prefixes.  One could randomly select some words from an index 
and compute the average amount that a prefix of one and two changes the 
end results.  My guess is that the changes are small.  Since fuzzy 
search is a heuristic, not an exact computation, good approximations are 
fair play.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/queryParser QueryParser.java QueryParser.jj

2004-10-11 Thread Doug Cutting
[EMAIL PROTECTED] wrote:
goller  2004/10/11 06:36:14
  Modified:src/java/org/apache/lucene/queryParser Tag: lucene_1_4_2_dev
QueryParser.java QueryParser.jj
[ ... ]
  +   * @deprecated use [EMAIL PROTECTED] #getFieldQuery(String, String)}
Should these be deprecated in 1.4.3?  I don't think so.  They should be 
deprecated in 1.9 and removed in 2.0, but 1.4.3 should not require 
application changes if possible when upgrading from earlier 1.x releases.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sandbox - core ?

2004-10-08 Thread Doug Cutting
Erik Hatcher wrote:
It would be nice if the Sandbox components were versioned and released  
along with the core - perhaps this would be a sufficient enough  
solution?  But, alas, I have no free time currently to devote to this  
effort.
That's precisely the reason to add these to the main CVS tree: if 
they're somewhere else then they simply won't get versioned and released 
in parallel with the core, while if they're in the main CVS tree this 
will happen with no extra effort.

In general, I'm a proponent of bundling as much as possible into a 
single CVS tree and build procedure, since it makes it much easier to 
keep things synchronized.  If folks feel the jar is too big, then we can 
always build these into a separate jar.  I'd also vote to put analyzers 
in the same CVS tree and under the top-level build.xml, for the same 
reason.  If we like, we could put them each in subdirectories of 
src/analyzers, and have each built as a separate jar.  Thoughts?

The sandbox should be for experimental stuff.  Stuff that's proven 
widely useful should go into the main tree and get released along with 
every Lucene release.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sandbox - core ?

2004-10-08 Thread Doug Cutting
Otis Gospodnetic wrote:
I like this idea.  I don't care so much about 1 or more CVS
repositories, as much as separate Jars, so if we can make
analyzers-1.4.2.jar and highlighter-1.4.2.jar along lucene-1.4.2.jar,
that would be ideal, in my opinion.
A minor point: we should prefix all the jar file names with 'lucene-'.
Also, I think the javadoc should include everything, not just the core. 
 That way folks can easily see what's available.  We could group things 
to make it clear what's core and what's in auxiliary jars:

http://java.sun.com/j2se/1.4.2/docs/tooldocs/solaris/javadoc.html#group
So we might have groups for Core, Analyzers, etc.
However, I still think a separate jar for the highlighter is overkill.
Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: IndexInput GCJ

2004-10-07 Thread Doug Cutting
Andi Vajda wrote:
Do you intend to ultimately support Java Lucene with GCJ ?
As far as possible...
I'm down to 3 patches:
Can you please file a Lucene bug report and attach these patches?  I'm 
not guaranteeing that they'll all be committed right away, but rather 
that that's a better place to keep track of them.  And they may get 
comitted!  Thanks.

I've also suggested a few changes to your patches below.
  - GCJH cannot generate a header file from QueryParser.class because there
are one static field and one static method which have the same name 
(down
from two in Lucene 1.4.1)
This code is generated by JavaCC.  I think the best way to fix this 
would be to fixup the code automatically whenever it is regenerated. 
So, instead of patching QueryParser.java, patch build.xml.  In the 
javacc-QueryParser task, add a replace task which replaces 
'jj_la1_0()' with 'jj_la1_0_method()'.

Is there a GCJ bug number assigned to this issue?  If not, could you 
please file one and note the bug number in a comment?  That way, if/when 
GCJ more elegantly resolves this we can remove the hack.

  - The delete(int) and delete(Term) methods on IndexReader clash with the
'delete' c++ keyword. GCJ will generate them as 'delete$' which is a 
neat
workaround; the problem, however, is that the dynamic linker, at 
least on
Mac OS X, doesn't then properly link to these symbols and fails to load
the resulting shared library.
So I defined two synonym methods deleteDocument(int) and
deleteDocuments(Term) in a patch to IndexReader.
In your patch, please add javadocs and deprecate the old delete() 
methods.  Again, GCJ and/or OS X bug numbers in a comment would be good 
to have.

  - Because of GCJ bug 15411,
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=15411,
Searcher.java needs to be patched to define the missing method
definitions
Please add this bug reference in a comment to the patched code.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene JAR for Maven Repo

2004-10-07 Thread Doug Cutting
I just copied the 1.4.2 jar there.
Doug
Otis Gospodnetic wrote:
Here is the email I mentioned earlier on lucene-dev.
--- Brian McCallister [EMAIL PROTECTED] wrote:

To: [EMAIL PROTECTED]
From: Brian McCallister [EMAIL PROTECTED]
Subject: Maven Repo
Date: Thu, 26 Aug 2004 19:59:50 -0400
Hi all,
Thank you for the amazing work on lucene. That said, any chance you 
could push lucene-1.4.1.jar onto the ibiblio maven repository? I'm 
happy to do so myself if you prefer (is just copying it to 
/www/www.apache.org/dist/java-repository/lucene/jars/ ) but figured
I'd 
ask before just copying the jar over =)

Thank you again!
-Brian


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene 1.4.2?

2004-10-02 Thread Doug Cutting
Daniel Naber wrote:
On Friday 01 October 2004 23:57, Doug Cutting wrote:
It is not mirrored yet.  Erik's the only one who has ever done that.
Erik, do you have time to mirror 1.4.2?  Thanks.
BTW, the release on the official download pages is still 1.4-final:
http://jakarta.apache.org/site/sourceindex.cgi
http://jakarta.apache.org/site/binindex.cgi
Right.  The official site is the mirrored site.  The procedure for 
releasing to the mirror is documented at:

http://jakarta.apache.org/site/convert-to-mirror.html
Would someone else like to do this?  Erik's been rather busy.  If 
another comitter has the time, it would be great to get this done ASAP.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene 1.4.2?

2004-10-01 Thread Doug Cutting
Christoph Goller wrote:
I would never
have guessed that calling the constructor there could make such a 
difference.
The improvement is greatest for OR queries that contain a common term, 
i.e., which match a large portion of the collection.  However for, e.g., 
most phrase searches and AND searches the improvement is probably not so 
pronounced.  When folks use Lucene as a vector-space search engine, 
constructing queries that represent large weighted vectors, the 
improvement is substantial.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene 1.4.2?

2004-10-01 Thread Doug Cutting
Christoph Goller wrote:
Items 4 and 5 don't seem that important to me. As far as I am
concerned we can leave them out.
When did 4 happen?  Was it a rare or common problem?
I agree that we don't need to put 5 in 1.4.2.
So the only thing missing is your
optimization. Then 1.4.2 should be ready.
I just committed this.  I can make a 1.4.2 release later today or Monday.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Using MMapDirectory fails TestCompoundFile; MMapDirectory for huge indexes

2004-10-01 Thread Doug Cutting
Paul Elschot wrote:
I'm working on a memory mapped directory that uses multiple buffers
for large files.
Great!
There will be a small performance hit, as each call to readByte() will 
need to first check whether it's overflowed the current buffer, right?

While trying some test runs I found that the current version fails a test:
[junit] Testsuite: org.apache.lucene.index.TestCompoundFile
Thanks for testing this!
I'm testing the version with multiple buffers using a smaller maximum
buffer size (1024 * 128), and it does this test in the same way.
You mean it fails too?
I have not yet looked into TestCompoundFile. When it is a good test
case for this, I'll submit the multibuffer version as an enhancement.
Thanks, that would be great.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene 1.4.2?

2004-10-01 Thread Doug Cutting
The new release is up at http://jakarta.apache.org/lucene/.
It is not mirrored yet.  Erik's the only one who has ever done that. 
Erik, do you have time to mirror 1.4.2?  Thanks.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: DbDirectory and compound files

2004-09-30 Thread Doug Cutting
Andi Vajda wrote:
You ask if this makes sense. No, not really. I don't know the details of 
the purpose of the compound file implementation so this may be my problem.
The purpose of the compound file implementation is to minimize the 
number of open files that an IndexReader must keep open.  Instead of 7 + 
the number of indexed fields files per segement, only a single file must 
be kept open per segement.  This helps applications which keep lots of 
unoptimized indexes open.  (It also, and this is more common, helps 
folks who open a new IndexReader for each query and don't close it.  In 
this case, opening fewer files gives the garbage collector time to close 
files before the process runs into its file descriptor limit, inducing a 
flurry of but reports about too many open files.)

Does that make any more sense?
However, from earlier posts of yours, it seems that the Directory 
implementation classes such as OutputStream et al are being deprecated 
and replaced by others, so it may very well be that DbDirectory needs to 
be rewritten when these changes are finalized.
These changes are back-compatible: the old classes and methods are still 
there and interoperate with the new but are deprecated.  You might wait 
until there is a Lucene release with the new API in it before you update 
DbDirectory.  To move to the new API, all that should be required is 
changing your subclass of InputStream to instead subclass 
BufferedIndexInput, and also change your subclass of IndexOutput to 
instead subclass BufferedIndexOutput.  You'll also need to add a 
length() method to your BufferedIndexInput subclass, instead of setting 
a protected length field in the constructor.  That's it.

The revision of the API was primarily to make buffering optional.  We 
could have left the buffered implementation names the same, but then the 
classes would be named poorly and it also seemed like an opportunity to 
remove the name clash with java.io.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene 1.4.2?

2004-09-30 Thread Doug Cutting
Christoph Goller wrote:
I'd like the changes on FuzzyQuery, PhraseQuery, and PhrasePrefixQuery
included in the branch. Any objections?
I'm okay with these, but the primary purpose of 1.4.2 should be to 
stabilize things, not to add new features.  So let's be very selective 
about what we add, and scrutinize changes carefully so we don't 
introduce new bugs.  Are you confident that these are safe changes?

If we agree to let a *few* features in, then I vote for my optimization 
to IndexSearcher.  Of all the optimizations I made recently, the single 
biggest performance improvement was to avoid allocating a new ScoreDoc 
for every non-zero score in IndexSearcher.search(Query,Filter,int).  I 
think this is safe.  Are there any concerns about putting this 
optimization into 1.4.2?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: DbDirectory and compound files

2004-09-29 Thread Doug Cutting
Andi Vajda wrote:
So, my question: why is the compound file storage implemented in such an 
orthogonal to Directory way instead of just being another Directory 
implementation called FSCompoundFileDirectory ?
To combine the files of a segment we need to know when the segment was 
complete.  So a method would need to be added to Directory to instruct 
it when to combine files.  And then the Directory would need to be able 
to locate files within the combined file in order to open them.

It would be a shame to re-invent this logic for each Directory 
implementation, so the indexing code has a generic implementation 
layered on top of Directory.  Does that make sense?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene 1.4.2?

2004-09-29 Thread Doug Cutting
Daniel Naber wrote:
On Monday 20 September 2004 18:49, Doug Cutting wrote:
To be clear, you are proposing that we branch from the 1.4.1 tag in CVS
and re-apply the patches below?
Yes, exactly.
Now that we have a patch for the memory leak problem, should we start a 
1.4.2 branch?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene 1.4.2?

2004-09-29 Thread Doug Cutting
Daniel Naber wrote:
I can try to do some of the work, but I'd need detailed instructions for 
branching and tagging. It's probably easier/better if you do those parts.
I've never branched with CVS before either... so here goes!
I've added a branch called lucene_1_4_2_dev.  To get a copy, use:
cvs -d :ext:[EMAIL PROTECTED]:/home/cvs co -r lucene_1_4_2_dev -d 
lucene_1_4_2_dev jakarta-lucene

Where XXX is your username at Apache.  Then you can make changes and 
commit them from this directory.

I just made the memory leak patch in this branch, but I've not yet 
updated CHANGES.txt.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: IndexInput GCJ

2004-09-28 Thread Doug Cutting
Doug Cutting wrote:
Still to do:
  1. Replace OutputStream with IndexOutput and BufferedIndexOutput. This 
is not critical and mostly for consistency, as mmap makes more sense for 
read-only data.
  2. Update RAMDirectory and FSDirectory to no longer use deprecated 
classes.  This is done last, to make sure that the earlier steps to not 
break back-compatibility for existing Directory implementations.
These changes are now complete and in CVS.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/store MMapDirectory.java

2004-09-28 Thread Doug Cutting
[EMAIL PROTECTED] wrote:
  Added:   src/java/org/apache/lucene/store MMapDirectory.java
  Log:
  Add an nio mmap based Directory implementation.
For my simple benchmarks this is somewhat slower than the classic 
FSDirectory, but I thought it was still worth having.  It should use 
less memory when there are lots of query terms, since it does not need 
to allocate a new buffer per term and the mmapped data can be shared. 
This may be good for folks who, e.g., use lots of wildcards.  It also 
should, in theory, someday be faster.  One downside is that it cannot 
handle indexes with files larger than 2^31 bytes.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/store MMapDirectory.java

2004-09-28 Thread Doug Cutting
Bruce Ritchie wrote:
[EMAIL PROTECTED] wrote:
One downside 
is that it cannot handle indexes with files larger than 2^31 bytes.
Can you expand slightly on what causes this limitation and whether it still exists on 64 bit hardware?
This is a limit of the nio ByteBuffer API, which uses int instead of 
long to address data.  Java defines int as a singed 32 bit quantity. 
The size of a ByteBuffer is also an int.

http://java.sun.com/j2se/1.4.2/docs/api/java/nio/channels/FileChannel.html#map(java.nio.channels.FileChannel.MapMode,%20long,%20long)
http://java.sun.com/j2se/1.4.2/docs/api/java/nio/ByteBuffer.html
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: cvs commit: jakarta-lucene build.xml

2004-09-21 Thread Doug Cutting
Daniel Naber wrote:
I'm using gcc/gcj 3.3.3, do I maybe need a more recent version?
I'm currently using 3.4.1, but I think 3.4.0 will work as well.  I had 
troubles with 3.3.

I've worked more on this, and now have a version (not yet committed) 
which appears a bit faster than a JVM.  More soon.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



  1   2   3   4   5   >