Lucene Book

2004-09-07 Thread ebrahim . faisal
Hi

I am new to Lucene. Can anyone guide me from where i can download free
Lucene book.

Thanx  Regards
E.Faisal
Important Email Information :- The  information  in  this  email is
confidential and may  be  legally  privileged. It  is  intended  solely for
the addressee. Access to  this email  by anyone  else  is  unauthorized.
If you are not the intended recipient, any disclosure, copying,
distribution or any action taken or omitted to be taken in reliance on it,
is prohibited  and may be unlawful. If you are not the intended addressee
please contact the sender and dispose of this e-mail immediately.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Book

2004-09-07 Thread Erik Hatcher
On Sep 7, 2004, at 3:00 AM, [EMAIL PROTECTED] wrote:
I am new to Lucene. Can anyone guide me from where i can download free
Lucene book.
Free?!
http://www.manning.com/hatcher2 is the book Otis and I have spent the 
last year laboring on.  It has been a long hard effort that is about to 
come to fruition.  Lucene in Action is in copy/tech editing right now 
and will be pushed into production very shortly.  There will be some 
chapters, as always with Manning, available for free download once the 
book has been typeset (probably even before physical copies are 
available).  We have not decided which chapters we'll make available 
for free yet.

I hope a few folks buy it - it would be a shame for my kids to go 
without food ;)

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene Book

2004-09-07 Thread Terry Steichen
Jeez, Erik!  Where's your sense of public spirit ;-)

Terry

PS: Glad to hear you're (finally!) nearing publication.  

  - Original Message - 
  From: Erik Hatcher 
  To: Lucene Users List 
  Sent: Tuesday, September 07, 2004 6:43 AM
  Subject: Re: Lucene Book


  On Sep 7, 2004, at 3:00 AM, [EMAIL PROTECTED] wrote:
   I am new to Lucene. Can anyone guide me from where i can download free
   Lucene book.

  Free?!

  http://www.manning.com/hatcher2 is the book Otis and I have spent the 
  last year laboring on.  It has been a long hard effort that is about to 
  come to fruition.  Lucene in Action is in copy/tech editing right now 
  and will be pushed into production very shortly.  There will be some 
  chapters, as always with Manning, available for free download once the 
  book has been typeset (probably even before physical copies are 
  available).  We have not decided which chapters we'll make available 
  for free yet.

  I hope a few folks buy it - it would be a shame for my kids to go 
  without food ;)

  Erik


  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Book

2004-09-07 Thread Otis Gospodnetic
Hello Ebrahim,

Like Erik said, the book about Lucene is coming soon.  Although it
won't be free, Erik, I, and a few other people already shared some of
our knowledge about Lucene in several articles about Lucene.  There is
a page on the Lucene Wiki that has links to all known Lucene articles. 
I suggest you take a look at those while we finish Lucene in Action.

Otis


--- [EMAIL PROTECTED] wrote:

 Hi
 
 I am new to Lucene. Can anyone guide me from where i can download
 free
 Lucene book.
 
 Thanx  Regards
 E.Faisal
 Important Email Information :- The  information  in  this  email is
 confidential and may  be  legally  privileged. It  is  intended 
 solely for
 the addressee. Access to  this email  by anyone  else  is 
 unauthorized.
 If you are not the intended recipient, any disclosure, copying,
 distribution or any action taken or omitted to be taken in reliance
 on it,
 is prohibited  and may be unlawful. If you are not the intended
 addressee
 please contact the sender and dispose of this e-mail immediately.
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Use of + and - in queries

2004-09-07 Thread Bill Tschumy
I don't understand the difference in using + and - in queries compared 
to using AND and NOT.  Even the Query Syntax document seems a bit 
confused.  In the section on the NOT operator it says:

 To search for documents that contain jakarta apache but not 
jakarta lucene use the query:

 
 jakarta apache NOT jakarta lucene

Then in the section on the - operator you read this:
 To search for documents that contain jakarta apache but not 
jakarta lucene use the query:

 
 jakarta apache -jakarta lucene

So what's the difference?
--
Bill Tschumy
Otherwise -- Austin, TX
http://www.otherwise.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Spam:too many open files

2004-09-07 Thread wallen
I sent out an email to this list a few weeks ago about how to fix a corrupt
index.  I basically edited the segments file with a hex editor removing the
entry for the missing file and decremented the total count of files from the
file count that is near the beginning of the segments file.

-Original Message-
From: Patrick Kates [mailto:[EMAIL PROTECTED]
Sent: Wednesday, September 01, 2004 1:30 PM
To: [EMAIL PROTECTED]
Subject: Spam:too many open files


I am having two problems with my client's lucene indexes.

One, we are getting a FileNotFound exception (too many open files).  THis
would seem to indicate that I need to increase the number of open files on
our Suse 9.0 Pro box.  I have our sys admin working on this problem for me.

Two, because of this error and subsequent restarting of the box, we seem to
have lost an index segment or two.  My client's tape backups do not contain
the segments we know about.

I am concerned about the missing index segments as they seem to be
preventing any further update of the index.  Does anyone have any
suggestions as to how to fix this besides a full re-index of the problem
indexes?

I was wondering if maybe a merge of the index might solve the problem?  I
could move our nightly merge of the index files to sooner, but I am afraid
that the merge might make matters worse?

Any ideas or helpful speculation would be greatly appreciated.

Patrick




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Spam:too many open files

2004-09-07 Thread wallen
A note to developers, the code checked into lucene CVS ~Aug 15th, post
1.4.1, was causing frequent index corruptions.  When I reverted back to
version 1.4 I no longer am getting the corruptions.

I was unable to trace the problem to anything specific, but was using the
newer code to take advantage of the sort fixes.

-Original Message-
From: Patrick Kates [mailto:[EMAIL PROTECTED]
Sent: Wednesday, September 01, 2004 1:30 PM
To: [EMAIL PROTECTED]
Subject: Spam:too many open files


I am having two problems with my client's lucene indexes.

One, we are getting a FileNotFound exception (too many open files).  THis
would seem to indicate that I need to increase the number of open files on
our Suse 9.0 Pro box.  I have our sys admin working on this problem for me.

Two, because of this error and subsequent restarting of the box, we seem to
have lost an index segment or two.  My client's tape backups do not contain
the segments we know about.

I am concerned about the missing index segments as they seem to be
preventing any further update of the index.  Does anyone have any
suggestions as to how to fix this besides a full re-index of the problem
indexes?

I was wondering if maybe a merge of the index might solve the problem?  I
could move our nightly merge of the index files to sooner, but I am afraid
that the merge might make matters worse?

Any ideas or helpful speculation would be greatly appreciated.

Patrick




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: telling one version of the index from another?

2004-09-07 Thread Doug Cutting
Bill Janssen wrote:
Hi.
Hey, Bill.  It's been a long time!
I've got a Lucene application that's been in use for about two years.
Some users are using Lucene 1.2, some 1.3, and some are moving to 1.4.
The indices seem to behave differently under each version.  I'd like
to add code to my application that checks the current user's index
version against the version of Lucene that they are using, and
automatically re-indexes their files if necessary.  However, I can't
figure out how to tell the version, from the index files.
Prior to 1.4, there were no format numbers in the index.  These are 
being added, file-by-file, as we change file formats.  As you've 
discovered, there is currently no public API to obtain the format number 
of an index.  Also, the formats of different files are revved at 
different times, so there may not be a single format number for the 
entire index.  (Perhaps we should remedy this, by, e.g., always revving 
the segments version whenever any file changes format.)

The documentation on the file formats, at
http://jakarta.apache.org/lucene/docs/fileformats.html, directs me to
the segments file.  However, when I look at a version 1.3 segments
file, it seems to bear little relationship to the format described in
fileformats.html. 
Have a look at the version of fileformats.html that shipped with 1.3. 
You can find this by browsing CVS, looking for the 1.3-final tag.  But 
let me do it for you:

http://cvs.apache.org/viewcvs.cgi/*checkout*/jakarta-lucene/docs/fileformats.html?rev=1.15
According to CVS tags, that describes both the 1.3 and 1.2 index file 
formats.

But the part of fileformats.html dealing with the
segments file contains no compatibility notes, so I assume it hasn't
changed since 1.3. 
I wrote the bit about compatibility notes when I first documented file 
formats, and then promptly forgot about it.  So, until someone 
contributes them, there are no compatibility notes.  Sorry.

Even if it had, what's the idea of using -1 as the
format number for 1.4?
The idea is to promptly break 1.3 and 1.2 code which tries to read the 
index.  Those versions of Lucene don't check format numbers (because 
there were none).  Positive values would give unpredictable errors.  A 
negative value causes an immediate failure.

So, anyone know a way to tell the difference between the various
versions of the index files?  Crufty hacks welcome :-).
The first four bytes of the segments file will mostly do the trick. 
If it is zero or positive, then the index is a 1.2 or 1.3 index.  If it 
is -2, then it's a 1.4-final or later index.

There was a change in formats between 1.2 and 1.3, with no format number 
change.  This was in 1.3 RC1 (note #12 in CHANGES.txt).  The semantics 
of each byte in norm files (.f[0-9]) changed.  In 1.3 each byte 
represented 0.0-255.0 on a linear scale.  In 1.3 and later they're 
eight-bit floats (three-bit mantissa, five-bit exponent, no sign bit). 
The net result is that if you use a 1.2 index with 1.3 or later then the 
correct documents will be returned, but scores and rankings will be wacky.

With the exception of this last bit, 1.4 should be able to correctly 
handle indexes from earlier releases.  Please report if this is not the 
case.

Cheers,
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Possible to remove duplicate documents in sort API?

2004-09-07 Thread Doug Cutting
Kevin A. Burton wrote:
My problem is that I have two machines... one for searching, one for 
indexing.

The searcher has an existing index.
The indexer found an UPDATED document and then adds it to a new index 
and pushes that new index over to the searcher.

The searcher then reloads and when someone performs a search BOTH 
documents could show up (including the stale document).

I can't do a delete() on the searcher because the indexer doesn't have 
the entire index as the searcher.
I can think of a couple ways to fix this.
If the indexer box kept copies of the indexes that it has already sent 
to the searcher, then it can mark updated documents as deleted in these 
old indexes.  Then you can, with the new index, also distribute new .del 
files for the old indexes.

Alternately, you could, on the searcher box, before you open the new 
index, open an IndexReader on all of the existing indexes and mark all 
new documents as deleted in the old indexes.  This shouldn't take more 
than a few seconds.

IndexReader.delete() just sets a bit in a bit vector that is written to 
file by IndexReader.close().  So it's quite fast.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Why doesn't Document use a HashSet instead of a LinkedList (DocumentFieldList)

2004-09-07 Thread Doug Cutting
Kevin A. Burton wrote:
It looks like Document.java uses its own implementation of a LinkedList..
Why not use a HashMap to enable O(1) lookup... right now field lookup is 
O(N) which is certainly no fun.

Was this benchmarked?  Perhaps theres the assumption that since 
documents often have few fields the object overhead and hashcode 
overhead would have been less this way.
I have never benchmarked this but would be surprised if it makes a 
measureable difference in any real application.  A linked list is used 
because it naturally supports multiple entries with the same key.  A 
home-grown linked list was used because, when Lucene was first written, 
java.util.LinkedList did not exist.

Please feel free to benchmark this against a HashMap of LinkedList of 
Field.  This would be slower to construct, which may offset any 
increased access speed.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Spam:too many open files

2004-09-07 Thread Daniel Naber
On Tuesday 07 September 2004 17:41, [EMAIL PROTECTED] wrote:

 A note to developers, the code checked into lucene CVS ~Aug 15th, post
 1.4.1, was causing frequent index corruptions. When I reverted back to
 version 1.4 I no longer am getting the corruptions.

Here are some changes from around that day:

http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/SegmentMerger.java
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/SegmentReader.java
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/IndexWriter.java

Could you check which of those might have caused the problem? I guess 
there's not much the developers can do without the problem being 
reproducible.

regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



getting most common terms for a smaller set of documents

2004-09-07 Thread wallen
Dear Lucene Users:

What is the best way to get the most common terms for a subset of the total
documents in your index?

I know how to get the most common terms for a field for the entire index,
but what is the most efficient way to do this for a subset of documents?

Here is the code I am using to get the top numberOfTerms common terms for
the field fieldName:

public TermInfo[] mostCommonTerms(String fieldName, int
numberOfTerms)
{
//make sure min will get a positive number
if (numberOfTerms  1)
{
numberOfTerms = Integer.MAX_VALUE;
}
numberOfTerms = Math.min(numberOfTerms, 50);
//String[] commonTerms = new String[numberOfTerms];
try
{
IndexReader reader = IndexReader.open(indexPath);
TermInfoQueue tiq = new
TermInfoQueue(numberOfTerms);
TermEnum terms = reader.terms();

int minFreq = 0;
while (terms.next())
{

if(fieldName.equalsIgnoreCase(terms.term().field()))
{
if (terms.docFreq()  minFreq)
{
tiq.put(new
TermInfo(terms.term(), terms.docFreq()));
if (tiq.size() =
numberOfTerms) // if tiq overfull
{
tiq.pop(); // remove
lowest in tiq
minFreq =
((TermInfo) tiq.top()).docFreq; // reset

// minFreq
}
}

}
}
TermInfo[] res = new TermInfo[tiq.size()];
for (int i = 0; i  res.length; i++)
{
res[res.length - i - 1] = (TermInfo)
tiq.pop();
}
reader.close();
return res;

}
catch (IOException ioe)
{
logger.error(IOException:  + ioe.getMessage());
}
return null;
}

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: telling one version of the index from another?

2004-09-07 Thread Bill Janssen
Thanks, Doug, much as I'd figured from looking at the code.

Here's a follow-up question:  Is there any programmatic way to tell
which version of the Lucene code a program is using?  A version number
or string would be great (perhaps an idea for the next release), but a
list of classes in one version but not in the previous one would do
for the moment.

 (Perhaps we should remedy this, by, e.g., always revving 
 the segments version whenever any file changes format.)

I think you mean the segments format, right?  And I highly recommend
doing so.

Bill

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



lucene locks index, tomcat has to stop and restart

2004-09-07 Thread hui liu
Hi all,

I met with such a problem with lucene demo:

Each time when I create lucene index, I have to first stop tomcat, and
restart tomcat after the index is created. The reason is: the index is
locked when using IndexReader.open(index) method in the jsp file.

So, I tried to modify the jsp codes by adding close(), but it shows
error which said close() is not a static method. I checked the
source codes of lucene IndexReader methods, and found that the close()
method is final not static. I tried to change it to static, but
resulted in many errors.

So, does anybody meet the similar problem as me? Do you have any solutions?

Thank you very very much.!!

Ivy.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Moving from a single server to a cluster

2004-09-07 Thread Ben Sinclair
My application currently uses Lucene with an index living on the
filesystem, and it works fine. I'm moving to a clustered environment
soon and need to figure out how to keep my indexes together. Since the
index is on the filesystem, each machine in the cluster will end up
with a different index.

I looked into JDBC Directory, but it's not tested under Oracle and
doesn't seem like a very mature project.

What are other people doing to solve this problem?

-- 
Ben Sinclair
[EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



lucene index parser problem

2004-09-07 Thread hui liu
Hi,

I have such a problem when creating lucene index for many html files:

It shows aborted, expectedtagnametagend for those html files
which contain java scripts. It seems it cannot parse the tags  \.
Does anyone has any solution?

Thank you very very much...!!!

Ivy.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



lucene locks index, tomcat has to stop and restart

2004-09-07 Thread hui liu
Hi,

I met with such a problem with lucene demo:

Each time when I create lucene index, I have to first stop tomcat, and
restart tomcat after the index is created. The reason is: the index is
locked when using IndexReader.open(index) method in the jsp file.

So, I tried to modify the jsp codes by adding close(), but it shows
error which said close() is not a static method. I checked the
source codes of lucene IndexReader methods, and found that the close()
method is final not static. I tried to change it to static, but
resulted in many errors.

So, does anybody meet the similar problem as me? Do you have any solutions?

Thank you very very much.!!

Ivy.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene index parser problem

2004-09-07 Thread Patrick Burleson
Why oh why did you send this to the tomcat lists?

Don't cross post! Especially when the question doesn't even apply to
one of the lists.

Patrick

On Tue, 7 Sep 2004 16:35:35 -0400, hui liu [EMAIL PROTECTED] wrote:
 Hi,
 
 I have such a problem when creating lucene index for many html files:
 
 It shows aborted, expectedtagnametagend for those html files
 which contain java scripts. It seems it cannot parse the tags  \.
 Does anyone has any solution?
 
 Thank you very very much...!!!
 
 Ivy.
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene locks index, tomcat has to stop and restart

2004-09-07 Thread Patrick Burleson
This isn't a Tomcat specific problem, but sounds like a problem with
how you the reader is being used.

Somewhere in the JSP a IndexReader variable was probably assigned to.
A line something like:

IndexReader ir = IndexReader.open(somepath);

To close the reader, and thus solve the problem, somewhere later, you need:

ir.close();

with the needed try/catch in place. 

Again, please refrain from cross-posting...just because it happened on
Tomcat doesn't make it a Tomcat problem. This is clearly a lucene
usage problem.

Patrick

On Tue, 7 Sep 2004 16:37:42 -0400, hui liu [EMAIL PROTECTED] wrote:
 Hi,
 
 I met with such a problem with lucene demo:
 
 Each time when I create lucene index, I have to first stop tomcat, and
 restart tomcat after the index is created. The reason is: the index is
 locked when using IndexReader.open(index) method in the jsp file.
 
 So, I tried to modify the jsp codes by adding close(), but it shows
 error which said close() is not a static method. I checked the
 source codes of lucene IndexReader methods, and found that the close()
 method is final not static. I tried to change it to static, but
 resulted in many errors.
 
 So, does anybody meet the similar problem as me? Do you have any solutions?
 
 Thank you very very much.!!
 
 Ivy.
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene locks index, tomcat has to stop and restart

2004-09-07 Thread Patrick Burleson
Ah, I see your problem. From the Lucene Javadocs on IndexSearcher.close():

Note that the underlying IndexReader is not closed, if IndexSearcher
was constructed with IndexSearcher(IndexReader r). If the IndexReader
was supplied implicitly by specifying a directory, then the
IndexReader gets closed.

Since you are explicitly passing in an IndexReader, the IndexSearcher
is not closing it. But since you created the IndexReader without
retaining a reference to it, you can not close it, thus you will
always have an open index.

You have a couple of options:

Either retain a reference to IndexReader by doing the following:

IndexReader ir = IndexReader.open(indexName);
IndexSearcher searcher = new IndexSearcher(ir);

or if indexName is just the path to the index, then just use:

IndexSearcher searcher = new IndexSearcher(indexName);

since that will manage the IndexReader for you.

Hope that helps.

Patrick

CCing list for archives.


On Tue, 7 Sep 2004 18:38:43 -0400, hui liu [EMAIL PROTECTED] wrote:
 First of all, thanks for your reply:-)
 
 But actually, I've already tried this and here is my code:
 
 searcher = new IndexSearcher(IndexReader.open(indexName));
 
 and at some later place I wrote:
 
 IndexReader.close();
 
 Both of them are within try and catch, and then I got such an error in
 IE by tomcat:
 
 non-static method close() cannot be referenced from a static context.
 
 I read the source code of IndexReader and found that the method
 close() is final not static. so I tried to change it to static, but
 got even more errors.
 
 I am wondering how do you use lucene? Has anyone met with the same thing?
 
 Thanks a lot.
 
 Ivy.
 
 
 
 
 On Tue, 7 Sep 2004 17:03:00 -0400, Patrick Burleson [EMAIL PROTECTED] wrote:
  This isn't a Tomcat specific problem, but sounds like a problem with
  how you the reader is being used.
 
  Somewhere in the JSP a IndexReader variable was probably assigned to.
  A line something like:
 
  IndexReader ir = IndexReader.open(somepath);
 
  To close the reader, and thus solve the problem, somewhere later, you need:
 
  ir.close();
 
  with the needed try/catch in place.
 
  Again, please refrain from cross-posting...just because it happened on
  Tomcat doesn't make it a Tomcat problem. This is clearly a lucene
  usage problem.
 
  Patrick
 
 
 
  On Tue, 7 Sep 2004 16:37:42 -0400, hui liu [EMAIL PROTECTED] wrote:
   Hi,
  
   I met with such a problem with lucene demo:
  
   Each time when I create lucene index, I have to first stop tomcat, and
   restart tomcat after the index is created. The reason is: the index is
   locked when using IndexReader.open(index) method in the jsp file.
  
   So, I tried to modify the jsp codes by adding close(), but it shows
   error which said close() is not a static method. I checked the
   source codes of lucene IndexReader methods, and found that the close()
   method is final not static. I tried to change it to static, but
   resulted in many errors.
  
   So, does anybody meet the similar problem as me? Do you have any solutions?
  
   Thank you very very much.!!
  
   Ivy.
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
  
  
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Use of + and - in queries

2004-09-07 Thread Otis Gospodnetic
Hi Bill,

No difference, it's just that Lucene's query syntax recognizes both
'NOT' and '-' and uses them the same way - to exclude certain documents
from sesrch results.

Otis

--- Bill Tschumy [EMAIL PROTECTED] wrote:

 I don't understand the difference in using + and - in queries
 compared 
 to using AND and NOT.  Even the Query Syntax document seems a bit 
 confused.  In the section on the NOT operator it says:
 
   To search for documents that contain jakarta apache but not 
 jakarta lucene use the query:
  
   
jakarta apache NOT jakarta lucene
  
 
 Then in the section on the - operator you read this:
 
   To search for documents that contain jakarta apache but not 
 jakarta lucene use the query:
  
   
jakarta apache -jakarta lucene
  
 
 So what's the difference?
 
 -- 
 Bill Tschumy
 Otherwise -- Austin, TX
 http://www.otherwise.com
 
 
-
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Moving from a single server to a cluster

2004-09-07 Thread Otis Gospodnetic
I've used scp and rsync successfully in the past.
Lucene now includes a remote searcher (RMI stuff), so you may want to
consider a single index, too.

Otis

--- Ben Sinclair [EMAIL PROTECTED] wrote:

 My application currently uses Lucene with an index living on the
 filesystem, and it works fine. I'm moving to a clustered environment
 soon and need to figure out how to keep my indexes together. Since
 the
 index is on the filesystem, each machine in the cluster will end up
 with a different index.
 
 I looked into JDBC Directory, but it's not tested under Oracle and
 doesn't seem like a very mature project.
 
 What are other people doing to solve this problem?
 
 -- 
 Ben Sinclair
 [EMAIL PROTECTED]
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: too many open files

2004-09-07 Thread Will Allen

I suspect it has to do with this change:

--- jakarta-lucene/src/java/org/apache/lucene/index/SegmentMerger.java  2004/08/08 
13:03:59 1.12
+++ jakarta-lucene/src/java/org/apache/lucene/index/SegmentMerger.java  2004/08/11 
17:37:52 1.13

I wouldn't know where to start to reproduce the problem as it was happening just once 
a day or so on an index that was being both queried and added to real time to the tune 
of 100,000 docs a day / 50 queries a day.

The corruption was always the same thing, the segments file listed an entry to a file 
that was not there.

-Will

-Original Message-
From: Daniel Naber [mailto:[EMAIL PROTECTED]
Sent: Tuesday, September 07, 2004 1:54 PM
To: Lucene Users List
Subject: Re: Spam:too many open files


On Tuesday 07 September 2004 17:41, [EMAIL PROTECTED] wrote:

 A note to developers, the code checked into lucene CVS ~Aug 15th, post
 1.4.1, was causing frequent index corruptions. When I reverted back to
 version 1.4 I no longer am getting the corruptions.

Here are some changes from around that day:

http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/SegmentMerger.java
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/SegmentReader.java
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/IndexWriter.java

Could you check which of those might have caused the problem? I guess 
there's not much the developers can do without the problem being 
reproducible.

regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Spam:too many open files

2004-09-07 Thread Dmitry Serebrennikov
Hi Wallen,
Actually, the files Daniel listed were modified on 8/11 and then again 
on 8/15. In the time between 8/11 to 8/15, I belive there could have 
been any number of problems, including corrupt indexes and poor 
multithreaded performance. However, I think after 8/15, the files should 
be in good working order. If you are not sure if you saw problems with 
pre-8/15 or post-8/15 version of the code, is it possible for you to try 
the latest CVS and see if the problem exists now? If it does, it will of 
course require urgent attention.

Thanks very much!
Dmitry.
Daniel Naber wrote:
On Tuesday 07 September 2004 17:41, [EMAIL PROTECTED] wrote:
 

A note to developers, the code checked into lucene CVS ~Aug 15th, post
1.4.1, was causing frequent index corruptions.  When I reverted back to
version 1.4 I no longer am getting the corruptions.
   

Here are some changes from around that day:
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/SegmentMerger.java
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/SegmentReader.java
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/IndexWriter.java
Could you check which of those might have caused the problem? I guess 
there's not much the developers can do without the problem being 
reproducible.

regards
Daniel
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


MultiFieldQueryParser seems broken... Fix attached.

2004-09-07 Thread Bill Janssen
Hi!

I'm using Lucene for an application which has lots of fields/document,
in which the users can specify in their config files what fields they
wish to be included by default in a search.  I'd been happily using
MultiFieldQueryParser to do the searches, but the darn users started
demanding more Google-like searches; that is, they want the search
terms to be implicitly AND-ed instead of implicitly OR-ed.  No
problem, thinks I, I'll just set the operator.

Only to find this has no effect on MultiFieldQueryParser.

Once I looked at the code, I find that MultiFieldQueryParser combines
the clauses at the wrong level -- it combines them at the outermost
level instead of the innermost level.  This means that if you have two
fields, author and title, and the search string cutting lucene,
you'll get the final query

   (title:cutting title:lucene) (author:cutting author:lucene)

If the search operator is OR, this isn't a problem.  But if it is,
you have two problems.  The first is that MultiFieldQueryParser seems
to ignore the operator entirely.  But even if it didn't, the second
problem is that the query formed would be

   +(title:cutting title:lucene) +(author:cutting author:lucene)

That is, if the word Lucene was in both the author field and the
title field, the match would fit.  This clearly isn't what the
searcher intended.

You can re-write MultiFieldQueryParser, as I've done in the example
code which I append here.  This little program allows you to run
either my parser (-DSearchTest.QueryParser=new) or the old parser
(-DSearchTest.QueryParser=old).  It allows you to use either OR
(-DSearchTest.QueryDefaultOperator=or) or AND
(-DSearchTest.QueryDefaultOperator=and) as the operator.  And it
allows you to pick your favorite set of default search terms
(-DSearchTest.QueryDefaultFields=author:title:body, for example).  It
takes one argument, a query string, and outputs the re-written query
after running it through the query parser.  So to evaluate the above
query:

% java -classpath /import/lucene/lucene-1.4.1.jar:. \
   -DSearchTest.QueryDefaultFields=title:author \
   -DSearchTest.QueryDefaultOperator=AND \
   -DSearchTest.QueryParser=old \
   SearchTest cutting lucene
query is (title:cutting title:lucene) (author:cutting author:lucene)
%

The class NewMultiFieldQueryParser does the combination at the inner
level, using an override of addClause, instead of the outer level.
Note that it can't cover all cases (notably PhrasePrefixQuery, because
that class has no access methods which allow one to introspect over
it, and SpanQueries, because I don't understand them well enough :-).
I post it here in advance of filing a formal bug report for early
feedback.  But it will show up in a bug report in the near future.

Running the above query with the new parser gives:

% java -classpath /import/lucene/lucene-1.4.1.jar:. \
   -DSearchTest.QueryDefaultFields=title:author \
   -DSearchTest.QueryDefaultOperator=AND \
   -DSearchTest.QueryParser=new \
   SearchTest cutting lucene
query is +(title:cutting author:cutting) +(title:lucene author:lucene)
%

which I claim is what the user is expecting.

In addition, the new class uses an API more similar to QueryParser, so
that the user has less to learn when using it.  The code in it could
probably just be folded into QueryParser, in fact.

Bill


the code for SearchTest:

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermDocs;
import org.apache.lucene.document.Document;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.FuzzyQuery;
import org.apache.lucene.search.WildcardQuery;
import org.apache.lucene.search.PrefixQuery;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.RangeQuery;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.Hits;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.queryParser.MultiFieldQueryParser;
import org.apache.lucene.queryParser.FastCharStream;
import org.apache.lucene.queryParser.TokenMgrError;
import org.apache.lucene.queryParser.ParseException;

import java.io.File;
import java.io.StringReader;
import java.util.Date;
import java.util.HashMap;
import java.util.Iterator;
import java.util.StringTokenizer;

class SearchTest {

static class NewMultiFieldQueryParser extends QueryParser {

static private final String DEFAULT_FIELD = %%;

private String[] fields = null;

public NewMultiFieldQueryParser (String[] f, Analyzer a) {
super(DEFAULT_FIELD, a);
fields = f;
}


RE: Spam:too many open files

2004-09-07 Thread Will Allen
I will deploy and test through the end of the week and report back Friday if the 
problem persists.  Thank you!

-Original Message-
From: Dmitry Serebrennikov [mailto:[EMAIL PROTECTED]
Sent: Tuesday, September 07, 2004 8:40 PM
To: Lucene Users List
Subject: Re: Spam:too many open files


Hi Wallen,

Actually, the files Daniel listed were modified on 8/11 and then again 
on 8/15. In the time between 8/11 to 8/15, I belive there could have 
been any number of problems, including corrupt indexes and poor 
multithreaded performance. However, I think after 8/15, the files should 
be in good working order. If you are not sure if you saw problems with 
pre-8/15 or post-8/15 version of the code, is it possible for you to try 
the latest CVS and see if the problem exists now? If it does, it will of 
course require urgent attention.

Thanks very much!
Dmitry.


Daniel Naber wrote:

On Tuesday 07 September 2004 17:41, [EMAIL PROTECTED] wrote:

  

A note to developers, the code checked into lucene CVS ~Aug 15th, post
1.4.1, was causing frequent index corruptions.  When I reverted back to
version 1.4 I no longer am getting the corruptions.



Here are some changes from around that day:

http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/SegmentMerger.java
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/SegmentReader.java
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/index/IndexWriter.java

Could you check which of those might have caused the problem? I guess 
there's not much the developers can do without the problem being 
reproducible.

regards
 Daniel

  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Use of explain() vs search()

2004-09-07 Thread Minh Kama Yie
Hi all,
I was wondering if anyone could tell me what the expected behaviour is 
for calling an explain() without calling a search() first on a 
particular query. Would it effectively do a search and then I can 
examine the Explanation in order to check whether it matches?

I'm currently looking at some existing code to this effect:
Explanation  exp = searcher.explain(myQuery, docId)
// Where docId was _not_ returned by a search on myQuery
if (exp.getValue()  0.0f)
{
   // Assuming document for docId matched query.
}
Is the assumption wrong?
I ask because the result of this code is inconsistent with
Hits h = searcher.search(myQuery);  // there are not hits returned.
Thanks in advance,
Minh

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Use of explain() vs search()

2004-09-07 Thread Minh Kama Yie
Hi all,
Sorry I should clarify my last point.
The search() would return no hits, but the explain() using the 
apparently invalid docId returns a value greater than 0.

For what it's worth it's performing a PhraseQuery.
Thanks in advance,
Minh
Minh Kama Yie wrote:
Hi all,
I was wondering if anyone could tell me what the expected behaviour is 
for calling an explain() without calling a search() first on a 
particular query. Would it effectively do a search and then I can 
examine the Explanation in order to check whether it matches?

I'm currently looking at some existing code to this effect:
Explanation  exp = searcher.explain(myQuery, docId)
// Where docId was _not_ returned by a search on myQuery
if (exp.getValue()  0.0f)
{
   // Assuming document for docId matched query.
}
Is the assumption wrong?
I ask because the result of this code is inconsistent with
Hits h = searcher.search(myQuery);  // there are not hits returned.
Thanks in advance,
Minh

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


pdf in Chinese

2004-09-07 Thread [EMAIL PROTECTED]
Hi all,
i use pdfbox to parse pdf file to lucene document.when i parse  Chinese
pdf file,pdfbox is not always success.
Is anyone have some advice?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]