Re:can't delete from an index using IndexReader.delete()

2004-02-20 Thread Dhruba Borthakur
Hi folks,

I am using the latest and greatest Lucene jar file and am facing a problem 
with
deleting documents from the index. Browsing the mail archive, I found that 
the
following email (June 2003) listed the exact problem that I am encountering.

In short: I am using Field.text(id, value) to mark a document. Then I 
use
reader.delete(new Term(id, value)) to remove the document: this
call returns 0 and fails to delete the document. The attached sample program
shows this behaviour.

i would appreciate it a lot if anybody in this list has encountered this 
problem
and would like to share his/her solution with me.

thanks,
dhruba
From: Robert Koberg [EMAIL PROTECTED]
Subject: can't delete from an index using IndexReader.delete()
Date: Mon, 23 Jun 2003 14:38:25 -0700
Content-Type: text/plain;
charset=us-ascii
Here is a simple class that can reproduce the problem (happens with the last
stable release too). Let me know if you would prefer this as an attachment.
Call like this:
java TestReaderDelete existing_id new_label
- or -
Try:
java TestReaderDelete B724547 ppp
and then try:
java TestReaderDelete a266122794 ppp
If an index has not been created it will create one. Keep running the one of
the above example commands (with and without deleting the index directory)
and see what happens to the System.out.println's


import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.DateField;
import org.xml.sax.*;
import org.xml.sax.helpers.*;
import org.xml.sax.Attributes;
import javax.xml.parsers.*;
import java.io.*;
import java.util.*;
class TestReaderDelete {



public static void main(String[] args)
 throws IOException
{
 File index = new File(./testindex);
 if (!index.exists()) {
   HashMap test_map = new HashMap();
   test_map.put(preamble_content, Preamble content bbb);
   test_map.put(art_01_section_01, Article 1, Section 1);
   test_map.put(toc_tester, Test TOC XML bbb);
   test_map.put(B724547, bio example);
   test_map.put(a266122794, tester);
   indexFiles(index, test_map);
 }
 String identifier = args[0];
 String new_label = args[1];
 testDeleteAndAdd(index, identifier, new_label);
}
public static void indexFiles(File index, HashMap test_map)
{
 try {
   IndexWriter writer = new IndexWriter(index, new StandardAnalyzer(),
true);
   for (Iterator i=test_map.entrySet().iterator(); i.hasNext(); ) {
 Map.Entry e = (Map.Entry) i.next();
System.out.println(Adding:  + e.getKey() +  =  + e.getValue());
 Document doc = new Document();
 doc.add(Field.Text(id, (String)e.getKey()));
 doc.add(Field.Text(label, (String)e.getValue()));
 writer.addDocument(doc);
   }
   writer.optimize();
   writer.close();
 } catch (Exception e) {
   System.out.println( caught a  + e.getClass() +
\n with message:  + e.getMessage());
 }
}
public static void testDeleteAndAdd(File index, String identifier, String
new_label)
 throws IOException
{
 IndexReader reader = IndexReader.open(index);
System.out.println(!!! reader.numDocs() :  + reader.numDocs());
System.out.println(reader.indexExists():  + reader.indexExists(index));
System.out.println(term field:  + new Term(id, identifier).field());
System.out.println(term text:  + new Term(id, identifier).text());
System.out.println(reader.docFreq:  + reader.docFreq(new Term(id,
identifier)));
System.out.println(deleting target now...);
 int deleted_num = reader.delete(new Term(id, identifier));
System.out.println(*** deleted_num:  + deleted_num);
 reader.close();
 try {
   IndexWriter writer = new IndexWriter(index, new StandardAnalyzer(),
false);
   String ident = identifier;
   Document doc = new Document();
   doc.add(Field.Text(id, identifier));
   doc.add(Field.Text(label, new_label));
   writer.addDocument(doc);
   writer.optimize();
   writer.close();
 } catch (Exception e) {
   System.out.println( caught a  + e.getClass() +
\n with message:  + e.getMessage());
 }
System.out.println(!!! reader.numDocs() after deleting and adding :  +
reader.numDocs());
}
}



  -Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Sunday, June 22, 2003 9:42 PM
To: Lucene Users List
The code looks fine.  Unfortunately, the provided code is not a full,
self-sufficient class that I can run on my machine to verify the
behaviour that you are describing.
Otis

_
Stay informed on Election 2004 and the race to Super Tuesday. 
http://special.msn.com/msn/election2004.armx

Return-Path: [EMAIL PROTECTED]
Received: (qmail 30363 invoked from network); 20 Feb 2004 08:58:38 -
Received: from unknown (HELO hotmail.com) (64.4.49.60)
by daedalus.apache.org with SMTP; 20 Feb 2004 08:58:38 -
Received: from mail 

Re: Re:can't delete from an index using IndexReader.delete()

2004-02-20 Thread Morus Walter
Dhruba Borthakur writes:
 Hi folks,
 
 I am using the latest and greatest Lucene jar file and am facing a problem 
 with
 deleting documents from the index. Browsing the mail archive, I found that 
 the
 following email (June 2003) listed the exact problem that I am encountering.
 
 In short: I am using Field.text(id, value) to mark a document. Then I 
 use
 reader.delete(new Term(id, value)) to remove the document: this
 call returns 0 and fails to delete the document. The attached sample program
 shows this behaviour.
 
You don't tell us how your ids look like, but Field.text(id, value)
tokenizes value, that is splits value into whatever the analyzer considers
to be a token, and creates a term for each token. 
Whereas new Term(id, value) creates one term containing value.

So I guess your ids are considered several token by the analyzer you use
and therefore they won't be matched by the term you construct for the delete.

Using keyword fields instead of text fields for the id should help.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE : MultiReader

2004-02-20 Thread Rasik Pandey
Hello,

 I just committed one!  This was really already there, in
 SegmentsReader,
 but it was not public and needed a few minor changes.  Enjoy.
 
 Doug

Greatthanks! Do you feel as though, managing an index made up of numerous smaller 
indices is an effective use of the MultiReader and MultiSearcher? Ignoring for a 
moment the potential of causing a Too many open files error, I feel as though it may 
be decent/reasonable way in which one could ensure overall index integrity by managing 
smaller parts.

As a side note, regarding the Too many open files issue, has anyone noticed that 
this could be related to the JVM? For instance, I have a coworker who tried to run a 
number of optimized indexes in a JVM instance and a received the Too many open 
files error. With the same number of available file descriptors (on linux ulimit = 
ulimited), he split the number of indicies over too JVM instances his problem 
disappeared.  He also tested the problem by increasing the available memory to the JVM 
instance, via the -Xmx parameter, with all indicies running in one JVM instance and 
again the problem disappeared. I think the issue deserves more testing to pin-point 
the exact problem, but I was just wondering if anyone has already experienced anything 
similar or if this information could be of use to anyone, in which case we should 
probably start a new thread dedicated to this issue.


Regards,
Rasik

 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



open files under linux

2004-02-20 Thread Morus Walter
Rasik Pandey writes:
 
 As a side note, regarding the Too many open files issue, has anyone noticed that 
 this could be related to the JVM? For instance, I have a coworker who tried to run a 
 number of optimized indexes in a JVM instance and a received the Too many open 
 files error. With the same number of available file descriptors (on linux ulimit = 
 ulimited), he split the number of indicies over too JVM instances his problem 
 disappeared.  He also tested the problem by increasing the available memory to the 
 JVM instance, via the -Xmx parameter, with all indicies running in one JVM instance 
 and again the problem disappeared. I think the issue deserves more testing to 
 pin-point the exact problem, but I was just wondering if anyone has already 
 experienced anything similar or if this information could be of use to anyone, in 
 which case we should probably start a new thread dedicated to this issue.
 
The limit is per process. Two JVM make two processes.
(There's a per system limit too, but it's much higher; I think you find
it in /proc/sys/fs/file-max and it's default value depends on the amount
of memory the system has)

AFAIK there's no way of setting openfiles to unlimited. At least neither
bash nor tcsh accepts that.
But it should not be a problem to set it to very high values.
And you should be able to increase the system wide limit by writing to
/proc/sys/fs/file-max as long as you have enough memory.

I never used this, though.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: open files under linux

2004-02-20 Thread Stephen Eaton
Easiest way is to use sysctl to view and change the max files setting. For
some reason the max-files is set to 8000 or something small (is for mandrake
anyway)

Sysctl fs.file-nr to view what is currently in use and what the max is set
for.  It reports file usage in the format of xxx yyy zzz  where xxx= max
that has been used by the system, yyy=currently being used, zzz=max
allocated.  So yyy should never get near zzz, if it does then you will get
the out of file errors.  Try using the command when you are getting the
issue and see what the system values are. 

To change use
Sysctl -w fs.file-max=32768  to give it something decent.

Should solve your problems.

Stephen...


 -Original Message-
 From: Morus Walter [mailto:[EMAIL PROTECTED] 
 Sent: Friday, 20 February 2004 7:41 PM
 To: Lucene Users List
 Subject: open files under linux
 
 Rasik Pandey writes:
  
  As a side note, regarding the Too many open files issue, 
 has anyone noticed that this could be related to the JVM? For 
 instance, I have a coworker who tried to run a number of 
 optimized indexes in a JVM instance and a received the Too 
 many open files error. With the same number of available 
 file descriptors (on linux ulimit = ulimited), he split the 
 number of indicies over too JVM instances his problem 
 disappeared.  He also tested the problem by increasing the 
 available memory to the JVM instance, via the -Xmx parameter, 
 with all indicies running in one JVM instance and again the 
 problem disappeared. I think the issue deserves more testing 
 to pin-point the exact problem, but I was just wondering if 
 anyone has already experienced anything similar or if this 
 information could be of use to anyone, in which case we 
 should probably start a new thread dedicated to this issue.
  
 The limit is per process. Two JVM make two processes.
 (There's a per system limit too, but it's much higher; I 
 think you find it in /proc/sys/fs/file-max and it's default 
 value depends on the amount of memory the system has)
 
 AFAIK there's no way of setting openfiles to unlimited. At 
 least neither bash nor tcsh accepts that.
 But it should not be a problem to set it to very high values.
 And you should be able to increase the system wide limit by 
 writing to /proc/sys/fs/file-max as long as you have enough memory.
 
 I never used this, though.
 
 Morus
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 __ NOD32 1.628 (20040218) Information __
 
 This message was checked by NOD32 antivirus system.
 http://www.nod32.com
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Concurrency

2004-02-20 Thread Alan Smith
Hi

Ive just got a couple of questions which i cant quite work out...wondered if 
someone could help me with them:

1. What happens if i make a backup (copy) of an index while documents are 
being added? Can it cause problems, and if so is there a way to safely do 
this?

2. When I create a new IndexSearcher, what method does Lucene use to take a 
'snapshot' of the index (because if i add documents after the search object 
is created they dont appear in the search results)?

Many thanks

Al

_
It's fast, it's easy and it's free. Get MSN Messenger today! 
http://www.msn.co.uk/messenger

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Concurrency

2004-02-20 Thread Otis Gospodnetic
 Ive just got a couple of questions which i cant quite work
 out...wondered if 
 someone could help me with them:
 
 1. What happens if i make a backup (copy) of an index while documents
 are 
 being added? Can it cause problems, and if so is there a way to
 safely do 
 this?

You should be okay.  When new documents are added, they are added to
new segments.  A 'table of contents' of all valid segments is in
'segments' file.  Even if you copy extra segments, your index will
still work, it's just that your searches may not search newly created
segments, whose existence was not registered in segments file, when you
copied the index.

 2. When I create a new IndexSearcher, what method does Lucene use to
 take a 
 'snapshot' of the index (because if i add documents after the search
 object 
 is created they dont appear in the search results)?

This is related to the answer under 1.  New documents are not seen with
an old IndexSearcher, because the old IndexSearcher is not aware of new
segments.
It would have to re-read the segments file and read any new segments
found, in order to become aware of new segments and documents in them.

Otis


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Concurrency

2004-02-20 Thread Doug Cutting
Alan Smith wrote:
1. What happens if i make a backup (copy) of an index while documents 
are being added? Can it cause problems, and if so is there a way to 
safely do this?
This is not in general safe.  A copy may not be a usable index.  The 
segments file points to the current set of files.  An IndexWriter 
periodically rewrites the segments file, and then may delete files which 
are no longer used.  If you copy the segments file, then, before you 
copy all the files, the segments file is re-written and some files are 
deleted, your copied index will be incoherent.  Or, vice versa, you 
might copy the segments file last, and it may refer to newly created 
files which you failed to copy.

The safest way to do this is to use an IndexReader to make your backup, 
with something like:

  IndexReader reader = IndexReader.open(index);
  IndexWriter writer = new IndexWriter(backup, analyzer, true);
  writer.addIndexes(new IndexReader[] { reader });
  writer.close();
  reader.close();
This will use Lucene's locking code to make sure all is safe.

2. When I create a new IndexSearcher, what method does Lucene use to 
take a 'snapshot' of the index (because if i add documents after the 
search object is created they dont appear in the search results)?
It keeps the existing set of files open.  As described above, Lucene 
modifies an index by adding new files and removing old ones.  But an 
open index never changes its set of files.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Concurrency

2004-02-20 Thread Doug Cutting
David Townsend wrote:
Does this mean that if an IndexSearcher has hold of a segment file, then the index is optimised, any subsequent search will use a list of files that probably don't exist anymore?
The IndexSearcher (through an IndexReader) has the files open, so it is 
still valid, and may be searched.  On Unix, the may have already been 
deleted, since Unix lets you delete files which are open, reclaiming the 
disk space when they're closed (or the process exits).  On Win32 one 
cannot delete open files and Lucene must queue deletions.

So, yes, on Unix you may be searching index files that have been 
deleted.  Whether such files exist or not is a matter for philosophers.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Concurrency

2004-02-20 Thread Michael Giles
It would be great if we could come up with way to integrate the Lucene 
locking information with something more incremental like rsync.  At Furl ( 
http://www.furl.net ) we have this problem in spades because we have 
thousands (and thousands) of indexes that need to be backed up.  Currently, 
we run rsync frequently (i.e. hourly) with safe (i.e. stopped server) 
daily snapshots (that rollover, etc.).  Essentially we are playing the odds 
on the frequent backups since indexes are only open for the moment that a 
new item is saved (i.e. they are reopened/closed each time) and we also 
have all of the data to recreate them if need be.  But this is definitely a 
topic we have discussed and it would be nice to have a solution to it.

Any other ideas out there (i.e. a way to check the locks and then retry a 
bit later)?

-Mike

At 11:43 AM 2/20/2004, you wrote:
Alan Smith wrote:
1. What happens if i make a backup (copy) of an index while documents are 
being added? Can it cause problems, and if so is there a way to safely do this?
This is not in general safe.  A copy may not be a usable index.  The 
segments file points to the current set of files.  An IndexWriter 
periodically rewrites the segments file, and then may delete files which 
are no longer used.  If you copy the segments file, then, before you copy 
all the files, the segments file is re-written and some files are deleted, 
your copied index will be incoherent.  Or, vice versa, you might copy the 
segments file last, and it may refer to newly created files which you 
failed to copy.

The safest way to do this is to use an IndexReader to make your backup, 
with something like:

  IndexReader reader = IndexReader.open(index);
  IndexWriter writer = new IndexWriter(backup, analyzer, true);
  writer.addIndexes(new IndexReader[] { reader });
  writer.close();
  reader.close();
This will use Lucene's locking code to make sure all is safe.

2. When I create a new IndexSearcher, what method does Lucene use to take 
a 'snapshot' of the index (because if i add documents after the search 
object is created they dont appear in the search results)?
It keeps the existing set of files open.  As described above, Lucene 
modifies an index by adding new files and removing old ones.  But an open 
index never changes its set of files.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Concurrency

2004-02-20 Thread Alan Smith
Thanks Doug and Otis. Thats very helpful.

So if an IndexSearcher is open, new documents are just added to existing 
index files which remain under the same name yeah? And an IndexSearcher 
which continues to search these changed files will still return the same 
results (i.e. the modifications won't affect the searches)...is that right?

Cheers

Al

Alan Smith wrote:
1. What happens if i make a backup (copy) of an index while documents are 
being added? Can it cause problems, and if so is there a way to safely do 
this?
This is not in general safe.  A copy may not be a usable index.  The
segments file points to the current set of files.  An IndexWriter
periodically rewrites the segments file, and then may delete files which
are no longer used.  If you copy the segments file, then, before you
copy all the files, the segments file is re-written and some files are
deleted, your copied index will be incoherent.  Or, vice versa, you
might copy the segments file last, and it may refer to newly created
files which you failed to copy.
The safest way to do this is to use an IndexReader to make your backup,
with something like:
  IndexReader reader = IndexReader.open(index);
  IndexWriter writer = new IndexWriter(backup, analyzer, true);
  writer.addIndexes(new IndexReader[] { reader });
  writer.close();
  reader.close();
This will use Lucene's locking code to make sure all is safe.

2. When I create a new IndexSearcher, what method does Lucene use to take a 
'snapshot' of the index (because if i add documents after the search object 
is created they dont appear in the search results)?
It keeps the existing set of files open.  As described above, Lucene
modifies an index by adding new files and removing old ones.  But an
open index never changes its set of files.
Doug

_
Express yourself with cool new emoticons http://www.msn.co.uk/specials/myemo
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


using relative paths

2004-02-20 Thread Raman
Hi
I am using Lucene with Struts framework in my web application.
I am using Tomcat and have my application in webapp directory. I want to create 
indexes for all the HTMLs lying in html folder. 
I am facing problem in giving relative paths for the source and index folders in my fn 
to create index.

indexDocs(root, index, create);// add new docs

How can i give relative paths for that..
Pls give me some example.. that will be helpful

Thanks..
Raman Garg



-- Raman