Re: Concurrent searching & re-indexing

2005-02-17 Thread Jim Lynch
It failed for me on Linux.
Paul Mellor wrote:
"on windows you cannot delete open files, so Lucene AFAIK (I don't use
windows) postpones the deletion to a time, when the file is closed"
If Lucene does not in fact postpone the deletion, that would explain the
exception I'm seeing ("java.io.IOException: couldn't delete _a.f1") - the
IndexWriter is attempting to delete the files but the IndexReader has them
open.
Does this then mean that re-indexing whilst searching is inherently unsafe,
but only on Windows?
 

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Concurrent searching & re-indexing

2005-02-17 Thread Jim Lynch
Hi, Paul,
I brought this point up a while back and didn't get a response.  I've 
found that I frequently get a "file not found" exception when searching 
at the same time an indexing and/or optimize operation is running.  I 
fixed it by trapping the exception and looping until it didn't fail.

Jim.
Paul Mellor wrote:
Otis,
1. If IndexReader takes a snapshot of the index state when opened and then
reads the files when searching, what would happen if the files it takes a
snapshot of are deleted before the search is performed (as would happen with
a reindexing in the period between opening an IndexSearcher and using it to
search)?
 

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Are wildcard searches supposed to work with fields that are saved, indexed and not tokenized?

2005-02-14 Thread Jim Lynch

Erik Hatcher wrote:
On Feb 14, 2005, at 12:40 PM, Jim Lynch wrote:
I was trying to write some documentation on how to use the tool and 
issued a search for:

contact:DENNIS MORROW

Is that literally the QueryParser string you entered?  If so, that 
parses to:

contact:DENNIS OR defaultField:MORROW
most likely.
Ah! Good point.

And now I get 648 hits, but in some of them the contact doesn't even 
remotely resemble the search pattern.  For instance here are the what 
the contact fields contain for some of these hits:
Contact: GENERIC CONTACT
Contact: Andre Gardinalli
Contact: Brett Morrow  (that's especially interesting)
Contact: KEN PATTERSON

And of course there are some with Dennis' name too.
Any idea why this is happening?  I'm using the QueryParser.parse method.

I'm not sure you'll be able to do this with QueryParser with spaces in 
an untokenized field.  First try it with an API created WildcardQuery 
to be sure it works the way you expect.
I didn't really have any expectations other than what I saw didn't make 
sense.  I'll just add to the docs that [this set of fields] can't be 
searched with wildcards. 

Thanks,
Jim.
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Are wildcard searches supposed to work with fields that are saved, indexed and not tokenized?

2005-02-14 Thread Jim Lynch
I was trying to write some documentation on how to use the tool and 
issued a search for:

contact:DENNIS MORROW
And sure enough I got 647 hits.  Then I changed the searc to:
contact:DENNIS MORRO?
And now I get 648 hits, but in some of them the contact doesn't even 
remotely resemble the search pattern.  For instance here are the what 
the contact fields contain for some of these hits:
Contact: GENERIC CONTACT
Contact: Andre Gardinalli
Contact: Brett Morrow  (that's especially interesting)
Contact: KEN PATTERSON

And of course there are some with Dennis' name too.
Any idea why this is happening?  I'm using the QueryParser.parse method.
Jim.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: What does [] do to a query and what's up with lucene.apache.org?

2005-02-14 Thread Jim Lynch
Otis and Erik,
Thanks for the info.  That's a great reference.
Jim.
Erik Hatcher wrote:
Jim,
The Lucene website is transitioning to the new top-level space.  I 
have  checked out the current site to the new lucene.apache.org area 
and set  up redirects from the old Jakarta URL's.  The source code, 
though, is  not an official part of the website.  Thanks to our 
conversion to  Subversion, though, the source is browsable starting here:

http://svn.apache.org/repos/asf/lucene/java/trunk
The HTML of the website will need link adjustments to get everything  
back in shape.

The brackets are documented here:  
http://lucene.apache.org/queryparsersyntax.html

Erik
On Feb 14, 2005, at 10:31 AM, Jim Lynch wrote:
First I'm getting a
   The requested URL could not be retrieved
--- 
-

While trying to retrieve the URL:  
http://lucene.apache.org/src/test/org/apache/lucene/queryParser/ 
TestQueryParser.java

The following error was encountered:
   Unable to determine IP address from host name for /lucene.apache.org
   /Guess the system is down.
I'm getting this error:
org.apache.lucene.queryParser.ParseException: Encountered "is" at 
line  1, column 15.
Was expecting:
   "]" ...
when I tried to parse the following string "[this is a test]".

I can't find any documentation that tells me what the brackets do to 
a  query.  I had a user that was used to another search engine that 
used  [] to do proximity or near searches and tried it on this one. 
Actually  I'd like to see the documentation for what the parser 
does.  All that  is mentioned in the javadoc is + - and ().  
Obviously there are more  special characters.

Thanks,
Jim.
Jim.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


What does [] do to a query and what's up with lucene.apache.org?

2005-02-14 Thread Jim Lynch
First I'm getting a
   The requested URL could not be retrieved

While trying to retrieve the URL: 
http://lucene.apache.org/src/test/org/apache/lucene/queryParser/TestQueryParser.java 

The following error was encountered:
   Unable to determine IP address from host name for /lucene.apache.org
   /Guess the system is down.
I'm getting this error:
org.apache.lucene.queryParser.ParseException: Encountered "is" at line 
1, column 15.
Was expecting:
   "]" ...
when I tried to parse the following string "[this is a test]".

I can't find any documentation that tells me what the brackets do to a 
query.  I had a user that was used to another search engine that used [] 
to do proximity or near searches and tried it on this one. Actually I'd 
like to see the documentation for what the parser does.  All that is 
mentioned in the javadoc is + - and ().  Obviously there are more 
special characters.

Thanks,
Jim.
Jim.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Does anyone have a copy of the highligher code?

2005-02-08 Thread Jim Lynch
Our firewall prevents me from using cvs to check out anything.  Does 
anyone have a jar file or a set of class files publicly available?

Thanks,
Jim.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How do I delete?

2005-02-02 Thread Jim Lynch
OK, the reference field was not parsed.  See:
 }else if(key.equals("reference") ) {
   reference = value;
   Field fReference = new Field("reference",value,true,true,false);
   doc.add(fReference);
On another examination of my program, the delete does seem to be 
working.  At least the delete returns a value of 1 saying it deleted one 
record.  However the search still keeps finding the old record.  I am 
doing an optimize after each index batch. 

Unfortuately the old record is still there even after I delete it.  So I 
deleted it and replaced it with the date in a different format to see if 
it was really replaced.  The date field indicates I've still got the old 
data in there for some reason.  Is data cached somewhere?

Jim.
Chris Hostetter wrote:
: anywhere.  I checked the count coming back from the delete operation and
: it is zero.  I even tried to delete another unique term with similar
: results.
First off, are you absolutely certain you are closing the reader?  it's
not in the code you listed.
Second, I'd bet $1 that when your documents were indexed, your "reference"
field was analyzed and parsed into multiple terms.  Did you try searching
for the Term you're trying to delete by?
(I hear "luke" is a pretty handy tool for checking exactly which Terms are
in your index)
: >>Here is the delete and associated code:
: >>
: >>  reader = IndexReader.open(database);
: >>
: >>  Term t = new Term("reference",reference);
: >>  try {
: >>reader.delete(t);
: >>  } catch (Exception e) {
: >>System.out.println("Delete exception;"+e);
: >>  }
-Hoss
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How do I delete?

2005-02-01 Thread Jim Lynch
Thanks, I'd try that, but I don't think it will make any difference.  If 
I modify the code to not reindex the documents, no files in the index 
directory are touched, hence there is no record of the deletions 
anywhere.  I checked the count coming back from the delete operation and 
it is zero.  I even tried to delete another unique term with similar 
results.

How does one call the commit method anyway? Isn't it automatically called?
Jim.
Joseph Ottinger wrote:
I've had success with deletion by running IndexReader.delete(int), then
getting an IndexWriter and optimizing the directory. I don't know if
that's "the right way" to do it or not.
On Tue, 1 Feb 2005, Jim Lynch wrote:
 

I've been merrily cooking along, thinking I was replacing documents when
I haven't.  My logic is to go through a batch of documents, get a field
called "reference" which is unique build a term from it and delete it
via the reader.delete() method.  Then I close the reader and open a
writer and reprocess the batch indexing all.
Here is the delete and associated code:
 reader = IndexReader.open(database);
 Term t = new Term("reference",reference);
 try {
   reader.delete(t);
 } catch (Exception e) {
   System.out.println("Delete exception;"+e);
 }
except it isn't working.  I tried to do a commt and a doCommit, but
those are both protected.  I do a reader.close() after processing the
batch the first time.
What am I missing?  I don't get an exception.  Reference is definitely a
valid field, 'cause I print out the value at search time and compare to
the doc and they are identical.
Thanks,
Jim.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   

---
Joseph B. Ottinger http://enigmastation.com
IT Consultant[EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


How do I delete?

2005-02-01 Thread Jim Lynch
I've been merrily cooking along, thinking I was replacing documents when 
I haven't.  My logic is to go through a batch of documents, get a field 
called "reference" which is unique build a term from it and delete it 
via the reader.delete() method.  Then I close the reader and open a 
writer and reprocess the batch indexing all. 

Here is the delete and associated code:
 reader = IndexReader.open(database);
 Term t = new Term("reference",reference);
 try {
   reader.delete(t);
 } catch (Exception e) {
   System.out.println("Delete exception;"+e);
 }
except it isn't working.  I tried to do a commt and a doCommit, but 
those are both protected.  I do a reader.close() after processing the 
batch the first time. 

What am I missing?  I don't get an exception.  Reference is definitely a 
valid field, 'cause I print out the value at search time and compare to 
the doc and they are identical.

Thanks,
Jim.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How to get document count?

2005-02-01 Thread Jim Lynch
That works, thanks.  I can't use Luke on this system.   It fails for 
some reason.

Jim.
Ravi wrote:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexW
riter.html#docCount()
You can try this.
-Original Message-
From: Luke Shannon [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 01, 2005 11:33 AM
To: Lucene Users List
Subject: Re: How to get document count?

Not sure if the API provides a method for this, but you could use Luke:
http://www.getopt.org/luke/
It gives you a count and lets you step through each Doc looking at their
fields.
- Original Message - 
From: "Jim Lynch" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Tuesday, February 01, 2005 11:28 AM
Subject: How to get document count?

 

I've indexed a large set of documents and think that something may
   

have
 

gone wrong somewhere in the middle.  Is there a way I can display the
count of documents in the index?
Thanks,
Jim.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


How to get document count?

2005-02-01 Thread Jim Lynch
I've indexed a large set of documents and think that something may have 
gone wrong somewhere in the middle.  Is there a way I can display the 
count of documents in the index? 

Thanks,
Jim.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Search failed with a "File not found" error

2005-01-14 Thread Jim Lynch
I don't call optimize.  I suspect the indexer was since I was in the 
middle of indexing some 20 documents each averaging 30K bytes.

Jim.
Miles Barr wrote:
On Thu, 2005-01-13 at 13:05 -0500, Jim Lynch wrote:
 

I was indexing at the time and I was under the impression that was safe, 
but it looks like the indexer may have removed a file that the search 
was trying to access.  Is there something I should be doing to lock the 
index?

java.io.FileNotFoundException: /db/lucene/oasis/Clarify_Closed/_2meu.fnm 
(No such file or directory)
   

Did you call optimize on the writer? Alternatively you could have
reached the max number of segments and it optimized automatically (i.e.
turn several segment files like _2meu.fnm into one large one).
I don't know how this affects an existing reader, whether the reader
caches the values or not. Maybe someone can shed some more light on
this.
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Search failed with a "File not found" error

2005-01-13 Thread Jim Lynch
I was indexing at the time and I was under the impression that was safe, 
but it looks like the indexer may have removed a file that the search 
was trying to access.  Is there something I should be doing to lock the 
index?

Thanks,
Jim.
java.io.FileNotFoundException: /db/lucene/oasis/Clarify_Closed/_2meu.fnm 
(No such file or directory)
   at java.io.RandomAccessFile.open(Native Method)
   at java.io.RandomAccessFile.(RandomAccessFile.java:200)
   at 
org.apache.lucene.store.FSInputStream$Descriptor.(FSDirectory.java:376)
   at 
org.apache.lucene.store.FSInputStream.(FSDirectory.java:405)
   at 
org.apache.lucene.store.FSDirectory.openFile(FSDirectory.java:268)
   at org.apache.lucene.index.FieldInfos.(FieldInfos.java:53)
   at 
org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:109)
   at 
org.apache.lucene.index.SegmentReader.(SegmentReader.java:94)
   at 
org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:122)
   at org.apache.lucene.store.Lock$With.run(Lock.java:109)
   at org.apache.lucene.index.IndexReader.open(IndexReader.java:111)
   at org.apache.lucene.index.IndexReader.open(IndexReader.java:95)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


How do I unlock?

2005-01-11 Thread Jim Lynch
I'm getting
Lock obtain timed out.
I was developing and forgot to close the writer.  How do I recover?  I 
killed the program, put the close in, but it won't let me open again.

Thanks,
Jim.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Performance question

2005-01-11 Thread Jim Lynch
I would be tempted to index the text fields but not save them.  Since 
Lucene returns everything as Otis pointed out, it's inefficent to keep 
rarely used data in as content in the index.  Put the text fields in a 
database or a file tree somewhere and keep a pointer to it as a field in 
the index.  When you need the data just retrieve it from wherever using 
the saved pointer.

Jim.
Crump, Michael wrote:
Hello,

If I have large text fields that are rarely retrieved but need to be
searched often - Is it better to create 2 indices, one for searching and
one for retrieval, or just one index and put everything in it?

Or are there other recommendations?

Regards,

Michael
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


How do you handle dynamic html pages?

2005-01-10 Thread Jim Lynch
How is anyone managing reindexing of pages that change?  Just 
periodically reindex everything or do you try to determine frequency of 
each changes to each page and/or site? 

Thanks,
Jim.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Another highlighter question

2005-01-10 Thread Jim Lynch
Do you keep it in the index or cached in a separate place like a file or db?
Thanks,
Jim.
Miles Barr wrote:
Hi Jim,
On Mon, 2005-01-10 at 09:46 -0500, Jim Lynch wrote:
 

If the source of the documents in the index is from  web pages and the 
source isn't stored in the index, would highlighting be too slow since 
you'd have to download each web page again to gain access to the source?
   

For web pages I keep a cached parsed (HTML removed) copy for
highlighting purposes. I think downloading each page and removing HTML
tags each time would take too long.
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Program design question.

2005-01-10 Thread Jim Lynch
My application for Lucene involves updating an existing index with a 
mixture of new and revised documents.  From what I've been able to 
dicern from reading I'm going to have to delete the old versions of the 
revised documents before indexing them again.  Since this indexing will 
probably take quite a while due to the number of new/revised documents 
I'll be adding and the large number of documents already in the index, 
I'm uncomfortable keeping an IndexReader and an IndexWriter open for 
long periods of time.  

What I'm considering doing is reading the file with mulitple documents 
twice.  One time I test to see if the document is in the index and 
delete it if it is with something like:

The "Reference" term is unique.
...
   while(String ref = getNextDocument() != null) {
   Term t = Term("Reference",ref);
   TermDocs td = indexReader.termDocs(t);
   if(td != null) {
   td.next();
   indexReader.delete(td.doc());
   }
   }
Or should I not bother to look for the term at all and do something like 
this?

   while(String ref = getNextDocument() != null) {
   Term t = Term("Reference",ref);
   indexReader.delete(t);
   }
Are either of these more efficient.
Then I would close the indexReader and go back and reread the file, 
indexing merrily away.

Should I be concerned about keeping both an indexReader and indexWriter 
open at the same time?  I'll have other processes probably making 
searches during this time.  I'm not concerned about the searches not 
finding the data I'm currently adding, I'm more concerned about locking 
those searches out.  

A couple of valid assumptions.  The reference term is unique in the 
index and there will be only one in the input file.

Comments?
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Another highlighter question

2005-01-10 Thread Jim Lynch
If the source of the documents in the index is from  web pages and the 
source isn't stored in the index, would highlighting be too slow since 
you'd have to download each web page again to gain access to the source?

Jim.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Question about the best way to replace existing docs in an index.

2005-01-10 Thread Jim Lynch
Miles,
Thanks for the tips.  I didn't see this response nor did I see my 
original email earlier, so I reposted the question, thinking I had 
forgotten to do so on Friday.  My apologies to the group for the double 
post.

Jim.
Miles Barr wrote:
On Fri, 2005-01-07 at 14:47 -0500, Jim Lynch wrote:
 

My application for Lucene involves updating an existing index with a 
mixture of new and revised documents.  From what I've been able to 
dicern from reading I'm going to have to delete the old versions of the 
revised documents before indexing them again.  Since this indexing will 
probably take quite a while due to the number of new/revised documents 
I'll be adding and the large number of documents already in the index, 
I'm uncomfortable keeping an IndexReader and an IndexWriter open for 
long periods of time.  
   

As I understand it you can't have an index reader which you do deletes
on and an index writer open at the same time since they are both doing
write operations. I think locking will prevent you from opening an index
writer once you do a delete on the reader.
So you're either going to have to open and close the reader and writer
for each update, or keep a list of duplicate references and a list of
documents to be updated, then do the deletes like:
for (Iterator it = toBeDeleted.iterator(); it.hasNext(); ) {
 Term term = new Term("Reference", (String) it.next());
 indexReader.delete(term);
}
Close the reader, open the writer, then iterate through your list of new
docs and write them to the index.
 

Should I be concerned about keeping both an indexReader and indexWriter 
open at the same time?  I'll have other processes probably making 
searches during this time.  I'm not concerned about the searches not 
finding the data I'm currently adding, I'm more concerned about locking 
those searches out.  
   

Once you close your reader searches won't be possible. So once you've
done your deletes close the reader and open it again to release the
write lock before opening the writer.
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Query based stemming

2005-01-07 Thread Jim Lynch
From what I've read, if you want to have a choice, the easiest way is 
to index the documents twice. Once with stemming on and once with it off 
placing the results in two different indexes.  Then at query time, 
select which index you want to use based on whether you want stemming on 
or off.

Jim.
Peter Kim wrote:
Hi,
I'm new to Lucene, so I apologize if this issue has been discussed
before (I'm sure it has), but I had a hard time finding an answer using
google. (Maybe this would be a good candidate for the FAQ!) :)
Is it possible to enable stem queries on a per-query basis? It doesn't
seem to be possible since the stem tokenizing is done during the
indexing process. Are people basically stuck with having all their
queries stemmed or none at all?
Thanks!
Peter
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Quick question about highlighting.

2005-01-07 Thread Jim Lynch
OK, thanks.  That clears things up.  I'll play with it once I get 
something indexed.

Jim.
David Spencer wrote:
Jim Lynch wrote:
I've read as much as I could find on the highlighting that is now in 
the sandbox.  I didn't find the javadocs.

I have a copy here:
http://www.searchmorph.com/pub/jakarta-lucene-sandbox/contributions/highlighter/build/docs/api/overview-summary.html 


  I found a link to them, but it
redirected my to a cvs tree.
Do I assume that you have to store the content of the document for 
the highlighting to work?  

Not per se, but you do need access to the contents to pass to 
Highlighter.getBestFragments(). You can store the contents in the 
index, or you can have in a cache, DB, or you can refetch the doc...

You need to know what Analyzer you used too to get the tokenStream via:
TokenStream tokenStream = analyzer.tokenStream( field, new 
StringReader(body));

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Quick question about highlighting.

2005-01-07 Thread Jim Lynch
I've read as much as I could find on the highlighting that is now in the 
sandbox.  I didn't find the javadocs.  I found a link to them, but it 
redirected my to a cvs tree.

Do I assume that you have to store the content of the document for the 
highlighting to work?  Otherwise I don't see how it could work.

Thanks,
Jim.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Question about the best way to replace existing docs in an index.

2005-01-07 Thread Jim Lynch
My application for Lucene involves updating an existing index with a 
mixture of new and revised documents.  From what I've been able to 
dicern from reading I'm going to have to delete the old versions of the 
revised documents before indexing them again.  Since this indexing will 
probably take quite a while due to the number of new/revised documents 
I'll be adding and the large number of documents already in the index, 
I'm uncomfortable keeping an IndexReader and an IndexWriter open for 
long periods of time.  

What I'm considering doing is reading the file with mulitple documents 
twice.  One time I test to see if the document is in the index and 
delete it if it is with something like:

The "Reference" term is unique.
...
   while(String ref = getNextDocument() != null) {
 Term t = Term("Reference",ref);
 TermDocs td = indexReader.termDocs(t);
 if(td != null) {
   td.next();
   indexReader.delete(td.doc());
 }
   }
Or should I not bother to look for the term at all and do something like 
this?

   while(String ref = getNextDocument() != null) {
 Term t = Term("Reference",ref);
 indexReader.delete(t);
   }
Are either of these more efficient?
Then I would close the indexReader and go back and reread the file, 
indexing merrily away.

Should I be concerned about keeping both an indexReader and indexWriter 
open at the same time?  I'll have other processes probably making 
searches during this time.  I'm not concerned about the searches not 
finding the data I'm currently adding, I'm more concerned about locking 
those searches out.  

A couple of valid assumptions.  The reference term is unique in the 
index and there will be only one in the input file.

Thanks,
Jim.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Need an analyzer that includes numbers.

2005-01-03 Thread Jim Lynch
Hi, Erik,
Thank you very much for taking the time to do this.  I may have 
mentioned, I'm evaluating search engines and am implementing a subset of 
the features that we'll need eventually.  This will help greatly. 

Thanks,
Jim.
Erik Hatcher wrote:
On Dec 25, 2004, at 11:05 AM, Jim wrote:
I've seen some discussion on this and the answer seems to be "write 
your own".  Hasn't someone already done that by now that would 
share?  I really have to be able to include numeric and alphanumeric 
strings in my searches.   I don't understand analyzers well enough to 
roll my own.

This is more involved than just keeping numbers around... or at least 
there are more steps to consider.  Do you want the alpha characters 
lower-cased, which is the typical behavior so that searches are 
case-insensitive.  What about punctuation characters?  Generally these 
get tossed, however there are cases where that is not desired either.
(Snip excellent response)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: I though I understood, but obviously I missed something.

2004-12-24 Thread Jim Lynch
Sorry for the stupidity.  I should have seen that. 

Jim.
Jim Lynch wrote:

Where did I go wrong?

The answer is, I got out of bed this morning.  :-[
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


I though I understood, but obviously I missed something.

2004-12-24 Thread Jim Lynch
A snippet from my program:
   
   Document doc = new Document();
   Field fContent = new 
Field("content",content.toString(),false,true,true);
   Field fTitle = new Field("title",title,true,true,true);
   Field fDate = new Field("date",date,true,true,false);
   Document.add(fContent);
   Document.add(fTitle);
   Document.add(fDate);

Generate this (and other like it ) error
method add(org.apache.lucene.document.Field) cannot be referenced from a 
static context
   [javac] Document.add(fContent);

Where did I go wrong?
Thanks,
Jim.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Multiple collections

2004-12-23 Thread Jim Lynch
Hi, Erik,
I've been perusing the mail list today and see your name often.  As well 
as visiting the web site advertising your book.  If we decide to go this 
way, I'll be sure to pick up a copy.

The FAQ number 41 on page 
http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.search&toc=faq 
implies a problem with searching and indexing at the same time, unless 
I'm misunderstanding what it says.

So it is kosher to download the source code before buying the book?  I 
tend not to do that for a couple of reasons, it doesn't seem right and 
frequently authors go out of their way to make sure it's not very useful 
without the book.   Not that I consider that unfair, mind you.  It's 
just a common practice from my experience.

Any way thanks for the info. 

So what you are saying if I can read between the lines and extrapolate 
from what I've read, is that I can create an index for each of my 
collections as I see fit, putting them in separate directories and when 
I need to search I can select a subset of the directories with the 
MultiSearcher.  Since the user selects which collections he wants to 
search from via checkboxes, I can build a list of searchables to pass to 
MultiSearcher.  However, looking at the javadocs I see Searchable is an 
interface.  Hm, I'll have to look at some code to see how that works.

Thanks, you've given me something to chew on.
Jim.
At the risk of  being politically incorrect, Merry Christmas to you 
all.  Not that I care a whit about political correctness.  8)

Erik Hatcher wrote:
On Dec 23, 2004, at 2:18 PM, Jim Lynch wrote:
I'm investigating search engines and have started to look at Lucene.  
I have a couple of questions, however.  The faq seems to indicate we 
can't do searches and indexing at the same time.

Where in the FAQ does it indicate this?  This is incorrect.  And I 
don't think this has ever been the case for Lucene.  Indexing and 
searching can most definitely occur at the same time.

We have currently about 4 million documents comprised of  about 16 
million terms.  This is currently broken up into about 50 different 
collections which are separate "databases".  Some of these 
collections are producted by a web crawler, some are produced by 
indexing a static file tree and some are produced via a feed from 
another system, which either adds new documents to a collection or 
replaces a document.  There are really 2 questions.  Is this too much 
data for Lucene?

It is not too much data for Lucene.  Your architecture around Lucene 
is the more important aspect.

  And is there a way to keep separate collections (probably indexes) 
and search all (usually just a subset) of them at once?  I see the 
MultiSearcher object that may be the ticket, but IMHO javadocs leave 
a lot to be desired in the way of documentation.  They seem to 
completely leave out the "glue" and examples.

MultiSearcher is pretty trivial to use.  There is an example in Lucene 
in Action's source code ("ant SearchServer") and I'm using a 
MultiSearcher for the upcoming lucenebook.com site like this:

Searchable[] searchables = new Searchable[indexes.length];
for (int i = 0; i < indexes.length; i++) {
  searchables[i] = new IndexSearcher(indexes[i]);
}
searcher = new MultiSearcher(searchables);
Use MultiSearcher in the same manner as you would IndexSearcher.  You 
can also find out which index a particular hit was from using the 
subSearcher method.

As for your comment about the javadocs, allow me to refer you to 
Lucene's test suite.  TestMultiSearcher.java in this case.  This is 
the best "documentation" there is!  (besides Lucene in Action, of 
course :)

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Multiple collections

2004-12-23 Thread Jim Lynch
I'm investigating search engines and have started to look at Lucene.  I 
have a couple of questions, however.  The faq seems to indicate we can't 
do searches and indexing at the same time.  Is that still true, given 
that the faq is a few years old now?  If so is there locking going on or 
do I have to do it myself?

We have currently about 4 million documents comprised of  about 16 
million terms.  This is currently broken up into about 50 different 
collections which are separate "databases".  Some of these collections 
are producted by a web crawler, some are produced by indexing a static 
file tree and some are produced via a feed from another system, which 
either adds new documents to a collection or replaces a document.  There 
are really 2 questions.  Is this too much data for Lucene?  And is there 
a way to keep separate collections (probably indexes) and search all 
(usually just a subset) of them at once?  I see the MultiSearcher object 
that may be the ticket, but IMHO javadocs leave a lot to be desired in 
the way of documentation.  They seem to completely leave out the "glue" 
and examples.

Thanks for any advice.
Jim.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]