Re: Concurrent searching & re-indexing
It failed for me on Linux. Paul Mellor wrote: "on windows you cannot delete open files, so Lucene AFAIK (I don't use windows) postpones the deletion to a time, when the file is closed" If Lucene does not in fact postpone the deletion, that would explain the exception I'm seeing ("java.io.IOException: couldn't delete _a.f1") - the IndexWriter is attempting to delete the files but the IndexReader has them open. Does this then mean that re-indexing whilst searching is inherently unsafe, but only on Windows? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Concurrent searching & re-indexing
Hi, Paul, I brought this point up a while back and didn't get a response. I've found that I frequently get a "file not found" exception when searching at the same time an indexing and/or optimize operation is running. I fixed it by trapping the exception and looping until it didn't fail. Jim. Paul Mellor wrote: Otis, 1. If IndexReader takes a snapshot of the index state when opened and then reads the files when searching, what would happen if the files it takes a snapshot of are deleted before the search is performed (as would happen with a reindexing in the period between opening an IndexSearcher and using it to search)? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Are wildcard searches supposed to work with fields that are saved, indexed and not tokenized?
Erik Hatcher wrote: On Feb 14, 2005, at 12:40 PM, Jim Lynch wrote: I was trying to write some documentation on how to use the tool and issued a search for: contact:DENNIS MORROW Is that literally the QueryParser string you entered? If so, that parses to: contact:DENNIS OR defaultField:MORROW most likely. Ah! Good point. And now I get 648 hits, but in some of them the contact doesn't even remotely resemble the search pattern. For instance here are the what the contact fields contain for some of these hits: Contact: GENERIC CONTACT Contact: Andre Gardinalli Contact: Brett Morrow (that's especially interesting) Contact: KEN PATTERSON And of course there are some with Dennis' name too. Any idea why this is happening? I'm using the QueryParser.parse method. I'm not sure you'll be able to do this with QueryParser with spaces in an untokenized field. First try it with an API created WildcardQuery to be sure it works the way you expect. I didn't really have any expectations other than what I saw didn't make sense. I'll just add to the docs that [this set of fields] can't be searched with wildcards. Thanks, Jim. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Are wildcard searches supposed to work with fields that are saved, indexed and not tokenized?
I was trying to write some documentation on how to use the tool and issued a search for: contact:DENNIS MORROW And sure enough I got 647 hits. Then I changed the searc to: contact:DENNIS MORRO? And now I get 648 hits, but in some of them the contact doesn't even remotely resemble the search pattern. For instance here are the what the contact fields contain for some of these hits: Contact: GENERIC CONTACT Contact: Andre Gardinalli Contact: Brett Morrow (that's especially interesting) Contact: KEN PATTERSON And of course there are some with Dennis' name too. Any idea why this is happening? I'm using the QueryParser.parse method. Jim. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: What does [] do to a query and what's up with lucene.apache.org?
Otis and Erik, Thanks for the info. That's a great reference. Jim. Erik Hatcher wrote: Jim, The Lucene website is transitioning to the new top-level space. I have checked out the current site to the new lucene.apache.org area and set up redirects from the old Jakarta URL's. The source code, though, is not an official part of the website. Thanks to our conversion to Subversion, though, the source is browsable starting here: http://svn.apache.org/repos/asf/lucene/java/trunk The HTML of the website will need link adjustments to get everything back in shape. The brackets are documented here: http://lucene.apache.org/queryparsersyntax.html Erik On Feb 14, 2005, at 10:31 AM, Jim Lynch wrote: First I'm getting a The requested URL could not be retrieved --- - While trying to retrieve the URL: http://lucene.apache.org/src/test/org/apache/lucene/queryParser/ TestQueryParser.java The following error was encountered: Unable to determine IP address from host name for /lucene.apache.org /Guess the system is down. I'm getting this error: org.apache.lucene.queryParser.ParseException: Encountered "is" at line 1, column 15. Was expecting: "]" ... when I tried to parse the following string "[this is a test]". I can't find any documentation that tells me what the brackets do to a query. I had a user that was used to another search engine that used [] to do proximity or near searches and tried it on this one. Actually I'd like to see the documentation for what the parser does. All that is mentioned in the javadoc is + - and (). Obviously there are more special characters. Thanks, Jim. Jim. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
What does [] do to a query and what's up with lucene.apache.org?
First I'm getting a The requested URL could not be retrieved While trying to retrieve the URL: http://lucene.apache.org/src/test/org/apache/lucene/queryParser/TestQueryParser.java The following error was encountered: Unable to determine IP address from host name for /lucene.apache.org /Guess the system is down. I'm getting this error: org.apache.lucene.queryParser.ParseException: Encountered "is" at line 1, column 15. Was expecting: "]" ... when I tried to parse the following string "[this is a test]". I can't find any documentation that tells me what the brackets do to a query. I had a user that was used to another search engine that used [] to do proximity or near searches and tried it on this one. Actually I'd like to see the documentation for what the parser does. All that is mentioned in the javadoc is + - and (). Obviously there are more special characters. Thanks, Jim. Jim. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Does anyone have a copy of the highligher code?
Our firewall prevents me from using cvs to check out anything. Does anyone have a jar file or a set of class files publicly available? Thanks, Jim. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How do I delete?
OK, the reference field was not parsed. See: }else if(key.equals("reference") ) { reference = value; Field fReference = new Field("reference",value,true,true,false); doc.add(fReference); On another examination of my program, the delete does seem to be working. At least the delete returns a value of 1 saying it deleted one record. However the search still keeps finding the old record. I am doing an optimize after each index batch. Unfortuately the old record is still there even after I delete it. So I deleted it and replaced it with the date in a different format to see if it was really replaced. The date field indicates I've still got the old data in there for some reason. Is data cached somewhere? Jim. Chris Hostetter wrote: : anywhere. I checked the count coming back from the delete operation and : it is zero. I even tried to delete another unique term with similar : results. First off, are you absolutely certain you are closing the reader? it's not in the code you listed. Second, I'd bet $1 that when your documents were indexed, your "reference" field was analyzed and parsed into multiple terms. Did you try searching for the Term you're trying to delete by? (I hear "luke" is a pretty handy tool for checking exactly which Terms are in your index) : >>Here is the delete and associated code: : >> : >> reader = IndexReader.open(database); : >> : >> Term t = new Term("reference",reference); : >> try { : >>reader.delete(t); : >> } catch (Exception e) { : >>System.out.println("Delete exception;"+e); : >> } -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How do I delete?
Thanks, I'd try that, but I don't think it will make any difference. If I modify the code to not reindex the documents, no files in the index directory are touched, hence there is no record of the deletions anywhere. I checked the count coming back from the delete operation and it is zero. I even tried to delete another unique term with similar results. How does one call the commit method anyway? Isn't it automatically called? Jim. Joseph Ottinger wrote: I've had success with deletion by running IndexReader.delete(int), then getting an IndexWriter and optimizing the directory. I don't know if that's "the right way" to do it or not. On Tue, 1 Feb 2005, Jim Lynch wrote: I've been merrily cooking along, thinking I was replacing documents when I haven't. My logic is to go through a batch of documents, get a field called "reference" which is unique build a term from it and delete it via the reader.delete() method. Then I close the reader and open a writer and reprocess the batch indexing all. Here is the delete and associated code: reader = IndexReader.open(database); Term t = new Term("reference",reference); try { reader.delete(t); } catch (Exception e) { System.out.println("Delete exception;"+e); } except it isn't working. I tried to do a commt and a doCommit, but those are both protected. I do a reader.close() after processing the batch the first time. What am I missing? I don't get an exception. Reference is definitely a valid field, 'cause I print out the value at search time and compare to the doc and they are identical. Thanks, Jim. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --- Joseph B. Ottinger http://enigmastation.com IT Consultant[EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
How do I delete?
I've been merrily cooking along, thinking I was replacing documents when I haven't. My logic is to go through a batch of documents, get a field called "reference" which is unique build a term from it and delete it via the reader.delete() method. Then I close the reader and open a writer and reprocess the batch indexing all. Here is the delete and associated code: reader = IndexReader.open(database); Term t = new Term("reference",reference); try { reader.delete(t); } catch (Exception e) { System.out.println("Delete exception;"+e); } except it isn't working. I tried to do a commt and a doCommit, but those are both protected. I do a reader.close() after processing the batch the first time. What am I missing? I don't get an exception. Reference is definitely a valid field, 'cause I print out the value at search time and compare to the doc and they are identical. Thanks, Jim. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to get document count?
That works, thanks. I can't use Luke on this system. It fails for some reason. Jim. Ravi wrote: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexW riter.html#docCount() You can try this. -Original Message- From: Luke Shannon [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 11:33 AM To: Lucene Users List Subject: Re: How to get document count? Not sure if the API provides a method for this, but you could use Luke: http://www.getopt.org/luke/ It gives you a count and lets you step through each Doc looking at their fields. - Original Message - From: "Jim Lynch" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Tuesday, February 01, 2005 11:28 AM Subject: How to get document count? I've indexed a large set of documents and think that something may have gone wrong somewhere in the middle. Is there a way I can display the count of documents in the index? Thanks, Jim. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
How to get document count?
I've indexed a large set of documents and think that something may have gone wrong somewhere in the middle. Is there a way I can display the count of documents in the index? Thanks, Jim. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search failed with a "File not found" error
I don't call optimize. I suspect the indexer was since I was in the middle of indexing some 20 documents each averaging 30K bytes. Jim. Miles Barr wrote: On Thu, 2005-01-13 at 13:05 -0500, Jim Lynch wrote: I was indexing at the time and I was under the impression that was safe, but it looks like the indexer may have removed a file that the search was trying to access. Is there something I should be doing to lock the index? java.io.FileNotFoundException: /db/lucene/oasis/Clarify_Closed/_2meu.fnm (No such file or directory) Did you call optimize on the writer? Alternatively you could have reached the max number of segments and it optimized automatically (i.e. turn several segment files like _2meu.fnm into one large one). I don't know how this affects an existing reader, whether the reader caches the values or not. Maybe someone can shed some more light on this. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Search failed with a "File not found" error
I was indexing at the time and I was under the impression that was safe, but it looks like the indexer may have removed a file that the search was trying to access. Is there something I should be doing to lock the index? Thanks, Jim. java.io.FileNotFoundException: /db/lucene/oasis/Clarify_Closed/_2meu.fnm (No such file or directory) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.(RandomAccessFile.java:200) at org.apache.lucene.store.FSInputStream$Descriptor.(FSDirectory.java:376) at org.apache.lucene.store.FSInputStream.(FSDirectory.java:405) at org.apache.lucene.store.FSDirectory.openFile(FSDirectory.java:268) at org.apache.lucene.index.FieldInfos.(FieldInfos.java:53) at org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:109) at org.apache.lucene.index.SegmentReader.(SegmentReader.java:94) at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:122) at org.apache.lucene.store.Lock$With.run(Lock.java:109) at org.apache.lucene.index.IndexReader.open(IndexReader.java:111) at org.apache.lucene.index.IndexReader.open(IndexReader.java:95) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
How do I unlock?
I'm getting Lock obtain timed out. I was developing and forgot to close the writer. How do I recover? I killed the program, put the close in, but it won't let me open again. Thanks, Jim. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Performance question
I would be tempted to index the text fields but not save them. Since Lucene returns everything as Otis pointed out, it's inefficent to keep rarely used data in as content in the index. Put the text fields in a database or a file tree somewhere and keep a pointer to it as a field in the index. When you need the data just retrieve it from wherever using the saved pointer. Jim. Crump, Michael wrote: Hello, If I have large text fields that are rarely retrieved but need to be searched often - Is it better to create 2 indices, one for searching and one for retrieval, or just one index and put everything in it? Or are there other recommendations? Regards, Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
How do you handle dynamic html pages?
How is anyone managing reindexing of pages that change? Just periodically reindex everything or do you try to determine frequency of each changes to each page and/or site? Thanks, Jim. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Another highlighter question
Do you keep it in the index or cached in a separate place like a file or db? Thanks, Jim. Miles Barr wrote: Hi Jim, On Mon, 2005-01-10 at 09:46 -0500, Jim Lynch wrote: If the source of the documents in the index is from web pages and the source isn't stored in the index, would highlighting be too slow since you'd have to download each web page again to gain access to the source? For web pages I keep a cached parsed (HTML removed) copy for highlighting purposes. I think downloading each page and removing HTML tags each time would take too long. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Program design question.
My application for Lucene involves updating an existing index with a mixture of new and revised documents. From what I've been able to dicern from reading I'm going to have to delete the old versions of the revised documents before indexing them again. Since this indexing will probably take quite a while due to the number of new/revised documents I'll be adding and the large number of documents already in the index, I'm uncomfortable keeping an IndexReader and an IndexWriter open for long periods of time. What I'm considering doing is reading the file with mulitple documents twice. One time I test to see if the document is in the index and delete it if it is with something like: The "Reference" term is unique. ... while(String ref = getNextDocument() != null) { Term t = Term("Reference",ref); TermDocs td = indexReader.termDocs(t); if(td != null) { td.next(); indexReader.delete(td.doc()); } } Or should I not bother to look for the term at all and do something like this? while(String ref = getNextDocument() != null) { Term t = Term("Reference",ref); indexReader.delete(t); } Are either of these more efficient. Then I would close the indexReader and go back and reread the file, indexing merrily away. Should I be concerned about keeping both an indexReader and indexWriter open at the same time? I'll have other processes probably making searches during this time. I'm not concerned about the searches not finding the data I'm currently adding, I'm more concerned about locking those searches out. A couple of valid assumptions. The reference term is unique in the index and there will be only one in the input file. Comments? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Another highlighter question
If the source of the documents in the index is from web pages and the source isn't stored in the index, would highlighting be too slow since you'd have to download each web page again to gain access to the source? Jim. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Question about the best way to replace existing docs in an index.
Miles, Thanks for the tips. I didn't see this response nor did I see my original email earlier, so I reposted the question, thinking I had forgotten to do so on Friday. My apologies to the group for the double post. Jim. Miles Barr wrote: On Fri, 2005-01-07 at 14:47 -0500, Jim Lynch wrote: My application for Lucene involves updating an existing index with a mixture of new and revised documents. From what I've been able to dicern from reading I'm going to have to delete the old versions of the revised documents before indexing them again. Since this indexing will probably take quite a while due to the number of new/revised documents I'll be adding and the large number of documents already in the index, I'm uncomfortable keeping an IndexReader and an IndexWriter open for long periods of time. As I understand it you can't have an index reader which you do deletes on and an index writer open at the same time since they are both doing write operations. I think locking will prevent you from opening an index writer once you do a delete on the reader. So you're either going to have to open and close the reader and writer for each update, or keep a list of duplicate references and a list of documents to be updated, then do the deletes like: for (Iterator it = toBeDeleted.iterator(); it.hasNext(); ) { Term term = new Term("Reference", (String) it.next()); indexReader.delete(term); } Close the reader, open the writer, then iterate through your list of new docs and write them to the index. Should I be concerned about keeping both an indexReader and indexWriter open at the same time? I'll have other processes probably making searches during this time. I'm not concerned about the searches not finding the data I'm currently adding, I'm more concerned about locking those searches out. Once you close your reader searches won't be possible. So once you've done your deletes close the reader and open it again to release the write lock before opening the writer. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query based stemming
From what I've read, if you want to have a choice, the easiest way is to index the documents twice. Once with stemming on and once with it off placing the results in two different indexes. Then at query time, select which index you want to use based on whether you want stemming on or off. Jim. Peter Kim wrote: Hi, I'm new to Lucene, so I apologize if this issue has been discussed before (I'm sure it has), but I had a hard time finding an answer using google. (Maybe this would be a good candidate for the FAQ!) :) Is it possible to enable stem queries on a per-query basis? It doesn't seem to be possible since the stem tokenizing is done during the indexing process. Are people basically stuck with having all their queries stemmed or none at all? Thanks! Peter - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Quick question about highlighting.
OK, thanks. That clears things up. I'll play with it once I get something indexed. Jim. David Spencer wrote: Jim Lynch wrote: I've read as much as I could find on the highlighting that is now in the sandbox. I didn't find the javadocs. I have a copy here: http://www.searchmorph.com/pub/jakarta-lucene-sandbox/contributions/highlighter/build/docs/api/overview-summary.html I found a link to them, but it redirected my to a cvs tree. Do I assume that you have to store the content of the document for the highlighting to work? Not per se, but you do need access to the contents to pass to Highlighter.getBestFragments(). You can store the contents in the index, or you can have in a cache, DB, or you can refetch the doc... You need to know what Analyzer you used too to get the tokenStream via: TokenStream tokenStream = analyzer.tokenStream( field, new StringReader(body)); - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Quick question about highlighting.
I've read as much as I could find on the highlighting that is now in the sandbox. I didn't find the javadocs. I found a link to them, but it redirected my to a cvs tree. Do I assume that you have to store the content of the document for the highlighting to work? Otherwise I don't see how it could work. Thanks, Jim. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Question about the best way to replace existing docs in an index.
My application for Lucene involves updating an existing index with a mixture of new and revised documents. From what I've been able to dicern from reading I'm going to have to delete the old versions of the revised documents before indexing them again. Since this indexing will probably take quite a while due to the number of new/revised documents I'll be adding and the large number of documents already in the index, I'm uncomfortable keeping an IndexReader and an IndexWriter open for long periods of time. What I'm considering doing is reading the file with mulitple documents twice. One time I test to see if the document is in the index and delete it if it is with something like: The "Reference" term is unique. ... while(String ref = getNextDocument() != null) { Term t = Term("Reference",ref); TermDocs td = indexReader.termDocs(t); if(td != null) { td.next(); indexReader.delete(td.doc()); } } Or should I not bother to look for the term at all and do something like this? while(String ref = getNextDocument() != null) { Term t = Term("Reference",ref); indexReader.delete(t); } Are either of these more efficient? Then I would close the indexReader and go back and reread the file, indexing merrily away. Should I be concerned about keeping both an indexReader and indexWriter open at the same time? I'll have other processes probably making searches during this time. I'm not concerned about the searches not finding the data I'm currently adding, I'm more concerned about locking those searches out. A couple of valid assumptions. The reference term is unique in the index and there will be only one in the input file. Thanks, Jim. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Need an analyzer that includes numbers.
Hi, Erik, Thank you very much for taking the time to do this. I may have mentioned, I'm evaluating search engines and am implementing a subset of the features that we'll need eventually. This will help greatly. Thanks, Jim. Erik Hatcher wrote: On Dec 25, 2004, at 11:05 AM, Jim wrote: I've seen some discussion on this and the answer seems to be "write your own". Hasn't someone already done that by now that would share? I really have to be able to include numeric and alphanumeric strings in my searches. I don't understand analyzers well enough to roll my own. This is more involved than just keeping numbers around... or at least there are more steps to consider. Do you want the alpha characters lower-cased, which is the typical behavior so that searches are case-insensitive. What about punctuation characters? Generally these get tossed, however there are cases where that is not desired either. (Snip excellent response) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: I though I understood, but obviously I missed something.
Sorry for the stupidity. I should have seen that. Jim. Jim Lynch wrote: Where did I go wrong? The answer is, I got out of bed this morning. :-[ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
I though I understood, but obviously I missed something.
A snippet from my program: Document doc = new Document(); Field fContent = new Field("content",content.toString(),false,true,true); Field fTitle = new Field("title",title,true,true,true); Field fDate = new Field("date",date,true,true,false); Document.add(fContent); Document.add(fTitle); Document.add(fDate); Generate this (and other like it ) error method add(org.apache.lucene.document.Field) cannot be referenced from a static context [javac] Document.add(fContent); Where did I go wrong? Thanks, Jim. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Multiple collections
Hi, Erik, I've been perusing the mail list today and see your name often. As well as visiting the web site advertising your book. If we decide to go this way, I'll be sure to pick up a copy. The FAQ number 41 on page http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.search&toc=faq implies a problem with searching and indexing at the same time, unless I'm misunderstanding what it says. So it is kosher to download the source code before buying the book? I tend not to do that for a couple of reasons, it doesn't seem right and frequently authors go out of their way to make sure it's not very useful without the book. Not that I consider that unfair, mind you. It's just a common practice from my experience. Any way thanks for the info. So what you are saying if I can read between the lines and extrapolate from what I've read, is that I can create an index for each of my collections as I see fit, putting them in separate directories and when I need to search I can select a subset of the directories with the MultiSearcher. Since the user selects which collections he wants to search from via checkboxes, I can build a list of searchables to pass to MultiSearcher. However, looking at the javadocs I see Searchable is an interface. Hm, I'll have to look at some code to see how that works. Thanks, you've given me something to chew on. Jim. At the risk of being politically incorrect, Merry Christmas to you all. Not that I care a whit about political correctness. 8) Erik Hatcher wrote: On Dec 23, 2004, at 2:18 PM, Jim Lynch wrote: I'm investigating search engines and have started to look at Lucene. I have a couple of questions, however. The faq seems to indicate we can't do searches and indexing at the same time. Where in the FAQ does it indicate this? This is incorrect. And I don't think this has ever been the case for Lucene. Indexing and searching can most definitely occur at the same time. We have currently about 4 million documents comprised of about 16 million terms. This is currently broken up into about 50 different collections which are separate "databases". Some of these collections are producted by a web crawler, some are produced by indexing a static file tree and some are produced via a feed from another system, which either adds new documents to a collection or replaces a document. There are really 2 questions. Is this too much data for Lucene? It is not too much data for Lucene. Your architecture around Lucene is the more important aspect. And is there a way to keep separate collections (probably indexes) and search all (usually just a subset) of them at once? I see the MultiSearcher object that may be the ticket, but IMHO javadocs leave a lot to be desired in the way of documentation. They seem to completely leave out the "glue" and examples. MultiSearcher is pretty trivial to use. There is an example in Lucene in Action's source code ("ant SearchServer") and I'm using a MultiSearcher for the upcoming lucenebook.com site like this: Searchable[] searchables = new Searchable[indexes.length]; for (int i = 0; i < indexes.length; i++) { searchables[i] = new IndexSearcher(indexes[i]); } searcher = new MultiSearcher(searchables); Use MultiSearcher in the same manner as you would IndexSearcher. You can also find out which index a particular hit was from using the subSearcher method. As for your comment about the javadocs, allow me to refer you to Lucene's test suite. TestMultiSearcher.java in this case. This is the best "documentation" there is! (besides Lucene in Action, of course :) Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Multiple collections
I'm investigating search engines and have started to look at Lucene. I have a couple of questions, however. The faq seems to indicate we can't do searches and indexing at the same time. Is that still true, given that the faq is a few years old now? If so is there locking going on or do I have to do it myself? We have currently about 4 million documents comprised of about 16 million terms. This is currently broken up into about 50 different collections which are separate "databases". Some of these collections are producted by a web crawler, some are produced by indexing a static file tree and some are produced via a feed from another system, which either adds new documents to a collection or replaces a document. There are really 2 questions. Is this too much data for Lucene? And is there a way to keep separate collections (probably indexes) and search all (usually just a subset) of them at once? I see the MultiSearcher object that may be the ticket, but IMHO javadocs leave a lot to be desired in the way of documentation. They seem to completely leave out the "glue" and examples. Thanks for any advice. Jim. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]