Re: Getting an exact field match
Add Location as Keyword and delete something like this: Term t = new Term(LOCATION, location); indexReader.delete(t); -- Ian. [EMAIL PROTECTED] [EMAIL PROTECTED] (Wilton, Reece) wrote Hi, I am indexing XML files. The XML files have a Location element. For example, the Location is /Foo/Bar.html in one of the files. When I update the index, I want to remove the existing document. I search for the Location and delete the existing document like this: Query query = QueryParser.parse(location, LOCATION, new StandardAnalyzer()); Hits hits = searcher.search(query); for (int i = 0; i hits.length(); i++) { indexReader.delete(hits.id(i)); } But I never get anything returned from the searcher. I'm passing in the exact value that is in the field. How do I get an exact match of the field? Should I be adding Location as Text or Keyword? I've tried both but can't get it to return what I want. Is the problem because I have slashes (/) in the field? Does the StandardAnalyzer filter those out or something? Any help is appreciated! Reece - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Searchable personal storage and archiving from http://www.digimem.net/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: can't delete from an index using IndexReader.delete()
You should use Field.Keyword rather than Field.Text for the identifier because you do not want it tokenized. doc.add(Field.Keyword(id, whatever)); In 2 places in your example code. -- Ian. [EMAIL PROTECTED] [EMAIL PROTECTED] (Robert Koberg) wrote Here is a simple class that can reproduce the problem (happens with the last stable release too). Let me know if you would prefer this as an attachment. Call like this: java TestReaderDelete existing_id new_label - or - Try: java TestReaderDelete B724547 ppp and then try: java TestReaderDelete a266122794 ppp If an index has not been created it will create one. Keep running the one of the above example commands (with and without deleting the index directory) and see what happens to the System.out.println's import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.Term; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.DateField; import org.xml.sax.*; import org.xml.sax.helpers.*; import org.xml.sax.Attributes; import javax.xml.parsers.*; import java.io.*; import java.util.*; class TestReaderDelete { public static void main(String[] args) throws IOException { File index = new File(./testindex); if (!index.exists()) { HashMap test_map = new HashMap(); test_map.put(preamble_content, Preamble content bbb); test_map.put(art_01_section_01, Article 1, Section 1); test_map.put(toc_tester, Test TOC XML bbb); test_map.put(B724547, bio example); test_map.put(a266122794, tester); indexFiles(index, test_map); } String identifier = args[0]; String new_label = args[1]; testDeleteAndAdd(index, identifier, new_label); } public static void indexFiles(File index, HashMap test_map) { try { IndexWriter writer = new IndexWriter(index, new StandardAnalyzer(), true); for (Iterator i=test_map.entrySet().iterator(); i.hasNext(); ) { Map.Entry e = (Map.Entry) i.next(); System.out.println(Adding: + e.getKey() + = + e.getValue()); Document doc = new Document(); doc.add(Field.Text(id, (String)e.getKey())); doc.add(Field.Text(label, (String)e.getValue())); writer.addDocument(doc); } writer.optimize(); writer.close(); } catch (Exception e) { System.out.println( caught a + e.getClass() + \n with message: + e.getMessage()); } } public static void testDeleteAndAdd(File index, String identifier, String new_label) throws IOException { IndexReader reader = IndexReader.open(index); System.out.println(!!! reader.numDocs() : + reader.numDocs()); System.out.println(reader.indexExists(): + reader.indexExists(index)); System.out.println(term field: + new Term(id, identifier).field()); System.out.println(term text: + new Term(id, identifier).text()); System.out.println(reader.docFreq: + reader.docFreq(new Term(id, identifier))); System.out.println(deleting target now...); int deleted_num = reader.delete(new Term(id, identifier)); System.out.println(*** deleted_num: + deleted_num); reader.close(); try { IndexWriter writer = new IndexWriter(index, new StandardAnalyzer(), false); String ident = identifier; Document doc = new Document(); doc.add(Field.Text(id, identifier)); doc.add(Field.Text(label, new_label)); writer.addDocument(doc); writer.optimize(); writer.close(); } catch (Exception e) { System.out.println( caught a + e.getClass() + \n with message: + e.getMessage()); } System.out.println(!!! reader.numDocs() after deleting and adding : + reader.numDocs()); } } -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Sunday, June 22, 2003 9:42 PM To: Lucene Users List The code looks fine. Unfortunately, the provided code is not a full, self-sufficient class that I can run on my machine to verify the behaviour that you are describing. Otis - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Searchable personal storage and archiving from http://www.digimem.net/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: about increment update
Try this: 1. Open reader. 2. removeModifiedFiles(reader) 3. reader.close() 4. Open writer. 5. updateIndexDocs() 6. writer.close(); i.e. don't have both reader and writer open at the same time. btw I suspect you might be removing index entries for files that have been modified, but adding all files. Another index keeps growing problem! Could be wrong. -- Ian. [EMAIL PROTECTED] (kerr) wrote Thank you Otis, Yes, reader should be closed. But it isn't the reason of this Exception. the errors happen before deleting file. Kerr. close() Closes files associated with this index. Also saves any new deletions to disk. No other methods should be called after this has been called. - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, April 03, 2003 12:14 PM Subject: Re: about increment update Maybe this is missing? http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#close() Otis --- kerr [EMAIL PROTECTED] wrote: Hello everyone, Here I try to increment update index file and follow the idea to delete modified file first and re-add it. Here is the source. But when I execute it, the index directory create a file(write.lock) when execute the line reader.delete(i);, and caught a class java.io.IOException with message: Index locked for write. After that, when I execute the line IndexWriter writer = new IndexWriter(index, new StandardAnalyzer(), false); caught a class java.io.IOException with message: Index locked for write if I delete the file(write.lock), the error will re-happen. anyone can help and thanks. Kerr. import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.Term; import java.io.File; import java.util.Date; public class UpdateIndexFiles { public static void main(String[] args) { try { Date start = new Date(); Directory directory = FSDirectory.getDirectory(index, false); IndexReader reader = IndexReader.open(directory); System.out.println(reader.isLocked(directory)); //reader.unlock(directory); IndexWriter writer = new IndexWriter(index, new StandardAnalyzer(), false); String base = ; if (args.length == 0){ base = D:\\Tomcat\\webapps\\ROOT\\test; } else { base = args[0]; } removeModifiedFiles(reader); updateIndexDocs(reader, writer, new File(base)); writer.optimize(); writer.close(); Date end = new Date(); System.out.print(end.getTime() - start.getTime()); System.out.println( total milliseconds); } catch (Exception e) { System.out.println( caught a + e.getClass() + \n with message: + e.getMessage()); e.printStackTrace(); } } public static void removeModifiedFiles(IndexReader reader) throws Exception { Document adoc; String path; File aFile; for (int i=0; ireader.numDocs(); i++){ adoc = reader.document(i); path = adoc.get(path); aFile = new File(path); if (reader.lastModified(path) aFile.lastModified()){ System.out.println(reader.isLocked(path)); reader.delete(i); } } } public static void updateIndexDocs(IndexReader reader, IndexWriter writer, File file) throws Exception { if (file.isDirectory()) { String[] files = file.list(); for (int i = 0; i files.length; i++) updateIndexDocs(reader, writer, new File(file, files[i])); } else { if (!reader.indexExists(file)){ System.out.println(adding + file); writer.addDocument(FileDocument.Document(file)); } else {} } } } -- Searchable personal storage and archiving from http://www.digimem.net/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Indexing Growth
What does the index directory look like before and after running queries? Are files growing or being added? Which files? How many documents are there in the index before and after? Are you absolutely 100% positive there is no way that your application is adding entries to the index? That still has to be the most likely explanation, I think. -- Ian. [EMAIL PROTECTED] [EMAIL PROTECTED] (Rob Outar) wrote Hi all, This is too odd and I do not even know where to start. We built a Windows Explorer type tool that indexes all files in a sabdboxed file system. Each Lucene document contains stuff like path, parent directory, last modified date, file_lock etc.. When we display the files in a given directory through the tool we query the index about 5 times for each file in the repository, this is done so we can display all attributes in the index about that file. So for example if there are 5 files in the directory, each file has 6 attributes that means about 30 term queries are executed. The initial index when build it about 10.4megs, after accessing about 3 or 4 directories the index size increased to over 100megs, and we did not add anything!! All we are doing is querying!! Yesterday after querying became ungodly slow, we looked at the index size it had grown from 10megs to 1.5GB (granted we tested the tool all morning). But I have no idea why the index is growing like this. ANY help would be greatly appreciated. Thanks, Rob -Original Message- From: Rob Outar [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 01, 2003 3:32 PM To: Lucene Users List; [EMAIL PROTECTED] Subject: RE: Indexing Growth I reuse the same searcher, analyzer and Query object I don't think that should cause the problem. Thanks, Rob -Original Message- From: Alex Murzaku [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 01, 2003 3:22 PM To: 'Lucene Users List' Subject: RE: Indexing Growth I don't know if I remember this correctly: I think for every query (term) is created a file but the file should disappear after the query is completed. -Original Message- From: Rob Outar [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 01, 2003 3:13 PM To: Lucene Users List Subject: RE: Indexing Growth Dang I must be doing something crazy cause all my client app does is search and the index size increases. I do not add anything. Thanks, Rob -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 01, 2003 3:07 PM To: Lucene Users List Subject: Re: Indexing Growth Only when you add new documents to it. Otis --- Rob Outar [EMAIL PROTECTED] wrote: Hi all, Will the index grow based on queries alone? I build my index, then run several queries against it and afterwards I check the size of the index and in some cases it has grown quite a bit although I did not add anything??? Anyhow please let me know the cases when the index will grow. Thanks, Rob - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! Tax Center - File online, calculators, forms, and more http://platinum.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Searchable personal storage and archiving from http://www.digimem.net/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Indexing Growth
They look like the type of file name that would be created when documents were added to the index. So I still think something is adding stuff to your index. Could it be an external process as someone suggested? Does the index grow even if you don't search? In the code you posted, what does checkForIndexChange() do? Yes, I can guess what it is supposed to do, but is it perhaps doing something else as well or instead, directly or indirectly? -- Ian. [EMAIL PROTECTED] (Rob Outar) wrote After building the index for the first time: _l1d.f1 _l1d.f3 _l1d.f5 _l1d.f7 _l1d.f9 _l1d.fdx _l1d.frq _l1d.tii deletable _l1d.f2 _l1d.f4 _l1d.f6 _l1d.f8 _l1d.fdt _l1d.fnm _l1d.prx _l1d.tis segments After running first query to get all attributes from all files in the given directory, there were 17 files, each file has 5 attributes so 85 queries were ran: _l1j.f1 _l1p.f9 _l21.f3 _l27.fdx _l2j.f5 _l2p.prx _l31.f7 _l3j.f1 _l3p.f9 _l41.f3 _l44.fdx _l1j.f2 _l1p.fdt _l21.f4 _l27.frq _l2j.f6 _l2p.tis _l31.f8 _l3j.f2 _l3p.fdt _l41.f4 _l44.frq ... -- Searchable personal storage and archiving from http://www.digimem.net/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing large documents problem
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWriter.html#maxFieldLength -- Ian. [EMAIL PROTECTED] [EMAIL PROTECTED] (Andrey Grishin) wrote Hello, All! When I index large document 11 symbols and then try to search using words that are in the very end of that document - can't find anything... :(( Is this a feature of Lucene or I am doing something wrong? Any help will be appretiated. Regards, Andrey Grishin -- Searchable personal storage and archiving from http://www.digimem.net/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: searching on for null/blank field val
I think you will have to go the psuedo null/blank placeholder route. -- Ian. [EMAIL PROTECTED] [EMAIL PROTECTED] (aaz) wrote Hi, We have a document with 2 Fields. a) title = X b) fieldX = How can I do a search to only get documents where fieldX = . When I construct a TermQuery against FieldX with as the value I get no results. What is the best way to do searches for such values or should we create some psuedo null/blank placeholder to store in such fields with blank values? thanks -- Searchable personal storage and archiving from http://www.digimem.net/ -- To unsubscribe, e-mail: mailto:lucene-user-unsubscribe;jakarta.apache.org For additional commands, e-mail: mailto:lucene-user-help;jakarta.apache.org
Re: Can any one help me?
Uma If you know servlets and JSP you should be able to figure out how to integrate lucene. Presumably you have already read the Getting Started guide? Suggestions: 1. Create a lucene index of whatever it is you want to search across. As a standalone program. No servlets or JSP or database required. See demo programs and instructions. 2. Create a standalone program to search that index. 3. Take whichever bits of that functionality that you want to be accessible as a servlet, and call or cut/paste/refactor/ whatever to get what you want. Good luck. -- Ian. [EMAIL PROTECTED] [EMAIL PROTECTED] (Uma Maheswar) wrote Otis, Yes, I know Servlets and JSP. I am the only developer working on http://www.javagalaxy.com. All the contents in the site are developed by me. But I am not sure of working with Lucene. Can you help me? Uma -- Searchable personal storage and archiving from http://www.digimem.net/ -- To unsubscribe, e-mail: mailto:lucene-user-unsubscribe;jakarta.apache.org For additional commands, e-mail: mailto:lucene-user-help;jakarta.apache.org
Re: searching only the text of a XML-File???
Certainly. Just extract the content from the XML and index it along with the file name. Scan the archives of this list or search Google for something like Lucene XML if you want more. -- Ian. [EMAIL PROTECTED] [EMAIL PROTECTED] (Richly, Gerhard) wrote Hello, is it possible to index and search only the content, not the tags, of a XML-File. The result should be the name of the XML-File. Is that possible with Lucene??? Made anyone expierence with indexing and searching XML-File??? Thank you Gerhard -- Searchable personal storage and archiving from http://www.digimem.net/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Full List of Stop Words for Standard Analyzer.
In org/apache/lucene/analysis/standard/StandardAnalyzer.java. -- Ian. [EMAIL PROTECTED] [EMAIL PROTECTED] (Suneetha Rao) wrote Hi, I would like to include in my documentation all the stop words . Can somebody tell me where to find the list for the Standard Analyzer ? Thanks in Advance, Suneetha -- Searchable personal storage and archiving from http://www.digimem.net/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Deleting Problem
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html says, for delete: Deletes the document numbered docNum. Once a document is deleted it will not appear in TermDocs or TermPostitions enumerations. Attempts to read its field with the document(int) method will result in an error. The presence of this document may still be reflected in the docFreq(org.apache.lucene.index.Term) statistic, though this will be corrected eventually as the index is further modified. This is from the delete(int) method rather than delete(Term) but I would expect that it still holds true. If you want the deleted documents to really disappear for good, now, optimize the index. -- Ian. [EMAIL PROTECTED] [EMAIL PROTECTED] (Terry Steichen) wrote I'm having difficulty deleting documents from my index. Here's code snippet 1: IndexReader reader = IndexReader.open(index_dir); Term dterm = new Term(pub_date,pub_date); int docs = reader.docFreq(dterm); reader.close(); System.out.println(Found +docs+ docs matching term pub_date = +pub_date); It reports back that I have 48 matching documents. Then I run code snippet 2: IndexReader reader = IndexReader.open(index_dir); Term dterm = new Term(pub_date,pub_date); int docs = reader.delete(dterm); reader.close(); System.out.println(Deleted+docs+ docs matching term pub_date = +pub_date); It reports back that I deleted 48 documents. But when I run snippet 1 once again, it reports 48 matching documents still exist. If I run snippet 2 again, it reports that it (this time) deleted 0 docs. Obviously I'm overlooking something (probably obvious and simple), but I can't seem to delete the selected documents. Ideas/help would be welcome. Regards, Terry -- Searchable personal storage and archiving from http://www.digimem.net/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: contains
I also think it might work. Just for fun I tried a variation of it on a copy of the unix file /usr/dict/words. Indexing program reads each word and splits it up into substrings and stores the original and the substrings e.g. beautiful stored as Field.UnIndexed, substrings beautiful eautiful autiful utiful tiful iful ful ul l stored as Field.UnStored. One Document per word. Search program uses StandardAnalyzer and QueryParser. Pass it ful* and it returns 259 hits, including beautiful and beautifully. Pass it eaut and get 13 hits including beauteous and beauty. ifu* gives 20 hits including beautiful, bifurcate and centrifuge. There are 45,407 words in the file and the index, unoptimized, takes up 3.6Mb of disk space. Indexing the 45,407 words by themselves, one Document per word with the word as Field.Text takes up 1.7Mb disk space, unoptimized. Didn't add inverse substrings since don't see why they are needed. My sample program seems to work without them. I expect I've missed something or perhaps it works because the sample data is so simple. -- Ian. [EMAIL PROTECTED] [EMAIL PROTECTED] (Lothar Simon) wrote Just to correct a few points: - The factor would be 2 * (average no of chars per word)/2 = (average no of chars per word). - One would probably create a set of 2 * (maximum number of chars per word) as Fields for a document. If this could work was actually my question... - Most important: my proposal is exactly (and almost only) designed to solve the substring (*uti*) problem !!! One field in the first group of fields in my example contains utiful and would be found by uti*, a field in the other group of fields contains itueb and would be found by itu*. Voila! I still think my idea would work (given you spend the space for the index). Lothar -Original Message- From: Joshua O'Madadhain [mailto:[EMAIL PROTECTED]] Sent: Friday, July 12, 2002 6:45 PM To: Lucene Users List Subject: RE: contains On Fri, 12 Jul 2002, Lothar Simon wrote: [in response to Peter Carlson pointing out that searching for *xyz* is a difficult problem] Of course you are right. And I am surely more the last then the first one to try to come up with THE solution for this. But still... Could the following work? If space (ok, a lot) is available you could store beutiful, eutiful, utiful, tiful, iful, ful, ul, l PLUS its inversions (lufitueb, ufitueb, fitueb, itueb, tueb, ueb, eb, b) in the index. Space needed would be something like (average no of chars per word) as much as in a normal index. Actually it would be twice that, because you're storing backward and forward versions. I'd hazard a guess that this factor alone would mean something like a 10- or 12-fold increase in index size (the average length of a word is less than 5 or 6 letters, but by throwing out stop words you throw out a lot of the words that drag the average down). Another problem with this is that in order to be able to get from ful to beautiful, you have to store, in the index entry for ful, (pointers to) every single complete word in your document set that contains ful as a substring. Just _creating_ such an index would be extremely time-consuming even with clever data structures, and consider how much extra storage for pointers would be necessary for entries like e or n. Finally, you're not including all substrings: your scheme doesn't allow me to search for *uti* and find beautiful. If you did, the number of entries would then be multiplied by a factor of the _square_ of the average number of characters per word. (You might be able to avoid this by doing prefix and suffix searches--which are difficult but less so--on the strings you specify, though.) There might be some clever way to get around these problems, but I suspect that developing one would be a dissertation topic. :) Regards, Joshua O'Madadhain [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization. -- Searchable personal storage and archiving from http://www.digimem.net/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Problem in unicode field value retrival
I don't think you can retrieve the contents of Fields that have been loaded by a Reader. From the javadoc for Field: Text(String name, Reader value) Constructs a Reader-valued Field that is tokenized and indexed, but is not stored in the index verbatim. -- Ian. [EMAIL PROTECTED] [EMAIL PROTECTED] (Harpreet S Walia) wrote Hi I am trying to index and search unicode (utf - 8) . the code i am using to index the documents is as follows : /**/ IndexWriter iw = new IndexWriter(d:\\jakarta-tomcat3.2.3\\webapps\\lucene\\index, new SimpleAnalyzer(), true); String dirBase = d:\\jakarta-tomcat3.2.3\\webapps\\lucene\\docs; File docDir = new File(dirBase); String[] docFiles = docDir.list(); InputStreamReader isr; InputStream is; Document doc; for(int i=0;idocFiles.length;i++) { File tempFile = new File(dirBase + \\ + docFiles[i]); if(tempFile.isFile()==true) { System.out.println(Indexing File : + docFiles[i]); is = new FileInputStream(tempFile); isr=new InputStreamReader(is,utf-8); doc= new Document(); doc.add(Field.UnIndexed(path,tempFile.toString())); doc.add(Field.Text(abc,(Reader)isr)); doc.add(Field.Text(all,sansui)); iw.addDocument(doc); is.close(); isr.close(); doc=null; } } iw.close(); is=null; isr=null; iw=null; docDir=null; System.out.println(Indexing Complete); /**/ Now when i try to search the contents and get the field called abc by using the method doc.get(abc) , i get null as the output. Can anyone please tell me where i am going wrong . Thanks And Regards Harpreet -- Searchable personal storage and archiving from http://www.digimem.net/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: segment count
In order to make a search, the mergeSegments() function must be called right? Otherwise IndexSearcher won't have the most updated index files to work with to do a search. I guess my point is that do I have to intermittenly call Optimize or Close (to call mergeSegments()) or make maybeMergeSegments to find a merge to do before using IndexSearcher? Btw, I am running IndexFiles and SearchFiles at the same time. I don't know if you have to call close() to make all modifications visible or not. Sounds likely. You do not have to call optimize. Having one writer and one or more readers concurrently is fine. You can (should) call IndexReader.lastModified() to find out if the index has been modified since the IndexReader was opened. Also, when IndexWriter.addDocument is called per file, the function calls newSegmentName() to create its corresponding segement name. That segment name is used to create a SegmentInfo, which gets added to the SegmentInfos vector. Am I missing something? No idea. I'm just a lucene user and have never needed to know about that sort of stuff. -- Ian. [EMAIL PROTECTED] -- Searchable personal storage and archiving from http://www.digimem.net/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: MS Word Search ??
... openoffice - www.openoffice.org knows how to parse all of the microsoft ... #2 - if open office is programmatically drivable (which I don't know if it is), fire up a copy of open office and use it to convert the files as necessary. See an earlier post to this list: http://marc.theaimsgroup.com/?l=lucene-userm=101920039808700w=2 It is often worth searching or browsing the archive! -- Ian. [EMAIL PROTECTED] -- Searchable personal storage and archiving from http://www.digimem.net/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: segment count
Lucene doesn't store one document per segment. See http://marc.theaimsgroup.com/?l=lucene-userm=102079295608850w=2 for detail on the files created. On is this the right way ...?, here is an extract from the javadoc 1. Create Document's by adding Field's. 2. Create an IndexWriter and add documents to to it with addDocument(); 3. Call QueryParser.parse() to build a query from a string; and 4. Create an IndexSearcher and pass the query to it's search() method. -- Ian. [EMAIL PROTECTED] [EMAIL PROTECTED] (Hyong Ko) wrote I added a segment using IndexWriter.addDocument. Then I called IndexWriter.optimize (IndexWriter.close works too) to generate index files to do a search. Then I added another segment using IndexWriter.addDocument. The total segment count should be 2, but instead it's 3. Any ideas? Is this the right way to index and search concurrently? Thanks. -- Searchable personal storage and archiving from http://www.digimem.net/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Wilcard Search Issues
How about if you search for resloc:ccsa* i.e. all lower case? If using QueryParser.parse() with a standard analyzer the search term does not get converted to lower case if it contains a trailing wildcard. Running code like this Analyzer analyzer = new StandardAnalyzer(); Query query = QueryParser.parse(s, KEYFIELD, analyzer); System.out.println(s + ( + query.toString(KEYFIELD) + )); with s set to various values gives something like this CCsa (ccsa) // OK CCsa* (CCsa*)// Suspect ccsa* (ccsa*)// OK Tested against rc5 and latest from CVS. -- Ian. [EMAIL PROTECTED] [EMAIL PROTECTED] (Nader S. Henein) wrote I'm using the new Lucene 1.5 release and I remember a message in the lucene-user mailing list that talked about a wildcard issue that if you search something like this: reslocCCsa/resloc using the following query string : resloc:CCsa* it will yield no results, and them there was a reply saying that the issue has been resolved in the nightly builds, this was about two weeks before rc1.5 (witch I'm using) and according the rc1.5 mailer that went out wildcard issues where hammered out. but I still have this problem if I search using resloc:CCsa I get 5 results but when I add the star sign to the right-hand side of the query string like so resloc:CCsa* I get no results. Anyone care to shed some light on this issue ? Nader S. Henein Bayt.com , Dubai Internet City Tel. +9714 3911900 Fax. +9714 3911915 GSM. +9715 05659557 www.bayt.com -- Searchable personal storage and archiving from http://www.digimem.net/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: QueryParser.parse
Why do you not want to tokenize AUTEUR and TITRE? Your example will work if you do. Using QueryParser.parse with a standard analyzer your query will be tokenized and will match the tokens created when you added the documents to the index. -- Ian. [EMAIL PROTECTED] [EMAIL PROTECTED] (Arpad KATONA) wrote Hello, under http://jakarta.apache.org/lucene/docs/queryparsersyntax.html is to be read : +++ As an example, let's assume a Lucene index contains two fields, title and text and text is the default field. If you want to find the document entitled The Right Way which contains the text don't go this way, you can enter: title:The Right Way AND text:go +++ So i tried, see the code below : 1) first, i create a document with 3 fields: KEY, AUTEUR and TITRE, the fields are NOT to be tokenized. 2) second, i read the index to see what it contains (see output below). 3) third, i would like to find a record with jean auteur as AUTEUR or opus primum as TITRE; thus, the query string is : ?AUTEUR:jean auteur OR TITRE:opus primum?. But, h?las!, there is no answer. Could you tell me, please, where the error is? Thanks Arpad KATONA [EMAIL PROTECTED] -- +++ try { Analyzer analyzer = new StandardAnalyzer(); String sIxPath = D:\\tempo\\lucene\\ix; // // indexing // IndexWriter writer = new IndexWriter(sIxPath, analyzer, true); org.apache.lucene.document.Document doc = new org.apache.lucene.document.Document(); Field fd = new Field(KEY, 1, true, true, false); doc.add(fd); fd = null; fd = new Field(AUTEUR, jean auteur, false, true, false); doc.add(fd); fd = null; fd = new Field(TITRE, opus primum, false, true, false); doc.add(fd); fd = null; writer.addDocument(doc); doc = null; writer.close(); writer = null; // // checking // IndexReader reader = IndexReader.open(sIxPath); TermEnum tenum = reader.terms(); for(; tenum.next(); ) { Term term = tenum.term(); String sField = term.field(); String sText = term.text(); System.out.println(sField=?+sField+? sText=?+sText+?); } tenum.close(); tenum = null; reader.close(); reader=null; // // searching // String sQuery=TITRE:\opus primum\ OR AUTEUR:\jean auteur\; Query query = QueryParser.parse(sQuery, , analyzer); System.out.println(query = ?+query.toString()+?); Searcher searcher = new IndexSearcher(sIxPath); Hits hits = searcher.search(query); if(null==hits) { System.out.println(No answer.); } else { int nbHits = hits.length(); System.out.println(nbHits = +Integer.toString(nbHits)); } hits = null; searcher.close(); searcher=null; query = null; } catch (Exception ex) { ex.printStackTrace(); } +++ output: +++ sField=?AUTEUR? sText=?jean auteur? sField=?KEY? sText=?1? sField=?TITRE? sText=?opus primum? query = ?TITRE:opus primum AUTEUR:jean auteur? nbHits = 0 +++ -- Searchable personal storage and archiving from http://www.digimem.net/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: AW: WildcardQuery
Left wildcards seem to work if you explicitly use a WildcardQuery e.g. Term t = new Term(id, *ucene); Query query = new WildcardQuery(t); but if use QueryParser with an analyzer e.g. Analyzer analyzer = new StandardAnalyzer(); Query query = QueryParser.parse(*ucene, id, analyzer); get an exception: org.apache.lucene.queryParser.ParseException: Lexical error at line 1, column 1. Encountered: * (42), after : at org.apache.lucene.queryParser.QueryParser.parse(Unknown Source) Tested on RC5. Haven't tried other ways of building a query. In my simple tests terms with left and right wildcards like *lucene* worked too, even if the whole word was included. -- Ian. [EMAIL PROTECTED] [EMAIL PROTECTED] (Christian Schrader) wrote It works with the nightly builds and probably with 1.2-RC5 :-) Christian -Ursprungliche Nachricht- Von: Otis Gospodnetic [mailto:[EMAIL PROTECTED]] Gesendet: 07 May 2002 17:31 An: Lucene Users List Betreff: Re: WildcardQuery Yes, me too. I just tried it on some Lucene index (the search at blink.com) and it doesn't seem to work (try searching for travel and then *vel). I'm assuming the original poster confused something... Otis --- Joel Bernstein [EMAIL PROTECTED] wrote: I thought Lucene didn't support left wildcards like the following: *ucene - Original Message - From: Christian Schrader [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Monday, May 06, 2002 7:14 PM Subject: WildcardQuery I am pretty happy with the results of WildcardQueries like *ucen* that matches lucene, but *lucene* doesn't match lucene. Is there a reason for this? And what would be the patch. It should be in WildcardTermEnum. I am wondering if somebody already patched it? Thanks, Chris -- Searchable personal storage and archiving from http://www.digimem.net/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: AW: WildcardQuery
Hi... I am both a newbie to Lucene and to using this list, so please forgive me if I make some mistakes. I am trailing onto this post because I cannot seem to get the wildcard function to work at all, while all of the other features seem to work just fine. I am using a very standard application (actually, it is just the demo version slightly modified) with the StandardAnalyzer and the QueryParser. But the wildcard feature (using either ? or *) just doesn't work. I must be missing something very basic. I would appreciate any ideas. Thanks! Basic wildcard support (i.e. ignoring things like left wildcards) comes pretty much out of the box. Attached is a copy of the program I was playing with before sending the earlier message. It uses StandardAnalyzer and the static QueryParser.parse() method so doesn't work with left wildcards. I haven't tried ? rather than *. Hope this helps. -- Ian. [EMAIL PROTECTED] -- Searchable personal storage and archiving from http://www.digimem.net/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: AW: WildcardQuery
Sorry - I said I was going to send some code ... -- Ian. import org.apache.lucene.queryParser.*; import org.apache.lucene.search.*; import org.apache.lucene.index.*; import org.apache.lucene.analysis.*; import org.apache.lucene.analysis.standard.*; import org.apache.lucene.document.*; import org.apache.lucene.store.*; public class LuceneTest { RAMDirectory ramdir; Analyzer analyzer; IndexWriter writer; IndexReader reader; Searcher searcher; public LuceneTest() { analyzer = new StandardAnalyzer(); ramdir = new RAMDirectory(); } public static void main(String args[]) throws Exception { LuceneTest ld = new LuceneTest(); ld.load(); ld.search(); } void load() throws Exception { writer = new IndexWriter(ramdir, analyzer, true); add(january); add(february); add(june); add(july); writer.close(); } void add(String s) throws Exception { Document d = new Document(); d.add(Field.Keyword(id, s)); System.out.println(Adding +s); writer.addDocument(d); } void search() throws Exception { reader = IndexReader.open(ramdir); searcher = new IndexSearcher(reader); search(jan*); search(jan*y); search(j*y); search(j*); search(*y); } void search(String s) throws Exception { Query query = QueryParser.parse(s, id, analyzer); Hits hits = searcher.search(query); System.out.println(s+ matched +hits.length()); for (int i = 0; i hits.length(); i++) { System.out.println( + hits.doc(i).get(id)); } } } -- Searchable personal storage and archiving from http://www.digimem.net/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Using the UnStored Field() type
If the field is tokenized and indexed, can I still search that field? Yes. My code looks like this: theDocument = new Document(); if ( 0 != textString.length() ) { textField = Field.UnStored( FIELD_TEXT, textString ); theDocument.add( textField ); } then I search it like this: indexReader = IndexReader.open( C:\\temp\\index_store ); Term searchTermTiny = new Term( DocumentVisitor.FIELD_TEXT, Syndeo ); FuzzyQuery query = new FuzzyQuery( searchTermTiny ); IndexSearcher search = new IndexSearcher( indexReader ); Hits foundDocs = search.search( query ); I never get any results for the documents. Syndeo occures often. Any ideas? You don't show us code to save the document in the index but I assume you are doing that! Is FIELD_TEXT the same as DocumentVisitor.FIELD_TEXT? A short but complete program makes it easier to spot problems rather than making wild guesses. -- Ian. [EMAIL PROTECTED] -- Searchable personal storage and archiving from http://www.digimem.net/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: rc4 and FileNotFoundException: an update
Have you posted code that demonstrates this problem? If so I missed it. If you send, to this list, the shortest program you can come up with that demonstrates the problem there is a fair chance that someone may spot something. I, and many others, use that release of Lucene to index far more than 16 objects so I think that at this stage the assumption has to be that the problem lies with your code. -- Ian. [EMAIL PROTECTED] [EMAIL PROTECTED] (petite_abeille) wrote Hello again, I guess it's really not my day... Just to make sure I'm not hallucinating to much, I downloaded the latest and greatest: rc4. Changed all the packages names to org.apache. Updated a method here and there to reflect the APIs changes. And run my little app. I would like to emphasize that except updating to the latest Lucene release, nothing else has changed. Well, it's pretty ugly. Whatever I'm doing with Lucene in the previous package (com.lucene) is magnified many folds in rc4. After processing a paltry 16 objects I got: SZFinder.findObjectsWithSpecificationInStore: java.io.FileNotFoundException: _2.f14 (Too many open files) At least in the previous version, I will see that after a couple of thousand of objects... So, it seems, that there is something really rotten in the kingdom of Denmark... Any help much appreciated. Thanks. -- Searchable personal storage and archiving from http://www.digimem.net/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Incremental vs Full indexing
You can just use optimize(). -- Ian. [EMAIL PROTECTED] [EMAIL PROTECTED] (Andrew Smith) wrote I wish to use incremental indexing in an application based on Lucene. Do I need to preiodcally perform a full re-build of the index to keep the index in an efficient state or can I simply use the IndexWriter's Optimise() function instead? TIA Andy -- -- Searchable personal storage and archiving from http://www.digimem.net/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Lucene with Number+Text
I've problem searching for number in Lucene. I'm using StandardAnalyzer for Index/Search. In my document, I have a field contains text this is a test for lucene with number 1727a and 1992 and 3562 - I was able to search for a 1992 or 3562. - However, search return empty when I try to search for 1727 or 1727a. It seems like it didn't index number and text when it's one word. Please help Are you sure this is using StandardAnalyzer on the latest release (1.2 rc4)? If I index that string and search for 1727a I get a hit. -- Ian. -- Searchable personal storage and archiving from http://www.digimem.net/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Lucene with Number+Text
Good thinking. In my test, using a Text field, searches for 1727a and 1727* both return a hit but if switch to Keyword they don't. -- Ian. [EMAIL PROTECTED] (Shannon Booher) wrote I think I have seen a similar problem. Are you guys using Keyword or Text fields? -- Searchable personal storage and archiving from http://www.digimem.net/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: indexing secure sites
Is it possible to index secure sites with Lucene? Lucene will index whatever you can feed it. When I tried I get an error message ::javax.net.ssl.SSLException: untrusted server cert chain, is it a problem with my certificate? Could well be. It is definitely nothing to do with lucene or this list. -- Ian. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Build index using RAMDirectory out of memory errors
Have you tried different values for IndexWriter.mergeFactor? Setting it to 1000 gave me a 10* speed improvement on some large index some time ago. Not RAMDirectory though. Your mileage may vary. -- Ian. Kurt Vaag wrote: I have been using Lucene for 3 weeks and it rules. The indexing process can be slow. So I searched the mailgroup archives and found example code using RAMDirectory to improve indexing speed. The example code I found was indexing 100,000 files at a time to the RAMDirectory before writing to disk. I tried indexing 10,000 files at a time to the RAMDirectory before writing to disk. This drastically improved indexing times but sometimes I get out of memory errors. I am indexing text files and adding 9 fields from an Oracle database. Environment: Solaris 2.8 with 1G of ram and 2G of swap Java 1.3.1 Lucene 1.2-rc4 Any ideas for eliminating the out of memory errors ? -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Phrase Query
Hello All, Question on phrase queries- I have a medical reports document that has Anesth, Knee in it. If I use phrase query, it works but so does Anesth Knee (notice that the comma is missing.) Does Lucene remove special characters before indexing the documents? Depends on the Analyzer you use. Some certainly do, including StandardAnalyzer. You can build your own analyzer if you have special requirements. -- Ian. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Lucene Build Instructions
Today I built lucene-1.2-rc3 from the source distribution for the first time. On the whole it was easy enough but a couple of points: BUILD.txt says Install JDK 1.3. Might it be better to say Install JDK 1.3 or later?. Lots of people are probably running 1.4 already. BUILD.txt makes no mention of javaCC. You find out soon enough when run ant, but would be better to have it mentioned up front. Also I couldn't get the .ant.properties method mentioned in build.xml to work (probably finger trouble) but copying javaCC.zip to the lucene-1.2-rc3-src/lib/ directory worked fine. Also, by default the code as compiled from source and as distributed in the binary download doesn't have line numbers in stack trace dumps. This may be deliberate in which case fine, but the line numbers do help in tracking down problems. -- Ian. [EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Web demo example: Errors from Tomcat startup
Looks like it does. -- Ian. Andrew C. Oliver wrote: Does tomcat 4 understand the 3.2.x header? If so, lets just use the 3.2.x header. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Web demo example: Errors from Tomcat startup
The UnknownHostException is probably because the parser trying to read WEB-INF/web.xml wants to lookup the dtd and is failing because it can't locate java.sun.com. Perhaps suprising given your email address! Perhaps you need to fix your local DNS setup so it can find java.sun.com or use a different parser. I can't remember all the ins and outs of this but I run tomcat off line, without access to java.sun.com, using xerces. Can't comment on the other stuff about the demo. -- Ian. [EMAIL PROTECTED] Don Gilchrest - Sun Microsystems wrote: Hi, I'm going through the Lucene 'Getting Started Guide and am up to creating the index for the Web app. After copying lucene-1.2-rc3/luceneweb.war to $TOMCAT_HOME/webapps, the following errors are reported when I startup Tomcat (v3.2.1): XmlMapper: Can't find resource for entity: -//Sun Microsystems, Inc.//DTD Web Application 2.3//EN -- http://java.sun.com/dtd/web-app_2_3.dtd null ERROR reading /opt/jakarta-tomcat-3.2.1/webapps/luceneweb/WEB-INF/web.xml At External entity not found: http://java.sun.com/dtd/web-app_2_3.dtd;. ERROR reading /opt/jakarta-tomcat-3.2.1/webapps/luceneweb/WEB-INF/web.xml java.net.UnknownHostException: java.sun.com at java.net.InetAddress.getAllByName0(InetAddress.java:571) at java.net.InetAddress.getAllByName0(InetAddress.java:540) at java.net.InetAddress.getByName(InetAddress.java:449) at java.net.Socket.init(Socket.java:100) at sun.net.NetworkClient.doConnect(NetworkClient.java:50) at sun.net.www.http.HttpClient.openServer(HttpClient.java:335) at sun.net.www.http.HttpClient.openServer(HttpClient.java:521) at sun.net.www.http.HttpClient.init(HttpClient.java:271) at sun.net.www.http.HttpClient.init(HttpClient.java:281) at sun.net.www.http.HttpClient.New(HttpClient.java:293) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:404) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.jav a:497) at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:230) at com.sun.xml.parser.Resolver.createInputSource(Resolver.java:248) at com.sun.xml.parser.ExternalEntity.getInputSource(ExternalEntity.java:49) at com.sun.xml.parser.Parser.pushReader(Parser.java:2768) at com.sun.xml.parser.Parser.externalParameterEntity(Parser.java:2504) at com.sun.xml.parser.Parser.maybeDoctypeDecl(Parser.java:1137) at com.sun.xml.parser.Parser.parseInternal(Parser.java:481) at com.sun.xml.parser.Parser.parse(Parser.java:284) at javax.xml.parsers.SAXParser.parse(SAXParser.java:155) at javax.xml.parsers.SAXParser.parse(SAXParser.java:126) at org.apache.tomcat.util.xml.XmlMapper.readXml(XmlMapper.java:214) at org.apache.tomcat.context.WebXmlReader.processWebXmlFile(WebXmlReader.java:202) at org.apache.tomcat.context.WebXmlReader.contextInit(WebXmlReader.java:109) at org.apache.tomcat.core.ContextManager.initContext(ContextManager.java:491) at org.apache.tomcat.core.ContextManager.init(ContextManager.java:453) at org.apache.tomcat.startup.Tomcat.execute(Tomcat.java:195) at org.apache.tomcat.startup.Tomcat.main(Tomcat.java:235) 2002-02-06 04:51:53 - PoolTcpConnector: Starting HttpConnectionHandler on 8080 2002-02-06 04:51:53 - PoolTcpConnector: Starting Ajp12ConnectionHandler on 8007 Also, the instructions (in demo3.html) related to this step are a bit confusing to me and seem like they may be out of order. For example, in Indexing files section, it states to execute the command to create the index within your {tomcat}/webapps/luceneweb directory; but that directory doesn't exist until I do the next step -- copying luceneweb.war to {tomcat}/webapps and re-start Tomcat, which is when I get the errors shown above. What am I missing here? Thanks in advance for any help with this. regards, -don PS Here's my classpath output from 'tomcat.sh start': /opt/jakarta-tomcat-3.2.1/lib/ant.jar:/opt/jakarta-tomcat-3.2.1/lib/jasper.jar:/ opt/jakarta-tomcat-3.2.1/lib/jaxp.jar:/opt/jakarta-tomcat-3.2.1/lib/parser.jar:/ opt/jakarta-tomcat-3.2.1/lib/servlet.jar:/opt/jakarta-tomcat-3.2.1/lib/test:/opt /jakarta-tomcat-3.2.1/lib/webserver.jar:/usr/local/j2sdk1_3_1_02/lib/tools.jar:/ usr/local/lucene-1.2-rc3/lucene-1.2-rc3.jar:/usr/local/lucene-1.2-rc3/lucene-dem os-1.2-rc3.jar:.:/usr/local/j2sdk1_3_1_02/lib/dt.jar:/usr/local/j2sdk1_3_1_02/li b/tools.jar:/usr/local/j2sdk1_3_1_02/lib/htmlconverter.jar:/usr/local/jdom-b7/bu ild/jdom.jar:/opt/java/jsdk2.2/lib/jsdk.jar Let me know if I need to provide any other info. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands,
Re: Problem in deleteing the documents
That's the version I use. -- Ian. Thutika, Swamy wrote: Thanks , Ian for the solution. It works now. In the api doc I am looking at , this is not mentioned. I am using the lucene-1.2-rc2. I am just wondering if i am using the current version of lucene. Could you let me know where I can find the latest one. The api i have says: delete public abstract void delete(int docNum) throws IOException Deletes the document numbered docNum. Once a document is deleted it will not appear in TermDocs or TermPostitions enumerations. Attempts to read its field with the document(int) method will result in an error. The presence of this document may still be reflected in the docFreq(org.apache.lucene.index.Term) statistic, though this will be corrected eventually as the index is further modified. ** Thanks Swamy -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: AW: Can't get a DateFilter to work
Sorry, my mistake. I glanced at your message and leapt to the wrong conclusion. Don't know what is wrong in your code but since I had been meaning to get to grips with Date searching in Lucene but had never made time to do so, have knocked up a test program, attached to this message, which creates a small index and searches on a few ranges, including 0 to 2008269714990, and it seems to work fine. Perhaps it may help you track down your problem. If not, and you still want help, I suggest you post the simplest possible program that demonstrates the problem. -- Ian. [EMAIL PROTECTED] Jan Stövesand wrote: Hi, it is no typo. I thought that a DateFilter will return everything between from (0) and to (2008269714990), specified in the parameter list. Jan -Ursprüngliche Nachricht- Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]Im Auftrag von Ian Lea Gesendet: Freitag, 14. Dezember 2001 12:40 An: Lucene Users List Betreff: Re: Can't get a DateFilter to work Could be just a typo in the email message, but 1008269714990 != 2008269714990. -- Ian. [EMAIL PROTECTED] Jan Stövesand wrote: Hi, I really tried everything to get a DateFilter to work but I failed. :-( What I did was: Indexing: doc.add( Field.Keyword(last-modifed, DateField.timeToString( timeInMillies ) ); e.g. millies: 1008269714990 field value: 0cv6xr772 If i submit a normal query, looking for 0cv6xr772 I find the entry, i.e. the entry should be indexed correctly. If I search for a text in the body of the element I find about 30 entries including the one mentioned above with last-midified=0cv6xr772. If I repeat the same query with DateFilter filter=new DateFilter(last-modified, 0, 2008269714990); Hits hits = searcher.search(query, filter); I do not get any results. What am I doing wrong? Any help appreciated. JAn -- import java.util.Date; import org.apache.lucene.queryParser.*; import org.apache.lucene.search.*; import org.apache.lucene.index.*; import org.apache.lucene.analysis.*; import org.apache.lucene.analysis.standard.*; import org.apache.lucene.document.*; import org.apache.lucene.store.*; /** Simple program to play with DateFilter stuff. * * To compile and run on unix, save to e.g. /tmp/LuceneDateTest.java and * pre * $ cd /tmp * $ CLASSPATH=./:/some/where/lucene.jar; export CLASSPATH * $ javac LuceneDateTest.java * $ java LuceneDateTest * /pre * * Adjust as appropriate for other platforms. * * Creates a RAMDirectory, adds some documents with dates and *searches on date ranges. */ public class LuceneDateTest { RAMDirectory ramdir; Analyzer analyzer; IndexWriter writer; IndexReader reader; Searcher searcher; Date start; Date middle; Date end; public LuceneDateTest() { analyzer = new StandardAnalyzer(); ramdir = new RAMDirectory(); } public static void main(String args[]) throws Exception { LuceneDateTest ld = new LuceneDateTest(); ld.load(); ld.search(); } void load() throws Exception { start = new Date(); writer = new IndexWriter(ramdir, analyzer, true); add(doc1); add(doc2); middle = new Date(); add(doc3); add(doc4); writer.close(); end = new Date(); } void add(String id) throws Exception { Document d = new Document(); d.add(Field.Keyword(id, id)); String lmod = DateField.timeToString(System.currentTimeMillis()); String lmod2 = DateField.timeToString(1008269714990L); d.add(Field.Keyword(last-modified, lmod)); d.add(Field.Keyword(last-mod2, lmod2)); System.out.println(Adding id:+id+ , last-modified: +lmod+ , last-mod2: +lmod2); writer.addDocument(d); } void search() throws Exception { reader = IndexReader.open(ramdir); searcher = new IndexSearcher(reader); DateFilter filter; Query query = QueryParser.parse(d*, id, analyzer); filter = new DateFilter(last-modified, start, end); search(query, filter, start to end (4): ); filter = new DateFilter(last-modified, middle, end); search(query, filter, middle to end (2): ); filter = new DateFilter(last-modified, 0, System.currentTimeMillis()); search(query, filter, 0 to now (4): ); filter = new DateFilter(last-mod2, 0, 2008269714990L); search(query, filter, 0 to 2008269714990 (4
Re: FileNotFoundException
I don't have an explanation for this, but if it was me indexing this large amount of data I'd be running each of the 6 in a completely separate process. More control, less damage when one bit fails, perhaps better performance on a multi-processor machine. And perhaps you wouldn't get this problem! -- Ian. [EMAIL PROTECTED] Chantal Ackermann wrote: hello all, I am still trying to find the best way to index a really big amount of data. at the moment I am trying to index each of the 29 textfiles in a single thread using for each an own IndexWriter and an own directory where to place the index. there are always six threads working the same time. the problem that occures now is that every second thread stops due to a FileNotFoundException or an ArrayIndexOutOfBoundsException (the latter only once) while the other half finishes fine. the file's name is different for each thread but has always the extension .fnm. for example: java.io.FileNotFoundException: /lucenetest/medlineIndex/1976-1977/_2zfj.fnm (Datei oder Verzeichnis nicht gefunden) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.init(RandomAccessFile.java(Compiled Code))at java.io.RandomAccessFile.init(RandomAccessFile.java(Compiled Code)) at org.apache.lucene.store.FSInputStream.init(Unknown Source) at org.apache.lucene.store.FSInputStream.init(Unknown Source) at org.apache.lucene.store.FSDirectory.openFile(Unknown Source) at org.apache.lucene.index.FieldInfos.init(Unknown Source) at org.apache.lucene.index.SegmentReader.init(Unknown Source) at org.apache.lucene.index.IndexWriter.mergeSegments(Unknown Source) at org.apache.lucene.index.IndexWriter.maybeMergeSegments(Unknown Source) at org.apache.lucene.index.IndexWriter.maybeMergeSegments(Unknown Source) at org.apache.lucene.index.IndexWriter.addDocument(Unknown Source) at de.biomax.lucenetest.MedlineRecordIndexer.indexDocs(MedlineRecordIndexer.java(Compiled Code)) since half of the files are indexed without throwing that kind of exception I'm at a loss where to start debugging. any ideas? thanks a lot chantal -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: OutOfMemoryError
Doug sent the message below to the list on 3-Nov in response to a query about file size limits. There may have been more related stuff on the thread as well. -- Ian. *** Anyway, is there anyway to control how big the indexes grow ? The easiset thing is to set IndexWriter.maxMergeDocs. Since you hit 2GB at 8M docs, set this to 7M. That will keep Lucene from trying to merge an index that won't fit in your filesystem. (It will actually effectively round this down to the next lower power of Index.mergeFactor. So with the default mergeFactor=10, maxMergeDocs=7M will generate a series of 1M document indexes, since merging 10 of these would exceed the max.) Slightly more complex: you could further minimize the number of segments, if, when you've added seven million documents, optimize the index and start a new index. Then use MultiSearcher to search. Even more complex and optimal: write a version of FSDirectory that, when a file exceeds 2GB, creates a subdirectory and represents the file as a series of files. (I've done this before, and found that, on at least the version of Solaris that I was using, the files had to be a few 100k less than 2GB for programs like 'cp' and 'ftp' to operate correctly on them.) Doug Chantal Ackermann wrote: hi Ian, hi Winton, hi all, sorry I meant heap size of 100Mb. I'm starting java with -Xmx100m. I'm not setting -Xms. For what I know now, I had a bug in my own code. still I don't understand where these OutOfMemoryErrors came from. I will try to index again in one thread without RAMDirectory just to check if the program is sane. the problem that the files get to big while merging remains. I wonder why there is not the possibility to tell lucene not to create files that are bigger than the system limit. how am i supposed to know after how many documents this limit is reached? lucene creates the documents - i just know the average size of a piece of text that is the input for a document. or am I missing something?! chantal -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: OutOfMemoryError
I've loaded a large (but not as large as yours) index with mergeFactor set to 1000. Was substantially faster than with default setting. Making it higher didn't seem to make things much faster but did cause it to use more memory. In addition I loaded the data in chunks in separate processes and optimized the index after each chunk, again in a separate process. All done straight to disk, no messing about with RAMDirectories. Didn't play with maxMergeDocs and am not sure what you mean by maximum heap size but 1MB doesn't sound very large. -- Ian. [EMAIL PROTECTED] Chantal Ackermann wrote: hi to all, please help! I think I mixed my brain up already with this stuff... I'm trying to index about 29 textfiles where the biggest one is ~700Mb and the smallest ~300Mb. I achieved once to run the whole index, with a merge factor = 10 and maxMergeDocs=1. This took more than 35 hours I think (don't know exactly) and it didn't use much RAM (though it could have). unfortunately I had a call to optimize at the end and while optimization an IOException (File to big) occured (while merging). As I run the program on a multi-processor machine I now changed the code to index each file in a single thread and write to one single IndexWriter. the merge factor is still at 10. maxMergeDocs is at 1.000.000. I set the maximum heap size to 1MB. I tried to use RAMDirectory (as mentioned in the mailing list) and just use IndexWriter.addDocument(). At the moment it seems not to make any difference. after a while _all_ the threads exit one after another (not all at once!) with an OutOfMemoryError. the priority of all of them is at the minimum. even if the multithreading doesn't increase performance I would be glad if I could just once get it running again. I would be even happier if someone could give me a hint what would be the best way to index this amount of data. (the average size of an entry that gets parsed for a Document is about 1Kb.) thanx for any help! chantal -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Efficient document spooling and indexing
Data may not be committed to disk, buffers flushed, files closed, etc. until IndexWriter.close() is called, but file IO does happen before then. So I would expect the answer to your question to be no. -- Ian. [EMAIL PROTECTED] Otis Gospodnetic wrote: Hello, This is from a thread from about 2 weeks ago. What is the answer to this question? If data is written to disk only when IndexWriter's close() is called, wouldn't the sample code below be as efficient as the sample code that uses RAMDirectory, further down? Thanks, Otis When using the FSWriter, the actual file io doesn't occur until I close the writer, right? So wouldn't it be just as efficient to do the following: IndexWriter fsWriter = new IndexWriter(new File(...), analyzer, false); while (... more docs to index...) ... add 100,000 docs to fsWriter ... } fsWriter.optimize(); fsWriter.close(); -Original Message- From: Scott Ganyo [mailto:[EMAIL PROTECTED]] Sent: Friday, November 02, 2001 10:47 AM To: 'Lucene Users List' Subject: RE: Indexing problem Well, I don't know if there's an archive of the list, so this what Doug wrote: A more efficient and slightly more complex approach would be to build large indexes in RAM, and copy them to disk with IndexWriter.addIndexes: IndexWriter fsWriter = new IndexWriter(new File(...), analyzer, true); while (... more docs to index...) RAMDirectory ramDir = new RAMDirectory(); IndexWriter ramWriter = new IndexWriter(ramDir, analyzer, true); ... add 100,000 docs to ramWriter ... ramWriter.optimize(); ramWriter.close(); fsWriter.addIndexes(new Directory[] { ramDir }); } fsWriter.optimize(); fsWriter.close(); Scott -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Database Example
I'm new to Lucene, does any one have any good samples or links to tutorial sites? You can find a tutorial at http://www.darksleep.com/puff/lucene/lucene.html I'm particularlly interested in the ability to index a database. Someone asked a similar question a week or so back and I posted the reply and sample code below. Lucene works on Documents that you fill with data that you get from anywhere you like. To index data in a database table you could use something like: Analyzer analyzer = new StandardAnalyzer(); IndexWriter writer = new IndexWriter(dbindex, analyzer, true); Connection conn = getConnection(); String sql = select id, firstname, lastname from people; Statement stmt = conn.createStatement(sql); ResultSet rs = stmt.executeQuery(); while (rs.next()) { Document d = new Document(); d.add(Field.Text(id, rs.getString(id))); d.add(Field.UnStored(firstname, rs.getString(firstname))); d.add(Field.UnStored(lastname, rs.getString(lastname))); writer.addDocument(d); } writer.close(); The id field is indexed and stored since will want to extract it from lucene. The name fields are not stored since they are already stored on the database, although you could store them if wanted to avoid having to go back to the database for display. To search and display results, with details coming from the database, something along the lines of: Searcher searcher = new IndexSearcher(IndexReader.open(dbindex)); Query query = QueryParser.parse(duncan, firstname, analyzer); Hits hits = searcher.search(query); String sql = select * from people where id = ?; PreparedStatement pstmt = conn.prepareStatement(sql); for (int i = 0; i hits.length(); i++) { String id = hits.doc(i).get(id); pstmt.setString(1, id); displayResults(pstmt); } Hope this helps. -- Ian. [EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Adding New Fields to Document
Are you adding the document to the index once you've added the Field to the document? Do you see the authorname if search on some other field in the Document? I don't know if there are limits or not, but if there are I suspect they are high enough that you and I are unlikely to hit them. -- Ian. [EMAIL PROTECTED] Vijay Jagannathan wrote: Hello All, I am trying to add a new field to the Document as below: doc.add(new Field(authorname, authorname, false, true, false)); When I am doing the search after rebuilding the index, I am not getting any results if I search on this field. Is there a size limit or limit on the number of fields a Document can have ? Any guidance is much appreciated. Thanks. Vijay -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]