Re: OutOfMemoryError
hi Ian, hi Winton, hi all, sorry I meant heap size of 100Mb. I'm starting java with -Xmx100m. I'm not setting -Xms. For what I know now, I had a bug in my own code. still I don't understand where these OutOfMemoryErrors came from. I will try to index again in one thread without RAMDirectory just to check if the program is sane. the problem that the files get to big while merging remains. I wonder why there is not the possibility to tell lucene not to create files that are bigger than the system limit. how am i supposed to know after how many documents this limit is reached? lucene creates the documents - i just know the average size of a piece of text that is the input for a document. or am I missing something?! chantal Am Mittwoch, 28. November 2001 20:14 schrieben Sie: Were you using -mx and -ms (setting heap size ?) Cheers, Winton As I run the program on a multi-processor machine I now changed the code to index each file in a single thread and write to one single IndexWriter. the merge factor is still at 10. maxMergeDocs is at 1.000.000. I set the maximum heap size to 1MB. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: io exception on recreating an index
* Steven J. Owens | Kiran Kumar K.G [EMAIL PROTECTED] wrote: | | I'm currently having a problem overwriting an old index. Every | night, the contents of a database I'm using get updated, so the | lucene indexes are also recreated every night. The technique I'm | currently using is just to start a new index on top of the old one | (IndexWriter writer = new IndexWriter(filePath, new | StandardAnalyzer(), true) ) but sproatically I get an IO exception: | couldn't delete _2oil.fdt or something to that effect. | | I ran into this as well; I didn't get it on Solaris, just when I | tried running it on a win2k laptop (didn't feel like being stuck at my | desk all the time :-). Just for the record: I've also had this problem, but only on Windows 2000. It works just fine on Linux. Geir O. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: OutOfMemoryError
Doug sent the message below to the list on 3-Nov in response to a query about file size limits. There may have been more related stuff on the thread as well. -- Ian. *** Anyway, is there anyway to control how big the indexes grow ? The easiset thing is to set IndexWriter.maxMergeDocs. Since you hit 2GB at 8M docs, set this to 7M. That will keep Lucene from trying to merge an index that won't fit in your filesystem. (It will actually effectively round this down to the next lower power of Index.mergeFactor. So with the default mergeFactor=10, maxMergeDocs=7M will generate a series of 1M document indexes, since merging 10 of these would exceed the max.) Slightly more complex: you could further minimize the number of segments, if, when you've added seven million documents, optimize the index and start a new index. Then use MultiSearcher to search. Even more complex and optimal: write a version of FSDirectory that, when a file exceeds 2GB, creates a subdirectory and represents the file as a series of files. (I've done this before, and found that, on at least the version of Solaris that I was using, the files had to be a few 100k less than 2GB for programs like 'cp' and 'ftp' to operate correctly on them.) Doug Chantal Ackermann wrote: hi Ian, hi Winton, hi all, sorry I meant heap size of 100Mb. I'm starting java with -Xmx100m. I'm not setting -Xms. For what I know now, I had a bug in my own code. still I don't understand where these OutOfMemoryErrors came from. I will try to index again in one thread without RAMDirectory just to check if the program is sane. the problem that the files get to big while merging remains. I wonder why there is not the possibility to tell lucene not to create files that are bigger than the system limit. how am i supposed to know after how many documents this limit is reached? lucene creates the documents - i just know the average size of a piece of text that is the input for a document. or am I missing something?! chantal -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Indexing other documents type than html and txt
You'd have to write parsers for each of those document types to convert it to text and then index it. Sure, you can feed it something like XML, but then you may consider something like xmldb.org instead. Otis --- Antonio Vazquez [EMAIL PROTECTED] wrote: Hi all, I have a doubt. I know that lucene can index html and text documents, but can it index other type of documents like pdf,docs, and xls documents? if it can, how can I implement it? Perhaps can implement it like html and txt indexing? regards Antonio _ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! GeoCities - quick and easy web site hosting, just $8.95/month. http://geocities.yahoo.com/ps/info1 -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Using Lucene to index a DataBase
Not really. A document in Lucene is a logical entity comprising named fields. So a row in the db would probably be represented as a Lucene document. Each [searchable] column on the table would be represented as a field of the document. However, you may find it necessary to build psuedo fields in Lucene that span multiple columns in the database if you want to do complex queries across multiple fields since in Lucene you typically need to preface your search keywords with the field they are located in (eg. Field1:SearchKey). - Original Message - From: Weaver, Scott [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, November 29, 2001 11:17 AM Subject: Using Lucene to index a DataBase I've used Verity in Cold Fusion to index Databases. Is this possible with Lucene? From recent posts, it looks like I would have to write a custom parser to convert each row into a text document. Am I correct in thinking this? Scott -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Indexing other documents type than html and txt
Hi all, I have a doubt. I know that lucene can index html and text documents, but can it index other type of documents like pdf,docs, and xls documents? if it can, how can I implement it? Perhaps can implement it like html and txt indexing? regards Antonio -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: OutOfMemoryError
Chantal, For what I know now, I had a bug in my own code. still I don't understand where these OutOfMemoryErrors came from. I will try to index again in one thread without RAMDirectory just to check if the program is sane. Java often has misleading error messages. For example, on solaris machines the default ulimit used to be 24 - that's 24 open file handles! Yeesh. This will cause an OutOfMemoryError. So don't assume it's actually a memory problem, particularly if a memory problem doesn't particularly make sense. Just a thought. Steven J. Owens [EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Indexing other documents type than html and txt (XML)
I have started to create a set of generic lucene document types that can be easily manipulated depending on the fields. I know other have generated Documents out of PDF. Is there some place we can add contributed classes to the lucene web page? Here my current version of the XMLDocument based on . It's a bit slow. It uses a path (taken from Document example) and based on a field name / xpath pair (key / value) from either an array or property file generates an appropriate lucene document with the specified fields. I have not tested all permutations of Document (I have used the File, Properties) and it works. Note: It uses the xalan example ApplyXpath class to get the xml xpath. I hope this helps. --Peter -- package xxx.lucene.xml; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.DateField; import org.apache/ApplyXpath; import java.util.Properties; import java.io.File; import java.util.Enumeration; import java.io.FileInputStream; /** * A utility for making lucene document from an XML source and a set of xpaths * based on Document example from Lucene * */ public class XMLDocument { private XMLDocument() { } /** * @param file Document that to be converted to a lucene document * @param propertyList properties where the key is the field name and the value is the * XML xpath. * @throws FileNotFoundException * @throws Exception * @return lucene document */ public static Document Document (File file, Properties propertyList) throws java.io.FileNotFoundException , Exception { Document doc = new Document(); // add path doc.add(Field.Text(path, file.getPath())); //add date modified doc.add(Field.Keyword(modified, DateField.timeToString(file.lastModified(; //add field list in property list Enumeration e = propertyList.propertyNames(); while (e.hasMoreElements()) { String key = (String) e.nextElement(); String xpath = propertyList.getProperty(key); String[] valueArray = ApplyXpath(file.getPath(),xpath); StringBuffer value = new StringBuffer(); for (int i=0; i valueArray.length; i++) { value.append(valueArray[i]); } //System.out.println(add key +key+ wtih value = +value); filter(key,value); doc.add(Field.Text(key,value.toString())); } return doc; } /** * @return lucene document * @param fieldNames field names for the lucene document * @param file Document that to be converted to a lucene document * @param xpaths XML xpaths for the information you want to get * @throws Exception */ public static Document Document(File file, java.lang.String[] fieldNames, java.lang.String[] xpaths) { if (fieldNames.length != xpaths.length) { throw new IllegalArgumentException (String arrays are not equal size); } Properties propertyList = new Properties(); // generate properties from the arrays for (int i=0;ifieldNames.length;i++) { propertyList.setProperty(fieldNames[i],xpaths[i]); } Document doc = Document (file, propertyList); return doc; } /** * @param path path of the Document that to be converted to a lucene document * @param keys * @param xpaths * @throws Exception * @return */ public static Document Document(String path, String[] fieldNames, String[] xpaths) throws Exception { File file = new File(path); Document doc = Document (file, fieldNames, xpaths); return doc; } /** * @param path path of document you want to convert to a lucene document * @param propertyList properties where the key is the field name and the value is the * XML xpath. * @throws Exception * @return lucene document */ public static Document Document(String path, Properties propertyList) throws Exception { File file = new File(path); Document doc = Document (file, propertyList); return doc; } /** * @param documentPath path of the
Transactional Indexing
I have noticed that when I kill/interrupt an indexing process, that it leaves a lock file, preventing further indexing. This raises a couple of questions: a. When I simply delete the file and restart the indexing, it seems to work. Is there a risk in doing this? b. Can indexing be done in a concurrent fashion? For example, allowing multiple uploading of files over the web and doing incremental indexing as they arrive. In this situation, you may have several documents that need to be indexed simultaneously. Also I have seen some mention of a mailing list archive... how does one find it or search it? I do not see a reference at apache.org. Thanks -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Parallelising a query...
From: Winton Davies [mailto:[EMAIL PROTECTED]] I have 4 million documents... I could: Split these into 4 x 1 million document indexes and then send a query to 4 Lucene processes ? At the end I would have to sort the results by relevance. Question for Doug or any other Search Engine guru -- would this reduce the time to find these results by 75% ? It could, if you have four processors and four disk drives and things work out optimally. If you have a single machine with multiple processors and/or a disk array, and your CPU or i/o are not already maxed out, then multi-threading is a good way to make searches faster. To implement this I would write something like MultiSearcher, but that runs each sub-search in a separate thread, a ThreadedMultiSearcher. If you instead have several machines that you would like to spread search load over, then you could use RMI to send queries to these machines. I would first implement the single-machine version, ThreadedMultiSearcher, then implement a RemoteSearcher class, that forwards Searcher methods via RMI to a Searcher object on another machine. Then to spread load across machines, construct a ThreadedMultiSearcher and populate it with RemoteSearcher instances pointing at the different machines. The Searcher API was designed with this sort of thing in mind. Note though that HitCollector-based searching is not a good candidate for RMI, since it does a callback for every document. Stick to the TopDocs-based search method. You'll also need to forward docFreq(Term) and maxDoc(), used to weight the query before searching, and doc(int), used to fetch hit documents. Probably these should be abstracted into a separate interface, Searchable. Doug -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Transactional Indexing
From: New, Cecil (GEAE) [mailto:[EMAIL PROTECTED]] I have noticed that when I kill/interrupt an indexing process, that it leaves a lock file, preventing further indexing. This raises a couple of questions: a. When I simply delete the file and restart the indexing, it seems to work. Is there a risk in doing this? No, there is no risk. The index is never inconsistent, so long as only a single process is modifying it. Removing lock files is the standard crash-recovery method for Lucene. b. Can indexing be done in a concurrent fashion? For example, allowing multiple uploading of files over the web and doing incremental indexing as they arrive. In this situation, you may have several documents that need to be indexed simultaneously. Lucene only supports index modification by a single process and using a single IndexWriter object. However, the index update code is thread safe, so that many threads may use the same IndexWriter instance concurrently to add documents. Doug -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: OutOfMemoryError
I wrote: Java often has misleading error messages. For example, on solaris machines the default ulimit used to be 24 - that's 24 open file handles! Yeesh. This will cause an OutOfMemoryError. So don't Jeff Trent replied: Wow. I did not know that! I also don't see an option to increase that limit from java -X. Do you know how to increase that limit? That's used to be, I think it's larger on newer machines. I don't think there's a java command line option to set this, it's a system limit. The solaris command to check it is ulimit. To set it for a given login process (assuming sufficient privileges) use ulimit number (i.e. ulimit 128). ulimit -a prints out all limits. Steven J. Owens [EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
AW: lucene in applet
Hi folks, I found a solution to use an index in an applet. The index files are stored an an URL. Since you cannot acces the disc on the client from an applet I used the RAMDirectory. Here´s my code: //Variable to store the names of the indexfiles in Stack stackFileNames = new Stack(); //open the file with the filenames //you need this, because you cannot search a directory for files an an URL URL source = new URL(getCodeBase(), filenames.txt); BufferedReader in = new BufferedReader(new InputStreamReader(source.openStream())); while(true) { String s = in.readLine(); if(s==null) break; stackFileNames.push(s); } in.close(); //open the index while(!stackFileNames.empty()) { String sFileName = (String)stackFileNames.pop(); //streaming the file to the RAMDirectory org.apache.lucene.store.OutputStream os = ramDir.createFile(sFileName); java.io.InputStream is = new URL(getCodeBase(), repository/+sFileName).openStream(); byte[] baBuffer = new byte[1024]; while(true) { synchronized(baBuffer) { int iBytesRead = is.read(baBuffer); if (iBytesRead == -1) break; os.writeBytes(baBuffer, iBytesRead); } } is.close(); os.close(); } //IndexReader IndexReader ir = IndexReader.open(ramDir); With this I can perform searchen in the Applet. Work´s fine. Have fun, Christoph Breidert www.sitewaerts.de -Ursprüngliche Nachricht- Von: Christoph Breidert [mailto:[EMAIL PROTECTED]] Gesendet: Montag, 26. November 2001 13:09 An: 'Lucene Users List' Betreff: AW: lucene in applet Hi Steven, thanks for your reply. Here is my szenario: 1) I want to grab a the website xy.com to the local disc at C:\xy 2) While exporting I want to index the content to C:\xy\index 3) I put applet.class at C:\xy\index If I open that applet from the (file) location C:\xy\index it can access any file in that directory with a method like this one. This method reads the content of sFile. private String read(String sFileName) { StringBuffer sb = new StringBuffer(); try { URL source = new URL(getCodeBase(), sFileName); BufferedReader in = new BufferedReader(new InputStreamReader(source.openStream())); while(true) { String s = in.readLine(); if(s==null) break; sb.append(s); } in.close(); } catch(Exception e) { //errorhandling } return sb.toString(); } It seems to me that the security-constraints of the sandbox do not prevent me from using an applet to open any file in the directory C:\xy\index. With this framework I can nicely create an applet to search an exported site. It would be nice if lucene had the functionality to open content from an URL. Like in this little method above, there should not be any problem reading the index files from the URL and perfom a search with the content of these files? if you're retrieving the index file from a URL, why not just run the search at that URL? I cannot run a JVM at C:\xy\index. For this I want to use the browser. This gives me the possibility to entirely grab a site and burn it on a CD maybe, and the index will still work. Yours, Christoph Breidert [EMAIL PROTECTED] -Ursprüngliche Nachricht- Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] Gesendet: Sonntag, 25. November 2001 20:27 An: Lucene Users List Betreff: Re: lucene in applet Christopher, I?m working on a site-grabber to wich I would like to add offline search funktionality. I experimented with lucene, and it covers all my needs. I?m planning to realize the search functionality with an applet. Problem: I cannot access the index created with lucene from my applet. The only way to access resources on a remote host (wich could be the file-system as well) is with a stream. This sounds about right. It's a question of the applet sandbox, not of lucene itself. I've done a little applet work in the past; my general advice is that a site-grabber with offline search functionality is not compatible with the constraints an applet must run under. Specifically applets have no access to the local file system, and applets may only open network connections back to the server they were downloaded from. I strongly suggest you ditch the applet aspect entirely, and maybe look into sun's Java Web Start, which attempts to give you a good combination of the features of an applet and an application. //Something like this URL source = new URL(getCodeBase(), path_to_index); BufferedReader in = new BufferedReader(new InputStreamReader(source.openStream())); ... From what I tested lucene I found that the only possibilities to access an existing index is directly with a
RE: Using Lucene to index a DataBase
I'm pretty sure that's what Verity/Cold Fusion was doing behind the scenes. Thanks. -Original Message- From: Jeff Trent [mailto:[EMAIL PROTECTED]] Sent: Thursday, November 29, 2001 11:26 AM To: Lucene Users List Subject: Re: Using Lucene to index a DataBase Not really. A document in Lucene is a logical entity comprising named fields. So a row in the db would probably be represented as a Lucene document. Each [searchable] column on the table would be represented as a field of the document. However, you may find it necessary to build psuedo fields in Lucene that span multiple columns in the database if you want to do complex queries across multiple fields since in Lucene you typically need to preface your search keywords with the field they are located in (eg. Field1:SearchKey). - Original Message - From: Weaver, Scott [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, November 29, 2001 11:17 AM Subject: Using Lucene to index a DataBase I've used Verity in Cold Fusion to index Databases. Is this possible with Lucene? From recent posts, it looks like I would have to write a custom parser to convert each row into a text document. Am I correct in thinking this? Scott -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Parallelising a query...
Hi Doug, Thank you again for your wisdom. Yep, we'll probably be running on a quad E420 or two. I think that I'll get better performance with one virtual searcher, spread over 4 CPU's (and 1 GB ram each) rather than 4 monolithics searchers each with its own index (actually I don't think I could get 4 on the machine --- the index is nearly a Gig each). Winton Winton Davies Lead Engineer, Overture (NSDQ: OVER) 1820 Gateway Drive, Suite 360 San Mateo, CA 94404 work: (650) 403-2259 cell: (650) 867-1598 http://www.overture.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Parallelising a query...
Hi again Another dumb question :) (actually I'm too busy to look at the code :) ) In the index, is the datastructure of termDocs (is that the right term), sorted by anything? Or is it just insertion order ? I could see how one might want to sort by the Doc with the highest term frequency ? But I can also see why it might not help. e.g. Token1 - doc1 (2) [occurences] - doc2 (6) - doc3 (3) or is it like this ? Token1 - doc2 (6) - doc3 (3) - doc1 (2) ? I have an idea for an optimization I want to make, but I'm not sure exactly whether it is warrants investigation. Winton Winton Davies Lead Engineer, Overture (NSDQ: OVER) 1820 Gateway Drive, Suite 360 San Mateo, CA 94404 work: (650) 403-2259 cell: (650) 867-1598 http://www.overture.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Parallelising a query...
TermDocs are ordered by document number. It would not be easy to change this. Doug -Original Message- From: Winton Davies [mailto:[EMAIL PROTECTED]] Sent: Thursday, November 29, 2001 11:12 AM To: Lucene Users List Subject: Re: Parallelising a query... Hi again Another dumb question :) (actually I'm too busy to look at the code :) ) In the index, is the datastructure of termDocs (is that the right term), sorted by anything? Or is it just insertion order ? I could see how one might want to sort by the Doc with the highest term frequency ? But I can also see why it might not help. e.g. Token1 - doc1 (2) [occurences] - doc2 (6) - doc3 (3) or is it like this ? Token1 - doc2 (6) - doc3 (3) - doc1 (2) ? I have an idea for an optimization I want to make, but I'm not sure exactly whether it is warrants investigation. Winton Winton Davies Lead Engineer, Overture (NSDQ: OVER) 1820 Gateway Drive, Suite 360 San Mateo, CA 94404 work: (650) 403-2259 cell: (650) 867-1598 http://www.overture.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
GCJ and Lucene ?
Hi, Another maybe quick question: Has anyone tried using GCJ with Lucene ? http://www.gnu.org/software/gcc/java/ As far as I tell, this tries to compile Java directly to native code. I think it is restricted to 1.1 classes, which might be a gotcha (does Lucene use any 1.2 classes ?) Winton Winton Davies Lead Engineer, Overture (NSDQ: OVER) 1820 Gateway Drive, Suite 360 San Mateo, CA 94404 work: (650) 403-2259 cell: (650) 867-1598 http://www.overture.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Parallelising a query...
Thanks ! (one thing to cross off my list of optimizations...) Cheers, Winton Winton Davies Lead Engineer, Overture (NSDQ: OVER) 1820 Gateway Drive, Suite 360 San Mateo, CA 94404 work: (650) 403-2259 cell: (650) 867-1598 http://www.overture.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]