Re: Indexing very large sets (10 million docs)

2003-07-28 Thread Peter Becker
Roger Ford wrote: [...index size troubles...] Believe it or not, this 10 million documents was meant to be a single partition of a much larger dataset. I'm not sure I'm at liberty to discuss in detail the data I'm indexing - but it's a massive geneological database. Roger, maybe your data type is

RE: Indexing very large sets (10 million docs)

2003-07-28 Thread Armbrust, Daniel C.
Execute 'ulimit -f' to see what your current limit is... And then change appropriately after reading the man pages. My redhat machines come up with an unlimited file size limit. I don't know what the real limit is of an "unlimited" limit - but I haven't found it yet Dan -

Re: Indexing very large sets (10 million docs)

2003-07-28 Thread Doug Cutting
Ryan Clifton wrote: You seem to by implying that it is possible to optimize very large indexes. My index has a couple million records, but more importantly it's about 40 gigs in size. I have tried many times to optimize it and this always results in hitting the Linux file size limit. Is there a

RE: Indexing very large sets (10 million docs)

2003-07-28 Thread Ryan Clifton
Doug, You seem to by implying that it is possible to optimize very large indexes. My index has a couple million records, but more importantly it's about 40 gigs in size. I have tried many times to optimize it and this always results in hitting the Linux file size limit. Is there a way to ge

Re: Indexing very large sets (10 million docs)

2003-07-28 Thread Doug Cutting
Armbrust, Daniel C. wrote: If you set your mergeFactor back down to something closer to the default (10) - you probably wouldn't have any problems with file handles. The higher you make it, the more open files you will have. When I set it at 90 for performance reasons, I would run out of file han

RE: Different Analyzer for each Field

2003-07-28 Thread Gregor Heinrich
Hi Claude, one solution is to make the tokenStream method in the Analyzer subclass listen to the field name. Example: public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream result = new StandardTokenizer(reader); result = new Standa

RE: Indexing very large sets (10 million docs)

2003-07-28 Thread Armbrust, Daniel C.
I would say that something definitely went wrong to make your index that big that early - now that I saw you are only storing one field. Even if you make your indexes partitioned at 2.5 instead of 10 million (which you probably don't need to do) I would still recommend that you lower your mergeF

RE : Indexing very large sets (10 million docs)

2003-07-28 Thread Martin Sevigny
Roger, > Given that on my previous 16GB partition it managed 1.5 million rows > before failing, it looks like disk space requirements grow > exponentially > with number of documents indexed. Can anyone comment whether this > should be true? Exponentially? Would be surprising. When you add docu

Re: Indexing very large sets (10 million docs)

2003-07-28 Thread Roger Ford
Lichtner, Guglielmo wrote: That's 46 hits/s. That's not bad, actually. It's not the time I'm worried about, so much as the disk consumption. It's just failed optimizing 3 million documents with "No space left on device". That's 100GB it's used! Given that on my previous 16GB partition it managed 1.

RE: Indexing very large sets (10 million docs)

2003-07-28 Thread Armbrust, Daniel C.
Oh, and you may be short on diskspace You must have double the amount of diskspace available as the end size of your index to call optimize - you may get by with less diskspace if just do a single merge - never calling optimize - but I'm not sure about this. Our index of 15 million docume

RE : Indexing very large sets (10 million docs)

2003-07-28 Thread Martin Sevigny
Roger, Just to double-check... > Each document is typically only around 2K in size. Each field is > free-text indexed, but only the "key" field is stored. > After experimenting, I've set >Java memory to 750MB >writer.mergeFactor = 1 >- and run an optimize every 50,000 documents

RE: Indexing very large sets (10 million docs)

2003-07-28 Thread Armbrust, Daniel C.
We are currently doing something similar here. We have upwards of 15 million documents in our index. There has been a lot of discussion on this in the past... But I'll give a few details: My current techniques for indexing very large amounts of data is to Set the merge factor to 90 Leave the m

RE: Indexing very large sets (10 million docs)

2003-07-28 Thread Lichtner, Guglielmo
That's 46 hits/s. That's not bad, actually. It's an interesting problem. It certainly seems that when index such a large number of documents the indexing should be parallel. So far I have assumed that Lucene is not able to use multiple threads to speed up the indexing run. If it did, I guess it

Indexing very large sets (10 million docs)

2003-07-28 Thread Roger Ford
I'm trying to index 10 million small XML-like documents, extracted from an Oracle database. Lucene version is 1.2, and I'm using RedHat 7.0 Advanced Server, on an AMD XP1800+ with 1GB RAM and 46GB+120GB hard disks. The database is on a separate machine, connected by thin JDBC. Each document consist

Re: Different Analyzer for each Field

2003-07-28 Thread Erik Hatcher
On Monday, July 28, 2003, at 01:32 AM, Claude Libois wrote: My question is in the title: how can I use a different Analyzer for each field of a Document object? My problem is that if I use LetterTokenizer for a field which contains a String representation of a number, after I can't delete it.

Re: Different Analyzer for each Field

2003-07-28 Thread Erik Hatcher
On Monday, July 28, 2003, at 03:12 AM, Kelvin Tan wrote: AFAIK, there is a one-one mapping between an index and an analyzer. Not true. The Analyzer base class has a method tokenStream that accepts the field name. None of the built-in analyzers use the field name to do anything different based

RE: How can I index JSP files?

2003-07-28 Thread Pitre, Russell
I think this may be exactly what i'm looking for! Thanx a lot Russs I'll let you know how it works outthanx again! -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Monday, July 28, 2003 6:56 AM To: Lucene Users List Subject: Re: How can I index JS

Re: How can I index JSP files?

2003-07-28 Thread Otis Gospodnetic
You could try using a spider such as Spindle. Don't have the URL, but I'm sure you can find it via Google. Spindle uses Lucene. Otis --- "Pitre, Russell" <[EMAIL PROTECTED]> wrote: > Reffering to this: http://www.jguru.com/faq/view.jsp?EID=1074516 > > > > > > "To index the content of JS

Re: Different Analyzer for each Field

2003-07-28 Thread Kelvin Tan
Perhaps one way to do it is to have 2 separate indices for the 2 analyzers. Then, depending on which field you wish to search, you can choose from either index. AFAIK, there is a one-one mapping between an index and an analyzer. Kelvin On Mon, 28 Jul 2003 10:32:21 +0200, Claude Libois said: >My

Different Analyzer for each Field

2003-07-28 Thread Claude Libois
My question is in the title: how can I use a different Analyzer for each field of a Document object? My problem is that if I use LetterTokenizer for a field which contains a String representation of a number, after I can't delete it. Probably because this analyzer threw away my number. So I n