Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Paul Elschot
On Wednesday 25 January 2006 20:51, Peter Keegan wrote: > The index is non-compound format and optimized. Yes, I did try > MMapDirectory, but the index is too big - 3.5 GB (1.3GB is term vectors) > > Peter > You could also give this a try: http://issues.apache.org/jira/browse/LUCENE-283 Regards

Getting the document number (with IndexReader)

2006-01-26 Thread Chun Wei Ho
I am attempting to prune an index by getting each document in turn and then checking/deleting it: IndexReader ir = IndexReader.open(path); for(int i=0;i

Re: how to select top categories.

2006-01-26 Thread Paul Elschot
On Wednesday 25 January 2006 22:24, Chris Hostetter wrote: > > : for this site, but would you cash all manufacturers and intersect all with > : the initial query in one page load? Seems like that would be alot. > > Yep it is a lot, but if you've got the RAM, it's not that time intensive. > At CNE

Re: Getting the document number (with IndexReader)

2006-01-26 Thread Paul Elschot
On Thursday 26 January 2006 09:15, Chun Wei Ho wrote: > I am attempting to prune an index by getting each document in turn and > then checking/deleting it: > > IndexReader ir = IndexReader.open(path); > for(int i=0;i Document doc = ir.document(i); > if(thisDocShouldBeDeleted(doc)) { >

Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Ray Tsang
Speaking of NioFSDirectory, I thought there was one posted a while ago, is this something that can be used? http://issues.apache.org/jira/browse/LUCENE-414 ray, On 11/22/05, Doug Cutting <[EMAIL PROTECTED]> wrote: > Jay Booth wrote: > > I had a similar problem with threading, the problem turned o

Re: Getting the document number (with IndexReader)

2006-01-26 Thread Chun Wei Ho
Hi, Thanks for the help, just a few more questions: On 1/26/06, Paul Elschot <[EMAIL PROTECTED]> wrote: > On Thursday 26 January 2006 09:15, Chun Wei Ho wrote: > > I am attempting to prune an index by getting each document in turn and > > then checking/deleting it: > > > > IndexReader ir = IndexR

encoding

2006-01-26 Thread arnaudbuffet
Hello, I 've a problem with data i try to index with lucene. I browse a directory and index text from different types of files throw parsers. For text files, data could be in different languages so different encoding. If data are in Turkish for exemple, all special characters and accents are no

Re: encoding

2006-01-26 Thread John Haxby
arnaudbuffet wrote: For text files, data could be in different languages so different encoding. If data are in Turkish for exemple, all special characters and accents are not recognized in my lucene index. Is there a way to resolve problem? How do I work with the encoding ? I've been looking

Range number queries

2006-01-26 Thread Mike Streeton
For the recent questions about this here are a couple of methods for encoding/decoding long values that will be sorted into order by a range query public static String encodeLong(long num) { String hex = Long.toHexString(num < 0 ? Long.MAX_VALUE - (0xL ^ num) : num);

Re: Highlighter

2006-01-26 Thread msftblows
Yes, that is correct...you need to rewrite the query. I was actually the main developer for the 1.5 .NET port, so if you come across any issues, please email me at my hotmail address which I check more often than this one... -Joe Langley -Original Message- From: Gwyn Carwardine <[EMAI

Re: SoundEx

2006-01-26 Thread msftblows
You can also look at Phonetix which has many implementations of this... -Original Message- From: Erik Hatcher <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Wed, 18 Jan 2006 05:41:30 -0500 Subject: Re: SoundEx On Jan 18, 2006, at 4:20 AM, Christian Reuschling wrote: > yes,

Performance tips?

2006-01-26 Thread Daniel Pfeifer
Hi, Got more questions regarding Lucene and this time it's about performance ;-) We currently are using RAMDirectories to read our Indexes. This has now become a problem since our index has grown to appx 5GB of RAM and the machine we are running on only has 12GB of RAM and everytime we refr

RE : encoding

2006-01-26 Thread arnaudbuffet
Hello and thanks for your answer. I do not find the ISOLatin1AccentFilter class in my lucene jar, but I find one on google attach to this mail, could you tell me if it is the good one? I do not see anything in this class which can help me. This program will replace some accent characters but my

Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Peter Keegan
Paul, I tried this but it ran out of memory trying to read the 500Mb .fdt file. I tried various values for MAX_BBUF, but it still ran out of memory (I'm using -Xmx1600M, which is the jvm's maximum value (v1.5)) I'll give NioFSDirectory a try. Thanks, Peter On 1/26/06, Paul Elschot <[EMAIL PROT

Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Peter Keegan
Ray, The throughput is worse with NioFSDIrectory than with the FSDIrectory (patched and unpatched). The bottleneck still seems to be synchronization, this time in NioFile.getChannel (7 of the 8 threads were blocked there during one snapshot). I tried this with 4 and 8 channels. The throughput wi

Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Yonik Seeley
Hmmm, can you run the 64 bit version of Windows (and hence a 64 bit JVM?) We're running with heap sizes up to 8GB (RH Linux 64 bit, Opterons, Sun Java 1.5) -Yonik On 1/26/06, Peter Keegan <[EMAIL PROTECTED]> wrote: > Paul, > > I tried this but it ran out of memory trying to read the 500Mb .fdt fi

Re: RE : encoding

2006-01-26 Thread Erik Hatcher
On Jan 26, 2006, at 7:26 PM, arnaudbuffet wrote: I do not find the ISOLatin1AccentFilter class in my lucene jar, but I find one on google attach to this mail, could you tell me if it is the good one? This used to be in contrib/analyzers but has been moved into the core (Subversion only fo

Re: encoding

2006-01-26 Thread John Haxby
arnaudbuffet wrote: if I try to index a text file encoded in Western 1252 for exemple with the Turkish text "düzenlediğimiz kampanyamıza" the lucene index will contain re encoded data with �k�� ISOLatin1AccentFilter.removeAccents() converts that string to "duzenlediğimiz kampanyamıza"

Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Peter Keegan
I'd love to try this, but I'm not aware of any 64-bit jvms for Windows on Intel. If you know of any, please let me know. Linux may be an option, too. btw, I'm getting a sustained rate of 135 queries/sec with 4 threads, which is pretty impressive. Another way around the concurrency limit is to run

Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Yonik Seeley
BEA Jrockit supports both AMD64 and Intel's EM64T (basically renamed AMD64) http://www.bea.com/framework.jsp?CNT=index.htm&FP=/content/products/jrockit/ and Sun's Java 1.5 for "Windows AMD64 Platform" They advertize AMD64, presumably because that's what there servers use, but it should work on Int

Re: Getting the document number (with IndexReader)

2006-01-26 Thread Chris Hostetter
: > The document number is the variable i in this case. : If the document number is the variable i (enumerated from numDocs()), : what's the difference between numDocs() and maxDoc() in this case? I : was previously under the impression that the internal docNum might be : different to the counter.

Re: Getting the document number (with IndexReader)

2006-01-26 Thread Paul Elschot
On Thursday 26 January 2006 09:47, Chun Wei Ho wrote: > Hi, > > Thanks for the help, just a few more questions: > > On 1/26/06, Paul Elschot <[EMAIL PROTECTED]> wrote: > > On Thursday 26 January 2006 09:15, Chun Wei Ho wrote: > > > I am attempting to prune an index by getting each document in tur

Re: Getting the document number (with IndexReader)

2006-01-26 Thread Paul Elschot
On Thursday 26 January 2006 19:44, Chris Hostetter wrote: > > : > The document number is the variable i in this case. > : If the document number is the variable i (enumerated from numDocs()), > : what's the difference between numDocs() and maxDoc() in this case? I > : was previously under the impr

Re: encoding

2006-01-26 Thread petite_abeille
Hello, On Jan 26, 2006, at 12:01, John Haxby wrote: I have a perl script here that I used to generate downgrading table for a C program. I can let you have the perl script as is, but if there's enough interest(*) I'll use it to generate, say, CompoundAsciiFilter since it converts compound cha

Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Doug Cutting
Peter Keegan wrote: The throughput is worse with NioFSDIrectory than with the FSDIrectory (patched and unpatched). The bottleneck still seems to be synchronization, this time in NioFile.getChannel (7 of the 8 threads were blocked there during one snapshot). I tried this with 4 and 8 channels.

Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Doug Cutting
Doug Cutting wrote: A 64-bit JVM with NioDirectory would really be optimal for this. Oops. I meant MMapDirectory, not NioDirectory. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PRO

Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Peter Keegan
I tried the AMD64-bit JVM from Sun and with MMapDirectory and I'm now getting 250 queries/sec and excellent cpu utilization (equal concurrency on all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm. I wasn't aware of it. Thanks all very much. Peter On 1/26/06, Doug Cutting <[EMAIL PROTEC

Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Peter Keegan
Correction: make that 285 qps :) On 1/26/06, Peter Keegan <[EMAIL PROTECTED]> wrote: > > I tried the AMD64-bit JVM from Sun and with MMapDirectory and I'm now > getting 250 queries/sec and excellent cpu utilization (equal concurrency on > all cpus)!! Yonik, thanks for the pointer to the 64-bit jvm

Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Yonik Seeley
Nice speedup! The extra registers in 64 bit mode hay have helped a little too. -Yonik On 1/26/06, Peter Keegan <[EMAIL PROTECTED]> wrote: > Correction: make that 285 qps :) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additi

Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Peter Keegan
Dumb question: does the 64-bit compiler (javac) generate different code than the 32-bit version, or is it just the jvm that matters? My reported speedups were soley from using the 64-bit jvm with jar files from the 32-bit compiler. Peter On 1/26/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: > > Ni

Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Yonik Seeley
There is no difference in bytecode... the whole difference is just in the underlying JVM. -Yonik On 1/26/06, Peter Keegan <[EMAIL PROTECTED]> wrote: > Dumb question: does the 64-bit compiler (javac) generate different code than > the 32-bit version, or is it just the jvm that matters? My reported

Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Ray Tsang
Peter, Wow, the speed up in impressive! But may I ask what did you do to achieve 135 queries/sec prior to the JVM swich? ray, On 1/27/06, Peter Keegan <[EMAIL PROTECTED]> wrote: > Correction: make that 285 qps :) > > On 1/26/06, Peter Keegan <[EMAIL PROTECTED]> wrote: > > > > I tried the AMD64-b

Re: Getting the document number (with IndexReader)

2006-01-26 Thread Chun Wei Ho
Thanks for the info :) One last related question. If I delete documents using a IndexReader(), can I assume that the internal document numbers of other undeleted documents (obtained using the same IndexReader instance) will not change until I call IndexReader.close()?

Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Peter Keegan
Ray, The short answer is that you can make Lucene blazingly fast by using advice and design principles mentioned in this forum and of course reading 'Lucene in Action'. For example, use a 'content' field for searching all fields (vs mutli-field search), put all your stored data in one field, under

problem updating a document: no segments file?

2006-01-26 Thread John Powers
Hello, I have a couple instances of lucene. I just altered on implementation and now its not keeping a segments file. while indexing occurs, there is a segment file.but once its done, there isn't.all the other indexes have one. the problem comes when i try to update a document, it

Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Ray Tsang
Paul, Thanks for the advice! But for the 100+queries/sec on a 32-bit platfrom, did you end up applying other patches? or use different FSDirectory implementations? Thanks! ray, On 1/27/06, Peter Keegan <[EMAIL PROTECTED]> wrote: > Ray, > > The short answer is that you can make Lucene blazingly

Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Peter Keegan
Ray, The 135 qps rate was using the standard FSDirectory in 1.9. Peter On 1/26/06, Ray Tsang <[EMAIL PROTECTED]> wrote: > > Paul, > > Thanks for the advice! But for the 100+queries/sec on a 32-bit > platfrom, did you end up applying other patches? or use different > FSDirectory implementations?

Re: Two strange things in Lucene

2006-01-26 Thread Daniel Pfeifer
>> Since I didn't find anything in the log from log4j I did a "kill >> -3" on >> > the process and found two very interesting things: >> >> Almost all multisearcher threads were in this state: >> >> "MultiSearcher thread #1" daemon prio=10 tid=0x01900960 >> nid=0x81442c waiting for moni

How does the lucene normalize the score?

2006-01-26 Thread xing jiang
Hi, I want to know how the lucene normalizes the score. I see hits class has this function to get each document's score. But i dont know how lucene calculates the normalized score and in the "Lucene in action", it only said normalized score of the nth top scoring docuemnts. -- Regards Jiang Xing

Re: Performance tips?

2006-01-26 Thread Chris Lamprecht
I seem to say this a lot :), but, assuming your OS has a decent filesystem cache, try reducing your JVM heapsize, using an FSDirectory instead of RAMDirectory, and see if your filesystem cache does ok. If you have 12GB, then you should have enough RAM to hold both the old and new indexes during th