Re: Inconsistent StandardTokenizer behaviour

2005-11-21 Thread Erik Hatcher
On 21 Nov 2005, at 19:39, [EMAIL PROTECTED] wrote: This is the results for the StandardTokenizer: input - output token - output type 1. 1.2 - 1.2 - 2. 1.2. - 1.2 - 3. a.b - a.b - 4. a.b. - a.b. - 5. www.apache.org - www.apache.org - 6. www.apac

Re: How does lucene choose a field for sort?

2005-11-21 Thread Yonik Seeley
On 11/21/05, Erik Hatcher <[EMAIL PROTECTED]> wrote: > Neither. It'll throw an exception. Just don't rely on it to throw an exception either though... the checking is not comprehensive. One should treat sorting on a field with more than one value per document as undefined. -Yonik Now hiring --

Re: Inconsistent StandardTokenizer behaviour

2005-11-21 Thread yahootintin . 11533894
Sorry for the bad looking table. Retrying... input string - output token (output type) 1. 1.2 - 1.2 () 2. 1.2. - 1.2 () 3. a.b - a.b () 4. a.b. - a.b. () 5. www.apache.org - www.apache.org () 6. www.apache.org. - www.apache.org. () --- java-user@lucene.apache.org wrote: This is the results for

Inconsistent StandardTokenizer behaviour

2005-11-21 Thread yahootintin . 11533894
This is the results for the StandardTokenizer: input - output token - output type 1. 1.2 - 1.2 - 2. 1.2. - 1.2 - 3. a.b - a.b - 4. a.b. - a.b. - 5. www.apache.org - www.apache.org - 6. www.apache.org. - www.apache.org. - Number 6 should still be

Re: How does lucene choose a field for sort?

2005-11-21 Thread Erik Hatcher
On 21 Nov 2005, at 16:12, John Powers wrote: If I sort on a field called sequence, but at document creation time I add in //create doc A doc.add(Field.Text("sequence", "32")); doc.add(Field.Text("sequence", "3")); doc.add(Field.Text("sequence", "932")); //create doc B doc.add(Field.Text("seq

Custom sort/basic question

2005-11-21 Thread John Powers
If I add keywords to a document at the same time, will they stay in that order? Create New doc A doc.add(Field.Text("category", "toys")); doc.add(Field.Text("sequence", "235")); doc.add(Field.Text("category", "bears")); doc.add(Field.Text("sequence", "63")); doc.add(Field.Text("category", "truc

RE: Lucene Index Changed event

2005-11-21 Thread Aigner, Thomas
Thx Peter.. worked like a charm! -Original Message- From: Peter Kim [mailto:[EMAIL PROTECTED] Sent: Monday, November 21, 2005 4:32 PM To: java-user@lucene.apache.org Subject: RE: Lucene Index Changed event You can check IndexReader.getCurrentVersion() to see if the index changed from the

RE: Lucene Index Changed event

2005-11-21 Thread Peter Kim
You can check IndexReader.getCurrentVersion() to see if the index changed from the last time you checked. The index's version number changes whenever the index is updated. Peter > -Original Message- > From: Aigner, Thomas [mailto:[EMAIL PROTECTED] > Sent: Monday, November 21, 2005 3:48 P

Re: Spans, appended fields, and term positions

2005-11-21 Thread Erik Hatcher
On 21 Nov 2005, at 16:09, Yonik Seeley wrote: The Analyzer extensions seem fine, but much more general purpose than my need. For your need (a global increment), isn't expanding analyzer actually easier? analyser = new OldAnalyzer() { public int getPositionIncrementGap(String field) {

How does lucene choose a field for sort?

2005-11-21 Thread John Powers
If I sort on a field called sequence, but at document creation time I add in //create doc A doc.add(Field.Text("sequence", "32")); doc.add(Field.Text("sequence", "3")); doc.add(Field.Text("sequence", "932")); //create doc B doc.add(Field.Text("sequence", "1")); doc.add(Field.Text("sequence", "300

Re: Spans, appended fields, and term positions

2005-11-21 Thread Yonik Seeley
> > For position increments, it doesn't have to be tracked. The patch to > > DocumentWriter could also be: > > > > int position = fieldPositions[fieldNumber]; > > + if (position>0) position+=analyzer.getPositionIncrementGap > > (fieldName) > > This could be thwarted with tokens using zer

Lucene Index Changed event

2005-11-21 Thread Aigner, Thomas
Hi all, Is there an index changed event that I can jump on that will tell me when my index has been updated so I can close and reopen my searcher to get the new changes? I can't seem to find the event, but see some tools that might accomplish this (DLESE DPC software components?).

Re: TermFrequencies vector limits?

2005-11-21 Thread Chris Hostetter
: " By default, no more than 10,000 terms will be : indexed for a field." : : Given your note, then the docs do not mean that no : more than 10,000 terms will be indexed, but that some : smaller number of terms will be indexed and only the : first 10,000 occurrances will be tallied. It means that

Re: Spans, appended fields, and term positions

2005-11-21 Thread Erik Hatcher
On 21 Nov 2005, at 12:55, Yonik Seeley wrote: On 11/21/05, Erik Hatcher <[EMAIL PROTECTED]> wrote: Modifying Analyzer as you have suggested would require DocumentWriter additionally keep track of the field names and note when one is used again. For position increments, it doesn't have to be t

Re: TermFrequencies vector limits?

2005-11-21 Thread Paul Elschot
On Monday 21 November 2005 14:28, [EMAIL PROTECTED] wrote: > Just to make sure that I understand this correctly, > the docs say: > > " By default, no more than 10,000 terms will be > indexed for a field." > > Given your note, then the docs do not mean that no > more than 10,000 terms will be ind

Re: Spans, appended fields, and term positions

2005-11-21 Thread Yonik Seeley
On 11/21/05, Erik Hatcher <[EMAIL PROTECTED]> wrote: > Modifying Analyzer as you have suggested would > require DocumentWriter additionally keep track of the field names > and note when one is used again. For position increments, it doesn't have to be tracked. The patch to DocumentWriter could als

Re: Throughput doesn't increase when using more concurrent threads

2005-11-21 Thread Doug Cutting
Jay Booth wrote: I had a similar problem with threading, the problem turned out to be that in the back end of the FSDirectory class I believe it was, there was a synchronized block on the actual RandomAccessFile resource when reading a block of data from it... high-concurrency situations caused t

Re: Urgent - File Lock in Lucene 1.2

2005-11-21 Thread jian chen
Hi, Karl, Therer have been quite some discussions regarding the "too many open files" problem. From my understanding, it is due to Lucene trying to open multiple segments at the same time (during search/merging segments), and the operating system wouldn't allow opening that many file handles. If

Re: Throughput doesn't increase when using more concurrent threads

2005-11-21 Thread Oren Shir
Thanks Jay Booth, I thought as much. I just verified that I'm not reaching 100% CPU, and I found out that when using RAMDirectory and 100 threads the CPU usage is 60%, avarage request time 40 times more that one thread, but number of requests the same. I think I'll have to do somthing like you sug

RE: Throughput doesn't increase when using more concurrent threads

2005-11-21 Thread Jay Booth
I had a similar problem with threading, the problem turned out to be that in the back end of the FSDirectory class I believe it was, there was a synchronized block on the actual RandomAccessFile resource when reading a block of data from it... high-concurrency situations caused threads to stack up

Re: Throughput doesn't increase when using more concurrent threads

2005-11-21 Thread Yonik Seeley
On 11/21/05, Oren Shir <[EMAIL PROTECTED]> wrote: > It is rather sad if 10 threads reach the CPU limit. I'll check it and get > back to you. It's about performance and throughput though, not about number of threads it takes to reach saturation. In a 2 CPU box, I would say that the ideal situation

Re: Throughput doesn't increase when using more concurrent threads

2005-11-21 Thread Oren Shir
gekkokid, does 1.4.3 benefit from multi-threading? > Sorry for not being clear. My tests show that both version does not benefit from multi threading, but it is possible that I'm CPU bound, as Yonik kindly reminded me. is 1.9 the version in the source repository? 1.9 is the version in source re

Re: Throughput doesn't increase when using more concurrent threads

2005-11-21 Thread Yonik Seeley
This is expected behavior: you are probably quickly becoming CPU bound (which isn't a bad thing). More threads only help when some threads are waiting on IO, or if you actually have a lot of CPUs in the box. -Yonik Now hiring -- http://forms.cnet.com/slink?231706 On 11/21/05, Oren Shir <[EMAIL

Re: Throughput doesn't increase when using more concurrent threads

2005-11-21 Thread gekkokid
Oren Shir wrote: I tested this in version 1.4.3 and 1.9rc1, and they are both the same in this aspect. 1.9rc1 is faster, but does not benefit from multi threading. some newbie questions i have, does 1.4.3 benefit from multi-threading? is 1.9 the version in the source repository? _gk ---

Re: TermFrequencies vector limits?

2005-11-21 Thread Michael Curtin
> > To get a higher limit. Of course, you could also change the Lucene source > > file and recompile it. Note that you CANNOT just set the property in your > > code, in general, as the Lucene class puts it into a static final int, > > meaning it examines the value of the property (once) at

Throughput doesn't increase when using more concurrent threads

2005-11-21 Thread Oren Shir
Hi, I tried stressing Lucene in a controlled environment: one static IndexSearcher for an index that doesn't change, and in same process I create a number of Threads that call this Searcher concurrently for a limited time. I expected the number of successful queries to increase when using more thr

Re: TermFrequencies vector limits?

2005-11-21 Thread Erik Hatcher
On 21 Nov 2005, at 08:37, Michael Curtin wrote: That's probably because there is a limit built into Lucene where it ignores any tokens in a field past the first 10,000. There is a property you can set to increase this limit. I dont' have the source in front of me right now, but if you go

Re: TermFrequencies vector limits?

2005-11-21 Thread Michael Curtin
> When I go and retrieve the term frequency vectors, for > any document under about 90k, everything looks as > expected. However for larger documents (I haven't > found the exact point, but I know that those over 128k > qualify) the sum of the term frequencies in the vector > seems to max out at 1

Re: TermFrequencies vector limits?

2005-11-21 Thread marigoldcc
Just to make sure that I understand this correctly, the docs say: " By default, no more than 10,000 terms will be indexed for a field." Given your note, then the docs do not mean that no more than 10,000 terms will be indexed, but that some smaller number of terms will be indexed and only the fi

Grouping results on the basis of a field

2005-11-21 Thread Samarendra Pratap
Hi, I am using lucene 1.4.3. The basic functionality of the search is simple, put in the keyword as “java” and it will display you all the books having java keyword. Now I have to add a feature which also shows the name of top authors (lets say top 5 authors) with the number of books,

Re: Spans, appended fields, and term positions

2005-11-21 Thread Erik Hatcher
On 21 Nov 2005, at 04:26, Erik Hatcher wrote: What about adding an offset to Field, setPositionOffset(int offset)? Looking at DocumentWriter, it looks like this would be the simplest thing that could work, without precluding the interesting option of modifying Analyzer to allow with flags

Re: Spans, appended fields, and term positions

2005-11-21 Thread Erik Hatcher
Yonik, Thanks for your carefully thought out and detailed reply. On 20 Nov 2005, at 12:00, Yonik Seeley wrote: Does it make sense to add an IndexWriter setting to specify a default position increment gap to use when multiple fields are added in this way? Per-field might be nice... The good

Re: TermFrequencies vector limits?

2005-11-21 Thread Erik Hatcher
By default, documents get truncated at 10,000 terms (maybe there is an off-by-one where it is going to 10,001 though?). To increase this, and I always do, set the max field length on your IndexWriter, and re-index. In 1.4.3, you set the maxFieldLength variable of IndexWriter directly. We'

Re: TermFrequencies vector limits?

2005-11-21 Thread Paul Elschot
On Monday 21 November 2005 02:16, [EMAIL PROTECTED] wrote: > Hi. I was wondering if anyone else has seen this > before. I'm using lucene 1.4.3 and have indexed > about 3000 text documents using the statement: > > doc.add(Field.Text("contents", new FileReader(f), > true)); > > When I go and ret