Re: Issue upgrading from lucene 2.3.2 to 2.4 (moving from bitset to docidset)

2008-12-10 Thread Tim Sturge
Mike, Mike, I have an implementation of FieldCacheTermsFilter (which uses field cache to filter for a predefined set of terms) around if either of you are interested. It is faster than materializing the filter roughly when the filter matches more than 1% of the documents. So it's not better for

Re: Issue upgrading from lucene 2.3.2 to 2.4 (moving from bitset to docidset)

2008-12-10 Thread Tim Sturge
of arbitrary terms to filter for, like TermsFilter in contrib/queries? And it's space/time efficient once FieldCache is populated? Mike Tim Sturge wrote: Mike, Mike, I have an implementation of FieldCacheTermsFilter (which uses field cache to filter for a predefined set of terms

Re: Issue upgrading from lucene 2.3.2 to 2.4 (moving from bitset to docidset)

2008-12-10 Thread Tim Sturge
It's LUCENE-1487. Tim On 12/10/08 1:13 PM, Tim Sturge [EMAIL PROTECTED] wrote: Yes (mostly). It turns those terms into an OpenBitSet on the term array. Then it does a fastGet() in the next() and skipTo() loops to see if the term for that document is in the set. The issue is that fastGet

Re: Slow queries with lots of hits

2008-12-05 Thread Tim Sturge
From: Tim Sturge [EMAIL PROTECTED] To: java-user@lucene.apache.org java-user@lucene.apache.org Sent: Thursday, December 4, 2008 3:27:30 PM Subject: Slow queries with lots of hits Hi all, I have an interesting problem with my query traffic. Most of the queries run in a fairly short amount

Slow queries with lots of hits

2008-12-04 Thread Tim Sturge
Hi all, I have an interesting problem with my query traffic. Most of the queries run in a fairly short amount of time ( 100ms) but a few take over 1000ms. These queries are predominantly those with a huge number of hits (1 million hits in a 100 million document index). The time taken (as far as I

Re: Slow queries with lots of hits

2008-12-04 Thread Tim Sturge
that when you say sorting you mean sorting by something other than relevance. Hope this helps Erick On Thu, Dec 4, 2008 at 3:27 PM, Tim Sturge [EMAIL PROTECTED] wrote: Hi all, I have an interesting problem with my query traffic. Most of the queries run in a fairly short amount

Re: Term numbering and range filtering

2008-11-18 Thread Tim Sturge
I've finished a query time implementation of a column stride filter, which implements DocIdSetIterator. This just builds the filter at process start and uses it for each subsequent query. The index itself is unchanged. The results are very impressive. Here are the results on a 45M document index:

Re: Term numbering and range filtering

2008-11-18 Thread Tim Sturge
With Allow Filter as clause to BooleanQuery: https://issues.apache.org/jira/browse/LUCENE-1345 one could even skip the ConstantScoreQuery with this. Unfortunately 1345 is unfinished for now. That would be interesting; I'd like to see how much performance improves. startup: 2811 Hits:

Re: Term numbering and range filtering

2008-11-10 Thread Tim Sturge
do this per-segment? Mike Tim Sturge wrote: Hi, I¹m wondering if there is any easy technique to number the terms in an index (By number I mean map a sequence of terms to a contiguous range of integers and map terms to these numbers efficiently) Looking at the Term class

Re: Term numbering and range filtering

2008-11-10 Thread Tim Sturge
this for its terms? Or maybe you'd just do this per-segment? Mike Tim Sturge wrote: Hi, I¹m wondering if there is any easy technique to number the terms in an index (By number I mean map a sequence of terms to a contiguous range of integers and map terms to these numbers efficiently

Re: Term numbering and range filtering

2008-11-10 Thread Tim Sturge
seem to be something that lots of people use but Lucene implements fairly poorly. Tim On 11/10/08 1:58 PM, Paul Elschot [EMAIL PROTECTED] wrote: Op Monday 10 November 2008 22:21:20 schreef Tim Sturge: Hmmm -- I hadn't thought about that so I took a quick look at the term vector support. What

Re: Term numbering and range filtering

2008-11-10 Thread Tim Sturge
assumes that someone must have done this already; so either there's an implementation out there already or there's a good reason I don't see that this is entirely impractical. So I'm interested to get feedback. Tim On 11/10/08 2:26 PM, Tim Sturge [EMAIL PROTECTED] wrote: I think we've gone

Term numbering and range filtering

2008-11-07 Thread Tim Sturge
Hi, I¹m wondering if there is any easy technique to number the terms in an index (By number I mean map a sequence of terms to a contiguous range of integers and map terms to these numbers efficiently) Looking at the Term class and the .tis/.tii index format it appears that the terms are stored

Almost parallel indexes

2007-09-27 Thread Tim Sturge
Hi, I have an index which contains two very distinct types of fields: - Some fields are large (many term documents) and change fairly slowly. - Some fields are small (mostly titles, names, anchor text) and change fairly rapidly. Right now I keep around the large fields in raw form and when the

Re: indexing fields with multiplicity

2007-08-29 Thread Tim Sturge
, they will rank equally. I want the United States of America to rank higher. Tim Karl Wettin wrote: 28 aug 2007 kl. 21.41 skrev Tim Sturge: Hi, I have fields which have high multiplicity; for example I have a topic with 1000 names, 500 of which are USA and 200 are United States of America

indexing fields with multiplicity

2007-08-28 Thread Tim Sturge
Hi, I have fields which have high multiplicity; for example I have a topic with 1000 names, 500 of which are USA and 200 are United States of America. Previously I was indexing USA USA .(500x).. USA United States of America .(200x).. United States of America as as single field. The problem

calling commit() on IndexReader

2007-07-31 Thread Tim Sturge
Can anyone explain to me why commit() on IndexReader is a protected method? I want to do periodic deletes from my main index. I don't want to reopen the index (all that is changing are things are being deleted), so I don't want to call close(), but I can't call commit() from outside the class

Re: java gc with a frequently changing index?

2007-07-30 Thread Tim Sturge
seconds is really necessary.I've seen a lot of times when real time could be 5 minutes and nobody would really complain, and other times when it really is critical. But that's between you and our Product Manager Hope this helps Erick On 7/25/07, Tim Sturge [EMAIL PROTECTED] wrote: Hi, I

Re: java gc with a frequently changing index?

2007-07-30 Thread Tim Sturge
. On 7/30/07, Mark Miller [EMAIL PROTECTED] wrote: I believe there is an issue in JIRA that handles reopening an IndexReader without reopening segments that have not changed. On 7/30/07, Tim Sturge [EMAIL PROTECTED] wrote: Thanks for the reply Erick, I believe it is the gc for four

java gc with a frequently changing index?

2007-07-25 Thread Tim Sturge
Hi, I am indexing a set of constantly changing documents. The change rate is moderate (about 10 docs/sec over a 10M document collection with a 6G total size) but I want to be right up to date (ideally within a second but within 5 seconds is acceptable) with the index. Right now I have code

Re: product based term combination for BooleanQuery?

2007-07-04 Thread Tim Sturge
:-) The use of wikipedia data here is no secret; it's all over www.freebase.com. I just hoped to avoid being sucked into a what is the best way to index wikipedia with Lucene? discussion, which I believe several other groups are already tackling. At index time, I used a per document boost

product based term combination for BooleanQuery?

2007-07-03 Thread Tim Sturge
implement it and look at the results?) or does it make searching a lot slower? Thanks, Tim Tim Sturge wrote: I have an index with two different sources of information, one small but of high quality (call it title), and one large, but of lower quality (call it body). I give boosts to certain

Re: product based term combination for BooleanQuery?

2007-07-03 Thread Tim Sturge
) AND (title:Bush^4.0 body:Bush) ) Tim Sturge wrote: I'm following myself up here to ask if anyone has experience or code with a BooleanQuery that weights the terms it encounters on a product basis rather than a sum basis. This would effectively compute the geometric mean of the term score (rather

Re: product based term combination for BooleanQuery?

2007-07-03 Thread Tim Sturge
? The point of coord is to give a little bump to those docs that have more terms from the query in a given document. Sounds like you want a bigger bump once you have multiple query terms in a document. Would this work for you? Also, below... On Jul 3, 2007, at 3:20 PM, Tim Sturge wrote: That's true

multi-term query weighting

2007-07-02 Thread Tim Sturge
I have an index with two different sources of information, one small but of high quality (call it title), and one large, but of lower quality (call it body). I give boosts to certain documents related to their popularity (this is very similar to what one would do indexing the web). The