Mike, Mike,
I have an implementation of FieldCacheTermsFilter (which uses field cache to
filter for a predefined set of terms) around if either of you are
interested. It is faster than materializing the filter roughly when the
filter matches more than 1% of the documents.
So it's not better for
of arbitrary terms to
filter for, like TermsFilter in contrib/queries? And it's space/time
efficient once FieldCache is populated?
Mike
Tim Sturge wrote:
Mike, Mike,
I have an implementation of FieldCacheTermsFilter (which uses field
cache to
filter for a predefined set of terms
It's LUCENE-1487.
Tim
On 12/10/08 1:13 PM, Tim Sturge [EMAIL PROTECTED] wrote:
Yes (mostly). It turns those terms into an OpenBitSet on the term array.
Then it does a fastGet() in the next() and skipTo() loops to see if the term
for that document is in the set.
The issue is that fastGet
From: Tim Sturge [EMAIL PROTECTED]
To: java-user@lucene.apache.org java-user@lucene.apache.org
Sent: Thursday, December 4, 2008 3:27:30 PM
Subject: Slow queries with lots of hits
Hi all,
I have an interesting problem with my query traffic. Most of the queries run
in a fairly short amount
Hi all,
I have an interesting problem with my query traffic. Most of the queries run
in a fairly short amount of time ( 100ms) but a few take over 1000ms. These
queries are predominantly those with a huge number of hits (1 million hits
in a 100 million document index). The time taken (as far as I
that when you say sorting you mean sorting
by something other than relevance.
Hope this helps
Erick
On Thu, Dec 4, 2008 at 3:27 PM, Tim Sturge [EMAIL PROTECTED] wrote:
Hi all,
I have an interesting problem with my query traffic. Most of the queries
run
in a fairly short amount
I've finished a query time implementation of a column stride filter, which
implements DocIdSetIterator. This just builds the filter at process start
and uses it for each subsequent query. The index itself is unchanged.
The results are very impressive. Here are the results on a 45M document
index:
With Allow Filter as clause to BooleanQuery:
https://issues.apache.org/jira/browse/LUCENE-1345
one could even skip the ConstantScoreQuery with this.
Unfortunately 1345 is unfinished for now.
That would be interesting; I'd like to see how much performance improves.
startup: 2811
Hits:
do
this per-segment?
Mike
Tim Sturge wrote:
Hi,
I¹m wondering if there is any easy technique to number the terms in
an index
(By number I mean map a sequence of terms to a contiguous range of
integers
and map terms to these numbers efficiently)
Looking at the Term class
this for its terms? Or maybe
you'd just do this per-segment?
Mike
Tim Sturge wrote:
Hi,
I¹m wondering if there is any easy technique to number the terms
in an index
(By number I mean map a sequence of terms to a contiguous range of
integers
and map terms to these numbers efficiently
seem to be something that lots of people use but Lucene implements fairly
poorly.
Tim
On 11/10/08 1:58 PM, Paul Elschot [EMAIL PROTECTED] wrote:
Op Monday 10 November 2008 22:21:20 schreef Tim Sturge:
Hmmm -- I hadn't thought about that so I took a quick look at the
term vector support.
What
assumes that someone must have done this already; so either
there's an implementation out there already or there's a good reason I don't
see that this is entirely impractical. So I'm interested to get feedback.
Tim
On 11/10/08 2:26 PM, Tim Sturge [EMAIL PROTECTED] wrote:
I think we've gone
Hi,
I¹m wondering if there is any easy technique to number the terms in an index
(By number I mean map a sequence of terms to a contiguous range of integers
and map terms to these numbers efficiently)
Looking at the Term class and the .tis/.tii index format it appears that the
terms are stored
Hi,
I have an index which contains two very distinct types of fields:
- Some fields are large (many term documents) and change fairly slowly.
- Some fields are small (mostly titles, names, anchor text) and change fairly
rapidly.
Right now I keep around the large fields in raw form and when the
, they will rank equally. I
want the United States of America to rank higher.
Tim
Karl Wettin wrote:
28 aug 2007 kl. 21.41 skrev Tim Sturge:
Hi,
I have fields which have high multiplicity; for example I have a
topic with 1000 names, 500 of which are USA and 200 are United
States of America
Hi,
I have fields which have high multiplicity; for example I have a topic
with 1000 names, 500 of which are USA and 200 are United States of
America.
Previously I was indexing USA USA .(500x).. USA United States of
America .(200x).. United States of America as as single field. The
problem
Can anyone explain to me why commit() on IndexReader is a protected method?
I want to do periodic deletes from my main index. I don't want to reopen
the index (all that is changing are things are being deleted), so I
don't want to call close(), but I can't call commit() from outside the
class
seconds is really necessary.I've seen a lot
of times when real time could be 5 minutes and nobody would really
complain, and other times when it really is critical. But that's between you
and our Product Manager
Hope this helps
Erick
On 7/25/07, Tim Sturge [EMAIL PROTECTED] wrote:
Hi,
I
.
On 7/30/07, Mark Miller [EMAIL PROTECTED] wrote:
I believe there is an issue in JIRA that handles reopening an IndexReader
without reopening segments that have not changed.
On 7/30/07, Tim Sturge [EMAIL PROTECTED] wrote:
Thanks for the reply Erick,
I believe it is the gc for four
Hi,
I am indexing a set of constantly changing documents. The change rate is
moderate (about 10 docs/sec over a 10M document collection with a 6G
total size) but I want to be right up to date (ideally within a second
but within 5 seconds is acceptable) with the index.
Right now I have code
:-) The use of wikipedia data here is no secret; it's all over
www.freebase.com. I just hoped to avoid being sucked into a what is the best
way to index wikipedia with Lucene? discussion, which I believe several other
groups are already tackling.
At index time, I used a per document boost
implement it and look at the
results?) or does it make searching a lot slower?
Thanks,
Tim
Tim Sturge wrote:
I have an index with two different sources of information, one small
but of high quality (call it title), and one large, but of lower
quality (call it body). I give boosts to certain
) AND (title:Bush^4.0 body:Bush) )
Tim Sturge wrote:
I'm following myself up here to ask if anyone has experience or code
with a BooleanQuery that weights the terms it encounters on a product
basis rather than a sum basis.
This would effectively compute the geometric mean of the term score
(rather
? The point of coord is to give a little bump to those
docs that have more terms from the query in a given document. Sounds
like you want a bigger bump once you have multiple query terms in a
document. Would this work for you?
Also, below...
On Jul 3, 2007, at 3:20 PM, Tim Sturge wrote:
That's true
I have an index with two different sources of information, one small but
of high quality (call it title), and one large, but of lower quality
(call it body). I give boosts to certain documents related to their
popularity (this is very similar to what one would do indexing the web).
The
25 matches
Mail list logo