> Looks correct! Wrapping by CharBuffer is very intelligent! In Lucene 3.1 the
> new Term Attribute will implement CharSequence, then its even simplier. You
> may also look at 3.1's ICU contrib that has support even for Normalizer2.
Ok, I've only been looking at 3.0.1 so far; I'll check out the 3.
y imitating the LowerCaseFilter. If somebody could take a
look at what I've put up at
http://github.com/tsuraan/StandardNormalizingAnalyzer, and tell me if
there's something horrible about what I've done, I'd really appreciate
it. It passes the small unit tests I've
I'd like to have all my queries and terms run through Unicode
Normalization prior to being executed/indexed. I've been using the
StandardAnalyzer with pretty good luck for the past few years, so I
think I'd like to write an analyzer that wraps that, and tacks a
custom TokenFilter onto the chain pr
> The FieldCache loads per segment, and the NRT reader is reloading only
> new segments from disk, so yes, it's "smarter" about this caching in this
> case.
Ok, so the cache is tied to the index, and not to any particular
reader. The actual FieldCacheImpl keeps a mapping from Reader to its
terms,
Is the cache used by sorting on strings separated by reader, or is it
a global thing? I'm trying to use the near-realtime search, and I
have a few indices with a million docs apiece. If I'm opening a new
reader every minute, am I going to have every term in every sort field
read into RAM for each
> It's not really possible.
> Lucene must iterate over all of the hits before it knows for sure that
> it has the top sorted by any criteria (other than docid).
> A Collector is called for every hit as it happens, and thus one can't
> specify a sort order (sorting itself is actually implemented wit
Is there any way to run a search where I provide a Query, a Sort, and
a Collector? I have a case where it is sometimes, but rarely,
necessary to get all the results from a query, but usually I'm
satisfied with a smaller amount. That part I can do with just a query
and a collector, but I'd like th
> It's an open question whether this is more or less work than
> re-parsing the document (I infer that you have the originals
> available). Before trying to reconstruct the document I'd
> ask how often you need to do this. The gremlins coming out
> of the woodwork from reconstruction would consume
Suppose I have a (useful) document stored in a Lucene index, and I
have a variant that I'd also like to be able to search. This variant
has the exact same data as the original document, but with some extra
fields. I'd like to be able to use an IndexReader to get the document
that I stored, use th
> Have you tried setting the termInfosIndexDivisor when opening the
> IndexReader? EG a setting of 2 would load every 256th term (instead
> of every 128th term) into RAM, halving RAM usage, with the downside
> being that looking up a term will generally take longer since it'll
> require more scann
> This (very large number of unique terms) is a problem for Lucene currently.
>
> There are some simple improvements we could make to the terms dict
> format to not require so much RAM per term in the terms index...
> LUCENE-1458 (flexible indexing) has these improvements, but
> unfortunately tied
> I've pinged Manning to get this corrected. Thanks for the heads-up.
>
> Erik
No problem. I'm about to order the beta book, and I'm looking forward
to the final copy. Thanks for writing it :)
-
To unsubscribe, e-mail: j
In the free first chapter of the new Lucene in Action book, it states
that it's targetting Lucene 3.0, but on the Manning page for the book,
it says the code in the book is written for 2.3. I'm guessing that
the book is the authority on what the book covers, but could somebody
maybe change the Man
On 05/08/2009, Michael McCandless wrote:
> Switching to addIndexes instead, or using SerialMergeScheduler, or
> upgrading to 2.4.1, should all work.
Thanks! We'll be switching to 2.9 once it's ready. From past
experience, lucene upgrades are simple and painless, but I don't think
I can do a 2.4
I'm getting the exception
"org.apache.lucene.index.MergePolicy$MergeException: segment "_0
exists in external directory yet the MergeScheduler executed the merge
in a separate thread". According to this:
http://mail-archives.apache.org/mod_mbox/lucene-java-user/200809.mbox/
that only happens wit
> Out of curiosity, what is the size of your corpus? How much and how
> quickly do you expect it to grow?
in terms of lucene documents, we tend to have in the 10M-100M range.
Currently we use merging to make larger indices from smaller ones, so
a single index can have a lot of documents in it, bu
> If you did this, wouldn't you be binding the processing of the results
> of all queries to that of the slowest performing one within the collection?
I would imagine it would, but I haven't seen too much variance between
lucene query speeds in our data.
> I'm guessing you are trying for some sor
> It's not accurate to say that Lucene scans the index for each search.
> Rather, every Query reads a set of posting lists, each are typically read
> from disk. If you pass Query[] which have nothing to do in common (for
> example no terms in common), then you won't gain anything, b/c each Query
>
If I understand lucene correctly, when doing multiple simultaneous
searches on the same IndexSearcher, they will basically all do their
own index scans and collect results independently. If that's correct,
is there a way to batch searches together, so only one index scan is
done? What I'd like is
> http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/all/index.html
>
> Koji
Thanks!
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
> Make that "Collector" (new as of 2.9).
>
> HitCollector is the old (deprecated as of 2.9) way, which always
> pre-computed the score of each hit and passed the score to the collect
> method.
Where can I find docs for 2.9? Do I just have to check out the lucene
trunk and run javadoc there?
> Also, have you looked at how it performs?
Just making a directory of 1,000,000 documents and reading from it, it
looks like this implementation is probably unbearably slow, unless
Lucene has some really good caching. ZipFile gives InputStreams for
the zip contents, and InputStreams don't suppor
> Sounds interesting. Can you tell us a bit more about the use case for it?
Is it basically you are in a situation where you can't unzip the index?
Indices compress pretty nicely: 30% to 50% in my experience. So, if youre
indices are read-only anyhow (mine aren't live; we do batch jobs to modif
e.lucene.store since that's how I was testing it.
Anyhow, it's really ugly, but seems to work. I was wondering if
anybody wanted to have a glance at it to see if there's anything
obvious that I'm doing wrong, simple off-by-one errors, that sort of
thing.
The code is on github,
htt
That's really nice. Thanks!
I'm guessing the answer is no, but is there an equivalent to that for
lucene-2.2.0? Upgrading shouldn't be much of a problem anyhow (we've
been doing it since 1.9), but out of curiosity...
On 17/06/2008, Alex <[EMAIL PROTECTED]> wrote:
>
> you can invoke IndexReader.
I have a collection of indices with a total of about 7,000,000
documents between them all. When I attempt to run a search over these
indices, the searching process's memory usage increases to ~1.7GB if I
allow java to use that much memory. If I don't (my normal memory cap
is 512MB), I get the fol
26 matches
Mail list logo