Re: Customer TokenFilter

2010-05-27 Thread tsuraan
> Looks correct! Wrapping by CharBuffer is very intelligent! In Lucene 3.1 the > new Term Attribute will implement CharSequence, then its even simplier. You > may also look at 3.1's ICU contrib that has support even for Normalizer2. Ok, I've only been looking at 3.0.1 so far; I'll check out the 3.

Re: Customer TokenFilter

2010-05-27 Thread tsuraan
y imitating the LowerCaseFilter. If somebody could take a look at what I've put up at http://github.com/tsuraan/StandardNormalizingAnalyzer, and tell me if there's something horrible about what I've done, I'd really appreciate it. It passes the small unit tests I've

Customer TokenFilter

2010-05-26 Thread tsuraan
I'd like to have all my queries and terms run through Unicode Normalization prior to being executed/indexed. I've been using the StandardAnalyzer with pretty good luck for the past few years, so I think I'd like to write an analyzer that wraps that, and tacks a custom TokenFilter onto the chain pr

Re: Sort memory usage

2010-02-03 Thread tsuraan
> The FieldCache loads per segment, and the NRT reader is reloading only > new segments from disk, so yes, it's "smarter" about this caching in this > case. Ok, so the cache is tied to the index, and not to any particular reader. The actual FieldCacheImpl keeps a mapping from Reader to its terms,

Sort memory usage

2010-02-03 Thread tsuraan
Is the cache used by sorting on strings separated by reader, or is it a global thing? I'm trying to use the near-realtime search, and I have a few indices with a million docs apiece. If I'm opening a new reader every minute, am I going to have every term in every sort field read into RAM for each

Re: Sort and Collector

2010-02-03 Thread tsuraan
> It's not really possible. > Lucene must iterate over all of the hits before it knows for sure that > it has the top sorted by any criteria (other than docid). > A Collector is called for every hit as it happens, and thus one can't > specify a sort order (sorting itself is actually implemented wit

Sort and Collector

2010-02-03 Thread tsuraan
Is there any way to run a search where I provide a Query, a Sort, and a Collector? I have a case where it is sometimes, but rarely, necessary to get all the results from a query, but usually I'm satisfied with a smaller amount. That part I can do with just a query and a collector, but I'd like th

Re: Copy and augment an indexed Document

2009-12-30 Thread tsuraan
> It's an open question whether this is more or less work than > re-parsing the document (I infer that you have the originals > available). Before trying to reconstruct the document I'd > ask how often you need to do this. The gremlins coming out > of the woodwork from reconstruction would consume

Copy and augment an indexed Document

2009-12-30 Thread tsuraan
Suppose I have a (useful) document stored in a Lucene index, and I have a variant that I'd also like to be able to search. This variant has the exact same data as the original document, but with some extra fields. I'd like to be able to use an IndexReader to get the document that I stored, use th

Re: Lucene memory usage

2009-12-25 Thread tsuraan
> Have you tried setting the termInfosIndexDivisor when opening the > IndexReader? EG a setting of 2 would load every 256th term (instead > of every 128th term) into RAM, halving RAM usage, with the downside > being that looking up a term will generally take longer since it'll > require more scann

Re: Lucene memory usage

2009-12-23 Thread tsuraan
> This (very large number of unique terms) is a problem for Lucene currently. > > There are some simple improvements we could make to the terms dict > format to not require so much RAM per term in the terms index... > LUCENE-1458 (flexible indexing) has these improvements, but > unfortunately tied

Re: Lucene in Action Rev2

2009-08-27 Thread tsuraan
> I've pinged Manning to get this corrected. Thanks for the heads-up. > > Erik No problem. I'm about to order the beta book, and I'm looking forward to the final copy. Thanks for writing it :) - To unsubscribe, e-mail: j

Lucene in Action Rev2

2009-08-26 Thread tsuraan
In the free first chapter of the new Lucene in Action book, it states that it's targetting Lucene 3.0, but on the Manning page for the book, it says the code in the book is written for 2.3. I'm guessing that the book is the authority on what the book covers, but could somebody maybe change the Man

Re: org.apache.lucene.index.MergePolicy$MergeException

2009-08-05 Thread tsuraan
On 05/08/2009, Michael McCandless wrote: > Switching to addIndexes instead, or using SerialMergeScheduler, or > upgrading to 2.4.1, should all work. Thanks! We'll be switching to 2.9 once it's ready. From past experience, lucene upgrades are simple and painless, but I don't think I can do a 2.4

org.apache.lucene.index.MergePolicy$MergeException

2009-08-05 Thread tsuraan
I'm getting the exception "org.apache.lucene.index.MergePolicy$MergeException: segment "_0 exists in external directory yet the MergeScheduler executed the merge in a separate thread". According to this: http://mail-archives.apache.org/mod_mbox/lucene-java-user/200809.mbox/ that only happens wit

Re: Batch searching

2009-07-22 Thread tsuraan
> Out of curiosity, what is the size of your corpus? How much and how > quickly do you expect it to grow? in terms of lucene documents, we tend to have in the 10M-100M range. Currently we use merging to make larger indices from smaller ones, so a single index can have a lot of documents in it, bu

Re: Batch searching

2009-07-22 Thread tsuraan
> If you did this, wouldn't you be binding the processing of the results > of all queries to that of the slowest performing one within the collection? I would imagine it would, but I haven't seen too much variance between lucene query speeds in our data. > I'm guessing you are trying for some sor

Re: Batch searching

2009-07-22 Thread tsuraan
> It's not accurate to say that Lucene scans the index for each search. > Rather, every Query reads a set of posting lists, each are typically read > from disk. If you pass Query[] which have nothing to do in common (for > example no terms in common), then you won't gain anything, b/c each Query >

Batch searching

2009-07-22 Thread tsuraan
If I understand lucene correctly, when doing multiple simultaneous searches on the same IndexSearcher, they will basically all do their own index scans and collect results independently. If that's correct, is there a way to batch searches together, so only one index scan is done? What I'd like is

Re: Boolean retrieval

2009-07-14 Thread tsuraan
> http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/all/index.html > > Koji Thanks! - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Boolean retrieval

2009-07-13 Thread tsuraan
> Make that "Collector" (new as of 2.9). > > HitCollector is the old (deprecated as of 2.9) way, which always > pre-computed the score of each hit and passed the score to the collect > method. Where can I find docs for 2.9? Do I just have to check out the lucene trunk and run javadoc there?

Re: ZipFile directory implementation

2009-03-09 Thread tsuraan
> Also, have you looked at how it performs? Just making a directory of 1,000,000 documents and reading from it, it looks like this implementation is probably unbearably slow, unless Lucene has some really good caching. ZipFile gives InputStreams for the zip contents, and InputStreams don't suppor

Re: ZipFile directory implementation

2009-03-09 Thread tsuraan
> Sounds interesting. Can you tell us a bit more about the use case for it? Is it basically you are in a situation where you can't unzip the index? Indices compress pretty nicely: 30% to 50% in my experience. So, if youre indices are read-only anyhow (mine aren't live; we do batch jobs to modif

ZipFile directory implementation

2009-03-06 Thread tsuraan
e.lucene.store since that's how I was testing it. Anyhow, it's really ugly, but seems to work. I was wondering if anybody wanted to have a glance at it to see if there's anything obvious that I'm doing wrong, simple off-by-one errors, that sort of thing. The code is on github, htt

Re: huge tii files

2008-06-17 Thread tsuraan
That's really nice. Thanks! I'm guessing the answer is no, but is there an equivalent to that for lucene-2.2.0? Upgrading shouldn't be much of a problem anyhow (we've been doing it since 1.9), but out of curiosity... On 17/06/2008, Alex <[EMAIL PROTECTED]> wrote: > > you can invoke IndexReader.

huge tii files

2008-06-17 Thread tsuraan
I have a collection of indices with a total of about 7,000,000 documents between them all. When I attempt to run a search over these indices, the searching process's memory usage increases to ~1.7GB if I allow java to use that much memory. If I don't (my normal memory cap is 512MB), I get the fol