FuzzyQuery using termDocs() for context filtering

2007-11-05 Thread Timo Nentwig
Hi! Imagine an index holding documents in different languages and country. Language+country is what I call a context and I build and hold a QueryFilter for each context. When performing a fuzzy search, FilteredTermEnum doesn't care about any contexts at all (well, how should it :). It builds a

Why exactly are fuzzy queries so slow?

2007-11-24 Thread Timo Nentwig
Hi! I search an 1.5 gig index and fuzzy queries are really slow; something like avg. ~500ms (IndexSearcher.search(Query, HitCollector)). When performing exact queries I archieve response times <25ms. What is it that makes fuzzy queries so slow? Increased index access due to more terms, i.e. d

Re: Why exactly are fuzzy queries so slow?

2007-11-25 Thread Timo Nentwig
On Saturday 24 November 2007 18:28:48 markharw00d wrote: > The added IO is one factor. Another is the CPU load from doing many > edit-distance comparisons between index terms and the provided search You mean FuzzyQuery.rewrite(). Are you sure this is a CPU and not an IO issue (reading the terms f

Re: Why exactly are fuzzy queries so slow?

2007-11-25 Thread Timo Nentwig
might be in a document or field that I'm not interested in at all. > you first select used word wich share ngram with the query word, the > distance is computed with levenstein, and you use this word as a > synonym. > > M. > > Le 24 nov. 07 à 17:36, Timo Nentwig a écrit : >

Re: Why exactly are fuzzy queries so slow?

2007-11-25 Thread Timo Nentwig
On Saturday 24 November 2007 18:28:48 markharw00d wrote: > term. You can limit the number of edit distance comparisons conducted by > setting the minimum prefix length. This is a property of the QueryParser Well, javadoc: "prefixLength - length of common (non-fuzzy) prefix". So, this is some kind

Re: Why exactly are fuzzy queries so slow?

2007-11-26 Thread Timo Nentwig
d then if you're lucky files==philes but there's no > room for error and they either match or they dont - there is no measure > of similarity. > > There's no free lunch here. > > Timo Nentwig wrote: > > On Saturday 24 November 2007 18:28:48 markharw00d wrote:

Re: FieldSelector

2007-11-30 Thread Timo Nentwig
On Friday 30 November 2007 12:59:13 Grant Ingersoll wrote: > Hmmm, I think you should be able to rely on the fact that Fields are > stored in order of indexing and then read back in that same order. Yeah, tought about that for a moment but this is just way to fragile. > Otherwise, the reading twi

FieldSelector

2007-11-30 Thread Timo Nentwig
Hi! I do have different document types (Books, Magazines, Author whatever) in the index and a FieldSelector is document type specific (for Books LOAD isbn and title for Author name, ...). The document type can be determined by a field surprisingly called documentType. How am I going to do this

Re: FieldSelector

2007-12-04 Thread Timo Nentwig
On Friday 30 November 2007 19:28:12 Grant Ingersoll wrote: > I guess the question becomes what is the nature of your fields? Do > you have some really large fields that you want to avoid loading b/c > they are not shown initially? That is the main use case, I guess. I wonder why there's not Lazy

Re: FieldSelector

2007-12-05 Thread Timo Nentwig
On Wednesday 05 December 2007 12:20:51 Grant Ingersoll wrote: > Then, when you go to access those 4 fields, which you most certainly > will at some point soon, otherwise why did you get the document to Nope, I won't :) In fact my Document contain fields I only need for searching and sorting. But

CachingWrapperFilter: why cache per IndexReader?

2008-01-01 Thread Timo Nentwig
Hi! Is there are particular reason why CachingWrapperFilter caches per IndexReader and not per IndexReader.directory()? If there are multiple IndexSearcher/IndexReader instances (and only one Directory) cache will be built and held in memory redundantly. I don't see any sense in doing so (?).

Re: CachingWrapperFilter: why cache per IndexReader?

2008-01-01 Thread Timo Nentwig
PROTECTED]> wrote: > > My guess would be b/c best practice is usually to only have one Reader/ > > Searcher per Directory, but I don't know if that is the real reason. > > Most discussions/testing I have seen indicate a single Reader/Searcher > > performs best. > &

Re: CachingWrapperFilter: why cache per IndexReader?

2008-01-01 Thread Timo Nentwig
since does a single thread plus synchronous IO scale? However I yet didn't find any discussion on this topic so I'd be glad if somebody could give me a link. > -Grant > > On Jan 1, 2008, at 11:57 AM, Timo Nentwig wrote: > > Hi! > > > > Is there are particular reaso

Re: CachingWrapperFilter: why cache per IndexReader?

2008-01-01 Thread Timo Nentwig
On Tuesday 01 January 2008 21:06:06 Mark Miller wrote: > The main reason to use a single IndexReader is because its very time > consuming to open an IndexReader. If your index is pretty static, maybe Yes, it takes quite some time to build it and it's not changed but rebuilt from scratch. > Perha

Re: CachingWrapperFilter: why cache per IndexReader?

2008-01-01 Thread Timo Nentwig
fferent HDs but *of couse* we're talking about multiple hard drives (at least some RAID, in my case it's some expensive netapp, however I don't know which one exactely but I can find out...). > even better, separate systems (using RMI or something). > > - Mark > > Timo

Re: CachingWrapperFilter: why cache per IndexReader?

2008-01-06 Thread Timo Nentwig
On Wednesday 02 January 2008 08:03:48 Chris Hostetter wrote: > 1) there is a semi-articulated goal of moving away from "under the > coveres" weakref caching to more explicit and controllable caching ... YES! BTW why havin caching been removed from QueryFilter at all? Isn't caching the only sens

Sorting consumes hundreds of MBytes RAM

2008-04-13 Thread Timo Nentwig
Hi! I found that when sorting the search result -depending on the amount of data in the field to sort by - this can easily lead to FieldCacheImpl to allocate hundreds of MByte RAM. How does this work internally? It seems as if all data for this field found in the entire index is read into memo

Re: Sorting consumes hundreds of MBytes RAM

2008-04-15 Thread Timo Nentwig
ric field, make that explicit. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message ---- From: Timo Nentwig <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Sunday, April 13, 2008 4:45:37 PM Subject: Sorting consumes hundreds of MBytes RAM Hi! I

IndexSearcher.close() doesn't close Directory

2008-04-20 Thread Timo Nentwig
Hi! I advised Directory.close() (AspectJ) and noticed that it's not called at all for the following code: final FSDirectory d = FSDirectory.getDirectory( path ); final IndexReader r = IndexReader.open( d ); final IndexSearcher s = new IndexSearcher( r ); ... s.close(); r.close(); d.close(); // D

@todo parallelize this one too

2008-05-05 Thread Timo Nentwig
Hi! Unfortunately the search method in ParallelMultiSearcher which is able to take an HitCollector isn't running in parallel and there' even an issue regarding this (LUCENE-990) with zero watchers or votes :-\ So this isn't something that's likely to be done in near future, is it? And question

How to combine filter in Lucene 2.4?

2008-11-08 Thread Timo Nentwig
Hi! Since Filter.bits() is deprecated and replaced by getDocIdSet() now I wonder how I am supposed to combine (AND) filters (for facets). I worked around this issue by extending Filter and let getDocIdSet() return an OpenBitSet to ensure that this implementation is used everywhere and casting

Re: How to combine filter in Lucene 2.4?

2008-11-09 Thread Timo Nentwig
in two classes that do (precisely?) what you need: > contrib/miscellaneous/**/ChainedFilter > contrib/queries/**/BooleanFilter > > Regards, > Paul Elschot > > Op Saturday 08 November 2008 19:06:15 schreef Timo Nentwig: > > Hi! > > > > Since Filter.bits() is dep

Storing fields without term positions

2006-09-12 Thread Timo Nentwig
Hi everybody, is it possible to store fields without term position (the .prx file) data? We store sort of custom data in the field and use it as some sort of a filter for queries, so we just don't need any term position data and it bloats the index' size nearly by factor 3. Thanks Timo ---