> > you want what Lucene already does, but that's clearly not true
> 
> Hmmm, let's pretend that "contents" field in my example wasn't analyzed at 
> index
> time. The unstemmed form of terms will be indexed. But if I query with a 
> stemmed
> form or use QueryParser with the SnowballAnalyzer, I'm not going to get a 
> match.
> I could fix this situation by analyzing the indexed field at search time to
> match the query. I don't know that Lucene provides this opportunity and, as I
> said, maybe that's crazy.

I don't know of any facility in Lucene to re-analyze indexed content at search 
time.  This is an IR anti-pattern - it defeats the purpose of constructing the 
inverted index.

People generally decide prior to indexing what kind of queries they need to 
support, and then perform all required document analysis at index time.  To use 
your stemming example, if you need to be able to match against both stemmed and 
unstemmed forms, you would make both a stemmed field and an unstemmed field, 
and then construct corresponding sub-queries against each as required.

> > What do you mean when you say "rewriting"
> 
> "Our current path to solving our problem requires additional fields which need
> rewritten". I meant actually altering the document in the index. My desire has
> been to write a new Query class implementation whereas you mentioned query
> rewriting (isn't that accomplished as in my example by passing an Analyzer to
> QueryParser?)

Changing the index at query time, unless you have a tiny set of documents, 
sounds like a big mistake to me.  Again, extremely expensive.

Query rewriting: 
<http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Query.html#rewrite%28org.apache.lucene.index.IndexReader%29>

> > not discrete...  limited number of prefixes
> 
> So my document may have "myfield:A*foo*", "myfield:B*foo*",
> "myfield:A*dog*", and "myfield:D*cat*".
> 
> Or, to phrase differently, "myfield:[PREFIX][PATTERN]" may appear any
> number of times where PREFIX comes from the set { A, B, C, D, E, ... }.
> 
> This complexity is really a tangent of my question in order to avoid poor
> performance from WildcardQuery.

I still think you could make one field for each PREFIX, and then do whatever 
query you want on each of those fields.

If you need to support "contains" functionality (e.g. "A:*foo*"), you might 
want to look into ngram analysis at indexing time (and maybe also at query 
time, depending on the source of your queries):

<http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/analysis/ngram/NGramTokenFilter.html>

You would have to know the minimum and maximum length of the contained string 
you would want to query for.

Steve

Reply via email to