Ranking docs with all terms higher

2011-05-18 Thread Christopher Condit
Let's say I have the query (nacho OR foo OR bar) and some documents (single field with norms off) doc a: nacho nacho nacho nacho doc b: foo bar bar doc c: nacho foo bar I'm interested in all of these documents but I would like c to score the highest since it contains all of the search terms, b to

Re: Can't perform exact match...?

2011-04-15 Thread Christopher Condit
On 4/11/2011 1:47 AM, Chris Mantle wrote: > Hi, I’m having some trouble with Lucene at the moment. I have a number of > unique identifiers that I need to search through. They’re in many different > forms, eg. “M”, “MO”, “:MOFB”, “FH..L-O”, etc. All I need to do is an exact > prefix search: at th

best practice for reusing documents with multi-valued fields

2011-04-14 Thread Christopher Condit
I know that it's best practice to reuse the Document object when indexing, but I'm curious how multi-valued fields affect this. I tried this before indexing each document: doc.removeFields(myMultiValuedField); for (String fieldName: fieldNames) { Field field= doc.getField(field); if (null != f

Field Aware TokenFilter

2011-04-04 Thread Christopher Condit
I need to add synonyms to an index depending on the field being indexed. I know that TokenFilter is not "field aware", but is there a good way to get at the field or do I need to add something to allow my Analyzer to tell the TokenFilter which field is currently being examined? Thanks, -Chris ---

Using IndexWriterConfig repeatedly in 3.1

2011-04-01 Thread Christopher Condit
I see in the JavaDoc for IndexWriterConfig that: "Note that IndexWriter makes a private clone; if you need to subsequently change settings use IndexWriter.getConfig()." However when I attempt to use the same IndexWriterConfig to create multiple IndexWriters the following exception is thrown: org.

Re: Best practice for stemming and exact matching

2011-04-01 Thread Christopher Condit
>> Ideally I'd like to have the parser use the >> custom analyzer for everything unless it's going to parse a clause into >> a PhraseQuery or a MultiPhraseQuery, in which case it uses the >> SimpleAnalyzer and looks in the _exact field - but I can't figure out >> the best way to accomplish this. >

Best practice for stemming and exact matching

2011-03-29 Thread Christopher Condit
I have Lucene indexes build using a shingled, stemmed custom analyzer. I have a new requirement that exact searches match correctly. ie: bar AND "nachos" will only fetch results with plural nachos. Right now, with the stemming, singular nacho results are returned as well. I realize that I'm going t

Phrase query with boolean matches

2011-02-14 Thread Christopher Condit
I'm trying to use the QueryParser in 3.0.2 to make "foo and bar" (with the quotes) return documents with the exact phrase "foo and bar". When I run it through the QueryParser (with a StandardAnalyzer) I end up with "foo ? bar", which doesn't match the documents in the index. I know that "and" is a

Best practice for embedding extra information in an index

2010-09-21 Thread Christopher Condit
I'm curious about embedding extra information in an index (and being able to search the extra information as well). In this case certain tokens correspond to recognized entities with ids. I'd like to get the ids into the index so that searching for the id of the entity will also return that docu

RE: Question to the writer of MultiPassIndexSplitter

2010-08-05 Thread Christopher Condit
> > > I heard work is being done on re-writing MultiPassIndexSplitter so it > > > will be a single pass and work quicker. > > Because that was so slow I just wrote a utility class to create a list of N > > IndexWriters and round robin documents to them as the index is created. > > Then we use a Pa

RE: Question to the writer of MultiPassIndexSplitter

2010-08-03 Thread Christopher Condit
> I heard work is being done on re-writing MultiPassIndexSplitter so it will be > a > single pass and work quicker. Because that was so slow I just wrote a utility class to create a list of N IndexWriters and round robin documents to them as the index is created. Then we use a ParallelMultiSear

RE: Best practices for searcher memory usage?

2010-07-15 Thread Christopher Condit
> [Toke: No frequent updates] > > So everything is rebuild from scratch each time? Or do you mean that you're > only adding new documents, not changing old ones? Everything is reindexed from scratch - indexing speed is not essential to us... > Either way, optimizing to a single 140GB segment is

RE: Best practices for searcher memory usage?

2010-07-14 Thread Christopher Condit
Hi Toke- > > * 20 million documents [...] > > * 140GB total index size > > * Optimized into a single segment > > I take it that you do not have frequent updates? Have you tried to see if you > can get by with more segments without significant slowdown? Correct - in fact there are no updates and n

Best practices for searcher memory usage?

2010-07-13 Thread Christopher Condit
We're getting up there in terms of corpus size for our Lucene indexing application: * 20 million documents * all fields need to be stored * 10 short fields / document * 1 long free text field / document (analyzed with a custom shingle-based analyzer) * 140GB total index size * Optimized into a s

RE: Stemming Problem

2010-05-18 Thread Christopher Condit
Hi Larry- > Right now I'm using Lucene with a basic Whitespace Anayzer but I'm having > problems with stemming. Does anyone have a recommendation for other > text analyzers that handle stemming and also keep capitalization, stop words, > and punctuation? Have you tried the SnowballFilter? You co

RE: Modify TermQueries or Tokens

2010-05-01 Thread Christopher Condit
> It looks good to me, but I did not test, when testing, we may print out both > > initialQuery.toString() // query produced by QueryParser > finalQuery.toString() // query after your new function > > as comparison, besides testing the query result. Yes - it's exactly what I wanted: Test Input

RE: Modify TermQueries or Tokens

2010-04-30 Thread Christopher Condit
> 2) if I have to accept whole input string with all logic (AND, OR, ..) inside, >I think it is easier to change TermQuery afterwards than parsing the > string, >since final result from query parser should be BooleanQuery (in your > example), >then we iterate through each BooleanClause

RE: Modify TermQueries or Tokens

2010-04-30 Thread Christopher Condit
Hi Lisheng- >> On a small index that I have I'd like to query certain fields by adding >> wildcards >> on either side of the term: foo -> *foo*. I realize the performance >> implications but there are some cases where these terms are crammed >> together in the indexed content (ie foonacho) and I

Modify TermQueries or Tokens

2010-04-30 Thread Christopher Condit
On a small index that I have I'd like to query certain fields by adding wildcards on either side of the term: foo -> *foo*. I realize the performance implications but there are some cases where these terms are crammed together in the indexed content (ie foonacho) and I need to be able to return

RE: recovering payload from fields

2010-02-27 Thread Christopher Condit
> It sounds like you need to iterate through all terms sequentially in a given > field in the doc, accessing offset & payload? In which case reanalyzing at > search time may be the best way to go. If it matters it doesn't need to be sequential. I just need access to all the payloads for a given

RE: recovering payload from fields

2010-02-26 Thread Christopher Condit
> Payload Data is accessed through PayloadSpans so using SpanQUeries is the > netry point it seems. There are tools like PayloadSpanUtil that convert other > queries into SpanQueries for this purpose if needed but the api for Payloads > looks it like it goes through Spans is the bottom line. So t

RE: recovering payload from fields

2010-02-26 Thread Christopher Condit
Hi Chris- > To my knoweldge, the character position of the tokens is not preserved by > Lucene - only the ordinal postion of token's within a document / field is > preserved. Thus you need to store this character offset information > separately, say, as Payload data. Thanks for the information. S

recovering payload from fields

2010-02-26 Thread Christopher Condit
I'm trying to store semantic information in payloads at index time. I believe this part is successful - but I'm having trouble getting access to the payload locations after the index is created. I'd like to know the offset in the original text for the token with the payload - and get this inform

Snowball Stemmer Question

2009-12-03 Thread Christopher Condit
The Snowball Analyzer works well for certain constructs but not others. In particular I'm having a problem with things like "colossal" vs "colossus" and "hippocampus" vs "hippocampal". Is there a way to customize the analyzer to include these rules? Thanks, -Chris ---

RE: Analysis Question

2009-08-06 Thread Christopher Condit
Hi Anshum- > You might want to look at writing a custom analyzer or something and > add a > document boost (while indexing) for documents containing those terms. Do you know how to access the document from an analyzer? It seems to only have access to the field... Thanks, -Chris ---

RE: Analysis Question

2009-08-05 Thread Christopher Condit
e- > From: Christopher Condit [mailto:con...@sdsc.edu] > Sent: Tuesday, July 21, 2009 2:48 PM > To: java-user@lucene.apache.org > Subject: Analysis Question > > I'm trying to implement an analyzer that will compute a score based on > vocabulary terms in the indexed content

Analysis Question

2009-07-21 Thread Christopher Condit
I'm trying to implement an analyzer that will compute a score based on vocabulary terms in the indexed content (ie a document field with more terms in the vocabulary will score higher). Although I can see the tokens I can't seem to access the document from the analyzer to set a new field on it a