Let's say I have the query
(nacho OR foo OR bar)
and some documents (single field with norms off)
doc a: nacho nacho nacho nacho
doc b: foo bar bar
doc c: nacho foo bar
I'm interested in all of these documents but I would like c to score the
highest since it contains all of the search terms, b to
On 4/11/2011 1:47 AM, Chris Mantle wrote:
> Hi, I’m having some trouble with Lucene at the moment. I have a number of
> unique identifiers that I need to search through. They’re in many different
> forms, eg. “M”, “MO”, “:MOFB”, “FH..L-O”, etc. All I need to do is an exact
> prefix search: at th
I know that it's best practice to reuse the Document object when
indexing, but I'm curious how multi-valued fields affect this. I tried
this before indexing each document:
doc.removeFields(myMultiValuedField);
for (String fieldName: fieldNames) {
Field field= doc.getField(field);
if (null != f
I need to add synonyms to an index depending on the field being indexed.
I know that TokenFilter is not "field aware", but is there a good way to
get at the field or do I need to add something to allow my Analyzer to
tell the TokenFilter which field is currently being examined?
Thanks,
-Chris
---
I see in the JavaDoc for IndexWriterConfig that:
"Note that IndexWriter makes a private clone; if you need to
subsequently change settings use IndexWriter.getConfig()."
However when I attempt to use the same IndexWriterConfig to create
multiple IndexWriters the following exception is thrown:
org.
>> Ideally I'd like to have the parser use the
>> custom analyzer for everything unless it's going to parse a clause into
>> a PhraseQuery or a MultiPhraseQuery, in which case it uses the
>> SimpleAnalyzer and looks in the _exact field - but I can't figure out
>> the best way to accomplish this.
>
I have Lucene indexes build using a shingled, stemmed custom analyzer.
I have a new requirement that exact searches match correctly.
ie: bar AND "nachos"
will only fetch results with plural nachos. Right now, with the
stemming, singular nacho results are returned as well. I realize that
I'm going t
I'm trying to use the QueryParser in 3.0.2 to make "foo and bar" (with
the quotes) return documents with the exact phrase "foo and bar". When I
run it through the QueryParser (with a StandardAnalyzer) I end up with
"foo ? bar", which doesn't match the documents in the index. I know that
"and" is a
I'm curious about embedding extra information in an index (and being able to
search the extra information as well). In this case certain tokens correspond
to recognized entities with ids. I'd like to get the ids into the index so that
searching for the id of the entity will also return that docu
> > > I heard work is being done on re-writing MultiPassIndexSplitter so it
> > > will be a single pass and work quicker.
> > Because that was so slow I just wrote a utility class to create a list of N
> > IndexWriters and round robin documents to them as the index is created.
> > Then we use a Pa
> I heard work is being done on re-writing MultiPassIndexSplitter so it will be
> a
> single pass and work quicker.
Because that was so slow I just wrote a utility class to create a list of N
IndexWriters and round robin documents to them as the index is created. Then we
use a ParallelMultiSear
> [Toke: No frequent updates]
>
> So everything is rebuild from scratch each time? Or do you mean that you're
> only adding new documents, not changing old ones?
Everything is reindexed from scratch - indexing speed is not essential to us...
> Either way, optimizing to a single 140GB segment is
Hi Toke-
> > * 20 million documents [...]
> > * 140GB total index size
> > * Optimized into a single segment
>
> I take it that you do not have frequent updates? Have you tried to see if you
> can get by with more segments without significant slowdown?
Correct - in fact there are no updates and n
We're getting up there in terms of corpus size for our Lucene indexing
application:
* 20 million documents
* all fields need to be stored
* 10 short fields / document
* 1 long free text field / document (analyzed with a custom shingle-based
analyzer)
* 140GB total index size
* Optimized into a s
Hi Larry-
> Right now I'm using Lucene with a basic Whitespace Anayzer but I'm having
> problems with stemming. Does anyone have a recommendation for other
> text analyzers that handle stemming and also keep capitalization, stop words,
> and punctuation?
Have you tried the SnowballFilter? You co
> It looks good to me, but I did not test, when testing, we may print out both
>
> initialQuery.toString() // query produced by QueryParser
> finalQuery.toString() // query after your new function
>
> as comparison, besides testing the query result.
Yes - it's exactly what I wanted:
Test Input
> 2) if I have to accept whole input string with all logic (AND, OR, ..) inside,
>I think it is easier to change TermQuery afterwards than parsing the
> string,
>since final result from query parser should be BooleanQuery (in your
> example),
>then we iterate through each BooleanClause
Hi Lisheng-
>> On a small index that I have I'd like to query certain fields by adding
>> wildcards
>> on either side of the term: foo -> *foo*. I realize the performance
>> implications but there are some cases where these terms are crammed
>> together in the indexed content (ie foonacho) and I
On a small index that I have I'd like to query certain fields by adding
wildcards on either side of the term: foo -> *foo*. I realize the performance
implications but there are some cases where these terms are crammed together in
the indexed content (ie foonacho) and I need to be able to return
> It sounds like you need to iterate through all terms sequentially in a given
> field in the doc, accessing offset & payload? In which case reanalyzing at
> search time may be the best way to go.
If it matters it doesn't need to be sequential. I just need access to all the
payloads for a given
> Payload Data is accessed through PayloadSpans so using SpanQUeries is the
> netry point it seems. There are tools like PayloadSpanUtil that convert other
> queries into SpanQueries for this purpose if needed but the api for Payloads
> looks it like it goes through Spans is the bottom line.
So t
Hi Chris-
> To my knoweldge, the character position of the tokens is not preserved by
> Lucene - only the ordinal postion of token's within a document / field is
> preserved. Thus you need to store this character offset information
> separately, say, as Payload data.
Thanks for the information. S
I'm trying to store semantic information in payloads at index time. I believe
this part is successful - but I'm having trouble getting access to the payload
locations after the index is created. I'd like to know the offset in the
original text for the token with the payload - and get this inform
The Snowball Analyzer works well for certain constructs but not others. In
particular I'm having a problem with things like "colossal" vs "colossus" and
"hippocampus" vs "hippocampal".
Is there a way to customize the analyzer to include these rules?
Thanks,
-Chris
---
Hi Anshum-
> You might want to look at writing a custom analyzer or something and
> add a
> document boost (while indexing) for documents containing those terms.
Do you know how to access the document from an analyzer? It seems to only have
access to the field...
Thanks,
-Chris
---
e-
> From: Christopher Condit [mailto:con...@sdsc.edu]
> Sent: Tuesday, July 21, 2009 2:48 PM
> To: java-user@lucene.apache.org
> Subject: Analysis Question
>
> I'm trying to implement an analyzer that will compute a score based on
> vocabulary terms in the indexed content
I'm trying to implement an analyzer that will compute a score based on
vocabulary terms in the indexed content (ie a document field with more terms in
the vocabulary will score higher). Although I can see the tokens I can't seem
to access the document from the analyzer to set a new field on it a
27 matches
Mail list logo