ysis page will help you lots here if you're in SOLR.
>
> StandardAnalyzer could well be splitting on '-' if you're using that.
>
> Best
> Erick
>
> On Wed, Sep 8, 2010 at 5:27 PM, Max Lynch wrote:
>
> > Hi,
> > I am using the StandardAnalyzer,
Hi,
I am using the StandardAnalyzer, but I am not interested in converting words
like Wi-Fi into "Wi" and "Fi". Rather, "WI" is an important word for my
users (indicating the state of Wisconsin) and I need "WI" to only match the
distinct word.
I know in Solr I can set generateWordParts="0" for my
k Erickson wrote:
> H, if you somehow know the last date you processed, why wouldn't using
> a
> range query work for you? I.e.
> date:[ TO ]?
>
> Best
> Erick
>
> On Wed, Jul 14, 2010 at 10:37 AM, Max Lynch wrote:
>
> > You could have a field within e
You could have a field within each doc say "Processed" and store a
> value Yes/No, next run a searcher query which should give you the
> collection of unprocessed ones.
>
That sounds like a reasonable idea, and I just realized that I could have
done that in a way specific to my application. Howe
Hi,
I would like to continuously iterate over the documents in my lucene index
as the index is updated. Kind of like a "stream" of documents. Is there a
way I can achieve this?
Would something like this be sufficient (untested):
int currentDocId = 0;
while(true) {
for(; currentDocId < r
lso the query choice:"groupC, night"
> didn't give me a hit. Does the WhitespaceAnalyzer split on whitespaces
> in phrases?
>
The reason I used Whitespace Analyzer was so I could match full names like
"Max Lynch". With StandardAnalyzer this would match: "Max
Personally punctuation matters in my queries so I use WhitespaceAnalyzer. I
also only want exact hits, so that analyzer works well for me.
Also, AFAIK you don't set NOT_ANALYZED if you want to search through it.
On Wed, Feb 24, 2010 at 10:33 AM, Murdoch, Paul wrote:
> I'm using Lucene 2.9. How
>
>
> I *think* you can get what you want using SpanNotQuery - something like the
> following, using your "Microsoft Windows" example:
>
> SpanNot:
>include:
>SpanNear(in-order=true, slop=0):
>SpanTerm: "Microsoft"
>SpanTerm: "Windows"
>exclude:
>Span
Hi,
I would like to do a search for "Microsoft Windows" as a span, but not match
if words before or after "Microsoft Windows" are upper cased.
For example, I want this to match: another crash for Microsoft Windows today
But not this: another crash for Microsoft Windows Server today
Is this possib
> Alternatively, if one of the "regular" analyzers works for you *except*
> for lower-casing, just use that one for your mixed-case field and
> lower-case your input and send it to your lower-case field.
>
> Be careful to do the same steps when querying .
>
Thanks Erick, I didn't think about this.
> I just want to see if it's safe to use two different analyzers for the
> following situation:
>
> I have an index that I want to preserve case with so I can do
> case-sensitive
> searches with my WhitespaceAnalyzer. However, I also want to do case
> insensitive searches.
you should also make su
Hi,
I have a HitCollector that processes all hits from a query. I want all
hits, not the top N hits. I am converting my HitCollector to a Collector
for Lucene 3.0.0, and I'm a little confused by the new interface.
I assume that I can implement by new Collector much like the code on the API
Docs:
ing. First run it without -fix to see what problems there are.
> Then take a backup of the index. Then run it with -fix. The index
> will lose all docs in those segments that it removes.
>
> Can you describe what led up to this? Is it repeatable?
>
> Mike
>
> On Fri, Oc
On Wed, Nov 25, 2009 at 11:18 AM, Erick Erickson wrote:
> Why do you want to kill your indexer anyway? Just because it had
> been running "too long"? Or was it behaving poorly?
>
> But yeah, you need to change your process, you're almost guaranteeing
> that you'll corrupt your index.
I've learne
On Wed, Nov 25, 2009 at 9:49 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:
> Before 2.4 it was possible that a crash of the OS, or sudden power
> loss to the machine, could corrupt the index. But that's been fixed
> with 2.4.
>
> The only known sources of corruption are hardware faul
On Wed, Nov 25, 2009 at 9:31 AM, Ian Lea wrote:
> > What are the typical scenarios when the index will go corrupt?
>
> Dodgy disks.
>
I also have had index corruption on two occasions. It is not a big deal for
me since my data is fairly real time so the old documents aren't as
important.
Howev
http://stackoverflow.com/questions/198577/real-differences-between-java-server-and-java-client
On Mon, Nov 16, 2009 at 7:54 PM, Wenbo Zhao wrote:
> Hi, all
> I found a suggestion in 'Lucene in Action' : use 'java -server' to run
> faster.
> As I tested, it's 2 times faster than normal 'java' whi
Well already, without doing any boosting, documents matching more of the
> terms
> in your query will score higher. If you really want to make this effect
> more
> pronounced, yes, you can boost the more important query terms higher.
>
> -jake
>
But there isn't a way to determine exactly what bo
> > Now, I would like to know exactly what term was found. For example, if a
> > result comes back from the query above, how do I know whether John Smith
> > was
> > found, or both John Smith and his company, or just John Smith
> Manufacturing
> > was found?
>
>
> In general, this is actually very
> query: "San Francisco" "California" +("John Smith" "John Smith
> Manufacturing")
>
> Here the San Fran and CA clauses are optional, and the ("John Smith" OR
> "John Smith Manufacturing") is required.
>
Thanks Jake, that works nicely.
Now, I would like to know exactly what term was found. For e
> You want a query like
>
> ("San Francisco" OR "California") AND ("John Smith" OR "John Smith
> Manufacturing")
>
Won't his require San Francisco or California to be present? I do not
require them to be, I only require "John Smith" OR "John Smith
Manufacturing", but I want to get a bigger scor
Hi,
I am trying to move from a system where I counted the frequency of terms by
hand in a highlighter to determine if a result was useful to me. In an
earlier post on this list someone suggested I could boost the terms that are
useful to me and only accept hits above a certain threshold. However,
index file, too.
>
> Bernd
>
> On Fri, Oct 2, 2009 at 17:10, Max Lynch wrote:
> > I'm getting this error when I try to run my searcher and my indexer:
> >
> > Traceback (most recent call last):
> > self.searcher = lucene.IndexSearcher(self.directory)
>
I'm getting this error when I try to run my searcher and my indexer:
Traceback (most recent call last):
self.searcher = lucene.IndexSearcher(self.directory)
JavaError: java.io.FileNotFoundException: /home/spider/misc/index/_275c.cfs
(No such file or directory)
I don't know anything about the form
Thanks Mark that's exactly what I need. How does the performance of
processing each document in the collect method of HitCollector compare to
looping through the Hits in the deprecated Hits class?
On Tue, Sep 29, 2009 at 7:40 PM, Mark Miller wrote:
> Max Lynch wrote:
> >
I would like my searches to match "John Smith" when John Smith is in a
document, but not separated with punctuation. For example, when I was using
StandardAnalyzer, "John. Smith" was matching, which is wrong for me. Right
now I am using WhitespaceAnalyzer but instead searching for "John Smith"
"J
Hi,
I am developing a search system that doesn't do pagination (searches are run
in the background and machine analyzed). However, TopDocCollector makes me
put a limit on how many results I want back. For my system, each result
found is important. How can I make it collect every result found?
T
I just want to see if it's safe to use two different analyzers for the
following situation:
I have an index that I want to preserve case with so I can do case-sensitive
searches with my WhitespaceAnalyzer. However, I also want to do case
insensitive searches.
What I did was create a custom Analy
> Couldn't you maybe get the same effect using some clever term boosting?
>
> I.. think something like
>
> "Term 1" OR "Term 2" OR "Term 3" ^ .25
>
> would return in almost the exact order that you are asking for here, with
> the only real difference being that you would have some matches for only
> do a search on "Term 1" AND "Term 2"
> do a search on "Term 2" AND "Term2" AND "Term 3"
>
> This would ensure that you have two objects back, one of which is
> guaranteed to be a subset of the other.
I did start doing this after sending the email. My only concern is search
speed. Right now I
> What do you mean by "first"? Would you want to process a doc thatdid NOT
> have a "Term 3"?
>
> Let's say you have the following:
> doc1: "Term 1"
> doc2: "Term 2"
> doc3: "Term 1" "Term 2"
> doc4: "Term 3"
> doc5: "Term 1" "Term 2" "Term 3"
> doc6: "Term 2" "Term 3"
>
> Which docs do you want to
Hi,
I am doing a search on my index for a query like this:
query = "\"Term 1\" \"Term 2\" \"Term 3\""
Where I want to find Term 1, Term 2 and Term 3 in the index. However, I
only want to search for "Term 3" if I find "Term 1" and "Term 2" first, to
avoid doing processing on hits that only contai
Hello,
I am having an issue with analyzers. Right now, when I do a search, I am
searching for a whole name. For example, if I have a document like this:
"This is the document text. John Smith is mentioned right here, he is in
the john. Smith is his last name. His full name is John Smith."
If
On Wed, Jun 3, 2009 at 7:34 PM, Mark Miller wrote:
> Max Lynch wrote:
>
>> Well what happens is if I use a SpanScorer instead, and allocate it like
>>>
>>>
>>
>>
>>
>>> such:
>>>>
>>>> analyzer =
> Well what happens is if I use a SpanScorer instead, and allocate it like
> > such:
> >
> >analyzer = StandardAnalyzer([])
> >tokenStream = analyzer.tokenStream("contents",
> > lucene.StringReader(text))
> >ctokenStream = lucene.CachingTokenFilter(tokenStream)
On Thu, Apr 30, 2009 at 5:16 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:
> On Thu, Apr 30, 2009 at 12:15 AM, Max Lynch wrote:
> > You should switch to the SpanScorer (in o.a.l.search.highlighter).
> >> That fragment scorer should only match true phrase
You should switch to the SpanScorer (in o.a.l.search.highlighter).
> That fragment scorer should only match true phrase matches.
>
> Mike
>
Thanks Mike. I gave it a try and it wasn't working how I expected. I am
using pylucene right now so I can ask them if the implementation is
different. I'm
Hi,
I am trying to find out exactly when a word I'm looking for in a document is
found. I've talked to a few people on IRC and it seems like the best way is
to use a highlighter. What I have right now is a system where I put each
word the highlighter is called with into a list so I then know whic
38 matches
Mail list logo