Re: "WI" not Wi-Fi

2010-09-08 Thread Max Lynch
ysis page will help you lots here if you're in SOLR. > > StandardAnalyzer could well be splitting on '-' if you're using that. > > Best > Erick > > On Wed, Sep 8, 2010 at 5:27 PM, Max Lynch wrote: > > > Hi, > > I am using the StandardAnalyzer,

"WI" not Wi-Fi

2010-09-08 Thread Max Lynch
Hi, I am using the StandardAnalyzer, but I am not interested in converting words like Wi-Fi into "Wi" and "Fi". Rather, "WI" is an important word for my users (indicating the state of Wisconsin) and I need "WI" to only match the distinct word. I know in Solr I can set generateWordParts="0" for my

Re: Continuously iterate over documents in index

2010-07-15 Thread Max Lynch
k Erickson wrote: > H, if you somehow know the last date you processed, why wouldn't using > a > range query work for you? I.e. > date:[ TO ]? > > Best > Erick > > On Wed, Jul 14, 2010 at 10:37 AM, Max Lynch wrote: > > > You could have a field within e

Re: Continuously iterate over documents in index

2010-07-14 Thread Max Lynch
You could have a field within each doc say "Processed" and store a > value Yes/No, next run a searcher query which should give you the > collection of unprocessed ones. > That sounds like a reasonable idea, and I just realized that I could have done that in a way specific to my application. Howe

Continuously iterate over documents in index

2010-07-13 Thread Max Lynch
Hi, I would like to continuously iterate over the documents in my lucene index as the index is updated. Kind of like a "stream" of documents. Is there a way I can achieve this? Would something like this be sufficient (untested): int currentDocId = 0; while(true) { for(; currentDocId < r

Re: StandardAnalyzer and comma

2010-02-24 Thread Max Lynch
lso the query choice:"groupC, night" > didn't give me a hit. Does the WhitespaceAnalyzer split on whitespaces > in phrases? > The reason I used Whitespace Analyzer was so I could match full names like "Max Lynch". With StandardAnalyzer this would match: "Max

Re: StandardAnalyzer and comma

2010-02-24 Thread Max Lynch
Personally punctuation matters in my queries so I use WhitespaceAnalyzer. I also only want exact hits, so that analyzer works well for me. Also, AFAIK you don't set NOT_ANALYZED if you want to search through it. On Wed, Feb 24, 2010 at 10:33 AM, Murdoch, Paul wrote: > I'm using Lucene 2.9. How

Re: Match span of capitalized words

2010-02-05 Thread Max Lynch
> > > I *think* you can get what you want using SpanNotQuery - something like the > following, using your "Microsoft Windows" example: > > SpanNot: >include: >SpanNear(in-order=true, slop=0): >SpanTerm: "Microsoft" >SpanTerm: "Windows" >exclude: >Span

Match span of capitalized words

2010-02-03 Thread Max Lynch
Hi, I would like to do a search for "Microsoft Windows" as a span, but not match if words before or after "Microsoft Windows" are upper cased. For example, I want this to match: another crash for Microsoft Windows today But not this: another crash for Microsoft Windows Server today Is this possib

Re: Different Analyzers

2009-12-30 Thread Max Lynch
> Alternatively, if one of the "regular" analyzers works for you *except* > for lower-casing, just use that one for your mixed-case field and > lower-case your input and send it to your lower-case field. > > Be careful to do the same steps when querying . > Thanks Erick, I didn't think about this.

Re: Different Analyzers

2009-12-30 Thread Max Lynch
> I just want to see if it's safe to use two different analyzers for the > following situation: > > I have an index that I want to preserve case with so I can do > case-sensitive > searches with my WhitespaceAnalyzer. However, I also want to do case > insensitive searches. you should also make su

Converting HitCollector to Collector

2009-12-09 Thread Max Lynch
Hi, I have a HitCollector that processes all hits from a query. I want all hits, not the top N hits. I am converting my HitCollector to a Collector for Lucene 3.0.0, and I'm a little confused by the new interface. I assume that I can implement by new Collector much like the code on the API Docs:

Re: FileNotFoundException on index

2009-12-08 Thread Max Lynch
ing. First run it without -fix to see what problems there are. > Then take a backup of the index. Then run it with -fix. The index > will lose all docs in those segments that it removes. > > Can you describe what led up to this? Is it repeatable? > > Mike > > On Fri, Oc

Re: best way to ensure IndexWriter won't corrupt the index?

2009-11-25 Thread Max Lynch
On Wed, Nov 25, 2009 at 11:18 AM, Erick Erickson wrote: > Why do you want to kill your indexer anyway? Just because it had > been running "too long"? Or was it behaving poorly? > > But yeah, you need to change your process, you're almost guaranteeing > that you'll corrupt your index. I've learne

Re: best way to ensure IndexWriter won't corrupt the index?

2009-11-25 Thread Max Lynch
On Wed, Nov 25, 2009 at 9:49 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > Before 2.4 it was possible that a crash of the OS, or sudden power > loss to the machine, could corrupt the index. But that's been fixed > with 2.4. > > The only known sources of corruption are hardware faul

Re: best way to ensure IndexWriter won't corrupt the index?

2009-11-25 Thread Max Lynch
On Wed, Nov 25, 2009 at 9:31 AM, Ian Lea wrote: > > What are the typical scenarios when the index will go corrupt? > > Dodgy disks. > I also have had index corruption on two occasions. It is not a big deal for me since my data is fairly real time so the old documents aren't as important. Howev

Re: What's 'java -server' option ?

2009-11-16 Thread Max Lynch
http://stackoverflow.com/questions/198577/real-differences-between-java-server-and-java-client On Mon, Nov 16, 2009 at 7:54 PM, Wenbo Zhao wrote: > Hi, all > I found a suggestion in 'Lucene in Action' : use 'java -server' to run > faster. > As I tested, it's 2 times faster than normal 'java' whi

Re: Term Boost Threshold

2009-11-13 Thread Max Lynch
Well already, without doing any boosting, documents matching more of the > terms > in your query will score higher. If you really want to make this effect > more > pronounced, yes, you can boost the more important query terms higher. > > -jake > But there isn't a way to determine exactly what bo

Re: Term Boost Threshold

2009-11-13 Thread Max Lynch
> > Now, I would like to know exactly what term was found. For example, if a > > result comes back from the query above, how do I know whether John Smith > > was > > found, or both John Smith and his company, or just John Smith > Manufacturing > > was found? > > > In general, this is actually very

Re: Term Boost Threshold

2009-11-13 Thread Max Lynch
> query: "San Francisco" "California" +("John Smith" "John Smith > Manufacturing") > > Here the San Fran and CA clauses are optional, and the ("John Smith" OR > "John Smith Manufacturing") is required. > Thanks Jake, that works nicely. Now, I would like to know exactly what term was found. For e

Re: Term Boost Threshold

2009-11-13 Thread Max Lynch
> You want a query like > > ("San Francisco" OR "California") AND ("John Smith" OR "John Smith > Manufacturing") > Won't his require San Francisco or California to be present? I do not require them to be, I only require "John Smith" OR "John Smith Manufacturing", but I want to get a bigger scor

Term Boost Threshold

2009-11-13 Thread Max Lynch
Hi, I am trying to move from a system where I counted the frequency of terms by hand in a highlighter to determine if a result was useful to me. In an earlier post on this list someone suggested I could boost the terms that are useful to me and only accept hits above a certain threshold. However,

Re: FileNotFoundException on index

2009-10-08 Thread Max Lynch
index file, too. > > Bernd > > On Fri, Oct 2, 2009 at 17:10, Max Lynch wrote: > > I'm getting this error when I try to run my searcher and my indexer: > > > > Traceback (most recent call last): > > self.searcher = lucene.IndexSearcher(self.directory) >

FileNotFoundException on index

2009-10-02 Thread Max Lynch
I'm getting this error when I try to run my searcher and my indexer: Traceback (most recent call last): self.searcher = lucene.IndexSearcher(self.directory) JavaError: java.io.FileNotFoundException: /home/spider/misc/index/_275c.cfs (No such file or directory) I don't know anything about the form

Re: TopDocCollector limits

2009-09-30 Thread Max Lynch
Thanks Mark that's exactly what I need. How does the performance of processing each document in the collect method of HitCollector compare to looping through the Hits in the deprecated Hits class? On Tue, Sep 29, 2009 at 7:40 PM, Mark Miller wrote: > Max Lynch wrote: > >

Whitespace/Standard Analyzer and punctuation

2009-09-29 Thread Max Lynch
I would like my searches to match "John Smith" when John Smith is in a document, but not separated with punctuation. For example, when I was using StandardAnalyzer, "John. Smith" was matching, which is wrong for me. Right now I am using WhitespaceAnalyzer but instead searching for "John Smith" "J

TopDocCollector limits

2009-09-29 Thread Max Lynch
Hi, I am developing a search system that doesn't do pagination (searches are run in the background and machine analyzed). However, TopDocCollector makes me put a limit on how many results I want back. For my system, each result found is important. How can I make it collect every result found? T

Different Analyzers

2009-08-11 Thread Max Lynch
I just want to see if it's safe to use two different analyzers for the following situation: I have an index that I want to preserve case with so I can do case-sensitive searches with my WhitespaceAnalyzer. However, I also want to do case insensitive searches. What I did was create a custom Analy

Re: Combining hits

2009-07-23 Thread Max Lynch
> Couldn't you maybe get the same effect using some clever term boosting? > > I.. think something like > > "Term 1" OR "Term 2" OR "Term 3" ^ .25 > > would return in almost the exact order that you are asking for here, with > the only real difference being that you would have some matches for only

Re: Combining hits

2009-07-23 Thread Max Lynch
> do a search on "Term 1" AND "Term 2" > do a search on "Term 2" AND "Term2" AND "Term 3" > > This would ensure that you have two objects back, one of which is > guaranteed to be a subset of the other. I did start doing this after sending the email. My only concern is search speed. Right now I

Re: Combining hits

2009-07-23 Thread Max Lynch
> What do you mean by "first"? Would you want to process a doc thatdid NOT > have a "Term 3"? > > Let's say you have the following: > doc1: "Term 1" > doc2: "Term 2" > doc3: "Term 1" "Term 2" > doc4: "Term 3" > doc5: "Term 1" "Term 2" "Term 3" > doc6: "Term 2" "Term 3" > > Which docs do you want to

Combining hits

2009-07-23 Thread Max Lynch
Hi, I am doing a search on my index for a query like this: query = "\"Term 1\" \"Term 2\" \"Term 3\"" Where I want to find Term 1, Term 2 and Term 3 in the index. However, I only want to search for "Term 3" if I find "Term 1" and "Term 2" first, to avoid doing processing on hits that only contai

Punctuation in Whitespace Analyzer

2009-07-03 Thread Max Lynch
Hello, I am having an issue with analyzers. Right now, when I do a search, I am searching for a whole name. For example, if I have a document like this: "This is the document text. John Smith is mentioned right here, he is in the john. Smith is his last name. His full name is John Smith." If

Re: Phrase Highlighting

2009-06-03 Thread Max Lynch
On Wed, Jun 3, 2009 at 7:34 PM, Mark Miller wrote: > Max Lynch wrote: > >> Well what happens is if I use a SpanScorer instead, and allocate it like >>> >>> >> >> >> >>> such: >>>> >>>> analyzer =

Re: Phrase Highlighting

2009-06-02 Thread Max Lynch
> Well what happens is if I use a SpanScorer instead, and allocate it like > > such: > > > >analyzer = StandardAnalyzer([]) > >tokenStream = analyzer.tokenStream("contents", > > lucene.StringReader(text)) > >ctokenStream = lucene.CachingTokenFilter(tokenStream)

Re: Phrase Highlighting

2009-05-21 Thread Max Lynch
On Thu, Apr 30, 2009 at 5:16 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > On Thu, Apr 30, 2009 at 12:15 AM, Max Lynch wrote: > > You should switch to the SpanScorer (in o.a.l.search.highlighter). > >> That fragment scorer should only match true phrase

Re: Phrase Highlighting

2009-04-29 Thread Max Lynch
You should switch to the SpanScorer (in o.a.l.search.highlighter). > That fragment scorer should only match true phrase matches. > > Mike > Thanks Mike. I gave it a try and it wasn't working how I expected. I am using pylucene right now so I can ask them if the implementation is different. I'm

Phrase Highlighting

2009-04-28 Thread Max Lynch
Hi, I am trying to find out exactly when a word I'm looking for in a document is found. I've talked to a few people on IRC and it seems like the best way is to use a highlighter. What I have right now is a system where I put each word the highlighter is called with into a list so I then know whic