Re: Realtime & distributed

2009-10-09 Thread Jake Mannix
Hi Mike, Zoie itself doesn't do anything with the new with the distributed side of things - it just plays nicely with it. Zoie, at its core, exposes a couple of primary interfaces (well, this is a slightly simplified form of them) : interface IndexReaderFactory { List getIndexReaders(); },

Re: Realtime & distributed

2009-10-09 Thread Michael Masters
Hi Jake, Zoie looks like a a really cool project. I'd like to learn more about the distributed part of the setup. Any way you could describe that here or on the wiki? -Mike On Thu, Oct 8, 2009 at 9:24 PM, Jake Mannix wrote: > On Thu, Oct 8, 2009 at 7:00 PM, Angel, Eric wrote: > >> >> Does anyo

Re: Realtime & distributed

2009-10-09 Thread Bradford Stephens
My deepest apologies for the spam, everyone. I slipped on my G-mail button :) On Fri, Oct 9, 2009 at 9:09 PM, Bradford Stephens wrote: > Hey Eric, > > My consulting company specializes in scalable, real-time search with > distributed Lucene. I'm more than happy to chat, if you'd like! :) > > Chee

Re: Realtime & distributed

2009-10-09 Thread Bradford Stephens
Hey Eric, My consulting company specializes in scalable, real-time search with distributed Lucene. I'm more than happy to chat, if you'd like! :) Cheers, Bradford On Thu, Oct 8, 2009 at 7:00 PM, Angel, Eric wrote: > > Does anyone have any recommendations?  I've looked at Katta, but it doesn't >

Re: Question about how to speed up custom scoring

2009-10-09 Thread Jake Mannix
Great Scott (hah!) - please do report back, even if it just works fine and you have no more questions, I'd like to know whether this really is what you were after and actually works for you. Note that the FieldCache is kinda "magic" - it's lazy (so the first query will be slow and you should fire

Re: Question about how to speed up custom scoring

2009-10-09 Thread scott w
Thanks Jake! I will test this out and report back soon in case it's helpful to others. Definitely appreciate the help. Scott On Fri, Oct 9, 2009 at 3:33 PM, Jake Mannix wrote: > On Fri, Oct 9, 2009 at 3:07 PM, scott w wrote: > > > Example Document: > > model_1_score = 0.9 > > model_2_score = 0

Re: Question about how to speed up custom scoring

2009-10-09 Thread Jake Mannix
On Fri, Oct 9, 2009 at 3:07 PM, scott w wrote: > Example Document: > model_1_score = 0.9 > model_2_score = 0.3 > model_3_score = 0.7 > > I want to be able to pass in the following map at query time: > {model_1_score=0.4, model_2_score=0.7} and have that map get used as input > to a custom score f

Re: Question about how to speed up custom scoring

2009-10-09 Thread scott w
Hi Jake -- Sorry for the confusion. I have two similar but slightly different use cases in mind and the example I gave you corresponds to one use case while the code corresponds to the other slightly more complicated one. Ignore the original example, and let me restate the one I have in mind so it

Re: Question about how to speed up custom scoring

2009-10-09 Thread Jake Mannix
Hey Scott, I'm still not sure I understand what your dynamic boosts are for: they are the names of fields, right, not terms in the fields? So in terms of your example { company = microsoft, city = redmond, size = big }, the three possible choices for keys in your map are company, city, or size,

Re: Question about how to speed up custom scoring

2009-10-09 Thread scott w
(Apologies if this message gets sent more than once. I received an error sending it the first two times so sent directly to Jake but reposting to group.) Hi Jake -- Thanks for the feedback. What I am trying to implement is a way to custom score documents using a scoring function that takes as inp

Re: Question about how to speed up custom scoring

2009-10-09 Thread scott w
Right exactly. I looked into payload initially and realized it wouldn't work for my use case. On Fri, Oct 9, 2009 at 2:00 PM, Grant Ingersoll wrote: > Oops, just reread and realized you wanted query time weights. Payloads are > an index time thing. > > > On Oct 9, 2009, at 5:49 PM, Grant Ingers

Re: Using Numeric Field

2009-10-09 Thread Jake Mannix
If you are really using all of that precision (down to the second) the short answer is YES. If you can remove much of that precision (only keep down to the day, for example), then you may be able to get perfectly good performance with strings alone when the range is only over a small set of terms,

Re: Question about how to speed up custom scoring

2009-10-09 Thread Grant Ingersoll
Oops, just reread and realized you wanted query time weights. Payloads are an index time thing. On Oct 9, 2009, at 5:49 PM, Grant Ingersoll wrote: If you are trying to add specific term weights to terms in the index and then incorporate them into scoring, you might benefit from payloads a

Re: Question about how to speed up custom scoring

2009-10-09 Thread Grant Ingersoll
If you are trying to add specific term weights to terms in the index and then incorporate them into scoring, you might benefit from payloads and the PayloadTermQuery option. See http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/ -Grant On Oct 8, 2009, at 11:56 AM

Using Numeric Field

2009-10-09 Thread Siraj Haider
Hi, I have a Date field in my Lucene index that is currently stored as a String field with format: MMDDHHMISS. I perform RangeFilter on it when searching and also sort the results specifying it as a String field. My question is, will converting it to a Numeric field and start using Numeri

Re: Realtime & distributed

2009-10-09 Thread John Wang
I can provide some preliminary numbers (we will need to do some detailed analysis and post it somewhere): Dataset: medline starting index: empty. add only, no update, for 30 min. maximum indexing load, 1000 docs/ sec Under stress, we take indexing events (add only) and stream into both systems: Z

Re: How do you properly use NumericField

2009-10-09 Thread Paul Taylor
Michael McCandless wrote: On Fri, Oct 9, 2009 at 3:26 PM, Paul Taylor wrote It still relies on super.getRangeQuery() for non-numeric fields. If you don't have non-numeric fields that accept range queries you can simply call NumericRangeQuery.newXXXRange directly. For some indexes I hav

Re: How do you properly use NumericField

2009-10-09 Thread Paul Taylor
Michael McCandless wrote: On Fri, Oct 9, 2009 at 3:26 PM, Paul Taylor wrote: I currently use NumberTools.longToString() to add integer fields to an index and allow range searching, then when searching I then preprocess the query (using regular expressions) and convert integer fields to Numb

Re: Realtime & distributed

2009-10-09 Thread Jason Rutherglen
The dimensions sound good. It's unclear if you're going to post a chart again, numbers, or code? There's a LUCENE-1577 Jira issue for code. On Fri, Oct 9, 2009 at 12:37 PM, Jake Mannix wrote: > Jason, > >  We've been running some perf/load/stress tests lately, but on a suggestion > > from Ted D

RE: How do you properly use NumericField

2009-10-09 Thread Uwe Schindler
Hi Paul, for creating NumericFields just refer to the JavaDoc. As Mike said on the query side you can create NumericRangeQuery directly (recommended) - see javadocs. If you want to use QueryParser, you have to customize it, as QueryParser does not support NumericRangeQuery natively. Uwe - Uw

Re: How do you properly use NumericField

2009-10-09 Thread Michael McCandless
On Fri, Oct 9, 2009 at 3:26 PM, Paul Taylor wrote: > I currently use NumberTools.longToString() to add integer fields to > an index and allow range searching, then when searching I then > preprocess the query (using regular expressions) and convert integer > fields to NumberTools.longToString bef

Re: Realtime & distributed

2009-10-09 Thread Jake Mannix
Jason, We've been running some perf/load/stress tests lately, but on a suggestion from Ted Dunning, I've been trying to come up with a more "realistic" set of stress tests and indexing rates to see where NRT performs well and where it does not, instead of just indexing at maximum rate, looping

Re: Realtime & distributed

2009-10-09 Thread Jason Rutherglen
Jake and John, It would be interesting and enlightening to see NRT performance numbers in a variety of configurations. The best way to go about this is to post benchmarks that others may run in their environment which can then be tweaked for their unique edge cases. I wish I had more time to work

How do you properly use NumericField

2009-10-09 Thread Paul Taylor
Hi I currently use NumberTools.longToString() to add integer fields to an index and allow range searching, then when searching I then preprocess the query (using regular expressions) and convert integer fields to NumberTools.longToString before it is parsed by the QueryParser, then when I re

Re: Question about how to speed up custom scoring

2009-10-09 Thread Jake Mannix
Scott, To reiterate what Erick and Andrzej's said: calling IndexReader.document(docId) in your inner scoring loop is the source of your performance problem - iterating over all these stored fields is what is killing you. To do this a better way, can you try to explain exactly what this Scorer

Re: Question about how to speed up custom scoring

2009-10-09 Thread scott w
Thanks for the suggestions Erick. I am using Lucene 2.3. Terms are stored and given Andrzej's comments in the follow up email sounds like it's not the stored field issue. I'll keep investigating... thanks, Scott On Thu, Oct 8, 2009 at 8:06 AM, Erick Erickson wrote: > I suspect your problem here

Re: Getting left and right offsets of term search results

2009-10-09 Thread David Causse
Hi, we also index linguistic data, but (someone correct me if I'm wrong) you have to deal with what the lucene store is offering. You can store usable on the search side : - a term (TermAttribute) - the position of the term (PositionIncrementAttribute) - an arbitrary payload (PayloadAttrib

Getting left and right offsets of term search results

2009-10-09 Thread Till Kolter
I am quite new to Lucene, but I have searched the FAQs and consulted the mailinglist archive. I debugged through the source codes as well. I have writen an Analyzer, that analyzes a stream by sending it to a whole pipeline of linguistic processing and uses the internal representation to construct

Re: Efficiently reopening remotely-distributed indexes in 2.9?

2009-10-09 Thread Nigel
Got it -- thanks, Mark! (Recently I read elsewhere in the archives of this list about the value or lack thereof of segments.gen, so skipping that file was in the back of my mind as well.) Chris On Thu, Oct 8, 2009 at 3:04 PM, Mark Miller wrote: > Nigel wrote: > > Thanks, Mark. That makes sens

Re: Lucene 2.9.0 [PROBLEM] : TokenStream API (incrementToken / captureState / restoreState), cannot implement a "stop phrases filter"

2009-10-09 Thread Enrico Detoma
Thank you. Starting from CachingTokenFilter was indeed the correct way to proceed. Regards Enrico 2009/10/8 Uwe Schindler > restoreState only restores the token contents, not the complete stream. So > you cannot roll back the token stream (and this was also not possible with > the old API). Th

Re: Index.close() infinite TIME_WAITING (repost)

2009-10-09 Thread Michael McCandless
Were there any exceptions inside Lucene, before the hang? The fact that you're hitting AlreadyClosedException is a spooky sign -- that means IW thinks you had in fact closed the writer, but then used it again. For increasing indexing throughput, I'd start here: http://wiki.apache.org/lucene-

Re: Index.close() infinite TIME_WAITING (repost)

2009-10-09 Thread Jamie Band
Hi Mike There are other threads involved but none are simultaneously modifying the index. There is one thread that retrieves the total count every 2 seconds on the index for GUI display: public long getTotalMessageCount(Volume volume) throws MessageSearchException { if (volum

Re: Index.close() infinite TIME_WAITING (repost)

2009-10-09 Thread Jamie Band
Hi Mike There are other threads involved but none are simultaneously modifying the idex. There is one read that retrieves the total count every 2 seconds on the Index for GUI display: public long getTotalMessageCount(Volume volume) throws MessageSearchException { if (volume =

Re: FileNotFoundException on index

2009-10-09 Thread Michael McCandless
You can use o.a.l.index.CheckIndex to fix the index. It will remove references to any segments that are missing or have problems during testing. First run it without -fix to see what problems there are. Then take a backup of the index. Then run it with -fix. The index will lose all docs in thos

Re: Index.close() infinite TIME_WAITING (repost)

2009-10-09 Thread Michael McCandless
Are there other threads involved, besides the one hung in close? Can you post their stack traces? This stack trace seems to indicate that IW believes another thread is in the process of closing. Can you call IndexWriter.setInfoStream and post the output leading to the hang? Mike On Fri, Oct 9,

Re: Index.close() infinite TIME_WAITING (repost)

2009-10-09 Thread Jamie Band
HI Michael / Uwe / others Sorry for the repost... it just does not look like the earlier message I sent go through. FYI: there are no large Lucene merges taking place. Jamie Band wrote: Hi Michael Thanks for your help. Here are the stacks: index processor [TIME_WAITING] CPU time: 33:01 java

Re: Index.close() infinite TIME_WAITING

2009-10-09 Thread Jamie Band
Incidentally, there are no Lucene merge threads doing any work. See attached. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Index.close() infinite TIME_WAITING

2009-10-09 Thread Jamie Band
Hi Michael Thanks for your help. Here are the stacks: index processor [TIME_WAITING] CPU time: 33:01 java.lang.Object.wait(long) org.apache.lucene.index.IndexWriter.doWait() org.apache.lucene.index.IndexWriter.shouldClose() org.apache.lucene.index.IndexWriter.close(boolean) org.apache.lucene.ind

AW: Reverse stemmer?

2009-10-09 Thread Uwe Goetzke
We use a statistical approach. So we have little language dependent context in our search. A simplified description: Our data gets indexed with a "normal" analyzer in a data index. In a second step we index all terms of defined search fields with a different analyzer which uses bigrams on the ch

Unable to use jdbc store

2009-10-09 Thread man...@mailinator.com
I am using lucene 2.9.0 ,Compass2.2.0. I have configured for jdbc store.In my oracle db, there are 22000 users in User_ table . I am unable to index 22000 users.It stops at 13000 users. Problem is that table LUCENE_10109(my jdbc index table) is getting populated with no of records. now I have 2772