Re: Sharding Techniques

2011-05-10 Thread Ganesh
We also use similar kind of technique, breaking indexes in to smaller and search using ParallelMultiSearcher. We have to do incremental indexing and the records older than 6 months or 1 year (based on ageout setting) should be deleted. Having multiple small indexes is really fast in terms of in

Can I omit ShingleFilter's filler tokens

2011-05-10 Thread William Koscho
Hi, Can I remove the filler token _ from the n-gram-tokens that are generated by a ShingleFilter? I'm using a chain of filters: ClassicFilter, StopFilter, LowerCaseFilter, and ShingleFilter to create phrase n-grams. The ShingleFilter inserts FILLER_TOKENs in place of the stopwords, but I don't w

Re: How do I sort lucene search results by relevance and time?

2011-05-10 Thread Johnbin Wang
Thanks for your suggestion! I try to set document boost factor when indexing document. In order to bubble up recent documents' scores, I set last three month's documents' boost to 2 , and set other documents' boost factor to 0.5. The I search index sorting by two fields, lucene default score and

RE: SpanNearQuery - inOrder parameter

2011-05-10 Thread Chris Hostetter
: I attach a junit test which shows strange behaviour of the inOrder : parameter on the SpanNearQuery constructor, using Lucene 2.9.4. : : My understanding of this parameter is that true forces the order and : false doesn't care about the order. : : Using true always works. However using false

Re: SpanNearQuery - inOrder parameter

2011-05-10 Thread Tom Hill
Since no one else is jumping in, I'll say that I suspect that the span query code does not bother to check to see if two of the terms are the same. I think that would account for the behavior you are seeing. Since the second SpanTermQuery would match the same term the first one did. Note that I'm

Query on using Payload with MoreLikeThis class

2011-05-10 Thread Saurabh Gokhale
Hi, In the Lucene 2.9.4 project, there is a requirement to boost some of the keywords in the document using payload. Now while searching, is there a way I can boost the MoreLikeThis result using the index time payload values? Or can I merge MoreLikeThis output and PayloadTermQuery output somehow

RE: Sharding Techniques

2011-05-10 Thread Burton-West, Tom
Hi Samar, >>Normal queries go fine under 500 ms but when people start searching >>"anything" some queries take up to > 100 seconds. Don't you think >>distributing smaller indexes on different machines would reduce the average >>.search time. (Although I have a feeling that search time for smaller

Re: Sharding Techniques

2011-05-10 Thread Samarendra Pratap
Hi Mike, *"I think the usual approach is to create multiple mirrored copies (slaves) rather than sharding"* This is where my eyes stuck. We do have mirrors and in-fact a good number of those. 6 servers are being used for serving regular queries (2 are for specific queries that do take time) and e

Re: Sharding Techniques

2011-05-10 Thread Mike Sokolov
Down to basics, Lucene searches work by locating terms and resolving documents from them. For standard term queries, a term is located by a process akin to binary search. That means that it uses log(n) seeks to get the term. Let's say you have 10M terms in your corpus. If you stored that in a si

RE: SpanNearQuery - inOrder parameter

2011-05-10 Thread Gregory Tarr
Anyone able to help me with the problem below? Thanks Greg -Original Message- From: Gregory Tarr [mailto:gregory.t...@detica.com] Sent: 09 May 2011 12:33 To: java-user@lucene.apache.org Subject: RE: SpanNearQuery - inOrder parameter Attachment didn't work - test below: import org.ap

Re: An unexpected network error occurred

2011-05-10 Thread Ian Lea
A full stack trace dump is always helpful. Are the three instances on one server with a local index directory, or on different servers accessing a network drive (how?) or what? If the index is locked it would be surprising that you could update it from 2 of the instances. -- Ian. On Tue, May

An unexpected network error occurred

2011-05-10 Thread Yogesh Dabhi
Three Instance of My application & lucene index directory shared for all instance Lucene version 3.1 Lock factory:- NativeFSLockFactory Instance1 jdk64 ,64 os Instance2 jdk64 ,64 os Instance3 jdk32 ,32 os When I try to search the data from the index directory from Instance1 I got

PDF Highlighting using PDF Highlight File

2011-05-10 Thread Wulf Berschin
Hi all, in our Lucene 3.0.3-based web application when a user clicks on a hit link the targeted PDF should be opened in the browser with highlighted hits. For this purpose using the Acrobat Highlight File (Parameter xml, see http://www.pdfbox.org/userguide/highlighting.html and http://partne

Re: Sharding Techniques

2011-05-10 Thread Samarendra Pratap
Thanks to Johannes - I am looking into katta. Seems promising. to Toke - Great explanation. That's what I was looking for. I'll come back and share my experience. Thank you very much. On Tue, May 10, 2011 at 1:31 PM, Toke Eskildsen wrote: > On Mon, 2011-05-09 at 13:56 +0200, Samarendra Prata

Re: Sharding Techniques

2011-05-10 Thread Toke Eskildsen
On Mon, 2011-05-09 at 13:56 +0200, Samarendra Pratap wrote: > We have an index directory of 30 GB which is divided into 3 subdirectories > (idx1, idx2, idx3) which are again divided into 21 sub-subdirectories > (idx1-1, idx1-2, , idx2-1, , idx3-1, , idx3-21). So each part is about ½ G

Re: Sharding Techniques

2011-05-10 Thread Johannes Zillmann
On May 10, 2011, at 9:42 AM, Samarendra Pratap wrote: > Hi, > Though we have 30 GB total index, size of the indexes that are used > in 75%-80% searches is 5 GB. and we have average search time around 700 ms. > (yes, we have optimized index). > > Could someone please throw some light on my origin

Re: Sharding Techniques

2011-05-10 Thread Samarendra Pratap
Hi, Though we have 30 GB total index, size of the indexes that are used in 75%-80% searches is 5 GB. and we have average search time around 700 ms. (yes, we have optimized index). Could someone please throw some light on my original doubt!!! If I want to keep smaller indexes on different servers