index bigger than it should be?

2011-10-27 Thread v . sevel
Hi, I have an application that has an index with 30 millions docs in it. every day, I add around 1 million docs, and I remove the oldest 1 million, to keepit stable at 30 million. for the most part doc fields are indexed and stored. each doc weighs around from a few Kb to a 1 Mb (a few Mb in so

Re: idf calculation in Lucene ?

2011-10-27 Thread Robert Muir
On Thu, Oct 20, 2011 at 3:11 PM, David Ryan wrote: > > However, in some case,  when I search o'reilly ,  I see > >  *  44.0865 = idf(title: o''reilli=4 o=1488 reilli=14 oreilli=4)* > > In this cae, How is IDF calculated? > thats a phrase or multiphrase query. in this case it sums up the idf of

Re: Lucene 3.1 search paralelism per segment doubt

2011-10-27 Thread Robert Muir
On Mon, Oct 10, 2011 at 7:02 AM, Marc Sturlese wrote: > I've read in another thread > (http://lucene.472066.n3.nabble.com/Indexing-slower-in-trunk-td3059836.html#a3062991) > /Since Lucene 2.9, Lucene works on a per segment basis when searching. Since > Lucene 3.1 it can even parallelize on multipl

Re: IndexWriter loops trying to merge using ConcurrentMergeScheduler

2011-10-27 Thread Michael McCandless
It looks like you are using BalancedSegmentMergePolicy right? And somehow it gets stuck in a state where it keeps merging the same single segment into a new segment, which is odd. Likely this is a bug in BSMP. Do you see this same looping with eg LogByteSizeMergePolicy? Note that newer versions

Re: index bigger than it should be?

2011-10-27 Thread Ian Lea
There's org.apache.lucene.index.CheckIndex which will report assorted stats about the index, as well as checking it for correctness. It can fix it too but you don't need that. I hope. Will take quite a while to run on a large index. What version of lucene? Does a before/after (or large/small) d

Re: Lucene 3.1 search paralelism per segment doubt

2011-10-27 Thread Simon Willnauer
On Thu, Oct 27, 2011 at 2:50 PM, Robert Muir wrote: > On Mon, Oct 10, 2011 at 7:02 AM, Marc Sturlese > wrote: >> I've read in another thread >> (http://lucene.472066.n3.nabble.com/Indexing-slower-in-trunk-td3059836.html#a3062991) >> /Since Lucene 2.9, Lucene works on a per segment basis when sea

Re: performance question - number of documents

2011-10-27 Thread Felipe Hummel
Hi, there are two types of query processing in document retrieval: document-at-a-time and term-at-a-time. Lucene uses document-at-a-time processing. That means the posting lists (the list of documents a word appears in) is sorted by the document IDs. This type of processing is usually better for l

Re: using lucene to find neighbouring points in an n-dimensional space

2011-10-27 Thread Felipe Hummel
For the indexing part, you can 'insert' the term multiple times (term-weight times) constructing the document String manually. That is not very typical, you would normally feed Lucene with the original documents for it to parse and index. The query processing could be done similar as you said. Jus

Re: IndexWriter loops trying to merge using ConcurrentMergeScheduler

2011-10-27 Thread alfredhong
Hi, Mike, Thanks for your analysis. You are correct in that BalancedSegmentMergePolicy is used. We previously used LogByteSizeMergePolicy but might have run into some other issues that I was involved in so weren't using it. Re: TieredMergePolicy, we'll definitely check that out when we update

Re: using lucene to find neighbouring points in an n-dimensional space

2011-10-27 Thread prasenjit mukherjee
Thanks for responding. On Fri, Oct 28, 2011 at 1:12 AM, Felipe Hummel wrote: > For the indexing part, you can 'insert' the term multiple times (term-weight > times) constructing the document String manually. That is not very typical, > you would normally feed Lucene with the original documents fo

Finding Term Positions in the original document

2011-10-27 Thread Vidya Kanigiluppai Sivasubramanian
Hi, I am using lucene 2.4.1 in my project. I need to display the search results when searched for a particular term and on selecting an item in the result page, I need to display the document where the term was found highlighting the match terms in the display. For this I need to know the match