Hi,

I don't know the answer to your questions, but I'm guessing that the answer to 
#3 is probably because the answers to #1 and #2.  

Did you try to look at the indexes using Luke?  That shows the top 50 terms 
when it starts, so it might be obvious what the differences are, and that might 
give someone here (more knowledgeable than myself) a hint as to what's going on.

Jim



---- Jibo John <jiboj...@mac.com> wrote: 
> Tried with a larger set of documents (2,000,000 ) this time.
> 
> ThreadedIndexWriter
> -------------------------------
> Size  - 1.4 G
> optimized - yes (as suggested by Phil)
> Number of documents - 1,999,924 (Not idea where the 76 documents  
> vanished...)
> Number of terms - 3,638,801
> 
> 
> IndexWriter
> ---------------
> Size - 1.8 G    (Noticed the size difference factor reduced to 23%)
> optimized - yes
> Number of documents  - 2,000,000  (All of them got in)
> Number of terms - 10,624,806
> 
> I think it's getting complicated with more unanswered questions..
> 
> 1. Why didn't those 76 docs get in while using ThreadedIndexWriter ?
> 2. Why would the number of terms triple for a difference of 76  
> documents out of 2 million?
> 3. And, my original question..why there is still a huge variation in  
> size difference b/n the two indexes
> 
> 
> Thanks,
> -Jibo
> 
> 
> On Jul 31, 2009, at 1:44 PM, oh...@cox.net wrote:
> 
> > Hi,
> >
> > Sorry to jump in, but I've been following this thread with  
> > interest :)...
> >
> > Am I misunderstanding your original observation, that  
> > ThreadedIndexWriter produced smaller index?  Did the  
> > ThreadedIndexWriter also finish faster (I'm assuming that it should)?
> >
> > If the index is smaller, and everything else being good and equal,  
> > doesn't that mean that using ThreadedIndexWriter is a good thing?
> >
> > Anyway, aside from checking that the # of documents were the same,  
> > have you looked at the index using something like Luke?  Does the  
> > contents of the index look the same in both cases, or were they  
> > different?  If different, how so (e.g., missing terms, etc.)?
> >
> > Later,
> > Jim
> >
> >
> > On Fri, Jul 31, 2009 at 2:38 PM , Jibo John wrote:
> >
> >> Number of docs are the same in the index for both the cases  
> >> (200,000).
> >> I haven't altered the benchmark/ code, but, used a profiler to  
> >> verify that  Benchmark main thread is closed only after all other   
> >> threads are closed.
> >>
> >> Thanks,
> >> -Jibo
> >>
> >>
> >> On Jul 31, 2009, at 2:34 AM, Michael McCandless wrote:
> >>
> >>> Hmm... this doesn't sound right.
> >>>
> >>> That example (ThreadedIndexWriter) is meant to be a drop-in
> >>> replacement, wherever you use an IndexWriter, that keeps an
> >>> under-the-hood thread pool (using java.util.concurrent.*) to
> >>> add/update documents with multiple threads.
> >>>
> >>> It should not result in a smaller index.
> >>>
> >>> Can you sanity check the index?  Eg is numDocs() the same for both?
> >>> You definitely called close() on the writer, right?  That method  
> >>> waits
> >>> for all threads to finish their work before actually closing.
> >>>
> >>> Mike
> >>>
> >>> On Thu, Jul 30, 2009 at 8:01 PM, Jibo John<jiboj...@mac.com> wrote:
> >>>> While trying out a few tuning options using contrib/benchmak as  
> >>>> described in
> >>>> LIA (2nd edition) book, I had an interesting observation.
> >>>>
> >>>> If I use a ThreadedIndexWriter (picked the example from lia2e,  
> >>>> page 356)
> >>>> instead of IndexWriter, the index size got reduced by 40%  
> >>>> compared to using
> >>>> IndexWriter.
> >>>> Index related configuration were the same for both the tests in  
> >>>> the alg
> >>>> file.
> >>>>
> >>>> I am curious how come using a threaded index writer will have an  
> >>>> impact on
> >>>> the index size.
> >>>>
> >>>> Appreciate your input.
> >>>>
> >>>> Thanks,
> >>>> -Jibo
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>>>
> >>>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to