Re: Vector Space Model: New Similarity Implementation Issues

2008-02-28 Thread h t
Compare with classical VSM, lucene just ignore the denominator (|Q|*|D|) of similarity formula, but it add norm(t,d) and coord(q,d) to calculate the fraction of terms in Query and Doc, so it's a modified implementation of VSM in practice. Do you just want to verify which implementation of VSM in

Re: How do i get a text summary

2008-02-28 Thread h t
Hi Karl, Where is the introduction of below algorithm? Thanks. Very simple algorithmic solutions usually involve ranking top senstances by looking at distribution of terms in sentances, paragraphs and the whole document. I implemented something like this a couple of years back that worked fairly

Re: Lucene Search Performance

2008-02-27 Thread h t
Device:tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 7.1250.67 186.41 38936841 143240688 See attached for hardware info and the CPU call tree (taken from YourKit). I would appreciate your recommendations. Jamie h t wrote: Hi Michael

Re: Lucene Search Performance

2008-02-26 Thread h t
Hi Michael, I guess the hotspot of lucene is org.apache.lucene.search.IndexSearcher.search() Hi Jamie, What's the original text size of a million emails? I estimate the size of an email is around 100k, is this true? When you doing search, what kind keywords did you input, words or short sentence?

Re: Inconsistent Search Speed

2008-02-26 Thread h t
Did you use the keywords in two calls? 2008/2/27, fangz [EMAIL PROTECTED]: Hi, I am using a simple java program to test the search speed. The index file is about 1.93G in size. I initiated an indexsearcher and built a query using the query parser: parser.parse(entity:fail). The initial

Re: Security filtering from external DB

2008-02-26 Thread h t
I guess you can implement createBitSet() more effciently by using Filer,but not BooleanQuery 2008/2/25, Gabriel Landais [EMAIL PROTECTED]: Gabriel Landais a écrit : How to create a Filter for a field in CollectionString? First, split Collection in CollectionCollection with

Re: EdgeNGramTokenizer

2007-11-04 Thread h t
http://www.shifttab.cn:8001/wiki 2007/10/31, Marco [EMAIL PROTECTED]: It seems that the problem is when I add the token created by EdgeNGramTokenizer in in the index. If the token contains a space (for example apple com) I have to add to the index with Field.Index.TOKENIZED otherwise the