Re: Re: Re: potential accuracy degradation due to approximation of document length in BM25 (and other similarities)

2016-07-09 Thread Leo Boytsov
Hi David, Submitting a patch wouldn't be a problem. But let me do a couple more tests with more collections (this time I will try more standard ones). Thanks! --- Leo On Sat, Jul 9, 2016 at 10:20 AM, David Smiley wrote: > --ok; (they already have configuration

Re: Re: Re: potential accuracy degradation due to approximation of document length in BM25 (and other similarities)

2016-07-09 Thread David Smiley
--ok; (they already have configuration parameters). Leo if you can submit a patch to extend the BM25 similarity, I would welcome it. On Sat, Jul 9, 2016 at 7:11 AM Robert Muir wrote: > Our similarities do not need a boolean flag. Instead we should focus > on making them as

Re: Re: Re: potential accuracy degradation due to approximation of document length in BM25 (and other similarities)

2016-07-09 Thread Robert Muir
Our similarities do not need a boolean flag. Instead we should focus on making them as simple as possible: there can always be alternative implementations. On Sat, Jul 9, 2016 at 1:08 AM, David Smiley wrote: > I agree that using one byte by default is questionable on

Re: Re: Re: potential accuracy degradation due to approximation of document length in BM25 (and other similarities)

2016-07-09 Thread Konstantin
Hello, Are norms cut to 1 byte precision as of Lucene 6.0.0 release ? 2016-07-09 8:08 GMT+03:00 David Smiley : > I agree that using one byte by default is questionable on modern machines > and given common text field sizes as well. I think my understanding of how >

Re: Re: Re: potential accuracy degradation due to approximation of document length in BM25 (and other similarities)

2016-07-08 Thread David Smiley
I agree that using one byte by default is questionable on modern machines and given common text field sizes as well. I think my understanding of how norms are encoding/accessed may be wrong from what I had said. Lucene53NormsFormat supports Long, I see, and it's clever about observing the max

Re: Re: Re: potential accuracy degradation due to approximation of document length in BM25 (and other similarities)

2016-07-07 Thread Leo Boytsov
Hi David, thank you for picking it up. Now we are having a more meaningful discussion regarding the "waste". Leo, > There may be confusion here as to where the space is wasted. 1 vs 8 bytes > per doc on disk is peanuts, sure, but in RAM it is not and that is the > concern. AFAIK the norms are

Re: Re: potential accuracy degradation due to approximation of document length in BM25 (and other similarities)

2016-07-06 Thread David Smiley
Leo, There may be confusion here as to where the space is wasted. 1 vs 8 bytes per doc on disk is peanuts, sure, but in RAM it is not and that is the concern. AFAIK the norms are memory-mapped in, and we need to ensure it's trivial to know which offset to go to on disk based on a document id,

Re: Re: potential accuracy degradation due to approximation of document length in BM25 (and other similarities)

2016-07-06 Thread Leo Boytsov
Hi, for some reason I didn't get a reply from the mailing list directly, so I have to send a new message. I appreciate if something can be fixed, so that I get a reply as well. First of all, I don't buy the claim about the issue being well-known. I would actually argue that nobody except a few