Re: search quality - assessment & improvements

2007-07-19 Thread Chris Hostetter

: (d) Now we might get stupid (or erroneous)
: few words docs as top results;
: (e) To solve this, pivoted doc-length-norm punishes too
: long docs (longer than the average) but only slightly
: rewards docs that are shorter than the average.

I get that your calculation is much more gradual then the 1/sqrt(length)
so extremeley short docs are "only slightly" rewarded over average length
docs ... i'm just not not clear on why you wnat to reward supper short
docs at all.

Going back to SSS as an example, did you consider using a sweetspot that
went from 0 to your pivot (so all docs with legnth less then or equal to
the pivot/average length get an equal length boost) ?

...that's actually what i started with when i first wrote SSS, but then i
realized that in the case of really rare words (where the highest tf of
all docs is just 1) the tf was the only discriminating factor in the
scores of the various documents -- so it didn't matter if the norm for a 3
word doc wsa only slightly higher then (or equal to) that of an average
length doc -- the 3 word doc would get a higher (or equal) score.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Need help for ordering results by specific order

2007-07-19 Thread savageboy

Yes, I found what I need is the term vector which is stored in the indexing
time.
I am appreciate you guide me to "Lucene in action", but I think the
interface it offered is version 1.4.
So I need to get the syntax for lucene2.0 for making the term vector add to
the document when the indexing time.
By the way, the sort function is lower-level and faster than i thought, it
is nick work now(I've implemented ScoreDocComparator and comparator for my
own!).
Thanks a lot!
And Best regards!
:)




Mathieu Lecarme wrote:
> 
> If I understand well your needs:
> You ask lucene for a set of words
> You wont to sort result by number of different words wich match?
> The query is not good, it would be
> 
> +content:(aleden bob carray)
> 
> I don't understand how can you sort at indexing time with informations
> known at querying time.
> 
> M.
> savageboy a écrit :
>> Yes, Mathieu.
>> I just have the book "Lucene in action" by my hand, it is chinese
>> language
>> version, it is about lucene1.4, hope it is not too old.
>> If I use SortComparatorSource, does it means it will be do the sort work
>> at
>> the user query time?
>> Can I sort (maybe score it atindexing time)?
>>
>>
>>
>> Mathieu Lecarme wrote:
>>   
>>> Have a look of the book "Lucene in action", ch 6.1 : "using custom  
>>> sort method"
>>>
>>> SortComparatorSource might be your friend. Lucene selecting stuff,  
>>> and you sort, just like you wont.
>>>
>>> M.
>>> Le 18 juil. 07 à 10:29, savageboy a écrit :
>>>
>>> 
 Hi,
 I am newer for lucene.
 I have a project for search engine by Lucene2.0. But near the project
 finished, My boss want me to order the result by the sort blew:

 the query likes '+content:"aleden bob carray" '

 content 
 date
 order
 "alden bob carray ... " 
 2005/12/23
 1
 "alden... alden ... bob... bob... carray..."   2005/12/01
 2
 "alden... alden ... bob... carray"
 2005/11/28
 3
 "alden... carray" 
 2005/12/24
 4
 "alden... bob" 
 2005/12/24
 5

 the meaning of the sort above is no matter how much the term match  
 in the
 field "content", there will be met four satuations :"3 matched","2
 matched","1 matched","0 matched". In the "3 matched" group, I need  
 sorting
 the result by it's date desc, and in the "2 matched" group is same...

 But I dont know HOW to get this results in Lucene...
 Should I override the method of scoring? (tf(t in d) >>> field>,idf(t)
 )
 Could you give me some references about it?

 I am really stucked, and Need You help!!


 -- 
 View this message in context: http://www.nabble.com/Need-help-for- 
 ordering-results-by-specific-order-tf4101844.html#a11664583
 Sent from the Lucene - Java Developer mailing list archive at  
 Nabble.com.


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


   
>>> -
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>>
>>>
>>> 
>>
>>   
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Need-help-for-ordering-results-by-specific-order-tf4101844.html#a11700924
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-868) Making Term Vectors more accessible

2007-07-19 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated LUCENE-868:
---

Attachment: LUCENE-868-v4.patch

Based on Yonik's and Karl's comments on avoiding loading the offset and 
position arrays, this patch has two new methods on the TermVectorMapper which 
tell the TermVectorsReader whether the Mapper is interested in positions or 
not, regardless of whether they are stored or not.

> Making Term Vectors more accessible
> ---
>
> Key: LUCENE-868
> URL: https://issues.apache.org/jira/browse/LUCENE-868
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: LUCENE-868-v2.patch, LUCENE-868-v3.patch, 
> LUCENE-868-v4.patch
>
>
> One of the big issues with term vector usage is that the information is 
> loaded into parallel arrays as it is loaded, which are then often times 
> manipulated again to use in the application (for instance, they are sorted by 
> frequency).
> Adding a callback mechanism that allows the vector loading to be handled by 
> the application would make this a lot more efficient.
> I propose to add to IndexReader:
> abstract public void getTermFreqVector(int docNumber, String field, 
> TermVectorMapper mapper) throws IOException;
> and a similar one for the all fields version
> Where TermVectorMapper is an interface with a single method:
> void map(String term, int frequency, int offset, int position);
> The TermVectorReader will be modified to just call the TermVectorMapper.  The 
> existing getTermFreqVectors will be reimplemented to use an implementation of 
> TermVectorMapper that creates the parallel arrays.  Additionally, some simple 
> implementations that automatically sort vectors will also be created.
> This is my first draft of this API and is subject to change.  I hope to have 
> a patch soon.
> See 
> http://www.gossamer-threads.com/lists/lucene/java-user/48003?search_string=get%20the%20total%20term%20frequency;#48003
>  for related information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Token termBuffer issues

2007-07-19 Thread Michael McCandless
"Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> I had previously missed the changes to Token that add support for
> using an array (termBuffer):
> 
> +  // For better indexing speed, use termBuffer (and
> +  // termBufferOffset/termBufferLength) instead of termText
> +  // to save new'ing a String per token
> +  char[] termBuffer;
> +  int termBufferOffset;
> +  int termBufferLength;
> 
> While I think this approach would have been best to start off with
> rather than String,
> I'm concerned that it will do little more than add overhead at this
> point, resulting in slower code, not faster.
> 
> - If any tokenizer or token filter tries setting the termBuffer, any
> downstream components would need to check for both.  It could be made
> backward compatible by constructing a string on demand, but that will
> really slow things down, unless the whole chain is converted to only
> using the char[] somehow.

Good point: if your analyzer/tokenizer produces char[] tokens then
your downstream filters would have to accept char[] tokens.

I think on-demand constructing a String (and saving it as termText)
would be an OK solution?  Why would that be slower than having to make
a String in the first place (if we didn't have the char[] API)?  It's
at least graceful degradation.

> - It doesn't look like the indexing code currently pays any attention
> to the char[], right?

It does, in DocumentsWriter.addPosition().

> - What if both the String and char[] are set?  A filter that doesn't
> know better sets the String... this doesn't clear the char[]
> currently, should it?

Currently the char[] wins, but good point: seems like each setter
should null out the other one?

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Token termBuffer issues

2007-07-19 Thread Yonik Seeley

I had previously missed the changes to Token that add support for
using an array (termBuffer):

+  // For better indexing speed, use termBuffer (and
+  // termBufferOffset/termBufferLength) instead of termText
+  // to save new'ing a String per token
+  char[] termBuffer;
+  int termBufferOffset;
+  int termBufferLength;

While I think this approach would have been best to start off with
rather than String,
I'm concerned that it will do little more than add overhead at this
point, resulting in slower code, not faster.

- If any tokenizer or token filter tries setting the termBuffer, any
downstream components would need to check for both.  It could be made
backward compatible by constructing a string on demand, but that will
really slow things down, unless the whole chain is converted to only
using the char[] somehow.

- It doesn't look like the indexing code currently pays any attention
to the char[], right?

- What if both the String and char[] are set?  A filter that doesn't
know better sets the String... this doesn't clear the char[]
currently, should it?

Thoughts?

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-868) Making Term Vectors more accessible

2007-07-19 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513991
 ] 

Grant Ingersoll commented on LUCENE-868:


The TermVectorOffsetInfo and Position arrays are only created if storeOffsets 
and storePositions are turned on.  But, we could also add mapperMethods like:
boolean isIgnoringOffsets()
and
boolean isIgnoringPositions()

Then, in TermVectorsReader, it could become:

if (storePositions && mapper.isIgnoringPositions() == false)

and likewise for isIgnoringOffsets.  This way a mapper could express whether it 
wants these arrays to be constructed even if they are turned on.  Then we just 
need to skip ahead by the appropriate amount.


> Making Term Vectors more accessible
> ---
>
> Key: LUCENE-868
> URL: https://issues.apache.org/jira/browse/LUCENE-868
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: LUCENE-868-v2.patch, LUCENE-868-v3.patch
>
>
> One of the big issues with term vector usage is that the information is 
> loaded into parallel arrays as it is loaded, which are then often times 
> manipulated again to use in the application (for instance, they are sorted by 
> frequency).
> Adding a callback mechanism that allows the vector loading to be handled by 
> the application would make this a lot more efficient.
> I propose to add to IndexReader:
> abstract public void getTermFreqVector(int docNumber, String field, 
> TermVectorMapper mapper) throws IOException;
> and a similar one for the all fields version
> Where TermVectorMapper is an interface with a single method:
> void map(String term, int frequency, int offset, int position);
> The TermVectorReader will be modified to just call the TermVectorMapper.  The 
> existing getTermFreqVectors will be reimplemented to use an implementation of 
> TermVectorMapper that creates the parallel arrays.  Additionally, some simple 
> implementations that automatically sort vectors will also be created.
> This is my first draft of this API and is subject to change.  I hope to have 
> a patch soon.
> See 
> http://www.gossamer-threads.com/lists/lucene/java-user/48003?search_string=get%20the%20total%20term%20frequency;#48003
>  for related information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-868) Making Term Vectors more accessible

2007-07-19 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513983
 ] 

Karl Wettin commented on LUCENE-868:


Sorry for the delay, vacation time.

In short I think this is a really nice improvment to the API. I also agree with 
Yonik about the array[]s constructed and passed down to the mapper. Perhaps 
your current implementation could be moved one layer further up? Another 
thought is to reuse array(s) and pass on the data length, but that might just 
complicate things.

I'll try to introduce these things next week and see how well it works. 

I use the term vectors for text classification. For each new classifier 
introduced (occurs quite a lot) I iterate the corpus and classify the 
documents. Potentially it could save me quite a bit of ticks and bits to not 
create all them array[]s, however my gut tells me there might be some JVM 
settings that does the same trick. I'll have to look in to that.



> Making Term Vectors more accessible
> ---
>
> Key: LUCENE-868
> URL: https://issues.apache.org/jira/browse/LUCENE-868
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: LUCENE-868-v2.patch, LUCENE-868-v3.patch
>
>
> One of the big issues with term vector usage is that the information is 
> loaded into parallel arrays as it is loaded, which are then often times 
> manipulated again to use in the application (for instance, they are sorted by 
> frequency).
> Adding a callback mechanism that allows the vector loading to be handled by 
> the application would make this a lot more efficient.
> I propose to add to IndexReader:
> abstract public void getTermFreqVector(int docNumber, String field, 
> TermVectorMapper mapper) throws IOException;
> and a similar one for the all fields version
> Where TermVectorMapper is an interface with a single method:
> void map(String term, int frequency, int offset, int position);
> The TermVectorReader will be modified to just call the TermVectorMapper.  The 
> existing getTermFreqVectors will be reimplemented to use an implementation of 
> TermVectorMapper that creates the parallel arrays.  Additionally, some simple 
> implementations that automatically sort vectors will also be created.
> This is my first draft of this API and is subject to change.  I hope to have 
> a patch soon.
> See 
> http://www.gossamer-threads.com/lists/lucene/java-user/48003?search_string=get%20the%20total%20term%20frequency;#48003
>  for related information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: search quality - assessment & improvements

2007-07-19 Thread Doron Cohen
> However ... i still think that if you realy want
> a length norm that takes into account the average
> length of the docs, you want one that rewards docs
> for being near the average ...

... like SweetSpotSimilarity (SSS)

> it doesn't seem to make a lot of sense to me to say
> that a doc whose length is N% longer longer then the
> average length is significantly worse the docs whose
> length is N% shorter then the average length.

I don't understand why a doc should be punished for
just having length different from the average length
(i.e. no matter longer or shorter).

The (evolving) way I understand it:
(a) Very long docs are likely to contain everything,
let's punish them to relax this;
(b) This is what the original doc-length-norm
actually does;
(c) But then very short docs might be
rewarded too much;
(d) Now we might get stupid (or erroneous)
few words docs as top results;
(e) To solve this, pivoted doc-length-norm punishes too
long docs (longer than the average) but only slightly
rewards docs that are shorter than the average.

It makes sense to me (IR'ishly if I may say so).
The SSS way does not make sense to me that way.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-957) Lucene RAM Directory doesn't work for Index Size > 8 GB

2007-07-19 Thread Doron Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen resolved LUCENE-957.


   Resolution: Fixed
Lucene Fields:   (was: [New])

committed.

> Lucene RAM Directory doesn't work for Index Size > 8 GB
> ---
>
> Key: LUCENE-957
> URL: https://issues.apache.org/jira/browse/LUCENE-957
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Store
>Reporter: Doron Cohen
>Assignee: Doron Cohen
> Attachments: lucene-957.patch, lucene-957.patch
>
>
> from user list - http://www.gossamer-threads.com/lists/lucene/java-user/50982
> Problem seems to be casting issues in RAMInputStream.
> Line 90:
>   bufferStart = BUFFER_SIZE * currentBufferIndex;
> both rhs are ints while lhs is long.
> so a very large product would first overflow MAX_INT, become negative, and 
> only then (auto) casted to long, but this is too late. 
> Line 91: 
>  bufferLength = (int) (length - bufferStart);
> both rhs are longs while lhs is int.
> so the (int) cast result may turn negative and the logic that follows would 
> be wrong.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: svn commit: r557445 - in /lucene/java/trunk: CHANGES.txt src/java/org/apache/lucene/document/Field.java src/test/org/apache/lucene/document/TestDocument.java

2007-07-19 Thread Michael McCandless

I agree.  I will add wording to that effect, and also link over to the Wiki 
page for details (and update the Wiki page with these details!).

Mike

"Doron Cohen" <[EMAIL PROTECTED]> wrote:
> mikemccand wrote:
> > +  /** Expert: change the value of this field.  This can be
> > +   *  used during indexing to re-use a single Field instance
> > +   *  to improve indexing speed. */
> > +  public void setValue(String value) {
> 
> Would it make sense to warn from modifying the field
> value before the doc was added?
> Something like:
>   Note that fields reuse means adding the same field instance
>   to multiple documents. You cannot reuse a field instance
>   for adding multiple fields to the same document."
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Need help for ordering results by specific order

2007-07-19 Thread Mathieu Lecarme
If I understand well your needs:
You ask lucene for a set of words
You wont to sort result by number of different words wich match?
The query is not good, it would be

+content:(aleden bob carray)

I don't understand how can you sort at indexing time with informations
known at querying time.

M.
savageboy a écrit :
> Yes, Mathieu.
> I just have the book "Lucene in action" by my hand, it is chinese language
> version, it is about lucene1.4, hope it is not too old.
> If I use SortComparatorSource, does it means it will be do the sort work at
> the user query time?
> Can I sort (maybe score it atindexing time)?
>
>
>
> Mathieu Lecarme wrote:
>   
>> Have a look of the book "Lucene in action", ch 6.1 : "using custom  
>> sort method"
>>
>> SortComparatorSource might be your friend. Lucene selecting stuff,  
>> and you sort, just like you wont.
>>
>> M.
>> Le 18 juil. 07 à 10:29, savageboy a écrit :
>>
>> 
>>> Hi,
>>> I am newer for lucene.
>>> I have a project for search engine by Lucene2.0. But near the project
>>> finished, My boss want me to order the result by the sort blew:
>>>
>>> the query likes '+content:"aleden bob carray" '
>>>
>>> content 
>>> date
>>> order
>>> "alden bob carray ... " 
>>> 2005/12/23
>>> 1
>>> "alden... alden ... bob... bob... carray..."   2005/12/01
>>> 2
>>> "alden... alden ... bob... carray"
>>> 2005/11/28
>>> 3
>>> "alden... carray" 
>>> 2005/12/24
>>> 4
>>> "alden... bob" 
>>> 2005/12/24
>>> 5
>>>
>>> the meaning of the sort above is no matter how much the term match  
>>> in the
>>> field "content", there will be met four satuations :"3 matched","2
>>> matched","1 matched","0 matched". In the "3 matched" group, I need  
>>> sorting
>>> the result by it's date desc, and in the "2 matched" group is same...
>>>
>>> But I dont know HOW to get this results in Lucene...
>>> Should I override the method of scoring? (tf(t in d) >> field>,idf(t)
>>> )
>>> Could you give me some references about it?
>>>
>>> I am really stucked, and Need You help!!
>>>
>>>
>>> -- 
>>> View this message in context: http://www.nabble.com/Need-help-for- 
>>> ordering-results-by-specific-order-tf4101844.html#a11664583
>>> Sent from the Lucene - Java Developer mailing list archive at  
>>> Nabble.com.
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>>
>>>   
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>>
>> 
>
>   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]