Re: search quality - assessment & improvements
: (d) Now we might get stupid (or erroneous) : few words docs as top results; : (e) To solve this, pivoted doc-length-norm punishes too : long docs (longer than the average) but only slightly : rewards docs that are shorter than the average. I get that your calculation is much more gradual then the 1/sqrt(length) so extremeley short docs are "only slightly" rewarded over average length docs ... i'm just not not clear on why you wnat to reward supper short docs at all. Going back to SSS as an example, did you consider using a sweetspot that went from 0 to your pivot (so all docs with legnth less then or equal to the pivot/average length get an equal length boost) ? ...that's actually what i started with when i first wrote SSS, but then i realized that in the case of really rare words (where the highest tf of all docs is just 1) the tf was the only discriminating factor in the scores of the various documents -- so it didn't matter if the norm for a 3 word doc wsa only slightly higher then (or equal to) that of an average length doc -- the 3 word doc would get a higher (or equal) score. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Need help for ordering results by specific order
Yes, I found what I need is the term vector which is stored in the indexing time. I am appreciate you guide me to "Lucene in action", but I think the interface it offered is version 1.4. So I need to get the syntax for lucene2.0 for making the term vector add to the document when the indexing time. By the way, the sort function is lower-level and faster than i thought, it is nick work now(I've implemented ScoreDocComparator and comparator for my own!). Thanks a lot! And Best regards! :) Mathieu Lecarme wrote: > > If I understand well your needs: > You ask lucene for a set of words > You wont to sort result by number of different words wich match? > The query is not good, it would be > > +content:(aleden bob carray) > > I don't understand how can you sort at indexing time with informations > known at querying time. > > M. > savageboy a écrit : >> Yes, Mathieu. >> I just have the book "Lucene in action" by my hand, it is chinese >> language >> version, it is about lucene1.4, hope it is not too old. >> If I use SortComparatorSource, does it means it will be do the sort work >> at >> the user query time? >> Can I sort (maybe score it atindexing time)? >> >> >> >> Mathieu Lecarme wrote: >> >>> Have a look of the book "Lucene in action", ch 6.1 : "using custom >>> sort method" >>> >>> SortComparatorSource might be your friend. Lucene selecting stuff, >>> and you sort, just like you wont. >>> >>> M. >>> Le 18 juil. 07 à 10:29, savageboy a écrit : >>> >>> Hi, I am newer for lucene. I have a project for search engine by Lucene2.0. But near the project finished, My boss want me to order the result by the sort blew: the query likes '+content:"aleden bob carray" ' content date order "alden bob carray ... " 2005/12/23 1 "alden... alden ... bob... bob... carray..." 2005/12/01 2 "alden... alden ... bob... carray" 2005/11/28 3 "alden... carray" 2005/12/24 4 "alden... bob" 2005/12/24 5 the meaning of the sort above is no matter how much the term match in the field "content", there will be met four satuations :"3 matched","2 matched","1 matched","0 matched". In the "3 matched" group, I need sorting the result by it's date desc, and in the "2 matched" group is same... But I dont know HOW to get this results in Lucene... Should I override the method of scoring? (tf(t in d) >>> field>,idf(t) ) Could you give me some references about it? I am really stucked, and Need You help!! -- View this message in context: http://www.nabble.com/Need-help-for- ordering-results-by-specific-order-tf4101844.html#a11664583 Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] >>> - >>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>> For additional commands, e-mail: [EMAIL PROTECTED] >>> >>> >>> >>> >> >> > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/Need-help-for-ordering-results-by-specific-order-tf4101844.html#a11700924 Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-868) Making Term Vectors more accessible
[ https://issues.apache.org/jira/browse/LUCENE-868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated LUCENE-868: --- Attachment: LUCENE-868-v4.patch Based on Yonik's and Karl's comments on avoiding loading the offset and position arrays, this patch has two new methods on the TermVectorMapper which tell the TermVectorsReader whether the Mapper is interested in positions or not, regardless of whether they are stored or not. > Making Term Vectors more accessible > --- > > Key: LUCENE-868 > URL: https://issues.apache.org/jira/browse/LUCENE-868 > Project: Lucene - Java > Issue Type: New Feature > Components: Store >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Attachments: LUCENE-868-v2.patch, LUCENE-868-v3.patch, > LUCENE-868-v4.patch > > > One of the big issues with term vector usage is that the information is > loaded into parallel arrays as it is loaded, which are then often times > manipulated again to use in the application (for instance, they are sorted by > frequency). > Adding a callback mechanism that allows the vector loading to be handled by > the application would make this a lot more efficient. > I propose to add to IndexReader: > abstract public void getTermFreqVector(int docNumber, String field, > TermVectorMapper mapper) throws IOException; > and a similar one for the all fields version > Where TermVectorMapper is an interface with a single method: > void map(String term, int frequency, int offset, int position); > The TermVectorReader will be modified to just call the TermVectorMapper. The > existing getTermFreqVectors will be reimplemented to use an implementation of > TermVectorMapper that creates the parallel arrays. Additionally, some simple > implementations that automatically sort vectors will also be created. > This is my first draft of this API and is subject to change. I hope to have > a patch soon. > See > http://www.gossamer-threads.com/lists/lucene/java-user/48003?search_string=get%20the%20total%20term%20frequency;#48003 > for related information. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Token termBuffer issues
"Yonik Seeley" <[EMAIL PROTECTED]> wrote: > I had previously missed the changes to Token that add support for > using an array (termBuffer): > > + // For better indexing speed, use termBuffer (and > + // termBufferOffset/termBufferLength) instead of termText > + // to save new'ing a String per token > + char[] termBuffer; > + int termBufferOffset; > + int termBufferLength; > > While I think this approach would have been best to start off with > rather than String, > I'm concerned that it will do little more than add overhead at this > point, resulting in slower code, not faster. > > - If any tokenizer or token filter tries setting the termBuffer, any > downstream components would need to check for both. It could be made > backward compatible by constructing a string on demand, but that will > really slow things down, unless the whole chain is converted to only > using the char[] somehow. Good point: if your analyzer/tokenizer produces char[] tokens then your downstream filters would have to accept char[] tokens. I think on-demand constructing a String (and saving it as termText) would be an OK solution? Why would that be slower than having to make a String in the first place (if we didn't have the char[] API)? It's at least graceful degradation. > - It doesn't look like the indexing code currently pays any attention > to the char[], right? It does, in DocumentsWriter.addPosition(). > - What if both the String and char[] are set? A filter that doesn't > know better sets the String... this doesn't clear the char[] > currently, should it? Currently the char[] wins, but good point: seems like each setter should null out the other one? Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Token termBuffer issues
I had previously missed the changes to Token that add support for using an array (termBuffer): + // For better indexing speed, use termBuffer (and + // termBufferOffset/termBufferLength) instead of termText + // to save new'ing a String per token + char[] termBuffer; + int termBufferOffset; + int termBufferLength; While I think this approach would have been best to start off with rather than String, I'm concerned that it will do little more than add overhead at this point, resulting in slower code, not faster. - If any tokenizer or token filter tries setting the termBuffer, any downstream components would need to check for both. It could be made backward compatible by constructing a string on demand, but that will really slow things down, unless the whole chain is converted to only using the char[] somehow. - It doesn't look like the indexing code currently pays any attention to the char[], right? - What if both the String and char[] are set? A filter that doesn't know better sets the String... this doesn't clear the char[] currently, should it? Thoughts? -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-868) Making Term Vectors more accessible
[ https://issues.apache.org/jira/browse/LUCENE-868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513991 ] Grant Ingersoll commented on LUCENE-868: The TermVectorOffsetInfo and Position arrays are only created if storeOffsets and storePositions are turned on. But, we could also add mapperMethods like: boolean isIgnoringOffsets() and boolean isIgnoringPositions() Then, in TermVectorsReader, it could become: if (storePositions && mapper.isIgnoringPositions() == false) and likewise for isIgnoringOffsets. This way a mapper could express whether it wants these arrays to be constructed even if they are turned on. Then we just need to skip ahead by the appropriate amount. > Making Term Vectors more accessible > --- > > Key: LUCENE-868 > URL: https://issues.apache.org/jira/browse/LUCENE-868 > Project: Lucene - Java > Issue Type: New Feature > Components: Store >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Attachments: LUCENE-868-v2.patch, LUCENE-868-v3.patch > > > One of the big issues with term vector usage is that the information is > loaded into parallel arrays as it is loaded, which are then often times > manipulated again to use in the application (for instance, they are sorted by > frequency). > Adding a callback mechanism that allows the vector loading to be handled by > the application would make this a lot more efficient. > I propose to add to IndexReader: > abstract public void getTermFreqVector(int docNumber, String field, > TermVectorMapper mapper) throws IOException; > and a similar one for the all fields version > Where TermVectorMapper is an interface with a single method: > void map(String term, int frequency, int offset, int position); > The TermVectorReader will be modified to just call the TermVectorMapper. The > existing getTermFreqVectors will be reimplemented to use an implementation of > TermVectorMapper that creates the parallel arrays. Additionally, some simple > implementations that automatically sort vectors will also be created. > This is my first draft of this API and is subject to change. I hope to have > a patch soon. > See > http://www.gossamer-threads.com/lists/lucene/java-user/48003?search_string=get%20the%20total%20term%20frequency;#48003 > for related information. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-868) Making Term Vectors more accessible
[ https://issues.apache.org/jira/browse/LUCENE-868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513983 ] Karl Wettin commented on LUCENE-868: Sorry for the delay, vacation time. In short I think this is a really nice improvment to the API. I also agree with Yonik about the array[]s constructed and passed down to the mapper. Perhaps your current implementation could be moved one layer further up? Another thought is to reuse array(s) and pass on the data length, but that might just complicate things. I'll try to introduce these things next week and see how well it works. I use the term vectors for text classification. For each new classifier introduced (occurs quite a lot) I iterate the corpus and classify the documents. Potentially it could save me quite a bit of ticks and bits to not create all them array[]s, however my gut tells me there might be some JVM settings that does the same trick. I'll have to look in to that. > Making Term Vectors more accessible > --- > > Key: LUCENE-868 > URL: https://issues.apache.org/jira/browse/LUCENE-868 > Project: Lucene - Java > Issue Type: New Feature > Components: Store >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Attachments: LUCENE-868-v2.patch, LUCENE-868-v3.patch > > > One of the big issues with term vector usage is that the information is > loaded into parallel arrays as it is loaded, which are then often times > manipulated again to use in the application (for instance, they are sorted by > frequency). > Adding a callback mechanism that allows the vector loading to be handled by > the application would make this a lot more efficient. > I propose to add to IndexReader: > abstract public void getTermFreqVector(int docNumber, String field, > TermVectorMapper mapper) throws IOException; > and a similar one for the all fields version > Where TermVectorMapper is an interface with a single method: > void map(String term, int frequency, int offset, int position); > The TermVectorReader will be modified to just call the TermVectorMapper. The > existing getTermFreqVectors will be reimplemented to use an implementation of > TermVectorMapper that creates the parallel arrays. Additionally, some simple > implementations that automatically sort vectors will also be created. > This is my first draft of this API and is subject to change. I hope to have > a patch soon. > See > http://www.gossamer-threads.com/lists/lucene/java-user/48003?search_string=get%20the%20total%20term%20frequency;#48003 > for related information. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: search quality - assessment & improvements
> However ... i still think that if you realy want > a length norm that takes into account the average > length of the docs, you want one that rewards docs > for being near the average ... ... like SweetSpotSimilarity (SSS) > it doesn't seem to make a lot of sense to me to say > that a doc whose length is N% longer longer then the > average length is significantly worse the docs whose > length is N% shorter then the average length. I don't understand why a doc should be punished for just having length different from the average length (i.e. no matter longer or shorter). The (evolving) way I understand it: (a) Very long docs are likely to contain everything, let's punish them to relax this; (b) This is what the original doc-length-norm actually does; (c) But then very short docs might be rewarded too much; (d) Now we might get stupid (or erroneous) few words docs as top results; (e) To solve this, pivoted doc-length-norm punishes too long docs (longer than the average) but only slightly rewards docs that are shorter than the average. It makes sense to me (IR'ishly if I may say so). The SSS way does not make sense to me that way. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Resolved: (LUCENE-957) Lucene RAM Directory doesn't work for Index Size > 8 GB
[ https://issues.apache.org/jira/browse/LUCENE-957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen resolved LUCENE-957. Resolution: Fixed Lucene Fields: (was: [New]) committed. > Lucene RAM Directory doesn't work for Index Size > 8 GB > --- > > Key: LUCENE-957 > URL: https://issues.apache.org/jira/browse/LUCENE-957 > Project: Lucene - Java > Issue Type: Bug > Components: Store >Reporter: Doron Cohen >Assignee: Doron Cohen > Attachments: lucene-957.patch, lucene-957.patch > > > from user list - http://www.gossamer-threads.com/lists/lucene/java-user/50982 > Problem seems to be casting issues in RAMInputStream. > Line 90: > bufferStart = BUFFER_SIZE * currentBufferIndex; > both rhs are ints while lhs is long. > so a very large product would first overflow MAX_INT, become negative, and > only then (auto) casted to long, but this is too late. > Line 91: > bufferLength = (int) (length - bufferStart); > both rhs are longs while lhs is int. > so the (int) cast result may turn negative and the logic that follows would > be wrong. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: svn commit: r557445 - in /lucene/java/trunk: CHANGES.txt src/java/org/apache/lucene/document/Field.java src/test/org/apache/lucene/document/TestDocument.java
I agree. I will add wording to that effect, and also link over to the Wiki page for details (and update the Wiki page with these details!). Mike "Doron Cohen" <[EMAIL PROTECTED]> wrote: > mikemccand wrote: > > + /** Expert: change the value of this field. This can be > > + * used during indexing to re-use a single Field instance > > + * to improve indexing speed. */ > > + public void setValue(String value) { > > Would it make sense to warn from modifying the field > value before the doc was added? > Something like: > Note that fields reuse means adding the same field instance > to multiple documents. You cannot reuse a field instance > for adding multiple fields to the same document." > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Need help for ordering results by specific order
If I understand well your needs: You ask lucene for a set of words You wont to sort result by number of different words wich match? The query is not good, it would be +content:(aleden bob carray) I don't understand how can you sort at indexing time with informations known at querying time. M. savageboy a écrit : > Yes, Mathieu. > I just have the book "Lucene in action" by my hand, it is chinese language > version, it is about lucene1.4, hope it is not too old. > If I use SortComparatorSource, does it means it will be do the sort work at > the user query time? > Can I sort (maybe score it atindexing time)? > > > > Mathieu Lecarme wrote: > >> Have a look of the book "Lucene in action", ch 6.1 : "using custom >> sort method" >> >> SortComparatorSource might be your friend. Lucene selecting stuff, >> and you sort, just like you wont. >> >> M. >> Le 18 juil. 07 à 10:29, savageboy a écrit : >> >> >>> Hi, >>> I am newer for lucene. >>> I have a project for search engine by Lucene2.0. But near the project >>> finished, My boss want me to order the result by the sort blew: >>> >>> the query likes '+content:"aleden bob carray" ' >>> >>> content >>> date >>> order >>> "alden bob carray ... " >>> 2005/12/23 >>> 1 >>> "alden... alden ... bob... bob... carray..." 2005/12/01 >>> 2 >>> "alden... alden ... bob... carray" >>> 2005/11/28 >>> 3 >>> "alden... carray" >>> 2005/12/24 >>> 4 >>> "alden... bob" >>> 2005/12/24 >>> 5 >>> >>> the meaning of the sort above is no matter how much the term match >>> in the >>> field "content", there will be met four satuations :"3 matched","2 >>> matched","1 matched","0 matched". In the "3 matched" group, I need >>> sorting >>> the result by it's date desc, and in the "2 matched" group is same... >>> >>> But I dont know HOW to get this results in Lucene... >>> Should I override the method of scoring? (tf(t in d) >> field>,idf(t) >>> ) >>> Could you give me some references about it? >>> >>> I am really stucked, and Need You help!! >>> >>> >>> -- >>> View this message in context: http://www.nabble.com/Need-help-for- >>> ordering-results-by-specific-order-tf4101844.html#a11664583 >>> Sent from the Lucene - Java Developer mailing list archive at >>> Nabble.com. >>> >>> >>> - >>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>> For additional commands, e-mail: [EMAIL PROTECTED] >>> >>> >>> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> >> >> > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]