[ https://issues.apache.org/jira/browse/LUCENE-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12479354 ]
Hoss Man commented on LUCENE-252: --------------------------------- definitely in agreement with yonik here, erroring out if "docField.isTokenized()" would prevent some perfectly valid use cases ... my point was that hte current test of "if (t >= mterms.length)" only triggers an error if htere are more total terms in the field then there are documents in the index ... but there can be plenty of situations where a doc has more then one indexed term, but the total number of indexed terms is less hten the number of documents, a better test would be to check and see if we have already recorded a term for this doc. I have to say: I'm really not understanding how the current behavior is hindering nutch ... my understanding of the nutch model is that the set of fields is very well known -- why do you need to rely on FieldCache being smart enough to stop you from trying to sort on a tokenized field? (and what does that have to do with deleting duplicates?) if nothing else: if nutch needs to prevent using FieldCache based sorting on tokenized fields, why can't the "if (docField.isTokenized())" logic be done outside of the FieldCacheImpl ... possibly as a way to decide if you want to use the basic sorting or use something like LUCENE-769? ...perhaps this is something that should be discussed more on java-dev? > [PATCH] Problem with Sort logic on tokenized fields > --------------------------------------------------- > > Key: LUCENE-252 > URL: https://issues.apache.org/jira/browse/LUCENE-252 > Project: Lucene - Java > Issue Type: Bug > Components: Search > Affects Versions: 1.4 > Environment: Operating System: other > Platform: All > Reporter: Aviran Mordo > Assigned To: Lucene Developers > Attachments: dif.txt, > FieldCacheImpl_Tokenized_fields_lucene_2.0.patch, > FieldCacheImpl_Tokenized_fields_lucene_2.0_v1.1.patch, > FieldCacheImpl_Tokenized_fields_lucene_2.2-dev.patch > > > When you set s SortField to a Text field which gets tokenized > FieldCacheImpl uses the term to do the sort, but then sorting is off > especially with more then one word in the field. I think it is much > more logical to sort by field's string value if the sort field is Tokenized > and > stored. This way you'll get the CORRECT sort order -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]