[
https://issues.apache.org/jira/browse/LUCENE-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12479354
]
Hoss Man commented on LUCENE-252:
---------------------------------
definitely in agreement with yonik here, erroring out if
"docField.isTokenized()" would prevent some perfectly valid use cases ... my
point was that hte current test of "if (t >= mterms.length)" only triggers an
error if htere are more total terms in the field then there are documents in
the index ... but there can be plenty of situations where a doc has more then
one indexed term, but the total number of indexed terms is less hten the number
of documents, a better test would be to check and see if we have already
recorded a term for this doc.
I have to say: I'm really not understanding how the current behavior is
hindering nutch ... my understanding of the nutch model is that the set of
fields is very well known -- why do you need to rely on FieldCache being smart
enough to stop you from trying to sort on a tokenized field? (and what does
that have to do with deleting duplicates?)
if nothing else: if nutch needs to prevent using FieldCache based sorting on
tokenized fields, why can't the "if (docField.isTokenized())" logic be done
outside of the FieldCacheImpl ... possibly as a way to decide if you want to
use the basic sorting or use something like LUCENE-769?
...perhaps this is something that should be discussed more on java-dev?
> [PATCH] Problem with Sort logic on tokenized fields
> ---------------------------------------------------
>
> Key: LUCENE-252
> URL: https://issues.apache.org/jira/browse/LUCENE-252
> Project: Lucene - Java
> Issue Type: Bug
> Components: Search
> Affects Versions: 1.4
> Environment: Operating System: other
> Platform: All
> Reporter: Aviran Mordo
> Assigned To: Lucene Developers
> Attachments: dif.txt,
> FieldCacheImpl_Tokenized_fields_lucene_2.0.patch,
> FieldCacheImpl_Tokenized_fields_lucene_2.0_v1.1.patch,
> FieldCacheImpl_Tokenized_fields_lucene_2.2-dev.patch
>
>
> When you set s SortField to a Text field which gets tokenized
> FieldCacheImpl uses the term to do the sort, but then sorting is off
> especially with more then one word in the field. I think it is much
> more logical to sort by field's string value if the sort field is Tokenized
> and
> stored. This way you'll get the CORRECT sort order
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]