[jira] Commented: (LUCENE-252) [PATCH] Problem with Sort logic on tokenized fields

Enis Soztutar (JIRA) Tue, 13 Mar 2007 08:57:34 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12480476
 ]


Enis Soztutar commented on LUCENE-252:
--------------------------------------

I should admit that, considering the case Yonik has mentioned. throwing an 
exception by checking the Field.isTokenized() is not suitable. However the 
check if (t >= mterms.length) is only in getStringIndex() and not in 
getStrings(). I think that a more robust check then the aforementioned should 
be included in both getStrings and getStringIndex functions. A possibility 
would be to allocate a boolean array(or BitSet) of the same size with the 
retArray, and  then use the array to avoid multiple terms per document.  

> 2) the desired behavior you are requesting in a StoredFieldCacheImpl could be 
> done without making any changes to what so ever to FieldCacheImpl -- since 
> nutch knows exactly which fields it's indexing multiple tokens for, it can 
> make the choice between using a StoredFieldCacheImple or using a 
> FieldCacheImpl.

from my previous post  :  In nutch we have 3 options : 1st is to disallow 
deleting duplicates on tokenized fields(due to FieldCache), 2nd is to index the 
tokenized field twice(once tokenized, and once untokenized), 3rd use the above 
patch and warm the cache initially in the index servers.

Yes indexing a field a second time is an option, but considering my use cases 
with nutch, why would i want to grow my index by indexing the field twice, 
instead of tolerating 30 seconds of cache building in a web server, which will 
serve the indexes for days or even weeks. 

with a class like StoredFieldCacheImpl we can get the desired behaviour w/o 
modifiying the FieldCacheImpl, and my suggestion in my previous post  without 
the 1st part does just this. I couldl have sent this to nutch but i think it is 
a lucene issue. 

Any more suggestions ?





> [PATCH] Problem with Sort logic on tokenized fields
> ---------------------------------------------------
>
>                 Key: LUCENE-252
>                 URL: https://issues.apache.org/jira/browse/LUCENE-252
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 1.4
>         Environment: Operating System: other
> Platform: All
>            Reporter: Aviran Mordo
>         Assigned To: Lucene Developers
>         Attachments: dif.txt, 
> FieldCacheImpl_Tokenized_fields_lucene_2.0.patch, 
> FieldCacheImpl_Tokenized_fields_lucene_2.0_v1.1.patch, 
> FieldCacheImpl_Tokenized_fields_lucene_2.2-dev.patch
>
>
> When you set s SortField to a Text field which gets tokenized
> FieldCacheImpl uses the term to do the sort, but then sorting is off 
> especially with more then one word in the field. I think it is much 
> more logical to sort by field's string value if the sort field is Tokenized 
> and
> stored. This way you'll get the CORRECT sort order

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-252) [PATCH] Problem with Sort logic on tokenized fields

Reply via email to