[jira] Commented: (LUCENE-252) [PATCH] Problem with Sort logic on tokenized fields

Enis Soztutar (JIRA) Thu, 08 Mar 2007 00:25:48 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12479260
 ]


Enis Soztutar commented on LUCENE-252:
--------------------------------------

Well, I have spent half a day to find this issue with tokenized field caching, 
so I absolutely agree on throwing and exception in the getStrings() and 
getStringIndex() functions of FieldCacheImpl. A snippet would be like : 

Field docField = getField(reader, field);
      if (docField != null && docField.isStored() && docField.isTokenized()) {
           throw new RuntimeException("Caching in Tokenized Fields is not 
allowed");
      }

Looking at the timing of cache building tokenized fields are really slow, as 
Doug mentioned, for a 1.5M real index(from web documents) building the cache on 
a tokenized field takes 1600 ms on the avarage, but for an untokenized field, 
it takes 30000 ms on avarage. 

In nutch we have 3 options : 1st is to disallow deleting duplicates on 
tokenized fields(due to FieldCache), 2nd is to index the tokenized field 
twice(once tokenized, and once untokenized), 3rd use the above patch and warm 
the cache initially in the index servers. 

I am in favor of the 3rd option and believe that this patch is necessary and it 
can be included with an explanatory javadoc. 
another option will be to extend the defalut FieldCacheImpl and allow for 
tokenized field caching and naming the class similar to LUCENE-769's such as 
StoredFieldCacheImpl. If that is ok, i can prepare a patch and send it here. 






> [PATCH] Problem with Sort logic on tokenized fields
> ---------------------------------------------------
>
>                 Key: LUCENE-252
>                 URL: https://issues.apache.org/jira/browse/LUCENE-252
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 1.4
>         Environment: Operating System: other
> Platform: All
>            Reporter: Aviran Mordo
>         Assigned To: Lucene Developers
>         Attachments: dif.txt, 
> FieldCacheImpl_Tokenized_fields_lucene_2.0.patch, 
> FieldCacheImpl_Tokenized_fields_lucene_2.0_v1.1.patch, 
> FieldCacheImpl_Tokenized_fields_lucene_2.2-dev.patch
>
>
> When you set s SortField to a Text field which gets tokenized
> FieldCacheImpl uses the term to do the sort, but then sorting is off 
> especially with more then one word in the field. I think it is much 
> more logical to sort by field's string value if the sort field is Tokenized 
> and
> stored. This way you'll get the CORRECT sort order

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-252) [PATCH] Problem with Sort logic on tokenized fields

Reply via email to