[ 
https://issues.apache.org/jira/browse/LUCENE-3312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13431807#comment-13431807
 ] 

Robert Muir commented on LUCENE-3312:
-------------------------------------

Something is going wrong with the indexing of the reuters content. 

I ran the test with SimpleText on both branches (adding forceMerge(1) for 
simplicity) and looked at the resulting index:

Trunk:
{noformat}
-rw-rw-r-- 1 rmuir rmuir   13798 Aug  9 09:42 _0_6.len
-rw-rw-r-- 1 rmuir rmuir 1022509 Aug  9 09:42 _0.fld
-rw-rw-r-- 1 rmuir rmuir    1310 Aug  9 09:42 _0.inf
-rw-rw-r-- 1 rmuir rmuir 3345582 Aug  9 09:42 _0.pst
-rw-rw-r-- 1 rmuir rmuir     513 Aug  9 09:42 _0.si
-rw-rw-r-- 1 rmuir rmuir      71 Aug  9 09:42 segments_1
-rw-rw-r-- 1 rmuir rmuir      20 Aug  9 09:42 segments.gen
{noformat}

Branch:
{noformat}
-rw-rw-r-- 1 rmuir rmuir     13262 Aug  9 09:46 _4_6.len
-rw-rw-r-- 1 rmuir rmuir 290247032 Aug  9 09:45 _4.fld
-rw-rw-r-- 1 rmuir rmuir      1310 Aug  9 09:46 _4.inf
-rw-rw-r-- 1 rmuir rmuir 459164224 Aug  9 09:46 _4.pst
-rw-rw-r-- 1 rmuir rmuir       593 Aug  9 09:46 _4.si
-rw-rw-r-- 1 rmuir rmuir        71 Aug  9 09:46 segments_1
-rw-rw-r-- 1 rmuir rmuir        20 Aug  9 09:46 segments.gen
{noformat}

Looking into the .fld file, I think the problem is obvious:
on trunk:
{noformat}
doc 0
  numfields 5
doc 1
  numfields 5
doc 2
  numfields 5
{noformat}

on branch:
{noformat}
doc 0
  numfields 5
doc 1
  numfields 10
doc 2
  numfields 15
{noformat}

So there is some bug, where a field is 'accumulating' across documents. The 
last document has 2890.

I'm really horrified this is the only test that fails!

                
> Break out StorableField from IndexableField
> -------------------------------------------
>
>                 Key: LUCENE-3312
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3312
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Michael McCandless
>            Assignee: Nikola Tankovic
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: Field Type branch
>
>         Attachments: lucene-3312-patch-01.patch, lucene-3312-patch-02.patch, 
> lucene-3312-patch-03.patch, lucene-3312-patch-04.patch, 
> lucene-3312-patch-05.patch, lucene-3312-patch-06.patch, 
> lucene-3312-patch-07.patch, lucene-3312-patch-08.patch, 
> lucene-3312-patch-09.patch
>
>
> In the field type branch we have strongly decoupled
> Document/Field/FieldType impl from the indexer, by having only a
> narrow API (IndexableField) passed to IndexWriter.  This frees apps up
> use their own "documents" instead of the "user-space" impls we provide
> in oal.document.
> Similarly, with LUCENE-3309, we've done the same thing on the
> doc/field retrieval side (from IndexReader), with the
> StoredFieldsVisitor.
> But, maybe we should break out StorableField from IndexableField,
> such that when you index a doc you provide two Iterables -- one for the
> IndexableFields and one for the StorableFields.  Either can be null.
> One downside is possible perf hit for fields that are both indexed &
> stored (ie, we visit them twice, lookup their name in a hash twice,
> etc.).  But the upside is a cleaner separation of concerns in API....

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to