[ https://issues.apache.org/jira/browse/LUCENE-3312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13431807#comment-13431807 ]
Robert Muir commented on LUCENE-3312: ------------------------------------- Something is going wrong with the indexing of the reuters content. I ran the test with SimpleText on both branches (adding forceMerge(1) for simplicity) and looked at the resulting index: Trunk: {noformat} -rw-rw-r-- 1 rmuir rmuir 13798 Aug 9 09:42 _0_6.len -rw-rw-r-- 1 rmuir rmuir 1022509 Aug 9 09:42 _0.fld -rw-rw-r-- 1 rmuir rmuir 1310 Aug 9 09:42 _0.inf -rw-rw-r-- 1 rmuir rmuir 3345582 Aug 9 09:42 _0.pst -rw-rw-r-- 1 rmuir rmuir 513 Aug 9 09:42 _0.si -rw-rw-r-- 1 rmuir rmuir 71 Aug 9 09:42 segments_1 -rw-rw-r-- 1 rmuir rmuir 20 Aug 9 09:42 segments.gen {noformat} Branch: {noformat} -rw-rw-r-- 1 rmuir rmuir 13262 Aug 9 09:46 _4_6.len -rw-rw-r-- 1 rmuir rmuir 290247032 Aug 9 09:45 _4.fld -rw-rw-r-- 1 rmuir rmuir 1310 Aug 9 09:46 _4.inf -rw-rw-r-- 1 rmuir rmuir 459164224 Aug 9 09:46 _4.pst -rw-rw-r-- 1 rmuir rmuir 593 Aug 9 09:46 _4.si -rw-rw-r-- 1 rmuir rmuir 71 Aug 9 09:46 segments_1 -rw-rw-r-- 1 rmuir rmuir 20 Aug 9 09:46 segments.gen {noformat} Looking into the .fld file, I think the problem is obvious: on trunk: {noformat} doc 0 numfields 5 doc 1 numfields 5 doc 2 numfields 5 {noformat} on branch: {noformat} doc 0 numfields 5 doc 1 numfields 10 doc 2 numfields 15 {noformat} So there is some bug, where a field is 'accumulating' across documents. The last document has 2890. I'm really horrified this is the only test that fails! > Break out StorableField from IndexableField > ------------------------------------------- > > Key: LUCENE-3312 > URL: https://issues.apache.org/jira/browse/LUCENE-3312 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index > Reporter: Michael McCandless > Assignee: Nikola Tankovic > Labels: gsoc2012, lucene-gsoc-12 > Fix For: Field Type branch > > Attachments: lucene-3312-patch-01.patch, lucene-3312-patch-02.patch, > lucene-3312-patch-03.patch, lucene-3312-patch-04.patch, > lucene-3312-patch-05.patch, lucene-3312-patch-06.patch, > lucene-3312-patch-07.patch, lucene-3312-patch-08.patch, > lucene-3312-patch-09.patch > > > In the field type branch we have strongly decoupled > Document/Field/FieldType impl from the indexer, by having only a > narrow API (IndexableField) passed to IndexWriter. This frees apps up > use their own "documents" instead of the "user-space" impls we provide > in oal.document. > Similarly, with LUCENE-3309, we've done the same thing on the > doc/field retrieval side (from IndexReader), with the > StoredFieldsVisitor. > But, maybe we should break out StorableField from IndexableField, > such that when you index a doc you provide two Iterables -- one for the > IndexableFields and one for the StorableFields. Either can be null. > One downside is possible perf hit for fields that are both indexed & > stored (ie, we visit them twice, lookup their name in a hash twice, > etc.). But the upside is a cleaner separation of concerns in API.... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org