[
https://issues.apache.org/jira/browse/LUCENE-3312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13431807#comment-13431807
]
Robert Muir commented on LUCENE-3312:
-------------------------------------
Something is going wrong with the indexing of the reuters content.
I ran the test with SimpleText on both branches (adding forceMerge(1) for
simplicity) and looked at the resulting index:
Trunk:
{noformat}
-rw-rw-r-- 1 rmuir rmuir 13798 Aug 9 09:42 _0_6.len
-rw-rw-r-- 1 rmuir rmuir 1022509 Aug 9 09:42 _0.fld
-rw-rw-r-- 1 rmuir rmuir 1310 Aug 9 09:42 _0.inf
-rw-rw-r-- 1 rmuir rmuir 3345582 Aug 9 09:42 _0.pst
-rw-rw-r-- 1 rmuir rmuir 513 Aug 9 09:42 _0.si
-rw-rw-r-- 1 rmuir rmuir 71 Aug 9 09:42 segments_1
-rw-rw-r-- 1 rmuir rmuir 20 Aug 9 09:42 segments.gen
{noformat}
Branch:
{noformat}
-rw-rw-r-- 1 rmuir rmuir 13262 Aug 9 09:46 _4_6.len
-rw-rw-r-- 1 rmuir rmuir 290247032 Aug 9 09:45 _4.fld
-rw-rw-r-- 1 rmuir rmuir 1310 Aug 9 09:46 _4.inf
-rw-rw-r-- 1 rmuir rmuir 459164224 Aug 9 09:46 _4.pst
-rw-rw-r-- 1 rmuir rmuir 593 Aug 9 09:46 _4.si
-rw-rw-r-- 1 rmuir rmuir 71 Aug 9 09:46 segments_1
-rw-rw-r-- 1 rmuir rmuir 20 Aug 9 09:46 segments.gen
{noformat}
Looking into the .fld file, I think the problem is obvious:
on trunk:
{noformat}
doc 0
numfields 5
doc 1
numfields 5
doc 2
numfields 5
{noformat}
on branch:
{noformat}
doc 0
numfields 5
doc 1
numfields 10
doc 2
numfields 15
{noformat}
So there is some bug, where a field is 'accumulating' across documents. The
last document has 2890.
I'm really horrified this is the only test that fails!
> Break out StorableField from IndexableField
> -------------------------------------------
>
> Key: LUCENE-3312
> URL: https://issues.apache.org/jira/browse/LUCENE-3312
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index
> Reporter: Michael McCandless
> Assignee: Nikola Tankovic
> Labels: gsoc2012, lucene-gsoc-12
> Fix For: Field Type branch
>
> Attachments: lucene-3312-patch-01.patch, lucene-3312-patch-02.patch,
> lucene-3312-patch-03.patch, lucene-3312-patch-04.patch,
> lucene-3312-patch-05.patch, lucene-3312-patch-06.patch,
> lucene-3312-patch-07.patch, lucene-3312-patch-08.patch,
> lucene-3312-patch-09.patch
>
>
> In the field type branch we have strongly decoupled
> Document/Field/FieldType impl from the indexer, by having only a
> narrow API (IndexableField) passed to IndexWriter. This frees apps up
> use their own "documents" instead of the "user-space" impls we provide
> in oal.document.
> Similarly, with LUCENE-3309, we've done the same thing on the
> doc/field retrieval side (from IndexReader), with the
> StoredFieldsVisitor.
> But, maybe we should break out StorableField from IndexableField,
> such that when you index a doc you provide two Iterables -- one for the
> IndexableFields and one for the StorableFields. Either can be null.
> One downside is possible perf hit for fields that are both indexed &
> stored (ie, we visit them twice, lookup their name in a hash twice,
> etc.). But the upside is a cleaner separation of concerns in API....
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]