[jira] [Commented] (LUCENE-8551) Purge unused FieldInfo on segment merge

2018-11-15 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688331#comment-16688331
 ] 

Adrien Grand commented on LUCENE-8551:
--

Field infos are tracked at IndexWriter#globalFieldNumberMap and written with 
every segment. My understanding of this issue is that we would like to 
garbage-collect unused fields, so I was thinking that if a point-in-time view 
is open, then a field is removed, then a field with the same name is added back 
with a different configuration, then the point-in-time view that got open 
before the field got removed would have different field infos from IndexWriter. 
The fact that field infos are currently (almost) append-only makes things 
easier to reason about. It's a different issue but related: for a long time 
index options could be downgraded. Only until we figured out that this 
prevented a relevancy bug (LUCENE-8031) from being addressed, so we removed 
this feature (LUCENE-8134).

> Purge unused FieldInfo on segment merge
> ---
>
> Key: LUCENE-8551
> URL: https://issues.apache.org/jira/browse/LUCENE-8551
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: David Smiley
>Priority: Major
>
> If a field is effectively unused (no norms, terms index, term vectors, 
> docValues, stored value, points index), it will nonetheless hang around in 
> FieldInfos indefinitely.  It would be nice to be able to recognize an unused 
> FieldInfo and allow it to disappear after a merge (or two).
> SegmentMerger merges FieldInfo (from each segment) as nearly the first thing 
> it does.  After that, the different index parts, before it's known what's 
> "used" or not.  After writing, we theoretically know which fields are used or 
> not, though we're not doing any bookkeeping to track it.  Maybe we should 
> track the fields used during writing so we write a filtered merged fieldInfo 
> at the end instead of unfiltered up front?  Or perhaps upon reading a 
> segment, we make it cheap/easy for each index type (e.g. terms index, stored 
> fields, ...) to know which fields have data for the corresponding type.  
> Then, on a subsequent merge, we know up front to filter the FieldInfos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8551) Purge unused FieldInfo on segment merge

2018-11-15 Thread David Smiley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688303#comment-16688303
 ] 

David Smiley commented on LUCENE-8551:
--

bq. For instance in the NRT case that means that you could have two consecutive 
point-in-time views of the same index that disagree on the FieldInfo of a field?

Can you please elaborate on that?  I'm unclear how NRT in particular relates.

> Purge unused FieldInfo on segment merge
> ---
>
> Key: LUCENE-8551
> URL: https://issues.apache.org/jira/browse/LUCENE-8551
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: David Smiley
>Priority: Major
>
> If a field is effectively unused (no norms, terms index, term vectors, 
> docValues, stored value, points index), it will nonetheless hang around in 
> FieldInfos indefinitely.  It would be nice to be able to recognize an unused 
> FieldInfo and allow it to disappear after a merge (or two).
> SegmentMerger merges FieldInfo (from each segment) as nearly the first thing 
> it does.  After that, the different index parts, before it's known what's 
> "used" or not.  After writing, we theoretically know which fields are used or 
> not, though we're not doing any bookkeeping to track it.  Maybe we should 
> track the fields used during writing so we write a filtered merged fieldInfo 
> at the end instead of unfiltered up front?  Or perhaps upon reading a 
> segment, we make it cheap/easy for each index type (e.g. terms index, stored 
> fields, ...) to know which fields have data for the corresponding type.  
> Then, on a subsequent merge, we know up front to filter the FieldInfos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8551) Purge unused FieldInfo on segment merge

2018-11-15 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688170#comment-16688170
 ] 

Christophe Bismuth commented on LUCENE-8551:


The overhead makes me think a dedicated optimize/purge API would be wiser. But, 
I don't know NRT internals enough to have a valuable opinion on the second 
point.

> Purge unused FieldInfo on segment merge
> ---
>
> Key: LUCENE-8551
> URL: https://issues.apache.org/jira/browse/LUCENE-8551
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: David Smiley
>Priority: Major
>
> If a field is effectively unused (no norms, terms index, term vectors, 
> docValues, stored value, points index), it will nonetheless hang around in 
> FieldInfos indefinitely.  It would be nice to be able to recognize an unused 
> FieldInfo and allow it to disappear after a merge (or two).
> SegmentMerger merges FieldInfo (from each segment) as nearly the first thing 
> it does.  After that, the different index parts, before it's known what's 
> "used" or not.  After writing, we theoretically know which fields are used or 
> not, though we're not doing any bookkeeping to track it.  Maybe we should 
> track the fields used during writing so we write a filtered merged fieldInfo 
> at the end instead of unfiltered up front?  Or perhaps upon reading a 
> segment, we make it cheap/easy for each index type (e.g. terms index, stored 
> fields, ...) to know which fields have data for the corresponding type.  
> Then, on a subsequent merge, we know up front to filter the FieldInfos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8551) Purge unused FieldInfo on segment merge

2018-11-15 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688146#comment-16688146
 ] 

Adrien Grand commented on LUCENE-8551:
--

That would have overhead for sure. For instance it's not cheap to know which 
fields are used in stored fields, the only way to do this is to iterate over 
all documents and compute the set of used field names. In contrast merging can 
often copy raw compressed bytes and skip decompressing+decoding entirely.

I'm also a bit worried of the fact that a field could be added back with a 
different number or with different options. For instance in the NRT case that 
means that you could have two consecutive point-in-time views of the same index 
that disagree on the FieldInfo of a field?

> Purge unused FieldInfo on segment merge
> ---
>
> Key: LUCENE-8551
> URL: https://issues.apache.org/jira/browse/LUCENE-8551
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: David Smiley
>Priority: Major
>
> If a field is effectively unused (no norms, terms index, term vectors, 
> docValues, stored value, points index), it will nonetheless hang around in 
> FieldInfos indefinitely.  It would be nice to be able to recognize an unused 
> FieldInfo and allow it to disappear after a merge (or two).
> SegmentMerger merges FieldInfo (from each segment) as nearly the first thing 
> it does.  After that, the different index parts, before it's known what's 
> "used" or not.  After writing, we theoretically know which fields are used or 
> not, though we're not doing any bookkeeping to track it.  Maybe we should 
> track the fields used during writing so we write a filtered merged fieldInfo 
> at the end instead of unfiltered up front?  Or perhaps upon reading a 
> segment, we make it cheap/easy for each index type (e.g. terms index, stored 
> fields, ...) to know which fields have data for the corresponding type.  
> Then, on a subsequent merge, we know up front to filter the FieldInfos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8551) Purge unused FieldInfo on segment merge

2018-11-15 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687705#comment-16687705
 ] 

Christophe Bismuth commented on LUCENE-8551:


I'll first implement unused {{FieldInfo}} tracking and let you know.

> Purge unused FieldInfo on segment merge
> ---
>
> Key: LUCENE-8551
> URL: https://issues.apache.org/jira/browse/LUCENE-8551
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: David Smiley
>Priority: Major
>
> If a field is effectively unused (no norms, terms index, term vectors, 
> docValues, stored value, points index), it will nonetheless hang around in 
> FieldInfos indefinitely.  It would be nice to be able to recognize an unused 
> FieldInfo and allow it to disappear after a merge (or two).
> SegmentMerger merges FieldInfo (from each segment) as nearly the first thing 
> it does.  After that, the different index parts, before it's known what's 
> "used" or not.  After writing, we theoretically know which fields are used or 
> not, though we're not doing any bookkeeping to track it.  Maybe we should 
> track the fields used during writing so we write a filtered merged fieldInfo 
> at the end instead of unfiltered up front?  Or perhaps upon reading a 
> segment, we make it cheap/easy for each index type (e.g. terms index, stored 
> fields, ...) to know which fields have data for the corresponding type.  
> Then, on a subsequent merge, we know up front to filter the FieldInfos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8551) Purge unused FieldInfo on segment merge

2018-11-15 Thread Christophe Bismuth (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687703#comment-16687703
 ] 

Christophe Bismuth commented on LUCENE-8551:


Sounds challenging, I'd like to work in it!

> Purge unused FieldInfo on segment merge
> ---
>
> Key: LUCENE-8551
> URL: https://issues.apache.org/jira/browse/LUCENE-8551
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: David Smiley
>Priority: Major
>
> If a field is effectively unused (no norms, terms index, term vectors, 
> docValues, stored value, points index), it will nonetheless hang around in 
> FieldInfos indefinitely.  It would be nice to be able to recognize an unused 
> FieldInfo and allow it to disappear after a merge (or two).
> SegmentMerger merges FieldInfo (from each segment) as nearly the first thing 
> it does.  After that, the different index parts, before it's known what's 
> "used" or not.  After writing, we theoretically know which fields are used or 
> not, though we're not doing any bookkeeping to track it.  Maybe we should 
> track the fields used during writing so we write a filtered merged fieldInfo 
> at the end instead of unfiltered up front?  Or perhaps upon reading a 
> segment, we make it cheap/easy for each index type (e.g. terms index, stored 
> fields, ...) to know which fields have data for the corresponding type.  
> Then, on a subsequent merge, we know up front to filter the FieldInfos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8551) Purge unused FieldInfo on segment merge

2018-11-02 Thread David Smiley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673119#comment-16673119
 ] 

David Smiley commented on LUCENE-8551:
--

Agreed -- if this does not wind up happening automatically, it could be added 
to some other mechanism like SOLR-12259.  I'm not sure yet how much complexity 
it would add to the regular merge.  I'm also not yet sure how much performance 
degradation this is causing my employer... it remains to be measured.  Even 
then, it's a YMMV thing.

> Purge unused FieldInfo on segment merge
> ---
>
> Key: LUCENE-8551
> URL: https://issues.apache.org/jira/browse/LUCENE-8551
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: David Smiley
>Priority: Major
>
> If a field is effectively unused (no norms, terms index, term vectors, 
> docValues, stored value, points index), it will nonetheless hang around in 
> FieldInfos indefinitely.  It would be nice to be able to recognize an unused 
> FieldInfo and allow it to disappear after a merge (or two).
> SegmentMerger merges FieldInfo (from each segment) as nearly the first thing 
> it does.  After that, the different index parts, before it's known what's 
> "used" or not.  After writing, we theoretically know which fields are used or 
> not, though we're not doing any bookkeeping to track it.  Maybe we should 
> track the fields used during writing so we write a filtered merged fieldInfo 
> at the end instead of unfiltered up front?  Or perhaps upon reading a 
> segment, we make it cheap/easy for each index type (e.g. terms index, stored 
> fields, ...) to know which fields have data for the corresponding type.  
> Then, on a subsequent merge, we know up front to filter the FieldInfos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8551) Purge unused FieldInfo on segment merge

2018-10-31 Thread Erick Erickson (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16670205#comment-16670205
 ] 

Erick Erickson commented on LUCENE-8551:


David:

This would be way cool to get to happen on merge. We've had situations where 
some wild program adds over a million fields and the only remedy was to 
re-index.

I'm working on SOLR-12259 which will, I hope, allow us to "do things" to the 
index. If this is too expensive to make happen as part of the regular merging 
process, that might be an alternative way to go about it on a one-off basis. 
I'd rather have it happen as part of regular merging of course.

If this is part of regular segment merging, we should still be able to make it 
happen with SOLR-12259  to cover those cases where there are segments that are 
never merged because they're full and aren't having records deleted.

> Purge unused FieldInfo on segment merge
> ---
>
> Key: LUCENE-8551
> URL: https://issues.apache.org/jira/browse/LUCENE-8551
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: David Smiley
>Priority: Major
>
> If a field is effectively unused (no norms, terms index, term vectors, 
> docValues, stored value, points index), it will nonetheless hang around in 
> FieldInfos indefinitely.  It would be nice to be able to recognize an unused 
> FieldInfo and allow it to disappear after a merge (or two).
> SegmentMerger merges FieldInfo (from each segment) as nearly the first thing 
> it does.  After that, the different index parts, before it's known what's 
> "used" or not.  After writing, we theoretically know which fields are used or 
> not, though we're not doing any bookkeeping to track it.  Maybe we should 
> track the fields used during writing so we write a filtered merged fieldInfo 
> at the end instead of unfiltered up front?  Or perhaps upon reading a 
> segment, we make it cheap/easy for each index type (e.g. terms index, stored 
> fields, ...) to know which fields have data for the corresponding type.  
> Then, on a subsequent merge, we know up front to filter the FieldInfos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org