[jira] [Commented] (LUCENE-9236) Having a modular Doc Values format
[ https://issues.apache.org/jira/browse/LUCENE-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056013#comment-17056013 ] juan camilo rodriguez duran commented on LUCENE-9236: - [~jpountz] This is step 2 of the Jira issue, I want to know what do you think about step one, only splitting the big classes and then make the reader and writing part more symmetric > Having a modular Doc Values format > -- > > Key: LUCENE-9236 > URL: https://issues.apache.org/jira/browse/LUCENE-9236 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: juan camilo rodriguez duran >Priority: Minor > Labels: docValues > > Today DocValues Consumer/Producer require override 5 different methods, even > if you only want to use one and given that one given field can only support > one doc values type at same time. > > In the attached PR I’ve implemented a new modular version of those classes > (consumer/producer) each one having a single responsibility and writing in > the same unique file. > This is mainly a refactor of the existing format opening the possibility to > override or implement the sub-format you need. > > I’ll do in 3 steps: > # Create a CompositeDocValuesFormat and moving the code of > Lucene80DocValuesFormat in separate classes, without modifying the inner > code. At same time I created a Lucene85CompositeDocValuesFormat based on > these changes. > # I’ll introduce some basic components for writing doc values in general > such as: > ## DocumentIdSetIterator Serializer: used in each type of field based on an > IndexedDISI. > ## Document Ordinals Serializer: Used in Sorted and SortedSet for > deduplicate values using a dictionary. > ## Document Boundaries Serializer (optional used only for multivalued > fields: SortedNumeric and SortedSet) > ## TermsEnum Serializer: useful to write and read the terms dictionary for > sorted and sorted set doc values. > # I’ll create the new Sub-DocValues format using the previous components. > > PR: [https://github.com/apache/lucene-solr/pull/1282] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9236) Having a modular Doc Values format
[ https://issues.apache.org/jira/browse/LUCENE-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055907#comment-17055907 ] Adrien Grand commented on LUCENE-9236: -- [~juan.duran] In my opinion it introduces complexity because it introduces more abstractions: CompositeFieldMetadata, DocValuesConsumerSupplier, and so on. > Having a modular Doc Values format > -- > > Key: LUCENE-9236 > URL: https://issues.apache.org/jira/browse/LUCENE-9236 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: juan camilo rodriguez duran >Priority: Minor > Labels: docValues > > Today DocValues Consumer/Producer require override 5 different methods, even > if you only want to use one and given that one given field can only support > one doc values type at same time. > > In the attached PR I’ve implemented a new modular version of those classes > (consumer/producer) each one having a single responsibility and writing in > the same unique file. > This is mainly a refactor of the existing format opening the possibility to > override or implement the sub-format you need. > > I’ll do in 3 steps: > # Create a CompositeDocValuesFormat and moving the code of > Lucene80DocValuesFormat in separate classes, without modifying the inner > code. At same time I created a Lucene85CompositeDocValuesFormat based on > these changes. > # I’ll introduce some basic components for writing doc values in general > such as: > ## DocumentIdSetIterator Serializer: used in each type of field based on an > IndexedDISI. > ## Document Ordinals Serializer: Used in Sorted and SortedSet for > deduplicate values using a dictionary. > ## Document Boundaries Serializer (optional used only for multivalued > fields: SortedNumeric and SortedSet) > ## TermsEnum Serializer: useful to write and read the terms dictionary for > sorted and sorted set doc values. > # I’ll create the new Sub-DocValues format using the previous components. > > PR: [https://github.com/apache/lucene-solr/pull/1282] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9236) Having a modular Doc Values format
[ https://issues.apache.org/jira/browse/LUCENE-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055035#comment-17055035 ] juan camilo rodriguez duran commented on LUCENE-9236: - [~rcmuir] could you please elaborate a bit more why introducing sub formats having a single responsibility and the code in the same file will increase complexity? today Lucene80DocValuesProducer is 1565 lines to read, with the approach I'm proposing we will have at beginning 3 classes of 500 lines each one a bit more easy to digest. I just want to know which factors are important to continue keep the code as it is. > Having a modular Doc Values format > -- > > Key: LUCENE-9236 > URL: https://issues.apache.org/jira/browse/LUCENE-9236 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: juan camilo rodriguez duran >Priority: Minor > Labels: docValues > > Today DocValues Consumer/Producer require override 5 different methods, even > if you only want to use one and given that one given field can only support > one doc values type at same time. > > In the attached PR I’ve implemented a new modular version of those classes > (consumer/producer) each one having a single responsibility and writing in > the same unique file. > This is mainly a refactor of the existing format opening the possibility to > override or implement the sub-format you need. > > I’ll do in 3 steps: > # Create a CompositeDocValuesFormat and moving the code of > Lucene80DocValuesFormat in separate classes, without modifying the inner > code. At same time I created a Lucene85CompositeDocValuesFormat based on > these changes. > # I’ll introduce some basic components for writing doc values in general > such as: > ## DocumentIdSetIterator Serializer: used in each type of field based on an > IndexedDISI. > ## Document Ordinals Serializer: Used in Sorted and SortedSet for > deduplicate values using a dictionary. > ## Document Boundaries Serializer (optional used only for multivalued > fields: SortedNumeric and SortedSet) > ## TermsEnum Serializer: useful to write and read the terms dictionary for > sorted and sorted set doc values. > # I’ll create the new Sub-DocValues format using the previous components. > > PR: [https://github.com/apache/lucene-solr/pull/1282] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9236) Having a modular Doc Values format
[ https://issues.apache.org/jira/browse/LUCENE-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046339#comment-17046339 ] juan camilo rodriguez duran commented on LUCENE-9236: - [~dsmiley] just throwing the exception wouldn't work at least If you want to run all test (and be compliant with the API), and here is the main point of this API, why if this is used independently, the API force you to support other sub formats that can't co-exist at same time for a given field. This same pattern is replicated using the EmptyDocValuesProducer, ideally DocValues#checkField would be easier if we use only the sub formats. But still this is not the point of this PR, the first objective is at least simplify the code readability by spiting the big classes DocValuesProducer/Consumer into Single responsible classes, then do a refactor to have more symmetric read and writing classes, and finally If it worth to refactor some common components between all formats as the DISI iterator reading and writing part. > Having a modular Doc Values format > -- > > Key: LUCENE-9236 > URL: https://issues.apache.org/jira/browse/LUCENE-9236 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: juan camilo rodriguez duran >Priority: Minor > Labels: docValues > > Today DocValues Consumer/Producer require override 5 different methods, even > if you only want to use one and given that one given field can only support > one doc values type at same time. > > In the attached PR I’ve implemented a new modular version of those classes > (consumer/producer) each one having a single responsibility and writing in > the same unique file. > This is mainly a refactor of the existing format opening the possibility to > override or implement the sub-format you need. > > I’ll do in 3 steps: > # Create a CompositeDocValuesFormat and moving the code of > Lucene80DocValuesFormat in separate classes, without modifying the inner > code. At same time I created a Lucene85CompositeDocValuesFormat based on > these changes. > # I’ll introduce some basic components for writing doc values in general > such as: > ## DocumentIdSetIterator Serializer: used in each type of field based on an > IndexedDISI. > ## Document Ordinals Serializer: Used in Sorted and SortedSet for > deduplicate values using a dictionary. > ## Document Boundaries Serializer (optional used only for multivalued > fields: SortedNumeric and SortedSet) > ## TermsEnum Serializer: useful to write and read the terms dictionary for > sorted and sorted set doc values. > # I’ll create the new Sub-DocValues format using the previous components. > > PR: [https://github.com/apache/lucene-solr/pull/1282] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9236) Having a modular Doc Values format
[ https://issues.apache.org/jira/browse/LUCENE-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045803#comment-17045803 ] David Smiley commented on LUCENE-9236: -- I think it's weird that DocValuesFormat is a format of formats whereas the others are for their singular purpose. No? bq. ... you *must* implement the 5 different functions ... Wouldn't throwing UnsupportedOperationException work? > Having a modular Doc Values format > -- > > Key: LUCENE-9236 > URL: https://issues.apache.org/jira/browse/LUCENE-9236 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: juan camilo rodriguez duran >Priority: Minor > Labels: docValues > > Today DocValues Consumer/Producer require override 5 different methods, even > if you only want to use one and given that one given field can only support > one doc values type at same time. > > In the attached PR I’ve implemented a new modular version of those classes > (consumer/producer) each one having a single responsibility and writing in > the same unique file. > This is mainly a refactor of the existing format opening the possibility to > override or implement the sub-format you need. > > I’ll do in 3 steps: > # Create a CompositeDocValuesFormat and moving the code of > Lucene80DocValuesFormat in separate classes, without modifying the inner > code. At same time I created a Lucene85CompositeDocValuesFormat based on > these changes. > # I’ll introduce some basic components for writing doc values in general > such as: > ## DocumentIdSetIterator Serializer: used in each type of field based on an > IndexedDISI. > ## Document Ordinals Serializer: Used in Sorted and SortedSet for > deduplicate values using a dictionary. > ## Document Boundaries Serializer (optional used only for multivalued > fields: SortedNumeric and SortedSet) > ## TermsEnum Serializer: useful to write and read the terms dictionary for > sorted and sorted set doc values. > # I’ll create the new Sub-DocValues format using the previous components. > > PR: [https://github.com/apache/lucene-solr/pull/1282] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9236) Having a modular Doc Values format
[ https://issues.apache.org/jira/browse/LUCENE-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17044605#comment-17044605 ] juan camilo rodriguez duran commented on LUCENE-9236: - [~jpountz] and [~rcmuir] I attached the draft PR to the issue, If you have time to check it out. Thanks in advance for your time and comments > Having a modular Doc Values format > -- > > Key: LUCENE-9236 > URL: https://issues.apache.org/jira/browse/LUCENE-9236 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: juan camilo rodriguez duran >Priority: Minor > Labels: docValues > > Today DocValues Consumer/Producer require override 5 different methods, even > if you only want to use one and given that one given field can only support > one doc values type at same time. > > In the attached PR I’ve implemented a new modular version of those classes > (consumer/producer) each one having a single responsibility and writing in > the same unique file. > This is mainly a refactor of the existing format opening the possibility to > override or implement the sub-format you need. > > I’ll do in 3 steps: > # Create a CompositeDocValuesFormat and moving the code of > Lucene80DocValuesFormat in separate classes, without modifying the inner > code. At same time I created a Lucene85CompositeDocValuesFormat based on > these changes. > # I’ll introduce some basic components for writing doc values in general > such as: > ## DocumentIdSetIterator Serializer: used in each type of field based on an > IndexedDISI. > ## Document Ordinals Serializer: Used in Sorted and SortedSet for > deduplicate values using a dictionary. > ## Document Boundaries Serializer (optional used only for multivalued > fields: SortedNumeric and SortedSet) > ## TermsEnum Serializer: useful to write and read the terms dictionary for > sorted and sorted set doc values. > # I’ll create the new Sub-DocValues format using the previous components. > > PR: [https://github.com/apache/lucene-solr/pull/1282] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9236) Having a modular Doc Values format
[ https://issues.apache.org/jira/browse/LUCENE-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041729#comment-17041729 ] juan camilo rodriguez duran commented on LUCENE-9236: - Yeah that's why I consider do this in 3 steps, to have feedback at each stage and depending on the results continue or stop in something useful for everybody. [~rcmuir] just to say that the PerFieldDocValues is not enough when you have to write a new doc values format at in its contract you must implement the 5 different functions even If a field only supports one of those functions (cannot be numeric and sorted set at same time for example). > Having a modular Doc Values format > -- > > Key: LUCENE-9236 > URL: https://issues.apache.org/jira/browse/LUCENE-9236 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: juan camilo rodriguez duran >Priority: Minor > Labels: docValues > > Today DocValues Consumer/Producer require override 5 different methods, even > if you only want to use one and given that one given field can only support > one doc values type at same time. > > In the attached PR I’ve implemented a new modular version of those classes > (consumer/producer) each one having a single responsibility and writing in > the same unique file. > This is mainly a refactor of the existing format opening the possibility to > override or implement the sub-format you need. > > I’ll do in 3 steps: > # Create a CompositeDocValuesFormat and moving the code of > Lucene80DocValuesFormat in separate classes, without modifying the inner > code. At same time I created a Lucene85CompositeDocValuesFormat based on > these changes. > # I’ll introduce some basic components for writing doc values in general > such as: > ## DocumentIdSetIterator Serializer: used in each type of field based on an > IndexedDISI. > ## Document Ordinals Serializer: Used in Sorted and SortedSet for > deduplicate values using a dictionary. > ## Document Boundaries Serializer (optional used only for multivalued > fields: SortedNumeric and SortedSet) > ## TermsEnum Serializer: useful to write and read the terms dictionary for > sorted and sorted set doc values. > # I’ll create the new Sub-DocValues format using the previous components. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9236) Having a modular Doc Values format
[ https://issues.apache.org/jira/browse/LUCENE-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041689#comment-17041689 ] Adrien Grand commented on LUCENE-9236: -- Agreed with Robert regarding the abstractions. There is a part of your change that I liked though, where you were creating BinaryEntry/NumericEntry/... on the Consumer side as well, which made the Consumer and Producer look more symmetric. > Having a modular Doc Values format > -- > > Key: LUCENE-9236 > URL: https://issues.apache.org/jira/browse/LUCENE-9236 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: juan camilo rodriguez duran >Priority: Minor > Labels: docValues > > Today DocValues Consumer/Producer require override 5 different methods, even > if you only want to use one and given that one given field can only support > one doc values type at same time. > > In the attached PR I’ve implemented a new modular version of those classes > (consumer/producer) each one having a single responsibility and writing in > the same unique file. > This is mainly a refactor of the existing format opening the possibility to > override or implement the sub-format you need. > > I’ll do in 3 steps: > # Create a CompositeDocValuesFormat and moving the code of > Lucene80DocValuesFormat in separate classes, without modifying the inner > code. At same time I created a Lucene85CompositeDocValuesFormat based on > these changes. > # I’ll introduce some basic components for writing doc values in general > such as: > ## DocumentIdSetIterator Serializer: used in each type of field based on an > IndexedDISI. > ## Document Ordinals Serializer: Used in Sorted and SortedSet for > deduplicate values using a dictionary. > ## Document Boundaries Serializer (optional used only for multivalued > fields: SortedNumeric and SortedSet) > ## TermsEnum Serializer: useful to write and read the terms dictionary for > sorted and sorted set doc values. > # I’ll create the new Sub-DocValues format using the previous components. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9236) Having a modular Doc Values format
[ https://issues.apache.org/jira/browse/LUCENE-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041404#comment-17041404 ] Robert Muir commented on LUCENE-9236: - the perfielddocvaluesformat (implemented by the default codec) provides enough abstractions already such that fields format can be trivially customized. i dont think we need any more abstractions, in fact the opposite, we desperately need less of them. > Having a modular Doc Values format > -- > > Key: LUCENE-9236 > URL: https://issues.apache.org/jira/browse/LUCENE-9236 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: juan camilo rodriguez duran >Priority: Minor > Labels: docValues > > Today DocValues Consumer/Producer require override 5 different methods, even > if you only want to use one and given that one given field can only support > one doc values type at same time. > > In the attached PR I’ve implemented a new modular version of those classes > (consumer/producer) each one having a single responsibility and writing in > the same unique file. > This is mainly a refactor of the existing format opening the possibility to > override or implement the sub-format you need. > > I’ll do in 3 steps: > # Create a CompositeDocValuesFormat and moving the code of > Lucene80DocValuesFormat in separate classes, without modifying the inner > code. At same time I created a Lucene85CompositeDocValuesFormat based on > these changes. > # I’ll introduce some basic components for writing doc values in general > such as: > ## DocumentIdSetIterator Serializer: used in each type of field based on an > IndexedDISI. > ## Document Ordinals Serializer: Used in Sorted and SortedSet for > deduplicate values using a dictionary. > ## Document Boundaries Serializer (optional used only for multivalued > fields: SortedNumeric and SortedSet) > ## TermsEnum Serializer: useful to write and read the terms dictionary for > sorted and sorted set doc values. > # I’ll create the new Sub-DocValues format using the previous components. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org