[jira] [Commented] (LUCENE-9236) Having a modular Doc Values format

2020-03-10 Thread juan camilo rodriguez duran (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056013#comment-17056013
 ] 

juan camilo rodriguez duran commented on LUCENE-9236:
-

[~jpountz] This is step 2 of the Jira issue, I want to know what do you think 
about step one, only splitting the big classes and then make the reader and 
writing part more symmetric

> Having a modular Doc Values format
> --
>
> Key: LUCENE-9236
> URL: https://issues.apache.org/jira/browse/LUCENE-9236
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: juan camilo rodriguez duran
>Priority: Minor
>  Labels: docValues
>
>  Today DocValues Consumer/Producer require override 5 different methods, even 
> if you only want to use one and given that one given field can only support 
> one doc values type at same time.
>  
> In the attached PR I’ve implemented a new modular version of those classes 
> (consumer/producer) each one having a single responsibility and writing in 
> the same unique file.
> This is mainly a refactor of the existing format opening the possibility to 
> override or implement the sub-format you need.
>  
> I’ll do in 3 steps:
>  # Create a CompositeDocValuesFormat and moving the code of 
> Lucene80DocValuesFormat in separate classes, without modifying the inner 
> code. At same time I created a Lucene85CompositeDocValuesFormat based on 
> these changes.
>  # I’ll introduce some basic components for writing doc values in general 
> such as:
>  ## DocumentIdSetIterator Serializer: used in each type of field based on an 
> IndexedDISI.
>  ## Document Ordinals Serializer: Used in Sorted and SortedSet for 
> deduplicate values using a dictionary.
>  ## Document Boundaries Serializer (optional used only for multivalued 
> fields: SortedNumeric and SortedSet)
>  ## TermsEnum Serializer: useful to write and read the terms dictionary for 
> sorted and sorted set doc values.
>  # I’ll create the new Sub-DocValues format using the previous components.
>  
> PR: [https://github.com/apache/lucene-solr/pull/1282]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9236) Having a modular Doc Values format

2020-03-10 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055907#comment-17055907
 ] 

Adrien Grand commented on LUCENE-9236:
--

[~juan.duran] In my opinion it introduces complexity because it introduces more 
abstractions: CompositeFieldMetadata, DocValuesConsumerSupplier, and so on.

> Having a modular Doc Values format
> --
>
> Key: LUCENE-9236
> URL: https://issues.apache.org/jira/browse/LUCENE-9236
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: juan camilo rodriguez duran
>Priority: Minor
>  Labels: docValues
>
>  Today DocValues Consumer/Producer require override 5 different methods, even 
> if you only want to use one and given that one given field can only support 
> one doc values type at same time.
>  
> In the attached PR I’ve implemented a new modular version of those classes 
> (consumer/producer) each one having a single responsibility and writing in 
> the same unique file.
> This is mainly a refactor of the existing format opening the possibility to 
> override or implement the sub-format you need.
>  
> I’ll do in 3 steps:
>  # Create a CompositeDocValuesFormat and moving the code of 
> Lucene80DocValuesFormat in separate classes, without modifying the inner 
> code. At same time I created a Lucene85CompositeDocValuesFormat based on 
> these changes.
>  # I’ll introduce some basic components for writing doc values in general 
> such as:
>  ## DocumentIdSetIterator Serializer: used in each type of field based on an 
> IndexedDISI.
>  ## Document Ordinals Serializer: Used in Sorted and SortedSet for 
> deduplicate values using a dictionary.
>  ## Document Boundaries Serializer (optional used only for multivalued 
> fields: SortedNumeric and SortedSet)
>  ## TermsEnum Serializer: useful to write and read the terms dictionary for 
> sorted and sorted set doc values.
>  # I’ll create the new Sub-DocValues format using the previous components.
>  
> PR: [https://github.com/apache/lucene-solr/pull/1282]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9236) Having a modular Doc Values format

2020-03-09 Thread juan camilo rodriguez duran (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055035#comment-17055035
 ] 

juan camilo rodriguez duran commented on LUCENE-9236:
-

[~rcmuir] could you please elaborate a bit more why introducing sub formats 
having a single responsibility and the code in the same file will increase 
complexity?  today Lucene80DocValuesProducer is 1565 lines to read, with the 
approach I'm proposing we will have at beginning 3 classes of 500 lines each 
one a bit  more easy to digest. I just want to know which factors are important 
to continue keep the code as it is.

> Having a modular Doc Values format
> --
>
> Key: LUCENE-9236
> URL: https://issues.apache.org/jira/browse/LUCENE-9236
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: juan camilo rodriguez duran
>Priority: Minor
>  Labels: docValues
>
>  Today DocValues Consumer/Producer require override 5 different methods, even 
> if you only want to use one and given that one given field can only support 
> one doc values type at same time.
>  
> In the attached PR I’ve implemented a new modular version of those classes 
> (consumer/producer) each one having a single responsibility and writing in 
> the same unique file.
> This is mainly a refactor of the existing format opening the possibility to 
> override or implement the sub-format you need.
>  
> I’ll do in 3 steps:
>  # Create a CompositeDocValuesFormat and moving the code of 
> Lucene80DocValuesFormat in separate classes, without modifying the inner 
> code. At same time I created a Lucene85CompositeDocValuesFormat based on 
> these changes.
>  # I’ll introduce some basic components for writing doc values in general 
> such as:
>  ## DocumentIdSetIterator Serializer: used in each type of field based on an 
> IndexedDISI.
>  ## Document Ordinals Serializer: Used in Sorted and SortedSet for 
> deduplicate values using a dictionary.
>  ## Document Boundaries Serializer (optional used only for multivalued 
> fields: SortedNumeric and SortedSet)
>  ## TermsEnum Serializer: useful to write and read the terms dictionary for 
> sorted and sorted set doc values.
>  # I’ll create the new Sub-DocValues format using the previous components.
>  
> PR: [https://github.com/apache/lucene-solr/pull/1282]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9236) Having a modular Doc Values format

2020-02-27 Thread juan camilo rodriguez duran (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046339#comment-17046339
 ] 

juan camilo rodriguez duran commented on LUCENE-9236:
-

[~dsmiley] just throwing the exception wouldn't work at least If you want to 
run all test (and be compliant with the API), and here is the main point of 
this API, why if this is used independently, the API force you to support other 
sub formats that can't co-exist at same time for a given field. This same 
pattern is replicated using the EmptyDocValuesProducer, ideally 
DocValues#checkField would be easier if we use only the sub formats.

But still this is not the point of this PR, the first objective is at least 
simplify the code readability by spiting the big classes 
DocValuesProducer/Consumer into Single responsible classes, then do a refactor 
to have more symmetric read and writing classes, and finally If it worth to 
refactor some common components between all formats as the DISI iterator 
reading and writing part.  

> Having a modular Doc Values format
> --
>
> Key: LUCENE-9236
> URL: https://issues.apache.org/jira/browse/LUCENE-9236
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: juan camilo rodriguez duran
>Priority: Minor
>  Labels: docValues
>
>  Today DocValues Consumer/Producer require override 5 different methods, even 
> if you only want to use one and given that one given field can only support 
> one doc values type at same time.
>  
> In the attached PR I’ve implemented a new modular version of those classes 
> (consumer/producer) each one having a single responsibility and writing in 
> the same unique file.
> This is mainly a refactor of the existing format opening the possibility to 
> override or implement the sub-format you need.
>  
> I’ll do in 3 steps:
>  # Create a CompositeDocValuesFormat and moving the code of 
> Lucene80DocValuesFormat in separate classes, without modifying the inner 
> code. At same time I created a Lucene85CompositeDocValuesFormat based on 
> these changes.
>  # I’ll introduce some basic components for writing doc values in general 
> such as:
>  ## DocumentIdSetIterator Serializer: used in each type of field based on an 
> IndexedDISI.
>  ## Document Ordinals Serializer: Used in Sorted and SortedSet for 
> deduplicate values using a dictionary.
>  ## Document Boundaries Serializer (optional used only for multivalued 
> fields: SortedNumeric and SortedSet)
>  ## TermsEnum Serializer: useful to write and read the terms dictionary for 
> sorted and sorted set doc values.
>  # I’ll create the new Sub-DocValues format using the previous components.
>  
> PR: [https://github.com/apache/lucene-solr/pull/1282]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9236) Having a modular Doc Values format

2020-02-26 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045803#comment-17045803
 ] 

David Smiley commented on LUCENE-9236:
--

I think it's weird that DocValuesFormat is a format of formats whereas the 
others are for their singular purpose.  No?

bq. ... you *must* implement the 5 different functions ...

Wouldn't throwing UnsupportedOperationException work?  

> Having a modular Doc Values format
> --
>
> Key: LUCENE-9236
> URL: https://issues.apache.org/jira/browse/LUCENE-9236
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: juan camilo rodriguez duran
>Priority: Minor
>  Labels: docValues
>
>  Today DocValues Consumer/Producer require override 5 different methods, even 
> if you only want to use one and given that one given field can only support 
> one doc values type at same time.
>  
> In the attached PR I’ve implemented a new modular version of those classes 
> (consumer/producer) each one having a single responsibility and writing in 
> the same unique file.
> This is mainly a refactor of the existing format opening the possibility to 
> override or implement the sub-format you need.
>  
> I’ll do in 3 steps:
>  # Create a CompositeDocValuesFormat and moving the code of 
> Lucene80DocValuesFormat in separate classes, without modifying the inner 
> code. At same time I created a Lucene85CompositeDocValuesFormat based on 
> these changes.
>  # I’ll introduce some basic components for writing doc values in general 
> such as:
>  ## DocumentIdSetIterator Serializer: used in each type of field based on an 
> IndexedDISI.
>  ## Document Ordinals Serializer: Used in Sorted and SortedSet for 
> deduplicate values using a dictionary.
>  ## Document Boundaries Serializer (optional used only for multivalued 
> fields: SortedNumeric and SortedSet)
>  ## TermsEnum Serializer: useful to write and read the terms dictionary for 
> sorted and sorted set doc values.
>  # I’ll create the new Sub-DocValues format using the previous components.
>  
> PR: [https://github.com/apache/lucene-solr/pull/1282]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9236) Having a modular Doc Values format

2020-02-25 Thread juan camilo rodriguez duran (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17044605#comment-17044605
 ] 

juan camilo rodriguez duran commented on LUCENE-9236:
-

[~jpountz] and [~rcmuir] I attached the draft PR to the issue, If you have time 
to check it out. Thanks in advance for your time and comments

> Having a modular Doc Values format
> --
>
> Key: LUCENE-9236
> URL: https://issues.apache.org/jira/browse/LUCENE-9236
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: juan camilo rodriguez duran
>Priority: Minor
>  Labels: docValues
>
>  Today DocValues Consumer/Producer require override 5 different methods, even 
> if you only want to use one and given that one given field can only support 
> one doc values type at same time.
>  
> In the attached PR I’ve implemented a new modular version of those classes 
> (consumer/producer) each one having a single responsibility and writing in 
> the same unique file.
> This is mainly a refactor of the existing format opening the possibility to 
> override or implement the sub-format you need.
>  
> I’ll do in 3 steps:
>  # Create a CompositeDocValuesFormat and moving the code of 
> Lucene80DocValuesFormat in separate classes, without modifying the inner 
> code. At same time I created a Lucene85CompositeDocValuesFormat based on 
> these changes.
>  # I’ll introduce some basic components for writing doc values in general 
> such as:
>  ## DocumentIdSetIterator Serializer: used in each type of field based on an 
> IndexedDISI.
>  ## Document Ordinals Serializer: Used in Sorted and SortedSet for 
> deduplicate values using a dictionary.
>  ## Document Boundaries Serializer (optional used only for multivalued 
> fields: SortedNumeric and SortedSet)
>  ## TermsEnum Serializer: useful to write and read the terms dictionary for 
> sorted and sorted set doc values.
>  # I’ll create the new Sub-DocValues format using the previous components.
>  
> PR: [https://github.com/apache/lucene-solr/pull/1282]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9236) Having a modular Doc Values format

2020-02-21 Thread juan camilo rodriguez duran (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041729#comment-17041729
 ] 

juan camilo rodriguez duran commented on LUCENE-9236:
-

Yeah that's why I consider do this in 3 steps, to have feedback at each stage 
and depending on the results continue or stop in something useful for 
everybody. [~rcmuir] just to say that the PerFieldDocValues is not enough when 
you have to write a new doc values format at in its contract you must implement 
the 5 different functions even If a field only supports one of those functions 
(cannot be numeric and sorted set at same time for example).

> Having a modular Doc Values format
> --
>
> Key: LUCENE-9236
> URL: https://issues.apache.org/jira/browse/LUCENE-9236
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: juan camilo rodriguez duran
>Priority: Minor
>  Labels: docValues
>
>  Today DocValues Consumer/Producer require override 5 different methods, even 
> if you only want to use one and given that one given field can only support 
> one doc values type at same time.
>  
> In the attached PR I’ve implemented a new modular version of those classes 
> (consumer/producer) each one having a single responsibility and writing in 
> the same unique file.
> This is mainly a refactor of the existing format opening the possibility to 
> override or implement the sub-format you need.
>  
> I’ll do in 3 steps:
>  # Create a CompositeDocValuesFormat and moving the code of 
> Lucene80DocValuesFormat in separate classes, without modifying the inner 
> code. At same time I created a Lucene85CompositeDocValuesFormat based on 
> these changes.
>  # I’ll introduce some basic components for writing doc values in general 
> such as:
>  ## DocumentIdSetIterator Serializer: used in each type of field based on an 
> IndexedDISI.
>  ## Document Ordinals Serializer: Used in Sorted and SortedSet for 
> deduplicate values using a dictionary.
>  ## Document Boundaries Serializer (optional used only for multivalued 
> fields: SortedNumeric and SortedSet)
>  ## TermsEnum Serializer: useful to write and read the terms dictionary for 
> sorted and sorted set doc values.
>  # I’ll create the new Sub-DocValues format using the previous components.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9236) Having a modular Doc Values format

2020-02-21 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041689#comment-17041689
 ] 

Adrien Grand commented on LUCENE-9236:
--

Agreed with Robert regarding the abstractions. There is a part of your change 
that I liked though, where you were creating BinaryEntry/NumericEntry/... on 
the Consumer side as well, which made the Consumer and Producer look more 
symmetric.

> Having a modular Doc Values format
> --
>
> Key: LUCENE-9236
> URL: https://issues.apache.org/jira/browse/LUCENE-9236
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: juan camilo rodriguez duran
>Priority: Minor
>  Labels: docValues
>
>  Today DocValues Consumer/Producer require override 5 different methods, even 
> if you only want to use one and given that one given field can only support 
> one doc values type at same time.
>  
> In the attached PR I’ve implemented a new modular version of those classes 
> (consumer/producer) each one having a single responsibility and writing in 
> the same unique file.
> This is mainly a refactor of the existing format opening the possibility to 
> override or implement the sub-format you need.
>  
> I’ll do in 3 steps:
>  # Create a CompositeDocValuesFormat and moving the code of 
> Lucene80DocValuesFormat in separate classes, without modifying the inner 
> code. At same time I created a Lucene85CompositeDocValuesFormat based on 
> these changes.
>  # I’ll introduce some basic components for writing doc values in general 
> such as:
>  ## DocumentIdSetIterator Serializer: used in each type of field based on an 
> IndexedDISI.
>  ## Document Ordinals Serializer: Used in Sorted and SortedSet for 
> deduplicate values using a dictionary.
>  ## Document Boundaries Serializer (optional used only for multivalued 
> fields: SortedNumeric and SortedSet)
>  ## TermsEnum Serializer: useful to write and read the terms dictionary for 
> sorted and sorted set doc values.
>  # I’ll create the new Sub-DocValues format using the previous components.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9236) Having a modular Doc Values format

2020-02-20 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041404#comment-17041404
 ] 

Robert Muir commented on LUCENE-9236:
-

the perfielddocvaluesformat (implemented by the default codec) provides enough 
abstractions already such that fields format can be trivially customized. i 
dont think we need any more abstractions, in fact the opposite, we desperately 
need less of them.

> Having a modular Doc Values format
> --
>
> Key: LUCENE-9236
> URL: https://issues.apache.org/jira/browse/LUCENE-9236
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: juan camilo rodriguez duran
>Priority: Minor
>  Labels: docValues
>
>  Today DocValues Consumer/Producer require override 5 different methods, even 
> if you only want to use one and given that one given field can only support 
> one doc values type at same time.
>  
> In the attached PR I’ve implemented a new modular version of those classes 
> (consumer/producer) each one having a single responsibility and writing in 
> the same unique file.
> This is mainly a refactor of the existing format opening the possibility to 
> override or implement the sub-format you need.
>  
> I’ll do in 3 steps:
>  # Create a CompositeDocValuesFormat and moving the code of 
> Lucene80DocValuesFormat in separate classes, without modifying the inner 
> code. At same time I created a Lucene85CompositeDocValuesFormat based on 
> these changes.
>  # I’ll introduce some basic components for writing doc values in general 
> such as:
>  ## DocumentIdSetIterator Serializer: used in each type of field based on an 
> IndexedDISI.
>  ## Document Ordinals Serializer: Used in Sorted and SortedSet for 
> deduplicate values using a dictionary.
>  ## Document Boundaries Serializer (optional used only for multivalued 
> fields: SortedNumeric and SortedSet)
>  ## TermsEnum Serializer: useful to write and read the terms dictionary for 
> sorted and sorted set doc values.
>  # I’ll create the new Sub-DocValues format using the previous components.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org