[ 
https://issues.apache.org/jira/browse/SOLR-17052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17779224#comment-17779224
 ] 

Chris M. Hostetter commented on SOLR-17052:
-------------------------------------------



Here is what I would propose to untangle some of this mess by inverting a lot 
of the logic...
 * For every "per-field" method in the {{Codec}} API, add a corresponding 
method to {{FieldType}}
 ** The default behavior of these new methods in {{FieldType}} should largely 
just be to return {{null}}
 *** The exception should be if things like {{postingsFormat}} or 
{{docValuesFormat}} are configured
 **** in which case {{FieldType.init(...)}} should do the SPI lookup and assign 
the value to new {{{}private final PostingFormat postingFormat{}}}, etc..
 **** Which the corresponding method impls should return
 **** Eliminating redundant & repetitive SPI lookups for every field of that 
type
 ** In the case of {{{}DenseVectorField{}}}...
 *** It's {{.init(...)}} method should validate it's options, and set a new 
{{private final KnnVectorsFormat knnVectrsFormat}}
 *** Which it's {{getKnnVectorsFormatForField}} impl should return
 *** Eliminating the instantiation of identical {{KnnVectorsFormat}} instances 
for every field of that type
 ** Change the codec returned by {{SchemaCodecFactory}} to delegate to these 
new {{FieldType}} methods
 *** returning {{super.get...}} if/when the {{FieldType}} methods return 
{{null}}
 ** In the (unlikely?) event that a (future) {{FieldType}} needs to know 
{{SchemaField}} level properties to make codec decisions in one of it's 
per-field methods (ex: is {{{}multiValued=true{}}}) ...
 *** That {{FieldType}} subclass can keep a reference to the {{IndexSchema}} 
(passed to {{{}FieldType.init(...){}}}) and then...
 *** That {{FieldType}} subclass can pay the added cost of calling 
{{schema.getFieldOrNull(field)}} if needed in one of it's per-field codec 
methods
 * Fixing the "validation" that {{SolrCore.initCodec}} does is much harder to 
deal with in a clean way
 ** The two biggest problems:
 *** {{IndexSchema}} sharing between cores (which might have different 
{{solrconfig.xml}} files, therefor different Codecs)
 *** {{mutable==true}} instances of {{IndexSchema}} that add fieldtypes 
dynamically
 ** What _can_ be done in a somewhat straightforward way, is to fix the 
{*}non{*}-mutable usecase – in a way that will work even if {{IndexSchema}} 
sharing is in use...
 *** Add a {{public void checkSchema(IndexSchema schema) throws 
SolrException;}} method to {{CodecFactory}}
 *** Add a {{public void checkCodec(CodecFactory codecFactory) throws 
SolrException;}} method to {{FieldType}}
 *** The default impl of {{CodecFactory.checkSchema(...)}} should loop over all 
{{FieldType}} s and call {{ft.checkCodec(this)}}
 **** It should also log warnings if {{schema.isMutable()}} is true: field 
types added at runtime may violate codec rules
 *** The default impl of {{FieldType.checkCodec(...)}} should...
 **** fail if {{(postingFormat != null || docValuesFormat != null) && ! 
(codecFactory instanceof SchemaCodecFactory))}}
 *** {{DenseVectorField.checkCodec(...)}} should similarly fail unless 
{{codecFactory instanceof SchemaCodecFactory}}
 *** make {{SchemaCodecFactory.validate(IndexSchema)}} effectively a No-Op (it 
let's the field types do whatever they want)
 **** But it should also log warnings if {{schema.isMutable()}} is true: field 
types added at runtime may violate codec rules
 *** Replace the (currently incomplete) validation logic in 
{{SolrCore.initCodec}} with a call to {{CodecFactory.checkSchema(...)}}
 **** Which necessitates a trivial {{LuceneDefaultCodecFactory}} impl (see 
SOLR-17046)
 ** Down the road, maybe we can refactor {{CodecFactory}} configuration & 
initialization to be part of the {{IndexSchema}} ?
 *** Which would make it straightforward for {{mutable==true}} schemas to 
validate that new {{FieldType}} instances work with the configured 
{{CodecFactory}} _before_ adding them (and persisting to the schema resource on 
disk/zk)



> SchemaCodecFactory/IndexSchema/FieldType relationships are kludgy and should 
> be inverted
> ----------------------------------------------------------------------------------------
>
>                 Key: SOLR-17052
>                 URL: https://issues.apache.org/jira/browse/SOLR-17052
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Chris M. Hostetter
>            Priority: Major
>
> While getting familiar with the {{SolreCore + CodecFactory + 
> SchemaCodecFactory + FieldType}} related code relevant to SOLR-17045, 
> SOLR-17046, & SOLR-17047 It occurred to me that there is a lot of 
> ineffeciencies and kludginess to how {{FieldType}} based "codec overrides" 
> are used (and validated) by {{SchemaCodecFactory}} (and 
> {{{}SolrCore.initCodec{}}}) :
>  * {{SolrCore.initCodec}} needs to be aware of all the possible ways a 
> {{FieldType}} instance might support codec overrides
>  ** ... so it can fail if any are specified unless the {{CodecFactory 
> instanceOf SolrCoreAware}}
>  *** ... even though that still doesn't ensure the factory supports those 
> field type overrides
>  ** This validation currently just looks at {{getPostingsFormatForField}} & 
> {{getDocValuesFormatForField}}
>  *** ... it's ignorant about {{DenseVectorField}} 's assumptions about being 
> able to override aspects of the {{KnnVectorsFormat}}
>  *** ... and AFAICT, what validation is don't doesn't help if the Schema API 
> is used to add new field types (w/ {{postingsFormat}} or {{docValuesFormat}} 
> overrides)
>  * in all of the the {{SchemaCodecFactory}} "per-field" methods 
> ({{{}getPostingsFormatForField{}}}, {{{}getDocValuesFormatForField{}}}, & 
> {{{}getKnnVectorsFormatForField{}}}) ...
>  ** ... every call to these methods resolves a {{SchemaField}} instance – 
> even though only the (Solr) {{FieldType}} is needed
>  *** Asking the {{IndexSchema}} for the {{SchemaField}} of a fieldName has 
> more overhead then just asking for the {{FieldType}}
>  *** None of the things these methods care about can be configured on a 
> per-fieldName bassis anyway.
>  ** For {{PostingsFormat}} and {{{}DocValuesFormat{}}}, every call to these 
> methods repeats the SPI lookup on the "format name" configured on the 
> {{FieldType}} instance
>  ** For {{KnnVectorsFormat}} every call to this method constructs a new 
> {{SolrDelegatingKnnVectorsFormat}} – even though the same instance could be 
> re-used for every field of the same {{FieldType}} instance.
>  * In {{FieldType}} ...
>  ** ... there is no validation anywhere that the {{postingsFormat}} or 
> {{docValuesFormat}} are valid
>  *** ... bogus values only cause a problem when the {{SchemaCodecFactory}} 
> tries to resolve them (when indexing)
>  * In {{DenseVectorField}} ...
>  ** ... {{checkSchemaField}} validates (and logs warnings) based on the 
> {{vectorEncoding}} and {{{}dimensions{}}}...
>  *** ... Even though these validations aren't "field" specific – they are 
> "type" specific, and could be validated in {{DenseVectorField.init()}}
>  ** BUT! ... there is no validation anywhere that the {{knnAlgorithm}} is 
> supported, or that the HNSW options make sense for it
>  *** These are only validated by the 
> {{Codec.getKnnVectorsFormatForField(...)}} impl provided by 
> {{SchemaCodecFactory}} ...
>  **** ... and they are redundenly validated on every call



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to