[ https://issues.apache.org/jira/browse/SOLR-17052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17779224#comment-17779224 ]
Chris M. Hostetter commented on SOLR-17052: ------------------------------------------- Here is what I would propose to untangle some of this mess by inverting a lot of the logic... * For every "per-field" method in the {{Codec}} API, add a corresponding method to {{FieldType}} ** The default behavior of these new methods in {{FieldType}} should largely just be to return {{null}} *** The exception should be if things like {{postingsFormat}} or {{docValuesFormat}} are configured **** in which case {{FieldType.init(...)}} should do the SPI lookup and assign the value to new {{{}private final PostingFormat postingFormat{}}}, etc.. **** Which the corresponding method impls should return **** Eliminating redundant & repetitive SPI lookups for every field of that type ** In the case of {{{}DenseVectorField{}}}... *** It's {{.init(...)}} method should validate it's options, and set a new {{private final KnnVectorsFormat knnVectrsFormat}} *** Which it's {{getKnnVectorsFormatForField}} impl should return *** Eliminating the instantiation of identical {{KnnVectorsFormat}} instances for every field of that type ** Change the codec returned by {{SchemaCodecFactory}} to delegate to these new {{FieldType}} methods *** returning {{super.get...}} if/when the {{FieldType}} methods return {{null}} ** In the (unlikely?) event that a (future) {{FieldType}} needs to know {{SchemaField}} level properties to make codec decisions in one of it's per-field methods (ex: is {{{}multiValued=true{}}}) ... *** That {{FieldType}} subclass can keep a reference to the {{IndexSchema}} (passed to {{{}FieldType.init(...){}}}) and then... *** That {{FieldType}} subclass can pay the added cost of calling {{schema.getFieldOrNull(field)}} if needed in one of it's per-field codec methods * Fixing the "validation" that {{SolrCore.initCodec}} does is much harder to deal with in a clean way ** The two biggest problems: *** {{IndexSchema}} sharing between cores (which might have different {{solrconfig.xml}} files, therefor different Codecs) *** {{mutable==true}} instances of {{IndexSchema}} that add fieldtypes dynamically ** What _can_ be done in a somewhat straightforward way, is to fix the {*}non{*}-mutable usecase – in a way that will work even if {{IndexSchema}} sharing is in use... *** Add a {{public void checkSchema(IndexSchema schema) throws SolrException;}} method to {{CodecFactory}} *** Add a {{public void checkCodec(CodecFactory codecFactory) throws SolrException;}} method to {{FieldType}} *** The default impl of {{CodecFactory.checkSchema(...)}} should loop over all {{FieldType}} s and call {{ft.checkCodec(this)}} **** It should also log warnings if {{schema.isMutable()}} is true: field types added at runtime may violate codec rules *** The default impl of {{FieldType.checkCodec(...)}} should... **** fail if {{(postingFormat != null || docValuesFormat != null) && ! (codecFactory instanceof SchemaCodecFactory))}} *** {{DenseVectorField.checkCodec(...)}} should similarly fail unless {{codecFactory instanceof SchemaCodecFactory}} *** make {{SchemaCodecFactory.validate(IndexSchema)}} effectively a No-Op (it let's the field types do whatever they want) **** But it should also log warnings if {{schema.isMutable()}} is true: field types added at runtime may violate codec rules *** Replace the (currently incomplete) validation logic in {{SolrCore.initCodec}} with a call to {{CodecFactory.checkSchema(...)}} **** Which necessitates a trivial {{LuceneDefaultCodecFactory}} impl (see SOLR-17046) ** Down the road, maybe we can refactor {{CodecFactory}} configuration & initialization to be part of the {{IndexSchema}} ? *** Which would make it straightforward for {{mutable==true}} schemas to validate that new {{FieldType}} instances work with the configured {{CodecFactory}} _before_ adding them (and persisting to the schema resource on disk/zk) > SchemaCodecFactory/IndexSchema/FieldType relationships are kludgy and should > be inverted > ---------------------------------------------------------------------------------------- > > Key: SOLR-17052 > URL: https://issues.apache.org/jira/browse/SOLR-17052 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Chris M. Hostetter > Priority: Major > > While getting familiar with the {{SolreCore + CodecFactory + > SchemaCodecFactory + FieldType}} related code relevant to SOLR-17045, > SOLR-17046, & SOLR-17047 It occurred to me that there is a lot of > ineffeciencies and kludginess to how {{FieldType}} based "codec overrides" > are used (and validated) by {{SchemaCodecFactory}} (and > {{{}SolrCore.initCodec{}}}) : > * {{SolrCore.initCodec}} needs to be aware of all the possible ways a > {{FieldType}} instance might support codec overrides > ** ... so it can fail if any are specified unless the {{CodecFactory > instanceOf SolrCoreAware}} > *** ... even though that still doesn't ensure the factory supports those > field type overrides > ** This validation currently just looks at {{getPostingsFormatForField}} & > {{getDocValuesFormatForField}} > *** ... it's ignorant about {{DenseVectorField}} 's assumptions about being > able to override aspects of the {{KnnVectorsFormat}} > *** ... and AFAICT, what validation is don't doesn't help if the Schema API > is used to add new field types (w/ {{postingsFormat}} or {{docValuesFormat}} > overrides) > * in all of the the {{SchemaCodecFactory}} "per-field" methods > ({{{}getPostingsFormatForField{}}}, {{{}getDocValuesFormatForField{}}}, & > {{{}getKnnVectorsFormatForField{}}}) ... > ** ... every call to these methods resolves a {{SchemaField}} instance – > even though only the (Solr) {{FieldType}} is needed > *** Asking the {{IndexSchema}} for the {{SchemaField}} of a fieldName has > more overhead then just asking for the {{FieldType}} > *** None of the things these methods care about can be configured on a > per-fieldName bassis anyway. > ** For {{PostingsFormat}} and {{{}DocValuesFormat{}}}, every call to these > methods repeats the SPI lookup on the "format name" configured on the > {{FieldType}} instance > ** For {{KnnVectorsFormat}} every call to this method constructs a new > {{SolrDelegatingKnnVectorsFormat}} – even though the same instance could be > re-used for every field of the same {{FieldType}} instance. > * In {{FieldType}} ... > ** ... there is no validation anywhere that the {{postingsFormat}} or > {{docValuesFormat}} are valid > *** ... bogus values only cause a problem when the {{SchemaCodecFactory}} > tries to resolve them (when indexing) > * In {{DenseVectorField}} ... > ** ... {{checkSchemaField}} validates (and logs warnings) based on the > {{vectorEncoding}} and {{{}dimensions{}}}... > *** ... Even though these validations aren't "field" specific – they are > "type" specific, and could be validated in {{DenseVectorField.init()}} > ** BUT! ... there is no validation anywhere that the {{knnAlgorithm}} is > supported, or that the HNSW options make sense for it > *** These are only validated by the > {{Codec.getKnnVectorsFormatForField(...)}} impl provided by > {{SchemaCodecFactory}} ... > **** ... and they are redundenly validated on every call -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org