Fields are placed in the index totally separately from each other, so it’s no wonder that removing the copyField results in this kind of savings.
And they have to be separate. Consider what comes out of the end of the analysis chain. The same input could produce totally different output. As a trivial example, imagine two fields: whitespacetokenizer lowercasefilter whitespacetokenizer lowercasefilter edgengramfilterfactory and identical input "fleas”. The output of the first would be “fleas”, and the output of the second would be something like “f”, “fl”, “fle”, “flea”, “fleas”. Trying to share the tokens between fields would be a nightmare. And that’s only one of many ways the output of two different analysis chains could be different… Best, Erick > On Sep 28, 2020, at 10:56 AM, Edward Turner <eddtur...@gmail.com> wrote: > > Hi all, > > We have recently switched to using edismax + qf fields, and no longer use > copyfields to allow us to easily search over values in multiple fields (by > copying multiple fields' values to the copyfield destinations, and then > performing queries over the destination field). > > By removing the copyfields, we've found that our index sizes have reduced > by ~40% in some cases, which is great! We're just curious now as to exactly > how this can be ... > > My question is, given the following two schemas, if we index some data to > the "description" field, will the index for schema1 be twice as large as > the index of schema2? (I guess this relates to how, internally, Solr stores > field + index data) > > Old way -- schema1: > ======= > <field name="description type="text_general" indexed="true" > multiValued="false"/> > <field name="default_field" type="text_general" indexed="true" > multiValued="false" /> > <copyField source="description" dest="default_field /> > > New way -- schema2: > ======= > <field name="description type="text_general" indexed="true" > multiValued="false"/> > > Many thanks and kind regards, > > Edd