Re: does copyFields increase indexe size ?

Nicolas Paris Thu, 26 Dec 2019 12:22:54 -0800

Hi Eric

Below a part of the managed-schema. There is 1k section* fields. The
second experience, I removed the copyField, droped the collection and
re-indexed the whole. To mesure the index size, I went to solr-cloud and
looked in the cloud part: 40GO per shard. I also look at the folder
size. I made some tests and the _text_ field is indexed.


    <field name="_text_" type="text_fr" indexed="true" stored="false" 
multiValued="true"/> 
    <dynamicField name="section*" type="text_fr" indexed="true" stored="true" 
multiValued="true"/>
    <copyField source="section*" dest="_text_"/>

    <fieldType name="text_fr" class="solr.TextField" positionIncrementGap="100">

    <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>

        <filter class="solr.PatternReplaceFilterFactory" pattern="\p{Punct}" 
replacement=" " replace="all"/>
        <filter class="solr.ICUFoldingFilterFactory"/>
        <!-- removes l', etc -->
        <filter class="solr.ElisionFilterFactory" ignoreCase="true" 
articles="lang/contractions_fr.txt"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="lang/stopwords_fr.txt" format="snowball" />
        <filter class="solr.FrenchLightStemFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymGraphFilterFactory" 
synonyms="synonyms-fr.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.PatternReplaceFilterFactory" pattern="\p{Punct}" 
replacement=" " replace="all"/>
        <filter class="solr.ICUFoldingFilterFactory"/>
        <!-- removes l', etc -->
        <filter class="solr.ElisionFilterFactory" ignoreCase="true" 
articles="lang/contractions_fr.txt"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="lang/stopwords_fr.txt" format="snowball" />
        <filter class="solr.FrenchLightStemFilterFactory"/>
      </analyzer>
    </fieldType>





On Thu, Dec 26, 2019 at 02:16:32PM -0500, Erick Erickson wrote:
> This simply cannot be true unless the destination copyField is indexed=false, 
> docValues=false stored=false. I.e. “some circumstances” means there’s really 
> no use in using the copyField in the first place. I suppose that if you don’t 
> store any term vectors, no position information nothing except, say, the 
> terms then maybe you’ll have extremely minimal size. But even in that case, 
> I’d use the original field in an “fq” clause which doesn’t use any scoring in 
> place of using the copyField.
> 
> Each field is stored in a separate part of the relevant files (.tim, .pos, 
> etc). Term frequencies are kept on a _per field_ basis for instance.
> 
> So this pretty much has to be small sample size or other measurement error.
> 
> Best,
> Erick
> 
> > On Dec 26, 2019, at 9:27 AM, Nicolas Paris <nicolas.pa...@riseup.net> wrote:
> > 
> > Anyway, that´s good news copy field does not increase indexe size in
> > some circumstance:
> > - the copied fields and the target field share the same datatype
> > - the target field is not stored
> > 
> > this is tested on text fields
> > 
> > 
> > On Wed, Dec 25, 2019 at 11:42:23AM +0100, Nicolas Paris wrote:
> >> 
> >> On Wed, Dec 25, 2019 at 05:30:03AM -0500, Dave wrote:
> >>> #2 you initially said you were talking about 1k documents. 
> >> 
> >> Hi Dave. Again, sorry for the confusion. This is 1k fields
> >> (general_text), over 50M large  documents copied into one _text_ field. 
> >> 4 shards, 40GB per shard in both case, with/without the _text_ field
> >> 
> >>> 
> >>>> On Dec 25, 2019, at 3:07 AM, Nicolas Paris <nicolas.pa...@riseup.net> 
> >>>> wrote:
> >>>> 
> >>>> 
> >>>>> 
> >>>>> If you are redoing the indexing after changing the schema and
> >>>>> reloading/restarting, then you can ignore me.
> >>>> 
> >>>> I am sorry to say that I have to ignore you. Indeed, my tests include
> >>>> recreating the collection from scratch - with and without the copy
> >>>> fields.
> >>>> In both cases the index size is the same ! (while the _text_ field is
> >>>> working correctly)
> >>>> 
> >>>>> On Tue, Dec 24, 2019 at 05:32:09PM -0700, Shawn Heisey wrote:
> >>>>>> On 12/24/2019 5:11 PM, Nicolas Paris wrote:
> >>>>>> Do you mean "copy fields" is only an action of changing the schema ?
> >>>>>> I was thinking it was adding a new field and eventually a new index to
> >>>>>> the collection
> >>>>> 
> >>>>> The copy that copyField does happens at index time.  Reindexing is 
> >>>>> required
> >>>>> after changing the schema, or nothing happens.
> >>>>> 
> >>>>> If you are redoing the indexing after changing the schema and
> >>>>> reloading/restarting, then you can ignore me.
> >>>>> 
> >>>>> Thanks,
> >>>>> Shawn
> >>>>> 
> >>>> 
> >>>> -- 
> >>>> nicolas
> >>> 
> >> 
> >> -- 
> >> nicolas
> >> 
> > 
> > -- 
> > nicolas
> 

-- 
nicolas

Re: does copyFields increase indexe size ?

Reply via email to