[ 
https://issues.apache.org/jira/browse/SOLR-6666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Elran Dvir updated SOLR-6666:
-----------------------------
    Attachment: SOLR-6666.patch

> Dynamic copy fields are considering all dynamic fields, causing a significant 
> performance impact on indexing documents
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-6666
>                 URL: https://issues.apache.org/jira/browse/SOLR-6666
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis, update
>         Environment: Linux, Solr 4.8, Schema with 70 fields and more than 500 
> specific CopyFields for dynamic fields, but without wildcards (the fields are 
> dynamic, the copy directive is not)
>            Reporter: Liram Vardi
>            Assignee: Erick Erickson
>         Attachments: SOLR-6666.patch, SOLR-6666.patch
>
>
> Result:
> After applying a fix for this issue, tests which we conducted show more than 
> 40 percent improvement on our insertion performance.
> Explanation:
> Using JVM profiler, we found a CPU "bottleneck" during Solr indexing process. 
> This bottleneck can be found at org.apache.solr.schema.IndexSchema, in the 
> following method, "getCopyFieldsList()":
> {code:title=getCopyFieldsList() |borderStyle=solid}
> final List<CopyField> result = new ArrayList<>();
>     for (DynamicCopy dynamicCopy : dynamicCopyFields) {
>       if (dynamicCopy.matches(sourceField)) {
>         result.add(new CopyField(getField(sourceField), 
> dynamicCopy.getTargetField(sourceField), dynamicCopy.maxChars));
>       }
>     }
>     List<CopyField> fixedCopyFields = copyFieldsMap.get(sourceField);
>     if (null != fixedCopyFields) {
>       result.addAll(fixedCopyFields);
>     }
> {code}
> This function tries to find for an input source field all its copyFields (All 
> its destinations which Solr need to move this field). 
> As you can probably note, the first part of the procedure is the procedure 
> most “expensive” step (takes O( n ) time while N is the size of the 
> "dynamicCopyFields" group).
> The next part is just a simple "hash" extraction, which takes O(1) time. 
> Our schema contains over then 500 copyFields but only 70 of then are 
> "indexed" fields. 
> We also have one dynamic field with  a wildcard ( * ), which "catches" the 
> rest of the document fields. 
> As you can conclude, we have more than 400 copyFields that are based on this 
> dynamicField but all, except one, are fixed (i.e. does not contain any 
> wildcard).
> From some reason, the copyFields registration procedure defines those 400 
> fields as "DynamicCopyField " and then store them in the “dynamicCopyFields” 
> array, 
> This step makes getCopyFieldsList() very expensive (in CPU terms) without any 
> justification: All of those 400 copyFields are not glob and therefore do not 
> need any complex pattern matching to the input field. They all can be store 
> at the "fixedCopyFields".
> Only copyFields with asterisks need this "special" treatment and they are 
> (especially on our case) pretty rare.  
> Therefore, we created a patch which fix this problem by changing the 
> registerCopyField() procedure.
> Test which we conducted show that there is no change in the Indexing results. 
> Moreover, the fix still successfully passes the class unit tests (i.e. 
> IndexSchemaTest.java).
>        



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to