[jira] [Updated] (SOLR-6666) Dynamic copy fields are considering all dynamic fields, causing a significant performance impact on indexing documents

Liram Vardi (JIRA) Wed, 29 Oct 2014 07:46:50 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-6666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Liram Vardi updated SOLR-6666:
------------------------------
    Description: 
Result:
After applying a fix for this issue, tests which we conducted show more than 40 
percent improvement on our insertion performance.

Explanation:

Using JVM profiler, we found a CPU "bottleneck" during Solr indexing process. 
This bottleneck can be found at org.apache.solr.schema.IndexSchema, in the 
following method, "getCopyFieldsList()":

{code:title=getCopyFieldsList() |borderStyle=solid}
final List<CopyField> result = new ArrayList<>();
    for (DynamicCopy dynamicCopy : dynamicCopyFields) {
      if (dynamicCopy.matches(sourceField)) {
        result.add(new CopyField(getField(sourceField), 
dynamicCopy.getTargetField(sourceField), dynamicCopy.maxChars));
      }
    }
    List<CopyField> fixedCopyFields = copyFieldsMap.get(sourceField);
    if (null != fixedCopyFields) {
      result.addAll(fixedCopyFields);
    }
{code}

This function tries to find for an input source field all its copyFields (All 
its destinations which Solr need to move this field). 
As you can probably note, the first part of the procedure is the procedure most 
“expensive” step (takes O( n ) time while N is the size of the 
"dynamicCopyFields" group).
The next part is just a simple "hash" extraction, which takes O(1) time. 

Our schema contains over then 500 copyFields but only 70 of then are "indexed" 
fields. 
We also have one dynamic field with  a wildcard ( * ), which "catches" the rest 
of the document fields. 
As you can conclude, we have more than 400 copyFields that are based on this 
dynamicField but all, except one, are fixed (i.e. does not contain any 
wildcard).

>From some reason, the copyFields registration procedure defines those 400 
>fields as "DynamicCopyField " and then store them in the “dynamicCopyFields” 
>array, 
This step makes getCopyFieldsList() very expensive (in CPU terms) without any 
justification: All of those 400 copyFields are not glob and therefore do not 
need any complex pattern matching to the input field. They all can be store at 
the "fixedCopyFields".
Only copyFields with asterisks need this "special" treatment and they are 
(especially on our case) pretty rare.  

Therefore, we created a patch which fix this problem by changing the 
registerCopyField() procedure.
Test which we conducted show that there is no change in the Indexing results. 
Moreover, the fix still successfully passes the class unit tests (i.e. 
IndexSchemaTest.java).

       

  was:
Result:
After applying a fix for this issue, tests which we conducted show more than 40 
percent improvement on our insertion performance.

Explanation:

Using JVM profiler, we found a CPU "bottleneck" during Solr indexing process. 
This bottleneck can be found at org.apache.solr.schema.IndexSchema, in the 
following method, "getCopyFieldsList()":

{code:title=getCopyFieldsList() |borderStyle=solid}
final List<CopyField> result = new ArrayList<>();
    for (DynamicCopy dynamicCopy : dynamicCopyFields) {
      if (dynamicCopy.matches(sourceField)) {
        result.add(new CopyField(getField(sourceField), 
dynamicCopy.getTargetField(sourceField), dynamicCopy.maxChars));
      }
    }
    List<CopyField> fixedCopyFields = copyFieldsMap.get(sourceField);
    if (null != fixedCopyFields) {
      result.addAll(fixedCopyFields);
    }
{code}

This function tries to find for an input source field all its copyFields (All 
its destinations which Solr need to move this field). 
As you can probably note, the first part of the procedure is the procedure most 
“expensive” step (takes O(n) time while N is the size of the 
"dynamicCopyFields" group).
The next part is just a simple "hash" extraction, which takes O(1) time. 

Our schema contains over then 500 copyFields but only 70 of then are "indexed" 
fields. 
We also have one dynamic field with  a wildcard (*), which "catches" the rest 
of the document fields. 
As you can conclude, we have more than 400 copyFields that are based on this 
dynamicField but all, except one, are fixed (i.e. does not contain any 
wildcard).

>From some reason, the copyFields registration procedure defines those 400 
>fields as "DynamicCopyField " and then store them in the “dynamicCopyFields” 
>array, 
This step makes getCopyFieldsList() very expensive (in CPU terms) without any 
justification: All of those 400 copyFields are not glob and therefore do not 
need any complex pattern matching to the input field. They all can be store at 
the "fixedCopyFields".
Only copyFields with asterisks need this "special" treatment and they are 
(especially on our case) pretty rare.  

Therefore, we created a patch which fix this problem by changing the 
registerCopyField() procedure.
Test which we conducted show that there is no change in the Indexing results. 
Moreover, the fix still successfully passes the class unit tests (i.e. 
IndexSchemaTest.java).

       


> Dynamic copy fields are considering all dynamic fields, causing a significant 
> performance impact on indexing documents
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-6666
>                 URL: https://issues.apache.org/jira/browse/SOLR-6666
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis, update
>         Environment: Linux, Solr 4.8, Schema with 70 fields and more than 500 
> specific CopyFields for dynamic fields, but without wildcards (the fields are 
> dynamic, the copy directive is not)
>            Reporter: Liram Vardi
>
> Result:
> After applying a fix for this issue, tests which we conducted show more than 
> 40 percent improvement on our insertion performance.
> Explanation:
> Using JVM profiler, we found a CPU "bottleneck" during Solr indexing process. 
> This bottleneck can be found at org.apache.solr.schema.IndexSchema, in the 
> following method, "getCopyFieldsList()":
> {code:title=getCopyFieldsList() |borderStyle=solid}
> final List<CopyField> result = new ArrayList<>();
>     for (DynamicCopy dynamicCopy : dynamicCopyFields) {
>       if (dynamicCopy.matches(sourceField)) {
>         result.add(new CopyField(getField(sourceField), 
> dynamicCopy.getTargetField(sourceField), dynamicCopy.maxChars));
>       }
>     }
>     List<CopyField> fixedCopyFields = copyFieldsMap.get(sourceField);
>     if (null != fixedCopyFields) {
>       result.addAll(fixedCopyFields);
>     }
> {code}
> This function tries to find for an input source field all its copyFields (All 
> its destinations which Solr need to move this field). 
> As you can probably note, the first part of the procedure is the procedure 
> most “expensive” step (takes O( n ) time while N is the size of the 
> "dynamicCopyFields" group).
> The next part is just a simple "hash" extraction, which takes O(1) time. 
> Our schema contains over then 500 copyFields but only 70 of then are 
> "indexed" fields. 
> We also have one dynamic field with  a wildcard ( * ), which "catches" the 
> rest of the document fields. 
> As you can conclude, we have more than 400 copyFields that are based on this 
> dynamicField but all, except one, are fixed (i.e. does not contain any 
> wildcard).
> From some reason, the copyFields registration procedure defines those 400 
> fields as "DynamicCopyField " and then store them in the “dynamicCopyFields” 
> array, 
> This step makes getCopyFieldsList() very expensive (in CPU terms) without any 
> justification: All of those 400 copyFields are not glob and therefore do not 
> need any complex pattern matching to the input field. They all can be store 
> at the "fixedCopyFields".
> Only copyFields with asterisks need this "special" treatment and they are 
> (especially on our case) pretty rare.  
> Therefore, we created a patch which fix this problem by changing the 
> registerCopyField() procedure.
> Test which we conducted show that there is no change in the Indexing results. 
> Moreover, the fix still successfully passes the class unit tests (i.e. 
> IndexSchemaTest.java).
>        



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-6666) Dynamic copy fields are considering all dynamic fields, causing a significant performance impact on indexing documents

Reply via email to