[ https://issues.apache.org/jira/browse/SOLR-6666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Elran Dvir updated SOLR-6666: ----------------------------- Attachment: SOLR-6666.patch > Dynamic copy fields are considering all dynamic fields, causing a significant > performance impact on indexing documents > ---------------------------------------------------------------------------------------------------------------------- > > Key: SOLR-6666 > URL: https://issues.apache.org/jira/browse/SOLR-6666 > Project: Solr > Issue Type: Improvement > Components: Schema and Analysis, update > Environment: Linux, Solr 4.8, Schema with 70 fields and more than 500 > specific CopyFields for dynamic fields, but without wildcards (the fields are > dynamic, the copy directive is not) > Reporter: Liram Vardi > Assignee: Erick Erickson > Attachments: SOLR-6666.patch, SOLR-6666.patch > > > Result: > After applying a fix for this issue, tests which we conducted show more than > 40 percent improvement on our insertion performance. > Explanation: > Using JVM profiler, we found a CPU "bottleneck" during Solr indexing process. > This bottleneck can be found at org.apache.solr.schema.IndexSchema, in the > following method, "getCopyFieldsList()": > {code:title=getCopyFieldsList() |borderStyle=solid} > final List<CopyField> result = new ArrayList<>(); > for (DynamicCopy dynamicCopy : dynamicCopyFields) { > if (dynamicCopy.matches(sourceField)) { > result.add(new CopyField(getField(sourceField), > dynamicCopy.getTargetField(sourceField), dynamicCopy.maxChars)); > } > } > List<CopyField> fixedCopyFields = copyFieldsMap.get(sourceField); > if (null != fixedCopyFields) { > result.addAll(fixedCopyFields); > } > {code} > This function tries to find for an input source field all its copyFields (All > its destinations which Solr need to move this field). > As you can probably note, the first part of the procedure is the procedure > most “expensive” step (takes O( n ) time while N is the size of the > "dynamicCopyFields" group). > The next part is just a simple "hash" extraction, which takes O(1) time. > Our schema contains over then 500 copyFields but only 70 of then are > "indexed" fields. > We also have one dynamic field with a wildcard ( * ), which "catches" the > rest of the document fields. > As you can conclude, we have more than 400 copyFields that are based on this > dynamicField but all, except one, are fixed (i.e. does not contain any > wildcard). > From some reason, the copyFields registration procedure defines those 400 > fields as "DynamicCopyField " and then store them in the “dynamicCopyFields” > array, > This step makes getCopyFieldsList() very expensive (in CPU terms) without any > justification: All of those 400 copyFields are not glob and therefore do not > need any complex pattern matching to the input field. They all can be store > at the "fixedCopyFields". > Only copyFields with asterisks need this "special" treatment and they are > (especially on our case) pretty rare. > Therefore, we created a patch which fix this problem by changing the > registerCopyField() procedure. > Test which we conducted show that there is no change in the Indexing results. > Moreover, the fix still successfully passes the class unit tests (i.e. > IndexSchemaTest.java). > -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org