Re: [PR] SOLR-11191: SolrIndexSplitter: Routing child docs by _root_ when available [solr]

via GitHub Sun, 27 Oct 2024 12:17:45 -0700


dsmiley commented on code in PR #2799:
URL: https://github.com/apache/solr/pull/2799#discussion_r1818145471



##########
solr/core/src/java/org/apache/solr/update/SolrIndexSplitter.java:
##########
@@ -125,17 +126,33 @@ public SolrIndexSplitter(SplitIndexCommand cmd) {
       numPieces = cmd.ranges.size();
       rangesArr = cmd.ranges.toArray(new DocRouter.Range[0]);
     }
+
+    SchemaField maybeField;
     if (cmd.routeFieldName == null) {
-      field = searcher.getSchema().getUniqueKeyField();
+      // To support routing child documents, use the root field if it exists 
(which would be
+      // populated with unique field), otherwise use the unique key field
+      if (searcher.getSchema().isUsableForChildDocs()) {
+        maybeField = 
searcher.getSchema().getFieldOrNull(IndexSchema.ROOT_FIELD_NAME);

Review Comment:
   why "OrNull"?



##########
solr/core/src/java/org/apache/solr/update/SolrIndexSplitter.java:
##########
@@ -125,17 +126,33 @@ public SolrIndexSplitter(SplitIndexCommand cmd) {
       numPieces = cmd.ranges.size();
       rangesArr = cmd.ranges.toArray(new DocRouter.Range[0]);
     }
+
+    SchemaField maybeField;
     if (cmd.routeFieldName == null) {
-      field = searcher.getSchema().getUniqueKeyField();
+      // To support routing child documents, use the root field if it exists 
(which would be
+      // populated with unique field), otherwise use the unique key field
+      if (searcher.getSchema().isUsableForChildDocs()) {
+        maybeField = 
searcher.getSchema().getFieldOrNull(IndexSchema.ROOT_FIELD_NAME);
+      } else {
+        maybeField = searcher.getSchema().getUniqueKeyField();
+      }
     } else {
-      field = searcher.getSchema().getField(cmd.routeFieldName);
+      // Custom routing
+      // This may not route child documents the same as parents; users are 
expected to manage this
+      // themselves, such as by using the value in each field.
+      // If the field is set to the unique field, users can reset it to null 
so the above logic to

Review Comment:
   And I'd drop this sentence.



##########
solr/solr-ref-guide/modules/indexing-guide/pages/indexing-nested-documents.adoc:
##########
@@ -304,6 +305,11 @@ Instead, use delete-by-query (most efficient) or 
xref:partial-document-updates.a
 If you use Solr's delete-by-query APIs, you *MUST* be careful to ensure that 
any deletion query is structured to ensure no descendent children remain of any 
documents that are being deleted.
 *_Doing otherwise will violate integrity assumptions that Solr expects._*
 
+=== Shard Splits
+When xref:deployment-guide:shard-management.adoc#splitshard[splitting shards], 
child documents get transmitted and routed on their own, since they are 
documents in their own right. If you route based on your unique field, then the 
child documents will be routed correctly based on their `_root_`. However, if 
your collection specifies a router field (even if it matches the unique field), 
then the child documents will be routed based on that field instead of their 
root field. If your collection specifies a router field, you should ensure 
every doc in a hierarchy has the same value for that field, otherwise it maybe 
broken during a shard split.

Review Comment:
   Hmm; I suppose we might add such a tip in a number of places.  In Nested 
Docs, or in Shard Splits, or `collection-management.adoc` (where  router.field 
is documented).  I suppose here in this "maintaining integrity" in nested docs 
is okay but I also think it should be where `router.field` is defined to simply 
say that all docs, _even child docs_, need this field, and that all docs in a 
nested doc tree must have the same value.  Needn't list what bad things happen 
when you don't.
   
   I suppose Solr should do that itself -- I filed: 
https://issues.apache.org/jira/browse/SOLR-17517



##########
solr/core/src/java/org/apache/solr/update/SolrIndexSplitter.java:
##########
@@ -125,17 +126,33 @@ public SolrIndexSplitter(SplitIndexCommand cmd) {
       numPieces = cmd.ranges.size();
       rangesArr = cmd.ranges.toArray(new DocRouter.Range[0]);
     }
+
+    SchemaField maybeField;
     if (cmd.routeFieldName == null) {
-      field = searcher.getSchema().getUniqueKeyField();
+      // To support routing child documents, use the root field if it exists 
(which would be
+      // populated with unique field), otherwise use the unique key field
+      if (searcher.getSchema().isUsableForChildDocs()) {
+        maybeField = 
searcher.getSchema().getFieldOrNull(IndexSchema.ROOT_FIELD_NAME);
+      } else {
+        maybeField = searcher.getSchema().getUniqueKeyField();
+      }
     } else {
-      field = searcher.getSchema().getField(cmd.routeFieldName);
+      // Custom routing
+      // This may not route child documents the same as parents; users are 
expected to manage this

Review Comment:
   I would reword this slightly as, "if child docs are used, users must ensure 
that the whole nested document tree has a consistent routeFieldName."



##########
solr/solr-ref-guide/modules/indexing-guide/pages/indexing-nested-documents.adoc:
##########
@@ -304,6 +305,11 @@ Instead, use delete-by-query (most efficient) or 
xref:partial-document-updates.a
 If you use Solr's delete-by-query APIs, you *MUST* be careful to ensure that 
any deletion query is structured to ensure no descendent children remain of any 
documents that are being deleted.
 *_Doing otherwise will violate integrity assumptions that Solr expects._*
 
+=== Shard Splits
+When xref:deployment-guide:shard-management.adoc#splitshard[splitting shards], 
child documents get transmitted and routed on their own, since they are 
documents in their own right. If you route based on your unique field, then the 
child documents will be routed correctly based on their `_root_`. However, if 
your collection specifies a router field (even if it matches the unique field), 
then the child documents will be routed based on that field instead of their 
root field. If your collection specifies a router field, you should ensure 
every doc in a hierarchy has the same value for that field, otherwise it maybe 
broken during a shard split.

Review Comment:
   This paragraph feels more like it should be a warning/tip than something 
that needs its own heading.  And although your inside/internal information is 
interesting, I think it's better to be short & sweet get to the point of what a 
user must do:  router.field must be consistent for the nested doc tree.



##########
solr/CHANGES.txt:
##########
@@ -29,7 +29,7 @@ Optimizations
 
 Bug Fixes
 ---------------------
-(No changes)
+* SOLR-11191: Splitting shards now routes child-docs with their _root_ field 
when available so they maintain parent relationship. (Zack Kendall)

Review Comment:
   FWIW I view this as an Improvement but it's debatable.
   Also by putting this here, it will not be back-ported to 9.x.  But wouldn't 
we want it in the next 9.x release?



##########
solr/solr-ref-guide/modules/indexing-guide/pages/indexing-nested-documents.adoc:
##########
@@ -304,6 +305,11 @@ Instead, use delete-by-query (most efficient) or 
xref:partial-document-updates.a
 If you use Solr's delete-by-query APIs, you *MUST* be careful to ensure that 
any deletion query is structured to ensure no descendent children remain of any 
documents that are being deleted.
 *_Doing otherwise will violate integrity assumptions that Solr expects._*
 
+=== Shard Splits
+When xref:deployment-guide:shard-management.adoc#splitshard[splitting shards], 
child documents get transmitted and routed on their own, since they are 
documents in their own right. If you route based on your unique field, then the 
child documents will be routed correctly based on their `_root_`. However, if 
your collection specifies a router field (even if it matches the unique field), 
then the child documents will be routed based on that field instead of their 
root field. If your collection specifies a router field, you should ensure 
every doc in a hierarchy has the same value for that field, otherwise it maybe 
broken during a shard split.

Review Comment:
   Please put a newline at the end of each sentence in our adoc files.  It's 
then easier to handle diffs and it's less of a problem when an editor doesn't 
wrap.



##########
solr/core/src/test/org/apache/solr/cloud/SplitShardTest.java:
##########
@@ -174,6 +164,50 @@ public void testSplitFuzz() throws Exception {
     assertEquals("wrong range in s1_1", expected1, delta1);
   }
 
+  @Test
+  public void testWithChildDocuments() throws Exception {
+    CloudSolrClient solrClient = cluster.getSolrClient();
+    CollectionAdminRequest.createCollection(COLLECTION_NAME, "conf", 1, 
1).process(solrClient);
+
+    cluster.waitForActiveCollection(COLLECTION_NAME, 1, 1);
+
+    List<SolrInputDocument> inputDocs = new ArrayList<>();
+    for (int idx = 0; idx < 10; ++idx) {
+      SolrInputDocument parent = new SolrInputDocument();
+      parent.addField("id", idx);
+      parent.addField("type_s", "parent");
+
+      // Child has different ID, so could be routed differently if used for 
routing.
+      SolrInputDocument child = new SolrInputDocument();
+      child.addField("id", idx + "_child");
+      child.addField("expected_parent_s", idx);
+
+      parent.addField("myChild", List.of(child));
+
+      inputDocs.add(parent);
+    }
+    solrClient.add(COLLECTION_NAME, inputDocs);
+
+    CollectionAdminRequest.SplitShard splitShard =
+        CollectionAdminRequest.splitShard(COLLECTION_NAME)
+            .setNumSubShards(2)
+            .setShardName("shard1");
+    splitShard.process(solrClient);
+    waitForState("Timed out waiting for shard split", COLLECTION_NAME, 
activeClusterShape(2, 2));

Review Comment:
   I think this isn't necessary since commands are synchronous; it should be 
split by now.



##########
solr/core/src/test-files/solr/configsets/cloud-minimal/conf/schema.xml:
##########
@@ -19,10 +19,12 @@
   <fieldType name="string" class="solr.StrField"/>
   <fieldType name="int" class="${solr.tests.IntegerFieldType}" 
docValues="${solr.tests.numeric.dv}" precisionStep="0" omitNorms="true" 
positionIncrementGap="0" uninvertible="true"/>
   <fieldType name="long" class="${solr.tests.LongFieldType}" 
docValues="${solr.tests.numeric.dv}" precisionStep="0" omitNorms="true" 
positionIncrementGap="0" uninvertible="true"/>
+  <fieldType name="_nest_path_" class="solr.NestPathField"/>
   <dynamicField name="*" type="string" indexed="true" stored="true"/>
   <!-- for versioning -->
   <field name="_version_" type="long" indexed="true" stored="true"/>
   <field name="_root_" type="string" indexed="true" stored="true" 
multiValued="false" required="false"/>
+  <field name="_nest_path_" type="_nest_path_"/>`

Review Comment:
   Ironically this schema isn't so "minimal" after all once we get into nested 
docs.  Whatever I guess.



##########
solr/core/src/test/org/apache/solr/cloud/SplitShardTest.java:
##########
@@ -174,6 +164,50 @@ public void testSplitFuzz() throws Exception {
     assertEquals("wrong range in s1_1", expected1, delta1);
   }
 
+  @Test
+  public void testWithChildDocuments() throws Exception {
+    CloudSolrClient solrClient = cluster.getSolrClient();
+    CollectionAdminRequest.createCollection(COLLECTION_NAME, "conf", 1, 
1).process(solrClient);
+
+    cluster.waitForActiveCollection(COLLECTION_NAME, 1, 1);
+
+    List<SolrInputDocument> inputDocs = new ArrayList<>();
+    for (int idx = 0; idx < 10; ++idx) {
+      SolrInputDocument parent = new SolrInputDocument();
+      parent.addField("id", idx);
+      parent.addField("type_s", "parent");
+
+      // Child has different ID, so could be routed differently if used for 
routing.
+      SolrInputDocument child = new SolrInputDocument();
+      child.addField("id", idx + "_child");

Review Comment:
   Could you use a global doc counter instead of concatenating?  This would 
emphasize that the ID format has no bearing at all on the splitting.



##########
solr/core/src/test/org/apache/solr/update/SolrIndexSplitterTest.java:
##########
@@ -485,19 +486,147 @@ private void 
doTestSplitByRouteKey(SolrIndexSplitter.SplitMethod splitMethod) th
     }
   }
 
+  @Test
+  public void testSplitWithChildDocs() throws Exception {
+    doTestSplitWithChildDocs(SolrIndexSplitter.SplitMethod.REWRITE);
+  }
+
+  @Test
+  public void testSplitWithChildDocsLink() throws Exception {
+    doTestSplitWithChildDocs(SolrIndexSplitter.SplitMethod.LINK);
+  }
+
+  public void doTestSplitWithChildDocs(SolrIndexSplitter.SplitMethod 
splitMethod) throws Exception {
+    // Overall test/split pattern copied from doTestSplitByCores
+    File indexDir = createTempDir().toFile();
+
+    CompositeIdRouter r1 = new CompositeIdRouter();
+    String routeKeyBase = "sea-line!";
+
+    List<DocRouter.Range> twoRanges = r1.partitionRange(2, 
r1.keyHashRange(routeKeyBase));
+
+    // Insert other doc for contrast
+    String otherDocId = routeKeyBase + "6"; // Will hash into first range
+    assertU(adoc("id", otherDocId));
+
+    // Insert child doc
+    String parentId = routeKeyBase + "1"; // Will hash into second range
+    String childId = routeKeyBase + "5"; // Will hash into first range
+    SolrInputDocument doc =
+        sdoc("id", parentId, "myChild", sdocs(sdoc("id", childId, "child_s", 
"child")));
+    assertU(adoc(doc));
+
+    assertU(commit());
+    assertJQ(req("q", "*:*"), "/response/numFound==3");
+
+    // Able to query child.
+    assertJQ(
+        req("q", "id:" + parentId, "fl", "*, [child]"),
+        "/response/numFound==1",
+        "/response/docs/[0]/myChild/[0]/id=='" + childId + "'",
+        "/response/docs/[0]/myChild/[0]/_root_=='"
+            + parentId
+            + "'" // Child has parent root to route with
+        );
+
+    SolrCore core1 = null, core2 = null;
+    try {
+      core1 =
+          h.getCoreContainer()
+              .create(
+                  "split1",
+                  Map.of("dataDir", indexDir1.getAbsolutePath(), "configSet", 
"cloud-minimal"));
+      core2 =
+          h.getCoreContainer()
+              .create(
+                  "split2",
+                  Map.of("dataDir", indexDir2.getAbsolutePath(), "configSet", 
"cloud-minimal"));
+
+      LocalSolrQueryRequest request = null;
+      try {
+        request = lrf.makeRequest("q", "dummy");
+        SolrQueryResponse rsp = new SolrQueryResponse();
+        SplitIndexCommand command =
+            new SplitIndexCommand(
+                request,
+                rsp,
+                null,
+                List.of(core1, core2),
+                twoRanges,
+                new CompositeIdRouter(),
+                null,
+                null,
+                splitMethod);
+        doSplit(command);
+      } finally {
+        if (request != null) request.close();
+      }
+      @SuppressWarnings("resource")
+      final EmbeddedSolrServer server1 = new 
EmbeddedSolrServer(h.getCoreContainer(), "split1");
+      @SuppressWarnings("resource")
+      final EmbeddedSolrServer server2 = new 
EmbeddedSolrServer(h.getCoreContainer(), "split2");
+      server1.commit(true, true);
+      server2.commit(true, true);
+      assertEquals(
+          "should be  2 docs in index2",
+          2,
+          server2.query(new SolrQuery("*:*")).getResults().getNumFound());
+      assertEquals(
+          "parent doc should be in index2",
+          1,
+          server2.query(new SolrQuery("id:" + 
parentId)).getResults().getNumFound());
+      assertEquals(
+          "child doc should be in index2",
+          1,
+          server2.query(new SolrQuery("id:" + 
childId)).getResults().getNumFound());
+      assertEquals(
+          "other doc should be in index1",
+          1,
+          server1.query(new SolrQuery("id:" + 
otherDocId)).getResults().getNumFound());
+    } finally {
+      h.getCoreContainer().unload("split2");
+      h.getCoreContainer().unload("split1");
+    }
+  }
+
+  /** Utility method to find Ids that hash into which ranges. Uncomment @Test 
to print. */
+  //  @Test
+  public void testCompositeHashSandbox() {

Review Comment:
   just by naming it with a "test" prefix, I think it'll run.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] SOLR-11191: SolrIndexSplitter: Routing child docs by _root_ when available [solr]

Reply via email to