dsmiley commented on code in PR #2799:
URL: https://github.com/apache/solr/pull/2799#discussion_r1818145471
##########
solr/core/src/java/org/apache/solr/update/SolrIndexSplitter.java:
##########
@@ -125,17 +126,33 @@ public SolrIndexSplitter(SplitIndexCommand cmd) {
numPieces = cmd.ranges.size();
rangesArr = cmd.ranges.toArray(new DocRouter.Range[0]);
}
+
+ SchemaField maybeField;
if (cmd.routeFieldName == null) {
- field = searcher.getSchema().getUniqueKeyField();
+ // To support routing child documents, use the root field if it exists
(which would be
+ // populated with unique field), otherwise use the unique key field
+ if (searcher.getSchema().isUsableForChildDocs()) {
+ maybeField =
searcher.getSchema().getFieldOrNull(IndexSchema.ROOT_FIELD_NAME);
Review Comment:
why "OrNull"?
##########
solr/core/src/java/org/apache/solr/update/SolrIndexSplitter.java:
##########
@@ -125,17 +126,33 @@ public SolrIndexSplitter(SplitIndexCommand cmd) {
numPieces = cmd.ranges.size();
rangesArr = cmd.ranges.toArray(new DocRouter.Range[0]);
}
+
+ SchemaField maybeField;
if (cmd.routeFieldName == null) {
- field = searcher.getSchema().getUniqueKeyField();
+ // To support routing child documents, use the root field if it exists
(which would be
+ // populated with unique field), otherwise use the unique key field
+ if (searcher.getSchema().isUsableForChildDocs()) {
+ maybeField =
searcher.getSchema().getFieldOrNull(IndexSchema.ROOT_FIELD_NAME);
+ } else {
+ maybeField = searcher.getSchema().getUniqueKeyField();
+ }
} else {
- field = searcher.getSchema().getField(cmd.routeFieldName);
+ // Custom routing
+ // This may not route child documents the same as parents; users are
expected to manage this
+ // themselves, such as by using the value in each field.
+ // If the field is set to the unique field, users can reset it to null
so the above logic to
Review Comment:
And I'd drop this sentence.
##########
solr/solr-ref-guide/modules/indexing-guide/pages/indexing-nested-documents.adoc:
##########
@@ -304,6 +305,11 @@ Instead, use delete-by-query (most efficient) or
xref:partial-document-updates.a
If you use Solr's delete-by-query APIs, you *MUST* be careful to ensure that
any deletion query is structured to ensure no descendent children remain of any
documents that are being deleted.
*_Doing otherwise will violate integrity assumptions that Solr expects._*
+=== Shard Splits
+When xref:deployment-guide:shard-management.adoc#splitshard[splitting shards],
child documents get transmitted and routed on their own, since they are
documents in their own right. If you route based on your unique field, then the
child documents will be routed correctly based on their `_root_`. However, if
your collection specifies a router field (even if it matches the unique field),
then the child documents will be routed based on that field instead of their
root field. If your collection specifies a router field, you should ensure
every doc in a hierarchy has the same value for that field, otherwise it maybe
broken during a shard split.
Review Comment:
Hmm; I suppose we might add such a tip in a number of places. In Nested
Docs, or in Shard Splits, or `collection-management.adoc` (where router.field
is documented). I suppose here in this "maintaining integrity" in nested docs
is okay but I also think it should be where `router.field` is defined to simply
say that all docs, _even child docs_, need this field, and that all docs in a
nested doc tree must have the same value. Needn't list what bad things happen
when you don't.
I suppose Solr should do that itself -- I filed:
https://issues.apache.org/jira/browse/SOLR-17517
##########
solr/core/src/java/org/apache/solr/update/SolrIndexSplitter.java:
##########
@@ -125,17 +126,33 @@ public SolrIndexSplitter(SplitIndexCommand cmd) {
numPieces = cmd.ranges.size();
rangesArr = cmd.ranges.toArray(new DocRouter.Range[0]);
}
+
+ SchemaField maybeField;
if (cmd.routeFieldName == null) {
- field = searcher.getSchema().getUniqueKeyField();
+ // To support routing child documents, use the root field if it exists
(which would be
+ // populated with unique field), otherwise use the unique key field
+ if (searcher.getSchema().isUsableForChildDocs()) {
+ maybeField =
searcher.getSchema().getFieldOrNull(IndexSchema.ROOT_FIELD_NAME);
+ } else {
+ maybeField = searcher.getSchema().getUniqueKeyField();
+ }
} else {
- field = searcher.getSchema().getField(cmd.routeFieldName);
+ // Custom routing
+ // This may not route child documents the same as parents; users are
expected to manage this
Review Comment:
I would reword this slightly as, "if child docs are used, users must ensure
that the whole nested document tree has a consistent routeFieldName."
##########
solr/solr-ref-guide/modules/indexing-guide/pages/indexing-nested-documents.adoc:
##########
@@ -304,6 +305,11 @@ Instead, use delete-by-query (most efficient) or
xref:partial-document-updates.a
If you use Solr's delete-by-query APIs, you *MUST* be careful to ensure that
any deletion query is structured to ensure no descendent children remain of any
documents that are being deleted.
*_Doing otherwise will violate integrity assumptions that Solr expects._*
+=== Shard Splits
+When xref:deployment-guide:shard-management.adoc#splitshard[splitting shards],
child documents get transmitted and routed on their own, since they are
documents in their own right. If you route based on your unique field, then the
child documents will be routed correctly based on their `_root_`. However, if
your collection specifies a router field (even if it matches the unique field),
then the child documents will be routed based on that field instead of their
root field. If your collection specifies a router field, you should ensure
every doc in a hierarchy has the same value for that field, otherwise it maybe
broken during a shard split.
Review Comment:
This paragraph feels more like it should be a warning/tip than something
that needs its own heading. And although your inside/internal information is
interesting, I think it's better to be short & sweet get to the point of what a
user must do: router.field must be consistent for the nested doc tree.
##########
solr/CHANGES.txt:
##########
@@ -29,7 +29,7 @@ Optimizations
Bug Fixes
---------------------
-(No changes)
+* SOLR-11191: Splitting shards now routes child-docs with their _root_ field
when available so they maintain parent relationship. (Zack Kendall)
Review Comment:
FWIW I view this as an Improvement but it's debatable.
Also by putting this here, it will not be back-ported to 9.x. But wouldn't
we want it in the next 9.x release?
##########
solr/solr-ref-guide/modules/indexing-guide/pages/indexing-nested-documents.adoc:
##########
@@ -304,6 +305,11 @@ Instead, use delete-by-query (most efficient) or
xref:partial-document-updates.a
If you use Solr's delete-by-query APIs, you *MUST* be careful to ensure that
any deletion query is structured to ensure no descendent children remain of any
documents that are being deleted.
*_Doing otherwise will violate integrity assumptions that Solr expects._*
+=== Shard Splits
+When xref:deployment-guide:shard-management.adoc#splitshard[splitting shards],
child documents get transmitted and routed on their own, since they are
documents in their own right. If you route based on your unique field, then the
child documents will be routed correctly based on their `_root_`. However, if
your collection specifies a router field (even if it matches the unique field),
then the child documents will be routed based on that field instead of their
root field. If your collection specifies a router field, you should ensure
every doc in a hierarchy has the same value for that field, otherwise it maybe
broken during a shard split.
Review Comment:
Please put a newline at the end of each sentence in our adoc files. It's
then easier to handle diffs and it's less of a problem when an editor doesn't
wrap.
##########
solr/core/src/test/org/apache/solr/cloud/SplitShardTest.java:
##########
@@ -174,6 +164,50 @@ public void testSplitFuzz() throws Exception {
assertEquals("wrong range in s1_1", expected1, delta1);
}
+ @Test
+ public void testWithChildDocuments() throws Exception {
+ CloudSolrClient solrClient = cluster.getSolrClient();
+ CollectionAdminRequest.createCollection(COLLECTION_NAME, "conf", 1,
1).process(solrClient);
+
+ cluster.waitForActiveCollection(COLLECTION_NAME, 1, 1);
+
+ List<SolrInputDocument> inputDocs = new ArrayList<>();
+ for (int idx = 0; idx < 10; ++idx) {
+ SolrInputDocument parent = new SolrInputDocument();
+ parent.addField("id", idx);
+ parent.addField("type_s", "parent");
+
+ // Child has different ID, so could be routed differently if used for
routing.
+ SolrInputDocument child = new SolrInputDocument();
+ child.addField("id", idx + "_child");
+ child.addField("expected_parent_s", idx);
+
+ parent.addField("myChild", List.of(child));
+
+ inputDocs.add(parent);
+ }
+ solrClient.add(COLLECTION_NAME, inputDocs);
+
+ CollectionAdminRequest.SplitShard splitShard =
+ CollectionAdminRequest.splitShard(COLLECTION_NAME)
+ .setNumSubShards(2)
+ .setShardName("shard1");
+ splitShard.process(solrClient);
+ waitForState("Timed out waiting for shard split", COLLECTION_NAME,
activeClusterShape(2, 2));
Review Comment:
I think this isn't necessary since commands are synchronous; it should be
split by now.
##########
solr/core/src/test-files/solr/configsets/cloud-minimal/conf/schema.xml:
##########
@@ -19,10 +19,12 @@
<fieldType name="string" class="solr.StrField"/>
<fieldType name="int" class="${solr.tests.IntegerFieldType}"
docValues="${solr.tests.numeric.dv}" precisionStep="0" omitNorms="true"
positionIncrementGap="0" uninvertible="true"/>
<fieldType name="long" class="${solr.tests.LongFieldType}"
docValues="${solr.tests.numeric.dv}" precisionStep="0" omitNorms="true"
positionIncrementGap="0" uninvertible="true"/>
+ <fieldType name="_nest_path_" class="solr.NestPathField"/>
<dynamicField name="*" type="string" indexed="true" stored="true"/>
<!-- for versioning -->
<field name="_version_" type="long" indexed="true" stored="true"/>
<field name="_root_" type="string" indexed="true" stored="true"
multiValued="false" required="false"/>
+ <field name="_nest_path_" type="_nest_path_"/>`
Review Comment:
Ironically this schema isn't so "minimal" after all once we get into nested
docs. Whatever I guess.
##########
solr/core/src/test/org/apache/solr/cloud/SplitShardTest.java:
##########
@@ -174,6 +164,50 @@ public void testSplitFuzz() throws Exception {
assertEquals("wrong range in s1_1", expected1, delta1);
}
+ @Test
+ public void testWithChildDocuments() throws Exception {
+ CloudSolrClient solrClient = cluster.getSolrClient();
+ CollectionAdminRequest.createCollection(COLLECTION_NAME, "conf", 1,
1).process(solrClient);
+
+ cluster.waitForActiveCollection(COLLECTION_NAME, 1, 1);
+
+ List<SolrInputDocument> inputDocs = new ArrayList<>();
+ for (int idx = 0; idx < 10; ++idx) {
+ SolrInputDocument parent = new SolrInputDocument();
+ parent.addField("id", idx);
+ parent.addField("type_s", "parent");
+
+ // Child has different ID, so could be routed differently if used for
routing.
+ SolrInputDocument child = new SolrInputDocument();
+ child.addField("id", idx + "_child");
Review Comment:
Could you use a global doc counter instead of concatenating? This would
emphasize that the ID format has no bearing at all on the splitting.
##########
solr/core/src/test/org/apache/solr/update/SolrIndexSplitterTest.java:
##########
@@ -485,19 +486,147 @@ private void
doTestSplitByRouteKey(SolrIndexSplitter.SplitMethod splitMethod) th
}
}
+ @Test
+ public void testSplitWithChildDocs() throws Exception {
+ doTestSplitWithChildDocs(SolrIndexSplitter.SplitMethod.REWRITE);
+ }
+
+ @Test
+ public void testSplitWithChildDocsLink() throws Exception {
+ doTestSplitWithChildDocs(SolrIndexSplitter.SplitMethod.LINK);
+ }
+
+ public void doTestSplitWithChildDocs(SolrIndexSplitter.SplitMethod
splitMethod) throws Exception {
+ // Overall test/split pattern copied from doTestSplitByCores
+ File indexDir = createTempDir().toFile();
+
+ CompositeIdRouter r1 = new CompositeIdRouter();
+ String routeKeyBase = "sea-line!";
+
+ List<DocRouter.Range> twoRanges = r1.partitionRange(2,
r1.keyHashRange(routeKeyBase));
+
+ // Insert other doc for contrast
+ String otherDocId = routeKeyBase + "6"; // Will hash into first range
+ assertU(adoc("id", otherDocId));
+
+ // Insert child doc
+ String parentId = routeKeyBase + "1"; // Will hash into second range
+ String childId = routeKeyBase + "5"; // Will hash into first range
+ SolrInputDocument doc =
+ sdoc("id", parentId, "myChild", sdocs(sdoc("id", childId, "child_s",
"child")));
+ assertU(adoc(doc));
+
+ assertU(commit());
+ assertJQ(req("q", "*:*"), "/response/numFound==3");
+
+ // Able to query child.
+ assertJQ(
+ req("q", "id:" + parentId, "fl", "*, [child]"),
+ "/response/numFound==1",
+ "/response/docs/[0]/myChild/[0]/id=='" + childId + "'",
+ "/response/docs/[0]/myChild/[0]/_root_=='"
+ + parentId
+ + "'" // Child has parent root to route with
+ );
+
+ SolrCore core1 = null, core2 = null;
+ try {
+ core1 =
+ h.getCoreContainer()
+ .create(
+ "split1",
+ Map.of("dataDir", indexDir1.getAbsolutePath(), "configSet",
"cloud-minimal"));
+ core2 =
+ h.getCoreContainer()
+ .create(
+ "split2",
+ Map.of("dataDir", indexDir2.getAbsolutePath(), "configSet",
"cloud-minimal"));
+
+ LocalSolrQueryRequest request = null;
+ try {
+ request = lrf.makeRequest("q", "dummy");
+ SolrQueryResponse rsp = new SolrQueryResponse();
+ SplitIndexCommand command =
+ new SplitIndexCommand(
+ request,
+ rsp,
+ null,
+ List.of(core1, core2),
+ twoRanges,
+ new CompositeIdRouter(),
+ null,
+ null,
+ splitMethod);
+ doSplit(command);
+ } finally {
+ if (request != null) request.close();
+ }
+ @SuppressWarnings("resource")
+ final EmbeddedSolrServer server1 = new
EmbeddedSolrServer(h.getCoreContainer(), "split1");
+ @SuppressWarnings("resource")
+ final EmbeddedSolrServer server2 = new
EmbeddedSolrServer(h.getCoreContainer(), "split2");
+ server1.commit(true, true);
+ server2.commit(true, true);
+ assertEquals(
+ "should be 2 docs in index2",
+ 2,
+ server2.query(new SolrQuery("*:*")).getResults().getNumFound());
+ assertEquals(
+ "parent doc should be in index2",
+ 1,
+ server2.query(new SolrQuery("id:" +
parentId)).getResults().getNumFound());
+ assertEquals(
+ "child doc should be in index2",
+ 1,
+ server2.query(new SolrQuery("id:" +
childId)).getResults().getNumFound());
+ assertEquals(
+ "other doc should be in index1",
+ 1,
+ server1.query(new SolrQuery("id:" +
otherDocId)).getResults().getNumFound());
+ } finally {
+ h.getCoreContainer().unload("split2");
+ h.getCoreContainer().unload("split1");
+ }
+ }
+
+ /** Utility method to find Ids that hash into which ranges. Uncomment @Test
to print. */
+ // @Test
+ public void testCompositeHashSandbox() {
Review Comment:
just by naming it with a "test" prefix, I think it'll run.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]