[ https://issues.apache.org/jira/browse/SOLR-13399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901349#comment-16901349 ]
ASF subversion and git services commented on SOLR-13399: -------------------------------------------------------- Commit d8f99a9986835507d19b70edf0ff280416104788 in lucene-solr's branch refs/heads/branch_8x from Yonik Seeley [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=d8f99a9 ] SOLR-13399: ability to use id field for compositeId histogram > compositeId support for shard splitting > --------------------------------------- > > Key: SOLR-13399 > URL: https://issues.apache.org/jira/browse/SOLR-13399 > Project: Solr > Issue Type: New Feature > Reporter: Yonik Seeley > Assignee: Yonik Seeley > Priority: Major > Fix For: 8.3 > > Attachments: SOLR-13399.patch, SOLR-13399.patch, > SOLR-13399_testfix.patch, SOLR-13399_useId.patch > > > Shard splitting does not currently have a way to automatically take into > account the actual distribution (number of documents) in each hash bucket > created by using compositeId hashing. > We should probably add a parameter *splitByPrefix* to the *SPLITSHARD* > command that would look at the number of docs sharing each compositeId prefix > and use that to create roughly equal sized buckets by document count rather > than just assuming an equal distribution across the entire hash range. > Like normal shard splitting, we should bias against splitting within hash > buckets unless necessary (since that leads to larger query fanout.) . Perhaps > this warrants a parameter that would control how much of a size mismatch is > tolerable before resorting to splitting within a bucket. > *allowedSizeDifference*? > To more quickly calculate the number of docs in each bucket, we could index > the prefix in a different field. Iterating over the terms for this field > would quickly give us the number of docs in each (i.e lucene keeps track of > the doc count for each term already.) Perhaps the implementation could be a > flag on the *id* field... something like *indexPrefixes* and poly-fields that > would cause the indexing to be automatically done and alleviate having to > pass in an additional field during indexing and during the call to > *SPLITSHARD*. This whole part is an optimization though and could be split > off into its own issue if desired. > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org