[ https://issues.apache.org/jira/browse/SOLR-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13502577#comment-13502577 ]
Yonik Seeley commented on SOLR-2592: ------------------------------------ Just want to recap thoughts on all the different types/levels of custom sharding/hashing: 1) custom sharding with complete user control... user is responsible for adding/removing shards, and specifying what shard an update is targeted toward (SOLR-4059) - example: http://.../update?shard=NY_NJ_area - Solr still keeps track of leader, still forward updates to the leader which forwards to replicas - replicas can still be added on the fly - Users could also provide a pre-built shard and we could still replicate it out - search side of this is already implemented: ?shards=NY_NJ_area,SF_area - OPTIONAL: we could still provide a shard splitting service for custom shards - OPTIONAL: somehow specify a syntax for including the shard with the document to support bulk loading, multiple docs per request (perhaps magic field \_shard\_?) - if we go with _shard_, perhaps we should change the request param name to match (like we do with _version_?). Example: http://.../update?_shard_=NY_NJ_area 2) custom sharding based on plugin: a superset of #1 - a plugin looks at the document being indexed and somehow determines what shard the document belongs to (including possibly consulting an external system) - an implementation that takes the shard name from a given field - IDEA: plugin should have access to request parameters to make decision based on that also (i.e. this may be how _shard_ is implemented above?) - atomic updates, realtime-get, deletes, etc. would need to specify enough info to determine shard (and not change info that determines shard) - trickier for parameters that already specify a comma separated list of ids... how do we specify the additional info to determine shard? - OPTIONAL: allow some mechanism for the plugin to indicate that the location of a document has changed (i.e. a delete should be issued to the old shard?) - is there a standard mechanism to provide enough information to determine shard (for example on a delete)? It would seem this is dependent on the plugin specifics and thus all clients must know the details. 3) Time-based sharding. A user could do time based sharding based on #1 or #2, or we could provide more specific support (perhaps this is just a specific implementation of #2). - automatically route to the correct shard by time field - on search side, allow a time range to be specified and all shards covering part of the range will be selected - OPTIONAL: automatically add a filter to restrict to exactly the given time range (as opposed to just the shards) - OPTIONAL: allow automatically reducing the replication level of older shards (and down to 0 would mean complete removal - they have been aged out) 4) Custom hashing based on plugin - The plugin determines the hash code, the hash code determines the shard (based on the hash range stored in the shard descriptor today) - This is very much like option #2, but the output is hash instead of shard 5) Hash based on field (a specific implementation of #4?) - collection defines field to hash on - OPTIONAL: could define multiple fields in comma separated list. the hash value would be constructed by catenation of the values. - how to specify the restriction/range on the query side? 6) Hash based on first part of ID (composite id) - Like #5, but the hask key value is contained in the ID. - very transparent and unobtrusive - can be enabled by default since there should be no additional requirements or restrictions For both custom hashing options #5 and #6, instead of deriving the complete hash value from the hash key, we could use it for only the top bits of the hash value with the remainder coming from the ID. The advantage of this is that it allows for splitting of over-allocated groups. If the hash code is derived only from the hashKey then all of a certain customer's records would share the exact same hash value and there would be no way to split it later. I think eventually, we want *all* of these options. It still seems natural to go with #6 first since it's the only one that can actually be enabled by default w/o any configuration. Other things to think about: Where should the hashing/sharding specification/configuration be kept? a) as a collection property (like configName)? b) as part of the standard "config" referenced by configName (either part of the schema or a separate file more amenable to live updating) Handing grouping of documents by more than one dimension: Let's say you have multiple customers (and you want to group each customers documents together), but you also want to do time based sharding. > Custom Hashing > -------------- > > Key: SOLR-2592 > URL: https://issues.apache.org/jira/browse/SOLR-2592 > Project: Solr > Issue Type: New Feature > Components: SolrCloud > Affects Versions: 4.0-ALPHA > Reporter: Noble Paul > Attachments: dbq_fix.patch, pluggable_sharding.patch, > pluggable_sharding_V2.patch, SOLR-2592.patch, SOLR-2592_r1373086.patch, > SOLR-2592_r1384367.patch, SOLR-2592_rev_2.patch, > SOLR_2592_solr_4_0_0_BETA_ShardPartitioner.patch > > > If the data in a cloud can be partitioned on some criteria (say range, hash, > attribute value etc) It will be easy to narrow down the search to a smaller > subset of shards and in effect can achieve more efficient search. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org