[ 
https://issues.apache.org/jira/browse/SOLR-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13502577#comment-13502577
 ] 

Yonik Seeley commented on SOLR-2592:
------------------------------------

Just want to recap thoughts on all the different types/levels of custom 
sharding/hashing:

1) custom sharding with complete user control... user is responsible for 
adding/removing shards, and specifying what shard an update is targeted toward 
(SOLR-4059)
  - example: http://.../update?shard=NY_NJ_area
  - Solr still keeps track of leader, still forward updates to the leader which 
forwards to replicas
  - replicas can still be added on the fly
  - Users could also provide a pre-built shard and we could still replicate it 
out 
  - search side of this is already implemented:  ?shards=NY_NJ_area,SF_area
  - OPTIONAL: we could still provide a shard splitting service for custom shards
  - OPTIONAL: somehow specify a syntax for including the shard with the 
document to support bulk loading, multiple docs per request (perhaps magic 
field \_shard\_?)
    - if we go with _shard_, perhaps we should change the request param name to 
match (like we do with _version_?). Example: 
http://.../update?_shard_=NY_NJ_area

2) custom sharding based on plugin: a superset of #1
  - a plugin looks at the document being indexed and somehow determines what 
shard the document belongs to (including possibly consulting an external system)
  - an implementation that takes the shard name from a given field
  - IDEA: plugin should have access to request parameters to make decision 
based on that also (i.e. this may be how _shard_ is implemented above?)
  - atomic updates, realtime-get, deletes, etc. would need to specify enough 
info to determine shard (and not change info that determines shard)
    - trickier for parameters that already specify a comma separated list of 
ids... how do we specify the additional info to determine shard?
  - OPTIONAL: allow some mechanism for the plugin to indicate that the location 
of a document has changed (i.e. a delete should be issued to the old shard?)
  - is there a standard mechanism to provide enough information to determine 
shard (for example on a delete)?  It would seem this is dependent on the plugin 
specifics and thus all clients must know the details.

3) Time-based sharding.  A user could do time based sharding based on #1 or #2, 
or we could provide more specific support (perhaps this is just a specific 
implementation of #2).
   - automatically route to the correct shard by time field
   - on search side, allow a time range to be specified and all shards covering 
part of the range will be selected
   - OPTIONAL: automatically add a filter to restrict to exactly the given time 
range (as opposed to just the shards)
   - OPTIONAL: allow automatically reducing the replication level of older 
shards (and down to 0 would mean complete removal - they have been aged out)

4) Custom hashing based on plugin
  - The plugin determines the hash code, the hash code determines the shard 
(based on the hash range stored in the shard descriptor today)
  - This is very much like option #2, but the output is hash instead of shard

5) Hash based on field (a specific implementation of #4?)
  - collection defines field to hash on
  - OPTIONAL: could define multiple fields in comma separated list.  the hash 
value would be constructed by catenation of the values. 
  - how to specify the restriction/range on the query side?  

6) Hash based on first part of ID (composite id)
  - Like #5, but the hask key value is contained in the ID.
  - very transparent and unobtrusive - can be enabled by default since there 
should be no additional requirements or restrictions


For both custom hashing options #5 and #6, instead of deriving the complete 
hash value from the hash key, we could
use it for only the top bits of the hash value with the remainder coming from 
the ID.  The advantage of this is 
that it allows for splitting of over-allocated groups.  If the hash code is 
derived only from the hashKey then 
all of a certain customer's records would share the exact same hash value and 
there would be no way to split it
later.

I think eventually, we want *all* of these options.
It still seems natural to go with #6 first since it's the only one that can 
actually be enabled by default w/o any configuration.

Other things to think about:
Where should the hashing/sharding specification/configuration be kept?
 a) as a collection property (like configName)?
 b) as part of the standard "config" referenced by configName (either part of 
the schema or a separate file more amenable to live updating)
 
Handing grouping of documents by more than one dimension:
Let's say you have multiple customers (and you want to group each customers 
documents together), but you also want to do time based sharding.

                
> Custom Hashing
> --------------
>
>                 Key: SOLR-2592
>                 URL: https://issues.apache.org/jira/browse/SOLR-2592
>             Project: Solr
>          Issue Type: New Feature
>          Components: SolrCloud
>    Affects Versions: 4.0-ALPHA
>            Reporter: Noble Paul
>         Attachments: dbq_fix.patch, pluggable_sharding.patch, 
> pluggable_sharding_V2.patch, SOLR-2592.patch, SOLR-2592_r1373086.patch, 
> SOLR-2592_r1384367.patch, SOLR-2592_rev_2.patch, 
> SOLR_2592_solr_4_0_0_BETA_ShardPartitioner.patch
>
>
> If the data in a cloud can be partitioned on some criteria (say range, hash, 
> attribute value etc) It will be easy to narrow down the search to a smaller 
> subset of shards and in effect can achieve more efficient search.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to