[
https://issues.apache.org/jira/browse/SOLR-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13502577#comment-13502577
]
Yonik Seeley commented on SOLR-2592:
------------------------------------
Just want to recap thoughts on all the different types/levels of custom
sharding/hashing:
1) custom sharding with complete user control... user is responsible for
adding/removing shards, and specifying what shard an update is targeted toward
(SOLR-4059)
- example: http://.../update?shard=NY_NJ_area
- Solr still keeps track of leader, still forward updates to the leader which
forwards to replicas
- replicas can still be added on the fly
- Users could also provide a pre-built shard and we could still replicate it
out
- search side of this is already implemented: ?shards=NY_NJ_area,SF_area
- OPTIONAL: we could still provide a shard splitting service for custom shards
- OPTIONAL: somehow specify a syntax for including the shard with the
document to support bulk loading, multiple docs per request (perhaps magic
field \_shard\_?)
- if we go with _shard_, perhaps we should change the request param name to
match (like we do with _version_?). Example:
http://.../update?_shard_=NY_NJ_area
2) custom sharding based on plugin: a superset of #1
- a plugin looks at the document being indexed and somehow determines what
shard the document belongs to (including possibly consulting an external system)
- an implementation that takes the shard name from a given field
- IDEA: plugin should have access to request parameters to make decision
based on that also (i.e. this may be how _shard_ is implemented above?)
- atomic updates, realtime-get, deletes, etc. would need to specify enough
info to determine shard (and not change info that determines shard)
- trickier for parameters that already specify a comma separated list of
ids... how do we specify the additional info to determine shard?
- OPTIONAL: allow some mechanism for the plugin to indicate that the location
of a document has changed (i.e. a delete should be issued to the old shard?)
- is there a standard mechanism to provide enough information to determine
shard (for example on a delete)? It would seem this is dependent on the plugin
specifics and thus all clients must know the details.
3) Time-based sharding. A user could do time based sharding based on #1 or #2,
or we could provide more specific support (perhaps this is just a specific
implementation of #2).
- automatically route to the correct shard by time field
- on search side, allow a time range to be specified and all shards covering
part of the range will be selected
- OPTIONAL: automatically add a filter to restrict to exactly the given time
range (as opposed to just the shards)
- OPTIONAL: allow automatically reducing the replication level of older
shards (and down to 0 would mean complete removal - they have been aged out)
4) Custom hashing based on plugin
- The plugin determines the hash code, the hash code determines the shard
(based on the hash range stored in the shard descriptor today)
- This is very much like option #2, but the output is hash instead of shard
5) Hash based on field (a specific implementation of #4?)
- collection defines field to hash on
- OPTIONAL: could define multiple fields in comma separated list. the hash
value would be constructed by catenation of the values.
- how to specify the restriction/range on the query side?
6) Hash based on first part of ID (composite id)
- Like #5, but the hask key value is contained in the ID.
- very transparent and unobtrusive - can be enabled by default since there
should be no additional requirements or restrictions
For both custom hashing options #5 and #6, instead of deriving the complete
hash value from the hash key, we could
use it for only the top bits of the hash value with the remainder coming from
the ID. The advantage of this is
that it allows for splitting of over-allocated groups. If the hash code is
derived only from the hashKey then
all of a certain customer's records would share the exact same hash value and
there would be no way to split it
later.
I think eventually, we want *all* of these options.
It still seems natural to go with #6 first since it's the only one that can
actually be enabled by default w/o any configuration.
Other things to think about:
Where should the hashing/sharding specification/configuration be kept?
a) as a collection property (like configName)?
b) as part of the standard "config" referenced by configName (either part of
the schema or a separate file more amenable to live updating)
Handing grouping of documents by more than one dimension:
Let's say you have multiple customers (and you want to group each customers
documents together), but you also want to do time based sharding.
> Custom Hashing
> --------------
>
> Key: SOLR-2592
> URL: https://issues.apache.org/jira/browse/SOLR-2592
> Project: Solr
> Issue Type: New Feature
> Components: SolrCloud
> Affects Versions: 4.0-ALPHA
> Reporter: Noble Paul
> Attachments: dbq_fix.patch, pluggable_sharding.patch,
> pluggable_sharding_V2.patch, SOLR-2592.patch, SOLR-2592_r1373086.patch,
> SOLR-2592_r1384367.patch, SOLR-2592_rev_2.patch,
> SOLR_2592_solr_4_0_0_BETA_ShardPartitioner.patch
>
>
> If the data in a cloud can be partitioned on some criteria (say range, hash,
> attribute value etc) It will be easy to narrow down the search to a smaller
> subset of shards and in effect can achieve more efficient search.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]