[ https://issues.apache.org/jira/browse/SOLR-11299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16194601#comment-16194601 ]
Gus Heck commented on SOLR-11299: --------------------------------- One thought that comes to mind is that with deletions of old collections, we could more or less think of it as solr collection based ring buffer... The implicit assumption seems to be that writes are "mostly ordered" and that severely out of order writes might be rejected? I think that that's probably a critical assumption since I imagine that we'll have an alias that's moving from collection to collection for writes. Even if CloudSolrClient is able to write to the first collection in a multi-collection alias, this applies since we would need to reject a write not appropriate for that partition. And if that change is made does it have the potential to surprise folks who make an alias write to it and find all the docs in only one collection? Handling some sort of collection level routing will be needed if pre-allocation is to be useful in catching "early" or "late" writes near partition boundaries... Thoughts on the possible URP/DURP maybe it's always present by default, but a silent no-op unless it sees that a time partitioned collection is being accessed, and only then does it do anything? This would require some highly efficient way of checking if something is a time series collection. Maybe a mandatory suffix/prefix on the collection name (".tpc" or "TPC-" or some such) so that there's no need to look anything up in zookeeper etc to know if it's a time series...? Downside is the potential for accidentally triggering it, so maybe a second more expensive check (attempt to parse out dateness from the name, ask zookeeper...whatever) could then revert to no-op if it failed so that slowdown rather than failure is the impact of an inadvertent suffix/prefix? suffix/prefix denoting time series collections could be configureable in solr.xml to make it possible to escape from naming clashes. Another thought is that while date/time is the objective here, it would seem that any numeric field should work... > Time partitioned collections (umbrella issue) > --------------------------------------------- > > Key: SOLR-11299 > URL: https://issues.apache.org/jira/browse/SOLR-11299 > Project: Solr > Issue Type: New Feature > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud > Reporter: David Smiley > Assignee: David Smiley > > Solr ought to have the ability to manage large-scale time-series data (think > logs or sensor data / IOT) itself without a lot of manual/external work. The > most naive and painless approach today is to create a collection with a high > numShards with hash routing but this isn't as good as partitioning the > underlying indexes by time for these reasons: > * Easy to scale up/down horizontally as data/requirements change. (No need > to over-provision, use shard splitting, or re-index with different config) > * Faster queries: > ** can search fewer shards, reducing overall load > ** realtime search is more tractable (since most shards are stable -- > good caches) > ** "recent" shards (that might be queried more) can be allocated to > faster hardware > ** aged out data is simply removed, not marked as deleted. Deleted docs > still have search overhead. > * Outages of a shard result in a degraded but sometimes a useful system > nonetheless (compare to random subset missing) > Ideally you could set this up once and then simply work with a collection > (potentially actually an alias) in a normal way (search or update), letting > Solr handle the addition of new partitions, removing of old ones, and > appropriate routing of requests depending on their nature. > This issue is an umbrella issue for the particular tasks that will make it > all happen -- either subtasks or issue linking. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org