[jira] [Commented] (SOLR-11299) Time partitioned collections (umbrella issue)

Gus Heck (JIRA) Fri, 06 Oct 2017 06:42:40 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-11299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16194601#comment-16194601
 ]


Gus Heck commented on SOLR-11299:
---------------------------------

One thought that comes to mind is that with deletions of old collections, we 
could more or less think of it as solr collection based ring buffer...

The implicit assumption seems to be that writes are "mostly ordered" and that 
severely out of order writes might be rejected? I think that that's probably a 
critical assumption since I imagine that we'll have an alias that's moving from 
collection to collection for writes. Even if CloudSolrClient is able to write 
to the first collection in a multi-collection alias, this applies since we 
would need to reject a write not appropriate for that partition. And if that 
change is made does it have the potential to surprise folks who make an alias 
write to it and find all the docs in only one collection? Handling some sort of 
collection level routing will be needed if pre-allocation is to be useful in 
catching "early" or "late" writes near partition boundaries...

Thoughts on the possible URP/DURP maybe it's always present by default, but a 
silent no-op unless it sees that a time partitioned collection is being 
accessed, and only then does it do anything? This would require some highly 
efficient way of checking if something is a time series collection. Maybe a 
mandatory suffix/prefix on the collection name (".tpc" or "TPC-" or some such) 
so that there's no need to look anything up in zookeeper etc to know if it's a 
time series...? Downside is the potential for accidentally triggering it, so 
maybe a second more expensive check (attempt to parse out dateness from the 
name, ask zookeeper...whatever) could then revert to no-op if it failed so that 
slowdown rather than failure is the impact of an inadvertent suffix/prefix? 
suffix/prefix denoting time series collections could be configureable in 
solr.xml to make it possible to escape from naming clashes.

Another thought is that while date/time is the objective here, it would seem 
that any numeric field should work...

> Time partitioned collections (umbrella issue)
> ---------------------------------------------
>
>                 Key: SOLR-11299
>                 URL: https://issues.apache.org/jira/browse/SOLR-11299
>             Project: Solr
>          Issue Type: New Feature
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>            Reporter: David Smiley
>            Assignee: David Smiley
>
> Solr ought to have the ability to manage large-scale time-series data (think 
> logs or sensor data / IOT) itself without a lot of manual/external work.  The 
> most naive and painless approach today is to create a collection with a high 
> numShards with hash routing but this isn't as good as partitioning the 
> underlying indexes by time for these reasons:
> * Easy to scale up/down horizontally as data/requirements change.  (No need 
> to over-provision, use shard splitting, or re-index with different config)
> * Faster queries: 
>     ** can search fewer shards, reducing overall load
>     ** realtime search is more tractable (since most shards are stable -- 
> good caches)
>     ** "recent" shards (that might be queried more) can be allocated to 
> faster hardware
>     ** aged out data is simply removed, not marked as deleted.  Deleted docs 
> still have search overhead.
> * Outages of a shard result in a degraded but sometimes a useful system 
> nonetheless (compare to random subset missing)
> Ideally you could set this up once and then simply work with a collection 
> (potentially actually an alias) in a normal way (search or update), letting 
> Solr handle the addition of new partitions, removing of old ones, and 
> appropriate routing of requests depending on their nature.
> This issue is an umbrella issue for the particular tasks that will make it 
> all happen -- either subtasks or issue linking.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11299) Time partitioned collections (umbrella issue)

Reply via email to