Re: Keeping a rolling window of indexes around solr
I suspect you're worrying about something you don't need to. At 1 insert every 30 seconds, and assuming 30,000,000 records will fit on a machine (I've seen this), you're talking 1,000,000 seconds worth of data on a single box! Or roughly 10,000 day's worth of data. Test, of course, YMMV. Or I'm mis-understanding what 1 log insert means, I guess it could be a full log file But do the simple thing first, just let Solr do what it does by default and periodically do a delete by query on documents you want to roll off the end. Especially since you say that queries happen every few days. The tricks for utilizing hot shards are probably not very useful for you with that low a query rate. Test, of course Best Erick On Tue, May 28, 2013 at 8:42 PM, Saikat Kanjilal sxk1...@hotmail.com wrote: Volume of data: 1 log insert every 30 seconds, queries done sporadically asynchronously every so often at a much lower frequency every few days Also the majority of the requests are indeed going to be within a splice of time (typically hours or at most a few days) Type of queries: Keyword or termsearch Search by guid (or id as known in the solr world) Reserved or percolation queries to be executed when new data becomes available Search by dates as mentioned above Regards Sent from my iPhone On May 28, 2013, at 4:25 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : This is kind of the approach used by elastic search , if I'm not using : solrcloud will I be able to use shard aliasing, also with this approach : how would replication work, is it even needed? you haven't said much about hte volume of data you expect to deal with, nor have you really explained what types of queries you intend to do -- ie: you said you were intersted in a rolling window of indexes around n days of data but you never clarified why you think a rolling window of indexes would be useful to you or how exactly you would use it. The primary advantage of sharding by date is if you know that a large percentage of your queries are only going to be within a small range of time, and therefore you can optimize those requests to only hit the shards neccessary to satisfy that small windo of time. if the majority of requests are going to be across your entire n days of data, then date based sharding doesn't really help you -- you can just use arbitrary (randomized) sharding using periodic deleteByQuery commands to purge anything older then N days. Query the whole collection by default, and add a filter query if/when you want to restrict your search to only a narrow date range of documents. this is the same general approach you would use on a non-distributed / non-SolrCloud setup if you just had a single collection on a single master replicated to some number of slaves for horizontal scaling. -Hoss
RE: Keeping a rolling window of indexes around solr
At first glance unless I missed something hourglass will definitely not work for our use-case which just involves real time inserts of new log data and no appends at all. However I would like to examine the guts of hourglass to see if we can customize it for our use-case. From: arafa...@gmail.com Date: Mon, 27 May 2013 16:17:12 -0400 Subject: Re: Keeping a rolling window of indexes around solr To: solr-user@lucene.apache.org But how is Hourglass going to help Solr? Or is it a portable implementation? Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Mon, May 27, 2013 at 3:48 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, SolrCloud now has the same index aliasing as Elasticsearch. I can't lookup the link now but Zoie from LinkedIn has Hourglass, which is uses for circular buffer sort of index setup if I recall correctly. Otis Solr ElasticSearch Support http://sematext.com/ On May 24, 2013 10:26 AM, Saikat Kanjilal sxk1...@hotmail.com wrote: Hello Solr community folks, I am doing some investigative work around how to roll and manage indexes inside our solr configuration, to date I've come up with an architecture that separates a set of masters that are focused on writes and get replicated periodically and a set of slave shards strictly docused on reads, additionally for each master index the design contains partial purges which get performed on each of the slave shards as well as the master to keep the data current. However the architecture seems a bit more complex than I'd like with a lot of moving pieces. I was wondering if anyone has ever handled/designed an architecture around a conveyor belt or rolling window of indexes around n days of data and if there are best practices around this. One thing I was thinking about was whether to keep a conveyor belt list of the slave shards and rotate them as needed and drop the master periodically and make its backup temporarily the master. Anyways would love to hear thoughts and usecases that are similar from the community. Regards
Re: Keeping a rolling window of indexes around solr
: This is kind of the approach used by elastic search , if I'm not using : solrcloud will I be able to use shard aliasing, also with this approach : how would replication work, is it even needed? you haven't said much about hte volume of data you expect to deal with, nor have you really explained what types of queries you intend to do -- ie: you said you were intersted in a rolling window of indexes around n days of data but you never clarified why you think a rolling window of indexes would be useful to you or how exactly you would use it. The primary advantage of sharding by date is if you know that a large percentage of your queries are only going to be within a small range of time, and therefore you can optimize those requests to only hit the shards neccessary to satisfy that small windo of time. if the majority of requests are going to be across your entire n days of data, then date based sharding doesn't really help you -- you can just use arbitrary (randomized) sharding using periodic deleteByQuery commands to purge anything older then N days. Query the whole collection by default, and add a filter query if/when you want to restrict your search to only a narrow date range of documents. this is the same general approach you would use on a non-distributed / non-SolrCloud setup if you just had a single collection on a single master replicated to some number of slaves for horizontal scaling. -Hoss
Re: Keeping a rolling window of indexes around solr
Volume of data: 1 log insert every 30 seconds, queries done sporadically asynchronously every so often at a much lower frequency every few days Also the majority of the requests are indeed going to be within a splice of time (typically hours or at most a few days) Type of queries: Keyword or termsearch Search by guid (or id as known in the solr world) Reserved or percolation queries to be executed when new data becomes available Search by dates as mentioned above Regards Sent from my iPhone On May 28, 2013, at 4:25 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : This is kind of the approach used by elastic search , if I'm not using : solrcloud will I be able to use shard aliasing, also with this approach : how would replication work, is it even needed? you haven't said much about hte volume of data you expect to deal with, nor have you really explained what types of queries you intend to do -- ie: you said you were intersted in a rolling window of indexes around n days of data but you never clarified why you think a rolling window of indexes would be useful to you or how exactly you would use it. The primary advantage of sharding by date is if you know that a large percentage of your queries are only going to be within a small range of time, and therefore you can optimize those requests to only hit the shards neccessary to satisfy that small windo of time. if the majority of requests are going to be across your entire n days of data, then date based sharding doesn't really help you -- you can just use arbitrary (randomized) sharding using periodic deleteByQuery commands to purge anything older then N days. Query the whole collection by default, and add a filter query if/when you want to restrict your search to only a narrow date range of documents. this is the same general approach you would use on a non-distributed / non-SolrCloud setup if you just had a single collection on a single master replicated to some number of slaves for horizontal scaling. -Hoss
Re: Keeping a rolling window of indexes around solr
Hi, SolrCloud now has the same index aliasing as Elasticsearch. I can't lookup the link now but Zoie from LinkedIn has Hourglass, which is uses for circular buffer sort of index setup if I recall correctly. Otis Solr ElasticSearch Support http://sematext.com/ On May 24, 2013 10:26 AM, Saikat Kanjilal sxk1...@hotmail.com wrote: Hello Solr community folks, I am doing some investigative work around how to roll and manage indexes inside our solr configuration, to date I've come up with an architecture that separates a set of masters that are focused on writes and get replicated periodically and a set of slave shards strictly docused on reads, additionally for each master index the design contains partial purges which get performed on each of the slave shards as well as the master to keep the data current. However the architecture seems a bit more complex than I'd like with a lot of moving pieces. I was wondering if anyone has ever handled/designed an architecture around a conveyor belt or rolling window of indexes around n days of data and if there are best practices around this. One thing I was thinking about was whether to keep a conveyor belt list of the slave shards and rotate them as needed and drop the master periodically and make its backup temporarily the master. Anyways would love to hear thoughts and usecases that are similar from the community. Regards
Re: Keeping a rolling window of indexes around solr
But how is Hourglass going to help Solr? Or is it a portable implementation? Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Mon, May 27, 2013 at 3:48 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, SolrCloud now has the same index aliasing as Elasticsearch. I can't lookup the link now but Zoie from LinkedIn has Hourglass, which is uses for circular buffer sort of index setup if I recall correctly. Otis Solr ElasticSearch Support http://sematext.com/ On May 24, 2013 10:26 AM, Saikat Kanjilal sxk1...@hotmail.com wrote: Hello Solr community folks, I am doing some investigative work around how to roll and manage indexes inside our solr configuration, to date I've come up with an architecture that separates a set of masters that are focused on writes and get replicated periodically and a set of slave shards strictly docused on reads, additionally for each master index the design contains partial purges which get performed on each of the slave shards as well as the master to keep the data current. However the architecture seems a bit more complex than I'd like with a lot of moving pieces. I was wondering if anyone has ever handled/designed an architecture around a conveyor belt or rolling window of indexes around n days of data and if there are best practices around this. One thing I was thinking about was whether to keep a conveyor belt list of the slave shards and rotate them as needed and drop the master periodically and make its backup temporarily the master. Anyways would love to hear thoughts and usecases that are similar from the community. Regards
Re: Keeping a rolling window of indexes around solr
On 5/24/2013 8:25 AM, Saikat Kanjilal wrote: Anyways would love to hear thoughts and usecases that are similar from the community. Your use-case sounds a lot like what loggly was doing back in 2010. http://loggly.com/videos/lucene-revolution-2010/
Re: Keeping a rolling window of indexes around solr
On 5/24/2013 8:56 AM, Shawn Heisey wrote: On 5/24/2013 8:25 AM, Saikat Kanjilal wrote: Anyways would love to hear thoughts and usecases that are similar from the community. Your use-case sounds a lot like what loggly was doing back in 2010. http://loggly.com/videos/lucene-revolution-2010/ While I was writing that, I accidentally pressed the key combination that told my mail client to send the message before I was done. Loggly created a new shard every five minutes, and merged older shards to longer time intervals. I personally don't need this capability, but it is a useful pattern. I was wondering recently whether a custom document router could be built for SolrCloud that automatically manages time-divided shards - creating, merging, and if you're not keeping the data forever, deleting. Thanks, Shawn
RE: Keeping a rolling window of indexes around solr
I would like to see something similar to this existing in the solr world or I could gladly help create this: https://github.com/karussell/elasticsearch-rollindex We are evaluating both elasticsearch and our current solr architecture and need to manage write heavy use-cases within a rolling window. Date: Fri, 24 May 2013 09:07:38 -0600 From: elyog...@elyograg.org To: solr-user@lucene.apache.org Subject: Re: Keeping a rolling window of indexes around solr On 5/24/2013 8:56 AM, Shawn Heisey wrote: On 5/24/2013 8:25 AM, Saikat Kanjilal wrote: Anyways would love to hear thoughts and usecases that are similar from the community. Your use-case sounds a lot like what loggly was doing back in 2010. http://loggly.com/videos/lucene-revolution-2010/ While I was writing that, I accidentally pressed the key combination that told my mail client to send the message before I was done. Loggly created a new shard every five minutes, and merged older shards to longer time intervals. I personally don't need this capability, but it is a useful pattern. I was wondering recently whether a custom document router could be built for SolrCloud that automatically manages time-divided shards - creating, merging, and if you're not keeping the data forever, deleting. Thanks, Shawn
Re: Keeping a rolling window of indexes around solr
Would collection aliasing help here? From Solr 4.2 release notes: Collection Aliasing. Got time based data? Want to re-index in a temporary collection and then swap it into production? Done. Stay tuned for Shard Aliasing. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, May 24, 2013 at 10:25 AM, Saikat Kanjilal sxk1...@hotmail.com wrote: Hello Solr community folks, I am doing some investigative work around how to roll and manage indexes inside our solr configuration, to date I've come up with an architecture that separates a set of masters that are focused on writes and get replicated periodically and a set of slave shards strictly docused on reads, additionally for each master index the design contains partial purges which get performed on each of the slave shards as well as the master to keep the data current. However the architecture seems a bit more complex than I'd like with a lot of moving pieces. I was wondering if anyone has ever handled/designed an architecture around a conveyor belt or rolling window of indexes around n days of data and if there are best practices around this. One thing I was thinking about was whether to keep a conveyor belt list of the slave shards and rotate them as needed and drop the master periodically and make its backup temporarily the master. Anyways would love to hear thoughts and usecases that are similar from the community. Regards
Re: Keeping a rolling window of indexes around solr
This is kind of the approach used by elastic search , if I'm not using solrcloud will I be able to use shard aliasing, also with this approach how would replication work, is it even needed? Sent from my iPhone On May 24, 2013, at 12:00 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: Would collection aliasing help here? From Solr 4.2 release notes: Collection Aliasing. Got time based data? Want to re-index in a temporary collection and then swap it into production? Done. Stay tuned for Shard Aliasing. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, May 24, 2013 at 10:25 AM, Saikat Kanjilal sxk1...@hotmail.com wrote: Hello Solr community folks, I am doing some investigative work around how to roll and manage indexes inside our solr configuration, to date I've come up with an architecture that separates a set of masters that are focused on writes and get replicated periodically and a set of slave shards strictly docused on reads, additionally for each master index the design contains partial purges which get performed on each of the slave shards as well as the master to keep the data current. However the architecture seems a bit more complex than I'd like with a lot of moving pieces. I was wondering if anyone has ever handled/designed an architecture around a conveyor belt or rolling window of indexes around n days of data and if there are best practices around this. One thing I was thinking about was whether to keep a conveyor belt list of the slave shards and rotate them as needed and drop the master periodically and make its backup temporarily the master. Anyways would love to hear thoughts and usecases that are similar from the community. Regards