Re: Keeping a rolling window of indexes around solr

2013-05-29 Thread Erick Erickson
I suspect you're worrying about something you don't need to. At 1 insert every
30 seconds, and assuming 30,000,000 records will fit on a machine (I've seen
this), you're talking 1,000,000 seconds worth of data on a single box!
Or roughly
10,000 day's worth of data. Test, of course, YMMV.

Or I'm mis-understanding what 1 log insert means, I guess it could be a full
log file

But do the simple thing first, just let Solr do what it does by
default and periodically
do a delete by query on documents you want to roll off the end. Especially since
you say that queries happen every few days. The tricks for utilizing
hot shards are
probably not very useful for you with that low a query rate.

Test, of course
Best
Erick

On Tue, May 28, 2013 at 8:42 PM, Saikat Kanjilal sxk1...@hotmail.com wrote:
 Volume of data:
 1 log insert every 30 seconds, queries done sporadically asynchronously every 
 so often at a much lower frequency every few days

 Also the majority of the requests are indeed going to be within a splice of 
 time (typically hours or at most a few days)

 Type of queries:
 Keyword or termsearch
 Search by guid (or id as known in the solr world)
 Reserved or percolation queries to be executed when new data becomes available
 Search by dates as mentioned above

 Regards


 Sent from my iPhone

 On May 28, 2013, at 4:25 PM, Chris Hostetter hossman_luc...@fucit.org wrote:


 : This is kind of the approach used by elastic search , if I'm not using
 : solrcloud will I be able to use shard aliasing, also with this approach
 : how would replication work, is it even needed?

 you haven't said much about hte volume of data you expect to deal with,
 nor have you really explained what types of queries you intend to do --
 ie: you said you were intersted in a rolling window of indexes
 around n days of data but you never clarified why you think a
 rolling window of indexes would be useful to you or how exactly you would
 use it.

 The primary advantage of sharding by date is if you know that a large
 percentage of your queries are only going to be within a small range of
 time, and therefore you can optimize those requests to only hit the shards
 neccessary to satisfy that small windo of time.

 if the majority of requests are going to be across your entire n days of
 data, then date based sharding doesn't really help you -- you can just use
 arbitrary (randomized) sharding using periodic deleteByQuery commands to
 purge anything older then N days.  Query the whole collection by default,
 and add a filter query if/when you want to restrict your search to only a
 narrow date range of documents.

 this is the same general approach you would use on a non-distributed /
 non-SolrCloud setup if you just had a single collection on a single master
 replicated to some number of slaves for horizontal scaling.


 -Hoss



RE: Keeping a rolling window of indexes around solr

2013-05-28 Thread Saikat Kanjilal
At first glance unless I missed something hourglass will definitely not work 
for our use-case which just involves real time inserts of new log data and no 
appends at all.  However I would like to examine the guts of hourglass to see 
if we can customize it for our use-case.

 From: arafa...@gmail.com
 Date: Mon, 27 May 2013 16:17:12 -0400
 Subject: Re: Keeping a rolling window of indexes around solr
 To: solr-user@lucene.apache.org
 
 But how is Hourglass going to help Solr? Or is it a portable implementation?
 
 Regards,
Alex.
 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all
 at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
 book)
 
 
 On Mon, May 27, 2013 at 3:48 PM, Otis Gospodnetic
 otis.gospodne...@gmail.com wrote:
  Hi,
 
  SolrCloud now has the same index aliasing as Elasticsearch.  I can't lookup
  the link now but Zoie from LinkedIn has Hourglass, which is uses for
  circular buffer sort of index setup if I recall correctly.
 
  Otis
  Solr  ElasticSearch Support
  http://sematext.com/
  On May 24, 2013 10:26 AM, Saikat Kanjilal sxk1...@hotmail.com wrote:
 
  Hello Solr community folks,
  I am doing some investigative work around how to roll and manage indexes
  inside our solr configuration, to date I've come up with an architecture
  that separates a set of masters that are focused on writes and get
  replicated periodically and a set of slave shards strictly docused on
  reads, additionally for each master index the design contains partial
  purges which get performed on each of the slave shards as well as the
  master to keep the data current.   However the architecture seems a bit
  more complex than I'd like with a lot of moving pieces.  I was wondering if
  anyone has ever handled/designed an architecture around a conveyor belt
  or rolling window of indexes around n days of data and if there are best
  practices around this.  One thing I was thinking about was whether to keep
  a conveyor belt list of the slave shards and rotate them as needed and drop
  the master periodically and make its backup temporarily the master.
 
 
  Anyways would love to hear thoughts and usecases that are similar from the
  community.
 
  Regards
  

Re: Keeping a rolling window of indexes around solr

2013-05-28 Thread Chris Hostetter

: This is kind of the approach used by elastic search , if I'm not using 
: solrcloud will I be able to use shard aliasing, also with this approach 
: how would replication work, is it even needed?

you haven't said much about hte volume of data you expect to deal with, 
nor have you really explained what types of queries you intend to do -- 
ie: you said you were intersted in a rolling window of indexes
around n days of data but you never clarified why you think a 
rolling window of indexes would be useful to you or how exactly you would 
use it.

The primary advantage of sharding by date is if you know that a large 
percentage of your queries are only going to be within a small range of 
time, and therefore you can optimize those requests to only hit the shards 
neccessary to satisfy that small windo of time.

if the majority of requests are going to be across your entire n days of 
data, then date based sharding doesn't really help you -- you can just use 
arbitrary (randomized) sharding using periodic deleteByQuery commands to 
purge anything older then N days.  Query the whole collection by default, 
and add a filter query if/when you want to restrict your search to only a 
narrow date range of documents.

this is the same general approach you would use on a non-distributed / 
non-SolrCloud setup if you just had a single collection on a single master 
replicated to some number of slaves for horizontal scaling.


-Hoss


Re: Keeping a rolling window of indexes around solr

2013-05-28 Thread Saikat Kanjilal
Volume of data:
1 log insert every 30 seconds, queries done sporadically asynchronously every 
so often at a much lower frequency every few days

Also the majority of the requests are indeed going to be within a splice of 
time (typically hours or at most a few days)

Type of queries:
Keyword or termsearch
Search by guid (or id as known in the solr world)
Reserved or percolation queries to be executed when new data becomes available 
Search by dates as mentioned above

Regards


Sent from my iPhone

On May 28, 2013, at 4:25 PM, Chris Hostetter hossman_luc...@fucit.org wrote:

 
 : This is kind of the approach used by elastic search , if I'm not using 
 : solrcloud will I be able to use shard aliasing, also with this approach 
 : how would replication work, is it even needed?
 
 you haven't said much about hte volume of data you expect to deal with, 
 nor have you really explained what types of queries you intend to do -- 
 ie: you said you were intersted in a rolling window of indexes
 around n days of data but you never clarified why you think a 
 rolling window of indexes would be useful to you or how exactly you would 
 use it.
 
 The primary advantage of sharding by date is if you know that a large 
 percentage of your queries are only going to be within a small range of 
 time, and therefore you can optimize those requests to only hit the shards 
 neccessary to satisfy that small windo of time.
 
 if the majority of requests are going to be across your entire n days of 
 data, then date based sharding doesn't really help you -- you can just use 
 arbitrary (randomized) sharding using periodic deleteByQuery commands to 
 purge anything older then N days.  Query the whole collection by default, 
 and add a filter query if/when you want to restrict your search to only a 
 narrow date range of documents.
 
 this is the same general approach you would use on a non-distributed / 
 non-SolrCloud setup if you just had a single collection on a single master 
 replicated to some number of slaves for horizontal scaling.
 
 
 -Hoss
 


Re: Keeping a rolling window of indexes around solr

2013-05-27 Thread Otis Gospodnetic
Hi,

SolrCloud now has the same index aliasing as Elasticsearch.  I can't lookup
the link now but Zoie from LinkedIn has Hourglass, which is uses for
circular buffer sort of index setup if I recall correctly.

Otis
Solr  ElasticSearch Support
http://sematext.com/
On May 24, 2013 10:26 AM, Saikat Kanjilal sxk1...@hotmail.com wrote:

 Hello Solr community folks,
 I am doing some investigative work around how to roll and manage indexes
 inside our solr configuration, to date I've come up with an architecture
 that separates a set of masters that are focused on writes and get
 replicated periodically and a set of slave shards strictly docused on
 reads, additionally for each master index the design contains partial
 purges which get performed on each of the slave shards as well as the
 master to keep the data current.   However the architecture seems a bit
 more complex than I'd like with a lot of moving pieces.  I was wondering if
 anyone has ever handled/designed an architecture around a conveyor belt
 or rolling window of indexes around n days of data and if there are best
 practices around this.  One thing I was thinking about was whether to keep
 a conveyor belt list of the slave shards and rotate them as needed and drop
 the master periodically and make its backup temporarily the master.


 Anyways would love to hear thoughts and usecases that are similar from the
 community.

 Regards


Re: Keeping a rolling window of indexes around solr

2013-05-27 Thread Alexandre Rafalovitch
But how is Hourglass going to help Solr? Or is it a portable implementation?

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Mon, May 27, 2013 at 3:48 PM, Otis Gospodnetic
otis.gospodne...@gmail.com wrote:
 Hi,

 SolrCloud now has the same index aliasing as Elasticsearch.  I can't lookup
 the link now but Zoie from LinkedIn has Hourglass, which is uses for
 circular buffer sort of index setup if I recall correctly.

 Otis
 Solr  ElasticSearch Support
 http://sematext.com/
 On May 24, 2013 10:26 AM, Saikat Kanjilal sxk1...@hotmail.com wrote:

 Hello Solr community folks,
 I am doing some investigative work around how to roll and manage indexes
 inside our solr configuration, to date I've come up with an architecture
 that separates a set of masters that are focused on writes and get
 replicated periodically and a set of slave shards strictly docused on
 reads, additionally for each master index the design contains partial
 purges which get performed on each of the slave shards as well as the
 master to keep the data current.   However the architecture seems a bit
 more complex than I'd like with a lot of moving pieces.  I was wondering if
 anyone has ever handled/designed an architecture around a conveyor belt
 or rolling window of indexes around n days of data and if there are best
 practices around this.  One thing I was thinking about was whether to keep
 a conveyor belt list of the slave shards and rotate them as needed and drop
 the master periodically and make its backup temporarily the master.


 Anyways would love to hear thoughts and usecases that are similar from the
 community.

 Regards


Re: Keeping a rolling window of indexes around solr

2013-05-24 Thread Shawn Heisey
On 5/24/2013 8:25 AM, Saikat Kanjilal wrote:
 Anyways would love to hear thoughts and usecases that are similar from the 
 community.

Your use-case sounds a lot like what loggly was doing back in 2010.

http://loggly.com/videos/lucene-revolution-2010/



Re: Keeping a rolling window of indexes around solr

2013-05-24 Thread Shawn Heisey
On 5/24/2013 8:56 AM, Shawn Heisey wrote:
 On 5/24/2013 8:25 AM, Saikat Kanjilal wrote:
 Anyways would love to hear thoughts and usecases that are similar from the 
 community.
 
 Your use-case sounds a lot like what loggly was doing back in 2010.
 
 http://loggly.com/videos/lucene-revolution-2010/

While I was writing that, I accidentally pressed the key combination
that told my mail client to send the message before I was done.

Loggly created a new shard every five minutes, and merged older shards
to longer time intervals.  I personally don't need this capability, but
it is a useful pattern.  I was wondering recently whether a custom
document router could be built for SolrCloud that automatically manages
time-divided shards - creating, merging, and if you're not keeping the
data forever, deleting.

Thanks,
Shawn



RE: Keeping a rolling window of indexes around solr

2013-05-24 Thread Saikat Kanjilal
I would like to see something similar to this existing in the solr world or  I 
could gladly help create this:

https://github.com/karussell/elasticsearch-rollindex


We are evaluating both elasticsearch and our current solr architecture and need 
to manage write heavy use-cases within a rolling window.

 Date: Fri, 24 May 2013 09:07:38 -0600
 From: elyog...@elyograg.org
 To: solr-user@lucene.apache.org
 Subject: Re: Keeping a rolling window of indexes around solr
 
 On 5/24/2013 8:56 AM, Shawn Heisey wrote:
  On 5/24/2013 8:25 AM, Saikat Kanjilal wrote:
  Anyways would love to hear thoughts and usecases that are similar from the 
  community.
  
  Your use-case sounds a lot like what loggly was doing back in 2010.
  
  http://loggly.com/videos/lucene-revolution-2010/
 
 While I was writing that, I accidentally pressed the key combination
 that told my mail client to send the message before I was done.
 
 Loggly created a new shard every five minutes, and merged older shards
 to longer time intervals.  I personally don't need this capability, but
 it is a useful pattern.  I was wondering recently whether a custom
 document router could be built for SolrCloud that automatically manages
 time-divided shards - creating, merging, and if you're not keeping the
 data forever, deleting.
 
 Thanks,
 Shawn
 
  

Re: Keeping a rolling window of indexes around solr

2013-05-24 Thread Alexandre Rafalovitch
Would collection aliasing help here? From Solr 4.2 release notes:
Collection Aliasing. Got time based data? Want to re-index in a
temporary collection and then swap it into production? Done. Stay
tuned for Shard Aliasing.

Regards,
  Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Fri, May 24, 2013 at 10:25 AM, Saikat Kanjilal sxk1...@hotmail.com wrote:
 Hello Solr community folks,
 I am doing some investigative work around how to roll and manage indexes 
 inside our solr configuration, to date I've come up with an architecture that 
 separates a set of masters that are focused on writes and get replicated 
 periodically and a set of slave shards strictly docused on reads, 
 additionally for each master index the design contains partial purges which 
 get performed on each of the slave shards as well as the master to keep the 
 data current.   However the architecture seems a bit more complex than I'd 
 like with a lot of moving pieces.  I was wondering if anyone has ever 
 handled/designed an architecture around a conveyor belt or rolling window 
 of indexes around n days of data and if there are best practices around this. 
  One thing I was thinking about was whether to keep a conveyor belt list of 
 the slave shards and rotate them as needed and drop the master periodically 
 and make its backup temporarily the master.


 Anyways would love to hear thoughts and usecases that are similar from the 
 community.

 Regards


Re: Keeping a rolling window of indexes around solr

2013-05-24 Thread Saikat Kanjilal
This is kind of the approach used by elastic search , if I'm not using 
solrcloud will I be able to use shard aliasing, also with this approach how 
would replication work, is it even needed?

Sent from my iPhone

On May 24, 2013, at 12:00 PM, Alexandre Rafalovitch arafa...@gmail.com wrote:

 Would collection aliasing help here? From Solr 4.2 release notes:
 Collection Aliasing. Got time based data? Want to re-index in a
 temporary collection and then swap it into production? Done. Stay
 tuned for Shard Aliasing.
 
 Regards,
  Alex.
 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all
 at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
 book)
 
 
 On Fri, May 24, 2013 at 10:25 AM, Saikat Kanjilal sxk1...@hotmail.com wrote:
 Hello Solr community folks,
 I am doing some investigative work around how to roll and manage indexes 
 inside our solr configuration, to date I've come up with an architecture 
 that separates a set of masters that are focused on writes and get 
 replicated periodically and a set of slave shards strictly docused on reads, 
 additionally for each master index the design contains partial purges which 
 get performed on each of the slave shards as well as the master to keep the 
 data current.   However the architecture seems a bit more complex than I'd 
 like with a lot of moving pieces.  I was wondering if anyone has ever 
 handled/designed an architecture around a conveyor belt or rolling window 
 of indexes around n days of data and if there are best practices around 
 this.  One thing I was thinking about was whether to keep a conveyor belt 
 list of the slave shards and rotate them as needed and drop the master 
 periodically and make its backup temporarily the master.
 
 
 Anyways would love to hear thoughts and usecases that are similar from the 
 community.
 
 Regards