Re: Dynamic collections in SolrCloud for log indexing

2012-12-27 Thread Otis Gospodnetic
Added https://issues.apache.org/jira/browse/SOLR-4237

Otis
--
Performance Monitoring - http://sematext.com/spm/index.html
Search Analytics - http://sematext.com/search-analytics/index.html



On Tue, Dec 25, 2012 at 9:13 PM, Mark Miller markrmil...@gmail.com wrote:

 I've been thinking about aliases for a while as well. Seem very handy and
 fairly easy to implement. So far there has just always been higher priority
 things (need to finish collection api responses this week…) but this is
 something I'd def help work on.

 - Mark

 On Dec 25, 2012, at 1:49 AM, Otis Gospodnetic otis.gospodne...@gmail.com
 wrote:

  Hi,
 
  Right, this is not really about routing in ElasticSearch-sense.
  What's handy for indexing logs are index aliases which I thought I
 had
  added to JIRA a while back, but it looks like I have not.
  Index aliases would let you keep a last 7 days alias fixed while
  underneath you push and pop an index every day without the client app
  having to adjust.
 
  Otis
  --
  Performance Monitoring - http://sematext.com/spm/index.html
  Search Analytics - http://sematext.com/search-analytics/index.html
 
 
 
  On Mon, Dec 24, 2012 at 4:30 AM, Per Steffensen st...@designware.dk
 wrote:
 
  I believe it is a misunderstandig to use custom routing (or sharding as
  Erick calls it) for this kind of stuff. Custom routing is nice if you
 want
  to control which slice/shard under a collection a specific document
 goes to
  - mainly to be able to control that two (or more) documents are indexed
 on
  the same slice/shard, but also just to be able to control on which
  slice/shard a specific document is indexed. Knowing/controlling this
 kind
  of stuff can be used for a lot of nice purposes. But you dont want to
 move
  slices/shards around among collection or delete/add slices from/to a
  collection - unless its for elasticity reasons.
 
  I think you should fill a collection every week/month and just keep
 those
  collections as is. Instead of ending up with a big historic collection
  containing many slices/shards/cores (one for each historic week/month),
 you
  will end up with many historic collections (one for each historic
  week/month). Searching historic data you will have to cross-search those
  historic collections, but that is no problem at all. If Solr Cloud is
 made
  at it is supposed to be made (and I believe it is) it shouldnt require
 more
  resouces or be harder in any way to cross-search X slices across many
  collections, than it is to cross-search X slices under the same
 collection.
 
  Besides that see my answer for topic Will SolrCloud always slice by ID
  hash? a few days back.
 
  Regards, Per Steffensen
 
 
  On 12/24/12 1:07 AM, Erick Erickson wrote:
 
  I think this is one of the primary use-cases for custom sharding. Solr
 4.0
  doesn't really lend itself to this scenario, but I _believe_ that the
  patch
  for custom sharding has been committed...
 
  That said, I'm not quite sure how you drop off the old shard if you
 don't
  need to keep old data. I'd guess it's possible, but haven't implemented
  anything like that myself.
 
  FWIW,
  Erick
 
 
  On Fri, Dec 21, 2012 at 12:17 PM, Upayavira u...@odoko.co.uk wrote:
 
  I'm working on a system for indexing logs. We're probably looking at
  filling one core every month.
 
  We'll maintain a short term index containing the last 7 days - that
 one
  is easy to handle.
 
  For the longer term stuff, we'd like to maintain a collection that
 will
  query across all the historic data, but that means every month we need
  to add another core to an existing collection, which as I understand
 it
  in 4.0 is not possible.
 
  How do people handle this sort of situation where you have rolling new
  content arriving? I'm sure I've heard people using SolrCloud for this
  sort of thing.
 
  Given it is logs, distributed IDF has no real bearing.
 
  Upayavira
 
 
 




Re: Dynamic collections in SolrCloud for log indexing

2012-12-25 Thread Mark Miller
I've been thinking about aliases for a while as well. Seem very handy and 
fairly easy to implement. So far there has just always been higher priority 
things (need to finish collection api responses this week…) but this is 
something I'd def help work on.

- Mark

On Dec 25, 2012, at 1:49 AM, Otis Gospodnetic otis.gospodne...@gmail.com 
wrote:

 Hi,
 
 Right, this is not really about routing in ElasticSearch-sense.
 What's handy for indexing logs are index aliases which I thought I had
 added to JIRA a while back, but it looks like I have not.
 Index aliases would let you keep a last 7 days alias fixed while
 underneath you push and pop an index every day without the client app
 having to adjust.
 
 Otis
 --
 Performance Monitoring - http://sematext.com/spm/index.html
 Search Analytics - http://sematext.com/search-analytics/index.html
 
 
 
 On Mon, Dec 24, 2012 at 4:30 AM, Per Steffensen st...@designware.dk wrote:
 
 I believe it is a misunderstandig to use custom routing (or sharding as
 Erick calls it) for this kind of stuff. Custom routing is nice if you want
 to control which slice/shard under a collection a specific document goes to
 - mainly to be able to control that two (or more) documents are indexed on
 the same slice/shard, but also just to be able to control on which
 slice/shard a specific document is indexed. Knowing/controlling this kind
 of stuff can be used for a lot of nice purposes. But you dont want to move
 slices/shards around among collection or delete/add slices from/to a
 collection - unless its for elasticity reasons.
 
 I think you should fill a collection every week/month and just keep those
 collections as is. Instead of ending up with a big historic collection
 containing many slices/shards/cores (one for each historic week/month), you
 will end up with many historic collections (one for each historic
 week/month). Searching historic data you will have to cross-search those
 historic collections, but that is no problem at all. If Solr Cloud is made
 at it is supposed to be made (and I believe it is) it shouldnt require more
 resouces or be harder in any way to cross-search X slices across many
 collections, than it is to cross-search X slices under the same collection.
 
 Besides that see my answer for topic Will SolrCloud always slice by ID
 hash? a few days back.
 
 Regards, Per Steffensen
 
 
 On 12/24/12 1:07 AM, Erick Erickson wrote:
 
 I think this is one of the primary use-cases for custom sharding. Solr 4.0
 doesn't really lend itself to this scenario, but I _believe_ that the
 patch
 for custom sharding has been committed...
 
 That said, I'm not quite sure how you drop off the old shard if you don't
 need to keep old data. I'd guess it's possible, but haven't implemented
 anything like that myself.
 
 FWIW,
 Erick
 
 
 On Fri, Dec 21, 2012 at 12:17 PM, Upayavira u...@odoko.co.uk wrote:
 
 I'm working on a system for indexing logs. We're probably looking at
 filling one core every month.
 
 We'll maintain a short term index containing the last 7 days - that one
 is easy to handle.
 
 For the longer term stuff, we'd like to maintain a collection that will
 query across all the historic data, but that means every month we need
 to add another core to an existing collection, which as I understand it
 in 4.0 is not possible.
 
 How do people handle this sort of situation where you have rolling new
 content arriving? I'm sure I've heard people using SolrCloud for this
 sort of thing.
 
 Given it is logs, distributed IDF has no real bearing.
 
 Upayavira
 
 
 



Re: Dynamic collections in SolrCloud for log indexing

2012-12-24 Thread Per Steffensen
I believe it is a misunderstandig to use custom routing (or sharding as 
Erick calls it) for this kind of stuff. Custom routing is nice if you 
want to control which slice/shard under a collection a specific document 
goes to - mainly to be able to control that two (or more) documents are 
indexed on the same slice/shard, but also just to be able to control on 
which slice/shard a specific document is indexed. Knowing/controlling 
this kind of stuff can be used for a lot of nice purposes. But you dont 
want to move slices/shards around among collection or delete/add slices 
from/to a collection - unless its for elasticity reasons.


I think you should fill a collection every week/month and just keep 
those collections as is. Instead of ending up with a big historic 
collection containing many slices/shards/cores (one for each historic 
week/month), you will end up with many historic collections (one for 
each historic week/month). Searching historic data you will have to 
cross-search those historic collections, but that is no problem at all. 
If Solr Cloud is made at it is supposed to be made (and I believe it is) 
it shouldnt require more resouces or be harder in any way to 
cross-search X slices across many collections, than it is to 
cross-search X slices under the same collection.


Besides that see my answer for topic Will SolrCloud always slice by ID 
hash? a few days back.


Regards, Per Steffensen

On 12/24/12 1:07 AM, Erick Erickson wrote:

I think this is one of the primary use-cases for custom sharding. Solr 4.0
doesn't really lend itself to this scenario, but I _believe_ that the patch
for custom sharding has been committed...

That said, I'm not quite sure how you drop off the old shard if you don't
need to keep old data. I'd guess it's possible, but haven't implemented
anything like that myself.

FWIW,
Erick


On Fri, Dec 21, 2012 at 12:17 PM, Upayavira u...@odoko.co.uk wrote:


I'm working on a system for indexing logs. We're probably looking at
filling one core every month.

We'll maintain a short term index containing the last 7 days - that one
is easy to handle.

For the longer term stuff, we'd like to maintain a collection that will
query across all the historic data, but that means every month we need
to add another core to an existing collection, which as I understand it
in 4.0 is not possible.

How do people handle this sort of situation where you have rolling new
content arriving? I'm sure I've heard people using SolrCloud for this
sort of thing.

Given it is logs, distributed IDF has no real bearing.

Upayavira





Re: Dynamic collections in SolrCloud for log indexing

2012-12-24 Thread Otis Gospodnetic
Hi,

Right, this is not really about routing in ElasticSearch-sense.
What's handy for indexing logs are index aliases which I thought I had
added to JIRA a while back, but it looks like I have not.
Index aliases would let you keep a last 7 days alias fixed while
underneath you push and pop an index every day without the client app
having to adjust.

Otis
--
Performance Monitoring - http://sematext.com/spm/index.html
Search Analytics - http://sematext.com/search-analytics/index.html



On Mon, Dec 24, 2012 at 4:30 AM, Per Steffensen st...@designware.dk wrote:

 I believe it is a misunderstandig to use custom routing (or sharding as
 Erick calls it) for this kind of stuff. Custom routing is nice if you want
 to control which slice/shard under a collection a specific document goes to
 - mainly to be able to control that two (or more) documents are indexed on
 the same slice/shard, but also just to be able to control on which
 slice/shard a specific document is indexed. Knowing/controlling this kind
 of stuff can be used for a lot of nice purposes. But you dont want to move
 slices/shards around among collection or delete/add slices from/to a
 collection - unless its for elasticity reasons.

 I think you should fill a collection every week/month and just keep those
 collections as is. Instead of ending up with a big historic collection
 containing many slices/shards/cores (one for each historic week/month), you
 will end up with many historic collections (one for each historic
 week/month). Searching historic data you will have to cross-search those
 historic collections, but that is no problem at all. If Solr Cloud is made
 at it is supposed to be made (and I believe it is) it shouldnt require more
 resouces or be harder in any way to cross-search X slices across many
 collections, than it is to cross-search X slices under the same collection.

 Besides that see my answer for topic Will SolrCloud always slice by ID
 hash? a few days back.

 Regards, Per Steffensen


 On 12/24/12 1:07 AM, Erick Erickson wrote:

 I think this is one of the primary use-cases for custom sharding. Solr 4.0
 doesn't really lend itself to this scenario, but I _believe_ that the
 patch
 for custom sharding has been committed...

 That said, I'm not quite sure how you drop off the old shard if you don't
 need to keep old data. I'd guess it's possible, but haven't implemented
 anything like that myself.

 FWIW,
 Erick


 On Fri, Dec 21, 2012 at 12:17 PM, Upayavira u...@odoko.co.uk wrote:

  I'm working on a system for indexing logs. We're probably looking at
 filling one core every month.

 We'll maintain a short term index containing the last 7 days - that one
 is easy to handle.

 For the longer term stuff, we'd like to maintain a collection that will
 query across all the historic data, but that means every month we need
 to add another core to an existing collection, which as I understand it
 in 4.0 is not possible.

 How do people handle this sort of situation where you have rolling new
 content arriving? I'm sure I've heard people using SolrCloud for this
 sort of thing.

 Given it is logs, distributed IDF has no real bearing.

 Upayavira





Re: Dynamic collections in SolrCloud for log indexing

2012-12-24 Thread Upayavira
This is precisely it. It is a 'collections alias', allowing you to group
collections together into 'super-collections'.

You add a new collection (made up of a core on n hosts) every
day/week/month/whatever. When you do so, you add this collection to your
super-collection. Many you do a quick audit if those cores in your short
term super-collections, but the net result is some names you can use to
address various subsets of your total content.

Upayavira (who's kids are still asleep, so no excitement yet...)

On Tue, Dec 25, 2012, at 06:49 AM, Otis Gospodnetic wrote:
 Hi,
 
 Right, this is not really about routing in ElasticSearch-sense.
 What's handy for indexing logs are index aliases which I thought I
 had
 added to JIRA a while back, but it looks like I have not.
 Index aliases would let you keep a last 7 days alias fixed while
 underneath you push and pop an index every day without the client app
 having to adjust.
 
 Otis
 --
 Performance Monitoring - http://sematext.com/spm/index.html
 Search Analytics - http://sematext.com/search-analytics/index.html
 
 
 
 On Mon, Dec 24, 2012 at 4:30 AM, Per Steffensen st...@designware.dk
 wrote:
 
  I believe it is a misunderstandig to use custom routing (or sharding as
  Erick calls it) for this kind of stuff. Custom routing is nice if you want
  to control which slice/shard under a collection a specific document goes to
  - mainly to be able to control that two (or more) documents are indexed on
  the same slice/shard, but also just to be able to control on which
  slice/shard a specific document is indexed. Knowing/controlling this kind
  of stuff can be used for a lot of nice purposes. But you dont want to move
  slices/shards around among collection or delete/add slices from/to a
  collection - unless its for elasticity reasons.
 
  I think you should fill a collection every week/month and just keep those
  collections as is. Instead of ending up with a big historic collection
  containing many slices/shards/cores (one for each historic week/month), you
  will end up with many historic collections (one for each historic
  week/month). Searching historic data you will have to cross-search those
  historic collections, but that is no problem at all. If Solr Cloud is made
  at it is supposed to be made (and I believe it is) it shouldnt require more
  resouces or be harder in any way to cross-search X slices across many
  collections, than it is to cross-search X slices under the same collection.
 
  Besides that see my answer for topic Will SolrCloud always slice by ID
  hash? a few days back.
 
  Regards, Per Steffensen
 
 
  On 12/24/12 1:07 AM, Erick Erickson wrote:
 
  I think this is one of the primary use-cases for custom sharding. Solr 4.0
  doesn't really lend itself to this scenario, but I _believe_ that the
  patch
  for custom sharding has been committed...
 
  That said, I'm not quite sure how you drop off the old shard if you don't
  need to keep old data. I'd guess it's possible, but haven't implemented
  anything like that myself.
 
  FWIW,
  Erick
 
 
  On Fri, Dec 21, 2012 at 12:17 PM, Upayavira u...@odoko.co.uk wrote:
 
   I'm working on a system for indexing logs. We're probably looking at
  filling one core every month.
 
  We'll maintain a short term index containing the last 7 days - that one
  is easy to handle.
 
  For the longer term stuff, we'd like to maintain a collection that will
  query across all the historic data, but that means every month we need
  to add another core to an existing collection, which as I understand it
  in 4.0 is not possible.
 
  How do people handle this sort of situation where you have rolling new
  content arriving? I'm sure I've heard people using SolrCloud for this
  sort of thing.
 
  Given it is logs, distributed IDF has no real bearing.
 
  Upayavira
 
 
 


Re: Dynamic collections in SolrCloud for log indexing

2012-12-23 Thread Erick Erickson
I think this is one of the primary use-cases for custom sharding. Solr 4.0
doesn't really lend itself to this scenario, but I _believe_ that the patch
for custom sharding has been committed...

That said, I'm not quite sure how you drop off the old shard if you don't
need to keep old data. I'd guess it's possible, but haven't implemented
anything like that myself.

FWIW,
Erick


On Fri, Dec 21, 2012 at 12:17 PM, Upayavira u...@odoko.co.uk wrote:

 I'm working on a system for indexing logs. We're probably looking at
 filling one core every month.

 We'll maintain a short term index containing the last 7 days - that one
 is easy to handle.

 For the longer term stuff, we'd like to maintain a collection that will
 query across all the historic data, but that means every month we need
 to add another core to an existing collection, which as I understand it
 in 4.0 is not possible.

 How do people handle this sort of situation where you have rolling new
 content arriving? I'm sure I've heard people using SolrCloud for this
 sort of thing.

 Given it is logs, distributed IDF has no real bearing.

 Upayavira



Dynamic collections in SolrCloud for log indexing

2012-12-21 Thread Upayavira
I'm working on a system for indexing logs. We're probably looking at
filling one core every month.

We'll maintain a short term index containing the last 7 days - that one
is easy to handle.

For the longer term stuff, we'd like to maintain a collection that will
query across all the historic data, but that means every month we need
to add another core to an existing collection, which as I understand it
in 4.0 is not possible.

How do people handle this sort of situation where you have rolling new
content arriving? I'm sure I've heard people using SolrCloud for this
sort of thing.

Given it is logs, distributed IDF has no real bearing.

Upayavira