Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-18 Thread Jason Rutherglen
I'm curious how on the fly updates are handled as a new shard is added
to an alias.  Eg, how does the system know to which shard to send an
update?

On Tue, Apr 17, 2012 at 4:00 PM, Lukáš Vlček lukas.vl...@gmail.com wrote:
 Hi,

 speaking about ES I think it would be fair to mention that one has to
 specify number of shards upfront when the index is created - that is
 correct, however, it is possible to give index one or more aliases which
 basically means that you can add new indices on the fly and give them same
 alias which is then used to search against. Given that you can add/remove
 indices, nodes and aliases on the fly I think there is a way how to handle
 growing data set with ease. If anyone is interested such scenario has been
 discussed in detail in ES mail list.

 Regards,
 Lukas

 On Tue, Apr 17, 2012 at 2:42 AM, Jason Rutherglen 
 jason.rutherg...@gmail.com wrote:

 One of big weaknesses of Solr Cloud (and ES?) is the lack of the
 ability to redistribute shards across servers.  Meaning, as a single
 shard grows too large, splitting the shard, while live updates.

 How do you plan on elastically adding more servers without this feature?

 Cassandra and HBase handle elasticity in their own ways.  Cassandra
 has successfully implemented the Dynamo model and HBase uses the
 traditional BigTable 'split'.  Both systems are complex though are at
 a singular level of maturity.

 Also Cassandra [successfully] implements multiple data center support,
 is that available in SC or ES?

 On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic
 otis_gospodne...@yahoo.com wrote:
  Hello Ali,
 
  I'm trying to setup a large scale *Crawl + Index + Search
 *infrastructure
 
  using Nutch and Solr/Lucene. The targeted scale is *5 Billion web
 pages*,
  crawled + indexed every *4 weeks, *with a search latency of less than
 0.5
  seconds.
 
 
  That's fine.  Whether it's doable with any tech will depend on how much
 hardware you give it, among other things.
 
  Needless to mention, the search index needs to scale to 5Billion pages.
 It
  is also possible that I might need to store multiple indexes -- one for
  crawled content, and one for ancillary data that is also very large.
 Each
  of these indices would likely require a logically distributed and
  replicated index.
 
 
  Yup, OK.
 
  However, I would like for such a system to be homogenous with the Hadoop
  infrastructure that is already installed on the cluster (for the
 crawl). In
  other words, I would much prefer if the replication and distribution of
 the
  Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead
 of
  using another scalability framework (such as SolrCloud). In addition, it
  would be ideal if this environment was flexible enough to be dynamically
  scaled based on the size requirements of the index and the search
 traffic
  at the time (i.e. if it is deployed on an Amazon cluster, it should be
 easy
  enough to automatically provision additional processing power into the
  cluster without requiring server re-starts).
 
 
  There is no such thing just yet.
  There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt
 to automatically index HBase content, but that was either not completed or
 not committed into HBase.
 
  However, I'm not sure which Solr-based tool in the Hadoop ecosystem
 would
  be ideal for this scenario. I've heard mention of Solr-on-HBase,
 Solandra,
  Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of
 these is
  mature enough and would be the right architectural choice to go along
 with
  a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling
 aspects
  above.
 
 
  Here is a summary on all of them:
  * Search on HBase - I assume you are referring to the same thing I
 mentioned above.  Not ready.
  * Solandra - uses Cassandra+Solr, plus DataStax now has a different
 (commercial) offering that combines search and Cassandra.  Looks good.
  * Lily - data stored in HBase cluster gets indexed to a separate Solr
 instance(s)  on the side.  Not really integrated the way you want it to be.
  * ElasticSearch - solid at this point, the most dynamic solution today,
 can scale well (we are working on a mny-B documents index and hundreds
 of nodes with ElasticSearch right now), etc.  But again, not integrated
 with Hadoop the way you want it.
  * IndexTank - has some technical weaknesses, not integrated with Hadoop,
 not sure about its future considering LinkedIn uses Zoie and Sensei already.
  * And there is SolrCloud, which is coming soon and will be solid, but is
 again not integrated.
 
  If I were you and I had to pick today - I'd pick ElasticSearch if I were
 completely open.  If I had Solr bias I'd give SolrCloud a try first.
 
  Lastly, how much hardware (assuming a medium sized EC2 instance) would
 you
  estimate my needing with this setup, for regular web-data (HTML text) at
  this scale?
 
  I don't know off the topic of my head, but I'm guessing several hundred
 

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-18 Thread Lukáš Vlček
AFAIK it can not. You can only add new shards by creating a new index and
you will then need to index new data into that new index. Index aliases are
useful mainly for searching part. So it means that you need to plan for
this when you implement your indexing logic. On the other hand the query
logic does not need to change as you only add new indices and give them all
the same alias.

I am not an expert on this but I think that index splitting and re-sharding
can be expensive for [near] real-time search system and the point is that
you can probably use different techniques to support your large scale
needs. Index aliasing and routing in elasticsearch can help a lot in
supporting various large scale data scenarios, check the following thread
in ES ML for some examples:
https://groups.google.com/forum/#!msg/elasticsearch/49q-_AgQCp8/MRol0t9asEcJ

Just to sum it up, the fact that elasticsearch does have fixed number of
shards per index and does not support resharding and index splitting does
not mean you can not scale your data easily.

(I was not following this whole thread in every detail. So may be you may
have specific needs that can be solved only by splitting or resharding, in
such case I would recommend you to ask on ES ML with further questions, I
do not want to run into system X vs system Y flame here...)

Regards,
Lukas

On Wed, Apr 18, 2012 at 2:22 PM, Jason Rutherglen 
jason.rutherg...@gmail.com wrote:

 I'm curious how on the fly updates are handled as a new shard is added
 to an alias.  Eg, how does the system know to which shard to send an
 update?

 On Tue, Apr 17, 2012 at 4:00 PM, Lukáš Vlček lukas.vl...@gmail.com
 wrote:
  Hi,
 
  speaking about ES I think it would be fair to mention that one has to
  specify number of shards upfront when the index is created - that is
  correct, however, it is possible to give index one or more aliases which
  basically means that you can add new indices on the fly and give them
 same
  alias which is then used to search against. Given that you can add/remove
  indices, nodes and aliases on the fly I think there is a way how to
 handle
  growing data set with ease. If anyone is interested such scenario has
 been
  discussed in detail in ES mail list.
 
  Regards,
  Lukas
 
  On Tue, Apr 17, 2012 at 2:42 AM, Jason Rutherglen 
  jason.rutherg...@gmail.com wrote:
 
  One of big weaknesses of Solr Cloud (and ES?) is the lack of the
  ability to redistribute shards across servers.  Meaning, as a single
  shard grows too large, splitting the shard, while live updates.
 
  How do you plan on elastically adding more servers without this feature?
 
  Cassandra and HBase handle elasticity in their own ways.  Cassandra
  has successfully implemented the Dynamo model and HBase uses the
  traditional BigTable 'split'.  Both systems are complex though are at
  a singular level of maturity.
 
  Also Cassandra [successfully] implements multiple data center support,
  is that available in SC or ES?
 
  On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic
  otis_gospodne...@yahoo.com wrote:
   Hello Ali,
  
   I'm trying to setup a large scale *Crawl + Index + Search
  *infrastructure
  
   using Nutch and Solr/Lucene. The targeted scale is *5 Billion web
  pages*,
   crawled + indexed every *4 weeks, *with a search latency of less than
  0.5
   seconds.
  
  
   That's fine.  Whether it's doable with any tech will depend on how
 much
  hardware you give it, among other things.
  
   Needless to mention, the search index needs to scale to 5Billion
 pages.
  It
   is also possible that I might need to store multiple indexes -- one
 for
   crawled content, and one for ancillary data that is also very large.
  Each
   of these indices would likely require a logically distributed and
   replicated index.
  
  
   Yup, OK.
  
   However, I would like for such a system to be homogenous with the
 Hadoop
   infrastructure that is already installed on the cluster (for the
  crawl). In
   other words, I would much prefer if the replication and distribution
 of
  the
   Solr/Lucene index be done automagically on top of Hadoop/HDFS,
 instead
  of
   using another scalability framework (such as SolrCloud). In
 addition, it
   would be ideal if this environment was flexible enough to be
 dynamically
   scaled based on the size requirements of the index and the search
  traffic
   at the time (i.e. if it is deployed on an Amazon cluster, it should
 be
  easy
   enough to automatically provision additional processing power into
 the
   cluster without requiring server re-starts).
  
  
   There is no such thing just yet.
   There is no Search+Hadoop/HDFS in a box just yet.  There was an
 attempt
  to automatically index HBase content, but that was either not completed
 or
  not committed into HBase.
  
   However, I'm not sure which Solr-based tool in the Hadoop ecosystem
  would
   be ideal for this scenario. I've heard mention of Solr-on-HBase,
  Solandra,
   Lily, ElasticSearch, IndexTank 

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-18 Thread Jason Rutherglen
The main point being made is established NoSQL solutions (eg,
Cassandra, HBase, et al) have solved the update problem (among many
other scalability issues, for several years).

If an update is being performed and it is not known where the record
exists, the update capability of the system is inefficient.  In
addition, in a production system, the mere possibility of losing data,
or inaccurate updates is usually a red flag.

On Wed, Apr 18, 2012 at 6:40 AM, Lukáš Vlček lukas.vl...@gmail.com wrote:
 AFAIK it can not. You can only add new shards by creating a new index and
 you will then need to index new data into that new index. Index aliases are
 useful mainly for searching part. So it means that you need to plan for
 this when you implement your indexing logic. On the other hand the query
 logic does not need to change as you only add new indices and give them all
 the same alias.

 I am not an expert on this but I think that index splitting and re-sharding
 can be expensive for [near] real-time search system and the point is that
 you can probably use different techniques to support your large scale
 needs. Index aliasing and routing in elasticsearch can help a lot in
 supporting various large scale data scenarios, check the following thread
 in ES ML for some examples:
 https://groups.google.com/forum/#!msg/elasticsearch/49q-_AgQCp8/MRol0t9asEcJ

 Just to sum it up, the fact that elasticsearch does have fixed number of
 shards per index and does not support resharding and index splitting does
 not mean you can not scale your data easily.

 (I was not following this whole thread in every detail. So may be you may
 have specific needs that can be solved only by splitting or resharding, in
 such case I would recommend you to ask on ES ML with further questions, I
 do not want to run into system X vs system Y flame here...)

 Regards,
 Lukas

 On Wed, Apr 18, 2012 at 2:22 PM, Jason Rutherglen 
 jason.rutherg...@gmail.com wrote:

 I'm curious how on the fly updates are handled as a new shard is added
 to an alias.  Eg, how does the system know to which shard to send an
 update?

 On Tue, Apr 17, 2012 at 4:00 PM, Lukáš Vlček lukas.vl...@gmail.com
 wrote:
  Hi,
 
  speaking about ES I think it would be fair to mention that one has to
  specify number of shards upfront when the index is created - that is
  correct, however, it is possible to give index one or more aliases which
  basically means that you can add new indices on the fly and give them
 same
  alias which is then used to search against. Given that you can add/remove
  indices, nodes and aliases on the fly I think there is a way how to
 handle
  growing data set with ease. If anyone is interested such scenario has
 been
  discussed in detail in ES mail list.
 
  Regards,
  Lukas
 
  On Tue, Apr 17, 2012 at 2:42 AM, Jason Rutherglen 
  jason.rutherg...@gmail.com wrote:
 
  One of big weaknesses of Solr Cloud (and ES?) is the lack of the
  ability to redistribute shards across servers.  Meaning, as a single
  shard grows too large, splitting the shard, while live updates.
 
  How do you plan on elastically adding more servers without this feature?
 
  Cassandra and HBase handle elasticity in their own ways.  Cassandra
  has successfully implemented the Dynamo model and HBase uses the
  traditional BigTable 'split'.  Both systems are complex though are at
  a singular level of maturity.
 
  Also Cassandra [successfully] implements multiple data center support,
  is that available in SC or ES?
 
  On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic
  otis_gospodne...@yahoo.com wrote:
   Hello Ali,
  
   I'm trying to setup a large scale *Crawl + Index + Search
  *infrastructure
  
   using Nutch and Solr/Lucene. The targeted scale is *5 Billion web
  pages*,
   crawled + indexed every *4 weeks, *with a search latency of less than
  0.5
   seconds.
  
  
   That's fine.  Whether it's doable with any tech will depend on how
 much
  hardware you give it, among other things.
  
   Needless to mention, the search index needs to scale to 5Billion
 pages.
  It
   is also possible that I might need to store multiple indexes -- one
 for
   crawled content, and one for ancillary data that is also very large.
  Each
   of these indices would likely require a logically distributed and
   replicated index.
  
  
   Yup, OK.
  
   However, I would like for such a system to be homogenous with the
 Hadoop
   infrastructure that is already installed on the cluster (for the
  crawl). In
   other words, I would much prefer if the replication and distribution
 of
  the
   Solr/Lucene index be done automagically on top of Hadoop/HDFS,
 instead
  of
   using another scalability framework (such as SolrCloud). In
 addition, it
   would be ideal if this environment was flexible enough to be
 dynamically
   scaled based on the size requirements of the index and the search
  traffic
   at the time (i.e. if it is deployed on an Amazon cluster, it should
 be
  easy
   enough to 

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-17 Thread Jan Høydahl
Hi,

I think Katta integration is nice, but it is not very real-time. What if you 
want both?
Perhaps a Katta/SolrCloud integration could make the two frameworks play 
together, so that some shards in SolrCloud may be marked as static while 
others are realtime. SolrCloud will handle indexing the realtime shards as 
today, but indexing the static shards will be handled by Katta. If Katta adds a 
shard it will tell SolrCloud by updating the ZK tree, and SolrCloud will pick 
up the shard and start serving search for it..

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 17. apr. 2012, at 02:42, Jason Rutherglen wrote:

 One of big weaknesses of Solr Cloud (and ES?) is the lack of the
 ability to redistribute shards across servers.  Meaning, as a single
 shard grows too large, splitting the shard, while live updates.
 
 How do you plan on elastically adding more servers without this feature?
 
 Cassandra and HBase handle elasticity in their own ways.  Cassandra
 has successfully implemented the Dynamo model and HBase uses the
 traditional BigTable 'split'.  Both systems are complex though are at
 a singular level of maturity.
 
 Also Cassandra [successfully] implements multiple data center support,
 is that available in SC or ES?
 
 On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic
 otis_gospodne...@yahoo.com wrote:
 Hello Ali,
 
 I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
 
 using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
 crawled + indexed every *4 weeks, *with a search latency of less than 0.5
 seconds.
 
 
 That's fine.  Whether it's doable with any tech will depend on how much 
 hardware you give it, among other things.
 
 Needless to mention, the search index needs to scale to 5Billion pages. It
 is also possible that I might need to store multiple indexes -- one for
 crawled content, and one for ancillary data that is also very large. Each
 of these indices would likely require a logically distributed and
 replicated index.
 
 
 Yup, OK.
 
 However, I would like for such a system to be homogenous with the Hadoop
 infrastructure that is already installed on the cluster (for the crawl). In
 other words, I would much prefer if the replication and distribution of the
 Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
 using another scalability framework (such as SolrCloud). In addition, it
 would be ideal if this environment was flexible enough to be dynamically
 scaled based on the size requirements of the index and the search traffic
 at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
 enough to automatically provision additional processing power into the
 cluster without requiring server re-starts).
 
 
 There is no such thing just yet.
 There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt to 
 automatically index HBase content, but that was either not completed or not 
 committed into HBase.
 
 However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
 be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
 Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
 mature enough and would be the right architectural choice to go along with
 a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
 above.
 
 
 Here is a summary on all of them:
 * Search on HBase - I assume you are referring to the same thing I mentioned 
 above.  Not ready.
 * Solandra - uses Cassandra+Solr, plus DataStax now has a different 
 (commercial) offering that combines search and Cassandra.  Looks good.
 * Lily - data stored in HBase cluster gets indexed to a separate Solr 
 instance(s)  on the side.  Not really integrated the way you want it to be.
 * ElasticSearch - solid at this point, the most dynamic solution today, can 
 scale well (we are working on a mny-B documents index and hundreds of 
 nodes with ElasticSearch right now), etc.  But again, not integrated with 
 Hadoop the way you want it.
 * IndexTank - has some technical weaknesses, not integrated with Hadoop, not 
 sure about its future considering LinkedIn uses Zoie and Sensei already.
 * And there is SolrCloud, which is coming soon and will be solid, but is 
 again not integrated.
 
 If I were you and I had to pick today - I'd pick ElasticSearch if I were 
 completely open.  If I had Solr bias I'd give SolrCloud a try first.
 
 Lastly, how much hardware (assuming a medium sized EC2 instance) would you
 estimate my needing with this setup, for regular web-data (HTML text) at
 this scale?
 
 I don't know off the topic of my head, but I'm guessing several hundred for 
 serving search requests.
 
 HTH,
 
 Otis
 --
 Search Analytics - http://sematext.com/search-analytics/index.html
 
 Scalable Performance Monitoring - http://sematext.com/spm/index.html
 
 
 Any architectural guidance would be greatly 

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-17 Thread Otis Gospodnetic
I think Jason is right - there is no index splitting in ES and SolrCloud, so 
one has to think ahead, overshard, and then count on redistributing shards 
from oversubscribed nodes to other nodes.  No resharding on demand and no 
index/shard splitting yet.

Otis 

Performance Monitoring SaaS for Solr - 
http://sematext.com/spm/solr-performance-monitoring/index.html




 From: Jason Rutherglen jason.rutherg...@gmail.com
To: solr-user@lucene.apache.org 
Sent: Monday, April 16, 2012 8:42 PM
Subject: Re: Options for automagically Scaling Solr (without needing 
distributed index/replication) in a Hadoop environment
 
One of big weaknesses of Solr Cloud (and ES?) is the lack of the
ability to redistribute shards across servers.  Meaning, as a single
shard grows too large, splitting the shard, while live updates.

How do you plan on elastically adding more servers without this feature?

Cassandra and HBase handle elasticity in their own ways.  Cassandra
has successfully implemented the Dynamo model and HBase uses the
traditional BigTable 'split'.  Both systems are complex though are at
a singular level of maturity.

Also Cassandra [successfully] implements multiple data center support,
is that available in SC or ES?

On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:
 Hello Ali,

 I'm trying to setup a large scale *Crawl + Index + Search *infrastructure

 using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
 crawled + indexed every *4 weeks, *with a search latency of less than 0.5
 seconds.


 That's fine.  Whether it's doable with any tech will depend on how much 
 hardware you give it, among other things.

 Needless to mention, the search index needs to scale to 5Billion pages. It
 is also possible that I might need to store multiple indexes -- one for
 crawled content, and one for ancillary data that is also very large. Each
 of these indices would likely require a logically distributed and
 replicated index.


 Yup, OK.

 However, I would like for such a system to be homogenous with the Hadoop
 infrastructure that is already installed on the cluster (for the crawl). In
 other words, I would much prefer if the replication and distribution of the
 Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
 using another scalability framework (such as SolrCloud). In addition, it
 would be ideal if this environment was flexible enough to be dynamically
 scaled based on the size requirements of the index and the search traffic
 at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
 enough to automatically provision additional processing power into the
 cluster without requiring server re-starts).


 There is no such thing just yet.
 There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt to 
 automatically index HBase content, but that was either not completed or not 
 committed into HBase.

 However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
 be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
 Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
 mature enough and would be the right architectural choice to go along with
 a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
 above.


 Here is a summary on all of them:
 * Search on HBase - I assume you are referring to the same thing I mentioned 
 above.  Not ready.
 * Solandra - uses Cassandra+Solr, plus DataStax now has a different 
 (commercial) offering that combines search and Cassandra.  Looks good.
 * Lily - data stored in HBase cluster gets indexed to a separate Solr 
 instance(s)  on the side.  Not really integrated the way you want it to be.
 * ElasticSearch - solid at this point, the most dynamic solution today, can 
 scale well (we are working on a mny-B documents index and hundreds of 
 nodes with ElasticSearch right now), etc.  But again, not integrated with 
 Hadoop the way you want it.
 * IndexTank - has some technical weaknesses, not integrated with Hadoop, not 
 sure about its future considering LinkedIn uses Zoie and Sensei already.
 * And there is SolrCloud, which is coming soon and will be solid, but is 
 again not integrated.

 If I were you and I had to pick today - I'd pick ElasticSearch if I were 
 completely open.  If I had Solr bias I'd give SolrCloud a try first.

 Lastly, how much hardware (assuming a medium sized EC2 instance) would you
 estimate my needing with this setup, for regular web-data (HTML text) at
 this scale?

 I don't know off the topic of my head, but I'm guessing several hundred for 
 serving search requests.

 HTH,

 Otis
 --
 Search Analytics - http://sematext.com/search-analytics/index.html

 Scalable Performance Monitoring - http://sematext.com/spm/index.html


 Any architectural guidance would be greatly appreciated. The more details
 provided, the wider my grin :).

 Many many

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-17 Thread Jason Rutherglen
 redistributing shards from oversubscribed nodes to other nodes

Redistributing shards on a live system is not possible however because
the updates in-flight will likely be lost.  Also it is not simple
technology to build from the ground-up.

As is today, one would need to schedule downtime, for multi-terabyte
live realtime systems, that not acceptable and will cause the system
to not meet SLAs.

Solr Cloud seems limited to a simple hashing algorithm for sending
updates to the appropriate shard.  This is precisely what Dynamo (and
Cassandra) solves, eg, elastically and dynamically rearranging the
hash 'ring' both logically and physically.

In addition, there is the potential for data loss which Cassandra has
the technology for.

On Tue, Apr 17, 2012 at 1:33 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:
 I think Jason is right - there is no index splitting in ES and SolrCloud, so 
 one has to think ahead, overshard, and then count on redistributing shards 
 from oversubscribed nodes to other nodes.  No resharding on demand and no 
 index/shard splitting yet.

 Otis
 
 Performance Monitoring SaaS for Solr - 
 http://sematext.com/spm/solr-performance-monitoring/index.html




 From: Jason Rutherglen jason.rutherg...@gmail.com
To: solr-user@lucene.apache.org
Sent: Monday, April 16, 2012 8:42 PM
Subject: Re: Options for automagically Scaling Solr (without needing 
distributed index/replication) in a Hadoop environment

One of big weaknesses of Solr Cloud (and ES?) is the lack of the
ability to redistribute shards across servers.  Meaning, as a single
shard grows too large, splitting the shard, while live updates.

How do you plan on elastically adding more servers without this feature?

Cassandra and HBase handle elasticity in their own ways.  Cassandra
has successfully implemented the Dynamo model and HBase uses the
traditional BigTable 'split'.  Both systems are complex though are at
a singular level of maturity.

Also Cassandra [successfully] implements multiple data center support,
is that available in SC or ES?

On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:
 Hello Ali,

 I'm trying to setup a large scale *Crawl + Index + Search *infrastructure

 using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
 crawled + indexed every *4 weeks, *with a search latency of less than 0.5
 seconds.


 That's fine.  Whether it's doable with any tech will depend on how much 
 hardware you give it, among other things.

 Needless to mention, the search index needs to scale to 5Billion pages. It
 is also possible that I might need to store multiple indexes -- one for
 crawled content, and one for ancillary data that is also very large. Each
 of these indices would likely require a logically distributed and
 replicated index.


 Yup, OK.

 However, I would like for such a system to be homogenous with the Hadoop
 infrastructure that is already installed on the cluster (for the crawl). In
 other words, I would much prefer if the replication and distribution of the
 Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
 using another scalability framework (such as SolrCloud). In addition, it
 would be ideal if this environment was flexible enough to be dynamically
 scaled based on the size requirements of the index and the search traffic
 at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
 enough to automatically provision additional processing power into the
 cluster without requiring server re-starts).


 There is no such thing just yet.
 There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt to 
 automatically index HBase content, but that was either not completed or not 
 committed into HBase.

 However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
 be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
 Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
 mature enough and would be the right architectural choice to go along with
 a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
 above.


 Here is a summary on all of them:
 * Search on HBase - I assume you are referring to the same thing I 
 mentioned above.  Not ready.
 * Solandra - uses Cassandra+Solr, plus DataStax now has a different 
 (commercial) offering that combines search and Cassandra.  Looks good.
 * Lily - data stored in HBase cluster gets indexed to a separate Solr 
 instance(s)  on the side.  Not really integrated the way you want it to be.
 * ElasticSearch - solid at this point, the most dynamic solution today, can 
 scale well (we are working on a mny-B documents index and hundreds of 
 nodes with ElasticSearch right now), etc.  But again, not integrated with 
 Hadoop the way you want it.
 * IndexTank - has some technical weaknesses, not integrated with Hadoop, 
 not sure about its future considering

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-17 Thread Lukáš Vlček
Hi,

speaking about ES I think it would be fair to mention that one has to
specify number of shards upfront when the index is created - that is
correct, however, it is possible to give index one or more aliases which
basically means that you can add new indices on the fly and give them same
alias which is then used to search against. Given that you can add/remove
indices, nodes and aliases on the fly I think there is a way how to handle
growing data set with ease. If anyone is interested such scenario has been
discussed in detail in ES mail list.

Regards,
Lukas

On Tue, Apr 17, 2012 at 2:42 AM, Jason Rutherglen 
jason.rutherg...@gmail.com wrote:

 One of big weaknesses of Solr Cloud (and ES?) is the lack of the
 ability to redistribute shards across servers.  Meaning, as a single
 shard grows too large, splitting the shard, while live updates.

 How do you plan on elastically adding more servers without this feature?

 Cassandra and HBase handle elasticity in their own ways.  Cassandra
 has successfully implemented the Dynamo model and HBase uses the
 traditional BigTable 'split'.  Both systems are complex though are at
 a singular level of maturity.

 Also Cassandra [successfully] implements multiple data center support,
 is that available in SC or ES?

 On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic
 otis_gospodne...@yahoo.com wrote:
  Hello Ali,
 
  I'm trying to setup a large scale *Crawl + Index + Search
 *infrastructure
 
  using Nutch and Solr/Lucene. The targeted scale is *5 Billion web
 pages*,
  crawled + indexed every *4 weeks, *with a search latency of less than
 0.5
  seconds.
 
 
  That's fine.  Whether it's doable with any tech will depend on how much
 hardware you give it, among other things.
 
  Needless to mention, the search index needs to scale to 5Billion pages.
 It
  is also possible that I might need to store multiple indexes -- one for
  crawled content, and one for ancillary data that is also very large.
 Each
  of these indices would likely require a logically distributed and
  replicated index.
 
 
  Yup, OK.
 
  However, I would like for such a system to be homogenous with the Hadoop
  infrastructure that is already installed on the cluster (for the
 crawl). In
  other words, I would much prefer if the replication and distribution of
 the
  Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead
 of
  using another scalability framework (such as SolrCloud). In addition, it
  would be ideal if this environment was flexible enough to be dynamically
  scaled based on the size requirements of the index and the search
 traffic
  at the time (i.e. if it is deployed on an Amazon cluster, it should be
 easy
  enough to automatically provision additional processing power into the
  cluster without requiring server re-starts).
 
 
  There is no such thing just yet.
  There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt
 to automatically index HBase content, but that was either not completed or
 not committed into HBase.
 
  However, I'm not sure which Solr-based tool in the Hadoop ecosystem
 would
  be ideal for this scenario. I've heard mention of Solr-on-HBase,
 Solandra,
  Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of
 these is
  mature enough and would be the right architectural choice to go along
 with
  a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling
 aspects
  above.
 
 
  Here is a summary on all of them:
  * Search on HBase - I assume you are referring to the same thing I
 mentioned above.  Not ready.
  * Solandra - uses Cassandra+Solr, plus DataStax now has a different
 (commercial) offering that combines search and Cassandra.  Looks good.
  * Lily - data stored in HBase cluster gets indexed to a separate Solr
 instance(s)  on the side.  Not really integrated the way you want it to be.
  * ElasticSearch - solid at this point, the most dynamic solution today,
 can scale well (we are working on a mny-B documents index and hundreds
 of nodes with ElasticSearch right now), etc.  But again, not integrated
 with Hadoop the way you want it.
  * IndexTank - has some technical weaknesses, not integrated with Hadoop,
 not sure about its future considering LinkedIn uses Zoie and Sensei already.
  * And there is SolrCloud, which is coming soon and will be solid, but is
 again not integrated.
 
  If I were you and I had to pick today - I'd pick ElasticSearch if I were
 completely open.  If I had Solr bias I'd give SolrCloud a try first.
 
  Lastly, how much hardware (assuming a medium sized EC2 instance) would
 you
  estimate my needing with this setup, for regular web-data (HTML text) at
  this scale?
 
  I don't know off the topic of my head, but I'm guessing several hundred
 for serving search requests.
 
  HTH,
 
  Otis
  --
  Search Analytics - http://sematext.com/search-analytics/index.html
 
  Scalable Performance Monitoring - http://sematext.com/spm/index.html
 
 
  Any architectural guidance would be 

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-16 Thread Jason Rutherglen
One of big weaknesses of Solr Cloud (and ES?) is the lack of the
ability to redistribute shards across servers.  Meaning, as a single
shard grows too large, splitting the shard, while live updates.

How do you plan on elastically adding more servers without this feature?

Cassandra and HBase handle elasticity in their own ways.  Cassandra
has successfully implemented the Dynamo model and HBase uses the
traditional BigTable 'split'.  Both systems are complex though are at
a singular level of maturity.

Also Cassandra [successfully] implements multiple data center support,
is that available in SC or ES?

On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:
 Hello Ali,

 I'm trying to setup a large scale *Crawl + Index + Search *infrastructure

 using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
 crawled + indexed every *4 weeks, *with a search latency of less than 0.5
 seconds.


 That's fine.  Whether it's doable with any tech will depend on how much 
 hardware you give it, among other things.

 Needless to mention, the search index needs to scale to 5Billion pages. It
 is also possible that I might need to store multiple indexes -- one for
 crawled content, and one for ancillary data that is also very large. Each
 of these indices would likely require a logically distributed and
 replicated index.


 Yup, OK.

 However, I would like for such a system to be homogenous with the Hadoop
 infrastructure that is already installed on the cluster (for the crawl). In
 other words, I would much prefer if the replication and distribution of the
 Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
 using another scalability framework (such as SolrCloud). In addition, it
 would be ideal if this environment was flexible enough to be dynamically
 scaled based on the size requirements of the index and the search traffic
 at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
 enough to automatically provision additional processing power into the
 cluster without requiring server re-starts).


 There is no such thing just yet.
 There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt to 
 automatically index HBase content, but that was either not completed or not 
 committed into HBase.

 However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
 be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
 Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
 mature enough and would be the right architectural choice to go along with
 a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
 above.


 Here is a summary on all of them:
 * Search on HBase - I assume you are referring to the same thing I mentioned 
 above.  Not ready.
 * Solandra - uses Cassandra+Solr, plus DataStax now has a different 
 (commercial) offering that combines search and Cassandra.  Looks good.
 * Lily - data stored in HBase cluster gets indexed to a separate Solr 
 instance(s)  on the side.  Not really integrated the way you want it to be.
 * ElasticSearch - solid at this point, the most dynamic solution today, can 
 scale well (we are working on a mny-B documents index and hundreds of 
 nodes with ElasticSearch right now), etc.  But again, not integrated with 
 Hadoop the way you want it.
 * IndexTank - has some technical weaknesses, not integrated with Hadoop, not 
 sure about its future considering LinkedIn uses Zoie and Sensei already.
 * And there is SolrCloud, which is coming soon and will be solid, but is 
 again not integrated.

 If I were you and I had to pick today - I'd pick ElasticSearch if I were 
 completely open.  If I had Solr bias I'd give SolrCloud a try first.

 Lastly, how much hardware (assuming a medium sized EC2 instance) would you
 estimate my needing with this setup, for regular web-data (HTML text) at
 this scale?

 I don't know off the topic of my head, but I'm guessing several hundred for 
 serving search requests.

 HTH,

 Otis
 --
 Search Analytics - http://sematext.com/search-analytics/index.html

 Scalable Performance Monitoring - http://sematext.com/spm/index.html


 Any architectural guidance would be greatly appreciated. The more details
 provided, the wider my grin :).

 Many many thanks in advance.

 Thanks,
 Safdar



Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-15 Thread Jason Rutherglen
This was done in SOLR-1301 going on several years ago now.

On Sat, Apr 14, 2012 at 4:11 PM, Lance Norskog goks...@gmail.com wrote:
 It sounds like you really want the final map/reduce phase to put Solr
 index files into HDFS. Solr has a feature to do this called 'Embedded
 Solr'. This packages Solr as a library instead of an HTTP servlet. The
 Solr committers mostly hate it and want it to go away, but it is
 useful for exactly this problem.

 There is some integration work here, both to bolt ES to the Hadoop
 output libraries and also some trickery to write out the HDFS files.
 HDFS only appends and most of the codecs (Lucene segment formats) like
 to seek a lot. Then at the end it needs a way to tell SolrCloud about
 the files.

 If someone wants a great Summer Of Code project, Hadoop-Lucene
 indexes-SolrCloud would be a lot of fun and make you widely loved by
 people with money. I'm not kidding. Do a good job of this and write
 clean code, and you'll get offers for very cool jobs.

 On Sat, Apr 14, 2012 at 2:27 PM, Otis Gospodnetic
 otis_gospodne...@yahoo.com wrote:
 Hello,

 Unfortunately I don't know when exactly SolrCloud release will be ready, but 
 we've used trunk versions in the past and didn't have major issues.

 Otis
 
 Performance Monitoring SaaS for Solr - 
 http://sematext.com/spm/solr-performance-monitoring/index.html




 From: Ali S Kureishy safdar.kurei...@gmail.com
To: Otis Gospodnetic otis_gospodne...@yahoo.com
Cc: solr-user@lucene.apache.org solr-user@lucene.apache.org
Sent: Friday, April 13, 2012 7:16 PM
Subject: Re: Options for automagically Scaling Solr (without needing 
distributed index/replication) in a Hadoop environment


Thanks Otis.


I really appreciate the details offered here. This was very helpful 
information.


I'm going to go through Solandra and Elastic Search and see if those make 
sense. I was also given a suggestion to use SolrCloud on FuseDFS (that's two 
recommendations for SolrCloud so far), so I will give that a shot when it is 
available. However, do you know when SolrCloud IS expected to be available?


Thanks again!


Warm regards,
Safdar





On Fri, Apr 13, 2012 at 5:23 AM, Otis Gospodnetic 
otis_gospodne...@yahoo.com wrote:

Hello Ali,


 I'm trying to setup a large scale *Crawl + Index + Search *infrastructure

 using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
 crawled + indexed every *4 weeks, *with a search latency of less than 0.5
 seconds.


That's fine.  Whether it's doable with any tech will depend on how much 
hardware you give it, among other things.


 Needless to mention, the search index needs to scale to 5Billion pages. It
 is also possible that I might need to store multiple indexes -- one for
 crawled content, and one for ancillary data that is also very large. Each
 of these indices would likely require a logically distributed and
 replicated index.


Yup, OK.


 However, I would like for such a system to be homogenous with the Hadoop
 infrastructure that is already installed on the cluster (for the crawl). 
 In
 other words, I would much prefer if the replication and distribution of 
 the
 Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
 using another scalability framework (such as SolrCloud). In addition, it
 would be ideal if this environment was flexible enough to be dynamically
 scaled based on the size requirements of the index and the search traffic
 at the time (i.e. if it is deployed on an Amazon cluster, it should be 
 easy
 enough to automatically provision additional processing power into the
 cluster without requiring server re-starts).


There is no such thing just yet.
There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt to 
automatically index HBase content, but that was either not completed or not 
committed into HBase.


 However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
 be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
 Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these 
 is
 mature enough and would be the right architectural choice to go along with
 a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling 
 aspects
 above.


Here is a summary on all of them:
* Search on HBase - I assume you are referring to the same thing I 
mentioned above.  Not ready.
* Solandra - uses Cassandra+Solr, plus DataStax now has a different 
(commercial) offering that combines search and Cassandra.  Looks good.
* Lily - data stored in HBase cluster gets indexed to a separate Solr 
instance(s)  on the side.  Not really integrated the way you want it to be.
* ElasticSearch - solid at this point, the most dynamic solution today, can 
scale well (we are working on a mny-B documents index and hundreds of 
nodes with ElasticSearch right now), etc.  But again, not integrated with 
Hadoop the way you want it.
* IndexTank - has some technical weaknesses

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-14 Thread Jan Høydahl
Hi,

This won't give you the performance you need, unless you have enough RAM on the 
Solr box to cache the whole index in memory.
Have you tested this yourself?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 12. apr. 2012, at 15:27, Darren Govoni wrote:

 You could use SolrCloud (for the automatic scaling) and just mount a
 fuse[1] HDFS directory and configure solr to use that directory for its
 data. 
 
 [1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS
 
 On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote:
 Hi,
 
 I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
 using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
 crawled + indexed every *4 weeks, *with a search latency of less than 0.5
 seconds.
 
 Needless to mention, the search index needs to scale to 5Billion pages. It
 is also possible that I might need to store multiple indexes -- one for
 crawled content, and one for ancillary data that is also very large. Each
 of these indices would likely require a logically distributed and
 replicated index.
 
 However, I would like for such a system to be homogenous with the Hadoop
 infrastructure that is already installed on the cluster (for the crawl). In
 other words, I would much prefer if the replication and distribution of the
 Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
 using another scalability framework (such as SolrCloud). In addition, it
 would be ideal if this environment was flexible enough to be dynamically
 scaled based on the size requirements of the index and the search traffic
 at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
 enough to automatically provision additional processing power into the
 cluster without requiring server re-starts).
 
 However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
 be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
 Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
 mature enough and would be the right architectural choice to go along with
 a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
 above.
 
 Lastly, how much hardware (assuming a medium sized EC2 instance) would you
 estimate my needing with this setup, for regular web-data (HTML text) at
 this scale?
 
 Any architectural guidance would be greatly appreciated. The more details
 provided, the wider my grin :).
 
 Many many thanks in advance.
 
 Thanks,
 Safdar
 
 



Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-14 Thread Otis Gospodnetic
Hello,

Unfortunately I don't know when exactly SolrCloud release will be ready, but 
we've used trunk versions in the past and didn't have major issues.

Otis 

Performance Monitoring SaaS for Solr - 
http://sematext.com/spm/solr-performance-monitoring/index.html




 From: Ali S Kureishy safdar.kurei...@gmail.com
To: Otis Gospodnetic otis_gospodne...@yahoo.com 
Cc: solr-user@lucene.apache.org solr-user@lucene.apache.org 
Sent: Friday, April 13, 2012 7:16 PM
Subject: Re: Options for automagically Scaling Solr (without needing 
distributed index/replication) in a Hadoop environment
 

Thanks Otis.


I really appreciate the details offered here. This was very helpful 
information.


I'm going to go through Solandra and Elastic Search and see if those make 
sense. I was also given a suggestion to use SolrCloud on FuseDFS (that's two 
recommendations for SolrCloud so far), so I will give that a shot when it is 
available. However, do you know when SolrCloud IS expected to be available?


Thanks again!


Warm regards,
Safdar





On Fri, Apr 13, 2012 at 5:23 AM, Otis Gospodnetic otis_gospodne...@yahoo.com 
wrote:

Hello Ali,


 I'm trying to setup a large scale *Crawl + Index + Search *infrastructure

 using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
 crawled + indexed every *4 weeks, *with a search latency of less than 0.5
 seconds.


That's fine.  Whether it's doable with any tech will depend on how much 
hardware you give it, among other things.


 Needless to mention, the search index needs to scale to 5Billion pages. It
 is also possible that I might need to store multiple indexes -- one for
 crawled content, and one for ancillary data that is also very large. Each
 of these indices would likely require a logically distributed and
 replicated index.


Yup, OK.


 However, I would like for such a system to be homogenous with the Hadoop
 infrastructure that is already installed on the cluster (for the crawl). In
 other words, I would much prefer if the replication and distribution of the
 Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
 using another scalability framework (such as SolrCloud). In addition, it
 would be ideal if this environment was flexible enough to be dynamically
 scaled based on the size requirements of the index and the search traffic
 at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
 enough to automatically provision additional processing power into the
 cluster without requiring server re-starts).


There is no such thing just yet.
There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt to 
automatically index HBase content, but that was either not completed or not 
committed into HBase.


 However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
 be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
 Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
 mature enough and would be the right architectural choice to go along with
 a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
 above.


Here is a summary on all of them:
* Search on HBase - I assume you are referring to the same thing I mentioned 
above.  Not ready.
* Solandra - uses Cassandra+Solr, plus DataStax now has a different 
(commercial) offering that combines search and Cassandra.  Looks good.
* Lily - data stored in HBase cluster gets indexed to a separate Solr 
instance(s)  on the side.  Not really integrated the way you want it to be.
* ElasticSearch - solid at this point, the most dynamic solution today, can 
scale well (we are working on a mny-B documents index and hundreds of 
nodes with ElasticSearch right now), etc.  But again, not integrated with 
Hadoop the way you want it.
* IndexTank - has some technical weaknesses, not integrated with Hadoop, not 
sure about its future considering LinkedIn uses Zoie and Sensei already.
* And there is SolrCloud, which is coming soon and will be solid, but is 
again not integrated.

If I were you and I had to pick today - I'd pick ElasticSearch if I were 
completely open.  If I had Solr bias I'd give SolrCloud a try first.


 Lastly, how much hardware (assuming a medium sized EC2 instance) would you
 estimate my needing with this setup, for regular web-data (HTML text) at
 this scale?

I don't know off the topic of my head, but I'm guessing several hundred for 
serving search requests.

HTH,

Otis
--
Search Analytics - http://sematext.com/search-analytics/index.html

Scalable Performance Monitoring - http://sematext.com/spm/index.html



 Any architectural guidance would be greatly appreciated. The more details
 provided, the wider my grin :).

 Many many thanks in advance.

 Thanks,
 Safdar






Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-14 Thread Lance Norskog
It sounds like you really want the final map/reduce phase to put Solr
index files into HDFS. Solr has a feature to do this called 'Embedded
Solr'. This packages Solr as a library instead of an HTTP servlet. The
Solr committers mostly hate it and want it to go away, but it is
useful for exactly this problem.

There is some integration work here, both to bolt ES to the Hadoop
output libraries and also some trickery to write out the HDFS files.
HDFS only appends and most of the codecs (Lucene segment formats) like
to seek a lot. Then at the end it needs a way to tell SolrCloud about
the files.

If someone wants a great Summer Of Code project, Hadoop-Lucene
indexes-SolrCloud would be a lot of fun and make you widely loved by
people with money. I'm not kidding. Do a good job of this and write
clean code, and you'll get offers for very cool jobs.

On Sat, Apr 14, 2012 at 2:27 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:
 Hello,

 Unfortunately I don't know when exactly SolrCloud release will be ready, but 
 we've used trunk versions in the past and didn't have major issues.

 Otis
 
 Performance Monitoring SaaS for Solr - 
 http://sematext.com/spm/solr-performance-monitoring/index.html




 From: Ali S Kureishy safdar.kurei...@gmail.com
To: Otis Gospodnetic otis_gospodne...@yahoo.com
Cc: solr-user@lucene.apache.org solr-user@lucene.apache.org
Sent: Friday, April 13, 2012 7:16 PM
Subject: Re: Options for automagically Scaling Solr (without needing 
distributed index/replication) in a Hadoop environment


Thanks Otis.


I really appreciate the details offered here. This was very helpful 
information.


I'm going to go through Solandra and Elastic Search and see if those make 
sense. I was also given a suggestion to use SolrCloud on FuseDFS (that's two 
recommendations for SolrCloud so far), so I will give that a shot when it is 
available. However, do you know when SolrCloud IS expected to be available?


Thanks again!


Warm regards,
Safdar





On Fri, Apr 13, 2012 at 5:23 AM, Otis Gospodnetic 
otis_gospodne...@yahoo.com wrote:

Hello Ali,


 I'm trying to setup a large scale *Crawl + Index + Search *infrastructure

 using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
 crawled + indexed every *4 weeks, *with a search latency of less than 0.5
 seconds.


That's fine.  Whether it's doable with any tech will depend on how much 
hardware you give it, among other things.


 Needless to mention, the search index needs to scale to 5Billion pages. It
 is also possible that I might need to store multiple indexes -- one for
 crawled content, and one for ancillary data that is also very large. Each
 of these indices would likely require a logically distributed and
 replicated index.


Yup, OK.


 However, I would like for such a system to be homogenous with the Hadoop
 infrastructure that is already installed on the cluster (for the crawl). In
 other words, I would much prefer if the replication and distribution of the
 Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
 using another scalability framework (such as SolrCloud). In addition, it
 would be ideal if this environment was flexible enough to be dynamically
 scaled based on the size requirements of the index and the search traffic
 at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
 enough to automatically provision additional processing power into the
 cluster without requiring server re-starts).


There is no such thing just yet.
There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt to 
automatically index HBase content, but that was either not completed or not 
committed into HBase.


 However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
 be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
 Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
 mature enough and would be the right architectural choice to go along with
 a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
 above.


Here is a summary on all of them:
* Search on HBase - I assume you are referring to the same thing I mentioned 
above.  Not ready.
* Solandra - uses Cassandra+Solr, plus DataStax now has a different 
(commercial) offering that combines search and Cassandra.  Looks good.
* Lily - data stored in HBase cluster gets indexed to a separate Solr 
instance(s)  on the side.  Not really integrated the way you want it to be.
* ElasticSearch - solid at this point, the most dynamic solution today, can 
scale well (we are working on a mny-B documents index and hundreds of 
nodes with ElasticSearch right now), etc.  But again, not integrated with 
Hadoop the way you want it.
* IndexTank - has some technical weaknesses, not integrated with Hadoop, not 
sure about its future considering LinkedIn uses Zoie and Sensei already.
* And there is SolrCloud, which is coming soon

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-13 Thread Jan Høydahl
Hi,

For a web crawl+search like this you will probably need a lot of additional Big 
Data crunching, so a Hadoop based solution is wise.

In addition to those products mentioned we also now have Amazon's own 
CloudSearch http://aws.amazon.com/cloudsearch/ It's new, is not as cool as Solr 
(not even Lucene based), but gives you the elasticity you request I guess. If 
you run your Hadoop cluster in EC2 already it would be quite efficient to 
batch-load the crawled and processed data into a SearchDomain in the same 
availability zone. However, both cost and features may prohibit this as a 
realistic choice for you.

It would be cool to explore a Hadoop/HDFS + SolrCloud integration. SolrCloud 
would not build the indexes, but be pulling pre-built indexes from HDFS down to 
local disk every time it's told to. Or perhaps the SolrCloud nodes could be 
part of the hadoop cluster, being responsible for the Reduce part building the 
indexes?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 13. apr. 2012, at 04:23, Otis Gospodnetic wrote:

 Hello Ali,
 
 I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
 
 using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
 crawled + indexed every *4 weeks, *with a search latency of less than 0.5
 seconds.
 
 
 That's fine.  Whether it's doable with any tech will depend on how much 
 hardware you give it, among other things.
 
 Needless to mention, the search index needs to scale to 5Billion pages. It
 is also possible that I might need to store multiple indexes -- one for
 crawled content, and one for ancillary data that is also very large. Each
 of these indices would likely require a logically distributed and
 replicated index.
 
 
 Yup, OK.
 
 However, I would like for such a system to be homogenous with the Hadoop
 infrastructure that is already installed on the cluster (for the crawl). In
 other words, I would much prefer if the replication and distribution of the
 Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
 using another scalability framework (such as SolrCloud). In addition, it
 would be ideal if this environment was flexible enough to be dynamically
 scaled based on the size requirements of the index and the search traffic
 at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
 enough to automatically provision additional processing power into the
 cluster without requiring server re-starts).
 
 
 There is no such thing just yet.
 There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt to 
 automatically index HBase content, but that was either not completed or not 
 committed into HBase.
 
 However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
 be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
 Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
 mature enough and would be the right architectural choice to go along with
 a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
 above.
 
 
 Here is a summary on all of them:
 * Search on HBase - I assume you are referring to the same thing I mentioned 
 above.  Not ready.
 * Solandra - uses Cassandra+Solr, plus DataStax now has a different 
 (commercial) offering that combines search and Cassandra.  Looks good.
 * Lily - data stored in HBase cluster gets indexed to a separate Solr 
 instance(s)  on the side.  Not really integrated the way you want it to be.
 * ElasticSearch - solid at this point, the most dynamic solution today, can 
 scale well (we are working on a mny-B documents index and hundreds of 
 nodes with ElasticSearch right now), etc.  But again, not integrated with 
 Hadoop the way you want it.
 * IndexTank - has some technical weaknesses, not integrated with Hadoop, not 
 sure about its future considering LinkedIn uses Zoie and Sensei already.
 * And there is SolrCloud, which is coming soon and will be solid, but is 
 again not integrated.
 
 If I were you and I had to pick today - I'd pick ElasticSearch if I were 
 completely open.  If I had Solr bias I'd give SolrCloud a try first.
 
 Lastly, how much hardware (assuming a medium sized EC2 instance) would you
 estimate my needing with this setup, for regular web-data (HTML text) at
 this scale?
 
 I don't know off the topic of my head, but I'm guessing several hundred for 
 serving search requests.
 
 HTH,
 
 Otis
 --
 Search Analytics - http://sematext.com/search-analytics/index.html
 
 Scalable Performance Monitoring - http://sematext.com/spm/index.html
 
 
 Any architectural guidance would be greatly appreciated. The more details
 provided, the wider my grin :).
 
 Many many thanks in advance.
 
 Thanks,
 Safdar
 



Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-13 Thread Ali S Kureishy
Thanks Otis.

I really appreciate the details offered here. This was very helpful
information.

I'm going to go through Solandra and Elastic Search and see if those make
sense. I was also given a suggestion to use SolrCloud on FuseDFS (that's
two recommendations for SolrCloud so far), so I will give that a shot when
it is available. However, do you know when SolrCloud IS expected to be
available?

Thanks again!

Warm regards,
Safdar



On Fri, Apr 13, 2012 at 5:23 AM, Otis Gospodnetic 
otis_gospodne...@yahoo.com wrote:

 Hello Ali,

  I'm trying to setup a large scale *Crawl + Index + Search *infrastructure

  using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
  crawled + indexed every *4 weeks, *with a search latency of less than 0.5
  seconds.


 That's fine.  Whether it's doable with any tech will depend on how much
 hardware you give it, among other things.

  Needless to mention, the search index needs to scale to 5Billion pages.
 It
  is also possible that I might need to store multiple indexes -- one for
  crawled content, and one for ancillary data that is also very large. Each
  of these indices would likely require a logically distributed and
  replicated index.


 Yup, OK.

  However, I would like for such a system to be homogenous with the Hadoop
  infrastructure that is already installed on the cluster (for the crawl).
 In
  other words, I would much prefer if the replication and distribution of
 the
  Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
  using another scalability framework (such as SolrCloud). In addition, it
  would be ideal if this environment was flexible enough to be dynamically
  scaled based on the size requirements of the index and the search traffic
  at the time (i.e. if it is deployed on an Amazon cluster, it should be
 easy
  enough to automatically provision additional processing power into the
  cluster without requiring server re-starts).


 There is no such thing just yet.
 There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt to
 automatically index HBase content, but that was either not completed or not
 committed into HBase.

  However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
  be ideal for this scenario. I've heard mention of Solr-on-HBase,
 Solandra,
  Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these
 is
  mature enough and would be the right architectural choice to go along
 with
  a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling
 aspects
  above.


 Here is a summary on all of them:
 * Search on HBase - I assume you are referring to the same thing I
 mentioned above.  Not ready.
 * Solandra - uses Cassandra+Solr, plus DataStax now has a different
 (commercial) offering that combines search and Cassandra.  Looks good.
 * Lily - data stored in HBase cluster gets indexed to a separate Solr
 instance(s)  on the side.  Not really integrated the way you want it to be.
 * ElasticSearch - solid at this point, the most dynamic solution today,
 can scale well (we are working on a mny-B documents index and hundreds
 of nodes with ElasticSearch right now), etc.  But again, not integrated
 with Hadoop the way you want it.
 * IndexTank - has some technical weaknesses, not integrated with Hadoop,
 not sure about its future considering LinkedIn uses Zoie and Sensei already.
 * And there is SolrCloud, which is coming soon and will be solid, but is
 again not integrated.

 If I were you and I had to pick today - I'd pick ElasticSearch if I were
 completely open.  If I had Solr bias I'd give SolrCloud a try first.

  Lastly, how much hardware (assuming a medium sized EC2 instance) would
 you
  estimate my needing with this setup, for regular web-data (HTML text) at
  this scale?

 I don't know off the topic of my head, but I'm guessing several hundred
 for serving search requests.

 HTH,

 Otis
 --
 Search Analytics - http://sematext.com/search-analytics/index.html

 Scalable Performance Monitoring - http://sematext.com/spm/index.html


  Any architectural guidance would be greatly appreciated. The more details
  provided, the wider my grin :).
 
  Many many thanks in advance.
 
  Thanks,
  Safdar
 



Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-12 Thread Darren Govoni
You could use SolrCloud (for the automatic scaling) and just mount a
fuse[1] HDFS directory and configure solr to use that directory for its
data. 

[1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS

On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote:
 Hi,
 
 I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
 using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
 crawled + indexed every *4 weeks, *with a search latency of less than 0.5
 seconds.
 
 Needless to mention, the search index needs to scale to 5Billion pages. It
 is also possible that I might need to store multiple indexes -- one for
 crawled content, and one for ancillary data that is also very large. Each
 of these indices would likely require a logically distributed and
 replicated index.
 
 However, I would like for such a system to be homogenous with the Hadoop
 infrastructure that is already installed on the cluster (for the crawl). In
 other words, I would much prefer if the replication and distribution of the
 Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
 using another scalability framework (such as SolrCloud). In addition, it
 would be ideal if this environment was flexible enough to be dynamically
 scaled based on the size requirements of the index and the search traffic
 at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
 enough to automatically provision additional processing power into the
 cluster without requiring server re-starts).
 
 However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
 be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
 Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
 mature enough and would be the right architectural choice to go along with
 a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
 above.
 
 Lastly, how much hardware (assuming a medium sized EC2 instance) would you
 estimate my needing with this setup, for regular web-data (HTML text) at
 this scale?
 
 Any architectural guidance would be greatly appreciated. The more details
 provided, the wider my grin :).
 
 Many many thanks in advance.
 
 Thanks,
 Safdar




Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-12 Thread Ali S Kureishy
Thanks Darren.

Actually, I would like the system to be homogenous - i.e., use Hadoop based
tools that already provide all the necessary scaling for the lucene index
(in terms of throughput, latency of writes/reads etc). Since SolrCloud adds
its own layer of sharding/replication that is outside Hadoop, I feel that
using SolrCloud would be redundant, and a step in the opposite
direction, which is what I'm trying to avoid in the first place. Or am I
mistaken?

Thanks,
Safdar


On Thu, Apr 12, 2012 at 4:27 PM, Darren Govoni dar...@ontrenet.com wrote:

 You could use SolrCloud (for the automatic scaling) and just mount a
 fuse[1] HDFS directory and configure solr to use that directory for its
 data.

 [1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS

 On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote:
  Hi,
 
  I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
  using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
  crawled + indexed every *4 weeks, *with a search latency of less than 0.5
  seconds.
 
  Needless to mention, the search index needs to scale to 5Billion pages.
 It
  is also possible that I might need to store multiple indexes -- one for
  crawled content, and one for ancillary data that is also very large. Each
  of these indices would likely require a logically distributed and
  replicated index.
 
  However, I would like for such a system to be homogenous with the Hadoop
  infrastructure that is already installed on the cluster (for the crawl).
 In
  other words, I would much prefer if the replication and distribution of
 the
  Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
  using another scalability framework (such as SolrCloud). In addition, it
  would be ideal if this environment was flexible enough to be dynamically
  scaled based on the size requirements of the index and the search traffic
  at the time (i.e. if it is deployed on an Amazon cluster, it should be
 easy
  enough to automatically provision additional processing power into the
  cluster without requiring server re-starts).
 
  However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
  be ideal for this scenario. I've heard mention of Solr-on-HBase,
 Solandra,
  Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these
 is
  mature enough and would be the right architectural choice to go along
 with
  a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling
 aspects
  above.
 
  Lastly, how much hardware (assuming a medium sized EC2 instance) would
 you
  estimate my needing with this setup, for regular web-data (HTML text) at
  this scale?
 
  Any architectural guidance would be greatly appreciated. The more details
  provided, the wider my grin :).
 
  Many many thanks in advance.
 
  Thanks,
  Safdar





RE: Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-12 Thread Darren Govoni

Solrcloud or any other tech specific replication isnt going to 'just work' with 
hadoop replication. But with some significant custom coding anything should be 
possible. Interesting idea.

brbrbr--- Original Message ---
On 4/12/2012  09:21 AM Ali S Kureishy wrote:brThanks Darren.
br
brActually, I would like the system to be homogenous - i.e., use Hadoop based
brtools that already provide all the necessary scaling for the lucene index
br(in terms of throughput, latency of writes/reads etc). Since SolrCloud adds
brits own layer of sharding/replication that is outside Hadoop, I feel that
brusing SolrCloud would be redundant, and a step in the opposite
brdirection, which is what I'm trying to avoid in the first place. Or am I
brmistaken?
br
brThanks,
brSafdar
br
br
brOn Thu, Apr 12, 2012 at 4:27 PM, Darren Govoni dar...@ontrenet.com wrote:
br
br You could use SolrCloud (for the automatic scaling) and just mount a
br fuse[1] HDFS directory and configure solr to use that directory for its
br data.
br
br [1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS
br
br On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote:
br  Hi,
br 
br  I'm trying to setup a large scale *Crawl + Index + Search 
*infrastructure
br  using Nutch and Solr/Lucene. The targeted scale is *5 Billion web 
pages*,
br  crawled + indexed every *4 weeks, *with a search latency of less than 
0.5
br  seconds.
br 
br  Needless to mention, the search index needs to scale to 5Billion pages.
br It
br  is also possible that I might need to store multiple indexes -- one for
br  crawled content, and one for ancillary data that is also very large. 
Each
br  of these indices would likely require a logically distributed and
br  replicated index.
br 
br  However, I would like for such a system to be homogenous with the Hadoop
br  infrastructure that is already installed on the cluster (for the crawl).
br In
br  other words, I would much prefer if the replication and distribution of
br the
br  Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead 
of
br  using another scalability framework (such as SolrCloud). In addition, it
br  would be ideal if this environment was flexible enough to be dynamically
br  scaled based on the size requirements of the index and the search 
traffic
br  at the time (i.e. if it is deployed on an Amazon cluster, it should be
br easy
br  enough to automatically provision additional processing power into the
br  cluster without requiring server re-starts).
br 
br  However, I'm not sure which Solr-based tool in the Hadoop ecosystem 
would
br  be ideal for this scenario. I've heard mention of Solr-on-HBase,
br Solandra,
br  Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these
br is
br  mature enough and would be the right architectural choice to go along
br with
br  a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling
br aspects
br  above.
br 
br  Lastly, how much hardware (assuming a medium sized EC2 instance) would
br you
br  estimate my needing with this setup, for regular web-data (HTML text) at
br  this scale?
br 
br  Any architectural guidance would be greatly appreciated. The more 
details
br  provided, the wider my grin :).
br 
br  Many many thanks in advance.
br 
br  Thanks,
br  Safdar
br
br
br
br


Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-12 Thread Otis Gospodnetic
Hello Ali,

 I'm trying to setup a large scale *Crawl + Index + Search *infrastructure

 using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
 crawled + indexed every *4 weeks, *with a search latency of less than 0.5
 seconds.


That's fine.  Whether it's doable with any tech will depend on how much 
hardware you give it, among other things.

 Needless to mention, the search index needs to scale to 5Billion pages. It
 is also possible that I might need to store multiple indexes -- one for
 crawled content, and one for ancillary data that is also very large. Each
 of these indices would likely require a logically distributed and
 replicated index.


Yup, OK.

 However, I would like for such a system to be homogenous with the Hadoop
 infrastructure that is already installed on the cluster (for the crawl). In
 other words, I would much prefer if the replication and distribution of the
 Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
 using another scalability framework (such as SolrCloud). In addition, it
 would be ideal if this environment was flexible enough to be dynamically
 scaled based on the size requirements of the index and the search traffic
 at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
 enough to automatically provision additional processing power into the
 cluster without requiring server re-starts).


There is no such thing just yet.
There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt to 
automatically index HBase content, but that was either not completed or not 
committed into HBase.

 However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
 be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
 Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
 mature enough and would be the right architectural choice to go along with
 a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
 above.


Here is a summary on all of them:
* Search on HBase - I assume you are referring to the same thing I mentioned 
above.  Not ready.
* Solandra - uses Cassandra+Solr, plus DataStax now has a different 
(commercial) offering that combines search and Cassandra.  Looks good.
* Lily - data stored in HBase cluster gets indexed to a separate Solr 
instance(s)  on the side.  Not really integrated the way you want it to be.
* ElasticSearch - solid at this point, the most dynamic solution today, can 
scale well (we are working on a mny-B documents index and hundreds of nodes 
with ElasticSearch right now), etc.  But again, not integrated with Hadoop the 
way you want it.
* IndexTank - has some technical weaknesses, not integrated with Hadoop, not 
sure about its future considering LinkedIn uses Zoie and Sensei already.
* And there is SolrCloud, which is coming soon and will be solid, but is again 
not integrated.

If I were you and I had to pick today - I'd pick ElasticSearch if I were 
completely open.  If I had Solr bias I'd give SolrCloud a try first.

 Lastly, how much hardware (assuming a medium sized EC2 instance) would you
 estimate my needing with this setup, for regular web-data (HTML text) at
 this scale?

I don't know off the topic of my head, but I'm guessing several hundred for 
serving search requests.

HTH,

Otis
--
Search Analytics - http://sematext.com/search-analytics/index.html

Scalable Performance Monitoring - http://sematext.com/spm/index.html


 Any architectural guidance would be greatly appreciated. The more details
 provided, the wider my grin :).
 
 Many many thanks in advance.
 
 Thanks,
 Safdar