Re: Advise for choice

2010-01-08 Thread Erich Nachbar
I can give you a few more data points. For one of my last projects, I
built the search index of one of the largest IM aggregators. I got
around 2.5k chat msg/s, keeping 400M messages in my index.

I looked at Solr and while it is very convenient/luxurious, there was
no way in hell I could scale it this big. I ended up using Katta to
serve the index with Hadoop to compute my index shards.

While the whole system is batch oriented, I got my latency down to
2min (time for a doc to show up in the index), if I got less than 8k
chat messages/s in.

Katta handles replication and node failover (uses Zookeeper) and can
be scaled easily by adding nodes  increasing the replication factor.
In comparison to Solr, scale was not one of the things I had to worry.

Like others have said, unless you provide a lot more specifics it will
be hard to give you detailed recommendations.

Hope this help!
-Erich

On Thu, Jan 7, 2010 at 11:31 PM, Richard Grossman richie...@gmail.com wrote:
 First Thanks to all your answer it's help to really check  all the aspects.

 In fact the system we want to build have to manage a lot of data but not in
 an heavy transactional way. Solr can handle the data but doesn't have
 the distributed way to serve it. But it's always possible to just duplicate
 the data in my case. then we can load balancing the queries between multiple
 instance server.

 We load a large set of data once a week and that all this data are going to
 be used as his without modification or update or delete. In this point load
 the data into Solr is very easy because we make a csv file and that's it
 it's inside.

 The data need to be structured but not like a relational database. Obviously
 Solr doesn't fit the data structure required. it force us to de-normalize a
 lot of data and build like a very very big table it's force us also to build
 very difficult lucene query.

 The speed to query for data is critical cause the application is internet
 oriented we hope a lot of queries / minutes. With this point the problem is
 that with the same amount of data Solr have been faster than cassandra but
 of course the data structure is not the same.

 It seems by the end we'll go as Tatu tell to have an hybrid solution mixing
 Solr and Cassandra. I'm not sure its the best in our case
 Thanks


Re: Advise for choice

2010-01-08 Thread scott w
Good point although there has been very recent work integrating solr with
katta so you can have your cake and eat it too:

http://developer.yahoo.net/blogs/theater/archives/2009/12/hadoop_bay_area_user_group_session_1.html


On Fri, Jan 8, 2010 at 1:09 AM, Erich Nachbar er...@nachbar.biz wrote:

 I can give you a few more data points. For one of my last projects, I
 built the search index of one of the largest IM aggregators. I got
 around 2.5k chat msg/s, keeping 400M messages in my index.

 I looked at Solr and while it is very convenient/luxurious, there was
 no way in hell I could scale it this big. I ended up using Katta to
 serve the index with Hadoop to compute my index shards.

 While the whole system is batch oriented, I got my latency down to
 2min (time for a doc to show up in the index), if I got less than 8k
 chat messages/s in.

 Katta handles replication and node failover (uses Zookeeper) and can
 be scaled easily by adding nodes  increasing the replication factor.
 In comparison to Solr, scale was not one of the things I had to worry.

 Like others have said, unless you provide a lot more specifics it will
 be hard to give you detailed recommendations.

 Hope this help!
 -Erich

 On Thu, Jan 7, 2010 at 11:31 PM, Richard Grossman richie...@gmail.com
 wrote:
  First Thanks to all your answer it's help to really check  all the
 aspects.
 
  In fact the system we want to build have to manage a lot of data but not
 in
  an heavy transactional way. Solr can handle the data but doesn't have
  the distributed way to serve it. But it's always possible to just
 duplicate
  the data in my case. then we can load balancing the queries between
 multiple
  instance server.
 
  We load a large set of data once a week and that all this data are going
 to
  be used as his without modification or update or delete. In this point
 load
  the data into Solr is very easy because we make a csv file and that's it
  it's inside.
 
  The data need to be structured but not like a relational
 database. Obviously
  Solr doesn't fit the data structure required. it force us
 to de-normalize a
  lot of data and build like a very very big table it's force us also to
 build
  very difficult lucene query.
 
  The speed to query for data is critical cause the application is internet
  oriented we hope a lot of queries / minutes. With this point the problem
 is
  that with the same amount of data Solr have been faster than cassandra
 but
  of course the data structure is not the same.
 
  It seems by the end we'll go as Tatu tell to have an hybrid solution
 mixing
  Solr and Cassandra. I'm not sure its the best in our case
  Thanks



Re: Advise for choice

2010-01-07 Thread Tatu Saloranta
On Thu, Jan 7, 2010 at 3:16 AM, Richard Grossman richie...@gmail.com wrote:
 Hi,

 This message is little different than support.
 I'm confronted to problem where people want to change Cassandra with Solr
 server. I really think that our problem is a great case for cassandra but I
 need more arguments.

 So please if you've some time just put some idea why to use cassandra
 instead solr.

Solution is generally applicable to a problem... so what is the (main) use case?

That would make it easier to find arguments for or against proposed solution.

-+ Tatu +-


Re: Advise for choice

2010-01-07 Thread Ian Holsman
things positive for solr.
- mature and stable
- lots of documentation
- a swiss army knife and can be used for a LOT of things, especially if you are 
manipulating a lot of text.
- the query language is easier to use (imho.. but i've been using solr for 
years, so I am biased)
- lots of people know it
- fast caching
- faceting

cons for solr.
- hard to update a single field (you need to fetch  re-insert the entire row)
- commits/optimizes can slow things down to a crawl
- can't store structured data easily. (for example a blog post has tags which 
have both a key and a value).
- scalability isn't as easy as cassandra. sharding works, but it requires a lot 
of manual effort
- it's easy to get started and get something running, but if you need to do 
something out of the ordinary, it gets hard fast. I think cassandra is more 
flexible to do ordinary things that don't involve text-matching.
- replication isn't instant. (this is changing.. also look at zoie which may 
help).

of course, if you tell us what your trying to do, I can be more specific.
FWIW.. we use SOLR for some of our news-content (see love.com and 
newsrunner.com) and it works fast enough for us. 
We have a incoming doc rate of about 8-10 news articles/second.

On Jan 8, 2010, at 5:43 AM, Nathan McCall wrote:

 Agreed that there is not much to go on here in the original question.
 I will say that we very recently found a good fit with Solr and
 Cassandra in how we deal with a very heavy write volume of news
 article data. Cassandra is excellent with write throughput and high
 availability, but our search use cases are with time-dependent news
 content, so we need lots of term proximity, faceting and ordering
 functionality.
 
 We probably could store everything in Solr, but the above approach
 will allow us to make articles immediately available in a
 fault-tolerant manner while being able to efficiently send batches at
 regular intervals to Solr and therefore scale out our ingestion of
 news articles a little smoother. Full disclosure: I am still getting
 my head around the innards of Solr replication and clustering, but so
 far I feel like we made a good choice.
 
 Hopefully the above will be helpful to folks during their evaluations.
 
 Cheers,
 -Nate
 
 
 On Thu, Jan 7, 2010 at 10:02 AM, Joseph Bowman bowman.jos...@gmail.com 
 wrote:
 I have to agree with Tatu. If you're struggling to find reasons to validate
 that Cassandra is the better choice for your task than Solr, then perhaps
 Solr is the correct choice. I kind of went through the same thing recently,
 struggled to make Cassandra fit what I was doing, then realized I was doing
 it wrong and moved to MongoDB.
 Cassandra is great at what it tries to accomplish, which is managing
 gigantic datasets in a distributed way. The question is, is that really what
 you need?
 
 On Thu, Jan 7, 2010 at 12:58 PM, Tatu Saloranta tsalora...@gmail.com
 wrote:
 
 On Thu, Jan 7, 2010 at 3:16 AM, Richard Grossman richie...@gmail.com
 wrote:
 Hi,
 
 This message is little different than support.
 I'm confronted to problem where people want to change Cassandra with
 Solr
 server. I really think that our problem is a great case for cassandra
 but I
 need more arguments.
 
 So please if you've some time just put some idea why to use cassandra
 instead solr.
 
 Solution is generally applicable to a problem... so what is the (main) use
 case?
 
 That would make it easier to find arguments for or against proposed
 solution.
 
 -+ Tatu +-
 
 

--
Ian Holsman
i...@holsman.net





Re: Advise for choice

2010-01-07 Thread Tatu Saloranta
On Thu, Jan 7, 2010 at 10:43 AM, Nathan McCall n...@vervewireless.com wrote:
 Agreed that there is not much to go on here in the original question.
 I will say that we very recently found a good fit with Solr and
 Cassandra in how we deal with a very heavy write volume of news
 article data. Cassandra is excellent with write throughput and high
 availability, but our search use cases are with time-dependent news
 content, so we need lots of term proximity, faceting and ordering
 functionality.

 We probably could store everything in Solr, but the above approach
...

I think that in many (most?) cases, optimal solutions for searching
and lookups are different.

Traditionally this has meant that instead of trying to cram everything
in Oracle (or MySQL, Postgres) with its in-built
not-quite-as-good-as-Lucene text indexer, do the right thing and use
both: DB for storing data, for lookups, aggregates; and search index
for full-text searches. For some reason it seems very unintuitive
notion to use two tools instead of one, when they have different sweet
spots.
And going forward, similar trade-offs are needed between 'traditional'
RDBMSs, newer distributed high-availability eventual consistent data
stores (with multiple variation from simple-lookup to sorted access),
search index processing, and batch-oriented processing (Hadoop /
map/reduce).
Trying to do too many things using just one kind of tool tends to lead
to scalability and maintenance problems.

I am actually trying to decide on similar case which tools (from loose
set of Cassandra, Lucene/Solr, Voldemort) to use to handle processing
of large amounts of data, and I'm pretty sure I will end up using more
than just one.

-+ Tatu +-


Re: Advise for choice

2010-01-07 Thread Richard Grossman
First Thanks to all your answer it's help to really check  all the aspects.


   - In fact the system we want to build have to manage a lot of data but
   not in an heavy transactional way. Solr can handle the data but doesn't have
   the distributed way to serve it. But it's always possible to just duplicate
   the data in my case. then we can load balancing the queries between multiple
   instance server.


   - We load a large set of data once a week and that all this data are
   going to be used as his without modification or update or delete. In this
   point load the data into Solr is very easy because we make a csv file and
   that's it it's inside.


   - The data need to be structured but not like a relational
   database. Obviously Solr doesn't fit the data structure required. it force
   us to de-normalize a lot of data and build like a very very big table it's
   force us also to build very difficult lucene query.


   - The speed to query for data is critical cause the application is
   internet oriented we hope a lot of queries / minutes. With this point the
   problem is that with the same amount of data Solr have been faster than
   cassandra but of course the data structure is not the same.

It seems by the end we'll go as Tatu tell to have an hybrid solution mixing
Solr and Cassandra. I'm not sure its the best in our case

Thanks