Re: Advise for choice
I can give you a few more data points. For one of my last projects, I built the search index of one of the largest IM aggregators. I got around 2.5k chat msg/s, keeping 400M messages in my index. I looked at Solr and while it is very convenient/luxurious, there was no way in hell I could scale it this big. I ended up using Katta to serve the index with Hadoop to compute my index shards. While the whole system is batch oriented, I got my latency down to 2min (time for a doc to show up in the index), if I got less than 8k chat messages/s in. Katta handles replication and node failover (uses Zookeeper) and can be scaled easily by adding nodes increasing the replication factor. In comparison to Solr, scale was not one of the things I had to worry. Like others have said, unless you provide a lot more specifics it will be hard to give you detailed recommendations. Hope this help! -Erich On Thu, Jan 7, 2010 at 11:31 PM, Richard Grossman richie...@gmail.com wrote: First Thanks to all your answer it's help to really check all the aspects. In fact the system we want to build have to manage a lot of data but not in an heavy transactional way. Solr can handle the data but doesn't have the distributed way to serve it. But it's always possible to just duplicate the data in my case. then we can load balancing the queries between multiple instance server. We load a large set of data once a week and that all this data are going to be used as his without modification or update or delete. In this point load the data into Solr is very easy because we make a csv file and that's it it's inside. The data need to be structured but not like a relational database. Obviously Solr doesn't fit the data structure required. it force us to de-normalize a lot of data and build like a very very big table it's force us also to build very difficult lucene query. The speed to query for data is critical cause the application is internet oriented we hope a lot of queries / minutes. With this point the problem is that with the same amount of data Solr have been faster than cassandra but of course the data structure is not the same. It seems by the end we'll go as Tatu tell to have an hybrid solution mixing Solr and Cassandra. I'm not sure its the best in our case Thanks
Re: Advise for choice
Good point although there has been very recent work integrating solr with katta so you can have your cake and eat it too: http://developer.yahoo.net/blogs/theater/archives/2009/12/hadoop_bay_area_user_group_session_1.html On Fri, Jan 8, 2010 at 1:09 AM, Erich Nachbar er...@nachbar.biz wrote: I can give you a few more data points. For one of my last projects, I built the search index of one of the largest IM aggregators. I got around 2.5k chat msg/s, keeping 400M messages in my index. I looked at Solr and while it is very convenient/luxurious, there was no way in hell I could scale it this big. I ended up using Katta to serve the index with Hadoop to compute my index shards. While the whole system is batch oriented, I got my latency down to 2min (time for a doc to show up in the index), if I got less than 8k chat messages/s in. Katta handles replication and node failover (uses Zookeeper) and can be scaled easily by adding nodes increasing the replication factor. In comparison to Solr, scale was not one of the things I had to worry. Like others have said, unless you provide a lot more specifics it will be hard to give you detailed recommendations. Hope this help! -Erich On Thu, Jan 7, 2010 at 11:31 PM, Richard Grossman richie...@gmail.com wrote: First Thanks to all your answer it's help to really check all the aspects. In fact the system we want to build have to manage a lot of data but not in an heavy transactional way. Solr can handle the data but doesn't have the distributed way to serve it. But it's always possible to just duplicate the data in my case. then we can load balancing the queries between multiple instance server. We load a large set of data once a week and that all this data are going to be used as his without modification or update or delete. In this point load the data into Solr is very easy because we make a csv file and that's it it's inside. The data need to be structured but not like a relational database. Obviously Solr doesn't fit the data structure required. it force us to de-normalize a lot of data and build like a very very big table it's force us also to build very difficult lucene query. The speed to query for data is critical cause the application is internet oriented we hope a lot of queries / minutes. With this point the problem is that with the same amount of data Solr have been faster than cassandra but of course the data structure is not the same. It seems by the end we'll go as Tatu tell to have an hybrid solution mixing Solr and Cassandra. I'm not sure its the best in our case Thanks
Re: Advise for choice
On Thu, Jan 7, 2010 at 3:16 AM, Richard Grossman richie...@gmail.com wrote: Hi, This message is little different than support. I'm confronted to problem where people want to change Cassandra with Solr server. I really think that our problem is a great case for cassandra but I need more arguments. So please if you've some time just put some idea why to use cassandra instead solr. Solution is generally applicable to a problem... so what is the (main) use case? That would make it easier to find arguments for or against proposed solution. -+ Tatu +-
Re: Advise for choice
things positive for solr. - mature and stable - lots of documentation - a swiss army knife and can be used for a LOT of things, especially if you are manipulating a lot of text. - the query language is easier to use (imho.. but i've been using solr for years, so I am biased) - lots of people know it - fast caching - faceting cons for solr. - hard to update a single field (you need to fetch re-insert the entire row) - commits/optimizes can slow things down to a crawl - can't store structured data easily. (for example a blog post has tags which have both a key and a value). - scalability isn't as easy as cassandra. sharding works, but it requires a lot of manual effort - it's easy to get started and get something running, but if you need to do something out of the ordinary, it gets hard fast. I think cassandra is more flexible to do ordinary things that don't involve text-matching. - replication isn't instant. (this is changing.. also look at zoie which may help). of course, if you tell us what your trying to do, I can be more specific. FWIW.. we use SOLR for some of our news-content (see love.com and newsrunner.com) and it works fast enough for us. We have a incoming doc rate of about 8-10 news articles/second. On Jan 8, 2010, at 5:43 AM, Nathan McCall wrote: Agreed that there is not much to go on here in the original question. I will say that we very recently found a good fit with Solr and Cassandra in how we deal with a very heavy write volume of news article data. Cassandra is excellent with write throughput and high availability, but our search use cases are with time-dependent news content, so we need lots of term proximity, faceting and ordering functionality. We probably could store everything in Solr, but the above approach will allow us to make articles immediately available in a fault-tolerant manner while being able to efficiently send batches at regular intervals to Solr and therefore scale out our ingestion of news articles a little smoother. Full disclosure: I am still getting my head around the innards of Solr replication and clustering, but so far I feel like we made a good choice. Hopefully the above will be helpful to folks during their evaluations. Cheers, -Nate On Thu, Jan 7, 2010 at 10:02 AM, Joseph Bowman bowman.jos...@gmail.com wrote: I have to agree with Tatu. If you're struggling to find reasons to validate that Cassandra is the better choice for your task than Solr, then perhaps Solr is the correct choice. I kind of went through the same thing recently, struggled to make Cassandra fit what I was doing, then realized I was doing it wrong and moved to MongoDB. Cassandra is great at what it tries to accomplish, which is managing gigantic datasets in a distributed way. The question is, is that really what you need? On Thu, Jan 7, 2010 at 12:58 PM, Tatu Saloranta tsalora...@gmail.com wrote: On Thu, Jan 7, 2010 at 3:16 AM, Richard Grossman richie...@gmail.com wrote: Hi, This message is little different than support. I'm confronted to problem where people want to change Cassandra with Solr server. I really think that our problem is a great case for cassandra but I need more arguments. So please if you've some time just put some idea why to use cassandra instead solr. Solution is generally applicable to a problem... so what is the (main) use case? That would make it easier to find arguments for or against proposed solution. -+ Tatu +- -- Ian Holsman i...@holsman.net
Re: Advise for choice
On Thu, Jan 7, 2010 at 10:43 AM, Nathan McCall n...@vervewireless.com wrote: Agreed that there is not much to go on here in the original question. I will say that we very recently found a good fit with Solr and Cassandra in how we deal with a very heavy write volume of news article data. Cassandra is excellent with write throughput and high availability, but our search use cases are with time-dependent news content, so we need lots of term proximity, faceting and ordering functionality. We probably could store everything in Solr, but the above approach ... I think that in many (most?) cases, optimal solutions for searching and lookups are different. Traditionally this has meant that instead of trying to cram everything in Oracle (or MySQL, Postgres) with its in-built not-quite-as-good-as-Lucene text indexer, do the right thing and use both: DB for storing data, for lookups, aggregates; and search index for full-text searches. For some reason it seems very unintuitive notion to use two tools instead of one, when they have different sweet spots. And going forward, similar trade-offs are needed between 'traditional' RDBMSs, newer distributed high-availability eventual consistent data stores (with multiple variation from simple-lookup to sorted access), search index processing, and batch-oriented processing (Hadoop / map/reduce). Trying to do too many things using just one kind of tool tends to lead to scalability and maintenance problems. I am actually trying to decide on similar case which tools (from loose set of Cassandra, Lucene/Solr, Voldemort) to use to handle processing of large amounts of data, and I'm pretty sure I will end up using more than just one. -+ Tatu +-
Re: Advise for choice
First Thanks to all your answer it's help to really check all the aspects. - In fact the system we want to build have to manage a lot of data but not in an heavy transactional way. Solr can handle the data but doesn't have the distributed way to serve it. But it's always possible to just duplicate the data in my case. then we can load balancing the queries between multiple instance server. - We load a large set of data once a week and that all this data are going to be used as his without modification or update or delete. In this point load the data into Solr is very easy because we make a csv file and that's it it's inside. - The data need to be structured but not like a relational database. Obviously Solr doesn't fit the data structure required. it force us to de-normalize a lot of data and build like a very very big table it's force us also to build very difficult lucene query. - The speed to query for data is critical cause the application is internet oriented we hope a lot of queries / minutes. With this point the problem is that with the same amount of data Solr have been faster than cassandra but of course the data structure is not the same. It seems by the end we'll go as Tatu tell to have an hybrid solution mixing Solr and Cassandra. I'm not sure its the best in our case Thanks