Re: Cassandra API Library.
@Brian: you can add the Cassandra::Simple Perl client http://fmgoncalves.github.com/p5-cassandra-simple/ 2012/8/27 Paolo Bernardi berna...@gmail.com On 08/23/2012 01:40 PM, Thomas Spengler wrote: 4) pelops (Thrift,Java) I've been using Pelops for quite some time with pretty good results; it felt much cleaner than Hector. Paolo -- @bernarpa http://paolobernardi.**wordpress.com http://paolobernardi.wordpress.com -- Filipe Gonçalves
Re: Concurrency Control
It's the timestamps provided in the columns that do concurrency control/conflict resolution. Basically, the newer timestamp wins. For counters I think there is no such mechanism (i.e. counter updates are not idempotent). From https://wiki.apache.org/cassandra/DataModel : All values are supplied by the client, including the 'timestamp'. This means that clocks on the clients should be synchronized (in the Cassandra server environment is useful also), as these timestamps are used for conflict resolution. In many cases the 'timestamp' is not used in client applications, and it becomes convenient to think of a column as a name/value pair. For the remainder of this document, 'timestamps' will be elided for readability. It is also worth noting the name and value are binary values, although in many applications they are UTF8 serialized strings. Timestamps can be anything you like, but microseconds since 1970 is a convention. Whatever you use, it must be consistent across the application, otherwise earlier changes may overwrite newer ones. 2012/5/28 Helen live42...@gmx.ch Hi, what kind of Concurrency Control Method is used in Cassandra? I found out so far that it's not done with the MVCC Method and that no vector clocks are being used. Thanks Helen -- Filipe Gonçalves
Re: improving cassandra-vs-mongodb-vs-couchdb-vs-redis
There really is no generic way of comparing these systems, NoSQL databases are highly heterogeneous. The only credible and accurate way of doing a comparison is for a specific, well defined, use case. Other than that you are always going to be comparing apples to oranges thus having an crappy (and in that one, even inaccurate) comparison to work with. Some engineers (facebook, twitter and netflix among others if I'm not mistaken) have done some interesting articles describing where and why their companies use each database, google those for a minimally accurate perspective of the NoSQL (and SQL in some cases) database world. 2011/12/28 CharSyam chars...@gmail.com: Don't trust NoSQL Benchmark. It's not a lie. but. NoSQL has different performance in many different environment. Do Benchmark with your real environment. and choose it. Thank you. 2011/12/28 Igor Lino icampi...@gmx.de You are totally right. I'm far from being an expert on the subject, but the comparison felt inconsistent and incomplete. (I could not express that in my 1st email, not to bias the opinion) Do you know of any similar comparison, which is not biased towards some particular technology or solution? (so not coming from http://cassandra.apache.org/) I want to understand how superior is Cassandra in its latest release against closer competitors, ideally with the opinion from expert guys. On Wed, Dec 28, 2011 at 12:14 AM, Edward Capriolo edlinuxg...@gmail.com wrote: This is not really a comparison of anything because each NoSQL has its own bullet points like: Boats great for traveling on water Cars great for traveling on land So the conclusion I should gather is? Also as for the Cassandra bullet points, they are really thin (and wrong). Such as: Cassandra: Best used: When you write more than you read (logging). If every component of the system must be in Java. (No one gets fired for choosing Apache's stuff.) I view that as: Nonsensical, inaccurate, and anecdotal. Also I notice on the other side (and not trying to pick on hbase, but) hbase: No single point of failure Random access performance is like MySQL Hbase has several SPOF's, its random access performance is definitely NOT 'like mysql', Cassandra ACTUALLY has no SPOF but as they author mentions, he/she does not like Cassandra so that fact was left out. From what I can see of the writeup, it is obviously inaccurate in numerous places (without even reading the entire thing). Also when comparing these technologies very subtle differences in design have profound in effects in operation and performance. Thus someone trying to paper over 6 technologies and compare them with a few bullet points is really doing the world an injustice. On Tue, Dec 27, 2011 at 5:01 PM, Igor Lino icampi...@gmx.de wrote: Hi! I was trying to get an understanding of the real strengths of Cassandra against other competitors. Its actually not that simple and depends a lot on details on the actual requirements. Reading the following comparison: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis It felt like the description of Cassandra painted a limiting picture of its capabilities. Is there any Cassandra expert that could improve that summary? is there any important thing missing? or is there a more fitting common use case for Cassandra than what Mr. Kovacs has given? (I believe/think that a Cassandra expert can improve that generic description) Thanks, Igor -- Filipe Gonçalves
Re: Choosing a Partitioner Type for Random java.util.UUID Row Keys
Generally, RandomPartitioner is the recommended one. If you already provide randomized keys it doesn't make much of a difference, the nodes should be balanced with any partitioner. However, unless you have UUID in all keys of all column families (highly unlikely) ByteOrderedPartitioner and OrderPreservingPartitioning will lead to hotspots and unbalanced rings. 2011/12/20 Drew Kutcharian d...@venarc.com: Hey Guys, I just came across http://wiki.apache.org/cassandra/ByteOrderedPartitioner and it got me thinking. If the row keys are java.util.UUID which are generated randomly (and securely), then what type of partitioner would be the best? Since the key values are already random, would it make a difference to use RandomPartitioner or one can use ByteOrderedPartitioner or OrderPreservingPartitioning as well and get the same result? -- Drew -- Filipe Gonçalves
Re: Setting Key Validation Class
Cassandra's data model is NOT table based. The key is not a column, it is a separate value. index_type: KEYS means you are creating an index on that column, and that index can only be acessed in an equality query ( column = x ). You should probably read http://www.datastax.com/docs/1.0/ddl/index first. 2011/12/5 Dinusha Dilrukshi sdddilruk...@gmail.com: Hi, I am using apache-cassandra-1.0.0 and I tried to insert/retrieve data in a column family using cassandra-jdbc program. Here is how I created 'USER' column family using cassandra-cli. create column family USER with comparator=UTF8Type and column_metadata=[{column_name: user_id, validation_class: UTF8Type, index_type: KEYS}, {column_name: username, validation_class: UTF8Type, index_type: KEYS}, {column_name: password, validation_class: UTF8Type}]; But, when i try to insert data to USER column family it gives the error java.sql.SQLException: Mismatched types: java.lang.String cannot be cast to java.nio.ByteBuffer. Since I have set user_id as a KEY and it's validation_class as UTF8Type, I was expected Key Validation Class as UTF8Type. But when I look at the meta-data of USER column family it shows as Key Validation Class: org.apache.cassandra.db.marshal.BytesType which has cause for the above error. When I created USER column family as follows, it solves the above issue. create column family USER with comparator=UTF8Type and key_validation_class=UTF8Type and column_metadata=[{column_name: user_id, validation_class: UTF8Type, index_type: KEYS}, {column_name: username, validation_class: UTF8Type, index_type: KEYS}, {column_name: password, validation_class: UTF8Type}]; Do we always need to define key_validation_class as in the above query ? Isn't it not enough to add validation classes for each column ? Regards, ~Dinusha~ -- Filipe Gonçalves
Re: Required field 'name' was not present! Struct: Column(name:null)
It's a pretty straightforward error message. Some of your rows have columns with empty names (e.g. an empty string), and column names can't be empty. 2011/11/27 Masoud Moshref Javadi moshr...@usc.edu I get this error Required field 'name' was not present! Struct: Column(name:null) on different column families. My code is going to insert lots of rows in parallel. I think this debug log from django may help: - /root/twiss/lib/python2.7/site-packages/pycassa/pool.py in new_f 1. if self.max_retries != -1 and self._retry_count self.max_retries: 2. raise MaximumRetryException('Retried %d times. Last failure was %s: %s' % 3. (self._retry_count, exc.__class__.__name__, exc)) 4. # Exponential backoff 5. time.sleep(_BASE_BACKOFF * (2 ** self._retry_count)) 6. 7. kwargs['reset'] = True 1. return new_f(self, *args, **kwargs) ... 1. 2. new_f.__name__ = f.__name__ 3. return new_f 4. 5. def _fail_once(self, *args, **kwargs): 6. if self._should_fail: ▼ Local vars http://204.57.0.195/LOAD/# Variable Value exc EOFError() f unbound method Connection.batch_mutate self pycassa.pool.ConnectionWrapper object at 0x2086050 args ({'user50': {'User': [Mutation(column_or_supercolumn=ColumnOrSuperColumn(column=Column(timestamp=1322382778794088, name='password', value='password50', ttl=None), counter_super_column=None, super_column=None, counter_column=None), deletion=None), Mutation(column_or_supercolumn=ColumnOrSuperColumn(column=Column(timestamp=1322382778794088, name='name', value='User 50', ttl=None), counter_super_column=None, super_column=None, counter_column=None), deletion=None)]}}, 1) new_f function batch_mutate at 0x2062cf8 kwargs {'reset': True} - /root/twiss/lib/python2.7/site-packages/pycassa/pool.py in new_f 1. result = f(self, *args, **kwargs) 2. self._retry_count = 0 # reset the count after a success 3. return result 4. except Thrift.TApplicationException, app_exc: 5. self.close() 6. self._pool._decrement_overflow() 7. self._pool._clear_current() 1. raise app_exc ... 1. except (TimedOutException, UnavailableException, Thrift.TException, 2. socket.error, IOError, EOFError), exc: 3. self._pool._notify_on_failure(exc, server=self.server, connection=self) 4. 5. self.close() 6. self._pool._decrement_overflow() ▼ Local vars http://204.57.0.195/LOAD/# Variable Value f unbound method Connection.batch_mutate self pycassa.pool.ConnectionWrapper object at 0x2086050 args ({'user50': {'User': [Mutation(column_or_supercolumn=ColumnOrSuperColumn(column=Column(timestamp=1322382778794088, name='password', value='password50', ttl=None), counter_super_column=None, super_column=None, counter_column=None), deletion=None), Mutation(column_or_supercolumn=ColumnOrSuperColumn(column=Column(timestamp=1322382778794088, name='name', value='User 50', ttl=None), counter_super_column=None, super_column=None, counter_column=None), deletion=None)]}}, 1) app_exc TApplicationException(None,) new_f function batch_mutate at 0x2062cf8 kwargs {} -- Filipe Gonçalves
Re: Yanking a dead node
Just remove its token from the ring using nodetool removetoken token 2011/11/23 Maxim Potekhin potek...@bnl.gov: This was discussed a long time ago, but I need to know what's the state of the art answer to that: assume one of my few nodes is very dead. I have no resources or time to fix it. Data is replicated so the data is still available in the cluster. How do I completely remove the dead node without having to rebuild it, repair, drain and decommission? TIA Maxim -- Filipe Gonçalves
Re: Is there a way to get only keys with get_indexed_slices?
You can, just set the number of columns returned to zero (count parameter in the slice range). The indexed slices thrift call is get_indexed_slices(ColumnParent column_parent, IndexClause index_clause, SlicePredicate predicate, ConsistencyLevel consistency_level) the count parameter is in the SliceRange within the SlicePredicate (If you are using the thrift interface directly). 2011/11/11 Maxim Potekhin potek...@bnl.gov: Is there a way to get only keys with get_indexed_slices? Looking at the code, it's not possible, but -- is there some way anyhow? I don't want to extract any data, just a list of matching keys. TIA, Maxim -- Filipe Gonçalves
Re: is that possible to add more data structure(key-list) in cassandra?
You could use composite columns.For example, key: composite(listname:listindex) : value A simple get_range would give you access to list as you would normally have in any programming language, and a get could give you direct access to any index. Obviously, this would not be a good fit for list which need insertion at a specific index or reordering... 2011/11/11 Yan Chunlu springri...@gmail.com: I thought currently no one is maintaining supercolumns related code, and also it not quite efficient. On Fri, Nov 11, 2011 at 2:46 PM, Radim Kolar h...@sendmail.cz wrote: Dne 11.11.2011 5:58, Yan Chunlu napsal(a): I think cassandra is doing great job on key-value data store, it saved me tremendous work on maintain the data consistency and service availability. But I think it would be great if it could support more data structures such as key-list, currently I am using key-value save the list, it seems not very efficiency. Redis has a good point on this but it is not easy to scale. Maybe it is a wrong place and wrong question, only curious if there is already solution about this, thanks a lot! use supercolumns unless your lists are very large. -- Filipe Gonçalves
Multiget question
Multiget slice queries seem to fetch rows sequentially, at least fromwhat I understood of the sources. This means the node that receives amultiget of N keys does N get operations to the other nodes in thecluster to fetch the remaining keys. Am I right? Is this the way multiget works internally? Also, shouldn't this be done in parallel, to avoid contacting nodesmore than once? -- Filipe Gonçalves
Re: Multiget question
Thanks for the answer. I hadn't realised requests were made in parallel, I first noticed it when multiget's took linear time in machines with high loads. Looking at the code led me to the previous conclusion (N gets for multiget for N keys). I agree it would take a major overhaul of the code to change the current behaviour, possiby more than it's worth for the potencial gains. 2011/11/4 Sylvain Lebresne sylv...@datastax.com: 2011/11/4 Filipe Gonçalves the.wa.syndr...@gmail.com: Multiget slice queries seem to fetch rows sequentially, at least fromwhat I understood of the sources. This means the node that receives amultiget of N keys does N get operations to the other nodes in thecluster to fetch the remaining keys. Am I right? Is this the way multiget works internally? The 'sequentially' is probably not right depending on what you meant by that (see below) but otherwise yes, a multiget of N keys is internally splitted into N gets. Also, shouldn't this be done in parallel, to avoid contacting nodesmore than once? It's done in parallel, in that the coordinating nodes send all the get requests in parallel. It doesn't wait for the result to the first get to issue the second one. But it does do every get separately, i.e. it may contact the same note multiple times. In theory we could do with at most one message to each node for each multiget. We don't do it because it would actually require quite a bit of change in the current code and it's unclear it would really buy us much. Since we already parallelize requests, we would mostly win a bit on network traffic (by merging messages) but there is good chance this is unsignificant (of course I could be wrong given we haven't tried). -- Sylvain -- Filipe Gonçalves -- Filipe Gonçalves