Re: Cassandra API Library.

2012-09-04 Thread Filipe Gonçalves
@Brian: you can add the Cassandra::Simple Perl client
http://fmgoncalves.github.com/p5-cassandra-simple/

2012/8/27 Paolo Bernardi berna...@gmail.com

 On 08/23/2012 01:40 PM, Thomas Spengler wrote:

 4) pelops (Thrift,Java)


  I've been using Pelops for quite some time with pretty good results; it
 felt much cleaner than Hector.

 Paolo

 --
 @bernarpa
 http://paolobernardi.**wordpress.com http://paolobernardi.wordpress.com




-- 
Filipe Gonçalves


Re: Concurrency Control

2012-05-30 Thread Filipe Gonçalves
It's the timestamps provided in the columns that do concurrency
control/conflict resolution. Basically, the newer timestamp wins.
For counters I think there is no such mechanism (i.e. counter updates are
not idempotent).

From https://wiki.apache.org/cassandra/DataModel :

All values are supplied by the client, including the 'timestamp'. This
 means that clocks on the clients should be synchronized (in the Cassandra
 server environment is useful also), as these timestamps are used for
 conflict resolution. In many cases the 'timestamp' is not used in client
 applications, and it becomes convenient to think of a column as a
 name/value pair. For the remainder of this document, 'timestamps' will be
 elided for readability. It is also worth noting the name and value are
 binary values, although in many applications they are UTF8 serialized
 strings.
 Timestamps can be anything you like, but microseconds since 1970 is a
 convention. Whatever you use, it must be consistent across the application,
 otherwise earlier changes may overwrite newer ones.


2012/5/28 Helen live42...@gmx.ch

 Hi,
 what kind of Concurrency Control Method is used in Cassandra? I found out
 so far
 that it's not done with the MVCC Method and that no vector clocks are
 being used.
 Thanks Helen




-- 
Filipe Gonçalves


Re: improving cassandra-vs-mongodb-vs-couchdb-vs-redis

2011-12-28 Thread Filipe Gonçalves
There really is no generic way of comparing these systems, NoSQL
databases are highly heterogeneous.
The only credible and accurate way of doing a comparison is for a
specific, well defined, use case. Other than that you are always going
to be comparing apples to oranges thus having an crappy (and in that
one, even inaccurate) comparison to work with.
Some engineers (facebook, twitter and netflix among others if I'm not
mistaken) have done some interesting articles describing where and why
their companies use each database, google those for a minimally
accurate perspective of the NoSQL (and SQL in some cases) database
world.

2011/12/28 CharSyam chars...@gmail.com:
 Don't trust NoSQL Benchmark. It's not a lie. but. NoSQL has different
 performance in many different environment.

 Do Benchmark with your real environment. and choose it.

 Thank you.


 2011/12/28 Igor Lino icampi...@gmx.de

 You are totally right. I'm far from being an expert on the subject, but
 the comparison felt inconsistent and incomplete. (I could not express that
 in my 1st email, not to bias the opinion)

 Do you know of any similar comparison, which is not biased towards some
 particular technology or solution?   (so not coming from
 http://cassandra.apache.org/)
 I want to understand how superior is Cassandra in its latest release
 against closer competitors, ideally with the opinion from expert guys.


 On Wed, Dec 28, 2011 at 12:14 AM, Edward Capriolo edlinuxg...@gmail.com
 wrote:

    This is not really a comparison of anything because each NoSQL has its
 own bullet points like:
    Boats
      great for traveling on water
    Cars
      great for traveling on land
    So the conclusion I should gather is?
    Also as for the Cassandra bullet points, they are really thin (and
 wrong). Such as:
    Cassandra:
    Best used: When you write more than you read (logging). If every
 component of the system must be in Java. (No one gets fired for choosing
 Apache's stuff.)
    I view that as:
    Nonsensical, inaccurate, and anecdotal.
    Also I notice on the other side (and not trying to pick on hbase, but)
    hbase:
    No single point of failure
    Random access performance is like MySQL
    Hbase has several SPOF's, its random access performance is definitely
 NOT 'like mysql',
    Cassandra ACTUALLY has no SPOF but as they author mentions, he/she does
 not like Cassandra so that fact was left out.
    From what I can see of the writeup, it is obviously inaccurate in
 numerous places (without even reading the entire thing).
    Also when comparing these technologies very subtle differences in
 design have profound in effects in operation and performance. Thus someone
 trying to paper over 6 technologies and compare them with a few bullet
 points is really doing the world an injustice.
    On Tue, Dec 27, 2011 at 5:01 PM, Igor Lino icampi...@gmx.de wrote:

        Hi!

        I was trying to get an understanding of the real strengths of
 Cassandra against other competitors. Its actually not that simple and
 depends a lot on details on the actual requirements.

        Reading the following comparison:
        http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis

        It felt like the description of Cassandra painted a limiting
 picture of its capabilities. Is there any Cassandra expert that could
 improve that summary? is there any important thing missing? or is there a
 more fitting common use case for Cassandra than what Mr. Kovacs has given?
        (I believe/think that a Cassandra expert can improve that generic
 description)

        Thanks,
        Igor







-- 
Filipe Gonçalves


Re: Choosing a Partitioner Type for Random java.util.UUID Row Keys

2011-12-20 Thread Filipe Gonçalves
Generally, RandomPartitioner is the recommended one.
If you already provide randomized keys it doesn't make much of a
difference, the nodes should be balanced with any partitioner.
However, unless you have UUID in all keys of all column families
(highly unlikely) ByteOrderedPartitioner and
OrderPreservingPartitioning will lead to hotspots and unbalanced
rings.

2011/12/20 Drew Kutcharian d...@venarc.com:
 Hey Guys,

 I just came
 across http://wiki.apache.org/cassandra/ByteOrderedPartitioner and it got me
 thinking. If the row keys are java.util.UUID which are generated randomly
 (and securely), then what type of partitioner would be the best? Since the
 key values are already random, would it make a difference to use
 RandomPartitioner or one can use ByteOrderedPartitioner or
 OrderPreservingPartitioning as well and get the same result?

 -- Drew




-- 
Filipe Gonçalves


Re: Setting Key Validation Class

2011-12-05 Thread Filipe Gonçalves
Cassandra's data model is NOT table based. The key is not a column, it
is a separate value.  index_type: KEYS means you are creating an
index on that column, and that index can only be acessed in an
equality query ( column = x ).

You should probably read http://www.datastax.com/docs/1.0/ddl/index first.

2011/12/5 Dinusha Dilrukshi sdddilruk...@gmail.com:
 Hi,

 I am using apache-cassandra-1.0.0 and I tried to insert/retrieve data in a
 column family using cassandra-jdbc program.
 Here is how I created 'USER' column family using cassandra-cli.

 create column family USER with comparator=UTF8Type
 and column_metadata=[{column_name: user_id, validation_class: UTF8Type,
 index_type: KEYS},
 {column_name: username, validation_class: UTF8Type, index_type: KEYS},
 {column_name: password, validation_class: UTF8Type}];

 But, when i try to insert data to USER column family it gives the error
 java.sql.SQLException: Mismatched types: java.lang.String cannot be cast to
 java.nio.ByteBuffer.

 Since I have set user_id as a KEY and it's validation_class as UTF8Type, I
 was expected Key Validation Class as UTF8Type.
 But when I look at the meta-data of USER column family it shows as Key
 Validation Class: org.apache.cassandra.db.marshal.BytesType which has cause
 for the above error.

 When I created USER column family as follows, it solves the above issue.

 create column family USER with comparator=UTF8Type and
 key_validation_class=UTF8Type
 and column_metadata=[{column_name: user_id, validation_class: UTF8Type,
 index_type: KEYS},
 {column_name: username, validation_class: UTF8Type, index_type: KEYS},
 {column_name: password, validation_class: UTF8Type}];

 Do we always need to define key_validation_class as in the above query ?
 Isn't it not enough to add validation classes for each column ?

 Regards,
 ~Dinusha~




-- 
Filipe Gonçalves


Re: Required field 'name' was not present! Struct: Column(name:null)

2011-11-27 Thread Filipe Gonçalves
It's a pretty straightforward error message. Some of your rows have columns
with empty names (e.g. an empty string), and column names can't be empty.

2011/11/27 Masoud Moshref Javadi moshr...@usc.edu

   I get this error

 Required field 'name' was not present! Struct: Column(name:null)

 on different column families. My code is going to insert lots of rows in 
 parallel.

 I think this debug log from django may help:



- /root/twiss/lib/python2.7/site-packages/pycassa/pool.py in new_f
 1.

   if self.max_retries != -1 and self._retry_count  
 self.max_retries:

2.

   raise MaximumRetryException('Retried %d times. Last 
 failure was %s: %s' %

3.

   (self._retry_count, 
 exc.__class__.__name__, exc))

4.

   # Exponential backoff

5.

   time.sleep(_BASE_BACKOFF * (2 ** self._retry_count))

6.
7.

   kwargs['reset'] = True

 1.

   return new_f(self, *args, **kwargs)

   ...
 1.
2.

   new_f.__name__ = f.__name__

3.

   return new_f

4.
5.

   def _fail_once(self, *args, **kwargs):

6.

   if self._should_fail:

 ▼ Local vars http://204.57.0.195/LOAD/#
  Variable Value   exc

EOFError()

  f

unbound method Connection.batch_mutate

  self

pycassa.pool.ConnectionWrapper object at 0x2086050

  args

({'user50': {'User': 
 [Mutation(column_or_supercolumn=ColumnOrSuperColumn(column=Column(timestamp=1322382778794088,
  name='password', value='password50', ttl=None), counter_super_column=None, 
 super_column=None, counter_column=None), deletion=None),
  
 Mutation(column_or_supercolumn=ColumnOrSuperColumn(column=Column(timestamp=1322382778794088,
  name='name', value='User 50', ttl=None), counter_super_column=None, 
 super_column=None, counter_column=None), deletion=None)]}},
 1)

  new_f

function batch_mutate at 0x2062cf8

  kwargs

{'reset': True}

   - /root/twiss/lib/python2.7/site-packages/pycassa/pool.py in new_f
 1.

   result = f(self, *args, **kwargs)

2.

   self._retry_count = 0 # reset the count after a success

3.

   return result

4.

   except Thrift.TApplicationException, app_exc:

5.

   self.close()

6.

   self._pool._decrement_overflow()

7.

   self._pool._clear_current()

 1.

   raise app_exc

   ...
 1.

   except (TimedOutException, UnavailableException, 
 Thrift.TException,

2.

   socket.error, IOError, EOFError), exc:

3.

   self._pool._notify_on_failure(exc, server=self.server, 
 connection=self)

4.
5.

   self.close()

6.

   self._pool._decrement_overflow()

 ▼ Local vars http://204.57.0.195/LOAD/#
  Variable Value   f

unbound method Connection.batch_mutate

  self

pycassa.pool.ConnectionWrapper object at 0x2086050

  args

({'user50': {'User': 
 [Mutation(column_or_supercolumn=ColumnOrSuperColumn(column=Column(timestamp=1322382778794088,
  name='password', value='password50', ttl=None), counter_super_column=None, 
 super_column=None, counter_column=None), deletion=None),
  
 Mutation(column_or_supercolumn=ColumnOrSuperColumn(column=Column(timestamp=1322382778794088,
  name='name', value='User 50', ttl=None), counter_super_column=None, 
 super_column=None, counter_column=None), deletion=None)]}},
 1)

  app_exc

TApplicationException(None,)

  new_f

function batch_mutate at 0x2062cf8

  kwargs

{}







-- 
Filipe Gonçalves


Re: Yanking a dead node

2011-11-24 Thread Filipe Gonçalves
Just remove its token from the ring using

nodetool removetoken token

2011/11/23 Maxim Potekhin potek...@bnl.gov:
 This was discussed a long time ago, but I need to know what's the state of
 the art answer to that:
 assume one of my few nodes is very dead. I have no resources or time to fix
 it. Data is replicated
 so the data is still available in the cluster. How do I completely remove
 the dead node without having
 to rebuild it, repair, drain and decommission?

 TIA
 Maxim





-- 
Filipe Gonçalves


Re: Is there a way to get only keys with get_indexed_slices?

2011-11-11 Thread Filipe Gonçalves
You can, just set the number of columns returned to zero (count
parameter in the slice range).

The indexed slices thrift call is

get_indexed_slices(ColumnParent column_parent, IndexClause
index_clause, SlicePredicate predicate, ConsistencyLevel
consistency_level)

the count parameter is in the SliceRange within the SlicePredicate (If
you are using the thrift interface directly).

2011/11/11 Maxim Potekhin potek...@bnl.gov:

 Is there a way to get only keys with get_indexed_slices?
 Looking at the code, it's not possible, but -- is there some way anyhow?
 I don't want to extract any data, just a list of matching keys.

 TIA,

 Maxim





-- 
Filipe Gonçalves


Re: is that possible to add more data structure(key-list) in cassandra?

2011-11-11 Thread Filipe Gonçalves
You could use composite columns.For example,

key:
  composite(listname:listindex) : value

A simple get_range would give you access to list as you would normally
have in any programming language, and a get could give you direct
access to any index.
Obviously, this would not be a good fit for list which need insertion
at a specific index or reordering...

2011/11/11 Yan Chunlu springri...@gmail.com:
 I thought currently no one is maintaining supercolumns related code, and
 also it not quite efficient.


 On Fri, Nov 11, 2011 at 2:46 PM, Radim Kolar h...@sendmail.cz wrote:

 Dne 11.11.2011 5:58, Yan Chunlu napsal(a):

 I think cassandra is doing great job on key-value data store, it saved me
 tremendous work on maintain the data consistency and service availability.
  But I think it would be great if it could support more data structures such
 as key-list, currently I am using key-value save the list, it seems not very
 efficiency. Redis has a good point on this but it is not easy to scale.

 Maybe it is a wrong place and wrong question, only curious if there is
 already solution about this, thanks a lot!

 use supercolumns unless your lists are very large.





-- 
Filipe Gonçalves


Multiget question

2011-11-04 Thread Filipe Gonçalves
Multiget slice queries seem to fetch rows sequentially, at least
fromwhat I understood of the sources. This means the node that
receives amultiget of N keys does N get operations to the other nodes
in thecluster to fetch the remaining keys.
Am I right? Is this the way multiget works internally?
Also, shouldn't this be done in parallel, to avoid contacting
nodesmore than once?
-- 
Filipe Gonçalves


Re: Multiget question

2011-11-04 Thread Filipe Gonçalves
Thanks for the answer.
I hadn't realised requests were made in parallel, I first noticed it
when multiget's took linear time in machines with high loads. Looking
at the code led me to the previous conclusion (N gets for multiget for
N keys). I agree it would take a major overhaul of the code to change
the current behaviour, possiby more than it's worth for the potencial
gains.


2011/11/4 Sylvain Lebresne sylv...@datastax.com:
 2011/11/4 Filipe Gonçalves the.wa.syndr...@gmail.com:
 Multiget slice queries seem to fetch rows sequentially, at least
 fromwhat I understood of the sources. This means the node that
 receives amultiget of N keys does N get operations to the other nodes
 in thecluster to fetch the remaining keys.
 Am I right? Is this the way multiget works internally?

 The 'sequentially' is probably not right depending on what you meant
 by that (see below) but otherwise yes, a multiget of N keys is internally
 splitted into N gets.

 Also, shouldn't this be done in parallel, to avoid contacting
 nodesmore than once?

 It's done in parallel, in that the coordinating nodes send all the get
 requests in parallel. It doesn't wait for the result to the first get to
 issue the second one. But it does do every get separately, i.e. it may
 contact the same note multiple times.

 In theory we could do with at most one message to each node for each
 multiget. We don't do it because it would actually require quite a bit of
 change in the current code and it's unclear it would really buy us much.
 Since we already parallelize requests, we would mostly win a bit on network
 traffic (by merging messages) but there is good chance this is unsignificant
 (of course I could be wrong given we haven't tried).

 --
 Sylvain

 --
 Filipe Gonçalves





-- 
Filipe Gonçalves