efficient way to store 8-bit or 16-bit value?

2013-12-11 Thread onlinespending
What do people recommend I do to store a small binary value in a column? I’d 
rather not simply use a 32-bit int for a single byte value. Can I have a one 
byte blob? Or should I store it as a single character ASCII string? I imagine 
each is going to have the overhead of storing the length (or null termination 
in the case of a string). That overhead may be worse than simply using a 32-bit 
int.

Also is it possible to partition on a single character or substring of 
characters from a string (or a portion of a blob)? Something like:

CREATE TABLE test (
id text,
value blob,
PRIMARY KEY (string[0:1])
) 

user / password authentication advice

2013-12-11 Thread onlinespending
Hi,

I’m using Cassandra in an environment where many users can login to use an 
application I’m developing. I’m curious if anyone has any advice or links to 
documentation / blogs where it discusses common implementations or best 
practices for user and password authentication. My cursory search online didn’t 
bring much up on the subject. I suppose the information needn’t even be 
specific to Cassandra.

I imagine a few basic steps will be as follows:

user types in username (e.g. email address) and password
this is verified against a table storing username and passwords (encrypted in 
some way)
a token is return to the app / web browser to allow further transactions using 
secure token (e.g. cookie)

Obviously I’m only scratching the surface and it’s the detail and best 
practices of implementing this user / password authentication that I’m curious 
about.

Thank you,
Ben




Re: user / password authentication advice

2013-12-11 Thread onlinespending
OK, thanks for getting me going in the right direction. I imagine most people 
would store password and tokenized authentication information in a single 
table, using the username (e.g. email address) as the key?


On Dec 11, 2013, at 10:44 PM, Janne Jalkanen janne.jalka...@ecyrd.com wrote:

 
 Hi!
 
 You're right, this isn't really Cassandra-specific. Most languages/web 
 frameworks have their own way of doing user authentication, and then you just 
 typically write a plugin that just stores whatever data the system needs in 
 Cassandra.
 
 For example, if you're using Java (or Scala or Groovy or anything else 
 JVM-based), Apache Shiro is a good way of doing user authentication and 
 authorization. http://shiro.apache.org/. Just implement a custom Realm for 
 Cassandra and you should be set.
 
 /Janne
 
 On Dec 12, 2013, at 05:31 , onlinespending onlinespend...@gmail.com wrote:
 
 Hi,
 
 I’m using Cassandra in an environment where many users can login to use an 
 application I’m developing. I’m curious if anyone has any advice or links to 
 documentation / blogs where it discusses common implementations or best 
 practices for user and password authentication. My cursory search online 
 didn’t bring much up on the subject. I suppose the information needn’t even 
 be specific to Cassandra.
 
 I imagine a few basic steps will be as follows:
 
 user types in username (e.g. email address) and password
 this is verified against a table storing username and passwords (encrypted 
 in some way)
 a token is return to the app / web browser to allow further transactions 
 using secure token (e.g. cookie)
 
 Obviously I’m only scratching the surface and it’s the detail and best 
 practices of implementing this user / password authentication that I’m 
 curious about.
 
 Thank you,
 Ben
 
 
 



Re: Exactly one wide row per node for a given CF?

2013-12-10 Thread onlinespending
comments below

On Dec 9, 2013, at 11:33 PM, Aaron Morton aa...@thelastpickle.com wrote:

 But this becomes troublesome if I add or remove nodes. What effectively I 
 want is to partition on the unique id of the record modulus N (id % N; where 
 N is the number of nodes).
 This is exactly the problem consistent hashing (used by cassandra) is 
 designed to solve. If you hash the key and modulo the number of nodes, adding 
 and removing nodes requires a lot of data to move. 
 
 I want to be able to randomly distribute a large set of records but keep 
 them clustered in one wide row per node.
 Sounds like you should revisit your data modelling, this is a pretty well 
 known anti pattern. 
 
 When rows get above a few 10’s  of MB things can slow down, when they get 
 above 50 MB they can be a pain, when they get above 100MB it’s a warning 
 sign. And when they get above 1GB, well you you don’t want to know what 
 happens then. 
 
 It’s a bad idea and you should take another look at the data model. If you 
 have to do it, you can try the ByteOrderedPartitioner which uses the row key 
 as a token, given you total control of the row placement.


You should re-read my last paragraph as an example that would clearly benefit 
from such an approach. If one understands how paging works then you’ll see why 
how you’d benefit from grouping probabilistically similar data within each 
node, but also wanting to split data across nodes to reduce hot spotting.

Regardless, I no longer think it’s necessary to have a single wide rode per 
node. Several wide rows per node is just as good, since for all practical 
purposes paging in the first N key/values per M rows on a node is the same as 
reading in the first N*M key/values from a single row.

So I’m going to do what I alluded to before. Treat the LSB of a record’s id as 
the partition key, and then cluster on something meaningful (such as geohash) 
and the prefix of the id.

create table test (
id_prefix int,
id_suffix int,
geohash text,
value text, 
primary key (id_suffix, geohash, id_prefix));

So if this was for a collection of users, they would be randomly distributed 
across nodes to increase parallelism and reduce hotspots, but within each wide 
row they'd be meaningfully clustered by geographic region, so as to increase 
the probability that paged in data is going to contain more of the working set.



 
 Cheers
 
 
 -
 Aaron Morton
 New Zealand
 @aaronmorton
 
 Co-Founder  Principal Consultant
 Apache Cassandra Consulting
 http://www.thelastpickle.com
 
 On 4/12/2013, at 8:32 pm, Vivek Mishra mishra.v...@gmail.com wrote:
 
 So Basically you want to create a cluster of multiple unique keys, but data 
 which belongs to one unique should be colocated. correct?
 
 -Vivek
 
 
 On Tue, Dec 3, 2013 at 10:39 AM, onlinespending onlinespend...@gmail.com 
 wrote:
 Subject says it all. I want to be able to randomly distribute a large set of 
 records but keep them clustered in one wide row per node.
 
 As an example, lets say I’ve got a collection of about 1 million records 
 each with a unique id. If I just go ahead and set the primary key (and 
 therefore the partition key) as the unique id, I’ll get very good random 
 distribution across my server cluster. However, each record will be its own 
 row. I’d like to have each record belong to one large wide row (per server 
 node) so I can have them sorted or clustered on some other column.
 
 If I say have 5 nodes in my cluster, I could randomly assign a value of 1 - 
 5 at the time of creation and have the partition key set to this value. But 
 this becomes troublesome if I add or remove nodes. What effectively I want 
 is to partition on the unique id of the record modulus N (id % N; where N is 
 the number of nodes).
 
 I have to imagine there’s a mechanism in Cassandra to simply randomize the 
 partitioning without even using a key (and then clustering on some column).
 
 Thanks for any help.
 
 



Re: Exactly one wide row per node for a given CF?

2013-12-10 Thread onlinespending
Where you’ll run into trouble is with compaction.  It looks as if pord is some 
sequentially increasing value.  Try your experiment again with a clustering key 
which is a bit more random at the time of insertion.


On Dec 10, 2013, at 5:41 AM, Robert Wille rwi...@fold3.com wrote:

 I have a question about this statement:
 
 When rows get above a few 10’s  of MB things can slow down, when they get 
 above 50 MB they can be a pain, when they get above 100MB it’s a warning 
 sign. And when they get above 1GB, well you you don’t want to know what 
 happens then. 
 
 I tested a data model that I created. Here’s the schema for the table in 
 question:
 
 CREATE TABLE bdn_index_pub (
   tree INT,
   pord INT,
   hpath VARCHAR,
   PRIMARY KEY (tree, pord)
 );
 
 As a test, I inserted 100 million records. tree had the same value for every 
 record, and I had 100 million values for pord. hpath averaged about 50 
 characters in length. My understanding is that all 100 million strings would 
 have been stored in a single row, since they all had the same value for the 
 first component of the primary key. I didn’t look at the size of the table, 
 but it had to be several gigs (uncompressed). Contrary to what Aaron says, I 
 do want to know what happens, because I didn’t experience any issues with 
 this table during my test. Inserting was fast. The last batch of records 
 inserted in approximately the same amount of time as the first batch. 
 Querying the table was fast. What I didn’t do was test the table under load, 
 nor did I try this in a multi-node cluster.
 
 If this is bad, can somebody suggest a better pattern? This table was 
 designed to support a query like this: select hpath from bdn_index_pub where 
 tree = :tree and pord = :start and pord = :end. In my application, most 
 trees will have less than a million records. A handful will have 10’s of 
 millions, and one of them will have 100 million.
 
 If I need to break up my rows, my first instinct would be to divide each tree 
 into blocks of say 10,000 and change tree to a string that contains the tree 
 and the block number. Something like this:
 
 17:0, 0, ‘/’
 …
 17:0, , ’/a/b/c’
 17:1,1, ‘/a/b/d’
 …
 
 I’d then need to issue an extra query for ranges that crossed block 
 boundaries.
 
 Any suggestions on a better pattern?
 
 Thanks
 
 Robert
 
 From: Aaron Morton aa...@thelastpickle.com
 Reply-To: user@cassandra.apache.org
 Date: Tuesday, December 10, 2013 at 12:33 AM
 To: Cassandra User user@cassandra.apache.org
 Subject: Re: Exactly one wide row per node for a given CF?
 
 But this becomes troublesome if I add or remove nodes. What effectively I 
 want is to partition on the unique id of the record modulus N (id % N; 
 where N is the number of nodes).
 This is exactly the problem consistent hashing (used by cassandra) is 
 designed to solve. If you hash the key and modulo the number of nodes, adding 
 and removing nodes requires a lot of data to move. 
 
 I want to be able to randomly distribute a large set of records but keep 
 them clustered in one wide row per node.
 Sounds like you should revisit your data modelling, this is a pretty well 
 known anti pattern. 
 
 When rows get above a few 10’s  of MB things can slow down, when they get 
 above 50 MB they can be a pain, when they get above 100MB it’s a warning 
 sign. And when they get above 1GB, well you you don’t want to know what 
 happens then. 
 
 It’s a bad idea and you should take another look at the data model. If you 
 have to do it, you can try the ByteOrderedPartitioner which uses the row key 
 as a token, given you total control of the row placement. 
 
 Cheers
 
 
 -
 Aaron Morton
 New Zealand
 @aaronmorton
 
 Co-Founder  Principal Consultant
 Apache Cassandra Consulting
 http://www.thelastpickle.com
 
 On 4/12/2013, at 8:32 pm, Vivek Mishra mishra.v...@gmail.com wrote:
 
 So Basically you want to create a cluster of multiple unique keys, but data 
 which belongs to one unique should be colocated. correct?
 
 -Vivek
 
 
 On Tue, Dec 3, 2013 at 10:39 AM, onlinespending onlinespend...@gmail.com 
 wrote:
 Subject says it all. I want to be able to randomly distribute a large set 
 of records but keep them clustered in one wide row per node.
 
 As an example, lets say I’ve got a collection of about 1 million records 
 each with a unique id. If I just go ahead and set the primary key (and 
 therefore the partition key) as the unique id, I’ll get very good random 
 distribution across my server cluster. However, each record will be its own 
 row. I’d like to have each record belong to one large wide row (per server 
 node) so I can have them sorted or clustered on some other column.
 
 If I say have 5 nodes in my cluster, I could randomly assign a value of 1 - 
 5 at the time of creation and have the partition key set to this value. But 
 this becomes troublesome if I add or remove nodes. What effectively I want 
 is to partition on the unique id

Re: Exactly one wide row per node for a given CF?

2013-12-04 Thread onlinespending
Pretty much yes. Although I think it’d be nice if Cassandra handled such a 
case, I’ve resigned to the fact that it cannot at the moment. The workaround 
will be to partition on the LSB portion of the id (giving 256 rows spread 
amongst my nodes) which allows room for scaling, and then cluster each row on 
geohash or something else.

Basically this desire all stems from wanting efficient use of memory. 
Frequently accessed keys’ values are kept in RAM through the OS page cache. But 
the page size is 4KB. This is a problem if you are accessing several small 
records of data (say 200 bytes), since each record only occupies a small % of a 
page. This is why it’s important to increase the probability that neighboring 
data on the disk is relevant. Worst case would be to read in a full 4KB page 
into RAM, of which you’re only accessing one record that’s a couple hundred 
bytes. All of the other unused data of the page is wastefully occupying RAM. 
Now project this problem to a collection of millions of small records all 
indiscriminately and randomly scattered on the disk, and you can easily see how 
inefficient your memory usage will become.

That’s why it’s best to cluster data in some meaningful way, all in an effort 
to increasing the probability that when one record is accessed in that 4KB 
block that its neighboring records will also be accessed. This brings me back 
to the question of this thread. I want to randomly distribute the data amongst 
the nodes to avoid hot spotting, but within each node I want to cluster the 
data meaningfully such that the probability that neighboring data is relevant 
is increased.

An example of this would be having a huge collection of small records that 
store basic user information. If you partition on the unique user id, then 
you’ll get nice random distribution but with no ability to cluster (each record 
would occupy its own row). You could partition on say geographical region, but 
then you’ll end up with hot spotting when one region is more active than 
another. So ideally you’d like to randomly assign a node to each record to 
increase parallelism, but then cluster all records on a node by say geohash 
since it is more likely (however small that may be) that when one user from a 
geographical region is accessed other users from the same region will also need 
to be accessed. It’s certainly better than having some random user record next 
to the one you are accessing at the moment.




On Dec 3, 2013, at 11:32 PM, Vivek Mishra mishra.v...@gmail.com wrote:

 So Basically you want to create a cluster of multiple unique keys, but data 
 which belongs to one unique should be colocated. correct?
 
 -Vivek
 
 
 On Tue, Dec 3, 2013 at 10:39 AM, onlinespending onlinespend...@gmail.com 
 wrote:
 Subject says it all. I want to be able to randomly distribute a large set of 
 records but keep them clustered in one wide row per node.
 
 As an example, lets say I’ve got a collection of about 1 million records each 
 with a unique id. If I just go ahead and set the primary key (and therefore 
 the partition key) as the unique id, I’ll get very good random distribution 
 across my server cluster. However, each record will be its own row. I’d like 
 to have each record belong to one large wide row (per server node) so I can 
 have them sorted or clustered on some other column.
 
 If I say have 5 nodes in my cluster, I could randomly assign a value of 1 - 5 
 at the time of creation and have the partition key set to this value. But 
 this becomes troublesome if I add or remove nodes. What effectively I want is 
 to partition on the unique id of the record modulus N (id % N; where N is the 
 number of nodes).
 
 I have to imagine there’s a mechanism in Cassandra to simply randomize the 
 partitioning without even using a key (and then clustering on some column).
 
 Thanks for any help.
 



Inefficiency with large set of small documents?

2013-11-25 Thread onlinespending
I’m trying to decide what noSQL database to use, and I’ve certainly decided 
against mongodb due to its use of mmap. I’m wondering if Cassandra would also 
suffer from a similar inefficiency with small documents. In mongodb, if you 
have a large set of small documents (each much less than the 4KB page size) you 
will require far more RAM to fit your working set into memory, since a large 
percentage of a 4KB chunk could very easily include infrequently accessed data 
outside of your working set. Cassandra doesn’t use mmap, but it would still 
have to intelligently discard the excess data that does not pertain to a small 
document that exists in the same allocation unit on the hard disk when reading 
it into RAM. As an example lets say your cluster size is 4KB as well, and you 
have 1000 small 256 byte documents that are scattered on the disk that you want 
to fetch on a given query (the total number of documents is over 1 billion). I 
want to make sure it only consumes roughly 256,000 bytes for those 1000 
documents and not 4,096,000 bytes. When it first fetches a cluster from disk it 
may consume 4KB of cache, but it should ultimately only ideally consume the 
relevant amount of bytes in RAM. If Cassandra just indiscriminately uses RAM in 
4KB blocks than that is unacceptable to me, because if my working set at any 
given time is just 20% of my huge collection of small sized documents, I don’t 
want to have to use servers with 5X as much RAM. That’s a huge expense.

Thanks,
Ben

P.S. Here’s a detailed post I made this morning in the mongodb user group about 
this topic.

People have often complained that because mongodb memory maps everything and 
leaves memory management to the OS's virtual memory system, the swapping 
algorithm isn't optimized for database usage. I disagree with this. For the 
most part, the swapping or paging algorithm itself can't be much better than 
the sophisticated algorithms (such as LRU based ones) that OSes have refined 
over many years. Why reinvent the wheel? Yes, you could potentially ensure that 
certain data (such as the indexes) never get swapped out to disk, because even 
if they haven't been accessed recently the cost of reading them back into 
memory will be too costly when they are in fact needed. But that's not the 
bigger issue.

It breaks down with small documents  than page size

This is where using virtual memory for everything really becomes an issue. 
Suppose you've got a bunch of really tiny documents (e.g. ~256 bytes) that are 
much smaller than the virtual memory page size (e.g. 4KB). Now let's say that 
you've determined that your working set (e.g. those documents in your 
collection that constitute say 99% of those accessed in a given hour) to be 
20GB. But your entire collection size is actually 100GB (it's just that 20% of 
your documents are much much likely to be accessed in a given time period. It's 
not uncommon that a small minority of documents will be accessed a large 
majority of the time). If your collection is randomly distributed (such as 
would happen if you simply inserted new documents into your collection) then in 
this example only about 20% of the documents that fit onto a 4KB page will be 
part of the working set (i.e. the data that you need frequent access to at the 
moment). The rest of the data will be made up of much less frequently accessed 
documents, that should ideally be sitting on disk. So there's a huge 
inefficiency here. 80% of the data that is in RAM is not even something I need 
to frequently access. In this example, I would need 5X the amount of RAM to 
accommodate my working set.

Now, as a solution to this problem, you could separate your documents into two 
(or even a few) collections with the grouping done by access frequency. The 
problem with this, is that your working set can often change as a function of 
time of day and day of week. If your application is global, your working set 
will be far different during 12pm local in NY vs 12pm local in Tokyo. But more 
even more likely is that the working set is constantly changing as new data is 
inserted into the database. Popularity of a document is often viral. As an 
example, a photo that's posted on a social network may start off infrequently 
accessed but then quickly after hundreds of likes could become very 
frequently accessed and part of your working set. You'd need to actively 
monitor your documents and manually move a document from one collection to the 
other, which is very inefficient.

Quite frankly this is not a burden that should be placed on the user anyways. 
By punting the problem of memory management to the OS, mongodb requires the 
user to essentially do its job and group data in a way that patches the 
inefficiencies in its memory management. As far as I'm concerned, not until 
mongodb steps up and takes control of memory management can it be taken 
seriously for very large datasets that often require many small documents with 
ever changing