[jira] [Commented] (CASSANDRA-4175) Reduce memory, disk space, and cpu usage with a column name/id map

2015-06-03 Thread Nikhil Patel (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572204#comment-14572204
 ] 

Nikhil Patel commented on CASSANDRA-4175:
-

I am curious to know what is the final take over the column name/id mapping. Is 
this feature implemented or have plan to do so ?

 Reduce memory, disk space, and cpu usage with a column name/id map
 --

 Key: CASSANDRA-4175
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4175
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: Jason Brown
  Labels: performance
 Fix For: 3.x


 We spend a lot of memory on column names, both transiently (during reads) and 
 more permanently (in the row cache).  Compression mitigates this on disk but 
 not on the heap.
 The overhead is significant for typical small column values, e.g., ints.
 Even though we intern once we get to the memtable, this affects writes too 
 via very high allocation rates in the young generation, hence more GC 
 activity.
 Now that CQL3 provides us some guarantees that column names must be defined 
 before they are inserted, we could create a map of (say) 32-bit int column 
 id, to names, and use that internally right up until we return a resultset to 
 the client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-4175) Reduce memory, disk space, and cpu usage with a column name/id map

2014-11-22 Thread Edward Capriolo (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1449#comment-1449
 ] 

Edward Capriolo commented on CASSANDRA-4175:


There was once a https://twitter.com/roflscaletips suggestion that said 
something to the effect of make mongo faster by using small column names. The 
same advice applies here. If you name a column wombat_walnut_crackerjacks 
instead of w it is going to take up more space on disk. This is because 
cassandra stores the column name and value each column on disk, because it is a 
row store, apparently.

A simple way to solve this would be to have the CQL language store some 
meta-data about alternate column names.

{quote}
Create table abc ( wombat_walnul_crackerjacks int (shortname w) );
{quote}

Then the query engine could allow either to be used in a select cause.

{quote}
SELECT w from abc;
{quote}

{quote}
SELECT wombat_walnul_crackerjacks from abc;
{quote}

An even easier way is to name the column w. This way you avoid having systems 
where column needs two names, or systems where column names have a internal 
database of column name-shorter column name. But what is the fun of just 
telling people to use short names when a complex solution can be engineered :)

 Reduce memory, disk space, and cpu usage with a column name/id map
 --

 Key: CASSANDRA-4175
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4175
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: Jason Brown
  Labels: performance
 Fix For: 3.0


 We spend a lot of memory on column names, both transiently (during reads) and 
 more permanently (in the row cache).  Compression mitigates this on disk but 
 not on the heap.
 The overhead is significant for typical small column values, e.g., ints.
 Even though we intern once we get to the memtable, this affects writes too 
 via very high allocation rates in the young generation, hence more GC 
 activity.
 Now that CQL3 provides us some guarantees that column names must be defined 
 before they are inserted, we could create a map of (say) 32-bit int column 
 id, to names, and use that internally right up until we return a resultset to 
 the client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-4175) Reduce memory, disk space, and cpu usage with a column name/id map

2014-11-20 Thread Jon Haddad (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219899#comment-14219899
 ] 

Jon Haddad commented on CASSANDRA-4175:
---

Probably a stupid question, but will making the schema be set external to the 
SSTable make it harder or impossible to move an sstable to a different cluster, 
since the column data is no longer there?

 Reduce memory, disk space, and cpu usage with a column name/id map
 --

 Key: CASSANDRA-4175
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4175
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: Jason Brown
  Labels: performance
 Fix For: 3.0


 We spend a lot of memory on column names, both transiently (during reads) and 
 more permanently (in the row cache).  Compression mitigates this on disk but 
 not on the heap.
 The overhead is significant for typical small column values, e.g., ints.
 Even though we intern once we get to the memtable, this affects writes too 
 via very high allocation rates in the young generation, hence more GC 
 activity.
 Now that CQL3 provides us some guarantees that column names must be defined 
 before they are inserted, we could create a map of (say) 32-bit int column 
 id, to names, and use that internally right up until we return a resultset to 
 the client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-4175) Reduce memory, disk space, and cpu usage with a column name/id map

2014-11-20 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219973#comment-14219973
 ] 

Aleksey Yeschenko commented on CASSANDRA-4175:
--

bq. Probably a stupid question, but will making the schema be set external to 
the SSTable make it harder or impossible to move an sstable to a different 
cluster, since the column data is no longer there?

Nah, we can just encode the {name - id} map in sstable metadata.

 Reduce memory, disk space, and cpu usage with a column name/id map
 --

 Key: CASSANDRA-4175
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4175
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: Jason Brown
  Labels: performance
 Fix For: 3.0


 We spend a lot of memory on column names, both transiently (during reads) and 
 more permanently (in the row cache).  Compression mitigates this on disk but 
 not on the heap.
 The overhead is significant for typical small column values, e.g., ints.
 Even though we intern once we get to the memtable, this affects writes too 
 via very high allocation rates in the young generation, hence more GC 
 activity.
 Now that CQL3 provides us some guarantees that column names must be defined 
 before they are inserted, we could create a map of (say) 32-bit int column 
 id, to names, and use that internally right up until we return a resultset to 
 the client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-4175) Reduce memory, disk space, and cpu usage with a column name/id map

2014-07-18 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066498#comment-14066498
 ] 

Robert Stupp commented on CASSANDRA-4175:
-

My five cent ;) Sorry, if I repeat some things, didn't read everything...

Using such a enum/map of _column-id_ to _column-name_ should also include UDT 
field names

The id generator for the _column-id_ could be per-keyspace (maybe something 
like a _next-column-id_ field per keyspace)

I guess a typical column name is 10-15 chars long.
So the savings on heap and off-heap are worth implementing that enum/map - such 
a typical column name {{String}} occupies about 60 bytes on heap - an {{int}} 
just 4. And it removes pressure from GC.

Savings could also occur on the wire (between nodes), in the commit log and in 
data files. If the _column-id_ is globlally unique per KS, sstable files remain 
to be portable between nodes (are they portable?).

It might also save bandwidth when serializing result sets back to the client 
(if all clients shall have to know about that id-name mapping).

 Reduce memory, disk space, and cpu usage with a column name/id map
 --

 Key: CASSANDRA-4175
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4175
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: Jason Brown
  Labels: performance
 Fix For: 3.0


 We spend a lot of memory on column names, both transiently (during reads) and 
 more permanently (in the row cache).  Compression mitigates this on disk but 
 not on the heap.
 The overhead is significant for typical small column values, e.g., ints.
 Even though we intern once we get to the memtable, this affects writes too 
 via very high allocation rates in the young generation, hence more GC 
 activity.
 Now that CQL3 provides us some guarantees that column names must be defined 
 before they are inserted, we could create a map of (say) 32-bit int column 
 id, to names, and use that internally right up until we return a resultset to 
 the client.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4175) Reduce memory, disk space, and cpu usage with a column name/id map

2014-04-22 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13977467#comment-13977467
 ] 

Benedict commented on CASSANDRA-4175:
-

See also CASSANDRA-6917 - IMO the best solution to this problem is an enum data 
type, and then to convert all column names to that type.

 Reduce memory, disk space, and cpu usage with a column name/id map
 --

 Key: CASSANDRA-4175
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4175
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: Jason Brown
  Labels: performance
 Fix For: 3.0


 We spend a lot of memory on column names, both transiently (during reads) and 
 more permanently (in the row cache).  Compression mitigates this on disk but 
 not on the heap.
 The overhead is significant for typical small column values, e.g., ints.
 Even though we intern once we get to the memtable, this affects writes too 
 via very high allocation rates in the young generation, hence more GC 
 activity.
 Now that CQL3 provides us some guarantees that column names must be defined 
 before they are inserted, we could create a map of (say) 32-bit int column 
 id, to names, and use that internally right up until we return a resultset to 
 the client.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4175) Reduce memory, disk space, and cpu usage with a column name/id map

2013-10-31 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13810299#comment-13810299
 ] 

Benedict commented on CASSANDRA-4175:
-

I think it could be a big win from a CPU pov just to have a transient (per 
launch, per node) map. On the assumption that we convert back via a single 
array lookup, the extra indirection cost is unlikely to be measurable, but if 
we were to precompute the comparisons of the ByteBuffer names we would 
definitely save O(name.length()) operations per task, but could potentially 
switch to counting sort and save O(m.n.lg(n)) [where n is the number of columns 
involved in an operation, and m is the length of the column names] for CFs 
with, say,  100 columns.

It could potentially be implemented by abstracting Column to allow different 
sources of name(), so that CFs with large numbers of column names, or TimeUUID 
comparators, etc. can remain with the current implementation. Obviously with 
care taken not to break the native protocol...


 Reduce memory, disk space, and cpu usage with a column name/id map
 --

 Key: CASSANDRA-4175
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4175
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: Jason Brown
  Labels: performance
 Fix For: 2.1


 We spend a lot of memory on column names, both transiently (during reads) and 
 more permanently (in the row cache).  Compression mitigates this on disk but 
 not on the heap.
 The overhead is significant for typical small column values, e.g., ints.
 Even though we intern once we get to the memtable, this affects writes too 
 via very high allocation rates in the young generation, hence more GC 
 activity.
 Now that CQL3 provides us some guarantees that column names must be defined 
 before they are inserted, we could create a map of (say) 32-bit int column 
 id, to names, and use that internally right up until we return a resultset to 
 the client.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-4175) Reduce memory, disk space, and cpu usage with a column name/id map

2013-09-19 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13772023#comment-13772023
 ] 

Sylvain Lebresne commented on CASSANDRA-4175:
-

I'm pretty sure we'd need CASSANDRA-5417 to make that doable (in fact, that's 
one of my original motivation for doing CASSANDRA-5417). Namely, we don't want 
a cell name/id map, we want a cql3 column/id map, otherwise this loose most 
interest. And we can't do a cql3 column/id map if we store cell name as opaque 
byte buffers.

To be more precise, I don't deny that a cell name/id map could be a start and 
would in fact server some use cases, but I'm a bit reluctant in implementing 
that knowing that we want to change to a cql3 column/id map sooner than later 
because I suspect it'll be a lot easier to do the right thing to start with 
rather than doing cell name/id map and then have a painful time to switch a 
cql3 column/id one without breaking backward compatibility.

Besides, I also suspect there is a bunch of refactorings that are in 
CASSANDRA-5417 that would be needed here as well, so working on both separately 
without coordination is likely to be frustrating and a duplication of effort.

Anyway, I do plan on getting back to CASSANDRA-5417 asap (though it will 
unlikely be like next week) so maybe we can hold a bit on that one until then? 
If I've made no progress on CASSANDRA-5417 in say a month or two, and people 
really want this, we can re-evaluate? 

 Reduce memory, disk space, and cpu usage with a column name/id map
 --

 Key: CASSANDRA-4175
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4175
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: Jason Brown
 Fix For: 2.1


 We spend a lot of memory on column names, both transiently (during reads) and 
 more permanently (in the row cache).  Compression mitigates this on disk but 
 not on the heap.
 The overhead is significant for typical small column values, e.g., ints.
 Even though we intern once we get to the memtable, this affects writes too 
 via very high allocation rates in the young generation, hence more GC 
 activity.
 Now that CQL3 provides us some guarantees that column names must be defined 
 before they are inserted, we could create a map of (say) 32-bit int column 
 id, to names, and use that internally right up until we return a resultset to 
 the client.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4175) Reduce memory, disk space, and cpu usage with a column name/id map

2013-07-03 Thread Terje Marthinussen (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698701#comment-13698701
 ] 

Terje Marthinussen commented on CASSANDRA-4175:
---

Hi, 

Sorry for the late update.

Yes, we have a cluster with some 20-30 billion columns (maybe even closer to 40 
billion by now) which implements a column name map and has been in production 
for about 2 years.

I was actually looking at committing this 2 years ago together with fairly 
large number of other changes which was implemented in the column/supercolumn 
serializer code but I never got  around to implement a good way to push the 
sstable version numbers into the serializer to make things backwards compatible 
before focus moved resources elsewhere.

As mentioned above by others, while not benchmarked and proven, I had a very 
good feeling the total change helped quite a bit on GC issues, memtables and a 
bit on performance in general, but in terms of disk space, the benefit was 
somewhat limited after sstable compression was implemented as the repeating 
column names are compressed pretty well.

This is already 2 years ago (the cluster still runs by the way), but if memory 
serves me right:
30-40% reduction in disk space without compression
10% reduction on top of compression (I did a test after it was implemented).

In my case, the implementation is actually hardcoded due to time constraints. A 
static map which is global for the entire cassandra installation.

If committing this into cassandra, I believe my plan was split in 3.
Possible as 3 different implementation stages:

1. A simple config option (as a config file or as a columnfamily) where users 
themselves can assign repeating column names. Sure, it is not as fancy as many 
other options, but maybe we could open up to cover some strange corner case 
usages here with things like substrings as well.

Think options to cover complex versions of patterns like date/times such as 
20130701202020 where a large chunk of the column name repeats, but not all of 
it.

In the current implementation, if there is a mapping entry, it converts the 
string to a variable length integer which becomes the new column name. If there 
is no mapping entry, it stores the raw data.

In our case, we have 40 repeating column names so I never need more than 1 
byte, but the implementation would handle more if I had.

I modified the sstable to add a bitmap at the start of each column to be able 
to turn on/off mapping entries, timestamps not used, TTL's and other things. 
There is a bunch of 64 bit numbers in the column format which only have default 
value in 99.999% of all cases and very often your column value is just an 8 
byte int, a boolean or a short text entry. 

I think in 99% of the columns in this cassandra store, the column timestamp 
takes up more space than the column value.

This would have been my first implementation. Mostly because I have a working 
implementation of it already and the mapping table would be very easy to move 
to a config file read at start of a column family similar to what we have for 
CF config but also here, it is a bit work to push such config data down to the 
serializer as the code was organized 2 years ago.

Notice again, you do not need atomic handling of the updates to the map in any 
way in this implementation. You can add map entries at any time. The result 
after deserializing is always the same as column names can have a mix of raw 
and map id values thanks to the column feature bitmap that was introduced.

2. Auto learning feature with mapping table per sstable. 
This would be stage 2 of the implementation.

When starting to create a new SSTable, build a sampling of the most frequently 
occuring column names and gradually start mapping them to ID's.

Add the mapping table to the end of the SSTable or in a separate .map file 
(similar to index files) at the completion of sstable generation.

The initial id mapping could be further improved by maintaining a global map of 
column names. This global map would not be used for 
serialization/deserialization. It would be used to pre-populate the value for a 
sstable and would only be statistics to optimize things further by reducing the 
number of mapping variances between sstables and reducing the number of raw 
values getting stored a bit more.

The id map would still be local to each sstable in terms of storage, but having 
such statistics would allow you to dramatically reduce the size of a 
potentially shared id cache across sstables where a lot of mapping entries 
would be identical.

Some may feel that we would run out of memory quickly or use a lot of extra 
disk with maps per sstable, but I guess that we only really need to deal with 
the top few thousand entries in each sstable and this would not be a problem to 
keep in a idmap cache in terms of size.

This is really just the top X 

[jira] [Commented] (CASSANDRA-4175) Reduce memory, disk space, and cpu usage with a column name/id map

2013-07-03 Thread Terje Marthinussen (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698710#comment-13698710
 ] 

Terje Marthinussen commented on CASSANDRA-4175:
---

I should maybe add, 1 and 2 above does not exclude but rather complement each 
other.

#1 is a manual map and could allow things like a prefix map such as '$201212' 
which will map all such prefixes to an id

#2 is a auto map. It may require 1 if we want to consider to allow user to give 
hints to substring maps such as '$(201\d\d\d)' to map all year+month like 
string starting on 201 to a mapping entry. This will just be a hint. The 
sampling of number of entries should decide what gets mapped to avoid running 
out of memory.

I am a bit unsure if these advanced features like substrings would never be 
used and should maybe only be  implemented as some sort of substring detection 
separately. 

As this can be a bit processing intensive, substring statistics (top 
substrings) could be detected and maintained node wide in compaction and given 
as hints to the serializer later.


 Reduce memory, disk space, and cpu usage with a column name/id map
 --

 Key: CASSANDRA-4175
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4175
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
 Fix For: 2.1


 We spend a lot of memory on column names, both transiently (during reads) and 
 more permanently (in the row cache).  Compression mitigates this on disk but 
 not on the heap.
 The overhead is significant for typical small column values, e.g., ints.
 Even though we intern once we get to the memtable, this affects writes too 
 via very high allocation rates in the young generation, hence more GC 
 activity.
 Now that CQL3 provides us some guarantees that column names must be defined 
 before they are inserted, we could create a map of (say) 32-bit int column 
 id, to names, and use that internally right up until we return a resultset to 
 the client.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4175) Reduce memory, disk space, and cpu usage with a column name/id map

2013-07-03 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13699208#comment-13699208
 ] 

Jonathan Ellis commented on CASSANDRA-4175:
---

{quote}
Has it become easier to get to know sstable version numbers in the serializer 
class now?

I could maybe check if someone in the team here would like to take a stab at 
moving this to latest cassandra and commit it if the above implementation seems 
interesting. 
{quote}

That would be great.  Yes, you'll see Descriptor.Version being passed around 
now which is what encapsulates what kind of sstable it is, including to the 
lowest level of Column.onDiskIterator.

 Reduce memory, disk space, and cpu usage with a column name/id map
 --

 Key: CASSANDRA-4175
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4175
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
 Fix For: 2.1


 We spend a lot of memory on column names, both transiently (during reads) and 
 more permanently (in the row cache).  Compression mitigates this on disk but 
 not on the heap.
 The overhead is significant for typical small column values, e.g., ints.
 Even though we intern once we get to the memtable, this affects writes too 
 via very high allocation rates in the young generation, hence more GC 
 activity.
 Now that CQL3 provides us some guarantees that column names must be defined 
 before they are inserted, we could create a map of (say) 32-bit int column 
 id, to names, and use that internally right up until we return a resultset to 
 the client.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4175) Reduce memory, disk space, and cpu usage with a column name/id map

2013-06-09 Thread Edward Capriolo (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13679079#comment-13679079
 ] 

Edward Capriolo commented on CASSANDRA-4175:


2995 says 

{quote}
It could be advantageous for Cassandra to make the storage engine pluggable. 
This could allow Cassandra to

deal with potential use cases where maybe the current sstables are not the 
best fit
allow several types of internal storage formats (at the same time) 
optimized for different data types
{quote}

Since this issue talks about reducing disk space it will be changing how data 
is written, this seems to benefit people with mostly static column. It sounds 
right on the money with 2995. However it goes beyond storage layer changes.

The feature makes a ton of sense and does not only benefit the cql3 case. Many 
people have static columns and since 0.7 standard column families have had 
schema as well.

If cassandra had a 'plugable storage format'. One of the things it the 
'ColumnMapIdStorageFormat' could do is write the known schema to a small file 
loaded in memory with each sstable, (like the bloom filter) that would contain 
the mappings. In the end I think you would have to store this anyway because 
the mappings would change over time and what is in the schema now may not be 
fully accurate for old slushed tables. This would only save storage as 
mentioned and the internode traffic could not be optimized with plugable 
storage alone.


For compare and swap, well whatever, it's just one feature and no one has to 
use it if they do not want to. However requiring all schema changes to need zk 
is crazy scary to me. It is true that schema always needed to propagate before 
it can be used. I personally do not want to have to install zk side by side 
with all my cassandra installs, and I do not want to rely on it for schema 
changes. 

Architecturally building on zk is a house of cards. This was originally why I 
chose cassandra over hbase (hbase had meta data on hdfs, and state information 
with zk). The WORST think that ever happens to cassandra is a node has a 
corrupt schema or a disagreement. I restart/decommission rejoin the node and it 
is fixed.

If we start storing bits of information (column ids, schema in zookeeper) we 
become totally reliant on it, nodes may or may not be able to start up without 
it, we may or not be able to make schema changes without it, and MOST 
IMPORTANTLY, ITS AN SPOF THAT WHEN  IT GOES CORRUPT will likely cause the 
entire cluster to * die, or likely function in a way worse then death, 
something like writing (corrupt ids column to files and hopelessly corrupting 
everything).

No thanks to any ZK integration. ZK and centrally managed meta data = hbase.



 Reduce memory, disk space, and cpu usage with a column name/id map
 --

 Key: CASSANDRA-4175
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4175
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
 Fix For: 2.1


 We spend a lot of memory on column names, both transiently (during reads) and 
 more permanently (in the row cache).  Compression mitigates this on disk but 
 not on the heap.
 The overhead is significant for typical small column values, e.g., ints.
 Even though we intern once we get to the memtable, this affects writes too 
 via very high allocation rates in the young generation, hence more GC 
 activity.
 Now that CQL3 provides us some guarantees that column names must be defined 
 before they are inserted, we could create a map of (say) 32-bit int column 
 id, to names, and use that internally right up until we return a resultset to 
 the client.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4175) Reduce memory, disk space, and cpu usage with a column name/id map

2013-06-07 Thread Edward Capriolo (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13678651#comment-13678651
 ] 

Edward Capriolo commented on CASSANDRA-4175:


It also sounds like we are re-opening up the concept of plugable storage. 
https://issues.apache.org/jira/browse/CASSANDRA-2995 since we are talking about 
custom disk formats only good for specific use cases.

 Reduce memory, disk space, and cpu usage with a column name/id map
 --

 Key: CASSANDRA-4175
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4175
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
 Fix For: 2.1


 We spend a lot of memory on column names, both transiently (during reads) and 
 more permanently (in the row cache).  Compression mitigates this on disk but 
 not on the heap.
 The overhead is significant for typical small column values, e.g., ints.
 Even though we intern once we get to the memtable, this affects writes too 
 via very high allocation rates in the young generation, hence more GC 
 activity.
 Now that CQL3 provides us some guarantees that column names must be defined 
 before they are inserted, we could create a map of (say) 32-bit int column 
 id, to names, and use that internally right up until we return a resultset to 
 the client.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4175) Reduce memory, disk space, and cpu usage with a column name/id map

2013-06-07 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13678655#comment-13678655
 ] 

Jonathan Ellis commented on CASSANDRA-4175:
---

That is not what we are talking about.

 Reduce memory, disk space, and cpu usage with a column name/id map
 --

 Key: CASSANDRA-4175
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4175
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
 Fix For: 2.1


 We spend a lot of memory on column names, both transiently (during reads) and 
 more permanently (in the row cache).  Compression mitigates this on disk but 
 not on the heap.
 The overhead is significant for typical small column values, e.g., ints.
 Even though we intern once we get to the memtable, this affects writes too 
 via very high allocation rates in the young generation, hence more GC 
 activity.
 Now that CQL3 provides us some guarantees that column names must be defined 
 before they are inserted, we could create a map of (say) 32-bit int column 
 id, to names, and use that internally right up until we return a resultset to 
 the client.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira