[jira] [Commented] (CASSANDRA-4175) Reduce memory, disk space, and cpu usage with a column name/id map
[ https://issues.apache.org/jira/browse/CASSANDRA-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572204#comment-14572204 ] Nikhil Patel commented on CASSANDRA-4175: - I am curious to know what is the final take over the column name/id mapping. Is this feature implemented or have plan to do so ? Reduce memory, disk space, and cpu usage with a column name/id map -- Key: CASSANDRA-4175 URL: https://issues.apache.org/jira/browse/CASSANDRA-4175 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: Jason Brown Labels: performance Fix For: 3.x We spend a lot of memory on column names, both transiently (during reads) and more permanently (in the row cache). Compression mitigates this on disk but not on the heap. The overhead is significant for typical small column values, e.g., ints. Even though we intern once we get to the memtable, this affects writes too via very high allocation rates in the young generation, hence more GC activity. Now that CQL3 provides us some guarantees that column names must be defined before they are inserted, we could create a map of (say) 32-bit int column id, to names, and use that internally right up until we return a resultset to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-4175) Reduce memory, disk space, and cpu usage with a column name/id map
[ https://issues.apache.org/jira/browse/CASSANDRA-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1449#comment-1449 ] Edward Capriolo commented on CASSANDRA-4175: There was once a https://twitter.com/roflscaletips suggestion that said something to the effect of make mongo faster by using small column names. The same advice applies here. If you name a column wombat_walnut_crackerjacks instead of w it is going to take up more space on disk. This is because cassandra stores the column name and value each column on disk, because it is a row store, apparently. A simple way to solve this would be to have the CQL language store some meta-data about alternate column names. {quote} Create table abc ( wombat_walnul_crackerjacks int (shortname w) ); {quote} Then the query engine could allow either to be used in a select cause. {quote} SELECT w from abc; {quote} {quote} SELECT wombat_walnul_crackerjacks from abc; {quote} An even easier way is to name the column w. This way you avoid having systems where column needs two names, or systems where column names have a internal database of column name-shorter column name. But what is the fun of just telling people to use short names when a complex solution can be engineered :) Reduce memory, disk space, and cpu usage with a column name/id map -- Key: CASSANDRA-4175 URL: https://issues.apache.org/jira/browse/CASSANDRA-4175 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: Jason Brown Labels: performance Fix For: 3.0 We spend a lot of memory on column names, both transiently (during reads) and more permanently (in the row cache). Compression mitigates this on disk but not on the heap. The overhead is significant for typical small column values, e.g., ints. Even though we intern once we get to the memtable, this affects writes too via very high allocation rates in the young generation, hence more GC activity. Now that CQL3 provides us some guarantees that column names must be defined before they are inserted, we could create a map of (say) 32-bit int column id, to names, and use that internally right up until we return a resultset to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-4175) Reduce memory, disk space, and cpu usage with a column name/id map
[ https://issues.apache.org/jira/browse/CASSANDRA-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219899#comment-14219899 ] Jon Haddad commented on CASSANDRA-4175: --- Probably a stupid question, but will making the schema be set external to the SSTable make it harder or impossible to move an sstable to a different cluster, since the column data is no longer there? Reduce memory, disk space, and cpu usage with a column name/id map -- Key: CASSANDRA-4175 URL: https://issues.apache.org/jira/browse/CASSANDRA-4175 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: Jason Brown Labels: performance Fix For: 3.0 We spend a lot of memory on column names, both transiently (during reads) and more permanently (in the row cache). Compression mitigates this on disk but not on the heap. The overhead is significant for typical small column values, e.g., ints. Even though we intern once we get to the memtable, this affects writes too via very high allocation rates in the young generation, hence more GC activity. Now that CQL3 provides us some guarantees that column names must be defined before they are inserted, we could create a map of (say) 32-bit int column id, to names, and use that internally right up until we return a resultset to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-4175) Reduce memory, disk space, and cpu usage with a column name/id map
[ https://issues.apache.org/jira/browse/CASSANDRA-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14219973#comment-14219973 ] Aleksey Yeschenko commented on CASSANDRA-4175: -- bq. Probably a stupid question, but will making the schema be set external to the SSTable make it harder or impossible to move an sstable to a different cluster, since the column data is no longer there? Nah, we can just encode the {name - id} map in sstable metadata. Reduce memory, disk space, and cpu usage with a column name/id map -- Key: CASSANDRA-4175 URL: https://issues.apache.org/jira/browse/CASSANDRA-4175 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: Jason Brown Labels: performance Fix For: 3.0 We spend a lot of memory on column names, both transiently (during reads) and more permanently (in the row cache). Compression mitigates this on disk but not on the heap. The overhead is significant for typical small column values, e.g., ints. Even though we intern once we get to the memtable, this affects writes too via very high allocation rates in the young generation, hence more GC activity. Now that CQL3 provides us some guarantees that column names must be defined before they are inserted, we could create a map of (say) 32-bit int column id, to names, and use that internally right up until we return a resultset to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-4175) Reduce memory, disk space, and cpu usage with a column name/id map
[ https://issues.apache.org/jira/browse/CASSANDRA-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066498#comment-14066498 ] Robert Stupp commented on CASSANDRA-4175: - My five cent ;) Sorry, if I repeat some things, didn't read everything... Using such a enum/map of _column-id_ to _column-name_ should also include UDT field names The id generator for the _column-id_ could be per-keyspace (maybe something like a _next-column-id_ field per keyspace) I guess a typical column name is 10-15 chars long. So the savings on heap and off-heap are worth implementing that enum/map - such a typical column name {{String}} occupies about 60 bytes on heap - an {{int}} just 4. And it removes pressure from GC. Savings could also occur on the wire (between nodes), in the commit log and in data files. If the _column-id_ is globlally unique per KS, sstable files remain to be portable between nodes (are they portable?). It might also save bandwidth when serializing result sets back to the client (if all clients shall have to know about that id-name mapping). Reduce memory, disk space, and cpu usage with a column name/id map -- Key: CASSANDRA-4175 URL: https://issues.apache.org/jira/browse/CASSANDRA-4175 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: Jason Brown Labels: performance Fix For: 3.0 We spend a lot of memory on column names, both transiently (during reads) and more permanently (in the row cache). Compression mitigates this on disk but not on the heap. The overhead is significant for typical small column values, e.g., ints. Even though we intern once we get to the memtable, this affects writes too via very high allocation rates in the young generation, hence more GC activity. Now that CQL3 provides us some guarantees that column names must be defined before they are inserted, we could create a map of (say) 32-bit int column id, to names, and use that internally right up until we return a resultset to the client. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4175) Reduce memory, disk space, and cpu usage with a column name/id map
[ https://issues.apache.org/jira/browse/CASSANDRA-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13977467#comment-13977467 ] Benedict commented on CASSANDRA-4175: - See also CASSANDRA-6917 - IMO the best solution to this problem is an enum data type, and then to convert all column names to that type. Reduce memory, disk space, and cpu usage with a column name/id map -- Key: CASSANDRA-4175 URL: https://issues.apache.org/jira/browse/CASSANDRA-4175 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: Jason Brown Labels: performance Fix For: 3.0 We spend a lot of memory on column names, both transiently (during reads) and more permanently (in the row cache). Compression mitigates this on disk but not on the heap. The overhead is significant for typical small column values, e.g., ints. Even though we intern once we get to the memtable, this affects writes too via very high allocation rates in the young generation, hence more GC activity. Now that CQL3 provides us some guarantees that column names must be defined before they are inserted, we could create a map of (say) 32-bit int column id, to names, and use that internally right up until we return a resultset to the client. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-4175) Reduce memory, disk space, and cpu usage with a column name/id map
[ https://issues.apache.org/jira/browse/CASSANDRA-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13810299#comment-13810299 ] Benedict commented on CASSANDRA-4175: - I think it could be a big win from a CPU pov just to have a transient (per launch, per node) map. On the assumption that we convert back via a single array lookup, the extra indirection cost is unlikely to be measurable, but if we were to precompute the comparisons of the ByteBuffer names we would definitely save O(name.length()) operations per task, but could potentially switch to counting sort and save O(m.n.lg(n)) [where n is the number of columns involved in an operation, and m is the length of the column names] for CFs with, say, 100 columns. It could potentially be implemented by abstracting Column to allow different sources of name(), so that CFs with large numbers of column names, or TimeUUID comparators, etc. can remain with the current implementation. Obviously with care taken not to break the native protocol... Reduce memory, disk space, and cpu usage with a column name/id map -- Key: CASSANDRA-4175 URL: https://issues.apache.org/jira/browse/CASSANDRA-4175 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: Jason Brown Labels: performance Fix For: 2.1 We spend a lot of memory on column names, both transiently (during reads) and more permanently (in the row cache). Compression mitigates this on disk but not on the heap. The overhead is significant for typical small column values, e.g., ints. Even though we intern once we get to the memtable, this affects writes too via very high allocation rates in the young generation, hence more GC activity. Now that CQL3 provides us some guarantees that column names must be defined before they are inserted, we could create a map of (say) 32-bit int column id, to names, and use that internally right up until we return a resultset to the client. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-4175) Reduce memory, disk space, and cpu usage with a column name/id map
[ https://issues.apache.org/jira/browse/CASSANDRA-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13772023#comment-13772023 ] Sylvain Lebresne commented on CASSANDRA-4175: - I'm pretty sure we'd need CASSANDRA-5417 to make that doable (in fact, that's one of my original motivation for doing CASSANDRA-5417). Namely, we don't want a cell name/id map, we want a cql3 column/id map, otherwise this loose most interest. And we can't do a cql3 column/id map if we store cell name as opaque byte buffers. To be more precise, I don't deny that a cell name/id map could be a start and would in fact server some use cases, but I'm a bit reluctant in implementing that knowing that we want to change to a cql3 column/id map sooner than later because I suspect it'll be a lot easier to do the right thing to start with rather than doing cell name/id map and then have a painful time to switch a cql3 column/id one without breaking backward compatibility. Besides, I also suspect there is a bunch of refactorings that are in CASSANDRA-5417 that would be needed here as well, so working on both separately without coordination is likely to be frustrating and a duplication of effort. Anyway, I do plan on getting back to CASSANDRA-5417 asap (though it will unlikely be like next week) so maybe we can hold a bit on that one until then? If I've made no progress on CASSANDRA-5417 in say a month or two, and people really want this, we can re-evaluate? Reduce memory, disk space, and cpu usage with a column name/id map -- Key: CASSANDRA-4175 URL: https://issues.apache.org/jira/browse/CASSANDRA-4175 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: Jason Brown Fix For: 2.1 We spend a lot of memory on column names, both transiently (during reads) and more permanently (in the row cache). Compression mitigates this on disk but not on the heap. The overhead is significant for typical small column values, e.g., ints. Even though we intern once we get to the memtable, this affects writes too via very high allocation rates in the young generation, hence more GC activity. Now that CQL3 provides us some guarantees that column names must be defined before they are inserted, we could create a map of (say) 32-bit int column id, to names, and use that internally right up until we return a resultset to the client. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4175) Reduce memory, disk space, and cpu usage with a column name/id map
[ https://issues.apache.org/jira/browse/CASSANDRA-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698701#comment-13698701 ] Terje Marthinussen commented on CASSANDRA-4175: --- Hi, Sorry for the late update. Yes, we have a cluster with some 20-30 billion columns (maybe even closer to 40 billion by now) which implements a column name map and has been in production for about 2 years. I was actually looking at committing this 2 years ago together with fairly large number of other changes which was implemented in the column/supercolumn serializer code but I never got around to implement a good way to push the sstable version numbers into the serializer to make things backwards compatible before focus moved resources elsewhere. As mentioned above by others, while not benchmarked and proven, I had a very good feeling the total change helped quite a bit on GC issues, memtables and a bit on performance in general, but in terms of disk space, the benefit was somewhat limited after sstable compression was implemented as the repeating column names are compressed pretty well. This is already 2 years ago (the cluster still runs by the way), but if memory serves me right: 30-40% reduction in disk space without compression 10% reduction on top of compression (I did a test after it was implemented). In my case, the implementation is actually hardcoded due to time constraints. A static map which is global for the entire cassandra installation. If committing this into cassandra, I believe my plan was split in 3. Possible as 3 different implementation stages: 1. A simple config option (as a config file or as a columnfamily) where users themselves can assign repeating column names. Sure, it is not as fancy as many other options, but maybe we could open up to cover some strange corner case usages here with things like substrings as well. Think options to cover complex versions of patterns like date/times such as 20130701202020 where a large chunk of the column name repeats, but not all of it. In the current implementation, if there is a mapping entry, it converts the string to a variable length integer which becomes the new column name. If there is no mapping entry, it stores the raw data. In our case, we have 40 repeating column names so I never need more than 1 byte, but the implementation would handle more if I had. I modified the sstable to add a bitmap at the start of each column to be able to turn on/off mapping entries, timestamps not used, TTL's and other things. There is a bunch of 64 bit numbers in the column format which only have default value in 99.999% of all cases and very often your column value is just an 8 byte int, a boolean or a short text entry. I think in 99% of the columns in this cassandra store, the column timestamp takes up more space than the column value. This would have been my first implementation. Mostly because I have a working implementation of it already and the mapping table would be very easy to move to a config file read at start of a column family similar to what we have for CF config but also here, it is a bit work to push such config data down to the serializer as the code was organized 2 years ago. Notice again, you do not need atomic handling of the updates to the map in any way in this implementation. You can add map entries at any time. The result after deserializing is always the same as column names can have a mix of raw and map id values thanks to the column feature bitmap that was introduced. 2. Auto learning feature with mapping table per sstable. This would be stage 2 of the implementation. When starting to create a new SSTable, build a sampling of the most frequently occuring column names and gradually start mapping them to ID's. Add the mapping table to the end of the SSTable or in a separate .map file (similar to index files) at the completion of sstable generation. The initial id mapping could be further improved by maintaining a global map of column names. This global map would not be used for serialization/deserialization. It would be used to pre-populate the value for a sstable and would only be statistics to optimize things further by reducing the number of mapping variances between sstables and reducing the number of raw values getting stored a bit more. The id map would still be local to each sstable in terms of storage, but having such statistics would allow you to dramatically reduce the size of a potentially shared id cache across sstables where a lot of mapping entries would be identical. Some may feel that we would run out of memory quickly or use a lot of extra disk with maps per sstable, but I guess that we only really need to deal with the top few thousand entries in each sstable and this would not be a problem to keep in a idmap cache in terms of size. This is really just the top X
[jira] [Commented] (CASSANDRA-4175) Reduce memory, disk space, and cpu usage with a column name/id map
[ https://issues.apache.org/jira/browse/CASSANDRA-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13698710#comment-13698710 ] Terje Marthinussen commented on CASSANDRA-4175: --- I should maybe add, 1 and 2 above does not exclude but rather complement each other. #1 is a manual map and could allow things like a prefix map such as '$201212' which will map all such prefixes to an id #2 is a auto map. It may require 1 if we want to consider to allow user to give hints to substring maps such as '$(201\d\d\d)' to map all year+month like string starting on 201 to a mapping entry. This will just be a hint. The sampling of number of entries should decide what gets mapped to avoid running out of memory. I am a bit unsure if these advanced features like substrings would never be used and should maybe only be implemented as some sort of substring detection separately. As this can be a bit processing intensive, substring statistics (top substrings) could be detected and maintained node wide in compaction and given as hints to the serializer later. Reduce memory, disk space, and cpu usage with a column name/id map -- Key: CASSANDRA-4175 URL: https://issues.apache.org/jira/browse/CASSANDRA-4175 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Fix For: 2.1 We spend a lot of memory on column names, both transiently (during reads) and more permanently (in the row cache). Compression mitigates this on disk but not on the heap. The overhead is significant for typical small column values, e.g., ints. Even though we intern once we get to the memtable, this affects writes too via very high allocation rates in the young generation, hence more GC activity. Now that CQL3 provides us some guarantees that column names must be defined before they are inserted, we could create a map of (say) 32-bit int column id, to names, and use that internally right up until we return a resultset to the client. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4175) Reduce memory, disk space, and cpu usage with a column name/id map
[ https://issues.apache.org/jira/browse/CASSANDRA-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13699208#comment-13699208 ] Jonathan Ellis commented on CASSANDRA-4175: --- {quote} Has it become easier to get to know sstable version numbers in the serializer class now? I could maybe check if someone in the team here would like to take a stab at moving this to latest cassandra and commit it if the above implementation seems interesting. {quote} That would be great. Yes, you'll see Descriptor.Version being passed around now which is what encapsulates what kind of sstable it is, including to the lowest level of Column.onDiskIterator. Reduce memory, disk space, and cpu usage with a column name/id map -- Key: CASSANDRA-4175 URL: https://issues.apache.org/jira/browse/CASSANDRA-4175 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Fix For: 2.1 We spend a lot of memory on column names, both transiently (during reads) and more permanently (in the row cache). Compression mitigates this on disk but not on the heap. The overhead is significant for typical small column values, e.g., ints. Even though we intern once we get to the memtable, this affects writes too via very high allocation rates in the young generation, hence more GC activity. Now that CQL3 provides us some guarantees that column names must be defined before they are inserted, we could create a map of (say) 32-bit int column id, to names, and use that internally right up until we return a resultset to the client. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4175) Reduce memory, disk space, and cpu usage with a column name/id map
[ https://issues.apache.org/jira/browse/CASSANDRA-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13679079#comment-13679079 ] Edward Capriolo commented on CASSANDRA-4175: 2995 says {quote} It could be advantageous for Cassandra to make the storage engine pluggable. This could allow Cassandra to deal with potential use cases where maybe the current sstables are not the best fit allow several types of internal storage formats (at the same time) optimized for different data types {quote} Since this issue talks about reducing disk space it will be changing how data is written, this seems to benefit people with mostly static column. It sounds right on the money with 2995. However it goes beyond storage layer changes. The feature makes a ton of sense and does not only benefit the cql3 case. Many people have static columns and since 0.7 standard column families have had schema as well. If cassandra had a 'plugable storage format'. One of the things it the 'ColumnMapIdStorageFormat' could do is write the known schema to a small file loaded in memory with each sstable, (like the bloom filter) that would contain the mappings. In the end I think you would have to store this anyway because the mappings would change over time and what is in the schema now may not be fully accurate for old slushed tables. This would only save storage as mentioned and the internode traffic could not be optimized with plugable storage alone. For compare and swap, well whatever, it's just one feature and no one has to use it if they do not want to. However requiring all schema changes to need zk is crazy scary to me. It is true that schema always needed to propagate before it can be used. I personally do not want to have to install zk side by side with all my cassandra installs, and I do not want to rely on it for schema changes. Architecturally building on zk is a house of cards. This was originally why I chose cassandra over hbase (hbase had meta data on hdfs, and state information with zk). The WORST think that ever happens to cassandra is a node has a corrupt schema or a disagreement. I restart/decommission rejoin the node and it is fixed. If we start storing bits of information (column ids, schema in zookeeper) we become totally reliant on it, nodes may or may not be able to start up without it, we may or not be able to make schema changes without it, and MOST IMPORTANTLY, ITS AN SPOF THAT WHEN IT GOES CORRUPT will likely cause the entire cluster to * die, or likely function in a way worse then death, something like writing (corrupt ids column to files and hopelessly corrupting everything). No thanks to any ZK integration. ZK and centrally managed meta data = hbase. Reduce memory, disk space, and cpu usage with a column name/id map -- Key: CASSANDRA-4175 URL: https://issues.apache.org/jira/browse/CASSANDRA-4175 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Fix For: 2.1 We spend a lot of memory on column names, both transiently (during reads) and more permanently (in the row cache). Compression mitigates this on disk but not on the heap. The overhead is significant for typical small column values, e.g., ints. Even though we intern once we get to the memtable, this affects writes too via very high allocation rates in the young generation, hence more GC activity. Now that CQL3 provides us some guarantees that column names must be defined before they are inserted, we could create a map of (say) 32-bit int column id, to names, and use that internally right up until we return a resultset to the client. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4175) Reduce memory, disk space, and cpu usage with a column name/id map
[ https://issues.apache.org/jira/browse/CASSANDRA-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13678651#comment-13678651 ] Edward Capriolo commented on CASSANDRA-4175: It also sounds like we are re-opening up the concept of plugable storage. https://issues.apache.org/jira/browse/CASSANDRA-2995 since we are talking about custom disk formats only good for specific use cases. Reduce memory, disk space, and cpu usage with a column name/id map -- Key: CASSANDRA-4175 URL: https://issues.apache.org/jira/browse/CASSANDRA-4175 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Fix For: 2.1 We spend a lot of memory on column names, both transiently (during reads) and more permanently (in the row cache). Compression mitigates this on disk but not on the heap. The overhead is significant for typical small column values, e.g., ints. Even though we intern once we get to the memtable, this affects writes too via very high allocation rates in the young generation, hence more GC activity. Now that CQL3 provides us some guarantees that column names must be defined before they are inserted, we could create a map of (say) 32-bit int column id, to names, and use that internally right up until we return a resultset to the client. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-4175) Reduce memory, disk space, and cpu usage with a column name/id map
[ https://issues.apache.org/jira/browse/CASSANDRA-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13678655#comment-13678655 ] Jonathan Ellis commented on CASSANDRA-4175: --- That is not what we are talking about. Reduce memory, disk space, and cpu usage with a column name/id map -- Key: CASSANDRA-4175 URL: https://issues.apache.org/jira/browse/CASSANDRA-4175 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Fix For: 2.1 We spend a lot of memory on column names, both transiently (during reads) and more permanently (in the row cache). Compression mitigates this on disk but not on the heap. The overhead is significant for typical small column values, e.g., ints. Even though we intern once we get to the memtable, this affects writes too via very high allocation rates in the young generation, hence more GC activity. Now that CQL3 provides us some guarantees that column names must be defined before they are inserted, we could create a map of (say) 32-bit int column id, to names, and use that internally right up until we return a resultset to the client. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira