Re: How is Cassandra being used?
I've read through the thread and have a few comments and and idea. 1) I can understand a preference for opt in 2) As a user I would have probably opted in every time I hit a performance issue 3) Opt in may well be skewed to poorer use cases or hardware issues 4) There is a trust gap that needs to be bridged before opt out is acceptable Now for the Idea, perhaps a report tool, in nodetool that generates a human readable profile, in the short term a manual submission process, perhaps down the line fully automated. So basically there are two good plans in your email 1) Standard reporting (+1) 2) Automated feedback (opt in +1) p From: Jonathan Ellis jbel...@gmail.com To: dev dev@cassandra.apache.org Sent: Tuesday, 15 November 2011, 23:23 Subject: How is Cassandra being used? I started a users survey thread over on the users list (replies are still trickling in), but as useful as that is, I'd like to get feedback that is more quantitative and with a broader base. This will let us prioritize our development efforts to better address what people are actually using it for, with less guesswork. For instance: we put a lot of effort into compression for 1.0.0; if it turned out that only 1% of 1.0.x users actually enable compression, then it means that we should spend less effort fine-tuning that moving forward, and use the energy elsewhere. (Of course it could also mean that we did a terrible job getting the word out about new features and explaining how to use them, but either way, it would be good to know!) I propose adding a basic cluster reporting feature to cassandra.yaml, enabled by default. It would send anonymous information about your cluster to an apache.org VM. Information like, number (but not names) of keyspaces and columnfamilies, ks-level options like compression, cf options like compaction strategy, data types (again, not names) of columns, average row size (or better: the histogram data), and average sstables per read. Thoughts? -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: How is Cassandra being used?
On Wed, Nov 16, 2011 at 2:01 AM, Jonathan Ellis jbel...@gmail.com wrote: On Tue, Nov 15, 2011 at 7:02 PM, Eric Evans eev...@acunu.com wrote: I think this is potentially quite dangerous; There are a lot people who get very twitchy at the idea of software that Phones Home. I've seen this so many times, and in all cases it was for software a lot less sensitive than a database. True, but unlike most Home Phoners, ours will be out there in the open and you can see exactly what it's sending (or not, if you disable it). I'm sure there's other examples in the wild of this, but the only one I can think of is popcorn [1]. I don't think the transparency of the implementation changes things much. It's still going to be opaque to a lot of folks, and more importantly is the precedence it sets and the way it changes the project/user trust relationship. Even if you're satisfied with the implementation, and trust that it won't be extended to transmit additional data later (unintentionally or otherwise), there are still very valid privacy concerns. For example, seeing as how this must be transmitted over an IP network, there are only so many guarantees you can make with respect to anonymity. There will always be *someone* that can tie the data to a unique IP, and an IP can almost always be tied to an individual or organization. Imagine an organization that doesn't want *anyone* to know it uses Cassandra, and isn't willing to accept the risk that one of their admins might accidentally enable this reporting. It's also interesting that you mention popcon because it has always been contentious. It's taken years for it to transition from the point where it required users to install it themselves, to a prompt at install-time that defaulted to No, to the current state of an install-time prompt that defaults to Yes. And, the installer asks *very* few questions; Whether or not popcon is enabled is on par with partitioning and the assignment of a root password. Also, there should be no shame in the admission that we haven't earned anywhere near the level of trust and respect that the Debian project has. More broadly, my sense is that people are getting used to the idea that it's okay to give away anonymous statistics as part of the price of free, although YMMclearlyV. I am, after all, a Windows user. :) As privacy becomes more threatened people are either capitulating, or becoming even more defensive; Whether that makes it better or worse for us if we do this is debatable. I'm sure you've already considered this though, you're already talking about anonymity, and transparency, and what I assume is neutrality of the collection endpoint (can apache actually provide a VM; is that a thing?). Yes, they provide Ubuntu or FreeBSD VMs. I'm just afraid that we'll scare people off before they can be properly convinced that it's all on the up-and-up. How would you propose addressing this? Honestly? The best way to convince people that we take the privacy of their data seriously is to not transmit any of it to a machine outside their control. I'm curious to see what others think, but at the moment I'm hovering somewhere around a -0 if it were opt-in (off by default). I'm okay with opt-in if you think that's useful as a first step to ease the twitchiness you mention, but longer term I think it's only really useful if it's on by default. There's a lot of research that shows that people tend to stick with whatever is the path of least resistance [2], and specifically, my experience with Cassandra users is exactly that -- one reason we've spent so much effort getting defaults so good is because almost nobody goes beyond that. It's even worse than that. It's not just that you'll be receiving less data, it will also be less meaningful (since it's from a self-selecting group). [1] http://popcon.debian.org/ [2] http://www.richmondfed.org/publications/research/region_focus/2007/winter/pdf/feature2.pdf -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com -- Eric Evans Acunu | http://www.acunu.com | @acunu
Re: How is Cassandra being used?
Lively thread... +1 opt-in +1 in separate module I'll just substantiate Rick Shaw's comments. If this is on by default, I can see it making its way into production at a large corporation, at which time the traffic would sound an alarm as suspicious activity, which would immediately get the server's plug pulled and trigger an investigation. That would land the architect responsible for deploying that server in the proverbial principal's office. In the extreme case, that might black-list the technology and add fuel to any debate that the corporation should just stick with the 'proven enterprise' solutions. That is not my perspective, just be aware that in some large corporations it is an uphill battle to deploy Cassandra in the first place given incumbent systems. In every situation I've been in, even outside of large corporations, we would need to disable this feature given the sensitivity of the data. All that said... I would love to see this data. ;) I'd love to know where our deployment lies on the spectrum of use. Maybe a good old fashioned web form that allows companies to submit their usage scenarios might accomplish the same goal? (and you could get additional context information about the industry, etc.) It wouldn't be comprehensive, but it may be sufficiently representative. Maybe you could just output a couple lines at server start that said something like Go here http://... to see how your usage compares to others. I personally wouldn't throw to big a hissy if it was incorporated into the actual server and on by default, but I certainly know others that would. -brian On Wed, Nov 16, 2011 at 7:17 AM, Eric Evans eev...@acunu.com wrote: On Wed, Nov 16, 2011 at 2:01 AM, Jonathan Ellis jbel...@gmail.com wrote: On Tue, Nov 15, 2011 at 7:02 PM, Eric Evans eev...@acunu.com wrote: I think this is potentially quite dangerous; There are a lot people who get very twitchy at the idea of software that Phones Home. I've seen this so many times, and in all cases it was for software a lot less sensitive than a database. True, but unlike most Home Phoners, ours will be out there in the open and you can see exactly what it's sending (or not, if you disable it). I'm sure there's other examples in the wild of this, but the only one I can think of is popcorn [1]. I don't think the transparency of the implementation changes things much. It's still going to be opaque to a lot of folks, and more importantly is the precedence it sets and the way it changes the project/user trust relationship. Even if you're satisfied with the implementation, and trust that it won't be extended to transmit additional data later (unintentionally or otherwise), there are still very valid privacy concerns. For example, seeing as how this must be transmitted over an IP network, there are only so many guarantees you can make with respect to anonymity. There will always be *someone* that can tie the data to a unique IP, and an IP can almost always be tied to an individual or organization. Imagine an organization that doesn't want *anyone* to know it uses Cassandra, and isn't willing to accept the risk that one of their admins might accidentally enable this reporting. It's also interesting that you mention popcon because it has always been contentious. It's taken years for it to transition from the point where it required users to install it themselves, to a prompt at install-time that defaulted to No, to the current state of an install-time prompt that defaults to Yes. And, the installer asks *very* few questions; Whether or not popcon is enabled is on par with partitioning and the assignment of a root password. Also, there should be no shame in the admission that we haven't earned anywhere near the level of trust and respect that the Debian project has. More broadly, my sense is that people are getting used to the idea that it's okay to give away anonymous statistics as part of the price of free, although YMMclearlyV. I am, after all, a Windows user. :) As privacy becomes more threatened people are either capitulating, or becoming even more defensive; Whether that makes it better or worse for us if we do this is debatable. I'm sure you've already considered this though, you're already talking about anonymity, and transparency, and what I assume is neutrality of the collection endpoint (can apache actually provide a VM; is that a thing?). Yes, they provide Ubuntu or FreeBSD VMs. I'm just afraid that we'll scare people off before they can be properly convinced that it's all on the up-and-up. How would you propose addressing this? Honestly? The best way to convince people that we take the privacy of their data seriously is to not transmit any of it to a machine outside their control. I'm curious to see what others think, but at the moment I'm hovering somewhere around a -0 if it were opt-in (off by default). I'm okay with
Re: How is Cassandra being used?
Having worked at places where you get fired if software *attempts* to contact outside world I understand the concerns. However, if it's opt-in via config file and requires a restart then there is no reason why it should be a concern. On Wed, Nov 16, 2011 at 3:29 AM, Zhu Han schumi@gmail.com wrote: On Wed, Nov 16, 2011 at 3:03 PM, Norman Maurer nor...@apache.org wrote: 2011/11/16 Jonathan Ellis jbel...@gmail.com: I started a users survey thread over on the users list (replies are still trickling in), but as useful as that is, I'd like to get feedback that is more quantitative and with a broader base. This will let us prioritize our development efforts to better address what people are actually using it for, with less guesswork. For instance: we put a lot of effort into compression for 1.0.0; if it turned out that only 1% of 1.0.x users actually enable compression, then it means that we should spend less effort fine-tuning that moving forward, and use the energy elsewhere. (Of course it could also mean that we did a terrible job getting the word out about new features and explaining how to use them, but either way, it would be good to know!) I propose adding a basic cluster reporting feature to cassandra.yaml, enabled by default. It would send anonymous information about your cluster to an apache.org VM. Information like, number (but not names) of keyspaces and columnfamilies, ks-level options like compression, cf options like compaction strategy, data types (again, not names) of columns, average row size (or better: the histogram data), and average sstables per read. Thoughts? -1. It may scare some admins who stores sensitive data in cassandra. Even if it can disabled, we can not sleep well in the night when we know the door can be opened unintentionally... Hi there, I'm not a cassandra dev but an user of it. I would really hate to see such code in the cassandra code-base. I understand that it would be kind of useful to get a better feeling about usage etc, but its really something that scares the shit out of many managers (and even devs ;) ). So -1 to add this code (*non-binding) Bye, Norman -- http://twitter.com/tjake
Re: How is Cassandra being used?
On Wed, Nov 16, 2011 at 2:59 PM, Jonathan Ellis jbel...@gmail.com wrote: On Wed, Nov 16, 2011 at 8:46 AM, Gary Dusbabek gdusba...@gmail.com wrote: Here is what should determine where energy is spent: if enough people are willing to expend the effort to voice their concerns about feature X in JIRA and on the mailing list, and there are people willing to do the technical work, and it doesn't represent a technical Wrong Turn for the project, then it should (it will) get worked on. Well, sort of. I'm *willing* to work on all or most of the 217 open Cassandra tickets, but since I don't have time to do them all I need to prioritize aggressively. My motivation here is to get more data for that prioritization, which so far has been mostly guided by intuition. It sounds like your implicit assumption is that jira + mailing list are a good enough approximation for who-is-using-what, but I'm not sure that's the case. There probably is a rather large group of shadow users whose (valuable?) input doesn't make it to the list or bug tracker. It sounds like Gary is questioning whether we should be giving these people a voice. Assuming I have that right, I agree that's a very good question. This is a community-based project after all. -- Eric Evans Acunu | http://www.acunu.com | @acunu
Re: How is Cassandra being used?
On Wed, Nov 16, 2011 at 10:56 AM, Eric Evans eev...@acunu.com wrote: There probably is a rather large group of shadow users whose (valuable?) input doesn't make it to the list or bug tracker. It sounds like Gary is questioning whether we should be giving these people a voice. Assuming I have that right, I agree that's a very good question. This is a community-based project after all. First, as attractive (and easy!) as it is to live inside our echo chamber, yes, I do think we should give them a voice. Of course, that doesn't mean you're obliged to listen to it. If you don't think that is a valuable source of input for prioritizing your work, you're free to ignore it. Second, what I'm talking about is a different type of data from what you get on jira + ML. Those are negative sources of information -- you mostly only find out someone is using compression if they have a problem with it. How many people are using it with no problems? That is what this would let us start to find out. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: hintedhandoff in 1.0.3
Keys in HCF are nodes it has hints for. You can try forcing delivery to the node that still has hints. It's also possible that new hints were created (because that node timed out some writes) during the delivery of the first ones. On Tue, Nov 15, 2011 at 3:42 AM, Radim Kolar h...@sendmail.cz wrote: Same problem on other node: 2 keys in HintsColumnFamily. One delivered, one left. INFO [HintedHandoff:1] 2011-11-15 10:31:53,181 HintedHandOffManager.java (line 268) Started hinted handoff for token: 99070591730234615865843651857942052864 INFO [HintedHandoff:1] 2011-11-15 10:32:49,385 ColumnFamilyStore.java (line 688) Enqueuing flush of Memtable-HintsColumnFamily@797897458(1674737/2093421 serialized/live bytes, 6176 ops) INFO [FlushWriter:5] 2011-11-15 10:32:49,386 Memtable.java (line 239) Writing Memtable-HintsColumnFamily@797897458(1674737/2093421 serialized/live bytes, 6176 ops) INFO [CompactionExecutor:10] 2011-11-15 10:32:49,387 CompactionTask.java (line 112) Compacting [SSTableReader(path='/usr/local/cassandra/data/system/HintsColumnFamily-hb-754-Data.db'), SSTableReader(path='/usr/local/cassandra/data/system/HintsColumnFamily-hb-752-Data.db')] INFO [FlushWriter:5] 2011-11-15 10:32:49,523 Memtable.java (line 275) Completed flushing /usr/local/cassandra/data/system/HintsColumnFamily-hb-755-Data.db (1888357 bytes) INFO [CompactionExecutor:10] 2011-11-15 10:32:49,820 CompactionTask.java (line 213) Compacted to [/usr/local/cassandra/data/system/HintsColumnFamily-hb-756-Data.db,]. 19,913,818 to 19,913,392 (~99% of original) bytes for 2 keys at 43.960395MB/s. Time: 432ms. INFO [HintedHandoff:1] 2011-11-15 10:32:49,820 HintedHandOffManager.java (line 334) Finished hinted handoff of 5796 rows -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: How is Cassandra being used?
Sounds like the consensus is that if this is a good idea at all, it needs to be opt-in. Like I said earlier, I can live with that. On Wed, Nov 16, 2011 at 10:35 AM, Jake Luciani jak...@gmail.com wrote: Having worked at places where you get fired if software *attempts* to contact outside world I understand the concerns. However, if it's opt-in via config file and requires a restart then there is no reason why it should be a concern. On Wed, Nov 16, 2011 at 3:29 AM, Zhu Han schumi@gmail.com wrote: On Wed, Nov 16, 2011 at 3:03 PM, Norman Maurer nor...@apache.org wrote: 2011/11/16 Jonathan Ellis jbel...@gmail.com: I started a users survey thread over on the users list (replies are still trickling in), but as useful as that is, I'd like to get feedback that is more quantitative and with a broader base. This will let us prioritize our development efforts to better address what people are actually using it for, with less guesswork. For instance: we put a lot of effort into compression for 1.0.0; if it turned out that only 1% of 1.0.x users actually enable compression, then it means that we should spend less effort fine-tuning that moving forward, and use the energy elsewhere. (Of course it could also mean that we did a terrible job getting the word out about new features and explaining how to use them, but either way, it would be good to know!) I propose adding a basic cluster reporting feature to cassandra.yaml, enabled by default. It would send anonymous information about your cluster to an apache.org VM. Information like, number (but not names) of keyspaces and columnfamilies, ks-level options like compression, cf options like compaction strategy, data types (again, not names) of columns, average row size (or better: the histogram data), and average sstables per read. Thoughts? -1. It may scare some admins who stores sensitive data in cassandra. Even if it can disabled, we can not sleep well in the night when we know the door can be opened unintentionally... Hi there, I'm not a cassandra dev but an user of it. I would really hate to see such code in the cassandra code-base. I understand that it would be kind of useful to get a better feeling about usage etc, but its really something that scares the shit out of many managers (and even devs ;) ). So -1 to add this code (*non-binding) Bye, Norman -- http://twitter.com/tjake -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: How is Cassandra being used?
On Wed, Nov 16, 2011 at 10:02 AM, Jonathan Ellis jbel...@gmail.com wrote: Sounds like the consensus is that if this is a good idea at all, it needs to be opt-in. Like I said earlier, I can live with that. In addition, if you want to get data from large companies that manage their own datacenters, there needs to be a way to contribute data without the software phoning home automatically. We aren't allowed to make connections to the outside world from our datacenter. And I'm not willing to ask for an exception for this. A mode that dumps the data to a file which can be uploaded would be preferable. People probably won't do it often, but imagine if your periodic how are you using cassandra? email threads included data? -ryan
Re: [VOTE] Release Apache Cassandra 1.0.3 (take 2)
I'm +1 on either these artifacts as is, or these artifacts with thrift rebuilt to reflect the correct api version On Tue, Nov 15, 2011 at 7:46 AM, Eric Evans eev...@acunu.com wrote: On Tue, Nov 15, 2011 at 1:40 AM, Sylvain Lebresne sylv...@datastax.com wrote: So, CASSANDRA-3491 and CASSANDRA-3492 got in the way of the first take. Now that they are fixed, let's try again. I propose the following artifacts for release as 1.0.3. SVN: https://svn.apache.org/repos/asf/cassandra/branches/cassandra-1.0@1202082 Artifacts: https://repository.apache.org/content/repositories/orgapachecassandra-186/org/apache/cassandra/apache-cassandra/1.0.3/ Staging repository: https://repository.apache.org/content/repositories/orgapachecassandra-186/ The artifacts as well as the debian package are also available here: http://people.apache.org/~slebresne/ The vote will be open for 72 hours (longer if needed). [1]: http://goo.gl/I1dZG (CHANGES.txt) [2]: http://goo.gl/PeD3Z (NEWS.txt) It looks like interface/cassandra.thrift has changed without the Java code being regenerated. The test_describe system test is failing because of this, (the versions don't match). Probably not justification for a re-roll, but not a great thing for the release either... -- Eric Evans Acunu | http://www.acunu.com | @acunu -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Build failed in Jenkins: Cassandra #1209
See https://builds.apache.org/job/Cassandra/1209/changes Changes: [jbellis] merge from 1.0 -- [...truncated 2243 lines...] [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 0.163 sec [junit] [junit] Testsuite: org.apache.cassandra.locator.ReplicationStrategyEndpointCacheTest [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.513 sec [junit] [junit] Testsuite: org.apache.cassandra.locator.SimpleStrategyTest [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 0.687 sec [junit] [junit] Testsuite: org.apache.cassandra.locator.TokenMetadataTest [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 0.453 sec [junit] [junit] Testsuite: org.apache.cassandra.service.AntiEntropyServiceCounterTest [junit] Tests run: 6, Failures: 0, Errors: 1, Time elapsed: 2.607 sec [junit] [junit] Testcase: testValidatorPrepare(org.apache.cassandra.service.AntiEntropyServiceCounterTest): Caused an ERROR [junit] /127.0.0.1:7010 is in use by another process. Change listen_address:storage_port in cassandra.yaml to values that do not conflict with other services [junit] org.apache.cassandra.config.ConfigurationException: /127.0.0.1:7010 is in use by another process. Change listen_address:storage_port in cassandra.yaml to values that do not conflict with other services [junit] at org.apache.cassandra.net.MessagingService.getServerSocket(MessagingService.java:271) [junit] at org.apache.cassandra.net.MessagingService.listen(MessagingService.java:241) [junit] at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:484) [junit] at org.apache.cassandra.service.StorageService.initServer(StorageService.java:461) [junit] at org.apache.cassandra.service.AntiEntropyServiceTestAbstract.prepare(AntiEntropyServiceTestAbstract.java:80) [junit] [junit] [junit] Test org.apache.cassandra.service.AntiEntropyServiceCounterTest FAILED [junit] Testsuite: org.apache.cassandra.service.AntiEntropyServiceStandardTest [junit] Tests run: 6, Failures: 0, Errors: 1, Time elapsed: 2.437 sec [junit] [junit] Testcase: testValidatorPrepare(org.apache.cassandra.service.AntiEntropyServiceStandardTest): Caused an ERROR [junit] /127.0.0.1:7010 is in use by another process. Change listen_address:storage_port in cassandra.yaml to values that do not conflict with other services [junit] org.apache.cassandra.config.ConfigurationException: /127.0.0.1:7010 is in use by another process. Change listen_address:storage_port in cassandra.yaml to values that do not conflict with other services [junit] at org.apache.cassandra.net.MessagingService.getServerSocket(MessagingService.java:271) [junit] at org.apache.cassandra.net.MessagingService.listen(MessagingService.java:241) [junit] at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:484) [junit] at org.apache.cassandra.service.StorageService.initServer(StorageService.java:461) [junit] at org.apache.cassandra.service.AntiEntropyServiceTestAbstract.prepare(AntiEntropyServiceTestAbstract.java:80) [junit] [junit] [junit] Test org.apache.cassandra.service.AntiEntropyServiceStandardTest FAILED [junit] Testsuite: org.apache.cassandra.service.CassandraServerTest [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.461 sec [junit] [junit] Testsuite: org.apache.cassandra.service.ConsistencyLevelTest [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0.776 sec [junit] [junit] Testcase: testReadWriteConsistencyChecks(org.apache.cassandra.service.ConsistencyLevelTest): Caused an ERROR [junit] invalid consistency level: ANY [junit] java.lang.UnsupportedOperationException: invalid consistency level: ANY [junit] at org.apache.cassandra.service.ReadCallback.determineBlockFor(ReadCallback.java:195) [junit] at org.apache.cassandra.service.ReadCallback.init(ReadCallback.java:68) [junit] at org.apache.cassandra.service.StorageProxy.getReadCallback(StorageProxy.java:798) [junit] at org.apache.cassandra.service.ConsistencyLevelTest.testReadWriteConsistencyChecks(ConsistencyLevelTest.java:110) [junit] [junit] [junit] Test org.apache.cassandra.service.ConsistencyLevelTest FAILED [junit] Testsuite: org.apache.cassandra.service.EmbeddedCassandraServiceTest [junit] Testsuite: org.apache.cassandra.service.EmbeddedCassandraServiceTest [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec [junit] [junit] Testcase: org.apache.cassandra.service.EmbeddedCassandraServiceTest:BeforeFirstTest: Caused an ERROR [junit] Forked Java VM exited abnormally. Please note the time in the report does not reflect the time until the VM exit. [junit]
Jenkins build is still unstable: Cassandra-Coverage #168
See https://builds.apache.org/job/Cassandra-Coverage/changes
Re: How is Cassandra being used?
Sounds like it would be best if it were in a separate jar for people? On Nov 16, 2011, at 4:58 PM, Bill wrote: Thoughts? We'll turn this off, and would possibly patch it out of the code. That's not to say it wouldn't be useful to others. Bill On 15/11/11 23:23, Jonathan Ellis wrote: I started a users survey thread over on the users list (replies are still trickling in), but as useful as that is, I'd like to get feedback that is more quantitative and with a broader base. This will let us prioritize our development efforts to better address what people are actually using it for, with less guesswork. For instance: we put a lot of effort into compression for 1.0.0; if it turned out that only 1% of 1.0.x users actually enable compression, then it means that we should spend less effort fine-tuning that moving forward, and use the energy elsewhere. (Of course it could also mean that we did a terrible job getting the word out about new features and explaining how to use them, but either way, it would be good to know!) I propose adding a basic cluster reporting feature to cassandra.yaml, enabled by default. It would send anonymous information about your cluster to an apache.org VM. Information like, number (but not names) of keyspaces and columnfamilies, ks-level options like compression, cf options like compaction strategy, data types (again, not names) of columns, average row size (or better: the histogram data), and average sstables per read. Thoughts?
Re: How is Cassandra being used?
+1 for a separate jar (and a second download link that doesn't include this jar, though I would make the primary link include it with BIG BOLD PRINT saying it is in there) +1 for a config option to turn off auto-post (defaulted on in the download that has the jar) +1 for a nodetool command to dump it to a file for manual posting I think this could be a good debugging tool as well. Have a command to dump here is what my cluster looks like to a file, that could then be sent though email for others to be used help resolve issues with would be nice. The current nodetool information commands have too much stuff that needs to be sanitized out before you can send it outside the firewall. - Jeremiah On Nov 16, 2011, at 7:16 PM, Jeremy Hanna wrote: Sounds like it would be best if it were in a separate jar for people? On Nov 16, 2011, at 4:58 PM, Bill wrote: Thoughts? We'll turn this off, and would possibly patch it out of the code. That's not to say it wouldn't be useful to others. Bill On 15/11/11 23:23, Jonathan Ellis wrote: I started a users survey thread over on the users list (replies are still trickling in), but as useful as that is, I'd like to get feedback that is more quantitative and with a broader base. This will let us prioritize our development efforts to better address what people are actually using it for, with less guesswork. For instance: we put a lot of effort into compression for 1.0.0; if it turned out that only 1% of 1.0.x users actually enable compression, then it means that we should spend less effort fine-tuning that moving forward, and use the energy elsewhere. (Of course it could also mean that we did a terrible job getting the word out about new features and explaining how to use them, but either way, it would be good to know!) I propose adding a basic cluster reporting feature to cassandra.yaml, enabled by default. It would send anonymous information about your cluster to an apache.org VM. Information like, number (but not names) of keyspaces and columnfamilies, ks-level options like compression, cf options like compaction strategy, data types (again, not names) of columns, average row size (or better: the histogram data), and average sstables per read. Thoughts?