Apache Cassandra meetup @ Instagram HQ
Hi all, Apologies for the cross-post. In case you're in the SF Bay Area, Instagram is hosting a meetup. Interesting talks on Cassandra Traffic management, Cassandra on Kubernetes. See details in the attached link - https://www.eventbrite.com/e/cassandra-traffic-management-at-instagram-cassandra-and-k8s-with-instaclustr-tickets-54986803008 Thanks, Dinesh
Re: Maximum memory usage
Are you running any nodetool commands during that period? IIRC, this is a log entry emitted by the BufferPool. It may be harm unless it's happening very often or logging a OOM. Dinesh On Wednesday, February 6, 2019, 6:19:42 AM PST, Rahul Reddy wrote: Hello, I see maximum memory usage alerts in my system.log couple of times in a day as INFO. So far I haven't seen any issue with db. Why those messages are logged in system.log do we have any impact for reads/writes with those warnings? And what nerd to be looked INFO [RMI TCP Connection(170917)-127.0.0.1] 2019-02-05 23:15:47,408 NoSpamLogger.java:91 - Maximum memory usage reached (512.000MiB), cannot allocate chunk of 1.000MiB Thanks in advance
Re: Two datacenters with one cassandra node in each datacenter
You also want to use Cassandra with a minimum of 3 nodes. Dinesh On Wednesday, February 6, 2019, 11:26:07 PM PST, dinesh.jo...@yahoo.com wrote: Hey Kunal, Can you add more details about the size of data, read/write throughput, what are your latency expectations, etc? What do you mean by "performance" issue with replication? Without these details it's a bit tough to answer your questions. Dinesh On Wednesday, February 6, 2019, 3:47:05 PM PST, Kunal wrote: HI All, I need some recommendation on using two datacenters with one node in each datacenter. In our organization, We are trying to have two cassandra dataceters with only 1 node on each side. From the preliminary investigation, I see replication is happening but I want to know if we can use this deployment in production? Will there be any performance issue with replication ? We have already setup 2 datacenters with one node on each datacenter and replication is working fine. Can you please let me know if this kind of setup is recommended for production deployment. Thanks in anticipation. Regards,Kunal Vaid
Re: Two datacenters with one cassandra node in each datacenter
Hey Kunal, Can you add more details about the size of data, read/write throughput, what are your latency expectations, etc? What do you mean by "performance" issue with replication? Without these details it's a bit tough to answer your questions. Dinesh On Wednesday, February 6, 2019, 3:47:05 PM PST, Kunal wrote: HI All, I need some recommendation on using two datacenters with one node in each datacenter. In our organization, We are trying to have two cassandra dataceters with only 1 node on each side. From the preliminary investigation, I see replication is happening but I want to know if we can use this deployment in production? Will there be any performance issue with replication ? We have already setup 2 datacenters with one node on each datacenter and replication is working fine. Can you please let me know if this kind of setup is recommended for production deployment. Thanks in anticipation. Regards,Kunal Vaid
Re: Bootstrap keeps failing
Would it be possible for you to take a thread dump & logs and share them? Dinesh On Wednesday, February 6, 2019, 10:09:11 AM PST, Léo FERLIN SUTTON wrote: Hello ! I am having a recurrent problem when trying to bootstrap a few new nodes. Some general info : - I am running cassandra 3.0.17 - We have about 30 nodes in our cluster - All healthy nodes have between 60% to 90% used disk space on /var/lib/cassandra So I create a new node and let auto_bootstrap do it's job. After a few days the bootstrapping node stops streaming new data but is still not a member of the cluster. `nodetool status` says the node is still joining, When this happens I run `nodetool bootstrap resume`. This usually ends up in two different ways : - The node fills up to 100% disk space and crashes. - The bootstrap resume finishes with errors When I look at `nodetool netstats -H` is looks like `bootstrap resume` does not resume but restarts a full transfer of every data from every node. This is the output I get from `nodetool resume` : [2019-02-06 01:39:14,369] received file /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-225-big-Data.db (progress: 2113%) [2019-02-06 01:39:16,821] received file /var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-88-big-Data.db (progress: 2113%) [2019-02-06 01:39:17,003] received file /var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-89-big-Data.db (progress: 2113%) [2019-02-06 01:39:17,032] session with /10.16.XX.YYY complete (progress: 2113%) [2019-02-06 01:41:15,160] received file /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-220-big-Data.db (progress: 2113%) [2019-02-06 01:42:02,864] received file /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-226-big-Data.db (progress: 2113%) [2019-02-06 01:42:09,284] received file /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-227-big-Data.db (progress: 2113%) [2019-02-06 01:42:10,522] received file /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-228-big-Data.db (progress: 2113%) [2019-02-06 01:42:10,622] received file /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-229-big-Data.db (progress: 2113%) [2019-02-06 01:42:11,925] received file /var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-90-big-Data.db (progress: 2114%) [2019-02-06 01:42:14,887] received file /var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-91-big-Data.db (progress: 2114%) [2019-02-06 01:42:14,980] session with /10.16.XX.ZZZ complete (progress: 2114%) [2019-02-06 01:42:14,980] Stream failed [2019-02-06 01:42:14,982] Error during bootstrap: Stream failed [2019-02-06 01:42:14,982] Resume bootstrap complete The bootstrap `progress` goes way over 100% and eventually fails. Right now I have a node with this output from `nodetool status` : `UJ 10.16.XX.YYY 2.93 TB 256 ? 5788f061-a3c0-46af-b712-ebeecd397bf7 c` It is almost filled with data, yet if I look at `nodetool netstats` : Receiving 480 files, 325.39 GB total. Already received 5 files, 68.32 MB total Receiving 499 files, 328.96 GB total. Already received 1 files, 1.32 GB total Receiving 506 files, 345.33 GB total. Already received 6 files, 24.19 MB total Receiving 362 files, 206.73 GB total. Already received 7 files, 34 MB total Receiving 424 files, 281.25 GB total. Already received 1 files, 1.3 GB total Receiving 581 files, 349.26 GB total. Already received 8 files, 45.96 MB total Receiving 443 files, 337.26 GB total. Already received 6 files, 96.15 MB total Receiving 424 files, 275.23 GB total. Already received 5 files, 42.67 MB total It is trying to pull all the data again. Am I missing something about the way `nodetool bootstrap resume` is supposed to be used ? Regards, Leo
Re: Modeling Time Series data
Hi Akash, There are a lot of interesting articles written around this topic. - http://thelastpickle.com/blog/2017/08/02/time-series-data-modeling-massive-scale.html - https://medium.com/netflix-techblog/scaling-time-series-data-storage-part-i-ec2b6d44ba39 You shouldn't need to worry about hotspots if you select the partition key carefully and your cluster is configured properly. Please go through the links and if you have more clarification, please feel free to ask more questions here. Thanks, Dinesh On Friday, January 11, 2019, 2:45:42 PM PST, Akash Gangil wrote: Hi, I have a data model where the partition key for a lot of tables is based on time (year, month, day, hour) Would this create a hotspot in my cluster, given all the writes/reads would go to the same node for a given hour? Or does the cassandra storage engine also takes into account the table info like table name, when distributing the data? If the above model would be a problem, what's the suggested way to solve this? Add tablename to partition key? -- Akash
Re: Cassandra lucene secondary indexes
Providing logs or more technical information might be helpful. If it is cassandra-lucene related issue, perhaps it'll be better to open a issue in their github repo? Dinesh On Wednesday, December 12, 2018, 11:17:06 PM GMT+5:30, Brian Spindler wrote: Hi all, we recently started using the cassandra-lucene secondary index support that Instaclustr recently assumed ownership of, thank you btw! We are experiencing a strange issue where adding/removing nodes fails and the joining node is left hung with a compaction "Secondary index build" and it just never completes. We're running v3.11.3 of Cassandra and the plugin, has anyone experienced this before? It's a relatively small cluster ~6 nodes in our user acceptance environment and so not a lot of load either. Thanks! -- -Brian
Re: Migrating from DSE5.1.2 to Opensource cassandra
Thanks, nice summary of the overall process. Dinesh On Tuesday, December 4, 2018, 9:38:47 PM EST, Jonathan Koppenhofer wrote: Unfortunately, we found this to be a little tricky. We did migrations from DSE 4.8 and 5.0 to OSS 3.0.x, so you may run into additional issues. I will also say your best option may be to install a fresh cluster and stream the data. This wasn't feasible for us at the size and scale in the time frames and infrastructure restrictions we had. I will have to review my notes for more detail, but off the top of my head, for an in place migration... Pre-upgrade* Be sure you are not using any Enterprise features like Search or Graph. Not only are there not equivalent features in open source, but theses features require proprietary classes to be in the classpath, or Cassandra will not even start up.* By default, I think DSE uses their own custom authenticators, authorizors, and such. Make sure what you are doing has an open source equivalent.* The DSE system keyapaces use custom replication strategies. Convert these to NTS before upgrade.* Otherwise, follow the same processes you would do before an upgrade (repair, snapshot, etc) Upgrade* The easy part is just replacing the binaries as you would in normal upgrade. Drain and stop the existing node first. You can also do this same process in a rolling fashion to maintain availability. In our case, we were doing an in-place upgrade and reusing the same IPs* DSE unfortunately creates a custom column in a system table that requires you to remove one (or more) system tables (peers?) to be able to start the node. You delete these system tables by removing the sstbles on disk while the node is down. This is a bit of a headache if using vnodes. As we are using vnodes, it required us to manually specify num tokens, and the specific tokens the node was responsible for in Cassandra.yaml. You have to do this before you start the node. If not using vnodes, this is simpler, but we used vnodes. Again, I'll double check my notes. Once the node is up, you can revert to your normal vnodes/num tokens settings. Post upgrade:* Drop DSE system tables. I'll revert with more detail if needed. On Tue, Dec 4, 2018, 5:46 PM Nandakishore Tokala
Re: request_scheduler functionalities for CQL Native Transport
I think what you're looking for might be solved by CASSANDRA-8303. However, I am not sure if anybody is working on it. Generally you want to create different clusters for users to physically isolate them. What you propose has been discussed in the past and it is something that is currently unsupported. Dinesh On Tuesday, November 27, 2018, 11:05:32 PM PST, Shaurya Gupta wrote: Hi, We want to throttle maximum queries on any keyspace for clients connecting via CQL native transport. This option is available for clients connecting via thrift by property of request_scheduler in cassandra.yaml.Is there some option available for clients connecting via CQL native transport.If not is there any plan to do so in future.It is a must have feature if we want to support multiple teams on a single cassandra cluster or to prevent one keyspace from interfering with the performance of the other keyspaces. RegardsShaurya Gupta
Re: nodetool rebuild
Its a long shot but do you have stream_throughput_outbound_megabits_per_sec or inter_dc_stream_throughput_outbound_megabits_per_sec set to a low value? You're right in that 3.0 streaming uses 1 thread for incoming and outgoing connection each per peer. It not only reads the bytes off of the channel but also deserializes the partitions on that same thread. If you see high CPU use by STREAM-IN thread then your streaming is CPU bound. In this situation a powerful CPU will definitely help. Dropping internode compression and encryption will also help. Are your SSTables compressed? Dinesh On Friday, September 14, 2018, 4:15:28 AM PDT, Vitali Dyachuk wrote: None of these throttling are helpful for streaming if you have even a 150-200 Mbit/s bandwidth which is affordable in any cloud. Tweaking network tcp memory, window size etc does not help, the bottleneck is not the network. These are my findings on how streaming is limited in C* 3.0.* 1) Streaming of the particular range which needs to be steamed to the new node is limited with one 1 thread and no tweaking of cpu affinity etc helps, probably the powerfull computing VM will help 2) Disabling compression internode_compression and disabling compression per table in our case helps a bit 3) When streaming has been dropped there is no resume available for the streaming range so it will start from the beginning One of the options could be to create snapshots of sstables on the source node and just copy all sstable snapshots to new node and then run repair, data is ~5TB, RF3 ? How is it possible at all to stream data fast to a new node/nodes ? Vitali. On Wed, Sep 12, 2018 at 5:02 PM Surbhi Gupta wrote: Increase 3 throughput Compaction throughput Stream throughput Interdcstream throughput (if rebuilding from another DC) Make all of the above to 0 and see if there is any improvement and later set the value if u can’t leave these values to 0. On Wed, Sep 12, 2018 at 5:42 AM Vitali Dyachuk wrote: Hi, I'm currently streaming data with nodetool rebuild on 2 nodes, each node is streaming from different location. The problem is that it takes ~7 days to stream 4Tb of data to 1 node, the speed on each side is ~150Mbit/s so it should take around ~2,5 days . Although there are resources on the destnodes and in the source regions. I've increased stream throughput, but its only affects outbound connections. Tested with iperf the bandwidth is 600Mibt/s from both sides. Last week i've changed the CS from ST to LC because of huge sstables and compaction of them is still ongoing. How does rebuild command works ? Does it calculate the range then request the needed sstables from that node and start streaming ? How is it possible to speed up the streaming ? Vitali.
Re: AxonOps
Are you planning to open source it or just a binary distribution? Dinesh On Saturday, September 15, 2018, 3:16:47 PM PDT, Hayato Shimizu wrote: Hi Cassandra folks, We built a Cassandra management tool for ourselves, but decided that we'd like to share it with you and it will soon be released to the public for free. We called it AxonOps and it provides GUI metrics/logs dashboards, service health checks, Cassandra adaptive repair, backup/restore, and currently integrates with PagerDuty, Slack, Hipchat (R.I.P.) and email on alerts and notifications. You can read about it on our blog here: https://digitalis.io/blog/axonops/ We're currently working hard on documentations etc before making it available for you to download. We'd be interested to learn if anybody would like to use such a tool. Hayato Professional Services & Fully Managed Technologies - on-premise, all major clouds and hybrid -- Any views or opinions presented are solely those of the author and do not necessarily represent those of the company. digitalis.io is a trading name of Digitalis.io Ltd. Company Number: 98499457 Registered in England and Wales. Registered Office: Kemp House, 152 City Road, London, EC1V 2NX, United Kingddom
Re: bigger data density with Cassandra 4.0?
With LCS, 6696 you can maximize the percentage of SSTables that use the new streaming path. With LCS and relatively small SSTables you should see good gains. Bootstrap is a use-case that should see the maximum benefits. This feature will get better with time. Dinesh On Wednesday, August 29, 2018, 12:34:32 AM PDT, kurt greaves wrote: My reasoning was if you have a small cluster with vnodes you're more likely to have enough overlap between nodes that whole SSTables will be streamed on major ops. As N gets >RF you'll have less common ranges and thus less likely to be streaming complete SSTables. Correct me if I've misunderstood. On 28 August 2018 at 01:37, Dinesh Joshi wrote: Although the extent of benefits depend on the specific use case, the cluster size is definitely not a limiting factor. Dinesh On Aug 27, 2018, at 5:05 AM, kurt greaves wrote: I believe there are caveats that it will only really help if you're not using vnodes, or you have a very small cluster, and also internode encryption is not enabled. Alternatively if you're using JBOD vnodes will be marginally better, but JBOD is not a great idea (and doesn't guarantee a massive improvement). On 27 August 2018 at 15:46, dinesh.jo...@yahoo.com.INVALID wrote: Yes, this feature will help with operating nodes with higher data density. Dinesh On Saturday, August 25, 2018, 9:01:27 PM PDT, onmstester onmstester wrote: I've noticed this new feature of 4.0: Streaming optimizations (https://cassandra.apache.org/ blog/2018/08/07/faster_streami ng_in_cassandra.html) Is this mean that we could have much more data density with Cassandra 4.0 (less problems than 3.X)? I mean > 10 TB of data on each node without worrying about node join/remove? This is something needed for Write-Heavy applications that do not read a lot. When you have like 2 TB of data per day and need to keep it for 6 month, it would be waste of money to purchase 180 servers (even Commodity or Cloud). IMHO, even if 4.0 fix problem with streaming/joining a new node, still Compaction is another evil for a big node, but we could tolerate that somehow Sent using Zoho Mail
Re: bigger data density with Cassandra 4.0?
Yes, this feature will help with operating nodes with higher data density. Dinesh On Saturday, August 25, 2018, 9:01:27 PM PDT, onmstester onmstester wrote: I've noticed this new feature of 4.0: Streaming optimizations (https://cassandra.apache.org/blog/2018/08/07/faster_streaming_in_cassandra.html) Is this mean that we could have much more data density with Cassandra 4.0 (less problems than 3.X)? I mean > 10 TB of data on each node without worrying about node join/remove? This is something needed for Write-Heavy applications that do not read a lot. When you have like 2 TB of data per day and need to keep it for 6 month, it would be waste of money to purchase 180 servers (even Commodity or Cloud). IMHO, even if 4.0 fix problem with streaming/joining a new node, still Compaction is another evil for a big node, but we could tolerate that somehow Sent using Zoho Mail
Re: benefits oh HBase over Cassandra
I've worked with both databases. They're suitable for different use-cases. If you look at the CAP theorem; HBase is CP while Cassandra is a AP. If we talk about a specific use-case, it'll be easier to discuss. Dinesh On Friday, August 24, 2018, 1:56:31 PM PDT, Vitaliy Semochkin wrote: Hi, I read that once Facebook chose HBase over Cassandra for it's messenger, but I never found what are the benefits for HBase over Cassandra, can someone list, if there are any? Regards, Vitaliy - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Re: duplicate rows for partition
What is the schema of the table? Could your include the output of DESCRIBE? Dinesh On Wednesday, August 22, 2018, 2:22:31 PM PDT, Gosar M wrote: Hello, Have a table with following partition and clustering keys partition key - ("userid", "secondaryid"), clustering key - "tDate", "tid3", "sid4", "pid5" Data is inserted based on above partition and clustering key. For 1 record seeing 2 rows returned when queried by both partition and clustering key. userid | secondaryid | tdate | tid3 | sid4 | pid5 | associate_degree --+-+ 090sdfdsf898 | ab984564 | 2018-08-04 07:59:59+ | 0a5995672e3 | l34 | l34_listing | 123145979615694 090sdfdsf898 | ab984564 | 2018-08-04 07:59:59+ | 0a5995672e3 | l34 | l34_listing | 123145979615694989 We did not had any node which was down longer than gc_grace_period. Thank you.
Re: Huge daily outbound network traffic
You could also run tcpdump to inspect the streams. Dinesh On Thursday, August 16, 2018, 1:11:47 PM PDT, Elliott Sims wrote: Since this is cross-node traffic, "nodetool netstats" during the high-traffic period should give you a better idea of what's being sent. On Thu, Aug 16, 2018 at 2:34 AM, Behnam B.Marandi wrote: In case of cronjobs, there is no jobs for that time period and I can see affect of jobs like backups and repairs but traffic that they cause is not comparable. Like 800MB comparing to 2GB. And for this case it is all outbound network on all 3 cluster nodes. On Thu, Aug 16, 2018 at 5:16 PM dinesh.jo...@yahoo.com.INVALID wrote: Since it is predictable, can you check the logs during that period? What do they say? Do you have a cron running on those hosts? Do all the nodes experience this issue? Dinesh On Thursday, August 16, 2018, 12:02:55 AM PDT, Behnam B.Marandi wrote: Actually I did. It seems this is a cross node traffic from one node to port 7000 (storage_port) of the other node. On Sun, Aug 12, 2018 at 2:44 PM Elliott Sims wrote: Since it's at a consistent time, maybe just look at it with iftop to see where the traffic's going and what port it's coming from? On Fri, Aug 10, 2018 at 1:48 AM, Behnam B.Marandi wrote: I don't have any external process or planed repair in that time period.In case of network, I can see outbound network on Cassandra node network interface but couldn't find any way to check the VPC network to make sure it is not going out of network. Maybe the only way is analysing VPC Flow Log.B. On Tue, Aug 7, 2018 at 11:23 PM, Rahul Singh wrote: Are you sure you don’t have an outside process that is doing an export , Spark job, non AWS managed backup process ? Is this network out from Cassandra or from the network? RahulOn Aug 7, 2018, 4:09 AM -0400, Behnam B.Marandi , wrote: Hi,I have a 3 node Cassandra cluster (version 3.11.1) on m4.xlarge EC2 instances with separate EBS volumes for root (gp2), data (gp2) and commitlog (io1).I get daily outbound traffic at a certain time everyday. As you can see in the attached screenshot, whiile my normal networkl oad hardly meets 200MB, this outbound (orange) spikes up to 2GB while inbound (purple) is less than 800MB.There is no repair or backup process giong on in that time window, so I am wondering where to look. Any idea? -- -- - To unsubscribe, e-mail: user-unsubscribe@cassandra. apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Re: Huge daily outbound network traffic
Since it is predictable, can you check the logs during that period? What do they say? Do you have a cron running on those hosts? Do all the nodes experience this issue? Dinesh On Thursday, August 16, 2018, 12:02:55 AM PDT, Behnam B.Marandi wrote: Actually I did. It seems this is a cross node traffic from one node to port 7000 (storage_port) of the other node. On Sun, Aug 12, 2018 at 2:44 PM Elliott Sims wrote: Since it's at a consistent time, maybe just look at it with iftop to see where the traffic's going and what port it's coming from? On Fri, Aug 10, 2018 at 1:48 AM, Behnam B.Marandi wrote: I don't have any external process or planed repair in that time period.In case of network, I can see outbound network on Cassandra node network interface but couldn't find any way to check the VPC network to make sure it is not going out of network. Maybe the only way is analysing VPC Flow Log.B. On Tue, Aug 7, 2018 at 11:23 PM, Rahul Singh wrote: Are you sure you don’t have an outside process that is doing an export , Spark job, non AWS managed backup process ? Is this network out from Cassandra or from the network? RahulOn Aug 7, 2018, 4:09 AM -0400, Behnam B.Marandi , wrote: Hi,I have a 3 node Cassandra cluster (version 3.11.1) on m4.xlarge EC2 instances with separate EBS volumes for root (gp2), data (gp2) and commitlog (io1).I get daily outbound traffic at a certain time everyday. As you can see in the attached screenshot, whiile my normal networkl oad hardly meets 200MB, this outbound (orange) spikes up to 2GB while inbound (purple) is less than 800MB.There is no repair or backup process giong on in that time window, so I am wondering where to look. Any idea? - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Re: Reading cardinality from Statistics.db failed
Vitali, It doesn't look like there is an existing Jira. It would be helpful if you could create one with as much information as possible. Can you reduce this issue to a short, repeatable set of steps that we can reproduce? That'll be helpful to debug this problem. Dinesh On Wednesday, August 15, 2018, 1:07:21 AM PDT, Vitali Dyachuk wrote: I've upgraded to 3.0.17 and the issue is still there, Is there a jira ticket for that bug or should i create one? On Wed, Jul 25, 2018 at 2:57 PM Vitali Dyachuk wrote: I'm using 3.0.15. I see that there is some fix for sstable metadata in 3.0.16 https://issues.apache.org/jira/browse/CASSANDRA-14217 - is that a fix for "reading cardinalyti from statistics.db" ? On Wed, Jul 25, 2018 at 1:02 PM Hannu Kröger wrote: What version of Cassandra are you running? There is a bug in 3.10.0 and certain 3.0.x that occurs in certain conditions and corrupts that file. Hannu Vitali Dyachuk kirjoitti 25.7.2018 kello 10.48: Hi, I have noticed in the cassandra system.log that there is some issue with sstable metadata, the messages says: WARN [Thread-6] 2018-07-25 07:12:47,928 SSTableReader.java:249 - Reading cardinality from Statistics.db failed for /opt/data/disk5/data/keyspace/table/mc-big-Data.db Although there is no such file. The message has appeared after i've changed the compaction strategy from SizeTiered to Leveled. Currently i'm running nodetool scrub to rebuilt the sstable, and it takes a lot of time to scrub all sstables. Reading the code it is said that if this metada is broken, then estimating the keys will be done using index summary. How expensive it is ? https://github.com/apache/cassandra/blob/cassandra-3.0.15/src/java/org/apache/cassandra/io/sstable/format/SSTableReader.java#L245 The main question is why has this happened? Thanks, Vitali Djatsuk.
Re: Thrift to CQL migration under new Keyspace or Cluster
If you're working in a different keyspace, I don't anticipate any issues. Have you attempted one in a test cluster? :) Dinesh On Friday, June 22, 2018, 1:26:56 AM PDT, Fernando Neves wrote: Hi guys,We are running one of our Cassandra cluster under 2.0.17 Thrift version and we started the 2.0.17 CQL migration plan through CQLSSTableWriter/sstableloader method. Simple question, maybe someone worked in similar scenario, is there any problem to do the migration under the same Cassandra instances (nodes) but in different keyspace (ks_thrift to ks_cql) or should we create another 2.0.17 cluster to do this work?I know that new keyspace will require more host resources but it will be more simple for us, because once the table migrated we will drop it on the old ks_thrift keyspace. Thanks,Fernando.
Re: RE: Mongo DB vs Cassandra
If you have the time, I would suggest creating a prototype with both databases and trying it out. You should also have some idea of how this system might evolve in the future. It is important because that could very well help you make a decision. Mongo or Cassandra may work but if your requirements evolve in a way that works better with Cassandra, you might be better off going with Cassandra. As others have pointed out each database has it's own strength. Given that you may store a 20KB to 600MB row, you may be able to model it with Mongo as well as Cassandra. If you plan on having a separate index like ElasticSearch, Solr that is outside the database, I would suggest going with Cassandra. Other factors to consider are licensing, operational cost, etc. Dinesh On Thursday, May 31, 2018, 9:01:09 AM PDT, Sudhakar Ganesan wrote: #yiv691278 #yiv691278 -- _filtered #yiv691278 {panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv691278 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;} _filtered #yiv691278 {font-family:Candara;panose-1:2 14 5 2 3 3 3 2 2 4;}#yiv691278 #yiv691278 p.yiv691278MsoNormal, #yiv691278 li.yiv691278MsoNormal, #yiv691278 div.yiv691278MsoNormal {margin:0in;margin-bottom:.0001pt;font-size:11.0pt;font-family:sans-serif;color:black;}#yiv691278 a:link, #yiv691278 span.yiv691278MsoHyperlink {color:#0563C1;text-decoration:underline;}#yiv691278 a:visited, #yiv691278 span.yiv691278MsoHyperlinkFollowed {color:#954F72;text-decoration:underline;}#yiv691278 p.yiv691278msonormal0, #yiv691278 li.yiv691278msonormal0, #yiv691278 div.yiv691278msonormal0 {margin-right:0in;margin-left:0in;font-size:11.0pt;font-family:sans-serif;color:black;}#yiv691278 span.yiv691278EmailStyle18 {font-family:sans-serif;color:windowtext;}#yiv691278 span.yiv691278EmailStyle19 {font-family:sans-serif;color:windowtext;}#yiv691278 .yiv691278MsoChpDefault {font-size:10.0pt;} _filtered #yiv691278 {margin:1.0in 1.0in 1.0in 1.0in;}#yiv691278 div.yiv691278WordSection1 {}#yiv691278 At high level, in the production line, machine will provide the data in the form of CSV in every 1 sec to 1 minutes to 1 day ( depending on machine type used in the line operations). I need to parse those files and load it to DB and build and API layer expose it to downstream systems. Number of files to be processed 13,889,660,134 per day Each file could range from 20 KB to 600MB which will translate into few hundred rows to millions of rows. High availability with high write. Read is less compare to write. While extracting the rows, few validation to be performed. Build an API layer on top of the data to be persisted in the DB. Now, tell me what would be the best choice… From: Russell Bateman [mailto:r...@windofkeltia.com] Sent: Thursday, May 31, 2018 7:36 PM To: user@cassandra.apache.org Subject: Re: Mongo DB vs Cassandra Sudhakar, MongoDB will accommodate loading CSV without regard to schema while still creating identifiable "columns" in the database, but you'll have to predict or back-impose some schema later if you're going to create indices for fast searching of the data. You can perform searching of data without indexing in MongoDB, but it's slower. Cassandra will require you to understand the schema, i.e.: what the columns are up front unless you're just going to store the data without schema and, therefore, without ability to search effectively. As suggested already, you should share more detail if you want good advice. Both DBs are excellent. Both do different things in different ways. Hope this helps, Russ On 05/31/2018 05:49 AM, Sudhakar Ganesan wrote: Team, I need to make a decision on Mongo DB vs Cassandra for loading the csv file data and store csv file as well. If any of you did such study in last couple of months, please share your analysis or observations. Regards, Sudhakar Legal Disclaimer : The information contained in this message may be privileged and confidential. It is intended to be read only by the individual or entity to whom it is addressed or by their designee. If the reader of this message is not the intended recipient, you are on notice that any distribution of this message, in any form, is strictly prohibited. If you have received this message in error, please immediately notify the sender and delete or destroy any copy of this message!
Re: Cassandra doesn't insert all rows
Soheil, As Jeff mentioned that you need to provide more information. There are no known issues that I can think of that would cause such behavior. It would be great if you could provide us with a reduced test case so we can try and reproduce this behavior or at least help you debug the issue better. Could you detail the version of Cassandra, the number of nodes, the keyspace definition, RF / CL, perhaps a bit of your client code that does the writes, did you get back any errors on the client or on the server side? These details would be helpful to further help you. Thanks, Dinesh On Saturday, April 21, 2018, 11:06:12 AM PDT, Soheil Pourbafrani wrote: I consume data from Kafka and insert them into Cassandra cluster using Java API. The table has 4 keys including a timestamp based on millisecond. But when executing the code, it just inserts 120 to 190 rows and ignores other incoming data! What parts can be the cause of the problem? Bad insert code in key fields that overwrite data, improper cluster configuration,?