Bulk Load Hadoop to Cassandra
Hi, I intend to bulk load data from HDFS to Cassandra using a map-only program which uses the BulkOutputFormat class. Please advise me which versions of Cassandra and Hadoop would support such a use-case. I am using Hadoop 2.2.0 and Cassandra 2.0.6 and I am getting following error: Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected Thanks, Vijay
Re: Unsubscribe
http://cassandra.apache.org/#lists 2014-11-04 21:59 GMT+01:00 James Carman ja...@carmanconsulting.com: You should have received an email when you signed up which gives you instructions on how to unsubscribe. Otherwise, send an email to user-h...@cassandra.apache.org On Mon, Nov 3, 2014 at 10:30 PM, Malay Nilabh malay.nil...@lntinfotech.com wrote: Hi It was great to be part of this group. Thanks for helping out. Please unsubscribe me now. *Regards,* *Malay Nilabh* BIDW BU/ Big Data CoE LT Infotech Ltd, Hinjewadi,Pune [image: Description: image001]: +91-20-66571746 [image: Description: Description: Description: Description: cid:image002.png@01CF1EAD.959B9290]+91-73-879-00727 Email: malay.nil...@lntinfotech.com *|| Save Paper - Save Trees || * -- The contents of this e-mail and any attachment(s) may contain confidential or privileged information for the intended recipient(s). Unintended recipients are prohibited from taking action on the basis of information in this e-mail and using or disseminating the information, and must notify the sender and delete it from their system. LT Infotech will not accept responsibility or liability for the accuracy or completeness of, or the presence of any virus or disabling code in this e-mail
Storing files in Cassandra with Spring Data / Astyanax
Hi, I am currently testing with Cassandra and Spring Data Cassandra. I would now need to store files (images and avi files, normally up to 50 Mb big). I did find the Chuncked Object store https://github.com/Netflix/astyanax/wiki/Chunked-Object-Store from Astyanax which looks promising. However, I have no idea on how to combine Astyanax with Spring Data Cassandra ? Also this answer on SO http://stackoverflow.com/a/25926062/40064 states that Netflix is no longer working on Astyanax, so maybe this is not a good option to base my application? Are there any other options (where I can keep using Spring Data Cassandra)? I also read http://www.datastax.com/docs/datastax_enterprise3.0/solutions/hadoop_multiple_cfs but it is unclear to me if I would need to install Hadoop as well if I want to use this? regards, Wim
different disk foot print of cassandra data folder on copying
I have cassandra nodes with long uptime. Disk foot print for cassandra data older is different when I copy to a different folder. Why is that ? I have used rsync and cp. This can be very confusing when trying to do certain maintenance tasks like hardware upgrade on EC2 and backing up a snapshot. I am talking about as much 100% different for 25-40GB of data. On copying they grow to double that. The server's folder is on EC2 magnetic instance-store and I copied to various EBS. I do not think that it's something weird about EC2; when I copied EBS data back to magnetic instance-store the size remains the same.So I am guessing there is some kind of cassandra magical compression that is fooling the operation system tools like du and df Some issue with commitlog folder too but the total size of this folder is not as big and differences is size percent is low. Thanks for any insight you can share k.z.
Re: Cassandra heap pre-1.1
On Tue, Nov 4, 2014 at 8:51 PM, Raj N raj.cassan...@gmail.com wrote: Is there a good formula to calculate heap utilization in Cassandra pre-1.1, specifically 1.0.10. We are seeing gc pressure on our nodes. And I am trying to estimate what could be causing this? Using node tool info my steady state heap is at about 10GB. XMX is 12G. Basically, no. If you really want to know, take a heap dump and load it into Eclipse Memory Analyzer. I have 4.5 GB of bloom filters which I can derive looking at cfstats This is a *very* large percentage of your total heap, and is probably the lever you have most influence on pulling. I have negligible row caching. Row caching is generally not advised in that era, especially with heap pressure. I have key caching enabled on my cfs. I couldn't find an easy way to estimate how much this is using, but I tried to invalidate the key cache and I got 1.3 GB back. Key caching is generally advisable, but 1.3GB is a lot of key cache.. That still only adds up to 5.8 GB. I know there is index sampling going on as well. I have around 800 million rows. Is there a way to estimate how much space this would add up to? Plenty. You should reduce your bloom filter size, or upgrade to a version of Cassandra that moves stuff off the heap. =Rob http://twitter.com/rcolidba
Re: different disk foot print of cassandra data folder on copying
On Wed, Nov 5, 2014 at 12:08 PM, KZ Win kz...@pelotoncycle.com wrote: I have cassandra nodes with long uptime. Disk foot print for cassandra data older is different when I copy to a different folder. I am talking about as much 100% different for 25-40GB of data. On copying they grow to double that. 1) Cassandra automatically snapshots SSTables when one does certain operations. 2) One can also manually create snapshots. 3) Snapshots are hard links to files. 4) Hard links to files generally become duplicate files when copied to another partition, unless rsync or cp is configured to maintain the hard link relationship. 5) snapshots are kept in a subdirectory of the data directory for the columnfamily. 6) This all has the pathological seeming outcome that snapshots become effectively larger as time passes (because the hard links they contain become the only copy of the file when the original is deleted from the data directory via compaction) and might grow significantly when copied. tl;dr : modify your rsync to include --exclude=snapshots/ =Rob
Re: different disk foot print of cassandra data folder on copying
Duh. I totally forgot about my snapshotting just before daily rsync backup. k.z. On Wed, Nov 5, 2014 at 3:13 PM, Robert Coli rc...@eventbrite.com wrote: On Wed, Nov 5, 2014 at 12:08 PM, KZ Win kz...@pelotoncycle.com wrote: I have cassandra nodes with long uptime. Disk foot print for cassandra data older is different when I copy to a different folder. I am talking about as much 100% different for 25-40GB of data. On copying they grow to double that. 1) Cassandra automatically snapshots SSTables when one does certain operations. 2) One can also manually create snapshots. 3) Snapshots are hard links to files. 4) Hard links to files generally become duplicate files when copied to another partition, unless rsync or cp is configured to maintain the hard link relationship. 5) snapshots are kept in a subdirectory of the data directory for the columnfamily. 6) This all has the pathological seeming outcome that snapshots become effectively larger as time passes (because the hard links they contain become the only copy of the file when the original is deleted from the data directory via compaction) and might grow significantly when copied. tl;dr : modify your rsync to include --exclude=snapshots/ =Rob
use select with different attributes present in where clause Cassandra
Hello all, I need to create a Cassandra column family with following attributes. id bigint, content varchar, year int, frequency int, I want to get the content with highest frequency in a given year using this column family. Also when inserting data to table, for given content and year, I need to check if an id already exist or not. How can I achieve this with Cassandra? I tried creating CF using CREATE TABLE sinmin.word_time_inv_frequency ( id bigint, content varchar, year int, frequency int, PRIMARY KEY((year), frequency) ); and then retrieved data using SELECT id FROM word_time_inv_frequency WHERE year = 2010 ORDER BY frequency ; But when using this, I can't check if entry is already existing for the (content,year) pair in the CF. Thank You! -- *Chamila Dilshan Wijayarathna,* SMIEEE, SMIESL, Undergraduate, Department of Computer Science and Engineering, University of Moratuwa.
Re: Storing files in Cassandra with Spring Data / Astyanax
Astyanax isn't deprecated; that user is wrong and is downvoted--and has a comment mentioning the same. What you're describing doesn't sound like you need a data store at all; it /sounds/ like you need a file store. Why not use S3 or similar to store your images? What benefits are you expecting to receive from Cassandra? It sounds like you're incurring an awful lot of overhead for what amounts to a file lookup. On Wed, Nov 5, 2014 at 8:19 AM, Wim Deblauwe wim.debla...@gmail.com wrote: Hi, I am currently testing with Cassandra and Spring Data Cassandra. I would now need to store files (images and avi files, normally up to 50 Mb big). I did find the Chuncked Object store https://github.com/Netflix/astyanax/wiki/Chunked-Object-Store from Astyanax which looks promising. However, I have no idea on how to combine Astyanax with Spring Data Cassandra ? Also this answer on SO http://stackoverflow.com/a/25926062/40064 states that Netflix is no longer working on Astyanax, so maybe this is not a good option to base my application? Are there any other options (where I can keep using Spring Data Cassandra)? I also read http://www.datastax.com/docs/datastax_enterprise3.0/solutions/hadoop_multiple_cfs but it is unclear to me if I would need to install Hadoop as well if I want to use this? regards, Wim
Re: Storing files in Cassandra with Spring Data / Astyanax
On Wed, Nov 5, 2014 at 8:19 AM, Wim Deblauwe wim.debla...@gmail.com wrote: I am currently testing with Cassandra and Spring Data Cassandra. I would now need to store files (images and avi files, normally up to 50 Mb big). https://github.com/mogilefs/ A for distributed/replicated file storage, would use again in a heartbeat. Yes, it uses MySQL as the datastore, fortunately most people know how to make MySQL available enough to be the meta store for a filesystem. =Rob http://twitter.com/rcolidba
Re: Cassandra heap pre-1.1
We are planning to upgrade soon. But in the meantime, I wanted to see if we can tweak certain things. -Rajesh On Wed, Nov 5, 2014 at 3:10 PM, Robert Coli rc...@eventbrite.com wrote: On Tue, Nov 4, 2014 at 8:51 PM, Raj N raj.cassan...@gmail.com wrote: Is there a good formula to calculate heap utilization in Cassandra pre-1.1, specifically 1.0.10. We are seeing gc pressure on our nodes. And I am trying to estimate what could be causing this? Using node tool info my steady state heap is at about 10GB. XMX is 12G. Basically, no. If you really want to know, take a heap dump and load it into Eclipse Memory Analyzer. I have 4.5 GB of bloom filters which I can derive looking at cfstats This is a *very* large percentage of your total heap, and is probably the lever you have most influence on pulling. I have negligible row caching. Row caching is generally not advised in that era, especially with heap pressure. I have key caching enabled on my cfs. I couldn't find an easy way to estimate how much this is using, but I tried to invalidate the key cache and I got 1.3 GB back. Key caching is generally advisable, but 1.3GB is a lot of key cache.. That still only adds up to 5.8 GB. I know there is index sampling going on as well. I have around 800 million rows. Is there a way to estimate how much space this would add up to? Plenty. You should reduce your bloom filter size, or upgrade to a version of Cassandra that moves stuff off the heap. =Rob http://twitter.com/rcolidba
Why is one query 10 times slower than the other?
Hi Guys, I have two cassandra 2.0.5 nodes, RF=2. When I do a: select * from table1 where clustercolumn=‘something' The trace indicates that it only needs to talk to one node, which I would have expected. However when I do a: select * from table2 Which is a small table with only has 20 rows in it, should be fully replicated, and should be a much quicker query, trace indicates that cassandra is talking to both nodes. This adds a 200ms to the query results, and is not necessary for my application (this table might have an amendment once per year if that), theres no real need to check both nodes for consistency. At this point I’ve not altered anything to do with consistency level. Does this mean that cassandra attempts to guess/infer what consistency level you need depending on if your query includes a filter on a particular key or clustering key? Thanks, Jacob CREATE KEYSPACE mykeyspace WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': ‘2' }; CREATE TABLE organisation (uuid uuid, name text, url text, PRIMARY KEY (uuid)) CREATE TABLE lookup_code (type text, code text, name text, PRIMARY KEY ((type), code)) select * from lookup_code where type=‘mylist': activity | timestamp| source | source_elapsed ---+--+--+ execute_cql3_query | 04:20:15,319 | 74.50.54.123 | 0 Parsing select * from lookup_code where type='research_area' LIMIT 1; | 04:20:15,319 | 74.50.54.123 | 64 Preparing statement | 04:20:15,320 | 74.50.54.123 |204 Executing single-partition query on lookup_code | 04:20:15,320 | 74.50.54.123 |849 Acquiring sstable references | 04:20:15,320 | 74.50.54.123 |870 Merging memtable tombstones | 04:20:15,320 | 74.50.54.123 |894 Skipped 0/0 non-slice-intersecting sstables, included 0 due to tombstones | 04:20:15,320 | 74.50.54.123 |958 Merging data from memtables and 0 sstables | 04:20:15,320 | 74.50.54.123 |976 Read 168 live and 0 tombstoned cells | 04:20:15,321 | 74.50.54.123 | 1412 Request complete | 04:20:15,321 | 74.50.54.123 | 2043 select * from organisation: activity | timestamp| source | source_elapsed -+--+--+ execute_cql3_query | 04:21:03,641 | 74.50.54.123 | 0 Parsing select * from organisation LIMIT 1; | 04:21:03,641 | 74.50.54.123 | 68 Preparing statement | 04:21:03,641 | 74.50.54.123 |174 Determining replicas to query | 04:21:03,642 | 74.50.54.123 |307 Enqueuing request to /72.249.82.85 | 04:21:03,642 | 74.50.54.123 | 1034 Sending message to /72.249.82.85 | 04:21:03,643 | 74.50.54.123 | 1402 Message received from /74.50.54.123 | 04:21:03,644 | 72.249.82.85 | 47 Executing seq scan across 0 sstables for [min(-9223372036854775808), min(-9223372036854775808)] | 04:21:03,644 | 72.249.82.85 |461 Read 1 live and 0 tombstoned cells | 04:21:03,644 | 72.249.82.85 |560 Read 1 live and 0 tombstoned cells | 04:21:03,644 | 72.249.82.85 |611 ………..etc…. smime.p7s Description: S/MIME cryptographic signature
Re: Why is one query 10 times slower than the other?
In your “lookup_code” example “type” is not a clustercolumn it is the partition key, and hence the first query only hits one partition The second query is a range slice across all possible keys, so the sub-ranges are farmed out to nodes with the data. You are likely at CL_ONE, so it only needs response from one node for each sub-range… I guess it has decided (based on the snitch) that it is not unreasonable to share the query across the two nodes On Nov 5, 2014, at 10:41 PM, Jacob Rhoden jacob.rho...@me.com wrote: Hi Guys, I have two cassandra 2.0.5 nodes, RF=2. When I do a: select * from table1 where clustercolumn=‘something' The trace indicates that it only needs to talk to one node, which I would have expected. However when I do a: select * from table2 Which is a small table with only has 20 rows in it, should be fully replicated, and should be a much quicker query, trace indicates that cassandra is talking to both nodes. This adds a 200ms to the query results, and is not necessary for my application (this table might have an amendment once per year if that), theres no real need to check both nodes for consistency. At this point I’ve not altered anything to do with consistency level. Does this mean that cassandra attempts to guess/infer what consistency level you need depending on if your query includes a filter on a particular key or clustering key? Thanks, Jacob CREATE KEYSPACE mykeyspace WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': ‘2' }; CREATE TABLE organisation (uuid uuid, name text, url text, PRIMARY KEY (uuid)) CREATE TABLE lookup_code (type text, code text, name text, PRIMARY KEY ((type), code)) select * from lookup_code where type=‘mylist': activity | timestamp| source | source_elapsed ---+--+--+ execute_cql3_query | 04:20:15,319 | 74.50.54.123 | 0 Parsing select * from lookup_code where type='research_area' LIMIT 1; | 04:20:15,319 | 74.50.54.123 | 64 Preparing statement | 04:20:15,320 | 74.50.54.123 |204 Executing single-partition query on lookup_code | 04:20:15,320 | 74.50.54.123 |849 Acquiring sstable references | 04:20:15,320 | 74.50.54.123 |870 Merging memtable tombstones | 04:20:15,320 | 74.50.54.123 |894 Skipped 0/0 non-slice-intersecting sstables, included 0 due to tombstones | 04:20:15,320 | 74.50.54.123 |958 Merging data from memtables and 0 sstables | 04:20:15,320 | 74.50.54.123 |976 Read 168 live and 0 tombstoned cells | 04:20:15,321 | 74.50.54.123 | 1412 Request complete | 04:20:15,321 | 74.50.54.123 | 2043 select * from organisation: activity | timestamp| source | source_elapsed -+--+--+ execute_cql3_query | 04:21:03,641 | 74.50.54.123 | 0 Parsing select * from organisation LIMIT 1; | 04:21:03,641 | 74.50.54.123 | 68 Preparing statement | 04:21:03,641 | 74.50.54.123 |174 Determining replicas to query | 04:21:03,642 | 74.50.54.123 |307 Enqueuing request to /72.249.82.85 | 04:21:03,642 | 74.50.54.123 | 1034 Sending message to /72.249.82.85 | 04:21:03,643 | 74.50.54.123 | 1402 Message received from /74.50.54.123 | 04:21:03,644 | 72.249.82.85 | 47 Executing seq scan across 0 sstables for [min(-9223372036854775808), min(-9223372036854775808)] | 04:21:03,644 | 72.249.82.85 |461 Read 1 live and 0 tombstoned cells | 04:21:03,644 | 72.249.82.85 |560
Re: tuning concurrent_reads param
Sorry I have late follow up question In the Cassandra.yaml file the concurrent_read section has the following comment: What does it mean by the operations to enqueue low enough in the stack that the OS and drives can reorder them. ? how does it help making the system healthy? What really happen if we increase it to a too high value? (maybe affecting other read or write operation as it eat up all disk IO resource?) thanks # For workloads with more daa than can fit in memory, Cassandra's # bottleneck will be reads that need to fetch data from # disk. concurrent_reads shuld be set to (16 * number_of_drives) in # order to allow the operations to enqueue low enough in the stack # that the OS and drives can reorder them. On Wed, Oct 29, 2014 at 8:47 PM, Chris Lohfink chris.lohf...@datastax.com wrote: Theres a bit to it, sometimes it can use tweaking though. Its a good default for most systems so I wouldn't increase right off the bat. When using ssds or something with a lot of horsepower it could be higher though (ie i2.xlarge+ on ec2). If you monitor the number of active threads in read thread pool (nodetool tpstats) you can see if they are actually all busy or not. If its near 32 (or whatever you set it at) all the time it may be a bottleneck. --- Chris Lohfink On Wed, Oct 29, 2014 at 10:41 PM, Jimmy Lin y2klyf+w...@gmail.com wrote: Hi, looking at the docs, the default value for concurrent_reads is 32, which seems bit small to me (comparing to say http server)? because if my node is receiving slight traffic, any more than 32 concurrent read query will have to wait.(?) Recommend rule is, 16* number of drives. Would that be different if I have SSDs? I am attempting to increase it because I have a few tables have wide rows that app will fetch them, the pure size of data may already eating up the thread time, which can cause other read threads need to wait and essential slow. thanks
Re: Storing files in Cassandra with Spring Data / Astyanax
Hi, We are building an application where we install it on-premise, usually there is no internet connection at all there. As I am using Cassandra for storing everything else in the application, it would be very convenient to also use Cassandra for those files so I don't have to set up 2 distributed systems for each installation we do. Is there documentation somewhere on how to integrate/get started with Astyanax with Spring Data Cassandra ? regards, Wim 2014-11-05 23:40 GMT+01:00 Redmumba redmu...@gmail.com: Astyanax isn't deprecated; that user is wrong and is downvoted--and has a comment mentioning the same. What you're describing doesn't sound like you need a data store at all; it /sounds/ like you need a file store. Why not use S3 or similar to store your images? What benefits are you expecting to receive from Cassandra? It sounds like you're incurring an awful lot of overhead for what amounts to a file lookup. On Wed, Nov 5, 2014 at 8:19 AM, Wim Deblauwe wim.debla...@gmail.com wrote: Hi, I am currently testing with Cassandra and Spring Data Cassandra. I would now need to store files (images and avi files, normally up to 50 Mb big). I did find the Chuncked Object store https://github.com/Netflix/astyanax/wiki/Chunked-Object-Store from Astyanax which looks promising. However, I have no idea on how to combine Astyanax with Spring Data Cassandra ? Also this answer on SO http://stackoverflow.com/a/25926062/40064 states that Netflix is no longer working on Astyanax, so maybe this is not a good option to base my application? Are there any other options (where I can keep using Spring Data Cassandra)? I also read http://www.datastax.com/docs/datastax_enterprise3.0/solutions/hadoop_multiple_cfs but it is unclear to me if I would need to install Hadoop as well if I want to use this? regards, Wim
Counter column impossible to delete and re-insert
Hi, I have a table with counter column . When I insert (update) a row, delete it and try to re-insert, it fail to re-insert the row. Here is the commands i use : CREATE TABLE test( testId int, year int, testCounter counter, PRIMARY KEY (testId, year) )WITH CLUSTERING ORDER BY (year DESC); UPDATE test SET testcounter = testcounter +5 WHERE testid = 2 AND year = 2014; DELETE FROM test WHERE testid = 2 AND year = 2014; UPDATE test SET testcounter = testcounter +5 WHERE testid = 2 AND year = 2014; The last command failed, there is no error message but the table is empty after it. Is that normal? Am I doing something wrong? Regards Clément