Bulk Load Hadoop to Cassandra

2014-11-05 Thread Vijay Kadel
Hi,

I intend to bulk load data from HDFS to Cassandra using a map-only program 
which uses the BulkOutputFormat class. Please advise me which versions of 
Cassandra and Hadoop would support such a use-case. I am using Hadoop 2.2.0 and 
Cassandra 2.0.6 and I am getting following error:

Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but 
class was expected

Thanks,
Vijay



Re: Unsubscribe

2014-11-05 Thread Alain RODRIGUEZ
http://cassandra.apache.org/#lists

2014-11-04 21:59 GMT+01:00 James Carman ja...@carmanconsulting.com:

 You should have received an email when you signed up which gives you
 instructions on how to unsubscribe.  Otherwise, send an email to
 user-h...@cassandra.apache.org

 On Mon, Nov 3, 2014 at 10:30 PM, Malay Nilabh 
 malay.nil...@lntinfotech.com wrote:

  Hi



 It was great to be part of this group. Thanks for helping out. Please
 unsubscribe me now.



 *Regards,*

 *Malay Nilabh*

 BIDW BU/ Big Data CoE

 LT Infotech Ltd, Hinjewadi,Pune

 [image: Description: image001]: +91-20-66571746

 [image: Description: Description: Description: Description:
 cid:image002.png@01CF1EAD.959B9290]+91-73-879-00727

 Email: malay.nil...@lntinfotech.com

 *|| Save Paper - Save Trees || *



 --
 The contents of this e-mail and any attachment(s) may contain
 confidential or privileged information for the intended recipient(s).
 Unintended recipients are prohibited from taking action on the basis of
 information in this e-mail and using or disseminating the information, and
 must notify the sender and delete it from their system. LT Infotech will
 not accept responsibility or liability for the accuracy or completeness of,
 or the presence of any virus or disabling code in this e-mail





Storing files in Cassandra with Spring Data / Astyanax

2014-11-05 Thread Wim Deblauwe
Hi,

I am currently testing with Cassandra and Spring Data Cassandra. I would
now need to store files (images and avi files, normally up to 50 Mb big).

I did find the Chuncked Object store
https://github.com/Netflix/astyanax/wiki/Chunked-Object-Store from
Astyanax  which looks promising. However, I have no idea on how to combine
Astyanax with Spring Data Cassandra ?

Also this answer on SO http://stackoverflow.com/a/25926062/40064 states
that Netflix is no longer working on Astyanax, so maybe this is not a good
option to base my application?

Are there any other options (where I can keep using Spring Data Cassandra)?

I also read
http://www.datastax.com/docs/datastax_enterprise3.0/solutions/hadoop_multiple_cfs
but it is unclear to me if I would need to install Hadoop as well if I want
to use this?

regards,

Wim


different disk foot print of cassandra data folder on copying

2014-11-05 Thread KZ Win
I have cassandra nodes with long uptime.  Disk foot print for
cassandra data older is different when I copy to a different folder.
Why is that ?  I have used rsync and cp.  This can be very confusing
when trying to do certain maintenance tasks like hardware upgrade on
EC2 and backing up a snapshot.

I am talking about as much 100% different for 25-40GB of data.  On
copying they grow to double that.  The server's folder is on EC2
magnetic instance-store and I copied to various EBS.  I do not think
that it's something weird about EC2; when I copied EBS data back to
magnetic instance-store
the size remains the same.So I am guessing there is some kind of
cassandra magical compression that is fooling the operation system
tools like du and df

Some issue with commitlog folder too but the total size of this folder
is not as big and differences is size percent is low.

Thanks for any insight you can share

k.z.


Re: Cassandra heap pre-1.1

2014-11-05 Thread Robert Coli
On Tue, Nov 4, 2014 at 8:51 PM, Raj N raj.cassan...@gmail.com wrote:

 Is there a good formula to calculate heap utilization in Cassandra
 pre-1.1, specifically 1.0.10. We are seeing gc pressure on our nodes. And I
 am trying to estimate what could be causing this? Using node tool info my
 steady state heap is at about 10GB. XMX is 12G.


Basically, no. If you really want to know, take a heap dump and load it
into Eclipse Memory Analyzer.


 I have 4.5 GB of bloom filters which I can derive looking at cfstats


This is a *very* large percentage of your total heap, and is probably the
lever you have most influence on pulling.


 I have negligible row caching.


Row caching is generally not advised in that era, especially with heap
pressure.


 I have key caching enabled on my cfs. I couldn't find an easy way to
 estimate how much this is using, but I tried to invalidate the key cache
 and I got 1.3 GB back.


Key caching is generally advisable, but 1.3GB is a lot of key cache..


 That still only adds up to 5.8 GB. I know there is index sampling going on
 as well. I have around 800 million rows. Is there a way to estimate how
 much space this would add up to?


Plenty. You should reduce your bloom filter size, or upgrade to a version
of Cassandra that moves stuff off the heap.

=Rob
http://twitter.com/rcolidba


Re: different disk foot print of cassandra data folder on copying

2014-11-05 Thread Robert Coli
On Wed, Nov 5, 2014 at 12:08 PM, KZ Win kz...@pelotoncycle.com wrote:

 I have cassandra nodes with long uptime.  Disk foot print for
 cassandra data older is different when I copy to a different folder.



 I am talking about as much 100% different for 25-40GB of data.  On
 copying they grow to double that.


1) Cassandra automatically snapshots SSTables when one does certain
operations.
2) One can also manually create snapshots.
3) Snapshots are hard links to files.
4) Hard links to files generally become duplicate files when copied to
another partition, unless rsync or cp is configured to maintain the hard
link relationship.
5) snapshots are kept in a subdirectory of the data directory for the
columnfamily.
6) This all has the pathological seeming outcome that snapshots become
effectively larger as time passes (because the hard links they contain
become the only copy of the file when the original is deleted from the
data directory via compaction) and might grow significantly when copied.

tl;dr : modify your rsync to include --exclude=snapshots/

=Rob


Re: different disk foot print of cassandra data folder on copying

2014-11-05 Thread KZ Win
Duh.  I totally forgot about my snapshotting just before daily rsync backup.

k.z.

On Wed, Nov 5, 2014 at 3:13 PM, Robert Coli rc...@eventbrite.com wrote:
 On Wed, Nov 5, 2014 at 12:08 PM, KZ Win kz...@pelotoncycle.com wrote:

 I have cassandra nodes with long uptime.  Disk foot print for
 cassandra data older is different when I copy to a different folder.



 I am talking about as much 100% different for 25-40GB of data.  On
 copying they grow to double that.


 1) Cassandra automatically snapshots SSTables when one does certain
 operations.
 2) One can also manually create snapshots.
 3) Snapshots are hard links to files.
 4) Hard links to files generally become duplicate files when copied to
 another partition, unless rsync or cp is configured to maintain the hard
 link relationship.
 5) snapshots are kept in a subdirectory of the data directory for the
 columnfamily.
 6) This all has the pathological seeming outcome that snapshots become
 effectively larger as time passes (because the hard links they contain
 become the only copy of the file when the original is deleted from the
 data directory via compaction) and might grow significantly when copied.

 tl;dr : modify your rsync to include --exclude=snapshots/

 =Rob



use select with different attributes present in where clause Cassandra

2014-11-05 Thread Chamila Wijayarathna
Hello all,

I need to create a Cassandra column family with following attributes.

id bigint,
content varchar,
year int,
frequency int,

I want to get the content with highest frequency in a given year using this
column family. Also when inserting data to table, for given content and
year, I need to check if an id already exist or not. How can I achieve this
with Cassandra?

I tried creating CF using

CREATE TABLE sinmin.word_time_inv_frequency (
id bigint,
content varchar,
year int,
frequency int,
PRIMARY KEY((year), frequency)
);

and then retrieved data using

SELECT id FROM word_time_inv_frequency WHERE year = 2010 ORDER BY frequency ;

But when using this, I can't check if entry is already existing for the
(content,year) pair in the CF.

Thank You!

-- 
*Chamila Dilshan Wijayarathna,*
SMIEEE, SMIESL,
Undergraduate,
Department of Computer Science and Engineering,
University of Moratuwa.


Re: Storing files in Cassandra with Spring Data / Astyanax

2014-11-05 Thread Redmumba
Astyanax isn't deprecated; that user is wrong and is downvoted--and has a
comment mentioning the same.

What you're describing doesn't sound like you need a data store at all; it
/sounds/ like you need a file store.  Why not use S3 or similar to store
your images?  What benefits are you expecting to receive from Cassandra?
It sounds like you're incurring an awful lot of overhead for what amounts
to a file lookup.

On Wed, Nov 5, 2014 at 8:19 AM, Wim Deblauwe wim.debla...@gmail.com wrote:

 Hi,

 I am currently testing with Cassandra and Spring Data Cassandra. I would
 now need to store files (images and avi files, normally up to 50 Mb big).

 I did find the Chuncked Object store
 https://github.com/Netflix/astyanax/wiki/Chunked-Object-Store from
 Astyanax  which looks promising. However, I have no idea on how to combine
 Astyanax with Spring Data Cassandra ?

 Also this answer on SO http://stackoverflow.com/a/25926062/40064 states
 that Netflix is no longer working on Astyanax, so maybe this is not a good
 option to base my application?

 Are there any other options (where I can keep using Spring Data Cassandra)?

 I also read
 http://www.datastax.com/docs/datastax_enterprise3.0/solutions/hadoop_multiple_cfs
 but it is unclear to me if I would need to install Hadoop as well if I want
 to use this?

 regards,

 Wim



Re: Storing files in Cassandra with Spring Data / Astyanax

2014-11-05 Thread Robert Coli
On Wed, Nov 5, 2014 at 8:19 AM, Wim Deblauwe wim.debla...@gmail.com wrote:

 I am currently testing with Cassandra and Spring Data Cassandra. I would
 now need to store files (images and avi files, normally up to 50 Mb big).


https://github.com/mogilefs/

A for distributed/replicated file storage, would use again in a
heartbeat.

Yes, it uses MySQL as the datastore, fortunately most people know how to
make MySQL available enough to be the meta store for a filesystem.

=Rob
http://twitter.com/rcolidba


Re: Cassandra heap pre-1.1

2014-11-05 Thread Raj N
We are planning to upgrade soon. But in the meantime, I wanted to see if we
can tweak certain things.

-Rajesh

On Wed, Nov 5, 2014 at 3:10 PM, Robert Coli rc...@eventbrite.com wrote:

 On Tue, Nov 4, 2014 at 8:51 PM, Raj N raj.cassan...@gmail.com wrote:

 Is there a good formula to calculate heap utilization in Cassandra
 pre-1.1, specifically 1.0.10. We are seeing gc pressure on our nodes. And I
 am trying to estimate what could be causing this? Using node tool info my
 steady state heap is at about 10GB. XMX is 12G.


 Basically, no. If you really want to know, take a heap dump and load it
 into Eclipse Memory Analyzer.


 I have 4.5 GB of bloom filters which I can derive looking at cfstats


 This is a *very* large percentage of your total heap, and is probably the
 lever you have most influence on pulling.


 I have negligible row caching.


 Row caching is generally not advised in that era, especially with heap
 pressure.


 I have key caching enabled on my cfs. I couldn't find an easy way to
 estimate how much this is using, but I tried to invalidate the key cache
 and I got 1.3 GB back.


 Key caching is generally advisable, but 1.3GB is a lot of key cache..


 That still only adds up to 5.8 GB. I know there is index sampling going
 on as well. I have around 800 million rows. Is there a way to estimate how
 much space this would add up to?


 Plenty. You should reduce your bloom filter size, or upgrade to a version
 of Cassandra that moves stuff off the heap.

 =Rob
 http://twitter.com/rcolidba





Why is one query 10 times slower than the other?

2014-11-05 Thread Jacob Rhoden
Hi Guys,

I have two cassandra 2.0.5 nodes, RF=2. When I do a:

select * from table1 where clustercolumn=‘something'

The trace indicates that it only needs to talk to one node, which I would have 
expected. However when I do a:

select * from table2

Which is a small table with only has 20 rows in it, should be fully replicated, 
and should be a much quicker query, trace indicates that cassandra is talking 
to both nodes. This adds a 200ms to the query results, and is not necessary for 
my application (this table might have an amendment once per year if that), 
theres no real need to check both nodes for consistency.

At this point I’ve not altered anything to do with consistency level. Does this 
mean that cassandra attempts to guess/infer what consistency level you need 
depending on if your query includes a filter on a particular key or clustering 
key?

Thanks,
Jacob


CREATE KEYSPACE mykeyspace WITH replication = { 'class': 'SimpleStrategy', 
'replication_factor': ‘2' };

CREATE TABLE organisation (uuid uuid, name text, url text, PRIMARY KEY (uuid))

CREATE TABLE lookup_code (type text, code text, name text, PRIMARY KEY ((type), 
code)) 


select * from lookup_code where type=‘mylist':

 activity  | 
timestamp| source   | source_elapsed
---+--+--+
execute_cql3_query | 
04:20:15,319 | 74.50.54.123 |  0
 Parsing select * from lookup_code where type='research_area' LIMIT 1; | 
04:20:15,319 | 74.50.54.123 | 64
   Preparing statement | 
04:20:15,320 | 74.50.54.123 |204
   Executing single-partition query on lookup_code | 
04:20:15,320 | 74.50.54.123 |849
  Acquiring sstable references | 
04:20:15,320 | 74.50.54.123 |870
   Merging memtable tombstones | 
04:20:15,320 | 74.50.54.123 |894
 Skipped 0/0 non-slice-intersecting sstables, included 0 due to tombstones | 
04:20:15,320 | 74.50.54.123 |958
Merging data from memtables and 0 sstables | 
04:20:15,320 | 74.50.54.123 |976
  Read 168 live and 0 tombstoned cells | 
04:20:15,321 | 74.50.54.123 |   1412
  Request complete | 
04:20:15,321 | 74.50.54.123 |   2043


select * from organisation:

 activity   
 | timestamp| source   | source_elapsed
-+--+--+
  
execute_cql3_query | 04:21:03,641 | 74.50.54.123 |  0
 Parsing select * from 
organisation LIMIT 1; | 04:21:03,641 | 74.50.54.123 | 68
 
Preparing statement | 04:21:03,641 | 74.50.54.123 |174
   Determining 
replicas to query | 04:21:03,642 | 74.50.54.123 |307
  Enqueuing request 
to /72.249.82.85 | 04:21:03,642 | 74.50.54.123 |   1034
Sending message 
to /72.249.82.85 | 04:21:03,643 | 74.50.54.123 |   1402
 Message received 
from /74.50.54.123 | 04:21:03,644 | 72.249.82.85 | 47
 Executing seq scan across 0 sstables for [min(-9223372036854775808), 
min(-9223372036854775808)] | 04:21:03,644 | 72.249.82.85 |461
  Read 1 live and 0 
tombstoned cells | 04:21:03,644 | 72.249.82.85 |560
  Read 1 live and 0 
tombstoned cells | 04:21:03,644 | 72.249.82.85 |611

………..etc….

smime.p7s
Description: S/MIME cryptographic signature


Re: Why is one query 10 times slower than the other?

2014-11-05 Thread graham sanderson
In your “lookup_code” example “type” is not a clustercolumn it is the partition 
key, and hence the first query only hits one partition
The second query is a range slice across all possible keys, so the sub-ranges 
are farmed out to nodes with the data.
You are likely at CL_ONE, so it only needs response from one node for each 
sub-range… I guess it has decided (based on the snitch) that it is not 
unreasonable to share the query across the two nodes 

 On Nov 5, 2014, at 10:41 PM, Jacob Rhoden jacob.rho...@me.com wrote:
 
 Hi Guys,
 
 I have two cassandra 2.0.5 nodes, RF=2. When I do a:
 
 select * from table1 where clustercolumn=‘something'
 
 The trace indicates that it only needs to talk to one node, which I would 
 have expected. However when I do a:
 
 select * from table2
 
 Which is a small table with only has 20 rows in it, should be fully 
 replicated, and should be a much quicker query, trace indicates that 
 cassandra is talking to both nodes. This adds a 200ms to the query results, 
 and is not necessary for my application (this table might have an amendment 
 once per year if that), theres no real need to check both nodes for 
 consistency.
 
 At this point I’ve not altered anything to do with consistency level. Does 
 this mean that cassandra attempts to guess/infer what consistency level you 
 need depending on if your query includes a filter on a particular key or 
 clustering key?
 
 Thanks,
 Jacob
 
 
 CREATE KEYSPACE mykeyspace WITH replication = { 'class': 'SimpleStrategy', 
 'replication_factor': ‘2' };
 
 CREATE TABLE organisation (uuid uuid, name text, url text, PRIMARY KEY (uuid))
 
 CREATE TABLE lookup_code (type text, code text, name text, PRIMARY KEY 
 ((type), code)) 
 
 
 select * from lookup_code where type=‘mylist':
 
  activity  | 
 timestamp| source   | source_elapsed
 ---+--+--+
 execute_cql3_query | 
 04:20:15,319 | 74.50.54.123 |  0
  Parsing select * from lookup_code where type='research_area' LIMIT 1; | 
 04:20:15,319 | 74.50.54.123 | 64
Preparing statement | 
 04:20:15,320 | 74.50.54.123 |204
Executing single-partition query on lookup_code | 
 04:20:15,320 | 74.50.54.123 |849
   Acquiring sstable references | 
 04:20:15,320 | 74.50.54.123 |870
Merging memtable tombstones | 
 04:20:15,320 | 74.50.54.123 |894
  Skipped 0/0 non-slice-intersecting sstables, included 0 due to tombstones | 
 04:20:15,320 | 74.50.54.123 |958
 Merging data from memtables and 0 sstables | 
 04:20:15,320 | 74.50.54.123 |976
   Read 168 live and 0 tombstoned cells | 
 04:20:15,321 | 74.50.54.123 |   1412
   Request complete | 
 04:20:15,321 | 74.50.54.123 |   2043
 
 
 select * from organisation:
 
  activity 
| timestamp| source   | source_elapsed
 -+--+--+
   
 execute_cql3_query | 04:21:03,641 | 74.50.54.123 |  0
  Parsing select * from 
 organisation LIMIT 1; | 04:21:03,641 | 74.50.54.123 | 68
  
 Preparing statement | 04:21:03,641 | 74.50.54.123 |174

 Determining replicas to query | 04:21:03,642 | 74.50.54.123 |307
   Enqueuing 
 request to /72.249.82.85 | 04:21:03,642 | 74.50.54.123 |   1034
 Sending 
 message to /72.249.82.85 | 04:21:03,643 | 74.50.54.123 |   1402
  Message received 
 from /74.50.54.123 | 04:21:03,644 | 72.249.82.85 | 47
  Executing seq scan across 0 sstables for [min(-9223372036854775808), 
 min(-9223372036854775808)] | 04:21:03,644 | 72.249.82.85 |461
   Read 1 live and 
 0 tombstoned cells | 04:21:03,644 | 72.249.82.85 |560
 

Re: tuning concurrent_reads param

2014-11-05 Thread Jimmy Lin
Sorry I have late follow up question 

In the Cassandra.yaml file the concurrent_read section has the following
comment:

What does it mean by  the operations to enqueue low enough in the stack
that the OS and drives can reorder them. ? how does it help making the
system healthy?
What really happen if we increase it to a too high value? (maybe affecting
other read or write operation as it eat up all disk IO resource?)


thanks


# For workloads with more daa than can fit in memory, Cassandra's
# bottleneck will be reads that need to fetch data from
# disk. concurrent_reads shuld be set to (16 * number_of_drives) in
# order to allow the operations to enqueue low enough in the stack
# that the OS and drives can reorder them.

On Wed, Oct 29, 2014 at 8:47 PM, Chris Lohfink chris.lohf...@datastax.com
wrote:

 Theres a bit to it, sometimes it can use tweaking though.  Its a good
 default for most systems so I wouldn't increase right off the bat. When
 using ssds or something with a lot of horsepower it could be higher though
 (ie i2.xlarge+ on ec2).  If you monitor the number of active threads in
 read thread pool (nodetool tpstats) you can see if they are actually all
 busy or not.  If its near 32 (or whatever you set it at) all the time it
 may be a bottleneck.

 ---
 Chris Lohfink

 On Wed, Oct 29, 2014 at 10:41 PM, Jimmy Lin y2klyf+w...@gmail.com wrote:

 Hi,
 looking at the docs, the default value for concurrent_reads is 32, which
 seems bit small to me (comparing to say http server)? because if my node is
 receiving slight traffic, any more than 32 concurrent read query will have
 to wait.(?)

 Recommend rule is, 16* number of drives. Would that be different if I
 have SSDs?

 I am attempting to increase it because I have a few tables have wide rows
 that app will fetch them, the pure size of data may already eating up the
 thread time, which can cause  other read threads need to wait and essential
 slow.

 thanks







Re: Storing files in Cassandra with Spring Data / Astyanax

2014-11-05 Thread Wim Deblauwe
Hi,

We are building an application where we install it on-premise, usually
there is no internet connection at all there. As I am using Cassandra for
storing everything else in the application, it would be very convenient to
also use Cassandra for those files so I don't have to set up 2 distributed
systems for each installation we do.

Is there documentation somewhere on how to integrate/get started with
Astyanax with Spring Data Cassandra ?

regards,

Wim

2014-11-05 23:40 GMT+01:00 Redmumba redmu...@gmail.com:

 Astyanax isn't deprecated; that user is wrong and is downvoted--and has a
 comment mentioning the same.

 What you're describing doesn't sound like you need a data store at all; it
 /sounds/ like you need a file store.  Why not use S3 or similar to store
 your images?  What benefits are you expecting to receive from Cassandra?
 It sounds like you're incurring an awful lot of overhead for what amounts
 to a file lookup.

 On Wed, Nov 5, 2014 at 8:19 AM, Wim Deblauwe wim.debla...@gmail.com
 wrote:

 Hi,

 I am currently testing with Cassandra and Spring Data Cassandra. I would
 now need to store files (images and avi files, normally up to 50 Mb big).

 I did find the Chuncked Object store
 https://github.com/Netflix/astyanax/wiki/Chunked-Object-Store from
 Astyanax  which looks promising. However, I have no idea on how to combine
 Astyanax with Spring Data Cassandra ?

 Also this answer on SO http://stackoverflow.com/a/25926062/40064
 states that Netflix is no longer working on Astyanax, so maybe this is not
 a good option to base my application?

 Are there any other options (where I can keep using Spring Data
 Cassandra)?

 I also read
 http://www.datastax.com/docs/datastax_enterprise3.0/solutions/hadoop_multiple_cfs
 but it is unclear to me if I would need to install Hadoop as well if I want
 to use this?

 regards,

 Wim





Counter column impossible to delete and re-insert

2014-11-05 Thread Clément Fumey
Hi,

I have a table with counter column . When I insert (update) a row, delete
it and try to re-insert, it fail to re-insert the row. Here is the commands
i use :

CREATE TABLE test(
testId int,
year int,
testCounter counter,
PRIMARY KEY (testId, year)
)WITH CLUSTERING ORDER BY (year DESC);

UPDATE test SET testcounter = testcounter +5 WHERE testid = 2 AND year =
2014;
DELETE FROM test WHERE testid = 2 AND year = 2014;
UPDATE test SET testcounter = testcounter +5 WHERE testid = 2 AND year =
2014;

The last command failed, there is no error message but the table is empty
after it.
Is that normal? Am I doing something wrong?

Regards

Clément