Recomended storage choice for Cassandra on Amazon m1.xlarge instance
Hello, I'd like to ask what is the best options of separating commit log and data on Amazon m1.xlarge instance, given 4x420 Gb attached storages and EBS volume ? As far as I understand, the EBS is not the choice and it's recomended to use attached storages instead. Is it better to combine 4 ephemeral drives in 2 raid0 (or raid1 ?), and store data on the first and commit log on the second? Or may be trying other combinations like 1 attached storage for commit log, and 3 others grouped in raid0 for data? Thank you.
Re[2]: Cassandra cluster migration in Amazon EC2
Thanks for the quick reply! If I launch the new Cassandra node, should I preliminary add it's IP to the cassandra-topology.properties and "seeds" parameter in the cassandra.yaml on all existing nodes and restart them? >If you launch the new servers, have them join the cluster, then decommission >the old ones, you'll be able to do it without downtime. It'll also have the >effect of randomizing the tokens, I believe. > >On Sep 2, 2013, at 4:21 PM, Renat Gilfanov < gren...@mail.ru > wrote: > >> Hello, >> >> Currently we have a Cassandra cluster in the Amazon EC2, and we are planning >> to upgrade our deployment configuration to achieve better >> performance and stability. However, a lot of open questions arise when >> planning this migration. I'll be very thankfull if somebody could answer my >> questions. >> >> Current state: >> >> We use Apache Cassandra 1.2.8, on 5 nodes deployed in the Amazon EC2, on >> m1.large instanses with EBS volume each. In Cassandra we have set up 2 >> datacenters, first one have 3 nodes each in the separate rack, second - 2 >> nodes in the separate rack, however all Amazon instances belong to the >> same region and even availability zone. The replication factor for our >> keyspace is the following: {'class': 'NetworkTopologyStrategy', 'DC2': '1', >> 'DC1': '2'}. >> We have virtual nodes enabled, however the shuffle hasn't been completed >> yet, and the nodes unballanced. >> >> What we want to achieve: >> >> - We would like to move to the M1 Extra Large instances with 4x420 Gb >> instance storages. >> - Group 3 of storages into raid0 array, move data directory to the raid0, >> and commit log - to the 4th left storage. >> >> Open questions: >> - Does the suggested configuration look reasonable from the performance >> optimization point of view? >> - As far as I understand, separation of commit log and data directory >> should make performance better - but what about separation the OS from those >> two - is it worth doing? >> - What are the steps to perform such migration? Will it be possible to >> perform it without downtime, restarting node by node with new configuration >> applied? >> I'm especially worried about IP changes, when we'll uprade the instance >> type. What's the recomended way to handle those IP changes? >> >> Best Regards, >> Renat. > -- Renat Gilfanov
Re: Cassandra cluster migration in Amazon EC2
If you launch the new servers, have them join the cluster, then decommission the old ones, you'll be able to do it without downtime. It'll also have the effect of randomizing the tokens, I believe. On Sep 2, 2013, at 4:21 PM, Renat Gilfanov wrote: > Hello, > > Currently we have a Cassandra cluster in the Amazon EC2, and we are planning > to upgrade our deployment configuration to achieve better > performance and stability. However, a lot of open questions arise when > planning this migration. I'll be very thankfull if somebody could answer my > questions. > > Current state: > > We use Apache Cassandra 1.2.8, on 5 nodes deployed in the Amazon EC2, on > m1.large instanses with EBS volume each. In Cassandra we have set up 2 > datacenters, first one have 3 nodes each in the separate rack, second - 2 > nodes in the separate rack, however all Amazon instances belong to the > same region and even availability zone. The replication factor for our > keyspace is the following: {'class': 'NetworkTopologyStrategy', 'DC2': '1', > 'DC1': '2'}. > We have virtual nodes enabled, however the shuffle hasn't been completed yet, > and the nodes unballanced. > > What we want to achieve: > > - We would like to move to the M1 Extra Large instances with 4x420 Gb > instance storages. > - Group 3 of storages into raid0 array, move data directory to the raid0, and > commit log - to the 4th left storage. > > Open questions: > - Does the suggested configuration look reasonable from the performance > optimization point of view? > - As far as I understand, separation of commit log and data directory should > make performance better - but what about separation the OS from those two - > is it worth doing? > - What are the steps to perform such migration? Will it be possible to > perform it without downtime, restarting node by node with new configuration > applied? > I'm especially worried about IP changes, when we'll uprade the instance > type. What's the recomended way to handle those IP changes? > > Best Regards, > Renat.
Cassandra cluster migration in Amazon EC2
Hello, Currently we have a Cassandra cluster in the Amazon EC2, and we are planning to upgrade our deployment configuration to achieve better performance and stability. However, a lot of open questions arise when planning this migration. I'll be very thankfull if somebody could answer my questions. Current state: We use Apache Cassandra 1.2.8, on 5 nodes deployed in the Amazon EC2, on m1.large instanses with EBS volume each. In Cassandra we have set up 2 datacenters, first one have 3 nodes each in the separate rack, second - 2 nodes in the separate rack, however all Amazon instances belong to the same region and even availability zone. The replication factor for our keyspace is the following: {'class': 'NetworkTopologyStrategy', 'DC2': '1', 'DC1': '2'}. We have virtual nodes enabled, however the shuffle hasn't been completed yet, and the nodes unballanced. What we want to achieve: - We would like to move to the M1 Extra Large instances with 4x420 Gb instance storages. - Group 3 of storages into raid0 array, move data directory to the raid0, and commit log - to the 4th left storage. Open questions: - Does the suggested configuration look reasonable from the performance optimization point of view? - As far as I understand, separation of commit log and data directory should make performance better - but what about separation the OS from those two - is it worth doing? - What are the steps to perform such migration? Will it be possible to perform it without downtime, restarting node by node with new configuration applied? I'm especially worried about IP changes, when we'll uprade the instance type. What's the recomended way to handle those IP changes? Best Regards, Renat.
Re: How to perform range queries efficiently?
Sorry, I was not very clear. We simply created another CF whose row keys were given by the secondary index that we needed. The value of each row in this new CF was the key associated with a row in the first CF (the original one). Francisco On Sep 2, 2013, at 4:13 PM, Sávio Teles wrote: > > We performed some modifications and created another column family, which maps > the secondary index to the key of the original column family. The > improvements were very impressive in our case! > > Sorry, I coundn't understand! What changes? Have you built a B-tree? > > > 2013/9/2 Francisco Nogueira Calmon Sobral > We had some problems when using secondary indexes because of three issues: > > - The query is a Range Query, which means that it is slow. > - There is an open bug regarding the use of row cache for secondary indexes > (CASSANDRA-4973) > - The cardinality of our secondary key was very low (this was bad) > > We performed some modifications and created another column family, which maps > the secondary index to the key of the original column family. The > improvements were very impressive in our case! > > Best regards > Francisco > > > On Aug 28, 2013, at 12:22 PM, Vivek Mishra wrote: > >> Create a column family of compositeType (or PRIMARY KEY) as (user_id,age, >> salary). >> >> Then you will be able to query use eq operator over partition key and as >> well over clustering key: >> >> You may also exclude salary as a secondary index rather than part of cluster >> key(e.g. age,salary) >> >> I am sure based on your query usage, you can opt for either a composite key >> or may mix composite key with secondary index ! >> >> Have a look at: >> http://www.datastax.com/dev/blog/introduction-to-composite-columns-part-1 >> >> Hope it helps. >> >> >> -Vivek >> >> >> On Wed, Aug 28, 2013 at 5:49 PM, Sávio Teles >> wrote: >> I can populate again. We are modelling the data yet! Tks. >> >> >> 2013/8/28 Vivek Mishra >> Just saw that you already have data populated, so i guess modifying for >> composite key may not work for you. >> >> -Vivek >> >> >> On Tue, Aug 27, 2013 at 11:55 PM, Sávio Teles >> wrote: >> Vivek, using a composite key, how would be the query? >> >> >> 2013/8/27 Vivek Mishra >> For such queries, looks like you may create a composite key as (user_id,age, >> salary). >> >> Too much indexing always kills(irrespective of RDBMS or NoSQL). Remember >> every search request on secondary indexes will be passed on each node in >> ring. >> >> -Vivek >> >> >> On Tue, Aug 27, 2013 at 11:11 PM, Sávio Teles >> wrote: >> Use a database that is designed for efficient range queries? ;D >> >> Is there no way to do this with Cassandra? Like using Hive, Sorl... >> >> >> 2013/8/27 Robert Coli >> On Fri, Aug 23, 2013 at 5:53 AM, Sávio Teles >> wrote: >> I need to perform range query efficiently. >> ... >> This query takes a long time to run. Any ideas to perform it efficiently? >> >> Use a database that is designed for efficient range queries? ;D >> >> =Rob >> >> >> >> >> -- >> Atenciosamente, >> Sávio S. Teles de Oliveira >> voice: +55 62 9136 6996 >> http://br.linkedin.com/in/savioteles >> Mestrando em Ciências da Computação - UFG >> Arquiteto de Software >> Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG >> >> >> >> >> -- >> Atenciosamente, >> Sávio S. Teles de Oliveira >> voice: +55 62 9136 6996 >> http://br.linkedin.com/in/savioteles >> Mestrando em Ciências da Computação - UFG >> Arquiteto de Software >> Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG >> >> >> >> >> -- >> Atenciosamente, >> Sávio S. Teles de Oliveira >> voice: +55 62 9136 6996 >> http://br.linkedin.com/in/savioteles >> Mestrando em Ciências da Computação - UFG >> Arquiteto de Software >> Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG >> > > > > > -- > Atenciosamente, > Sávio S. Teles de Oliveira > voice: +55 62 9136 6996 > http://br.linkedin.com/in/savioteles > Mestrando em Ciências da Computação - UFG > Arquiteto de Software > Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG
Re: Temporarily slow nodes on Cassandra
In general with LOCAL_QUORUM you should not see such an issue when one node is slow. However, it could be because Client's are still sending requests to that node. Depending on what client library you are using , you could try to take that node out of your connection pool. Not knowing exact issue you are facing this is just a hunch at this point. On Mon, Sep 2, 2013 at 12:33 PM, Michael Theroux wrote: > Hello, > > We are experiencing an issue where nodes a temporarily slow due to I/O > contention anywhere from 10 minutes to 2 hours. I don't believe this > slowdown is Cassandra related, but factors outside of Cassandra. We run > Cassandra 1.1.9. We run a 12 node cluster, with a replication factor of 3, > and all queries use LOCAL_QUORUM consistency. > > Our problem is (other than the contention issue, which we are working on), > when this one node slows down, the whole system performance appears to slow > down. Is there a way in Cassandra to accommodate or mitigate slower nodes? > Shutting down the node in question during the period of contention does > "resolve" the performance problem, but is there anything in cassandra that > can assist this situation while we resolve the hardware problem? > > Thanks, > -Mike
Temporarily slow nodes on Cassandra
Hello, We are experiencing an issue where nodes a temporarily slow due to I/O contention anywhere from 10 minutes to 2 hours. I don't believe this slowdown is Cassandra related, but factors outside of Cassandra. We run Cassandra 1.1.9. We run a 12 node cluster, with a replication factor of 3, and all queries use LOCAL_QUORUM consistency. Our problem is (other than the contention issue, which we are working on), when this one node slows down, the whole system performance appears to slow down. Is there a way in Cassandra to accommodate or mitigate slower nodes? Shutting down the node in question during the period of contention does "resolve" the performance problem, but is there anything in cassandra that can assist this situation while we resolve the hardware problem? Thanks, -Mike
Re: Timeout Exception with row_cache enabled
Is it related to https://issues.apache.org/jira/browse/CASSANDRA-4973? And https://issues.apache.org/jira/browse/CASSANDRA-4785? 2013/9/2 Nate McCall > You experience is not uncommon. There was a recent thread on this with a > variety of details on when to use row caching: > http://www.mail-archive.com/user@cassandra.apache.org/msg31693.html > > tl;dr - it depends completely on use case. Small static rows work best. > > > > On Mon, Sep 2, 2013 at 2:05 PM, Sávio Teles > wrote: > >> I'm running the Cassandra 1.2.4 and when I enable the row_cache, the >> system throws TImeoutExcpetion and Garbage Collection don't stop. >> >> When I disable the query returns in 700ms. >> >> *Configuration: >> >> * >> >>- *row_cache_size_in_mb: 256* >>- *row_cache_save_period: 0* >>- *# row_cache_keys_to_save: 100* >>- *row_cache_provider: SerializingCacheProvider* >> >> Why is this happening? >> >> >> Thanks in advance!! >> ** >> >> -- >> Atenciosamente, >> Sávio S. Teles de Oliveira >> voice: +55 62 9136 6996 >> http://br.linkedin.com/in/savioteles >> Mestrando em Ciências da Computação - UFG >> Arquiteto de Software >> Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG >> > > -- Atenciosamente, Sávio S. Teles de Oliveira voice: +55 62 9136 6996 http://br.linkedin.com/in/savioteles Mestrando em Ciências da Computação - UFG Arquiteto de Software Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG
Re: Timeout Exception with row_cache enabled
You experience is not uncommon. There was a recent thread on this with a variety of details on when to use row caching: http://www.mail-archive.com/user@cassandra.apache.org/msg31693.html tl;dr - it depends completely on use case. Small static rows work best. On Mon, Sep 2, 2013 at 2:05 PM, Sávio Teles wrote: > I'm running the Cassandra 1.2.4 and when I enable the row_cache, the > system throws TImeoutExcpetion and Garbage Collection don't stop. > > When I disable the query returns in 700ms. > > *Configuration: > > * > >- *row_cache_size_in_mb: 256* >- *row_cache_save_period: 0* >- *# row_cache_keys_to_save: 100* >- *row_cache_provider: SerializingCacheProvider* > > Why is this happening? > > > Thanks in advance!! > ** > > -- > Atenciosamente, > Sávio S. Teles de Oliveira > voice: +55 62 9136 6996 > http://br.linkedin.com/in/savioteles > Mestrando em Ciências da Computação - UFG > Arquiteto de Software > Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG >
Timeout Exception with row_cache enabled
I'm running the Cassandra 1.2.4 and when I enable the row_cache, the system throws TImeoutExcpetion and Garbage Collection don't stop. When I disable the query returns in 700ms. *Configuration: * - *row_cache_size_in_mb: 256* - *row_cache_save_period: 0* - *# row_cache_keys_to_save: 100* - *row_cache_provider: SerializingCacheProvider* Why is this happening? Thanks in advance!! ** -- Atenciosamente, Sávio S. Teles de Oliveira voice: +55 62 9136 6996 http://br.linkedin.com/in/savioteles Mestrando em Ciências da Computação - UFG Arquiteto de Software Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG
Re: How to perform range queries efficiently?
> We performed some modifications and created another column family, which > maps the secondary index to the key of the original column family. The > improvements were very impressive in our case! Sorry, I coundn't understand! What changes? Have you built a B-tree? 2013/9/2 Francisco Nogueira Calmon Sobral > We had some problems when using secondary indexes because of three issues: > > - The query is a Range Query, which means that it is slow. > - There is an open bug regarding the use of row cache for secondary > indexes (CASSANDRA-4973) > - The cardinality of our secondary key was very low (this was bad) > > We performed some modifications and created another column family, which > maps the secondary index to the key of the original column family. The > improvements were very impressive in our case! > > Best regards > Francisco > > > On Aug 28, 2013, at 12:22 PM, Vivek Mishra wrote: > > Create a column family of compositeType (or PRIMARY KEY) as (user_id,age, > salary). > > Then you will be able to query use eq operator over partition key and as > well over clustering key: > > You may also exclude salary as a secondary index rather than part of > cluster key(e.g. age,salary) > > I am sure based on your query usage, you can opt for either a composite > key or may mix composite key with secondary index ! > > Have a look at: > http://www.datastax.com/dev/blog/introduction-to-composite-columns-part-1 > > Hope it helps. > > > -Vivek > > > On Wed, Aug 28, 2013 at 5:49 PM, Sávio Teles > wrote: > >> I can populate again. We are modelling the data yet! Tks. >> >> >> 2013/8/28 Vivek Mishra >> >>> Just saw that you already have data populated, so i guess modifying for >>> composite key may not work for you. >>> >>> -Vivek >>> >>> >>> On Tue, Aug 27, 2013 at 11:55 PM, Sávio Teles < >>> savio.te...@lupa.inf.ufg.br> wrote: >>> Vivek, using a composite key, how would be the query? 2013/8/27 Vivek Mishra > For such queries, looks like you may create a composite key as > (user_id,age, salary). > > Too much indexing always kills(irrespective of RDBMS or NoSQL). > Remember every search request on secondary indexes will be passed on each > node in ring. > > -Vivek > > > On Tue, Aug 27, 2013 at 11:11 PM, Sávio Teles < > savio.te...@lupa.inf.ufg.br> wrote: > >> Use a database that is designed for efficient range queries? ;D >>> >> >> Is there no way to do this with Cassandra? Like using Hive, Sorl... >> >> >> 2013/8/27 Robert Coli >> >>> On Fri, Aug 23, 2013 at 5:53 AM, Sávio Teles < >>> savio.te...@lupa.inf.ufg.br> wrote: >>> I need to perform range query efficiently. >>> ... >>> This query takes a long time to run. Any ideas to perform it efficiently? >>> >>> Use a database that is designed for efficient range queries? ;D >>> >>> =Rob >>> >>> >> >> >> >> -- >> Atenciosamente, >> Sávio S. Teles de Oliveira >> voice: +55 62 9136 6996 >> http://br.linkedin.com/in/savioteles >> Mestrando em Ciências da Computação - UFG >> Arquiteto de Software >> Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG >> > > -- Atenciosamente, Sávio S. Teles de Oliveira voice: +55 62 9136 6996 http://br.linkedin.com/in/savioteles Mestrando em Ciências da Computação - UFG Arquiteto de Software Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG >>> >>> >> >> >> -- >> Atenciosamente, >> Sávio S. Teles de Oliveira >> voice: +55 62 9136 6996 >> http://br.linkedin.com/in/savioteles >> Mestrando em Ciências da Computação - UFG >> Arquiteto de Software >> Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG >> > > > -- Atenciosamente, Sávio S. Teles de Oliveira voice: +55 62 9136 6996 http://br.linkedin.com/in/savioteles Mestrando em Ciências da Computação - UFG Arquiteto de Software Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG
Re: Versioning in cassandra
In my case version can be timestamp as well. What do you suggest version number to be, do you see any problems if I keep version as counter / timestamp ? On Tue, Sep 3, 2013 at 12:22 AM, Jan Algermissen wrote: > > On 02.09.2013, at 20:44, dawood abdullah > wrote: > > > Requirement is like I have a column family say File > > > > create table file(id text primary key, fname text, version int, mimetype > text, content text); > > > > Say, I have few records inserted, when I modify an existing record > (content is modified) a new version needs to be created. As I need to have > provision to revert to back any old version whenever required. > > > > So, can version be a timestamp? Or does it need to be an integer? > > In the former case, make use of C*'s ordering like so: > > CREATE TABLE file ( >file_id text, >version timestamp, >fname text, > >PRIMARY KEY (file_id,version) > ) WITH CLUSTERING ORDER BY (version DESC); > > Get the latest file version with > > select * from file where file_id = 'xxx' limit 1; > > If it has to be an integer, use counter columns. > > Jan > > > > Regards, > > Dawood > > > > > > On Mon, Sep 2, 2013 at 10:47 PM, Jan Algermissen < > jan.algermis...@nordsc.com> wrote: > > Hi Dawood, > > > > On 02.09.2013, at 16:36, dawood abdullah > wrote: > > > > > Hi > > > I have a requirement of versioning to be done in Cassandra. > > > > > > Following is my column family definition > > > > > > create table file_details(id text primary key, fname text, version > int, mimetype text); > > > > > > I have a secondary index created on fname column. > > > > > > Whenever I do an insert for the same 'fname', the version should be > incremented. And when I retrieve a row with fname it should return me the > latest version row. > > > > > > Is there a better way to do in Cassandra? Please suggest what approach > needs to be taken. > > > > Can you explain more about your use case? > > > > If the version need not be a small number, but could be a timestamp, you > could make use of C*'s ordering feature , have the database set the new > version as a timestamp and retrieve the latest one with a simple LIMIT 1 > query. (I'll explain more when this is an option for you). > > > > Jan > > > > P.S. Me being a REST/HTTP head, an alarm rings when I see 'version' next > to 'mimetype' :-) What exactly are you versioning here? Maybe we can even > change the situation from a functional POV? > > > > > > > > > > Regards, > > > > > > Dawood > > > > > > > > > > > > > > > > > >
Re: Versioning in cassandra
On 02.09.2013, at 20:44, dawood abdullah wrote: > Requirement is like I have a column family say File > > create table file(id text primary key, fname text, version int, mimetype > text, content text); > > Say, I have few records inserted, when I modify an existing record (content > is modified) a new version needs to be created. As I need to have provision > to revert to back any old version whenever required. > So, can version be a timestamp? Or does it need to be an integer? In the former case, make use of C*'s ordering like so: CREATE TABLE file ( file_id text, version timestamp, fname text, PRIMARY KEY (file_id,version) ) WITH CLUSTERING ORDER BY (version DESC); Get the latest file version with select * from file where file_id = 'xxx' limit 1; If it has to be an integer, use counter columns. Jan > Regards, > Dawood > > > On Mon, Sep 2, 2013 at 10:47 PM, Jan Algermissen > wrote: > Hi Dawood, > > On 02.09.2013, at 16:36, dawood abdullah wrote: > > > Hi > > I have a requirement of versioning to be done in Cassandra. > > > > Following is my column family definition > > > > create table file_details(id text primary key, fname text, version int, > > mimetype text); > > > > I have a secondary index created on fname column. > > > > Whenever I do an insert for the same 'fname', the version should be > > incremented. And when I retrieve a row with fname it should return me the > > latest version row. > > > > Is there a better way to do in Cassandra? Please suggest what approach > > needs to be taken. > > Can you explain more about your use case? > > If the version need not be a small number, but could be a timestamp, you > could make use of C*'s ordering feature , have the database set the new > version as a timestamp and retrieve the latest one with a simple LIMIT 1 > query. (I'll explain more when this is an option for you). > > Jan > > P.S. Me being a REST/HTTP head, an alarm rings when I see 'version' next to > 'mimetype' :-) What exactly are you versioning here? Maybe we can even change > the situation from a functional POV? > > > > > > Regards, > > > > Dawood > > > > > > > > > >
Re: Versioning in cassandra
Requirement is like I have a column family say File create table file(id text primary key, fname text, version int, mimetype text, content text); Say, I have few records inserted, when I modify an existing record (content is modified) a new version needs to be created. As I need to have provision to revert to back any old version whenever required. Regards, Dawood On Mon, Sep 2, 2013 at 10:47 PM, Jan Algermissen wrote: > Hi Dawood, > > On 02.09.2013, at 16:36, dawood abdullah > wrote: > > > Hi > > I have a requirement of versioning to be done in Cassandra. > > > > Following is my column family definition > > > > create table file_details(id text primary key, fname text, version int, > mimetype text); > > > > I have a secondary index created on fname column. > > > > Whenever I do an insert for the same 'fname', the version should be > incremented. And when I retrieve a row with fname it should return me the > latest version row. > > > > Is there a better way to do in Cassandra? Please suggest what approach > needs to be taken. > > Can you explain more about your use case? > > If the version need not be a small number, but could be a timestamp, you > could make use of C*'s ordering feature , have the database set the new > version as a timestamp and retrieve the latest one with a simple LIMIT 1 > query. (I'll explain more when this is an option for you). > > Jan > > P.S. Me being a REST/HTTP head, an alarm rings when I see 'version' next > to 'mimetype' :-) What exactly are you versioning here? Maybe we can even > change the situation from a functional POV? > > > > > > Regards, > > > > Dawood > > > > > > > > > >
Re: How to perform range queries efficiently?
We had some problems when using secondary indexes because of three issues: - The query is a Range Query, which means that it is slow. - There is an open bug regarding the use of row cache for secondary indexes (CASSANDRA-4973) - The cardinality of our secondary key was very low (this was bad) We performed some modifications and created another column family, which maps the secondary index to the key of the original column family. The improvements were very impressive in our case! Best regards Francisco On Aug 28, 2013, at 12:22 PM, Vivek Mishra wrote: > Create a column family of compositeType (or PRIMARY KEY) as (user_id,age, > salary). > > Then you will be able to query use eq operator over partition key and as > well over clustering key: > > You may also exclude salary as a secondary index rather than part of cluster > key(e.g. age,salary) > > I am sure based on your query usage, you can opt for either a composite key > or may mix composite key with secondary index ! > > Have a look at: > http://www.datastax.com/dev/blog/introduction-to-composite-columns-part-1 > > Hope it helps. > > > -Vivek > > > On Wed, Aug 28, 2013 at 5:49 PM, Sávio Teles > wrote: > I can populate again. We are modelling the data yet! Tks. > > > 2013/8/28 Vivek Mishra > Just saw that you already have data populated, so i guess modifying for > composite key may not work for you. > > -Vivek > > > On Tue, Aug 27, 2013 at 11:55 PM, Sávio Teles > wrote: > Vivek, using a composite key, how would be the query? > > > 2013/8/27 Vivek Mishra > For such queries, looks like you may create a composite key as (user_id,age, > salary). > > Too much indexing always kills(irrespective of RDBMS or NoSQL). Remember > every search request on secondary indexes will be passed on each node in ring. > > -Vivek > > > On Tue, Aug 27, 2013 at 11:11 PM, Sávio Teles > wrote: > Use a database that is designed for efficient range queries? ;D > > Is there no way to do this with Cassandra? Like using Hive, Sorl... > > > 2013/8/27 Robert Coli > On Fri, Aug 23, 2013 at 5:53 AM, Sávio Teles > wrote: > I need to perform range query efficiently. > ... > This query takes a long time to run. Any ideas to perform it efficiently? > > Use a database that is designed for efficient range queries? ;D > > =Rob > > > > > -- > Atenciosamente, > Sávio S. Teles de Oliveira > voice: +55 62 9136 6996 > http://br.linkedin.com/in/savioteles > Mestrando em Ciências da Computação - UFG > Arquiteto de Software > Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG > > > > > -- > Atenciosamente, > Sávio S. Teles de Oliveira > voice: +55 62 9136 6996 > http://br.linkedin.com/in/savioteles > Mestrando em Ciências da Computação - UFG > Arquiteto de Software > Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG > > > > > -- > Atenciosamente, > Sávio S. Teles de Oliveira > voice: +55 62 9136 6996 > http://br.linkedin.com/in/savioteles > Mestrando em Ciências da Computação - UFG > Arquiteto de Software > Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG >
Re: Versioning in cassandra
Hi Dawood, On 02.09.2013, at 16:36, dawood abdullah wrote: > Hi > I have a requirement of versioning to be done in Cassandra. > > Following is my column family definition > > create table file_details(id text primary key, fname text, version int, > mimetype text); > > I have a secondary index created on fname column. > > Whenever I do an insert for the same 'fname', the version should be > incremented. And when I retrieve a row with fname it should return me the > latest version row. > > Is there a better way to do in Cassandra? Please suggest what approach needs > to be taken. Can you explain more about your use case? If the version need not be a small number, but could be a timestamp, you could make use of C*'s ordering feature , have the database set the new version as a timestamp and retrieve the latest one with a simple LIMIT 1 query. (I'll explain more when this is an option for you). Jan P.S. Me being a REST/HTTP head, an alarm rings when I see 'version' next to 'mimetype' :-) What exactly are you versioning here? Maybe we can even change the situation from a functional POV? > > Regards, > > Dawood > > > >
Re: Does collection in CQL3 have certain limits?
I believe CQL has to fetch and transport the entire row, so if it contains a collection you transmit the entire collection. C* is mostly about low latency queries and as the row gets larger keeping low latency becomes impossible. Collections do not support a large number of columns, they were not designed to do that. IMHO If you are talking about 2K + columns collections are not for you use old-school c* wide rows. On Mon, Sep 2, 2013 at 10:36 AM, Keith Wright wrote: > I know that the size is limited to max short (~32k) because when > deserializing the response from the server, the first item returned is the > number of items and its a short. That being said you could likely handle > this by looking for the overflow and allowing double max short. > > Vikas Goyal wrote: > > > As there are two ways to support wide rows in CQL3..One is to use > composite keys and another is to use collections like Map, List and Set. > The composite keys method can have millions of columns (transposed to > rows).. This is solving some of our use cases. > > However, if we use collections, I want to know if there is a limit that > the collections can store a certain number/amount of data (Like earlier > with Thrift C* supports up-to 2 billion columns in a row) > > > Thanks, > > Vikas Goyal >
Re: Upgrade from 1.0.9 to 1.2.8
> 1.0.9 -> 1.0.12 -> 1.1.12 -> 1.2.x? Because this fix in 1.0.11: * fix 1.0.x node join to mixed version cluster, other nodes >= 1.1 (CASSANDRA-4195) -Jeremiah On Aug 30, 2013, at 2:00 PM, Mike Neir wrote: > Is there anything that you can link that describes the pitfalls you mention? > I'd like a bit more information. Just for clarity's sake, are you > recommending 1.0.9 -> 1.0.12 -> 1.1.12 -> 1.2.x? Or would 1.0.9 -> 1.1.12 -> > 1.2.x suffice? > > Regarding the placement strategy mentioned in a different post, I'm using the > Simple placement strategy, with the RackInferringSnitch. How does that play > into the bugs mentioned previously about cross-DC replication? > > MN > > On 08/30/2013 01:28 PM, Jeremiah D Jordan wrote: >> You probably want to go to 1.0.11/12 first no matter what. If you want the >> least chance of issue you should then go to 1.1.12. While there is a high >> probability that going from 1.0.X->1.2 will work. You have the best chance >> at no failures if you go through 1.1.12. There are some edge cases that can >> cause errors if you don't do that. >> >> -Jeremiah >> >>
Re: CqlStorage creates wrong schema for Pig
Hi 1.- May be? -- Register the UDF REGISTER /path/to/cqlstorageudf-1.0-SNAPSHOT -- FromCqlColumn will convert chararray, int, long, float, double DEFINE FromCqlColumn com.megatome.pig.piggybank.tuple.FromCqlColumn(); -- Load data as normal data_raw = LOAD 'cql://bookcrossing/books' USING CqlStorage(); -- Use the UDF data = FOREACH data_raw GENERATE *FromCqlColumn*(isbn) AS ISBN, *FromCqlColumn*(bookauthor) AS BookAuthor, *FromCqlColumn*(booktitle) AS BookTitle, *FromCqlColumn*(publisher) AS Publisher, *FromCqlColumn*(yearofpublication) AS YearOfPublication; and 2.: with the data in cql cassandra 1.2.8, pig 0.11.11 and cql3: *CREATE KEYSPACE keyspace1* * WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }* * AND durable_writes = true;* * * *use keyspace2;* * * * CREATE TABLE test (* *id text PRIMARY KEY,* *title text,* *age int* * ) WITH COMPACT STORAGE;* * * * * * insert into test (id, title, age) values('1', 'child', 21);* * insert into test (id, title, age) values('2', 'support', 21);* * insert into test (id, title, age) values('3', 'manager', 31);* * insert into test (id, title, age) values('4', 'QA', 41);* * insert into test (id, title, age) values('5', 'QA', 30);* * insert into test (id, title, age) values('6', 'QA', 30);* and script: * * *register './libs/cqlstorageudf-1.0-SNAPSHOT.jar';* *DEFINE FromCqlColumn com.megatome.pig.piggybank.tuple.FromCqlColumn();* *rows = LOAD 'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' USING CqlStorage();* *dump rows;* *ILLUSTRATE rows;* *describe rows;* *A = FOREACH rows GENERATE FLATTEN(title);* *dump A;* *values3 = FOREACH A GENERATE FromCqlColumn(title) AS title;* *dump values3;* *describe values3;* -- I have this error: - | rows | id:chararray | age:int | title:chararray | - | | (id, 5)| (age, 30) | (title, QA) | - rows: {id: chararray,age: int,title: chararray} ... (title,QA) (title,QA) .. 2013-09-02 16:40:52,454 [Thread-11] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local_0003 *java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.pig.data.Tuple* at com.megatome.pig.piggybank.tuple.ColumnBase.exec(ColumnBase.java:32) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:337) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:434) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:340) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:372) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:297) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) 2013-09-02 16:40:52,832 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_local_0003 8-| Regards ... Miguel Angel Martín Junquera Analyst Engineer. miguelangel.mar...@brainsins.com 2013/9/2 Miguel Angel Martin junquera > hi all: > > More info : > > https://issues.apache.org/jira/browse/CASSANDRA-5941 > > > > I tried this (and gen. cassandra 1.2.9) but do not work for me, > > git clone http://git-wip-us.apache.org/repos/asf/cassandra.git > cd cassandra > git checkout cassandra-1.2 > patch -p1 < 5867-bug-fix-filter-push-down-1.2-branch.txt > ant > > > > Miguel Angel Martín Junquera > Analyst Engineer. > miguelangel.mar...@brainsins.com > > > > 2013/9/2 Miguel Angel Martin junquera > >> *good/nice job !!!* >> * >> * >> * >> * >> *I'd testing with an udf only with string schema type this is better >> and elaborate work..* >> * >> * >> *Regads* >> >> >> Miguel Angel Martín Junquera >> Analyst Engineer. >> miguelangel.mar...@brainsins.com >> >> >> >> 2013/8/31 Chad Johnston >> >>> I threw together a quick UDF to work around this issue. It just extracts >>> the value portion of the tuple while taking advantage of the CqlStorage >>> generated schema to keep the type correct. >>> >>> You can get it her
Re: Does collection in CQL3 have certain limits?
I know that the size is limited to max short (~32k) because when deserializing the response from the server, the first item returned is the number of items and its a short. That being said you could likely handle this by looking for the overflow and allowing double max short. Vikas Goyal wrote: As there are two ways to support wide rows in CQL3..One is to use composite keys and another is to use collections like Map, List and Set. The composite keys method can have millions of columns (transposed to rows).. This is solving some of our use cases. However, if we use collections, I want to know if there is a limit that the collections can store a certain number/amount of data (Like earlier with Thrift C* supports up-to 2 billion columns in a row) Thanks, Vikas Goyal
Versioning in cassandra
Hi I have a requirement of versioning to be done in Cassandra. Following is my column family definition *create table file_details(id text primary key, fname text, version int, mimetype text);* I have a secondary index created on fname column. Whenever I do an insert for the same 'fname', the version should be incremented. And when I retrieve a row with fname it should return me the latest version row. Is there a better way to do in Cassandra? Please suggest what approach needs to be taken. Regards, Dawood
Re: CqlStorage creates wrong schema for Pig
hi all: More info : https://issues.apache.org/jira/browse/CASSANDRA-5941 I tried this (and gen. cassandra 1.2.9) but do not work for me, git clone http://git-wip-us.apache.org/repos/asf/cassandra.git cd cassandra git checkout cassandra-1.2 patch -p1 < 5867-bug-fix-filter-push-down-1.2-branch.txt ant Miguel Angel Martín Junquera Analyst Engineer. miguelangel.mar...@brainsins.com 2013/9/2 Miguel Angel Martin junquera > *good/nice job !!!* > * > * > * > * > *I'd testing with an udf only with string schema type this is better > and elaborate work..* > * > * > *Regads* > > > Miguel Angel Martín Junquera > Analyst Engineer. > miguelangel.mar...@brainsins.com > > > > 2013/8/31 Chad Johnston > >> I threw together a quick UDF to work around this issue. It just extracts >> the value portion of the tuple while taking advantage of the CqlStorage >> generated schema to keep the type correct. >> >> You can get it here: https://github.com/iamthechad/cqlstorage-udf >> >> I'll see if I can find more useful information and open a defect, since >> that's what this seems to be. >> >> Chad >> >> >> On Fri, Aug 30, 2013 at 2:02 AM, Miguel Angel Martin junquera < >> mianmarjun.mailingl...@gmail.com> wrote: >> >>> I try this: >>> >>> *rows = LOAD >>> 'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' USING >>> CqlStorage();* >>> >>> *dump rows;* >>> >>> *ILLUSTRATE rows;* >>> >>> *describe rows;* >>> >>> * >>> * >>> >>> *values2= FOREACH rows GENERATE TOTUPLE (id) as >>> (mycolumn:tuple(name,value));* >>> >>> *dump values2;* >>> >>> *describe values2;* >>> * >>> * >>> >>> But I get this results: >>> >>> >>> >>> - >>> | rows | id:chararray | age:int | title:chararray | >>> - >>> | | (id, 6)| (age, 30) | (title, QA) | >>> - >>> >>> rows: {id: chararray,age: int,title: chararray} >>> 2013-08-30 09:54:37,831 [main] ERROR org.apache.pig.tools.grunt.Grunt - >>> ERROR 1031: Incompatable field schema: left is >>> "tuple_0:tuple(mycolumn:tuple(name:bytearray,value:bytearray))", right is >>> "org.apache.pig.builtin.totuple_id_1:tuple(id:chararray)" >>> >>> >>> >>> >>> >>> or >>> >>> >>> >>> >>> >>> *values2= FOREACH rows GENERATE TOTUPLE (id) ;* >>> *dump values2;* >>> *describe values2;* >>> >>> >>> >>> >>> and the results are: >>> >>> >>> ... >>> (((id,6))) >>> (((id,5))) >>> values2: {org.apache.pig.builtin.totuple_id_8: (id: chararray)} >>> >>> >>> >>> Aggg! >>> >>> >>> * >>> * >>> >>> >>> >>> Miguel Angel Martín Junquera >>> Analyst Engineer. >>> miguelangel.mar...@brainsins.com >>> >>> >>> >>> 2013/8/26 Miguel Angel Martin junquera >> > >>> hi Chad . I have this issue I send a mail to user-pig-list and I still i can resolve this, and I can not access to column values. In this mail I write some things that I try without results... and information about this issue. http://mail-archives.apache.org/mod_mbox/pig-user/201308.mbox/%3ccajeg_hq9s2po3_xytzx5xki4j1mao8q26jydg2wndy_kyiv...@mail.gmail.com%3E I hope someOne reply one comment, idea or solution about this issue or bug. I have reviewed the CqlStorage class in code cassandra 1.2.8 but i do not have configure the environmetn to debug and trace this issue. Only I find some comments like, but I do not understand at all. /** * A LoadStoreFunc for retrieving data from and storing data to Cassandra * * A row from a standard CF will be returned as nested tuples: * (((key1, value1), (key2, value2)), ((name1, val1), (name2, val2))). */ I you found some idea or solution, please post it thanks 2013/8/23 Chad Johnston > (I'm using Cassandra 1.2.8 and Pig 0.11.1) > > I'm loading some simple data from Cassandra into Pig using CqlStorage. > The CqlStorage loader defines a Pig schema based on the Cassandra schema, > but it seems to be wrong. > > If I do: > > data = LOAD 'cql://bookdata/books' USING CqlStorage(); > DESCRIBE data; > > I get this: > > data: {isbn: chararray,bookauthor: chararray,booktitle: > chararray,publisher: chararray,yearofpublication: int} > > However, if I DUMP data, I get results like these: > > ((isbn,0425093387),(bookauthor,Georgette Heyer),(booktitle,Death in > the Stocks),(publisher,Berkley Pub Group),(yearofpublication,1986)) > > Clearly the results from Cassandra are key/value pairs, as would be > expected. I don't know why the schema generated by CqlStorage() would be > so > different. > > This is really causing me problems trying to acc
Re: how can i get the column value? Need help!.. cassandra 1.28 and pig 0.11.1
hi all: More info : https://issues.apache.org/jira/browse/CASSANDRA-5941 I tried this (and gen. cassandra 1.2.9) but do not work for me, git clone http://git-wip-us.apache.org/repos/asf/cassandra.git cd cassandra git checkout cassandra-1.2 patch -p1 < 5867-bug-fix-filter-push-down-1.2-branch.txt ant Miguel Angel Martín Junquera Analyst Engineer. miguelangel.mar...@brainsins.com 2013/9/2 Miguel Angel Martin junquera > hi: > > I test this in cassandra 1.2.9 new version and the issue still persists . > > :-( > > > > > Miguel Angel Martín Junquera > Analyst Engineer. > miguelangel.mar...@brainsins.com > > > > 2013/8/30 Miguel Angel Martin junquera > >> I try this: >> >> *rows = LOAD >> 'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' USING >> CqlStorage();* >> >> *dump rows;* >> >> *ILLUSTRATE rows;* >> >> *describe rows;* >> >> * >> * >> >> *values2= FOREACH rows GENERATE TOTUPLE (id) as >> (mycolumn:tuple(name,value));* >> >> *dump values2;* >> >> *describe values2;* >> * >> * >> >> But I get this results: >> >> >> >> - >> | rows | id:chararray | age:int | title:chararray | >> - >> | | (id, 6)| (age, 30) | (title, QA) | >> - >> >> rows: {id: chararray,age: int,title: chararray} >> 2013-08-30 09:54:37,831 [main] ERROR org.apache.pig.tools.grunt.Grunt - >> ERROR 1031: Incompatable field schema: left is >> "tuple_0:tuple(mycolumn:tuple(name:bytearray,value:bytearray))", right is >> "org.apache.pig.builtin.totuple_id_1:tuple(id:chararray)" >> >> >> >> >> >> or >> >> >> >> >> >> *values2= FOREACH rows GENERATE TOTUPLE (id) ;* >> *dump values2;* >> *describe values2;* >> >> >> >> >> and the results are: >> >> >> ... >> (((id,6))) >> (((id,5))) >> values2: {org.apache.pig.builtin.totuple_id_8: (id: chararray)} >> >> >> >> Aggg! >> >> >> * >> * >> >> >> >> >> Miguel Angel Martín Junquera >> Analyst Engineer. >> miguelangel.mar...@brainsins.com >> >> >> >> 2013/8/28 Miguel Angel Martin junquera >> >>> hi: >>> >>> I can not understand why the schema is define like >>> *"id:chararray,age:int,title:chararray" >>> and it does not define like tuples or bag tuples, if we have pair >>> key-values columns* >>> * >>> * >>> * >>> * >>> *I try other time to change schema but it does not work.* >>> * >>> * >>> *any ideas ...* >>> * >>> * >>> *perhaps, is the issue in the definition cql3 tables ?* >>> * >>> * >>> *regards* >>> >>> >>> 2013/8/28 Miguel Angel Martin junquera >> > >>> hi all: Regards Still i can resolve this issue. . does anybody have this issue or try to test this simple example? i am stumped I can not find a solution working. I appreciate any comment or help 2013/8/22 Miguel Angel Martin junquera < mianmarjun.mailingl...@gmail.com> > hi all: > > > > > I,m testing the new CqlStorage() with cassandra 1.28 and pig 0.11.1 > > > I am using this sample data test: > > > http://frommyworkshop.blogspot.com.es/2013/07/hadoop-map-reduce-with-cassandra.html > > And I load and dump data Righ with this script: > > *rows = LOAD > 'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' > USING > CqlStorage();* > * > * > *dump rows;* > *describe rows;* > * > * > > *resutls: > > ((id,6),(age,30),(title,QA)) > > ((id,5),(age,30),(title,QA)) > > rows: {id: chararray,age: int,title: chararray} > > > * > > > But i can not get the column values > > I try to define another schemas in Load like I used with > cassandraStorage() > > > http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-and-Pig-how-to-get-column-values-td5641158.html > > > example: > > *rows = LOAD > 'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' > USING > CqlStorage() AS (columns: bag {T: tuple(name, value)});* > > > and I get this error: > > *2013-08-22 12:24:45,426 [main] ERROR > org.apache.pig.tools.grunt.Grunt - ERROR 1031: Incompatable schema: left > is > "columns:bag{T:tuple(name:bytearray,value:bytearray)}", right is > "id:chararray,age:int,title:chararray"* > > > > > I try to use, FLATTEN, SUBSTRING, SPLIT UDF`s but i have not get good > result: > > Example: > > >- when I flatten , I get a set of tuples like > > *(title,QA)* > > *(title,QA)* > > *2013-08-22 12:42:20,673 [main] INFO > org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total > input paths to process : 1* > > *A: {title: chararray}* >>>
Cassandra SSTable Access Distribution ? How To ?
Hi, I'm trying to get SSTable access distribution for Reads from Cassandra Stress Tool. When I try to dump the cfhistogram I don't see entries in SSTable column. Meaning all turn out to be zero. Any idea what must be going wrong ? Please suggest how to dump the histogram with SSTable access distribution ? Thanks, /BK
Does collection in CQL3 have certain limits?
As there are two ways to support wide rows in CQL3..One is to use composite keys and another is to use collections like Map, List and Set. The composite keys method can have millions of columns (transposed to rows).. This is solving some of our use cases. However, if we use collections, I want to know if there is a limit that the collections can store a certain number/amount of data (Like earlier with Thrift C* supports up-to 2 billion columns in a row) Thanks, Vikas Goyal
Re: CqlStorage creates wrong schema for Pig
*good/nice job !!!* * * * * *I'd testing with an udf only with string schema type this is better and elaborate work..* * * *Regads* Miguel Angel Martín Junquera Analyst Engineer. miguelangel.mar...@brainsins.com 2013/8/31 Chad Johnston > I threw together a quick UDF to work around this issue. It just extracts > the value portion of the tuple while taking advantage of the CqlStorage > generated schema to keep the type correct. > > You can get it here: https://github.com/iamthechad/cqlstorage-udf > > I'll see if I can find more useful information and open a defect, since > that's what this seems to be. > > Chad > > > On Fri, Aug 30, 2013 at 2:02 AM, Miguel Angel Martin junquera < > mianmarjun.mailingl...@gmail.com> wrote: > >> I try this: >> >> *rows = LOAD >> 'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' USING >> CqlStorage();* >> >> *dump rows;* >> >> *ILLUSTRATE rows;* >> >> *describe rows;* >> >> * >> * >> >> *values2= FOREACH rows GENERATE TOTUPLE (id) as >> (mycolumn:tuple(name,value));* >> >> *dump values2;* >> >> *describe values2;* >> * >> * >> >> But I get this results: >> >> >> >> - >> | rows | id:chararray | age:int | title:chararray | >> - >> | | (id, 6)| (age, 30) | (title, QA) | >> - >> >> rows: {id: chararray,age: int,title: chararray} >> 2013-08-30 09:54:37,831 [main] ERROR org.apache.pig.tools.grunt.Grunt - >> ERROR 1031: Incompatable field schema: left is >> "tuple_0:tuple(mycolumn:tuple(name:bytearray,value:bytearray))", right is >> "org.apache.pig.builtin.totuple_id_1:tuple(id:chararray)" >> >> >> >> >> >> or >> >> >> >> >> >> *values2= FOREACH rows GENERATE TOTUPLE (id) ;* >> *dump values2;* >> *describe values2;* >> >> >> >> >> and the results are: >> >> >> ... >> (((id,6))) >> (((id,5))) >> values2: {org.apache.pig.builtin.totuple_id_8: (id: chararray)} >> >> >> >> Aggg! >> >> >> * >> * >> >> >> >> Miguel Angel Martín Junquera >> Analyst Engineer. >> miguelangel.mar...@brainsins.com >> >> >> >> 2013/8/26 Miguel Angel Martin junquera >> >>> hi Chad . >>> >>> I have this issue >>> >>> I send a mail to user-pig-list and I still i can resolve this, and I >>> can not access to column values. >>> In this mail I write some things that I try without results... and >>> information about this issue. >>> >>> >>> >>> http://mail-archives.apache.org/mod_mbox/pig-user/201308.mbox/%3ccajeg_hq9s2po3_xytzx5xki4j1mao8q26jydg2wndy_kyiv...@mail.gmail.com%3E >>> >>> >>> >>> I hope someOne reply one comment, idea or solution about this issue >>> or bug. >>> >>> >>> I have reviewed the CqlStorage class in code cassandra 1.2.8 but i do >>> not have configure the environmetn to debug and trace this issue. >>> >>> Only I find some comments like, but I do not understand at all. >>> >>> >>> /** >>> >>> * A LoadStoreFunc for retrieving data from and storing data to >>> Cassandra >>> >>> * >>> >>> * A row from a standard CF will be returned as nested tuples: >>> >>> * (((key1, value1), (key2, value2)), ((name1, val1), (name2, val2))). >>> */ >>> >>> >>> I you found some idea or solution, please post it >>> >>> thanks >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> 2013/8/23 Chad Johnston >>> (I'm using Cassandra 1.2.8 and Pig 0.11.1) I'm loading some simple data from Cassandra into Pig using CqlStorage. The CqlStorage loader defines a Pig schema based on the Cassandra schema, but it seems to be wrong. If I do: data = LOAD 'cql://bookdata/books' USING CqlStorage(); DESCRIBE data; I get this: data: {isbn: chararray,bookauthor: chararray,booktitle: chararray,publisher: chararray,yearofpublication: int} However, if I DUMP data, I get results like these: ((isbn,0425093387),(bookauthor,Georgette Heyer),(booktitle,Death in the Stocks),(publisher,Berkley Pub Group),(yearofpublication,1986)) Clearly the results from Cassandra are key/value pairs, as would be expected. I don't know why the schema generated by CqlStorage() would be so different. This is really causing me problems trying to access the column values. I tried a naive approach of FLATTENing each tuple, then trying to access the values that way: flattened = FOREACH data GENERATE FLATTEN(isbn), FLATTEN(booktitle), ... values = FOREACH flattened GENERATE $1 AS ISBN, $3 AS BookTitle, ... As soon as I try to access field $5, Pig complains about the index being out of bounds. Is there a way to solve the schema/reality mismatch? Am I doing something wrong, or have I stumbled across a defect? Thanks, Chad >>> >>> >> >
Re: how can i get the column value? Need help!.. cassandra 1.28 and pig 0.11.1
hi: I test this in cassandra 1.2.9 new version and the issue still persists . :-( Miguel Angel Martín Junquera Analyst Engineer. miguelangel.mar...@brainsins.com 2013/8/30 Miguel Angel Martin junquera > I try this: > > *rows = LOAD > 'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' USING > CqlStorage();* > > *dump rows;* > > *ILLUSTRATE rows;* > > *describe rows;* > > * > * > > *values2= FOREACH rows GENERATE TOTUPLE (id) as > (mycolumn:tuple(name,value));* > > *dump values2;* > > *describe values2;* > * > * > > But I get this results: > > > > - > | rows | id:chararray | age:int | title:chararray | > - > | | (id, 6)| (age, 30) | (title, QA) | > - > > rows: {id: chararray,age: int,title: chararray} > 2013-08-30 09:54:37,831 [main] ERROR org.apache.pig.tools.grunt.Grunt - > ERROR 1031: Incompatable field schema: left is > "tuple_0:tuple(mycolumn:tuple(name:bytearray,value:bytearray))", right is > "org.apache.pig.builtin.totuple_id_1:tuple(id:chararray)" > > > > > > or > > > > > > *values2= FOREACH rows GENERATE TOTUPLE (id) ;* > *dump values2;* > *describe values2;* > > > > > and the results are: > > > ... > (((id,6))) > (((id,5))) > values2: {org.apache.pig.builtin.totuple_id_8: (id: chararray)} > > > > Aggg! > > > * > * > > > > > Miguel Angel Martín Junquera > Analyst Engineer. > miguelangel.mar...@brainsins.com > > > > 2013/8/28 Miguel Angel Martin junquera > >> hi: >> >> I can not understand why the schema is define like >> *"id:chararray,age:int,title:chararray" >> and it does not define like tuples or bag tuples, if we have pair >> key-values columns* >> * >> * >> * >> * >> *I try other time to change schema but it does not work.* >> * >> * >> *any ideas ...* >> * >> * >> *perhaps, is the issue in the definition cql3 tables ?* >> * >> * >> *regards* >> >> >> 2013/8/28 Miguel Angel Martin junquera >> >>> hi all: >>> >>> >>> Regards >>> >>> Still i can resolve this issue. . >>> >>> does anybody have this issue or try to test this simple example? >>> >>> >>> i am stumped I can not find a solution working. >>> >>> I appreciate any comment or help >>> >>> >>> 2013/8/22 Miguel Angel Martin junquera >> > >>> hi all: I,m testing the new CqlStorage() with cassandra 1.28 and pig 0.11.1 I am using this sample data test: http://frommyworkshop.blogspot.com.es/2013/07/hadoop-map-reduce-with-cassandra.html And I load and dump data Righ with this script: *rows = LOAD 'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' USING CqlStorage();* * * *dump rows;* *describe rows;* * * *resutls: ((id,6),(age,30),(title,QA)) ((id,5),(age,30),(title,QA)) rows: {id: chararray,age: int,title: chararray} * But i can not get the column values I try to define another schemas in Load like I used with cassandraStorage() http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-and-Pig-how-to-get-column-values-td5641158.html example: *rows = LOAD 'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' USING CqlStorage() AS (columns: bag {T: tuple(name, value)});* and I get this error: *2013-08-22 12:24:45,426 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1031: Incompatable schema: left is "columns:bag{T:tuple(name:bytearray,value:bytearray)}", right is "id:chararray,age:int,title:chararray"* I try to use, FLATTEN, SUBSTRING, SPLIT UDF`s but i have not get good result: Example: - when I flatten , I get a set of tuples like *(title,QA)* *(title,QA)* *2013-08-22 12:42:20,673 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1* *A: {title: chararray}* but i can get value QA Sustring only works with title example: *B = FOREACH A GENERATE SUBSTRING(title,2,5);* * * *dump B;* *describe B;* * * * * *results:* * * *(tle)* *(tle)* *B: {chararray}* i try, this like ERIC LEE inthe other mail and have the same results: Anyways, what I really what is the column value, not the name. Is there a way to do that? I listed all of the failed attempts I made below. - colnames = FOREACH cols GENERATE $1 and was told $1 was out of bounds. - casted = FOREACH cols GENERATE (
Re: successful use of "shuffle"?
On 30 August 2013 18:42, Jeremiah D Jordan wrote: You need to introduce the new "vnode enabled" nodes in a new DC. Or you > will have similar issues to > https://issues.apache.org/jira/browse/CASSANDRA-5525 > > Add vnode DC: > > http://www.datastax.com/documentation/cassandra/1.2/webhelp/index.html#cassandra/operations/ops_add_dc_to_cluster_t.html > > Point clients to new DC > > Remove non vnode DC: > > http://www.datastax.com/documentation/cassandra/1.2/webhelp/index.html#cassandra/operations/ops_decomission_dc_t.html > This is a good workaround if you have the hardware to temporarily have a cluster that's double the size. If you don't then I think shuffle is the only option, but it is known to have issues. Richard.