Recomended storage choice for Cassandra on Amazon m1.xlarge instance

2013-09-02 Thread Renat Gilfanov
 Hello,

I'd like to ask what is the best options of separating commit log and data on 
Amazon m1.xlarge instance, given 4x420 Gb attached storages and EBS volume ?

As far as I understand, the EBS is not the choice and it's recomended to use 
attached storages instead.
Is it better to combine 4 ephemeral drives in 2 raid0 (or raid1 ?), and store 
data on the first and commit log on the second? Or may be trying other 
combinations like 1 attached storage for commit log, and 3 others grouped in 
raid0 for data?

Thank you. 




Re[2]: Cassandra cluster migration in Amazon EC2

2013-09-02 Thread Renat Gilfanov
 Thanks for the quick reply!

If  I launch the new Cassandra node, should I preliminary add it's IP to the 
cassandra-topology.properties and "seeds" parameter in the cassandra.yaml on 
all existing nodes and restart them?

>If you launch the new servers, have them join the cluster, then decommission 
>the old ones, you'll be able to do it without downtime.  It'll also have the 
>effect of randomizing the tokens, I believe. 
>
>On Sep 2, 2013, at 4:21 PM, Renat Gilfanov < gren...@mail.ru > wrote:
>
>> Hello,
>> 
>> Currently we have a Cassandra cluster in the Amazon EC2, and we are planning 
>> to upgrade our deployment configuration to achieve better 
>> performance and stability. However, a lot of open questions arise when 
>> planning this migration. I'll be very thankfull if somebody could answer my 
>> questions.
>> 
>> Current state:
>> 
>> We use Apache Cassandra 1.2.8, on 5 nodes deployed in the Amazon EC2, on 
>> m1.large instanses with EBS volume each. In Cassandra we have set up 2 
>> datacenters, first one have 3 nodes each in the separate rack, second - 2 
>> nodes in the separate rack, however all Amazon instances belong to the 
>> same region and even availability zone. The replication factor for our 
>> keyspace is the following: {'class': 'NetworkTopologyStrategy',  'DC2': '1', 
>>  'DC1': '2'}.
>> We have virtual nodes enabled, however the shuffle hasn't been completed 
>> yet, and the nodes unballanced.
>> 
>> What we want to achieve:
>> 
>> - We would like to move to the M1 Extra Large instances with 4x420 Gb 
>> instance storages. 
>> - Group 3 of storages into raid0 array, move data directory to the raid0, 
>> and commit log - to the 4th left storage.
>> 
>> Open questions:
>>  - Does the suggested configuration look reasonable from the performance 
>> optimization point of view?
>>  - As far as I understand, separation of commit log and data directory 
>> should make performance better - but what about separation the OS from those 
>> two  - is it worth doing?
>>  - What are the steps to perform such migration? Will it be possible to 
>> perform it without downtime, restarting node by node with new configuration 
>> applied?
>>  I'm especially worried about IP changes, when we'll uprade the instance 
>> type. What's the recomended way to handle those IP changes?
>> 
>> Best Regards,
>> Renat.
>


-- 
Renat Gilfanov


Re: Cassandra cluster migration in Amazon EC2

2013-09-02 Thread Jon Haddad
If you launch the new servers, have them join the cluster, then decommission 
the old ones, you'll be able to do it without downtime.  It'll also have the 
effect of randomizing the tokens, I believe. 

On Sep 2, 2013, at 4:21 PM, Renat Gilfanov  wrote:

> Hello,
> 
> Currently we have a Cassandra cluster in the Amazon EC2, and we are planning 
> to upgrade our deployment configuration to achieve better 
> performance and stability. However, a lot of open questions arise when 
> planning this migration. I'll be very thankfull if somebody could answer my 
> questions.
> 
> Current state:
> 
> We use Apache Cassandra 1.2.8, on 5 nodes deployed in the Amazon EC2, on 
> m1.large instanses with EBS volume each. In Cassandra we have set up 2 
> datacenters, first one have 3 nodes each in the separate rack, second - 2 
> nodes in the separate rack, however all Amazon instances belong to the 
> same region and even availability zone. The replication factor for our 
> keyspace is the following: {'class': 'NetworkTopologyStrategy',  'DC2': '1',  
> 'DC1': '2'}.
> We have virtual nodes enabled, however the shuffle hasn't been completed yet, 
> and the nodes unballanced.
> 
> What we want to achieve:
> 
> - We would like to move to the M1 Extra Large instances with 4x420 Gb 
> instance storages. 
> - Group 3 of storages into raid0 array, move data directory to the raid0, and 
> commit log - to the 4th left storage.
> 
> Open questions:
>  - Does the suggested configuration look reasonable from the performance 
> optimization point of view?
>  - As far as I understand, separation of commit log and data directory should 
> make performance better - but what about separation the OS from those two  - 
> is it worth doing?
>  - What are the steps to perform such migration? Will it be possible to 
> perform it without downtime, restarting node by node with new configuration 
> applied?
>  I'm especially worried about IP changes, when we'll uprade the instance 
> type. What's the recomended way to handle those IP changes?
> 
> Best Regards,
> Renat.



Cassandra cluster migration in Amazon EC2

2013-09-02 Thread Renat Gilfanov
 Hello,

Currently we have a Cassandra cluster in the Amazon EC2, and we are planning to 
upgrade our deployment configuration to achieve better 
performance and stability. However, a lot of open questions arise when planning 
this migration. I'll be very thankfull if somebody could answer my 
questions.

Current state:

We use Apache Cassandra 1.2.8, on 5 nodes deployed in the Amazon EC2, on 
m1.large instanses with EBS volume each. In Cassandra we have set up 2 
datacenters, first one have 3 nodes each in the separate rack, second - 2 nodes 
in the separate rack, however all Amazon instances belong to the 
same region and even availability zone. The replication factor for our keyspace 
is the following: {'class': 'NetworkTopologyStrategy',  'DC2': '1',  'DC1': 
'2'}.
We have virtual nodes enabled, however the shuffle hasn't been completed yet, 
and the nodes unballanced.

What we want to achieve:

- We would like to move to the M1 Extra Large instances with 4x420 Gb instance 
storages. 
- Group 3 of storages into raid0 array, move data directory to the raid0, and 
commit log - to the 4th left storage.

Open questions:
 - Does the suggested configuration look reasonable from the performance 
optimization point of view?
 - As far as I understand, separation of commit log and data directory should 
make performance better - but what about separation the OS from those two  - is 
it worth doing?
 - What are the steps to perform such migration? Will it be possible to perform 
it without downtime, restarting node by node with new configuration applied?
 I'm especially worried about IP changes, when we'll uprade the instance type. 
What's the recomended way to handle those IP changes?

Best Regards,
Renat.

Re: How to perform range queries efficiently?

2013-09-02 Thread Francisco Nogueira Calmon Sobral
Sorry, I was not very clear.

We simply created another CF whose row keys were given by the secondary index 
that we needed. The value of each row in this new CF was the key associated 
with a row in the first CF (the original one).

Francisco


On Sep 2, 2013, at 4:13 PM, Sávio Teles  wrote:

> 
> We performed some modifications and created another column family, which maps 
> the secondary index to the key of the original column family. The 
> improvements were very impressive in our case!
> 
> Sorry, I coundn't understand! What changes? Have you built a B-tree? 
> 
> 
> 2013/9/2 Francisco Nogueira Calmon Sobral 
> We had some problems when using secondary indexes because of three issues:
> 
> - The query is a Range Query, which means that it is slow.
> - There is an open bug regarding the use of row cache for secondary indexes 
> (CASSANDRA-4973)
> - The cardinality of our secondary key was very low (this was bad)
> 
> We performed some modifications and created another column family, which maps 
> the secondary index to the key of the original column family. The 
> improvements were very impressive in our case!
> 
> Best regards
> Francisco
> 
> 
> On Aug 28, 2013, at 12:22 PM, Vivek Mishra  wrote:
> 
>> Create a column family of compositeType (or PRIMARY KEY) as (user_id,age, 
>> salary).
>> 
>> Then you will be able to query use eq operator  over partition key and as 
>> well over clustering key:
>> 
>> You may also exclude salary as a secondary index rather than part of cluster 
>> key(e.g. age,salary)
>> 
>> I am sure based on your query usage, you can opt for either a composite key 
>> or may mix composite key with secondary index !
>> 
>> Have a look at: 
>> http://www.datastax.com/dev/blog/introduction-to-composite-columns-part-1
>> 
>> Hope it helps.
>> 
>> 
>> -Vivek
>> 
>> 
>> On Wed, Aug 28, 2013 at 5:49 PM, Sávio Teles  
>> wrote:
>> I can populate again. We are modelling the data yet! Tks.
>> 
>> 
>> 2013/8/28 Vivek Mishra 
>> Just saw that you already have data populated, so i guess modifying for 
>> composite key may not work for you.
>> 
>> -Vivek
>> 
>> 
>> On Tue, Aug 27, 2013 at 11:55 PM, Sávio Teles  
>> wrote:
>> Vivek, using a composite key, how would be the query?
>> 
>> 
>> 2013/8/27 Vivek Mishra 
>> For such queries, looks like you may create a composite key as (user_id,age, 
>> salary).
>> 
>> Too much indexing always kills(irrespective of RDBMS or NoSQL). Remember 
>> every search request on secondary indexes will be passed on each node in 
>> ring.
>> 
>> -Vivek
>> 
>> 
>> On Tue, Aug 27, 2013 at 11:11 PM, Sávio Teles  
>> wrote:
>> Use a database that is designed for efficient range queries? ;D
>> 
>> Is there no way to do this with Cassandra? Like using Hive, Sorl...
>> 
>> 
>> 2013/8/27 Robert Coli 
>> On Fri, Aug 23, 2013 at 5:53 AM, Sávio Teles  
>> wrote:
>> I need to perform range query efficiently. 
>> ... 
>> This query takes a long time to run. Any ideas to perform it efficiently?
>> 
>> Use a database that is designed for efficient range queries? ;D
>> 
>> =Rob
>>  
>> 
>> 
>> 
>> -- 
>> Atenciosamente,
>> Sávio S. Teles de Oliveira
>> voice: +55 62 9136 6996
>> http://br.linkedin.com/in/savioteles
>> Mestrando em Ciências da Computação - UFG 
>> Arquiteto de Software
>> Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG
>> 
>> 
>> 
>> 
>> -- 
>> Atenciosamente,
>> Sávio S. Teles de Oliveira
>> voice: +55 62 9136 6996
>> http://br.linkedin.com/in/savioteles
>> Mestrando em Ciências da Computação - UFG 
>> Arquiteto de Software
>> Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG
>> 
>> 
>> 
>> 
>> -- 
>> Atenciosamente,
>> Sávio S. Teles de Oliveira
>> voice: +55 62 9136 6996
>> http://br.linkedin.com/in/savioteles
>> Mestrando em Ciências da Computação - UFG 
>> Arquiteto de Software
>> Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG
>> 
> 
> 
> 
> 
> -- 
> Atenciosamente,
> Sávio S. Teles de Oliveira
> voice: +55 62 9136 6996
> http://br.linkedin.com/in/savioteles
> Mestrando em Ciências da Computação - UFG 
> Arquiteto de Software
> Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG



Re: Temporarily slow nodes on Cassandra

2013-09-02 Thread Mohit Anchlia
In general with LOCAL_QUORUM you should not see such an issue when one node
is slow. However, it could be because Client's are still sending requests
to that node. Depending on what client library you are using , you could
try to take that node out of your connection pool. Not knowing exact issue
you are facing this is just a hunch at this point.

On Mon, Sep 2, 2013 at 12:33 PM, Michael Theroux wrote:

> Hello,
>
> We are experiencing an issue where nodes a temporarily slow due to I/O
> contention anywhere from 10 minutes to 2 hours.  I don't believe this
> slowdown is Cassandra related, but factors outside of Cassandra.  We run
> Cassandra 1.1.9.  We run a 12 node cluster, with a replication factor of 3,
> and all queries use LOCAL_QUORUM consistency.
>
> Our problem is (other than the contention issue, which we are working on),
> when this one node slows down, the whole system performance appears to slow
> down.  Is there a way in Cassandra to accommodate or mitigate slower nodes?
>  Shutting down the node in question during the period of contention does
> "resolve" the performance problem, but is there anything in cassandra that
> can assist this situation while we resolve the hardware problem?
>
> Thanks,
> -Mike


Temporarily slow nodes on Cassandra

2013-09-02 Thread Michael Theroux
Hello,

We are experiencing an issue where nodes a temporarily slow due to I/O 
contention anywhere from 10 minutes to 2 hours.  I don't believe this slowdown 
is Cassandra related, but factors outside of Cassandra.  We run Cassandra 
1.1.9.  We run a 12 node cluster, with a replication factor of 3, and all 
queries use LOCAL_QUORUM consistency.

Our problem is (other than the contention issue, which we are working on), when 
this one node slows down, the whole system performance appears to slow down.  
Is there a way in Cassandra to accommodate or mitigate slower nodes?  Shutting 
down the node in question during the period of contention does "resolve" the 
performance problem, but is there anything in cassandra that can assist this 
situation while we resolve the hardware problem?

Thanks,
-Mike

Re: Timeout Exception with row_cache enabled

2013-09-02 Thread Sávio Teles
Is it related to https://issues.apache.org/jira/browse/CASSANDRA-4973? And
https://issues.apache.org/jira/browse/CASSANDRA-4785?


2013/9/2 Nate McCall 

> You experience is not uncommon. There was a recent thread on this with a
> variety of details on when to use row caching:
> http://www.mail-archive.com/user@cassandra.apache.org/msg31693.html
>
> tl;dr - it depends completely on use case. Small static rows work best.
>
>
>
> On Mon, Sep 2, 2013 at 2:05 PM, Sávio Teles 
> wrote:
>
>> I'm running the Cassandra 1.2.4 and when I enable the row_cache, the
>> system throws TImeoutExcpetion and Garbage Collection don't stop.
>>
>> When I disable the query returns in 700ms.
>>
>> *Configuration:
>>
>> *
>>
>>- *row_cache_size_in_mb: 256*
>>- *row_cache_save_period: 0*
>>- *# row_cache_keys_to_save: 100*
>>- *row_cache_provider: SerializingCacheProvider*
>>
>> Why is this happening?
>>
>>
>> Thanks in advance!!
>> **
>>
>> --
>> Atenciosamente,
>> Sávio S. Teles de Oliveira
>> voice: +55 62 9136 6996
>> http://br.linkedin.com/in/savioteles
>>  Mestrando em Ciências da Computação - UFG
>> Arquiteto de Software
>> Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG
>>
>
>


-- 
Atenciosamente,
Sávio S. Teles de Oliveira
voice: +55 62 9136 6996
http://br.linkedin.com/in/savioteles
Mestrando em Ciências da Computação - UFG
Arquiteto de Software
Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG


Re: Timeout Exception with row_cache enabled

2013-09-02 Thread Nate McCall
You experience is not uncommon. There was a recent thread on this with a
variety of details on when to use row caching:
http://www.mail-archive.com/user@cassandra.apache.org/msg31693.html

tl;dr - it depends completely on use case. Small static rows work best.



On Mon, Sep 2, 2013 at 2:05 PM, Sávio Teles wrote:

> I'm running the Cassandra 1.2.4 and when I enable the row_cache, the
> system throws TImeoutExcpetion and Garbage Collection don't stop.
>
> When I disable the query returns in 700ms.
>
> *Configuration:
>
> *
>
>- *row_cache_size_in_mb: 256*
>- *row_cache_save_period: 0*
>- *# row_cache_keys_to_save: 100*
>- *row_cache_provider: SerializingCacheProvider*
>
> Why is this happening?
>
>
> Thanks in advance!!
> **
>
> --
> Atenciosamente,
> Sávio S. Teles de Oliveira
> voice: +55 62 9136 6996
> http://br.linkedin.com/in/savioteles
>  Mestrando em Ciências da Computação - UFG
> Arquiteto de Software
> Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG
>


Timeout Exception with row_cache enabled

2013-09-02 Thread Sávio Teles
I'm running the Cassandra 1.2.4 and when I enable the row_cache, the system
throws TImeoutExcpetion and Garbage Collection don't stop.

When I disable the query returns in 700ms.

*Configuration:

*

   - *row_cache_size_in_mb: 256*
   - *row_cache_save_period: 0*
   - *# row_cache_keys_to_save: 100*
   - *row_cache_provider: SerializingCacheProvider*

Why is this happening?


Thanks in advance!!
**

-- 
Atenciosamente,
Sávio S. Teles de Oliveira
voice: +55 62 9136 6996
http://br.linkedin.com/in/savioteles
Mestrando em Ciências da Computação - UFG
Arquiteto de Software
Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG


Re: How to perform range queries efficiently?

2013-09-02 Thread Sávio Teles
> We performed some modifications and created another column family, which
> maps the secondary index to the key of the original column family. The
> improvements were very impressive in our case!


Sorry, I coundn't understand! What changes? Have you built a B-tree?


2013/9/2 Francisco Nogueira Calmon Sobral 

> We had some problems when using secondary indexes because of three issues:
>
> - The query is a Range Query, which means that it is slow.
> - There is an open bug regarding the use of row cache for secondary
> indexes (CASSANDRA-4973)
> - The cardinality of our secondary key was very low (this was bad)
>
> We performed some modifications and created another column family, which
> maps the secondary index to the key of the original column family. The
> improvements were very impressive in our case!
>
> Best regards
> Francisco
>
>
> On Aug 28, 2013, at 12:22 PM, Vivek Mishra  wrote:
>
> Create a column family of compositeType (or PRIMARY KEY) as (user_id,age,
> salary).
>
> Then you will be able to query use eq operator  over partition key and as
> well over clustering key:
>
> You may also exclude salary as a secondary index rather than part of
> cluster key(e.g. age,salary)
>
> I am sure based on your query usage, you can opt for either a composite
> key or may mix composite key with secondary index !
>
> Have a look at:
> http://www.datastax.com/dev/blog/introduction-to-composite-columns-part-1
>
> Hope it helps.
>
>
> -Vivek
>
>
> On Wed, Aug 28, 2013 at 5:49 PM, Sávio Teles 
> wrote:
>
>> I can populate again. We are modelling the data yet! Tks.
>>
>>
>> 2013/8/28 Vivek Mishra 
>>
>>> Just saw that you already have data populated, so i guess modifying for
>>> composite key may not work for you.
>>>
>>> -Vivek
>>>
>>>
>>> On Tue, Aug 27, 2013 at 11:55 PM, Sávio Teles <
>>> savio.te...@lupa.inf.ufg.br> wrote:
>>>
 Vivek, using a composite key, how would be the query?


 2013/8/27 Vivek Mishra 

> For such queries, looks like you may create a composite key as
> (user_id,age, salary).
>
> Too much indexing always kills(irrespective of RDBMS or NoSQL).
> Remember every search request on secondary indexes will be passed on each
> node in ring.
>
> -Vivek
>
>
> On Tue, Aug 27, 2013 at 11:11 PM, Sávio Teles <
> savio.te...@lupa.inf.ufg.br> wrote:
>
>> Use a database that is designed for efficient range queries? ;D
>>>
>>
>> Is there no way to do this with Cassandra? Like using Hive, Sorl...
>>
>>
>> 2013/8/27 Robert Coli 
>>
>>> On Fri, Aug 23, 2013 at 5:53 AM, Sávio Teles <
>>> savio.te...@lupa.inf.ufg.br> wrote:
>>>
 I need to perform range query efficiently.

>>> ...
>>>
 This query takes a long time to run. Any ideas to perform it
 efficiently?

>>>
>>> Use a database that is designed for efficient range queries? ;D
>>>
>>> =Rob
>>>
>>>
>>
>>
>>
>> --
>> Atenciosamente,
>> Sávio S. Teles de Oliveira
>> voice: +55 62 9136 6996
>> http://br.linkedin.com/in/savioteles
>>  Mestrando em Ciências da Computação - UFG
>> Arquiteto de Software
>> Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG
>>
>
>


 --
 Atenciosamente,
 Sávio S. Teles de Oliveira
 voice: +55 62 9136 6996
 http://br.linkedin.com/in/savioteles
  Mestrando em Ciências da Computação - UFG
 Arquiteto de Software
 Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG

>>>
>>>
>>
>>
>> --
>> Atenciosamente,
>> Sávio S. Teles de Oliveira
>> voice: +55 62 9136 6996
>> http://br.linkedin.com/in/savioteles
>>  Mestrando em Ciências da Computação - UFG
>> Arquiteto de Software
>> Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG
>>
>
>
>


-- 
Atenciosamente,
Sávio S. Teles de Oliveira
voice: +55 62 9136 6996
http://br.linkedin.com/in/savioteles
Mestrando em Ciências da Computação - UFG
Arquiteto de Software
Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG


Re: Versioning in cassandra

2013-09-02 Thread dawood abdullah
In my case version can be timestamp as well. What do you suggest version
number to be, do you see any problems if I keep version as counter /
timestamp ?


On Tue, Sep 3, 2013 at 12:22 AM, Jan Algermissen  wrote:

>
> On 02.09.2013, at 20:44, dawood abdullah 
> wrote:
>
> > Requirement is like I have a column family say File
> >
> > create table file(id text primary key, fname text, version int, mimetype
> text, content text);
> >
> > Say, I have few records inserted, when I modify an existing record
> (content is modified) a new version needs to be created. As I need to have
> provision to revert to back any old version whenever required.
> >
>
> So, can version be a timestamp? Or does it need to be an integer?
>
> In the former case, make use of C*'s ordering like so:
>
> CREATE TABLE file (
>file_id text,
>version timestamp,
>fname text,
>
>PRIMARY KEY (file_id,version)
> ) WITH CLUSTERING ORDER BY (version DESC);
>
> Get the latest file version with
>
> select * from file where file_id = 'xxx' limit 1;
>
> If it has to be an integer, use counter columns.
>
> Jan
>
>
> > Regards,
> > Dawood
> >
> >
> > On Mon, Sep 2, 2013 at 10:47 PM, Jan Algermissen <
> jan.algermis...@nordsc.com> wrote:
> > Hi Dawood,
> >
> > On 02.09.2013, at 16:36, dawood abdullah 
> wrote:
> >
> > > Hi
> > > I have a requirement of versioning to be done in Cassandra.
> > >
> > > Following is my column family definition
> > >
> > > create table file_details(id text primary key, fname text, version
> int, mimetype text);
> > >
> > > I have a secondary index created on fname column.
> > >
> > > Whenever I do an insert for the same 'fname', the version should be
> incremented. And when I retrieve a row with fname it should return me the
> latest version row.
> > >
> > > Is there a better way to do in Cassandra? Please suggest what approach
> needs to be taken.
> >
> > Can you explain more about your use case?
> >
> > If the version need not be a small number, but could be a timestamp, you
> could make use of C*'s ordering feature , have the database set the new
> version as a timestamp and retrieve the latest one with a simple LIMIT 1
> query. (I'll explain more when this is an option for you).
> >
> > Jan
> >
> > P.S. Me being a REST/HTTP head, an alarm rings when I see 'version' next
> to 'mimetype' :-) What exactly are you versioning here? Maybe we can even
> change the situation from a functional POV?
> >
> >
> > >
> > > Regards,
> > >
> > > Dawood
> > >
> > >
> > >
> > >
> >
> >
>
>


Re: Versioning in cassandra

2013-09-02 Thread Jan Algermissen

On 02.09.2013, at 20:44, dawood abdullah  wrote:

> Requirement is like I have a column family say File
> 
> create table file(id text primary key, fname text, version int, mimetype 
> text, content text);
> 
> Say, I have few records inserted, when I modify an existing record (content 
> is modified) a new version needs to be created. As I need to have provision 
> to revert to back any old version whenever required.
> 

So, can version be a timestamp? Or does it need to be an integer?

In the former case, make use of C*'s ordering like so:

CREATE TABLE file (
   file_id text,
   version timestamp,
   fname text,
   
   PRIMARY KEY (file_id,version)
) WITH CLUSTERING ORDER BY (version DESC);

Get the latest file version with

select * from file where file_id = 'xxx' limit 1;

If it has to be an integer, use counter columns.

Jan


> Regards,
> Dawood
> 
> 
> On Mon, Sep 2, 2013 at 10:47 PM, Jan Algermissen  
> wrote:
> Hi Dawood,
> 
> On 02.09.2013, at 16:36, dawood abdullah  wrote:
> 
> > Hi
> > I have a requirement of versioning to be done in Cassandra.
> >
> > Following is my column family definition
> >
> > create table file_details(id text primary key, fname text, version int, 
> > mimetype text);
> >
> > I have a secondary index created on fname column.
> >
> > Whenever I do an insert for the same 'fname', the version should be 
> > incremented. And when I retrieve a row with fname it should return me the 
> > latest version row.
> >
> > Is there a better way to do in Cassandra? Please suggest what approach 
> > needs to be taken.
> 
> Can you explain more about your use case?
> 
> If the version need not be a small number, but could be a timestamp, you 
> could make use of C*'s ordering feature , have the database set the new 
> version as a timestamp and retrieve the latest one with a simple LIMIT 1 
> query. (I'll explain more when this is an option for you).
> 
> Jan
> 
> P.S. Me being a REST/HTTP head, an alarm rings when I see 'version' next to 
> 'mimetype' :-) What exactly are you versioning here? Maybe we can even change 
> the situation from a functional POV?
> 
> 
> >
> > Regards,
> >
> > Dawood
> >
> >
> >
> >
> 
> 



Re: Versioning in cassandra

2013-09-02 Thread dawood abdullah
Requirement is like I have a column family say File

create table file(id text primary key, fname text, version int, mimetype
text, content text);

Say, I have few records inserted, when I modify an existing record (content
is modified) a new version needs to be created. As I need to have provision
to revert to back any old version whenever required.

Regards,
Dawood


On Mon, Sep 2, 2013 at 10:47 PM, Jan Algermissen  wrote:

> Hi Dawood,
>
> On 02.09.2013, at 16:36, dawood abdullah 
> wrote:
>
> > Hi
> > I have a requirement of versioning to be done in Cassandra.
> >
> > Following is my column family definition
> >
> > create table file_details(id text primary key, fname text, version int,
> mimetype text);
> >
> > I have a secondary index created on fname column.
> >
> > Whenever I do an insert for the same 'fname', the version should be
> incremented. And when I retrieve a row with fname it should return me the
> latest version row.
> >
> > Is there a better way to do in Cassandra? Please suggest what approach
> needs to be taken.
>
> Can you explain more about your use case?
>
> If the version need not be a small number, but could be a timestamp, you
> could make use of C*'s ordering feature , have the database set the new
> version as a timestamp and retrieve the latest one with a simple LIMIT 1
> query. (I'll explain more when this is an option for you).
>
> Jan
>
> P.S. Me being a REST/HTTP head, an alarm rings when I see 'version' next
> to 'mimetype' :-) What exactly are you versioning here? Maybe we can even
> change the situation from a functional POV?
>
>
> >
> > Regards,
> >
> > Dawood
> >
> >
> >
> >
>
>


Re: How to perform range queries efficiently?

2013-09-02 Thread Francisco Nogueira Calmon Sobral
We had some problems when using secondary indexes because of three issues:

- The query is a Range Query, which means that it is slow.
- There is an open bug regarding the use of row cache for secondary indexes 
(CASSANDRA-4973)
- The cardinality of our secondary key was very low (this was bad)

We performed some modifications and created another column family, which maps 
the secondary index to the key of the original column family. The improvements 
were very impressive in our case!

Best regards
Francisco


On Aug 28, 2013, at 12:22 PM, Vivek Mishra  wrote:

> Create a column family of compositeType (or PRIMARY KEY) as (user_id,age, 
> salary).
> 
> Then you will be able to query use eq operator  over partition key and as 
> well over clustering key:
> 
> You may also exclude salary as a secondary index rather than part of cluster 
> key(e.g. age,salary)
> 
> I am sure based on your query usage, you can opt for either a composite key 
> or may mix composite key with secondary index !
> 
> Have a look at: 
> http://www.datastax.com/dev/blog/introduction-to-composite-columns-part-1
> 
> Hope it helps.
> 
> 
> -Vivek
> 
> 
> On Wed, Aug 28, 2013 at 5:49 PM, Sávio Teles  
> wrote:
> I can populate again. We are modelling the data yet! Tks.
> 
> 
> 2013/8/28 Vivek Mishra 
> Just saw that you already have data populated, so i guess modifying for 
> composite key may not work for you.
> 
> -Vivek
> 
> 
> On Tue, Aug 27, 2013 at 11:55 PM, Sávio Teles  
> wrote:
> Vivek, using a composite key, how would be the query?
> 
> 
> 2013/8/27 Vivek Mishra 
> For such queries, looks like you may create a composite key as (user_id,age, 
> salary).
> 
> Too much indexing always kills(irrespective of RDBMS or NoSQL). Remember 
> every search request on secondary indexes will be passed on each node in ring.
> 
> -Vivek
> 
> 
> On Tue, Aug 27, 2013 at 11:11 PM, Sávio Teles  
> wrote:
> Use a database that is designed for efficient range queries? ;D
> 
> Is there no way to do this with Cassandra? Like using Hive, Sorl...
> 
> 
> 2013/8/27 Robert Coli 
> On Fri, Aug 23, 2013 at 5:53 AM, Sávio Teles  
> wrote:
> I need to perform range query efficiently. 
> ... 
> This query takes a long time to run. Any ideas to perform it efficiently?
> 
> Use a database that is designed for efficient range queries? ;D
> 
> =Rob
>  
> 
> 
> 
> -- 
> Atenciosamente,
> Sávio S. Teles de Oliveira
> voice: +55 62 9136 6996
> http://br.linkedin.com/in/savioteles
> Mestrando em Ciências da Computação - UFG 
> Arquiteto de Software
> Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG
> 
> 
> 
> 
> -- 
> Atenciosamente,
> Sávio S. Teles de Oliveira
> voice: +55 62 9136 6996
> http://br.linkedin.com/in/savioteles
> Mestrando em Ciências da Computação - UFG 
> Arquiteto de Software
> Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG
> 
> 
> 
> 
> -- 
> Atenciosamente,
> Sávio S. Teles de Oliveira
> voice: +55 62 9136 6996
> http://br.linkedin.com/in/savioteles
> Mestrando em Ciências da Computação - UFG 
> Arquiteto de Software
> Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG
> 



Re: Versioning in cassandra

2013-09-02 Thread Jan Algermissen
Hi Dawood,

On 02.09.2013, at 16:36, dawood abdullah  wrote:

> Hi
> I have a requirement of versioning to be done in Cassandra.
> 
> Following is my column family definition
> 
> create table file_details(id text primary key, fname text, version int, 
> mimetype text);
> 
> I have a secondary index created on fname column.
> 
> Whenever I do an insert for the same 'fname', the version should be 
> incremented. And when I retrieve a row with fname it should return me the 
> latest version row.
> 
> Is there a better way to do in Cassandra? Please suggest what approach needs 
> to be taken.

Can you explain more about your use case?

If the version need not be a small number, but could be a timestamp, you could 
make use of C*'s ordering feature , have the database set the new version as a 
timestamp and retrieve the latest one with a simple LIMIT 1 query. (I'll 
explain more when this is an option for you).

Jan

P.S. Me being a REST/HTTP head, an alarm rings when I see 'version' next to 
'mimetype' :-) What exactly are you versioning here? Maybe we can even change 
the situation from a functional POV?


> 
> Regards,
> 
> Dawood
> 
> 
> 
> 



Re: Does collection in CQL3 have certain limits?

2013-09-02 Thread Edward Capriolo
I believe CQL has to fetch and transport the entire row, so if it contains
a collection you transmit the entire collection. C* is mostly about low
latency queries and as the row gets larger keeping low latency becomes
impossible.

Collections do not support a large number of columns, they were not
designed to do that. IMHO If you are talking about 2K + columns collections
are not for you use old-school c* wide rows.


On Mon, Sep 2, 2013 at 10:36 AM, Keith Wright  wrote:

>  I know that the size is limited to max short (~32k) because when 
> deserializing the response from the server, the first item returned is the 
> number of items and its a short.  That being said you could likely handle 
> this by looking for the overflow and allowing double max short.
>
> Vikas Goyal  wrote:
>
>
>   As there are two ways to support wide rows in CQL3..One is to use
> composite keys and another is to use collections like Map, List and Set.
> The composite keys method can have millions of columns (transposed to
> rows).. This is solving some of our use cases.
>
> However, if we use collections, I want to know if there is a limit that
> the collections can store a certain number/amount of data (Like earlier
> with Thrift C* supports up-to 2 billion columns in a row)
>
>
>  Thanks,
>
> Vikas Goyal
>


Re: Upgrade from 1.0.9 to 1.2.8

2013-09-02 Thread Jeremiah D Jordan
> 1.0.9 -> 1.0.12 -> 1.1.12 -> 1.2.x?

Because this fix in 1.0.11:
* fix 1.0.x node join to mixed version cluster, other nodes >= 1.1 
(CASSANDRA-4195)

-Jeremiah

On Aug 30, 2013, at 2:00 PM, Mike Neir  wrote:

> Is there anything that you can link that describes the pitfalls you mention? 
> I'd like a bit more information. Just for clarity's sake, are you 
> recommending 1.0.9 -> 1.0.12 -> 1.1.12 -> 1.2.x? Or would  1.0.9 -> 1.1.12 -> 
> 1.2.x suffice?
> 
> Regarding the placement strategy mentioned in a different post, I'm using the 
> Simple placement strategy, with the RackInferringSnitch. How does that play 
> into the bugs mentioned previously about cross-DC replication?
> 
> MN
> 
> On 08/30/2013 01:28 PM, Jeremiah D Jordan wrote:
>> You probably want to go to 1.0.11/12 first no matter what.  If you want the 
>> least chance of issue you should then go to 1.1.12.  While there is a high 
>> probability that going from 1.0.X->1.2 will work. You have the best chance 
>> at no failures if you go through 1.1.12.  There are some edge cases that can 
>> cause errors if you don't do that.
>> 
>> -Jeremiah
>> 
>> 



Re: CqlStorage creates wrong schema for Pig

2013-09-02 Thread Miguel Angel Martin junquera
Hi


1.-

May be?

-- Register the UDF
REGISTER /path/to/cqlstorageudf-1.0-SNAPSHOT

-- FromCqlColumn will convert chararray, int, long, float, double
DEFINE FromCqlColumn com.megatome.pig.piggybank.tuple.FromCqlColumn();

-- Load data as normal
data_raw = LOAD 'cql://bookcrossing/books' USING CqlStorage();

-- Use the UDF
data = FOREACH data_raw GENERATE
*FromCqlColumn*(isbn) AS ISBN,
*FromCqlColumn*(bookauthor) AS BookAuthor,
*FromCqlColumn*(booktitle) AS BookTitle,
*FromCqlColumn*(publisher) AS Publisher,
*FromCqlColumn*(yearofpublication) AS YearOfPublication;





and  2.:

with  the data in cql cassandra 1.2.8, pig 0.11.11 and cql3:

*CREATE KEYSPACE keyspace1*

*  WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' :
1 }*

*  AND durable_writes = true;*

*
*

*use keyspace2;*

*
*

*  CREATE TABLE test (*

*id text PRIMARY KEY,*

*title text,*

*age int*

*  )  WITH COMPACT STORAGE;*

*
*

*
*

*  insert into test (id, title, age) values('1', 'child', 21);*

*  insert into test (id, title, age) values('2', 'support', 21);*

*  insert into test (id, title, age) values('3', 'manager', 31);*

*  insert into test (id, title, age) values('4', 'QA', 41);*

*  insert into test (id, title, age) values('5', 'QA', 30);*

*  insert into test (id, title, age) values('6', 'QA', 30);*





and script:

*
*
*register './libs/cqlstorageudf-1.0-SNAPSHOT.jar';*
*DEFINE FromCqlColumn com.megatome.pig.piggybank.tuple.FromCqlColumn();*
*rows = LOAD
'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' USING
CqlStorage();*
*dump rows;*
*ILLUSTRATE rows;*
*describe rows;*
*A = FOREACH rows GENERATE FLATTEN(title);*
*dump A;*
*values3 = FOREACH A GENERATE FromCqlColumn(title) AS title;*
*dump values3;*
*describe values3;*


--



I have this error:






-
| rows | id:chararray   | age:int   | title:chararray   |
-
|  | (id, 5)| (age, 30) | (title, QA)   |
-

rows: {id: chararray,age: int,title: chararray}


...

(title,QA)
(title,QA)
..
2013-09-02 16:40:52,454 [Thread-11] WARN
 org.apache.hadoop.mapred.LocalJobRunner - job_local_0003
*java.lang.ClassCastException: java.lang.String cannot be cast to
org.apache.pig.data.Tuple*
at com.megatome.pig.piggybank.tuple.ColumnBase.exec(ColumnBase.java:32)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:337)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:434)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:340)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:372)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:297)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
2013-09-02 16:40:52,832 [main] INFO
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- HadoopJobId: job_local_0003



8-|

Regards

...


Miguel Angel Martín Junquera
Analyst Engineer.
miguelangel.mar...@brainsins.com



2013/9/2 Miguel Angel Martin junquera 

> hi all:
>
> More info :
>
> https://issues.apache.org/jira/browse/CASSANDRA-5941
>
>
>
> I tried this (and gen. cassandra 1.2.9)  but do not work for me,
>
> git clone http://git-wip-us.apache.org/repos/asf/cassandra.git
> cd cassandra
> git checkout cassandra-1.2
> patch -p1 < 5867-bug-fix-filter-push-down-1.2-branch.txt
> ant
>
>
>
> Miguel Angel Martín Junquera
> Analyst Engineer.
> miguelangel.mar...@brainsins.com
>
>
>
> 2013/9/2 Miguel Angel Martin junquera 
>
>> *good/nice job !!!*
>> *
>> *
>> *
>> *
>> *I'd testing with an udf only with  string schema type  this is better
>> and elaborate work..*
>> *
>> *
>> *Regads*
>>
>>
>> Miguel Angel Martín Junquera
>> Analyst Engineer.
>> miguelangel.mar...@brainsins.com
>>
>>
>>
>> 2013/8/31 Chad Johnston 
>>
>>> I threw together a quick UDF to work around this issue. It just extracts
>>> the value portion of the tuple while taking advantage of the CqlStorage
>>> generated schema to keep the type correct.
>>>
>>> You can get it her

Re: Does collection in CQL3 have certain limits?

2013-09-02 Thread Keith Wright
I know that the size is limited to max short (~32k) because when deserializing 
the response from the server, the first item returned is the number of items 
and its a short.  That being said you could likely handle this by looking for 
the overflow and allowing double max short.

Vikas Goyal  wrote:



As there are two ways to support wide rows in CQL3..One is to use composite 
keys and another is to use collections like Map, List and Set. The composite 
keys method can have millions of columns (transposed to rows).. This is solving 
some of our use cases.

However, if we use collections, I want to know if there is a limit that the 
collections can store a certain number/amount of data (Like earlier with Thrift 
C* supports up-to 2 billion columns in a row)


Thanks,

Vikas Goyal


Versioning in cassandra

2013-09-02 Thread dawood abdullah
Hi

I have a requirement of versioning to be done in Cassandra.

Following is my column family definition

*create table file_details(id text primary key, fname text, version int,
mimetype text);*

I have a secondary index created on fname column.

Whenever I do an insert for the same 'fname', the version should be
incremented. And when I retrieve a row with fname it should return me the
latest version row.

Is there a better way to do in Cassandra? Please suggest what approach
needs to be taken.

Regards,

Dawood


Re: CqlStorage creates wrong schema for Pig

2013-09-02 Thread Miguel Angel Martin junquera
hi all:

More info :

https://issues.apache.org/jira/browse/CASSANDRA-5941



I tried this (and gen. cassandra 1.2.9)  but do not work for me,

git clone http://git-wip-us.apache.org/repos/asf/cassandra.git
cd cassandra
git checkout cassandra-1.2
patch -p1 < 5867-bug-fix-filter-push-down-1.2-branch.txt
ant



Miguel Angel Martín Junquera
Analyst Engineer.
miguelangel.mar...@brainsins.com



2013/9/2 Miguel Angel Martin junquera 

> *good/nice job !!!*
> *
> *
> *
> *
> *I'd testing with an udf only with  string schema type  this is better
> and elaborate work..*
> *
> *
> *Regads*
>
>
> Miguel Angel Martín Junquera
> Analyst Engineer.
> miguelangel.mar...@brainsins.com
>
>
>
> 2013/8/31 Chad Johnston 
>
>> I threw together a quick UDF to work around this issue. It just extracts
>> the value portion of the tuple while taking advantage of the CqlStorage
>> generated schema to keep the type correct.
>>
>> You can get it here: https://github.com/iamthechad/cqlstorage-udf
>>
>> I'll see if I can find more useful information and open a defect, since
>> that's what this seems to be.
>>
>> Chad
>>
>>
>> On Fri, Aug 30, 2013 at 2:02 AM, Miguel Angel Martin junquera <
>> mianmarjun.mailingl...@gmail.com> wrote:
>>
>>> I try this:
>>>
>>> *rows = LOAD
>>> 'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' USING
>>> CqlStorage();*
>>>
>>> *dump rows;*
>>>
>>> *ILLUSTRATE rows;*
>>>
>>> *describe rows;*
>>>
>>> *
>>> *
>>>
>>> *values2= FOREACH rows GENERATE  TOTUPLE (id) as
>>> (mycolumn:tuple(name,value));*
>>>
>>> *dump values2;*
>>>
>>> *describe values2;*
>>> *
>>> *
>>>
>>> But I get this results:
>>>
>>>
>>>
>>> -
>>> | rows | id:chararray   | age:int   | title:chararray   |
>>> -
>>> |  | (id, 6)| (age, 30) | (title, QA)   |
>>> -
>>>
>>> rows: {id: chararray,age: int,title: chararray}
>>> 2013-08-30 09:54:37,831 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>>> ERROR 1031: Incompatable field schema: left is
>>> "tuple_0:tuple(mycolumn:tuple(name:bytearray,value:bytearray))", right is
>>> "org.apache.pig.builtin.totuple_id_1:tuple(id:chararray)"
>>>
>>>
>>>
>>>
>>>
>>> or
>>>
>>>
>>>
>>> 
>>>
>>> *values2= FOREACH rows GENERATE  TOTUPLE (id) ;*
>>> *dump values2;*
>>> *describe values2;*
>>>
>>>
>>>
>>>
>>> and  the results are:
>>>
>>>
>>> ...
>>> (((id,6)))
>>> (((id,5)))
>>> values2: {org.apache.pig.builtin.totuple_id_8: (id: chararray)}
>>>
>>>
>>>
>>> Aggg!
>>>
>>>
>>> *
>>> *
>>>
>>>
>>>
>>> Miguel Angel Martín Junquera
>>> Analyst Engineer.
>>> miguelangel.mar...@brainsins.com
>>>
>>>
>>>
>>> 2013/8/26 Miguel Angel Martin junquera >> >
>>>
 hi Chad .

 I have this issue

 I send a mail to user-pig-list and  I still i can resolve this, and I
 can not  access to column values.
 In this mail  I write some things that I try without results... and
 information about this issue.



 http://mail-archives.apache.org/mod_mbox/pig-user/201308.mbox/%3ccajeg_hq9s2po3_xytzx5xki4j1mao8q26jydg2wndy_kyiv...@mail.gmail.com%3E



 I hope  someOne reply  one comment, idea or  solution about  this issue
 or bug.


 I have reviewed the CqlStorage class in code cassandra 1.2.8  but i do
 not have configure the environmetn to debug  and trace this issue.

 Only  I find some comments like, but I do not understand at all.


 /**

  * A LoadStoreFunc for retrieving data from and storing data to
 Cassandra

  *

  * A row from a standard CF will be returned as nested tuples:

  * (((key1, value1), (key2, value2)), ((name1, val1), (name2, val2))).
  */


 I you found some idea or solution, please post it

 thanks









 2013/8/23 Chad Johnston 

> (I'm using Cassandra 1.2.8 and Pig 0.11.1)
>
> I'm loading some simple data from Cassandra into Pig using CqlStorage.
> The CqlStorage loader defines a Pig schema based on the Cassandra schema,
> but it seems to be wrong.
>
> If I do:
>
> data = LOAD 'cql://bookdata/books' USING CqlStorage();
> DESCRIBE data;
>
> I get this:
>
> data: {isbn: chararray,bookauthor: chararray,booktitle:
> chararray,publisher: chararray,yearofpublication: int}
>
> However, if I DUMP data, I get results like these:
>
> ((isbn,0425093387),(bookauthor,Georgette Heyer),(booktitle,Death in
> the Stocks),(publisher,Berkley Pub Group),(yearofpublication,1986))
>
> Clearly the results from Cassandra are key/value pairs, as would be
> expected. I don't know why the schema generated by CqlStorage() would be 
> so
> different.
>
> This is really causing me problems trying to acc

Re: how can i get the column value? Need help!.. cassandra 1.28 and pig 0.11.1

2013-09-02 Thread Miguel Angel Martin junquera
hi all:

More info :

https://issues.apache.org/jira/browse/CASSANDRA-5941



I tried this (and gen. cassandra 1.2.9)  but do not work for me,

git clone http://git-wip-us.apache.org/repos/asf/cassandra.git
cd cassandra
git checkout cassandra-1.2
patch -p1 < 5867-bug-fix-filter-push-down-1.2-branch.txt
ant



Miguel Angel Martín Junquera
Analyst Engineer.
miguelangel.mar...@brainsins.com



2013/9/2 Miguel Angel Martin junquera 

> hi:
>
> I test this in cassandra 1.2.9 new  version and the issue still persists .
>
> :-(
>
>
>
>
> Miguel Angel Martín Junquera
> Analyst Engineer.
> miguelangel.mar...@brainsins.com
>
>
>
> 2013/8/30 Miguel Angel Martin junquera 
>
>> I try this:
>>
>> *rows = LOAD
>> 'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' USING
>> CqlStorage();*
>>
>> *dump rows;*
>>
>> *ILLUSTRATE rows;*
>>
>> *describe rows;*
>>
>> *
>> *
>>
>> *values2= FOREACH rows GENERATE  TOTUPLE (id) as
>> (mycolumn:tuple(name,value));*
>>
>> *dump values2;*
>>
>> *describe values2;*
>> *
>> *
>>
>> But I get this results:
>>
>>
>>
>> -
>> | rows | id:chararray   | age:int   | title:chararray   |
>> -
>> |  | (id, 6)| (age, 30) | (title, QA)   |
>> -
>>
>> rows: {id: chararray,age: int,title: chararray}
>> 2013-08-30 09:54:37,831 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>> ERROR 1031: Incompatable field schema: left is
>> "tuple_0:tuple(mycolumn:tuple(name:bytearray,value:bytearray))", right is
>> "org.apache.pig.builtin.totuple_id_1:tuple(id:chararray)"
>>
>>
>>
>>
>>
>> or
>>
>>
>>
>> 
>>
>> *values2= FOREACH rows GENERATE  TOTUPLE (id) ;*
>> *dump values2;*
>> *describe values2;*
>>
>>
>>
>>
>> and  the results are:
>>
>>
>> ...
>> (((id,6)))
>> (((id,5)))
>> values2: {org.apache.pig.builtin.totuple_id_8: (id: chararray)}
>>
>>
>>
>> Aggg!
>>
>>
>> *
>> *
>>
>>
>>
>>
>> Miguel Angel Martín Junquera
>> Analyst Engineer.
>> miguelangel.mar...@brainsins.com
>>
>>
>>
>> 2013/8/28 Miguel Angel Martin junquera 
>>
>>> hi:
>>>
>>> I can not understand why the schema is  define like 
>>> *"id:chararray,age:int,title:chararray"
>>>  and it does not define like tuples or bag tuples,  if we have pair
>>> key-values  columns*
>>> *
>>> *
>>> *
>>> *
>>> *I try other time to change schema  but it does not work.*
>>> *
>>> *
>>> *any ideas ...*
>>> *
>>> *
>>> *perhaps, is the issue in the definition cql3 tables ?*
>>> *
>>> *
>>> *regards*
>>>
>>>
>>> 2013/8/28 Miguel Angel Martin junquera >> >
>>>
 hi all:


 Regards

 Still i can resolve this issue. .

 does anybody have this issue or try to test this simple example?


 i am stumped I can not find a solution working.

 I appreciate any comment or help


 2013/8/22 Miguel Angel Martin junquera <
 mianmarjun.mailingl...@gmail.com>

> hi all:
>
>
>
>
> I,m testing the new CqlStorage() with cassandra 1.28 and pig 0.11.1
>
>
> I am using this sample data test:
>
>
> http://frommyworkshop.blogspot.com.es/2013/07/hadoop-map-reduce-with-cassandra.html
>
> And I load and dump data Righ with this script:
>
> *rows = LOAD
> 'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' 
> USING
> CqlStorage();*
> *
> *
> *dump rows;*
> *describe rows;*
> *
> *
>
> *resutls:
>
> ((id,6),(age,30),(title,QA))
>
> ((id,5),(age,30),(title,QA))
>
> rows: {id: chararray,age: int,title: chararray}
>
>
> *
>
>
> But i can not  get  the column values
>
> I try to define   another schemas in Load like I used with
> cassandraStorage()
>
>
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-and-Pig-how-to-get-column-values-td5641158.html
>
>
> example:
>
> *rows = LOAD
> 'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' 
> USING
> CqlStorage() AS (columns: bag {T: tuple(name, value)});*
>
>
> and I get this error:
>
> *2013-08-22 12:24:45,426 [main] ERROR
> org.apache.pig.tools.grunt.Grunt - ERROR 1031: Incompatable schema: left 
> is
> "columns:bag{T:tuple(name:bytearray,value:bytearray)}", right is
> "id:chararray,age:int,title:chararray"*
>
>
>
>
> I try to use, FLATTEN, SUBSTRING, SPLIT UDF`s but i have not get good
> result:
>
> Example:
>
>
>- when I flatten , I get a set of tuples like
>
> *(title,QA)*
>
> *(title,QA)*
>
> *2013-08-22 12:42:20,673 [main] INFO
>  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> input paths to process : 1*
>
> *A: {title: chararray}*
>>>

Cassandra SSTable Access Distribution ? How To ?

2013-09-02 Thread Girish Kumar
Hi,

I'm trying to get SSTable access distribution for Reads from Cassandra
Stress Tool.  When I try to dump the cfhistogram I don't see entries in
SSTable column.  Meaning all turn out to be zero.

Any idea what must be going wrong ?  Please suggest how to dump the
histogram with SSTable access distribution ?

Thanks,
/BK


Does collection in CQL3 have certain limits?

2013-09-02 Thread Vikas Goyal
As there are two ways to support wide rows in CQL3..One is to use composite
keys and another is to use collections like Map, List and Set. The
composite keys method can have millions of columns (transposed to rows)..
This is solving some of our use cases.

However, if we use collections, I want to know if there is a limit that the
collections can store a certain number/amount of data (Like earlier with
Thrift C* supports up-to 2 billion columns in a row)


Thanks,

Vikas Goyal


Re: CqlStorage creates wrong schema for Pig

2013-09-02 Thread Miguel Angel Martin junquera
*good/nice job !!!*
*
*
*
*
*I'd testing with an udf only with  string schema type  this is better and
elaborate work..*
*
*
*Regads*


Miguel Angel Martín Junquera
Analyst Engineer.
miguelangel.mar...@brainsins.com



2013/8/31 Chad Johnston 

> I threw together a quick UDF to work around this issue. It just extracts
> the value portion of the tuple while taking advantage of the CqlStorage
> generated schema to keep the type correct.
>
> You can get it here: https://github.com/iamthechad/cqlstorage-udf
>
> I'll see if I can find more useful information and open a defect, since
> that's what this seems to be.
>
> Chad
>
>
> On Fri, Aug 30, 2013 at 2:02 AM, Miguel Angel Martin junquera <
> mianmarjun.mailingl...@gmail.com> wrote:
>
>> I try this:
>>
>> *rows = LOAD
>> 'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' USING
>> CqlStorage();*
>>
>> *dump rows;*
>>
>> *ILLUSTRATE rows;*
>>
>> *describe rows;*
>>
>> *
>> *
>>
>> *values2= FOREACH rows GENERATE  TOTUPLE (id) as
>> (mycolumn:tuple(name,value));*
>>
>> *dump values2;*
>>
>> *describe values2;*
>> *
>> *
>>
>> But I get this results:
>>
>>
>>
>> -
>> | rows | id:chararray   | age:int   | title:chararray   |
>> -
>> |  | (id, 6)| (age, 30) | (title, QA)   |
>> -
>>
>> rows: {id: chararray,age: int,title: chararray}
>> 2013-08-30 09:54:37,831 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>> ERROR 1031: Incompatable field schema: left is
>> "tuple_0:tuple(mycolumn:tuple(name:bytearray,value:bytearray))", right is
>> "org.apache.pig.builtin.totuple_id_1:tuple(id:chararray)"
>>
>>
>>
>>
>>
>> or
>>
>>
>>
>> 
>>
>> *values2= FOREACH rows GENERATE  TOTUPLE (id) ;*
>> *dump values2;*
>> *describe values2;*
>>
>>
>>
>>
>> and  the results are:
>>
>>
>> ...
>> (((id,6)))
>> (((id,5)))
>> values2: {org.apache.pig.builtin.totuple_id_8: (id: chararray)}
>>
>>
>>
>> Aggg!
>>
>>
>> *
>> *
>>
>>
>>
>> Miguel Angel Martín Junquera
>> Analyst Engineer.
>> miguelangel.mar...@brainsins.com
>>
>>
>>
>> 2013/8/26 Miguel Angel Martin junquera 
>>
>>> hi Chad .
>>>
>>> I have this issue
>>>
>>> I send a mail to user-pig-list and  I still i can resolve this, and I
>>> can not  access to column values.
>>> In this mail  I write some things that I try without results... and
>>> information about this issue.
>>>
>>>
>>>
>>> http://mail-archives.apache.org/mod_mbox/pig-user/201308.mbox/%3ccajeg_hq9s2po3_xytzx5xki4j1mao8q26jydg2wndy_kyiv...@mail.gmail.com%3E
>>>
>>>
>>>
>>> I hope  someOne reply  one comment, idea or  solution about  this issue
>>> or bug.
>>>
>>>
>>> I have reviewed the CqlStorage class in code cassandra 1.2.8  but i do
>>> not have configure the environmetn to debug  and trace this issue.
>>>
>>> Only  I find some comments like, but I do not understand at all.
>>>
>>>
>>> /**
>>>
>>>  * A LoadStoreFunc for retrieving data from and storing data to
>>> Cassandra
>>>
>>>  *
>>>
>>>  * A row from a standard CF will be returned as nested tuples:
>>>
>>>  * (((key1, value1), (key2, value2)), ((name1, val1), (name2, val2))).
>>>  */
>>>
>>>
>>> I you found some idea or solution, please post it
>>>
>>> thanks
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> 2013/8/23 Chad Johnston 
>>>
 (I'm using Cassandra 1.2.8 and Pig 0.11.1)

 I'm loading some simple data from Cassandra into Pig using CqlStorage.
 The CqlStorage loader defines a Pig schema based on the Cassandra schema,
 but it seems to be wrong.

 If I do:

 data = LOAD 'cql://bookdata/books' USING CqlStorage();
 DESCRIBE data;

 I get this:

 data: {isbn: chararray,bookauthor: chararray,booktitle:
 chararray,publisher: chararray,yearofpublication: int}

 However, if I DUMP data, I get results like these:

 ((isbn,0425093387),(bookauthor,Georgette Heyer),(booktitle,Death in the
 Stocks),(publisher,Berkley Pub Group),(yearofpublication,1986))

 Clearly the results from Cassandra are key/value pairs, as would be
 expected. I don't know why the schema generated by CqlStorage() would be so
 different.

 This is really causing me problems trying to access the column values.
 I tried a naive approach of FLATTENing each tuple, then trying to access
 the values that way:

 flattened = FOREACH data GENERATE
   FLATTEN(isbn),
   FLATTEN(booktitle),
   ...
 values = FOREACH flattened GENERATE
   $1 AS ISBN,
   $3 AS BookTitle,
   ...

 As soon as I try to access field $5, Pig complains about the index
 being out of bounds.

 Is there a way to solve the schema/reality mismatch? Am I doing
 something wrong, or have I stumbled across a defect?

 Thanks,
 Chad

>>>
>>>
>>
>


Re: how can i get the column value? Need help!.. cassandra 1.28 and pig 0.11.1

2013-09-02 Thread Miguel Angel Martin junquera
hi:

I test this in cassandra 1.2.9 new  version and the issue still persists .

:-(




Miguel Angel Martín Junquera
Analyst Engineer.
miguelangel.mar...@brainsins.com



2013/8/30 Miguel Angel Martin junquera 

> I try this:
>
> *rows = LOAD
> 'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' USING
> CqlStorage();*
>
> *dump rows;*
>
> *ILLUSTRATE rows;*
>
> *describe rows;*
>
> *
> *
>
> *values2= FOREACH rows GENERATE  TOTUPLE (id) as
> (mycolumn:tuple(name,value));*
>
> *dump values2;*
>
> *describe values2;*
> *
> *
>
> But I get this results:
>
>
>
> -
> | rows | id:chararray   | age:int   | title:chararray   |
> -
> |  | (id, 6)| (age, 30) | (title, QA)   |
> -
>
> rows: {id: chararray,age: int,title: chararray}
> 2013-08-30 09:54:37,831 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1031: Incompatable field schema: left is
> "tuple_0:tuple(mycolumn:tuple(name:bytearray,value:bytearray))", right is
> "org.apache.pig.builtin.totuple_id_1:tuple(id:chararray)"
>
>
>
>
>
> or
>
>
>
> 
>
> *values2= FOREACH rows GENERATE  TOTUPLE (id) ;*
> *dump values2;*
> *describe values2;*
>
>
>
>
> and  the results are:
>
>
> ...
> (((id,6)))
> (((id,5)))
> values2: {org.apache.pig.builtin.totuple_id_8: (id: chararray)}
>
>
>
> Aggg!
>
>
> *
> *
>
>
>
>
> Miguel Angel Martín Junquera
> Analyst Engineer.
> miguelangel.mar...@brainsins.com
>
>
>
> 2013/8/28 Miguel Angel Martin junquera 
>
>> hi:
>>
>> I can not understand why the schema is  define like 
>> *"id:chararray,age:int,title:chararray"
>>  and it does not define like tuples or bag tuples,  if we have pair
>> key-values  columns*
>> *
>> *
>> *
>> *
>> *I try other time to change schema  but it does not work.*
>> *
>> *
>> *any ideas ...*
>> *
>> *
>> *perhaps, is the issue in the definition cql3 tables ?*
>> *
>> *
>> *regards*
>>
>>
>> 2013/8/28 Miguel Angel Martin junquera 
>>
>>> hi all:
>>>
>>>
>>> Regards
>>>
>>> Still i can resolve this issue. .
>>>
>>> does anybody have this issue or try to test this simple example?
>>>
>>>
>>> i am stumped I can not find a solution working.
>>>
>>> I appreciate any comment or help
>>>
>>>
>>> 2013/8/22 Miguel Angel Martin junquera >> >
>>>
 hi all:




 I,m testing the new CqlStorage() with cassandra 1.28 and pig 0.11.1


 I am using this sample data test:


 http://frommyworkshop.blogspot.com.es/2013/07/hadoop-map-reduce-with-cassandra.html

 And I load and dump data Righ with this script:

 *rows = LOAD
 'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' USING
 CqlStorage();*
 *
 *
 *dump rows;*
 *describe rows;*
 *
 *

 *resutls:

 ((id,6),(age,30),(title,QA))

 ((id,5),(age,30),(title,QA))

 rows: {id: chararray,age: int,title: chararray}


 *


 But i can not  get  the column values

 I try to define   another schemas in Load like I used with
 cassandraStorage()


 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-and-Pig-how-to-get-column-values-td5641158.html


 example:

 *rows = LOAD
 'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' USING
 CqlStorage() AS (columns: bag {T: tuple(name, value)});*


 and I get this error:

 *2013-08-22 12:24:45,426 [main] ERROR org.apache.pig.tools.grunt.Grunt
 - ERROR 1031: Incompatable schema: left is
 "columns:bag{T:tuple(name:bytearray,value:bytearray)}", right is
 "id:chararray,age:int,title:chararray"*




 I try to use, FLATTEN, SUBSTRING, SPLIT UDF`s but i have not get good
 result:

 Example:


- when I flatten , I get a set of tuples like

 *(title,QA)*

 *(title,QA)*

 *2013-08-22 12:42:20,673 [main] INFO
  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
 input paths to process : 1*

 *A: {title: chararray}*



 but i can get value QA

 Sustring only works with title



 example:

 *B = FOREACH A GENERATE SUBSTRING(title,2,5);*
 *
 *
 *dump B;*
 *describe B;*
 *
 *
 *
 *

 *results:*
 *
 *

 *(tle)*
 *(tle)*
 *B: {chararray}*




 i try, this like ERIC LEE inthe other mail  and have the same results:


  Anyways, what I really what is the column value, not the name. Is
 there a way to do that? I listed all of the failed attempts I made below.

- colnames = FOREACH cols GENERATE $1 and was told $1 was out of
bounds.
- casted = FOREACH cols GENERATE (

Re: successful use of "shuffle"?

2013-09-02 Thread Richard Low
On 30 August 2013 18:42, Jeremiah D Jordan wrote:

You need to introduce the new "vnode enabled" nodes in a new DC.  Or you
> will have similar issues to
> https://issues.apache.org/jira/browse/CASSANDRA-5525
>
> Add vnode DC:
>
> http://www.datastax.com/documentation/cassandra/1.2/webhelp/index.html#cassandra/operations/ops_add_dc_to_cluster_t.html
>
> Point clients to new DC
>
> Remove non vnode DC:
>
> http://www.datastax.com/documentation/cassandra/1.2/webhelp/index.html#cassandra/operations/ops_decomission_dc_t.html
>

This is a good workaround if you have the hardware to temporarily have a
cluster that's double the size.  If you don't then I think shuffle is the
only option, but it is known to have issues.

Richard.