subject:"can i..."

Re: How can I add blank values instead of null values in cassandra ?

2019-09-10 Thread Swen Moczarski

When using prepared statements, you could use "unset":
https://github.com/datastax/java-driver/blob/4.x/manual/core/statements/prepared/README.md#unset-values


That should solve the tombstone problem but might need code changes.

Regards,
Swen

Am Di., 10. Sept. 2019 um 04:50 Uhr schrieb Nitan Kainth <
nitankai...@gmail.com>:

> You can set default values in driver but that also little code change
>
>
> Regards,
>
> Nitan
>
> Cell: 510 449 9629
>
> On Sep 9, 2019, at 8:15 PM, buchi adddagada  wrote:
>
> We are using DSE 5.1.0 & Spring boot Java.
>
> While we are trying to insert data into cassandra , java by default
> inserts null values in cassandra tables which is causing huge tombstones.
>
>
> Instead of changing code in java to insert null values, can you control
> anywhere at driver level ?
>
>
> Thanks,
>
> Buchi Babu
>
>

Re: How can I add blank values instead of null values in cassandra ?

2019-09-09 Thread Nitan Kainth

You can set default values in driver but that also little code change


Regards,
Nitan
Cell: 510 449 9629

> On Sep 9, 2019, at 8:15 PM, buchi adddagada  wrote:
> 
> We are using DSE 5.1.0 & Spring boot Java.
> 
> While we are trying to insert data into cassandra , java by default inserts 
> null values in cassandra tables which is causing huge tombstones.
> 
> 
> 
> Instead of changing code in java to insert null values, can you control 
> anywhere at driver level ?
> 
> 
> 
> Thanks,
> 
> Buchi Babu

How can I add blank values instead of null values in cassandra ?

2019-09-09 Thread buchi adddagada

We are using DSE 5.1.0 & Spring boot Java.

While we are trying to insert data into cassandra , java by default inserts
null values in cassandra tables which is causing huge tombstones.


Instead of changing code in java to insert null values, can you control
anywhere at driver level ?


Thanks,

Buchi Babu

Re: How can I check cassandra cluster has a real working function of high availability?

2019-06-28 Thread Nimbus Lin

To Sir Oleksandr :




Thank you!

Sincerely
Nimbuslin(Lin JiaXin)
Mobile: 0086 180 5986 1565
Mail: jiaxin...@live.com



From: Oleksandr Shulgin 
Sent: Monday, June 17, 2019 7:19 AM
To: User
Subject: Re: How can I check cassandra cluster has a real working function of 
high availability?

On Sat, Jun 15, 2019 at 4:31 PM Nimbus Lin 
mailto:jiaxin...@live.com>> wrote:
Dear cassandra's pioneers:
I am a 5 years' newbie,  it is until now that I have time to use cassandra. 
but I cann't check cassandra's high availabily when I stop a seed node or none 
seed DN as CGE or Greenplum.
Would someone can tell me how to check the cassandra's high availability? 
even I change the consistency level from one to local_one, the cqlsh's select 
is always return an error of NoHostAvailable.

 By the way, would you like to answer me other two questions:
2nd question: although cassandrfa's consistency is a per-operation setting, 
isn't there a whole all operations' consistency setting method?
3rd question: how can I can cassandra cluster's running variables as mysql's 
show global variables? such as hidden variable of  auto_bootstrap?

Hi,

For the purpose of serving client requests, all nodes are equal -- seed or not. 
 So it shouldn't matter which node you are stopping (or making it unavailable 
for the rest of the cluster using other means).

In order to test it with cqlsh you should ensure that the replication factors 
of the keyspace you're testing with is sufficient.  Given the NoHostAvailable 
exception that you are experiencing at consistency level ONE (or LOCAL_ONE), I 
can guess that you are testing with a keyspace with replication factor 1 and 
the node which is unavailable happen to be responsible for the particular 
partition.

For your second question: it depends on a client (or "client driver") you are 
using.  In cqlsh you can set consistency level that will be applied for all 
subsequent queries using the "CONSISTENCY ..." command.  I think that the Java 
driver does have an option to set the default consistency level, as well as has 
an option to set consistency level per query.  Most likely this is also true 
for Python and other drivers.

And for the third question: I'm not aware of a CQL or nodetool command that 
would fulfill the need.  Most likely it is possible to learn (and update) most 
of the configuration parameters using JMX, e.g. with JConsole: 
https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsMonitoring.html#opsMonitoringJconsole

Cheers,
--
Alex


-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: How can I check cassandra cluster has a real working function of high availability?

2019-06-17 Thread Oleksandr Shulgin

On Sat, Jun 15, 2019 at 4:31 PM Nimbus Lin  wrote:

> Dear cassandra's pioneers:
> I am a 5 years' newbie,  it is until now that I have time to use
> cassandra. but I cann't check cassandra's high availabily when I stop a
> seed node or none seed DN as CGE or Greenplum.
> Would someone can tell me how to check the cassandra's high
> availability? even I change the consistency level from one to local_one,
> the cqlsh's select is always return an error of NoHostAvailable.
>
>  By the way, would you like to answer me other two questions:
> 2nd question: although cassandrfa's consistency is a per-operation
> setting, isn't there a whole all operations' consistency setting method?
> 3rd question: how can I can cassandra cluster's running variables as
> mysql's show global variables? such as hidden variable of  auto_bootstrap?
>

Hi,

For the purpose of serving client requests, all nodes are equal -- seed or
not.  So it shouldn't matter which node you are stopping (or making it
unavailable for the rest of the cluster using other means).

In order to test it with cqlsh you should ensure that the replication
factors of the keyspace you're testing with is sufficient.  Given the
NoHostAvailable exception that you are experiencing at consistency level
ONE (or LOCAL_ONE), I can guess that you are testing with a keyspace with
replication factor 1 and the node which is unavailable happen to be
responsible for the particular partition.

For your second question: it depends on a client (or "client driver") you
are using.  In cqlsh you can set consistency level that will be applied for
all subsequent queries using the "CONSISTENCY ..." command.  I think that
the Java driver does have an option to set the default consistency level,
as well as has an option to set consistency level per query.  Most likely
this is also true for Python and other drivers.

And for the third question: I'm not aware of a CQL or nodetool command that
would fulfill the need.  Most likely it is possible to learn (and update)
most of the configuration parameters using JMX, e.g. with JConsole:
https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsMonitoring.html#opsMonitoringJconsole

Cheers,
--
Alex

How can I check cassandra cluster has a real working function of high availability?

2019-06-15 Thread Nimbus Lin

Dear cassandra's pioneers:
I am a 5 years' newbie,  it is until now that I have time to use cassandra. 
but I cann't check cassandra's high availabily when I stop a seed node or none 
seed DN as CGE or Greenplum.
Would someone can tell me how to check the cassandra's high availability? 
even I change the consistency level from one to local_one, the cqlsh's select 
is always return an error of NoHostAvailable.

 By the way, would you like to answer me other two questions:
2nd question: although cassandrfa's consistency is a per-operation setting, 
isn't there a whole all operations' consistency setting method?
3rd question: how can I can cassandra cluster's running variables as mysql's 
show global variables? such as hidden variable of  auto_bootstrap?

Thank you!

Sincerely
Nimbuslin(Lin JiaXin)
Mobile: 0086 180 5986 1565
Mail: jiaxin...@live.com

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: Can I cancel a decommissioning procedure??

2019-06-05 Thread Alain RODRIGUEZ

Sure, you're welcome, glad to hear it worked! =)

Thanks for letting us know/reporting this back here, it might matter for
other people as well.

C*heers!
Alain


Le mer. 5 juin 2019 à 07:45, William R  a écrit :

> Eventually after the reboot the decommission was cancelled. Thanks a lot
> for the info!
>
> Cheers
>
>
> Sent with ProtonMail  Secure Email.
>
> ‐‐‐ Original Message ‐‐‐
> On Tuesday, June 4, 2019 10:59 PM, Alain RODRIGUEZ 
> wrote:
>
> > the issue is that the rest nodes in the cluster marked it as DL
> (DOWN/LEAVING) thats why I am kinda stressed.. Lets see once is up!
>
> The last information other nodes had is that this node is leaving, and
> down, that's expected in this situation. When the node comes back online,
> it should come back UN and 'quickly' other nodes should ACK it.
>
> During decommission, the node itself is responsible for streaming its data
> over. Streams were stopped as the node went down, Cassandra won't remove
> the node unless data was streamed properly (or if you force  the node out).
> I don't think that there is a decommission 'resume', and even les that it
> is enabled by default.
> Thus when the node comes back, the only possible option I see is a
> 'regular' start for that node and other to acknowledge that the node is up
> and not leaving anymore.
>
> The only consequence I expect (other than the node missing the latest
> data) is that other nodes might have some extra data due to the
> decommission attempts. If that's needed (streaming for long or no TTL), you
> can consider using 'nodetool cleanup -j 2' on all the other nodes than the
> one that went down, to remove the extra data (and free space).
>
>  I did restart, still waiting to come up (normally takes ~ 30 minutes)
>>
>
> 30 minutes to start the nodes sounds like a long time to me, but well,
> that's another topic.
>
> C*heers
> ---
> Alain Rodriguez - al...@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> Le mar. 4 juin 2019 à 18:31, William R  a écrit :
>
>> Hi Alain,
>>
>> Thank you for your comforting reply :)  I did restart, still waiting to
>> come up (normally takes ~ 30 minutes) , the issue is that the rest nodes in
>> the cluster marked it as DL (DOWN/LEAVING) thats why I am kinda stressed..
>> Lets see once is up!
>>
>>
>> Sent with ProtonMail  Secure Email.
>>
>> ‐‐‐ Original Message ‐‐‐
>> On Tuesday, June 4, 2019 7:25 PM, Alain RODRIGUEZ 
>> wrote:
>>
>> Hello William,
>>
>> At the moment we keep the node down before figure out a way to cancel
>>> that.
>>>
>>
>> Off the top of my head, a restart of the node is the way to go to cancel
>> a decommission.
>> I think you did the right thing and your safety measure is also the fix
>> here :).
>>
>> Did you try to bring it up again?
>>
>> If it's really critical, you can probably test that quickly with ccm (
>> https://github.com/riptano/ccm), tlp-cluster (
>> https://github.com/thelastpickle/tlp-cluster) or simply with any
>> existing dev/test environment if you have any available with some data.
>>
>> Good luck with that, a PEBKAC issue are the worst. You can do a lot of
>> damage, could always have avoided it and it makes you feel terrible.
>> It doesn't sound that bad in your case though, I've seen (and done)
>> worse  ¯\_(ツ)_/¯. It's hard to fight PEBKACs, we, operators, are
>> unpredictable :).
>> Nonetheless, and to go back to something more serious, there are ways to
>> limit the amount and possible scope of those, such as good practices,
>> testing and automations.
>>
>> C*heers,
>> ---
>> Alain Rodriguez - al...@thelastpickle.com
>> France / Spain
>>
>> The Last Pickle - Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>>
>>
>> Le mar. 4 juin 2019 à 17:55, William R  a
>> écrit :
>>
>>> Hi,
>>>
>>> Was an accidental decommissioning of a node and we really need to to
>>> cancel it.. is there any way? At the moment we keep the node down before
>>> figure out a way to cancel that.
>>>
>>> Thanks
>>>
>>
>>
>

Re: Can I cancel a decommissioning procedure??

2019-06-05 Thread William R

Eventually after the reboot the decommission was cancelled. Thanks a lot for 
the info!

Cheers

Sent with [ProtonMail](https://protonmail.com) Secure Email.

‐‐‐ Original Message ‐‐‐
On Tuesday, June 4, 2019 10:59 PM, Alain RODRIGUEZ  wrote:

>> the issue is that the rest nodes in the cluster marked it as DL 
>> (DOWN/LEAVING) thats why I am kinda stressed.. Lets see once is up!
>
> The last information other nodes had is that this node is leaving, and down, 
> that's expected in this situation. When the node comes back online, it should 
> come back UN and 'quickly' other nodes should ACK it.
>
> During decommission, the node itself is responsible for streaming its data 
> over. Streams were stopped as the node went down, Cassandra won't remove the 
> node unless data was streamed properly (or if you force  the node out). I 
> don't think that there is a decommission 'resume', and even les that it is 
> enabled by default.
> Thus when the node comes back, the only possible option I see is a 'regular' 
> start for that node and other to acknowledge that the node is up and not 
> leaving anymore.
>
> The only consequence I expect (other than the node missing the latest data) 
> is that other nodes might have some extra data due to the decommission 
> attempts. If that's needed (streaming for long or no TTL), you can consider 
> using 'nodetool cleanup -j 2' on all the other nodes than the one that went 
> down, to remove the extra data (and free space).
>
>>  I did restart, still waiting to come up (normally takes ~ 30 minutes)
>
> 30 minutes to start the nodes sounds like a long time to me, but well, that's 
> another topic.
>
> C*heers
> ---
> Alain Rodriguez - al...@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> Le mar. 4 juin 2019 à 18:31, William R  a écrit :
>
>> Hi Alain,
>>
>> Thank you for your comforting reply :)  I did restart, still waiting to come 
>> up (normally takes ~ 30 minutes) , the issue is that the rest nodes in the 
>> cluster marked it as DL (DOWN/LEAVING) thats why I am kinda stressed.. Lets 
>> see once is up!
>>
>> Sent with [ProtonMail](https://protonmail.com) Secure Email.
>>
>> ‐‐‐ Original Message ‐‐‐
>> On Tuesday, June 4, 2019 7:25 PM, Alain RODRIGUEZ  wrote:
>>
>>> Hello William,
>>>
 At the moment we keep the node down before figure out a way to cancel that.
>>>
>>> Off the top of my head, a restart of the node is the way to go to cancel a 
>>> decommission.
>>> I think you did the right thing and your safety measure is also the fix 
>>> here :).
>>>
>>> Did you try to bring it up again?
>>>
>>> If it's really critical, you can probably test that quickly with ccm 
>>> (https://github.com/riptano/ccm), tlp-cluster 
>>> (https://github.com/thelastpickle/tlp-cluster) or simply with any existing 
>>> dev/test environment if you have any available with some data.
>>>
>>> Good luck with that, a PEBKAC issue are the worst. You can do a lot of 
>>> damage, could always have avoided it and it makes you feel terrible.
>>> It doesn't sound that bad in your case though, I've seen (and done) worse  
>>> ¯\_(ツ)_/¯. It's hard to fight PEBKACs, we, operators, are unpredictable :).
>>> Nonetheless, and to go back to something more serious, there are ways to 
>>> limit the amount and possible scope of those, such as good practices, 
>>> testing and automations.
>>>
>>> C*heers,
>>> ---
>>> Alain Rodriguez - al...@thelastpickle.com
>>> France / Spain
>>>
>>> The Last Pickle - Apache Cassandra Consulting
>>> http://www.thelastpickle.com
>>>
>>> Le mar. 4 juin 2019 à 17:55, William R  a 
>>> écrit :
>>>
 Hi,

 Was an accidental decommissioning of a node and we really need to to 
 cancel it.. is there any way? At the moment we keep the node down before 
 figure out a way to cancel that.

 Thanks

Re: Can I cancel a decommissioning procedure??

2019-06-04 Thread William R

Hi Alain,

Thank you for your comforting reply :)  I did restart, still waiting to come up 
(normally takes ~ 30 minutes) , the issue is that the rest nodes in the cluster 
marked it as DL (DOWN/LEAVING) thats why I am kinda stressed.. Lets see once is 
up!

Sent with [ProtonMail](https://protonmail.com) Secure Email.

‐‐‐ Original Message ‐‐‐
On Tuesday, June 4, 2019 7:25 PM, Alain RODRIGUEZ  wrote:

> Hello William,
>
>> At the moment we keep the node down before figure out a way to cancel that.
>
> Off the top of my head, a restart of the node is the way to go to cancel a 
> decommission.
> I think you did the right thing and your safety measure is also the fix here 
> :).
>
> Did you try to bring it up again?
>
> If it's really critical, you can probably test that quickly with ccm 
> (https://github.com/riptano/ccm), tlp-cluster 
> (https://github.com/thelastpickle/tlp-cluster) or simply with any existing 
> dev/test environment if you have any available with some data.
>
> Good luck with that, a PEBKAC issue are the worst. You can do a lot of 
> damage, could always have avoided it and it makes you feel terrible.
> It doesn't sound that bad in your case though, I've seen (and done) worse  
> ¯\_(ツ)_/¯. It's hard to fight PEBKACs, we, operators, are unpredictable :).
> Nonetheless, and to go back to something more serious, there are ways to 
> limit the amount and possible scope of those, such as good practices, testing 
> and automations.
>
> C*heers,
> ---
> Alain Rodriguez - al...@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> Le mar. 4 juin 2019 à 17:55, William R  a 
> écrit :
>
>> Hi,
>>
>> Was an accidental decommissioning of a node and we really need to to cancel 
>> it.. is there any way? At the moment we keep the node down before figure out 
>> a way to cancel that.
>>
>> Thanks

Re: Can I cancel a decommissioning procedure??

2019-06-04 Thread Alain RODRIGUEZ

Hello William,

At the moment we keep the node down before figure out a way to cancel that.
>

Off the top of my head, a restart of the node is the way to go to cancel a
decommission.
I think you did the right thing and your safety measure is also the fix
here :).

Did you try to bring it up again?

If it's really critical, you can probably test that quickly with ccm (
https://github.com/riptano/ccm), tlp-cluster (
https://github.com/thelastpickle/tlp-cluster) or simply with any existing
dev/test environment if you have any available with some data.

Good luck with that, a PEBKAC issue are the worst. You can do a lot of
damage, could always have avoided it and it makes you feel terrible.
It doesn't sound that bad in your case though, I've seen (and done) worse
¯\_(ツ)_/¯. It's hard to fight PEBKACs, we, operators, are unpredictable :).
Nonetheless, and to go back to something more serious, there are ways to
limit the amount and possible scope of those, such as good practices,
testing and automations.

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com



Le mar. 4 juin 2019 à 17:55, William R  a
écrit :

> Hi,
>
> Was an accidental decommissioning of a node and we really need to to
> cancel it.. is there any way? At the moment we keep the node down before
> figure out a way to cancel that.
>
> Thanks
>

Can I cancel a decommissioning procedure??

2019-06-04 Thread William R

Hi,

Was an accidental decommissioning of a node and we really need to to cancel 
it.. is there any way? At the moment we keep the node down before figure out a 
way to cancel that.

Thanks

can i delete a sstable with Estimated droppable tombstones > 1, manually?

2019-03-19 Thread onmstester onmstester

Running:
SSTablemetadata /THE_KEYSPACE_DIR/mc-1421-big-Data.db



result was:

Estimated droppable tombstones: 1.2



Having STCS and data disk usage of 80% (do not have enough free space for 
normal compaction), Is it OK to just: 1. stop Cassandra, 2. delete mc-1421* and 
then 3. start Cassandra?
Sent using https://www.zoho.com/mail/

Re: can i...

2019-03-07 Thread Nick Hatfield

-big-Data.db
Max: 12/03/2018 Min: 12/02/2018 Estimated droppable tombstones: 
0.885308722373626112G Mar 5 15:16 mc-231551-big-Data.db
Max: 12/04/2018 Min: 12/03/2018 Estimated droppable tombstones: 
0.880364920515454612G Mar 5 10:06 mc-231309-big-Data.db
Max: 12/05/2018 Min: 12/04/2018 Estimated droppable tombstones: 
0.882480501247063312G Mar 5 10:33 mc-231334-big-Data.db
Max: 12/06/2018 Min: 12/05/2018 Estimated droppable tombstones: 
0.76055525630331674.1G Mar 5 08:12 mc-231253-big-Data.db
Max: 12/07/2018 Min: 12/06/2018 Estimated droppable tombstones: 
0.77487879955196473.9G Mar 5 10:55 mc-231386-big-Data.db
Max: 12/08/2018 Min: 12/07/2018 Estimated droppable tombstones: 
0.79989816020505794.1G Mar 5 08:37 mc-231275-big-Data.db
Max: 12/09/2018 Min: 12/08/2018 Estimated droppable tombstones: 
0.80476620795743164.5G Mar 5 03:35 mc-231043-big-Data.db
Max: 12/10/2018 Min: 12/09/2018 Estimated droppable tombstones: 
0.79870462610734534.8G Mar 4 23:36 mc-230870-big-Data.db
Max: 12/11/2018 Min: 12/10/2018 Estimated droppable tombstones: 
0.83463168502464045.6G Mar 5 13:10 mc-231478-big-Data.db
Max: 12/12/2018 Min: 12/11/2018 Estimated droppable tombstones: 
0.83362161077286086.1G Mar 5 00:06 mc-230888-big-Data.db
Max: 12/13/2018 Min: 12/12/2018 Estimated droppable tombstones: 0.8566337089121 
  7.2G Mar 5 02:46 mc-230993-big-Data.db
Max: 12/14/2018 Min: 12/13/2018 Estimated droppable tombstones: 
0.81376446917687834.7G Mar 5 10:32 mc-231358-big-Data.db
Max: 12/15/2018 Min: 12/14/2018 Estimated droppable tombstones: 
0.81666099375092324.6G Mar 5 13:59 mc-231525-big-Data.db
Max: 12/16/2018 Min: 12/15/2018 Estimated droppable tombstones: 
0.80856040432115274.8G Mar 5 05:00 mc-231110-big-Data.db
Max: 12/17/2018 Min: 12/16/2018 Estimated droppable tombstones: 
0.81240082770061115.0G Mar 4 20:34 mc-230739-big-Data.db
Max: 12/18/2018 Min: 12/17/2018 Estimated droppable tombstones: 
0.81975444529467435.0G Mar 5 12:03 mc-231430-big-Data.db
Max: 12/19/2018 Min: 12/18/2018 Estimated droppable tombstones: 
0.76046841348736945.7G Mar 4 21:08 mc-230768-big-Data.db
Max: 12/20/2018 Min: 12/19/2018 Estimated droppable tombstones: 
0.62767161624315766.8G Mar 4 22:39 mc-230832-big-Data.db
Max: 12/21/2018 Min: 12/20/2018 Estimated droppable tombstones: 
0.62628307965486436.9G Mar 4 21:23 mc-230778-big-Data.db
Max: 12/22/2018 Min: 12/21/2018 Estimated droppable tombstones: 
0.62456782183153546.7G Mar 5 09:22 mc-231304-big-Data.db
Max: 12/23/2018 Min: 12/22/2018 Estimated droppable tombstones: 
0.63399018943391546.7G Mar 5 00:06 mc-230897-big-Data.db
Max: 12/24/2018 Min: 12/23/2018 Estimated droppable tombstones: 
0.64010854891802926.8G Mar 5 00:17 mc-230901-big-Data.db
Max: 12/25/2018 Min: 12/24/2018 Estimated droppable tombstones: 
0.648027924752315 6.9G Mar 4 22:04 mc-230809-big-Data.db
Max: 12/26/2018 Min: 12/25/2018 Estimated droppable tombstones: 
0.64656606965168767.0G Mar 4 23:16 mc-230856-big-Data.db
Max: 12/27/2018 Min: 12/26/2018 Estimated droppable tombstones: 
0.54646764577881025.9G Mar 5 08:46 mc-231285-big-Data.db
Max: 12/28/2018 Min: 12/27/2018 Estimated droppable tombstones: 
0.55563361501056525.8G Mar 5 09:03 mc-231298-big-Data.db
Max: 12/29/2018 Min: 12/28/2018 Estimated droppable tombstones: 
0.58846722378738656.1G Mar 4 20:32 mc-230741-big-Data.db
Max: 12/30/2018 Min: 12/29/2018 Estimated droppable tombstones: 
0.61162079117707546.3G Mar 4 21:52 mc-230801-big-Data.db
Max: 12/31/2018 Min: 12/30/2018 Estimated droppable tombstones: 
0.61564495923846196.6G Mar 5 09:48 mc-231332-big-Data.db



Currently our data on disk is filling up quickly because we are unable to 
successfully evict this data. Is there a way to


  1.

1. Cleanup what is currently taking up so much disk space

  2.

2. Mitigate this entirely in the future


Any help would be greatly appreciated!!

Thanks,

Nick Hatfield

From: Surbhi Gupta mailto:surbhi.gupt...@gmail.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Date: Thursday, March 7, 2019 at 11:50 AM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Subject: Re: can i...

Send the details

On Thu, Mar 7, 2019 at 8:45 AM Nick Hatfield 
mailto:nick.hatfi...@metricly.com>> wrote:
Use this email to get some insight on how to fix database issues in our cluster?

Re: can i...

2019-03-07 Thread Surbhi Gupta

Send the details

On Thu, Mar 7, 2019 at 8:45 AM Nick Hatfield 
wrote:

> Use this email to get some insight on how to fix database issues in our
> cluster?
>

can i...

2019-03-07 Thread Nick Hatfield

Use this email to get some insight on how to fix database issues in our cluster?

Re: How can I limit the non-heap memory for Cassandra

2019-01-18 Thread Alain RODRIGUEZ

Hello Chris,

I must admit I am a bit confused about what you need exactly, I'll try to
do my best :).


> would like to place limits on it to avoid it becoming a “noisy neighbor”

But we also don’t want it killed by the oom killer, so just placing limits
> on the container won't help.


This sound contradictory to me. When the available memory is fully used and
the memory cannot be taken elsewhere the program cannot continue and OOM
cannot be avoided. So it's probably one thing or the other I would say
¯\_(ツ)_/¯.

Is there’s a way to limit Cassandra’s off-heap memory usage?


If we consider this perspective: "cointainerMemory = NativeMemory +
HeapMemory",
Then controlling the JVM heap memory and the container memory, you also
control the off-heap/native memory. So practically yes, you can set the
off-heap memory size, not directly, but by reducing the JVM heap size.
The option is in jvm.options (or cassandra-env.sh): MAX_HEAP_SIZE="X"  (it
comes with some tradeoff as well of course, if you're going this path I
recommend this post from Jon about JVM/GarbageCollection which is a tricky
piece of Cassandra operations
http://thelastpickle.com/blog/2018/04/11/gc-tuning.html)

You can also control each (most?) of the off-heap structures individually.
It's a bit split here and there between the distinct configuration files
and at the table level.
For example, if you're running out of Native Memory, you can maybe:

- Consider adding RAM or use a bigger instance type in the cloud.
- Reduce bloom filters ? -->
http://cassandra.apache.org/doc/latest/operating/bloom_filters.html?highlight=bloom_filter_fp_chance#bloom-filters
- Disable Row caches ? If you have troubles with memory, I would start
there probably (You did not give us your version of Cassandra though).
- Reduce the max_index_interval? -->
https://docs.datastax.com/en/cql/3.3/cql/cql_reference/cqlCreateTable.html#tabProp__cqlTableMax_index_interval
- ...

*Long Story :)*

 It's the C* operator job to ensure that the hardware choice and usage is
optimal or at least that the sum of the memory needed by off-heap
structures stays below what's available and does not produce any OOM. Some
structures are capped (like the key cache size) some other will grow with
the data (like bloom filters). Thus it's good to have some breathing room
for growth and to have a monitoring system in place (this is something I
advocate for at any occasion :D). Finding the right balance is part of the
job of many of us here around :).

That being said, it's rare we are fighting this kind of OOM because, in a
huge majority of the cluster, we rely strongly on page caching and we try
to have as much possible 'free' native memory for that purpose. We run into
problems way before running out of native memory in many cases.

Generally, a Cassandra cluster with the recommended (64 GB of RAM maybe?)
or at least decent (32 GB?) hardware and the default configuration should
hopefully work nicely. The schema design might make things worse and of
course, you can optimize and reduce the cost, sometimes in a
substantial way. But globally Cassandra and the default configuration give
a good starting point I think.

One last thing is that the more details you share, the sharper and accurate
our answers can be. I feel like I told you *everything* I know about memory
because it wasn't clear to me what you needed :). Specifying the Cassandra
version, and telling something about your specific case like the memory
total size or the JVM configuration would probably induce (faster/better)
responses from us :).

I hope this still helps.

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com





Le jeu. 3 janv. 2019 à 00:03, Chris Mildebrandt  a
écrit :

> Hi,
>
> Is there’s a way to limit Cassandra’s off-heap memory usage? I can’t find
> a way to limit the memory used for row caches, bloom filters, etc. We’re
> running Cassandra in a container and would like to place limits on it to
> avoid it becoming a “noisy neighbor”. But we also don’t want it killed by
> the oom killer, so just placing limits on the container won't help.
>
> Thanks,
> -Chris
>

How can I limit the non-heap memory for Cassandra

2019-01-02 Thread Chris Mildebrandt

Hi,

Is there’s a way to limit Cassandra’s off-heap memory usage? I can’t find a
way to limit the memory used for row caches, bloom filters, etc. We’re
running Cassandra in a container and would like to place limits on it to
avoid it becoming a “noisy neighbor”. But we also don’t want it killed by
the oom killer, so just placing limits on the container won't help.

Thanks,
-Chris

Re: Can I sort it as a result of group by?

2018-04-10 Thread onmstester onmstester

I'm using apache spark on top of cassandra for such cases



Sent using Zoho Mail






 On Mon, 09 Apr 2018 18:00:33 +0430 DuyHai Doan 
doanduy...@gmail.com wrote 




No, sorting by column other than clustering column is not possible



On Mon, Apr 9, 2018 at 11:42 AM, Eunsu Kim eunsu.bil...@gmail.com wrote:







Hello, everyone.



I am using 3.11.0 and I have the following table.



CREATE TABLE summary_5m (

service_key text,

hash_key int,

instance_hash int,

collected_time timestamp,

count int,

PRIMARY KEY ((service_key), hash_key, instance_hash, collected_time)

)





And I can sum count grouping by primary key.



select service_key, hash_key, instance_hash, sum(count) as count_summ 

from apm.ip_summary_5m 

where service_key='ABCED'

group by service_key, hash_key, instance_hash;





But what I want is to get only the top 100 with a high value added.



Like following query is attached … (syntax error, of course)



order by count_sum limit 100;



Anybody have ever solved this problem?



Thank you in advance.

Re: Can I sort it as a result of group by?

2018-04-09 Thread DuyHai Doan

No, sorting by column other than clustering column is not possible

On Mon, Apr 9, 2018 at 11:42 AM, Eunsu Kim  wrote:

> Hello, everyone.
>
> I am using 3.11.0 and I have the following table.
>
> CREATE TABLE summary_5m (
> service_key text,
> hash_key int,
> instance_hash int,
> collected_time timestamp,
> count int,
> PRIMARY KEY ((service_key), hash_key, instance_hash, collected_time)
> )
>
>
> And I can sum count grouping by primary key.
>
> select service_key, hash_key, instance_hash, sum(count) as count_summ
> from apm.ip_summary_5m
> where service_key='ABCED'
> group by service_key, hash_key, instance_hash;
>
>
> But what I want is to get only the top 100 with a high value added.
>
> Like following query is attached … (syntax error, of course)
>
> order by count_sum limit 100;
>
> Anybody have ever solved this problem?
>
> Thank you in advance.
>
>
>

Can I sort it as a result of group by?

2018-04-09 Thread Eunsu Kim

Hello, everyone.

I am using 3.11.0 and I have the following table.

CREATE TABLE summary_5m (
service_key text,
hash_key int,
instance_hash int,
collected_time timestamp,
count int,
PRIMARY KEY ((service_key), hash_key, instance_hash, collected_time)
)


And I can sum count grouping by primary key.

select service_key, hash_key, instance_hash, sum(count) as count_summ 
from apm.ip_summary_5m 
where service_key='ABCED'
group by service_key, hash_key, instance_hash;


But what I want is to get only the top 100 with a high value added.

Like following query is attached … (syntax error, of course)

order by count_sum limit 100;

Anybody have ever solved this problem?

Thank you in advance.

where can i buy cassandra spring applications

2017-10-17 Thread Lutaya Shafiq Holmes

where can i buy cassandra spring applications,

I need to purchase a fully built Cassandr  Web Application in Eclipse,

Where Can I get one? -


Forexample on Evanto Market , and Theme Forest I ca get  WordPress,
Drupal, PHP and other systems,

Where can I get Spring Cassandra Applications

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: How can I install a Java Spring Application running Cassandra on to AWS

2017-10-17 Thread Lutaya Shafiq Holmes

Thank YOU

On 10/17/17, Who Dadddy <qwerty15...@gmail.com> wrote:
> http://lmgtfy.com/?q=install+java+app+on+AWS
> <http://lmgtfy.com/?q=install+java+app+on+AWS>
>
>> On 17 Oct 2017, at 15:32, Lutaya Shafiq Holmes <lutayasha...@gmail.com>
>> wrote:
>>
>> How can I install a Java Spring Application running Cassandra  on to  AWS
>> --
>> Lutaaya Shafiq
>> Web: www.ronzag.com | i...@ronzag.com
>> Mobile: +256702772721 | +256783564130
>> Twitter: @lutayashafiq
>> Skype: lutaya5
>> Blog: lutayashafiq.com
>> http://www.fourcornersalliancegroup.com/?a=shafiqholmes
>>
>> "The most beautiful people we have known are those who have known defeat,
>> known suffering, known struggle, known loss and have found their way out
>> of
>> the depths. These persons have an appreciation, a sensitivity and an
>> understanding of life that fills them with compassion, gentleness and a
>> deep loving concern. Beautiful people do not just happen." - *Elisabeth
>> Kubler-Ross*
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>
>


-- 
Lutaaya Shafiq
Web: www.ronzag.com | i...@ronzag.com
Mobile: +256702772721 | +256783564130
Twitter: @lutayashafiq
Skype: lutaya5
Blog: lutayashafiq.com
http://www.fourcornersalliancegroup.com/?a=shafiqholmes

"The most beautiful people we have known are those who have known defeat,
known suffering, known struggle, known loss and have found their way out of
the depths. These persons have an appreciation, a sensitivity and an
understanding of life that fills them with compassion, gentleness and a
deep loving concern. Beautiful people do not just happen." - *Elisabeth
Kubler-Ross*

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: How can I install a Java Spring Application running Cassandra on to AWS

2017-10-17 Thread Who Dadddy

http://lmgtfy.com/?q=install+java+app+on+AWS 
<http://lmgtfy.com/?q=install+java+app+on+AWS>

> On 17 Oct 2017, at 15:32, Lutaya Shafiq Holmes <lutayasha...@gmail.com> wrote:
> 
> How can I install a Java Spring Application running Cassandra  on to  AWS
> -- 
> Lutaaya Shafiq
> Web: www.ronzag.com | i...@ronzag.com
> Mobile: +256702772721 | +256783564130
> Twitter: @lutayashafiq
> Skype: lutaya5
> Blog: lutayashafiq.com
> http://www.fourcornersalliancegroup.com/?a=shafiqholmes
> 
> "The most beautiful people we have known are those who have known defeat,
> known suffering, known struggle, known loss and have found their way out of
> the depths. These persons have an appreciation, a sensitivity and an
> understanding of life that fills them with compassion, gentleness and a
> deep loving concern. Beautiful people do not just happen." - *Elisabeth
> Kubler-Ross*
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>

How can I install a Java Spring Application running Cassandra on to AWS

2017-10-17 Thread Lutaya Shafiq Holmes

How can I install a Java Spring Application running Cassandra  on to  AWS
-- 
Lutaaya Shafiq
Web: www.ronzag.com | i...@ronzag.com
Mobile: +256702772721 | +256783564130
Twitter: @lutayashafiq
Skype: lutaya5
Blog: lutayashafiq.com
http://www.fourcornersalliancegroup.com/?a=shafiqholmes

"The most beautiful people we have known are those who have known defeat,
known suffering, known struggle, known loss and have found their way out of
the depths. These persons have an appreciation, a sensitivity and an
understanding of life that fills them with compassion, gentleness and a
deep loving concern. Beautiful people do not just happen." - *Elisabeth
Kubler-Ross*

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

How Can I get started with Using Cassandra and Netbeans- Please help

2017-09-29 Thread Lutaya Shafiq Holmes

How Can I get started with Using Cassandra and Netbeans- Please help
-- 
Lutaaya Shafiq
Web: www.ronzag.com | i...@ronzag.com
Mobile: +256702772721 | +256783564130
Twitter: @lutayashafiq
Skype: lutaya5
Blog: lutayashafiq.com
http://www.fourcornersalliancegroup.com/?a=shafiqholmes

"The most beautiful people we have known are those who have known defeat,
known suffering, known struggle, known loss and have found their way out of
the depths. These persons have an appreciation, a sensitivity and an
understanding of life that fills them with compassion, gentleness and a
deep loving concern. Beautiful people do not just happen." - *Elisabeth
Kubler-Ross*

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

RE: Can I have multiple datacenter with different versions of Cassandra

2017-09-12 Thread Durity, Sean R

No – the general answer is that you cannot stream between major versions of 
Cassandra. I would upgrade the existing ring, then add the new DC.

Sean Durity

From: Chuck Reynolds [mailto:creyno...@ancestry.com]
Sent: Thursday, May 18, 2017 11:20 AM
To: user@cassandra.apache.org
Subject: Can I have multiple datacenter with different versions of Cassandra

I have a need to create another datacenter and upgrade my existing Cassandra 
from 2.1.13 to Cassandra 3.0.9.

Can I do this as one step?  Create a new Cassandra ring that is version 3.0.9 
and replicate the data from an existing ring that is Cassandra 2.1.13?

After replicating to the new ring if possible them I would upgrade the old ring 
to Cassandra 3.0.9

The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.

Re: Can I have multiple datacenter with different versions of Cassandra

2017-05-18 Thread Jonathan Haddad

No you can't do it in one step. Streaming between versions isn't supported
On Thu, May 18, 2017 at 8:26 AM daemeon reiydelle <daeme...@gmail.com>
wrote:

> Yes, or decomission the old one and build anew after new one is operational
>
> “All men dream, but not equally. Those who dream by night in the dusty
> recesses of their minds wake up in the day to find it was vanity, but the
> dreamers of the day are dangerous men, for they may act their dreams with
> open eyes, to make it possible.” — T.E. Lawrence
>
> sent from my mobile
>
> Daemeon Reiydelle
> skype daemeon.c.m.reiydelle
> USA 415.501.0198
>
> On May 18, 2017 8:20 AM, "Chuck Reynolds" <creyno...@ancestry.com> wrote:
>
>> I have a need to create another datacenter and upgrade my existing
>> Cassandra from 2.1.13 to Cassandra 3.0.9.
>>
>>
>>
>> Can I do this as one step?  Create a new Cassandra ring that is version
>> 3.0.9 and replicate the data from an existing ring that is Cassandra 2.1.13?
>>
>>
>>
>> After replicating to the new ring if possible them I would upgrade the
>> old ring to Cassandra 3.0.9
>>
>

Re: Can I have multiple datacenter with different versions of Cassandra

2017-05-18 Thread daemeon reiydelle

Yes, or decomission the old one and build anew after new one is operational

“All men dream, but not equally. Those who dream by night in the dusty
recesses of their minds wake up in the day to find it was vanity, but the
dreamers of the day are dangerous men, for they may act their dreams with
open eyes, to make it possible.” — T.E. Lawrence

sent from my mobile
Daemeon Reiydelle
skype daemeon.c.m.reiydelle
USA 415.501.0198

On May 18, 2017 8:20 AM, "Chuck Reynolds" <creyno...@ancestry.com> wrote:

> I have a need to create another datacenter and upgrade my existing
> Cassandra from 2.1.13 to Cassandra 3.0.9.
>
>
>
> Can I do this as one step?  Create a new Cassandra ring that is version
> 3.0.9 and replicate the data from an existing ring that is Cassandra 2.1.13?
>
>
>
> After replicating to the new ring if possible them I would upgrade the old
> ring to Cassandra 3.0.9
>

Can I have multiple datacenter with different versions of Cassandra

2017-05-18 Thread Chuck Reynolds

I have a need to create another datacenter and upgrade my existing Cassandra 
from 2.1.13 to Cassandra 3.0.9.

Can I do this as one step?  Create a new Cassandra ring that is version 3.0.9 
and replicate the data from an existing ring that is Cassandra 2.1.13?

After replicating to the new ring if possible them I would upgrade the old ring 
to Cassandra 3.0.9

Re: How can I efficiently export the content of my table to KAFKA

2017-04-28 Thread Tobias Eriksson

Hi Chris,
Well, that seemed like a good idea at first, I would like to read from 
Cassandra and post to KAFKA
But the KAFKA Connector Cassandra Source, requires that the table has a 
time-series order, and all my tables does not
So thanx for the tip, but it did not work ☹
-Tobias

From: Chris Stromberger <chris.stromber...@gmail.com>
Date: Thursday, 27 April 2017 at 15:50
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Subject: Re: How can I efficiently export the content of my table to KAFKA

Maybe 
https://www.confluent.io/blog/kafka-connect-cassandra-sink-the-perfect-match/

On Wed, Apr 26, 2017 at 2:49 PM, Tobias Eriksson 
<tobias.eriks...@qvantel.com<mailto:tobias.eriks...@qvantel.com>> wrote:
Hi
I would like to make a dump of the database, in JSON format, to KAFKA
The database contains lots of data, millions and in some cases billions of 
“rows”
I will provide the customer with an export of the data, where they can read it 
off of a KAFKA topic

My thinking was to have it scalable such that I will distribute the token range 
of all available partition-keys to a number of (N) processes (JSON-Producers)
First I will have a process which will read through the available tokens and 
then publish them on a KAFKA “Coordinator” Topic
And then I can create 1, 10, 20 or N processes that will act as Producers to 
the real KAFKA topic, and pick available tokens/partition-keys off of the 
“Coordinator” Topic
One by one until all the “rows” have been processed.
So the JOSN-Producer will take e.g. a range of 1000 “rows” and convert them 
into my own JSON format and post to KAFKA
And then after that take another 1000 “rows” and then …. And then another 1000 
“rows” and so on, until it is done.

I base my idea on how I believe Apache Spark Connector accomplishes data 
locality, i.e. being aware of where tokens reside and figured that since that 
is possible it should be possible to create a job-list in a KAFKA topic, and 
have each Producer pick jobs from there, and read up data from Cassandra based 
on the partition key (token) and then post the JSON on the export KAFKA topic.
https://dzone.com/articles/data-locality-w-cassandra-how

Would you consider this a good idea ?
Would there in fact be a better idea, what would that be then ?

-Tobias

Re: How can I efficiently export the content of my table to KAFKA

2017-04-27 Thread Chris Stromberger

Maybe
https://www.confluent.io/blog/kafka-connect-cassandra-sink-the-perfect-match/



On Wed, Apr 26, 2017 at 2:49 PM, Tobias Eriksson <
tobias.eriks...@qvantel.com> wrote:

> Hi
>
> I would like to make a dump of the database, in JSON format, to KAFKA
>
> The database contains lots of data, millions and in some cases billions of
> “rows”
>
> I will provide the customer with an export of the data, where they can
> read it off of a KAFKA topic
>
>
>
> My thinking was to have it scalable such that I will distribute the token
> range of all available partition-keys to a number of (N) processes
> (JSON-Producers)
>
> First I will have a process which will read through the available tokens
> and then publish them on a KAFKA “Coordinator” Topic
>
> And then I can create 1, 10, 20 or N processes that will act as Producers
> to the real KAFKA topic, and pick available tokens/partition-keys off of
> the “Coordinator” Topic
>
> One by one until all the “rows” have been processed.
>
> So the JOSN-Producer will take e.g. a range of 1000 “rows” and convert
> them into my own JSON format and post to KAFKA
>
> And then after that take another 1000 “rows” and then …. And then another
> 1000 “rows” and so on, until it is done.
>
>
>
> I base my idea on how I believe Apache Spark Connector accomplishes data
> locality, i.e. being aware of where tokens reside and figured that since
> that is possible it should be possible to create a job-list in a KAFKA
> topic, and have each Producer pick jobs from there, and read up data from
> Cassandra based on the partition key (token) and then post the JSON on the
> export KAFKA topic.
>
> https://dzone.com/articles/data-locality-w-cassandra-how
>
>
>
>
>
> Would you consider this a good idea ?
>
> Would there in fact be a better idea, what would that be then ?
>
>
>
> -Tobias
>
>
>

Re: How can I efficiently export the content of my table to KAFKA

2017-04-26 Thread Justin Cameron

You can run multiple applications in parallel in Standalone mode - you just
need to configure spark to allocate resources between your jobs the way you
want (by default it assigns all resources to the first application you run,
so they won't be freed up until it has finished).

You can use Spark's web UI to check the resources that are available and
those allocated to each job. See
http://spark.apache.org/docs/latest/job-scheduling.html for more details.

On Thu, 27 Apr 2017 at 15:12 Tobias Eriksson <tobias.eriks...@qvantel.com>
wrote:

> Well, I have been working some with Spark and the biggest hurdle is that
> Spark does not allow me to run multiple jobs in parallel
>
> i.e. at the point of starting the job to taking the table of “Individuals”
> I will have to wait until all that processing is done before I can start an
> additional one
>
> so I will need to upon demand start various additional jobs where I get
> “Addresses”, “Invoices”, … and so on
>
> I know I could increase number of Workers/Executors and use Mesos for
> handling the scheduling and resource management but we have so far not been
> able to get it dynamic/flexible enough
>
> Although I admit that this could still be a way forward we have not
> evaluated it 100% yet, so I have not completely given up that thought
>
>
>
> -Tobias
>
>
>
>
>
> *From: *Justin Cameron <jus...@instaclustr.com>
> *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
> *Date: *Thursday, 27 April 2017 at 01:36
> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
> *Subject: *Re: How can I efficiently export the content of my table to
> KAFKA
>
>
>
> You could probably save yourself a lot of hassle by just writing a Spark
> job that scans through the entire table, converts each row to JSON and
> dumps the output into a Kafka topic. It should be fairly straightforward to
> implement.
>
>
>
> Spark will manage the partitioning of "Producer" processes for you - no
> need for a "Coordinator" topic.
>
>
>
> On Thu, 27 Apr 2017 at 05:49 Tobias Eriksson <tobias.eriks...@qvantel.com>
> wrote:
>
> Hi
>
> I would like to make a dump of the database, in JSON format, to KAFKA
>
> The database contains lots of data, millions and in some cases billions of
> “rows”
>
> I will provide the customer with an export of the data, where they can
> read it off of a KAFKA topic
>
>
>
> My thinking was to have it scalable such that I will distribute the token
> range of all available partition-keys to a number of (N) processes
> (JSON-Producers)
>
> First I will have a process which will read through the available tokens
> and then publish them on a KAFKA “Coordinator” Topic
>
> And then I can create 1, 10, 20 or N processes that will act as Producers
> to the real KAFKA topic, and pick available tokens/partition-keys off of
> the “Coordinator” Topic
>
> One by one until all the “rows” have been processed.
>
> So the JOSN-Producer will take e.g. a range of 1000 “rows” and convert
> them into my own JSON format and post to KAFKA
>
> And then after that take another 1000 “rows” and then …. And then another
> 1000 “rows” and so on, until it is done.
>
>
>
> I base my idea on how I believe Apache Spark Connector accomplishes data
> locality, i.e. being aware of where tokens reside and figured that since
> that is possible it should be possible to create a job-list in a KAFKA
> topic, and have each Producer pick jobs from there, and read up data from
> Cassandra based on the partition key (token) and then post the JSON on the
> export KAFKA topic.
>
> https://dzone.com/articles/data-locality-w-cassandra-how
>
>
>
>
>
> Would you consider this a good idea ?
>
> Would there in fact be a better idea, what would that be then ?
>
>
>
> -Tobias
>
>
>
> --
>
> *Justin Cameron*
> Senior Software Engineer
>
>
>
> <https://www.instaclustr.com/>
>
>
> This email has been sent on behalf of Instaclustr Pty. Limited (Australia)
> and Instaclustr Inc (USA).
>
> This email and any attachments may contain confidential and legally
> privileged information.  If you are not the intended recipient, do not copy
> or disclose its content, but please reply to this email immediately and
> highlight the error to the sender and then immediately delete the message.
>
-- 


*Justin Cameron*Senior Software Engineer


<https://www.instaclustr.com/>


This email has been sent on behalf of Instaclustr Pty. Limited (Australia)
and Instaclustr Inc (USA).

This email and any attachments may contain confidential and legally
privileged information.  If you are not the intended recipient, do not copy
or disclose its content, but please reply to this email immediately and
highlight the error to the sender and then immediately delete the message.

Re: How can I efficiently export the content of my table to KAFKA

2017-04-26 Thread Tobias Eriksson

Well, I have been working some with Spark and the biggest hurdle is that Spark 
does not allow me to run multiple jobs in parallel
i.e. at the point of starting the job to taking the table of “Individuals” I 
will have to wait until all that processing is done before I can start an 
additional one
so I will need to upon demand start various additional jobs where I get 
“Addresses”, “Invoices”, … and so on
I know I could increase number of Workers/Executors and use Mesos for handling 
the scheduling and resource management but we have so far not been able to get 
it dynamic/flexible enough
Although I admit that this could still be a way forward we have not evaluated 
it 100% yet, so I have not completely given up that thought

-Tobias


From: Justin Cameron <jus...@instaclustr.com>
Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Thursday, 27 April 2017 at 01:36
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Subject: Re: How can I efficiently export the content of my table to KAFKA

You could probably save yourself a lot of hassle by just writing a Spark job 
that scans through the entire table, converts each row to JSON and dumps the 
output into a Kafka topic. It should be fairly straightforward to implement.

Spark will manage the partitioning of "Producer" processes for you - no need 
for a "Coordinator" topic.

On Thu, 27 Apr 2017 at 05:49 Tobias Eriksson 
<tobias.eriks...@qvantel.com<mailto:tobias.eriks...@qvantel.com>> wrote:
Hi
I would like to make a dump of the database, in JSON format, to KAFKA
The database contains lots of data, millions and in some cases billions of 
“rows”
I will provide the customer with an export of the data, where they can read it 
off of a KAFKA topic

My thinking was to have it scalable such that I will distribute the token range 
of all available partition-keys to a number of (N) processes (JSON-Producers)
First I will have a process which will read through the available tokens and 
then publish them on a KAFKA “Coordinator” Topic
And then I can create 1, 10, 20 or N processes that will act as Producers to 
the real KAFKA topic, and pick available tokens/partition-keys off of the 
“Coordinator” Topic
One by one until all the “rows” have been processed.
So the JOSN-Producer will take e.g. a range of 1000 “rows” and convert them 
into my own JSON format and post to KAFKA
And then after that take another 1000 “rows” and then …. And then another 1000 
“rows” and so on, until it is done.

I base my idea on how I believe Apache Spark Connector accomplishes data 
locality, i.e. being aware of where tokens reside and figured that since that 
is possible it should be possible to create a job-list in a KAFKA topic, and 
have each Producer pick jobs from there, and read up data from Cassandra based 
on the partition key (token) and then post the JSON on the export KAFKA topic.
https://dzone.com/articles/data-locality-w-cassandra-how


Would you consider this a good idea ?
Would there in fact be a better idea, what would that be then ?

-Tobias

--
Justin Cameron
Senior Software Engineer


[https://cdn2.hubspot.net/hubfs/2549680/Instaclustr-Navy-logo-new.png]<https://www.instaclustr.com/>

This email has been sent on behalf of Instaclustr Pty. Limited (Australia) and 
Instaclustr Inc (USA).

This email and any attachments may contain confidential and legally privileged 
information.  If you are not the intended recipient, do not copy or disclose 
its content, but please reply to this email immediately and highlight the error 
to the sender and then immediately delete the message.

Re: How can I efficiently export the content of my table to KAFKA

2017-04-26 Thread Justin Cameron

You could probably save yourself a lot of hassle by just writing a Spark
job that scans through the entire table, converts each row to JSON and
dumps the output into a Kafka topic. It should be fairly straightforward to
implement.

Spark will manage the partitioning of "Producer" processes for you - no
need for a "Coordinator" topic.

On Thu, 27 Apr 2017 at 05:49 Tobias Eriksson 
wrote:

> Hi
>
> I would like to make a dump of the database, in JSON format, to KAFKA
>
> The database contains lots of data, millions and in some cases billions of
> “rows”
>
> I will provide the customer with an export of the data, where they can
> read it off of a KAFKA topic
>
>
>
> My thinking was to have it scalable such that I will distribute the token
> range of all available partition-keys to a number of (N) processes
> (JSON-Producers)
>
> First I will have a process which will read through the available tokens
> and then publish them on a KAFKA “Coordinator” Topic
>
> And then I can create 1, 10, 20 or N processes that will act as Producers
> to the real KAFKA topic, and pick available tokens/partition-keys off of
> the “Coordinator” Topic
>
> One by one until all the “rows” have been processed.
>
> So the JOSN-Producer will take e.g. a range of 1000 “rows” and convert
> them into my own JSON format and post to KAFKA
>
> And then after that take another 1000 “rows” and then …. And then another
> 1000 “rows” and so on, until it is done.
>
>
>
> I base my idea on how I believe Apache Spark Connector accomplishes data
> locality, i.e. being aware of where tokens reside and figured that since
> that is possible it should be possible to create a job-list in a KAFKA
> topic, and have each Producer pick jobs from there, and read up data from
> Cassandra based on the partition key (token) and then post the JSON on the
> export KAFKA topic.
>
> https://dzone.com/articles/data-locality-w-cassandra-how
>
>
>
>
>
> Would you consider this a good idea ?
>
> Would there in fact be a better idea, what would that be then ?
>
>
>
> -Tobias
>
>
>
-- 

*Justin Cameron*Senior Software Engineer

This email has been sent on behalf of Instaclustr Pty. Limited (Australia)
and Instaclustr Inc (USA).

This email and any attachments may contain confidential and legally
privileged information.  If you are not the intended recipient, do not copy
or disclose its content, but please reply to this email immediately and
highlight the error to the sender and then immediately delete the message.

How can I efficiently export the content of my table to KAFKA

2017-04-26 Thread Tobias Eriksson

Hi
I would like to make a dump of the database, in JSON format, to KAFKA
The database contains lots of data, millions and in some cases billions of 
“rows”
I will provide the customer with an export of the data, where they can read it 
off of a KAFKA topic

My thinking was to have it scalable such that I will distribute the token range 
of all available partition-keys to a number of (N) processes (JSON-Producers)
First I will have a process which will read through the available tokens and 
then publish them on a KAFKA “Coordinator” Topic
And then I can create 1, 10, 20 or N processes that will act as Producers to 
the real KAFKA topic, and pick available tokens/partition-keys off of the 
“Coordinator” Topic
One by one until all the “rows” have been processed.
So the JOSN-Producer will take e.g. a range of 1000 “rows” and convert them 
into my own JSON format and post to KAFKA
And then after that take another 1000 “rows” and then …. And then another 1000 
“rows” and so on, until it is done.

I base my idea on how I believe Apache Spark Connector accomplishes data 
locality, i.e. being aware of where tokens reside and figured that since that 
is possible it should be possible to create a job-list in a KAFKA topic, and 
have each Producer pick jobs from there, and read up data from Cassandra based 
on the partition key (token) and then post the JSON on the export KAFKA topic.
https://dzone.com/articles/data-locality-w-cassandra-how


Would you consider this a good idea ?
Would there in fact be a better idea, what would that be then ?

-Tobias

Re: How can I scale my read rate?

2017-03-27 Thread Alexander Dejanovski

By default the TokenAwarePolicy does shuffle replicas, and it can be
disabled if you want to only hit the primary replica for the token range
you're querying :
http://docs.datastax.com/en/drivers/java/3.0/com/datastax/driver/core/policies/TokenAwarePolicy.html

On Mon, Mar 27, 2017 at 9:41 AM Avi Kivity  wrote:

> Is the driver doing the right thing by directing all reads for a given
> token to the same node?  If that node fails, then all of those reads will
> be directed at other nodes, all oh whom will be cache-cold for the the
> failed node's primary token range.  Seems like the driver should distribute
> reads among the all the replicas for a token, at least as an option, to
> keep the caches warm for latency-sensitive loads.
>
> On 03/26/2017 07:46 PM, Eric Stevens wrote:
>
> Yes, throughput for a given partition key cannot be improved with
> horizontal scaling.  You can increase RF to theoretically improve
> throughput on that key, but actually in this case smart clients might hold
> you back, because they're probably token aware, and will try to serve that
> read off the key's primary replica, so all reads would be directed at a
> single node for that key.
>
> If you're reading at CL=QUORUM, there's a chance that increasing RF will
> actually reduce performance rather than improve it, because you've
> increased the total amount of work to serve the read (as well as the
> write).  If you're reading at CL=ONE, increasing RF will increase the
> chances of falling afoul of eventual consistency.
>
> However that's not really a real-world scenario.  Or if it is, Cassandra
> is probably the wrong tool to satisfy that kind of workload.
>
> On Thu, Mar 23, 2017 at 11:43 PM Alain Rastoul 
> wrote:
>
> On 24/03/2017 01:00, Eric Stevens wrote:
> > Assuming an even distribution of data in your cluster, and an even
> > distribution across those keys by your readers, you would not need to
> > increase RF with cluster size to increase read performance.  If you have
> > 3 nodes with RF=3, and do 3 million reads, with good distribution, each
> > node has served 1 million read requests.  If you increase to 6 nodes and
> > keep RF=3, then each node now owns half as much data and serves only
> > 500,000 reads.  Or more meaningfully in the same time it takes to do 3
> > million reads under the 3 node cluster you ought to be able to do 6
> > million reads under the 6 node cluster since each node is just
> > responsible for 1 million total reads.
> >
> Hi Eric,
>
> I think I got your point.
> In case of really evenly distributed  reads it may (or should?) not make
> any difference,
>
> But when you do not distribute well the reads (and in that case only),
> my understanding about RF was that it could help spreading the load :
> In that case, with RF= 4 instead of 3,  with several clients accessing keys
> same key ranges, a coordinator could pick up one node to handle the request
> in 4 replicas instead of picking up one node in 3 , thus having
> more "workers" to handle a request ?
>
> Am I wrong here ?
>
> Thank you for the clarification
>
>
> --
> best,
> Alain
>
>
> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: How can I scale my read rate?

2017-03-27 Thread Avi Kivity

Is the driver doing the right thing by directing all reads for a given 
token to the same node?  If that node fails, then all of those reads 
will be directed at other nodes, all oh whom will be cache-cold for the 
the failed node's primary token range.  Seems like the driver should 
distribute reads among the all the replicas for a token, at least as an 
option, to keep the caches warm for latency-sensitive loads.

On 03/26/2017 07:46 PM, Eric Stevens wrote:
Yes, throughput for a given partition key cannot be improved with 
horizontal scaling.  You can increase RF to theoretically improve 
throughput on that key, but actually in this case smart clients might 
hold you back, because they're probably token aware, and will try to 
serve that read off the key's primary replica, so all reads would be 
directed at a single node for that key.

If you're reading at CL=QUORUM, there's a chance that increasing RF 
will actually reduce performance rather than improve it, because 
you've increased the total amount of work to serve the read (as well 
as the write).  If you're reading at CL=ONE, increasing RF will 
increase the chances of falling afoul of eventual consistency.

However that's not really a real-world scenario.  Or if it is, 
Cassandra is probably the wrong tool to satisfy that kind of workload.

On Thu, Mar 23, 2017 at 11:43 PM Alain Rastoul > wrote:

On 24/03/2017 01:00, Eric Stevens wrote:
> Assuming an even distribution of data in your cluster, and an even
> distribution across those keys by your readers, you would not
need to
> increase RF with cluster size to increase read performance.  If
you have
> 3 nodes with RF=3, and do 3 million reads, with good
distribution, each
> node has served 1 million read requests.  If you increase to 6
nodes and
> keep RF=3, then each node now owns half as much data and serves only
> 500,000 reads.  Or more meaningfully in the same time it takes
to do 3
> million reads under the 3 node cluster you ought to be able to do 6
> million reads under the 6 node cluster since each node is just
> responsible for 1 million total reads.
>
Hi Eric,

I think I got your point.
In case of really evenly distributed  reads it may (or should?)
not make
any difference,

But when you do not distribute well the reads (and in that case only),
my understanding about RF was that it could help spreading the load :
In that case, with RF= 4 instead of 3,  with several clients
accessing keys
same key ranges, a coordinator could pick up one node to handle
the request
in 4 replicas instead of picking up one node in 3 , thus having
more "workers" to handle a request ?

Am I wrong here ?

Thank you for the clarification

--
best,
Alain

Re: How can I scale my read rate?

2017-03-26 Thread Anthony Grasso

Keep in mind there are side effects to increasing to RF = 4

   - Storage requirements for each node will increase. Depending on the
   number of nodes in the cluster and the size of the data this could be
   significant.
   - Whilst the number of available coordinators increases, the number of
   nodes involved in QUORUM reads/writes will increase from 2 to 3.



On 24 March 2017 at 16:43, Alain Rastoul  wrote:

> On 24/03/2017 01:00, Eric Stevens wrote:
>
>> Assuming an even distribution of data in your cluster, and an even
>> distribution across those keys by your readers, you would not need to
>> increase RF with cluster size to increase read performance.  If you have
>> 3 nodes with RF=3, and do 3 million reads, with good distribution, each
>> node has served 1 million read requests.  If you increase to 6 nodes and
>> keep RF=3, then each node now owns half as much data and serves only
>> 500,000 reads.  Or more meaningfully in the same time it takes to do 3
>> million reads under the 3 node cluster you ought to be able to do 6
>> million reads under the 6 node cluster since each node is just
>> responsible for 1 million total reads.
>>
>> Hi Eric,
>
> I think I got your point.
> In case of really evenly distributed  reads it may (or should?) not make
> any difference,
>
> But when you do not distribute well the reads (and in that case only),
> my understanding about RF was that it could help spreading the load :
> In that case, with RF= 4 instead of 3,  with several clients accessing keys
> same key ranges, a coordinator could pick up one node to handle the request
> in 4 replicas instead of picking up one node in 3 , thus having
> more "workers" to handle a request ?
>
> Am I wrong here ?
>
> Thank you for the clarification
>
>
> --
> best,
> Alain
>
>

Re: How can I scale my read rate?

2017-03-26 Thread Eric Stevens

Yes, throughput for a given partition key cannot be improved with
horizontal scaling.  You can increase RF to theoretically improve
throughput on that key, but actually in this case smart clients might hold
you back, because they're probably token aware, and will try to serve that
read off the key's primary replica, so all reads would be directed at a
single node for that key.

If you're reading at CL=QUORUM, there's a chance that increasing RF will
actually reduce performance rather than improve it, because you've
increased the total amount of work to serve the read (as well as the
write).  If you're reading at CL=ONE, increasing RF will increase the
chances of falling afoul of eventual consistency.

However that's not really a real-world scenario.  Or if it is, Cassandra is
probably the wrong tool to satisfy that kind of workload.

On Thu, Mar 23, 2017 at 11:43 PM Alain Rastoul 
wrote:

On 24/03/2017 01:00, Eric Stevens wrote:
> Assuming an even distribution of data in your cluster, and an even
> distribution across those keys by your readers, you would not need to
> increase RF with cluster size to increase read performance.  If you have
> 3 nodes with RF=3, and do 3 million reads, with good distribution, each
> node has served 1 million read requests.  If you increase to 6 nodes and
> keep RF=3, then each node now owns half as much data and serves only
> 500,000 reads.  Or more meaningfully in the same time it takes to do 3
> million reads under the 3 node cluster you ought to be able to do 6
> million reads under the 6 node cluster since each node is just
> responsible for 1 million total reads.
>
Hi Eric,

I think I got your point.
In case of really evenly distributed  reads it may (or should?) not make
any difference,

But when you do not distribute well the reads (and in that case only),
my understanding about RF was that it could help spreading the load :
In that case, with RF= 4 instead of 3,  with several clients accessing keys
same key ranges, a coordinator could pick up one node to handle the request
in 4 replicas instead of picking up one node in 3 , thus having
more "workers" to handle a request ?

Am I wrong here ?

Thank you for the clarification

--
best,
Alain

Re: Using datastax driver, how can I read a non-primitive column as a JSON string?

2017-03-24 Thread Vladimir Yudovin

Hi,



why not used SELECT JSON * FROM as described here 
https://www.datastax.com/dev/blog/whats-new-in-cassandra-2-2-json-support ?



Best regards, Vladimir Yudovin, 

Winguzone - Cloud Cassandra Hosting






 On Thu, 23 Mar 2017 13:08:30 -0400 S G sg.online.em...@gmail.com 
wrote 




Hi,



I have several non-primitive columns in my cassandra tables.

Some of them are user-defined-types UDTs.



While querying them through datastax driver, I want to convert such UDTs into 
JSON values.

More specifically, I want to get JSON string for the value object below:

Row row = itr.next();

ColumnDefinitions cds = row.getColumnDefinitions();

cds.asList().forEach((ColumnDefinitions.Definition cd) - {

String name = cd.getName();

Object value = row.getObject(name);

  }

I have gone through 
http://docs.datastax.com/en/developer/java-driver/3.1/manual/custom_codecs/

But I do not want to add a codec for every UDT I have.



Can the driver somehow return me direct JSON without explicit meddling with 
codecs and all?



Thanks

SG

Re: How can I scale my read rate?

2017-03-23 Thread Alain Rastoul


On 24/03/2017 01:00, Eric Stevens wrote:

Assuming an even distribution of data in your cluster, and an even
distribution across those keys by your readers, you would not need to
increase RF with cluster size to increase read performance.  If you have
3 nodes with RF=3, and do 3 million reads, with good distribution, each
node has served 1 million read requests.  If you increase to 6 nodes and
keep RF=3, then each node now owns half as much data and serves only
500,000 reads.  Or more meaningfully in the same time it takes to do 3
million reads under the 3 node cluster you ought to be able to do 6
million reads under the 6 node cluster since each node is just
responsible for 1 million total reads.


Hi Eric,

I think I got your point.
In case of really evenly distributed  reads it may (or should?) not make 
any difference,


But when you do not distribute well the reads (and in that case only),
my understanding about RF was that it could help spreading the load :
In that case, with RF= 4 instead of 3,  with several clients accessing keys
same key ranges, a coordinator could pick up one node to handle the request
in 4 replicas instead of picking up one node in 3 , thus having
more "workers" to handle a request ?

Am I wrong here ?

Thank you for the clarification


--
best,
Alain

Re: How can I scale my read rate?

2017-03-23 Thread Eric Stevens

Assuming an even distribution of data in your cluster, and an even
distribution across those keys by your readers, you would not need to
increase RF with cluster size to increase read performance.  If you have 3
nodes with RF=3, and do 3 million reads, with good distribution, each node
has served 1 million read requests.  If you increase to 6 nodes and keep
RF=3, then each node now owns half as much data and serves only 500,000
reads.  Or more meaningfully in the same time it takes to do 3 million
reads under the 3 node cluster you ought to be able to do 6 million reads
under the 6 node cluster since each node is just responsible for 1 million
total reads.

On Mon, Mar 20, 2017 at 11:24 PM Alain Rastoul 
wrote:

> On 20/03/2017 22:05, Michael Wojcikiewicz wrote:
> > Not sure if someone has suggested this, but I believe it's not
> > sufficient to simply add nodes to a cluster to increase read
> > performance: you also need to alter the ReplicationFactor of the
> > keyspace to a larger value as you increase your cluster gets larger.
> >
> > ie. data is available from more nodes in the cluster for each query.
> >
> Yes, good point in case of cluster growth, there would be more replica
> to handle same key ranges.
> And also readjust token ranges :
> https://cassandra.apache.org/doc/latest/operating/topo_changes.html
>
> SG, can you give some information (or share your code) about how you
> generate your data and how you read it ?
>
> --
> best,
> Alain
>
>

Using datastax driver, how can I read a non-primitive column as a JSON string?

2017-03-23 Thread S G

Hi,

I have several non-primitive columns in my cassandra tables.
Some of them are user-defined-types UDTs.

While querying them through datastax driver, I want to convert such UDTs
into JSON values.
More specifically, I want to get JSON string for the value object below:

Row row = itr.next();

ColumnDefinitions cds = row.getColumnDefinitions();

cds.asList().forEach((ColumnDefinitions.Definition cd) -> {

String name = cd.getName();

Object value = row.getObject(name);

  }

I have gone through
http://docs.datastax.com/en/developer/java-driver/3.1/manual/custom_codecs/

But I do not want to add a codec for every UDT I have.


Can the driver somehow return me direct JSON without explicit meddling with
codecs and all?


Thanks

SG

Re: How can I scale my read rate?

2017-03-20 Thread Alain Rastoul


On 20/03/2017 22:05, Michael Wojcikiewicz wrote:

Not sure if someone has suggested this, but I believe it's not
sufficient to simply add nodes to a cluster to increase read
performance: you also need to alter the ReplicationFactor of the
keyspace to a larger value as you increase your cluster gets larger.

ie. data is available from more nodes in the cluster for each query.

Yes, good point in case of cluster growth, there would be more replica 
to handle same key ranges.

And also readjust token ranges :
https://cassandra.apache.org/doc/latest/operating/topo_changes.html

SG, can you give some information (or share your code) about how you 
generate your data and how you read it ?


--
best,
Alain

Re: How can I scale my read rate?

2017-03-20 Thread Alain Rastoul

On 20/03/2017 02:35, S G wrote:

2)
https://docs.datastax.com/en/developer/java-driver/3.1/manual/statements/prepared/
tells me to avoid preparing select queries if I expect a change of
columns in my table down the road.
The problem is also related to select * which is considered bad practice 
with most databases...

I did some more testing to see if my client machines were the bottleneck.
For a 6-node Cassandra cluster (each VM having 8-cores), I got 26,000
reads/sec for all of the following:
1) Client nodes:1, Threads: 60
2) Client nodes:3, Threads: 180
3) Client nodes:5, Threads: 300
4) Client nodes:10, Threads: 600
5) Client nodes:20, Threads: 1200

So adding more client nodes or threads to those client nodes is not
having any effect.
I am suspecting Cassandra is simply not allowing me to go any further.

> Primary keys for my schema are:
>  PRIMARY KEY((name, phone), age)
> name: text
> phone: int
> age: int

Yes with such a PK data must be spread on the whole cluster (also taking 
into account the partitioner), strange that the throughput doesn't scale.

I guess you also have verified that you select data randomly?

May be you could have a look at the system traces to see the query plan 
for some requests:
If you are on a test cluster you can truncate the tables before 
(truncate system_traces.sessions; and truncate system_traces.events;), 
run a test then select * from system_traces.events

where session_id = 
xxx being one of the sessions you pick in trace.sessions.

Try to see if you are not always hitting the same nodes.

--
best,
Alain

Re: How can I scale my read rate?

2017-03-19 Thread James Carman

Have you tried using PreparedStatements?

On Sat, Mar 18, 2017 at 9:47 PM S G <sg.online.em...@gmail.com> wrote:

> ok, I gave the executeAsync() a try.
> Good part is that it was really easy to write the code for that.
> Bad part is that it did not had a huge effect on my throughput - I gained
> about 5% increase in throughput.
> I suspect it is so because my queries are all get-by-primary-key queries
> and were anyways completing in less than 2 milliseconds.
> So there was not much wait to begin with.
>
>
> Here is my code:
>
> String getByKeyQueryStr = "Select * from fooTable where key = " + key;
> //ResultSet result = session.execute(getByKeyQueryStr);  // Previous code
> ResultSetFuture future = session.executeAsync(getByKeyQueryStr);
> FutureCallback callback = new MyFutureCallback();
> executor = MoreExecutors.sameThreadExecutor();
> //executor = Executors.newFixedThreadPool(3); // Tried this too, no effect
> //executor = Executors.newFixedThreadPool(10); // Tried this too, no effect
> Futures.addCallback(future, callback, executor);
>
> Can I improve the above code in some way?
> Are there any JMX metrics that can tell me what's going on?
>
> From the vmstat command, I see that CPU idle time is about 70% even though
> I am running about 60 threads per VM
> Total 20 client-VMs with 8 cores each are querying a Cassandra cluster
> with 16 VMs, 8-core each too.
>
> [image: Screen Shot 2017-03-18 at 6.46.03 PM.png]
> 
> 
>
>
> Thanks
> SG
>
>
> On Sat, Mar 18, 2017 at 5:38 PM, S G <sg.online.em...@gmail.com> wrote:
>
> Thanks. It seems that you guys have found executeAsync to yield good
> results.
> I want to share my understanding how this could benefit performance and
> some validation from the group will be awesome.
>
> I will call executeAsync() each time I want to get by primary-key.
> That ways, my client thread is not blocked anymore and I can submit a lot
> more requests per unit time.
> The async requests get piled on the underlying Netty I/O thread which
> ensures that it is always busy all the time.
> Earlier, the Netty I/O thread would have wasted some cycles when the
> sync-execute method was processing the results.
> And earlier, the client thread would also have wasted some cycles waiting
> for netty-thread to complete.
>
> With executeAsync(), none of them is waiting.
> Only thing to ensure is that the Netty thread's queue does not grow
> indefinitely.
>
> If the above theory is correct, then it sounds like a really good thing to
> try.
> If not, please do share some more details.
>
>
>
>
> On Sat, Mar 18, 2017 at 2:00 PM, <j.kes...@enercast.de> wrote:
>
> +1 for executeAsync – had a long time to argue that it’s not bad as with
> good old rdbms.
>
>
>
>
>
>
>
> Gesendet von meinem Windows 10 Phone
>
>
>
> *Von: *Arvydas Jonusonis <arvydas.jonuso...@gmail.com>
> *Gesendet: *Samstag, 18. März 2017 19:08
> *An: *user@cassandra.apache.org
> *Betreff: *Re: How can I scale my read rate?
>
>
>
> ..then you're not taking advantage of request pipelining. Use executeAsync
> - this will increase your throughput for sure.
>
>
>
> http://www.datastax.com/dev/blog/java-driver-async-queries
>
>
>
>
>
> On Sat, Mar 18, 2017 at 08:00 S G <sg.online.em...@gmail.com> wrote:
>
> I have enabled JMX but not sure what metrics to look for - they are way
> too many of them.
>
> I am using session.execute(...)
>
>
>
>
>
> On Fri, Mar 17, 2017 at 2:07 PM, Arvydas Jonusonis <
> arvydas.jonuso...@gmail.com> wrote:
>
> It would be interesting to see some of the driver metrics (in your stress
> test tool) - if you enable JMX, they should be exposed by default.
>
> Also, are you using session.execute(..) or session.executeAsync(..) ?
>
>
>
>
>
>
>

Re: How can I scale my read rate?

2017-03-19 Thread Alain Rastoul


On 19/03/2017 02:54, S G wrote:

Forgot to mention that this vmstat picture is for the client-cluster
reading from Cassandra.



Hi SG,


Your numbers are low, 15k req/sec would be ok for a single node, for a 
12 nodes cluster, something goes wrong... how do you measure the 
throughput?


As suggested by others, to achieve good results you have to add threads 
and client VMs: Cassandra scales horizontally, not vertically, ie each 
single node performance will not goes up, but if you spread the load, by 
adding nodes the global cluster performance will.


Theorically,
assuming the data and the load is spread on the cluster (*1)
from your saying, with each request at 2ms avg (*2)
you should have 500 req/sec in each thread,
40 threads should go 20k req/sec on each client VM stress application (*3)
and 10 client VMs should go 200k req/sec on the whole cluster (*4)

=
(*1) the partition key (first PK column) must spread data on all nodes
and your testing code must spread the load by selecting evenly spread data.
(This point is very important: can you give information on your schema 
and your data ?)


(*2) to achieve better single client throughtput, may be you could 
prepare the requests, since you are always executing the same requests


(*3) => run more client tests application on each VM

(*4) add more client  VMs (Patrick's suggestion)

with (3) and (4) the throughput of each client will not be better, but 
the global cluster throughput will.

=

There are other factors to take into account if you are also writing to 
the cluster : read path, tombstones, replication, repairs etc. but 
that's not the case here?


Performance testing goes to the limit of our understanding of the system
 and is very difficult
... hence interesting :)



--
best,
Alain

Re: How can I scale my read rate?

2017-03-18 Thread S G

Forgot to mention that this vmstat picture is for the client-cluster
reading from Cassandra.

On Sat, Mar 18, 2017 at 6:47 PM, S G <sg.online.em...@gmail.com> wrote:

> ok, I gave the executeAsync() a try.
> Good part is that it was really easy to write the code for that.
> Bad part is that it did not had a huge effect on my throughput - I gained
> about 5% increase in throughput.
> I suspect it is so because my queries are all get-by-primary-key queries
> and were anyways completing in less than 2 milliseconds.
> So there was not much wait to begin with.
>
>
> Here is my code:
>
> String getByKeyQueryStr = "Select * from fooTable where key = " + key;
> //ResultSet result = session.execute(getByKeyQueryStr);  // Previous code
> ResultSetFuture future = session.executeAsync(getByKeyQueryStr);
> FutureCallback callback = new MyFutureCallback();
> executor = MoreExecutors.sameThreadExecutor();
> //executor = Executors.newFixedThreadPool(3); // Tried this too, no effect
> //executor = Executors.newFixedThreadPool(10); // Tried this too, no
> effect
> Futures.addCallback(future, callback, executor);
>
> Can I improve the above code in some way?
> Are there any JMX metrics that can tell me what's going on?
>
> From the vmstat command, I see that CPU idle time is about 70% even though
> I am running about 60 threads per VM
> Total 20 client-VMs with 8 cores each are querying a Cassandra cluster
> with 16 VMs, 8-core each too.
>
>
> 
> 
>
>
> Thanks
> SG
>
>
> On Sat, Mar 18, 2017 at 5:38 PM, S G <sg.online.em...@gmail.com> wrote:
>
>> Thanks. It seems that you guys have found executeAsync to yield good
>> results.
>> I want to share my understanding how this could benefit performance and
>> some validation from the group will be awesome.
>>
>> I will call executeAsync() each time I want to get by primary-key.
>> That ways, my client thread is not blocked anymore and I can submit a lot
>> more requests per unit time.
>> The async requests get piled on the underlying Netty I/O thread which
>> ensures that it is always busy all the time.
>> Earlier, the Netty I/O thread would have wasted some cycles when the
>> sync-execute method was processing the results.
>> And earlier, the client thread would also have wasted some cycles waiting
>> for netty-thread to complete.
>>
>> With executeAsync(), none of them is waiting.
>> Only thing to ensure is that the Netty thread's queue does not grow
>> indefinitely.
>>
>> If the above theory is correct, then it sounds like a really good thing
>> to try.
>> If not, please do share some more details.
>>
>>
>>
>>
>> On Sat, Mar 18, 2017 at 2:00 PM, <j.kes...@enercast.de> wrote:
>>
>>> +1 for executeAsync – had a long time to argue that it’s not bad as with
>>> good old rdbms.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Gesendet von meinem Windows 10 Phone
>>>
>>>
>>>
>>> *Von: *Arvydas Jonusonis <arvydas.jonuso...@gmail.com>
>>> *Gesendet: *Samstag, 18. März 2017 19:08
>>> *An: *user@cassandra.apache.org
>>> *Betreff: *Re: How can I scale my read rate?
>>>
>>>
>>>
>>> ..then you're not taking advantage of request pipelining. Use
>>> executeAsync - this will increase your throughput for sure.
>>>
>>>
>>>
>>> http://www.datastax.com/dev/blog/java-driver-async-queries
>>>
>>>
>>>
>>>
>>>
>>> On Sat, Mar 18, 2017 at 08:00 S G <sg.online.em...@gmail.com> wrote:
>>>
>>> I have enabled JMX but not sure what metrics to look for - they are way
>>> too many of them.
>>>
>>> I am using session.execute(...)
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Mar 17, 2017 at 2:07 PM, Arvydas Jonusonis <
>>> arvydas.jonuso...@gmail.com> wrote:
>>>
>>> It would be interesting to see some of the driver metrics (in your
>>> stress test tool) - if you enable JMX, they should be exposed by default.
>>>
>>> Also, are you using session.execute(..) or session.executeAsync(..) ?
>>>
>>>
>>>
>>>
>>>
>>
>

Re: How can I scale my read rate?

2017-03-18 Thread S G

ok, I gave the executeAsync() a try.
Good part is that it was really easy to write the code for that.
Bad part is that it did not had a huge effect on my throughput - I gained
about 5% increase in throughput.
I suspect it is so because my queries are all get-by-primary-key queries
and were anyways completing in less than 2 milliseconds.
So there was not much wait to begin with.


Here is my code:

String getByKeyQueryStr = "Select * from fooTable where key = " + key;
//ResultSet result = session.execute(getByKeyQueryStr);  // Previous code
ResultSetFuture future = session.executeAsync(getByKeyQueryStr);
FutureCallback callback = new MyFutureCallback();
executor = MoreExecutors.sameThreadExecutor();
//executor = Executors.newFixedThreadPool(3); // Tried this too, no effect
//executor = Executors.newFixedThreadPool(10); // Tried this too, no effect
Futures.addCallback(future, callback, executor);

Can I improve the above code in some way?
Are there any JMX metrics that can tell me what's going on?

>From the vmstat command, I see that CPU idle time is about 70% even though
I am running about 60 threads per VM
Total 20 client-VMs with 8 cores each are querying a Cassandra cluster with
16 VMs, 8-core each too.






Thanks
SG


On Sat, Mar 18, 2017 at 5:38 PM, S G <sg.online.em...@gmail.com> wrote:

> Thanks. It seems that you guys have found executeAsync to yield good
> results.
> I want to share my understanding how this could benefit performance and
> some validation from the group will be awesome.
>
> I will call executeAsync() each time I want to get by primary-key.
> That ways, my client thread is not blocked anymore and I can submit a lot
> more requests per unit time.
> The async requests get piled on the underlying Netty I/O thread which
> ensures that it is always busy all the time.
> Earlier, the Netty I/O thread would have wasted some cycles when the
> sync-execute method was processing the results.
> And earlier, the client thread would also have wasted some cycles waiting
> for netty-thread to complete.
>
> With executeAsync(), none of them is waiting.
> Only thing to ensure is that the Netty thread's queue does not grow
> indefinitely.
>
> If the above theory is correct, then it sounds like a really good thing to
> try.
> If not, please do share some more details.
>
>
>
>
> On Sat, Mar 18, 2017 at 2:00 PM, <j.kes...@enercast.de> wrote:
>
>> +1 for executeAsync – had a long time to argue that it’s not bad as with
>> good old rdbms.
>>
>>
>>
>>
>>
>>
>>
>> Gesendet von meinem Windows 10 Phone
>>
>>
>>
>> *Von: *Arvydas Jonusonis <arvydas.jonuso...@gmail.com>
>> *Gesendet: *Samstag, 18. März 2017 19:08
>> *An: *user@cassandra.apache.org
>> *Betreff: *Re: How can I scale my read rate?
>>
>>
>>
>> ..then you're not taking advantage of request pipelining. Use
>> executeAsync - this will increase your throughput for sure.
>>
>>
>>
>> http://www.datastax.com/dev/blog/java-driver-async-queries
>>
>>
>>
>>
>>
>> On Sat, Mar 18, 2017 at 08:00 S G <sg.online.em...@gmail.com> wrote:
>>
>> I have enabled JMX but not sure what metrics to look for - they are way
>> too many of them.
>>
>> I am using session.execute(...)
>>
>>
>>
>>
>>
>> On Fri, Mar 17, 2017 at 2:07 PM, Arvydas Jonusonis <
>> arvydas.jonuso...@gmail.com> wrote:
>>
>> It would be interesting to see some of the driver metrics (in your stress
>> test tool) - if you enable JMX, they should be exposed by default.
>>
>> Also, are you using session.execute(..) or session.executeAsync(..) ?
>>
>>
>>
>>
>>
>

Re: How can I scale my read rate?

2017-03-18 Thread S G

Thanks. It seems that you guys have found executeAsync to yield good
results.
I want to share my understanding how this could benefit performance and
some validation from the group will be awesome.

I will call executeAsync() each time I want to get by primary-key.
That ways, my client thread is not blocked anymore and I can submit a lot
more requests per unit time.
The async requests get piled on the underlying Netty I/O thread which
ensures that it is always busy all the time.
Earlier, the Netty I/O thread would have wasted some cycles when the
sync-execute method was processing the results.
And earlier, the client thread would also have wasted some cycles waiting
for netty-thread to complete.

With executeAsync(), none of them is waiting.
Only thing to ensure is that the Netty thread's queue does not grow
indefinitely.

If the above theory is correct, then it sounds like a really good thing to
try.
If not, please do share some more details.

On Sat, Mar 18, 2017 at 2:00 PM, <j.kes...@enercast.de> wrote:

> +1 for executeAsync – had a long time to argue that it’s not bad as with
> good old rdbms.
>
>
>
>
>
>
>
> Gesendet von meinem Windows 10 Phone
>
>
>
> *Von: *Arvydas Jonusonis <arvydas.jonuso...@gmail.com>
> *Gesendet: *Samstag, 18. März 2017 19:08
> *An: *user@cassandra.apache.org
> *Betreff: *Re: How can I scale my read rate?
>
>
>
> ..then you're not taking advantage of request pipelining. Use executeAsync
> - this will increase your throughput for sure.
>
>
>
> http://www.datastax.com/dev/blog/java-driver-async-queries
>
>
>
>
>
> On Sat, Mar 18, 2017 at 08:00 S G <sg.online.em...@gmail.com> wrote:
>
> I have enabled JMX but not sure what metrics to look for - they are way
> too many of them.
>
> I am using session.execute(...)
>
>
>
>
>
> On Fri, Mar 17, 2017 at 2:07 PM, Arvydas Jonusonis <
> arvydas.jonuso...@gmail.com> wrote:
>
> It would be interesting to see some of the driver metrics (in your stress
> test tool) - if you enable JMX, they should be exposed by default.
>
> Also, are you using session.execute(..) or session.executeAsync(..) ?
>
>
>
>
>

AW: How can I scale my read rate?

2017-03-18 Thread j.kesten

+1 for executeAsync – had a long time to argue that it’s not bad as with good 
old rdbms. 



Gesendet von meinem Windows 10 Phone

Von: Arvydas Jonusonis
Gesendet: Samstag, 18. März 2017 19:08
An: user@cassandra.apache.org
Betreff: Re: How can I scale my read rate?

..then you're not taking advantage of request pipelining. Use executeAsync - 
this will increase your throughput for sure.

http://www.datastax.com/dev/blog/java-driver-async-queries


On Sat, Mar 18, 2017 at 08:00 S G <sg.online.em...@gmail.com> wrote:
I have enabled JMX but not sure what metrics to look for - they are way too 
many of them.
I am using session.execute(...)


On Fri, Mar 17, 2017 at 2:07 PM, Arvydas Jonusonis 
<arvydas.jonuso...@gmail.com> wrote:
It would be interesting to see some of the driver metrics (in your stress test 
tool) - if you enable JMX, they should be exposed by default.

Also, are you using session.execute(..) or session.executeAsync(..) ?

Re: Can I do point in time recover using nodetool

2017-03-08 Thread Hannu Kröger

Yes,

It's possible. I haven't seen good instructions online though. The
Cassandra docs are quite bad as well.

I think I asked about it in this list and therefore I suggest you check the
mailing list archive as Mr. Roth suggested.

Hannu
On Wed, 8 Mar 2017 at 10.50, benjamin roth  wrote:

> I remember a very similar question on the list some months ago.
> The short answer is that there is no short answer. I'd recommend you
> search the mailing list archive for "backup" or "recover".
>
> 2017-03-08 10:17 GMT+01:00 Bhardwaj, Rahul :
>
> Hi All,
>
>
>
> Is there any possibility of restoring cassandra snapshots to point in time
> without using opscenter ?
>
>
>
>
>
>
>
>
>
> *Thanks and Regards*
>
> *Rahul Bhardwaj*
>
>
>
>
>

Re: Can I do point in time recover using nodetool

2017-03-08 Thread benjamin roth

I remember a very similar question on the list some months ago.
The short answer is that there is no short answer. I'd recommend you search
the mailing list archive for "backup" or "recover".

2017-03-08 10:17 GMT+01:00 Bhardwaj, Rahul :

> Hi All,
>
>
>
> Is there any possibility of restoring cassandra snapshots to point in time
> without using opscenter ?
>
>
>
>
>
>
>
>
>
> *Thanks and Regards*
>
> *Rahul Bhardwaj*
>
>
>

Can I do point in time recover using nodetool

2017-03-08 Thread Bhardwaj, Rahul

Hi All,

Is there any possibility of restoring cassandra snapshots to point in time 
without using opscenter ?




Thanks and Regards
Rahul Bhardwaj

Can I monitor Read Repair from the logs

2016-11-04 Thread James Rothering

What should I grep for in the logs to see if read repair is happening on a
table?

Re: How can I make Cassandra stable in a 2GB RAM node environment ?

2016-03-19 Thread Alain RODRIGUEZ

Hi, I am not sure I understood your message correctly but I will try to
answer it.

but, I think, in Cassandra case, it seems a matter of how much data we use
> with how much memory we have.

If you are saying you can use poor commodity servers (vertically scale
poorly) and just add nodes (horizontal scaling) when the cluster is not
powerful enough, you need to know that a minimum of vertical scaling is
needed to have great performances or a good stability. Yet, tuning things,
you can probably reach a stable state with t2.medium if there is enough
t2.medium to handle the load.

with default configuration except for leveledCompactionStrategy

LeveledCompactionStrategy is heavier to maintain than STCS. On such an
environment, read latency is probably not  your main concern, and using
STCS could give better results as it is way lighter in terms of compactions
(Depends on your use case though).

I also used 4GM RAM machine (t2.medium)
>

With 4GB of RAM you probably want to use 1 GB of heap. What version of
cassandra are you using ?
You might also need to tune bloomfilters, index_interval, memtables size
and type, and a few other things to reduce the memory footprint.

About compaction, use only half of the cores as concurrent compactors (one
core) and see if this improves stability and compaction can still keep up.
Or keep 2 and reduce it speed by lowering the compaction throughput.

Use nodetool {tpstats, compactionstats, cfstats, cfhistograms} to monitor
things and see what to tune.

As told earlier, using this low spec machines is fine if you know how to
tune Cassandra and can afford some research / tuning time...

Alain
---
Alain Rodriguez - al...@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2016-03-12 6:58 GMT+01:00 Hiroyuki Yamada :

> Thank you all to respond and discuss my question.
>
> I agree with you all basically,
> but, I think, in Cassandra case, it seems a matter of how much data we use
> with how much memory we have.
>
> As Jack's (and datastax's) suggestion,
> I also used 4GM RAM machine (t2.medium) with 1 billion records (about
> 100GB in size) with default configuration except for
> leveledCompactionStrategy,
> but after completion of insertion from an application program, probably
> compaction kept working,
> and again, later Cassandra was killed by OOM killer.
>
> Insertion from application side is finished, so the issue is maybe from
> compaction happening in background.
> Is there any recommended configuration in compaction to make Cassandra
> stable with large dataset (more than 100GB) with kind of low memory (4GB)
> environment ?
>
> I think it would be the same thing if I try the experiment with 8GB memory
> and larger data set (maybe more than 2 billion records).
> (If it is not correct, please explain why.)
>
>
> Best regards,
> Hiro
>
> On Fri, Mar 11, 2016 at 4:19 AM, Robert Coli  wrote:
>
>> On Thu, Mar 10, 2016 at 3:27 AM, Alain RODRIGUEZ 
>> wrote:
>>
>>> So, like Jack, I globally really not recommend it unless you know what
>>> you are doing and don't care about facing those issues.
>>>
>>
>> Certainly a spectrum of views here, but everyone (including OP) seems to
>> agree with the above. :D
>>
>> =Rob
>>
>>
>
>

Re: How can I make Cassandra stable in a 2GB RAM node environment ?

2016-03-11 Thread Hiroyuki Yamada

Thank you all to respond and discuss my question.

I agree with you all basically,
but, I think, in Cassandra case, it seems a matter of how much data we use
with how much memory we have.

As Jack's (and datastax's) suggestion,
I also used 4GM RAM machine (t2.medium) with 1 billion records (about 100GB
in size) with default configuration except for leveledCompactionStrategy,
but after completion of insertion from an application program, probably
compaction kept working,
and again, later Cassandra was killed by OOM killer.

Insertion from application side is finished, so the issue is maybe from
compaction happening in background.
Is there any recommended configuration in compaction to make Cassandra
stable with large dataset (more than 100GB) with kind of low memory (4GB)
environment ?

I think it would be the same thing if I try the experiment with 8GB memory
and larger data set (maybe more than 2 billion records).
(If it is not correct, please explain why.)

Best regards,
Hiro

On Fri, Mar 11, 2016 at 4:19 AM, Robert Coli  wrote:

> On Thu, Mar 10, 2016 at 3:27 AM, Alain RODRIGUEZ 
> wrote:
>
>> So, like Jack, I globally really not recommend it unless you know what
>> you are doing and don't care about facing those issues.
>>
>
> Certainly a spectrum of views here, but everyone (including OP) seems to
> agree with the above. :D
>
> =Rob
>
>

Re: How can I make Cassandra stable in a 2GB RAM node environment ?

2016-03-10 Thread Robert Coli

On Thu, Mar 10, 2016 at 3:27 AM, Alain RODRIGUEZ  wrote:

> So, like Jack, I globally really not recommend it unless you know what you
> are doing and don't care about facing those issues.
>

Certainly a spectrum of views here, but everyone (including OP) seems to
agree with the above. :D

=Rob

Re: How can I make Cassandra stable in a 2GB RAM node environment ?

2016-03-10 Thread Alain RODRIGUEZ

+1 for Rob comment.

I would add that I have been learning a lot from running t1.micro (then
small, medium, Large, ..., i2.2XL) on AWS machines (800 MB RAM). I had to
tweak every single parameter in cassandra.yaml and cassandra-env.sh. So I
leaned a lot about internals, I had to! Even if I am glad I had this chance
to learn, I must say production wasn't that stable (latency was not
predictable, a compaction was a big event to handle...).

So, like Jack, I globally really not recommend it unless you know what you
are doing and don't care about facing those issues.

There are also people running Cassandra on Raspberry, so yes, it is doable
and it is really up to you =).

Good luck if you go this way.

C*heers
---
Alain Rodriguez - al...@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2016-03-09 23:31 GMT+01:00 Jack Krupansky :

> Thanks, Rob, but... I'll continue to do my best to strongly (vehemently,
> or is there an even stronger word for me to use?!) discourage use of
> Cassandra in under 4/8 GB of memory. Hey, I just want people to be happy,
> and trying to run Cassandra in under 8 GB (or 4 GB for dev) is just...
> asking for trouble, unhappiness, even despair. Hey, if somebody is smart
> enough to figure out how to do it on their own, then great, they are set
> and don't need our help, but personally I would declare it as out of
> bounds/off limits. But if anybody else here wants to support/encourage it,
> they are free to do so and I won't get in their way other than to state my
> own view.
>
> By "support", I primarily mean what the (open source) code does out of the
> box without superhuman effort (BTW, all of the guys at Open Source
> Connection ARE superhuman!!) as well as the support of memory of the
> community here on this list.
>
> Doc? If anybody thinks there is a better source of doc for open source
> Cassandra than the DataStax doc, please point me to it. Until then, I'll
> stick with the DataStax doc
>
> That said, it might be interesting to have a no-memory/low-memory mode for
> Cassandra which trades off performance for storage capacity. But... that
> would be an enhancement, not something that is "supported" out of the box
> today. What use cases would this satisfy? I mean, who is it that can get
> away with sacrificing performance these days?
>
> -- Jack Krupansky
>
> On Mon, Mar 7, 2016 at 3:29 PM, Ben Bromhead  wrote:
>
>> +1 for
>> http://opensourceconnections.com/blog/2013/08/31/building-
>> the-perfect-cassandra-test-environment/
>> 
>>
>>
>> We also run Cassandra on t2.mediums for our Developer clusters. You can
>> force Cassandra to do most "memory" things by hitting the disk instead (on
>> disk compaction passes, flush immediately to disk) and by throttling client
>> connections. In fact on the t2 series memory is not the biggest concern,
>> but rather the CPU credit issue.
>>
>> On Mon, 7 Mar 2016 at 11:53 Robert Coli  wrote:
>>
>>> On Fri, Mar 4, 2016 at 8:27 PM, Jack Krupansky >> > wrote:
>>>
 Please review the minimum hardware requirements as clearly documented:

 http://docs.datastax.com/en/cassandra/3.x/cassandra/planning/planPlanningHardware.html

>>>
>>> That is a document for Datastax Cassandra, not Apache Cassandra. It's
>>> wonderful that Datastax provides docs, but Datastax Cassandra is a superset
>>> of Apache Cassandra. Presuming that the requirements of one are exactly
>>> equivalent to the requirements of the other is not necessarily reasonable.
>>>
>>> Please adjust your hardware usage to at least meet the clearly
 documented minimum requirements. If you continue to encounter problems once
 you have corrected your configuration error, please resubmit the details
 with updated hardware configuration details.

>>>
>>> Disagree. OP specifically stated that they knew this was not a
>>> recommended practice. It does not seem unlikely that they are constrained
>>> to use this hardware for reasons outside of their control.
>>>
>>>
 Just to be clear, development on less than 4 GB is not supported and
 production on less than 8 GB is not supported. Those are not suggestions or
 guidelines or recommendations, they are absolute requirements.

>>>
>>> What does "supported" mean here? That Datastax will not provide support
>>> if you do not follow the above recommendations? Because it certainly is
>>> "supported" in the sense of "it can be made to work" ... ?
>>>
>>> The premise of a minimum RAM level seems meaningless without context.
>>> How much data are you serving from your 2GB RAM node? What is the rate of
>>> client requests?
>>>
>>> To be clear, I don't recommend trying to run production Cassandra with
>>> under 8GB of RAM on your node, but "absolute

Re: How can I make Cassandra stable in a 2GB RAM node environment ?

2016-03-09 Thread Jack Krupansky

Thanks, Rob, but... I'll continue to do my best to strongly (vehemently, or
is there an even stronger word for me to use?!) discourage use of Cassandra
in under 4/8 GB of memory. Hey, I just want people to be happy, and trying
to run Cassandra in under 8 GB (or 4 GB for dev) is just... asking for
trouble, unhappiness, even despair. Hey, if somebody is smart enough to
figure out how to do it on their own, then great, they are set and don't
need our help, but personally I would declare it as out of bounds/off
limits. But if anybody else here wants to support/encourage it, they are
free to do so and I won't get in their way other than to state my own view.

By "support", I primarily mean what the (open source) code does out of the
box without superhuman effort (BTW, all of the guys at Open Source
Connection ARE superhuman!!) as well as the support of memory of the
community here on this list.

Doc? If anybody thinks there is a better source of doc for open source
Cassandra than the DataStax doc, please point me to it. Until then, I'll
stick with the DataStax doc

That said, it might be interesting to have a no-memory/low-memory mode for
Cassandra which trades off performance for storage capacity. But... that
would be an enhancement, not something that is "supported" out of the box
today. What use cases would this satisfy? I mean, who is it that can get
away with sacrificing performance these days?

-- Jack Krupansky

On Mon, Mar 7, 2016 at 3:29 PM, Ben Bromhead  wrote:

> +1 for
> http://opensourceconnections.com/blog/2013/08/31/building-
> the-perfect-cassandra-test-environment/
> 
>
>
> We also run Cassandra on t2.mediums for our Developer clusters. You can
> force Cassandra to do most "memory" things by hitting the disk instead (on
> disk compaction passes, flush immediately to disk) and by throttling client
> connections. In fact on the t2 series memory is not the biggest concern,
> but rather the CPU credit issue.
>
> On Mon, 7 Mar 2016 at 11:53 Robert Coli  wrote:
>
>> On Fri, Mar 4, 2016 at 8:27 PM, Jack Krupansky 
>> wrote:
>>
>>> Please review the minimum hardware requirements as clearly documented:
>>>
>>> http://docs.datastax.com/en/cassandra/3.x/cassandra/planning/planPlanningHardware.html
>>>
>>
>> That is a document for Datastax Cassandra, not Apache Cassandra. It's
>> wonderful that Datastax provides docs, but Datastax Cassandra is a superset
>> of Apache Cassandra. Presuming that the requirements of one are exactly
>> equivalent to the requirements of the other is not necessarily reasonable.
>>
>> Please adjust your hardware usage to at least meet the clearly documented
>>> minimum requirements. If you continue to encounter problems once you have
>>> corrected your configuration error, please resubmit the details with
>>> updated hardware configuration details.
>>>
>>
>> Disagree. OP specifically stated that they knew this was not a
>> recommended practice. It does not seem unlikely that they are constrained
>> to use this hardware for reasons outside of their control.
>>
>>
>>> Just to be clear, development on less than 4 GB is not supported and
>>> production on less than 8 GB is not supported. Those are not suggestions or
>>> guidelines or recommendations, they are absolute requirements.
>>>
>>
>> What does "supported" mean here? That Datastax will not provide support
>> if you do not follow the above recommendations? Because it certainly is
>> "supported" in the sense of "it can be made to work" ... ?
>>
>> The premise of a minimum RAM level seems meaningless without context. How
>> much data are you serving from your 2GB RAM node? What is the rate of
>> client requests?
>>
>> To be clear, I don't recommend trying to run production Cassandra with
>> under 8GB of RAM on your node, but "absolute requirement" is a serious
>> overstatement.
>>
>>
>> http://opensourceconnections.com/blog/2013/08/31/building-the-perfect-cassandra-test-environment/
>>
>> Has some good discussion of how to run Cassandra in a low memory
>> environment. Maybe someone should tell John that his 64MB of JVM heap for a
>> test node is 62x too small to be "supported"? :D
>>
>> =Rob
>>
>> --
> Ben Bromhead
> CTO | Instaclustr 
> +1 650 284 9692
> Managed Cassandra / Spark on AWS, Azure and Softlayer
>

Re: How can I make Cassandra stable in a 2GB RAM node environment ?

2016-03-07 Thread Ben Bromhead

+1 for
http://opensourceconnections.com/blog/2013/08/31/building-
the-perfect-cassandra-test-environment/



We also run Cassandra on t2.mediums for our Developer clusters. You can
force Cassandra to do most "memory" things by hitting the disk instead (on
disk compaction passes, flush immediately to disk) and by throttling client
connections. In fact on the t2 series memory is not the biggest concern,
but rather the CPU credit issue.

On Mon, 7 Mar 2016 at 11:53 Robert Coli  wrote:

> On Fri, Mar 4, 2016 at 8:27 PM, Jack Krupansky 
> wrote:
>
>> Please review the minimum hardware requirements as clearly documented:
>>
>> http://docs.datastax.com/en/cassandra/3.x/cassandra/planning/planPlanningHardware.html
>>
>
> That is a document for Datastax Cassandra, not Apache Cassandra. It's
> wonderful that Datastax provides docs, but Datastax Cassandra is a superset
> of Apache Cassandra. Presuming that the requirements of one are exactly
> equivalent to the requirements of the other is not necessarily reasonable.
>
> Please adjust your hardware usage to at least meet the clearly documented
>> minimum requirements. If you continue to encounter problems once you have
>> corrected your configuration error, please resubmit the details with
>> updated hardware configuration details.
>>
>
> Disagree. OP specifically stated that they knew this was not a recommended
> practice. It does not seem unlikely that they are constrained to use this
> hardware for reasons outside of their control.
>
>
>> Just to be clear, development on less than 4 GB is not supported and
>> production on less than 8 GB is not supported. Those are not suggestions or
>> guidelines or recommendations, they are absolute requirements.
>>
>
> What does "supported" mean here? That Datastax will not provide support if
> you do not follow the above recommendations? Because it certainly is
> "supported" in the sense of "it can be made to work" ... ?
>
> The premise of a minimum RAM level seems meaningless without context. How
> much data are you serving from your 2GB RAM node? What is the rate of
> client requests?
>
> To be clear, I don't recommend trying to run production Cassandra with
> under 8GB of RAM on your node, but "absolute requirement" is a serious
> overstatement.
>
>
> http://opensourceconnections.com/blog/2013/08/31/building-the-perfect-cassandra-test-environment/
>
> Has some good discussion of how to run Cassandra in a low memory
> environment. Maybe someone should tell John that his 64MB of JVM heap for a
> test node is 62x too small to be "supported"? :D
>
> =Rob
>
> --
Ben Bromhead
CTO | Instaclustr 
+1 650 284 9692
Managed Cassandra / Spark on AWS, Azure and Softlayer

Re: How can I make Cassandra stable in a 2GB RAM node environment ?

2016-03-07 Thread Robert Coli

On Fri, Mar 4, 2016 at 8:27 PM, Jack Krupansky 
wrote:

> Please review the minimum hardware requirements as clearly documented:
>
> http://docs.datastax.com/en/cassandra/3.x/cassandra/planning/planPlanningHardware.html
>

That is a document for Datastax Cassandra, not Apache Cassandra. It's
wonderful that Datastax provides docs, but Datastax Cassandra is a superset
of Apache Cassandra. Presuming that the requirements of one are exactly
equivalent to the requirements of the other is not necessarily reasonable.

Please adjust your hardware usage to at least meet the clearly documented
> minimum requirements. If you continue to encounter problems once you have
> corrected your configuration error, please resubmit the details with
> updated hardware configuration details.
>

Disagree. OP specifically stated that they knew this was not a recommended
practice. It does not seem unlikely that they are constrained to use this
hardware for reasons outside of their control.

> Just to be clear, development on less than 4 GB is not supported and
> production on less than 8 GB is not supported. Those are not suggestions or
> guidelines or recommendations, they are absolute requirements.
>

What does "supported" mean here? That Datastax will not provide support if
you do not follow the above recommendations? Because it certainly is
"supported" in the sense of "it can be made to work" ... ?

The premise of a minimum RAM level seems meaningless without context. How
much data are you serving from your 2GB RAM node? What is the rate of
client requests?

To be clear, I don't recommend trying to run production Cassandra with
under 8GB of RAM on your node, but "absolute requirement" is a serious
overstatement.

http://opensourceconnections.com/blog/2013/08/31/building-the-perfect-cassandra-test-environment/

Has some good discussion of how to run Cassandra in a low memory
environment. Maybe someone should tell John that his 64MB of JVM heap for a
test node is 62x too small to be "supported"? :D

=Rob

Re: How can I make Cassandra stable in a 2GB RAM node environment ?

2016-03-04 Thread Jack Krupansky

Please review the minimum hardware requirements as clearly documented:
http://docs.datastax.com/en/cassandra/3.x/cassandra/planning/planPlanningHardware.html

Please adjust your hardware usage to at least meet the clearly documented
minimum requirements. If you continue to encounter problems once you have
corrected your configuration error, please resubmit the details with
updated hardware configuration details.

Just to be clear, development on less than 4 GB is not supported and
production on less than 8 GB is not supported. Those are not suggestions or
guidelines or recommendations, they are absolute requirements.

-- Jack Krupansky

On Fri, Mar 4, 2016 at 9:04 PM, Hiroyuki Yamada <mogwa...@gmail.com> wrote:

> Hi,
>
> I'm working on some POCs for Cassandra with single 2GB RAM node
> environment and
> some issues came up with me, so let me ask here.
>
> I have tried to insert about 200 million records (about 11GB in size) to
> the node,
> and the insertion from an application program seems completed,
> but something (probably compaction?) was happening after the insertion and
> later Cassandra itself was killed by OOM killer.
>
> I've tried to tune the configurations including heap size, compaction
> memory setting and bloom filter setting
> to make C* work nicely in the low memory environment,
> but in any cases, it doesn't work so far. (which means I still get OOM
> eventually)
>
> I know it is not very recommended to run C* in such low memory environment,
> but I am wondering what can I do (what configurations to change) to make
> it a little more stable in such environment.
> (I understand the following configuration is very tight and not very
> recommended but I just want to make it work now)
>
> Could anyone give me a help ?
>
>
> Hardware and software :
> - EC2 instance (t2.small: 1vCPU, 2GB RAM)
> - Cassandra 2.2.5
> - JDK 8 (8u73)
>
> Cassandara configuraions (what I changed from the default) :
> - leveledCompactionStrategy
> - custom configuration settings of cassandra-env.sh
> - MAX_HEAP_SIZE: 640MB
> - HEAP_NEWSIZE: 128MB
> - custom configuration settings of cassandra.yaml
> - commitlog_segment_size_in_mb: 4
> - commitlog_total_space_in_mb: 512
> - sstable_preemptive_open_interval_in_mb: 16
> - file_cache_size_in_mb: 40
> - memtable_heap_space_in_mb: 40
> - key_cache_size_in_mb: 0
> - bloom filter is disabled
>
>
> === debug.log around when Cassandra was killed by OOM killer ===
> DEBUG [NonPeriodicTasks:1] 2016-03-04 00:36:02,378
> FileCacheService.java:177 - Invalidating cache for
> /var/lib/cassandra/data/test/user-adc91d20e15011e586c53fd5b957bea8/tmplink-la-15626-big-Data.db
> DEBUG [NonPeriodicTasks:1] 2016-03-04 00:36:09,903
> FileCacheService.java:177 - Invalidating cache for
> /var/lib/cassandra/data/test/user-adc91d20e15011e586c53fd5b957bea8/tmplink-la-15622-big-Data.db
> DEBUG [NonPeriodicTasks:1] 2016-03-04 00:36:14,360
> FileCacheService.java:177 - Invalidating cache for
> /var/lib/cassandra/data/test/user-adc91d20e15011e586c53fd5b957bea8/tmplink-la-15626-big-Data.db
> DEBUG [NonPeriodicTasks:1] 2016-03-04 00:36:20,004
> FileCacheService.java:177 - Invalidating cache for
> /var/lib/cassandra/data/test/user-adc91d20e15011e586c53fd5b957bea8/tmplink-la-15622-big-Data.db
> ==
>
> === /var/log/message ===
> Mar  4 00:36:22 ip-10-0-0-11 kernel: Killed process 8919 (java)
> total-vm:32407840kB, anon-rss:1535020kB, file-rss:123096kB
> ==
>
>
> Best regards,
> Hiro
>
>

How can I make Cassandra stable in a 2GB RAM node environment ?

2016-03-04 Thread Hiroyuki Yamada

Hi,

I'm working on some POCs for Cassandra with single 2GB RAM node environment
and
some issues came up with me, so let me ask here.

I have tried to insert about 200 million records (about 11GB in size) to
the node,
and the insertion from an application program seems completed,
but something (probably compaction?) was happening after the insertion and
later Cassandra itself was killed by OOM killer.

I've tried to tune the configurations including heap size, compaction
memory setting and bloom filter setting
to make C* work nicely in the low memory environment,
but in any cases, it doesn't work so far. (which means I still get OOM
eventually)

I know it is not very recommended to run C* in such low memory environment,
but I am wondering what can I do (what configurations to change) to make it
a little more stable in such environment.
(I understand the following configuration is very tight and not very
recommended but I just want to make it work now)

Could anyone give me a help ?


Hardware and software :
- EC2 instance (t2.small: 1vCPU, 2GB RAM)
- Cassandra 2.2.5
- JDK 8 (8u73)

Cassandara configuraions (what I changed from the default) :
- leveledCompactionStrategy
- custom configuration settings of cassandra-env.sh
- MAX_HEAP_SIZE: 640MB
- HEAP_NEWSIZE: 128MB
- custom configuration settings of cassandra.yaml
- commitlog_segment_size_in_mb: 4
- commitlog_total_space_in_mb: 512
- sstable_preemptive_open_interval_in_mb: 16
- file_cache_size_in_mb: 40
- memtable_heap_space_in_mb: 40
- key_cache_size_in_mb: 0
- bloom filter is disabled


=== debug.log around when Cassandra was killed by OOM killer ===
DEBUG [NonPeriodicTasks:1] 2016-03-04 00:36:02,378
FileCacheService.java:177 - Invalidating cache for
/var/lib/cassandra/data/test/user-adc91d20e15011e586c53fd5b957bea8/tmplink-la-15626-big-Data.db
DEBUG [NonPeriodicTasks:1] 2016-03-04 00:36:09,903
FileCacheService.java:177 - Invalidating cache for
/var/lib/cassandra/data/test/user-adc91d20e15011e586c53fd5b957bea8/tmplink-la-15622-big-Data.db
DEBUG [NonPeriodicTasks:1] 2016-03-04 00:36:14,360
FileCacheService.java:177 - Invalidating cache for
/var/lib/cassandra/data/test/user-adc91d20e15011e586c53fd5b957bea8/tmplink-la-15626-big-Data.db
DEBUG [NonPeriodicTasks:1] 2016-03-04 00:36:20,004
FileCacheService.java:177 - Invalidating cache for
/var/lib/cassandra/data/test/user-adc91d20e15011e586c53fd5b957bea8/tmplink-la-15622-big-Data.db
==

=== /var/log/message ===
Mar  4 00:36:22 ip-10-0-0-11 kernel: Killed process 8919 (java)
total-vm:32407840kB, anon-rss:1535020kB, file-rss:123096kB
==


Best regards,
Hiro

Re: How can I specify the file_data_directories for a keyspace

2015-08-25 Thread Jeff Jirsa

At this point, it is only/automatically managed by cassandra, but if you’re 
clever with mount points you can probably work around the limitation.

From:  Ahmed Eljami
Reply-To:  user@cassandra.apache.org
Date:  Tuesday, August 25, 2015 at 2:09 AM
To:  user@cassandra.apache.org
Subject:  How can I specify the file_data_directories for a keyspace

When I defines several file_data_directories in cassandra.yaml, would it be 
possible to specify the location keyspace and tables ? or it is only and 
automatically managed by Cassandra.

Thx.

-- 
Ahmed ELJAMI

smime.p7s
Description: S/MIME cryptographic signature

How can I specify the file_data_directories for a keyspace

2015-08-25 Thread Ahmed Eljami

When I defines several file_data_directories in cassandra.yaml, would it be
possible to specify the location keyspace and tables ? or it is * only* and
*automatically* managed by Cassandra.

Thx.

-- 
Ahmed ELJAMI

Can I run upgrade sstables on many nodes on one time

2015-08-13 Thread Ola Nowak

Hi all,
I'm trying to update my 6 node cluster from 2.0.11 to 2.1.8.
I'm following this update procedure:
http://docs.datastax.com/en/upgrade/doc/upgrade/cassandra/upgradeCassandraDetails.html
and the point 8 says: If you are upgrading from a major version (for
example, from Cassandra 1.2 to 2.0) or a major point release (for example,
from Cassandra 2.0 to 2.1), upgrade the SSTables on each node.
$ nodetool upgradesstables
As far as I understood it correctly I should run nodetool upgradesstables
on every node after upgrading the version on each node. Is that right?
As it is a really time consuming operation I wonder if I could run
upgradesstables on multiple nodes at one time ( parallelly)?
Regards,
Ola

RE: Can I run upgrade sstables on many nodes on one time

2015-08-13 Thread SEAN_R_DURITY

Yes, you should run upgradesstables on each node. If the sstable structure has 
changed, you will need this completed before you can do streaming operations 
like repairs or adding nodes.

As for running in parallel, that should be fine. It is a “within the node” 
operation that pounds I/O (but is capped by compaction threshold). You need to 
look at the level of activity from normal operations, though. If Cassandra is 
running without much stress/sweat, go ahead and run 2 at once. (Conservatively, 
that’s all I would do on 6 nodes.) If the cluster is inactive, let it fly on 
all nodes.


Sean Durity
Lead Cassandra Admin, Big Data Team

From: Ola Nowak [mailto:ola.nowa...@gmail.com]
Sent: Thursday, August 13, 2015 5:30 AM
To: user@cassandra.apache.org
Subject: Can I run upgrade sstables on many nodes on one time

Hi all,
I'm trying to update my 6 node cluster from 2.0.11 to 2.1.8.
I'm following this update procedure: 
http://docs.datastax.com/en/upgrade/doc/upgrade/cassandra/upgradeCassandraDetails.html
 and the point 8 says: If you are upgrading from a major version (for example, 
from Cassandra 1.2 to 2.0) or a major point release (for example, from 
Cassandra 2.0 to 2.1), upgrade the SSTables on each node.
$ nodetool upgradesstables
As far as I understood it correctly I should run nodetool upgradesstables on 
every node after upgrading the version on each node. Is that right?
As it is a really time consuming operation I wonder if I could run 
upgradesstables on multiple nodes at one time ( parallelly)?
Regards,
Ola




The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.

Which JMX item can I use to see total cluster (or data center) Read and Write volumes?

2014-11-14 Thread Bob Nilsen

Hi all,

Within DataStax OpsCenter I can see metrics that show total traffic volume
for a cluster and each data center.

How can I find these same numbers amongst all the JMX items?

Thanks,

-- 
Bob Nilsen
rwnils...@gmail.com

Re: Which JMX item can I use to see total cluster (or data center) Read and Write volumes?

2014-11-14 Thread Tyler Hobbs

OpsCenter is aggregating individual metrics across the whole datacenter (or
cluster).  The individual metrics are in
org.apache.cassandra.metrics.ClientRequest.Read.Latency.count and
Write.Latency.count.

On Fri, Nov 14, 2014 at 10:04 AM, Bob Nilsen rwnils...@gmail.com wrote:

 Hi all,

 Within DataStax OpsCenter I can see metrics that show total traffic volume
 for a cluster and each data center.

 How can I find these same numbers amongst all the JMX items?

 Thanks,

 --
 Bob Nilsen
 rwnils...@gmail.com




-- 
Tyler Hobbs
DataStax http://datastax.com/

Re: Can I call getBytes on a text column to get the raw (already encoded UTF8)

2014-06-24 Thread Olivier Michallat

Assuming we're talking about the DataStax Java driver:

getBytes will throw an exception, because it validates that the column is
of type BLOB. But you can use getBytesUnsafe:

ByteBuffer b = row.getBytesUnsafe(aTextColumn);
// if you want to check it:
Charset.forName(UTF-8).decode(b);

Regarding whether this will continue working in the future: from the
driver's perspective, the fact that the native protocol uses UTF-8 is an
implementation detail, but I doubt this will change any time soon.




On Tue, Jun 24, 2014 at 7:23 AM, DuyHai Doan doanduy...@gmail.com wrote:

 Good idea, bytes are merely processed by the server so you're saving a lot
 of Cpu. AFAIK getBytes should work fine.
 Le 24 juin 2014 05:50, Kevin Burton bur...@spinn3r.com a écrit :

 I'm building a webservice whereby I read the data from cassandra, then
 write it over the wire.

 It's going to push LOTS of content, and encoding/decoding performance has
 really bitten us in the future.  So I try to avoid transparent
 encoding/decoding if I can avoid it.

 So right now, I have a huge blob of text that's a 'text' column.

 Logically it *should* be text, because that's what it is...

 Can I just keep it as text so our normal tools work on it, but get it as
 raw UTF8 if I call getBytes?

 This way I can call getBytes and then send it right over the wire as
 pre-encoded UTF8 data.

 ... and of course the question is whether it will continue working in the
 future :-P

 I'll write a test of it of course but I wanted to see what you guys
 thought of this idea.

 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 Skype: *burtonator*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
 http://spinn3r.com
 War is peace. Freedom is slavery. Ignorance is strength. Corporations are
 people.

Re: Can I call getBytes on a text column to get the raw (already encoded UTF8)

2014-06-24 Thread Robert Stupp

You can use getBytesUnsafe on the UTF8 column

--
Sent from my iPhone 

 Am 24.06.2014 um 09:13 schrieb Olivier Michallat 
 olivier.michal...@datastax.com:
 
 Assuming we're talking about the DataStax Java driver:
 
 getBytes will throw an exception, because it validates that the column is of 
 type BLOB. But you can use getBytesUnsafe:
 
 ByteBuffer b = row.getBytesUnsafe(aTextColumn);
 // if you want to check it:
 Charset.forName(UTF-8).decode(b);
 
 Regarding whether this will continue working in the future: from the driver's 
 perspective, the fact that the native protocol uses UTF-8 is an 
 implementation detail, but I doubt this will change any time soon.
 
 
 
 
 On Tue, Jun 24, 2014 at 7:23 AM, DuyHai Doan doanduy...@gmail.com wrote:
 Good idea, bytes are merely processed by the server so you're saving a lot 
 of Cpu. AFAIK getBytes should work fine.
 
 Le 24 juin 2014 05:50, Kevin Burton bur...@spinn3r.com a écrit :
 
 I'm building a webservice whereby I read the data from cassandra, then 
 write it over the wire.
 
 It's going to push LOTS of content, and encoding/decoding performance has 
 really bitten us in the future.  So I try to avoid transparent 
 encoding/decoding if I can avoid it.
 
 So right now, I have a huge blob of text that's a 'text' column.
 
 Logically it *should* be text, because that's what it is...
 
 Can I just keep it as text so our normal tools work on it, but get it as 
 raw UTF8 if I call getBytes?
 
 This way I can call getBytes and then send it right over the wire as 
 pre-encoded UTF8 data.
 
 ... and of course the question is whether it will continue working in the 
 future :-P
 
 I'll write a test of it of course but I wanted to see what you guys thought 
 of this idea.
 
 -- 
 Founder/CEO Spinn3r.com
 Location: San Francisco, CA
 Skype: burtonator
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 
 War is peace. Freedom is slavery. Ignorance is strength. Corporations are 
 people.

Re: Can I call getBytes on a text column to get the raw (already encoded UTF8)

2014-06-24 Thread Kevin Burton

Yes… I confirmed that getBytesUnsafe works…

I also have a unit test for it so if cassandra ever changes anything we'll
pick it up.

One point in your above code.  I still think charsets are behind a
synchronized code block.

So your above code wouldn't be super fast on multi-core machines.  I
usually use guava's Charsets class since they have static references to all
of them.

… just wanted to point that out since it could bite someone :-P …




On Tue, Jun 24, 2014 at 12:13 AM, Olivier Michallat 
olivier.michal...@datastax.com wrote:

 Assuming we're talking about the DataStax Java driver:

 getBytes will throw an exception, because it validates that the column is
 of type BLOB. But you can use getBytesUnsafe:

 ByteBuffer b = row.getBytesUnsafe(aTextColumn);
 // if you want to check it:
 Charset.forName(UTF-8).decode(b);

 Regarding whether this will continue working in the future: from the
 driver's perspective, the fact that the native protocol uses UTF-8 is an
 implementation detail, but I doubt this will change any time soon.




 On Tue, Jun 24, 2014 at 7:23 AM, DuyHai Doan doanduy...@gmail.com wrote:

 Good idea, bytes are merely processed by the server so you're saving a
 lot of Cpu. AFAIK getBytes should work fine.
 Le 24 juin 2014 05:50, Kevin Burton bur...@spinn3r.com a écrit :

 I'm building a webservice whereby I read the data from cassandra, then
 write it over the wire.

 It's going to push LOTS of content, and encoding/decoding performance
 has really bitten us in the future.  So I try to avoid transparent
 encoding/decoding if I can avoid it.

 So right now, I have a huge blob of text that's a 'text' column.

 Logically it *should* be text, because that's what it is...

 Can I just keep it as text so our normal tools work on it, but get it as
 raw UTF8 if I call getBytes?

 This way I can call getBytes and then send it right over the wire as
 pre-encoded UTF8 data.

 ... and of course the question is whether it will continue working in
 the future :-P

 I'll write a test of it of course but I wanted to see what you guys
 thought of this idea.

 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 Skype: *burtonator*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
 http://spinn3r.com
 War is peace. Freedom is slavery. Ignorance is strength. Corporations
 are people.





-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.

Can I call getBytes on a text column to get the raw (already encoded UTF8)

2014-06-23 Thread Kevin Burton

I'm building a webservice whereby I read the data from cassandra, then
write it over the wire.

It's going to push LOTS of content, and encoding/decoding performance has
really bitten us in the future.  So I try to avoid transparent
encoding/decoding if I can avoid it.

So right now, I have a huge blob of text that's a 'text' column.

Logically it *should* be text, because that's what it is...

Can I just keep it as text so our normal tools work on it, but get it as
raw UTF8 if I call getBytes?

This way I can call getBytes and then send it right over the wire as
pre-encoded UTF8 data.

... and of course the question is whether it will continue working in the
future :-P

I'll write a test of it of course but I wanted to see what you guys thought
of this idea.

-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.

Re: Can I call getBytes on a text column to get the raw (already encoded UTF8)

2014-06-23 Thread DuyHai Doan

Good idea, bytes are merely processed by the server so you're saving a lot
of Cpu. AFAIK getBytes should work fine.
Le 24 juin 2014 05:50, Kevin Burton bur...@spinn3r.com a écrit :

 I'm building a webservice whereby I read the data from cassandra, then
 write it over the wire.

 It's going to push LOTS of content, and encoding/decoding performance has
 really bitten us in the future.  So I try to avoid transparent
 encoding/decoding if I can avoid it.

 So right now, I have a huge blob of text that's a 'text' column.

 Logically it *should* be text, because that's what it is...

 Can I just keep it as text so our normal tools work on it, but get it as
 raw UTF8 if I call getBytes?

 This way I can call getBytes and then send it right over the wire as
 pre-encoded UTF8 data.

 ... and of course the question is whether it will continue working in the
 future :-P

 I'll write a test of it of course but I wanted to see what you guys
 thought of this idea.

 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 Skype: *burtonator*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
 http://spinn3r.com
 War is peace. Freedom is slavery. Ignorance is strength. Corporations are
 people.

Re: can I kill very old data files in my data folder (I know that sounds crazy but....)

2014-06-19 Thread Jens Rantil

...and temporarily adding more nodes and rebalancing is not an option?—
Sent from Mailbox

On Wed, Jun 18, 2014 at 9:39 PM, Brian Tarbox tar...@cabotresearch.com
wrote:

 I don't think I have the space to run a major compaction right now (I'm
 above 50% disk space used already) and compaction can take extra space I
 think?
 On Wed, Jun 18, 2014 at 3:24 PM, Robert Coli rc...@eventbrite.com wrote:
 On Wed, Jun 18, 2014 at 12:05 PM, Brian Tarbox tar...@cabotresearch.com
 wrote:

 Thank you!   We are not using TTL, we're manually deleting data more than
 5 days old for this CF.  We're running 1.2.13 and are using size tiered
 compaction (this cf is append-only i.e.zero updates).

 Sounds like we can get away with doing a (stop, delete old-data-file,
 restart) process on a rolling basis if I understand you.


 Sure, though in your case (because you're using STS and can) I'd probably
 just run a major compaction.

 =Rob

can I kill very old data files in my data folder (I know that sounds crazy but....)

2014-06-18 Thread Brian Tarbox

I have a column family that only stores the last 5 days worth of some
data...and yet I have files in the data directory for this CF that are 3
weeks old.  They take the form:

keyspace-CFName-ic--Filter.db
keyspace-CFName-ic--Index.db
keyspace-CFName-ic--Data.db
keyspace-CFName-ic--Statistics.db
keyspace-CFName-ic--TOC.txt
keyspace-CFName-ic--Summary.db

I have six bunches of these file groups, each with a different 
value...and with timestamps of each of the last five days...plus one group
from 3 weeks ago...which makes me wonder if that group  somehow should have
been deleted but were not.

The files are tens or hundreds of gigs so deleting would be good, unless
its really bad!

Thanks,

Brian Tarbox

Re: can I kill very old data files in my data folder (I know that sounds crazy but....)

2014-06-18 Thread Robert Coli

On Wed, Jun 18, 2014 at 10:56 AM, Brian Tarbox tar...@cabotresearch.com
wrote:

 I have a column family that only stores the last 5 days worth of some
 data...and yet I have files in the data directory for this CF that are 3
 weeks old.


Are you using TTL? If so :

https://issues.apache.org/jira/browse/CASSANDRA-6654

Are you using size tiered or level compaction?

I have six bunches of these file groups, each with a different 
 value...and with timestamps of each of the last five days...plus one group
 from 3 weeks ago...which makes me wonder if that group  somehow should have
 been deleted but were not.

 The files are tens or hundreds of gigs so deleting would be good, unless
 its really bad!


Data files can't be deleted from the data dir with Cassandra running, but
it should be fine (if probably technically unsupported) to delete them with
Cassandra stopped. In most cases you don't want to do so, because you might
un-mask deleted rows or cause unexpected consistency characteristics.

In your case, you know that no data in files created 3 weeks old can
possibly have any value, so it is safe to delete them.

=Rob

Re: can I kill very old data files in my data folder (I know that sounds crazy but....)

2014-06-18 Thread Brian Tarbox

Rob,
Thank you!   We are not using TTL, we're manually deleting data more than 5
days old for this CF.  We're running 1.2.13 and are using size tiered
compaction (this cf is append-only i.e.zero updates).

Sounds like we can get away with doing a (stop, delete old-data-file,
restart) process on a rolling basis if I understand you.

Thanks,

Brian


On Wed, Jun 18, 2014 at 2:37 PM, Robert Coli rc...@eventbrite.com wrote:

 On Wed, Jun 18, 2014 at 10:56 AM, Brian Tarbox tar...@cabotresearch.com
 wrote:

 I have a column family that only stores the last 5 days worth of some
 data...and yet I have files in the data directory for this CF that are 3
 weeks old.


 Are you using TTL? If so :

 https://issues.apache.org/jira/browse/CASSANDRA-6654

 Are you using size tiered or level compaction?

 I have six bunches of these file groups, each with a different 
 value...and with timestamps of each of the last five days...plus one group
 from 3 weeks ago...which makes me wonder if that group  somehow should have
 been deleted but were not.

 The files are tens or hundreds of gigs so deleting would be good, unless
 its really bad!


 Data files can't be deleted from the data dir with Cassandra running, but
 it should be fine (if probably technically unsupported) to delete them with
 Cassandra stopped. In most cases you don't want to do so, because you might
 un-mask deleted rows or cause unexpected consistency characteristics.

 In your case, you know that no data in files created 3 weeks old can
 possibly have any value, so it is safe to delete them.

 =Rob

Re: can I kill very old data files in my data folder (I know that sounds crazy but....)

2014-06-18 Thread Robert Coli

On Wed, Jun 18, 2014 at 12:05 PM, Brian Tarbox tar...@cabotresearch.com
wrote:

 Thank you!   We are not using TTL, we're manually deleting data more than
 5 days old for this CF.  We're running 1.2.13 and are using size tiered
 compaction (this cf is append-only i.e.zero updates).

 Sounds like we can get away with doing a (stop, delete old-data-file,
 restart) process on a rolling basis if I understand you.


Sure, though in your case (because you're using STS and can) I'd probably
just run a major compaction.

=Rob

Re: can I kill very old data files in my data folder (I know that sounds crazy but....)

2014-06-18 Thread Brian Tarbox

I don't think I have the space to run a major compaction right now (I'm
above 50% disk space used already) and compaction can take extra space I
think?


On Wed, Jun 18, 2014 at 3:24 PM, Robert Coli rc...@eventbrite.com wrote:

 On Wed, Jun 18, 2014 at 12:05 PM, Brian Tarbox tar...@cabotresearch.com
 wrote:

 Thank you!   We are not using TTL, we're manually deleting data more than
 5 days old for this CF.  We're running 1.2.13 and are using size tiered
 compaction (this cf is append-only i.e.zero updates).

 Sounds like we can get away with doing a (stop, delete old-data-file,
 restart) process on a rolling basis if I understand you.


 Sure, though in your case (because you're using STS and can) I'd probably
 just run a major compaction.

 =Rob

Re: [OT]: Can I have a non-delivering subscription?

2014-02-24 Thread Edward Capriolo

You can setup the mail to deliver one per day as well.

On Saturday, February 22, 2014, Robert Wille rwi...@fold3.com wrote:
 Yeah, it¹s called a rule. Set one up to delete everything from
 user@cassandra.apache.org.

 On 2/22/14, 10:32 AM, Paul LeoNerd Evans leon...@leonerd.org.uk
 wrote:

A question about the mailing list itself, rather than Cassandra.

I've re-subscribed simply because I have to be subscribed in order to
send to the list, as I sometimes try to when people Cc questions about
my Net::Async::CassandraCQL perl module to me. However, if I want to
read the list, I usually do so on the online archives and not by mail.

Is it possible to have a non-delivering subscription, which would let
me send messages, but doesn't deliver anything back to me?

--
Paul LeoNerd Evans

leon...@leonerd.org.uk
ICQ# 4135350   |  Registered Linux# 179460
http://www.leonerd.org.uk/




-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.

[OT]: Can I have a non-delivering subscription?

2014-02-22 Thread Paul LeoNerd Evans

A question about the mailing list itself, rather than Cassandra.

I've re-subscribed simply because I have to be subscribed in order to
send to the list, as I sometimes try to when people Cc questions about
my Net::Async::CassandraCQL perl module to me. However, if I want to
read the list, I usually do so on the online archives and not by mail.

Is it possible to have a non-delivering subscription, which would let
me send messages, but doesn't deliver anything back to me?

-- 
Paul LeoNerd Evans

leon...@leonerd.org.uk
ICQ# 4135350   |  Registered Linux# 179460
http://www.leonerd.org.uk/


signature.asc
Description: PGP signature

Re: [OT]: Can I have a non-delivering subscription?

2014-02-22 Thread Robert Wille

Yeah, it¹s called a rule. Set one up to delete everything from
user@cassandra.apache.org.

On 2/22/14, 10:32 AM, Paul LeoNerd Evans leon...@leonerd.org.uk
wrote:

A question about the mailing list itself, rather than Cassandra.

I've re-subscribed simply because I have to be subscribed in order to
send to the list, as I sometimes try to when people Cc questions about
my Net::Async::CassandraCQL perl module to me. However, if I want to
read the list, I usually do so on the online archives and not by mail.

Is it possible to have a non-delivering subscription, which would let
me send messages, but doesn't deliver anything back to me?

-- 
Paul LeoNerd Evans

leon...@leonerd.org.uk
ICQ# 4135350   |  Registered Linux# 179460
http://www.leonerd.org.uk/

Re: how can i get the column value? Need help!.. cassandra 1.28 and pig 0.11.1

2013-09-23 Thread Cyril Scetbon

I tried with 1.2.10 and don't meet the issue anymore.

Regards
-- 
Cyril SCETBON

On Sep 19, 2013, at 10:28 PM, Cyril Scetbon cyril.scet...@free.fr wrote:

 Hi,
 
 Did you try to build 1.2.10 and to use it for your tests ? I've got the same 
 issue and will give it a try as soon as it's released (expected at the end of 
 the week).
 
 Regards
 -- 
 Cyril SCETBON
 
 On Sep 2, 2013, at 3:09 PM, Miguel Angel Martin junquera 
 mianmarjun.mailingl...@gmail.com wrote:
 
 hi all:
 
 More info :
 
 https://issues.apache.org/jira/browse/CASSANDRA-5941
 
 
 
 I tried this (and gen. cassandra 1.2.9)  but do not work for me, 
 
 git clone http://git-wip-us.apache.org/repos/asf/cassandra.git
 cd cassandra
 git checkout cassandra-1.2
 patch -p1  5867-bug-fix-filter-push-down-1.2-branch.txt
 ant
 
 
 
 Miguel Angel Martín Junquera
 Analyst Engineer.
 miguelangel.mar...@brainsins.com
 
 
 
 2013/9/2 Miguel Angel Martin junquera mianmarjun.mailingl...@gmail.com
 hi:
 
 I test this in cassandra 1.2.9 new  version and the issue still persists .
 
 :-(
 
 
 
 
 
 
 Miguel Angel Martín Junquera
 Analyst Engineer.
 miguelangel.mar...@brainsins.com
 
 
 
 2013/8/30 Miguel Angel Martin junquera mianmarjun.mailingl...@gmail.com
 I try this:
 
 rows = LOAD 
 'cql://keyspace1/test?page_size=1split_size=4where_clause=age%3D30' USING 
 CqlStorage();
 dump rows;
 ILLUSTRATE rows;
 describe rows;
 
 values2= FOREACH rows GENERATE  TOTUPLE (id) as (mycolumn:tuple(name,value));
 dump values2;
 describe values2;
 
 But I get this results:
 
 
 
 -
 | rows | id:chararray   | age:int   | title:chararray   | 
 -
 |  | (id, 6)| (age, 30) | (title, QA)   | 
 -
 
 rows: {id: chararray,age: int,title: chararray}
 2013-08-30 09:54:37,831 [main] ERROR org.apache.pig.tools.grunt.Grunt - 
 ERROR 1031: Incompatable field schema: left is 
 tuple_0:tuple(mycolumn:tuple(name:bytearray,value:bytearray)), right is 
 org.apache.pig.builtin.totuple_id_1:tuple(id:chararray)
 
 
 
 
 
 or 
 
 
 
 
 
 values2= FOREACH rows GENERATE  TOTUPLE (id) ;
 dump values2;
 describe values2;
 
 
 
 and  the results are:
 
 
 ...
 (((id,6)))
 (((id,5)))
 values2: {org.apache.pig.builtin.totuple_id_8: (id: chararray)}
 
 
 
 Aggg!
 
 
 
 
 
 
 
 
 
 Miguel Angel Martín Junquera
 Analyst Engineer.
 miguelangel.mar...@brainsins.com
 
 
 
 2013/8/28 Miguel Angel Martin junquera mianmarjun.mailingl...@gmail.com
 hi:
 
 I can not understand why the schema is  define like 
 id:chararray,age:int,title:chararray  and it does not define like tuples 
 or bag tuples,  if we have pair key-values  columns
 
 
 I try other time to change schema  but it does not work.
 
 any ideas ...
 
 perhaps, is the issue in the definition cql3 tables ?
 
 regards
 
 
 2013/8/28 Miguel Angel Martin junquera mianmarjun.mailingl...@gmail.com
 hi all:
 
 
 Regards
 
 Still i can resolve this issue. .
 
 does anybody have this issue or try to test this simple example?
 
 
 i am stumped I can not find a solution working. 
 
 I appreciate any comment or help
 
 
 2013/8/22 Miguel Angel Martin junquera mianmarjun.mailingl...@gmail.com
 hi all:
 
 
 
 
 I,m testing the new CqlStorage() with cassandra 1.28 and pig 0.11.1 
 
 
 I am using this sample data test:
 
  
 http://frommyworkshop.blogspot.com.es/2013/07/hadoop-map-reduce-with-cassandra.html
 
 And I load and dump data Righ with this script:
 
 rows = LOAD 
 'cql://keyspace1/test?page_size=1split_size=4where_clause=age%3D30' USING 
 CqlStorage();
 
 dump rows;
 describe rows;
 
 resutls:
 
 ((id,6),(age,30),(title,QA))
 ((id,5),(age,30),(title,QA))
 rows: {id: chararray,age: int,title: chararray}
 
 
 But i can not  get  the column values 
 
 I try to define   another schemas in Load like I used with cassandraStorage()
 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-and-Pig-how-to-get-column-values-td5641158.html
 
 
 example:
 
 rows = LOAD 
 'cql://keyspace1/test?page_size=1split_size=4where_clause=age%3D30' USING 
 CqlStorage() AS (columns: bag {T: tuple(name, value)});
 
 
 and I get this error:
 
 2013-08-22 12:24:45,426 [main] ERROR org.apache.pig.tools.grunt.Grunt - 
 ERROR 1031: Incompatable schema: left is 
 columns:bag{T:tuple(name:bytearray,value:bytearray)}, right is 
 id:chararray,age:int,title:chararray
 
 
 
 I try to use, FLATTEN, SUBSTRING, SPLIT UDF`s but i have not get good result:
 
 Example:
 
 when I flatten , I get a set of tuples like
 (title,QA)
 (title,QA)
 2013-08-22 12:42:20,673 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input 
 paths to process : 1
 A: {title: chararray}
 
 
 but i can get value QA 
 
 Sustring only works with title
 
 
 
 example:
 
 B = FOREACH A GENERATE SUBSTRING(title,2,5);
 
 dump B;
 describe B;
 
 
 results:
 
 (tle)

Re: how can i get the column value? Need help!.. cassandra 1.28 and pig 0.11.1

2013-09-19 Thread Cyril Scetbon

Hi,

Did you try to build 1.2.10 and to use it for your tests ? I've got the same 
issue and will give it a try as soon as it's released (expected at the end of 
the week).

Regards
-- 
Cyril SCETBON

On Sep 2, 2013, at 3:09 PM, Miguel Angel Martin junquera 
mianmarjun.mailingl...@gmail.com wrote:

 hi all:
 
 More info :
 
 https://issues.apache.org/jira/browse/CASSANDRA-5941
 
 
 
 I tried this (and gen. cassandra 1.2.9)  but do not work for me, 
 
 git clone http://git-wip-us.apache.org/repos/asf/cassandra.git
 cd cassandra
 git checkout cassandra-1.2
 patch -p1  5867-bug-fix-filter-push-down-1.2-branch.txt
 ant
 
 
 
 Miguel Angel Martín Junquera
 Analyst Engineer.
 miguelangel.mar...@brainsins.com
 
 
 
 2013/9/2 Miguel Angel Martin junquera mianmarjun.mailingl...@gmail.com
 hi:
 
 I test this in cassandra 1.2.9 new  version and the issue still persists .
 
 :-(
 
 
 
 
 
 
 Miguel Angel Martín Junquera
 Analyst Engineer.
 miguelangel.mar...@brainsins.com
 
 
 
 2013/8/30 Miguel Angel Martin junquera mianmarjun.mailingl...@gmail.com
 I try this:
 
 rows = LOAD 
 'cql://keyspace1/test?page_size=1split_size=4where_clause=age%3D30' USING 
 CqlStorage();
 dump rows;
 ILLUSTRATE rows;
 describe rows;
 
 values2= FOREACH rows GENERATE  TOTUPLE (id) as (mycolumn:tuple(name,value));
 dump values2;
 describe values2;
 
 But I get this results:
 
 
 
 -
 | rows | id:chararray   | age:int   | title:chararray   | 
 -
 |  | (id, 6)| (age, 30) | (title, QA)   | 
 -
 
 rows: {id: chararray,age: int,title: chararray}
 2013-08-30 09:54:37,831 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1031: Incompatable field schema: left is 
 tuple_0:tuple(mycolumn:tuple(name:bytearray,value:bytearray)), right is 
 org.apache.pig.builtin.totuple_id_1:tuple(id:chararray)
 
 
 
 
 
 or 
 
 
 
 
 
 values2= FOREACH rows GENERATE  TOTUPLE (id) ;
 dump values2;
 describe values2;
 
 
 
 and  the results are:
 
 
 ...
 (((id,6)))
 (((id,5)))
 values2: {org.apache.pig.builtin.totuple_id_8: (id: chararray)}
 
 
 
 Aggg!
 
 
 
 
 
 
 
 
 
 Miguel Angel Martín Junquera
 Analyst Engineer.
 miguelangel.mar...@brainsins.com
 
 
 
 2013/8/28 Miguel Angel Martin junquera mianmarjun.mailingl...@gmail.com
 hi:
 
 I can not understand why the schema is  define like 
 id:chararray,age:int,title:chararray  and it does not define like tuples or 
 bag tuples,  if we have pair key-values  columns
 
 
 I try other time to change schema  but it does not work.
 
 any ideas ...
 
 perhaps, is the issue in the definition cql3 tables ?
 
 regards
 
 
 2013/8/28 Miguel Angel Martin junquera mianmarjun.mailingl...@gmail.com
 hi all:
 
 
 Regards
 
 Still i can resolve this issue. .
 
 does anybody have this issue or try to test this simple example?
 
 
 i am stumped I can not find a solution working. 
 
 I appreciate any comment or help
 
 
 2013/8/22 Miguel Angel Martin junquera mianmarjun.mailingl...@gmail.com
 hi all:
 
 
 
 
 I,m testing the new CqlStorage() with cassandra 1.28 and pig 0.11.1 
 
 
 I am using this sample data test:
 
  
 http://frommyworkshop.blogspot.com.es/2013/07/hadoop-map-reduce-with-cassandra.html
 
 And I load and dump data Righ with this script:
 
 rows = LOAD 
 'cql://keyspace1/test?page_size=1split_size=4where_clause=age%3D30' USING 
 CqlStorage();
 
 dump rows;
 describe rows;
 
 resutls:
 
 ((id,6),(age,30),(title,QA))
 ((id,5),(age,30),(title,QA))
 rows: {id: chararray,age: int,title: chararray}
 
 
 But i can not  get  the column values 
 
 I try to define   another schemas in Load like I used with cassandraStorage()
 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-and-Pig-how-to-get-column-values-td5641158.html
 
 
 example:
 
 rows = LOAD 
 'cql://keyspace1/test?page_size=1split_size=4where_clause=age%3D30' USING 
 CqlStorage() AS (columns: bag {T: tuple(name, value)});
 
 
 and I get this error:
 
 2013-08-22 12:24:45,426 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1031: Incompatable schema: left is 
 columns:bag{T:tuple(name:bytearray,value:bytearray)}, right is 
 id:chararray,age:int,title:chararray
 
 
 
 I try to use, FLATTEN, SUBSTRING, SPLIT UDF`s but i have not get good result:
 
 Example:
 
 when I flatten , I get a set of tuples like
 (title,QA)
 (title,QA)
 2013-08-22 12:42:20,673 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input 
 paths to process : 1
 A: {title: chararray}
 
 
 but i can get value QA 
 
 Sustring only works with title
 
 
 
 example:
 
 B = FOREACH A GENERATE SUBSTRING(title,2,5);
 
 dump B;
 describe B;
 
 
 results:
 
 (tle)
 (tle)
 B: {chararray}
 
 
 
 i try, this like ERIC LEE inthe other mail  and have the same results:
 
 
  Anyways, what I really what is the column value, not the

Re: How can I switch from multiple disks to a single disk?

2013-09-17 Thread Robert Coli

On Tue, Sep 17, 2013 at 4:01 PM, Juan Manuel Formoso jform...@gmail.comwrote:

 Anyone who knows for sure if this would work?


Sankalp Kohli (whose last name is phonetically awesome!) has pointed you in
the correct direction.

To be a bit more explicit :

1) determine if sstable names are unique across drives (they should be)
2) pre-copy all sstables from all source drives to target single drive
3) drain and stop cassandra
4) re-copy all sstables from all source drives to target single drive, with
--delete or equivalent option to rsync such that you delete any files
missing from source drives due to compaction in the interim
5) start cassandra with new conf file with single drive
6) if it doesn't work for some unforseen reason, you still have all your
sstables in the old dirs, so just revert the conf file and fail back

=Rob

Re: How can I switch from multiple disks to a single disk?

2013-09-17 Thread Juan Manuel Formoso

Thanks! But, shouldn't I be able to just stop Cassandra, copy the files,
change the config and restart? Why should I drain?

My RF+consistency level can handle one replica down (I forgot to mention
that in my OP, apologies)

Would it work in theory?

On Tuesday, September 17, 2013, Robert Coli wrote:

 On Tue, Sep 17, 2013 at 4:01 PM, Juan Manuel Formoso 
 jform...@gmail.comjavascript:_e({}, 'cvml', 'jform...@gmail.com');
  wrote:

 Anyone who knows for sure if this would work?


 Sankalp Kohli (whose last name is phonetically awesome!) has pointed you
 in the correct direction.

 To be a bit more explicit :

 1) determine if sstable names are unique across drives (they should be)
 2) pre-copy all sstables from all source drives to target single drive
 3) drain and stop cassandra
 4) re-copy all sstables from all source drives to target single drive,
 with --delete or equivalent option to rsync such that you delete any files
 missing from source drives due to compaction in the interim
 5) start cassandra with new conf file with single drive
 6) if it doesn't work for some unforseen reason, you still have all your
 sstables in the old dirs, so just revert the conf file and fail back

 =Rob



-- 
*Juan Manuel Formoso
*Senior Geek
http://twitter.com/juanformoso
http://seniorgeek.com.ar
LLAP

Re: How can I switch from multiple disks to a single disk?

2013-09-17 Thread Robert Coli

On Tue, Sep 17, 2013 at 5:57 PM, Juan Manuel Formoso jform...@gmail.comwrote:

 Thanks! But, shouldn't I be able to just stop Cassandra, copy the files,
 change the config and restart? Why should I drain?


If you drain, you reduce to zero the chance of having some problem with the
SSTables flushed as a result of the restart.

However you are correct that you probably do not need to do so... :D

=Rob

Re: How can I switch from multiple disks to a single disk?

2013-09-16 Thread sankalp kohli

I think you can do by moving all the sstables under one drive. I am not
sure though. The sstables names should be unique across drives.


On Mon, Sep 16, 2013 at 10:14 AM, Juan Manuel Formoso jform...@gmail.comwrote:

 Because I ran out of space when shuffling, I was forced to add multiple
 disks on my Cassandra nodes.

 When I finish compacting, cleaning up, and repairing, I'd like to remove
 them and return to one disk per node.

 What is the procedure to make the switch?
 Can I just kill cassandra, move the data from one disk to the other,
 remove the configuration for the second disk, and re-start cassandra?

 I assume files will not have the same name and thus not be overwritten, is
 this the case? Does it pick it up just like that?

 Thanks

 --
 *Juan Manuel Formoso
 *Senior Geek
 http://twitter.com/juanformoso
 http://seniorgeek.com.ar
 LLAP

Re: how can i get the column value? Need help!.. cassandra 1.28 and pig 0.11.1

2013-09-02 Thread Miguel Angel Martin junquera

hi:

I test this in cassandra 1.2.9 new  version and the issue still persists .

:-(




Miguel Angel Martín Junquera
Analyst Engineer.
miguelangel.mar...@brainsins.com



2013/8/30 Miguel Angel Martin junquera mianmarjun.mailingl...@gmail.com

 I try this:

 *rows = LOAD
 'cql://keyspace1/test?page_size=1split_size=4where_clause=age%3D30' USING
 CqlStorage();*

 *dump rows;*

 *ILLUSTRATE rows;*

 *describe rows;*

 *
 *

 *values2= FOREACH rows GENERATE  TOTUPLE (id) as
 (mycolumn:tuple(name,value));*

 *dump values2;*

 *describe values2;*
 *
 *

 But I get this results:



 -
 | rows | id:chararray   | age:int   | title:chararray   |
 -
 |  | (id, 6)| (age, 30) | (title, QA)   |
 -

 rows: {id: chararray,age: int,title: chararray}
 2013-08-30 09:54:37,831 [main] ERROR org.apache.pig.tools.grunt.Grunt -
 ERROR 1031: Incompatable field schema: left is
 tuple_0:tuple(mycolumn:tuple(name:bytearray,value:bytearray)), right is
 org.apache.pig.builtin.totuple_id_1:tuple(id:chararray)





 or



 

 *values2= FOREACH rows GENERATE  TOTUPLE (id) ;*
 *dump values2;*
 *describe values2;*




 and  the results are:


 ...
 (((id,6)))
 (((id,5)))
 values2: {org.apache.pig.builtin.totuple_id_8: (id: chararray)}



 Aggg!


 *
 *




 Miguel Angel Martín Junquera
 Analyst Engineer.
 miguelangel.mar...@brainsins.com



 2013/8/28 Miguel Angel Martin junquera mianmarjun.mailingl...@gmail.com

 hi:

 I can not understand why the schema is  define like 
 *id:chararray,age:int,title:chararray
  and it does not define like tuples or bag tuples,  if we have pair
 key-values  columns*
 *
 *
 *
 *
 *I try other time to change schema  but it does not work.*
 *
 *
 *any ideas ...*
 *
 *
 *perhaps, is the issue in the definition cql3 tables ?*
 *
 *
 *regards*


 2013/8/28 Miguel Angel Martin junquera mianmarjun.mailingl...@gmail.com

 hi all:


 Regards

 Still i can resolve this issue. .

 does anybody have this issue or try to test this simple example?


 i am stumped I can not find a solution working.

 I appreciate any comment or help


 2013/8/22 Miguel Angel Martin junquera mianmarjun.mailingl...@gmail.com
 

 hi all:




 I,m testing the new CqlStorage() with cassandra 1.28 and pig 0.11.1


 I am using this sample data test:


 http://frommyworkshop.blogspot.com.es/2013/07/hadoop-map-reduce-with-cassandra.html

 And I load and dump data Righ with this script:

 *rows = LOAD
 'cql://keyspace1/test?page_size=1split_size=4where_clause=age%3D30' USING
 CqlStorage();*
 *
 *
 *dump rows;*
 *describe rows;*
 *
 *

 *resutls:

 ((id,6),(age,30),(title,QA))

 ((id,5),(age,30),(title,QA))

 rows: {id: chararray,age: int,title: chararray}


 *


 But i can not  get  the column values

 I try to define   another schemas in Load like I used with
 cassandraStorage()


 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-and-Pig-how-to-get-column-values-td5641158.html


 example:

 *rows = LOAD
 'cql://keyspace1/test?page_size=1split_size=4where_clause=age%3D30' USING
 CqlStorage() AS (columns: bag {T: tuple(name, value)});*


 and I get this error:

 *2013-08-22 12:24:45,426 [main] ERROR org.apache.pig.tools.grunt.Grunt
 - ERROR 1031: Incompatable schema: left is
 columns:bag{T:tuple(name:bytearray,value:bytearray)}, right is
 id:chararray,age:int,title:chararray*




 I try to use, FLATTEN, SUBSTRING, SPLIT UDF`s but i have not get good
 result:

 Example:


- when I flatten , I get a set of tuples like

 *(title,QA)*

 *(title,QA)*

 *2013-08-22 12:42:20,673 [main] INFO
  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
 input paths to process : 1*

 *A: {title: chararray}*



 but i can get value QA

 Sustring only works with title



 example:

 *B = FOREACH A GENERATE SUBSTRING(title,2,5);*
 *
 *
 *dump B;*
 *describe B;*
 *
 *
 *
 *

 *results:*
 *
 *

 *(tle)*
 *(tle)*
 *B: {chararray}*




 i try, this like ERIC LEE inthe other mail  and have the same results:


  Anyways, what I really what is the column value, not the name. Is
 there a way to do that? I listed all of the failed attempts I made below.

- colnames = FOREACH cols GENERATE $1 and was told $1 was out of
bounds.
- casted = FOREACH cols GENERATE (tuple(chararray, chararray))$0;
but all I got back were empty tuples
- values = FOREACH cols GENERATE $0.$1; but I got an error telling
me data byte array can't be casted to tuple


 Please, I will appreciate any help


 Regards









 --

 Miguel Angel Martín Junquera
 Analyst Engineer.
 miguelangel.mar...@brainsins.com
 Tel. / Fax: (+34) 91 485 56 66
 *http://www.brainsins.com*
 Smart eCommerce
 *Madrid*: http://goo.gl/4B5kv
  *London*: http://goo.gl/uIXdv
  *Barcelona*: http://goo.gl/NZslW

 Antes de imprimir este e-mail, piense si

Re: how can i get the column value? Need help!.. cassandra 1.28 and pig 0.11.1

2013-09-02 Thread Miguel Angel Martin junquera

hi all:

More info :

https://issues.apache.org/jira/browse/CASSANDRA-5941



I tried this (and gen. cassandra 1.2.9)  but do not work for me,

git clone http://git-wip-us.apache.org/repos/asf/cassandra.git
cd cassandra
git checkout cassandra-1.2
patch -p1  5867-bug-fix-filter-push-down-1.2-branch.txt
ant



Miguel Angel Martín Junquera
Analyst Engineer.
miguelangel.mar...@brainsins.com



2013/9/2 Miguel Angel Martin junquera mianmarjun.mailingl...@gmail.com

 hi:

 I test this in cassandra 1.2.9 new  version and the issue still persists .

 :-(




 Miguel Angel Martín Junquera
 Analyst Engineer.
 miguelangel.mar...@brainsins.com



 2013/8/30 Miguel Angel Martin junquera mianmarjun.mailingl...@gmail.com

 I try this:

 *rows = LOAD
 'cql://keyspace1/test?page_size=1split_size=4where_clause=age%3D30' USING
 CqlStorage();*

 *dump rows;*

 *ILLUSTRATE rows;*

 *describe rows;*

 *
 *

 *values2= FOREACH rows GENERATE  TOTUPLE (id) as
 (mycolumn:tuple(name,value));*

 *dump values2;*

 *describe values2;*
 *
 *

 But I get this results:



 -
 | rows | id:chararray   | age:int   | title:chararray   |
 -
 |  | (id, 6)| (age, 30) | (title, QA)   |
 -

 rows: {id: chararray,age: int,title: chararray}
 2013-08-30 09:54:37,831 [main] ERROR org.apache.pig.tools.grunt.Grunt -
 ERROR 1031: Incompatable field schema: left is
 tuple_0:tuple(mycolumn:tuple(name:bytearray,value:bytearray)), right is
 org.apache.pig.builtin.totuple_id_1:tuple(id:chararray)





 or



 

 *values2= FOREACH rows GENERATE  TOTUPLE (id) ;*
 *dump values2;*
 *describe values2;*




 and  the results are:


 ...
 (((id,6)))
 (((id,5)))
 values2: {org.apache.pig.builtin.totuple_id_8: (id: chararray)}



 Aggg!


 *
 *




 Miguel Angel Martín Junquera
 Analyst Engineer.
 miguelangel.mar...@brainsins.com



 2013/8/28 Miguel Angel Martin junquera mianmarjun.mailingl...@gmail.com

 hi:

 I can not understand why the schema is  define like 
 *id:chararray,age:int,title:chararray
  and it does not define like tuples or bag tuples,  if we have pair
 key-values  columns*
 *
 *
 *
 *
 *I try other time to change schema  but it does not work.*
 *
 *
 *any ideas ...*
 *
 *
 *perhaps, is the issue in the definition cql3 tables ?*
 *
 *
 *regards*


 2013/8/28 Miguel Angel Martin junquera mianmarjun.mailingl...@gmail.com
 

 hi all:


 Regards

 Still i can resolve this issue. .

 does anybody have this issue or try to test this simple example?


 i am stumped I can not find a solution working.

 I appreciate any comment or help


 2013/8/22 Miguel Angel Martin junquera 
 mianmarjun.mailingl...@gmail.com

 hi all:




 I,m testing the new CqlStorage() with cassandra 1.28 and pig 0.11.1


 I am using this sample data test:


 http://frommyworkshop.blogspot.com.es/2013/07/hadoop-map-reduce-with-cassandra.html

 And I load and dump data Righ with this script:

 *rows = LOAD
 'cql://keyspace1/test?page_size=1split_size=4where_clause=age%3D30' 
 USING
 CqlStorage();*
 *
 *
 *dump rows;*
 *describe rows;*
 *
 *

 *resutls:

 ((id,6),(age,30),(title,QA))

 ((id,5),(age,30),(title,QA))

 rows: {id: chararray,age: int,title: chararray}


 *


 But i can not  get  the column values

 I try to define   another schemas in Load like I used with
 cassandraStorage()


 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-and-Pig-how-to-get-column-values-td5641158.html


 example:

 *rows = LOAD
 'cql://keyspace1/test?page_size=1split_size=4where_clause=age%3D30' 
 USING
 CqlStorage() AS (columns: bag {T: tuple(name, value)});*


 and I get this error:

 *2013-08-22 12:24:45,426 [main] ERROR
 org.apache.pig.tools.grunt.Grunt - ERROR 1031: Incompatable schema: left 
 is
 columns:bag{T:tuple(name:bytearray,value:bytearray)}, right is
 id:chararray,age:int,title:chararray*




 I try to use, FLATTEN, SUBSTRING, SPLIT UDF`s but i have not get good
 result:

 Example:


- when I flatten , I get a set of tuples like

 *(title,QA)*

 *(title,QA)*

 *2013-08-22 12:42:20,673 [main] INFO
  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
 input paths to process : 1*

 *A: {title: chararray}*



 but i can get value QA

 Sustring only works with title



 example:

 *B = FOREACH A GENERATE SUBSTRING(title,2,5);*
 *
 *
 *dump B;*
 *describe B;*
 *
 *
 *
 *

 *results:*
 *
 *

 *(tle)*
 *(tle)*
 *B: {chararray}*




 i try, this like ERIC LEE inthe other mail  and have the same results:


  Anyways, what I really what is the column value, not the name. Is
 there a way to do that? I listed all of the failed attempts I made below.

- colnames = FOREACH cols GENERATE $1 and was told $1 was out of
bounds.
- casted = FOREACH cols GENERATE (tuple(chararray, chararray))$0;
but all I got back were empty tuples

Re: how can i get the column value? Need help!.. cassandra 1.28 and pig 0.11.1

2013-08-30 Thread Miguel Angel Martin junquera

I try this:

*rows = LOAD
'cql://keyspace1/test?page_size=1split_size=4where_clause=age%3D30' USING
CqlStorage();*

*dump rows;*

*ILLUSTRATE rows;*

*describe rows;*

*
*

*values2= FOREACH rows GENERATE  TOTUPLE (id) as
(mycolumn:tuple(name,value));*

*dump values2;*

*describe values2;*
*
*

But I get this results:



-
| rows | id:chararray   | age:int   | title:chararray   |
-
|  | (id, 6)| (age, 30) | (title, QA)   |
-

rows: {id: chararray,age: int,title: chararray}
2013-08-30 09:54:37,831 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1031: Incompatable field schema: left is
tuple_0:tuple(mycolumn:tuple(name:bytearray,value:bytearray)), right is
org.apache.pig.builtin.totuple_id_1:tuple(id:chararray)





or





*values2= FOREACH rows GENERATE  TOTUPLE (id) ;*
*dump values2;*
*describe values2;*




and  the results are:


...
(((id,6)))
(((id,5)))
values2: {org.apache.pig.builtin.totuple_id_8: (id: chararray)}



Aggg!


*
*




Miguel Angel Martín Junquera
Analyst Engineer.
miguelangel.mar...@brainsins.com



2013/8/28 Miguel Angel Martin junquera mianmarjun.mailingl...@gmail.com

 hi:

 I can not understand why the schema is  define like 
 *id:chararray,age:int,title:chararray
  and it does not define like tuples or bag tuples,  if we have pair
 key-values  columns*
 *
 *
 *
 *
 *I try other time to change schema  but it does not work.*
 *
 *
 *any ideas ...*
 *
 *
 *perhaps, is the issue in the definition cql3 tables ?*
 *
 *
 *regards*


 2013/8/28 Miguel Angel Martin junquera mianmarjun.mailingl...@gmail.com

 hi all:


 Regards

 Still i can resolve this issue. .

 does anybody have this issue or try to test this simple example?


 i am stumped I can not find a solution working.

 I appreciate any comment or help


 2013/8/22 Miguel Angel Martin junquera mianmarjun.mailingl...@gmail.com

 hi all:




 I,m testing the new CqlStorage() with cassandra 1.28 and pig 0.11.1


 I am using this sample data test:


 http://frommyworkshop.blogspot.com.es/2013/07/hadoop-map-reduce-with-cassandra.html

 And I load and dump data Righ with this script:

 *rows = LOAD
 'cql://keyspace1/test?page_size=1split_size=4where_clause=age%3D30' USING
 CqlStorage();*
 *
 *
 *dump rows;*
 *describe rows;*
 *
 *

 *resutls:

 ((id,6),(age,30),(title,QA))

 ((id,5),(age,30),(title,QA))

 rows: {id: chararray,age: int,title: chararray}


 *


 But i can not  get  the column values

 I try to define   another schemas in Load like I used with
 cassandraStorage()


 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-and-Pig-how-to-get-column-values-td5641158.html


 example:

 *rows = LOAD
 'cql://keyspace1/test?page_size=1split_size=4where_clause=age%3D30' USING
 CqlStorage() AS (columns: bag {T: tuple(name, value)});*


 and I get this error:

 *2013-08-22 12:24:45,426 [main] ERROR org.apache.pig.tools.grunt.Grunt
 - ERROR 1031: Incompatable schema: left is
 columns:bag{T:tuple(name:bytearray,value:bytearray)}, right is
 id:chararray,age:int,title:chararray*




 I try to use, FLATTEN, SUBSTRING, SPLIT UDF`s but i have not get good
 result:

 Example:


- when I flatten , I get a set of tuples like

 *(title,QA)*

 *(title,QA)*

 *2013-08-22 12:42:20,673 [main] INFO
  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
 input paths to process : 1*

 *A: {title: chararray}*



 but i can get value QA

 Sustring only works with title



 example:

 *B = FOREACH A GENERATE SUBSTRING(title,2,5);*
 *
 *
 *dump B;*
 *describe B;*
 *
 *
 *
 *

 *results:*
 *
 *

 *(tle)*
 *(tle)*
 *B: {chararray}*




 i try, this like ERIC LEE inthe other mail  and have the same results:


  Anyways, what I really what is the column value, not the name. Is there
 a way to do that? I listed all of the failed attempts I made below.

- colnames = FOREACH cols GENERATE $1 and was told $1 was out of
bounds.
- casted = FOREACH cols GENERATE (tuple(chararray, chararray))$0;
but all I got back were empty tuples
- values = FOREACH cols GENERATE $0.$1; but I got an error telling
me data byte array can't be casted to tuple


 Please, I will appreciate any help


 Regards









 --

 Miguel Angel Martín Junquera
 Analyst Engineer.
 miguelangel.mar...@brainsins.com
 Tel. / Fax: (+34) 91 485 56 66
 *http://www.brainsins.com*
 Smart eCommerce
 *Madrid*: http://goo.gl/4B5kv
  *London*: http://goo.gl/uIXdv
  *Barcelona*: http://goo.gl/NZslW

 Antes de imprimir este e-mail, piense si es necesario.
 La legislación española ampara el secreto de las comunicaciones. Este
 correo electrónico es estrictamente confidencial y va dirigido
 exclusivamente a su destinatario/a. Si no es Ud., le rogamos que no difunda
 ni copie la transmisión y nos lo notifique cuanto antes.

Re: how can i get the column value? Need help!.. cassandra 1.28 and pig 0.11.1

2013-08-28 Thread Miguel Angel Martin junquera

hi all:


Regards

Still i can resolve this issue. .

does anybody have this issue or try to test this simple example?


i am stumped I can not find a solution working.

I appreciate any comment or help


2013/8/22 Miguel Angel Martin junquera mianmarjun.mailingl...@gmail.com

 hi all:




 I,m testing the new CqlStorage() with cassandra 1.28 and pig 0.11.1


 I am using this sample data test:


 http://frommyworkshop.blogspot.com.es/2013/07/hadoop-map-reduce-with-cassandra.html

 And I load and dump data Righ with this script:

 *rows = LOAD
 'cql://keyspace1/test?page_size=1split_size=4where_clause=age%3D30' USING
 CqlStorage();*
 *
 *
 *dump rows;*
 *describe rows;*
 *
 *

 *resutls:

 ((id,6),(age,30),(title,QA))

 ((id,5),(age,30),(title,QA))

 rows: {id: chararray,age: int,title: chararray}


 *


 But i can not  get  the column values

 I try to define   another schemas in Load like I used with
 cassandraStorage()


 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-and-Pig-how-to-get-column-values-td5641158.html


 example:

 *rows = LOAD
 'cql://keyspace1/test?page_size=1split_size=4where_clause=age%3D30' USING
 CqlStorage() AS (columns: bag {T: tuple(name, value)});*


 and I get this error:

 *2013-08-22 12:24:45,426 [main] ERROR org.apache.pig.tools.grunt.Grunt -
 ERROR 1031: Incompatable schema: left is
 columns:bag{T:tuple(name:bytearray,value:bytearray)}, right is
 id:chararray,age:int,title:chararray*




 I try to use, FLATTEN, SUBSTRING, SPLIT UDF`s but i have not get good
 result:

 Example:


- when I flatten , I get a set of tuples like

 *(title,QA)*

 *(title,QA)*

 *2013-08-22 12:42:20,673 [main] INFO
  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
 input paths to process : 1*

 *A: {title: chararray}*



 but i can get value QA

 Sustring only works with title



 example:

 *B = FOREACH A GENERATE SUBSTRING(title,2,5);*
 *
 *
 *dump B;*
 *describe B;*
 *
 *
 *
 *

 *results:*
 *
 *

 *(tle)*
 *(tle)*
 *B: {chararray}*




 i try, this like ERIC LEE inthe other mail  and have the same results:


  Anyways, what I really what is the column value, not the name. Is there a
 way to do that? I listed all of the failed attempts I made below.

- colnames = FOREACH cols GENERATE $1 and was told $1 was out of
bounds.
- casted = FOREACH cols GENERATE (tuple(chararray, chararray))$0; but
all I got back were empty tuples
- values = FOREACH cols GENERATE $0.$1; but I got an error telling me
data byte array can't be casted to tuple


 Please, I will appreciate any help


 Regards









-- 

Miguel Angel Martín Junquera
Analyst Engineer.
miguelangel.mar...@brainsins.com
Tel. / Fax: (+34) 91 485 56 66
*http://www.brainsins.com*
Smart eCommerce
*Madrid*: http://goo.gl/4B5kv
*London*: http://goo.gl/uIXdv
*Barcelona*: http://goo.gl/NZslW

Antes de imprimir este e-mail, piense si es necesario.
La legislación española ampara el secreto de las comunicaciones. Este
correo electrónico es estrictamente confidencial y va dirigido
exclusivamente a su destinatario/a. Si no es Ud., le rogamos que no difunda
ni copie la transmisión y nos lo notifique cuanto antes.

Re: how can i get the column value? Need help!.. cassandra 1.28 and pig 0.11.1

2013-08-28 Thread Miguel Angel Martin junquera

hi:

I can not understand why the schema is  define like
*id:chararray,age:int,title:chararray
 and it does not define like tuples or bag tuples,  if we have pair
key-values  columns*
*
*
*
*
*I try other time to change schema  but it does not work.*
*
*
*any ideas ...*
*
*
*perhaps, is the issue in the definition cql3 tables ?*
*
*
*regards*


2013/8/28 Miguel Angel Martin junquera mianmarjun.mailingl...@gmail.com

 hi all:


 Regards

 Still i can resolve this issue. .

 does anybody have this issue or try to test this simple example?


 i am stumped I can not find a solution working.

 I appreciate any comment or help


 2013/8/22 Miguel Angel Martin junquera mianmarjun.mailingl...@gmail.com

 hi all:




 I,m testing the new CqlStorage() with cassandra 1.28 and pig 0.11.1


 I am using this sample data test:


 http://frommyworkshop.blogspot.com.es/2013/07/hadoop-map-reduce-with-cassandra.html

 And I load and dump data Righ with this script:

 *rows = LOAD
 'cql://keyspace1/test?page_size=1split_size=4where_clause=age%3D30' USING
 CqlStorage();*
 *
 *
 *dump rows;*
 *describe rows;*
 *
 *

 *resutls:

 ((id,6),(age,30),(title,QA))

 ((id,5),(age,30),(title,QA))

 rows: {id: chararray,age: int,title: chararray}


 *


 But i can not  get  the column values

 I try to define   another schemas in Load like I used with
 cassandraStorage()


 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-and-Pig-how-to-get-column-values-td5641158.html


 example:

 *rows = LOAD
 'cql://keyspace1/test?page_size=1split_size=4where_clause=age%3D30' USING
 CqlStorage() AS (columns: bag {T: tuple(name, value)});*


 and I get this error:

 *2013-08-22 12:24:45,426 [main] ERROR org.apache.pig.tools.grunt.Grunt -
 ERROR 1031: Incompatable schema: left is
 columns:bag{T:tuple(name:bytearray,value:bytearray)}, right is
 id:chararray,age:int,title:chararray*




 I try to use, FLATTEN, SUBSTRING, SPLIT UDF`s but i have not get good
 result:

 Example:


- when I flatten , I get a set of tuples like

 *(title,QA)*

 *(title,QA)*

 *2013-08-22 12:42:20,673 [main] INFO
  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
 input paths to process : 1*

 *A: {title: chararray}*



 but i can get value QA

 Sustring only works with title



 example:

 *B = FOREACH A GENERATE SUBSTRING(title,2,5);*
 *
 *
 *dump B;*
 *describe B;*
 *
 *
 *
 *

 *results:*
 *
 *

 *(tle)*
 *(tle)*
 *B: {chararray}*




 i try, this like ERIC LEE inthe other mail  and have the same results:


  Anyways, what I really what is the column value, not the name. Is there
 a way to do that? I listed all of the failed attempts I made below.

- colnames = FOREACH cols GENERATE $1 and was told $1 was out of
bounds.
- casted = FOREACH cols GENERATE (tuple(chararray, chararray))$0; but
all I got back were empty tuples
- values = FOREACH cols GENERATE $0.$1; but I got an error telling me
data byte array can't be casted to tuple


 Please, I will appreciate any help


 Regards









 --

 Miguel Angel Martín Junquera
 Analyst Engineer.
 miguelangel.mar...@brainsins.com
 Tel. / Fax: (+34) 91 485 56 66
 *http://www.brainsins.com*
 Smart eCommerce
 *Madrid*: http://goo.gl/4B5kv
  *London*: http://goo.gl/uIXdv
  *Barcelona*: http://goo.gl/NZslW

 Antes de imprimir este e-mail, piense si es necesario.
 La legislación española ampara el secreto de las comunicaciones. Este
 correo electrónico es estrictamente confidencial y va dirigido
 exclusivamente a su destinatario/a. Si no es Ud., le rogamos que no difunda
 ni copie la transmisión y nos lo notifique cuanto antes.




-- 

Miguel Angel Martín Junquera
Analyst Engineer.
miguelangel.mar...@brainsins.com

how can i get the column value? Need help!.. cassandra 1.28 and pig 0.11.1

2013-08-22 Thread Miguel Angel Martin junquera

hi all:




I,m testing the new CqlStorage() with cassandra 1.28 and pig 0.11.1


I am using this sample data test:


http://frommyworkshop.blogspot.com.es/2013/07/hadoop-map-reduce-with-cassandra.html

And I load and dump data Righ with this script:

*rows = LOAD
'cql://keyspace1/test?page_size=1split_size=4where_clause=age%3D30' USING
CqlStorage();*
*
*
*dump rows;*
*describe rows;*
*
*

*resutls:

((id,6),(age,30),(title,QA))

((id,5),(age,30),(title,QA))

rows: {id: chararray,age: int,title: chararray}


*


But i can not  get  the column values

I try to define   another schemas in Load like I used with
cassandraStorage()

http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-and-Pig-how-to-get-column-values-td5641158.html


example:

*rows = LOAD
'cql://keyspace1/test?page_size=1split_size=4where_clause=age%3D30' USING
CqlStorage() AS (columns: bag {T: tuple(name, value)});*


and I get this error:

*2013-08-22 12:24:45,426 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1031: Incompatable schema: left is
columns:bag{T:tuple(name:bytearray,value:bytearray)}, right is
id:chararray,age:int,title:chararray*




I try to use, FLATTEN, SUBSTRING, SPLIT UDF`s but i have not get good
result:

Example:


   - when I flatten , I get a set of tuples like

*(title,QA)*

*(title,QA)*

*2013-08-22 12:42:20,673 [main] INFO
 org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
input paths to process : 1*

*A: {title: chararray}*



but i can get value QA

Sustring only works with title



example:

*B = FOREACH A GENERATE SUBSTRING(title,2,5);*
*
*
*dump B;*
*describe B;*
*
*
*
*

*results:*
*
*

*(tle)*
*(tle)*
*B: {chararray}*




i try, this like ERIC LEE inthe other mail  and have the same results:


 Anyways, what I really what is the column value, not the name. Is there a
way to do that? I listed all of the failed attempts I made below.

   - colnames = FOREACH cols GENERATE $1 and was told $1 was out of bounds.
   - casted = FOREACH cols GENERATE (tuple(chararray, chararray))$0; but
   all I got back were empty tuples
   - values = FOREACH cols GENERATE $0.$1; but I got an error telling me
   data byte array can't be casted to tuple


Please, I will appreciate any help


Regards

Re: Can I create a counter column family with many rows in 1.1.10?

2013-03-06 Thread Alain RODRIGUEZ

What would be the exact CQL3 syntax to create a counter CF with composite
row key and not predefined column names ?

Is the following supposed to work ?

CREATE TABLE composite_counter (
   aid   text,
   key1  text,
   key2  text,
   key3  text,
   value counter,
   PRIMARY KEY (aid, key1, key2, key3)
)

First, when I do so I have no error shown, but I *can't* see this CF appear
in my OpsCenter.

update composite_counter set value = value + 5 where aid = '1' and key1 =
'test1' and key2 = 'test2' and key3 = 'test3'; works as expected too.

But how can I have multiple counter columns using the schemaless property
of cassandra ? I mean before, when I created counter CF with cli, things
like this used to work:
update composite_counter set 'value2' = 'value2' + 5 where aid = '1' and
key1 = 'test1' and key2 = 'test2' and key3 = 'test3'; = Bad Request: line
1:29 no viable alternative at input 'value2'

I also tried:
update composite_counter set value2 = value2 + 5 where aid = '1' and key1
= 'test1' and key2 = 'test2' and key3 = 'test3';   = Bad Request: Unknown
identifier value2 (as expected I guess)

I want to make a counter CF with composite keys and a lot of counters using
this pattern 20130306#event or (20130306, event), not sure if I should
use composite columns there.

Is it mandatory to create the CF with at least one column with the
counter type ? I mean I will probably never use a column named 'value', I
defined it just to be sure the CF is defined as a counter CF.




2013/3/6 Abhijit Chanda abhijit.chan...@gmail.com

 Thanks @aaron  for the rectification


 On Wed, Mar 6, 2013 at 1:17 PM, aaron morton aa...@thelastpickle.comwrote:

 Note that CQL 3 in 1.1 is  compatible with CQL 3 in 1.2. Also you do not
 have to use CQL 3, you can still use the cassandra-cli to create CF's.

 The syntax you use to populate it depends on the client you are using.

 Cheers

-
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 5/03/2013, at 9:16 PM, Abhijit Chanda abhijit.chan...@gmail.com
 wrote:

 Yes you can , you just have to use CQL3 and 1.1.10 onward cassandra
 supports CQL3.  Just you have to aware of the fact that a column family
 that contains a counter column can only contain counters. In other other
 words either all the columns of the column family excluding KEY have the
 counter type or none of them can have it.

 Best Regards,
 --
 Abhijit Chanda
 +91-974395





 --
 Abhijit Chanda
 +91-974395

RE: Can I create a counter column family with many rows in 1.1.10?

2013-03-06 Thread Mateus Ferreira e Freitas

Ah, I'ts with many columns, not rows. I use this in cql 2-3 create table cnt 
(key text PRIMARY KEY, y2003 counter, y2004 counter);it says this is not a 
counter column family, and if I try to use 
default_validation_class=CounterType,it says this is not a valid keyword.What 
I'm supposed to type in order to create it?

From: aa...@thelastpickle.com
Subject: Re: Can I create a counter column family with many rows in 1.1.10?
Date: Tue, 5 Mar 2013 23:47:38 -0800
To: user@cassandra.apache.org

Note that CQL 3 in 1.1 is  compatible with CQL 3 in 1.2. Also you do not have 
to use CQL 3, you can still use the cassandra-cli to create CF's. 
The syntax you use to populate it depends on the client you are using. 
Cheers 

-Aaron MortonFreelance Cassandra DeveloperNew Zealand
@aaronmortonhttp://www.thelastpickle.com

On 5/03/2013, at 9:16 PM, Abhijit Chanda abhijit.chan...@gmail.com wrote:Yes 
you can , you just have to use CQL3 and 1.1.10 onward cassandra supports CQL3.  
Just you have to aware of the fact that a column family that contains a counter 
column can only contain counters. In other other words either all the columns 
of the column family excluding KEY have the counter type or none of them can 
have it.

Best Regards,
-- 
Abhijit Chanda
+91-974395

RE: Can I create a counter column family with many rows in 1.1.10?

2013-03-06 Thread Mateus Ferreira e Freitas

I got it now.

From: mateus.ffrei...@hotmail.com
To: user@cassandra.apache.org
Subject: RE: Can I create a counter column family with many rows in 1.1.10?
Date: Wed, 6 Mar 2013 08:42:37 -0300

Ah, I'ts with many columns, not rows. I use this in cql 2-3 create table cnt 
(key text PRIMARY KEY, y2003 counter, y2004 counter);it says this is not a 
counter column family, and if I try to use 
default_validation_class=CounterType,it says this is not a valid keyword.What 
I'm supposed to type in order to create it?

From: aa...@thelastpickle.com
Subject: Re: Can I create a counter column family with many rows in 1.1.10?
Date: Tue, 5 Mar 2013 23:47:38 -0800
To: user@cassandra.apache.org

Note that CQL 3 in 1.1 is  compatible with CQL 3 in 1.2. Also you do not have 
to use CQL 3, you can still use the cassandra-cli to create CF's. 
The syntax you use to populate it depends on the client you are using. 
Cheers 

-Aaron MortonFreelance Cassandra DeveloperNew Zealand
@aaronmortonhttp://www.thelastpickle.com

On 5/03/2013, at 9:16 PM, Abhijit Chanda abhijit.chan...@gmail.com wrote:Yes 
you can , you just have to use CQL3 and 1.1.10 onward cassandra supports CQL3.  
Just you have to aware of the fact that a column family that contains a counter 
column can only contain counters. In other other words either all the columns 
of the column family excluding KEY have the counter type or none of them can 
have it.

Best Regards,
-- 
Abhijit Chanda
+91-974395

Re: Can I create a counter column family with many rows in 1.1.10?

2013-03-06 Thread aaron morton

If you have one column in the table that is not part of the primary key and is 
a counter, then all columns that are not part of the primary key must also be a 
counter. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 6/03/2013, at 2:56 AM, Alain RODRIGUEZ arodr...@gmail.com wrote:

 What would be the exact CQL3 syntax to create a counter CF with composite row 
 key and not predefined column names ?
 
 Is the following supposed to work ?
 
 CREATE TABLE composite_counter (
aid   text,
key1  text,
key2  text,
key3  text,
value counter,
PRIMARY KEY (aid, key1, key2, key3)
 )
 
 First, when I do so I have no error shown, but I *can't* see this CF appear 
 in my OpsCenter.
 
 update composite_counter set value = value + 5 where aid = '1' and key1 = 
 'test1' and key2 = 'test2' and key3 = 'test3'; works as expected too.
 
 But how can I have multiple counter columns using the schemaless property of 
 cassandra ? I mean before, when I created counter CF with cli, things like 
 this used to work:
 update composite_counter set 'value2' = 'value2' + 5 where aid = '1' and 
 key1 = 'test1' and key2 = 'test2' and key3 = 'test3'; = Bad Request: line 
 1:29 no viable alternative at input 'value2'
 
 I also tried:
 update composite_counter set value2 = value2 + 5 where aid = '1' and key1 = 
 'test1' and key2 = 'test2' and key3 = 'test3';   = Bad Request: Unknown 
 identifier value2 (as expected I guess)
 
 I want to make a counter CF with composite keys and a lot of counters using 
 this pattern 20130306#event or (20130306, event), not sure if I should 
 use composite columns there.
 
 Is it mandatory to create the CF with at least one column with the counter 
 type ? I mean I will probably never use a column named 'value', I defined it 
 just to be sure the CF is defined as a counter CF.
 
 
 
 
 2013/3/6 Abhijit Chanda abhijit.chan...@gmail.com
 Thanks @aaron  for the rectification
 
 
 On Wed, Mar 6, 2013 at 1:17 PM, aaron morton aa...@thelastpickle.com wrote:
 Note that CQL 3 in 1.1 is  compatible with CQL 3 in 1.2. Also you do not have 
 to use CQL 3, you can still use the cassandra-cli to create CF's. 
 
 The syntax you use to populate it depends on the client you are using. 
 
 Cheers
  
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 5/03/2013, at 9:16 PM, Abhijit Chanda abhijit.chan...@gmail.com wrote:
 
 Yes you can , you just have to use CQL3 and 1.1.10 onward cassandra supports 
 CQL3.  Just you have to aware of the fact that a column family that contains 
 a counter column can only contain counters. In other other words either all 
 the columns of the column family excluding KEY have the counter type or none 
 of them can have it.
 
 Best Regards,
 -- 
 Abhijit Chanda
 +91-974395
 
 
 
 
 -- 
 Abhijit Chanda
 +91-974395

1 2 >

1 - 100 of 186 matches

Mail list logo