Re: Which version is stable enough for production environment?

2016-12-02 Thread Hugo José Pinto
All,

Many thanks for this enlightening thread.

We're about to go live with a client for a pre-production environment, and
must decide on which 3.x version to use. We will probably need to perform
regular repairs, so we are obviously worried about both CASSANDRA-12905 and
CASSANDRA-12888 that Benjamin referred to.

Hence, the two golden questions:

1) Are these issues already present in 3.0.x?

2) What would be the best 3.x version to put in production at this moment?

Many thanks for any help you can come up with,

--
Hugo José Pinto


>
> LeveledCompaction: Have you checked if there where major changes in the
> LeveledStrategy between 2.x and 3.x?
>
> 2016-11-30 21:04 GMT+01:00 Harikrishnan Pillai :
>
>> https://issues.apache.org/jira/browse/CASSANDRA-12728
>>
>> [CASSANDRA-12728] Handling partially written hint files ...
>> <https://issues.apache.org/jira/browse/CASSANDRA-12728>
>> issues.apache.org
>> Cassandra; CASSANDRA-12728; Handling partially written hint files. Agile
>> Board; Awaiting Feedback; Export
>> https://issues.apache.org/jira/browse/CASSANDRA-12844
>>
>>
>> Also when i testes some of our write heavy workload Leveled Compaction
>> was not keeping up.With same system settings 2.1.16 performs better and all
>> levels was properly aligned.
>> --
>> *From:* Benjamin Roth 
>> *Sent:* Tuesday, November 29, 2016 11:20:19 PM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Which version is stable enough for production environment?
>>
>> What are the compaction issues / hint corruprions you encountered? Are
>> there JIRA tickets for it?
>> I am curios cause I use 3.10 (trunk) in production.
>>
>> For anyone who is planning to use MVs:
>> They basically work. We use them in production since some months, BUT
>> (it's a quite big one) maintainance is a pain. Bootstrapping and repairs
>> may be - depending on the model, config, amount of data - really, really
>> painful. I'm currently investigating intensively.
>>
>> 2016-11-30 3:11 GMT+01:00 Harikrishnan Pillai :
>>
>>> 3.0 has "off the heap memtable" impl removed and if you have a
>>> requirement for this,its not available.If you don't have the requirement
>>> 3.0.9 can be tried out. 3.9 version we did some testing and find lot issues
>>> in compaction,hint corruption etc.
>>>
>>> Regards
>>>
>>> Hari
>>>
>>>
>>> --
>>> *From:* Discovery 
>>> *Sent:* Tuesday, November 29, 2016 5:59 PM
>>> *To:* user
>>> *Subject:* Re: Which version is stable enough for production
>>> environment?
>>>
>>> Why version 3.x is not recommended?  Thanks.
>>>
>>>
>>> -- Original --
>>> *From: * "Harikrishnan Pillai";;
>>> *Date: * Wed, Nov 30, 2016 09:57 AM
>>> *To: * "user";
>>> *Subject: * Re: Which version is stable enough for production
>>> environment?
>>>
>>> Cassandra 2.1.16
>>>
>>>
>>> --
>>> *From:* Discovery 
>>> *Sent:* Tuesday, November 29, 2016 5:42 PM
>>> *To:* user
>>> *Subject:* Which version is stable enough for production environment?
>>>
>>> Hi Cassandra Experts,
>>>
>>>   We prepare to deploy Cassandra in production env, but
>>> we can not confirm which version is stable and recommended, could someone
>>> in this mail list give the suggestion? Thanks in advance!
>>>
>>>
>>> Best Regards
>>> Discovery
>>> 11/30/2016
>>>
>>
>>
>>
>> --
>> Benjamin Roth
>> Prokurist
>>
>> Jaumo GmbH · www.jaumo.com
>> Wehrstraße 46 · 73035 Göppingen · Germany
>> Phone +49 7161 304880-6 <07161%203048806> · Fax +49 7161 304880-1
>> <07161%203048801>
>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>>
>
>
>
> --
> Benjamin Roth
> Prokurist
>
> Jaumo GmbH · www.jaumo.com
> Wehrstraße 46 · 73035 Göppingen · Germany
> Phone +49 7161 304880-6 <+49%207161%203048806> · Fax +49 7161 304880-1
> <+49%207161%203048801>
> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>


Timeseries: Include rows immediately adjacent to range query?

2015-01-13 Thread Hugo José Pinto
Hi!

We're using cassandra to store a time series, using a table similar to:

CREATE TABLE timeseries (
source_id uuid,
tstamp timestamp,
value text,
PRIMARY KEY (source_id, tstamp)
) WITH CLUSTERING ORDER BY (tstamp DESC);

With that, we do a ranged query with tstamp > x and tstamp < y to gather
all events within a time window.

Now, due to the variable granularity of incoming data, we have no guarantee
that our data points are close to the requested interval. In order to
compute derived values, we'd need the values immediately before and after
the requested range.

Any ideas on what would be the best way to approach this in Cassandra CQL?

Many thanks for any help!

-- 
Hugo José Pinto


Re: Best approach in Cassandra (+ Spark?) for Continuous Queries?

2015-01-04 Thread Hugo José Pinto
Many thanks once again.

I rethought the target data structure, and things started coming together to 
allow for really elegant, compact ESP preprocessing and storage.

Best.

Enviado do meu iPhone

No dia 03/01/2015, às 23:53, Peter Lin  escreveu:

> 
> if you like SQL dialect, try out products that use streamSQL to do continuous 
> queries. Espers comes to mind. Google to see what other products support 
> streamSQL
> 
>> On Sat, Jan 3, 2015 at 6:48 PM, Hugo José Pinto  
>> wrote:
>> Thanks :)
>> 
>> Duly noted - this is all uncharted territory for us, hence the value of 
>> seasoned advice.
>> 
>> 
>> Best
>> 
>> --
>> Hugo José Pinto
>> 
>> No dia 03/01/2015, às 23:43, Peter Lin  escreveu:
>> 
>>> 
>>> listen to colin's advice, avoid the temptation of anti-patterns.
>>> 
>>>> On Sat, Jan 3, 2015 at 6:10 PM, Colin  wrote:
>>>> Use a message bus with a transactional get, get the message, send to 
>>>> cassandra, upon write success, submit to esp, commit get on bus.  
>>>> Messaging systems like rabbitmq support this semantic.
>>>> 
>>>> Using cassandra as a queuing mechanism is an anti-pattern.
>>>> 
>>>> --
>>>> Colin Clark 
>>>> +1-320-221-9531
>>>>  
>>>> 
>>>>> On Jan 3, 2015, at 6:07 PM, Hugo José Pinto  
>>>>> wrote:
>>>>> 
>>>>> Thank you all for your answers.
>>>>> 
>>>>> It seems I'll have to go with some event-driven processing before/during 
>>>>> the Cassandra write path. 
>>>>> 
>>>>> My concern would be that I'd love to first guarantee the disk write of 
>>>>> the Cassandra persistence and then do the event processing (which is 
>>>>> mostly CRUD intercepts at this point), even if slightly delayed, and 
>>>>> doing so via triggers would probably bog down the whole processing 
>>>>> pipeline. 
>>>>> 
>>>>> What I'd probably do is to write, in trigger, a separate key table with 
>>>>> all the CRUDed elements and to have the ESP process that table.
>>>>> 
>>>>> Thank you for your contribution. Should anyone else have any experiende 
>>>>> experience in these scenarios I'm obviously all ears as well. 
>>>>> 
>>>>> Best,
>>>>> 
>>>>> Hugo 
>>>>> 
>>>>> Enviado do meu iPhone
>>>>> 
>>>>> No dia 03/01/2015, às 11:09, DuyHai Doan  escreveu:
>>>>> 
>>>>>> Hello Hugo
>>>>>> 
>>>>>>  I was facing the same kind of requirement from some users. Long story 
>>>>>> short, below are the possible strategies with advantages and draw-backs 
>>>>>> of each
>>>>>> 
>>>>>> 1) Put Spark in front of the back-end, every incoming 
>>>>>> modification/update/insert goes into Spark first, then Spark will 
>>>>>> forward it to Cassandra for persistence. With Spark, you can perform pre 
>>>>>> or post-processing and notify external clients of mutation.
>>>>>> 
>>>>>>  The draw back of this solution is that all the incoming mutations must 
>>>>>> go through Spark. You may set up a Kafka queue as temporary storage to 
>>>>>> distribute the load and consume mutations with Spark but it add ups to 
>>>>>> the architecture complexity with additional components & technologies
>>>>>> 
>>>>>> 2) For high availability and resilience, you probably want to have all 
>>>>>> mutations saved first into Cassandra then process notifications with 
>>>>>> Spark. In this case the only way to have notifications from Cassandra, 
>>>>>> as of version 2.1, is to rely on manually coded triggers (which is still 
>>>>>> experimental feature).
>>>>>> 
>>>>>> With the triggers you can notify whatever clients you want, not only 
>>>>>> Spark.
>>>>>> 
>>>>>> The big draw back of this solution is that playing with triggers is 
>>>>>> dangerous if you are not familiar with Cassandra internals. Indeed the 
>>>>>> trigger is on the write path and may hurt performance if you are doing 
>>>>>> complex and blocking tasks.
>&g

Re: Best approach in Cassandra (+ Spark?) for Continuous Queries?

2015-01-03 Thread Hugo José Pinto
Thanks :)

Duly noted - this is all uncharted territory for us, hence the value of 
seasoned advice.


Best

--
Hugo José Pinto

No dia 03/01/2015, às 23:43, Peter Lin  escreveu:

> 
> listen to colin's advice, avoid the temptation of anti-patterns.
> 
>> On Sat, Jan 3, 2015 at 6:10 PM, Colin  wrote:
>> Use a message bus with a transactional get, get the message, send to 
>> cassandra, upon write success, submit to esp, commit get on bus.  Messaging 
>> systems like rabbitmq support this semantic.
>> 
>> Using cassandra as a queuing mechanism is an anti-pattern.
>> 
>> --
>> Colin Clark 
>> +1-320-221-9531
>>  
>> 
>>> On Jan 3, 2015, at 6:07 PM, Hugo José Pinto  
>>> wrote:
>>> 
>>> Thank you all for your answers.
>>> 
>>> It seems I'll have to go with some event-driven processing before/during 
>>> the Cassandra write path. 
>>> 
>>> My concern would be that I'd love to first guarantee the disk write of the 
>>> Cassandra persistence and then do the event processing (which is mostly 
>>> CRUD intercepts at this point), even if slightly delayed, and doing so via 
>>> triggers would probably bog down the whole processing pipeline. 
>>> 
>>> What I'd probably do is to write, in trigger, a separate key table with all 
>>> the CRUDed elements and to have the ESP process that table.
>>> 
>>> Thank you for your contribution. Should anyone else have any experiende 
>>> experience in these scenarios I'm obviously all ears as well. 
>>> 
>>> Best,
>>> 
>>> Hugo 
>>> 
>>> Enviado do meu iPhone
>>> 
>>> No dia 03/01/2015, às 11:09, DuyHai Doan  escreveu:
>>> 
>>>> Hello Hugo
>>>> 
>>>>  I was facing the same kind of requirement from some users. Long story 
>>>> short, below are the possible strategies with advantages and draw-backs of 
>>>> each
>>>> 
>>>> 1) Put Spark in front of the back-end, every incoming 
>>>> modification/update/insert goes into Spark first, then Spark will forward 
>>>> it to Cassandra for persistence. With Spark, you can perform pre or 
>>>> post-processing and notify external clients of mutation.
>>>> 
>>>>  The draw back of this solution is that all the incoming mutations must go 
>>>> through Spark. You may set up a Kafka queue as temporary storage to 
>>>> distribute the load and consume mutations with Spark but it add ups to the 
>>>> architecture complexity with additional components & technologies
>>>> 
>>>> 2) For high availability and resilience, you probably want to have all 
>>>> mutations saved first into Cassandra then process notifications with 
>>>> Spark. In this case the only way to have notifications from Cassandra, as 
>>>> of version 2.1, is to rely on manually coded triggers (which is still 
>>>> experimental feature).
>>>> 
>>>> With the triggers you can notify whatever clients you want, not only Spark.
>>>> 
>>>> The big draw back of this solution is that playing with triggers is 
>>>> dangerous if you are not familiar with Cassandra internals. Indeed the 
>>>> trigger is on the write path and may hurt performance if you are doing 
>>>> complex and blocking tasks.
>>>> 
>>>> That's the 2 solutions I can see, maybe the ML members will propose other 
>>>> innovative choices
>>>> 
>>>>  Regards
>>>> 
>>>>> On Sat, Jan 3, 2015 at 11:46 AM, Hugo José Pinto 
>>>>>  wrote:
>>>>> Hello.
>>>>> 
>>>>> We're currently using Hazelcast (http://hazelcast.org/) as a distributed 
>>>>> in-memory data grid. That's been working sort-of-well for us, but going 
>>>>> solely in-memory has exhausted its path in our use case, and we're 
>>>>> considering porting our application to a NoSQL persistent store. After 
>>>>> the usual comparisons and evaluations, we're borderline close to picking 
>>>>> Cassandra, plus eventually Spark for analytics.
>>>>> 
>>>>> Nonetheless, there is a gap in our architectural needs that we're still 
>>>>> not grasping how to solve in Cassandra (with or without Spark): Hazelcast 
>>>>> allows us to create a Continuous Query in that, whenever a row is 
>>>>> added/

Re: Best approach in Cassandra (+ Spark?) for Continuous Queries?

2015-01-03 Thread Hugo José Pinto
Thank you all for your answers.

It seems I'll have to go with some event-driven processing before/during the 
Cassandra write path. 

My concern would be that I'd love to first guarantee the disk write of the 
Cassandra persistence and then do the event processing (which is mostly CRUD 
intercepts at this point), even if slightly delayed, and doing so via triggers 
would probably bog down the whole processing pipeline. 

What I'd probably do is to write, in trigger, a separate key table with all the 
CRUDed elements and to have the ESP process that table.

Thank you for your contribution. Should anyone else have any experiende 
experience in these scenarios I'm obviously all ears as well. 

Best,

Hugo 

Enviado do meu iPhone

No dia 03/01/2015, às 11:09, DuyHai Doan  escreveu:

> Hello Hugo
> 
>  I was facing the same kind of requirement from some users. Long story short, 
> below are the possible strategies with advantages and draw-backs of each
> 
> 1) Put Spark in front of the back-end, every incoming 
> modification/update/insert goes into Spark first, then Spark will forward it 
> to Cassandra for persistence. With Spark, you can perform pre or 
> post-processing and notify external clients of mutation.
> 
>  The draw back of this solution is that all the incoming mutations must go 
> through Spark. You may set up a Kafka queue as temporary storage to 
> distribute the load and consume mutations with Spark but it add ups to the 
> architecture complexity with additional components & technologies
> 
> 2) For high availability and resilience, you probably want to have all 
> mutations saved first into Cassandra then process notifications with Spark. 
> In this case the only way to have notifications from Cassandra, as of version 
> 2.1, is to rely on manually coded triggers (which is still experimental 
> feature).
> 
> With the triggers you can notify whatever clients you want, not only Spark.
> 
> The big draw back of this solution is that playing with triggers is dangerous 
> if you are not familiar with Cassandra internals. Indeed the trigger is on 
> the write path and may hurt performance if you are doing complex and blocking 
> tasks.
> 
> That's the 2 solutions I can see, maybe the ML members will propose other 
> innovative choices
> 
>  Regards
> 
>> On Sat, Jan 3, 2015 at 11:46 AM, Hugo José Pinto  
>> wrote:
>> Hello.
>> 
>> We're currently using Hazelcast (http://hazelcast.org/) as a distributed 
>> in-memory data grid. That's been working sort-of-well for us, but going 
>> solely in-memory has exhausted its path in our use case, and we're 
>> considering porting our application to a NoSQL persistent store. After the 
>> usual comparisons and evaluations, we're borderline close to picking 
>> Cassandra, plus eventually Spark for analytics.
>> 
>> Nonetheless, there is a gap in our architectural needs that we're still not 
>> grasping how to solve in Cassandra (with or without Spark): Hazelcast allows 
>> us to create a Continuous Query in that, whenever a row is 
>> added/removed/modified from the clause's resultset, Hazelcast calls up back 
>> with the corresponding notification. We use this to continuously update the 
>> clients via AJAX streaming with the new/changed rows.
>> 
>> This is probably a conceptual mismatch we're making, so - how to best 
>> address this use case in Cassandra (with or without Spark's help)? Is there 
>> something in the API that allows for Continuous Queries on key/clause 
>> changes (haven't found it)? Is there some other way to get a stream of 
>> key/clause updates? Events of some sort?
>> 
>> I'm aware that we could, eventually, periodically poll Cassandra, but in our 
>> use case, the client is potentially interested in a large number of table 
>> clause notifications (think "all changes to Ship positions on California's 
>> coastline"), and iterating out of the store would kill the streamer's 
>> scalability.
>> 
>> Hence, the magic question: what are we missing? Is Cassandra the wrong tool 
>> for the job? Are we not aware of a particular part of the API or external 
>> library in/outside the apache realm that would allow for this?
>> 
>> Many thanks for any assistance!
>> 
>> Hugo
>> 
> 


Best approach in Cassandra (+ Spark?) for Continuous Queries?

2015-01-03 Thread Hugo José Pinto
Hello.

We're currently using Hazelcast (http://hazelcast.org/) as a distributed
in-memory data grid. That's been working sort-of-well for us, but going
solely in-memory has exhausted its path in our use case, and we're
considering porting our application to a NoSQL persistent store. After the
usual comparisons and evaluations, we're borderline close to picking
Cassandra, plus eventually Spark for analytics.

Nonetheless, there is a gap in our architectural needs that we're still not
grasping how to solve in Cassandra (with or without Spark): Hazelcast
allows us to create a Continuous Query in that, whenever a row is
added/removed/modified from the clause's resultset, Hazelcast calls up back
with the corresponding notification. We use this to continuously update the
clients via AJAX streaming with the new/changed rows.

This is probably a conceptual mismatch we're making, so - how to best
address this use case in Cassandra (with or without Spark's help)? Is there
something in the API that allows for Continuous Queries on key/clause
changes (haven't found it)? Is there some other way to get a stream of
key/clause updates? Events of some sort?

I'm aware that we could, eventually, periodically poll Cassandra, but in
our use case, the client is potentially interested in a large number of
table clause notifications (think "all changes to Ship positions on
California's coastline"), and iterating out of the store would kill the
streamer's scalability.

Hence, the magic question: what are we missing? Is Cassandra the wrong tool
for the job? Are we not aware of a particular part of the API or external
library in/outside the apache realm that would allow for this?

Many thanks for any assistance!

Hugo