Re: Sorting keys for batch reads to minimize seeks

2013-10-22 Thread Manoj Khangaonkar
Hi,

Apologies if my response is a little off track, But

instead of trying to squeeze the last ounce of performance out of
cassandra, Have you considered
putting an external in memory cache in front or along side cassandra ( like
a redis or memcached ) to cache
frequently used rows.

You get fast read performance for data read from the cache and you get
scalability , high availability etc with Cassandra at the back.

Given the suggested 8G heap size limitation for java servers, in jvm row
caching might not be sufficient ( not enough rows found in cache). I read
somewhere that Cassandra is capable of using off jvm memory for row
caching. I am not familiar with it, but that might be something to look
into.

Bottom line, Reads can be much faster if your rows are found in memory.

regards



On Tue, Oct 22, 2013 at 2:15 AM, Artur Kronenberg <
artur.kronenb...@openmarket.com> wrote:

>  Hi,
>
> we did some testing and found that doing range queries is much quicker
> then querying data regularly. I am guessing that a range query request is
> going to seek much more efficiently on disk.
>
> This is where the idea of sorting our tokens comes in. We have a batch
> request of say 1000 items and instead of doing a multiget from cassandra
> which involves a lot of random I/O seeks, we would like to have a way to
> seek for the range. It doesn't actually matter if the range is slightly
> biggern then the amount of items we would like to retrieve as the time we
> loose filtering unneeded items in code is quicker then doing a multiget for
> 1000 items in the first place.
>
> Is there a way for basing token ranges somewhat on a certain value in our
> schema? Say every row has a value A and B. While A is just a random
> identifier and we can't really rely on what this will be, all our queries
> operate on a way that B is going to be the same value for all items in the
> query. If we had the token range being random however with regards that the
> random values are generated based on the B value and therefore all items
> with B are close together in range and therefore optimized for range
> queries rather then gets, that could possibly speed up read performance
> significantly.
>
> Thanks!
>
> Artur
>
>
> On 21/10/13 16:58, Edward Capriolo wrote:
>
> I am not sure what you are working on will have an effect. You can not
> actually control the way the operating system seeks data on disk. The io
> scheduling is done outside cassandra. You can try to write the code in an
> optimistic way taking phyical hardware into account, but then you have to
> consider there are n concurrent requests on the io system.
>
> On Friday, October 18, 2013, Viktor Jevdokimov <
> viktor.jevdoki...@adform.com> wrote:
> > Read latency depends on many factors, don't forget "physics".
> > If it meets your requirements, it is good.
> >
> >
> > -----Original Message-
> > From: Artur Kronenberg [mailto:artur.kronenb...@openmarket.com]
> > Sent: Friday, October 18, 2013 1:03 PM
> > To: user@cassandra.apache.org
> > Subject: Re: Sorting keys for batch reads to minimize seeks
> >
> > Hi,
> >
> > Thanks for your reply. Our latency currently is 23.618ms. However I
> simply read that off one node just now while it wasn't under a load test. I
> am going to be able to get a better number after the next test run.
> >
> > What is a good value for read latency?
> >
> >
> > On 18/10/13 08:31, Viktor Jevdokimov wrote:
> >> The only thing you may win - avoid unnecessary network hops if:
> >> - request sorted keys (by token) from appropriate replica with
> ConsistencyLevel.ONE and "dynamic_snitch: false".
> >> - nodes has the same load
> >> - replica not doing GC, and GC pauses are much higher than internode
> communication.
> >>
> >> For multiple keys request C* will do multiple single key reads, except
> for range scan requests, where only starting key and batch size is used in
> request.
> >>
> >> Consider multiple key request as a slow request by design, try to model
> your data for low latency single key requests.
> >>
> >> So, what latencies do you want to achieve?
> >>
> >>
> >>
> >> Best regards / Pagarbiai
> >>
> >> Viktor Jevdokimov
> >> Senior Developer
> >>
> >> Email: viktor.jevdoki...@adform.com
> >> Phone: +370 5 212 3063
> >> Fax: +370 5 261 0453
> >>
> >> J. Jasinskio 16C,
> >> LT-03163 Vilnius,
> >> Lithuania
> >>
> >>
> >>
> >> Disclaimer: The information contained in this message and attachme

Re: Sorting keys for batch reads to minimize seeks

2013-10-22 Thread Artur Kronenberg

Hi,

we did some testing and found that doing range queries is much quicker 
then querying data regularly. I am guessing that a range query request 
is going to seek much more efficiently on disk.


This is where the idea of sorting our tokens comes in. We have a batch 
request of say 1000 items and instead of doing a multiget from cassandra 
which involves a lot of random I/O seeks, we would like to have a way to 
seek for the range. It doesn't actually matter if the range is slightly 
biggern then the amount of items we would like to retrieve as the time 
we loose filtering unneeded items in code is quicker then doing a 
multiget for 1000 items in the first place.


Is there a way for basing token ranges somewhat on a certain value in 
our schema? Say every row has a value A and B. While A is just a random 
identifier and we can't really rely on what this will be, all our 
queries operate on a way that B is going to be the same value for all 
items in the query. If we had the token range being random however with 
regards that the random values are generated based on the B value and 
therefore all items with B are close together in range and therefore 
optimized for range queries rather then gets, that could possibly speed 
up read performance significantly.


Thanks!

Artur

On 21/10/13 16:58, Edward Capriolo wrote:
I am not sure what you are working on will have an effect. You can not 
actually control the way the operating system seeks data on disk. The 
io scheduling is done outside cassandra. You can try to write the code 
in an optimistic way taking phyical hardware into account, but then 
you have to consider there are n concurrent requests on the io system.


On Friday, October 18, 2013, Viktor Jevdokimov 
mailto:viktor.jevdoki...@adform.com>> 
wrote:

> Read latency depends on many factors, don't forget "physics".
> If it meets your requirements, it is good.
>
>
> -Original Message-
> From: Artur Kronenberg [mailto:artur.kronenb...@openmarket.com 
<mailto:artur.kronenb...@openmarket.com>]

> Sent: Friday, October 18, 2013 1:03 PM
> To: user@cassandra.apache.org <mailto:user@cassandra.apache.org>
> Subject: Re: Sorting keys for batch reads to minimize seeks
>
> Hi,
>
> Thanks for your reply. Our latency currently is 23.618ms. However I 
simply read that off one node just now while it wasn't under a load 
test. I am going to be able to get a better number after the next test 
run.

>
> What is a good value for read latency?
>
>
> On 18/10/13 08:31, Viktor Jevdokimov wrote:
>> The only thing you may win - avoid unnecessary network hops if:
>> - request sorted keys (by token) from appropriate replica with 
ConsistencyLevel.ONE and "dynamic_snitch: false".

>> - nodes has the same load
>> - replica not doing GC, and GC pauses are much higher than 
internode communication.

>>
>> For multiple keys request C* will do multiple single key reads, 
except for range scan requests, where only starting key and batch size 
is used in request.

>>
>> Consider multiple key request as a slow request by design, try to 
model your data for low latency single key requests.

>>
>> So, what latencies do you want to achieve?
>>
>>
>>
>> Best regards / Pagarbiai
>>
>> Viktor Jevdokimov
>> Senior Developer
>>
>> Email: viktor.jevdoki...@adform.com 
<mailto:viktor.jevdoki...@adform.com>

>> Phone: +370 5 212 3063
>> Fax: +370 5 261 0453
>>
>> J. Jasinskio 16C,
>> LT-03163 Vilnius,
>> Lithuania
>>
>>
>>
>> Disclaimer: The information contained in this message and attachments
>> is intended solely for the attention and use of the named addressee
>> and may be confidential. If you are not the intended recipient, you
>> are reminded that the information remains the property of the sender.
>> You must not use, disclose, distribute, copy, print or rely on this
>> e-mail. If you have received this message in error, please contact the
>> sender immediately and irrevocably delete this message and any
>> copies.-Original Message-
>> From: Artur Kronenberg [mailto:artur.kronenb...@openmarket.com 
<mailto:artur.kronenb...@openmarket.com>]

>> Sent: Thursday, October 17, 2013 7:40 PM
>> To: user@cassandra.apache.org <mailto:user@cassandra.apache.org>
>> Subject: Sorting keys for batch reads to minimize seeks
>>
>> Hi,
>>
>> I am looking to somehow increase read performance on cassandra. We 
are still playing with configurations but I was thinking if there 
would be solutions in software that might help us speed up our read 
performance.

>>
>> E.g. one idea, not sure how sane that is, was to sort read-batches 

Re: Sorting keys for batch reads to minimize seeks

2013-10-21 Thread Edward Capriolo
I am not sure what you are working on will have an effect. You can not
actually control the way the operating system seeks data on disk. The io
scheduling is done outside cassandra. You can try to write the code in an
optimistic way taking phyical hardware into account, but then you have to
consider there are n concurrent requests on the io system.

On Friday, October 18, 2013, Viktor Jevdokimov 
wrote:
> Read latency depends on many factors, don't forget "physics".
> If it meets your requirements, it is good.
>
>
> -Original Message-
> From: Artur Kronenberg [mailto:artur.kronenb...@openmarket.com]
> Sent: Friday, October 18, 2013 1:03 PM
> To: user@cassandra.apache.org
> Subject: Re: Sorting keys for batch reads to minimize seeks
>
> Hi,
>
> Thanks for your reply. Our latency currently is 23.618ms. However I
simply read that off one node just now while it wasn't under a load test. I
am going to be able to get a better number after the next test run.
>
> What is a good value for read latency?
>
>
> On 18/10/13 08:31, Viktor Jevdokimov wrote:
>> The only thing you may win - avoid unnecessary network hops if:
>> - request sorted keys (by token) from appropriate replica with
ConsistencyLevel.ONE and "dynamic_snitch: false".
>> - nodes has the same load
>> - replica not doing GC, and GC pauses are much higher than internode
communication.
>>
>> For multiple keys request C* will do multiple single key reads, except
for range scan requests, where only starting key and batch size is used in
request.
>>
>> Consider multiple key request as a slow request by design, try to model
your data for low latency single key requests.
>>
>> So, what latencies do you want to achieve?
>>
>>
>>
>> Best regards / Pagarbiai
>>
>> Viktor Jevdokimov
>> Senior Developer
>>
>> Email: viktor.jevdoki...@adform.com
>> Phone: +370 5 212 3063
>> Fax: +370 5 261 0453
>>
>> J. Jasinskio 16C,
>> LT-03163 Vilnius,
>> Lithuania
>>
>>
>>
>> Disclaimer: The information contained in this message and attachments
>> is intended solely for the attention and use of the named addressee
>> and may be confidential. If you are not the intended recipient, you
>> are reminded that the information remains the property of the sender.
>> You must not use, disclose, distribute, copy, print or rely on this
>> e-mail. If you have received this message in error, please contact the
>> sender immediately and irrevocably delete this message and any
>> copies.-Original Message-
>> From: Artur Kronenberg [mailto:artur.kronenb...@openmarket.com]
>> Sent: Thursday, October 17, 2013 7:40 PM
>> To: user@cassandra.apache.org
>> Subject: Sorting keys for batch reads to minimize seeks
>>
>> Hi,
>>
>> I am looking to somehow increase read performance on cassandra. We are
still playing with configurations but I was thinking if there would be
solutions in software that might help us speed up our read performance.
>>
>> E.g. one idea, not sure how sane that is, was to sort read-batches by
row-keys before submitting them to cassandra. The idea is that row-keys
should be closer together on the physical disk and therefor this may
minimize the amount of random seeks we have to do when querying say 1000
entries from cassandra. Does that make any sense?
>>
>> Is there anything else that we can do in software to improve
performance? Like specific batch sizes for reads? We are using the astyanax
library to access cassandra.
>>
>> Thanks!
>>
>>
>
>


RE: Sorting keys for batch reads to minimize seeks

2013-10-18 Thread Viktor Jevdokimov
Read latency depends on many factors, don't forget "physics".
If it meets your requirements, it is good.


-Original Message-
From: Artur Kronenberg [mailto:artur.kronenb...@openmarket.com] 
Sent: Friday, October 18, 2013 1:03 PM
To: user@cassandra.apache.org
Subject: Re: Sorting keys for batch reads to minimize seeks

Hi,

Thanks for your reply. Our latency currently is 23.618ms. However I simply read 
that off one node just now while it wasn't under a load test. I am going to be 
able to get a better number after the next test run.

What is a good value for read latency?


On 18/10/13 08:31, Viktor Jevdokimov wrote:
> The only thing you may win - avoid unnecessary network hops if:
> - request sorted keys (by token) from appropriate replica with 
> ConsistencyLevel.ONE and "dynamic_snitch: false".
> - nodes has the same load
> - replica not doing GC, and GC pauses are much higher than internode 
> communication.
>
> For multiple keys request C* will do multiple single key reads, except for 
> range scan requests, where only starting key and batch size is used in 
> request.
>
> Consider multiple key request as a slow request by design, try to model your 
> data for low latency single key requests.
>
> So, what latencies do you want to achieve?
>
>
>
> Best regards / Pagarbiai
>
> Viktor Jevdokimov
> Senior Developer
>
> Email: viktor.jevdoki...@adform.com
> Phone: +370 5 212 3063
> Fax: +370 5 261 0453
>
> J. Jasinskio 16C,
> LT-03163 Vilnius,
> Lithuania
>
>
>
> Disclaimer: The information contained in this message and attachments 
> is intended solely for the attention and use of the named addressee 
> and may be confidential. If you are not the intended recipient, you 
> are reminded that the information remains the property of the sender. 
> You must not use, disclose, distribute, copy, print or rely on this 
> e-mail. If you have received this message in error, please contact the 
> sender immediately and irrevocably delete this message and any 
> copies.-Original Message-
> From: Artur Kronenberg [mailto:artur.kronenb...@openmarket.com]
> Sent: Thursday, October 17, 2013 7:40 PM
> To: user@cassandra.apache.org
> Subject: Sorting keys for batch reads to minimize seeks
>
> Hi,
>
> I am looking to somehow increase read performance on cassandra. We are still 
> playing with configurations but I was thinking if there would be solutions in 
> software that might help us speed up our read performance.
>
> E.g. one idea, not sure how sane that is, was to sort read-batches by 
> row-keys before submitting them to cassandra. The idea is that row-keys 
> should be closer together on the physical disk and therefor this may minimize 
> the amount of random seeks we have to do when querying say 1000 entries from 
> cassandra. Does that make any sense?
>
> Is there anything else that we can do in software to improve performance? 
> Like specific batch sizes for reads? We are using the astyanax library to 
> access cassandra.
>
> Thanks!
>
>



Re: Sorting keys for batch reads to minimize seeks

2013-10-18 Thread Artur Kronenberg

Hi,

Thanks for your reply. Our latency currently is 23.618ms. However I 
simply read that off one node just now while it wasn't under a load 
test. I am going to be able to get a better number after the next test run.


What is a good value for read latency?


On 18/10/13 08:31, Viktor Jevdokimov wrote:

The only thing you may win - avoid unnecessary network hops if:
- request sorted keys (by token) from appropriate replica with ConsistencyLevel.ONE and 
"dynamic_snitch: false".
- nodes has the same load
- replica not doing GC, and GC pauses are much higher than internode 
communication.

For multiple keys request C* will do multiple single key reads, except for 
range scan requests, where only starting key and batch size is used in request.

Consider multiple key request as a slow request by design, try to model your 
data for low latency single key requests.

So, what latencies do you want to achieve?



Best regards / Pagarbiai

Viktor Jevdokimov
Senior Developer

Email: viktor.jevdoki...@adform.com
Phone: +370 5 212 3063
Fax: +370 5 261 0453

J. Jasinskio 16C,
LT-03163 Vilnius,
Lithuania



Disclaimer: The information contained in this message and attachments is 
intended solely for the attention and use of the named addressee and may be 
confidential. If you are not the intended recipient, you are reminded that the 
information remains the property of the sender. You must not use, disclose, 
distribute, copy, print or rely on this e-mail. If you have received this 
message in error, please contact the sender immediately and irrevocably delete 
this message and any copies.-Original Message-
From: Artur Kronenberg [mailto:artur.kronenb...@openmarket.com]
Sent: Thursday, October 17, 2013 7:40 PM
To: user@cassandra.apache.org
Subject: Sorting keys for batch reads to minimize seeks

Hi,

I am looking to somehow increase read performance on cassandra. We are still 
playing with configurations but I was thinking if there would be solutions in 
software that might help us speed up our read performance.

E.g. one idea, not sure how sane that is, was to sort read-batches by row-keys 
before submitting them to cassandra. The idea is that row-keys should be closer 
together on the physical disk and therefor this may minimize the amount of 
random seeks we have to do when querying say 1000 entries from cassandra. Does 
that make any sense?

Is there anything else that we can do in software to improve performance? Like 
specific batch sizes for reads? We are using the astyanax library to access 
cassandra.

Thanks!






RE: Sorting keys for batch reads to minimize seeks

2013-10-18 Thread Viktor Jevdokimov
> Sorting a random set of keys will not help.
False

> If you know that you set of keys are on a particular node, then sorting might 
> help.
True


Two different answers to the same question.


> But I doubt that it is a sound practice, given that sets of keys can be moved 
> - as nodes are added or removed from the cluster
Just be aware, get token ranges from Cassandra.



Best regards / Pagarbiai

Viktor Jevdokimov
Senior Developer

Email: viktor.jevdoki...@adform.com
Phone: +370 5 212 3063
Fax: +370 5 261 0453

J. Jasinskio 16C,
LT-03163 Vilnius,
Lithuania



Disclaimer: The information contained in this message and attachments is 
intended solely for the attention and use of the named addressee and may be 
confidential. If you are not the intended recipient, you are reminded that the 
information remains the property of the sender. You must not use, disclose, 
distribute, copy, print or rely on this e-mail. If you have received this 
message in error, please contact the sender immediately and irrevocably delete 
this message and any copies.


RE: Sorting keys for batch reads to minimize seeks

2013-10-18 Thread Viktor Jevdokimov
The only thing you may win - avoid unnecessary network hops if:
- request sorted keys (by token) from appropriate replica with 
ConsistencyLevel.ONE and "dynamic_snitch: false".
- nodes has the same load
- replica not doing GC, and GC pauses are much higher than internode 
communication.

For multiple keys request C* will do multiple single key reads, except for 
range scan requests, where only starting key and batch size is used in request.

Consider multiple key request as a slow request by design, try to model your 
data for low latency single key requests.

So, what latencies do you want to achieve?



Best regards / Pagarbiai

Viktor Jevdokimov
Senior Developer

Email: viktor.jevdoki...@adform.com
Phone: +370 5 212 3063
Fax: +370 5 261 0453

J. Jasinskio 16C,
LT-03163 Vilnius,
Lithuania



Disclaimer: The information contained in this message and attachments is 
intended solely for the attention and use of the named addressee and may be 
confidential. If you are not the intended recipient, you are reminded that the 
information remains the property of the sender. You must not use, disclose, 
distribute, copy, print or rely on this e-mail. If you have received this 
message in error, please contact the sender immediately and irrevocably delete 
this message and any copies.-Original Message-
From: Artur Kronenberg [mailto:artur.kronenb...@openmarket.com]
Sent: Thursday, October 17, 2013 7:40 PM
To: user@cassandra.apache.org
Subject: Sorting keys for batch reads to minimize seeks

Hi,

I am looking to somehow increase read performance on cassandra. We are still 
playing with configurations but I was thinking if there would be solutions in 
software that might help us speed up our read performance.

E.g. one idea, not sure how sane that is, was to sort read-batches by row-keys 
before submitting them to cassandra. The idea is that row-keys should be closer 
together on the physical disk and therefor this may minimize the amount of 
random seeks we have to do when querying say 1000 entries from cassandra. Does 
that make any sense?

Is there anything else that we can do in software to improve performance? Like 
specific batch sizes for reads? We are using the astyanax library to access 
cassandra.

Thanks!




Re: Sorting keys for batch reads to minimize seeks

2013-10-17 Thread Manoj Khangaonkar
Unless I misunderstood your statement on sorting by row keys,

Cassandra partitions rows across nodes based on row keys. Sorting a random
set of keys will not help.
If you know that you set of keys are on a particular node , then sorting
might help. But I doubt that it is a sound practice, given that sets of
keys can be moved - as nodes are added or removed from the cluster

regards


On Thu, Oct 17, 2013 at 9:40 AM, Artur Kronenberg <
artur.kronenb...@openmarket.com> wrote:

> Hi,
>
> I am looking to somehow increase read performance on cassandra. We are
> still playing with configurations but I was thinking if there would be
> solutions in software that might help us speed up our read performance.
>
> E.g. one idea, not sure how sane that is, was to sort read-batches by
> row-keys before submitting them to cassandra. The idea is that row-keys
> should be closer together on the physical disk and therefor this may
> minimize the amount of random seeks we have to do when querying say 1000
> entries from cassandra. Does that make any sense?
>
> Is there anything else that we can do in software to improve performance?
> Like specific batch sizes for reads? We are using the astyanax library to
> access cassandra.
>
> Thanks!
>
>
>


-- 
http://khangaonkar.blogspot.com/