Re: object sizes

2015-04-20 Thread Alex De la rosa
Hi Brett,

Yeah, that was my assumption too, an overhead on RAM memory for creating
the object structures, etc... that's also why the simple objects (raw
binary) gives a pretty accurate measure compared to cURL, but
maps/sets/etc... don't.

Exactly, I would like to be able to have a way to know how big is the
object stored inside Riak (using the python client instead of doing extra
cURL calls) so I can make sure objects not bigger than 1MB in storage space
is getting saved (and then implement some kind of key split mechanism if
arriving to the limit).

Thanks!
Alex

On Mon, Apr 20, 2015 at 11:42 PM, Brett Hazen br...@basho.com wrote:

 Alex -

 Looks like Matt created a GitHub issue to track this.
 https://github.com/basho/riak-python-client/issues/403 Thanks!

 It occurs to me that sys.getsizeof() returns the size of the Python Riak
 Object stored in memory which is most certainly not exactly the same as
 what curl reports.  Curl is measuring the JSON across the wire and the
 Python client is converting it into a native format.  There is extra
 information in memory such as indexes into dictionaries and CRDT metadata
 used in maps.

 Just to clarify, you want to know the size of the object stored in Riak as
 opposed to in memory, right?  The 1MB limit is on Riak storage?

 thanks,
 Brett

 On April 17, 2015 at 2:41:56 PM, Alex De la rosa (alex.rosa@gmail.com)
 wrote:

 Hi Matthew,

 I don't have a github account so seems i'm not able to create the ticket
 for this feature, could you do it?

 Thanks,
 Alex

 On Thu, Apr 16, 2015 at 10:08 PM, Alex De la rosa alex.rosa@gmail.com
  wrote:

 Hi Matthew,

 Thanks for your answer : ) i always have interesting questions : P

 about point [2]... if you see my examples, i'm already using
 sys.getsizeof()... but sizes are not so accurate, also, I believe that is
 the size they take on RAM when loaded by Python and not the full exact size
 of the object (specially on Maps that differs quite some).

 I will open the ticket then : ) I think it can be very helpful future
 feature.

 Thanks,
 Alex

 On Thu, Apr 16, 2015 at 10:03 PM, Matthew Brender mbren...@basho.com
 wrote:

 Hi Alex,

 That is an interesting question! I haven't seen a request like that in
 our backlog, so feel free to open a new issue [1]. I'm curious: why
 not use something like sys.getsizeof [2]?

 [1] https://github.com/basho/riak-python-client/issues
 [2]
 http://stackoverflow.com/questions/449560/how-do-i-determine-the-size-of-an-object-in-python

 Matt Brender | Developer Advocacy Lead
 Basho Technologies
 t: @mjbrender


 On Mon, Apr 13, 2015 at 7:26 AM, Alex De la rosa
  alex.rosa@gmail.com wrote:
  Hi Bryan,
 
  Thanks for your answer; i don't know how to code in erlang, so all my
 system
  relies on Python.
 
  Following Ciprian's curl suggestion, I tried to compare it with this
 python
  code during the weekend:
 
  Map object:
  curl -I
  1058 bytes
  print sys.getsizeof(obj.value)
  3352 bytes
 
  Standard object:
  curl -I
  9718 bytes
  print sys.getsizeof(obj.encoded_data)
  9755 bytes
 
  The standard object seems pretty accurate in both approaches even the
 image
  binary data was only 5kbs (I assume some overhead here)
 
  The map object is about 3x the difference between curl and getting the
  object via Python.
 
  Not so sure if this is a realistic way to measure their growth
 (moreover
  because the objects i would need this monitorization are Maps, not
 unaltered
  binary data that I can know the size before storing it).
 
  Would it be possible in some way that the Python get() function would
 return
  something like obj.content-lenght returning the size is currently
 taking?
  that would be a pretty nice feature.
 
  Thanks!
  Alex
 
  On Mon, Apr 13, 2015 at 12:47 PM, bryan hunt bh...@basho.com wrote:
 
  Alex,
 
 
  Maps and Sets are stored just like a regular Riak object, but using a
  particular data structure and object serialization format. As you have
  observed, there is an overhead, and you want to monitor the growth of
 these
  data structures.
 
  It is possible to write a MapReduce map function (in Erlang) which
  retrieves a provided object by type/bucket/id and returns the size of
 it's
  data. Would such a thing be of use?
 
  It would not be hard to write such a module, and I might even have
 some
  code for doing so if you are interested. There are also reasonably
 good
  examples in our documentation -
  http://docs.basho.com/riak/latest/dev/advanced/mapreduce
 
  I haven't looked at the Python PB API in a while, but I'm reasonably
  certain it supports the invocation of MapReduce jobs.
 
  Bryan
 
 
  On 10 Apr 2015, at 13:51, Alex De la rosa alex.rosa@gmail.com
 wrote:
 
  Also, I forgot, i'm most interested on bucket_types instead of simple
 riak
  buckets. Being able how my mutable data inside a MAP/SET has grown.
 
  For a traditional standard bucket I can calculate the size of what I'm
  sending before, so Riak won't get data bigger 

Re: object sizes

2015-04-20 Thread Brett Hazen
Alex -

Looks like Matt created a GitHub issue to track this. 
https://github.com/basho/riak-python-client/issues/403 Thanks!

It occurs to me that sys.getsizeof() returns the size of the Python Riak Object 
stored in memory which is most certainly not exactly the same as what curl 
reports.  Curl is measuring the JSON across the wire and the Python client is 
converting it into a native format.  There is extra information in memory such 
as indexes into dictionaries and CRDT metadata used in maps.

Just to clarify, you want to know the size of the object stored in Riak as 
opposed to in memory, right?  The 1MB limit is on Riak storage?

thanks,
Brett

On April 17, 2015 at 2:41:56 PM, Alex De la rosa (alex.rosa@gmail.com) 
wrote:

Hi Matthew,

I don't have a github account so seems i'm not able to create the ticket for 
this feature, could you do it?

Thanks,
Alex

On Thu, Apr 16, 2015 at 10:08 PM, Alex De la rosa alex.rosa@gmail.com 
wrote:
Hi Matthew,

Thanks for your answer : ) i always have interesting questions : P

about point [2]... if you see my examples, i'm already using sys.getsizeof()... 
but sizes are not so accurate, also, I believe that is the size they take on 
RAM when loaded by Python and not the full exact size of the object (specially 
on Maps that differs quite some).

I will open the ticket then : ) I think it can be very helpful future feature.

Thanks,
Alex

On Thu, Apr 16, 2015 at 10:03 PM, Matthew Brender mbren...@basho.com wrote:
Hi Alex,

That is an interesting question! I haven't seen a request like that in
our backlog, so feel free to open a new issue [1]. I'm curious: why
not use something like sys.getsizeof [2]?

[1] https://github.com/basho/riak-python-client/issues
[2] 
http://stackoverflow.com/questions/449560/how-do-i-determine-the-size-of-an-object-in-python

Matt Brender | Developer Advocacy Lead
Basho Technologies
t: @mjbrender


On Mon, Apr 13, 2015 at 7:26 AM, Alex De la rosa
alex.rosa@gmail.com wrote:
 Hi Bryan,

 Thanks for your answer; i don't know how to code in erlang, so all my system
 relies on Python.

 Following Ciprian's curl suggestion, I tried to compare it with this python
 code during the weekend:

 Map object:
 curl -I
 1058 bytes
 print sys.getsizeof(obj.value)
 3352 bytes

 Standard object:
 curl -I
 9718 bytes
 print sys.getsizeof(obj.encoded_data)
 9755 bytes

 The standard object seems pretty accurate in both approaches even the image
 binary data was only 5kbs (I assume some overhead here)

 The map object is about 3x the difference between curl and getting the
 object via Python.

 Not so sure if this is a realistic way to measure their growth (moreover
 because the objects i would need this monitorization are Maps, not unaltered
 binary data that I can know the size before storing it).

 Would it be possible in some way that the Python get() function would return
 something like obj.content-lenght returning the size is currently taking?
 that would be a pretty nice feature.

 Thanks!
 Alex

 On Mon, Apr 13, 2015 at 12:47 PM, bryan hunt bh...@basho.com wrote:

 Alex,


 Maps and Sets are stored just like a regular Riak object, but using a
 particular data structure and object serialization format. As you have
 observed, there is an overhead, and you want to monitor the growth of these
 data structures.

 It is possible to write a MapReduce map function (in Erlang) which
 retrieves a provided object by type/bucket/id and returns the size of it's
 data. Would such a thing be of use?

 It would not be hard to write such a module, and I might even have some
 code for doing so if you are interested. There are also reasonably good
 examples in our documentation -
 http://docs.basho.com/riak/latest/dev/advanced/mapreduce

 I haven't looked at the Python PB API in a while, but I'm reasonably
 certain it supports the invocation of MapReduce jobs.

 Bryan


 On 10 Apr 2015, at 13:51, Alex De la rosa alex.rosa@gmail.com wrote:

 Also, I forgot, i'm most interested on bucket_types instead of simple riak
 buckets. Being able how my mutable data inside a MAP/SET has grown.

 For a traditional standard bucket I can calculate the size of what I'm
 sending before, so Riak won't get data bigger than 1MB. Problem arise in
 MAPS/SETS that can grown.

 Thanks,
 Alex

 On Fri, Apr 10, 2015 at 2:47 PM, Alex De la rosa alex.rosa@gmail.com
 wrote:

 Well... using the HTTP Rest API would make no sense when using the PB
 API... would be extremely costly to maintain, also it may include some extra
 bytes on the transport.

 I would be interested on being able to know the size via Python itself
 using the PB API as I'm doing.

 Thanks anyway,
 Alex

 On Fri, Apr 10, 2015 at 1:58 PM, Ciprian Manea cipr...@basho.com wrote:

 Hi Alex,

 You can always query the size of a riak object using `curl` and the REST
 API:

 i.e. curl -I riak-node-ip:8098/buckets/test/keys/demo


 Regards,
 Ciprian

 On Thu, Apr 9, 2015 at 12:11 PM, Alex De la rosa
 

Re: RAM and CPU requirements

2015-04-20 Thread Shankar Dhanasekaran
Thanks Basho team for addressing every single question :-) It's very much
supportive.

On Mon, Apr 20, 2015 at 7:08 AM, Kota Uenishi k...@basho.com wrote:

 Hi,

 Generally speaking, 48/64GB RAM with 20TB storage is enough strong for
 typical usage of Riak CS, like yours. I would like to mention that too
 large disk size per node with narrow bandwidth may lead to less
 availability - time to recover degraded replicas in case of node/disk
 failure may take longer (because larger disk means longer time to
 recover). That should not be longer than MTBF. To ensure the
 availability, please make sure you have 5 nodes or more.


 On Sat, Apr 4, 2015 at 3:53 AM, Shankar Dhanasekaran
 shan...@opendrops.com wrote:
  Hi all,
  what is the RAM and CPU requirements for RIAK CS server with 20 TB
 storage
  and about a max of 100 concurrent requests (mix of put and get) with a
 max
  file size of request being less than 50mb. I am just looking for a rough
  figure so that I don't have to buy a really powerful server where it's
 not
  needed.
 
  In this talk https://www.youtube.com/watch?v=nMyU6-pU6aw, Andy Gross
 says
  you don't want to run these on big enterprise servers but on basically
 crap
  and when it dies ... install new servers
 
  My servers come with 48/64 GB RAM with i7 quad core processors. My need
 is
  basically more data storage than handling more requests concurrently. So
  what is the ideal 'crap' config for my need?
 
  Thanks,
  Shankar
 
  ___
  riak-users mailing list
  riak-users@lists.basho.com
  http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
 



 --
 Kota UENISHI / @kuenishi
 Basho Japan KK

___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Yokozuna queries slow

2015-04-20 Thread Jason Campbell
Hello,

I'm currently trying to debug slow YZ queries, and I've narrowed down the 
issue, but not sure how to solve it.

First off, we have about 80 million records in Riak (and YZ), but the queries 
return relatively few (a thousand or so at most).  Our query times are anywhere 
from 800ms to 1.5s.

I have been experimenting with queries directly on the Solr node, and it seems 
to be a problem with YZ and the way it does vnode filters.

Here is the same query, emulating YZ first:

{
  responseHeader:{
status:0,
QTime:958,
params:{
  q:timestamp:[1429579919010 TO 1429579921010],
  indent:true,
  fq:_yz_pn:55 OR _yz_pn:40 OR _yz_pn:25 OR _yz_pn:10,
  rows:0,
  wt:json}},
  response:{numFound:80,start:0,docs:[]
  }}

And the same query, but including the vnode filter in the main body instead of 
using a filter query:

{
  responseHeader:{
status:0,
QTime:1,
params:{
  q:timestamp:[1429579919010 TO 1429579921010] AND (_yz_pn:55 OR 
_yz_pn:40 OR _yz_pn:25 OR _yz_pn:10),
  indent:true,
  rows:0,
  wt:json}},
  response:{numFound:80,start:0,docs:[]
  }}

I understand there is a caching benefit to using filter queries, but a 
performance difference of 100x or greater doesn't seem worth it, especially 
with a constant data stream.

Is there a way to make YZ do this, or is the only way to query Solr directly, 
bypassing YZ?  Does anyone have any other suggestions of how to make this 
faster?

The timestamp field is a SolrTrieLongField with default settings if anyone is 
curious.

Thanks,
Jason
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Ensembles failing to reach Leader ready state

2015-04-20 Thread Alexander Sicular
Hi Jonathan,

staging (3 servers across NA)

If this means you're spreading your cluster across North America I would 
suggest you reconsider. A Riak cluster is meant to be deployed in one data 
center, more specifically in one LAN. Connecting Riak nodes over a WAN 
introduces network latencies. Riak's approach to multi datacenter replication 
is as a cluster of clusters. That said, I don't believe strong consistency is 
supported yet in an mdc environment. 

-Alexander 

@siculars
http://siculars.posthaven.com

Sent from my iRotaryPhone

 On Apr 17, 2015, at 16:19, Andrew Stone ast...@basho.com wrote:
 
 Hi Jonathan,
  
 Sorry for the late reply. It looks like riak_ensemble still thinks that those 
 old nodes are part of the cluster. Did you remove them with 'riak-admin 
 cluster leave' ? If so they should have been removed from the root ensemble 
 also, and the machines shouldn't have actually left the cluster until all the 
 ensembles were reconfigured via joint consensus. Can you paste the results 
 from the following commands:
 
 riak-admin member-status
 riak-admin ring-status
 
 Thanks,
 Andrew
 
 
 On Mon, Mar 23, 2015 at 11:25 AM, Jonathan Koff jonat...@projexity.com 
 wrote:
 Hi all,
 
 I recently used Riak’s Strong Consistency functionality to get 
 auto-incrementing IDs for a feature of an application I’m working on, and 
 although this worked great in dev (5 nodes in 1 VM) and staging (3 servers 
 across NA) environments, I’ve run into some odd behaviour in production 
 (originally 3 servers, now 4) that prevents it from working.
 
 I initially noticed that consistent requests were immediately failing as 
 timeouts, and upon checking `riak-admin ensemble-status` saw that many 
 ensembles were at 0 / 3, from the vantage point of the box I was SSH’d into. 
 Interestingly, SSH-ing into different boxes showed different results. Here’s 
 a brief snippet of what I see now, after adding a fourth server in a 
 troubleshooting attempt:
 
 *Machine 1* (104.131.39.61)
 
 == Consensus System 
 ===
 Enabled: true
 Active:  true
 Ring Ready:  true
 Validation:  strong (trusted majority required)
 Metadata:best-effort replication (asynchronous)
 
 == Ensembles 
 ==
  Ensemble QuorumNodes  Leader
 ---
root   0 / 6 3 / 6  --
 2 0 / 3 3 / 3  --
 3 3 / 3 3 / 3  riak@104.131.130.237
 4 3 / 3 3 / 3  riak@104.131.130.237
 5 3 / 3 3 / 3  riak@104.131.130.237
 6 0 / 3 3 / 3  --
 7 0 / 3 3 / 3  --
 8 0 / 3 3 / 3  --
 9 3 / 3 3 / 3  riak@104.131.130.237
 103 / 3 3 / 3  riak@104.131.130.237
 110 / 3 3 / 3  --
 
 *Machine 2* (104.236.79.78)
 
 == Consensus System 
 ===
 Enabled: true
 Active:  true
 Ring Ready:  true
 Validation:  strong (trusted majority required)
 Metadata:best-effort replication (asynchronous)
 
 == Ensembles 
 ==
  Ensemble QuorumNodes  Leader
 ---
root   0 / 6 3 / 6  --
 2 3 / 3 3 / 3  riak@104.236.79.78
 3 3 / 3 3 / 3  riak@104.131.130.237
 4 3 / 3 3 / 3  riak@104.131.130.237
 5 3 / 3 3 / 3  riak@104.131.130.237
 6 3 / 3 3 / 3  riak@104.236.79.78
 7 0 / 3 3 / 3  --
 8 0 / 3 3 / 3  --
 9 3 / 3 3 / 3  riak@104.131.130.237
 103 / 3 3 / 3  riak@104.131.130.237
 113 / 3 3 / 3  riak@104.236.79.78
 
 *Machine 3* (104.131.130.237)
 
 == Consensus System 
 ===
 Enabled: true
 Active:  true
 Ring Ready:  true
 Validation:  strong (trusted majority required)
 Metadata:best-effort replication (asynchronous)
 
 == Ensembles 
 ==
  Ensemble QuorumNodes  Leader
 ---
root   0 / 6 3 / 6  --
 2 0 / 3 3 / 3  --
 3 3 / 3 3 / 3  riak@104.131.130.237
 4 3 / 3 3 / 3  riak@104.131.130.237
 5 3 / 3 3 / 3  riak@104.131.130.237
 6 0 / 3 3 / 3  --
 7 0 / 3 3 / 3  --
 8 0 / 3 

Re: Ensembles failing to reach Leader ready state

2015-04-20 Thread Jonathan Koff
Hi Alexander and Andrew,

Thanks for the follow-up!

Although I would expect to have used `riak-admin cluster leave`, it’s been 
months at this point and I can’t be sure. Perhaps I did something weird when I 
was getting started…

Given the uncertain state of the system, it may make sense for me to migrate 
everything to a fresh cluster, unless a simple solution exists. It’s small 
enough that this would be practical, albeit inconvenient.

Your timing in following up is interesting—I just today attempted to 
`riak-admin cluster leave` a node (104.131.130.237) and it’s still in state 
“leaving with 0.0% of ring and the logs filling up with messages like:
2015-04-18 02:45:30.927 [warning] 
0.9069.0@riak_kv_ensemble_backend:handle_down:173 Vnode for Idx: 
548063113999088594326381812268606132370974703616 crashed with reason: normal.

Output of `riak-admin member-status`:
= Membership ==
Status RingPendingNode
---
leaving 0.0%  --  'riak@104.131.130.237'
valid  34.4%  --  'riak@104.131.39.61'
valid  32.8%  --  'riak@104.236.79.78'
valid  32.8%  --  'riak@162.243.5.87'
---
Valid:3 / Leaving:1 / Exiting:0 / Joining:0 / Down:0

Output of `ring-admin ring-status`:
== Claimant ===
Claimant:  'riak@104.131.130.237'
Status: up
Ring Ready: true

== Ownership Handoff ==
No pending changes.

== Unreachable Nodes ==
All nodes are up and reachable



With regard to staging being spread out across NA, my thinking was that staging 
under extreme conditions would serve as a canary as well as help me familiarize 
myself with the performance characteristics of Riak. However it ended up 
working perfectly (including strong consistency), so I never ended up moving 
the servers to be in the same geographical area.

I'd be reluctant to put everything in one LAN when the key requirement that 
lead us to pick Riak was high availability, and network issues at a single 
datacenter seems to be our most frequent mode of failure. I benchmarked under 
various network configurations and all seemed to work flawlessly and with 
acceptable performance. Do you think this is reasonable?


Thanks again!

Jonathan Koff B.CS.
co-founder of Projexity
www.projexity.com http://www.projexity.com/

follow us on facebook at: www.facebook.com/projexity 
http://www.facebook.com/projexity
follow us on twitter at: twitter.com/projexity http://twitter.com/projexity
 On Apr 17, 2015, at 7:49 PM, Alexander Sicular sicul...@gmail.com wrote:
 
 Hi Jonathan,
 
 staging (3 servers across NA)
 
 If this means you're spreading your cluster across North America I would 
 suggest you reconsider. A Riak cluster is meant to be deployed in one data 
 center, more specifically in one LAN. Connecting Riak nodes over a WAN 
 introduces network latencies. Riak's approach to multi datacenter replication 
 is as a cluster of clusters. That said, I don't believe strong consistency is 
 supported yet in an mdc environment. 
 
 -Alexander 
 
 @siculars
 http://siculars.posthaven.com http://siculars.posthaven.com/
 
 Sent from my iRotaryPhone
 
 On Apr 17, 2015, at 16:19, Andrew Stone ast...@basho.com 
 mailto:ast...@basho.com wrote:
 
 Hi Jonathan,
  
 Sorry for the late reply. It looks like riak_ensemble still thinks that 
 those old nodes are part of the cluster. Did you remove them with 
 'riak-admin cluster leave' ? If so they should have been removed from the 
 root ensemble also, and the machines shouldn't have actually left the 
 cluster until all the ensembles were reconfigured via joint consensus. Can 
 you paste the results from the following commands:
 
 riak-admin member-status
 riak-admin ring-status
 
 Thanks,
 Andrew
 
 
 On Mon, Mar 23, 2015 at 11:25 AM, Jonathan Koff jonat...@projexity.com 
 mailto:jonat...@projexity.com wrote:
 Hi all,
 
 I recently used Riak’s Strong Consistency functionality to get 
 auto-incrementing IDs for a feature of an application I’m working on, and 
 although this worked great in dev (5 nodes in 1 VM) and staging (3 servers 
 across NA) environments, I’ve run into some odd behaviour in production 
 (originally 3 servers, now 4) that prevents it from working.
 
 I initially noticed that consistent requests were immediately failing as 
 timeouts, and upon checking `riak-admin ensemble-status` saw that many 
 ensembles were at 0 / 3, from the vantage point of the box I was SSH’d into. 
 Interestingly, SSH-ing into different boxes showed different results. Here’s 
 a brief snippet of what I see now, after adding a fourth server in a 
 troubleshooting attempt:
 
 *Machine 1* (104.131.39.61)