Re: Slow s3cmd ls queries + HAProxy 504 timeouts

Alex Millar Mon, 18 Aug 2014 10:24:18 -0700

Hey Kelly,

Thanks for reaching out! We’re using the following versions for RiakCS & Riak


# Download RiakCS 
# Version: 1.4.5
# OS: Ubuntu 12.04 (Precise) AMD 64
curl -O 
http://s3.amazonaws.com/downloads.basho.com/riak-cs/1.4/1.4.5/ubuntu/precise/riak-cs_1.4.5-1_amd64.deb

# Download Riak
# Version: 1.4.8
# OS: Ubuntu 12.04 (Precise) AMD 64
curl -O 
http://s3.amazonaws.com/downloads.basho.com/riak/1.4/1.4.8/ubuntu/precise/riak_1.4.8-1_amd64.deb

The intent of being able to have performant ls operations was to that we could 
connection via Transmit to view and navigate the contents of the bucket, 
similar to how you can access the contents of your S3 buckets in their webUI. 
That being said our keys are akin to a folder structure, for example...

/organizations/OrganizationID-[OrganizationID]/documents/proposals/ProposalID-[ProposalID]/DocumentSlotID-[DocumentSlotID]

S3 must be doing some sort of secondary indexing to allow for fast lookups 
here, because the bucket in question that has the performance issues only has 2 
“folders” under s3://bonfirehub-resources-can-east-doc-conversion yet it takes 
the longest to s3cmd ls since Riak is clearly traversing all the keys to 
fulfill this request.

Short story, this is not a requirement for us in order to use RiakCS however, 
going forward it would be desirable if RiakCS could maintain this form of 
secondary indices (and potentially have a WebUI) to better match some use cases 
that exist for clients who are used to using S3.

                 Alex Millar, CTO  
Office: 1-800-354-8010 ext. 704  
Mobile: 519-729-2539  
GoBonfire.com

From: Kelly McLaughlin <ke...@basho.com>
Reply: Kelly McLaughlin <ke...@basho.com>>
Date: August 15, 2014 at 7:03:47 PM
To: Alex Millar <a...@gobonfire.com>>, riak-users@lists.basho.com 
<riak-users@lists.basho.com>>
Subject:  Re: Slow s3cmd ls queries + HAProxy 504 timeouts  

Hello Alex. Would you mind sharing what version of Riak and Riak CS you are 
using? Also if you can post the the contents of your Riak CS app.config file
it might help give a better idea of what might be going on.

Generally listing the contents of a bucket is more expensive than a normal 
download or upload request, but there have been performance improvements in 
recent
versions of Riak CS and there are settings that can be adjusted depending on 
the version you are using. The time required to list the contents of the entire 
bucket
is definitely related to the number of objects in that bucket so the time will 
continue to increase as the number of objects increases, but we do continue to 
work to
make the process as efficient as possible.

Depending on why you need to list the contents of the bucket the max-keys query 
parameter available with the bucket listing operation may be useful. By default 
this
limit is 1000 keys, but s3cmd does not expose this that I'm aware of and 
instead buffers all the results until the end of the contents is reached. But 
if you need
to list the contents for the purpose of some processing step, it may work 
better for you to break up this process into smaller chunks using max-keys.

Kelly

On 08/15/2014 06:39 AM, Alex Millar wrote:
So the issue we’re having is only with bucket listing.

alxndrmlr@alxndrmlr-mbp $ time s3cmd -c .s3cfg-riakcs-admin ls 
s3://bonfirehub-resources-can-east-doc-conversion
                       DIR   
s3://bonfirehub-resources-can-east-doc-conversion/organizations/

real 2m0.747s
user 0m0.076s
sys 0m0.030s

where as…

alxndrmlr@alxndrmlr-mbp $ time s3cmd -c .s3cfg-riakcs-admin ls 
s3://bonfirehub-resources-can-east-doc-conversion/organizations/OrganizationID-1/documents/proposals
                       DIR   
s3://bonfirehub-resources-can-east-doc-conversion/organizations/OrganizationID-1/documents/proposals/

real 0m10.262s
user 0m0.075s
sys 0m0.028s

The contents of this bucket contains a lot of very small files (basically for 
each PDF we receive I split it to .JPG foreach page and store them here. Based 
on the my latest counts it looks like we have around 170,000 .JPG files in that 
bucket.

Now I’ve had a hunch this is just a fundamentally expensive operation which 
exceeds the 5000ms response time threshold set in our HAProxy config (which I 
raised during the video to illustrate what’s going on). After reading 
http://www.quora.com/Riak/Is-it-really-expensive-for-Riak-to-list-all-buckets-Why
 and http://www.paperplanes.de/2011/12/13/list-all-of-the-riak-keys.html I’m 
feeling like this is just a fundamental issue with the data structure in Riak. 

Based on this I’m thinking that cost of this type of query is only going to get 
worse over time as we add more keys to this bucket (unless secondary indexes 
can be added). Or am I totally out to lunch here and there’s some other 
underlying problem?

        Alex Millar, CTO
Office: 1-800-354-8010 ext. 704
Mobile: 519-729-2539  
GoBonfire.com

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Slow s3cmd ls queries + HAProxy 504 timeouts

Reply via email to