Re: [ceph-users] radosgw daemon stalls on download of some files

2013-11-30 Thread Artem Silenkov
Good day!

Is it possible to change frontend to something different then Apache? For
example Nginx.


Regards, Artem Silenkov


2013/11/30 Sebastian webmas...@mailz.de

 Hi Yehuda,


  It's interesting, the responses are received but seems that they
  aren't being handled (hence the following pings). There are a few
  things that you could look at. First, try to connect to the admin
  socket and see if you get any useful information from there. This
  could include in-flight requests, look for other requests that have
  not completed. Also see if there's indication for requests throttling.

 Do you refer to the methods mentioned here?
 http://ceph.com/docs/dumpling/radosgw/troubleshooting/?
 Unfortunately the socket file is not present. Do i have to activate it in
 the config somehow? I could not find any reference to that in the docs. Is
 it already included in my radosgw version?
 radosgw -v
 ceph version 0.67.4 (ad85b8bfafea6232d64cb7ba76a8b6e8252fa0c7)

  Another thing to look at would be at the seemingly unrelated timeout
  messages. These should not happen and might indicate that there's
  something that is holding you up that shouldn't. Try searching for the
  same thread id that is specified in these messages (omit the 0x
  prefix), and see what's the last thing that it's doing.

 I checked that:
 http://pastebin.com/Z23PWwjt
 i do not see anything unusual before the messages happen, but maybe you
 see something odd.


  You could also try turning on also 'debug objecter = 20', see if it
  provides more info (it's very verbose though).
 

 Did that, but that is way to verbose for me ;) I uploaded it here:
 http://pastebin.com/VBPAVP6z
 There might be some requests mixed into it, but the one for
 cdn/52974400c6dd6ca71904/source.avi is the one that stalled.

  How much are you loading the gateway before that happens? We've seen a
  similar issue in the past that was related to the fcgi library that is
  dynamically linked with the radosgw process (that is, not the apache
  mod_fastcgi module). This, however, would only happen when there's
  heavy load and the fd numbers handled by the radosgw surpassed 1024
  (buggy library that was using select() instead of poll()).

 There are not that many requests on the Storage, maybe 10-20 req/min. The
 cluster serves as a source for a CDN, so once the resource is fetched it
 should not be fetched again soon. I checked for the open files, and there
 are only about 10-20 open file handles for the radosgw process. So this
 probably is not the issue.

 Sebastian


 
  Yehuda
 
  On Fri, Nov 29, 2013 at 7:28 AM, Sebastian webmas...@mailz.de wrote:
  Hi,
 
  thanks for the hint. I tried this again and noticed that the time out
 message does seem to be unrelated. Here is the log file for a stalling
 request with debug turned on:
  http://pastebin.com/DcQuc9wP
 
  I really cannot really find a real error in the log. The download
 stalls at about 500kb at that point though. Restarting radosgw fixes it for
 1 download only, the next one is broken again. But as i said this does not
 happen for all files.
 
  Sebastian
 
  On 27.11.2013, at 21:53, Yehuda Sadeh wrote:
 
  On Wed, Nov 27, 2013 at 4:46 AM, Sebastian webmas...@mailz.de wrote:
  Hi,
 
  we have a setup of 4 Servers running ceph and radosgw. We use it as
 an internal S3 service for our files. The Servers run Debian Squeeze with
 Ceph 0.67.4.
 
  The cluster has been running smoothly for quite a while, but we are
 currently experiencing issues with the radosgw. For some files the HTTP
 Download just stalls at around 500kb.
 
  The Apache error log just says:
  [error] [client ] FastCGI: comm with server /var/www/s3gw.fcgi
 aborted: idle timeout (30 sec)
  [error] [client ] Handler for fastcgi-script returned invalid result
 code 1
 
  radosgw logging:
  7f00bc66a700  1 heartbeat_map is_healthy 'RGWProcess::m_tp thread
 0x7f00934bb700' had timed out after 600
  7f00bc66a700  1 heartbeat_map is_healthy 'RGWProcess::m_tp thread
 0x7f00ab4eb700' had timed out after 600
 
  The interesting thing is that the cluster health is fine an only some
 files are not working properly. Most of them just work fine. A restart of
 radosgw fixes the issue. The other ceph logs are also clean.
 
  Any idea why this happens?
 
 
  No, but you can turn on 'debug ms = 1' on your gateway ceph.conf, and
  that might give some better indication.
 
  Yehuda
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] installing OS on software RAID

2013-11-30 Thread Gandalf Corvotempesta
2013/11/25 James Harper james.har...@bendigoit.com.au:
 Is the OS doing anything apart from ceph? Would booting a ramdisk-only system 
 from USB or compact flash work?

This is the same question i've made some times ago.
Is ok to use USB as standard OS (OS, non OSD!) disk? OSDs and journals
will be on dedicated disks.
USB will be used just to store and boot the operating system.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw daemon stalls on download of some files

2013-11-30 Thread Andrew Woodward
Are you using the  inktank patched FastCGI sever? http://gitbuilder.ceph.com

Alternately try another script sever like ngnix as already suggested.
On Nov 29, 2013 12:23 PM, German Anders gand...@despegar.com wrote:

  Thanks a lot Sebastian, i'm going to try that, also i'm having an issue
 while trying to test a rbd creation, i've install in the deploy server the
 ceph-client:

 ceph@ceph-deploy01:/etc/ceph$ sudo rbd -n client.ceph-test -k
 /home/ceph/ceph-cluster/ceph.client.admin.keyring create --size 10240
 cephdata
 2013-11-29 15:20:25.683930 7fcd9979c780  0 librados: client.ceph-openstack
 authentication error (1) Operation not permitted
 rbd: couldn't connect to the cluster!

  Anyone know what could be the issue here? maybe it has something to do
 with keys or maybe not...

 Thanks in advance,

 Best regards,


 *German Anders*







 --- Original message ---
 *Asunto:* Re: [ceph-users] radosgw daemon stalls on download of some
 files
 *De:* Sebastian webmas...@mailz.de
 *Para:* ceph-users ceph-users@lists.ceph.com
 *Fecha:* Friday, 29/11/2013 16:18

 Hi Yehuda,


 It's interesting, the responses are received but seems that they
 aren't being handled (hence the following pings). There are a few
 things that you could look at. First, try to connect to the admin
 socket and see if you get any useful information from there. This
 could include in-flight requests, look for other requests that have
 not completed. Also see if there's indication for requests throttling.


 Do you refer to the methods mentioned here?
 http://ceph.com/docs/dumpling/radosgw/troubleshooting/?
 Unfortunately the socket file is not present. Do i have to activate it in
 the config somehow? I could not find any reference to that in the docs. Is
 it already included in my radosgw version?
 radosgw -v
 ceph version 0.67.4 (ad85b8bfafea6232d64cb7ba76a8b6e8252fa0c7)

 Another thing to look at would be at the seemingly unrelated timeout
 messages. These should not happen and might indicate that there's
 something that is holding you up that shouldn't. Try searching for the
 same thread id that is specified in these messages (omit the 0x
 prefix), and see what's the last thing that it's doing.


 I checked that:
 http://pastebin.com/Z23PWwjt
 i do not see anything unusual before the messages happen, but maybe you
 see something odd.


 You could also try turning on also 'debug objecter = 20', see if it
 provides more info (it's very verbose though).


 Did that, but that is way to verbose for me ;) I uploaded it here:
 http://pastebin.com/VBPAVP6z
 There might be some requests mixed into it, but the one for
 cdn/52974400c6dd6ca71904/source.avi is the one that stalled.

 How much are you loading the gateway before that happens? We've seen a
 similar issue in the past that was related to the fcgi library that is
 dynamically linked with the radosgw process (that is, not the apache
 mod_fastcgi module). This, however, would only happen when there's
 heavy load and the fd numbers handled by the radosgw surpassed 1024
 (buggy library that was using select() instead of poll()).


 There are not that many requests on the Storage, maybe 10-20 req/min. The
 cluster serves as a source for a CDN, so once the resource is fetched it
 should not be fetched again soon. I checked for the open files, and there
 are only about 10-20 open file handles for the radosgw process. So this
 probably is not the issue.

 Sebastian



 Yehuda

 On Fri, Nov 29, 2013 at 7:28 AM, Sebastian webmas...@mailz.de wrote:

 Hi,

 thanks for the hint. I tried this again and noticed that the time out
 message does seem to be unrelated. Here is the log file for a stalling
 request with debug turned on:
 http://pastebin.com/DcQuc9wP

 I really cannot really find a real error in the log. The download stalls
 at about 500kb at that point though. Restarting radosgw fixes it for 1
 download only, the next one is broken again. But as i said this does not
 happen for all files.

 Sebastian

 On 27.11.2013, at 21:53, Yehuda Sadeh wrote:

 On Wed, Nov 27, 2013 at 4:46 AM, Sebastian webmas...@mailz.de wrote:

 Hi,

 we have a setup of 4 Servers running ceph and radosgw. We use it as an
 internal S3 service for our files. The Servers run Debian Squeeze with Ceph
 0.67.4.

 The cluster has been running smoothly for quite a while, but we are
 currently experiencing issues with the radosgw. For some files the HTTP
 Download just stalls at around 500kb.

 The Apache error log just says:
 [error] [client ] FastCGI: comm with server /var/www/s3gw.fcgi aborted:
 idle timeout (30 sec)
 [error] [client ] Handler for fastcgi-script returned invalid result code 1

 radosgw logging:
 7f00bc66a700 1 heartbeat_map is_healthy 'RGWProcess::m_tp thread
 0x7f00934bb700' had timed out after 600
 7f00bc66a700 1 heartbeat_map is_healthy 'RGWProcess::m_tp thread
 0x7f00ab4eb700' had timed out after 600

 The interesting thing is that the cluster health is fine an only some
 files are not working 

Re: [ceph-users] installing OS on software RAID

2013-11-30 Thread Kyle Bader
  Is the OS doing anything apart from ceph? Would booting a ramdisk-only
system from USB or compact flash work?

I haven't tested this kind of configuration myself but I can't think of
anything that would preclude this type of setup. I'd probably use sqashfs
layered with a tmpfs via aufs to avoid any writes to the USB drive. I would
also mount spinning high capacity media for /var/log or setup log streaming
to something like rsyslog/syslog-ng/logstash.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Impact of fancy striping

2013-11-30 Thread Kyle Bader
 This journal problem is a bit of wizardry to me, I even had weird
intermittent issues with OSDs not starting because the journal was not
found, so please do not hesitate to suggest a better journal setup.

You mentioned using SAS for journal, if your OSDs are SATA and a expander
is in the data path it might be slow from MUX/STP/etc overhead. If the
setup is all SAS you might try collocating the journal with it's matching
data partition on a single disk. Two spindles must be contended with 9
OSDs. How are your drives attached?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] odd performance graph

2013-11-30 Thread James Harper
I ran HTTach on one of my VM's and got a graph that looks like this:

___--

The low points are all ~35Mbytes/sec and the high points are all ~60Mbytes/sec. 
This is very reproducible.

HDTach does sample reads across the whole disk, so would I be right in thinking 
that the variation is due to pg's being on different OSD's, and there is a 
difference in performance because of a difference in my OSD's.

Is there a way for me to identify which OSD's are letting me down here? 
Presently I have 3:

#1 - xfs with isize=256
#2 - xfs with isize=2048
#3 - btrfs

I have my suspicions about which is dragging the chain, but how could I confirm 
it?

Thanks

James

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] odd performance graph

2013-11-30 Thread James Harper
 
 I ran HTTach on one of my VM's and got a graph that looks like this:
 
 ___--
 
 The low points are all ~35Mbytes/sec and the high points are all
 ~60Mbytes/sec. This is very reproducible.
 
 HDTach does sample reads across the whole disk, so would I be right in
 thinking that the variation is due to pg's being on different OSD's, and 
 there is
 a difference in performance because of a difference in my OSD's.
 
 Is there a way for me to identify which OSD's are letting me down here?
 Presently I have 3:
 
 #1 - xfs with isize=256
 #2 - xfs with isize=2048
 #3 - btrfs
 
 I have my suspicions about which is dragging the chain, but how could I
 confirm it?
 

It occurred to me that just stopping the OSD's selectively would allow me to 
see if there was a change when one was ejected, but at no time was there a 
change to the graph...

James
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Adding new OSDs, need to increase PGs?

2013-11-30 Thread Indra Pramana
Dear all,

Greetings to all, I am new to this list. Please mind my newbie question. :)

I am running a Ceph cluster with 3 servers and 4 drives / OSDs per server.
So total currently there are 12 OSDs running on the cluster. I set PGs
(Placement Groups) value to 600 based on recommendation of calculating
number of PGs = number of OSDs * 100 / number of replicas, which is 2.

Now I am going to add one more OSD node with 4 drives (OSDs) to make up to
16 OSDs in total.

My question, do I need to increase the PGs value to 800? Or leave it at
600? If I need to increase, on which step I need to increase the value
(pg_num and pgp_num) during the insertion of the new node?

Any advice is greatly appreciated, thank you.

Cheers.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com