Re: [ceph-users] Is lttng enable by default in debian hammer-0.94.5?

2015-10-30 Thread shylesh kumar
Hi,

Have you disabled apparmor?

Please check apparmor_status if libvirtd is in enforcing mode then
lttng traces will not be generated.

Thanks,
Shylesh

On Fri, Oct 30, 2015 at 9:15 AM, hzwulibin  wrote:
> Hi, everyone
>
> After install hammer-0.94.5 in debian, i want to trace the librbd by lttng, 
> but after done follow steps, i got nothing:
>  2036  mkdir -p traces
>  2037  lttng create -o traces librbd
>  2038  lttng enable-event -u 'librbd:*'
>  2039  lttng add-context -u -t pthread_id
>  2040  lttng start
>  2041  lttng stop
>
> So, is the lttng enabled in this version on debian?
>
> Thanks!
>
> --
> hzwulibin
> 2015-10-30
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Thanks & Regards
Shylesh Kumar M
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] No Presto metadata available for Ceph-noarch ceph-release-1-1.el7.noarch.rp FAILED

2015-10-30 Thread Andrey Shevel
Somebody knows what does mean the error message  ???

===
[ceph@ceph-admin yum.repos.d]$ sudo yum update -y && sudo yum install
ceph-deploy -y
Loaded plugins: priorities
Ceph

   |  951 B  00:00:00
Not using downloaded repomd.xml because it is older than what we have:
  Current   : Mon Oct 26 21:21:26 2015
  Downloaded: Mon Oct 19 20:36:49 2015
Ceph-noarch

   |  951 B  00:00:00
Not using downloaded repomd.xml because it is older than what we have:
  Current   : Mon Oct 26 21:21:06 2015
  Downloaded: Mon Oct 19 20:36:38 2015
ceph-source

   |  951 B  00:00:00
Not using downloaded repomd.xml because it is older than what we have:
  Current   : Mon Oct 26 21:21:08 2015
  Downloaded: Mon Oct 19 20:36:39 2015
Resolving Dependencies
--> Running transaction check
---> Package ceph-release.noarch 0:1-0.el7 will be updated
---> Package ceph-release.noarch 0:1-1.el7 will be an update
---> Package fcgi.x86_64 0:2.4.0-22.el7 will be updated
---> Package fcgi.x86_64 0:2.4.0-25.el7 will be an update
---> Package gperftools-libs.x86_64 0:2.1-1.el7 will be updated
---> Package gperftools-libs.x86_64 0:2.4-5.el7 will be an update
---> Package python-flask.noarch 1:0.10.1-3.el7 will be updated
---> Package python-flask.noarch 1:0.10.1-4.el7 will be an update
---> Package python-itsdangerous.noarch 0:0.23-1.el7 will be updated
---> Package python-itsdangerous.noarch 0:0.23-2.el7 will be an update
---> Package python-werkzeug.noarch 0:0.9.1-1.el7 will be updated
---> Package python-werkzeug.noarch 0:0.9.1-2.el7 will be an update
--> Finished Dependency Resolution

Dependencies Resolved


 Package Arch
 Version
Repository   Size

Updating:
 ceph-releasenoarch
 1-1.el7
Ceph-noarch 4.0 k
 fcgix86_64
 2.4.0-25.el7
epel 47 k
 gperftools-libs x86_64
 2.4-5.el7
epel279 k
 python-flasknoarch
 1:0.10.1-4.el7
sl-extras   203 k
 python-itsdangerous noarch
 0.23-2.el7
sl-extras23 k
 python-werkzeug noarch
 0.9.1-2.el7
sl-extras   561 k

Transaction Summary

Upgrade  6 Packages

Total size: 1.1 M
Total download size: 4.0 k
Downloading packages:
No Presto metadata available for Ceph-noarch
ceph-release-1-1.el7.noarch.rp FAILED
http://download.ceph.com/rpm-giant/el7/noarch/ceph-release-1-1.el7.noarch.rpm:
[Errno 14] HTTP Error 404 - Not Found
]  0.0 B/s |0 B  --:--:-- ETA
Trying other mirror.


Error downloading packages:
  ceph-release-1-1.el7.noarch: [Errno 256] No more mirrors to try.

[ceph@ceph-admin yum.repos.d]$
=

-- 
Andrey Y Shevel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cloudstack agent crashed JVM with exception in librbd

2015-10-30 Thread Wido den Hollander


On 29-10-15 16:38, Voloshanenko Igor wrote:
> Hi Wido and all community.
> 
> We catched very idiotic issue on our Cloudstack installation, which
> related to ceph and possible to java-rados lib.
> 

I think you ran into this one:
https://issues.apache.org/jira/browse/CLOUDSTACK-8879

Cleaning up RBD snapshots for volumes didn't go well and caused the JVM
to crash.

Wido

> So, we have constantly agent crashed (which cause very big problem for
> us... ).
> 
> When agent crashed - it's crash JVM. And no event in logs at all.
> We enabled crush dump, and after crash we see next picture:
> 
> #grep -A1 "Problematic frame" < /hs_err_pid30260.log
>  Problematic frame:
>  C  [librbd.so.1.0.0+0x5d681]
> 
> # gdb /usr/lib/librbd.so.1.0.0 /var/tmp/cores/jsvc.25526.0.core
> (gdb)  bt
> ...
> #7  0x7f30b9a1fed2 in ceph::log::SubsystemMap::should_gather
> (level=, sub=, this=)
> at ./log/SubsystemMap.h:62
> #8  0x7f30b9a3b693 in ceph::log::SubsystemMap::should_gather
> (this=, sub=, level=)
> at ./log/SubsystemMap.h:61
> #9  0x7f30b9d879be in ObjectCacher::flusher_entry
> (this=0x7f2fb4017910) at osdc/ObjectCacher.cc:1527
> #10 0x7f30b9d9851d in ObjectCacher::FlusherThread::entry
> (this=) at osdc/ObjectCacher.h:374
> 
> From ceph code, this part executed when flushing cache object... And we
> don;t understand why. Becasue we have absolutely different race
> condition to reproduce it.
> 
> As cloudstack have not good implementation yet of snapshot lifecycle,
> sometime, it's happen, that some volumes already marked as EXPUNGED in
> DB and then cloudstack try to delete bas Volume, before it's try to
> unprotect it.
> 
> Sure, unprotecting fail, normal exception returned back (fail because
> snap has childs... )
> 
> 2015-10-29 09:02:19,401 DEBUG [kvm.resource.KVMHAMonitor]
> (Thread-1304:null) Executing:
> /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh
> -i 10.44.253.13 -p /var/lib/libvirt/PRIMARY -m
> /mnt/93655746-a9ef-394d-95e9-6e62471dd39f -h 10.44.253.11
> 2015-10-29 09:02:19,412 DEBUG [kvm.resource.KVMHAMonitor]
> (Thread-1304:null) Execution is successful.
> 2015-10-29 09:02:19,554 INFO  [kvm.storage.LibvirtStorageAdaptor]
> (agentRequest-Handler-5:null) Unprotecting and Removing RBD snapshots of
> image 6789/71b1e2e9-1985-45ca-9ab6-9e5016b86b7c prior to removing the image
> 2015-10-29 09:02:19,571 DEBUG [kvm.storage.LibvirtStorageAdaptor]
> (agentRequest-Handler-5:null) Succesfully connected to Ceph cluster at
> cephmon.anolim.net:6789 
> 2015-10-29 09:02:19,608 DEBUG [kvm.storage.LibvirtStorageAdaptor]
> (agentRequest-Handler-5:null) Unprotecting snapshot
> cloudstack/71b1e2e9-1985-45ca-9ab6-9e5016b86b7c@cloudstack-base-snap
> 2015-10-29 09:02:19,627 DEBUG [kvm.storage.KVMStorageProcessor]
> (agentRequest-Handler-5:null) Failed to delete volume:
> com.cloud.utils.exception.CloudRuntimeException:
> com.ceph.rbd.RbdException: Failed to unprotect snapshot cloudstack-base-snap
> 2015-10-29 09:02:19,628 DEBUG [cloud.agent.Agent]
> (agentRequest-Handler-5:null) Seq 4-1921583831:  { Ans: , MgmtId:
> 161344838950, via: 4, Ver: v1, Flags: 10,
> [{"com.cloud.agent.api.Answer":{"result":false,"details":"com.cloud.utils.exception.CloudRuntimeException:
> com.ceph.rbd.RbdException: Failed to unprotect snapshot
> cloudstack-base-snap","wait":0}}] }
> 2015-10-29 09:02:25,722 DEBUG [cloud.agent.Agent]
> (agentRequest-Handler-2:null) Processing command:
> com.cloud.agent.api.GetHostStatsCommand
> 2015-10-29 09:02:25,722 DEBUG [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-2:null) Executing: /bin/bash -c idle=$(top -b -n
> 1| awk -F, '/^[%]*[Cc]pu/{$0=$4; gsub(/[^0-9.,]+/,""); print }'); echo $idle
> 2015-10-29 09:02:26,249 DEBUG [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-2:null) Execution is successful.
> 2015-10-29 09:02:26,250 DEBUG [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-2:null) Executing: /bin/bash -c
> freeMem=$(free|grep cache:|awk '{print $4}');echo $freeMem
> 2015-10-29 09:02:26,254 DEBUG [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-2:null) Execution is successful.
> 
> BUT, after 20 minutes - agent crashed... If we remove all childs and
> create conditions for cloudstack to delete volume - all OK - no agent
> crash in 20 minutes...
> 
> We can't connect this action - Volume delete with agent crashe... Also
> we don't understand why +- 20 minutes need to last, and only then agent
> crashed...
> 
> From logs, before crash - only GetVMStats... And then - agent started...
> 
> 2015-10-29 09:21:55,143 DEBUG [cloud.agent.Agent] (UgentTask-5:null)
> Sending ping: Seq 4-1343:  { Cmd , MgmtId: -1, via: 4, Ver: v1, Flags:
> 11,
> [{"com.cloud.agent.api.PingRoutingCommand":{"newStates":{},"_hostVmStateReport":{"i-881-1117-VM":{"state":"PowerOn","host":"cs2.anolim.net
> "},"i-7-106-VM":{"state":"PowerOn","host":"cs2.anolim.net
> 

Re: [ceph-users] Cloudstack agent crashed JVM with exception in librbd

2015-10-30 Thread Voloshanenko Igor
It's pain, but not... :(
We already used your updated lib in dev env... :(

2015-10-30 10:06 GMT+02:00 Wido den Hollander :

>
>
> On 29-10-15 16:38, Voloshanenko Igor wrote:
> > Hi Wido and all community.
> >
> > We catched very idiotic issue on our Cloudstack installation, which
> > related to ceph and possible to java-rados lib.
> >
>
> I think you ran into this one:
> https://issues.apache.org/jira/browse/CLOUDSTACK-8879
>
> Cleaning up RBD snapshots for volumes didn't go well and caused the JVM
> to crash.
>
> Wido
>
> > So, we have constantly agent crashed (which cause very big problem for
> > us... ).
> >
> > When agent crashed - it's crash JVM. And no event in logs at all.
> > We enabled crush dump, and after crash we see next picture:
> >
> > #grep -A1 "Problematic frame" < /hs_err_pid30260.log
> >  Problematic frame:
> >  C  [librbd.so.1.0.0+0x5d681]
> >
> > # gdb /usr/lib/librbd.so.1.0.0 /var/tmp/cores/jsvc.25526.0.core
> > (gdb)  bt
> > ...
> > #7  0x7f30b9a1fed2 in ceph::log::SubsystemMap::should_gather
> > (level=, sub=, this=)
> > at ./log/SubsystemMap.h:62
> > #8  0x7f30b9a3b693 in ceph::log::SubsystemMap::should_gather
> > (this=, sub=, level=)
> > at ./log/SubsystemMap.h:61
> > #9  0x7f30b9d879be in ObjectCacher::flusher_entry
> > (this=0x7f2fb4017910) at osdc/ObjectCacher.cc:1527
> > #10 0x7f30b9d9851d in ObjectCacher::FlusherThread::entry
> > (this=) at osdc/ObjectCacher.h:374
> >
> > From ceph code, this part executed when flushing cache object... And we
> > don;t understand why. Becasue we have absolutely different race
> > condition to reproduce it.
> >
> > As cloudstack have not good implementation yet of snapshot lifecycle,
> > sometime, it's happen, that some volumes already marked as EXPUNGED in
> > DB and then cloudstack try to delete bas Volume, before it's try to
> > unprotect it.
> >
> > Sure, unprotecting fail, normal exception returned back (fail because
> > snap has childs... )
> >
> > 2015-10-29 09:02:19,401 DEBUG [kvm.resource.KVMHAMonitor]
> > (Thread-1304:null) Executing:
> > /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh
> > -i 10.44.253.13 -p /var/lib/libvirt/PRIMARY -m
> > /mnt/93655746-a9ef-394d-95e9-6e62471dd39f -h 10.44.253.11
> > 2015-10-29 09:02:19,412 DEBUG [kvm.resource.KVMHAMonitor]
> > (Thread-1304:null) Execution is successful.
> > 2015-10-29 09:02:19,554 INFO  [kvm.storage.LibvirtStorageAdaptor]
> > (agentRequest-Handler-5:null) Unprotecting and Removing RBD snapshots of
> > image 6789/71b1e2e9-1985-45ca-9ab6-9e5016b86b7c prior to removing the
> image
> > 2015-10-29 09:02:19,571 DEBUG [kvm.storage.LibvirtStorageAdaptor]
> > (agentRequest-Handler-5:null) Succesfully connected to Ceph cluster at
> > cephmon.anolim.net:6789 
> > 2015-10-29 09:02:19,608 DEBUG [kvm.storage.LibvirtStorageAdaptor]
> > (agentRequest-Handler-5:null) Unprotecting snapshot
> > cloudstack/71b1e2e9-1985-45ca-9ab6-9e5016b86b7c@cloudstack-base-snap
> > 2015-10-29 09:02:19,627 DEBUG [kvm.storage.KVMStorageProcessor]
> > (agentRequest-Handler-5:null) Failed to delete volume:
> > com.cloud.utils.exception.CloudRuntimeException:
> > com.ceph.rbd.RbdException: Failed to unprotect snapshot
> cloudstack-base-snap
> > 2015-10-29 09:02:19,628 DEBUG [cloud.agent.Agent]
> > (agentRequest-Handler-5:null) Seq 4-1921583831:  { Ans: , MgmtId:
> > 161344838950, via: 4, Ver: v1, Flags: 10,
> >
> [{"com.cloud.agent.api.Answer":{"result":false,"details":"com.cloud.utils.exception.CloudRuntimeException:
> > com.ceph.rbd.RbdException: Failed to unprotect snapshot
> > cloudstack-base-snap","wait":0}}] }
> > 2015-10-29 09:02:25,722 DEBUG [cloud.agent.Agent]
> > (agentRequest-Handler-2:null) Processing command:
> > com.cloud.agent.api.GetHostStatsCommand
> > 2015-10-29 09:02:25,722 DEBUG [kvm.resource.LibvirtComputingResource]
> > (agentRequest-Handler-2:null) Executing: /bin/bash -c idle=$(top -b -n
> > 1| awk -F, '/^[%]*[Cc]pu/{$0=$4; gsub(/[^0-9.,]+/,""); print }'); echo
> $idle
> > 2015-10-29 09:02:26,249 DEBUG [kvm.resource.LibvirtComputingResource]
> > (agentRequest-Handler-2:null) Execution is successful.
> > 2015-10-29 09:02:26,250 DEBUG [kvm.resource.LibvirtComputingResource]
> > (agentRequest-Handler-2:null) Executing: /bin/bash -c
> > freeMem=$(free|grep cache:|awk '{print $4}');echo $freeMem
> > 2015-10-29 09:02:26,254 DEBUG [kvm.resource.LibvirtComputingResource]
> > (agentRequest-Handler-2:null) Execution is successful.
> >
> > BUT, after 20 minutes - agent crashed... If we remove all childs and
> > create conditions for cloudstack to delete volume - all OK - no agent
> > crash in 20 minutes...
> >
> > We can't connect this action - Volume delete with agent crashe... Also
> > we don't understand why +- 20 minutes need to last, and only then agent
> > crashed...
> >
> > From logs, before crash - only GetVMStats... And then - agent started...
> >
> > 2015-10-29 09:21:55,143 DEBUG [cloud.agent.Agent] 

Re: [ceph-users] Is lttng enable by default in debian hammer-0.94.5?

2015-10-30 Thread hzwulibin
Hi, 

we don't enable the appormor.

Thanks!

--   
hzwulibin
2015-10-30

-
发件人:shylesh kumar 
发送日期:2015-10-30 20:44
收件人:hzwulibin
抄送:ceph-devel,ceph-users
主题:Re: [ceph-users] Is lttng enable by default in debian hammer-0.94.5?

Hi,

Have you disabled apparmor?

Please check apparmor_status if libvirtd is in enforcing mode then lttng
traces will not be generated.

Thanks,
Shylesh

On Fri, Oct 30, 2015 at 9:15 AM, hzwulibin  wrote:

> Hi, everyone
>
> After install hammer-0.94.5 in debian, i want to trace the librbd by
> lttng, but after done follow steps, i got nothing:
>  2036  mkdir -p traces
>  2037  lttng create -o traces librbd
>  2038  lttng enable-event -u 'librbd:*'
>  2039  lttng add-context -u -t pthread_id
>  2040  lttng start
>  2041  lttng stop
>
> So, is the lttng enabled in this version on debian?
>
> Thanks!
>
> --
> hzwulibin
> 2015-10-30
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Thanks & Regards
Shylesh Kumar M


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is lttng enable by default in debian hammer-0.94.5?

2015-10-30 Thread shylesh kumar
Hi,

Have you disabled apparmor?

Please check apparmor_status if libvirtd is in enforcing mode then lttng
traces will not be generated.

Thanks,
Shylesh

On Fri, Oct 30, 2015 at 9:15 AM, hzwulibin  wrote:

> Hi, everyone
>
> After install hammer-0.94.5 in debian, i want to trace the librbd by
> lttng, but after done follow steps, i got nothing:
>  2036  mkdir -p traces
>  2037  lttng create -o traces librbd
>  2038  lttng enable-event -u 'librbd:*'
>  2039  lttng add-context -u -t pthread_id
>  2040  lttng start
>  2041  lttng stop
>
> So, is the lttng enabled in this version on debian?
>
> Thanks!
>
> --
> hzwulibin
> 2015-10-30
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Thanks & Regards
Shylesh Kumar M
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Write throughput drops to zero

2015-10-30 Thread Brendan Moloney
Hi,

I recently got my first Ceph cluster up and running and have been doing some 
stress tests. I quickly found that during sequential write benchmarks the 
throughput would often drop to zero. Initially I saw this inside QEMU virtual 
machines, but I can also reproduce the issue with "rados bench" within 5-10 
minutes of sustained writes.  If left alone the writes will eventually start 
going again, but it takes quite a while (at least a couple minutes). If I stop 
and restart the benchmark the write throughput will immediately be where it is 
supposed to be.

I have convinced myself it is not a network hardware issue.  I can load up the 
network with a bunch of parallel iperf benchmarks and it keeps chugging along 
happily. When the issue occurs with Ceph I don't see any indications of network 
issues (e.g. dropped packets).  Adding additional network load during the rados 
bench (using iperf) doesn't seem to trigger the issue any faster or more often.

I have also convinced myself it isn't an issue with a journal getting full or 
an OSD being too busy.  The amount of data being written before the problem 
occurs is much larger than the total journal capacity. Watching the load on the 
OSD servers with top/iostat I don't seen anything being overloaded, rather I 
see the load everywhere drop to essentially zero when the writes stall. Before 
the writes stall the load is well distributed with no visible hot spots. The 
OSDs and hosts that report slow requests are random, so I don't think it is a 
failing disk or server.  I don't see anything interesting going on in the logs 
so far (I am just about to do some tests with Ceph's debug logging cranked up).

The cluster specs are:

OS: Ubuntu 14.04 with 3.16 kernel
Ceph: 9.1.0
OSD Filesystem: XFS
Replication: 3X
Two racks with IPoIB network
10Gbps Ethernet between racks
8 OSD servers with:
  * Dual Xeon E5-2630L (12 cores @ 2.4GHz)
  * 128GB RAM
  * 12 6TB Seagate drives (connected to LSI 2208 chip in JBOD mode)
  * Two 400GB Intel P3600 NVMe drives (OS on RAID1 partition, 6 partitions for 
OSD journals each)
  * Mellanox ConnectX-3 NIC (for both Infiniband and 10Gbps Ethernet)
3 Mons collocated on OSD servers

Any advice is greatly appreciated. I am planning to try this with Hammer too.

Thanks,
Brendan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph -s hangs; need troubleshooting ideas

2015-10-30 Thread Artie Ziff
Hello,

I have one ceph cluster that works fine and one that is not starting.
On VM cluster Ceph works OK. On my native hardware Ceph is not starting.
OS is same: an recently updated Ubuntu 14

Following the *exact* same procedure as the cluster that is working, as
the fetch/checkout/build/post-install-monitor is automated end-to-end,
I observe that Ceph is not writing a log file on the host that is not
starting.

What does that indicate? To save me from reading the source code, which
program in the Ceph application and at what juncture in the application's
start procedure does Ceph write to a log file?

Creation of monitor, post installation did not throw any errors.
The monitor appears to start without error or other complaints.

ceph-mon -i $HOSTNAME

ceph -s simply hangs with no output.

I have some experience (by now) removing all Ceph app data and
re-installing and performing some diags, but there is little for me to go
on,
based on what I have learned so far.

Any common or routine things to look at when Ceph is simply not starting?
I am reading one Ceph administrators troubleshooting page. Are there others?
Maybe a blog that I should know about?  :)

Looks like my next steps are related to using something called the monitor’s
admin socket

.

I will search and read previous threads, to find some direction..

thx in adv!
-az
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SHA1 wrt hammer release and tag v0.94.3

2015-10-30 Thread Gregory Farnum
On Fri, Oct 30, 2015 at 6:20 PM, Artie Ziff  wrote:
> Hello,
>
> In the RELEASE INFORMATION section of the hammer v0.94.3 issue tracker [1]
> the git commit SHA1 is: b2503b0e15c0b13f480f0835060479717b9cf935
>
> On the github page for Ceph Release v0.94.3 [2], when I click on the
> "95cefea" link [3]
> we see the commit SHA1 of: 95cefea9fd9ab740263bf8bb4796fd864d9afe2b
>
> [1] http://tracker.ceph.com/issues/11990
> [2] https://github.com/ceph/ceph/releases/tag/v0.94.3
> [3]
> https://github.com/ceph/ceph/commit/95cefea9fd9ab740263bf8bb4796fd864d9afe2b
>
> I'm looking forward to learning...
> Why the different SHA1 values in two places that reference v0.94.3?

Odd. It's probably to do with that PGP signing in one of the commits,
but I bet Alfredo knows.

>
> Thanks for reading my question!
>
> PS: Tried to send this to ceph-dev but Gmail can't send plain text e-mails,
> evidently.
> So annoying! Maybe somebody knows the trick if it is even possible.

http://lmgtfy.com/?q=gmail+plain+text :)

>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Write throughput drops to zero

2015-10-30 Thread K K

I was get same situation at 1gbit network. Try to change mtu 9000 on nic and 
switch.
Do you show cluster configs?
--
Kostya суббота, 31 октября 2015г., 02:30 +05:00 от Brendan Moloney < 
molo...@ohsu.edu> :

>Hi,
>
>I recently got my first Ceph cluster up and running and have been doing some 
>stress tests. I quickly found that during sequential write benchmarks the 
>throughput would often drop to zero. Initially I saw this inside QEMU virtual 
>machines, but I can also reproduce
 the issue with "rados bench" within 5-10 minutes of sustained writes.  If left 
alone the writes will eventually start going again, but it takes quite a while 
(at least a couple minutes). If I stop and restart the benchmark the write 
throughput will immediately
 be where it is supposed to be.
>
>I have convinced myself it is not a network hardware issue.  I can load up the 
>network with a bunch of parallel iperf benchmarks and it keeps chugging along 
>happily. When the issue occurs with Ceph I don't see any indications of 
>network issues (e.g. dropped
 packets).  Adding additional network load during the rados bench (using iperf) 
doesn't seem to trigger the issue any faster or more often.
>
>I have also convinced myself it isn't an issue with a journal getting full or 
>an OSD being too busy.  The amount of data being written before the problem 
>occurs is much larger than the total journal capacity. Watching the load on 
>the OSD servers with top/iostat
 I don't seen anything being overloaded, rather I see the load everywhere drop 
to essentially zero when the writes stall. Before the writes stall the load is 
well distributed with no visible hot spots. The OSDs and hosts that report slow 
requests are random,
 so I don't think it is a failing disk or server.  I don't see anything 
interesting going on in the logs so far (I am just about to do some tests with 
Ceph's debug logging cranked up).
>
>The cluster specs are:
>
>OS: Ubuntu 14.04 with 3.16 kernel
>Ceph: 9.1.0
>OSD Filesystem: XFS
>Replication: 3X
>Two racks with IPoIB network
>10Gbps Ethernet between racks
>8 OSD servers with:
>  * Dual Xeon E5-2630L (12 cores @ 2.4GHz)
>  * 128GB RAM
>  * 12 6TB Seagate drives (connected to LSI 2208 chip in JBOD mode)
>  * Two 400GB Intel P3600 NVMe drives (OS on RAID1 partition, 6 partitions for 
>OSD journals each)
>  * Mellanox ConnectX-3 NIC (for both Infiniband and 10Gbps Ethernet)
>3 Mons collocated on OSD servers
>
>Any advice is greatly appreciated. I am planning to try this with Hammer too.
>
>Thanks,
>Brendan
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] SHA1 wrt hammer release and tag v0.94.3

2015-10-30 Thread Artie Ziff
Hello,

In the RELEASE INFORMATION section of the hammer v0.94.3 issue tracker [1]
the git commit SHA1 is: b2503b0e15c0b13f480f0835060479717b9cf935

On the github page for Ceph Release v0.94.3 [2], when I click on the
"95cefea" link [3]
we see the commit SHA1 of: 95cefea9fd9ab740263bf8bb4796fd864d9afe2b

[1] http://tracker.ceph.com/issues/11990
[2] https://github.com/ceph/ceph/releases/tag/v0.94.3
[3]
https://github.com/ceph/ceph/commit/95cefea9fd9ab740263bf8bb4796fd864d9afe2b

I'm looking forward to learning...
Why the different SHA1 values in two places that reference v0.94.3?

Thanks for reading my question!

PS: Tried to send this to ceph-dev but Gmail can't send plain text e-mails,
evidently.
So annoying! Maybe somebody knows the trick if it is even possible.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] proxmox 4.0 release : lxc with krbd support and qemu librbd improvements

2015-10-30 Thread Ilya Dryomov
On Fri, Oct 30, 2015 at 1:18 PM, Florent B  wrote:
> Hi,
>
> Just a little question for krbd gurus : Proxmox 4 uses 4.2.2 kernel, is
> krbd stable for Hammer ? And will it be for Infernalis ?

krbd is in upstream kernel - speaking in terms of ceph releases has
little point.  It is stable and 4.2.2 is pretty close to latest, so you
should be in good shape with both hammer and infernalis on the server
side.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-disk prepare with systemd and infernarlis

2015-10-30 Thread Loic Dachary
Hi Mathias,

On 31/10/2015 02:05, MATHIAS, Bryn (Bryn) wrote:
> Hi All,
> 
> I have been rolling out an infernarlis cluster, however I get stuck on the 
> ceph-disk prepare stage.
> 
> I am deploying ceph via ansible along with a whole load of other software.
> 
> Log output at the end of the message but the solution is to copy the 
> "/lib/systemd/system/ceph-osd@.service” file to 
> "/etc/systemd/system/ceph-osd@.service” in the ceph disk script.
> 
> Without doing this I need to somehow grab the number of the osd that is going 
> to be created and copy the file before ceph-disk runs: "'/usr/bin/systemctl', 
> 'enable', 'ceph-osd@’”
> 
> Whilst this works ok if I am doing everything in serial, the ideal is to be 
> able to spin up a cluster asynchronously.
> 
> 
> Unless there’s something I’m missing?
> 

On a given machine ceph-disk prepare /dev/sdb should run fine in parallel with 
ceph-disk prepare /dev/sdc. Is there a simple way to reproduce the problem you 
are seeing ?

Cheers

> 
> Cheers,
> Bryn
> 
> Log lines from ansible below.
> 
> 
> failed: [tapir5.eng.velocix.com -> 127.0.0.1] => 
> (item=tapir5.eng.velocix.com) => {"changed": false, "cmd": ["ssh", 
> "r...@tapir5.eng.velocix.com", "/tmp/ceph_disk.sh 5 ceph 
> 749cee00-c818-4abc-90ee-7b1193c2a8b9"], "delta": "0:00:24.432348", "end": 
> "2015-10-30 16:36:03.214603", "failed": true, "failed_when_result": true, 
> "item": "tapir5.eng.velocix.com", "rc": 1, "start": "2015-10-30 
> 16:35:38.782255", "stdout_lines": ["/dev/sdb1 on /var/lib/ceph/osd/ceph-0 
> type xfs 
> (rw,noatime,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota)", 
> "/dev/sdc1 on /var/lib/ceph/osd/ceph-2 type xfs 
> (rw,noatime,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota)", 
> "/dev/sdd1 on /var/lib/ceph/osd/ceph-4 type xfs 
> (rw,noatime,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota)", 
> "Creating new GPT entries.", "GPT data structures destroyed! You may now 
> partition the disk using fdisk or", "other utilities.", "Creating new GPT 
> entries.", "The operation has completed successfully.", "The operation
 has completed successfully.", "The operation has completed successfully.", 
"meta-data=/dev/sde1  isize=2048   agcount=32, agsize=45744064 
blks", " =   sectsz=512   attr=2, projid32bit=1", " 
=   crc=0finobt=0", "data = 
  bsize=4096   blocks=1463810048, imaxpct=5", " =   
sunit=64 swidth=64 blks", "naming   =version 2  
bsize=4096   ascii-ci=0 ftype=0", "log  =internal log   bsize=4096  
 blocks=521728, version=2", " =   sectsz=512   
sunit=64 blks, lazy-count=1", "realtime =none   extsz=4096   
blocks=0, rtextents=0", "The operation has completed successfully."], 
"warnings": []}
> stderr: Failed to issue method call: No such file or directory
> ceph-disk: Error: ceph osd start failed: Command '['/usr/bin/systemctl', 
> 'enable', 'ceph-osd@9']' returned non-zero exit status 1
> stdout: /dev/sdb1 on /var/lib/ceph/osd/ceph-0 type xfs 
> (rw,noatime,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota)
> /dev/sdc1 on /var/lib/ceph/osd/ceph-2 type xfs 
> (rw,noatime,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota)
> /dev/sdd1 on /var/lib/ceph/osd/ceph-4 type xfs 
> (rw,noatime,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota)
> Creating new GPT entries.
> GPT data structures destroyed! You may now partition the disk using fdisk or
> other utilities.
> Creating new GPT entries.
> The operation has completed successfully.
> The operation has completed successfully.
> The operation has completed successfully.
> meta-data=/dev/sde1  isize=2048   agcount=32, agsize=45744064 blks
>  =   sectsz=512   attr=2, projid32bit=1
>  =   crc=0finobt=0
> data =   bsize=4096   blocks=1463810048, imaxpct=5
>  =   sunit=64 swidth=64 blks
> naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
> log  =internal log   bsize=4096   blocks=521728, version=2
>  =   sectsz=512   sunit=64 blks, lazy-count=1
> realtime =none   extsz=4096   blocks=0, rtextents=0
> The operation has completed successfully.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-disk prepare with systemd and infernarlis

2015-10-30 Thread MATHIAS, Bryn (Bryn)
Hi All,

I have been rolling out an infernarlis cluster, however I get stuck on the 
ceph-disk prepare stage.

I am deploying ceph via ansible along with a whole load of other software.

Log output at the end of the message but the solution is to copy the 
"/lib/systemd/system/ceph-osd@.service” file to 
"/etc/systemd/system/ceph-osd@.service” in the ceph disk script.

Without doing this I need to somehow grab the number of the osd that is going 
to be created and copy the file before ceph-disk runs: "'/usr/bin/systemctl', 
'enable', 'ceph-osd@’”

Whilst this works ok if I am doing everything in serial, the ideal is to be 
able to spin up a cluster asynchronously.


Unless there’s something I’m missing?



Cheers,
Bryn

Log lines from ansible below.


failed: [tapir5.eng.velocix.com -> 127.0.0.1] => (item=tapir5.eng.velocix.com) 
=> {"changed": false, "cmd": ["ssh", "r...@tapir5.eng.velocix.com", 
"/tmp/ceph_disk.sh 5 ceph 749cee00-c818-4abc-90ee-7b1193c2a8b9"], "delta": 
"0:00:24.432348", "end": "2015-10-30 16:36:03.214603", "failed": true, 
"failed_when_result": true, "item": "tapir5.eng.velocix.com", "rc": 1, "start": 
"2015-10-30 16:35:38.782255", "stdout_lines": ["/dev/sdb1 on 
/var/lib/ceph/osd/ceph-0 type xfs 
(rw,noatime,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota)", 
"/dev/sdc1 on /var/lib/ceph/osd/ceph-2 type xfs 
(rw,noatime,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota)", 
"/dev/sdd1 on /var/lib/ceph/osd/ceph-4 type xfs 
(rw,noatime,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota)", 
"Creating new GPT entries.", "GPT data structures destroyed! You may now 
partition the disk using fdisk or", "other utilities.", "Creating new GPT 
entries.", "The operation has completed successfully.", "The operation has 
completed successfully.", "The operation has completed successfully.", 
"meta-data=/dev/sde1  isize=2048   agcount=32, agsize=45744064 
blks", " =   sectsz=512   attr=2, projid32bit=1", " 
=   crc=0finobt=0", "data = 
  bsize=4096   blocks=1463810048, imaxpct=5", " =   
sunit=64 swidth=64 blks", "naming   =version 2  
bsize=4096   ascii-ci=0 ftype=0", "log  =internal log   bsize=4096  
 blocks=521728, version=2", " =   sectsz=512   
sunit=64 blks, lazy-count=1", "realtime =none   extsz=4096   
blocks=0, rtextents=0", "The operation has completed successfully."], 
"warnings": []}
stderr: Failed to issue method call: No such file or directory
ceph-disk: Error: ceph osd start failed: Command '['/usr/bin/systemctl', 
'enable', 'ceph-osd@9']' returned non-zero exit status 1
stdout: /dev/sdb1 on /var/lib/ceph/osd/ceph-0 type xfs 
(rw,noatime,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota)
/dev/sdc1 on /var/lib/ceph/osd/ceph-2 type xfs 
(rw,noatime,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota)
/dev/sdd1 on /var/lib/ceph/osd/ceph-4 type xfs 
(rw,noatime,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota)
Creating new GPT entries.
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
Creating new GPT entries.
The operation has completed successfully.
The operation has completed successfully.
The operation has completed successfully.
meta-data=/dev/sde1  isize=2048   agcount=32, agsize=45744064 blks
 =   sectsz=512   attr=2, projid32bit=1
 =   crc=0finobt=0
data =   bsize=4096   blocks=1463810048, imaxpct=5
 =   sunit=64 swidth=64 blks
naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
log  =internal log   bsize=4096   blocks=521728, version=2
 =   sectsz=512   sunit=64 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0
The operation has completed successfully.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com