Re: [ceph-users] No module named rados

2016-11-28 Thread JiaJia Zhong
hi,
since you are playing on centos7, why not following 
http://docs.ceph.com/docs/master/install/get-packages/ or just downloading the 
binary packages form https://download.ceph.com/rpm-jewel/ ? :)
if you insist to install ceph from ceph-10.2.2.tar.gz,  please follow 
http://docs.ceph.com/docs/giant/install/build-ceph/,  make sure your system's 
clean before you start if you don't know what happens.


which to choose, depending on your purpose. devel or user.


I think you shall know why the error occurs rather than how to solve it.
 
 
-- Original --
From:  "鹏";
Date:  Tue, Nov 29, 2016 02:02 PM
To:  "“ceph-us...@ceph.co"; 

Subject:  [ceph-users] No module named rados

 
hi,


I build ceph-10.2.2.tar.gz , but there is an error like this:

[root@mds0 ceph-10.2.2]# ceph -s
Traceback (most recent call last):
  File "/usr/local/bin/ceph", line 118, in 
import rados
ImportError: No module named rados


I find the  module  rados like this:
[root@mds0 ceph-10.2.2]# locate rados.py
/usr/local/ceph-10.2.2/src/pybind/rados/rados.pyx


there is no file named  rados.py
[root@mds0 ceph-10.2.2]# ls /usr/local/lib/python2.7/site-packages/
ceph_argparse.py   ceph_argparse.pyo  ceph_daemon.pyc  ceph_rest_api.py   
ceph_rest_api.pyo  ceph_volume_client.pyc
ceph_argparse.pyc  ceph_daemon.py ceph_daemon.pyo  ceph_rest_api.pyc  
ceph_volume_client.py  ceph_volume_client.pyo



ls /usr/lib/python2.7/site-packages/ceph_*
/usr/lib/python2.7/site-packages/ceph_detect_init-1.0.1-py2.7.egg  
/usr/lib/python2.7/site-packages/ceph_disk-1.0.0-py2.7.egg





I make and install   ceph-10.2.2.tar.gz like this 



#./install_deps.sh
 
#./autogen.sh
 
#./configure 
 
# make -j2
 
# make install 




# cat /proc/version 
Linux version 3.10.0-327.el7.x86_64 (buil...@kbuilder.dev.centos.org) (gcc 
version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP Thu Nov 19 22:10:57 UTC 
2015



centos7


How can I  solve the error !___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] general ceph cluster design

2016-11-28 Thread nick
Hi Ben,
thanks for the information as well. It looks like we first will do some latency 
tests between our data centers (thanks for the netem hint), before deciding 
which topology is best for us. For simple DR scenarios rbd mirroring sounds 
like the better solution so far.
We are still fans of the hyperconverged setup (compute + ceph on one node) and 
are searching for suitable hardware. I think the resource usage separation 
should be doable with CPU pinning and plain old cgroups.

Cheers
Nick

On Monday, November 28, 2016 12:23:07 PM Benjeman Meekhof wrote:
> Hi Nick,
> 
> We have a Ceph cluster spread across 3 datacenters at 3 institutions
> in Michigan (UM, MSU, WSU).  It certainly is possible.  As noted you
> will have increased latency for write operations and overall reduced
> throughput as latency increases.  Latency between our sites is 3-5ms.
> 
> We did some simulated latency testing with netem where we induced
> varying levels of latency on one of our storage hosts (60 OSD).  Some
> information about the results is on our website:
> http://www.osris.org/performance/latency
> 
> We also had success running a 4th cluster site at Supercomputing in
> SLC.  We'll be putting up information on experiences there in the near
> future.
> 
> thanks,
> Ben
> 
> On Mon, Nov 28, 2016 at 12:06 AM, nick  wrote:
> > Hi Maxime,
> > thank you for the information given. We will have a look and check.
> > 
> > Cheers
> > Nick
> > 
> > On Friday, November 25, 2016 09:48:35 PM Maxime Guyot wrote:
> >> Hi Nick,
> >> 
> >> See inline comments.
> >> 
> >> Cheers,
> >> Maxime
> >> 
> >> On 25/11/16 16:01, "ceph-users on behalf of nick"
> >> 
> >>  wrote:
> >> >Hi,
> >> >we are currently planning a new ceph cluster which will be used for
> >> >virtualization (providing RBD storage for KVM machines) and we have
> >> >some
> >> >general questions.
> >> >
> >> >* Is it advisable to have one ceph cluster spread over multiple
> >> >datacenters
> >  
> >  (latency is low, as they are not so far from each
> >  
> >> >other)? Is anybody doing this in a production setup? We know that
> >> >any
> >> >network issue would affect virtual machines in all locations instead
> >> >just one, but we can see a lot of advantages as well.
> >> 
> >> I think the general consensus is to limit the size of the failure domain.
> >> That said, it depends the use case and what you mean by “multiple
> >> datacenters” and “latency is low”: writes will have to be journal-ACK:ed
> >> by the OSDs in the other datacenter. If there is 10ms latency between
> >> Location1 and Location2, then it would add 10ms to each write operation
> >> if
> >> crushmap requires replicas in each location. Speaking of which a 3rd
> >> location would help with sorting our quorum (1 mon at each location) in
> >> “triangle” configuration.
> >> 
> >> If this is for DR: RBD-mirroring is supposed to address that, you might
> >> not
> >> want to have 1 big cluster ( = failure domain).
> >> 
> >  If this is for VM live
> >  
> >> migration: Usually requires spread L2 adjacency (failure domain) or
> >> overlays (VXLAN and the likes), “network trombone” effect can be a
> >> problem
> >> depending on the setup
> >> I know of Nantes University who used/is using a 3 datacenter Ceph
> >> cluster:
> >> http://dachary.org/?p=2087
> >> 
> >> >* We are planning to combine the hosts for ceph and KVM (so far we
> >> >are
> >> >using
> >  
> >  seperate hosts for virtual machines and ceph storage). We see
> >  
> >> >the big advantage (next to the price drop) of an automatic ceph
> >> >expansion when adding more compute nodes as we got into situations
> >> >in
> >> >the past where we had too many compute nodes and the ceph cluster
> >> >was
> >> >not expanded properly (performance dropped over time). On the other
> >> >side there would be changes to the crush map every time we add a
> >> >compute node and that might end in a lot of data movement in ceph.
> >> >Is
> >> >anybody using combined servers for compute and ceph storage and has
> >> >some experience?
> >> 
> >> The challenge is to avoid ceph-osd to become a noisy neighbor for the VMs
> >> hosted on the hypervisor, especially under recovery. I’ve heard people
> >> using CPU pinning, containers, and QoS to keep it under control.
> >> 
> >  Sebastian
> >  
> >> has an article on his blog this topic:
> >> https://www.sebastien-han.fr/blog/2016/07/11/Quick-dive-into-hyperconverg
> >> ed
> >> -architecture-with-OpenStack-and-Ceph/
> >> For the performance dropped over time, you can look to improve your
> >> capacity:performance ratio.
> >> 
> >> >* is there a maximum amount of OSDs in a ceph cluster? We are
> >> >planning
> >> >to use
> >  
> >  a minimum of 8 OSDs per server and going to have a cluster
> >  
> >> >with about 100 servers which would 

[ceph-users] No module named rados

2016-11-28 Thread
|
hi,


I build ceph-10.2.2.tar.gz , but there is an error like this:


[root@mds0 ceph-10.2.2]# ceph -s
Traceback (most recent call last):
  File "/usr/local/bin/ceph", line 118, in 
import rados
ImportError: No module named rados


I find the  module  rados like this:
[root@mds0 ceph-10.2.2]# locate rados.py
/usr/local/ceph-10.2.2/src/pybind/rados/rados.pyx


there is no file named  rados.py
[root@mds0 ceph-10.2.2]# ls /usr/local/lib/python2.7/site-packages/
ceph_argparse.py   ceph_argparse.pyo  ceph_daemon.pyc  ceph_rest_api.py   
ceph_rest_api.pyo  ceph_volume_client.pyc
ceph_argparse.pyc  ceph_daemon.py ceph_daemon.pyo  ceph_rest_api.pyc  
ceph_volume_client.py  ceph_volume_client.pyo


ls /usr/lib/python2.7/site-packages/ceph_*
/usr/lib/python2.7/site-packages/ceph_detect_init-1.0.1-py2.7.egg  
/usr/lib/python2.7/site-packages/ceph_disk-1.0.0-py2.7.egg



I make and install   ceph-10.2.2.tar.gz like this 




#./install_deps.sh

#./autogen.sh

#./configure

# make -j2

# make install 





# cat /proc/version 
Linux version 3.10.0-327.el7.x86_64 (buil...@kbuilder.dev.centos.org) (gcc 
version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP Thu Nov 19 22:10:57 UTC 
2015


centos7


How can I  solve the error !
|
|
|   |   |
|___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OSDs cause kernel unresponsive

2016-11-28 Thread Craig Chi
Hi Brad,

We fully understood the hardware we currently use are under Ceph's 
recommendation, so we are seeking for a method to lower or restrict the 
resources needed by OSD. Definitely losing some performance is acceptable for 
us.

The reason why we did these experiments and discuss causes is that we want to 
find the true factors that reflect the memory usage. I think it is beneficial 
for the Ceph community and we can convince our customers and other Ceph users 
to realize the feasibility and stability of Ceph on different hardware 
infrastructure for production.

With your comments, we have more confidence on the memory consumption of Ceph 
OSD.

Hope there still exist some methods or workarounds to bound the memory 
consumption (tuning configs?), or we would just accept the recommendations on 
the website. (also, could we say 1GB / 1TB is the maximum requirement? or just 
enough under normal circumstances?)

Thank you very much.

Sincerely,
Craig Chi

On 2016-11-29 10:27, Brad Hubbardwrote:
>   
>   
> On Tue, Nov 29, 2016 at 3:12 AM, Craig 
> Chiwrote:
> > Hi guys,
> >   
> > Thanks to both of your suggestions, we had some progression on this issue.
> >   
> > I tuned vm.min_free_kbytes to 16GB and raised vm.vfs_cache_pressure to 200, 
> > and I did observe that the OS keep releasing cache while the OSDs want more 
> > and more memory.
>   
> vfs_cache_pressure is a percentage so values>100 have always seemed odd to me.
> >   
> > OK. Now we are going to reproduce the hanging issue.
> >   
> > 1. set the cluster with noup flag
> > 2. restart all ceph-osd process (then we can see all OSDs are down from 
> > ceph monitor)
> > 3. unset noup flag
> >   
> > As expected the OSDs started to consume memory, and eventually the kernel 
> > still hanged without response.
> >   
> > Therefore I learned to gather the vmcore and tried to investigate further 
> > as Brad advised.
> >   
> > The vmcore dump file was unbeliviably huge -- about 6 GB per dump. However 
> > it's helpful that we quickly found the following abnormal things:
> >   
> > 1. The memory was exhausted as expected.
> >   
> > crash>kmem -i
> > PAGESTOTALPERCENTAGE
> > TOTAL MEM63322527241.6 GB
> > FREE6764462.6 GB1% of TOTAL MEM
> > USED62646081239 GB98% of TOTAL MEM
> > SHARED6213362.4 GB0% of TOTAL MEM
> > BUFFERS47307184.8 MB0% of TOTAL MEM
> > CACHED3762051.4 GB0% of TOTAL MEM
> > SLAB4554001.7 GB0% of TOTAL MEM
> >   
> > TOTAL SWAP488703918.6 GB
> > SWAP USED385593814.7 GB78% of TOTAL SWAP
> > SWAP FREE10311013.9 GB21% of TOTAL SWAP
> >   
> > COMMIT LIMIT36548302139.4 GB
> > COMMITTED92434847352.6 GB252% of TOTAL LIMIT
>   
> As Nick already mentioned,90x8TB disks is 720Tb of storage and, according 
> tohttp://docs.ceph.com/docs/jewel/start/hardware-recommendations/#ramduring 
> recovery you may require ~1GB per 1TB of storage per daemon.
> >   
> >   
> > 2. Each OSD used a lot of memory. (We have only total 256 GB RAM but there 
> > are 90 OSDs in a node)
> >   
> > # Find 10 largest memory consumption processes
> > crash>ps -G | sed 's/>//g' | sort -k 8,8 -n | awk '$8 ~ /[0-9]/{ $8 = 
> > $8/1024" MB"; print}' | tail -10
> > 100864 1 12 883a43e1b700 IN 1.1 7484884 2973.33 MB ceph-osd
> > 87400 1 27 8838538ae040 IN 1.1 7557500 3036.92 MB ceph-osd
> > 108126 1 22 882bcca91b80 IN 1.2 7273068 3045.8 MB ceph-osd
> > 39787 1 28 883f468ab700 IN 1.2 7300756 3067.88 MB ceph-osd
> > 44861 1 20 883cf925 IN 1.2 7327496 3067.89 MB ceph-osd
> > 30486 1 23 883f59e1c4c0 IN 1.2 7332828 3083.58 MB ceph-osd
> > 125239 1 15 882687018000 IN 1.2 6965560 3103.36 MB ceph-osd
> > 123807 1 19 88275d90ee00 IN 1.2 7314484 3173.48 MB ceph-osd
> > 116445 1 1 882863926e00 IN 1.2 7279040 3269.09 MB ceph-osd
> > 94442 1 0 882ed2d01b80 IN 1.3 7566148 3418.69 MB ceph-osd
>   
> Based on the information above this is not excessive memory usage AFAICS.
> >   
> >   
> > 3. The excessive amount of message threads.
> >   
> > crash>ps | grep ms_pipe_read | wc -l
> > 144112
> > crash>ps | grep ms_pipe_write | wc -l
> > 146692
> >   
> > Totally up to 290k threads in ms_pipe_*.
> >   
> >   
> > 4. Several tries we had, and we luckily got some memory profiles before oom 
> > killer started to work.
> >   
> > # Parse the smaps of a ceph-osd process 
> > byparse_smaps.py(https://github.com/craig08/parse_smaps)
> >   
> > root@ceph2:~# ./parse_smaps.py /proc/198557/smaps
> > ===
> > PrivatePrivateSharedShared
> > Clean+Dirty+Clean+Dirty=Total: library
> > ===
> > 16660 kB + 5804548 kB +0 kB +0 kB = 5821208 kB : [heap]
> > 40 kB +92 kB +7640 kB +0 kB =7772 kB : ceph-osd
> > 56 kB +2472 kB +0 kB +0 kB =2528 kB : [anonymous]
> > 2084 kB +0 kB +0 kB +0 kB =2084 kB : 007656.ldb
> > 2080 kB +0 kB +0 kB 

Re: [ceph-users] High ops/s with kRBD and "--object-size 32M"

2016-11-28 Thread Alex Gorbachev
On Mon, Nov 28, 2016 at 2:59 PM Ilya Dryomov  wrote:

> On Mon, Nov 28, 2016 at 6:20 PM, Francois Blondel 
> wrote:
> > Hi *,
> >
> > I am currently testing different scenarios to try to optimize sequential
> > read and write speeds using Kernel RBD.
> >
> > I have two block devices created with :
> >   rbd create block1 --size 500G --pool rbd --image-feature layering
> >   rbd create block132m --size 500G --pool rbd --image-feature layering
> > --object-size 32M
> >
> > -> Writing to block1 works quite fine  (about 200ops/s, 310MB/s in
> average,
> > for a 250GB file) (tests running with dd)
> > -> Writing to block132m is much slower (about 40MB/s in average), and
> > generates high ops/s (seen from a ceph -w) (from 4000 to 13000)
> >
> > Current test cluster:
> >
> >  health HEALTH_WARN
> > noscrub,nodeep-scrub,sortbitwise flag(s) set
> >  monmap e2: 3 mons at
> > {aac=10.113.49.48:6789/0,aad=10.112.33.36:6789/0,aae=10.112.48.60:6789/0
> }
> > election epoch 26, quorum 0,1,2 aad,aae,aac
> >  osdmap e10962: 38 osds: 38 up, 38 in
> > flags noscrub,nodeep-scrub,sortbitwise
> >   pgmap v120464: 1024 pgs, 1 pools, 486 GB data, 122 kobjects
> > 4245 GB used, 50571 GB / 54816 GB avail
> > 1024 active+clean
> >
> > The OSDs (using bluestore) have been created using:
> > ceph-disk prepare --zap-disk --bluestore --cluster ceph
> --cluster-uuid
> > XX..XX  /dev/sdX
> >
> > ceph -v :   ceph version 10.2.3
> > (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
> >
> >
> > Does someone have any experience involving "non-standard" RBD
> "object-size"
> > ?
> >
> > Could this be due to "bluestore", or has someone already encountered that
> > issue using "filestore" OSDs ?
>
> It's hard to tell without any additional information: dd command,
> iostat or blktrace, probably some OSD logs as well.
>
> A ton of work has gone into bluestore in kraken, mostly on the
> performance front - jewel bluestore has little in common with the
> current version.
>
> >
> > Should switching to an higher RBD "object-size" at least theorycally
> improve
> > seq r/w speeds ?
>
> Well, it really depends on the workload.  It may result in an
> improvement in certain cases, but there are many downsides - RADOS (be
> it with filestore or bluestore) works much better with smaller objects.
>
> I agree with Jason in that you are probably better off with the
> default.  Try experimenting with krbd readahead - bump it to 4M or 8M
> or even higher and make sure you have a recent kernel on the client
> machine (4.4 or newer).
>
> There were a number of threads on this subject on ceph-users.  Search
> for: single thread sequential kernel rbd readahead, or so.
>
> Thanks,
>
> Ilya
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

Our experience on a busy production cluster are showing better read and
write latency response with 16MB objects. With 4MB objects we were seeing
sometimes 20 second delays, with 16MB it is more like 5 seconds at most.
There are a few caveats to our current cluster:

- it is made of about 200 NL-SAS 4TB HDDs with Micron SSDs as journals. I
have been told that on the 7.2k rpm drives latency jumps after about 200
iops per spindle.

- our workload is 100% VMWare VMs running replicated databases. Now with
NfS, but likely still a lot of small IO.

I wonder if we are a corner case. But with 16 MB objects with both iSCSI
gateway, as well as NFS we saw a clear improvement in latency and
throughput. I will reach to our performance engineer if anyone is
interested in details of tests.

Any thoughts on why this is the case? Nick Fisk thought that maybe the thin
space allocation overhead was smaller with larger object sizes?

Regards,
Alex
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] export-diff behavior if an initial snapshot is NOT specified

2016-11-28 Thread Zhongyan Gu
Thank you Jason.
We are designing a backup system for production cluster based on ceph's
export /import diff feature.
We found this issue and hopefully it can be confirmed and then fixed soon.
If you need any more information for debugging, Please just let me know.

Thanks,
Zhongyan

On Mon, Nov 28, 2016 at 10:05 PM, Jason Dillaman 
wrote:

> OK, in that case, none of my previous explanation is relevant. I'll
> spin up a hammer cluster and try to reproduce.
>
> On Wed, Nov 23, 2016 at 9:13 PM, Zhongyan Gu 
> wrote:
> > BTW, I used Hammer 0.94.5 to do the test.
> >
> > Zhongyan
> >
> > On Thu, Nov 24, 2016 at 10:07 AM, Zhongyan Gu 
> wrote:
> >>
> >> Thank you Jason. My test shows in the following case, image B will be
> >> exactly same:
> >> 1. clone image A from parent:
> >> #rbd clone 1124-parent@snap1 A
> >>
> >> 2. create snap for A
> >> #rbd snap create A@snap1
> >>
> >> 3. create empty image B
> >> #rbd create B -s 1
> >>
> >> 4. export-diff A then impor-diff B:
> >> #rbd export-diff A@snap1 -|./rbd import-diff - B
> >>
> >> 5. check A@snap1 equals B@snap1
> >> #rbd export A@snap1 -|md5sum
> >> Exporting image: 100% complete...done.
> >> 880709d7352b6c9926beb1d829673366  -
> >> #rbd export B@snap1 -|md5sum
> >> Exporting image: 100% complete...done.
> >> 880709d7352b6c9926beb1d829673366  -
> >> output shows A@snap1 equals B@snap1
> >>
> >> However, in the following case, image B will not be exactly same:
> >> 1. clone image A from parent:
> >> #rbd clone 1124-parent@snap1 A
> >>
> >> 2. create snap for A
> >> #rbd snap create A@snap1
> >>
> >> 3. use fio make some change to A
> >>
> >> 4. create empty image B
> >> #rbd create B -s 1
> >>
> >> 4. export-diff A then impor-diff B:
> >> #rbd export-diff A@snap1 -|./rbd import-diff - B
> >>
> >> 5. check A@snap1 equals B@snap1
> >> #rbd export A@snap1 -|md5sum
> >> Exporting image: 100% complete...done.
> >> 880709d7352b6c9926beb1d829673366  -
> >> #rbd export B@snap1 -|md5sum
> >> Exporting image: 100% complete...done.
> >> bbf7cf69a84f3978c66f5eb082fb91ec  -
> >> output shows A@snap1 DOES NOT equal B@snap1
> >>
> >> The second case can always be reproduced. What is wrong with the second
> >> case?
> >>
> >> Thanks,
> >> Zhongyan
> >>
> >>
> >> On Wed, Nov 23, 2016 at 10:11 PM, Jason Dillaman 
> >> wrote:
> >>>
> >>> What you are seeing sounds like a side-effect of deep-flatten support.
> >>> If you write to an unallocated extent within a cloned image, the
> >>> associated object extent must be read from the parent image, modified,
> >>> and written to the clone image.
> >>>
> >>> Since the Infernalis release, this process has been tweaked if the
> >>> cloned image has a snapshot. In that case, the associated object
> >>> extent is still read from the parent, but instead of being modified
> >>> and written to the HEAD revision, it is left unmodified and is written
> >>> to "pre" snapshot history followed by writing the original
> >>> modification (w/o the parent's object extent data) to the HEAD
> >>> revision.
> >>>
> >>> This change to the IO path was made to support flattening clones and
> >>> dissociating them from their parents even if the clone had snapshots.
> >>>
> >>> Therefore, what you are seeing with export-diff is actually the
> >>> backing object extent of data from the parent image written to the
> >>> clone's "pre" snapshot history. If you had two snapshots and your
> >>> export-diff'ed from the first to second snapshot, you wouldn't see
> >>> this extra data.
> >>>
> >>> To your question about how to prepare image B to make sure it will be
> >>> exactly the same, the answer is that you don't need to do anything. In
> >>> your example above, I am assuming you are manually creating an empty
> >>> Image B and using "import-diff" to populate it. The difference in the
> >>> export-diff is most likely related to fact that the clone lost its
> >>> sparseness on any backing object that was written (e.g. instead of a
> >>> one or more 512 byte diffs within a backing object extent, you will
> >>> see a single, full-object extent with zeroes where the parent image
> >>> had no data).
> >>>
> >>>
> >>> On Wed, Nov 23, 2016 at 5:06 AM, Zhongyan Gu 
> >>> wrote:
> >>> > Let me make the issue more clear.
> >>> > Suppose I cloned image A from a parent image and create snap1 for
> image
> >>> > A
> >>> > and  then make some change of image A.
> >>> > If I did the rbd export-diff @snap1. how should I prepare the
> existing
> >>> > image
> >>> > B to make sure it  will be exactly same with image A@snap1 after
> >>> > import-diff
> >>> > against this image B.
> >>> >
> >>> > Thanks,
> >>> > Zhongyan
> >>> >
> >>> >
> >>> > On Wed, Nov 23, 2016 at 11:34 AM, Zhongyan Gu  >
> >>> > wrote:
> >>> >>
> >>> >> Thanks Jason, very clear explanation.
> >>> >> However, I found some strange behavior when export-diff on a cloned
> >>> 

Re: [ceph-users] undefined symbol: rados_inconsistent_pg_list

2016-11-28 Thread JiaJia Zhong
hi, 
1.try below,
remove /root/.python-eggs/rados-0-py2.7-linux-x86_64.egg-tmp/
if you are sure that you want to keep it, back it up.




2. the command you ran, #cp -vf /usr/local/lib/python2.7/site-packages/* 
/usr/lib64/python2.7/ , generally, this is not recommended.


 
-- Original --
From:  "鹏";
Date:  Tue, Nov 29, 2016 10:56 AM
To:  "ceph-us...@ceph.com"; 

Subject:  [ceph-users] undefined symbol: rados_inconsistent_pg_list

 


the ceph version is :ceph-10.2.2.tar.gz


this error message is:

[root@mon0 ceph-10.2.2]# ceph -s
Traceback (most recent call last):
  File "/usr/local/bin/ceph", line 118, in 
import rados
  File "build/bdist.linux-x86_64/egg/rados.py", line 7, in 
  File "build/bdist.linux-x86_64/egg/rados.py", line 6, in __bootstrap__
ImportError: /root/.python-eggs/rados-0-py2.7-linux-x86_64.egg-tmp/rados.so: 
undefined symbol: rados_inconsistent_pg_list



i make ceph follow this :

#yum groupinstall "Development tools" 
 
#./install_deps.sh
 
#./autogen.sh
 
#./configure 
 
# make -j2
 
# make install 
 
#cp -vf /usr/local/lib/python2.7/site-packages/*  /usr/lib64/python2.7/


how can I fix the error___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] undefined symbol: rados_inconsistent_pg_list

2016-11-28 Thread
|


the ceph version is :ceph-10.2.2.tar.gz


this error message is:


[root@mon0 ceph-10.2.2]# ceph -s
Traceback (most recent call last):
  File "/usr/local/bin/ceph", line 118, in 
import rados
  File "build/bdist.linux-x86_64/egg/rados.py", line 7, in 
  File "build/bdist.linux-x86_64/egg/rados.py", line 6, in __bootstrap__
ImportError: /root/.python-eggs/rados-0-py2.7-linux-x86_64.egg-tmp/rados.so: 
undefined symbol: rados_inconsistent_pg_list



i make ceph follow this :

#yum groupinstall "Development tools"

#./install_deps.sh

#./autogen.sh

#./configure

# make -j2

# make install

#cp -vf /usr/local/lib/python2.7/site-packages/*  /usr/lib64/python2.7/



how can I fix the error




|
|
|   |   |
|___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] - cluster stuck and undersized if at least one osd is down

2016-11-28 Thread Brad Hubbard


On Mon, Nov 28, 2016 at 9:54 PM, Piotr Dzionek  wrote:
> Hi,
> I recently installed 3 nodes ceph cluster v.10.2.3. It has 3 mons, and 12
> osds. I removed default pool and created the following one:
>
> pool 7 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash
> rjenkins pg_num 1024 pgp_num 1024 last_change 126 flags hashpspool
> stripe_width 0

Do you understand the significance of min_size 1?

Are you OK with the likelihood of data loss that this value introduces?

>
> Cluster is healthy if all osds are up, however if I stop any of the osds, it
> becomes stuck and undersized - it is not rebuilding.
>
> cluster *
>  health HEALTH_WARN
> 166 pgs degraded
> 108 pgs stuck unclean
> 166 pgs undersized
> recovery 67261/827220 objects degraded (8.131%)
> 1/12 in osds are down
>  monmap e3: 3 mons at
> {**osd01=***.144:6789/0,***osd02=***.145:6789/0,**osd03=*.146:6789/0}
> election epoch 14, quorum 0,1,2 **osd01,**osd02,**osd03
>  osdmap e161: 12 osds: 11 up, 12 in; 166 remapped pgs
> flags sortbitwise
>   pgmap v307710: 1024 pgs, 1 pools, 1230 GB data, 403 kobjects
> 2452 GB used, 42231 GB / 44684 GB avail
> 67261/827220 objects degraded (8.131%)
>  858 active+clean
>  166 active+undersized+degraded
>
> Replica size is 2 and and I use the following crushmap:
>
> # begin crush map
> tunable choose_local_tries 0
> tunable choose_local_fallback_tries 0
> tunable choose_total_tries 50
> tunable chooseleaf_descend_once 1
> tunable chooseleaf_vary_r 1
> tunable straw_calc_version 1
>
> # devices
> device 0 osd.0
> device 1 osd.1
> device 2 osd.2
> device 3 osd.3
> device 4 osd.4
> device 5 osd.5
> device 6 osd.6
> device 7 osd.7
> device 8 osd.8
> device 9 osd.9
> device 10 osd.10
> device 11 osd.11
>
> # types
> type 0 osd
> type 1 host
> type 2 chassis
> type 3 rack
> type 4 row
> type 5 pdu
> type 6 pod
> type 7 room
> type 8 datacenter
> type 9 region
> type 10 root
>
> # buckets
> host osd01 {
> id -2   # do not change unnecessarily
> # weight 14.546
> alg straw
> hash 0  # rjenkins1
> item osd.0 weight 3.636
> item osd.1 weight 3.636
> item osd.2 weight 3.636
> item osd.3 weight 3.636
> }
> host osd02 {
> id -3   # do not change unnecessarily
> # weight 14.546
> alg straw
> hash 0  # rjenkins1
> item osd.4 weight 3.636
> item osd.5 weight 3.636
> item osd.6 weight 3.636
> item osd.7 weight 3.636
> }
> host osd03 {
> id -4   # do not change unnecessarily
> # weight 14.546
> alg straw
> hash 0  # rjenkins1
> item osd.8 weight 3.636
> item osd.9 weight 3.636
> item osd.10 weight 3.636
> item osd.11 weight 3.636
> }
> root default {
> id -1   # do not change unnecessarily
> # weight 43.637
> alg straw
> hash 0  # rjenkins1
> item osd01 weight 14.546
> item osd02 weight 14.546
> item osd03 weight 14.546
> }
>
> # rules
> rule replicated_ruleset {
> ruleset 0
> type replicated
> min_size 1
> max_size 10
> step take default
> step chooseleaf firstn 0 type host
> step emit
> }
>
> # end crush map
>
> I am not sure what is the reason for undersized state. All osd disks are the
> same size and replica size is 2. Also data is only replicated per hosts
> basis and I have 3 separate hosts. Maybe number of pg is incorrect ?  Is
> 1024 too big ? or maybe there is some misconfiguration in crushmap ?
>
>
> Kind regards,
> Piotr Dzionek
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] High ops/s with kRBD and "--object-size 32M"

2016-11-28 Thread Ilya Dryomov
On Mon, Nov 28, 2016 at 6:20 PM, Francois Blondel  wrote:
> Hi *,
>
> I am currently testing different scenarios to try to optimize sequential
> read and write speeds using Kernel RBD.
>
> I have two block devices created with :
>   rbd create block1 --size 500G --pool rbd --image-feature layering
>   rbd create block132m --size 500G --pool rbd --image-feature layering
> --object-size 32M
>
> -> Writing to block1 works quite fine  (about 200ops/s, 310MB/s in average,
> for a 250GB file) (tests running with dd)
> -> Writing to block132m is much slower (about 40MB/s in average), and
> generates high ops/s (seen from a ceph -w) (from 4000 to 13000)
>
> Current test cluster:
>
>  health HEALTH_WARN
> noscrub,nodeep-scrub,sortbitwise flag(s) set
>  monmap e2: 3 mons at
> {aac=10.113.49.48:6789/0,aad=10.112.33.36:6789/0,aae=10.112.48.60:6789/0}
> election epoch 26, quorum 0,1,2 aad,aae,aac
>  osdmap e10962: 38 osds: 38 up, 38 in
> flags noscrub,nodeep-scrub,sortbitwise
>   pgmap v120464: 1024 pgs, 1 pools, 486 GB data, 122 kobjects
> 4245 GB used, 50571 GB / 54816 GB avail
> 1024 active+clean
>
> The OSDs (using bluestore) have been created using:
> ceph-disk prepare --zap-disk --bluestore --cluster ceph --cluster-uuid
> XX..XX  /dev/sdX
>
> ceph -v :   ceph version 10.2.3
> (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
>
>
> Does someone have any experience involving "non-standard" RBD "object-size"
> ?
>
> Could this be due to "bluestore", or has someone already encountered that
> issue using "filestore" OSDs ?

It's hard to tell without any additional information: dd command,
iostat or blktrace, probably some OSD logs as well.

A ton of work has gone into bluestore in kraken, mostly on the
performance front - jewel bluestore has little in common with the
current version.

>
> Should switching to an higher RBD "object-size" at least theorycally improve
> seq r/w speeds ?

Well, it really depends on the workload.  It may result in an
improvement in certain cases, but there are many downsides - RADOS (be
it with filestore or bluestore) works much better with smaller objects.

I agree with Jason in that you are probably better off with the
default.  Try experimenting with krbd readahead - bump it to 4M or 8M
or even higher and make sure you have a recent kernel on the client
machine (4.4 or newer).

There were a number of threads on this subject on ceph-users.  Search
for: single thread sequential kernel rbd readahead, or so.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] general ceph cluster design

2016-11-28 Thread Benjeman Meekhof
Hi Nick,

We have a Ceph cluster spread across 3 datacenters at 3 institutions
in Michigan (UM, MSU, WSU).  It certainly is possible.  As noted you
will have increased latency for write operations and overall reduced
throughput as latency increases.  Latency between our sites is 3-5ms.

We did some simulated latency testing with netem where we induced
varying levels of latency on one of our storage hosts (60 OSD).  Some
information about the results is on our website:
http://www.osris.org/performance/latency

We also had success running a 4th cluster site at Supercomputing in
SLC.  We'll be putting up information on experiences there in the near
future.

thanks,
Ben


On Mon, Nov 28, 2016 at 12:06 AM, nick  wrote:
> Hi Maxime,
> thank you for the information given. We will have a look and check.
>
> Cheers
> Nick
>
> On Friday, November 25, 2016 09:48:35 PM Maxime Guyot wrote:
>> Hi Nick,
>>
>> See inline comments.
>>
>> Cheers,
>> Maxime
>>
>> On 25/11/16 16:01, "ceph-users on behalf of nick"
>>  wrote:
>
>>
>> >Hi,
>> >we are currently planning a new ceph cluster which will be used for
>> >virtualization (providing RBD storage for KVM machines) and we have
>> >some
>> >general questions.
>> >
>> >* Is it advisable to have one ceph cluster spread over multiple
>> >datacenters
>  (latency is low, as they are not so far from each
>> >other)? Is anybody doing this in a production setup? We know that any
>> >network issue would affect virtual machines in all locations instead
>> >just one, but we can see a lot of advantages as well.
>>
>>
>> I think the general consensus is to limit the size of the failure domain.
>> That said, it depends the use case and what you mean by “multiple
>> datacenters” and “latency is low”: writes will have to be journal-ACK:ed
>> by the OSDs in the other datacenter. If there is 10ms latency between
>> Location1 and Location2, then it would add 10ms to each write operation if
>> crushmap requires replicas in each location. Speaking of which a 3rd
>> location would help with sorting our quorum (1 mon at each location) in
>> “triangle” configuration.
>
>> If this is for DR: RBD-mirroring is supposed to address that, you might not
>> want to have 1 big cluster ( = failure domain).
>  If this is for VM live
>> migration: Usually requires spread L2 adjacency (failure domain) or
>> overlays (VXLAN and the likes), “network trombone” effect can be a problem
>> depending on the setup
>> I know of Nantes University who used/is using a 3 datacenter Ceph cluster:
>> http://dachary.org/?p=2087
>
>>
>> >
>> >* We are planning to combine the hosts for ceph and KVM (so far we are
>> >using
>  seperate hosts for virtual machines and ceph storage). We see
>> >the big advantage (next to the price drop) of an automatic ceph
>> >expansion when adding more compute nodes as we got into situations in
>> >the past where we had too many compute nodes and the ceph cluster was
>> >not expanded properly (performance dropped over time). On the other
>> >side there would be changes to the crush map every time we add a
>> >compute node and that might end in a lot of data movement in ceph. Is
>> >anybody using combined servers for compute and ceph storage and has
>> >some experience?
>>
>>
>> The challenge is to avoid ceph-osd to become a noisy neighbor for the VMs
>> hosted on the hypervisor, especially under recovery. I’ve heard people
>> using CPU pinning, containers, and QoS to keep it under control.
>  Sebastian
>> has an article on his blog this topic:
>> https://www.sebastien-han.fr/blog/2016/07/11/Quick-dive-into-hyperconverged
>> -architecture-with-OpenStack-and-Ceph/
>> For the performance dropped over time, you can look to improve your
>> capacity:performance ratio.
>
>>
>> >* is there a maximum amount of OSDs in a ceph cluster? We are planning
>> >to use
>  a minimum of 8 OSDs per server and going to have a cluster
>> >with about 100 servers which would end in about 800 OSDs.
>>
>>
>> There are a couple of thread from the ML about this:
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-April/028371.html
>> and
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-November/014246.ht
>> ml
>
>>
>> >
>> >Thanks for any help...
>> >
>> >Cheers
>> >Nick
>>
>>
>
>
> --
> Sebastian Nickel
> Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich
> Tel +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] metrics.ceph.com

2016-11-28 Thread Patrick McGarry
Thanks guys, I'll make sure the dashboard gets updated


On Thu, Nov 24, 2016 at 6:25 PM, Brad Hubbard  wrote:
> Patrick,
>
> I remember hearing you talk about this site recently. Do you know who
> can help with this query?
>
> On Fri, Nov 25, 2016 at 2:13 AM, Nick Fisk  wrote:
>> Who is responsible for the metrics.ceph.com site? I noticed that the mailing 
>> list stats are still trying to retrieve data from the
>> gmane archives which are no longer active.
>>
>> Nick
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Cheers,
> Brad



-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] High ops/s with kRBD and "--object-size 32M"

2016-11-28 Thread Jason Dillaman
To optimize for non-direct, sequential IO, you'd actually most likely
be better off with smaller RBD object sizes. The rationale is that
each backing object is handled by a single PG and by using smaller
objects, you can distribute the IO load to more PGs (and associated
OSDs) in parallel. The 4MB default object size was somewhat randomly
picked to not be too large to reduce parallelism but also not too
small to result in FileStore requiring an order of magnitude more
files to manage. This is why librbd supports "fancy" stripping to
create the illusion of small objects to increase the parallelism for
sequential IO. With BlueStore, the eventual hope is that we will be
able to reduce the default RBD object size since it *should* more
efficiently handle small objects.

On Mon, Nov 28, 2016 at 12:20 PM, Francois Blondel
 wrote:
> Hi *,
>
> I am currently testing different scenarios to try to optimize sequential
> read and write speeds using Kernel RBD.
>
> I have two block devices created with :
>   rbd create block1 --size 500G --pool rbd --image-feature layering
>   rbd create block132m --size 500G --pool rbd --image-feature layering
> --object-size 32M
>
> -> Writing to block1 works quite fine  (about 200ops/s, 310MB/s in average,
> for a 250GB file) (tests running with dd)
> -> Writing to block132m is much slower (about 40MB/s in average), and
> generates high ops/s (seen from a ceph -w) (from 4000 to 13000)
>
> Current test cluster:
>
>  health HEALTH_WARN
> noscrub,nodeep-scrub,sortbitwise flag(s) set
>  monmap e2: 3 mons at
> {aac=10.113.49.48:6789/0,aad=10.112.33.36:6789/0,aae=10.112.48.60:6789/0}
> election epoch 26, quorum 0,1,2 aad,aae,aac
>  osdmap e10962: 38 osds: 38 up, 38 in
> flags noscrub,nodeep-scrub,sortbitwise
>   pgmap v120464: 1024 pgs, 1 pools, 486 GB data, 122 kobjects
> 4245 GB used, 50571 GB / 54816 GB avail
> 1024 active+clean
>
> The OSDs (using bluestore) have been created using:
> ceph-disk prepare --zap-disk --bluestore --cluster ceph --cluster-uuid
> XX..XX  /dev/sdX
>
> ceph -v :   ceph version 10.2.3
> (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
>
>
> Does someone have any experience involving "non-standard" RBD "object-size"
> ?
>
> Could this be due to "bluestore", or has someone already encountered that
> issue using "filestore" OSDs ?
>
> Should switching to an higher RBD "object-size" at least theorycally improve
> seq r/w speeds ?
>
> Many thanks,
> François
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] High ops/s with kRBD and "--object-size 32M"

2016-11-28 Thread Francois Blondel
Hi *,

I am currently testing different scenarios to try to optimize sequential read 
and write speeds using Kernel RBD.

I have two block devices created with :
  rbd create block1 --size 500G --pool rbd --image-feature layering
  rbd create block132m --size 500G --pool rbd --image-feature layering 
--object-size 32M

-> Writing to block1 works quite fine  (about 200ops/s, 310MB/s in average, for 
a 250GB file) (tests running with dd)
-> Writing to block132m is much slower (about 40MB/s in average), and generates 
high ops/s (seen from a ceph -w) (from 4000 to 13000)

Current test cluster:

 health HEALTH_WARN
noscrub,nodeep-scrub,sortbitwise flag(s) set
 monmap e2: 3 mons at 
{aac=10.113.49.48:6789/0,aad=10.112.33.36:6789/0,aae=10.112.48.60:6789/0}
election epoch 26, quorum 0,1,2 aad,aae,aac
 osdmap e10962: 38 osds: 38 up, 38 in
flags noscrub,nodeep-scrub,sortbitwise
  pgmap v120464: 1024 pgs, 1 pools, 486 GB data, 122 kobjects
4245 GB used, 50571 GB / 54816 GB avail
1024 active+clean

The OSDs (using bluestore) have been created using:
ceph-disk prepare --zap-disk --bluestore --cluster ceph --cluster-uuid 
XX..XX  /dev/sdX

ceph -v :   ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)


Does someone have any experience involving "non-standard" RBD "object-size" ?

Could this be due to "bluestore", or has someone already encountered that issue 
using "filestore" OSDs ?

Should switching to an higher RBD "object-size" at least theorycally improve 
seq r/w speeds ?

Many thanks,
François
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OSDs cause kernel unresponsive

2016-11-28 Thread Craig Chi
Hi guys,

Thanks to both of your suggestions, we had some progression on this issue.

I tuned vm.min_free_kbytes to 16GB and raised vm.vfs_cache_pressure to 200, and 
I did observe that the OS keep releasing cache while the OSDs want more and 
more memory.

OK. Now we are going to reproduce the hanging issue.

1. set the cluster with noup flag
2. restart all ceph-osd process (then we can see all OSDs are down from ceph 
monitor)
3. unset noup flag

As expected the OSDs started to consume memory, and eventually the kernel still 
hanged without response.

Therefore I learned to gather the vmcore and tried to investigate further as 
Brad advised.

The vmcore dump file was unbeliviably huge -- about 6 GB per dump. However it's 
helpful that we quickly found the following abnormal things:

1. The memory was exhausted as expected.

crash>kmem -i
PAGESTOTALPERCENTAGE
TOTAL MEM63322527241.6 GB
FREE6764462.6 GB1% of TOTAL MEM
USED62646081239 GB98% of TOTAL MEM
SHARED6213362.4 GB0% of TOTAL MEM
BUFFERS47307184.8 MB0% of TOTAL MEM
CACHED3762051.4 GB0% of TOTAL MEM
SLAB4554001.7 GB0% of TOTAL MEM

TOTAL SWAP488703918.6 GB
SWAP USED385593814.7 GB78% of TOTAL SWAP
SWAP FREE10311013.9 GB21% of TOTAL SWAP

COMMIT LIMIT36548302139.4 GB
COMMITTED92434847352.6 GB252% of TOTAL LIMIT


2. Each OSD used a lot of memory. (We have only total 256 GB RAM but there are 
90 OSDs in a node)

# Find 10 largest memory consumption processes
crash>ps -G | sed 's/>//g' | sort -k 8,8 -n | awk '$8 ~ /[0-9]/{ $8 = $8/1024" 
MB"; print}' | tail -10
100864 1 12 883a43e1b700 IN 1.1 7484884 2973.33 MB ceph-osd
87400 1 27 8838538ae040 IN 1.1 7557500 3036.92 MB ceph-osd
108126 1 22 882bcca91b80 IN 1.2 7273068 3045.8 MB ceph-osd
39787 1 28 883f468ab700 IN 1.2 7300756 3067.88 MB ceph-osd
44861 1 20 883cf925 IN 1.2 7327496 3067.89 MB ceph-osd
30486 1 23 883f59e1c4c0 IN 1.2 7332828 3083.58 MB ceph-osd
125239 1 15 882687018000 IN 1.2 6965560 3103.36 MB ceph-osd
123807 1 19 88275d90ee00 IN 1.2 7314484 3173.48 MB ceph-osd
116445 1 1 882863926e00 IN 1.2 7279040 3269.09 MB ceph-osd
94442 1 0 882ed2d01b80 IN 1.3 7566148 3418.69 MB ceph-osd


3. The excessive amount of message threads.

crash>ps | grep ms_pipe_read | wc -l
144112
crash>ps | grep ms_pipe_write | wc -l
146692

Totally up to 290k threads in ms_pipe_*.


4. Several tries we had, and we luckily got some memory profiles before oom 
killer started to work.

# Parse the smaps of a ceph-osd process 
byparse_smaps.py(https://github.com/craig08/parse_smaps)

root@ceph2:~# ./parse_smaps.py /proc/198557/smaps
===
PrivatePrivateSharedShared
Clean+Dirty+Clean+Dirty=Total: library
===
16660 kB + 5804548 kB +0 kB +0 kB = 5821208 kB : [heap]
40 kB +92 kB +7640 kB +0 kB =7772 kB : ceph-osd
56 kB +2472 kB +0 kB +0 kB =2528 kB : [anonymous]
2084 kB +0 kB +0 kB +0 kB =2084 kB : 007656.ldb
2080 kB +0 kB +0 kB +0 kB =2080 kB : 007657.ldb
2080 kB +0 kB +0 kB +0 kB =2080 kB : 007653.ldb
2080 kB +0 kB +0 kB +0 kB =2080 kB : 007658.ldb
2080 kB +0 kB +0 kB +0 kB =2080 kB : 011125.ldb
2076 kB +0 kB +0 kB +0 kB =2076 kB : 009607.ldb
2072 kB +0 kB +0 kB +0 kB =2072 kB : 011127.ldb
2072 kB +0 kB +0 kB +0 kB =2072 kB : 011126.ldb
0 kB +24 kB +1636 kB +0 kB =1660 kB : libc-2.23.so
0 kB +0 kB +1060 kB +0 kB =1060 kB : libec_lrc.so
4 kB +28 kB +1024 kB +0 kB =1056 kB : libstdc++.so.6.0.21
996 kB +0 kB +0 kB +0 kB =996 kB : 011168.ldb
908 kB +0 kB +0 kB +0 kB =908 kB : 009608.ldb
840 kB +0 kB +0 kB +0 kB =840 kB : 007648.ldb
0 kB +0 kB +812 kB +0 kB =812 kB : libcls_rgw.so
0 kB +0 kB +716 kB +0 kB =716 kB : libcls_refcount.so
684 kB +0 kB +0 kB +0 kB =684 kB : 011128.ldb
0 kB +0 kB +552 kB +0 kB =552 kB : libm-2.23.so
0 kB +0 kB +472 kB +0 kB =472 kB : libec_jerasure_sse4.so
4 kB +0 kB +372 kB +0 kB =376 kB : libfreebl3.so
8 kB +4 kB +356 kB +0 kB =368 kB : libnss3.so
0 kB +12 kB +352 kB +0 kB =364 kB : libleveldb.so.1.18
0 kB +0 kB +296 kB +0 kB =296 kB : libec_jerasure.so
4 kB +4 kB +224 kB +0 kB =232 kB : libsoftokn3.so
8 kB +0 kB +208 kB +0 kB =216 kB : libnspr4.so
8 kB +0 kB +196 kB +0 kB =204 kB : libec_isa.so
0 kB +8 kB +196 kB +0 kB =204 kB : libtcmalloc.so.4.2.6
0 kB +8 kB +152 kB +0 kB =160 kB : ld-2.23.so
4 kB +4 kB +136 kB +0 kB =144 kB : libboost_thread.so.1.58.0
4 kB +0 kB +132 kB +0 kB =136 kB : ibnssutil3.so
..
===
37736 kB + 5807224 kB + 18128 kB +0 kB = 5863088 kB : Total


5. Heap profiler by Ceph.

root@ceph2:~# ceph tell osd.163 heap stats
osd.163 tcmalloc heap stats:
MALLOC:5861094560 ( 5589.6 MiB) Bytes in use by application
MALLOC: +0 (0.0 MiB) Bytes in page heap freelist
MALLOC: +38945176 (37.1 MiB) Bytes in central cache freelist
MALLOC: +13279168 (12.7 MiB) Bytes in 

Re: [ceph-users] - cluster stuck and undersized if at least one osd is down

2016-11-28 Thread David Turner
In the cluster your OSD is down, not out.  When an osd goes out, that is when 
the data will start to rebuild.  Once the osd is marked out, it will show as 
11/11 osds are up instead of 1/12 osds are down.



[cid:imageda99ef.JPG@0cac0d28.4bbd15de]   David 
Turner | Cloud Operations Engineer | StorageCraft Technology 
Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943



If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.




From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Piotr Dzionek 
[piotr.dzio...@seqr.com]
Sent: Monday, November 28, 2016 4:54 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] - cluster stuck and undersized if at least one osd is down


Hi,
I recently installed 3 nodes ceph cluster v.10.2.3. It has 3 mons, and 12 osds. 
I removed default pool and created the following one:

pool 7 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins 
pg_num 1024 pgp_num 1024 last_change 126 flags hashpspool stripe_width 0

Cluster is healthy if all osds are up, however if I stop any of the osds, it 
becomes stuck and undersized - it is not rebuilding.

cluster *
 health HEALTH_WARN
166 pgs degraded
108 pgs stuck unclean
166 pgs undersized
recovery 67261/827220 objects degraded (8.131%)
1/12 in osds are down
 monmap e3: 3 mons at 
{**osd01=***.144:6789/0,***osd02=***.145:6789/0,**osd03=*.146:6789/0}
election epoch 14, quorum 0,1,2 **osd01,**osd02,**osd03
 osdmap e161: 12 osds: 11 up, 12 in; 166 remapped pgs
flags sortbitwise
  pgmap v307710: 1024 pgs, 1 pools, 1230 GB data, 403 kobjects
2452 GB used, 42231 GB / 44684 GB avail
67261/827220 objects degraded (8.131%)
 858 active+clean
 166 active+undersized+degraded

Replica size is 2 and and I use the following crushmap:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable straw_calc_version 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host osd01 {
id -2   # do not change unnecessarily
# weight 14.546
alg straw
hash 0  # rjenkins1
item osd.0 weight 3.636
item osd.1 weight 3.636
item osd.2 weight 3.636
item osd.3 weight 3.636
}
host osd02 {
id -3   # do not change unnecessarily
# weight 14.546
alg straw
hash 0  # rjenkins1
item osd.4 weight 3.636
item osd.5 weight 3.636
item osd.6 weight 3.636
item osd.7 weight 3.636
}
host osd03 {
id -4   # do not change unnecessarily
# weight 14.546
alg straw
hash 0  # rjenkins1
item osd.8 weight 3.636
item osd.9 weight 3.636
item osd.10 weight 3.636
item osd.11 weight 3.636
}
root default {
id -1   # do not change unnecessarily
# weight 43.637
alg straw
hash 0  # rjenkins1
item osd01 weight 14.546
item osd02 weight 14.546
item osd03 weight 14.546
}

# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

# end crush map

I am not sure what is the reason for undersized state. All osd disks are the 
same size and replica size is 2. Also data is only replicated per hosts basis 
and I have 3 separate hosts. Maybe number of pg is incorrect ?  Is 1024 too big 
? or maybe there is some misconfiguration in crushmap ?

Kind regards,
Piotr Dzionek
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] export-diff behavior if an initial snapshot is NOT specified

2016-11-28 Thread Jason Dillaman
OK, in that case, none of my previous explanation is relevant. I'll
spin up a hammer cluster and try to reproduce.

On Wed, Nov 23, 2016 at 9:13 PM, Zhongyan Gu  wrote:
> BTW, I used Hammer 0.94.5 to do the test.
>
> Zhongyan
>
> On Thu, Nov 24, 2016 at 10:07 AM, Zhongyan Gu  wrote:
>>
>> Thank you Jason. My test shows in the following case, image B will be
>> exactly same:
>> 1. clone image A from parent:
>> #rbd clone 1124-parent@snap1 A
>>
>> 2. create snap for A
>> #rbd snap create A@snap1
>>
>> 3. create empty image B
>> #rbd create B -s 1
>>
>> 4. export-diff A then impor-diff B:
>> #rbd export-diff A@snap1 -|./rbd import-diff - B
>>
>> 5. check A@snap1 equals B@snap1
>> #rbd export A@snap1 -|md5sum
>> Exporting image: 100% complete...done.
>> 880709d7352b6c9926beb1d829673366  -
>> #rbd export B@snap1 -|md5sum
>> Exporting image: 100% complete...done.
>> 880709d7352b6c9926beb1d829673366  -
>> output shows A@snap1 equals B@snap1
>>
>> However, in the following case, image B will not be exactly same:
>> 1. clone image A from parent:
>> #rbd clone 1124-parent@snap1 A
>>
>> 2. create snap for A
>> #rbd snap create A@snap1
>>
>> 3. use fio make some change to A
>>
>> 4. create empty image B
>> #rbd create B -s 1
>>
>> 4. export-diff A then impor-diff B:
>> #rbd export-diff A@snap1 -|./rbd import-diff - B
>>
>> 5. check A@snap1 equals B@snap1
>> #rbd export A@snap1 -|md5sum
>> Exporting image: 100% complete...done.
>> 880709d7352b6c9926beb1d829673366  -
>> #rbd export B@snap1 -|md5sum
>> Exporting image: 100% complete...done.
>> bbf7cf69a84f3978c66f5eb082fb91ec  -
>> output shows A@snap1 DOES NOT equal B@snap1
>>
>> The second case can always be reproduced. What is wrong with the second
>> case?
>>
>> Thanks,
>> Zhongyan
>>
>>
>> On Wed, Nov 23, 2016 at 10:11 PM, Jason Dillaman 
>> wrote:
>>>
>>> What you are seeing sounds like a side-effect of deep-flatten support.
>>> If you write to an unallocated extent within a cloned image, the
>>> associated object extent must be read from the parent image, modified,
>>> and written to the clone image.
>>>
>>> Since the Infernalis release, this process has been tweaked if the
>>> cloned image has a snapshot. In that case, the associated object
>>> extent is still read from the parent, but instead of being modified
>>> and written to the HEAD revision, it is left unmodified and is written
>>> to "pre" snapshot history followed by writing the original
>>> modification (w/o the parent's object extent data) to the HEAD
>>> revision.
>>>
>>> This change to the IO path was made to support flattening clones and
>>> dissociating them from their parents even if the clone had snapshots.
>>>
>>> Therefore, what you are seeing with export-diff is actually the
>>> backing object extent of data from the parent image written to the
>>> clone's "pre" snapshot history. If you had two snapshots and your
>>> export-diff'ed from the first to second snapshot, you wouldn't see
>>> this extra data.
>>>
>>> To your question about how to prepare image B to make sure it will be
>>> exactly the same, the answer is that you don't need to do anything. In
>>> your example above, I am assuming you are manually creating an empty
>>> Image B and using "import-diff" to populate it. The difference in the
>>> export-diff is most likely related to fact that the clone lost its
>>> sparseness on any backing object that was written (e.g. instead of a
>>> one or more 512 byte diffs within a backing object extent, you will
>>> see a single, full-object extent with zeroes where the parent image
>>> had no data).
>>>
>>>
>>> On Wed, Nov 23, 2016 at 5:06 AM, Zhongyan Gu 
>>> wrote:
>>> > Let me make the issue more clear.
>>> > Suppose I cloned image A from a parent image and create snap1 for image
>>> > A
>>> > and  then make some change of image A.
>>> > If I did the rbd export-diff @snap1. how should I prepare the existing
>>> > image
>>> > B to make sure it  will be exactly same with image A@snap1 after
>>> > import-diff
>>> > against this image B.
>>> >
>>> > Thanks,
>>> > Zhongyan
>>> >
>>> >
>>> > On Wed, Nov 23, 2016 at 11:34 AM, Zhongyan Gu 
>>> > wrote:
>>> >>
>>> >> Thanks Jason, very clear explanation.
>>> >> However, I found some strange behavior when export-diff on a cloned
>>> >> image,
>>> >> not sure it is a bug on calc_snap_set_diff().
>>> >> The test is,
>>> >> Image A is cloned from a parent image. then create snap1 for image A.
>>> >> The content of export-diff A@snap1 will be changed when update image
>>> >> A.
>>> >> Only after image A has no overlap with parent, the content of
>>> >> export-diff
>>> >> A@snap1 is stabled, which is almost zero.
>>> >> I don't think it is a designed behavior. export-diff A@snap1 should
>>> >> always
>>> >> get a stable output no matter image A is cloned or not.
>>> >>
>>> >> Please correct me if anything wrong.
>>> >>
>>> >> Thanks,
>>> >> 

Re: [ceph-users] Production System Evaluation / Problems

2016-11-28 Thread Maxime Guyot
Hi,


1.   It is possible to do that with the primary affinity setting. The 
documentation gives an example with SSD as primary OSD and HDD as secondary. I 
think it would work for Active/Passive DC scenario might be tricky for 
Active/Active. If you do Ceph across 2 DCs you might have problems with quorum, 
a third location with 1 MON can help break ties.

2.   Zap & re-create?

3.   It is common to use 2 VLANs on a LACP bond instead of 1 NIC on each 
VLAN. You just need to size the pipes accordingly to avoid bottlenecks.

Cheers,

Maxime Guyot

From: ceph-users  on behalf of Stefan 
Lissmats 
Date: Monday 28 November 2016 11:12
To: "Strankowski, Florian" , 
"'ceph-users@lists.ceph.com'" 
Subject: Re: [ceph-users] Production System Evaluation / Problems

Hey!

I have using ceph for a while bu is not a real expert but i will give you some 
pointers to make everyone able to help you further.

1. The crush map is kind of devided into two parts, the topology description, 
(which you provided us with) and also the crush rules that defines how the data 
is placed in the topology. Have you made any changes in the rules? If you have 
made any changes it would be great if you provided how the rules is defined. 
However i think you can get the data placed the way you want with some more 
advanced crush rules, but I don't think there is any possibility to have a read 
only copy. Guess you have seen this? 
http://docs.ceph.com/docs/jewel/rados/operations/crush-map/


2.  Have you looked into the osd logs server that osd.0 resides on? That could 
give some information why osd.0 never comes up. It should normally be in 
/var/log/ceph/ceph-
osd.0.log

Other notes:
You have 6 mons but you normally want an odd number and do not normally need 
more than 5 (or even 3 is).


Från: ceph-users [ceph-users-boun...@lists.ceph.com] för Strankowski, Florian 
[fstrankow...@stadtwerke-norderstedt.de]
Skickat: den 28 november 2016 10:29
Till: 'ceph-users@lists.ceph.com'
Ämne: [ceph-users] Production System Evaluation / Problems
Hey guys,

we’re evaluating ceph at the moment for a bigger production-ready 
implementation. So far we’ve had some success and
some problems with ceph. In combination with Proxmox CEPH works quite well, if 
taken out of the box. I’ve tried to coverup my questions
with existing answers and solutions but i still find some things unclear. Here 
are the things i’m having problems with:


1.   The first question is just for my understanding: How does CEPH account 
failure domains? For what i’ve read by now is
that i create a new CRUSH-Map with for example 2 datacenters, each DC has a 
rack and in this rack there is a chassis with nodes.
By using an own CRUSH-Map CEPH will „see“ it and deal with the data 
automatically. What i am missing here is some more possible adjustment.
For example i want to define that by using a replica of 3 i want CEPH to store 
the data 2 times in datacenter A and one time in datacenter B. Further
more i want read-access exclusivly within 1 datacenter (if possible and data is 
available) to keep rtt low. Is this possible?

2.   I’ve build my own CRUSH-Map and tried to get it working. No success at 
all. I’m literally „done with this s…“ ☺ thats why im here right now. Here is 
the state

of the cluster:



cluster 42f04e55-0a3f-4644-8543-516cd46cd4e9

 health HEALTH_WARN

79 pgs degraded

262 pgs stale

79 pgs stuck degraded

262 pgs stuck stale

512 pgs stuck unclean

79 pgs stuck undersized

79 pgs undersized

 monmap e8: 6 mons at 
{0=192.168.40.20:6789/0,1=192.168.40.21:6789/0,2=192.168.40.22:6789/0,3=192.168.40.23:6789/0,4=192.168.40.24:6789/0,5=192.168.40.25:6789/0}

election epoch 86, quorum 0,1,2,3,4,5 0,1,2,3,4,5

 mdsmap e2: 0/0/1 up

 osdmap e212: 6 osds: 5 up, 5 in; 250 remapped pgs

  pgmap v366013: 512 pgs, 2 pools, 0 bytes data, 0 objects

278 MB used, 900 GB / 901 GB avail

 250 active+remapped

 183 stale+active+remapped

  79 stale+active+undersized+degraded+remapped



Here the config:





ID  WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY

-27 1.07997 root default

-25 0.53998 datacenter datacenter1

-23 0.53998 chassis chassis1

-1 0.17999 blade blade3

  0 0.17999 osd.0 down0  1.0

-2 0.17999 blade blade4

  1 0.17999 osd.1   up  1.0  1.0

-3 0.17999 blade blade5

  2 0.17999 osd.2   up  1.0  1.0

-26 0.53999 datacenter datacenter2

-24 0.53999 chassis chassis2

-17 0.17999 

Re: [ceph-users] cephfs and manila

2016-11-28 Thread John Spray
(Copying ceph-users to share the info more broadly)

On Thu, Nov 24, 2016 at 10:50 AM,   wrote:
> Hi John
>
> I have some questions about the use of cephfs,
> Can you help me answer, Thank you!
>
> We built a Openstack(M) file share, and use Manila componets based CephFS.
> I can export CephFS'posix and refer to
> http://docs.openstack.org/developer/manila/devref/cephfs_native_driver.html(CephFS
> Native driver from you, :+1:)
> as we all know:
> use "nfs-ganesha", I can manual export NFS based on CephFS, and not in
> Manila;
> use "samba 4.x", I can manual exprot CIFS based on CephFS, and not in
> Manila.
> but if i want to direct export the CIFS and NFS based on CephFS in Manila,
> seems not support.
> although Manila support Ganesha Library(not support Samba library)
> so will plan to support the above functions?

An NFS+CephFS driver for Manila is a work in progress.  The rough plan
is to have some initial functionality for auto-configuring ganesha
exports using the existing ganesha modules in openstack Ocata, and
then the to have it automatically creating gateway VMs in the
subsequent Pike release.

> Two other issues outside
> 1. ceph-fuse cannot mount cephfs snapshot directory?
>Example:ceph-fuse /root/mycephfs5 --id=admin --conf=./client.conf
> --keyring=/etc/ceph/ceph.client.admin.keyring
> --client-mountpoint=/volumes/_nogroup/b53cbff4-a3f2-402b-91c2-aaf967f32d40/.snap/

Hmm, we've never tested this, and it probably needs some work because
snapshot dirs are special.  This will be necessary to enable Manila's
mountable snapshots feature:
https://github.com/openstack/manila-specs/blob/master/specs/ocata/mountable-snapshots.rst

I've created at ticket here: http://tracker.ceph.com/issues/18050,
although I'm not sure how soon it would reach the top of anyone's
priority list given the current focus on robustness and multi-mds.

> 2. if i delete cephfs using "ceph fs rm {cephfs_name}", but there is
> cephfs_data_pool and cephfs_meta_pool,
>Based on the old cephfs_data_pool and cephfs_meta_pool, whether can
> restore cephfs.

This would be considered a disaster recovery situation.  Removing the
filesystem doesn't touch anything in the pools, so you'd do something
like:
 * Stop all MDS daemons
 * Do a "ceph fs new" with the same pools
 * Do a "ceph fs reset" on the new filesystem to make it skip the
creating stage.
 * If you had multiple active MDSs you would probably also need to do
some extra work to truncate journals etc.

You can also use cephfs-data-scan to (best effort) scrape files
directly out of a cephfs data pool to some other filesystem.

The usual warnings for disaster recovery tools all apply to the
situation of trying to recover a cephfs filesystem from pools: this is
a last resort, they can do harm as well as good, and anyone unsure
should seek expert advice before using them.

John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deploying new OSDs in parallel or one after another

2016-11-28 Thread Peter Maloney
On 11/28/16 10:02, Kevin Olbrich wrote:
> Hi!
>
> I want to deploy two nodes with 4 OSDs each. I already prepared OSDs
> and only need to activate them.
> What is better? One by one or all at once?
>
> Kind regards,
> Kevin.
I think the general statement is that if your cluster is very small, you
might want to do it more gradually. And if it's large, a whole node at
once might not cause any noticeable effect. And doing it all at once
will cause less overall data movement... move data once to its final place.


But either way, make sure you know about settings such as:

ceph osd set noscrub
ceph osd set nodeep-scrub

# 1 makes recovery slow...raise it to where you can still tolerate
the load
osd max backfills = 1

osd recovery max active = 1
osd recovery op priority = 1
osd recover max single start = 1
osd op threads = 12

# you probably have this already (default)
osd client op priority = 63

And if that's not enough, I found this one that worked better than the
rest (and for my 27 osd 3 node cluster, 0.6 here and 2 max active was
tolerable and faster than 1 max active and no setting here):
osd recovery sleep = 0.6

and when you want to give it a rest due to some issues:
ceph osd set nobackfill
ceph osd set norecover


To use the config options at runtime, you can use a command like:
ceph tell osd.* injectargs --osd-max-backfills=1

And I'm sure I missed some options and someone can mention them too.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] - cluster stuck and undersized if at least one osd is down

2016-11-28 Thread Piotr Dzionek

Hi,
I recently installed 3 nodes ceph cluster v.10.2.3. It has 3 mons, and 
12 osds. I removed default pool and created the following one:


/pool 7 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash 
rjenkins pg_num 1024 pgp_num 1024 last_change 126 flags hashpspool 
stripe_width 0/


Cluster is healthy if all osds are up, however if I stop any of the 
osds, it becomes stuck and undersized - it is not rebuilding.


/cluster *
 health HEALTH_WARN
166 pgs degraded
108 pgs stuck unclean
166 pgs undersized
recovery 67261/827220 objects degraded (8.131%)
1/12 in osds are down
 monmap e3: 3 mons at 
{**osd01=***.144:6789/0,***osd02=***.145:6789/0,**osd03=*.146:6789/0}

election epoch 14, quorum 0,1,2 **osd01,**osd02,**osd03
 osdmap e161: 12 osds: 11 up, 12 in; 166 remapped pgs
flags sortbitwise
  pgmap v307710: 1024 pgs, 1 pools, 1230 GB data, 403 kobjects
2452 GB used, 42231 GB / 44684 GB avail
67261/827220 objects degraded (8.131%)
 858 active+clean
 166 active+undersized+degraded/

Replica size is 2 and and I use the following crushmap:

/# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable straw_calc_version 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host osd01 {
id -2   # do not change unnecessarily
# weight 14.546
alg straw
hash 0  # rjenkins1
item osd.0 weight 3.636
item osd.1 weight 3.636
item osd.2 weight 3.636
item osd.3 weight 3.636
}
host osd02 {
id -3   # do not change unnecessarily
# weight 14.546
alg straw
hash 0  # rjenkins1
item osd.4 weight 3.636
item osd.5 weight 3.636
item osd.6 weight 3.636
item osd.7 weight 3.636
}
host osd03 {
id -4   # do not change unnecessarily
# weight 14.546
alg straw
hash 0  # rjenkins1
item osd.8 weight 3.636
item osd.9 weight 3.636
item osd.10 weight 3.636
item osd.11 weight 3.636
}
root default {
id -1   # do not change unnecessarily
# weight 43.637
alg straw
hash 0  # rjenkins1
item osd01 weight 14.546//
item osd02 weight 14.546
item osd03 weight 14.546
}

# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

# end crush map
/
I am not sure what is the reason for undersized state. All osd disks are 
the same size and replica size is 2. Also data is only replicated per 
hosts basis and I have 3 separate hosts. Maybe number of pg is incorrect 
?  Is 1024 too big ? or maybe there is some misconfiguration in crushmap ?



Kind regards,
Piotr Dzionek

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deploying new OSDs in parallel or one after another

2016-11-28 Thread Kevin Olbrich
I need to note that I already have 5 hosts with one OSD each.


Mit freundlichen Grüßen / best regards,
Kevin Olbrich.

2016-11-28 10:02 GMT+01:00 Kevin Olbrich :

> Hi!
>
> I want to deploy two nodes with 4 OSDs each. I already prepared OSDs and
> only need to activate them.
> What is better? One by one or all at once?
>
> Kind regards,
> Kevin.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Introducing DeepSea: A tool for deploying Ceph using Salt

2016-11-28 Thread M Ranga Swami Reddy
Hello Tim,
Can you please confirm, if the DeepSea works on Ubuntu also?

Thanks

On Fri, Nov 25, 2016 at 3:34 PM, M Ranga Swami Reddy 
wrote:

> Hello Tim,
> Can you please confirm, if the DeepSea works on Ubuntu also?
>
> Thanks
> Swami
>
> On Thu, Nov 3, 2016 at 11:22 AM, Tim Serong  wrote:
>
>> Hi All,
>>
>> I thought I should make a little noise about a project some of us at
>> SUSE have been working on, called DeepSea.  It's a collection of Salt
>> states, runners and modules for orchestrating deployment of Ceph
>> clusters.  To help everyone get a feel for it, I've written a blog post
>> which walks through using DeepSea to set up a small test cluster:
>>
>>   http://ourobengr.com/2016/11/hello-salty-goodness/
>>
>> If you'd like to try it out yourself, the code is on GitHub:
>>
>>   https://github.com/SUSE/DeepSea
>>
>> More detailed documentation can be found at:
>>
>>   https://github.com/SUSE/DeepSea/wiki/intro
>>   https://github.com/SUSE/DeepSea/wiki/management
>>   https://github.com/SUSE/DeepSea/wiki/policy
>>
>> Usual story: feedback, issues, pull requests are all welcome ;)
>>
>> Enjoy,
>>
>> Tim
>> --
>> Tim Serong
>> Senior Clustering Engineer
>> SUSE
>> tser...@suse.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Production System Evaluation / Problems

2016-11-28 Thread Stefan Lissmats
Hey!

I have using ceph for a while bu is not a real expert but i will give you some 
pointers to make everyone able to help you further.

1. The crush map is kind of devided into two parts, the topology description, 
(which you provided us with) and also the crush rules that defines how the data 
is placed in the topology. Have you made any changes in the rules? If you have 
made any changes it would be great if you provided how the rules is defined. 
However i think you can get the data placed the way you want with some more 
advanced crush rules, but I don't think there is any possibility to have a read 
only copy. Guess you have seen this? 
http://docs.ceph.com/docs/jewel/rados/operations/crush-map/


2.  Have you looked into the osd logs server that osd.0 resides on? That could 
give some information why osd.0 never comes up. It should normally be in 
/var/log/ceph/ceph- osd.0.log

Other notes:
You have 6 mons but you normally want an odd number and do not normally need 
more than 5 (or even 3 is).


Från: ceph-users [ceph-users-boun...@lists.ceph.com] för Strankowski, Florian 
[fstrankow...@stadtwerke-norderstedt.de]
Skickat: den 28 november 2016 10:29
Till: 'ceph-users@lists.ceph.com'
Ämne: [ceph-users] Production System Evaluation / Problems

Hey guys,

we’re evaluating ceph at the moment for a bigger production-ready 
implementation. So far we’ve had some success and
some problems with ceph. In combination with Proxmox CEPH works quite well, if 
taken out of the box. I’ve tried to coverup my questions
with existing answers and solutions but i still find some things unclear. Here 
are the things i’m having problems with:


1.   The first question is just for my understanding: How does CEPH account 
failure domains? For what i’ve read by now is
that i create a new CRUSH-Map with for example 2 datacenters, each DC has a 
rack and in this rack there is a chassis with nodes.
By using an own CRUSH-Map CEPH will „see“ it and deal with the data 
automatically. What i am missing here is some more possible adjustment.
For example i want to define that by using a replica of 3 i want CEPH to store 
the data 2 times in datacenter A and one time in datacenter B. Further
more i want read-access exclusivly within 1 datacenter (if possible and data is 
available) to keep rtt low. Is this possible?

2.   I’ve build my own CRUSH-Map and tried to get it working. No success at 
all. I’m literally „done with this s…“ :) thats why im here right now. Here is 
the state

of the cluster:



cluster 42f04e55-0a3f-4644-8543-516cd46cd4e9

 health HEALTH_WARN

79 pgs degraded

262 pgs stale

79 pgs stuck degraded

262 pgs stuck stale

512 pgs stuck unclean

79 pgs stuck undersized

79 pgs undersized

 monmap e8: 6 mons at 
{0=192.168.40.20:6789/0,1=192.168.40.21:6789/0,2=192.168.40.22:6789/0,3=192.168.40.23:6789/0,4=192.168.40.24:6789/0,5=192.168.40.25:6789/0}

election epoch 86, quorum 0,1,2,3,4,5 0,1,2,3,4,5

 mdsmap e2: 0/0/1 up

 osdmap e212: 6 osds: 5 up, 5 in; 250 remapped pgs

  pgmap v366013: 512 pgs, 2 pools, 0 bytes data, 0 objects

278 MB used, 900 GB / 901 GB avail

 250 active+remapped

 183 stale+active+remapped

  79 stale+active+undersized+degraded+remapped



Here the config:





ID  WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY

-27 1.07997 root default

-25 0.53998 datacenter datacenter1

-23 0.53998 chassis chassis1

-1 0.17999 blade blade3

  0 0.17999 osd.0 down0  1.0

-2 0.17999 blade blade4

  1 0.17999 osd.1   up  1.0  1.0

-3 0.17999 blade blade5

  2 0.17999 osd.2   up  1.0  1.0

-26 0.53999 datacenter datacenter2

-24 0.53999 chassis chassis2

-17 0.17999 blade blade17

  3 0.17999 osd.3   up  0.95001  1.0

-18 0.17999 blade blade18

  4 0.17999 osd.4   up  1.0  1.0

-19 0.17999 blade blade19

  5 0.17999 osd.5   up  1.0  1.0



I simply cant get osd.0 back up. I took it offline, out, reinserterd, resetup, 
deleted the osd configs, remade them, no success

whatsoever. IMHO the documentation on this part is a bit „lousy“ so im missing 
some points of information here, sorry folks.



3.   Last but not least i would like to know whether it is a good idea to 
have the data and config network instead on 2 dedicated nics on 2 dedicated 
vlans. Our

Hardware is redundant and we got 10GIG Fibreoptics inhouse and 80 GIG between 
the two datacenters. The data-vlan is using jumbo frames while the others dont.



4.   Do you guys 

Re: [ceph-users] Missing heartbeats, OSD spending time reconnecting - possible bug?

2016-11-28 Thread Trygve Vea
- Den 11.nov.2016 14:35 skrev Wido den Hollander w...@42on.com:
>> Op 11 november 2016 om 14:23 schreef Trygve Vea 
>> :
>> 
>> 
>> Hi,
>> 
>> We recently experienced a problem with a single OSD.  This occurred twice.
>> 
>> The problem manifested itself thus:
>> 
>> - 8 placement groups stuck peering, all of which had the problematic OSD as 
>> one
>> of the acting OSDs in the set.
>> - The OSD had a lot of active placement groups
>> - The OSD were blocking IO on placement groups that were active (waiting for
>> subops logged on the monitors)
>> - The OSD logged that a single other OSD didn't respond to heartbeats.  This 
>> OSD
>> was not involved in any of the PG's stuck in peering.
>> 
>> 2016-11-10 21:40:30.373352 7fad7fdc4700 -1 osd.29 32033 heartbeat_check: no
>> reply from osd.14 since back 2016-11-10 21:40:03.465758 front 2016-11-10
>> 21:40:03.465758 (cutoff 2016-11-10 21:40:10.373339)
>> 
>> This were logged until it was restarted.  osd.14 in its turn logged a few
>> instances of this:
>> 
>> 2016-11-10 21:40:30.697238 7f1a8f9cb700  0 -- 10.20.9.21:6808/18024412 >>
>> 10.20.9.22:6810/41828 pipe(0x7f1b15af8800 sd=20 :38625 s=2 pgs=9449 cs=1 l=0
>> c=0x7f1b499a4a80).fault, initiating reconnect
>> 2016-11-10 21:40:30.697860 7f1a8a16f700  0 -- 10.20.9.21:6808/18024412 >>
>> 10.20.9.22:6810/41828 pipe(0x7f1b15af8800 sd=20 :38627 s=1 pgs=9449 cs=2 l=0
>> c=0x7f1b499a4a80).connect got RESETSESSION
>> 
>> 
>> No real IO-problem on the OSD against the disk.  Using more CPU than usual, 
>> but
>> no indication that the bottleneck is the drive, and the drive is healthy.
>> 
>> 
>> Killing osd.29 off unblocked the traffic.  The OSD were then started again, 
>> and
>> things recovered nicely and things worked fine throughout the night.
>> 
>> The next morning, the same behaviour as described above reoccurred on osd.29.
>> Less PGs stuck in peering, but blocking IO.  The OSD were then killed off, 
>> and
>> have not been started since.  I'm leaving it as it is if there is any
>> possibility of using the OSD partition for forensics (ordinary xfs 
>> filesystem,
>> journal on ssd).
>> 
>> 
>> Not an expert of the low-level behaviour of Ceph, but the logged
>> reconnection-attempts from osd.14, and the complaining about missing 
>> heartbeats
>> on osd.29 sounds to me like a bug.
>> 
>> Have anyone else seen this behaviour?
>> 
> 
> Yes, but that usually indicates that there is something wrong with the network
> or the machine.
> 
> Is osd.29 alone on that machine? Did you verify that the network is OK? Any
> firewalls present?

There are four OSDs on each machine, and they are dedicated for OSDs.

We've addressed a bottleneck where the buffers on the attached switch got full 
and we occasionally dropped packages, which I suspect contributed to this issue.

There are no firewalls present.

However, we can still see the occasional (2-3 a day, per OSD, on varying times) 
heartbeat_map: reset_timeout message.  So there are still something funky going 
on here.


We also experience significantly increased CPU footprint, which also started 
after we upgraded to Jewel; 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-October/013959.html.



-- 
Trygve
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Production System Evaluation / Problems

2016-11-28 Thread Strankowski, Florian
Hey guys,

we're evaluating ceph at the moment for a bigger production-ready 
implementation. So far we've had some success and
some problems with ceph. In combination with Proxmox CEPH works quite well, if 
taken out of the box. I've tried to coverup my questions
with existing answers and solutions but i still find some things unclear. Here 
are the things i'm having problems with:


1.   The first question is just for my understanding: How does CEPH account 
failure domains? For what i've read by now is
that i create a new CRUSH-Map with for example 2 datacenters, each DC has a 
rack and in this rack there is a chassis with nodes.
By using an own CRUSH-Map CEPH will "see" it and deal with the data 
automatically. What i am missing here is some more possible adjustment.
For example i want to define that by using a replica of 3 i want CEPH to store 
the data 2 times in datacenter A and one time in datacenter B. Further
more i want read-access exclusivly within 1 datacenter (if possible and data is 
available) to keep rtt low. Is this possible?

2.   I've build my own CRUSH-Map and tried to get it working. No success at 
all. I'm literally "done with this s..." :) thats why im here right now. Here 
is the state

of the cluster:



cluster 42f04e55-0a3f-4644-8543-516cd46cd4e9

 health HEALTH_WARN

79 pgs degraded

262 pgs stale

79 pgs stuck degraded

262 pgs stuck stale

512 pgs stuck unclean

79 pgs stuck undersized

79 pgs undersized

 monmap e8: 6 mons at 
{0=192.168.40.20:6789/0,1=192.168.40.21:6789/0,2=192.168.40.22:6789/0,3=192.168.40.23:6789/0,4=192.168.40.24:6789/0,5=192.168.40.25:6789/0}

election epoch 86, quorum 0,1,2,3,4,5 0,1,2,3,4,5

 mdsmap e2: 0/0/1 up

 osdmap e212: 6 osds: 5 up, 5 in; 250 remapped pgs

  pgmap v366013: 512 pgs, 2 pools, 0 bytes data, 0 objects

278 MB used, 900 GB / 901 GB avail

 250 active+remapped

 183 stale+active+remapped

  79 stale+active+undersized+degraded+remapped



Here the config:





ID  WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY

-27 1.07997 root default

-25 0.53998 datacenter datacenter1

-23 0.53998 chassis chassis1

-1 0.17999 blade blade3

  0 0.17999 osd.0 down0  1.0

-2 0.17999 blade blade4

  1 0.17999 osd.1   up  1.0  1.0

-3 0.17999 blade blade5

  2 0.17999 osd.2   up  1.0  1.0

-26 0.53999 datacenter datacenter2

-24 0.53999 chassis chassis2

-17 0.17999 blade blade17

  3 0.17999 osd.3   up  0.95001  1.0

-18 0.17999 blade blade18

  4 0.17999 osd.4   up  1.0  1.0

-19 0.17999 blade blade19

  5 0.17999 osd.5   up  1.0  1.0



I simply cant get osd.0 back up. I took it offline, out, reinserterd, resetup, 
deleted the osd configs, remade them, no success

whatsoever. IMHO the documentation on this part is a bit "lousy" so im missing 
some points of information here, sorry folks.



3.   Last but not least i would like to know whether it is a good idea to 
have the data and config network instead on 2 dedicated nics on 2 dedicated 
vlans. Our

Hardware is redundant and we got 10GIG Fibreoptics inhouse and 80 GIG between 
the two datacenters. The data-vlan is using jumbo frames while the others dont.



4.   Do you guys have somekind of "best practice"-book available for large 
scale deployments? 20+ Servers up to 100+ and 1000+

Regards

Florian

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Deploying new OSDs in parallel or one after another

2016-11-28 Thread Kevin Olbrich
Hi!

I want to deploy two nodes with 4 OSDs each. I already prepared OSDs and
only need to activate them.
What is better? One by one or all at once?

Kind regards,
Kevin.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com