[ceph-users] Re: OSD crash on Onode::put

2023-01-10 Thread Serkan Çoban
Slot 19 is inside the chassis? Do you check chassis temperature? I
sometimes have more failure rate in chassis HDDs than in front of the
chassis. In our case it was related to the temperature difference.

On Tue, Jan 10, 2023 at 1:28 PM Frank Schilder  wrote:
>
> Following up on my previous post, we have identical OSD hosts. The very 
> strange observation now is, that all outlier OSDs are in exactly the same 
> disk slot on these hosts. We have 5 problematic OSDs and they are all in slot 
> 19 on 5 different hosts. This is an extremely strange and unlikely 
> co-incidence.
>
> Are there any specific conditions for this problem to be present or amplified 
> that could have to do with hardware?
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: What is the max size of cephfs (filesystem)

2022-06-20 Thread Serkan Çoban
Currently the biggest HDD is 20TB. 1 exabyte means 50.000 OSD
cluster(without replication or EC)
AFAIK Cern did some tests using 5000 OSDs. I don't know any larger
clusters than Cern's.
So I am not saying it is impossible but it is very unlikely to grow a
single Ceph cluster to that size.
Maybe you should search for alternatives, like hdfs which I
know/worked with more than 50.000 HDDs without problems.

On Mon, Jun 20, 2022 at 10:46 AM Arnaud M  wrote:
>
> Hello to everyone
>
> I have looked on the internet but couldn't find an answer.
> Do you know the maximum size of a ceph filesystem ? Not the max size of a
> single file but the limit of the whole filesystem ?
>
> For example a quick search on zfs on google output :
> A ZFS file system can store up to *256 quadrillion zettabytes* (ZB).
>
> I would like to have the same answer with cephfs.
>
> And if there is a limit, where is this limit coded ? Is it hard-coded or is
> it configurable ?
>
> Let's say someone wants to have a cephfs up to ExaByte, would it be
> completely foolish or would the system, given enough mds and servers and
> everything needed, be usable ?
>
> Is there any other limit to a ceph filesystem ?
>
> All the best
>
> Arnaud
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph as a HDFS alternative?

2021-08-26 Thread Serkan Çoban
Ceph cannot scale like HDFS. There are 10K-20K node HDFS clusters in production.
There is no data locality concept if you use CEPH, every IO will be
served from the network.

On Thu, Aug 26, 2021 at 12:04 PM zhang listar  wrote:
>
> Hi, all.
>
> I want to use ceph instead of HDFS in big data analysis senario, does ceph
> have some potential problems when the cluster becoming big? say 100PB or
> 500PB?
>
> As far I know, there are some cons:
>
>1.
>
>no short circuit read, so we need fast network say 10G or better 50G?
>2.
>
>not exactly du, but it is acceptable for applications.
>3.
>
>ceph can't handle slow disk as HDFS does for example heged read or write.
>
> Is that right? or there are many other cons?
>
> Thanks in advance.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph with BGP?

2021-07-06 Thread Serkan Çoban
Sorry, I did not use it before, we evaluated it and went for L2 below ToR.
You need to design the BGP network with your network team because
every node will have an ASN too.
You need to set up a test network before going to production.

On Tue, Jul 6, 2021 at 4:33 PM German Anders  wrote:
>
> Great, now regarding FRR no, the truth is that I never had experience before, 
> is there any specific guide to frr-ceph configuration or any advice? 
> Actually, we have for the public network a bond created with two 10GbE link, 
> so we can break the bond and get two 10GbE fibers per node for the public 
> network
>
> On Tue, Jul 6, 2021 at 10:27 AM Serkan Çoban  wrote:
>>
>> Without using FRR you are still L2 below ToR. Do you have any
>> experience with FRR?
>> You should think about multiple uplinks and how you handle them with FRR.
>> Other than that, it should work, we have been in production with BGP
>> above ToR for years without any problem.
>>
>> On Tue, Jul 6, 2021 at 4:10 PM German Anders  wrote:
>> >
>> > Hello Stefan, thanks a lot for the reply.
>> >
>> >We have everything in the same datacenter, and the fault domains are per
>> > rack. Regarding BGP, the idea of the networking team is to stop using layer
>> > 2+3 and move everything to full layer 3, for this they want to implement
>> > BGP and the idea is that each node is connected to a different ToR edge
>> > switch and that the clients have a single IP and can then reach the entire
>> > cluster.
>> >Currently, we have an environment configured with a specific VLAN
>> > without GW, and they want to get the VLAN out of the way and that each node
>> > has its own IP with its own GW (and that is the ToR Switch). We already
>> > have a separate cluster network that is running on Infiniband and it's
>> > completely separated. So the idea is to use BGP on the public network only.
>> >
>> > Thanks in advance,
>> >
>> > Cheers,
>> >
>> > On Tue, Jul 6, 2021 at 2:10 AM Stefan Kooman  wrote:
>> >
>> > > On 7/5/21 6:26 PM, German Anders wrote:
>> > > > Hi All,
>> > > >
>> > > > I have an already created and functional ceph cluster (latest
>> > > luminous
>> > > > release) with two networks one for the public (layer 2+3) and the other
>> > > for
>> > > > the cluster, the public one uses VLAN and its 10GbE and the other one
>> > > uses
>> > > > Infiniband with 56Gb/s, the cluster works ok. The public network uses
>> > > > Juniper QFX5100 switches with VLAN in layer2+3 configuration but the
>> > > > network team needs to move to a full layer3 and they want to use BGP, 
>> > > > so
>> > > > the question is, how can we move to that schema? What are the
>> > > > considerations? Is it possible? Is there any step-by-step way to move 
>> > > > to
>> > > > that schema? Also is anything better than BGP or other alternatives?
>> > >
>> > > Ceph doesn't care at all. Just as long as the nodes can communicate to
>> > > each other, it's fine. It depends on your failure domains how easy you
>> > > can move to this L3 model. Do you have separate datacenters that you can
>> > > do one by one, or separate racks?
>> > >
>> > > And you can do BGP on different levels: router, top of rack switches, or
>> > > even on the Ceph host itselfs (FRR).
>> > >
>> > > We use BGP / VXLAN / EVPN for our Ceph cluster. But it all depends on
>> > > why your networking teams wants to change to L3, and why.
>> > >
>> > > There are no step by step guides, as most deployments are unique.
>> > >
>> > > This might be a good time to reconsider a separate cluster network.
>> > > Normally there is no need for that, and might make things simpler.
>> > >
>> > > Do you have separate storage switches? Whre are your clients connected
>> > > to (separate switches or connected to storage switches as well).
>> > >
>> > > This is not easy to answer without all the details. But for sure there
>> > > are cluster running with BGP in the field just fine.
>> > >
>> > > Gr. Stefan
>> > > ___
>> > > ceph-users mailing list -- ceph-users@ceph.io
>> > > To unsubscribe send an email to ceph-users-le...@ceph.io
>> > >
>> > ___
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph with BGP?

2021-07-06 Thread Serkan Çoban
Without using FRR you are still L2 below ToR. Do you have any
experience with FRR?
You should think about multiple uplinks and how you handle them with FRR.
Other than that, it should work, we have been in production with BGP
above ToR for years without any problem.

On Tue, Jul 6, 2021 at 4:10 PM German Anders  wrote:
>
> Hello Stefan, thanks a lot for the reply.
>
>We have everything in the same datacenter, and the fault domains are per
> rack. Regarding BGP, the idea of the networking team is to stop using layer
> 2+3 and move everything to full layer 3, for this they want to implement
> BGP and the idea is that each node is connected to a different ToR edge
> switch and that the clients have a single IP and can then reach the entire
> cluster.
>Currently, we have an environment configured with a specific VLAN
> without GW, and they want to get the VLAN out of the way and that each node
> has its own IP with its own GW (and that is the ToR Switch). We already
> have a separate cluster network that is running on Infiniband and it's
> completely separated. So the idea is to use BGP on the public network only.
>
> Thanks in advance,
>
> Cheers,
>
> On Tue, Jul 6, 2021 at 2:10 AM Stefan Kooman  wrote:
>
> > On 7/5/21 6:26 PM, German Anders wrote:
> > > Hi All,
> > >
> > > I have an already created and functional ceph cluster (latest
> > luminous
> > > release) with two networks one for the public (layer 2+3) and the other
> > for
> > > the cluster, the public one uses VLAN and its 10GbE and the other one
> > uses
> > > Infiniband with 56Gb/s, the cluster works ok. The public network uses
> > > Juniper QFX5100 switches with VLAN in layer2+3 configuration but the
> > > network team needs to move to a full layer3 and they want to use BGP, so
> > > the question is, how can we move to that schema? What are the
> > > considerations? Is it possible? Is there any step-by-step way to move to
> > > that schema? Also is anything better than BGP or other alternatives?
> >
> > Ceph doesn't care at all. Just as long as the nodes can communicate to
> > each other, it's fine. It depends on your failure domains how easy you
> > can move to this L3 model. Do you have separate datacenters that you can
> > do one by one, or separate racks?
> >
> > And you can do BGP on different levels: router, top of rack switches, or
> > even on the Ceph host itselfs (FRR).
> >
> > We use BGP / VXLAN / EVPN for our Ceph cluster. But it all depends on
> > why your networking teams wants to change to L3, and why.
> >
> > There are no step by step guides, as most deployments are unique.
> >
> > This might be a good time to reconsider a separate cluster network.
> > Normally there is no need for that, and might make things simpler.
> >
> > Do you have separate storage switches? Whre are your clients connected
> > to (separate switches or connected to storage switches as well).
> >
> > This is not easy to answer without all the details. But for sure there
> > are cluster running with BGP in the field just fine.
> >
> > Gr. Stefan
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: speeding up EC recovery

2021-06-25 Thread Serkan Çoban
You can use clay codes(1).
This reads less data for reconstruction.

1- https://docs.ceph.com/en/latest/rados/operations/erasure-code-clay/

On Fri, Jun 25, 2021 at 2:50 PM Andrej Filipcic  wrote:
>
>
> Hi,
>
> on a large cluster with ~1600 OSDs, 60 servers and using 16+3 erasure
> coded pools, the recovery after OSD failure (HDD) is quite slow. Typical
> values are at 4GB/s with 125 ops/s and 32MB object sizes, which then
> takes 6-8 hours, during that time the pgs are degraded. I tried to speed
> it up with
>
>osd advanced  osd_max_backfills 32
>osd advanced  osd_recovery_max_active 10
>osd advanced  osd_recovery_op_priority 63
>osd advanced  osd_recovery_sleep_hdd 0.00
>
> which at least kept the iops/s at a constant level. The recovery does
> not seem to be cpu or memory bound. Is there any way to speed it up?
> While testing the recovery on replicated pools, it reached 50GB/s.
>
> In contrast, replacing the failed drive with a new one and re-adding the
> OSD is  quite fast, with 1GB/s recovery rate of misplaced pgs, or
> ~120MB/s average HDD write speed, which is not very far from HDD throughput.
>
> Regards,
> Andrej
>
> --
> _
> prof. dr. Andrej Filipcic,   E-mail: andrej.filip...@ijs.si
> Department of Experimental High Energy Physics - F9
> Jozef Stefan Institute, Jamova 39, P.o.Box 3000
> SI-1001 Ljubljana, Slovenia
> Tel.: +386-1-477-3674Fax: +386-1-425-7074
> -
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Fwd: Re: Issues with Ceph network redundancy using L2 MC-LAG

2021-06-16 Thread Serkan Çoban
You cannot do much if the link is flapping or the cable is bad.
Maybe you can write some rules to shut the port down on the switch if
the error packet ratio goes up.
I also remember there are some config on the switch side for link flapping.

On Wed, Jun 16, 2021 at 10:57 AM huxia...@horebdata.cn
 wrote:
>
> Is it true that MC-LAG and 803.2ad, by its default, are working on 
> active-active.
>
> What else should i take care to ensure fault tolerance when one path is bad?
>
> best regards,
>
> samuel
>
>
>
> huxia...@horebdata.cn
>
> From: Joe Comeau
> Date: 2021-06-15 23:44
> To: ceph-users@ceph.io
> Subject: [ceph-users] Fwd: Re: Issues with Ceph network redundancy using L2 
> MC-LAG
> We also run with Dell VLT switches (40 GB)
> everything is active/active, so multiple paths as Andrew describes in
> his config
> Our config allows us:
>bring down one of the switches for upgrades
>bring down an iscsi gatway for patching
> all the while at least one path is up and servicing
> Thanks Joe
>
>
> >>> Andrew Walker-Brown  6/15/2021 10:26 AM
> >>>
> With an unstable link/port you could see the issues you describe.  Ping
> doesn’t have the packet rate for you to necessarily have a packet in
> transit at exactly the same time as the port fails temporarily.  Iperf
> on the other hand could certainly show the issue, higher packet rate and
> more likely to have packets in flight at the time of a link
> fail...combined with packet loss/retries gives poor throughput.
>
> Depending on what you want to happen, there are a number of tuning
> options both on the switches and Linux.  If you want the LAG to be down
> if any link fails, the you should be able to config this on the switches
> and/or Linux  (minimum number of links = 2 if you have 2 links in the
> lag).
>
> You can also tune the link monitoring, how frequently the links are
> checked (e.g. miimon) etc.  Bringing this value down from the default of
> 100ms may allow you to detect a link failure more quickly.  But you then
> run into the chance if detecting a transient failure that wouldn’t have
> caused any issuesand the LAG becoming more unstable.
>
> Flapping/unstable links are the worst kind of situation.  Ideally you’d
> pick that up quickly from monitoring/alerts and either fix immediately
> or take the link down until you can fix it.
>
> I run 2x10G from my hosts into separate switches (Dell S series – VLT
> between switches).  Pulling a single interface has no impact on Ceph,
> any packet loss is tiny and we’re not exceeding 10G bandwidth per host.
>
> If you’re running 1G links and the LAG is already busy, a link failure
> could be causing slow writes to the host, just down to
> congestion...which then starts to impact the wider cluster based on how
> Ceph works.
>
> Just caveating the above with - I’m relatively new to Ceph myself
>
> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for
> Windows 10
>
> From: huxia...@horebdata.cn<mailto:huxia...@horebdata.cn>
> Sent: 15 June 2021 17:52
> To: Serkan Çoban<mailto:cobanser...@gmail.com>
> Cc: ceph-users<mailto:ceph-users@ceph.io>
> Subject: [ceph-users] Re: Issues with Ceph network redundancy using L2
> MC-LAG
>
> When i pull out the cable, then the bond is working properly.
>
> Does it mean that the port is somehow flapping? Ping can still work,
> but the iperf test yields very low results.
>
>
>
>
>
> huxia...@horebdata.cn
>
> From: Serkan Çoban
> Date: 2021-06-15 18:47
> To: huxia...@horebdata.cn
> CC: ceph-users
> Subject: Re: [ceph-users] Issues with Ceph network redundancy using L2
> MC-LAG
> Do you observe the same behaviour when you pull a cable?
> Maybe a flapping port might cause this kind of behaviour, other than
> that you should't see any network disconnects.
> Are you sure about LACP configuration, what is the output of 'cat
> /proc/net/bonding/bond0'
>
> On Tue, Jun 15, 2021 at 7:19 PM huxia...@horebdata.cn
>  wrote:
> >
> > Dear Cephers,
> >
> > I encountered the following networking issue several times, and i
> wonder whether there is a solution for networking HA solution.
> >
> > We build ceph using L2 multi chassis link aggregation group (MC-LAG )
> to provide switch redundancy. On each host, we use 802.3ad, LACP
> > mode for NIC redundancy. However, we observe several times, when a
> single network port, either the cable, or the SFP+ optical module fails,
> Ceph cluster  is badly affected by networking, although in theory it
> should be able to tolerate.
> >
> > Did i miss something important here? and how to really achieve
> networking HA in Ce

[ceph-users] Re: Issues with Ceph network redundancy using L2 MC-LAG

2021-06-15 Thread Serkan Çoban
Do you observe the same behaviour when you pull a cable?
Maybe a flapping port might cause this kind of behaviour, other than
that you should't see any network disconnects.
Are you sure about LACP configuration, what is the output of 'cat
/proc/net/bonding/bond0'

On Tue, Jun 15, 2021 at 7:19 PM huxia...@horebdata.cn
 wrote:
>
> Dear Cephers,
>
> I encountered the following networking issue several times, and i wonder 
> whether there is a solution for networking HA solution.
>
> We build ceph using L2 multi chassis link aggregation group (MC-LAG ) to 
> provide switch redundancy. On each host, we use 802.3ad, LACP
> mode for NIC redundancy. However, we observe several times, when a single 
> network port, either the cable, or the SFP+ optical module fails, Ceph 
> cluster  is badly affected by networking, although in theory it should be 
> able to tolerate.
>
> Did i miss something important here? and how to really achieve networking HA 
> in Ceph cluster?
>
> best regards,
>
> Samuel
>
>
>
>
> huxia...@horebdata.cn
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Storing 20 billions of immutable objects in Ceph, 75% <16KB

2021-02-17 Thread Serkan Çoban
I still prefer the simplest solution. There are 4U servers with 110 x
20TB disks on the market.
After raid you get 1.5PiB per server. This is 30 months of data.
2 such servers will hold 5 years of data with minimal problems.
If you need backup; then buy 2 more sets and just send zfs snapshot
diffs to this set.


On Wed, Feb 17, 2021 at 11:15 PM Loïc Dachary  wrote:
>
>
>
> On 17/02/2021 18:27, Serkan Çoban wrote:
> > Why not put all the data to a zfs pool with 3-4 levels deep directory
> > structure each directory named with 2 byte in range 00-FF?
> > Four levels deep, you get 255^4=4B folders with 3-4 objects per folder
> > or three levels deep you get 255^3=16M folders with ~1000 objects
> > each.
> It is more or less the current setup :-) I should have mentioned that there 
> currently are ~750TB and 10 billions objects. But it's growing by 50TB every 
> month and it will keep growing indefinitely. Reason why a solution that 
> scales out is desirable.
> >
> > On Wed, Feb 17, 2021 at 8:14 PM Loïc Dachary  wrote:
> >> Hi Nathan,
> >>
> >> Good thinking :-) The names of the objects are indeed the SHA256 of their 
> >> content, which provides deduplication.
> >>
> >> Cheers
> >>
> >> On 17/02/2021 18:04, Nathan Fish wrote:
> >>> I'm not much of a programmer, but as soon as I hear "immutable
> >>> objects" I think "content-addressed". I don't know if you have many
> >>> duplicate objects in this set, but content-addressing gives you
> >>> object-level dedup for free. Do you have to preserve some meaningful
> >>> object names from the original dataset, or just do you just need some
> >>> kind of ID?
> >>>
> >>> On Wed, Feb 17, 2021 at 11:37 AM Loïc Dachary  wrote:
> >>>> Bonjour,
> >>>>
> >>>> TL;DR: Is it more advisable to work on Ceph internals to make it 
> >>>> friendly to this particular workload or write something similar to 
> >>>> EOS[0] (i.e Rocksdb + Xrootd + RBD)?
> >>>>
> >>>> This is a followup of two previous mails[1] sent while researching this 
> >>>> topic. In a nutshell, the Software Heritage project[1] currently has 
> >>>> ~750TB and 10 billions objects, 75% of which have a size smaller than 
> >>>> 16KB and 50% have a size smaller than 4KB. But they only account for ~5% 
> >>>> of the 750TB: 25% of the objects have a size > 16KB and total ~700TB. 
> >>>> The objects can be compressed by ~50% and 750TB only needs 350TB of 
> >>>> actual storage. (if you're interested in the details see [2]).
> >>>>
> >>>> Let say those 10 billions objects are stored in a single 4+2 erasure 
> >>>> coded pool with bluestore compression set for objects that have a size > 
> >>>> 32KB and the smallest allocation size for bluestore set to 4KB[3]. The 
> >>>> 750TB won't use the expected 350TB but about 30% more, i.e. ~450TB (see 
> >>>> [4] for the maths). This space amplification is because storing a 1 byte 
> >>>> object uses the same space as storing a 16KB object (see [5] to repeat 
> >>>> the experience at home). In a 4+2 erasure coded pool, each of the 6 
> >>>> chunks will use no less than 4KB because that's the smallest allocation 
> >>>> size for bluestore. That's 4 * 4KB = 16KB even when all that is needed 
> >>>> is 1 byte.
> >>>>
> >>>> It was suggested[6] to have two different pools: one with a 4+2 erasure 
> >>>> pool and compression for all objects with a size > 32KB that are 
> >>>> expected to compress to 16KB. And another with 3 replicas for the 
> >>>> smaller objects to reduce space amplification to a minimum without 
> >>>> compromising on durability. A client looking for the object could make 
> >>>> two simultaneous requests to the two pools. They would get 404 from one 
> >>>> of them and the object from the other.
> >>>>
> >>>> Another workaround, is best described in the "Finding a needle in 
> >>>> Haystack: Facebook’s photo storage"[9] paper and essentially boils down 
> >>>> to using a database to store a map between the object name and its 
> >>>> location. That does not scale out (writing the database index is the 
> >>>> bottleneck) but it's simple enough and is successfully implemented in 
> >>>> EOS[0] with >200PB worth of

[ceph-users] Re: Storing 20 billions of immutable objects in Ceph, 75% <16KB

2021-02-17 Thread Serkan Çoban
Why not put all the data to a zfs pool with 3-4 levels deep directory
structure each directory named with 2 byte in range 00-FF?
Four levels deep, you get 255^4=4B folders with 3-4 objects per folder
or three levels deep you get 255^3=16M folders with ~1000 objects
each.

On Wed, Feb 17, 2021 at 8:14 PM Loïc Dachary  wrote:
>
> Hi Nathan,
>
> Good thinking :-) The names of the objects are indeed the SHA256 of their 
> content, which provides deduplication.
>
> Cheers
>
> On 17/02/2021 18:04, Nathan Fish wrote:
> > I'm not much of a programmer, but as soon as I hear "immutable
> > objects" I think "content-addressed". I don't know if you have many
> > duplicate objects in this set, but content-addressing gives you
> > object-level dedup for free. Do you have to preserve some meaningful
> > object names from the original dataset, or just do you just need some
> > kind of ID?
> >
> > On Wed, Feb 17, 2021 at 11:37 AM Loïc Dachary  wrote:
> >> Bonjour,
> >>
> >> TL;DR: Is it more advisable to work on Ceph internals to make it friendly 
> >> to this particular workload or write something similar to EOS[0] (i.e 
> >> Rocksdb + Xrootd + RBD)?
> >>
> >> This is a followup of two previous mails[1] sent while researching this 
> >> topic. In a nutshell, the Software Heritage project[1] currently has 
> >> ~750TB and 10 billions objects, 75% of which have a size smaller than 16KB 
> >> and 50% have a size smaller than 4KB. But they only account for ~5% of the 
> >> 750TB: 25% of the objects have a size > 16KB and total ~700TB. The objects 
> >> can be compressed by ~50% and 750TB only needs 350TB of actual storage. 
> >> (if you're interested in the details see [2]).
> >>
> >> Let say those 10 billions objects are stored in a single 4+2 erasure coded 
> >> pool with bluestore compression set for objects that have a size > 32KB 
> >> and the smallest allocation size for bluestore set to 4KB[3]. The 750TB 
> >> won't use the expected 350TB but about 30% more, i.e. ~450TB (see [4] for 
> >> the maths). This space amplification is because storing a 1 byte object 
> >> uses the same space as storing a 16KB object (see [5] to repeat the 
> >> experience at home). In a 4+2 erasure coded pool, each of the 6 chunks 
> >> will use no less than 4KB because that's the smallest allocation size for 
> >> bluestore. That's 4 * 4KB = 16KB even when all that is needed is 1 byte.
> >>
> >> It was suggested[6] to have two different pools: one with a 4+2 erasure 
> >> pool and compression for all objects with a size > 32KB that are expected 
> >> to compress to 16KB. And another with 3 replicas for the smaller objects 
> >> to reduce space amplification to a minimum without compromising on 
> >> durability. A client looking for the object could make two simultaneous 
> >> requests to the two pools. They would get 404 from one of them and the 
> >> object from the other.
> >>
> >> Another workaround, is best described in the "Finding a needle in 
> >> Haystack: Facebook’s photo storage"[9] paper and essentially boils down to 
> >> using a database to store a map between the object name and its location. 
> >> That does not scale out (writing the database index is the bottleneck) but 
> >> it's simple enough and is successfully implemented in EOS[0] with >200PB 
> >> worth of data and in seaweedfs[10], another promising object store 
> >> software based on the same idea.
> >>
> >> Instead of working around the problem, maybe Ceph could be modified to 
> >> make better use of the immutability of these objects[7], a hint that is 
> >> apparently only used to figure out how to best compress it and for 
> >> checksum calculation[8]. I honestly have not clue how difficult it would 
> >> be. All I know is that it's not easy otherwise it would have been done 
> >> already: there seem to be a general need for efficiently (space wise and 
> >> performance wise) storing large quantities of objects smaller than 4KB.
> >>
> >> Is it more advisable to:
> >>
> >>   * work on Ceph internals to make it friendly to this particular workload 
> >> or,
> >>   * write another implementation of "Finding a needle in Haystack: 
> >> Facebook’s photo storage"[9] based on RBD[11]?
> >>
> >> I'm currently leaning toward working on Ceph internals but there are pros 
> >> and cons to both approaches[12]. And since all this is still very new to 
> >> me, there also is the possibility that I'm missing something. Maybe it's 
> >> *super* difficult  to improve Ceph in this way. I should try to figure 
> >> that out sooner rather than later.
> >>
> >> I realize it's a lot to take in and unless you're facing the exact same 
> >> problem there is very little chance you read that far :-) But if you 
> >> did... I'm *really* interested to hear what yout think. In any case I'll 
> >> report back to this thread once a decision has been made.
> >>
> >> Cheers
> >>
> >> [0] https://eos-web.web.cern.ch/eos-web/
> >> [1] 
> >> 

[ceph-users] Re: OSD crashes regularely

2020-05-20 Thread Serkan Çoban
Disk is not ok, look to the output below:
SMART Health Status: HARDWARE IMPENDING FAILURE GENERAL HARD DRIVE

you should replace the disk.

On Wed, May 20, 2020 at 5:11 PM Thomas <74cmo...@gmail.com> wrote:
>
> Hello,
>
> I have a pool of +300 OSDs that are identical model (Seagate model:
> ST1800MM0129 size: 1.64 TiB).
> Only 1 OSD crashes regularely, however I cannot identify a root cause.
>
> Based on the output of smartctl the disk is ok.
>
> # smartctl -a -d megaraid,1
> /dev/sda
> [47/1833]
> smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.3.18-2-pve] (local build)
> Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
>
> === START OF INFORMATION SECTION ===
> Vendor:   LENOVO-X
> Product:  ST1800MM0129
> Revision: L2B6
> Compliance:   SPC-4
> User Capacity:1,800,360,124,416 bytes [1.80 TB]
> Logical block size:   512 bytes
> Physical block size:  4096 bytes
> LU is fully provisioned
> Rotation Rate:10500 rpm
> Form Factor:  2.5 inches
> Logical Unit id:  0x5000c500bb7822cf
> Serial number:WBN0QHX8E852944J
> Device type:  disk
> Transport protocol:   SAS (SPL-3)
> Local Time is:Mon May 18 09:19:41 2020 CEST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> Temperature Warning:  Enabled
>
> === START OF READ SMART DATA SECTION ===
> SMART Health Status: HARDWARE IMPENDING FAILURE GENERAL HARD DRIVE
> FAILURE [asc=5d, ascq=10]  [22/1833]
>
> Grown defects during certification 
> Total blocks reassigned during format 
> Total new blocks reassigned = 68
> Power on minutes since format 
> Current Drive Temperature: 33 C
> Drive Trip Temperature:65 C
>
> Manufactured in week 31 of year 2018
> Specified cycle count over device lifetime:  1
> Accumulated start-stop cycles:  21
> Specified load-unload count over device lifetime:  30
> Accumulated load-unload cycles:  709
> Elements in grown defect list: 18
>
> Error counter log:
>Errors Corrected by   Total   Correction
> GigabytesTotal
>ECC  rereads/errors   algorithm
> processeduncorrected
>fast | delayed   rewrites  corrected  invocations   [10^9
> bytes]  errors
> read:   32788538961 0  3278853897 32
> 83933.567  19
> write: 00 0 0  0
> 24093.894   0
> verify: 30803618800 0  3080361880  0
> 12630.494   0
>
> Non-medium error count:  244
>
> SMART Self-test log
> Num  Test  Status segment  LifeTime
> LBA_first_err [SK ASC ASQ]
>  Description  number   (hours)
> # 1  Background short  Completed   -
> 3761 - [-   --]
> # 2  Background short  Completed   -
> 3737 - [-   --]
> # 3  Background short  Completed   -
> 3713 - [-   --]
> # 4  Background short  Completed   -
> 3689 - [-   --]
> # 5  Background short  Completed   -
> 3665 - [-   --]
> # 6  Background short  Completed   -
> 3641 - [-   --]
> # 7  Background short  Completed   -
> 3617 - [-   --]
> # 8  Background short  Completed   -
> 3593 - [-   --]
> # 9  Background long   Completed   -
> 3569 - [-   --]
> #10  Background short  Completed   -
> 3545 - [-   --]
> #11  Background short  Completed   -
> 3521 - [-   --]
> #12  Background short  Completed   -
> 3497 - [-   --]
> #13  Background short  Completed   -
> 3473 - [-   --]
> #14  Background short  Completed   -
> 3449 - [-   --]
> #15  Background short  Completed   -
> 3425 - [-   --]
> #16  Background short  Completed   -
> 3401 - [-   --]
> #17  Background short  Completed   -
> 3377 - [-   --]
> #18  Background short  Completed   -
> 3353 - [-   --]
> #19  Background short  Completed   -
> 3329 - [-   --]
> #20  Background short  Completed   -
> 3305 - [-   --]
>
> Long (extended) Self-test duration: 9459 seconds [157.7 minutes]
>
> I have attached the log of the affected OSD.
>
> THX
> Thomas
>
> Ich habe 1 zu dieser E-Mail gehörende Datei hochgeladen:
> ceph-osd.92.log.1.gz (578
> KB)WeTransferhttps://we.tl/t-7DzNCDP3iZ
> Mozilla Thunderbird 

[ceph-users] Re: HBase/HDFS on Ceph/CephFS

2020-04-24 Thread Serkan Çoban
You do not want to mix ceph with hadoop, because you'll loose data
locality, which is the main point of hadoop systems.
Every read/write request will go through network, this is not optimal.

On Fri, Apr 24, 2020 at 9:04 AM  wrote:
>
> Hi
>
> We have an 3 year old Hadoop cluster - up for refresh - so it is time
> to evaluate options. The "only" usecase is running an HBase installation
> which is important for us and migrating out of HBase would be a hazzle.
>
> Our Ceph usage has expanded and in general - we really like what we see.
>
> Thus - Can this be "sanely" consolidated somehow? I have seen this:
> https://docs.ceph.com/docs/jewel/cephfs/hadoop/
> But it seem really-really bogus to me.
>
> It recommends that you set:
> pool 3 'hadoop1' rep size 1 min_size 1
>
> Which would - if I understand correct - be disastrous. The Hadoop end would
> replicated in 3 across - but within Ceph the replication would be 1.
> The 1 replication in ceph means pulling the OSD node would "gaurantee" the
> pg's to go inactive - which could be ok - but there is nothing
> gauranteeing that the other Hadoop replicas are not served out of the same
> OSD-node/pg? In which case - rebooting an OSD node would bring the hadoop
> cluster unavailable.
>
> Is anyone serving HBase out of Ceph - how does the stadck and
> configuration look? If I went for 3 x replication in both Ceph and HDFS
> then it would definately work, but 9x copies of the dataset is a bit more
> than what looks feasible at the moment.
>
> Thanks for your reflections/input.
>
> Jesper
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: EC pool used space high

2019-11-26 Thread Serkan Çoban
Maybe following link helps...
https://www.spinics.net/lists/dev-ceph/msg00795.html

On Tue, Nov 26, 2019 at 6:17 PM Erdem Agaoglu  wrote:
>
> I thought of that but it doesn't make much sense. AFAICT min_size should 
> block IO when i lose 3 osds, but it shouldn't effect the amount of the stored 
> data. Am i missing something?
>
> On Tue, Nov 26, 2019 at 6:04 AM Konstantin Shalygin  wrote:
>>
>> On 11/25/19 6:05 PM, Erdem Agaoglu wrote:
>>
>>
>> What I can't find is the 138,509 G difference between the 
>> ceph_cluster_total_used_bytes and ceph_pool_stored_raw. This is not static 
>> BTW, checking the same data historically shows we have about 1.12x of what 
>> we expect. This seems to make our 1.5x EC overhead a 1.68x overhead in 
>> reality. Anyone have any ideas for why this is the case?
>>
>> May be min_size related? Because you are right, 6+3 is a 1.50, but 6+3 (+1) 
>> is a your calculated 1.67.
>>
>>
>>
>> k
>
>
>
> --
> erdem agaoglu
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io