Re: [ceph-users] Intel P4600 3.2TB U.2 form factor NVMe firmware problems causing dead disks

2019-02-26 Thread Jeff Smith
We had several postgresql servers running these disks from Dell.  Numerous
failures, including one server that had 3 die at once.  Dell claims it is a
firmware issue instructed us to upgrade to  QDV1DP15 from  QDV1DP12 (I am
not sure how these line up to the Intel firmwares).  We lost several more
during the upgrade process.  We are using ZFS with these drives.  I can
confirm it is not a Ceph Bluestore only issue.

On Mon, Feb 18, 2019 at 8:44 AM David Turner  wrote:

> We have 2 clusters of [1] these disks that have 2 Bluestore OSDs per disk
> (partitioned), 3 disks per node, 5 nodes per cluster.  The clusters are
> 12.2.4 running CephFS and RBDs.  So in total we have 15 NVMe's per cluster
> and 30 NVMe's in total.  They were all built at the same time and were
> running firmware version QDV10130.  On this firmware version we early on
> had 2 disks failures, a few months later we had 1 more, and then a month
> after that (just a few weeks ago) we had 7 disk failures in 1 week.
>
> The failures are such that the disk is no longer visible to the OS.  This
> holds true beyond server reboots as well as placing the failed disks into a
> new server.  With a firmware upgrade tool we got an error that pretty much
> said there's no way to get data back and to RMA the disk.  We upgraded all
> of our remaining disks' firmware to QDV101D1 and haven't had any problems
> since then.  Most of our failures happened while rebalancing the cluster
> after replacing dead disks and we tested rigorously around that use case
> after upgrading the firmware.  This firmware version seems to have resolved
> whatever the problem was.
>
> We have about 100 more of these scattered among database servers and other
> servers that have never had this problem while running the
> QDV10130 firmware as well as firmwares between this one and the one we
> upgraded to.  Bluestore on Ceph is the only use case we've had so far with
> this sort of failure.
>
> Has anyone else come across this issue before?  Our current theory is that
> Bluestore is accessing the disk in a way that is triggering a bug in the
> older firmware version that isn't triggered by more traditional
> filesystems.  We have a scheduled call with Intel to discuss this, but
> their preliminary searches into the bugfixes and known problems between
> firmware versions didn't indicate the bug that we triggered.  It would be
> good to have some more information about what those differences for disk
> accessing might be to hopefully get a better answer from them as to what
> the problem is.
>
>
> [1]
> https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-ssds/dc-p4600-series/dc-p4600-3-2tb-2-5inch-3d1.html
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] list admin issues

2018-10-08 Thread Jeff Smith
I just got dumped again.  I have not sent any attechments/images.
On Mon, Oct 8, 2018 at 5:48 AM Elias Abacioglu
 wrote:
>
> If it's attachments causing this, perhaps forbid attachments? Force people to 
> use pastebin / imgur type of services?
>
> /E
>
> On Mon, Oct 8, 2018 at 1:33 PM Martin Palma  wrote:
>>
>> Same here also on Gmail with G Suite.
>> On Mon, Oct 8, 2018 at 12:31 AM Paul Emmerich  wrote:
>> >
>> > I'm also seeing this once every few months or so on Gmail with G Suite.
>> >
>> > Paul
>> > Am So., 7. Okt. 2018 um 08:18 Uhr schrieb Joshua Chen
>> > :
>> > >
>> > > I also got removed once, got another warning once (need to re-enable).
>> > >
>> > > Cheers
>> > > Joshua
>> > >
>> > >
>> > > On Sun, Oct 7, 2018 at 5:38 AM Svante Karlsson  
>> > > wrote:
>> > >>
>> > >> I'm also getting removed but not only from ceph. I subscribe 
>> > >> d...@kafka.apache.org list and the same thing happens there.
>> > >>
>> > >> Den lör 6 okt. 2018 kl 23:24 skrev Jeff Smith :
>> > >>>
>> > >>> I have been removed twice.
>> > >>> On Sat, Oct 6, 2018 at 7:07 AM Elias Abacioglu
>> > >>>  wrote:
>> > >>> >
>> > >>> > Hi,
>> > >>> >
>> > >>> > I'm bumping this old thread cause it's getting annoying. My 
>> > >>> > membership get disabled twice a month.
>> > >>> > Between my two Gmail accounts I'm in more than 25 mailing lists and 
>> > >>> > I see this behavior only here. Why is only ceph-users only affected? 
>> > >>> > Maybe Christian was on to something, is this intentional?
>> > >>> > Reality is that there is a lot of ceph-users with Gmail accounts, 
>> > >>> > perhaps it wouldn't be so bad to actually trying to figure this one 
>> > >>> > out?
>> > >>> >
>> > >>> > So can the maintainers of this list please investigate what actually 
>> > >>> > gets bounced? Look at my address if you want.
>> > >>> > I got disabled 20181006, 20180927, 20180916, 20180725, 20180718 most 
>> > >>> > recently.
>> > >>> > Please help!
>> > >>> >
>> > >>> > Thanks,
>> > >>> > Elias
>> > >>> >
>> > >>> > On Mon, Oct 16, 2017 at 5:41 AM Christian Balzer  
>> > >>> > wrote:
>> > >>> >>
>> > >>> >>
>> > >>> >> Most mails to this ML score low or negatively with SpamAssassin, 
>> > >>> >> however
>> > >>> >> once in a while (this is a recent one) we get relatively high 
>> > >>> >> scores.
>> > >>> >> Note that the forged bits are false positives, but the SA is up to 
>> > >>> >> date and
>> > >>> >> google will have similar checks:
>> > >>> >> ---
>> > >>> >> X-Spam-Status: No, score=3.9 required=10.0 tests=BAYES_00,DCC_CHECK,
>> > >>> >>  
>> > >>> >> FORGED_MUA_MOZILLA,FORGED_YAHOO_RCVD,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,
>> > >>> >>  
>> > >>> >> HEADER_FROM_DIFFERENT_DOMAINS,HTML_MESSAGE,MIME_HTML_MOSTLY,RCVD_IN_MSPIKE_H4,
>> > >>> >>  RCVD_IN_MSPIKE_WL,RDNS_NONE,T_DKIM_INVALID shortcircuit=no 
>> > >>> >> autolearn=no
>> > >>> >> ---
>> > >>> >>
>> > >>> >> Between attachment mails and some of these and you're well on your 
>> > >>> >> way out.
>> > >>> >>
>> > >>> >> The default mailman settings and logic require 5 bounces to trigger
>> > >>> >> unsubscription and 7 days of NO bounces to reset the counter.
>> > >>> >>
>> > >>> >> Christian
>> > >>> >>
>> > >>> >> On Mon, 16 Oct 2017 12:23:25 +0900 Christian Balzer wrote:
>> > >>> >>
>> > >>> >> > On Mon, 16 Oct 2017 14:15:22 +1100 Blair Bethwaite wrote:
>> > >>> >> >
>> &g

[ceph-users] mds will not activate

2018-10-06 Thread Jeff Smith
I had to reboot my mds.  The hot spare did not kick in and now I am
showing the filesystem is degraded and offline.  Both mds are showing
as up:standby.  I am not sure how to proceed.

  cluster:
id: 188c7fba-288f-45e9-bca1-cc5fceccd2a1
health: HEALTH_ERR
1 filesystem is degraded
1 filesystem is offline
1 mds daemon damaged
646909/1843113 objects misplaced (35.099%)

  services:
mon: 1 daemons, quorum mon.b
mgr: copious(active)
mds: bulkfs-0/1/1 up , 2 up:standby, 1 damaged
osd: 8 osds: 8 up, 8 in; 47 remapped pgs

  data:
pools:   2 pools, 94 pgs
objects: 614.4 k objects, 2.3 TiB
usage:   7.1 TiB used, 12 TiB / 19 TiB avail
pgs: 646909/1843113 objects misplaced (35.099%)
 47 active+clean
 44 active+remapped+backfill_wait
 3  active+remapped+backfilling

  io:
recovery: 37 MiB/s, 9 objects/s
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] list admin issues

2018-10-06 Thread Jeff Smith
I have been removed twice.
On Sat, Oct 6, 2018 at 7:07 AM Elias Abacioglu
 wrote:
>
> Hi,
>
> I'm bumping this old thread cause it's getting annoying. My membership get 
> disabled twice a month.
> Between my two Gmail accounts I'm in more than 25 mailing lists and I see 
> this behavior only here. Why is only ceph-users only affected? Maybe 
> Christian was on to something, is this intentional?
> Reality is that there is a lot of ceph-users with Gmail accounts, perhaps it 
> wouldn't be so bad to actually trying to figure this one out?
>
> So can the maintainers of this list please investigate what actually gets 
> bounced? Look at my address if you want.
> I got disabled 20181006, 20180927, 20180916, 20180725, 20180718 most recently.
> Please help!
>
> Thanks,
> Elias
>
> On Mon, Oct 16, 2017 at 5:41 AM Christian Balzer  wrote:
>>
>>
>> Most mails to this ML score low or negatively with SpamAssassin, however
>> once in a while (this is a recent one) we get relatively high scores.
>> Note that the forged bits are false positives, but the SA is up to date and
>> google will have similar checks:
>> ---
>> X-Spam-Status: No, score=3.9 required=10.0 tests=BAYES_00,DCC_CHECK,
>>  
>> FORGED_MUA_MOZILLA,FORGED_YAHOO_RCVD,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,
>>  
>> HEADER_FROM_DIFFERENT_DOMAINS,HTML_MESSAGE,MIME_HTML_MOSTLY,RCVD_IN_MSPIKE_H4,
>>  RCVD_IN_MSPIKE_WL,RDNS_NONE,T_DKIM_INVALID shortcircuit=no autolearn=no
>> ---
>>
>> Between attachment mails and some of these and you're well on your way out.
>>
>> The default mailman settings and logic require 5 bounces to trigger
>> unsubscription and 7 days of NO bounces to reset the counter.
>>
>> Christian
>>
>> On Mon, 16 Oct 2017 12:23:25 +0900 Christian Balzer wrote:
>>
>> > On Mon, 16 Oct 2017 14:15:22 +1100 Blair Bethwaite wrote:
>> >
>> > > Thanks Christian,
>> > >
>> > > You're no doubt on the right track, but I'd really like to figure out
>> > > what it is at my end - I'm unlikely to be the only person subscribed
>> > > to ceph-users via a gmail account.
>> > >
>> > > Re. attachments, I'm surprised mailman would be allowing them in the
>> > > first place, and even so gmail's attachment requirements are less
>> > > strict than most corporate email setups (those that don't already use
>> > > a cloud provider).
>> > >
>> > Mailman doesn't do anything with this by default AFAIK, but see below.
>> > Strict is fine if you're in control, corporate mail can be hell, doubly so
>> > if on M$ cloud.
>> >
>> > > This started happening earlier in the year after I turned off digest
>> > > mode. I also have a paid google domain, maybe I'll try setting
>> > > delivery to that address and seeing if anything changes...
>> > >
>> > Don't think google domain is handled differently, but what do I know.
>> >
>> > Though the digest bit confirms my suspicion about attachments:
>> > ---
>> > When a subscriber chooses to receive plain text daily “digests” of list
>> > messages, Mailman sends the digest messages without any original
>> > attachments (in Mailman lingo, it “scrubs” the messages of attachments).
>> > However, Mailman also includes links to the original attachments that the
>> > recipient can click on.
>> > ---
>> >
>> > Christian
>> >
>> > > Cheers,
>> > >
>> > > On 16 October 2017 at 13:54, Christian Balzer  wrote:
>> > > >
>> > > > Hello,
>> > > >
>> > > > You're on gmail.
>> > > >
>> > > > Aside from various potential false positives with regards to spam my 
>> > > > bet
>> > > > is that gmail's known dislike for attachments is the cause of these
>> > > > bounces and that setting is beyond your control.
>> > > >
>> > > > Because Google knows best[tm].
>> > > >
>> > > > Christian
>> > > >
>> > > > On Mon, 16 Oct 2017 13:50:43 +1100 Blair Bethwaite wrote:
>> > > >
>> > > >> Hi all,
>> > > >>
>> > > >> This is a mailing-list admin issue - I keep being unsubscribed from
>> > > >> ceph-users with the message:
>> > > >> "Your membership in the mailing list ceph-users has been disabled due
>> > > >> to excessive bounces..."
>> > > >> This seems to be happening on roughly a monthly basis.
>> > > >>
>> > > >> Thing is I have no idea what the bounce is or where it is coming from.
>> > > >> I've tried emailing ceph-users-ow...@lists.ceph.com and the contact
>> > > >> listed in Mailman (l...@redhat.com) to get more info but haven't
>> > > >> received any response despite several attempts.
>> > > >>
>> > > >> Help!
>> > > >>
>> > > >
>> > > >
>> > > > --
>> > > > Christian BalzerNetwork/Systems Engineer
>> > > > ch...@gol.com   Rakuten Communications
>> > >
>> > >
>> > >
>> >
>> >
>>
>>
>> --
>> Christian BalzerNetwork/Systems Engineer
>> ch...@gol.com   Rakuten Communications
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.

[ceph-users] interpreting ceph mds stat

2018-10-03 Thread Jeff Smith
I need some help deciphering the results of ceph mds stat.  I have
been digging in the docs for hours.  If someone can point me in the
right direction and/or help me understand.

In the documentation it shows a result like this.

cephfs-1/1/1 up {0=a=up:active}

What do each of the 1s represent?   What is the 0=a=up:active?  Is
that saying rank 0 of file system a is up:active?

Jeff Smith
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com