Re: [ceph-users] Erasure code profiles and crush rules. Missing link...?

2019-05-22 Thread Rainer Krienke
Hello,

thanks for the hint. I opened a ticket with a feature request to include
the ec-profile information in the output of ceph osd pool ls detail.

http://tracker.ceph.com/issues/40009

Rainer

Am 22.05.19 um 17:04 schrieb Jan Fajerski:
> On Wed, May 22, 2019 at 03:38:27PM +0200, Rainer Krienke wrote:
>> Am 22.05.19 um 15:16 schrieb Dan van der Ster:
>>
>> Yes this is basically what I was looking for however I had expected that
>> its a little better visible in the output...
> Mind opening a tracker ticket on http://tracker.ceph.com/ so we can have
> this added to the non-json output of ceph osd pool ls detail?
>>

-- 
Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse 1
56070 Koblenz, Tel: +49261287 1312 Fax +49261287 100 1312
Web: http://userpages.uni-koblenz.de/~krienke
PGP: http://userpages.uni-koblenz.de/~krienke/mypgp.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] assume_role() :http_code 405 error

2019-05-22 Thread Pritha Srivastava
On Thu, May 23, 2019 at 9:24 AM Yuan Minghui  wrote:

> HELLO :
>
>  The version I am using is ceph luminous 12.2.4 ,and what types of ceph
> can support AssumeRole or STS?
>
>
>
STS is available in Nautilus (v14.2.0), and versions after that.

Thanks,
Pritha

> Thanks a lot.
>
> kyle
>
>
>
> *发件人**: *Pritha Srivastava 
> *日期**: *2019年5月23日 星期四 上午11:49
> *收件人**: *Yuan Minghui 
> *抄送**: *"ceph-users@lists.ceph.com" 
> *主题**: *Re: [ceph-users] assume_role() :http_code 405 error
>
>
>
> Hello,
>
> It looks like the version that you are trying this on, doesn't support
> AssumeRole or STS. What version of Ceph are you using?
>
> Thanks,
>
> Pritha
>
>
>
> On Thu, May 23, 2019 at 9:10 AM Yuan Minghui 
> wrote:
>
> Hello :
>
>When I try to make a secure-temp-sesstion(STS), I try the following
> actions:
>
> s3 = session.client('sts',
> aws_access_key_id=tomAccessKey,
> aws_secret_access_key=tomSecretKey,
> endpoint_url=host
> )  #返回一个低级的客户端实例
>
> response = s3.assume_role(RoleArn='arn:aws:iam:::role/S3Access1',
> RoleSessionName='test_session1',
> )
>
>
>
> however, it returns that:
>
> [image: cid:16ae2cb9a1d4cff311]
>
> Is there someone can help with this problems?
>
>
>
> Thanks  a lot.
>
> Kyle
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] assume_role() :http_code 405 error

2019-05-22 Thread Yuan Minghui
HELLO :

  The version I am using is ceph luminous 12.2.4 ,and what types of ceph can 
support AssumeRole or STS?

 

Thanks a lot.

kyle

 

发件人: Pritha Srivastava 
日期: 2019年5月23日 星期四 上午11:49
收件人: Yuan Minghui 
抄送: "ceph-users@lists.ceph.com" 
主题: Re: [ceph-users] assume_role() :http_code 405 error

 

Hello,

It looks like the version that you are trying this on, doesn't support 
AssumeRole or STS. What version of Ceph are you using?

Thanks,

Pritha

 

On Thu, May 23, 2019 at 9:10 AM Yuan Minghui  wrote:

Hello :

   When I try to make a secure-temp-sesstion(STS), I try the following 
actions:

s3 = session.client('sts',
aws_access_key_id=tomAccessKey,
aws_secret_access_key=tomSecretKey,
endpoint_url=host
)  #返回一个低级的客户端实例

response = s3.assume_role(RoleArn='arn:aws:iam:::role/S3Access1',
RoleSessionName='test_session1',
)

 

however, it returns that:

Is there someone can help with this problems?

 

Thanks  a lot.

Kyle

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] assume_role() :http_code 405 error

2019-05-22 Thread Pritha Srivastava
Hello,

It looks like the version that you are trying this on, doesn't support
AssumeRole or STS. What version of Ceph are you using?

Thanks,
Pritha

On Thu, May 23, 2019 at 9:10 AM Yuan Minghui  wrote:

> Hello :
>
>When I try to make a secure-temp-sesstion(STS), I try the following
> actions:
>
> s3 = session.client('sts',
> aws_access_key_id=tomAccessKey,
> aws_secret_access_key=tomSecretKey,
> endpoint_url=host
> )  #返回一个低级的客户端实例
>
> response = s3.assume_role(RoleArn='arn:aws:iam:::role/S3Access1',
> RoleSessionName='test_session1',
> )
>
>
>
> however, it returns that:
>
> Is there someone can help with this problems?
>
>
>
> Thanks  a lot.
>
> Kyle
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] assume_role() :http_code 405 error

2019-05-22 Thread Yuan Minghui
Hello :

   When I try to make a secure-temp-sesstion(STS), I try the following 
actions:

s3 = session.client('sts',
    aws_access_key_id=tomAccessKey,
    aws_secret_access_key=tomSecretKey,
    endpoint_url=host
    )  #返回一个低级的客户端实例

response = s3.assume_role(RoleArn='arn:aws:iam:::role/S3Access1',
    RoleSessionName='test_session1',
    )

 

however, it returns that:

Is there someone can help with this problems?

 

Thanks  a lot.

Kyle

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Massive TCP connection on radosgw

2019-05-22 Thread Li Wang
Thank you for your reply. We will run the script and let you know the results 
once the  number of TCP connections raises up. We just restarted the sever 
several days ago. 

Sent from my iPhone

> On 23 May 2019, at 12:26 AM, Igor Podlesny  wrote:
> 
>> On Wed, 22 May 2019 at 20:32, Torben Hørup  wrote:
>> 
>> Which states are all these connections in ?
>> 
>> ss -tn
> 
> That set of the args won't display anything but ESTAB-lished conn-s..
> 
> One typically needs `-atn` instead.
> 
> -- 
> End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Massive TCP connection on radosgw

2019-05-22 Thread Igor Podlesny
On Wed, 22 May 2019 at 20:32, Torben Hørup  wrote:
>
> Which states are all these connections in ?
>
> ss -tn

That set of the args won't display anything but ESTAB-lished conn-s..

One typically needs `-atn` instead.

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Major ceph disaster

2019-05-22 Thread Robert LeBlanc
On Wed, May 22, 2019 at 4:31 AM Kevin Flöh  wrote:

> Hi,
>
> thank you, it worked. The PGs are not incomplete anymore. Still we have
> another problem, there are 7 PGs inconsistent and a cpeh pg repair is
> not doing anything. I just get "instructing pg 1.5dd on osd.24 to
> repair" and nothing happens. Does somebody know how we can get the PGs
> to repair?
>
> Regards,
>
> Kevin
>

Kevin,

I just fixed an inconsistent PG yesterday. You will need to figure out why
they are inconsistent. Do these steps and then we can figure out how to
proceed.
1. Do a deep-scrub on each PG that is inconsistent. (This may fix some of
them)
2. Print out the inconsistent report for each inconsistent PG. `rados
list-inconsistent-obj  --format=json-pretty`
3. You will want to look at the error messages and see if all the shards
have the same data.

Robert LeBlanc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS object mapping.

2019-05-22 Thread Robert LeBlanc
On Wed, May 22, 2019 at 12:22 AM Burkhard Linke <
burkhard.li...@computational.bio.uni-giessen.de> wrote:

> Hi,
>
> On 5/21/19 9:46 PM, Robert LeBlanc wrote:
> > I'm at a new job working with Ceph again and am excited to back in the
> > community!
> >
> > I can't find any documentation to support this, so please help me
> > understand if I got this right.
> >
> > I've got a Jewel cluster with CephFS and we have an inconsistent PG.
> > All copies of the object are zero size, but the digest says that it
> > should be a non-zero size, so it seems that my two options are, delete
> > the file that the object is part of, or rewrite the object with RADOS
> > to update the digest. So, this leads to my question, how to I tell
> > which file the object belongs to.
> >
> > From what I found, the object is prefixed with the hex value of the
> > inode and suffixed by the stripe number:
> > 1000d2ba15c.0005
> > .
> >
> > I then ran `find . -xdev -inum 1099732590940` and found a file on the
> > CephFS file system. I just want to make sure that I found the right
> > file before I start trying recovery options.
> >
>
> The first stripe XYZ. has some metadata stored as xattr (rados
> xattr, not cephfs xattr). One of the entries has the key 'parent':
>

When you say 'some' is it a fixed offset that the file data starts? Is the
first stripe just metadata?


> # ls Ubuntu16.04-WS2016-17.ova
> Ubuntu16.04-WS2016-17.ova
>
> # ls -i Ubuntu16.04-WS2016-17.ova
> 1099751898435 Ubuntu16.04-WS2016-17.ova
>
> # rados -p cephfs_test_data stat 1000e523d43.
> cephfs_test_data/1000e523d43. mtime 2016-10-13 16:20:10.00,
> size 4194304
>
> # rados -p cephfs_test_data listxattr 1000e523d43.
> layout
> parent
>
> # rados -p cephfs_test_data getxattr 1000e523d43. parent | strings
> Ubuntu16.04-WS2016-17.ova5:
> adm2
> volumes
>
>
> The complete path of the file is
> /volumes/adm/Ubuntu16.04-WS2016-17.ova5. For a complete check you can
> store the content of the parent key and use ceph-dencoder to print its
> content:
>
> # rados -p cephfs_test_data getxattr 1000e523d43. parent >
> parent.bin
>
> # ceph-dencoder type inode_backtrace_t import parent.bin decode dump_json
> {
>  "ino": 1099751898435,
>  "ancestors": [
>  {
>  "dirino": 1099527190071,
>  "dname": "Ubuntu16.04-WS2016-17.ova",
>  "version": 14901
>  },
>  {
>  "dirino": 1099521974514,
>  "dname": "adm",
>  "version": 61190706
>  },
>  {
>  "dirino": 1,
>  "dname": "volumes",
>  "version": 48394885
>  }
>  ],
>  "pool": 7,
>  "old_pools": []
> }
>
>
> One important thing to note: ls -i prints the inode id in decimal,
> cephfs uses hexadecimal for the rados object names. Thus the different
> value in the above commands.
>

Thank you for this, this is much faster than doing a find for the inode
(that took many hours, I let it run overnight and it found it some time. It
took about 21 hours to search the whole filesystem.)


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW metadata pool migration

2019-05-22 Thread Nikhil Mitra (nikmitra)
Hi All,

What are the metadata pools in an RGW deployment that need to sit on the 
fastest medium to better the client experience from an access standpoint ?
Also is there an easy way to migrate these pools in a PROD scenario with 
minimal to no-outage if possible ?

Regards,
Nikhil

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS msg length greater than osd_max_write_size

2019-05-22 Thread Ryan Leimenstoll
Thanks for the reply! We will be more proactive about evicting clients in the 
future rather than waiting.


One followup however, it seems that the filesystem going read only was only a 
WARNING state, which didn’t immediately catch our eye due to some other 
rebalancing operations. Is there a reason that this wouldn’t be a HEALTH_ERR 
condition since it represents a significant service degradation?


Thanks!
Ryan


> On May 22, 2019, at 4:20 AM, Yan, Zheng  wrote:
> 
> On Tue, May 21, 2019 at 6:10 AM Ryan Leimenstoll
>  wrote:
>> 
>> Hi all,
>> 
>> We recently encountered an issue where our CephFS filesystem unexpectedly 
>> was set to read-only. When we look at some of the logs from the daemons I 
>> can see the following:
>> 
>> On the MDS:
>> ...
>> 2019-05-18 16:34:24.341 7fb3bd610700 -1 mds.0.89098 unhandled write error 
>> (90) Message too long, force readonly...
>> 2019-05-18 16:34:24.341 7fb3bd610700  1 mds.0.cache force file system 
>> read-only
>> 2019-05-18 16:34:24.341 7fb3bd610700  0 log_channel(cluster) log [WRN] : 
>> force file system read-only
>> 2019-05-18 16:34:41.289 7fb3c0616700  1 heartbeat_map is_healthy 'MDSRank' 
>> had timed out after 15
>> 2019-05-18 16:34:41.289 7fb3c0616700  0 mds.beacon.objmds00 Skipping beacon 
>> heartbeat to monitors (last acked 4.00101s ago); MDS internal heartbeat is 
>> not healthy!
>> ...
>> 
>> On one of the OSDs it was most likely targeting:
>> ...
>> 2019-05-18 16:34:24.140 7f8134e6c700 -1 osd.602 pg_epoch: 682796 pg[49.20b( 
>> v 682796'15706523 (682693'15703449,682796'15706523] 
>> local-lis/les=673041/673042 n=10524 ec=245563/245563 lis/c 673041/673041 
>> les/c/f 673042/673042/0 673038/673041/668565) [602,530,558] r=0 lpr=673041 
>> crt=682796'15706523 lcod 682796'15706522 mlcod 682796'15706522 active+clean] 
>> do_op msg data len 95146005 > osd_max_write_size 94371840 on 
>> osd_op(mds.0.89098:48609421 49.20b 49:d0630e4c:::mds0_sessionmap:head 
>> [omap-set-header,omap-set-vals] snapc 0=[] 
>> ondisk+write+known_if_redirected+full_force e682796) v8
>> 2019-05-18 17:10:33.695 7f813466b700  0 log_channel(cluster) log [DBG] : 
>> 49.31c scrub starts
>> 2019-05-18 17:10:34.980 7f813466b700  0 log_channel(cluster) log [DBG] : 
>> 49.31c scrub ok
>> 2019-05-18 22:17:37.320 7f8134e6c700 -1 osd.602 pg_epoch: 683434 pg[49.20b( 
>> v 682861'15706526 (682693'15703449,682861'15706526] 
>> local-lis/les=673041/673042 n=10525 ec=245563/245563 lis/c 673041/673041 
>> les/c/f 673042/673042/0 673038/673041/668565) [602,530,558] r=0 lpr=673041 
>> crt=682861'15706526 lcod 682859'15706525 mlcod 682859'15706525 active+clean] 
>> do_op msg data len 95903764 > osd_max_write_size 94371840 on 
>> osd_op(mds.0.91565:357877 49.20b 49:d0630e4c:::mds0_sessionmap:head 
>> [omap-set-header,omap-set-vals,omap-rm-keys] snapc 0=[] 
>> ondisk+write+known_if_redirected+full_force e683434) v8
>> …
>> 
>> During this time there were some health concerns with the cluster. 
>> Significantly, since the error above seems to be related to the SessionMap, 
>> we had a client that had a few blocked requests for over 35948 secs (it’s a 
>> member of a compute cluster so we let the node drain/finish jobs before 
>> rebooting). We have also had some issues with certain OSDs running older 
>> hardware staying up/responding timely to heartbeats after upgrading to 
>> Nautilus, although that seems to be an iowait/load issue that we are 
>> actively working to mitigate separately.
>> 
> 
> This prevent mds from trimming completed requests recorded in session.
> which results a very large session item.  To recovery, blacklist the
> client that has blocked request, the restart mds.
> 
>> We are running Nautilus 14.2.1 on RHEL7.6. There is only one MDS Rank, with 
>> an active/standby setup between two MDS nodes. MDS clients are mounted using 
>> the RHEL7.6 kernel driver.
>> 
>> My read here would be that the MDS is sending too large a message to the 
>> OSD, however my understanding was that the MDS should be using 
>> osd_max_write_size to determine the size of that message [0]. Is this maybe 
>> a bug in how this is calculated on the MDS side?
>> 
>> 
>> Thanks!
>> Ryan Leimenstoll
>> rleim...@umiacs.umd.edu
>> University of Maryland Institute for Advanced Computer Studies
>> 
>> 
>> 
>> [0] https://www.spinics.net/lists/ceph-devel/msg11951.html
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure code profiles and crush rules. Missing link...?

2019-05-22 Thread Jan Fajerski

On Wed, May 22, 2019 at 03:38:27PM +0200, Rainer Krienke wrote:

Am 22.05.19 um 15:16 schrieb Dan van der Ster:

Yes this is basically what I was looking for however I had expected that
its a little better visible in the output...
Mind opening a tracker ticket on http://tracker.ceph.com/ so we can have this 
added to the non-json output of ceph osd pool ls detail?


Rainer


Is this what you're looking for?

# ceph osd pool ls detail  -f json | jq .[0].erasure_code_profile
"jera_4plus2"

-- Dan


--
Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse 1
56070 Koblenz, Tel: +49261287 1312 Fax +49261287 100 1312
Web: http://userpages.uni-koblenz.de/~krienke
PGP: http://userpages.uni-koblenz.de/~krienke/mypgp.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Jan Fajerski
Engineer Enterprise Storage
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure code profiles and crush rules. Missing link...?

2019-05-22 Thread Rainer Krienke
Am 22.05.19 um 15:16 schrieb Dan van der Ster:

Yes this is basically what I was looking for however I had expected that
its a little better visible in the output...

Rainer
> 
> Is this what you're looking for?
> 
> # ceph osd pool ls detail  -f json | jq .[0].erasure_code_profile
> "jera_4plus2"
> 
> -- Dan

-- 
Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse 1
56070 Koblenz, Tel: +49261287 1312 Fax +49261287 100 1312
Web: http://userpages.uni-koblenz.de/~krienke
PGP: http://userpages.uni-koblenz.de/~krienke/mypgp.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Massive TCP connection on radosgw

2019-05-22 Thread Torben Hørup
Which states are all these connections in ? 

ss -tn | awk '{print $1}' | sort | uniq -c 

/Torben 

On 22.05.2019 15:19, Li Wang wrote:

> Hi guys,  
> 
> Any help here?
> 
> Sent from my iPhone 
> 
> On 20 May 2019, at 2:48 PM, John Hearns  wrote:
> 
> I found similar behaviour on a Nautilus cluster on Friday. Around 300 000 
> open connections which I think were the result of a benchmarking run which 
> was terminated. I restarted the radosgw service to get rid of them. 
> 
> On Mon, 20 May 2019 at 06:56, Li Wang  wrote: Dear ceph 
> community members,
> 
> We have a ceph cluster (mimic 13.2.4) with 7 nodes and 130+ OSDs. However, we 
> observed over 70 millions active TCP connections on the radosgw host, which 
> makes the radosgw very unstable. 
> 
> After further investigation, we found most of the TCP connections on the 
> radosgw are connected to OSDs.
> 
> May I ask what might be the possible reason causing the the massive amount of 
> TCP connection? And is there anything configuration or tuning work that I can 
> do to solve this issue?
> 
> Any suggestion is highly appreciated.
> 
> Regards,
> Li Wang
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Massive TCP connection on radosgw

2019-05-22 Thread Li Wang
Hi guys, 

Any help here?

Sent from my iPhone

> On 20 May 2019, at 2:48 PM, John Hearns  wrote:
> 
> I found similar behaviour on a Nautilus cluster on Friday. Around 300 000 
> open connections which I think were the result of a benchmarking run which 
> was terminated. I restarted the radosgw service to get rid of them.
> 
>> On Mon, 20 May 2019 at 06:56, Li Wang  wrote:
>> Dear ceph community members,
>> 
>> We have a ceph cluster (mimic 13.2.4) with 7 nodes and 130+ OSDs. However, 
>> we observed over 70 millions active TCP connections on the radosgw host, 
>> which makes the radosgw very unstable. 
>> 
>> After further investigation, we found most of the TCP connections on the 
>> radosgw are connected to OSDs.
>> 
>> May I ask what might be the possible reason causing the the massive amount 
>> of TCP connection? And is there anything configuration or tuning work that I 
>> can do to solve this issue?
>> 
>> Any suggestion is highly appreciated.
>> 
>> Regards,
>> Li Wang
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure code profiles and crush rules. Missing link...?

2019-05-22 Thread Dan van der Ster
On Wed, May 22, 2019 at 3:03 PM Rainer Krienke  wrote:
>
> Hello,
>
> I created an erasure code profile named ecprofile-42 with the following
> parameters:
>
> $ ceph osd erasure-code-profile set ecprofile-42 plugin=jerasure k=4 m=2
>
> Next I created a new pool using the ec profile from above:
>
> $ ceph osd pool create my_erasure_pool 64 64  erasure ecprofile-42
>
> The pool created then has an autogenerated crush rule with the contents
> as shown at the end of this mail (see: ceph osd crush rule dump
> my_erasure_pool).
>
> What I am missing in the output of the crush rule dump below are the k,m
> values used for this pool or a "link" from the crushrule to the erasure
> code profile that contains these settings and was used creating the pool
> and thus the ec crushrule.  If I had several ec profiles and pools
> created with the different ec profiles how else could I see which k,m
> values were used for the different pools?
>
> For a replicated crush rule there is the size parameter which is part of
> the crush-rule and indirectly tells you the number of replicas, but what
> about erasure coded pools?

Is this what you're looking for?

# ceph osd pool ls detail  -f json | jq .[0].erasure_code_profile
"jera_4plus2"

-- Dan


>
> Probably there is somewhere the link I am looking for, but I din't find
> it yet...
>
> Thanks Rainer
>
> #
> # autogenerated crush rule my_erasure_pool:
> #
> $ ceph osd crush rule dump my_erasure_pool
> {
> "rule_id": 1,
> "rule_name": "my_erasure_pool",
> "ruleset": 1,
> "type": 3,
> "min_size": 3,
> "max_size": 6,
> "steps": [
> {
> "op": "set_chooseleaf_tries",
> "num": 5
> },
> {
> "op": "set_choose_tries",
> "num": 100
> },
> {
> "op": "take",
> "item": -1,
> "item_name": "default"
> },
> {
> "op": "chooseleaf_indep",
> "num": 0,
> "type": "host"
> },
> {
> "op": "emit"
> }
> ]
> }
>
> --
> Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse 1
> 56070 Koblenz, Tel: +49261287 1312 Fax +49261287 100 1312
> Web: http://userpages.uni-koblenz.de/~krienke
> PGP: http://userpages.uni-koblenz.de/~krienke/mypgp.html
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure code profiles and crush rules. Missing link...?

2019-05-22 Thread Paul Emmerich
CRUSH only specifies where the chunks are placed, not how many chunks there
are (the pool specifies this)
This is the same with replicated rules: pool specifies the number of
replicas, the rule where they are put.

You can use one CRUSH rule for multiple ec pools


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Wed, May 22, 2019 at 3:03 PM Rainer Krienke 
wrote:

> Hello,
>
> I created an erasure code profile named ecprofile-42 with the following
> parameters:
>
> $ ceph osd erasure-code-profile set ecprofile-42 plugin=jerasure k=4 m=2
>
> Next I created a new pool using the ec profile from above:
>
> $ ceph osd pool create my_erasure_pool 64 64  erasure ecprofile-42
>
> The pool created then has an autogenerated crush rule with the contents
> as shown at the end of this mail (see: ceph osd crush rule dump
> my_erasure_pool).
>
> What I am missing in the output of the crush rule dump below are the k,m
> values used for this pool or a "link" from the crushrule to the erasure
> code profile that contains these settings and was used creating the pool
> and thus the ec crushrule.  If I had several ec profiles and pools
> created with the different ec profiles how else could I see which k,m
> values were used for the different pools?
>
> For a replicated crush rule there is the size parameter which is part of
> the crush-rule and indirectly tells you the number of replicas, but what
> about erasure coded pools?
>
> Probably there is somewhere the link I am looking for, but I din't find
> it yet...
>
> Thanks Rainer
>
> #
> # autogenerated crush rule my_erasure_pool:
> #
> $ ceph osd crush rule dump my_erasure_pool
> {
> "rule_id": 1,
> "rule_name": "my_erasure_pool",
> "ruleset": 1,
> "type": 3,
> "min_size": 3,
> "max_size": 6,
> "steps": [
> {
> "op": "set_chooseleaf_tries",
> "num": 5
> },
> {
> "op": "set_choose_tries",
> "num": 100
> },
> {
> "op": "take",
> "item": -1,
> "item_name": "default"
> },
> {
> "op": "chooseleaf_indep",
> "num": 0,
> "type": "host"
> },
> {
> "op": "emit"
> }
> ]
> }
>
> --
> Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse 1
> 56070 Koblenz, Tel: +49261287 1312 Fax +49261287 100 1312
> Web: http://userpages.uni-koblenz.de/~krienke
> PGP: http://userpages.uni-koblenz.de/~krienke/mypgp.html
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Erasure code profiles and crush rules. Missing link...?

2019-05-22 Thread Rainer Krienke
Hello,

I created an erasure code profile named ecprofile-42 with the following
parameters:

$ ceph osd erasure-code-profile set ecprofile-42 plugin=jerasure k=4 m=2

Next I created a new pool using the ec profile from above:

$ ceph osd pool create my_erasure_pool 64 64  erasure ecprofile-42

The pool created then has an autogenerated crush rule with the contents
as shown at the end of this mail (see: ceph osd crush rule dump
my_erasure_pool).

What I am missing in the output of the crush rule dump below are the k,m
values used for this pool or a "link" from the crushrule to the erasure
code profile that contains these settings and was used creating the pool
and thus the ec crushrule.  If I had several ec profiles and pools
created with the different ec profiles how else could I see which k,m
values were used for the different pools?

For a replicated crush rule there is the size parameter which is part of
the crush-rule and indirectly tells you the number of replicas, but what
about erasure coded pools?

Probably there is somewhere the link I am looking for, but I din't find
it yet...

Thanks Rainer

#
# autogenerated crush rule my_erasure_pool:
#
$ ceph osd crush rule dump my_erasure_pool
{
"rule_id": 1,
"rule_name": "my_erasure_pool",
"ruleset": 1,
"type": 3,
"min_size": 3,
"max_size": 6,
"steps": [
{
"op": "set_chooseleaf_tries",
"num": 5
},
{
"op": "set_choose_tries",
"num": 100
},
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_indep",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}

-- 
Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse 1
56070 Koblenz, Tel: +49261287 1312 Fax +49261287 100 1312
Web: http://userpages.uni-koblenz.de/~krienke
PGP: http://userpages.uni-koblenz.de/~krienke/mypgp.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Major ceph disaster

2019-05-22 Thread John Petrini
It's been suggested here in the past to disable deep scrubbing temporarily
before running the repair because it does not execute immediately but gets
queued up behind deep scrubs.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Major ceph disaster

2019-05-22 Thread Kevin Flöh

Hi,

thank you, it worked. The PGs are not incomplete anymore. Still we have 
another problem, there are 7 PGs inconsistent and a cpeh pg repair is 
not doing anything. I just get "instructing pg 1.5dd on osd.24 to 
repair" and nothing happens. Does somebody know how we can get the PGs 
to repair?


Regards,

Kevin

On 21.05.19 4:52 nachm., Wido den Hollander wrote:


On 5/21/19 4:48 PM, Kevin Flöh wrote:

Hi,

we gave up on the incomplete pgs since we do not have enough complete
shards to restore them. What is the procedure to get rid of these pgs?


You need to start with marking the OSDs as 'lost' and then you can
force_create_pg to get the PGs back (empty).

Wido


regards,

Kevin

On 20.05.19 9:22 vorm., Kevin Flöh wrote:

Hi Frederic,

we do not have access to the original OSDs. We exported the remaining
shards of the two pgs but we are only left with two shards (of
reasonable size) per pg. The rest of the shards displayed by ceph pg
query are empty. I guess marking the OSD as complete doesn't make
sense then.

Best,
Kevin

On 17.05.19 2:36 nachm., Frédéric Nass wrote:


Le 14/05/2019 à 10:04, Kevin Flöh a écrit :

On 13.05.19 11:21 nachm., Dan van der Ster wrote:

Presumably the 2 OSDs you marked as lost were hosting those
incomplete PGs?
It would be useful to double confirm that: check with `ceph pg 
query` and `ceph pg dump`.
(If so, this is why the ignore history les thing isn't helping; you
don't have the minimum 3 stripes up for those 3+1 PGs.)

yes, but as written in my other mail, we still have enough shards,
at least I think so.


If those "lost" OSDs by some miracle still have the PG data, you might
be able to export the relevant PG stripes with the
ceph-objectstore-tool. I've never tried this myself, but there have
been threads in the past where people export a PG from a nearly dead
hdd, import to another OSD, then backfilling works.

guess that is not possible.

Hi Kevin,

You want to make sure of this.

Unless you recreated the OSDs 4 and 23 and had new data written on
them, they should still host the data you need.
What Dan suggested (export the 7 inconsistent PGs and import them on
a healthy OSD) seems to be the only way to recover your lost data, as
with 4 hosts and 2 OSDs lost, you're left with 2 chunks of
data/parity when you actually need 3 to access it. Reducing min_size
to 3 will not help.

Have a look here:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-July/019673.html

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/023736.html


This is probably the best way you want to follow form now on.

Regards,
Frédéric.


If OTOH those PGs are really lost forever, and someone else should
confirm what I say here, I think the next step would be to force
recreate the incomplete PGs then run a set of cephfs scrub/repair
disaster recovery cmds to recover what you can from the cephfs.

-- dan

would this let us recover at least some of the data on the pgs? If
not we would just set up a new ceph directly without fixing the old
one and copy whatever is left.

Best regards,

Kevin




On Mon, May 13, 2019 at 4:20 PM Kevin Flöh 
wrote:

Dear ceph experts,

we have several (maybe related) problems with our ceph cluster,
let me
first show you the current ceph status:

     cluster:
   id: 23e72372-0d44-4cad-b24f-3641b14b86f4
   health: HEALTH_ERR
   1 MDSs report slow metadata IOs
   1 MDSs report slow requests
   1 MDSs behind on trimming
   1/126319678 objects unfound (0.000%)
   19 scrub errors
   Reduced data availability: 2 pgs inactive, 2 pgs
incomplete
   Possible data damage: 7 pgs inconsistent
   Degraded data redundancy: 1/500333881 objects degraded
(0.000%), 1 pg degraded
   118 stuck requests are blocked > 4096 sec.
Implicated osds
24,32,91

     services:
   mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02
   mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu
   mds: cephfs-1/1/1 up {0=ceph-node02.etp.kit.edu=up:active}, 3
up:standby
   osd: 96 osds: 96 up, 96 in

     data:
   pools:   2 pools, 4096 pgs
   objects: 126.32M objects, 260TiB
   usage:   372TiB used, 152TiB / 524TiB avail
   pgs: 0.049% pgs not active
    1/500333881 objects degraded (0.000%)
    1/126319678 objects unfound (0.000%)
    4076 active+clean
    10   active+clean+scrubbing+deep
    7    active+clean+inconsistent
    2    incomplete
    1    active+recovery_wait+degraded

     io:
   client:   449KiB/s rd, 42.9KiB/s wr, 152op/s rd, 0op/s wr


and ceph health detail:


HEALTH_ERR 1 MDSs report slow metadata IOs; 1 MDSs report slow
requests;
1 MDSs behind on trimming; 1/126319687 objects unfound (0.000%); 19
scrub errors; Reduced data availability: 2 pgs inactive, 2 pgs
incomplete; Possible data damage: 7 pgs inconsistent; De

Re: [ceph-users] CephFS msg length greater than osd_max_write_size

2019-05-22 Thread Yan, Zheng
On Tue, May 21, 2019 at 6:10 AM Ryan Leimenstoll
 wrote:
>
> Hi all,
>
> We recently encountered an issue where our CephFS filesystem unexpectedly was 
> set to read-only. When we look at some of the logs from the daemons I can see 
> the following:
>
> On the MDS:
> ...
> 2019-05-18 16:34:24.341 7fb3bd610700 -1 mds.0.89098 unhandled write error 
> (90) Message too long, force readonly...
> 2019-05-18 16:34:24.341 7fb3bd610700  1 mds.0.cache force file system 
> read-only
> 2019-05-18 16:34:24.341 7fb3bd610700  0 log_channel(cluster) log [WRN] : 
> force file system read-only
> 2019-05-18 16:34:41.289 7fb3c0616700  1 heartbeat_map is_healthy 'MDSRank' 
> had timed out after 15
> 2019-05-18 16:34:41.289 7fb3c0616700  0 mds.beacon.objmds00 Skipping beacon 
> heartbeat to monitors (last acked 4.00101s ago); MDS internal heartbeat is 
> not healthy!
> ...
>
> On one of the OSDs it was most likely targeting:
> ...
> 2019-05-18 16:34:24.140 7f8134e6c700 -1 osd.602 pg_epoch: 682796 pg[49.20b( v 
> 682796'15706523 (682693'15703449,682796'15706523] local-lis/les=673041/673042 
> n=10524 ec=245563/245563 lis/c 673041/673041 les/c/f 673042/673042/0 
> 673038/673041/668565) [602,530,558] r=0 lpr=673041 crt=682796'15706523 lcod 
> 682796'15706522 mlcod 682796'15706522 active+clean] do_op msg data len 
> 95146005 > osd_max_write_size 94371840 on osd_op(mds.0.89098:48609421 49.20b 
> 49:d0630e4c:::mds0_sessionmap:head [omap-set-header,omap-set-vals] snapc 0=[] 
> ondisk+write+known_if_redirected+full_force e682796) v8
> 2019-05-18 17:10:33.695 7f813466b700  0 log_channel(cluster) log [DBG] : 
> 49.31c scrub starts
> 2019-05-18 17:10:34.980 7f813466b700  0 log_channel(cluster) log [DBG] : 
> 49.31c scrub ok
> 2019-05-18 22:17:37.320 7f8134e6c700 -1 osd.602 pg_epoch: 683434 pg[49.20b( v 
> 682861'15706526 (682693'15703449,682861'15706526] local-lis/les=673041/673042 
> n=10525 ec=245563/245563 lis/c 673041/673041 les/c/f 673042/673042/0 
> 673038/673041/668565) [602,530,558] r=0 lpr=673041 crt=682861'15706526 lcod 
> 682859'15706525 mlcod 682859'15706525 active+clean] do_op msg data len 
> 95903764 > osd_max_write_size 94371840 on osd_op(mds.0.91565:357877 49.20b 
> 49:d0630e4c:::mds0_sessionmap:head 
> [omap-set-header,omap-set-vals,omap-rm-keys] snapc 0=[] 
> ondisk+write+known_if_redirected+full_force e683434) v8
> …
>
> During this time there were some health concerns with the cluster. 
> Significantly, since the error above seems to be related to the SessionMap, 
> we had a client that had a few blocked requests for over 35948 secs (it’s a 
> member of a compute cluster so we let the node drain/finish jobs before 
> rebooting). We have also had some issues with certain OSDs running older 
> hardware staying up/responding timely to heartbeats after upgrading to 
> Nautilus, although that seems to be an iowait/load issue that we are actively 
> working to mitigate separately.
>

This prevent mds from trimming completed requests recorded in session.
which results a very large session item.  To recovery, blacklist the
client that has blocked request, the restart mds.

> We are running Nautilus 14.2.1 on RHEL7.6. There is only one MDS Rank, with 
> an active/standby setup between two MDS nodes. MDS clients are mounted using 
> the RHEL7.6 kernel driver.
>
> My read here would be that the MDS is sending too large a message to the OSD, 
> however my understanding was that the MDS should be using osd_max_write_size 
> to determine the size of that message [0]. Is this maybe a bug in how this is 
> calculated on the MDS side?
>
>
> Thanks!
> Ryan Leimenstoll
> rleim...@umiacs.umd.edu
> University of Maryland Institute for Advanced Computer Studies
>
>
>
> [0] https://www.spinics.net/lists/ceph-devel/msg11951.html
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs client evicted, how to unmount the filesystem on the client?

2019-05-22 Thread Yan, Zheng
try 'umount -f'

On Tue, May 21, 2019 at 4:41 PM Marc Roos  wrote:
>
>
>
>
>
> [@ceph]# ps -aux | grep D
> USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
> root 12527  0.0  0.0 123520   932 pts/1D+   09:26   0:00 umount
> /home/mail-archive
> root 14549  0.2  0.0  0 0 ?D09:29   0:09
> [kworker/0:0]
> root 23350  0.0  0.0 123520   932 pts/0D09:38   0:00 umount
> /home/archiveindex
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Failed Disk simulation question

2019-05-22 Thread Eugen Block

Hi Alex,

The cluster has been idle at the moment being new and all.  I  
noticed some disk related errors in dmesg but that was about it.
It looked to me for the next 20 - 30 minutes the failure has not  
been detected.  All osds were up and in and health was OK. OSD logs  
had no smoking gun either.
After 30 minutes, I restarted the OSD container and it failed to  
start as expected.


if the cluster doesn't have to read or write to specific OSDs (or  
sectors on that OSD) the failure won't be detected immediately. We had  
an issue last year where one of the SSDs (used for rocksdb and wal)  
had a failure, but that was never reported. We discovered that when we  
tried to migrate the lvm to a new device and got read errors.


Later on, I performed the same operation during the fio bench mark  
and OSD failed immediately.


This confirms our experience, if there's data to read/write on that  
disk the failure will be detected.
Please note that this was in a Luminous cluster, I don't know if and  
how Nautilus has improved in sensing disk failures.


Regards,
Eugen


Zitat von Alex Litvak :


Hello cephers,

I know that there was similar question posted 5 years ago.  However  
the answer was inconclusive for me.
I installed a new Nautilus 14.2.1 cluster and started pre-production  
testing.  I followed RedHat document and simulated a soft disk  
failure by


#  echo 1 > /sys/block/sdc/device/delete

The cluster has been idle at the moment being new and all.  I  
noticed some disk related errors in dmesg but that was about it.
It looked to me for the next 20 - 30 minutes the failure has not  
been detected.  All osds were up and in and health was OK. OSD logs  
had no smoking gun either.
After 30 minutes, I restarted the OSD container and it failed to  
start as expected.


Later on, I performed the same operation during the fio bench mark  
and OSD failed immediately.


My question is:  Should the disk problem have been detected quick  
enough even on the idle cluster? I thought Nautilus has the means to  
sense failure before intensive IO hit the disk.

Am I wrong to expect that?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS object mapping.

2019-05-22 Thread Burkhard Linke

Hi,

On 5/21/19 9:46 PM, Robert LeBlanc wrote:
I'm at a new job working with Ceph again and am excited to back in the 
community!


I can't find any documentation to support this, so please help me 
understand if I got this right.


I've got a Jewel cluster with CephFS and we have an inconsistent PG. 
All copies of the object are zero size, but the digest says that it 
should be a non-zero size, so it seems that my two options are, delete 
the file that the object is part of, or rewrite the object with RADOS 
to update the digest. So, this leads to my question, how to I tell 
which file the object belongs to.


From what I found, the object is prefixed with the hex value of the 
inode and suffixed by the stripe number:

1000d2ba15c.0005
.

I then ran `find . -xdev -inum 1099732590940` and found a file on the 
CephFS file system. I just want to make sure that I found the right 
file before I start trying recovery options.




The first stripe XYZ. has some metadata stored as xattr (rados 
xattr, not cephfs xattr). One of the entries has the key 'parent':


# ls Ubuntu16.04-WS2016-17.ova
Ubuntu16.04-WS2016-17.ova

# ls -i Ubuntu16.04-WS2016-17.ova
1099751898435 Ubuntu16.04-WS2016-17.ova

# rados -p cephfs_test_data stat 1000e523d43.
cephfs_test_data/1000e523d43. mtime 2016-10-13 16:20:10.00, 
size 4194304


# rados -p cephfs_test_data listxattr 1000e523d43.
layout
parent

# rados -p cephfs_test_data getxattr 1000e523d43. parent | strings
Ubuntu16.04-WS2016-17.ova5:
adm2
volumes


The complete path of the file is 
/volumes/adm/Ubuntu16.04-WS2016-17.ova5. For a complete check you can 
store the content of the parent key and use ceph-dencoder to print its 
content:


# rados -p cephfs_test_data getxattr 1000e523d43. parent > 
parent.bin


# ceph-dencoder type inode_backtrace_t import parent.bin decode dump_json
{
    "ino": 1099751898435,
    "ancestors": [
    {
    "dirino": 1099527190071,
    "dname": "Ubuntu16.04-WS2016-17.ova",
    "version": 14901
    },
    {
    "dirino": 1099521974514,
    "dname": "adm",
    "version": 61190706
    },
    {
    "dirino": 1,
    "dname": "volumes",
    "version": 48394885
    }
    ],
    "pool": 7,
    "old_pools": []
}


One important thing to note: ls -i prints the inode id in decimal, 
cephfs uses hexadecimal for the rados object names. Thus the different 
value in the above commands.



Regards,

Burkhard


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com