Re: [ceph-users] RGW: Is 'radosgw-admin reshard stale-instances rm' safe?

2019-07-12 Thread Rudenko Aleksandr
Hi, Casey.

Can you help me with my question?

From: Konstantin Shalygin 
Date: Wednesday, 26 June 2019 at 07:29
To: Rudenko Aleksandr 
Cc: "ceph-users@lists.ceph.com" , Casey Bodley 

Subject: Re: [ceph-users] RGW: Is 'radosgw-admin reshard stale-instances rm' 
safe?



On 6/25/19 12:46 AM, Rudenko Aleksandr wrote:
Hi, Konstantin.

Thanks for the reply.

I know about stale instances and that they remained from prior version.

I ask about “marker” of bucket. I have bucket “clx” and I can see his current 
marker in stale-instances list.
As I know, stale-instances list must contain only previous marker ids.



Good question! I CC'ed Casey for answer...





k
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW: Is 'radosgw-admin reshard stale-instances rm' safe?

2019-06-24 Thread Rudenko Aleksandr
Hi, Konstantin.

Thanks for the reply.

I know about stale instances and that they remained from prior version.

I ask about “marker” of bucket. I have bucket “clx” and I can see his current 
marker in stale-instances list.
As I know, stale-instances list must contain only previous marker ids.

From: Konstantin Shalygin 
Date: Friday, 21 June 2019 at 15:30
To: Rudenko Aleksandr 
Cc: "ceph-users@lists.ceph.com" 
Subject: Re: [ceph-users] RGW: Is 'radosgw-admin reshard stale-instances rm' 
safe?


Hi, folks.



I have Luminous 12.2.12. Auto-resharding is enabled.



In stale instances list I have:



# radosgw-admin reshard stale-instances list | grep clx

"clx:default.422998.196",



I have the same marker-id in bucket stats of this bucket:



# radosgw-admin bucket stats --bucket clx | grep marker

"marker": "default.422998.196",

"max_marker": 
"0#,1#,2#,3#,4#,5#,6#,7#,8#,9#,10#,11#,12#,13#,14#,15#,16#,17#,18#,19#,20#,21#,22#,23#,24#,25#,26#,27#,28#,29#,30#,31#,32#,33#,34#,35#,36#,37#,38#,39#,40#,41#,42#,43#,44#,45#,46#,47#,48#,49#,50#,51#,52#",



I think it is not correct. I think active marker (in bucket stats) must not 
match marker in stale instances list.



I have to run ‘radosgw-admin reshard stale-instances rm’ because I have large 
OMAP warning, but I am not sure.



Is it safe to run: radosgw-admin reshard stale-instances rm ?

Yes, this staled by dynamic resharding, mostly prior 12.2.11. At least I have 
not seen new one stales in my cluster.





k
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW: Is 'radosgw-admin reshard stale-instances rm' safe?

2019-06-21 Thread Rudenko Aleksandr
Hi, folks.

I have Luminous 12.2.12. Auto-resharding is enabled.

In stale instances list I have:

# radosgw-admin reshard stale-instances list | grep clx
"clx:default.422998.196",

I have the same marker-id in bucket stats of this bucket:

# radosgw-admin bucket stats --bucket clx | grep marker
"marker": "default.422998.196",
"max_marker": 
"0#,1#,2#,3#,4#,5#,6#,7#,8#,9#,10#,11#,12#,13#,14#,15#,16#,17#,18#,19#,20#,21#,22#,23#,24#,25#,26#,27#,28#,29#,30#,31#,32#,33#,34#,35#,36#,37#,38#,39#,40#,41#,42#,43#,44#,45#,46#,47#,48#,49#,50#,51#,52#",

I think it is not correct. I think active marker (in bucket stats) must not 
match marker in stale instances list.

I have to run ‘radosgw-admin reshard stale-instances rm’ because I have large 
OMAP warning, but I am not sure.

Is it safe to run: radosgw-admin reshard stale-instances rm ?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] osd_journal_aio=false and performance

2018-09-04 Thread Rudenko Aleksandr
Hi, guys.

I made a few tests and i see that performance is better if 
osd_journal_aio=false for LV-journals.

Setup:
2 servers x 4 OSD (SATA HDD + journal on SSD LV)
12.2.5, filestore

  cluster:
id: ce305aae-4c56-41ec-be54-529b05eb45ed
health: HEALTH_OK

  services:
mon: 2 daemons, quorum a,b
mgr: a(active), standbys: b
osd: 8 osds: 8 up, 8 in

  data:
pools:   1 pools, 512 pgs
objects: 0 objects, 0 bytes
usage:   904 MB used, 11440 GB / 11441 GB avail
pgs: 512 active+clean


0 objects before each test.

I used rados bench from two servers in parralel:

2x of:
rados bench -p test 30 write -b 1M -t 32

I ran each test four times.

file-journal on XFS FS, dio=1, aio=0:

Average IOPS:   102.5
Average Latency(s):   0.30

LV-journal, dio=1, aio=1:

Average IOPS:   96.5
Average Latency(s):   32.5

LV-journal, dio=1, aio=0:

Average IOPS:   104
Average Latency(s):   0.30


Is It safe to disable aio on LV journals? Is it make sence?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph logging into graylog

2018-08-16 Thread Rudenko Aleksandr
Hi,

Yes, we use GELF UDP:

[cid:B902F2D1-93EA-4332-A440-66D50A5CD578@croc.ru]

On 15 Aug 2018, at 14:28, Roman Steinhart 
mailto:ro...@aternos.org>> wrote:

Hi,

thanks for your reply.
May I ask which type of input do you use in graylog?
"GELF UDP" or another one?
And which version of graylog/ceph do you use?

Thanks,

Roman


On Aug 9 2018, at 7:47 pm, Rudenko Aleksandr 
mailto:arude...@croc.ru>> wrote:

Hi,

All our settings for this:

mon cluster log to graylog = true
mon cluster log to graylog host = {graylog-server-hostname}





On 9 Aug 2018, at 19:33, Roman Steinhart 
mailto:ro...@aternos.org>> wrote:

Hi all,

I'm trying to set up ceph logging into graylog.
For that I've set the following options in ceph.conf:
log_to_graylog = true
err_to_graylog = true
log_to_graylog_host = graylog.service.consul
log_to_graylog_port = 12201
mon_cluster_log_to_graylog = true
mon_cluster_log_to_graylog_host = graylog.service.consul
mon_cluster_log_to_graylog_port = 12201
clog_to_graylog = true
clog_to_graylog_host = graylog.service.consul
clog_to_graylog_port = 12201

According to the graylog server.log file it looks like ceph accepted these 
config options and sends log messages to graylog, however graylog is not able 
to process these message cause of this error: 
https://paste.steinh.art/jezerobevu.apache

It says: "has empty mandatory "host" field."
How can I advice ceph to fill this host field?
Or is it because of a version incompatibility between ceph and graylog?

We're using ceph 12.2.7 and graylog 2.4.6+ceaa7e4

Maybe someone of you was already able to get graylog working and is able to 
help me with this problem?

Kind regards,

Roman
___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd.X down, but it is still running on Luminous

2018-08-10 Thread Rudenko Aleksandr
Thanks for the reply.

I don’t see “Segmentation fault” in logs (



On 10 Aug 2018, at 09:35, Eugen Block mailto:ebl...@nde.ag>> 
wrote:

Hi,

could you be hitting the bug from [1]? Watch out for segfaults in dmesg.
Since a couple of days we see random OSDs with a segfault from safe_timer. We 
didn't update any packages for months.

Regards

[1] https://tracker.ceph.com/issues/23352


Zitat von Rudenko Aleksandr mailto:arude...@croc.ru>>:

Hi, guys.

After upgrade on Luminous i see:

Monitor daemon marked osd.xx down, but it is still running

this happens 3-5 times a day on different OSDs.

I spent a lot of time on debug but i haven’t found problem.

Network works perfectly. CPU, network and disk utilization is low. Memory is 
enough.

Maybe deep-scrub, but we have following config:

osd scrub sleep = 0.2
osd scrub chunk min = 1
osd scrub chunk max = 2

and we didn’t see OSD flapping on Hammer and Jewel during scrub.



___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph logging into graylog

2018-08-09 Thread Rudenko Aleksandr
Hi,

All our settings for this:

mon cluster log to graylog = true
mon cluster log to graylog host = {graylog-server-hostname}





On 9 Aug 2018, at 19:33, Roman Steinhart 
mailto:ro...@aternos.org>> wrote:

Hi all,

I'm trying to set up ceph logging into graylog.
For that I've set the following options in ceph.conf:
log_to_graylog = true
err_to_graylog = true
log_to_graylog_host = graylog.service.consul
log_to_graylog_port = 12201
mon_cluster_log_to_graylog = true
mon_cluster_log_to_graylog_host = graylog.service.consul
mon_cluster_log_to_graylog_port = 12201
clog_to_graylog = true
clog_to_graylog_host = graylog.service.consul
clog_to_graylog_port = 12201

According to the graylog server.log file it looks like ceph accepted these 
config options and sends log messages to graylog, however graylog is not able 
to process these message cause of this error: 
https://paste.steinh.art/jezerobevu.apache

It says: "has empty mandatory "host" field."
How can I advice ceph to fill this host field?
Or is it because of a version incompatibility between ceph and graylog?

We're using ceph 12.2.7 and graylog 2.4.6+ceaa7e4

Maybe someone of you was already able to get graylog working and is able to 
help me with this problem?

Kind regards,

Roman
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] osd.X down, but it is still running on Luminous

2018-08-09 Thread Rudenko Aleksandr
Hi, guys.

After upgrade on Luminous i see:

Monitor daemon marked osd.xx down, but it is still running

this happens 3-5 times a day on different OSDs.

I spent a lot of time on debug but i haven’t found problem.

Network works perfectly. CPU, network and disk utilization is low. Memory is 
enough.

Maybe deep-scrub, but we have following config:

osd scrub sleep = 0.2
osd scrub chunk min = 1
osd scrub chunk max = 2

and we didn’t see OSD flapping on Hammer and Jewel during scrub.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [rgw] Very high cache misses with automatic bucket resharding

2018-07-16 Thread Rudenko Aleksandr
Yes, i have tasks in `radosgw-admin reshard list`.

And objects count in .rgw.buckets.index is increasing, slowly.

But i confused a bit. I have one big bucket with 161 shards.

…
"max_marker": 
"0#,1#,2#,3#,4#,5#,6#,7#,8#,9#,10#,11#,12#,13#,14#,15#,16#,17#,18#,19#,20#,21#,22#,23#,24#,25#,26#,27#,28#,29#,30#,31#,32#,33#,34#,35#,36#,37#,38#,39#,40#,41#,42#,43#,44#,45#,46#,47#,48#,49#,50#,51#,52#,53#,54#,55#,56#,57#,58#,59#,60#,61#,62#,63#,64#,65#,66#,67#,68#,69#,70#,71#,72#,73#,74#,75#,76#,77#,78#,79#,80#,81#,82#,83#,84#,85#,86#,87#,88#,89#,90#,91#,92#,93#,94#,95#,96#,97#,98#,99#,100#,101#,102#,103#,104#,105#,106#,107#,108#,109#,110#,111#,112#,113#,114#,115#,116#,117#,118#,119#,120#,121#,122#,123#,124#,125#,126#,127#,128#,129#,130#,131#,132#,133#,134#,135#,136#,137#,138#,139#,140#,141#,142#,143#,144#,145#,146#,147#,148#,149#,150#,151#,152#,153#,154#,155#,156#,157#,158#,159#,160#»,
…

But in reshard list i see:

{
"time": "2018-07-15 21:11:31.290620Z",
"tenant": "",
"bucket_name": "my-bucket",
"bucket_id": "default.32785769.2",
"new_instance_id": "",
"old_num_shards": 1,
"new_num_shards": 162
},

"old_num_shards": 1 - it’s correct?

I hit a lot of problems trying to use auto resharding in 12.2.5

Which problems?

On 16 Jul 2018, at 16:57, Sean Redmond 
mailto:sean.redmo...@gmail.com>> wrote:

Hi,

Do you have on going resharding? 'radosgw-admin reshard list' should so you the 
status.

Do you see the number of objects in .rgw.bucket.index pool increasing?

I hit a lot of problems trying to use auto resharding in 12.2.5 - I have 
disabled it for the moment.

Thanks

[1] https://tracker.ceph.com/issues/24551

On Mon, Jul 16, 2018 at 12:32 PM, Rudenko Aleksandr 
mailto:arude...@croc.ru>> wrote:

Hi, guys.

I use Luminous 12.2.5.

Automatic bucket index resharding has not been activated in the past.

Few days ago i activated auto. resharding.

After that and now i see:

- very high Ceph read I/O (~300 I/O before activating resharding, ~4k now),
- very high Ceph read bandwidth (50 MB/s before activating resharding, 250 MB/s 
now),
- very high RGW cache miss (400 count/s before activating resharding, ~3.5k 
now).

For Ceph monitoring i use MGR+Zabbix plugin and zabbix-template from ceph 
github repo.
For RGW monitoring i use RGW perf dump and my script.

Why is it happening? When is it ending?

___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [rgw] Very high cache misses with automatic bucket resharding

2018-07-16 Thread Rudenko Aleksandr
Hi, guys.

I use Luminous 12.2.5.

Automatic bucket index resharding has not been activated in the past.

Few days ago i activated auto. resharding.

After that and now i see:

- very high Ceph read I/O (~300 I/O before activating resharding, ~4k now),
- very high Ceph read bandwidth (50 MB/s before activating resharding, 250 MB/s 
now),
- very high RGW cache miss (400 count/s before activating resharding, ~3.5k 
now).

For Ceph monitoring i use MGR+Zabbix plugin and zabbix-template from ceph 
github repo.
For RGW monitoring i use RGW perf dump and my script.

Why is it happening? When is it ending?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Jewel -> Luminous: can't decode unknown message type 1544 MSG_AUTH=17

2018-06-09 Thread Rudenko Aleksandr
Hi, friends.

I updated cluster from Hammer(0.94.10) to Jewel(10.2.10) recently an all works 
fine.

Now, i can't update cluster from Jewel to Luminous(12.2.5).

I have 5 MONs.

I updated packages, restarted MON service but MON didn't joined to cluster.

In logs of new MON i see:

7f53bc797700  0 -- 192.168.144.96:6789/0 >> 192.168.144.6:6789/0 
conn(0x7f53cf48e800 :-1 s=STATE_OPEN pgs=7342571 cs=1 l=0).fault initiating 
reconnect
7f53bc797700  0 -- 192.168.144.96:6789/0 >> 192.168.144.6:6789/0 
conn(0x7f53cf48e800 :-1 s=STATE_OPEN pgs=7342573 cs=3 l=0).fault initiating 
reconnect


in logs of current active MON i see:


7f5415beb700 10 mon.e@0(leader) e14 ms_verify_authorizer 192.168.144.96:6789/0 
mon protocol 2
2018-06-09 19:55:14.979566 7f5415beb700  0 -- 192.168.144.6:6789/0 >> 
192.168.144.96:6789/0 pipe(0x7f542d934000 sd=8 :6789 s=0 pgs=0 cs=0 l=0 
c=0x7f542d99db00).accept connect_seq 24080 vs existing 24079 state
 standby
7f54178f3700 10 mon.e@0(leader) e14 ms_handle_reset 0x7f542d99db00 
192.168.144.96:6789/0
7f5415beb700  0 can't decode unknown message type 1544 MSG_AUTH=17
7f5415beb700  0 -- 192.168.144.6:6789/0 >> 192.168.144.96:6789/0 
pipe(0x7f542d934000 sd=8 :6789 s=2 pgs=6779228 cs=24081 l=0 
c=0x7f542ca94a00).fault with nothing to send, going to s
tandby


and some times this:

7fc6ddb8b700  0 mon.e@0(leader) e8 ms_verify_authorizer cephx enabled, but no 
authorizer (required for mon)
7fc6ddb8b700  0 -- 192.168.144.6:6789/0 >> 192.168.144.94:6789/0 
pipe(0x7fc6ff23e000 sd=12 :6789 s=0 pgs=0 cs=0 l=0 c=0x7fc6fb237800).accept: 
got bad authorizer


I naven't reproduce this in test environment. I tested update
Hammer->Jewel-Luminous many times in different setups and always all
was fine.

Any recommendation, please.




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [rgw] user stats understanding

2018-04-24 Thread Rudenko Aleksandr
Hi, friends.

We use RGW user stats in our billing.

Example on Luminous:

radosgw-admin usage show --uid 5300c830-82e2-4dce-ac6d-1d97a65def33

{
"entries": [
{
"user": "5300c830-82e2-4dce-ac6d-1d97a65def33",
"buckets": [
{
"bucket": "",
"time": "2018-04-06 19:00:00.00Z",
"epoch": 1523041200,
"owner": "5300c830-82e2-4dce-ac6d-1d97a65def33",
"categories": [
{
"category": "list_buckets",
"bytes_sent": 141032,
"bytes_received": 0,
"ops": 402,
"successful_ops": 402
}
]
},
{
"bucket": "-",
"time": "2018-04-24 13:00:00.00Z",
"epoch": 1524574800,
"owner": "5300c830-82e2-4dce-ac6d-1d97a65def33",
"categories": [
{
"category": "get_obj",
"bytes_sent": 422,
"bytes_received": 0,
"ops": 2,
"successful_ops": 0
}
]
},
{
"bucket": "test",
"time": "2018-04-06 19:00:00.00Z",
"epoch": 1523041200,
"owner": "5300c830-82e2-4dce-ac6d-1d97a65def33",
"categories": [
…

...
{
"category": "get_obj",
"bytes_sent": 642,
"bytes_received": 0,
"ops": 3,
"successful_ops": 0
},
…

...
 ]
}
]
}
],
"summary": [
{
"user": "5300c830-82e2-4dce-ac6d-1d97a65def33",
"categories": [
…

...
{
"category": "get_obj",
"bytes_sent": 2569,
"bytes_received": 0,
"ops": 12,
"successful_ops": 0
},
{
"category": "list_bucket",
"bytes_sent": 185537,
"bytes_received": 0,
"ops": 302,
"successful_ops": 302
},
{
"category": "list_buckets",
"bytes_sent": 141032,
"bytes_received": 0,
"ops": 402,
"successful_ops": 402
},
…

...
],
"total": {
"bytes_sent": 884974,
"bytes_received": 0,
"ops": 1521,
"successful_ops": 1507
}
}
]
}

What statistics are in dictionaries with "bucket": "", and "bucket":  "-",?
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [rgw] civetweb behind haproxy doesn't work with absolute URI

2018-03-31 Thread Rudenko Aleksandr
Hi, Sean.

Thank you for the reply.

What does it mean: “We had to disable "rgw dns name" in the end”?

"rgw_dns_name": “”, has no effect for me.



On 29 Mar 2018, at 11:23, Sean Purdy 
mailto:s.pu...@cv-library.co.uk>> wrote:

We had something similar recently.  We had to disable "rgw dns name" in the end

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] One object degraded cause all ceph requests hang - Jewel 10.2.6 (rbd + radosgw)

2018-03-29 Thread Rudenko Aleksandr
Thank you Vincent, it’s very helpfull for me!


On 11 Jan 2018, at 14:24, Vincent Godin 
mailto:vince.ml...@gmail.com>> wrote:

As no response were given, i will explain what i found : maybe it
could help other people

.dirXXX object is an index marker with a 0 data size. The metadata
associated to this object (located in the levelDB of the OSDs
currently holding this marker) is the index of the bucket
corresponding to this marker.
My problem came from the number of objects stored in this bucket :
more than 50 millions. As the size of an object in the index is
between 200 and 250 bytes, the index should have a 12 GB size. That's
why it is recommanded to add a shard the index for each 100.000
objects.
During a ceph process rebuild, some pgs move from some OSDs to others.
When a index is moving, all the write requests to the bucket are
blocked till the operation completed. During this move, the user had
launched an upload batch on the bucket so a lot of requests were
blocked, leading to block all the requests on the primary pgs hold by
the OSD.
So the loop i saw was in fact just normal and but moving a 12 GB
object from one SATA to an other takes several minutes, to long in
fact for a ceph cluster with a lot of clients to survive
The lesson of this story is : Don't forget to shard your bucket !!!


---
Yesterday we just encountered this bug. One OSD was looping on
"2018-01-03 16:20:59.148121 7f011a6a1700  0 log_channel(cluster) log
[WRN] : slow request 30.254269 seconds old, received at 2018-01-03
16:20:28.883837: osd_op(client.48285929.0:14601958 35.8abfc02e
.dir.0a3e5369-ff79-4f7d-b0b6-79c5a75b1759.29113876.1 [call
rgw.bucket_prepare_op] snapc 0=[] ondisk+write+known_if_redirected
e359833) currently waiting for degraded object".

The requests on this OSD.150 went quickly in blocked state

2018-01-03 16:25:56.241064 7f011a6a1700  0 log_channel(cluster) log
[WRN] : 20 slow requests, 1 included below; oldest blocked for >
327.357139 secs
2018-01-03 16:30:19.299288 7f011a6a1700  0 log_channel(cluster) log
[WRN] : 45 slow requests, 1 included below; oldest blocked for >
590.415387 secs
...
...
2018-01-03 16:46:04.900204 7f011a6a1700  0 log_channel(cluster) log
[WRN] : 100 slow requests, 2 included below; oldest blocked for >
1204.060056 secs

while still looping

2018-01-03 16:46:04.900220 7f011a6a1700  0 log_channel(cluster) log
[WRN] : slow request 123.294762 seconds old, received at 2018-01-03
16:44:01.605320 : osd_op(client.48285929.0:14605228 35.8abfc02e
.dir.0a3e5369-ff79-4f7d-b0b6-79c5a75b1759.29113876.1 [call
rgw.bucket_complete_op] snapc 0=[]
ack+ondisk+write+known_if_redirected e359833) currently waiting for
degraded object

All theses resquest were blocked on OSD.150.
A lot of VMs attached to Ceph were hanging.

The degraded object was
.dir.0a3e5369-ff79-4f7d-b0b6-79c5a75b1759.29113876.1 in the pg 35.2e.
This PG was located on 4 OSDs. The object has a 0 size on the 4 OSDs.
It was not possible to do a ceph osd pg 35.2e query with a response.
Killing the OSD.150 lead to the requests bloqued on the new primary.

I found the relatively new bug #22072 which looks like mine but there
was no response from the ceph team. I finally tried the same solution
: rados rm -p pool/degraded_object but with no response from the
command. I stopped the command after 15 mn. Few minutes later, the 4
OSDs holding the pg 35.2e suddenly rebooted and the problem was
solved. The object was deleted on the 4 OSDs.

Anyway, it leads to a production break and i have no idea of what
produced the "degraded object" and i'm not sure if the solution came
from my command or from a inside process. At this time we are still
trying to repare some filesystems of the VMs attached to Ceph and i
have to explain that this all production break comes from one empty
object ... The real problem is why Ceph was unable to handle this
"degraded object" and looped on it, blocking all the requests on the
OSD.150 ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [rgw] civetweb behind haproxy doesn't work with absolute URI

2018-03-29 Thread Rudenko Aleksandr

Hi friends.


I'm sorry, maybe it isn't bug, but i don't know how to solve this problem.

I know that absolute URIs are supported in civetweb and it works fine for me 
without haproxy in the middle.

But if client send absolute URIs through reverse proxy(haproxy) to civetweb, 
civetweb breaks connection without responce.

i set:

debug rgw = 20
debug civetweb = 10


but no any messgaes in civetweb logs(access, error) and in rgw logs.
in tcpdump i only see as rgw closes connection after request with absolute URI. 
Relative URIs in requests work fine with haproxy.

Client:
Docker registry v2.6.2, s3 driver based on aws-sdk-go/1.2.4 (go1.7.6; linux; 
amd64) uses absolute URI in requests.

s3 driver options of docker registry:

  s3:
region: us-east-1
bucket: docker
accesskey: 'access_key'
secretkey: 'secret_key'
regionendpoint: http://storage.my-domain.ru
secure: false
v4auth: true


ceph.conf for rgw instance:

[client]
rgw dns name = storage.my-domain.ru
rgw enable apis = s3, admin
rgw dynamic resharding = false
rgw enable usage log = true
rgw num rados handles = 8
rgw thread pool size = 256

[client.rgw.a]
host = aj15
keyring = /var/lib/ceph/radosgw/rgw.a.keyring
rgw enable static website = true
rgw frontends = civetweb 
authentication_domain=storage.my-domain.ru 
num_threads=128 port=0.0.0.0:7480 
access_log_file=/var/log/ceph/civetweb.rgw.access.log 
error_log_file=/var/log/ceph/civetweb.rgw.error.log
debug rgw = 20
debug civetweb = 10


very simple haproxy.cfg:

global
chroot /var/empty
# /log is chroot path
log /haproxy-log local2

pidfile /var/run/haproxy.pid

user haproxy
group haproxy
daemon

ssl-default-bind-options no-sslv3
ssl-default-bind-ciphers 
ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256:DHE-DSS-AES256-SHA:DHE-RSA-AES256-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:AES:CAMELLIA:DES-CBC3-SHA:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!MD5:!PSK:!aECDH:!EDH-DSS-DES-CBC3-SHA:!EDH-RSA-DES-CBC3-SHA:!KRB5-DES-CBC3-SHA
ssl-dh-param-file /etc/pki/tls/dhparams.pem

defaults
mode http
log global

frontend s3

bind *:80
bind *:443 ssl crt /etc/pki/tls/certs/s3.pem crt 
/etc/pki/tls/certs/s3-buckets.pem

use_backend rgw

backend rgw

balance roundrobin

server a aj15:7480 check fall 1
server a aj16:7480 check fall 1


http haeder from tcpdump before and after haproxy:

GET http://storage.my-domain.ru/docker?max-keys=1&prefix= HTTP/1.1
Host: storage.my-domain.ru
User-Agent: aws-sdk-go/1.2.4 (go1.7.6; linux; amd64)
Authorization: AWS4-HMAC-SHA256 
Credential=user:u...@cloud.croc.ru/20180328/us-east-1/s3/aws4_request,
 SignedHeaders=host;x-amz-content-sha256;x-amz-date, 
Signature=10043867bbb2833d50f9fe16a6991436a5c328adc5042556ce1ddf1101ee2cb9
X-Amz-Content-Sha256: 
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
X-Amz-Date: 20180328T111255Z
Accept-Encoding: gzip

i don't understand how use haproxy and absolute URIs in requests(

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problem with UID starting with underscores

2018-03-15 Thread Rudenko Aleksandr
Hi,

I have the same issue.

Try to use two underscores:

radosgw-admin user info --uid=“__pro_"

I have user with two underscores on hammer and i can work with him with one 
underscore:)

I recommend you remove this user and not use underscore in user names and 
access_keys because after upgrade on Luminous i can’t work with user with 
underscore(

http://tracker.ceph.com/issues/23373?next_issue_id=23372


On 6 Mar 2018, at 11:52, Arvydas Opulskis 
mailto:zebedie...@gmail.com>> wrote:

Hi all,

because one our script misbehaved, new user with bad UID was created via API, 
and now we can't remove, view or modify it. I believe, it's because it has 
three underscores at the beginning:

[root@rgw001 /]# radosgw-admin metadata list user | grep "___pro_"
"___pro_",

[root@rgw001 /]# radosgw-admin user info --uid="___pro_"
could not fetch user info: no user info saved

Do you have any ideas how to workaround this problem? If it's not supported 
naming, maybe API shouldn't allow to create it?

We are using Jewel 10.2.10 version on Centos 7.4.

Thanks for any ideas,

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [rgw] Underscore at the beginning of access key not works after upgrade Jewel->Luminous

2018-02-12 Thread Rudenko Aleksandr
Hi friends,

I have rgw-user (_sc) with with the same access key:

radosgw-admin metadata user info --uid _sc
{
"user_id": "_sc",
"display_name": "_sc",
"email": "",
"suspended": 0,
"max_buckets": 0,
"auid": 0,
"subusers": [],
"keys": [
{
"user": "_sc",
"access_key": "_sc",
"secret_key": "Sbg6C7wkSK+jO2t3D\/719A"
}
],
….


Everything works fine on Jewel.
But, after upgrade to Luminous this user receives “InvalidAccessKeyId”.


Radosgw-admin says:

radosgw-admin metadata user info --uid _sc
could not fetch user info: no user info saved

why it's happens?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problems with CORS

2017-10-23 Thread Rudenko Aleksandr
Thank you David for your suggestion.

We add our domain(Origin) to zonegroup’s endpoints and hostnames:

{
"id": "default",
"name": "default",
"api_name": "",
"is_master": "true",
"endpoints": [
"https://console.{our_domain}.ru";,
],
"hostnames": [
"https://console.{our_domain}.ru";,
],
"hostnames_s3website": [],
"master_zone": "default",
"zones": [
{
"id": "default",
"name": "default",
"endpoints": [],
"log_meta": "false",
"log_data": "false",
"bucket_index_max_shards": 0,
"read_only": "false",
"tier_type": "",
"sync_from_all": "true",
"sync_from": []
}
],
"placement_targets": [
{
"name": "default-placement",
"tags": []
}
],
"default_placement": "default-placement",
"realm_id": "9c7666df-132d-4db0-988e-6b28767ff3cf"
}

But it’s not solve our problems.

In RGW logs:

2017-10-23 10:51:25.301934 7f39f2e73700  1 == starting new request 
req=0x7f39f2e6d190 =
2017-10-23 10:51:25.301956 7f39f2e73700  2 req 22:0.22::OPTIONS 
/aaa::initializing for trans_id = 
tx00016-0059ed9f7d-fc80-default
2017-10-23 10:51:25.301993 7f39f2e73700  2 req 22:0.58:s3:OPTIONS 
/aaa::getting op 6
2017-10-23 10:51:25.302004 7f39f2e73700  2 req 22:0.71:s3:OPTIONS 
/aaa:options_cors:verifying requester
2017-10-23 10:51:25.302013 7f39f2e73700  2 req 22:0.80:s3:OPTIONS 
/aaa:options_cors:normalizing buckets and tenants
2017-10-23 10:51:25.302018 7f39f2e73700  2 req 22:0.84:s3:OPTIONS 
/aaa:options_cors:init permissions
2017-10-23 10:51:25.302065 7f39f2e73700  2 req 22:0.000131:s3:OPTIONS 
/aaa:options_cors:recalculating target
2017-10-23 10:51:25.302070 7f39f2e73700  2 req 22:0.000136:s3:OPTIONS 
/aaa:options_cors:reading permissions
2017-10-23 10:51:25.302075 7f39f2e73700  2 req 22:0.000141:s3:OPTIONS 
/aaa:options_cors:init op
2017-10-23 10:51:25.302076 7f39f2e73700  2 req 22:0.000143:s3:OPTIONS 
/aaa:options_cors:verifying op mask
2017-10-23 10:51:25.302078 7f39f2e73700  2 req 22:0.000144:s3:OPTIONS 
/aaa:options_cors:verifying op permissions
2017-10-23 10:51:25.302080 7f39f2e73700  2 req 22:0.000146:s3:OPTIONS 
/aaa:options_cors:verifying op params
2017-10-23 10:51:25.302081 7f39f2e73700  2 req 22:0.000148:s3:OPTIONS 
/aaa:options_cors:pre-executing
2017-10-23 10:51:25.302111 7f39f2e73700  2 req 22:0.000149:s3:OPTIONS 
/aaa:options_cors:executing
2017-10-23 10:51:25.302124 7f39f2e73700  2 No CORS configuration set yet for 
this bucket
2017-10-23 10:51:25.302126 7f39f2e73700  2 req 22:0.000193:s3:OPTIONS 
/aaa:options_cors:completing
2017-10-23 10:51:25.302191 7f39f2e73700  2 req 22:0.000258:s3:OPTIONS 
/aaa:options_cors:op status=-13
2017-10-23 10:51:25.302198 7f39f2e73700  2 req 22:0.000264:s3:OPTIONS 
/aaa:options_cors:http status=403
2017-10-23 10:51:25.302203 7f39f2e73700  1 == req done req=0x7f39f2e6d190 
op status=-13 http_status=403 ==
2017-10-23 10:51:25.302260 7f39f2e73700  1 civetweb: 0x7f3a30c7c000: 
172.20.41.101 - - [23/Oct/2017:10:51:25 +0300] "OPTIONS /aaa HTTP/1.1" 1 0 - 
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:56.0) Gecko/20100101 
Firefox/56.0

OPTIONS requests failed with 403.

Robin thank you so much!

We have plans to use haproxy with Civetweb and your rules solve our problem 
with OPTIONS requests!

Thank you guys!


> On 22 Oct 2017, at 23:10, Robin H. Johnson  wrote:
> 
> On Sun, Oct 22, 2017 at 01:31:03PM +, Rudenko Aleksandr wrote:
>> In past we rewrite http response header by Apache rules for our
>> web-interface and pass CORS check. But now it’s impossible to solve on
>> balancer level.
> You CAN modify the CORS responses at the load-balancer level.
> 
> Find below the snippets needed to do it in HAProxy w/ Jewel-Civetweb;
> specifically, this completely overrides the CORS if the Origin matches some
> strings.
> 
> We use this to override the CORS for access via our customer interface panel,
> so regardless of what CORS they set on the bucket, the panel always works.
> 
> frontend ...
>  # Store variable for using later in the response.
>  http-request set-var(txn.origin) req.hdr(Origin)
>  acl override_cors var(txn.origin) -m end -i SOMEDOMAIN
>  acl override_cors var(txn.origin) -m sub -i SOMEDOMAIN
>  # Export fact as a boolean
>  http-request set-var(txn.override_cors) bool(true) if override_cors

[ceph-users] Problems with CORS

2017-10-22 Thread Rudenko Aleksandr
Hi guys,

We use Ceph as S3-compatible object store and we have our self-development 
web-interface for our customers on different domain.

Now we use Hammer(FCGI + Apache as RGW frontend) but we have plans for upgrade 
Ceph from hammer to luminous.

In luminous release FCGI frontend was dropped and Civetweb frontend always 
checks CORS.

As I said above, our web-interface works on different domain and by default 
Civetweb returns 403 оn HTTP OPTIONS request from our web-interface.

If I PUT CORS for some bucket, everything works fine. But isn’t good idea 
because bucket owner can PUTs your own CORS and overwrite our default CORS and 
lost access to bucker from our web-interface. It’s not cool)

In past we rewrite http response header by Apache rules for our web-interface 
and pass CORS check. But now it’s impossible to solve on balancer level.

What is right way?

---
Best regards,

Aleksandr Rudenko


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [rgw][s3] Object not in objects list

2017-08-31 Thread Rudenko Aleksandr
Hi,

Maybe someone have thoughts?

---
Best regards,

Alexander Rudenko



On 30 Aug 2017, at 12:28, Rudenko Aleksandr 
mailto:arude...@croc.ru>> wrote:

Hi,

I use ceph 0.94.10(hammer) with radosgw as S3-compatible object store.

I have few objects in some bucket with strange problem.

I use awscli as s3 client.

GET/HEAD objects work fine but list object doesn’t.
In objects list I don’t see these objects.

Object metadata:

radosgw-admin bi list --bucket={my-bucket} --object={my-object}

Return [].

But:

rados -p .rgw.buckets stat default.32785769.2_{my-object}

.rgw.buckets/default.32785769.2_{my-object} mtime 2017-08-15 18:07:29.00, 
size 97430


Bucket versioning not enabled.
Bucket has more 13M objects.

Where can I find the problem?

---
Best regards,

Alexander Rudenko




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [rgw][s3] Object not in objects list

2017-08-30 Thread Rudenko Aleksandr
Hi,

I use ceph 0.94.10(hammer) with radosgw as S3-compatible object store.

I have few objects in some bucket with strange problem.

I use awscli as s3 client.

GET/HEAD objects work fine but list object doesn’t.
In objects list I don’t see these objects.

Object metadata:

radosgw-admin bi list --bucket={my-bucket} --object={my-object}

Return [].

But:

rados -p .rgw.buckets stat default.32785769.2_{my-object}

.rgw.buckets/default.32785769.2_{my-object} mtime 2017-08-15 18:07:29.00, 
size 97430


Bucket versioning not enabled.
Bucket has more 13M objects.

Where can I find the problem?

---
Best regards,

Alexander Rudenko



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com