from:"Gerd Jakobovitsch"

[ceph-users] BUG 14154 on erasure coded PG

2016-09-09 Thread Gerd Jakobovitsch


Dear all,

I am using an erasure coded pool, and I get to a situation where I'm not 
able to recover a PG. The OSDs that contain this PG keep crashing, on 
the same behavior registered at http://tracker.ceph.com/issues/14154.


I'm using ceph 0.94.9 (it first appeared on 0.94.7, an upgrade didn't 
solve the issue) on centOS 7.2, kernel 3.10.0-327.18.2.el7.x86_64.


My EC profile:

directory=/usr/lib64/ceph/erasure-code
k=3
m=2
plugin=isa

Is this issue being handled? Is there any hint on how to handle it?
--




--

As informa��es contidas nesta mensagem s�o CONFIDENCIAIS, protegidas pelo 
sigilo legal e por direitos autorais. A divulga��o, distribui��o, reprodu��o ou 
qualquer forma de utiliza��o do teor deste documento depende de autoriza��o do 
emissor, sujeitando-se o infrator �s san��es legais. Caso esta comunica��o 
tenha sido recebida por engano, favor avisar imediatamente, respondendo esta 
mensagem.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Recovering full OSD

2016-08-08 Thread Gerd Jakobovitsch

I got to this situation several times, due to a strange behavior in the 
xfs filesystem - I initially ran on debian, afterwards reinstalled the 
nodes to centos7, kernel 3.10.0-229.14.1.el7.x86_64, package 
xfsprogs-3.2.1-6.el7.x86_64. Around 75-80% of usage shown with df, the 
disk is already full.

To delete PGs in order to restart the OSD, I first lowered the weight of 
the affected OSD, and observed which PGs started backfilling elsewhere. 
Then I deleted some of these backfilling PGs before trying to restart 
the OSD. It worked without data loss.

Em 08-08-2016 08:19, Mykola Dvornik escreveu:

@Shinobu

According to
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/

"If you cannot start an OSD because it is full, you may delete some 
data by deleting some placement group directories in the full OSD."

On 8 August 2016 at 13:16, Shinobu Kinjo <shinobu...@gmail.com 
<mailto:shinobu...@gmail.com>> wrote:

On Mon, Aug 8, 2016 at 8:01 PM, Mykola Dvornik
<mykola.dvor...@gmail.com <mailto:mykola.dvor...@gmail.com>> wrote:
> Dear ceph community,
>
> One of the OSDs in my cluster cannot start due to the
>
> ERROR: osd init failed: (28) No space left on device
>
> A while ago it was recommended to manually delete PGs on the OSD
to let it
> start.

Who recommended that?

>
> So I am wondering was is the recommended way to fix this issue
for the
> cluster running Jewel release (10.2.2)?
>
> Regards,
>
> --
>  Mykola
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>

--
Email:
shin...@linux.com <mailto:shin...@linux.com>
shin...@redhat.com <mailto:shin...@redhat.com>

--
 Mykola**

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--

--

Mandic Cloud Solutions 
<http://www.mandic.com.br/?utm_source=Assinatura-de-Email_medium=Email_content=Logo_campaign=Site-Mandic> 
	*Gerd Jakobovitsch *

*Diretoria de Tecnologia*
+55 11 3030-3456

Avalie a Mandic Cloud 
<http://www.mandic.com.br/redirect/surveymonkey/?utm_source=Assinatura-de-Email_medium=Email_campaign=Pesquisa-Survey-Monkey> 
| Como está sua satisfação?

*Vendas:* 4007-2442
*Suporte 24h:* 4007-1858 | 400-365-24

*Mandic.* Somos Especialistas em Cloud. 
<http://www.mandic.com.br/?utm_source=Assinatura%20de%20Email_medium=Email_content=Texto_campaign=Site%20Mandic> 

	Imagem: Cloud Sob Medida - Garantia de serviço e contrato em reais, sem 
variação do dólar. 
<http://www.mandic.com.br/solucoes/projetos-especiais-em-cloud/?utm_source=Email_medium=Assinatura-de-Email-Out15_campaign=BannerProjetosEspeciais> 

--

As informações contidas nesta mensagem são CONFIDENCIAIS, protegidas pelo 
sigilo legal e por direitos autorais. A divulgação, distribuição, reprodução ou 
qualquer forma de utilização do teor deste documento depende de autorização do 
emissor, sujeitando-se o infrator às sanções legais. Caso esta comunicação 
tenha sido recebida por engano, favor avisar imediatamente, respondendo esta 
mensagem.___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Lost access when removing cache pool overlay

2016-01-29 Thread Gerd Jakobovitsch


Dear all,

I had to move .rgw.buckets.index pool to another structure; therefore, I 
created a new pool .rgw.buckets.index.new ; added the old pool as cache 
pool, and flushed the data.


Up to this moment everything was ok. With radosgw -p  df, I saw 
the objects moving to the new pool; the moved objects where ok, I could 
list omap keys and so on.


When everything got moved, I removed the overlay cache pool. But at this 
moment, the objects became unresponsive:


[(13:39:20) ceph@spchaog1 ~]$ rados -p .rgw.buckets.index listomapkeys 
.dir.default.198764998.1
error getting omap key set .rgw.buckets.index/.dir.default.198764998.1: 
(5) Input/output error


That happens to all objects. When trying the access to the bucket 
through radosgw, I also get problems:


[(13:16:01) root@spcogp1 ~]# radosgw-admin bucket stats --bucket="mybucket"
error getting bucket stats ret=-2

Looking at the disk, data seems to be there:

[(13:47:10) root@spcsnp1 ~]# ls 
/var/lib/ceph/osd/ceph-23/current/34.1f_head/|grep 198764998.1

\.dir.default.198764998.1__head_8A7482FF__22

Does anyone have a hint? Could I have lost ownership of the objects?

Regards.



--

As informações contidas nesta mensagem são CONFIDENCIAIS, protegidas pelo 
sigilo legal e por direitos autorais. A divulgação, distribuição, reprodução ou 
qualquer forma de utilização do teor deste documento depende de autorização do 
emissor, sujeitando-se o infrator às sanções legais. Caso esta comunicação 
tenha sido recebida por engano, favor avisar imediatamente, respondendo esta 
mensagem.___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Lost access when removing cache pool overlay

2016-01-29 Thread Gerd Jakobovitsch

Thank you for the response. It seems to me it is a transient situation.
At this moment, I regained access to most, but not all buckets/index
objects. But the overall performance dropped once again - I already have
huge performance issues.

Regards.

Em 29-01-2016 14:41, Robert LeBlanc escreveu:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Do the client key have access to the base pool? Something similar bit
us when adding a caching tier. Since the cache tier may be proxying
all the I/O, the client may not have had access to the base pool and
it still worked ok. Once you removed the cache tier, it could no
longer access the pool.
-
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1

On Fri, Jan 29, 2016 at 8:47 AM, Gerd Jakobovitsch wrote:

Dear all,

I had to move .rgw.buckets.index pool to another structure; therefore, I
created a new pool .rgw.buckets.index.new ; added the old pool as cache
pool, and flushed the data.

Up to this moment everything was ok. With radosgw -p df, I saw the
objects moving to the new pool; the moved objects where ok, I could list
omap keys and so on.

When everything got moved, I removed the overlay cache pool. But at this
moment, the objects became unresponsive:

[(13:39:20) ceph@spchaog1 ~]$ rados -p .rgw.buckets.index listomapkeys
.dir.default.198764998.1
error getting omap key set .rgw.buckets.index/.dir.default.198764998.1: (5)
Input/output error

That happens to all objects. When trying the access to the bucket through
radosgw, I also get problems:

[(13:16:01) root@spcogp1 ~]# radosgw-admin bucket stats --bucket="mybucket"
error getting bucket stats ret=-2

Looking at the disk, data seems to be there:

[(13:47:10) root@spcsnp1 ~]# ls
/var/lib/ceph/osd/ceph-23/current/34.1f_head/|grep 198764998.1
\.dir.default.198764998.1__head_8A7482FF__22

Does anyone have a hint? Could I have lost ownership of the objects?

Regards.

As informações contidas nesta mensagem são CONFIDENCIAIS, protegidas pelo
sigilo legal e por direitos autorais. A divulgação, distribuição, reprodução
ou qualquer forma de utilização do teor deste documento depende de
autorização do emissor, sujeitando-se o infrator às sanções legais. Caso
esta comunicação tenha sido recebida por engano, favor avisar imediatamente,
respondendo esta mensagem.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.4
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWq5YjCRDmVDuy+mK58QAAzUMQALqON8Ux5KPaotbyOMcr
SzWVigfIa9Go1on8snKmehVzkwC25XaJxYQNU2OwsUUhHa1cy7v+rG6DTbDQ
UDUK4IJ0O6ItGz4IeoyL06KyqmRy06OnuLRyzpQQD+nbIN+/82CVhMRMaKN+
U/GM+avDArN1JjjuQXFMgX/bS6ZoqJOBGqZKt3QWpJnkob1wgxP1tZA7MjZt
p6Sfm/ci0dhveRhzylpEoxYKXwR6hN1hy/wiH2P5yeQBYYmpOALLDJDSTvln
VZ/MbxPL5c0U/RRAkVMic1CvteeQ2nil2wEPFlu7cDjERvoBCMoyQeDXlep4
l+sAJbkKoOEKqE9xDo6CPnPNTePZsEaeSWvupkaypKL2bocBuZcwK6/c4IKE
ITrhT2WTMxDiV5+h29f1ph5TQOHN72nEebggHtPnvoFI9nU50AaWb+QMr8oP
ImerkQpLtvTwO3riLOY5arHXljf5X5IPtj+yDCD03QUoFLqELV+nnL8+v85v
x3C0cL0n0TKm0zQpqvSoB1cXkZ1pCKATq8l7GFclR46P7a5PrDcVzsl+/p3X
lqX94IoI+IIWqm7jVmOuMI2Pgo9c6FuprnG+bT997ivmucka4h/2ORNPbVt+
lz8hB1jU6dClgiaN1IdmzHDNFYniDFgnBWgfSN/N0qNZ2a84S1aTka+fr0ac
MU8o
=laAp
-END PGP SIGNATURE-

As informa��es contidas nesta mensagem s�o CONFIDENCIAIS, protegidas pelo
sigilo legal e por direitos autorais. A divulga��o, distribui��o, reprodu��o ou
qualquer forma de utiliza��o do teor deste documento depende de autoriza��o do
emissor, sujeitando-se o infrator �s san��es legais. Caso esta comunica��o
tenha sido recebida por engano, favor avisar imediatamente, respondendo esta
mensagem.___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] leveldb on OSD with missing file after hard boot

2016-01-27 Thread Gerd Jakobovitsch


Hello all,

I had a hard reset on a ceph node, and one of the OSDs is not starting 
due to leveldb error. At that moment, the node was trying to start up, 
but there was no actual writing of new data:


2016-01-27 12:00:37.068431 7f367f654880  0 ceph version 0.94.5 
(9764da52395923e0b32908d83a9f7304401fee43), process ceph-osd, pid 24734
2016-01-27 12:00:37.115800 7f367f654880  0 
filestore(/var/lib/ceph/osd/ceph-26) backend xfs (magic 0x58465342)
2016-01-27 12:00:37.133031 7f367f654880  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-26) detect_features: 
FIEMAP ioctl is supported and appears to work
2016-01-27 12:00:37.133042 7f367f654880  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-26) detect_features: 
FIEMAP ioctl is disabled via 'filestore fiemap' config option
2016-01-27 12:00:37.136538 7f367f654880  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-26) detect_features: 
syncfs(2) syscall fully supported (by glibc and kernel)
2016-01-27 12:00:37.137584 7f367f654880  0 
xfsfilestorebackend(/var/lib/ceph/osd/ceph-26) detect_feature: extsize 
is supported and kernel 3.10.0-123.el7.x86_64 >= 3.5
2016-01-27 12:00:37.176226 7f367f654880 -1 
filestore(/var/lib/ceph/osd/ceph-26) Error initializing leveldb : 
Corruption: 1 missing files; e.g.: 
/var/lib/ceph/osd/ceph-26/current/omap/075074.sst


2016-01-27 12:00:37.176286 7f367f654880 -1 osd.26 0 OSD:init: unable to 
mount object store
2016-01-27 12:00:37.176315 7f367f654880 -1  ** ERROR: osd init failed: 
(1) Operation not permitted


The file 075074.sst is missing indeed.

Since I was not able to restart the OSD, and I could not find 
information on recovering the leveldb, I marked the OSD as lost, but 
then I got 3 incomplete OSDs. I tried to follow the recovery howto at 
https://ceph.com/community/incomplete-pgs-oh-my/, but it stepped on the 
same error on leveldb, having the same dependency.


Is there any means to recover from this situation? To check and recover 
leveldb as good as possible? Or, alternatively, to get rid of the 
incomplete status, even with the penalty of losing some objects?


Regards.


--

As informações contidas nesta mensagem são CONFIDENCIAIS, protegidas pelo 
sigilo legal e por direitos autorais. A divulgação, distribuição, reprodução ou 
qualquer forma de utilização do teor deste documento depende de autorização do 
emissor, sujeitando-se o infrator às sanções legais. Caso esta comunicação 
tenha sido recebida por engano, favor avisar imediatamente, respondendo esta 
mensagem.___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] One object in .rgw.buckets.index causes systemic instability

2015-11-03 Thread Gerd Jakobovitsch


Dear all,

I have a cluster running hammer (0.94.5), with 5 nodes. The main usage 
is for S3-compatible object storage.
I am getting to a very troublesome problem at a ceph cluster. A single 
object in the .rgw.buckets.index is not responding to request and takes 
a very long time while recovering after an osd restart. During this 
time, the OSDs where this object is mapped got heavily loaded, with high 
cpu as well as memory usage. At the same time, the directory 
/var/lib/ceph/osd/ceph-XX/current/omap gets a large number of entries ( 
> 1), that won't decrease.


Very frequently, I get >100 blocked requests for this object, and the 
main OSD that stores it ends up accepting no other requests. Very 
frequently the OSD ends up crashing due to filestore timeout, and 
getting it up again is very troublesome - it usually has to run alone in 
the node for a long time, until the object gets recovered, somehow.


At the OSD logs, there are several entries like these:
 -7051> 2015-11-03 10:46:08.339283 7f776974f700 10 log_client logged 
2015-11-03 10:46:02.942023 osd.63 10.17.0.9:6857/2002 41 : cluster [WRN] 
slow re
quest 120.003081 seconds old, received at 2015-11-03 10:43:56.472825: 
osd_repop(osd.53.236531:7 34.7 
8a7482ff/.dir.default.198764998.1/head//34 v 2369

84'22) currently commit_sent


2015-11-03 10:28:32.405265 7f0035982700  0 log_channel(cluster) log 
[WRN] : 97 slow requests, 1 included below; oldest blocked for > 
2046.502848 secs
2015-11-03 10:28:32.405269 7f0035982700  0 log_channel(cluster) log 
[WRN] : slow request 1920.676998 seconds old, received at 2015-11-03 
09:56:31.7282
24: osd_op(client.210508702.0:14696798 .dir.default.198764998.1 [call 
rgw.bucket_prepare_op] 15.8a7482ff ondisk+write+known_if_redirected 
e236956) cur

rently waiting for blocked object

Is there any way to go deeper into this problem, or to rebuild the .rgw 
index without loosing data? I currently have 30 TB of data in the 
cluster - most of it concentrated in a handful of buckets - that I can't 
loose.


Regards.
--



--

As informações contidas nesta mensagem são CONFIDENCIAIS, protegidas pelo 
sigilo legal e por direitos autorais. A divulgação, distribuição, reprodução ou 
qualquer forma de utilização do teor deste documento depende de autorização do 
emissor, sujeitando-se o infrator às sanções legais. Caso esta comunicação 
tenha sido recebida por engano, favor avisar imediatamente, respondendo esta 
mensagem.___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] ISA erasure code plugin in debian

2015-09-15 Thread Gerd Jakobovitsch


Dear all,

I have a ceph cluster deployed in debian; I'm trying to test ISA 
erasure-coded pools, but there is no plugin (libec_isa.so) included in 
the library.


Looking at the packages at debian Ceph repository, I found a "trusty" 
package that includes the plugin. Is it created to use with debian? At 
which version? Is there any documentation for it?


Otherwise, is there any other way to get the plugin working? Are there 
any kernel requirements?


Regard



--

As informações contidas nesta mensagem são CONFIDENCIAIS, protegidas pelo 
sigilo legal e por direitos autorais. A divulgação, distribuição, reprodução ou 
qualquer forma de utilização do teor deste documento depende de autorização do 
emissor, sujeitando-se o infrator às sanções legais. Caso esta comunicação 
tenha sido recebida por engano, favor avisar imediatamente, respondendo esta 
mensagem.___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] PGs stuck stale during data migration and OSD restart

2015-08-31 Thread Gerd Jakobovitsch

I tried pg query, but it doesn't return, it hungs forever. As I
understand it, when the PG is stale, there is no OSD to get the query.
Am I right?

I did the tunables in 2 steps, but didn't wait for all the data being
moved before doing the second step.

I rolled back to intermediate tunables - undefining the optimization below:

chooseleaf_descend_once: Whether a recursive chooseleaf attempt will
retry, or only try once and allow the original placement to retry.
Legacy default is 0, optimal value is 1.

Doing so, the stale OSDs imediately disappeared. Since I rolled back, I
can't give you the outcome of ceph -s.

I believe some of the issue is related to a under-dimensioned hardware.
The OSDs are being killed by watchdog, my memory is swapping. But even
so I didn't expect to lose data mapping.

Regards.

Em 31-08-2015 05:48, Gregory Farnum escreveu:

On Sat, Aug 29, 2015 at 11:50 AM, Gerd Jakobovitsch <g...@mandic.net.br> wrote:

Dear all,

During a cluster reconfiguration (change of crush tunables from legacy to
TUNABLES2) with large data replacement, several OSDs get overloaded and had
to be restarted; when OSDs stabilize, I got a number of PGs marked stale,
even when all OSDs where this data used to be located show up again.

When I look at the OSDs current directory for the last placement, there is
still some data. But it never shows up again.

Is there any way to force these OSDs to resume being used?

This sounds very strange. Can you provide the output of "ceph -s" and
run pg query against one of the stuck PGs?
-Greg

[ceph-users] PGs stuck stale during data migration and OSD restart

2015-08-29 Thread Gerd Jakobovitsch


Dear all,

During a cluster reconfiguration (change of crush tunables from legacy 
to TUNABLES2) with large data replacement, several OSDs get overloaded 
and had to be restarted; when OSDs stabilize, I got a number of PGs 
marked stale, even when all OSDs where this data used to be located show 
up again.


When I look at the OSDs current directory for the last placement, there 
is still some data. But it never shows up again.


Is there any way to force these OSDs to resume being used?

regards.

--

As informa��es contidas nesta mensagem s�o CONFIDENCIAIS, protegidas pelo 
sigilo legal e por direitos autorais. A divulga��o, distribui��o, reprodu��o ou 
qualquer forma de utiliza��o do teor deste documento depende de autoriza��o do 
emissor, sujeitando-se o infrator �s san��es legais. Caso esta comunica��o 
tenha sido recebida por engano, favor avisar imediatamente, respondendo esta 
mensagem.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fwd: OSD crashes after upgrade to 0.80.10

2015-08-12 Thread Gerd Jakobovitsch


An update:

It seems that I am arriving at memory shortage. Even with 32 GB for 20 
OSDs and 2 GB swap, ceph-osd uses all available memory.
I created another swap device with 10 GB, and I managed to get the 
failed OSD running without crash, but consuming extra 5 GB.

Are there known issues regarding memory on ceph osd?

But I still get the problem of the incomplete+inactive PG.

Regards.

Gerd

On 12-08-2015 10:11, Gerd Jakobovitsch wrote:

I tried it, the error propagates to whichever OSD gets the errorred PG.

For the moment, this is my worst problem. I have one PG 
incomplete+inactive, and the OSD with the highest priority in it gets 
100 blocked requests (I guess that is the maximum), and, although 
running, doesn't get other requests - for example, ceph tell osd.21 
injectargs '--osd-max-backfills 1'. After some time, it crashes, and 
the blocked requests go to the second OSD for the errorred PG. I can't 
get rid of these slow requests.


I guessed a problem with leveldb, I checked, and had the default 
version for debian wheezy (0+20120530.gitdd0d562-1). I updated it for 
wheezy-backports (1.17-1~bpo70+1), but the error was the same.


I use regular wheezy kernel (3.2+46).

On 11-08-2015 23:52, Haomai Wang wrote:

it seems like a leveldb problem. could you just kick it out and add a
new osd to make cluster healthy firstly?

On Wed, Aug 12, 2015 at 1:31 AM, Gerd Jakobovitschg...@mandic.net.br  wrote:

Dear all,

I run a ceph system with 4 nodes and ~80 OSDs using xfs, with currently 75%
usage, running firefly. On friday I upgraded it from 0.80.8 to 0.80.10, and
since then I got several OSDs crashing and never recovering: trying to run
it, ends up crashing as follows.

Is this problem known? Is there any configuration that should be checked?
Any way to try to recover these OSDs without losing all data?

After that, setting the OSD to lost, I got one incomplete, inactive PG. Is
there any way to recover it? Data still exists in crashed OSDs.

Regards.

[(12:58:13) root@spcsnp3 ~]# service ceph start osd.7
=== osd.7 ===
2015-08-11 12:58:21.003876 7f17ed52b700  1 monclient(hunting): found
mon.spcsmp2
2015-08-11 12:58:21.003915 7f17ef493700  5 monclient: authenticate success,
global_id 206010466
create-or-move updated item name 'osd.7' weight 3.64 at location
{host=spcsnp3,root=default} to crush map
Starting Ceph osd.7 on spcsnp3...
2015-08-11 12:58:21.279878 7f200fa8f780  0 ceph version 0.80.10
(ea6c958c38df1216bf95c927f143d8b13c4a9e70), process ceph-osd, pid 31918
starting osd.7 at :/0 osd_data /var/lib/ceph/osd/ceph-7
/var/lib/ceph/osd/ceph-7/journal
[(12:58:21) root@spcsnp3 ~]# 2015-08-11 12:58:21.348094 7f200fa8f780 10
filestore(/var/lib/ceph/osd/ceph-7) dump_stop
2015-08-11 12:58:21.348291 7f200fa8f780  5
filestore(/var/lib/ceph/osd/ceph-7) basedir /var/lib/ceph/osd/ceph-7 journal
/var/lib/ceph/osd/ceph-7/journal
2015-08-11 12:58:21.348326 7f200fa8f780 10
filestore(/var/lib/ceph/osd/ceph-7) mount fsid is
54c136da-c51c-4799-b2dc-b7988982ee00
2015-08-11 12:58:21.349010 7f200fa8f780  0
filestore(/var/lib/ceph/osd/ceph-7) mount detected xfs (libxfs)
2015-08-11 12:58:21.349026 7f200fa8f780  1
filestore(/var/lib/ceph/osd/ceph-7)  disabling 'filestore replica fadvise'
due to known issues with fadvise(DONTNEED) on xfs
2015-08-11 12:58:21.353277 7f200fa8f780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: FIEMAP
ioctl is supported and appears to work
2015-08-11 12:58:21.353302 7f200fa8f780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: FIEMAP
ioctl is disabled via 'filestore fiemap' config option
2015-08-11 12:58:21.362106 7f200fa8f780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features:
syscall(SYS_syncfs, fd) fully supported
2015-08-11 12:58:21.362195 7f200fa8f780  0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_feature: extsize is
disabled by conf
2015-08-11 12:58:21.362701 7f200fa8f780  5
filestore(/var/lib/ceph/osd/ceph-7) mount op_seq is 35490995
2015-08-11 12:58:59.383179 7f200fa8f780 -1 *** Caught signal (Aborted) **
  in thread 7f200fa8f780

  ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70)
  1: /usr/bin/ceph-osd() [0xab7562]
  2: (()+0xf0a0) [0x7f200efcd0a0]
  3: (gsignal()+0x35) [0x7f200db3f165]
  4: (abort()+0x180) [0x7f200db423e0]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f200e39589d]
  6: (()+0x63996) [0x7f200e393996]
  7: (()+0x639c3) [0x7f200e3939c3]
  8: (()+0x63bee) [0x7f200e393bee]
  9: (tc_new()+0x48e) [0x7f200f213aee]
  10: (std::string::_Rep::_S_create(unsigned long, unsigned long,
std::allocatorchar const)+0x59) [0x7f200e3ef999]
  11: (std::string::_Rep::_M_clone(std::allocatorchar const, unsigned
long)+0x28) [0x7f200e3f0708]
  12: (std::string::reserve(unsigned long)+0x30) [0x7f200e3f07f0]
  13: (std::string::append(char const*, unsigned long)+0xb5) [0x7f200e3f0ab5]
  14: (leveldb::log::Reader::ReadRecord(leveldb::Slice*, std::string*)+0x2a2)
[0x7f200f46ffa2]
  15: (leveldb::DBImpl

Re: [ceph-users] Fwd: OSD crashes after upgrade to 0.80.10

2015-08-12 Thread Gerd Jakobovitsch


I tried it, the error propagates to whichever OSD gets the errorred PG.

For the moment, this is my worst problem. I have one PG 
incomplete+inactive, and the OSD with the highest priority in it gets 
100 blocked requests (I guess that is the maximum), and, although 
running, doesn't get other requests - for example, ceph tell osd.21 
injectargs '--osd-max-backfills 1'. After some time, it crashes, and the 
blocked requests go to the second OSD for the errorred PG. I can't get 
rid of these slow requests.


I guessed a problem with leveldb, I checked, and had the default version 
for debian wheezy (0+20120530.gitdd0d562-1). I updated it for 
wheezy-backports (1.17-1~bpo70+1), but the error was the same.


I use regular wheezy kernel (3.2+46).

On 11-08-2015 23:52, Haomai Wang wrote:

it seems like a leveldb problem. could you just kick it out and add a
new osd to make cluster healthy firstly?

On Wed, Aug 12, 2015 at 1:31 AM, Gerd Jakobovitsch g...@mandic.net.br wrote:


Dear all,

I run a ceph system with 4 nodes and ~80 OSDs using xfs, with currently 75%
usage, running firefly. On friday I upgraded it from 0.80.8 to 0.80.10, and
since then I got several OSDs crashing and never recovering: trying to run
it, ends up crashing as follows.

Is this problem known? Is there any configuration that should be checked?
Any way to try to recover these OSDs without losing all data?

After that, setting the OSD to lost, I got one incomplete, inactive PG. Is
there any way to recover it? Data still exists in crashed OSDs.

Regards.

[(12:58:13) root@spcsnp3 ~]# service ceph start osd.7
=== osd.7 ===
2015-08-11 12:58:21.003876 7f17ed52b700  1 monclient(hunting): found
mon.spcsmp2
2015-08-11 12:58:21.003915 7f17ef493700  5 monclient: authenticate success,
global_id 206010466
create-or-move updated item name 'osd.7' weight 3.64 at location
{host=spcsnp3,root=default} to crush map
Starting Ceph osd.7 on spcsnp3...
2015-08-11 12:58:21.279878 7f200fa8f780  0 ceph version 0.80.10
(ea6c958c38df1216bf95c927f143d8b13c4a9e70), process ceph-osd, pid 31918
starting osd.7 at :/0 osd_data /var/lib/ceph/osd/ceph-7
/var/lib/ceph/osd/ceph-7/journal
[(12:58:21) root@spcsnp3 ~]# 2015-08-11 12:58:21.348094 7f200fa8f780 10
filestore(/var/lib/ceph/osd/ceph-7) dump_stop
2015-08-11 12:58:21.348291 7f200fa8f780  5
filestore(/var/lib/ceph/osd/ceph-7) basedir /var/lib/ceph/osd/ceph-7 journal
/var/lib/ceph/osd/ceph-7/journal
2015-08-11 12:58:21.348326 7f200fa8f780 10
filestore(/var/lib/ceph/osd/ceph-7) mount fsid is
54c136da-c51c-4799-b2dc-b7988982ee00
2015-08-11 12:58:21.349010 7f200fa8f780  0
filestore(/var/lib/ceph/osd/ceph-7) mount detected xfs (libxfs)
2015-08-11 12:58:21.349026 7f200fa8f780  1
filestore(/var/lib/ceph/osd/ceph-7)  disabling 'filestore replica fadvise'
due to known issues with fadvise(DONTNEED) on xfs
2015-08-11 12:58:21.353277 7f200fa8f780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: FIEMAP
ioctl is supported and appears to work
2015-08-11 12:58:21.353302 7f200fa8f780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: FIEMAP
ioctl is disabled via 'filestore fiemap' config option
2015-08-11 12:58:21.362106 7f200fa8f780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features:
syscall(SYS_syncfs, fd) fully supported
2015-08-11 12:58:21.362195 7f200fa8f780  0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_feature: extsize is
disabled by conf
2015-08-11 12:58:21.362701 7f200fa8f780  5
filestore(/var/lib/ceph/osd/ceph-7) mount op_seq is 35490995
2015-08-11 12:58:59.383179 7f200fa8f780 -1 *** Caught signal (Aborted) **
  in thread 7f200fa8f780

  ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70)
  1: /usr/bin/ceph-osd() [0xab7562]
  2: (()+0xf0a0) [0x7f200efcd0a0]
  3: (gsignal()+0x35) [0x7f200db3f165]
  4: (abort()+0x180) [0x7f200db423e0]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f200e39589d]
  6: (()+0x63996) [0x7f200e393996]
  7: (()+0x639c3) [0x7f200e3939c3]
  8: (()+0x63bee) [0x7f200e393bee]
  9: (tc_new()+0x48e) [0x7f200f213aee]
  10: (std::string::_Rep::_S_create(unsigned long, unsigned long,
std::allocatorchar const)+0x59) [0x7f200e3ef999]
  11: (std::string::_Rep::_M_clone(std::allocatorchar const, unsigned
long)+0x28) [0x7f200e3f0708]
  12: (std::string::reserve(unsigned long)+0x30) [0x7f200e3f07f0]
  13: (std::string::append(char const*, unsigned long)+0xb5) [0x7f200e3f0ab5]
  14: (leveldb::log::Reader::ReadRecord(leveldb::Slice*, std::string*)+0x2a2)
[0x7f200f46ffa2]
  15: (leveldb::DBImpl::RecoverLogFile(unsigned long, leveldb::VersionEdit*,
unsigned long*)+0x180) [0x7f200f468360]
  16: (leveldb::DBImpl::Recover(leveldb::VersionEdit*)+0x5c2)
[0x7f200f46adf2]
  17: (leveldb::DB::Open(leveldb::Options const, std::string const,
leveldb::DB**)+0xff) [0x7f200f46b11f]
  18: (LevelDBStore::do_open(std::ostream, bool)+0xd8) [0xa123a8]
  19: (FileStore::mount()+0x18e0) [0x9b7080]
  20: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x78f52a

[ceph-users] OSD crashes when starting

2015-08-07 Thread Gerd Jakobovitsch


Dear all,

I got to an unrecoverable crash at one specific OSD, every time I try to 
restart it. It happened first at firefly 0.80.8, I updated to 0.80.10, 
but it continued to happen.


Due to this failure, I have several PGs down+peering, that won't recover 
even marking the OSD out.


Could someone help me? Is it possible to edit/rebuild the leveldb-based 
log that seems to be causing the problem?


Here is what the logfile informs me:

[(12:54:45) root@spcsnp2 ~]# service ceph start osd.31
=== osd.31 ===
create-or-move updated item name 'osd.31' weight 2.73 at location 
{host=spcsnp2,root=default} to crush map

Starting Ceph osd.31 on spcsnp2...
starting osd.31 at :/0 osd_data /var/lib/ceph/osd/ceph-31 
/var/lib/ceph/osd/ceph-31/journal
2015-08-07 12:55:12.916880 7fd614c8f780  0 ceph version 0.80.10 
(ea6c958c38df1216bf95c927f143d8b13c4a9e70), process ceph-osd, pid 23260
[(12:55:12) root@spcsnp2 ~]# 2015-08-07 12:55:12.928614 7fd614c8f780  0 
filestore(/var/lib/ceph/osd/ceph-31) mount detected xfs (libxfs)
2015-08-07 12:55:12.928622 7fd614c8f780  1 
filestore(/var/lib/ceph/osd/ceph-31)  disabling 'filestore replica 
fadvise' due to known issues with fadvise(DONTNEED) on xfs
2015-08-07 12:55:12.931410 7fd614c8f780  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-31) detect_features: 
FIEMAP ioctl is supported and appears to work
2015-08-07 12:55:12.931419 7fd614c8f780  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-31) detect_features: 
FIEMAP ioctl is disabled via 'filestore fiemap' config option
2015-08-07 12:55:12.939290 7fd614c8f780  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-31) detect_features: 
syscall(SYS_syncfs, fd) fully supported
2015-08-07 12:55:12.939326 7fd614c8f780  0 
xfsfilestorebackend(/var/lib/ceph/osd/ceph-31) detect_feature: extsize 
is disabled by conf

2015-08-07 12:55:45.587019 7fd614c8f780 -1 *** Caught signal (Aborted) **
 in thread 7fd614c8f780

 ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70)
 1: /usr/bin/ceph-osd() [0xab7562]
 2: (()+0xf030) [0x7fd6141ce030]
 3: (gsignal()+0x35) [0x7fd612d41475]
 4: (abort()+0x180) [0x7fd612d446f0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fd61359689d]
 6: (()+0x63996) [0x7fd613594996]
 7: (()+0x639c3) [0x7fd6135949c3]
 8: (()+0x63bee) [0x7fd613594bee]
 9: (tc_new()+0x48e) [0x7fd614414aee]
 10: (std::string::_Rep::_S_create(unsigned long, unsigned long, 
std::allocatorchar const)+0x59) [0x7fd6135f0999]
 11: (std::string::_Rep::_M_clone(std::allocatorchar const, unsigned 
long)+0x28) [0x7fd6135f1708]

 12: (std::string::reserve(unsigned long)+0x30) [0x7fd6135f17f0]
 13: (std::string::append(char const*, unsigned long)+0xb5) 
[0x7fd6135f1ab5]
 14: (leveldb::log::Reader::ReadRecord(leveldb::Slice*, 
std::string*)+0x2a2) [0x7fd614670fa2]
 15: (leveldb::DBImpl::RecoverLogFile(unsigned long, 
leveldb::VersionEdit*, unsigned long*)+0x180) [0x7fd614669360]
 16: (leveldb::DBImpl::Recover(leveldb::VersionEdit*)+0x5c2) 
[0x7fd61466bdf2]
 17: (leveldb::DB::Open(leveldb::Options const, std::string const, 
leveldb::DB**)+0xff) [0x7fd61466c11f]

 18: (LevelDBStore::do_open(std::ostream, bool)+0xd8) [0xa123a8]
 19: (FileStore::mount()+0x18e0) [0x9b7080]
 20: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x78f52a]
 21: (main()+0x2234) [0x7331c4]
 22: (__libc_start_main()+0xfd) [0x7fd612d2dead]
 23: /usr/bin/ceph-osd() [0x736e99]
 NOTE: a copy of the executable, or `objdump -rdS executable` is 
needed to interpret this.


--- begin dump of recent events ---
   -56 2015-08-07 12:55:12.915675 7fd614c8f780  5 asok(0x1a20230) 
register_command perfcounters_dump hook 0x1a10010
   -55 2015-08-07 12:55:12.915697 7fd614c8f780  5 asok(0x1a20230) 
register_command 1 hook 0x1a10010
   -54 2015-08-07 12:55:12.915700 7fd614c8f780  5 asok(0x1a20230) 
register_command perf dump hook 0x1a10010
   -53 2015-08-07 12:55:12.915704 7fd614c8f780  5 asok(0x1a20230) 
register_command perfcounters_schema hook 0x1a10010
   -52 2015-08-07 12:55:12.915706 7fd614c8f780  5 asok(0x1a20230) 
register_command 2 hook 0x1a10010
   -51 2015-08-07 12:55:12.915709 7fd614c8f780  5 asok(0x1a20230) 
register_command perf schema hook 0x1a10010
   -50 2015-08-07 12:55:12.915711 7fd614c8f780  5 asok(0x1a20230) 
register_command config show hook 0x1a10010
   -49 2015-08-07 12:55:12.915714 7fd614c8f780  5 asok(0x1a20230) 
register_command config set hook 0x1a10010
   -48 2015-08-07 12:55:12.915716 7fd614c8f780  5 asok(0x1a20230) 
register_command config get hook 0x1a10010
   -47 2015-08-07 12:55:12.915718 7fd614c8f780  5 asok(0x1a20230) 
register_command log flush hook 0x1a10010
   -46 2015-08-07 12:55:12.915721 7fd614c8f780  5 asok(0x1a20230) 
register_command log dump hook 0x1a10010
   -45 2015-08-07 12:55:12.915723 7fd614c8f780  5 asok(0x1a20230) 
register_command log reopen hook 0x1a10010
   -44 2015-08-07 12:55:12.916880 7fd614c8f780  0 ceph version 0.80.10 
(ea6c958c38df1216bf95c927f143d8b13c4a9e70), process ceph-osd, pid 23260
   -43 2015-08-07

Re: [ceph-users] Uploading large files to swift interface on radosgw

2013-09-19 Thread Gerd Jakobovitsch


Thank you very much, now it worked, with the value you suggested.

Regards.

On 09/19/2013 12:10 PM, Yehuda Sadeh wrote:

Now you're hitting issue #6336 (it's a regression in dumpling that
we'll fix soon). The current workaround is setting the following in
your osd:

osd max attr size = large number here

try a value of 10485760 (10M) which I think is large enough.

Yehuda



On Thu, Sep 19, 2013 at 7:30 AM, Gerd Jakobovitsch g...@mandic.net.br wrote:

Hello Yehuda, thank you for your help.


On 09/17/2013 08:35 PM, Yehuda Sadeh wrote:

On Tue, Sep 17, 2013 at 3:21 PM, Gerd Jakobovitsch g...@mandic.net.br wrote:

Hi all,

I am testing a ceph environment installed in debian wheezy, and, when
testing file upload of more than 1 GB, I am getting errors. For files larger
than 5 GB, I get a 400 Bad Request   EntityTooLarge response; looking at

The EntityTooLarge is expected, as there's a 5GB limit on objects.
Bigger objects need to be uploaded using the large object api.


the radosgw server, I notice that only the apache process is consuming cpu
time, and I only have traffic on the external interface used by apache.
For files between 2 GB  and 5 GB, I get stuck for a very long time, and I
see relatively high processing for both apache and radosgw. Finally, I get a
response 500 Internal Server Error UnknownError. The object is created on
rados, but is empty.

I am wondering whether there are any configuration I should change on
apache, fastcgi or rgw, or if there are hardware limitations.

Apache and fastCGI where installed from the distro. My ceph configuration:

Are you by any chance using the fcgi module rather than the fastcgi
module? It had a problem with caching the entire object before sending
it to the backend, which would result in the same symptoms as you just
described.

Yehuda

Well, I followed the installation instructions, that explicitly refer to 
fastcgi. Now I disabled the cgid module and repeated the test: I got the same 
problem.

Apache and fastcgi versions:
apache2:
   Installed: 2.2.22-13
libapache2-mod-fastcgi:
   Installed: 2.4.7~0910052141-1

I enabled radosgw logging; please find annex the log file. There is a lot of 
information listed, but I couldn't figure out the problem.

Regards.




[global]
mon_initial_members = spcsmp1, spcsmp2, spcsmp3
mon_host = 10.17.0.2,10.17.0.3,10.17.0.4
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
osd_journal_size = 1024
filestore_xattr_use_omap = true
public_network = 10.17.0.0/24
cluster_network = 10.18.0.0/24

[osd]
osd_journal_size = 1024

[client.radosgw.gateway]
host = mss.mandic.com.br
keyring = /etc/ceph/keyring.radosgw.gateway
rgw_socket_path = /tmp/radosgw.sock
log_file = /var/log/ceph/radosgw.log
rgw_enable_ops_log = false



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] BUG 14154 on erasure coded PG

Re: [ceph-users] Recovering full OSD

[ceph-users] Lost access when removing cache pool overlay

Re: [ceph-users] Lost access when removing cache pool overlay

[ceph-users] leveldb on OSD with missing file after hard boot

[ceph-users] One object in .rgw.buckets.index causes systemic instability

[ceph-users] ISA erasure code plugin in debian

Re: [ceph-users] PGs stuck stale during data migration and OSD restart

[ceph-users] PGs stuck stale during data migration and OSD restart

Re: [ceph-users] Fwd: OSD crashes after upgrade to 0.80.10

Re: [ceph-users] Fwd: OSD crashes after upgrade to 0.80.10

[ceph-users] OSD crashes when starting

Re: [ceph-users] Uploading large files to swift interface on radosgw

13 matches

Site Navigation

Mail list logo

Footer information