Re: [ceph-users] reliable monitor restarts

2016-10-21 Thread Wido den Hollander

> Op 21 oktober 2016 om 21:31 schreef Steffen Weißgerber :
> 
> 
> Hello,
> 
> we're running a 6 node ceph cluster with 3 mons on Ubuntu (14.04.4).
> 
> Sometimes it happen's that the mon services die and have to restarted
> manually.
> 

That they die is not the thing which should happen! MONs are usually very 
stable. Having them crash is odd and slightly scary.

> To have reliable service restarts I normally use D.J. Bernsteins deamontools
> on other Linux distributions. Until now I never did this on Ubuntu.
> 
> Is there a comparable way to configure such a watcher on services on Ubuntu
> (i.e. under systemd)?

You can run them under upstart which is supported by Ubuntu and Ceph, the 
upstart files are in the ceph packages.

I'd however double-check WHY they crash. Which Ceph version?

Wido

> 
> Regards and have a nice weekend.
> 
> Steffen
> 
> 
> 
> 
> -- 
> Klinik-Service Neubrandenburg GmbH
> Allendestr. 30, 17036 Neubrandenburg
> Amtsgericht Neubrandenburg, HRB 2457
> Geschaeftsfuehrerin: Gudrun Kappich
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph rbd jewel

2016-10-21 Thread fridifree
Hi,

What is the ceph tunables? how it affects the cluster?
I upgrade my kernel I do not understand why I have to disable features?

On Oct 21, 2016 19:39, "Ilya Dryomov"  wrote:

> On Fri, Oct 21, 2016 at 5:50 PM, fridifree  wrote:
> > Hi everyone,
> > I'm using ceph jewel running on Ubuntu 16.04 (kernel 4.4) and Ubuntu
> 14.04
> > clients (kernel 3.13)
> > When trying to map rbd to the clients and to servers I get error about
> > feature set mismatch which I didnt get on hammer.
> > Tried to upgrade my clients to kernel 4.8 and 4.9rc1 I got an error about
> > missing 0x38 feature.
> >
> > Any suggestions?
>
> See
>
> http://lists.opennebula.org/pipermail/ceph-users-ceph.com/
> 2016-May/009635.html
>
> Thanks,
>
> Ilya
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] tgt with ceph

2016-10-21 Thread Lu Dillon
Hi all,


I'm using tgt for iSCSI service. Are there any parameters of tgt to specific 
the user and keyring to access the RBD? Right now, I'm using admin user to do 
this. Thanks for advise.


Thanks,

Dillon
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Three tier cache

2016-10-21 Thread Robert Sanders
Hello,

Is it possible to create a three level cache tier?  Searching documentation and 
archives suggests that I’m not the first one to ask about it, but I can’t tell 
if it is supported yet.  

Thanks,
Rob
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] effect of changing ceph osd primary affinity

2016-10-21 Thread Ridwan Rashid Noel
Thank you for your reply Greg. Is there any detailed resource that describe
about how the primary affinity changing works? All I got from searching was
one paragraph from the documentation.

Regards,

Ridwan Noel

On Oct 21, 2016 3:15 PM, "Gregory Farnum"  wrote:

> On Fri, Oct 21, 2016 at 8:38 AM, Ridwan Rashid Noel 
> wrote:
> > Hi,
> >
> > While reading about Ceph osd primary affinity in the documentation of
> Ceph I
> > found that it is mentioned "When the weight is < 1, it is less likely
> that
> > CRUSH will select the Ceph OSD Daemon to act as a primary". My question
> is
> > if the primary affinity of an OSD is set to be <1 will there be any data
> > movement happening among this OSD and the other OSDs?
>
> Nope, the purpose of the primary affinity is to redistribute workload
> without changing the data distribution. :)
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] effect of changing ceph osd primary affinity

2016-10-21 Thread Gregory Farnum
On Fri, Oct 21, 2016 at 8:38 AM, Ridwan Rashid Noel  wrote:
> Hi,
>
> While reading about Ceph osd primary affinity in the documentation of Ceph I
> found that it is mentioned "When the weight is < 1, it is less likely that
> CRUSH will select the Ceph OSD Daemon to act as a primary". My question is
> if the primary affinity of an OSD is set to be <1 will there be any data
> movement happening among this OSD and the other OSDs?

Nope, the purpose of the primary affinity is to redistribute workload
without changing the data distribution. :)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph and TCP States

2016-10-21 Thread Gregory Farnum
On Fri, Oct 21, 2016 at 7:56 AM, Nick Fisk  wrote:
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
>> Haomai Wang
>> Sent: 21 October 2016 15:40
>> To: Nick Fisk 
>> Cc: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] Ceph and TCP States
>>
>>
>>
>> On Fri, Oct 21, 2016 at 10:31 PM, Nick Fisk  wrote:
>> > -Original Message-
>> > From: ceph-users [mailto:mailto:ceph-users-boun...@lists.ceph.com] On 
>> > Behalf Of Haomai Wang
>> > Sent: 21 October 2016 15:28
>> > To: Nick Fisk 
>> > Cc: mailto:ceph-users@lists.ceph.com
>> > Subject: Re: [ceph-users] Ceph and TCP States
>> >
>> >
>> >
>> > On Fri, Oct 21, 2016 at 10:19 PM, Nick Fisk 
>> >  wrote:
>> > Hi,
>> >
>> > I'm just testing out using a Ceph client in a DMZ behind a FW from the 
>> > main Ceph cluster. One thing I have noticed is that if the
>> > state table on the FW is emptied maybe by restarting it or just clearing 
>> > the state table...etc. Then the Ceph client will hang for a
>> > long time as the TCP session can no longer pass through the FW and just 
>> > gets blocked instead.
>> >
>> > This "FW" is linux firewall or hardware FW?
>>
>> PFSense running on dedicated HW. Eventually they will be in a HA pair so 
>> states should persist, but trying to work around this for now.
>> Bit annoying having CephFS lock hard for 15 minutes even though the network 
>> connection only went down for a few seconds.
>>
>> hmm, I'm not familiar with this fw. And from my view, whether RST packet 
>> sent is decided by FW. But I think you can try
>> "/proc/sys/net/ipv4/tcp_keepalive_time", if FW reset tcp session, tcp 
>> keepalive should detect and send a rst.
>
> Yeah I think that’s where the problem lies. Most Firewalls tend to silently 
> drop denied packets without sending RST's, so Ceph effectively just thinks 
> that its experiencing packet loss and will never retry until the 15 minute 
> timeout period is up. Am I right in thinking I can't tune down this parameter 
> for a CephFS kernel client as it doesn't use the ceph.conf file?

The kernel client has a lot of mount options and can be configured in
a few ways via debugfs et al; I think there's a setting for the
timeout as well. If you can't find it, I'm sure Zheng knows. :)
-Greg

>
>>
>> >
>> >
>> > I believe this behaviour can be adjusted by the "ms tcp read timeout" 
>> > setting to limit its impact, but wondering if anybody has any
>> > other ideas. I'm also thinking of experimenting with either stateless FW 
>> > rules for Ceph or getting the FW to send back RST packets
>> > instead of silently dropping packets.
>> >
>> > hmm, I think it depends on FW
>> >
>> >
>> > Thanks,
>> > Nick
>> >
>> > ___
>> > ceph-users mailing list
>> > mailto:mailto:ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>> ___
>> ceph-users mailing list
>> mailto:ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] reliable monitor restarts

2016-10-21 Thread Steffen Weißgerber
Hello,

we're running a 6 node ceph cluster with 3 mons on Ubuntu (14.04.4).

Sometimes it happen's that the mon services die and have to restarted
manually.

To have reliable service restarts I normally use D.J. Bernsteins deamontools
on other Linux distributions. Until now I never did this on Ubuntu.

Is there a comparable way to configure such a watcher on services on Ubuntu
(i.e. under systemd)?

Regards and have a nice weekend.

Steffen




-- 
Klinik-Service Neubrandenburg GmbH
Allendestr. 30, 17036 Neubrandenburg
Amtsgericht Neubrandenburg, HRB 2457
Geschaeftsfuehrerin: Gudrun Kappich
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd cache writethrough until flush

2016-10-21 Thread Jason Dillaman
Thanks for pointing that out, since it is incorrect for (semi-)modern
QEMUs. All configuration starts and the Ceph defaults, are overwritten
by your ceph.conf, and then are further overwritten by any
QEMU-specific override.  I would recommend retesting with
"cache=writeback" to see if that helps.

On Fri, Oct 21, 2016 at 2:10 PM, Pavan Rallabhandi
 wrote:
> The VM am testing against is created after the librbd upgrade.
>
> Always had this confusion around this bit in the docs here  
> http://docs.ceph.com/docs/jewel/rbd/qemu-rbd/#qemu-cache-options that:
>
> “QEMU’s cache settings override Ceph’s default settings (i.e., settings that 
> are not explicitly set in the Ceph configuration file). If you explicitly set 
> RBD Cache settings in your Ceph configuration file, your Ceph settings 
> override the QEMU cache settings. If you set cache settings on the QEMU 
> command line, the QEMU command line settings override the Ceph configuration 
> file settings.”
>
> Thanks,
> -Pavan.
>
> On 10/21/16, 11:31 PM, "Jason Dillaman"  wrote:
>
> On Fri, Oct 21, 2016 at 1:15 PM, Pavan Rallabhandi
>  wrote:
> > The QEMU cache is none for all of the rbd drives
>
> Hmm -- if you have QEMU cache disabled, I would expect it to disable
> the librbd cache.
>
> I have to ask, but did you (re)start/live-migrate these VMs you are
> testing against after you upgraded to librbd v10.2.3?
>
> --
> Jason
>
>
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd cache writethrough until flush

2016-10-21 Thread Pavan Rallabhandi
The VM am testing against is created after the librbd upgrade.

Always had this confusion around this bit in the docs here  
http://docs.ceph.com/docs/jewel/rbd/qemu-rbd/#qemu-cache-options that:

“QEMU’s cache settings override Ceph’s default settings (i.e., settings that 
are not explicitly set in the Ceph configuration file). If you explicitly set 
RBD Cache settings in your Ceph configuration file, your Ceph settings override 
the QEMU cache settings. If you set cache settings on the QEMU command line, 
the QEMU command line settings override the Ceph configuration file settings.”

Thanks,
-Pavan.

On 10/21/16, 11:31 PM, "Jason Dillaman"  wrote:

On Fri, Oct 21, 2016 at 1:15 PM, Pavan Rallabhandi
 wrote:
> The QEMU cache is none for all of the rbd drives

Hmm -- if you have QEMU cache disabled, I would expect it to disable
the librbd cache.

I have to ask, but did you (re)start/live-migrate these VMs you are
testing against after you upgraded to librbd v10.2.3?

-- 
Jason



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph on two data centers far away

2016-10-21 Thread Wes Dillingham
What is the use case that requires you to have it in two datacenters?
In addition to RBD mirroring already mentioned by others, you can do
RBD snapshots and ship those snapshots to a remote location (separate
cluster or separate pool). Similar to RBD mirroring, in this situation
your client writes are not subject to that latency.

On Thu, Oct 20, 2016 at 1:51 PM, German Anders  wrote:
> Thanks, that's too far actually lol. And how things going with rbd
> mirroring?
>
> German
>
> 2016-10-20 14:49 GMT-03:00 yan cui :
>>
>> The two data centers are actually cross US.  One is in the west, and the
>> other in the east.
>> We try to sync rdb images using RDB mirroring.
>>
>> 2016-10-20 9:54 GMT-07:00 German Anders :
>>>
>>> from curiosity I wanted to ask you what kind of network topology are you
>>> trying to use across the cluster? In this type of scenario you really need
>>> an ultra low latency network, how far from each other?
>>>
>>> Best,
>>>
>>> German
>>>
>>> 2016-10-18 16:22 GMT-03:00 Sean Redmond :

 Maybe this would be an option for you:

 http://docs.ceph.com/docs/jewel/rbd/rbd-mirroring/


 On Tue, Oct 18, 2016 at 8:18 PM, yan cui  wrote:
>
> Hi Guys,
>
>Our company has a use case which needs the support of Ceph across
> two data centers (one data center is far away from the other). The
> experience of using one data center is good. We did some benchmarking on 
> two
> data centers, and the performance is bad because of the synchronization
> feature in Ceph and large latency between data centers. So, are there
> setting ups like data center aware features in Ceph, so that we have good
> locality? Usually, we use rbd to create volume and snapshot. But we want 
> the
> volume is high available with acceptable performance in case one data 
> center
> is down. Our current setting ups does not consider data center difference.
> Any ideas?
>
>
> Thanks, Yan
>
> --
> Think big; Dream impossible; Make it happen.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>
>>
>>
>>
>> --
>> Think big; Dream impossible; Make it happen.
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Respectfully,

Wes Dillingham
wes_dilling...@harvard.edu
Research Computing | Infrastructure Engineer
Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 210
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd cache writethrough until flush

2016-10-21 Thread Jason Dillaman
On Fri, Oct 21, 2016 at 1:15 PM, Pavan Rallabhandi
 wrote:
> The QEMU cache is none for all of the rbd drives

Hmm -- if you have QEMU cache disabled, I would expect it to disable
the librbd cache.

I have to ask, but did you (re)start/live-migrate these VMs you are
testing against after you upgraded to librbd v10.2.3?

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

2016-10-21 Thread Jim Kilborn
Reed/Christian,

So if I put the OSD journals on an SSD that has power loss protection (Samsung 
SM863) , all the write then go through those journals. Can I then leave write 
caching turn on for the spinner OSDs, even without BBU caching controller? In 
the event of a power outage past our ups time, I want to ensure all the osds 
aren’t corrupt after bring the nodes back up.

Secondly, Seagate 8TB enterprise drives say they employ power loss protection 
as well. Apparently, in your case, this turned out to be untrue?




Sent from Mail for Windows 10

From: Reed Dier
Sent: Friday, October 21, 2016 10:06 AM
To: Christian Balzer
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache 
pressure, capability release, poor iostat await avg queue size


On Oct 19, 2016, at 7:54 PM, Christian Balzer 
mailto:ch...@gol.com>> wrote:


Hello,

On Wed, 19 Oct 2016 12:28:28 + Jim Kilborn wrote:

I have setup a new linux cluster to allow migration from our old SAN based 
cluster to a new cluster with ceph.
All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
As others mentioned, not a good choice, but also not the (main) cause of
your problems.

I am basically running stock ceph settings, with just turning the write cache 
off via hdparm on the drives, and temporarily turning of scrubbing.

The former is bound to kill performance, if you care that much for your
data but can't guarantee constant power (UPS, dual PSUs, etc), consider
using a BBU caching controller.

I wanted to comment on this small bolded bit, in the early days of my ceph 
cluster, testing resiliency to power failure (worst case scenario), when the 
on-disk write cache was enabled on my drives, I would lose that OSD to leveldb 
corruption, even with BBU.

With BBU + no disk-level cache, the OSD would come back, with no data loss, 
however performance would be significantly degraded. (xfsaild process with 99% 
iowait, cured by zapping disk and recreating OSD)

For reference, these were Seagate ST8000NM0065, backed by an LSI 3108 RoC, with 
the OSD set as a single RAID0 VD. On disk journaling.

There was a decent enough hit to write performance after disabling write 
caching at the disk layer, but write-back caching at the controller layer 
provided enough of a negating increase, that the data security was an 
acceptable trade off.

Was a tough way to learn how important this was after data center was struck by 
lightning two weeks after initial ceph cluster install and one phase of power 
was knocked out for 15 minutes, taking half the non-dual-PSU nodes with it.

Just want to make sure that people learn from that painful experience.

Reed


The later I venture you did because performance was abysmal with scrubbing
enabled.
Which is always a good indicator that your cluster needs tuning, improving.

The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So 
Server performance should be good.
Memory is fine, CPU I can't tell from the model number and I'm not
inclined to look up or guess, but that usually only becomes a bottleneck
when dealing with all SSD setup and things requiring the lowest latency
possible.


Since I am running cephfs, I have tiering setup.
That should read "on top of EC pools", and as John said, not a good idea
at all, both EC pools and cache-tiering.

Each server has 4 – 4TB drives for the erasure code pool, with K=3 and M=1. So 
the idea is to ensure a single host failure.
Each server also has a 1TB Seagate 850 Pro SSD for the cache drive, in a 
replicated set with size=2

This isn't a Seagate, you mean Samsung. And that's a consumer model,
ill suited for this task, even with the DC level SSDs below as journals.

And as such a replication of 2 is also ill advised, I've seen these SSDs
die w/o ANY warning whatsoever and long before their (abysmal) endurance
was exhausted.

The cache tier also has a 128GB SM863 SSD that is being used as a journal for 
the cache SSD. It has power loss protection

Those are fine. If you re-do you cluster, don't put more than 4-5 journals
on them.

My crush map is setup to ensure the cache pool uses only the 4 850 pro and the 
erasure code uses only the 16 spinning 4TB drives.

The problems that I am seeing is that I start copying data from our old san to 
the ceph volume, and once the cache tier gets to my  target_max_bytes of 1.4 
TB, I start seeing:

HEALTH_WARN 63 requests are blocked > 32 sec; 1 osds have slow requests; 
noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
26 ops are blocked > 65.536 sec on osd.0
37 ops are blocked > 32.768 sec on osd.0
1 osds have slow requests
noout,noscrub,nodeep-scrub,sortbitwise flag(s) set

osd.0 is the cache ssd

If I watch iostat on the cache ssd, I see the queue lengths are high and the 
await are high
Below is the iostat on the cache drive (osd

Re: [ceph-users] rbd cache writethrough until flush

2016-10-21 Thread Pavan Rallabhandi
Thanks for verifying at your end Jason.

It’s pretty weird that the difference is >~10X, with 
"rbd_cache_writethrough_until_flush = true" I see ~400 IOPS vs with 
"rbd_cache_writethrough_until_flush = false" I see them to be ~6000 IOPS. 

The QEMU cache is none for all of the rbd drives. On that note, would older 
librbd versions (like Hammer) have any caching issues while dealing with Jewel 
clusters?

Thanks,
-Pavan.

On 10/21/16, 8:17 PM, "Jason Dillaman"  wrote:

QEMU cache setting for the rbd drive?



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph rbd jewel

2016-10-21 Thread Ilya Dryomov
On Fri, Oct 21, 2016 at 5:50 PM, fridifree  wrote:
> Hi everyone,
> I'm using ceph jewel running on Ubuntu 16.04 (kernel 4.4) and Ubuntu 14.04
> clients (kernel 3.13)
> When trying to map rbd to the clients and to servers I get error about
> feature set mismatch which I didnt get on hammer.
> Tried to upgrade my clients to kernel 4.8 and 4.9rc1 I got an error about
> missing 0x38 feature.
>
> Any suggestions?

See

http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2016-May/009635.html

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph rbd jewel

2016-10-21 Thread fridifree
Hi everyone,
I'm using ceph jewel running on Ubuntu 16.04 (kernel 4.4) and Ubuntu 14.04
clients (kernel 3.13)
When trying to map rbd to the clients and to servers I get error about
feature set mismatch which I didnt get on hammer.
Tried to upgrade my clients to kernel 4.8 and 4.9rc1 I got an error about
missing 0x38 feature.

Any suggestions?

Thank you
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] effect of changing ceph osd primary affinity

2016-10-21 Thread Ridwan Rashid Noel
Hi,

While reading about Ceph osd primary affinity in the documentation of Ceph
I found that it is mentioned "When the weight is < 1, it is less likely
that CRUSH will select the Ceph OSD Daemon to act as a primary". My
question is if the primary affinity of an OSD is set to be <1 will there be
any data movement happening among this OSD and the other OSDs?

Regards,

Ridwan Noel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crash in ceph_read_iter->__free_pages due to null page

2016-10-21 Thread Markus Blank-Burian
Thanks for the fix and the quick reply!

From: Nikolay Borisov [mailto:ker...@kyup.com]
Sent: Freitag, 21. Oktober 2016 17:09
To: Markus Blank-Burian 
Cc: Nikolay Borisov ; Ilya Dryomov ; Yan, 
Zheng ; ceph-users 
Subject: Re: Crash in ceph_read_iter->__free_pages due to null page



On Friday, October 21, 2016, Markus Blank-Burian 
mailto:bur...@muenster.de>> wrote:
Hi,

is there any update regarding this bug?

I did send a patch and i believe it should find its way in upstream releaes 
rather soon.



I can easily reproduce this issue on our cluster with the following
scenario:
- Start a few hundred processes on different nodes, each process writing
slowly some text into its own output file
- Call: watch -n1 'grep mycustomerrorstring *.out'
- Hit CTRL+C (crashes the machine not always, but on a regular basis)

We are using a 4.4.25 kernel with some additional ceph patches borrowed from
newer kernel releases.

Thanks,
Markus

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On 
Behalf Of
Nikolay Borisov
Sent: Montag, 10. Oktober 2016 12:36
To: Ilya Dryomov >
Cc: Yan, Zheng >; ceph-users 
>
Subject: Re: [ceph-users] Crash in ceph_read_iter->__free_pages due to null
page



On 10/10/2016 12:22 PM, Ilya Dryomov wrote:
> On Fri, Oct 7, 2016 at 1:40 PM, Nikolay Borisov 
> > wrote:
>> Hello,
>>
>> I've encountered yet another cephfs crash:
>>
>> [990188.822271] BUG: unable to handle kernel NULL pointer dereference
>> at 001c [990188.822790] IP: []
>> __free_pages+0x5/0x30 [990188.823090] PGD 180dd8f067 PUD 1bf2722067
>> PMD 0 [990188.823506] Oops: 0002 [#1] SMP
>> [990188.831274] CPU: 25 PID: 18418 Comm: php-fpm Tainted: G   O
4.4.20-clouder2 #6
>> [990188.831650] Hardware name: Supermicro X10DRi/X10DRi, BIOS 2.0
>> 12/28/2015 [990188.831876] task: 8822a3b7b700 ti:
>> 88022427c000 task.ti: 88022427c000 [990188.832249] RIP:
>> 0010:[]  [] __free_pages+0x5/0x30
>> [990188.832691] RSP: :88022427fda8  EFLAGS: 00010246
>> [990188.832914] RAX: fe00 RBX: 0f3d RCX:
>> c100 [990188.833292] RDX: 47f2 RSI:
>>  RDI:  [990188.833670] RBP:
>> 88022427fe50 R08: 88022427c000 R09: 00038459d3aa3ee4
>> [990188.834049] R10: 00013b00e4b8 R11:  R12:
>>  [990188.834429] R13: 8802c5189f88 R14:
>> 881091270ca8 R15: 88022427fe70 [990188.838820] FS:
>> 7fc8ff5cb7c0() GS:881fffba() knlGS:
[990188.839197] CS:  0010 DS:  ES:  CR0: 80050033
[990188.839420] CR2: 001c CR3: 000405f7e000 CR4:
001406e0 [990188.839797] Stack:
>> [990188.840013]  a044a1bc 8806 
>> 88022427fe70 [990188.840639]  8802c5189f88 88189297b6a0
>> 0f3d 8810fe00 [990188.841263]  88022427fe98
>>  2000 8802c5189c20 [990188.841886] Call
Trace:
>> [990188.842115]  [] ? ceph_read_iter+0x19c/0x5f0
>> [ceph] [990188.842345]  [] __vfs_read+0xa7/0xd0
>> [990188.842568]  [] vfs_read+0x86/0x130
>> [990188.842792]  [] SyS_read+0x46/0xa0
>> [990188.843018]  []
>> entry_SYSCALL_64_fastpath+0x16/0x6e
>> [990188.843243] Code: e2 48 89 de ff d1 49 8b 0f 48 85 c9 75 e8 65 ff
>> 0d 99 a7 ed 7e eb 85 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f
>> 1f 44 00 00  ff 4f 1c 74 01 c3 55 85 f6 48 89 e5 74 07 e8 f7 f5
>> ff ff 5d [990188.847887] RIP  []
>> __free_pages+0x5/0x30 [990188.848183]  RSP 
>> [990188.848404] CR2: 001c
>>
>> The problem is that page(%RDI) being passed to __free_pages is NULL.
>> Also retry_op is CHECK_EOF(1), so the page allocation didn't execute
>> which leads to the null page. statret is : fe00 which seems to be
-ERESTARTSYS.
>
> Looks like this one exists upsteam - -ERESTARTSYS is returned from
> __ceph_do_getattr() if the process is killed while waiting for the
> reply from the MDS.  At first sight it's just a busted error path, but
> it could use more testing.  Zheng?

Checking the thread_info struct of the task in question it does have
TIF_SIGPENDING set and indeed the crash's "sig" command (if I'm reading
correctly the output" indicates that signal 15 (SIGTERM) is pending:

SHARED_PENDING
SIGNAL: 4000
  SIGQUEUE:  SIG  SIGINFO
  15  8801439a5d78

>
> Thanks,
>
> Ilya
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crash in ceph_read_iter->__free_pages due to null page

2016-10-21 Thread Ilya Dryomov
On Fri, Oct 21, 2016 at 5:01 PM, Markus Blank-Burian  wrote:
> Hi,
>
> is there any update regarding this bug?

Nikolay's patch made mainline yesterday and should show up in various
stable kernels in the forthcoming weeks.

>
> I can easily reproduce this issue on our cluster with the following
> scenario:
> - Start a few hundred processes on different nodes, each process writing
> slowly some text into its own output file
> - Call: watch -n1 'grep mycustomerrorstring *.out'
> - Hit CTRL+C (crashes the machine not always, but on a regular basis)
>
> We are using a 4.4.25 kernel with some additional ceph patches borrowed from
> newer kernel releases.

Borrow

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=0d7718f666be181fda1ba2d08f137d87c1419347

and you will be set ;)

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crash in ceph_read_iter->__free_pages due to null page

2016-10-21 Thread Nikolay Borisov
On Friday, October 21, 2016, Markus Blank-Burian  wrote:

> Hi,
>
> is there any update regarding this bug?


I did send a patch and i believe it should find its way in upstream releaes
rather soon.



>
> I can easily reproduce this issue on our cluster with the following
> scenario:
> - Start a few hundred processes on different nodes, each process writing
> slowly some text into its own output file
> - Call: watch -n1 'grep mycustomerrorstring *.out'
> - Hit CTRL+C (crashes the machine not always, but on a regular basis)
>
> We are using a 4.4.25 kernel with some additional ceph patches borrowed
> from
> newer kernel releases.
>
> Thanks,
> Markus
>
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com ]
> On Behalf Of
> Nikolay Borisov
> Sent: Montag, 10. Oktober 2016 12:36
> To: Ilya Dryomov >
> Cc: Yan, Zheng >; ceph-users <
> ceph-users@lists.ceph.com >
> Subject: Re: [ceph-users] Crash in ceph_read_iter->__free_pages due to null
> page
>
>
>
> On 10/10/2016 12:22 PM, Ilya Dryomov wrote:
> > On Fri, Oct 7, 2016 at 1:40 PM, Nikolay Borisov  > wrote:
> >> Hello,
> >>
> >> I've encountered yet another cephfs crash:
> >>
> >> [990188.822271] BUG: unable to handle kernel NULL pointer dereference
> >> at 001c [990188.822790] IP: []
> >> __free_pages+0x5/0x30 [990188.823090] PGD 180dd8f067 PUD 1bf2722067
> >> PMD 0 [990188.823506] Oops: 0002 [#1] SMP
> >> [990188.831274] CPU: 25 PID: 18418 Comm: php-fpm Tainted: G   O
> 4.4.20-clouder2 #6
> >> [990188.831650] Hardware name: Supermicro X10DRi/X10DRi, BIOS 2.0
> >> 12/28/2015 [990188.831876] task: 8822a3b7b700 ti:
> >> 88022427c000 task.ti: 88022427c000 [990188.832249] RIP:
> >> 0010:[]  [] __free_pages+0x5/0x30
> >> [990188.832691] RSP: :88022427fda8  EFLAGS: 00010246
> >> [990188.832914] RAX: fe00 RBX: 0f3d RCX:
> >> c100 [990188.833292] RDX: 47f2 RSI:
> >>  RDI:  [990188.833670] RBP:
> >> 88022427fe50 R08: 88022427c000 R09: 00038459d3aa3ee4
> >> [990188.834049] R10: 00013b00e4b8 R11:  R12:
> >>  [990188.834429] R13: 8802c5189f88 R14:
> >> 881091270ca8 R15: 88022427fe70 [990188.838820] FS:
> >> 7fc8ff5cb7c0() GS:881fffba() knlGS:
> [990188.839197] CS:  0010 DS:  ES:  CR0: 80050033
> [990188.839420] CR2: 001c CR3: 000405f7e000 CR4:
> 001406e0 [990188.839797] Stack:
> >> [990188.840013]  a044a1bc 8806 
> >> 88022427fe70 [990188.840639]  8802c5189f88 88189297b6a0
> >> 0f3d 8810fe00 [990188.841263]  88022427fe98
> >>  2000 8802c5189c20 [990188.841886] Call
> Trace:
> >> [990188.842115]  [] ? ceph_read_iter+0x19c/0x5f0
> >> [ceph] [990188.842345]  [] __vfs_read+0xa7/0xd0
> >> [990188.842568]  [] vfs_read+0x86/0x130
> >> [990188.842792]  [] SyS_read+0x46/0xa0
> >> [990188.843018]  []
> >> entry_SYSCALL_64_fastpath+0x16/0x6e
> >> [990188.843243] Code: e2 48 89 de ff d1 49 8b 0f 48 85 c9 75 e8 65 ff
> >> 0d 99 a7 ed 7e eb 85 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f
> >> 1f 44 00 00  ff 4f 1c 74 01 c3 55 85 f6 48 89 e5 74 07 e8 f7 f5
> >> ff ff 5d [990188.847887] RIP  []
> >> __free_pages+0x5/0x30 [990188.848183]  RSP 
> >> [990188.848404] CR2: 001c
> >>
> >> The problem is that page(%RDI) being passed to __free_pages is NULL.
> >> Also retry_op is CHECK_EOF(1), so the page allocation didn't execute
> >> which leads to the null page. statret is : fe00 which seems to be
> -ERESTARTSYS.
> >
> > Looks like this one exists upsteam - -ERESTARTSYS is returned from
> > __ceph_do_getattr() if the process is killed while waiting for the
> > reply from the MDS.  At first sight it's just a busted error path, but
> > it could use more testing.  Zheng?
>
> Checking the thread_info struct of the task in question it does have
> TIF_SIGPENDING set and indeed the crash's "sig" command (if I'm reading
> correctly the output" indicates that signal 15 (SIGTERM) is pending:
>
> SHARED_PENDING
> SIGNAL: 4000
>   SIGQUEUE:  SIG  SIGINFO
>   15  8801439a5d78
>
> >
> > Thanks,
> >
> > Ilya
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

2016-10-21 Thread Reed Dier

> On Oct 19, 2016, at 7:54 PM, Christian Balzer  wrote:
> 
> 
> Hello,
> 
> On Wed, 19 Oct 2016 12:28:28 + Jim Kilborn wrote:
> 
>> I have setup a new linux cluster to allow migration from our old SAN based 
>> cluster to a new cluster with ceph.
>> All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
> As others mentioned, not a good choice, but also not the (main) cause of
> your problems.
> 
>> I am basically running stock ceph settings, with just turning the write 
>> cache off via hdparm on the drives, and temporarily turning of scrubbing.
>> 
> The former is bound to kill performance, if you care that much for your
> data but can't guarantee constant power (UPS, dual PSUs, etc), consider
> using a BBU caching controller.

I wanted to comment on this small bolded bit, in the early days of my ceph 
cluster, testing resiliency to power failure (worst case scenario), when the 
on-disk write cache was enabled on my drives, I would lose that OSD to leveldb 
corruption, even with BBU.

With BBU + no disk-level cache, the OSD would come back, with no data loss, 
however performance would be significantly degraded. (xfsaild process with 99% 
iowait, cured by zapping disk and recreating OSD)

For reference, these were Seagate ST8000NM0065, backed by an LSI 3108 RoC, with 
the OSD set as a single RAID0 VD. On disk journaling.

There was a decent enough hit to write performance after disabling write 
caching at the disk layer, but write-back caching at the controller layer 
provided enough of a negating increase, that the data security was an 
acceptable trade off.

Was a tough way to learn how important this was after data center was struck by 
lightning two weeks after initial ceph cluster install and one phase of power 
was knocked out for 15 minutes, taking half the non-dual-PSU nodes with it.

Just want to make sure that people learn from that painful experience.

Reed

> 
> The later I venture you did because performance was abysmal with scrubbing
> enabled.
> Which is always a good indicator that your cluster needs tuning, improving.
> 
>> The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So 
>> Server performance should be good.  
> Memory is fine, CPU I can't tell from the model number and I'm not
> inclined to look up or guess, but that usually only becomes a bottleneck
> when dealing with all SSD setup and things requiring the lowest latency
> possible.
> 
> 
>> Since I am running cephfs, I have tiering setup.
> That should read "on top of EC pools", and as John said, not a good idea
> at all, both EC pools and cache-tiering.
> 
>> Each server has 4 – 4TB drives for the erasure code pool, with K=3 and M=1. 
>> So the idea is to ensure a single host failure.
>> Each server also has a 1TB Seagate 850 Pro SSD for the cache drive, in a 
>> replicated set with size=2
> 
> This isn't a Seagate, you mean Samsung. And that's a consumer model,
> ill suited for this task, even with the DC level SSDs below as journals.
> 
> And as such a replication of 2 is also ill advised, I've seen these SSDs
> die w/o ANY warning whatsoever and long before their (abysmal) endurance
> was exhausted.
> 
>> The cache tier also has a 128GB SM863 SSD that is being used as a journal 
>> for the cache SSD. It has power loss protection
> 
> Those are fine. If you re-do you cluster, don't put more than 4-5 journals
> on them.
> 
>> My crush map is setup to ensure the cache pool uses only the 4 850 pro and 
>> the erasure code uses only the 16 spinning 4TB drives.
>> 
>> The problems that I am seeing is that I start copying data from our old san 
>> to the ceph volume, and once the cache tier gets to my  target_max_bytes of 
>> 1.4 TB, I start seeing:
>> 
>> HEALTH_WARN 63 requests are blocked > 32 sec; 1 osds have slow requests; 
>> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
>> 26 ops are blocked > 65.536 sec on osd.0
>> 37 ops are blocked > 32.768 sec on osd.0
>> 1 osds have slow requests
>> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
>> 
>> osd.0 is the cache ssd
>> 
>> If I watch iostat on the cache ssd, I see the queue lengths are high and the 
>> await are high
>> Below is the iostat on the cache drive (osd.0) on the first host. The 
>> avgqu-sz is between 87 and 182 and the await is between 88ms and 1193ms
>> 
>> Device:   rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
>> avgqu-sz   await r_await w_await  svctm  %util
>> sdb
>>  0.00 0.339.00   84.33 0.9620.11   462.40
>> 75.92  397.56  125.67  426.58  10.70  99.90
>>  0.00 0.67   30.00   87.33 5.9621.03   471.20
>> 67.86  910.95   87.00 1193.99   8.27  97.07
>>  0.0016.67   33.00  289.33 4.2118.80   146.20
>> 29.83   88.99   93.91   88.43   3.10  99.83
>>  0.00 7.337.67  261.67 1.9219.63   163.81   
>> 117.42  331.97  182.04  336.36   3.71 100.00
>> 
>> 
>> If I look at the iostat f

Re: [ceph-users] Memory leak in radosgw

2016-10-21 Thread Trey Palmer
Hi Ben,

I previously hit this bug:

https://bugzilla.redhat.com/show_bug.cgi?id=1327142

So I updated from libcurl 7.29.0-25 to the new update package libcurl
7.29.0-32 on RHEL 7, which fixed the deadlock problem.

I had not seen the issue you linked.   It doesn't seem directly related,
since my problem is a memory leak and not CPU.   Clearly, though, older
libcurl versions remain problematic for multiple reasons, so I'll give a
newer one a try.

Thanks for the input!

   -- Trey



On Fri, Oct 21, 2016 at 3:21 AM, Ben Morrice  wrote:

> What version of libcurl are you using?
>
> I was hitting this bug with RHEL7/libcurl 7.29 which could also be your
> catalyst.
>
> http://tracker.ceph.com/issues/15915
>
> Kind regards,
>
> Ben Morrice
>
> __
> Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
> EPFL ENT CBS BBP
> Biotech Campus
> Chemin des Mines 9
> 1202 Geneva
> Switzerland
>
> On 20/10/16 21:41, Trey Palmer wrote:
>
> I've been trying to test radosgw multisite and have a pretty bad memory
> leak.It appears to be associated only with multisite sync.
>
> Multisite works well for a small numbers of objects.However, it all
> fell over when I wrote in 8M 64K objects to two buckets overnight for
> testing (via cosbench).
>
> The leak appears to happen on the multisite transfer source -- that is, the
> node where the objects were written originally.   The radosgw process
> eventually dies, I'm sure via the OOM killer, and systemd restarts it.
> Then repeat, though multisite sync pretty much stops at that point.
>
> I have tried 10.2.2, 10.2.3 and a combination of the two.   I'm running on
> CentOS 7.2, using civetweb with SSL.   I saw that the memory profiler only
> works on mon, osd and mds processes.
>
> Anyone else seen anything like this?
>
>-- Trey
>
>
>
>
> ___
> ceph-users mailing 
> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph and TCP States

2016-10-21 Thread Haomai Wang
On Fri, Oct 21, 2016 at 10:56 PM, Nick Fisk  wrote:

> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> Of Haomai Wang
> > Sent: 21 October 2016 15:40
> > To: Nick Fisk 
> > Cc: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Ceph and TCP States
> >
> >
> >
> > On Fri, Oct 21, 2016 at 10:31 PM, Nick Fisk 
> wrote:
> > > -Original Message-
> > > From: ceph-users [mailto:mailto:ceph-users-boun...@lists.ceph.com] On
> Behalf Of Haomai Wang
> > > Sent: 21 October 2016 15:28
> > > To: Nick Fisk 
> > > Cc: mailto:ceph-users@lists.ceph.com
> > > Subject: Re: [ceph-users] Ceph and TCP States
> > >
> > >
> > >
> > > On Fri, Oct 21, 2016 at 10:19 PM, Nick Fisk  n...@fisk.me.uk> wrote:
> > > Hi,
> > >
> > > I'm just testing out using a Ceph client in a DMZ behind a FW from the
> main Ceph cluster. One thing I have noticed is that if the
> > > state table on the FW is emptied maybe by restarting it or just
> clearing the state table...etc. Then the Ceph client will hang for a
> > > long time as the TCP session can no longer pass through the FW and
> just gets blocked instead.
> > >
> > > This "FW" is linux firewall or hardware FW?
> >
> > PFSense running on dedicated HW. Eventually they will be in a HA pair so
> states should persist, but trying to work around this for now.
> > Bit annoying having CephFS lock hard for 15 minutes even though the
> network connection only went down for a few seconds.
> >
> > hmm, I'm not familiar with this fw. And from my view, whether RST
> packet sent is decided by FW. But I think you can try
> > "/proc/sys/net/ipv4/tcp_keepalive_time", if FW reset tcp session, tcp
> keepalive should detect and send a rst.
>
> Yeah I think that’s where the problem lies. Most Firewalls tend to
> silently drop denied packets without sending RST's, so Ceph effectively
> just thinks that its experiencing packet loss and will never retry until
> the 15 minute timeout period is up. Am I right in thinking I can't tune
> down this parameter for a CephFS kernel client as it doesn't use the
> ceph.conf file?
>

I think cephfs kernel client doesn't have any timeout behavior. 15mins
timeout is triggered by server side I guess. So you can turn down this in
server side. Keep in mind that a very low value will cause frequency
connection lossy.


>
> >
> > >
> > >
> > > I believe this behaviour can be adjusted by the "ms tcp read timeout"
> setting to limit its impact, but wondering if anybody has any
> > > other ideas. I'm also thinking of experimenting with either stateless
> FW rules for Ceph or getting the FW to send back RST packets
> > > instead of silently dropping packets.
> > >
> > > hmm, I think it depends on FW
> > >
> > >
> > > Thanks,
> > > Nick
> > >
> > > ___
> > > ceph-users mailing list
> > > mailto:mailto:ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> > ___
> > ceph-users mailing list
> > mailto:ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crash in ceph_read_iter->__free_pages due to null page

2016-10-21 Thread Markus Blank-Burian
Hi,

is there any update regarding this bug?

I can easily reproduce this issue on our cluster with the following
scenario:
- Start a few hundred processes on different nodes, each process writing
slowly some text into its own output file
- Call: watch -n1 'grep mycustomerrorstring *.out'
- Hit CTRL+C (crashes the machine not always, but on a regular basis)

We are using a 4.4.25 kernel with some additional ceph patches borrowed from
newer kernel releases.

Thanks,
Markus

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Nikolay Borisov
Sent: Montag, 10. Oktober 2016 12:36
To: Ilya Dryomov 
Cc: Yan, Zheng ; ceph-users 
Subject: Re: [ceph-users] Crash in ceph_read_iter->__free_pages due to null
page



On 10/10/2016 12:22 PM, Ilya Dryomov wrote:
> On Fri, Oct 7, 2016 at 1:40 PM, Nikolay Borisov  wrote:
>> Hello,
>>
>> I've encountered yet another cephfs crash:
>>
>> [990188.822271] BUG: unable to handle kernel NULL pointer dereference 
>> at 001c [990188.822790] IP: [] 
>> __free_pages+0x5/0x30 [990188.823090] PGD 180dd8f067 PUD 1bf2722067 
>> PMD 0 [990188.823506] Oops: 0002 [#1] SMP
>> [990188.831274] CPU: 25 PID: 18418 Comm: php-fpm Tainted: G   O
4.4.20-clouder2 #6
>> [990188.831650] Hardware name: Supermicro X10DRi/X10DRi, BIOS 2.0 
>> 12/28/2015 [990188.831876] task: 8822a3b7b700 ti: 
>> 88022427c000 task.ti: 88022427c000 [990188.832249] RIP: 
>> 0010:[]  [] __free_pages+0x5/0x30 
>> [990188.832691] RSP: :88022427fda8  EFLAGS: 00010246 
>> [990188.832914] RAX: fe00 RBX: 0f3d RCX: 
>> c100 [990188.833292] RDX: 47f2 RSI: 
>>  RDI:  [990188.833670] RBP: 
>> 88022427fe50 R08: 88022427c000 R09: 00038459d3aa3ee4 
>> [990188.834049] R10: 00013b00e4b8 R11:  R12: 
>>  [990188.834429] R13: 8802c5189f88 R14: 
>> 881091270ca8 R15: 88022427fe70 [990188.838820] FS:  
>> 7fc8ff5cb7c0() GS:881fffba() knlGS:
[990188.839197] CS:  0010 DS:  ES:  CR0: 80050033
[990188.839420] CR2: 001c CR3: 000405f7e000 CR4:
001406e0 [990188.839797] Stack:
>> [990188.840013]  a044a1bc 8806  
>> 88022427fe70 [990188.840639]  8802c5189f88 88189297b6a0 
>> 0f3d 8810fe00 [990188.841263]  88022427fe98 
>>  2000 8802c5189c20 [990188.841886] Call
Trace:
>> [990188.842115]  [] ? ceph_read_iter+0x19c/0x5f0 
>> [ceph] [990188.842345]  [] __vfs_read+0xa7/0xd0 
>> [990188.842568]  [] vfs_read+0x86/0x130 
>> [990188.842792]  [] SyS_read+0x46/0xa0 
>> [990188.843018]  [] 
>> entry_SYSCALL_64_fastpath+0x16/0x6e
>> [990188.843243] Code: e2 48 89 de ff d1 49 8b 0f 48 85 c9 75 e8 65 ff 
>> 0d 99 a7 ed 7e eb 85 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 
>> 1f 44 00 00  ff 4f 1c 74 01 c3 55 85 f6 48 89 e5 74 07 e8 f7 f5 
>> ff ff 5d [990188.847887] RIP  [] 
>> __free_pages+0x5/0x30 [990188.848183]  RSP  
>> [990188.848404] CR2: 001c
>>
>> The problem is that page(%RDI) being passed to __free_pages is NULL. 
>> Also retry_op is CHECK_EOF(1), so the page allocation didn't execute 
>> which leads to the null page. statret is : fe00 which seems to be
-ERESTARTSYS.
> 
> Looks like this one exists upsteam - -ERESTARTSYS is returned from
> __ceph_do_getattr() if the process is killed while waiting for the 
> reply from the MDS.  At first sight it's just a busted error path, but 
> it could use more testing.  Zheng?

Checking the thread_info struct of the task in question it does have
TIF_SIGPENDING set and indeed the crash's "sig" command (if I'm reading
correctly the output" indicates that signal 15 (SIGTERM) is pending:

SHARED_PENDING
SIGNAL: 4000
  SIGQUEUE:  SIG  SIGINFO
  15  8801439a5d78

> 
> Thanks,
> 
> Ilya
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph and TCP States

2016-10-21 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Haomai Wang
> Sent: 21 October 2016 15:40
> To: Nick Fisk 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Ceph and TCP States
> 
> 
> 
> On Fri, Oct 21, 2016 at 10:31 PM, Nick Fisk  wrote:
> > -Original Message-
> > From: ceph-users [mailto:mailto:ceph-users-boun...@lists.ceph.com] On 
> > Behalf Of Haomai Wang
> > Sent: 21 October 2016 15:28
> > To: Nick Fisk 
> > Cc: mailto:ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Ceph and TCP States
> >
> >
> >
> > On Fri, Oct 21, 2016 at 10:19 PM, Nick Fisk  
> > wrote:
> > Hi,
> >
> > I'm just testing out using a Ceph client in a DMZ behind a FW from the main 
> > Ceph cluster. One thing I have noticed is that if the
> > state table on the FW is emptied maybe by restarting it or just clearing 
> > the state table...etc. Then the Ceph client will hang for a
> > long time as the TCP session can no longer pass through the FW and just 
> > gets blocked instead.
> >
> > This "FW" is linux firewall or hardware FW?
> 
> PFSense running on dedicated HW. Eventually they will be in a HA pair so 
> states should persist, but trying to work around this for now.
> Bit annoying having CephFS lock hard for 15 minutes even though the network 
> connection only went down for a few seconds.
> 
> hmm, I'm not familiar with this fw. And from my view, whether RST packet 
> sent is decided by FW. But I think you can try
> "/proc/sys/net/ipv4/tcp_keepalive_time", if FW reset tcp session, tcp 
> keepalive should detect and send a rst.

Yeah I think that’s where the problem lies. Most Firewalls tend to silently 
drop denied packets without sending RST's, so Ceph effectively just thinks that 
its experiencing packet loss and will never retry until the 15 minute timeout 
period is up. Am I right in thinking I can't tune down this parameter for a 
CephFS kernel client as it doesn't use the ceph.conf file?

> 
> >
> >
> > I believe this behaviour can be adjusted by the "ms tcp read timeout" 
> > setting to limit its impact, but wondering if anybody has any
> > other ideas. I'm also thinking of experimenting with either stateless FW 
> > rules for Ceph or getting the FW to send back RST packets
> > instead of silently dropping packets.
> >
> > hmm, I think it depends on FW
> >
> >
> > Thanks,
> > Nick
> >
> > ___
> > ceph-users mailing list
> > mailto:mailto:ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> ___
> ceph-users mailing list
> mailto:ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd cache writethrough until flush

2016-10-21 Thread Jason Dillaman
I just tested from the v10.2.3 git tag on my local machine and
averaged 2912.54 4K writes / second with
"rbd_cache_writethrough_until_flush = false" and averaged 3035.09 4K
writes / second with "rbd_cache_writethrough_until_flush = true"
(queue depth of 1 in both cases). I used new images between each run
to ensure the there wasn't any warm data.

What is the IOPS delta percentage between your two cases? What is your
QEMU cache setting for the rbd drive?

On Fri, Oct 21, 2016 at 9:56 AM, Pavan Rallabhandi
 wrote:
> From my VMs that have cinder provisioned volumes, I tried dd / fio (like 
> below) to find the IOPS to be less, even a sync before the runs didn’t help. 
> Same runs by setting the option to false yield better results.
>
> Both the clients and the cluster are running 10.2.3, perhaps the only 
> difference is that the clients are on Trusty and the cluster is Xenial.
>
> dd if=/dev/zero of=/dev/vdd bs=4K count=1000 oflag=direct
>
> fio -name iops -rw=write -bs=4k -direct=1  -runtime=60 -iodepth 1 -filename 
> /dev/vde -ioengine=libaio
>
> Thanks,
> -Pavan.
>
> On 10/21/16, 6:15 PM, "Jason Dillaman"  wrote:
>
> It's in the build and has tests to verify that it is properly being
> triggered [1].
>
> $ git tag --contains 5498377205523052476ed81aebb2c2e6973f67ef
> v10.2.3
>
> What are your tests that say otherwise?
>
> [1] 
> https://github.com/ceph/ceph/pull/10797/commits/5498377205523052476ed81aebb2c2e6973f67ef
>
> On Fri, Oct 21, 2016 at 7:42 AM, Pavan Rallabhandi
>  wrote:
> > I see the fix for write back cache not getting turned on after flush 
> has made into Jewel 10.2.3 ( http://tracker.ceph.com/issues/17080 ) but our 
> testing says otherwise.
> >
> > The cache is still behaving as if its writethrough, though the setting 
> is set to true. Wanted to check if it’s still broken in Jewel 10.2.3 or am I 
> missing anything here?
> >
> > Thanks,
> > -Pavan.
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Jason
>
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph and TCP States

2016-10-21 Thread Haomai Wang
On Fri, Oct 21, 2016 at 10:31 PM, Nick Fisk  wrote:

> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> Of Haomai Wang
> > Sent: 21 October 2016 15:28
> > To: Nick Fisk 
> > Cc: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Ceph and TCP States
> >
> >
> >
> > On Fri, Oct 21, 2016 at 10:19 PM, Nick Fisk 
> wrote:
> > Hi,
> >
> > I'm just testing out using a Ceph client in a DMZ behind a FW from the
> main Ceph cluster. One thing I have noticed is that if the
> > state table on the FW is emptied maybe by restarting it or just clearing
> the state table...etc. Then the Ceph client will hang for a
> > long time as the TCP session can no longer pass through the FW and just
> gets blocked instead.
> >
> > This "FW" is linux firewall or hardware FW?
>
> PFSense running on dedicated HW. Eventually they will be in a HA pair so
> states should persist, but trying to work around this for now. Bit annoying
> having CephFS lock hard for 15 minutes even though the network connection
> only went down for a few seconds.
>

hmm, I'm not familiar with this fw. And from my view, whether RST packet
sent is decided by FW. But I think you can try
"/proc/sys/net/ipv4/tcp_keepalive_time", if FW reset tcp session, tcp
keepalive should detect and send a rst.

>
> >
> >
> > I believe this behaviour can be adjusted by the "ms tcp read timeout"
> setting to limit its impact, but wondering if anybody has any
> > other ideas. I'm also thinking of experimenting with either stateless FW
> rules for Ceph or getting the FW to send back RST packets
> > instead of silently dropping packets.
> >
> > hmm, I think it depends on FW
> >
> >
> > Thanks,
> > Nick
> >
> > ___
> > ceph-users mailing list
> > mailto:ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd cache writethrough until flush

2016-10-21 Thread Pavan Rallabhandi
And to add, the host running Cinder services is having Hammer 0.94.9 but the 
rest of them like Nova are on Jewel 10.2.3

FWIW, the rbd info for one such image looks like this:

rbd image 'volume-f6ec45e2-b644-4b58-b6b5-b3a418c3c5b2':
size 2048 MB in 512 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.5ebf12d1934e
format: 2
features: layering, striping
flags: 
stripe unit: 4096 kB
stripe count: 1

Thanks!

On 10/21/16, 7:26 PM, "ceph-users on behalf of Pavan Rallabhandi" 
 
wrote:

Both the clients and the cluster are running 10.2.3, perhaps the only 
difference is that the clients are on Trusty and the cluster is Xenial.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph and TCP States

2016-10-21 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Haomai Wang
> Sent: 21 October 2016 15:28
> To: Nick Fisk 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Ceph and TCP States
> 
> 
> 
> On Fri, Oct 21, 2016 at 10:19 PM, Nick Fisk  wrote:
> Hi,
> 
> I'm just testing out using a Ceph client in a DMZ behind a FW from the main 
> Ceph cluster. One thing I have noticed is that if the
> state table on the FW is emptied maybe by restarting it or just clearing the 
> state table...etc. Then the Ceph client will hang for a
> long time as the TCP session can no longer pass through the FW and just gets 
> blocked instead.
> 
> This "FW" is linux firewall or hardware FW?

PFSense running on dedicated HW. Eventually they will be in a HA pair so states 
should persist, but trying to work around this for now. Bit annoying having 
CephFS lock hard for 15 minutes even though the network connection only went 
down for a few seconds.

> 
> 
> I believe this behaviour can be adjusted by the "ms tcp read timeout" setting 
> to limit its impact, but wondering if anybody has any
> other ideas. I'm also thinking of experimenting with either stateless FW 
> rules for Ceph or getting the FW to send back RST packets
> instead of silently dropping packets.
> 
> hmm, I think it depends on FW
> 
> 
> Thanks,
> Nick
> 
> ___
> ceph-users mailing list
> mailto:ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph and TCP States

2016-10-21 Thread Haomai Wang
On Fri, Oct 21, 2016 at 10:19 PM, Nick Fisk  wrote:

> Hi,
>
> I'm just testing out using a Ceph client in a DMZ behind a FW from the
> main Ceph cluster. One thing I have noticed is that if the
> state table on the FW is emptied maybe by restarting it or just clearing
> the state table...etc. Then the Ceph client will hang for a
> long time as the TCP session can no longer pass through the FW and just
> gets blocked instead.
>

This "FW" is linux firewall or hardware FW?


>
> I believe this behaviour can be adjusted by the "ms tcp read timeout"
> setting to limit its impact, but wondering if anybody has any
> other ideas. I'm also thinking of experimenting with either stateless FW
> rules for Ceph or getting the FW to send back RST packets
> instead of silently dropping packets.
>

hmm, I think it depends on FW


>
> Thanks,
> Nick
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph and TCP States

2016-10-21 Thread Nick Fisk
Hi,

I'm just testing out using a Ceph client in a DMZ behind a FW from the main 
Ceph cluster. One thing I have noticed is that if the
state table on the FW is emptied maybe by restarting it or just clearing the 
state table...etc. Then the Ceph client will hang for a
long time as the TCP session can no longer pass through the FW and just gets 
blocked instead.

I believe this behaviour can be adjusted by the "ms tcp read timeout" setting 
to limit its impact, but wondering if anybody has any
other ideas. I'm also thinking of experimenting with either stateless FW rules 
for Ceph or getting the FW to send back RST packets
instead of silently dropping packets.

Thanks,
Nick

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd cache writethrough until flush

2016-10-21 Thread Pavan Rallabhandi
From my VMs that have cinder provisioned volumes, I tried dd / fio (like below) 
to find the IOPS to be less, even a sync before the runs didn’t help. Same runs 
by setting the option to false yield better results.

Both the clients and the cluster are running 10.2.3, perhaps the only 
difference is that the clients are on Trusty and the cluster is Xenial.

dd if=/dev/zero of=/dev/vdd bs=4K count=1000 oflag=direct

fio -name iops -rw=write -bs=4k -direct=1  -runtime=60 -iodepth 1 -filename 
/dev/vde -ioengine=libaio 

Thanks,
-Pavan.

On 10/21/16, 6:15 PM, "Jason Dillaman"  wrote:

It's in the build and has tests to verify that it is properly being
triggered [1].

$ git tag --contains 5498377205523052476ed81aebb2c2e6973f67ef
v10.2.3

What are your tests that say otherwise?

[1] 
https://github.com/ceph/ceph/pull/10797/commits/5498377205523052476ed81aebb2c2e6973f67ef

On Fri, Oct 21, 2016 at 7:42 AM, Pavan Rallabhandi
 wrote:
> I see the fix for write back cache not getting turned on after flush has 
made into Jewel 10.2.3 ( http://tracker.ceph.com/issues/17080 ) but our testing 
says otherwise.
>
> The cache is still behaving as if its writethrough, though the setting is 
set to true. Wanted to check if it’s still broken in Jewel 10.2.3 or am I 
missing anything here?
>
> Thanks,
> -Pavan.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd cache writethrough until flush

2016-10-21 Thread Jason Dillaman
It's in the build and has tests to verify that it is properly being
triggered [1].

$ git tag --contains 5498377205523052476ed81aebb2c2e6973f67ef
v10.2.3

What are your tests that say otherwise?

[1] 
https://github.com/ceph/ceph/pull/10797/commits/5498377205523052476ed81aebb2c2e6973f67ef

On Fri, Oct 21, 2016 at 7:42 AM, Pavan Rallabhandi
 wrote:
> I see the fix for write back cache not getting turned on after flush has made 
> into Jewel 10.2.3 ( http://tracker.ceph.com/issues/17080 ) but our testing 
> says otherwise.
>
> The cache is still behaving as if its writethrough, though the setting is set 
> to true. Wanted to check if it’s still broken in Jewel 10.2.3 or am I missing 
> anything here?
>
> Thanks,
> -Pavan.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rbd cache writethrough until flush

2016-10-21 Thread Pavan Rallabhandi
I see the fix for write back cache not getting turned on after flush has made 
into Jewel 10.2.3 ( http://tracker.ceph.com/issues/17080 ) but our testing says 
otherwise. 

The cache is still behaving as if its writethrough, though the setting is set 
to true. Wanted to check if it’s still broken in Jewel 10.2.3 or am I missing 
anything here?

Thanks,
-Pavan.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd multipath by export iscsi gateway

2016-10-21 Thread Iban Cabrillo
HI tao,
 I would make something like this :


https://support.zadarastorage.com/hc/en-us/articles/213024386-How-To-setup-Multiple-iSCSI-sessions-and-MultiPath-on-your-Linux-Cloud-Server

regards, I

2016-10-21 11:33 GMT+02:00 tao chang :

> HI All,
>
> I try to configure multipath  by export two iscsi gateway on two host
> with same one rbd volume,   The steps is below:
>
> one host1:
> 1)   mapped a rbd volume as a local blockdevice
>  rbd map fastpool/vdisk
>
>   as  /dev/rbd0
>
>  2) export /dev/rbd0 as iscsi taget:
>   [root@zk23-03 ~]# targetcli / ls
> o- / 
> .
> [...]
>   o- backstores
> 
> ..
> [...]
>   | o- block 
> ..
> [Storage Objects: 1]
>   | | o- fastpool_vdisk
> 
> [/dev/rbd0 (195.3GiB) write-thru activated]
>   | o- fileio 
> .
> [Storage Objects: 0]
>   | o- pscsi 
> ..
> [Storage Objects: 0]
>   | o- ramdisk ..
> ..
> [Storage Objects: 0]
>   o- iscsi 
> 
> [Targets: 1]
>   | o- iqn.2016-05.com.zettakit.www:fastpool.vdisk
> ...
> [TPGs: 1]
>   |   o- tpg1 
> ...
> [no-gen-acls, no-auth]
>   | o- acls
> 
> ..
> [ACLs: 1]
>   | | o- iqn.2016-05.com.zettakit:node2202.local1
> . [Mapped
> LUNs: 1]
>   | |   o- mapped_lun0
> 
> [lun0 block/fastpool_vdisk (rw)]
>   | o- luns
> 
> ..
> [LUNs: 1]
>   | | o- lun0
> 
> ...
> [block/fastpool_vdisk (/dev/rbd0)]
>   | o- portals
> 
> 
> [Portals: 0]
>   o- loopback 
> .
> [Targets: 0]
>   o- srpt 
> .
> [Targets: 0]
>
>
>
>
> one host2, do the same:
>   one host1:
> 1)   mapped a rbd volume as a local blockdevice
>  rbd map fastpool/vdisk
>
>   as  /dev/rbd0
>
>  2) export /dev/rbd0 as iscsi taget:
>
> [root@zk23-01 ~]# targetcli / ls
> o- / 
> .
> [...]
>   o- backstores
> 
> ..
> [...]
>   | o- block 
> ..
> [Storage Objects: 1]
>   | | o- fastpool_vdisk
> 
> [/dev/rbd0 (195.3GiB) write-thru activated]
>   | o- fileio 
> .
> [Storage Objects: 0]
>   | o- pscsi 
> ..
> [Storage Objects: 0]
>   | o- ramdisk ..
> ..
> [Storage Objects: 0]
>   o- iscsi 
> 
> [Targets: 1]
>   | o- iqn.2016-05.com.zettakit.www:fastpool.vdisk
> ...
> [TPGs: 1]
>   |   o- tpg1 
> ...
> [no-gen-acls, no-auth]
>   | o- acls
> 
> ..
> [ACLs: 1]
>   | | o- iqn.2016-05.com.zettakit:node2202.local1
> 

[ceph-users] rbd multipath by export iscsi gateway

2016-10-21 Thread tao chang
HI All,

I try to configure multipath  by export two iscsi gateway on two host
with same one rbd volume,   The steps is below:

one host1:
1)   mapped a rbd volume as a local blockdevice
 rbd map fastpool/vdisk

  as  /dev/rbd0

 2) export /dev/rbd0 as iscsi taget:
  [root@zk23-03 ~]# targetcli / ls
o- / 
.
[...]
  o- backstores
..
[...]
  | o- block 
..
[Storage Objects: 1]
  | | o- fastpool_vdisk

[/dev/rbd0 (195.3GiB) write-thru activated]
  | o- fileio 
.
[Storage Objects: 0]
  | o- pscsi 
..
[Storage Objects: 0]
  | o- ramdisk 

[Storage Objects: 0]
  o- iscsi 

[Targets: 1]
  | o- iqn.2016-05.com.zettakit.www:fastpool.vdisk
...
[TPGs: 1]
  |   o- tpg1 
...
[no-gen-acls, no-auth]
  | o- acls
..
[ACLs: 1]
  | | o- iqn.2016-05.com.zettakit:node2202.local1
. [Mapped
LUNs: 1]
  | |   o- mapped_lun0

[lun0 block/fastpool_vdisk (rw)]
  | o- luns
..
[LUNs: 1]
  | | o- lun0
...
[block/fastpool_vdisk (/dev/rbd0)]
  | o- portals

[Portals: 0]
  o- loopback 
.
[Targets: 0]
  o- srpt 
.
[Targets: 0]




one host2, do the same:
  one host1:
1)   mapped a rbd volume as a local blockdevice
 rbd map fastpool/vdisk

  as  /dev/rbd0

 2) export /dev/rbd0 as iscsi taget:

[root@zk23-01 ~]# targetcli / ls
o- / 
.
[...]
  o- backstores
..
[...]
  | o- block 
..
[Storage Objects: 1]
  | | o- fastpool_vdisk

[/dev/rbd0 (195.3GiB) write-thru activated]
  | o- fileio 
.
[Storage Objects: 0]
  | o- pscsi 
..
[Storage Objects: 0]
  | o- ramdisk 

[Storage Objects: 0]
  o- iscsi 

[Targets: 1]
  | o- iqn.2016-05.com.zettakit.www:fastpool.vdisk
...
[TPGs: 1]
  |   o- tpg1 
...
[no-gen-acls, no-auth]
  | o- acls
..
[ACLs: 1]
  | | o- iqn.2016-05.com.zettakit:node2202.local1
. [Mapped
LUNs: 1]
  | |   o- mapped_lun0

[lun0 block/fastpool_vdisk (rw)]
  | o- luns
..
[LUNs: 1]
  | | o- lun0
...
[block/fastpool_vdisk (/dev/rbd0)]
  | o- portals

Re: [ceph-users] offending shards are crashing osd's

2016-10-21 Thread Ronny Aasen

On 19. okt. 2016 13:00, Ronny Aasen wrote:

On 06. okt. 2016 13:41, Ronny Aasen wrote:

hello

I have a few osd's in my cluster that are regularly crashing.


[snip]



ofcourse having 3 osd's dying regularly is not good for my health. so i
have set noout, to avoid heavy recoveries.

googeling this error messages gives exactly 1 hit:
https://github.com/ceph/ceph/pull/6946

where it saies:  "the shard must be removed so it can be reconstructed"
but with my 3 osd's failing, i am not certain witch of them contain the
broken shard. (or perhaps all 3 of them?)

a bit reluctant to delete on all 3. I have 4+2 erasure coding.
( erasure size 6 min_size 4 ) so finding out witch one is bad would be
nice.

hope someone have an idea how to progress.

kind regards
Ronny Aasen


i again have this problem with crashing osd's. a more detailed log is on
the tail of this mail.

Does anyone have any suggestions on how i can identify what shard that
needs to be removed to allow the EC to recover. ?

and more importantly how i can stop the osd's from crashing?


kind regards
Ronny Aasen



Answering my own question for googleabillity.

using this one liner.

for dir in $(find /var/lib/ceph/osd/ceph-* -maxdepth 2  -type d -name 
'5.26*' | sort | uniq) ; do find $dir -name 
'*3a3938238e1f29.002d80ca*' -type f -ls ;done


i got a list of all shards of the problematic object.
One of the object had size 0 but was otherways readable without any io 
errors. I guess this explains the inconsistent size, but it does not 
explain why ceph decides it's better to crash 3 osd's, rather then move 
a 0 byte file into a "LOST+FOUND" style directory structure.

Or just delete it, since it will not have any useful data anyway.

Deleting this file (mv to /tmp). allowed the 3 broken osd's to start, 
and have been running for >24h now. while usualy they crash within 10 
minutes. Yay!


Generally you need to check _all_ shards on the given pg. Not just the 3 
crashing. This was what confused me since i only focused on the crashing 
osd's


I used the oneliner that checked osd's for the pg since due to 
backfilling the pg was spread all over the place. And i could run it 
from ansible to reduce tedious work.


Also it would be convinient to be able to mark a broken/inconsistent pg 
manually "inactive". Instead of crashing 3 osd's and taking lots of 
other pg's with them down. One could set the pg inactive while 
troubleshooting, and unset pg-inactive when done. without having osd's 
crash and all the following high load rebalancing.


Also i ran a find for 0 size files on that pg and there are multiple 
other files.  are a 0 byte rbd_data file on a pg a normal occurence, or 
can i have more similar problems in the future due to the other 0 size 
files ?



kind regards
Ronny Aasen


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Memory leak in radosgw

2016-10-21 Thread Ben Morrice
What version of libcurl are you using?

I was hitting this bug with RHEL7/libcurl 7.29 which could also be your
catalyst.

http://tracker.ceph.com/issues/15915

Kind regards,

Ben Morrice

__
Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
EPFL ENT CBS BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland

On 20/10/16 21:41, Trey Palmer wrote:
> I've been trying to test radosgw multisite and have a pretty bad memory
> leak.It appears to be associated only with multisite sync.
>
> Multisite works well for a small numbers of objects.However, it all
> fell over when I wrote in 8M 64K objects to two buckets overnight for
> testing (via cosbench).
>
> The leak appears to happen on the multisite transfer source -- that is, the
> node where the objects were written originally.   The radosgw process
> eventually dies, I'm sure via the OOM killer, and systemd restarts it.
> Then repeat, though multisite sync pretty much stops at that point.
>
> I have tried 10.2.2, 10.2.3 and a combination of the two.   I'm running on
> CentOS 7.2, using civetweb with SSL.   I saw that the memory profiler only
> works on mon, osd and mds processes.
>
> Anyone else seen anything like this?
>
>-- Trey
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com