Use 2 as fallback inode number on 32 Bit kernel

2012-01-23 Thread Amon Ott
Hi folks,

the current code in 32 Bit use 1 as fallback number, if the inode number 
calculation returns 0. This always collides with the inode number of the root 
inode. The attached patch changes the fallback number to 2, which seems to be 
unused on our test systems.

Amon Ott
-- 
Dr. Amon Ott
m-privacy GmbH   Tel: +49 30 24342334
Am Köllnischen Park 1Fax: +49 30 24342336
10179 Berlin http://www.m-privacy.de

Amtsgericht Charlottenburg, HRB 84946

Geschäftsführer:
 Dipl.-Kfm. Holger Maczkowsky,
 Roman Maczkowsky

GnuPG-Key-ID: 0x2DD3A649
commit e9b257a3b1364980f3247e110c4ded56b4fbe54a
Author: Amon Ott 
Date:   Fri Jan 20 15:40:51 2012 +0100

Use 2 instead of 1 as fallback for 32 Bit inode number.

The root directory of the Ceph mount has inode number 1, so falling back
to 1 always creates a collision. 2 is unused on my test systems and seems
less likely to collide.

Signed-off-by: Amon Ott 

diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 9d19790..595f026 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -351,7 +351,7 @@ static inline u32 ceph_ino_to_ino32(__u64 vino)
 	u32 ino = vino & 0x;
 	ino ^= vino >> 32;
 	if (!ino)
-		ino = 1;
+		ino = 2;
 	return ino;
 }
 


Re: Btrfs slowdown with ceph (how to reproduce)

2012-01-23 Thread Josef Bacik
On Fri, Jan 20, 2012 at 01:13:37PM +0100, Christian Brunner wrote:
> As you might know, I have been seeing btrfs slowdowns in our ceph
> cluster for quite some time. Even with the latest btrfs code for 3.3
> I'm still seeing these problems. To make things reproducible, I've now
> written a small test, that imitates ceph's behavior:
> 
> On a freshly created btrfs filesystem (2 TB size, mounted with
> "noatime,nodiratime,compress=lzo,space_cache,inode_cache") I'm opening
> 100 files. After that I'm doing random writes on these files with a
> sync_file_range after each write (each write has a size of 100 bytes)
> and ioctl(BTRFS_IOC_SYNC) after every 100 writes.
> 
> After approximately 20 minutes, write activity suddenly increases
> fourfold and the average request size decreases (see chart in the
> attachment).
> 
> You can find IOstat output here: http://pastebin.com/Smbfg1aG
> 
> I hope that you are able to trace down the problem with the test
> program in the attachment.
 
Ran it, saw the problem, tried the dangerdonteveruse branch in Chris's tree and
formatted the fs with 64k node and leaf sizes and the problem appeared to go
away.  So surprise surprise fragmentation is biting us in the ass.  If you can
try running that branch with 64k node and leaf sizes with your ceph cluster and
see how that works out.  Course you should only do that if you dont mind if you
lose everything :).  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: upgrade from 0.39 to 0.40 failed...

2012-01-23 Thread Gregory Farnum
On Sun, Jan 22, 2012 at 4:25 AM, Smart Weblications GmbH - Florian
Wiessner  wrote:
> Am 22.01.2012 02:19, schrieb Yehuda Sadeh Weinraub:
>> On Sat, Jan 21, 2012 at 9:43 AM, Smart Weblications GmbH - Florian
>> Wiessner  wrote:
>>>
>>> 2) v1 -- ?+0 0x184dc00
>>> 2012-01-21 18:42:36.758683 7f60e26e0700 -- 192.168.0.6:6789/0 --> mon.3
>>> 192.168.0.7:6789/0 -- mon_probe(probe 6bac7900-17c9-47c4-8b8e-f3dd7c22c73d 
>>> name
>>> 2) v1 -- ?+0 0x184d900
>>> 2012-01-21 18:42:36.759270 7f60e6d62700 cephx: verify_authorizer_reply 
>>> exception
>>> in decode_decrypt with AQAM+RpP0JI9LRAAWyy1Flf5X6RUVxEjhAEFtg==
>>> 2012-01-21 18:42:36.759287 7f60e6d62700 -- 192.168.0.6:6789/0 >>
>>> 192.168.0.4:6789/0 pipe(0xc91500 sd=9 pgs=0 cs=0 l=0).failed verifying 
>>> authorize
>>> reply
>>> ^C
>>>
>>>
>> Can you verify that there's no 'auth supported' line in your mon ceph.conf?
>>
>> Also, try running (with correct ):
>>
>> # ceph-conf --lookup -c  'auth supported' --name mon.2
>>
>> or just:
>>
>> # ceph-conf --lookup -c  'auth supported'
> node04:/tmp# ceph-conf --lookup -c /etc/ceph/ceph.conf 'auth supported' 
> --name mon.2
> none
> node04:/tmp# ceph-conf --lookup -c /etc/ceph/ceph.conf 'auth supported' 
> --name mon.0
> none
>
>
> i now explicitly set auth supported = none, but still it is not working :(

Have you previously used cephx? I notice that it was a commented-out
line in your initial conf. :) I tried to reproduce quickly by creating
a v0.39 system and upgrading to v0.40 and didn't see any trouble.

Otherwise we'll need to start generating very explicit logs and stuff.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to remove lost objects.

2012-01-23 Thread Gregory Farnum
On Thu, Jan 19, 2012 at 12:36 PM, Andrey Stepachev  wrote:
> 2012/1/19 Gregory Farnum :
>> On Thu, Jan 19, 2012 at 12:53 AM, Andrey Stepachev  wrote:
>>> 2012/1/19 Gregory Farnum :
 On Wednesday, January 18, 2012, Andrey Stepachev  wrote:
> 2012/1/19 Gregory Farnum :
>> On Wed, Jan 18, 2012 at 12:48 PM, Andrey Stepachev 
>> wrote:
>>> But still don't know what happens with ceph, so it can't
>>> respond and hang. It is not a good behavior, because
>>> such situation leads to unresponsible cluster in case of
>>> temporal network failure.
>>
>> I'm a little concerned about this — I would expect to see hangs of up
>> to ~30 seconds (the timeout period), but for operations to then
>> continue. Are you putting the MDS down? If so, do you have any
>> standbys specified?
>
> Yes, MDS goes down (I restart it at some point, while changing something
> in config).
> Yes, i have 2 standbys.
> Clients hang more then 10 minutes.

 Okay, so it's probably an issue with the MDS not entering recovery when it
 should. Are you also taking down one of the monitor nodes? There's a known
 bug which can cause a standby MDS to wait up to 15 minutes if its monitor
 goes down which is fixed in latest master (and maybe .40; I'd have to
 check).
>>>
>>> Yes. I have collocated mon mds and osd on some nodes.
>>> And restart all daemons at once. I use 0.40. (built from my github fork).
>>
>> Hrm. I checked and the fix is in 0.40. Can you reproduce this with
>> client logging enabled (--debug_ms 1 --debug_client 10) and post the
>> logs somewhere for me to check out? That should be able to isolate the
>> problem area at least.
>
> Client writes "renew caps" and nothing more.
> I'd try to reproduce problem with more logging, but still no luck.
> May be debug serializes race somewhere and prevents
> this bug to occur.

Any updates on this? "renew caps" being the last thing in the log
doesn't actually mean much, unfortunately. We're going to need logs of
some description in order to give you any more help.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs slowdown with ceph (how to reproduce)

2012-01-23 Thread Chris Mason
On Mon, Jan 23, 2012 at 01:19:29PM -0500, Josef Bacik wrote:
> On Fri, Jan 20, 2012 at 01:13:37PM +0100, Christian Brunner wrote:
> > As you might know, I have been seeing btrfs slowdowns in our ceph
> > cluster for quite some time. Even with the latest btrfs code for 3.3
> > I'm still seeing these problems. To make things reproducible, I've now
> > written a small test, that imitates ceph's behavior:
> > 
> > On a freshly created btrfs filesystem (2 TB size, mounted with
> > "noatime,nodiratime,compress=lzo,space_cache,inode_cache") I'm opening
> > 100 files. After that I'm doing random writes on these files with a
> > sync_file_range after each write (each write has a size of 100 bytes)
> > and ioctl(BTRFS_IOC_SYNC) after every 100 writes.
> > 
> > After approximately 20 minutes, write activity suddenly increases
> > fourfold and the average request size decreases (see chart in the
> > attachment).
> > 
> > You can find IOstat output here: http://pastebin.com/Smbfg1aG
> > 
> > I hope that you are able to trace down the problem with the test
> > program in the attachment.
>  
> Ran it, saw the problem, tried the dangerdonteveruse branch in Chris's tree 
> and
> formatted the fs with 64k node and leaf sizes and the problem appeared to go
> away.  So surprise surprise fragmentation is biting us in the ass.  If you can
> try running that branch with 64k node and leaf sizes with your ceph cluster 
> and
> see how that works out.  Course you should only do that if you dont mind if 
> you
> lose everything :).  Thanks,
> 

Please keep in mind this branch is only out there for development, and
it really might have huge flaws.  scrub doesn't work with it correctly
right now, and the IO error recovery code is probably broken too.

Long term though, I think the bigger block sizes are going to make a
huge difference in these workloads.

If you use the very dangerous code:

mkfs.btrfs -l 64k -n 64k /dev/xxx

(-l is leaf size, -n is node size).

64K is the max right now, 32K may help just as much at a lower CPU cost.

-chris

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs slowdown with ceph (how to reproduce)

2012-01-23 Thread Christian Brunner
2012/1/23 Chris Mason :
> On Mon, Jan 23, 2012 at 01:19:29PM -0500, Josef Bacik wrote:
>> On Fri, Jan 20, 2012 at 01:13:37PM +0100, Christian Brunner wrote:
>> > As you might know, I have been seeing btrfs slowdowns in our ceph
>> > cluster for quite some time. Even with the latest btrfs code for 3.3
>> > I'm still seeing these problems. To make things reproducible, I've now
>> > written a small test, that imitates ceph's behavior:
>> >
>> > On a freshly created btrfs filesystem (2 TB size, mounted with
>> > "noatime,nodiratime,compress=lzo,space_cache,inode_cache") I'm opening
>> > 100 files. After that I'm doing random writes on these files with a
>> > sync_file_range after each write (each write has a size of 100 bytes)
>> > and ioctl(BTRFS_IOC_SYNC) after every 100 writes.
>> >
>> > After approximately 20 minutes, write activity suddenly increases
>> > fourfold and the average request size decreases (see chart in the
>> > attachment).
>> >
>> > You can find IOstat output here: http://pastebin.com/Smbfg1aG
>> >
>> > I hope that you are able to trace down the problem with the test
>> > program in the attachment.
>>
>> Ran it, saw the problem, tried the dangerdonteveruse branch in Chris's tree 
>> and
>> formatted the fs with 64k node and leaf sizes and the problem appeared to go
>> away.  So surprise surprise fragmentation is biting us in the ass.  If you 
>> can
>> try running that branch with 64k node and leaf sizes with your ceph cluster 
>> and
>> see how that works out.  Course you should only do that if you dont mind if 
>> you
>> lose everything :).  Thanks,
>>
>
> Please keep in mind this branch is only out there for development, and
> it really might have huge flaws.  scrub doesn't work with it correctly
> right now, and the IO error recovery code is probably broken too.
>
> Long term though, I think the bigger block sizes are going to make a
> huge difference in these workloads.
>
> If you use the very dangerous code:
>
> mkfs.btrfs -l 64k -n 64k /dev/xxx
>
> (-l is leaf size, -n is node size).
>
> 64K is the max right now, 32K may help just as much at a lower CPU cost.

Thanks for taking a look. - I'm glad to hear that there is a solution
on the horizon, but I'm not brave enough to try this on our ceph
cluster. I'll try it when the code has stabilized a bit.

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to remove lost objects.

2012-01-23 Thread Andrey Stepachev
2012/1/23 Gregory Farnum :
> On Thu, Jan 19, 2012 at 12:36 PM, Andrey Stepachev  wrote:
>> 2012/1/19 Gregory Farnum :
>>> On Thu, Jan 19, 2012 at 12:53 AM, Andrey Stepachev  wrote:
 2012/1/19 Gregory Farnum :
> On Wednesday, January 18, 2012, Andrey Stepachev  wrote:
>> 2012/1/19 Gregory Farnum :
>>> On Wed, Jan 18, 2012 at 12:48 PM, Andrey Stepachev 
>>> wrote:
 But still don't know what happens with ceph, so it can't
 respond and hang. It is not a good behavior, because
 such situation leads to unresponsible cluster in case of
 temporal network failure.
>>>
>>> I'm a little concerned about this — I would expect to see hangs of up
>>> to ~30 seconds (the timeout period), but for operations to then
>>> continue. Are you putting the MDS down? If so, do you have any
>>> standbys specified?
>>
>> Yes, MDS goes down (I restart it at some point, while changing something
>> in config).
>> Yes, i have 2 standbys.
>> Clients hang more then 10 minutes.
>
> Okay, so it's probably an issue with the MDS not entering recovery when it
> should. Are you also taking down one of the monitor nodes? There's a known
> bug which can cause a standby MDS to wait up to 15 minutes if its monitor
> goes down which is fixed in latest master (and maybe .40; I'd have to
> check).

 Yes. I have collocated mon mds and osd on some nodes.
 And restart all daemons at once. I use 0.40. (built from my github fork).
>>>
>>> Hrm. I checked and the fix is in 0.40. Can you reproduce this with
>>> client logging enabled (--debug_ms 1 --debug_client 10) and post the
>>> logs somewhere for me to check out? That should be able to isolate the
>>> problem area at least.
>>
>> Client writes "renew caps" and nothing more.
>> I'd try to reproduce problem with more logging, but still no luck.
>> May be debug serializes race somewhere and prevents
>> this bug to occur.
>
> Any updates on this? "renew caps" being the last thing in the log
> doesn't actually mean much, unfortunately. We're going to need logs of
> some description in order to give you any more help.

I've been switched to other urgent task now, so in a week or two
i'll return back to ceph and try to reproduce this hangouts to find out
what is going on.


> -Greg



-- 
Andrey.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html