Re: [PATCH] Btrfs: add a disk info ioctl to get the disks attached to a filesystem

2010-09-29 Thread Ric Wheeler

 On 09/29/2010 09:19 AM, Lennart Poettering wrote:

On Tue, 28.09.10 20:08, Josef Bacik (jo...@redhat.com) wrote:


On Tue, Sep 28, 2010 at 07:25:13PM -0400, Christoph Hellwig wrote:

On Tue, Sep 28, 2010 at 04:53:16PM -0400, Josef Bacik wrote:

This was a request from the systemd guys.  They need a quick and easy way to get
all devices attached to a Btrfs filesystem in order to check if any of the disks
are SSD for...something, I didn't ask :).   I've tested this with the
btrfs-progs patch that accompanies this patch.  Thanks,

So please tell the "systemd guys" to explain what the fuck they're doing
to linux-fsdevel and fiend a proper interface.  Chance is they will fuck
up as much as just about ever other lowlevel userspace tool are very
high.


Lennart? :).  And Christoph, what would be a good interface?  LVM has a slaves/
subdir in sysfs which symlinks to all of their dev's, would you rather I
resurrect the sysfs stuff for Btrfs and do a similar thing?  I'm open to
suggestions, I just took the quick and painless way out.  Thanks,

When doing readahead you want to know whether you are on SSD or rotating
media, because you a) want to order the readahead requests on bootup
after access time on SSD and after location on disk on rotating
media. And b) because you might want to priorize readahead reads over
other reads on rotating media, but prefer other reads over readahead
reads on SSD.

This in fact is how all current readahead implementations work, be it
the fedora, the suse or ubuntu's readahead or Arjan's sreadahead. What's
new is that in the systemd case we try to test for ssd/rotating
properly, instead of just hardcoding a check for
/sys/class/block/sda/queue/rotational.



A couple of questions pop into mind - is systemd the right place to 
automatically tune readahead?  If this is a generic feature for the type of 
device, it sounds like something that we should be doing somewhere else in the 
stack (not relying on tuning from user space).


Second question is why is checking in /sys a big deal, would  you prefer an 
interface like we did for alignment in libblkid?


Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: add a disk info ioctl to get the disks attached to a filesystem

2010-09-29 Thread Kay Sievers
On Wed, Sep 29, 2010 at 09:25, Ric Wheeler  wrote:

> Second question is why is checking in /sys a big deal, would  you prefer an
> interface like we did for alignment in libblkid?

It's about knowing what's behind the 'nodev' major == 0 of a btrfs
mount. There is no way to get that from /sys or anywhere else at the
moment.

Usually filesystems backed by a disk have the dev_t of the device, or
the fake block devices like md/dm/raid have their own major and the
slaves/ directory pointing to the devices.

This is not only about readahead, it's every other tool, that needs to
know what kind of disks are behind a btrfs 'nodev' major == 0 mount.

Kay
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Dirtiable inode bdi default != sb bdi btrfs

2010-09-29 Thread Christoph Hellwig
Here is the patch that I already proposed a while ago.  I've tested
xfstests on btrfs and xfstests to make sure the btrfs issue is fixed,
and I've also tested the original dirtying of device files issue
and I/O operations on block device files to test the special case
in the patch.

---
From: Christoph Hellwig 
Subject: [PATCH] writeback: always use sb->s_bdi for writeback purposes

We currently use struct backing_dev_info for various different purposes.
Originally it was introduced to describe a backing device which includes
an unplug and congestion function and various bits of readahead information
and VM-relevant flags.  We're also using for tracking dirty inodes for
writeback.

To make writeback properly find all inodes we need to only access the
per-filesystem backing_device pointed to by the superblock in ->s_bdi
inside the writeback code, and not the instances pointeded to by
inode->i_mapping->backing_dev which can be overriden by special devices
or might not be set at all by some filesystems.

Long term we should split out the writeback-relevant bits of struct
backing_device_info (which includes more than the current bdi_writeback)
and only point to it from the superblock while leaving the traditional
backing device as a separate structure that can be overriden by devices.

The one exception for now is the block device filesystem which really
wants different writeback contexts for it's different (internal) inodes
to handle the writeout more efficiently.  For now we do this with
a hack in fs-writeback.c because we're so late in the cycle, but in
the future I plan to replace this with a superblock method that allows
for multiple writeback contexts per filesystem.

Signed-off-by: Christoph Hellwig 

Index: linux-2.6/fs/fs-writeback.c
===
--- linux-2.6.orig/fs/fs-writeback.c2010-09-29 16:58:41.750557721 +0900
+++ linux-2.6/fs/fs-writeback.c 2010-09-29 17:11:35.040557719 +0900
@@ -72,22 +72,10 @@ int writeback_in_progress(struct backing
 static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
 {
struct super_block *sb = inode->i_sb;
-   struct backing_dev_info *bdi = inode->i_mapping->backing_dev_info;
 
-   /*
-* For inodes on standard filesystems, we use superblock's bdi. For
-* inodes on virtual filesystems, we want to use inode mapping's bdi
-* because they can possibly point to something useful (think about
-* block_dev filesystem).
-*/
-   if (sb->s_bdi && sb->s_bdi != &noop_backing_dev_info) {
-   /* Some device inodes could play dirty tricks. Catch them... */
-   WARN(bdi != sb->s_bdi && bdi_cap_writeback_dirty(bdi),
-   "Dirtiable inode bdi %s != sb bdi %s\n",
-   bdi->name, sb->s_bdi->name);
-   return sb->s_bdi;
-   }
-   return bdi;
+   if (strcmp(sb->s_type->name, "bdev") == 0)
+   return inode->i_mapping->backing_dev_info;
+   return sb->s_bdi;
 }
 
 static void bdi_queue_work(struct backing_dev_info *bdi,
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Another checksum error bugreport

2010-09-29 Thread Sebastian 'gonX' Jensen
Hey guys,

Today I experienced my first checksum error just out of the blue - and
it's not just the 'csum + 1 = private' issue, it's a completely
different one. Because of this, I am unable to retrieve the data off
the drive, even with nodatasum enabled - I simply get an I/O error.
Here's the dmesg output:

[149423.845177] btrfs: setting nodatasum
[149423.850339] Btrfs detected SSD devices, enabling SSD mode
[149432.094728] btrfs csum failed ino 259 off 26701824 csum 3875867041
private 371726550
[149432.117938] btrfs csum failed ino 259 off 26701824 csum 3875867041
private 371726550
[149432.118340] btrfs csum failed ino 259 off 26701824 csum 3875867041
private 371726550
[149432.125671] btrfs csum failed ino 259 off 26701824 csum 3875867041
private 371726550
[149432.126075] btrfs csum failed ino 259 off 26701824 csum 3875867041
private 371726550
[149432.135671] btrfs csum failed ino 259 off 26701824 csum 3875867041
private 371726550

I would really like to have the files on the drive retrieved in their
entirety, but if that is not possible then that is also OK. Consider
this a bugreport and a question on how to retrieve the data now.

Thanks,
Sebastian J.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Another checksum error bugreport

2010-09-29 Thread Francis Galiegue
On Wed, Sep 29, 2010 at 12:48, Sebastian 'gonX' Jensen
 wrote:
> Hey guys,
>
> Today I experienced my first checksum error just out of the blue - and
> it's not just the 'csum + 1 = private' issue, it's a completely
> different one. Because of this, I am unable to retrieve the data off
> the drive, even with nodatasum enabled - I simply get an I/O error.
> Here's the dmesg output:
>
> [149423.845177] btrfs: setting nodatasum
> [149423.850339] Btrfs detected SSD devices, enabling SSD mode
> [149432.094728] btrfs csum failed ino 259 off 26701824 csum 3875867041
> private 371726550
> [149432.117938] btrfs csum failed ino 259 off 26701824 csum 3875867041
> private 371726550
> [149432.118340] btrfs csum failed ino 259 off 26701824 csum 3875867041
> private 371726550
> [149432.125671] btrfs csum failed ino 259 off 26701824 csum 3875867041
> private 371726550
> [149432.126075] btrfs csum failed ino 259 off 26701824 csum 3875867041
> private 371726550
> [149432.135671] btrfs csum failed ino 259 off 26701824 csum 3875867041
> private 371726550
>
> I would really like to have the files on the drive retrieved in their
> entirety, but if that is not possible then that is also OK. Consider
> this a bugreport and a question on how to retrieve the data now.
>

Which kernel is that?

A patch made it in 2.6.36-rc6 which fixed an important bug in the bdi
code, wherein write requests and discard requests were merged,
transforming all requests in discard requests.

And you use an SSD... Hmmm.

-- 
Francis Galiegue, fgalie...@gmail.com
"It seems obvious [...] that at least some 'business intelligence'
tools invest so much intelligence on the business side that they have
nothing left for generating SQL queries" (Stéphane Faroult, in "The
Art of SQL", ISBN 0-596-00894-5)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Another checksum error bugreport

2010-09-29 Thread Sebastian 'gonX' Jensen
On 29 September 2010 13:12, Francis Galiegue  wrote:
> On Wed, Sep 29, 2010 at 12:48, Sebastian 'gonX' Jensen
>  wrote:
>> Hey guys,
>>
>> Today I experienced my first checksum error just out of the blue - and
>> it's not just the 'csum + 1 = private' issue, it's a completely
>> different one. Because of this, I am unable to retrieve the data off
>> the drive, even with nodatasum enabled - I simply get an I/O error.
>> Here's the dmesg output:
>>
>> [149423.845177] btrfs: setting nodatasum
>> [149423.850339] Btrfs detected SSD devices, enabling SSD mode
>> [149432.094728] btrfs csum failed ino 259 off 26701824 csum 3875867041
>> private 371726550
>> [149432.117938] btrfs csum failed ino 259 off 26701824 csum 3875867041
>> private 371726550
>> [149432.118340] btrfs csum failed ino 259 off 26701824 csum 3875867041
>> private 371726550
>> [149432.125671] btrfs csum failed ino 259 off 26701824 csum 3875867041
>> private 371726550
>> [149432.126075] btrfs csum failed ino 259 off 26701824 csum 3875867041
>> private 371726550
>> [149432.135671] btrfs csum failed ino 259 off 26701824 csum 3875867041
>> private 371726550
>>
>> I would really like to have the files on the drive retrieved in their
>> entirety, but if that is not possible then that is also OK. Consider
>> this a bugreport and a question on how to retrieve the data now.
>>
>
> Which kernel is that?
It was one of the 2.6.35 versions from the Ubuntu repository. I'm
running Ubuntu 10.04 Server.

> A patch made it in 2.6.36-rc6 which fixed an important bug in the bdi
> code, wherein write requests and discard requests were merged,
> transforming all requests in discard requests.
>
> And you use an SSD... Hmmm.
>
> --
> Francis Galiegue, fgalie...@gmail.com
> "It seems obvious [...] that at least some 'business intelligence'
> tools invest so much intelligence on the business side that they have
> nothing left for generating SQL queries" (Stéphane Faroult, in "The
> Art of SQL", ISBN 0-596-00894-5)
>

Well, overall it seems to work now. I downgraded to the .32 version in
the Ubuntu 10.04 repository and now I do not get any errors from
dmesg. I don't know what caused it, but I think I'll stick to stable
kernel versions instead. Since this is a production system it's not
very easy for me to troubleshoot this any further if it requires a
reboot. I can unmount and mount the drive from time to time, but not
reboot. If you want btrfs-debug-tree output or something, let me know.

Regards,
Sebastian J.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: add a disk info ioctl to get the disks attached to a filesystem

2010-09-29 Thread Lennart Poettering
On Wed, 29.09.10 16:25, Ric Wheeler (rwhee...@redhat.com) wrote:

> >This in fact is how all current readahead implementations work, be it
> >the fedora, the suse or ubuntu's readahead or Arjan's sreadahead. What's
> >new is that in the systemd case we try to test for ssd/rotating
> >properly, instead of just hardcoding a check for
> >/sys/class/block/sda/queue/rotational.
> >
> 
> A couple of questions pop into mind - is systemd the right place to
> automatically tune readahead?  If this is a generic feature for the
> type of device, it sounds like something that we should be doing
> somewhere else in the stack (not relying on tuning from user space).

Note that this is not the kind of readahead that is controllable via 
/sys/class/block/sda/queue/read_ahead_kb, this is about detecting "hot"
files at boot, and then preloading them on the next boot. i.e. the
problem Jens once proposed fcache for.

> Second question is why is checking in /sys a big deal, would  you
> prefer an interface like we did for alignment in libblkid?

Well, currently there's no way to discover the underlying block devices
if you have a btrfs mount point. This is what Josef's patch added for
us.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: add a disk info ioctl to get the disks attached to a filesystem

2010-09-29 Thread Ric Wheeler

 On 09/29/2010 08:59 PM, Lennart Poettering wrote:

On Wed, 29.09.10 16:25, Ric Wheeler (rwhee...@redhat.com) wrote:


This in fact is how all current readahead implementations work, be it
the fedora, the suse or ubuntu's readahead or Arjan's sreadahead. What's
new is that in the systemd case we try to test for ssd/rotating
properly, instead of just hardcoding a check for
/sys/class/block/sda/queue/rotational.


A couple of questions pop into mind - is systemd the right place to
automatically tune readahead?  If this is a generic feature for the
type of device, it sounds like something that we should be doing
somewhere else in the stack (not relying on tuning from user space).

Note that this is not the kind of readahead that is controllable via
/sys/class/block/sda/queue/read_ahead_kb, this is about detecting "hot"
files at boot, and then preloading them on the next boot. i.e. the
problem Jens once proposed fcache for.


Second question is why is checking in /sys a big deal, would  you
prefer an interface like we did for alignment in libblkid?

Well, currently there's no way to discover the underlying block devices
if you have a btrfs mount point. This is what Josef's patch added for
us.

Lennart



Makes sense to me,thanks!

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Dirtiable inode bdi default != sb bdi btrfs

2010-09-29 Thread Jan Kara
On Wed 29-09-10 10:19:36, Christoph Hellwig wrote:
> ---
> From: Christoph Hellwig 
> Subject: [PATCH] writeback: always use sb->s_bdi for writeback purposes
> 
...
> The one exception for now is the block device filesystem which really
> wants different writeback contexts for it's different (internal) inodes
> to handle the writeout more efficiently.  For now we do this with
> a hack in fs-writeback.c because we're so late in the cycle, but in
> the future I plan to replace this with a superblock method that allows
> for multiple writeback contexts per filesystem.
  Another exception I know about is mtd_inodefs filesystem
(drivers/mtd/mtdchar.c).

> Signed-off-by: Christoph Hellwig 
> 
> Index: linux-2.6/fs/fs-writeback.c
> ===
> --- linux-2.6.orig/fs/fs-writeback.c  2010-09-29 16:58:41.750557721 +0900
> +++ linux-2.6/fs/fs-writeback.c   2010-09-29 17:11:35.040557719 +0900
> @@ -72,22 +72,10 @@ int writeback_in_progress(struct backing
>  static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
>  {
>   struct super_block *sb = inode->i_sb;
> - struct backing_dev_info *bdi = inode->i_mapping->backing_dev_info;
>  
> - /*
> -  * For inodes on standard filesystems, we use superblock's bdi. For
> -  * inodes on virtual filesystems, we want to use inode mapping's bdi
> -  * because they can possibly point to something useful (think about
> -  * block_dev filesystem).
> -  */
> - if (sb->s_bdi && sb->s_bdi != &noop_backing_dev_info) {
> - /* Some device inodes could play dirty tricks. Catch them... */
> - WARN(bdi != sb->s_bdi && bdi_cap_writeback_dirty(bdi),
> - "Dirtiable inode bdi %s != sb bdi %s\n",
> - bdi->name, sb->s_bdi->name);
> - return sb->s_bdi;
> - }
> - return bdi;
> + if (strcmp(sb->s_type->name, "bdev") == 0)
> + return inode->i_mapping->backing_dev_info;
> + return sb->s_bdi;
  So at least here you'd need also add a similar exception for
"mtd_inodefs". Because of these exeptions I've chosen the
(sb->s_bdi && sb->s_bdi != &noop_backing_dev_info) check rather than your
exception based check. All in all I don't care much what ends up in the
kernel as it's just a temporary solution...
  Also I've added the warning to catch situations where inodes would get
filed to a different backing device after the patch. So far the reported
warnings were harmless but still I'm more comfortable when it's there
because otherwise we can so easily miss some device-driver-invented
filesystem like mtd_inodefs which would break silently after the change...

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: add a disk info ioctl to get the disks attached to a filesystem

2010-09-29 Thread Kay Sievers
On Wed, Sep 29, 2010 at 01:25, Christoph Hellwig  wrote:
> On Tue, Sep 28, 2010 at 04:53:16PM -0400, Josef Bacik wrote:
>> This was a request from the systemd guys.  They need a quick and easy way to 
>> get
>> all devices attached to a Btrfs filesystem in order to check if any of the 
>> disks
>> are SSD for...something, I didn't ask :).   I've tested this with the
>> btrfs-progs patch that accompanies this patch.  Thanks,
>
> So please tell the "systemd guys" to explain what the fuck they're doing
> to linux-fsdevel and fiend a proper interface.  Chance is they will fuck
> up as much as just about ever other lowlevel userspace tool are very
> high.

Fuck like these comments make it incredibly hard to find the few
statements where you are right, in all the fucking noise you are
creating.

Thanks,
Kay
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Another checksum error bugreport

2010-09-29 Thread Francis Galiegue
On Wed, Sep 29, 2010 at 13:37, Sebastian 'gonX' Jensen
 wrote:
[...]
>>
>> Which kernel is that?
> It was one of the 2.6.35 versions from the Ubuntu repository. I'm
> running Ubuntu 10.04 Server.
>

Since 2.6.32 works, you should report that bug to Ubuntu.

The upstream commit is f281fb5fe54e15a7ab802945e42f8e24fceb56b2,
pasted below, merged Sep 25:


commit f281fb5fe54e15a7ab802945e42f8e24fceb56b2
Author: Adrian Hunter 
Date:   Sat Sep 25 12:42:55 2010 +0200

block: prevent merges of discard and write requests

Add logic to prevent two I/O requests being merged if
only one of them is a discard.  Ditto secure discard.

Without this fix, it is possible for write requests
to transform into discard requests.  For example:

  Submit bio 1 to discard 8 sectors from sector n
  Submit bio 2 to write 8 sectors from sector n + 16
  Submit bio 3 to write 8 sectors from sector n + 8

Bio 1 becomes request 1.  Bio 2 becomes request 2.
Bio 3 is merged with request 2, and then subsequently
request 2 is merged with request 1 resulting in just
one I/O request which discards all 24 sectors.

Signed-off-by: Adrian Hunter 

(Moved the checks above the position checks /Jens)

Signed-off-by: Jens Axboe 


-- 
Francis Galiegue, fgalie...@gmail.com
"It seems obvious [...] that at least some 'business intelligence'
tools invest so much intelligence on the business side that they have
nothing left for generating SQL queries" (Stéphane Faroult, in "The
Art of SQL", ISBN 0-596-00894-5)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Dirtiable inode bdi default != sb bdi btrfs

2010-09-29 Thread Jan Kara
On Tue 28-09-10 10:05:49, Artem Bityutskiy wrote:
> On Mon, 2010-09-27 at 18:54 -0400, Chris Mason wrote:
> > On Tue, Sep 28, 2010 at 12:25:48AM +0200, Jan Kara wrote:
> > > [Added CCs for similar ecryptfs warning]
> > > On Thu 23-09-10 12:38:49, Andrew Morton wrote:
> > > > > This started appearing for me on v2.6.36-rc5-49-gc79bd89; it did not 
> > > > > happen on v2.6.36-rc5-33-g1ce1e41, probably because it does not have 
> > > > > commit 692ebd17c2905313fff3c504c249c6a0faad16ec which introduces the 
> > > > > warning.
> > > > > [...]
> > > > > device fsid 44d595920ddedfa-3ece6b56e80f689e devid 1 transid 22342 
> > > > > /dev/mapper/vg_cesarbinspiro-lv_home
> > > > > SELinux: initialized (dev dm-3, type btrfs), uses xattr
> > > > > [ cut here ]
> > > > > WARNING: at fs/fs-writeback.c:87 inode_to_bdi+0x62/0x6d()
> > > > > Hardware name: Inspiron N4010
> > > > > Dirtiable inode bdi default != sb bdi btrfs
> > > > > Modules linked in: ipv6 kvm_intel kvm uinput arc4 ecb 
> > > > > snd_hda_codec_intelhdmi snd_hda_codec_realtek iwlagn snd_hda_intel 
> > > > > iwlcore snd_hda_codec uvcvideo snd_hwdep mac80211 videodev snd_seq 
> > > > > snd_seq_device v4l1_compat snd_pcm atl1c v4l2_compat_ioctl32 btusb 
> > > > > cfg80211 snd_timer i2c_i801 bluetooth iTCO_wdt dell_wmi dell_laptop 
> > > > > snd 
> > > > > pcspkr wmi dcdbas shpchp iTCO_vendor_support soundcore snd_page_alloc 
> > > > > rfkill joydev microcode btrfs zlib_deflate libcrc32c cryptd 
> > > > > aes_x86_64 
> > > > > aes_generic xts gf128mul dm_crypt usb_storage i915 drm_kms_helper drm 
> > > > > i2c_algo_bit i2c_core video output [last unloaded: scsi_wait_scan]
> > > > > Pid: 1073, comm: find Not tainted 2.6.36-rc5+ #8
> > > > > Call Trace:
> > > > >   [] warn_slowpath_common+0x85/0x9d
> > > > >   [] warn_slowpath_fmt+0x46/0x48
> > > > >   [] inode_to_bdi+0x62/0x6d
> > > > >   [] __mark_inode_dirty+0xd0/0x177
> > > > >   [] touch_atime+0x107/0x12a
> > > > >   [] ? filldir+0x0/0xd0
> > > > >   [] vfs_readdir+0x8d/0xb4
> > > > >   [] sys_getdents+0x81/0xd1
> > > > >   [] system_call_fastpath+0x16/0x1b
> > >   Thanks for the report. These bdi pointers are a mess. As Chris pointed
> > > out, btrfs forgets to properly initialize 
> > > inode->i_mapping.backing_dev_info
> > > for directories and special inodes and thus these were previously attached
> > > to default_backing_dev_info which probably isn't what Chris would like to
> > > see.
> > 
> > There's no actual writeback for these, so it works fine for btrfs either
> > way.
> 
> Side note: every time inode is marked as dirty, we wake up a bdi thread
> or the default bdi thread. So if we have inodes which do not need
> write-back, we should never mark them as dirty.
  Are you sure? I think we wake up the thread only when it's the first
dirty inode for the bdi...
  And a side side note ;): It's harder not to dirty the inode than it seems.
E.g. btrfs (or similarly ext3) add the new inode data to the journal already
at inode dirty time but still they need to track that the transaction
carrying the inode is still uncommitted. Thus the inode *is* dirty in some
sense. Only it does not need any writeout to happen...

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Dirtiable inode bdi default != sb bdi btrfs

2010-09-29 Thread Jan Kara
On Mon 27-09-10 20:55:45, Cesar Eduardo Barros wrote:
> Em 27-09-2010 19:25, Jan Kara escreveu:
> >[Added CCs for similar ecryptfs warning]
> >On Thu 23-09-10 12:38:49, Andrew Morton wrote:
> >>>[...]
> >>>device fsid 44d595920ddedfa-3ece6b56e80f689e devid 1 transid 22342
> >>>/dev/mapper/vg_cesarbinspiro-lv_home
> >>>SELinux: initialized (dev dm-3, type btrfs), uses xattr
> >>>[ cut here ]
> >>>WARNING: at fs/fs-writeback.c:87 inode_to_bdi+0x62/0x6d()
> >>>Hardware name: Inspiron N4010
> >>>Dirtiable inode bdi default != sb bdi btrfs
> >   That suggests that we should probably handle such cases in a more generic
> >way by changing the code in inode_init_always(). The patch below makes at
> >least btrfs happy for me... Could you maybe test it? Thanks.
> 
> Applied on top of v2.6.36-rc5-151-g32163f4, running it right now.
> The warning messages no longer happen, and everything seems to be
> working fine.
> 
> Tested-by: Cesar Eduardo Barros 
  Great, thanks for testing.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Another checksum error bugreport

2010-09-29 Thread cwillu
>>> Which kernel is that?
>> It was one of the 2.6.35 versions from the Ubuntu repository. I'm
>> running Ubuntu 10.04 Server.
>>
>
> Since 2.6.32 works, you should report that bug to Ubuntu.

Alternatively, retest using ubuntu's mainline kernel ppa
(http://kernel.ubuntu.com/~kernel-ppa/mainline/), which doesn't
include any ubuntu patches.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Dirtiable inode bdi default != sb bdi btrfs

2010-09-29 Thread Artem Bityutskiy
On Wed, 2010-09-29 at 15:00 +0200, Jan Kara wrote:
> On Tue 28-09-10 10:05:49, Artem Bityutskiy wrote:
> > On Mon, 2010-09-27 at 18:54 -0400, Chris Mason wrote:
> > > On Tue, Sep 28, 2010 at 12:25:48AM +0200, Jan Kara wrote:
> > > > [Added CCs for similar ecryptfs warning]
> > > > On Thu 23-09-10 12:38:49, Andrew Morton wrote:
> > > > > > This started appearing for me on v2.6.36-rc5-49-gc79bd89; it did 
> > > > > > not 
> > > > > > happen on v2.6.36-rc5-33-g1ce1e41, probably because it does not 
> > > > > > have 
> > > > > > commit 692ebd17c2905313fff3c504c249c6a0faad16ec which introduces 
> > > > > > the 
> > > > > > warning.
> > > > > > [...]
> > > > > > device fsid 44d595920ddedfa-3ece6b56e80f689e devid 1 transid 22342 
> > > > > > /dev/mapper/vg_cesarbinspiro-lv_home
> > > > > > SELinux: initialized (dev dm-3, type btrfs), uses xattr
> > > > > > [ cut here ]
> > > > > > WARNING: at fs/fs-writeback.c:87 inode_to_bdi+0x62/0x6d()
> > > > > > Hardware name: Inspiron N4010
> > > > > > Dirtiable inode bdi default != sb bdi btrfs
> > > > > > Modules linked in: ipv6 kvm_intel kvm uinput arc4 ecb 
> > > > > > snd_hda_codec_intelhdmi snd_hda_codec_realtek iwlagn snd_hda_intel 
> > > > > > iwlcore snd_hda_codec uvcvideo snd_hwdep mac80211 videodev snd_seq 
> > > > > > snd_seq_device v4l1_compat snd_pcm atl1c v4l2_compat_ioctl32 btusb 
> > > > > > cfg80211 snd_timer i2c_i801 bluetooth iTCO_wdt dell_wmi dell_laptop 
> > > > > > snd 
> > > > > > pcspkr wmi dcdbas shpchp iTCO_vendor_support soundcore 
> > > > > > snd_page_alloc 
> > > > > > rfkill joydev microcode btrfs zlib_deflate libcrc32c cryptd 
> > > > > > aes_x86_64 
> > > > > > aes_generic xts gf128mul dm_crypt usb_storage i915 drm_kms_helper 
> > > > > > drm 
> > > > > > i2c_algo_bit i2c_core video output [last unloaded: scsi_wait_scan]
> > > > > > Pid: 1073, comm: find Not tainted 2.6.36-rc5+ #8
> > > > > > Call Trace:
> > > > > >   [] warn_slowpath_common+0x85/0x9d
> > > > > >   [] warn_slowpath_fmt+0x46/0x48
> > > > > >   [] inode_to_bdi+0x62/0x6d
> > > > > >   [] __mark_inode_dirty+0xd0/0x177
> > > > > >   [] touch_atime+0x107/0x12a
> > > > > >   [] ? filldir+0x0/0xd0
> > > > > >   [] vfs_readdir+0x8d/0xb4
> > > > > >   [] sys_getdents+0x81/0xd1
> > > > > >   [] system_call_fastpath+0x16/0x1b
> > > >   Thanks for the report. These bdi pointers are a mess. As Chris pointed
> > > > out, btrfs forgets to properly initialize 
> > > > inode->i_mapping.backing_dev_info
> > > > for directories and special inodes and thus these were previously 
> > > > attached
> > > > to default_backing_dev_info which probably isn't what Chris would like 
> > > > to
> > > > see.
> > > 
> > > There's no actual writeback for these, so it works fine for btrfs either
> > > way.
> > 
> > Side note: every time inode is marked as dirty, we wake up a bdi thread
> > or the default bdi thread. So if we have inodes which do not need
> > write-back, we should never mark them as dirty.
>   Are you sure? I think we wake up the thread only when it's the first
> dirty inode for the bdi...

Err, right. If no one ever marks it as clean then we won't wake-up the
thread. But I thought that marking it as dirty even once is bad because
this causes bdi thread creation, which consumes resources.

Sorry for my ignorance, I did not really follow the conversation, I just
remember that when I looked at bdi stuff, I noticed that during boot the
kernel created many bdi threads which were never used then. They
eventually exited. But I thought that creating useless bdi threads it
about concuming resources and slowing down the boot.

As I remember, the reason was touch_atime() for some of the threads.

But really, I did not dig this, I just noticed this conversation and
wanted to let you know about the issue I noticed this summer.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Dirtiable inode bdi default != sb bdi btrfs

2010-09-29 Thread Christoph Hellwig
On Wed, Sep 29, 2010 at 02:18:08PM +0200, Jan Kara wrote:
> On Wed 29-09-10 10:19:36, Christoph Hellwig wrote:
> > ---
> > From: Christoph Hellwig 
> > Subject: [PATCH] writeback: always use sb->s_bdi for writeback purposes
> > 
> ...
> > The one exception for now is the block device filesystem which really
> > wants different writeback contexts for it's different (internal) inodes
> > to handle the writeout more efficiently.  For now we do this with
> > a hack in fs-writeback.c because we're so late in the cycle, but in
> > the future I plan to replace this with a superblock method that allows
> > for multiple writeback contexts per filesystem.
>   Another exception I know about is mtd_inodefs filesystem
> (drivers/mtd/mtdchar.c).

No, it's not.  MTD only has three different backing_dev_info instances
which have different flags in the mapping-relevant portion of the
backing_dev. 

>   So at least here you'd need also add a similar exception for
> "mtd_inodefs".

No.  For one thing we don't need any exception for correctnes alone -
even the block device variant would work fine with the default case.
But for mtd specificly we don't need an exception for performance either
given that there are no per-device bdis in mtd.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Another checksum error bugreport

2010-09-29 Thread Sebastian 'gonX' Jensen
On 29 September 2010 15:15, cwillu  wrote:
 Which kernel is that?
>>> It was one of the 2.6.35 versions from the Ubuntu repository. I'm
>>> running Ubuntu 10.04 Server.
>>>
>>
>> Since 2.6.32 works, you should report that bug to Ubuntu.
>
> Alternatively, retest using ubuntu's mainline kernel ppa
> (http://kernel.ubuntu.com/~kernel-ppa/mainline/), which doesn't
> include any ubuntu patches.
>

I used the mainline ppa when I used 2.6.35. That is where I had the
issue. Forgive me for saying it was in the repository, but I did not
realize they were not the same thing.

Thanks,
Sebastian J.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Another checksum error bugreport

2010-09-29 Thread cwillu
On Wed, Sep 29, 2010 at 8:31 AM, Sebastian 'gonX' Jensen
 wrote:
> On 29 September 2010 15:15, cwillu  wrote:
> Which kernel is that?
 It was one of the 2.6.35 versions from the Ubuntu repository. I'm
 running Ubuntu 10.04 Server.

>>>
>>> Since 2.6.32 works, you should report that bug to Ubuntu.
>>
>> Alternatively, retest using ubuntu's mainline kernel ppa
>> (http://kernel.ubuntu.com/~kernel-ppa/mainline/), which doesn't
>> include any ubuntu patches.
>
> I used the mainline ppa when I used 2.6.35. That is where I had the
> issue. Forgive me for saying it was in the repository, but I did not
> realize they were not the same thing.

Well, it is a repository, but that one specifically doesn't include
ubuntu patches, as opposed to whats in the default repositories.
Given that you used the mainline kernels, it's unlikely to be an
ubuntu bug.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Francis Galiegue would like your help testing a survey

2010-09-29 Thread K. Richard Pixley

 On 9/28/10 11:47 , Francis Galiegue wrote:

As to file system hardening, what do you mean apart from checksums?
Fundamental filesystem design?

I specifically mean the intended ability to survive a power failure.

Historically, unix file systems lived in the disk cache such that a 
power failure would result in a polluted file system.  This would 
require an fsck pass to clear out the errors and "recover" the file 
system although data lost was still data lost.


A while back, (some time in the 90's), most unix file systems were 
"hardened" such that a power failure would generally _not_ result in a 
file system pollution.  Ext2 is not hardened.  Ext3 has an optional 
journal which provides file system hardening.


Ext2 is faster than ext3 but also suffers from smaller file system size 
limits.  Btrfs, in "-m single -d single" mode is hardened and competes 
favorably against ext2 for speed.  All other linux file systems are 
either not hardened, slower, or both.  (Although nilfs2 is also hardened 
and somewhere between ext2 and ext3 speeds.)

#15 presupposes it's own answer.  While I've had no filesystems fail, every
machine I use with btrfs file systems has failed numerous times -
pathological behavior, kernel crashes, etc.  In the absence of a btrfsck I
can't be sure that the file system has actually failed although rebuilding
the file system seems to alleviate the symptoms temporarily.

I don't really see your point here. Can you elaborate? And yes, I _do_
mean filesystem failures, not machine failure. I made that explicit.
It's simple.  I can't tell if I've had file system pollution because we 
don't have a functional btrfsck.  I only know that I have file systems 
which have reached a state where the kernel was unable to use them 
constructively.  I can't tell whether this state was due to a data error 
in the file system or a coding error in the file system driver which 
couldn't cope with a valid state of the file system.

#16 presupposes a failure mode. Again, my issues have more to do with
stability than with clear cases of file system pollution

Point taken, but again, this is on purpose, I talk here about hosed
filesystems indeed.
Then I think you need to ask the same question again with respect to 
system failures due to btrfs which aren't necessarily file system failures.


Imagine this for a moment - pretend that any time btrfs were in your 
kernel your kernel were only capable of network speeds of 1Mbps.  Data 
was correct both in your btrfs file systems and in your network 
interfaces - but you were horribly restricted in your network 
interfaces.  This would not represent a polluted btrfs file system and 
yet it would clearly represent a "broken" system by most people's 
definitions.  It's these cases I'm looking to see represented in the 
questionnaire because these are the types of failures I've been seeing.  
And in the absence of a reliable btrfsck, we can't really determine the 
existence of file system pollution anyway - we can only guess that we 
might have polluted file systems.


--rich
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix the df ioctl to report raid types

2010-09-29 Thread Josef Bacik
The new ENOSPC stuff broke the df ioctl since we no longer create seperate space
info's for each RAID type.  So instead, loop through each space info's raid
lists so we can get the right RAID information which will allow the df ioctl to
tell us RAID types again.  Thanks,

Signed-off-by: Josef Bacik 
---
 fs/btrfs/ioctl.c |  100 +-
 1 files changed, 76 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index f59b0bc..e264072 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1879,6 +1879,22 @@ static long btrfs_ioctl_default_subvol(struct file 
*file, void __user *argp)
return 0;
 }
 
+static void get_block_group_info(struct list_head *groups_list,
+struct btrfs_ioctl_space_info *space)
+{
+   struct btrfs_block_group_cache *block_group;
+
+   space->total_bytes = 0;
+   space->used_bytes = 0;
+   space->flags = 0;
+   list_for_each_entry(block_group, groups_list, list) {
+   space->flags = block_group->flags;
+   space->total_bytes += block_group->key.offset;
+   space->used_bytes +=
+   btrfs_block_group_used(&block_group->item);
+   }
+}
+
 long btrfs_ioctl_space_info(struct btrfs_root *root, void __user *arg)
 {
struct btrfs_ioctl_space_args space_args;
@@ -1887,27 +1903,56 @@ long btrfs_ioctl_space_info(struct btrfs_root *root, 
void __user *arg)
struct btrfs_ioctl_space_info *dest_orig;
struct btrfs_ioctl_space_info *user_dest;
struct btrfs_space_info *info;
+   u64 types[] = {BTRFS_BLOCK_GROUP_DATA,
+  BTRFS_BLOCK_GROUP_SYSTEM,
+  BTRFS_BLOCK_GROUP_METADATA,
+  BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_METADATA};
+   int num_types = 4;
int alloc_size;
int ret = 0;
int slot_count = 0;
+   int i, c;
 
if (copy_from_user(&space_args,
   (struct btrfs_ioctl_space_args __user *)arg,
   sizeof(space_args)))
return -EFAULT;
 
-   /* first we count slots */
-   rcu_read_lock();
-   list_for_each_entry_rcu(info, &root->fs_info->space_info, list)
-   slot_count++;
-   rcu_read_unlock();
+   for (i = 0; i < num_types; i++) {
+   struct btrfs_space_info *tmp;
+
+   info = NULL;
+   rcu_read_lock();
+   list_for_each_entry_rcu(tmp, &root->fs_info->space_info,
+   list) {
+   if (tmp->flags == types[i]) {
+   info = tmp;
+   break;
+   }
+   }
+   rcu_read_unlock();
+
+   if (!info)
+   continue;
+
+   down_read(&info->groups_sem);
+   for (c = 0; c < BTRFS_NR_RAID_TYPES; c++) {
+   if (!list_empty(&info->block_groups[c]))
+   slot_count++;
+   }
+   up_read(&info->groups_sem);
+   }
 
/* space_slots == 0 means they are asking for a count */
if (space_args.space_slots == 0) {
space_args.total_spaces = slot_count;
goto out;
}
+
+   slot_count = min_t(int, space_args.space_slots, slot_count);
+
alloc_size = sizeof(*dest) * slot_count;
+
/* we generally have at most 6 or so space infos, one for each raid
 * level.  So, a whole page should be more than enough for everyone
 */
@@ -1921,27 +1966,34 @@ long btrfs_ioctl_space_info(struct btrfs_root *root, 
void __user *arg)
dest_orig = dest;
 
/* now we have a buffer to copy into */
-   rcu_read_lock();
-   list_for_each_entry_rcu(info, &root->fs_info->space_info, list) {
-   /* make sure we don't copy more than we allocated
-* in our buffer
-*/
-   if (slot_count == 0)
-   break;
-   slot_count--;
-
-   /* make sure userland has enough room in their buffer */
-   if (space_args.total_spaces >= space_args.space_slots)
-   break;
+   for (i = 0; i < num_types; i++) {
+   struct btrfs_space_info *tmp;
+
+   info = NULL;
+   rcu_read_lock();
+   list_for_each_entry_rcu(tmp, &root->fs_info->space_info,
+   list) {
+   if (tmp->flags == types[i]) {
+   info = tmp;
+   break;
+   }
+   }
+   rcu_read_unlock();
 
-   space.flags = info->flags;
-   space.total_bytes = info->total_bytes;
-   space.used_bytes = info->bytes_used;
-   

BTRFS && SSD

2010-09-29 Thread Yuehai Xu
Hi,

I know BTRFS is a kind of Log-structured File System, which doesn't do
overwrite. Here is my question, suppose file A is overwritten by A',
instead of writing A' to the original place of A, a new place is
selected to store it. However, we know that the address of a file
should be recorded in its inode. In such case, the corresponding part
in inode of A should update from the original place A to the new place
A', is this a kind of overwrite actually? I think no matter what
design it is for Log-Structured FS, a mapping table is always needed,
such as inode map, DAT, etc. When a update operation happens for this
mapping table, is it actually a kind of over-write? If it is, is it a
bottleneck for the performance of write for SSD?

What do you think the major work that BTRFS can do to improve the
performance for SSD? I know FTL has becomes smarter and smarter, the
idea of log-structured file system is always implemented inside the
SSD by FTL, in that case, it sounds all the issues have been solved no
matter what the FS it is in upper stack. But at least, from the
results of benchmarks on the internet show that the performance from
different FS are quite different, such as NILFS2 and BTRFS.

Any comments?

Thanks,
Yuehai
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS && SSD

2010-09-29 Thread Sean Bartell
On Wed, Sep 29, 2010 at 11:30:14AM -0400, Yuehai Xu wrote:
> I know BTRFS is a kind of Log-structured File System, which doesn't do
> overwrite. Here is my question, suppose file A is overwritten by A',
> instead of writing A' to the original place of A, a new place is
> selected to store it. However, we know that the address of a file
> should be recorded in its inode. In such case, the corresponding part
> in inode of A should update from the original place A to the new place
> A', is this a kind of overwrite actually? I think no matter what
> design it is for Log-Structured FS, a mapping table is always needed,
> such as inode map, DAT, etc. When a update operation happens for this
> mapping table, is it actually a kind of over-write? If it is, is it a
> bottleneck for the performance of write for SSD?

In btrfs, this is solved by doing the same thing for the inode--a new
place for the leaf holding the inode is chosen. Then the parent of the
leaf must point to the new position of the leaf, so the parent is moved,
and the parent's parent, etc. This goes all the way up to the
superblocks, which are actually overwritten one at a time.

> What do you think the major work that BTRFS can do to improve the
> performance for SSD? I know FTL has becomes smarter and smarter, the
> idea of log-structured file system is always implemented inside the
> SSD by FTL, in that case, it sounds all the issues have been solved no
> matter what the FS it is in upper stack. But at least, from the
> results of benchmarks on the internet show that the performance from
> different FS are quite different, such as NILFS2 and BTRFS.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Another checksum error bugreport

2010-09-29 Thread Lubos Kolouch
Sebastian 'gonX' Jensen, Wed, 29 Sep 2010 12:48:56 +0200:

> Hey guys,
> 
> Today I experienced my first checksum error just out of the blue - and
> it's not just the 'csum + 1 = private' issue, it's a completely
> different one. Because of this, I am unable to retrieve the data off the
> drive, even with nodatasum enabled - I simply get an I/O error. Here's
> the dmesg output:
> 
> [149423.845177] btrfs: setting nodatasum [149423.850339] Btrfs detected
> SSD devices, enabling SSD mode [149432.094728] btrfs csum failed ino 259
> off 26701824 csum 3875867041 private 371726550
> [149432.117938] btrfs csum failed ino 259 off 26701824 csum 3875867041
> private 371726550
> [149432.118340] btrfs csum failed ino 259 off 26701824 csum 3875867041
> private 371726550
> [149432.125671] btrfs csum failed ino 259 off 26701824 csum 3875867041
> private 371726550
> [149432.126075] btrfs csum failed ino 259 off 26701824 csum 3875867041
> private 371726550
> [149432.135671] btrfs csum failed ino 259 off 26701824 csum 3875867041
> private 371726550
> 
> I would really like to have the files on the drive retrieved in their
> entirety, but if that is not possible then that is also OK. Consider
> this a bugreport and a question on how to retrieve the data now.
> 
> Thanks,
> Sebastian J.

I have seen this now too, on a usb flash drive (yes, with LUKS on it) that
has been always correctly unmounted and luksClosed. In facts I created 
the fs on it couple of days ago and now I cannot read one file with the 
same error.

I do not need to recover the file, it's just that you are not the only 
with this error.

Lubos

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Another checksum error bugreport

2010-09-29 Thread Sebastian 'gonX' Jensen
On 29 September 2010 19:35, Lubos Kolouch  wrote:
> Sebastian 'gonX' Jensen, Wed, 29 Sep 2010 12:48:56 +0200:
>
>> Hey guys,
>>
>> Today I experienced my first checksum error just out of the blue - and
>> it's not just the 'csum + 1 = private' issue, it's a completely
>> different one. Because of this, I am unable to retrieve the data off the
>> drive, even with nodatasum enabled - I simply get an I/O error. Here's
>> the dmesg output:
>>
>> [149423.845177] btrfs: setting nodatasum [149423.850339] Btrfs detected
>> SSD devices, enabling SSD mode [149432.094728] btrfs csum failed ino 259
>> off 26701824 csum 3875867041 private 371726550
>> [149432.117938] btrfs csum failed ino 259 off 26701824 csum 3875867041
>> private 371726550
>> [149432.118340] btrfs csum failed ino 259 off 26701824 csum 3875867041
>> private 371726550
>> [149432.125671] btrfs csum failed ino 259 off 26701824 csum 3875867041
>> private 371726550
>> [149432.126075] btrfs csum failed ino 259 off 26701824 csum 3875867041
>> private 371726550
>> [149432.135671] btrfs csum failed ino 259 off 26701824 csum 3875867041
>> private 371726550
>>
>> I would really like to have the files on the drive retrieved in their
>> entirety, but if that is not possible then that is also OK. Consider
>> this a bugreport and a question on how to retrieve the data now.
>>
>> Thanks,
>> Sebastian J.
>
> I have seen this now too, on a usb flash drive (yes, with LUKS on it) that
> has been always correctly unmounted and luksClosed. In facts I created
> the fs on it couple of days ago and now I cannot read one file with the
> same error.
>
> I do not need to recover the file, it's just that you are not the only
> with this error.
>
> Lubos
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Good to hear I am not alone with this. It seemed more like a fluke
than an actual issue since I have no issues reading the drive in
2.6.32

Regards,
Sebastian J.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS && SSD

2010-09-29 Thread Yuehai Xu
Hi,

On Wed, Sep 29, 2010 at 11:37 AM, Dipl.-Ing. Michael Niederle
 wrote:
> Hi Yuehai!
>
> I tested nilfs2 and btrfs for the use with flash based pen drives.
>
> nilfs2 performed incredibly well as long as there were enough free blocks. But
> the garbage collector of nilfs used too much IO-bandwidth to be useable (with
> slow-write flash devices).

I also tested the performance of write for INTEL X25-V SSD by
postmark, the results are totally different from the results of INTEL
X25-M(http://www.usenix.org/event/lsf08/tech/shin_SSD.pdf). In his
test, the performance of NILFS2 is the best over all, however, in my
test, ext3 is the best while NILFS2 is the worst, almost 10 times less
than ext3 for the throughput of write.

So, what's the role of file system to handle these tricky storage?
Different throughput might be gotten by different file system.

The question is why nilfs2 and btrfs perform so well compared with
ext3 without considering my results, here I just talk about SSD, since
the FTL internal should always do the same thing as the file system,
that redirects the write to a new place instead of writing to the
original place. The throughput for different file system should be
more or less the same.



>
> btrfs on the other side performed very well - a lot better than conventional
> file systems like ext2/3 or reiserfs. After switching the mount-options to
> "noatime" I was able to run a complete Linux system from a (quite slow) pen
> drive without (much) problems. Performance on a fast pen drive is great. I'm
> using btrfs as the root file system on a daily basis since last Christmas
> without running into any problems.
>

The performance of file system is determined by the internal structure
of SSD? or by the structure of file system? or by the coordination of
both file system and SSD?

Thanks very much for replying.

> Greetings, Michael
>

Thanks,
Yuehai
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS && SSD

2010-09-29 Thread Yuehai Xu
On Wed, Sep 29, 2010 at 1:08 PM, Sean Bartell  wrote:
> On Wed, Sep 29, 2010 at 11:30:14AM -0400, Yuehai Xu wrote:
>> I know BTRFS is a kind of Log-structured File System, which doesn't do
>> overwrite. Here is my question, suppose file A is overwritten by A',
>> instead of writing A' to the original place of A, a new place is
>> selected to store it. However, we know that the address of a file
>> should be recorded in its inode. In such case, the corresponding part
>> in inode of A should update from the original place A to the new place
>> A', is this a kind of overwrite actually? I think no matter what
>> design it is for Log-Structured FS, a mapping table is always needed,
>> such as inode map, DAT, etc. When a update operation happens for this
>> mapping table, is it actually a kind of over-write? If it is, is it a
>> bottleneck for the performance of write for SSD?
>
> In btrfs, this is solved by doing the same thing for the inode--a new
> place for the leaf holding the inode is chosen. Then the parent of the
> leaf must point to the new position of the leaf, so the parent is moved,
> and the parent's parent, etc. This goes all the way up to the
> superblocks, which are actually overwritten one at a time.

You mean that there is no over-write for inode too, once the inode
need to be updated, this inode is actually written to a new place
while the only thing to do is to change the point of its parent to
this new place. However, for the last parent, or the superblock, does
it need to be overwritten?

I am afraid I don't quite understand the meaning of your last sentence.

Thanks for replying,
Yuehai


>
>> What do you think the major work that BTRFS can do to improve the
>> performance for SSD? I know FTL has becomes smarter and smarter, the
>> idea of log-structured file system is always implemented inside the
>> SSD by FTL, in that case, it sounds all the issues have been solved no
>> matter what the FS it is in upper stack. But at least, from the
>> results of benchmarks on the internet show that the performance from
>> different FS are quite different, such as NILFS2 and BTRFS.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS && SSD

2010-09-29 Thread Aryeh Gregor
On Wed, Sep 29, 2010 at 1:08 PM, Sean Bartell  wrote:
> In btrfs, this is solved by doing the same thing for the inode--a new
> place for the leaf holding the inode is chosen. Then the parent of the
> leaf must point to the new position of the leaf, so the parent is moved,
> and the parent's parent, etc. This goes all the way up to the
> superblocks, which are actually overwritten one at a time.

Sorry for the useless question, but just out of curiosity: doesn't
this mean that btrfs has to do quite a lot more writes than ext4 for
small file operations?  E.g., if you append one block to a file, like
a log file, then ext3 should have to do about three writes: data,
metadata, and journal (and the latter is always sequential, so it's
cheap).  But btrfs will need to do more, rewriting parent nodes all
the way up the line for both the data and metadata blocks.  Why
doesn't this hurt performance a lot?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS && SSD

2010-09-29 Thread Sean Bartell
On Wed, Sep 29, 2010 at 02:45:29PM -0400, Yuehai Xu wrote:
> On Wed, Sep 29, 2010 at 1:08 PM, Sean Bartell  
> wrote:
> > On Wed, Sep 29, 2010 at 11:30:14AM -0400, Yuehai Xu wrote:
> >> I know BTRFS is a kind of Log-structured File System, which doesn't do
> >> overwrite. Here is my question, suppose file A is overwritten by A',
> >> instead of writing A' to the original place of A, a new place is
> >> selected to store it. However, we know that the address of a file
> >> should be recorded in its inode. In such case, the corresponding part
> >> in inode of A should update from the original place A to the new place
> >> A', is this a kind of overwrite actually? I think no matter what
> >> design it is for Log-Structured FS, a mapping table is always needed,
> >> such as inode map, DAT, etc. When a update operation happens for this
> >> mapping table, is it actually a kind of over-write? If it is, is it a
> >> bottleneck for the performance of write for SSD?
> >
> > In btrfs, this is solved by doing the same thing for the inode--a new
> > place for the leaf holding the inode is chosen. Then the parent of the
> > leaf must point to the new position of the leaf, so the parent is moved,
> > and the parent's parent, etc. This goes all the way up to the
> > superblocks, which are actually overwritten one at a time.
> 
> You mean that there is no over-write for inode too, once the inode
> need to be updated, this inode is actually written to a new place
> while the only thing to do is to change the point of its parent to
> this new place. However, for the last parent, or the superblock, does
> it need to be overwritten?

Yes. The idea of copy-on-write, as used by btrfs, is that whenever
*anything* is changed, it is simply written to a new location. This
applies to data, inodes, and all of the B-trees used by the filesystem.
However, it's necessary to have *something* in a fixed place on disk
pointing to everything else. So the superblocks can't move, and they are
overwritten instead.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS && SSD

2010-09-29 Thread Sean Bartell
On Wed, Sep 29, 2010 at 03:39:07PM -0400, Aryeh Gregor wrote:
> On Wed, Sep 29, 2010 at 1:08 PM, Sean Bartell  
> wrote:
> > In btrfs, this is solved by doing the same thing for the inode--a new
> > place for the leaf holding the inode is chosen. Then the parent of the
> > leaf must point to the new position of the leaf, so the parent is moved,
> > and the parent's parent, etc. This goes all the way up to the
> > superblocks, which are actually overwritten one at a time.
> 
> Sorry for the useless question, but just out of curiosity: doesn't
> this mean that btrfs has to do quite a lot more writes than ext4 for
> small file operations?  E.g., if you append one block to a file, like
> a log file, then ext3 should have to do about three writes: data,
> metadata, and journal (and the latter is always sequential, so it's
> cheap).  But btrfs will need to do more, rewriting parent nodes all
> the way up the line for both the data and metadata blocks.  Why
> doesn't this hurt performance a lot?

For a single change, it does write more. However, there are usually many
changes to children being performed at once, which only require one
change to the parent. Since it's moving everything to new places, btrfs
also has much more control over where writes occur, so all the leaves
and parents can be written sequentially. ext3 is a slave to the current
locations on disk.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: Don't dereference extent_mapping if NULL

2010-09-29 Thread Roel Kluin
Don't dereference em if it's NULL or an error pointer.

Signed-off-by: Roel Kluin 
---
I just noticed this by code analysis. It wasn't tested in any way.

 fs/btrfs/inode.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c038644..d4a37f8 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1787,7 +1787,8 @@ static int btrfs_io_failed_hook(struct bio *failed_bio,
 
read_lock(&em_tree->lock);
em = lookup_extent_mapping(em_tree, start, failrec->len);
-   if (em->start > start || em->start + em->len < start) {
+   if (em && !IS_ERR(em) && (em->start > start ||
+   em->start + em->len < start)) {
free_extent_map(em);
em = NULL;
}
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS && SSD

2010-09-29 Thread Yuehai Xu
On Wed, Sep 29, 2010 at 3:59 PM, Sean Bartell  wrote:
> On Wed, Sep 29, 2010 at 02:45:29PM -0400, Yuehai Xu wrote:
>> On Wed, Sep 29, 2010 at 1:08 PM, Sean Bartell  
>> wrote:
>> > On Wed, Sep 29, 2010 at 11:30:14AM -0400, Yuehai Xu wrote:
>> >> I know BTRFS is a kind of Log-structured File System, which doesn't do
>> >> overwrite. Here is my question, suppose file A is overwritten by A',
>> >> instead of writing A' to the original place of A, a new place is
>> >> selected to store it. However, we know that the address of a file
>> >> should be recorded in its inode. In such case, the corresponding part
>> >> in inode of A should update from the original place A to the new place
>> >> A', is this a kind of overwrite actually? I think no matter what
>> >> design it is for Log-Structured FS, a mapping table is always needed,
>> >> such as inode map, DAT, etc. When a update operation happens for this
>> >> mapping table, is it actually a kind of over-write? If it is, is it a
>> >> bottleneck for the performance of write for SSD?
>> >
>> > In btrfs, this is solved by doing the same thing for the inode--a new
>> > place for the leaf holding the inode is chosen. Then the parent of the
>> > leaf must point to the new position of the leaf, so the parent is moved,
>> > and the parent's parent, etc. This goes all the way up to the
>> > superblocks, which are actually overwritten one at a time.
>>
>> You mean that there is no over-write for inode too, once the inode
>> need to be updated, this inode is actually written to a new place
>> while the only thing to do is to change the point of its parent to
>> this new place. However, for the last parent, or the superblock, does
>> it need to be overwritten?
>
> Yes. The idea of copy-on-write, as used by btrfs, is that whenever
> *anything* is changed, it is simply written to a new location. This
> applies to data, inodes, and all of the B-trees used by the filesystem.
> However, it's necessary to have *something* in a fixed place on disk
> pointing to everything else. So the superblocks can't move, and they are
> overwritten instead.
>

So, is it a bottleneck in the case of SSD since the cost for over
write is very high? For every write, I think the superblocks should be
overwritten, it might be much more frequent than other common blocks
in SSD, even though SSD will do wear leveling inside by its FTL.

What I current know is that for Intel x25-V SSD, the write throughput
of BTRFS is almost 80% less than the one of EXT3 in the case of
PostMark. This really confuses me.

Thanks,
Yuehai
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Dirtiable inode bdi default != sb bdi btrfs

2010-09-29 Thread Jan Kara
On Wed 29-09-10 16:10:06, Christoph Hellwig wrote:
> On Wed, Sep 29, 2010 at 02:18:08PM +0200, Jan Kara wrote:
> > On Wed 29-09-10 10:19:36, Christoph Hellwig wrote:
> > > ---
> > > From: Christoph Hellwig 
> > > Subject: [PATCH] writeback: always use sb->s_bdi for writeback purposes
> > > 
> > ...
> > > The one exception for now is the block device filesystem which really
> > > wants different writeback contexts for it's different (internal) inodes
> > > to handle the writeout more efficiently.  For now we do this with
> > > a hack in fs-writeback.c because we're so late in the cycle, but in
> > > the future I plan to replace this with a superblock method that allows
> > > for multiple writeback contexts per filesystem.
> >   Another exception I know about is mtd_inodefs filesystem
> > (drivers/mtd/mtdchar.c).
> 
> No, it's not.  MTD only has three different backing_dev_info instances
> which have different flags in the mapping-relevant portion of the
> backing_dev. 
  In the end I agree I was probably wrong but it's not that simple ;)

> >   So at least here you'd need also add a similar exception for
> > "mtd_inodefs".
> 
> No.  For one thing we don't need any exception for correctnes alone -
> even the block device variant would work fine with the default case.
Here I don't agree. If you don't have some kind of exception, sb->s_bdi
for both "block" and "mtd_inodefs" filesystems points to
noop_backing_dev_info and you get no writeback for that one. So it isn't
just a performance issue but also a correctness one.

Regarding mtd_inodefs I now looked in more detail what MTD actually does
and it seems to me that MTD device inodes do not seem to carry any
cached state that flusher threads could write back. So returning
noop_backing_dev_info might be the right thing for them after all...
(added David Woodhouse and MTD list to CC so that they can shout if it's
not the case). Coming to this conclusion, I'm happy with your patch going
in as is...
Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: add a disk info ioctl to get the disks attached to a filesystem

2010-09-29 Thread Christoph Hellwig
On Wed, Sep 29, 2010 at 10:04:31AM +0200, Kay Sievers wrote:
> On Wed, Sep 29, 2010 at 09:25, Ric Wheeler  wrote:
> 
> > Second question is why is checking in /sys a big deal, would ??you prefer an
> > interface like we did for alignment in libblkid?
> 
> It's about knowing what's behind the 'nodev' major == 0 of a btrfs
> mount. There is no way to get that from /sys or anywhere else at the
> moment.
> 
> Usually filesystems backed by a disk have the dev_t of the device, or
> the fake block devices like md/dm/raid have their own major and the
> slaves/ directory pointing to the devices.
> 
> This is not only about readahead, it's every other tool, that needs to
> know what kind of disks are behind a btrfs 'nodev' major == 0 mount.

Thanks for explaining the problem.  It's one that affects everything
with more than one underlying block device, so adding a
filesystem-specific ioctl hack is not a good idea.  As mentioned in this
mail we already have a solution for that - the block device slaves
links used for raid and volume managers.  The most logical fix is to
re-use that for btrfs as well and stop it from abusing the anonymous
block major that was never intended for block based filesystems (and
already has caused trouble in other areas).  One way to to this might
be to allocate a block major for btrfs that only gets used for
representing these links.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Dirtiable inode bdi default != sb bdi btrfs

2010-09-29 Thread Christoph Hellwig
On Thu, Sep 30, 2010 at 01:38:07AM +0200, Jan Kara wrote:
> > No.  For one thing we don't need any exception for correctnes alone -
> > even the block device variant would work fine with the default case.
> Here I don't agree. If you don't have some kind of exception, sb->s_bdi
> for both "block" and "mtd_inodefs" filesystems points to
> noop_backing_dev_info and you get no writeback for that one. So it isn't
> just a performance issue but also a correctness one.

Indeed - for internal filesystems that require writeback the change
causes trouble if they haven't registered a s_bdi.  But for all user
visible filesystems that doesn't happen as we require s_bdi for
sync or even unmounts to work.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: add a disk info ioctl to get the disks attached to a filesystem

2010-09-29 Thread Josef Bacik
On Wed, Sep 29, 2010 at 07:43:27PM -0400, Christoph Hellwig wrote:
> On Wed, Sep 29, 2010 at 10:04:31AM +0200, Kay Sievers wrote:
> > On Wed, Sep 29, 2010 at 09:25, Ric Wheeler  wrote:
> > 
> > > Second question is why is checking in /sys a big deal, would ??you prefer 
> > > an
> > > interface like we did for alignment in libblkid?
> > 
> > It's about knowing what's behind the 'nodev' major == 0 of a btrfs
> > mount. There is no way to get that from /sys or anywhere else at the
> > moment.
> > 
> > Usually filesystems backed by a disk have the dev_t of the device, or
> > the fake block devices like md/dm/raid have their own major and the
> > slaves/ directory pointing to the devices.
> > 
> > This is not only about readahead, it's every other tool, that needs to
> > know what kind of disks are behind a btrfs 'nodev' major == 0 mount.
> 
> Thanks for explaining the problem.  It's one that affects everything
> with more than one underlying block device, so adding a
> filesystem-specific ioctl hack is not a good idea.  As mentioned in this
> mail we already have a solution for that - the block device slaves
> links used for raid and volume managers.  The most logical fix is to
> re-use that for btrfs as well and stop it from abusing the anonymous
> block major that was never intended for block based filesystems (and
> already has caused trouble in other areas).  One way to to this might
> be to allocate a block major for btrfs that only gets used for
> representing these links.
>

Fair enough, I will look into this next week sometime.  Thanks,

Josef 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ring btrfs

2010-09-29 Thread Simon Kirby
Just curious...

I've been wondering if it would possible/useful to make a btrfs mode
where it _always_ writes in a ring on the entire disk.  Since it can
write "anywhere" already, it could just write sequentially always.

The write process would have to be changed to read a blob, throw away
the garbage, and fit in the new stuff in the gaps, then write the blob.
Write performance would degrade to half once the disk is half full of
stuff that can't be thrown away, but then almost all file I/O patterns
should be about the same speed, because disk writes are only sequential.

Or is it that it's pretty much always like this until space is tight
anyway?  I suspect that eventually free space will be pretty fragmented,
and it'll have to seek a lot just to write in available space.  Perhaps
the performance is similar or worse to sequential read+write versus just
writing in the gaps in many/all cases?

Maybe this what nilfs2, etc., are all about, in which case maybe it would
be neat to have a mount option or even realtime heuristic-based switching
if load patterns fit better to a particular mode...

Simon-
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html