Re: User feedback: raise the default leaf size to 16k

2013-03-03 Thread Brendan Hide

On 2013/02/13 12:33 PM, Holger Hoffstaette wrote:

- raise the leaf size to 16k
- use single metadata profile

...

the difference in behaviour on a single disk is *very* noticeable.
Did you try an isolated change of leaf size? I think the devs would be 
willing to look into the default size if it makes a dramatic difference 
on its own. Personally I think you are seeing an improvement more as a 
result of the metadata profile rather than the leafsize.


I don't think changing the default profile for metadata will be easily 
entertained as this is very important for protecting against corruption 
due to bitrot.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs/raid56: Add missing #include linux/vmalloc.h

2013-03-03 Thread Geert Uytterhoeven
tilegx_defconfig:

fs/btrfs/raid56.c: In function 'btrfs_alloc_stripe_hash_table':
fs/btrfs/raid56.c:206:3: error: implicit declaration of function 'vzalloc' 
[-Werror=implicit-function-declaration]
fs/btrfs/raid56.c:206:9: warning: assignment makes pointer from integer without 
a cast [enabled by default]
fs/btrfs/raid56.c:226:4: error: implicit declaration of function 'vfree' 
[-Werror=implicit-function-declaration]

Signed-off-by: Geert Uytterhoeven ge...@linux-m68k.org
---
http://kisskb.ellerman.id.au/kisskb/buildresult/8311887/

 fs/btrfs/raid56.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 0722205..9a79fb7 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -31,6 +31,7 @@
 #include linux/hash.h
 #include linux/list_sort.h
 #include linux/raid/xor.h
+#include linux/vmalloc.h
 #include asm/div64.h
 #include compat.h
 #include ctree.h
-- 
1.7.0.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs/raid56: Add missing #include linux/vmalloc.h

2013-03-03 Thread Chris Mason
On Sun, Mar 03, 2013 at 04:44:41AM -0700, Geert Uytterhoeven wrote:
 tilegx_defconfig:
 
 fs/btrfs/raid56.c: In function 'btrfs_alloc_stripe_hash_table':
 fs/btrfs/raid56.c:206:3: error: implicit declaration of function 'vzalloc' 
 [-Werror=implicit-function-declaration]
 fs/btrfs/raid56.c:206:9: warning: assignment makes pointer from integer 
 without a cast [enabled by default]
 fs/btrfs/raid56.c:226:4: error: implicit declaration of function 'vfree' 
 [-Werror=implicit-function-declaration]

Thanks, I've got this one in my for-linus now.  It'll go with the next
pull.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: fix compile failure on parisc

2013-03-03 Thread James Bottomley
x86 seems to include vmalloc.h by default along some of its arch paths,
but most other architectures don't, leading to this compile failure:

fs/btrfs/raid56.c: In function 'btrfs_alloc_stripe_hash_table':
fs/btrfs/raid56.c:206: error: implicit declaration of function 'vzalloc'
fs/btrfs/raid56.c:206: warning: assignment makes pointer from integer
without a cast
fs/btrfs/raid56.c:226: error: implicit declaration of function 'vfree'
make[2]: *** [fs/btrfs/raid56.o] Error 1

Fix this by adding vmalloc.h explicitly to the includes list

Signed-off-by: James Bottomley jbottom...@parallels.com

---

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 0722205..1f0f57e 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -30,6 +30,7 @@
 #include linux/raid/pq.h
 #include linux/hash.h
 #include linux/list_sort.h
+#include linux/vmalloc.h
 #include linux/raid/xor.h
 #include asm/div64.h
 #include compat.h

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Btrfs fixup

2013-03-03 Thread Chris Mason
Hi Linus,

Geert and James both sent this one in, sorry guys.

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

Geert Uytterhoeven (1) commits (+1/-0):
btrfs/raid56: Add missing #include linux/vmalloc.h

Total: (1) commits (+1/-0)

 fs/btrfs/raid56.c | 1 +
 1 file changed, 1 insertion(+)
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


same EXTENT_ITEM appears twice in the extent tree

2013-03-03 Thread Alex Lyakas
Greetings all,
I have an extent tree that looks like follows:

item 22 key (27059916800 EXTENT_ITEM 16384) itemoff 2656 itemsize 24
extent refs 1 gen 164 flags 1
item 23 key (27059916800 EXTENT_ITEM 98304) itemoff 2603 itemsize 53
extent refs 1 gen 165 flags 1
extent data backref root 257 objectid 257 offset 17446191104 
count 1
item 24 key (27059916800 SHARED_DATA_REF 47169536) itemoff 2599 
itemsize 4
shared data backref count 1

As can be seen, same EXTENT_ITEM appears twice. This was undetected,
until __btrfs_free_extent was called, after cleaner deleted one of the
snapshots. Then it lead to assert:
if (found_extent) {
BUG_ON(is_data  refs_to_drop !=
   extent_data_ref_count(root, path, iref));
if (iref) {
BUG_ON(path-slots[0] != extent_slot);
} else {
BUG_ON(path-slots[0] != extent_slot + 1);  /* 
CRASH */
path-slots[0] = extent_slot;
num_to_del = 2;
}

As for the usage of this bad extent, there are multiple snapshots
sharing the 98304-length extent, but only one that uses the 16384
extent:
file tree key (257 ROOT_ITEM 0)
item 19 key (257 EXTENT_DATA 17446191104) itemoff 2935 itemsize 53
extent data disk byte 27059916800 nr 98304
extent data offset 0 nr 98304 ram 98304
extent compression 0
...
file tree key (350 ROOT_ITEM 164)
item 21 key (257 EXTENT_DATA 17446191104) itemoff 2829 itemsize 53
extent data disk byte 27059916800 nr 16384
extent data offset 0 nr 16384 ram 16384
extent compression 0

...
file tree key (352 ROOT_ITEM 167)
item 19 key (257 EXTENT_DATA 17446191104) itemoff 2935 itemsize 53
extent data disk byte 27059916800 nr 98304
extent data offset 0 nr 98304 ram 98304
extent compression 0

Kernel is for-linus, top commit:

commit 1eafa6c73791e4f312324ddad9cbcaf6a1b6052b
Author: Miao Xie mi...@cn.fujitsu.com
Date:   Tue Jan 22 10:49:00 2013 +

Btrfs: fix repeated delalloc work allocation

I believe I might have more extents like this, because btrfs-debug-tree warns:
warning, bad space info total_bytes 26851934208 used 26852773888
warning, bad space info total_bytes 27925676032 used 27926892544

Mount options: nodatasum,nodatacow,noatime,nospace_cache. Metadata
profile is DUP, data profile is single.

Can anybody advise on how this could have happened? I can provide the
whole debug-tree, btrfs-image or any additional info.

Thanks,
Alex.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


weird kernel-oopses while deleting files on btrfs

2013-03-03 Thread Michael Schmitt

Hi list,

some rather unexpected btrfs-oopses for my taste. I use btrfs for some 
time now (mostly on external harddisks) and these oopses happened 
during some simple file and folder deletion operation on that device. It 
is a luks-encrypted 80GB drive. Anything like that known? And the fs was 
created just yesterday, how come there is a message like...


[91491.919358] btrfs: mismatching generation and generation_v2 found in root 
item. This root was probably mounted with an older kernel. Resetting all new 
fields.

... but the kernel used (3.7.3 from Debian experimental on Debian sid) 
was installed several days ago. What kind of oopses are these? As of now 
there is no real data on that device. But if there were, would I need to 
be concerned about the integrity of those files?


ii  btrfs-tools   0.19+20130131-2 (if 
that matters)


The whole log is at http://paste.debian.net/hidden/6ee00823/ (if a mua 
fails to display the text unwrapped) and a copy right here:


[91491.900736] device label samsung_S0DWJ30L373663 devid 1 transid 10 
/dev/mapper/udisks-luks-uuid-64e6f540-8df0-49b2-af3d-ea18e07355d2-uid1000
[91491.904416] btrfs: disk space caching is enabled
[91491.919358] btrfs: mismatching generation and generation_v2 found in root 
item. This root was probably mounted with an older kernel. Resetting all new 
fields.
[91978.944644] device label seagate_W1E2Z3TA devid 1 transid 439 /dev/dm-8
[91979.320743] device label seagate_W1E2Z3TA devid 1 transid 439 
/dev/mapper/udisks-luks-uuid-24593edd-349c-451f-9b6d-eab1120471f6-uid1000
[91979.31] btrfs: disk space caching is enabled
[93283.761960] btrfs: block rsv returned -28
[93283.761965] [ cut here ]
[93283.762006] WARNING: at 
/build/buildd-linux_3.7.3-1~experimental.1-i386-eX5kUQ/linux-3.7.3/fs/btrfs/extent-tree.c:6297
 btrfs_alloc_free_block+0xcd/0x2a4 [btrfs]()
[93283.762010] Hardware name: System Product Name
[93283.762012] Modules linked in: hid_logitech usbhid ff_memless 
ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat 
nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack 
ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables 
x_tables cpufreq_powersave cpufreq_conservative cpufreq_stats cpufreq_userspace 
ppdev lp bnep rfcomm bluetooth rfkill binfmt_misc uinput nfsd auth_rpcgss 
nfs_acl nfs lockd dns_resolver fscache sunrpc bridge stp llc ext4 crc16 jbd2 
hwmon_vid loop fuse snd_hda_codec_analog snd_wavefront snd_cs4236 snd_hda_intel 
btrfs sg snd_opl3_lib sr_mod snd_hda_codec nouveau snd_hwdep cdrom snd_pcm_oss 
crc32c libcrc32c zlib_deflate snd_wss_lib joydev usb_storage hid_generic 
sata_sil snd_mpu401 snd_mixer_oss snd_mpu401_uart coretemp kvm_intel usbled 
snd_pcm snd_page_alloc snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq 
mxm_wmi wmi video snd_seq_device snd_timer ttm i2c_i801 iTCO_wdt snd ns558 
drm_kms_helper drm i2c_algo_bit soundcore gameport iTCO_vendor_support kvm 
lpc_ich mfd_core rng_core pcspkr i2c_core psmouse evdev acpi_cpufreq mperf 
parport_pc parport processor r8169 mii serio_raw ehci_hcd asus_atk0110 
thermal_sys button ext3 mbcache jbd dm_crypt dm_mod raid1 md_mod sha256_generic 
aes_i586 cbc hid sd_mod crc_t10dif ata_generic microcode ata_piix uhci_hcd 
libata scsi_mod usbcore usb_common [last unloaded: usbhid]
[93283.762149] Pid: 10948, comm: pool Not tainted 3.7-trunk-686-pae #1 Debian 
3.7.3-1~experimental.1
[93283.762151] Call Trace:
[93283.762160]  [c10310a1] ? warn_slowpath_common+0x68/0x79
[93283.762187]  [fbe07206] ? btrfs_alloc_free_block+0xcd/0x2a4 [btrfs]
[93283.762193]  [c10310bf] ? warn_slowpath_null+0xd/0x10
[93283.762219]  [fbe07206] ? btrfs_alloc_free_block+0xcd/0x2a4 [btrfs]
[93283.762226]  [c10b9541] ? page_address+0x1b/0x85
[93283.762254]  [fbe0ed11] ? btrfs_header_generation.isra.75+0xb/0x14 [btrfs]
[93283.762277]  [fbdf929c] ? __btrfs_cow_block+0xfb/0x3b4 [btrfs]
[93283.762301]  [fbdfa8b9] ? read_block_for_search.isra.42+0x91/0x31e [btrfs]
[93283.762325]  [fbdf966d] ? btrfs_cow_block+0xe2/0x11f [btrfs]
[93283.762349]  [fbdfbea3] ? btrfs_search_slot+0x1e6/0x5ab [btrfs]
[93283.762377]  [fbe0c613] ? btrfs_del_csums+0xd7/0x30a [btrfs]
[93283.762402]  [fbe029d7] ? __btrfs_free_extent+0x5f8/0x67f [btrfs]
[93283.762428]  [fbe0654e] ? run_clustered_refs+0x7a7/0x803 [btrfs]
[93283.762435]  [c10a89ec] ? __set_page_dirty_nobuffers+0x11/0xb7
[93283.762462]  [fbe088f2] ? btrfs_run_delayed_refs+0xe7/0x220 [btrfs]
[93283.762491]  [fbe151c4] ? __btrfs_end_transaction+0xfb/0x275 [btrfs]
[93283.762521]  [fbe1df70] ? btrfs_evict_inode+0x277/0x2a1 [btrfs]
[93283.762528]  [c10eae11] ? evict+0x89/0x122
[93283.762533]  [c10e3721] ? do_unlinkat+0xcc/0x108
[93283.762538]  [c10dbc57] ? fput+0xc/0x8a
[93283.762543]  [c10e6b16] ? sys_getdents64+0xaa/0xc4
[93283.762549]  [c12ecd4d] ? sysenter_do_call+0x12/0x28
[93283.762555]  [c12e007b] ? set_cpu_sibling_map+0x2cf/0x2e5
[93283.762560]  

Re: same EXTENT_ITEM appears twice in the extent tree

2013-03-03 Thread Chris Mason
On Sun, Mar 03, 2013 at 06:40:50AM -0700, Alex Lyakas wrote:
 Greetings all,
 I have an extent tree that looks like follows:
 
   item 22 key (27059916800 EXTENT_ITEM 16384) itemoff 2656 itemsize 24
   extent refs 1 gen 164 flags 1
   item 23 key (27059916800 EXTENT_ITEM 98304) itemoff 2603 itemsize 53
   extent refs 1 gen 165 flags 1
   extent data backref root 257 objectid 257 offset 17446191104 
 count 1
   item 24 key (27059916800 SHARED_DATA_REF 47169536) itemoff 2599 
 itemsize 4
   shared data backref count 1

Have you been experimenting on this FS with snapshot deletion patches?

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: basic questions regarding COW in Btrfs

2013-03-03 Thread Aastha Mehta
Hi Josef,

I have some more questions following up on my previous e-mails.
I now do somewhat understand the place where extent entries get
cow'ed. But I am unclear about the order of operations.

Is it correct that the data extent written first, then the pointer in
the indirect block needs to be updated, so then it is cowed and
written to disk and so on recursively up the tree? Or is the entire
path from leaf to node that is going to be affected by the write cowed
first and then all the cowed extents are written to the disk and then
the rest of the metadata pointers, (for example, in checksum tree,
extent tree, etc., I am not sure about this)?

Also, I need to understand specifically how the data (leaf nodes) of a
file is written to disk v/s the metadata including the indirect nodes
of the file. In extent_writepage I only know the pages of a file that
are to be written. I guess, I can identify metadata pages based on the
inode of the page's owner. But is it possible to distinguish the pages
available in extent_writepage path as belonging to the leaf node or
internal node for a file? If it cannot be identified at this point,
where earlier in the path can this be decided?

Many thanks,
Aastha.

On 25 February 2013 20:00, Aastha Mehta aasth...@gmail.com wrote:
 Ah okay, I now see how it works. Thanks a lot for your response.

 Regards,
 Aastha.


 On 25 February 2013 18:27, Josef Bacik jba...@fusionio.com wrote:
 On Mon, Feb 25, 2013 at 08:15:40AM -0700, Aastha Mehta wrote:
 Thanks again Josef.

 I understood that cow_file_range is called for a regular file. Just to
 clarify, in cow_file_range is cow done at the time of reserving
 extents in the extent btree for the io to be done in this delalloc? I
 see the following comment above find_free_extent() which is called
 while trying to reserve extents:

 /*
  * walks the btree of allocated extents and find a hole of a given size.
  * The key ins is changed to record the hole:
  * ins-objectid == block start
  * ins-flags = BTRFS_EXTENT_ITEM_KEY
  * ins-offset == number of blocks
  * Any available blocks before search_start are skipped.
  */

 This seems to be the only place where a cow might be done, because a
 key is being inserted into an extent which modifies it.


 The key isn't inserted at this time, it's just returned with those values 
 for us
 to do as we please.  There is no update of the btree until
 insert_reserved_extent/btrfs_mark_extent_written in btrfs_finish_ordered_io.
 Thanks,

 Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: same EXTENT_ITEM appears twice in the extent tree

2013-03-03 Thread Alex Lyakas
Hi Chris,

On Sun, Mar 3, 2013 at 5:28 PM, Chris Mason chris.ma...@fusionio.com wrote:
 On Sun, Mar 03, 2013 at 06:40:50AM -0700, Alex Lyakas wrote:
 Greetings all,
 I have an extent tree that looks like follows:

   item 22 key (27059916800 EXTENT_ITEM 16384) itemoff 2656 itemsize 24
   extent refs 1 gen 164 flags 1
   item 23 key (27059916800 EXTENT_ITEM 98304) itemoff 2603 itemsize 53
   extent refs 1 gen 165 flags 1
   extent data backref root 257 objectid 257 offset 17446191104 
 count 1
   item 24 key (27059916800 SHARED_DATA_REF 47169536) itemoff 2599 
 itemsize 4
   shared data backref count 1

 Have you been experimenting on this FS with snapshot deletion patches?

No, I haven't applied any patches on top of the commit I mentioned. (I
presume you mean David's patch for one-by-one deletion). Since
created, this FS has only seen straight IO with parallel snapshot
creation and deletion. However, the kernel was crashing pretty
frequently during this test, so I presume log replay was taking place.

Any particular thing I can look for in the debug-tree output, except
searching for more double-allocations?

Thanks,
Alex.



 -chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [btrfs] Periodic write spikes while idling, on btrfs root

2013-03-03 Thread Brendan Hide

On 2013/02/14 12:15 PM, Vedant Kumar wrote:

Hello,

I'm experiencing periodic write spikes while my system is idle.

...

turned out to be some systemd log in
/var/log/journal. I turned off journald and rebooted, but the write spike
behavior remained.

...

best,
-vk

I believe btrfs syncs every 30 seconds (if anything's changed).

This sounds like systemd's journal is not actually disabled and that it 
is simply logging new information every few seconds and forcing it to be 
synced to disk. Have you tried following the journal as root to see what 
is being logged?

journalctl -f

Alternatively, as another measure to troubleshoot, in 
/etc/systemd/journald.conf, change the Storage= option either to none 
(which disables logging completely) or to a path inside a tmpfs, thereby 
eliminating btrfs' involvement.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: usage should match what is coded

2013-03-03 Thread David Sterba
On Fri, Mar 01, 2013 at 06:05:21PM +, Hugo Mills wrote:
 On Fri, Mar 01, 2013 at 11:47:50AM -0600, Eric Sandeen wrote:
  On 3/1/13 4:10 AM, Anand Jain wrote:
   Signed-off-by: Anand Jain anand.j...@oracle.com
  
  Revieed-by: Eric Sandeen sand...@redhat.com
  
  But the curious side of me wonders how it got this way.
  
  commit e43cc461550130494194201037590a2b1f0f6880
  Author: Ian Kumlien po...@demius.net
  Date:   Fri Feb 8 01:37:02 2013 +0100
  
  Btrfs-progs: add restore command to btrfs
  
  added the usage text below, but didn't change the getopt
  or add code to handle them.
  
  No idea where it came from, it wasn't in the standalone
  restore either.  *shrug*  I guess nothing got lost.
 
-m was definitely a thing at some point, as I recall using it. I
 think the code was in josef's progs tree. I suspect the other options
 were part of that, too. (And -m was definitely really useful for me).

My fault here, I cherry-picked Ian's commit from a branch with Josef's
updates to restore (adding all the commandline options). I'll pick
Anand's fix to keep help and functionality matching. The updates to
restore are wanted, but as they are based on an old progs version it's
not all trivial to merge them.

david
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: User feedback: raise the default leaf size to 16k

2013-03-03 Thread Chris Mason
On Sun, Mar 03, 2013 at 03:33:30AM -0700, Brendan Hide wrote:
 On 2013/02/13 12:33 PM, Holger Hoffstaette wrote:
  - raise the leaf size to 16k
  - use single metadata profile
 
  ...
 
  the difference in behaviour on a single disk is *very* noticeable.
 Did you try an isolated change of leaf size? I think the devs would be 
 willing to look into the default size if it makes a dramatic difference 
 on its own. Personally I think you are seeing an improvement more as a 
 result of the metadata profile rather than the leafsize.
 
 I don't think changing the default profile for metadata will be easily 
 entertained as this is very important for protecting against corruption 
 due to bitrot.

The long term plan is to set the default size to 16K, since this does
cut down on metadata fragmentation.  But in some benchmarks, it adds lock
contention because we have fewer blocks to spread the locks over.

The 3.9 merge window has fixes for lock contention, so I need to
benchmark things again.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] Btrfs-progs: check out if the swap device

2013-03-03 Thread Brendan Hide

On 2013/02/14 09:53 AM, Tsutomu Itoh wrote:

+   if (ret  0) {
+   fprintf(stderr, error checking %s status: %s\n, file,
+   strerror(-ret));
+   exit(1);
+   }

...

+   /* check if the device is busy */
+   fd = open(file, O_RDWR|O_EXCL);
+   if (fd  0) {
+   fprintf(stderr, unable to open %s: %s\n, file,
+   strerror(errno));
+   exit(1);
+   }
This is fine and works (as tested by David) - but I'm not sure if the 
below suggestions from Zach were taken into account.


1. If the check with open(file, O_RDWR|O_EXCL) shows that the device 
is available, there's no point in checking if it is mounted as a swap 
device. A preliminary check using this could precede all other checks 
which should be skipped if it shows success.


2. If there's an error checking the status (for example lets say 
/proc/swaps is deprecated), we should print the informational message 
but not error out.


On 2013/02/13 11:58 AM, Zach Brown wrote:

- First always open with O_EXCL.  If it succeeds then there's no reason
   to check /proc/swaps at all.  (Maybe it doesn't need to try
   check_mounted() there either?  Not sure if it's protecting against
   accidentally mounting mounted shared storage or not.)

...

- At no point is failure of any of the /proc/swaps parsing fatal.  It'd
   carry on ignoring errors until it doesnt have work to do.  It'd only
   ever print the nice message when it finds a match.



--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: Include the device in most error printk()s

2013-03-03 Thread David Sterba
On Fri, Feb 15, 2013 at 05:12:37PM -0600, Simon Kirby wrote:
[...]
 Signed-off-by: Simon Kirby s...@hostway.ca

Thanks! 2 comments below.

Reviewed-by: David Sterba dste...@suse.cz

 @@ -2919,8 +2923,9 @@ int btrfs_write_out_ino_cache(struct btrfs_root *root,
   if (ret) {
   btrfs_delalloc_release_metadata(inode, inode-i_size);
  #ifdef DEBUG
 - printk(KERN_ERR btrfs: failed to write free ino cache 
 -for root %llu\n, root-root_key.objectid);
 + btrfs_err(root-fs_info,
 + btrfs %s: failed to write free ino cache for root 
 %llu,
 + root-root_key.objectid);

failed to write free ino cache for root %llu,

  #endif
   }
  
 @@ -2454,8 +2456,8 @@ int btrfs_orphan_cleanup(struct btrfs_root *root)
   ret = PTR_ERR(trans);
   goto out;
   }
 - printk(KERN_ERR auto deleting %Lu\n,
 -found_key.objectid);
 + btrfs_err(root-fs_info, auto deleting %Lu,
 + found_key.objectid);

That's probably only a debugging message, so btrfs_debug would be more
appropriate here.

   ret = btrfs_del_orphan_item(trans, root,
   found_key.objectid);
   BUG_ON(ret); /* -ENOMEM or corruption (JDM: Recheck) */
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs send receive produces Too many open files in system

2013-03-03 Thread Brendan Hide

On 2013/02/18 12:37 PM, Adam Ryczkowski wrote:

...
to migrate btrfs from one partition layout to another.
...
source sits on top of lvm2 logical volume, which sits on top of 
cryptsetup Luks device which subsequentely sits on top of mdadm RAID-6 
spanning a partition on each of 4 hard drives ... is a read-only 
snaphot which I estimate contain ca. 100GB data.

...
destination is btrfs multidevice raid10 filesystem, which is based 
on 4 cryptsetup Luks devices, each live as a separate partition on the 
same 4 physical hard drives ...

...
about 8MB/sek read (and the same speed of write) from each of all 4 
hard drives).



I hope you've solved this already - but if not:

The unnecessarily complex setup aside, a 4-disk RAID6 is going to be 
slow - most would have gone for a RAID10 configuration, albeit that it 
has less redundancy.


Another real problem here is that you are copying data from these disks 
to themselves. This means that for every read and write that all four of 
the disks have to do two seeks. This is time-consuming of the order of 
7ms per seek depending on the disks you have. The way to avoid these 
unnecessary seeks is to first copy the data to a separate unrelated 
device and then to copy from that device to your final destination device.


To increase RAID6 write performance (Perhaps irrelevant here) you can 
try optimising the stripe_cache_size value. It can use a ton of memory 
depending on how large a stripe cache setting you end up with. Search 
online for mdraid stripe_cache_size.


To increase the read performance you can try optimising the md arrays' 
readahead. As above, search online for blockdev setra. This should 
hopefully make a noticeable difference.


Good luck.

--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: basic questions regarding COW in Btrfs

2013-03-03 Thread Josef Bacik
On Sat, Mar 2, 2013 at 4:07 PM, Alex Lyakas
alex.bt...@zadarastorage.com wrote:
 Hi Josef,
 I hope it's ok to piggy back on this thread for the following question:

 I see that in btrfs_cross_ref_exist()=check_committed_ref() path,
 there is the following check:

 if (btrfs_extent_generation(leaf, ei) =
 btrfs_root_last_snapshot(root-root_item))
 goto out;

 So this basically means that after we have taken a snap of a subvol,
 then all subvol's extents must be COW'ed, even if we delete the snap a
 minute later.
 I wonder, why is that so?
 Is this because file extents can be shared indirectly, like when we
 create a snap, we only COW the root and only mark all root's
 *immediate* children shared in the extent tree?

Yes that's exactly it.  We have no way of knowing that there are no
snapshots left for this particular root so if there ever was a
snapshot we have to err on the side of caution.

 Can the new backref walking code be used here to check more
 accurately, if the extent is shared by anybody else?


Probably, if we could figure out if there is a way for more than one
root to point to this extent then yes this would be ideal so we don't
have to force COW in cases we would rather not.  Thanks,

Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: basic questions regarding COW in Btrfs

2013-03-03 Thread Josef Bacik
On Sun, Mar 3, 2013 at 10:41 AM, Aastha Mehta aasth...@gmail.com wrote:
 Hi Josef,

 I have some more questions following up on my previous e-mails.
 I now do somewhat understand the place where extent entries get
 cow'ed. But I am unclear about the order of operations.

 Is it correct that the data extent written first, then the pointer in
 the indirect block needs to be updated, so then it is cowed and
 written to disk and so on recursively up the tree? Or is the entire
 path from leaf to node that is going to be affected by the write cowed
 first and then all the cowed extents are written to the disk and then
 the rest of the metadata pointers, (for example, in checksum tree,
 extent tree, etc., I am not sure about this)?

The second one.  We COW the entire path from root to leaf as things
need COW'ing.  We start a transaction, we insert the file extent
entries, we add the checksums, and we add the delayed ref updates to
the extent tree.  The delayed things are guaranteed to happen in that
transaction so we have consistency there.  The COW'ing from top to
bottom works like that for all trees.


 Also, I need to understand specifically how the data (leaf nodes) of a
 file is written to disk v/s the metadata including the indirect nodes
 of the file. In extent_writepage I only know the pages of a file that
 are to be written. I guess, I can identify metadata pages based on the
 inode of the page's owner. But is it possible to distinguish the pages
 available in extent_writepage path as belonging to the leaf node or
 internal node for a file? If it cannot be identified at this point,
 where earlier in the path can this be decided?


So they are different things, and they could change from the time we
write to the time that the write completes because of COW.  Also keep
in mind that the metadata (the file extent items and such) for the
inodes are not stored specifically within the inode, they're stored
inside the same tree that the inode resides in.  So you can have a
leaf node with multiple inodes and extents for those different inodes.
 And so any sort of random things can happen, other inodes can be
deleted and this inode's metadata will be shifted into a new leaf, or
another inode could be added and this inode's data could be pushed off
into an adjacent leaf.  The only way to know which leaf/page the inode
is associated with is to search for whatever you are looking for in
the tree, and then while you are holding all of the locks and
reference counting you can be sure that those pages contain the
metadata you are looking for, but once you let that go there are no
guarantees.

So as far as how it is written to disk, that is where transactions
come in.  We track all the dirty metadata pages we have per
transaction, and then at transaction commit time we make sure that all
of those pages are written to disk and then we commit our super to
point to the new root of the tree root, which in turn points at all of
our new roots because of COW.  These pages can be written before the
commit though because of memory pressure, and if they are written and
then modified again within in the same transaction we will re-cow them
to make sure we don't have any partial-page updates.  Keeping track of
where a specific inodes metadata is contained is a tricky business.
Let me know if that helped.  Thanks,

Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: weird kernel-oopses while deleting files on btrfs

2013-03-03 Thread Chris Mason
On Sun, Mar 03, 2013 at 06:57:41AM -0700, Michael Schmitt wrote:
 Hi list,
 
 some rather unexpected btrfs-oopses for my taste. I use btrfs for some
 time now (mostly on external harddisks) and these oopses happened
 during some simple file and folder deletion operation on that device. It
 is a luks-encrypted 80GB drive. Anything like that known? And the fs was
 created just yesterday, how come there is a message like...
 
 [91491.919358] btrfs: mismatching generation and generation_v2 found in root 
 item. This root was probably mounted with an older kernel. Resetting all new 
 fields.

This may be from first mount after mkfs.  It depends on your tools.

 
 ... but the kernel used (3.7.3 from Debian experimental on Debian sid)
 was installed several days ago. What kind of oopses are these? As of now
 there is no real data on that device. But if there were, would I need to
 be concerned about the integrity of those files?

 [93283.762006] WARNING: at 
 /build/buildd-linux_3.7.3-1~experimental.1-i386-eX5kUQ/linux-3.7.3/fs/btrfs/extent-tree.c:6297
  btrfs_alloc_free_block+0xcd/0x2a4 [btrfs]()

These are not oopsen but warnings.  It's an ENOSPC warning as we try to
delete the extents.  It did happen sometimes in this kernel, but it is
only a warning.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: collapse concurrent forced allocations

2013-03-03 Thread Alexandre Oliva
On Feb 23, 2013, Alexandre Oliva ol...@gnu.org wrote:

 On Feb 22, 2013, Josef Bacik jba...@fusionio.com wrote:
 So I understand what you are getting at, but I think you are doing it wrong. 
  If
 we're calling with CHUNK_ALLOC_FORCE, but somebody has already started to
 allocate with CHUNK_ALLOC_NO_FORCE, we'll reset the space_info-force_alloc 
 to
 our original caller's CHUNK_ALLOC_FORCE.

 But that's ok, do_chunk_alloc will set space_info-force_alloc to
 CHUNK_ALLOC_NO_FORCE at the end, when it succeeds allocating, and then
 anyone else waiting on the mutex to try to allocate will load the
 NO_FORCE from space_info.

 So we only really care about making sure a chunk is actually
 allocated, instead of doing this flag shuffling we should just do

 if (space_info-chunk_alloc) { spin_unlock(space_info-lock);
 wait_event(!space_info-chunk_alloc); return 0;

I looked a bit further into it.  I think I this would work if we had a
wait_queue for space_info-chunk_alloc.  We don't, so the mutex
interface is probably the best we can do.

OTOH, I found out we seem to get into an allocate spree when a large
file is being quickly created, such as when creating a ceph journal or
making a copy of a multi-GB file.  I suppose btrfs is just trying to
allocate contiguous space for the file, but unfortunately there doesn't
seem to be a fallback for allocation failure: as soon as data allocation
fails and space_info is set as full, the large write fails and the
filesystem becomes full, without even trying to use non-contiguous
storage.  Isn't that a bug?


I've also been trying to track down why, on a single-data filesystem,
(compressed?) data reads that fail because of bad blocks also spike the
CPU load and lock the file that failed to map in and the entire
filesystem, so that the only way to recover is to force a reboot.
Does this sound familiar to anyone?

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: update mkfs.btrfs help info for raid5/6

2013-03-03 Thread zwu . kernel
From: Zhi Yong Wu wu...@linux.vnet.ibm.com

  Since raid5/6 support was introduced, we should update mkfs.btrfs help info.

Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
---
 mkfs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mkfs.c b/mkfs.c
index 5ece186..f9f26a5 100644
--- a/mkfs.c
+++ b/mkfs.c
@@ -326,7 +326,7 @@ static void print_usage(void)
fprintf(stderr, options:\n);
fprintf(stderr, \t -A --alloc-start the offset to start the FS\n);
fprintf(stderr, \t -b --byte-count total number of bytes in the FS\n);
-   fprintf(stderr, \t -d --data data profile, raid0, raid1, raid10, dup 
or single\n);
+   fprintf(stderr, \t -d --data data profile, raid0, raid1, raid5, raid6, 
raid10, dup or single\n);
fprintf(stderr, \t -l --leafsize size of btree leaves\n);
fprintf(stderr, \t -L --label set a label\n);
fprintf(stderr, \t -m --metadata metadata profile, values like data 
profile\n);
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: traverse to backup super-block only when indicated

2013-03-03 Thread Anand Jain




flags = BTRFS_SCAN_REGISTER | BTRFS_SCAN_PRIMARY_SB;
btrfs_scan_one_dir(/dev/, flags)


 I just got too flexed into the current way of coding
 in btrfs-progs :-)

 But let me get at least this part of the code
 in the right-way.

 Thanks Eric for pointing out.

-Anand
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html