Systemd 219 journald now sets the FS_NOCOW file flag for its journal files, possibly breaking RAID repairs.

2015-02-19 Thread Konstantinos Skarlatos
Systemd 219 now sets the special FS_NOCOW file flag for its journal 
files[1]. This unfortunately breaks the ability to repair the journal on 
RAID 1/5/6 btrfs volumes, should a bad sector happen to appear there. Is 
this something that can be configured for systemd? Is btrfs going to 
someday fix the fragmentation problem, making this option reduntant?



[1] 
http://lists.freedesktop.org/archives/systemd-devel/2015-February/028447.html


* journald now sets the special FS_NOCOW file flag for its
  journal files. This should improve performance on btrfs, by
  avoiding heavy fragmentation when journald's write-pattern
  is used on COW file systems. It degrades btrfs' data
  integrity guarantees for the files to the same levels as for
  ext3/ext4 however. This should be OK though as journald does
  its own data integrity checks and all its objects are
  checksummed on disk. Also, journald should handle btrfs disk
  full events a lot more gracefully now, by processing SIGBUS
  errors, and not relying on fallocate() anymore.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: price to pay for nocow file bit?

2015-01-08 Thread Konstantinos Skarlatos

On 8/1/2015 3:30 μμ, Lennart Poettering wrote:

On Wed, 07.01.15 15:10, Josef Bacik (jba...@fb.com) wrote:


On 01/07/2015 12:43 PM, Lennart Poettering wrote:

Heya!

Currently, systemd-journald's disk access patterns (appending to the
end of files, then updating a few pointers in the front) result in
awfully fragmented journal files on btrfs, which has a pretty
negative effect on performance when accessing them.

I've been wondering if mount -o autodefrag would deal with this problem but
I haven't had the chance to look into it.

Hmm, I am kinda interested in a solution that I can just implement in
systemd/journald now and that will then just make things work for
people suffering by the problem. I mean, I can hardly make systemd
patch the mount options of btrfs just because I place a journal file
on some fs...

Is autodefrag supposed to become a default one day?

Anyway, given the pros and cons I have now changed journald to set the
nocow bit on newly created journal files. When files are rotated (and
we hence know we will never ever write again to them) the bit is tried
to be unset again, and a defrag ioctl will be invoked right
after. btrfs currently silently ignores that we unset the bit, and
leaves it set, but I figure i should try to unset it anyway, in case
it learns that one day. After all, after rotating the files there's no
reason to treat the files special anymore...
Can this behaviour be optional? I dont mind some fragmentation if i can 
keep having checksums and the ability for raid 1 to repair those files.



I'll keep an eye on this, and see if I still get user complaints about
it. Should autodefrag become default eventually we can get rid of this
code in journald again.

One question regarding the btrfs defrag ioctl: playing around with it
it appears to be asynchronous, the defrag request is simply queued and
the ioctl returns immediately. Which is great for my usecase. However
I was wondering if it always was async like this? I googled a bit, and
found reports that defrag might take a while, but I am not sure if
those reports were about the ioctl taking so long, or the effect of
defrag actually hitting the disk...

Lennart



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs scrub status misreports as interrupted

2014-12-10 Thread Konstantinos Skarlatos

On 10/12/2014 9:28 μμ, Marc Joliet wrote:

Am Wed, 10 Dec 2014 10:51:15 +0800
schrieb Anand Jain anand.j...@oracle.com:


   Is there any relevant log in the dmegs ?

Not in my case; at least, nothing that made it into the syslog.


Same with me, no messages at all
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs scrub status misreports as interrupted

2014-12-09 Thread Konstantinos Skarlatos
I've got the exact same problem, with a 4 drive RAID1. kernel 3.18-git 
and btrfs tools-git, all built yesterday.


On 22/11/2014 2:13 μμ, Marc Joliet wrote:

Hi all,

While I haven't gotten any scrub already running type errors any more, I do
get one strange case of state misreport.  When running scrub on /home (btrfs
RAID10), after 3 of 4 drives have completed, the 4th drive (sdb) will report as
interrupted, despite still running:

# btrfs scrub status -d /home
scrub status for 472c9290-3ff2-4096-9c47-0612d3a52cef
scrub device /dev/sda (id 1) history
scrub started at Sat Nov 22 11:57:34 2014 and finished after 3380 
seconds
total bytes scrubbed: 252.86GiB with 0 errors
scrub device /dev/sdb (id 2) status
scrub started at Sat Nov 22 11:57:34 2014, interrupted after 3698 
seconds, not running
total bytes scrubbed: 217.50GiB with 0 errors
scrub device /dev/sdc (id 3) history
scrub started at Sat Nov 22 11:57:34 2014 and finished after 3013 
seconds
total bytes scrubbed: 252.85GiB with 0 errors
scrub device /dev/sdd (id 4) history
scrub started at Sat Nov 22 11:57:34 2014 and finished after 2994 
seconds
total bytes scrubbed: 252.85GiB with 0 errors

The funny thing is, the time will still update as the scrub keeps going:

# btrfs scrub status -d /home
scrub status for 472c9290-3ff2-4096-9c47-0612d3a52cef
scrub device /dev/sda (id 1) history
scrub started at Sat Nov 22 11:57:34 2014 and finished after 3380 
seconds
total bytes scrubbed: 252.86GiB with 0 errors
scrub device /dev/sdb (id 2) status
scrub started at Sat Nov 22 11:57:34 2014, interrupted after 4136 
seconds, not running
 

total bytes scrubbed: 239.44GiB with 0 errors
scrub device /dev/sdc (id 3) history
scrub started at Sat Nov 22 11:57:34 2014 and finished after 3013 
seconds
total bytes scrubbed: 252.85GiB with 0 errors
scrub device /dev/sdd (id 4) history
scrub started at Sat Nov 22 11:57:34 2014 and finished after 2994 
seconds
total bytes scrubbed: 252.85GiB with 0 errors

This has happened a few times, and when sdb finally finishes, the status is then
reported correctly as finished:

# btrfs scrub status -d /home
scrub status for 472c9290-3ff2-4096-9c47-0612d3a52cef
scrub device /dev/sda (id 1) history
scrub started at Sat Nov 22 11:57:34 2014 and finished after 3380 
seconds
total bytes scrubbed: 252.86GiB with 0 errors
scrub device /dev/sdb (id 2) history
scrub started at Sat Nov 22 11:57:34 2014 and finished after 4426 
seconds
total bytes scrubbed: 252.88GiB with 0 errors
scrub device /dev/sdc (id 3) history
scrub started at Sat Nov 22 11:57:34 2014 and finished after 3013 
seconds
total bytes scrubbed: 252.85GiB with 0 errors
scrub device /dev/sdd (id 4) history
scrub started at Sat Nov 22 11:57:34 2014 and finished after 2994 
seconds
total bytes scrubbed: 252.85GiB with 0 errors

Kernel and btrfs-progs version:

# uname -a
Linux marcec 3.16.7-gentoo #1 SMP PREEMPT Fri Oct 31 22:45:54 CET 2014 x86_64 
AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ AuthenticAMD GNU/Linux

# btrfs --version
Btrfs v3.17.1

Should I open a report on bugzilla?



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poll: time to switch skinny-metadata on by default?

2014-10-21 Thread Konstantinos Skarlatos

On 21/10/2014 2:02 μμ, Austin S Hemmelgarn wrote:

On 2014-10-21 05:29, Duncan wrote:

David Sterba posted on Mon, 20 Oct 2014 18:34:03 +0200 as excerpted:


On Thu, Oct 16, 2014 at 01:33:37PM +0200, David Sterba wrote:

I'd like to make it default with the 3.17 release of btrfs-progs.
Please let me know if you have objections.


For the record, 3.17 will not change the defaults. The timing of the
poll was very bad to get enough feedback before the release. Let's keep
it open for now.


FWIW my own results agree with yours, I've had no problem with skinny-
metadata here, and it has been my default now for a couple 
backup-and-new-

mkfs.btrfs generations, now.

As you know there were some problems with it in the first kernel 
cycle or

two after it was introduced as an option, and I waited awhile until they
died down before trying it here, but as I said, no problems since I
switched it on, and I've been running it awhile now.

So defaulting to skinny-metadata looks good from here. =:^)

Same here, I've been using it on all my systems since I switched from 
3.15 to 3.16, and have had no issues whatsoever.


I am using skinny-metadata for years, and only once had an issue with 
it. It was with scrub and was fixed by Liu Bo[1], so i think 
skinny-metadata is mature enough be a default.


[1] https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg34493.html

--
Konstantinos Skarlatos

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Undelete files / directory

2014-09-01 Thread Konstantinos Skarlatos

On 1/9/2014 7:27 μμ, Marc MERLIN wrote:

On Sat, Aug 30, 2014 at 11:26:52AM -1000, Jean-Denis Girard wrote:

So I commented out the break on line 238 of btrfs-find-root so that it

Thanks for that report.
Can a developer review this and see if it should be made an option or
removed entirely?
I think that is the best way to proceed, or maybe even better make a 
brute force option for btrfs restore that does something like my for 
loop, recovering what it can through the filesystem.


Until then, can we make this into a concise set of instructions so we 
can post it on the wiki?




Marc


continues even if it thinks it went past the fs size, rerun the command,
and I finally got a list of blocks to try!

Then as you suggested I did:
for i in `awk '{print $3}' root.txt`
  do echo   $i 
  btrfs restore -v -f $i --path-regex '^/(|jdg(|/tmp(|/.*)))$' \
../x220_home.img .
done

And I now have back my ~2800 photos (~13 Gb).

Many thanks to those who helped!


I am glad i could help!





Best regards,
Jean-Denis Girard


Le 30/08/2014 10:12, Jean-Denis Girard a écrit :

Le 28/08/2014 21:40, Konstantinos Skarlatos a écrit :

On 28/8/2014 8:04 μμ, Jean-Denis Girard wrote:

Hi Chris,

Thanks for your detailed answer.

Le 28/08/2014 06:25, Chris Murphy a écrit :

9. btrfs-find-root /dev/sdc
Super think's the tree root is at 29917184, chunk root 20987904
Well block 4194304 seems great, but generation doesn't match, have=2,
want=9 level 0
Well block 4243456 seems great, but generation doesn't match, have=3,
want=9 level 0
Well block 29376512 seems great, but generation doesn't match,
have=4, want=9 level 0
Well block 29474816 seems great, but generation doesn't match,
have=5, want=9 level 0
Well block 29556736 seems great, but generation doesn't match,
have=6, want=9 level 0
Well block 29736960 seems great, but generation doesn't match,
have=7, want=9 level 0
Well block 29900800 seems great, but generation doesn't match,
have=8, want=9 level 0

Hi all,

I did a successful btrfs restore a few months ago, saving all of my
deleted files except 2 (So i lost about 1GB on a 4TB filesystem)
Here is what i did (this is from memory and from my .zsh_history file,
so i may be missing something)

btrfs-find-root  /dev/sdd -o 5  b1.txt
I think the -o 5 option is quite important here.

Thanks for the reply, but for some reason btrfs-fins-root does not work
on this file system. Here is what I get:

[jdg@tiare tmp]$ btrfs-find-root x220_home.img -o 5
Super think's the tree root is at 115230801920, chunk root 131072
Went past the fs size, exiting[jdg@tiare tmp]$

I can mount the file system, access the files, though obviously not the
deleted directory.



Regards,
Jean-Denis Girard

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
Konstantinos Skarlatos

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Undelete files / directory

2014-08-29 Thread Konstantinos Skarlatos

On 28/8/2014 8:04 μμ, Jean-Denis Girard wrote:

Hi Chris,

Thanks for your detailed answer.

Le 28/08/2014 06:25, Chris Murphy a écrit :

9. btrfs-find-root /dev/sdc
Super think's the tree root is at 29917184, chunk root 20987904
Well block 4194304 seems great, but generation doesn't match, have=2, want=9 
level 0
Well block 4243456 seems great, but generation doesn't match, have=3, want=9 
level 0
Well block 29376512 seems great, but generation doesn't match, have=4, want=9 
level 0
Well block 29474816 seems great, but generation doesn't match, have=5, want=9 
level 0
Well block 29556736 seems great, but generation doesn't match, have=6, want=9 
level 0
Well block 29736960 seems great, but generation doesn't match, have=7, want=9 
level 0
Well block 29900800 seems great, but generation doesn't match, have=8, want=9 
level 0

Hi all,

I did a successful btrfs restore a few months ago, saving all of my 
deleted files except 2 (So i lost about 1GB on a 4TB filesystem)
Here is what i did (this is from memory and from my .zsh_history file, 
so i may be missing something)


btrfs-find-root  /dev/sdd -o 5  b1.txt
I think the -o 5 option is quite important here.
After that, i ran this

for i in `awk '{print $3}' b1.txt`; do echo   
$i   btrfs restore /dev/sdd /storage/A3/ -Dv -f 
$i  ; done


I think i did that in order to brute force a correct offset

I also have done this, in order to find the offset that gave the largest 
number of files
for i in `awk '{print $3}' b1.txt`; do echo   
$i   btrfs restore /dev/sdd /storage/A3/ -Dv -f 
$i |wc -l ; done



Then i did some test restores using various addresses
btrfs restore /dev/sdd /storage/A3/B1/  -vD -f 2149617336320
btrfs restore /dev/sdd /storage/A3/B1/  -vD -f 1607682736128
btrfs restore /dev/sdd /storage/A3/B1/  -vD -f 2688721551360

and then i finally did the restore using the offset that looked best

btrfs restore /dev/sdd /storage/A3/B1/  -v -f 2688721551360

I hope this helps, good luck!



Here is what the command returns :

[root@x220 ~]# btrfs-find-root /dev/mapper/home
Super think's the tree root is at 115230801920, chunk root 131072
Went past the fs size, exiting[root@x220 ~]#

I just tried with latest btrfs-progs (from git), it returns exactly the
same.

The btrfs partition is on top of dm-crypt, could it be a problem?


Thanks,
Jean-Denis Girard

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
Konstantinos Skarlatos

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Significance of high number of mails on this list?

2014-08-22 Thread Konstantinos Skarlatos

On 22/8/2014 6:40 πμ, Shriramana Sharma wrote:

Hello people. Thank you for your detailed replies, esp Duncan.

In essence, I plan on using BTRFS for my production data -- mainly
programs/documents I write in connection with my academic research.
I'm not a professional sysadmin and I'm not running a business server.
I'm just managing my own data, and as I have mentioned, my chief
reason for looking at BTRFS is the ease of snapshots and backups using
send/receive.

It is clear now that snapshots are by and large stable but
send/receive is not. But, IIUC, even if send/receive fails I still
have the older data which is not overwritten due to COW and atomic
operations, and I can always retry send/receive again. Is this
correct?

If yes, then I guess I can take the plunge but ensure I have daily
backups (which BTRFS itself should help me do easily).


I would stay with rsync for a while, because there is always the 
possibility of a bug that corrupts both your primary filesystem and your 
backup one, or send propagating corruption from one filesystem to 
another (Or maybe I am too paranoid, it would be good if we could have 
the opinion of a btrfs developer on this)


I would also suggest lsyncd if rsync runs become slow due to too many 
files and directories, or you have something like my use case, where i 
have filesystems with millions of files and my backup servers are a few 
km away and reachable over relatively slow wireless links.


Finaly, be sure to use the --inplace option of rsync.

--
Konstantinos Skarlatos

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Significance of high number of mails on this list?

2014-08-22 Thread Konstantinos Skarlatos

On 22/8/2014 12:58 μμ, Filipe David Manana wrote:

On Fri, Aug 22, 2014 at 8:35 AM, Duncan 1i5t5.dun...@cox.net wrote:

Konstantinos Skarlatos posted on Fri, 22 Aug 2014 09:56:55 +0300 as
excerpted:


I would stay with rsync for a while, because there is always the
possibility of a bug that corrupts both your primary filesystem and your
backup one, or send propagating corruption from one filesystem to
another (Or maybe I am too paranoid, it would be good if we could have
the opinion of a btrfs developer on this)

No claim to be a dev, btrfs or otherwise, here, but I believe in this
case you /are/ being too paranoid.

Both btrfs send and receive only deal with data/metadata they know how to
deal with.  If it's corrupt in some way or if they don't understand it,
they don't send/write it, they fail.

Most of the time yes, however we have at least 1 know bug that affects
3.14.x only where send silently corrupts file data (replaces valid
data with zeroes) at the destination:

https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=766b5e5ae78dd04a93a275690a49e23d7dcb1f39

The fix landed in 3.15, but wasn't backported to 3.14.x yet (adding
Chris to cc).


I didnt know about this one, but bugs like this are exactly the reason 
somebody should be paranoid and not rush to use new features, 
especially when they concern their only backup to an experimental 
filesystem.






IOW, if it works without error it's as guaranteed to be golden as these
things get.  The problem is that it doesn't always work without error in
the first place, sometimes it /does/ fail.  In that instance you can
always try again as the existing data/metadata shouldn't be damaged, but
if it keeps failing you may have to try something else, rsync, etc.

--
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html






--
Konstantinos Skarlatos

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ideas for a feature implementation

2014-08-13 Thread Konstantinos Skarlatos

On 13/8/2014 2:01 μμ, David Pottage wrote:


On 12/08/14 12:00, Konstantinos Skarlatos wrote:
Maybe help with Andrea Mazzoleni's New RAID library supporting up to 
six parities? It seems to be a great feature for btrfs.


https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg31735.html


That would be very cool, but at present vanila RAID 5 or 6 does not 
work properly, so I think getting that fully working would be a better 
idea. (Unless it would make more sense to merge the whole lot into one 
bit of work where RAID 5 or 6 are just a special case of arbitrary 
parity level support).


At present, you can write RAID 5 or 6 data, but if anything goes 
wrong, btrfs cannot use the parity information to help you get your 
data back, so in general you are better off with RAID 1 or 10. Also, I 
don't think I/O done in parallel so you get no speed advantage from 
having multiple discs either.


Yeah, thats one of the features I am waiting to get finished, because I 
already have 5 multi disk systems that i would prefer to migrate to 
RAID5/6 from RAID1/JBOD that they are now.


I dont know what is the best sequencing, I just think that these are 
great patches/features and its a pity for them to languish.



--
Konstantinos Skarlatos

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ideas for a feature implementation

2014-08-12 Thread Konstantinos Skarlatos

On 10/8/2014 10:21 μμ, Vimal A R wrote:

Hello,

I came across the to-do list at 
https://btrfs.wiki.kernel.org/index.php/Project_ideas and would like to know if 
this list is updated and recent.

I am looking for a project idea for my under graduate degree which can be 
completed in around 3-4 months. Are there any suggestions and ideas to help me 
further?
Maybe help with Andrea Mazzoleni's New RAID library supporting up to six 
parities? It seems to be a great feature for btrfs.


https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg31735.html




Thank you,
Vimal
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
Konstantinos Skarlatos

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mount time of multi-disk arrays

2014-07-07 Thread Konstantinos Skarlatos

On 7/7/2014 4:38 μμ, André-Sebastian Liebe wrote:

Hello List,

can anyone tell me how much time is acceptable and assumable for a
multi-disk btrfs array with classical hard disk drives to mount?

I'm having a bit of trouble with my current systemd setup, because it
couldn't mount my btrfs raid anymore after adding the 5th drive. With
the 4 drive setup it failed to mount once in a few times. Now it fails
everytime because the default timeout of 1m 30s is reached and mount is
aborted.
My last 10 manual mounts took between 1m57s and 2m12s to finish.
I have the exact same problem, and have to manually mount my large 
multi-disk btrfs filesystems, so I would be interested in a solution as 
well.




My hardware setup contains a
- Intel Core i7 4770
- Kernel 3.15.2-1-ARCH
- 32GB RAM
- dev 1-4 are 4TB Seagate ST4000DM000 (5900rpm)
- dev 5 is a 4TB Wstern Digital WDC WD40EFRX (5400rpm)

Thanks in advance

André-Sebastian Liebe
--

# btrfs fi sh
Label: 'apc01_pool0'  uuid: 066141c6-16ca-4a30-b55c-e606b90ad0fb
 Total devices 5 FS bytes used 14.21TiB
 devid1 size 3.64TiB used 2.86TiB path /dev/sdd
 devid2 size 3.64TiB used 2.86TiB path /dev/sdc
 devid3 size 3.64TiB used 2.86TiB path /dev/sdf
 devid4 size 3.64TiB used 2.86TiB path /dev/sde
 devid5 size 3.64TiB used 2.88TiB path /dev/sdb

Btrfs v3.14.2-dirty

# btrfs fi df /data/pool0/
Data, single: total=14.28TiB, used=14.19TiB
System, RAID1: total=8.00MiB, used=1.54MiB
Metadata, RAID1: total=26.00GiB, used=20.20GiB
unknown, single: total=512.00MiB, used=0.00


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
Konstantinos Skarlatos

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mount time of multi-disk arrays

2014-07-07 Thread Konstantinos Skarlatos

On 7/7/2014 6:48 μμ, Duncan wrote:

Konstantinos Skarlatos posted on Mon, 07 Jul 2014 16:54:05 +0300 as
excerpted:


On 7/7/2014 4:38 μμ, André-Sebastian Liebe wrote:

can anyone tell me how much time is acceptable and assumable for a
multi-disk btrfs array with classical hard disk drives to mount?

I'm having a bit of trouble with my current systemd setup, because it
couldn't mount my btrfs raid anymore after adding the 5th drive. With
the 4 drive setup it failed to mount once in a few times. Now it fails
everytime because the default timeout of 1m 30s is reached and mount is
aborted.
My last 10 manual mounts took between 1m57s and 2m12s to finish.

I have the exact same problem, and have to manually mount my large
multi-disk btrfs filesystems, so I would be interested in a solution as
well.

I don't have a direct answer, as my btrfs devices are all SSD, but...

a) Btrfs, like some other filesystems, is designed not to need a
pre-mount (or pre-rw-mount) fsck, because it does what /should/ be a
quick-scan at mount-time.  However, that isn't always as quick as it
might be for a number of reasons:

a1) Btrfs is still a relatively immature filesystem and certain
operations are not yet optimized.  In particular, multi-device btrfs
operations tend to still be using a first-working-implementation type of
algorithm instead of a well optimized for parallel operation algorithm,
and thus often serialize access to multiple devices where a more
optimized algorithm would parallelize operations across multiple devices
at the same time.  That will come, but it's not there yet.

a2) Certain operations such as orphan cleanup (orphans are files that
were deleted while they were in use and thus weren't fully deleted at the
time; if they were still in use at unmount (remount-read-only), cleanup
is done at mount-time) can delay mount as well.

a3) Inode_cache mount option:  Don't use this unless you can explain
exactly WHY you are using it, preferably backed up with benchmark
numbers, etc.  It's useful only on 32-bit, generally high-file-activity
server systems and has general-case problems, including long mount times
and possible overflow issues that make it inappropriate for normal use.
Unfortunately there's a lot of people out there using it that shouldn't
be, and I even saw it listed on at least one distro (not mine!) wiki. =:^(

a4) The space_cache mount option OTOH *IS* appropriate for normal use
(and is in fact enabled by default these days), but particularly in
improper shutdown cases can require rebuilding at mount time -- altho
this should happen /after/ mount, the system will just be busy for some
minutes, until the space-cache is rebuilt.  But the IO from a space_cache
rebuild on one filesystem could slow down the mounting of filesystems
that mount after it, as well as the boot-time launching of other post-
mount launched services.

If you're seeing the time go up dramatically with the addition of more
filesystem devices, however, and you do /not/ have inode_cache active,
I'd guess it's mainly the not-yet-optimized multi-device operations.


b) As with any systemd launched unit, however, there's systemd
configuration mechanisms for working around specific unit issues,
including timeout issues.  Of course most systems continue to use fstab
and let systemd auto-generate the mount units, and in fact that is
recommended, but either with fstab or directly created mount units,
there's a timeout configuration option that can be set.

b1) The general systemd *.mount unit [Mount] section option appears to be
TimeoutSec=.  As is usual with systemd times, the default is seconds, or
pass the unit(s, like 5min 20s).

b2) I don't see it /specifically/ stated, but with a bit of reading
between the lines, the corresponding fstab option appears to be either
x-systemd.timeoutsec= or x-systemd.TimeoutSec= (IOW I'm not sure of the
case).  You may also want to try x-systemd.device-timeout=, which /is/
specifically mentioned, altho that appears to be specifically the timeout
for the device to appear, NOT for the filesystem to mount after it does.

b3) See the systemd.mount (5) and systemd-fstab-generator (8) manpages
for more, that being what the above is based on.
Thanks for your detailed answer. A mount unit with a larger timeout 
works fine, maybe we should tell distro maintainers to up the limit for 
btrfs to 5 minutes or so?


In my experience, mount time definitely grows as the filesystem grows 
older, and times out after snapshot count gets more than 500-1000 . I 
guess thats something that can be optimized in the future, but i believe 
stability is a much more urgent need now...




So it might take a bit of experimentation to find the exact command, but
based on the above anyway, it /should/ be pretty easy to tell systemd to
wait a bit longer for that filesystem.

When you find the right invocation, please reply with it here, as I'm
sure there's others who will benefit as well.  FWIW, I'm still on
reiserfs for my spinning

Re: mount time of multi-disk arrays

2014-07-07 Thread Konstantinos Skarlatos

On 7/7/2014 5:24 μμ, André-Sebastian Liebe wrote:

On 07/07/2014 03:54 PM, Konstantinos Skarlatos wrote:

On 7/7/2014 4:38 μμ, André-Sebastian Liebe wrote:

Hello List,

can anyone tell me how much time is acceptable and assumable for a
multi-disk btrfs array with classical hard disk drives to mount?

I'm having a bit of trouble with my current systemd setup, because it
couldn't mount my btrfs raid anymore after adding the 5th drive. With
the 4 drive setup it failed to mount once in a few times. Now it fails
everytime because the default timeout of 1m 30s is reached and mount is
aborted.
My last 10 manual mounts took between 1m57s and 2m12s to finish.

I have the exact same problem, and have to manually mount my large
multi-disk btrfs filesystems, so I would be interested in a solution
as well.

Hi Konstantinos , you can workaround this by manual creating a systemd
mount unit.

- First review the autogenerated systemd mount unit (systemctl show
your-mount-unit.mount).  You you can get the unit name by issuing a
'systemctl' and look for your failed mount.
- Then you have to take the needed values (After, Before, Conflicts,
RequiresMountsFor, Where, What, Options, Type, Wantedby) and put them
into an new systemd mount unit file (possibly under
/usr/lib/systemd/system/your-mount-unit.mount ).
- Now just add the TimeoutSec with a large enough value below [Mount].
- If you later want to automount you raid, add the WantedBy under [Install]
- now issue a 'systemctl daemon-reload' and look for error messages in
syslog.
- If there are no errors you could enable your manual mount entry by
'systemctl enable your-mount-unit.mount' and safely comment out your
old fstab entry (systemd no longer generates autogenerated units).

-- 8 --- 8 --- 8 --- 8 --- 8
--- 8 --- 8 ---
[Unit]
Description=Mount /data/pool0
After=dev-disk-by\x2duuid-066141c6\x2d16ca\x2d4a30\x2db55c\x2de606b90ad0fb.device
systemd-journald.socket local-fs-pre.target system.slice -.mount
Before=umount.target
Conflicts=umount.target
RequiresMountsFor=/data
/dev/disk/by-uuid/066141c6-16ca-4a30-b55c-e606b90ad0fb

[Mount]
Where=/data/pool0
What=/dev/disk/by-uuid/066141c6-16ca-4a30-b55c-e606b90ad0fb
Options=rw,relatime,skip_balance,compress
Type=btrfs
TimeoutSec=3min

[Install]
WantedBy=dev-disk-by\x2duuid-066141c6\x2d16ca\x2d4a30\x2db55c\x2de606b90ad0fb.device
-- 8 --- 8 --- 8 --- 8 --- 8
--- 8 --- 8 ---


Hi André,
This unit file works for me, thank you for creating it! Can somebody put 
it on the wiki?









My hardware setup contains a
- Intel Core i7 4770
- Kernel 3.15.2-1-ARCH
- 32GB RAM
- dev 1-4 are 4TB Seagate ST4000DM000 (5900rpm)
- dev 5 is a 4TB Wstern Digital WDC WD40EFRX (5400rpm)

Thanks in advance

André-Sebastian Liebe
--


# btrfs fi sh
Label: 'apc01_pool0'  uuid: 066141c6-16ca-4a30-b55c-e606b90ad0fb
  Total devices 5 FS bytes used 14.21TiB
  devid1 size 3.64TiB used 2.86TiB path /dev/sdd
  devid2 size 3.64TiB used 2.86TiB path /dev/sdc
  devid3 size 3.64TiB used 2.86TiB path /dev/sdf
  devid4 size 3.64TiB used 2.86TiB path /dev/sde
  devid5 size 3.64TiB used 2.88TiB path /dev/sdb

Btrfs v3.14.2-dirty

# btrfs fi df /data/pool0/
Data, single: total=14.28TiB, used=14.19TiB
System, RAID1: total=8.00MiB, used=1.54MiB
Metadata, RAID1: total=26.00GiB, used=20.20GiB
unknown, single: total=512.00MiB, used=0.00


--
To unsubscribe from this list: send the line unsubscribe
linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Konstantinos Skarlatos

--
André-Sebastian Liebe




--
Konstantinos Skarlatos

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs data dup on single device?

2014-06-25 Thread Konstantinos Skarlatos

On 25/6/2014 5:41 μμ, Christoph Anton Mitterer wrote:

On Wed, 2014-06-25 at 08:47 +0100, Hugo Mills wrote:

This has variously been possible and not over the last few years. I
think it's finally come down on the side of not,

I think that would really be a loss... :(



The question is, why?

Well imagine you have some computer which can only have one disk drive
(laptop, etc.) and you still want at least some kind of redundancy
against bit rot errors.


IMO, btrfs should support most flavours out there...
- n-way duplicates on the same device (and not just DUP with n=2)
For the same device there is also erasure coding, where you lose lets 
say 10% capacity, and have the benefit of recovering from the most 
probable disk errors that dont take the whole disk with them, bad sectors.



- n-way mirrors on multiple devices (i.e. what we have right now with
RAID1 plus up to classic RAID1 with copies on each device
- RAID5/6
- n-way striped+parity with n2
- stacked layouts (RAID 10 as e.g. MD has it,... RAID50, 60)


And terminology should really be re-worked... IMHO it's very bad to use
the term RAID1, if it's not what classic RAID1 does.


Cheers,
Chris.




--
Konstantinos Skarlatos

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: frustrations with handling of crash reports

2014-06-19 Thread Konstantinos Skarlatos

On 19/6/2014 12:22 πμ, Duncan wrote:

Konstantinos Skarlatos posted on Wed, 18 Jun 2014 16:23:04 +0300 as
excerpted:


I guess that btrfs developers have put these BUG_ONs so that they get
reports from users when btrfs gets in these unexpected situations. But
if most of these reports are ignored or not resolved, then maybe there
is no use for these BUG_ONs and they should be replaced with something
more mild.

Keep in mind that if a system panics, then the only way to get logs from
it is with serial or netconsole, so BUG_ON really makes it much harder
for users to know what happened and send reports, and only the most
technical and determined users will manage to send reports here.

In terms of the BUGONs, they've been converting them to WARNONs recently,
exactly due to the point you and Marc have made.  Not being a dev and
simply based on the patch-flow I've seen as btrfs has been basically
behaving itself so far here[1], I had /thought/ that was more or less
done (perhaps some really bad bug-ons left but only a few, and basically
only where the kernel couldn't be sure it was in a logical enough state
to continue writing to other filesystems too, so bugon being logical in
that case), but based on you guys' comments there's apparently more to go.

So at least for BUGONs they agree.  I guess it's simply a matter of
getting them all converted.
Thats good to hear. But we should have a way to recover from these kinds 
of problems, first of all having btrfs report the exact location, disk 
and file name that is affected, and then make scrub fix or at least 
report about it, and finaly make fsck work for this.


My filesystem that consistently kernel panics when a specific logical 
address is read, passes scrub without anything bad reported. What's the 
use of scrub if it cant deal with this?




Tho at least in Marc's case, he's running kernels a couple back in some
cases and they may still have BUGONs already replaced in the most current
kernel.

As for experimental, they've been toning down and removing the warnings
recently.  Yes, the on-device format may come with some level of
compatibility guarantee now so I do agree with that bit, but IMO anyway,
that warning should be being replaced with a more explicit on-device-
format is now stable but the code is not yet entirely so, so keep your
backups and be prepared to use them, and run current kernels, language,
and that's not happening, they're mostly just toning it down without the
still explicit warnings, ATM.

---
[1] Btrfs (so far) behaving itself here: Possibly because my filesystems
are relatively small and I don't use snapshots much and prefer several
smaller independent filesystems rather than doing subvolumes, thus
keeping the number of eggs in a single basket small.  Plus, with small
filesystems on SSD, I can balance reasonably regularly, and I do full
fresh mkfs.btrfs rounds every few kernels as well to take advantage of
newer features, which may well have the result of killing smaller
problems that aren't yet showing up before they get big enough to cause
real issues.  Anyway, I'm not complaining! =:^)
Well my use case is about 25 filesystems on rotating disks, 20 of them 
on single disks, and the rest are multiple disk filesystems, either 
raid1 or single. I have many subvolumes and in some cases thousands of 
snapshots, but no databases, systemd and the like on them. Of course I 
have everything backed up, /nag mode on but I believe that after all 
those years of development I shouldnt still be forced to do mkfs every 6 
monts or so, when i use no new features. /nag mode off





--
Konstantinos Skarlatos

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs-transacti:516 blocked 120 seconds on 3.16-rc1

2014-06-19 Thread Konstantinos Skarlatos
I am not sure this is related with the other reports for lockups etc on 
3.16-rc1, so i am sending it. full dmesg attached. this is after some 
heavy io on a multi disk btrfs filesystem.


[69932.966704] INFO: task btrfs-transacti:516 blocked for more than 120 
seconds.

[69932.966837]   Not tainted 3.16.0-rc1-ge99cfa2 #1
[69932.966921] echo 0  /proc/sys/kernel/hung_task_timeout_secs 
disables this message.
[69932.967051] btrfs-transacti D 0001 0   516  2 
0x
[69932.967060]  8801f422fac0 0046 880203f3bd20 
000145c0
[69932.967069]  8801f422ffd8 000145c0 880203f3bd20 
8801f422fa30
[69932.967076]  a062e392 8800cda63300 8802010b1e60 
0c73d192

[69932.967083] Call Trace:
[69932.967133]  [a062e392] ? add_delayed_tree_ref+0x102/0x1b0 
[btrfs]

[69932.967146]  [8119937a] ? kmem_cache_alloc_trace+0x1fa/0x220
[69932.967155]  [814fd759] schedule+0x29/0x70
[69932.967179]  [a05c8571] cache_block_group+0x121/0x390 [btrfs]
[69932.967187]  [810b0990] ? __wake_up_sync+0x20/0x20
[69932.967212]  [a05d16fa] find_free_extent+0x5fa/0xc80 [btrfs]
[69932.967243]  [a0606f00] ? free_extent_buffer+0x10/0xa0 [btrfs]
[69932.967269]  [a05d1f52] btrfs_reserve_extent+0x62/0x140 [btrfs]
[69932.967298]  [a05ed388] 
__btrfs_prealloc_file_range+0xe8/0x380 [btrfs]
[69932.967328]  [a05f52b0] 
btrfs_prealloc_file_range_trans+0x30/0x40 [btrfs]
[69932.967353]  [a05d4a97] 
btrfs_write_dirty_block_groups+0x5c7/0x700 [btrfs]
[69932.967380]  [a05e2b5d] commit_cowonly_roots+0x18d/0x240 
[btrfs]
[69932.967408]  [a05e4c87] 
btrfs_commit_transaction+0x4f7/0xa40 [btrfs]

[69932.967435]  [a05e0835] transaction_kthread+0x1e5/0x250 [btrfs]
[69932.967462]  [a05e0650] ? 
btrfs_cleanup_transaction+0x570/0x570 [btrfs]

[69932.967471]  [8108c97b] kthread+0xdb/0x100
[69932.967478]  [8108c8a0] ? kthread_create_on_node+0x180/0x180
[69932.967486]  [8150137c] ret_from_fork+0x7c/0xb0
[69932.967493]  [8108c8a0] ? kthread_create_on_node+0x180/0x180
[69932.967505] INFO: task kworker/u16:15:30882 blocked for more than 120 
seconds.


--
Konstantinos Skarlatos

[  995.654816] BTRFS info (device sdh): force zlib compression
[  995.654827] BTRFS info (device sdh): disk space caching is enabled
[  995.654832] BTRFS: has skinny extents
[  995.785405] BTRFS: bdev /dev/sda errs: wr 0, rd 0, flush 0, corrupt 0, gen 2
[69932.966704] INFO: task btrfs-transacti:516 blocked for more than 120 seconds.
[69932.966837]   Not tainted 3.16.0-rc1-ge99cfa2 #1
[69932.966921] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this 
message.
[69932.967051] btrfs-transacti D 0001 0   516  2 0x
[69932.967060]  8801f422fac0 0046 880203f3bd20 
000145c0
[69932.967069]  8801f422ffd8 000145c0 880203f3bd20 
8801f422fa30
[69932.967076]  a062e392 8800cda63300 8802010b1e60 
0c73d192
[69932.967083] Call Trace:
[69932.967133]  [a062e392] ? add_delayed_tree_ref+0x102/0x1b0 [btrfs]
[69932.967146]  [8119937a] ? kmem_cache_alloc_trace+0x1fa/0x220
[69932.967155]  [814fd759] schedule+0x29/0x70
[69932.967179]  [a05c8571] cache_block_group+0x121/0x390 [btrfs]
[69932.967187]  [810b0990] ? __wake_up_sync+0x20/0x20
[69932.967212]  [a05d16fa] find_free_extent+0x5fa/0xc80 [btrfs]
[69932.967243]  [a0606f00] ? free_extent_buffer+0x10/0xa0 [btrfs]
[69932.967269]  [a05d1f52] btrfs_reserve_extent+0x62/0x140 [btrfs]
[69932.967298]  [a05ed388] __btrfs_prealloc_file_range+0xe8/0x380 
[btrfs]
[69932.967328]  [a05f52b0] btrfs_prealloc_file_range_trans+0x30/0x40 
[btrfs]
[69932.967353]  [a05d4a97] btrfs_write_dirty_block_groups+0x5c7/0x700 
[btrfs]
[69932.967380]  [a05e2b5d] commit_cowonly_roots+0x18d/0x240 [btrfs]
[69932.967408]  [a05e4c87] btrfs_commit_transaction+0x4f7/0xa40 
[btrfs]
[69932.967435]  [a05e0835] transaction_kthread+0x1e5/0x250 [btrfs]
[69932.967462]  [a05e0650] ? btrfs_cleanup_transaction+0x570/0x570 
[btrfs]
[69932.967471]  [8108c97b] kthread+0xdb/0x100
[69932.967478]  [8108c8a0] ? kthread_create_on_node+0x180/0x180
[69932.967486]  [8150137c] ret_from_fork+0x7c/0xb0
[69932.967493]  [8108c8a0] ? kthread_create_on_node+0x180/0x180
[69932.967505] INFO: task kworker/u16:15:30882 blocked for more than 120 
seconds.
[69932.967625]   Not tainted 3.16.0-rc1-ge99cfa2 #1
[69932.967707] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this 
message.
[69932.967835] kworker/u16:15  D  0 30882  2 0x
[69932.967867] Workqueue: btrfs-delalloc normal_work_helper [btrfs]
[69932.967871]  88003e537858 0046 8801fc599e90 
000145c0
[69932.967878

Re: commit 762380a block: add notion of a chunk size for request merging stops io on btrfs

2014-06-18 Thread Konstantinos Skarlatos

On 18/6/2014 5:11 πμ, Jens Axboe wrote:

On 2014-06-17 14:35, Konstantinos Skarlatos wrote:

Hi all,
with 3.16-rc1  rsync stops writing to my btrfs filesystem and stays at a
D+ state.
git bisect showed that the problematic commit is:

762380ad9322951cea4ce9d24864265f9c66a916 is the first bad commit
commit 762380ad9322951cea4ce9d24864265f9c66a916
Author: Jens Axboe ax...@fb.com
Date:   Thu Jun 5 13:38:39 2014 -0600

 block: add notion of a chunk size for request merging

 Some drivers have different limits on what size a request should
 optimally be, depending on the offset of the request. Similar to
 dividing a device into chunks. Add a setting that allows the driver
 to inform the block layer of such a chunk size. The block layer 
will

 then prevent merging across the chunks.

 This is needed to optimally support NVMe with a non-zero stripe 
size.


 Signed-off-by: Jens Axboe ax...@fb.com


That's odd, should not have any effect since nobody enables stripe 
sizes in the kernel. I'll double check, perhaps it's not always being 
cleared.


Ah wait, does the attached help?


Yes, it works! I recompiled at commit 
762380ad9322951cea4ce9d24864265f9c66a916 with your patch and it looks 
ok. Rebooted back to the unpatched kernel and the bug showed up again 
immediately.


The funny thing is that the problem only showed on my (multi-disk) btrfs 
filesystem. / which is on ext4 seems to work fine.







--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: frustrations with handling of crash reports

2014-06-18 Thread Konstantinos Skarlatos

On 17/6/2014 9:27 μμ, Marc MERLIN wrote:

On Tue, Jun 17, 2014 at 07:59:57AM -0700, Marc MERLIN wrote:

It is also ok to answer Any FS created or used before kernel 3.x can be
corrupted due to bugs we fixed in 3.y, thank you for your report but it's
not a good use of our time to investigate this
(although newer kernels should not just crash with BUG(xxx) on unexpected
data, they should remount the FS read only).

I was thinking about this some more, and I know I have no right to tell
others what to do, so take this as a mere suggestion :)

How about doing a release with cleanups and stabilization and better state
reporting when things go wrong?

This would give a good known version for users who have actual data and
backups that can take many hours or days to restore (never mind downtime).

A few things I was thinking about:
1) Wouldn't it be a good time to replace all the BUG ON statements with
appropriate error handling? Unexpected data can happen, the kernel shouldn't
crash that.
At the very least it should remount read only and give maybe a wiki link to
the user on what to do next (some bu reporting and recovery page)

2) On unexpected cases, output basic information on the filesystem or printk
instructions to the user on how to gather data that would be sent to the
list to be reviewed.
This would include information on how old the filesystem is when it's
possible to detect, and the instruction page could say sorry, anything
older than X, we don't want to hear about, we already fixed corruption bugs
since then

3) getting printk data on an end user machine when it just started refusing
to write to disk can be challenging and cause useful debug info to be lost.
Things I thinking about:
a) make sure most btrfs bugs do not just hang the kernel
b) recommend to users to send kernel syslog messages to an ext4 partition

How does that sound?
I 100% agree with this. I also have a problem where btrfs decides to 
BUG_ON and force a kernel panic because it has found an unexpected type 
of metadata. Although in my case I was more lucky and had help and test 
patches from Liu Bo, I am still of the opinion that btrfs should not 
take down a whole system because it found something unexpected.


I guess that btrfs developers have put these BUG_ONs so that they get 
reports from users when btrfs gets in these unexpected situations. But 
if most of these reports are ignored or not resolved, then maybe there 
is no use for these BUG_ONs and they should be replaced with something 
more mild.


Keep in mind that if a system panics, then the only way to get logs from 
it is with serial or netconsole, so BUG_ON really makes it much harder 
for users to know what happened and send reports, and only the most 
technical and determined users will manage to send reports here. So I 
can guess that the real number of kernel panics due to btrfs is much 
higher, and most people are unable to report them, because they _never 
know_ that it was btrfs that caused their crash.


I know btrfs is still experimental, but it is in kernel since 
2009-01-09, so I think most users have some expectation of stability 
after something is 5.5 years in the mainline kernel.


So my suggestion is that basicaly the same with Marc's:

These BUG_ONs should be replaced with something that does not crash the 
system and gives out as much info as possible, so that users do not have 
to get here and ask for a debugging patch.  After all, btrfs is still 
experimental, right? :)


Furthermore, these problems should either remount the fs as readonly, or 
try to make the file that is implicated readonly, and report the 
filename, so users can delete it and continue with their lives without 
having to mkfs every few months. Or even make fsck able to fix these, 
and not choke on a few TB filesystem because it wants to use ridiculous 
amounts of RAM.


In general, btrfs must get _much_ better at reporting what happened, 
which file was implicated and if it is a multiple disk fs, the disk 
where the problem is and the sector where that occured.


PS.
I am not a kernel developer, so please be kind if I have said something 
completely wrong :)




Thanks,
Marc



--
Konstantinos Skarlatos

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel BUG at fs/btrfs/ctree.h:2456

2014-06-05 Thread Konstantinos Skarlatos

On 5/6/2014 1:59 πμ, Konstantinos Skarlatos wrote:

Hi, I get this after doing a few runs of rsync on my btrfs filesystem.
kernel: 3.15.0-rc8
filesystem has 6x2tb disks, data is raid 0, fs was created with skinny 
metadata, mount options are noatime, compress-force=zlib. No quota or 
defrag or any of the new features is being used. attached full dmesg 
capture via netconsole.

adding some more info

$ btrfs fi df /storage/btrfs
Data, single: total=8.89TiB, used=8.43TiB
System, RAID1: total=32.00MiB, used=992.00KiB
Metadata, RAID1: total=69.00GiB, used=66.75GiB
unknown, single: total=512.00MiB, used=112.00KiB

$ btrfs fi show
Label: none  uuid: bde3c349-9e08-45bb-8517-b9a6dda81e88
Total devices 6 FS bytes used 8.50TiB
devid1 size 3.64TiB used 3.02TiB path /dev/sdf
devid2 size 1.82TiB used 1.20TiB path /dev/sda
devid3 size 1.82TiB used 1.20TiB path /dev/sdb
devid4 size 1.82TiB used 1.20TiB path /dev/sdc
devid5 size 1.82TiB used 1.20TiB path /dev/sdd
devid6 size 1.82TiB used 1.20TiB path /dev/sdh

Btrfs v3.14.2-dirty

btrfs su li /storage/btrfs -q | grep parent_uuid - |wc -l
22

btrfs su li /storage/btrfs -q | grep -v parent_uuid - |wc -l
5855

So the filesystem has 22 subvolumes and 5855 snapshots.
No vm images or databases are stored here, everything comes and goes 
with rsync, as this is a backup server.




[  855.493495] BTRFS info (device sdc): force zlib compression
[  855.498427] BTRFS info (device sdc): disk space caching is enabled
[  855.503348] BTRFS: has skinny extents
[27199.947244] [ cut here ]
[27199.952216] kernel BUG at fs/btrfs/ctree.h:2456!
[27199.957188] invalid opcode:  [#1] PREEMPT SMP
[27199.962184] Modules linked in: netconsole radeon kvm_amd 
snd_hda_codec_hdmi ttm drm_kms_helper drm kvm r8169 microcode evdev 
snd_hda_intel snd_hda_controller mac_hid edac_core snd_hda_codec 
snd_hwdep edac_mce_amd snd_pcm pcspkr snd_timer snd serio_raw 
i2c_algo_bit k10temp hwmon sp5100_tco i2c_piix4 i2c_core soundcore mii 
wmi shpchp button acpi_cpufreq processor ext4 crc16 mbcache jbd2 
crc32c_generic btrfs xor raid6_pq sd_mod crc_t10dif crct10dif_common 
ata_generic pata_acpi atkbd libps2 pata_jmicron ahci libahci ohci_pci 
libata ohci_hcd ehci_pci xhci_hcd ehci_hcd scsi_mod usbcore usb_common 
i8042 serio
[27199.990017] CPU: 1 PID: 7953 Comm: rsync Not tainted 
3.15.0-rc8-gfad01e8 #1
[27199.995748] Hardware name: Gigabyte Technology Co., Ltd. 
GA-890GPA-UD3H/GA-890GPA-UD3H, BIOS FD 07/23/2010
[27200.001584] task: 880202928000 ti: 8800129e task.ti: 
8800129e
[27200.007439] RIP: 0010:[a0594017] [a0594017] 
lookup_inline_extent_backref+0x407/0x5d0 [btrfs]

[27200.013445] RSP: 0018:8800129e3a90  EFLAGS: 00010283
[27200.019397] RAX: 0038 RBX: 88002ef9af00 RCX: 
8800129e3a40
[27200.025312] RDX: 8800 RSI: 36b7 RDI: 
88002ef9af00
[27200.031119] RBP: 8800129e3b28 R08: 4000 R09: 
8800129e3a50
[27200.036801] R10:  R11: 0003 R12: 
00b8
[27200.042377] R13: 0038 R14: 36b7 R15: 
399c
[27200.047899] FS:  7f543bbec700() GS:88020fc4() 
knlGS:

[27200.053379] CS:  0010 DS:  ES:  CR0: 8005003b
[27200.058751] CR2: 015b6fd8 CR3: 00018a0da000 CR4: 
07e0

[27200.064060] Stack:
[27200.069215]  0c14cecab000 88019abf1480 0327 
8800129e3b68
[27200.074437]  399c 00b8 88002ef9af00 
000d00b8
[27200.079585]  8800ce8fc800 b000a0594e27 00a80c14ceca 
0020

[27200.084647] Call Trace:
[27200.089554]  [a0595265] 
insert_inline_extent_backref+0x55/0xe0 [btrfs]
[27200.094467]  [a0595386] __btrfs_inc_extent_ref+0x96/0x200 
[btrfs]
[27200.099290]  [a059c0f9] 
__btrfs_run_delayed_refs+0x819/0x1240 [btrfs]
[27200.104035]  [a058979d] ? 
btrfs_put_tree_mod_seq+0x10d/0x150 [btrfs]
[27200.108676]  [a05a091b] 
btrfs_run_delayed_refs.part.52+0x7b/0x260 [btrfs]
[27200.113241]  [a05a0b17] btrfs_run_delayed_refs+0x17/0x20 
[btrfs]
[27200.117675]  [a05b1be3] 
__btrfs_end_transaction+0x243/0x380 [btrfs]
[27200.122031]  [a05b1d30] btrfs_end_transaction+0x10/0x20 
[btrfs]

[27200.126275]  [a05bb31e] btrfs_truncate+0x23e/0x330 [btrfs]
[27200.130452]  [a05bbe48] btrfs_setattr+0x228/0x2e0 [btrfs]
[27200.134549]  [811c6781] notify_change+0x221/0x380
[27200.138641]  [811a9006] do_truncate+0x66/0x90
[27200.142715]  [811ad159] ? __sb_start_write+0x49/0xf0
[27200.146795]  [811a937b] 
do_sys_ftruncate.constprop.10+0x10b/0x160

[27200.150927]  [811a940e] SyS_ftruncate+0xe/0x10
[27200.155104]  [814f56a9] system_call_fastpath+0x16/0x1b
[27200.159297] Code: 48 39 45 10 74 74 0f 87 28 01 00 00

Re: kernel BUG at fs/btrfs/ctree.h:2456

2014-06-05 Thread Konstantinos Skarlatos

On 5/6/2014 10:05 πμ, Liu Bo wrote:

Hi, Konstantinos

On Thu, Jun 05, 2014 at 09:28:16AM +0300, Konstantinos Skarlatos wrote:

On 5/6/2014 1:59 πμ, Konstantinos Skarlatos wrote:

Hi, I get this after doing a few runs of rsync on my btrfs filesystem.
kernel: 3.15.0-rc8
filesystem has 6x2tb disks, data is raid 0, fs was created with
skinny metadata, mount options are noatime, compress-force=zlib.
No quota or defrag or any of the new features is being used.
attached full dmesg capture via netconsole.

adding some more info

Can you reproduce it?  Or everything becomes good after a hard reboot?

Looks that this is an 'impossible' case from code analysis.

-liubo
I recompiled my kernel with CONFIG_BTRFS_DEBUG=y. after a few minutes of 
scrub and rsync, i got this



[  264.271695] BTRFS info (device sda): force zlib compression
[  264.276668] BTRFS info (device sda): disk space caching is enabled
[  264.282950] BTRFS: has skinny extents
[  363.412708] BTRFS: checking UUID tree
[ 1115.402092] BTRFS: checksum/header error at logical 4003307880448 on 
dev /dev/sda, sector 66783040: metadata node (level -1) in tree 
18446744073709551615

[ 1115.406251] [ cut here ]
[ 1115.408251] kernel BUG at fs/btrfs/ctree.h:2456!
[ 1115.410257] invalid opcode:  [#1] PREEMPT SMP
[ 1115.412291] Modules linked in: netconsole kvm_amd radeon kvm ttm 
drm_kms_helper snd_hda_codec_hdmi serio_raw drm k10temp edac_core evdev 
mac_hid microcode hwmon edac_mce_amd r8169 mii i2c_algo_bit pcspkr 
snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm 
snd_timer snd soundcore wmi shpchp sp5100_tco i2c_piix4 i2c_core button 
acpi_cpufreq processor ext4 crc16 mbcache jbd2 crc32c_generic btrfs xor 
raid6_pq sd_mod crc_t10dif crct10dif_common ata_generic pata_acpi atkbd 
libps2 ahci pata_jmicron libahci libata ohci_pci ohci_hcd ehci_pci 
ehci_hcd scsi_mod xhci_hcd usbcore i8042 serio usb_common
[ 1115.423444] CPU: 2 PID: 101 Comm: kworker/u16:6 Not tainted 
3.15.0-rc8-g54539cd #1
[ 1115.425705] Hardware name: Gigabyte Technology Co., Ltd. 
GA-890GPA-UD3H/GA-890GPA-UD3H, BIOS FD 07/23/2010

[ 1115.428023] Workqueue: btrfs-btrfs-scrub normal_work_helper [btrfs]
[ 1115.430296] task: 880203451e90 ti: 88020313 task.ti: 
88020313
[ 1115.432548] RIP: 0010:[a04437ac] [a04437ac] 
tree_backref_for_extent+0x1cc/0x1d0 [btrfs]

[ 1115.435433] RSP: 0018:880203133b40  EFLAGS: 00010283
[ 1115.438253] RAX: 0019 RBX: 2c05 RCX: 
880203133af0
[ 1115.441083] RDX: 8800 RSI: 2c0e RDI: 
88017e90efc0
[ 1115.443858] RBP: 880203133b88 R08: 4000 R09: 
880203133b00
[ 1115.446573] R10:  R11: 0002 R12: 
88017e90efc0
[ 1115.449230] R13: 2be4 R14: 880203133bc0 R15: 
2c0e
[ 1115.451835] FS:  7f9a2c9f1700() GS:88020fc8() 
knlGS:

[ 1115.454422] CS:  0010 DS:  ES:  CR0: 8005003b
[ 1115.456960] CR2: 7f05da035000 CR3: 0001ddbc CR4: 
07e0

[ 1115.459467] Stack:
[ 1115.461894]  880203133bbf 880203133bd0 2bfc 
2c0e
[ 1115.464345]  fffe 0021 a0460834 
88017e90efc0
[ 1115.466760]  880202cbc000 880203133c60 a043ab5c 


[ 1115.469138] Call Trace:
[ 1115.471447]  [a043ab5c] scrub_print_warning+0x28c/0x2d0 [btrfs]
[ 1115.473737]  [a03de746] ? btrfs_csum_data+0x16/0x20 [btrfs]
[ 1115.475975]  [a043de94] 
scrub_handle_errored_block+0x974/0xae0 [btrfs]
[ 1115.478176]  [a043e228] scrub_bio_end_io_worker+0x228/0x810 
[btrfs]

[ 1115.480327]  [a0414b77] normal_work_helper+0x77/0x350 [btrfs]
[ 1115.482438]  [810821c8] process_one_work+0x168/0x450
[ 1115.484518]  [81082c02] worker_thread+0x132/0x3e0
[ 1115.486601]  [81082ad0] ? manage_workers.isra.23+0x2d0/0x2d0
[ 1115.488693]  [8108908b] kthread+0xdb/0x100
[ 1115.490770]  [81088fb0] ? kthread_create_on_node+0x180/0x180
[ 1115.492881]  [814f573c] ret_from_fork+0x7c/0xb0
[ 1115.494997]  [81088fb0] ? kthread_create_on_node+0x180/0x180
[ 1115.497129] Code: ff 48 83 c4 20 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 
1f 80 00 00 00 00 48 83 c4 20 b8 fe ff ff ff 5b 41 5c 41 5d 41 5e 41 5f 
5d c3 0f 0b 66 90 66 66 66 66 90 55 48 89 e5 41 57 41 56 41 55 41 54
[ 1115.501825] RIP  [a04437ac] 
tree_backref_for_extent+0x1cc/0x1d0 [btrfs]

[ 1115.504103]  RSP 880203133b40
[ 1115.514902] ---[ end trace 54741a57d59e0263 ]---
[ 1115.516654] BUG: unable to handle kernel paging request at 
ffd8

[ 1115.518247] IP: [810896f0] kthread_data+0x10/0x20
[ 1115.519811] PGD 1814067 PUD 1816067 PMD 0
[ 1115.521277] Oops:  [#2] PREEMPT SMP
[ 1115.522735] Modules linked in: netconsole kvm_amd radeon kvm ttm 
drm_kms_helper snd_hda_codec_hdmi serio_raw drm k10temp

Re: send/receive and bedup

2014-05-23 Thread Konstantinos Skarlatos

On 21/5/2014 3:58 πμ, Chris Murphy wrote:

On May 20, 2014, at 4:56 PM, Konstantinos Skarlatos k.skarla...@gmail.com 
wrote:


On 21/5/2014 1:37 πμ, Mark Fasheh wrote:

On Tue, May 20, 2014 at 01:07:50AM +0300, Konstantinos Skarlatos wrote:

Duperemove will be shipping as supported software in a major SUSE release so
it will be bug fixed, etc as you would expect. At the moment I'm very busy
trying to fix qgroup bugs so I haven't had much time to add features, or
handle external bug reports, etc. Also I'm not very good at advertising my
software which would be why it hasn't really been mentioned on list lately
:)

I would say that state that it's in is that I've gotten the feature set to a
point which feels reasonable, and I've fixed enough bugs that I'd appreciate
folks giving it a spin and providing reasonable feedback.

Well, after having good results with duperemove with a few gigs of data, i
tried it on a 500gb subvolume. After it scanned all files, it is stuck at
100% of one cpu core for about 5 hours, and still hasn't done any deduping.
My cpu is an Intel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz, so i guess thats
not the problem. So I guess the speed of duperemove drops dramatically as
data volume increases.

Yeah I doubt it's your CPU. Duperemove is right now targeted at smaller data
sets (a few VMS, iso images, etc) than you threw it at as you undoubtedly
have figured out. It will need a bit of work before it can handle entire
file systems. My guess is that it was spending an enormous amount of time
finding duplicates (it has a very thorough check that could probably be
optimized).

It finished after 9 or so hours, so I agree it was checking for duplicates. It 
does a few GB in just seconds, so time probably scales exponentially with data 
size.

I'm going to guess it ran out of memory. I wonder what happens if you take an 
SSD and specify a humongous swap partition on it. Like, 4x, or more, the amount 
of installed memory.
Just tried it again, with 32GiB swap added on an SSD. My test files are 
633GiB.
duperemove -rv /storage/test 19537.67s user 183.86s system 89% cpu 
6:06:56.96 total


Duperemove was using about 1GiB or RAM, had one core at 100%, and I 
think swap was not touched at all.





This same trick has been mentioned on the XFS list for use with xfsrepair when 
memory requirements exceed system memory, and is immensely faster.


Chris Murphy



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ditto blocks on ZFS

2014-05-21 Thread Konstantinos Skarlatos

On 20/5/2014 5:07 πμ, Russell Coker wrote:

On Mon, 19 May 2014 23:47:37 Brendan Hide wrote:

This is extremely difficult to measure objectively. Subjectively ... see
below.


[snip]

*What other failure modes* should we guard against?

I know I'd sleep a /little/ better at night knowing that a double disk
failure on a raid5/1/10 configuration might ruin a ton of data along
with an obscure set of metadata in some long tree paths - but not the
entire filesystem.

My experience is that most disk failures that don't involve extreme physical
damage (EG dropping a drive on concrete) don't involve totally losing the
disk.  Much of the discussion about RAID failures concerns entirely failed
disks, but I believe that is due to RAID implementations such as Linux
software RAID that will entirely remove a disk when it gives errors.

I have a disk which had ~14,000 errors of which ~2000 errors were corrected by
duplicate metadata.  If two disks with that problem were in a RAID-1 array
then duplicate metadata would be a significant benefit.


The other use-case/failure mode - where you are somehow unlucky enough
to have sets of bad sectors/bitrot on multiple disks that simultaneously
affect the only copies of the tree roots - is an extremely unlikely
scenario. As unlikely as it may be, the scenario is a very painful
consequence in spite of VERY little corruption. That is where the
peace-of-mind/bragging rights come in.

http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html

The NetApp research on latent errors on drives is worth reading.  On page 12
they report latent sector errors on 9.5% of SATA disks per year.  So if you
lose one disk entirely the risk of having errors on a second disk is higher
than you would want for RAID-5.  While losing the root of the tree is
unlikely, losing a directory in the middle that has lots of subdirectories is
a risk.
Seeing the results of that paper, I think erasure coding is a better 
solution. Instead of having many copies of metadata or data, we could do 
erasure coding using something like zfec[1] that is being used by 
Tahoe-LAFS, increasing their size by lets say 5-10%, and be quite safe 
even from multiple continuous bad sectors.


[1] https://pypi.python.org/pypi/zfec


I can understand why people wouldn't want ditto blocks to be mandatory.  But
why are people arguing against them as an option?


As an aside, I'd really like to be able to set RAID levels by subtree.  I'd
like to use RAID-1 with ditto blocks for my important data and RAID-0 for
unimportant data.



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: send/receive and bedup

2014-05-20 Thread Konstantinos Skarlatos

On 21/5/2014 1:37 πμ, Mark Fasheh wrote:

On Tue, May 20, 2014 at 01:07:50AM +0300, Konstantinos Skarlatos wrote:

Duperemove will be shipping as supported software in a major SUSE release so
it will be bug fixed, etc as you would expect. At the moment I'm very busy
trying to fix qgroup bugs so I haven't had much time to add features, or
handle external bug reports, etc. Also I'm not very good at advertising my
software which would be why it hasn't really been mentioned on list lately
:)

I would say that state that it's in is that I've gotten the feature set to a
point which feels reasonable, and I've fixed enough bugs that I'd appreciate
folks giving it a spin and providing reasonable feedback.

Well, after having good results with duperemove with a few gigs of data, i
tried it on a 500gb subvolume. After it scanned all files, it is stuck at
100% of one cpu core for about 5 hours, and still hasn't done any deduping.
My cpu is an Intel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz, so i guess thats
not the problem. So I guess the speed of duperemove drops dramatically as
data volume increases.

Yeah I doubt it's your CPU. Duperemove is right now targeted at smaller data
sets (a few VMS, iso images, etc) than you threw it at as you undoubtedly
have figured out. It will need a bit of work before it can handle entire
file systems. My guess is that it was spending an enormous amount of time
finding duplicates (it has a very thorough check that could probably be
optimized).
It finished after 9 or so hours, so I agree it was checking for 
duplicates. It does a few GB in just seconds, so time probably scales 
exponentially with data size.


For what it's worth, handling larger data sets is the type of work I want to
be doing on it in the future.

I can help with testing :)
I would also suggest that you publish in this list any changes that you 
do, so that your program becomes better known among btrfs users. Or even 
a new announcement mail or a page in the btrfs wiki.


Finally, i would like to request the ability to do file level dedup, 
with a reflink. That has the advantage of consuming very little metadata 
compared to block level dedup. It could be done with a two pass dedup, 
first comparing all the same-sized files and after that doing your 
normal block level dedup.


Btw does anybody have a good program/script that can do file level dedup 
with reflinks and checksum comparison?


Kind regards,
Konstantinos Skarlatos

--Mark

--
Mark Fasheh


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: send/receive and bedup

2014-05-19 Thread Konstantinos Skarlatos

On 19/5/2014 7:01 μμ, Brendan Hide wrote:

On 19/05/14 15:00, Scott Middleton wrote:

On 19 May 2014 09:07, Marc MERLIN m...@merlins.org wrote:

On Wed, May 14, 2014 at 11:36:03PM +0800, Scott Middleton wrote:

I read so much about BtrFS that I mistaked Bedup with Duperemove.
Duperemove is actually what I am testing.

I'm currently using programs that find files that are the same, and
hardlink them together:
http://marc.merlins.org/perso/linux/post_2012-05-01_Handy-tip-to-save-on-inodes-and-disk-space_-finddupes_-fdupes_-and-hardlink_py.html 



hardlink.py actually seems to be the faster (memory and CPU) one event
though it's in python.
I can get others to run out of RAM on my 8GB server easily :(


Interesting app.

An issue with hardlinking (with the backups use-case, this problem 
isn't likely to happen), is that if you modify a file, all the 
hardlinks get changed along with it - including the ones that you 
don't want changed.


@Marc: Since you've been using btrfs for a while now I'm sure you've 
already considered whether or not a reflink copy is the better/worse 
option.




Bedup should be better, but last I tried I couldn't get it to work.
It's been updated since then, I just haven't had the chance to try it
again since then.

Please post what you find out, or if you have a hardlink maker that's
better than the ones I found :)



Thanks for that.

I may be  completely wrong in my approach.

I am not looking for a file level comparison. Bedup worked fine for
that. I have a lot of virtual images and shadow protect images where
only a few megabytes may be the difference. So a file level hash and
comparison doesn't really achieve my goals.

I thought duperemove may be on a lower level.

https://github.com/markfasheh/duperemove

Duperemove is a simple tool for finding duplicated extents and
submitting them for deduplication. When given a list of files it will
hash their contents on a block by block basis and compare those hashes
to each other, finding and categorizing extents that match each
other. When given the -d option, duperemove will submit those
extents for deduplication using the btrfs-extent-same ioctl.

It defaults to 128k but you can make it smaller.

I hit a hurdle though. The 3TB HDD  I used seemed OK when I did a long
SMART test but seems to die every few hours. Admittedly it was part of
a failed mdadm RAID array that I pulled out of a clients machine.

The only other copy I have of the data is the original mdadm array
that was recently replaced with a new server, so I am loathe to use
that HDD yet. At least for another couple of weeks!


I am still hopeful duperemove will work.
Duperemove does look exactly like what you are looking for. The last 
traffic on the mailing list regarding that was in August last year. It 
looks like it was pulled into the main kernel repository on September 
1st.


The last commit to the duperemove application was on April 20th this 
year. Maybe Mark (cc'd) can provide further insight on its current 
status.


I have been testing duperemove and it seems to work just fine, in 
contrast with bedup that i have been unable to install/compile/sort out 
the mess with python versions. I have 2 questions about duperemove:

1) can it use existing filesystem csums instead of calculating its own?
2) can it be included in btrfs-progs so that it becomes a standard 
feature of btrfs?

Thanks
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: send/receive and bedup

2014-05-19 Thread Konstantinos Skarlatos

On 19/5/2014 8:38 μμ, Mark Fasheh wrote:

On Mon, May 19, 2014 at 06:01:25PM +0200, Brendan Hide wrote:

On 19/05/14 15:00, Scott Middleton wrote:

On 19 May 2014 09:07, Marc MERLIN m...@merlins.org wrote:
Thanks for that.

I may be  completely wrong in my approach.

I am not looking for a file level comparison. Bedup worked fine for
that. I have a lot of virtual images and shadow protect images where
only a few megabytes may be the difference. So a file level hash and
comparison doesn't really achieve my goals.

I thought duperemove may be on a lower level.

https://github.com/markfasheh/duperemove

Duperemove is a simple tool for finding duplicated extents and
submitting them for deduplication. When given a list of files it will
hash their contents on a block by block basis and compare those hashes
to each other, finding and categorizing extents that match each
other. When given the -d option, duperemove will submit those
extents for deduplication using the btrfs-extent-same ioctl.

It defaults to 128k but you can make it smaller.

I hit a hurdle though. The 3TB HDD  I used seemed OK when I did a long
SMART test but seems to die every few hours. Admittedly it was part of
a failed mdadm RAID array that I pulled out of a clients machine.

The only other copy I have of the data is the original mdadm array
that was recently replaced with a new server, so I am loathe to use
that HDD yet. At least for another couple of weeks!


I am still hopeful duperemove will work.

Duperemove does look exactly like what you are looking for. The last
traffic on the mailing list regarding that was in August last year. It
looks like it was pulled into the main kernel repository on September 1st.

I'm confused - you need to avoid a file scan completely? Duperemove does do
that just to be clear.

In your mind, what would be the alternative to that sort of a scan?

By the way, if you know exactly where the changes are you
could just feed the duplicate extents directly to the ioctl via a script. I
have a small tool in the duperemove repositry that can do that for you
('make btrfs-extent-same').



The last commit to the duperemove application was on April 20th this year.
Maybe Mark (cc'd) can provide further insight on its current status.

Duperemove will be shipping as supported software in a major SUSE release so
it will be bug fixed, etc as you would expect. At the moment I'm very busy
trying to fix qgroup bugs so I haven't had much time to add features, or
handle external bug reports, etc. Also I'm not very good at advertising my
software which would be why it hasn't really been mentioned on list lately
:)

I would say that state that it's in is that I've gotten the feature set to a
point which feels reasonable, and I've fixed enough bugs that I'd appreciate
folks giving it a spin and providing reasonable feedback.
Well, after having good results with duperemove with a few gigs of data, 
i tried it on a 500gb subvolume. After it scanned all files, it is stuck 
at 100% of one cpu core for about 5 hours, and still hasn't done any 
deduping. My cpu is an Intel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz, so i 
guess thats not the problem. So I guess the speed of duperemove drops 
dramatically as data volume increases.




There's a TODO list which gives a decent idea of what's on my mind for
possible future improvements. I think what I'm most wanting to do right now
is some sort of (optional) writeout to a file of what was done during a run.
The idea is that you could feed that data back to duperemove to improve the
speed of subsequent runs. My priorities may change depending on feedback
from users of course.

I also at some point want to rewrite some of the duplicate extent finding
code as it got messy and could be a bit faster.
--Mark

--
Mark Fasheh
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs-progs: fsck: add an option to check data csums

2014-05-08 Thread Konstantinos Skarlatos

On 8/5/2014 4:26 πμ, Wang Shilong wrote:

This patch adds an option '--check-data-csum' to verify data csums.
fsck won't check data csums unless users specify this option explictly.
Can this option be added to btrfs restore as well? i think it would be a 
good thing if users can tell restore to only recover non-corrupt files.


Signed-off-by: Wang Shilong wangsl.f...@cn.fujitsu.com
---
  Documentation/btrfs-check.txt |   2 +
  cmds-check.c  | 122 --
  2 files changed, 120 insertions(+), 4 deletions(-)

diff --git a/Documentation/btrfs-check.txt b/Documentation/btrfs-check.txt
index 485a49c..bc10755 100644
--- a/Documentation/btrfs-check.txt
+++ b/Documentation/btrfs-check.txt
@@ -30,6 +30,8 @@ try to repair the filesystem.
  create a new CRC tree.
  --init-extent-tree::
  create a new extent tree.
+--check-data-csum::
+check data csums.
  
  EXIT STATUS

  ---
diff --git a/cmds-check.c b/cmds-check.c
index 103efc5..b53d49c 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -53,6 +53,7 @@ static LIST_HEAD(delete_items);
  static int repair = 0;
  static int no_holes = 0;
  static int init_extent_tree = 0;
+static int check_data_csum = 0;
  
  struct extent_backref {

struct list_head list;
@@ -3634,6 +3635,106 @@ static int check_space_cache(struct btrfs_root *root)
return error ? -EINVAL : 0;
  }
  
+static int read_extent_data(struct btrfs_root *root, char *data,

+   u64 logical, u64 len, int mirror)
+{
+   u64 offset = 0;
+   struct btrfs_multi_bio *multi = NULL;
+   struct btrfs_fs_info *info = root-fs_info;
+   struct btrfs_device *device;
+   int ret = 0;
+   u64 read_len;
+   unsigned long bytes_left = len;
+
+   while (bytes_left) {
+   read_len = bytes_left;
+   device = NULL;
+   ret = btrfs_map_block(info-mapping_tree, READ,
+   logical + offset, read_len, multi,
+   mirror, NULL);
+   if (ret) {
+   fprintf(stderr, Couldn't map the block %llu\n,
+   logical + offset);
+   goto error;
+   }
+   device = multi-stripes[0].dev;
+
+   if (device-fd == 0)
+   goto error;
+
+   if (read_len  root-sectorsize)
+   read_len = root-sectorsize;
+   if (read_len  bytes_left)
+   read_len = bytes_left;
+
+   ret = pread64(device-fd, data + offset, read_len,
+ multi-stripes[0].physical);
+   if (ret != read_len)
+   goto error;
+   offset += read_len;
+   bytes_left -= read_len;
+   kfree(multi);
+   multi = NULL;
+   }
+   return 0;
+error:
+   kfree(multi);
+   return -EIO;
+}
+
+static int check_extent_csums(struct btrfs_root *root, u64 bytenr,
+   u64 num_bytes, unsigned long leaf_offset,
+   struct extent_buffer *eb) {
+
+   u64 offset = 0;
+   u16 csum_size = btrfs_super_csum_size(root-fs_info-super_copy);
+   char *data;
+   u32 crc;
+   unsigned long tmp;
+   char result[csum_size];
+   char out[csum_size];
+   int ret = 0;
+   __s64 cmp;
+   int mirror;
+   int num_copies = btrfs_num_copies(root-fs_info-mapping_tree,
+   bytenr, num_bytes);
+
+   BUG_ON(num_bytes % root-sectorsize);
+   data = malloc(root-sectorsize);
+   if (!data)
+   return -ENOMEM;
+
+   while (offset  num_bytes) {
+   mirror = 0;
+again:
+   ret = read_extent_data(root, data, bytenr + offset,
+   root-sectorsize, mirror);
+   if (ret)
+   goto out;
+
+   crc = ~(u32)0;
+   crc = btrfs_csum_data(NULL, (char *)data, crc,
+ root-sectorsize);
+   btrfs_csum_final(crc, result);
+
+   tmp = leaf_offset + offset / root-sectorsize * csum_size;
+   read_extent_buffer(eb, out, tmp, csum_size);
+   cmp = memcmp(out, result, csum_size);
+   if (cmp) {
+   fprintf(stderr, mirror: %d range bytenr: %llu, len: %d 
checksum mismatch\n,
+   mirror, bytenr + offset, root-sectorsize);
+   if (mirror  num_copies - 1) {
+   mirror += 1;
+   goto again;
+   }
+   }
+   offset += root-sectorsize;
+   }
+out:
+   free(data);
+   return ret;
+}
+
  static int check_extent_exists(struct btrfs_root *root, u64 bytenr,
   u64 num_bytes)
  {
@@ -3771,6 +3872,8 

Test results for [RFC PATCH v10 00/16] Online(inband) data deduplication

2014-04-14 Thread Konstantinos Skarlatos

Hello,

Here are the test results from my testing of the latest patches of btrfs 
dedup.


TLDR;
I rsynced 10 separate copies of a 3.8GB folder with 138 RAW photographs 
(23-36MiB) on a btrfs volume with dedup enabled.
On the first try, the copy was very slow, and a sync after that took 
over 10 minutes to complete.
For the next copies sync was much faster, but still took up to one 
minute to complete.
The copy itself was quite slow, until the fifth try when it went from 
8MB/sec to 22-40MB/sec.
Each copy after the first consumed about 60-65MiB of metadata, or 
120-130MiB of free space due to metadata being DUP.


Obvious question:
Can dedup recognize that 2 files are the same and dedup them on a file 
level, saving much more space in the process?


In any case I am very thankful of the work being done here, and i am 
willing to help in any way i can.





AMD Phenom(tm) II X4 955 Processor
MemTotal:  8 GB
Hard Disk: Seagate Barracuda 7200.12 [160 GB]
kernel: 3.14.0-1-git

$ mkfs.btrfs /dev/loop0 -f  mount /storage/btrfs_dedup  mount |grep 
dedup  btrfs dedup enable /storage/btrfs_dedup  btrfs dedup on 
/storage/btrfs_dedup  for i in {01..10}; do time rsync -a 
/storage/btrfs/costas/Photo_library/2014/ /storage/btrfs_dedup/copy$i/ 
--stats  time btrfs fi sync /storage/btrfs_dedup/  df 
/storage/btrfs_dedup/  btrfs fi df /storage/btrfs_dedup ; done  time 
umount /storage/btrfs_dedup


/root/btrfs.img on /storage/btrfs_dedup type btrfs 
(rw,noatime,nodiratime,space_cache)


sent 4,017,134,246 bytes  received 2,689 bytes  8,274,226.44 bytes/sec
rsync -a /storage/btrfs/costas/Photo_library/2014/  --stats  21.85s user 
45.04s system 13% cpu 8:05.48 total
btrfs fi sync /storage/btrfs_dedup/  0.00s user 0.36s system 0% cpu 
10:43.27 total

/dev/loop1 46080  4119 40173  10% /storage/btrfs_dedup
Data, single: total=4.01GiB, used=3.74GiB
Metadata, DUP: total=1.00GiB, used=143.45MiB

sent 4,017,134,246 bytes  received 2,689 bytes  8,956,827.06 bytes/sec
rsync -a /storage/btrfs/costas/Photo_library/2014/  --stats  21.29s user 
42.32s system 14% cpu 7:28.74 total
btrfs fi sync /storage/btrfs_dedup/  0.00s user 0.01s system 0% cpu 
4.173 total

/dev/loop1 46080  4250 40173  10% /storage/btrfs_dedup
Data, single: total=5.01GiB, used=3.74GiB
Metadata, DUP: total=1.00GiB, used=208.72MiB

sent 4,017,134,246 bytes  received 2,689 bytes  9,691,524.57 bytes/sec
rsync -a /storage/btrfs/costas/Photo_library/2014/  --stats  20.95s user 
31.69s system 12% cpu 6:54.90 total
btrfs fi sync /storage/btrfs_dedup/  0.00s user 0.00s system 0% cpu 
3.254 total

/dev/loop1 46080  4371 40172  10% /storage/btrfs_dedup
Data, single: total=5.01GiB, used=3.74GiB
Metadata, DUP: total=1.00GiB, used=269.39MiB

sent 4,017,134,246 bytes  received 2,689 bytes  9,037,428.43 bytes/sec
rsync -a /storage/btrfs/costas/Photo_library/2014/  --stats  20.54s user 
36.70s system 12% cpu 7:23.93 total
btrfs fi sync /storage/btrfs_dedup/  0.00s user 0.01s system 0% cpu 
5.578 total

/dev/loop1 46080  4497 40172  11% /storage/btrfs_dedup
Data, single: total=5.01GiB, used=3.74GiB
Metadata, DUP: total=1.00GiB, used=331.98MiB

sent 4,017,134,246 bytes  received 2,689 bytes  29,004,598.81 bytes/sec
rsync -a /storage/btrfs/costas/Photo_library/2014/  --stats  22.30s user 
13.01s system 25% cpu 2:18.15 total
btrfs fi sync /storage/btrfs_dedup/  0.00s user 0.01s system 0% cpu 
23.447 total

/dev/loop1 46080  4617 40172  11% /storage/btrfs_dedup
Data, single: total=5.01GiB, used=3.74GiB
Metadata, DUP: total=1.00GiB, used=391.91MiB

sent 4,017,134,246 bytes  received 2,689 bytes  39,971,511.79 bytes/sec
rsync -a /storage/btrfs/costas/Photo_library/2014/  --stats  21.60s user 
11.85s system 33% cpu 1:39.74 total
btrfs fi sync /storage/btrfs_dedup/  0.00s user 0.01s system 0% cpu 
32.178 total

/dev/loop1 46080  4747 40171  11% /storage/btrfs_dedup
Data, single: total=5.01GiB, used=3.74GiB
Metadata, DUP: total=1.00GiB, used=456.48MiB

sent 4,017,134,246 bytes  received 2,689 bytes  32,009,059.24 bytes/sec
rsync -a /storage/btrfs/costas/Photo_library/2014/  --stats  25.68s user 
13.94s system 31% cpu 2:04.42 total
btrfs fi sync /storage/btrfs_dedup/  0.00s user 0.01s system 0% cpu 
29.313 total

/dev/loop1 46080  4870 40171  11% /storage/btrfs_dedup
Data, single: total=5.01GiB, used=3.74GiB
Metadata, DUP: total=1.00GiB, used=518.09MiB

sent 4,017,134,246 bytes  received 2,689 bytes  30,782,658.51 bytes/sec
rsync -a /storage/btrfs/costas/Photo_library/2014/  --stats  21.84s user 
12.63s system 26% cpu 2:10.20 total
btrfs fi sync /storage/btrfs_dedup/  0.00s user 0.00s system 0% cpu 
41.074 total

/dev/loop1 46080  4990 40171  12% /storage/btrfs_dedup
Data, single: total=5.01GiB, used=3.74GiB
Metadata, DUP: total=1.00GiB, used=578.16MiB

sent 4,017,134,246 bytes  received 2,689 bytes  22,379,592.95 bytes/sec
rsync -a /storage/btrfs/costas/Photo_library/2014/  --stats  28.57s user 

Re: [RFC PATCH v10 00/16] Online(inband) data deduplication

2014-04-10 Thread Konstantinos Skarlatos

On 10/4/2014 6:48 πμ, Liu Bo wrote:

Hello,

This the 10th attempt for in-band data dedupe, based on Linux _3.14_ kernel.

Data deduplication is a specialized data compression technique for eliminating
duplicate copies of repeating data.[1]

This patch set is also related to Content based storage in project ideas[2],
it introduces inband data deduplication for btrfs and dedup/dedupe is for short.

* PATCH 1 is a speed-up improvement, which is about dedup and quota.

* PATCH 2-5 is the preparation work for dedup implementation.

* PATCH 6 shows how we implement dedup feature.

* PATCH 7 fixes a backref walking bug with dedup.

* PATCH 8 fixes a free space bug of dedup extents on error handling.

* PATCH 9 adds the ioctl to control dedup feature.

* PATCH 10 targets delayed refs' scalability problem of deleting refs, which is
   uncovered by the dedup feature.

* PATCH 11-16 fixes bugs of dedupe including race bug, deadlock, abnormal
   transaction abortion and crash.

* btrfs-progs patch(PATCH 17) offers all details about how to control the
   dedup feature on progs side.

I've tested this with xfstests by adding a inline dedup 'enable  on' in 
xfstests'
mount and scratch_mount.


***NOTE***
Known bugs:
* Mounting with options flushoncommit and enabling dedupe feature will end up
   with _deadlock_.


TODO:
* a bit-to-bit comparison callback.

All comments are welcome!

Hi Liu,
Thanks for doing this work.
I tested your previous patches a few months ago, and will now test the 
new ones. One question about memory requirements, are they in the same 
league as ZFS dedup (ie needing 10's of gb of RAM for multi TB 
filesystems) or are they more reasonable?

Thanks



[1]: http://en.wikipedia.org/wiki/Data_deduplication
[2]: https://btrfs.wiki.kernel.org/index.php/Project_ideas#Content_based_storage

v10:
- fix a typo in the subject line.
- update struct 'btrfs_ioctl_dedup_args' in the kernel side to fix
   'Inappropriate ioctl for device'.

v9:
- fix a deadlock and a crash reported by users.
- fix the metadata ENOSPC problem with dedup again.

v8:
- fix the race crash of dedup ref again.
- fix the metadata ENOSPC problem with dedup.

v7:
- rebase onto the lastest btrfs
- break a big patch into smaller ones to make reviewers happy.
- kill mount options of dedup and use ioctl method instead.
- fix two crash due to the special dedup ref

For former patch sets:
v6: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27512
v5: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27257
v4: http://thread.gmane.org/gmane.comp.file-systems.btrfs/25751
v3: http://comments.gmane.org/gmane.comp.file-systems.btrfs/25433
v2: http://comments.gmane.org/gmane.comp.file-systems.btrfs/24959

Liu Bo (16):
   Btrfs: disable qgroups accounting when quota_enable is 0
   Btrfs: introduce dedup tree and relatives
   Btrfs: introduce dedup tree operations
   Btrfs: introduce dedup state
   Btrfs: make ordered extent aware of dedup
   Btrfs: online(inband) data dedup
   Btrfs: skip dedup reference during backref walking
   Btrfs: don't return space for dedup extent
   Btrfs: add ioctl of dedup control
   Btrfs: improve the delayed refs process in rm case
   Btrfs: fix a crash of dedup ref
   Btrfs: fix deadlock of dedup work
   Btrfs: fix transactin abortion in __btrfs_free_extent
   Btrfs: fix wrong pinned bytes in __btrfs_free_extent
   Btrfs: use total_bytes instead of bytes_used for global_rsv
   Btrfs: fix dedup enospc problem

  fs/btrfs/backref.c   |   9 +
  fs/btrfs/ctree.c |   2 +-
  fs/btrfs/ctree.h |  86 ++
  fs/btrfs/delayed-ref.c   |  26 +-
  fs/btrfs/delayed-ref.h   |   3 +
  fs/btrfs/disk-io.c   |  37 +++
  fs/btrfs/extent-tree.c   | 235 +---
  fs/btrfs/extent_io.c |  22 +-
  fs/btrfs/extent_io.h |  16 ++
  fs/btrfs/file-item.c | 244 +
  fs/btrfs/inode.c | 635 ++-
  fs/btrfs/ioctl.c | 167 
  fs/btrfs/ordered-data.c  |  44 ++-
  fs/btrfs/ordered-data.h  |  13 +-
  fs/btrfs/qgroup.c|   3 +
  fs/btrfs/relocation.c|   3 +
  fs/btrfs/transaction.c   |  41 +++
  fs/btrfs/transaction.h   |   1 +
  include/trace/events/btrfs.h |   3 +-
  include/uapi/linux/btrfs.h   |  12 +
  20 files changed, 1471 insertions(+), 131 deletions(-)



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] Btrfs: send, add calculate data size flag to allow for progress estimation

2014-04-04 Thread Konstantinos Skarlatos

On 4/4/2014 6:20 μμ, Filipe David Borba Manana wrote:

This new send flag makes send calculate first the amount of new file data (in 
bytes)
the send root has relatively to the parent root, or for the case of a 
non-incremental
send, the total amount of file data we will send through the send stream. In 
other words,
it computes the sum of the lengths of all write and clone operations that will 
be sent
through the send stream.

This data size value is sent in a new command, named 
BTRFS_SEND_C_TOTAL_DATA_SIZE, that
immediately follows a BTRFS_SEND_C_SUBVOL or BTRFS_SEND_C_SNAPSHOT command, and 
precedes
any command that changes a file or the filesystem hierarchy. Upon receiving a 
write or
clone command, the receiving end can increment a counter by the data length of 
that
command and therefore report progress by comparing the counter's value with the 
data size
value received in the BTRFS_SEND_C_TOTAL_DATA_SIZE command.

The approach is simple, before the normal operation of send, do a scan in the 
file system
tree for new inodes and file extent items, just like in send's normal 
operation, and keep
incrementing a counter with new inodes' size and the size of file extents that 
are going
to be written or cloned. This is actually a simpler and more lightweight tree 
scan/processing
than the one we do when sending the changes, as it doesn't process inode 
references nor does
any lookups in the extent tree for example.

After modifying btrfs-progs to understand this new command and report progress, 
here's an
example (the -o flag tells btrfs send to pass the new flag to the kernel's send 
ioctl):

 $ btrfs send -o /mnt/sdd/base | btrfs receive /mnt/sdc
 At subvol /mnt/sdd/base
 At subvol base
 About to receive 9211507211 bytes
 Subvolume/snapshot /mnt/sdc//base, progress  24.73%, 2278015008 bytes 
received (9211507211 total bytes)

 $ btrfs send -o -p /mnt/sdd/base /mnt/sdd/incr | btrfs receive /mnt/sdc
 At subvol /mnt/sdd/incr
 At snapshot incr
 About to receive 9211747739 bytes
 Subvolume/snapshot /mnt/sdc//incr, progress  63.42%, 5843024211 bytes 
received (9211747739 total bytes)
Hi, as a user of send i can say that this feature is very useful. Is it 
possible to add current speed indication (MB/sec)?




Signed-off-by: Filipe David Borba Manana fdman...@gmail.com
---
  fs/btrfs/send.c| 194 +
  fs/btrfs/send.h|   1 +
  include/uapi/linux/btrfs.h |  13 ++-
  3 files changed, 175 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index c81e0d9..fa378c7 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -81,7 +81,13 @@ struct clone_root {
  #define SEND_CTX_MAX_NAME_CACHE_SIZE 128
  #define SEND_CTX_NAME_CACHE_CLEAN_SIZE (SEND_CTX_MAX_NAME_CACHE_SIZE * 2)
  
+enum btrfs_send_phase {

+   SEND_PHASE_STREAM_CHANGES,
+   SEND_PHASE_COMPUTE_DATA_SIZE,
+};
+
  struct send_ctx {
+   enum btrfs_send_phase phase;
struct file *send_filp;
loff_t send_off;
char *send_buf;
@@ -116,6 +122,7 @@ struct send_ctx {
u64 cur_inode_last_extent;
  
  	u64 send_progress;

+   u64 total_data_size;
  
  	struct list_head new_refs;

struct list_head deleted_refs;
@@ -687,6 +694,8 @@ static int send_rename(struct send_ctx *sctx,
  {
int ret;
  
+	ASSERT(sctx-phase != SEND_PHASE_COMPUTE_DATA_SIZE);

+
  verbose_printk(btrfs: send_rename %s - %s\n, from-start, to-start);
  
  	ret = begin_cmd(sctx, BTRFS_SEND_C_RENAME);

@@ -711,6 +720,8 @@ static int send_link(struct send_ctx *sctx,
  {
int ret;
  
+	ASSERT(sctx-phase != SEND_PHASE_COMPUTE_DATA_SIZE);

+
  verbose_printk(btrfs: send_link %s - %s\n, path-start, lnk-start);
  
  	ret = begin_cmd(sctx, BTRFS_SEND_C_LINK);

@@ -734,6 +745,8 @@ static int send_unlink(struct send_ctx *sctx, struct 
fs_path *path)
  {
int ret;
  
+	ASSERT(sctx-phase != SEND_PHASE_COMPUTE_DATA_SIZE);

+
  verbose_printk(btrfs: send_unlink %s\n, path-start);
  
  	ret = begin_cmd(sctx, BTRFS_SEND_C_UNLINK);

@@ -756,6 +769,8 @@ static int send_rmdir(struct send_ctx *sctx, struct fs_path 
*path)
  {
int ret;
  
+	ASSERT(sctx-phase != SEND_PHASE_COMPUTE_DATA_SIZE);

+
  verbose_printk(btrfs: send_rmdir %s\n, path-start);
  
  	ret = begin_cmd(sctx, BTRFS_SEND_C_RMDIR);

@@ -2286,6 +2301,9 @@ static int send_truncate(struct send_ctx *sctx, u64 ino, 
u64 gen, u64 size)
int ret = 0;
struct fs_path *p;
  
+	if (sctx-phase == SEND_PHASE_COMPUTE_DATA_SIZE)

+   return 0;
+
  verbose_printk(btrfs: send_truncate %llu size=%llu\n, ino, size);
  
  	p = fs_path_alloc();

@@ -2315,6 +2333,8 @@ static int send_chmod(struct send_ctx *sctx, u64 ino, u64 
gen, u64 mode)
int ret = 0;
struct fs_path *p;
  
+	ASSERT(sctx-phase != SEND_PHASE_COMPUTE_DATA_SIZE);

+
  verbose_printk(btrfs: send_chmod %llu mode=%llu\n, ino, mode);
  
  	p = fs_path_alloc();

@@ 

help with btrfs device delete of a disk with errors (resent from subscribed mail)

2014-01-29 Thread Konstantinos Skarlatos
I am trying to delete a device  (device 5, /dev/sdg) that has some read 
errors from a multi device file system :


Label: none  uuid: f379d9aa-ddfd-4b4e-84c1-cd93d4592862
Total devices 6 FS bytes used 7.11TiB
devid1 size 1.82TiB used 1.21TiB path /dev/sda
devid2 size 1.82TiB used 1.23TiB path /dev/sdb
devid3 size 1.82TiB used 1.23TiB path /dev/sdc
devid4 size 1.82TiB used 1.23TiB path /dev/sdd
devid5 size 0.00 used 1.12TiB path /dev/sdg
devid6 size 1.82TiB used 1.23TiB path /dev/sdh

$ btrfs fi df /storage/btrfs2
Data, RAID0: total=7.07TiB, used=7.07TiB
Data, single: total=8.00MiB, used=7.94MiB
System, RAID1: total=8.00MiB, used=416.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, RAID1: total=81.00GiB, used=35.02GiB
Metadata, single: total=8.00MiB, used=0.00

btrfs: bdev /dev/sdg errs: wr 0, rd 510, flush 0, corrupt 0, gen 0

Device delete works fine until it gets to a block group that has a read 
error, then it crashes and remounts the filesystem as readonly.
I have found via btrfs inspect-internal logical-resolve the file that 
corresponds to that block group, and deleted it.

After that, btrfs inspect-internal logical-resolve returns:

ioctl ret=-1, error: No such file or directory

When i retry the device delete operation it still tries to relocate that 
same block group and crashes... Is there something else i can do to skip 
that block group and continue the device delete?


my kernel is linux-3.13.0-rc6-git




[2279324.794890] btrfs: found 55688 extents
[2279325.525990] btrfs: relocating block group 7349792145408 flags 9
[2279360.657953] btrfs: found 64189 extents
[2279367.861713] [ cut here ]
[2279367.861753] WARNING: CPU: 1 PID: 29088 at 
fs/btrfs/extent-tree.c:1597 lookup_inline_extent_backref+0x1d9/0x5c0 
[btrfs]()
[2279367.861758] Modules linked in: sha256_generic btrfs raid6_pq crc32c 
libcrc32c radeon xor snd_hda_codec_hdmi snd_hda_intel snd_hda_codec 
snd_hwdep pcspkr ttm snd_pcm snd_page_alloc snd_timer snd drm_kms_helper 
edac_core sp5100_tco i2c_piix4 serio_raw k10temp soundcore edac_mce_amd 
drm evdev i2c_algo_bit r8169 i2c_core mii wmi shpchp button acpi_cpufreq 
processor ext4 crc16 mbcache jbd2 ata_generic pata_acpi sd_mod 
hid_generic usbhid hid ohci_pci ehci_pci ohci_hcd xhci_hcd pata_jmicron 
ehci_hcd ahci libahci libata scsi_mod usbcore usb_common
[2279367.861839] CPU: 1 PID: 29088 Comm: btrfs Tainted: G W 
3.13.0-rc6-git #1
[2279367.861845] Hardware name: Gigabyte Technology Co., Ltd. 
GA-890GPA-UD3H/GA-890GPA-UD3H, BIOS FD 07/23/2010
[2279367.861849]  0009 8800827f96e8 814f5648 

[2279367.861858]  8800827f9720 81061b5d 8801fad0be10 

[2279367.861866]   8800c92b1500 0009 
8800827f9730

[2279367.861873] Call Trace:
[2279367.861886]  [814f5648] dump_stack+0x4d/0x6f
[2279367.861897]  [81061b5d] warn_slowpath_common+0x7d/0xa0
[2279367.861905]  [81061c3a] warn_slowpath_null+0x1a/0x20
[2279367.861929]  [a09229d9] 
lookup_inline_extent_backref+0x1d9/0x5c0 [btrfs]
[2279367.861954]  [a0923e15] 
insert_inline_extent_backref+0x55/0xd0 [btrfs]
[2279367.861978]  [a0923f27] __btrfs_inc_extent_ref+0x97/0x200 
[btrfs]
[2279367.862003]  [a092b016] run_clustered_refs+0xb46/0x1180 
[btrfs]
[2279367.862027]  [a091a63d] ? 
generic_bin_search.constprop.34+0x8d/0x1a0 [btrfs]
[2279367.862054]  [a092f3f0] btrfs_run_delayed_refs+0xe0/0x550 
[btrfs]
[2279367.862083]  [a093fdee] 
btrfs_commit_transaction+0x4e/0x9a0 [btrfs]

[2279367.862104]  [a09acd6f] prepare_to_merge+0x1d2/0x1ed [btrfs]
[2279367.862131]  [a098d613] relocate_block_group+0x393/0x640 
[btrfs]
[2279367.862156]  [a098da62] 
btrfs_relocate_block_group+0x1a2/0x2f0 [btrfs]
[2279367.862184]  [a0965568] 
btrfs_relocate_chunk.isra.28+0x68/0x760 [btrfs]
[2279367.862207]  [a091d066] ? btrfs_search_slot+0x496/0x970 
[btrfs]
[2279367.862237]  [a095b01b] ? release_extent_buffer+0x2b/0xd0 
[btrfs]
[2279367.862265]  [a096082f] ? free_extent_buffer+0x4f/0xb0 
[btrfs]
[2279367.862294]  [a0967df9] btrfs_shrink_device+0x1e9/0x420 
[btrfs]

[2279367.862322]  [a096ab58] btrfs_rm_device+0x328/0x800 [btrfs]
[2279367.862330]  [8118b192] ? __kmalloc_track_caller+0x32/0x250
[2279367.862358]  [a0974ed0] btrfs_ioctl+0x2250/0x2d90 [btrfs]
[2279367.862366]  [811b350f] ? user_path_at_empty+0x5f/0x90
[2279367.862374]  [814ff9c4] ? __do_page_fault+0x2c4/0x5b0
[2279367.862382]  [811650b7] ? vma_link+0xb7/0xc0
[2279367.862389]  [811b58a0] do_vfs_ioctl+0x2e0/0x4c0
[2279367.862397]  [811b5b01] SyS_ioctl+0x81/0xa0
[2279367.862404]  [814ffcbe] ? do_page_fault+0xe/0x10
[2279367.862412]  [81503aad] system_call_fastpath+0x1a/0x1f

Btrfs send 4-5 times slower than rsync on local

2014-01-27 Thread Konstantinos Skarlatos
Hello, i am using btrfs send to copy a snapshot to another btrfs 
filesystem on the same machine, and it has a maximum speed of 
30-35MByte/sec.
Incredibly rsync is much faster, at 120-140MB/sec. Source btrfs is a 
5x2TB raid 0 and target is 1x4TB.


mount options: rw,noatime,compress-force=zlib,space_cache
kernel is linux-3.13.0-rc6-git and btrfs tools is built from git at 
about the same time linux-3.13.0-rc6 was released


Finally, is there a way to resume an interrupted send?

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v8 00/14] Online(inband) data deduplication

2014-01-02 Thread Konstantinos Skarlatos
Sorry for the spam, i just mixed up the order of your patches. they now 
apply cleanly to 3.13 git.

Thanks

On 2/1/2014 4:32 μμ, Konstantinos Skarlatos wrote:
Hello, I am trying to test your patches and they do not apply to 
latest 3.12 source or 3.13 git. Am I doing something wrong?


---logs for 3.12---

Hunk #1 succeeded at 59 with fuzz 2 (offset 1 line).
patching file init/Kconfig
Hunk #1 succeeded at 1085 (offset 96 lines).
Hunk #2 succeeded at 1096 (offset 96 lines).
patching file fs/btrfs/ctree.h
Hunk #1 FAILED at 3692.
1 out of 1 hunk FAILED -- saving rejects to file fs/btrfs/ctree.h.rej
patching file fs/btrfs/extent-tree.c
Hunk #1 FAILED at 5996.
Hunk #2 FAILED at 6023.
2 out of 2 hunks FAILED -- saving rejects to file 
fs/btrfs/extent-tree.c.rej

patching file fs/btrfs/file-item.c
Hunk #1 FAILED at 887.
Hunk #2 succeeded at 765 with fuzz 2 (offset -151 lines).
Hunk #3 FAILED at 978.
Hunk #4 FAILED at 1061.
Hunk #5 FAILED at 1094.
4 out of 5 hunks FAILED -- saving rejects to file 
fs/btrfs/file-item.c.rej

patching file fs/btrfs/inode.c
Hunk #1 FAILED at 969.
Hunk #2 FAILED at 2364.
2 out of 2 hunks FAILED -- saving rejects to file fs/btrfs/inode.c.rej

---logs for 3.13---
Hunk #1 succeeded at 59 with fuzz 2 (offset 1 line).
patching file init/Kconfig
Hunk #1 succeeded at 1078 (offset 89 lines).
Hunk #2 succeeded at 1089 (offset 89 lines).
patching file fs/btrfs/ctree.h
Hunk #1 FAILED at 3692.
1 out of 1 hunk FAILED -- saving rejects to file fs/btrfs/ctree.h.rej
patching file fs/btrfs/extent-tree.c
Hunk #1 FAILED at 5996.
Hunk #2 FAILED at 6023.
2 out of 2 hunks FAILED -- saving rejects to file 
fs/btrfs/extent-tree.c.rej

patching file fs/btrfs/file-item.c
Hunk #1 FAILED at 887.
Hunk #2 succeeded at 768 with fuzz 2 (offset -148 lines).
Hunk #3 FAILED at 978.
Hunk #4 FAILED at 1061.
Hunk #5 FAILED at 1094.
4 out of 5 hunks FAILED -- saving rejects to file 
fs/btrfs/file-item.c.rej

patching file fs/btrfs/inode.c
Hunk #1 FAILED at 969.
Hunk #2 FAILED at 2364.
2 out of 2 hunks FAILED -- saving rejects to file fs/btrfs/inode.c.rej


On 30/12/2013 10:12 πμ, Liu Bo wrote:

Hello,

Here is the New Year patch bomb :-)

Data deduplication is a specialized data compression technique for 
eliminating

duplicate copies of repeating data.[1]

This patch set is also related to Content based storage in project 
ideas[2],
it introduces inband data deduplication for btrfs and dedup/dedupe is 
for short.


PATCH 1 is a hang fix with deduplication on, but it's also useful 
without

dedup in practice use.

PATCH 2 and 3 are targetting delayed refs' scalability problems, 
which are

uncovered by the dedup feature.

PATCH 4 is a speed-up improvement, which is about dedup and quota.

PATCH 5-8 is the preparation work for dedup implementation.

PATCH 9 shows how we implement dedup feature.

PATCH 10 fixes a backref walking bug with dedup.

PATCH 11 fixes a free space bug of dedup extents on error handling.

PATCH 12 adds the ioctl to control dedup feature.

PATCH 13 fixes the metadata ENOSPC problem with dedup which has been 
there

WAY TOO LONG.

PATCH 14 fixes a race bug on dedup writes.

And there is also a btrfs-progs patch(PATCH 15) which offers all 
details about

how to control the dedup feature.

I've tested this with xfstests by adding a inline dedup 'enable  on' 
in xfstests'

mount and scratch_mount.

TODO:
* a bit-to-bit comparison callback.

All comments are welcome!


[1]: http://en.wikipedia.org/wiki/Data_deduplication
[2]: 
https://btrfs.wiki.kernel.org/index.php/Project_ideas#Content_based_storage


v8:
- fix the race crash of dedup ref again.
- fix the metadata ENOSPC problem with dedup.

v7:
- rebase onto the lastest btrfs
- break a big patch into smaller ones to make reviewers happy.
- kill mount options of dedup and use ioctl method instead.
- fix two crash due to the special dedup ref

For former patch sets:
v6: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27512
v5: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27257
v4: http://thread.gmane.org/gmane.comp.file-systems.btrfs/25751
v3: http://comments.gmane.org/gmane.comp.file-systems.btrfs/25433
v2: http://comments.gmane.org/gmane.comp.file-systems.btrfs/24959

Liu Bo (14):
   Btrfs: skip merge part for delayed data refs
   Btrfs: improve the delayed refs process in rm case
   Btrfs: introduce a head ref rbtree
   Btrfs: disable qgroups accounting when quata_enable is 0
   Btrfs: introduce dedup tree and relatives
   Btrfs: introduce dedup tree operations
   Btrfs: introduce dedup state
   Btrfs: make ordered extent aware of dedup
   Btrfs: online(inband) data dedup
   Btrfs: skip dedup reference during backref walking
   Btrfs: don't return space for dedup extent
   Btrfs: add ioctl of dedup control
   Btrfs: fix dedupe 'ENOSPC' problem
   Btrfs: fix a crash of dedup ref

  fs/btrfs/backref.c   |   9 +
  fs/btrfs/ctree.c |   2 +-
  fs/btrfs/ctree.h |  86 ++
  fs/btrfs/delayed-ref.c   | 161

Re: [PATCH] BTRFS-PROG: recursively subvolume snapshot and delete

2013-11-27 Thread Konstantinos Skarlatos

On 26/11/2013 7:44 μμ, Goffredo Baroncelli wrote:

On 2013-11-26 16:12, Konstantinos Skarlatos wrote:

On 25/11/2013 11:23 μμ, Goffredo Baroncelli wrote:

Hi all,

nobody is interested in these new features ?

Is this ZFS-style recursive snapshotting? If yes, i am interested, and
thanks for your great work :)

No it is not equal. My recursive snapshotting is not atomic as the ZFS
one; every subvolume snapshot is atomic, but each snapshot is taken at
different time.

For my use case that is not a problem, but others may disagree


BR
G.Baroncelli


On 2013-11-16 18:09, Goffredo Baroncelli wrote:

Hi All,

the following patches implement the recursively snapshotting and
deleting of a subvolume.

To snapshot recursively you must pass the -R switch:

# btrfs subvolume create sub1
Create subvolume './sub1'
# btrfs subvolume create sub1/sub2
Create subvolume 'sub1/sub2'

# btrfs subvolume snapshot -R sub1 sub1-snap
Create a snapshot of 'sub1' in './sub1-snap'
Create a snapshot of 'sub1/sub2' in './sub1-snap/sub2'

To recursively delete subvolumes, you must pass the switch '-R':

# btrfs subvolume create sub1
Create subvolume './sub1'
# btrfs subvolume create sub1/sub2
Create subvolume 'sub1/sub2'

# btrfs subvolume delete -R sub1
Delete subvolume '/root/sub1/sub2'
Delete subvolume '/root/sub1'


Some caveats:
1) the recursively behaviour need the root capability
This because how the subvolume are discovered

2) it is not possible to recursively snapshot a subvolume
in read-only mode
This because when a subvolume is snapshotted, its
nested subvolumes appear as directory in the snapshot.
These directories are removed before snapshotting the
nested subvolumes. This is incompatible with a read
only subvolume.

BR
G.Baroncelli

--
To unsubscribe from this list: send the line unsubscribe
linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs-tools build instructions for Centos

2013-11-26 Thread Konstantinos Skarlatos

Hello,
in https://btrfs.wiki.kernel.org/index.php/Btrfs_source_repositories, i 
used the fedora instructions for Centos.
The problem is that lzo2-devel is named lzo-devel in Centos, so if 
somebody follows the fedora instructions and doesn't notice that 
lzo2-devel is missing, btrfs-progs build will fail with /usr/bin/ld: 
cannot find -llzo2. The solution is to install lzo-devel instead. Can 
this be added to the wiki?


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] BTRFS-PROG: recursively subvolume snapshot and delete

2013-11-26 Thread Konstantinos Skarlatos

On 25/11/2013 11:23 μμ, Goffredo Baroncelli wrote:

Hi all,

nobody is interested in these new features ?
Is this ZFS-style recursive snapshotting? If yes, i am interested, and 
thanks for your great work :)


On 2013-11-16 18:09, Goffredo Baroncelli wrote:

Hi All,

the following patches implement the recursively snapshotting and
deleting of a subvolume.

To snapshot recursively you must pass the -R switch:

# btrfs subvolume create sub1
Create subvolume './sub1'
# btrfs subvolume create sub1/sub2
Create subvolume 'sub1/sub2'

# btrfs subvolume snapshot -R sub1 sub1-snap
Create a snapshot of 'sub1' in './sub1-snap'
Create a snapshot of 'sub1/sub2' in './sub1-snap/sub2'

To recursively delete subvolumes, you must pass the switch '-R':

# btrfs subvolume create sub1
Create subvolume './sub1'
# btrfs subvolume create sub1/sub2
Create subvolume 'sub1/sub2'

# btrfs subvolume delete -R sub1
Delete subvolume '/root/sub1/sub2'
Delete subvolume '/root/sub1'


Some caveats:
1) the recursively behaviour need the root capability
This because how the subvolume are discovered

2) it is not possible to recursively snapshot a subvolume
in read-only mode
This because when a subvolume is snapshotted, its
nested subvolumes appear as directory in the snapshot.
These directories are removed before snapshotting the
nested subvolumes. This is incompatible with a read
only subvolume.

BR
G.Baroncelli

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Dedup on read-only snapshots

2013-11-26 Thread Konstantinos Skarlatos

According to https://github.com/g2p/bedup/tree/wip/dedup-syscall
The clone call is considered a write operation and won't work on 
read-only snapshots.

Is this fixed on newer kernels?

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs filesystems can only be mounted after an unclean shutdown if btrfsck is run and immediately killed!

2012-06-08 Thread Konstantinos Skarlatos

Hi all,
I have two multi-disk btrfs filesystems on a Arch linux 3.4.0 system. 
After a power failure, both filesystems refuse to mount


[   10.402284] Btrfs loaded
[   10.402714] device fsid 1e7c18a4-02d6-44b1-8eaf-c01378009cd3 devid 4 
transid 65282 /dev/sdc

[   10.403108] btrfs: force zlib compression
[   10.403130] btrfs: enabling inode map caching
[   10.403152] btrfs: disk space caching is enabled
[   10.403377] btrfs: failed to read the system array on sdc
[   10.403557] btrfs: open_ctree failed
[   10.431763] device fsid 7f7be913-e359-400f-8bdb-7ef48aad3f03 devid 2 
transid 3916 /dev/sdb

[   10.432180] btrfs: force zlib compression
[   10.433040] btrfs: enabling inode map caching
[   10.433892] btrfs: disk space caching is enabled
[   10.434930] btrfs: failed to read the system array on sdb
[   10.435945] btrfs: open_ctree failed


fstab:

UUID=1e7c18a4-02d6-44b1-8eaf-c01378009cd3 /storage/btrfs btrfs 
noatime,compress-force=zlib,space_cache,inode_cache 0 0
UUID=7f7be913-e359-400f-8bdb-7ef48aad3f03 /storage/btrfs2 btrfs 
noatime,compress-force=zlib,space_cache,inode_cache 0 0



The funny thing is that if i run btrfsck for one second on the first 
filesystem and then kill it with ctrl-c, then both filesystems can be 
mounted without any problems!


I have this problem for many months, probably for all 3.x kernels and 
maybe a bit older, all git btrfs tools since at least late last year.


 [root@linuxserver ~/btrfs-progs]# btrfs fi show /dev/sdb
Label: none  uuid: 7f7be913-e359-400f-8bdb-7ef48aad3f03
Total devices 2 FS bytes used 1.54TB
devid1 size 1.82TB used 1.04TB path /dev/sda
devid2 size 1.82TB used 1.04TB path /dev/sdb

Btrfs Btrfs v0.19
 [root@linuxserver ~/btrfs-progs]# btrfs fi show /dev/sdf
Label: none  uuid: 1e7c18a4-02d6-44b1-8eaf-c01378009cd3
Total devices 4 FS bytes used 4.33TB
devid5 size 1.82TB used 1.82TB path /dev/sdg
devid4 size 1.82TB used 1.82TB path /dev/sdc
devid3 size 1.82TB used 1.79TB path /dev/sdf
devid1 size 1.82TB used 1.82TB path /dev/sdd

Btrfs Btrfs v0.19
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystems can only be mounted after an unclean shutdown if btrfsck is run and immediately killed!

2012-06-08 Thread Konstantinos Skarlatos

On Παρασκευή, 8 Ιούνιος 2012 11:28:39 πμ, Tomasz Torcz wrote:

On Fri, Jun 08, 2012 at 11:26:21AM +0300, Konstantinos Skarlatos wrote:

Hi all,
I have two multi-disk btrfs filesystems on a Arch linux 3.4.0
system. After a power failure, both filesystems refuse to mount


   Multi-device filesystem had to be first fully discovered by
btrfs device scan.  It is typically done from udev rules. Also,
dracut does it in initramfs for quite a long time.


(Added cc to btrfs list)

You are right, i have forgotten to enable it (arch linux has a new 
rc.conf option for that), i will reboot in a few minutes to test it.
Maybe it would be prudent to give a better error message when such a 
thing happens, or even make mount run btrfs device scan if it detects 
that a multi-drive fs is being mounted?

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: cross-subvolume cp --reflink

2012-04-01 Thread Konstantinos Skarlatos

On Κυριακή, 1 Απρίλιος 2012 8:07:54 μμ, Norbert Scheibner wrote:

On: Sun, 01 Apr 2012 19:45:13 +0300 Konstantinos Skarlatos wrote



That's my point. This poor man's dedupe would solve my problems here

very well. I don't need a zfs-variant of dedupe. I can implement such a
file-based dedupe with userland tools and would be happy.

do you have any scripts that can search a btrfs filesystem for dupes
and replace them with cp --reflink?


Nothing really working and tested very well. After I get to known the missing 
cp --reflink feature I stopped to develop the script any further.

I use btrfs for my backups. Ones a day I rsync --delete --inplace the complete 
system to a subvolume, snapshot it, delete some tempfiles in the snapshot.


In my setup I rsync --inplace many servers and workstations, 4-6 times 
a day into a 12TB btrfs volume, each one in its own subvolume. After 
every backup a new ro snapshot is created.


I have many cross-subvolume duplicate files (OS files, programs, many 
huge media files that are copied locally from the servers to the 
workstations etc), so a good dedupe script could save lots of space, 
and allow me to keep snapshots for much longer.




In addition to that I wanted to shrink file-duplicates.

What the script should do:
1. I md5sum every file
2. If the checksums are identical, I compare the files
3. If 2 or more files are really identical:
- move one to a temp-dir
- cp --reflink the second to the position and name of the first
- do a chown --reference, chmod --reference and touch --reference
  to copy owner, file mode bits and time from the orginal to the
  reflink-copy and then delete the original in temp-dir

Everything could be done with bash. Thinkable is the use of a database for the 
md5sums, which could be used for other purposes in the future.



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: cross-subvolume cp --reflink

2012-04-01 Thread Konstantinos Skarlatos

On 1/4/2012 9:39 μμ, Norbert Scheibner wrote:

On: Sun, 01 Apr 2012 19:22:42 +0200Klaus A. Kreil wrote



I am just an interested reader on the btrfs list and so far have never
posted or sent a message to the list, but I do have a dedup bash script
that searches for duplicates underneath a directory (provided as an
argument) and hard links identical files.

It works very well for an ext3 filesystem, but I guess the basics should
be the same for a btrfs filesystem.

Thanks for the nice script, it works fine here!
I just added a du -sh $1 line at the beginning and end to see how much 
space it saves.


Everyone feel free to correct me here, but:
At the moment there is a little problem with the maximum number of hard links 
in a directory. So I wouldn't use them wherever possible to avoid any thinkable 
problems in the near future.

Plus to hard link 2 files means, that change one file You change the other one. 
It's something You either don't want to happen or something, which could be 
done in better ways. The cp --reflink method on a COW-fs is a much smarter 
method.
thats true, cp --reflink is much better. Also am I wrong that btrfs has 
a limitation on the number of hard links that can only be fixed with a 
disk format change?


Plus hard links across subvolumes do match the case of hard links across 
devices on a traditional fs, which is forbidden.

Plus hard links In my opinion should really be substituted by soft links, 
because hard links are not transparent at the first sight and can not be copied 
as it.

So no, I'd rather want the patch to allow cross-subvolume cp --reflink in the 
kernel and I will wait for that to happen.

Greetings
Norbert


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: cross-subvolume cp --reflink

2012-04-01 Thread Konstantinos Skarlatos

On 1/4/2012 9:11 μμ, Norbert Scheibner wrote:

On: Sun, 01 Apr 2012 20:19:24 +0300 Konstantinos Skarlatos wrote



I use btrfs for my backups. Ones a day I rsync --delete --inplace
the

complete system to a subvolume, snapshot it, delete some tempfiles
in the snapshot.

In my setup I rsync --inplace many servers and workstations, 4-6
times a day into a 12TB btrfs volume, each one in its own
subvolume. After every backup a new ro snapshot is created.

I have many cross-subvolume duplicate files (OS files, programs,
many huge media files that are copied locally from the servers to
the workstations etc), so a good dedupe script could save lots of
space, and allow me to keep snapshots for much longer.


So the script should be optimized not to try to deduplicate the whole
fs everytime but the newly written ones. You could take such a file
list out of the rsync output or the btrfs subvolume find-new
command.


a cron task with btrfs subvolume find-new would be ideal i think

Albeit the reflink patch, You could use such a bash-script inside one
subvolume, after the rsync and before the snapshot. I don't know how
much space it saves for You in this situation, but it's worth a try
and a good way to develop such a script, because before You write
anything to disc You can see how many duplicates are there and how
much space could be freed.

MfG Norbert


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] btrfs: allow cross-subvolume BTRFS_IOC_CLONE

2012-01-06 Thread Konstantinos Skarlatos

On 22/12/2011 2:24 μμ, Chris Samuel wrote:

Christoph,

On Sat, 2 Apr 2011 12:40:11 AM Chris Mason wrote:


Excerpts from Christoph Hellwig's message of 2011-04-01 09:34:05

-0400:



I don't think it's a good idea to introduce any user visible
operations over subvolume boundaries.  Currently we don't have
any operations over mount boundaries, which is pretty
fumdamental to the unix filesystem semantics.  If you want to
change this please come up with a clear description of the
semantics and post it to linux-fsdevel for discussion.  That of
course requires a clear description of the btrfs subvolumes,
which is still completely missing.


The subvolume is just a directory tree that can be snapshotted, and
has it's own private inode number space.

reflink across subvolumes is no different from copying a file from
one subvolume to another at the VFS level.  The src and
destination are different files and different inodes, they just
happen to share data extents.


Were Chris Mason's points above enough to sway your opposition to this
functionality/patch?

There is demand for the ability to move data between subvolumes
without needing to copy the extents themselves, it's cropped up again
on the list in recent days.

It seems a little hard (and counterintuitive) to enforce a wasteful
use of resources to copy data between different parts of the same
filesystem which happen to be a on a different subvolume when it's
permitted  functional to the same filesystem on the same subvolume.

I don't dispute the comment about documentation on subvolumes though,
there is a short discussion of them on the btrfs wiki in the sysadmins
guide, but not really a lot of detail. :-)

All the best,
Chris


Me too wants cp --reflink across subvolumes. Please make this feature 
available to us, as its a poor man's dedupe and would give big space 
savings for many use cases.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Status of dedupe in btrfs

2012-01-05 Thread Konstantinos Skarlatos

Hello everyone,

I was reading this article in Slashdot about dedupe [1] and i was 
wondering about the status of the (offline) dedupe patches in btrfs. Are 
they applicable to a recent kernel? do userspace tools support it?


Kind regards


[1] 
http://sk.slashdot.org/story/12/01/04/1955248/ask-slashdot-freeopen-deduplication-software

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs: blocked for more than 120 seconds, made worse by 3.2 rc7

2011-12-28 Thread Konstantinos Skarlatos

Hello all:
I have two machines with btrfs, that give me the blocked for more than 
120 seconds message. After that I cannot write anything to disk, i am 
unable to unmount the btrfs filesystem and i can only reboot with 
sysrq-trigger.


It always happens when i write many files with rsync over network. When 
i used 3.2rc6 it happened randomly on both machines after 50-500gb of 
writes. with rc7 it happens after much less writes, probably 10gb or so, 
but only on machine 1 for the time being. machine 2 has not crashed yet 
after 200gb of writes and I am still testing that.


machine 1: btrfs on a 6tb sparse file, mounted as loop, on a xfs 
filesystem that lies on a 10TB md raid5. mount options 
compress=zlib,compress-force


machine 2: btrfs over md raid 5 (4x2TB)=5.5TB filesystem. mount options 
compress=zlib,compress-force


pastebins:

machine1:
3.2rc7 http://pastebin.com/u583G7jK
3.2rc6 http://pastebin.com/L12TDaXa

machine2:
3.2rc6 http://pastebin.com/khD0wGXx
3.2rc7 (not crashed yet)
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: blocked for more than 120 seconds, made worse by 3.2 rc7

2011-12-28 Thread Konstantinos Skarlatos

Well now machine2 has just crashed too...
http://pastebin.com/gvfUm0az

On Τετάρτη, 28 Δεκέμβριος 2011 9:26:07 μμ, Konstantinos Skarlatos wrote:

Hello all:
I have two machines with btrfs, that give me the blocked for more 
than 120 seconds message. After that I cannot write anything to disk, 
i am unable to unmount the btrfs filesystem and i can only reboot with 
sysrq-trigger.


It always happens when i write many files with rsync over network. 
When i used 3.2rc6 it happened randomly on both machines after 
50-500gb of writes. with rc7 it happens after much less writes, 
probably 10gb or so, but only on machine 1 for the time being. machine 
2 has not crashed yet after 200gb of writes and I am still testing that.


machine 1: btrfs on a 6tb sparse file, mounted as loop, on a xfs 
filesystem that lies on a 10TB md raid5. mount options 
compress=zlib,compress-force


machine 2: btrfs over md raid 5 (4x2TB)=5.5TB filesystem. mount 
options compress=zlib,compress-force


pastebins:

machine1:
3.2rc7 http://pastebin.com/u583G7jK
3.2rc6 http://pastebin.com/L12TDaXa

machine2:
3.2rc6 http://pastebin.com/khD0wGXx
3.2rc7 (not crashed yet)

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: blocked for more than 120 seconds, made worse by 3.2 rc7

2011-12-28 Thread Konstantinos Skarlatos

On Τετάρτη, 28 Δεκέμβριος 2011 11:48:32 μμ, Dave Chinner wrote:

On Wed, Dec 28, 2011 at 09:26:07PM +0200, Konstantinos Skarlatos wrote:

Hello all:
I have two machines with btrfs, that give me the blocked for more
than 120 seconds message. After that I cannot write anything to
disk, i am unable to unmount the btrfs filesystem and i can only
reboot with sysrq-trigger.

It always happens when i write many files with rsync over network.
When i used 3.2rc6 it happened randomly on both machines after
50-500gb of writes. with rc7 it happens after much less writes,
probably 10gb or so, but only on machine 1 for the time being.
machine 2 has not crashed yet after 200gb of writes and I am still
testing that.

machine 1: btrfs on a 6tb sparse file, mounted as loop, on a xfs
filesystem that lies on a 10TB md raid5. mount options
compress=zlib,compress-force

machine 2: btrfs over md raid 5 (4x2TB)=5.5TB filesystem. mount
options compress=zlib,compress-force

pastebins:

machine1:
3.2rc7 http://pastebin.com/u583G7jK
3.2rc6 http://pastebin.com/L12TDaXa


These two are caused by it taking longer than 120s for XFS to fsync
the loop file. Writing a signficant chunk of a sparse 6TB file on a
software RAID5  volume is going to take some time.  However, if IO
is not occurring, then somewhere below XFS an IO has gone missing
(MD or hardware problem) because the fsync on the XFS file is
blocked waiting for an IO completion.


machine2:
3.2rc6 http://pastebin.com/khD0wGXx
3.2rc7 (not crashed yet)

Crashed a few hours ago, here is the rc7 pastebin
http://pastebin.com/gvfUm0az 


These don't have XFS in the picture, but also appear to be hung
waiting on IO completion with MD stuck in
make_request()-get_active_stripe(). That, to me, indicates an MD
problem.


Added the linux-raid mailing list
Please reply to me too, because i am not subscribed.


Cheers,

Dave.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Blocked for more than 120 seconds

2011-12-04 Thread Konstantinos Skarlatos
even more kernel messages from btrfs crashing when rsyncing large 
amounts of data on 3.2rc4



Dec  3 15:12:14 mail kernel: [15481.100564] loop0   D 
00010044b6c5 0  1729  2 0x
Dec  3 15:12:14 mail kernel: [15481.101550]  8801f9b31b30 
0046  
Dec  3 15:12:14 mail kernel: [15481.102548]  880200950e40 
8801f9b31fd8 8801f9b31fd8 8801f9b31fd8
Dec  3 15:12:14 mail kernel: [15481.103539]  880202cb7200 
880200950e40 0002 8801f9b31b78

Dec  3 15:12:14 mail kernel: [15481.104533] Call Trace:
Dec  3 15:12:14 mail kernel: [15481.105531]  [81101a55] ? 
find_get_pages_tag+0x125/0x150
Dec  3 15:12:14 mail kernel: [15481.106541]  [8110e205] ? 
pagevec_lookup_tag+0x25/0x40
Dec  3 15:12:14 mail kernel: [15481.107552]  [8101d639] ? 
read_tsc+0x9/0x20
Dec  3 15:12:14 mail kernel: [15481.108576]  [8108f14d] ? 
ktime_get_ts+0xad/0xe0
Dec  3 15:12:14 mail kernel: [15481.109592]  [81101d60] ? 
__lock_page+0x70/0x70
Dec  3 15:12:14 mail kernel: [15481.110607]  [814140bf] 
schedule+0x3f/0x60
Dec  3 15:12:14 mail kernel: [15481.111619]  [8141416f] 
io_schedule+0x8f/0xd0
Dec  3 15:12:14 mail kernel: [15481.112641]  [81101d6e] 
sleep_on_page+0xe/0x20
Dec  3 15:12:14 mail kernel: [15481.113639]  [8141491f] 
__wait_on_bit+0x5f/0x90
Dec  3 15:12:14 mail kernel: [15481.114629]  [81101f58] 
wait_on_page_bit+0x78/0x80
Dec  3 15:12:14 mail kernel: [15481.115628]  [81085790] ? 
autoremove_wake_function+0x40/0x40
Dec  3 15:12:14 mail kernel: [15481.116614]  [811020cc] 
filemap_fdatawait_range+0x10c/0x1a0
Dec  3 15:12:14 mail kernel: [15481.117613]  [811030c8] 
filemap_write_and_wait_range+0x68/0x80
Dec  3 15:12:14 mail kernel: [15481.118630]  [a03a7234] 
xfs_file_fsync+0x54/0x340 [xfs]
Dec  3 15:12:14 mail kernel: [15481.119629]  [8119148b] 
vfs_fsync+0x2b/0x40
Dec  3 15:12:14 mail kernel: [15481.120627]  [a04dacf2] 
do_bio_filebacked+0x1b2/0x320 [loop]
Dec  3 15:12:14 mail kernel: [15481.121645]  [a050efac] ? 
end_workqueue_bio+0x9c/0xa0 [btrfs]
Dec  3 15:12:14 mail kernel: [15481.122668]  [a04daf1b] 
loop_thread+0xbb/0x260 [loop]
Dec  3 15:12:14 mail kernel: [15481.123674]  [81085750] ? 
abort_exclusive_wait+0xb0/0xb0
Dec  3 15:12:14 mail kernel: [15481.124676]  [a04dae60] ? 
do_bio_filebacked+0x320/0x320 [loop]
Dec  3 15:12:14 mail kernel: [15481.125698]  [81084e0c] 
kthread+0x8c/0xa0
Dec  3 15:12:14 mail kernel: [15481.126710]  [81419a34] 
kernel_thread_helper+0x4/0x10
Dec  3 15:12:14 mail kernel: [15481.127721]  [81084d80] ? 
kthread_worker_fn+0x190/0x190
Dec  3 15:12:14 mail kernel: [15481.128742]  [81419a30] ? 
gs_change+0x13/0x13
Dec  3 15:12:14 mail kernel: [15481.131702] btrfs-transacti D 
8801f9ab7200 0  1756  2 0x
Dec  3 15:12:14 mail kernel: [15481.132723]  8801e7533bc0 
0046 88020fc93400 0002
Dec  3 15:12:14 mail kernel: [15481.133744]  8801f9ab7200 
8801e7533fd8 8801e7533fd8 8801e7533fd8
Dec  3 15:12:14 mail kernel: [15481.134771]  880200950e40 
8801f9ab7200 8801e7533b10 81051ae2

Dec  3 15:12:14 mail kernel: [15481.135813] Call Trace:
Dec  3 15:12:14 mail kernel: [15481.136828]  [8105ad36] ? 
ttwu_do_activate.constprop.172+0x66/0x70
Dec  3 15:12:14 mail kernel: [15481.137863]  [8105bd6e] ? 
try_to_wake_up+0x1de/0x290
Dec  3 15:12:14 mail kernel: [15481.138914]  [814140bf] 
schedule+0x3f/0x60
Dec  3 15:12:14 mail kernel: [15481.139956]  [814147d5] 
schedule_timeout+0x305/0x390
Dec  3 15:12:14 mail kernel: [15481.141007]  [8104d003] ? 
__wake_up+0x53/0x70
Dec  3 15:12:14 mail kernel: [15481.142074]  [81413348] 
wait_for_common+0xc8/0x160
Dec  3 15:12:14 mail kernel: [15481.143124]  [8105be20] ? 
try_to_wake_up+0x290/0x290
Dec  3 15:12:14 mail kernel: [15481.144170]  [814133fd] 
wait_for_completion+0x1d/0x20
Dec  3 15:12:14 mail kernel: [15481.145229]  [a050f0bb] 
write_dev_flush+0x4b/0x140 [btrfs]
Dec  3 15:12:14 mail kernel: [15481.146275]  [a0511086] 
write_all_supers+0x6f6/0x800 [btrfs]
Dec  3 15:12:14 mail kernel: [15481.147317]  [a05111a3] 
write_ctree_super+0x13/0x20 [btrfs]
Dec  3 15:12:14 mail kernel: [15481.148354]  [a05164dd] 
btrfs_commit_transaction+0x63d/0x880 [btrfs]
Dec  3 15:12:14 mail kernel: [15481.149397]  [81085750] ? 
abort_exclusive_wait+0xb0/0xb0
Dec  3 15:12:14 mail kernel: [15481.150416]  [a0516b74] ? 
start_transaction+0x94/0x2b0 [btrfs]
Dec  3 15:12:14 mail kernel: [15481.151444]  [a050ed4d] 
transaction_kthread+0x26d/0x290 [btrfs]
Dec  3 15:12:14 mail kernel: [15481.152492]  [a050eae0] ? 
btrfs_congested_fn+0xd0/0xd0 [btrfs]
Dec  3 15:12:14 mail kernel: [15481.153519]  

Re: Blocked for more than 120 seconds

2011-12-03 Thread Konstantinos Skarlatos
] schedule+0x3f/0x60
[15601.348711]  [8141416f] io_schedule+0x8f/0xd0
[15601.348714]  [81101d6e] sleep_on_page+0xe/0x20
[15601.348716]  [8141491f] __wait_on_bit+0x5f/0x90
[15601.348719]  [81101f58] wait_on_page_bit+0x78/0x80
[15601.348722]  [81085790] ? 
autoremove_wake_function+0x40/0x40
[15601.348725]  [81102845] 
grab_cache_page_write_begin+0x95/0xe0
[15601.348732]  [a03a1150] ? xfs_get_blocks_direct+0x20/0x20 
[xfs]

[15601.348736]  [811967b8] block_write_begin+0x38/0xa0
[15601.348743]  [a03a1213] xfs_vm_write_begin+0x43/0x70 [xfs]
[15601.348746]  [8110233c] 
generic_file_buffered_write+0x10c/0x270

[15601.348754]  [a03aad66] ? xfs_iunlock+0x116/0x180 [xfs]
[15601.348761]  [a03a7fef] 
xfs_file_buffered_aio_write+0x10f/0x200 [xfs]
[15601.348768]  [a03a8252] xfs_file_aio_write+0x172/0x2a0 
[xfs]

[15601.348772]  [81162d62] do_sync_write+0xd2/0x110
[15601.348775]  [811f0fcc] ? 
security_file_permission+0x2c/0xb0

[15601.348778]  [81163311] ? rw_verify_area+0x61/0xf0
[15601.348781]  [8116366f] vfs_write+0xaf/0x180
[15601.348784]  [81163b12] sys_pwrite64+0x82/0xb0
[15601.348787]  [814178c2] system_call_fastpath+0x16/0x1b


On Σάββατο, 3 Δεκέμβριος 2011 2:35:50 πμ, Konstantinos Skarlatos wrote:
After about 1TB of rsyncs from multiple servers at the same time, plus 
some heavy filesystem loading, i believe that 3.2rc4 solves the 
problem for me. Now if only we had deduplication and an fsck tool :)

On Παρασκευή, 2 Δεκέμβριος 2011 9:53:10 μμ, Konstantinos Skarlatos wrote:
I see they got into 3.2rc4, so I am now compiling it. I will report 
back in a few hours


On Παρασκευή, 2 Δεκέμβριος 2011 5:48:31 μμ, Tobias wrote:

Am 02.12.2011 16:22, schrieb Konstantinos Skarlatos:
So, the transaction close is in btrfs_evict_inode, which sounds 
like a

deadlock recently fixed by this commit:

http://git.kernel.org/?p=linux/kernel/git/mason/linux-btrfs.git;a=commit;h=aa38a711a893accf5b5192f3d705a120deaa81e0 



If you pull the for-linus branch from today, hopefully the 
problem will

be gone.



This looks very good. With this Kernel i still have some hangs, 
but only in rsync, only under high load and they don't lock up the 
system - so i guess it's ok now.


I still have hangs and lock ups under the same situation (rsync of 
many files) under 3.2rc3. rc3 made the hang appear after 200gb of 
files, while in rc2 i had hangs after only 11gb .


Yes, i had them too in 3.2rc3! The problem where solved with patches 
from the btrfs-for-linus -branch. (see link above).


Tobias


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Blocked for more than 120 seconds

2011-12-02 Thread Konstantinos Skarlatos

Hi all

On 2/12/2011 3:46 μμ, Tobias wrote:

Hi Chris!

Am 01.12.2011 19:41, schrieb Chris Mason:


So, the transaction close is in btrfs_evict_inode, which sounds like a
deadlock recently fixed by this commit:

http://git.kernel.org/?p=linux/kernel/git/mason/linux-btrfs.git;a=commit;h=aa38a711a893accf5b5192f3d705a120deaa81e0 



If you pull the for-linus branch from today, hopefully the problem will
be gone.



This looks very good. With this Kernel i still have some hangs, but 
only in rsync, only under high load and they don't lock up the system 
- so i guess it's ok now.
I still have hangs and lock ups under the same situation (rsync of many 
files) under 3.2rc3. rc3 made the hang appear after 200gb of files, 
while in rc2 i had hangs after only 11gb .


Thank You very much for Your help!

When will this patches go into the main Kernel?

Tobias

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Blocked for more than 120 seconds

2011-12-02 Thread Konstantinos Skarlatos
I see they got into 3.2rc4, so I am now compiling it. I will report 
back in a few hours


On Παρασκευή, 2 Δεκέμβριος 2011 5:48:31 μμ, Tobias wrote:

Am 02.12.2011 16:22, schrieb Konstantinos Skarlatos:

So, the transaction close is in btrfs_evict_inode, which sounds like a
deadlock recently fixed by this commit:

http://git.kernel.org/?p=linux/kernel/git/mason/linux-btrfs.git;a=commit;h=aa38a711a893accf5b5192f3d705a120deaa81e0 



If you pull the for-linus branch from today, hopefully the problem 
will

be gone.



This looks very good. With this Kernel i still have some hangs, but 
only in rsync, only under high load and they don't lock up the 
system - so i guess it's ok now.


I still have hangs and lock ups under the same situation (rsync of 
many files) under 3.2rc3. rc3 made the hang appear after 200gb of 
files, while in rc2 i had hangs after only 11gb .


Yes, i had them too in 3.2rc3! The problem where solved with patches 
from the btrfs-for-linus -branch. (see link above).


Tobias


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Blocked for more than 120 seconds

2011-12-02 Thread Konstantinos Skarlatos
After about 1TB of rsyncs from multiple servers at the same time, plus 
some heavy filesystem loading, i believe that 3.2rc4 solves the problem 
for me. Now if only we had deduplication and an fsck tool :)
On Παρασκευή, 2 Δεκέμβριος 2011 9:53:10 μμ, Konstantinos Skarlatos 
wrote:
I see they got into 3.2rc4, so I am now compiling it. I will report 
back in a few hours


On Παρασκευή, 2 Δεκέμβριος 2011 5:48:31 μμ, Tobias wrote:

Am 02.12.2011 16:22, schrieb Konstantinos Skarlatos:
So, the transaction close is in btrfs_evict_inode, which sounds 
like a

deadlock recently fixed by this commit:

http://git.kernel.org/?p=linux/kernel/git/mason/linux-btrfs.git;a=commit;h=aa38a711a893accf5b5192f3d705a120deaa81e0 



If you pull the for-linus branch from today, hopefully the problem 
will

be gone.



This looks very good. With this Kernel i still have some hangs, but 
only in rsync, only under high load and they don't lock up the 
system - so i guess it's ok now.


I still have hangs and lock ups under the same situation (rsync of 
many files) under 3.2rc3. rc3 made the hang appear after 200gb of 
files, while in rc2 i had hangs after only 11gb .


Yes, i had them too in 3.2rc3! The problem where solved with patches 
from the btrfs-for-linus -branch. (see link above).


Tobias


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Having parent transid verify failed

2011-05-05 Thread Konstantinos Skarlatos
Hello, I have a 5.5TB Btrfs filesystem on top of a md-raid 5 device. Now 
if i run some file operations like find, i get these messages.

kernel is 2.6.38.5-1 on arch linux

May  5 14:15:12 mail kernel: [13559.089713] parent transid verify failed 
on 3062073683968 wanted 5181 found 5188
May  5 14:15:12 mail kernel: [13559.089834] parent transid verify failed 
on 3062073683968 wanted 5181 found 5188
May  5 14:15:14 mail kernel: [13560.752074] btrfs-transacti D 
88007211ac78 0  5339  2 0x
May  5 14:15:14 mail kernel: [13560.752078]  880023167d30 
0046 8800 8800195b6000
May  5 14:15:14 mail kernel: [13560.752082]  880023167c10 
02c8f27b4000 880023167fd8 88007211a9a0
May  5 14:15:14 mail kernel: [13560.752085]  880023167fd8 
880023167fd8 88007211ac80 880023167fd8

May  5 14:15:14 mail kernel: [13560.752087] Call Trace:
May  5 14:15:14 mail kernel: [13560.752101]  [a0850d02] ? 
run_clustered_refs+0x132/0x830 [btrfs]
May  5 14:15:14 mail kernel: [13560.752105]  [813aff3d] 
schedule_timeout+0x2fd/0x380
May  5 14:15:14 mail kernel: [13560.752108]  [813b0cf9] ? 
mutex_unlock+0x9/0x10
May  5 14:15:14 mail kernel: [13560.752115]  [a087e9f4] ? 
btrfs_run_ordered_operations+0x1f4/0x210 [btrfs]
May  5 14:15:14 mail kernel: [13560.752122]  [a0860fa3] 
btrfs_commit_transaction+0x263/0x750 [btrfs]
May  5 14:15:14 mail kernel: [13560.752126]  [81079ff0] ? 
autoremove_wake_function+0x0/0x40
May  5 14:15:14 mail kernel: [13560.752131]  [a085a9bd] 
transaction_kthread+0x26d/0x290 [btrfs]
May  5 14:15:14 mail kernel: [13560.752137]  [a085a750] ? 
transaction_kthread+0x0/0x290 [btrfs]
May  5 14:15:14 mail kernel: [13560.752139]  [81079717] 
kthread+0x87/0x90
May  5 14:15:14 mail kernel: [13560.752142]  [8100bc24] 
kernel_thread_helper+0x4/0x10
May  5 14:15:14 mail kernel: [13560.752145]  [81079690] ? 
kthread+0x0/0x90
May  5 14:15:14 mail kernel: [13560.752147]  [8100bc20] ? 
kernel_thread_helper+0x0/0x10
May  5 14:15:17 mail kernel: [13564.092081] verify_parent_transid: 40736 
callbacks suppressed
May  5 14:15:17 mail kernel: [13564.092084] parent transid verify failed 
on 3062073683968 wanted 5181 found 5188


--snip--
May  5 14:17:13 mail kernel: [13679.169772] parent transid verify failed 
on 3062073683968 wanted 5181 found 5188

--snip--
May  5 14:17:14 mail kernel: [13680.751996] btrfs-transacti D 
88007211ac78 0  5339  2 0x
May  5 14:17:14 mail kernel: [13680.752000]  880023167d30 
0046 8800 8800195b6000
May  5 14:17:14 mail kernel: [13680.752004]  880023167c10 
02c8f27b4000 880023167fd8 88007211a9a0
May  5 14:17:14 mail kernel: [13680.752006]  880023167fd8 
880023167fd8 88007211ac80 880023167fd8

May  5 14:17:14 mail kernel: [13680.752009] Call Trace:
May  5 14:17:14 mail kernel: [13680.752024]  [a0850d02] ? 
run_clustered_refs+0x132/0x830 [btrfs]
May  5 14:17:14 mail kernel: [13680.752030]  [813aff3d] 
schedule_timeout+0x2fd/0x380
May  5 14:17:14 mail kernel: [13680.752032]  [813b0cf9] ? 
mutex_unlock+0x9/0x10
May  5 14:17:14 mail kernel: [13680.752040]  [a087e9f4] ? 
btrfs_run_ordered_operations+0x1f4/0x210 [btrfs]
May  5 14:17:14 mail kernel: [13680.752046]  [a0860fa3] 
btrfs_commit_transaction+0x263/0x750 [btrfs]
May  5 14:17:14 mail kernel: [13680.752051]  [81079ff0] ? 
autoremove_wake_function+0x0/0x40
May  5 14:17:14 mail kernel: [13680.752057]  [a085a9bd] 
transaction_kthread+0x26d/0x290 [btrfs]
May  5 14:17:14 mail kernel: [13680.752062]  [a085a750] ? 
transaction_kthread+0x0/0x290 [btrfs]
May  5 14:17:14 mail kernel: [13680.752065]  [81079717] 
kthread+0x87/0x90
May  5 14:17:14 mail kernel: [13680.752068]  [8100bc24] 
kernel_thread_helper+0x4/0x10
May  5 14:17:14 mail kernel: [13680.752070]  [81079690] ? 
kthread+0x0/0x90
May  5 14:17:14 mail kernel: [13680.752072]  [8100bc20] ? 
kernel_thread_helper+0x0/0x10
May  5 14:17:14 mail kernel: [13680.752079] dd  D 
8800714c4838 0  5792   5740 0x0004
May  5 14:17:14 mail kernel: [13680.752082]  88006a205b38 
0082 88006a205af8 0246
May  5 14:17:14 mail kernel: [13680.752085]  ea00017f57e8 
88006a205fd8 88006a205fd8 8800714c4560
May  5 14:17:14 mail kernel: [13680.752088]  88006a205fd8 
88006a205fd8 8800714c4840 88006a205fd8

May  5 14:17:14 mail kernel: [13680.752090] Call Trace:
May  5 14:17:14 mail kernel: [13680.752095]  [810ff145] ? 
zone_statistics+0x75/0x90
May  5 14:17:14 mail kernel: [13680.752098]  [810ea8b7] ? 
get_page_from_freelist+0x3c7/0x820
May  5 14:17:14 mail kernel: [13680.752101]  [810e3588] ? 
find_get_page+0x68/0xb0
May  5 14:17:14 mail kernel: [13680.752108]  [a08603f9] 

Re: Having parent transid verify failed

2011-05-05 Thread Konstantinos Skarlatos



On 5/5/2011 2:42 μμ, Chris Mason wrote:

Excerpts from Konstantinos Skarlatos's message of 2011-05-05 07:19:52 -0400:

Hello, I have a 5.5TB Btrfs filesystem on top of a md-raid 5 device. Now
if i run some file operations like find, i get these messages.
kernel is 2.6.38.5-1 on arch linux


Are all of the messages for this one block?

parent transid verify failed on 3062073683968 wanted 5181 found 5188

yes, only this block


-chris

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Having parent transid verify failed

2011-05-05 Thread Konstantinos Skarlatos



On 5/5/2011 6:06 μμ, Chris Mason wrote:

Excerpts from Konstantinos Skarlatos's message of 2011-05-05 10:27:30 -0400:

attached you can find the whole dmesg log. I can trigger the error again
if more logs are needed


Yes, I'll send you a patch to get rid of the printk for the transid
failed message.  That way we can get a clean view of the other errors.

Will you be able to compile/test it?


Yes, i think i will be able to make it, but because i have only done 
this once and in a quite hackish way, i may need some help in order to 
do it right.




-chris

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Having parent transid verify failed

2011-05-05 Thread Konstantinos Skarlatos
I think i made some progress. When i tried to remove the directory that 
i suspect contains the problematic file, i got this on the console


rm -rf serverloft/

2011 May  5 23:32:53 mail [  200.580195] Oops:  [#1] PREEMPT SMP
2011 May  5 23:32:53 mail [  200.580220] last sysfs file: 
/sys/module/vt/parameters/default_utf8

2011 May  5 23:32:53 mail [  200.581145] Stack:
2011 May  5 23:32:53 mail [  200.581276] Call Trace:
2011 May  5 23:32:53 mail [  200.581732] Code: cc 00 00 48 8d 91 28 e0 
ff ff 48 89 e5 48 81 ec 90 00 00 00 48 89 5d d8 4c 89 65 e0 48 89 f3 4c 
89 6d e8 4c 89 75 f0 4c 89 7d f8 48 8b 76 30 83 42 1c 01 48 b8 00 00 
00 00 00 16 00 00 48 01 f0

2011 May  5 23:32:53 mail [  200.583376] CR2: 0030


here is the  part of dmesg that does not contain the  thousands of 
parent transid verify failed messages



May  5 23:32:51 mail kernel: [  198.371084] parent transid verify failed 
on 3062073683968 wanted 5181 found 5188
May  5 23:32:51 mail kernel: [  198.371204] parent transid verify failed 
on 3062073683968 wanted 5181 found 5188
May  5 23:32:53 mail kernel: [  200.572774] Modules linked in: ipv6 
btrfs zlib_deflate crc32c libcrc32c ext2 raid456 async_raid6_recov 
async_pq raid6_pq async_xor xor async_memcpy async_tx md_mod usb_storage 
uas snd_seq_dummy snd_seq_oss radeon snd_seq_midi_event ttm snd_seq 
snd_hda_codec_hdmi snd_seq_device drm_kms_helper ohci_hcd snd_hda_intel 
snd_hda_codec snd_pcm_oss snd_hwdep drm i2c_algo_bit snd_mixer_oss 
snd_pcm i2c_piix4 snd_timer snd soundcore snd_page_alloc ehci_hcd wmi 
i2c_core usbcore evdev processor button k10temp serio_raw pcspkr sg 
r8169 edac_core shpchp pci_hotplug edac_mce_amd mii sp5100_tco ext4 
mbcache jbd2 crc16 sd_mod pata_acpi ahci libahci pata_atiixp libata scsi_mod
May  5 23:32:53 mail kernel: [  200.572808] Pid: 1037, comm: 
btrfs-transacti Not tainted 2.6.38-ARCH #1

May  5 23:32:53 mail kernel: [  200.572810] Call Trace:
May  5 23:32:53 mail kernel: [  200.572817]  [813a932b] ? 
__schedule_bug+0x59/0x5d
May  5 23:32:53 mail kernel: [  200.572820]  [813af827] ? 
schedule+0x9f7/0xad0
May  5 23:32:53 mail kernel: [  200.572823]  [811e5827] ? 
generic_unplug_device+0x37/0x40
May  5 23:32:53 mail kernel: [  200.572827]  [a07ac164] ? 
md_raid5_unplug_device+0x64/0x110 [raid456]
May  5 23:32:53 mail kernel: [  200.572830]  [a07ac223] ? 
raid5_unplug_queue+0x13/0x20 [raid456]
May  5 23:32:53 mail kernel: [  200.572833]  [81012d79] ? 
read_tsc+0x9/0x20
May  5 23:32:53 mail kernel: [  200.572837]  [8108418c] ? 
ktime_get_ts+0xac/0xe0
May  5 23:32:53 mail kernel: [  200.572840]  [810e36c0] ? 
sync_page+0x0/0x50
May  5 23:32:53 mail kernel: [  200.572842]  [813af96e] ? 
io_schedule+0x6e/0xb0
May  5 23:32:53 mail kernel: [  200.572844]  [810e36fb] ? 
sync_page+0x3b/0x50
May  5 23:32:53 mail kernel: [  200.572846]  [813b0077] ? 
__wait_on_bit+0x57/0x80
May  5 23:32:53 mail kernel: [  200.572848]  [810e38c0] ? 
wait_on_page_bit+0x70/0x80
May  5 23:32:53 mail kernel: [  200.572851]  [8107a030] ? 
wake_bit_function+0x0/0x40
May  5 23:32:53 mail kernel: [  200.572861]  [a08348d2] ? 
read_extent_buffer_pages+0x412/0x480 [btrfs]
May  5 23:32:53 mail kernel: [  200.572867]  [a0809e00] ? 
btree_get_extent+0x0/0x1b0 [btrfs]
May  5 23:32:53 mail kernel: [  200.572873]  [a080ac7e] ? 
btree_read_extent_buffer_pages.isra.60+0x5e/0xb0 [btrfs]
May  5 23:32:53 mail kernel: [  200.572880]  [a080c0bc] ? 
read_tree_block+0x3c/0x60 [btrfs]
May  5 23:32:53 mail kernel: [  200.572884]  [a07f272b] ? 
read_block_for_search.isra.34+0x1fb/0x410 [btrfs]
May  5 23:32:53 mail kernel: [  200.572890]  [a08417d1] ? 
btrfs_tree_unlock+0x51/0x60 [btrfs]
May  5 23:32:53 mail kernel: [  200.572895]  [a07f5ca0] ? 
btrfs_search_slot+0x430/0xa30 [btrfs]
May  5 23:32:53 mail kernel: [  200.572900]  [a07fb3a6] ? 
lookup_inline_extent_backref+0x96/0x460 [btrfs]
May  5 23:32:53 mail kernel: [  200.572904]  [8112b8d3] ? 
kmem_cache_alloc+0x133/0x150
May  5 23:32:53 mail kernel: [  200.572908]  [a07fd452] ? 
__btrfs_free_extent+0xc2/0x6d0 [btrfs]
May  5 23:32:53 mail kernel: [  200.572914]  [a0800f59] ? 
run_clustered_refs+0x389/0x830 [btrfs]
May  5 23:32:53 mail kernel: [  200.572920]  [a084d900] ? 
btrfs_find_ref_cluster+0x10/0x190 [btrfs]
May  5 23:32:53 mail kernel: [  200.572925]  [a08014c0] ? 
btrfs_run_delayed_refs+0xc0/0x210 [btrfs]
May  5 23:32:53 mail kernel: [  200.572927]  [813b0cf9] ? 
mutex_unlock+0x9/0x10
May  5 23:32:53 mail kernel: [  200.572933]  [a0810db8] ? 
btrfs_commit_transaction+0x78/0x750 [btrfs]
May  5 23:32:53 mail kernel: [  200.572936]  [81079ff0] ? 
autoremove_wake_function+0x0/0x40
May  5 23:32:53 mail kernel: [  200.572941]  [a080a9bd] ? 
transaction_kthread+0x26d/0x290 [btrfs]
May  5 23:32:53 mail kernel: 

Re: Having parent transid verify failed

2011-05-05 Thread Konstantinos Skarlatos

On 5/5/2011 11:32 μμ, Chris Mason wrote:

Excerpts from Konstantinos Skarlatos's message of 2011-05-05 16:27:54 -0400:

I think i made some progress. When i tried to remove the directory that
i suspect contains the problematic file, i got this on the console

rm -rf serverloft/


Ok, our one bad block is in the extent allocation tree.  This is going
to be the very hardest thing to fix.

Until I finish off the code to rebuild parts of the extent allocation
tree, I think your best bet is to copy the files off.

The big question is, what happened to make this error?  Can you describe
your setup in more detail?


I created this btrfs filesystem on an arch linux system (amd64, quad 
core) with kernel 2.3.38.1. it is on top of a md raid 5.


[root@linuxserver ~]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sde1[3] sdc1[1] sda1[0] sdf1[4]
  5860535808 blocks super 1.2 level 5, 512k chunk, algorithm 2 
[4/4] []


the raid was grown from 3 devices to 4, and then btrfs was grown to max 
size. mount options were clear_cache,compress-force.


I was investigating a performance issue that i had, because over the 
network i could only write to the filesystem at about 32mb/sec.


when writing btrfs-delalloc- cpu usage was at 100%.

While investigating i disabled compression, enabled space_cache and 
tried zlib compression, and various combinations, while copying large 
files back and forth using samba.


BTW I tried to change some mount options using mount -o remount but 
although the new options were printed on dmesg i think that they were 
not enabled.


I got the first error when i was copying some files and at the same time 
created a directory over samba. After a while i upgraded to 2.6.38.5 but 
nothing seems to have changed.


I really dont think there is a hardware error here, but to be safe I am 
now running a check on the raid





-chris



2011 May  5 23:32:53 mail [  200.580195] Oops:  [#1] PREEMPT SMP
2011 May  5 23:32:53 mail [  200.580220] last sysfs file:
/sys/module/vt/parameters/default_utf8
2011 May  5 23:32:53 mail [  200.581145] Stack:
2011 May  5 23:32:53 mail [  200.581276] Call Trace:
2011 May  5 23:32:53 mail [  200.581732] Code: cc 00 00 48 8d 91 28 e0
ff ff 48 89 e5 48 81 ec 90 00 00 00 48 89 5d d8 4c 89 65 e0 48 89 f3 4c
89 6d e8 4c 89 75 f0 4c 89 7d f848  8b 76 30 83 42 1c 01 48 b8 00 00
00 00 00 16 00 00 48 01 f0
2011 May  5 23:32:53 mail [  200.583376] CR2: 0030


here is the  part of dmesg that does not contain the  thousands of
parent transid verify failed messages


May  5 23:32:51 mail kernel: [  198.371084] parent transid verify failed
on 3062073683968 wanted 5181 found 5188
May  5 23:32:51 mail kernel: [  198.371204] parent transid verify failed
on 3062073683968 wanted 5181 found 5188
May  5 23:32:53 mail kernel: [  200.572774] Modules linked in: ipv6
btrfs zlib_deflate crc32c libcrc32c ext2 raid456 async_raid6_recov
async_pq raid6_pq async_xor xor async_memcpy async_tx md_mod usb_storage
uas snd_seq_dummy snd_seq_oss radeon snd_seq_midi_event ttm snd_seq
snd_hda_codec_hdmi snd_seq_device drm_kms_helper ohci_hcd snd_hda_intel
snd_hda_codec snd_pcm_oss snd_hwdep drm i2c_algo_bit snd_mixer_oss
snd_pcm i2c_piix4 snd_timer snd soundcore snd_page_alloc ehci_hcd wmi
i2c_core usbcore evdev processor button k10temp serio_raw pcspkr sg
r8169 edac_core shpchp pci_hotplug edac_mce_amd mii sp5100_tco ext4
mbcache jbd2 crc16 sd_mod pata_acpi ahci libahci pata_atiixp libata scsi_mod
May  5 23:32:53 mail kernel: [  200.572808] Pid: 1037, comm:
btrfs-transacti Not tainted 2.6.38-ARCH #1
May  5 23:32:53 mail kernel: [  200.572810] Call Trace:
May  5 23:32:53 mail kernel: [  200.572817]  [813a932b] ?
__schedule_bug+0x59/0x5d
May  5 23:32:53 mail kernel: [  200.572820]  [813af827] ?
schedule+0x9f7/0xad0
May  5 23:32:53 mail kernel: [  200.572823]  [811e5827] ?
generic_unplug_device+0x37/0x40
May  5 23:32:53 mail kernel: [  200.572827]  [a07ac164] ?
md_raid5_unplug_device+0x64/0x110 [raid456]
May  5 23:32:53 mail kernel: [  200.572830]  [a07ac223] ?
raid5_unplug_queue+0x13/0x20 [raid456]
May  5 23:32:53 mail kernel: [  200.572833]  [81012d79] ?
read_tsc+0x9/0x20
May  5 23:32:53 mail kernel: [  200.572837]  [8108418c] ?
ktime_get_ts+0xac/0xe0
May  5 23:32:53 mail kernel: [  200.572840]  [810e36c0] ?
sync_page+0x0/0x50
May  5 23:32:53 mail kernel: [  200.572842]  [813af96e] ?
io_schedule+0x6e/0xb0
May  5 23:32:53 mail kernel: [  200.572844]  [810e36fb] ?
sync_page+0x3b/0x50
May  5 23:32:53 mail kernel: [  200.572846]  [813b0077] ?
__wait_on_bit+0x57/0x80
May  5 23:32:53 mail kernel: [  200.572848]  [810e38c0] ?
wait_on_page_bit+0x70/0x80
May  5 23:32:53 mail kernel: [  200.572851]  [8107a030] ?
wake_bit_function+0x0/0x40
May  5 23:32:53 mail kernel: [  200.572861]  [a08348d2] ?
read_extent_buffer_pages+0x412/0x480 [btrfs]
May  5 23:32:53 

Re: Having parent transid verify failed

2011-05-05 Thread Konstantinos Skarlatos



On 6/5/2011 2:50 πμ, Chris Mason wrote:

Excerpts from Konstantinos Skarlatos's message of 2011-05-05 17:04:00 -0400:

On 5/5/2011 11:32 μμ, Chris Mason wrote:

Excerpts from Konstantinos Skarlatos's message of 2011-05-05 16:27:54 -0400:

I think i made some progress. When i tried to remove the directory that
i suspect contains the problematic file, i got this on the console

rm -rf serverloft/


Ok, our one bad block is in the extent allocation tree.  This is going
to be the very hardest thing to fix.

Until I finish off the code to rebuild parts of the extent allocation
tree, I think your best bet is to copy the files off.

The big question is, what happened to make this error?  Can you describe
your setup in more detail?


I created this btrfs filesystem on an arch linux system (amd64, quad
core) with kernel 2.3.38.1. it is on top of a md raid 5.

[root@linuxserver ~]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sde1[3] sdc1[1] sda1[0] sdf1[4]
5860535808 blocks super 1.2 level 5, 512k chunk, algorithm 2
[4/4] []

the raid was grown from 3 devices to 4, and then btrfs was grown to max
size. mount options were clear_cache,compress-force.

I was investigating a performance issue that i had, because over the
network i could only write to the filesystem at about 32mb/sec.

when writing btrfs-delalloc- cpu usage was at 100%.

While investigating i disabled compression, enabled space_cache and
tried zlib compression, and various combinations, while copying large
files back and forth using samba.

BTW I tried to change some mount options using mount -o remount but
although the new options were printed on dmesg i think that they were
not enabled.

I got the first error when i was copying some files and at the same time
created a directory over samba. After a while i upgraded to 2.6.38.5 but
nothing seems to have changed.

I really dont think there is a hardware error here, but to be safe I am
now running a check on the raid


This error basically means we didn't write the block.  It could be
because the write went to the wrong spot, or the hardware stack messed
it up, or because of a btrfs bug.  But, 2.6.38 is relatively recent.  It
doesn't look like memory corruption because the transids are fairly
close.

When you grew the raid device, did you grow a partition as well?  We've
had trouble in the past with block dev flushing code kicking in as
devices are resized.


no, I did not grow any partitions, I just added one disk to the Raid 5 
md0 device, and then grew the btrfs filesystem to max size(no partitions 
on md0).


I can remember that as a test (to see if shrink works) i shrank the fs 
by 1 gb and then grew it again to max size.




Samba isn't doing anything exotic, and 2.6.38 has my recent fixes for
rare metadata corruption bugs in btrfs.

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2 v2] Btrfs: Per file/directory controls for COW and compression

2011-04-04 Thread Konstantinos Skarlatos

Hello,
I would like to ask about the status of this feature/patch, is it 
accepted into btrfs code, and how can I use it?


I am interested in enabling compression in a specific 
folder(force-compress would be ideal) of a large btrfs volume, and 
disabling it for the rest.



On 21/3/2011 10:57 πμ, liubo wrote:

Data compression and data cow are controlled across the entire FS by mount
options right now.  ioctls are needed to set this on a per file or per
directory basis.  This has been proposed previously, but VFS developers
wanted us to use generic ioctls rather than btrfs-specific ones.

According to chris's comment, there should be just one true compression
method(probably LZO) stored in the super.  However, before this, we would
wait for that one method is stable enough to be adopted into the super.
So I list it as a long term goal, and just store it in ram today.

After applying this patch, we can use the generic FS_IOC_SETFLAGS ioctl to
control file and directory's datacow and compression attribute.

NOTE:
  - The compression type is selected by such rules:
If we mount btrfs with compress options, ie, zlib/lzo, the type is it.
Otherwise, we'll use the default compress type (zlib today).

v1-v2:
Rebase the patch with the latest btrfs.

Signed-off-by: Liu Boliubo2...@cn.fujitsu.com
---
  fs/btrfs/ctree.h   |1 +
  fs/btrfs/disk-io.c |6 ++
  fs/btrfs/inode.c   |   32 
  fs/btrfs/ioctl.c   |   41 +
  4 files changed, 72 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8b4b9d1..b77d1a5 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1283,6 +1283,7 @@ struct btrfs_root {
  #define BTRFS_INODE_NODUMP(1  8)
  #define BTRFS_INODE_NOATIME   (1  9)
  #define BTRFS_INODE_DIRSYNC   (1  10)
+#define BTRFS_INODE_COMPRESS   (1  11)

  /* some macros to generate set/get funcs for the struct fields.  This
   * assumes there is a lefoo_to_cpu for every type, so lets make a simple
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 3e1ea3e..a894c12 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1762,6 +1762,12 @@ struct btrfs_root *open_ctree(struct super_block *sb,

btrfs_check_super_valid(fs_info, sb-s_flags  MS_RDONLY);

+   /*
+* In the long term, we'll store the compression type in the super
+* block, and it'll be used for per file compression control.
+*/
+   fs_info-compress_type = BTRFS_COMPRESS_ZLIB;
+
ret = btrfs_parse_options(tree_root, options);
if (ret) {
err = ret;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index db67821..e687bb9 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -381,7 +381,8 @@ again:
 */
if (!(BTRFS_I(inode)-flags  BTRFS_INODE_NOCOMPRESS)
(btrfs_test_opt(root, COMPRESS) ||
-(BTRFS_I(inode)-force_compress))) {
+(BTRFS_I(inode)-force_compress) ||
+(BTRFS_I(inode)-flags  BTRFS_INODE_COMPRESS))) {
WARN_ON(pages);
pages = kzalloc(sizeof(struct page *) * nr_pages, GFP_NOFS);

@@ -1253,7 +1254,8 @@ static int run_delalloc_range(struct inode *inode, struct 
page *locked_page,
ret = run_delalloc_nocow(inode, locked_page, start, end,
 page_started, 0, nr_written);
else if (!btrfs_test_opt(root, COMPRESS)
-!(BTRFS_I(inode)-force_compress))
+!(BTRFS_I(inode)-force_compress)
+!(BTRFS_I(inode)-flags  BTRFS_INODE_COMPRESS))
ret = cow_file_range(inode, locked_page, start, end,
  page_started, nr_written, 1);
else
@@ -4581,8 +4583,6 @@ static struct inode *btrfs_new_inode(struct 
btrfs_trans_handle *trans,
location-offset = 0;
btrfs_set_key_type(location, BTRFS_INODE_ITEM_KEY);

-   btrfs_inherit_iflags(inode, dir);
-
if ((mode  S_IFREG)) {
if (btrfs_test_opt(root, NODATASUM))
BTRFS_I(inode)-flags |= BTRFS_INODE_NODATASUM;
@@ -4590,6 +4590,8 @@ static struct inode *btrfs_new_inode(struct 
btrfs_trans_handle *trans,
BTRFS_I(inode)-flags |= BTRFS_INODE_NODATACOW;
}

+   btrfs_inherit_iflags(inode, dir);
+
insert_inode_hash(inode);
inode_tree_add(inode);
return inode;
@@ -6803,6 +6805,26 @@ static int btrfs_getattr(struct vfsmount *mnt,
return 0;
  }

+/*
+ * If a file is moved, it will inherit the cow and compression flags of the new
+ * directory.
+ */
+static void fixup_inode_flags(struct inode *dir, struct inode *inode)
+{
+   struct btrfs_inode *b_dir = BTRFS_I(dir);
+   struct btrfs_inode *b_inode = BTRFS_I(inode);
+
+   if (b_dir-flags  BTRFS_INODE_NODATACOW)
+   b_inode-flags |= BTRFS_INODE_NODATACOW;
+  

Re: btrfs balancing start - and stop?

2011-04-01 Thread Konstantinos Skarlatos

On 1/4/2011 3:12 μμ, Helmut Hullen wrote:

Hallo, Struan,

Du meintest am 01.04.11:


1) Is the balancing operation expected to take many hours (or days?)
on a filesystem such as this? Or are there known issues with the
algorithm that are yet to be addressed?

May be. Balancing about 15 GByte needed about 2 hours (or less),
balancing about 2 TByte needed about 20 hours.

dmesg counts down the number of remaining jobs.
are you sure? here is a snippet of dmesg from a balance i did yesterday 
(2.6.38.1)


btrfs: relocating block group 15338569728 flags 9
btrfs: found 17296 extents
btrfs: found 17296 extents
btrfs: relocating block group 13191086080 flags 9
btrfs: found 21029 extents
btrfs: found 21029 extents
btrfs: relocating block group 11043602432 flags 9
btrfs: found 4728 extents
btrfs: found 4728 extents



Viele Gruesse!
Helmut
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Do not use free space caching!

2011-04-01 Thread Konstantinos Skarlatos

On 1/4/2011 1:59 πμ, Josef Bacik wrote:

On Thu, Mar 31, 2011 at 05:06:42PM -0400, Calvin Walton wrote:

On Wed, 2011-03-30 at 17:19 -0400, Josef Bacik wrote:

Hello,

Just found a big bug in the free space caching stuff that will result in
early ENOSPC.  I'm working on fixing this bug, but it won't be until
tomorrow that I'll have it completely working, so for now make sure to
mount -o clear_cache so that it just clears the cache and doesn't use it.

NOTE: It doesn't cause problems other than early ENOSPC, you won't get
corruption or anything like that, tho you could possibly panic.

Sorry for the inconvenience.  Thanks,

Any chance you could provide a little more information about which
kernels are affected? Is it any kernel with free space cache support (is
2.6.38.x included?) - and if so, do you plan on submitting the fix to
the stable kernel series?


Yeah it affects any kernel that has the free space cache feature, which I think
started in .37.  Course you have to have specifically enabled it, so it's not a
huge problem.  I've submitted a patch, but since it's currently an optional
feature I don't think it needs to go to stable.  Thanks,
So it will have to wait for 2.6.39? If possible please push it for 
inclusion it in the next stable of 2.6.38, as 2.6.39 is a few months 
away and i wont risk an early RC for my system


Thanks

Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs balancing start - and stop?

2011-04-01 Thread Konstantinos Skarlatos

On 1/4/2011 4:37 μμ, Hugo Mills wrote:

On Fri, Apr 01, 2011 at 04:22:39PM +0300, Konstantinos Skarlatos wrote:

On 1/4/2011 3:12 μμ, Helmut Hullen wrote:

Du meintest am 01.04.11:

dmesg counts down the number of remaining jobs.

are you sure? here is a snippet of dmesg from a balance i did
yesterday (2.6.38.1)

btrfs: relocating block group 15338569728 flags 9
btrfs: found 17296 extents
btrfs: found 17296 extents
btrfs: relocating block group 13191086080 flags 9
btrfs: found 21029 extents
btrfs: found 21029 extents
btrfs: relocating block group 11043602432 flags 9
btrfs: found 4728 extents
btrfs: found 4728 extents

Count the number of block groups in the system (1GiB for data,
256MiB for metdata on a typical filesystem), and subtract the number
of relocating block group messages... Not ideal, but it's possible.

The balance cancel patch I mentioned earlier also supplies an
additional patch for monitoring progress, which does show up in the
dmesg output (as well as user-space support for prettier output).
Great, I think it is very important to have a human-readable progress 
monitor for operations like that.

Hugo.



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs troubles

2010-09-07 Thread Konstantinos Skarlatos
 Hello, I have these messages from a two not full (111GB of 2TB and 
452GB of 2TB free) filesystems.
Eventually the filesystem mounts, but I am unable to create new files, 
even when i delete data.

Most files are 1.45GB.

[r...@linuxserver ~]# btrfs filesystem df /storage/WD20_1
Data: total=1.81TB, used=1.71TB
Metadata: total=2.63GB, used=2.43GB
System: total=12.00MB, used=204.00KB

[r...@linuxserver ~]# btrfsck /dev/sdb1
found 1878986375168 bytes used err is 0
total csum bytes: 1832396384
total tree bytes: 2612477952
total fs tree bytes: 22331392
btree space waste bytes: 595037490
file data blocks allocated: 1922007609344
 referenced 1876364451840
Btrfs Btrfs v0.19

[r...@linuxserver ~]# btrfs filesystem df /storage/WD20_2
Data: total=1.81TB, used=1.37TB
Metadata: total=2.51GB, used=2.47GB
System: total=12.00MB, used=204.00KB

 [r...@linuxserver ~]# btrfsck /dev/sda1
found 1512592834560 bytes used err is 0
total csum bytes: 1474551008
total tree bytes: 2652602368
total fs tree bytes: 301985792
btree space waste bytes: 599008365
file data blocks allocated: 1510591008768
 referenced 1607206682624
Btrfs Btrfs v0.19




[ cut here ]
WARNING: at fs/btrfs/extent-tree.c:3441 
btrfs_block_rsv_check+0x15e/0x190 [btrfs]()

Hardware name: GA-MA785G-UD3H
Modules linked in: btrfs zlib_deflate crc32c libcrc32c ipv6 ext2 usbhid 
hid usb_storage snd_hda_codec_atihdmi radeon snd_hda_intel snd_hda_codec 
ttm ohci_hcd drm_kms_helper ehci_hcd drm i2c_algo_bit snd_seq_dummy 
snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss 
snd_hwdep snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc 
usbcore shpchp pcspkr serio_raw r8169 evdev i2c_piix4 pci_hotplug 
processor thermal edac_core k10temp button edac_mce_amd i2c_core mii sg 
wmi rtc_cmos rtc_core rtc_lib ext4 mbcache jbd2 crc16 sd_mod pata_acpi 
ahci libahci pata_atiixp libata scsi_mod

Pid: 12735, comm: ls Not tainted 2.6.35-ARCH #1
Call Trace:
 [8105288a] warn_slowpath_common+0x7a/0xb0
 [810528d5] warn_slowpath_null+0x15/0x20
 [a07f8c9e] btrfs_block_rsv_check+0x15e/0x190 [btrfs]
 [a080976a] __btrfs_end_transaction+0x19a/0x220 [btrfs]
 [a080980b] btrfs_end_transaction+0xb/0x10 [btrfs]
 [a081342b] btrfs_dirty_inode+0x8b/0x120 [btrfs]
 [81145a86] __mark_inode_dirty+0x36/0x170
 [81139c0d] touch_atime+0x12d/0x170
 [81134330] ? filldir+0x0/0xd0
 [81134586] vfs_readdir+0xc6/0xd0
 [81134670] sys_getdents+0x80/0xe0
 [81373765] ? page_fault+0x25/0x30
 [81009e82] system_call_fastpath+0x16/0x1b
---[ end trace a296d77e7bd54918 ]---
block_rsv size 872415232 reserved 206303232 freed 0 0
INFO: task ls:12735 blocked for more than 120 seconds.
echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this message.
lsD  0 12735  11939 0x
 88004aa5fd58 0082 880001894fa8 
 00014f40 00014f40 88004aa5ffd8 88004aa5ffd8
 88004aa5ffd8 880057c5ef00 88004aa5ffd8 00014f40
Call Trace:
 [a0808654] wait_current_trans.clone.19+0x84/0xe0 [btrfs]
 [810718d0] ? autoremove_wake_function+0x0/0x40
 [a08098ef] start_transaction+0xdf/0x250 [btrfs]
 [a0809aae] btrfs_start_transaction+0xe/0x10 [btrfs]
 [a0813438] btrfs_dirty_inode+0x98/0x120 [btrfs]
 [81145a86] __mark_inode_dirty+0x36/0x170
 [81139c0d] touch_atime+0x12d/0x170
 [81134330] ? filldir+0x0/0xd0
 [81134586] vfs_readdir+0xc6/0xd0
 [81134670] sys_getdents+0x80/0xe0
 [81373765] ? page_fault+0x25/0x30
 [81009e82] system_call_fastpath+0x16/0x1b
[ cut here ]
WARNING: at fs/btrfs/extent-tree.c:3441 
btrfs_block_rsv_check+0x15e/0x190 [btrfs]()

Hardware name: GA-MA785G-UD3H
Modules linked in: btrfs zlib_deflate crc32c libcrc32c ipv6 ext2 usbhid 
hid usb_storage snd_hda_codec_atihdmi radeon snd_hda_intel snd_hda_codec 
ttm ohci_hcd drm_kms_helper ehci_hcd drm i2c_algo_bit snd_seq_dummy 
snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss 
snd_hwdep snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc 
usbcore shpchp pcspkr serio_raw r8169 evdev i2c_piix4 pci_hotplug 
processor thermal edac_core k10temp button edac_mce_amd i2c_core mii sg 
wmi rtc_cmos rtc_core rtc_lib ext4 mbcache jbd2 crc16 sd_mod pata_acpi 
ahci libahci pata_atiixp libata scsi_mod

Pid: 12726, comm: btrfs-transacti Tainted: GW   2.6.35-ARCH #1
Call Trace:
 [8105288a] warn_slowpath_common+0x7a/0xb0
 [810528d5] warn_slowpath_null+0x15/0x20
 [a07f8c9e] btrfs_block_rsv_check+0x15e/0x190 [btrfs]
 [a080976a] __btrfs_end_transaction+0x19a/0x220 [btrfs]
 [a080980b] btrfs_end_transaction+0xb/0x10 [btrfs]
 [a080948e] btrfs_commit_transaction+0x62e/0x770 [btrfs]
 [81371739] ? mutex_unlock+0x9/0x10
 [a08099d3] ?