Re: Adventures in btrfs raid5 disk recovery - update

2016-06-21 Thread Zygo Blaxell
TL;DR:

Kernel 4.6.2 causes a world of pain.  Use 4.5.7 instead.

'btrfs dev stat' doesn't seem to count "csum failed"
(i.e. corruption) errors in compressed extents.



On Sun, Jun 19, 2016 at 11:44:27PM -0400, Zygo Blaxell wrote:
> Not so long ago, I had a disk fail in a btrfs filesystem with raid1
> metadata and raid5 data.  I mounted the filesystem readonly, replaced
> the failing disk, and attempted to recover by adding the new disk and
> deleting the missing disk.

> I'm currently using kernel 4.6.2

That turned out to be a mistake.  4.6.2 has some severe problems.

Over the past few days I've been upgrading other machines from 4.5.7
to 4.6.2.  This morning I saw the aggregate data coming back from
those machines, and it's all bad:  stalls in snapshot delete, balance,
and sync; some machines just lock up with no console messages; a lot of
watchdog timeouts.  None of the machines could get to an uptime over 26
hours and still be in a usable state.

I switched to 4.5.7 and the crashes, balance/delete hangs, and some of
the data corruption modes stopped.

> I'm
> getting EIO randomly all over the filesystem, including in files that were
> written entirely _after_ the disk failure.

There were actually four distinct corruption modes happening:

1.  There are some number (16500 so far) "normal" corrupt blocks:  read
repeatably returns EIO, they show up in scrub with sane log messages,
and replacing the files that contain these blocks makes them go away.
These blocks appear to be contained in extents that coincide with the
date of the disk failure.  Interestingly, no matter how many times I
read these blocks, I get no increase in the 'btrfs dev stat' numbers
even though I get kernel csum failure messages.  That looks like a bug.

2.  When attempting to replace corrupted files with rsync, I had used
'rsync --inplace'.  This caused bad blocks to be overwritten within
extents, but does not necessarily replace the _entire_ extent containing a
bad block.  This creates corrupt blocks that show up in scrub, balance,
and device delete, but not when reading files.  It also updates the
timestamps so a file with old corruption looks "new" to an insufficiently
sophisticated analysis tool.

3.  Files were corrupted while they were written and accessed via NFS.
This created files with correct btrfs checksums, but garbage contents.
This would show up as failures during 'git gc' or rsync checksum
mismatches.  During one of the many VM crashes, any writes in progress at
the time of the crash were lost.  This effectively rewound the filesystem
several minutes each time as btrfs reverts to the previous committed
tree on the next mount.  4.6.2's hanging issues made this worse by
delaying btrfs commits indefinitely.  The NFS clients were completely
unaware of this, so when the VM rebooted, files ended up with holes,
or would just disappear while in use.

4.  After a VM crash and the filesystem reverted to the previous
committed tree, files with bad blocks that had been repaired through
the NFS server or with rsync would be "unrepaired" (i.e. the filesystem
would revert back to the original corrupted blocks after the mount).

Combinations of these could occur as well for extra confusion, and some
corrupted blocks are contained in many files thanks to dedup.

With kernel 4.5.7 there have been no lockups during commit and no VM
crashes, so I haven't seen any of corruption modes 3 and 4 since 4.5.7.

Balance is now running normally to move the remaining data off the
missing disk.  ETA is 558 hours.  See you in mid-July!  ;)



signature.asc
Description: Digital signature


[PATCH] btrfs: fix disk_i_size update bug when fallocate() fails

2016-06-21 Thread Wang Xiaoguang
When doing truncate operation, btrfs_setsize() will first call
truncate_setsize() to set new inode->i_size, but if later
btrfs_truncate() fails, btrfs_setsize() will call
"i_size_write(inode, BTRFS_I(inode)->disk_i_size)" to reset the
inmemory inode size, now bug occurs. It's because for truncate
case btrfs_ordered_update_i_size() directly uses inode->i_size
to update BTRFS_I(inode)->disk_i_size, indeed we should use the
"offset" argument to update disk_i_size. Here is the call graph:
==>btrfs_truncate()
>btrfs_truncate_inode_items()
==>btrfs_ordered_update_i_size(inode, last_size, NULL);
Here btrfs_ordered_update_i_size()'s offset argument is last_size.

And below test case can reveal this bug:

dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=100
dev=$(losetup --show -f fs.img)
mkdir -p /mnt/mntpoint
mkfs.btrfs  -f $dev
mount $dev /mnt/mntpoint
cd /mnt/mntpoint

echo "workdir is: /mnt/mntpoint"
blocksize=$((128 * 1024))
dd if=/dev/zero of=testfile bs=$blocksize count=1
sync
count=$((17*1024*1024*1024/blocksize))
echo "file size is:" $((count*blocksize))
for ((i = 1; i <= $count; i++)); do
i=$((i + 1))
dst_offset=$((blocksize * i))
xfs_io -f -c "reflink testfile 0 $dst_offset $blocksize"\
testfile > /dev/null
done
sync

truncate --size 0 testfile
ls -l testfile
du -sh testfile
exit

In this case, truncate operation will fail for enospc reason and
"du -sh testfile" returns value greater than 0, but testfile's
size is 0, we need to reflect correct inode->i_size.

Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/ordered-data.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index e96634a..aca8264 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -968,6 +968,7 @@ int btrfs_ordered_update_i_size(struct inode *inode, u64 
offset,
struct rb_node *prev = NULL;
struct btrfs_ordered_extent *test;
int ret = 1;
+   u64 orig_offset = offset;
 
spin_lock_irq(>lock);
if (ordered) {
@@ -983,7 +984,7 @@ int btrfs_ordered_update_i_size(struct inode *inode, u64 
offset,
 
/* truncate file */
if (disk_i_size > i_size) {
-   BTRFS_I(inode)->disk_i_size = i_size;
+   BTRFS_I(inode)->disk_i_size = orig_offset;
ret = 0;
goto out;
}
-- 
1.8.3.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v11 00/13] Btrfs dedupe framework

2016-06-21 Thread Qu Wenruo
Here is the long-waited (simple and theoretical) performance test for 
dedupe.


Such result may be added to btrfs wiki page, as an advice for dedupe use 
case.


The full result can be check from google drive:
https://drive.google.com/file/d/0BxpkL3ehzX3pb05WT1lZSnRGbjA/view?usp=sharing

[Short Conclusion]
For high dedupe rate and easily compressible data,
if cpu cores >= 4, dedupe speed is on par with lzo compression,
and faster than default dd, about 35% faster.

if cpu == 2, lzo compression is faster than dedupe, but both faster than 
default dd.


For cpu == 1, lzo compression is on par with SAS HDD, while dedupe is 
slower than default dd.


[Test Platform]
The test platform is Fujitsu PRIMERGY RX300 S7.
CPU: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz (2 nodes)
Memory: 32G, while limited to 8G in performace tests
Disk: 300G SAS HDDs with hardware RAID 5/6

[Test method]
Just do 40G buffered write into a new btrfs.
Since it's 5 times(in fact 7 times, since only 5.7G available memory) 
the total usable memory, flush will happen several times.


dd if=/dev/zero bs=1M count=40960 of=/mnt/btrfs/out

[Future plan]
More tests on less theoretical cases, like low-to-medium dedup rate.
Which may leads to slower performance than raw dd.

Considering lzo is already the fastest compression method btrfs provides 
yet, SHA512 should make dedupe even faster, faster than compression.


Also, current dedupe is splitting dealloc range into 512K segments 
first, then split into 128K (default dedupe size) and balance hash work 
into different CPUs, so for smaller dedupe block size, dedupe should be 
faster and take full usage of all CPUs.


Thanks,
Qu


At 06/15/2016 10:09 AM, Qu Wenruo wrote:

This patchset can be fetched from github:
https://github.com/adam900710/linux.git wang_dedupe_20160524

In this update, the patchset goes through another re-organization along
with other fixes to address comments from community.
1) Move on-disk backend and dedupe props out of the patchset
   Suggested by David.
   There is still some discussion on the on-disk format.
   And dedupe prop is still not 100% determined.

   So it's better to focus on the current in-memory backend only, which
   doesn't bring any on-disk format change.

   Once the framework is done, new backends and props can be added more
   easily.

2) Better enable/disable and buffered write race avoidance
   Inspired by Mark.
   Although in previous version, we didn't trigger it with our test
   case, but if we manually add delay(5s) to __btrfs_buffered_write(),
   it's possible to trigger disable and buffered write race.

   The cause is, there is a windows between __btrfs_buffered_write() and
   btrfs_dirty_pages().
   In that window, sync_filesystem() can return very quickly since there
   is no dirty page.
   During that window, dedupe disable can happen and finish, and
   buffered writer may access to the NULL pointer of dedupe info.

   Now we use sb->s_writers.rw_sem to wait all current writers and block
   further writers, then sync the fs, change dedupe status and finally
   unblock writers. (Like freeze)
   This provides clearer logical and code, and safer than previous
   method, because there is no windows before we dirty pages.

3) Fix ENOSPC problem with better solution.
   Pointed out by Josef.
   The last 2 patches from Wang fixes ENOSPC problem, in a more
   comprehensive method for delalloc metadata reservation.
   Alone with small outstanding extents improvement, to co-operate with
   tunable max extent size.

Now the whole patchset will only add in-memory backend as a whole.
No other backend nor prop.
So we can focus on the framework itself.

Next version will focus on ioctl interface modification suggested by
David.

Thanks,
Qu

Changelog:
v2:
  Totally reworked to handle multiple backends
v3:
  Fix a stupid but deadly on-disk backend bug
  Add handle for multiple hash on same bytenr corner case to fix abort
  trans error
  Increase dedup rate by enhancing delayed ref handler for both backend.
  Move dedup_add() to run_delayed_ref() time, to fix abort trans error.
  Increase dedup block size up limit to 8M.
v4:
  Add dedup prop for disabling dedup for given files/dirs.
  Merge inmem_search() and ondisk_search() into generic_search() to save
  some code
  Fix another delayed_ref related bug.
  Use the same mutex for both inmem and ondisk backend.
  Move dedup_add() back to btrfs_finish_ordered_io() to increase dedup
  rate.
v5:
  Reuse compress routine for much simpler dedup function.
  Slightly improved performance due to above modification.
  Fix race between dedup enable/disable
  Fix for false ENOSPC report
v6:
  Further enable/disable race window fix.
  Minor format change according to checkpatch.
v7:
  Fix one concurrency bug with balance.
  Slightly modify return value from -EINVAL to -EOPNOTSUPP for
  btrfs_dedup_ioctl() to allow progs to distinguish unsupported commands
  and wrong parameter.
  Rebased to integration-4.6.
v8:
  Rename 

btrfs defrag: success or failure?

2016-06-21 Thread Dmitry Katsubo
Hi everyone,

I got the following message:

# btrfs fi defrag -r /home
ERROR: defrag failed on /home/user/.dropbox-dist/dropbox: Success
total 1 failures

I feel that something went wrong, but the message is a bit misleading.

Anyway: Provided that Dropbox is running in the system, does it mean
that it cannot be defagmented?

-- 
With best regards,
Dmitry
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is "btrfs balance start" truly asynchronous?

2016-06-21 Thread Dmitry Katsubo
On 2016-06-21 15:17, Graham Cobb wrote:
> On 21/06/16 12:51, Austin S. Hemmelgarn wrote:
>> The scrub design works, but the whole state file thing has some rather
>> irritating side effects and other implications, and developed out of
>> requirements that aren't present for balance (it might be nice to check
>> how many chunks actually got balanced after the fact, but it's not
>> absolutely necessary).
> 
> Actually, that would be **really** useful.  I have been experimenting
> with cancelling balances after a certain time (as part of my
> "balance-slowly" script).  I have got it working, just using bash
> scripting, but it means my script does not know whether any work has
> actually been done by the balance run which was cancelled (if no work
> was done, but it timed out anyway, there is probably no point trying
> again with the same timeout later!).

Additionally it would be nice if balance/scrub reports the status via
/proc in human readable manner (similar to /proc/mdstat).

-- 
With best regards,
Dmitry
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Rescue a single-device btrfs instance with zeroed tree root

2016-06-21 Thread Ivan Shapovalov
Hello,

So this is another case of "I lost my partition and do not have
backups". More precisely, _this_ is the backup and it turned out to be
damaged.

(The backup was made by partclone.btrfs. Together with a zeroed out
tree root, this asks for a bug in partclone...)

So: the tree root is zeroes, backup roots are zeroes too,
btrfs-find-root only reports blocks of level 0 (needed is 1).
Is there something that can be done? Maybe it is possible to
reconstruct the root from its children?
Operations log following.
Please Cc: me in replies as I'm not subscribed to the list.

1. regular mount

# mount /dev/loop0p3 /mnt/temp

=== dmesg ===
[106737.299592] BTRFS info (device loop0p3): disk space caching is enabled
[106737.299604] BTRFS: has skinny extents
[106737.299884] BTRFS error (device loop0p3): bad tree block start 0 
162633449472
[106737.299888] BTRFS: failed to read tree root on loop0p3
[106737.314359] BTRFS: open_ctree failed
=== end dmesg ===

2. mount with -o recovery

# mount -o recovery /dev/loop0p3 /mnt/temp

=== dmesg ===
[106742.305720] BTRFS warning (device loop0p3): 'recovery' is deprecated, use 
'usebackuproot' instead
[106742.305722] BTRFS info (device loop0p3): trying to use backup root at mount 
time
[106742.305724] BTRFS info (device loop0p3): disk space caching is enabled
[106742.305725] BTRFS: has skinny extents
[106742.306056] BTRFS error (device loop0p3): bad tree block start 0 
162633449472
[106742.306060] BTRFS: failed to read tree root on loop0p3
[106742.306069] BTRFS error (device loop0p3): bad tree block start 0 
162633449472
[106742.306071] BTRFS: failed to read tree root on loop0p3
[106742.306084] BTRFS error (device loop0p3): bad tree block start 0 
162632237056
[106742.306086] BTRFS: failed to read tree root on loop0p3
[106742.306097] BTRFS error (device loop0p3): bad tree block start 0 
162626682880
[106742.306100] BTRFS: failed to read tree root on loop0p3
[106742.306111] BTRFS error (device loop0p3): bad tree block start 0 
162609168384
[106742.306114] BTRFS: failed to read tree root on loop0p3
[106742.327272] BTRFS: open_ctree failed
=== end dmesg ===


3. btrfs-find-root

# btrfs-find-root /dev/loop0p3
Couldn't read tree root
Superblock thinks the generation is 22332
Superblock thinks the level is 1
Well block 162633646080(gen: 22332 level: 0) seems good, but generation/level 
doesn't match, want gen: 22332 level: 1
Well block 162633596928(gen: 22332 level: 0) seems good, but generation/level 
doesn't match, want gen: 22332 level: 1
Well block 162633515008(gen: 22332 level: 0) seems good, but generation/level 
doesn't match, want gen: 22332 level: 1

Thanks,
-- 

Ivan Shapovalov / intelfx /

signature.asc
Description: This is a digitally signed message part


Re: [PATCH v11 00/13] Btrfs dedupe framework

2016-06-21 Thread Chandan Rajendra
On Tuesday, June 21, 2016 11:34:57 AM David Sterba wrote:
> On Tue, Jun 21, 2016 at 05:26:23PM +0800, Qu Wenruo wrote:
> > > Yeah, but I'm now concerned about the way both will be integrated in the
> > > development or preview branches, not really the functionality itself.
> > >
> > > Now the conflicts are not trivial, so this takes extra time on my side
> > > and I can't be sure about the result in the end if I put only minor
> > > efforts to resolve the conflicts ("make it compile"). And I don't want
> > > to do that too often.
> > >
> > > As stated in past discussions, the features of this impact should spend
> > > one development cycle in for-next, even if it's not ready for merge or
> > > there are reviews going on.
> > >
> > > The subpage patchset is now in a relatively good shape to start actual
> > > testing, which already revealed some problems.
> > >
> > >
> > I'm completely OK to do the rebase, but since I don't have 64K page size 
> > machine to test the rebase, we can only test if 4K system is unaffected.
> > 
> > Although not much help, at least it would be better than making it compile.
> > 
> > Also such rebase may help us to expose bad design/unexpected corner case 
> > in dedupe.
> > So if it's OK, please let me try to do the rebase.
> 
> Well, if you base dedupe on subpage, then it could be hard to find the
> patchset that introduces bugs, or combination of both. We should be able
> to test the features independently, and thus I'm proposing to first find
> some common patchset that makes that possible.
>

Hi David,

I am not sure if I understood the above statement correctly. Do you mean to
commit the 'common/simple' patches from both the subpage-blocksize & dedupe
patchset first and then bring in the complicated ones later?

If yes, then we have a problem doing that w.r.t subpage-blocksize
patchset. The first few patches bring in the core changes necessary for the
other remaining patches.

-- 
chandan

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


скороговорка суждении употреблен

2016-06-21 Thread Станислава Силина
хомяком стары ран презентировал
примелькались влетел покорнейшую сдобные доноса
настаив мес вскакивали дроздихи
насолил подмигивал фарфора
блажным
проглотим погуляли полонезы нашедшим соединял
исколоть ростовщики контракта
година сбираю
удовлетворены стесняющимся умилительно покопаюсь яством
тре
обступило пушкины невинного
походи соратников чихи напомни фе переиначит 
лапам сметаной всепокорный создаем
засучила наказании двухколесные здоровехонек
водкою борисове северном красы брызги
резвую улыбн благороднейшей околдовал ставятся гонение 
опекун вторичным крикливую
бегавш
интересовавшее железн съезжую рундуком аносов торжествовал 
разлеглись карава позаб
утишив похваль очаро затосковало нагр
показываются заступилась основаться поголовного мстительная
пробивавшаяся поэтов беспорядочное обваренный
пузом выгоните подленького спасается
умыслу пригнанные пренебрежительн юсуфа возвещало
одетым русых заве взвешивали перебримши
подпившими смущенная невиновен подгородном дума новых 
паркет блекло притаивши бубн
эгоисткой голосок кунцом унываете
паки бабушкин зацепиться
тополи дрова шерсть
бессилием шатре
преобразились пересказывает провидел
предубеждение
весь обходиться обрыв кадило восьмым
неугомонен кругла сотв
задайте призраков
старшинстве обетованный сумасш
картавить скошено арапов знаменами объявившую
положительное
мыкать прошедшую мелешь злоде праведен замерзших 
запыхавшийся разгромлять запуганным нипоч варшаву скиталас 
N�r��yb�X��ǧv�^�)޺{.n�+{�n�߲)w*jg����ݢj/���z�ޖ��2�ޙ&�)ߡ�a�����G���h��j:+v���w��٥

Re: Is "btrfs balance start" truly asynchronous?

2016-06-21 Thread Lionel Bouton
Le 21/06/2016 15:17, Graham Cobb a écrit :
> On 21/06/16 12:51, Austin S. Hemmelgarn wrote:
>> The scrub design works, but the whole state file thing has some rather
>> irritating side effects and other implications, and developed out of
>> requirements that aren't present for balance (it might be nice to check
>> how many chunks actually got balanced after the fact, but it's not
>> absolutely necessary).
> Actually, that would be **really** useful.  I have been experimenting
> with cancelling balances after a certain time (as part of my
> "balance-slowly" script).  I have got it working, just using bash
> scripting, but it means my script does not know whether any work has
> actually been done by the balance run which was cancelled (if no work
> was done, but it timed out anyway, there is probably no point trying
> again with the same timeout later!).

I have the exact same use case.

We trigger balances when we detect that the free space is mostly
allocated but unused to prevent possible ENOSPC events. A balance on
busy disks can slow other I/Os so we try to limit them in time (in our
use case 15 to 30 min max is mostly OK).
Trying to emulate this by using [d|v]range was a possibility too but I
thought it could be hard to get right. We actually inspect the allocated
space before and after to report the difference but we don't know if
this difference is caused by the aborted balance or other activity (we
have to read the kernel logs to find out).

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: add option to run balance as daemon

2016-06-21 Thread Austin S. Hemmelgarn
Currently, balance operations are run synchronously in the foreground.
This is nice for interactive management, but is kind of crappy when you
start looking at automation and similar things.

This patch adds an option to `btrfs balance start` to tell it to
daemonize prior to running the balance operation, thus allowing us to
preform balances asynchronously.  The two biggest use cases I have for
this are starting a balance on a remote server without establishing a
full shell session, and being able to background the balance in a
recovery shell (which usually has no job control) so I can still get
progress information.

Because it simply daemonizes prior to calling the balance ioctl, this
doesn't actually need any kernel support.

Signed-off-by: Austin S. Hemmelgarn 
---
This works as is, but there are two specific things I would love to
eventually fix but don't have the time to fix right now:
* There is no way to get any feedback from the balance operation.
* Because of how everything works, trying to start a new balance with
  --background while one iw already running won't return an error but
  won't queue or start a new balance either.

The first one is more a utility item than anything else, and probably
would not be hard to add.  Ideally, it should be output to a user
specified file, and this should work even for a normal foreground balance.

The second is very much a UX issue, but can't be easily sovled without
doing some creative process monitoring from the parrent processes.

 Documentation/btrfs-balance.asciidoc |  2 ++
 cmds-balance.c   | 43 +++-
 2 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/Documentation/btrfs-balance.asciidoc 
b/Documentation/btrfs-balance.asciidoc
index 7df40b9..f487dbb 100644
--- a/Documentation/btrfs-balance.asciidoc
+++ b/Documentation/btrfs-balance.asciidoc
@@ -85,6 +85,8 @@ act on system chunks (requires '-f'), see `FILTERS` section 
for details about 'f
 be verbose and print balance filter arguments
 -f
 force reducing of metadata integrity, eg. when going from 'raid1' to 'single'
+--background
+run the balance operation asynchronously in the background
 
 *status* [-v] ::
 Show status of running or paused balance.
diff --git a/cmds-balance.c b/cmds-balance.c
index 708bbf4..66169b7 100644
--- a/cmds-balance.c
+++ b/cmds-balance.c
@@ -20,6 +20,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 #include 
 
 #include "kerncompat.h"
@@ -510,6 +513,7 @@ static const char * const cmd_balance_start_usage[] = {
"-v be verbose",
"-f force reducing of metadata integrity",
"--full-balance do not print warning and do not delay start",
+   "--background   run the balance as a background process",
NULL
 };
 
@@ -520,6 +524,7 @@ static int cmd_balance_start(int argc, char **argv)
, NULL };
int force = 0;
int verbose = 0;
+   int background = 0;
unsigned start_flags = 0;
int i;
 
@@ -527,7 +532,8 @@ static int cmd_balance_start(int argc, char **argv)
 
optind = 1;
while (1) {
-   enum { GETOPT_VAL_FULL_BALANCE = 256 };
+   enum { GETOPT_VAL_FULL_BALANCE = 256,
+   GETOPT_VAL_BACKGROUND = 257 };
static const struct option longopts[] = {
{ "data", optional_argument, NULL, 'd'},
{ "metadata", optional_argument, NULL, 'm' },
@@ -536,6 +542,8 @@ static int cmd_balance_start(int argc, char **argv)
{ "verbose", no_argument, NULL, 'v' },
{ "full-balance", no_argument, NULL,
GETOPT_VAL_FULL_BALANCE },
+   { "background", no_argument, NULL,
+   GETOPT_VAL_BACKGROUND },
{ NULL, 0, NULL, 0 }
};
 
@@ -574,6 +582,9 @@ static int cmd_balance_start(int argc, char **argv)
case GETOPT_VAL_FULL_BALANCE:
start_flags |= BALANCE_START_NOWARN;
break;
+   case GETOPT_VAL_BACKGROUND:
+   background = 1;
+   break;
default:
usage(cmd_balance_start_usage);
}
@@ -626,6 +637,36 @@ static int cmd_balance_start(int argc, char **argv)
args.flags |= BTRFS_BALANCE_FORCE;
if (verbose)
dump_ioctl_balance_args();
+   if (background) {
+   switch (fork()) {
+   case (-1):
+   error("Unable to fork to run balance in background");
+   return 1;
+   break;
+   case (0):
+   setsid();
+   switch(fork()) {
+   

can't use btrfs on USB-stick (write errors)

2016-06-21 Thread Tomasz Chmielewski
I've tried to use btrfs on USB-stick, but unfortunately it fails with 
write errors.


The below is for kernel 4.4; I've tried with 4.6.2, and it fails in a 
similar way.



I'm not sure how to reliably reproduce it, but it seems to me it has 
something to do with:


- plenty of random writes

- USB stick sometimes "very slow to reply" (may be USB-stick dependent)


The closest thing to reproduce pretty much within an hour was launching 
ktorrent and downloading a couple of Linux isos.


ext4 does not fail under similar load. USB stick is brand new, writing 
data with dd is successful, reading data with dd is successful (whether 
it's ext4, btrfs or raw partition); it only seems to fail with btrfs and 
plenty of random writes.



Jun 14 07:50:26 ativ kernel: [57362.029895] 
btrfs_dev_stat_print_on_error: 3 callbacks suppressed
Jun 14 07:50:26 ativ kernel: [57362.029912] BTRFS error (device sdb1): 
bdev /dev/sdb1 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
Jun 14 07:50:26 ativ kernel: [57362.030912] BTRFS error (device sdb1): 
bdev /dev/sdb1 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
Jun 14 07:50:26 ativ kernel: [57362.031585] BTRFS error (device sdb1): 
bdev /dev/sdb1 errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
Jun 14 07:50:26 ativ kernel: [57362.032260] BTRFS error (device sdb1): 
bdev /dev/sdb1 errs: wr 4, rd 0, flush 0, corrupt 0, gen 0
Jun 14 07:50:26 ativ kernel: [57362.033040] BTRFS error (device sdb1): 
bdev /dev/sdb1 errs: wr 5, rd 0, flush 0, corrupt 0, gen 0
Jun 14 07:50:26 ativ kernel: [57362.033400] BTRFS error (device sdb1): 
bdev /dev/sdb1 errs: wr 6, rd 0, flush 0, corrupt 0, gen 0
Jun 14 07:50:26 ativ kernel: [57362.033408] BTRFS error (device sdb1): 
bdev /dev/sdb1 errs: wr 7, rd 0, flush 0, corrupt 0, gen 0
Jun 14 07:50:26 ativ kernel: [57362.033491] BTRFS error (device sdb1): 
bdev /dev/sdb1 errs: wr 8, rd 0, flush 0, corrupt 0, gen 0
Jun 14 07:50:26 ativ kernel: [57362.033498] BTRFS error (device sdb1): 
bdev /dev/sdb1 errs: wr 9, rd 0, flush 0, corrupt 0, gen 0
Jun 14 07:50:26 ativ kernel: [57362.033566] BTRFS error (device sdb1): 
bdev /dev/sdb1 errs: wr 10, rd 0, flush 0, corrupt 0, gen 0
Jun 14 07:50:26 ativ kernel: [57362.033587] BTRFS: error (device sdb1) 
in btrfs_commit_transaction:2124: errno=-5 IO failure (Error while 
writing out transaction)
Jun 14 07:50:26 ativ kernel: [57362.033591] BTRFS info (device sdb1): 
forced readonly
Jun 14 07:50:26 ativ kernel: [57362.033597] BTRFS warning (device sdb1): 
Skipping commit of aborted transaction.
Jun 14 07:50:26 ativ kernel: [57362.033601] [ cut here 
]
Jun 14 07:50:26 ativ kernel: [57362.033659] WARNING: CPU: 3 PID: 24844 
at /build/linux-oXTOqc/linux-4.4.0/fs/btrfs/transaction.c:1746 
cleanup_transaction+0x92/0x300 [btrfs]()
Jun 14 07:50:26 ativ kernel: [57362.033662] BTRFS: Transaction aborted 
(error -5)
Jun 14 07:50:26 ativ kernel: [57362.033665] Modules linked in: 
nls_iso8859_1 ctr ccm hid_generic usbhid hid rfcomm pci_stub vboxpci(OE) 
vboxnetadp(OE) vboxnetflt(OE) msr vboxdrv(OE) drbg ansi_cprng 
xt_CHECKSUM iptable_mangle bnep ipt_MASQUERADE nf_nat_masquerade_ipv4 
iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 
xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp 
llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter 
ip_tables x_tables rtsx_usb_ms memstick ath3k btusb btrtl btbcm btintel 
amd_freq_sensitivity uvcvideo crct10dif_pclmul videobuf2_vmalloc 
videobuf2_memops crc32_pclmul bluetooth samsung_laptop videobuf2_v4l2 
videobuf2_core v4l2_common arc4 videodev media ath9k aesni_intel 
ath9k_common aes_x86_64 snd_hda_codec_realtek lrw snd_hda_codec_generic 
gf128mul ath9k_hw glue_helper snd_hda_codec_hdmi ablk_helper 
snd_hda_intel cryptd snd_hda_codec ath snd_hda_core snd_hwdep mac80211 
joydev snd_pcm edac_mce_amd input_leds snd_seq_midi edac_core serio_raw 
snd_seq_midi_event fam15h_power k10temp i2c_piix4 snd_rawmidi cfg80211 
snd_seq snd_seq_device snd_timer shpchp snd soundcore mac_hid 
binfmt_misc kvm_amd kvm irqbypass parport_pc ppdev sunrpc lp parport 
autofs4 btrfs xor raid6_pq rtsx_usb_sdmmc uas usb_storage rtsx_usb 
amdkfd amd_iommu_v2 radeon i2c_algo_bit ttm drm_kms_helper syscopyarea 
psmouse sysfillrect sysimgblt fb_sys_fops ahci drm libahci r8169 mii wmi 
fjes video
Jun 14 07:50:26 ativ kernel: [57362.033842] CPU: 3 PID: 24844 Comm: 
ktorrent Tainted: GW  OE   4.4.0-24-generic #43-Ubuntu
Jun 14 07:50:26 ativ kernel: [57362.033846] Hardware name: SAMSUNG 
ELECTRONICS CO., LTD. 905S3G/906S3G/915S3G/NP905S3G-K01PL, BIOS 
P06RBV.074.130822.FL 08/22/2013
Jun 14 07:50:26 ativ kernel: [57362.033851]  0286 
4d3ecc09 8800043a3bf0 813eab23
Jun 14 07:50:26 ativ kernel: [57362.033858]  8800043a3c38 
c03fa060 8800043a3c28 810810d2
Jun 14 07:50:26 ativ kernel: [57362.033864]  8801211a5cb0 
880137001800 8800442a93c0 fffb

Jun 14 07:50:26 ativ kernel: 

Re: [Y2038] [PATCH v2 00/24] Delete CURRENT_TIME and CURRENT_TIME_SEC macros

2016-06-21 Thread Arnd Bergmann
On Monday, June 20, 2016 11:03:01 AM CEST you wrote:
> On Sun, Jun 19, 2016 at 5:26 PM, Deepa Dinamani  
> wrote:
> > The series is aimed at getting rid of CURRENT_TIME and CURRENT_TIME_SEC 
> > macros.

> Gcc handles 8-byte structure returns (on most architectures) by
> returning them as two 32-bit registers (%edx:%eax on x86). But once it
> is timespec64, that will no longer be the case, and the calling
> convention will end up using a pointer to the local stack instead.

I guess we already have that today, as the implementation of
current_fs_time() is

static inline struct timespec64 tk_xtime(struct timekeeper *tk)
{
struct timespec64 ts;

ts.tv_sec = tk->xtime_sec;
ts.tv_nsec = (long)(tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift);
return ts;
}
extern struct timespec64 current_kernel_time64(void);
struct timespec64 current_kernel_time64(void)
{
struct timekeeper *tk = _core.timekeeper;
struct timespec64 now;
unsigned long seq;

do {
seq = read_seqcount_begin(_core.seq);

now = tk_xtime(tk);
} while (read_seqcount_retry(_core.seq, seq));

return now;
}
static inline struct timespec current_kernel_time(void)
{
struct timespec64 now = current_kernel_time64();

return timespec64_to_timespec(now);
}
extern struct timespec current_fs_time(struct super_block *sb);
struct timespec current_fs_time(struct super_block *sb)
{   
struct timespec now = current_kernel_time();
return timespec_trunc(now, sb->s_time_gran);
}   

We can surely do a little better than this, independent of the
conversion in Deepa's patch set.

> So for 32-bit code generation, we *may* want to introduce a new model of doing
> 
> set_inode_time(inode, ATTR_ATIME | ATTR_MTIME);
> 
> which basically just does
> 
> inode->i_atime = inode->i_mtime = current_time(inode);
> 
> but with a much easier calling convention on 32-bit architectures.
> 
> But that is entirely orthogonal to this patch-set, and should be seen
> as a separate issue.

I've played around with that, but found it hard to avoid going
through memory other than going all the way to the tk_xtime()
access to copy both tk->xtime_sec and the nanoseconds into
the inode fields.

Without that, the set_inode_time() implementation ends up
being more expensive than
inode->i_atime = inode->i_ctime = inode->i_mtime = current_time(inode);
because we still copy through the stack but also have
a couple of conditional branches that we don't have at the
moment.

At the moment, the triple assignment becomes (here on ARM)

   c:   4668mov r0, sp
  12:   f7ff fffe   bl  0 
  3e:   f107 0520   add.w   r5, r7, #32
12: R_ARM_THM_CALL  current_kernel_time64
  16:   f106 0410   add.w   r4, r6, #16
  1a:   e89d 000f   ldmia.w sp, {r0, r1, r2, r3} # load from stack
  1e:   e885 000f   stmia.w r5, {r0, r1, r2, r3} # store into i_atime
  22:   e884 000f   stmia.w r4, {r0, r1, r2, r3} #i_ctime
  26:   e886 000f   stmia.w r6, {r0, r1, r2, r3} #i_mtime

and a slightly more verbose version of the same thing on x86
(storing only 12 bytes instead of 16 is cheaper there, while
ARM does a store-multiple to copy the entire structure).

Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/6] fstests: btrfs: test RAID1 device reappear and balanced

2016-06-21 Thread Eryu Guan
On Wed, Jun 15, 2016 at 04:48:47PM +0800, Anand Jain wrote:
> From: Anand Jain 
> 
> The test does the following:
>   Initialize a RAID1 with some data
> 
>   Re-mount RAID1 degraded with _dev1_ and write up to
>   half of the FS capacity

If test devices are big enough, this test consumes much longer test
time. I tested with 15G scratch dev pool and this test ran ~200s on my
4vcpu 8G memory test vm.

Is it possible to limit the file size or the device size used? So it
won't grow with device size. I'm thinking about something like
_scratch_mkfs_sized, but that doesn't work for dev pool.

>   Save md5sum checkpoint1
> 
>   Re-mount healthy RAID1
> 
>   Let balance re-silver.
>   Save md5sum checkpoint2
> 
>   Re-mount RAID1 degraded with _dev2_
>   Save md5sum checkpoint3
> 
>   Verify if all three md5sum match
> 
> Signed-off-by: Anand Jain 
> ---
> v2:
>   add tmp= and its rm
>   add comments to why _reload_btrfs_ko is used
>   add missing put and test_mount at notrun exit
>   use echo instead of _fail when checkpoints are checked
>   .out updated to remove Silence..
> 
>  tests/btrfs/123 | 169 
> 
>  tests/btrfs/123.out |   7 +++
>  tests/btrfs/group   |   1 +
>  3 files changed, 177 insertions(+)
>  create mode 100755 tests/btrfs/123
>  create mode 100644 tests/btrfs/123.out
> 
> diff --git a/tests/btrfs/123 b/tests/btrfs/123
> new file mode 100755
> index ..33decfd1c434
> --- /dev/null
> +++ b/tests/btrfs/123
> @@ -0,0 +1,169 @@
> +#! /bin/bash
> +# FS QA Test 123
> +#
> +# This test verify the RAID1 reconstruction on the reappeared
> +# device. By using the following steps:
> +# Initialize a RAID1 with some data
> +#
> +# Re-mount RAID1 degraded with dev2 missing and write up to
> +# half of the FS capacity.
> +# Save md5sum checkpoint1
> +#
> +# Re-mount healthy RAID1
> +#
> +# Let balance re-silver.
> +# Save md5sum checkpoint2
> +#
> +# Re-mount RAID1 degraded with dev1 missing
> +# Save md5sum checkpoint3
> +#
> +# Verify if all three checkpoints match
> +#
> +#-
> +# Copyright (c) 2016 Oracle.  All Rights Reserved.
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#-
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1 # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> + cd /
> + rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +
> +# remove previous $seqres.full before test
> +rm -f $seqres.full
> +
> +# real QA test starts here
> +
> +_supported_fs btrfs
> +_supported_os Linux
> +_require_scratch_nocheck

Why don't check filesystem after test? A comment would be good if
there's a good reason. Patch 6 needs it as well :)

Thanks,
Eryu

> +_require_scratch_dev_pool 2
> +
> +# the mounted test dir prevent btrfs unload, we need to unmount
> +_test_unmount
> +_require_btrfs_loadable
> +
> +_scratch_dev_pool_get 2
> +
> +dev1=`echo $SCRATCH_DEV_POOL | awk '{print $1}'`
> +dev2=`echo $SCRATCH_DEV_POOL | awk '{print $2}'`
> +
> +dev1_sz=`blockdev --getsize64 $dev1`
> +dev2_sz=`blockdev --getsize64 $dev2`
> +# get min of both
> +max_fs_sz=`echo -e "$dev1_sz\n$dev2_sz" | sort | head -1`
> +max_fs_sz=$(( max_fs_sz/2 ))
> +if [ $max_fs_sz -gt 100 ]; then
> + bs="1M"
> + count=$(( max_fs_sz/100 ))
> +else
> + max_fs_sz=$(( max_fs_sz*2 ))
> + _scratch_dev_pool_put
> + _test_mount
> + _notrun "Smallest dev size $max_fs_sz, Need at least 2M"
> +fi
> +
> +echo >> $seqres.full
> +echo "max_fs_sz=$max_fs_sz count=$count" >> $seqres.full
> +echo "-Initialize -" >> $seqres.full
> +_scratch_pool_mkfs "-mraid1 -draid1" >> $seqres.full 2>&1
> +_scratch_mount >> $seqres.full 2>&1
> +_run_btrfs_util_prog filesystem show
> +dd if=/dev/zero of="$SCRATCH_MNT"/tf1 bs=$bs count=1 \
> + >>$seqres.full 2>&1
> +count=$(( count-- ))
> +echo "unmount" >> $seqres.full
> +echo "clean btrfs ko" >> $seqres.full
> +_scratch_unmount
> +
> +# un-scan the btrfs devices
> +_reload_btrfs_ko
> +
> +
> +echo >> 

Re: [PATCH v2 2/2] btrfs: wait for bdev put

2016-06-21 Thread Chris Mason

On 06/21/2016 06:24 AM, Anand Jain wrote:

From: Anand Jain 

Further to the commit
  bc178622d40d87e75abc131007342429c9b03351
  btrfs: use rcu_barrier() to wait for bdev puts at unmount

This patch implements a method to time wait on the __free_device()
which actually does the bdev put. This is needed as the user space
running 'btrfs fi show -d' immediately after the replace and
unmount, is still reading older information from the device.


Thanks for working on this Anand.  Since it looks like blkdev_put can 
deadlock against us, can we please switch to making sure we fully flush 
the outstanding IO?  It's probably enough to do a sync_blockdev() call 
before we allow the unmount to finish, but we can toss in an 
invalidate_bdev for good measure.


Then we can get rid of the mdelay loop completely, which seems pretty 
error prone to me.


Thanks!

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is "btrfs balance start" truly asynchronous?

2016-06-21 Thread Graham Cobb
On 21/06/16 12:51, Austin S. Hemmelgarn wrote:
> The scrub design works, but the whole state file thing has some rather
> irritating side effects and other implications, and developed out of
> requirements that aren't present for balance (it might be nice to check
> how many chunks actually got balanced after the fact, but it's not
> absolutely necessary).

Actually, that would be **really** useful.  I have been experimenting
with cancelling balances after a certain time (as part of my
"balance-slowly" script).  I have got it working, just using bash
scripting, but it means my script does not know whether any work has
actually been done by the balance run which was cancelled (if no work
was done, but it timed out anyway, there is probably no point trying
again with the same timeout later!).

Graham



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/6] fstests: btrfs: add functions to set and reset required number of SCRATCH_DEV_POOL

2016-06-21 Thread Eryu Guan
On Wed, Jun 15, 2016 at 04:46:03PM +0800, Anand Jain wrote:
> From: Anand Jain 
> 
> This patch provides functions
>  _scratch_dev_pool_get()
>  _scratch_dev_pool_put()
> 
> Which will help to set/reset SCRATCH_DEV_POOL with the required
> number of devices. SCRATCH_DEV_POOL_SAVED will hold all the devices.
> 
> Usage:
>   _scratch_dev_pool_get() 
>   :: do stuff
> 
>   _scratch_dev_pool_put()
> 
> Signed-off-by: Anand Jain 

I think the helpers should be introduced when they are first used, so
that we know why they're introduced, and know how they're used with
clear examples.

So patch 1, 2 and 4 can be merged as one patch, patch 3 updates
btrfs/027, patch 4 and patch 5 can be merged as one patch, then comes
patch 6.

What do you think?

> ---
>  v2: Error message and comment section updates.
> 
>  common/rc | 55 +++
>  1 file changed, 55 insertions(+)
> 
> diff --git a/common/rc b/common/rc
> index 51092a0644f0..31c46ba1226e 100644
> --- a/common/rc
> +++ b/common/rc
> @@ -802,6 +802,61 @@ _scratch_mkfs()
>  esac
>  }
>  
> +#
> +# Generally test cases will have..
> +#   _require_scratch_dev_pool X
> +# to make sure it has the enough scratch devices including
> +# replace-target and spare device. Now arg1 here is the
> +# required number of scratch devices by a-test-case excluding
> +# the replace-target and spare device. So this function will
> +# set SCRATCH_DEV_POOL to the specified number of devices.
> +#
> +# Usage:
> +#  _scratch_dev_pool_get() 
> +# :: do stuff
> +#
> +#  _scratch_dev_pool_put()
> +#
> +_scratch_dev_pool_get()
> +{
> + if [ $# != 1 ]; then

"-ne" "-eq" etc. are used for comparing integers, "=!" "==" are for
strings. And I think $#, $? are integers here, "-ne" would be better.

Thanks,
Eryu

> + _fail "Usage: _scratch_dev_pool_get ndevs"
> + fi
> +
> + local test_ndevs=$1
> + local config_ndevs=`echo $SCRATCH_DEV_POOL| wc -w`
> + local devs[]="( $SCRATCH_DEV_POOL )"
> +
> + typeset -p config_ndevs >/dev/null 2>&1
> + if [ $? != 0 ]; then
> + _fail "Bug: cant find SCRATCH_DEV_POOL ndevs"
> + fi
> +
> + if [ $config_ndevs -lt $test_ndevs ]; then
> + _notrun "Need at least test requested number of ndevs 
> $test_ndevs"
> + fi
> +
> + SCRATCH_DEV_POOL_SAVED=${SCRATCH_DEV_POOL}
> + export SCRATCH_DEV_POOL_SAVED
> + SCRATCH_DEV_POOL=${devs[@]:0:$test_ndevs}
> + export SCRATCH_DEV_POOL
> +}
> +
> +_scratch_dev_pool_put()
> +{
> + typeset -p SCRATCH_DEV_POOL_SAVED >/dev/null 2>&1
> + if [ $? != 0 ]; then
> + _fail "Bug: unset val, must call _scratch_dev_pool_get before 
> _scratch_dev_pool_put"
> + fi
> +
> + if [ -z "$SCRATCH_DEV_POOL_SAVED" ]; then
> + _fail "Bug: str empty, must call _scratch_dev_pool_get before 
> _scratch_dev_pool_put"
> + fi
> +
> + export SCRATCH_DEV_POOL=$SCRATCH_DEV_POOL_SAVED
> + export SCRATCH_DEV_POOL_SAVED=""
> +}
> +
>  _scratch_pool_mkfs()
>  {
>  case $FSTYP in
> -- 
> 2.7.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is "btrfs balance start" truly asynchronous?

2016-06-21 Thread Zygo Blaxell
On Tue, Jun 21, 2016 at 07:24:24AM -0400, Austin S. Hemmelgarn wrote:
> (for example, you can't easily start a balance on a remote
> system via a ssh command, which is the specific use case I have).

Wait, what?

ssh remotehost -n btrfs balance start -d... -m... /foo \&

or

ssh remotehost -f btrfs balance start -d... -m... /foo

It even works with systemd's auto-kill feature (send btrfs balance all the
SIGKILLs you want, the kernel just ignores them).



signature.asc
Description: Digital signature


Re: Is "btrfs balance start" truly asynchronous?

2016-06-21 Thread Hugo Mills
On Tue, Jun 21, 2016 at 07:24:24AM -0400, Austin S. Hemmelgarn wrote:
> On 2016-06-21 04:55, Duncan wrote:
> >Dmitry Katsubo posted on Mon, 20 Jun 2016 18:33:54 +0200 as excerpted:
> >
> >>Dear btfs community,
> >>
> >>I have added a drive to existing raid1 btrfs volume and decided to
> >>perform balancing so that data distributes "fairly" among drives. I have
> >>started "btrfs balance start", but it stalled for about 5-10 minutes
> >>intensively doing the work. After that time it has printed something
> >>like "had to relocate 50 chunks" and exited. According to drive I/O,
> >>"btrfs balance" did most (if not all) of the work, so after it has
> >>exited the job was done.
> >>
> >>Shouldn't "btrfs balance start" do the operation in the background?
> >
> >From the btrfs-balance (8) manpage (from btrfs-progs-4.5.3):
> >
> >start [options] 
> >start the balance operation according to the specified filters,
> >no filters will rewrite the entire filesystem. The process runs
> >in the foreground.
> >
> >
> >So the balance start operation runs in the foreground, but as explained
> >elsewhere in the manpage, the balance is interruptible by unmount and
> >will automatically restart after a remount.  It can also be paused and
> >resumed or canceled with the appropriate btrfs balance subcommands.
> >
> FWIW, there was some talk a while back about possibly providing an
> option to run balance in the background.  If I end up finding the
> time, I may write a patch for this (userland only, I'm not
> interested in mucking around with the kernel side of things, and
> it's fully possible to do this just using libc functions), as it's
> something I'd rather like to have myself, as the current method of
> using job control in a shell doesn't really work in some
> circumstances (for example, you can't easily start a balance on a
> remote system via a ssh command, which is the specific use case I
> have).

   There's quite a bit of infrastructure in the userspace tools to
deal with managing an asynchronous scrub. It would probably be worth
looking at that in the first instance to see if it can be reused for
balance.

   Hugo.

-- 
Hugo Mills |
hugo@... carfax.org.uk | __(_'>
http://carfax.org.uk/  | Squeak!
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: Is "btrfs balance start" truly asynchronous?

2016-06-21 Thread Austin S. Hemmelgarn

On 2016-06-21 07:33, Hugo Mills wrote:

On Tue, Jun 21, 2016 at 07:24:24AM -0400, Austin S. Hemmelgarn wrote:

On 2016-06-21 04:55, Duncan wrote:

Dmitry Katsubo posted on Mon, 20 Jun 2016 18:33:54 +0200 as excerpted:


Dear btfs community,

I have added a drive to existing raid1 btrfs volume and decided to
perform balancing so that data distributes "fairly" among drives. I have
started "btrfs balance start", but it stalled for about 5-10 minutes
intensively doing the work. After that time it has printed something
like "had to relocate 50 chunks" and exited. According to drive I/O,
"btrfs balance" did most (if not all) of the work, so after it has
exited the job was done.

Shouldn't "btrfs balance start" do the operation in the background?



>From the btrfs-balance (8) manpage (from btrfs-progs-4.5.3):


start [options] 
   start the balance operation according to the specified filters,
   no filters will rewrite the entire filesystem. The process runs
   in the foreground.


So the balance start operation runs in the foreground, but as explained
elsewhere in the manpage, the balance is interruptible by unmount and
will automatically restart after a remount.  It can also be paused and
resumed or canceled with the appropriate btrfs balance subcommands.


FWIW, there was some talk a while back about possibly providing an
option to run balance in the background.  If I end up finding the
time, I may write a patch for this (userland only, I'm not
interested in mucking around with the kernel side of things, and
it's fully possible to do this just using libc functions), as it's
something I'd rather like to have myself, as the current method of
using job control in a shell doesn't really work in some
circumstances (for example, you can't easily start a balance on a
remote system via a ssh command, which is the specific use case I
have).


   There's quite a bit of infrastructure in the userspace tools to
deal with managing an asynchronous scrub. It would probably be worth
looking at that in the first instance to see if it can be reused for
balance.
Yeah, but we've also already got most of what's needed though for an 
asynchronous balance.  The kernel itself functionally mutexes balances 
(at least, I'm pretty certain it does), we already store state in the 
filesystem itself (so that it can be auto-resumed on remount), and we 
already have commands for pausing, resuming, canceling, and checking 
status.  The only thing that appears to be missing is the ability to 
have the balance backgrounded by the tools themselves instead of needing 
POSIX sh job control or something to daemonize it.


The scrub design works, but the whole state file thing has some rather 
irritating side effects and other implications, and developed out of 
requirements that aren't present for balance (it might be nice to check 
how many chunks actually got balanced after the fact, but it's not 
absolutely necessary).


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] btrfs: wait for bdev put

2016-06-21 Thread Holger Hoffstätte
On 06/21/16 12:24, Anand Jain wrote:
> From: Anand Jain 
> 
> Further to the commit
>  bc178622d40d87e75abc131007342429c9b03351
>  btrfs: use rcu_barrier() to wait for bdev puts at unmount
> 
> This patch implements a method to time wait on the __free_device()
> which actually does the bdev put. This is needed as the user space
> running 'btrfs fi show -d' immediately after the replace and
> unmount, is still reading older information from the device.
> 
>  mail-archive.com/linux-btrfs@vger.kernel.org/msg54188.html
> 
> Signed-off-by: Anand Jain 
> [updates: bc178622d40d87e75abc131007342429c9b03351]
> ---
> v2: Also to make sure bdev_closing is set it needs rcu_barrier(),
> restored rcu_barrier().

Looks like this one works reliably again. ;)
Tested with a slow disk, no long unmounts or timeout messages.

Tested-by: Holger Hoffstätte 

thanks!
Holger

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is "btrfs balance start" truly asynchronous?

2016-06-21 Thread Austin S. Hemmelgarn

On 2016-06-21 04:55, Duncan wrote:

Dmitry Katsubo posted on Mon, 20 Jun 2016 18:33:54 +0200 as excerpted:


Dear btfs community,

I have added a drive to existing raid1 btrfs volume and decided to
perform balancing so that data distributes "fairly" among drives. I have
started "btrfs balance start", but it stalled for about 5-10 minutes
intensively doing the work. After that time it has printed something
like "had to relocate 50 chunks" and exited. According to drive I/O,
"btrfs balance" did most (if not all) of the work, so after it has
exited the job was done.

Shouldn't "btrfs balance start" do the operation in the background?


From the btrfs-balance (8) manpage (from btrfs-progs-4.5.3):

start [options] 
start the balance operation according to the specified filters,
no filters will rewrite the entire filesystem. The process runs
in the foreground.


So the balance start operation runs in the foreground, but as explained
elsewhere in the manpage, the balance is interruptible by unmount and
will automatically restart after a remount.  It can also be paused and
resumed or canceled with the appropriate btrfs balance subcommands.

FWIW, there was some talk a while back about possibly providing an 
option to run balance in the background.  If I end up finding the time, 
I may write a patch for this (userland only, I'm not interested in 
mucking around with the kernel side of things, and it's fully possible 
to do this just using libc functions), as it's something I'd rather like 
to have myself, as the current method of using job control in a shell 
doesn't really work in some circumstances (for example, you can't easily 
start a balance on a remote system via a ssh command, which is the 
specific use case I have).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Confusing output from fi us/df

2016-06-21 Thread Duncan
Hans van Kranenburg posted on Tue, 21 Jun 2016 02:25:20 +0200 as
excerpted:

> On 06/21/2016 01:30 AM, Marc Grondin wrote:
>>
>> I have a btrfs filesystem ontop of a 4x1tb mdraid raid5 array and I've
>> been getting confusing output on metadata usage. Seems that even tho
>> both data and metadata are in single profile metadata is reporting
>> double the space(as if it was in dupe profile)
> 
> I guess that's a coincidence.

Yes.

>  From the total amount of space you have (on top of the mdraid), there's
> 3 GiB allocated/reserved for use as metadata. Inside that 3 GiB, 1.53GiB
> of actual metadata is present.
> 
>>[...]
>> Metadata,single: Size:3.00GiB, Used:1.53GiB
>> /dev/mapper/storage2 3.00GiB
> 
>> Metadata, single: total=3.00GiB, used=1.53GiB

[Explaining a bit more than HvK or ST did.]

Btrfs does two-stage allocation.  First it allocates chunks, separately 
for data vs. metadata.  Then it uses the space in those chunks, until it 
needs to allocate more.

It's simply coincidence that the actual used metadata space within the 
allocation happens to be approximately half of the allocated metadata 
chunk space.

Tho unlike data, metadata space should never get completely full -- it'll 
need to to allocate a new chunk before it reports full, because the 
global reserve space (which is always zero usage unless the filesystem is 
in severe straits, if you ever see global reserve usage above zero you 
know the filesystem is in serious trouble, space-wise) comes from 
metadata as well.

So in reality, you have 3 gigs of metadata chunks allocated, 1.53 gigs of 
it used for actual metadata, and half a gig (512 MiB) reserved as global-
reserve (none of which is used =:^), so in actuality, approximately 2.03 
GiB of the 3.00 GiB of metadata chunks are used, with 0.97 GiB of 
metadata free.

Now metadata chunks are nominally 256 MiB (quarter GiB) each, while data 
chunks are nominally 1 GiB each.  However, that's just the _nominal_ 
size.  On TB-scale filesystems they may be larger.

In any case, you could balance the metadata and possibly reclaim a bit of 
space, but with usage including the reserve slightly over 2 GiB, you 
could only get down to 2.25 GiB metadata allocation best-case, and may be 
stuck with 2.5 or even the same 3 GiB, depending on actual metadata chunk 
size.

But I'd not worry about it yet.  Once unallocated space gets down to 
about half a TB, or either data or metadata size becomes multiple times 
actual usage, a balance will arguably be useful.  But the numbers look 
pretty healthy ATM.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/2] btrfs: wait for bdev put

2016-06-21 Thread Anand Jain
From: Anand Jain 

Further to the commit
 bc178622d40d87e75abc131007342429c9b03351
 btrfs: use rcu_barrier() to wait for bdev puts at unmount

This patch implements a method to time wait on the __free_device()
which actually does the bdev put. This is needed as the user space
running 'btrfs fi show -d' immediately after the replace and
unmount, is still reading older information from the device.

 mail-archive.com/linux-btrfs@vger.kernel.org/msg54188.html

Signed-off-by: Anand Jain 
[updates: bc178622d40d87e75abc131007342429c9b03351]
---
v2: Also to make sure bdev_closing is set it needs rcu_barrier(),
restored rcu_barrier().

 fs/btrfs/volumes.c | 45 +++--
 fs/btrfs/volumes.h |  1 +
 2 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 604daf315669..ef61c34cafbf 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "ctree.h"
 #include "extent_map.h"
@@ -254,6 +255,17 @@ static struct btrfs_device *__alloc_device(void)
return dev;
 }
 
+static int is_device_closing(struct list_head *head)
+{
+   struct btrfs_device *dev;
+
+   list_for_each_entry(dev, head, dev_list) {
+   if (dev->bdev_closing)
+   return 1;
+   }
+   return 0;
+}
+
 static noinline struct btrfs_device *__find_device(struct list_head *head,
   u64 devid, u8 *uuid)
 {
@@ -832,12 +844,22 @@ again:
 static void __free_device(struct work_struct *work)
 {
struct btrfs_device *device;
+   struct btrfs_device *new_device_addr;
 
device = container_of(work, struct btrfs_device, rcu_work);
 
if (device->bdev)
blkdev_put(device->bdev, device->mode);
 
+   /*
+* If we are coming here from btrfs_close_one_device()
+* then it allocates a new device structure for the same
+* devid, so find device again with the devid
+*/
+   new_device_addr = __find_device(>fs_devices->devices,
+   device->devid, NULL);
+
+   new_device_addr->bdev_closing = 0;
rcu_string_free(device->name);
kfree(device);
 }
@@ -884,6 +906,12 @@ static void btrfs_close_one_device(struct btrfs_device 
*device)
list_replace_rcu(>dev_list, _device->dev_list);
new_device->fs_devices = device->fs_devices;
 
+   /*
+* So to wait for kworkers to finish all blkdev_puts,
+* so device is really free when umount is done.
+*/
+   new_device->bdev_closing = 1;
+
call_rcu(>rcu, free_device);
 }
 
@@ -912,6 +940,7 @@ int btrfs_close_devices(struct btrfs_fs_devices *fs_devices)
 {
struct btrfs_fs_devices *seed_devices = NULL;
int ret;
+   int retry_cnt = 5;
 
mutex_lock(_mutex);
ret = __btrfs_close_devices(fs_devices);
@@ -929,10 +958,22 @@ int btrfs_close_devices(struct btrfs_fs_devices 
*fs_devices)
}
/*
 * Wait for rcu kworkers under __btrfs_close_devices
-* to finish all blkdev_puts so device is really
-* free when umount is done.
+* to finish all free_device()
 */
rcu_barrier();
+
+   /*
+* Wait for a grace period so that __free_device()
+* will actaully do the device close.
+*/
+   while (is_device_closing(_devices->devices) &&
+   --retry_cnt) {
+   mdelay(1000); //1 sec
+   }
+
+   if (!(retry_cnt > 0))
+   printk(KERN_WARNING "BTRFS: %pU bdev_put didn't complete, 
giving up\n",
+   fs_devices->fsid);
return ret;
 }
 
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 0ac90f8d85bd..945e49f5e17d 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -150,6 +150,7 @@ struct btrfs_device {
/* Counter to record the change of device stats */
atomic_t dev_stats_ccnt;
atomic_t dev_stat_values[BTRFS_DEV_STAT_VALUES_MAX];
+   int bdev_closing;
 };
 
 /*
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v11 00/13] Btrfs dedupe framework

2016-06-21 Thread David Sterba
On Tue, Jun 21, 2016 at 05:26:23PM +0800, Qu Wenruo wrote:
> > Yeah, but I'm now concerned about the way both will be integrated in the
> > development or preview branches, not really the functionality itself.
> >
> > Now the conflicts are not trivial, so this takes extra time on my side
> > and I can't be sure about the result in the end if I put only minor
> > efforts to resolve the conflicts ("make it compile"). And I don't want
> > to do that too often.
> >
> > As stated in past discussions, the features of this impact should spend
> > one development cycle in for-next, even if it's not ready for merge or
> > there are reviews going on.
> >
> > The subpage patchset is now in a relatively good shape to start actual
> > testing, which already revealed some problems.
> >
> >
> I'm completely OK to do the rebase, but since I don't have 64K page size 
> machine to test the rebase, we can only test if 4K system is unaffected.
> 
> Although not much help, at least it would be better than making it compile.
> 
> Also such rebase may help us to expose bad design/unexpected corner case 
> in dedupe.
> So if it's OK, please let me try to do the rebase.

Well, if you base dedupe on subpage, then it could be hard to find the
patchset that introduces bugs, or combination of both. We should be able
to test the features independently, and thus I'm proposing to first find
some common patchset that makes that possible.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v11 00/13] Btrfs dedupe framework

2016-06-21 Thread Qu Wenruo



At 06/21/2016 05:13 PM, David Sterba wrote:

On Tue, Jun 21, 2016 at 08:36:49AM +0800, Qu Wenruo wrote:

I'm looking how well does this patchset merges with the rest, so far
there are excpected conflicts with Chandan's subpage-blocksize
patchset. For the easy parts, we can add stub patches to extend
functions like cow_file_range with parameters that are added by the
other patch.

Honestly I don't know which patchset to take first. As the
subpage-blockszie is in the core, there are no user visibility and
interface questions, but it must not cause any regressions.

Dedupe is optional, not default, and we have to mainly make sure it does
not have any impact when turned off.

So I see three possible ways:

* merge subpage first, as it defines the API, adapt dedupe


Personally, I'd like to merge subpage first.

AFAIK, it's more important than dedupe.
It affects whether a fs created in 64K page size environment can be
mounted on a 4K page size system.


Yeah, but I'm now concerned about the way both will be integrated in the
development or preview branches, not really the functionality itself.

Now the conflicts are not trivial, so this takes extra time on my side
and I can't be sure about the result in the end if I put only minor
efforts to resolve the conflicts ("make it compile"). And I don't want
to do that too often.

As stated in past discussions, the features of this impact should spend
one development cycle in for-next, even if it's not ready for merge or
there are reviews going on.

The subpage patchset is now in a relatively good shape to start actual
testing, which already revealed some problems.


I'm completely OK to do the rebase, but since I don't have 64K page size 
machine to test the rebase, we can only test if 4K system is unaffected.


Although not much help, at least it would be better than making it compile.

Also such rebase may help us to expose bad design/unexpected corner case 
in dedupe.

So if it's OK, please let me try to do the rebase.

Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v11 00/13] Btrfs dedupe framework

2016-06-21 Thread David Sterba
On Tue, Jun 21, 2016 at 08:36:49AM +0800, Qu Wenruo wrote:
> > I'm looking how well does this patchset merges with the rest, so far
> > there are excpected conflicts with Chandan's subpage-blocksize
> > patchset. For the easy parts, we can add stub patches to extend
> > functions like cow_file_range with parameters that are added by the
> > other patch.
> >
> > Honestly I don't know which patchset to take first. As the
> > subpage-blockszie is in the core, there are no user visibility and
> > interface questions, but it must not cause any regressions.
> >
> > Dedupe is optional, not default, and we have to mainly make sure it does
> > not have any impact when turned off.
> >
> > So I see three possible ways:
> >
> > * merge subpage first, as it defines the API, adapt dedupe
> 
> Personally, I'd like to merge subpage first.
> 
> AFAIK, it's more important than dedupe.
> It affects whether a fs created in 64K page size environment can be 
> mounted on a 4K page size system.

Yeah, but I'm now concerned about the way both will be integrated in the
development or preview branches, not really the functionality itself.

Now the conflicts are not trivial, so this takes extra time on my side
and I can't be sure about the result in the end if I put only minor
efforts to resolve the conflicts ("make it compile"). And I don't want
to do that too often.

As stated in past discussions, the features of this impact should spend
one development cycle in for-next, even if it's not ready for merge or
there are reviews going on.

The subpage patchset is now in a relatively good shape to start actual
testing, which already revealed some problems.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is "btrfs balance start" truly asynchronous?

2016-06-21 Thread Duncan
Dmitry Katsubo posted on Mon, 20 Jun 2016 18:33:54 +0200 as excerpted:

> Dear btfs community,
> 
> I have added a drive to existing raid1 btrfs volume and decided to
> perform balancing so that data distributes "fairly" among drives. I have
> started "btrfs balance start", but it stalled for about 5-10 minutes
> intensively doing the work. After that time it has printed something
> like "had to relocate 50 chunks" and exited. According to drive I/O,
> "btrfs balance" did most (if not all) of the work, so after it has
> exited the job was done.
> 
> Shouldn't "btrfs balance start" do the operation in the background?

>From the btrfs-balance (8) manpage (from btrfs-progs-4.5.3):

start [options] 
start the balance operation according to the specified filters,
no filters will rewrite the entire filesystem. The process runs
in the foreground.


So the balance start operation runs in the foreground, but as explained 
elsewhere in the manpage, the balance is interruptible by unmount and 
will automatically restart after a remount.  It can also be paused and 
resumed or canceled with the appropriate btrfs balance subcommands.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html