Re: Possible deadlock when writing

2018-12-01 Thread Martin Bakiev
I was having the same issue with kernels 4.19.2 and 4.19.4. I don’t appear to 
have the issue with 4.20.0-0.rc1 on Fedora Server 29.

The issue is very easy to reproduce on my setup, not sure how much of it is 
actually relevant, but here it is:

- 3 drive RAID5 created
- Some data moved to it
- Expanded to 7 drives
- No balancing

The issue is easily reproduced (within 30 mins) by starting multiple transfers 
to the volume (several TB in the form of many 30GB+ files). Multiple concurrent 
‘rsync’ transfers seems to take a bit longer to trigger the issue, but multiple 
‘cp’ commands will do it much quicker (again not sure if relevant).

I have not seen the issue occur with a single ‘rsync’ or ‘cp’ transfer, but I 
haven’t left one running alone for too long (copying the data from multiple 
drives, so there is a lot to be gained from parallelizing the transfers).

I’m not sure what state the FS is left in after Magic SysRq reboot after it 
deadlocks, but seemingly it’s fine. No problems mounting and ‘btrfs check’ 
passes OK. I’m sure some of the data doesn’t get flushed, but it’s no problem 
for my use case.

I’ve been running nonstop concurrent transfers with kernel 4.20.0-0.rc1 for 
24hr nonstop and I haven’t experienced the issue.

Hope this helps.

Re: [PATCH RESEND 0/8] btrfs-progs: sub: Relax the privileges of "subvolume list/show"

2018-11-27 Thread Martin Steigerwald
Misono Tomohiro - 27.11.18, 06:24:
> Importantly, in order to make output consistent for both root and
> non-privileged user, this changes the behavior of "subvolume list":
>  - (default) Only list in subvolume under the specified path.
>Path needs to be a subvolume.

Does that work recursively?

I wound find it quite unexpected if I did btrfs subvol list in or on the 
root directory of a BTRFS filesystem would not display any subvolumes on 
that filesystem no matter where they are.

Thanks,
-- 
Martin




Re: Interpreting `btrfs filesystem show'

2018-10-15 Thread Martin Steigerwald
Hugo Mills - 15.10.18, 16:26:
> On Mon, Oct 15, 2018 at 05:24:08PM +0300, Anton Shepelev wrote:
> > Hello, all
> > 
> > While trying to resolve free space problems, and found that
> > 
> > I cannot interpret the output of:
> > > btrfs filesystem show
> > 
> > Label: none  uuid: 8971ce5b-71d9-4e46-ab25-ca37485784c8
> > Total devices 1 FS bytes used 34.06GiB
> > devid1 size 40.00GiB used 37.82GiB path /dev/sda2
> > 
> > How come the total used value is less than the value listed
> > for the only device?
> 
>"Used" on the device is the mount of space allocated. "Used" on the
> FS is the total amount of actual data and metadata in that
> allocation.
> 
>You will also need to look at the output of "btrfs fi df" to see
> the breakdown of the 37.82 GiB into data, metadata and currently
> unused.

I usually use btrfs fi usage -T, cause

1. It has all the information.

2. It differentiates between used and allocated.

% btrfs fi usage -T /
Overall:
Device size: 100.00GiB
Device allocated: 54.06GiB
Device unallocated:   45.94GiB
Device missing:  0.00B
Used: 46.24GiB
Free (estimated): 25.58GiB  (min: 25.58GiB)
Data ratio:   2.00
Metadata ratio:   2.00
Global reserve:   70.91MiB  (used: 0.00B)

Data Metadata  System  
Id Path RAID1RAID1 RAID1Unallocated
--   -  ---
 2 /dev/mapper/msata-debian 25.00GiB   2.00GiB 32.00MiB22.97GiB
 1 /dev/mapper/sata-debian  25.00GiB   2.00GiB 32.00MiB22.97GiB
--   -  ---
   Total25.00GiB   2.00GiB 32.00MiB45.94GiB
   Used 22.38GiB 754.66MiB 16.00KiB  


For RAID it in some place reports the raw size and sometimes the logical 
size. Especially in the "Total" line I find this a bit inconsistent. 
"RAID1" columns show logical size, "Unallocated" shows raw size.

Also "Used:" in the global section shows raw size and "Free 
(estimated):" shows logical size.

Thanks
-- 
Martin




Re: BTRFS related kernel backtrace on boot on 4.18.7 after blackout due to discharged battery

2018-10-05 Thread Martin Steigerwald
Filipe Manana - 05.10.18, 17:21:
> On Fri, Oct 5, 2018 at 3:23 PM Martin Steigerwald 
 wrote:
> > Hello!
> > 
> > On ThinkPad T520 after battery was discharged and machine just
> > blacked out.
> > 
> > Is that some sign of regular consistency check / replay or something
> > to investigate further?
> 
> I think it's harmless, if anything were messed up with link counts or
> mismatches between those and dir entries, fsck (btrfs check) should
> have reported something.
> I'll dig a big further and remove the warning if it's really harmless.

I just scrubbed the filesystem. I did not run btrfs check on it.

> > I already scrubbed all data and there are no errors. Also btrfs
> > device stats reports no errors. SMART status appears to be okay as
> > well on both SSD.
> > 
> > [4.524355] BTRFS info (device dm-4): disk space caching is
> > enabled [… backtrace …]
-- 
Martin




BTRFS related kernel backtrace on boot on 4.18.7 after blackout due to discharged battery

2018-10-05 Thread Martin Steigerwald
 83 c8 ff c3 66 2e 
0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 
f0 ff ff 73 01 c3 48 8b 0d 3e e4 0b 00 f7 d8 64 89 01 48 
[6.123872] RSP: 002b:7ffc0e3466a8 EFLAGS: 0202 ORIG_RAX: 
00a5
[6.131285] RAX: ffda RBX: 55f3ed7ee9c0 RCX: 7f0715b89a1a
[6.131286] RDX: 55f3ed7eebc0 RSI: 55f3ed7eec40 RDI: 55f3ed7ef900
[6.131287] RBP: 7f0715ecff04 R08: 55f3ed7eec00 R09: 55f3ed7eebc0
[6.131288] R10: c0ed0400 R11: 0202 R12: 
[6.131289] R13: c0ed0400 R14: 55f3ed7ef900 R15: 55f3ed7eebc0
[6.131292] ---[ end trace bd5d30b2fea7fb77 ]---
[6.251219] BTRFS info (device dm-3): checking UUID tree

Thanks,
-- 
Martin




Re: very poor performance / a lot of writes to disk with space_cache (but not with space_cache=v2)

2018-09-19 Thread Martin Steigerwald
Hans van Kranenburg - 19.09.18, 19:58:
> However, as soon as we remount the filesystem with space_cache=v2 -
> 
> > writes drop to just around 3-10 MB/s to each disk. If we remount to
> > space_cache - lots of writes, system unresponsive. Again remount to
> > space_cache=v2 - low writes, system responsive.
> > 
> > That's a huuge, 10x overhead! Is it expected? Especially that
> > space_cache=v1 is still the default mount option?
> 
> Yes, that does not surprise me.
> 
> https://events.static.linuxfound.org/sites/events/files/slides/vault20
> 16_0.pdf
> 
> Free space cache v1 is the default because of issues with btrfs-progs,
> not because it's unwise to use the kernel code. I can totally
> recommend using it. The linked presentation above gives some good
> background information.

What issues in btrfs-progs are that?

I am wondering whether to switch to freespace tree v2. Would it provide 
benefit for a regular / and /home filesystems as dual SSD BTRFS RAID-1 
on a laptop?

Thanks,
-- 
Martin




Re: Transactional btrfs

2018-09-08 Thread Martin Raiber
Am 08.09.2018 um 18:24 schrieb Adam Borowski:
> On Thu, Sep 06, 2018 at 06:08:33AM -0400, Austin S. Hemmelgarn wrote:
>> On 2018-09-06 03:23, Nathan Dehnel wrote:
>>> So I guess my question is, does btrfs support atomic writes across
>>> multiple files? Or is anyone interested in such a feature?
>>>
>> I'm fairly certain that it does not currently, but in theory it would not be
>> hard to add.
>>
>> Realistically, the only cases I can think of where cross-file atomic
>> _writes_ would be of any benefit are database systems.
>>
>> However, if this were extended to include rename, unlink, touch, and a
>> handful of other VFS operations, then I can easily think of a few dozen use
>> cases.  Package managers in particular would likely be very interested in
>> being able to atomically rename a group of files as a single transaction, as
>> it would make their job _much_ easier.
> I wonder, what about:
> sync; mount -o remount,commit=999,flushoncommit
> eatmydata apt dist-upgrade
> sync; mount -o remount,commit=30,noflushoncommit
>
> Obviously, this gets fooled by fsyncs, and makes the transaction affects the
> whole system (if you have unrelated writes they won't get committed until
> the end of transaction).  Then there are nocow files, but you already made
> the decision to disable most features of btrfs for them.
>
> So unless something forces a commit, this should already work, giving
> cross-file atomic writes, renames and so on.

Now combine this with snapshot root, then on success rename exchange to
root and you are there.

Btrfs had in the past TRANS_START and TRANS_END ioctls (for ceph, I
think), but no rollback (and therefore no error handling incl. ENOSPC).

If you want to look at a working file system transaction mechanism, you
should look at transactional NTFS (TxF). They are writing they are
deprecating it, so it's perhaps not very widely used. Windows uses it
for updates, I think.

Specifically for btrfs, the problem would be that it really needs to
support multiple simultaneous writers, otherwise one transaction can
block the whole system.




Re: lazytime mount option—no support in Btrfs

2018-08-19 Thread Martin Steigerwald
waxhead - 18.08.18, 22:45:
> Adam Hunt wrote:
> > Back in 2014 Ted Tso introduced the lazytime mount option for ext4
> > and shortly thereafter a more generic VFS implementation which was
> > then merged into mainline. His early patches included support for
> > Btrfs but those changes were removed prior to the feature being
> > merged. His> 
> > changelog includes the following note about the removal:
> >- Per Christoph's suggestion, drop support for btrfs and xfs for
> >now,
> >
> >  issues with how btrfs and xfs handle dirty inode tracking.  We
> >  can add btrfs and xfs support back later or at the end of this
> >  series if we want to revisit this decision.
> > 
> > My reading of the current mainline shows that Btrfs still lacks any
> > support for lazytime. Has any thought been given to adding support
> > for lazytime to Btrfs?
[…]
> Is there any new regarding this?

I´d like to know whether there is any news about this as well.

If I understand it correctly this could even help BTRFS performance a 
lot cause it is COW´ing metadata.

Thanks,
-- 
Martin




Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD

2018-08-18 Thread Martin Steigerwald
Roman Mamedov - 18.08.18, 09:12:
> On Fri, 17 Aug 2018 23:17:33 +0200
> 
> Martin Steigerwald  wrote:
> > > Do not consider SSD "compression" as a factor in any of your
> > > calculations or planning. Modern controllers do not do it anymore,
> > > the last ones that did are SandForce, and that's 2010 era stuff.
> > > You
> > > can check for yourself by comparing write speeds of compressible
> > > vs
> > > incompressible data, it should be the same. At most, the modern
> > > ones
> > > know to recognize a stream of binary zeroes and have a special
> > > case
> > > for that.
> > 
> > Interesting. Do you have any backup for your claim?
> 
> Just "something I read". I follow quote a bit of SSD-related articles
> and reviews which often also include a section to talk about the
> controller utilized, its background and technological
> improvements/changes -- and the compression going out of fashion
> after SandForce seems to be considered a well-known fact.
> 
> Incidentally, your old Intel 320 SSDs actually seem to be based on
> that old SandForce controller (or at least license some of that IP to
> extend on it), and hence those indeed might perform compression.

Interesting. Back then I read the Intel SSD 320 would not compress.
I think its difficult to know for sure with those proprietary controllers.

> > As the data still needs to be transferred to the SSD at least when
> > the SATA connection is maxed out I bet you won´t see any difference
> > in write speed whether the SSD compresses in real time or not.
> 
> Most controllers expose two readings in SMART:
> 
>   - Lifetime writes from host (SMART attribute 241)
>   - Lifetime writes to flash (attribute 233, or 177, or 173...)
>
> It might be difficult to get the second one, as often it needs to be
> decoded from others such as "Average block erase count" or "Wear
> leveling count". (And seems to be impossible on Samsung NVMe ones,
> for example)

I got the impression every manufacturer does their own thing here. And I
would not even be surprised when its different between different generations
of SSDs by one manufacturer.

# Crucial mSATA

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate 0x002f   100   100   000Pre-fail  Always   
-   0
  5 Reallocated_Sector_Ct   0x0033   100   100   000Pre-fail  Always   
-   0
  9 Power_On_Hours  0x0032   100   100   000Old_age   Always   
-   16345
 12 Power_Cycle_Count   0x0032   100   100   000Old_age   Always   
-   4193
171 Program_Fail_Count  0x0032   100   100   000Old_age   Always   
-   0
172 Erase_Fail_Count0x0032   100   100   000Old_age   Always   
-   0
173 Wear_Leveling_Count 0x0032   078   078   000Old_age   Always   
-   663
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000Old_age   Always   
-   362
180 Unused_Rsvd_Blk_Cnt_Tot 0x0033   000   000   000Pre-fail  Always   
-   8219
183 SATA_Iface_Downshift0x0032   100   100   000Old_age   Always   
-   1
184 End-to-End_Error0x0032   100   100   000Old_age   Always   
-   0
187 Reported_Uncorrect  0x0032   100   100   000Old_age   Always   
-   0
194 Temperature_Celsius 0x0022   046   020   000Old_age   Always   
-   54 (Min/Max -10/80)
196 Reallocated_Event_Count 0x0032   100   100   000Old_age   Always   
-   16
197 Current_Pending_Sector  0x0032   100   100   000Old_age   Always   
-   0
198 Offline_Uncorrectable   0x0030   100   100   000Old_age   Offline  
-   0
199 UDMA_CRC_Error_Count0x0032   100   100   000Old_age   Always   
-   0
202 Percent_Lifetime_Used   0x0031   078   078   000Pre-fail  Offline  
-   22

I expect the raw value of this to raise more slowly now there are almost
100 GiB completely unused and there is lots of free space in the filesystems.
But even if not, the SSD is in use since March 2014. So it has plenty of time
to go.

206 Write_Error_Rate0x000e   100   100   000Old_age   Always   
-   0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000Old_age   Always   
-   0
246 Total_Host_Sector_Write 0x0032   100   100   ---Old_age   Always   
-   91288276930

^^ In sectors. 91288276930 * 512 / 1024 / 1024 / 1024 ~= 43529 GiB

Could be 4 KiB… but as its telling about Host_Sector and the value multiplied
by eight does not make any sense, I bet its 512 Bytes.

% smartctl /dev/sdb --all |grep "

Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD

2018-08-17 Thread Martin Steigerwald
Austin S. Hemmelgarn - 17.08.18, 14:55:
> On 2018-08-17 08:28, Martin Steigerwald wrote:
> > Thanks for your detailed answer.
> > 
> > Austin S. Hemmelgarn - 17.08.18, 13:58:
> >> On 2018-08-17 05:08, Martin Steigerwald wrote:
[…]
> >>> Anyway, creating a new filesystem may have been better here
> >>> anyway,
> >>> cause it replaced an BTRFS that aged over several years with a new
> >>> one. Due to the increased capacity and due to me thinking that
> >>> Samsung 860 Pro compresses itself, I removed LZO compression. This
> >>> would also give larger extents on files that are not fragmented or
> >>> only slightly fragmented. I think that Intel SSD 320 did not
> >>> compress, but Crucial m500 mSATA SSD does. That has been the
> >>> secondary SSD that still had all the data after the outage of the
> >>> Intel SSD 320.
> >> 
> >> First off, keep in mind that the SSD firmware doing compression
> >> only
> >> really helps with wear-leveling.  Doing it in the filesystem will
> >> help not only with that, but will also give you more space to work
> >> with.> 
> > While also reducing the ability of the SSD to wear-level. The more
> > data I fit on the SSD, the less it can wear-level. And the better I
> > compress that data, the less it can wear-level.
> 
> No, the better you compress the data, the _less_ data you are
> physically putting on the SSD, just like compressing a file makes it
> take up less space.  This actually makes it easier for the firmware
> to do wear-leveling.  Wear-leveling is entirely about picking where
> to put data, and by reducing the total amount of data you are writing
> to the SSD, you're making that decision easier for the firmware, and
> also reducing the number of blocks of flash memory needed (which also
> helps with SSD life expectancy because it translates to fewer erase
> cycles).

On one hand I can go with this, but:

If I fill the SSD 99% with already compressed data, in case it 
compresses itself for wear leveling, it has less chance to wear level 
than with 99% of not yet compressed data that it could compress itself.

That was the point I was trying to make.

Sure, with a fill rate of about 46% for home, compression would help the 
wear leveling. And if the controller does not compress at all, it would 
also.

Hmmm, maybe I enable "zstd", but on the other hand I save CPU cycles 
with not enabling it. 

> > However… I am not all that convinced that it would benefit me as
> > long as I have enough space. That SSD replacement more than doubled
> > capacity from about 680 TB to 1480 TB. I have ton of free space in
> > the filesystems – usage of /home is only 46% for example – and
> > there are 96 GiB completely unused in LVM on the Crucial SSD and
> > even more than 183 GiB completely unused on Samsung SSD. The system
> > is doing weekly "fstrim" on all filesystems. I think that this is
> > more than is needed for the longevity of the SSDs, but well
> > actually I just don´t need the space, so…
> > 
> > Of course, in case I manage to fill up all that space, I consider
> > using compression. Until then, I am not all that convinced that I´d
> > benefit from it.
> > 
> > Of course it may increase read speeds and in case of nicely
> > compressible data also write speeds, I am not sure whether it even
> > matters. Also it uses up some CPU cycles on a dual core (+
> > hyperthreading) Sandybridge mobile i5. While I am not sure about
> > it, I bet also having larger possible extent sizes may help a bit.
> > As well as no compression may also help a bit with fragmentation.
> 
> It generally does actually. Less data physically on the device means
> lower chances of fragmentation.  In your case, it may not improve

I thought "no compression" may help with fragmentation, but I think you 
think that "compression" helps with fragmentation and misunderstood what 
I wrote.

> speed much though (your i5 _probably_ can't compress data much faster
> than it can access your SSD's, which means you likely won't see much
> performance benefit other than reducing fragmentation).
> 
> > Well putting this to a (non-scientific) test:
> > 
> > […]/.local/share/akonadi/db_data/akonadi> du -sh * | sort -rh | head
> > -5 3,1Gparttable.ibd
> > 
> > […]/.local/share/akonadi/db_data/akonadi> filefrag parttable.ibd
> > parttable.ibd: 11583 extents found
> > 
> > Hmmm, already quite many extents after just about one week with the
> > new filesystem. On the old filesystem I had somewhat around
> > 4-5 ex

Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD

2018-08-17 Thread Martin Steigerwald
Hi Roman.

Now with proper CC.

Roman Mamedov - 17.08.18, 14:50:
> On Fri, 17 Aug 2018 14:28:25 +0200
> 
> Martin Steigerwald  wrote:
> > > First off, keep in mind that the SSD firmware doing compression
> > > only
> > > really helps with wear-leveling.  Doing it in the filesystem will
> > > help not only with that, but will also give you more space to
> > > work with.> 
> > While also reducing the ability of the SSD to wear-level. The more
> > data I fit on the SSD, the less it can wear-level. And the better I
> > compress that data, the less it can wear-level.
> 
> Do not consider SSD "compression" as a factor in any of your
> calculations or planning. Modern controllers do not do it anymore,
> the last ones that did are SandForce, and that's 2010 era stuff. You
> can check for yourself by comparing write speeds of compressible vs
> incompressible data, it should be the same. At most, the modern ones
> know to recognize a stream of binary zeroes and have a special case
> for that.

Interesting. Do you have any backup for your claim?

> As for general comment on this thread, always try to save the exact
> messages you get when troubleshooting or getting failures from your
> system. Saying just "was not able to add" or "btrfs replace not
> working" without any exact details isn't really helpful as a bug
> report or even as a general "experiences" story, as we don't know
> what was the exact cause of those, could that have been avoided or
> worked around, not to mention what was your FS state at the time (as
> in "btrfs fi show" and "fi df").

I had a screen.log, but I put it on the filesystem after the 
backup was made, so it was lost.

Anyway, the reason for not being able to add the device was the read 
only state of the BTRFS, as I wrote. Same goes for replace. I was able 
to read the error message just fine. AFAIR the exact wording was "read 
only filesystem".

In any case: It was a experience report, no request for help, so I don´t 
see why exact error messages are absolutely needed. If I had a support 
inquiry that would be different, I agree.

Thanks,
-- 
Martin




Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD

2018-08-17 Thread Martin Steigerwald
Austin S. Hemmelgarn - 17.08.18, 15:01:
> On 2018-08-17 08:50, Roman Mamedov wrote:
> > On Fri, 17 Aug 2018 14:28:25 +0200
> > 
> > Martin Steigerwald  wrote:
> >>> First off, keep in mind that the SSD firmware doing compression
> >>> only
> >>> really helps with wear-leveling.  Doing it in the filesystem will
> >>> help not only with that, but will also give you more space to
> >>> work with.>> 
> >> While also reducing the ability of the SSD to wear-level. The more
> >> data I fit on the SSD, the less it can wear-level. And the better
> >> I compress that data, the less it can wear-level.
> > 
> > Do not consider SSD "compression" as a factor in any of your
> > calculations or planning. Modern controllers do not do it anymore,
> > the last ones that did are SandForce, and that's 2010 era stuff.
> > You can check for yourself by comparing write speeds of
> > compressible vs incompressible data, it should be the same. At
> > most, the modern ones know to recognize a stream of binary zeroes
> > and have a special case for that.
> 
> All that testing write speeds forz compressible versus incompressible
> data tells you is if the SSD is doing real-time compression of data,
> not if they are doing any compression at all..  Also, this test only
> works if you turn the write-cache on the device off.

As the data still needs to be transferred to the SSD at least when the 
SATA connection is maxed out I bet you won´t see any difference in write 
speed whether the SSD compresses in real time or not.

> Besides, you can't prove 100% for certain that any manufacturer who
> does not sell their controller chips isn't doing this, which means
> there are a few manufacturers that may still be doing it.

Who really knows what SSD controller manufacturers are doing? I have not 
seen any Open Channel SSD stuff for laptops so far.

Thanks,
-- 
Martin




Hang after growing file system (4.14.48)

2018-08-17 Thread Martin Raiber
Hi,

after growing a single btrfs file system (on a loop device, with btrfs
fi resize max /fs ) it hangs later (sometimes much later). Symptoms:

* A unkillable btrfs process using 100% (of one) CPU in R state (no
kernel trace, cannot attach with strace, gdb or run linux perf)
* Other processes with following stack trace:

Fri Aug 17 16:21:06 2018] INFO: task python3:46794 blocked for more than
120 seconds.
[Fri Aug 17 16:21:06 2018]   Not tainted 4.14.48 #2
[Fri Aug 17 16:21:06 2018] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Aug 17 16:21:06 2018] python3 D    0 46794  46702 0x
[Fri Aug 17 16:21:06 2018] Call Trace:
[Fri Aug 17 16:21:06 2018]  ? __schedule+0x2de/0x7b0
[Fri Aug 17 16:21:06 2018]  schedule+0x32/0x80
[Fri Aug 17 16:21:06 2018]  schedule_preempt_disabled+0xa/0x10
[Fri Aug 17 16:21:06 2018]  __mutex_lock.isra.1+0x295/0x4c0
[Fri Aug 17 16:21:06 2018]  ? btrfs_show_devname+0x25/0xd0
[Fri Aug 17 16:21:06 2018]  btrfs_show_devname+0x25/0xd0
[Fri Aug 17 16:21:06 2018]  show_vfsmnt+0x44/0x150
[Fri Aug 17 16:21:06 2018]  seq_read+0x314/0x3d0
[Fri Aug 17 16:21:06 2018]  __vfs_read+0x26/0x130
[Fri Aug 17 16:21:06 2018]  vfs_read+0x91/0x130
[Fri Aug 17 16:21:06 2018]  SyS_read+0x42/0x90
[Fri Aug 17 16:21:06 2018]  do_syscall_64+0x6e/0x120
[Fri Aug 17 16:21:06 2018]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[Fri Aug 17 16:21:06 2018] RIP: 0033:0x7f67fd41b6d0
[Fri Aug 17 16:21:06 2018] RSP: 002b:7ffd80be2678 EFLAGS: 0246
ORIG_RAX: 
[Fri Aug 17 16:21:06 2018] RAX: ffda RBX: 56521bf7bb00
RCX: 7f67fd41b6d0
[Fri Aug 17 16:21:06 2018] RDX: 0400 RSI: 56521bf7bd30
RDI: 0004
[Fri Aug 17 16:21:06 2018] RBP: 0d68 R08: 7f67fe655700
R09: 0101
[Fri Aug 17 16:21:06 2018] R10: 56521bf7c0cc R11: 0246
R12: 7f67fd6d6440
[Fri Aug 17 16:21:06 2018] R13: 7f67fd6d5900 R14: 0064
R15: 0000

Regards,
Martin Raiber



Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD

2018-08-17 Thread Martin Steigerwald
Thanks for your detailed answer.  

Austin S. Hemmelgarn - 17.08.18, 13:58:
> On 2018-08-17 05:08, Martin Steigerwald wrote:
[…]
> > I have seen a discussion about the limitation in point 2. That
> > allowing to add a device and make it into RAID 1 again might be
> > dangerous, cause of system chunk and probably other reasons. I did
> > not completely read and understand it tough.
> > 
> > So I still don´t get it, cause:
> > 
> > Either it is a RAID 1, then, one disk may fail and I still have
> > *all*
> > data. Also for the system chunk, which according to btrfs fi df /
> > btrfs fi sh was indeed RAID 1. If so, then period. Then I don´t see
> > why it would need to disallow me to make it into an RAID 1 again
> > after one device has been lost.
> > 
> > Or it is no RAID 1 and then what is the point to begin with? As I
> > was
> > able to copy of all date of the degraded mount, I´d say it was a
> > RAID 1.
> > 
> > (I know that BTRFS RAID 1 is not a regular RAID 1 anyway, but just
> > does two copies regardless of how many drives you use.)
> 
> So, what's happening here is a bit complicated.  The issue is entirely
> with older kernels that are missing a couple of specific patches, but
> it appears that not all distributions have their kernels updated to
> include those patches yet.
> 
> In short, when you have a volume consisting of _exactly_ two devices
> using raid1 profiles that is missing one device, and you mount it
> writable and degraded on such a kernel, newly created chunks will be
> single-profile chunks instead of raid1 chunks with one half missing.
> Any write has the potential to trigger allocation of a new chunk, and
> more importantly any _read_ has the potential to trigger allocation of
> a new chunk if you don't use the `noatime` mount option (because a
> read will trigger an atime update, which results in a write).
> 
> When older kernels then go and try to mount that volume a second time,
> they see that there are single-profile chunks (which can't tolerate
> _any_ device failures), and refuse to mount at all (because they
> can't guarantee that metadata is intact).  Newer kernels fix this
> part by checking per-chunk if a chunk is degraded/complete/missing,
> which avoids this because all the single chunks are on the remaining
> device.

How new the kernel needs to be for that to happen?

Do I get this right that it would be the kernel used for recovery, i.e. 
the one on the live distro that needs to be new enough? To one on this 
laptop meanwhile is already 4.18.1.

I used latest GRML stable release 2017.05 which has an 4.9 kernel.

> As far as avoiding this in the future:

I hope that with the new Samsung Pro 860 together with the existing 
Crucial m500 I am spared from this for years to come. That Crucial SSD 
according to SMART status about lifetime used has still quite some time 
to go.

> * If you're just pulling data off the device, mark the device
> read-only in the _block layer_, not the filesystem, before you mount
> it.  If you're using LVM, just mark the LV read-only using LVM
> commands  This will make 100% certain that nothing gets written to
> the device, and thus makes sure that you won't accidentally cause
> issues like this.

> * If you're going to convert to a single device,
> just do it and don't stop it part way through.  In particular, make
> sure that your system will not lose power.

> * Otherwise, don't mount the volume unless you know you're going to
> repair it.

Thanks for those. Good to keep in mind.

> > For this laptop it was not all that important but I wonder about
> > BTRFS RAID 1 in enterprise environment, cause restoring from backup
> > adds a significantly higher downtime.
> > 
> > Anyway, creating a new filesystem may have been better here anyway,
> > cause it replaced an BTRFS that aged over several years with a new
> > one. Due to the increased capacity and due to me thinking that
> > Samsung 860 Pro compresses itself, I removed LZO compression. This
> > would also give larger extents on files that are not fragmented or
> > only slightly fragmented. I think that Intel SSD 320 did not
> > compress, but Crucial m500 mSATA SSD does. That has been the
> > secondary SSD that still had all the data after the outage of the
> > Intel SSD 320.
> 
> First off, keep in mind that the SSD firmware doing compression only
> really helps with wear-leveling.  Doing it in the filesystem will help
> not only with that, but will also give you more space to work with.

While also reducing the ability of the SSD to wear-level. The more data 
I fit on the SSD, the less it can wear-level. And the better I compress 
that data, the less it can wear-level

Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD

2018-08-17 Thread Martin Steigerwald
Hi!

This happened about two weeks ago. I already dealt with it and all is 
well.

Linux hung on suspend so I switched off this ThinkPad T520 forcefully. 
After that it did not boot the operating system anymore. Intel SSD 320, 
latest firmware, which should patch this bug, but apparently does not, 
is only 8 MiB big. Those 8 MiB just contain zeros.

Access via GRML and "mount -fo degraded" worked. I initially was even 
able to write onto this degraded filesystem. First I copied all data to 
a backup drive.

I even started a balance to "single" so that it would work with one SSD.

But later I learned that secure erase may recover the Intel SSD 320 and 
since I had no other SSD at hand, did that. And yes, it did. So I 
canceled the balance.

I partitioned the Intel SSD 320 and put LVM on it, just as I had it. But 
at that time I was not able to mount the degraded BTRFS on the other SSD 
as writable anymore, not even with "-f" "I know what I am doing". Thus I 
was not able to add a device to it and btrfs balance it to RAID 1. Even 
"btrfs replace" was not working.

I thus formatted a new BTRFS RAID 1 and restored.

A week later I migrated the Intel SSD 320 to a Samsung 860 Pro. Again 
via one full backup and restore cycle. However, this time I was able to 
copy most of the data of the Intel SSD 320 with "mount -fo degraded" via 
eSATA and thus the copy operation was way faster.

So conclusion:

1. Pro: BTRFS RAID 1 really protected my data against a complete SSD 
outage.

2. Con:  It does not allow me to add a device and balance to RAID 1 or 
replace one device that is already missing at this time.

3. I keep using BTRFS RAID 1 on two SSDs for often changed, critical 
data.

4. And yes, I know it does not replace a backup. As it was holidays and 
I was lazy backup was two weeks old already, so I was happy to have all 
my data still on the other SSD.

5. The error messages in kernel when mounting without "-o degraded" are 
less than helpful. They indicate a corrupted filesystem instead of just 
telling that one device is missing and "-o degraded" would help here.


I have seen a discussion about the limitation in point 2. That allowing 
to add a device and make it into RAID 1 again might be dangerous, cause 
of system chunk and probably other reasons. I did not completely read 
and understand it tough.

So I still don´t get it, cause:

Either it is a RAID 1, then, one disk may fail and I still have *all* 
data. Also for the system chunk, which according to btrfs fi df / btrfs 
fi sh was indeed RAID 1. If so, then period. Then I don´t see why it 
would need to disallow me to make it into an RAID 1 again after one 
device has been lost.

Or it is no RAID 1 and then what is the point to begin with? As I was 
able to copy of all date of the degraded mount, I´d say it was a RAID 1.

(I know that BTRFS RAID 1 is not a regular RAID 1 anyway, but just does 
two copies regardless of how many drives you use.)


For this laptop it was not all that important but I wonder about BTRFS 
RAID 1 in enterprise environment, cause restoring from backup adds a 
significantly higher downtime.

Anyway, creating a new filesystem may have been better here anyway, 
cause it replaced an BTRFS that aged over several years with a new one. 
Due to the increased capacity and due to me thinking that Samsung 860 
Pro compresses itself, I removed LZO compression. This would also give 
larger extents on files that are not fragmented or only slightly 
fragmented. I think that Intel SSD 320 did not compress, but Crucial 
m500 mSATA SSD does. That has been the secondary SSD that still had all 
the data after the outage of the Intel SSD 320.


Overall I am happy, cause BTRFS RAID 1 gave me access to the data after 
the SSD outage. That is the most important thing about it for me.

Thanks,
-- 
Martin




Re: BTRFS and databases

2018-08-02 Thread Martin Raiber
On 02.08.2018 14:27 Austin S. Hemmelgarn wrote:
> On 2018-08-02 06:56, Qu Wenruo wrote:
>>
>> On 2018年08月02日 18:45, Andrei Borzenkov wrote:
>>>
>>> Отправлено с iPhone
>>>
 2 авг. 2018 г., в 10:02, Qu Wenruo 
 написал(а):

> On 2018年08月01日 11:45, MegaBrutal wrote:
> Hi all,
>
> I know it's a decade-old question, but I'd like to hear your thoughts
> of today. By now, I became a heavy BTRFS user. Almost everywhere I
> use
> BTRFS, except in situations when it is obvious there is no benefit
> (e.g. /var/log, /boot). At home, all my desktop, laptop and server
> computers are mainly running on BTRFS with only a few file systems on
> ext4. I even installed BTRFS in corporate productive systems (in
> those
> cases, the systems were mainly on ext4; but there were some specific
> file systems those exploited BTRFS features).
>
> But there is still one question that I can't get over: if you store a
> database (e.g. MySQL), would you prefer having a BTRFS volume mounted
> with nodatacow, or would you just simply use ext4?
>
> I know that with nodatacow, I take away most of the benefits of BTRFS
> (those are actually hurting database performance – the exact CoW
> nature that is elsewhere a blessing, with databases it's a drawback).
> But are there any advantages of still sticking to BTRFS for a
> database
> albeit CoW is disabled, or should I just return to the old and
> reliable ext4 for those applications?

 Since I'm not a expert in database, so I can totally be wrong, but
 what
 about completely disabling database write-ahead-log (WAL), and let
 btrfs' data CoW to handle data consistency completely?

>>>
>>> This would make content of database after crash completely
>>> unpredictable, thus making it impossible to reliably roll back
>>> transaction.
>>
>> Btrfs itself (with datacow) can ensure the fs is updated completely.
>>
>> That's to say, even a crash happens, the content of the fs will be the
>> same state as previous btrfs transaction (btrfs sync).
>>
>> Thus there is no need to rollback database transaction though.
>> (Unless database transaction is not sync to btrfs transaction)
>>
> Two issues with this statement:
>
> 1. Not all database software properly groups logically related
> operations that need to be atomic as a unit into transactions.
> 2. Even aside from point 1 and the possibility of database corruption,
> there are other legitimate reasons that you might need to roll-back a
> transaction (for example, the rather obvious case of a transaction
> that should not have happened in the first place).

I thought of a database transaction scheme that is based on btrfs
features before. It has practical issues, though.
One would put a b-tree database file into a subvolume (e.g. trans_0).
When changing the b-tree database one would create a snapshot (trans_1),
then change the file in the snapshot. On commit sync trans_1, then
delete trans_0. On rollback, delete trans_1.

Problems:
* Large overhead for small transactions (OLTP) -- problem in general for
copy-on-write b-tree databases
* Only root can create or destroy snapshots
* Per default the Linux memory system starts write-back pretty much
immediately, so pages that get overwritten more than once in a
transaction (and not kept in RAM) unless Linux is tuned to not do this.

I have used this method, albeit by reflinking the database, then
modifying the reflink, but I think reflinking it slower than creating a
snapshot?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS and databases

2018-08-02 Thread Martin Steigerwald
Andrei Borzenkov - 02.08.18, 12:35:
> Отправлено с iPhone
> 
> > 2 авг. 2018 г., в 12:16, Martin Steigerwald 
> > написал(а):> 
> > Hugo Mills - 01.08.18, 10:56:
> >>> On Wed, Aug 01, 2018 at 05:45:15AM +0200, MegaBrutal wrote:
> >>> I know it's a decade-old question, but I'd like to hear your
> >>> thoughts
> >>> of today. By now, I became a heavy BTRFS user. Almost everywhere I
> >>> use BTRFS, except in situations when it is obvious there is no
> >>> benefit (e.g. /var/log, /boot). At home, all my desktop, laptop
> >>> and
> >>> server computers are mainly running on BTRFS with only a few file
> >>> systems on ext4. I even installed BTRFS in corporate productive
> >>> systems (in those cases, the systems were mainly on ext4; but
> >>> there
> >>> were some specific file systems those exploited BTRFS features).
> >>> 
> >>> But there is still one question that I can't get over: if you
> >>> store
> >>> a
> >>> database (e.g. MySQL), would you prefer having a BTRFS volume
> >>> mounted
> >>> with nodatacow, or would you just simply use ext4?
> >>> 
> >>   Personally, I'd start with btrfs with autodefrag. It has some
> >> 
> >> degree of I/O overhead, but if the database isn't
> >> performance-critical and already near the limits of the hardware,
> >> it's unlikely to make much difference. Autodefrag should keep the
> >> fragmentation down to a minimum.
> > 
> > I read that autodefrag would only help with small databases.
> 
> I wonder if anyone actually
> 
> a) quantified performance impact
> b) analyzed the cause
> 
> I work with NetApp for a long time and I can say from first hand
> experience that fragmentation had zero impact on OLTP workload. It
> did affect backup performance as was expected, but this could be
> fixed by periodic reallocation (defragmentation).
> 
> And even that needed quite some time to observe (years) on pretty high
>  load database with regular backup and replication snapshots.
> 
> If btrfs is so susceptible to fragmentation, what is the reason for
> it?

In the end of my original mail I mentioned a blog article that also had 
some performance graphs. Did you actually read it?

Thanks,
-- 
Martin


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS and databases

2018-08-02 Thread Martin Steigerwald
Hugo Mills - 01.08.18, 10:56:
> On Wed, Aug 01, 2018 at 05:45:15AM +0200, MegaBrutal wrote:
> > I know it's a decade-old question, but I'd like to hear your
> > thoughts
> > of today. By now, I became a heavy BTRFS user. Almost everywhere I
> > use BTRFS, except in situations when it is obvious there is no
> > benefit (e.g. /var/log, /boot). At home, all my desktop, laptop and
> > server computers are mainly running on BTRFS with only a few file
> > systems on ext4. I even installed BTRFS in corporate productive
> > systems (in those cases, the systems were mainly on ext4; but there
> > were some specific file systems those exploited BTRFS features).
> > 
> > But there is still one question that I can't get over: if you store
> > a
> > database (e.g. MySQL), would you prefer having a BTRFS volume
> > mounted
> > with nodatacow, or would you just simply use ext4?
> 
>Personally, I'd start with btrfs with autodefrag. It has some
> degree of I/O overhead, but if the database isn't performance-critical
> and already near the limits of the hardware, it's unlikely to make
> much difference. Autodefrag should keep the fragmentation down to a
> minimum.

I read that autodefrag would only help with small databases.

I also read that even on SSDs there is a notable performance penalty. 
4.2 GiB akonadi database  for tons of mails appears to work okayish on 
dual SSD BTRFS RAID 1 here with LZO compression here. However I have no 
comparison, for example how it would run on XFS. And its fragmented 
quite a bit, example for the largest file of 3 GiB – I know this in part 
is also due to LZO compression.

[…].local/share/akonadi/db_data/akonadi> time /usr/sbin/filefrag 
parttable.ibd
parttable.ibd: 45380 extents found
/usr/sbin/filefrag parttable.ibd  0,00s user 0,86s system 41% cpu 2,054 
total

However it digs out those extents quite fast.

I would not feel comfortable with setting this file to nodatacow.


However I wonder: Is this it? Is there nothing that can be improved in 
BTRFS to handle database and VM files in a better way, without altering 
any default settings?

Is it also an issue on ZFS? ZFS does also copy on write. How does ZFS 
handle this? Can anything be learned from it? I never head people 
complain about poor database performance on ZFS, but… I don´t use it and 
I am not subscribed to any ZFS mailing lists, so they may have similar 
issues and I just do not know it.

Well there seems to be a performance penalty at least when compared to 
XFS:

About ZFS Performance
Yves Trudeau, May 15, 2018

https://www.percona.com/blog/2018/05/15/about-zfs-performance/

The article described how you can use NVMe devices as cache to mitigate 
the performance impact. That would hint that BTRFS with VFS Hot Data 
Tracking and relocating data to SSD or NVMe devices could be a way to 
set this up.


But as said I read about bad database performance even on SSDs with 
BTRFS. I do not find the original reference at the moment, but I got 
this for example, however it is from 2015 (on kernel 4.0 which is a bit 
old):

Friends don't let friends use BTRFS for OLTP
2015/09/16 by Tomas Vondra

https://blog.pgaddict.com/posts/friends-dont-let-friends-use-btrfs-for-oltp

Interestingly it also compares with ZFS which is doing much better. So 
maybe there is really something to be learned from ZFS.

I did not get clearly whether the benchmark was on an SSD, as Tomas 
notes the "ssd" mount option, it might have been.

Thanks,
-- 
Martin


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Healthy amount of free space?

2018-07-17 Thread Martin Steigerwald
Nikolay Borisov - 17.07.18, 10:16:
> On 17.07.2018 11:02, Martin Steigerwald wrote:
> > Nikolay Borisov - 17.07.18, 09:20:
> >> On 16.07.2018 23:58, Wolf wrote:
> >>> Greetings,
> >>> I would like to ask what what is healthy amount of free space to
> >>> keep on each device for btrfs to be happy?
> >>> 
> >>> This is how my disk array currently looks like
> >>> 
> >>> [root@dennas ~]# btrfs fi usage /raid
> >>> 
> >>> Overall:
> >>> Device size:  29.11TiB
> >>> Device allocated: 21.26TiB
> >>> Device unallocated:7.85TiB
> >>> Device missing:  0.00B
> >>> Used: 21.18TiB
> >>> Free (estimated):  3.96TiB  (min: 3.96TiB)
> >>> Data ratio:   2.00
> >>> Metadata ratio:   2.00
> >>> Global reserve:  512.00MiB  (used: 0.00B)
> > 
> > […]
> > 
> >>> Btrfs does quite good job of evenly using space on all devices.
> >>> No,
> >>> how low can I let that go? In other words, with how much space
> >>> free/unallocated remaining space should I consider adding new
> >>> disk?
> >> 
> >> Btrfs will start running into problems when you run out of
> >> unallocated space. So the best advice will be monitor your device
> >> unallocated, once it gets really low - like 2-3 gb I will suggest
> >> you run balance which will try to free up unallocated space by
> >> rewriting data more compactly into sparsely populated block
> >> groups. If after running balance you haven't really freed any
> >> space then you should consider adding a new drive and running
> >> balance to even out the spread of data/metadata.
> > 
> > What are these issues exactly?
> 
> For example if you have plenty of data space but your metadata is full
> then you will be getting ENOSPC.

Of that one I am aware.

This just did not happen so far.

I did not yet add it explicitly to the training slides, but I just make 
myself a note to do that.

Anything else?

> > I have
> > 
> > % btrfs fi us -T /home
> > 
> > Overall:
> > Device size: 340.00GiB
> > Device allocated:340.00GiB
> > Device unallocated:2.00MiB
> > Device missing:  0.00B
> > Used:308.37GiB
> > Free (estimated): 14.65GiB  (min: 14.65GiB)
> > Data ratio:   2.00
> > Metadata ratio:   2.00
> > Global reserve:  512.00MiB  (used: 0.00B)
> > 
> >   Data  Metadata System
> > 
> > Id Path   RAID1 RAID1RAID1Unallocated
> > -- -- -   ---
> > 
> >  1 /dev/mapper/msata-home 165.89GiB  4.08GiB 32.00MiB 1.00MiB
> >  2 /dev/mapper/sata-home  165.89GiB  4.08GiB 32.00MiB 1.00MiB
> > 
> > -- -- -   ---
> > 
> >Total  165.89GiB  4.08GiB 32.00MiB 2.00MiB
> >Used   151.24GiB  2.95GiB 48.00KiB
>
> You already have only 33% of your metadata full so if your workload
> turned out to actually be making more metadata-heavy changed i.e
> snapshots you could exhaust this and get ENOSPC, despite having around
> 14gb of free data space. Furthermore this data space is spread around
> multiple data chunks, depending on how populated they are a balance
> could be able to free up unallocated space which later could be
> re-purposed for metadata (again, depending on what you are doing).

The filesystem above IMO is not fit for snapshots. It would fill up 
rather quickly, I think even when I balance metadata. Actually I tried 
this and as I remember it took at most a day until it was full.

If I read above figures currently at maximum I could gain one additional 
GiB by balancing metadata. That would not make a huge difference.

I bet I am already running this filesystem beyond recommendation, as I 
bet many would argue it is to full already for regular usage… I do not 
see the benefit of squeezing the last free space out of it just to fit 
in another GiB.

So I still do not get the point why it would make sense to balance it at 
this point in time. Especially as this 1 GiB I could regain is not even 
needed. And I do not see th

Re: Healthy amount of free space?

2018-07-17 Thread Martin Steigerwald
Hi Nikolay.

Nikolay Borisov - 17.07.18, 09:20:
> On 16.07.2018 23:58, Wolf wrote:
> > Greetings,
> > I would like to ask what what is healthy amount of free space to
> > keep on each device for btrfs to be happy?
> > 
> > This is how my disk array currently looks like
> > 
> > [root@dennas ~]# btrfs fi usage /raid
> > 
> > Overall:
> > Device size:  29.11TiB
> > Device allocated: 21.26TiB
> > Device unallocated:7.85TiB
> > Device missing:  0.00B
> > Used: 21.18TiB
> > Free (estimated):  3.96TiB  (min: 3.96TiB)
> > Data ratio:   2.00
> > Metadata ratio:   2.00
> > Global reserve:  512.00MiB  (used: 0.00B)
[…]
> > Btrfs does quite good job of evenly using space on all devices. No,
> > how low can I let that go? In other words, with how much space
> > free/unallocated remaining space should I consider adding new disk?
> 
> Btrfs will start running into problems when you run out of unallocated
> space. So the best advice will be monitor your device unallocated,
> once it gets really low - like 2-3 gb I will suggest you run balance
> which will try to free up unallocated space by rewriting data more
> compactly into sparsely populated block groups. If after running
> balance you haven't really freed any space then you should consider
> adding a new drive and running balance to even out the spread of
> data/metadata.

What are these issues exactly?

I have

% btrfs fi us -T /home
Overall:
Device size: 340.00GiB
Device allocated:340.00GiB
Device unallocated:2.00MiB
Device missing:  0.00B
Used:308.37GiB
Free (estimated): 14.65GiB  (min: 14.65GiB)
Data ratio:   2.00
Metadata ratio:   2.00
Global reserve:  512.00MiB  (used: 0.00B)

  Data  Metadata System  
Id Path   RAID1 RAID1RAID1Unallocated
-- -- -   ---
 1 /dev/mapper/msata-home 165.89GiB  4.08GiB 32.00MiB 1.00MiB
 2 /dev/mapper/sata-home  165.89GiB  4.08GiB 32.00MiB 1.00MiB
-- -- -   ---
   Total  165.89GiB  4.08GiB 32.00MiB 2.00MiB
   Used   151.24GiB  2.95GiB 48.00KiB

on a RAID-1 filesystem one, part of the time two Plasma desktops + 
KDEPIM and Akonadi + Baloo desktop search + you name it write to like 
mad.

Since kernel 4.5 or 4.6 this simply works. Before that sometimes BTRFS 
crawled to an halt on searching for free blocks, and I had to switch off 
the laptop uncleanly. If that happened, a balance helped for a while. 
But since 4.5 or 4.6 this did not happen anymore.

I found with SLES 12 SP 3 or so there is btrfsmaintenance running a 
balance weekly. Which created an issue on our Proxmox + Ceph on Intel 
NUC based opensource demo lab. This is for sure no recommended 
configuration for Ceph and Ceph is quite slow on these 2,5 inch 
harddisks and 1 GBit network link, despite albeit somewhat minimal, 
limited to 5 GiB m.2 SSD caching. What happened it that the VM crawled 
to a halt and the kernel gave task hung for more than 120 seconds 
messages. The VM was basically unusable during the balance. Sure that 
should not happen with a "proper" setup, also it also did not happen 
without the automatic balance.

Also what would happen on a hypervisor setup with several thousands of 
VMs with BTRFS, when several 100 of them decide to start the balance at 
a similar time? It could probably bring the I/O system below to an halt, 
as many enterprise storage systems are designed to sustain burst I/O 
loads, but not maximum utilization during an extended period of time.

I am really wondering what to recommend in my Linux performance tuning 
and analysis courses. On my own laptop I do not do regular balances so 
far. Due to my thinking: If it is not broken, do not fix it.

My personal opinion here also is: If the filesystem degrades that much 
that it becomes unusable without regular maintenance from user space, 
the filesystem needs to be fixed. Ideally I would not have to worry on 
whether to regularly balance an BTRFS or not. In other words: I should 
not have to visit a performance analysis and tuning course in order to 
use a computer with BTRFS filesystem.

Thanks,
-- 
Martin


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Transaction aborted (error -28) btrfs_run_delayed_refs*0x163/0x190

2018-07-10 Thread Martin Raiber
On 10.07.2018 09:04 Pete wrote:
> I've just had the error in the subject which caused the file system to
> go read-only.
>
> Further part of error message:
> WARNING: CPU: 14 PID: 1351 at fs/btrfs/extent-tree.c:3076
> btrfs_run_delayed_refs*0x163/0x190
>
> 'Screenshot' here:
> https://drive.google.com/file/d/1qw7TE1bec8BKcmffrOmg2LS15IOq8Jwc/view?usp=sharing
>
> The kernel is 4.17.4.  There are three hard drives in the file system.
> dmcrypt (luks) is used between btrfs and the disks.
This is probably a known issue. See
https://www.spinics.net/lists/linux-btrfs/msg75647.html
You could apply the patch in this thread and mount with enospc_debug to
confirm it is the same issue.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/2] btrfs: Check each block group has corresponding chunk at mount time

2018-07-03 Thread Martin Steigerwald
Nikolay Borisov - 03.07.18, 11:08:
> On  3.07.2018 11:47, Qu Wenruo wrote:
> > On 2018年07月03日 16:33, Nikolay Borisov wrote:
> >> On  3.07.2018 11:08, Qu Wenruo wrote:
> >>> Reported in https://bugzilla.kernel.org/show_bug.cgi?id=199837, if
> >>> a
> >>> crafted btrfs with incorrect chunk<->block group mapping, it could
> >>> leads to a lot of unexpected behavior.
> >>> 
> >>> Although the crafted image can be catched by block group item
> >>> checker
> >>> added in "[PATCH] btrfs: tree-checker: Verify block_group_item",
> >>> if one crafted a valid enough block group item which can pass
> >>> above check but still mismatch with existing chunk, it could
> >>> cause a lot of undefined behavior.
> >>> 
> >>> This patch will add extra block group -> chunk mapping check, to
> >>> ensure we have a completely matching (start, len, flags) chunk
> >>> for each block group at mount time.
> >>> 
> >>> Reported-by: Xu Wen 
> >>> Signed-off-by: Qu Wenruo 
> >>> ---
> >>> changelog:
> >>> 
> >>> v2:
> >>>   Add better error message for each mismatch case.
> >>>   Rename function name, to co-operate with later patch.
> >>>   Add flags mismatch check.
> >>> 
> >>> ---
> >> 
> >> It's getting really hard to keep track of the various validation
> >> patches you sent with multiple versions + new checks. Please batch
> >> everything in a topic series i.e "Making checks stricter" or some
> >> such and send everything again nicely packed, otherwise the risk
> >> of mis-merging is increased.
> > 
> > Indeed, I'll send the branch and push it to github.
> > 
> >> I now see that Gu Jinxiang from fujitsu also started sending
> >> validation fixes.
> > 
> > No need to worry, that will be the only patch related to that thread
> > of bugzilla from Fujitsu.
> > As all the other cases can be addressed by my patches, sorry Fujitsu
> > guys :)> 
> >> Also for evry patch which fixes a specific issue from one of the
> >> reported on bugzilla.kernel.org just use the Link: tag to point to
> >> the original report on bugzilla that will make it easier to relate
> >> the fixes to the original report.
> > 
> > Never heard of "Link:" tag.
> > Maybe it's a good idea to added it to "submitting-patches.rst"?
> 
> I guess it's not officially documented but if you do git log --grep
> "Link:" you'd see quite a lot of patches actually have a Link pointing
> to the original thread if it has sparked some pertinent discussion.
> In this case those patches are a direct result of a bugzilla
> bugreport so having a Link: tag makes sense.

For Bugzilla reports I saw something like

Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=43511

in a patch I was Cc´d to.

Of course that does only apply if the patch in question fixes the 
reported bug.

> In the example of the qgroup patch I sent yesterday resulting from
> Misono's report there was also an involved discussion hence I added a
> link to the original thread.
[…]
-- 
Martin


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: "decompress failed" in 1-2 files always causes kernel oops, check/scrub pass

2018-05-12 Thread Martin Steigerwald
Hey James.

james harvey - 12.05.18, 07:08:
> 100% reproducible, booting from disk, or even Arch installation ISO.
> Kernel 4.16.7.  btrfs-progs v4.16.
> 
> Reading one of two journalctl files causes a kernel oops.  Initially
> ran into it from "journalctl --list-boots", but cat'ing the file does
> it too.  I believe this shows there's compressed data that is invalid,
> but its btrfs checksum is invalid.  I've cat'ed every file on the
> disk, and luckily have the problems narrowed down to only these 2
> files in /var/log/journal.
> 
> This volume has always been mounted with lzo compression.
> 
> scrub has never found anything, and have ran it since the oops.
> 
> Found a user a few years ago who also ran into this, without
> resolution, at:
> https://www.spinics.net/lists/linux-btrfs/msg52218.html
> 
> 1. Cat'ing a (non-essential) file shouldn't be able to bring down the
> system.
> 
> 2. If this is infact invalid compressed data, there should be a way to
> check for that.  Btrfs check and scrub pass.

I think systemd-journald sets those files to nocow on BTRFS in order to 
reduce fragmentation: That means no checksums, no snapshots, no nothing. 
I just removed /var/log/journal and thus disabled journalling to disk. 
Its sufficient for me to have the recent state in /run/journal.

Can you confirm nocow being set via lsattr on those files?

Still they should be decompressible just fine.

> Hardware is fine.  Passes memtest86+ in SMP mode.  Works fine on all
> other files.
> 
> 
> 
> [  381.869940] BUG: unable to handle kernel paging request at
> 00390e50 [  381.870881] BTRFS: decompress failed
[…]
-- 
Martin


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs remounted read-only due to ENOSPC in btrfs_run_delayed_refs cont. [Was: Re: metadata_ratio mount option?]

2018-05-08 Thread Martin Svec
Hello Chris,

Dne 7.5.2018 v 18:37 Chris Mason napsal(a):
>
>
> On 7 May 2018, at 12:16, Martin Svec wrote:
>
>> Hello Chris,
>>
>> Dne 7.5.2018 v 16:49 Chris Mason napsal(a):
>>> On 7 May 2018, at 7:40, Martin Svec wrote:
>>>
>>>> Hi,
>>>>
>>>> According to man btrfs [1], I assume that metadata_ratio=1 mount option 
>>>> should
>>>> force allocation of one metadata chunk after every allocated data chunk. 
>>>> However,
>>>> when I set this option and start filling btrfs with "dd if=/dev/zero 
>>>> of=dummyfile.dat",
>>>> only data chunks are allocated but no metadata ones. So, how does the 
>>>> metadata_ratio
>>>> option really work?
>>>>
>>>> Note that I'm trying to use this option as a workaround of the bug 
>>>> reported here:
>>>>
>>>
>>> [ urls that FB email server eats, sorry ]
>>
>> It's link to "Btrfs remounted read-only due to ENOSPC in 
>> btrfs_run_delayed_refs" thread :)
>
> Oh yeah, the link worked fine, it just goes through this url defense monster 
> that munges it in
> replies.
>
>>
>>>
>>>>
>>>> i.e. I want to manually preallocate metadata chunks to avoid nightly 
>>>> ENOSPC errors.
>>>
>>>
>>> metadata_ratio is almost but not quite what you want.  It sets a flag on 
>>> the space_info to force a
>>> chunk allocation the next time we decide to call should_alloc_chunk().  
>>> Thanks to the overcommit
>>> code, we usually don't call that until the metadata we think we're going to 
>>> need is bigger than
>>> the metadata space available.  In other words, by the time we're into the 
>>> code that honors the
>>> force flag, reservations are already high enough to make us allocate the 
>>> chunk anyway.
>>
>> Yeah, that's how I understood the code. So I think metadata_ratio man 
>> section is quite confusing
>> because it implies that btrfs guarantees given metadata to data chunk space 
>> ratio, which isn't true.
>>
>>>
>>> I tried to use metadata_ratio to experiment with forcing more metadata slop 
>>> space, but really I
>>> have to tweak the overcommit code first.
>>> Omar beat me to a better solution, tracking down our transient ENOSPC 
>>> problems here at FB to
>>> reservations done for orphans.  Do you have a lot of deleted files still 
>>> being held open?  lsof
>>> /mntpoint | grep deleted will list them.
>>
>> I'll take a look during backup window. The initial bug report describes our 
>> rsync workload in
>> detail, for your reference. 

No, there're no trailing deleted files during backup. However, I noticed 
something interesting in
strace output: rsync does ftruncate() of every transferred file before closing 
it. In 99.9% cases
the file is truncated to its own size, so it should be a no-op. But these 
ftruncates are by far the
slowest syscalls according to strace timing and btrfs_truncate() comments 
itself as "indeed ugly".
Could it be the root cause of global reservations pressure?

I've found this patch from Filipe (Cc'd): 
https://patchwork.kernel.org/patch/10205013/. Should I
apply it to our 4.14.y kernel and try the impact on intensive rsync workloads?

Thank you
Martin


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: metadata_ratio mount option?

2018-05-07 Thread Martin Svec
Hello Chris,

Dne 7.5.2018 v 16:49 Chris Mason napsal(a):
> On 7 May 2018, at 7:40, Martin Svec wrote:
>
>> Hi,
>>
>> According to man btrfs [1], I assume that metadata_ratio=1 mount option 
>> should
>> force allocation of one metadata chunk after every allocated data chunk. 
>> However,
>> when I set this option and start filling btrfs with "dd if=/dev/zero 
>> of=dummyfile.dat",
>> only data chunks are allocated but no metadata ones. So, how does the 
>> metadata_ratio
>> option really work?
>>
>> Note that I'm trying to use this option as a workaround of the bug reported 
>> here:
>>
>
> [ urls that FB email server eats, sorry ]

It's link to "Btrfs remounted read-only due to ENOSPC in 
btrfs_run_delayed_refs" thread :)

>
>>
>> i.e. I want to manually preallocate metadata chunks to avoid nightly ENOSPC 
>> errors.
>
>
> metadata_ratio is almost but not quite what you want.  It sets a flag on the 
> space_info to force a
> chunk allocation the next time we decide to call should_alloc_chunk().  
> Thanks to the overcommit
> code, we usually don't call that until the metadata we think we're going to 
> need is bigger than
> the metadata space available.  In other words, by the time we're into the 
> code that honors the
> force flag, reservations are already high enough to make us allocate the 
> chunk anyway.

Yeah, that's how I understood the code. So I think metadata_ratio man section 
is quite confusing
because it implies that btrfs guarantees given metadata to data chunk space 
ratio, which isn't true.

>
> I tried to use metadata_ratio to experiment with forcing more metadata slop 
> space, but really I
> have to tweak the overcommit code first.
> Omar beat me to a better solution, tracking down our transient ENOSPC 
> problems here at FB to
> reservations done for orphans.  Do you have a lot of deleted files still 
> being held open?  lsof
> /mntpoint | grep deleted will list them.

I'll take a look during backup window. The initial bug report describes our 
rsync workload in
detail, for your reference.

>
> We're working through a patch for the orphans here.  You've got a ton of 
> bytes pinned, which isn't
> a great match for the symptoms we see:
>
> [285169.096630] BTRFS info (device sdb): space_info 4 has 
> 18446744072120172544 free, is not full
> [285169.096633] BTRFS info (device sdb): space_info total=273804165120, 
> used=269218267136,
> pinned=3459629056, reserved=52396032, may_use=2663120896, readonly=131072
>
> But, your may_use count is high enough that you might be hitting this 
> problem.  Otherwise I'll
> work out a patch to make some more metadata chunks while Josef is perfecting 
> his great delayed ref
> update.

As mentioned in the bug report, we have a custom patch that dedicates SSDs for 
metadata chunks and
HDDs for data chunks. So, all we need is to preallocate metadata chunks to 
occupy all of the SSD
space and our issues will be gone.
Note that btrfs with SSD-backed metadata works absolutely great for rsync 
backups, even if there're
billions of files and thousands of snapshots. The global reservation ENOSPC is 
the last issue we're
struggling with.

Thank you

Martin


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


metadata_ratio mount option?

2018-05-07 Thread Martin Svec
Hi,

According to man btrfs [1], I assume that metadata_ratio=1 mount option should
force allocation of one metadata chunk after every allocated data chunk. 
However,
when I set this option and start filling btrfs with "dd if=/dev/zero 
of=dummyfile.dat",
only data chunks are allocated but no metadata ones. So, how does the 
metadata_ratio
option really work?

Note that I'm trying to use this option as a workaround of the bug reported 
here: 

https://www.spinics.net/lists/linux-btrfs/msg75104.html

i.e. I want to manually preallocate metadata chunks to avoid nightly ENOSPC 
errors.

Best regards.

Martin


[1] https://btrfs.wiki.kernel.org/index.php/Manpage/btrfs(5)#MOUNT_OPTIONS



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: extent-tree.c no space left (4.9.77 + 4.16.2)

2018-04-21 Thread Martin Svec
Hi David,

this looks like the bug that I already reported two times:

https://www.spinics.net/lists/linux-btrfs/msg54394.html
https://www.spinics.net/lists/linux-btrfs/msg75104.html

The second thread contains Nikolay's debug patch that can confirm if you run 
out of global metadata
reservations too.

Martin

Dne 21.4.2018 v 9:38 David Goodwin napsal(a):
> Hi,
>
> I'm running a 3TiB EBS based (2+1TiB devices) volume in EC2 which contains 
> about 500 read-only
> snapshots.
>
> btrfs-progs v4.7.3
>
> There are two dmesg trace things below. The first one from a 4.9.77 kernel -
>
> [ cut here ]
> BTRFS: error (device xvdg) in btrfs_run_delayed_refs:2967: errno=-28 No space 
> left
> BTRFS info (device xvdg): forced readonlyApr 19 11:44:40 gateway1 kernel: 
> [7648104.300115]
> WARNING: CPU: 2 PID: 963 at fs/btrfs/extent-tree.c:2967 
> btrfs_run_delayed_refs+0x27e/0x2b0
> [btrfs]Apr 19 11:44:40 gateway1 kernel: [7648104.313268] BTRFS: Transaction 
> aborted (error -28)
> Modules linked in: dm_mod nfsv3 ipt_REJECT nf_reject_ipv4 ipt_MASQUERADE 
> nf_nat_masquerade_ipv4
> iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat_ftp 
> nf_conntrack_ftp nf_nat
> nf_conntrack xt_mu
> nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc evdev 
> intel_rapl
> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_pcsp snd_pcm 
> aesni_intel aes_x86_64 lrw
> gf128mul glue_helper snd_timer ablk_helper snd cryptd soundcore ext4 crc16 
> jbd2 mbcache btrfs xor
> raid6_pq xen_netfront xen_blkfront crc32c_intel
> CPU: 2 PID: 963 Comm: btrfs-transacti Not tainted 4.9.77-dg1 #1Apr 19 
> 11:44:40 gateway1 kernel:
> [7648104.408561]   812f17a4 c90043203d08 
> 
>  8107389e a0157d5a c90043203d58 8802ccfd7170
>  880394684800 880394684800 0007315c 8107390f
> Call Trace:
>  [] ? dump_stack+0x5c/0x78
>  [] ? __warn+0xbe/0xe0
>  [] ? warn_slowpath_fmt+0x4f/0x60
>  [] ? btrfs_run_delayed_refs+0x27e/0x2b0 [btrfs]
>  [] ? btrfs_release_path+0x13/0x80 [btrfs]
>  [] ? btrfs_start_dirty_block_groups+0x2c2/0x450 [btrfs]
>  [] ? btrfs_commit_transaction+0x14c/0xa30 [btrfs]
>  [] ? start_transaction+0x96/0x480 [btrfs]
>  [] ? transaction_kthread+0x1dc/0x200 [btrfs]
>  [] ? btrfs_cleanup_transaction+0x550/0x550 [btrfs]
>  [] ? kthread+0xc7/0xe0
>  [] ? kthread_park+0x60/0x60
>  [] ? ret_from_fork+0x54/0x60
> ---[ end trace 69ca1332d91b4310 ]---
> BTRFS: error (device xvdg) in btrfs_run_delayed_refs:2967: errno=-28 No space 
> left
> BTRFS error (device xvdg): parent transid verify failed on 5400398217216 
> wanted 1893543 found 1893366
>
>
>
> On checking btrfs fi us there was plenty of unallocated space left.
>
> % btrfs fi us /broken/
>
> Overall:
>     Device size:   3.06TiB
>     Device allocated:   2.43TiB
>     Device unallocated: 643.09GiB
>     Device missing: 0.00B
>     Used:   2.43TiB
>     Free (estimated): 646.41GiB    (min: 646.41GiB)
>     Data ratio:  1.00
>     Metadata ratio:  1.00
>     Global reserve: 512.00MiB    (used: 0.00B)
>
> 
>
> The VM was then rebooted with a 4.16.2 kernel, which encountered what I 
> assume is the same problem:
>
>
> [ cut here ]
> BTRFS: Transaction aborted (error -28)
> WARNING: CPU: 2 PID: 981 at fs/btrfs/extent-tree.c:6990 
> __btrfs_free_extent.isra.63+0x3d2/0xd20
> [btrfs]
> Modules linked in: nfsv3 ipt_REJECT nf_reject_ipv4 ipt_MASQUERADE 
> nf_nat_masquerade_ipv4
> iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat_ftp 
> nf_conntrack_ftp nf_nat
> nf_conntrack libcrc32c crc32c_generic xt_multiport iptable_filter ip_tables 
> x_tables autofs4 nfsd
> auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc intel_rapl 
> crct10dif_pclmul crc32_pclmul
> ghash_clmulni_intel evdev pcbc snd_pcsp aesni_intel snd_pcm aes_x86_64 
> snd_timer crypto_simd
> glue_helper snd cryptd soundcore ext4 crc16 mbcache jbd2 btrfs xor 
> zstd_decompress zstd_compress
> xxhash raid6_pq xen_netfront xen_blkfront crc32c_intel
> CPU: 2 PID: 981 Comm: btrfs-transacti Not tainted 4.16.2-dg1 #1
> RIP: e030:__btrfs_free_extent.isra.63+0x3d2/0xd20 [btrfs]
> RSP: e02b:c900428d7c68 EFLAGS: 00010292
> RAX: 0026 RBX: 01fb8031c000 RCX: 0006
> RDX: 0007 RSI: 0001 RDI: 88039a916650
> RBP: ffe4 R08: 0001 R09: 010a
> R10: 0001 R11: 010a R12: 8803957e6000
> R13: 88036f5a9e70 R14:  R15: 0

Re: Btrfs remounted read-only due to ENOSPC in btrfs_run_delayed_refs

2018-03-13 Thread Martin Svec
Dne 10.3.2018 v 15:51 Martin Svec napsal(a):
> Dne 10.3.2018 v 13:13 Nikolay Borisov napsal(a):
>> 
>>
>>>>> And then report back on the output of the extra debug 
>>>>> statements. 
>>>>>
>>>>> Your global rsv is essentially unused, this means 
>>>>> in the worst case the code should fallback to using the global rsv
>>>>> for satisfying the memory allocation for delayed refs. So we should
>>>>> figure out why this isn't' happening. 
>>>> Patch applied. Thank you very much, Nikolay. I'll let you know as soon as 
>>>> we hit ENOSPC again.
>>> There is the output:
>>>
>>> [24672.573075] BTRFS info (device sdb): space_info 4 has 
>>> 18446744072971649024 free, is not full
>>> [24672.573077] BTRFS info (device sdb): space_info total=308163903488, 
>>> used=304593289216, pinned=2321940480, reserved=174800896, 
>>> may_use=1811644416, readonly=131072
>>> [24672.573079] use_block_rsv: Not using global blockrsv! Current 
>>> blockrsv->type = 1 blockrsv->space_info = 999a57db7000 
>>> global_rsv->space_info = 999a57db7000
>>> [24672.573083] BTRFS: Transaction aborted (error -28)
>> Bummer, so you are indeed running out of global space reservations in
>> context which can't really use any other reservation type, thus the
>> ENOSPC. Was the stacktrace again during processing of running delayed refs?
> Yes, the stacktrace is below.
>
> [24672.573132] WARNING: CPU: 3 PID: 808 at fs/btrfs/extent-tree.c:3089 
> btrfs_run_delayed_refs+0x259/0x270 [btrfs]
> [24672.573132] Modules linked in: binfmt_misc xt_comment xt_tcpudp 
> iptable_filter nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack iptable_raw 
> ip6table_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 
> nf_nat nf_conntrack ip6table_mangle ip6table_raw ip6_tables iptable_mangle 
> intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul crc32_pclmul 
> ghash_clmulni_intel pcbc aesni_intel snd_pcm aes_x86_64 snd_timer crypto_simd 
> glue_helper snd cryptd soundcore iTCO_wdt intel_cstate joydev 
> iTCO_vendor_support pcspkr dcdbas intel_uncore sg serio_raw evdev lpc_ich 
> mgag200 ttm drm_kms_helper drm i2c_algo_bit shpchp mfd_core i7core_edac 
> ipmi_si ipmi_devintf acpi_power_meter ipmi_msghandler button acpi_cpufreq 
> ip_tables x_tables autofs4 xfs libcrc32c crc32c_generic btrfs xor 
> zstd_decompress zstd_compress
> [24672.573161]  xxhash hid_generic usbhid hid raid6_pq sd_mod crc32c_intel 
> psmouse uhci_hcd ehci_pci ehci_hcd megaraid_sas usbcore scsi_mod bnx2
> [24672.573170] CPU: 3 PID: 808 Comm: btrfs-transacti Tainted: GW I
>  4.14.23-znr8+ #73
> [24672.573171] Hardware name: Dell Inc. PowerEdge R510/0DPRKF, BIOS 1.6.3 
> 02/01/2011
> [24672.573172] task: 999a23229140 task.stack: a85642094000
> [24672.573186] RIP: 0010:btrfs_run_delayed_refs+0x259/0x270 [btrfs]
> [24672.573187] RSP: 0018:a85642097de0 EFLAGS: 00010282
> [24672.573188] RAX: 0026 RBX: 99975c75c3c0 RCX: 
> 0006
> [24672.573189] RDX:  RSI: 0082 RDI: 
> 999a6fcd66f0
> [24672.573190] RBP: 95c24d68 R08: 0001 R09: 
> 0479
> [24672.573190] R10: 99974b1960e0 R11: 0479 R12: 
> 999a5a65
> [24672.573191] R13: 999a5a6511f0 R14:  R15: 
> 
> [24672.573192] FS:  () GS:999a6fcc() 
> knlGS:
> [24672.573193] CS:  0010 DS:  ES:  CR0: 80050033
> [24672.573194] CR2: 558bfd56dfd0 CR3: 00030a60a005 CR4: 
> 000206e0
> [24672.573195] Call Trace:
> [24672.573215]  btrfs_commit_transaction+0x3e1/0x950 [btrfs]
> [24672.573231]  ? start_transaction+0x89/0x410 [btrfs]
> [24672.573246]  transaction_kthread+0x195/0x1b0 [btrfs]
> [24672.573249]  kthread+0xfc/0x130
> [24672.573265]  ? btrfs_cleanup_transaction+0x580/0x580 [btrfs]
> [24672.573266]  ? kthread_create_on_node+0x70/0x70
> [24672.573269]  ret_from_fork+0x35/0x40
> [24672.573270] Code: c7 c6 20 e8 37 c0 48 89 df 44 89 04 24 e8 59 bc 09 00 44 
> 8b 04 24 eb 86 44 89 c6 48 c7 c7 30 58 38 c0 44 89 04 24 e8 82 30 3f cf <0f> 
> 0b 44 8b 04 24 eb c4 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00
> [24672.573292] ---[ end trace b17d927a946cb02e ]---
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Again, another ENOSPC due to lack of global rsv space in the context of delayed 
refs:

[1933

Re: Btrfs remounted read-only due to ENOSPC in btrfs_run_delayed_refs

2018-03-10 Thread Martin Svec
Dne 10.3.2018 v 13:13 Nikolay Borisov napsal(a):
>
> 
>
 And then report back on the output of the extra debug 
 statements. 

 Your global rsv is essentially unused, this means 
 in the worst case the code should fallback to using the global rsv
 for satisfying the memory allocation for delayed refs. So we should
 figure out why this isn't' happening. 
>>> Patch applied. Thank you very much, Nikolay. I'll let you know as soon as 
>>> we hit ENOSPC again.
>> There is the output:
>>
>> [24672.573075] BTRFS info (device sdb): space_info 4 has 
>> 18446744072971649024 free, is not full
>> [24672.573077] BTRFS info (device sdb): space_info total=308163903488, 
>> used=304593289216, pinned=2321940480, reserved=174800896, 
>> may_use=1811644416, readonly=131072
>> [24672.573079] use_block_rsv: Not using global blockrsv! Current 
>> blockrsv->type = 1 blockrsv->space_info = 999a57db7000 
>> global_rsv->space_info = 999a57db7000
>> [24672.573083] BTRFS: Transaction aborted (error -28)
> Bummer, so you are indeed running out of global space reservations in
> context which can't really use any other reservation type, thus the
> ENOSPC. Was the stacktrace again during processing of running delayed refs?

Yes, the stacktrace is below.

[24672.573132] WARNING: CPU: 3 PID: 808 at fs/btrfs/extent-tree.c:3089 
btrfs_run_delayed_refs+0x259/0x270 [btrfs]
[24672.573132] Modules linked in: binfmt_misc xt_comment xt_tcpudp 
iptable_filter nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack iptable_raw 
ip6table_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat 
nf_conntrack ip6table_mangle ip6table_raw ip6_tables iptable_mangle 
intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul crc32_pclmul 
ghash_clmulni_intel pcbc aesni_intel snd_pcm aes_x86_64 snd_timer crypto_simd 
glue_helper snd cryptd soundcore iTCO_wdt intel_cstate joydev 
iTCO_vendor_support pcspkr dcdbas intel_uncore sg serio_raw evdev lpc_ich 
mgag200 ttm drm_kms_helper drm i2c_algo_bit shpchp mfd_core i7core_edac ipmi_si 
ipmi_devintf acpi_power_meter ipmi_msghandler button acpi_cpufreq ip_tables 
x_tables autofs4 xfs libcrc32c crc32c_generic btrfs xor zstd_decompress 
zstd_compress
[24672.573161]  xxhash hid_generic usbhid hid raid6_pq sd_mod crc32c_intel 
psmouse uhci_hcd ehci_pci ehci_hcd megaraid_sas usbcore scsi_mod bnx2
[24672.573170] CPU: 3 PID: 808 Comm: btrfs-transacti Tainted: GW I 
4.14.23-znr8+ #73
[24672.573171] Hardware name: Dell Inc. PowerEdge R510/0DPRKF, BIOS 1.6.3 
02/01/2011
[24672.573172] task: 999a23229140 task.stack: a85642094000
[24672.573186] RIP: 0010:btrfs_run_delayed_refs+0x259/0x270 [btrfs]
[24672.573187] RSP: 0018:a85642097de0 EFLAGS: 00010282
[24672.573188] RAX: 0026 RBX: 99975c75c3c0 RCX: 0006
[24672.573189] RDX:  RSI: 0082 RDI: 999a6fcd66f0
[24672.573190] RBP: 95c24d68 R08: 0001 R09: 0479
[24672.573190] R10: 99974b1960e0 R11: 0479 R12: 999a5a65
[24672.573191] R13: 999a5a6511f0 R14:  R15: 
[24672.573192] FS:  () GS:999a6fcc() 
knlGS:
[24672.573193] CS:  0010 DS:  ES:  CR0: 80050033
[24672.573194] CR2: 558bfd56dfd0 CR3: 00030a60a005 CR4: 000206e0
[24672.573195] Call Trace:
[24672.573215]  btrfs_commit_transaction+0x3e1/0x950 [btrfs]
[24672.573231]  ? start_transaction+0x89/0x410 [btrfs]
[24672.573246]  transaction_kthread+0x195/0x1b0 [btrfs]
[24672.573249]  kthread+0xfc/0x130
[24672.573265]  ? btrfs_cleanup_transaction+0x580/0x580 [btrfs]
[24672.573266]  ? kthread_create_on_node+0x70/0x70
[24672.573269]  ret_from_fork+0x35/0x40
[24672.573270] Code: c7 c6 20 e8 37 c0 48 89 df 44 89 04 24 e8 59 bc 09 00 44 
8b 04 24 eb 86 44 89 c6 48 c7 c7 30 58 38 c0 44 89 04 24 e8 82 30 3f cf <0f> 0b 
44 8b 04 24 eb c4 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00
[24672.573292] ---[ end trace b17d927a946cb02e ]---


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs remounted read-only due to ENOSPC in btrfs_run_delayed_refs

2018-03-10 Thread Martin Svec
Dne 9.3.2018 v 20:03 Martin Svec napsal(a):
> Dne 9.3.2018 v 17:36 Nikolay Borisov napsal(a):
>> On 23.02.2018 16:28, Martin Svec wrote:
>>> Hello,
>>>
>>> we have a btrfs-based backup system using btrfs snapshots and rsync. 
>>> Sometimes,
>>> we hit ENOSPC bug and the filesystem is remounted read-only. However, 
>>> there's 
>>> still plenty of unallocated space according to "btrfs fi usage". So I think 
>>> this
>>> isn't another edge condition when btrfs runs out of space due to fragmented 
>>> chunks,
>>> but a bug in disk space allocation code. It suffices to umount the 
>>> filesystem and
>>> remount it back and it works fine again. The frequency of ENOSPC seems to be
>>> dependent on metadata chunks usage. When there's a lot of free space in 
>>> existing
>>> metadata chunks, the bug doesn't happen for months. If most metadata chunks 
>>> are
>>> above ~98%, we hit the bug every few days. Below are details regarding the 
>>> backup
>>> server and btrfs.
>>>
>>> The backup works as follows: 
>>>
>>>   * Every night, we create a btrfs snapshot on the backup server and rsync 
>>> data
>>> from a production server into it. This snapshot is then marked 
>>> read-only and
>>> will be used as a base subvolume for the next backup snapshot.
>>>   * Every day, expired snapshots are removed and their space is freed. 
>>> Cleanup
>>> is scheduled in such a way that it doesn't interfere with the backup 
>>> window.
>>>   * Multiple production servers are backed up in parallel to one backup 
>>> server.
>>>   * The backed up servers are mostly webhosting servers and mail servers, 
>>> i.e.
>>> hundreds of billions of small files. (Yes, we push btrfs to the limits 
>>> :-))
>>>   * Backup server contains ~1080 snapshots, Zlib compression is enabled.
>>>   * Rsync is configured to use whole file copying.
>>>
>>> System configuration:
>>>
>>> Debian Stretch, vanilla stable 4.14.20 kernel with one custom btrfs patch 
>>> (see below) and
>>> Nikolay's patch 1b816c23e9 (btrfs: Add enospc_debug printing in 
>>> metadata_reserve_bytes)
>>>
>>> btrfs mount options: 
>>> noatime,compress=zlib,enospc_debug,space_cache=v2,commit=15
>>>
>>> $ btrfs fi df /backup:
>>>
>>> Data, single: total=28.05TiB, used=26.37TiB
>>> System, single: total=32.00MiB, used=3.53MiB
>>> Metadata, single: total=255.00GiB, used=250.73GiB
>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>
>>> $ btrfs fi show /backup:
>>>
>>> Label: none  uuid: a52501a9-651c-4712-a76b-7b4238cfff63
>>> Total devices 2 FS bytes used 26.62TiB
>>> devid1 size 416.62GiB used 255.03GiB path /dev/sdb
>>> devid2 size 36.38TiB used 28.05TiB path /dev/sdc
>>>
>>> $ btrfs fi usage /backup:
>>>
>>> Overall:
>>> Device size:  36.79TiB
>>> Device allocated: 28.30TiB
>>> Device unallocated:8.49TiB
>>> Device missing:  0.00B
>>> Used: 26.62TiB
>>> Free (estimated): 10.17TiB  (min: 10.17TiB)
>>> Data ratio:   1.00
>>> Metadata ratio:   1.00
>>> Global reserve:  512.00MiB  (used: 0.00B)
>>>
>>> Data,single: Size:28.05TiB, Used:26.37TiB
>>>/dev/sdc   28.05TiB
>>>
>>> Metadata,single: Size:255.00GiB, Used:250.73GiB
>>>/dev/sdb  255.00GiB
>>>
>>> System,single: Size:32.00MiB, Used:3.53MiB
>>>/dev/sdb   32.00MiB
>>>
>>> Unallocated:
>>>/dev/sdb  161.59GiB
>>>/dev/sdc8.33TiB
>>>
>>> Btrfs filesystem uses two logical drives in single mode, backed by
>>> hardware RAID controller PERC H710; /dev/sdb is HW RAID1 consisting
>>> of two SATA SSDs and /dev/sdc is HW RAID6 SATA volume.
>>>
>>> Please note that we have a simple custom patch in btrfs which ensures
>>> that metadata chunks are allocated preferably on SSD volume and data
>>> chunks are allocated only on SATA volume. The patch slightly modifies
>>> __btrfs_alloc_chunk() so that its loop over devices ignores rotating
>>> devices when a metadata chunk is 

Re: Btrfs remounted read-only due to ENOSPC in btrfs_run_delayed_refs

2018-03-09 Thread Martin Svec
Dne 9.3.2018 v 17:36 Nikolay Borisov napsal(a):
>
> On 23.02.2018 16:28, Martin Svec wrote:
>> Hello,
>>
>> we have a btrfs-based backup system using btrfs snapshots and rsync. 
>> Sometimes,
>> we hit ENOSPC bug and the filesystem is remounted read-only. However, 
>> there's 
>> still plenty of unallocated space according to "btrfs fi usage". So I think 
>> this
>> isn't another edge condition when btrfs runs out of space due to fragmented 
>> chunks,
>> but a bug in disk space allocation code. It suffices to umount the 
>> filesystem and
>> remount it back and it works fine again. The frequency of ENOSPC seems to be
>> dependent on metadata chunks usage. When there's a lot of free space in 
>> existing
>> metadata chunks, the bug doesn't happen for months. If most metadata chunks 
>> are
>> above ~98%, we hit the bug every few days. Below are details regarding the 
>> backup
>> server and btrfs.
>>
>> The backup works as follows: 
>>
>>   * Every night, we create a btrfs snapshot on the backup server and rsync 
>> data
>> from a production server into it. This snapshot is then marked read-only 
>> and
>> will be used as a base subvolume for the next backup snapshot.
>>   * Every day, expired snapshots are removed and their space is freed. 
>> Cleanup
>> is scheduled in such a way that it doesn't interfere with the backup 
>> window.
>>   * Multiple production servers are backed up in parallel to one backup 
>> server.
>>   * The backed up servers are mostly webhosting servers and mail servers, 
>> i.e.
>> hundreds of billions of small files. (Yes, we push btrfs to the limits 
>> :-))
>>   * Backup server contains ~1080 snapshots, Zlib compression is enabled.
>>   * Rsync is configured to use whole file copying.
>>
>> System configuration:
>>
>> Debian Stretch, vanilla stable 4.14.20 kernel with one custom btrfs patch 
>> (see below) and
>> Nikolay's patch 1b816c23e9 (btrfs: Add enospc_debug printing in 
>> metadata_reserve_bytes)
>>
>> btrfs mount options: 
>> noatime,compress=zlib,enospc_debug,space_cache=v2,commit=15
>>
>> $ btrfs fi df /backup:
>>
>> Data, single: total=28.05TiB, used=26.37TiB
>> System, single: total=32.00MiB, used=3.53MiB
>> Metadata, single: total=255.00GiB, used=250.73GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>
>> $ btrfs fi show /backup:
>>
>> Label: none  uuid: a52501a9-651c-4712-a76b-7b4238cfff63
>> Total devices 2 FS bytes used 26.62TiB
>> devid1 size 416.62GiB used 255.03GiB path /dev/sdb
>> devid2 size 36.38TiB used 28.05TiB path /dev/sdc
>>
>> $ btrfs fi usage /backup:
>>
>> Overall:
>> Device size:  36.79TiB
>> Device allocated: 28.30TiB
>> Device unallocated:8.49TiB
>> Device missing:  0.00B
>> Used: 26.62TiB
>> Free (estimated): 10.17TiB  (min: 10.17TiB)
>> Data ratio:   1.00
>> Metadata ratio:   1.00
>> Global reserve:  512.00MiB  (used: 0.00B)
>>
>> Data,single: Size:28.05TiB, Used:26.37TiB
>>/dev/sdc   28.05TiB
>>
>> Metadata,single: Size:255.00GiB, Used:250.73GiB
>>/dev/sdb  255.00GiB
>>
>> System,single: Size:32.00MiB, Used:3.53MiB
>>/dev/sdb   32.00MiB
>>
>> Unallocated:
>>/dev/sdb  161.59GiB
>>/dev/sdc8.33TiB
>>
>> Btrfs filesystem uses two logical drives in single mode, backed by
>> hardware RAID controller PERC H710; /dev/sdb is HW RAID1 consisting
>> of two SATA SSDs and /dev/sdc is HW RAID6 SATA volume.
>>
>> Please note that we have a simple custom patch in btrfs which ensures
>> that metadata chunks are allocated preferably on SSD volume and data
>> chunks are allocated only on SATA volume. The patch slightly modifies
>> __btrfs_alloc_chunk() so that its loop over devices ignores rotating
>> devices when a metadata chunk is requested and vice versa. However, 
>> I'm quite sure that this patch doesn't cause the reported bug because
>> we log every call of the modified code and there're no __btrfs_alloc_chunk()
>> calls when ENOSPC is triggered. Moreover, we observed the same bug before
>> we developed the patch. (IIRC, Chris Mason mentioned that they work on
>> a similar feature in facebook, but I've found no official patches yet.)

Re: Btrfs remounted read-only due to ENOSPC in btrfs_run_delayed_refs

2018-03-09 Thread Martin Svec
Nobody knows?

I'm particularly interested why debug space_info 4 shows negative (unsigned 
18446744072120172544)
value as free metadata space, please see the original report. Is it a bug in 
dump_space_info(), or
metadata reservations can temporarily exceed the total space, or is it an 
indication of a damaged
filesystem? Also note that rebuilding free space cache doesn't help.

Thank you.

Martin

Dne 23.2.2018 v 15:28 Martin Svec napsal(a):
> Hello,
>
> we have a btrfs-based backup system using btrfs snapshots and rsync. 
> Sometimes,
> we hit ENOSPC bug and the filesystem is remounted read-only. However, there's 
> still plenty of unallocated space according to "btrfs fi usage". So I think 
> this
> isn't another edge condition when btrfs runs out of space due to fragmented 
> chunks,
> but a bug in disk space allocation code. It suffices to umount the filesystem 
> and
> remount it back and it works fine again. The frequency of ENOSPC seems to be
> dependent on metadata chunks usage. When there's a lot of free space in 
> existing
> metadata chunks, the bug doesn't happen for months. If most metadata chunks 
> are
> above ~98%, we hit the bug every few days. Below are details regarding the 
> backup
> server and btrfs.
>
> The backup works as follows: 
>
>   * Every night, we create a btrfs snapshot on the backup server and rsync 
> data
> from a production server into it. This snapshot is then marked read-only 
> and
> will be used as a base subvolume for the next backup snapshot.
>   * Every day, expired snapshots are removed and their space is freed. Cleanup
> is scheduled in such a way that it doesn't interfere with the backup 
> window.
>   * Multiple production servers are backed up in parallel to one backup 
> server.
>   * The backed up servers are mostly webhosting servers and mail servers, i.e.
> hundreds of billions of small files. (Yes, we push btrfs to the limits 
> :-))
>   * Backup server contains ~1080 snapshots, Zlib compression is enabled.
>   * Rsync is configured to use whole file copying.
>
> System configuration:
>
> Debian Stretch, vanilla stable 4.14.20 kernel with one custom btrfs patch 
> (see below) and
> Nikolay's patch 1b816c23e9 (btrfs: Add enospc_debug printing in 
> metadata_reserve_bytes)
>
> btrfs mount options: 
> noatime,compress=zlib,enospc_debug,space_cache=v2,commit=15
>
> $ btrfs fi df /backup:
>
> Data, single: total=28.05TiB, used=26.37TiB
> System, single: total=32.00MiB, used=3.53MiB
> Metadata, single: total=255.00GiB, used=250.73GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
>
> $ btrfs fi show /backup:
>
> Label: none  uuid: a52501a9-651c-4712-a76b-7b4238cfff63
> Total devices 2 FS bytes used 26.62TiB
> devid1 size 416.62GiB used 255.03GiB path /dev/sdb
> devid2 size 36.38TiB used 28.05TiB path /dev/sdc
>
> $ btrfs fi usage /backup:
>
> Overall:
> Device size:  36.79TiB
> Device allocated: 28.30TiB
> Device unallocated:8.49TiB
> Device missing:  0.00B
> Used: 26.62TiB
> Free (estimated): 10.17TiB  (min: 10.17TiB)
> Data ratio:   1.00
> Metadata ratio:   1.00
> Global reserve:  512.00MiB  (used: 0.00B)
>
> Data,single: Size:28.05TiB, Used:26.37TiB
>/dev/sdc   28.05TiB
>
> Metadata,single: Size:255.00GiB, Used:250.73GiB
>/dev/sdb  255.00GiB
>
> System,single: Size:32.00MiB, Used:3.53MiB
>/dev/sdb   32.00MiB
>
> Unallocated:
>/dev/sdb  161.59GiB
>/dev/sdc8.33TiB
>
> Btrfs filesystem uses two logical drives in single mode, backed by
> hardware RAID controller PERC H710; /dev/sdb is HW RAID1 consisting
> of two SATA SSDs and /dev/sdc is HW RAID6 SATA volume.
>
> Please note that we have a simple custom patch in btrfs which ensures
> that metadata chunks are allocated preferably on SSD volume and data
> chunks are allocated only on SATA volume. The patch slightly modifies
> __btrfs_alloc_chunk() so that its loop over devices ignores rotating
> devices when a metadata chunk is requested and vice versa. However, 
> I'm quite sure that this patch doesn't cause the reported bug because
> we log every call of the modified code and there're no __btrfs_alloc_chunk()
> calls when ENOSPC is triggered. Moreover, we observed the same bug before
> we developed the patch. (IIRC, Chris Mason mentioned that they work on
> a similar feature in facebook, but I've found no official patches yet.)
>
> Dmesg dump:
>
> [285167.750763] use_block_rsv: 62468 callbacks suppressed

Btrfs remounted read-only due to ENOSPC in btrfs_run_delayed_refs

2018-02-23 Thread Martin Svec
: 00010282
[285167.750880] RAX: 001d RBX: 9c4a1c2ce128 RCX: 
0006
[285167.750881] RDX:  RSI: 0096 RDI: 
9c4a2fd566f0
[285167.750882] RBP: 4000 R08: 0001 R09: 
03dc
[285167.750883] R10: 0001 R11: 03dc R12: 
9c4a1c2ce000
[285167.750883] R13: 9c4a17692800 R14: 0001 R15: 
ffe4
[285167.750885] FS:  () GS:9c4a2fd4() 
knlGS:
[285167.750885] CS:  0010 DS:  ES:  CR0: 80050033
[285167.750886] CR2: 56250e55bfd0 CR3: 0ee0a003 CR4: 
000206e0
[285167.750887] Call Trace:
[285167.750903]  __btrfs_cow_block+0x125/0x5c0 [btrfs]
[285167.750917]  btrfs_cow_block+0xcb/0x1b0 [btrfs]
[285167.750929]  btrfs_search_slot+0x1fd/0x9e0 [btrfs]
[285167.750943]  lookup_inline_extent_backref+0x105/0x610 [btrfs]
[285167.750961]  ? set_extent_bit+0x19/0x20 [btrfs]
[285167.750974]  __btrfs_free_extent.isra.61+0xf5/0xd30 [btrfs]
[285167.750992]  ? btrfs_merge_delayed_refs+0x63/0x560 [btrfs]
[285167.751006]  __btrfs_run_delayed_refs+0x516/0x12a0 [btrfs]
[285167.751021]  btrfs_run_delayed_refs+0x7a/0x270 [btrfs]
[285167.751037]  btrfs_commit_transaction+0x3e1/0x950 [btrfs]
[285167.751053]  ? start_transaction+0x89/0x410 [btrfs]
[285167.751068]  transaction_kthread+0x195/0x1b0 [btrfs]
[285167.751071]  kthread+0xfc/0x130
[285167.751087]  ? btrfs_cleanup_transaction+0x580/0x580 [btrfs]
[285167.751088]  ? kthread_create_on_node+0x70/0x70
[285167.751091]  ret_from_fork+0x35/0x40
[285167.751092] Code: ff 48 c7 c6 28 d7 44 c0 48 c7 c7 a0 21 4a c0 e8 3c a5 4b 
cb 85 c0 0f 84 1c fd ff ff 44 89 fe 48 c7 c7 c0 4c 45 c0 e8 80 fd f1 ca <0f> ff 
e9 06 fd ff ff 4c 63 e8 31 d2 48 89 ee 48 89 df e8 4e eb
[285167.751114] ---[ end trace 8721883b5af677ec ]---
[285169.096630] BTRFS info (device sdb): space_info 4 has 18446744072120172544 
free, is not full
[285169.096633] BTRFS info (device sdb): space_info total=273804165120, 
used=269218267136, pinned=3459629056, reserved=52396032, may_use=2663120896, 
readonly=131072
[285169.096638] BTRFS: Transaction aborted (error -28)
[285169.096664] [ cut here ]
[285169.096691] WARNING: CPU: 7 PID: 443 at fs/btrfs/extent-tree.c:3089 
btrfs_run_delayed_refs+0x259/0x270 [btrfs]
[285169.096692] Modules linked in: binfmt_misc xt_comment xt_tcpudp 
iptable_filter nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack iptable_raw 
ip6table_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat 
nf_conntr
[285169.096722]  zstd_compress xxhash raid6_pq sd_mod crc32c_intel psmouse 
uhci_hcd ehci_pci ehci_hcd megaraid_sas usbcore scsi_mod bnx2
[285169.096729] CPU: 7 PID: 443 Comm: btrfs-transacti Tainted: GW I 
4.14.20-znr1+ #69
[285169.096730] Hardware name: Dell Inc. PowerEdge R510/0DPRKF, BIOS 1.6.3 
02/01/2011
[285169.096731] task: 9c4a1740e280 task.stack: ba48c1ecc000
[285169.096745] RIP: 0010:btrfs_run_delayed_refs+0x259/0x270 [btrfs]
[285169.096746] RSP: 0018:ba48c1ecfde0 EFLAGS: 00010282
[285169.096747] RAX: 0026 RBX: 9c47990c0780 RCX: 
0006
[285169.096748] RDX:  RSI: 0082 RDI: 
9c4a2fdd66f0
[285169.096749] RBP: 9c493d509b68 R08: 0001 R09: 
0403
[285169.096749] R10: 9c49731d6620 R11: 0403 R12: 
9c4a1c2ce000
[285169.096750] R13: 9c4a1c2cf1f0 R14:  R15: 

[285169.096751] FS:  () GS:9c4a2fdc() 
knlGS:
[285169.096752] CS:  0010 DS:  ES:  CR0: 80050033
[285169.096753] CR2: 55e70555bfe0 CR3: 0ee0a005 CR4: 
000206e0
[285169.096754] Call Trace:
[285169.096774]  btrfs_commit_transaction+0x3e1/0x950 [btrfs]
[285169.096790]  ? start_transaction+0x89/0x410 [btrfs]
[285169.096806]  transaction_kthread+0x195/0x1b0 [btrfs]
[285169.096809]  kthread+0xfc/0x130
[285169.096825]  ? btrfs_cleanup_transaction+0x580/0x580 [btrfs]
[285169.096826]  ? kthread_create_on_node+0x70/0x70
[285169.096828]  ret_from_fork+0x35/0x40
[285169.096830] Code: c7 c6 20 d8 44 c0 48 89 df 44 89 04 24 e8 19 bb 09 00 44 
8b 04 24 eb 86 44 89 c6 48 c7 c7 30 48 45 c0 44 89 04 24 e8 d2 40 f2 ca <0f> ff 
44 8b 04 24 eb c4 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00
[285169.096852] ---[ end trace 8721883b5af677ed ]---
[285169.096918] BTRFS: error (device sdb) in btrfs_run_delayed_refs:3089: 
errno=-28 No space left
[285169.096976] BTRFS info (device sdb): forced readonly
[285169.096979] BTRFS warning (device sdb): Skipping commit of aborted 
transaction.
[285169.096981] BTRFS: error (device sdb) in cleanup_transaction:1873: 
errno=-28 No space left


How can I help you to fix this issue?

Regards,

Martin Svec




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo

Re: Recommendations for balancing as part of regular maintenance?

2018-01-08 Thread Martin Raiber
On 08.01.2018 19:34 Austin S. Hemmelgarn wrote:
> On 2018-01-08 13:17, Graham Cobb wrote:
>> On 08/01/18 16:34, Austin S. Hemmelgarn wrote:
>>> Ideally, I think it should be as generic as reasonably possible,
>>> possibly something along the lines of:
>>>
>>> A: While not strictly necessary, running regular filtered balances (for
>>> example `btrfs balance start -dusage=50 -dlimit=2 -musage=50
>>> -mlimit=4`,
>>> see `man btrfs-balance` for more info on what the options mean) can
>>> help
>>> keep a volume healthy by mitigating the things that typically cause
>>> ENOSPC errors.  Full balances by contrast are long and expensive
>>> operations, and should be done only as a last resort.
>>
>> That recommendation is similar to what I do and it works well for my use
>> case. I would recommend it to anyone with my usage, but cannot say how
>> well it would work for other uses. In my case, I run balances like that
>> once a week: some weeks nothing happens, other weeks 5 or 10 blocks may
>> get moved.
>
> In my own usage I've got a pretty varied mix of other stuff going on.
> All my systems are Gentoo, so system updates mean that I'm building
> software regularly (though on most of the systems that happens on
> tmpfs in RAM), I run a home server with a dozen low use QEMU VM's and
> a bunch of transient test VM's, all of which I'm currently storing
> disk images for raw on top of BTRFS (which is actually handling all of
> it pretty well, though that may be thanks to all the VM's using
> PV-SCSI for their disks), I run a BOINC client system that sees pretty
> heavy filesystem usage, and have a lot of personal files that get
> synced regularly across systems, and all of this is on raid1 with
> essentially no snapshots.  For me the balance command I mentioned
> above run daily seems to help, even if the balance doesn't move much
> most of the time on most filesystems, and the actual balance
> operations take at most a few seconds most of the time (I've got
> reasonably nice SSD's in everything).

There have been reports of (rare) corruption caused by balance (won't be
detected by a scrub) here on the mailing list. So I would stay a away
from btrfs balance unless it is absolutely needed (ENOSPC), and while it
is run I would try not to do anything else wrt. to writes simultaneously.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs blocked by too many delayed refs

2017-12-21 Thread Martin Raiber
Hi,

I have the problem that too many delayed refs block a btrfs storage. I
have one thread that does work:

[] io_schedule+0x16/0x40
[] wait_on_page_bit+0x116/0x150
[] read_extent_buffer_pages+0x1c5/0x290
[] btree_read_extent_buffer_pages+0x9d/0x100
[] read_tree_block+0x32/0x50
[] read_block_for_search.isra.30+0x120/0x2e0
[] btrfs_search_slot+0x385/0x990
[] btrfs_insert_empty_items+0x71/0xc0
[] insert_extent_data_ref.isra.49+0x11b/0x2a0
[] __btrfs_inc_extent_ref.isra.59+0x1ee/0x220
[] __btrfs_run_delayed_refs+0x924/0x12c0
[] btrfs_run_delayed_refs+0x7a/0x260
[] create_pending_snapshot+0x5e4/0xf00
[] create_pending_snapshots+0x97/0xc0
[] btrfs_commit_transaction+0x395/0x930
[] btrfs_mksubvol+0x4a6/0x4f0
[] btrfs_ioctl_snap_create_transid+0x185/0x190
[] btrfs_ioctl_snap_create_v2+0x104/0x150
[] btrfs_ioctl+0x5e1/0x23b0
[] do_vfs_ioctl+0x92/0x5a0
[] SyS_ioctl+0x79/0x9

the others are in 'D' state e.g. with

[] call_rwsem_down_write_failed+0x17/0x30
[] filename_create+0x6b/0x150
[] SyS_mkdir+0x44/0xe0

Slabtop shows 2423910 btrfs_delayed_ref_head structs, slowly decreasing.

What I think is happening is that delayed refs are added without
throttling them with btrfs_should_throttle_delayed_refs . Maybe by
creating a snapshot of a file and then modifying it (some action that
creates delayed refs, is not truncate which is already throttled and
does not commit a transaction which is also throttled).

Regards,
Martin Raiber

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: again "out of space" and remount read only, with 4.14

2017-12-18 Thread Martin Raiber
On 03.12.2017 16:39 Martin Raiber wrote:
> Am 26.11.2017 um 17:02 schrieb Tomasz Chmielewski:
>> On 2017-11-27 00:37, Martin Raiber wrote:
>>> On 26.11.2017 08:46 Tomasz Chmielewski wrote:
>>>> Got this one on a 4.14-rc7 filesystem with some 400 GB left:
>>> I guess it is too late now, but I guess the "btrfs fi usage" output of
>>> the file system (especially after it went ro) would be useful.
>> It was more or less similar as it went ro:
>>
>> # btrfs fi usage /srv
>> Overall:
>>     Device size:   5.25TiB
>>     Device allocated:  4.45TiB
>>     Device unallocated:  823.97GiB
>>     Device missing:  0.00B
>>     Used:  4.33TiB
>>     Free (estimated):    471.91GiB  (min: 471.91GiB)
>>     Data ratio:   2.00
>>     Metadata ratio:   2.00
>>     Global reserve:  512.00MiB  (used: 0.00B)
>>
>> Unallocated:
>>    /dev/sda4 411.99GiB
>>    /dev/sdb4 411.99GiB
> I wanted to check if is the same issue I have, e.g. with 4.14.1
> space_cache=v2:
>
> [153245.341823] BTRFS: error (device loop0) in
> btrfs_run_delayed_refs:3089: errno=-28 No space left
> [153245.341845] BTRFS: error (device loop0) in btrfs_drop_snapshot:9317:
> errno=-28 No space left
> [153245.341848] BTRFS info (device loop0): forced readonly
> [153245.341972] BTRFS warning (device loop0): Skipping commit of aborted
> transaction.
> [153245.341975] BTRFS: error (device loop0) in cleanup_transaction:1873:
> errno=-28 No space left
> # btrfs fi usage /media/backup
> Overall:
>     Device size:  49.60TiB
>     Device allocated: 38.10TiB
>     Device unallocated:   11.50TiB
>     Device missing:  0.00B
>     Used: 36.98TiB
>     Free (estimated): 12.59TiB  (min: 12.59TiB)
>     Data ratio:   1.00
>     Metadata ratio:   1.00
>     Global reserve:    2.00GiB  (used: 1.99GiB)
>
> Data,single: Size:37.70TiB, Used:36.61TiB
>    /dev/loop0 37.70TiB
>
> Metadata,single: Size:411.01GiB, Used:380.98GiB
>    /dev/loop0    411.01GiB
>
> System,single: Size:36.00MiB, Used:4.00MiB
>    /dev/loop0 36.00MiB
>
> Unallocated:
>    /dev/loop0 11.50TiB
>
> Note the global reserve being at maximum. I already increased that in
> the code to 2G and that seems to make this issue appear more rarely.

This time with enospc_debug mount option:

With Linux 4.14.3. Single large device.

[15179.739038] [ cut here ]
[15179.739059] WARNING: CPU: 0 PID: 28694 at fs/btrfs/extent-tree.c:8458
btrfs_alloc_tree_block+0x38f/0x4a0
[15179.739060] Modules linked in: bcache loop dm_crypt algif_skcipher
af_alg st sr_mod cdrom xfs libcrc32c zbud intel_rapl sb_edac
x86_pkg_temp_thermal intel_powerclamp coretemp iTCO_wdt kvm_intel kvm
iTCO_vendor_support irqbypass crct10dif_pclmul crc32_pclmul
ghash_clmulni_intel pcbc raid1 mgag200 snd_pcm aesni_intel ttm snd_timer
drm_kms_helper snd soundcore aes_x86_64 crypto_simd glue_helper cryptd
pcspkr i2c_i801 joydev drm mei_me evdev lpc_ich mei mfd_core ipmi_si
ipmi_devintf ipmi_msghandler tpm_tis tpm_tis_core tpm wmi ioatdma button
shpchp fuse autofs4 hid_generic usbhid hid sg sd_mod dm_mod dax md_mod
crc32c_intel isci ahci mpt3sas libsas libahci igb raid_class ehci_pci
i2c_algo_bit libata dca ehci_hcd scsi_transport_sas ptp nvme pps_core
scsi_mod usbcore nvme_core
[15179.739133] CPU: 0 PID: 28694 Comm: btrfs Not tainted 4.14.3 #2
[15179.739134] Hardware name: Supermicro
X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.2 03/04/2015
[15179.739136] task: 8813e4f02ac0 task.stack: c9000aea
[15179.739140] RIP: 0010:btrfs_alloc_tree_block+0x38f/0x4a0
[15179.739141] RSP: 0018:c9000aea3558 EFLAGS: 00010292
[15179.739144] RAX: 001d RBX: 4000 RCX:

[15179.739146] RDX: 880c4fa15b38 RSI: 880c4fa0de58 RDI:
880c4fa0de58
[15179.739147] RBP: c9000aea35d0 R08: 0001 R09:
0662
[15179.739149] R10: 1600 R11: 0662 R12:
880c0a454000
[15179.739151] R13: 880c4ba33800 R14: 0001 R15:
880c0a454128
[15179.739153] FS:  7f0d699128c0() GS:880c4fa0()
knlGS:
[15179.739155] CS:  0010 DS:  ES:  CR0: 80050033
[15179.739156] CR2: 7bbfcdf2c6e8 CR3: 00151da91003 CR4:
000606f0
[15179.739158] Call Trace:
[15179.739166]  __btrfs_cow_block+0x117/0x580
[15179.739169]  btrfs_cow_block+0xdf/0x200
[15179.739171]  btrfs_search_slot+0x1ea/0x990
[15179.739174]  lookup_inline_extent_backref+0x

Re: again "out of space" and remount read only, with 4.14

2017-12-03 Thread Martin Raiber
Am 26.11.2017 um 17:02 schrieb Tomasz Chmielewski:
> On 2017-11-27 00:37, Martin Raiber wrote:
>> On 26.11.2017 08:46 Tomasz Chmielewski wrote:
>>> Got this one on a 4.14-rc7 filesystem with some 400 GB left:
>> I guess it is too late now, but I guess the "btrfs fi usage" output of
>> the file system (especially after it went ro) would be useful.
> It was more or less similar as it went ro:
>
> # btrfs fi usage /srv
> Overall:
>     Device size:   5.25TiB
>     Device allocated:  4.45TiB
>     Device unallocated:  823.97GiB
>     Device missing:  0.00B
>     Used:  4.33TiB
>     Free (estimated):    471.91GiB  (min: 471.91GiB)
>     Data ratio:   2.00
>     Metadata ratio:   2.00
>     Global reserve:  512.00MiB  (used: 0.00B)
>
> Unallocated:
>    /dev/sda4 411.99GiB
>    /dev/sdb4 411.99GiB

I wanted to check if is the same issue I have, e.g. with 4.14.1
space_cache=v2:

[153245.341823] BTRFS: error (device loop0) in
btrfs_run_delayed_refs:3089: errno=-28 No space left
[153245.341845] BTRFS: error (device loop0) in btrfs_drop_snapshot:9317:
errno=-28 No space left
[153245.341848] BTRFS info (device loop0): forced readonly
[153245.341972] BTRFS warning (device loop0): Skipping commit of aborted
transaction.
[153245.341975] BTRFS: error (device loop0) in cleanup_transaction:1873:
errno=-28 No space left
# btrfs fi usage /media/backup
Overall:
    Device size:  49.60TiB
    Device allocated: 38.10TiB
    Device unallocated:   11.50TiB
    Device missing:  0.00B
    Used: 36.98TiB
    Free (estimated): 12.59TiB  (min: 12.59TiB)
    Data ratio:   1.00
    Metadata ratio:   1.00
    Global reserve:    2.00GiB  (used: 1.99GiB)

Data,single: Size:37.70TiB, Used:36.61TiB
   /dev/loop0 37.70TiB

Metadata,single: Size:411.01GiB, Used:380.98GiB
   /dev/loop0    411.01GiB

System,single: Size:36.00MiB, Used:4.00MiB
   /dev/loop0 36.00MiB

Unallocated:
   /dev/loop0 11.50TiB

Note the global reserve being at maximum. I already increased that in
the code to 2G and that seems to make this issue appear more rarely.

Regards,
Martin Raiber


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Read before you deploy btrfs + zstd

2017-11-15 Thread Martin Steigerwald
David Sterba - 15.11.17, 15:39:
> On Tue, Nov 14, 2017 at 07:53:31PM +0100, David Sterba wrote:
> > On Mon, Nov 13, 2017 at 11:50:46PM +0100, David Sterba wrote:
> > > Up to now, there are no bootloaders supporting ZSTD.
> > 
> > I've tried to implement the support to GRUB, still incomplete and hacky
> > but most of the code is there.  The ZSTD implementation is copied from
> > kernel. The allocators need to be properly set up, as it needs to use
> > grub_malloc/grub_free for the workspace thats called from some ZSTD_*
> > functions.
> > 
> > https://github.com/kdave/grub/tree/btrfs-zstd
> 
> The branch is now in a state that can be tested. Turns out the memory
> requirements are too much for grub, so the boot fails with "not enough
> memory". The calculated value
> 
> ZSTD_BTRFS_MAX_INPUT: 131072
> ZSTD_DStreamWorkspaceBound with ZSTD_BTRFS_MAX_INPUT: 549424
> 
> This is not something I could fix easily, we'd probalby need a tuned
> version of ZSTD for grub constraints. Adding Nick to CC.

Somehow I am happy that I still have a plain Ext4 for /boot. :)

Thanks for looking into Grub support anyway.

Thanks,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Read before you deploy btrfs + zstd

2017-11-14 Thread Martin Steigerwald
David Sterba - 14.11.17, 19:49:
> On Tue, Nov 14, 2017 at 08:34:37AM +0100, Martin Steigerwald wrote:
> > Hello David.
> > 
> > David Sterba - 13.11.17, 23:50:
> > > while 4.14 is still fresh, let me address some concerns I've seen on
> > > linux
> > > forums already.
> > > 
> > > The newly added ZSTD support is a feature that has broader impact than
> > > just the runtime compression. The btrfs-progs understand filesystem with
> > > ZSTD since 4.13. The remaining key part is the bootloader.
> > > 
> > > Up to now, there are no bootloaders supporting ZSTD. This could lead to
> > > an
> > > unmountable filesystem if the critical files under /boot get
> > > accidentally
> > > or intentionally compressed by ZSTD.
> > 
> > But otherwise ZSTD is safe to use? Are you aware of any other issues?
> 
> No issues from my own testing or reported by other users.

Thanks to you and the others. I think I try this soon.

Thanks,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Read before you deploy btrfs + zstd

2017-11-13 Thread Martin Steigerwald
Hello David.

David Sterba - 13.11.17, 23:50:
> while 4.14 is still fresh, let me address some concerns I've seen on linux
> forums already.
> 
> The newly added ZSTD support is a feature that has broader impact than
> just the runtime compression. The btrfs-progs understand filesystem with
> ZSTD since 4.13. The remaining key part is the bootloader.
> 
> Up to now, there are no bootloaders supporting ZSTD. This could lead to an
> unmountable filesystem if the critical files under /boot get accidentally
> or intentionally compressed by ZSTD.

But otherwise ZSTD is safe to use? Are you aware of any other issues?

I consider switching from LZO to ZSTD on this ThinkPad T520 with Sandybridge.

Thank you,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to run balance successfully (No space left on device)?

2017-11-10 Thread Martin Raiber
On 10.11.2017 22:51 Chris Murphy wrote:
>> Combined with evidence that "No space left on device" during balance can
>> lead to various file corruption (we've witnessed it with MySQL), I'd day
>> btrfs balance is a dangerous operation and decision to use it should be
>> considered very thoroughly.
> I've never heard of this. Balance is COW at the chunk level. The old
> chunk is not dereferenced until it's written in the new location
> correctly. Corruption during balance shouldn't be possible so if you
> have a reproducer, the devs need to know about it.

I didn't say anything before, because I could not reproduce the problem.
I had (I guess) a corruption caused by balance as well. It had ENOSPC in
spite of enough free space (4.9.x), which made me balance it regularly
to keep unallocated space around. Corruption occured probably after or
shortly before power reset during a balance -- no skip_balance specified
so it continued directly after mount -- data was moved relatively fast
after the mount operation (copy file then delete old file). I think
space_cache=v2 was active at the time. I'm of course not completely sure
it was btrfs's fault and as usual not all the conditions may be
relevant. Could also be instead an upper layer error (Hyper-V storage),
memory issue or an application error.

Regards,
Martin Raiber

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiple btrfs-cleaner threads per volume

2017-11-02 Thread Martin Raiber
On 02.11.2017 16:10 Hans van Kranenburg wrote:
> On 11/02/2017 04:02 PM, Martin Raiber wrote:
>> snapshot cleanup is a little slow in my case (50TB volume). Would it
>> help to have multiple btrfs-cleaner threads? The block layer underneath
>> would have higher throughput with more simultaneous read/write requests.
> Just curious:
> * How many subvolumes/snapshots are you removing, and what's the
> complexity level (like, how many other subvolumes/snapshots reference
> the same data extents?)
> * Do you see a lot of cpu usage, or mainly a lot of disk I/O? If it's
> disk IO, is it mainly random read IO, or is it a lot of write traffic?
> * What mount options are you running with (from /proc/mounts)?

It is a single block device, so not a multi-device btrfs, so
optimizations in that area wouldn't help. It is a UrBackup system with
about 200 snapshots per client. 20009 snapshots total. UrBackup reflinks
files between them, but btrfs-cleaner doesn't use much CPU (so it
doesn't seem like the backref walking is the problem). btrfs-cleaner is
probably limited mainly by random read/write IO. The device has a cache,
so parallel accesses would help, as some of them may hit the cache.
Looking at the code it seems easy enough to do. Question is if there are
any obvious reasons why this wouldn't work (like some lock etc.).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Multiple btrfs-cleaner threads per volume

2017-11-02 Thread Martin Raiber
Hi,

snapshot cleanup is a little slow in my case (50TB volume). Would it
help to have multiple btrfs-cleaner threads? The block layer underneath
would have higher throughput with more simultaneous read/write requests.

Regards,
Martin Raiber

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data and metadata extent allocators [1/2]: Recap: The data story

2017-10-27 Thread Martin Steigerwald
space)

I see a difference in behavior but I do not yet fully understand what I am 
looking at.
 
> Q: But what if all my chunks have badly fragmented free space right now?
> A: If your situation allows for it, the simplest way is running a full
> balance of the data, as some sort of big reset button. If you only want
> to clean up chunks with excessive free space fragmentation, then you can
> use the helper I used to identify them, which is
> show_free_space_fragmentation.py in [8]. Just feed the chunks to balance
> starting with the one with the highest score. The script requires the
> free space tree to be used, which is a good idea anyway.

Okay, when I understand this correctly I don´t need to use "nossd" with kernel 
4.14, but it would be good to do a full "btrfs filesystem balance" run on all 
the SSD BTRFS filesystems or all other ones with rotational=0.

What would be the benefit of that? Would the filesystem run faster again? My 
subjective impression is that performance got worse over time. *However* all 
my previous full balance attempts made the performance even more worse. So… is 
a full balance safe to the filesystem performance meanwhile?

I still have the issue that fstrim on /home only works with patch from Lutz 
Euler from 2014, which is still not in mainline BTRFS. Maybe it would be a 
good idea to recreate /home in order to get rid of that special "anomaly" of 
the BTRFS that fstrim don´t work without this patch.

Maybe a least a part of this should go into BTRFS kernel wiki as it would be 
more easy to find there for users.

I wonder about a "upgrade notes for users" / "BTRFS maintenance" page that 
gives recommendations in case some step is recommended after a major kernel 
update and general recommendations for maintenance. Ideally most of this would 
be integrated into BTRFS or a userspace daemon for it and be handled 
transparently and automatically. Yet a full balance is an expensive operation 
time-wise and probably should not be started without user consent.

I do wonder about the ton of tools here and there and I would love some btrfsd 
or… maybe even more generic fsd filesystem maintenance daemon which would do 
regular scrubs and whatever else makes sense. It could use some configuration 
in the root directory of a filesystem and work for BTRFS and other filesystem 
that do have beneficial online / background upgraded like XFS which also has 
online scrubbing by now (at least for metadata).

> [0] https://www.spinics.net/lists/linux-btrfs/msg64446.html
> [1] https://www.spinics.net/lists/linux-btrfs/msg64771.html
> [2] https://github.com/knorrie/btrfs-heatmap/
> [3]
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-10-27-shotgunblast.png
> [4]
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2016-12-18-heatmap-scripting/
> fsid_ed10a358-c846-4e76-a071-3821d423a99d_startat_320029589504_at_1482095269
> .png [5] https://www.spinics.net/lists/linux-btrfs/msg64418.html
> [6]
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i
> d=583b723151794e2ff1691f1510b4e43710293875 [7]
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-ssd-to-nossd.mp4 [8]
> https://github.com/knorrie/python-btrfs/tree/develop/examples

Thanks,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.13: "error in btrfs_run_delayed_refs:3009: errno=-28 No space left" with 1.3TB unallocated / 737G free?

2017-10-19 Thread Martin Raiber
On 19.10.2017 10:16 Vladimir Panteleev wrote:
> On Tue, 17 Oct 2017 16:21:04 -0700, Duncan wrote:
>> * try the balance on 4.14-rc5+, where the known bug should be fixed
>
> Thanks! However, I'm getting the same error on
> 4.14.0-rc5-g9aa0d2dde6eb. The stack trace is different, though:
>
> Aside from rebuilding the filesystem, what are my options? Should I
> try to temporarily add a file from another volume as a device and
> retry the balance? If so, what would be a good size for the temporary
> device?
>
Hi,

for me a work-around for something like this has been to reduce the
amount of dirty memory via e.g.

sysctl vm.dirty_background_bytes=$((100*1024*1024))
sysctl vm.dirty_bytes=$((400*1024*1024))

this reduces performance, however. You could also mount with
"enospc_debug" to give the devs more infos about this issue.
I am having more ENOSPC issues with 4.9.x than with the latest 4.14.

Regards,
Martin

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Something like ZFS Channel Programs for BTRFS & probably XFS or even VFS?

2017-10-03 Thread Martin Steigerwald
[repost. I didn´t notice autocompletion gave me wrong address for fsdevel, 
blacklisted now]

Hello.

What do you think of

http://open-zfs.org/wiki/Projects/ZFS_Channel_Programs

?

There are quite some BTRFS maintenance programs like the deduplication stuff. 
Also regular scrubs… and in certain circumstances probably balances can make 
sense.

In addition to this XFS got scrub functionality as well.

Now putting the foundation for such a functionality in the kernel I think 
would only be reasonable if it cannot be done purely within user space, so I 
wonder about the safety from other concurrent ZFS modification and atomicity 
that are mentioned on the wiki page. The second set of slides, those the 
OpenZFS Developer Commit 2014, which are linked to on the wiki page explain 
this more. (I didn´t look the first ones, as I am no fan of slideshare.net and 
prefer a simple PDF to download and view locally anytime, not for privacy 
reasons alone, but also to avoid a using a crappy webpage over a wonderfully 
functional PDF viewer fat client like Okular)

Also I wonder about putting a lua interpreter into the kernel, but it seems at 
least NetBSD developers added one to their kernel with version 7.0¹.

I also ask this cause I wondered about a kind of fsmaintd or volmaintd for 
quite a while, and thought… it would be nice to do this in a generic way, as 
BTRFS is not the only filesystem which supports maintenance operations. However 
if it can all just nicely be done in userspace, I am all for it.

[1] http://www.netbsd.org/releases/formal-7/NetBSD-7.0.html
(tons of presentation PDFs on their site as well)

Thanks,
-- 
Martin

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Something like ZFS Channel Programs for BTRFS & probably XFS or even VFS?

2017-10-03 Thread Martin Steigerwald
Hello.

What do you think of

http://open-zfs.org/wiki/Projects/ZFS_Channel_Programs

?

There are quite some BTRFS maintenance programs like the deduplication stuff. 
Also regular scrubs… and in certain circumstances probably balances can make 
sense.

In addition to this XFS got scrub functionality as well.

Now putting the foundation for such a functionality in the kernel I think 
would only be reasonable if it cannot be done purely within user space, so I 
wonder about the safety from other concurrent ZFS modification and atomicity 
that are mentioned on the wiki page. The second set of slides, those the 
OpenZFS Developer Commit 2014, which are linked to on the wiki page explain 
this more. (I didn´t look the first ones, as I am no fan of slideshare.net and 
prefer a simple PDF to download and view locally anytime, not for privacy 
reasons alone, but also to avoid a using a crappy webpage over a wonderfully 
functional PDF viewer fat client like Okular)

Also I wonder about putting a lua interpreter into the kernel, but it seems at 
least NetBSD developers added one to their kernel with version 7.0¹.

I also ask this cause I wondered about a kind of fsmaintd or volmaintd for 
quite a while, and thought… it would be nice to do this in a generic way, as 
BTRFS is not the only filesystem which supports maintenance operations. However 
if it can all just nicely be done in userspace, I am all for it.

[1] http://www.netbsd.org/releases/formal-7/NetBSD-7.0.html
(tons of presentation PDFs on their site as well)

Thanks,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Regarding handling of file renames in Btrfs

2017-09-16 Thread Martin Raiber
Hi,

On 16.09.2017 14:27 Hans van Kranenburg wrote:
> On 09/10/2017 01:50 AM, Rohan Kadekodi wrote:
>> I was trying to understand how file renames are handled in Btrfs. I
>> read the code documentation, but had a problem understanding a few
>> things.
>>
>> During a file rename, btrfs_commit_transaction() is called which is
>> because Btrfs has to commit the whole FS before storing the
>> information related to the new renamed file.
> Can you point to which lines of code you're looking at?
>
>> It has to commit the FS
>> because a rename first does an unlink, which is not recorded in the
>> btrfs_rename() transaction and so is not logged in the log tree. Is my
>> understanding correct? [...]
> Can you also point to where exactly you see this happening? I'd also
> like to understand more about this.
>
> The whole mail thread following this message continues about what a
> transaction commit is and does etc, but the above question is never
> answered I think.
>
> And I think it's an interesting question. Is a rename a "heavier"
> operation relative to other file operations?
>
as far as I can see it only uses the log tree in some cases where the
log tree was already used for the file or the parent directory. The
cases are documented here
https://github.com/torvalds/linux/blob/master/fs/btrfs/tree-log.c#L45 .
So rename isn't much heavier than unlink+create.

Regards,
Martin Raiber

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm VM died during partial raid1 problems of btrfs

2017-09-13 Thread Martin Raiber
Hi,

On 12.09.2017 23:13 Adam Borowski wrote:
> On Tue, Sep 12, 2017 at 04:12:32PM -0400, Austin S. Hemmelgarn wrote:
>> On 2017-09-12 16:00, Adam Borowski wrote:
>>> Noted.  Both Marat's and my use cases, though, involve VMs that are off most
>>> of the time, and at least for me, turned on only to test something.
>>> Touching mtime makes rsync run again, and it's freaking _slow_: worse than
>>> 40 minutes for a 40GB VM (source:SSD target:deduped HDD).
>> 40 minutes for 40GB is insanely slow (that's just short of 18 MB/s) if
>> you're going direct to a hard drive.  I get better performance than that on
>> my somewhat pathetic NUC based storage cluster (I get roughly 20 MB/s there,
>> but it's for archival storage so I don't really care).  I'm actually curious
>> what the exact rsync command you are using is (you can obviously redact
>> paths as you see fit), as the only way I can think of that it should be that
>> slow is if you're using both --checksum (but if you're using this, you can
>> tell rsync to skip the mtime check, and that issue goes away) and --inplace,
>> _and_ your HDD is slow to begin with.
> rsync -axX --delete --inplace --numeric-ids /mnt/btr1/qemu/ mordor:$BASE/qemu
> The target is single, compress=zlib SAMSUNG HD204UI, 34976 hours old but
> with nothing notable on SMART, in a Qnap 253a, kernel 4.9.
>
> Both source and target are btrfs, but here switching to send|receive
> wouldn't give much as this particular guest is Win10 Insider Edition --
> a thingy that shows what the folks from Redmond have cooked up, with roughly
> weekly updates to the tune of ~10GB writes 10GB deletions (if they do
> incremental transfers, installation still rewrites everything system).
>
> Lemme look a bit more, rsync performance is indeed really abysmal compared
> to what it should be.

self promo, but consider using UrBackup (OSS software, too) instead? For
Windows VMs I would install the client in the VM. It excludes unnessary
stuff like e.g. page files or the shadow storage area from the image
backups, as well and has a mode to store image backups as raw btrfs files.
Linux VMs I'd backup as files either from the hypervisor or from in VM.
If you want to backup big btrfs image files it can do that too, and
faster than rsync plus it can do incremental backups with sparse files.

Regards,
Martin Raiber

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Regarding handling of file renames in Btrfs

2017-09-10 Thread Martin Raiber
Hi,

On 10.09.2017 08:45 Qu Wenruo wrote:
>
>
> On 2017年09月10日 14:41, Qu Wenruo wrote:
>>
>>
>> On 2017年09月10日 07:50, Rohan Kadekodi wrote:
>>> Hello,
>>>
>>> I was trying to understand how file renames are handled in Btrfs. I
>>> read the code documentation, but had a problem understanding a few
>>> things.
>>>
>>> During a file rename, btrfs_commit_transaction() is called which is
>>> because Btrfs has to commit the whole FS before storing the
>>> information related to the new renamed file. It has to commit the FS
>>> because a rename first does an unlink, which is not recorded in the
>>> btrfs_rename() transaction and so is not logged in the log tree. Is my
>>> understanding correct? If yes, my questions are as follows:
>>
>> Not familiar with rename kernel code, so not much help for rename
>> opeartion.
>>
>>>
>>> 1. What does committing the whole FS mean?
>>
>> Committing the whole fs means a lot of things, but generally
>> speaking, it makes that the on-disk data is inconsistent with each
>> other.
>
>> For obvious part, it writes modified fs/subvolume trees to disk (with
>> handling of tree operations so no half modified trees).
>>
>> Also other trees like extent tree (very hot since every CoW will
>> update it, and the most complicated one), csum tree if modified.
>>
>> After transaction is committed, the on-disk btrfs will represent the
>> states when commit trans is called, and every tree should match each
>> other.
>>
>> Despite of this, after a transaction is committed, generation of the
>> fs get increased and modified tree blocks will have the same
>> generation number.
>>
>>> Blktrace shows that there
>>> are 2   256KB writes, which are essentially writes to the data of
>>> the root directory of the file system (which I found out through
>>> btrfs-debug-tree).
>>
>> I'd say you didn't check btrfs-debug-tree output carefully enough.
>> I strongly recommend to do vimdiff to get what tree is modified.
>>
>> At least the following trees are modified:
>>
>> 1) fs/subvolume tree
>>     Rename modified the DIR_INDEX/DIR_ITEM/INODE_REF at least, and
>>     updated inode time.
>>     So fs/subvolume tree must be CoWed.
>>
>> 2) extent tree
>>     CoW of above metadata operation will definitely cause extent
>>     allocation and freeing, extent tree will also get updated.
>>
>> 3) root tree
>>     Both extent tree and fs/subvolume tree modified, their root bytenr
>>     needs to be updated and root tree must be updated.
>>
>> And finally superblocks.
>>
>> I just verified the behavior with empty btrfs created on a 1G file,
>> only one file to do the rename.
>>
>> In that case (with 4K sectorsize and 16K nodesize), the total IO
>> should be (3 * 16K) * 2 + 4K * 2 = 104K.
>>
>> "3" = number of tree blocks get modified
>> "16K" = nodesize
>> 1st "*2" = DUP profile for metadata
>> "4K" = superblock size
>> 2nd "*2" = 2 superblocks for 1G fs.
>>
>> If your extent/root/fs trees have higher level, then more tree blocks
>> needs to be updated.
>> And if your fs is very large, you may have 3 superblocks.
>>
>>> Is this equivalent to doing a shell sync, as the
>>> same block groups are written during a shell sync too?
>>
>> For shell "sync" the difference is that, "sync" will write all dirty
>> data pages to disk, and then commit transaction.
>> While only calling btrfs_commit_transacation() doesn't trigger dirty
>> page writeback.
>>
>> So there is a difference.

this conversation made me realize why btrfs has sub-optimal meta-data
performance. Cow b-trees are not the best data structure for such small
changes. In my application I have multiple operations (e.g. renames)
which can be bundles up and (mostly) one writer.
I guess using BTRFS_IOC_TRANS_START and BTRFS_IOC_TRANS_END would be one
way to reduce the cow overhead, but those are dangerous wrt. to ENOSPC
and there have been discussions about removing them.
Best would be if there were delayed metadata, where metadata is handled
the same as delayed allocations and data changes, i.e. commit on fsync,
commit interval or fssync. I assumed this was already the case...

Please correct me if I got this wrong.

Regards,
Martin Raiber
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.11.6 / more corruption / root 15455 has a root item with a more recent gen (33682) compared to the found root node (0)

2017-07-09 Thread Martin Steigerwald
Hello Duncan.

Duncan - 09.07.17, 11:17:
> Paul Jones posted on Sun, 09 Jul 2017 09:16:36 + as excerpted:
> >> Marc MERLIN - 08.07.17, 21:34:
> >> > This is now the 3rd filesystem I have (on 3 different machines) that
> >> > is getting corruption of some kind (on 4.11.6).
> >> 
> >> Anyone else getting corruptions with 4.11?
> >> 
> >> I happily switch back to 4.10.17 or even 4.9 if that is the case. I may
> >> even do so just from your reports. Well, yes, I will do exactly that. I
> >> just switch back for 4.10 for now. Better be safe, than sorry.
> > 
> > No corruption for me - I've been on 4.11 since about .2 and everything
> > seems fine. Currently on 4.11.8
> 
> No corruptions here either. 4.12.0 now, previously 4.12-rc5(ish, git),
> before that 4.11.0.
> 
> I have however just upgraded to new ssds then wiped and setup the old
[…]
> Also, all my btrfs are raid1 or dup for checksummed redundancy, and
> relatively small, the largest now 80 GiB per device, after the upgrade.
> And my use-case doesn't involve snapshots or subvolumes.
> 
> So any bug that is most likely on older filesystems, say those without
> the no-holes feature, for instance, or that doesn't tend to hit raid1 or
> dup mode, or that is less likely on small filesystems on fast ssds, or
> that triggers most often with reflinks and thus on filesystems with
> snapshots, is unlikely to hit me.

Hmmm, the BTRFS filesystems on my laptop 3 to 5 or even more years old. I stick 
with 4.10 for now, I think.

The older ones are RAID 1 across two SSDs, the newer one is single device, on 
one SSD.

These filesystems didn´t fail me in years and since 4.5 or 4.6 even the "I 
search for free space" kernel hang (hung tasks and all that) is gone as well.

Thanks,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.11.6 / more corruption / root 15455 has a root item with a more recent gen (33682) compared to the found root node (0)

2017-07-09 Thread Martin Steigerwald
Hello Marc.

Marc MERLIN - 08.07.17, 21:34:
> Sigh,
> 
> This is now the 3rd filesystem I have (on 3 different machines) that is
> getting corruption of some kind (on 4.11.6).

Anyone else getting corruptions with 4.11?

I happily switch back to 4.10.17 or even 4.9 if that is the case. I may even 
do so just from your reports. Well, yes, I will do exactly that. I just switch 
back for 4.10 for now. Better be safe, than sorry.

I know how you feel, Marc. I posted about a corruption on one of my backup 
harddisks here some time ago that btrfs check --repair wasn´t able to handle. 
I redid that disk from scratch and it took a long, long time.

I agree with you that this has to stop. Before that I will never *ever* 
recommend this to a customer. Ideally no corruptions in stable kernels, 
especially when its a .6 at the end of the version number. But if so… then 
fixable. Other filesystems like Ext4 and XFS can do it… so this should be 
possible with BTRFS as well.

Thanks,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/13] scsi/osd: don't save block errors into req_results

2017-05-26 Thread Martin K. Petersen

Christoph,

> We will only have sense data if the command exectured and got a SCSI
> result, so this is pointless.

"executed"

Reviewed-by: Martin K. Petersen <martin.peter...@oracle.com>

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] [PATCH 08/15] dm mpath: merge do_end_io_bio into multipath_end_io_bio

2017-05-22 Thread Martin Wilck
On Thu, 2017-05-18 at 15:18 +0200, Christoph Hellwig wrote:
> This simplifies the code and especially the error passing a bit and
> will help with the next patch.
> 
> Signed-off-by: Christoph Hellwig <h...@lst.de>
> ---
>  drivers/md/dm-mpath.c | 42 -
> -
>  1 file changed, 16 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
> index 3df056b73b66..b1cb0273b081 100644
> --- a/drivers/md/dm-mpath.c
> +++ b/drivers/md/dm-mpath.c
> @@ -1510,24 +1510,26 @@ static int multipath_end_io(struct dm_target
> *ti, struct request *clone,
>   return r;
>  }
>  
> -static int do_end_io_bio(struct multipath *m, struct bio *clone,
> -  int error, struct dm_mpath_io *mpio)
> +static int multipath_end_io_bio(struct dm_target *ti, struct bio
> *clone, int error)
>  {
> + struct multipath *m = ti->private;
> + struct dm_mpath_io *mpio = get_mpio_from_bio(clone);
> + struct pgpath *pgpath = mpio->pgpath;
>   unsigned long flags;
>  
> - if (!error)
> - return 0;   /* I/O complete */
> + BUG_ON(!mpio);

You dereferenced mpio already above.

Regards,
Martin

>  
> - if (noretry_error(error))
> - return error;
> + if (!error || noretry_error(error))
> + goto done;
>  
> - if (mpio->pgpath)
> - fail_path(mpio->pgpath);
> + if (pgpath)
> + fail_path(pgpath);
>  
>   if (atomic_read(>nr_valid_paths) == 0 &&
>   !test_bit(MPATHF_QUEUE_IF_NO_PATH, >flags)) {
>   dm_report_EIO(m);
> - return -EIO;
> + error = -EIO;
> + goto done;
>   }
>  
>   /* Queue for the daemon to resubmit */
> @@ -1539,28 +1541,16 @@ static int do_end_io_bio(struct multipath *m,
> struct bio *clone,
>   if (!test_bit(MPATHF_QUEUE_IO, >flags))
>   queue_work(kmultipathd, >process_queued_bios);
>  
> - return DM_ENDIO_INCOMPLETE;
> -}
> -
> -static int multipath_end_io_bio(struct dm_target *ti, struct bio
> *clone, int error)
> -{
> - struct multipath *m = ti->private;
> - struct dm_mpath_io *mpio = get_mpio_from_bio(clone);
> - struct pgpath *pgpath;
> - struct path_selector *ps;
> - int r;
> -
> - BUG_ON(!mpio);
> -
> - r = do_end_io_bio(m, clone, error, mpio);
> - pgpath = mpio->pgpath;
> + error = DM_ENDIO_INCOMPLETE;
> +done:
>   if (pgpath) {
> - ps = >pg->ps;
> + struct path_selector *ps = >pg->ps;
> +
>   if (ps->type->end_io)
>   ps->type->end_io(ps, >path, mpio-
> >nr_bytes);
>   }
>  
> - return r;
> + return error;
>  }
>  
>  /*

-- 
Dr. Martin Wilck <mwi...@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: runtime btrfsck

2017-05-10 Thread Martin Steigerwald
Stefan Priebe - Profihost AG - 10.05.17, 09:02:
> I'm now trying btrfs progs 4.10.2. Is anybody out there who can tell me
> something about the expected runtime or how to fix bad key ordering?

I had a similar issue which remained unresolved.

But I clearly saw that btrfs check was running in a loop, see thread:

[4.9] btrfs check --repair looping over file extent discount errors

So it would be interesting to see the exact output of btrfs check, maybe there 
is something like repeated numbers that also indicate a loop.

I was about to say that BTRFS is production ready before this issue happened. 
I still think it for a lot of setup mostly is, as at least the "I get stuck on 
the CPU while searching for free space" issue seems to be gone since about 
anything between 4.5/4.6 kernels. I also think so regarding absence of data 
loss. I was able to copy over all of the data I needed of the broken 
filesystem.

Yet, when it comes to btrfs check? Its still quite rudimentary if you ask me.  
So unless someone has a clever idea here and shares it with you, it may be 
needed to backup anything you can from this filesystem and then start over from 
scratch. As to my past experience something like xfs_repair surpasses btrfs 
check in the ability to actually fix broken filesystem by a great extent.

Ciao,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [4.9] btrfs check --repair looping over file extent discount errors

2017-04-22 Thread Martin Steigerwald
Martin Steigerwald - 22.04.17, 20:01:
> Chris Murphy - 22.04.17, 09:31:
> > Is the file system created with no-holes?
> 
> I have how to find out about it and while doing accidentally set that

I didn´t find out how to find out about it and…

> feature on another filesystem (btrfstune only seems to be able to enable
> the feature, not show the current state of it).
> 
> But as there is no notice of the feature being set as standard in manpage of
> mkfs.btrfs as of BTRFS tools 4.9.1 and as I didn´t set it myself, I best
> bet is that the feature is not enable on the filesystem.
> 
> Now I wonder… how to disable the feature on that other filesystem again.
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [4.9] btrfs check --repair looping over file extent discount errors

2017-04-22 Thread Martin Steigerwald
Hello Chris.

Chris Murphy - 22.04.17, 09:31:
> Is the file system created with no-holes?

I have how to find out about it and while doing accidentally set that feature 
on another filesystem (btrfstune only seems to be able to enable the feature, 
not show the current state of it).

But as there is no notice of the feature being set as standard in manpage of 
mkfs.btrfs as of BTRFS tools 4.9.1 and as I didn´t set it myself, I best bet 
is that the feature is not enable on the filesystem.

Now I wonder… how to disable the feature on that other filesystem again.

Thanks,
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [4.9] btrfs check --repair looping over file extent discount errors

2017-04-22 Thread Martin Steigerwald
Hello.

I am planning to copy of important data on the disk with the broken filesystem 
to the disk with the good filesystem and then reformatitting the disk with the 
broken filesystem soon, probably in the course of the day… so in case you want 
any debug information before that, let me know ASAP.

Thanks,
Martin

Martin Steigerwald - 14.04.17, 21:35:
> Hello,
> 
> backup harddisk connected via eSATA. Hard kernel hang, mouse pointer
> freezing two times seemingly after finishing /home backup and creating new
> snapshot on source BTRFS SSD RAID 1 for / in order to backup it. I did
> scrubbed / and it appears to be okay, but I didn´t run btrfs check on it.
> Anyway deleting that subvolume works and I as I suspected an issue with the
> backup disk I started with that one.
> 
> I got
> 
> merkaba:~> btrfs --version
> btrfs-progs v4.9.1
> 
> merkaba:~> cat /proc/version
> Linux version 4.9.20-tp520-btrfstrim+ (martin@merkaba) (gcc version 6.3.0
> 20170321 (Debian 6.3.0-11) ) #6 SMP PREEMPT Mon Apr 3 11:42:17 CEST 2017
> 
> merkaba:~> btrfs fi sh feenwald
> Label: 'feenwald'  uuid: […]
> Total devices 1 FS bytes used 1.26TiB
> devid1 size 2.73TiB used 1.27TiB path /dev/sdc1
> 
> on Debian unstable on ThinkPad T520 connected via eSATA port on Minidock.
> 
> 
> I am now running btrfs check --repair on it after without --repair the
> command reported file extent discount errors and it appears to loop on the
> same file extent discount errors for ages. Any advice?
> 
> I do have another backup harddisk with BTRFS that worked fine today, so I do
> not need to recover that drive immediately. I may let it run for a little
> more time, but then will abort the repair process as I really think its
> looping just over and over and over the same issues again. At some time I
> may just copy all the stuff that is on that harddisk, but not on the other
> one over to the other one and mkfs.btrfs the filesystem again, but I´d
> rather like to know whats happening here.
> 
> Here is output:
> 
> merkaba:~> btrfs check --repair /dev/sdc1
> enabling repair mode
> Checking filesystem on /dev/sdc1
> [… UUID ommited …]
> checking extents
> Fixed 0 roots.
> checking free space cache
> cache and super generation don't match, space cache will be invalidated
> checking fs roots
> root 257 inode 4979842 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 78798848
> root 257 inode 4980212 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 143360
> root 257 inode 4980214 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 4227072
> root 257 inode 4979842 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 78798848
> root 257 inode 4980212 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 143360
> root 257 inode 4980214 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 4227072
> root 257 inode 4979842 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 78798848
> root 257 inode 4980212 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 143360
> root 257 inode 4980214 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 4227072
> root 257 inode 4979842 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 78798848
> root 257 inode 4980212 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 143360
> root 257 inode 4980214 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 4227072
> [… hours later …]
> root 257 inode 4979842 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 78798848
> root 257 inode 4980212 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 143360
> root 257 inode 4980214 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 4227072
> root 257 inode 4979842 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 78798848
> root 257 inode 4980212 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 143360
> root 257 inode 4980214 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 4227072
> root 257 inode 4979842 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 78798848
> root 257 inode 4980212 errors 100, file extent discount
> Found file extent holes:

[4.9] btrfs check --repair looping over file extent discount errors

2017-04-14 Thread Martin Steigerwald
Hello,

backup harddisk connected via eSATA. Hard kernel hang, mouse pointer freezing 
two times seemingly after finishing /home backup and creating new snapshot on 
source BTRFS SSD RAID 1 for / in order to backup it. I did scrubbed / and it 
appears to be okay, but I didn´t run btrfs check on it. Anyway deleting that 
subvolume works and I as I suspected an issue with the backup disk I started 
with that one.

I got

merkaba:~> btrfs --version
btrfs-progs v4.9.1

merkaba:~> cat /proc/version
Linux version 4.9.20-tp520-btrfstrim+ (martin@merkaba) (gcc version 6.3.0 
20170321 (Debian 6.3.0-11) ) #6 SMP PREEMPT Mon Apr 3 11:42:17 CEST 2017

merkaba:~> btrfs fi sh feenwald
Label: 'feenwald'  uuid: […]
Total devices 1 FS bytes used 1.26TiB
devid1 size 2.73TiB used 1.27TiB path /dev/sdc1

on Debian unstable on ThinkPad T520 connected via eSATA port on Minidock.


I am now running btrfs check --repair on it after without --repair the command 
reported file extent discount errors and it appears to loop on the same file 
extent discount errors for ages. Any advice?

I do have another backup harddisk with BTRFS that worked fine today, so I do 
not need to recover that drive immediately. I may let it run for a little more 
time, but then will abort the repair process as I really think its looping 
just over and over and over the same issues again. At some time I may just 
copy all the stuff that is on that harddisk, but not on the other one over to 
the other one and mkfs.btrfs the filesystem again, but I´d rather like to know 
whats happening here.

Here is output:

merkaba:~> btrfs check --repair /dev/sdc1
enabling repair mode
Checking filesystem on /dev/sdc1
[… UUID ommited …]
checking extents
Fixed 0 roots.
checking free space cache
cache and super generation don't match, space cache will be invalidated
checking fs roots
root 257 inode 4979842 errors 100, file extent discount
Found file extent holes:
start: 0, len: 78798848
root 257 inode 4980212 errors 100, file extent discount
Found file extent holes:
start: 0, len: 143360
root 257 inode 4980214 errors 100, file extent discount
Found file extent holes:
start: 0, len: 4227072
root 257 inode 4979842 errors 100, file extent discount
Found file extent holes:
start: 0, len: 78798848
root 257 inode 4980212 errors 100, file extent discount
Found file extent holes:
start: 0, len: 143360
root 257 inode 4980214 errors 100, file extent discount
Found file extent holes:
start: 0, len: 4227072
root 257 inode 4979842 errors 100, file extent discount
Found file extent holes:
start: 0, len: 78798848
root 257 inode 4980212 errors 100, file extent discount
Found file extent holes:
start: 0, len: 143360
root 257 inode 4980214 errors 100, file extent discount
Found file extent holes:
start: 0, len: 4227072
root 257 inode 4979842 errors 100, file extent discount
Found file extent holes:
start: 0, len: 78798848
root 257 inode 4980212 errors 100, file extent discount
Found file extent holes:
start: 0, len: 143360
root 257 inode 4980214 errors 100, file extent discount
Found file extent holes:
start: 0, len: 4227072
[… hours later …]
root 257 inode 4979842 errors 100, file extent discount
Found file extent holes:
start: 0, len: 78798848
root 257 inode 4980212 errors 100, file extent discount
Found file extent holes:
start: 0, len: 143360
root 257 inode 4980214 errors 100, file extent discount
Found file extent holes:
start: 0, len: 4227072
root 257 inode 4979842 errors 100, file extent discount
Found file extent holes:
start: 0, len: 78798848
root 257 inode 4980212 errors 100, file extent discount
Found file extent holes:
start: 0, len: 143360
root 257 inode 4980214 errors 100, file extent discount
Found file extent holes:
start: 0, len: 4227072
root 257 inode 4979842 errors 100, file extent discount
Found file extent holes:
start: 0, len: 78798848
root 257 inode 4980212 errors 100, file extent discount
Found file extent holes:
start: 0, len: 143360
root 257 inode 4980214 errors 100, file extent discount
Found file extent holes:
start: 0, len: 4227072

This basically seems to go on like this forever.

Thanks,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Do different btrfs volumes compete for CPU?

2017-04-06 Thread Martin
On 05/04/17 08:04, Marat Khalili wrote:
> On 04/04/17 20:36, Peter Grandi wrote:
>> SATA works for external use, eSATA works well, but what really
>> matters is the chipset of the adapter card.
> eSATA might be sound electrically, but mechanically it is awful. Try to
> run it for months in a crowded server room, and inevitably you'll get
> disconnections and data corruption. Tried different cables, brackets --
> same result. If you ever used eSATA connector, you'd feel it.

Been using eSATA here for multiple disk packs continuously connected for
a few years now for 48TB of data (not enough room in the host for the
disks).

Never suffered am eSATA disconnect.

Had the usual cooling fan fails and HDD fails due to old age.


All just a case of ensuring undisturbed clean cabling and a good UPS?...

(BTRFS spanning four disks per external pack has worked well also.)

Good luck,
Martin


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Root volume (ID 5) in deleting state

2017-02-14 Thread Martin Mlynář



It looks you're right!

On a different machine:

# btrfs sub list / | grep -v lxc
ID 327 gen 1959587 top level 5 path mnt/reaver
ID 498 gen 593655 top level 5 path var/lib/machines

# btrfs sub list / -d | wc -l
0

Ok, apparently it's a regression in one of the latest versions then.
But, it seems quite harmless.

I'm glad my data are safe :)






# uname -a
Linux interceptor 4.9.6-1-ARCH #1 SMP PREEMPT Thu Jan 26 09:22:26 CET
2017 x86_64 GNU/Linux

# btrfs fi show  /
Label: none  uuid: 859dec5c-850c-4660-ad99-bc87456aa309
  Total devices 1 FS bytes used 132.89GiB
  devid1 size 200.00GiB used 200.00GiB path
/dev/mapper/vg0-btrfsroot

As a side note, all of your disk space is allocated (200GiB of 200GiB).

Even while there's still 70GiB of free space scattered around inside,
this might lead to out-of-space issues, depending on how badly
fragmented that free space is.

I have not noticed this at all!

# btrfs fi show /
Label: none  uuid: 859dec5c-850c-4660-ad99-bc87456aa309
 Total devices 1 FS bytes used 134.23GiB
 devid1 size 200.00GiB used 200.00GiB path /dev/mapper/vg0-btrfsroot

# btrfs fi df /
Data, single: total=195.96GiB, used=131.58GiB
System, single: total=3.00MiB, used=48.00KiB
Metadata, single: total=4.03GiB, used=2.64GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

After btrfs defrag there is no difference. btrfs fi show says still
200/200. I'll try to play with it.


[ ... ]

So, to get the numbers of total raw disk space allocation down, you need
to defragment free space (compact the data), not defrag used space.

You can even create pictures of space utilization in your btrfs
filesystem, which might help understanding what it looks like right now: \o/

https://github.com/knorrie/btrfs-heatmap/
I've run into your tool yesterday while googling around this - thanks, 
it's really nice tool. Now rebalance is running and it seems to work well.


Thank you for excellent responses and help!



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Root volume (ID 5) in deleting state

2017-02-13 Thread Martin Mlynář

On 13.2.2017 21:03, Hans van Kranenburg wrote:

On 02/13/2017 12:26 PM, Martin Mlynář wrote:

I've currently run into strange problem with BTRFS. I'm using it as my
daily driver as root FS. Nothing complicated, just few subvolumes and
incremental backups using btrbk.

Now I've noticed that my btrfs root volume (absolute top, ID 5) is in
"deleting" state. As I've done some testing and googling it seems that
this should not be possible.

[...]

# btrfs sub list -ad /mnt/btrfs_root/
ID 5 gen 257505 top level 0 path /DELETED

I have heard rumours that this is actually a bug in the output of sub
list itself.

What's the version of your btrfs-progs? (output of `btrfs version`)

Sorry, I've lost this part:

$ btrfs version
btrfs-progs v4.9




# mount | grep btr
/dev/mapper/vg0-btrfsroot on / type btrfs
(rw,noatime,nodatasum,nodatacow,ssd,discard,space_cache,subvolid=1339,subvol=/rootfs)

/dev/mapper/vg0-btrfsroot on /mnt/btrfs_root type btrfs
(rw,noatime,nodatasum,nodatacow,ssd,discard,space_cache,subvolid=5,subvol=/)

The rumour was that it had something to do with using space_cache=v2,
which this example does not confirm.

It looks you're right!

On a different machine:

# btrfs sub list / | grep -v lxc
ID 327 gen 1959587 top level 5 path mnt/reaver
ID 498 gen 593655 top level 5 path var/lib/machines

# btrfs sub list / -d | wc -l
0

# btrfs version
btrfs-progs v4.8.2

# uname -a
Linux nxserver 4.8.6-1-ARCH #1 SMP PREEMPT Mon Oct 31 18:51:30 CET 2016 
x86_64 GNU/Linux


# mount | grep btrfs
/dev/vda1 on / type btrfs 
(rw,relatime,nodatasum,nodatacow,space_cache,subvolid=5,subvol=/)


Then I've upgraded this machine and:

# btrfs sub list / | grep -v lxc
ID 327 gen 1959587 top level 5 path mnt/reaver
ID 498 gen 593655 top level 5 path var/lib/machines

# btrfs sub list / -d | wc -l
1

# btrfs sub list / -d
ID 5 gen 2186037 top level 0 path DELETED<==

1

# btrfs version
btrfs-progs v4.9

# uname -a
Linux nxserver 4.9.8-1-ARCH #1 SMP PREEMPT Mon Feb 6 12:59:40 CET 2017 
x86_64 GNU/Linux


# mount | grep btrfs
/dev/vda1 on / type btrfs 
(rw,relatime,nodatasum,nodatacow,space_cache,subvolid=5,subvol=/)






# uname -a
Linux interceptor 4.9.6-1-ARCH #1 SMP PREEMPT Thu Jan 26 09:22:26 CET
2017 x86_64 GNU/Linux

# btrfs fi show  /
Label: none  uuid: 859dec5c-850c-4660-ad99-bc87456aa309
 Total devices 1 FS bytes used 132.89GiB
 devid1 size 200.00GiB used 200.00GiB path /dev/mapper/vg0-btrfsroot

As a side note, all of your disk space is allocated (200GiB of 200GiB).

Even while there's still 70GiB of free space scattered around inside,
this might lead to out-of-space issues, depending on how badly
fragmented that free space is.

I have not noticed this at all!

# btrfs fi show /
Label: none  uuid: 859dec5c-850c-4660-ad99-bc87456aa309
Total devices 1 FS bytes used 134.23GiB
devid1 size 200.00GiB used 200.00GiB path /dev/mapper/vg0-btrfsroot

# btrfs fi df /
Data, single: total=195.96GiB, used=131.58GiB
System, single: total=3.00MiB, used=48.00KiB
Metadata, single: total=4.03GiB, used=2.64GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

After btrfs defrag there is no difference. btrfs fi show says still 
200/200. I'll try to play with it.



--
Martin Mlynář
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Root volume (ID 5) in deleting state

2017-02-13 Thread Martin Mlynář

Hello,


I've currently run into strange problem with BTRFS. I'm using it as my 
daily driver as root FS. Nothing complicated, just few subvolumes and 
incremental backups using btrbk.


Now I've noticed that my btrfs root volume (absolute top, ID 5) is in 
"deleting" state. As I've done some testing and googling it seems that 
this should not be possible.


I've tried scrubbing and checking, but nothing changed. Volume is not 
being deleted in reality. It just sits there in this state.


Is there anything I can do to fix this?

# btrfs sub list -a /mnt/btrfs_root/
ID 1339 gen 262150 top level 5 path rootfs
ID 1340 gen 262101 top level 5 path .btrbk
ID 1987 gen 262149 top level 5 path no_backup
ID 4206 gen 255869 top level 1340 path /.btrbk/rootfs.20170121T1829
ID 4272 gen 257460 top level 1340 path /.btrbk/rootfs.20170123T0933
ID 4468 gen 259194 top level 1340 path /.btrbk/rootfs.20170131T1132
ID 4474 gen 260911 top level 1340 path /.btrbk/rootfs.20170207T0927
ID 4476 gen 261712 top level 1340 path /.btrbk/rootfs.20170211T
ID 4477 gen 261970 top level 1340 path /.btrbk/rootfs.20170212T1331
ID 4478 gen 262102 top level 1340 path /.btrbk/rootfs.20170213T

# btrfs sub list -ad /mnt/btrfs_root/
ID 5 gen 257505 top level 0 path /DELETED

# mount | grep btr
/dev/mapper/vg0-btrfsroot on / type btrfs 
(rw,noatime,nodatasum,nodatacow,ssd,discard,space_cache,subvolid=1339,subvol=/rootfs)
/dev/mapper/vg0-btrfsroot on /mnt/btrfs_root type btrfs 
(rw,noatime,nodatasum,nodatacow,ssd,discard,space_cache,subvolid=5,subvol=/)


# uname -a
Linux interceptor 4.9.6-1-ARCH #1 SMP PREEMPT Thu Jan 26 09:22:26 CET 
2017 x86_64 GNU/Linux


# btrfs fi show  /
Label: none  uuid: 859dec5c-850c-4660-ad99-bc87456aa309
Total devices 1 FS bytes used 132.89GiB
devid1 size 200.00GiB used 200.00GiB path /dev/mapper/vg0-btrfsroot


Thank you for your time,


Best regards

--

Martin Mlynář

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS for OLTP Databases

2017-02-08 Thread Martin Raiber
On 08.02.2017 14:08 Austin S. Hemmelgarn wrote:
> On 2017-02-08 07:14, Martin Raiber wrote:
>> Hi,
>>
>> On 08.02.2017 03:11 Peter Zaitsev wrote:
>>> Out of curiosity, I see one problem here:
>>> If you're doing snapshots of the live database, each snapshot leaves
>>> the database files like killing the database in-flight. Like shutting
>>> the system down in the middle of writing data.
>>>
>>> This is because I think there's no API for user space to subscribe to
>>> events like a snapshot - unlike e.g. the VSS API (volume snapshot
>>> service) in Windows. You should put the database into frozen state to
>>> prepare it for a hotcopy before creating the snapshot, then ensure all
>>> data is flushed before continuing.
>>>
>>> I think I've read that btrfs snapshots do not guarantee single point in
>>> time snapshots - the snapshot may be smeared across a longer period of
>>> time while the kernel is still writing data. So parts of your writes
>>> may still end up in the snapshot after issuing the snapshot command,
>>> instead of in the working copy as expected.
>>>
>>> How is this going to be addressed? Is there some snapshot aware API to
>>> let user space subscribe to such events and do proper preparation? Is
>>> this planned? LVM could be a user of such an API, too. I think this
>>> could have nice enterprise-grade value for Linux.
>>>
>>> XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But
>>> still, also this needs to be integrated with MySQL to properly work. I
>>> once (years ago) researched on this but gave up on my plans when I
>>> planned database backups for our web server infrastructure. We moved to
>>> creating SQL dumps instead, although there're binlogs which can be used
>>> to recover to a clean and stable transactional state after taking
>>> snapshots. But I simply didn't want to fiddle around with properly
>>> cleaning up binlogs which accumulate horribly much space usage over
>>> time. The cleanup process requires to create a cold copy or dump of the
>>> complete database from time to time, only then it's safe to remove all
>>> binlogs up to that point in time.
>>
>> little bit off topic, but I for one would be on board with such an
>> effort. It "just" needs coordination between the backup
>> software/snapshot tools, the backed up software and the various snapshot
>> providers. If you look at the Windows VSS API, this would be a
>> relatively large undertaking if all the corner cases are taken into
>> account, like e.g. a database having the database log on a separate
>> volume from the data, dependencies between different components etc.
>>
>> You'll know more about this, but databases usually fsync quite often in
>> their default configuration, so btrfs snapshots shouldn't be much behind
>> the properly snapshotted state, so I see the advantages more with
>> usability and taking care of corner cases automatically.
> Just my perspective, but BTRFS (and XFS, and OCFS2) already provide
> reflinking to userspace, and therefore it's fully possible to
> implement this in userspace.  Having a version of the fsfreeze (the
> generic form of xfs_freeze) stuff that worked on individual sub-trees
> would be nice from a practical perspective, but implementing it would
> not be easy by any means, and would be essentially necessary for a
> VSS-like API.  In the meantime though, it is fully possible for the
> application software to implement this itself without needing anything
> more from the kernel.

VSS snapshots whole volumes, not individual files (so comparable to an
LVM snapshot). The sub-folder freeze would be something useful in some
situations, but duplicating the files+extends might also take too long
in a lot of situations. You are correct that the kernel features are
there and what is missing is a user-space daemon, plus a protocol that
facilitates/coordinates the backups/snapshots.

Sending a FIFREEZE ioctl, taking a snapshot and then thawing it does not
really help in some situations as e.g. MySQL InnoDB uses O_DIRECT and
manages its on buffer pool which won't get the FIFREEZE and flush, but
as said, the default configuration is to flush/fsync on every commit.





smime.p7s
Description: S/MIME Cryptographic Signature


Re: BTRFS for OLTP Databases

2017-02-08 Thread Martin Raiber
Hi,

On 08.02.2017 03:11 Peter Zaitsev wrote:
> Out of curiosity, I see one problem here:
> If you're doing snapshots of the live database, each snapshot leaves
> the database files like killing the database in-flight. Like shutting
> the system down in the middle of writing data.
>
> This is because I think there's no API for user space to subscribe to
> events like a snapshot - unlike e.g. the VSS API (volume snapshot
> service) in Windows. You should put the database into frozen state to
> prepare it for a hotcopy before creating the snapshot, then ensure all
> data is flushed before continuing.
>
> I think I've read that btrfs snapshots do not guarantee single point in
> time snapshots - the snapshot may be smeared across a longer period of
> time while the kernel is still writing data. So parts of your writes
> may still end up in the snapshot after issuing the snapshot command,
> instead of in the working copy as expected.
>
> How is this going to be addressed? Is there some snapshot aware API to
> let user space subscribe to such events and do proper preparation? Is
> this planned? LVM could be a user of such an API, too. I think this
> could have nice enterprise-grade value for Linux.
>
> XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But
> still, also this needs to be integrated with MySQL to properly work. I
> once (years ago) researched on this but gave up on my plans when I
> planned database backups for our web server infrastructure. We moved to
> creating SQL dumps instead, although there're binlogs which can be used
> to recover to a clean and stable transactional state after taking
> snapshots. But I simply didn't want to fiddle around with properly
> cleaning up binlogs which accumulate horribly much space usage over
> time. The cleanup process requires to create a cold copy or dump of the
> complete database from time to time, only then it's safe to remove all
> binlogs up to that point in time.

little bit off topic, but I for one would be on board with such an
effort. It "just" needs coordination between the backup
software/snapshot tools, the backed up software and the various snapshot
providers. If you look at the Windows VSS API, this would be a
relatively large undertaking if all the corner cases are taken into
account, like e.g. a database having the database log on a separate
volume from the data, dependencies between different components etc.

You'll know more about this, but databases usually fsync quite often in
their default configuration, so btrfs snapshots shouldn't be much behind
the properly snapshotted state, so I see the advantages more with
usability and taking care of corner cases automatically.

Regards,
Martin Raiber



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [markfasheh/duperemove] Why blocksize is limit to 1MB?

2017-01-03 Thread Martin Raiber
On 04.01.2017 00:43 Hans van Kranenburg wrote:
> On 01/04/2017 12:12 AM, Peter Becker wrote:
>> Good hint, this would be an option and i will try this.
>>
>> Regardless of this the curiosity has packed me and I will try to
>> figure out where the problem with the low transfer rate is.
>>
>> 2017-01-04 0:07 GMT+01:00 Hans van Kranenburg 
>> :
>>> On 01/03/2017 08:24 PM, Peter Becker wrote:
 All invocations are justified, but not relevant in (offline) backup
 and archive scenarios.

 For example you have multiple version of append-only log-files or
 append-only db-files (each more then 100GB in size), like this:

> Snapshot_01_01_2017
 -> file1.log .. 201 GB

> Snapshot_02_01_2017
 -> file1.log .. 205 GB

> Snapshot_03_01_2017
 -> file1.log .. 221 GB

 The first 201 GB would be every time the same.
 Files a copied at night from windows, linux or bsd systems and
 snapshoted after copy.
>>> XY problem?
>>>
>>> Why not use rsync --inplace in combination with btrfs snapshots? Even if
>>> the remote does not support rsync and you need to pull the full file
>>> first, you could again use rsync locally.
> please don't toppost
>
> Also, there is a rather huge difference in the two approaches, given the
> way how btrfs works internally.
>
> Say, I have a subvolume with thousands of directories and millions of
> files with random data in it, and I want to have a second deduped copy
> of it.
>
> Approach 1:
>
> Create a full copy of everything (compare: retrieving remote file again)
> (now 200% of data storage is used), and after that do deduplication, so
> that again only 100% of data storage is used.
>
> Approach 2:
>
> cp -av --reflink original/ copy/
>
> By doing this, you end up with the same as doing approach 1 if your
> deduper is the most ideal in the world (and the files are so random they
> don't contain duplicate blocks inside them).
>
> Approach 3:
>
> btrfs sub snap original copy
>
> W00t, that was fast, and the only thing that happened was writing a few
> 16kB metadata pages again. (1 for the toplevel tree page that got cloned
> into a new filesystem tree, and a few for the blocks one level lower to
> add backreferences to the new root).
>
> So:
>
> The big difference in the end result between approach 1,2 and otoh 3 is
> that while deduplicating your data, you're actually duplicating all your
> metadata at the same time.
>
> In your situation, if possible doing an rsync --inplace from the remote,
> so that only changed appended data gets stored, and then useing native
> btrfs snapshotting it would seem the most effective.
>
Or use UrBackup as backup software. It uses the snapshot then modfiy
approach with btrfs, plus you get file level deduplication between
clients using reflinks.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: Convert from RAID 5 to 10

2016-11-30 Thread Martin Steigerwald
Am Mittwoch, 30. November 2016, 12:09:23 CET schrieb Chris Murphy:
> On Wed, Nov 30, 2016 at 7:37 AM, Austin S. Hemmelgarn
> 
> <ahferro...@gmail.com> wrote:
> > The stability info could be improved, but _absolutely none_ of the things
> > mentioned as issues with raid1 are specific to raid1.  And in general, in
> > the context of a feature stability matrix, 'OK' generally means that there
> > are no significant issues with that specific feature, and since none of
> > the
> > issues outlined are specific to raid1, it does meet that description of
> > 'OK'.
> 
> Maybe the gotchas page needs a one or two liner for each profile's
> gotchas compared to what the profile leads the user into believing.
> The overriding gotcha with all Btrfs multiple device support is the
> lack of monitoring and notification other than kernel messages; and
> the raid10 actually being more like raid0+1 I think it certainly a
> gotcha, however 'man mkfs.btrfs' contains a grid that very clearly
> states raid10 can only safely lose 1 device.

Wow, that manpage is quite an resource.

Developers, documentation people definitely improved the official BTRFS 
documentation.

Thanks,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Convert from RAID 5 to 10

2016-11-30 Thread Martin Steigerwald
Am Mittwoch, 30. November 2016, 16:49:59 CET schrieb Wilson Meier:
> Am 30/11/16 um 15:37 schrieb Austin S. Hemmelgarn:
> > On 2016-11-30 08:12, Wilson Meier wrote:
> >> Am 30/11/16 um 11:41 schrieb Duncan:
> >>> Wilson Meier posted on Wed, 30 Nov 2016 09:35:36 +0100 as excerpted:
> >>>> Am 30/11/16 um 09:06 schrieb Martin Steigerwald:
> >>>>> Am Mittwoch, 30. November 2016, 10:38:08 CET schrieb Roman Mamedov:
[…]
> >> It is really disappointing to not have this information in the wiki
> >> itself. This would have saved me, and i'm quite sure others too, a lot
> >> of time.
> >> Sorry for being a bit frustrated.
> 
> I'm not angry or something like that :) .
> I just would like to have the possibility to read such information about
> the storage i put my personal data (> 3 TB) on its official wiki.

Anyone can get an account on the wiki and add notes there, so feel free.

You can even use footnotes or something like that. Maybe it would be good to 
add a paragraph there that features are related to one another, so while BTRFS 
RAID 1 for example might be quite okay, it depends on features that are still 
flaky.

I for myself rely quite much on BTRFS RAID 1 with lzo compression and it seems 
to work okay for me.

-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Convert from RAID 5 to 10

2016-11-30 Thread Martin Steigerwald
Am Mittwoch, 30. November 2016, 10:38:08 CET schrieb Roman Mamedov:
> On Wed, 30 Nov 2016 00:16:48 +0100
> 
> Wilson Meier <wilson.me...@gmail.com> wrote:
> > That said, btrfs shouldn't be used for other then raid1 as every other
> > raid level has serious problems or at least doesn't work as the expected
> > raid level (in terms of failure recovery).
> 
> RAID1 shouldn't be used either:
> 
> *) Read performance is not optimized: all metadata is always read from the
> first device unless it has failed, data reads are supposedly balanced
> between devices per PID of the process reading. Better implementations
> dispatch reads per request to devices that are currently idle.
> 
> *) Write performance is not optimized, during long full bandwidth sequential
> writes it is common to see devices writing not in parallel, but with a long
> periods of just one device writing, then another. (Admittedly have been
> some time since I tested that).
> 
> *) A degraded RAID1 won't mount by default.
> 
> If this was the root filesystem, the machine won't boot.
> 
> To mount it, you need to add the "degraded" mount option.
> However you have exactly a single chance at that, you MUST restore the RAID
> to non-degraded state while it's mounted during that session, since it
> won't ever mount again in the r/w+degraded mode, and in r/o mode you can't
> perform any operations on the filesystem, including adding/removing
> devices.
> 
> *) It does not properly handle a device disappearing during operation.
> (There is a patchset to add that).
> 
> *) It does not properly handle said device returning (under a
> different /dev/sdX name, for bonus points).
> 
> Most of these also apply to all other RAID levels.

So the stability matrix would need to be updated not to recommend any kind of 
BTRFS RAID 1 at the moment?

Actually I faced the BTRFS RAID 1 read only after first attempt of mounting it 
"degraded" just a short time ago.

BTRFS still needs way more stability work it seems to me.

-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Inconsistent free space with false ENOSPC

2016-11-23 Thread Martin Raiber
On 23.11.2016 07:09 Duncan wrote:
> Yes, you're in a *serious* metadata bind.
> Any time global reserve has anything above zero usage, it means the 
> filesystem is in dire straits, and well over half of your global reserve 
> is used, a state that is quite rare as btrfs really tries hard not to use 
> that space at all under normal conditions and under most conditions will 
> ENOSPC before using the reserve at all.
>
> And the global reserve comes from metadata but isn't accounted in 
> metadata usage, so your available metadata is actually negative by the 
> amount of global reserve used.
>
> Meanwhile, all available space is allocated to either data or metadata 
> chunks already -- no unallocated space left to allocate new metadata 
> chunks to take care of the problem (well, ~1 MiB unallocated, but that's 
> not enough to allocate a chunk, metadata chunks being nominally 256 MiB 
> in size and with metadata dup, a pair of metadata chunks must be 
> allocated together, so 512 MiB would be needed, and of course even if the 
> 1 MiB could be allocated, it'd be ~1/2 MiB worth of metadata due to 
> metadata-dup and you're 300+ MiB into global reserve, so it wouldn't even 
> come close to fixing the problem).
>
>
> Now normally, as mentioned in the ENOSPC discussion in the FAQ on the 
> wiki, temporarily adding (btrfs device add) another device of some GiB 
> (32 GiB should do reasonably well, 8 GiB may, a USB thumb drive of 
> suitable size can be used if necessary) and using the space it makes 
> available to do a balance (-dusage= incrementing from 0 to perhaps 30 to 
> 70 percent, higher numbers will take longer and may not work at first) in 
> ordered to combine partially used chunks and free enough space to then 
> remove (btrfs device remove) the temporarily added device.
>
> However, in your case the data usage is 488 of 508 GiB on a 512 GiB 
> device with space needed for several GiB of metadata as well, so while in 
> theory you could free up ~20 GiB of space that way and that should get 
> you out of the immediate bind, the filesystem will still be very close to 
> full, particularly after clearing out the global reserve usage, with 
> perhaps 16 GiB unallocated at ideal, ~97% used.  And as any veteran 
> sysadmin or filesystem expert will tell you, filesystems in general like 
> 10-20% free in ordered to be able to "breath" or work most efficiently, 
> with btrfs being no exception, so while the above might get you out of 
> the immediate bind, it's unlikely to work for long.
>
> Which means once you're out of the immediate bind, you're still going to 
> need to free some space, one way or another, and that might not be as 
> simple as the words make it appear.

Yes, adding a temporary disk allowed me to fix it. Though, it wanted to
write RAID1 metadata instead of DUP first, which further confused me.

File system is being written to by a program that watches disk usage and
deletes stuff/stops writing if too much is used. But it could not
anticipate the jump from 20GiB to zero free. I have now set
"metadata_ratio=8" to prevent that, and will lower it if it still
becomes a problem.

Perhaps it would be good to somehow show that "global reserve" belongs
to metadata and show in btrfs fi usage/df that metadata is full if
global reserve>=free metadata, so that future users are not as confused
by this situation as I was.

Regards,
Martin Raiber




smime.p7s
Description: S/MIME Cryptographic Signature


Re: Inconsistent free space with false ENOSPC

2016-11-22 Thread Martin Raiber
On 22.11.2016 15:16 Martin Raiber wrote:
> ...
> Interestingly,
> after running "btrfs check --repair" "df" shows 0 free space (Used
> 516456408 Available 0), being inconsistent with the below other btrfs
> free space information.
>
> btrfs fi usage output:
>
> Overall:
> Device size: 512.00GiB
> Device allocated:512.00GiB
> Device unallocated:1.04MiB
> Device missing:  0.00B
> Used:492.03GiB
> Free (estimated): 19.59GiB  (min: 19.59GiB)
> Data ratio:   1.00
> Metadata ratio:   2.00
> Global reserve:  512.00MiB  (used: 326.20MiB)
>
> Data,single: Size:507.98GiB, Used:488.39GiB
>/dev/mapper/LUKS-CC-9a6043feb9d946269555a71ec0742c8b  507.98GiB
>
> Metadata,DUP: Size:2.00GiB, Used:1.82GiB
>/dev/mapper/LUKS-CC-9a6043feb9d946269555a71ec0742c8b4.00GiB
>
> System,DUP: Size:8.00MiB, Used:80.00KiB
>/dev/mapper/LUKS-CC-9a6043feb9d946269555a71ec0742c8b   16.00MiB
>
> Unallocated:
>/dev/mapper/LUKS-CC-9a6043feb9d946269555a71ec0742c8b1.04MiB
Looking at the code, it seems df shows zero if the available metadata
space is smaller than the used global reserve. So this file system might
be out of metadata space.



smime.p7s
Description: S/MIME Cryptographic Signature


Inconsistent free space with false ENOSPC

2016-11-22 Thread Martin Raiber
Hi,

I'm having a file system which is currently broken because of ENOSPC issues.

It is a single device file system with no compression and no quotas
enabled but with some snapshots. Creation and initial ENOSPC/free space
inconsistency with 4.4.20 and 4.4.30 (both vanilla).
Currently I am on 4.9.0-rc6 and still getting ENOSPC. Interestingly,
after running "btrfs check --repair" "df" shows 0 free space (Used
516456408 Available 0), being inconsistent with the below other btrfs
free space information.

I have tried clearing the space cache, using space_cache=v2 and that
doesn't help. It also cannot balance anything anymore.

Please tell me if anyone is interested in a btrfs-image of the file system.

btrfs fi usage output:

Overall:
Device size: 512.00GiB
Device allocated:512.00GiB
Device unallocated:1.04MiB
Device missing:  0.00B
Used:492.03GiB
Free (estimated): 19.59GiB  (min: 19.59GiB)
Data ratio:   1.00
Metadata ratio:   2.00
Global reserve:  512.00MiB  (used: 326.20MiB)

Data,single: Size:507.98GiB, Used:488.39GiB
   /dev/mapper/LUKS-CC-9a6043feb9d946269555a71ec0742c8b  507.98GiB

Metadata,DUP: Size:2.00GiB, Used:1.82GiB
   /dev/mapper/LUKS-CC-9a6043feb9d946269555a71ec0742c8b4.00GiB

System,DUP: Size:8.00MiB, Used:80.00KiB
   /dev/mapper/LUKS-CC-9a6043feb9d946269555a71ec0742c8b   16.00MiB

Unallocated:
   /dev/mapper/LUKS-CC-9a6043feb9d946269555a71ec0742c8b1.04MiB

btrfs-debugs output:

block group offset 1103101952 len 1073741824 used 1029148672
chunk_objectid 256 flags 1 usage 0.96
block group offset 2176843776 len 1073741824 used 1066725376
chunk_objectid 256 flags 1 usage 0.99
block group offset 3250585600 len 1073741824 used 1070731264
chunk_objectid 256 flags 1 usage 1.00
block group offset 4324327424 len 1073741824 used 1066201088
chunk_objectid 256 flags 1 usage 0.99
block group offset 5398069248 len 1073741824 used 1072377856
chunk_objectid 256 flags 1 usage 1.00
block group offset 6471811072 len 1073741824 used 1070784512
chunk_objectid 256 flags 1 usage 1.00
block group offset 7545552896 len 1073741824 used 1069023232
chunk_objectid 256 flags 1 usage 1.00
block group offset 8619294720 len 1073741824 used 1058299904
chunk_objectid 256 flags 1 usage 0.99
block group offset 9693036544 len 1073741824 used 1069408256
chunk_objectid 256 flags 1 usage 1.00
block group offset 10766778368 len 1073741824 used 1070317568
chunk_objectid 256 flags 1 usage 1.00
block group offset 11840520192 len 1073741824 used 1068920832
chunk_objectid 256 flags 1 usage 1.00
block group offset 12914262016 len 1073741824 used 1066930176
chunk_objectid 256 flags 1 usage 0.99
block group offset 13988003840 len 1073741824 used 1072746496
chunk_objectid 256 flags 1 usage 1.00
block group offset 15061745664 len 1073741824 used 1073213440
chunk_objectid 256 flags 1 usage 1.00
block group offset 16135487488 len 1073741824 used 1068171264
chunk_objectid 256 flags 1 usage 0.99
block group offset 17209229312 len 1073741824 used 1071550464
chunk_objectid 256 flags 1 usage 1.00
block group offset 18282971136 len 1073741824 used 1073672192
chunk_objectid 256 flags 1 usage 1.00
block group offset 19356712960 len 1073741824 used 1073508352
chunk_objectid 256 flags 1 usage 1.00
block group offset 20430454784 len 1073741824 used 1073668096
chunk_objectid 256 flags 1 usage 1.00
block group offset 21504196608 len 1073741824 used 1073577984
chunk_objectid 256 flags 1 usage 1.00
block group offset 22577938432 len 1073741824 used 1073483776
chunk_objectid 256 flags 1 usage 1.00
block group offset 23651680256 len 1073741824 used 1072021504
chunk_objectid 256 flags 1 usage 1.00
block group offset 24725422080 len 1073741824 used 1073672192
chunk_objectid 256 flags 1 usage 1.00
block group offset 25799163904 len 1073741824 used 1073176576
chunk_objectid 256 flags 1 usage 1.00
block group offset 26872905728 len 1073741824 used 1073360896
chunk_objectid 256 flags 1 usage 1.00
block group offset 27946647552 len 1073741824 used 1072599040
chunk_objectid 256 flags 1 usage 1.00
block group offset 29020389376 len 1073741824 used 1073524736
chunk_objectid 256 flags 1 usage 1.00
block group offset 31167873024 len 1073741824 used 1073561600
chunk_objectid 256 flags 1 usage 1.00
block group offset 32241614848 len 1073741824 used 1072566272
chunk_objectid 256 flags 1 usage 1.00
block group offset 33315356672 len 1073741824 used 1073635328
chunk_objectid 256 flags 1 usage 1.00
block group offset 34389098496 len 1073741824 used 1073364992
chunk_objectid 256 flags 1 usage 1.00
block group offset 36536582144 len 1073741824 used 1073344512
chunk_objectid 256 flags 1 usage 1.00
block group offset 37610323968 len 1073741824 used 1073381376
chunk_objectid 256 flags 1 usage 1.00
block group offset 38684065792 len 1073741824 used 1067831296
chunk_objectid 256 flags 1 usage 0.99

Re: degraded BTRFS RAID 1 not mountable: open_ctree failed, unable to find block group for 0

2016-11-17 Thread Martin Steigerwald
Am Donnerstag, 17. November 2016, 12:05:31 CET schrieb Chris Murphy:
> I think the wiki should be updated to reflect that raid1 and raid10
> are mostly OK. I think it's grossly misleading to consider either as
> green/OK when a single degraded read write mount creates single chunks
> that will then prevent a subsequent degraded read write mount. And
> also the lack of various notifications of device faultiness I think
> make it less than OK also. It's not in the "do not use" category but
> it should be in the middle ground status so users can make informed
> decisions.

I agree – as error reporting I think is indead misleading. Feel free to edit 
it.

Ciao,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded BTRFS RAID 1 not mountable: open_ctree failed, unable to find block group for 0

2016-11-16 Thread Martin Steigerwald
Am Mittwoch, 16. November 2016, 07:57:08 CET schrieb Austin S. Hemmelgarn:
> On 2016-11-16 06:04, Martin Steigerwald wrote:
> > Am Mittwoch, 16. November 2016, 16:00:31 CET schrieb Roman Mamedov:
> >> On Wed, 16 Nov 2016 11:55:32 +0100
> >> 
> >> Martin Steigerwald <martin.steigerw...@teamix.de> wrote:
[…]
> > As there seems to be no force option to override the limitation and I
> > do not feel like compiling my own btrfs-tools right now, I will use rsync
> > instead.
> 
> In a case like this, I'd trust rsync more than send/receive.  The
> following rsync switches might also be of interest:
> -a: This turns on a bunch of things almost everyone wants when using
> rsync, similar to the same switch for cp, just with even more added in.
> -H: This recreates hardlinks on the receiving end.
> -S: This recreates sparse files.
> -A: This copies POSIX ACL's
> -X: This copies extended attributes (most of them at least, there are a
> few that can't be arbitrarily written to).
> Pre-creating the subvolumes by hand combined with using all of those
> will get you almost everything covered by send/receive except for
> sharing of extents and ctime.

I usually use rsync -aAHXSP already :).

I was able to rsync any relevant data of the disk which is now being deleted 
by shred command.

Thank you,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded BTRFS RAID 1 not mountable: open_ctree failed, unable to find block group for 0

2016-11-16 Thread Martin Steigerwald
Am Mittwoch, 16. November 2016, 11:55:32 CET schrieben Sie:
> So mounting work although for some reason scrubbing is aborted (I had this
> issue a long time ago on my laptop as well). After removing /var/lib/btrfs 
> scrub status file for the filesystem:
> 
> merkaba:~> btrfs scrub start /mnt/zeit
> scrub started on /mnt/zeit, fsid […] (pid=9054)
> merkaba:~> btrfs scrub status /mnt/zeit
> scrub status for […]
> scrub started at Wed Nov 16 11:52:56 2016 and was aborted after 
> 00:00:00
> total bytes scrubbed: 0.00B with 0 errors
> 
> Anyway, I will now just rsync off the files.
> 
> Interestingly enough btrfs restore complained about looping over certain
> files… lets see whether the rsync or btrfs send/receive proceeds through.

I have an idea on why scrubbing may not work:

The filesystem is mounted read only and on checksum errors on one disk scrub 
would try to repair it with the good copy from another disk.

Yes, this is it:

merkaba:~>  btrfs scrub start -r /dev/satafp1/daten
scrub started on /dev/satafp1/daten, fsid […] (pid=9375)
merkaba:~>  btrfs scrub status /dev/satafp1/daten 
scrub status for […]
scrub started at Wed Nov 16 12:13:27 2016, running for 00:00:10
total bytes scrubbed: 45.53MiB with 0 errors

It would be helpful to receive a proper error message on this one.

Okay, seems today I learned quite something about BTRFS.

Thanks,

-- 
Martin Steigerwald  | Trainer

teamix GmbH
Südwestpark 43
90449 Nürnberg

Tel.:  +49 911 30999 55 | Fax: +49 911 30999 99
mail: martin.steigerw...@teamix.de | web:  http://www.teamix.de | blog: 
http://blog.teamix.de

Amtsgericht Nürnberg, HRB 18320 | Geschäftsführer: Oliver Kügow, Richard Müller

teamix Support Hotline: +49 911 30999-112
 
 *** Bitte liken Sie uns auf Facebook: facebook.com/teamix ***

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded BTRFS RAID 1 not mountable: open_ctree failed, unable to find block group for 0

2016-11-16 Thread Martin Steigerwald
Am Mittwoch, 16. November 2016, 16:00:31 CET schrieb Roman Mamedov:
> On Wed, 16 Nov 2016 11:55:32 +0100
> 
> Martin Steigerwald <martin.steigerw...@teamix.de> wrote:
> > I do think that above kernel messages invite such a kind of interpretation
> > tough. I took the "BTRFS: open_ctree failed" message as indicative to some
> > structural issue with the filesystem.
> 
> For the reason as to why the writable mount didn't work, check "btrfs fi df"
> for the filesystem to see if you have any "single" profile chunks on it:
> quite likely you did already mount it "degraded,rw" in the past *once*,
> after which those "single" chunks get created, and consequently it won't
> mount r/w anymore (without lifting the restriction on the number of missing
> devices as proposed).

That exactly explains it. I very likely did a degraded mount without ro on 
this disk already.

Funnily enough this creates another complication:

merkaba:/mnt/zeit#1> btrfs send somesubvolume | btrfs receive /mnt/
someotherbtrfs
ERROR: subvolume /mnt/zeit/somesubvolume is not read-only

Yet:

merkaba:/mnt/zeit> btrfs property get somesubvolume
ro=false
merkaba:/mnt/zeit> btrfs property set somesubvolume ro true 
 
ERROR: failed to set flags for somesubvolume: Read-only file system

To me it seems right logic would be to allow the send to proceed in case
the whole filesystem is readonly.

As there seems to be no force option to override the limitation and I
do not feel like compiling my own btrfs-tools right now, I will use rsync
instead.

Thanks,

-- 
Martin Steigerwald  | Trainer

teamix GmbH
Südwestpark 43
90449 Nürnberg

Tel.:  +49 911 30999 55 | Fax: +49 911 30999 99
mail: martin.steigerw...@teamix.de | web:  http://www.teamix.de | blog: 
http://blog.teamix.de

Amtsgericht Nürnberg, HRB 18320 | Geschäftsführer: Oliver Kügow, Richard Müller

teamix Support Hotline: +49 911 30999-112
 
 *** Bitte liken Sie uns auf Facebook: facebook.com/teamix ***

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded BTRFS RAID 1 not mountable: open_ctree failed, unable to find block group for 0

2016-11-16 Thread Martin Steigerwald
Am Mittwoch, 16. November 2016, 15:43:36 CET schrieb Roman Mamedov:
> On Wed, 16 Nov 2016 11:25:00 +0100
> 
> Martin Steigerwald <martin.steigerw...@teamix.de> wrote:
> > merkaba:~> mount -o degraded,clear_cache /dev/satafp1/backup /mnt/zeit
> > mount: Falscher Dateisystemtyp, ungültige Optionen, der
> > Superblock von /dev/mapper/satafp1-backup ist beschädigt, fehlende
> > Kodierungsseite oder ein anderer Fehler
> > 
> >   Manchmal liefert das Systemprotokoll wertvolle Informationen –
> >   versuchen Sie  dmesg | tail  oder ähnlich
> > 
> > merkaba:~#32> dmesg | tail -6
> > [ 3080.120687] BTRFS info (device dm-13): allowing degraded mounts
> > [ 3080.120699] BTRFS info (device dm-13): force clearing of disk cache
> > [ 3080.120703] BTRFS info (device dm-13): disk space caching is
> > enabled
> > [ 3080.120706] BTRFS info (device dm-13): has skinny extents
> > [ 3080.150957] BTRFS warning (device dm-13): missing devices (1)
> > exceeds the limit (0), writeable mount is not allowed
> > [ 3080.195941] BTRFS: open_ctree failed
> 
> I have to wonder did you read the above message? What you need at this point
> is simply "-o degraded,ro". But I don't see that tried anywhere down the
> line.
> 
> See also (or try): https://patchwork.kernel.org/patch/9419189/

Actually I read that one, but I read more into it than what it was saying:

I read into it that BTRFS would automatically use a read only mount.


merkaba:~> mount -o degraded,ro /dev/satafp1/daten /mnt/zeit

actually really works. *Thank you*, Roman.


I do think that above kernel messages invite such a kind of interpretation
tough. I took the "BTRFS: open_ctree failed" message as indicative to some
structural issue with the filesystem.


So mounting work although for some reason scrubbing is aborted (I had this
issue a long time ago on my laptop as well). After removing /var/lib/btrfs 
scrub status file for the filesystem:

merkaba:~> btrfs scrub start /mnt/zeit
scrub started on /mnt/zeit, fsid […] (pid=9054)
merkaba:~> btrfs scrub status /mnt/zeit
scrub status for […]
scrub started at Wed Nov 16 11:52:56 2016 and was aborted after 
00:00:00
total bytes scrubbed: 0.00B with 0 errors

Anyway, I will now just rsync off the files.

Interestingly enough btrfs restore complained about looping over certain
files… lets see whether the rsync or btrfs send/receive proceeds through.

Ciao,

-- 
Martin Steigerwald  | Trainer

teamix GmbH
Südwestpark 43
90449 Nürnberg

Tel.:  +49 911 30999 55 | Fax: +49 911 30999 99
mail: martin.steigerw...@teamix.de | web:  http://www.teamix.de | blog: 
http://blog.teamix.de

Amtsgericht Nürnberg, HRB 18320 | Geschäftsführer: Oliver Kügow, Richard Müller

teamix Support Hotline: +49 911 30999-112
 
 *** Bitte liken Sie uns auf Facebook: facebook.com/teamix ***

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


degraded BTRFS RAID 1 not mountable: open_ctree failed, unable to find block group for 0

2016-11-16 Thread Martin Steigerwald
ice dm-13): has skinny extents
[ 3080.150957] BTRFS warning (device dm-13): missing devices (1) exceeds 
the limit (0), writeable mount is not allowed
[ 3080.195941] BTRFS: open_ctree failed

merkaba:~> mount -o degraded,clear_cache,usebackuproot /dev/satafp1/backup 
/mnt/zeit
mount: Falscher Dateisystemtyp, ungültige Optionen, der
Superblock von /dev/mapper/satafp1-backup ist beschädigt, fehlende
Kodierungsseite oder ein anderer Fehler

  Manchmal liefert das Systemprotokoll wertvolle Informationen –
  versuchen Sie  dmesg | tail  oder ähnlich

merkaba:~> dmesg | tail -7
[ 3173.784713] BTRFS info (device dm-13): allowing degraded mounts
[ 3173.784728] BTRFS info (device dm-13): force clearing of disk cache
[ 3173.784737] BTRFS info (device dm-13): trying to use backup root at 
mount time
[ 3173.784742] BTRFS info (device dm-13): disk space caching is enabled
[ 3173.784746] BTRFS info (device dm-13): has skinny extents
[ 3173.816983] BTRFS warning (device dm-13): missing devices (1) exceeds 
the limit (0), writeable mount is not allowed
[ 3173.865199] BTRFS: open_ctree failed

I aborted repairing after this assert:

merkaba:~#130> btrfs check --repair /dev/satafp1/backup &| stdbuf -oL tee 
btrfs-check-repair-satafp1-backup.log
enabling repair mode
warning, device 2 is missing
Checking filesystem on /dev/satafp1/backup
UUID: 01cf0493-476f-42e8-8905-61ef205313db
checking extents
Unable to find block group for 0
extent-tree.c:289: find_search_start: Assertion `1` failed.
btrfs[0x43e418]
btrfs(btrfs_reserve_extent+0x5c9)[0x4425df]
btrfs(btrfs_alloc_free_block+0x63)[0x44297c]
btrfs(__btrfs_cow_block+0xfc)[0x436636]
btrfs(btrfs_cow_block+0x8b)[0x436bd8]
btrfs[0x43ad82]
btrfs(btrfs_commit_transaction+0xb8)[0x43c5dc]
btrfs[0x4268b4]
btrfs(cmd_check+0x)[0x427d6d]
btrfs(main+0x12f)[0x40a341]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7fb2e6bec2b1]
btrfs(_start+0x2a)[0x40a37a]

merkaba:~#130> btrfs --version
btrfs-progs v4.7.3

(Honestly I think asserts like this need to be gone from btrfs-tools for good)

About this I only found this unanswered mailing list post:

btrfs-convert: Unable to find block group for 0
Date: Fri, 24 Jun 2016 11:09:27 +0200
https://www.spinics.net/lists/linux-btrfs/msg56478.html


Out of curiosity I tried:

merkaba:~#1> btrfs rescue zero-log //dev/satafp1/daten
warning, device 2 is missing
Clearing log on //dev/satafp1/daten, previous log_root 0, level 0
Unable to find block group for 0
extent-tree.c:289: find_search_start: Assertion `1` failed.
btrfs[0x43e418]
btrfs(btrfs_reserve_extent+0x5c9)[0x4425df]
btrfs(btrfs_alloc_free_block+0x63)[0x44297c]
btrfs(__btrfs_cow_block+0xfc)[0x436636]
btrfs(btrfs_cow_block+0x8b)[0x436bd8]
btrfs[0x43ad82]
btrfs(btrfs_commit_transaction+0xb8)[0x43c5dc]
btrfs[0x42c0d4]
btrfs(main+0x12f)[0x40a341]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7fb2f16a82b1]
btrfs(_start+0x2a)[0x40a37a]

(I didn´t expect much as this is an issue that AFAIK does not happen
easily anymore, but I also thought it could not do much harm)

Superblocks themselves seem to be sane:

merkaba:~#1> btrfs rescue super-recover //dev/satafp1/daten
All supers are valid, no need to recover

So "btrfs restore" it is:

merkaba:[…]> btrfs restore -mxs /dev/satafp1/daten daten-restore

This prints out a ton of:

Trying another mirror
Trying another mirror

But it actually works. Somewhat, I now just got

Trying another mirror
We seem to be looping a lot on 
daten-restore/[…]/virtualbox-4.1.18-dfsg/out/lib/vboxsoap.a, do you want to 
keep going on ? (y/N/a):

after about 35 GiB of data restored. I answered no to this one and now it is
at about 53 GiB already. I just got another one of these, but also not 
concerning a file I actually need.

Thanks,

-- 
Martin Steigerwald  | Trainer

teamix GmbH
Südwestpark 43
90449 Nürnberg

Tel.:  +49 911 30999 55 | Fax: +49 911 30999 99
mail: martin.steigerw...@teamix.de | web:  http://www.teamix.de | blog: 
http://blog.teamix.de

Amtsgericht Nürnberg, HRB 18320 | Geschäftsführer: Oliver Kügow, Richard Müller

teamix Support Hotline: +49 911 30999-112
 
 *** Bitte liken Sie uns auf Facebook: facebook.com/teamix ***

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: csum failed during copy/compare

2016-10-21 Thread Martin Dev
SATA trace shows device behaving correctly.
btrfs repair --ignore-errors /dev/sda2 /tmp/ will yield files that are
not verifiable by FIO, and differ from the original files on the
internal drive that they were copied from at the failing offset.

On Wed, Oct 19, 2016 at 3:39 PM, Martin Dev <mrturtle...@gmail.com> wrote:
> Fails on Antergos Linux 4.8.2-1-ARCH #1 SMP PREEMPT Mon Oct 17
> 08:11:46 CEST 2016 x86_64 GNU/Linux
>
> btrfs-progs v4.8.1
>
> On Mon, Oct 10, 2016 at 10:05 PM, Chris Murphy <li...@colorremedies.com> 
> wrote:
>> On Mon, Oct 10, 2016 at 12:42 PM, Roman Mamedov <r...@romanrm.net> wrote:
>>> On Mon, 10 Oct 2016 10:44:39 +0100
>>> Martin Dev <mrturtle...@gmail.com> wrote:
>>>
>>>> I work for system verification of SSDs and we've recently come up
>>>> against an issue with BTRFS on Ubuntu 16.04
>>>
>>>> This seems to be a recent change
>>>
>>> ...well, a change in what?
>>>
>>> If you really didn't change anything on your machines and the used process,
>>> there is no reason for anything to start breaking, other than obvious 
>>> hardware
>>> issues from age/etc (likely not what's happening here).
>>>
>>> So you most likely did change something yourself, and perhaps the change was
>>> upgrading OS version, kernel version(!!!), or versions of software in 
>>> general.
>>>
>>> As such, the first suggestion would be go through the recent software 
>>> updates
>>> history, maybe even restore an OS image you used three months ago (if
>>> available) and confirm that the problem doesn't occur there. After that 
>>> it's a
>>> process called bisecting, there are tools for that, but likely you don't 
>>> even
>>> need those yet, just carefully note when you got which upgrades, paying
>>> highest attention to the kernel version, and note at which point the
>>> corruptions start to occur.
>>
>>
>> There  have been various trim bugs, in Btrfs but also in the block
>> layer. And I don't remember all the different versions involved.  I'd
>> like to think 4.4.24 should behave the same as 4.8.1, so I would
>> retest with those two, using something without ubuntu specific
>> backports (i.e. something as close to the kernel.org trees of those
>> versions as possible). I have no idea what Ubuntu generic 4.4.0-21
>> translates into. Because of the 0, it makes me think it's literally
>> 4.4.0 with 21 sets of various backports, from some unknown time frame
>> without going and looking it up. If that's really 4.4.21, then it's
>> weirdly named, I don't know why any distro would do that.
>>
>> In any case I would compare 4.8.1 and 4.4.24 because those two should
>> work and if not it's a bug that needs to get fixed. Independently,
>> check the SSD firmware. There have been bugs there also.
>>
>> --
>> Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: csum failed during copy/compare

2016-10-19 Thread Martin Dev
Fails on Antergos Linux 4.8.2-1-ARCH #1 SMP PREEMPT Mon Oct 17
08:11:46 CEST 2016 x86_64 GNU/Linux

btrfs-progs v4.8.1

On Mon, Oct 10, 2016 at 10:05 PM, Chris Murphy <li...@colorremedies.com> wrote:
> On Mon, Oct 10, 2016 at 12:42 PM, Roman Mamedov <r...@romanrm.net> wrote:
>> On Mon, 10 Oct 2016 10:44:39 +0100
>> Martin Dev <mrturtle...@gmail.com> wrote:
>>
>>> I work for system verification of SSDs and we've recently come up
>>> against an issue with BTRFS on Ubuntu 16.04
>>
>>> This seems to be a recent change
>>
>> ...well, a change in what?
>>
>> If you really didn't change anything on your machines and the used process,
>> there is no reason for anything to start breaking, other than obvious 
>> hardware
>> issues from age/etc (likely not what's happening here).
>>
>> So you most likely did change something yourself, and perhaps the change was
>> upgrading OS version, kernel version(!!!), or versions of software in 
>> general.
>>
>> As such, the first suggestion would be go through the recent software updates
>> history, maybe even restore an OS image you used three months ago (if
>> available) and confirm that the problem doesn't occur there. After that it's 
>> a
>> process called bisecting, there are tools for that, but likely you don't even
>> need those yet, just carefully note when you got which upgrades, paying
>> highest attention to the kernel version, and note at which point the
>> corruptions start to occur.
>
>
> There  have been various trim bugs, in Btrfs but also in the block
> layer. And I don't remember all the different versions involved.  I'd
> like to think 4.4.24 should behave the same as 4.8.1, so I would
> retest with those two, using something without ubuntu specific
> backports (i.e. something as close to the kernel.org trees of those
> versions as possible). I have no idea what Ubuntu generic 4.4.0-21
> translates into. Because of the 0, it makes me think it's literally
> 4.4.0 with 21 sets of various backports, from some unknown time frame
> without going and looking it up. If that's really 4.4.21, then it's
> weirdly named, I don't know why any distro would do that.
>
> In any case I would compare 4.8.1 and 4.4.24 because those two should
> work and if not it's a bug that needs to get fixed. Independently,
> check the SSD firmware. There have been bugs there also.
>
> --
> Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: csum failed during copy/compare

2016-10-10 Thread Martin Dev
After some investigation this seems to follow the discard flag set in fstab.

9 or so reproductions with discard on partition 2 fail
move discard flag to partition 1, then partition 1 fails.

Re-running our tests with no discard options set in fstab

On Mon, Oct 10, 2016 at 1:02 PM, Martin Dev <mrturtle...@gmail.com> wrote:
> Additional BTRFS scrub logs
>
> root@# btrfs scrub start -B /mnt/g/
> scrub done for 554b0043-052f-48d1-986f-5a6154496d89
> scrub started at Mon Oct 10 12:52:40 2016 and finished after 00:00:39
> total bytes scrubbed: 20.03GiB with 46304 errors
> error details: csum=46304
> corrected errors: 0, uncorrectable errors: 46304, unverified errors: 0
> ERROR: there are uncorrectable errors
>
> I've attached the dmesg log incase gmail wordwrap completely destroys 
> formatting
>
> dmesg :
> [10284.452766] BTRFS warning (device sdb2): checksum error at logical
> 143896739840 on dev /dev/sdb2, sector 42509568, root 5, inode 270,
> offset 10548412416, length 4096, links 1 (path: shutdown-4.bin)
> [10284.452775] BTRFS warning (device sdb2): checksum error at logical
> 143896608768 on dev /dev/sdb2, sector 42509312, root 5, inode 270,
> offset 10548281344, length 4096, links 1 (path: shutdown-4.bin)
> [10284.452784] BTRFS warning (device sdb2): checksum error at logical
> 143896346624 on dev /dev/sdb2, sector 42508800, root 5, inode 270,
> offset 10548019200, length 4096, links 1 (path: shutdown-4.bin)
> [10284.452792] BTRFS warning (device sdb2): checksum error at logical
> 143896215552 on dev /dev/sdb2, sector 42508544, root 5, inode 270,
> offset 10547888128, length 4096, links 1 (path: shutdown-4.bin)
> [10284.452800] BTRFS error (device sdb2): bdev /dev/sdb2 errs: wr 0,
> rd 0, flush 0, corrupt 1, gen 0
> [10284.452804] BTRFS error (device sdb2): bdev /dev/sdb2 errs: wr 0,
> rd 0, flush 0, corrupt 2, gen 0
> [10284.452810] BTRFS error (device sdb2): bdev /dev/sdb2 errs: wr 0,
> rd 0, flush 0, corrupt 3, gen 0
> [10284.452813] BTRFS error (device sdb2): unable to fixup (regular)
> error at logical 143896608768 on dev /dev/sdb2
> [10284.452815] BTRFS error (device sdb2): unable to fixup (regular)
> error at logical 143896346624 on dev /dev/sdb2
> [10284.452820] BTRFS error (device sdb2): unable to fixup (regular)
> error at logical 143896215552 on dev /dev/sdb2
> [10284.452840] BTRFS error (device sdb2): bdev /dev/sdb2 errs: wr 0,
> rd 0, flush 0, corrupt 4, gen 0
> [10284.452845] BTRFS error (device sdb2): unable to fixup (regular)
> error at logical 143896739840 on dev /dev/sdb2
> [10284.452906] BTRFS warning (device sdb2): checksum error at logical
> 143896084480 on dev /dev/sdb2, sector 42508288, root 5, inode 270,
> offset 10547757056, length 4096, links 1 (path: shutdown-4.bin)
> [10284.453082] BTRFS error (device sdb2): bdev /dev/sdb2 errs: wr 0,
> rd 0, flush 0, corrupt 5, gen 0
> [10284.453087] BTRFS error (device sdb2): unable to fixup (regular)
> error at logical 143896084480 on dev /dev/sdb2
> [10284.453107] BTRFS warning (device sdb2): checksum error at logical
> 143896743936 on dev /dev/sdb2, sector 42509576, root 5, inode 270,
> offset 10548416512, length 4096, links 1 (path: shutdown-4.bin)
> [10284.453113] BTRFS warning (device sdb2): checksum error at logical
> 143896612864 on dev /dev/sdb2, sector 42509320, root 5, inode 270,
> offset 10548285440, length 4096, links 1 (path: shutdown-4.bin)
> [10284.453122] BTRFS warning (device sdb2): checksum error at logical
> 143896350720 on dev /dev/sdb2, sector 42508808, root 5, inode 270,
> offset 10548023296, length 4096, links 1 (path: shutdown-4.bin)
> [10284.453126] BTRFS error (device sdb2): bdev /dev/sdb2 errs: wr 0,
> rd 0, flush 0, corrupt 6, gen 0
> [10284.453130] BTRFS error (device sdb2): unable to fixup (regular)
> error at logical 143896612864 on dev /dev/sdb2
> [10284.453134] BTRFS error (device sdb2): bdev /dev/sdb2 errs: wr 0,
> rd 0, flush 0, corrupt 7, gen 0
> [10284.453137] BTRFS error (device sdb2): unable to fixup (regular)
> error at logical 143896350720 on dev /dev/sdb2
> [10284.453153] BTRFS error (device sdb2): bdev /dev/sdb2 errs: wr 0,
> rd 0, flush 0, corrupt 8, gen 0
> [10284.453157] BTRFS error (device sdb2): unable to fixup (regular)
> error at logical 143896743936 on dev /dev/sdb2
> [10284.453190] BTRFS error (device sdb2): bdev /dev/sdb2 errs: wr 0,
> rd 0, flush 0, corrupt 9, gen 0
> [10284.453194] BTRFS warning (device sdb2): checksum error at logical
> 143896219648 on dev /dev/sdb2, sector 42508552, root 5, inode 270,
> offset 10547892224, length 4096, links 1 (path: shutdown-4.bin)
> [10284.453198] BTRFS error (device sdb2): bdev /dev/sdb2 errs: wr 0,
> rd 0, flush 0, corrupt 10, gen 0
> [10284.453201] BTRFS error (device sdb2): unable to fixu

Re: csum failed during copy/compare

2016-10-10 Thread Martin Dev
Additional BTRFS scrub logs

root@# btrfs scrub start -B /mnt/g/
scrub done for 554b0043-052f-48d1-986f-5a6154496d89
scrub started at Mon Oct 10 12:52:40 2016 and finished after 00:00:39
total bytes scrubbed: 20.03GiB with 46304 errors
error details: csum=46304
corrected errors: 0, uncorrectable errors: 46304, unverified errors: 0
ERROR: there are uncorrectable errors

I've attached the dmesg log incase gmail wordwrap completely destroys formatting

dmesg :
[10284.452766] BTRFS warning (device sdb2): checksum error at logical
143896739840 on dev /dev/sdb2, sector 42509568, root 5, inode 270,
offset 10548412416, length 4096, links 1 (path: shutdown-4.bin)
[10284.452775] BTRFS warning (device sdb2): checksum error at logical
143896608768 on dev /dev/sdb2, sector 42509312, root 5, inode 270,
offset 10548281344, length 4096, links 1 (path: shutdown-4.bin)
[10284.452784] BTRFS warning (device sdb2): checksum error at logical
143896346624 on dev /dev/sdb2, sector 42508800, root 5, inode 270,
offset 10548019200, length 4096, links 1 (path: shutdown-4.bin)
[10284.452792] BTRFS warning (device sdb2): checksum error at logical
143896215552 on dev /dev/sdb2, sector 42508544, root 5, inode 270,
offset 10547888128, length 4096, links 1 (path: shutdown-4.bin)
[10284.452800] BTRFS error (device sdb2): bdev /dev/sdb2 errs: wr 0,
rd 0, flush 0, corrupt 1, gen 0
[10284.452804] BTRFS error (device sdb2): bdev /dev/sdb2 errs: wr 0,
rd 0, flush 0, corrupt 2, gen 0
[10284.452810] BTRFS error (device sdb2): bdev /dev/sdb2 errs: wr 0,
rd 0, flush 0, corrupt 3, gen 0
[10284.452813] BTRFS error (device sdb2): unable to fixup (regular)
error at logical 143896608768 on dev /dev/sdb2
[10284.452815] BTRFS error (device sdb2): unable to fixup (regular)
error at logical 143896346624 on dev /dev/sdb2
[10284.452820] BTRFS error (device sdb2): unable to fixup (regular)
error at logical 143896215552 on dev /dev/sdb2
[10284.452840] BTRFS error (device sdb2): bdev /dev/sdb2 errs: wr 0,
rd 0, flush 0, corrupt 4, gen 0
[10284.452845] BTRFS error (device sdb2): unable to fixup (regular)
error at logical 143896739840 on dev /dev/sdb2
[10284.452906] BTRFS warning (device sdb2): checksum error at logical
143896084480 on dev /dev/sdb2, sector 42508288, root 5, inode 270,
offset 10547757056, length 4096, links 1 (path: shutdown-4.bin)
[10284.453082] BTRFS error (device sdb2): bdev /dev/sdb2 errs: wr 0,
rd 0, flush 0, corrupt 5, gen 0
[10284.453087] BTRFS error (device sdb2): unable to fixup (regular)
error at logical 143896084480 on dev /dev/sdb2
[10284.453107] BTRFS warning (device sdb2): checksum error at logical
143896743936 on dev /dev/sdb2, sector 42509576, root 5, inode 270,
offset 10548416512, length 4096, links 1 (path: shutdown-4.bin)
[10284.453113] BTRFS warning (device sdb2): checksum error at logical
143896612864 on dev /dev/sdb2, sector 42509320, root 5, inode 270,
offset 10548285440, length 4096, links 1 (path: shutdown-4.bin)
[10284.453122] BTRFS warning (device sdb2): checksum error at logical
143896350720 on dev /dev/sdb2, sector 42508808, root 5, inode 270,
offset 10548023296, length 4096, links 1 (path: shutdown-4.bin)
[10284.453126] BTRFS error (device sdb2): bdev /dev/sdb2 errs: wr 0,
rd 0, flush 0, corrupt 6, gen 0
[10284.453130] BTRFS error (device sdb2): unable to fixup (regular)
error at logical 143896612864 on dev /dev/sdb2
[10284.453134] BTRFS error (device sdb2): bdev /dev/sdb2 errs: wr 0,
rd 0, flush 0, corrupt 7, gen 0
[10284.453137] BTRFS error (device sdb2): unable to fixup (regular)
error at logical 143896350720 on dev /dev/sdb2
[10284.453153] BTRFS error (device sdb2): bdev /dev/sdb2 errs: wr 0,
rd 0, flush 0, corrupt 8, gen 0
[10284.453157] BTRFS error (device sdb2): unable to fixup (regular)
error at logical 143896743936 on dev /dev/sdb2
[10284.453190] BTRFS error (device sdb2): bdev /dev/sdb2 errs: wr 0,
rd 0, flush 0, corrupt 9, gen 0
[10284.453194] BTRFS warning (device sdb2): checksum error at logical
143896219648 on dev /dev/sdb2, sector 42508552, root 5, inode 270,
offset 10547892224, length 4096, links 1 (path: shutdown-4.bin)
[10284.453198] BTRFS error (device sdb2): bdev /dev/sdb2 errs: wr 0,
rd 0, flush 0, corrupt 10, gen 0
[10284.453201] BTRFS error (device sdb2): unable to fixup (regular)
error at logical 143896219648 on dev /dev/sdb2
[10284.453207] BTRFS error (device sdb2): unable to fixup (regular)
error at logical 143896616960 on dev /dev/sdb2
[10284.453238] BTRFS warning (device sdb2): checksum error at logical
143896477696 on dev /dev/sdb2, sector 42509056, root 5, inode 270,
offset 10548150272, length 4096, links 1 (path: shutdown-4.bin)

Important to note this happens across 2 machines, now 3 different
drives (all different brand SSDs), on a Z170 chipset and C612 chipset.

On Mon, Oct 10, 2016 at 10:44 AM, Martin Dev <mrturtle...@gmail.com> wrote:
> Hey everyone,
>
> I work for system verification of SSDs and we've recently come up
> against an issue wi

Fwd: csum failed during copy/compare

2016-10-10 Thread Martin Dev
Hey everyone,

I work for system verification of SSDs and we've recently come up
against an issue with BTRFS on Ubuntu 16.04. We have a framework which
follows the following steps:

Generate verifiable 10GB file with FIO on internal drive
Copy 10GB file to 2 target partitions on DUT (using "cp" command)
Sync
Verify copied files with FIO (using direct=1)
Perform a power event (restart, shutdown, suspend, or hibernate)
Verify both 10GB files
Repeat.

We keep the files from the first iteration on the target partitions,
but subsequent iterations are deleted after verification. Every
command is monitored for exit status and framework will fail with
error if anything exits non zero.

We've found that during this process (between 2-9 iterations of
restarts or shutdowns) BTRFS will fail the pre-power verification of
the file with 100% reproduction rate out of 7 attempts. So this is a
"cp" command to copy from the internal to the btrfs partition, a sync,
then a verify with fio. This seems to be a recent change as the same
process has been used for the last 2 years including over the last
month with no issues.

Here are some more details:

Linux ht_stress_b6_20 4.4.0-21-generic #37-Ubuntu SMP Mon Apr 18
18:33:37 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

btrfs-progs v4.4

Label: none  uuid: 554b0043-052f-48d1-986f-5a6154496d89
Total devices 1 FS bytes used 20.03GiB
devid1 size 29.81GiB used 22.26GiB path /dev/sdb2

Data, single: total=22.00GiB, used=20.00GiB
System, single: total=4.00MiB, used=16.00KiB
Metadata, single: total=264.00MiB, used=20.64MiB
GlobalReserve, single: total=16.00MiB, used=0.00B


df -hT
/dev/sdb1  ext4   30G   21G  7.7G  73% /mnt/f
/dev/sdb2  btrfs  30G   21G  9.6G  68% /mnt/g
/dev/sdb3  xfs30G   33M   30G   1% /mnt/h
/dev/sdb4  ext4   30G   44M   28G   1% /mnt/i

/etc/fstab
/dev/sdb2   /mnt/g/ btrfs
defaults,discard   0   0

root@: ls -l /mnt/g/
total 20971520
-rw-r--r-- 1 root root 10737418240 Oct 10 09:42 shutdown-1.bin
-rw-r--r-- 1 root root 10737418240 Oct 10 10:03 shutdown-4.bin


root@# cat fio-screenlog.log
test: (g=0): rw=read, bs=128K-128K/128K-128K/128K-128K,
ioengine=libaio, iodepth=32
fio-2.13-88-g32bc8-dirty
Starting 1 process
verify: bad magic header 101, wanted acca at file
/mnt/g/shutdown-4.bin offset 10547757056, length 131072
verify: bad magic header 101, wanted acca at file
/mnt/g/shutdown-4.bin offset 10547888128, length 131072
fio: pid=2288, err=84/file:io_u.c:1978, func=io_u_queued_complete,
error=Invalid or incomplete multibyte or wide character

In this situation, FIO is reading "101" instead of the correct data
from the shutdown-4.bin file. from the "ls -l" above, we can see the
shutdown-1.bin and shutdown-4.bin are the same size and we know the cp
command exited 0.

shutdown-1.bin (OK):
dd if=/mnt/g/shutdown-1.bin of=/tmp/header bs=512 count=1 skip=20601088
1+0 records in
1+0 records out
512 bytes copied, 0.00247518 s, 207 kB/s

Data:
000 acca 0001  0002 3d92 e664 178e 8421
010  74b2 0002  0016  2bfa 0009
020 0001 3a59 c82b 1b3a a7b8 ee74 881a 0747
030 94f7 09f2 79c9 04e3 d29e d6c2 ea3f 04b8
...

shutdown-4.bin (FAILS):
dd if=/mnt/g/shutdown-4.bin of=/tmp/header bs=512 count=1 skip=20601088
dd: error reading '/mnt/g/shutdown-4.bin': Input/output error
0+0 records in
0+0 records out
0 bytes copied, 0.00230287 s, 0.0 kB/s

We can't even read the file at that offset correctly and have to go
back to 20601087 (1 512 byte sector) before I can get valid data.
What's interesting, is over 2 different drives, 2 different machines,
and 7 reproductions, the data is always "101", this might be a quirk
of FIO, as I would hope that btrfs would not return corrupt data to
the user.

Reproduction is straight forward, and takes around an hour. I have a
large amount of duplicate machines at my disposal if debug /
investigations need to be run. I can provide portions of our
automation logs, full dmesg (attached), and any other information that
needs to be gathered.
[0.00] Initializing cgroup subsys cpuset
[0.00] Initializing cgroup subsys cpu
[0.00] Initializing cgroup subsys cpuacct
[0.00] Linux version 4.4.0-21-generic (buildd@lgw01-21) (gcc version 5.3.1 20160413 (Ubuntu 5.3.1-14ubuntu2) ) #37-Ubuntu SMP Mon Apr 18 18:33:37 UTC 2016 (Ubuntu 4.4.0-21.37-generic 4.4.6)
[0.00] Command line: BOOT_IMAGE=/vmlinuz-4.4.0-21-generic root=UUID=eca58483-5695-4068-828b-6b0a48eeb620 ro quiet splash vt.handoff=7
[0.00] KERNEL supported cpus:
[0.00]   Intel GenuineIntel
[0.00]   AMD AuthenticAMD
[0.00]   Centaur CentaurHauls
[0.00] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[0.00] x86/fpu: xstate_offset[3]:  960, xstate_sizes[3]:   64
[0.00] x86/fpu: xstate_offset[4]: 1024, xstate_sizes[4]:   64
[0.00] x86/fpu: Supporting XSAVE feature 0x01: 'x87 floating 

Re: stability matrix

2016-09-15 Thread Martin Steigerwald
Am Donnerstag, 15. September 2016, 07:54:26 CEST schrieb Austin S. Hemmelgarn:
> On 2016-09-15 05:49, Hans van Kranenburg wrote:
> > On 09/15/2016 04:14 AM, Christoph Anton Mitterer wrote:
[…]
> I specifically do not think we should worry about distro kernels though.
>   If someone is using a specific distro, that distro's documentation
> should cover what they support and what works and what doesn't.  Some
> (like Arch and to a lesser extent Gentoo) use almost upstream kernels,
> so there's very little point in tracking them.  Some (like Ubuntu and
> Debian) use almost upstream LTS kernels, so there's little point
> tracking them either.  Many others though (like CentOS, RHEL, and OEL)
> Use forked kernels that have so many back-ported patches that it's
> impossible to track up-date to up-date what the hell they've got.  A
> rather ridiculous expression regarding herding of cats comes to mind
> with respect to the last group.

Yep. I just read through RHEL releasenotes for a RHEL 7 workshop I will hold 
for a customer… and noted that newer RHEL 7 kernels for example have device 
mapper from Kernel 4.1 (while the kernel still says its a 3.10 one), XFS from 
kernel this.that, including new incompat CRC disk format and the need to also 
upgrade xfsprogs in lockstep, and this and that from kernel this.that and so 
on. Frankenstein as an association comes to my mind, but I bet RHEL kernel 
engineers know what they are doing.

-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is stability a joke?

2016-09-15 Thread Martin Steigerwald
Am Donnerstag, 15. September 2016, 07:55:36 CEST schrieb Kai Krakow:
> Am Mon, 12 Sep 2016 08:20:20 -0400
> 
> schrieb "Austin S. Hemmelgarn" <ahferro...@gmail.com>:
> > On 2016-09-11 09:02, Hugo Mills wrote:
> > > On Sun, Sep 11, 2016 at 02:39:14PM +0200, Waxhead wrote:
> > >> Martin Steigerwald wrote:
> >  [...]
> >  [...]
> >  [...]
> >  [...]
> >  
> > >> That is exactly the same reason I don't edit the wiki myself. I
> > >> could of course get it started and hopefully someone will correct
> > >> what I write, but I feel that if I start this off I don't have deep
> > >> enough knowledge to do a proper start. Perhaps I will change my
> > >> mind about this.
> > >> 
> > >Given that nobody else has done it yet, what are the odds that
> > > 
> > > someone else will step up to do it now? I would say that you should
> > > at least try. Yes, you don't have as much knowledge as some others,
> > > but if you keep working at it, you'll gain that knowledge. Yes,
> > > you'll probably get it wrong to start with, but you probably won't
> > > get it *very* wrong. You'll probably get it horribly wrong at some
> > > point, but even the more knowledgable people you're deferring to
> > > didn't identify the problems with parity RAID until Zygo and Austin
> > > and Chris (and others) put in the work to pin down the exact
> > > issues.
> > 
> > FWIW, here's a list of what I personally consider stable (as in, I'm
> > willing to bet against reduced uptime to use this stuff on production
> > systems at work and personal systems at home):
> > 1. Single device mode, including DUP data profiles on single device
> > without mixed-bg.
> > 2. Multi-device raid0, raid1, and raid10 profiles with symmetrical
> > devices (all devices are the same size).
> > 3. Multi-device single profiles with asymmetrical devices.
> > 4. Small numbers (max double digit) of snapshots, taken at infrequent
> > intervals (no more than once an hour).  I use single snapshots
> > regularly to get stable images of the filesystem for backups, and I
> > keep hourly ones of my home directory for about 48 hours.
> > 5. Subvolumes used to isolate parts of a filesystem from snapshots.
> > I use this regularly to isolate areas of my filesystems from backups.
> > 6. Non-incremental send/receive (no clone source, no parent's, no
> > deduplication).  I use this regularly for cloning virtual machines.
> > 7. Checksumming and scrubs using any of the profiles I've listed
> > above. 8. Defragmentation, including autodefrag.
> > 9. All of the compat_features, including no-holes and skinny-metadata.
> > 
> > Things I consider stable enough that I'm willing to use them on my
> > personal systems but not systems at work:
> > 1. In-line data compression with compress=lzo.  I use this on my
> > laptop and home server system.  I've never had any issues with it
> > myself, but I know that other people have, and it does seem to make
> > other things more likely to have issues.
> > 2. Batch deduplication.  I only use this on the back-end filesystems
> > for my personal storage cluster, and only because I have multiple
> > copies as a result of GlusterFS on top of BTRFS.  I've not had any
> > significant issues with it, and I don't remember any reports of data
> > loss resulting from it, but it's something that people should not be
> > using if they don't understand all the implications.
> 
> I could at least add one "don't do it":
> 
> Don't use BFQ patches (it's an IO scheduler) if you're using btrfs.
> Some people like to use it especially for running VMs and desktops
> because it provides very good interactivity while maintaining very good
> throughput. But it completely destroyed my btrfs beyond repair at least
> twice, either while actually using a VM (in VirtualBox) or during high
> IO loads. I now stick to the deadline scheduler instead which provides
> very good interactivity for me, too, and the corruptions didn't occur
> again so far.
> 
> The story with BFQ has always been the same: System suddenly freezes
> during moderate to high IO until all processes stop working (no process
> shows D state, tho). Only hard reboot possible. After rebooting, access
> to some (unrelated) files may fail with "errno=-17 Object already
> exists" which cannot be repaired. If it affects files needed during
> boot, you are screwed because file system goes RO.

This could be a further row in the table. And well…

as for CFQ Jens Axboe currently works on bandwidth throttling patches 
*exactly* for the reason to provide more interactivity and fairness to I/O 
operations in between.

Right now, Completely Fair in CFQ is a *huge* exaggeration, at least while you 
have a dd bs=1M thing running.

Thanks,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is stability a joke?

2016-09-15 Thread Martin Steigerwald
Hello Nicholas.

Am Mittwoch, 14. September 2016, 21:05:52 CEST schrieb Nicholas D Steeves:
> On Mon, Sep 12, 2016 at 08:20:20AM -0400, Austin S. Hemmelgarn wrote:
> > On 2016-09-11 09:02, Hugo Mills wrote:
[…]
> > As far as documentation though, we [BTRFS] really do need to get our act
> > together.  It really doesn't look good to have most of the best
> > documentation be in the distro's wikis instead of ours.  I'm not trying to
> > say the distros shouldn't be documenting BTRFS, but the point at which
> > Debian (for example) has better documentation of the upstream version of
> > BTRFS than the upstream project itself does, that starts to look bad.
> 
> I would have loved to have this feature-to-stability list when I
> started working on the Debian documentation!  I started it because I
> was saddened by number of horror story "adventures with btrfs"
> articles and posts I had read about, combined with the perspective of
> certain members within the Debian community that it was a toy fs.
> 
> Are my contributions to that wiki of a high enough quality that I
> can work on the upstream one?  Do you think the broader btrfs
> community is interested in citations and curated links to discussions?
> 
> eg: if a company wants to use btrfs, they check the status page, see a
> feature they want is still in the yellow zone of stabilisation, and
> then follow the links to familiarise themselves with past discussions.
> I imagine this would also help individuals or grad students more
> quickly familiarise themselves with the available literature before
> choosing a specific project.  If regular updates from SUSE, STRATO,
> Facebook, and Fujitsu are also publicly available the k.org wiki would
> be a wonderful place to syndicate them!

 I definately think the quality of your contributions is high enough, others 
can also proofread and give in their experiences, so… By *all* means, go ahead 
*already*.

It doesn´t fit all inside the table directly, I bet, *but* you can use 
footnotes or further explainations regarding features that need them with a 
headline per feature below the table and a link to it from within the table.

Thank you!
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is stability a joke? (wiki updated)

2016-09-13 Thread Martin Steigerwald
Am Dienstag, 13. September 2016, 07:28:38 CEST schrieb Austin S. Hemmelgarn:
> On 2016-09-12 16:44, Chris Murphy wrote:
> > On Mon, Sep 12, 2016 at 2:35 PM, Martin Steigerwald <mar...@lichtvoll.de> 
wrote:
> >> Am Montag, 12. September 2016, 23:21:09 CEST schrieb Pasi Kärkkäinen:
> >>> On Mon, Sep 12, 2016 at 09:57:17PM +0200, Martin Steigerwald wrote:
> >>>> Am Montag, 12. September 2016, 18:27:47 CEST schrieb David Sterba:
> >>>>> On Mon, Sep 12, 2016 at 04:27:14PM +0200, David Sterba wrote:
[…]
> >>>>> https://btrfs.wiki.kernel.org/index.php/Status
> >>>> 
> >>>> Great.
> >>>> 
> >>>> I made to minor adaption. I added a link to the Status page to my
> >>>> warning
> >>>> in before the kernel log by feature page. And I also mentioned that at
> >>>> the time the page was last updated the latest kernel version was 4.7.
> >>>> Yes, thats some extra work to update the kernel version, but I think
> >>>> its
> >>>> beneficial to explicitely mention the kernel version the page talks
> >>>> about. Everyone who updates the page can update the version within a
> >>>> second.
> >>> 
> >>> Hmm.. that will still leave people wondering "but I'm running Linux 4.4,
> >>> not 4.7, I wonder what the status of feature X is.."
> >>> 
> >>> Should we also add a column for kernel version, so we can add "feature X
> >>> is
> >>> known to be OK on Linux 3.18 and later"..  ? Or add those to "notes"
> >>> field,
> >>> where applicable?
> >> 
> >> That was my initial idea, and it may be better than a generic kernel
> >> version for all features. Even if we fill in 4.7 for any of the features
> >> that are known to work okay for the table.
> >> 
> >> For RAID 1 I am willing to say it works stable since kernel 3.14, as this
> >> was the kernel I used when I switched /home and / to Dual SSD RAID 1 on
> >> this ThinkPad T520.
> > 
> > Just to cut yourself some slack, you could skip 3.14 because it's EOL
> > now, and just go from 4.4.
> 
> That reminds me, we should probably make a point to make it clear that
> this is for the _upstream_ mainline kernel versions, not for versions
> from some arbitrary distro, and that people should check the distro's
> documentation for that info.

I´d do the following:

Really state the first known to work stable kernel version for a feature.

But before the table state this:

1) Instead of the first known to work stable kernel for a feature recommend to 
use the latest upstream kernel or alternatively the latest upstream LTS kernel 
for those users who want to play it a bit safer.

2) For stable distros such as  SLES, RHEL, Ubuntu LTS, Debian Stable recommend 
to check distro documentation. Note that some distro kernels track upstream 
kernels quite closely like Debian backport kernel or Ubuntu kernel backports 
PPA.

Thanks,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is stability a joke? (wiki updated)

2016-09-12 Thread Martin Steigerwald
Am Montag, 12. September 2016, 23:21:09 CEST schrieb Pasi Kärkkäinen:
> On Mon, Sep 12, 2016 at 09:57:17PM +0200, Martin Steigerwald wrote:
> > Am Montag, 12. September 2016, 18:27:47 CEST schrieb David Sterba:
> > > On Mon, Sep 12, 2016 at 04:27:14PM +0200, David Sterba wrote:
> > > > > I therefore would like to propose that some sort of feature /
> > > > > stability
> > > > > matrix for the latest kernel is added to the wiki preferably
> > > > > somewhere
> > > > > where it is easy to find. It would be nice to archive old matrix'es
> > > > > as
> > > > > well in case someone runs on a bit older kernel (we who use Debian
> > > > > tend
> > > > > to like older kernels). In my opinion it would make things bit
> > > > > easier
> > > > > and perhaps a bit less scary too. Remember if you get bitten badly
> > > > > once
> > > > > you tend to stay away from from it all just in case, if you on the
> > > > > other
> > > > > hand know what bites you can safely pet the fluffy end instead :)
> > > > 
> > > > Somebody has put that table on the wiki, so it's a good starting
> > > > point.
> > > > I'm not sure we can fit everything into one table, some combinations
> > > > do
> > > > not bring new information and we'd need n-dimensional matrix to get
> > > > the
> > > > whole picture.
> > > 
> > > https://btrfs.wiki.kernel.org/index.php/Status
> > 
> > Great.
> > 
> > I made to minor adaption. I added a link to the Status page to my warning
> > in before the kernel log by feature page. And I also mentioned that at
> > the time the page was last updated the latest kernel version was 4.7.
> > Yes, thats some extra work to update the kernel version, but I think its
> > beneficial to explicitely mention the kernel version the page talks
> > about. Everyone who updates the page can update the version within a
> > second.
> 
> Hmm.. that will still leave people wondering "but I'm running Linux 4.4, not
> 4.7, I wonder what the status of feature X is.."
> 
> Should we also add a column for kernel version, so we can add "feature X is
> known to be OK on Linux 3.18 and later"..  ? Or add those to "notes" field,
> where applicable?

That was my initial idea, and it may be better than a generic kernel version 
for all features. Even if we fill in 4.7 for any of the features that are 
known to work okay for the table.

For RAID 1 I am willing to say it works stable since kernel 3.14, as this was 
the kernel I used when I switched /home and / to Dual SSD RAID 1 on this 
ThinkPad T520.


-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is stability a joke? (wiki updated)

2016-09-12 Thread Martin Steigerwald
Am Montag, 12. September 2016, 18:27:47 CEST schrieb David Sterba:
> On Mon, Sep 12, 2016 at 04:27:14PM +0200, David Sterba wrote:
> > > I therefore would like to propose that some sort of feature / stability
> > > matrix for the latest kernel is added to the wiki preferably somewhere
> > > where it is easy to find. It would be nice to archive old matrix'es as
> > > well in case someone runs on a bit older kernel (we who use Debian tend
> > > to like older kernels). In my opinion it would make things bit easier
> > > and perhaps a bit less scary too. Remember if you get bitten badly once
> > > you tend to stay away from from it all just in case, if you on the other
> > > hand know what bites you can safely pet the fluffy end instead :)
> > 
> > Somebody has put that table on the wiki, so it's a good starting point.
> > I'm not sure we can fit everything into one table, some combinations do
> > not bring new information and we'd need n-dimensional matrix to get the
> > whole picture.
> 
> https://btrfs.wiki.kernel.org/index.php/Status

Great.

I made to minor adaption. I added a link to the Status page to my warning in 
before the kernel log by feature page. And I also mentioned that at the time 
the page was last updated the latest kernel version was 4.7. Yes, thats some 
extra work to update the kernel version, but I think its beneficial to 
explicitely mention the kernel version the page talks about. Everyone who 
updates the page can update the version within a second.

-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Small fs

2016-09-11 Thread Martin Steigerwald
Am Sonntag, 11. September 2016, 19:46:32 CEST schrieb Hugo Mills:
> On Sun, Sep 11, 2016 at 09:13:28PM +0200, Martin Steigerwald wrote:
> > Am Sonntag, 11. September 2016, 16:44:23 CEST schrieb Duncan:
> > > * Metadata, and thus mixed-bg, defaults to DUP mode on a single-device
> > > filesystem (except on ssd where I actually still use it myself, and
> > > recommend it except for ssds that do firmware dedupe).  In mixed-mode
> > > this means two copies of data as well, which halves the usable space.
> > > 
> > > IOW, when using mixed-mode, which is recommended under a gig, and dup
> > > replication which is then the single-device default, effective usable
> > > space is **HALVED**, so 256 MiB btrfs size becomes 128 MiB usable. (!!)
> > 
> > I don´t get this part. That is just *metadata* being duplicated, not the
> > actual *data* inside the files. Or am I missing something here?
> 
>In mixed mode, there's no distinction: Data and metadata both use
> the same chunks. If those chunks are DUP, then both data and metadata
> are duplicated, and you get half the space available.

In german I´d say "autsch", in english according to pda.leo.org "ouch", to 
this.

Okay, I just erased using mixed mode as an idea from my mind altogether :).

Just like I think I will never use a BTRFS below 5 GiB. Well, with one 
exception, maybe on the eMMC flash of the new Omnia Turris router that I hope 
will arrive soon at my place.

-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


compress=lzo safe to use? (was: Re: Trying to rescue my data :()

2016-09-11 Thread Martin Steigerwald
Am Sonntag, 26. Juni 2016, 13:13:04 CEST schrieb Steven Haigh:
> On 26/06/16 12:30, Duncan wrote:
> > Steven Haigh posted on Sun, 26 Jun 2016 02:39:23 +1000 as excerpted:
> >> In every case, it was a flurry of csum error messages, then instant
> >> death.
> > 
> > This is very possibly a known bug in btrfs, that occurs even in raid1
> > where a later scrub repairs all csum errors.  While in theory btrfs raid1
> > should simply pull from the mirrored copy if its first try fails checksum
> > (assuming the second one passes, of course), and it seems to do this just
> > fine if there's only an occasional csum error, if it gets too many at
> > once, it *does* unfortunately crash, despite the second copy being
> > available and being just fine as later demonstrated by the scrub fixing
> > the bad copy from the good one.
> > 
> > I'm used to dealing with that here any time I have a bad shutdown (and
> > I'm running live-git kde, which currently has a bug that triggers a
> > system crash if I let it idle and shut off the monitors, so I've been
> > getting crash shutdowns and having to deal with this unfortunately often,
> > recently).  Fortunately I keep my root, with all system executables, etc,
> > mounted read-only by default, so it's not affected and I can /almost/
> > boot normally after such a crash.  The problem is /var/log and /home
> > (which has some parts of /var that need to be writable symlinked into /
> > home/var, so / can stay read-only).  Something in the normal after-crash
> > boot triggers enough csum errors there that I often crash again.
> > 
> > So I have to boot to emergency mode and manually mount the filesystems in
> > question, so nothing's trying to access them until I run the scrub and
> > fix the csum errors.  Scrub itself doesn't trigger the crash, thankfully,
> > and once it has repaired all the csum errors due to partial writes on one
> > mirror that either were never made or were properly completed on the
> > other mirror, I can exit emergency mode and complete the normal boot (to
> > the multi-user default target).  As there's no more csum errors then
> > because scrub fixed them all, the boot doesn't crash due to too many such
> > errors, and I'm back in business.
> > 
> > 
> > Tho I believe at least the csum bug that affects me may only trigger if
> > compression is (or perhaps has been in the past) enabled.  Since I run
> > compress=lzo everywhere, that would certainly affect me.  It would also
> > explain why the bug has remained around for quite some time as well,
> > since presumably the devs don't run with compression on enough for this
> > to have become a personal itch they needed to scratch, thus its remaining
> > untraced and unfixed.
> > 
> > So if you weren't using the compress option, your bug is probably
> > different, but either way, the whole thing about too many csum errors at
> > once triggering a system crash sure does sound familiar, here.
> 
> Yes, I was running the compress=lzo option as well... Maybe here lays a
> common problem?

Hmm… I found this from being referred to by reading Debian wiki page on 
BTRFS¹.

I use compress=lzo on BTRFS RAID 1 since April 2014 and I never found an 
issue. Steven, your filesystem wasn´t RAID 1 but RAID 5 or 6?

I just want to assess whether using compress=lzo might be dangerous to use in 
my setup. Actually right now I like to keep using it, since I think at least 
one of the SSDs does not compress. And… well… /home and / where I use it are 
both quite full already.

[1] https://wiki.debian.org/Btrfs#WARNINGS

Thanks,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Small fs

2016-09-11 Thread Martin Steigerwald
Am Sonntag, 11. September 2016, 21:56:07 CEST schrieb Imran Geriskovan:
> On 9/11/16, Duncan <1i5t5.dun...@cox.net> wrote:
> > Martin Steigerwald posted on Sun, 11 Sep 2016 17:32:44 +0200 as excerpted:
> >>> What is the smallest recommended fs size for btrfs?
> >>> Can we say size should be in multiples of 64MB?
> >> 
> >> Do you want to know the smalled *recommended* or the smallest *possible*
> >> size?
> 
> In fact both.
> I'm reconsidering my options for /boot

Well my stance on boot still is: Ext4. Done.

:)

It just does not bother me. It practically makes no difference at all. It has 
no visible effect on my user experience and I never saw the need to snapshot /
boot.

But another approach in case you want to use BTRFS for /boot is to use a 
subvolume. Thats IMHO the SLES 12 default setup. They basically create 
subvolumes for /boot, /var, /var/lib/mysql – you name it. Big advantage: You 
have one big FS and do not need to plan space for partitions or LVs. 
Disadvantage: If it breaks, it breaks.

That said, I think at a new installation I may do this for /boot. Just put it 
inside a subvolume.

>From my experiences at work with customer systems and even some systems I 
setup myself, I often do not use little partitions anymore. I did so for a 
CentOS 7 training VM, just 2 GiB XFS for /var. Guess what happened? Last 
update was too long ago, so… yum tried to download a ton of packages and then 
complained it has not enough space in /var. Luckily I used LVM, so I enlarged 
partition LVM resides on, enlarged PV and then enlarged /var. There may be 
valid reasons to split things up, and I am quite comfortable with splitting /
boot out, cause its, well, plannable easily enough. And it may make sense to 
split /var or /var/log out. But on BTRFS I would likely use subvolumes. Only 
thing I may separate would be /home to make it easier on a re-installation of 
the OS to keep it around. That said, I never ever reinstalled the Debian on 
this ThinkPad T520 since I initially installed it. And on previous laptops I 
even copied the Debian on the older laptop onto the newer laptop. With the 
T520 I reinstalled, cause I wanted to switch to 64 bit cleanly.

-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Small fs

2016-09-11 Thread Martin Steigerwald
Am Sonntag, 11. September 2016, 16:44:23 CEST schrieb Duncan:
> * Metadata, and thus mixed-bg, defaults to DUP mode on a single-device 
> filesystem (except on ssd where I actually still use it myself, and 
> recommend it except for ssds that do firmware dedupe).  In mixed-mode 
> this means two copies of data as well, which halves the usable space.
> 
> IOW, when using mixed-mode, which is recommended under a gig, and dup 
> replication which is then the single-device default, effective usable 
> space is **HALVED**, so 256 MiB btrfs size becomes 128 MiB usable. (!!)

I don´t get this part. That is just *metadata* being duplicated, not the 
actual *data* inside the files. Or am I missing something here?

-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Small fs

2016-09-11 Thread Martin Steigerwald
Am Sonntag, 11. September 2016, 18:27:30 CEST schrieben Sie:
> What is the smallest recommended fs size for btrfs?
> 
> - There are mentions of 256MB around the net.
> - Gparted reserves minimum of 256MB for btrfs.
> 
> With an ordinary partition on a single disk,
> fs created with just "mkfs.btrfs /dev/sdxx":
> - 128MB works fine.
> - 127MB works but as if it is 64MB.
> 
> Can we say size should be in multiples of 64MB?

Do you want to know the smalled *recommended* or the smallest *possible* size?

I personally wouldn´t go below one or two GiB or or so with BTRFS. On small 
filesystems, I don´t know the treshold right now it uses a mixed metadata/data 
format. And I think using smaller BTRFS filesystem invited any left over 
"filesystem is full while it isn´t" issues.

Well there we go. Excerpt from mkfs.btrfs(8) manpage:

   -M|--mixed
   Normally the data and metadata block groups are isolated.
   The mixed mode will remove the isolation and store both
   types in the same block group type. This helps to utilize
   the free space regardless of the purpose and is suitable
   for small devices. The separate allocation of block groups
   leads to a situation where the space is reserved for the
   other block group type, is not available for allocation and
   can lead to ENOSPC state.

   The recommended size for the mixed mode is for filesystems
   less than 1GiB. The soft recommendation is to use it for
   filesystems smaller than 5GiB. The mixed mode may lead to
   degraded performance on larger filesystems, but is
   otherwise usable, even on multiple devices.

   The nodesize and sectorsize must be equal, and the block
   group types must match.

   Note
   versions up to 4.2.x forced the mixed mode for devices
   smaller than 1GiB. This has been removed in 4.3+ as it
   caused some usability issues.

Thanks
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is stability a joke?

2016-09-11 Thread Martin Steigerwald
Am Sonntag, 11. September 2016, 16:54:25 CEST schrieben Sie:
> Am Sonntag, 11. September 2016, 14:39:14 CEST schrieb Waxhead:
> > Martin Steigerwald wrote:
> > > Am Sonntag, 11. September 2016, 13:43:59 CEST schrieb Martin 
Steigerwald:
> > >>>>> The Nouveau graphics driver have a nice feature matrix on it's
> > >>>>> webpage
> > >>>>> and I think that BTRFS perhaps should consider doing something like
> > >>>>> that
> > >>>>> on it's official wiki as well
> > >>>> 
> > >>>> BTRFS also has a feature matrix. The links to it are in the "News"
> > >>>> section
> > >>>> however:
> > >>>> 
> > >>>> https://btrfs.wiki.kernel.org/index.php/Changelog#By_feature
> 
> […]
> 
> > > I mentioned this matrix as a good *starting* point. And I think it would
> > > be
> > > easy to extent it:
> > > 
> > > Just add another column called "Production ready". Then research / ask
> > > about production stability of each feature. The only challenge is: Who
> > > is
> > > authoritative on that? I´d certainly ask the developer of a feature, but
> > > I´d also consider user reports to some extent.
> > > 
> > > Maybe thats the real challenge.
> > > 
> > > If you wish, I´d go through each feature there and give my own
> > > estimation.
> > > But I think there are others who are deeper into this.
> > 
> > That is exactly the same reason I don't edit the wiki myself. I could of
> > course get it started and hopefully someone will correct what I write,
> > but I feel that if I start this off I don't have deep enough knowledge
> > to do a proper start. Perhaps I will change my mind about this.
> 
> Well one thing would be to start with the column and start filling the more
> easy stuff. And if its not known since what kernel version, but its known to
> be stable I suggest to conservatively just put the first kernel version
> into it where people think it is stable or in doubt even put 4.7 into it.
> It can still be reduced to lower kernel versions.
> 
> Well: I made a tiny start. I linked "Features by kernel version" more
> prominently on the main page, so it is easier to find and also added the
> following warning just above the table:
> 
> "WARNING: The "Version" row states at which version a feature has been
> merged into the mainline kernel. It does not tell anything about at which
> kernel version it is considered mature enough for production use."
> 
> Now I wonder: Would adding a "Production ready" column, stating the first
> known to be stable kernel version make sense in this table? What do you
> think? I can add the column and give some first rough, conservative
> estimations on a few features.
> 
> What do you think? Is this a good place?

It isn´t as straight forward to add this column as I thought. If I add it 
after "Version" then the following fields are not aligned anymore, even tough 
they use some kind of identifier – but that identifier also doesn´t match the 
row title. After reading about mediawiki syntax I came to the conclusion that 
I need to add the new column in every data row as well and cannot just assign 
values to the rows and leave out whats not known yet.

! Feature !! Version !! Description !! Notes
{{FeatureMerged
|name=scrub
|version=3.0
|text=Read all data and verify checksums, repair if possible.
}}

Thanks,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is stability a joke?

2016-09-11 Thread Martin Steigerwald
Am Sonntag, 11. September 2016, 13:02:21 CEST schrieb Hugo Mills:
> On Sun, Sep 11, 2016 at 02:39:14PM +0200, Waxhead wrote:
> > Martin Steigerwald wrote:
> > >Am Sonntag, 11. September 2016, 13:43:59 CEST schrieb Martin Steigerwald:
> > >>>>Thing is: This just seems to be when has a feature been implemented
> > >>>>matrix.
> > >>>>Not when it is considered to be stable. I think this could be done
> > >>>>with
> > >>>>colors or so. Like red for not supported, yellow for implemented and
> > >>>>green for production ready.
> > >>>
> > >>>Exactly, just like the Nouveau matrix. It clearly shows what you can
> > >>>expect from it.
> > >
> > >I mentioned this matrix as a good *starting* point. And I think it would
> > >be
> > >easy to extent it:
> > >
> > >Just add another column called "Production ready". Then research / ask
> > >about production stability of each feature. The only challenge is: Who
> > >is authoritative on that? I´d certainly ask the developer of a feature,
> > >but I´d also consider user reports to some extent.
> > >
> > >Maybe thats the real challenge.
> > >
> > >If you wish, I´d go through each feature there and give my own
> > >estimation. But I think there are others who are deeper into this.
> > 
> > That is exactly the same reason I don't edit the wiki myself. I
> > could of course get it started and hopefully someone will correct
> > what I write, but I feel that if I start this off I don't have deep
> > enough knowledge to do a proper start. Perhaps I will change my mind
> > about this.
> 
>Given that nobody else has done it yet, what are the odds that
> someone else will step up to do it now? I would say that you should at
> least try. Yes, you don't have as much knowledge as some others, but
> if you keep working at it, you'll gain that knowledge. Yes, you'll
> probably get it wrong to start with, but you probably won't get it
> *very* wrong. You'll probably get it horribly wrong at some point, but
> even the more knowledgable people you're deferring to didn't identify
> the problems with parity RAID until Zygo and Austin and Chris (and
> others) put in the work to pin down the exact issues.
> 
>So I'd strongly encourage you to set up and maintain the stability
> matrix yourself -- you have the motivation at least, and the knowledge
> will come with time and effort. Just keep reading the mailing list and
> IRC and bugzilla, and try to identify where you see lots of repeated
> problems, and where bugfixes in those areas happen.
> 
>So, go for it. You have a lot to offer the community.

Yep! Fully agreed.

-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is stability a joke?

2016-09-11 Thread Martin Steigerwald
Am Sonntag, 11. September 2016, 14:39:14 CEST schrieb Waxhead:
> Martin Steigerwald wrote:
> > Am Sonntag, 11. September 2016, 13:43:59 CEST schrieb Martin Steigerwald:
> >>>>> The Nouveau graphics driver have a nice feature matrix on it's webpage
> >>>>> and I think that BTRFS perhaps should consider doing something like
> >>>>> that
> >>>>> on it's official wiki as well
> >>>> 
> >>>> BTRFS also has a feature matrix. The links to it are in the "News"
> >>>> section
> >>>> however:
> >>>> 
> >>>> https://btrfs.wiki.kernel.org/index.php/Changelog#By_feature
[…]
> > I mentioned this matrix as a good *starting* point. And I think it would
> > be
> > easy to extent it:
> > 
> > Just add another column called "Production ready". Then research / ask
> > about production stability of each feature. The only challenge is: Who is
> > authoritative on that? I´d certainly ask the developer of a feature, but
> > I´d also consider user reports to some extent.
> > 
> > Maybe thats the real challenge.
> > 
> > If you wish, I´d go through each feature there and give my own estimation.
> > But I think there are others who are deeper into this.
> 
> That is exactly the same reason I don't edit the wiki myself. I could of
> course get it started and hopefully someone will correct what I write,
> but I feel that if I start this off I don't have deep enough knowledge
> to do a proper start. Perhaps I will change my mind about this.

Well one thing would be to start with the column and start filling the more 
easy stuff. And if its not known since what kernel version, but its known to 
be stable I suggest to conservatively just put the first kernel version into 
it where people think it is stable or in doubt even put 4.7 into it. It can 
still be reduced to lower kernel versions.

Well: I made a tiny start. I linked "Features by kernel version" more 
prominently on the main page, so it is easier to find and also added the 
following warning just above the table:

"WARNING: The "Version" row states at which version a feature has been merged 
into the mainline kernel. It does not tell anything about at which kernel 
version it is considered mature enough for production use."

Now I wonder: Would adding a "Production ready" column, stating the first 
known to be stable kernel version make sense in this table? What do you think? 
I can add the column and give some first rough, conservative estimations on a 
few features.

What do you think? Is this a good place?

Thanks,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is stability a joke?

2016-09-11 Thread Martin Steigerwald
Am Sonntag, 11. September 2016, 14:30:51 CEST schrieb Waxhead:
> > I think what would be a good next step would be to ask developers / users
> > about feature stability and then update the wiki. If thats important to
> > you, I suggest you invest some energy in doing that. And ask for help.
> > This mailinglist is a good idea.
> > 
> > I already gave you my idea on what works for me.
> > 
> > There is just one thing I won´t go further even a single step: The
> > complaining path. As it leads to no desirable outcome.
> > 
> > Thanks,
> 
> My intention was not to be hostile and if my response sound a bit harsh 
> for you then by all means I do apologize for that.

Okay, maybe I read something into your mail that you didn´t intend to put 
there. Sorry. Let us focus on the constructive way to move forward with this.

Thanks,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is stability a joke?

2016-09-11 Thread Martin Steigerwald
Am Sonntag, 11. September 2016, 13:43:59 CEST schrieb Martin Steigerwald:
> > >> The Nouveau graphics driver have a nice feature matrix on it's webpage
> > >> and I think that BTRFS perhaps should consider doing something like
> > >> that
> > >> on it's official wiki as well
> > > 
> > > BTRFS also has a feature matrix. The links to it are in the "News"
> > > section
> > > however:
> > > 
> > > https://btrfs.wiki.kernel.org/index.php/Changelog#By_feature
> > 
> > I disagree, this is not a feature / stability matrix. It is a clearly a
> > changelog by kernel version.
> 
> It is a *feature* matrix. I fully said its not about stability, but about 
> implementation – I just wrote this a sentence after this one. There is no
> need  whatsoever to further discuss this as I never claimed that it is a
> feature / stability matrix in the first place.
> 
> > > Thing is: This just seems to be when has a feature been implemented
> > > matrix.
> > > Not when it is considered to be stable. I think this could be done with
> > > colors or so. Like red for not supported, yellow for implemented and
> > > green for production ready.
> > 
> > Exactly, just like the Nouveau matrix. It clearly shows what you can
> > expect from it.

I mentioned this matrix as a good *starting* point. And I think it would be 
easy to extent it:

Just add another column called "Production ready". Then research / ask about 
production stability of each feature. The only challenge is: Who is 
authoritative on that? I´d certainly ask the developer of a feature, but I´d 
also consider user reports to some extent.

Maybe thats the real challenge.

If you wish, I´d go through each feature there and give my own estimation. But 
I think there are others who are deeper into this.

I do think for example that scrubbing and auto raid repair are stable, except 
for RAID 5/6. Also device statistics and RAID 0 and 1 I consider to be stable. 
I think RAID 10 is also stable, but as I do not run it, I don´t know. For me 
also skinny-metadata is stable. For me so far even compress=lzo seems to be 
stable, but well for others it may not.

Since what kernel version? Now, there you go. I have no idea. All I know I 
started BTRFS with Kernel 2.6.38 or 2.6.39 on my laptop, but not as RAID 1 at 
that time.

See, the implementation time of a feature is much easier to assess. Maybe 
thats part of the reason why there is not stability matrix: Maybe no one 
*exactly* knows *for sure*. How could you? So I would even put a footnote on 
that "production ready" row explaining "Considered to be stable by developer 
and user oppinions".

Of course additionally it would be good to read about experiences of corporate 
usage of BTRFS. I know at least Fujitsu, SUSE, Facebook, Oracle are using it. 
But I don´t know in what configurations and with what experiences. One Oracle 
developer invests a lot of time to bring BTRFS like features to XFS and RedHat 
still favors XFS over BTRFS, even SLES defaults to XFS for /home and other non 
/-filesystems. That also tells a story.

Some ideas you can get from SUSE releasenotes. Even if you do not want to use 
it, it tells something and I bet is one of the better sources of information 
regarding your question you can get at this time. Cause I believe SUSE 
developers invested some time to assess the stability of features. Cause they 
would carefully assess what they can support in enterprise environments. There 
is also someone from Fujitsu who shared experiences in a talk, I can search 
the URL to the slides again.

I bet Chris Mason and other BTRFS developers at Facebook have some idea on 
what they use within Facebook as well. To what extent they are allowed to talk 
about it… I don´t know. My personal impression is that as soon as Chris went 
to Facebook he became quite quiet. Maybe just due to being busy. Maybe due to 
Facebook being concerned much more about the privacy of itself than of its 
users.

Thanks,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is stability a joke?

2016-09-11 Thread Martin Steigerwald
Am Sonntag, 11. September 2016, 13:21:30 CEST schrieb Zoiled:
> Martin Steigerwald wrote:
> > Am Sonntag, 11. September 2016, 10:55:21 CEST schrieb Waxhead:
> >> I have been following BTRFS for years and have recently been starting to
> >> use BTRFS more and more and as always BTRFS' stability is a hot topic.
> >> Some says that BTRFS is a dead end research project while others claim
> >> the opposite.
> > 
> > First off: On my systems BTRFS definately runs too stable for a research
> > project. Actually: I have zero issues with stability of BTRFS on *any* of
> > my systems at the moment and in the last half year.
> > 
> > The only issue I had till about half an year ago was BTRFS getting stuck
> > at
> > seeking free space on a highly fragmented RAID 1 + compress=lzo /home.
> > This
> > went away with either kernel 4.4 or 4.5.
> > 
> > Additionally I never ever lost even a single byte of data on my own BTRFS
> > filesystems. I had a checksum failure on one of the SSDs, but BTRFS RAID 1
> > repaired it.
> > 
> > 
> > Where do I use BTRFS?
> > 
> > 1) On this ThinkPad T520 with two SSDs. /home and / in RAID 1, another
> > data
> > volume as single. In case you can read german, search blog.teamix.de for
> > BTRFS.
> > 
> > 2) On my music box ThinkPad T42 for /home. I did not bother to change / so
> > far and may never to so for this laptop. It has a slow 2,5 inch harddisk.
> > 
> > 3) I used it on Workstation at work as well for a data volume in RAID 1.
> > But workstation is no more (not due to a filesystem failure).
> > 
> > 4) On a server VM for /home with Maildirs and Owncloud data. /var is still
> > on Ext4, but I want to migrate it as well. Whether I ever change /, I
> > don´t know.
> > 
> > 5) On another server VM, a backup VM which I currently use with
> > borgbackup.
> > With borgbackup I actually wouldn´t really need BTRFS, but well…
> > 
> > 6) On *all* of my externel eSATA based backup harddisks for snapshotting
> > older states of the backups.
> 
> In other words, you are one of those who claim the opposite :) I have
> also myself run btrfs for a "toy" filesystem since 2013 without any
> issues, but this is more or less irrelevant since some people have
> experienced data loss thanks to unstable features that are not clearly
> marked as such.
> And making a claim that you have not lost a single byte of data does not
> make sense, how did you test this? SHA256 against a backup? :)

Do you have any proof like that with *any* other filesystem on Linux?

No, my claim is a bit weaker: BTRFS own scrubbing feature and well no I/O 
errors on rsyncing my data over to the backup drive - BTRFS checks checksum on 
read as well –, and yes I know BTRFS uses a weaker hashing algorithm, I think 
crc32c. Yet this is still more than what I can say about *any* other 
filesystem I used so far. Up to my current knowledge neither XFS nor Ext4/3 
provide data checksumming. They do have metadata checksumming and I found 
contradicting information on whether XFS may support data checksumming in the 
future, but up to now, no *proof* *whatsoever* from side of the filesystem 
that the data is, what it was when I saved it initially. There may be bit 
errors rotting on any of your Ext4 and XFS filesystem without you even 
noticing for *years*. I think thats still unlikely, but it can happen, I have 
seen this years ago after restoring a backup with bit errors from a hardware 
RAID controller.

Of course, I rely on the checksumming feature within BTRFS – which may have 
errors. But even that is more than with any other filesystem I had before.

And I do not scrub daily, especially not the backup disks, but for any scrubs 
up to now, no issues. So, granted, my claim has been a bit bold. Right now I 
have no up-to-this-day scrubs so all I can say is that I am not aware of any 
data losses up to the point in time where I last scrubbed my devices. Just 
redoing the scrubbing now on my laptop.

> >> The Debian wiki for BTRFS (which is recent by the way) contains a bunch
> >> of warnings and recommendations and is for me a bit better than the
> >> official BTRFS wiki when it comes to how to decide what features to use.
> > 
> > Nice page. I wasn´t aware of this one.
> > 
> > If you use BTRFS with Debian, I suggest to usually use the recent backport
> > kernel, currently 4.6.
> > 
> > Hmmm, maybe I better remove that compress=lzo mount option. Never saw any
> > issue with it, tough. Will research what they say about it.
> 
> My point exactly: You did not know about this and hence the risk of your
> data being gnawed on.

Well I do follow B

  1   2   3   4   5   6   7   8   >