Am Sat, 2 Apr 2016 19:17:55 +0200
schrieb Henk Slager <eye...@gmail.com>:

> On Sat, Apr 2, 2016 at 11:00 AM, Kai Krakow <hurikha...@gmail.com>
> wrote:
> > Am Fri, 1 Apr 2016 01:27:21 +0200
> > schrieb Henk Slager <eye...@gmail.com>:
> >  
> >> It is not clear to me what 'Gentoo patch-set r1' is and does. So
> >> just boot a vanilla v4.5 kernel from kernel.org and see if you get
> >> csum errors in dmesg.  
> >
> > It is the gentoo patchset, I don't think anything there relates to
> > btrfs:
> > https://dev.gentoo.org/~mpagano/genpatches/trunk/4.5/
> >  
> >> Also, where does 'duplicate object' come from? dmesg ? then please
> >> post its surroundings, straight from dmesg.  
> >
> > It was in dmesg. I already posted it in the other thread and Qu took
> > note of it. Apparently, I didn't manage to capture anything else
> > than:
> >
> > btrfs_run_delayed_refs:2927: errno=-17 Object already exists
> >
> > It hit me unexpected. This was the first time btrfs went RO for me.
> > It was with kernel 4.4.5 I think.
> >
> > I suspect this is the outcome of unnoticed corruptions that sneaked
> > in earlier over some period of time. The system had no problems
> > until this incident, and only then I discovered the huge pile of
> > corruptions when I ran btrfsck.
> >
> > I'm also pretty convinced now that VirtualBox itself is not the
> > problem but only victim of these corruptions, that's why it
> > primarily shows up in the VDI file.
> >
> > However, I now found csum errors in unrelated files (see other post
> > in this thread), even for files not touched in a long time.  
> 
> Ok, this is some good further status and background. That there are
> more csum errors elsewhere is quite worrying I would say. You said HW
> is tested, are you sure there no rare undetected failures, like due to
> overclocking or just aging or whatever. It might just be that spurious
> HW errors just now start to happen and are unrelated to kernel upgrade
> from 4.4.x to 4.5.
> I had once a RAM module going bad; Windows7 ran fine (at least no
> crashes), but when I booted with Linux/btrfs, all kinds of strange
> btrfs errors started to appear including csum errors.

I'll go checking the RAM for problems - tho that would be the first
time in twenty years that a RAM module hadn't errors from the
beginning. Well, you'll never know. But I expect no error since usually
this would mean all sorts of different and random problems which I
don't have. Problems are very specific, which is atypical for RAM
errors.

The hardware is not overclocked, every part was tested when installed.

> The other thing you could think about is the SSD cache partition. I
> don't remember if blocks from RAM to SSD get an extra CRC attached
> (independent of BTRFS). But if data gets corrupted while in the SSD,
> you could get very nasty errors, how nasty depends a bit on the
> various bcache settings. It is not unthinkable that dirty changed data
> gets written to the harddisks. But at least btrfs (scub) can detect
> that (the situation you are in now).

Well, the SSD could in fact soon become a problem. It's at 97% of its
lifetime according to SMART. I'm probably somewhere near 85TB (that's
the lifetime spec of the SSD) of written data within one year thanks to
some unfortunate disk replacement (btrfs replace) action with btrfs
through bcache, and weekly scrubs (which does not just read, but
writes).

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
UPDATED  WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate     0x002f   100
100   000    Pre-fail  Always       -       1 5 Reallocate_NAND_Blk_Cnt
0x0033   100   100   000    Pre-fail  Always       -       0 9
Power_On_Hours          0x0032   100   100   000    Old_age
Always       -       8705 12 Power_Cycle_Count       0x0032   100
100   000    Old_age   Always       -       286 171
Program_Fail_Count      0x0032   100   100   000    Old_age
Always       -       0 172 Erase_Fail_Count        0x0032   100   100
000    Old_age   Always       -       0 173 Ave_Block-Erase_Count
0x0032   003   003   000    Old_age   Always       -       2913 174
Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age
Always       -       112 180 Unused_Reserve_NAND_Blk 0x0033   000
000   000    Pre-fail  Always       -       1036 183
SATA_Interfac_Downshift 0x0032   100   100   000    Old_age
Always       -       0 184 Error_Correction_Count  0x0032   100   100
000    Old_age   Always       -       0 187 Reported_Uncorrect
0x0032   100   100   000    Old_age   Always       -       0 194
Temperature_Celsius     0x0022   067   057   000    Old_age
Always       -       33 (Min/Max 20/43) 196 Reallocated_Event_Count
0x0032   100   100   000    Old_age   Always       -       0 197
Current_Pending_Sector  0x0032   100   100   000    Old_age
Always       -       0 198 Offline_Uncorrectable   0x0030   100   100
000    Old_age   Offline      -       0 199 UDMA_CRC_Error_Count
0x0032   100   100   000    Old_age   Always       -       0 202
Percent_Lifetime_Used   0x0031   003   003   000    Pre-fail
Offline      -       97 206 Write_Error_Rate        0x000e   100
100   000    Old_age   Always       -       0 210
Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age
Always       -       0 246 Total_Host_Sector_Write 0x0032   100   100
000    Old_age   Always       -       42879382296 247
Host_Program_Page_Count 0x0032   100   100   000    Old_age
Always       -       1495038460 248 Bckgnd_Program_Page_Cnt 0x0032
100   100   000    Old_age   Always       -       42326578695


> Maybe to further isolate just btrfs, you could temporary rule out
> bcache by making sure the cache is clean and then increase the
> startsectors of second partitions on the harddisks by 16 (8KiB) and
> then reboot. Of course after any write to the partitions, you'll have
> to recreate all bcache.

Bcache had some patches lately for problems I never experienced. At
this point, I'd also not rule out bcache as the fault. Tho, bcache
itself had no problems (I have one other system where bcache broke down
after those patches were applied, resulting in a broken bcache b-tree).

> But maybe it is just due to bugs in older kernels that the fs has been
> silently corrupted and now kernel 4.5 cannot handle it anymore and any
> use of the fs increases corruption.

I'm pretty sure the problems sneaked in during running older kernels,
and the FS going RO was only tip of the iceberg.

My last "error free" rsync backup is from mid March. By that time, I
probably had no csum errors in files with young modification time - but
since I only in-place sync files with changed mod-time, I cannot rule
out csum errors having already been there. My script only takes
snapshots of the backup scratch area when rsync was successful, thus my
last snapshot from mid March holds valid copies of the broken files
while the scratch area has a current backup with some files broken (due
to in-place sync). [1]

According to previous inspections, that backup FS is in good shape -
the only btrfsck errors have been false alerts which have been fixed by
Qu (thanks BTW).

Interesting thing is:

As with the first file with csum errors (the VDI file), also the second
file has csum errors again when recreated. It's a game data file from
Steam. I removed it (then the FS went RO, mentioned earlier in this
thread). Now, Steam re-downloaded the file to a temp directory - so
obviously it's a completely new file (except Steam somehow magically
recovered it from somewhere else). But this new file has csum errors
again. WTH? And Steam forces the FS RO when working with this file.

So, either the SSD (thru bcache) or btrfs' compression algorithms show
bugs with very specific data patterns (since I'm using compress=lzo),
or the other corruptions make btrfs destroy those new files and it
allocates space over and over again from affected areas of the disk. I
don't know how btrfs allocation works - but that may be an explanation
(wrt the backpointer errors).

BTW: Replacement SSD already ordered. At the current rate the old
one will reach 100% lifetime in about 4-6 weeks.

[1]: As a reference or if you're curious:
https://gist.github.com/kakra/5520370

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to