Re: [f2fs-dev] [PATCH v4 2/2] f2fs: Check write pointer consistency of non-open zones

2019-12-08 Thread Shinichiro Kawasaki
On Dec 09, 2019 / 10:04, Chao Yu wrote:
> On 2019/12/2 17:40, Shin'ichiro Kawasaki wrote:
> > To catch f2fs bugs in write pointer handling code for zoned block
> > devices, check write pointers of non-open zones that current segments do
> > not point to. Do this check at mount time, after the fsync data recovery
> > and current segments' write pointer consistency fix. Or when fsync data
> > recovery is disabled by mount option, do the check when there is no fsync
> > data.
> > 
> > Check two items comparing write pointers with valid block maps in SIT.
> > The first item is check for zones with no valid blocks. When there is no
> > valid blocks in a zone, the write pointer should be at the start of the
> > zone. If not, next write operation to the zone will cause unaligned write
> > error. If write pointer is not at the zone start, reset the write pointer
> > to place at the zone start.
> > 
> > The second item is check between the write pointer position and the last
> > valid block in the zone. It is unexpected that the last valid block
> > position is beyond the write pointer. In such a case, report as a bug.
> > Fix is not required for such zone, because the zone is not selected for
> > next write operation until the zone get discarded.
> > 
> > Signed-off-by: Shin'ichiro Kawasaki 
> > ---
> >  fs/f2fs/f2fs.h|   1 +
> >  fs/f2fs/segment.c | 126 ++
> >  fs/f2fs/super.c   |  11 
> >  3 files changed, 138 insertions(+)
> > 
> > diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> > index 002c417b0a53..23a84d7f17b8 100644
> > --- a/fs/f2fs/f2fs.h
> > +++ b/fs/f2fs/f2fs.h
> > @@ -3156,6 +3156,7 @@ int f2fs_lookup_journal_in_cursum(struct f2fs_journal 
> > *journal, int type,
> > unsigned int val, int alloc);
> >  void f2fs_flush_sit_entries(struct f2fs_sb_info *sbi, struct cp_control 
> > *cpc);
> >  int f2fs_fix_curseg_write_pointer(struct f2fs_sb_info *sbi);
> > +int f2fs_check_write_pointer(struct f2fs_sb_info *sbi);
> >  int f2fs_build_segment_manager(struct f2fs_sb_info *sbi);
> >  void f2fs_destroy_segment_manager(struct f2fs_sb_info *sbi);
> >  int __init f2fs_create_segment_manager_caches(void);
> > diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
> > index 9b6c7ab67b93..48903b7a9d25 100644
> > --- a/fs/f2fs/segment.c
> > +++ b/fs/f2fs/segment.c
> > @@ -4370,6 +4370,90 @@ static int sanity_check_curseg(struct f2fs_sb_info 
> > *sbi)
> >  
> >  #ifdef CONFIG_BLK_DEV_ZONED
> >  
> > +static int check_zone_write_pointer(struct f2fs_sb_info *sbi,
> > +   struct f2fs_dev_info *fdev,
> > +   struct blk_zone *zone)
> > +{
> > +   unsigned int wp_segno, wp_blkoff, zone_secno, zone_segno, segno;
> > +   block_t zone_block, wp_block, last_valid_block;
> > +   unsigned int log_sectors_per_block = sbi->log_blocksize - SECTOR_SHIFT;
> > +   int i, s, b, ret;
> > +   struct seg_entry *se;
> > +
> > +   if (zone->type != BLK_ZONE_TYPE_SEQWRITE_REQ)
> > +   return 0;
> > +
> > +   wp_block = fdev->start_blk + (zone->wp >> log_sectors_per_block);
> > +   wp_segno = GET_SEGNO(sbi, wp_block);
> > +   wp_blkoff = wp_block - START_BLOCK(sbi, wp_segno);
> > +   zone_block = fdev->start_blk + (zone->start >> log_sectors_per_block);
> > +   zone_segno = GET_SEGNO(sbi, zone_block);
> > +   zone_secno = GET_SEC_FROM_SEG(sbi, zone_segno);
> > +
> > +   if (zone_segno >= MAIN_SEGS(sbi))
> > +   return 0;
> > +
> > +   /*
> > +* Skip check of zones cursegs point to, since
> > +* fix_curseg_write_pointer() checks them.
> > +*/
> > +   for (i = 0; i < NO_CHECK_TYPE; i++)
> > +   if (zone_secno == GET_SEC_FROM_SEG(sbi,
> > +  CURSEG_I(sbi, i)->segno))
> > +   return 0;
> > +
> > +   /*
> > +* Get last valid block of the zone.
> > +*/
> > +   last_valid_block = zone_block - 1;
> > +   for (s = sbi->segs_per_sec - 1; s >= 0; s--) {
> > +   segno = zone_segno + s;
> > +   se = get_seg_entry(sbi, segno);
> > +   for (b = sbi->blocks_per_seg - 1; b >= 0; b--)
> > +   if (f2fs_test_bit(b, se->cur_valid_map)) {
> > +   last_valid_block = START_BLOCK(sbi, segno) + b;
> > +   break;
> > +   }
> > +   if (last_valid_block >= zone_block)
> > +   break;
> > +   }
> > +
> > +   /*
> > +* If last valid block is beyond the write pointer, report the
> > +* inconsistency. This inconsistency does not cause write error
> > +* because the zone will not be selected for write operation until
> > +* it get discarded. Just report it.
> > +*/
> > +   if (last_valid_block >= wp_block) {
> > +   f2fs_notice(sbi, "Valid block beyond write pointer: "
> > +   "valid block[0x%x,0x%x] wp[0x%x,0x%x]",
> > +   GET_SEGNO(sbi, last_valid_block),
> > + 

Re: [f2fs-dev] [PATCH v4 2/2] f2fs: Check write pointer consistency of non-open zones

2019-12-08 Thread Chao Yu
On 2019/12/2 17:40, Shin'ichiro Kawasaki wrote:
> To catch f2fs bugs in write pointer handling code for zoned block
> devices, check write pointers of non-open zones that current segments do
> not point to. Do this check at mount time, after the fsync data recovery
> and current segments' write pointer consistency fix. Or when fsync data
> recovery is disabled by mount option, do the check when there is no fsync
> data.
> 
> Check two items comparing write pointers with valid block maps in SIT.
> The first item is check for zones with no valid blocks. When there is no
> valid blocks in a zone, the write pointer should be at the start of the
> zone. If not, next write operation to the zone will cause unaligned write
> error. If write pointer is not at the zone start, reset the write pointer
> to place at the zone start.
> 
> The second item is check between the write pointer position and the last
> valid block in the zone. It is unexpected that the last valid block
> position is beyond the write pointer. In such a case, report as a bug.
> Fix is not required for such zone, because the zone is not selected for
> next write operation until the zone get discarded.
> 
> Signed-off-by: Shin'ichiro Kawasaki 
> ---
>  fs/f2fs/f2fs.h|   1 +
>  fs/f2fs/segment.c | 126 ++
>  fs/f2fs/super.c   |  11 
>  3 files changed, 138 insertions(+)
> 
> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> index 002c417b0a53..23a84d7f17b8 100644
> --- a/fs/f2fs/f2fs.h
> +++ b/fs/f2fs/f2fs.h
> @@ -3156,6 +3156,7 @@ int f2fs_lookup_journal_in_cursum(struct f2fs_journal 
> *journal, int type,
>   unsigned int val, int alloc);
>  void f2fs_flush_sit_entries(struct f2fs_sb_info *sbi, struct cp_control 
> *cpc);
>  int f2fs_fix_curseg_write_pointer(struct f2fs_sb_info *sbi);
> +int f2fs_check_write_pointer(struct f2fs_sb_info *sbi);
>  int f2fs_build_segment_manager(struct f2fs_sb_info *sbi);
>  void f2fs_destroy_segment_manager(struct f2fs_sb_info *sbi);
>  int __init f2fs_create_segment_manager_caches(void);
> diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
> index 9b6c7ab67b93..48903b7a9d25 100644
> --- a/fs/f2fs/segment.c
> +++ b/fs/f2fs/segment.c
> @@ -4370,6 +4370,90 @@ static int sanity_check_curseg(struct f2fs_sb_info 
> *sbi)
>  
>  #ifdef CONFIG_BLK_DEV_ZONED
>  
> +static int check_zone_write_pointer(struct f2fs_sb_info *sbi,
> + struct f2fs_dev_info *fdev,
> + struct blk_zone *zone)
> +{
> + unsigned int wp_segno, wp_blkoff, zone_secno, zone_segno, segno;
> + block_t zone_block, wp_block, last_valid_block;
> + unsigned int log_sectors_per_block = sbi->log_blocksize - SECTOR_SHIFT;
> + int i, s, b, ret;
> + struct seg_entry *se;
> +
> + if (zone->type != BLK_ZONE_TYPE_SEQWRITE_REQ)
> + return 0;
> +
> + wp_block = fdev->start_blk + (zone->wp >> log_sectors_per_block);
> + wp_segno = GET_SEGNO(sbi, wp_block);
> + wp_blkoff = wp_block - START_BLOCK(sbi, wp_segno);
> + zone_block = fdev->start_blk + (zone->start >> log_sectors_per_block);
> + zone_segno = GET_SEGNO(sbi, zone_block);
> + zone_secno = GET_SEC_FROM_SEG(sbi, zone_segno);
> +
> + if (zone_segno >= MAIN_SEGS(sbi))
> + return 0;
> +
> + /*
> +  * Skip check of zones cursegs point to, since
> +  * fix_curseg_write_pointer() checks them.
> +  */
> + for (i = 0; i < NO_CHECK_TYPE; i++)
> + if (zone_secno == GET_SEC_FROM_SEG(sbi,
> +CURSEG_I(sbi, i)->segno))
> + return 0;
> +
> + /*
> +  * Get last valid block of the zone.
> +  */
> + last_valid_block = zone_block - 1;
> + for (s = sbi->segs_per_sec - 1; s >= 0; s--) {
> + segno = zone_segno + s;
> + se = get_seg_entry(sbi, segno);
> + for (b = sbi->blocks_per_seg - 1; b >= 0; b--)
> + if (f2fs_test_bit(b, se->cur_valid_map)) {
> + last_valid_block = START_BLOCK(sbi, segno) + b;
> + break;
> + }
> + if (last_valid_block >= zone_block)
> + break;
> + }
> +
> + /*
> +  * If last valid block is beyond the write pointer, report the
> +  * inconsistency. This inconsistency does not cause write error
> +  * because the zone will not be selected for write operation until
> +  * it get discarded. Just report it.
> +  */
> + if (last_valid_block >= wp_block) {
> + f2fs_notice(sbi, "Valid block beyond write pointer: "
> + "valid block[0x%x,0x%x] wp[0x%x,0x%x]",
> + GET_SEGNO(sbi, last_valid_block),
> + GET_BLKOFF_FROM_SEG0(sbi, last_valid_block),
> + wp_segno, wp_blkoff);
> + return 0;
> + }
> 

Re: [f2fs-dev] [PATCH v3] fs: introduce is_dot_dotdot helper for cleanup

2019-12-08 Thread John Stoffel
On Sun, Dec 08, 2019 at 04:38:04AM -0800, Matthew Wilcox wrote:
> On Sun, Dec 08, 2019 at 03:41:44AM +, Al Viro wrote:
> > On Sat, Dec 07, 2019 at 07:35:48PM +0800, Tiezhu Yang wrote:
> > > There exists many similar and duplicate codes to check "." and "..",
> > > so introduce is_dot_dotdot helper to make the code more clean.
> > 
> > Umm...  No objections, in principle, but... you try to say that name
> > (e.g. in a phone conversation) without stuttering ;-/
> > 
> > Any suggestions from native speakers?
> 
> I used "is_dot_or_dotdot" when discussing this patch with my wife verbally.

*thumbs up*  Both for the wife, and the name.  :-)

-- 


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] Potential data corruption?

2019-12-08 Thread Gao Xiang via Linux-f2fs-devel
Hi,

On Sun, Dec 08, 2019 at 09:15:55PM +0800, Hongwei Qin wrote:
> Hi,
> 
> On Sun, Dec 8, 2019 at 12:01 PM Chao Yu  wrote:
> >
> > Hello,
> >
> > On 2019-12-7 18:10, 锟斤拷锟秸碉拷锟斤拷锟斤拷锟斤拷 wrote:
> > > Hi F2FS experts,
> > > The following confuses me:
> > >
> > > A typical fsync() goes like this:
> > > 1) Issue data block IOs
> > > 2) Wait for completion
> > > 3) Issue chained node block IOs
> > > 4) Wait for completion
> > > 5) Issue flush command
> > >
> > > In order to preserve data consistency under sudden power failure, it 
> > > requires that the storage device persists data blocks prior to node 
> > > blocks.
> > > Otherwise, under sudden power failure, it's possible that the persisted 
> > > node block points to NULL data blocks.
> >
> > Firstly it doesn't break POSIX semantics, right? since fsync() didn't return
> > successfully before sudden power-cut, so we can not guarantee that data is 
> > fully
> > persisted in such condition.
> >
> > However, what you want looks like atomic write semantics, which mostly 
> > database
> > want to guarantee during db file update.
> >
> > F2FS has support atomic_write via ioctl, which is used by SQLite 
> > officially, I
> > guess you can check its implementation detail.
> >
> > Thanks,
> >
> 
> Thanks for your kind reply.
> It's true that if we meet power failure before fsync() completes,
> POSIX doen't require FS to recover the file. However, consider the
> following situation:
> 
> 1) Data block IOs (Not persisted)
> 2) Node block IOs (All Persisted)
> 3) Power failure
> 
> Since the node blocks are all persisted before power failure, the node
> chain isn't broken. Note that this file's new data is not properly
> persisted before crash. So the recovery process should be able to
> recognize this situation and avoid recover this file. However, since
> the node chain is not broken, perhaps the recovery process will regard
> this file as recoverable?

As my own limited understanding, I'm afraid it seems true for extreme case.
Without proper FLUSH command, newer nodes could be recovered but no newer
data persisted.

So if fsync() is not successful, the old data should be readed
but for this case, unexpected data (not A or A', could be random data
C) will be considered validly since its node is ok.

It seems it should FLUSH data before the related node chain written or
introduce some data checksum though.

If I am wrong, kindly correct me...

Thanks,
Gao Xiang



___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] Potential data corruption?

2019-12-08 Thread Chao Yu

Hi,

On 2019-12-8 21:15, Hongwei Qin wrote:

Hi,

On Sun, Dec 8, 2019 at 12:01 PM Chao Yu  wrote:


Hello,

On 2019-12-7 18:10, 红烧的威化饼 wrote:

Hi F2FS experts,
The following confuses me:

A typical fsync() goes like this:
1) Issue data block IOs
2) Wait for completion
3) Issue chained node block IOs
4) Wait for completion
5) Issue flush command

In order to preserve data consistency under sudden power failure, it requires 
that the storage device persists data blocks prior to node blocks.
Otherwise, under sudden power failure, it's possible that the persisted node 
block points to NULL data blocks.


Firstly it doesn't break POSIX semantics, right? since fsync() didn't return
successfully before sudden power-cut, so we can not guarantee that data is fully
persisted in such condition.

However, what you want looks like atomic write semantics, which mostly database
want to guarantee during db file update.

F2FS has support atomic_write via ioctl, which is used by SQLite officially, I
guess you can check its implementation detail.

Thanks,



Thanks for your kind reply.
It's true that if we meet power failure before fsync() completes,
POSIX doen't require FS to recover the file. However, consider the
following situation:

1) Data block IOs (Not persisted)
2) Node block IOs (All Persisted)
3) Power failure

Since the node blocks are all persisted before power failure, the node
chain isn't broken. Note that this file's new data is not properly
persisted before crash. So the recovery process should be able to
recognize this situation and avoid recover this file. However, since
the node chain is not broken, perhaps the recovery process will regard
this file as recoverable?


So this is why atomic write submission will tag PREFLUSH & FUA in last node bio 
to keep all data IO being persisted before node IO, and recovery flag is only 
tagged in last node block of node chain, if the last node block is not be 
persisted, all atomic write data will not be recovered. With this mechanism, we 
can guarantee atomic write semantics.


__write_node_page()
{
...
if (atomic && !test_opt(sbi, NOBARRIER))
fio.op_flags |= REQ_PREFLUSH | REQ_FUA;
...
}

f2fs_fsync_node_page()
{
...
if (!atomic || page == last_page) {
set_fsync_mark(page, 1);
if (IS_INODE(page)) {
if (is_inode_flag_set(inode,
FI_DIRTY_INODE))
f2fs_update_inode(inode, page);
set_dentry_mark(page,
f2fs_need_dentry_mark(sbi, 
ino));
}
/*  may be written by other thread */
if (!PageDirty(page))
set_page_dirty(page);
}
...
}



Thanks!



However, according to this study 
(https://www.usenix.org/conference/fast18/presentation/won), the persistent 
order of requests doesn't necessarily equals to the request finish order (due 
to device volatile caches). This means that its possible that the node blocks 
get persisted prior to data blocks.

Does F2FS have other mechanisms to prevent such inconsistency? Or does it 
require the device to persist data without reordering?

Thanks!

Hongwei
___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel




___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel



___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] Potential data corruption?

2019-12-08 Thread Hongwei Qin
Hi,

On Sun, Dec 8, 2019 at 12:01 PM Chao Yu  wrote:
>
> Hello,
>
> On 2019-12-7 18:10, 红烧的威化饼 wrote:
> > Hi F2FS experts,
> > The following confuses me:
> >
> > A typical fsync() goes like this:
> > 1) Issue data block IOs
> > 2) Wait for completion
> > 3) Issue chained node block IOs
> > 4) Wait for completion
> > 5) Issue flush command
> >
> > In order to preserve data consistency under sudden power failure, it 
> > requires that the storage device persists data blocks prior to node blocks.
> > Otherwise, under sudden power failure, it's possible that the persisted 
> > node block points to NULL data blocks.
>
> Firstly it doesn't break POSIX semantics, right? since fsync() didn't return
> successfully before sudden power-cut, so we can not guarantee that data is 
> fully
> persisted in such condition.
>
> However, what you want looks like atomic write semantics, which mostly 
> database
> want to guarantee during db file update.
>
> F2FS has support atomic_write via ioctl, which is used by SQLite officially, I
> guess you can check its implementation detail.
>
> Thanks,
>

Thanks for your kind reply.
It's true that if we meet power failure before fsync() completes,
POSIX doen't require FS to recover the file. However, consider the
following situation:

1) Data block IOs (Not persisted)
2) Node block IOs (All Persisted)
3) Power failure

Since the node blocks are all persisted before power failure, the node
chain isn't broken. Note that this file's new data is not properly
persisted before crash. So the recovery process should be able to
recognize this situation and avoid recover this file. However, since
the node chain is not broken, perhaps the recovery process will regard
this file as recoverable?

Thanks!

> >
> > However, according to this study 
> > (https://www.usenix.org/conference/fast18/presentation/won), the persistent 
> > order of requests doesn't necessarily equals to the request finish order 
> > (due to device volatile caches). This means that its possible that the node 
> > blocks get persisted prior to data blocks.
> >
> > Does F2FS have other mechanisms to prevent such inconsistency? Or does it 
> > require the device to persist data without reordering?
> >
> > Thanks!
> >
> > Hongwei
> > ___
> > Linux-f2fs-devel mailing list
> > Linux-f2fs-devel@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
> >
>
>
> ___
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v3] fs: introduce is_dot_dotdot helper for cleanup

2019-12-08 Thread Matthew Wilcox
On Sun, Dec 08, 2019 at 03:41:44AM +, Al Viro wrote:
> On Sat, Dec 07, 2019 at 07:35:48PM +0800, Tiezhu Yang wrote:
> > There exists many similar and duplicate codes to check "." and "..",
> > so introduce is_dot_dotdot helper to make the code more clean.
> 
> Umm...  No objections, in principle, but... you try to say that name
> (e.g. in a phone conversation) without stuttering ;-/
> 
> Any suggestions from native speakers?

I used "is_dot_or_dotdot" when discussing this patch with my wife verbally.


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel