Re: [PATCH v3] btrfs: fiemap: Cache and merge fiemap extent before submit it to user

2017-04-19 Thread Qu Wenruo



At 04/20/2017 10:08 AM, Liu Bo wrote:

On Thu, Apr 20, 2017 at 09:52:00AM +0800, Qu Wenruo wrote:



At 04/20/2017 09:25 AM, Liu Bo wrote:

On Fri, Apr 07, 2017 at 10:43:15AM +0800, Qu Wenruo wrote:

[BUG]
Cycle mount btrfs can cause fiemap to return different result.
Like:
   # mount /dev/vdb5 /mnt/btrfs
   # dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
   # xfs_io -c "fiemap -v" /mnt/btrfs/file
   /mnt/test/file:
   EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
 0: [0..127]:25088..25215   128   0x1
   # umount /mnt/btrfs
   # mount /dev/vdb5 /mnt/btrfs
   # xfs_io -c "fiemap -v" /mnt/btrfs/file
   /mnt/test/file:
   EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
 0: [0..31]: 25088..2511932   0x0
 1: [32..63]:25120..2515132   0x0
 2: [64..95]:25152..2518332   0x0
 3: [96..127]:   25184..2521532   0x1
But after above fiemap, we get correct merged result if we call fiemap
again.
   # xfs_io -c "fiemap -v" /mnt/btrfs/file
   /mnt/test/file:
   EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
 0: [0..127]:25088..25215   128   0x1

[REASON]
Btrfs will try to merge extent map when inserting new extent map.

btrfs_fiemap(start=0 len=(u64)-1)
|- extent_fiemap(start=0 len=(u64)-1)
 |- get_extent_skip_holes(start=0 len=64k)
 |  |- btrfs_get_extent_fiemap(start=0 len=64k)
 | |- btrfs_get_extent(start=0 len=64k)
 ||  Found on-disk (ino, EXTENT_DATA, 0)
 ||- add_extent_mapping()
 ||- Return (em->start=0, len=16k)
 |
 |- fiemap_fill_next_extent(logic=0 phys=X len=16k)
 |
 |- get_extent_skip_holes(start=0 len=64k)
 |  |- btrfs_get_extent_fiemap(start=0 len=64k)
 | |- btrfs_get_extent(start=16k len=48k)
 ||  Found on-disk (ino, EXTENT_DATA, 16k)
 ||- add_extent_mapping()
 ||  |- try_merge_map()
 || Merge with previous em start=0 len=16k
 || resulting em start=0 len=32k
 ||- Return (em->start=0, len=32K)<< Merged result
 |- Stripe off the unrelated range (0~16K) of return em
 |- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K)
^^^ Causing split fiemap extent.

And since in add_extent_mapping(), em is already merged, in next
fiemap() call, we will get merged result.

[FIX]
Here we introduce a new structure, fiemap_cache, which records previous
fiemap extent.

And will always try to merge current fiemap_cache result before calling
fiemap_fill_next_extent().
Only when we failed to merge current fiemap extent with cached one, we
will call fiemap_fill_next_extent() to submit cached one.

So by this method, we can merge all fiemap extents.



If I understand it correctly, what it's missing currently is 'merge', a
straightfoward idea is to make use of the 'merge' ability of btrfs_get_extent. >
Since btrfs_get_extent_fiemap is a wrapper of btrfs_get_extent, does it make
sense if we make btrfs_get_extent_fiemap iterate all extent maps within the
range passing to it or just one more extent map to check if btrfs_get_extent
could return a merged extent map before returning?


So your idea to to do the merging inside btrfs_get_extent(), instead of
merging it in extent_fiemap()?

In theory, they have the same effect for fiemap.
And that's my original idea.

But the problem is, btrfs_get_extent() is called almost everywhere, more
frequently than extent_fiemap(), the extra time used to merging extent map
may cause performance problem.

For the worst case, if a file is made up by 262144 4K extent (takes up 1G),
btrfs_get_extent() call on the file will iterate all these extents,
which will iterate more than 500 16K tree blocks.

If only fiemap takes such longer time, it's still acceptable. But if
btrfs_get_extent() takes such longer time, I think it will be a problem
then.



Not related to the patch, but the question is, does fiemap ioctl think
it is important to have unified output?


Not defined by any one nor documentation.

But wouldn't it be strange that fiemap returns different result even we 
had done nothing?


As XFS can do it well, why not Btrfs?
(BTW, it seems that XFS does it even better, merge ondisk extents at 
insert time, other than do tricks in fiemap routine)


Thanks,
Qu



The filefrag manual only says it's attempting to get extent
information.  And this inconsistent output of filefrag doesn't seem to
confusing users as the reported extent count is correct.

Reviewed-by: Liu Bo 

Thanks,

-liubo





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs: fiemap: Cache and merge fiemap extent before submit it to user

2017-04-19 Thread Qu Wenruo



At 04/20/2017 09:58 AM, Liu Bo wrote:

On Thu, Apr 20, 2017 at 09:52:00AM +0800, Qu Wenruo wrote:



At 04/20/2017 09:25 AM, Liu Bo wrote:

On Fri, Apr 07, 2017 at 10:43:15AM +0800, Qu Wenruo wrote:

[BUG]
Cycle mount btrfs can cause fiemap to return different result.
Like:
   # mount /dev/vdb5 /mnt/btrfs
   # dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
   # xfs_io -c "fiemap -v" /mnt/btrfs/file
   /mnt/test/file:
   EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
 0: [0..127]:25088..25215   128   0x1
   # umount /mnt/btrfs
   # mount /dev/vdb5 /mnt/btrfs
   # xfs_io -c "fiemap -v" /mnt/btrfs/file
   /mnt/test/file:
   EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
 0: [0..31]: 25088..2511932   0x0
 1: [32..63]:25120..2515132   0x0
 2: [64..95]:25152..2518332   0x0
 3: [96..127]:   25184..2521532   0x1
But after above fiemap, we get correct merged result if we call fiemap
again.
   # xfs_io -c "fiemap -v" /mnt/btrfs/file
   /mnt/test/file:
   EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
 0: [0..127]:25088..25215   128   0x1

[REASON]
Btrfs will try to merge extent map when inserting new extent map.

btrfs_fiemap(start=0 len=(u64)-1)
|- extent_fiemap(start=0 len=(u64)-1)
 |- get_extent_skip_holes(start=0 len=64k)
 |  |- btrfs_get_extent_fiemap(start=0 len=64k)
 | |- btrfs_get_extent(start=0 len=64k)
 ||  Found on-disk (ino, EXTENT_DATA, 0)
 ||- add_extent_mapping()
 ||- Return (em->start=0, len=16k)
 |
 |- fiemap_fill_next_extent(logic=0 phys=X len=16k)
 |
 |- get_extent_skip_holes(start=0 len=64k)
 |  |- btrfs_get_extent_fiemap(start=0 len=64k)
 | |- btrfs_get_extent(start=16k len=48k)
 ||  Found on-disk (ino, EXTENT_DATA, 16k)
 ||- add_extent_mapping()
 ||  |- try_merge_map()
 || Merge with previous em start=0 len=16k
 || resulting em start=0 len=32k
 ||- Return (em->start=0, len=32K)<< Merged result
 |- Stripe off the unrelated range (0~16K) of return em
 |- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K)
^^^ Causing split fiemap extent.

And since in add_extent_mapping(), em is already merged, in next
fiemap() call, we will get merged result.

[FIX]
Here we introduce a new structure, fiemap_cache, which records previous
fiemap extent.

And will always try to merge current fiemap_cache result before calling
fiemap_fill_next_extent().
Only when we failed to merge current fiemap extent with cached one, we
will call fiemap_fill_next_extent() to submit cached one.

So by this method, we can merge all fiemap extents.



If I understand it correctly, what it's missing currently is 'merge', a
straightfoward idea is to make use of the 'merge' ability of btrfs_get_extent. >
Since btrfs_get_extent_fiemap is a wrapper of btrfs_get_extent, does it make
sense if we make btrfs_get_extent_fiemap iterate all extent maps within the
range passing to it or just one more extent map to check if btrfs_get_extent
could return a merged extent map before returning?


So your idea to to do the merging inside btrfs_get_extent(), instead of
merging it in extent_fiemap()?



No, merge ems inside the wrapper, ie. btrfs_get_extent_fiemap().


Then llseek() with SEEK_HOLE/SEEK_DATA also get affected.

So limiting the extra time to merging extent maps in fiemap is still valid.

Thanks,
Qu


Thanks,

-liubo

In theory, they have the same effect for fiemap.
And that's my original idea.

But the problem is, btrfs_get_extent() is called almost everywhere, more
frequently than extent_fiemap(), the extra time used to merging extent map
may cause performance problem.

For the worst case, if a file is made up by 262144 4K extent (takes up 1G),
btrfs_get_extent() call on the file will iterate all these extents,
which will iterate more than 500 16K tree blocks.

If only fiemap takes such longer time, it's still acceptable. But if
btrfs_get_extent() takes such longer time, I think it will be a problem
then.

Thanks,
Qu




If no extent map could be merged, which is the worst case, the first
btrfs_get_extent_fiemap will read file metadata into memory from disk and the
following btrfs_get_extent_fiemap will read the rest of file metadata from
the in-memory extent map tree directly.

Thanks,

-liubo


It can also be done in fs/ioctl.c, however the problem is if
fieinfo->fi_extents_max == 0, we have no space to cache previous fiemap
extent.
So I choose to merge it in btrfs.

Signed-off-by: Qu Wenruo 
---
v2:
Since fiemap_extent_info has a limit for number of fiemap_extent, it's 
possible
that fiemap_fill_next_extent() return 1 halfway. Remove the WARN_ON() which 
can
cause kernel warning if we fiemap is called on large compressed file.
v3:
Rename finish_fiemap_extent() to 

Re: [PATCH v3] btrfs: fiemap: Cache and merge fiemap extent before submit it to user

2017-04-19 Thread Liu Bo
On Thu, Apr 20, 2017 at 09:52:00AM +0800, Qu Wenruo wrote:
> 
> 
> At 04/20/2017 09:25 AM, Liu Bo wrote:
> > On Fri, Apr 07, 2017 at 10:43:15AM +0800, Qu Wenruo wrote:
> > > [BUG]
> > > Cycle mount btrfs can cause fiemap to return different result.
> > > Like:
> > >   # mount /dev/vdb5 /mnt/btrfs
> > >   # dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
> > >   # xfs_io -c "fiemap -v" /mnt/btrfs/file
> > >   /mnt/test/file:
> > >   EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
> > > 0: [0..127]:25088..25215   128   0x1
> > >   # umount /mnt/btrfs
> > >   # mount /dev/vdb5 /mnt/btrfs
> > >   # xfs_io -c "fiemap -v" /mnt/btrfs/file
> > >   /mnt/test/file:
> > >   EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
> > > 0: [0..31]: 25088..2511932   0x0
> > > 1: [32..63]:25120..2515132   0x0
> > > 2: [64..95]:25152..2518332   0x0
> > > 3: [96..127]:   25184..2521532   0x1
> > > But after above fiemap, we get correct merged result if we call fiemap
> > > again.
> > >   # xfs_io -c "fiemap -v" /mnt/btrfs/file
> > >   /mnt/test/file:
> > >   EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
> > > 0: [0..127]:25088..25215   128   0x1
> > > 
> > > [REASON]
> > > Btrfs will try to merge extent map when inserting new extent map.
> > > 
> > > btrfs_fiemap(start=0 len=(u64)-1)
> > > |- extent_fiemap(start=0 len=(u64)-1)
> > > |- get_extent_skip_holes(start=0 len=64k)
> > > |  |- btrfs_get_extent_fiemap(start=0 len=64k)
> > > | |- btrfs_get_extent(start=0 len=64k)
> > > ||  Found on-disk (ino, EXTENT_DATA, 0)
> > > ||- add_extent_mapping()
> > > ||- Return (em->start=0, len=16k)
> > > |
> > > |- fiemap_fill_next_extent(logic=0 phys=X len=16k)
> > > |
> > > |- get_extent_skip_holes(start=0 len=64k)
> > > |  |- btrfs_get_extent_fiemap(start=0 len=64k)
> > > | |- btrfs_get_extent(start=16k len=48k)
> > > ||  Found on-disk (ino, EXTENT_DATA, 16k)
> > > ||- add_extent_mapping()
> > > ||  |- try_merge_map()
> > > || Merge with previous em start=0 len=16k
> > > || resulting em start=0 len=32k
> > > ||- Return (em->start=0, len=32K)<< Merged result
> > > |- Stripe off the unrelated range (0~16K) of return em
> > > |- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K)
> > >^^^ Causing split fiemap extent.
> > > 
> > > And since in add_extent_mapping(), em is already merged, in next
> > > fiemap() call, we will get merged result.
> > > 
> > > [FIX]
> > > Here we introduce a new structure, fiemap_cache, which records previous
> > > fiemap extent.
> > > 
> > > And will always try to merge current fiemap_cache result before calling
> > > fiemap_fill_next_extent().
> > > Only when we failed to merge current fiemap extent with cached one, we
> > > will call fiemap_fill_next_extent() to submit cached one.
> > > 
> > > So by this method, we can merge all fiemap extents.
> > > 
> > 
> > If I understand it correctly, what it's missing currently is 'merge', a
> > straightfoward idea is to make use of the 'merge' ability of 
> > btrfs_get_extent. >
> > Since btrfs_get_extent_fiemap is a wrapper of btrfs_get_extent, does it make
> > sense if we make btrfs_get_extent_fiemap iterate all extent maps within the
> > range passing to it or just one more extent map to check if btrfs_get_extent
> > could return a merged extent map before returning?
> 
> So your idea to to do the merging inside btrfs_get_extent(), instead of
> merging it in extent_fiemap()?
> 
> In theory, they have the same effect for fiemap.
> And that's my original idea.
> 
> But the problem is, btrfs_get_extent() is called almost everywhere, more
> frequently than extent_fiemap(), the extra time used to merging extent map
> may cause performance problem.
> 
> For the worst case, if a file is made up by 262144 4K extent (takes up 1G),
> btrfs_get_extent() call on the file will iterate all these extents,
> which will iterate more than 500 16K tree blocks.
> 
> If only fiemap takes such longer time, it's still acceptable. But if
> btrfs_get_extent() takes such longer time, I think it will be a problem
> then.
>

Not related to the patch, but the question is, does fiemap ioctl think
it is important to have unified output?

The filefrag manual only says it's attempting to get extent
information.  And this inconsistent output of filefrag doesn't seem to
confusing users as the reported extent count is correct.

Reviewed-by: Liu Bo 

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs: fiemap: Cache and merge fiemap extent before submit it to user

2017-04-19 Thread Liu Bo
On Thu, Apr 20, 2017 at 09:52:00AM +0800, Qu Wenruo wrote:
> 
> 
> At 04/20/2017 09:25 AM, Liu Bo wrote:
> > On Fri, Apr 07, 2017 at 10:43:15AM +0800, Qu Wenruo wrote:
> > > [BUG]
> > > Cycle mount btrfs can cause fiemap to return different result.
> > > Like:
> > >   # mount /dev/vdb5 /mnt/btrfs
> > >   # dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
> > >   # xfs_io -c "fiemap -v" /mnt/btrfs/file
> > >   /mnt/test/file:
> > >   EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
> > > 0: [0..127]:25088..25215   128   0x1
> > >   # umount /mnt/btrfs
> > >   # mount /dev/vdb5 /mnt/btrfs
> > >   # xfs_io -c "fiemap -v" /mnt/btrfs/file
> > >   /mnt/test/file:
> > >   EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
> > > 0: [0..31]: 25088..2511932   0x0
> > > 1: [32..63]:25120..2515132   0x0
> > > 2: [64..95]:25152..2518332   0x0
> > > 3: [96..127]:   25184..2521532   0x1
> > > But after above fiemap, we get correct merged result if we call fiemap
> > > again.
> > >   # xfs_io -c "fiemap -v" /mnt/btrfs/file
> > >   /mnt/test/file:
> > >   EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
> > > 0: [0..127]:25088..25215   128   0x1
> > > 
> > > [REASON]
> > > Btrfs will try to merge extent map when inserting new extent map.
> > > 
> > > btrfs_fiemap(start=0 len=(u64)-1)
> > > |- extent_fiemap(start=0 len=(u64)-1)
> > > |- get_extent_skip_holes(start=0 len=64k)
> > > |  |- btrfs_get_extent_fiemap(start=0 len=64k)
> > > | |- btrfs_get_extent(start=0 len=64k)
> > > ||  Found on-disk (ino, EXTENT_DATA, 0)
> > > ||- add_extent_mapping()
> > > ||- Return (em->start=0, len=16k)
> > > |
> > > |- fiemap_fill_next_extent(logic=0 phys=X len=16k)
> > > |
> > > |- get_extent_skip_holes(start=0 len=64k)
> > > |  |- btrfs_get_extent_fiemap(start=0 len=64k)
> > > | |- btrfs_get_extent(start=16k len=48k)
> > > ||  Found on-disk (ino, EXTENT_DATA, 16k)
> > > ||- add_extent_mapping()
> > > ||  |- try_merge_map()
> > > || Merge with previous em start=0 len=16k
> > > || resulting em start=0 len=32k
> > > ||- Return (em->start=0, len=32K)<< Merged result
> > > |- Stripe off the unrelated range (0~16K) of return em
> > > |- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K)
> > >^^^ Causing split fiemap extent.
> > > 
> > > And since in add_extent_mapping(), em is already merged, in next
> > > fiemap() call, we will get merged result.
> > > 
> > > [FIX]
> > > Here we introduce a new structure, fiemap_cache, which records previous
> > > fiemap extent.
> > > 
> > > And will always try to merge current fiemap_cache result before calling
> > > fiemap_fill_next_extent().
> > > Only when we failed to merge current fiemap extent with cached one, we
> > > will call fiemap_fill_next_extent() to submit cached one.
> > > 
> > > So by this method, we can merge all fiemap extents.
> > > 
> > 
> > If I understand it correctly, what it's missing currently is 'merge', a
> > straightfoward idea is to make use of the 'merge' ability of 
> > btrfs_get_extent. >
> > Since btrfs_get_extent_fiemap is a wrapper of btrfs_get_extent, does it make
> > sense if we make btrfs_get_extent_fiemap iterate all extent maps within the
> > range passing to it or just one more extent map to check if btrfs_get_extent
> > could return a merged extent map before returning?
> 
> So your idea to to do the merging inside btrfs_get_extent(), instead of
> merging it in extent_fiemap()?
>

No, merge ems inside the wrapper, ie. btrfs_get_extent_fiemap().

Thanks,

-liubo
> In theory, they have the same effect for fiemap.
> And that's my original idea.
> 
> But the problem is, btrfs_get_extent() is called almost everywhere, more
> frequently than extent_fiemap(), the extra time used to merging extent map
> may cause performance problem.
> 
> For the worst case, if a file is made up by 262144 4K extent (takes up 1G),
> btrfs_get_extent() call on the file will iterate all these extents,
> which will iterate more than 500 16K tree blocks.
> 
> If only fiemap takes such longer time, it's still acceptable. But if
> btrfs_get_extent() takes such longer time, I think it will be a problem
> then.
> 
> Thanks,
> Qu
> 
> 
> > 
> > If no extent map could be merged, which is the worst case, the first
> > btrfs_get_extent_fiemap will read file metadata into memory from disk and 
> > the
> > following btrfs_get_extent_fiemap will read the rest of file metadata from
> > the in-memory extent map tree directly.
> > 
> > Thanks,
> > 
> > -liubo
> > 
> > > It can also be done in fs/ioctl.c, however the problem is if
> > > fieinfo->fi_extents_max == 0, we have no space to cache previous fiemap
> > > extent.
> > > So I choose to merge it in btrfs.
> > > 
> > > Signed-off-by: Qu Wenruo 

Re: [PATCH v3] btrfs: fiemap: Cache and merge fiemap extent before submit it to user

2017-04-19 Thread Qu Wenruo



At 04/20/2017 09:25 AM, Liu Bo wrote:

On Fri, Apr 07, 2017 at 10:43:15AM +0800, Qu Wenruo wrote:

[BUG]
Cycle mount btrfs can cause fiemap to return different result.
Like:
  # mount /dev/vdb5 /mnt/btrfs
  # dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
  # xfs_io -c "fiemap -v" /mnt/btrfs/file
  /mnt/test/file:
  EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
0: [0..127]:25088..25215   128   0x1
  # umount /mnt/btrfs
  # mount /dev/vdb5 /mnt/btrfs
  # xfs_io -c "fiemap -v" /mnt/btrfs/file
  /mnt/test/file:
  EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
0: [0..31]: 25088..2511932   0x0
1: [32..63]:25120..2515132   0x0
2: [64..95]:25152..2518332   0x0
3: [96..127]:   25184..2521532   0x1
But after above fiemap, we get correct merged result if we call fiemap
again.
  # xfs_io -c "fiemap -v" /mnt/btrfs/file
  /mnt/test/file:
  EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
0: [0..127]:25088..25215   128   0x1

[REASON]
Btrfs will try to merge extent map when inserting new extent map.

btrfs_fiemap(start=0 len=(u64)-1)
|- extent_fiemap(start=0 len=(u64)-1)
|- get_extent_skip_holes(start=0 len=64k)
|  |- btrfs_get_extent_fiemap(start=0 len=64k)
| |- btrfs_get_extent(start=0 len=64k)
||  Found on-disk (ino, EXTENT_DATA, 0)
||- add_extent_mapping()
||- Return (em->start=0, len=16k)
|
|- fiemap_fill_next_extent(logic=0 phys=X len=16k)
|
|- get_extent_skip_holes(start=0 len=64k)
|  |- btrfs_get_extent_fiemap(start=0 len=64k)
| |- btrfs_get_extent(start=16k len=48k)
||  Found on-disk (ino, EXTENT_DATA, 16k)
||- add_extent_mapping()
||  |- try_merge_map()
|| Merge with previous em start=0 len=16k
|| resulting em start=0 len=32k
||- Return (em->start=0, len=32K)<< Merged result
|- Stripe off the unrelated range (0~16K) of return em
|- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K)
   ^^^ Causing split fiemap extent.

And since in add_extent_mapping(), em is already merged, in next
fiemap() call, we will get merged result.

[FIX]
Here we introduce a new structure, fiemap_cache, which records previous
fiemap extent.

And will always try to merge current fiemap_cache result before calling
fiemap_fill_next_extent().
Only when we failed to merge current fiemap extent with cached one, we
will call fiemap_fill_next_extent() to submit cached one.

So by this method, we can merge all fiemap extents.



If I understand it correctly, what it's missing currently is 'merge', a
straightfoward idea is to make use of the 'merge' ability of btrfs_get_extent. >
Since btrfs_get_extent_fiemap is a wrapper of btrfs_get_extent, does it make
sense if we make btrfs_get_extent_fiemap iterate all extent maps within the
range passing to it or just one more extent map to check if btrfs_get_extent
could return a merged extent map before returning?


So your idea to to do the merging inside btrfs_get_extent(), instead of 
merging it in extent_fiemap()?


In theory, they have the same effect for fiemap.
And that's my original idea.

But the problem is, btrfs_get_extent() is called almost everywhere, more 
frequently than extent_fiemap(), the extra time used to merging extent 
map may cause performance problem.


For the worst case, if a file is made up by 262144 4K extent (takes up 
1G), btrfs_get_extent() call on the file will iterate all these extents,

which will iterate more than 500 16K tree blocks.

If only fiemap takes such longer time, it's still acceptable. But if 
btrfs_get_extent() takes such longer time, I think it will be a problem 
then.


Thanks,
Qu




If no extent map could be merged, which is the worst case, the first
btrfs_get_extent_fiemap will read file metadata into memory from disk and the
following btrfs_get_extent_fiemap will read the rest of file metadata from
the in-memory extent map tree directly.

Thanks,

-liubo


It can also be done in fs/ioctl.c, however the problem is if
fieinfo->fi_extents_max == 0, we have no space to cache previous fiemap
extent.
So I choose to merge it in btrfs.

Signed-off-by: Qu Wenruo 
---
v2:
   Since fiemap_extent_info has a limit for number of fiemap_extent, it's 
possible
   that fiemap_fill_next_extent() return 1 halfway. Remove the WARN_ON() which 
can
   cause kernel warning if we fiemap is called on large compressed file.
v3:
   Rename finish_fiemap_extent() to check_fiemap_extent(), as in v3 we ensured
   submit_fiemap_extent() to submit fiemap cache, so it just acts as a
   sanity check.
   Remove BTRFS_MAX_EXTENT_SIZE limit in submit_fiemap_extent(), as
   extent map can be larger than BTRFS_MAX_EXTENT_SIZE.
   Don't do backward jump, suggested by David.
   Better sanity check and recoverable fix.

To David:
   What about adding a 

Re: [PATCH v3] btrfs: fiemap: Cache and merge fiemap extent before submit it to user

2017-04-19 Thread Liu Bo
On Fri, Apr 07, 2017 at 10:43:15AM +0800, Qu Wenruo wrote:
> [BUG]
> Cycle mount btrfs can cause fiemap to return different result.
> Like:
>  # mount /dev/vdb5 /mnt/btrfs
>  # dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
>  # xfs_io -c "fiemap -v" /mnt/btrfs/file
>  /mnt/test/file:
>  EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
>0: [0..127]:25088..25215   128   0x1
>  # umount /mnt/btrfs
>  # mount /dev/vdb5 /mnt/btrfs
>  # xfs_io -c "fiemap -v" /mnt/btrfs/file
>  /mnt/test/file:
>  EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
>0: [0..31]: 25088..2511932   0x0
>1: [32..63]:25120..2515132   0x0
>2: [64..95]:25152..2518332   0x0
>3: [96..127]:   25184..2521532   0x1
> But after above fiemap, we get correct merged result if we call fiemap
> again.
>  # xfs_io -c "fiemap -v" /mnt/btrfs/file
>  /mnt/test/file:
>  EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
>0: [0..127]:25088..25215   128   0x1
> 
> [REASON]
> Btrfs will try to merge extent map when inserting new extent map.
> 
> btrfs_fiemap(start=0 len=(u64)-1)
> |- extent_fiemap(start=0 len=(u64)-1)
>|- get_extent_skip_holes(start=0 len=64k)
>|  |- btrfs_get_extent_fiemap(start=0 len=64k)
>| |- btrfs_get_extent(start=0 len=64k)
>||  Found on-disk (ino, EXTENT_DATA, 0)
>||- add_extent_mapping()
>||- Return (em->start=0, len=16k)
>|
>|- fiemap_fill_next_extent(logic=0 phys=X len=16k)
>|
>|- get_extent_skip_holes(start=0 len=64k)
>|  |- btrfs_get_extent_fiemap(start=0 len=64k)
>| |- btrfs_get_extent(start=16k len=48k)
>||  Found on-disk (ino, EXTENT_DATA, 16k)
>||- add_extent_mapping()
>||  |- try_merge_map()
>|| Merge with previous em start=0 len=16k
>|| resulting em start=0 len=32k
>||- Return (em->start=0, len=32K)<< Merged result
>|- Stripe off the unrelated range (0~16K) of return em
>|- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K)
>   ^^^ Causing split fiemap extent.
> 
> And since in add_extent_mapping(), em is already merged, in next
> fiemap() call, we will get merged result.
> 
> [FIX]
> Here we introduce a new structure, fiemap_cache, which records previous
> fiemap extent.
> 
> And will always try to merge current fiemap_cache result before calling
> fiemap_fill_next_extent().
> Only when we failed to merge current fiemap extent with cached one, we
> will call fiemap_fill_next_extent() to submit cached one.
> 
> So by this method, we can merge all fiemap extents.
>

If I understand it correctly, what it's missing currently is 'merge', a
straightfoward idea is to make use of the 'merge' ability of btrfs_get_extent.

Since btrfs_get_extent_fiemap is a wrapper of btrfs_get_extent, does it make
sense if we make btrfs_get_extent_fiemap iterate all extent maps within the
range passing to it or just one more extent map to check if btrfs_get_extent
could return a merged extent map before returning?

If no extent map could be merged, which is the worst case, the first
btrfs_get_extent_fiemap will read file metadata into memory from disk and the
following btrfs_get_extent_fiemap will read the rest of file metadata from
the in-memory extent map tree directly.

Thanks,

-liubo

> It can also be done in fs/ioctl.c, however the problem is if
> fieinfo->fi_extents_max == 0, we have no space to cache previous fiemap
> extent.
> So I choose to merge it in btrfs.
> 
> Signed-off-by: Qu Wenruo 
> ---
> v2:
>   Since fiemap_extent_info has a limit for number of fiemap_extent, it's 
> possible
>   that fiemap_fill_next_extent() return 1 halfway. Remove the WARN_ON() which 
> can
>   cause kernel warning if we fiemap is called on large compressed file.
> v3:
>   Rename finish_fiemap_extent() to check_fiemap_extent(), as in v3 we ensured
>   submit_fiemap_extent() to submit fiemap cache, so it just acts as a
>   sanity check.
>   Remove BTRFS_MAX_EXTENT_SIZE limit in submit_fiemap_extent(), as
>   extent map can be larger than BTRFS_MAX_EXTENT_SIZE.
>   Don't do backward jump, suggested by David.
>   Better sanity check and recoverable fix.
> 
> To David:
>   What about adding a btrfs_debug_warn(), which will only call WARN_ON(1) if
>   BTRFS_CONFIG_DEBUG is specified for recoverable bug?
> 
>   And modify ASSERT() to always WARN_ON() and exit error code?
> ---
>  fs/btrfs/extent_io.c | 124 
> ++-
>  1 file changed, 122 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 28e8192..c4cb65d 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -4353,6 +4353,123 @@ static struct extent_map 
> *get_extent_skip_holes(struct inode *inode,
>   return NULL;
>  }
>  
> +/*
> + * To cache previous fiemap extent
> + *
> 

Re: [PATCH 0/9] Introduce btrfs-modify prog to make corruption easier

2017-04-19 Thread Qu Wenruo



At 04/20/2017 02:15 AM, David Sterba wrote:

On Mon, Apr 17, 2017 at 11:26:33AM +0800, Qu Wenruo wrote:

Introduce a new command, btrfs-modify, which is not part of 'btrfs', but
an independent command, bring minimal impact to existing btrfs commands.

Btrfs-modify is designed to provides better documentation than current
btrfs-corrupt-block with better subcommand division to reduce confusing
or conflicting options.

Btrfs-modify paired with offline-scrub patchset (not merged yet) could
provide a full suite for test case writers to do corruption and recovery
verification.


Starting a new tool makes sense, the btrfs-corrupt-block lacks the
subcommand hierarchy and it would be tedious to sew it in.

The commands that you add now still seem ad-hoc, adressing current
needs.


Yes, the "mirror" command is indeed for current corruption recovery test 
cases.


But the ability to add new command easily could make the tool more generic.


This is how the corrupt-block utility started as well. I'd like
see some proposal of more potential uses and some command gouping. For
example main group: set, get, map, delete, insert, show.


While I prefer to create command groups by their logical level.
For example, "mirror" as the lowest level.
And then "leaf", allowing us to modify specified leaf headers.

Then "item", allowing us to modify btrfs item and its stored structure.
Including modifying, inserting, removing.

The short objective is to allow us to modify one-item-one-structure 
structure, like INODE_ITEM or EXTENT_DATA, but not EXTENT_ITEM(inlined).


By this method, we can reducing the option combinations by limiting some 
options under certain commands.

Or the logic to check option validation will be a hell.

Thanks,
Qu


Then, define
for which objects the commands are applicable. We can start with the
raid56 testing usecase. You want to locate and modify a block baced on
the logical offset, so this should comprise of 'map' + 'set'. For
specific tasks we can add shortcuts or compound commands, not from the
start.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH ping] btrfs: warn about RAID5/6 being experimental at mount time

2017-04-19 Thread Adam Borowski
Too many people come complaining about losing their data -- and indeed,
there's no warning outside a wiki and the mailing list tribal knowledge.
Message severity chosen for consistency with XFS -- "alert" makes dmesg
produce nice red background which should get the point across.

Signed-off-by: Adam Borowski 
---
Resent alone, the other patch in the original series (dropping the incompat
flag when no longer needed) is pointless on its own.

I intend to ask for inclusion of this one (or an equivalent) in 4.9, either
in Debian or via GregKH -- while for us kernels "that old" are history,
regular users expect stable releases to be free of known serious data loss
bugs.


 fs/btrfs/disk-io.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index e54844767fe5..e7f91f70e149 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3083,6 +3083,14 @@ int open_ctree(struct super_block *sb,
btrfs_set_opt(fs_info->mount_opt, SSD);
}
 
+   if ((fs_info->avail_data_alloc_bits |
+fs_info->avail_metadata_alloc_bits |
+fs_info->avail_system_alloc_bits) &
+   BTRFS_BLOCK_GROUP_RAID56_MASK) {
+   btrfs_alert(fs_info,
+   "btrfs RAID5/6 is EXPERIMENTAL and has known data-loss bugs");
+   }
+
/*
 * Mount does not set all options immediately, we can do it now and do
 * not have to wait for transaction commit
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: send, fix file hole not being preserved due to inline extent

2017-04-19 Thread Liu Bo
On Thu, Apr 06, 2017 at 05:07:56PM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> Normally we don't have inline extents followed by regular extents, but
> there's currently at least one harmless case where this happens. For
> example, when the page size is 4Kb and compression is enabled:
> 
>   $ mkfs.btrfs -f /dev/sdb
>   $ mount -o compress /dev/sdb /mnt
>   $ xfs_io -f -c "pwrite -S 0xaa 0 4K" -c "fsync" /mnt/foobar
>   $ xfs_io -c "pwrite -S 0xbb 8K 4K" -c "fsync" /mnt/foobar
> 
> In this case we get a compressed inline extent, representing 4Kb of
> data, followed by a hole extent and then a regular data extent. The
> inline extent was not expanded/converted to a regular extent exactly
> because it represents 4Kb of data. This does not cause any apparent
> problem (such as the issue solved by commit e1699d2d7bf6
> ("btrfs: add missing memset while reading compressed inline extents"))
> except trigger an unexpected case in the incremental send code path
> that makes us issue an operation to write a hole when it's not needed,
> resulting in more writes at the receiver and wasting space at the
> receiver.
> 
> So teach the incremental send code to deal with this particular case.
> 
> The issue can be currently triggered by running fstests btrfs/137 with
> compression enabled (MOUNT_OPTIONS="-o compress" ./check btrfs/137).
> 

Reviewed-by: Liu Bo 

Thanks,

-liubo
> Signed-off-by: Filipe Manana 
> ---
>  fs/btrfs/send.c | 23 +--
>  1 file changed, 21 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> index 456c890..f66095a 100644
> --- a/fs/btrfs/send.c
> +++ b/fs/btrfs/send.c
> @@ -5184,13 +5184,19 @@ static int is_extent_unchanged(struct send_ctx *sctx,
>   while (key.offset < ekey->offset + left_len) {
>   ei = btrfs_item_ptr(eb, slot, struct btrfs_file_extent_item);
>   right_type = btrfs_file_extent_type(eb, ei);
> - if (right_type != BTRFS_FILE_EXTENT_REG) {
> + if (right_type != BTRFS_FILE_EXTENT_REG &&
> + right_type != BTRFS_FILE_EXTENT_INLINE) {
>   ret = 0;
>   goto out;
>   }
>  
>   right_disknr = btrfs_file_extent_disk_bytenr(eb, ei);
> - right_len = btrfs_file_extent_num_bytes(eb, ei);
> + if (right_type == BTRFS_FILE_EXTENT_INLINE) {
> + right_len = btrfs_file_extent_inline_len(eb, slot, ei);
> + right_len = PAGE_ALIGN(right_len);
> + } else {
> + right_len = btrfs_file_extent_num_bytes(eb, ei);
> + }
>   right_offset = btrfs_file_extent_offset(eb, ei);
>   right_gen = btrfs_file_extent_generation(eb, ei);
>  
> @@ -5204,6 +5210,19 @@ static int is_extent_unchanged(struct send_ctx *sctx,
>   goto out;
>   }
>  
> + /*
> +  * We just wanted to see if when we have an inline extent, what
> +  * follows it is a regular extent (wanted to check the above
> +  * condition for inline extents too). This should normally not
> +  * happen but it's possible for example when we have an inline
> +  * compressed extent representing data with a size matching
> +  * the page size (currently the same as sector size).
> +  */
> + if (right_type == BTRFS_FILE_EXTENT_INLINE) {
> + ret = 0;
> + goto out;
> + }
> +
>   left_offset_fixed = left_offset;
>   if (key.offset < ekey->offset) {
>   /* Fix the right offset for 2a and 7. */
> -- 
> 2.7.0.rc3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/9] Introduce btrfs-modify prog to make corruption easier

2017-04-19 Thread David Sterba
On Mon, Apr 17, 2017 at 11:26:33AM +0800, Qu Wenruo wrote:
> Introduce a new command, btrfs-modify, which is not part of 'btrfs', but
> an independent command, bring minimal impact to existing btrfs commands.
> 
> Btrfs-modify is designed to provides better documentation than current
> btrfs-corrupt-block with better subcommand division to reduce confusing
> or conflicting options.
> 
> Btrfs-modify paired with offline-scrub patchset (not merged yet) could
> provide a full suite for test case writers to do corruption and recovery
> verification.

Starting a new tool makes sense, the btrfs-corrupt-block lacks the
subcommand hierarchy and it would be tedious to sew it in.

The commands that you add now still seem ad-hoc, adressing current
needs. This is how the corrupt-block utility started as well. I'd like
see some proposal of more potential uses and some command gouping. For
example main group: set, get, map, delete, insert, show. Then, define
for which objects the commands are applicable. We can start with the
raid56 testing usecase. You want to locate and modify a block baced on
the logical offset, so this should comprise of 'map' + 'set'. For
specific tasks we can add shortcuts or compound commands, not from the
start.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/SSD

2017-04-19 Thread Chris Murphy
On Mon, Apr 17, 2017 at 4:55 PM, Hans van Kranenburg
 wrote:
> On 04/17/2017 09:22 PM, Imran Geriskovan wrote:
>> [...]
>>
>> Going over the thread following questions come to my mind:
>>
>> - What exactly does btrfs ssd option does relative to plain mode?
>
> There's quite an amount of information in the the very recent threads:
> - "About free space fragmentation, metadata write amplification and (no)ssd"
> - "BTRFS as a GlusterFS storage back-end, and what I've learned from
> using it as such."
> - "btrfs filesystem keeps allocating new chunks for no apparent reason"
> - ... and a few more
>
> I suspect there will be some "summary" mails at some point, but for now,
> I'd recommend crawling through these threads first.
>
> And now for your instant satisfaction, a short visual guide to the
> difference, which shows actual btrfs behaviour instead of our guesswork
> around it (taken from the second mail thread just mentioned):
>
> -o ssd:
>
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-01-19-noautodefrag-ichiban.mp4
>
> -o nossd:
>
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-ichiban-walk-nossd.mp4

I'm uncertain from these if the option affects both metadata and data
writes, or just data. The latter makes some sense, if you think a
given data write event contains related files and thus increases the
chance when those files are deleted of having a mostly freed up erase
block. That way wear leveling is doing less work. For metadata writes
it makes less sense to me, and is inconsistent with what I've seen
from metadata chunk allocation. Pretty much anything means dozens or
more 16K nodes are being COWd. e.g. a 2KiB write to systemd journal,
even preallocated, means adding an EXTENT DATA item, one of maybe 200
per node, which means that whole node must be COWd, and whatever its
parent is must be written (ROOT ITEM I think) and then tree root, and
then super block. I see generally 30 16K nodes modified in about 4
minutes with average logging. Even if it's 1 change per 4 minutes, and
all 30 nodes get written to one 2MB block, and then that block isn't
ever written to again, the metadata chunk would be growing and I don't
see that. For weeks or months I see a 512MB metadata chunk and it
doesn't ever get bigger than this.

Anyway, I think ssd mount option still sounds plausibly useful. What
I'm skeptical of on SSD is defragmenting without compression, and also
nocow.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Reporting and monitoring storage events (blog)

2017-04-19 Thread Chris Murphy
http://www-rhstorage.rhcloud.com/blog/vpodzime/reporting-and-monitoring-storage-events

I think the most useful part of this would be standardized messaging.
For the exact same defect state on disk (data corruption), I get two
different formatted messages depending on whether it's found passively
by reading the file, or with a scrub.

(this is 2x disk raid 1)

read file:
[256914.773712] BTRFS warning (device dm-6): csum failed ino 257 off 0
csum 3734069121 expected csum 1334657141
[256914.774594] BTRFS warning (device dm-6): csum failed ino 257 off 0
csum 3734069121 expected csum 1334657141
[256914.775892] BTRFS info (device dm-6): read error corrected: ino
257 off 0 (dev /dev/mapper/VG-b1 sector 2155520)

scrub volume:


[257313.636610] BTRFS warning (device dm-6): checksum error at logical
1103626240 on dev /dev/mapper/VG-b1, sector 2155520, root 5, inode
257, offset 0, length 4096, links 1 (path:
openSUSE-Tumbleweed-NET-x86_64-Current.iso)
[257313.636865] BTRFS error (device dm-6): bdev /dev/mapper/VG-b1
errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
[257313.637737] BTRFS error (device dm-6): fixed up error at logical
1103626240 on dev /dev/mapper/VG-b1


Reading means there's a warning, scrubbing means there's an error? So
even the log level is different for the same problem?

And then the ambiguous "read error corrected" vs "fixed up error" -
the second one is more clear that the fix is pushed to a device "fixed
error on device" rather than just an in memory correction. But still,
they're different messages for the same problem and the auto healing.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs-progs: send-dump: always print a space after path

2017-04-19 Thread David Sterba
On Tue, Apr 11, 2017 at 12:33:40PM -0400, Evan Danaher wrote:
> I was shocked to discover that 'btrfs receive --dump' doesn't print a
> space after long filenames, so it runs together into the metadata; for
> example:
> 
> truncate./20-00-03/this-name-is-32-characters-longsize=0
> 
> This is a trivial patch to add a single space unconditionally, so the
> result is the following:
> 
> truncate./20-00-03/this-name-is-32-characters-long size=0
> 
> I suppose this is technically a breaking change, but it seems unlikely
> to me that anyone would depend on the existing behavior given how
> unfriendly it is.
> 
> Signed-off-by: Evan Danaher 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs-progs: send-dump: always print a space after path

2017-04-19 Thread David Sterba
On Thu, Apr 13, 2017 at 10:38:02AM -0400, Noah Massey wrote:
> On Wed, Apr 12, 2017 at 7:05 PM, Duncan <1i5t5.dun...@cox.net> wrote:
> > Evan Danaher posted on Tue, 11 Apr 2017 12:33:40 -0400 as excerpted:
> >
> >> I was shocked to discover that 'btrfs receive --dump' doesn't print a
> >> space after long filenames, so it runs together into the metadata; for
> >> example:
> >>
> >> truncate./20-00-03/this-name-is-32-characters-longsize=0
> >>
> >> This is a trivial patch to add a single space unconditionally, so the
> >> result is the following:
> >>
> >> truncate./20-00-03/this-name-is-32-characters-long size=0
> >>
> >> I suppose this is technically a breaking change, but it seems unlikely
> >> to me that anyone would depend on the existing behavior given how
> >> unfriendly it is.
> >>
> >> Signed-off-by: Evan Danaher 
> >> ---
> >
> > I'm not a dev so won't attempt to comment on the patch itself, but it's
> > worth noting that according to kernel patch submission guidelines (which
> > btrfs-progs use as well) on V2+ patch postings, there should be a short,
> > often one-line per version, summary of what changed between versions.
> > This helps both reviewers and would-be patch-using admins such as myself
> > understand how a patch is evolving, as well as for reviewers preventing
> > unnecessary work when re-reviewing a new version of a patch previously
> > reviewed in an earlier version.
> >
> > On patch series this summary is generally found in the 0/N post, while on
> > individual patches without a 0/N, it's normally found below the first ---
> > delimiter, so as to avoid including the patch history in the final merged
> > version comment.
> 
> To be specific, something like
> > ---
> >
> >v2: fixed an off-by-one error which caused padding to be 33 characters for 
> >short paths
> > ---
> 
> instead of the email tail you currently appended (for future reference).
> FWIW, I'm fine with you adding my 'Reviewed-by', but I don't think it
> carries much weight yet. :-)

It does, I'll add the tag, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: send-dump: always print a space after path

2017-04-19 Thread David Sterba
On Tue, Apr 11, 2017 at 09:49:02AM -0400, Evan Danaher wrote:
> Thanks for catching that!  I overthought this and managed to convince
> myself that it was correct as it stood.
> 
> Should I re-send the whole patch with that change?  This is my first
> attempt at contributing to a project managed in the Linux kernel style,
> so I'm not sure about the process; documentation that I've found around
> responding to feedback seems to be more about social issues than the
> mechanics of what to actually do...

Here's some howto
https://btrfs.wiki.kernel.org/index.php/Developer%27s_FAQ

Depends on what you take from the 'linux kernel style'. We stick to good
changelogs, signed-off-by lines and patches that do one thing at a time.
The interactions experience differs across the subsystems, the btrfs
community is friendly.

Patches from strangers are apprecated as they usually do the extra step
and also try to fix the problem they encounter. The patch does not need
to be perfect, small things I fix myself. If it's not obvious it's
better to do another iteration, eg. when the change needs to be
retested.

> On Tue, Apr 11, 2017 at 09:24:56AM -0400, Noah Massey wrote:
> > On Mon, Apr 10, 2017 at 8:09 PM, Evan Danaher  wrote:
> > > I was shocked to discover that 'btrfs receive --dump' doesn't print a
> > > space after long filenames, so it runs together into the metadata; for
> > > example:
> > >
> > > truncate./20-00-03/this-name-is-32-characters-longsize=0
> > >
> > > This is a trivial patch to add a single space unconditionally, so the
> > > result is the following:
> > >
> > > truncate./20-00-03/this-name-is-32-characters-long size=0
> > >
> > > I suppose this is technically a breaking change, but it seems unlikely
> > > to me that anyone would depend on the existing behavior given how
> > > unfriendly it is.

Agreed, the output needs to be fixed, backward compatibility is not an
issue here.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] btrfs-progs: misc-tests: Superblock corruption and recovery using backup

2017-04-19 Thread David Sterba
On Sat, Apr 15, 2017 at 03:43:08PM +0530, Lakshmipathi.G wrote:
> Signed-off-by: Lakshmipathi.G 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: docs: Fix newlines for man btrfstune

2017-04-19 Thread David Sterba
On Sun, Apr 16, 2017 at 07:20:02PM +0200, Hans van Kranenburg wrote:
> A bunch of newlines were missing, which resulted in only -S and -r to
> show as option after xmlto is used to convert the documentation to a man
> page.
> 
> The rest of the options would end up being appended to the explanation
> of -r.
> 
> Signed-off-by: Hans van Kranenburg 

Applied, thanks.

> ---
> 
> By the way, I have no idea why -r was showing up in the first place, because
> it also had no newline before it.
> 
> I'm no asciidoc expert, but with consistent newlines, at least for me the
> resulting man page does contain all options nicely listed.

IIRC if the :: section is one paragraph, then the newline does not need
to be there, while if there are more glued together with a +, then the
newline before next :: must be added.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] btrfs-progs: dump-super: check array_size in print_sys_chunk_array

2017-04-19 Thread David Sterba
On Wed, Apr 05, 2017 at 04:59:14PM +0800, Lu Fengqi wrote:
> Without validation of array_size, the dump-super may lead to a bad
> memory access.
> 
> Signed-off-by: Lu Fengqi 
> ---
>  cmds-inspect-dump-super.c | 15 +++
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/cmds-inspect-dump-super.c b/cmds-inspect-dump-super.c
> index ee2c8e3a..48b5219c 100644
> --- a/cmds-inspect-dump-super.c
> +++ b/cmds-inspect-dump-super.c
> @@ -62,16 +62,23 @@ static void print_sys_chunk_array(struct 
> btrfs_super_block *sb)
>   struct btrfs_key key;
>   int item;
>  
> - buf = malloc(sizeof(*buf) + sizeof(*sb));
> + buf = malloc(sizeof(*buf) + BTRFS_SUPER_INFO_SIZE);

This seems to be unnecessary, the super block structure should contain
entier sys_array.

>   if (!buf) {
>   error("not enough memory");
> - goto out;
> + return;
>   }
> - write_extent_buffer(buf, sb, 0, sizeof(*sb));
> + write_extent_buffer(buf, sb, 0, BTRFS_SUPER_INFO_SIZE);
>   array_size = btrfs_super_sys_array_size(sb);
>  
>   array_ptr = sb->sys_chunk_array;
>   sb_array_offset = offsetof(struct btrfs_super_block, sys_chunk_array);
> +
> + if (array_size > BTRFS_SYSTEM_CHUNK_ARRAY_SIZE) {
> + error("sys_array_size %u shouldn't exceed %u bytes",
> + array_size, BTRFS_SYSTEM_CHUNK_ARRAY_SIZE);
> + goto out;
> + }
> +
>   cur_offset = 0;
>   item = 0;
>  
> @@ -124,8 +131,8 @@ static void print_sys_chunk_array(struct 
> btrfs_super_block *sb)
>   item++;
>   }
>  
> - free(buf);
>  out:
> + free(buf);
>   return;
>  
>  out_short_read:
> -- 
> 2.12.1
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


"No such entry" recurring error

2017-04-19 Thread Tadeáš Berkman
Hello,

i have btrfs on my root partition (openSUSE Tumbleweed) for a year and half, 
but recently i've ran into the following issue - after an attempt to restore a 
snapper snapshot the system stopped booting, claiming:

> BTRFS: error (device sda3) in __btrfs_free_extent:6944: errno=-2 no such entry
> BTRFS: error (device sda3) in btrfs_run_delayed_refs:2956: errno=-2 no such 
> entry

after some searching i ran "btrfs check --repair" on the partition. it fixed 
some errors and the system would start again. but after few days the error 
returned, same as before. so i decided to seek help of someone more 
knowledgeable than me.

system and partition info:
Linux Niddhog 4.10.8-1-default #1 SMP PREEMPT Fri Mar 31 17:16:00 UTC 2017 
(ea9dcd4) x86_64 x86_64 x86_64 GNU/Linux
btrfs-progs v4.9+20170104
Label: none uuid: 08fb0009-baab-44d4-9a3b-1131d9a17719
Total devices 1 FS bytes used 24.05GiB
devid 1 size 40.00GiB used 28.07GiB path /dev/sda3

Data, single: total=24.01GiB, used=22.62GiB
System, DUP: total=32.00MiB, used=16.00KiB
Metadata, DUP: total=2.00GiB, used=1.43GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

attaching check --repair output.
dmesg output: https://pastebin.com/DRrNbe4H

thank you,
Tadeáš Berkman

repair.log
Description: Binary data


Re: [PATCH 5/8] nowait aio: return on congested block device

2017-04-19 Thread Goldwyn Rodrigues


On 04/19/2017 01:45 AM, Christoph Hellwig wrote:
> 
>> +if (bio->bi_opf & REQ_NOWAIT) {
>> +if (!blk_queue_nowait(q)) {
>> +err = -EOPNOTSUPP;
>> +goto end_io;
>> +}
>> +if (!(bio->bi_opf & REQ_SYNC)) {
> 
> I don't understand this check at all..

It is to ensure at the block layer that NOWAIT comes only for DIRECT
calls only. I should probably change it to REQ_SYNC | REQ_IDLE.

> 
>> +if (unlikely(!blk_queue_dying(q) && (bio->bi_opf & 
>> REQ_NOWAIT)))
> 
> Please break lines after 80 characters.
> 
>> @@ -119,6 +119,9 @@ struct request *blk_mq_sched_get_request(struct 
>> request_queue *q,
>>  if (likely(!data->hctx))
>>  data->hctx = blk_mq_map_queue(q, data->ctx->cpu);
>>  
>> +if (bio && (bio->bi_opf & REQ_NOWAIT))
>> +data->flags |= BLK_MQ_REQ_NOWAIT;
> 
> Check the op flag again here.
> 
>> +++ b/block/blk-mq.c
>> @@ -1538,6 +1538,8 @@ static blk_qc_t blk_mq_make_request(struct 
>> request_queue *q, struct bio *bio)
>>  rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, );
>>  if (unlikely(!rq)) {
>>  __wbt_done(q->rq_wb, wb_acct);
>> +if (bio && (bio->bi_opf & REQ_NOWAIT))
>> +bio_wouldblock_error(bio);
> 
> bio iѕ dereferences unconditionally above, so you can do the same.
> 
>> @@ -1662,6 +1664,8 @@ static blk_qc_t blk_sq_make_request(struct 
>> request_queue *q, struct bio *bio)
>>  rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, );
>>  if (unlikely(!rq)) {
>>  __wbt_done(q->rq_wb, wb_acct);
>> +if (bio && (bio->bi_opf & REQ_NOWAIT))
>> +bio_wouldblock_error(bio);
> 
> Same here.  Although blk_sq_make_request is gone anyway in the current
> block tree..
> 
>> +/* Request queue supports BIO_NOWAIT */
>> +queue_flag_set_unlocked(QUEUE_FLAG_NOWAIT, q);
> 
> BIO_NOWAIT is gone.  And the comment would not be needed if the
> flag had a more descriptive name, e.g. QUEUE_FLAG_NOWAIT_SUPPORT.
> 
> And I think all request based drivers should set the flag implicitly
> as ->queuecommand can't sleep, and ->queue_rq only when it's always
> offloaded to a workqueue when the BLK_MQ_F_BLOCKING flag is set.
> 

Yes, Do we have a central point (like a probe() function call?) where
this can be done?


-- 
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Tidy while loop in end_compressed_writeback

2017-04-19 Thread Sahil Kang

Hi,

This is my first patch so it's a minor code change.
I think removing the early continue from the loop makes the function a 
little easier to follow.

Please have a look and I'd appreciate any feedback.

Thanks,
Sahil


From 98afe83a570180e841fefe3fd48d450accc42ea3 Mon Sep 17 00:00:00 2001
From: Sahil Kang 
Date: Wed, 19 Apr 2017 04:47:00 -0700
Subject: [PATCH] Tidy while loop in end_compressed_writeback

Instead of continuing early in the loop when ret is 0,
we can decrement/increment nr_pages/index by 1 at the ending.
The for loop will not execute when ret is 0 anyway.

Signed-off-by: Sahil Kang 
---
 fs/btrfs/compression.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index c473c42..c653297 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -238,17 +238,14 @@ static noinline void 
end_compressed_writeback(struct inode *inode,

 ret = find_get_pages_contig(inode->i_mapping, index,
  min_t(unsigned long,
  nr_pages, ARRAY_SIZE(pages)), pages);
-if (ret == 0) {
-nr_pages -= 1;
-index += 1;
-continue;
-}
 for (i = 0; i < ret; i++) {
 if (cb->errors)
 SetPageError(pages[i]);
 end_page_writeback(pages[i]);
 page_cache_release(pages[i]);
 }
+
+ret = ret ? ret : 1; /* set ret to 1 if it's 0 */
 nr_pages -= ret;
 index += ret;
 }
--
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL] Btrfs, updates for 4.12

2017-04-19 Thread David Sterba
Hi,

this is the main part of my 4.12 pull, condensed changelog below. I might send
another pull with low-risk patches, mostly cleanups, but so far I'm done with
base testing now. We had a high-churn cycle last time, so this could be small
one and we can concentrate on testing & fixing the raid56 updates.

The qgroup patches have been in for-next but I haven't seen any new review for
the core part.

Updates:
* raid56:
  * fix mirror name in warning message after repair
  * scrub fixes: calculate parity correctly
  * scrub recheck and dev replace race fix
  * enabled auto-repair during read
  * fix warnings during recovery, due to races, bogus reports can appear
* switch to refcount_t where atomic_t was used for plain refcounting
* new and updated tracepoints
* split __btrfs_map_block, clean up
* minor qgroup fixes
* usual cleanups


The following changes since commit 4f7d029b9bf009fbee76bb10c0c4351a1870d2f3:

  Linux 4.11-rc7 (2017-04-16 13:00:18 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-chris-4.12

for you to fetch changes up to c2a9c7ab475bc3aaf06521a39ac65bc48c8cad4f:

  btrfs: check if the device is flush capable (2017-04-18 16:13:27 +0200)


Adam Borowski (1):
  btrfs: fix a bogus warning when converting only data or metadata

Anand Jain (3):
  btrfs: use q which is already obtained from bdev_get_queue
  btrfs: delete unused member nobarriers
  btrfs: check if the device is flush capable

Dan Carpenter (1):
  Btrfs: handle only applicable errors returned by btrfs_get_extent

David Sterba (12):
  btrfs: preallocate radix tree node for readahead
  btrfs: preallocate radix tree node for global readahead tree
  btrfs: remove redundant parameter from btree_readahead_hook
  btrfs: remove redundant parameter from reada_find_zone
  btrfs: remove redundant parameter from reada_start_machine_dev
  btrfs: remove local blocksize variable in reada_find_extent
  btrfs: remove unused qgroup members from btrfs_trans_handle
  btrfs: track exclusive filesystem operation in flags
  btrfs: sink GFP flags parameter to tree_mod_log_insert_move
  btrfs: sink GFP flags parameter to tree_mod_log_insert_root
  btrfs: drop redundant parameters from btrfs_map_sblock
  btrfs: use clear_page where appropriate

Deepa Dinamani (1):
  btrfs: Use ktime_get_real_ts for root ctime

Dmitry V. Levin (1):
  MAINTAINERS: add btrfs file entries for include directories

Edmund Nadolski (2):
  btrfs: provide enumeration for __merge_refs mode argument
  btrfs: replace hardcoded value with SEQ_LAST macro

Elena Reshetova (16):
  btrfs: convert btrfs_bio.refs from atomic_t to refcount_t
  btrfs: convert btrfs_transaction.use_count from atomic_t to refcount_t
  btrfs: convert extent_map.refs from atomic_t to refcount_t
  btrfs: convert btrfs_ordered_extent.refs from atomic_t to refcount_t
  btrfs: convert btrfs_caching_control.count from atomic_t to refcount_t
  btrfs: convert btrfs_delayed_ref_node.refs from atomic_t to refcount_t
  btrfs: convert btrfs_delayed_node.refs from atomic_t to refcount_t
  btrfs: convert btrfs_delayed_item.refs from atomic_t to refcount_t
  btrfs: convert btrfs_root.refs from atomic_t to refcount_t
  btrfs: convert extent_state.refs from atomic_t to refcount_t
  btrfs: convert compressed_bio.pending_bios from atomic_t to refcount_t
  btrfs: convert scrub_recover.refs from atomic_t to refcount_t
  btrfs: convert scrub_block.refs from atomic_t to refcount_t
  btrfs: convert scrub_parity.refs from atomic_t to refcount_t
  btrfs: convert scrub_ctx.refs from atomic_t to refcount_t
  btrfs: convert btrfs_raid_bio.refs from atomic_t to refcount_t

Goldwyn Rodrigues (2):
  btrfs: No need to check !(flags & MS_RDONLY) twice
  btrfs: qgroups: Retry after commit on getting EDQUOT

Hans van Kranenburg (1):
  Btrfs: consistent usage of types in balance_args

Liu Bo (15):
  Btrfs: remove ASSERT in btrfs_truncate_inode_items
  Btrfs: add file item tracepoints
  Btrfs: create a helper for getting chunk map
  Btrfs: separate DISCARD from __btrfs_map_block
  Btrfs: introduce a function to get extra mirror from replace
  Btrfs: handle operations for device replace separately
  Btrfs: do not add extra mirror when dev_replace target dev is not 
available
  Btrfs: helper for ops that requires full stripe
  Btrfs: convert BUG_ON to WARN_ON
  Btrfs: update comments in cache_save_setup
  Btrfs: set scrub page's io_error if failing to submit io
  Btrfs: fix wrong failed mirror_num of read-repair on raid56
  Btrfs: enable repair during read for raid56 profile
  Btrfs: update scrub_parity to use u64 stripe_len
  Btrfs: switch to 

Re: [PATCH 2/8] nowait aio: Introduce RWF_NOWAIT

2017-04-19 Thread Jan Kara
On Wed 19-04-17 05:30:24, Goldwyn Rodrigues wrote:
> 
> 
> On 04/19/2017 01:39 AM, Christoph Hellwig wrote:
> > 
> >> @@ -1593,6 +1593,11 @@ static int io_submit_one(struct kioctx *ctx, struct 
> >> iocb __user *user_iocb,
> >>}
> >>  
> >>req->common.ki_flags |= iocb_rw_flags(iocb->aio_rw_flags);
> >> +  if ((req->common.ki_flags & IOCB_NOWAIT) &&
> >> +  !(req->common.ki_flags & IOCB_DIRECT)) {
> >> +  ret = -EINVAL;
> >> +  goto out_put_req;
> >> +  }
> > 
> > Wrong indentation.  Also I think this should be EOPNOTSUPP here.
> > 
> 
> Do we plan to add support for nowait in buffered I/O in the future? It
> is just too complicated. EINVAL suits best in this case.

Well, it may be complicated in the current implementation but IMO it makes
sense to ask for such functionality (see e.g. patches for non-blocking
buffered reads that were flying around year or two ago). So I agree with
Christoph that EOPNOTSUPP is more logical.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/8] nowait aio: Introduce RWF_NOWAIT

2017-04-19 Thread Goldwyn Rodrigues


On 04/19/2017 01:39 AM, Christoph Hellwig wrote:
> 
>> @@ -1593,6 +1593,11 @@ static int io_submit_one(struct kioctx *ctx, struct 
>> iocb __user *user_iocb,
>>  }
>>  
>>  req->common.ki_flags |= iocb_rw_flags(iocb->aio_rw_flags);
>> +if ((req->common.ki_flags & IOCB_NOWAIT) &&
>> +!(req->common.ki_flags & IOCB_DIRECT)) {
>> +ret = -EINVAL;
>> +goto out_put_req;
>> +}
> 
> Wrong indentation.  Also I think this should be EOPNOTSUPP here.
> 

Do we plan to add support for nowait in buffered I/O in the future? It
is just too complicated. EINVAL suits best in this case.

-- 
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs check --check-data-csum malfunctioning?

2017-04-19 Thread Henk Slager
> At 04/18/2017 08:41 PM, Werner Braun wrote:
>>
>> Hi,
>>
>> i have a WD WD40EZRX with strange beaviour off btrfs check vs. btrfs scrub
>>
>> running btrfs check --check-data-csum returns no errors on the disk
>>
>> running btrfs scrub on the disk finds tons of errors
>>
>> i could clear the disk and send it to anyone intrested in ;-)
>>
>>
> That's because --check-data-csum will only check the first copy of data if
> the first copy is good.

So is the conclusion that all the csum errors are in the metadata?
What is the profile of the fs? ( not dup for metadata I assume?)

I also have a WD40EZRX and the fs on it is also almost exclusively a
btrfs receive target and it has now for the second time csum (just 5 )
errors. Extended selftest at 16K hours shows no problem and I am not
fully sure if this is a magnetic media error case or something else.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/8] nowait-aio: Introduce IOMAP_NOWAIT

2017-04-19 Thread Jan Kara
On Fri 14-04-17 07:02:53, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> IOCB_NOWAIT translates to IOMAP_NOWAIT for iomaps.
> This is used by XFS in the XFS patch.

Goldwyn, the patch is missing your Signed-off-by...

Honza

> ---
>  fs/iomap.c| 2 ++
>  include/linux/iomap.h | 1 +
>  2 files changed, 3 insertions(+)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 141c3cd55a8b..d1c81753d411 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -885,6 +885,8 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>   } else {
>   dio->flags |= IOMAP_DIO_WRITE;
>   flags |= IOMAP_WRITE;
> + if (iocb->ki_flags & IOCB_NOWAIT)
> + flags |= IOMAP_NOWAIT;
>   }
>  
>   if (mapping->nrpages) {
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 7291810067eb..53f6af89c625 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -51,6 +51,7 @@ struct iomap {
>  #define IOMAP_REPORT (1 << 2) /* report extent status, e.g. FIEMAP */
>  #define IOMAP_FAULT  (1 << 3) /* mapping for page fault */
>  #define IOMAP_DIRECT (1 << 4) /* direct I/O */
> +#define IOMAP_NOWAIT (1 << 5) /* Don't wait for writeback */
>  
>  struct iomap_ops {
>   /*
> -- 
> 2.12.0
> 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/8] nowait aio: xfs

2017-04-19 Thread Christoph Hellwig
Looks fine,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/8] nowait aio: return on congested block device

2017-04-19 Thread Christoph Hellwig
On Fri, Apr 14, 2017 at 07:02:54AM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> A new bio operation flag REQ_NOWAIT is introduced to identify bio's

s/bio/block/

> @@ -1232,6 +1232,11 @@ static struct request *get_request(struct 
> request_queue *q, unsigned int op,
>   if (!IS_ERR(rq))
>   return rq;
>  
> + if (bio && (bio->bi_opf & REQ_NOWAIT)) {
> + blk_put_rl(rl);
> + return ERR_PTR(-EAGAIN);
> + }

Please check the op argument instead of touching bio.

> + if (bio->bi_opf & REQ_NOWAIT) {
> + if (!blk_queue_nowait(q)) {
> + err = -EOPNOTSUPP;
> + goto end_io;
> + }
> + if (!(bio->bi_opf & REQ_SYNC)) {

I don't understand this check at all..

> + if (unlikely(!blk_queue_dying(q) && (bio->bi_opf & 
> REQ_NOWAIT)))

Please break lines after 80 characters.

> @@ -119,6 +119,9 @@ struct request *blk_mq_sched_get_request(struct 
> request_queue *q,
>   if (likely(!data->hctx))
>   data->hctx = blk_mq_map_queue(q, data->ctx->cpu);
>  
> + if (bio && (bio->bi_opf & REQ_NOWAIT))
> + data->flags |= BLK_MQ_REQ_NOWAIT;

Check the op flag again here.

> +++ b/block/blk-mq.c
> @@ -1538,6 +1538,8 @@ static blk_qc_t blk_mq_make_request(struct 
> request_queue *q, struct bio *bio)
>   rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, );
>   if (unlikely(!rq)) {
>   __wbt_done(q->rq_wb, wb_acct);
> + if (bio && (bio->bi_opf & REQ_NOWAIT))
> + bio_wouldblock_error(bio);

bio iѕ dereferences unconditionally above, so you can do the same.

> @@ -1662,6 +1664,8 @@ static blk_qc_t blk_sq_make_request(struct 
> request_queue *q, struct bio *bio)
>   rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, );
>   if (unlikely(!rq)) {
>   __wbt_done(q->rq_wb, wb_acct);
> + if (bio && (bio->bi_opf & REQ_NOWAIT))
> + bio_wouldblock_error(bio);

Same here.  Although blk_sq_make_request is gone anyway in the current
block tree..

> + /* Request queue supports BIO_NOWAIT */
> + queue_flag_set_unlocked(QUEUE_FLAG_NOWAIT, q);

BIO_NOWAIT is gone.  And the comment would not be needed if the
flag had a more descriptive name, e.g. QUEUE_FLAG_NOWAIT_SUPPORT.

And I think all request based drivers should set the flag implicitly
as ->queuecommand can't sleep, and ->queue_rq only when it's always
offloaded to a workqueue when the BLK_MQ_F_BLOCKING flag is set.

> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -480,8 +480,12 @@ static int dio_bio_complete(struct dio *dio, struct bio 
> *bio)
>   unsigned i;
>   int err;
>  
> - if (bio->bi_error)
> - dio->io_error = -EIO;
> + if (bio->bi_error) {
> + if (bio->bi_opf & REQ_NOWAIT)
> + dio->io_error = -EAGAIN;
> + else
> + dio->io_error = -EIO;
> + }

Huh?  Once REQ_NOWAIT is set all errors are -EAGAIN?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/8] nowait-aio: Introduce IOMAP_NOWAIT

2017-04-19 Thread Christoph Hellwig
Looks fine,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/8] nowait aio: Introduce RWF_NOWAIT

2017-04-19 Thread Christoph Hellwig
>   }
>  
> -
>   /* prevent overflows */

Weird whitespace change.

> @@ -1593,6 +1593,11 @@ static int io_submit_one(struct kioctx *ctx, struct 
> iocb __user *user_iocb,
>   }
>  
>   req->common.ki_flags |= iocb_rw_flags(iocb->aio_rw_flags);
> + if ((req->common.ki_flags & IOCB_NOWAIT) &&
> + !(req->common.ki_flags & IOCB_DIRECT)) {
> + ret = -EINVAL;
> + goto out_put_req;
> + }

Wrong indentation.  Also I think this should be EOPNOTSUPP here.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/8] Use RWF_* flags for AIO operations

2017-04-19 Thread Christoph Hellwig
On Fri, Apr 14, 2017 at 07:02:50AM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> RWF_* flags is used for preadv2/pwritev2 calls. Port to use
> it for aio operations as well. For this, aio_rw_flags is
> introduced in struct iocb (using aio_reserved1) which will
> carry these flags.
> 
> This is a precursor to the nowait AIO calls.
> 
> Note, the only place RWF_HIPRI comes in effect is dio_await_one().
> All the rest of the locations, aio code return -EIOCBQUEUED before the
> checks for RWF_HIPRI.
> 
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/aio.c | 10 +-
>  fs/read_write.c  |  7 +--
>  include/linux/fs.h   | 12 
>  include/uapi/linux/aio_abi.h |  2 +-
>  4 files changed, 23 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index f52d925ee259..b8a33f5beef5 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1541,11 +1541,17 @@ static int io_submit_one(struct kioctx *ctx, struct 
> iocb __user *user_iocb,
>   ssize_t ret;
>  
>   /* enforce forwards compatibility on users */
> - if (unlikely(iocb->aio_reserved1 || iocb->aio_reserved2)) {
> + if (unlikely(iocb->aio_reserved2)) {
>   pr_debug("EINVAL: reserve field set\n");
>   return -EINVAL;
>   }
>  
> + if (unlikely(iocb->aio_rw_flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC))) 
> {
> + pr_debug("EINVAL: aio_rw_flags set with incompatible flags\n");
> + return -EINVAL;
> + }
> +

> + req->common.ki_flags |= iocb_rw_flags(iocb->aio_rw_flags);

The flag validity checking also needs to go into what's currently
iocb_rw_flags (which will need a new name and a return value).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html