from:"Lukáš Czerner"

Re: [PATCH] scsi_debug: Fix off-by-one bug when unmapping region

2012-09-05 Thread Lukáš Czerner

On Fri, 17 Aug 2012, Martin K. Petersen wrote:

> Date: Fri, 17 Aug 2012 12:11:50 -0400
> From: Martin K. Petersen 
> To: Lukas Czerner 
> Cc: linux-s...@vger.kernel.org, jbottom...@parallels.com, ty...@mit.edu,
> pbonz...@redhat.com, linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] scsi_debug: Fix off-by-one bug when unmapping region
> 
> > "Lukas" == Lukas Czerner  writes:
> 
> Lukas> Currently it is possible to unmap one more block than user
> Lukas> requested to due to the off-by-one error in unmap_region(). This
> Lukas> is probably due to the fact that the end variable despite its
> Lukas> name actually points to the last block to unmap + 1. However in
> Lukas> the condition it is handled as the last block of the region to
> Lukas> unmap.
> 
> Acked-by: Martin K. Petersen 
> 

I am still not seeing it in anywhere, is this going in any time soon
?

Thanks!
-Lukas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Ping^3] Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without CAP_SYS_RAWIO

2012-09-06 Thread Lukáš Czerner

On Thu, 6 Sep 2012, Paolo Bonzini wrote:

> Date: Thu, 06 Sep 2012 14:36:53 +0200
> From: Paolo Bonzini 
> To: Ric Wheeler 
> Cc: ax...@kernel.dk, Mike Snitzer ,
> Alan Cox ,
> Martin K. Petersen ,
> linux-kernel@vger.kernel.org, linux-s...@vger.kernel.org
> Subject: Re: [Ping^3] Re: [PATCH] sg_io: allow UNMAP and WRITE SAME without
> CAP_SYS_RAWIO
> 
> Il 06/09/2012 14:08, Ric Wheeler ha scritto:
> >> According to the standard, the translation layer can write a
> >> user-provided pattern to every sector in the disk.  It's an optional
> >> feature and libata doesn't do that, but it is still possible.
> > 
> > It is not possible today with our stack though, any patch that would
> > change that would also need to be vetted.
> 
> It is not possible with SATA disks, but native SCSI disks might well
> interpret FORMAT UNIT destructively.
> 
> >>> I don't see allowing anyone who can open the device to zero the data as
> >>> better though :)
> >> Note: anyone who can open it for writing!  And they can just as well
> >> issue WRITE, it just takes a little more effort than with WRITE SAME. :)
> >>   If you only have read access, you cannot issue WRITE or FORMAT UNIT,
> >> and with this patch you will not be able to issue WRITE SAME.
> > 
> > This just seems like an argument over whether or not capabilities make
> > sense. In general, anything as destructive as a single CDB that can kill
> > all of your data should be tightly controlled.
> 
> In practice, a single write to the first MB of the disk is just as
> destructive.  For that you do not even need a SCSI command.
> 
> > Pushing more code in the data path is not where we are going - we
> > routinely need to disable IO scheduling for example when driving IO to
> > high speed/low latency devices and are actively looking at how to tackle
> > other performance bottlenecks in the stack.
> 
> I am not talking about the regular data path, only of SG_IO.
> 
> > I don't see a strong reason that our existing scheme (root or
> > CAP_SYS_RAWIO access) prevents you from doing what you need to do.
> 
> Here are three:
> 
> - CAP_SYS_RAWIO partly bypasses DAC; you can issue destructive commands
> even if you only opened the disk for reading.  CAP_SYS_RAWIO also gives
> access to _really_ destructive commands (WRITE BUFFER and PERSISTENT
> RESERVE OUT for example).
> 
> - CAP_SYS_RAWIO lets you send SCSI commands to partitions, and they will
> gladly read/write the disk going outside the boundaries of the
> partition.  Changing this behavior was rejected upstream already.
> 
> - CAP_SYS_RAWIO also gives access to I/O ports, mmap at address 0, and
> too many other insecure things.
> 
> All the above mean that:
> 
> - any application using CAP_SYS_RAWIO would have to implement its own
> whitelisting, even if just to duplicate what is done in the kernel;
> 
> - exploiting a CAP_SYS_RAWIO process leads to root too easily, and it is
> not possible to give the capability to anything that will run in a
> hostile environment (in my case QEMU).

So at fist I did not think this is such a good idea however there
are several good points you've mentioned.

CAP_SYS_RAWIO is indeed too big hammer for this and it is not secure
to allow such application to possess such capability. Moreover
WRITE_SAME indeed is almost the SAME as WRITE :), only easier and
delegated to the storage itself.

UNMAP or WRITE_WAME w/ unamp bit is a little bit trickier but
thinking about it some more I do not see any real reason why the
user with write permission should not be able to use this. Yes, it
is not technically write and it has other consequences as well, but
none of it seems to be exploitable more than simple write.

Moreover looking at BLKDISCARD ioctl there is no such restrictions
(obviously) but neither is with TRIM ata command. So with sata SSD
you're allowed to use TRIM command if you have write permission to
that device. So I guess having this consistent is good idea and
considering the points above I think it is ok to allow WRITE_SAME
and UNMAP without CAP_SYS_RAWIO.

Thanks!
-Lukas

> 
> Paolo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/21] drop vmtruncate

2012-09-06 Thread Lukáš Czerner

On Fri, 31 Aug 2012, Marco Stornelli wrote:

> Date: Fri, 31 Aug 2012 15:50:20 +0200
> From: Marco Stornelli 
> To: Linux FS Devel , linux...@kvack.org
> Cc: Linux Kernel 
> Subject: [PATCH 00/21] drop vmtruncate
> 
> Hi all,
> 
> with this patch series I try to clean the vmtruncate code. The theory of
> operation:
> 
> old   new
> vmtruncate() =>   inode_newsize_ok+truncate_setsize+fs truncate
> 
> Where vmtruncate was used without any error check, the code now is:
> 
> if (inode_newsize_ok() == 0) {
>   truncate_setsize();
>   fs truncate();
> }
> 
> So, performance and semantic nothing change at all. I think that maybe in some
> point we can skip inode_newsize_ok (where the error check of vmtruncate wasn't
> used) but since there is a swap check in case of no-extension, maybe it's
> better to avoid regressions. After this clean, of course, each fs can clean in
> a deeply way.
> 
> With these patches even the inode truncate callback is deleted.
> 
> Any comments/feedback/bugs are welcome.

Could you explain the reason behind this change a little bit more ?
This does not make any sense to me since you're replacing
vmtruncate() which does basically 

if (inode_newsize_ok() == 0) {
truncate_setsize();
fs truncate();
}

as you mentioned above by exactly the same thing but doing it within
the file system. It does not seem like an improvement to me ... how
is this a clean up ?

Thanks!
-Lukas

> 
> Marco Stornelli (21):
>   ufs: drop vmtruncate
>   sysv: drop vmtruncate
>   reiserfs: drop vmtruncate
>   procfs: drop vmtruncate
>   omfs: drop vmtruncate
>   ocfs2: drop vmtruncate
>   adfs: drop vmtruncate
>   affs: drop vmtruncate
>   bfs: drop vmtruncate
>   hfs: drop vmtruncate
>   hpfs: drop vmtruncate
>   jfs: drop vmtruncate
>   hfsplus: drop vmtruncate
>   hostfs: drop vmtruncate
>   logfs: drop vmtruncate
>   minix: drop vmtruncate
>   ncpfs: drop vmtruncate
>   nilfs2: drop vmtruncate
>   ntfs: drop vmtruncate
>   vfs: drop vmtruncate
>   mm: drop vmtruncate
> 
>  fs/adfs/inode.c |5 +++--
>  fs/affs/file.c  |8 +---
>  fs/affs/inode.c |5 -
>  fs/bfs/file.c   |5 +++--
>  fs/hfs/inode.c  |   19 +--
>  fs/hfsplus/inode.c  |   19 +--
>  fs/hostfs/hostfs_kern.c |8 +---
>  fs/hpfs/file.c  |8 +---
>  fs/hpfs/inode.c |5 -
>  fs/jfs/file.c   |6 --
>  fs/jfs/inode.c  |   13 +
>  fs/libfs.c  |2 --
>  fs/logfs/readwrite.c|   10 --
>  fs/minix/file.c |6 --
>  fs/minix/inode.c|7 +--
>  fs/ncpfs/inode.c|4 +++-
>  fs/nilfs2/file.c|1 -
>  fs/nilfs2/inode.c   |   18 +-
>  fs/nilfs2/recovery.c|7 +--
>  fs/ntfs/file.c  |8 +---
>  fs/ntfs/inode.c |   11 +--
>  fs/ntfs/inode.h |4 
>  fs/ocfs2/file.c |3 ++-
>  fs/omfs/file.c  |   12 
>  fs/proc/base.c  |3 ++-
>  fs/proc/generic.c   |3 ++-
>  fs/proc/proc_sysctl.c   |3 ++-
>  fs/reiserfs/file.c  |3 +--
>  fs/reiserfs/inode.c |   15 +++
>  fs/reiserfs/reiserfs.h  |1 +
>  fs/sysv/file.c  |5 +++--
>  fs/sysv/itree.c |7 +--
>  fs/ufs/inode.c  |5 +++--
>  include/linux/fs.h  |1 -
>  include/linux/mm.h  |1 -
>  mm/truncate.c   |   23 ---
>  36 files changed, 164 insertions(+), 100 deletions(-)
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ext4fs error "ext4_mb_generate_buddy:741:group 16, 8160 clusters in bitmap, 4064 in gd" (with repro)

2012-08-15 Thread Lukáš Czerner

On Thu, 9 Aug 2012, Theodore Ts'o wrote:

> Date: Thu, 9 Aug 2012 13:06:40 -0400
> From: Theodore Ts'o 
> To: Lukas Czerner 
> Cc: Paolo Bonzini ,
> "Linux Kernel mailinlinux-e...@vger.kernel.orgg List"
> , linux-e...@vger.kernel.org
> Subject: Re: ext4fs error
> "ext4_mb_generate_buddy:741:group 16, 8160 clusters in bitmap, 4064 in gd"
>  (with repro)
> 
> On Thu, Aug 09, 2012 at 12:00:09PM +0200, Paolo Bonzini wrote:
> > Here is how to reproduce it.  It happens during fstrim.  I found other
> > occurrences of the error in the mailing list, but they were not related
> > to trim so they may be something different.
> > 
> > modprobe scsi_debug dev_size_mb=256 lbpws=1
> > dd if=/dev/zero of=/dev/sdb bs=1M  
> > fdisk /dev/sdb
> >  >> create a new partition accepting all defaults
> > fdisk -lu /dev/sdb|tail -1
> >  >> should show: /dev/sdb1 57  524285  262114+  83  Linux
> > 
> > mkfs.ext4 /dev/sdb1
> > mkdir test
> > mount /dev/sdb1 test
> > fstrim ./test
> 
> I can confirm that this accurately reproduces file system corruption
> using a 3.5 kernel.  It looks like some block allocation bitmap blocks
> is getting trimmed when it shouldn't have been.  Lukas, can you take a
> look at this?
> 
>   - Ted

Hi Ted,

sorry for the delay, I've just got back from my vacation. I'll take
a look at it.

Thanks!
-Lukas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] xfs: check for possible overflow in xfs_ioc_trim

2012-08-08 Thread Lukáš Czerner

On Wed, 1 Aug 2012, Tomas Racek wrote:

> Date: Wed,  1 Aug 2012 15:45:37 +0200
> From: Tomas Racek 
> To: linux-...@vger.kernel.org
> Cc: lczer...@redhat.com, Tomas Racek ,
> Ben Myers , Alex Elder ,
> "supporter:XFS FILESYSTEM" ,
> open list 
> Subject: [PATCH v2] xfs: check for possible overflow in xfs_ioc_trim
> 
> If range.start points behind the filesystem, return invalid value error.
> This fixes possible overflow in
> 
> start = BTOBB(range.start)
> 
> when range.start is nearly ULLONG_MAX.
> 
> Signed-off-by: Tomas Racek 

The fix looks good to me, but I have one comment bellow.

> ---
>  fs/xfs/xfs_discard.c |5 +++--
>  1 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c
> index f9c3fe3..33d367f 100644
> --- a/fs/xfs/xfs_discard.c
> +++ b/fs/xfs/xfs_discard.c
> @@ -179,12 +179,13 @@ xfs_ioc_trim(
>* used by the fstrim application.  In the end it really doesn't
>* matter as trimming blocks is an advisory interface.
>*/
> + if (range.start >= XFS_FSB_TO_B(mp, mp->m_sb.sb_dblocks))
> + return -XFS_ERROR(EINVAL);
> +
>   start = BTOBB(range.start);
>   end = start + BTOBBT(range.len) - 1;
>   minlen = BTOBB(max_t(u64, granularity, range.minlen));

It seems that we have the same problem here with range.minlen. It
is highly unlikely and stupid from the user space to send minlen of
such size, but it is possible.

Thanks!
-Lukas

>  
> - if (XFS_BB_TO_FSB(mp, start) >= mp->m_sb.sb_dblocks)
> - return -XFS_ERROR(EINVAL);
>   if (end > XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) - 1)
>   end = XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks)- 1;
>  
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] ext3: fix return values on parse_options() failure

2012-10-08 Thread Lukáš Czerner

On Tue, 9 Oct 2012, Zhao Hongjiang wrote:

> Date: Tue, 09 Oct 2012 13:48:47 +0800
> From: Zhao Hongjiang 
> To: j...@suse.cz
> Cc: a...@linux-foundation.org, adilger.ker...@dilger.ca,
> linux-e...@vger.kernel.org, linux-kernel@vger.kernel.org,
> Eric W. Biederman , serge.hal...@canonical.com,
> contain...@lists.linux-foundation.org
> Subject: [PATCH] ext3: fix return values on parse_options() failure
> 
> From: Zhao Hongjiang 
> 
> parse_options() in ext3 should return 0 when parse the mount options fails.
> 

Ah, it really is a joy to read the parse_options() :). But this fix
makes sense, thanks for catching it.

Reviewed-by: Lukas Czerner 

-Lukas

> Signed-off-by: Zhao Hongjiang 
> ---
>  fs/ext3/super.c |4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> index 17ae5c8..ebf8312 100644
> --- a/fs/ext3/super.c
> +++ b/fs/ext3/super.c
> @@ -1001,7 +1001,7 @@ static int parse_options (char *options, struct 
> super_block *sb,
>   uid = make_kuid(current_user_ns(), option);
>   if (!uid_valid(uid)) {
>   ext3_msg(sb, KERN_ERR, "Invalid uid value %d", 
> option);
> - return -1;
> + return 0;
> 
>   }
>   sbi->s_resuid = uid;
> @@ -1012,7 +1012,7 @@ static int parse_options (char *options, struct 
> super_block *sb,
>   gid = make_kgid(current_user_ns(), option);
>   if (!gid_valid(gid)) {
>   ext3_msg(sb, KERN_ERR, "Invalid gid value %d", 
> option);
> - return -1;
> + return 0;
>   }
>   sbi->s_resgid = gid;
>   break;
> -- 1.7.1
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] ext2: fix return values on parse_options() failure

2012-10-08 Thread Lukáš Czerner

On Tue, 9 Oct 2012, Zhao Hongjiang wrote:

> Date: Tue, 09 Oct 2012 13:44:36 +0800
> From: Zhao Hongjiang 
> To: j...@suse.cz
> Cc: linux-e...@vger.kernel.org, linux-kernel@vger.kernel.org,
> Eric W. Biederman , serge.hal...@canonical.com,
> contain...@lists.linux-foundation.org
> Subject: [PATCH] ext2: fix return values on parse_options() failure
> 
> From: Zhao Hongjiang 
> 
> parse_options() in ext2 should return 0 when parse the mount options fails.
> 
> Signed-off-by: Zhao Hongjiang 

Reviewed-by: Lukas Czerner 

Thanks!
-Lukas

> ---
>  fs/ext2/super.c |4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext2/super.c b/fs/ext2/super.c
> index 6c205d0..fa04d02 100644
> --- a/fs/ext2/super.c
> +++ b/fs/ext2/super.c
> @@ -469,7 +469,7 @@ static int parse_options(char *options, struct 
> super_block *sb)
>   uid = make_kuid(current_user_ns(), option);
>   if (!uid_valid(uid)) {
>   ext2_msg(sb, KERN_ERR, "Invalid uid value %d", 
> option);
> - return -1;
> + return 0;
> 
>   }
>   sbi->s_resuid = uid;
> @@ -480,7 +480,7 @@ static int parse_options(char *options, struct 
> super_block *sb)
>   gid = make_kgid(current_user_ns(), option);
>   if (!gid_valid(gid)) {
>   ext2_msg(sb, KERN_ERR, "Invalid gid value %d", 
> option);
> - return -1;
> + return 0;
>   }
>   sbi->s_resgid = gid;
>   break;
> -- 1.7.1
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 00/16] f2fs: introduce flash-friendly file system

2012-10-09 Thread Lukáš Czerner

On Mon, 8 Oct 2012, Jaegeuk Kim wrote:

> Date: Mon, 08 Oct 2012 19:52:03 +0900
> From: Jaegeuk Kim 
> To: 'Namjae Jeon' 
> Cc: 'Vyacheslav Dubeyko' ,
> 'Marco Stornelli' ,
> 'Jaegeuk Kim' ,
> 'Al Viro' , ty...@mit.edu,
> gre...@linuxfoundation.org, linux-kernel@vger.kernel.org,
> chur@samsung.com, cm224@samsung.com, jooyoung.hw...@samsung.com,
> linux-fsde...@vger.kernel.org
> Subject: RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
> 
> > -Original Message-
> > From: Namjae Jeon [mailto:linkinj...@gmail.com]
> > Sent: Monday, October 08, 2012 7:00 PM
> > To: Jaegeuk Kim
> > Cc: Vyacheslav Dubeyko; Marco Stornelli; Jaegeuk Kim; Al Viro; 
> > ty...@mit.edu;
> > gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; 
> > chur@samsung.com; cm224@samsung.com;
> > jooyoung.hw...@samsung.com; linux-fsde...@vger.kernel.org
> > Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> > 
> > 2012/10/8, Jaegeuk Kim :
> > >> -Original Message-
> > >> From: Vyacheslav Dubeyko [mailto:sl...@dubeyko.com]
> > >> Sent: Sunday, October 07, 2012 9:09 PM
> > >> To: Jaegeuk Kim
> > >> Cc: 'Marco Stornelli'; 'Jaegeuk Kim'; 'Al Viro'; ty...@mit.edu;
> > >> gre...@linuxfoundation.org; linux-
> > >> ker...@vger.kernel.org; chur@samsung.com; cm224@samsung.com;
> > >> jooyoung.hw...@samsung.com;
> > >> linux-fsde...@vger.kernel.org
> > >> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> > >>
> > >> Hi,
> > >>
> > >> On Oct 7, 2012, at 1:31 PM, Jaegeuk Kim wrote:
> > >>
> > >> >> -Original Message-
> > >> >> From: Marco Stornelli [mailto:marco.storne...@gmail.com]
> > >> >> Sent: Sunday, October 07, 2012 4:10 PM
> > >> >> To: Jaegeuk Kim
> > >> >> Cc: Vyacheslav Dubeyko; jaegeuk@samsung.com; Al Viro;
> > >> >> ty...@mit.edu; gre...@linuxfoundation.org;
> > >> >> linux-kernel@vger.kernel.org; chur@samsung.com;
> > >> >> cm224@samsung.com;
> > >> jooyoung.hw...@samsung.com;
> > >> >> linux-fsde...@vger.kernel.org
> > >> >> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> > >> >>
> > >> >> Il 06/10/2012 22:06, Jaegeuk Kim ha scritto:
> > >> >>> 2012-10-06 (토), 17:54 +0400, Vyacheslav Dubeyko:
> > >>  Hi Jaegeuk,
> > >> >>>
> > >> >>> Hi.
> > >> >>> We know each other, right? :)
> > >> >>>
> > >> 
> > >> > From:  김재극 
> > >> > To:v...@zeniv.linux.org.uk, 'Theodore Ts'o' 
> > >> > ,
> > >> >> gre...@linuxfoundation.org, linux-kernel@vger.kernel.org,
> > >> >> chur@samsung.com,
> > >> cm224@samsung.com,
> > >> >> jaegeuk@samsung.com, jooyoung.hw...@samsung.com
> > >> > Subject:   [PATCH 00/16] f2fs: introduce flash-friendly 
> > >> > file system
> > >> > Date:  Fri, 05 Oct 2012 20:55:07 +0900
> > >> >
> > >> > This is a new patch set for the f2fs file system.
> > >> >
> > >> > What is F2FS?
> > >> > =
> > >> >
> > >> > NAND flash memory-based storage devices, such as SSD, eMMC, and SD
> > >> > cards, have
> > >> > been widely being used for ranging from mobile to server systems.
> > >> > Since they are
> > >> > known to have different characteristics from the conventional
> > >> > rotational disks,
> > >> > a file system, an upper layer to the storage device, should adapt 
> > >> > to
> > >> > the changes
> > >> > from the sketch.
> > >> >
> > >> > F2FS is a new file system carefully designed for the NAND flash
> > >> > memory-based storage
> > >> > devices. We chose a log structure file system approach, but we 
> > >> > tried
> > >> > to adapt it
> > >> > to the new form of storage. Also we remedy some known issues of the
> > >> > very old log
> > >> > structured file system, such as snowball effect of wandering tree
> > >> > and high cleaning
> > >> > overhead.
> > >> >
> > >> > Because a NAND-based storage device shows different characteristics
> > >> > according to
> > >> > its internal geometry or flash memory management scheme aka FTL, we
> > >> > add various
> > >> > parameters not only for configuring on-disk layout, but also for
> > >> > selecting allocation
> > >> > and cleaning algorithms.
> > >> >
> > >> 
> > >>  What about F2FS performance? Could you share benchmarking results of
> > >>  the new file system?
> > >> 
> > >>  It is very interesting the case of aged file system. How is GC's
> > >>  implementation efficient? Could
> > >> >> you share benchmarking results for the very aged file system state?
> > >> 
> > >> >>>
> > >> >>> Although I have benchmark results, currently I'd like to see the
> > >> >>> results
> > >> >>> measured by community as a black-box. As you know, the results are
> > >> >>> very
> > >> >>> dependent on the workloads and parameters, so I think it would be
> > >> >>> be

RE: [PATCH 00/16] f2fs: introduce flash-friendly file system

2012-10-09 Thread Lukáš Czerner

On Tue, 9 Oct 2012, Jaegeuk Kim wrote:

> Date: Tue, 09 Oct 2012 19:45:57 +0900
> From: Jaegeuk Kim 
> To: 'Lukáš Czerner' 
> Cc: 'Namjae Jeon' ,
> 'Vyacheslav Dubeyko' ,
> 'Marco Stornelli' ,
> 'Jaegeuk Kim' ,
> 'Al Viro' , ty...@mit.edu,
> gre...@linuxfoundation.org, linux-kernel@vger.kernel.org,
> chur@samsung.com, cm224@samsung.com, jooyoung.hw...@samsung.com,
> linux-fsde...@vger.kernel.org
> Subject: RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
> 
> > -Original Message-
> > From: linux-fsdevel-ow...@vger.kernel.org 
> > [mailto:linux-fsdevel-ow...@vger.kernel.org] On Behalf Of
> > Luka? Czerner
> > Sent: Tuesday, October 09, 2012 5:32 PM
> > To: Jaegeuk Kim
> > Cc: 'Namjae Jeon'; 'Vyacheslav Dubeyko'; 'Marco Stornelli'; 'Jaegeuk Kim'; 
> > 'Al Viro'; ty...@mit.edu;
> > gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; 
> > chur@samsung.com; cm224@samsung.com;
> > jooyoung.hw...@samsung.com; linux-fsde...@vger.kernel.org
> > Subject: RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
> > 
> > On Mon, 8 Oct 2012, Jaegeuk Kim wrote:
> > 
> > > Date: Mon, 08 Oct 2012 19:52:03 +0900
> > > From: Jaegeuk Kim 
> > > To: 'Namjae Jeon' 
> > > Cc: 'Vyacheslav Dubeyko' ,
> > > 'Marco Stornelli' ,
> > > 'Jaegeuk Kim' ,
> > > 'Al Viro' , ty...@mit.edu,
> > > gre...@linuxfoundation.org, linux-kernel@vger.kernel.org,
> > > chur@samsung.com, cm224@samsung.com, 
> > > jooyoung.hw...@samsung.com,
> > > linux-fsde...@vger.kernel.org
> > > Subject: RE: [PATCH 00/16] f2fs: introduce flash-friendly file system
> > >
> > > > -Original Message-
> > > > From: Namjae Jeon [mailto:linkinj...@gmail.com]
> > > > Sent: Monday, October 08, 2012 7:00 PM
> > > > To: Jaegeuk Kim
> > > > Cc: Vyacheslav Dubeyko; Marco Stornelli; Jaegeuk Kim; Al Viro; 
> > > > ty...@mit.edu;
> > > > gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; 
> > > > chur@samsung.com;
> > cm224@samsung.com;
> > > > jooyoung.hw...@samsung.com; linux-fsde...@vger.kernel.org
> > > > Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> > > >
> > > > 2012/10/8, Jaegeuk Kim :
> > > > >> -Original Message-
> > > > >> From: Vyacheslav Dubeyko [mailto:sl...@dubeyko.com]
> > > > >> Sent: Sunday, October 07, 2012 9:09 PM
> > > > >> To: Jaegeuk Kim
> > > > >> Cc: 'Marco Stornelli'; 'Jaegeuk Kim'; 'Al Viro'; ty...@mit.edu;
> > > > >> gre...@linuxfoundation.org; linux-
> > > > >> ker...@vger.kernel.org; chur@samsung.com; cm224@samsung.com;
> > > > >> jooyoung.hw...@samsung.com;
> > > > >> linux-fsde...@vger.kernel.org
> > > > >> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file system
> > > > >>
> > > > >> Hi,
> > > > >>
> > > > >> On Oct 7, 2012, at 1:31 PM, Jaegeuk Kim wrote:
> > > > >>
> > > > >> >> -Original Message-
> > > > >> >> From: Marco Stornelli [mailto:marco.storne...@gmail.com]
> > > > >> >> Sent: Sunday, October 07, 2012 4:10 PM
> > > > >> >> To: Jaegeuk Kim
> > > > >> >> Cc: Vyacheslav Dubeyko; jaegeuk@samsung.com; Al Viro;
> > > > >> >> ty...@mit.edu; gre...@linuxfoundation.org;
> > > > >> >> linux-kernel@vger.kernel.org; chur@samsung.com;
> > > > >> >> cm224@samsung.com;
> > > > >> jooyoung.hw...@samsung.com;
> > > > >> >> linux-fsde...@vger.kernel.org
> > > > >> >> Subject: Re: [PATCH 00/16] f2fs: introduce flash-friendly file 
> > > > >> >> system
> > > > >> >>
> > > > >> >> Il 06/10/2012 22:06, Jaegeuk Kim ha scritto:
> > > > >> >>> 2012-10-06 (토), 17:54 +0400, Vyacheslav Dubeyko:
> > > > >> >>>> Hi Jaegeuk,
> > > > >> >>>
> > > > >> >>> Hi.
> > > >

RE: [PATCH 00/16] f2fs: introduce flash-friendly file system

2012-10-09 Thread Lukáš Czerner

On Tue, 9 Oct 2012, Jaegeuk Kim wrote:

> > > > > > >
> > > > > > > As you can see the f2fs kernel document patch, I think one of the 
> > > > > > > most
> > > > > > > important features is to align operating units between f2fs and 
> > > > > > > ftl.
> > > > > > > Specifically, f2fs has section and zone, which are cleaning unit 
> > > > > > > and basic
> > > > > > > allocation unit respectively.
> > > > > > > Through these configurable units in f2fs, I think f2fs is able to 
> > > > > > > reduce the
> > > > > > > unnecessary operations done by FTL.
> > > > > > > And, in order to avoid changing IO patterns by the block-layer, 
> > > > > > > f2fs merges
> > > > > > > itself some bios likewise ext4.
> > > > > > Hello.
> > > > > > The internal of eMMC and SSD is the blackbox from user side.
> > > > > > How does the normal user easily set operating units alignment(page
> > > > > > size and physical block size ?) between f2fs and ftl in storage 
> > > > > > device
> > > > > > ?
> > > > >
> > > > > I've known that some works have been tried to figure out the units by 
> > > > > profiling the storage, AKA
> > > > reverse engineering.
> > > > > In most cases, the simplest way is to measure the latencies of 
> > > > > consecutive writes and analyze
> > their
> > > > patterns.
> > > > > As you mentioned, in practical, users will not want to do this, so 
> > > > > maybe we need a tool to
> > profile
> > > > them to optimize f2fs.
> > > > > In the current state, I think profiling is an another issue, and 
> > > > > mkfs.f2fs had better include
> > this
> > > > work in the future.
> > > > > But, IMO, from the viewpoint of performance, default configuration is 
> > > > > quite enough now.
> > > > >
> > > > > ps) f2fs doesn't care about the flash page size, but considers 
> > > > > garbage collection unit.
> > > >
> > > > I am sorry but this reply makes me smile. How can you design a fs
> > > > relying on time attack heuristics to figure out what the proper
> > > > layout should be ? Or even endorse such heuristics to be used in
> > > > mkfs ? What we should be focusing on is to push vendors to actually
> > > > give us such information so we can properly propagate that
> > > > throughout the kernel - that's something everyone will benefit from.
> > > > After that the optimization can be done in every file system.
> > > >
> > >
> > > Frankly speaking, I agree that it would be the right direction eventually.
> > > But, as you know, it's very difficult for all flash vendors to promote 
> > > and standardize that.
> > > Because each vendors have different strategies to open their internal 
> > > information and also try
> > > to protect their secrets whatever they are.
> > >
> > > IMO, we don't need to wait them now.
> > > Instead, from the start, I suggest f2fs that uses those information to 
> > > the file system design.
> > > In addition, I suggest using heuristics right now as best efforts.
> > > Maybe in future, if vendors give something, f2fs would be more feasible.
> > > In the mean time, I strongly hope to validate and stabilize f2fs with 
> > > community.
> > 
> > Do not get me wrong, I do not think it is worth to wait for vendors
> > to come to their senses, but it is worth constantly reminding that
> > we *need* this kind of information and those heuristics are not
> > feasible in the long run anyway.
> > 
> > I believe that this conversation happened several times already, but
> > what about having independent public database of all the internal
> > information about hw from different vendors where users can add
> > information gathered by the time attack heuristic so other does not
> > have to run this again and again. I am not sure if Linaro or someone
> > else have something like that, someone can maybe post a link to that.
> > 
> 
> As I mentioned, I agree to push vendors to open those information all the 
> time.
> And, I absolutely didn't mean that it is worth to wait vendors.
> I meant, until opening those information by vendors, something like
> proposing f2fs or gathering heuristics are also needed simultaneously.
> 
> Anyway, it's very interesting to build a database gathering products' 
> information.
> May I access the database?

That's what I found:

https://wiki.linaro.org/WorkingGroups/Kernel/Projects/FlashCardSurvey

-Lukas

> 
> Thanks,
> 
> > Eventually we can show this to the vendors to see that their
> > "secrets" are already public anyway and that everyones lives would be
> > easier if they just agree to provide it from the beginning.
> > 
> > >
> > > > Promoting time attack heuristics instead of pushing vendors to tell
> > > > us how their hardware should be used is a journey to hell and we've
> > > > been talking about this for a looong time now. And I imagine that
> > > > you especially have quite some persuasion power.
> > >
> > > I know. :)
> > > If there comes a chance, I want to try.
> > > Thanks,
> > 
> > That's very good to hear, thank you.
> > 
> > -Lukas
--
To unsubscribe from this

Re: [RFC v3 01/13] btrfs: add one new mount option '-o hot_track'

2012-10-10 Thread Lukáš Czerner

On Wed, 10 Oct 2012, Zhi Yong Wu wrote:

> Date: Wed, 10 Oct 2012 20:21:48 +0800
> From: Zhi Yong Wu 
> To: Lukáš Czerner 
> Cc: linux-fsde...@vger.kernel.org, linux-e...@vger.kernel.org,
> linux-bt...@vger.kernel.org, linux-kernel@vger.kernel.org,
> linux...@linux.vnet.ibm.com, v...@zeniv.linux.org.uk, da...@fromorbit.com,
> d...@jikos.cz, ty...@mit.edu, c...@us.ibm.com,
> Zhi Yong Wu 
> Subject: Re: [RFC v3 01/13] btrfs: add one new mount option '-o hot_track'
> 
> On Wed, Oct 10, 2012 at 7:59 PM, Lukáš Czerner  wrote:
> > On Wed, 10 Oct 2012, zwu.ker...@gmail.com wrote:
> >
> >> Date: Wed, 10 Oct 2012 18:07:23 +0800
> >> From: zwu.ker...@gmail.com
> >> To: linux-fsde...@vger.kernel.org
> >> Cc: linux-e...@vger.kernel.org, linux-bt...@vger.kernel.org,
> >> linux-kernel@vger.kernel.org, linux...@linux.vnet.ibm.com,
> >> v...@zeniv.linux.org.uk, da...@fromorbit.com, d...@jikos.cz,
> >> ty...@mit.edu, c...@us.ibm.com, Zhi Yong Wu 
> >> Subject: [RFC v3 01/13] btrfs: add one new mount option '-o hot_track'
> >>
> >> From: Zhi Yong Wu 
> >>
> >> Introduce one new mount option '-o hot_track',
> >> and add its parsing support.
> >>   Its usage looks like:
> >>mount -o hot_track
> >>mount -o nouser,hot_track
> >>mount -o nouser,hot_track,loop
> >>mount -o hot_track,nouser
> >
> > This patch should probably be at the end of the series.
> Can you let me know your reason? I think that it is not necessary to
> be at the end of the series.

Simply because you're adding the mount option which does not do
anything yet. Moreover you change the implementation of the hot track
as you go. You should enable this once it is ready to use, not the other
way around. So, please move this at the end of the patch set when
the feature is supposed to be ready to use.

Thanks!
-Lukas

> 
> >
> > -Lukas
> >
> >>
> >> Signed-off-by: Zhi Yong Wu 
> >> ---
> >>  fs/btrfs/ctree.h |1 +
> >>  fs/btrfs/super.c |7 ++-
> >>  2 files changed, 7 insertions(+), 1 deletions(-)
> >>
> >> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> >> index 9821b67..094bec6 100644
> >> --- a/fs/btrfs/ctree.h
> >> +++ b/fs/btrfs/ctree.h
> >> @@ -1726,6 +1726,7 @@ struct btrfs_ioctl_defrag_range_args {
> >>  #define BTRFS_MOUNT_CHECK_INTEGRITY  (1 << 20)
> >>  #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 << 21)
> >>  #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR (1 << 22)
> >> +#define BTRFS_MOUNT_HOT_TRACK(1 << 23)
> >>
> >>  #define btrfs_clear_opt(o, opt)  ((o) &= ~BTRFS_MOUNT_##opt)
> >>  #define btrfs_set_opt(o, opt)((o) |= BTRFS_MOUNT_##opt)
> >> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> >> index 83d6f9f..00be9e3 100644
> >> --- a/fs/btrfs/super.c
> >> +++ b/fs/btrfs/super.c
> >> @@ -41,6 +41,7 @@
> >>  #include 
> >>  #include 
> >>  #include 
> >> +#include 
> >>  #include "compat.h"
> >>  #include "delayed-inode.h"
> >>  #include "ctree.h"
> >> @@ -303,7 +304,7 @@ enum {
> >>   Opt_notreelog, Opt_ratio, Opt_flushoncommit, Opt_discard,
> >>   Opt_space_cache, Opt_clear_cache, Opt_user_subvol_rm_allowed,
> >>   Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, Opt_inode_cache,
> >> - Opt_no_space_cache, Opt_recovery, Opt_skip_balance,
> >> + Opt_no_space_cache, Opt_recovery, Opt_skip_balance, Opt_hot_track,
> >>   Opt_check_integrity, Opt_check_integrity_including_extent_data,
> >>   Opt_check_integrity_print_mask, Opt_fatal_errors,
> >>   Opt_err,
> >> @@ -342,6 +343,7 @@ static match_table_t tokens = {
> >>   {Opt_no_space_cache, "nospace_cache"},
> >>   {Opt_recovery, "recovery"},
> >>   {Opt_skip_balance, "skip_balance"},
> >> + {Opt_hot_track, "hot_track"},
> >>   {Opt_check_integrity, "check_int"},
> >>   {Opt_check_integrity_including_extent_data, "check_int_data"},
> >>   {Opt_check_integrity_print_mask, "check_int_print_mask=%d"},
> >> @@ -553,6 +555,9 @@ int btrfs_parse_options(struct btrfs_root *root, char 
> >> *options)
> >>   case Opt_skip_balance:
> >>   btrfs_set_opt(info->mount_opt, SKIP_BALANCE);
> >>   break;
> >> + case Opt_hot_track:
> >> + btrfs_set_opt(info->mount_opt, HOT_TRACK);
> >> + break;
> >>  #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
> >>   case Opt_check_integrity_including_extent_data:
> >>   printk(KERN_INFO "btrfs: enabling check integrity"
> >>
> 
> 
> 
>

Re: [PATCH] scsi_debug: Fix off-by-one bug when unmapping region

2012-10-03 Thread Lukáš Czerner

On Wed, 5 Sep 2012, Douglas Gilbert wrote:

> Date: Wed, 05 Sep 2012 15:41:13 -0400
> From: Douglas Gilbert 
> To: Martin K. Petersen 
> Cc: Lukas Czerner , linux-s...@vger.kernel.org,
> jbottom...@parallels.com, ty...@mit.edu, pbonz...@redhat.com,
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] scsi_debug: Fix off-by-one bug when unmapping region
> 
> On 12-08-17 12:11 PM, Martin K. Petersen wrote:
> > > > > > > "Lukas" == Lukas Czerner  writes:
> > 
> > Lukas> Currently it is possible to unmap one more block than user
> > Lukas> requested to due to the off-by-one error in unmap_region(). This
> > Lukas> is probably due to the fact that the end variable despite its
> > Lukas> name actually points to the last block to unmap + 1. However in
> > Lukas> the condition it is handled as the last block of the region to
> > Lukas> unmap.
> > 
> > Acked-by: Martin K. Petersen 
> 
> Acked-by: Douglas Gilbert 

James, could you please pick this one ?

Thanks!
-Lukas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] loop: Limit the number of requests in the bio list

2012-10-03 Thread Lukáš Czerner

On Wed, 3 Oct 2012, Jeff Moyer wrote:

> Date: Wed, 03 Oct 2012 10:30:54 -0400
> From: Jeff Moyer 
> To: Dave Chinner 
> Cc: Lukáš Czerner , Jens Axboe ,
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] loop: Limit the number of requests in the bio list
> 
> Dave Chinner  writes:
> 
> > On Tue, Oct 02, 2012 at 10:52:05AM +0200, Lukáš Czerner wrote:
> >> On Mon, 1 Oct 2012, Jeff Moyer wrote:
> >> > Date: Mon, 01 Oct 2012 12:52:19 -0400
> >> > From: Jeff Moyer 
> >> > To: Lukas Czerner 
> >> > Cc: Jens Axboe , linux-kernel@vger.kernel.org,
> >> > Dave Chinner 
> >> > Subject: Re: [PATCH] loop: Limit the number of requests in the bio list
> >> > 
> >> > Lukas Czerner  writes:
> >> > 
> >> > > Currently there is not limitation of number of requests in the loop bio
> >> > > list. This can lead into some nasty situations when the caller spawns
> >> > > tons of bio requests taking huge amount of memory. This is even more
> >> > > obvious with discard where blkdev_issue_discard() will submit all bios
> >> > > for the range and wait for them to finish afterwards. On really big 
> >> > > loop
> >> > > devices this can lead to OOM situation as reported by Dave Chinner.
> >> > >
> >> > > With this patch we will wait in loop_make_request() if the number of
> >> > > bios in the loop bio list would exceed 'nr_requests' number of 
> >> > > requests.
> >> > > We'll wake up the process as we process the bios form the list.
> >> > 
> >> > I think you might want to do something similar to what is done for
> >> > request_queues by implementing a congestion on and off threshold.  As
> >> > Jens writes in this commit (predating the conversion to git):
> >> 
> >> Right, I've had the same idea. However my first proof-of-concept
> >> worked quite well without this and my simple performance testing did
> >> not show any regression.
> 
> Did you look at system time?
> 
> -Jeff

Hi, none of the times showed any significant difference, there was
not any pattern suggesting a problem. Also the system time is included
in the real time, so it would show anyway I guess.

-Lukas

Re: [PATCH v2] loop: Limit the number of requests in the bio list

2012-11-13 Thread Lukáš Czerner

On Fri, 9 Nov 2012, Jens Axboe wrote:

> Date: Fri, 09 Nov 2012 08:34:03 +0100
> From: Jens Axboe 
> To: Andrew Morton 
> Cc: Lukas Czerner , dchin...@redhat.com,
> jmo...@redhat.com, linux-kernel@vger.kernel.org
> Subject: Re: [PATCH v2] loop: Limit the number of requests in the bio list
> 
> On 2012-11-08 20:14, Andrew Morton wrote:
> > On Tue, 16 Oct 2012 11:21:45 +0200
> > Lukas Czerner  wrote:
> > 
> >> Currently there is not limitation of number of requests in the loop bio
> >> list. This can lead into some nasty situations when the caller spawns
> >> tons of bio requests taking huge amount of memory. This is even more
> >> obvious with discard where blkdev_issue_discard() will submit all bios
> >> for the range and wait for them to finish afterwards. On really big loop
> >> devices and slow backing file system this can lead to OOM situation as
> >> reported by Dave Chinner.
> >>
> >> With this patch we will wait in loop_make_request() if the number of
> >> bios in the loop bio list would exceed 'nr_requests' number of requests.
> >> We'll wake up the process as we process the bios form the list. Some
> >> threshold hysteresis is in place to avoid high frequency oscillation.
> >>
> > 
> > What's happening with this?
> 
> Sorry I didn't reply to this yet. My initial thought is that we had
> something like this for loop back in the 2.4 days, and it was deadlock
> prone. Can't seem to remember all the details on that yet.
> 
> v2 is a nice improvement, though. With 1:1 bio and wakeups, you would
> get tons of context switches. The batched approach is much better.
> 
> Lukas, have you beaten on this with a file backed loop and heavy traffic
> on a file system on top?

Hi Jeff,

sorry for the delay. Yes, I've tested this with xfstests using loop
driver for both test and scratch disk.

I'll resend the updated version hopefully later today.

Thanks!
-Lukas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] loop: Limit the number of requests in the bio list

2012-11-13 Thread Lukáš Czerner

On Thu, 8 Nov 2012, Jeff Moyer wrote:

> Date: Thu, 08 Nov 2012 16:53:01 -0500
> From: Jeff Moyer 
> To: Lukas Czerner 
> Cc: ax...@kernel.dk, dchin...@redhat.com, linux-kernel@vger.kernel.org
> Subject: Re: [PATCH v2] loop: Limit the number of requests in the bio list
> 
> Lukas Czerner  writes:
> 
> > +   if (lo->lo_bio_count >= lo->lo_queue->nr_requests) {
> > +   unsigned int nr;
> > +   spin_unlock_irq(&lo->lo_lock);
> > +   nr = lo->lo_queue->nr_requests - (lo->lo_queue->nr_requests/8);
> > +   wait_event_interruptible(lo->lo_req_wait,
> > +lo->lo_bio_count < nr);
> > +   spin_lock_irq(&lo->lo_lock);
> > +   }
> 
> So, blk_queue_make_request already initialized q->nr_congestion_on and
> q->nr_congestion_off.  Is there a reason you didn't simply use
> queue_congestion_on_threshold and queue_congestion_off_threshold?

The reason is that I did not knew about those :) Thanks for pointing
it out I'll take a look at it.

Thanks!
-Lukas

> 
> Cheers,
> Jeff
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] loop: Limit the number of requests in the bio list

2012-11-13 Thread Lukáš Czerner

On Thu, 8 Nov 2012, Andrew Morton wrote:

> Date: Thu, 8 Nov 2012 11:14:18 -0800
> From: Andrew Morton 
> To: Lukas Czerner 
> Cc: ax...@kernel.dk, dchin...@redhat.com, jmo...@redhat.com,
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH v2] loop: Limit the number of requests in the bio list
> 
> On Tue, 16 Oct 2012 11:21:45 +0200
> Lukas Czerner  wrote:
> 
> > Currently there is not limitation of number of requests in the loop bio
> > list. This can lead into some nasty situations when the caller spawns
> > tons of bio requests taking huge amount of memory. This is even more
> > obvious with discard where blkdev_issue_discard() will submit all bios
> > for the range and wait for them to finish afterwards. On really big loop
> > devices and slow backing file system this can lead to OOM situation as
> > reported by Dave Chinner.
> > 
> > With this patch we will wait in loop_make_request() if the number of
> > bios in the loop bio list would exceed 'nr_requests' number of requests.
> > We'll wake up the process as we process the bios form the list. Some
> > threshold hysteresis is in place to avoid high frequency oscillation.
> > 
> 
> What's happening with this?
> 
> > --- a/drivers/block/loop.c
> > +++ b/drivers/block/loop.c
> > @@ -463,6 +463,7 @@ out:
> >   */
> >  static void loop_add_bio(struct loop_device *lo, struct bio *bio)
> >  {
> > +   lo->lo_bio_count++;
> > bio_list_add(&lo->lo_bio_list, bio);
> >  }
> >  
> > @@ -471,6 +472,7 @@ static void loop_add_bio(struct loop_device *lo, struct 
> > bio *bio)
> >   */
> >  static struct bio *loop_get_bio(struct loop_device *lo)
> >  {
> > +   lo->lo_bio_count--;
> > return bio_list_pop(&lo->lo_bio_list);
> >  }
> >  
> > @@ -489,6 +491,14 @@ static void loop_make_request(struct request_queue *q, 
> > struct bio *old_bio)
> > goto out;
> > if (unlikely(rw == WRITE && (lo->lo_flags & LO_FLAGS_READ_ONLY)))
> > goto out;
> > +   if (lo->lo_bio_count >= lo->lo_queue->nr_requests) {
> > +   unsigned int nr;
> > +   spin_unlock_irq(&lo->lo_lock);
> > +   nr = lo->lo_queue->nr_requests - (lo->lo_queue->nr_requests/8);
> > +   wait_event_interruptible(lo->lo_req_wait,
> > +lo->lo_bio_count < nr);
> > +   spin_lock_irq(&lo->lo_lock);
> > +   }
> 
> Two things.
> 
> a) wait_event_interruptible() will return immediately if a signal is
>pending (eg, someone hit ^C).  This is not the behaviour you want. 
>If the calling process is always a kernel thread then
>wait_event_interruptible() is OK and is the correct thing to use. 
>Otherwise, it will need to be an uninterruptible sleep.

Understood, I'll fix that.

> 
> b) Why is it safe to drop lo_lock here?  What data is that lock protecting?
> 

It is protecting the bio list, lo state, backing file so I think it
is perfectly safe to drop the lock there.

Thanks!
-Lukas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3] loop: Limit the number of requests in the bio list

2012-11-14 Thread Lukáš Czerner

On Tue, 13 Nov 2012, Jens Axboe wrote:

> Date: Tue, 13 Nov 2012 09:42:58 -0700
> From: Jens Axboe 
> To: Lukas Czerner 
> Cc: linux-kernel@vger.kernel.org, linux-fsde...@vger.kernel.org,
> jmo...@redhat.com, a...@linux-foundation.org
> Subject: Re: [PATCH v3] loop: Limit the number of requests in the bio list
> 
> > @@ -489,6 +491,12 @@ static void loop_make_request(struct request_queue *q, 
> > struct bio *old_bio)
> > goto out;
> > if (unlikely(rw == WRITE && (lo->lo_flags & LO_FLAGS_READ_ONLY)))
> > goto out;
> > +   if (lo->lo_bio_count >= q->nr_congestion_on) {
> > +   spin_unlock_irq(&lo->lo_lock);
> > +   wait_event(lo->lo_req_wait, lo->lo_bio_count <
> > +  q->nr_congestion_off);
> > +   spin_lock_irq(&lo->lo_lock);
> > +   }
> 
> This makes me nervous. You are reading lo_bio_count outside the lock. If
> you race with the prepare_to_wait() and condition check in
> __wait_event(), then you will sleep forever.

Hi Jens,

I am sorry for being dense, but I do not see how this would be
possible. The only place we increase the lo_bio_count is after that
piece of code (possibly after the wait). Moreover every time we're
decreasing the lo_bio_count and it is smaller than nr_congestion_off
we will wake_up().

That's how wait_event/wake_up is supposed to be used, right ?

Thanks!
-Lukas

> 
> md has private helpers for this, seems it would be a good idea to move
> these into the regular wait includes and use them here too.
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3] loop: Limit the number of requests in the bio list

2012-11-15 Thread Lukáš Czerner

On Wed, 14 Nov 2012, Jens Axboe wrote:

> Date: Wed, 14 Nov 2012 08:21:41 -0700
> From: Jens Axboe 
> To: Lukáš Czerner 
> Cc: linux-kernel@vger.kernel.org, linux-fsde...@vger.kernel.org,
> jmo...@redhat.com, a...@linux-foundation.org
> Subject: Re: [PATCH v3] loop: Limit the number of requests in the bio list
> 
> On 2012-11-14 02:02, Lukáš Czerner wrote:
> > On Tue, 13 Nov 2012, Jens Axboe wrote:
> > 
> >> Date: Tue, 13 Nov 2012 09:42:58 -0700
> >> From: Jens Axboe 
> >> To: Lukas Czerner 
> >> Cc: linux-kernel@vger.kernel.org, linux-fsde...@vger.kernel.org,
> >> jmo...@redhat.com, a...@linux-foundation.org
> >> Subject: Re: [PATCH v3] loop: Limit the number of requests in the bio list
> >>
> >>> @@ -489,6 +491,12 @@ static void loop_make_request(struct request_queue 
> >>> *q, struct bio *old_bio)
> >>>   goto out;
> >>>   if (unlikely(rw == WRITE && (lo->lo_flags & LO_FLAGS_READ_ONLY)))
> >>>   goto out;
> >>> + if (lo->lo_bio_count >= q->nr_congestion_on) {
> >>> + spin_unlock_irq(&lo->lo_lock);
> >>> + wait_event(lo->lo_req_wait, lo->lo_bio_count <
> >>> +q->nr_congestion_off);
> >>> + spin_lock_irq(&lo->lo_lock);
> >>> + }
> >>
> >> This makes me nervous. You are reading lo_bio_count outside the lock. If
> >> you race with the prepare_to_wait() and condition check in
> >> __wait_event(), then you will sleep forever.
> > 
> > Hi Jens,
> > 
> > I am sorry for being dense, but I do not see how this would be
> > possible. The only place we increase the lo_bio_count is after that
> > piece of code (possibly after the wait). Moreover every time we're
> > decreasing the lo_bio_count and it is smaller than nr_congestion_off
> > we will wake_up().
> > 
> > That's how wait_event/wake_up is supposed to be used, right ?
> 
> It is, yes. But you are checking the condition without the lock, so you
> could be operating on a stale value. The point is, you have to safely
> check the condition _after prepare_to_wait() to be completely safe. And
> you do not. Either lo_bio_count needs to be atomic, or you need to use a
> variant of wait_event() that holds the appropriate lock before
> prepare_to_wait() and condition check, then dropping it for the sleep.
> 
> See wait_even_lock_irq() in drivers/md/md.h.

Ok I knew that much. So the only possibility to deadlock is when we
would process all the bios in the loop_thread() before the waiting
event would get to checking the condition after which we would read
the stale data where lo_bio_count is still < nr_congestion_off so we
get back to sleep, never to be woken up again. That sounds highly
unlikely. But fair enough, it make sense to make it absolutely bullet
proof.

I'll take a look at the wait_event_lock_irq.

Thanks!
-Lukas

Re: [PATCH 01/10] ext4: balloc: Fixed coding style issue

2012-10-17 Thread Lukáš Czerner

On Wed, 17 Oct 2012, Adil Mujeeb wrote:

> Date: Wed, 17 Oct 2012 00:42:56 +0530
> From: Adil Mujeeb 
> To: ty...@mit.edu, adilger.ker...@dilger.ca, linux-e...@vger.kernel.org,
> linux-kernel@vger.kernel.org
> Cc: Adil Mujeeb 
> Subject: [PATCH 01/10] ext4: balloc: Fixed coding style issue
> 
> Fixed checkpatch.pl reported ERRORs

Hi Adil,

let me ask you something. How useful do you think those changes are ?
Have you learned anything by creating those patches ?

Just to clarify why I am asking such weird questions. It's not one of
those sneer questions, I would really like to know.

Thanks!
-Lukas

> 
> Signed-off-by: Adil Mujeeb 
> ---
>  linux-3.7-rc1/fs/ext4/balloc.c |8 
>  1 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/linux-3.7-rc1/fs/ext4/balloc.c b/linux-3.7-rc1/fs/ext4/balloc.c
> index 1b50890..395418d 100644
> --- a/linux-3.7-rc1/fs/ext4/balloc.c
> +++ b/linux-3.7-rc1/fs/ext4/balloc.c
> @@ -246,7 +246,7 @@ unsigned ext4_free_clusters_after_init(struct super_block 
> *sb,
>   * @bh:  pointer to the buffer head to store the block
>   *   group descriptor
>   */
> -struct ext4_group_desc * ext4_get_group_desc(struct super_block *sb,
> +struct ext4_group_desc *ext4_get_group_desc(struct super_block *sb,
>ext4_group_t block_group,
>struct buffer_head **bh)
>  {
> @@ -700,7 +700,7 @@ static unsigned long ext4_bg_num_gdb_nometa(struct 
> super_block *sb,
>   if (!ext4_bg_has_super(sb, group))
>   return 0;
>  
> - if (EXT4_HAS_INCOMPAT_FEATURE(sb,EXT4_FEATURE_INCOMPAT_META_BG))
> + if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_META_BG))
>   return le32_to_cpu(EXT4_SB(sb)->s_es->s_first_meta_bg);
>   else
>   return EXT4_SB(sb)->s_gdb_count;
> @@ -721,11 +721,11 @@ unsigned long ext4_bg_num_gdb(struct super_block *sb, 
> ext4_group_t group)
>   le32_to_cpu(EXT4_SB(sb)->s_es->s_first_meta_bg);
>   unsigned long metagroup = group / EXT4_DESC_PER_BLOCK(sb);
>  
> - if (!EXT4_HAS_INCOMPAT_FEATURE(sb,EXT4_FEATURE_INCOMPAT_META_BG) ||
> + if (!EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_META_BG) ||
>   metagroup < first_meta_bg)
>   return ext4_bg_num_gdb_nometa(sb, group);
>  
> - return ext4_bg_num_gdb_meta(sb,group);
> + return ext4_bg_num_gdb_meta(sb, group);
>  
>  }
>  
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] loop: Limit the number of requests in the bio list

2012-10-02 Thread Lukáš Czerner

On Mon, 1 Oct 2012, Jeff Moyer wrote:

> Date: Mon, 01 Oct 2012 12:52:19 -0400
> From: Jeff Moyer 
> To: Lukas Czerner 
> Cc: Jens Axboe , linux-kernel@vger.kernel.org,
> Dave Chinner 
> Subject: Re: [PATCH] loop: Limit the number of requests in the bio list
> 
> Lukas Czerner  writes:
> 
> > Currently there is not limitation of number of requests in the loop bio
> > list. This can lead into some nasty situations when the caller spawns
> > tons of bio requests taking huge amount of memory. This is even more
> > obvious with discard where blkdev_issue_discard() will submit all bios
> > for the range and wait for them to finish afterwards. On really big loop
> > devices this can lead to OOM situation as reported by Dave Chinner.
> >
> > With this patch we will wait in loop_make_request() if the number of
> > bios in the loop bio list would exceed 'nr_requests' number of requests.
> > We'll wake up the process as we process the bios form the list.
> 
> I think you might want to do something similar to what is done for
> request_queues by implementing a congestion on and off threshold.  As
> Jens writes in this commit (predating the conversion to git):

Right, I've had the same idea. However my first proof-of-concept
worked quite well without this and my simple performance testing did
not show any regression.

I've basically done just fstrim, and blkdiscard on huge loop device
measuring time to finish and dd bs=4k throughput. None of those showed
any performance regression. I've chosen those for being quite simple
and supposedly issuing quite a lot of bios. Any better
recommendation to test this ?

Also I am still unable to reproduce the problem Dave originally
experienced and I was hoping that he can test whether this helps or
not.

Dave could you give it a try please ? By creating huge (500T, 1000T,
1500T) loop device on machine with 2GB memory I was not able to reproduce
that. Maybe it's that xfs punch hole implementation is so damn fast
:). Please let me know.

Thanks!
-Lukas

> 
> Author: Jens Axboe 
> Date:   Wed Nov 3 15:47:37 2004 -0800
> 
> [PATCH] queue congestion threshold hysteresis
> 
> We need to open the gap between congestion on/off a little bit, or
> we risk burning many cycles continually putting processes on a wait
> queue only to wake them up again immediately. This was observed with
> CFQ at least, which showed way excessive sys time.
> 
> Patch is from Arjan.
> 
> Signed-off-by: Jens Axboe 
> Signed-off-by: Linus Torvalds 
> 
> If you feel this isn't necessary, then I think you at least need to
> justify it with testing.  Perhaps Jens can shed some light on the exact
> workload that triggered the pathological behaviour.
> 
> Cheers,
> Jeff
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2 v3] wait: add wait_event_lock_irq() interface

2012-11-30 Thread Lukáš Czerner

On Fri, 30 Nov 2012, Andrew Morton wrote:

> Date: Fri, 30 Nov 2012 02:48:42 -0800
> From: Andrew Morton 
> To: Lukas Czerner 
> Cc: linux-kernel@vger.kernel.org, linux-r...@vger.kernel.org, ax...@kernel.dk,
> jmo...@redhat.com, Neil Brown ,
> David Howells , Ingo Molnar ,
> Peter Zijlstra 
> Subject: Re: [PATCH 1/2 v3] wait: add wait_event_lock_irq() interface
> 
> On Fri, 30 Nov 2012 11:42:40 +0100 Lukas Czerner  wrote:
> 
> > New wait_event{_interruptible}_lock_irq{_cmd} macros added. This commit
> > moves the private wait_event_lock_irq() macro from MD to regular wait
> > includes, introduces new macro wait_event_lock_irq_cmd() instead of using
> > the old method with omitting cmd parameter which is ugly and makes a use
> > of new macros in the MD. It also introduces the _interruptible_ variant.
> > 
> > The use of new interface is when one have a special lock to protect data
> > structures used in the condition, or one also needs to invoke "cmd"
> > before putting it to sleep.
> > 
> > All new macros are expected to be called with the lock taken. The lock
> > is released before sleep and is reacquired afterwards. We will leave the
> > macro with the lock held.
> > 
> > Note to DM: IMO this should also fix theoretical race on waitqueue while
> > using simultaneously wait_event_lock_irq() and wait_event() because of
> > lack of locking around current state setting and wait queue removal.
> 
> Does this fix the sparse warning which Fengguang just sent us?

Which report from Fengguang do you have in mind ? I do not see any
on linux-kernel today.

/me going to see what spare reports

-Lukas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2 v3] wait: add wait_event_lock_irq() interface

2012-12-03 Thread Lukáš Czerner

On Fri, 30 Nov 2012, Andrew Morton wrote:

> Date: Fri, 30 Nov 2012 12:13:08 -0800
> From: Andrew Morton 
> To: Lukáš Czerner 
> Cc: linux-kernel@vger.kernel.org, linux-r...@vger.kernel.org, ax...@kernel.dk,
> jmo...@redhat.com, Neil Brown 
> Subject: Re: [PATCH 1/2 v3] wait: add wait_event_lock_irq() interface
> 
> On Fri, 30 Nov 2012 12:36:35 +0100 (CET)
> Lukáš Czerner  wrote:
> 
> > On Fri, 30 Nov 2012, Andrew Morton wrote:
> > 
> > > Date: Fri, 30 Nov 2012 02:48:42 -0800
> > > From: Andrew Morton 
> > > To: Lukas Czerner 
> > > Cc: linux-kernel@vger.kernel.org, linux-r...@vger.kernel.org, 
> > > ax...@kernel.dk,
> > > jmo...@redhat.com, Neil Brown ,
> > > David Howells , Ingo Molnar ,
> > > Peter Zijlstra 
> > > Subject: Re: [PATCH 1/2 v3] wait: add wait_event_lock_irq() interface
> > > 
> > > On Fri, 30 Nov 2012 11:42:40 +0100 Lukas Czerner  
> > > wrote:
> > > 
> > > > New wait_event{_interruptible}_lock_irq{_cmd} macros added. This commit
> > > > moves the private wait_event_lock_irq() macro from MD to regular wait
> > > > includes, introduces new macro wait_event_lock_irq_cmd() instead of 
> > > > using
> > > > the old method with omitting cmd parameter which is ugly and makes a use
> > > > of new macros in the MD. It also introduces the _interruptible_ variant.
> > > > 
> > > > The use of new interface is when one have a special lock to protect data
> > > > structures used in the condition, or one also needs to invoke "cmd"
> > > > before putting it to sleep.
> > > > 
> > > > All new macros are expected to be called with the lock taken. The lock
> > > > is released before sleep and is reacquired afterwards. We will leave the
> > > > macro with the lock held.
> > > > 
> > > > Note to DM: IMO this should also fix theoretical race on waitqueue while
> > > > using simultaneously wait_event_lock_irq() and wait_event() because of
> > > > lack of locking around current state setting and wait queue removal.
> > > 
> > > Does this fix the sparse warning which Fengguang just sent us?
> > 
> > Which report from Fengguang do you have in mind ? I do not see any
> > on linux-kernel today.
> > 
> > /me going to see what spare reports
> > 
> 
> On Fri, 30 Nov 2012 16:30:24 +0800
> kbuild test robot  wrote:
> 
> > tree:   git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git 
> > akpm
> > head:   cfb65dadcd079ad4547407a1584bc6b96bd48bb3
> > commit: 2b29cdb6f98c86a1da4ec5335d6247392b7c6551 [35/476] wait: add 
> > wait_event_lock_irq() interface
> > 
> > 
> > sparse warnings:
> > 
> > + drivers/block/drbd/drbd_int.h:2339:9: sparse: preprocessor token 
> > __wait_event_lock_irq redefined
> > include/linux/wait.h:554:9: this was the original definition
> > + drivers/block/drbd/drbd_int.h:2358:9: sparse: preprocessor token 
> > wait_event_lock_irq redefined
> > include/linux/wait.h:621:9: this was the original definition
> > drivers/block/drbd/drbd_interval.h:12:63: sparse: dubious one-bit signed 
> > bitfield
> > drivers/block/drbd/drbd_interval.h:13:22: sparse: dubious one-bit signed 
> > bitfield
> > drivers/block/drbd/drbd_int.h:903:39: sparse: attribute 'require_context': 
> > unknown attribute
> > drivers/block/drbd/drbd_int.h:1123:69: sparse: attribute 'require_context': 
> > unknown attribute
> > drivers/block/drbd/drbd_int.h:1124:70: sparse: attribute 'require_context': 
> > unknown attribute
> > drivers/block/drbd/drbd_int.h:1125:59: sparse: attribute 'require_context': 
> > unknown attribute
> > drivers/block/drbd/drbd_int.h:1126:63: sparse: attribute 'require_context': 
> > unknown attribute
> > drivers/block/drbd/drbd_int.h:1127:60: sparse: attribute 'require_context': 
> > unknown attribute
> > drivers/block/drbd/drbd_int.h:1128:71: sparse: attribute 'require_context': 
> > unknown attribute
> > drivers/block/drbd/drbd_int.h:1129:65: sparse: attribute 'require_context': 
> > unknown attribute
> > drivers/block/drbd/drbd_int.h:1130:66: sparse: attribute 'require_context': 
> > unknown attribute
> > drivers/block/drbd/drbd_int.h:1335:74: sparse: attribute 'require_context': 
> > unknown attribute
> > drivers/block/drbd/drbd_int.h:1336:50: sparse: attribute 'require_context': 
> > unknown attribute
> > drivers/block/drbd/drbd_int.h:1338:51: sparse: attribute 'require_context': 
> > unknown attribute
> > drivers/block/drbd/drbd_int.h:1339:58: sparse: attribute 'require_context': 
> > unknown attribute
> > drivers/block/drbd/drbd_int.h:1340:54: sparse: attribute 'require_context': 
> > unknown attribute
> > drivers/block/drbd/drbd_int.h:1341:62: sparse: attribute 'require_context': 
> > unknown attribute
> > drivers/block/drbd/drbd_int.h:1435:92: sparse: attribute 'require_context': 
> > unknown attribute
> 

I believe that Lars send patch for that already. So no,

'[PATCH 1/2 v3] wait: add wait_event_lock_irq() interface'

does not fix the issue, but another patch should.

http://www.spinics.net/lists/linux-next/msg23042.html

Thanks!
-Lukas

Re: [PATCH 1/2] wait: add wait_event_lock_irq() interface

2012-11-21 Thread Lukáš Czerner

On Tue, 20 Nov 2012, Andrew Morton wrote:

> Date: Tue, 20 Nov 2012 12:14:14 -0800
> From: Andrew Morton 
> To: Lukas Czerner 
> Cc: linux-kernel@vger.kernel.org, linux-r...@vger.kernel.org, ax...@kernel.dk,
> jmo...@redhat.com, Neil Brown ,
> David Howells , Ingo Molnar ,
> Peter Zijlstra 
> Subject: Re: [PATCH 1/2] wait: add wait_event_lock_irq() interface
> 
> On Tue, 20 Nov 2012 10:23:04 +0100
> Lukas Czerner  wrote:
> 
> > New wait_event_lock_irq{,cmd} macros added. This commit moves the
> > private wait_event_lock_irq() macro from MD to regular wait includes,
> > introduces new macro wait_event_lock_irq_cmd() instead of using the old
> > method with omitting cmd parameter which is ugly and makes a use of new
> > macros in the MD.
> > 
> > The use of new interface is when one have a special lock to protect data
> > structures used in the condition, or one also needs to invoke "cmd"
> > before putting it to sleep.
> > 
> > Both new macros are expected to be called with the lock taken. The lock
> > is released before sleep and reacquired afterwards. We will leave the
> > macro with the lock held.
> 
> Moving generic code out of md is a good thing.  It never should have
> been put there.  Bad md.
> 
> > ...
> >
> > +#define __wait_event_lock_irq(wq, condition, lock, cmd)\
> > +do {   
> > \
> > +   wait_queue_t __wait;\
> > +   init_waitqueue_entry(&__wait, current); \
> > +   \
> 
> The above two lines should be swapped - the blank line goes between
> end-of-locals and start-of-code.

Good point.

> 
> > +   add_wait_queue(&wq, &__wait);   \
> > +   for (;;) {  \
> > +   set_current_state(TASK_UNINTERRUPTIBLE);\
> > +   if (condition)  \
> > +   break;  \
> > +   spin_unlock_irq(&lock); \
> > +   cmd;\
> > +   schedule(); \
> > +   spin_lock_irq(&lock);   \
> > +   }   \
> > +   current->state = TASK_RUNNING;  \
> > +   remove_wait_queue(&wq, &__wait);\
> > +} while (0)
> 
> I'm scratching my head a bit over which situations this will be used
> in, particularly outside md.
> 
> Because calling schedule() immediately after calling `cmd' might be a
> problem for some callers.  Or at least, suboptimal.  If that evaluation
> of `cmd' results in `condition' becoming true then we don't *want* to
> call schedule().  Yes, `cmd' would have put this thread into
> TASK_RUNNING, but it was just a waste of cycles.
> 
> So I wonder if we should retest `condition' there.  Or, perhaps, test
> the return value of `cmd'.

Right, it might makes sense to swap the lines and let cmd to be run
after the schedule(). That way, if 'cmd' would result in 'condition'
becoming true, then we would catch it before going to schedule()
again.

> 
> Also, wait_event() uses prepare_to_wait().  It's a bit neater and more
> efficient because the wakeup removes the waiter from the waitqueue.  I
> wonder if we can use prepare_to_wait() here.

I suspect that it was a tradeoff between using the neat
prepare_to_wait() with waitqueue removal, but having to lock the
wait_queue_head_t->lock or using simply set_current_state() since
we already hold external 'lock'.

Whoever wrote the code originally thought that it was better this
way, however since the waitqueue lock should not actually be
contended at all here, I think it is ok to use prepare_to_wait().

> 
> Also, we will surely end up needing TASK_INTERRUPTIBLE versions of
> these macros, so you may as well design for that (or actually implement
> them) in version 1.

Sure, I can add the TASK_INTERRUPTIBLE version as well. The reason I
did not included that initially was that I did not want to introduce
unused code. But if you're ok with it, I'll add it.

> 
> > +/**
> > + * wait_event_lock_irq_cmd - sleep until a condition gets true. The
> > + *  condition is checked under the lock. This
> > + *  is expected to be called with the lock
> > + *  taken.
> > + * @wq: the waitqueue to wait on
> > + * @condition: a C expression for the event to wait for
> > + * @lock: a locked lock, which will be released before cmd and schedule()
> > + *   and reacquired afterwards.
> 
> @lock isn't just any old lock.  It must have type spinlock_t.  It's
> worth mentioning this here.  Thi

Re: ext4fs error "ext4_mb_generate_buddy:741:group 16, 8160 clusters in bitmap, 4064 in gd" (with repro)

2012-08-16 Thread Lukáš Czerner

On Wed, 15 Aug 2012, Lukáš Czerner wrote:

> Date: Wed, 15 Aug 2012 11:17:57 +0200 (CEST)
> From: Lukáš Czerner 
> To: Theodore Ts'o 
> Cc: Lukas Czerner , Paolo Bonzini ,
> "Linux Kernel mailinlinux-e...@vger.kernel.orgg List"
> , linux-e...@vger.kernel.org
> Subject: Re: ext4fs error
> "ext4_mb_generate_buddy:741:group 16, 8160 clusters in bitmap, 4064 in gd"
>  (with repro)
> 
> On Thu, 9 Aug 2012, Theodore Ts'o wrote:
> 
> > Date: Thu, 9 Aug 2012 13:06:40 -0400
> > From: Theodore Ts'o 
> > To: Lukas Czerner 
> > Cc: Paolo Bonzini ,
> > "Linux Kernel mailinlinux-e...@vger.kernel.orgg List"
> > , linux-e...@vger.kernel.org
> > Subject: Re: ext4fs error
> > "ext4_mb_generate_buddy:741:group 16, 8160 clusters in bitmap, 4064 in 
> > gd"
> >  (with repro)
> > 
> > On Thu, Aug 09, 2012 at 12:00:09PM +0200, Paolo Bonzini wrote:
> > > Here is how to reproduce it.  It happens during fstrim.  I found other
> > > occurrences of the error in the mailing list, but they were not related
> > > to trim so they may be something different.
> > > 
> > > modprobe scsi_debug dev_size_mb=256 lbpws=1
> > > dd if=/dev/zero of=/dev/sdb bs=1M  
> > > fdisk /dev/sdb
> > >  >> create a new partition accepting all defaults
> > > fdisk -lu /dev/sdb|tail -1
> > >  >> should show: /dev/sdb1 57  524285  262114+  83  Linux
> > > 
> > > mkfs.ext4 /dev/sdb1
> > > mkdir test
> > > mount /dev/sdb1 test
> > > fstrim ./test
> > 
> > I can confirm that this accurately reproduces file system corruption
> > using a 3.5 kernel.  It looks like some block allocation bitmap blocks
> > is getting trimmed when it shouldn't have been.  Lukas, can you take a
> > look at this?
> > 
> > - Ted
> 
> Hi Ted,
> 
> sorry for the delay, I've just got back from my vacation. I'll take
> a look at it.
> 
> Thanks!
> -Lukas

This does not seem like an ext4 problem. The code seems unable to
actually discard blocks which are allocated. Moreover I was not able
to reproduce the problem on the loop device with the same setting as
the reported scsi_debug device (1024 bs file system on the 256MB image
residing on the 1024 bs filesystem) 

After a little bit of tracing with the systemtap and blktrace ext4
does not seem to be doing anything wrong and yet we get part of the
block bitmap trimmed. This lead me to the scsi_debug driver itself
and indeed it seems that we have off-by-one bug there in the
unamp_region() which is causing the problem.

Here is the patch which fixes the problem for me, I'll resend the
proper patch in a bit.

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 182d5a5..f4cc413 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -2054,7 +2054,7 @@ static void unmap_region(sector_t lba, unsigned int len)
block = lba + alignment;
rem = do_div(block, granularity);
 
-   if (rem == 0 && lba + granularity <= end && block < map_size) {
+   if (rem == 0 && lba + granularity < end && block < map_size) {
clear_bit(block, map_storep);
if (scsi_debug_lbprz)
memset(fake_storep +


Thanks!
-Lukas

Re: [PATCH] ext4: add error handling when discard cmd is fail in FITRIM

2012-07-30 Thread Lukáš Czerner

On Sun, 29 Jul 2012, Namjae Jeon wrote:

> Date: Sun, 29 Jul 2012 07:31:54 -0400
> From: Namjae Jeon 
> To: ty...@mit.edu, sand...@redhat.com, lczer...@redhat.com,
> linux-e...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org, Namjae Jeon ,
> Amit Sahrawat 
> Subject: [PATCH] ext4: add error handling when discard cmd is fail in FITRIM
> 
> Although free extents is proper not trimmed(mmc driver return error code
> while sending trim command), currently FITRIM ioctl return success.
> Add exception routine to inform user error code.
> 
> #> ./fitrim_test
> end_request: I/O error, dev mmcblk0, sector 27232
> EXT4-fs warning (device mmcblk0): ext4_trim_all_free:4857:
> Discard command returned error -5
> #>

Well, by this change you're actually reverting commit

d9f34504e6952e909a6932c5b2d1857716606380
 ext4: ignore errors when issuing discards

which effectively reverts a30eec2a8.

Now I think that the way it is now is actually better than your
proposal for the reasons mentioned in the commit
d9f34504e6952e909a6932c5b2d1857716606380. However I think that the
discard errors should be logged nevertheless but not at the file
system level, but rather on block layer level if it is not done
already.

Thanks!
-Lukas

> 
> Signed-off-by: Namjae Jeon 
> Signed-off-by: Amit Sahrawat 
> Reviewed-by: Jan Kara 
> ---
>  fs/ext4/mballoc.c |   25 +
>  1 file changed, 17 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 8eae947..07569b6 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -4852,10 +4852,11 @@ error_return:
>   * one will allocate those blocks, mark it as used in buddy bitmap. This must
>   * be called with under the group lock.
>   */
> -static void ext4_trim_extent(struct super_block *sb, int start, int count,
> +static int ext4_trim_extent(struct super_block *sb, int start, int count,
>ext4_group_t group, struct ext4_buddy *e4b)
>  {
>   struct ext4_free_extent ex;
> + int err;
>  
>   trace_ext4_trim_extent(sb, group, start, count);
>  
> @@ -4871,9 +4872,10 @@ static void ext4_trim_extent(struct super_block *sb, 
> int start, int count,
>*/
>   mb_mark_used(e4b, &ex);
>   ext4_unlock_group(sb, group);
> - ext4_issue_discard(sb, group, start, count);
> + err = ext4_issue_discard(sb, group, start, count);
>   ext4_lock_group(sb, group);
>   mb_free_blocks(NULL, e4b, start, ex.fe_len);
> + return err;
>  }
>  
>  /**
> @@ -4902,7 +4904,7 @@ ext4_trim_all_free(struct super_block *sb, ext4_group_t 
> group,
>   void *bitmap;
>   ext4_grpblk_t next, count = 0, free_count = 0;
>   struct ext4_buddy e4b;
> - int ret;
> + int ret = 0;
>  
>   trace_ext4_trim_all_free(sb, group, start, max);
>  
> @@ -4929,15 +4931,22 @@ ext4_trim_all_free(struct super_block *sb, 
> ext4_group_t group,
>   next = mb_find_next_bit(bitmap, max + 1, start);
>  
>   if ((next - start) >= minblocks) {
> - ext4_trim_extent(sb, start,
> -  next - start, group, &e4b);
> - count += next - start;
> + ret = ext4_trim_extent(sb, start,
> + next - start, group, &e4b);
> + if (ret < 0) {
> + if (ret != -EOPNOTSUPP)
> + ext4_warning(sb,
> + "Discard command returned error 
> %d\n",
> + ret);
> + break;
> + } else
> + count += next - start;
>   }
>   free_count += next - start;
>   start = next + 1;
>  
>   if (fatal_signal_pending(current)) {
> - count = -ERESTARTSYS;
> + ret = -ERESTARTSYS;
>   break;
>   }
>  
> @@ -4960,7 +4969,7 @@ out:
>   ext4_debug("trimmed %d blocks in the group %d\n",
>   count, group);
>  
> - return count;
> + return (ret < 0) ? ret : count;
>  }
>  
>  /**
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] ext4: fix hole punch failure when depth is greater than 0

2012-07-09 Thread Lukáš Czerner

On Thu, 5 Jul 2012, Ashish Sangwan wrote:

> Date: Thu, 5 Jul 2012 15:22:04 +0530
> From: Ashish Sangwan 
> To: Lukáš Czerner 
> Cc: sand...@redhat.com, Ted Tso , linux-kernel@vger.kernel.org,
> linux-e...@vger.kernel.org, Namjae Jeon 
> Subject: Re: [PATCH v2] ext4: fix hole punch failure when depth is greater
> than 0
> 
> On Wed, Jul 4, 2012 at 11:03 PM, Lukáš Czerner  wrote:
> > So I've finally has some time to look at the patch and reproduce the
> > problem. Thanks for noticing the problem, the patch seems good,
> > though I have one question. Is the p_block setting really necessary
> > ? I do not think so, but I might be missing something. Here is
> > updated patch I've tested, bellow.
> AFAICS, p_block setting is necessary. As mentioned in the patch description,
> whether to continue removing extents or not is decided by the return value
> of function ext4_ext_more_to_rm() which checks 2 conditions:
> a) if there are no more indexes to process.
> b) if the number of entries are decreased in the header of "depth -1".
> The p_block setting is important for condition b).
> 
> In function ext4_ext_more_to_rm, there is this second check:
> if (le16_to_cpu(path->p_hdr->eh_entries) == path->p_block)
> return 0;
> If the value of p_block would not be correct, the above mentioned
> condition could become
> true while there are still blocks left to be removed.

Ok, the code is not very clear, but now I can see it. p_block is
actually misused here to store the number of indexes in the current
node while diving into the tree. Then on the way up, we are checking
that to see if the eh_entries changed or not (which is indicating
that something has been freed deeper in the tree).

That said, it makes sense to set it before the loop itself because
we are actually skipping the path construction while diving into
the tree since patch is already initialized and we're starting
walking back from 'depth' up in this case. So the patch seems fine.
Thanks for catching it and fixing it.

You can add

Reviewed-by: Lukas Czerner 


> >
> > Note: there are some indent problems in your patch, like for example
> > this:
> >
> > +   path[k].p_block =
> > +   le16_to_cpu(path[k].p_hdr->eh_entries)+1;
> >
> >
> Before submitting the patch, I run checkpatch.pl with --strict option.
> It did'nt show any error or warning. Should I re-submit
> the patch with an extra tab before the second line? The call is yours.

checkpatch.pl does not catch everything. Just look at how wrapping
of long lines is done in the code, there are plenty of examples.

> 
> > Anyway, what do you think about the modification ?
> >
> Also there is 1 modification missing from your patch.
> ext4_ext_drop_refs(path);
> kfree(path);
> +   path = NULL;
> if (err == -EAGAIN)
> goto again;
> If path is not set to NULL, it will crash in xfstest #13. Ted has
> already reported it.

Right, I've probably used the old patch as an example.

Thanks!
-Lukas

> 
> Thanks,
> Ashish
> 
> > Thanks!
> > -Lukas
> >
> >
> > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > index b3300eb..94bc1bd 100644
> > --- a/fs/ext4/extents.c
> > +++ b/fs/ext4/extents.c
> > @@ -2570,7 +2570,7 @@ static int ext4_ext_remove_space(struct inode *inode, 
> > ext4_lblk_t start,
> >  {
> > struct super_block *sb = inode->i_sb;
> > int depth = ext_depth(inode);
> > -   struct ext4_ext_path *path;
> > +   struct ext4_ext_path *path = NULL;
> > ext4_fsblk_t partial_cluster = 0;
> > handle_t *handle;
> > int i, err;
> > @@ -2606,8 +2606,12 @@ again:
> > }
> > depth = ext_depth(inode);
> > ex = path[depth].p_ext;
> > -   if (!ex)
> > +   if (!ex) {
> > +   ext4_ext_drop_refs(path);
> > +   kfree(path);
> > +   path = NULL;
> > goto cont;
> > +   }
> >
> > ee_block = le32_to_cpu(ex->ee_block);
> >
> > @@ -2637,8 +2641,6 @@ again:
> > if (err < 0)
> > goto out;
> > }
> > -   ext4_ext_drop_refs(path);
> > -   kfree(path);
> > }
> >  cont:
> >
> > @@ -2646,20 +2648,26 @@ cont:
> >  * We start scanning from right side, freeing all the blocks
> >

Re: [PATCH v2 10/18] mm: teach truncate_inode_pages_range() to handle non page aligned ranges

2013-02-08 Thread Lukáš Czerner

On Thu, 7 Feb 2013, Andrew Morton wrote:

> Date: Thu, 7 Feb 2013 15:40:42 -0800
> From: Andrew Morton 
> To: Lukas Czerner 
> Cc: linux...@kvack.org, linux-kernel@vger.kernel.org,
> linux-fsde...@vger.kernel.org, linux-e...@vger.kernel.org,
> Hugh Dickins 
> Subject: Re: [PATCH v2 10/18] mm: teach truncate_inode_pages_range() to handle
>  non page aligned ranges
> 
> On Tue,  5 Feb 2013 10:12:03 +0100
> Lukas Czerner  wrote:
> 
> > This commit changes truncate_inode_pages_range() so it can handle non
> > page aligned regions of the truncate. Currently we can hit BUG_ON when
> > the end of the range is not page aligned, but we can handle unaligned
> > start of the range.
> > 
> > Being able to handle non page aligned regions of the page can help file
> > system punch_hole implementations and save some work, because once we're
> > holding the page we might as well deal with it right away.
> > 
> > In previous commits we've changed ->invalidatepage() prototype to accept
> > 'length' argument to be able to specify range to invalidate. No we can
> > use that new ability in truncate_inode_pages_range().
> > 
> > ...
> >
> > +   /*
> > +* 'start' and 'end' always covers the range of pages to be fully
> > +* truncated. Partial pages are covered with 'partial_start' at the
> > +* start of the range and 'partial_end' at the end of the range.
> > +* Note that 'end' is exclusive while 'lend' is inclusive.
> > +*/
> 
> That helped ;)  So the bytes to be truncated are
> 
> (start*PAGE_SIZE + partial_start) -> (end*PAGE_SIZE + partial_end) - 1
> 
> yes?

The start of the range is not right, because 'start' and 'end'
covers pages to be _fully_ truncated. See the while cycle and 
then 'if (partial_start)' condition where we search for the
page (start - 1) and do_invalidate within that page.

So it should be like this:


(start*PAGE_SIZE - partial_start*(PAGE_SIZE - partial_start) ->
(end*PAGE_END + partial_end) - 1


assuming that you want the range to be inclusive on both sides.

-Lukas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RESEND 3/10] ext4: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate

2014-02-18 Thread Lukáš Czerner

On Wed, 12 Feb 2014, Namjae Jeon wrote:

> Date: Wed, 12 Feb 2014 11:28:35 +0900
> From: Namjae Jeon 
> To: Lukáš Czerner 
> Cc: v...@zeniv.linux.org.uk, da...@fromorbit.com, b...@sgi.com, ty...@mit.edu,
> adilger.ker...@dilger.ca, j...@suse.cz, mtk.manpa...@gmail.com,
> linux-fsde...@vger.kernel.org, x...@oss.sgi.com,
> linux-e...@vger.kernel.org, linux-kernel@vger.kernel.org,
> Namjae Jeon ,
> Ashish Sangwan 
> Subject: Re: [PATCH RESEND 3/10] ext4: Add support FALLOC_FL_COLLAPSE_RANGE
> for fallocate
> 
> 2014-02-11 20:59 GMT+09:00, Lukáš Czerner :
> > On Sun, 2 Feb 2014, Namjae Jeon wrote:
> >
> >> Date: Sun,  2 Feb 2014 14:44:34 +0900
> >> From: Namjae Jeon 
> >> To: v...@zeniv.linux.org.uk, da...@fromorbit.com, b...@sgi.com,
> >> ty...@mit.edu,
> >> adilger.ker...@dilger.ca, j...@suse.cz, mtk.manpa...@gmail.com
> >> Cc: linux-fsde...@vger.kernel.org, x...@oss.sgi.com,
> >> linux-e...@vger.kernel.org, linux-kernel@vger.kernel.org,
> >> Namjae Jeon , Namjae Jeon
> >> ,
> >> Ashish Sangwan 
> >> Subject: [PATCH RESEND 3/10] ext4: Add support FALLOC_FL_COLLAPSE_RANGE
> >> for
> >> fallocate
> >>
> >> From: Namjae Jeon 
> >>
> >> Add support FALLOC_FL_COLLAPSE_RANGE for fallocate
> Hi Lukas.
> >
> > Description is missing here, please provide FALLOC_FL_COLLAPSE_RANGE
> > description so people know what it's supposed to be doing.
> >
> > more comments bellow.
> Okay, I will udpate.
> >
> > Thanks!
> > -Lukas

Hi,

you may have noticed my patches for new FALLOC_FL_ZERO_RANGE
fallocate flag. This changes things around the same area as your
patches does so we should probably figure out which are going to
land in first and then rebase the other on top of that.

One concern I have is that I have not seen any tests provided to
verify the feature. But I just may have missed it. Do you have any
xfstests test or at least fsx and fsstress patches to provide
support for testing FALLOC_FL_COLLAPSE_RANGE ? Patches for
util_linux might also be handy.

Thanks!
-Lukas

Re: [PATCH v4 15/20] ext4: use ext4_zero_partial_blocks in punch_hole

2013-06-14 Thread Lukáš Czerner

On Thu, 13 Jun 2013, Theodore Ts'o wrote:

> Date: Thu, 13 Jun 2013 23:01:54 -0400
> From: Theodore Ts'o 
> To: Lukas Czerner 
> Cc: linux...@kvack.org, linux-kernel@vger.kernel.org,
> linux-fsde...@vger.kernel.org, linux-e...@vger.kernel.org,
> a...@linux-foundation.org, hu...@google.com
> Subject: Re: [PATCH v4 15/20] ext4: use ext4_zero_partial_blocks in punch_hole
> 
> On Tue, May 14, 2013 at 06:37:29PM +0200, Lukas Czerner wrote:
> > We're doing to get rid of ext4_discard_partial_page_buffers() since it is
> > duplicating some code and also partially duplicating work of
> > truncate_pagecache_range(), moreover the old implementation was much
> > clearer.
> > 
> > Now when the truncate_inode_pages_range() can handle truncating non page
> > aligned regions we can use this to invalidate and zero out block aligned
> > region of the punched out range and then use ext4_block_truncate_page()
> > to zero the unaligned blocks on the start and end of the range. This
> > will greatly simplify the punch hole code. Moreover after this commit we
> > can get rid of the ext4_discard_partial_page_buffers() completely.
> > 
> > We also introduce function ext4_prepare_punch_hole() to do come common
> > operations before we attempt to do the actual punch hole on
> > indirect or extent file which saves us some code duplication.
> > 
> > This has been tested on ppc64 with 1k block size with fsx and xfstests
> > without any problems.
> > 
> > Signed-off-by: Lukas Czerner 
> 
> Hi Lukas,
> 
> I've been seeing xfstests failures on test generic/300 in nojournal
> mode.
> 
> BEGIN TEST: Ext4 4k block w/ no journal Thu Jun 13 22:38:47 EDT 2013
> Device: /dev/vdb
> mk2fs options: -q -O ^has_journal
> mount options: -o block_validity,noload
> FSTYP -- ext4
> PLATFORM  -- Linux/i686 candygram 3.10.0-rc2-00477-g1e1cad7
> MKFS_OPTIONS  -- -q -O ^has_journal /dev/vdc
> MOUNT_OPTIONS -- -o acl,user_xattr -o block_validity,noload /dev/vdc /vdc
> 
> generic/300   [20:42:18][  116.877278] fio (3320) used greatest stack 
> depth: 5580 bytes left
> [  116.967122] fio (3321) used greatest stack depth: 5560 bytes left
> [  117.573861] fio (3325) used greatest stack depth: 5504 bytes left
>  [20:44:01] [failed, exit status 1] - output mismatch (see 
> /root/xfstests/results/generic/300.out.bad)
> --- tests/generic/300.out  2013-06-04 22:42:55.0 -0400
> +++ /root/xfstests/results/generic/300.out.bad   2013-06-13 
> 20:44:01.30665 -0400
> @@ -2,3 +2,4 @@
>  
>  Run fio with random aio-dio pattern
>  
> +_check_generic_filesystem: filesystem on /dev/vdc is inconsistent (see 
> /root/xfstests/results/generic/300.full)
>  ...
>  (Run 'diff -u tests/generic/300.out 
> /root/xfstests/results/generic/300.out.bad' to see the entire diff)
> 
> It bisects down to this patch, and if I take the dev branch, and
> revert patches #15 through #19 in this series, the problem goes away.
> 
> Can you investigate and recommend a better fix?

Hi Ted,

I'll take a look at this. Thanks for letting me know.

-Lukas

> 
> Thanks,
> 
>   - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] ext4: pair trace_ext4_writepages & trace_ext4_writepages_result

2013-10-17 Thread Lukáš Czerner

On Thu, 17 Oct 2013, Ming Lei wrote:

> Date: Thu, 17 Oct 2013 20:28:46 +0800
> From: Ming Lei 
> To: linux-kernel@vger.kernel.org
> Cc: Jan Kara , Ming Lei ,
> Ted Tso , linux-e...@vger.kernel.org,
> "linux-fsde...@vger.kernel.org" 
> Subject: [PATCH] ext4: pair trace_ext4_writepages &
> trace_ext4_writepages_result
> 
> Pair the two trace events to make troubeshooting writepages
> easier, and it should be more convinient to write a simple script
> to parse the traces.
> 
> Cc: Ted Tso 
> Cc: linux-e...@vger.kernel.org
> Cc: "linux-fsde...@vger.kernel.org" 
> Cc: Jan Kara 
> Signed-off-by: Ming Lei 

It looks good.

Thanks!

Reviewed-by: Lukas Czerner 

> ---
>  fs/ext4/inode.c |   11 ++-
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 32beaa4..0ad73d4 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2420,16 +2420,15 @@ static int ext4_writepages(struct address_space 
> *mapping,
>* because that could violate lock ordering on umount
>*/
>   if (!mapping->nrpages || !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
> - return 0;
> + goto out_writepages;
>  
>   if (ext4_should_journal_data(inode)) {
>   struct blk_plug plug;
> - int ret;
>  
>   blk_start_plug(&plug);
>   ret = write_cache_pages(mapping, wbc, __writepage, mapping);
>   blk_finish_plug(&plug);
> - return ret;
> + goto out_writepages;
>   }
>  
>   /*
> @@ -2442,8 +2441,10 @@ static int ext4_writepages(struct address_space 
> *mapping,
>* *never* be called, so if that ever happens, we would want
>* the stack trace.
>*/
> - if (unlikely(sbi->s_mount_flags & EXT4_MF_FS_ABORTED))
> - return -EROFS;
> + if (unlikely(sbi->s_mount_flags & EXT4_MF_FS_ABORTED)) {
> + ret = -EROFS;
> + goto out_writepages;
> + }
>  
>   if (ext4_should_dioread_nolock(inode)) {
>   /*
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: bisected! (WAS Re: s390x: kernel BUG at fs/ext4/inode.c:1591!)

2013-04-03 Thread Lukáš Czerner

On Wed, 3 Apr 2013, CAI Qian wrote:

> Date: Wed, 3 Apr 2013 00:00:17 -0400 (EDT)
> From: CAI Qian 
> To: Dmitry Monakhov 
> Cc: Theodore Ts'o , LKML ,
> linux-s390 , Steve Best ,
> linux-e...@vger.kernel.org, Lukas Czerner ,
> sta...@vger.kernel.org
> Subject: Re: bisected! (WAS Re: s390x: kernel BUG at fs/ext4/inode.c:1591!)
> 
> 
> 
> - Original Message -
> > From: "Dmitry Monakhov" 
> > To: "CAI Qian" 
> > Cc: "Theodore Ts'o" , "LKML" , 
> > "linux-s390"
> > , "Steve Best" , 
> > linux-e...@vger.kernel.org
> > Sent: Tuesday, April 2, 2013 6:01:36 PM
> > Subject: Re: bisected! (WAS Re: s390x: kernel BUG at fs/ext4/inode.c:1591!)
> > 
> > On Tue, 2 Apr 2013 00:06:24 -0400 (EDT), CAI Qian  
> > wrote:
> > > Bisect indicated this is the culprit,
> > > 
> > > 0e401101db49959f5783f6ee9e676124b5a183ac
> > > ext4: fix memory leakage in mext_check_coverage
> > Strange...
> > It changes a bug in move_extent.c (e4defrag functionality)
> > ASAIU you just previously stopped your bisecting process here. Right?
> > Is this indeed a first bad commit?
> Hmm, bisect went wrong in the first place. Now double-confirmed this is
> the culprit,
> 
> 4f42f80a8f08d4c3f52c4267361241885d5dee3a
> ext4: use s_extent_max_zeroout_kb value as number of kb

With this commit we're zeroing parts of uninitialized extents when
converting uninitialized extents to initialized as we should. This is
unlikely to be real cause, though it probably uncover some another bug which
we could not notice before.

-Lukas

> 
> > > 
> > > This following with Dmitry's debug patch applied,
> > > 
> > > CAI Qian
> > > 
> > > Ý  101.408610¨ ES cache assertation failed for inode: 753 es_cached ex
> > > Ý56/5/744
> > > 81/20¨ != found ex Ý56/5/3396400/0¨ retval 0 flags 5
> > > Ý  209.858899¨ ES cache assertation failed for inode: 384 es_cached ex
> > > Ý57/7/332
> > > 82/20¨ != found ex Ý57/7/3396400/0¨ retval 0 flags 5
> > > Ý  209.860656¨ ES cache assertation failed for inode: 384 es_cached ex
> > > Ý25/1/332
> > > 50/20¨ != found ex Ý25/1/0/0¨ retval 0 flags 0
> > > Ý  209.893587¨ ES cache assertation failed for inode: 384 es_cached ex
> > > Ý22/1/332
> > > 47/20¨ != found ex Ý22/1/34838/1000¨ retval 1 flags 0
> > > Ý  209.913482¨ ES cache assertation failed for inode: 384 es_cached ex
> > > Ý27/1/329
> > > 40/20¨ != found ex Ý27/1/0/0¨ retval 0 flags 0
> > > Ý  209.919950¨ ES cache assertation failed for inode: 384 es_cached ex
> > > Ý59/5/338
> > > 48/20¨ != found ex Ý59/5/3396400/0¨ retval 0 flags 5
> > > Ý  209.931856¨ ES cache assertation failed for inode: 384 es_cached ex
> > > Ý7/1/3292
> > > 0/20¨ != found ex Ý7/1/35879/20¨ retval 1 flags 43
> > > Ý  209.969282¨ ES cache assertation failed for inode: 384 es_cached ex
> > > Ý35/1/361
> > > 97/20¨ != found ex Ý35/1/36197/1000¨ retval 1 flags 0
> > > Ý  209.969290¨ ES cache assertation failed for inode: 384 es_cached ex
> > > Ý48/1/362
> > > 10/20¨ != found ex Ý48/1/0/0¨ retval 0 flags 0
> > > Ý  209.980724¨ ES cache assertation failed for inode: 384 es_cached ex
> > > Ý13/4/334
> > > 89/20¨ != found ex Ý13/4/2161372/0¨ retval 0 flags 5
> > > Ý  209.980744¨ ES cache assertation failed for inode: 384 es_cached ex
> > > Ý61/3/335
> > > 37/20¨ != found ex Ý61/3/3396400/0¨ retval 0 flags 5
> > > Ý  209.983848¨ ES cache assertation failed for inode: 384 es_cached ex
> > > Ý44/2/335
> > > 20/20¨ != found ex Ý44/2/36216/20¨ retval 2 flags 43
> > > Ý  210.020041¨ ES cache assertation failed for inode: 384 es_cached ex
> > > Ý61/3/341
> > > 91/20¨ != found ex Ý61/3/3396400/0¨ retval 0 flags 5
> > > Ý  210.050100¨ ES cache assertation failed for inode: 384 es_cached ex
> > > Ý22/11/34
> > > 565/20¨ != found ex Ý22/11/3396400/0¨ retval 0 flags 5
> > > Ý  210.053271¨ ES cache assertation failed for inode: 384 es_cached ex
> > > Ý15/1/334
> > > 90/20¨ != found ex Ý15/1/33579/1000¨ retval 1 flags 1
> > It does not looks like bigendian issue, actually I cant find any logical
> > system in the log. The only thing I see is that es_cache is
> > horribly out of sync with extent_tree.
> > Please try this patch:
> I'll test that shortly.
> CAI Qian
> > 
> > 
> > [Text Documents:disable-es_lookup_extent.patch]
> > 
> > 
> > It just disable es_cache lookup feature should. This should helps to
> > determine whenever this is a es_cache issue or not.
> > > Ý  210.053275¨ mpage_da_submit_io failed block=33490 != b_blocknr=33579
> > > Ý  210.053277¨ ino:384 lbkl:15, b_state=0x1023, b_size=4096
> > > Ý  210.053320¨ Ý cut here ¨
> > > Ý  210.053323¨ kernel BUG at fs/ext4/inode.c:1639!
> > > Ý  210.053402¨ illegal operation: 0001 Ý#1¨ SMP
> > > Ý  210.053405¨ Modules linked in: nf_conntrack_netbios_ns
> > > nf_conntrack_broadcast
> > >  ipt_MASQUERADE ip6table_mangle ip6t_REJECT nf_conntrack_ipv6
> > >  nf_defrag_ipv6 ipt
> > > able_nat nf_nat_ipv4 nf_nat iptable_mangle ipt_REJECT nf_conntrack_ipv4
> > > nf_defra
> > > g_ipv4 xt_conntrack nf_conntrack ebtable_filter ebtables ip6table_f

Re: linux-next: build failure after merge of the ext4 tree

2013-04-04 Thread Lukáš Czerner

On Thu, 4 Apr 2013, Stephen Rothwell wrote:

> Date: Thu, 4 Apr 2013 10:43:19 +1100
> From: Stephen Rothwell 
> To: Theodore Ts'o 
> Cc: linux-n...@vger.kernel.org, linux-kernel@vger.kernel.org,
> Lukas Czerner 
> Subject: linux-next: build failure after merge of the ext4 tree
> 
> Hi Ted,
> 
> After merging the ext4 tree, today's linux-next build (powerpc
> ppc64_defconfig) failed like this:
> 
> fs/ext4/resize.c: In function 'ext4_resize_fs':
> fs/ext4/ext4.h:1799:21: sorry, unimplemented: inlining failed in call to 
> 'ext4_get_group_number': function body not available
> fs/ext4/resize.c:1882:10: sorry, unimplemented: called from here
> fs/ext4/ext4.h:1799:21: sorry, unimplemented: inlining failed in call to 
> 'ext4_get_group_number': function body not available
> fs/ext4/resize.c:275:9: sorry, unimplemented: called from here
> fs/ext4/ext4.h:1799:21: sorry, unimplemented: inlining failed in call to 
> 'ext4_get_group_number': function body not available
> fs/ext4/resize.c:287:9: sorry, unimplemented: called from here
> fs/ext4/ext4.h:1799:21: sorry, unimplemented: inlining failed in call to 
> 'ext4_get_group_number': function body not available
> fs/ext4/resize.c:299:9: sorry, unimplemented: called from here
> fs/ext4/mballoc.c: In function 'ext4_mb_release_context':
> fs/ext4/ext4.h:1799:21: sorry, unimplemented: inlining failed in call to 
> 'ext4_get_group_number': function body not available
> fs/ext4/mballoc.c:3347:6: sorry, unimplemented: called from here

I am afraid I do not understand why this is happening. The commit
simply replaces ext4_get_group_no_and_offset() with
ext4_get_group_number() which has the similar definition in the same
header file. Maybe someone knows what this is all about ?

> 
> Caused by commit dc351389caa5 ("ext4: introduce ext4_get_group_number()").
> 
> This build was done with gcc 4.6.3 if that matters.
> 
> I have used the ext4 tree from next-20130403 for today.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: build failure after merge of the ext4 tree

2013-04-04 Thread Lukáš Czerner

On Thu, 4 Apr 2013, Theodore Ts'o wrote:

> Date: Thu, 4 Apr 2013 09:20:48 -0400
> From: Theodore Ts'o 
> To: Lukáš Czerner 
> Cc: Stephen Rothwell , linux-n...@vger.kernel.org,
> linux-kernel@vger.kernel.org
> Subject: Re: linux-next: build failure after merge of the ext4 tree
> 
> On Thu, Apr 04, 2013 at 03:18:32PM +0200, Lukáš Czerner wrote:
> > I am afraid I do not understand why this is happening. The commit
> > simply replaces ext4_get_group_no_and_offset() with
> > ext4_get_group_number() which has the similar definition in the same
> > header file. Maybe someone knows what this is all about ?
> 
> I've fixed this up already in the ext4 tree.  The problem is that you
> defined the function ext4_get_group_number() as "inline", but the
> function body was only in fs/ext4/balloc.c.  I have no idea why gcc
> wasn't complaining about this on x86, but the fix was to simply
> declare the function as "extern", not as "inline".
> 
>   - Ted
> 

Ok, that's why I have not seen it in ext4 tree :). I do not know why
I did it, but it was obviously wrong.

Thanks for fixing it.
-Lukas

Re: [PATCH v3 08/18] gfs2: use ->invalidatepage() length argument

2013-04-09 Thread Lukáš Czerner

On Tue, 9 Apr 2013, Bob Peterson wrote:

> Date: Tue, 9 Apr 2013 09:09:20 -0400 (EDT)
> From: Bob Peterson 
> To: Lukas Czerner 
> Cc: linux...@kvack.org, linux-kernel@vger.kernel.org,
> linux-fsde...@vger.kernel.org, linux-e...@vger.kernel.org,
> cluster-de...@redhat.com
> Subject: Re: [PATCH v3 08/18] gfs2: use ->invalidatepage() length argument
> 
> Hi,
> 
> - Original Message -
> | ->invalidatepage() aop now accepts range to invalidate so we can make
> | use of it in gfs2_invalidatepage().
> | 
> | Signed-off-by: Lukas Czerner 
> | Cc: cluster-de...@redhat.com
> | ---
> |  fs/gfs2/aops.c |9 +++--
> |  1 files changed, 7 insertions(+), 2 deletions(-)
> | 
> | diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
> | index 37093ba..ea920bf 100644
> | --- a/fs/gfs2/aops.c
> | +++ b/fs/gfs2/aops.c
> | @@ -947,24 +947,29 @@ static void gfs2_invalidatepage(struct page *page,
> | unsigned int offset,
> | unsigned int length)
> |  {
> | struct gfs2_sbd *sdp = GFS2_SB(page->mapping->host);
> | +   unsigned int stop = offset + length;
> | +   int partial_page = (offset || length < PAGE_CACHE_SIZE);
> | struct buffer_head *bh, *head;
> | unsigned long pos = 0;
> |  
> | BUG_ON(!PageLocked(page));
> | -   if (offset == 0)
> | +   if (!partial_page)
> | ClearPageChecked(page);
> | if (!page_has_buffers(page))
> | goto out;
> |  
> | bh = head = page_buffers(page);
> | do {
> | +   if (pos + bh->b_size > stop)
> | +   return;
> | +
> 
> I always regret it when I review something this early in the morning,
> or post before the caffeine has taken full effect. But...
> Shouldn't the statement above be (1) ">= stop" and (2) goto out;?

Hi,

1: No, because "stop = offset + length" and we're comparing it with
   "pos + bh->b_size". If it would be ">=" then we would always miss
   the last buffer. Let's assume pos = 0, b_size = 1024 and we're
   invalidating offset = 0 and length = 1024 .. the rest you can try
   for yourself :)

2: No, because if this condition is true, we can never have full page
   invalidate. With full page invalidation we'll walk all the
   buffers within the page hence we would jump our of the cycle when
   "bh == head".

Thanks!
-Lukas

> 
> | if (offset <= pos)
> | gfs2_discard(sdp, bh);
> | pos += bh->b_size;
> | bh = bh->b_this_page;
> | } while (bh != head);
> |  out:
> | -   if (offset == 0)
> | +   if (!partial_page)
> | try_to_release_page(page, 0);
> |  }
> 
> Regards,
> 
> Bob Peterson
> Red Hat File Systems
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 09/18] reiserfs: use ->invalidatepage() length argument

2013-04-10 Thread Lukáš Czerner

On Tue, 9 Apr 2013, Jan Kara wrote:

> Date: Tue, 9 Apr 2013 15:27:56 +0200
> From: Jan Kara 
> To: Lukas Czerner 
> Cc: linux...@kvack.org, linux-kernel@vger.kernel.org,
> linux-fsde...@vger.kernel.org, linux-e...@vger.kernel.org,
> reiserfs-de...@vger.kernel.org
> Subject: Re: [PATCH v3 09/18] reiserfs: use ->invalidatepage() length argument
> 
> On Tue 09-04-13 11:14:18, Lukas Czerner wrote:
> > ->invalidatepage() aop now accepts range to invalidate so we can make
> > use of it in reiserfs_invalidatepage()
>   Hum, reiserfs is probably never going to support punch hole. So shouldn't
> we rather WARN and return without doing anything if stop !=
> PAGE_CACHE_SIZE?
> 
>   Honza

Hi,

I can not even think of the case when this would happen since it
could not happen before either. However in the case it happens the
code will do what's expected. So I do not have any strong preference
about this one, but I do not think it's necessary to WARN here. If
you still insist on the WARN, let me know and I'll resend the patch.

Thanks for the reviews!
-Lukas

> > 
> > Signed-off-by: Lukas Czerner 
> > Cc: reiserfs-de...@vger.kernel.org
> > ---
> >  fs/reiserfs/inode.c |9 +++--
> >  1 files changed, 7 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
> > index 808e02e..e963164 100644
> > --- a/fs/reiserfs/inode.c
> > +++ b/fs/reiserfs/inode.c
> > @@ -2975,11 +2975,13 @@ static void reiserfs_invalidatepage(struct page 
> > *page, unsigned int offset,
> > struct buffer_head *head, *bh, *next;
> > struct inode *inode = page->mapping->host;
> > unsigned int curr_off = 0;
> > +   unsigned int stop = offset + length;
> > +   int partial_page = (offset || length < PAGE_CACHE_SIZE);
> > int ret = 1;
> >  
> > BUG_ON(!PageLocked(page));
> >  
> > -   if (offset == 0)
> > +   if (!partial_page)
> > ClearPageChecked(page);
> >  
> > if (!page_has_buffers(page))
> > @@ -2991,6 +2993,9 @@ static void reiserfs_invalidatepage(struct page 
> > *page, unsigned int offset,
> > unsigned int next_off = curr_off + bh->b_size;
> > next = bh->b_this_page;
> >  
> > +   if (next_off > stop)
> > +   goto out;
> > +
> > /*
> >  * is this block fully invalidated?
> >  */
> > @@ -3009,7 +3014,7 @@ static void reiserfs_invalidatepage(struct page 
> > *page, unsigned int offset,
> >  * The get_block cached value has been unconditionally invalidated,
> >  * so real IO is not possible anymore.
> >  */
> > -   if (!offset && ret) {
> > +   if (!partial_page && ret) {
> > ret = try_to_release_page(page, 0);
> > /* maybe should BUG_ON(!ret); - neilb */
> > }
> > -- 
> > 1.7.7.6
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4 15/20] ext4: use ext4_zero_partial_blocks in punch_hole

2013-06-19 Thread Lukáš Czerner

On Thu, 13 Jun 2013, Theodore Ts'o wrote:

> Date: Thu, 13 Jun 2013 23:01:54 -0400
> From: Theodore Ts'o 
> To: Lukas Czerner 
> Cc: linux...@kvack.org, linux-kernel@vger.kernel.org,
> linux-fsde...@vger.kernel.org, linux-e...@vger.kernel.org,
> a...@linux-foundation.org, hu...@google.com
> Subject: Re: [PATCH v4 15/20] ext4: use ext4_zero_partial_blocks in punch_hole
> 
> On Tue, May 14, 2013 at 06:37:29PM +0200, Lukas Czerner wrote:
> > We're doing to get rid of ext4_discard_partial_page_buffers() since it is
> > duplicating some code and also partially duplicating work of
> > truncate_pagecache_range(), moreover the old implementation was much
> > clearer.
> > 
> > Now when the truncate_inode_pages_range() can handle truncating non page
> > aligned regions we can use this to invalidate and zero out block aligned
> > region of the punched out range and then use ext4_block_truncate_page()
> > to zero the unaligned blocks on the start and end of the range. This
> > will greatly simplify the punch hole code. Moreover after this commit we
> > can get rid of the ext4_discard_partial_page_buffers() completely.
> > 
> > We also introduce function ext4_prepare_punch_hole() to do come common
> > operations before we attempt to do the actual punch hole on
> > indirect or extent file which saves us some code duplication.
> > 
> > This has been tested on ppc64 with 1k block size with fsx and xfstests
> > without any problems.
> > 
> > Signed-off-by: Lukas Czerner 
> 
> Hi Lukas,
> 
> I've been seeing xfstests failures on test generic/300 in nojournal
> mode.
> 
> BEGIN TEST: Ext4 4k block w/ no journal Thu Jun 13 22:38:47 EDT 2013
> Device: /dev/vdb
> mk2fs options: -q -O ^has_journal
> mount options: -o block_validity,noload
> FSTYP -- ext4
> PLATFORM  -- Linux/i686 candygram 3.10.0-rc2-00477-g1e1cad7
> MKFS_OPTIONS  -- -q -O ^has_journal /dev/vdc
> MOUNT_OPTIONS -- -o acl,user_xattr -o block_validity,noload /dev/vdc /vdc
> 
> generic/300   [20:42:18][  116.877278] fio (3320) used greatest stack 
> depth: 5580 bytes left
> [  116.967122] fio (3321) used greatest stack depth: 5560 bytes left
> [  117.573861] fio (3325) used greatest stack depth: 5504 bytes left
>  [20:44:01] [failed, exit status 1] - output mismatch (see 
> /root/xfstests/results/generic/300.out.bad)
> --- tests/generic/300.out  2013-06-04 22:42:55.0 -0400
> +++ /root/xfstests/results/generic/300.out.bad   2013-06-13 
> 20:44:01.30665 -0400
> @@ -2,3 +2,4 @@
>  
>  Run fio with random aio-dio pattern
>  
> +_check_generic_filesystem: filesystem on /dev/vdc is inconsistent (see 
> /root/xfstests/results/generic/300.full)
>  ...
>  (Run 'diff -u tests/generic/300.out 
> /root/xfstests/results/generic/300.out.bad' to see the entire diff)

I think I've got this. The problem actually is in
ext4_zero_partial_blocks() where we would attempt to zero out page
which has been previously released by truncate_pagecache_range().
This might happen when we're punching out just a single page because
in ext4_zero_partial_blocks() we do not check whether we're dealing
with the whole, or partial page. At the point we're going to zero it
out it might have been already released and reused by someone else.

This patch should fix this issue. And indeed with this applied I do
not see the problem anymore but I am still testing.

-Lukas


diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3acf353..ce9f926 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3698,33 +3698,36 @@ int ext4_zero_partial_blocks(handle_t *handle, struct 
inode *inode,
 {
struct super_block *sb = inode->i_sb;
struct address_space *mapping = inode->i_mapping;
-   unsigned partial = lstart & (sb->s_blocksize - 1);
+   unsigned partial_start, partial_end;
ext4_fsblk_t start, end;
loff_t byte_end = (lstart + length - 1);
int err = 0;
 
+   partial_start = lstart & (sb->s_blocksize - 1);
+   partial_end = byte_end & (sb->s_blocksize - 1);
+
start = lstart >> sb->s_blocksize_bits;
end = byte_end >> sb->s_blocksize_bits;
 
/* Handle partial zero within the single block */
-   if (start == end) {
+   if (start == end &&
+   (partial_start || (partial_end != sb->s_blocksize - 1))) {
err = ext4_block_zero_page_range(handle, mapping,
 lstart, length);
return err;
}
/* Handle partial zero out on the start of the range */
-   if (partial) {
+   if (partial_start) {
err = ext4_block_zero_page_range(handle, mapping,
 lstart, sb->s_blocksize);
if (err)
return err;
}
/* Handle partial zero out on the end of the range */
-   partial = byte_end & (sb->s_blocksize - 1);
-   if (partial != sb->s_blocksize - 1)
+

Re: [PATCH v3 17/18] ext4: make punch hole code path work with bigalloc

2013-04-24 Thread Lukáš Czerner

On Sat, 20 Apr 2013, Jan Kara wrote:

> Date: Sat, 20 Apr 2013 15:42:41 +0200
> From: Jan Kara 
> To: Lukas Czerner 
> Cc: linux...@kvack.org, linux-kernel@vger.kernel.org,
> linux-fsde...@vger.kernel.org, linux-e...@vger.kernel.org
> Subject: Re: [PATCH v3 17/18] ext4: make punch hole code path work with
> bigalloc
> 
> On Tue 09-04-13 11:14:26, Lukas Czerner wrote:
> > Currently punch hole is disabled in file systems with bigalloc
> > feature enabled. However the recent changes in punch hole patch should
> > make it easier to support punching holes on bigalloc enabled file
> > systems.
> > 
> > This commit changes partial_cluster handling in ext4_remove_blocks(),
> > ext4_ext_rm_leaf() and ext4_ext_remove_space(). Currently
> > partial_cluster is unsigned long long type and it makes sure that we
> > will free the partial cluster if all extents has been released from that
> > cluster. However it has been specifically designed only for truncate.
> > 
> > With punch hole we can be freeing just some extents in the cluster
> > leaving the rest untouched. So we have to make sure that we will notice
> > cluster which still has some extents. To do this I've changed
> > partial_cluster to be signed long long type. The only scenario where
> > this could be a problem is when cluster_size == block size, however in
> > that case there would not be any partial clusters so we're safe. For
> > bigger clusters the signed type is enough. Now we use the negative value
> > in partial_cluster to mark such cluster used, hence we know that we must
> > not free it even if all other extents has been freed from such cluster.
> > 
> > This scenario can be described in simple diagram:
> > 
> > |FFF...FF..FF.UUU|
> >  ^--^
> >   punch hole
> > 
> > . - free space
> > | - cluster boundary
> > F - freed extent
> > U - used extent
> > 
> > Also update respective tracepoints to use signed long long type for
> > partial_cluster.
>   The patch looks OK. You can add:
> Reviewed-by: Jan Kara 
> 
>   Just a minor nit - sometimes you use 'signed long long', sometimes 'long
> long int', sometimes just 'long long'. In kernel we tend to always use just
> 'long long' so it would be good to clean that up.

Sure, I can do that.

Thanks!
-Lukas

> 
>   Honza
> > 
> > Signed-off-by: Lukas Czerner 
> > ---
> >  fs/ext4/extents.c   |   69 
> > +++---
> >  include/trace/events/ext4.h |   25 ---
> >  2 files changed, 64 insertions(+), 30 deletions(-)
> > 
> > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > index 9023b76..577c4f5 100644
> > --- a/fs/ext4/extents.c
> > +++ b/fs/ext4/extents.c
> > @@ -2362,7 +2362,7 @@ int ext4_ext_index_trans_blocks(struct inode *inode, 
> > int nrblocks, int chunk)
> >  
> >  static int ext4_remove_blocks(handle_t *handle, struct inode *inode,
> >   struct ext4_extent *ex,
> > - ext4_fsblk_t *partial_cluster,
> > + signed long long *partial_cluster,
> >   ext4_lblk_t from, ext4_lblk_t to)
> >  {
> > struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> > @@ -2391,7 +2391,8 @@ static int ext4_remove_blocks(handle_t *handle, 
> > struct inode *inode,
> >  * partial cluster here.
> >  */
> > pblk = ext4_ext_pblock(ex) + ee_len - 1;
> > -   if (*partial_cluster && (EXT4_B2C(sbi, pblk) != *partial_cluster)) {
> > +   if ((*partial_cluster > 0) &&
> > +   (EXT4_B2C(sbi, pblk) != *partial_cluster)) {
> > ext4_free_blocks(handle, inode, NULL,
> >  EXT4_C2B(sbi, *partial_cluster),
> >  sbi->s_cluster_ratio, flags);
> > @@ -2417,23 +2418,41 @@ static int ext4_remove_blocks(handle_t *handle, 
> > struct inode *inode,
> > && to == le32_to_cpu(ex->ee_block) + ee_len - 1) {
> > /* tail removal */
> > ext4_lblk_t num;
> > +   unsigned int unaligned;
> >  
> > num = le32_to_cpu(ex->ee_block) + ee_len - from;
> > pblk = ext4_ext_pblock(ex) + ee_len - num;
> > -   ext_debug("free last %u blocks starting %llu\n", num, pblk);
> > +   /*
> > +* Usually we want to free partial cluster at the end of the
> > +* extent, except for the situation when the cluster is still
> > +* used by any other extent (partial_cluster is negative).
> > +*/
> > +   if (*partial_cluster < 0 &&
> > +   -(*partial_cluster) == EXT4_B2C(sbi, pblk + num - 1))
> > +   flags |= EXT4_FREE_BLOCKS_NOFREE_LAST_CLUSTER;
> > +
> > +   ext_debug("free last %u blocks starting %llu partial %lld\n",
> > + num, pblk, *partial_cluster);
> > ext4_free_blocks(handle, inode, NULL, pblk, num, flags);
> > /*
> >  * If the block range to be freed didn'

Re: [PATCH v3 17/18] ext4: make punch hole code path work with bigalloc

2013-04-24 Thread Lukáš Czerner

On Tue, 23 Apr 2013, Zheng Liu wrote:

> Date: Tue, 23 Apr 2013 17:19:28 +0800
> From: Zheng Liu 
> To: Jan Kara 
> Cc: Lukas Czerner , linux...@kvack.org,
> linux-kernel@vger.kernel.org, linux-fsde...@vger.kernel.org,
> linux-e...@vger.kernel.org
> Subject: Re: [PATCH v3 17/18] ext4: make punch hole code path work with
> bigalloc
> 
> On Sat, Apr 20, 2013 at 03:42:41PM +0200, Jan Kara wrote:
> > On Tue 09-04-13 11:14:26, Lukas Czerner wrote:
> > > Currently punch hole is disabled in file systems with bigalloc
> > > feature enabled. However the recent changes in punch hole patch should
> > > make it easier to support punching holes on bigalloc enabled file
> > > systems.
> > > 
> > > This commit changes partial_cluster handling in ext4_remove_blocks(),
> > > ext4_ext_rm_leaf() and ext4_ext_remove_space(). Currently
> > > partial_cluster is unsigned long long type and it makes sure that we
> > > will free the partial cluster if all extents has been released from that
> > > cluster. However it has been specifically designed only for truncate.
> > > 
> > > With punch hole we can be freeing just some extents in the cluster
> > > leaving the rest untouched. So we have to make sure that we will notice
> > > cluster which still has some extents. To do this I've changed
> > > partial_cluster to be signed long long type. The only scenario where
> > > this could be a problem is when cluster_size == block size, however in
> > > that case there would not be any partial clusters so we're safe. For
> > > bigger clusters the signed type is enough. Now we use the negative value
> > > in partial_cluster to mark such cluster used, hence we know that we must
> > > not free it even if all other extents has been freed from such cluster.
> > > 
> > > This scenario can be described in simple diagram:
> > > 
> > > |FFF...FF..FF.UUU|
> > >  ^--^
> > >   punch hole
> > > 
> > > . - free space
> > > | - cluster boundary
> > > F - freed extent
> > > U - used extent
> > > 
> > > Also update respective tracepoints to use signed long long type for
> > > partial_cluster.
> >   The patch looks OK. You can add:
> > Reviewed-by: Jan Kara 
> > 
> >   Just a minor nit - sometimes you use 'signed long long', sometimes 'long
> > long int', sometimes just 'long long'. In kernel we tend to always use just
> > 'long long' so it would be good to clean that up.
> 
> Another question is that in patch 01/18 invalidatepage signature is
> changed from
>   int (*invalidatepage) (struct page *, unsigned long);
> to
>   void (*invalidatepage) (struct page *, unsigned int, unsigned int);
> 
> The argument type is changed from 'unsigned long' to 'unsigned int'.  My
> question is why we need to change it.
> 
> Thanks,
> - Zheng
> 

Hi Zheng,

this was changed on Hugh Dickins request because it makes it clearer
that those args are indeed intended to be offsets within a page
(0..PAGE_CACHE_SIZE).

Even though PAGE_CACHE_SIZE can be defined as unsigned long, this is
only for convenience. Here is quote from Hugh:

  "
  They would be defined as unsigned long so that they can be used in
  masks like ~(PAGE_SIZE - 1), and behave as expected on addresses,
  without needing casts to be added all over.

  We do not (currently!) expect PAGE_SIZE or PAGE_CACHE_SIZE to grow
  beyond an unsigned int - but indeed they can be larger than what's
  held in an unsigned short (look no further than ia64 or ppc64).

  For more reassurance, see include/linux/highmem.h, which declares
  zero_user_segments() and others: unsigned int (well, unsigned with
  the int implicit) for offsets within a page.

  Hugh
  "

I should probably mention that in the description.

Thanks!
-Lukas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4 20/20] ext4: Allow punch hole with bigalloc enabled

2013-06-11 Thread Lukáš Czerner

On Fri, 31 May 2013, Theodore Ts'o wrote:

> Date: Fri, 31 May 2013 11:14:54 -0400
> From: Theodore Ts'o 
> To: Lukas Czerner 
> Cc: linux...@kvack.org, linux-kernel@vger.kernel.org,
> linux-fsde...@vger.kernel.org, linux-e...@vger.kernel.org,
> a...@linux-foundation.org, hu...@google.com
> Subject: Re: [PATCH v4 20/20] ext4: Allow punch hole with bigalloc enabled
> 
> On Tue, May 14, 2013 at 06:37:34PM +0200, Lukas Czerner wrote:
> > In commits 5f95d21fb6f2aaa52830e5b7fb405f6c71d3ab85 and
> > 30bc2ec9598a1b156ad75217f2e7d4560efdeeab we've reworked punch_hole
> > implementation and there is noting holding us back from using punch hole
> > on file system with bigalloc feature enabled.
> > 
> > This has been tested with fsx and xfstests.
> > 
> > Signed-off-by: Lukas Czerner 
> > Reviewed-by: Jan Kara 
> 
> This patch is causing a test failure with bigalloc enabled with the
> xfstests shared/298.
> 
> Since it's at the end of the invalidate page range tests, I'm going to
> drop this patch for now.  Could you take a look at this?
> 
> Thanks!!
> 
>   - Ted

Hi Ted,

I should have really noticed this earlier. This test (shared/298)
have nothing to do with bigalloc, nor punch hole. It tests file
system discard implementation.

The most likely reason it failed for you is that the tests does not
count with bigalloc feature. However it seems to be working for me
without any problems. Can you provide more information about the
problem you've seen, or at least your xfstest configuration so we
can see what went wrong and possibly fix the test ?

Tom can you take a look at this ? (Adding Tomas Racek to the CC)

So, since this failure is not really related to the patch itself,
can we re-include the patch (it might be already too late I guess).

Thanks!
-Lukas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 10/18] mm: teach truncate_inode_pages_range() to handle non page aligned ranges

2013-02-04 Thread Lukáš Czerner

On Fri, 1 Feb 2013, Andrew Morton wrote:

> Date: Fri, 1 Feb 2013 15:15:02 -0800
> From: Andrew Morton 
> To: Lukas Czerner 
> Cc: linux...@kvack.org, linux-kernel@vger.kernel.org,
> linux-fsde...@vger.kernel.org, linux-e...@vger.kernel.org,
> x...@oss.sgi.com, Hugh Dickins 
> Subject: Re: [PATCH 10/18] mm: teach truncate_inode_pages_range() to handle
> non page aligned ranges
> 
> On Fri,  1 Feb 2013 11:43:36 +0100
> Lukas Czerner  wrote:
> 
> > This commit changes truncate_inode_pages_range() so it can handle non
> > page aligned regions of the truncate. Currently we can hit BUG_ON when
> > the end of the range is not page aligned, but we can handle unaligned
> > start of the range.
> > 
> > Being able to handle non page aligned regions of the page can help file
> > system punch_hole implementations and save some work, because once we're
> > holding the page we might as well deal with it right away.
> > 
> > In previous commits we've changed ->invalidatepage() prototype to accept
> > 'length' argument to be able to specify range to invalidate. No we can
> > use that new ability in truncate_inode_pages_range().
> 
> The change seems sensible.
> 
> > This was based on the code provided by Hugh Dickins
> 
> Despite this ;)
> 
> > changes to make use of do_invalidatepage_range().
> >
> > ...
> >
> >  void truncate_inode_pages_range(struct address_space *mapping,
> > loff_t lstart, loff_t lend)
> >  {
> > -   const pgoff_t start = (lstart + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT;
> > -   const unsigned partial = lstart & (PAGE_CACHE_SIZE - 1);
> > +   pgoff_t start = (lstart + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
> > +   pgoff_t end = (lend + 1) >> PAGE_CACHE_SHIFT;
> > +   unsigned int partial_start = lstart & (PAGE_CACHE_SIZE - 1);
> > +   unsigned int partial_end = (lend + 1) & (PAGE_CACHE_SIZE - 1);
> > struct pagevec pvec;
> > pgoff_t index;
> > -   pgoff_t end;
> > int i;
> 
> This is starting to get pretty hairy.  Some of these "end" variables
> are inclusive and some are exclusive.

Yes, I agree that it's little bit confusing.

> 
> Can we improve things?  We can drop all this tiresome
> intialisation-at-declaration-site stuff and do:

Yes, I agree that this will make things cleaner.

> 
>   pgoff_t start;  /* inclusive */
>   pgoff_t end;/* exclusive */
>   unsigned int partial_start; /* inclusive */
>   unsigned int partial_end;   /* exclusive */
>   struct pagevec pvec;
>   pgoff_t index;
>   int i;
> 
>   start = (lstart + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
>   end = (lend + 1) >> PAGE_CACHE_SHIFT;
>   partial_start = lstart & (PAGE_CACHE_SIZE - 1);
>   partial_end = (lend + 1) & (PAGE_CACHE_SIZE - 1);
> 
> And lo, I see that the "inclusive" thing only applies to incoming arg
> `lend'.  I seem to recall that being my handiwork and somehow I seem to
> not have documented the reason: it was so that we can pass
> lend=0x into truncate_inode_pages_range) to indicate "end of
> file".
> 
> Your code handles this in a rather nasty fashion.  It permits the above
> overflow to occur then later fixes it up with an explicit test for -1. 
> And it then sets `end' (which is a pgoff_t!) to -1.
> 
> I guess this works, but let's make it clearer, with something like:
> 
>   if (lend == -1) {
>   /*
>* Nice explanation goes here
>*/
>   end = -1;
>   } else {
>   end = (lend + 1) >> PAGE_CACHE_SHIFT;
>   }

Good point, this is better.

> 
> 
> > cleancache_invalidate_inode(mapping);
> > if (mapping->nrpages == 0)
> > return;
> >  
> > -   BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1));
> > -   end = (lend >> PAGE_CACHE_SHIFT);
> > +   if (lend == -1)
> > +   end = -1;   /* unsigned, so actually very big */
> >  
> > pagevec_init(&pvec, 0);
> > index = start;
> > -   while (index <= end && pagevec_lookup(&pvec, mapping, index,
> > -   min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
> > +   while (index < end && pagevec_lookup(&pvec, mapping, index,
> > +   min(end - index, (pgoff_t)PAGEVEC_SIZE))) {
> 
> Here, my brain burst.  You've effectively added 1 to (end - index).  Is
> that correct?

Not sure what do you mean by that. I have to admit that I've changed
the 'end' variable from previous inclusive to exclusive for two
reasons. First of all it makes more sense to me and second of all it
solves the pain where we're dealing with the partial truncation within
the first page.

> 
> > mem_cgroup_uncharge_start();
> > for (i = 0; i < pagevec_count(&pvec); i++) {
> > struct page *page = pvec.pages[i];
> >  
> > /* We rely upon deletion not changing page->index */
> > index = page->index;
> > -

Re: [PATCH 10/18] mm: teach truncate_inode_pages_range() to handle non page aligned ranges

2013-02-04 Thread Lukáš Czerner

On Mon, 4 Feb 2013, Andrew Morton wrote:

> Date: Mon, 4 Feb 2013 12:51:36 -0800
> From: Andrew Morton 
> To: Lukáš Czerner 
> Cc: linux...@kvack.org, linux-kernel@vger.kernel.org,
> linux-fsde...@vger.kernel.org, linux-e...@vger.kernel.org,
> x...@oss.sgi.com, Hugh Dickins 
> Subject: Re: [PATCH 10/18] mm: teach truncate_inode_pages_range() to handle
> non page aligned ranges
> 
> On Mon, 4 Feb 2013 15:51:19 +0100 (CET)
> Luk Czerner  wrote:
> 
> > I hope I explained myself well enough :). Are you ok with this king
> > of approach ? If so, I'll resend the patch set without the
> > initialisation-at-declaration.
> 
> uh, maybe.  Next time I'll apply the patch and look at the end result! 
> Try to make it as understandable and (hence) maintainable as possible,
> OK?

Agreed.

Thanks!
-Lukas

Re: [PATCH RESEND 3/10] ext4: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate

2014-02-11 Thread Lukáš Czerner

On Sun, 2 Feb 2014, Namjae Jeon wrote:

> Date: Sun,  2 Feb 2014 14:44:34 +0900
> From: Namjae Jeon 
> To: v...@zeniv.linux.org.uk, da...@fromorbit.com, b...@sgi.com, ty...@mit.edu,
> adilger.ker...@dilger.ca, j...@suse.cz, mtk.manpa...@gmail.com
> Cc: linux-fsde...@vger.kernel.org, x...@oss.sgi.com,
> linux-e...@vger.kernel.org, linux-kernel@vger.kernel.org,
> Namjae Jeon , Namjae Jeon ,
> Ashish Sangwan 
> Subject: [PATCH RESEND 3/10] ext4: Add support FALLOC_FL_COLLAPSE_RANGE for
> fallocate
> 
> From: Namjae Jeon 
> 
> Add support FALLOC_FL_COLLAPSE_RANGE for fallocate

Description is missing here, please provide FALLOC_FL_COLLAPSE_RANGE
description so people know what it's supposed to be doing.

more comments bellow.

Thanks!
-Lukas

> 
> Signed-off-by: Namjae Jeon 
> Signed-off-by: Ashish Sangwan 
> ---
>  fs/ext4/ext4.h  |3 +
>  fs/ext4/extents.c   |  297 
> ++-
>  fs/ext4/move_extent.c   |2 +-
>  include/trace/events/ext4.h |   25 
>  4 files changed, 325 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index e618503..5cc015a 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -2745,6 +2745,7 @@ extern int ext4_find_delalloc_cluster(struct inode 
> *inode, ext4_lblk_t lblk);
>  extern int ext4_fiemap(struct inode *inode, struct fiemap_extent_info 
> *fieinfo,
>   __u64 start, __u64 len);
>  extern int ext4_ext_precache(struct inode *inode);
> +extern int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t 
> len);
>  
>  /* move_extent.c */
>  extern void ext4_double_down_write_data_sem(struct inode *first,
> @@ -2754,6 +2755,8 @@ extern void ext4_double_up_write_data_sem(struct inode 
> *orig_inode,
>  extern int ext4_move_extents(struct file *o_filp, struct file *d_filp,
>__u64 start_orig, __u64 start_donor,
>__u64 len, __u64 *moved_len);
> +extern int mext_next_extent(struct inode *inode, struct ext4_ext_path *path,
> + struct ext4_extent **extent);
>  
>  /* page-io.c */
>  extern int __init ext4_init_pageio(void);
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 267c9fb..12338c1 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4566,12 +4566,16 @@ long ext4_fallocate(struct file *file, int mode, 
> loff_t offset, loff_t len)
>   unsigned int credits, blkbits = inode->i_blkbits;
>  
>   /* Return error if mode is not supported */
> - if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
> + if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
> +  FALLOC_FL_COLLAPSE_RANGE))
>   return -EOPNOTSUPP;
>  
>   if (mode & FALLOC_FL_PUNCH_HOLE)
>   return ext4_punch_hole(inode, offset, len);
>  
> + if (mode & FALLOC_FL_COLLAPSE_RANGE)
> + return ext4_collapse_range(inode, offset, len);
> +
>   ret = ext4_convert_inline_data(inode);
>   if (ret)
>   return ret;
> @@ -4870,3 +4874,294 @@ int ext4_fiemap(struct inode *inode, struct 
> fiemap_extent_info *fieinfo,
>   ext4_es_lru_add(inode);
>   return error;
>  }
> +
> +/*
> + * ext4_access_path:
> + * Function to access the path buffer for marking it dirty.
> + * It also checks if there are sufficient credits left in the journal handle
> + * to update path.
> + */
> +static int
> +ext4_access_path(handle_t *handle, struct inode *inode,
> + struct ext4_ext_path *path)
> +{
> + int credits, err;
> +
> + /*
> +  * Check if need to extend journal credits
> +  * 3 for leaf, sb, and inode plus 2 (bmap and group
> +  * descriptor) for each block group; assume two block
> +  * groups
> +  */
> + if (handle->h_buffer_credits < 7) {
> + credits = ext4_writepage_trans_blocks(inode);
> + err = ext4_ext_truncate_extend_restart(handle, inode, credits);
> + /* EAGAIN is success */
> + if (err && err != -EAGAIN)
> + return err;
> + }
> +
> + err = ext4_ext_get_access(handle, inode, path);
> + return err;
> +}
> +
> +/*
> + * ext4_ext_shift_path_extents:
> + * Shift the extents of a path structure lying between path[depth].p_ext
> + * and EXT_LAST_EXTENT(path[depth].p_hdr) downwards, by subtracting shift
> + * from starting block for each extent.
> + */
> +static int
> +ext4_ext_shift_path_extents(struct ext4_ext_path *path, ext4_lblk_t shift,
> + struct inode *inode, handle_t *handle,
> + ext4_lblk_t *start)
> +{
> + int depth, err = 0;
> + struct ext4_extent *ex_start, *ex_last;
> + bool update = 0;
> + depth = path->p_depth;
> +
> + while (depth >= 0) {
> + if (depth == path->p_depth) {
> + ex_start = path[depth].p_ext;
> + if (!ex_start)
> +

Re: [ 122/184] ext4: Fix max file size and logical block counting

2013-06-05 Thread Lukáš Czerner

On Tue, 4 Jun 2013, Willy Tarreau wrote:

> Date: Tue, 04 Jun 2013 19:23:32 +0200
> From: Willy Tarreau 
> To: linux-kernel@vger.kernel.org, sta...@vger.kernel.org
> Cc: Lukas Czerner , Theodore Tso ,
> Willy Tarreau 
> Subject: [ 122/184] ext4: Fix max file size and logical block counting
> 
> 2.6.32-longterm review patch.  If anyone has any objections, please let me 
> know.

Now it looks like we could get rid of this thing in e2fsprogs as
well. As well as remove the remaining bits from kernel.
What do you think Ted ?

-Lukas

> 
> --
>  of extent format file
> 
> From: Lukas Czerner 
> 
> commit f17722f917b2f21497deb6edc62fb1683daa08e6 upstream
> 
> Kazuya Mio reported that he was able to hit BUG_ON(next == lblock)
> in ext4_ext_put_gap_in_cache() while creating a sparse file in extent
> format and fill the tail of file up to its end. We will hit the BUG_ON
> when we write the last block (2^32-1) into the sparse file.
> 
> The root cause of the problem lies in the fact that we specifically set
> s_maxbytes so that block at s_maxbytes fit into on-disk extent format,
> which is 32 bit long. However, we are not storing start and end block
> number, but rather start block number and length in blocks. It means
> that in order to cover extent from 0 to EXT_MAX_BLOCK we need
> EXT_MAX_BLOCK+1 to fit into len (because we counting block 0 as well) -
> and it does not.
> 
> The only way to fix it without changing the meaning of the struct
> ext4_extent members is, as Kazuya Mio suggested, to lower s_maxbytes
> by one fs block so we can cover the whole extent we can get by the
> on-disk extent format.
> 
> Also in many places EXT_MAX_BLOCK is used as length instead of maximum
> logical block number as the name suggests, it is all a bit messy. So
> this commit renames it to EXT_MAX_BLOCKS and change its usage in some
> places to actually be maximum number of blocks in the extent.
> 
> The bug which this commit fixes can be reproduced as follows:
> 
>  dd if=/dev/zero of=/mnt/mp1/file bs= count=1 seek=$((2**32-2))
>  sync
>  dd if=/dev/zero of=/mnt/mp1/file bs= count=1 seek=$((2**32-1))
> 
> Reported-by: Kazuya Mio 
> Signed-off-by: Lukas Czerner 
> Signed-off-by: "Theodore Ts'o" 
> [dannf: Applied the backport from RHEL6 to Debian's 2.6.32]
> Signed-off-by: Willy Tarreau 
> ---
>  fs/ext4/ext4_extents.h |  7 +--
>  fs/ext4/extents.c  | 39 +++
>  fs/ext4/move_extent.c  | 10 +-
>  fs/ext4/super.c| 15 ---
>  4 files changed, 41 insertions(+), 30 deletions(-)
> 
> diff --git a/fs/ext4/ext4_extents.h b/fs/ext4/ext4_extents.h
> index bdb6ce7..24fa647 100644
> --- a/fs/ext4/ext4_extents.h
> +++ b/fs/ext4/ext4_extents.h
> @@ -137,8 +137,11 @@ typedef int (*ext_prepare_callback)(struct inode *, 
> struct ext4_ext_path *,
>  #define EXT_BREAK  1
>  #define EXT_REPEAT 2
>  
> -/* Maximum logical block in a file; ext4_extent's ee_block is __le32 */
> -#define EXT_MAX_BLOCK0x
> +/*
> + * Maximum number of logical blocks in a file; ext4_extent's ee_block is
> + * __le32.
> + */
> +#define EXT_MAX_BLOCKS   0x
>  
>  /*
>   * EXT_INIT_MAX_LEN is the maximum number of blocks we can have in an
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index b4402c8..f4b471d 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -1331,7 +1331,7 @@ got_index:
>  
>  /*
>   * ext4_ext_next_allocated_block:
> - * returns allocated block in subsequent extent or EXT_MAX_BLOCK.
> + * returns allocated block in subsequent extent or EXT_MAX_BLOCKS.
>   * NOTE: it considers block number from index entry as
>   * allocated block. Thus, index entries have to be consistent
>   * with leaves.
> @@ -1345,7 +1345,7 @@ ext4_ext_next_allocated_block(struct ext4_ext_path 
> *path)
>   depth = path->p_depth;
>  
>   if (depth == 0 && path->p_ext == NULL)
> - return EXT_MAX_BLOCK;
> + return EXT_MAX_BLOCKS;
>  
>   while (depth >= 0) {
>   if (depth == path->p_depth) {
> @@ -1362,12 +1362,12 @@ ext4_ext_next_allocated_block(struct ext4_ext_path 
> *path)
>   depth--;
>   }
>  
> - return EXT_MAX_BLOCK;
> + return EXT_MAX_BLOCKS;
>  }
>  
>  /*
>   * ext4_ext_next_leaf_block:
> - * returns first allocated block from next leaf or EXT_MAX_BLOCK
> + * returns first allocated block from next leaf or EXT_MAX_BLOCKS
>   */
>  static ext4_lblk_t ext4_ext_next_leaf_block(struct inode *inode,
>   struct ext4_ext_path *path)
> @@ -1379,7 +1379,7 @@ static ext4_lblk_t ext4_ext_next_leaf_block(struct 
> inode *inode,
>  
>   /* zero-tree has no leaf blocks at all */
>   if (depth == 0)
> - return EXT_MAX_BLOCK;
> + return EXT_MAX_BLOCKS;
>  
>   /* go to index block */
>   depth--;
> @@ -1392,7 +1392,7 @@ static ext4_lblk_t ext4_ext_next_leaf_block(struct 
> inode *inode,
>   de

Re: [PATCH v4 20/20] ext4: Allow punch hole with bigalloc enabled

2013-06-05 Thread Lukáš Czerner

On Fri, 31 May 2013, Theodore Ts'o wrote:

> Date: Fri, 31 May 2013 11:14:54 -0400
> From: Theodore Ts'o 
> To: Lukas Czerner 
> Cc: linux...@kvack.org, linux-kernel@vger.kernel.org,
> linux-fsde...@vger.kernel.org, linux-e...@vger.kernel.org,
> a...@linux-foundation.org, hu...@google.com
> Subject: Re: [PATCH v4 20/20] ext4: Allow punch hole with bigalloc enabled
> 
> On Tue, May 14, 2013 at 06:37:34PM +0200, Lukas Czerner wrote:
> > In commits 5f95d21fb6f2aaa52830e5b7fb405f6c71d3ab85 and
> > 30bc2ec9598a1b156ad75217f2e7d4560efdeeab we've reworked punch_hole
> > implementation and there is noting holding us back from using punch hole
> > on file system with bigalloc feature enabled.
> > 
> > This has been tested with fsx and xfstests.
> > 
> > Signed-off-by: Lukas Czerner 
> > Reviewed-by: Jan Kara 
> 
> This patch is causing a test failure with bigalloc enabled with the
> xfstests shared/298.
> 
> Since it's at the end of the invalidate page range tests, I'm going to
> drop this patch for now.  Could you take a look at this?

Hi Ted,

sorry for the delay, I've been on vacation last week so I am trying
to catch on the recent development :) I'll take a look at it
hopefully this week. Thanks for letting me know.

-Lukas

> 
> Thanks!!
> 
>   - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ 122/184] ext4: Fix max file size and logical block counting

2013-06-05 Thread Lukáš Czerner

On Wed, 5 Jun 2013, Lukáš Czerner wrote:

> Date: Wed, 5 Jun 2013 11:26:55 +0200 (CEST)
> From: Lukáš Czerner 
> To: Willy Tarreau 
> Cc: linux-kernel@vger.kernel.org, sta...@vger.kernel.org,
> Theodore Tso 
> Subject: Re: [ 122/184] ext4: Fix max file size and logical block counting
> 
> On Tue, 4 Jun 2013, Willy Tarreau wrote:
> 
> > Date: Tue, 04 Jun 2013 19:23:32 +0200
> > From: Willy Tarreau 
> > To: linux-kernel@vger.kernel.org, sta...@vger.kernel.org
> > Cc: Lukas Czerner , Theodore Tso ,
> > Willy Tarreau 
> > Subject: [ 122/184] ext4: Fix max file size and logical block counting
> > 
> > 2.6.32-longterm review patch.  If anyone has any objections, please let me 
> > know.
> 
> Now it looks like we could get rid of this thing in e2fsprogs as
> well. As well as remove the remaining bits from kernel.
> What do you think Ted ?

Oops, this is not the commit I was looking for. Sorry for the noise!

-Lukas

> 
> -Lukas
> 
> > 
> > --
> >  of extent format file
> > 
> > From: Lukas Czerner 
> > 
> > commit f17722f917b2f21497deb6edc62fb1683daa08e6 upstream
> > 
> > Kazuya Mio reported that he was able to hit BUG_ON(next == lblock)
> > in ext4_ext_put_gap_in_cache() while creating a sparse file in extent
> > format and fill the tail of file up to its end. We will hit the BUG_ON
> > when we write the last block (2^32-1) into the sparse file.
> > 
> > The root cause of the problem lies in the fact that we specifically set
> > s_maxbytes so that block at s_maxbytes fit into on-disk extent format,
> > which is 32 bit long. However, we are not storing start and end block
> > number, but rather start block number and length in blocks. It means
> > that in order to cover extent from 0 to EXT_MAX_BLOCK we need
> > EXT_MAX_BLOCK+1 to fit into len (because we counting block 0 as well) -
> > and it does not.
> > 
> > The only way to fix it without changing the meaning of the struct
> > ext4_extent members is, as Kazuya Mio suggested, to lower s_maxbytes
> > by one fs block so we can cover the whole extent we can get by the
> > on-disk extent format.
> > 
> > Also in many places EXT_MAX_BLOCK is used as length instead of maximum
> > logical block number as the name suggests, it is all a bit messy. So
> > this commit renames it to EXT_MAX_BLOCKS and change its usage in some
> > places to actually be maximum number of blocks in the extent.
> > 
> > The bug which this commit fixes can be reproduced as follows:
> > 
> >  dd if=/dev/zero of=/mnt/mp1/file bs= count=1 seek=$((2**32-2))
> >  sync
> >  dd if=/dev/zero of=/mnt/mp1/file bs= count=1 seek=$((2**32-1))
> > 
> > Reported-by: Kazuya Mio 
> > Signed-off-by: Lukas Czerner 
> > Signed-off-by: "Theodore Ts'o" 
> > [dannf: Applied the backport from RHEL6 to Debian's 2.6.32]
> > Signed-off-by: Willy Tarreau 
> > ---
> >  fs/ext4/ext4_extents.h |  7 +--
> >  fs/ext4/extents.c  | 39 +++
> >  fs/ext4/move_extent.c  | 10 +-
> >  fs/ext4/super.c| 15 ---
> >  4 files changed, 41 insertions(+), 30 deletions(-)
> > 
> > diff --git a/fs/ext4/ext4_extents.h b/fs/ext4/ext4_extents.h
> > index bdb6ce7..24fa647 100644
> > --- a/fs/ext4/ext4_extents.h
> > +++ b/fs/ext4/ext4_extents.h
> > @@ -137,8 +137,11 @@ typedef int (*ext_prepare_callback)(struct inode *, 
> > struct ext4_ext_path *,
> >  #define EXT_BREAK  1
> >  #define EXT_REPEAT 2
> >  
> > -/* Maximum logical block in a file; ext4_extent's ee_block is __le32 */
> > -#define EXT_MAX_BLOCK  0x
> > +/*
> > + * Maximum number of logical blocks in a file; ext4_extent's ee_block is
> > + * __le32.
> > + */
> > +#define EXT_MAX_BLOCKS 0x
> >  
> >  /*
> >   * EXT_INIT_MAX_LEN is the maximum number of blocks we can have in an
> > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > index b4402c8..f4b471d 100644
> > --- a/fs/ext4/extents.c
> > +++ b/fs/ext4/extents.c
> > @@ -1331,7 +1331,7 @@ got_index:
> >  
> >  /*
> >   * ext4_ext_next_allocated_block:
> > - * returns allocated block in subsequent extent or EXT_MAX_BLOCK.
> > + * returns allocated block in subsequent extent or EXT_MAX_BLOCKS.
> >   * NOTE: it considers block number from index entry as
> >   * allocated block. Thus, index entries have to be consistent
> >   * with leaves.
> > @@ -1345,7 +1345,7 @@ ext4_ext_next_allocated_block(

Re: [PATCH v4 00/20] change invalidatepage prototype to accept length

2013-05-21 Thread Lukáš Czerner

On Tue, 14 May 2013, Lukas Czerner wrote:

> Date: Tue, 14 May 2013 18:37:14 +0200
> From: Lukas Czerner 
> To: linux...@kvack.org
> Cc: linux-kernel@vger.kernel.org, linux-fsde...@vger.kernel.org,
> linux-e...@vger.kernel.org, a...@linux-foundation.org, hu...@google.com,
> lczer...@redhat.com
> Subject: [PATCH v4 00/20] change invalidatepage prototype to accept length

Hi Ted,

you've mentioned that you'll carry the changed in the ext4 tree. Are
you going to take it for the next merge window ?

However I still need some review on the mm part of the series,
Andrew, Hugh, anyone ?

Thanks!
-Lukas

> 
> Hi,
> 
> This set of patches are aimed to allow truncate_inode_pages_range() handle
> ranges which are not aligned at the end of the page. Currently it will
> hit BUG_ON() when the end of the range is not aligned. Punch hole feature
> however can benefit from this ability saving file systems some work not
> forcing them to implement their own invalidate code to handle unaligned
> ranges.
> 
> In order for this to woke we need change ->invalidatepage() address space
> operation to to accept range to invalidate by adding 'length' argument in
> addition to 'offset'. This is different from my previous attempt to create
> new aop ->invalidatepage_range (http://lwn.net/Articles/514828/) which I
> reconsidered to be unnecessary.
> 
> It would be for the best if this series could go through ext4 branch since
> there are a lot of ext4 changes which are based on dev branch of ext4 
> (git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git)
> 
> For description purposes this patch set can be divided into following
> groups:
> 
> patch 0001:Change ->invalidatepage() prototype adding 'length' argument
>   and changing all the instances. In very simple cases file
>   system methods are completely adapted, otherwise only
>   prototype is changed and the rest will follow. This patch
>   also implement the 'range' invalidation in
>   block_invalidatepage().
> 
> patch 0002 - 0009:
>   Make the use of new 'length' argument in the file system
>   itself. Some file systems can take advantage of it trying
>   to invalidate only portion of the page if possible, some
>   can't, however none of the file systems currently attempt
>   to truncate non page aligned ranges.
> 
> 
> patch 0010:Teach truncate_inode_pages_range() to handle non page aligned
>   ranges.
> 
> patch 0011 - 0020:
>   Ext4 changes build on top of previous changes, simplifying
>   punch hole code. Not all changes are realated specifically
>   to invalidatepage() change, but all are related to the punch
>   hole feature.
> 
> Even though this patch set would mainly affect functionality of the file
> file systems implementing punch hole I've tested all the following file
> system using xfstests without noticing any bugs related to this change.
> 
> ext3, ext4, xfs, btrfs, gfs2 and reiserfs
> 
> I've also tested block size < page size on ext4 with xfstests and fsx.
> 
> 
> v3 -> v4: Some minor changes based on the reviews. Added two ext4 patches
> as suggested by Jan Kara.
> 
> Thanks!
> -Lukas
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] aio: fix potential leak in aio_run_iocb().

2014-05-02 Thread Lukáš Czerner

On Thu, 1 May 2014, Leon Yu wrote:

> Date: Thu,  1 May 2014 03:31:28 +
> From: Leon Yu 
> To: Benjamin LaHaise ,
> Alexander Viro 
> Cc: linux-...@kvack.org, linux-fsde...@vger.kernel.org,
> linux-kernel@vger.kernel.org, Leon Yu 
> Subject: [PATCH] aio: fix potential leak in aio_run_iocb().
> 
> iovec should be reclaimed whenever caller of rw_copy_check_uvector() returns,
> but it doesn't hold when failure happens right after aio_setup_vectored_rw().
> 
> Fix that in a such way to avoid hairy goto.


Hi,

this does not seem right, see bellow

> 
> Signed-off-by: Leon Yu 
> ---
>  fs/aio.c | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index 12a3de0e..04cd768 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1299,10 +1299,8 @@ rw_common:
>   &iovec, compat)
>   : aio_setup_single_vector(req, rw, buf, &nr_segs,
> iovec);
> - if (ret)
> - return ret;
> -
> - ret = rw_verify_area(rw, file, &req->ki_pos, req->ki_nbytes);

here ret could be possibly set to a positive number.

> + if (!ret)
> + ret = rw_verify_area(rw, file, &req->ki_pos, 
> req->ki_nbytes);
>   if (ret < 0) {

but here we're checking for negative and bail out. However this
changes the way it worked before this patch and the iovec would not
be reclaimed anyway as you mentioned in the commit description.

Thanks!
-Lukas

>   if (iovec != &inline_vec)
>   kfree(iovec);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] ext4: Do not destroy ext4_groupinfo_caches if ext4_mb_init() fails

2014-05-13 Thread Lukáš Czerner

On Tue, 13 May 2014, Andrey Tsyvarev wrote:

> Date: Tue, 13 May 2014 14:17:25 +0400
> From: Andrey Tsyvarev 
> To: Lukáš Czerner 
> Cc: Theodore Ts'o , Andreas Dilger ,
> linux-e...@vger.kernel.org, linux-kernel@vger.kernel.org,
> Alexey Khoroshilov 
> Subject: Re: [PATCH] ext4: Do not destroy ext4_groupinfo_caches if
> ext4_mb_init() fails
> 
> 
> 12.05.2014 19:08, Lukáš Czerner пишет:
> > On Mon, 12 May 2014, Andrey Tsyvarev wrote:
> > 
> > > Date: Mon, 12 May 2014 12:23:59 +0400
> > > From: Andrey Tsyvarev
> > > To: Theodore Ts'o
> > > Cc: Andrey Tsyvarev,
> > >  Andreas Dilger,linux-e...@vger.kernel.org,
> > >  linux-kernel@vger.kernel.org, Alexey
> > > Khoroshilov
> > > Subject: [PATCH] ext4: Do not destroy ext4_groupinfo_caches if
> > > ext4_mb_init()
> > >  fails
> > > 
> > > Caches from 'ext4_groupinfo_caches' may be in use by other mounts, which
> > > have already existed.
> > > So, it is incorrect to destroy them when newly requested mount fails.
> > > 
> > > Found by Linux File System Verification project (linuxtesting.org).
> > Makes sense, thanks! Can you please share the test case which
> > triggered this ? It might be worth including in xfstests.
> 
> Actually it was triggered by xfstests themselves but run with fault
> simulation.
> The method of fault simulation is under development/evaluation now, we expect
> to publish a paper describing it in the near future.
> 
> BUG_ON() in get_groupinfo_cache() was firstly triggered by test generic/003,
> but actually it could be any other test, which uses a scratch device: xftests
> itself requires test device(TEST_DEV) mounted, so a fault simulated while
> mount scratch device causes the problem described.

It sounds interesting. I hope that you'll send the information out
to the fsdevel list when your paper is finished, It looks like it
might be quite useful.

Thanks!
-Lukas

> 
> 
> > Reviewed-by: Lukas Czerner
> > 
> > > Signed-off-by: Andrey Tsyvarev
> > > ---
> > >   fs/ext4/mballoc.c | 4 +---
> > >   1 file changed, 1 insertion(+), 3 deletions(-)
> > > 
> > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > > index 04a5c75..becea1d 100644
> > > --- a/fs/ext4/mballoc.c
> > > +++ b/fs/ext4/mballoc.c
> > > @@ -2607,7 +2607,7 @@ int ext4_mb_init(struct super_block *sb)
> > >   sbi->s_locality_groups = alloc_percpu(struct 
> > > ext4_locality_group);
> > >   if (sbi->s_locality_groups == NULL) {
> > >   ret = -ENOMEM;
> > > - goto out_free_groupinfo_slab;
> > > + goto out;
> > >   }
> > >   for_each_possible_cpu(i) {
> > >   struct ext4_locality_group *lg;
> > > @@ -2632,8 +2632,6 @@ int ext4_mb_init(struct super_block *sb)
> > >   out_free_locality_groups:
> > >   free_percpu(sbi->s_locality_groups);
> > >   sbi->s_locality_groups = NULL;
> > > -out_free_groupinfo_slab:
> > > - ext4_groupinfo_destroy_slabs();
> > >   out:
> > >   kfree(sbi->s_mb_offsets);
> > >   sbi->s_mb_offsets = NULL;
> > > 
> 
> --
> Andrey Tsyvarev
> Linux Verification Center, ISPRAS
> 
>

Re: [PATCH V2 2/2] fs: print a message when freezing/unfreezing filesystems

2014-05-14 Thread Lukáš Czerner

On Wed, 14 May 2014, Jan Kara wrote:

> Date: Wed, 14 May 2014 13:14:49 +0200
> From: Jan Kara 
> To: Mateusz Guzik 
> Cc: linux-kernel@vger.kernel.org, linux-fsde...@vger.kernel.org,
> Josef Bacik , Jan Kara ,
> Al Viro , Eric Sandeen ,
> Joe Perches 
> Subject: Re: [PATCH V2 2/2] fs: print a message when freezing/unfreezing
> filesystems
> 
> On Wed 14-05-14 00:04:43, Mateusz Guzik wrote:
> > This helps hang troubleshooting efforts when only dmesg is available.
> > 
> > While here remove code duplication with MS_RDONLY case and fix a
> > whitespace nit.
>   I'm somewhat undecided here I have to say. On one hand I don't like
> printing to kernel log when everything is fine and kernel is operating
> normally. On the other hand I've seen quite a few cases where people have
> shot themselves in the foot with filesystem freezing so having some trace
> of this in the log doesn't seem like a completely bad thing either. What do
> other people think?

I think that this message is definitely useful to have. As Mateusz
already mentioned I think that this is of the same importance as the
"file system mounted/remounted" message.

-Lukas

> 
>   Honza
> > Signed-off-by: Mateusz Guzik 
> > Cc: linux-fsde...@vger.kernel.org
> > Cc: Josef Bacik 
> > Cc: Jan Kara 
> > Cc: Al Viro 
> > Cc: Eric Sandeen 
> > Cc: Joe Perches 
> > ---
> > since v1:
> > fix copy-pasto which found its way into the patch
> > restore curly brackets in MS_RDONLY case
> >  fs/super.c | 8 
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/super.c b/fs/super.c
> > index 017e10a..4dd7356 100644
> > --- a/fs/super.c
> > +++ b/fs/super.c
> > @@ -1291,9 +1291,7 @@ int freeze_super(struct super_block *sb)
> >  
> > if (sb->s_flags & MS_RDONLY) {
> > /* Nothing to do really... */
> > -   sb->s_writers.frozen = SB_FREEZE_COMPLETE;
> > -   up_write(&sb->s_umount);
> > -   return 0;
> > +   goto out;
> > }
> >  
> > /* From now on, no new normal writers can start */
> > @@ -1335,8 +1333,10 @@ int freeze_super(struct super_block *sb)
> >  * This is just for debugging purposes so that fs can warn if it
> >  * sees write activity when frozen is set to SB_FREEZE_COMPLETE.
> >  */
> > +out:
> > sb->s_writers.frozen = SB_FREEZE_COMPLETE;
> > up_write(&sb->s_umount);
> > +   pr_info("VFS:Filesystem %s frozen\n", sb->s_id);
> > return 0;
> >  }
> >  EXPORT_SYMBOL(freeze_super);
> > @@ -1374,7 +1374,7 @@ out:
> > smp_wmb();
> > wake_up(&sb->s_writers.wait_unfrozen);
> > deactivate_locked_super(sb);
> > -
> > +   pr_info("VFS:Filesystem %s thawed\n", sb->s_id);
> > return 0;
> >  }
> >  EXPORT_SYMBOL(thaw_super);
> > -- 
> > 1.8.3.1
> > 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V2 2/2] fs: print a message when freezing/unfreezing filesystems

2014-05-15 Thread Lukáš Czerner

On Wed, 14 May 2014, Eric Sandeen wrote:

> Date: Wed, 14 May 2014 17:40:22 -0500
> From: Eric Sandeen 
> Reply-To: sand...@redhat.com
> To: Dave Chinner , Jan Kara 
> Cc: Mateusz Guzik , linux-kernel@vger.kernel.org,
> linux-fsde...@vger.kernel.org, Josef Bacik ,
> Al Viro , Joe Perches 
> Subject: Re: [PATCH V2 2/2] fs: print a message when freezing/unfreezing
> filesystems
> 
> On 5/14/14, 5:37 PM, Dave Chinner wrote:
> > On Thu, May 15, 2014 at 08:00:52AM +1000, Dave Chinner wrote:
> >> On Wed, May 14, 2014 at 01:39:45PM +0200, Jan Kara wrote:
> >>> On Wed 14-05-14 13:26:21, Mateusz Guzik wrote:
>  On Wed, May 14, 2014 at 01:14:49PM +0200, Jan Kara wrote:
> > On Wed 14-05-14 00:04:43, Mateusz Guzik wrote:
> >> This helps hang troubleshooting efforts when only dmesg is available.
> >>
> >> While here remove code duplication with MS_RDONLY case and fix a
> >> whitespace nit.
> >   I'm somewhat undecided here I have to say. On one hand I don't like
> > printing to kernel log when everything is fine and kernel is operating
> > normally. On the other hand I've seen quite a few cases where people 
> > have
> > shot themselves in the foot with filesystem freezing so having some 
> > trace
> > of this in the log doesn't seem like a completely bad thing either. 
> > What do
> > other people think?
> >
> 
>  I would like to note that the kernel already prints messages when e.g.
>  filesystems get mounted.
> >>>   Yeah, that's a fair point.
> >>
> >> But filesystems choose to output that info, not the VFS. When you do
> >> a remount,ro there is no output in syslog, because filesystems don't
> >> need to dump any output - the state change is reflected in
> >> /proc/self/mounts. IMO frozen should state should be communicated
> >> the same way so that it is silent when it just works, and the state
> >> can easily be determined when something goes wrong.
> > 
> > Say, like this:
> > 
> > $ grep /mnt/test /proc/mounts
> > /dev/vda /mnt/test xfs rw,relatime,attr2,inode64,noquota 0 0
> > $ sudo xfs_freeze -f /mnt/test
> > $ grep /mnt/test /proc/mounts
> > /dev/vda /mnt/test xfs rw,frozen,relatime,attr2,inode64,noquota 0 0
> > $ sudo xfs_freeze -u /mnt/test
> > $ grep /mnt/test /proc/mounts
> > /dev/vda /mnt/test xfs rw,relatime,attr2,inode64,noquota 0 0
> > $
> 
> I'm not totally convinced that including a non-mount option in what
> has always (?) been a list of mount options is a great idea.

I do not like it either. Mixing this together with other mount
options does not seem like a great idea, however we really need a
way to report this and I guess we can not just change the
/proc/self/mounts, or /proc/self/mountinfo format.

So what about crating a new file /proc/self/frozen with the list of
frozen file systems in the same format what mounts, or mountinfo has
?

> 
> (Granted, some options there are defaults, and weren't actually specified
> as a mount option, but if they had been, they'd have been accepted).
> 
> Maybe add a "mount -o remount,frozen" handler ?  ;)

That's the neat way to work around that :), but I would prefer a new
procfs file rather than this.

-Lukas

> 
> -Eric
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] fs: print a message when freezing/unfreezing filesystems

2014-05-15 Thread Lukáš Czerner

On Thu, 15 May 2014, Jan Kara wrote:

> Date: Thu, 15 May 2014 12:01:57 +0200
> From: Jan Kara 
> To: Mateusz Guzik 
> Cc: Dave Chinner , linux-kernel@vger.kernel.org,
> linux-fsde...@vger.kernel.org, Josef Bacik ,
> Jan Kara , Al Viro ,
> Eric Sandeen 
> Subject: Re: [PATCH 2/2] fs: print a message when freezing/unfreezing
> filesystems
> 
> On Thu 15-05-14 11:42:37, Mateusz Guzik wrote:
> > On Thu, May 15, 2014 at 07:54:57AM +1000, Dave Chinner wrote:
> > > On Tue, May 13, 2014 at 08:31:02PM +0200, Mateusz Guzik wrote:
> > > > This helps hang troubleshooting efforts when only dmesg is available.
> > > 
> > > I really don't think that spamming dmesg every time a filesystem is
> > > frozen or thawed is a good idea. This happens a *lot* when systems
> > > are using snapshots, and for the most part nobody cares about
> > > freeze/thaw cycles because they almost always work just fine.
> > > 
> > 
> > I agree it may get noisy.
> > 
> > > I'd think that /proc/self/mounts would be a much better place to
> > > indicate that the fs is frozen. After all, that's where we tell
> > > people whether the filesystem is ro or rw, and frozen is just
> > > a temporary, non-invasive ro state...
> > > 
> > 
> > Except you can't inspect /proc/self/mounts when the only thing you got
> > is dmesg, so this does not really help my case.
> > 
> > That said, I'll try to come up with a different solution.
> > 
> > Poorly reported side-effects of frozen I/O are only a part of the real
> > problem which is hung task detector being able to typically report
> > backtraces of "victims" only. I came up with printks becuase these are
> > a cheap way and would help us out in a lot of cases.
>   I was tracking down a couple of times what the hell is freezing the
> filesystem (and not unfreezing it) and I agree with Mateusz it would be
> nice if we could tell after the fact who froze the fs. Maybe we could store
> that information in superblock and dump it during emergency thaw?

I like that idea and if we agree to create procfs file to list
frozen file system this might be one of the useful information to
report there.

-Lukas

> 
>   Honza
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] ext2: super: remove unnecessary goto statement

2014-05-12 Thread Lukáš Czerner

On Thu, 8 May 2014, Seunghun Lee wrote:

> Date: Thu,  8 May 2014 01:08:03 +0900
> From: Seunghun Lee 
> To: j...@suse.cz
> Cc: linux-e...@vger.kernel.org, linux-kernel@vger.kernel.org,
> Seunghun Lee 
> Subject: [PATCH] ext2: super: remove unnecessary goto statement
> 
> This patch removes unnecessary goto failed,
> and moves kfree to failed.
> 
> Signed-off-by: Seunghun Lee 
> ---
>  fs/ext2/super.c |   10 --
>  1 file changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/ext2/super.c b/fs/ext2/super.c
> index 3750031..7d20a50 100644
> --- a/fs/ext2/super.c
> +++ b/fs/ext2/super.c
> @@ -777,17 +777,15 @@ static int ext2_fill_super(struct super_block *sb, void 
> *data, int silent)
>   __le32 features;
>   int err;
>  
> - err = -ENOMEM;

The problem here is actually that we return -EINVAL while we should
actually return -ENOMEM, which is fixed only partially by your
patch.

Also there are number other problems in error handling here mainly
because we're using err variable but not setting ret variable which
is the one that we return eventually.

Also ret is defined as long (not sure if that's needed) but the
function is supposed to return int.

So if you're changing the code around it would be good to actually
fix some problems as well :)

Thanks!
-Lukas

>   sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
>   if (!sbi)
> - goto failed;
> + return -ENOMEM;
>  
>   sbi->s_blockgroup_lock =
>   kzalloc(sizeof(struct blockgroup_lock), GFP_KERNEL);
> - if (!sbi->s_blockgroup_lock) {
> - kfree(sbi);
> + if (!sbi->s_blockgroup_lock)
>   goto failed;
> - }
> +
>   sb->s_fs_info = sbi;
>   sbi->s_sb_block = sb_block;
>  
> @@ -1138,8 +1136,8 @@ failed_mount:
>  failed_sbi:
>   sb->s_fs_info = NULL;
>   kfree(sbi->s_blockgroup_lock);
> - kfree(sbi);
>  failed:
> + kfree(sbi);
>   return ret;
>  }
>  
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] ext4: Do not destroy ext4_groupinfo_caches if ext4_mb_init() fails

2014-05-12 Thread Lukáš Czerner

On Mon, 12 May 2014, Andrey Tsyvarev wrote:

> Date: Mon, 12 May 2014 12:23:59 +0400
> From: Andrey Tsyvarev 
> To: Theodore Ts'o 
> Cc: Andrey Tsyvarev ,
> Andreas Dilger , linux-e...@vger.kernel.org,
> linux-kernel@vger.kernel.org, Alexey Khoroshilov 
> Subject: [PATCH] ext4: Do not destroy ext4_groupinfo_caches if ext4_mb_init()
> fails
> 
> Caches from 'ext4_groupinfo_caches' may be in use by other mounts, which have 
> already existed.
> So, it is incorrect to destroy them when newly requested mount fails.
> 
> Found by Linux File System Verification project (linuxtesting.org).

Makes sense, thanks! Can you please share the test case which
triggered this ? It might be worth including in xfstests.

Reviewed-by: Lukas Czerner 

> 
> Signed-off-by: Andrey Tsyvarev 
> ---
>  fs/ext4/mballoc.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 04a5c75..becea1d 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -2607,7 +2607,7 @@ int ext4_mb_init(struct super_block *sb)
>   sbi->s_locality_groups = alloc_percpu(struct ext4_locality_group);
>   if (sbi->s_locality_groups == NULL) {
>   ret = -ENOMEM;
> - goto out_free_groupinfo_slab;
> + goto out;
>   }
>   for_each_possible_cpu(i) {
>   struct ext4_locality_group *lg;
> @@ -2632,8 +2632,6 @@ int ext4_mb_init(struct super_block *sb)
>  out_free_locality_groups:
>   free_percpu(sbi->s_locality_groups);
>   sbi->s_locality_groups = NULL;
> -out_free_groupinfo_slab:
> - ext4_groupinfo_destroy_slabs();
>  out:
>   kfree(sbi->s_mb_offsets);
>   sbi->s_mb_offsets = NULL;
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 10/18] mm: teach truncate_inode_pages_range() to handle non page aligned ranges

2013-02-21 Thread Lukáš Czerner

On Fri, 8 Feb 2013, Lukáš Czerner wrote:

> Date: Fri, 8 Feb 2013 10:08:05 +0100 (CET)
> From: Lukáš Czerner 
> To: Andrew Morton 
> Cc: Lukas Czerner , linux...@kvack.org,
> linux-kernel@vger.kernel.org, linux-fsde...@vger.kernel.org,
> linux-e...@vger.kernel.org, Hugh Dickins 
> Subject: Re: [PATCH v2 10/18] mm: teach truncate_inode_pages_range() to handle
>  non page aligned ranges

..snip..

> > > + /*
> > > +  * 'start' and 'end' always covers the range of pages to be fully
> > > +  * truncated. Partial pages are covered with 'partial_start' at the
> > > +  * start of the range and 'partial_end' at the end of the range.
> > > +  * Note that 'end' is exclusive while 'lend' is inclusive.
> > > +  */
> > 
> > That helped ;)  So the bytes to be truncated are
> > 
> > (start*PAGE_SIZE + partial_start) -> (end*PAGE_SIZE + partial_end) - 1
> > 
> > yes?
> 
> The start of the range is not right, because 'start' and 'end'
> covers pages to be _fully_ truncated. See the while cycle and 
> then 'if (partial_start)' condition where we search for the
> page (start - 1) and do_invalidate within that page.
> 
> So it should be like this:
> 
> 
> (start*PAGE_SIZE - partial_start*(PAGE_SIZE - partial_start) ->
> (end*PAGE_END + partial_end) - 1
> 
> 
> assuming that you want the range to be inclusive on both sides.
> 
> -Lukas
> 

Hi Andrew,

what's the status of the patch set ? Do you have any more comments
or questions ? Can we get this in in this merge window ?

Thanks!
-Lukas

Re: [PATCH v2 10/18] mm: teach truncate_inode_pages_range() to handle non page aligned ranges

2013-02-22 Thread Lukáš Czerner

On Thu, 21 Feb 2013, Andrew Morton wrote:

> Date: Thu, 21 Feb 2013 13:49:04 -0800
> From: Andrew Morton 
> To: Lukáš Czerner 
> Cc: linux...@kvack.org, linux-kernel@vger.kernel.org,
> linux-fsde...@vger.kernel.org, linux-e...@vger.kernel.org,
> Hugh Dickins 
> Subject: Re: [PATCH v2 10/18] mm: teach truncate_inode_pages_range() to handle
>  non page aligned ranges
> 
> On Thu, 21 Feb 2013 09:33:56 +0100 (CET)
> Luk Czerner  wrote:
> 
> > what's the status of the patch set ?
> 
> Forgotten about :(
> 
> > Can we get this in in this merge window ?
> 
> Please do a full resend after 3.9-rc1 and let's take it up again.
> 

I'll do that. Thanks.

-Lukas

Re: [PATCH] tmpfs: ZERO_RANGE and COLLAPSE_RANGE not currently supported

2014-06-17 Thread Lukáš Czerner

On Sun, 15 Jun 2014, Hugh Dickins wrote:

> Date: Sun, 15 Jun 2014 18:28:00 -0700 (PDT)
> From: Hugh Dickins 
> To: Andrew Morton 
> Cc: Lukas Czerner ,
> Namjae Jeon , linux-kernel@vger.kernel.org
> Subject: [PATCH] tmpfs: ZERO_RANGE and COLLAPSE_RANGE not currently supported
> 
> I was well aware of FALLOC_FL_ZERO_RANGE and FALLOC_FL_COLLAPSE_RANGE
> support being added to fallocate(); but didn't realize until now that
> I had been too stupid to future-proof shmem_fallocate() against new
> additions.  -EOPNOTSUPP instead of going on to ordinary fallocation.

Looks good.

Reviewed-by: Lukas Czerner 

> 
> Signed-off-by: Hugh Dickins 
> Cc: sta...@vger.kernel.org # v3.15
> ---
> 
>  mm/shmem.c |3 +++
>  1 file changed, 3 insertions(+)
> 
> --- v3.15/mm/shmem.c  2014-06-08 11:19:54.0 -0700
> +++ linux/mm/shmem.c  2014-06-15 18:01:16.988996166 -0700
> @@ -1728,6 +1728,9 @@ static long shmem_fallocate(struct file
>   pgoff_t start, index, end;
>   int error;
>  
> + if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
> + return -EOPNOTSUPP;
> +
>   mutex_lock(&inode->i_mutex);
>  
>   if (mode & FALLOC_FL_PUNCH_HOLE) {
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 0/10] fs: Introduce FALLOC_FL_INSERT_RANGE for fallocate

2014-05-30 Thread Lukáš Czerner

On Thu, 8 May 2014, Namjae Jeon wrote:

> Date: Thu, 08 May 2014 19:23:19 +0900
> From: Namjae Jeon 
> To: Dave Chinner , Theodore Ts'o 
> Cc: linux-ext4 , x...@oss.sgi.com,
> linux-fsde...@vger.kernel.org, linux-kernel@vger.kernel.org,
> Ashish Sangwan 
> Subject: [PATCH v2 0/10] fs: Introduce FALLOC_FL_INSERT_RANGE for fallocate
> 
> In continuation of the work of making the process of non linear editing of
> media files faster, we introduce here the new flag FALLOC_FL_INSERT_RANGE
> for fallocate.
> 
> This flag will work opposite to the newly added FALLOC_FL_COLLAPSE_RANGE flag.
> As such, specifying FALLOC_FL_INSERT_RANGE flag will insert zeroed-out space
> in between the file within the range specified by offset and len. User can
> write new data in this space. e.g. ads.
> Like collapse range, currently we have the limitation that offset and len 
> should
> be block size aligned for both XFS and Ext4.
> 
> The semantics of the flag are :
> 1) It allocates new zeroed out on disk space of len bytes starting
>at offset byte without overwriting any existing data. All the data blocks
>from offset to EOF are shifted towards right to make space for inserting
>new blocks

Hi,

this sounds a little bit weird to me. I understand the reason for
this, but this is effectively two operations masking as one. We
shift the existing data and then we allocate unwritten extents for
the hole we've created.

What would make more sense to me is to implement only the first
operation - the shift. And then let the user to allocate unwritten
extents for the hole using simple fallocate.

The reason is that if you succeed the first part and then fail the
second due to ENOSPC or any other reason the file will end up in
undefined state unnecessarily. Yes in your current implementation
it seems that you'll always end up with the hole in the file and the
rest is properly shifted, but that may vary from file system to file
system. Some might choose to roll back the shift, some might not.

If FALLOC_FL_INSERT_RANGE (or any name you wish to choose) would
just simply shift the extents then you'll get rid of this and the
only thing that user needs to do (if he chooses to) is to use
fallocate for the hole created by the shift. If it fails, then
well, he can try again without any consequences. As a bonus you get
the possibility to leave the hole in the file which might be useful
as well.

With current behaviour this might get very confusing very quickly.

What do you and others think ?

Thanks!
-LUkas

> 2) It should be used exclusively. No other fallocate flag in combination.
> 3) Offset and length supplied to fallocate should be fs block size aligned
>in case of xfs and ext4.
> 4) Insert range does not work for the case when offset is overlapping/beyond
>i_size. If the user wants to allocate space at the end of file they are
>advised to use either ftruncate(2) or fallocate(2) with mode 0.
> 5) It increses the size of file by len bytes.
> 
> 
> Namjae Jeon (10):
>  fs: Add support FALLOC_FL_INSERT_RANGE for fallocate
>  xfs: Add support FALLOC_FL_INSERT_RANGE for fallocate
>  ext4: Add support FALLOC_FL_INSERT_RANGE for fallocate
>  xfsprogs: xfs_io: add finsert command for insert range via fallocate
>  xfstests: generic/027: Standard insert range tests
>  xfstests: generic/028: Delayed allocation insert range
>  xfstests: generic/029: Multi insert range tests
>  xfstests: generic/030: Delayed allocation multi insert
>  xfstests: fsstress: Add fallocate insert range operation
>  xfstests: fsx: Add fallocate insert range operation
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH v2 0/10] fs: Introduce FALLOC_FL_INSERT_RANGE for fallocate

2014-06-02 Thread Lukáš Czerner

On Sat, 31 May 2014, Namjae Jeon wrote:

> Date: Sat, 31 May 2014 16:40:29 +0900
> From: Namjae Jeon 
> To: 'Lukáš Czerner' 
> Cc: 'Dave Chinner' , 'Theodore Ts'o' ,
> 'linux-ext4' , x...@oss.sgi.com,
> linux-fsde...@vger.kernel.org, linux-kernel@vger.kernel.org,
> 'Ashish Sangwan' 
> Subject: RE: [PATCH v2 0/10] fs: Introduce FALLOC_FL_INSERT_RANGE for
> fallocate
> 
>  
> > > Date: Thu, 08 May 2014 19:23:19 +0900
> > > From: Namjae Jeon 
> > > To: Dave Chinner , Theodore Ts'o 
> > > Cc: linux-ext4 , x...@oss.sgi.com,
> > > linux-fsde...@vger.kernel.org, linux-kernel@vger.kernel.org,
> > > Ashish Sangwan 
> > > Subject: [PATCH v2 0/10] fs: Introduce FALLOC_FL_INSERT_RANGE for 
> > > fallocate
> > >
> > > In continuation of the work of making the process of non linear editing of
> > > media files faster, we introduce here the new flag FALLOC_FL_INSERT_RANGE
> > > for fallocate.
> > >
> > > This flag will work opposite to the newly added FALLOC_FL_COLLAPSE_RANGE 
> > > flag.
> > > As such, specifying FALLOC_FL_INSERT_RANGE flag will insert zeroed-out 
> > > space
> > > in between the file within the range specified by offset and len. User can
> > > write new data in this space. e.g. ads.
> > > Like collapse range, currently we have the limitation that offset and len 
> > > should
> > > be block size aligned for both XFS and Ext4.
> > >
> > > The semantics of the flag are :
> > > 1) It allocates new zeroed out on disk space of len bytes starting
> > >at offset byte without overwriting any existing data. All the data 
> > > blocks
> > >from offset to EOF are shifted towards right to make space for 
> > > inserting
> > >new blocks
> > 
> > Hi,
> > 
> > this sounds a little bit weird to me. I understand the reason for
> > this, but this is effectively two operations masking as one. We
> > shift the existing data and then we allocate unwritten extents for
> > the hole we've created.
> > 
> > What would make more sense to me is to implement only the first
> > operation - the shift. And then let the user to allocate unwritten
> > extents for the hole using simple fallocate.
> > 
> > The reason is that if you succeed the first part and then fail the
> > second due to ENOSPC or any other reason the file will end up in
> > undefined state unnecessarily. Yes in your current implementation
> > it seems that you'll always end up with the hole in the file and the
> > rest is properly shifted, but that may vary from file system to file
> > system. Some might choose to roll back the shift, some might not.
> > 
> > If FALLOC_FL_INSERT_RANGE (or any name you wish to choose) would
> > just simply shift the extents then you'll get rid of this and the
> > only thing that user needs to do (if he chooses to) is to use
> > fallocate for the hole created by the shift. If it fails, then
> > well, he can try again without any consequences. As a bonus you get
> > the possibility to leave the hole in the file which might be useful
> > as well.
> > 
> > With current behaviour this might get very confusing very quickly.
> > 
> > What do you and others think ?
> Hi Lukas.
> Insert range inherently means inserting a real range of space into
> the file to provide guaranteed space with user in the inserted area
> so that further writes should not fail with an -ENOSPC at least.
> If insert range cannot gurantees the above semantics, It should
> return error to user space.

So what will happen when there is not enough space when "inserting a
range" ? And how should user proceed from there ?

> 
> If insert range has been performed on a file, It will reserve space
> that write never fail in the inserted area,
> In case of full partition or small available size than the range
> user want, It seems odd just only left inserting a hole in the middle
> of file and return success to user when no one can really write to
> this hole.

There is a fallocate for allocation, so as I already said you can
shift the extents to make a hole in the file and then use fallocate
to allocate space for it and you'll get the same result. You are
basically doing that now as well, but when the allocation fails the
whole "insert range" ioct fails, however the extents are already
shifter and there is already a holi in the file so freeing some
space and running this ioctl again will not help you at all. While
if you fail a f

RE: [PATCH v2 0/10] fs: Introduce FALLOC_FL_INSERT_RANGE for fallocate

2014-06-02 Thread Lukáš Czerner

On Mon, 2 Jun 2014, Namjae Jeon wrote:

> Date: Mon, 02 Jun 2014 20:52:51 +0900
> From: Namjae Jeon 
> To: 'Lukáš Czerner' 
> Cc: 'Dave Chinner' , 'Theodore Ts'o' ,
> 'linux-ext4' , x...@oss.sgi.com,
> linux-fsde...@vger.kernel.org, linux-kernel@vger.kernel.org,
> 'Ashish Sangwan' 
> Subject: RE: [PATCH v2 0/10] fs: Introduce FALLOC_FL_INSERT_RANGE for
> fallocate
> 
> > 
> > > Date: Sat, 31 May 2014 16:40:29 +0900
> > > From: Namjae Jeon 
> > > To: 'Lukáš Czerner' 
> > > Cc: 'Dave Chinner' , 'Theodore Ts'o' ,
> > > 'linux-ext4' , x...@oss.sgi.com,
> > > linux-fsde...@vger.kernel.org, linux-kernel@vger.kernel.org,
> > > 'Ashish Sangwan' 
> > > Subject: RE: [PATCH v2 0/10] fs: Introduce FALLOC_FL_INSERT_RANGE for
> > > fallocate
> > >
> > >
> > > > > Date: Thu, 08 May 2014 19:23:19 +0900
> > > > > From: Namjae Jeon 
> > > > > To: Dave Chinner , Theodore Ts'o 
> > > > > Cc: linux-ext4 , x...@oss.sgi.com,
> > > > > linux-fsde...@vger.kernel.org, linux-kernel@vger.kernel.org,
> > > > > Ashish Sangwan 
> > > > > Subject: [PATCH v2 0/10] fs: Introduce FALLOC_FL_INSERT_RANGE for 
> > > > > fallocate
> > > > >
> > > > > In continuation of the work of making the process of non linear 
> > > > > editing of
> > > > > media files faster, we introduce here the new flag 
> > > > > FALLOC_FL_INSERT_RANGE
> > > > > for fallocate.
> > > > >
> > > > > This flag will work opposite to the newly added 
> > > > > FALLOC_FL_COLLAPSE_RANGE flag.
> > > > > As such, specifying FALLOC_FL_INSERT_RANGE flag will insert 
> > > > > zeroed-out space
> > > > > in between the file within the range specified by offset and len. 
> > > > > User can
> > > > > write new data in this space. e.g. ads.
> > > > > Like collapse range, currently we have the limitation that offset and 
> > > > > len should
> > > > > be block size aligned for both XFS and Ext4.
> > > > >
> > > > > The semantics of the flag are :
> > > > > 1) It allocates new zeroed out on disk space of len bytes starting
> > > > >at offset byte without overwriting any existing data. All the data 
> > > > > blocks
> > > > >from offset to EOF are shifted towards right to make space for 
> > > > > inserting
> > > > >new blocks
> > > >
> > > > Hi,
> > > >
> > > > this sounds a little bit weird to me. I understand the reason for
> > > > this, but this is effectively two operations masking as one. We
> > > > shift the existing data and then we allocate unwritten extents for
> > > > the hole we've created.
> > > >
> > > > What would make more sense to me is to implement only the first
> > > > operation - the shift. And then let the user to allocate unwritten
> > > > extents for the hole using simple fallocate.
> > > >
> > > > The reason is that if you succeed the first part and then fail the
> > > > second due to ENOSPC or any other reason the file will end up in
> > > > undefined state unnecessarily. Yes in your current implementation
> > > > it seems that you'll always end up with the hole in the file and the
> > > > rest is properly shifted, but that may vary from file system to file
> > > > system. Some might choose to roll back the shift, some might not.
> > > >
> > > > If FALLOC_FL_INSERT_RANGE (or any name you wish to choose) would
> > > > just simply shift the extents then you'll get rid of this and the
> > > > only thing that user needs to do (if he chooses to) is to use
> > > > fallocate for the hole created by the shift. If it fails, then
> > > > well, he can try again without any consequences. As a bonus you get
> > > > the possibility to leave the hole in the file which might be useful
> > > > as well.
> > > >
> > > > With current behaviour this might get very confusing very quickly.
> > > >
> > > > What do you and others think ?
> > > Hi Lukas.
> > > Insert range inherently means inserting a real range of space

Re: [PATCH] xfstests: f2fs support

2014-06-03 Thread Lukáš Czerner

On Mon, 2 Jun 2014, jaeg...@kernel.org wrote:

> Date: Mon,  2 Jun 2014 22:59:32 +0900
> From: jaeg...@kernel.org
> To: Dave Chinner 
> Cc: x...@oss.sgi.com, linux-kernel@vger.kernel.org,
> linux-fsde...@vger.kernel.org, linux-f2fs-de...@lists.sourceforget.net,
> Jaegeuk Kim 
> Subject: [PATCH] xfstests: f2fs support
> 
> From: Jaegeuk Kim 
> 
> This patch adds to support f2fs file system.

Hi,

Looks good.

Signed-off-by: Lukas Czerner 

Btw dry run option in fsck is quite useful here because when the
file system gets corrupted you do not want it to get fixed, but
rather investigate the corrupted file system to figure out what
happened. But of course it's not required :)

However after a quick look at you fsck code it seems that you do not
attempt to fix anything at all...

Also having mkfs to check for existing signatures on the device
before attempting to create the file system is a good thing to do
(at least xfs, btrfs extN are doing so) to avoid data loss by
mistake. You can use libblkid from util-linux and it's very simple
to use.

Thanks!
-Lukas

> 
> Signed-off-by: Jaegeuk Kim 
> ---
>  common/config | 7 +++
>  common/rc | 6 ++
>  2 files changed, 13 insertions(+)
> 
> diff --git a/common/config b/common/config
> index 0dbf0b9..0607294 100644
> --- a/common/config
> +++ b/common/config
> @@ -210,6 +210,7 @@ case "$HOSTOS" in
>  export MKFS_XFS_PROG="`set_prog_path mkfs.xfs`"
>  export MKFS_UDF_PROG="`set_prog_path mkudffs`"
>  export MKFS_BTRFS_PROG="`set_btrfs_mkfs_prog_path_with_opts`"
> +export MKFS_F2FS_PROG="`set_prog_path mkfs.f2fs`"
>  export BTRFS_UTIL_PROG="`set_prog_path btrfs`"
>  export BTRFS_SHOW_SUPER_PROG="`set_prog_path btrfs-show-super`"
>  export XFS_FSR_PROG="`set_prog_path xfs_fsr`"
> @@ -241,6 +242,9 @@ _mount_opts()
>   # acls & xattrs aren't turned on by default on ext$FOO
>   export MOUNT_OPTIONS="-o acl,user_xattr $EXT_MOUNT_OPTIONS"
>   ;;
> + f2fs)
> + export MOUNT_OPTIONS="-o acl,user_xattr $F2FS_MOUNT_OPTIONS"
> + ;;
>   reiserfs)
>   # acls & xattrs aren't turned on by default on reiserfs
>   export MOUNT_OPTIONS="-o acl,user_xattr $REISERFS_MOUNT_OPTIONS"
> @@ -295,6 +299,9 @@ _fsck_opts()
>   reiserfs)
>   export FSCK_OPTIONS="--yes"
>   ;;
> + f2fs)
> + export FSCK_OPTIONS=""
> + ;;
>   *)
>   export FSCK_OPTIONS="-n"
>   ;;
> diff --git a/common/rc b/common/rc
> index f27ee53..fcdabfe 100644
> --- a/common/rc
> +++ b/common/rc
> @@ -105,6 +105,9 @@ case "$FSTYP" in
>  btrfs)
>[ "$MKFS_BTRFS_PROG" = "" ] && _fatal "mkfs.btrfs not found"
>;;
> +f2fs)
> +  [ "$MKFS_F2FS_PROG" = "" ] && _fatal "mkfs.f2fs not found"
> +  ;;
>  nfs)
>;;
>  esac
> @@ -511,6 +514,9 @@ _scratch_mkfs()
>  tmpfs)
>   # do nothing for tmpfs
>   ;;
> +f2fs)
> +$MKFS_F2FS_PROG $MKFS_OPTIONS $* $SCRATCH_DEV > /dev/null
> + ;;
>  *)
>   yes | $MKFS_PROG -t $FSTYP -- $MKFS_OPTIONS $* $SCRATCH_DEV
>   ;;
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] xfstests: f2fs support

2014-06-03 Thread Lukáš Czerner

On Tue, 3 Jun 2014, Lukáš Czerner wrote:

> Date: Tue, 3 Jun 2014 13:32:30 +0200 (CEST)
> From: Lukáš Czerner 
> To: Jaegeuk Kim 
> Cc: Dave Chinner , x...@oss.sgi.com,
> linux-kernel@vger.kernel.org, linux-fsde...@vger.kernel.org,
> linux-f2fs-de...@lists.sourceforget.net
> Subject: Re: [PATCH] xfstests: f2fs support
> 
> On Mon, 2 Jun 2014, jaeg...@kernel.org wrote:
> 
> > Date: Mon,  2 Jun 2014 22:59:32 +0900
> > From: jaeg...@kernel.org
> > To: Dave Chinner 
> > Cc: x...@oss.sgi.com, linux-kernel@vger.kernel.org,
> > linux-fsde...@vger.kernel.org, linux-f2fs-de...@lists.sourceforget.net,
> > Jaegeuk Kim 
> > Subject: [PATCH] xfstests: f2fs support
> > 
> > From: Jaegeuk Kim 
> > 
> > This patch adds to support f2fs file system.
> 
> Hi,
> 
> Looks good.
> 
> Signed-off-by: Lukas Czerner 
> 

Oops, wrong macro. I meant to say

Reviewed-by: Lukas Czerner 


> Btw dry run option in fsck is quite useful here because when the
> file system gets corrupted you do not want it to get fixed, but
> rather investigate the corrupted file system to figure out what
> happened. But of course it's not required :)
> 
> However after a quick look at you fsck code it seems that you do not
> attempt to fix anything at all...
> 
> Also having mkfs to check for existing signatures on the device
> before attempting to create the file system is a good thing to do
> (at least xfs, btrfs extN are doing so) to avoid data loss by
> mistake. You can use libblkid from util-linux and it's very simple
> to use.
> 
> Thanks!
> -Lukas
> 
> > 
> > Signed-off-by: Jaegeuk Kim 
> > ---
> >  common/config | 7 +++
> >  common/rc | 6 ++
> >  2 files changed, 13 insertions(+)
> > 
> > diff --git a/common/config b/common/config
> > index 0dbf0b9..0607294 100644
> > --- a/common/config
> > +++ b/common/config
> > @@ -210,6 +210,7 @@ case "$HOSTOS" in
> >  export MKFS_XFS_PROG="`set_prog_path mkfs.xfs`"
> >  export MKFS_UDF_PROG="`set_prog_path mkudffs`"
> >  export MKFS_BTRFS_PROG="`set_btrfs_mkfs_prog_path_with_opts`"
> > +export MKFS_F2FS_PROG="`set_prog_path mkfs.f2fs`"
> >  export BTRFS_UTIL_PROG="`set_prog_path btrfs`"
> >  export BTRFS_SHOW_SUPER_PROG="`set_prog_path btrfs-show-super`"
> >  export XFS_FSR_PROG="`set_prog_path xfs_fsr`"
> > @@ -241,6 +242,9 @@ _mount_opts()
> > # acls & xattrs aren't turned on by default on ext$FOO
> > export MOUNT_OPTIONS="-o acl,user_xattr $EXT_MOUNT_OPTIONS"
> > ;;
> > +   f2fs)
> > +   export MOUNT_OPTIONS="-o acl,user_xattr $F2FS_MOUNT_OPTIONS"
> > +   ;;
> > reiserfs)
> > # acls & xattrs aren't turned on by default on reiserfs
> > export MOUNT_OPTIONS="-o acl,user_xattr $REISERFS_MOUNT_OPTIONS"
> > @@ -295,6 +299,9 @@ _fsck_opts()
> > reiserfs)
> > export FSCK_OPTIONS="--yes"
> > ;;
> > +   f2fs)
> > +   export FSCK_OPTIONS=""
> > +   ;;
> > *)
> > export FSCK_OPTIONS="-n"
> > ;;
> > diff --git a/common/rc b/common/rc
> > index f27ee53..fcdabfe 100644
> > --- a/common/rc
> > +++ b/common/rc
> > @@ -105,6 +105,9 @@ case "$FSTYP" in
> >  btrfs)
> >  [ "$MKFS_BTRFS_PROG" = "" ] && _fatal "mkfs.btrfs not found"
> >  ;;
> > +f2fs)
> > +[ "$MKFS_F2FS_PROG" = "" ] && _fatal "mkfs.f2fs not found"
> > +;;
> >  nfs)
> >  ;;
> >  esac
> > @@ -511,6 +514,9 @@ _scratch_mkfs()
> >  tmpfs)
> > # do nothing for tmpfs
> > ;;
> > +f2fs)
> > +$MKFS_F2FS_PROG $MKFS_OPTIONS $* $SCRATCH_DEV > /dev/null
> > +   ;;
> >  *)
> > yes | $MKFS_PROG -t $FSTYP -- $MKFS_OPTIONS $* $SCRATCH_DEV
> > ;;
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Re: [PATCH] sysctl: Add a feature to drop caches selectively

2014-06-26 Thread Lukáš Czerner

On Thu, 26 Jun 2014, Artem Bityutskiy wrote:

> Date: Thu, 26 Jun 2014 14:31:03 +0300
> From: Artem Bityutskiy 
> To: Bernd Schubert 
> Cc: Dave Chinner , Thomas Knauth ,
> David Rientjes ,
> Maksym Planeta ,
> Alexander Viro , linux-fsde...@vger.kernel.org,
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] sysctl: Add a feature to drop caches selectively
> 
> On Thu, 2014-06-26 at 12:36 +0200, Bernd Schubert wrote:
> > On 06/26/2014 08:13 AM, Artem Bityutskiy wrote:
> > > On Thu, 2014-06-26 at 11:06 +1000, Dave Chinner wrote:
> > >> Your particular use case can be handled by directing your benchmark
> > >> at a filesystem mount point and unmounting the filesystem in between
> > >> benchmark runs. There is no ned to adding kernel functionality for
> > >> somethign that can be so easily acheived by other means, especially
> > >> in benchmark environments where *everything* is tightly controlled.
> > >
> > > If I was a benchmark writer, I would not be willing running it as root
> > > to be able to mount/unmount, I would not be willing to require the
> > > customer creating special dedicated partitions for the benchmark,
> > > because this is too user-unfriendly. Or do I make incorrect assumptions?
> > 
> > But why a sysctl then? And also don't see a point for that at all, why 
> > can't the benchmark use posix_fadvise(POSIX_FADV_DONTNEED)?
> 
> The latter question was answered - people want a way to drop caches for
> a file. They need a method which guarantees that the caches are dropped.
> They do not need an advisory method which does not give any guarantees.
> 
> As for the first question - this was what I was also asking too, but
> without suggesting alternatives. I challenged the authors with the
> following:
> 
> 1. Why the interface would only allow the super user dropping the
> caches? How about allowing the file owner or, generally speaking, the
> person who is allowed to modify the file, drop the caches?
> 
> I alluded that this may be doable with an fd-based interface.
> 
> 2. What about symlinks? Can I have a choice whether I drop caches
> (struct inode, I suppose) for the symlink itself or for the destination
> file? Again, fd-based interface would probably naturally allow for this.
> 
> 3. What about leaving some room for future extensions? E.g., someone may
> want to drop only part of a file in the future, who knows. Can we invent
> an interface which would allow to be extended in the future, without
> breaking older software?
> 
> My intention was to encourage the submitter to take some time and come
> back with deeper analysis.
> 
> And finally, and most importantly, Dave stated that any per-file cache
> dropping interface is unlikely going to be accepted at all, because
> there is mount/unmount.
> 
> So far this is the mane concern the submitter should address.
> 
> But I just answered that what Dave suggested is probably not the nicest
> way to do this from the user-space perspective, because it requires
> superuser privileges, and probably a separate "benchmark-only"
> partition.

I think that Dave is right in that if it's just for a "benchmarking"
purposes, then there is no need for a new special interface for
dropping caches. There is mount/umount and drop_caches which should
be more than enough for any benchmark. And while it's true that
you'd likely need superuser privileges for mount/umount, the same is
true about drop_caches, isn't it ?

> 
> So if the authors want to sell this new interface (in whatever form) to
> the kernel community, they should start with providing a solid use-case,
> with some more details, explore alternatives and show how the
> alternatives do not work for them.

Yes please, let's see some solid use-case for this.

-Lukas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sysctl: Add a feature to drop caches selectively

2014-06-27 Thread Lukáš Czerner

On Fri, 27 Jun 2014, Matthias Schniedermeyer wrote:

> Date: Fri, 27 Jun 2014 10:41:39 +0200
> From: Matthias Schniedermeyer 
> To: Luká? Czerner 
> Cc: Artem Bityutskiy ,
> Bernd Schubert ,
> Dave Chinner , Thomas Knauth ,
> David Rientjes ,
> Maksym Planeta ,
> Alexander Viro , linux-fsde...@vger.kernel.org,
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] sysctl: Add a feature to drop caches selectively
> 
> On 26.06.2014 13:57, Luká? Czerner wrote:
> 
> > > So if the authors want to sell this new interface (in whatever form) to
> > > the kernel community, they should start with providing a solid use-case,
> > > with some more details, explore alternatives and show how the
> > > alternatives do not work for them.
> > 
> > Yes please, let's see some solid use-case for this.
> 
> Personally i would want it to verify files after copying them:
> Especially while moving files:
> - Copy a file
> - 
> - Verify that it really is correct on stable storage
> - Remove original file

I assume you're using cp to copy a file, not your own program. In
that case can we make cp optionally use direct io ? It seems that it
would solve your problem in very elegant way.

-Lukas

> 
> Currently i choose either of the 3 ways:
> - drop_caches
> - umount/mount
> - Write more data than memory in machine (Which is only an 
> approximnation and you have to verify in the same order the files were 
> written, so it is likely that any cache was thrashed in the meantime)
> 
> But having a way to selectivly "destory" the cache of a file would make 
> this task easier.
> 
> 
> 
> 
>

Re: [RFC] Tux3 for review

2014-06-19 Thread Lukáš Czerner

On Thu, 19 Jun 2014, Pavel Machek wrote:

> Date: Thu, 19 Jun 2014 10:21:29 +0200
> From: Pavel Machek 
> To: James Bottomley 
> Cc: Daniel Phillips , Dave Chinner ,
> linux-kernel@vger.kernel.org, linux-fsde...@vger.kernel.org,
> Linus Torvalds ,
> Andrew Morton 
> Subject: Re: [RFC] Tux3 for review
> 
> On Mon 2014-06-16 08:25:54, James Bottomley wrote:
> > On Sun, 2014-06-15 at 14:41 -0700, Daniel Phillips wrote:
> > > On Friday, June 13, 2014 1:20:39 PM PDT, Pavel Machek wrote:
> > > > On Fri 2014-06-13 10:49:39, Daniel Phillips wrote:
> > >  to sign up for a ridiculous amount of largely thankless, but 
> > > perhaps fascinating work. Any volunteers?
> > 
> > The whole suggestion is a non starter: we can't stage core API changes.
> > Even if we worked out how to do that, the staging trees mostly don't get
> > the type of in-depth expert review that you need anyway.
> 
> Well.. most filesystems do not need any core API changes, right?
> 
> > The Cardinal concern has always been the viability page forking and its
> > impact on writeback  ... and since writeback is our most difficult an
> > performance sensitive area, the bar to changing it is high.
> 
> And in this particular case, Daniel was flamed for poor coding style, not
> for page forking. So staging/ would actually help him -- he could concentrate
> on core changes without being distracted by unimportant stuff.

Flamed ? really ? Dave pointed out some serious coding style problems.
Those should be very easy to fix.

Let me remind you some more important problems Dave brought up,
including page forking:

"
 The hacks around VFS and MM functionality need to have demonstrated
 methods for being removed. We're not going to merge that page
 forking stuff (like you were told at LSF 2013 more than a year ago:
 http://lwn.net/Articles/548091/) without rigorous design review and
 a demonstration of the solutions to all the hard corner cases it
 has. The current code doesn't solve them (e.g. direct IO doesn't
 work in tux3), and there's no clear patch set we can review that
 demonstrates how it is all supposed to work. i.e. you need to
 separate out all the page forking code into a separate patchset for
 review, independent of the tux3 code and applies to the core mm/
 code.
"

"
 Then there's all the writeback hacks. You've simply copy-n-pasted
 most of fs-writeback.c, including duplicating structures like struct
 wb_writeback_work and then hacked in crap (kallsyms lookups!) to be
 able to access core structures from kernel module context
 (tux3_setup_writeback(), I'm looking at you). This is completely
 unacceptible for a merge. Again, you need to separate out all the
 writeback changes you need into an independent patchset so that they
 can be reviewed independently of the tux3 code that uses it.
"

-Lukas

> 
> > When you presented page forking at LSF/MM in 2013, it didn't even stand
> > up to basic scrutiny before people found unresolved problems:
> > 
> > http://lwn.net/Articles/548091/
> > 
> > After lots of prodding, you finally coughed up a patch for discussion:
> > 
> > http://thread.gmane.org/gmane.linux.file-systems/85619
> > 
> > But then that petered out again.  I can't emphasise enough that
> > iterating these threads to a conclusion and reposting interface
> > suggestions is the way to proceed on this ... as far as I can tell from
> > the discussion, the reviewers were making helpful suggestions, even if
> > they didn't like the original interface you proposed.
> 
> This obviously needs to be solved, first...
>   Pavel
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: testing result of loop-aio patchset on ext3

2014-07-16 Thread Lukáš Czerner

On Wed, 16 Jul 2014, Rui Xiang wrote:

> Date: Wed, 16 Jul 2014 11:54:24 +0800
> From: Rui Xiang 
> To: Lukáš Czerner 
> Cc: Dave Kleikamp , linux-e...@vger.kernel.org,
> linux-fsde...@vger.kernel.org, linux-kernel@vger.kernel.org,
> Li Zefan 
> Subject: Re: testing result of loop-aio patchset on ext3
> 
> On 2014/7/14 17:51, Lukáš Czerner wrote:
> > On Mon, 14 Jul 2014, Rui Xiang wrote:
> > 
> >> Date: Mon, 14 Jul 2014 17:34:38 +0800
> >> From: Rui Xiang 
> >> To: Dave Kleikamp , linux-e...@vger.kernel.org
> >> Cc: linux-fsde...@vger.kernel.org, linux-kernel@vger.kernel.org,
> >> Li Zefan 
> >> Subject: testing result of loop-aio patchset on ext3
> >>
> >> Hi Dave,
> >>
> >> We export a container image file as a block device via loop device, but we
> >> found it's very easy that the container rootfs gets corrupted due to power
> >> loss.
> >>
> >> Your early version of loop-aio patchset said the patchset can make loop
> >> mounted filesystems recoverable(lkml.org/lkml/2012/3/30/317), but we found
> >> it doesn't help.
> >>
> >> Both the guest fs and host fs are ext3.
> >>
> >> The loop-aio patchset is from:
> >> git://github.com/kleikamp/linux-shaggy.git aio_loop
> >>
> >> Steps:
> >> 1. dd a 10G image, mkfs.ext3,
> >>   # dd if=/dev/zero of=./raw_image bs=1M count=1
> >>   # echo y | mkfs.ext3 raw_image
> >>
> >> 2. losetup a loop device, mount at ./test_dir
> >>   # losetup /dev/loop1 raw_image
> >>   # mount /dev/loop1 ./test_dir
> >>
> >> 3. copy fs_mark into test_dir and run
> >>   # ./fs_mark -d ./tmp/ -s 10240 -n 80
> >>
> >> 4. during runing fs_mark, make systerm reboot indirectly.
> >>   # echo b > /proc/sysrq-trigger
> >>
> >> After systerm booted up, sometimes fsck reported raw_image fs has been 
> >> damaged.
> >>
> >> # fsck.ext3 -n raw_image
> >> e2fsck 1.41.9 (22-Aug-2009)
> >> Warning: skipping journal recovery because doing a read-only filesystem 
> >> check.
> >> raw_image contains a file system with errors, check forced.
> >> Pass 1: Checking inodes, blocks, and sizes
> >> Pass 2: Checking directory structure
> >> Pass 3: Checking directory connectivity
> >> Pass 4: Checking reference counts
> >> Pass 5: Checking group summary information
> >> Free blocks count wrong (2481348, counted=2480577).
> >> Fix? no
> >> Free inodes count wrong (640837, counted=640835).
> >> Fix? no
> >> raw_image: ** WARNING: Filesystem still has errors **
> >> raw_image: 11/640848 files (0.0% non-contiguous), 78652/256 blocks
> > 
> > It's not damaged, this is expected result if you're using old
> > e2fsprogs which still treats this as an error.
> > 
> > It's not an error because we only update superblock summary at
> > unmount time so with unclean shutdown it's likely that it does not
> > match the reality, but e2fsck can and will easily fix that for you.
> > 
> > Please try e2fsprogs v1.42.3 or newer.
> > 
> 
> Hi Lukas,
> 
> I updated e2fsprogs to v1.42.3, and user the newer fsck.ext3 to check 
> raw_image.
> Exactly, the result seemed normal.

Now I can see that there are much more problems than before, that's
weird. Sorry for not making this clear, but for this kind of
reproducers please use the most recent e2fsprogs. Also , what is the
kernel version you're using in this test ?

Thanks!
-Lukas

> 
> Then, I continue my previous test. And after testing 35 times, "fsck -n" 
> reported image fs
> had been damaged, too.
> 
>  # fsck.ext3 -n image1
> e2fsck 1.42.3.wc1 (28-May-2012)
> Warning: skipping journal recovery because doing a read-only filesystem check.
> image1 has been mounted 36 times without being checked, check forced.
> Pass 1: Checking inodes, blocks, and sizes
> Inode 16407, i_size is 597447, should be 602112.  Fix? no
> Inode 16407, i_blocks is 1176, should be 1184.  Fix? no
> Inode 409941, i_blocks is 200208, should be 112.  Fix? no
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
> Block bitmap differences:  -1506836 -1506843 -(1506859--1506860) 
> -(1660941--1661964) -(1661966--1671167) -(1671688--1686473)
> Fix? no
> Free blocks count wrong for group #2 (31558, counted=31556).
> Fix? no
> Free blocks count w

Re: testing result of loop-aio patchset on ext3

2014-07-18 Thread Lukáš Czerner

On Wed, 16 Jul 2014, Rui Xiang wrote:

> Date: Wed, 16 Jul 2014 17:28:10 +0800
> From: Rui Xiang 
> To: Lukáš Czerner 
> Cc: Dave Kleikamp , linux-e...@vger.kernel.org,
> linux-fsde...@vger.kernel.org, linux-kernel@vger.kernel.org,
> Li Zefan 
> Subject: Re: testing result of loop-aio patchset on ext3
> 
> On 2014/7/16 15:58, Lukáš Czerner wrote:
> > On Wed, 16 Jul 2014, Rui Xiang wrote:
> > 
> >> Date: Wed, 16 Jul 2014 11:54:24 +0800
> >> From: Rui Xiang 
> >> To: Lukáš Czerner 
> >> Cc: Dave Kleikamp , linux-e...@vger.kernel.org,
> >> linux-fsde...@vger.kernel.org, linux-kernel@vger.kernel.org,
> >> Li Zefan 
> >> Subject: Re: testing result of loop-aio patchset on ext3
> >>
> >> On 2014/7/14 17:51, Lukáš Czerner wrote:
> >>> On Mon, 14 Jul 2014, Rui Xiang wrote:
> >>>
> >>>> Date: Mon, 14 Jul 2014 17:34:38 +0800
> >>>> From: Rui Xiang 
> >>>> To: Dave Kleikamp , linux-e...@vger.kernel.org
> >>>> Cc: linux-fsde...@vger.kernel.org, linux-kernel@vger.kernel.org,
> >>>> Li Zefan 
> >>>> Subject: testing result of loop-aio patchset on ext3
> >>>>
> >>>> Hi Dave,
> >>>>
> >>>> We export a container image file as a block device via loop device, but 
> >>>> we
> >>>> found it's very easy that the container rootfs gets corrupted due to 
> >>>> power
> >>>> loss.
> >>>>
> >>>> Your early version of loop-aio patchset said the patchset can make loop
> >>>> mounted filesystems recoverable(lkml.org/lkml/2012/3/30/317), but we 
> >>>> found
> >>>> it doesn't help.
> >>>>
> >>>> Both the guest fs and host fs are ext3.
> >>>>
> >>>> The loop-aio patchset is from:
> >>>> git://github.com/kleikamp/linux-shaggy.git aio_loop
> >>>>
> >>>> Steps:
> >>>> 1. dd a 10G image, mkfs.ext3,
> >>>>   # dd if=/dev/zero of=./raw_image bs=1M count=1
> >>>>   # echo y | mkfs.ext3 raw_image
> >>>>
> >>>> 2. losetup a loop device, mount at ./test_dir
> >>>>   # losetup /dev/loop1 raw_image
> >>>>   # mount /dev/loop1 ./test_dir
> >>>>
> >>>> 3. copy fs_mark into test_dir and run
> >>>>   # ./fs_mark -d ./tmp/ -s 10240 -n 80
> >>>>
> >>>> 4. during runing fs_mark, make systerm reboot indirectly.
> >>>>   # echo b > /proc/sysrq-trigger
> >>>>
> >>>> After systerm booted up, sometimes fsck reported raw_image fs has been 
> >>>> damaged.
> >>>>
> >>>> # fsck.ext3 -n raw_image
> >>>> e2fsck 1.41.9 (22-Aug-2009)
> >>>> Warning: skipping journal recovery because doing a read-only filesystem 
> >>>> check.
> >>>> raw_image contains a file system with errors, check forced.
> >>>> Pass 1: Checking inodes, blocks, and sizes
> >>>> Pass 2: Checking directory structure
> >>>> Pass 3: Checking directory connectivity
> >>>> Pass 4: Checking reference counts
> >>>> Pass 5: Checking group summary information
> >>>> Free blocks count wrong (2481348, counted=2480577).
> >>>> Fix? no
> >>>> Free inodes count wrong (640837, counted=640835).
> >>>> Fix? no
> >>>> raw_image: ** WARNING: Filesystem still has errors **
> >>>> raw_image: 11/640848 files (0.0% non-contiguous), 78652/256 blocks
> >>>
> >>> It's not damaged, this is expected result if you're using old
> >>> e2fsprogs which still treats this as an error.
> >>>
> >>> It's not an error because we only update superblock summary at
> >>> unmount time so with unclean shutdown it's likely that it does not
> >>> match the reality, but e2fsck can and will easily fix that for you.
> >>>
> >>> Please try e2fsprogs v1.42.3 or newer.
> >>>
> >>
> >> Hi Lukas,
> >>
> >> I updated e2fsprogs to v1.42.3, and user the newer fsck.ext3 to check 
> >> raw_image.
> >> Exactly, the result seemed normal.
> > 
> > Now I can see that there are much more problems than before, that's
> > weird. Sorry for not making this clear, but for this kind

Re: [PATCH] vfs: fix check for fallocate on active swapfile

2014-06-25 Thread Lukáš Czerner

On Tue, 24 Jun 2014, Eric Biggers wrote:

> Date: Tue, 24 Jun 2014 23:45:08 -0500
> From: Eric Biggers 
> To: linux-fsde...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org, Eric Biggers 
> Subject: [PATCH] vfs: fix check for fallocate on active swapfile
> 
> Fix the broken check for calling sys_fallocate() on an active swapfile,
> introduced by commit 0790b31b69374ddadefe ("fs: disallow all fallocate
> operation on active swapfile").

Oops, good catch, Thanks!

Reviewed-by: Lukas Czerner 

> 
> Signed-off-by: Eric Biggers 
> ---
>  fs/open.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/open.c b/fs/open.c
> index 36662d0..d6fd3ac 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -263,11 +263,10 @@ int do_fallocate(struct file *file, int mode, loff_t 
> offset, loff_t len)
>   return -EPERM;
>  
>   /*
> -  * We can not allow to do any fallocate operation on an active
> -  * swapfile
> +  * We cannot allow any fallocate operation on an active swapfile
>*/
>   if (IS_SWAPFILE(inode))
> - ret = -ETXTBSY;
> + return -ETXTBSY;
>  
>   /*
>* Revalidate the write permissions, in case security policy has
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] ext4: Add support IOC_MOV_DATA ioctl

2014-07-08 Thread Lukáš Czerner

On Tue, 8 Jul 2014, Namjae Jeon wrote:

> Date: Tue, 08 Jul 2014 21:00:02 +0900
> From: Namjae Jeon 
> To: Dave Chinner , Theodore Ts'o 
> Cc: linux-ext4 , linux-fsde...@vger.kernel.org,
> linux-kernel@vger.kernel.org, Lukáš Czerner ,
> Brian Foster , Christoph Hellwig ,
> Ashish Sangwan , x...@oss.sgi.com
> Subject: [PATCH 3/3] ext4: Add support IOC_MOV_DATA ioctl
> 
> This patch implements fs ioctl's IOC_MOV_DATA for Ext4.

Hmm isn't this basically what ext4_move_extents() does ? eg.
EXT4_IOC_MOVE_EXT ?

I guess that the intention here is to do the move, without actually
moving the data right ? But nevertheless maybe some code can be
shared with ext4_move_extents() ?

-Lukas

> 
> The semantics of this ioctl are:
> 1) Like collapse range, offsets and length should be file system block size
>aligned.
> 2) In the receiver file, atleast length size hole should be present at
>receiver_offset
> 3) It does not change file size of any of donor or receiver file.
> 4) It leaves a hole at the place from where blocks are moved out in donor 
> file.
> 5) Both (donor_offset + length) and (receiver_offset + length) should be 
> within
>size of donor file and receiver file respectively.
>Only unwritten extents resides beyond file size and it does not make sense
>to transfer unwritten extents, leave apart the security issues it may 
> raise.
> 6) If the range to be transfered from donor file contain any holes, they are
>replicated as it is in receiver file. It mean holes are preserved and
>the length of hole will be added to moved_len signifying that the hole 
> range
>is succesfully transfered.
> 
> Signed-off-by: Namjae Jeon 
> Signed-off-by: Ashish Sangwan 
> ---
>  fs/ext4/ext4.h|   2 +
>  fs/ext4/extents.c | 375 
> ++
>  fs/ext4/file.c|   1 +
>  3 files changed, 378 insertions(+)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 6386c5f..26478eb 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -2725,6 +2725,8 @@ extern int ext4_fiemap(struct inode *inode, struct 
> fiemap_extent_info *fieinfo,
>  extern int ext4_ext_precache(struct inode *inode);
>  extern int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t 
> len);
>  extern int ext4_insert_range(struct file *file, loff_t offset, loff_t len);
> +extern int ext4_mov_data(struct inode *, struct inode *, loff_t, loff_t, 
> loff_t,
> +  loff_t *);
>  
>  /* move_extent.c */
>  extern void ext4_double_down_write_data_sem(struct inode *first,
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 0c2432e..511db03 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -5811,3 +5811,378 @@ out_mutex:
>   mutex_unlock(&inode->i_mutex);
>   return ret;
>  }
> +
> +/*
> + * If offset_lblk does not lie on the extent start boundary, split extent
> + */
> +int ext4_find_and_split_extent_at(struct inode *inode, ext4_lblk_t 
> offset_lblk)
> +{
> + struct ext4_ext_path *path;
> + handle_t *handle;
> + int credits, err = 0, split_flag, ex_len;
> + struct ext4_extent *ex;
> + int depth = ext_depth(inode);
> + ext4_lblk_t ex_start;
> +
> + path = ext4_ext_find_extent(inode, offset_lblk, NULL, 0);
> + if (IS_ERR(path))
> + return PTR_ERR(path);
> +
> + ex = path[depth].p_ext;
> + if (!ex)
> + goto free_path;
> + ex_start = le32_to_cpu(ex->ee_block);
> + ex_len = ext4_ext_get_actual_len(ex);
> +
> + if (offset_lblk > ex_start && offset_lblk < (ex_start + ex_len)) {
> + credits = ext4_writepage_trans_blocks(inode);
> + handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE, credits);
> + if (IS_ERR(handle)) {
> + err = PTR_ERR(handle);
> + goto free_path;
> + }
> + if (ext4_ext_is_unwritten(ex))
> + split_flag = EXT4_EXT_MARK_UNWRIT1 |
> +  EXT4_EXT_MARK_UNWRIT2;
> + else
> + split_flag = 0;
> +
> + err = ext4_split_extent_at(handle, inode, path, offset_lblk,
> +split_flag, EXT4_EX_NOCACHE |
> +EXT4_GET_BLOCKS_PRE_IO);
> + ext4_journal_stop(handle);
> + }
> +
> +free_path:
> + ext4_ext_drop_refs(path);
> + kfree(path);
> + return err;
> +}
> +
> +/*
> + * Compute the size of hole in terms of filesystem blocks present at 
> offset_lblk
> + * until the next e

Re: [PATCH] mmc: Do not advertise secure discard if it is blacklisted

2014-07-14 Thread Lukáš Czerner

On Wed, 18 Jun 2014, Lukas Czerner wrote:

> Date: Wed, 18 Jun 2014 13:18:07 +0200
> From: Lukas Czerner 
> To: linux-...@vger.kernel.org
> Cc: linux-fsde...@vger.kernel.org, ch...@printf.net,
> Lukas Czerner 
> Subject: [PATCH] mmc: Do not advertise secure discard if it is blacklisted
> 
> Currently when the device secure discard implementation is
> blacklisted (MMC_QUIRK_SEC_ERASE_TRIM_BROKEN quirk is set)
> instead of secure discard we're going to do normal discard,
> which is wrong.
> 
> When the secure discard is known to be broken we should just
> disallow it entirely and not advertise this functionality to
> the user. Fix it.
> 
> Also move mmc_fixup_device() in from of mmc_blk_alloc() so we
> can get quirks set before we attempt to set queue information.

Where should I send it to get noticed ? It's sitting here for a
month now.

Thanks!
-Lukas

> 
> Signed-off-by: Lukas Czerner 
> ---
>  drivers/mmc/card/block.c | 6 +++---
>  drivers/mmc/core/core.c  | 3 ++-
>  2 files changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/mmc/card/block.c b/drivers/mmc/card/block.c
> index 452782b..ede41f0 100644
> --- a/drivers/mmc/card/block.c
> +++ b/drivers/mmc/card/block.c
> @@ -2028,8 +2028,7 @@ static int mmc_blk_issue_rq(struct mmc_queue *mq, 
> struct request *req)
>   /* complete ongoing async transfer before issuing discard */
>   if (card->host->areq)
>   mmc_blk_issue_rw_rq(mq, NULL);
> - if (req->cmd_flags & REQ_SECURE &&
> - !(card->quirks & MMC_QUIRK_SEC_ERASE_TRIM_BROKEN))
> + if (req->cmd_flags & REQ_SECURE)
>   ret = mmc_blk_issue_secdiscard_rq(mq, req);
>   else
>   ret = mmc_blk_issue_discard_rq(mq, req);
> @@ -2432,6 +2431,8 @@ static int mmc_blk_probe(struct mmc_card *card)
>   if (!(card->csd.cmdclass & CCC_BLOCK_READ))
>   return -ENODEV;
>  
> + mmc_fixup_device(card, blk_fixups);
> +
>   md = mmc_blk_alloc(card);
>   if (IS_ERR(md))
>   return PTR_ERR(md);
> @@ -2446,7 +2447,6 @@ static int mmc_blk_probe(struct mmc_card *card)
>   goto out;
>  
>   mmc_set_drvdata(card, md);
> - mmc_fixup_device(card, blk_fixups);
>  
>   if (mmc_add_disk(md))
>   goto out;
> diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
> index 7dc0c85..d03a080 100644
> --- a/drivers/mmc/core/core.c
> +++ b/drivers/mmc/core/core.c
> @@ -2102,7 +2102,8 @@ EXPORT_SYMBOL(mmc_can_sanitize);
>  
>  int mmc_can_secure_erase_trim(struct mmc_card *card)
>  {
> - if (card->ext_csd.sec_feature_support & EXT_CSD_SEC_ER_EN)
> + if ((card->ext_csd.sec_feature_support & EXT_CSD_SEC_ER_EN) &&
> + !(card->quirks & MMC_QUIRK_SEC_ERASE_TRIM_BROKEN))
>   return 1;
>   return 0;
>  }
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] vfs: Turn the unlinked inode counter into a percpu counter

2014-07-14 Thread Lukáš Czerner

On Fri, 11 Jul 2014, Andi Kleen wrote:

> Date: Fri, 11 Jul 2014 13:23:39 -0700
> From: Andi Kleen 
> To: linux-fsde...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org, Andi Kleen ,
> Al Viro , Miklos Szeredi 
> Subject: [PATCH] vfs: Turn the unlinked inode counter into a percpu counter
> 
> From: Andi Kleen 
> 
> The unlinked open inodes super block counter that was added some time ago
> unfortunately becomes a very hot cache line in some delete heavy
> high IOPs or tmpfs workloads with enough cores, when files
> are frequently removed. With TSX it also causes plenty of aborts.
> 
> Turn the atomic into a percpu_counter. It's nearly perfectly
> suited for a percpu counter because it is only checked
> in very slow paths, like remount.
> 
> Original commit:
> 
> commit 7ada4db88634429f4da690ad1c4eb73c93085f0c
> Author: Miklos Szeredi 
> Date:   Mon Nov 21 12:11:32 2011 +0100
> 
> vfs: count unlinked inodes
> 
> Add a new counter to the superblock that keeps track of unlinked but
> not yet deleted inodes.
> 
> Cc: Al Viro 
> Cc: Miklos Szeredi 
> Signed-off-by: Andi Kleen 
> ---
>  fs/ext2/super.c|  2 +-
>  fs/inode.c | 14 ++
>  fs/namespace.c |  4 ++--
>  fs/super.c |  3 +++
>  include/linux/fs.h |  2 +-
>  5 files changed, 13 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/ext2/super.c b/fs/ext2/super.c
> index 3750031..ac0ad9a 100644
> --- a/fs/ext2/super.c
> +++ b/fs/ext2/super.c
> @@ -1218,7 +1218,7 @@ static int ext2_freeze(struct super_block *sb)
>* because we have unattached inodes and thus filesystem is not fully
>* consistent.
>*/
> - if (atomic_long_read(&sb->s_remove_count)) {
> + if (percpu_counter_sum(&sb->s_remove_counters)) {
>   ext2_sync_fs(sb, 1);
>   return 0;
>   }
> diff --git a/fs/inode.c b/fs/inode.c
> index f96d2a6..4820c35 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -234,10 +234,8 @@ void __destroy_inode(struct inode *inode)
>   BUG_ON(inode_has_buffers(inode));
>   security_inode_free(inode);
>   fsnotify_inode_delete(inode);
> - if (!inode->i_nlink) {
> - WARN_ON(atomic_long_read(&inode->i_sb->s_remove_count) == 0);

This is not mentioned in the description, but I guess it makes sense since
it'd be quite expensive.

-Lukas

> - atomic_long_dec(&inode->i_sb->s_remove_count);
> - }
> + if (!inode->i_nlink)
> + percpu_counter_dec(&inode->i_sb->s_remove_counters);
>  
>  #ifdef CONFIG_FS_POSIX_ACL
>   if (inode->i_acl && inode->i_acl != ACL_NOT_CACHED)
> @@ -281,7 +279,7 @@ void drop_nlink(struct inode *inode)
>   WARN_ON(inode->i_nlink == 0);
>   inode->__i_nlink--;
>   if (!inode->i_nlink)
> - atomic_long_inc(&inode->i_sb->s_remove_count);
> + percpu_counter_inc(&inode->i_sb->s_remove_counters);
>  }
>  EXPORT_SYMBOL(drop_nlink);
>  
> @@ -297,7 +295,7 @@ void clear_nlink(struct inode *inode)
>  {
>   if (inode->i_nlink) {
>   inode->__i_nlink = 0;
> - atomic_long_inc(&inode->i_sb->s_remove_count);
> + percpu_counter_inc(&inode->i_sb->s_remove_counters);
>   }
>  }
>  EXPORT_SYMBOL(clear_nlink);
> @@ -317,7 +315,7 @@ void set_nlink(struct inode *inode, unsigned int nlink)
>   } else {
>   /* Yes, some filesystems do change nlink from zero to one */
>   if (inode->i_nlink == 0)
> - atomic_long_dec(&inode->i_sb->s_remove_count);
> + percpu_counter_dec(&inode->i_sb->s_remove_counters);
>  
>   inode->__i_nlink = nlink;
>   }
> @@ -336,7 +334,7 @@ void inc_nlink(struct inode *inode)
>  {
>   if (unlikely(inode->i_nlink == 0)) {
>   WARN_ON(!(inode->i_state & I_LINKABLE));
> - atomic_long_dec(&inode->i_sb->s_remove_count);
> + percpu_counter_dec(&inode->i_sb->s_remove_counters);
>   }
>  
>   inode->__i_nlink++;
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 182bc41..20d33d9 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -535,7 +535,7 @@ int sb_prepare_remount_readonly(struct super_block *sb)
>   int err = 0;
>  
>   /* Racy optimization.  Recheck the counter under MNT_WRITE_HOLD */
> - if (atomic_long_read(&sb->s_remove_count))
> + if (percpu_counter_sum(&sb->s_remove_counters) != 0)
>   return -EBUSY;
>  
>   lock_mount_hash();
> @@ -549,7 +549,7 @@ int sb_prepare_remount_readonly(struct super_block *sb)
>   }
>   }
>   }
> - if (!err && atomic_long_read(&sb->s_remove_count))
> + if (!err && percpu_counter_sum(&sb->s_remove_counters) != 0)
>   err = -EBUSY;
>  
>   if (!err) {
> diff --git a/fs/super.c b/fs/super.c
> index 48377f7..b7a8a45 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -142,6 +142,7 @@ static void destroy_super(struct super_block *s)
>   list_lru_destroy(&s->s

Re: testing result of loop-aio patchset on ext3

2014-07-14 Thread Lukáš Czerner

On Mon, 14 Jul 2014, Rui Xiang wrote:

> Date: Mon, 14 Jul 2014 17:34:38 +0800
> From: Rui Xiang 
> To: Dave Kleikamp , linux-e...@vger.kernel.org
> Cc: linux-fsde...@vger.kernel.org, linux-kernel@vger.kernel.org,
> Li Zefan 
> Subject: testing result of loop-aio patchset on ext3
> 
> Hi Dave,
> 
> We export a container image file as a block device via loop device, but we
> found it's very easy that the container rootfs gets corrupted due to power
> loss.
> 
> Your early version of loop-aio patchset said the patchset can make loop
> mounted filesystems recoverable(lkml.org/lkml/2012/3/30/317), but we found
> it doesn't help.
> 
> Both the guest fs and host fs are ext3.
> 
> The loop-aio patchset is from:
> git://github.com/kleikamp/linux-shaggy.git aio_loop
> 
> Steps:
> 1. dd a 10G image, mkfs.ext3,
>   # dd if=/dev/zero of=./raw_image bs=1M count=1
>   # echo y | mkfs.ext3 raw_image
> 
> 2. losetup a loop device, mount at ./test_dir
>   # losetup /dev/loop1 raw_image
>   # mount /dev/loop1 ./test_dir
> 
> 3. copy fs_mark into test_dir and run
>   # ./fs_mark -d ./tmp/ -s 10240 -n 80
> 
> 4. during runing fs_mark, make systerm reboot indirectly.
>   # echo b > /proc/sysrq-trigger
> 
> After systerm booted up, sometimes fsck reported raw_image fs has been 
> damaged.
> 
> # fsck.ext3 -n raw_image
> e2fsck 1.41.9 (22-Aug-2009)
> Warning: skipping journal recovery because doing a read-only filesystem check.
> raw_image contains a file system with errors, check forced.
> Pass 1: Checking inodes, blocks, and sizes
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
> Free blocks count wrong (2481348, counted=2480577).
> Fix? no
> Free inodes count wrong (640837, counted=640835).
> Fix? no
> raw_image: ** WARNING: Filesystem still has errors **
> raw_image: 11/640848 files (0.0% non-contiguous), 78652/256 blocks

It's not damaged, this is expected result if you're using old
e2fsprogs which still treats this as an error.

It's not an error because we only update superblock summary at
unmount time so with unclean shutdown it's likely that it does not
match the reality, but e2fsck can and will easily fix that for you.

Please try e2fsprogs v1.42.3 or newer.

Thanks!
-Lukas

> 
> 
> With a specific script, I can almost 100% reproduce this issue.
> 
> And it seems the corruption can only happen when reboot happens at the
> time loop is calling vfs_fsync().
> 
> Do you have any idea why the loop-aio patchset doesn't help?
> 
> Thanks.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/1] jbd2: fix incorrect unlock on j_list_lock

2015-03-19 Thread Lukáš Czerner

On Wed, 18 Mar 2015, Taesoo Kim wrote:

> Date: Wed, 18 Mar 2015 13:39:31 -0400
> From: Taesoo Kim 
> To: Lukáš Czerner 
> Cc: Taesoo Kim , ty...@mit.edu,
> linux-e...@vger.kernel.org, linux-kernel@vger.kernel.org,
> chang...@gatech.edu, sanid...@gatech.edu, b...@gatech.edu,
> cson...@gatech.edu
> Subject: Re: [PATCH 1/1] jbd2: fix incorrect unlock on j_list_lock
> 
> > The patch looks good, thanks.
> 
> Thank you.
>  
> > Reviewed-by: Lukas Czerner 
> > 
> > Btw, were you able to reproduce the problem, or have you seen the
> > problem in the wild ? Or did you just spot it in the code ?
> 
> We are developing a static checker to spot inconsistent programming
> patterns; our first goal is to scan over existing filesystems and
> figure out how they are implemented differently (or similarly). We
> will report bugs in sequence as soon as our team confirm (just start
> sending patches to other fs).

And this was found but it ?

But anyway it sounds really interesting, do you have any more
information you can share about the project ? Project website,
description, or source code would be great :)

Thanks!
-Lukas

> 
> Thanks,
> Taesoo
> 
> > Thanks!
> > -Lukas
> > 
> > > 
> > > Signed-off-by: Taesoo Kim 
> > > ---
> > >  fs/jbd2/transaction.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> > > index 5f09370..edb7f59 100644
> > > --- a/fs/jbd2/transaction.c
> > > +++ b/fs/jbd2/transaction.c
> > > @@ -1091,6 +1091,7 @@ int jbd2_journal_get_create_access(handle_t 
> > > *handle, struct buffer_head *bh)
> > >   JBUFFER_TRACE(jh, "file as BJ_Reserved");
> > >   spin_lock(&journal->j_list_lock);
> > >   __jbd2_journal_file_buffer(jh, transaction, BJ_Reserved);
> > > + spin_unlock(&journal->j_list_lock);
> > >   } else if (jh->b_transaction == journal->j_committing_transaction) {
> > >   /* first access by this transaction */
> > >   jh->b_modified = 0;
> > > @@ -1098,8 +1099,8 @@ int jbd2_journal_get_create_access(handle_t 
> > > *handle, struct buffer_head *bh)
> > >   JBUFFER_TRACE(jh, "set next transaction");
> > >   spin_lock(&journal->j_list_lock);
> > >   jh->b_next_transaction = transaction;
> > > + spin_unlock(&journal->j_list_lock);
> > >   }
> > > - spin_unlock(&journal->j_list_lock);
> > >   jbd_unlock_bh_state(bh);
> > > 
> > >   /*
> > > --
> > > 2.3.3
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > > the body of a message to majord...@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
>

Re: [PATCH 1/1] jbd2: fix incorrect unlock on j_list_lock

2015-03-19 Thread Lukáš Czerner

On Thu, 19 Mar 2015, Taesoo Kim wrote:

> Date: Thu, 19 Mar 2015 08:02:43 -0400
> From: Taesoo Kim 
> To: Lukáš Czerner 
> Cc: Taesoo Kim , ty...@mit.edu,
> linux-e...@vger.kernel.org, linux-kernel@vger.kernel.org,
> chang...@gatech.edu, sanid...@gatech.edu, b...@gatech.edu,
> cson...@gatech.edu
> Subject: Re: [PATCH 1/1] jbd2: fix incorrect unlock on j_list_lock
> 
> On 03/19/15 at 10:48am, Lukáš Czerner wrote:
> > On Wed, 18 Mar 2015, Taesoo Kim wrote:
> > 
> > > Date: Wed, 18 Mar 2015 13:39:31 -0400
> > > From: Taesoo Kim 
> > > To: Lukáš Czerner 
> > > Cc: Taesoo Kim , ty...@mit.edu,
> > > linux-e...@vger.kernel.org, linux-kernel@vger.kernel.org,
> > > chang...@gatech.edu, sanid...@gatech.edu, b...@gatech.edu,
> > > cson...@gatech.edu
> > > Subject: Re: [PATCH 1/1] jbd2: fix incorrect unlock on j_list_lock
> > > 
> > > > The patch looks good, thanks.
> > > 
> > > Thank you.
> > >  
> > > > Reviewed-by: Lukas Czerner 
> > > > 
> > > > Btw, were you able to reproduce the problem, or have you seen the
> > > > problem in the wild ? Or did you just spot it in the code ?
> > > 
> > > We are developing a static checker to spot inconsistent programming
> > > patterns; our first goal is to scan over existing filesystems and
> > > figure out how they are implemented differently (or similarly). We
> > > will report bugs in sequence as soon as our team confirm (just start
> > > sending patches to other fs).
> > 
> > And this was found but it ?
> 
> Our current prototype.
>  
> > But anyway it sounds really interesting, do you have any more
> > information you can share about the project ? Project website,
> > description, or source code would be great :)
> 
> We need 1-2 months to wrap up. I would say, right after the deadline,
> we plan to make the code/result publicly available :)

Great! Looking forward to it.

Thanks!
-Lukas

Re: [PATCH] ext4: Remove useless condition in if statement.

2015-03-19 Thread Lukáš Czerner

On Tue, 17 Mar 2015, Wei Yuan wrote:

> Date: Tue, 17 Mar 2015 09:31:54 +0800
> From: Wei Yuan 
> To: ty...@mit.edu, adilger.ker...@dilger.ca
> Cc: linux-e...@vger.kernel.org, linux-kernel@vger.kernel.org
> Subject: [PATCH] ext4: Remove useless condition in if statement.
> 
> In this if statement, the previous condition is useless, the later one has 
> covered it.

Looks good, thanks!

Reviewed-by: Lukas Czerner 

> 
> Signed-off-by: Weiyuan 
> ---
>  fs/ext4/xattr.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
> index 1e09fc7..f2ccad7 100644
> --- a/fs/ext4/xattr.c
> +++ b/fs/ext4/xattr.c
> @@ -639,8 +639,7 @@ ext4_xattr_set_entry(struct ext4_xattr_info *i, struct 
> ext4_xattr_search *s)
>   free += EXT4_XATTR_LEN(name_len);
>   }
>   if (i->value) {
> - if (free < EXT4_XATTR_SIZE(i->value_len) ||
> - free < EXT4_XATTR_LEN(name_len) +
> + if (free < EXT4_XATTR_LEN(name_len) +
>  EXT4_XATTR_SIZE(i->value_len))
>   return -ENOSPC;
>   }
> --
> 2.1.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] shmem: Add eventfd notification on utlilization level

2015-03-10 Thread Lukáš Czerner

On Tue, 10 Mar 2015, Beata Michalska wrote:

> Date: Tue, 10 Mar 2015 16:25:12 +0100
> From: Beata Michalska 
> To: Jan Kara 
> Cc: Christoph Hellwig ,
> Kyungmin Park ,
> Krzysztof Kozlowski ,
> Hugh Dickins , linux-mm ,
> Linux Kernel Mailing List ,
> Alexander Viro ,
> Linux Filesystem Mailing List ,
> Bartlomiej Zolnierkiewicz ,
> Marek Szyprowski 
> Subject: Re: [RFC] shmem: Add eventfd notification on utlilization level
> 
> On 03/10/2015 03:22 PM, Jan Kara wrote:
> > On Tue 10-03-15 06:03:23, Christoph Hellwig wrote:
> >> On Tue, Mar 10, 2015 at 10:51:41AM +0900, Kyungmin Park wrote:
> >>> Any updates?
> >> Please just add disk quota support to tmpfs so thast the standard quota
> >> netlink notifications can be used.
> >   If I understand the problem at hand, they are really interested in
> > notification when running out of free space. Using quota for that doesn't
> > seem ideal since that tracks used space per user, not free space on fs as a
> > whole.
> >
> > But if I remember right there were discussions about ENOSPC notification
> > from filesystem for thin provisioning usecases. It would be good to make
> > this consistent with those but I'm not sure if it went anywhere.
> >
> > Honza
> 
> The ideal case here, would be to get the notification, despite the type
> of the actual filesystem, whenever the amount of free space drops below
> a certain level. Quota doesn't seem to be the right approach here.
> 
> BR
> Beata Michalska

A while back I was prototyping a netlink notification interface for
file systems, but it went nowhere.

https://lkml.org/lkml/2011/8/18/170

So maybe it's time get back to the drawing board and finish the
idea, since it seems to be some interest in this now.

-Lukas

> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] ext4: Add pollable sysfs entry for block threshold events

2015-03-11 Thread Lukáš Czerner

On Wed, 11 Mar 2015, Beata Michalska wrote:

> Date: Wed, 11 Mar 2015 11:16:33 +0100
> From: Beata Michalska 
> To: ty...@mit.edu, adilger.ker...@dilger.ca
> Cc: linux-e...@vger.kernel.org, linux-kernel@vger.kernel.org,
> kyungmin.p...@samsung.com
> Subject: [RFC] ext4: Add pollable sysfs entry for block threshold events
> 
> Add support for pollable sysfs entry for logical blocks
> threshold, allowing the userspace to wait for
> the notification whenever the threshold is reached
> instead of periodically calling the statfs.
> This is supposed to work as a single-shot notifiaction
> to reduce the number of triggered events.

Hi,

I though you were advocating for a solution independent on the file
system. This is ext4 only solution, but I do not really have
anything against this.

However I do have couple of comments. First of all you should add
some documentation for the new sysfs file into
Documentation/filesystems/ext4.txt and describe how are you supposed
to use this.

Also I can see that you introduced ext4_mark_group_tainted() helper,
preferably this should go into a separate patch.

More comments below.

> 
> Signed-off-by: Beata Michalska 
> ---
>  fs/ext4/balloc.c  |   17 -
>  fs/ext4/ext4.h|   12 
>  fs/ext4/ialloc.c  |5 +
>  fs/ext4/inode.c   |2 +-
>  fs/ext4/mballoc.c |   14 --
>  fs/ext4/resize.c  |3 ++-
>  fs/ext4/super.c   |   52 ++--
>  7 files changed, 74 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
> index 83a6f49..bf4a669 100644
> --- a/fs/ext4/balloc.c
> +++ b/fs/ext4/balloc.c
> @@ -193,10 +193,7 @@ static int ext4_init_block_bitmap(struct super_block *sb,
>* essentially implementing a per-group read-only flag. */
>   if (!ext4_group_desc_csum_verify(sb, block_group, gdp)) {
>   grp = ext4_get_group_info(sb, block_group);
> - if (!EXT4_MB_GRP_BBITMAP_CORRUPT(grp))
> - percpu_counter_sub(&sbi->s_freeclusters_counter,
> -grp->bb_free);
> - set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state);
> + ext4_mark_group_tainted(sbi, grp);
>   if (!EXT4_MB_GRP_IBITMAP_CORRUPT(grp)) {
>   int count;
>   count = ext4_free_inodes_count(sb, gdp);
> @@ -252,7 +249,7 @@ unsigned ext4_free_clusters_after_init(struct super_block 
> *sb,
>  ext4_group_t block_group,
>  struct ext4_group_desc *gdp)
>  {
> - return num_clusters_in_group(sb, block_group) - 
> + return num_clusters_in_group(sb, block_group) -
>   ext4_num_overhead_clusters(sb, block_group, gdp);
>  }
>  
> @@ -379,20 +376,14 @@ static void ext4_validate_block_bitmap(struct 
> super_block *sb,
>   ext4_unlock_group(sb, block_group);
>   ext4_error(sb, "bg %u: block %llu: invalid block bitmap",
>  block_group, blk);
> - if (!EXT4_MB_GRP_BBITMAP_CORRUPT(grp))
> - percpu_counter_sub(&sbi->s_freeclusters_counter,
> -grp->bb_free);
> - set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state);
> + ext4_mark_group_tainted(sbi, grp);
>   return;
>   }
>   if (unlikely(!ext4_block_bitmap_csum_verify(sb, block_group,
>   desc, bh))) {
>   ext4_unlock_group(sb, block_group);
>   ext4_error(sb, "bg %u: bad block bitmap checksum", block_group);
> - if (!EXT4_MB_GRP_BBITMAP_CORRUPT(grp))
> - percpu_counter_sub(&sbi->s_freeclusters_counter,
> -grp->bb_free);
> - set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state);
> + ext4_mark_group_tainted(sbi, grp);
>   return;
>   }
>   set_buffer_verified(bh);
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index f63c3d5..ee911b7 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1309,6 +1309,7 @@ struct ext4_sb_info {
>   unsigned long s_sectors_written_start;
>   u64 s_kbytes_written;
>  
> + atomic64_t block_thres_event;
>   /* the size of zero-out chunk */
>   unsigned int s_extent_max_zeroout_kb;
>  
> @@ -2207,6 +2208,7 @@ extern int ext4_alloc_flex_bg_array(struct super_block 
> *sb,
>   ext4_group_t ngroup);
>  extern const char *ext4_decode_error(struct super_block *sb, int errno,
>char nbuf[16]);
> +extern void ext4_block_thres_notify(struct ext4_sb_info *sbi);
>  
>  extern __printf(4, 5)
>  void __ext4_error(struct super_block *, const char *, unsigned int,
> @@ -2535,6 +2537,16 @@ static inline spinlock_t *ext4_group_lock_ptr(struct 
> super_block *sb,
>   return

Re: [RFC] ext4: Add pollable sysfs entry for block threshold events

2015-03-11 Thread Lukáš Czerner

On Wed, 11 Mar 2015, Beata Michalska wrote:

> Date: Wed, 11 Mar 2015 17:45:52 +0100
> From: Beata Michalska 
> To: Lukáš Czerner 
> Cc: ty...@mit.edu, adilger.ker...@dilger.ca, linux-e...@vger.kernel.org,
> linux-kernel@vger.kernel.org, kyungmin.p...@samsung.com
> Subject: Re: [RFC] ext4: Add pollable sysfs entry for block threshold events
> 
> Hi,
> 
> On 03/11/2015 03:12 PM, Lukáš Czerner wrote:
> > On Wed, 11 Mar 2015, Beata Michalska wrote:
> > 
> >> Date: Wed, 11 Mar 2015 11:16:33 +0100
> >> From: Beata Michalska 
> >> To: ty...@mit.edu, adilger.ker...@dilger.ca
> >> Cc: linux-e...@vger.kernel.org, linux-kernel@vger.kernel.org,
> >> kyungmin.p...@samsung.com
> >> Subject: [RFC] ext4: Add pollable sysfs entry for block threshold events
> >>
> >> Add support for pollable sysfs entry for logical blocks
> >> threshold, allowing the userspace to wait for
> >> the notification whenever the threshold is reached
> >> instead of periodically calling the statfs.
> >> This is supposed to work as a single-shot notifiaction
> >> to reduce the number of triggered events.
> > 
> > Hi,
> > 
> > I though you were advocating for a solution independent on the file
> > system. This is ext4 only solution, but I do not really have
> > anything against this.
> > 
> 
> I definitely was/am, but again, that would be an ideal case.
> Until we work out some sensible solution, possibly based on your idea you
> have mentioned in another thread, I guess we have to stick to
> what we've got. The ext4 is within our interest, thus the changes proposed.

I agree, this change seems to be simple enough to serve as
workaround for ext4 at the moment.

...snip...

> >> @@ -2967,7 +2962,6 @@ ext4_mb_mark_diskspace_used(struct 
> >> ext4_allocation_context *ac,
> >>if (err)
> >>goto out_err;
> >>err = ext4_handle_dirty_metadata(handle, NULL, gdp_bh);
> >> -
> > 
> > No reason to change that.
> > 
> >>  out_err:
> >>brelse(bitmap_bh);
> >>return err;
> >> @@ -4525,8 +4519,8 @@ out:
> >>reserv_clstrs);
> >>}
> >>  
> >> +  ext4_block_thres_notify(sbi);
> > 
> > I wonder whether it would not be better to have this directly in
> > ext4_claim_free_clusters() ? Or maybe even better in
> > ext4_has_free_clusters() where we already have some of the counters
> > you need ?
> > 
> > This would avoid the overhead of calculating this again since especially
> > the percpu_counter might get quite expensive.
> > 
> 
> The idea was to call the notify once all the necessary arithmetic
> has been done, to get the most up-to date data. And to limit the
> number of calls to notify. In both cases: ext4_claim_free_clusters
> and ext4_has_free_clusters, smth might go wrong afterwards so the counters
> might get updated thus affecting the final outcome of ext4_block_thres_notify.

Right, we might get memory allocation, error quota enospc and
possibly more. However before we do, the space is actually already
reserved and with your approach someone might get ENOSPC (or at
least cross the threshold) without the notification.

And secondly, consider for example a delayed allocation which will
never be allocated because if was freed before we manage to do
that. Similar case, but possibly with a bigger window.

None of it actually matters too much since this is a threshold
notification and by the time you get the notification the reality
will be slightly different anyway. However I think that there is a
big benefit of having this in one place and avoiding gathering all
the calculations multiple times.

> >> --- a/fs/ext4/super.c
> >> +++ b/fs/ext4/super.c
> >> @@ -2558,10 +2558,56 @@ static ssize_t reserved_clusters_store(struct 
> >> ext4_attr *a,
> >>if (parse_strtoull(buf, -1ULL, &val))
> >>return -EINVAL;
> >>ret = ext4_reserve_clusters(sbi, val);
> >> -
> >> +  ext4_block_thres_notify(sbi);
> > 
> > I do not think you count in reserved clusters at the moment. But
> > it's definitely something you should count in.
> > 
> 
> Well, this is the difference between the free and available blocks.
> AFAIK, the reserved blocks just mark the difference between those two,
> and they are not being counted in as far as used blocks are being concerned,
> at least from the user-space perspective, though I might be missing smth here.

See ext4_statfs(). From user-space perspective those are accounted
towards

Re: [PATCH 3/5] WIP: fs: ext4: support unlinkat_s() for secure deletion of files

2015-02-03 Thread Lukáš Czerner

On Mon, 2 Feb 2015, Alexander Holler wrote:

> Date: Mon,  2 Feb 2015 18:05:11 +0100
> From: Alexander Holler 
> To: linux-fsde...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org, Alexander Holler 
> Subject: [PATCH 3/5] WIP: fs: ext4: support unlinkat_s() for secure deletion
> of files

Hi,

I am missing a description, where you'd describe what is this all
about, why and how.

> 
> Signed-off-by: Alexander Holler 
> ---
>  fs/ext4/ext4.h|  2 ++
>  fs/ext4/mballoc.c | 25 +++--
>  fs/ext4/super.c   | 12 
>  3 files changed, 37 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index c55a1fa..e66507c 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1342,6 +1342,8 @@ struct ext4_sb_info {
>   struct ratelimit_state s_err_ratelimit_state;
>   struct ratelimit_state s_warning_ratelimit_state;
>   struct ratelimit_state s_msg_ratelimit_state;
> +
> + atomic_t secure_delete;   /* delete blocks securely? */
>  };
>  
>  static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index dbfe15c..f33416f 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -2756,6 +2756,19 @@ static inline int ext4_issue_discard(struct 
> super_block *sb,
>   return sb_issue_discard(sb, discard_block, count, GFP_NOFS, 0);
>  }
>  
> +static inline int ext4_issue_zeroout(struct super_block *sb,
> + ext4_group_t block_group, ext4_grpblk_t cluster, int count)
> +{
> + ext4_fsblk_t discard_block;
> +
> + discard_block = (EXT4_C2B(EXT4_SB(sb), cluster) +
> +  ext4_group_first_block_no(sb, block_group));
> + count = EXT4_C2B(EXT4_SB(sb), count);
> + //trace_ext4_discard_blocks(sb,
> + //  (unsigned long long) discard_block, count);
> + return sb_issue_zeroout(sb, discard_block, count, GFP_NOFS);
> +}
> +
>  /*
>   * This function is called by the jbd2 layer once the commit has finished,
>   * so we know we can free the blocks that were released with that commit.
> @@ -2764,6 +2777,7 @@ static void ext4_free_data_callback(struct super_block 
> *sb,
>   struct ext4_journal_cb_entry *jce,
>   int rc)
>  {
> + struct ext4_sb_info *sbi = EXT4_SB(sb);
>   struct ext4_free_data *entry = (struct ext4_free_data *)jce;
>   struct ext4_buddy e4b;
>   struct ext4_group_info *db;
> @@ -2772,6 +2786,11 @@ static void ext4_free_data_callback(struct super_block 
> *sb,
>   mb_debug(1, "gonna free %u blocks in group %u (0x%p):",
>entry->efd_count, entry->efd_group, entry);
>  
> +
> + // TODO:
> + // if (atomic_read(&sbi->secure_delete) && secure_trim_available)
> + //  use secure trim
> + // else

I am missing very important pieces like, what happens if we require
secure delete, but there is no secure trim available and we're still
on the ssd ?

What if the underlying storage is thinly provisioned ?

What if the underlying storage consist of hardware which does
support secure discard and one that does not ? The request crossing
the borders will fail.

What if the underlying hardware does support secure trim, but the
storage under the fs is in raid configuration, which brings  me to
the next question.

Discard/secure discard does have a granularity and alignment, so
what if the extent is smaller than a discard granularity, or it is
not aligned properly ? Such discard requests would be ignored.

>   if (test_opt(sb, DISCARD)) {
>   err = ext4_issue_discard(sb, entry->efd_group,
>entry->efd_start_cluster,
> @@ -2782,8 +2801,10 @@ static void ext4_free_data_callback(struct super_block 
> *sb,
>" with %d", entry->efd_group,
>entry->efd_start_cluster,
>entry->efd_count, err);
> - }
> -
> + } else if (atomic_read(&sbi->secure_delete))
> + ext4_issue_zeroout(sb, entry->efd_group,
> +  entry->efd_start_cluster,
> +  entry->efd_count);

Error handling is missing here. Also I am not sure that zeroing out
the blocks is really enough. Yes, I've seen the link you've posted,
but I am not convinced.

Did you consider metadata information for the file ? File name,
timestamps, size, data placement ? Is it something you want to
remove as well, or are you going to ignore it ? It can potentially
contain valuable information for the attacker as well. I am just
trying to understand the scope of this thing.

Moreover with inline data you might have the data in the inode
itself, which also means the it will be in the journal as well.

Also with data=journal the data will be in the journal.

With no journal this would not work at all, you have to make this
for nojournal case as well

Re: [PATCH 3/5] WIP: fs: ext4: support unlinkat_s() for secure deletion of files

2015-02-03 Thread Lukáš Czerner

On Tue, 3 Feb 2015, Alexander Holler wrote:

> Date: Tue, 03 Feb 2015 15:50:53 +0100
> From: Alexander Holler 
> To: Lukáš Czerner 
> Cc: linux-fsde...@vger.kernel.org, linux-kernel@vger.kernel.org
> Subject: Re: [PATCH 3/5] WIP: fs: ext4: support unlinkat_s() for secure
> deletion of files
> 
> Am 03.02.2015 um 14:50 schrieb Lukáš Czerner:
> > On Mon, 2 Feb 2015, Alexander Holler wrote:
> > 
> > > Date: Mon,  2 Feb 2015 18:05:11 +0100
> > > From: Alexander Holler 
> > > To: linux-fsde...@vger.kernel.org
> > > Cc: linux-kernel@vger.kernel.org, Alexander Holler 
> > > Subject: [PATCH 3/5] WIP: fs: ext4: support unlinkat_s() for secure
> > > deletion
> > >  of files
> > 
> > Hi,
> > 
> > I am missing a description, where you'd describe what is this all
> > about, why and how.
> 
> Maybe you've missed the introduction, patch 0/5.
> 
> Sorry, but I'm not sponsored and the time I can spend is limited. Therefor,
> please, don't expect a paper in the kind european bureaucrazies are producing.

Please read Documentation/SubmittingPatches. What I am asking is a
description of this patch.

> 
> I'm already spending a lot of time trying to convince the developers here,
> that this a feature most people expect from any filesystem. And I've written
> these patches, for which now, even after I've marked them with all kind of
> "preliminary" terms, still get blamed.

So, you'd be much happier if we just ignored your patches ? I am not
sure you understand how this works. You spend time creating and
posting patches and at least two people (including me) spent time
reading and commenting on it, isn't that what you need ?

You have the attention to move this forward, so please take
advantage of this.

> 
> And, unfortunately, myths like that overwriting a block once on traditional
> magnetic platters isn't enough, don't help either.

Not sure about the myths, I've read the paper here

http://digital-forensics.sans.org/blog/2009/01/15/overwriting-hard-drive-data/

and it sounds good, but I have no idea about the methodology, the
details of it. Specifications of the HDD used. For example he
mention writing 512B, but on the newer (4k physical sector) drives at that
time it might have meant rewriting on 4k block twice just to write 1k. This
could also explain the difference between the 'old' and 'new' drives
(whatever that means).

So what I am saying is that I am not entirely convinced that what is
concluded in that paper is true.

> 
> 
> > I am missing very important pieces like, what happens if we require
> > secure delete, but there is no secure trim available and we're still
> > on the ssd ?
> 
> As written in 0/5, I don't know if trim (without secure) might be enough.

Well, then you should find out since you're the one writing the
patches. My point is that this case has to be taken care of in the
code.

> 
> > 
> > What if the underlying storage is thinly provisioned ?
> > 
> > What if the underlying storage consist of hardware which does
> > support secure discard and one that does not ? The request crossing
> > the borders will fail.
> > 
> > What if the underlying hardware does support secure trim, but the
> > storage under the fs is in raid configuration, which brings  me to
> > the next question.
> 
> That's all about how unlinkat_s will be documented. I would suggest to let
> unlinkat_s() fail if it is sure it can't delete stuff, but otherwise would
> write in the documentation that it might be useless in many cases like stacked
> filesystems, mixed raids and similiar constructs. Maybe the documentation for
> shred is something which could be used as an template.

Well, that's problematic. Documentation is not really enough, you
can't just expect every phone user to read unlinkat_s documentation.

Once you try to delete the file securely and it fails, you have a
problem. Because you might no longer have the references to the
blocks used by that file. That's why I think that error handling is
important here. You have to give a user the way out so he can use
other means of getting rid of the file.

> 
> > 
> > Discard/secure discard does have a granularity and alignment, so
> > what if the extent is smaller than a discard granularity, or it is
> > not aligned properly ? Such discard requests would be ignored.
> 
> You can throw in another dozen complications. That's just another way to say
> "never", to kill any further user expectations or requests and to hide the
> forest behind trees.

That's not what your code does.

> 
> I wonder how you

Re: [PATCH 1/5] WIP: Add syscall unlinkat_s (currently x86* only)

2015-02-04 Thread Lukáš Czerner

On Wed, 4 Feb 2015, Alexander Holler wrote:

> Date: Wed, 04 Feb 2015 11:19:01 +0100
> From: Alexander Holler 
> To: Al Viro 
> Cc: Theodore Ts'o , linux-fsde...@vger.kernel.org,
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH 1/5] WIP: Add syscall unlinkat_s (currently x86* only)
> 
> Am 04.02.2015 um 00:33 schrieb Al Viro:
> > On Tue, Feb 03, 2015 at 07:01:50PM +0100, Alexander Holler wrote:
> > 
> > > Yeah, as I've already admitted in the bug, I never should have use
> > > the word secure, because everyone nowadays seems to end up in panic
> > > when reading that word.
> > > 
> > > So, if I would be able to use sed on my mails, I would replace
> > > unlinkat_s() with unlinkat_w() (for wipe) or would say that _s does
> > > stand for 'shred' in the means of shred(1).
> > 
> > TBH, I suspect that the saner API would be something like
> > EXT2_IOC_[SG[ETFLAGS,
> > allowing to set and query that along with other flags (append-only, etc.).
> > 
> > Forget about unlink; first of all, whatever API you use should only _mark_
> > the inode as "zero freed blocks" (or trim, for that matter).  You can't
> > force freeing of an inode, so either you make sure that subsequent freeing
> > of inode, whenever it happens, will do that work, or your API is hopelessly
> > racy.  Moreover, when link has been removed it's too late to report that
> > fs has no way to e.g. trim those blocks, so you really want to have it done
> > _before_ the actual link removal.  And if the file contents is that
> > sensitive,
> > you'd better extend the same protection to all operations that free its
> > blocks, including truncate(), fallocate() hole-punching, whatever.  What's
> > more, if you divorce that from link removal, you probably don't want it as
> > in-core-only flag - have it stored in inode, if fs supports that.
> > 
> > Alternatively, you might want to represent it as xattr - as much as I hate
> > those, it might turn out to be the best fit in this case, if we end up
> > with several variants for freed blocks disposal.  Not sure...
> > 
> > But whichever way we represent that state, IMO
> > a) operation should be similar to chmod/chattr/setfattr - modifying
> > inode metadata.
> > b) it should affect _all_ operations freeing blocks of that file
> > from that point on
> > c) it should be able to fail, telling you that you can't do that for
> > this backing store.
> 
> My intention to use unlinkat() or unlinkat_s() was the following:
> 
> - It can be supported by most filesystems (see my fat patch)
> 
> - It doesn't really make any promises it can't, like deleting leftovers of an
> already modified file. That's where a much more complicated solution like the
> 's' attribute would appropriate. It just should try to wipe the current
> contents of a file.
> 
> The second reason was also the reason why I've crafted the subject of the RFC
> very carefully: "Offer a way for userspace to request real deletion of files".
> 
> I did that to avoid the nitpickers. It doesn't say how the request is or has
> to be handled. I was aware of all the problems which arise if one tries to
> fullfill what the 's' flag promises. The final result of trying to get a 100
> percent solution is just what we have now: nothing at all.
> 
> It wasn't the first time I've posted a patch to LKML, I know that maintainers
> like to request high towers from ordinary people and therefor very often nice
> dog houses were refused. There might be a legitimate reason to request a high
> tower from a big company, but that's something totally different.
> 
> And I refuse to try to understand why maintainers request high towers. ;)
> 
> And because hope never dies, I was again silly enough to post a simple patch.
> ;)

It really comes down to what your intentions and expectations are.
If you just wanted to have a discussion about this feature, you have
it and that's fine.

If your intention was to propose the actual solution, the actual
feature that would be merged to kernel, then your expectation should
have been exactly this, discussion, improvement propositions, cases
to take care of and so on.

For example the interface Al proposed is much better than the
solution in the patch. And he is right about the other block freeing
operations.

No one told you not to proceed with the implementation AFAIK. Some
are skeptical and that's fine, but you have all means to proof them
wrong.

The fact is that the current patches are useless for anything other
than proof-of-concept. Now you know more that needs to be done or
thought about, but if you're not willing to do the work, then please
stop complaining about "high towers". I am not a maintainer and I
thinks that the feedback you've got is entirely reasonable. Take it
as you will.

One more thing, can I ask you what were your expectations when
posting those patches ?

Regards,
-Lukas

> 
> Regards,
> 
> Alexander Hpller
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body

Re: [PATCH 1/5] WIP: Add syscall unlinkat_s (currently x86* only)

2015-02-04 Thread Lukáš Czerner

On Wed, 4 Feb 2015, Alexander Holler wrote:

> Date: Wed, 04 Feb 2015 14:21:12 +0100
> From: Alexander Holler 
> To: Michael Kerrisk 
> Cc: Lukáš Czerner , Al Viro ,
> Theodore Ts'o ,
> Linux-Fsdevel ,
> Linux Kernel ,
> Linux API 
> Subject: Re: [PATCH 1/5] WIP: Add syscall unlinkat_s (currently x86* only)
> 
> Am 04.02.2015 um 14:06 schrieb Michael Kerrisk:
> > Alexander,
> > 
> > On Wed, Feb 4, 2015 at 1:22 PM, Alexander Holler 
> > wrote:
> > > Am 04.02.2015 um 13:07 schrieb Lukáš Czerner:
> > > 
> > > > The fact is that the current patches are useless for anything other
> > > > than proof-of-concept. Now you know more that needs to be done or
> > > 
> > > 
> > > That's wrong. The patches already work. If you delete a file which isn't
> > > in
> > > use by something else, the current contents will be wiped on traditional
> > > harddrives. I assume that already fulfills more than 50% of use cases of
> > > ordinary people.
> > 
> > You are getting various feedback from people, that you seem to be ignoring.
> 
> I'm happy for all the feedback. But it doesn't help me. I'm not going to spend
> the necessary time unpaid.

Right, you'd much rather have someone else to spend the time on your
request unpaid. That's understandable, but unreasonable. You want
it, implement it, or pay someone else to do it for you.

> .
> > Al Viro, in his curmedgeonly way, points out that the problems are
> > much deeper than you realize. He does not say so explicitly, but I
> > imagine his point is that he does not want to see the kernel cluttered
> > with "partial" solutions that will simply increase the maintenance
> > burden in the long term, and leave bugs to be fixed further down the
> > line. You seem not to be listening.
> 
> It doesn't help me nor anyone else. As Eric Sandeen made me aware through in
> bug, look at http://lwn.net/Articles/462437/ what already happened.

That's what people have been trying to tell you. It's not an easy
task and there are plenty of cases to think about. As you can see
IBM tasked their developer to do it, but they did not succeed. And
here you come with your simplistic patches crying about "high
towers. But you're the one apparently interested in this feature
and you've been warned that's it's not a simple task.

But if you really want it I really do encourage you to try. I'd be
happy to have a working and reliable secure delete feature but it's
not my priority at all.

-Lukas

> 
> > Lukáš points out to you that getting a feature like this into the
> > kernel is complex process. You seem unwilling to hear that, and still
> > just want your partial solution.
> 
> Wrong. I don't want my partial solution to be part of the official kernel. I
> don't care. I offered it for other users because I'm aware that has become
> almost impossible for normal people to get something into the kernel without
> spending an unbelievable amount of time most people can't afford to spend.
> 
> > I tell you that discussions of APIs should CC linux-api, which I am
> > now CCing into this thread, again, because, again, you're not
> > listening to feedback.
> 
> Please don't confuse "not listening" with "unable to fulfill Linux kernel
> maintainer requests".
> 
> Alexander Holler
> 
>

Re: [PATCH] ext3: Remove useless condition in if statement.

2015-03-23 Thread Lukáš Czerner

On Fri, 20 Mar 2015, Wei Yuan wrote:

> Date: Fri, 20 Mar 2015 11:09:10 +0800
> From: Wei Yuan 
> To: j...@suse.cz, a...@linux-foundation.org, adilger.ker...@dilger.ca
> Cc: linux-e...@vger.kernel.org, linux-kernel@vger.kernel.org,
> lczer...@redhat.com, lize...@huawei.com
> Subject: [PATCH] ext3: Remove useless condition in if statement.
> 
> In this if statement, the previous condition is useless, the later one has 
> covered it.

Nice that you've send this ext3 counterpart for ext4 patch. Thanks!

Reviewed-by: Lukas Czerner 

> 
> Signed-off-by: Weiyuan 
> ---
>  fs/ext3/xattr.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/fs/ext3/xattr.c b/fs/ext3/xattr.c
> index c6874be..24215dc 100644
> --- a/fs/ext3/xattr.c
> +++ b/fs/ext3/xattr.c
> @@ -546,8 +546,7 @@ ext3_xattr_set_entry(struct ext3_xattr_info *i, struct 
> ext3_xattr_search *s)
>   free += EXT3_XATTR_LEN(name_len);
>   }
>   if (i->value) {
> - if (free < EXT3_XATTR_SIZE(i->value_len) ||
> - free < EXT3_XATTR_LEN(name_len) +
> + if (free < EXT3_XATTR_LEN(name_len) +
>  EXT3_XATTR_SIZE(i->value_len))
>   return -ENOSPC;
>   }
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 0/2] ext4: Add pollable sysfs entry for block threshold events

2015-03-18 Thread Lukáš Czerner

On Wed, 18 Mar 2015, Beata Michalska wrote:

> Date: Wed, 18 Mar 2015 10:31:01 +0100
> From: Beata Michalska 
> To: Christoph Hellwig 
> Cc: lczer...@redhat.com, adilger.ker...@dilger.ca, ty...@mit.edu,
> linux-e...@vger.kernel.org, linux-kernel@vger.kernel.org,
> kyungmin.p...@samsung.com
> Subject: Re: [PATCH v2 0/2] ext4: Add pollable sysfs entry for block threshold
>  events
> 
> On 03/16/2015 02:28 PM, Christoph Hellwig wrote:
> > Between this and the recent tmpfs discussion I really think this
> > needs to be done in a generic way.  Given that the quota noticiations
> > already use netlink general space notifications seem like a very
> > sensible extension for them.
> > 
> 
> First of all, apologies for late response.
> 
> I do agree that having a generic solution is the perfect one
> though non-trivial. If I understood You correctly, You would
> like to have quota being extended to cover the case in subject;
> smth similar to xfs project quota and its pqnoenforce mount
> option, as this would require disabling enforcing both:
> soft and hard limits ?

As Jan already mentioned the problem with quota is that it really
is accounting for used space per user, which is not really ideal for
this scenario. I have not seen the tmpfs discussions so maybe
they have a solution ?

Also, implementing noenforce should should be relatively easy.

> 
> 
> We could also consider going back to already mentioned,
> in another thread, the netlink notification interface proposal,
> though this still leaves the problem on how to setup the thresholds
> (as it covers only the ENOSPC case).

As Ted mentioned in one of his replies, filesystem would have to
support it at some level. Probably setting a value within a
superblock and then check against this value.

The netlink notification interface would then serve only as a mean
to notify the user.

Thanks!
-Lukas

> 
> 
> BR
> Beata
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/1] jbd2: fix incorrect unlock on j_list_lock

2015-03-18 Thread Lukáš Czerner

On Tue, 17 Mar 2015, Taesoo Kim wrote:

> Date: Tue, 17 Mar 2015 22:08:38 -0400
> From: Taesoo Kim 
> To: ty...@mit.edu, linux-e...@vger.kernel.org, linux-kernel@vger.kernel.org
> Cc: tae...@gatech.edu, chang...@gatech.edu, sanid...@gatech.edu,
> b...@gatech.edu, cson...@gatech.edu, Taesoo Kim 
> Subject: [PATCH 1/1] jbd2: fix incorrect unlock on j_list_lock
> 
> When 'jh->b_transaction == transaction' (asserted by below)
> 
>   J_ASSERT_JH(jh, (jh->b_transaction == transaction || ...
> 
> 'journal->j_list_lock' will be incorrectly unlocked, since
> the the lock is aquired only at the end of if / else-if
> statements (missing the else case).

The patch looks good, thanks.

Reviewed-by: Lukas Czerner 

Btw, were you able to reproduce the problem, or have you seen the
problem in the wild ? Or did you just spot it in the code ?

Thanks!
-Lukas

> 
> Signed-off-by: Taesoo Kim 
> ---
>  fs/jbd2/transaction.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index 5f09370..edb7f59 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -1091,6 +1091,7 @@ int jbd2_journal_get_create_access(handle_t *handle, 
> struct buffer_head *bh)
>   JBUFFER_TRACE(jh, "file as BJ_Reserved");
>   spin_lock(&journal->j_list_lock);
>   __jbd2_journal_file_buffer(jh, transaction, BJ_Reserved);
> + spin_unlock(&journal->j_list_lock);
>   } else if (jh->b_transaction == journal->j_committing_transaction) {
>   /* first access by this transaction */
>   jh->b_modified = 0;
> @@ -1098,8 +1099,8 @@ int jbd2_journal_get_create_access(handle_t *handle, 
> struct buffer_head *bh)
>   JBUFFER_TRACE(jh, "set next transaction");
>   spin_lock(&journal->j_list_lock);
>   jh->b_next_transaction = transaction;
> + spin_unlock(&journal->j_list_lock);
>   }
> - spin_unlock(&journal->j_list_lock);
>   jbd_unlock_bh_state(bh);
> 
>   /*
> --
> 2.3.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/5] WIP: Add syscall unlinkat_s (currently x86* only)

2015-02-04 Thread Lukáš Czerner

On Wed, 4 Feb 2015, Alexander Holler wrote:

> Date: Wed, 04 Feb 2015 17:12:52 +0100
> From: Alexander Holler 
> To: Lukáš Czerner 
> Cc: Michael Kerrisk ,
> Al Viro , Theodore Ts'o ,
> Linux-Fsdevel ,
> Linux Kernel ,
> Linux API 
> Subject: Re: [PATCH 1/5] WIP: Add syscall unlinkat_s (currently x86* only)
> 
> Am 04.02.2015 um 15:52 schrieb Lukáš Czerner:
> > On Wed, 4 Feb 2015, Alexander Holler wrote:
> 
> >> I'm happy for all the feedback. But it doesn't help me. I'm not going to 
> >> spend
> >> the necessary time unpaid.
> > 
> > Right, you'd much rather have someone else to spend the time on your
> > request unpaid. That's understandable, but unreasonable. You want
> > it, implement it, or pay someone else to do it for you.
> 
> Maybe you should attach a big fat red warning to the kernels bugzilla
> that filing a bug means either to fix it yourself or pay somone to do that.
> 
> I've never demanded that someone else fixes it.
> 
> I've just explained a problem.
> 
> Unbelievable how someone could do such without paying someone else to
> fix it or by fixing it themself ...

It's not a bug, you're requesting a feature.

Re: [dm-devel] [PATCH] block: do not return -EOPNOTSUPP only when issue a discard request

2016-02-03 Thread Lukáš Czerner

On Wed, 3 Feb 2016, jiangyiwen wrote:

> Date: Wed, 3 Feb 2016 17:54:36 +0800
> From: jiangyiwen 
> To: ax...@kernel.dk, linux-kernel@vger.kernel.org
> Cc: dm-de...@redhat.com, martin.peter...@oracle.com,
>     Lukáš Czerner , xuejiu...@huawei.com,
> "Qijiang (Joseph, Euler)" 
> Subject: [dm-devel] [PATCH] block: do not return -EOPNOTSUPP only when issue a
>  discard request
> 
> commit 8af1954d172a("blkdev: Do not return -EOPNOTSUPP if discard
> is supported") only solve the situation of discard, because When
> applications issue a discard request to device, they can't expect
> deterministic behaviour. However, WRITE SAME should not ignore error
> with EOPNOTSUPP, because if applications issue WRITE SAME requests to
> device, it should return deterministic results to applications
> according to the T10 standard, or else it will cause an inconsistent
> state between upper layer and bottom layer.
> 
> So ignoring error with EOPNOTSUPP only in discard situation.
> 
> Signed-off-by: Yiwen Jiang 

Looks good.

Reviewed-by: Lukas Czerner 

Thanks!
-Lukas

> ---
>  block/blk-lib.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index 9ebf653..e4f02f1 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -19,7 +19,9 @@ static void bio_batch_end_io(struct bio *bio)
>  {
>   struct bio_batch *bb = bio->bi_private;
> 
> - if (bio->bi_error && bio->bi_error != -EOPNOTSUPP)
> + /* ignore -EOPNOTSUPP only when issue a discard request */
> + if (bio->bi_error && (!(bio->bi_rw & REQ_DISCARD) ||
> + (bio->bi_error != -EOPNOTSUPP)))
>   bb->error = bio->bi_error;
>   if (atomic_dec_and_test(&bb->done))
>   complete(bb->wait);
>

Re: [RFC][PATCH] fs: Prevent syncing frozen file system

2015-07-10 Thread Lukáš Czerner

On Fri, 10 Jul 2015, Jan Kara wrote:

> Date: Fri, 10 Jul 2015 16:25:25 +0200
> From: Jan Kara 
> To: Dave Chinner 
> Cc: Lukas Czerner , v...@zeniv.linux.org.uk,
> bfie...@fieldses.org, linux-kernel@vger.kernel.org,
> linux-fsde...@vger.kernel.org
> Subject: Re: [RFC][PATCH] fs: Prevent syncing frozen file system
> 
> On Fri 10-07-15 09:40:12, Dave Chinner wrote:
> > On Thu, Jul 09, 2015 at 07:45:45PM +0200, Lukas Czerner wrote:
> > > Currently we can end up in a deadlock because of broken
> > > sb_start_write -> s_umount ordering.
> > > 
> > > The race goes like this:
> > > 
> > >  - write the file
> > >  - unlink the file - final_iput will not be calles as file is opened
> > >  - freeze the file system
> > >  - Now simultaneously close the file and call sync (or syncfs on that
> > >particular file system). Sync will get to wait_sb_inodes() where it 
> > > will
> > >grab the referece to the inode (__iget()) and later to call iput().
> > 
> > This problem goes away with the sync scalability patchset that josef
> > has been trying to get merged:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git 
> > superblock-scaling
> > 
> > That patchset removes the full sb inodes list walk in
> > wait_sb_inodes() and replaces it with a walk of inodes cleaned
> > during the sync, which will be an empty list in the case of sync
> > running on an empty filesystem. This commit does the work:
> > 
> > https://git.kernel.org/cgit/linux/kernel/git/josef/btrfs-next.git/commit/?h=superblock-scaling&id=9bea30d5f4521db674203f365b1e0970588b2650
> > 
> >  > that there are now several outstanding issues that it fixes in one
> > go?>
> 
> Not sure where that got stuck - oh, maybe on Tejun's memcg writeback series
> which was clashing with it. Josef?
> 
> > >If we manage to close the file and drop the reference in between those
> > >calls sync will attempt to do a iput_final() because the inode is now
> > >unlinked and we're holding the last reference to it. This will
> > >however block on a frozen file system (ext4_delete_inode for
> > >example).
> > > 
> > > Note that I've not been able to reproduce the issue, I've only seen this
> > > happen once. However with some instrumentation (like msleep() in the
> > > wait_sb_inodes() it can be achieved.
> > > 
> > > Fix this by properly doing sb_start_write/sb_end_write to prevent us
> > > from fsfreeze.
> > > 
> > > Note that with this patch syncfs will block on the frozen file system
> > > which is probably ok, but sync will block if any file system happens to
> > > be frozen - not sure if that's a problem, but it's certainly different
> > > from what we've been used to.
> > 
> > sync should not block on frozen fileystems. By definition, a frozen
> > filesystem is a clean filesystem, and so sync should really just be
> > skipping over them.
> 
> Just for record I agree with Dave. Sync on frozen fs should just return.
> And freeze protection in iterate_supers() looks just wrong.

Sure, that's why it's rfc. Anyway with the change Dave mentioned the
deadlock should not be possible anymore. However anywhere where we
take s_umount before sb_start_write we could deadlock, so it might
be worth adding a warning into sb_start_write() maybe ?

-Lukas
> 
>   Honza
> 
> > > +++ b/fs/super.c
> > > @@ -514,10 +514,17 @@ void iterate_supers(void (*f)(struct super_block *, 
> > > void *), void *arg)
> > >   sb->s_count++;
> > >   spin_unlock(&sb_lock);
> > >  
> > > + /*
> > > +  * Whatever we're going to do to the file system we have to
> > > +  * make sure that we'll not end up blocking on frozen file
> > > +  * system.
> > > +  */
> > > + sb_start_write(sb);
> > >   down_read(&sb->s_umount);
> > >   if (sb->s_root && (sb->s_flags & MS_BORN))
> > >   f(sb, arg);
> > >   up_read(&sb->s_umount);
> > > + sb_end_write(sb);
> > >  
> > >   spin_lock(&sb_lock);
> > >   if (p)
> > 
> > That deadlocks sysrq-j (emergency thaw)...
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > da...@fromorbit.com
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

87 matches

Mail list logo