Re: [Ocfs2-devel] [PATCH V2] ocfs2: Take inode cluster lock before moving reflinked inode from orphan dir

2018-05-01 Thread Ashish Samant


On 05/01/2018 05:08 PM, Andrew Morton wrote:
> On Wed, 11 Apr 2018 12:31:47 -0700 Ashish Samant <ashish.sam...@oracle.com> 
> wrote:
>
>> While reflinking an inode, we create a new inode in orphan directory, then
>> take EX lock on it, reflink the original inode to orphan inode and release
>> EX lock. Once the lock is released another node could request it in EX mode
>> from ocfs2_recover_orphans() which causes downconvert of the lock, on this
>> node, to NL mode.
>>
>> Later we attempt to initialize security acl for the orphan inode and move
>> it to the reflink destination. However, while doing this we dont take EX
>> lock on the inode. This could potentially cause problems because we could
>> be starting transaction, accessing journal and modifying metadata of the
>> inode while holding NL lock and with another node holding EX lock on the
>> inode.
> Someone (eg, me) needs to decide which kernel version(s) need the
> patch.  And all I have is "this could potentially cause problems".
>
> Can you please be more specific about the end-user impact?  *Does* it
> cause problems?  Is the problem sufficiently urgent to warrant a merge
> into 4.17?  Into -stable kernels?

Hi Andrew,


What the code is doing currently is definitely wrong and should be fixed.
This could theoretically cause a dlm race, leaading to kernel panic, 
although the possibility of that
happening in the real world is very less. Hence the "potentially" in the 
commit message.
So, I'd say it should go into 4.17 but backporting to stable is probably 
not required.

Thanks,
Ashish
>
> Please always include all the end-user-impact info when fixing bugs.
>
> Thanks.
>
>> Fix this by taking orphan inode cluster lock in EX mode before
>> initializing security and moving orphan inode to reflink destination.
>> Use the __tracker variant while taking inode lock to avoid recursive
>> locking in the ocfs2_init_security_and_acl() call chain.


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel


[Ocfs2-devel] [PATCH V2] ocfs2: Take inode cluster lock before moving reflinked inode from orphan dir

2018-04-11 Thread Ashish Samant
While reflinking an inode, we create a new inode in orphan directory, then
take EX lock on it, reflink the original inode to orphan inode and release
EX lock. Once the lock is released another node could request it in EX mode
from ocfs2_recover_orphans() which causes downconvert of the lock, on this
node, to NL mode.

Later we attempt to initialize security acl for the orphan inode and move
it to the reflink destination. However, while doing this we dont take EX
lock on the inode. This could potentially cause problems because we could
be starting transaction, accessing journal and modifying metadata of the
inode while holding NL lock and with another node holding EX lock on the
inode.

Fix this by taking orphan inode cluster lock in EX mode before
initializing security and moving orphan inode to reflink destination.
Use the __tracker variant while taking inode lock to avoid recursive
locking in the ocfs2_init_security_and_acl() call chain.

Signed-off-by: Ashish Samant <ashish.sam...@oracle.com>

V1->V2:
Modify commit message to better reflect the problem in upstream kernel.
---
 fs/ocfs2/refcounttree.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index ab156e3..1b1283f 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -4250,10 +4250,11 @@ static int __ocfs2_reflink(struct dentry *old_dentry,
 static int ocfs2_reflink(struct dentry *old_dentry, struct inode *dir,
 struct dentry *new_dentry, bool preserve)
 {
-   int error;
+   int error, had_lock;
struct inode *inode = d_inode(old_dentry);
struct buffer_head *old_bh = NULL;
struct inode *new_orphan_inode = NULL;
+   struct ocfs2_lock_holder oh;
 
if (!ocfs2_refcount_tree(OCFS2_SB(inode->i_sb)))
return -EOPNOTSUPP;
@@ -4295,6 +4296,14 @@ static int ocfs2_reflink(struct dentry *old_dentry, 
struct inode *dir,
goto out;
}
 
+   had_lock = ocfs2_inode_lock_tracker(new_orphan_inode, NULL, 1,
+   );
+   if (had_lock < 0) {
+   error = had_lock;
+   mlog_errno(error);
+   goto out;
+   }
+
/* If the security isn't preserved, we need to re-initialize them. */
if (!preserve) {
error = ocfs2_init_security_and_acl(dir, new_orphan_inode,
@@ -4302,14 +4311,15 @@ static int ocfs2_reflink(struct dentry *old_dentry, 
struct inode *dir,
if (error)
mlog_errno(error);
}
-out:
if (!error) {
error = ocfs2_mv_orphaned_inode_to_new(dir, new_orphan_inode,
   new_dentry);
if (error)
mlog_errno(error);
}
+   ocfs2_inode_unlock_tracker(new_orphan_inode, 1, , had_lock);
 
+out:
if (new_orphan_inode) {
/*
 * We need to open_unlock the inode no matter whether we
-- 
1.9.1


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] [PATCH] ocfs2: Take inode cluster lock before moving reflinked inode from orphan dir

2018-04-10 Thread Ashish Samant


On 04/02/2018 02:17 AM, Joseph Qi wrote:
>
> On 18/3/31 01:42, Ashish Samant wrote:
>> While reflinking an inode, we create a new inode in orphan directory, then
>> take EX lock on it, reflink the original inode to orphan inode and release
>> EX lock. Once the lock is released another node might request it in PR mode
>> which causes downconvert of the lock to PR mode.
>>
>> Later we attempt to initialize security acl for the orphan inode and move
>> it to the reflink destination. However, while doing this we dont take EX
>> lock on the inode. So effectively, we are doing this and accessing the
>> journal for this inode while holding PR lock. While accessing the journal,
>> we make
>>
>> ci->ci_last_trans = journal->j_trans_id
>>
>> At this point, if there is another downconvert request on this inode from
>> another node (PR->NL), we will trip on the following condition in
>> ocfs2_ci_checkpointed()
>>
>> BUG_ON(lockres->l_level != DLM_LOCK_EX && !checkpointed);
>>
>> because we hold the lock in PR mode and journal->j_trans_id is not greater
>> than ci_last_trans for the inode.
>>
>> Fix this by taking orphan inode cluster lock in EX mode before
>> initializing security and moving orphan inode to reflink destination.
>> Use the __tracker variant while taking inode lock to avoid recursive
>> locking in the ocfs2_init_security_and_acl() call chain.
>>
>> Signed-off-by: Ashish Samant <ashish.sam...@oracle.com>
> Looks good.
> Reviewed-by: Joseph Qi <jiangqi...@gmail.com>
Hi Joseph,

Looks like mainline kernel has removed the inode lock call
in PR mode for orphan inode in ocfs2_recover_orphans()
and it directly takes the lock in EX mode. So, the patch commit log does 
not
reflect the exact problem for mainline kernel. However,
it is still possible that we might end up accessing the journal for the 
inode
while holding inode lock in NL mode (instead of PR). It might not cause
the same problem as described above, but it seems wrong to
modify inode metadata with NL lock held and could potentially cause similar
problems later. So, i'll submit v2 with modified commit message.

Thanks,
Ashish




___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel


[Ocfs2-devel] [PATCH] ocfs2: Take inode cluster lock before moving reflinked inode from orphan dir

2018-03-30 Thread Ashish Samant
While reflinking an inode, we create a new inode in orphan directory, then
take EX lock on it, reflink the original inode to orphan inode and release
EX lock. Once the lock is released another node might request it in PR mode
which causes downconvert of the lock to PR mode.

Later we attempt to initialize security acl for the orphan inode and move
it to the reflink destination. However, while doing this we dont take EX
lock on the inode. So effectively, we are doing this and accessing the
journal for this inode while holding PR lock. While accessing the journal,
we make

ci->ci_last_trans = journal->j_trans_id

At this point, if there is another downconvert request on this inode from
another node (PR->NL), we will trip on the following condition in
ocfs2_ci_checkpointed()

BUG_ON(lockres->l_level != DLM_LOCK_EX && !checkpointed);

because we hold the lock in PR mode and journal->j_trans_id is not greater
than ci_last_trans for the inode.

Fix this by taking orphan inode cluster lock in EX mode before
initializing security and moving orphan inode to reflink destination.
Use the __tracker variant while taking inode lock to avoid recursive
locking in the ocfs2_init_security_and_acl() call chain.

Signed-off-by: Ashish Samant <ashish.sam...@oracle.com>
---
 fs/ocfs2/refcounttree.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index ab156e3..1b1283f 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -4250,10 +4250,11 @@ static int __ocfs2_reflink(struct dentry *old_dentry,
 static int ocfs2_reflink(struct dentry *old_dentry, struct inode *dir,
 struct dentry *new_dentry, bool preserve)
 {
-   int error;
+   int error, had_lock;
struct inode *inode = d_inode(old_dentry);
struct buffer_head *old_bh = NULL;
struct inode *new_orphan_inode = NULL;
+   struct ocfs2_lock_holder oh;
 
if (!ocfs2_refcount_tree(OCFS2_SB(inode->i_sb)))
return -EOPNOTSUPP;
@@ -4295,6 +4296,14 @@ static int ocfs2_reflink(struct dentry *old_dentry, 
struct inode *dir,
goto out;
}
 
+   had_lock = ocfs2_inode_lock_tracker(new_orphan_inode, NULL, 1,
+   );
+   if (had_lock < 0) {
+   error = had_lock;
+   mlog_errno(error);
+   goto out;
+   }
+
/* If the security isn't preserved, we need to re-initialize them. */
if (!preserve) {
error = ocfs2_init_security_and_acl(dir, new_orphan_inode,
@@ -4302,14 +4311,15 @@ static int ocfs2_reflink(struct dentry *old_dentry, 
struct inode *dir,
if (error)
mlog_errno(error);
}
-out:
if (!error) {
error = ocfs2_mv_orphaned_inode_to_new(dir, new_orphan_inode,
   new_dentry);
if (error)
mlog_errno(error);
}
+   ocfs2_inode_unlock_tracker(new_orphan_inode, 1, , had_lock);
 
+out:
if (new_orphan_inode) {
/*
 * We need to open_unlock the inode no matter whether we
-- 
1.9.1


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] fstrim corrupts ocfs2 filesystems(become ready-only) on SSD device which is managed by multipath

2017-10-27 Thread Ashish Samant
Hi Gang,

The following patch sent to the list should fix the issue.

https://patchwork.kernel.org/patch/10002583/

Thanks,
Ashish


On 10/27/2017 02:47 AM, Gang He wrote:
> Hello Guys,
>
> I got a bug from the customer, he said, fstrim command corrupted ocfs2 file 
> system on their SSD SAN, the file system became read-only and SSD LUN was 
> configured by multipath.
> After umount the file system, the customer ran fsck.ocfs2 on this file 
> system, then the file system can be mounted until the next fstrim happens.
> The error messages were likes,
> 2017-10-02T00:00:00.334141+02:00 rz-xen10 systemd[1]: Starting Discard unused 
> blocks...
> 2017-10-02T00:00:00.383805+02:00 rz-xen10 fstrim[36615]: fstrim: /xensan1: 
> FITRIM ioctl fehlgeschlagen: Das Dateisystem ist nur lesbar
> 2017-10-02T00:00:00.385233+02:00 rz-xen10 kernel: [1092967.091821] OCFS2: 
> ERROR (device dm-5): ocfs2_validate_gd_self: Group descriptor #8257536 has 
> bad signature  <<== here
> 2017-10-02T00:00:00.385251+02:00 rz-xen10 kernel: [1092967.091831] On-disk 
> corruption discovered. Please run fsck.ocfs2 once the filesystem is unmounted.
> 2017-10-02T00:00:00.385254+02:00 rz-xen10 kernel: [1092967.091836] 
> (fstrim,36615,5):ocfs2_trim_fs:7422 ERROR: status = -30
> 2017-10-02T00:00:00.385854+02:00 rz-xen10 systemd[1]: fstrim.service: Main 
> process exited, code=exited, status=32/n/a
> 2017-10-02T00:00:00.386756+02:00 rz-xen10 systemd[1]: Failed to start Discard 
> unused blocks.
> 2017-10-02T00:00:00.387236+02:00 rz-xen10 systemd[1]: fstrim.service: Unit 
> entered failed state.
> 2017-10-02T00:00:00.387601+02:00 rz-xen10 systemd[1]: fstrim.service: Failed 
> with result 'exit-code'.
>
> The similar bug looks like 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.launchpad.net_ubuntu_-2Bsource_util-2Dlinux_-2Bbug_1681410=DwIFAg=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10=f4ohdmGrYxZejY77yzx3eNgTHb1ZAfZytktjHqNVzc8=Jdo98IlzJDxBqiDEhsKfqxvEt4B6WpIbZ_woY7zmLFw=xp0bUwpDVIHZP9g4EboYYG_1gkenzWEt_O_5KZXyFg8=
>  .
> Then, I tried to reproduce this bug in local.
> Since I have not a SSD SAN, I found a PC server which has a SSD disk.
> I setup a two nodes ocfs2 cluster in VM on this PC server, attach this SSD 
> disk to each VM instance twice, then I can configure this SSD disk with 
> multipath tool,
> the configuration on each node likes,
> sle12sp3-nd1:/ # multipath -l
> INTEL_SSDSA2M040G2GC_CVGB0490002C040NGN dm-0 ATA,INTEL SSDSA2M040
> size=37G features='1 retain_attached_hw_handler' hwhandler='0' wp=rw
> |-+- policy='service-time 0' prio=0 status=active
> | `- 0:0:0:0 sda 8:0  active undef unknown
> `-+- policy='service-time 0' prio=0 status=enabled
>`- 0:0:0:1 sdb 8:16 active undef unknown
>
> Next, I do some fstrim command from each node simultaneously,
> I also do dd command to write data to the shared SSD disk during fstrim 
> commands.
> But, I can not reproduce this issue, all the things go well.
>
> Then, I'd like to ping the list, did who ever encounter this bug?  If yes, 
> please help to provide some information.
> I think there are three factors which are related to this bug, SSD device 
> type, multipath configuration and simultaneously fstrim.
>
> Thanks a lot.
> Gang
>
>
>
>
>
> ___
> Ocfs2-devel mailing list
> Ocfs2-devel@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel


[Ocfs2-devel] [PATCH] ocfs2: fstrim: Fix start offset of first cluster group during fstrim

2017-10-12 Thread Ashish Samant
The first cluster group descriptor is not stored at the start of the group
but at an offset from the start. We need to take this into account while
doing fstrim on the first cluster group. Otherwise we will wrongly start
fstrim a few blocks after the desired start block and the range can cross
over into the next cluster group and zero out the group descriptor there.
This can cause filesytem corruption that cannot be fixed by fsck.

Signed-off-by: Ashish Samant <ashish.sam...@oracle.com>
Cc: sta...@vger.kernel.org
---
 fs/ocfs2/alloc.c | 24 ++--
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index a177eae..addd7c5 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -7304,13 +7304,24 @@ int ocfs2_truncate_inline(struct inode *inode, struct 
buffer_head *di_bh,
 
 static int ocfs2_trim_extent(struct super_block *sb,
 struct ocfs2_group_desc *gd,
-u32 start, u32 count)
+u64 group, u32 start, u32 count)
 {
u64 discard, bcount;
+   struct ocfs2_super *osb = OCFS2_SB(sb);
 
bcount = ocfs2_clusters_to_blocks(sb, count);
-   discard = le64_to_cpu(gd->bg_blkno) +
-   ocfs2_clusters_to_blocks(sb, start);
+   discard = ocfs2_clusters_to_blocks(sb, start);
+
+   /*
+* For the first cluster group, the gd->bg_blkno is not at the start
+* of the group, but at an offset from the start. If we add it while
+* calculating discard for first group, we will wrongly start fstrim a
+* few blocks after the desried start block and the range can cross
+* over into the next cluster group. So, add it only if this is not
+* the first cluster group.
+*/
+   if (group != osb->first_cluster_group_blkno)
+   discard += le64_to_cpu(gd->bg_blkno);
 
trace_ocfs2_trim_extent(sb, (unsigned long long)discard, bcount);
 
@@ -7318,7 +7329,7 @@ static int ocfs2_trim_extent(struct super_block *sb,
 }
 
 static int ocfs2_trim_group(struct super_block *sb,
-   struct ocfs2_group_desc *gd,
+   struct ocfs2_group_desc *gd, u64 group,
u32 start, u32 max, u32 minbits)
 {
int ret = 0, count = 0, next;
@@ -7337,7 +7348,7 @@ static int ocfs2_trim_group(struct super_block *sb,
next = ocfs2_find_next_bit(bitmap, max, start);
 
if ((next - start) >= minbits) {
-   ret = ocfs2_trim_extent(sb, gd,
+   ret = ocfs2_trim_extent(sb, gd, group,
start, next - start);
if (ret < 0) {
mlog_errno(ret);
@@ -7435,7 +7446,8 @@ int ocfs2_trim_fs(struct super_block *sb, struct 
fstrim_range *range)
}
 
gd = (struct ocfs2_group_desc *)gd_bh->b_data;
-   cnt = ocfs2_trim_group(sb, gd, first_bit, last_bit, minlen);
+   cnt = ocfs2_trim_group(sb, gd, group,
+  first_bit, last_bit, minlen);
brelse(gd_bh);
gd_bh = NULL;
if (cnt < 0) {
-- 
1.9.1


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel


[Ocfs2-devel] [PATCH] ocfs2: Fix double put of recount tree in ocfs2_lock_refcount_tree()

2016-09-15 Thread Ashish Samant
In ocfs2_lock_refcount_tree, if ocfs2_read_refcount_block() returns error,
we do ocfs2_refcount_tree_put twice (once in ocfs2_unlock_refcount_tree
and once outside it), thereby reducing the refcount of the refcount tree
twice, but we dont delete the tree in this case. This will make refcnt
of the tree = 0 and the ocfs2_refcount_tree_put will eventually call
ocfs2_mark_lockres_freeing, setting OCFS2_LOCK_FREEING for the
refcount_tree->rf_lockres.

The error returned by ocfs2_read_refcount_block is propagated all the way
back and for next iteration of write, ocfs2_lock_refcount_tree gets the
same tree back from ocfs2_get_refcount_tree because we havent deleted the
tree. Now we have the same tree, but OCFS2_LOCK_FREEING is set for
rf_lockres and eventually, when _ocfs2_lock_refcount_tree is called in
this iteration, BUG_ON( __ocfs2_cluster_lock:1395 ERROR: Cluster lock
called on freeing lockres T0386019775b08d! flags 0x81) is
triggerred.

Call stack:

(loop16,11155,0):ocfs2_lock_refcount_tree:482 ERROR: status = -5
(loop16,11155,0):ocfs2_refcount_cow_hunk:3497 ERROR: status = -5
(loop16,11155,0):ocfs2_refcount_cow:3560 ERROR: status = -5
(loop16,11155,0):ocfs2_prepare_inode_for_refcount:2111 ERROR: status = -5
(loop16,11155,0):ocfs2_prepare_inode_for_write:2190 ERROR: status = -5
(loop16,11155,0):ocfs2_file_write_iter:2331 ERROR: status = -5
(loop16,11155,0):__ocfs2_cluster_lock:1395 ERROR: bug expression:
lockres->l_flags & OCFS2_LOCK_FREEING

(loop16,11155,0):__ocfs2_cluster_lock:1395 ERROR: Cluster lock called on
freeing lockres T0386019775b08d! flags 0x81

[ cut here ]
kernel BUG at fs/ocfs2/dlmglue.c:1395!

invalid opcode:  [#1] SMP  CPU 0
Modules linked in: tun ocfs2 jbd2 xen_blkback xen_netback xen_gntdev ..
sd_mod crc_t10dif ext3 jbd mbcache

RIP: e030:[]  []
__ocfs2_cluster_lock+0x31c/0x740 [ocfs2]
RSP: e02b:88017c0138a0  EFLAGS: 00010086
Process loop16 (pid: 11155, threadinfo 88017c01, task
8801b5374300)
Stack:
 88017bd25880 0081 00017c013920 88017c013960
 001d 0001 88017bd258b4 
 880172006000 a07fa410 88017bd202b4 
Call Trace:
 [] ocfs2_refcount_lock+0xae/0x130 [ocfs2]
 [] ? __ocfs2_lock_refcount_tree+0x29/0xe0 [ocfs2]
 [] ? _raw_spin_lock+0xe/0x20
 [] __ocfs2_lock_refcount_tree+0x29/0xe0 [ocfs2]
 [] ocfs2_lock_refcount_tree+0xdd/0x320 [ocfs2]
 [] ocfs2_refcount_cow_hunk+0x1cb/0x440 [ocfs2]
 [] ocfs2_refcount_cow+0xa9/0x1d0 [ocfs2]
 [] ? ocfs2_prepare_inode_for_refcount+0x67/0x200 [ocfs2]
 [] ocfs2_prepare_inode_for_refcount+0x115/0x200 [ocfs2]
 [] ? ocfs2_inode_unlock+0xd4/0x140 [ocfs2]
 [] ocfs2_prepare_inode_for_write+0x33b/0x470 [ocfs2]
 [] ? ocfs2_rw_lock+0x80/0x190 [ocfs2]
 [] ocfs2_file_write_iter+0x220/0x8c0 [ocfs2]
 [] ? mempool_free_slab+0x17/0x20
 [] ? bio_free+0x61/0x70
 [] ? aio_kernel_free+0xe/0x10
 [] aio_write_iter+0x2e/0x30

Fix this by avoiding the second call to ocfs2_refcount_tree_put()

Signed-off-by: Ashish Samant <ashish.sam...@oracle.com>
---
 fs/ocfs2/refcounttree.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index 92bbe93..a9d4102 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -478,7 +478,6 @@ again:
if (ret) {
mlog_errno(ret);
ocfs2_unlock_refcount_tree(osb, tree, rw);
-   ocfs2_refcount_tree_put(tree);
goto out;
}
 
-- 
1.9.1


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] [PATCH] ocfs2: Fix start offset to ocfs2_zero_range_for_truncate()

2016-09-14 Thread Ashish Samant


On 09/14/2016 03:43 PM, Andrew Morton wrote:
> On Thu, 11 Aug 2016 16:12:27 -0700 Ashish Samant <ashish.sam...@oracle.com> 
> wrote:
>
>> If we do fallocate with punch hole option on a reflink, with start offset
>> on a cluster boundary and end offset somewhere in another cluster, we
>> dont COW the first cluster starting at the start offset. But in this
>> case, we were wrongly passing this cluster to
>> ocfs2_zero_range_for_truncate() to zero out.
>>
>> Fix this by skipping this cluster in such a scenario.
> How serious is this bug?  It sounds like a data-corrupting error?  As
> such, this is a high priority fix and it should be backported into the
> -stable kernels?
>
>
> Please always include such info when fixing bugs.
Yes, it is quite serious, I should have cc'ed stable. Will do it going 
forward.

Thanks,
Ashish


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] [PATCH v2] ocfs2: Fix start offset to ocfs2_zero_range_for_truncate()

2016-08-30 Thread Ashish Samant
Hi Eric,

I am able to reproduce this on 4.8.0-rc3 as well. Can you try again and 
issue a sync between fallocate and dd?

On 08/30/2016 12:38 AM, Eric Ren wrote:
> Hi,
>
> I'm on 4.8.0-rc3 kernel. Hope someone else can double-confirm this;-)
>
> On 08/30/2016 12:11 PM, Ashish Samant wrote:
>> Hmm, thats weird. I see this on 4.7 kernel without the patch:
>>
>> # xfs_io -c 'pwrite -b 4k 0 10M' -f 10MBfile
>> wrote 10485760/10485760 bytes at offset 0
>> 10 MiB, 2560 ops; 0. sec (683.995 MiB/sec and 175102.5992 ops/sec)
>> # reflink -f 10MBfile reflnktest
>> # fallocate -p -o 0 -l 1048615 reflnktest
>> # dd if=10MBfile iflag=direct bs=1048576 count=1 | hexdump -C
>>   00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
>> ||
>> *
>> 1+0 records in
>> 1+0 records out
>> 1048576 bytes (1.0 MB) copied, 0.0321517 s, 32.6 MB/s
>> 0010
>>
>> and with patch
>> 
>> # dd if=10MBfile iflag=direct bs=1M count=1 | hexdump -C
>>   cd cd cd cd cd cd cd cd  cd cd cd cd cd cd cd cd 
>> ||
>
> I'm not familiar with this code.  So why is the output "cd ..."? 
> because we didn't write anything
> into "10MBfile". Is it a magic number when reading from a hole?
No, "cd" is what xfs_io wrote into the file. Those are the original 
contents of the file which are overwritten by 0 in the first cluster 
because of this bug.

Thanks,
Ashish
>
> Eric
>
>> *
>> 1+0 records in
>> 1+0 records out
>> 0010
>
>
>
>>
>> Thanks,
>> Ashish
>>
>>
>> On 08/29/2016 08:33 PM, Eric Ren wrote:
>>> Hello,
>>>
>>> On 08/30/2016 03:23 AM, Ashish Samant wrote:
>>>> Hi Eric,
>>>>
>>>> The easiest way to reproduce this is :
>>>>
>>>> 1. Create a random file of say 10 MB
>>>> xfs_io -c 'pwrite -b 4k 0 10M' -f 10MBfile
>>>> 2. Reflink  it
>>>> reflink -f 10MBfile reflnktest
>>>> 3. Punch a hole at starting at cluster boundary  with range greater 
>>>> that 1MB. You can also use a range that will put the end offset in 
>>>> another extent.
>>>> fallocate -p -o 0 -l 1048615 reflnktest
>>>> 4. sync
>>>> 5. Check the  first cluster in the source file. (It will be zeroed 
>>>> out).
>>>>dd if=10MBfile iflag=direct bs= count=1 | hexdump -C
>>>
>>> Thanks! I have a try myself, but I'm not sure what is our expected 
>>> output and if the test result meet
>>> it:
>>>
>>> 1. After applying this patch:
>>> ocfs2dev1:/mnt/ocfs2 # rm 10MBfile reflnktest
>>> ocfs2dev1:/mnt/ocfs2 # xfs_io -c 'pwrite -b 4k 0 10M' -f 10MBfile
>>> wrote 10485760/10485760 bytes at offset 0
>>> 10 MiB, 2560 ops; 0. sec (1.089 GiB/sec and 285427.5839 ops/sec)
>>> ocfs2dev1:/mnt/ocfs2 # reflink -f 10MBfile reflnktest
>>> ocfs2dev1:/mnt/ocfs2 # fallocate -p -o 0 -l 1048615 reflnktest
>>> ocfs2dev1:/mnt/ocfs2 # dd if=10MBfile iflag=direct bs=1048576 
>>> count=1 | hexdump -C
>>>   cd cd cd cd cd cd cd cd  cd cd cd cd cd cd cd cd 
>>> ||
>>> *
>>> 1+0 records in
>>> 1+0 records out
>>> 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0952464 s, 11.0 MB/s
>>> 0010
>>>
>>> 2. Before this patch:
>>> 
>>> ocfs2dev1:/mnt/ocfs2 # dd if=10MBfile iflag=direct bs=1048576 
>>> count=1 | hexdump -C
>>>   cd cd cd cd cd cd cd cd  cd cd cd cd cd cd cd cd 
>>> ||
>>> *
>>> 1+0 records in
>>> 1+0 records out
>>> 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0954648 s, 11.0 MB/s
>>> 0010
>>>
>>> 3. debugfs.ocfs2 -R stats /dev/sdb
>>> ...
>>> Block Size Bits: 12   Cluster Size Bits: 20
>>> ...
>>>
>>> Eric
>>>>
>>>> Thanks,
>>>> Ashish
>>>>
>>>> On 08/28/2016 10:39 PM, Eric Ren wrote:
>>>>> Hi,
>>>>>
>>>>> Thanks for this fix. I'd like to reproduce this issue locally and 
>>>>> test this patch,
>>>>> could you elaborate the detailed steps of reproduction?
>>>>>
>>>>> Thanks,
>>>>> Eric
>>>>>
>>>>> On 08/27/2016 07:04 AM, Ashish Samant wrote:
>>>>>> If we punch a hole on a reflink such that following conditions 
>>>>>&g

Re: [Ocfs2-devel] [PATCH v2] ocfs2: Fix start offset to ocfs2_zero_range_for_truncate()

2016-08-29 Thread Ashish Samant
Hmm, thats weird. I see this on 4.7 kernel without the patch:

# xfs_io -c 'pwrite -b 4k 0 10M' -f 10MBfile
wrote 10485760/10485760 bytes at offset 0
10 MiB, 2560 ops; 0. sec (683.995 MiB/sec and 175102.5992 ops/sec)
# reflink -f 10MBfile reflnktest
# fallocate -p -o 0 -l 1048615 reflnktest
# dd if=10MBfile iflag=direct bs=1048576 count=1 | hexdump -C
  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
||
*
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.0321517 s, 32.6 MB/s
0010

and with patch

# dd if=10MBfile iflag=direct bs=1M count=1 | hexdump -C
  cd cd cd cd cd cd cd cd  cd cd cd cd cd cd cd cd 
||
*
1+0 records in
1+0 records out
0010

Thanks,
Ashish


On 08/29/2016 08:33 PM, Eric Ren wrote:
> Hello,
>
> On 08/30/2016 03:23 AM, Ashish Samant wrote:
>> Hi Eric,
>>
>> The easiest way to reproduce this is :
>>
>> 1. Create a random file of say 10 MB
>> xfs_io -c 'pwrite -b 4k 0 10M' -f 10MBfile
>> 2. Reflink  it
>> reflink -f 10MBfile reflnktest
>> 3. Punch a hole at starting at cluster boundary  with range greater 
>> that 1MB. You can also use a range that will put the end offset in 
>> another extent.
>> fallocate -p -o 0 -l 1048615 reflnktest
>> 4. sync
>> 5. Check the  first cluster in the source file. (It will be zeroed out).
>>dd if=10MBfile iflag=direct bs= count=1 | hexdump -C
>
> Thanks! I have a try myself, but I'm not sure what is our expected 
> output and if the test result meet
> it:
>
> 1. After applying this patch:
> ocfs2dev1:/mnt/ocfs2 # rm 10MBfile reflnktest
> ocfs2dev1:/mnt/ocfs2 # xfs_io -c 'pwrite -b 4k 0 10M' -f 10MBfile
> wrote 10485760/10485760 bytes at offset 0
> 10 MiB, 2560 ops; 0. sec (1.089 GiB/sec and 285427.5839 ops/sec)
> ocfs2dev1:/mnt/ocfs2 # reflink -f 10MBfile reflnktest
> ocfs2dev1:/mnt/ocfs2 # fallocate -p -o 0 -l 1048615 reflnktest
> ocfs2dev1:/mnt/ocfs2 # dd if=10MBfile iflag=direct bs=1048576 count=1 
> | hexdump -C
>   cd cd cd cd cd cd cd cd  cd cd cd cd cd cd cd cd 
> ||
> *
> 1+0 records in
> 1+0 records out
> 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0952464 s, 11.0 MB/s
> 0010
>
> 2. Before this patch:
> 
> ocfs2dev1:/mnt/ocfs2 # dd if=10MBfile iflag=direct bs=1048576 count=1 
> | hexdump -C
>   cd cd cd cd cd cd cd cd  cd cd cd cd cd cd cd cd 
> ||
> *
> 1+0 records in
> 1+0 records out
> 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0954648 s, 11.0 MB/s
> 0010
>
> 3. debugfs.ocfs2 -R stats /dev/sdb
> ...
> Block Size Bits: 12   Cluster Size Bits: 20
> ...
>
> Eric
>>
>> Thanks,
>> Ashish
>>
>> On 08/28/2016 10:39 PM, Eric Ren wrote:
>>> Hi,
>>>
>>> Thanks for this fix. I'd like to reproduce this issue locally and 
>>> test this patch,
>>> could you elaborate the detailed steps of reproduction?
>>>
>>> Thanks,
>>> Eric
>>>
>>> On 08/27/2016 07:04 AM, Ashish Samant wrote:
>>>> If we punch a hole on a reflink such that following conditions are 
>>>> met:
>>>>
>>>> 1. start offset is on a cluster boundary
>>>> 2. end offset is not on a cluster boundary
>>>> 3. (end offset is somewhere in another extent) or
>>>> (hole range > MAX_CONTIG_BYTES(1MB)),
>>>>
>>>> we dont COW the first cluster starting at the start offset. But in 
>>>> this
>>>> case, we were wrongly passing this cluster to
>>>> ocfs2_zero_range_for_truncate() to zero out. This will modify the 
>>>> cluster
>>>> in place and zero it in the source too.
>>>>
>>>> Fix this by skipping this cluster in such a scenario.
>>>>
>>>> Reported-by: Saar Maoz <saar.m...@oracle.com>
>>>> Signed-off-by: Ashish Samant <ashish.sam...@oracle.com>
>>>> Reviewed-by: Srinivas Eeda <srinivas.e...@oracle.com>
>>>> ---
>>>> v1->v2:
>>>> -Changed the commit msg to include a better and generic description of
>>>>   the problem, for all cluster sizes.
>>>> -Added Reported-by and Reviewed-by tags.
>>>>  fs/ocfs2/file.c | 34 --
>>>>   1 file changed, 24 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
>>>> index 4e7b0dc..0b055bf 100644
>>>> --- a/fs/ocfs2/file.c
>>>> +++ b/fs/ocfs2/file.c
>>>> @@ -1506,7 +1506,8 @@ static int ocfs2_zero_partial_

[Ocfs2-devel] [PATCH] ocfs2: Fix start offset to ocfs2_zero_range_for_truncate()

2016-08-11 Thread Ashish Samant
If we do fallocate with punch hole option on a reflink, with start offset
on a cluster boundary and end offset somewhere in another cluster, we
dont COW the first cluster starting at the start offset. But in this
case, we were wrongly passing this cluster to
ocfs2_zero_range_for_truncate() to zero out.

Fix this by skipping this cluster in such a scenario.

Signed-off-by: Ashish Samant <ashish.sam...@oracle.com>
---
 fs/ocfs2/file.c | 34 --
 1 file changed, 24 insertions(+), 10 deletions(-)

diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 4a6e130..ab305aa 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1522,7 +1522,8 @@ static int ocfs2_zero_partial_clusters(struct inode 
*inode,
   u64 start, u64 len)
 {
int ret = 0;
-   u64 tmpend, end = start + len;
+   u64 tmpend = 0;
+   u64 end = start + len;
struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
unsigned int csize = osb->s_clustersize;
handle_t *handle;
@@ -1554,18 +1555,31 @@ static int ocfs2_zero_partial_clusters(struct inode 
*inode,
}
 
/*
-* We want to get the byte offset of the end of the 1st cluster.
+* If start is on a cluster boundary and end is somewhere in another
+* cluster, we have not COWed the cluster starting at start, unless
+* end is also within the same cluster. So, in this case, we skip this
+* first call to ocfs2_zero_range_for_truncate() truncate and move on
+* to the next one.
 */
-   tmpend = (u64)osb->s_clustersize + (start & ~(osb->s_clustersize - 1));
-   if (tmpend > end)
-   tmpend = end;
+   if ((start & (csize - 1)) != 0) {
+   /*
+* We want to get the byte offset of the end of the 1st
+* cluster.
+*/
+   tmpend = (u64)osb->s_clustersize +
+   (start & ~(osb->s_clustersize - 1));
+   if (tmpend > end)
+   tmpend = end;
 
-   trace_ocfs2_zero_partial_clusters_range1((unsigned long long)start,
-(unsigned long long)tmpend);
+   trace_ocfs2_zero_partial_clusters_range1(
+   (unsigned long long)start,
+   (unsigned long long)tmpend);
 
-   ret = ocfs2_zero_range_for_truncate(inode, handle, start, tmpend);
-   if (ret)
-   mlog_errno(ret);
+   ret = ocfs2_zero_range_for_truncate(inode, handle, start,
+   tmpend);
+   if (ret)
+   mlog_errno(ret);
+   }
 
if (tmpend < end) {
/*
-- 
1.9.1


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel