Re: [Ocfs2-devel] OCFS2 cluster debian8 / debian9

2017-11-28 Thread BASSAGET Cédric
Hello,
I guess I did something wrong the first time.
I retried three times, and it worked three times. So I guess ocfs 1.6 and
1.8 are compatibles :)

Not it's time to set up a second ocfs2 cluster on my debian 9 server (ocfs
1.8), and I get this error when trying to mkfs.ocfs2 :

root@LAB-virtm6:~# mkfs.ocfs2 /dev/mapper/data_san_2
mkfs.ocfs2 1.8.4
On disk cluster (o2cb,ocfs2new,0) does not match the active cluster
(o2cb,ocfs2,0).
mkfs.ocfs2 will not be able to determine if this operation can be done
safely.
To skip this check, use --force or -F


The running cluster on this host is :
root@LAB-virtm6:~# o2cluster -r
o2cb,ocfs2,local

I'm trying to add an "ocfs2new" cluster :
root@LAB-virtm6:~# o2cb add-cluster ocfs2new
root@LAB-virtm6:~# o2cb add-node --ip 192.168.0.12 --port  --number 1
ovfs2new LAB-virtm6
root@LAB-virtm6:~# o2cb add-node --ip 192.168.0.13 --port  --number 2
ovfs2new LAB-virtm7

root@LAB-virtm6:~# o2cb list-clusters
ocfs2
ocfs2new

root@LAB-virtm6:~# o2cb cluster-status
Cluster 'ocfs2' is online

Even if I restart services or reboot, cluster 'ocfs2new' never goes online.

What am I doing wrong ?

2017-11-24 3:30 GMT+01:00 Eric Ren :

> Hi,
>
> On 11/23/2017 09:42 PM, BASSAGET Cédric wrote:
>
> Tried to mkfs.ocfs2 on the debian 9 side (ocfs 1.8) :
>
> root@LAB-virtm6:~# o2cluster -o /dev/mapper/data_san_2
> o2cb,ocfs2new,local
>
>
> Sorry, my fault. For mount failure, first thing is to check if your o2cb
> stack
> is running via o2cb init service `rco2cb status` IIRC. Then, it's probably
> a compatible issue: see the "FEATURE FLAGS" section in `man ocfs2`
> for more instructions.
>
>
> root@LAB-virtm6:~# mount /dev/mapper/data_san_2 /mnt/vol1_iscsi_san2
> *mount.ocfs2: Cluster name is invalid while trying to join the group*
>
>
> Quote from `man ocfs2`:
> """
> DETECTING FEATURE INCOMPATIBILITY
>
>   Say one tries to mount a volume with an incompatible
> feature. What happens then? How does
>   one detect the problem? How does one know the name of that
> incompatible feature?
>
>   To begin with, one should look for error messages in
> dmesg(8). Mount  failures  that  are
>   due to an incompatible feature will always result in an
> error message like the following:
>
>   ERROR: couldn't mount because of unsupported optional
> features (200).
>
>   Here  the  file  system is unable to mount the volume due to
> an unsupported optional fea-
>   ture. That means that that feature is an Incompat feature.
> By  referring  to  the  table
>   above,  one can then deduce that the user failed to mount a
> volume with the xattr feature
>   enabled. (The value in the error message is in hexadecimal.)
> """
>
> Please show your dmesg.
>
> Eric
>
>
> root@LAB-virtm6:~# cat /etc/ocfs2/cluster.conf
> node:
> ip_port = 
> ip_address = 192.168.0.11
> number = 1
> name = LAB-virtm5
> cluster = ocfs2new
> node:
> ip_port = 
> ip_address = 192.168.0.12
> number = 2
> name = LAB-virtm6
> cluster = ocfs2new
> node:
> ip_port = 
> ip_address = 192.168.0.13
> number = 3
> name = LAB-virtm7
> cluster = ocfs2new
> cluster:
> node_count = 5
> name = ocfs2new
>
>
>
>
> 2017-11-23 14:02 GMT+01:00 BASSAGET Cédric :
>
>> Hi Eric
>> on debian 9 (ocfs2 v1.8) :
>>
>> # o2cluster -o /dev/mapper/data_san_2
>> default
>>
>> on debian 8 (ocfs2 v1.6), I don't have the "o2cluster" tool
>>
>> :(
>>
>>
>>
>> 2017-11-23 13:41 GMT+01:00 Eric Ren :
>>
>>> Hi,
>>>
>>> On 11/23/2017 06:23 PM, BASSAGET Cédric wrote:
>>>
>>> hello,
>>> I'm trying to set-up an OCFS2 cluster between hosts running debian8 and
>>> debian9
>>>
>>> 2*debian 8 : ocfs2-tools 1.6.4-3
>>> 1*debian 9 : ocfs2-tools 1.8.4-4
>>>
>>> I created the FS on debian 8 node :
>>>  mkfs.ocfs2 -L "ocfs2_new" -N 5 /dev/mapper/data_san_2
>>>
>>> then mounted it without problem
>>> mount /dev/mapper/data_san_2 /mnt/vol1_iscsi_san2/
>>>
>>> I mounted it on second debian 8 host too, without problem.
>>>
>>> Trying to mount in on debian9 returns :
>>> mount.ocfs2: Cluster name is invalid while trying to join the group
>>>
>>> I saw in "man mkfs.ocfs2" that debian9 version has --cluster-stack and
>>> --cluster-name options.
>>>
>>> Is this option mandatory on ocfs2 1.8 ? That would say that ocfs2 1.6
>>> and 1.8 are not compatible ? Nothing is said about 1.8 on
>>> https://oss.oracle.com/projects/ocfs2/'re
>>> 
>>>
>>>
>>> Not sure if they're compatible. So can you try again with
>>> --cluster-stack and --cluster-name?
>>>
>>> # o2cluster -o /dev/sda1
>>> pcmk,cluster,none
>>>
>>> pcmk is the cluster-stack, cluster is the name.
>>>
>>> Usually, these two option is optional, the tools will detect the right
>>> cluster stack automatically.
>>>
>>> Eric
>>>
>>
>>
>
>
_

Re: [Ocfs2-devel] [PATCH] Bug#841144: kernel BUG at /build/linux-Wgpe2M/linux-4.8.11/fs/ocfs2/alloc.c:1514!

2017-11-28 Thread alex chen
Hi John,

On 2017/11/28 22:34, John Lightsey wrote:
> On Fri, 2017-11-24 at 13:46 +0800, alex chen wrote:
>> We need to check the free number of the records in each loop to mark
>> extent written,
>> because the last extent block may be changed through many times
>> marking extent written
>> and the num_free_extents also be changed. In the worst case, the
>> num_free_extents may
>> become less than at the beginning of the loop. So we should not
>> estimate the free
>> number of the records at the beginning of the loop to mark extent
>> written.
>>
>> I'd appreciate it if you could test the following patch and feedback
>> the result.
> 
> I managed to reproduce the bug in a test environment using the
> following method. Some of the specific details here are definitely
> irrelevant:
> 
> - Setup a 20GB iscsi lun going to a spinning disk drive.
> 
> - Configure the OCFS cluster with three KVM VMs.
> 
> - Connect the iscsi lun to all three VMs.
> 
> - Format an OCFS2 partition on the iscsi lun with block size 1k and
> cluster size 4k.
> 
> - Mount the OCFS2 partition on one VM.
> 
> - Write out a 1GB file with a random pattern of 4k chunks. 4/5 of the
> 4k chunks are filled with nulls. 1/5 are filled with data.
> 
> - Run fallocate -d  to make sure the file is sparse.
> 
> - Copy the test file so that the next step can be run repeatedly with
> copies.
> 
> - Use directio to rewrite the copy of the file in 64k chunks of null
> bytes.
> 
> 
> In my test setup, the assertion failure happens on the next loop
> iteration after the number of free extents drops from 59 to 0. The call
> to ocfs2_split_extent() in ocfs2_change_extent_flag() is what actually
> reduces the number of free extents to 0. The count drops all at once in
> this case, not by 1 or 2 per loop iteration.
> 
> With your patch applied, it does handle this sudden reduction in the
> number of free extents, and it's able to entirely overwrite the 1GB
> file without any problems.

Thanks for your test.

> 
> Is it safe to bring up a few nodes in our production OCFS2 cluster with
> the patched 4.9 kernel while the remainder nodes are running a 3.16
> kernel?
>

IMO, it is best to ensure the kernel version of nodes in the cluster is 
consistent.

> The downtime required to switch our cluster forward to a 4.9 kernel and
> then back to a 3.16 kernel is hard to justify, but I can definitely
> test one or two nodes in our production environment if it will be a
> realistic test.
> 
I think this patch is only tested in one node because we lock the inode_lock
when we make the extent written.

Thanks,
Alex



___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] [PATCH] Bug#841144: kernel BUG at /build/linux-Wgpe2M/linux-4.8.11/fs/ocfs2/alloc.c:1514!

2017-11-28 Thread John Lightsey
On Fri, 2017-11-24 at 13:46 +0800, alex chen wrote:
> We need to check the free number of the records in each loop to mark
> extent written,
> because the last extent block may be changed through many times
> marking extent written
> and the num_free_extents also be changed. In the worst case, the
> num_free_extents may
> become less than at the beginning of the loop. So we should not
> estimate the free
> number of the records at the beginning of the loop to mark extent
> written.
> 
> I'd appreciate it if you could test the following patch and feedback
> the result.

I managed to reproduce the bug in a test environment using the
following method. Some of the specific details here are definitely
irrelevant:

- Setup a 20GB iscsi lun going to a spinning disk drive.

- Configure the OCFS cluster with three KVM VMs.

- Connect the iscsi lun to all three VMs.

- Format an OCFS2 partition on the iscsi lun with block size 1k and
cluster size 4k.

- Mount the OCFS2 partition on one VM.

- Write out a 1GB file with a random pattern of 4k chunks. 4/5 of the
4k chunks are filled with nulls. 1/5 are filled with data.

- Run fallocate -d  to make sure the file is sparse.

- Copy the test file so that the next step can be run repeatedly with
copies.

- Use directio to rewrite the copy of the file in 64k chunks of null
bytes.


In my test setup, the assertion failure happens on the next loop
iteration after the number of free extents drops from 59 to 0. The call
to ocfs2_split_extent() in ocfs2_change_extent_flag() is what actually
reduces the number of free extents to 0. The count drops all at once in
this case, not by 1 or 2 per loop iteration.

With your patch applied, it does handle this sudden reduction in the
number of free extents, and it's able to entirely overwrite the 1GB
file without any problems.

Is it safe to bring up a few nodes in our production OCFS2 cluster with
the patched 4.9 kernel while the remainder nodes are running a 3.16
kernel?

The downtime required to switch our cluster forward to a 4.9 kernel and
then back to a 3.16 kernel is hard to justify, but I can definitely
test one or two nodes in our production environment if it will be a
realistic test.

signature.asc
Description: This is a digitally signed message part
___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH 2/3] ocfs2: add ocfs2_overwrite_io function

2017-11-28 Thread alex chen
Hi Gang,

On 2017/11/28 16:32, Gang He wrote:
> Hi Alex,
> 
> 

>> Hi Gang,
>>
>> On 2017/11/28 15:38, Gang He wrote:
>>> Hi Alex,
>>>
>>>
>>
 Hi Gang,

 On 2017/11/28 13:33, Gang He wrote:
> Hello Alex,
>
>

>> Hi Gang,
>>
>> On 2017/11/27 17:46, Gang He wrote:
>>> Add ocfs2_overwrite_io function, which is used to judge if
>>> overwrite allocated blocks, otherwise, the write will bring extra
>>> block allocation overhead.
>>>
>>> Signed-off-by: Gang He 
>>> ---
>>>  fs/ocfs2/extent_map.c | 67 
>> +++
>>>  fs/ocfs2/extent_map.h |  3 +++
>>>  2 files changed, 70 insertions(+)
>>>
>>> diff --git a/fs/ocfs2/extent_map.c b/fs/ocfs2/extent_map.c
>>> index e4719e0..98bf325 100644
>>> --- a/fs/ocfs2/extent_map.c
>>> +++ b/fs/ocfs2/extent_map.c
>>> @@ -832,6 +832,73 @@ int ocfs2_fiemap(struct inode *inode, struct 
>> fiemap_extent_info *fieinfo,
>>> return ret;
>>>  }
>>>  
>>> +/* Is IO overwriting allocated blocks? */
>>> +int ocfs2_overwrite_io(struct inode *inode, u64 map_start, u64 map_len,
>>> +  int wait)
>>> +{
>>> +   int ret = 0, is_last;
>>> +   u32 mapping_end, cpos;
>>> +   struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
>>> +   struct buffer_head *di_bh = NULL;
>>> +   struct ocfs2_extent_rec rec;
>>> +
>>> +   if (wait)
>>> +   ret = ocfs2_inode_lock(inode, &di_bh, 0);
>>> +   else
>>> +   ret = ocfs2_try_inode_lock(inode, &di_bh, 0);
>>> +   if (ret)
>>> +   goto out;
>>> +
>>> +   if (wait)
>>> +   down_read(&OCFS2_I(inode)->ip_alloc_sem);
>>> +   else {
>>> +   if (!down_read_trylock(&OCFS2_I(inode)->ip_alloc_sem)) {
>>> +   ret = -EAGAIN;
>>> +   goto out_unlock1;
>>> +   }
>>> +   }
>>> +
>>> +   if ((OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL) &&
>>> +  ((map_start + map_len) <= i_size_read(inode)))
>>> +   goto out_unlock2;
>>> +
>>> +   cpos = map_start >> osb->s_clustersize_bits;
>>> +   mapping_end = ocfs2_clusters_for_bytes(inode->i_sb,
>>> +  map_start + map_len);
>>> +   is_last = 0;
>>> +   while (cpos < mapping_end && !is_last) {
>>> +   ret = ocfs2_get_clusters_nocache(inode, di_bh, cpos,
>>> +NULL, &rec, &is_last);
>>> +   if (ret) {
>>> +   mlog_errno(ret);
>>> +   goto out_unlock2;
>>> +   }
>>> +
>>> +   if (rec.e_blkno == 0ULL)
>>> +   break;
>> I think here the blocks is not overwrite, because the hold is found and 
>> the 
>> blocks
>> should be allocated.
> If the rec.e_blkno == NULL, this means there is a hole.
> The file hole means that these blocks are not allocated, it does not like 
 unwritten block.
> The unwritten blocks means that these blocks are allocated, but still 
> have 
 not been unwritten. 
>
 If we break the loop when we find the hold, out of this function we will 
 allocate the blocks in
 ocfs2_file_write_iter()->..->ocfs2_direct_IO->__blockdev_direct_IO->..->ocfs2_dio_wr_g
 et_block()
 ->ocfs2_write_begin_nolock. Does this violate the semantics of 
 'IOCB_NOWAIT';
>>> Yes, then we need to check if this is a overwrite before doing direct-io.
>>>
>>
>> I mean here we should return 0 instead of break and we should immediately 
>> return -EAGAIN
>> to upper apps, otherwise, some block allocation will be happen, which 
>> violates the
>> semantics of 'IOCB_NOWAIT'.
> Before we do a direct-io, I need to check if this is a overwrite allocated 
> blocks IO.
> If not, we will return  -EAGAIN in 'IOCB_NOWAIT' mode. this should not 
> trigger any block allocation.
> I am not sure if we understand your concern totally.
> 

Yes, your description is correct.
So we should return 0 instead of break when we find the hold in 
ocfs2_overwrite_io();

> Thanks
> Gang 
> 
>>
>> Thanks,
>> Alex
>>

 BTW, should we consider the down_write() and ocfs2_inode_lock() in 
 ocfs2_dio_wr_get_block() when
 the flag 'IOCB_NOWAIT' is set;
>>> I think that we should not consider that layer lock, otherwise, the code 
>> change will become more and more complex and big.
>>> I also refer to ext4 file system code change for this 
>> feature(728fbc0e10b7f3ce2ee043b32e3453fd5201c055), they did not do any 
>> change 
>> in that layer.
>>>
>>
>> OK.
>>
>>> Thanks
>>> Gang
>>>

>>> +
>>> +   if (rec.e_flags & OCFS2_E

Re: [Ocfs2-devel] [PATCH 2/3] ocfs2: add ocfs2_overwrite_io function

2017-11-28 Thread Joseph Qi


On 17/11/28 16:54, Gang He wrote:
> Hi Joseph,
> 
> 

> 
>>
>> On 17/11/28 15:24, Gang He wrote:
>>> Hello Joseph,
>>>
>>>
>>
>>>

 On 17/11/28 11:35, Gang He wrote:
> Hello Joseph,
>
>

>> Hi Gang,
>>
>> On 17/11/27 17:46, Gang He wrote:
>>> Add ocfs2_overwrite_io function, which is used to judge if
>>> overwrite allocated blocks, otherwise, the write will bring extra
>>> block allocation overhead.
>>>
>>> Signed-off-by: Gang He 
>>> ---
>>>  fs/ocfs2/extent_map.c | 67 
>> +++
>>>  fs/ocfs2/extent_map.h |  3 +++
>>>  2 files changed, 70 insertions(+)
>>>
>>> diff --git a/fs/ocfs2/extent_map.c b/fs/ocfs2/extent_map.c
>>> index e4719e0..98bf325 100644
>>> --- a/fs/ocfs2/extent_map.c
>>> +++ b/fs/ocfs2/extent_map.c
>>> @@ -832,6 +832,73 @@ int ocfs2_fiemap(struct inode *inode, struct 
>> fiemap_extent_info *fieinfo,
>>> return ret;
>>>  }
>>>  
>>> +/* Is IO overwriting allocated blocks? */
>>> +int ocfs2_overwrite_io(struct inode *inode, u64 map_start, u64 map_len,
>>> +  int wait)
>>> +{
>>> +   int ret = 0, is_last;
>>> +   u32 mapping_end, cpos;
>>> +   struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
>>> +   struct buffer_head *di_bh = NULL;
>>> +   struct ocfs2_extent_rec rec;
>>> +
>>> +   if (wait)
>>> +   ret = ocfs2_inode_lock(inode, &di_bh, 0);
>>> +   else
>>> +   ret = ocfs2_try_inode_lock(inode, &di_bh, 0);
>>> +   if (ret)
>>> +   goto out;
>>> +
>>> +   if (wait)
>>> +   down_read(&OCFS2_I(inode)->ip_alloc_sem);
>>> +   else {
>>> +   if (!down_read_trylock(&OCFS2_I(inode)->ip_alloc_sem)) {
>>> +   ret = -EAGAIN;
>>> +   goto out_unlock1;
>>> +   }
>>> +   }
>>> +
>>> +   if ((OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL) &&
>>> +  ((map_start + map_len) <= i_size_read(inode)))
>>> +   goto out_unlock2;
>>> +
>>> +   cpos = map_start >> osb->s_clustersize_bits;
>>> +   mapping_end = ocfs2_clusters_for_bytes(inode->i_sb,
>>> +  map_start + map_len);
>>> +   is_last = 0;
>>> +   while (cpos < mapping_end && !is_last) {
>>> +   ret = ocfs2_get_clusters_nocache(inode, di_bh, cpos,
>>> +NULL, &rec, &is_last);
>>> +   if (ret) {
>>> +   mlog_errno(ret);
>>> +   goto out_unlock2;
>>> +   }
>>> +
>>> +   if (rec.e_blkno == 0ULL)
>>> +   break;
>>> +
>>> +   if (rec.e_flags & OCFS2_EXT_REFCOUNTED)
>>> +   break;
>>> +
>>> +   cpos = le32_to_cpu(rec.e_cpos) +
>>> +   le16_to_cpu(rec.e_leaf_clusters);
>>> +   }
>>> +
>>> +   if (cpos < mapping_end)
>>> +   ret = 1;
>>> +
>>> +out_unlock2:
>>> +   brelse(di_bh);
>>> +
>>> +   up_read(&OCFS2_I(inode)->ip_alloc_sem);
>>> +
>>> +out_unlock1:
>> Should brelse(di_bh) be here?
> If the code jumps to out_unlock1 directly, the di_bh pointer should be 
> NULL, 
>>
 it is not necessary to release.
>
 Umm... No, once going out here, we have already taken inode lock. So
 di_bh should be released.
>>> Sorry, you are right.
>>>

>>
>>> +   ocfs2_inode_unlock(inode, 0);
>>> +
>>> +out:
>>> +   return (ret ? 0 : 1);
>> I don't think EAGAIN and other error code can be handled the same. We
>> have to distinguish them.
> Ok, I think we can add one line log to report the error in case the error 
> is 
>>
 not EAGAIN. 
>
 My point is, there is no need to try again in several cases, e.g. EROFS
 returned by ocfs2_get_clusters_nocache.
>>> In this function ocfs2_overwrite_io() only can return True(1) or False(0), 
>> then I think we can only give a error print before return true/false.
>>> It is not necessary to return another value, but should let the user know 
>> any possible error message.
>>> This is because you just ignore the error and convert it to 0 or 1.
>> But in your next patch, if !ocfs2_overwrite_io(), it will return EGAIN
>> to upper layer and let it try again.
>> But in some cases, e.g. EROFS, trying again is meaningless. That's why
>> we can't simply return 0 or 1 here. Also we have to distinguish the
>> error code in the next patch.
> I think that we have to use the return value if we want to propagate the 
> errorno to the above.
> I will 

Re: [Ocfs2-devel] [PATCH 2/3] ocfs2: add ocfs2_overwrite_io function

2017-11-28 Thread Gang He
Hi Joseph,


>>> 

> 
> On 17/11/28 15:24, Gang He wrote:
>> Hello Joseph,
>> 
>> 
>
>> 
>>>
>>> On 17/11/28 11:35, Gang He wrote:
 Hello Joseph,


>>>
> Hi Gang,
>
> On 17/11/27 17:46, Gang He wrote:
>> Add ocfs2_overwrite_io function, which is used to judge if
>> overwrite allocated blocks, otherwise, the write will bring extra
>> block allocation overhead.
>>
>> Signed-off-by: Gang He 
>> ---
>>  fs/ocfs2/extent_map.c | 67 
> +++
>>  fs/ocfs2/extent_map.h |  3 +++
>>  2 files changed, 70 insertions(+)
>>
>> diff --git a/fs/ocfs2/extent_map.c b/fs/ocfs2/extent_map.c
>> index e4719e0..98bf325 100644
>> --- a/fs/ocfs2/extent_map.c
>> +++ b/fs/ocfs2/extent_map.c
>> @@ -832,6 +832,73 @@ int ocfs2_fiemap(struct inode *inode, struct 
> fiemap_extent_info *fieinfo,
>>  return ret;
>>  }
>>  
>> +/* Is IO overwriting allocated blocks? */
>> +int ocfs2_overwrite_io(struct inode *inode, u64 map_start, u64 map_len,
>> +   int wait)
>> +{
>> +int ret = 0, is_last;
>> +u32 mapping_end, cpos;
>> +struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
>> +struct buffer_head *di_bh = NULL;
>> +struct ocfs2_extent_rec rec;
>> +
>> +if (wait)
>> +ret = ocfs2_inode_lock(inode, &di_bh, 0);
>> +else
>> +ret = ocfs2_try_inode_lock(inode, &di_bh, 0);
>> +if (ret)
>> +goto out;
>> +
>> +if (wait)
>> +down_read(&OCFS2_I(inode)->ip_alloc_sem);
>> +else {
>> +if (!down_read_trylock(&OCFS2_I(inode)->ip_alloc_sem)) {
>> +ret = -EAGAIN;
>> +goto out_unlock1;
>> +}
>> +}
>> +
>> +if ((OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL) &&
>> +   ((map_start + map_len) <= i_size_read(inode)))
>> +goto out_unlock2;
>> +
>> +cpos = map_start >> osb->s_clustersize_bits;
>> +mapping_end = ocfs2_clusters_for_bytes(inode->i_sb,
>> +   map_start + map_len);
>> +is_last = 0;
>> +while (cpos < mapping_end && !is_last) {
>> +ret = ocfs2_get_clusters_nocache(inode, di_bh, cpos,
>> + NULL, &rec, &is_last);
>> +if (ret) {
>> +mlog_errno(ret);
>> +goto out_unlock2;
>> +}
>> +
>> +if (rec.e_blkno == 0ULL)
>> +break;
>> +
>> +if (rec.e_flags & OCFS2_EXT_REFCOUNTED)
>> +break;
>> +
>> +cpos = le32_to_cpu(rec.e_cpos) +
>> +le16_to_cpu(rec.e_leaf_clusters);
>> +}
>> +
>> +if (cpos < mapping_end)
>> +ret = 1;
>> +
>> +out_unlock2:
>> +brelse(di_bh);
>> +
>> +up_read(&OCFS2_I(inode)->ip_alloc_sem);
>> +
>> +out_unlock1:
> Should brelse(di_bh) be here?
 If the code jumps to out_unlock1 directly, the di_bh pointer should be 
 NULL, 
> 
>>> it is not necessary to release.

>>> Umm... No, once going out here, we have already taken inode lock. So
>>> di_bh should be released.
>> Sorry, you are right.
>> 
>>>
>
>> +ocfs2_inode_unlock(inode, 0);
>> +
>> +out:
>> +return (ret ? 0 : 1);
> I don't think EAGAIN and other error code can be handled the same. We
> have to distinguish them.
 Ok, I think we can add one line log to report the error in case the error 
 is 
> 
>>> not EAGAIN. 

>>> My point is, there is no need to try again in several cases, e.g. EROFS
>>> returned by ocfs2_get_clusters_nocache.
>> In this function ocfs2_overwrite_io() only can return True(1) or False(0), 
> then I think we can only give a error print before return true/false.
>> It is not necessary to return another value, but should let the user know 
> any possible error message.
>>This is because you just ignore the error and convert it to 0 or 1.
> But in your next patch, if !ocfs2_overwrite_io(), it will return EGAIN
> to upper layer and let it try again.
> But in some cases, e.g. EROFS, trying again is meaningless. That's why
> we can't simply return 0 or 1 here. Also we have to distinguish the
> error code in the next patch.
I think that we have to use the return value if we want to propagate the 
errorno to the above.
I will change the return value meanings of ocfs2_overwrite_io() function.
return 0 means this is a overwrite allocated block IO.
return

Re: [Ocfs2-devel] [PATCH 2/3] ocfs2: add ocfs2_overwrite_io function

2017-11-28 Thread Joseph Qi


On 17/11/28 15:24, Gang He wrote:
> Hello Joseph,
> 
> 

> 
>>
>> On 17/11/28 11:35, Gang He wrote:
>>> Hello Joseph,
>>>
>>>
>>
 Hi Gang,

 On 17/11/27 17:46, Gang He wrote:
> Add ocfs2_overwrite_io function, which is used to judge if
> overwrite allocated blocks, otherwise, the write will bring extra
> block allocation overhead.
>
> Signed-off-by: Gang He 
> ---
>  fs/ocfs2/extent_map.c | 67 
 +++
>  fs/ocfs2/extent_map.h |  3 +++
>  2 files changed, 70 insertions(+)
>
> diff --git a/fs/ocfs2/extent_map.c b/fs/ocfs2/extent_map.c
> index e4719e0..98bf325 100644
> --- a/fs/ocfs2/extent_map.c
> +++ b/fs/ocfs2/extent_map.c
> @@ -832,6 +832,73 @@ int ocfs2_fiemap(struct inode *inode, struct 
 fiemap_extent_info *fieinfo,
>   return ret;
>  }
>  
> +/* Is IO overwriting allocated blocks? */
> +int ocfs2_overwrite_io(struct inode *inode, u64 map_start, u64 map_len,
> +int wait)
> +{
> + int ret = 0, is_last;
> + u32 mapping_end, cpos;
> + struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
> + struct buffer_head *di_bh = NULL;
> + struct ocfs2_extent_rec rec;
> +
> + if (wait)
> + ret = ocfs2_inode_lock(inode, &di_bh, 0);
> + else
> + ret = ocfs2_try_inode_lock(inode, &di_bh, 0);
> + if (ret)
> + goto out;
> +
> + if (wait)
> + down_read(&OCFS2_I(inode)->ip_alloc_sem);
> + else {
> + if (!down_read_trylock(&OCFS2_I(inode)->ip_alloc_sem)) {
> + ret = -EAGAIN;
> + goto out_unlock1;
> + }
> + }
> +
> + if ((OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL) &&
> +((map_start + map_len) <= i_size_read(inode)))
> + goto out_unlock2;
> +
> + cpos = map_start >> osb->s_clustersize_bits;
> + mapping_end = ocfs2_clusters_for_bytes(inode->i_sb,
> +map_start + map_len);
> + is_last = 0;
> + while (cpos < mapping_end && !is_last) {
> + ret = ocfs2_get_clusters_nocache(inode, di_bh, cpos,
> +  NULL, &rec, &is_last);
> + if (ret) {
> + mlog_errno(ret);
> + goto out_unlock2;
> + }
> +
> + if (rec.e_blkno == 0ULL)
> + break;
> +
> + if (rec.e_flags & OCFS2_EXT_REFCOUNTED)
> + break;
> +
> + cpos = le32_to_cpu(rec.e_cpos) +
> + le16_to_cpu(rec.e_leaf_clusters);
> + }
> +
> + if (cpos < mapping_end)
> + ret = 1;
> +
> +out_unlock2:
> + brelse(di_bh);
> +
> + up_read(&OCFS2_I(inode)->ip_alloc_sem);
> +
> +out_unlock1:
 Should brelse(di_bh) be here?
>>> If the code jumps to out_unlock1 directly, the di_bh pointer should be 
>>> NULL, 
>> it is not necessary to release.
>>>
>> Umm... No, once going out here, we have already taken inode lock. So
>> di_bh should be released.
> Sorry, you are right.
> 
>>

> + ocfs2_inode_unlock(inode, 0);
> +
> +out:
> + return (ret ? 0 : 1);
 I don't think EAGAIN and other error code can be handled the same. We
 have to distinguish them.
>>> Ok, I think we can add one line log to report the error in case the error 
>>> is 
>> not EAGAIN. 
>>>
>> My point is, there is no need to try again in several cases, e.g. EROFS
>> returned by ocfs2_get_clusters_nocache.
> In this function ocfs2_overwrite_io() only can return True(1) or False(0), 
> then I think we can only give a error print before return true/false.
> It is not necessary to return another value, but should let the user know any 
> possible error message.
>This is because you just ignore the error and convert it to 0 or 1.
But in your next patch, if !ocfs2_overwrite_io(), it will return EGAIN
to upper layer and let it try again.
But in some cases, e.g. EROFS, trying again is meaningless. That's why
we can't simply return 0 or 1 here. Also we have to distinguish the
error code in the next patch.

> Thanks
> Gang 
> 
>>

 Thanks,
 Joseph

> +}
> +
>  int ocfs2_seek_data_hole_offset(struct file *file, loff_t *offset, int 
 whence)
>  {
>   struct inode *inode = file->f_mapping->host;
> diff --git a/fs/ocfs2/extent_map.h b/fs/ocfs2/extent_map.h
> index 67ea57d..fd9e86a 100644
> --- a/fs/ocfs2/extent_map.h
> +++ b/fs/ocfs2/extent_map.h
> @@ -53,6 +53,9 @@ int ocfs2_extent_map_get_blocks(struct inode *inode, 
> u64 
 v_blkno, u64 *p_blkno,
>  int ocfs2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
>u64 map_start, u64 map_len);
>  
> +int ocfs2_overwrite_io(struct inode *inode, u64 map_start, u64 map_l

Re: [Ocfs2-devel] [PATCH 2/3] ocfs2: add ocfs2_overwrite_io function

2017-11-28 Thread Gang He
Hi Alex,


>>> 
> Hi Gang,
> 
> On 2017/11/28 15:38, Gang He wrote:
>> Hi Alex,
>> 
>> 
>
>>> Hi Gang,
>>>
>>> On 2017/11/28 13:33, Gang He wrote:
 Hello Alex,


>>>
> Hi Gang,
>
> On 2017/11/27 17:46, Gang He wrote:
>> Add ocfs2_overwrite_io function, which is used to judge if
>> overwrite allocated blocks, otherwise, the write will bring extra
>> block allocation overhead.
>>
>> Signed-off-by: Gang He 
>> ---
>>  fs/ocfs2/extent_map.c | 67 
> +++
>>  fs/ocfs2/extent_map.h |  3 +++
>>  2 files changed, 70 insertions(+)
>>
>> diff --git a/fs/ocfs2/extent_map.c b/fs/ocfs2/extent_map.c
>> index e4719e0..98bf325 100644
>> --- a/fs/ocfs2/extent_map.c
>> +++ b/fs/ocfs2/extent_map.c
>> @@ -832,6 +832,73 @@ int ocfs2_fiemap(struct inode *inode, struct 
> fiemap_extent_info *fieinfo,
>>  return ret;
>>  }
>>  
>> +/* Is IO overwriting allocated blocks? */
>> +int ocfs2_overwrite_io(struct inode *inode, u64 map_start, u64 map_len,
>> +   int wait)
>> +{
>> +int ret = 0, is_last;
>> +u32 mapping_end, cpos;
>> +struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
>> +struct buffer_head *di_bh = NULL;
>> +struct ocfs2_extent_rec rec;
>> +
>> +if (wait)
>> +ret = ocfs2_inode_lock(inode, &di_bh, 0);
>> +else
>> +ret = ocfs2_try_inode_lock(inode, &di_bh, 0);
>> +if (ret)
>> +goto out;
>> +
>> +if (wait)
>> +down_read(&OCFS2_I(inode)->ip_alloc_sem);
>> +else {
>> +if (!down_read_trylock(&OCFS2_I(inode)->ip_alloc_sem)) {
>> +ret = -EAGAIN;
>> +goto out_unlock1;
>> +}
>> +}
>> +
>> +if ((OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL) &&
>> +   ((map_start + map_len) <= i_size_read(inode)))
>> +goto out_unlock2;
>> +
>> +cpos = map_start >> osb->s_clustersize_bits;
>> +mapping_end = ocfs2_clusters_for_bytes(inode->i_sb,
>> +   map_start + map_len);
>> +is_last = 0;
>> +while (cpos < mapping_end && !is_last) {
>> +ret = ocfs2_get_clusters_nocache(inode, di_bh, cpos,
>> + NULL, &rec, &is_last);
>> +if (ret) {
>> +mlog_errno(ret);
>> +goto out_unlock2;
>> +}
>> +
>> +if (rec.e_blkno == 0ULL)
>> +break;
> I think here the blocks is not overwrite, because the hold is found and 
> the 
> blocks
> should be allocated.
 If the rec.e_blkno == NULL, this means there is a hole.
 The file hole means that these blocks are not allocated, it does not like 
>>> unwritten block.
 The unwritten blocks means that these blocks are allocated, but still have 
>>> not been unwritten. 

>>> If we break the loop when we find the hold, out of this function we will 
>>> allocate the blocks in
>>> ocfs2_file_write_iter()->..->ocfs2_direct_IO->__blockdev_direct_IO->..->ocfs2_dio_wr_g
>>> et_block()
>>> ->ocfs2_write_begin_nolock. Does this violate the semantics of 
>>> 'IOCB_NOWAIT';
>> Yes, then we need to check if this is a overwrite before doing direct-io.
>>
> 
> I mean here we should return 0 instead of break and we should immediately 
> return -EAGAIN
> to upper apps, otherwise, some block allocation will be happen, which 
> violates the
> semantics of 'IOCB_NOWAIT'.
Before we do a direct-io, I need to check if this is a overwrite allocated 
blocks IO.
If not, we will return  -EAGAIN in 'IOCB_NOWAIT' mode. this should not trigger 
any block allocation.
I am not sure if we understand your concern totally.

Thanks
Gang 

> 
> Thanks,
> Alex
> 
>>>
>>> BTW, should we consider the down_write() and ocfs2_inode_lock() in 
>>> ocfs2_dio_wr_get_block() when
>>> the flag 'IOCB_NOWAIT' is set;
>> I think that we should not consider that layer lock, otherwise, the code 
> change will become more and more complex and big.
>> I also refer to ext4 file system code change for this 
> feature(728fbc0e10b7f3ce2ee043b32e3453fd5201c055), they did not do any change 
> in that layer.
>> 
> 
> OK.
> 
>> Thanks
>> Gang
>> 
>>>
>> +
>> +if (rec.e_flags & OCFS2_EXT_REFCOUNTED)
>> +break;
>> +
>> +cpos = le32_to_cpu(rec.e_cpos) +
>> +le16_to_cpu(rec.e_leaf_clusters);
>> +}
>> +
>> +if (cpos < mapping_end)
>> +ret = 1

Re: [Ocfs2-devel] [PATCH 2/3] ocfs2: add ocfs2_overwrite_io function

2017-11-28 Thread alex chen
Hi Gang,

On 2017/11/28 15:38, Gang He wrote:
> Hi Alex,
> 
> 

>> Hi Gang,
>>
>> On 2017/11/28 13:33, Gang He wrote:
>>> Hello Alex,
>>>
>>>
>>
 Hi Gang,

 On 2017/11/27 17:46, Gang He wrote:
> Add ocfs2_overwrite_io function, which is used to judge if
> overwrite allocated blocks, otherwise, the write will bring extra
> block allocation overhead.
>
> Signed-off-by: Gang He 
> ---
>  fs/ocfs2/extent_map.c | 67 
 +++
>  fs/ocfs2/extent_map.h |  3 +++
>  2 files changed, 70 insertions(+)
>
> diff --git a/fs/ocfs2/extent_map.c b/fs/ocfs2/extent_map.c
> index e4719e0..98bf325 100644
> --- a/fs/ocfs2/extent_map.c
> +++ b/fs/ocfs2/extent_map.c
> @@ -832,6 +832,73 @@ int ocfs2_fiemap(struct inode *inode, struct 
 fiemap_extent_info *fieinfo,
>   return ret;
>  }
>  
> +/* Is IO overwriting allocated blocks? */
> +int ocfs2_overwrite_io(struct inode *inode, u64 map_start, u64 map_len,
> +int wait)
> +{
> + int ret = 0, is_last;
> + u32 mapping_end, cpos;
> + struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
> + struct buffer_head *di_bh = NULL;
> + struct ocfs2_extent_rec rec;
> +
> + if (wait)
> + ret = ocfs2_inode_lock(inode, &di_bh, 0);
> + else
> + ret = ocfs2_try_inode_lock(inode, &di_bh, 0);
> + if (ret)
> + goto out;
> +
> + if (wait)
> + down_read(&OCFS2_I(inode)->ip_alloc_sem);
> + else {
> + if (!down_read_trylock(&OCFS2_I(inode)->ip_alloc_sem)) {
> + ret = -EAGAIN;
> + goto out_unlock1;
> + }
> + }
> +
> + if ((OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL) &&
> +((map_start + map_len) <= i_size_read(inode)))
> + goto out_unlock2;
> +
> + cpos = map_start >> osb->s_clustersize_bits;
> + mapping_end = ocfs2_clusters_for_bytes(inode->i_sb,
> +map_start + map_len);
> + is_last = 0;
> + while (cpos < mapping_end && !is_last) {
> + ret = ocfs2_get_clusters_nocache(inode, di_bh, cpos,
> +  NULL, &rec, &is_last);
> + if (ret) {
> + mlog_errno(ret);
> + goto out_unlock2;
> + }
> +
> + if (rec.e_blkno == 0ULL)
> + break;
 I think here the blocks is not overwrite, because the hold is found and 
 the 
 blocks
 should be allocated.
>>> If the rec.e_blkno == NULL, this means there is a hole.
>>> The file hole means that these blocks are not allocated, it does not like 
>> unwritten block.
>>> The unwritten blocks means that these blocks are allocated, but still have 
>> not been unwritten. 
>>>
>> If we break the loop when we find the hold, out of this function we will 
>> allocate the blocks in
>> ocfs2_file_write_iter()->..->ocfs2_direct_IO->__blockdev_direct_IO->..->ocfs2_dio_wr_g
>> et_block()
>> ->ocfs2_write_begin_nolock. Does this violate the semantics of 'IOCB_NOWAIT';
> Yes, then we need to check if this is a overwrite before doing direct-io.
>

I mean here we should return 0 instead of break and we should immediately 
return -EAGAIN
to upper apps, otherwise, some block allocation will be happen, which violates 
the
semantics of 'IOCB_NOWAIT'.

Thanks,
Alex

>>
>> BTW, should we consider the down_write() and ocfs2_inode_lock() in 
>> ocfs2_dio_wr_get_block() when
>> the flag 'IOCB_NOWAIT' is set;
> I think that we should not consider that layer lock, otherwise, the code 
> change will become more and more complex and big.
> I also refer to ext4 file system code change for this 
> feature(728fbc0e10b7f3ce2ee043b32e3453fd5201c055), they did not do any change 
> in that layer.
> 

OK.

> Thanks
> Gang
> 
>>
> +
> + if (rec.e_flags & OCFS2_EXT_REFCOUNTED)
> + break;
> +
> + cpos = le32_to_cpu(rec.e_cpos) +
> + le16_to_cpu(rec.e_leaf_clusters);
> + }
> +
> + if (cpos < mapping_end)
> + ret = 1;
> +
> +out_unlock2:

 I think the 'out_up_read' is more readable than the 'out_unlock2' .
>>> Ok, I will use more readable tag here.

> + brelse(di_bh);
> +
> + up_read(&OCFS2_I(inode)->ip_alloc_sem);
> +
> +out_unlock1:

 We should release buffer head here.

> + ocfs2_inode_unlock(inode, 0);
> +
> +out:
> + return (ret ? 0 : 1);
> +}
> +
>  int ocfs2_seek_data_hole_offset(struct file *file, loff_t *offset, int 
 whence)
>  {
>   struct inode *inode = file->f_mapping->host;
> diff --git a/fs/ocfs2/extent_map.h b/fs/ocfs2/extent_map.h
> index 67ea57d..fd9e86a 100644
> --- a/fs/ocfs2/extent_map.h
> +++ b/fs/ocfs2/exten