Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-17 Thread Boaz Harrosh

On 17/09/2020 15:48, Matthew Wilcox wrote:

On Thu, Sep 17, 2020 at 02:01:33PM +0200, Oleg Nesterov wrote:

<>


If we change bio_endio to invoke the ->bi_end_io callbacks in softirq
context instead of hardirq context, we can change the pagecache to take
BH-safe locks instead of IRQ-safe locks.  I believe the only reason the
lock needs to be IRQ-safe is for the benefit of paths like:



From my totally subjective experience on the filesystem side (user of 
bio_endio) all HW block drivers I used including Nvme isci, sata... etc. 
end up calling bio_endio in softirq. The big exception to that is the 
vdX drivers under KVM. Which is very Ironic to me.

I wish we could make all drivers be uniform in this regard.

But maybe I'm just speaking crap. Its only from my limited debuging 
expirience.


Thanks
Boaz


Re: [RFC PATCH] locking/percpu-rwsem: use this_cpu_{inc|dec}() for read_count

2020-09-17 Thread Boaz Harrosh

On 16/09/2020 15:32, Hou Tao wrote:
<>

However the performance degradation is huge under aarch64 (4 sockets, 24 core 
per sockets): nearly 60% lost.

v4.19.111
no writer, reader cn   | 24| 48| 72 
   | 96
the rate of down_read/up_read per second   | 166129572 | 166064100 | 
165963448 | 165203565
the rate of down_read/up_read per second (patched) |  63863506 |  63842132 |  
63757267 |  63514920



I believe perhaps Peter Z's suggestion of an additional
percpu_down_read_irqsafe() API and let only those in IRQ users pay the 
penalty.


Peter Z wrote:

My leading alternative was adding: percpu_down_read_irqsafe() /
percpu_up_read_irqsafe(), which use local_irq_save() instead of
preempt_disable().


Thanks
Boaz


Re: [PATCH 0/5] Enable per-file/directory DAX operations

2019-10-23 Thread Boaz Harrosh
On 22/10/2019 14:21, Boaz Harrosh wrote:
> On 20/10/2019 18:59, ira.we...@intel.com wrote:
<>
>> At LSF/MM we discussed the difficulties of switching the mode of a file with
>> active mappings / page cache. Rather than solve those races the decision was 
>> to
>> just limit mode flips to 0-length files.
>>
> 
> What I understand above is that only "writers" before writing any bytes may
> turn the flag on, which then persists. But as a very long time user of DAX, 
> usually
> it is the writers that are least interesting. With lots of DAX technologies 
> and
> emulations the write is slower and needs slow "flushing".
> 
> The more interesting and performance gains comes from DAX READs actually.
> specially cross the VM guest. (IE. All VMs share host memory or pmem)
> 
> This fixture as I understand it, that I need to know before I write if I will
> want DAX or not and then the write is DAX as well as reads after that, looks
> not very interesting for me as a user.
> 
> Just my $0.17
> Boaz
> 

I want to say one more thing about this patchset please. I was sleeping on it
and I think it is completely wrong approach with a completely wrong API.

The DAX or not DAX is a matter of transport. and is nothing to do with data
content and persistency. It's like connecting a disk behind, say, a scsi bus 
and then
take the same DB and putting it behind a multy-path or an NFS server. It is 
always
the same data.
(Same mistake we did with ndctl which is cry for generations)

For me the DAX or NO-DAX is an application thing and not a data thing.

The right API is perhaps an open O_FLAG where if you are not the first opener
the open returns EBUSY and then the app can decide if to open without the
flag or complain and fail. (Or apply open locks)

If you are a second opener when the file is already opened O_DAX you are
silently running DAX. If you are 2nd open(O_DAX) when already oppened
O_DAX then off course you succeed.
(Also the directory inheritance thing is all mute too)

Please explain the use case behind your model?

Thanks
Boaz


Re: [PATCH 1/5] fs/stat: Define DAX statx attribute

2019-10-22 Thread Boaz Harrosh
On 20/10/2019 18:59, ira.we...@intel.com wrote:
> From: Ira Weiny 
> 
> In order for users to determine if a file is currently operating in DAX
> mode (effective DAX).  Define a statx attribute value and set that
> attribute if the effective DAX flag is set.
> 
> To go along with this we propose the following addition to the statx man
> page:
> 
> STATX_ATTR_DAX
> 
>   DAX (cpu direct access) is a file mode that attempts to minimize
>   software cache effects for both I/O and memory mappings of this
>   file.  It requires a capable device, a compatible filesystem
>   block size, and filesystem opt-in. It generally assumes all
>   accesses are via cpu load / store instructions which can
>   minimize overhead for small accesses, but adversely affect cpu
>   utilization for large transfers. File I/O is done directly
>   to/from user-space buffers. While the DAX property tends to
>   result in data being transferred synchronously it does not give
>   the guarantees of synchronous I/O that data and necessary
>   metadata are transferred. Memory mapped I/O may be performed
>   with direct mappings that bypass system memory buffering. Again
>   while memory-mapped I/O tends to result in data being
>   transferred synchronously it does not guarantee synchronous
>   metadata updates. A dax file may optionally support being mapped
>   with the MAP_SYNC flag which does allow cpu store operations to
>   be considered synchronous modulo cpu cache effects.
> 
> Signed-off-by: Ira Weiny 
> ---
>  fs/stat.c | 3 +++
>  include/uapi/linux/stat.h | 1 +
>  2 files changed, 4 insertions(+)
> 
> diff --git a/fs/stat.c b/fs/stat.c
> index c38e4c2e1221..59ca360c1ffb 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -77,6 +77,9 @@ int vfs_getattr_nosec(const struct path *path, struct kstat 
> *stat,
>   if (IS_AUTOMOUNT(inode))
>   stat->attributes |= STATX_ATTR_AUTOMOUNT;
>  
> + if (inode->i_flags & S_DAX)

Is there a reason not to use IS_DAX(inode) ?

> + stat->attributes |= STATX_ATTR_DAX;
> +
>   if (inode->i_op->getattr)
>   return inode->i_op->getattr(path, stat, request_mask,
>   query_flags);
> diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
> index 7b35e98d3c58..5b0962121ef7 100644
> --- a/include/uapi/linux/stat.h
> +++ b/include/uapi/linux/stat.h
> @@ -169,6 +169,7 @@ struct statx {
>  #define STATX_ATTR_ENCRYPTED 0x0800 /* [I] File requires key to 
> decrypt in fs */
>  
>  #define STATX_ATTR_AUTOMOUNT 0x1000 /* Dir: Automount trigger */
> +#define STATX_ATTR_DAX   0x2000 /* [I] File is DAX */
>  
>  
>  #endif /* _UAPI_LINUX_STAT_H */
> 



Re: [PATCH 0/5] Enable per-file/directory DAX operations

2019-10-22 Thread Boaz Harrosh
On 20/10/2019 18:59, ira.we...@intel.com wrote:
> From: Ira Weiny 
> 
> At LSF/MM'19 [1] [2] we discussed applications that overestimate memory
> consumption due to their inability to detect whether the kernel will
> instantiate page cache for a file, and cases where a global dax enable via a
> mount option is too coarse.
> 
> The following patch series enables selecting the use of DAX on individual 
> files
> and/or directories on xfs, and lays some groundwork to do so in ext4.  In this
> scheme the dax mount option can be omitted to allow the per-file property to
> take effect.
> 
> The insight at LSF/MM was to separate the per-mount or per-file "physical"
> capability switch from an "effective" attribute for the file.
> 
> At LSF/MM we discussed the difficulties of switching the mode of a file with
> active mappings / page cache. Rather than solve those races the decision was 
> to
> just limit mode flips to 0-length files.
> 

What I understand above is that only "writers" before writing any bytes may
turn the flag on, which then persists. But as a very long time user of DAX, 
usually
it is the writers that are least interesting. With lots of DAX technologies and
emulations the write is slower and needs slow "flushing".

The more interesting and performance gains comes from DAX READs actually.
specially cross the VM guest. (IE. All VMs share host memory or pmem)

This fixture as I understand it, that I need to know before I write if I will
want DAX or not and then the write is DAX as well as reads after that, looks
not very interesting for me as a user.

Just my $0.17
Boaz

> Finally, the physical DAX flag inheritance is maintained from previous work 
> on 
> XFS but should be added for other file systems for consistence.
> 
> 
> [1] https://lwn.net/Articles/787973/
> [2] https://lwn.net/Articles/787233/
> 
> To: linux-kernel@vger.kernel.org
> Cc: Alexander Viro 
> Cc: "Darrick J. Wong" 
> Cc: Dan Williams 
> Cc: Dave Chinner 
> Cc: Christoph Hellwig 
> Cc: "Theodore Y. Ts'o" 
> Cc: Jan Kara 
> Cc: linux-e...@vger.kernel.org
> Cc: linux-...@vger.kernel.org
> Cc: linux-fsde...@vger.kernel.org
> 
> Ira Weiny (5):
>   fs/stat: Define DAX statx attribute
>   fs/xfs: Isolate the physical DAX flag from effective
>   fs/xfs: Separate functionality of xfs_inode_supports_dax()
>   fs/xfs: Clean up DAX support check
>   fs/xfs: Allow toggle of physical DAX flag
> 
>  fs/stat.c |  3 +++
>  fs/xfs/xfs_ioctl.c| 32 ++--
>  fs/xfs/xfs_iops.c | 36 ++--
>  fs/xfs/xfs_iops.h |  2 ++
>  include/uapi/linux/stat.h |  1 +
>  5 files changed, 50 insertions(+), 24 deletions(-)
> 



Re: pagecache locking

2019-07-07 Thread Boaz Harrosh
On 06/07/2019 02:31, Dave Chinner wrote:

> 
> As long as the IO ranges to the same file *don't overlap*, it should
> be perfectly safe to take separate range locks (in read or write
> mode) on either side of the mmap_sem as non-overlapping range locks
> can be nested and will not self-deadlock.
> 
> The "recursive lock problem" still arises with DIO and page faults
> inside gup, but it only occurs when the user buffer range overlaps
> the DIO range to the same file. IOWs, the application is trying to
> do something that has an undefined result and is likely to result in
> data corruption. So, in that case I plan to have the gup page faults
> fail and the DIO return -EDEADLOCK to userspace
> 

This sounds very cool. I now understand. I hope you put all the tools
for this in generic places so it will be easier to salvage.

One thing I will be very curious to see is how you teach lockdep
about the "range locks can be nested" thing. I know its possible,
other places do it, but its something I never understood.

> Cheers,
> Dave.

[ Ha one more question if you have time:

  In one of the mails, and you also mentioned it before, you said about
  the rw_read_lock not being able to scale well on mammoth machines
  over 10ns of cores (maybe you said over 20).
  I wonder why that happens. Is it because of the atomic operations,
  or something in the lock algorithm. In my theoretical understanding,
  as long as there are no write-lock-grabbers, why would the readers
  interfere with each other?
]

Thanks
Boaz


Re: [PATCH] dax: Fix missed PMD wakeups

2019-07-04 Thread Boaz Harrosh
On 04/07/2019 16:58, Matthew Wilcox wrote:
> On Thu, Jul 04, 2019 at 04:00:00PM +0300, Boaz Harrosh wrote:
<>
>> Matthew you must be kidding an ilog2 in binary is zero clocks
>> (Return the highest bit or something like that)
> 
> You might want to actually check the documentation instead of just
> making shit up.
> 

Yes you are right I stand corrected. Must be smoking ;-)

> https://www.agner.org/optimize/instruction_tables.pdf
> 
> I think in this instance what we want is BSR (aka ffz) since the input is
> going to be one of 0, 1, 3, 7, 15 or 31 (and we want 0, 1, 2, 3, 4, 5 as
> results).
> 
<>
> The compiler doesn't know the range of 'sibs'.  Unless we do the
> profile-feedback thing.
> 

Would you please consider the use of get_order() macro from #include 

Just for the sake of understanding? (Or at least a comment)

Thanks
Boaz


Re: [PATCH] dax: Fix missed PMD wakeups

2019-07-04 Thread Boaz Harrosh
On 04/07/2019 06:27, Matthew Wilcox wrote:
> On Wed, Jul 03, 2019 at 02:28:41PM -0700, Dan Williams wrote:
<>
>>> +#ifdef CONFIG_XARRAY_MULTI
>>> +   unsigned int sibs = xas->xa_sibs;
>>> +
>>> +   while (sibs) {
>>> +   order++;
>>> +   sibs /= 2;
>>> +   }
>>
>> Use ilog2() here?
> 
> Thought about it.  sibs is never going to be more than 31, so I don't
> know that it's worth eliminating 5 add/shift pairs in favour of whatever
> the ilog2 instruction is on a given CPU.  In practice, on x86, sibs is
> going to be either 0 (PTEs) or 7 (PMDs).  We could also avoid even having
> this function by passing PMD_ORDER or PTE_ORDER into get_unlocked_entry().
> 
> It's probably never going to be noticable in this scenario because it's
> the very last thing checked before we put ourselves on a waitqueue and
> go to sleep.
> 

Matthew you must be kidding an ilog2 in binary is zero clocks
(Return the highest bit or something like that)

In any way. It took me 5 minutes to understand what you are doing
here. And I only fully got it when Dan gave his comment. So please for
the sake of stupid guys like me could you please make it ilog2() so
to make it easier to understand?
(And please don't do the compiler's job. If in some arch the loop
 is the fastest let the compiler decide?)

Thanks
Boaz


Re: [PATCH] mm: Support madvise_willneed override by Filesystems

2019-07-03 Thread Boaz Harrosh
On 03/07/2019 20:21, Jan Kara wrote:
> On Wed 03-07-19 04:04:57, Boaz Harrosh wrote:
>> On 19/06/2019 11:21, Jan Kara wrote:
>> <>
<>
>> Hi Jan
>>
>> Funny I'm sitting on the same patch since LSF last. I need it too for other
>> reasons. I have not seen, have you pushed your patch yet?
>> (Is based on old v4.20)
> 
> Your patch is wrong due to lock ordering. You should not call vfs_fadvise()
> under mmap_sem. So we need to do a similar dance like madvise_remove(). I
> have to get to writing at least XFS fix so that the madvise change gets
> used and post the madvise patch with it... Sorry it takes me so long.
> 
>   Honza

Ha Sorry I was not aware of this. Lockdep did not catch it on my setup
because my setup does not have any locking conflicts with mmap_sem on the
WILL_NEED path.

But surly you are right because the all effort is to fix the locking problems.

I will also try in a day or two to do as you suggest, and look at 
madvise_remove()
once I have a bit of time. Who ever gets to be less busy ...

Thank you for your help
Boaz


Re: [PATCH] filesystem-dax: Disable PMD support

2019-07-02 Thread Boaz Harrosh
On 03/07/2019 03:42, Dan Williams wrote:
> On Tue, Jul 2, 2019 at 5:23 PM Boaz Harrosh  wrote:
<>
> 
> Yes, but the trick is how to manage cases where someone waiting on one
> type needs to be woken up by an event on the other. 

Exactly I'm totally with you on this.

> So all I'm saying it lets live with more hash collisions until we can figure 
> out a race
> free way to better scale waitqueue usage.
> 

Yes and lets actually do real measurements to see if this really hurts 
needlessly.
Maybe not so much

Thanks
Boaz

<>


Re: pagecache locking

2019-07-02 Thread Boaz Harrosh
On 03/07/2019 04:07, Patrick Farrell wrote:
> Recursively read locking is generally unsafe, that’s why lockdep
> complains about it.  The common RW lock primitives are queued in
> their implementation, meaning this recursive read lock sequence:
> P1 - read (gets lock)
> P2 - write
> P1 - read
> 
> Results not in a successful read lock, but P1 blocking behind P2,
> which is blocked behind P1.  

> Readers are not allowed to jump past waiting writers.

OK thanks that makes sense. I did not know about that last part. Its a kind
of a lock fairness I did not know we have.

So I guess I'll keep my two locks than. The write_locker is the SLOW
path for me anyway, right?

[if we are already at the subject, Do mutexes have the same lock fairness as
 above? Do the write_lock side of rw_sem have same fairness? Something I never
 figured out]

Thanks
Boaz

> 
> - Patrick


[PATCH] mm: Support madvise_willneed override by Filesystems

2019-07-02 Thread Boaz Harrosh
On 19/06/2019 11:21, Jan Kara wrote:
<>
> Yes, I have patch to make madvise(MADV_WILLNEED) go through ->fadvise() as
> well. I'll post it soon since the rest of the series isn't really dependent
> on it.
> 
>   Honza
> 

Hi Jan

Funny I'm sitting on the same patch since LSF last. I need it too for other
reasons. I have not seen, have you pushed your patch yet?
(Is based on old v4.20)

~
>From fddb38169e33d23060ddd444ba6f2319f76edc89 Mon Sep 17 00:00:00 2001
From: Boaz Harrosh 
Date: Thu, 16 May 2019 20:02:14 +0300
Subject: [PATCH] mm: Support madvise_willneed override by Filesystems

In the patchset:
[b833a3660394] ovl: add ovl_fadvise()
[3d8f7615319b] vfs: implement readahead(2) using POSIX_FADV_WILLNEED
[45cd0faae371] vfs: add the fadvise() file operation

Amir Goldstein introduced a way for filesystems to overide fadvise.
Well madvise_willneed is exactly as fadvise_willneed except it always
returns 0.

In this patch we call the FS vector if it exists.

NOTE: I called vfs_fadvise(..,POSIX_FADV_WILLNEED);
  (Which is my artistic preference)

I could also selectively call
if (file->f_op->fadvise)
return file->f_op->fadvise(..., POSIX_FADV_WILLNEED);
If we fear theoretical side effects. I don't mind either way.

CC: Amir Goldstein 
CC: Miklos Szeredi 
Signed-off-by: Boaz Harrosh 
---
 mm/madvise.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 6cb1ca93e290..6b84ddcaaaf2 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -303,7 +304,8 @@ static long madvise_willneed(struct vm_area_struct *vma,
end = vma->vm_end;
end = ((end - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
 
-   force_page_cache_readahead(file->f_mapping, file, start, end - start);
+   vfs_fadvise(file, start << PAGE_SHIFT, (end - start) << PAGE_SHIFT,
+   POSIX_FADV_WILLNEED);
return 0;
 }
 
-- 
2.20.1



Re: pagecache locking

2019-07-02 Thread Boaz Harrosh
On 20/06/2019 01:37, Dave Chinner wrote:
<>
> 
> I'd prefer it doesn't get lifted to the VFS because I'm planning on
> getting rid of it in XFS with range locks. i.e. the XFS_MMAPLOCK is
> likely to go away in the near term because a range lock can be
> taken on either side of the mmap_sem in the page fault path.
> 
<>
Sir Dave

Sorry if this was answered before. I am please very curious. In the zufs
project I have an equivalent rw_MMAPLOCK that I _read_lock on page_faults.
(Read & writes all take read-locks ...)
The only reason I have it is because of lockdep actually.

Specifically for those xfstests that mmap a buffer then direct_IO in/out
of that buffer from/to another file in the same FS or the same file.
(For lockdep its the same case).
I would be perfectly happy to recursively _read_lock both from the top
of the page_fault at the DIO path, and under in the page_fault. I'm
_read_locking after all. But lockdep is hard to convince. So I stole the
xfs idea of having an rw_MMAPLOCK. And grab yet another _write_lock at
truncate/punch/clone time when all mapping traversal needs to stop for
the destructive change to take place. (Allocations are done another way
and are race safe with traversal)

How do you intend to address this problem with range-locks? ie recursively
taking the same "lock"? because if not for the recursive-ity and lockdep I would
not need the extra lock-object per inode.

Thanks
Boaz


Re: [PATCH] filesystem-dax: Disable PMD support

2019-07-02 Thread Boaz Harrosh
On 02/07/2019 18:37, Dan Williams wrote:
<>
> 
> I'd be inclined to do the brute force fix of not trying to get fancy
> with separate PTE/PMD waitqueues and then follow on with a more clever
> performance enhancement later. Thoughts about that?
> 

Sir Dan

I do not understand how separate waitqueues are any performance enhancement?
The all point of the waitqueues is that there is enough of them and the hash
function does a good radomization spread to effectively grab a single locker
per waitqueue unless the system is very contended and waitqueues are shared.
Which is good because it means you effectively need a back pressure to the app.
(Because pmem IO is mostly CPU bound with no long term sleeps I do not think
 you will ever get to that situation)

So the way I understand it having twice as many waitqueues serving two types
will be better performance over all then, segregating the types each with half
the number of queues.

(Regardless of the above problem of where the segregation is not race clean)

Thanks
Boaz


Re: remove exofs, the T10 OSD code and block/scsi bidi support V4

2019-02-04 Thread Boaz Harrosh
On 01/02/19 09:55, Christoph Hellwig wrote:
> The only real user of the T10 OSD protocol, the pNFS object layout
> driver never went to the point of having shipping products, and we
> removed it 1.5 years ago.  Exofs is just a simple example without
> real life users.
> 
> The code has been mostly unmaintained for years and is getting in the
> way of block / SCSI changes, 

For the 4th time: Which ?

> and does not even work properly currently,
> so I think it's finally time to drop it.
> 
> Quote from Boaz:
> 

Hello? anyone home?

Please resend without the below Quote!!!
Because yes I said that but I have also said the opposite. And I have clearly
stated in reaction to the last 3 resends. That I now oppose to the below 
statement.
And that you should please remove it.

Do your own text please. Don't drag me in the mud

> "As I said then. It is used in Universities for studies and experiments.
> Every once in a while. I get an email with questions and reports.
> 
> But yes feel free to remove the all thing!!
> 
> I guess I can put it up on github. In a public tree.
> 
> Just that I will need to forward port it myself, til now you guys
> been doing this for me ;-)"
> 
> Now the last time this caused a bit of a stir, but still no actual users,
> not even for SG_IO passthrough commands.  So here we go again, this time
> including removing everything in the scsi and block layer supporting it,
> and thus shrinking struct request.
> 

NAK-by: Boaz harrosh 
For what ever that means ...

Cheers
Boaz



Re: remove exofs, the T10 OSD code and block/scsi bidi support V3

2018-12-23 Thread Boaz Harrosh
On 20/12/18 09:26, Christoph Hellwig wrote:
> On Wed, Dec 19, 2018 at 09:01:53PM -0500, Douglas Gilbert wrote:
>>>   1) reduce the size of every kernel with block layer support, and
>>>  even more for every kernel with scsi support
>>
>> By proposing the removal of bidi support from the block layer, it isn't
>> just the SCSI subsystem that will be impacted. Those NVMe documents
>> that you referred me to earlier in the year, in the command tables
>> in 1.3c and earlier you have noticed the 2 bit direction field and
>> what 11b means? Even if there aren't any bidi NVMe commands *** yet,
>> the fact that NVMe's 64 byte command format has provision for 4
>> (not 2) independent data transfers (data + meta, for each direction).
>> Surely NVMe will sooner or later take advantage of those ... a
>> command like READ GATHERED comes to mind.
> 
> NVMe on the other hand does have support for separate read and write
> buffers as in the current SCSI bidi support, as it encodes the data
> transfers in that SQE.  So IFF NVMe does bidi commands it would have
> to use a single buffer for data in/out, 

There is no such thing as "buffer" there is at first a bio, and after
virtual-to-iommu mapping a scatter-gather-list. All these are currently
governed by a struct request.
request, bio, and sgl, have a single direction, All API's expect a single
direction.

All BIDI did was to say. Lets not change any API or structure but just
use two of them at the same time.

All the wiser is the very high level user, and the very low HW driver like
iscsi. All the middlewere was never touched.

In the view of a bidi target like say an osd. It all stream looks like a single
"Buffer" on the wire, were some of it is read and some of it is written
to.

> which can be easily done

?? Did you try. It will take much more than an additional pointer sir

> in the block layer without the current bidi support that chains
> two struct request instances for data in and data out.
> 

That was the all trick of not changing a single API or structure
Just have two of the same thing, we already know how to handle

>>>   2) reduce the size of the critical struct request structure by
>>>  128 bits, thus reducing the memory used by every blk-mq driver
>>>  significantly, never mind the cache effects
>>
>> Hmm, one pointer (that is null in the non-bidi case) should be enough,
>> that's 64 or 32 bits.
> 
> Due to the way we use request chaining we need two fields at the
> moment.  ->special and ->next_rq.  

No! ->special is nothing to do with bidi. ->special is a field to be
used by LLD's only and are not to be touched by block layer or transports
or high level users.
 
Request has the single ->next_rq for bidi. And could be eliminated by
sharing space with the elevator info. Do you want a patch?

(So in effect it can be taking 0 bytes, and yes a little bit of code)

> If we'd refactor the whole thing
> for the basically non-existent user we could indeed probably get it
> down to a single pointer. 
> 
>> While on the subject of bidi, the order of transfers: is the data-out
>> (to the target) always before the data-in or is it the target device
>> that decides (depending on the semantics of the command) who is first?
> 
> The way I read SAM data needs to be transferred to the device for
> processing first, then the processing occurs and then it is transferred
> out, so the order seems fixed.
> 

Not sure what is the "SAM" above. But most of the BIDI commands I know,
osd and otherwise, the order is command specific, and many times it is
done in parallel.
Read some bits than write some bits, rinse and repeat ...

(You see in scsi the all OUT buffer is part of the actual CDB, so in effect
 any READ is a BIDI. The novelty here is the variable sizes and the SW stack
 memory targets for the different operations)

>>
>> Doug Gilbert
>>
>>  *** there could already be vendor specific bidi NVMe commands out
>> there (ditto for SCSI)
> 
> For NVMe they'd need to transfer data in and out in the same buffer
> to sort work, and even then only if we don't happen to be bounce
> buffering using swiotlb, or using a network transport.  Similarly for
> SCSI only iSCSI at the moment supports bidi CDBs, so we could have
> applications using vendor specific bidi commands on iSCSI, which
> is exactly what we're trying to find out, but it is a bit of a very
> niche use case.
> 

Again bidi works NOW. Did not yet see the big gain, of throwing it
out.

Jai Maa
Boaz



Re: remove exofs, the T10 OSD code and block/scsi bidi support V3

2018-12-23 Thread Boaz Harrosh
On 19/12/18 16:43, Christoph Hellwig wrote:
> On Mon, Nov 26, 2018 at 07:11:10PM +0200, Boaz Harrosh wrote:
>> On 11/11/18 15:32, Christoph Hellwig wrote:
>>> The only real user of the T10 OSD protocol, the pNFS object layout
>>> driver never went to the point of having shipping products, and we
>>> removed it 1.5 years ago.  Exofs is just a simple example without
>>> real life users.
>>>
>>
>> You have failed to say what is your motivation for this patchset? What
>> is it you are trying to fix/improve.
> 
> Drop basically unused support, which allows us to
> 
>  1) reduce the size of every kernel with block layer support, and
> even more for every kernel with scsi support

Do you have numbers? its mainly code-segment so I don't think you will see
any real life measurable difference.

>  2) reduce the size of the critical struct request structure by
> 128 bits, thus reducing the memory used by every blk-mq driver
> significantly, never mind the cache effects

128 bits? I see the "struct request *next_rq;"
is there another one?

It could share space with elv; && flush;
Do you want a patch?

>  3) stop having the maintainance overhead for this code in the
> block layer, which has been rather painful at times
> 

I hear you man. Life is pain. But is it really such an overhead?
I mean it is already implemented. What else is there to do?
Please please show me? (Sorry for being slow)

Jai Maa
Boaz



Re: [PATCH 33/41] scsi: osd: osd_initiator: mark expected switch fall-throughs

2018-12-18 Thread Boaz Harrosh
On 28/11/18 06:32, Gustavo A. R. Silva wrote:
> In preparation to enabling -Wimplicit-fallthrough, mark switch cases
> where we are expecting to fall through.
> 
> Signed-off-by: Gustavo A. R. Silva 

ACK-by: Boaz Harrosh 

> ---
>  drivers/scsi/osd/osd_initiator.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/osd/osd_initiator.c 
> b/drivers/scsi/osd/osd_initiator.c
> index 60cf7c5eb880..cb26f26d5ec1 100644
> --- a/drivers/scsi/osd/osd_initiator.c
> +++ b/drivers/scsi/osd/osd_initiator.c
> @@ -1849,6 +1849,7 @@ int osd_req_decode_sense_full(struct osd_request *or,
>   32, 1, dump, sizeof(dump), true);
>   OSD_SENSE_PRINT2("response_integrity [%s]\n", dump);
>   }
> + /* fall through */
>   case osd_sense_attribute_identification:
>   {
>   struct osd_sense_attributes_data_descriptor
> @@ -1879,7 +1880,7 @@ int osd_req_decode_sense_full(struct osd_request *or,
>   attr_page, attr_id);
>   }
>   }
> - /*These are not legal for OSD*/
> + /* fall through - These are not legal for OSD */
>   case scsi_sense_field_replaceable_unit:
>   OSD_SENSE_PRINT2("scsi_sense_field_replaceable_unit\n");
>   break;
> 



Re: remove exofs, the T10 OSD code and block/scsi bidi support V3

2018-11-26 Thread Boaz Harrosh
On 11/11/18 15:32, Christoph Hellwig wrote:
> The only real user of the T10 OSD protocol, the pNFS object layout
> driver never went to the point of having shipping products, and we
> removed it 1.5 years ago.  Exofs is just a simple example without
> real life users.
> 

You have failed to say what is your motivation for this patchset? What
is it you are trying to fix/improve.

For the sake of "not used much" I fail to see the risk taking of
this removal.

> The code has been mostly unmaintained for years and is getting in the
> way of block / SCSI changes, so I think it's finally time to drop it.
> 
> Quote from Boaz:
> 
> "As I said then. It is used in Universities for studies and experiments.
> Every once in a while. I get an email with questions and reports.
> 
> But yes feel free to remove the all thing!!
> 
> I guess I can put it up on github. In a public tree.
> 
> Just that I will need to forward port it myself, til now you guys
> been doing this for me ;-)"
> 

Yes I wrote that for V1. But I wrote the *opposite* thing in a later mail.
Which nullifies this statement above. So please remove this quote in future
submits.

Here is what I wrote later as response of V2 of this set:



I think I'm changing my mind about this.

Because of two reasons:

One: I see 3 thousands bit-rots in the Kernel and particularly SCSI drivers
that are much older and fat-and-ugliness consuming then the clean osd
stack. For example the all ISA bus and ZONE_DMA stuff.

Two: I have offered many times, every time this came up. That if
anyone has a major (or even minor) change to the block and/or scsi layers
that he/she has done. And that now breaks the compilation/run time of
OSD or exofs.
  I'm personally willing to spend my weekends and fix it myself. Send me
  a URL of the tree with the work done, and I will send the patches needed
  to revitalize OSD/exofs as part of that change set.

I have never received any such requests to date.

So I would please like to protest on two of Christoph's statements above.

"The code has been mostly unmaintained for years and is getting in the
 way of block / SCSI changes"

1. What does "maintained" means? I have for all these years been immediately
   responsive to any inquiries and patches to the code in question.
   And am listed as MAINTAINER of this code.
2. I have regularly, for ever, every kernel release around the RC3-RC4
   time frame, compiled and ran my almost automatic setup and made sure
   the things still run as expected (No regressions).

So Yes the code has not seen any new fixtures for years. But it is regularly
tested and proven to work, on latest kernel. So it fails the definition 
of a "bit rot"

Christoph you've been saying for so long "getting in the way of block/SCSI
changes". And every time and again this time please tell me, you never answered
before. What are these changes you want to make? can I please help?

Send me any tree where exofs/osd compilation is broken and I will personally
fix it in "ONE WEEK" time.

(If compilation is fine but you know runtime will break, its nice to have an
 heads up, but if not my automatic system will detect it anyway)

Lets say that if in the FUTURE a change-set is submitted that breaks OSD/EXOFS
compilation, and I failed to respond with a fix within "ONE WEEK". Then
this goes in as part of that change-set. And not with the argument of
"Not used, not maintained" - But as "Breaks compilation of the following 
changes"
I promise I will gladly ACK it then.

So for now. A personal NACK from me on the grounds that. You never told me
why / what this is breaking.

Thanks
Boaz



> Now the last time this caused a bit of a stir, but still no actual users,
> not even for SG_IO passthrough commands.  So here we go again, this time
> including removing everything in the scsi and block layer supporting it,
> and thus shrinking struct request.
> 

Again. T10-OSD or not. Bidi is currently actively used. By Linus rules
You are not allowed to remove it.

Two use paths:
1. Management CDBS of private vendors yes via iscsi. virt_io and usb-scsi
2. Target mode support of WRITE-RETURN-XOR, and COMPARE_AND_WRITE

---
You guys should do what you feel best. Even not answering my questions and
of course not agreeing with my advise, .i.e about breaking people's setups.

But please remove the wrong quote from me. Please quote my objection
of the matter. (pretty please because you may surly ignore that request as
well)

[I am not fighting about this at all. Please do what you need to do.
 Just want to set the record strait that's all]

Cheers :-)
Boaz



Re: remove exofs, the T10 OSD code and block/scsi bidi support V3

2018-11-26 Thread Boaz Harrosh
On 11/11/18 15:32, Christoph Hellwig wrote:
> The only real user of the T10 OSD protocol, the pNFS object layout
> driver never went to the point of having shipping products, and we
> removed it 1.5 years ago.  Exofs is just a simple example without
> real life users.
> 

You have failed to say what is your motivation for this patchset? What
is it you are trying to fix/improve.

For the sake of "not used much" I fail to see the risk taking of
this removal.

> The code has been mostly unmaintained for years and is getting in the
> way of block / SCSI changes, so I think it's finally time to drop it.
> 
> Quote from Boaz:
> 
> "As I said then. It is used in Universities for studies and experiments.
> Every once in a while. I get an email with questions and reports.
> 
> But yes feel free to remove the all thing!!
> 
> I guess I can put it up on github. In a public tree.
> 
> Just that I will need to forward port it myself, til now you guys
> been doing this for me ;-)"
> 

Yes I wrote that for V1. But I wrote the *opposite* thing in a later mail.
Which nullifies this statement above. So please remove this quote in future
submits.

Here is what I wrote later as response of V2 of this set:



I think I'm changing my mind about this.

Because of two reasons:

One: I see 3 thousands bit-rots in the Kernel and particularly SCSI drivers
that are much older and fat-and-ugliness consuming then the clean osd
stack. For example the all ISA bus and ZONE_DMA stuff.

Two: I have offered many times, every time this came up. That if
anyone has a major (or even minor) change to the block and/or scsi layers
that he/she has done. And that now breaks the compilation/run time of
OSD or exofs.
  I'm personally willing to spend my weekends and fix it myself. Send me
  a URL of the tree with the work done, and I will send the patches needed
  to revitalize OSD/exofs as part of that change set.

I have never received any such requests to date.

So I would please like to protest on two of Christoph's statements above.

"The code has been mostly unmaintained for years and is getting in the
 way of block / SCSI changes"

1. What does "maintained" means? I have for all these years been immediately
   responsive to any inquiries and patches to the code in question.
   And am listed as MAINTAINER of this code.
2. I have regularly, for ever, every kernel release around the RC3-RC4
   time frame, compiled and ran my almost automatic setup and made sure
   the things still run as expected (No regressions).

So Yes the code has not seen any new fixtures for years. But it is regularly
tested and proven to work, on latest kernel. So it fails the definition 
of a "bit rot"

Christoph you've been saying for so long "getting in the way of block/SCSI
changes". And every time and again this time please tell me, you never answered
before. What are these changes you want to make? can I please help?

Send me any tree where exofs/osd compilation is broken and I will personally
fix it in "ONE WEEK" time.

(If compilation is fine but you know runtime will break, its nice to have an
 heads up, but if not my automatic system will detect it anyway)

Lets say that if in the FUTURE a change-set is submitted that breaks OSD/EXOFS
compilation, and I failed to respond with a fix within "ONE WEEK". Then
this goes in as part of that change-set. And not with the argument of
"Not used, not maintained" - But as "Breaks compilation of the following 
changes"
I promise I will gladly ACK it then.

So for now. A personal NACK from me on the grounds that. You never told me
why / what this is breaking.

Thanks
Boaz



> Now the last time this caused a bit of a stir, but still no actual users,
> not even for SG_IO passthrough commands.  So here we go again, this time
> including removing everything in the scsi and block layer supporting it,
> and thus shrinking struct request.
> 

Again. T10-OSD or not. Bidi is currently actively used. By Linus rules
You are not allowed to remove it.

Two use paths:
1. Management CDBS of private vendors yes via iscsi. virt_io and usb-scsi
2. Target mode support of WRITE-RETURN-XOR, and COMPARE_AND_WRITE

---
You guys should do what you feel best. Even not answering my questions and
of course not agreeing with my advise, .i.e about breaking people's setups.

But please remove the wrong quote from me. Please quote my objection
of the matter. (pretty please because you may surly ignore that request as
well)

[I am not fighting about this at all. Please do what you need to do.
 Just want to set the record strait that's all]

Cheers :-)
Boaz



Re: remove exofs and the T10 OSD code V2

2018-10-31 Thread Boaz Harrosh
On 31/10/18 23:10, Douglas Gilbert wrote:
> On 2018-10-31 4:57 p.m., Boaz Harrosh wrote:
>> On 30/10/18 09:45, Christoph Hellwig wrote:
>>> On Mon, Oct 29, 2018 at 02:42:12PM -0600, Jens Axboe wrote:
>>>> LGTM, for both:
>>>
>>> I also have this one on top as requested by Martin.  The core block
>>> bidi support is unfortunately also used by bsg-lib, although it is
>>> not anywhere near as invasive.  But that is another argument for
>>> looking into moving bsg-lib away from using block queues..
>>>
>>
>> BUT this patch is very very wrong.
>>
>> Totally apart from T10-OSD and its use in the field. Support for scsi BIDI
>> commands is not exclusive to T10-OSD at all. Even the simple scsi-array
>> command-set has BIDI operations defined. for example the write-return-xor
>> and so on.
>>
>> Also some private administrative CDBs of some vendor devices uses SCSI-BIDI.
>> So this patch just broke some drivers. (User-mode apps use bsg pass through)
>>
>> Also you might (try hard and) remove all usage of scsi-bidi as an initiator
>> from the Linux Kernel. But what about target mode. As a target we have 
>> supported
>> on the wire bidi protocols like write-return-xor and others for a long time.
>> Are you willing to silently break all these setups in the field on the next 
>> update?
>> Are you so sure these are never used?
>>
>> PLEASE, I beg of you guys. Do not remove SCSI-BIDI. It is a cry of 
>> generations.
>>
>> And I think by the rules of Linus, as far as target mode. You are not allowed
>> to break users in this way.
> 
> Hi,
> I'm in the process of rebuilding the sg driver with the following goals:
> 
>- undo 20 years of road wear, some of which is caused by literally
>  hundreds of "laparoscopic" patches that gradually ruin a driver,
>  at least from the maintainer's viewpoint. Comments that once made
>  sense become cryptic or just nonsense; naming schemes that
>  obliterate variable names to the extent that a random name
>  generator would be easier to follow; and just plain broken code.
>  For example, why sort out the existing locking at a particular
>  level when you can add a new lock in a completely non-orthogonal
>  way? [Yes, I looking at you, google.] Anyway, my first cut at this
>  is out there (on the linux-scsi list, see: "[PATCH v3 0/8] sg:
>  major cleanup, remove max_queue limit"). Not much new there,
>  unless you look very closely
> 
>- the next step is to add to the sg driver async SCSI command
>  capability based on the sg_io_v4 structure previously only used
>  by the bsg driver and now removed from bsg. The main advantage
>  of the sg_io_v4 structure over previous pass-through interface
>  attempts is the support of SCSI bidi commands
> 
>- as part of this effort introduce two new ioctls: SG_IOSUBMIT and
>  SG_IORECEIVE to replace the write()/read() technique currently
>  in use (since Linux 1.0 in 1992). The write()/read() technique
>  seems to be responsible for various security folks losing clumps
>  of their hair. One advantage of ioctls, as Alan Cox pointed out,
>  is the ability to write to and read back from the kernel in a way
>  that is naturally synchronized. [Actually, those security folks
>  shouldn't look too closely at sg_read() in that respect.]
> 
> In discussions with several folks who are on the T10 committee, I
> wondered why there was no READ GATHERED command (there has been a
> WRITE SCATTERED for 2 years). The answer was lack of interest ***,
> plus the difficultly of implementation. You see, READ GATHERED needs
> to send a scatter gather list to the device and get the corresponding
> data back (as a linear array). And that requires either:
>a) bidi commands (scatter gather list in the data-out, corresponding
>   "read" data in the data-in), or
>b) lng SCSI commands, up to around 256 bytes long in which the
>   sgat list is the latter part of that command
> 
> And the T10 folks say neither of those options is well supported or
> is expensive. 

It is supported in Linux scsi/osd driver is a proof of that. And expensive
it is not. I have demonstrated the ability to saturate a 10G link over
a raid of devices from a single writer. In OSD everything is bidi.

> I'm guessing they are referring to Linux and Windows.
> I haven't argued much beyond that point, but it looks like a bit of
> a chicken and egg situation.
> 
> 
> Don't know too much about the T10 OSD stuff. But I can see that it
> uses both long SCSI commands and a lot of bidi. IMO i

Re: remove exofs and the T10 OSD code V2

2018-10-31 Thread Boaz Harrosh
On 31/10/18 23:10, Douglas Gilbert wrote:
> On 2018-10-31 4:57 p.m., Boaz Harrosh wrote:
>> On 30/10/18 09:45, Christoph Hellwig wrote:
>>> On Mon, Oct 29, 2018 at 02:42:12PM -0600, Jens Axboe wrote:
>>>> LGTM, for both:
>>>
>>> I also have this one on top as requested by Martin.  The core block
>>> bidi support is unfortunately also used by bsg-lib, although it is
>>> not anywhere near as invasive.  But that is another argument for
>>> looking into moving bsg-lib away from using block queues..
>>>
>>
>> BUT this patch is very very wrong.
>>
>> Totally apart from T10-OSD and its use in the field. Support for scsi BIDI
>> commands is not exclusive to T10-OSD at all. Even the simple scsi-array
>> command-set has BIDI operations defined. for example the write-return-xor
>> and so on.
>>
>> Also some private administrative CDBs of some vendor devices uses SCSI-BIDI.
>> So this patch just broke some drivers. (User-mode apps use bsg pass through)
>>
>> Also you might (try hard and) remove all usage of scsi-bidi as an initiator
>> from the Linux Kernel. But what about target mode. As a target we have 
>> supported
>> on the wire bidi protocols like write-return-xor and others for a long time.
>> Are you willing to silently break all these setups in the field on the next 
>> update?
>> Are you so sure these are never used?
>>
>> PLEASE, I beg of you guys. Do not remove SCSI-BIDI. It is a cry of 
>> generations.
>>
>> And I think by the rules of Linus, as far as target mode. You are not allowed
>> to break users in this way.
> 
> Hi,
> I'm in the process of rebuilding the sg driver with the following goals:
> 
>- undo 20 years of road wear, some of which is caused by literally
>  hundreds of "laparoscopic" patches that gradually ruin a driver,
>  at least from the maintainer's viewpoint. Comments that once made
>  sense become cryptic or just nonsense; naming schemes that
>  obliterate variable names to the extent that a random name
>  generator would be easier to follow; and just plain broken code.
>  For example, why sort out the existing locking at a particular
>  level when you can add a new lock in a completely non-orthogonal
>  way? [Yes, I looking at you, google.] Anyway, my first cut at this
>  is out there (on the linux-scsi list, see: "[PATCH v3 0/8] sg:
>  major cleanup, remove max_queue limit"). Not much new there,
>  unless you look very closely
> 
>- the next step is to add to the sg driver async SCSI command
>  capability based on the sg_io_v4 structure previously only used
>  by the bsg driver and now removed from bsg. The main advantage
>  of the sg_io_v4 structure over previous pass-through interface
>  attempts is the support of SCSI bidi commands
> 
>- as part of this effort introduce two new ioctls: SG_IOSUBMIT and
>  SG_IORECEIVE to replace the write()/read() technique currently
>  in use (since Linux 1.0 in 1992). The write()/read() technique
>  seems to be responsible for various security folks losing clumps
>  of their hair. One advantage of ioctls, as Alan Cox pointed out,
>  is the ability to write to and read back from the kernel in a way
>  that is naturally synchronized. [Actually, those security folks
>  shouldn't look too closely at sg_read() in that respect.]
> 
> In discussions with several folks who are on the T10 committee, I
> wondered why there was no READ GATHERED command (there has been a
> WRITE SCATTERED for 2 years). The answer was lack of interest ***,
> plus the difficultly of implementation. You see, READ GATHERED needs
> to send a scatter gather list to the device and get the corresponding
> data back (as a linear array). And that requires either:
>a) bidi commands (scatter gather list in the data-out, corresponding
>   "read" data in the data-in), or
>b) lng SCSI commands, up to around 256 bytes long in which the
>   sgat list is the latter part of that command
> 
> And the T10 folks say neither of those options is well supported or
> is expensive. 

It is supported in Linux scsi/osd driver is a proof of that. And expensive
it is not. I have demonstrated the ability to saturate a 10G link over
a raid of devices from a single writer. In OSD everything is bidi.

> I'm guessing they are referring to Linux and Windows.
> I haven't argued much beyond that point, but it looks like a bit of
> a chicken and egg situation.
> 
> 
> Don't know too much about the T10 OSD stuff. But I can see that it
> uses both long SCSI commands and a lot of bidi. IMO i

Re: [RFC][PATCH v2 05/10] exofs: use common file type conversion

2018-10-24 Thread Boaz Harrosh
On 23/10/18 23:19, Phillip Potter wrote:
> Deduplicate the exofs file type conversion implementation.
> 
> Original patch by Amir Goldstein.
> 
> v2:
> - This version does not remove EXOFS_FT_x enum from fs/exofs/common.h,
>   as these values are now used in compile-time checks added by
>   Phillip Potter to make sure they remain the same as FT_x values
> 
> v1:
> - Initial implementation
> 
> Signed-off-by: Phillip Potter 

Yes thank you, totally

ACK-by: Boaz Harrosh 

> ---
>  fs/exofs/dir.c | 49 ++---
>  1 file changed, 18 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/exofs/dir.c b/fs/exofs/dir.c
> index f0138674c1ed..2e3161ca9014 100644
> --- a/fs/exofs/dir.c
> +++ b/fs/exofs/dir.c
> @@ -204,33 +204,10 @@ exofs_validate_entry(char *base, unsigned offset, 
> unsigned mask)
>   return (char *)p - base;
>  }
>  
> -static unsigned char exofs_filetype_table[EXOFS_FT_MAX] = {
> - [EXOFS_FT_UNKNOWN]  = DT_UNKNOWN,
> - [EXOFS_FT_REG_FILE] = DT_REG,
> - [EXOFS_FT_DIR]  = DT_DIR,
> - [EXOFS_FT_CHRDEV]   = DT_CHR,
> - [EXOFS_FT_BLKDEV]   = DT_BLK,
> - [EXOFS_FT_FIFO] = DT_FIFO,
> - [EXOFS_FT_SOCK] = DT_SOCK,
> - [EXOFS_FT_SYMLINK]  = DT_LNK,
> -};
> -
> -#define S_SHIFT 12
> -static unsigned char exofs_type_by_mode[S_IFMT >> S_SHIFT] = {
> - [S_IFREG >> S_SHIFT]= EXOFS_FT_REG_FILE,
> - [S_IFDIR >> S_SHIFT]= EXOFS_FT_DIR,
> - [S_IFCHR >> S_SHIFT]= EXOFS_FT_CHRDEV,
> - [S_IFBLK >> S_SHIFT]= EXOFS_FT_BLKDEV,
> - [S_IFIFO >> S_SHIFT]= EXOFS_FT_FIFO,
> - [S_IFSOCK >> S_SHIFT]   = EXOFS_FT_SOCK,
> - [S_IFLNK >> S_SHIFT]= EXOFS_FT_SYMLINK,
> -};
> -
>  static inline
>  void exofs_set_de_type(struct exofs_dir_entry *de, struct inode *inode)
>  {
> - umode_t mode = inode->i_mode;
> - de->file_type = exofs_type_by_mode[(mode & S_IFMT) >> S_SHIFT];
> + de->file_type = fs_umode_to_ftype(inode->i_mode);
>  }
>  
>  static int
> @@ -279,17 +256,27 @@ exofs_readdir(struct file *file, struct dir_context 
> *ctx)
>   exofs_put_page(page);
>   return -EIO;
>   }
> - if (de->inode_no) {
> - unsigned char t;
>  
> - if (de->file_type < EXOFS_FT_MAX)
> - t = exofs_filetype_table[de->file_type];
> - else
> - t = DT_UNKNOWN;
> + /*
> +  * compile-time asserts that generic FT_x types
> +  * still match EXOFS_FT_x types - no need to list
> +  * for other functions as well as build will
> +  * fail either way
> +  */
> + BUILD_BUG_ON(EXOFS_FT_UNKNOWN != FT_UNKNOWN);
> + BUILD_BUG_ON(EXOFS_FT_REG_FILE != FT_REG_FILE);
> + BUILD_BUG_ON(EXOFS_FT_DIR != FT_DIR);
> + BUILD_BUG_ON(EXOFS_FT_CHRDEV != FT_CHRDEV);
> + BUILD_BUG_ON(EXOFS_FT_BLKDEV != FT_BLKDEV);
> + BUILD_BUG_ON(EXOFS_FT_FIFO != FT_FIFO);
> + BUILD_BUG_ON(EXOFS_FT_SOCK != FT_SOCK);
> + BUILD_BUG_ON(EXOFS_FT_SYMLINK != FT_SYMLINK);
> + BUILD_BUG_ON(EXOFS_FT_MAX != FT_MAX);
>  
> + if (de->inode_no) {
>   if (!dir_emit(ctx, de->name, de->name_len,
>   le64_to_cpu(de->inode_no),
> - t)) {
> + fs_dtype(de->file_type))) {
>   exofs_put_page(page);
>   return 0;
>   }
> 



Re: [RFC][PATCH v2 05/10] exofs: use common file type conversion

2018-10-24 Thread Boaz Harrosh
On 23/10/18 23:19, Phillip Potter wrote:
> Deduplicate the exofs file type conversion implementation.
> 
> Original patch by Amir Goldstein.
> 
> v2:
> - This version does not remove EXOFS_FT_x enum from fs/exofs/common.h,
>   as these values are now used in compile-time checks added by
>   Phillip Potter to make sure they remain the same as FT_x values
> 
> v1:
> - Initial implementation
> 
> Signed-off-by: Phillip Potter 

Yes thank you, totally

ACK-by: Boaz Harrosh 

> ---
>  fs/exofs/dir.c | 49 ++---
>  1 file changed, 18 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/exofs/dir.c b/fs/exofs/dir.c
> index f0138674c1ed..2e3161ca9014 100644
> --- a/fs/exofs/dir.c
> +++ b/fs/exofs/dir.c
> @@ -204,33 +204,10 @@ exofs_validate_entry(char *base, unsigned offset, 
> unsigned mask)
>   return (char *)p - base;
>  }
>  
> -static unsigned char exofs_filetype_table[EXOFS_FT_MAX] = {
> - [EXOFS_FT_UNKNOWN]  = DT_UNKNOWN,
> - [EXOFS_FT_REG_FILE] = DT_REG,
> - [EXOFS_FT_DIR]  = DT_DIR,
> - [EXOFS_FT_CHRDEV]   = DT_CHR,
> - [EXOFS_FT_BLKDEV]   = DT_BLK,
> - [EXOFS_FT_FIFO] = DT_FIFO,
> - [EXOFS_FT_SOCK] = DT_SOCK,
> - [EXOFS_FT_SYMLINK]  = DT_LNK,
> -};
> -
> -#define S_SHIFT 12
> -static unsigned char exofs_type_by_mode[S_IFMT >> S_SHIFT] = {
> - [S_IFREG >> S_SHIFT]= EXOFS_FT_REG_FILE,
> - [S_IFDIR >> S_SHIFT]= EXOFS_FT_DIR,
> - [S_IFCHR >> S_SHIFT]= EXOFS_FT_CHRDEV,
> - [S_IFBLK >> S_SHIFT]= EXOFS_FT_BLKDEV,
> - [S_IFIFO >> S_SHIFT]= EXOFS_FT_FIFO,
> - [S_IFSOCK >> S_SHIFT]   = EXOFS_FT_SOCK,
> - [S_IFLNK >> S_SHIFT]= EXOFS_FT_SYMLINK,
> -};
> -
>  static inline
>  void exofs_set_de_type(struct exofs_dir_entry *de, struct inode *inode)
>  {
> - umode_t mode = inode->i_mode;
> - de->file_type = exofs_type_by_mode[(mode & S_IFMT) >> S_SHIFT];
> + de->file_type = fs_umode_to_ftype(inode->i_mode);
>  }
>  
>  static int
> @@ -279,17 +256,27 @@ exofs_readdir(struct file *file, struct dir_context 
> *ctx)
>   exofs_put_page(page);
>   return -EIO;
>   }
> - if (de->inode_no) {
> - unsigned char t;
>  
> - if (de->file_type < EXOFS_FT_MAX)
> - t = exofs_filetype_table[de->file_type];
> - else
> - t = DT_UNKNOWN;
> + /*
> +  * compile-time asserts that generic FT_x types
> +  * still match EXOFS_FT_x types - no need to list
> +  * for other functions as well as build will
> +  * fail either way
> +  */
> + BUILD_BUG_ON(EXOFS_FT_UNKNOWN != FT_UNKNOWN);
> + BUILD_BUG_ON(EXOFS_FT_REG_FILE != FT_REG_FILE);
> + BUILD_BUG_ON(EXOFS_FT_DIR != FT_DIR);
> + BUILD_BUG_ON(EXOFS_FT_CHRDEV != FT_CHRDEV);
> + BUILD_BUG_ON(EXOFS_FT_BLKDEV != FT_BLKDEV);
> + BUILD_BUG_ON(EXOFS_FT_FIFO != FT_FIFO);
> + BUILD_BUG_ON(EXOFS_FT_SOCK != FT_SOCK);
> + BUILD_BUG_ON(EXOFS_FT_SYMLINK != FT_SYMLINK);
> + BUILD_BUG_ON(EXOFS_FT_MAX != FT_MAX);
>  
> + if (de->inode_no) {
>   if (!dir_emit(ctx, de->name, de->name_len,
>   le64_to_cpu(de->inode_no),
> - t)) {
> + fs_dtype(de->file_type))) {
>   exofs_put_page(page);
>   return 0;
>   }
> 



Re: [PATCH] fs/exofs: Remove ignored __weak attribute

2018-10-02 Thread Boaz Harrosh
On 30/09/18 23:51, Nathan Chancellor wrote:
> Clang warns that the __weak attribute is going to be ignored on
> g_attr_inode_data because it's not in the correct location (needs to be
> after the type).
> 
> In file included from fs/exofs/dir.c:35:
> In file included from fs/exofs/exofs.h:41:
> fs/exofs/common.h:186:21: warning: 'weak' attribute only applies to
> variables, functions, and classes [-Wignored-attributes]
> static const struct __weak osd_attr g_attr_inode_data = ATTR_DEF(
> ^
> 
> Turns out that GCC ignores the attribute too albeit silently because
> moving the attribute after either osd_attr or g_attr_inode_data like
> all other uses of __weak on variables in the kernel causes a build
> error on both GCC and Clang because static variables cannot be weak
> since weak definitions rely on not having internal linkage:
> 
> In file included from fs/exofs/namei.c:34:
> In file included from fs/exofs/exofs.h:41:
> fs/exofs/common.h:186:30: error: weak declaration cannot have internal
> linkage
> static const struct osd_attr __weak g_attr_inode_data = ATTR_DEF(
>  ^
> 
> Just remove the attribute because it hasn't been correct since the
> initial addition of this file in commit b14f8ab28449 ("exofs: Kbuild,
> Headers and osd utils").
> 
> Reported-by: Nick Desaulniers 
> Reviewed-by: Nick Desaulniers 

ACK-by: Boaz Harrosh 

Yes! thanks
Boaz

> Signed-off-by: Nathan Chancellor 
> ---
>  fs/exofs/common.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/exofs/common.h b/fs/exofs/common.h
> index 7d88ef566213..45da96a1528d 100644
> --- a/fs/exofs/common.h
> +++ b/fs/exofs/common.h
> @@ -183,7 +183,7 @@ struct exofs_fcb {
>  #define EXOFS_INO_ATTR_SIZE  sizeof(struct exofs_fcb)
>  
>  /* This is the Attribute the fcb is stored in */
> -static const struct __weak osd_attr g_attr_inode_data = ATTR_DEF(
> +static const struct osd_attr g_attr_inode_data = ATTR_DEF(
>   EXOFS_APAGE_FS_DATA,
>   EXOFS_ATTR_INODE_DATA,
>   EXOFS_INO_ATTR_SIZE);
> 



Re: [PATCH] fs/exofs: Remove ignored __weak attribute

2018-10-02 Thread Boaz Harrosh
On 30/09/18 23:51, Nathan Chancellor wrote:
> Clang warns that the __weak attribute is going to be ignored on
> g_attr_inode_data because it's not in the correct location (needs to be
> after the type).
> 
> In file included from fs/exofs/dir.c:35:
> In file included from fs/exofs/exofs.h:41:
> fs/exofs/common.h:186:21: warning: 'weak' attribute only applies to
> variables, functions, and classes [-Wignored-attributes]
> static const struct __weak osd_attr g_attr_inode_data = ATTR_DEF(
> ^
> 
> Turns out that GCC ignores the attribute too albeit silently because
> moving the attribute after either osd_attr or g_attr_inode_data like
> all other uses of __weak on variables in the kernel causes a build
> error on both GCC and Clang because static variables cannot be weak
> since weak definitions rely on not having internal linkage:
> 
> In file included from fs/exofs/namei.c:34:
> In file included from fs/exofs/exofs.h:41:
> fs/exofs/common.h:186:30: error: weak declaration cannot have internal
> linkage
> static const struct osd_attr __weak g_attr_inode_data = ATTR_DEF(
>  ^
> 
> Just remove the attribute because it hasn't been correct since the
> initial addition of this file in commit b14f8ab28449 ("exofs: Kbuild,
> Headers and osd utils").
> 
> Reported-by: Nick Desaulniers 
> Reviewed-by: Nick Desaulniers 

ACK-by: Boaz Harrosh 

Yes! thanks
Boaz

> Signed-off-by: Nathan Chancellor 
> ---
>  fs/exofs/common.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/exofs/common.h b/fs/exofs/common.h
> index 7d88ef566213..45da96a1528d 100644
> --- a/fs/exofs/common.h
> +++ b/fs/exofs/common.h
> @@ -183,7 +183,7 @@ struct exofs_fcb {
>  #define EXOFS_INO_ATTR_SIZE  sizeof(struct exofs_fcb)
>  
>  /* This is the Attribute the fcb is stored in */
> -static const struct __weak osd_attr g_attr_inode_data = ATTR_DEF(
> +static const struct osd_attr g_attr_inode_data = ATTR_DEF(
>   EXOFS_APAGE_FS_DATA,
>   EXOFS_ATTR_INODE_DATA,
>   EXOFS_INO_ATTR_SIZE);
> 



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-22 Thread Boaz Harrosh
On 18/05/18 17:14, Christopher Lameter wrote:
> On Tue, 15 May 2018, Boaz Harrosh wrote:
> 
>>> I don't think page tables work the way you think they work.
>>>
>>> +   err = vm_insert_pfn_prot(zt->vma, zt_addr, pfn, prot);
>>>
>>> That doesn't just insert it into the local CPU's page table.  Any CPU
>>> which directly accesses or even prefetches that address will also get
>>> the translation into its cache.
>>>
>>
>> Yes I know, but that is exactly the point of this flag. I know that this
>> address is only ever accessed from a single core. Because it is an mmap (vma)
>> of an O_TMPFILE-exclusive file created in a core-pinned thread and I allow
>> only that thread any kind of access to this vma. Both the filehandle and the
>> mmaped pointer are kept on the thread stack and have no access from outside.
>>
>> So the all point of this flag is the kernel driver telling mm that this
>> address is enforced to only be accessed from one core-pinned thread.
> 
> But there are no provisions for probhiting accesses from other cores?
> 
> This means that a casual accidental write from a thread executing on
> another core can lead to arbitrary memory corruption because the cache
> flushing has been bypassed.
> 

No this is not accurate. A "casual accidental write" will not do any harm.
Only a well concerted malicious server can exploit this. A different thread
on a different core will need to hit the exact time to read from the exact
pointer at the narrow window while the IO is going on. fault-in a TLB at the
time of the valid mapping. Then later after the IO has ended and before any
of the threads where scheduled out, maliciously write. All the while the App
has freed its buffers and the buffer was used for something else.
Please bear in mind that this is only As root, in an /sbin/ executable signed
by the Kernel's key. I think that anyone who as gained such an access to the
system (i.e compiled and installed an /sbin server), Can just walk the front 
door.
He does not need to exploit this narrow random hole. Hell he can easily just
modprob a Kernel module.

And I do not understand. Every one is motivated in saying "no cannot be solved"
So lets start from the Beginning.

How can we implement "Private memory"?

You know how in the fork days. We have APIs for "shared memory".

I.E: All read/write memory defaults to private except special setup
 "shared memory"
This is vs Threads where all memory regions are shared.

[Q] How can we implement a "private memory" region.
.I.E All read/write memory defaults to shared except special setup
 "private memory"

Can this be done? How, please advise?

Thanks
Boaz



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-22 Thread Boaz Harrosh
On 18/05/18 17:14, Christopher Lameter wrote:
> On Tue, 15 May 2018, Boaz Harrosh wrote:
> 
>>> I don't think page tables work the way you think they work.
>>>
>>> +   err = vm_insert_pfn_prot(zt->vma, zt_addr, pfn, prot);
>>>
>>> That doesn't just insert it into the local CPU's page table.  Any CPU
>>> which directly accesses or even prefetches that address will also get
>>> the translation into its cache.
>>>
>>
>> Yes I know, but that is exactly the point of this flag. I know that this
>> address is only ever accessed from a single core. Because it is an mmap (vma)
>> of an O_TMPFILE-exclusive file created in a core-pinned thread and I allow
>> only that thread any kind of access to this vma. Both the filehandle and the
>> mmaped pointer are kept on the thread stack and have no access from outside.
>>
>> So the all point of this flag is the kernel driver telling mm that this
>> address is enforced to only be accessed from one core-pinned thread.
> 
> But there are no provisions for probhiting accesses from other cores?
> 
> This means that a casual accidental write from a thread executing on
> another core can lead to arbitrary memory corruption because the cache
> flushing has been bypassed.
> 

No this is not accurate. A "casual accidental write" will not do any harm.
Only a well concerted malicious server can exploit this. A different thread
on a different core will need to hit the exact time to read from the exact
pointer at the narrow window while the IO is going on. fault-in a TLB at the
time of the valid mapping. Then later after the IO has ended and before any
of the threads where scheduled out, maliciously write. All the while the App
has freed its buffers and the buffer was used for something else.
Please bear in mind that this is only As root, in an /sbin/ executable signed
by the Kernel's key. I think that anyone who as gained such an access to the
system (i.e compiled and installed an /sbin server), Can just walk the front 
door.
He does not need to exploit this narrow random hole. Hell he can easily just
modprob a Kernel module.

And I do not understand. Every one is motivated in saying "no cannot be solved"
So lets start from the Beginning.

How can we implement "Private memory"?

You know how in the fork days. We have APIs for "shared memory".

I.E: All read/write memory defaults to private except special setup
 "shared memory"
This is vs Threads where all memory regions are shared.

[Q] How can we implement a "private memory" region.
.I.E All read/write memory defaults to shared except special setup
 "private memory"

Can this be done? How, please advise?

Thanks
Boaz



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-15 Thread Boaz Harrosh
On 15/05/18 17:17, Peter Zijlstra wrote:
<>
>>
>> So I would love some mm guy to explain where are those bits collected?
> 
> Depends on the architecture, some architectures only ever set bits,
> some, like x86, clear bits again. You want to look at switch_mm().
> 
> Basically x86 clears the bit again when we switch away from the mm and
> have/will invalidate TLBs for it in doing so.
> 

Ha, OK I am starting to get a picture.

>> Which brings me to another question. How can I find from
>> within a thread Say at the file_operations->mmap() call that the thread
>> is indeed core-pinned. What mm_cpumask should I inspect?
> 
> is_percpu_thread().

Right thank you a lot Peter. this helps.

Boaz
> .
> 



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-15 Thread Boaz Harrosh
On 15/05/18 17:17, Peter Zijlstra wrote:
<>
>>
>> So I would love some mm guy to explain where are those bits collected?
> 
> Depends on the architecture, some architectures only ever set bits,
> some, like x86, clear bits again. You want to look at switch_mm().
> 
> Basically x86 clears the bit again when we switch away from the mm and
> have/will invalidate TLBs for it in doing so.
> 

Ha, OK I am starting to get a picture.

>> Which brings me to another question. How can I find from
>> within a thread Say at the file_operations->mmap() call that the thread
>> is indeed core-pinned. What mm_cpumask should I inspect?
> 
> is_percpu_thread().

Right thank you a lot Peter. this helps.

Boaz
> .
> 



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-15 Thread Boaz Harrosh
On 15/05/18 17:18, Matthew Wilcox wrote:
> On Tue, May 15, 2018 at 05:10:57PM +0300, Boaz Harrosh wrote:
>> I'm not a lawyer either but I think I'm doing OK. Because I am doing exactly
>> like FUSE is doing. Only some 15 years later, with modern CPUs in mind. I do 
>> not
>> think I am doing anything new here, am I?
> 
> You should talk to a lawyer.  I'm not giving you legal advice.
> I'm telling you that I think what you're doing is unethical.
> .
> 

Not more unethical than what is already there. And I do not see how
this is unethical at all? I trust your opinion and would really want
to understand.

For example your not-in-c zero-copy Server. How is it unethical?
I have the same problem actually some important parts are not in C.

How is it unethical to want to make this run fast?

Thanks
Boaz



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-15 Thread Boaz Harrosh
On 15/05/18 17:18, Matthew Wilcox wrote:
> On Tue, May 15, 2018 at 05:10:57PM +0300, Boaz Harrosh wrote:
>> I'm not a lawyer either but I think I'm doing OK. Because I am doing exactly
>> like FUSE is doing. Only some 15 years later, with modern CPUs in mind. I do 
>> not
>> think I am doing anything new here, am I?
> 
> You should talk to a lawyer.  I'm not giving you legal advice.
> I'm telling you that I think what you're doing is unethical.
> .
> 

Not more unethical than what is already there. And I do not see how
this is unethical at all? I trust your opinion and would really want
to understand.

For example your not-in-c zero-copy Server. How is it unethical?
I have the same problem actually some important parts are not in C.

How is it unethical to want to make this run fast?

Thanks
Boaz



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-15 Thread Boaz Harrosh
On 15/05/18 16:50, Matthew Wilcox wrote:
> On Tue, May 15, 2018 at 04:29:22PM +0300, Boaz Harrosh wrote:
>> On 15/05/18 15:03, Matthew Wilcox wrote:
>>> You're getting dangerously close to admitting that the entire point
>>> of this exercise is so that you can link non-GPL NetApp code into the
>>> kernel in clear violation of the GPL.
>>
>> It is not that at all. What I'm trying to do is enable a zero-copy,
>> synchronous, low latency, low overhead. highly parallel - a new modern
>> interface with application servers.
> 
> ... and fully buzzword compliant.
> 
>> You yourself had such a project that could easily be served out-of-the-box
>> with zufs, of a device that wanted to sit in user-mode.
> 
> For a very different reason.  I think the source code to that project
> is publically available; the problem is that it's not written in C.
> 

Exactly the point, sir. Many reasons to sit in user-land for example
for me it is libraries that can not be loaded into Kernel.

>> Sometimes it is very convenient and needed for Servers to sit in
>> user-mode. And this interface allows that. And it is not always
>> a licensing thing. Though yes licensing is also an issue sometimes.
>> It is the reality we are living in.
>>
>> But please indulge me I am curious how the point of signing /sbin/
>> servers, made you think about GPL licensing issues?
>>
>> That said, is your point that as long as user-mode servers are sl
>> they are OK to be supported but if they are as fast as the kernel,
>> (as demonstrated a zufs based FS was faster then xfs-dax on same pmem)
>> Then it is a GPL violation?
> 
> No.  Read what Linus wrote:
> 
>NOTE! This copyright does *not* cover user programs that use kernel
>  services by normal system calls - this is merely considered normal use
>  of the kernel, and does *not* fall under the heading of "derived work".
> 
> What you're doing is far beyond that exception.  You're developing in
> concert a userspace and kernel component, and claiming that the GPL does
> not apply to the userspace component.  I'm not a lawyer, but you're on
> very thin ice.
> 

But I am not the first one here am I? Fuse and other interfaces already do
exactly this long before I did. Actually any Kernel Interface has some user-mode
component, specifically written for it. And again I am only legally doing 
exactly as
FUSE is doing only much faster, and more importantly for me highly parallel on 
all
cores. Because from my testing the biggest problem of FUSE for me is that it 
does not
scale

I'm not a lawyer either but I think I'm doing OK. Because I am doing exactly
like FUSE is doing. Only some 15 years later, with modern CPUs in mind. I do not
think I am doing anything new here, am I?

Thanks
Boaz



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-15 Thread Boaz Harrosh
On 15/05/18 16:50, Matthew Wilcox wrote:
> On Tue, May 15, 2018 at 04:29:22PM +0300, Boaz Harrosh wrote:
>> On 15/05/18 15:03, Matthew Wilcox wrote:
>>> You're getting dangerously close to admitting that the entire point
>>> of this exercise is so that you can link non-GPL NetApp code into the
>>> kernel in clear violation of the GPL.
>>
>> It is not that at all. What I'm trying to do is enable a zero-copy,
>> synchronous, low latency, low overhead. highly parallel - a new modern
>> interface with application servers.
> 
> ... and fully buzzword compliant.
> 
>> You yourself had such a project that could easily be served out-of-the-box
>> with zufs, of a device that wanted to sit in user-mode.
> 
> For a very different reason.  I think the source code to that project
> is publically available; the problem is that it's not written in C.
> 

Exactly the point, sir. Many reasons to sit in user-land for example
for me it is libraries that can not be loaded into Kernel.

>> Sometimes it is very convenient and needed for Servers to sit in
>> user-mode. And this interface allows that. And it is not always
>> a licensing thing. Though yes licensing is also an issue sometimes.
>> It is the reality we are living in.
>>
>> But please indulge me I am curious how the point of signing /sbin/
>> servers, made you think about GPL licensing issues?
>>
>> That said, is your point that as long as user-mode servers are sl
>> they are OK to be supported but if they are as fast as the kernel,
>> (as demonstrated a zufs based FS was faster then xfs-dax on same pmem)
>> Then it is a GPL violation?
> 
> No.  Read what Linus wrote:
> 
>NOTE! This copyright does *not* cover user programs that use kernel
>  services by normal system calls - this is merely considered normal use
>  of the kernel, and does *not* fall under the heading of "derived work".
> 
> What you're doing is far beyond that exception.  You're developing in
> concert a userspace and kernel component, and claiming that the GPL does
> not apply to the userspace component.  I'm not a lawyer, but you're on
> very thin ice.
> 

But I am not the first one here am I? Fuse and other interfaces already do
exactly this long before I did. Actually any Kernel Interface has some user-mode
component, specifically written for it. And again I am only legally doing 
exactly as
FUSE is doing only much faster, and more importantly for me highly parallel on 
all
cores. Because from my testing the biggest problem of FUSE for me is that it 
does not
scale

I'm not a lawyer either but I think I'm doing OK. Because I am doing exactly
like FUSE is doing. Only some 15 years later, with modern CPUs in mind. I do not
think I am doing anything new here, am I?

Thanks
Boaz



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-15 Thread Boaz Harrosh
On 15/05/18 15:03, Matthew Wilcox wrote:
> On Tue, May 15, 2018 at 02:41:41PM +0300, Boaz Harrosh wrote:
>> That would be very hard. Because that program would:
>> - need to be root
>> - need to start and pretend it is zus Server with the all mount
>>   thread thing, register new filesystem, grab some pmem devices.
>> - Mount the said filesystem on said pmem. Create core-pinned ZT threads
>>   for all CPUs, start accepting IO.
>> - And only then it can start leaking the pointer and do bad things.
> 
> All of these things you've done for me by writing zus Server.  All I
> have to do now is compromise zus Server.
> 
>>   The bad things it can do to the application, not to the Kernel.
>>   And as a full filesystem it can do those bad things to the application
>>   through the front door directly not needing the mismatch tlb at all.
> 
> That's not true.  When I have a TLB entry that points to a page of kernel
> ram, I can do almost anything, depending on what the kernel decides to
> do with that ram next.  Maybe it's page cache again, in which case I can
> affect whatever application happens to get it allocated.  Maybe it's a
> kmalloc page next, in which case I can affect any part of the kernel.
> Maybe it's a page table, then I can affect any process.
> 
>> That said. It brings up a very important point that I wanted to talk about.
>> In this design the zuf(Kernel) and the zus(um Server) are part of the 
>> distribution.
>> I would like to have the zus module be signed by the distro's Kernel's key 
>> and
>> checked on loadtime. I know there is an effort by Redhat guys to try and 
>> sign all
>> /sbin/* servers and have Kernel check these. So this is not the first time 
>> people
>> have thought about that.
> 
> You're getting dangerously close to admitting that the entire point
> of this exercise is so that you can link non-GPL NetApp code into the
> kernel in clear violation of the GPL.
> 

It is not that at all. What I'm trying to do is enable a zero-copy,
synchronous, low latency, low overhead. highly parallel - a new modern
interface with application servers.

You yourself had such a project that could easily be served out-of-the-box
with zufs, of a device that wanted to sit in user-mode.

Sometimes it is very convenient and needed for Servers to sit in
user-mode. And this interface allows that. And it is not always
a licensing thing. Though yes licensing is also an issue sometimes.
It is the reality we are living in.

But please indulge me I am curious how the point of signing /sbin/
servers, made you think about GPL licensing issues?

That said, is your point that as long as user-mode servers are sl
they are OK to be supported but if they are as fast as the kernel,
(as demonstrated a zufs based FS was faster then xfs-dax on same pmem)
Then it is a GPL violation?

Thanks
Boaz



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-15 Thread Boaz Harrosh
On 15/05/18 15:03, Matthew Wilcox wrote:
> On Tue, May 15, 2018 at 02:41:41PM +0300, Boaz Harrosh wrote:
>> That would be very hard. Because that program would:
>> - need to be root
>> - need to start and pretend it is zus Server with the all mount
>>   thread thing, register new filesystem, grab some pmem devices.
>> - Mount the said filesystem on said pmem. Create core-pinned ZT threads
>>   for all CPUs, start accepting IO.
>> - And only then it can start leaking the pointer and do bad things.
> 
> All of these things you've done for me by writing zus Server.  All I
> have to do now is compromise zus Server.
> 
>>   The bad things it can do to the application, not to the Kernel.
>>   And as a full filesystem it can do those bad things to the application
>>   through the front door directly not needing the mismatch tlb at all.
> 
> That's not true.  When I have a TLB entry that points to a page of kernel
> ram, I can do almost anything, depending on what the kernel decides to
> do with that ram next.  Maybe it's page cache again, in which case I can
> affect whatever application happens to get it allocated.  Maybe it's a
> kmalloc page next, in which case I can affect any part of the kernel.
> Maybe it's a page table, then I can affect any process.
> 
>> That said. It brings up a very important point that I wanted to talk about.
>> In this design the zuf(Kernel) and the zus(um Server) are part of the 
>> distribution.
>> I would like to have the zus module be signed by the distro's Kernel's key 
>> and
>> checked on loadtime. I know there is an effort by Redhat guys to try and 
>> sign all
>> /sbin/* servers and have Kernel check these. So this is not the first time 
>> people
>> have thought about that.
> 
> You're getting dangerously close to admitting that the entire point
> of this exercise is so that you can link non-GPL NetApp code into the
> kernel in clear violation of the GPL.
> 

It is not that at all. What I'm trying to do is enable a zero-copy,
synchronous, low latency, low overhead. highly parallel - a new modern
interface with application servers.

You yourself had such a project that could easily be served out-of-the-box
with zufs, of a device that wanted to sit in user-mode.

Sometimes it is very convenient and needed for Servers to sit in
user-mode. And this interface allows that. And it is not always
a licensing thing. Though yes licensing is also an issue sometimes.
It is the reality we are living in.

But please indulge me I am curious how the point of signing /sbin/
servers, made you think about GPL licensing issues?

That said, is your point that as long as user-mode servers are sl
they are OK to be supported but if they are as fast as the kernel,
(as demonstrated a zufs based FS was faster then xfs-dax on same pmem)
Then it is a GPL violation?

Thanks
Boaz



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-15 Thread Boaz Harrosh
On 15/05/18 14:54, Boaz Harrosh wrote:
> On 15/05/18 03:44, Matthew Wilcox wrote:
>> On Mon, May 14, 2018 at 02:49:01PM -0700, Andrew Morton wrote:
>>> On Mon, 14 May 2018 20:28:01 +0300 Boaz Harrosh <bo...@netapp.com> wrote:
>>>> In this project we utilize a per-core server thread so everything
>>>> is kept local. If we use the regular zap_ptes() API All CPU's
>>>> are scheduled for the unmap, though in our case we know that we
>>>> have only used a single core. The regular zap_ptes adds a very big
>>>> latency on every operation and mostly kills the concurrency of the
>>>> over all system. Because it imposes a serialization between all cores
>>>
>>> I'd have thought that in this situation, only the local CPU's bit is
>>> set in the vma's mm_cpumask() and the remote invalidations are not
>>> performed.  Is that a misunderstanding, or is all that stuff not working
>>> correctly?
>>
>> I think you misunderstand Boaz's architecture.  He has one thread per CPU,
>> so every bit will be set in the mm's (not vma's) mm_cpumask.
>>
> 
> Hi Andrew, Matthew
> 
> Yes I have been trying to investigate and trace this for days.
> Please see the code below:
> 
>> #define flush_tlb_range(vma, start, end) \
>>  flush_tlb_mm_range(vma->vm_mm, start, end, vma->vm_flags)
> 
> The mm_struct @mm below comes from here vma->vm_mm
> 
>> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
>> index e055d1a..1d398a0 100644
>> --- a/arch/x86/mm/tlb.c
>> +++ b/arch/x86/mm/tlb.c
>> @@ -611,39 +611,40 @@ static unsigned long tlb_single_page_flush_ceiling 
>> __read_mostly = 33;
>>  void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>>  unsigned long end, unsigned long vmflag)
>>  {
>>  int cpu;
>>  
>>  struct flush_tlb_info info __aligned(SMP_CACHE_BYTES) = {
>>  .mm = mm,
>>  };
>>  
>>  cpu = get_cpu();
>>  
>>  /* This is also a barrier that synchronizes with switch_mm(). */
>>  info.new_tlb_gen = inc_mm_tlb_gen(mm);
>>  
>>  /* Should we flush just the requested range? */
>>  if ((end != TLB_FLUSH_ALL) &&
>>  !(vmflag & VM_HUGETLB) &&
>>  ((end - start) >> PAGE_SHIFT) <= tlb_single_page_flush_ceiling) {
>>  info.start = start;
>>  info.end = end;
>>  } else {
>>  info.start = 0UL;
>>  info.end = TLB_FLUSH_ALL;
>>  }
>>  
>>  if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
>>  VM_WARN_ON(irqs_disabled());
>>  local_irq_disable();
>>  flush_tlb_func_local(, TLB_LOCAL_MM_SHOOTDOWN);
>>  local_irq_enable();
>>  }
>>  
>> -if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
>> +if (!(vmflag & VM_LOCAL_CPU) &&
>> +cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
>>  flush_tlb_others(mm_cpumask(mm), );
>>  
> 
> I have been tracing the mm_cpumask(vma->vm_mm) at my driver at
> different points. At vma creation (file_operations->mmap()), 
> and before the call to insert_pfn (which calls here).
> 
> At the beginning I was wishful thinking that the mm_cpumask(vma->vm_mm)
> should have a single bit set just as the affinity of the thread on
> creation of that thread. But then I saw that at %80 of the times some
> other random bits are also set.
> 
> Yes Random. Always the thread affinity (single) bit was set but
> then zero one or two more bits were set as well. Never seen more then
> two though, which baffles me a lot.
> 
> If it was like Matthew said .i.e the cpumask of the all process
> then I would expect all the bits to be set. Because I have a thread
> on each core. And also I would even expect that all vma->vm_mm
> or maybe mm_cpumask(vma->vm_mm) to point to the same global object.
> But it was not so. it was pointing to some thread unique object but
> still those phantom bits were set all over. (And I am almost sure
> same vma had those bits change over time)
> 
> So I would love some mm guy to explain where are those bits collected?
> But I do not think this is a Kernel bug because as Matthew showed.
> that vma above can and is allowed to leak memory addresses to other
> threads / cores in the same process. So it appears that the Kernel
> has some correct logic behind its madness.
> 
Hi Mark

So you see %20 of the times the mm_cpumask(vma->vm_mm) is a single
bit set. And a natural call to flush_tlb_range will only invalidate
the local cpu. Are you familiar with this logic?

> Which brings me to another question. How can I find from
> within a thread Say at the file_operations->mmap() call that the thread
> is indeed core-pinned. What mm_cpumask should I inspect?
> 

Mark, Peter do you know how I can check the above?

Thanks
Boaz

>>  put_cpu();
>>  }
> 
> Thanks
> Boaz
> 



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-15 Thread Boaz Harrosh
On 15/05/18 14:54, Boaz Harrosh wrote:
> On 15/05/18 03:44, Matthew Wilcox wrote:
>> On Mon, May 14, 2018 at 02:49:01PM -0700, Andrew Morton wrote:
>>> On Mon, 14 May 2018 20:28:01 +0300 Boaz Harrosh  wrote:
>>>> In this project we utilize a per-core server thread so everything
>>>> is kept local. If we use the regular zap_ptes() API All CPU's
>>>> are scheduled for the unmap, though in our case we know that we
>>>> have only used a single core. The regular zap_ptes adds a very big
>>>> latency on every operation and mostly kills the concurrency of the
>>>> over all system. Because it imposes a serialization between all cores
>>>
>>> I'd have thought that in this situation, only the local CPU's bit is
>>> set in the vma's mm_cpumask() and the remote invalidations are not
>>> performed.  Is that a misunderstanding, or is all that stuff not working
>>> correctly?
>>
>> I think you misunderstand Boaz's architecture.  He has one thread per CPU,
>> so every bit will be set in the mm's (not vma's) mm_cpumask.
>>
> 
> Hi Andrew, Matthew
> 
> Yes I have been trying to investigate and trace this for days.
> Please see the code below:
> 
>> #define flush_tlb_range(vma, start, end) \
>>  flush_tlb_mm_range(vma->vm_mm, start, end, vma->vm_flags)
> 
> The mm_struct @mm below comes from here vma->vm_mm
> 
>> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
>> index e055d1a..1d398a0 100644
>> --- a/arch/x86/mm/tlb.c
>> +++ b/arch/x86/mm/tlb.c
>> @@ -611,39 +611,40 @@ static unsigned long tlb_single_page_flush_ceiling 
>> __read_mostly = 33;
>>  void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>>  unsigned long end, unsigned long vmflag)
>>  {
>>  int cpu;
>>  
>>  struct flush_tlb_info info __aligned(SMP_CACHE_BYTES) = {
>>  .mm = mm,
>>  };
>>  
>>  cpu = get_cpu();
>>  
>>  /* This is also a barrier that synchronizes with switch_mm(). */
>>  info.new_tlb_gen = inc_mm_tlb_gen(mm);
>>  
>>  /* Should we flush just the requested range? */
>>  if ((end != TLB_FLUSH_ALL) &&
>>  !(vmflag & VM_HUGETLB) &&
>>  ((end - start) >> PAGE_SHIFT) <= tlb_single_page_flush_ceiling) {
>>  info.start = start;
>>  info.end = end;
>>  } else {
>>  info.start = 0UL;
>>  info.end = TLB_FLUSH_ALL;
>>  }
>>  
>>  if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
>>  VM_WARN_ON(irqs_disabled());
>>  local_irq_disable();
>>  flush_tlb_func_local(, TLB_LOCAL_MM_SHOOTDOWN);
>>  local_irq_enable();
>>  }
>>  
>> -if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
>> +if (!(vmflag & VM_LOCAL_CPU) &&
>> +cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
>>  flush_tlb_others(mm_cpumask(mm), );
>>  
> 
> I have been tracing the mm_cpumask(vma->vm_mm) at my driver at
> different points. At vma creation (file_operations->mmap()), 
> and before the call to insert_pfn (which calls here).
> 
> At the beginning I was wishful thinking that the mm_cpumask(vma->vm_mm)
> should have a single bit set just as the affinity of the thread on
> creation of that thread. But then I saw that at %80 of the times some
> other random bits are also set.
> 
> Yes Random. Always the thread affinity (single) bit was set but
> then zero one or two more bits were set as well. Never seen more then
> two though, which baffles me a lot.
> 
> If it was like Matthew said .i.e the cpumask of the all process
> then I would expect all the bits to be set. Because I have a thread
> on each core. And also I would even expect that all vma->vm_mm
> or maybe mm_cpumask(vma->vm_mm) to point to the same global object.
> But it was not so. it was pointing to some thread unique object but
> still those phantom bits were set all over. (And I am almost sure
> same vma had those bits change over time)
> 
> So I would love some mm guy to explain where are those bits collected?
> But I do not think this is a Kernel bug because as Matthew showed.
> that vma above can and is allowed to leak memory addresses to other
> threads / cores in the same process. So it appears that the Kernel
> has some correct logic behind its madness.
> 
Hi Mark

So you see %20 of the times the mm_cpumask(vma->vm_mm) is a single
bit set. And a natural call to flush_tlb_range will only invalidate
the local cpu. Are you familiar with this logic?

> Which brings me to another question. How can I find from
> within a thread Say at the file_operations->mmap() call that the thread
> is indeed core-pinned. What mm_cpumask should I inspect?
> 

Mark, Peter do you know how I can check the above?

Thanks
Boaz

>>  put_cpu();
>>  }
> 
> Thanks
> Boaz
> 



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-15 Thread Boaz Harrosh
On 15/05/18 15:07, Mark Rutland wrote:
> On Tue, May 15, 2018 at 01:43:23PM +0300, Boaz Harrosh wrote:
>> On 15/05/18 03:41, Matthew Wilcox wrote:
>>> On Mon, May 14, 2018 at 10:37:38PM +0300, Boaz Harrosh wrote:
>>>> On 14/05/18 22:15, Matthew Wilcox wrote:
>>>>> On Mon, May 14, 2018 at 08:28:01PM +0300, Boaz Harrosh wrote:
>>>>>> On a call to mmap an mmap provider (like an FS) can put
>>>>>> this flag on vma->vm_flags.
>>>>>>
>>>>>> The VM_LOCAL_CPU flag tells the Kernel that the vma will be used
>>>>>> from a single-core only, and therefore invalidation (flush_tlb) of
>>>>>> PTE(s) need not be a wide CPU scheduling.
>>>>>
>>>>> I still don't get this.  You're opening the kernel up to being exploited
>>>>> by any application which can persuade it to set this flag on a VMA.
>>>>>
>>>>
>>>> No No this is not an application accessible flag this can only be set
>>>> by the mmap implementor at ->mmap() time (Say same as VM_VM_MIXEDMAP).
>>>>
>>>> Please see the zuf patches for usage (Again apologise for pushing before
>>>> a user)
>>>>
>>>> The mmap provider has all the facilities to know that this can not be
>>>> abused, not even by a trusted Server.
>>>
>>> I don't think page tables work the way you think they work.
>>>
>>> +   err = vm_insert_pfn_prot(zt->vma, zt_addr, pfn, prot);
>>>
>>> That doesn't just insert it into the local CPU's page table.  Any CPU
>>> which directly accesses or even prefetches that address will also get
>>> the translation into its cache.
>>>
>>
>> Yes I know, but that is exactly the point of this flag. I know that this
>> address is only ever accessed from a single core. Because it is an mmap (vma)
>> of an O_TMPFILE-exclusive file created in a core-pinned thread and I allow
>> only that thread any kind of access to this vma. Both the filehandle and the
>> mmaped pointer are kept on the thread stack and have no access from outside.
> 
> Even if (in the specific context of your application) software on other
> cores might not explicitly access this area, that does not prevent
> allocations into TLBs, and TLB maintenance *cannot* be elided.
> 
> Even assuming that software *never* explicitly accesses an address which
> it has not mapped is insufficient.
> 
> For example, imagine you have two threads, each pinned to a CPU, and
> some local_cpu_{mmap,munmap} which uses your new flag:
> 
>   CPU0CPU1
>   x = local_cpu_mmap(...);
>   do_things_with(x);
>   // speculatively allocates TLB
>   // entries for X.
> 
>   // only invalidates local TLBs
>   local_cpu_munmap(x);
> 
>   // TLB entries for X still live
>   
>   y = local_cpu_mmap(...);
> 
>   // if y == x, we can hit the

But this y == x is not possible. The x here is held throughout the
lifetime  of CPU0-pinned thread. And cannot be allocated later to
another thread.

In fact if that file holding the VMA closes we know the server
crashed and we cleanly close everything.
(Including properly zapping all maps)

>   // stale TLB entry, and access
>   // the wrong page
>   do_things_with(y);
> 

So even if the CPU pre fetched that TLB no one in the system will use
this address until proper close. Where everything is properly flushed.

> Consider that after we free x, the kernel could reuse the page for any
> purpose (e.g. kernel page tables), so this is a major risk.
> 

So yes. We never free x. We hold it for the entire duration of the ZT-thread
(ZThread is that core-pinned thread per-core we are using)
And each time we map some application buffers into that vma and local_tlb
invalidate when done.

When x is de-allocated, do to a close or a crash, it is all properly zapped.

> This flag simply is not safe, unless the *entire* mm is only ever
> accessed from a single CPU. In that case, we don't need the flag anyway,
> as the mm already has a cpumask.
> 

Did you please see that other part of the thread, and my answer to Andrew?
why is the vma->vm_mm cpumask so weird. It is neither all bits set nor
a single bit set. It is very common (20% of the time) for mm_cpumask(vma->vm_mm)
to be a single bit set. Exactly in an application where I have a thread-per-core
Please look at that? (I'll ping you from that email)

> Thanks,
> Mark.
> 

Thanks
Boaz



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-15 Thread Boaz Harrosh
On 15/05/18 15:07, Mark Rutland wrote:
> On Tue, May 15, 2018 at 01:43:23PM +0300, Boaz Harrosh wrote:
>> On 15/05/18 03:41, Matthew Wilcox wrote:
>>> On Mon, May 14, 2018 at 10:37:38PM +0300, Boaz Harrosh wrote:
>>>> On 14/05/18 22:15, Matthew Wilcox wrote:
>>>>> On Mon, May 14, 2018 at 08:28:01PM +0300, Boaz Harrosh wrote:
>>>>>> On a call to mmap an mmap provider (like an FS) can put
>>>>>> this flag on vma->vm_flags.
>>>>>>
>>>>>> The VM_LOCAL_CPU flag tells the Kernel that the vma will be used
>>>>>> from a single-core only, and therefore invalidation (flush_tlb) of
>>>>>> PTE(s) need not be a wide CPU scheduling.
>>>>>
>>>>> I still don't get this.  You're opening the kernel up to being exploited
>>>>> by any application which can persuade it to set this flag on a VMA.
>>>>>
>>>>
>>>> No No this is not an application accessible flag this can only be set
>>>> by the mmap implementor at ->mmap() time (Say same as VM_VM_MIXEDMAP).
>>>>
>>>> Please see the zuf patches for usage (Again apologise for pushing before
>>>> a user)
>>>>
>>>> The mmap provider has all the facilities to know that this can not be
>>>> abused, not even by a trusted Server.
>>>
>>> I don't think page tables work the way you think they work.
>>>
>>> +   err = vm_insert_pfn_prot(zt->vma, zt_addr, pfn, prot);
>>>
>>> That doesn't just insert it into the local CPU's page table.  Any CPU
>>> which directly accesses or even prefetches that address will also get
>>> the translation into its cache.
>>>
>>
>> Yes I know, but that is exactly the point of this flag. I know that this
>> address is only ever accessed from a single core. Because it is an mmap (vma)
>> of an O_TMPFILE-exclusive file created in a core-pinned thread and I allow
>> only that thread any kind of access to this vma. Both the filehandle and the
>> mmaped pointer are kept on the thread stack and have no access from outside.
> 
> Even if (in the specific context of your application) software on other
> cores might not explicitly access this area, that does not prevent
> allocations into TLBs, and TLB maintenance *cannot* be elided.
> 
> Even assuming that software *never* explicitly accesses an address which
> it has not mapped is insufficient.
> 
> For example, imagine you have two threads, each pinned to a CPU, and
> some local_cpu_{mmap,munmap} which uses your new flag:
> 
>   CPU0CPU1
>   x = local_cpu_mmap(...);
>   do_things_with(x);
>   // speculatively allocates TLB
>   // entries for X.
> 
>   // only invalidates local TLBs
>   local_cpu_munmap(x);
> 
>   // TLB entries for X still live
>   
>   y = local_cpu_mmap(...);
> 
>   // if y == x, we can hit the

But this y == x is not possible. The x here is held throughout the
lifetime  of CPU0-pinned thread. And cannot be allocated later to
another thread.

In fact if that file holding the VMA closes we know the server
crashed and we cleanly close everything.
(Including properly zapping all maps)

>   // stale TLB entry, and access
>   // the wrong page
>   do_things_with(y);
> 

So even if the CPU pre fetched that TLB no one in the system will use
this address until proper close. Where everything is properly flushed.

> Consider that after we free x, the kernel could reuse the page for any
> purpose (e.g. kernel page tables), so this is a major risk.
> 

So yes. We never free x. We hold it for the entire duration of the ZT-thread
(ZThread is that core-pinned thread per-core we are using)
And each time we map some application buffers into that vma and local_tlb
invalidate when done.

When x is de-allocated, do to a close or a crash, it is all properly zapped.

> This flag simply is not safe, unless the *entire* mm is only ever
> accessed from a single CPU. In that case, we don't need the flag anyway,
> as the mm already has a cpumask.
> 

Did you please see that other part of the thread, and my answer to Andrew?
why is the vma->vm_mm cpumask so weird. It is neither all bits set nor
a single bit set. It is very common (20% of the time) for mm_cpumask(vma->vm_mm)
to be a single bit set. Exactly in an application where I have a thread-per-core
Please look at that? (I'll ping you from that email)

> Thanks,
> Mark.
> 

Thanks
Boaz



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-15 Thread Boaz Harrosh
On 15/05/18 15:09, Peter Zijlstra wrote:
> On Tue, May 15, 2018 at 02:41:41PM +0300, Boaz Harrosh wrote:
>> On 15/05/18 14:11, Matthew Wilcox wrote:
> 
>>> You're still thinking about this from the wrong perspective.  If you
>>> were writing a program to attack this facility, how would you do it?
>>> It's not exactly hard to leak one pointer's worth of information.
>>>
>>
>> That would be very hard. Because that program would:
>> - need to be root
>> - need to start and pretend it is zus Server with the all mount
>>   thread thing, register new filesystem, grab some pmem devices.
>> - Mount the said filesystem on said pmem. Create core-pinned ZT threads
>>   for all CPUs, start accepting IO.
>> - And only then it can start leaking the pointer and do bad things.
>>   The bad things it can do to the application, not to the Kernel.
> 
> No I think you can do bad things to the kernel at that point. Consider
> it populating the TLBs on the 'wrong' CPU by 'inadvertenly' touching
> 'random' memory.
> 
> Then cause an invalidation and get the page re-used for kernel bits.
> 
> Then access that page through the 'stale' TLB entry we still have on the
> 'wrong' CPU and corrupt kernel data.
> 

Yes a BAD filesystem Server can do bad things I agree. But a filesystem can
do very bad things in any case. through the front door, No? and we trust
it with our data. So there is some trust we already put in a filesystem i think.

I will try to look at this deeper, see if I can actually enforce this policy.
Do you have any ideas? can I force page_faults on the other cores?

Thank you for looking
Boaz



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-15 Thread Boaz Harrosh
On 15/05/18 15:09, Peter Zijlstra wrote:
> On Tue, May 15, 2018 at 02:41:41PM +0300, Boaz Harrosh wrote:
>> On 15/05/18 14:11, Matthew Wilcox wrote:
> 
>>> You're still thinking about this from the wrong perspective.  If you
>>> were writing a program to attack this facility, how would you do it?
>>> It's not exactly hard to leak one pointer's worth of information.
>>>
>>
>> That would be very hard. Because that program would:
>> - need to be root
>> - need to start and pretend it is zus Server with the all mount
>>   thread thing, register new filesystem, grab some pmem devices.
>> - Mount the said filesystem on said pmem. Create core-pinned ZT threads
>>   for all CPUs, start accepting IO.
>> - And only then it can start leaking the pointer and do bad things.
>>   The bad things it can do to the application, not to the Kernel.
> 
> No I think you can do bad things to the kernel at that point. Consider
> it populating the TLBs on the 'wrong' CPU by 'inadvertenly' touching
> 'random' memory.
> 
> Then cause an invalidation and get the page re-used for kernel bits.
> 
> Then access that page through the 'stale' TLB entry we still have on the
> 'wrong' CPU and corrupt kernel data.
> 

Yes a BAD filesystem Server can do bad things I agree. But a filesystem can
do very bad things in any case. through the front door, No? and we trust
it with our data. So there is some trust we already put in a filesystem i think.

I will try to look at this deeper, see if I can actually enforce this policy.
Do you have any ideas? can I force page_faults on the other cores?

Thank you for looking
Boaz



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-15 Thread Boaz Harrosh
On 15/05/18 14:47, Peter Zijlstra wrote:
> On Tue, May 15, 2018 at 01:43:23PM +0300, Boaz Harrosh wrote:
>> Yes I know, but that is exactly the point of this flag. I know that this
>> address is only ever accessed from a single core. Because it is an mmap (vma)
>> of an O_TMPFILE-exclusive file created in a core-pinned thread and I allow
>> only that thread any kind of access to this vma. Both the filehandle and the
>> mmaped pointer are kept on the thread stack and have no access from outside.
>>
>> So the all point of this flag is the kernel driver telling mm that this
>> address is enforced to only be accessed from one core-pinned thread.
> 
> What happens when the userspace part -- there is one, right, how else do
> you get an mm to stick a vma in -- simply does a full address range
> probe scan?
> 
> Something like this really needs a far more detailed Changelog that
> explains how its to be used and how it is impossible to abuse. Esp. that
> latter is _very_ important.
> 

Thank you yes. I will try and capture all this thread in the commit message
and as Christoph demanded supply a user code to demonstrate usage.

Thank you for looking
Boaz



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-15 Thread Boaz Harrosh
On 15/05/18 14:47, Peter Zijlstra wrote:
> On Tue, May 15, 2018 at 01:43:23PM +0300, Boaz Harrosh wrote:
>> Yes I know, but that is exactly the point of this flag. I know that this
>> address is only ever accessed from a single core. Because it is an mmap (vma)
>> of an O_TMPFILE-exclusive file created in a core-pinned thread and I allow
>> only that thread any kind of access to this vma. Both the filehandle and the
>> mmaped pointer are kept on the thread stack and have no access from outside.
>>
>> So the all point of this flag is the kernel driver telling mm that this
>> address is enforced to only be accessed from one core-pinned thread.
> 
> What happens when the userspace part -- there is one, right, how else do
> you get an mm to stick a vma in -- simply does a full address range
> probe scan?
> 
> Something like this really needs a far more detailed Changelog that
> explains how its to be used and how it is impossible to abuse. Esp. that
> latter is _very_ important.
> 

Thank you yes. I will try and capture all this thread in the commit message
and as Christoph demanded supply a user code to demonstrate usage.

Thank you for looking
Boaz



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-15 Thread Boaz Harrosh
On 15/05/18 03:44, Matthew Wilcox wrote:
> On Mon, May 14, 2018 at 02:49:01PM -0700, Andrew Morton wrote:
>> On Mon, 14 May 2018 20:28:01 +0300 Boaz Harrosh <bo...@netapp.com> wrote:
>>> In this project we utilize a per-core server thread so everything
>>> is kept local. If we use the regular zap_ptes() API All CPU's
>>> are scheduled for the unmap, though in our case we know that we
>>> have only used a single core. The regular zap_ptes adds a very big
>>> latency on every operation and mostly kills the concurrency of the
>>> over all system. Because it imposes a serialization between all cores
>>
>> I'd have thought that in this situation, only the local CPU's bit is
>> set in the vma's mm_cpumask() and the remote invalidations are not
>> performed.  Is that a misunderstanding, or is all that stuff not working
>> correctly?
> 
> I think you misunderstand Boaz's architecture.  He has one thread per CPU,
> so every bit will be set in the mm's (not vma's) mm_cpumask.
> 

Hi Andrew, Matthew

Yes I have been trying to investigate and trace this for days.
Please see the code below:

> #define flush_tlb_range(vma, start, end)  \
>   flush_tlb_mm_range(vma->vm_mm, start, end, vma->vm_flags)

The mm_struct @mm below comes from here vma->vm_mm

> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index e055d1a..1d398a0 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -611,39 +611,40 @@ static unsigned long tlb_single_page_flush_ceiling 
> __read_mostly = 33;
>  void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>   unsigned long end, unsigned long vmflag)
>  {
>   int cpu;
>  
>   struct flush_tlb_info info __aligned(SMP_CACHE_BYTES) = {
>   .mm = mm,
>   };
>  
>   cpu = get_cpu();
>  
>   /* This is also a barrier that synchronizes with switch_mm(). */
>   info.new_tlb_gen = inc_mm_tlb_gen(mm);
>  
>   /* Should we flush just the requested range? */
>   if ((end != TLB_FLUSH_ALL) &&
>   !(vmflag & VM_HUGETLB) &&
>   ((end - start) >> PAGE_SHIFT) <= tlb_single_page_flush_ceiling) {
>   info.start = start;
>   info.end = end;
>   } else {
>   info.start = 0UL;
>   info.end = TLB_FLUSH_ALL;
>   }
>  
>   if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
>   VM_WARN_ON(irqs_disabled());
>   local_irq_disable();
>   flush_tlb_func_local(, TLB_LOCAL_MM_SHOOTDOWN);
>   local_irq_enable();
>   }
>  
> - if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
> + if (!(vmflag & VM_LOCAL_CPU) &&
> + cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
>   flush_tlb_others(mm_cpumask(mm), );
>  

I have been tracing the mm_cpumask(vma->vm_mm) at my driver at
different points. At vma creation (file_operations->mmap()), 
and before the call to insert_pfn (which calls here).

At the beginning I was wishful thinking that the mm_cpumask(vma->vm_mm)
should have a single bit set just as the affinity of the thread on
creation of that thread. But then I saw that at %80 of the times some
other random bits are also set.

Yes Random. Always the thread affinity (single) bit was set but
then zero one or two more bits were set as well. Never seen more then
two though, which baffles me a lot.

If it was like Matthew said .i.e the cpumask of the all process
then I would expect all the bits to be set. Because I have a thread
on each core. And also I would even expect that all vma->vm_mm
or maybe mm_cpumask(vma->vm_mm) to point to the same global object.
But it was not so. it was pointing to some thread unique object but
still those phantom bits were set all over. (And I am almost sure
same vma had those bits change over time)

So I would love some mm guy to explain where are those bits collected?
But I do not think this is a Kernel bug because as Matthew showed.
that vma above can and is allowed to leak memory addresses to other
threads / cores in the same process. So it appears that the Kernel
has some correct logic behind its madness.

Which brings me to another question. How can I find from
within a thread Say at the file_operations->mmap() call that the thread
is indeed core-pinned. What mm_cpumask should I inspect?

>   put_cpu();
>  }

Thanks
Boaz



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-15 Thread Boaz Harrosh
On 15/05/18 03:44, Matthew Wilcox wrote:
> On Mon, May 14, 2018 at 02:49:01PM -0700, Andrew Morton wrote:
>> On Mon, 14 May 2018 20:28:01 +0300 Boaz Harrosh  wrote:
>>> In this project we utilize a per-core server thread so everything
>>> is kept local. If we use the regular zap_ptes() API All CPU's
>>> are scheduled for the unmap, though in our case we know that we
>>> have only used a single core. The regular zap_ptes adds a very big
>>> latency on every operation and mostly kills the concurrency of the
>>> over all system. Because it imposes a serialization between all cores
>>
>> I'd have thought that in this situation, only the local CPU's bit is
>> set in the vma's mm_cpumask() and the remote invalidations are not
>> performed.  Is that a misunderstanding, or is all that stuff not working
>> correctly?
> 
> I think you misunderstand Boaz's architecture.  He has one thread per CPU,
> so every bit will be set in the mm's (not vma's) mm_cpumask.
> 

Hi Andrew, Matthew

Yes I have been trying to investigate and trace this for days.
Please see the code below:

> #define flush_tlb_range(vma, start, end)  \
>   flush_tlb_mm_range(vma->vm_mm, start, end, vma->vm_flags)

The mm_struct @mm below comes from here vma->vm_mm

> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index e055d1a..1d398a0 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -611,39 +611,40 @@ static unsigned long tlb_single_page_flush_ceiling 
> __read_mostly = 33;
>  void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>   unsigned long end, unsigned long vmflag)
>  {
>   int cpu;
>  
>   struct flush_tlb_info info __aligned(SMP_CACHE_BYTES) = {
>   .mm = mm,
>   };
>  
>   cpu = get_cpu();
>  
>   /* This is also a barrier that synchronizes with switch_mm(). */
>   info.new_tlb_gen = inc_mm_tlb_gen(mm);
>  
>   /* Should we flush just the requested range? */
>   if ((end != TLB_FLUSH_ALL) &&
>   !(vmflag & VM_HUGETLB) &&
>   ((end - start) >> PAGE_SHIFT) <= tlb_single_page_flush_ceiling) {
>   info.start = start;
>   info.end = end;
>   } else {
>   info.start = 0UL;
>   info.end = TLB_FLUSH_ALL;
>   }
>  
>   if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
>   VM_WARN_ON(irqs_disabled());
>   local_irq_disable();
>   flush_tlb_func_local(, TLB_LOCAL_MM_SHOOTDOWN);
>   local_irq_enable();
>   }
>  
> - if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
> + if (!(vmflag & VM_LOCAL_CPU) &&
> + cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
>   flush_tlb_others(mm_cpumask(mm), );
>  

I have been tracing the mm_cpumask(vma->vm_mm) at my driver at
different points. At vma creation (file_operations->mmap()), 
and before the call to insert_pfn (which calls here).

At the beginning I was wishful thinking that the mm_cpumask(vma->vm_mm)
should have a single bit set just as the affinity of the thread on
creation of that thread. But then I saw that at %80 of the times some
other random bits are also set.

Yes Random. Always the thread affinity (single) bit was set but
then zero one or two more bits were set as well. Never seen more then
two though, which baffles me a lot.

If it was like Matthew said .i.e the cpumask of the all process
then I would expect all the bits to be set. Because I have a thread
on each core. And also I would even expect that all vma->vm_mm
or maybe mm_cpumask(vma->vm_mm) to point to the same global object.
But it was not so. it was pointing to some thread unique object but
still those phantom bits were set all over. (And I am almost sure
same vma had those bits change over time)

So I would love some mm guy to explain where are those bits collected?
But I do not think this is a Kernel bug because as Matthew showed.
that vma above can and is allowed to leak memory addresses to other
threads / cores in the same process. So it appears that the Kernel
has some correct logic behind its madness.

Which brings me to another question. How can I find from
within a thread Say at the file_operations->mmap() call that the thread
is indeed core-pinned. What mm_cpumask should I inspect?

>   put_cpu();
>  }

Thanks
Boaz



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-15 Thread Boaz Harrosh
On 15/05/18 14:11, Matthew Wilcox wrote:
> On Tue, May 15, 2018 at 01:43:23PM +0300, Boaz Harrosh wrote:
>> On 15/05/18 03:41, Matthew Wilcox wrote:
>>> On Mon, May 14, 2018 at 10:37:38PM +0300, Boaz Harrosh wrote:
>>>> On 14/05/18 22:15, Matthew Wilcox wrote:
>>>>> On Mon, May 14, 2018 at 08:28:01PM +0300, Boaz Harrosh wrote:
>>>>>> On a call to mmap an mmap provider (like an FS) can put
>>>>>> this flag on vma->vm_flags.
>>>>>>
>>>>>> The VM_LOCAL_CPU flag tells the Kernel that the vma will be used
>>>>>> from a single-core only, and therefore invalidation (flush_tlb) of
>>>>>> PTE(s) need not be a wide CPU scheduling.
>>>>>
>>>>> I still don't get this.  You're opening the kernel up to being exploited
>>>>> by any application which can persuade it to set this flag on a VMA.
>>>>>
>>>>
>>>> No No this is not an application accessible flag this can only be set
>>>> by the mmap implementor at ->mmap() time (Say same as VM_VM_MIXEDMAP).
>>>>
>>>> Please see the zuf patches for usage (Again apologise for pushing before
>>>> a user)
>>>>
>>>> The mmap provider has all the facilities to know that this can not be
>>>> abused, not even by a trusted Server.
>>>
>>> I don't think page tables work the way you think they work.
>>>
>>> +   err = vm_insert_pfn_prot(zt->vma, zt_addr, pfn, prot);
>>>
>>> That doesn't just insert it into the local CPU's page table.  Any CPU
>>> which directly accesses or even prefetches that address will also get
>>> the translation into its cache.
>>
>> Yes I know, but that is exactly the point of this flag. I know that this
>> address is only ever accessed from a single core. Because it is an mmap (vma)
>> of an O_TMPFILE-exclusive file created in a core-pinned thread and I allow
>> only that thread any kind of access to this vma. Both the filehandle and the
>> mmaped pointer are kept on the thread stack and have no access from outside.
>>
>> So the all point of this flag is the kernel driver telling mm that this
>> address is enforced to only be accessed from one core-pinned thread.
> 
> You're still thinking about this from the wrong perspective.  If you
> were writing a program to attack this facility, how would you do it?
> It's not exactly hard to leak one pointer's worth of information.
> 

That would be very hard. Because that program would:
- need to be root
- need to start and pretend it is zus Server with the all mount
  thread thing, register new filesystem, grab some pmem devices.
- Mount the said filesystem on said pmem. Create core-pinned ZT threads
  for all CPUs, start accepting IO.
- And only then it can start leaking the pointer and do bad things.
  The bad things it can do to the application, not to the Kernel.
  And as a full filesystem it can do those bad things to the application
  through the front door directly not needing the mismatch tlb at all.

That said. It brings up a very important point that I wanted to talk about.
In this design the zuf(Kernel) and the zus(um Server) are part of the 
distribution.
I would like to have the zus module be signed by the distro's Kernel's key and
checked on loadtime. I know there is an effort by Redhat guys to try and sign 
all
/sbin/* servers and have Kernel check these. So this is not the first time 
people
have thought about that.

Thanks
Boaz



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-15 Thread Boaz Harrosh
On 15/05/18 14:11, Matthew Wilcox wrote:
> On Tue, May 15, 2018 at 01:43:23PM +0300, Boaz Harrosh wrote:
>> On 15/05/18 03:41, Matthew Wilcox wrote:
>>> On Mon, May 14, 2018 at 10:37:38PM +0300, Boaz Harrosh wrote:
>>>> On 14/05/18 22:15, Matthew Wilcox wrote:
>>>>> On Mon, May 14, 2018 at 08:28:01PM +0300, Boaz Harrosh wrote:
>>>>>> On a call to mmap an mmap provider (like an FS) can put
>>>>>> this flag on vma->vm_flags.
>>>>>>
>>>>>> The VM_LOCAL_CPU flag tells the Kernel that the vma will be used
>>>>>> from a single-core only, and therefore invalidation (flush_tlb) of
>>>>>> PTE(s) need not be a wide CPU scheduling.
>>>>>
>>>>> I still don't get this.  You're opening the kernel up to being exploited
>>>>> by any application which can persuade it to set this flag on a VMA.
>>>>>
>>>>
>>>> No No this is not an application accessible flag this can only be set
>>>> by the mmap implementor at ->mmap() time (Say same as VM_VM_MIXEDMAP).
>>>>
>>>> Please see the zuf patches for usage (Again apologise for pushing before
>>>> a user)
>>>>
>>>> The mmap provider has all the facilities to know that this can not be
>>>> abused, not even by a trusted Server.
>>>
>>> I don't think page tables work the way you think they work.
>>>
>>> +   err = vm_insert_pfn_prot(zt->vma, zt_addr, pfn, prot);
>>>
>>> That doesn't just insert it into the local CPU's page table.  Any CPU
>>> which directly accesses or even prefetches that address will also get
>>> the translation into its cache.
>>
>> Yes I know, but that is exactly the point of this flag. I know that this
>> address is only ever accessed from a single core. Because it is an mmap (vma)
>> of an O_TMPFILE-exclusive file created in a core-pinned thread and I allow
>> only that thread any kind of access to this vma. Both the filehandle and the
>> mmaped pointer are kept on the thread stack and have no access from outside.
>>
>> So the all point of this flag is the kernel driver telling mm that this
>> address is enforced to only be accessed from one core-pinned thread.
> 
> You're still thinking about this from the wrong perspective.  If you
> were writing a program to attack this facility, how would you do it?
> It's not exactly hard to leak one pointer's worth of information.
> 

That would be very hard. Because that program would:
- need to be root
- need to start and pretend it is zus Server with the all mount
  thread thing, register new filesystem, grab some pmem devices.
- Mount the said filesystem on said pmem. Create core-pinned ZT threads
  for all CPUs, start accepting IO.
- And only then it can start leaking the pointer and do bad things.
  The bad things it can do to the application, not to the Kernel.
  And as a full filesystem it can do those bad things to the application
  through the front door directly not needing the mismatch tlb at all.

That said. It brings up a very important point that I wanted to talk about.
In this design the zuf(Kernel) and the zus(um Server) are part of the 
distribution.
I would like to have the zus module be signed by the distro's Kernel's key and
checked on loadtime. I know there is an effort by Redhat guys to try and sign 
all
/sbin/* servers and have Kernel check these. So this is not the first time 
people
have thought about that.

Thanks
Boaz



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-15 Thread Boaz Harrosh
On 15/05/18 10:08, Christoph Hellwig wrote:
> On Mon, May 14, 2018 at 09:26:13PM +0300, Boaz Harrosh wrote:
>> I am please pushing for this patch ahead of the push of ZUFS, because
>> this is the only patch we need from otherwise an STD Kernel.
>>
>> We are partnering with Distro(s) to push ZUFS out-of-tree to beta clients
>> to try and stabilize such a big project before final submission and
>> an ABI / on-disk freeze.
>>
> 
> Please stop this crap.  If you want any new kernel functionality send
> it together with a user.  Even more so for something as questionanble
> and hairy as this.
> 
> With a stance like this you disqualify yourself.
> 

OK thank you I see your point. I will try to push a user ASAP.

Thanks
Boaz



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-15 Thread Boaz Harrosh
On 15/05/18 10:08, Christoph Hellwig wrote:
> On Mon, May 14, 2018 at 09:26:13PM +0300, Boaz Harrosh wrote:
>> I am please pushing for this patch ahead of the push of ZUFS, because
>> this is the only patch we need from otherwise an STD Kernel.
>>
>> We are partnering with Distro(s) to push ZUFS out-of-tree to beta clients
>> to try and stabilize such a big project before final submission and
>> an ABI / on-disk freeze.
>>
> 
> Please stop this crap.  If you want any new kernel functionality send
> it together with a user.  Even more so for something as questionanble
> and hairy as this.
> 
> With a stance like this you disqualify yourself.
> 

OK thank you I see your point. I will try to push a user ASAP.

Thanks
Boaz



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-15 Thread Boaz Harrosh
On 15/05/18 03:41, Matthew Wilcox wrote:
> On Mon, May 14, 2018 at 10:37:38PM +0300, Boaz Harrosh wrote:
>> On 14/05/18 22:15, Matthew Wilcox wrote:
>>> On Mon, May 14, 2018 at 08:28:01PM +0300, Boaz Harrosh wrote:
>>>> On a call to mmap an mmap provider (like an FS) can put
>>>> this flag on vma->vm_flags.
>>>>
>>>> The VM_LOCAL_CPU flag tells the Kernel that the vma will be used
>>>> from a single-core only, and therefore invalidation (flush_tlb) of
>>>> PTE(s) need not be a wide CPU scheduling.
>>>
>>> I still don't get this.  You're opening the kernel up to being exploited
>>> by any application which can persuade it to set this flag on a VMA.
>>>
>>
>> No No this is not an application accessible flag this can only be set
>> by the mmap implementor at ->mmap() time (Say same as VM_VM_MIXEDMAP).
>>
>> Please see the zuf patches for usage (Again apologise for pushing before
>> a user)
>>
>> The mmap provider has all the facilities to know that this can not be
>> abused, not even by a trusted Server.
> 
> I don't think page tables work the way you think they work.
> 
> +   err = vm_insert_pfn_prot(zt->vma, zt_addr, pfn, prot);
> 
> That doesn't just insert it into the local CPU's page table.  Any CPU
> which directly accesses or even prefetches that address will also get
> the translation into its cache.
> 

Yes I know, but that is exactly the point of this flag. I know that this
address is only ever accessed from a single core. Because it is an mmap (vma)
of an O_TMPFILE-exclusive file created in a core-pinned thread and I allow
only that thread any kind of access to this vma. Both the filehandle and the
mmaped pointer are kept on the thread stack and have no access from outside.

So the all point of this flag is the kernel driver telling mm that this
address is enforced to only be accessed from one core-pinned thread.

Thanks
Boaz



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-15 Thread Boaz Harrosh
On 15/05/18 03:41, Matthew Wilcox wrote:
> On Mon, May 14, 2018 at 10:37:38PM +0300, Boaz Harrosh wrote:
>> On 14/05/18 22:15, Matthew Wilcox wrote:
>>> On Mon, May 14, 2018 at 08:28:01PM +0300, Boaz Harrosh wrote:
>>>> On a call to mmap an mmap provider (like an FS) can put
>>>> this flag on vma->vm_flags.
>>>>
>>>> The VM_LOCAL_CPU flag tells the Kernel that the vma will be used
>>>> from a single-core only, and therefore invalidation (flush_tlb) of
>>>> PTE(s) need not be a wide CPU scheduling.
>>>
>>> I still don't get this.  You're opening the kernel up to being exploited
>>> by any application which can persuade it to set this flag on a VMA.
>>>
>>
>> No No this is not an application accessible flag this can only be set
>> by the mmap implementor at ->mmap() time (Say same as VM_VM_MIXEDMAP).
>>
>> Please see the zuf patches for usage (Again apologise for pushing before
>> a user)
>>
>> The mmap provider has all the facilities to know that this can not be
>> abused, not even by a trusted Server.
> 
> I don't think page tables work the way you think they work.
> 
> +   err = vm_insert_pfn_prot(zt->vma, zt_addr, pfn, prot);
> 
> That doesn't just insert it into the local CPU's page table.  Any CPU
> which directly accesses or even prefetches that address will also get
> the translation into its cache.
> 

Yes I know, but that is exactly the point of this flag. I know that this
address is only ever accessed from a single core. Because it is an mmap (vma)
of an O_TMPFILE-exclusive file created in a core-pinned thread and I allow
only that thread any kind of access to this vma. Both the filehandle and the
mmaped pointer are kept on the thread stack and have no access from outside.

So the all point of this flag is the kernel driver telling mm that this
address is enforced to only be accessed from one core-pinned thread.

Thanks
Boaz



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-14 Thread Boaz Harrosh
On 14/05/18 22:15, Matthew Wilcox wrote:
> On Mon, May 14, 2018 at 08:28:01PM +0300, Boaz Harrosh wrote:
>> On a call to mmap an mmap provider (like an FS) can put
>> this flag on vma->vm_flags.
>>
>> The VM_LOCAL_CPU flag tells the Kernel that the vma will be used
>> from a single-core only, and therefore invalidation (flush_tlb) of
>> PTE(s) need not be a wide CPU scheduling.
> 
> I still don't get this.  You're opening the kernel up to being exploited
> by any application which can persuade it to set this flag on a VMA.
> 

No No this is not an application accessible flag this can only be set
by the mmap implementor at ->mmap() time (Say same as VM_VM_MIXEDMAP).

Please see the zuf patches for usage (Again apologise for pushing before
a user)

The mmap provider has all the facilities to know that this can not be
abused, not even by a trusted Server.

>> NOTE: This vma (VM_LOCAL_CPU) is never used during a page_fault. It is
>> always used in a synchronous way from a thread pinned to a single core.
> 
> It's not a question of how your app is going to use this flag.  It's a
> question about how another app can abuse this flag (or how your app is
> going to be exploited to abuse this flag) to break into the kernel.
> 

If you look at the zuf user you will see that the faults all return
SIG_BUS. These can never fault. The server has access to this mapping
from a single thread pinned to a core.

Again it is not an app visible flag in anyway

Thanks for looking
Boaz



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-14 Thread Boaz Harrosh
On 14/05/18 22:15, Matthew Wilcox wrote:
> On Mon, May 14, 2018 at 08:28:01PM +0300, Boaz Harrosh wrote:
>> On a call to mmap an mmap provider (like an FS) can put
>> this flag on vma->vm_flags.
>>
>> The VM_LOCAL_CPU flag tells the Kernel that the vma will be used
>> from a single-core only, and therefore invalidation (flush_tlb) of
>> PTE(s) need not be a wide CPU scheduling.
> 
> I still don't get this.  You're opening the kernel up to being exploited
> by any application which can persuade it to set this flag on a VMA.
> 

No No this is not an application accessible flag this can only be set
by the mmap implementor at ->mmap() time (Say same as VM_VM_MIXEDMAP).

Please see the zuf patches for usage (Again apologise for pushing before
a user)

The mmap provider has all the facilities to know that this can not be
abused, not even by a trusted Server.

>> NOTE: This vma (VM_LOCAL_CPU) is never used during a page_fault. It is
>> always used in a synchronous way from a thread pinned to a single core.
> 
> It's not a question of how your app is going to use this flag.  It's a
> question about how another app can abuse this flag (or how your app is
> going to be exploited to abuse this flag) to break into the kernel.
> 

If you look at the zuf user you will see that the faults all return
SIG_BUS. These can never fault. The server has access to this mapping
from a single thread pinned to a core.

Again it is not an app visible flag in anyway

Thanks for looking
Boaz



Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-14 Thread Boaz Harrosh
On 14/05/18 20:28, Boaz Harrosh wrote:
> 
> On a call to mmap an mmap provider (like an FS) can put
> this flag on vma->vm_flags.
> 
> The VM_LOCAL_CPU flag tells the Kernel that the vma will be used
> from a single-core only, and therefore invalidation (flush_tlb) of
> PTE(s) need not be a wide CPU scheduling.
> 
> The motivation of this flag is the ZUFS project where we want
> to optimally map user-application buffers into a user-mode-server
> execute the operation and efficiently unmap.
> 

I am please pushing for this patch ahead of the push of ZUFS, because
this is the only patch we need from otherwise an STD Kernel.

We are partnering with Distro(s) to push ZUFS out-of-tree to beta clients
to try and stabilize such a big project before final submission and
an ABI / on-disk freeze.

By itself this patch has 0 risk and can not break anything.

Thanks
Boaz

> In this project we utilize a per-core server thread so everything
> is kept local. If we use the regular zap_ptes() API All CPU's
> are scheduled for the unmap, though in our case we know that we
> have only used a single core. The regular zap_ptes adds a very big
> latency on every operation and mostly kills the concurrency of the
> over all system. Because it imposes a serialization between all cores
> 
> Some preliminary measurements on a 40 core machines:
> 
>   unpatched   patched
> Threads   Op/sLat [us]Op/sLat [us]
> 
> 1 185391  4.9 200799  4.6
> 2 197993  9.6 314321  5.9
> 4 310597  12.1565574  6.6
> 8 546702  13.81113138 6.6
> 12641728  17.21598451 6.8
> 18744750  22.21648689 7.8
> 24790805  28.31702285 8
> 36849763  38.91783346 13.4
> 48792000  44.61741873 17.4
> 
> We can clearly see that on an unpatched Kernel we do not scale
> and the threads are interfering with each other. This is because
> flush-tlb is scheduled on all (other) CPUs.
> 
> NOTE: This vma (VM_LOCAL_CPU) is never used during a page_fault. It is
> always used in a synchronous way from a thread pinned to a single core.
> 
> Signed-off-by: Boaz Harrosh <bo...@netapp.com>
> ---
>  arch/x86/mm/tlb.c  |  3 ++-
>  fs/proc/task_mmu.c |  3 +++
>  include/linux/mm.h |  3 +++
>  mm/memory.c| 13 +++--
>  4 files changed, 19 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index e055d1a..1d398a0 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -640,7 +640,8 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned 
> long start,
>   local_irq_enable();
>   }
>  
> - if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
> + if (!(vmflag & VM_LOCAL_CPU) &&
> + cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
>   flush_tlb_others(mm_cpumask(mm), );
>  
>   put_cpu();
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index c486ad4..305d6e4 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -680,6 +680,9 @@ static void show_smap_vma_flags(struct seq_file *m, 
> struct vm_area_struct *vma)
>   [ilog2(VM_PKEY_BIT2)]   = "",
>   [ilog2(VM_PKEY_BIT3)]   = "",
>  #endif
> +#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
> + [ilog2(VM_LOCAL_CPU)]   = "lc",
> +#endif
>   };
>   size_t i;
>  
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 1ac1f06..3d14107 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -226,6 +226,9 @@ extern unsigned int kobjsize(const void *objp);
>  #define VM_HIGH_ARCH_2   BIT(VM_HIGH_ARCH_BIT_2)
>  #define VM_HIGH_ARCH_3   BIT(VM_HIGH_ARCH_BIT_3)
>  #define VM_HIGH_ARCH_4   BIT(VM_HIGH_ARCH_BIT_4)
> +#define VM_LOCAL_CPU BIT(37) /* FIXME: Needs to move from here */
> +#else /* ! CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
> +#define VM_LOCAL_CPU 0   /* FIXME: Needs to move from here */
>  #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
>  
>  #if defined(CONFIG_X86)
> diff --git a/mm/memory.c b/mm/memory.c
> index 01f5464..6236f5e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1788,6 +1788,7 @@ static int insert_pfn(struct vm_area_struct *vma, 
> unsigned long addr,
>   int retval;
>   pte_t *pte, entry;
>   spinlock_t *ptl;
> + bool need_flush = false;
>  
>   retval = -ENOMEM;
>   pte = get_locked_pte(mm, addr, );
> @@ -1795,7 +1796,12 @@ static int insert_pfn(struct vm_area_struct *vma, 
> unsigned long addr,

Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-14 Thread Boaz Harrosh
On 14/05/18 20:28, Boaz Harrosh wrote:
> 
> On a call to mmap an mmap provider (like an FS) can put
> this flag on vma->vm_flags.
> 
> The VM_LOCAL_CPU flag tells the Kernel that the vma will be used
> from a single-core only, and therefore invalidation (flush_tlb) of
> PTE(s) need not be a wide CPU scheduling.
> 
> The motivation of this flag is the ZUFS project where we want
> to optimally map user-application buffers into a user-mode-server
> execute the operation and efficiently unmap.
> 

I am please pushing for this patch ahead of the push of ZUFS, because
this is the only patch we need from otherwise an STD Kernel.

We are partnering with Distro(s) to push ZUFS out-of-tree to beta clients
to try and stabilize such a big project before final submission and
an ABI / on-disk freeze.

By itself this patch has 0 risk and can not break anything.

Thanks
Boaz

> In this project we utilize a per-core server thread so everything
> is kept local. If we use the regular zap_ptes() API All CPU's
> are scheduled for the unmap, though in our case we know that we
> have only used a single core. The regular zap_ptes adds a very big
> latency on every operation and mostly kills the concurrency of the
> over all system. Because it imposes a serialization between all cores
> 
> Some preliminary measurements on a 40 core machines:
> 
>   unpatched   patched
> Threads   Op/sLat [us]Op/sLat [us]
> 
> 1 185391  4.9 200799  4.6
> 2 197993  9.6 314321  5.9
> 4 310597  12.1565574  6.6
> 8 546702  13.81113138 6.6
> 12641728  17.21598451 6.8
> 18744750  22.21648689 7.8
> 24790805  28.31702285 8
> 36849763  38.91783346 13.4
> 48792000  44.61741873 17.4
> 
> We can clearly see that on an unpatched Kernel we do not scale
> and the threads are interfering with each other. This is because
> flush-tlb is scheduled on all (other) CPUs.
> 
> NOTE: This vma (VM_LOCAL_CPU) is never used during a page_fault. It is
> always used in a synchronous way from a thread pinned to a single core.
> 
> Signed-off-by: Boaz Harrosh 
> ---
>  arch/x86/mm/tlb.c  |  3 ++-
>  fs/proc/task_mmu.c |  3 +++
>  include/linux/mm.h |  3 +++
>  mm/memory.c| 13 +++--
>  4 files changed, 19 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index e055d1a..1d398a0 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -640,7 +640,8 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned 
> long start,
>   local_irq_enable();
>   }
>  
> - if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
> + if (!(vmflag & VM_LOCAL_CPU) &&
> + cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
>   flush_tlb_others(mm_cpumask(mm), );
>  
>   put_cpu();
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index c486ad4..305d6e4 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -680,6 +680,9 @@ static void show_smap_vma_flags(struct seq_file *m, 
> struct vm_area_struct *vma)
>   [ilog2(VM_PKEY_BIT2)]   = "",
>   [ilog2(VM_PKEY_BIT3)]   = "",
>  #endif
> +#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
> + [ilog2(VM_LOCAL_CPU)]   = "lc",
> +#endif
>   };
>   size_t i;
>  
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 1ac1f06..3d14107 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -226,6 +226,9 @@ extern unsigned int kobjsize(const void *objp);
>  #define VM_HIGH_ARCH_2   BIT(VM_HIGH_ARCH_BIT_2)
>  #define VM_HIGH_ARCH_3   BIT(VM_HIGH_ARCH_BIT_3)
>  #define VM_HIGH_ARCH_4   BIT(VM_HIGH_ARCH_BIT_4)
> +#define VM_LOCAL_CPU BIT(37) /* FIXME: Needs to move from here */
> +#else /* ! CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
> +#define VM_LOCAL_CPU 0   /* FIXME: Needs to move from here */
>  #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
>  
>  #if defined(CONFIG_X86)
> diff --git a/mm/memory.c b/mm/memory.c
> index 01f5464..6236f5e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1788,6 +1788,7 @@ static int insert_pfn(struct vm_area_struct *vma, 
> unsigned long addr,
>   int retval;
>   pte_t *pte, entry;
>   spinlock_t *ptl;
> + bool need_flush = false;
>  
>   retval = -ENOMEM;
>   pte = get_locked_pte(mm, addr, );
> @@ -1795,7 +1796,12 @@ static int insert_pfn(struct vm_area_struct *vma, 
> unsigned long addr,
> 

[PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-14 Thread Boaz Harrosh

On a call to mmap an mmap provider (like an FS) can put
this flag on vma->vm_flags.

The VM_LOCAL_CPU flag tells the Kernel that the vma will be used
from a single-core only, and therefore invalidation (flush_tlb) of
PTE(s) need not be a wide CPU scheduling.

The motivation of this flag is the ZUFS project where we want
to optimally map user-application buffers into a user-mode-server
execute the operation and efficiently unmap.

In this project we utilize a per-core server thread so everything
is kept local. If we use the regular zap_ptes() API All CPU's
are scheduled for the unmap, though in our case we know that we
have only used a single core. The regular zap_ptes adds a very big
latency on every operation and mostly kills the concurrency of the
over all system. Because it imposes a serialization between all cores

Some preliminary measurements on a 40 core machines:

unpatched   patched
Threads Op/sLat [us]Op/sLat [us]

1   185391  4.9 200799  4.6
2   197993  9.6 314321  5.9
4   310597  12.1565574  6.6
8   546702  13.81113138 6.6
12  641728  17.21598451 6.8
18  744750  22.21648689 7.8
24  790805  28.31702285 8
36  849763  38.91783346 13.4
48  792000  44.61741873 17.4

We can clearly see that on an unpatched Kernel we do not scale
and the threads are interfering with each other. This is because
flush-tlb is scheduled on all (other) CPUs.

NOTE: This vma (VM_LOCAL_CPU) is never used during a page_fault. It is
always used in a synchronous way from a thread pinned to a single core.

Signed-off-by: Boaz Harrosh <bo...@netapp.com>
---
 arch/x86/mm/tlb.c  |  3 ++-
 fs/proc/task_mmu.c |  3 +++
 include/linux/mm.h |  3 +++
 mm/memory.c| 13 +++--
 4 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index e055d1a..1d398a0 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -640,7 +640,8 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long 
start,
local_irq_enable();
}
 
-   if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
+   if (!(vmflag & VM_LOCAL_CPU) &&
+   cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
flush_tlb_others(mm_cpumask(mm), );
 
put_cpu();
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index c486ad4..305d6e4 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -680,6 +680,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct 
vm_area_struct *vma)
[ilog2(VM_PKEY_BIT2)]   = "",
[ilog2(VM_PKEY_BIT3)]   = "",
 #endif
+#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
+   [ilog2(VM_LOCAL_CPU)]   = "lc",
+#endif
};
size_t i;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1ac1f06..3d14107 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -226,6 +226,9 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_LOCAL_CPU   BIT(37) /* FIXME: Needs to move from here */
+#else /* ! CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
+#define VM_LOCAL_CPU   0   /* FIXME: Needs to move from here */
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #if defined(CONFIG_X86)
diff --git a/mm/memory.c b/mm/memory.c
index 01f5464..6236f5e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1788,6 +1788,7 @@ static int insert_pfn(struct vm_area_struct *vma, 
unsigned long addr,
int retval;
pte_t *pte, entry;
spinlock_t *ptl;
+   bool need_flush = false;
 
retval = -ENOMEM;
pte = get_locked_pte(mm, addr, );
@@ -1795,7 +1796,12 @@ static int insert_pfn(struct vm_area_struct *vma, 
unsigned long addr,
goto out;
retval = -EBUSY;
if (!pte_none(*pte)) {
-   if (mkwrite) {
+   if ((vma->vm_flags & VM_LOCAL_CPU)) {
+   /* VM_LOCAL_CPU is set, A single CPU is allowed to not
+* go through zap_vma_ptes before changing a pte
+*/
+   need_flush = true;
+   } else if (mkwrite) {
/*
 * For read faults on private mappings the PFN passed
 * in may not match the PFN we have mapped if the
@@ -1807,8 +1813,9 @@ static int insert_pfn(struct vm_area_struct *vma, 
unsigned long addr,
goto out_unlock;
entry = *pte;
goto out_mkwrite;
-   } else
+   } else {
goto out_u

[PATCH] mm: Add new vma flag VM_LOCAL_CPU

2018-05-14 Thread Boaz Harrosh

On a call to mmap an mmap provider (like an FS) can put
this flag on vma->vm_flags.

The VM_LOCAL_CPU flag tells the Kernel that the vma will be used
from a single-core only, and therefore invalidation (flush_tlb) of
PTE(s) need not be a wide CPU scheduling.

The motivation of this flag is the ZUFS project where we want
to optimally map user-application buffers into a user-mode-server
execute the operation and efficiently unmap.

In this project we utilize a per-core server thread so everything
is kept local. If we use the regular zap_ptes() API All CPU's
are scheduled for the unmap, though in our case we know that we
have only used a single core. The regular zap_ptes adds a very big
latency on every operation and mostly kills the concurrency of the
over all system. Because it imposes a serialization between all cores

Some preliminary measurements on a 40 core machines:

unpatched   patched
Threads Op/sLat [us]Op/sLat [us]

1   185391  4.9 200799  4.6
2   197993  9.6 314321  5.9
4   310597  12.1565574  6.6
8   546702  13.81113138 6.6
12  641728  17.21598451 6.8
18  744750  22.21648689 7.8
24  790805  28.31702285 8
36  849763  38.91783346 13.4
48  792000  44.61741873 17.4

We can clearly see that on an unpatched Kernel we do not scale
and the threads are interfering with each other. This is because
flush-tlb is scheduled on all (other) CPUs.

NOTE: This vma (VM_LOCAL_CPU) is never used during a page_fault. It is
always used in a synchronous way from a thread pinned to a single core.

Signed-off-by: Boaz Harrosh 
---
 arch/x86/mm/tlb.c  |  3 ++-
 fs/proc/task_mmu.c |  3 +++
 include/linux/mm.h |  3 +++
 mm/memory.c| 13 +++--
 4 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index e055d1a..1d398a0 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -640,7 +640,8 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long 
start,
local_irq_enable();
}
 
-   if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
+   if (!(vmflag & VM_LOCAL_CPU) &&
+   cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
flush_tlb_others(mm_cpumask(mm), );
 
put_cpu();
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index c486ad4..305d6e4 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -680,6 +680,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct 
vm_area_struct *vma)
[ilog2(VM_PKEY_BIT2)]   = "",
[ilog2(VM_PKEY_BIT3)]   = "",
 #endif
+#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
+   [ilog2(VM_LOCAL_CPU)]   = "lc",
+#endif
};
size_t i;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1ac1f06..3d14107 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -226,6 +226,9 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_LOCAL_CPU   BIT(37) /* FIXME: Needs to move from here */
+#else /* ! CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
+#define VM_LOCAL_CPU   0   /* FIXME: Needs to move from here */
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #if defined(CONFIG_X86)
diff --git a/mm/memory.c b/mm/memory.c
index 01f5464..6236f5e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1788,6 +1788,7 @@ static int insert_pfn(struct vm_area_struct *vma, 
unsigned long addr,
int retval;
pte_t *pte, entry;
spinlock_t *ptl;
+   bool need_flush = false;
 
retval = -ENOMEM;
pte = get_locked_pte(mm, addr, );
@@ -1795,7 +1796,12 @@ static int insert_pfn(struct vm_area_struct *vma, 
unsigned long addr,
goto out;
retval = -EBUSY;
if (!pte_none(*pte)) {
-   if (mkwrite) {
+   if ((vma->vm_flags & VM_LOCAL_CPU)) {
+   /* VM_LOCAL_CPU is set, A single CPU is allowed to not
+* go through zap_vma_ptes before changing a pte
+*/
+   need_flush = true;
+   } else if (mkwrite) {
/*
 * For read faults on private mappings the PFN passed
 * in may not match the PFN we have mapped if the
@@ -1807,8 +1813,9 @@ static int insert_pfn(struct vm_area_struct *vma, 
unsigned long addr,
goto out_unlock;
entry = *pte;
goto out_mkwrite;
-   } else
+   } else {
goto out_unlock;
+   

Re: [PATCH] scsi: libosd: Remove VLA usage

2018-05-13 Thread Boaz Harrosh
On 03/05/18 01:55, Kees Cook wrote:
> On the quest to remove all VLAs from the kernel[1] this rearranges the
> code to avoid a VLA warning under -Wvla (gcc doesn't recognize "const"
> variables as not triggering VLA creation). Additionally cleans up variable
> naming to avoid 80 character column limit.
> 
> [1] 
> https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qpxydaacu1rq...@mail.gmail.com
> 

ACK-BY: Boaz Harrosh <o...@electrozaur.com>

> Signed-off-by: Kees Cook <keesc...@chromium.org>
> ---
>  drivers/scsi/osd/osd_initiator.c | 16 
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/scsi/osd/osd_initiator.c 
> b/drivers/scsi/osd/osd_initiator.c
> index e18877177f1b..917a86a2ae8c 100644
> --- a/drivers/scsi/osd/osd_initiator.c
> +++ b/drivers/scsi/osd/osd_initiator.c
> @@ -1842,14 +1842,14 @@ int osd_req_decode_sense_full(struct osd_request *or,
>   case osd_sense_response_integrity_check:
>   {
>   struct osd_sense_response_integrity_check_descriptor
> - *osricd = cur_descriptor;
> - const unsigned len =
> -   sizeof(osricd->integrity_check_value);
> - char key_dump[len*4 + 2]; /* 2nibbles+space+ASCII */
> -
> - hex_dump_to_buffer(osricd->integrity_check_value, len,
> -32, 1, key_dump, sizeof(key_dump), true);
> - OSD_SENSE_PRINT2("response_integrity [%s]\n", key_dump);
> + *d = cur_descriptor;
> + /* 2nibbles+space+ASCII */
> + char dump[sizeof(d->integrity_check_value) * 4 + 2];
> +
> + hex_dump_to_buffer(d->integrity_check_value,
> + sizeof(d->integrity_check_value),
> + 32, 1, dump, sizeof(dump), true);
> + OSD_SENSE_PRINT2("response_integrity [%s]\n", dump);
>   }
>   case osd_sense_attribute_identification:
>   {
> 



Re: [PATCH] scsi: libosd: Remove VLA usage

2018-05-13 Thread Boaz Harrosh
On 03/05/18 01:55, Kees Cook wrote:
> On the quest to remove all VLAs from the kernel[1] this rearranges the
> code to avoid a VLA warning under -Wvla (gcc doesn't recognize "const"
> variables as not triggering VLA creation). Additionally cleans up variable
> naming to avoid 80 character column limit.
> 
> [1] 
> https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qpxydaacu1rq...@mail.gmail.com
> 

ACK-BY: Boaz Harrosh 

> Signed-off-by: Kees Cook 
> ---
>  drivers/scsi/osd/osd_initiator.c | 16 
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/scsi/osd/osd_initiator.c 
> b/drivers/scsi/osd/osd_initiator.c
> index e18877177f1b..917a86a2ae8c 100644
> --- a/drivers/scsi/osd/osd_initiator.c
> +++ b/drivers/scsi/osd/osd_initiator.c
> @@ -1842,14 +1842,14 @@ int osd_req_decode_sense_full(struct osd_request *or,
>   case osd_sense_response_integrity_check:
>   {
>   struct osd_sense_response_integrity_check_descriptor
> - *osricd = cur_descriptor;
> - const unsigned len =
> -   sizeof(osricd->integrity_check_value);
> - char key_dump[len*4 + 2]; /* 2nibbles+space+ASCII */
> -
> - hex_dump_to_buffer(osricd->integrity_check_value, len,
> -32, 1, key_dump, sizeof(key_dump), true);
> - OSD_SENSE_PRINT2("response_integrity [%s]\n", key_dump);
> + *d = cur_descriptor;
> + /* 2nibbles+space+ASCII */
> + char dump[sizeof(d->integrity_check_value) * 4 + 2];
> +
> + hex_dump_to_buffer(d->integrity_check_value,
> + sizeof(d->integrity_check_value),
> + 32, 1, dump, sizeof(dump), true);
> + OSD_SENSE_PRINT2("response_integrity [%s]\n", dump);
>   }
>   case osd_sense_attribute_identification:
>   {
> 



Re: [PATCH 02/11 linux-next] exofs: use magic.h

2017-05-23 Thread Boaz Harrosh
On 05/21/2017 06:40 PM, Fabian Frederick wrote:
> Filesystems generally use SUPER_MAGIC values from magic.h
> instead of a local definition.
> 

Is fine by me to move to magic.h but ...

> Signed-off-by: Fabian Frederick 
> ---
>  fs/exofs/common.h  | 6 +-
>  include/uapi/linux/magic.h | 1 +
>  2 files changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/exofs/common.h b/fs/exofs/common.h
> index 7d88ef5..fb988f9 100644
> --- a/fs/exofs/common.h
> +++ b/fs/exofs/common.h
> @@ -37,6 +37,7 @@
>  #define __EXOFS_COM_H__
>  
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -79,11 +80,6 @@ enum {
>  #define EXOFS_BLKSHIFT   12
>  #define EXOFS_BLKSIZE(1UL << EXOFS_BLKSHIFT)
>  
> -/
> - * superblock-related things
> - 
> /

Please do not remove this comment, it acts as a separator and is related to
everything up to the comment that says:
 * inode-related things

And not only to this definition

Thanks
Boaz

> -#define EXOFS_SUPER_MAGIC0x5DF5
> -
>  /*
>   * The file system control block - stored in object EXOFS_SUPER_ID's data.
>   * This is where the in-memory superblock is stored on disk.
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index e3bb43b..1265355 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -26,6 +26,7 @@
>  #define EXT2_SUPER_MAGIC 0xEF53
>  #define EXT3_SUPER_MAGIC 0xEF53
>  #define EXT4_SUPER_MAGIC 0xEF53
> +#define EXOFS_SUPER_MAGIC0x5DF5
>  #define F2FS_SUPER_MAGIC 0xF2F52010
>  #define FUTEXFS_SUPER_MAGIC  0xBAD1DEA
>  #define HOSTFS_SUPER_MAGIC   0x00c0ffee
> 



Re: [PATCH 02/11 linux-next] exofs: use magic.h

2017-05-23 Thread Boaz Harrosh
On 05/21/2017 06:40 PM, Fabian Frederick wrote:
> Filesystems generally use SUPER_MAGIC values from magic.h
> instead of a local definition.
> 

Is fine by me to move to magic.h but ...

> Signed-off-by: Fabian Frederick 
> ---
>  fs/exofs/common.h  | 6 +-
>  include/uapi/linux/magic.h | 1 +
>  2 files changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/exofs/common.h b/fs/exofs/common.h
> index 7d88ef5..fb988f9 100644
> --- a/fs/exofs/common.h
> +++ b/fs/exofs/common.h
> @@ -37,6 +37,7 @@
>  #define __EXOFS_COM_H__
>  
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -79,11 +80,6 @@ enum {
>  #define EXOFS_BLKSHIFT   12
>  #define EXOFS_BLKSIZE(1UL << EXOFS_BLKSHIFT)
>  
> -/
> - * superblock-related things
> - 
> /

Please do not remove this comment, it acts as a separator and is related to
everything up to the comment that says:
 * inode-related things

And not only to this definition

Thanks
Boaz

> -#define EXOFS_SUPER_MAGIC0x5DF5
> -
>  /*
>   * The file system control block - stored in object EXOFS_SUPER_ID's data.
>   * This is where the in-memory superblock is stored on disk.
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index e3bb43b..1265355 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -26,6 +26,7 @@
>  #define EXT2_SUPER_MAGIC 0xEF53
>  #define EXT3_SUPER_MAGIC 0xEF53
>  #define EXT4_SUPER_MAGIC 0xEF53
> +#define EXOFS_SUPER_MAGIC0x5DF5
>  #define F2FS_SUPER_MAGIC 0xF2F52010
>  #define FUTEXFS_SUPER_MAGIC  0xBAD1DEA
>  #define HOSTFS_SUPER_MAGIC   0x00c0ffee
> 



Re: [PATCH, RFC] MAINTAINERS: update OSD entries

2017-05-03 Thread Boaz Harrosh
On 05/02/2017 02:33 PM, Jeff Layton wrote:
> On Tue, 2017-05-02 at 09:57 +0200, Christoph Hellwig wrote:
>> The open-osd domain doesn't exist anymore, and mails to the list lead
>> to really annoying bounced that repeat every day.
>>
>> Also the primarydata address for Benny bounces, and while I have a new
>> one for him he doesn't seem to be maintaining the OSD code any more.
>>
>> Which beggs the question:  should we really leave the Supported status
>> in MAINTAINERS given that the code is barely maintained?
>>
>> Signed-off-by: Christoph Hellwig <h...@lst.de>
>> ---
>>  MAINTAINERS | 4 
>>  1 file changed, 4 deletions(-)
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 1bb06c5f7716..28dd83a1d9e2 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -9418,10 +9418,6 @@ F:drivers/net/wireless/intersil/orinoco/
>>  
>>  OSD LIBRARY and FILESYSTEM
>>  M:  Boaz Harrosh <o...@electrozaur.com>
>> -M:  Benny Halevy <bhal...@primarydata.com>
>> -L:  osd-...@open-osd.org
>> -W:  http://open-osd.org
>> -T:  git git://git.open-osd.org/open-osd.git
>>  S:  Maintained
>>  F:  drivers/scsi/osd/
>>  F:  include/scsi/osd_*
> 
> Hah, you beat me to it! I was going to spin up a patch for this today.
> 
> Acked-by: Jeff Layton <jlay...@redhat.com>

Acked-by: Boaz Harrosh <o...@electrozaur.com>

> 



Re: [PATCH, RFC] MAINTAINERS: update OSD entries

2017-05-03 Thread Boaz Harrosh
On 05/02/2017 02:33 PM, Jeff Layton wrote:
> On Tue, 2017-05-02 at 09:57 +0200, Christoph Hellwig wrote:
>> The open-osd domain doesn't exist anymore, and mails to the list lead
>> to really annoying bounced that repeat every day.
>>
>> Also the primarydata address for Benny bounces, and while I have a new
>> one for him he doesn't seem to be maintaining the OSD code any more.
>>
>> Which beggs the question:  should we really leave the Supported status
>> in MAINTAINERS given that the code is barely maintained?
>>
>> Signed-off-by: Christoph Hellwig 
>> ---
>>  MAINTAINERS | 4 
>>  1 file changed, 4 deletions(-)
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 1bb06c5f7716..28dd83a1d9e2 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -9418,10 +9418,6 @@ F:drivers/net/wireless/intersil/orinoco/
>>  
>>  OSD LIBRARY and FILESYSTEM
>>  M:  Boaz Harrosh 
>> -M:  Benny Halevy 
>> -L:  osd-...@open-osd.org
>> -W:  http://open-osd.org
>> -T:  git git://git.open-osd.org/open-osd.git
>>  S:  Maintained
>>  F:  drivers/scsi/osd/
>>  F:  include/scsi/osd_*
> 
> Hah, you beat me to it! I was going to spin up a patch for this today.
> 
> Acked-by: Jeff Layton 

Acked-by: Boaz Harrosh 

> 



Re: [PATCH] ore: fix spelling mistake: "Multples" -> "multiples"

2017-04-23 Thread Boaz Harrosh
On 04/22/2017 03:48 PM, Colin King wrote:
> From: Colin Ian King <colin.k...@canonical.com>
> 
> trivial fix to spelling mistake in ORE_ERR message and make word all
> lower case.
> 
> Signed-off-by: Colin Ian King <colin.k...@canonical.com>

Thanks
ACK-by: Boaz Harrosh <o...@electrozaur.com>

> ---
>  fs/exofs/ore.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/exofs/ore.c b/fs/exofs/ore.c
> index 8bb72807e70d..811522ae45e1 100644
> --- a/fs/exofs/ore.c
> +++ b/fs/exofs/ore.c
> @@ -68,7 +68,7 @@ int ore_verify_layout(unsigned total_comps, struct 
> ore_layout *layout)
>   }
>   if (0 != (layout->stripe_unit & ~PAGE_MASK)) {
>   ORE_ERR("Stripe Unit(0x%llx)"
> -   " must be Multples of PAGE_SIZE(0x%lx)\n",
> +   " must be multiples of PAGE_SIZE(0x%lx)\n",
> _LLU(layout->stripe_unit), PAGE_SIZE);
>   return -EINVAL;
>   }
> 



Re: [PATCH] ore: fix spelling mistake: "Multples" -> "multiples"

2017-04-23 Thread Boaz Harrosh
On 04/22/2017 03:48 PM, Colin King wrote:
> From: Colin Ian King 
> 
> trivial fix to spelling mistake in ORE_ERR message and make word all
> lower case.
> 
> Signed-off-by: Colin Ian King 

Thanks
ACK-by: Boaz Harrosh 

> ---
>  fs/exofs/ore.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/exofs/ore.c b/fs/exofs/ore.c
> index 8bb72807e70d..811522ae45e1 100644
> --- a/fs/exofs/ore.c
> +++ b/fs/exofs/ore.c
> @@ -68,7 +68,7 @@ int ore_verify_layout(unsigned total_comps, struct 
> ore_layout *layout)
>   }
>   if (0 != (layout->stripe_unit & ~PAGE_MASK)) {
>   ORE_ERR("Stripe Unit(0x%llx)"
> -   " must be Multples of PAGE_SIZE(0x%lx)\n",
> +   " must be multiples of PAGE_SIZE(0x%lx)\n",
> _LLU(layout->stripe_unit), PAGE_SIZE);
>   return -EINVAL;
>   }
> 



Re: [PATCH 3/4] nfs: remove the objlayout driver

2017-04-23 Thread Boaz Harrosh
On 04/21/2017 05:00 PM, Trond Myklebust wrote:
> Maintenance is not development. It’s about doing all the followup
> _after_ the feature is declared to be developed. That’s been missing
> for quite some time in the case of the OSD pNFS code, which is why
> I’m not even bothering to consider staging. If you are saying you are
> still maintaining exofs, and want to continue doing so, then great,
> but note that there is a file called BUGS in that directory that has
> been untouched since 2008, and that’s why I thing staging is a good
> idea.
> 

No, the BUGS file is just stale. As you said was not ever updated. All
the bugs (1) in there do no longer exist for a long long time.
I will send a patch to remove the file.

Yes I maintain this fs. It has the complete fixture list, of what was
first intended. I keep running it every major kernel release to make
sure it actually runs, and there are no regressions.

If this FS stands in the way of any new development please anyone
contact me and I will help in the conversion. Of exofs and the
osd-uld.

Thanks
Boaz



Re: [PATCH 3/4] nfs: remove the objlayout driver

2017-04-23 Thread Boaz Harrosh
On 04/21/2017 05:00 PM, Trond Myklebust wrote:
> Maintenance is not development. It’s about doing all the followup
> _after_ the feature is declared to be developed. That’s been missing
> for quite some time in the case of the OSD pNFS code, which is why
> I’m not even bothering to consider staging. If you are saying you are
> still maintaining exofs, and want to continue doing so, then great,
> but note that there is a file called BUGS in that directory that has
> been untouched since 2008, and that’s why I thing staging is a good
> idea.
> 

No, the BUGS file is just stale. As you said was not ever updated. All
the bugs (1) in there do no longer exist for a long long time.
I will send a patch to remove the file.

Yes I maintain this fs. It has the complete fixture list, of what was
first intended. I keep running it every major kernel release to make
sure it actually runs, and there are no regressions.

If this FS stands in the way of any new development please anyone
contact me and I will help in the conversion. Of exofs and the
osd-uld.

Thanks
Boaz



Re: [PATCH 3/4] nfs: remove the objlayout driver

2017-04-21 Thread Boaz Harrosh
On 04/20/2017 11:18 PM, Trond Myklebust wrote:
<>
> 
> OK. I'm applying this patch for the 4.12 merge window. 

That is understandable this code was not tested for a long while

> If, as Boaz
> suggests, there is still an interest in exofs, then I suggest we put
> that to the test by moving it into the STAGING area, to see if someone
> will step up to maintain it.
> 

I do not understand what you mean by "someone will maintain it"
What is it that is missing in exofs that you see needs development?

As I said I regularly run and test this FS and update as things advance
there where no new fixtures for a while, but no regressions either.

Please explain what it means "will maintain it"

> Cheers
>   Trond
> 

Thanks
Boaz



Re: [PATCH 3/4] nfs: remove the objlayout driver

2017-04-21 Thread Boaz Harrosh
On 04/20/2017 11:18 PM, Trond Myklebust wrote:
<>
> 
> OK. I'm applying this patch for the 4.12 merge window. 

That is understandable this code was not tested for a long while

> If, as Boaz
> suggests, there is still an interest in exofs, then I suggest we put
> that to the test by moving it into the STAGING area, to see if someone
> will step up to maintain it.
> 

I do not understand what you mean by "someone will maintain it"
What is it that is missing in exofs that you see needs development?

As I said I regularly run and test this FS and update as things advance
there where no new fixtures for a while, but no regressions either.

Please explain what it means "will maintain it"

> Cheers
>   Trond
> 

Thanks
Boaz



Re: linux-next: manual merge of the scsi-mkp tree with the char-misc tree

2017-04-20 Thread Boaz Harrosh
On 04/07/2017 10:50 PM, Bart Van Assche wrote:
> On Fri, 2017-04-07 at 13:29 -0600, Logan Gunthorpe wrote:
>> On 07/04/17 09:49 AM, Bart Van Assche wrote:
>>> Sorry that I had not yet noticed Logan's patch series. Should my two
>>> patches that conflict with Logan's patch series be dropped and reworked
>>> after Logan's patches are upstream?
>>
>> Yeah, Greg took my patchset around a few maintainers relatively quickly.
>> This is the second conflict, so sorry about that. Looks like the easiest
>> thing would be to just base your change off of mine. It doesn't look too
>> difficult. If you can do it before my patch hits upstream, I'd
>> appreciate some testing and/or review as no one from the scsi side
>> responded and that particular patch was a bit more involved than I would
>> have liked.
> 
> Boaz, had you noticed Logan's osd patch? If not, can you have a look?
> 

I did look, I even sent an ACK on one of the early versions.
The merge breakage is more of a build issue because I never
had get_device fail for me in my testing so it is more
academic.

Yes they both look fine BTW
Boaz

> Thanks,
> Bart.
> 



Re: linux-next: manual merge of the scsi-mkp tree with the char-misc tree

2017-04-20 Thread Boaz Harrosh
On 04/07/2017 10:50 PM, Bart Van Assche wrote:
> On Fri, 2017-04-07 at 13:29 -0600, Logan Gunthorpe wrote:
>> On 07/04/17 09:49 AM, Bart Van Assche wrote:
>>> Sorry that I had not yet noticed Logan's patch series. Should my two
>>> patches that conflict with Logan's patch series be dropped and reworked
>>> after Logan's patches are upstream?
>>
>> Yeah, Greg took my patchset around a few maintainers relatively quickly.
>> This is the second conflict, so sorry about that. Looks like the easiest
>> thing would be to just base your change off of mine. It doesn't look too
>> difficult. If you can do it before my patch hits upstream, I'd
>> appreciate some testing and/or review as no one from the scsi side
>> responded and that particular patch was a bit more involved than I would
>> have liked.
> 
> Boaz, had you noticed Logan's osd patch? If not, can you have a look?
> 

I did look, I even sent an ACK on one of the early versions.
The merge breakage is more of a build issue because I never
had get_device fail for me in my testing so it is more
academic.

Yes they both look fine BTW
Boaz

> Thanks,
> Bart.
> 



Re: [PATCH 1/4] block: remove the osdblk driver

2017-04-19 Thread Boaz Harrosh
On 04/12/2017 07:01 PM, Christoph Hellwig wrote:
> This was just a proof of concept user for the SCSI OSD library, and
> never had any real users.
> 
> Signed-off-by: Christoph Hellwig <h...@lst.de>

Yes please remove this driver

ACK-by Boaz Harrosh <o...@electrozaur.com>

> ---
>  drivers/block/Kconfig  |  16 --
>  drivers/block/Makefile |   1 -
>  drivers/block/osdblk.c | 693 
> -
>  3 files changed, 710 deletions(-)
>  delete mode 100644 drivers/block/osdblk.c
> 
> diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
> index a1c2e816128f..58e24c354933 100644
> --- a/drivers/block/Kconfig
> +++ b/drivers/block/Kconfig
> @@ -312,22 +312,6 @@ config BLK_DEV_SKD
>  
>   Use device /dev/skd$N amd /dev/skd$Np$M.
>  
> -config BLK_DEV_OSD
> - tristate "OSD object-as-blkdev support"
> - depends on SCSI_OSD_ULD
> - ---help---
> -   Saying Y or M here will allow the exporting of a single SCSI
> -   OSD (object-based storage) object as a Linux block device.
> -
> -   For example, if you create a 2G object on an OSD device,
> -   you can then use this module to present that 2G object as
> -   a Linux block device.
> -
> -   To compile this driver as a module, choose M here: the
> -   module will be called osdblk.
> -
> -   If unsure, say N.
> -
>  config BLK_DEV_SX8
>   tristate "Promise SATA SX8 support"
>   depends on PCI
> diff --git a/drivers/block/Makefile b/drivers/block/Makefile
> index b12c772bbeb3..42e8cee1cbc7 100644
> --- a/drivers/block/Makefile
> +++ b/drivers/block/Makefile
> @@ -22,7 +22,6 @@ obj-$(CONFIG_CDROM_PKTCDVD) += pktcdvd.o
>  obj-$(CONFIG_MG_DISK)+= mg_disk.o
>  obj-$(CONFIG_SUNVDC) += sunvdc.o
>  obj-$(CONFIG_BLK_DEV_SKD)+= skd.o
> -obj-$(CONFIG_BLK_DEV_OSD)+= osdblk.o
>  
>  obj-$(CONFIG_BLK_DEV_UMEM)   += umem.o
>  obj-$(CONFIG_BLK_DEV_NBD)+= nbd.o
> diff --git a/drivers/block/osdblk.c b/drivers/block/osdblk.c
> deleted file mode 100644
> index 8127b8201a01..
> --- a/drivers/block/osdblk.c
> +++ /dev/null
> @@ -1,693 +0,0 @@
> -
> -/*
> -   osdblk.c -- Export a single SCSI OSD object as a Linux block device
> -
> -
> -   Copyright 2009 Red Hat, Inc.
> -
> -   This program is free software; you can redistribute it and/or modify
> -   it under the terms of the GNU General Public License as published by
> -   the Free Software Foundation.
> -
> -   This program is distributed in the hope that it will be useful,
> -   but WITHOUT ANY WARRANTY; without even the implied warranty of
> -   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> -   GNU General Public License for more details.
> -
> -   You should have received a copy of the GNU General Public License
> -   along with this program; see the file COPYING.  If not, write to
> -   the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.
> -
> -
> -   Instructions for use
> -   
> -
> -   1) Map a Linux block device to an existing OSD object.
> -
> -  In this example, we will use partition id 1234, object id 5678,
> -  OSD device /dev/osd1.
> -
> -  $ echo "1234 5678 /dev/osd1" > /sys/class/osdblk/add
> -
> -
> -   2) List all active blkdev<->object mappings.
> -
> -  In this example, we have performed step #1 twice, creating two blkdevs,
> -  mapped to two separate OSD objects.
> -
> -  $ cat /sys/class/osdblk/list
> -  0 174 1234 5678 /dev/osd1
> -  1 179 1994 897123 /dev/osd0
> -
> -  The columns, in order, are:
> -  - blkdev unique id
> -  - blkdev assigned major
> -  - OSD object partition id
> -  - OSD object id
> -  - OSD device
> -
> -
> -   3) Remove an active blkdev<->object mapping.
> -
> -  In this example, we remove the mapping with blkdev unique id 1.
> -
> -  $ echo 1 > /sys/class/osdblk/remove
> -
> -
> -   NOTE:  The actual creation and deletion of OSD objects is outside the 
> scope
> -   of this driver.
> -
> - */
> -
> -#include 
> -#include 
> -#include 
> -#include 
> -#include 
> -#include 
> -#include 
> -#include 
> -#include 
> -
> -#define DRV_NAME "osdblk"
> -#define PFX DRV_NAME ": "
> -
> -/* #define _OSDBLK_DEBUG */
> -#ifdef _OSDBLK_DEBUG
> -#define OSDBLK_DEBUG(fmt, a...) \
> - printk(KERN_NOTICE "osdblk @%s:%d: " fmt, __func__, __LINE__, ##a)
> -#else
> -#define OSDBLK_DEBUG(fmt, a...) \
> - do { if (0) printk(fmt, ##a); } while 

Re: [PATCH 1/4] block: remove the osdblk driver

2017-04-19 Thread Boaz Harrosh
On 04/12/2017 07:01 PM, Christoph Hellwig wrote:
> This was just a proof of concept user for the SCSI OSD library, and
> never had any real users.
> 
> Signed-off-by: Christoph Hellwig 

Yes please remove this driver

ACK-by Boaz Harrosh 

> ---
>  drivers/block/Kconfig  |  16 --
>  drivers/block/Makefile |   1 -
>  drivers/block/osdblk.c | 693 
> -
>  3 files changed, 710 deletions(-)
>  delete mode 100644 drivers/block/osdblk.c
> 
> diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
> index a1c2e816128f..58e24c354933 100644
> --- a/drivers/block/Kconfig
> +++ b/drivers/block/Kconfig
> @@ -312,22 +312,6 @@ config BLK_DEV_SKD
>  
>   Use device /dev/skd$N amd /dev/skd$Np$M.
>  
> -config BLK_DEV_OSD
> - tristate "OSD object-as-blkdev support"
> - depends on SCSI_OSD_ULD
> - ---help---
> -   Saying Y or M here will allow the exporting of a single SCSI
> -   OSD (object-based storage) object as a Linux block device.
> -
> -   For example, if you create a 2G object on an OSD device,
> -   you can then use this module to present that 2G object as
> -   a Linux block device.
> -
> -   To compile this driver as a module, choose M here: the
> -   module will be called osdblk.
> -
> -   If unsure, say N.
> -
>  config BLK_DEV_SX8
>   tristate "Promise SATA SX8 support"
>   depends on PCI
> diff --git a/drivers/block/Makefile b/drivers/block/Makefile
> index b12c772bbeb3..42e8cee1cbc7 100644
> --- a/drivers/block/Makefile
> +++ b/drivers/block/Makefile
> @@ -22,7 +22,6 @@ obj-$(CONFIG_CDROM_PKTCDVD) += pktcdvd.o
>  obj-$(CONFIG_MG_DISK)+= mg_disk.o
>  obj-$(CONFIG_SUNVDC) += sunvdc.o
>  obj-$(CONFIG_BLK_DEV_SKD)+= skd.o
> -obj-$(CONFIG_BLK_DEV_OSD)+= osdblk.o
>  
>  obj-$(CONFIG_BLK_DEV_UMEM)   += umem.o
>  obj-$(CONFIG_BLK_DEV_NBD)+= nbd.o
> diff --git a/drivers/block/osdblk.c b/drivers/block/osdblk.c
> deleted file mode 100644
> index 8127b8201a01..
> --- a/drivers/block/osdblk.c
> +++ /dev/null
> @@ -1,693 +0,0 @@
> -
> -/*
> -   osdblk.c -- Export a single SCSI OSD object as a Linux block device
> -
> -
> -   Copyright 2009 Red Hat, Inc.
> -
> -   This program is free software; you can redistribute it and/or modify
> -   it under the terms of the GNU General Public License as published by
> -   the Free Software Foundation.
> -
> -   This program is distributed in the hope that it will be useful,
> -   but WITHOUT ANY WARRANTY; without even the implied warranty of
> -   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> -   GNU General Public License for more details.
> -
> -   You should have received a copy of the GNU General Public License
> -   along with this program; see the file COPYING.  If not, write to
> -   the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.
> -
> -
> -   Instructions for use
> -   
> -
> -   1) Map a Linux block device to an existing OSD object.
> -
> -  In this example, we will use partition id 1234, object id 5678,
> -  OSD device /dev/osd1.
> -
> -  $ echo "1234 5678 /dev/osd1" > /sys/class/osdblk/add
> -
> -
> -   2) List all active blkdev<->object mappings.
> -
> -  In this example, we have performed step #1 twice, creating two blkdevs,
> -  mapped to two separate OSD objects.
> -
> -  $ cat /sys/class/osdblk/list
> -  0 174 1234 5678 /dev/osd1
> -  1 179 1994 897123 /dev/osd0
> -
> -  The columns, in order, are:
> -  - blkdev unique id
> -  - blkdev assigned major
> -  - OSD object partition id
> -  - OSD object id
> -  - OSD device
> -
> -
> -   3) Remove an active blkdev<->object mapping.
> -
> -  In this example, we remove the mapping with blkdev unique id 1.
> -
> -  $ echo 1 > /sys/class/osdblk/remove
> -
> -
> -   NOTE:  The actual creation and deletion of OSD objects is outside the 
> scope
> -   of this driver.
> -
> - */
> -
> -#include 
> -#include 
> -#include 
> -#include 
> -#include 
> -#include 
> -#include 
> -#include 
> -#include 
> -
> -#define DRV_NAME "osdblk"
> -#define PFX DRV_NAME ": "
> -
> -/* #define _OSDBLK_DEBUG */
> -#ifdef _OSDBLK_DEBUG
> -#define OSDBLK_DEBUG(fmt, a...) \
> - printk(KERN_NOTICE "osdblk @%s:%d: " fmt, __func__, __LINE__, ##a)
> -#else
> -#define OSDBLK_DEBUG(fmt, a...) \
> - do { if (0) printk(fmt, ##a); } while (0)
> -#endif
> -
> -MODULE_AUTHOR("

Re: RFC: drop the T10 OSD code and its users

2017-04-18 Thread Boaz Harrosh
On 04/12/2017 07:01 PM, Christoph Hellwig wrote:

Hi Sir Christoph

> The only real user of the T10 OSD protocol, the pNFS object layout
> driver never went to the point of having shipping products, and the
> other two users (osdblk and exofs) were simple example of it's usage.
> 

I understand why osdblk might be a pain, and was broken from day one, and
should by all means go away ASAP.

But exofs should not be bothering anyone, and as far as I know does
not use any special API's except the osd_uld code of course.

> The code has been mostly unmaintained for years and is getting in the
> way of block / SCSI changes, so I think it's finally time to drop it.
> 

Please tell me what are those changes you are talking about? I might be
able to help in conversion. I guess you mean osd_uld and the Upper SCSI API.
Just point me at a tree where osd_uld is broken, and perhaps with a little
guidance from you I can do a satisfactory conversion.

Is true that no new code went in for a long while, but I still from time
to time run a setup and test that the all stack, like iscsi-bidi and so on still
works.

That said, yes only a stand alone exofs was tested for a long time, a full
pnfs setup is missing any supporting server. So Yes I admit that pnfs-obj is
getting very rotten. And is most probably broken, on the pnfs side of things.
[Which I admit makes my little plea kind of mute ;-) ]

Every once in a while I get emails from Students basing all kind of interesting
experiments on top of the exofs and object base storage. So for the sake of 
academics
and for the sake of a true bidi-stack testing, might we want to evaluate what 
is the
up coming cost, and what is a minimum set we are willing to keep?

Please advise?

thanks
Boaz

> These patches are against Jens' block for-next tree as that already
> has various modifications of the SCSI code.
> 



Re: RFC: drop the T10 OSD code and its users

2017-04-18 Thread Boaz Harrosh
On 04/12/2017 07:01 PM, Christoph Hellwig wrote:

Hi Sir Christoph

> The only real user of the T10 OSD protocol, the pNFS object layout
> driver never went to the point of having shipping products, and the
> other two users (osdblk and exofs) were simple example of it's usage.
> 

I understand why osdblk might be a pain, and was broken from day one, and
should by all means go away ASAP.

But exofs should not be bothering anyone, and as far as I know does
not use any special API's except the osd_uld code of course.

> The code has been mostly unmaintained for years and is getting in the
> way of block / SCSI changes, so I think it's finally time to drop it.
> 

Please tell me what are those changes you are talking about? I might be
able to help in conversion. I guess you mean osd_uld and the Upper SCSI API.
Just point me at a tree where osd_uld is broken, and perhaps with a little
guidance from you I can do a satisfactory conversion.

Is true that no new code went in for a long while, but I still from time
to time run a setup and test that the all stack, like iscsi-bidi and so on still
works.

That said, yes only a stand alone exofs was tested for a long time, a full
pnfs setup is missing any supporting server. So Yes I admit that pnfs-obj is
getting very rotten. And is most probably broken, on the pnfs side of things.
[Which I admit makes my little plea kind of mute ;-) ]

Every once in a while I get emails from Students basing all kind of interesting
experiments on top of the exofs and object base storage. So for the sake of 
academics
and for the sake of a true bidi-stack testing, might we want to evaluate what 
is the
up coming cost, and what is a minimum set we are willing to keep?

Please advise?

thanks
Boaz

> These patches are against Jens' block for-next tree as that already
> has various modifications of the SCSI code.
> 



Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-30 Thread Boaz Harrosh
On 03/30/2017 09:35 PM, Jeff Layton wrote:
<>
> Yeah, I imagine we'd need a on-disk change for this unless there's
> something already present that we could use in place of a crash counter.
> 

Perhaps we can use s_mtime and/or s_wtime in some way, I'm not sure
what is a parallel for that in xfs.
s_mtime - time-of-last mount 
s_wtime - time-of-last mount, umount, freez, unfreez, remount, ...
Of course you'll need a per FS vector to access that.

But this will need some math foo to get the bits compacted correctly
just a thought.

Thanks
Boaz



Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-30 Thread Boaz Harrosh
On 03/30/2017 09:35 PM, Jeff Layton wrote:
<>
> Yeah, I imagine we'd need a on-disk change for this unless there's
> something already present that we could use in place of a crash counter.
> 

Perhaps we can use s_mtime and/or s_wtime in some way, I'm not sure
what is a parallel for that in xfs.
s_mtime - time-of-last mount 
s_wtime - time-of-last mount, umount, freez, unfreez, remount, ...
Of course you'll need a per FS vector to access that.

But this will need some math foo to get the bits compacted correctly
just a thought.

Thanks
Boaz



Re: RFC: reject unknown open flags

2017-03-30 Thread Boaz Harrosh
On 03/30/2017 09:45 PM, Linus Torvalds wrote:
> On Thu, Mar 30, 2017 at 11:26 AM, Christoph Hellwig  wrote:
>>
>> That would be nice, but still won't work as we blindly copy f_flags
>> into F_GETFL, not even masking our internal FMODE_ bits.
> 
> Ok, *that* is just silly of us, and we could try to just fix, and even 
> backport.
> 
> There's no possible valid use I could see where that should break
> (famous last words - user code does some damn odd things at times).
> 
> Of course, that won't fix old kernels that are out there, but then
> neither would your original patch...
> 
> Side note: I think you *can* detect the O_ATOMIC support by using
> F_SETFL, because F_SETFL only allows you to change flags that we
> recognize. So somebody who really wants to *guarantee* that O_ATOMIC
> is there and honored even with old kernels could presumable do
> something like
> 
>fd = open(..); // *no* O_ATOMIC
>fcnt(fd, F_SETFL, O_ATOMIC);
>if (fcnt(fd, F_GETFL, NULL) & O_ATOMIC)
> // Yay! We actually got it
>else
> // I guess we need to fall back on old behavior
> 
> although I agree that that is ridiculously inconvenient and not a
> great thing, and it's worth trying to aim for some better model.
> 

Perhaps in that case it is time for an F_GETFL2 an F_GET_REAL_FL
that gives you the nice simple user code Linus wanted for new applications.
and solves forward and backwords for applications and Kernels?

Just my $0.017
Boaz

> Linus
> 



Re: RFC: reject unknown open flags

2017-03-30 Thread Boaz Harrosh
On 03/30/2017 09:45 PM, Linus Torvalds wrote:
> On Thu, Mar 30, 2017 at 11:26 AM, Christoph Hellwig  wrote:
>>
>> That would be nice, but still won't work as we blindly copy f_flags
>> into F_GETFL, not even masking our internal FMODE_ bits.
> 
> Ok, *that* is just silly of us, and we could try to just fix, and even 
> backport.
> 
> There's no possible valid use I could see where that should break
> (famous last words - user code does some damn odd things at times).
> 
> Of course, that won't fix old kernels that are out there, but then
> neither would your original patch...
> 
> Side note: I think you *can* detect the O_ATOMIC support by using
> F_SETFL, because F_SETFL only allows you to change flags that we
> recognize. So somebody who really wants to *guarantee* that O_ATOMIC
> is there and honored even with old kernels could presumable do
> something like
> 
>fd = open(..); // *no* O_ATOMIC
>fcnt(fd, F_SETFL, O_ATOMIC);
>if (fcnt(fd, F_GETFL, NULL) & O_ATOMIC)
> // Yay! We actually got it
>else
> // I guess we need to fall back on old behavior
> 
> although I agree that that is ridiculously inconvenient and not a
> great thing, and it's worth trying to aim for some better model.
> 

Perhaps in that case it is time for an F_GETFL2 an F_GET_REAL_FL
that gives you the nice simple user code Linus wanted for new applications.
and solves forward and backwords for applications and Kernels?

Just my $0.017
Boaz

> Linus
> 



Re: [PATCH] scsi: osd_uld: remove an unneeded NULL check

2017-03-23 Thread Boaz Harrosh
On 03/23/2017 12:41 PM, Dan Carpenter wrote:
> We don't call the remove() function unless probe() succeeds so "oud"
> can't be NULL here.  Plus, if it were NULL, we dereference it on the
> next line so it would crash anyway.
> 
> Signed-off-by: Dan Carpenter <dan.carpen...@oracle.com>
> 

Thanks sure!
ACK-by Boaz Harrosh <o...@electrozaur.com>

> diff --git a/drivers/scsi/osd/osd_uld.c b/drivers/scsi/osd/osd_uld.c
> index 4101c3178411..8b9941a5687a 100644
> --- a/drivers/scsi/osd/osd_uld.c
> +++ b/drivers/scsi/osd/osd_uld.c
> @@ -507,10 +507,9 @@ static int osd_remove(struct device *dev)
>   struct scsi_device *scsi_device = to_scsi_device(dev);
>   struct osd_uld_device *oud = dev_get_drvdata(dev);
>  
> - if (!oud || (oud->od.scsi_device != scsi_device)) {
> - OSD_ERR("Half cooked osd-device %p,%p || %p!=%p",
> - dev, oud, oud ? oud->od.scsi_device : NULL,
> - scsi_device);
> + if (oud->od.scsi_device != scsi_device) {
> + OSD_ERR("Half cooked osd-device %p, || %p!=%p",
> + dev, oud->od.scsi_device, scsi_device);
>   }
>  
>   cdev_device_del(>cdev, >class_dev);
> 



Re: [PATCH] scsi: osd_uld: remove an unneeded NULL check

2017-03-23 Thread Boaz Harrosh
On 03/23/2017 12:41 PM, Dan Carpenter wrote:
> We don't call the remove() function unless probe() succeeds so "oud"
> can't be NULL here.  Plus, if it were NULL, we dereference it on the
> next line so it would crash anyway.
> 
> Signed-off-by: Dan Carpenter 
> 

Thanks sure!
ACK-by Boaz Harrosh 

> diff --git a/drivers/scsi/osd/osd_uld.c b/drivers/scsi/osd/osd_uld.c
> index 4101c3178411..8b9941a5687a 100644
> --- a/drivers/scsi/osd/osd_uld.c
> +++ b/drivers/scsi/osd/osd_uld.c
> @@ -507,10 +507,9 @@ static int osd_remove(struct device *dev)
>   struct scsi_device *scsi_device = to_scsi_device(dev);
>   struct osd_uld_device *oud = dev_get_drvdata(dev);
>  
> - if (!oud || (oud->od.scsi_device != scsi_device)) {
> - OSD_ERR("Half cooked osd-device %p,%p || %p!=%p",
> - dev, oud, oud ? oud->od.scsi_device : NULL,
> - scsi_device);
> + if (oud->od.scsi_device != scsi_device) {
> + OSD_ERR("Half cooked osd-device %p, || %p!=%p",
> + dev, oud->od.scsi_device, scsi_device);
>   }
>  
>   cdev_device_del(>cdev, >class_dev);
> 



Re: [Lsf-pc] [LSF/MM TOPIC] do we really need PG_error at all?

2017-02-28 Thread Boaz Harrosh
On 02/28/2017 03:11 AM, Jeff Layton wrote:
<>
> 
> I'll probably have questions about the read side as well, but for now it
> looks like it's mostly used in an ad-hoc way to communicate errors
> across subsystems (block to fs layer, for instance).

If memory does not fail me it used to be checked long time ago in the
read-ahead case. On the buffered read case, the first page is read synchronous
and any error is returned to the caller, but then a read-ahead chunk is
read async all the while the original thread returned to the application.
So any errors are only recorded on the page-bit, since otherwise the uptodate
is off and the IO will be retransmitted. Then the move to read_iter changed
all that I think.
But again this is like 5-6 years ago, and maybe I didn't even understand
very well.

> --
> Jeff Layton 
> 

I would like a Documentation of all this as well please. Where are the
tests for this?

Thanks
Boaz



Re: [Lsf-pc] [LSF/MM TOPIC] do we really need PG_error at all?

2017-02-28 Thread Boaz Harrosh
On 02/28/2017 03:11 AM, Jeff Layton wrote:
<>
> 
> I'll probably have questions about the read side as well, but for now it
> looks like it's mostly used in an ad-hoc way to communicate errors
> across subsystems (block to fs layer, for instance).

If memory does not fail me it used to be checked long time ago in the
read-ahead case. On the buffered read case, the first page is read synchronous
and any error is returned to the caller, but then a read-ahead chunk is
read async all the while the original thread returned to the application.
So any errors are only recorded on the page-bit, since otherwise the uptodate
is off and the IO will be retransmitted. Then the move to read_iter changed
all that I think.
But again this is like 5-6 years ago, and maybe I didn't even understand
very well.

> --
> Jeff Layton 
> 

I would like a Documentation of all this as well please. Where are the
tests for this?

Thanks
Boaz



Re: [PATCH v1 45/54] exofs: convert to bio_for_each_segment_all_sp()

2017-01-03 Thread Boaz Harrosh
On 12/27/2016 06:04 PM, Ming Lei wrote:
> Signed-off-by: Ming Lei <tom.leim...@gmail.com>

Cool
ACK-by: Boaz Harrosh <o...@electrozaur.com>

> ---
>  fs/exofs/ore.c  | 3 ++-
>  fs/exofs/ore_raid.c | 3 ++-
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/exofs/ore.c b/fs/exofs/ore.c
> index 8bb72807e70d..38a7d8bfdd4c 100644
> --- a/fs/exofs/ore.c
> +++ b/fs/exofs/ore.c
> @@ -406,8 +406,9 @@ static void _clear_bio(struct bio *bio)
>  {
>   struct bio_vec *bv;
>   unsigned i;
> + struct bvec_iter_all bia;
>  
> - bio_for_each_segment_all(bv, bio, i) {
> + bio_for_each_segment_all_sp(bv, bio, i, bia) {
>   unsigned this_count = bv->bv_len;
>  
>   if (likely(PAGE_SIZE == this_count))
> diff --git a/fs/exofs/ore_raid.c b/fs/exofs/ore_raid.c
> index 27cbdb697649..37c0a9aa2ec2 100644
> --- a/fs/exofs/ore_raid.c
> +++ b/fs/exofs/ore_raid.c
> @@ -429,6 +429,7 @@ static void _mark_read4write_pages_uptodate(struct 
> ore_io_state *ios, int ret)
>  {
>   struct bio_vec *bv;
>   unsigned i, d;
> + struct bvec_iter_all bia;
>  
>   /* loop on all devices all pages */
>   for (d = 0; d < ios->numdevs; d++) {
> @@ -437,7 +438,7 @@ static void _mark_read4write_pages_uptodate(struct 
> ore_io_state *ios, int ret)
>   if (!bio)
>   continue;
>  
> - bio_for_each_segment_all(bv, bio, i) {
> + bio_for_each_segment_all_sp(bv, bio, i, bia) {
>   struct page *page = bv->bv_page;
>  
>   SetPageUptodate(page);
> 



Re: [PATCH v1 45/54] exofs: convert to bio_for_each_segment_all_sp()

2017-01-03 Thread Boaz Harrosh
On 12/27/2016 06:04 PM, Ming Lei wrote:
> Signed-off-by: Ming Lei 

Cool
ACK-by: Boaz Harrosh 

> ---
>  fs/exofs/ore.c  | 3 ++-
>  fs/exofs/ore_raid.c | 3 ++-
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/exofs/ore.c b/fs/exofs/ore.c
> index 8bb72807e70d..38a7d8bfdd4c 100644
> --- a/fs/exofs/ore.c
> +++ b/fs/exofs/ore.c
> @@ -406,8 +406,9 @@ static void _clear_bio(struct bio *bio)
>  {
>   struct bio_vec *bv;
>   unsigned i;
> + struct bvec_iter_all bia;
>  
> - bio_for_each_segment_all(bv, bio, i) {
> + bio_for_each_segment_all_sp(bv, bio, i, bia) {
>   unsigned this_count = bv->bv_len;
>  
>   if (likely(PAGE_SIZE == this_count))
> diff --git a/fs/exofs/ore_raid.c b/fs/exofs/ore_raid.c
> index 27cbdb697649..37c0a9aa2ec2 100644
> --- a/fs/exofs/ore_raid.c
> +++ b/fs/exofs/ore_raid.c
> @@ -429,6 +429,7 @@ static void _mark_read4write_pages_uptodate(struct 
> ore_io_state *ios, int ret)
>  {
>   struct bio_vec *bv;
>   unsigned i, d;
> + struct bvec_iter_all bia;
>  
>   /* loop on all devices all pages */
>   for (d = 0; d < ios->numdevs; d++) {
> @@ -437,7 +438,7 @@ static void _mark_read4write_pages_uptodate(struct 
> ore_io_state *ios, int ret)
>   if (!bio)
>   continue;
>  
> - bio_for_each_segment_all(bv, bio, i) {
> + bio_for_each_segment_all_sp(bv, bio, i, bia) {
>   struct page *page = bv->bv_page;
>  
>   SetPageUptodate(page);
> 



Re: [PATCH v2 1/3] introduce memcpy_nocache()

2016-11-01 Thread Boaz Harrosh
On 10/28/2016 04:54 AM, Boylston, Brian wrote:
> Boaz Harrosh wrote on 2016-10-26:
>> On 10/26/2016 06:50 PM, Brian Boylston wrote:
>>> Introduce memcpy_nocache() as a memcpy() that avoids the processor cache
>>> if possible.  Without arch-specific support, this defaults to just
>>> memcpy().  For now, include arch-specific support for x86.
>>>
>>> Cc: Ross Zwisler <ross.zwis...@linux.intel.com>
>>> Cc: Thomas Gleixner <t...@linutronix.de>
>>> Cc: Ingo Molnar <mi...@redhat.com>
>>> Cc: "H. Peter Anvin" <h...@zytor.com>
>>> Cc: <x...@kernel.org>
>>> Cc: Al Viro <v...@zeniv.linux.org.uk>
>>> Cc: Dan Williams <dan.j.willi...@intel.com>
>>> Signed-off-by: Brian Boylston <brian.boyls...@hpe.com>
>>> Reviewed-by: Toshi Kani <toshi.k...@hpe.com>
>>> Reported-by: Oliver Moreno <oliver.mor...@hpe.com>
>>> ---
>>>  arch/x86/include/asm/string_32.h |  3 +++
>>>  arch/x86/include/asm/string_64.h |  3 +++
>>>  arch/x86/lib/misc.c  | 12 
>>>  include/linux/string.h   | 15 +++
>>>  4 files changed, 33 insertions(+)
>>> diff --git a/arch/x86/include/asm/string_32.h 
>>> b/arch/x86/include/asm/string_32.h
>>> index 3d3e835..64f80c0 100644
>>> --- a/arch/x86/include/asm/string_32.h
>>> +++ b/arch/x86/include/asm/string_32.h
>>> @@ -196,6 +196,9 @@ static inline void *__memcpy3d(void *to, const void 
>>> *from, size_t len)
>>>
>>>  #endif
>>> +#define __HAVE_ARCH_MEMCPY_NOCACHE
>>> +extern void *memcpy_nocache(void *dest, const void *src, size_t count);
>>> +
>>>  #define __HAVE_ARCH_MEMMOVE
>>>  void *memmove(void *dest, const void *src, size_t n);
>>> diff --git a/arch/x86/include/asm/string_64.h 
>>> b/arch/x86/include/asm/string_64.h
>>> index 90dbbd9..a8fdd55 100644
>>> --- a/arch/x86/include/asm/string_64.h
>>> +++ b/arch/x86/include/asm/string_64.h
>>> @@ -51,6 +51,9 @@ extern void *__memcpy(void *to, const void *from, size_t 
>>> len);
>>>  #define memcpy(dst, src, len) __inline_memcpy((dst), (src), (len))
>>>  #endif
>>> +#define __HAVE_ARCH_MEMCPY_NOCACHE
>>> +extern void *memcpy_nocache(void *dest, const void *src, size_t count);
>>> +
>>>  #define __HAVE_ARCH_MEMSET
>>>  void *memset(void *s, int c, size_t n);
>>>  void *__memset(void *s, int c, size_t n);
>>> diff --git a/arch/x86/lib/misc.c b/arch/x86/lib/misc.c
>>> index 76b373a..c993ab3 100644
>>> --- a/arch/x86/lib/misc.c
>>> +++ b/arch/x86/lib/misc.c
>>> @@ -1,3 +1,6 @@
>>> +#include 
>>> +#include 
>>> +
>>>  /*
>>>   * Count the digits of @val including a possible sign.
>>>   *
>>> @@ -19,3 +22,12 @@ int num_digits(int val)
>>> }
>>> return d;
>>>  }
>>> +
>>> +#ifdef __HAVE_ARCH_MEMCPY_NOCACHE
>>> +void *memcpy_nocache(void *dest, const void *src, size_t count)
>>> +{
>>> +   __copy_from_user_inatomic_nocache(dest, src, count);
>>> +   return dest;
>>> +}
>>> +EXPORT_SYMBOL(memcpy_nocache);
>>> +#endif
>>> diff --git a/include/linux/string.h b/include/linux/string.h
>>> index 26b6f6a..7f40c41 100644
>>> --- a/include/linux/string.h
>>> +++ b/include/linux/string.h
>>> @@ -102,6 +102,21 @@ extern void * memset(void *,int,__kernel_size_t);
>>>  #ifndef __HAVE_ARCH_MEMCPY
>>>  extern void * memcpy(void *,const void *,__kernel_size_t);
>>>  #endif
>>> +
>>> +#ifndef __HAVE_ARCH_MEMCPY_NOCACHE
>>> +/**
>>> + * memcpy_nocache - Copy one area of memory to another, avoiding the
>>> + * processor cache if possible
>>> + * @dest: Where to copy to
>>> + * @src: Where to copy from
>>> + * @count: The size of the area.
>>> + */
>>> +static inline void *memcpy_nocache(void *dest, const void *src, size_t 
>>> count)
>>> +{
>>> +   return memcpy(dest, src, count);
>>> +}
>>
>> What about memcpy_to_pmem() in linux/pmem.h it already has all the arch 
>> switches.
>>
>> Feels bad to add yet just another arch switch over __copy_user_nocache
>>
>> Just feels like too many things that do the same thing. Sigh
> 
> I agree that this looks like a nicer path.
> 
> I had considered adjusting copy_from_iter_nocache() to use memcpy_to_pmem(),
> but lib

Re: [PATCH v2 1/3] introduce memcpy_nocache()

2016-11-01 Thread Boaz Harrosh
On 10/28/2016 04:54 AM, Boylston, Brian wrote:
> Boaz Harrosh wrote on 2016-10-26:
>> On 10/26/2016 06:50 PM, Brian Boylston wrote:
>>> Introduce memcpy_nocache() as a memcpy() that avoids the processor cache
>>> if possible.  Without arch-specific support, this defaults to just
>>> memcpy().  For now, include arch-specific support for x86.
>>>
>>> Cc: Ross Zwisler 
>>> Cc: Thomas Gleixner 
>>> Cc: Ingo Molnar 
>>> Cc: "H. Peter Anvin" 
>>> Cc: 
>>> Cc: Al Viro 
>>> Cc: Dan Williams 
>>> Signed-off-by: Brian Boylston 
>>> Reviewed-by: Toshi Kani 
>>> Reported-by: Oliver Moreno 
>>> ---
>>>  arch/x86/include/asm/string_32.h |  3 +++
>>>  arch/x86/include/asm/string_64.h |  3 +++
>>>  arch/x86/lib/misc.c  | 12 
>>>  include/linux/string.h   | 15 +++
>>>  4 files changed, 33 insertions(+)
>>> diff --git a/arch/x86/include/asm/string_32.h 
>>> b/arch/x86/include/asm/string_32.h
>>> index 3d3e835..64f80c0 100644
>>> --- a/arch/x86/include/asm/string_32.h
>>> +++ b/arch/x86/include/asm/string_32.h
>>> @@ -196,6 +196,9 @@ static inline void *__memcpy3d(void *to, const void 
>>> *from, size_t len)
>>>
>>>  #endif
>>> +#define __HAVE_ARCH_MEMCPY_NOCACHE
>>> +extern void *memcpy_nocache(void *dest, const void *src, size_t count);
>>> +
>>>  #define __HAVE_ARCH_MEMMOVE
>>>  void *memmove(void *dest, const void *src, size_t n);
>>> diff --git a/arch/x86/include/asm/string_64.h 
>>> b/arch/x86/include/asm/string_64.h
>>> index 90dbbd9..a8fdd55 100644
>>> --- a/arch/x86/include/asm/string_64.h
>>> +++ b/arch/x86/include/asm/string_64.h
>>> @@ -51,6 +51,9 @@ extern void *__memcpy(void *to, const void *from, size_t 
>>> len);
>>>  #define memcpy(dst, src, len) __inline_memcpy((dst), (src), (len))
>>>  #endif
>>> +#define __HAVE_ARCH_MEMCPY_NOCACHE
>>> +extern void *memcpy_nocache(void *dest, const void *src, size_t count);
>>> +
>>>  #define __HAVE_ARCH_MEMSET
>>>  void *memset(void *s, int c, size_t n);
>>>  void *__memset(void *s, int c, size_t n);
>>> diff --git a/arch/x86/lib/misc.c b/arch/x86/lib/misc.c
>>> index 76b373a..c993ab3 100644
>>> --- a/arch/x86/lib/misc.c
>>> +++ b/arch/x86/lib/misc.c
>>> @@ -1,3 +1,6 @@
>>> +#include 
>>> +#include 
>>> +
>>>  /*
>>>   * Count the digits of @val including a possible sign.
>>>   *
>>> @@ -19,3 +22,12 @@ int num_digits(int val)
>>> }
>>> return d;
>>>  }
>>> +
>>> +#ifdef __HAVE_ARCH_MEMCPY_NOCACHE
>>> +void *memcpy_nocache(void *dest, const void *src, size_t count)
>>> +{
>>> +   __copy_from_user_inatomic_nocache(dest, src, count);
>>> +   return dest;
>>> +}
>>> +EXPORT_SYMBOL(memcpy_nocache);
>>> +#endif
>>> diff --git a/include/linux/string.h b/include/linux/string.h
>>> index 26b6f6a..7f40c41 100644
>>> --- a/include/linux/string.h
>>> +++ b/include/linux/string.h
>>> @@ -102,6 +102,21 @@ extern void * memset(void *,int,__kernel_size_t);
>>>  #ifndef __HAVE_ARCH_MEMCPY
>>>  extern void * memcpy(void *,const void *,__kernel_size_t);
>>>  #endif
>>> +
>>> +#ifndef __HAVE_ARCH_MEMCPY_NOCACHE
>>> +/**
>>> + * memcpy_nocache - Copy one area of memory to another, avoiding the
>>> + * processor cache if possible
>>> + * @dest: Where to copy to
>>> + * @src: Where to copy from
>>> + * @count: The size of the area.
>>> + */
>>> +static inline void *memcpy_nocache(void *dest, const void *src, size_t 
>>> count)
>>> +{
>>> +   return memcpy(dest, src, count);
>>> +}
>>
>> What about memcpy_to_pmem() in linux/pmem.h it already has all the arch 
>> switches.
>>
>> Feels bad to add yet just another arch switch over __copy_user_nocache
>>
>> Just feels like too many things that do the same thing. Sigh
> 
> I agree that this looks like a nicer path.
> 
> I had considered adjusting copy_from_iter_nocache() to use memcpy_to_pmem(),
> but lib/iov_iter.c doesn't currently #include linux/pmem.h.  Would it be
> acceptable to add it?  Also, I wasn't sure if memcpy_to_pmem() would always
> mean exactly "memcpy nocache".
> 

I think this is the way to go. In my opinion there is no reason why not to 
include

Re: [PATCH v2 3/3] x86: remove unneeded flush in arch_copy_from_iter_pmem()

2016-10-26 Thread Boaz Harrosh
On 10/26/2016 06:50 PM, Brian Boylston wrote:
> copy_from_iter_nocache() now uses nocache copies for all types of iovecs
> on x86, so the flush in arch_copy_from_iter_pmem() is no longer needed.
> 
> Cc: Ross Zwisler 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: "H. Peter Anvin" 
> Cc: 
> Cc: Al Viro 
> Cc: Dan Williams 
> Signed-off-by: Brian Boylston 
> Reviewed-by: Toshi Kani 
> Reported-by: Oliver Moreno 
> ---
>  arch/x86/include/asm/pmem.h | 19 +--
>  1 file changed, 1 insertion(+), 18 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
> index 643eba4..2fbf4ae 100644
> --- a/arch/x86/include/asm/pmem.h
> +++ b/arch/x86/include/asm/pmem.h
> @@ -72,15 +72,6 @@ static inline void arch_wb_cache_pmem(void *addr, size_t 
> size)
>   clwb(p);
>  }
>  
> -/*
> - * copy_from_iter_nocache() on x86 only uses non-temporal stores for iovec
> - * iterators, so for other types (bvec & kvec) we must do a cache write-back.
> - */
> -static inline bool __iter_needs_pmem_wb(struct iov_iter *i)
> -{
> - return iter_is_iovec(i) == false;
> -}
> -
>  /**
>   * arch_copy_from_iter_pmem - copy data from an iterator to PMEM
>   * @addr:PMEM destination address
> @@ -92,15 +83,7 @@ static inline bool __iter_needs_pmem_wb(struct iov_iter *i)
>  static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes,
>   struct iov_iter *i)
>  {
> - size_t len;
> -
> - /* TODO: skip the write-back by always using non-temporal stores */
> - len = copy_from_iter_nocache(addr, bytes, i);
> -
> - if (__iter_needs_pmem_wb(i))
> - arch_wb_cache_pmem(addr, bytes);
> -
> - return len;
> + return copy_from_iter_nocache(addr, bytes, i);

I wish you would remove all this completely. Don't see the point for it anymore 

Thanks a lot for doing this. I have this patch for ages in my trees and was too
lasy to fight them through. Yes it is a big boost for many workloads.

Lots of gratitude
Boaz

>  }
>  
>  /**
> 



Re: [PATCH v2 3/3] x86: remove unneeded flush in arch_copy_from_iter_pmem()

2016-10-26 Thread Boaz Harrosh
On 10/26/2016 06:50 PM, Brian Boylston wrote:
> copy_from_iter_nocache() now uses nocache copies for all types of iovecs
> on x86, so the flush in arch_copy_from_iter_pmem() is no longer needed.
> 
> Cc: Ross Zwisler 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: "H. Peter Anvin" 
> Cc: 
> Cc: Al Viro 
> Cc: Dan Williams 
> Signed-off-by: Brian Boylston 
> Reviewed-by: Toshi Kani 
> Reported-by: Oliver Moreno 
> ---
>  arch/x86/include/asm/pmem.h | 19 +--
>  1 file changed, 1 insertion(+), 18 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
> index 643eba4..2fbf4ae 100644
> --- a/arch/x86/include/asm/pmem.h
> +++ b/arch/x86/include/asm/pmem.h
> @@ -72,15 +72,6 @@ static inline void arch_wb_cache_pmem(void *addr, size_t 
> size)
>   clwb(p);
>  }
>  
> -/*
> - * copy_from_iter_nocache() on x86 only uses non-temporal stores for iovec
> - * iterators, so for other types (bvec & kvec) we must do a cache write-back.
> - */
> -static inline bool __iter_needs_pmem_wb(struct iov_iter *i)
> -{
> - return iter_is_iovec(i) == false;
> -}
> -
>  /**
>   * arch_copy_from_iter_pmem - copy data from an iterator to PMEM
>   * @addr:PMEM destination address
> @@ -92,15 +83,7 @@ static inline bool __iter_needs_pmem_wb(struct iov_iter *i)
>  static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes,
>   struct iov_iter *i)
>  {
> - size_t len;
> -
> - /* TODO: skip the write-back by always using non-temporal stores */
> - len = copy_from_iter_nocache(addr, bytes, i);
> -
> - if (__iter_needs_pmem_wb(i))
> - arch_wb_cache_pmem(addr, bytes);
> -
> - return len;
> + return copy_from_iter_nocache(addr, bytes, i);

I wish you would remove all this completely. Don't see the point for it anymore 

Thanks a lot for doing this. I have this patch for ages in my trees and was too
lasy to fight them through. Yes it is a big boost for many workloads.

Lots of gratitude
Boaz

>  }
>  
>  /**
> 



Re: [PATCH v2 1/3] introduce memcpy_nocache()

2016-10-26 Thread Boaz Harrosh
On 10/26/2016 06:50 PM, Brian Boylston wrote:
> Introduce memcpy_nocache() as a memcpy() that avoids the processor cache
> if possible.  Without arch-specific support, this defaults to just
> memcpy().  For now, include arch-specific support for x86.
> 
> Cc: Ross Zwisler 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: "H. Peter Anvin" 
> Cc: 
> Cc: Al Viro 
> Cc: Dan Williams 
> Signed-off-by: Brian Boylston 
> Reviewed-by: Toshi Kani 
> Reported-by: Oliver Moreno 
> ---
>  arch/x86/include/asm/string_32.h |  3 +++
>  arch/x86/include/asm/string_64.h |  3 +++
>  arch/x86/lib/misc.c  | 12 
>  include/linux/string.h   | 15 +++
>  4 files changed, 33 insertions(+)
> 
> diff --git a/arch/x86/include/asm/string_32.h 
> b/arch/x86/include/asm/string_32.h
> index 3d3e835..64f80c0 100644
> --- a/arch/x86/include/asm/string_32.h
> +++ b/arch/x86/include/asm/string_32.h
> @@ -196,6 +196,9 @@ static inline void *__memcpy3d(void *to, const void 
> *from, size_t len)
>  
>  #endif
>  
> +#define __HAVE_ARCH_MEMCPY_NOCACHE
> +extern void *memcpy_nocache(void *dest, const void *src, size_t count);
> +
>  #define __HAVE_ARCH_MEMMOVE
>  void *memmove(void *dest, const void *src, size_t n);
>  
> diff --git a/arch/x86/include/asm/string_64.h 
> b/arch/x86/include/asm/string_64.h
> index 90dbbd9..a8fdd55 100644
> --- a/arch/x86/include/asm/string_64.h
> +++ b/arch/x86/include/asm/string_64.h
> @@ -51,6 +51,9 @@ extern void *__memcpy(void *to, const void *from, size_t 
> len);
>  #define memcpy(dst, src, len) __inline_memcpy((dst), (src), (len))
>  #endif
>  
> +#define __HAVE_ARCH_MEMCPY_NOCACHE
> +extern void *memcpy_nocache(void *dest, const void *src, size_t count);
> +
>  #define __HAVE_ARCH_MEMSET
>  void *memset(void *s, int c, size_t n);
>  void *__memset(void *s, int c, size_t n);
> diff --git a/arch/x86/lib/misc.c b/arch/x86/lib/misc.c
> index 76b373a..c993ab3 100644
> --- a/arch/x86/lib/misc.c
> +++ b/arch/x86/lib/misc.c
> @@ -1,3 +1,6 @@
> +#include 
> +#include 
> +
>  /*
>   * Count the digits of @val including a possible sign.
>   *
> @@ -19,3 +22,12 @@ int num_digits(int val)
>   }
>   return d;
>  }
> +
> +#ifdef __HAVE_ARCH_MEMCPY_NOCACHE
> +void *memcpy_nocache(void *dest, const void *src, size_t count)
> +{
> + __copy_from_user_inatomic_nocache(dest, src, count);
> + return dest;
> +}
> +EXPORT_SYMBOL(memcpy_nocache);
> +#endif
> diff --git a/include/linux/string.h b/include/linux/string.h
> index 26b6f6a..7f40c41 100644
> --- a/include/linux/string.h
> +++ b/include/linux/string.h
> @@ -102,6 +102,21 @@ extern void * memset(void *,int,__kernel_size_t);
>  #ifndef __HAVE_ARCH_MEMCPY
>  extern void * memcpy(void *,const void *,__kernel_size_t);
>  #endif
> +
> +#ifndef __HAVE_ARCH_MEMCPY_NOCACHE
> +/**
> + * memcpy_nocache - Copy one area of memory to another, avoiding the
> + * processor cache if possible
> + * @dest: Where to copy to
> + * @src: Where to copy from
> + * @count: The size of the area.
> + */
> +static inline void *memcpy_nocache(void *dest, const void *src, size_t count)
> +{
> + return memcpy(dest, src, count);
> +}

What about memcpy_to_pmem() in linux/pmem.h it already has all the arch 
switches.

Feels bad to add yet just another arch switch over __copy_user_nocache

Just feels like too many things that do the same thing. Sigh

Boaz

> +#endif
> +
>  #ifndef __HAVE_ARCH_MEMMOVE
>  extern void * memmove(void *,const void *,__kernel_size_t);
>  #endif
> 



Re: [PATCH v2 1/3] introduce memcpy_nocache()

2016-10-26 Thread Boaz Harrosh
On 10/26/2016 06:50 PM, Brian Boylston wrote:
> Introduce memcpy_nocache() as a memcpy() that avoids the processor cache
> if possible.  Without arch-specific support, this defaults to just
> memcpy().  For now, include arch-specific support for x86.
> 
> Cc: Ross Zwisler 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: "H. Peter Anvin" 
> Cc: 
> Cc: Al Viro 
> Cc: Dan Williams 
> Signed-off-by: Brian Boylston 
> Reviewed-by: Toshi Kani 
> Reported-by: Oliver Moreno 
> ---
>  arch/x86/include/asm/string_32.h |  3 +++
>  arch/x86/include/asm/string_64.h |  3 +++
>  arch/x86/lib/misc.c  | 12 
>  include/linux/string.h   | 15 +++
>  4 files changed, 33 insertions(+)
> 
> diff --git a/arch/x86/include/asm/string_32.h 
> b/arch/x86/include/asm/string_32.h
> index 3d3e835..64f80c0 100644
> --- a/arch/x86/include/asm/string_32.h
> +++ b/arch/x86/include/asm/string_32.h
> @@ -196,6 +196,9 @@ static inline void *__memcpy3d(void *to, const void 
> *from, size_t len)
>  
>  #endif
>  
> +#define __HAVE_ARCH_MEMCPY_NOCACHE
> +extern void *memcpy_nocache(void *dest, const void *src, size_t count);
> +
>  #define __HAVE_ARCH_MEMMOVE
>  void *memmove(void *dest, const void *src, size_t n);
>  
> diff --git a/arch/x86/include/asm/string_64.h 
> b/arch/x86/include/asm/string_64.h
> index 90dbbd9..a8fdd55 100644
> --- a/arch/x86/include/asm/string_64.h
> +++ b/arch/x86/include/asm/string_64.h
> @@ -51,6 +51,9 @@ extern void *__memcpy(void *to, const void *from, size_t 
> len);
>  #define memcpy(dst, src, len) __inline_memcpy((dst), (src), (len))
>  #endif
>  
> +#define __HAVE_ARCH_MEMCPY_NOCACHE
> +extern void *memcpy_nocache(void *dest, const void *src, size_t count);
> +
>  #define __HAVE_ARCH_MEMSET
>  void *memset(void *s, int c, size_t n);
>  void *__memset(void *s, int c, size_t n);
> diff --git a/arch/x86/lib/misc.c b/arch/x86/lib/misc.c
> index 76b373a..c993ab3 100644
> --- a/arch/x86/lib/misc.c
> +++ b/arch/x86/lib/misc.c
> @@ -1,3 +1,6 @@
> +#include 
> +#include 
> +
>  /*
>   * Count the digits of @val including a possible sign.
>   *
> @@ -19,3 +22,12 @@ int num_digits(int val)
>   }
>   return d;
>  }
> +
> +#ifdef __HAVE_ARCH_MEMCPY_NOCACHE
> +void *memcpy_nocache(void *dest, const void *src, size_t count)
> +{
> + __copy_from_user_inatomic_nocache(dest, src, count);
> + return dest;
> +}
> +EXPORT_SYMBOL(memcpy_nocache);
> +#endif
> diff --git a/include/linux/string.h b/include/linux/string.h
> index 26b6f6a..7f40c41 100644
> --- a/include/linux/string.h
> +++ b/include/linux/string.h
> @@ -102,6 +102,21 @@ extern void * memset(void *,int,__kernel_size_t);
>  #ifndef __HAVE_ARCH_MEMCPY
>  extern void * memcpy(void *,const void *,__kernel_size_t);
>  #endif
> +
> +#ifndef __HAVE_ARCH_MEMCPY_NOCACHE
> +/**
> + * memcpy_nocache - Copy one area of memory to another, avoiding the
> + * processor cache if possible
> + * @dest: Where to copy to
> + * @src: Where to copy from
> + * @count: The size of the area.
> + */
> +static inline void *memcpy_nocache(void *dest, const void *src, size_t count)
> +{
> + return memcpy(dest, src, count);
> +}

What about memcpy_to_pmem() in linux/pmem.h it already has all the arch 
switches.

Feels bad to add yet just another arch switch over __copy_user_nocache

Just feels like too many things that do the same thing. Sigh

Boaz

> +#endif
> +
>  #ifndef __HAVE_ARCH_MEMMOVE
>  extern void * memmove(void *,const void *,__kernel_size_t);
>  #endif
> 



Re: [PATCH 4/7] fs: make remaining filesystems use .rename2

2016-08-23 Thread Boaz Harrosh
On 08/23/2016 07:24 PM, Boaz Harrosh wrote:
> On 08/23/2016 05:05 PM, Miklos Szeredi wrote:
>> This is trivial to do:
>>
>>  - add flags argument to foo_rename()
>>  - check if flags is zero
>>  - assign foo_rename() to .rename2 instead of .rename
>>
>> This doesn't mean it's impossible to support RENAME_NOREPLACE for these
>> filesystems, but it is not trivial, like for local filesystems.
>> RENAME_NOREPLACE must guarantee atomicity (i.e. it shouldn't be possible
>> for a file to be created on one host while it is overwritten by rename on
>> another host).
>>
>> Filesystems converted:
>>
>> 9p, afs, ceph, coda, ecryptfs, exofs, kernfs, lustre, ncpfs, nfs, ocfs2,
>> orangefs.
>>
>> After this, we can get rid of the duplicate interfaces for rename.
>>
>> Signed-off-by: Miklos Szeredi <mszer...@redhat.com>
>> Cc: Eric Van Hensbergen <eri...@gmail.com>
>> Cc: David Howells <dhowe...@redhat.com>
>> Cc: Ilya Dryomov <idryo...@gmail.com>
>> Cc: Jan Harkes <jahar...@cs.cmu.edu>
>> Cc: Tyler Hicks <tyhi...@canonical.com>
>> Cc: Boaz Harrosh <o...@electrozaur.com>
> 
> Hi exofs is not a distributed file system in the nfs-client 
> sense. All meta-data operations happen on the single exofs mount.
> The distribution of an exofs cluster is done by an NFSD-like daemon
> that supports pNFS, and an std pNFS-client.
> So the code you see below is just the same as an ext4 FS with
> a raid of iscsi devices below it. (Even if then later this FS is
> exported by an NFSD server)
> 
> That said it is fine as is don't sweat over this unused FS so:
> ACK-by: Boaz Harrosh <o...@electrozaur.com>
> 
> <>
> 
>> diff --git a/fs/exofs/namei.c b/fs/exofs/namei.c
>> index 622a686bb08b..897280163f3c 100644
>> --- a/fs/exofs/namei.c
>> +++ b/fs/exofs/namei.c
>> @@ -227,7 +227,8 @@ static int exofs_rmdir(struct inode *dir, struct dentry 
>> *dentry)
>>  }
>>  
>>  static int exofs_rename(struct inode *old_dir, struct dentry *old_dentry,
>> -struct inode *new_dir, struct dentry *new_dentry)
>> +struct inode *new_dir, struct dentry *new_dentry,
>> +unsigned int flags)
>>  {
>>  struct inode *old_inode = d_inode(old_dentry);
>>  struct inode *new_inode = d_inode(new_dentry);
>> @@ -237,6 +238,9 @@ static int exofs_rename(struct inode *old_dir, struct 
>> dentry *old_dentry,
>>  struct exofs_dir_entry *old_de;
>>  int err = -ENOENT;
>>  
>> +if (flags)
>> +return -EINVAL;
>> +

+   if (flags & ~RENAME_NOREPLACE)
+   return -EINVAL;
+

And move to the other patch if you feel like it

>>  old_de = exofs_find_entry(old_dir, old_dentry, _page);
>>  if (!old_de)
>>  goto out;
>> @@ -310,7 +314,7 @@ const struct inode_operations exofs_dir_inode_operations 
>> = {
>>  .mkdir  = exofs_mkdir,
>>  .rmdir  = exofs_rmdir,
>>  .mknod  = exofs_mknod,
>> -.rename = exofs_rename,
>> +.rename2= exofs_rename,
>>  .setattr= exofs_setattr,
>>  };
>>  
> <>
> 



Re: [PATCH 4/7] fs: make remaining filesystems use .rename2

2016-08-23 Thread Boaz Harrosh
On 08/23/2016 07:24 PM, Boaz Harrosh wrote:
> On 08/23/2016 05:05 PM, Miklos Szeredi wrote:
>> This is trivial to do:
>>
>>  - add flags argument to foo_rename()
>>  - check if flags is zero
>>  - assign foo_rename() to .rename2 instead of .rename
>>
>> This doesn't mean it's impossible to support RENAME_NOREPLACE for these
>> filesystems, but it is not trivial, like for local filesystems.
>> RENAME_NOREPLACE must guarantee atomicity (i.e. it shouldn't be possible
>> for a file to be created on one host while it is overwritten by rename on
>> another host).
>>
>> Filesystems converted:
>>
>> 9p, afs, ceph, coda, ecryptfs, exofs, kernfs, lustre, ncpfs, nfs, ocfs2,
>> orangefs.
>>
>> After this, we can get rid of the duplicate interfaces for rename.
>>
>> Signed-off-by: Miklos Szeredi 
>> Cc: Eric Van Hensbergen 
>> Cc: David Howells 
>> Cc: Ilya Dryomov 
>> Cc: Jan Harkes 
>> Cc: Tyler Hicks 
>> Cc: Boaz Harrosh 
> 
> Hi exofs is not a distributed file system in the nfs-client 
> sense. All meta-data operations happen on the single exofs mount.
> The distribution of an exofs cluster is done by an NFSD-like daemon
> that supports pNFS, and an std pNFS-client.
> So the code you see below is just the same as an ext4 FS with
> a raid of iscsi devices below it. (Even if then later this FS is
> exported by an NFSD server)
> 
> That said it is fine as is don't sweat over this unused FS so:
> ACK-by: Boaz Harrosh 
> 
> <>
> 
>> diff --git a/fs/exofs/namei.c b/fs/exofs/namei.c
>> index 622a686bb08b..897280163f3c 100644
>> --- a/fs/exofs/namei.c
>> +++ b/fs/exofs/namei.c
>> @@ -227,7 +227,8 @@ static int exofs_rmdir(struct inode *dir, struct dentry 
>> *dentry)
>>  }
>>  
>>  static int exofs_rename(struct inode *old_dir, struct dentry *old_dentry,
>> -struct inode *new_dir, struct dentry *new_dentry)
>> +struct inode *new_dir, struct dentry *new_dentry,
>> +unsigned int flags)
>>  {
>>  struct inode *old_inode = d_inode(old_dentry);
>>  struct inode *new_inode = d_inode(new_dentry);
>> @@ -237,6 +238,9 @@ static int exofs_rename(struct inode *old_dir, struct 
>> dentry *old_dentry,
>>  struct exofs_dir_entry *old_de;
>>  int err = -ENOENT;
>>  
>> +if (flags)
>> +return -EINVAL;
>> +

+   if (flags & ~RENAME_NOREPLACE)
+   return -EINVAL;
+

And move to the other patch if you feel like it

>>  old_de = exofs_find_entry(old_dir, old_dentry, _page);
>>  if (!old_de)
>>  goto out;
>> @@ -310,7 +314,7 @@ const struct inode_operations exofs_dir_inode_operations 
>> = {
>>  .mkdir  = exofs_mkdir,
>>  .rmdir  = exofs_rmdir,
>>  .mknod  = exofs_mknod,
>> -.rename = exofs_rename,
>> +.rename2= exofs_rename,
>>  .setattr= exofs_setattr,
>>  };
>>  
> <>
> 



Re: [PATCH 4/7] fs: make remaining filesystems use .rename2

2016-08-23 Thread Boaz Harrosh
On 08/23/2016 05:05 PM, Miklos Szeredi wrote:
> This is trivial to do:
> 
>  - add flags argument to foo_rename()
>  - check if flags is zero
>  - assign foo_rename() to .rename2 instead of .rename
> 
> This doesn't mean it's impossible to support RENAME_NOREPLACE for these
> filesystems, but it is not trivial, like for local filesystems.
> RENAME_NOREPLACE must guarantee atomicity (i.e. it shouldn't be possible
> for a file to be created on one host while it is overwritten by rename on
> another host).
> 
> Filesystems converted:
> 
> 9p, afs, ceph, coda, ecryptfs, exofs, kernfs, lustre, ncpfs, nfs, ocfs2,
> orangefs.
> 
> After this, we can get rid of the duplicate interfaces for rename.
> 
> Signed-off-by: Miklos Szeredi <mszer...@redhat.com>
> Cc: Eric Van Hensbergen <eri...@gmail.com>
> Cc: David Howells <dhowe...@redhat.com>
> Cc: Ilya Dryomov <idryo...@gmail.com>
> Cc: Jan Harkes <jahar...@cs.cmu.edu>
> Cc: Tyler Hicks <tyhi...@canonical.com>
> Cc: Boaz Harrosh <o...@electrozaur.com>

Hi exofs is not a distributed file system in the nfs-client 
sense. All meta-data operations happen on the single exofs mount.
The distribution of an exofs cluster is done by an NFSD-like daemon
that supports pNFS, and an std pNFS-client.
So the code you see below is just the same as an ext4 FS with
a raid of iscsi devices below it. (Even if then later this FS is
exported by an NFSD server)

That said it is fine as is don't sweat over this unused FS so:
ACK-by: Boaz Harrosh <o...@electrozaur.com>

<>

> diff --git a/fs/exofs/namei.c b/fs/exofs/namei.c
> index 622a686bb08b..897280163f3c 100644
> --- a/fs/exofs/namei.c
> +++ b/fs/exofs/namei.c
> @@ -227,7 +227,8 @@ static int exofs_rmdir(struct inode *dir, struct dentry 
> *dentry)
>  }
>  
>  static int exofs_rename(struct inode *old_dir, struct dentry *old_dentry,
> - struct inode *new_dir, struct dentry *new_dentry)
> + struct inode *new_dir, struct dentry *new_dentry,
> + unsigned int flags)
>  {
>   struct inode *old_inode = d_inode(old_dentry);
>   struct inode *new_inode = d_inode(new_dentry);
> @@ -237,6 +238,9 @@ static int exofs_rename(struct inode *old_dir, struct 
> dentry *old_dentry,
>   struct exofs_dir_entry *old_de;
>   int err = -ENOENT;
>  
> + if (flags)
> + return -EINVAL;
> +
>   old_de = exofs_find_entry(old_dir, old_dentry, _page);
>   if (!old_de)
>   goto out;
> @@ -310,7 +314,7 @@ const struct inode_operations exofs_dir_inode_operations 
> = {
>   .mkdir  = exofs_mkdir,
>   .rmdir  = exofs_rmdir,
>   .mknod  = exofs_mknod,
> - .rename = exofs_rename,
> + .rename2= exofs_rename,
>   .setattr= exofs_setattr,
>  };
>  
<>



Re: [PATCH 4/7] fs: make remaining filesystems use .rename2

2016-08-23 Thread Boaz Harrosh
On 08/23/2016 05:05 PM, Miklos Szeredi wrote:
> This is trivial to do:
> 
>  - add flags argument to foo_rename()
>  - check if flags is zero
>  - assign foo_rename() to .rename2 instead of .rename
> 
> This doesn't mean it's impossible to support RENAME_NOREPLACE for these
> filesystems, but it is not trivial, like for local filesystems.
> RENAME_NOREPLACE must guarantee atomicity (i.e. it shouldn't be possible
> for a file to be created on one host while it is overwritten by rename on
> another host).
> 
> Filesystems converted:
> 
> 9p, afs, ceph, coda, ecryptfs, exofs, kernfs, lustre, ncpfs, nfs, ocfs2,
> orangefs.
> 
> After this, we can get rid of the duplicate interfaces for rename.
> 
> Signed-off-by: Miklos Szeredi 
> Cc: Eric Van Hensbergen 
> Cc: David Howells 
> Cc: Ilya Dryomov 
> Cc: Jan Harkes 
> Cc: Tyler Hicks 
> Cc: Boaz Harrosh 

Hi exofs is not a distributed file system in the nfs-client 
sense. All meta-data operations happen on the single exofs mount.
The distribution of an exofs cluster is done by an NFSD-like daemon
that supports pNFS, and an std pNFS-client.
So the code you see below is just the same as an ext4 FS with
a raid of iscsi devices below it. (Even if then later this FS is
exported by an NFSD server)

That said it is fine as is don't sweat over this unused FS so:
ACK-by: Boaz Harrosh 

<>

> diff --git a/fs/exofs/namei.c b/fs/exofs/namei.c
> index 622a686bb08b..897280163f3c 100644
> --- a/fs/exofs/namei.c
> +++ b/fs/exofs/namei.c
> @@ -227,7 +227,8 @@ static int exofs_rmdir(struct inode *dir, struct dentry 
> *dentry)
>  }
>  
>  static int exofs_rename(struct inode *old_dir, struct dentry *old_dentry,
> - struct inode *new_dir, struct dentry *new_dentry)
> + struct inode *new_dir, struct dentry *new_dentry,
> + unsigned int flags)
>  {
>   struct inode *old_inode = d_inode(old_dentry);
>   struct inode *new_inode = d_inode(new_dentry);
> @@ -237,6 +238,9 @@ static int exofs_rename(struct inode *old_dir, struct 
> dentry *old_dentry,
>   struct exofs_dir_entry *old_de;
>   int err = -ENOENT;
>  
> + if (flags)
> + return -EINVAL;
> +
>   old_de = exofs_find_entry(old_dir, old_dentry, _page);
>   if (!old_de)
>   goto out;
> @@ -310,7 +314,7 @@ const struct inode_operations exofs_dir_inode_operations 
> = {
>   .mkdir  = exofs_mkdir,
>   .rmdir  = exofs_rmdir,
>   .mknod  = exofs_mknod,
> - .rename = exofs_rename,
> + .rename2= exofs_rename,
>   .setattr= exofs_setattr,
>  };
>  
<>



Re: DAX can not work on virtual nvdimm device

2016-08-21 Thread Boaz Harrosh
On 08/19/2016 09:30 PM, Ross Zwisler wrote:
> On Fri, Aug 19, 2016 at 07:59:29AM -0700, Dan Williams wrote:
>> On Fri, Aug 19, 2016 at 4:19 AM, Xiao Guangrong
>>  wrote:
>>>
>>> Hi Dan,
>>>
>>> Recently, Redhat reported that nvml test suite failed on QEMU/KVM,
>>> more detailed info please refer to:
>>>https://bugzilla.redhat.com/show_bug.cgi?id=1365721
>>>
>>> The reason for this bug is that the memory region created by mmap()
>>> on the dax-based file was gone so that the region can not be found
>>> in /proc/self/smaps during the runtime.
>>>
>>> This is a simple way to trigger this issue:
>>>mount -o dax /dev/pmem0 /mnt/pmem/
>>>vim /mnt/pmem/xxx
>>> then 'vim' is crashed due to segment fault.
>>>
>>> This bug can be reproduced on your tree, the top commit is
>>> 10d7902fa0e82b (dax: unmap/truncate on device shutdown), the kernel
>>> configure file is attached.
>>>
>>> Your thought or comment is highly appreciated.
>>
>> I'm going to be offline until Tuesday, but I will investigate when I'm
>> back.  In the meantime if Ross or Vishal had an opportunity to take a
>> look I wouldn't say "no" :).
> 
> I haven't been able to reproduce this vim segfault.  I'm using QEMU v2.6.0,
> and the kernel commit you mentioned, and your kernel config.
> 
> Here's my QEMU command line:
> 
> sudo ~/qemu/bin/qemu-system-x86_64 /var/lib/libvirt/images/alara.qcow2 \
> -machine pc,nvdimm -m 8G,maxmem=100G,slots=100  -object \
> memory-backend-file,id=mem1,share,mem-path=/dev/pmem0,size=8G -device \
> nvdimm,memdev=mem1,id=nv1 -smp 6 -machine pc,accel=kvm 
> 
> With this I'm able to mkfs the guest's /dev/pmem0, mount it with -o dax, and
> write a file with vim.
> 
> Can you reproduce your results with a pmem device created via a memmap kernel
> command line parameter in the guest?  You'll need to update your kernel
> config to enable CONFIG_X86_PMEM_LEGACY and CONFIG_X86_PMEM_LEGACY_DEVICE.

Last time I had crashes like this was when the memmap= or the KVM command line
memory size and settings did not match the KVM defined memory settings. Nothing
fails but the pmemX device is configured over a none existent memory. All writes
just succeed but all reads return ZEROs.

Please check that the sizes of your pmem matches the memory size of the VM as
defined in virsh edit

Boaz



Re: DAX can not work on virtual nvdimm device

2016-08-21 Thread Boaz Harrosh
On 08/19/2016 09:30 PM, Ross Zwisler wrote:
> On Fri, Aug 19, 2016 at 07:59:29AM -0700, Dan Williams wrote:
>> On Fri, Aug 19, 2016 at 4:19 AM, Xiao Guangrong
>>  wrote:
>>>
>>> Hi Dan,
>>>
>>> Recently, Redhat reported that nvml test suite failed on QEMU/KVM,
>>> more detailed info please refer to:
>>>https://bugzilla.redhat.com/show_bug.cgi?id=1365721
>>>
>>> The reason for this bug is that the memory region created by mmap()
>>> on the dax-based file was gone so that the region can not be found
>>> in /proc/self/smaps during the runtime.
>>>
>>> This is a simple way to trigger this issue:
>>>mount -o dax /dev/pmem0 /mnt/pmem/
>>>vim /mnt/pmem/xxx
>>> then 'vim' is crashed due to segment fault.
>>>
>>> This bug can be reproduced on your tree, the top commit is
>>> 10d7902fa0e82b (dax: unmap/truncate on device shutdown), the kernel
>>> configure file is attached.
>>>
>>> Your thought or comment is highly appreciated.
>>
>> I'm going to be offline until Tuesday, but I will investigate when I'm
>> back.  In the meantime if Ross or Vishal had an opportunity to take a
>> look I wouldn't say "no" :).
> 
> I haven't been able to reproduce this vim segfault.  I'm using QEMU v2.6.0,
> and the kernel commit you mentioned, and your kernel config.
> 
> Here's my QEMU command line:
> 
> sudo ~/qemu/bin/qemu-system-x86_64 /var/lib/libvirt/images/alara.qcow2 \
> -machine pc,nvdimm -m 8G,maxmem=100G,slots=100  -object \
> memory-backend-file,id=mem1,share,mem-path=/dev/pmem0,size=8G -device \
> nvdimm,memdev=mem1,id=nv1 -smp 6 -machine pc,accel=kvm 
> 
> With this I'm able to mkfs the guest's /dev/pmem0, mount it with -o dax, and
> write a file with vim.
> 
> Can you reproduce your results with a pmem device created via a memmap kernel
> command line parameter in the guest?  You'll need to update your kernel
> config to enable CONFIG_X86_PMEM_LEGACY and CONFIG_X86_PMEM_LEGACY_DEVICE.

Last time I had crashes like this was when the memmap= or the KVM command line
memory size and settings did not match the KVM defined memory settings. Nothing
fails but the pmemX device is configured over a none existent memory. All writes
just succeed but all reads return ZEROs.

Please check that the sizes of your pmem matches the memory size of the VM as
defined in virsh edit

Boaz



Re: [PATCH] x86/mm: only allow memmap=XX!YY over existing RAM

2016-06-23 Thread Boaz Harrosh
On 06/20/2016 10:33 AM, Yigal Korman wrote:
> Before this patch, passing a range that is beyond the physical memory
> range will succeed, the user will see a /dev/pmem0 and will be able to
> access it. Reads will always return 0 and writes will be silently
> ignored.
> 
> I've gotten more than one bug report about mkfs.{xfs,ext4} or nvml
> failing that were eventually tracked down to be wrong values passed to
> memmap.
> 
> This patch prevents the above issue by instead of adding a new memory
> range, only update a RAM memory range with the PRAM type. This way,
> passing the wrong memmap will either not give you a pmem at all or give
> you a smaller one that actually has RAM behind it.
> 
> And if someone still needs to fake a pmem that doesn't have RAM behind
> it, they can simply do memmap=XX@YY,XX!YY.
> 

We are running with this patch for a while in the lab and it does
solve the problem above with no maleffects so:

Tested-by: Boaz Harrosh <b...@plexistor.com>

> Signed-off-by: Yigal Korman <yi...@plexistor.com>
> ---
>  arch/x86/kernel/e820.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index 621b501..4bd4207 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -878,7 +878,7 @@ static int __init parse_memmap_one(char *p)
>   e820_add_region(start_at, mem_size, E820_RESERVED);
>   } else if (*p == '!') {
>   start_at = memparse(p+1, );
> - e820_add_region(start_at, mem_size, E820_PRAM);
> + e820_update_range(start_at, mem_size, E820_RAM, E820_PRAM);
>   } else
>   e820_remove_range(mem_size, ULLONG_MAX - mem_size, E820_RAM, 1);
>  
> 



Re: [PATCH] x86/mm: only allow memmap=XX!YY over existing RAM

2016-06-23 Thread Boaz Harrosh
On 06/20/2016 10:33 AM, Yigal Korman wrote:
> Before this patch, passing a range that is beyond the physical memory
> range will succeed, the user will see a /dev/pmem0 and will be able to
> access it. Reads will always return 0 and writes will be silently
> ignored.
> 
> I've gotten more than one bug report about mkfs.{xfs,ext4} or nvml
> failing that were eventually tracked down to be wrong values passed to
> memmap.
> 
> This patch prevents the above issue by instead of adding a new memory
> range, only update a RAM memory range with the PRAM type. This way,
> passing the wrong memmap will either not give you a pmem at all or give
> you a smaller one that actually has RAM behind it.
> 
> And if someone still needs to fake a pmem that doesn't have RAM behind
> it, they can simply do memmap=XX@YY,XX!YY.
> 

We are running with this patch for a while in the lab and it does
solve the problem above with no maleffects so:

Tested-by: Boaz Harrosh 

> Signed-off-by: Yigal Korman 
> ---
>  arch/x86/kernel/e820.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index 621b501..4bd4207 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -878,7 +878,7 @@ static int __init parse_memmap_one(char *p)
>   e820_add_region(start_at, mem_size, E820_RESERVED);
>   } else if (*p == '!') {
>   start_at = memparse(p+1, );
> - e820_add_region(start_at, mem_size, E820_PRAM);
> + e820_update_range(start_at, mem_size, E820_RAM, E820_PRAM);
>   } else
>   e820_remove_range(mem_size, ULLONG_MAX - mem_size, E820_RAM, 1);
>  
> 



  1   2   3   4   5   6   7   8   9   10   >