RE: [External] Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone

2018-05-16 Thread Huaisheng HS1 Ye
> From: Dan Williams [mailto:dan.j.willi...@intel.com]
> Sent: Wednesday, May 16, 2018 10:49 AM
> On Tue, May 15, 2018 at 7:05 PM, Huaisheng HS1 Ye  wrote:
> >> From: Matthew Wilcox [mailto:wi...@infradead.org]
> >> Sent: Wednesday, May 16, 2018 12:20 AM>
> >> > > > > Then there's the problem of reconnecting the page cache (which is
> >> > > > > pointed to by ephemeral data structures like inodes and dentries) 
> >> > > > > to
> >> > > > > the new inodes.
> >> > > > Yes, it is not easy.
> >> > >
> >> > > Right ... and until we have that ability, there's no point in this 
> >> > > patch.
> >> > We are focusing to realize this ability.
> >>
> >> But is it the right approach?  So far we have (I think) two parallel
> >> activities.  The first is for local storage, using DAX to store files
> >> directly on the pmem.  The second is a physical block cache for network
> >> filesystems (both NAS and SAN).  You seem to be wanting to supplant the
> >> second effort, but I think it's much harder to reconnect the logical cache
> >> (ie the page cache) than it is the physical cache (ie the block cache).
> >
> > Dear Matthew,
> >
> > Thanks for correcting my idea with cache line.
> > But I have questions about that, assuming NVDIMM works with pmem mode, even 
> > we
> > used it as physical block cache, like dm-cache, there is potential risk with
> > this cache line issue, because NVDIMMs are bytes-address storage, right?
> 
> No, there is no risk if the cache is designed properly. The pmem
> driver will not report that the I/O is complete until the entire
> payload of the data write has made it to persistent memory. The cache
> driver will not report that the write succeeded until the pmem driver
> completes the I/O. There is no risk to losing power while the pmem
> driver is operating because the cache will recover to it's last
> acknowledged stable state, i.e. it will roll back / undo the
> incomplete write.
> 
> > If system crash happens, that means CPU doesn't have opportunity to flush 
> > all dirty
> > data from cache lines to NVDIMM, during copying data pointed by 
> > bio_vec.bv_page to
> > NVDIMM.
> > I know there is btt which is used to guarantee sector atomic with block 
> > mode,
> > but for pmem mode that will likely cause mix of new and old data in one page
> > of NVDIMM.
> > Correct me if anything wrong.
> 
> dm-cache is performing similar metadata management as the btt driver
> to ensure safe forward progress of the cache state relative to power
> loss or system-crash.

Dear Dan,

Thanks for your introduction, I've learned a lot from your comments.
I suppose that there should be implementations to protect data and metadata 
both in NVDIMMs from system-crash or power loss.
Not only data but also metadata itself needs to be correct and integrated, so 
kernel could have chance to recover data to target device after rebooting, 
right?

> 
> > Another question, if we used NVDIMMs as physical block cache for network 
> > filesystems,
> > Does industry have existing implementation to bypass Page Cache similarly 
> > like DAX way,
> > that is to say, directly storing data to NVDIMMs from userspace, rather 
> > than copying
> > data from kernel space memory to NVDIMMs.
> 
> Any caching solution with associated metadata requires coordination
> with the kernel, so it is not possible for the kernel to stay
> completely out of the way. Especially when we're talking about a cache
> in front of the network there is not much room for DAX to offer
> improved performance because we need the kernel to takeover on all
> write-persist operations to update cache metadata.

Agree.

> So, I'm still struggling to see why dm-cache is not a suitable
> solution for this case. It seems suitable if it is updated to allow
> direct dma-access to the pmem cache pages from the backing device
> storage / networking driver.

Sincerely,
Huaisheng Ye

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [External] Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone

2018-05-15 Thread Dan Williams
On Tue, May 15, 2018 at 7:52 PM, Matthew Wilcox  wrote:
> On Wed, May 16, 2018 at 02:05:05AM +, Huaisheng HS1 Ye wrote:
>> > From: Matthew Wilcox [mailto:wi...@infradead.org]
>> > Sent: Wednesday, May 16, 2018 12:20 AM>
>> > > > > > Then there's the problem of reconnecting the page cache (which is
>> > > > > > pointed to by ephemeral data structures like inodes and dentries) 
>> > > > > > to
>> > > > > > the new inodes.
>> > > > > Yes, it is not easy.
>> > > >
>> > > > Right ... and until we have that ability, there's no point in this 
>> > > > patch.
>> > > We are focusing to realize this ability.
>> >
>> > But is it the right approach?  So far we have (I think) two parallel
>> > activities.  The first is for local storage, using DAX to store files
>> > directly on the pmem.  The second is a physical block cache for network
>> > filesystems (both NAS and SAN).  You seem to be wanting to supplant the
>> > second effort, but I think it's much harder to reconnect the logical cache
>> > (ie the page cache) than it is the physical cache (ie the block cache).
>>
>> Dear Matthew,
>>
>> Thanks for correcting my idea with cache line.
>> But I have questions about that, assuming NVDIMM works with pmem mode, even 
>> we
>> used it as physical block cache, like dm-cache, there is potential risk with
>> this cache line issue, because NVDIMMs are bytes-address storage, right?
>> If system crash happens, that means CPU doesn't have opportunity to flush 
>> all dirty
>> data from cache lines to NVDIMM, during copying data pointed by 
>> bio_vec.bv_page to
>> NVDIMM.
>> I know there is btt which is used to guarantee sector atomic with block mode,
>> but for pmem mode that will likely cause mix of new and old data in one page
>> of NVDIMM.
>> Correct me if anything wrong.
>
> Right, we do have BTT.  I'm not sure how it's being used with the block
> cache ... but the principle is the same; write the new data to a new
> page and then update the metadata to point to the new page.
>
>> Another question, if we used NVDIMMs as physical block cache for network 
>> filesystems,
>> Does industry have existing implementation to bypass Page Cache similarly 
>> like DAX way,
>> that is to say, directly storing data to NVDIMMs from userspace, rather than 
>> copying
>> data from kernel space memory to NVDIMMs.
>
> The important part about DAX is that the kernel gets entirely out of the
> way and userspace takes care of handling flushing and synchronisation.
> I'm not sure how that works with the block cache; for a network
> filesystem, the filesystem needs to be in charge of deciding when and
> how to write the buffered data back to the storage.
>
> Dan, Vishal, perhaps you could jump in here; I'm not really sure where
> this effort has got to.

Which effort? I think we're saying that there is no such thing as a
DAX capable block cache and it is not clear one make sense.

We can certainly teach existing block caches some optimizations in the
presence of pmem, and perhaps that is sufficient.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [External] Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone

2018-05-15 Thread Matthew Wilcox
On Wed, May 16, 2018 at 02:05:05AM +, Huaisheng HS1 Ye wrote:
> > From: Matthew Wilcox [mailto:wi...@infradead.org]
> > Sent: Wednesday, May 16, 2018 12:20 AM> 
> > > > > > Then there's the problem of reconnecting the page cache (which is
> > > > > > pointed to by ephemeral data structures like inodes and dentries) to
> > > > > > the new inodes.
> > > > > Yes, it is not easy.
> > > >
> > > > Right ... and until we have that ability, there's no point in this 
> > > > patch.
> > > We are focusing to realize this ability.
> > 
> > But is it the right approach?  So far we have (I think) two parallel
> > activities.  The first is for local storage, using DAX to store files
> > directly on the pmem.  The second is a physical block cache for network
> > filesystems (both NAS and SAN).  You seem to be wanting to supplant the
> > second effort, but I think it's much harder to reconnect the logical cache
> > (ie the page cache) than it is the physical cache (ie the block cache).
> 
> Dear Matthew,
> 
> Thanks for correcting my idea with cache line.
> But I have questions about that, assuming NVDIMM works with pmem mode, even we
> used it as physical block cache, like dm-cache, there is potential risk with
> this cache line issue, because NVDIMMs are bytes-address storage, right?
> If system crash happens, that means CPU doesn't have opportunity to flush all 
> dirty
> data from cache lines to NVDIMM, during copying data pointed by 
> bio_vec.bv_page to
> NVDIMM. 
> I know there is btt which is used to guarantee sector atomic with block mode,
> but for pmem mode that will likely cause mix of new and old data in one page
> of NVDIMM.
> Correct me if anything wrong.

Right, we do have BTT.  I'm not sure how it's being used with the block
cache ... but the principle is the same; write the new data to a new
page and then update the metadata to point to the new page.

> Another question, if we used NVDIMMs as physical block cache for network 
> filesystems,
> Does industry have existing implementation to bypass Page Cache similarly 
> like DAX way,
> that is to say, directly storing data to NVDIMMs from userspace, rather than 
> copying
> data from kernel space memory to NVDIMMs.

The important part about DAX is that the kernel gets entirely out of the
way and userspace takes care of handling flushing and synchronisation.
I'm not sure how that works with the block cache; for a network
filesystem, the filesystem needs to be in charge of deciding when and
how to write the buffered data back to the storage.

Dan, Vishal, perhaps you could jump in here; I'm not really sure where
this effort has got to.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [External] Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone

2018-05-15 Thread Dan Williams
On Tue, May 15, 2018 at 7:05 PM, Huaisheng HS1 Ye  wrote:
>> From: Matthew Wilcox [mailto:wi...@infradead.org]
>> Sent: Wednesday, May 16, 2018 12:20 AM>
>> > > > > Then there's the problem of reconnecting the page cache (which is
>> > > > > pointed to by ephemeral data structures like inodes and dentries) to
>> > > > > the new inodes.
>> > > > Yes, it is not easy.
>> > >
>> > > Right ... and until we have that ability, there's no point in this patch.
>> > We are focusing to realize this ability.
>>
>> But is it the right approach?  So far we have (I think) two parallel
>> activities.  The first is for local storage, using DAX to store files
>> directly on the pmem.  The second is a physical block cache for network
>> filesystems (both NAS and SAN).  You seem to be wanting to supplant the
>> second effort, but I think it's much harder to reconnect the logical cache
>> (ie the page cache) than it is the physical cache (ie the block cache).
>
> Dear Matthew,
>
> Thanks for correcting my idea with cache line.
> But I have questions about that, assuming NVDIMM works with pmem mode, even we
> used it as physical block cache, like dm-cache, there is potential risk with
> this cache line issue, because NVDIMMs are bytes-address storage, right?

No, there is no risk if the cache is designed properly. The pmem
driver will not report that the I/O is complete until the entire
payload of the data write has made it to persistent memory. The cache
driver will not report that the write succeeded until the pmem driver
completes the I/O. There is no risk to losing power while the pmem
driver is operating because the cache will recover to it's last
acknowledged stable state, i.e. it will roll back / undo the
incomplete write.

> If system crash happens, that means CPU doesn't have opportunity to flush all 
> dirty
> data from cache lines to NVDIMM, during copying data pointed by 
> bio_vec.bv_page to
> NVDIMM.
> I know there is btt which is used to guarantee sector atomic with block mode,
> but for pmem mode that will likely cause mix of new and old data in one page
> of NVDIMM.
> Correct me if anything wrong.

dm-cache is performing similar metadata management as the btt driver
to ensure safe forward progress of the cache state relative to power
loss or system-crash.

> Another question, if we used NVDIMMs as physical block cache for network 
> filesystems,
> Does industry have existing implementation to bypass Page Cache similarly 
> like DAX way,
> that is to say, directly storing data to NVDIMMs from userspace, rather than 
> copying
> data from kernel space memory to NVDIMMs.

Any caching solution with associated metadata requires coordination
with the kernel, so it is not possible for the kernel to stay
completely out of the way. Especially when we're talking about a cache
in front of the network there is not much room for DAX to offer
improved performance because we need the kernel to takeover on all
write-persist operations to update cache metadata.

So, I'm still struggling to see why dm-cache is not a suitable
solution for this case. It seems suitable if it is updated to allow
direct dma-access to the pmem cache pages from the backing device
storage / networking driver.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [External] Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone

2018-05-15 Thread Matthew Wilcox
On Tue, May 15, 2018 at 04:07:28PM +, Huaisheng HS1 Ye wrote:
> > From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On Behalf 
> > Of Matthew
> > Wilcox
> > No.  In the current situation, the user knows that either the entire
> > page was written back from the pagecache or none of it was (at least
> > with a journalling filesystem).  With your proposal, we may have pages
> > splintered along cacheline boundaries, with a mix of old and new data.
> > This is completely unacceptable to most customers.
> 
> Dear Matthew,
> 
> Thanks for your great help, I really didn't consider this case.
> I want to make it a little bit clearer to me. So, correct me if anything 
> wrong.
> 
> Is that to say this mix of old and new data in one page, which only has 
> chance to happen when CPU failed to flush all dirty data from LLC to NVDIMM?
> But if an interrupt can be reported to CPU, and CPU successfully flush all 
> dirty data from cache lines to NVDIMM within interrupt response function, 
> this mix of old and new data can be avoided.

If you can keep the CPU and the memory (and all the busses between them)
alive for long enough after the power signal hs been tripped, yes.
Talk to your hardware designers about what it will take to achieve this
:-) Be sure to ask about the number of retries which may be necessary
on the CPU interconnect to flush all data to an NV-DIMM attached to a
remote CPU.

> Current X86_64 uses N-way set associative cache, and every cache line has 64 
> bytes.
> For 4096 bytes page, one page shall be splintered to 64 (4096/64) lines. Is 
> it right?

That's correct.

> > > > Then there's the problem of reconnecting the page cache (which is
> > > > pointed to by ephemeral data structures like inodes and dentries) to
> > > > the new inodes.
> > > Yes, it is not easy.
> > 
> > Right ... and until we have that ability, there's no point in this patch.
> We are focusing to realize this ability.

But is it the right approach?  So far we have (I think) two parallel
activities.  The first is for local storage, using DAX to store files
directly on the pmem.  The second is a physical block cache for network
filesystems (both NAS and SAN).  You seem to be wanting to supplant the
second effort, but I think it's much harder to reconnect the logical cache
(ie the page cache) than it is the physical cache (ie the block cache).

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


RE: [External] Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone

2018-05-15 Thread Huaisheng HS1 Ye



> From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On Behalf Of 
> Matthew
> Wilcox
> Sent: Friday, May 11, 2018 12:28 AM
> On Wed, May 09, 2018 at 04:47:54AM +, Huaisheng HS1 Ye wrote:
> > > On Tue, May 08, 2018 at 02:59:40AM +, Huaisheng HS1 Ye wrote:
> > > > Currently in our mind, an ideal use scenario is that, we put all page 
> > > > caches to
> > > > zone_nvm, without any doubt, page cache is an efficient and common cache
> > > > implement, but it has a disadvantage that all dirty data within it 
> > > > would has risk
> > > > to be missed by power failure or system crash. If we put all page 
> > > > caches to NVDIMMs,
> > > > all dirty data will be safe.
> > >
> > > That's a common misconception.  Some dirty data will still be in the
> > > CPU caches.  Are you planning on building servers which have enough
> > > capacitance to allow the CPU to flush all dirty data from LLC to NV-DIMM?
> > >
> > Sorry for not being clear.
> > For CPU caches if there is a power failure, NVDIMM has ADR to guarantee an 
> > interrupt
> will be reported to CPU, an interrupt response function should be responsible 
> to flush
> all dirty data to NVDIMM.
> > If there is a system crush, perhaps CPU couldn't have chance to execute 
> > this response.
> >
> > It is hard to make sure everything is safe, what we can do is just to save 
> > the dirty
> data which is already stored to Pagecache, but not in CPU cache.
> > Is this an improvement than current?
> 
> No.  In the current situation, the user knows that either the entire
> page was written back from the pagecache or none of it was (at least
> with a journalling filesystem).  With your proposal, we may have pages
> splintered along cacheline boundaries, with a mix of old and new data.
> This is completely unacceptable to most customers.

Dear Matthew,

Thanks for your great help, I really didn't consider this case.
I want to make it a little bit clearer to me. So, correct me if anything wrong.

Is that to say this mix of old and new data in one page, which only has chance 
to happen when CPU failed to flush all dirty data from LLC to NVDIMM?
But if an interrupt can be reported to CPU, and CPU successfully flush all 
dirty data from cache lines to NVDIMM within interrupt response function, this 
mix of old and new data can be avoided.

Current X86_64 uses N-way set associative cache, and every cache line has 64 
bytes.
For 4096 bytes page, one page shall be splintered to 64 (4096/64) lines. Is it 
right?


> > > Then there's the problem of reconnecting the page cache (which is
> > > pointed to by ephemeral data structures like inodes and dentries) to
> > > the new inodes.
> > Yes, it is not easy.
> 
> Right ... and until we have that ability, there's no point in this patch.
We are focusing to realize this ability.

Sincerely,
Huaisheng Ye


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [External] Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone

2018-05-10 Thread Matthew Wilcox
On Wed, May 09, 2018 at 04:47:54AM +, Huaisheng HS1 Ye wrote:
> > On Tue, May 08, 2018 at 02:59:40AM +, Huaisheng HS1 Ye wrote:
> > > Currently in our mind, an ideal use scenario is that, we put all page 
> > > caches to
> > > zone_nvm, without any doubt, page cache is an efficient and common cache
> > > implement, but it has a disadvantage that all dirty data within it would 
> > > has risk
> > > to be missed by power failure or system crash. If we put all page caches 
> > > to NVDIMMs,
> > > all dirty data will be safe.
> > 
> > That's a common misconception.  Some dirty data will still be in the
> > CPU caches.  Are you planning on building servers which have enough
> > capacitance to allow the CPU to flush all dirty data from LLC to NV-DIMM?
> > 
> Sorry for not being clear.
> For CPU caches if there is a power failure, NVDIMM has ADR to guarantee an 
> interrupt will be reported to CPU, an interrupt response function should be 
> responsible to flush all dirty data to NVDIMM.
> If there is a system crush, perhaps CPU couldn't have chance to execute this 
> response.
> 
> It is hard to make sure everything is safe, what we can do is just to save 
> the dirty data which is already stored to Pagecache, but not in CPU cache.
> Is this an improvement than current?

No.  In the current situation, the user knows that either the entire
page was written back from the pagecache or none of it was (at least
with a journalling filesystem).  With your proposal, we may have pages
splintered along cacheline boundaries, with a mix of old and new data.
This is completely unacceptable to most customers.

> > Then there's the problem of reconnecting the page cache (which is
> > pointed to by ephemeral data structures like inodes and dentries) to
> > the new inodes.
> Yes, it is not easy.

Right ... and until we have that ability, there's no point in this patch.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


RE: [External] Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone

2018-05-08 Thread Huaisheng HS1 Ye

> 
> On Tue, May 08, 2018 at 02:59:40AM +, Huaisheng HS1 Ye wrote:
> > Currently in our mind, an ideal use scenario is that, we put all page 
> > caches to
> > zone_nvm, without any doubt, page cache is an efficient and common cache
> > implement, but it has a disadvantage that all dirty data within it would 
> > has risk
> > to be missed by power failure or system crash. If we put all page caches to 
> > NVDIMMs,
> > all dirty data will be safe.
> 
> That's a common misconception.  Some dirty data will still be in the
> CPU caches.  Are you planning on building servers which have enough
> capacitance to allow the CPU to flush all dirty data from LLC to NV-DIMM?
> 
Sorry for not being clear.
For CPU caches if there is a power failure, NVDIMM has ADR to guarantee an 
interrupt will be reported to CPU, an interrupt response function should be 
responsible to flush all dirty data to NVDIMM.
If there is a system crush, perhaps CPU couldn't have chance to execute this 
response.

It is hard to make sure everything is safe, what we can do is just to save the 
dirty data which is already stored to Pagecache, but not in CPU cache.
Is this an improvement than current?

> Then there's the problem of reconnecting the page cache (which is
> pointed to by ephemeral data structures like inodes and dentries) to
> the new inodes.
Yes, it is not easy.

> 
> And then you have to convince customers that what you're doing is safe
> enough for them to trust it ;-)
Sure. 

Sincerely,
Huaisheng Ye
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [External] Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone

2018-05-07 Thread Dan Williams
On Mon, May 7, 2018 at 7:59 PM, Huaisheng HS1 Ye  wrote:
>>
>>Dan Williams  writes:
>>
>>> On Mon, May 7, 2018 at 11:46 AM, Matthew Wilcox 
>>wrote:
 On Mon, May 07, 2018 at 10:50:21PM +0800, Huaisheng Ye wrote:
> Traditionally, NVDIMMs are treated by mm(memory management)
>>subsystem as
> DEVICE zone, which is a virtual zone and both its start and end of pfn
> are equal to 0, mm wouldn’t manage NVDIMM directly as DRAM, kernel
>>uses
> corresponding drivers, which locate at \drivers\nvdimm\ and
> \drivers\acpi\nfit and fs, to realize NVDIMM memory alloc and free with
> memory hot plug implementation.

 You probably want to let linux-nvdimm know about this patch set.
 Adding to the cc.
>>>
>>> Yes, thanks for that!
>>>
 Also, I only received patch 0 and 4.  What happened
 to 1-3,5 and 6?

> With current kernel, many mm’s classical features like the buddy
> system, swap mechanism and page cache couldn’t be supported to
>>NVDIMM.
> What we are doing is to expand kernel mm’s capacity to make it to
>>handle
> NVDIMM like DRAM. Furthermore we make mm could treat DRAM and
>>NVDIMM
> separately, that means mm can only put the critical pages to NVDIMM
>>
>>Please define "critical pages."
>>
> zone, here we created a new zone type as NVM zone. That is to say for
> traditional(or normal) pages which would be stored at DRAM scope like
> Normal, DMA32 and DMA zones. But for the critical pages, which we hope
> them could be recovered from power fail or system crash, we make them
> to be persistent by storing them to NVM zone.
>>
>>[...]
>>
>>> I think adding yet one more mm-zone is the wrong direction. Instead,
>>> what we have been considering is a mechanism to allow a device-dax
>>> instance to be given back to the kernel as a distinct numa node
>>> managed by the VM. It seems it times to dust off those patches.
>>
>>What's the use case?  The above patch description seems to indicate an
>>intent to recover contents after a power loss.  Without seeing the whole
>>series, I'm not sure how that's accomplished in a safe or meaningful
>>way.
>>
>>Huaisheng, could you provide a bit more background?
>>
>
> Currently in our mind, an ideal use scenario is that, we put all page caches 
> to
> zone_nvm, without any doubt, page cache is an efficient and common cache
> implement, but it has a disadvantage that all dirty data within it would has 
> risk
> to be missed by power failure or system crash. If we put all page caches to 
> NVDIMMs,
> all dirty data will be safe.
>
> And the most important is that, Page cache is different from dm-cache or 
> B-cache.
> Page cache exists at mm. So, it has much more performance than other Write
> caches, which locate at storage level.

Can you be more specific? I think the only fundamental performance
difference between page cache and a block caching driver is that page
cache pages can be DMA'ed directly to lower level storage. However, I
believe that problem is solvable, i.e. we can teach dm-cache to
perform the equivalent of in-kernel direct-I/O when transferring data
between the cache and the backing storage when the cache is comprised
of persistent memory.

>
> At present we have realized NVM zone to be supported by two sockets(NUMA)
> product based on Lenovo Purley platform, and we can expand NVM flag into
> Page Cache allocation interface, so all Page Caches of system had been stored
> to NVDIMM safely.
>
> Now we are focusing how to recover data from Page cache after power on. That 
> is,
> The dirty pages could be safe and the time cost of cache training would be 
> saved a lot.
> Because many pages have already stored to ZONE_NVM before power failture.

I don't see how ZONE_NVM fits into a persistent page cache solution.
All of the mm structures to maintain the page cache are built to be
volatile. Once you build the infrastructure to persist and restore the
state of the page cache it is no longer the traditional page cache.
I.e. it will become something much closer to dm-cache or a filesystem.

One nascent idea from Dave Chinner is to teach xfs how to be a block
server for an upper level filesystem. His aim is sub-volume and
snapshot support, but I wonder if caching could be adapted into that
model?

In any event I think persisting and restoring cache state needs to be
designed before deciding if changes to the mm are needed.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [External] Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone

2018-05-07 Thread Matthew Wilcox
On Tue, May 08, 2018 at 02:59:40AM +, Huaisheng HS1 Ye wrote:
> Currently in our mind, an ideal use scenario is that, we put all page caches 
> to
> zone_nvm, without any doubt, page cache is an efficient and common cache
> implement, but it has a disadvantage that all dirty data within it would has 
> risk
> to be missed by power failure or system crash. If we put all page caches to 
> NVDIMMs,
> all dirty data will be safe. 

That's a common misconception.  Some dirty data will still be in the
CPU caches.  Are you planning on building servers which have enough
capacitance to allow the CPU to flush all dirty data from LLC to NV-DIMM?

Then there's the problem of reconnecting the page cache (which is
pointed to by ephemeral data structures like inodes and dentries) to
the new inodes.

And then you have to convince customers that what you're doing is safe
enough for them to trust it ;-)

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


RE: [External] Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone

2018-05-07 Thread Huaisheng HS1 Ye
>
>Dan Williams  writes:
>
>> On Mon, May 7, 2018 at 11:46 AM, Matthew Wilcox 
>wrote:
>>> On Mon, May 07, 2018 at 10:50:21PM +0800, Huaisheng Ye wrote:
 Traditionally, NVDIMMs are treated by mm(memory management)
>subsystem as
 DEVICE zone, which is a virtual zone and both its start and end of pfn
 are equal to 0, mm wouldn’t manage NVDIMM directly as DRAM, kernel
>uses
 corresponding drivers, which locate at \drivers\nvdimm\ and
 \drivers\acpi\nfit and fs, to realize NVDIMM memory alloc and free with
 memory hot plug implementation.
>>>
>>> You probably want to let linux-nvdimm know about this patch set.
>>> Adding to the cc.
>>
>> Yes, thanks for that!
>>
>>> Also, I only received patch 0 and 4.  What happened
>>> to 1-3,5 and 6?
>>>
 With current kernel, many mm’s classical features like the buddy
 system, swap mechanism and page cache couldn’t be supported to
>NVDIMM.
 What we are doing is to expand kernel mm’s capacity to make it to
>handle
 NVDIMM like DRAM. Furthermore we make mm could treat DRAM and
>NVDIMM
 separately, that means mm can only put the critical pages to NVDIMM
>
>Please define "critical pages."
>
 zone, here we created a new zone type as NVM zone. That is to say for
 traditional(or normal) pages which would be stored at DRAM scope like
 Normal, DMA32 and DMA zones. But for the critical pages, which we hope
 them could be recovered from power fail or system crash, we make them
 to be persistent by storing them to NVM zone.
>
>[...]
>
>> I think adding yet one more mm-zone is the wrong direction. Instead,
>> what we have been considering is a mechanism to allow a device-dax
>> instance to be given back to the kernel as a distinct numa node
>> managed by the VM. It seems it times to dust off those patches.
>
>What's the use case?  The above patch description seems to indicate an
>intent to recover contents after a power loss.  Without seeing the whole
>series, I'm not sure how that's accomplished in a safe or meaningful
>way.
>
>Huaisheng, could you provide a bit more background?
>

Currently in our mind, an ideal use scenario is that, we put all page caches to
zone_nvm, without any doubt, page cache is an efficient and common cache
implement, but it has a disadvantage that all dirty data within it would has 
risk
to be missed by power failure or system crash. If we put all page caches to 
NVDIMMs,
all dirty data will be safe. 

And the most important is that, Page cache is different from dm-cache or 
B-cache.
Page cache exists at mm. So, it has much more performance than other Write
caches, which locate at storage level.

At present we have realized NVM zone to be supported by two sockets(NUMA)
product based on Lenovo Purley platform, and we can expand NVM flag into
Page Cache allocation interface, so all Page Caches of system had been stored
to NVDIMM safely.

Now we are focusing how to recover data from Page cache after power on. That is,
The dirty pages could be safe and the time cost of cache training would be 
saved a lot.
Because many pages have already stored to ZONE_NVM before power failture.

Thanks,
Huaisheng Ye

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


RE: [External] Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone

2018-05-07 Thread Huaisheng HS1 Ye

> 
> On Mon, May 07, 2018 at 10:50:21PM +0800, Huaisheng Ye wrote:
> > Traditionally, NVDIMMs are treated by mm(memory management)
> subsystem as
> > DEVICE zone, which is a virtual zone and both its start and end of pfn
> > are equal to 0, mm wouldn’t manage NVDIMM directly as DRAM, kernel
> uses
> > corresponding drivers, which locate at \drivers\nvdimm\ and
> > \drivers\acpi\nfit and fs, to realize NVDIMM memory alloc and free with
> > memory hot plug implementation.
> 
> You probably want to let linux-nvdimm know about this patch set.
> Adding to the cc.  Also, I only received patch 0 and 4.  What happened
> to 1-3,5 and 6?

Sorry, It could be something wrong with my git-sendemail, but my mailbox has 
received all of them.
Anyway, I will send them again and CC linux-nvdimm.

Thanks
Huaisheng
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm