Re: [PATCH 0/3] iopmem : A block device for PCIe memory
[ adding Ashok and David for potential iommu comments ] On Tue, Oct 18, 2016 at 2:42 PM, Stephen Bateswrote: > This patch follows from an RFC we did earlier this year [1]. This > patchset applies cleanly to v4.9-rc1. > > Updates since RFC > - > Rebased. > Included the iopmem driver in the submission. > > History > --- > > There have been several attempts to upstream patchsets that enable > DMAs between PCIe peers. These include Peer-Direct [2] and DMA-Buf > style patches [3]. None have been successful to date. Haggai Eran > gives a nice overview of the prior art in this space in his cover > letter [3]. > > Motivation and Use Cases > > > PCIe IO devices are getting faster. It is not uncommon now to find PCIe > network and storage devices that can generate and consume several GB/s. > Almost always these devices have either a high performance DMA engine, a > number of exposed PCIe BARs or both. > > Until this patch, any high-performance transfer of information between > two PICe devices has required the use of a staging buffer in system > memory. With this patch the bandwidth to system memory is not compromised > when high-throughput transfers occurs between PCIe devices. This means > that more system memory bandwidth is available to the CPU cores for data > processing and manipulation. In addition, in systems where the two PCIe > devices reside behind a PCIe switch the datapath avoids the CPU > entirely. I agree with the motivation and the need for a solution, but I have some questions about this implementation. > > Consumers > - > > We provide a PCIe device driver in an accompanying patch that can be > used to map any PCIe BAR into a DAX capable block device. For > non-persistent BARs this simply serves as an alternative to using > system memory bounce buffers. For persistent BARs this can serve as an > additional storage device in the system. Why block devices? I wonder if iopmem was initially designed back when we were considering enabling DAX for raw block devices. However, that support has since been ripped out / abandoned. You currently need a filesystem on top of a block-device to get DAX operation. Putting xfs or ext4 on top of PCI-E memory mapped range seems awkward if all you want is a way to map the bar for another PCI-E device in the topology. If you're only using the block-device as a entry-point to create dax-mappings then a device-dax (drivers/dax/) character-device might be a better fit. > > Testing and Performance > --- > > We have done a moderate about of testing of this patch on a QEMU > environment and on real hardware. On real hardware we have observed > peer-to-peer writes of up to 4GB/s and reads of up to 1.2 GB/s. In > both cases these numbers are limitations of our consumer hardware. In > addition, we have observed that the CPU DRAM bandwidth is not impacted > when using IOPMEM which is not the case when a traditional path > through system memory is taken. > > For more information on the testing and performance results see the > GitHub site [4]. > > Known Issues > > > 1. Address Translation. Suggestions have been made that in certain > architectures and topologies the dma_addr_t passed to the DMA master > in a peer-2-peer transfer will not correctly route to the IO memory > intended. However in our testing to date we have not seen this to be > an issue, even in systems with IOMMUs and PCIe switches. It is our > understanding that an IOMMU only maps system memory and would not > interfere with device memory regions. (It certainly has no opportunity > to do so if the transfer gets routed through a switch). > There may still be platforms where peer-to-peer cycles are routed up through the root bridge and then back down to target device, but we can address that when / if it happens. I wonder if we could (ab)use a software-defined 'pasid' as the requester id for a peer-to-peer mapping that needs address translation. > 2. Memory Segment Spacing. This patch has the same limitations that > ZONE_DEVICE does in that memory regions must be spaces at least > SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where > BARs can be placed closer together than this. Thus ZONE_DEVICE would not > be usable on neighboring BARs. For our purposes, this is not an issue as > we'd only be looking at enabling a single BAR in a given PCIe device. > More exotic use cases may have problems with this. I'm working on patches for 4.10 to allow mixing multiple devm_memremap_pages() allocations within the same physical section. Hopefully this won't be a problem going forward. > 3. Coherency Issues. When IOMEM is written from both the CPU and a PCIe > peer there is potential for coherency issues and for writes to occur out > of order. This is something that users of this feature need to be > cognizant of. Though really, this isn't much different than the > existing situation
第七章 自我激励
销售精英2天强化训练 【时间地点】 2016年11月05-06日上海 11月19-20日北京11月26-27日深圳12月17-18日上海 Judge(评价)一个人,一个公司是不是优秀,不要看他是不是Harvard(哈佛大学),是不是Stanford(斯坦福大学).不要judge(评价)里面有多少名牌大学毕业生,而要judge(评价)这帮人干活是不是发疯一样干,看他每天下班是不是笑眯眯回家! ——阿里巴巴公司马云 ——课程简介 第一章客户需求分析 思考: 1、面对客户找不到话说,怎么办?二次沟通应该聊些什么? 2、为什么我把所有资料都给客户了,他还说要考虑一下? 3、同一件事,客户不同的人告诉我不一样的要求,听谁的? 4、同一件事,客户同一个人告诉我两次的答案不一样,听哪次的? 5、推荐哪一种产品给客户好?最好的?稍好的?还是够用的? 4、为什么我按客户要求去做,他还是没有选择我们? 5、不同的客户,我应该如何应对? 6、忠诚的客户应该如何培养? 第一节、为什么要对客户需求进行分析? 1、客户初次告诉我们的信息往往是有所保留的; 2、客户想要的产品,不一定就是实际所需要的; 3、客户不缺少产品信息,困惑的是自己如何选择; 4、客户购买决定是比较出来的,没有比较,产品就没有价值; 5、销售人员第一思想是战争思想,情报最重要; 6、未来的送货员,联络员,报价员将被淘汰; 第二节、如何做好客户需求分析? 一、基本要求: 1.无事不登三宝殿,有目的地做好拜访计划; 2.引导客户,首先要控制谈话的方向、节奏、内容; 3.从讲产品的“卖点”转变到讲客户的“买点” 4.好的,不一定是最适合的,最合适的才是最好的; 5.不要把猜测当成事实,“谈”的是什么?“判”是由谁判? 6.讨论:客户说价格太贵,代表哪15种不同的意思? 二、需求分析要点: 1.了解客户的4种期望目标; 2.了解客户采购的5个适当; 3.判断谁是关键人的8个依据; 4.哪6大类问题不可以问? 要表达别人听得懂的话; 5.提问注意的“3不谈”,“4不讲”; 6.客户需求分析手册制定的6个步骤; ?找对方向,事半功倍,为什么找这个客户? ?时间没对,努力白费,为什么这个时候找? ?找对人,说对话,为什么找这个人? ?为什么推荐这个产品?给客户需要的,而不是自己想给的; ?为什么给这样的服务? 客户看重不是产品,而是使用价值; ?为什么报这个价? 在客户的预算与同行之间找到平衡; 7.为什么还这个价?关注竞争对手,调整自己的策略; 第二章 如何正确推荐产品 思考: 1、为什么我满足客户所提出的要求,客户却还需要考虑一下? 2、为什么客户不相信我质量与服务的承诺? 3、面对客户提出高端产品的要求,而我只有低端产品,怎么办? 4、如何推荐产品才能让客户感觉到我们跟别人不一样; 第一节 为什么需要我们正确地推荐产品? 1、客户往往对自己深层次的问题并不清楚; 2、客户的提出的要求可能是模糊或抽象,有的仅仅提出方向,不要局限于客户明显的问题,头痛医头,脚痛医脚; 3、客户往往会以我们竞品给他的条件要求我们; 4、满足客户提出的要求,是引导客户在不同公司之间做比较,而不在我公司做出决策; 第二节 如何帮助客户建立“排他性”的采购标准? 案例:客户关心的是你如何保证你的质量和服务水平 1、打仗就是打后勤,推荐产品中常用的34项内容; 2、产品的功能与客户需要解决的问题要相对应;客户喜欢提供解决方案的人,而不仅提供工具的人; 3、如何给竞争对手业务员设置障碍? 第三节 见什么人,说什么话; 不同情况下如何讲?时间、能力、精力、兴趣、文化水平、不同的职位等; 1. 什么情况下偏重于理性说服,打动别人的脑? 2. 什么情况下偏重于情感说服,打动别人的心? 3. 何种情况下只讲优势不讲劣势? 4. 何种情况下即讲优势又讲劣势? 第三章如何有效处理异议 思考 1、遇到小气、固执、粗鲁、啰嗦、刻薄、吹毛求疵、优柔寡断的客户应对? 2、客户直接挂电话,怎么办? 3、第二次见面,客户对我大发脾气,怎么办? 4、有一个行业,销售人员每天都会遇到大量的拒绝,为什么却没有任何人会沮丧? 5、客户就没有压力吗?知已知彼,客户采购时会有哪些压力? 6、为什么客户在上班时与下班后会表现不同的性格特征? 第一节:买卖双方的心情分析 1、如果一方比另一方更主动、更积极追求合作,则后者称为潜在客户 2、卖方知道某价一定不能卖,但买方不知道什么价不能买; 3、当卖方表现自己很想卖,买方会表现自己不想买; 4、买方还的价,并不一定是他认为商品就应该值这个价; 5、付钱之前,买方占优势,之后,卖方占优势; 第二节、理解客户购买时的心态; 1、客户谈判时常用7种试探技巧分析; 2、客户态度非常好,就是不下订单,为什么? 3、为什么有些客户让我们感觉高高在上,花钱是大爷?难道他们素质真的差? 4、客户自身会有哪6个压力? 案例:客户提出合理条件,是否我就应该降价? 案例:如何分清客户异议的真实性? 案例:当谈判出现僵局时怎么办? 案例:为什么我答应客户提出的所有的条件,反而失去了订单? 案例:客户一再地提出不同的条件,怎么处理? 案例:客户要求我降价时,怎么办?请分8个步骤处理 第三节 客户异议处理的5个区分 1、要区分“第一” 还是“唯一” 2、对客户要求的真伪进行鉴别; 3、要区分“情绪”还是“行为” 4、区分“假想”还是“事实” 5、区别问题的轻重,缓急; 第四章 如何建立良好的客情关系? 案例:销售工作需要疯狂、圆滑、奉承、见人说人话,见鬼说鬼话吗? 案例:生意不成仁义在,咱俩交个朋友,这句话应该由谁说? 案例:邀请客户吃饭,你应该怎么说? 案例:当客户表扬了你,你会怎么回答? 案例:我代表公司的形象,是否我应该表现自己很强势? 案例:为了获得客户的信任,我是否应该花重金包装自己?让自己很完美? 案例:是否需要处处表现自己很有礼貌? 案例:如何与企业高层、政府高层打交道? 第一节 做回真实和真诚的自己,表里如一 礼仪的目的是尊重别人,而不是伪装自己,礼仪中常见的错误; 1、演别人,再好的演技也会搞砸,想做别人的时候,你就会离自己很远; 2、不同的人,需求不同,越改越累,越改越气,只会把自己折磨得心浮气躁,不得人心; 3、以朋友的心态与客户交往,过多的商业化语言、行为、过多的礼仪只会让客户感觉到生硬、距离、排斥、公事公办,没有感情; 4、适当的暴露自己的缺点,越完美的人越不可信; 5、守时,守信,守约,及时传递进程与信息,让客户感觉到可控性; 6、销售不是向客户笑,而是要让客户对自己笑; 第二节 感谢伤害我的人,是因为我自己错了; 1、一味顺从、推卸责任、理论交谈、谈论小事、无诚信; 2、当客户说过一段时间、以后、改天、回头、月底时,如何应对? 3、越完美的人越不可信,自我暴露的四个层次; 4、做好防错性的服务,签完合同仅仅是合作的开始; ?指导客户如何使用; ?跟踪产品使用的情况; ?为客户在使用过程中提供指导建议; ?积极解答客户在使用中提出的问题; 第四章团队配合 思考: 1.团队配合的前提是什么?是否任意两个人在一起都会有团队精神? 2.团队配合中为什么会出现扯皮的现象? 3.为什么公司花那么高成本让大家加深感情,但有些人之间还是有隔阂? 4.业绩好的人影响业绩差的人容易还是业绩差的影响业绩好的容易? 5.统一底薪好?还是差别化底薪好?如何让大家都觉得公平? 6.为什么有能力的不听话,听话的却没能力? 7.为什么有些人总是不按我要求的方法去做? 8.面对业绩总是很差的员工,到底是留还是开? 第一节团队配合的重要性 1.优秀的业务员业绩往往是普通的几十甚至上百倍; 2.提高成交的效率,不要杀敌一千,而自损八百; 3.优秀业务员缺时间,业绩普通的业务员缺能力,扬长避短,人尽其才; 4.把人力资源效益利用最大化; 5.打造完美的团队,让成员的缺点相互抵消; 第二节,如何开展团队配合 第一、能力互补 1.关注员工的能力,不要把未来寄托在员工未知的潜能上; 2.不要把员工塑造成同一类型的人,不把专才当全才用; 3.团队以能为本,销售岗位常见的14项技能; 4.售前、售中、售后人员要求与如何搭配? 5.案例:新员工有激情,但能力不足,老员工有能力,但激情不足,怎么办? 第二、利益关联 1.为什么成员会相互冷漠、互不关心、彼此封锁信息和资源? 2.为什么团队成员把团队的事不当回事? 3.如何才能让团队成员真心的为优秀的成员而高兴? 4.开除业绩差的员工,其他成员缺乏安全感怎么办? 5.如何才能让团队自动自发的努力工作? 第三节、不同客户喜欢不同风格的销售人员 1、 销售人员形象与举止,注意自己的形象; 2、 是否具备相似的背景,门当户对; 3、 是否具备相同的认识,道不同不相为盟; 4、 是否“投其所好”,话不投机半句多; 5、 赞美,喜欢对方,我们同样对喜欢我们的人有好感; 先交流感情,增进互信,欲速则不达; 6、 是否对销售人员熟悉,销售最忌讳交浅言深; 初次见面就企图跟别人成为朋友的行为很幼稚; 初次见面就暗示好处的行为很肤浅; 刚见面就强调价格很便宜的行为很愚蠢; 7、 销售人员是否具备亲和力,别人的脸是自己的一面镜子; 成交并不取决于说理,而是取决于心情 8、 销售人员是否值得信赖。 第六章 新客户开发 案例:为什么客户一开始很有兴趣,但迟迟不下单? 案例:前天明明说不买的客户居然今天却买了,客户的话能相信吗? 案例:客户答应买我司的产品,却突然变卦买别人的了,为什么? 案例:为什么我们会买自己没有兴趣的而且并不需要的产品? 一、客户是根据自己所投入的精力、金钱来确定自己的态度; 二、如何才能引导客户作自我说服? 1.不要轻易给客户下结论,谁会买,谁不会买 2.态度上的变化叫说服,行为上的变化叫接受; 3.我们都喜欢为我们自己的行为找理由,却不善于做我们已找到理由的事; 4.客户是发现了自己的需求,“发现”的依据是自己的行为; 5.案例:合同签订后,应该问哪4句话,提升客户忠诚度? 第七章 自我激励 1.做销售工作赚钱最快,且最容易得到老板的重视、同事的尊重; 2.不要把第一次见面看成最后一次,工作要积极但不要着急; 3.不是成功太慢,而是放弃太快,钱是给内行的人赚的; 4.不要报着试试看的心态,企图一夜暴富的投机心态让客户反感; 5.不是有希望才坚持,而是坚持了才有希望; 6.付出才会拥有,而不是拥有才付出;做了才会,而不是会了才做; 7.好工作是做出来的,不是找出来的,不要把自己托付给公司,而要独立成长; 8.尝试不同的工作方法,而不是多年重复使用一种方式,具备试错的精神; 9.工作可以出错,但不可以不做,世界上最危险的莫过于原地不动; 10.不要把未来寄托在自己一无所知的行业上,做好目前的工作; 【培训特点】 1.分组讨论,训练为主,互动式教学;2次现场考试; 2.真实案例分析,大量课后作业题,既有抢答,又有辩论,还有现场演练,热烈的课堂氛围; 3.将销售管理融入培训现场: 3.1 不仅关注个人学习表现,而且重视团队合作; 3.2 不仅关注2天以内的学习,而且营造2天以后的培训学习氛围; 3.3 不仅考核个人得分,而且考核团队得分;不仅考核学员的学习成绩,而且考核学员学习的参与度; 【讲师介绍】 王老师 销售团队管理咨询师、销售培训讲师; 曾任可口可乐(中国)公司业务经理;阿里巴巴(中国)网络技术有限公司业务经理; 清华大学.南京大学EMBA特邀培训讲师;新加坡莱佛士学院特约讲师;
Re: [PATCH 20/20] dax: Clear dirty entry tags on cache flush
On Tue, Sep 27, 2016 at 06:08:24PM +0200, Jan Kara wrote: > Currently we never clear dirty tags in DAX mappings and thus address > ranges to flush accumulate. Now that we have locking of radix tree > entries, we have all the locking necessary to reliably clear the radix > tree dirty tag when flushing caches for corresponding address range. > Similarly to page_mkclean() we also have to write-protect pages to get a > page fault when the page is next written to so that we can mark the > entry dirty again. > > Signed-off-by: Jan KaraLooks great. Reviewed-by: Ross Zwisler ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[PATCH 1/3] memremap.c : Add support for ZONE_DEVICE IO memory with struct pages.
From: Logan GunthorpeWe build on recent work that adds memory regions owned by a device driver (ZONE_DEVICE) [1] and to add struct page support for these new regions of memory [2]. 1. Add an extra flags argument into dev_memremap_pages to take in a MEMREMAP_XX argument. We update the existing calls to this function to reflect the change. 2. For completeness, we add MEMREMAP_WT support to the memremap; however we have no actual need for this functionality. 3. We add the static functions, add_zone_device_pages and remove_zone_device pages. These are similar to arch_add_memory except they don't create the memory mapping. We don't believe these need to be made arch specific, but are open to other opinions. 4. dev_memremap_pages and devm_memremap_pages_release are updated to treat IO memory slightly differently. For IO memory we use a combination of the appropriate io_remap function and the zone_device pages functions created above. A flags variable and kaddr pointer are added to struct page_mem to facilitate this for the release function. We also set up the page attribute tables for the mapped region correctly based on the desired mapping. [1] https://lists.01.org/pipermail/linux-nvdimm/2015-August/001810.html [2] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002387.html Signed-off-by: Stephen Bates Signed-off-by: Logan Gunthorpe --- drivers/dax/pmem.c| 4 +- drivers/nvdimm/pmem.c | 4 +- include/linux/memremap.h | 5 ++- kernel/memremap.c | 80 +-- tools/testing/nvdimm/test/iomap.c | 3 +- 5 files changed, 86 insertions(+), 10 deletions(-) diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c index 9630d88..58ac456 100644 --- a/drivers/dax/pmem.c +++ b/drivers/dax/pmem.c @@ -14,6 +14,7 @@ #include #include #include +#include #include "../nvdimm/pfn.h" #include "../nvdimm/nd.h" #include "dax.h" @@ -108,7 +109,8 @@ static int dax_pmem_probe(struct device *dev) if (rc) return rc; - addr = devm_memremap_pages(dev, , _pmem->ref, altmap); + addr = devm_memremap_pages(dev, , _pmem->ref, altmap, + ARCH_MEMREMAP_PMEM); if (IS_ERR(addr)) return PTR_ERR(addr); diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 42b3a82..97032a1 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -278,7 +278,7 @@ static int pmem_attach_disk(struct device *dev, pmem->pfn_flags = PFN_DEV; if (is_nd_pfn(dev)) { addr = devm_memremap_pages(dev, _res, >q_usage_counter, - altmap); + altmap, ARCH_MEMREMAP_PMEM); pfn_sb = nd_pfn->pfn_sb; pmem->data_offset = le64_to_cpu(pfn_sb->dataoff); pmem->pfn_pad = resource_size(res) - resource_size(_res); @@ -287,7 +287,7 @@ static int pmem_attach_disk(struct device *dev, res->start += pmem->data_offset; } else if (pmem_should_map_pages(dev)) { addr = devm_memremap_pages(dev, >res, - >q_usage_counter, NULL); + >q_usage_counter, NULL, ARCH_MEMREMAP_PMEM); pmem->pfn_flags |= PFN_MAP; } else addr = devm_memremap(dev, pmem->phys_addr, diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 9341619..fc99283 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -51,12 +51,13 @@ struct dev_pagemap { #ifdef CONFIG_ZONE_DEVICE void *devm_memremap_pages(struct device *dev, struct resource *res, - struct percpu_ref *ref, struct vmem_altmap *altmap); + struct percpu_ref *ref, struct vmem_altmap *altmap, + unsigned long flags); struct dev_pagemap *find_dev_pagemap(resource_size_t phys); #else static inline void *devm_memremap_pages(struct device *dev, struct resource *res, struct percpu_ref *ref, - struct vmem_altmap *altmap) + struct vmem_altmap *altmap, unsigned long flags) { /* * Fail attempts to call devm_memremap_pages() without diff --git a/kernel/memremap.c b/kernel/memremap.c index b501e39..d5f462c 100644 --- a/kernel/memremap.c +++ b/kernel/memremap.c @@ -175,13 +175,41 @@ static RADIX_TREE(pgmap_radix, GFP_KERNEL); #define SECTION_MASK ~((1UL << PA_SECTION_SHIFT) - 1) #define SECTION_SIZE (1UL << PA_SECTION_SHIFT) +enum { + PAGEMAP_IO_MEM = 1 << 0, +}; + struct page_map { struct resource res; struct percpu_ref *ref; struct dev_pagemap pgmap; struct vmem_altmap altmap; + void *kaddr; + int flags; }; +static int add_zone_device_pages(int nid, u64 start, u64 size) +{ + struct pglist_data *pgdat = NODE_DATA(nid);
[PATCH 3/3] iopmem : Add documentation for iopmem driver
Add documentation for the iopmem PCIe device driver. Signed-off-by: Stephen BatesSigned-off-by: Logan Gunthorpe --- Documentation/blockdev/00-INDEX | 2 ++ Documentation/blockdev/iopmem.txt | 62 +++ 2 files changed, 64 insertions(+) create mode 100644 Documentation/blockdev/iopmem.txt diff --git a/Documentation/blockdev/00-INDEX b/Documentation/blockdev/00-INDEX index c08df56..913e500 100644 --- a/Documentation/blockdev/00-INDEX +++ b/Documentation/blockdev/00-INDEX @@ -8,6 +8,8 @@ cpqarray.txt - info on using Compaq's SMART2 Intelligent Disk Array Controllers. floppy.txt - notes and driver options for the floppy disk driver. +iopmem.txt + - info on the iopmem block driver. mflash.txt - info on mGine m(g)flash driver for linux. nbd.txt diff --git a/Documentation/blockdev/iopmem.txt b/Documentation/blockdev/iopmem.txt new file mode 100644 index 000..ba805b8 --- /dev/null +++ b/Documentation/blockdev/iopmem.txt @@ -0,0 +1,62 @@ +IOPMEM Block Driver +=== + +Logan Gunthorpe and Stephen Bates - October 2016 + +Introduction + + +The iopmem module creates a DAX capable block device from a BAR on a PCIe +device. iopmem leverages heavily from the pmem driver although it utilizes IO +memory rather than system memory as its backing store. + +Usage +- + +To include the iopmem module in your kernel please set CONFIG_BLK_DEV_IOPMEM +to either y or m. A block device will be created for each PCIe attached device +that matches the vendor and device ID as specified in the module. Currently an +unallocated PMC PCIe ID is used as the default. Alternatively this driver can +be bound to any aribtary PCIe function using the sysfs bind entry. + +The main purpose for an iopmem block device is expected to be for peer-2-peer +PCIe transfers. We DO NOT RECCOMEND accessing a iopmem device using the local +CPU unless you are doing one of the three following things: + +1. Creating a DAX capable filesystem on the iopmem device. +2. Creating some files on the DAX capable filesystem. +3. Interogating the files on said filesystem to obtain pointers that can be + passed to other PCIe devices for p2p DMA operations. + +Issues +-- + +1. Address Translation. Suggestions have been made that in certain +architectures and topologies the dma_addr_t passed to the DMA master +in a peer-2-peer transfer will not correctly route to the IO memory +intended. However in our testing to date we have not seen this to be +an issue, even in systems with IOMMUs and PCIe switches. It is our +understanding that an IOMMU only maps system memory and would not +interfere with device memory regions. (It certainly has no opportunity +to do so if the transfer gets routed through a switch). + +2. Memory Segment Spacing. This patch has the same limitations that +ZONE_DEVICE does in that memory regions must be spaces at least +SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where +BARs can be placed closer together than this. Thus ZONE_DEVICE would not +be usable on neighboring BARs. For our purposes, this is not an issue as +we'd only be looking at enabling a single BAR in a given PCIe device. +More exotic use cases may have problems with this. + +3. Coherency Issues. When IOMEM is written from both the CPU and a PCIe +peer there is potential for coherency issues and for writes to occur out +of order. This is something that users of this feature need to be +cognizant of and may necessitate the use of CONFIG_EXPERT. Though really, +this isn't much different than the existing situation with RDMA: if +userspace sets up an MR for remote use, they need to be careful about +using that memory region themselves. + +4. Architecture. Currently this patch is applicable only to x86 +architectures. The same is true for much of the code pertaining to +PMEM and ZONE_DEVICE. It is hoped that the work will be extended to other +ARCH over time. -- 2.1.4 ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 17/20] mm: Export follow_pte()
On Tue, Sep 27, 2016 at 06:08:21PM +0200, Jan Kara wrote: > DAX will need to implement its own version of page_check_address(). To > avoid duplicating page table walking code, export follow_pte() which > does what we need. > > Signed-off-by: Jan KaraReviewed-by: Ross Zwisler ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 16/20] mm: Provide helper for finishing mkwrite faults
On Tue, Sep 27, 2016 at 06:08:20PM +0200, Jan Kara wrote: > Provide a helper function for finishing write faults due to PTE being > read-only. The helper will be used by DAX to avoid the need of > complicating generic MM code with DAX locking specifics. > > Signed-off-by: Jan Kara> --- > include/linux/mm.h | 1 + > mm/memory.c| 65 > +++--- > 2 files changed, 39 insertions(+), 27 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 1055f2ece80d..e5a014be8932 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -617,6 +617,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct > vm_area_struct *vma) > int alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg, > struct page *page); > int finish_fault(struct vm_fault *vmf); > +int finish_mkwrite_fault(struct vm_fault *vmf); > #endif > > /* > diff --git a/mm/memory.c b/mm/memory.c > index f49e736d6a36..8c8cb7f2133e 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2266,6 +2266,36 @@ oom: > return VM_FAULT_OOM; > } > > +/** > + * finish_mkrite_fault - finish page fault making PTE writeable once the page finish_mkwrite_fault > @@ -2315,26 +2335,17 @@ static int wp_page_shared(struct vm_fault *vmf) > put_page(vmf->page); > return tmp; > } > - /* > - * Since we dropped the lock we need to revalidate > - * the PTE as someone else may have changed it. If > - * they did, we just return, as we can count on the > - * MMU to tell us if they didn't also make it writable. > - */ > - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, > - vmf->address, >ptl); > - if (!pte_same(*vmf->pte, vmf->orig_pte)) { > + tmp = finish_mkwrite_fault(vmf); > + if (unlikely(!tmp || (tmp & > + (VM_FAULT_ERROR | VM_FAULT_NOPAGE { The 'tmp' return from finish_mkwrite_fault() can only be 0 or VM_FAULT_WRITE. I think this test should just be if (unlikely(!tmp)) { With that and the small spelling fix: Reviewed-by: Ross Zwisler ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 15/20] mm: Move part of wp_page_reuse() into the single call site
On Tue, Sep 27, 2016 at 06:08:19PM +0200, Jan Kara wrote: > wp_page_reuse() handles write shared faults which is needed only in > wp_page_shared(). Move the handling only into that location to make > wp_page_reuse() simpler and avoid a strange situation when we sometimes > pass in locked page, sometimes unlocked etc. > > Signed-off-by: Jan KaraReviewed-by: Ross Zwisler ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 12/20] mm: Factor out common parts of write fault handling
On Tue, Oct 18, 2016 at 12:50:00PM +0200, Jan Kara wrote: > On Mon 17-10-16 16:08:51, Ross Zwisler wrote: > > On Tue, Sep 27, 2016 at 06:08:16PM +0200, Jan Kara wrote: > > > Currently we duplicate handling of shared write faults in > > > wp_page_reuse() and do_shared_fault(). Factor them out into a common > > > function. > > > > > > Signed-off-by: Jan Kara> > > --- > > > mm/memory.c | 78 > > > + > > > 1 file changed, 37 insertions(+), 41 deletions(-) > > > > > > diff --git a/mm/memory.c b/mm/memory.c > > > index 63d9c1a54caf..0643b3b5a12a 100644 > > > --- a/mm/memory.c > > > +++ b/mm/memory.c > > > @@ -2063,6 +2063,41 @@ static int do_page_mkwrite(struct vm_area_struct > > > *vma, struct page *page, > > > } > > > > > > /* > > > + * Handle dirtying of a page in shared file mapping on a write fault. > > > + * > > > + * The function expects the page to be locked and unlocks it. > > > + */ > > > +static void fault_dirty_shared_page(struct vm_area_struct *vma, > > > + struct page *page) > > > +{ > > > + struct address_space *mapping; > > > + bool dirtied; > > > + bool page_mkwrite = vma->vm_ops->page_mkwrite; > > > > I think you may need to pass in a 'page_mkwrite' parameter if you don't want > > to change behavior. Just checking to see of vma->vm_ops->page_mkwrite is > > non-NULL works fine for this path: > > > > do_shared_fault() > > fault_dirty_shared_page() > > > > and for > > > > wp_page_shared() > > wp_page_reuse() > > fault_dirty_shared_page() > > > > But for these paths: > > > > wp_pfn_shared() > > wp_page_reuse() > > fault_dirty_shared_page() > > > > and > > > > do_wp_page() > > wp_page_reuse() > > fault_dirty_shared_page() > > > > we unconditionally pass 0 for the 'page_mkwrite' parameter, even though from > > the logic in wp_pfn_shared() especially you can see that > > vma->vm_ops->pfn_mkwrite() must be defined some of the time. > > The trick which makes this work is that for fault_dirty_shared_page() to be > called at all, you have to set 'dirty_shared' argument to wp_page_reuse() > and that does not happen from wp_pfn_shared() and do_wp_page() paths. So > things work as they should. If you look somewhat later into the series, > the patch "mm: Move part of wp_page_reuse() into the single call site" > cleans this up to make things more obvious. > > Honza Ah, cool, that makes sense. You can add: Reviewed-by: Ross Zwisler ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 11/20] mm: Remove unnecessary vma->vm_ops check
On Mon 17-10-16 13:40:41, Ross Zwisler wrote: > On Tue, Sep 27, 2016 at 06:08:15PM +0200, Jan Kara wrote: > > We don't check whether vma->vm_ops is NULL in do_shared_fault() so > > there's hardly any point in checking it in wp_page_shared() which gets > > called only for shared file mappings as well. > > > > Signed-off-by: Jan Kara> > --- > > mm/memory.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/mm/memory.c b/mm/memory.c > > index a4522e8999b2..63d9c1a54caf 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -2301,7 +2301,7 @@ static int wp_page_shared(struct vm_fault *vmf, > > struct page *old_page) > > > > get_page(old_page); > > > > - if (vma->vm_ops && vma->vm_ops->page_mkwrite) { > > + if (vma->vm_ops->page_mkwrite) { > > int tmp; > > > > pte_unmap_unlock(vmf->pte, vmf->ptl); > > -- > > 2.6.6 > > Does this apply equally to the check in wp_pfn_shared()? Both > wp_page_shared() and wp_pfn_shared() are called for shared file mappings via > do_wp_page(). Yes, it does apply there as well. Added to the commit. There are actually more places with these checks which don't seem necessary but I didn't want to do more cleanups than I need... But at least these two come logically together. Honza -- Jan Kara SUSE Labs, CR ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 07/20] mm: Add orig_pte field into vm_fault
On Mon 17-10-16 10:45:12, Ross Zwisler wrote: > On Tue, Sep 27, 2016 at 06:08:11PM +0200, Jan Kara wrote: > > Add orig_pte field to vm_fault structure to allow ->page_mkwrite > > handlers to fully handle the fault. This also allows us to save some > > passing of extra arguments around. > > > > Signed-off-by: Jan Kara> > --- > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > index f88b2d3810a7..66bc77f2d1d2 100644 > > --- a/mm/khugepaged.c > > +++ b/mm/khugepaged.c > > @@ -890,11 +890,12 @@ static bool __collapse_huge_page_swapin(struct > > mm_struct *mm, > > vmf.pte = pte_offset_map(pmd, address); > > for (; vmf.address < address + HPAGE_PMD_NR*PAGE_SIZE; > > vmf.pte++, vmf.address += PAGE_SIZE) { > > - pteval = *vmf.pte; > > + vmf.orig_pte = *vmf.pte; > > + pteval = vmf.orig_pte; > > if (!is_swap_pte(pteval)) > > continue; > > 'pteval' is now only used once. It's probably cleaner to just remove it and > use vmf.orig_pte for the is_swap_pte() check. Yes, fixed. > > @@ -3484,8 +3484,7 @@ static int handle_pte_fault(struct vm_fault *vmf) > > * So now it's safe to run pte_offset_map(). > > */ > > vmf->pte = pte_offset_map(vmf->pmd, vmf->address); > > - > > - entry = *vmf->pte; > > + vmf->orig_pte = *vmf->pte; > > > > /* > > * some architectures can have larger ptes than wordsize, > > @@ -3496,6 +3495,7 @@ static int handle_pte_fault(struct vm_fault *vmf) > > * ptl lock held. So here a barrier will do. > > */ > > barrier(); > > + entry = vmf->orig_pte; > > This set of 'entry' is now on the other side of the barrier(). I'll admit > that I don't fully grok the need for the barrier. Does it apply to only the > setting of vmf->pte and vmf->orig_pte, or does 'entry' also matter because it > too is of type pte_t, and thus could be bigger than the architecture's word > size? > > My guess is that 'entry' matters, too, and should remain before the barrier() > call. If not, can you help me understand why? Sure, actually the comment just above the barrier() explains it: We care about sampling *vmf->pte value only once - so we want the value stored in 'entry' (vmf->orig_pte after the patch) to be used and avoid compiler optimizations leading to refetching the value at *vmf->pte. The way I've written the code achieves this. Actually, I've moved the 'entry' assignment even further down where it makes more sense with the new code layout. Honza -- Jan Kara SUSE Labs, CR ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 0/20 v3] dax: Clear dirty bits after flushing caches
On Mon 17-10-16 12:59:55, Ross Zwisler wrote: > On Mon, Oct 17, 2016 at 10:47:32AM +0200, Jan Kara wrote: > > > This week I plan to rebase both series on top of rc1 + your THP patches so > > that we can move on with merging the stuff. > > Yea...so how are we going to coordinate merging of these series for the v4.10 > merge window? My series mostly changes DAX, but it also changes XFS, ext2 and > ext4. I think the plan right now is to have Dave Chinner take it through his > XFS tree. > > Your first series is mostly mm changes with some DAX sprinkled in, and your > second series touches dax, mm and all 3 DAX filesystems. > > What is the best way to handle all this? Have it go through one central tree > (-MM?), even though the changes touch code that exists outside of that trees > normal domain (like the FS code)? Have my series go through the XFS tree and > yours through -MM, and give Linus a merge resolution patch? Something else? For your changes to go through XFS tree is IMO fine (changes outside of XFS & DAX are easy). Let me do the rebase first and then discuss how to merge my patches after that... Honza -- Jan KaraSUSE Labs, CR ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm