date:20161018

Re: [PATCH 0/3] iopmem : A block device for PCIe memory

2016-10-18 Thread Dan Williams

[ adding Ashok and David for potential iommu comments ]

On Tue, Oct 18, 2016 at 2:42 PM, Stephen Bates  wrote:
> This patch follows from an RFC we did earlier this year [1]. This
> patchset applies cleanly to v4.9-rc1.
>
> Updates since RFC
> -
>   Rebased.
>   Included the iopmem driver in the submission.
>
> History
> ---
>
> There have been several attempts to upstream patchsets that enable
> DMAs between PCIe peers. These include Peer-Direct [2] and DMA-Buf
> style patches [3]. None have been successful to date. Haggai Eran
> gives a nice overview of the prior art in this space in his cover
> letter [3].
>
> Motivation and Use Cases
> 
>
> PCIe IO devices are getting faster. It is not uncommon now to find PCIe
> network and storage devices that can generate and consume several GB/s.
> Almost always these devices have either a high performance DMA engine, a
> number of exposed PCIe BARs or both.
>
> Until this patch, any high-performance transfer of information between
> two PICe devices has required the use of a staging buffer in system
> memory. With this patch the bandwidth to system memory is not compromised
> when high-throughput transfers occurs between PCIe devices. This means
> that more system memory bandwidth is available to the CPU cores for data
> processing and manipulation. In addition, in systems where the two PCIe
> devices reside behind a PCIe switch the datapath avoids the CPU
> entirely.

I agree with the motivation and the need for a solution, but I have
some questions about this implementation.

>
> Consumers
> -
>
> We provide a PCIe device driver in an accompanying patch that can be
> used to map any PCIe BAR into a DAX capable block device. For
> non-persistent BARs this simply serves as an alternative to using
> system memory bounce buffers. For persistent BARs this can serve as an
> additional storage device in the system.

Why block devices?  I wonder if iopmem was initially designed back
when we were considering enabling DAX for raw block devices.  However,
that support has since been ripped out / abandoned.  You currently
need a filesystem on top of a block-device to get DAX operation.
Putting xfs or ext4 on top of PCI-E memory mapped range seems awkward
if all you want is a way to map the bar for another PCI-E device in
the topology.

If you're only using the block-device as a entry-point to create
dax-mappings then a device-dax (drivers/dax/) character-device might
be a better fit.

>
> Testing and Performance
> ---
>
> We have done a moderate about of testing of this patch on a QEMU
> environment and on real hardware. On real hardware we have observed
> peer-to-peer writes of up to 4GB/s and reads of up to 1.2 GB/s. In
> both cases these numbers are limitations of our consumer hardware. In
> addition, we have observed that the CPU DRAM bandwidth is not impacted
> when using IOPMEM which is not the case when a traditional path
> through system memory is taken.
>
> For more information on the testing and performance results see the
> GitHub site [4].
>
> Known Issues
> 
>
> 1. Address Translation. Suggestions have been made that in certain
> architectures and topologies the dma_addr_t passed to the DMA master
> in a peer-2-peer transfer will not correctly route to the IO memory
> intended. However in our testing to date we have not seen this to be
> an issue, even in systems with IOMMUs and PCIe switches. It is our
> understanding that an IOMMU only maps system memory and would not
> interfere with device memory regions. (It certainly has no opportunity
> to do so if the transfer gets routed through a switch).
>

There may still be platforms where peer-to-peer cycles are routed up
through the root bridge and then back down to target device, but we
can address that when / if it happens.  I wonder if we could (ab)use a
software-defined 'pasid' as the requester id for a peer-to-peer
mapping that needs address translation.

> 2. Memory Segment Spacing. This patch has the same limitations that
> ZONE_DEVICE does in that memory regions must be spaces at least
> SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where
> BARs can be placed closer together than this. Thus ZONE_DEVICE would not
> be usable on neighboring BARs. For our purposes, this is not an issue as
> we'd only be looking at enabling a single BAR in a given PCIe device.
> More exotic use cases may have problems with this.

I'm working on patches for 4.10 to allow mixing multiple
devm_memremap_pages() allocations within the same physical section.
Hopefully this won't be a problem going forward.

> 3. Coherency Issues. When IOMEM is written from both the CPU and a PCIe
> peer there is potential for coherency issues and for writes to occur out
> of order. This is something that users of this feature need to be
> cognizant of. Though really, this isn't much different than the
> existing situation

第七章自我激励

2016-10-18 Thread 第七章自我激励先生

销售精英2天强化训练

【时间地点】 2016年11月05-06日上海 
 11月19-20日北京11月26-27日深圳12月17-18日上海 


Judge（评价）一个人，一个公司是不是优秀，不要看他是不是Harvard（哈佛大学），是不是Stanford（斯坦福大学）.不要judge（评价）里面有多少名牌大学毕业生，而要judge（评价）这帮人干活是不是发疯一样干，看他每天下班是不是笑眯眯回家！
——阿里巴巴公司马云


——课程简介

第一章客户需求分析
思考：
1、面对客户找不到话说，怎么办？二次沟通应该聊些什么？
2、为什么我把所有资料都给客户了，他还说要考虑一下？
3、同一件事，客户不同的人告诉我不一样的要求，听谁的？
4、同一件事，客户同一个人告诉我两次的答案不一样，听哪次的？
5、推荐哪一种产品给客户好？最好的？稍好的？还是够用的？
4、为什么我按客户要求去做，他还是没有选择我们？
5、不同的客户，我应该如何应对？
6、忠诚的客户应该如何培养？

第一节、为什么要对客户需求进行分析？
1、客户初次告诉我们的信息往往是有所保留的；
2、客户想要的产品，不一定就是实际所需要的；
3、客户不缺少产品信息，困惑的是自己如何选择； 
4、客户购买决定是比较出来的，没有比较，产品就没有价值；
5、销售人员第一思想是战争思想，情报最重要；
6、未来的送货员,联络员,报价员将被淘汰;

第二节、如何做好客户需求分析？
一、基本要求：
1.无事不登三宝殿，有目的地做好拜访计划；
2.引导客户，首先要控制谈话的方向、节奏、内容；
3.从讲产品的“卖点”转变到讲客户的“买点”
4.好的，不一定是最适合的，最合适的才是最好的；
5.不要把猜测当成事实，“谈”的是什么？“判”是由谁判？
6.讨论：客户说价格太贵，代表哪15种不同的意思？

二、需求分析要点：
1.了解客户的4种期望目标；
2.了解客户采购的5个适当；
3.判断谁是关键人的8个依据；
4.哪6大类问题不可以问？ 要表达别人听得懂的话；
5.提问注意的“3不谈”，“4不讲”；
6.客户需求分析手册制定的6个步骤；
?找对方向，事半功倍，为什么找这个客户？
?时间没对，努力白费，为什么这个时候找？
?找对人，说对话，为什么找这个人？ 
?为什么推荐这个产品？给客户需要的，而不是自己想给的； 
?为什么给这样的服务？ 客户看重不是产品，而是使用价值；
?为什么报这个价？ 在客户的预算与同行之间找到平衡；
7.为什么还这个价？关注竞争对手，调整自己的策略；

第二章  如何正确推荐产品
思考：
1、为什么我满足客户所提出的要求，客户却还需要考虑一下？
2、为什么客户不相信我质量与服务的承诺？
3、面对客户提出高端产品的要求，而我只有低端产品，怎么办？
4、如何推荐产品才能让客户感觉到我们跟别人不一样；

第一节 为什么需要我们正确地推荐产品？
1、客户往往对自己深层次的问题并不清楚；
2、客户的提出的要求可能是模糊或抽象,有的仅仅提出方向，不要局限于客户明显的问题,头痛医头,脚痛医脚；
3、客户往往会以我们竞品给他的条件要求我们；
4、满足客户提出的要求，是引导客户在不同公司之间做比较，而不在我公司做出决策；

第二节 如何帮助客户建立“排他性”的采购标准？
案例：客户关心的是你如何保证你的质量和服务水平
1、打仗就是打后勤，推荐产品中常用的34项内容；
2、产品的功能与客户需要解决的问题要相对应；客户喜欢提供解决方案的人，而不仅提供工具的人；
3、如何给竞争对手业务员设置障碍？

第三节  见什么人,说什么话;
不同情况下如何讲？时间、能力、精力、兴趣、文化水平、不同的职位等；
1. 什么情况下偏重于理性说服,打动别人的脑?
2. 什么情况下偏重于情感说服,打动别人的心?
3. 何种情况下只讲优势不讲劣势？
4. 何种情况下即讲优势又讲劣势？

第三章如何有效处理异议
思考
1、遇到小气、固执、粗鲁、啰嗦、刻薄、吹毛求疵、优柔寡断的客户应对？
2、客户直接挂电话，怎么办？
3、第二次见面，客户对我大发脾气，怎么办？
4、有一个行业，销售人员每天都会遇到大量的拒绝，为什么却没有任何人会沮丧？ 
5、客户就没有压力吗？知已知彼，客户采购时会有哪些压力？
6、为什么客户在上班时与下班后会表现不同的性格特征？

第一节：买卖双方的心情分析
1、如果一方比另一方更主动、更积极追求合作，则后者称为潜在客户 
2、卖方知道某价一定不能卖，但买方不知道什么价不能买；
3、当卖方表现自己很想卖，买方会表现自己不想买；
4、买方还的价，并不一定是他认为商品就应该值这个价；
5、付钱之前，买方占优势，之后，卖方占优势；

第二节、理解客户购买时的心态；
1、客户谈判时常用7种试探技巧分析；
2、客户态度非常好，就是不下订单，为什么？
3、为什么有些客户让我们感觉高高在上，花钱是大爷？难道他们素质真的差？
4、客户自身会有哪6个压力？
案例：客户提出合理条件，是否我就应该降价？
案例：如何分清客户异议的真实性？
案例：当谈判出现僵局时怎么办？
案例：为什么我答应客户提出的所有的条件，反而失去了订单？
案例：客户一再地提出不同的条件，怎么处理？
案例：客户要求我降价时,怎么办?请分8个步骤处理

第三节 客户异议处理的5个区分
1、要区分“第一” 还是“唯一”
2、对客户要求的真伪进行鉴别;
3、要区分“情绪”还是“行为”
4、区分“假想”还是“事实”
5、区别问题的轻重,缓急;

第四章  如何建立良好的客情关系？
案例：销售工作需要疯狂、圆滑、奉承、见人说人话，见鬼说鬼话吗？
案例：生意不成仁义在，咱俩交个朋友，这句话应该由谁说？
案例：邀请客户吃饭，你应该怎么说？
案例：当客户表扬了你，你会怎么回答？
案例：我代表公司的形象，是否我应该表现自己很强势？
案例：为了获得客户的信任，我是否应该花重金包装自己？让自己很完美？
案例：是否需要处处表现自己很有礼貌？
案例：如何与企业高层、政府高层打交道？

第一节 做回真实和真诚的自己，表里如一
礼仪的目的是尊重别人，而不是伪装自己，礼仪中常见的错误；
1、演别人,再好的演技也会搞砸，想做别人的时候,你就会离自己很远；
2、不同的人,需求不同,越改越累,越改越气,只会把自己折磨得心浮气躁,不得人心；
3、以朋友的心态与客户交往，过多的商业化语言、行为、过多的礼仪只会让客户感觉到生硬、距离、排斥、公事公办，没有感情；
4、适当的暴露自己的缺点，越完美的人越不可信；
5、守时,守信,守约,及时传递进程与信息，让客户感觉到可控性；
6、销售不是向客户笑，而是要让客户对自己笑；

第二节 感谢伤害我的人，是因为我自己错了；
1、一味顺从、推卸责任、理论交谈、谈论小事、无诚信；
2、当客户说过一段时间、以后、改天、回头、月底时，如何应对？
3、越完美的人越不可信，自我暴露的四个层次；
4、做好防错性的服务,签完合同仅仅是合作的开始；
?指导客户如何使用； 
?跟踪产品使用的情况； 
?为客户在使用过程中提供指导建议； 
?积极解答客户在使用中提出的问题； 


第四章团队配合
思考：
1.团队配合的前提是什么？是否任意两个人在一起都会有团队精神？
2.团队配合中为什么会出现扯皮的现象？
3.为什么公司花那么高成本让大家加深感情，但有些人之间还是有隔阂？
4.业绩好的人影响业绩差的人容易还是业绩差的影响业绩好的容易？
5.统一底薪好？还是差别化底薪好？如何让大家都觉得公平？
6.为什么有能力的不听话，听话的却没能力？
7.为什么有些人总是不按我要求的方法去做？
8.面对业绩总是很差的员工，到底是留还是开？

第一节团队配合的重要性
1.优秀的业务员业绩往往是普通的几十甚至上百倍；
2.提高成交的效率，不要杀敌一千，而自损八百；
3.优秀业务员缺时间，业绩普通的业务员缺能力，扬长避短，人尽其才；
4.把人力资源效益利用最大化；
5.打造完美的团队，让成员的缺点相互抵消；

第二节，如何开展团队配合
第一、能力互补
1.关注员工的能力，不要把未来寄托在员工未知的潜能上；
2.不要把员工塑造成同一类型的人，不把专才当全才用；
3.团队以能为本，销售岗位常见的14项技能；
4.售前、售中、售后人员要求与如何搭配？
5.案例：新员工有激情，但能力不足，老员工有能力，但激情不足，怎么办？

第二、利益关联
1.为什么成员会相互冷漠、互不关心、彼此封锁信息和资源？
2.为什么团队成员把团队的事不当回事？
3.如何才能让团队成员真心的为优秀的成员而高兴？
4.开除业绩差的员工，其他成员缺乏安全感怎么办？
5.如何才能让团队自动自发的努力工作？

第三节、不同客户喜欢不同风格的销售人员
1、 销售人员形象与举止，注意自己的形象；
2、 是否具备相似的背景，门当户对；
3、 是否具备相同的认识，道不同不相为盟；
4、 是否“投其所好”，话不投机半句多；
5、 赞美，喜欢对方，我们同样对喜欢我们的人有好感；
先交流感情,增进互信,欲速则不达;
6、 是否对销售人员熟悉，销售最忌讳交浅言深；
初次见面就企图跟别人成为朋友的行为很幼稚；
初次见面就暗示好处的行为很肤浅；
刚见面就强调价格很便宜的行为很愚蠢；
7、 销售人员是否具备亲和力，别人的脸是自己的一面镜子；
成交并不取决于说理，而是取决于心情
8、 销售人员是否值得信赖。

第六章  新客户开发
案例：为什么客户一开始很有兴趣，但迟迟不下单？
案例：前天明明说不买的客户居然今天却买了，客户的话能相信吗？
案例：客户答应买我司的产品，却突然变卦买别人的了，为什么？
案例：为什么我们会买自己没有兴趣的而且并不需要的产品？
一、客户是根据自己所投入的精力、金钱来确定自己的态度；
二、如何才能引导客户作自我说服？
1.不要轻易给客户下结论，谁会买，谁不会买
2.态度上的变化叫说服，行为上的变化叫接受；
3.我们都喜欢为我们自己的行为找理由，却不善于做我们已找到理由的事；
4.客户是发现了自己的需求，“发现”的依据是自己的行为； 
5.案例：合同签订后，应该问哪4句话，提升客户忠诚度？

第七章 自我激励
1.做销售工作赚钱最快，且最容易得到老板的重视、同事的尊重；
2.不要把第一次见面看成最后一次，工作要积极但不要着急；
3.不是成功太慢，而是放弃太快，钱是给内行的人赚的；
4.不要报着试试看的心态，企图一夜暴富的投机心态让客户反感；
5.不是有希望才坚持，而是坚持了才有希望； 
6.付出才会拥有，而不是拥有才付出；做了才会，而不是会了才做；
7.好工作是做出来的，不是找出来的，不要把自己托付给公司，而要独立成长；
8.尝试不同的工作方法，而不是多年重复使用一种方式，具备试错的精神；
9.工作可以出错，但不可以不做，世界上最危险的莫过于原地不动；
10.不要把未来寄托在自己一无所知的行业上，做好目前的工作；

【培训特点】
1.分组讨论，训练为主，互动式教学；2次现场考试;
2.真实案例分析，大量课后作业题，既有抢答，又有辩论，还有现场演练，热烈的课堂氛围；
3.将销售管理融入培训现场：
   3.1  不仅关注个人学习表现，而且重视团队合作；
   3.2  不仅关注2天以内的学习，而且营造2天以后的培训学习氛围；
   3.3  不仅考核个人得分，而且考核团队得分；不仅考核学员的学习成绩，而且考核学员学习的参与度；


【讲师介绍】　王老师
　销售团队管理咨询师、销售培训讲师；
　曾任可口可乐（中国）公司业务经理；阿里巴巴（中国）网络技术有限公司业务经理；
　清华大学.南京大学EMBA特邀培训讲师；新加坡莱佛士学院特约讲师;

Re: [PATCH 20/20] dax: Clear dirty entry tags on cache flush

2016-10-18 Thread Ross Zwisler

On Tue, Sep 27, 2016 at 06:08:24PM +0200, Jan Kara wrote:
> Currently we never clear dirty tags in DAX mappings and thus address
> ranges to flush accumulate. Now that we have locking of radix tree
> entries, we have all the locking necessary to reliably clear the radix
> tree dirty tag when flushing caches for corresponding address range.
> Similarly to page_mkclean() we also have to write-protect pages to get a
> page fault when the page is next written to so that we can mark the
> entry dirty again.
> 
> Signed-off-by: Jan Kara 

Looks great. 

Reviewed-by: Ross Zwisler 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

[PATCH 1/3] memremap.c : Add support for ZONE_DEVICE IO memory with struct pages.

2016-10-18 Thread Stephen Bates

From: Logan Gunthorpe 

We build on recent work that adds memory regions owned by a device
driver (ZONE_DEVICE) [1] and to add struct page support for these new
regions of memory [2].

1. Add an extra flags argument into dev_memremap_pages to take in a
MEMREMAP_XX argument. We update the existing calls to this function to
reflect the change.

2. For completeness, we add MEMREMAP_WT support to the memremap;
however we have no actual need for this functionality.

3. We add the static functions, add_zone_device_pages and
remove_zone_device pages. These are similar to arch_add_memory except
they don't create the memory mapping. We don't believe these need to be
made arch specific, but are open to other opinions.

4. dev_memremap_pages and devm_memremap_pages_release are updated to
treat IO memory slightly differently. For IO memory we use a combination
of the appropriate io_remap function and the zone_device pages functions
created above. A flags variable and kaddr pointer are added to struct
page_mem to facilitate this for the release function. We also set up
the page attribute tables for the mapped region correctly based on the
desired mapping.

[1] https://lists.01.org/pipermail/linux-nvdimm/2015-August/001810.html
[2] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002387.html

Signed-off-by: Stephen Bates 
Signed-off-by: Logan Gunthorpe 
---
 drivers/dax/pmem.c|  4 +-
 drivers/nvdimm/pmem.c |  4 +-
 include/linux/memremap.h  |  5 ++-
 kernel/memremap.c | 80 +--
 tools/testing/nvdimm/test/iomap.c |  3 +-
 5 files changed, 86 insertions(+), 10 deletions(-)

diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
index 9630d88..58ac456 100644
--- a/drivers/dax/pmem.c
+++ b/drivers/dax/pmem.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "../nvdimm/pfn.h"
 #include "../nvdimm/nd.h"
 #include "dax.h"
@@ -108,7 +109,8 @@ static int dax_pmem_probe(struct device *dev)
if (rc)
return rc;

-   addr = devm_memremap_pages(dev, , _pmem->ref, altmap);
+   addr = devm_memremap_pages(dev, , _pmem->ref, altmap,
+   ARCH_MEMREMAP_PMEM);
if (IS_ERR(addr))
return PTR_ERR(addr);

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 42b3a82..97032a1 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -278,7 +278,7 @@ static int pmem_attach_disk(struct device *dev,
pmem->pfn_flags = PFN_DEV;
if (is_nd_pfn(dev)) {
addr = devm_memremap_pages(dev, _res, >q_usage_counter,
-   altmap);
+   altmap, ARCH_MEMREMAP_PMEM);
pfn_sb = nd_pfn->pfn_sb;
pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
pmem->pfn_pad = resource_size(res) - resource_size(_res);
@@ -287,7 +287,7 @@ static int pmem_attach_disk(struct device *dev,
res->start += pmem->data_offset;
} else if (pmem_should_map_pages(dev)) {
addr = devm_memremap_pages(dev, >res,
-   >q_usage_counter, NULL);
+   >q_usage_counter, NULL, ARCH_MEMREMAP_PMEM);
pmem->pfn_flags |= PFN_MAP;
} else
addr = devm_memremap(dev, pmem->phys_addr,
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 9341619..fc99283 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -51,12 +51,13 @@ struct dev_pagemap {

 #ifdef CONFIG_ZONE_DEVICE
 void *devm_memremap_pages(struct device *dev, struct resource *res,
-   struct percpu_ref *ref, struct vmem_altmap *altmap);
+   struct percpu_ref *ref, struct vmem_altmap *altmap,
+   unsigned long flags);
 struct dev_pagemap *find_dev_pagemap(resource_size_t phys);
 #else
 static inline void *devm_memremap_pages(struct device *dev,
struct resource *res, struct percpu_ref *ref,
-   struct vmem_altmap *altmap)
+   struct vmem_altmap *altmap, unsigned long flags)
 {
/*
 * Fail attempts to call devm_memremap_pages() without
diff --git a/kernel/memremap.c b/kernel/memremap.c
index b501e39..d5f462c 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -175,13 +175,41 @@ static RADIX_TREE(pgmap_radix, GFP_KERNEL);
 #define SECTION_MASK ~((1UL << PA_SECTION_SHIFT) - 1)
 #define SECTION_SIZE (1UL << PA_SECTION_SHIFT)

+enum {
+   PAGEMAP_IO_MEM = 1 << 0,
+};
+
 struct page_map {
struct resource res;
struct percpu_ref *ref;
struct dev_pagemap pgmap;
struct vmem_altmap altmap;
+   void *kaddr;
+   int flags;
 };

+static int add_zone_device_pages(int nid, u64 start, u64 size)
+{
+   struct pglist_data *pgdat = NODE_DATA(nid);

[PATCH 3/3] iopmem : Add documentation for iopmem driver

2016-10-18 Thread Stephen Bates

Add documentation for the iopmem PCIe device driver.

Signed-off-by: Stephen Bates 
Signed-off-by: Logan Gunthorpe 
---
 Documentation/blockdev/00-INDEX   |  2 ++
 Documentation/blockdev/iopmem.txt | 62 +++
 2 files changed, 64 insertions(+)
 create mode 100644 Documentation/blockdev/iopmem.txt

diff --git a/Documentation/blockdev/00-INDEX b/Documentation/blockdev/00-INDEX
index c08df56..913e500 100644
--- a/Documentation/blockdev/00-INDEX
+++ b/Documentation/blockdev/00-INDEX
@@ -8,6 +8,8 @@ cpqarray.txt
- info on using Compaq's SMART2 Intelligent Disk Array Controllers.
 floppy.txt
- notes and driver options for the floppy disk driver.
+iopmem.txt
+   - info on the iopmem block driver.
 mflash.txt
- info on mGine m(g)flash driver for linux.
 nbd.txt
diff --git a/Documentation/blockdev/iopmem.txt 
b/Documentation/blockdev/iopmem.txt
new file mode 100644
index 000..ba805b8
--- /dev/null
+++ b/Documentation/blockdev/iopmem.txt
@@ -0,0 +1,62 @@
+IOPMEM Block Driver
+===
+
+Logan Gunthorpe and Stephen Bates - October 2016
+
+Introduction
+
+
+The iopmem module creates a DAX capable block device from a BAR on a PCIe
+device. iopmem leverages heavily from the pmem driver although it utilizes IO
+memory rather than system memory as its backing store.
+
+Usage
+-
+
+To include the iopmem module in your kernel please set CONFIG_BLK_DEV_IOPMEM
+to either y or m. A block device will be created for each PCIe attached device
+that matches the vendor and device ID as specified in the module. Currently an
+unallocated PMC PCIe ID is used as the default. Alternatively this driver can
+be bound to any aribtary PCIe function using the sysfs bind entry.
+
+The main purpose for an iopmem block device is expected to be for peer-2-peer
+PCIe transfers. We DO NOT RECCOMEND accessing a iopmem device using the local
+CPU unless you are doing one of the three following things:
+
+1. Creating a DAX capable filesystem on the iopmem device.
+2. Creating some files on the DAX capable filesystem.
+3. Interogating the files on said filesystem to obtain pointers that can be
+   passed to other PCIe devices for p2p DMA operations.
+
+Issues
+--
+
+1. Address Translation. Suggestions have been made that in certain
+architectures and topologies the dma_addr_t passed to the DMA master
+in a peer-2-peer transfer will not correctly route to the IO memory
+intended. However in our testing to date we have not seen this to be
+an issue, even in systems with IOMMUs and PCIe switches. It is our
+understanding that an IOMMU only maps system memory and would not
+interfere with device memory regions. (It certainly has no opportunity
+to do so if the transfer gets routed through a switch).
+
+2. Memory Segment Spacing. This patch has the same limitations that
+ZONE_DEVICE does in that memory regions must be spaces at least
+SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where
+BARs can be placed closer together than this. Thus ZONE_DEVICE would not
+be usable on neighboring BARs. For our purposes, this is not an issue as
+we'd only be looking at enabling a single BAR in a given PCIe device.
+More exotic use cases may have problems with this.
+
+3. Coherency Issues. When IOMEM is written from both the CPU and a PCIe
+peer there is potential for coherency issues and for writes to occur out
+of order. This is something that users of this feature need to be
+cognizant of and may necessitate the use of CONFIG_EXPERT. Though really,
+this isn't much different than the existing situation with RDMA: if
+userspace sets up an MR for remote use, they need to be careful about
+using that memory region themselves.
+
+4. Architecture. Currently this patch is applicable only to x86
+architectures. The same is true for much of the code pertaining to
+PMEM and ZONE_DEVICE. It is hoped that the work will be extended to other
+ARCH over time.
--
2.1.4
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH 17/20] mm: Export follow_pte()

2016-10-18 Thread Ross Zwisler

On Tue, Sep 27, 2016 at 06:08:21PM +0200, Jan Kara wrote:
> DAX will need to implement its own version of page_check_address(). To
> avoid duplicating page table walking code, export follow_pte() which
> does what we need.
> 
> Signed-off-by: Jan Kara 

Reviewed-by: Ross Zwisler 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH 16/20] mm: Provide helper for finishing mkwrite faults

2016-10-18 Thread Ross Zwisler

On Tue, Sep 27, 2016 at 06:08:20PM +0200, Jan Kara wrote:
> Provide a helper function for finishing write faults due to PTE being
> read-only. The helper will be used by DAX to avoid the need of
> complicating generic MM code with DAX locking specifics.
> 
> Signed-off-by: Jan Kara 
> ---
>  include/linux/mm.h |  1 +
>  mm/memory.c| 65 
> +++---
>  2 files changed, 39 insertions(+), 27 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 1055f2ece80d..e5a014be8932 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -617,6 +617,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct 
> vm_area_struct *vma)
>  int alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
>   struct page *page);
>  int finish_fault(struct vm_fault *vmf);
> +int finish_mkwrite_fault(struct vm_fault *vmf);
>  #endif
>  
>  /*
> diff --git a/mm/memory.c b/mm/memory.c
> index f49e736d6a36..8c8cb7f2133e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2266,6 +2266,36 @@ oom:
>   return VM_FAULT_OOM;
>  }
>  
> +/**
> + * finish_mkrite_fault - finish page fault making PTE writeable once the page
  finish_mkwrite_fault

> @@ -2315,26 +2335,17 @@ static int wp_page_shared(struct vm_fault *vmf)
>   put_page(vmf->page);
>   return tmp;
>   }
> - /*
> -  * Since we dropped the lock we need to revalidate
> -  * the PTE as someone else may have changed it.  If
> -  * they did, we just return, as we can count on the
> -  * MMU to tell us if they didn't also make it writable.
> -  */
> - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> - vmf->address, >ptl);
> - if (!pte_same(*vmf->pte, vmf->orig_pte)) {
> + tmp = finish_mkwrite_fault(vmf);
> + if (unlikely(!tmp || (tmp &
> +   (VM_FAULT_ERROR | VM_FAULT_NOPAGE {

The 'tmp' return from finish_mkwrite_fault() can only be 0 or VM_FAULT_WRITE.
I think this test should just be 

if (unlikely(!tmp)) {

With that and the small spelling fix:

Reviewed-by: Ross Zwisler 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH 15/20] mm: Move part of wp_page_reuse() into the single call site

2016-10-18 Thread Ross Zwisler

On Tue, Sep 27, 2016 at 06:08:19PM +0200, Jan Kara wrote:
> wp_page_reuse() handles write shared faults which is needed only in
> wp_page_shared(). Move the handling only into that location to make
> wp_page_reuse() simpler and avoid a strange situation when we sometimes
> pass in locked page, sometimes unlocked etc.
> 
> Signed-off-by: Jan Kara 

Reviewed-by: Ross Zwisler 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH 12/20] mm: Factor out common parts of write fault handling

2016-10-18 Thread Ross Zwisler

On Tue, Oct 18, 2016 at 12:50:00PM +0200, Jan Kara wrote:
> On Mon 17-10-16 16:08:51, Ross Zwisler wrote:
> > On Tue, Sep 27, 2016 at 06:08:16PM +0200, Jan Kara wrote:
> > > Currently we duplicate handling of shared write faults in
> > > wp_page_reuse() and do_shared_fault(). Factor them out into a common
> > > function.
> > > 
> > > Signed-off-by: Jan Kara 
> > > ---
> > >  mm/memory.c | 78 
> > > +
> > >  1 file changed, 37 insertions(+), 41 deletions(-)
> > > 
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 63d9c1a54caf..0643b3b5a12a 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -2063,6 +2063,41 @@ static int do_page_mkwrite(struct vm_area_struct 
> > > *vma, struct page *page,
> > >  }
> > >  
> > >  /*
> > > + * Handle dirtying of a page in shared file mapping on a write fault.
> > > + *
> > > + * The function expects the page to be locked and unlocks it.
> > > + */
> > > +static void fault_dirty_shared_page(struct vm_area_struct *vma,
> > > + struct page *page)
> > > +{
> > > + struct address_space *mapping;
> > > + bool dirtied;
> > > + bool page_mkwrite = vma->vm_ops->page_mkwrite;
> > 
> > I think you may need to pass in a 'page_mkwrite' parameter if you don't want
> > to change behavior.  Just checking to see of vma->vm_ops->page_mkwrite is
> > non-NULL works fine for this path:
> > 
> > do_shared_fault()
> > fault_dirty_shared_page()
> > 
> > and for
> > 
> > wp_page_shared()
> > wp_page_reuse()
> > fault_dirty_shared_page()
> > 
> > But for these paths:
> > 
> > wp_pfn_shared()
> > wp_page_reuse()
> > fault_dirty_shared_page()
> > 
> > and
> > 
> > do_wp_page()
> > wp_page_reuse()
> > fault_dirty_shared_page()
> > 
> > we unconditionally pass 0 for the 'page_mkwrite' parameter, even though from
> > the logic in wp_pfn_shared() especially you can see that
> > vma->vm_ops->pfn_mkwrite() must be defined some of the time.
> 
> The trick which makes this work is that for fault_dirty_shared_page() to be
> called at all, you have to set 'dirty_shared' argument to wp_page_reuse()
> and that does not happen from wp_pfn_shared() and do_wp_page() paths. So
> things work as they should. If you look somewhat later into the series,
> the patch "mm: Move part of wp_page_reuse() into the single call site"
> cleans this up to make things more obvious.
> 
>   Honza

Ah, cool, that makes sense.

You can add:

Reviewed-by: Ross Zwisler 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH 11/20] mm: Remove unnecessary vma->vm_ops check

2016-10-18 Thread Jan Kara

On Mon 17-10-16 13:40:41, Ross Zwisler wrote:
> On Tue, Sep 27, 2016 at 06:08:15PM +0200, Jan Kara wrote:
> > We don't check whether vma->vm_ops is NULL in do_shared_fault() so
> > there's hardly any point in checking it in wp_page_shared() which gets
> > called only for shared file mappings as well.
> > 
> > Signed-off-by: Jan Kara 
> > ---
> >  mm/memory.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index a4522e8999b2..63d9c1a54caf 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -2301,7 +2301,7 @@ static int wp_page_shared(struct vm_fault *vmf, 
> > struct page *old_page)
> >  
> > get_page(old_page);
> >  
> > -   if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
> > +   if (vma->vm_ops->page_mkwrite) {
> > int tmp;
> >  
> > pte_unmap_unlock(vmf->pte, vmf->ptl);
> > -- 
> > 2.6.6
> 
> Does this apply equally to the check in wp_pfn_shared()?  Both
> wp_page_shared() and wp_pfn_shared() are called for shared file mappings via
> do_wp_page().

Yes, it does apply there as well. Added to the commit. There are actually
more places with these checks which don't seem necessary but I didn't want
to do more cleanups than I need... But at least these two come logically
together.

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH 07/20] mm: Add orig_pte field into vm_fault

2016-10-18 Thread Jan Kara

On Mon 17-10-16 10:45:12, Ross Zwisler wrote:
> On Tue, Sep 27, 2016 at 06:08:11PM +0200, Jan Kara wrote:
> > Add orig_pte field to vm_fault structure to allow ->page_mkwrite
> > handlers to fully handle the fault. This also allows us to save some
> > passing of extra arguments around.
> > 
> > Signed-off-by: Jan Kara 
> > ---
> 
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index f88b2d3810a7..66bc77f2d1d2 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -890,11 +890,12 @@ static bool __collapse_huge_page_swapin(struct 
> > mm_struct *mm,
> > vmf.pte = pte_offset_map(pmd, address);
> > for (; vmf.address < address + HPAGE_PMD_NR*PAGE_SIZE;
> > vmf.pte++, vmf.address += PAGE_SIZE) {
> > -   pteval = *vmf.pte;
> > +   vmf.orig_pte = *vmf.pte;
> > +   pteval = vmf.orig_pte;
> > if (!is_swap_pte(pteval))
> > continue;
> 
> 'pteval' is now only used once.  It's probably cleaner to just remove it and
> use vmf.orig_pte for the is_swap_pte() check.

Yes, fixed.

> > @@ -3484,8 +3484,7 @@ static int handle_pte_fault(struct vm_fault *vmf)
> >  * So now it's safe to run pte_offset_map().
> >  */
> > vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
> > -
> > -   entry = *vmf->pte;
> > +   vmf->orig_pte = *vmf->pte;
> >  
> > /*
> >  * some architectures can have larger ptes than wordsize,
> > @@ -3496,6 +3495,7 @@ static int handle_pte_fault(struct vm_fault *vmf)
> >  * ptl lock held. So here a barrier will do.
> >  */
> > barrier();
> > +   entry = vmf->orig_pte;
> 
> This set of 'entry' is now on the other side of the barrier().  I'll admit
> that I don't fully grok the need for the barrier. Does it apply to only the
> setting of vmf->pte and vmf->orig_pte, or does 'entry' also matter because it
> too is of type pte_t, and thus could be bigger than the architecture's word
> size?
> 
> My guess is that 'entry' matters, too, and should remain before the barrier()
> call.  If not, can you help me understand why?

Sure, actually the comment just above the barrier() explains it: We care
about sampling *vmf->pte value only once - so we want the value stored in
'entry' (vmf->orig_pte after the patch) to be used and avoid compiler
optimizations leading to refetching the value at *vmf->pte. The way I've
written the code achieves this. Actually, I've moved the 'entry' assignment
even further down where it makes more sense with the new code layout.

Honza

-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH 0/20 v3] dax: Clear dirty bits after flushing caches

2016-10-18 Thread Jan Kara

On Mon 17-10-16 12:59:55, Ross Zwisler wrote:
> On Mon, Oct 17, 2016 at 10:47:32AM +0200, Jan Kara wrote:
> 
> > This week I plan to rebase both series on top of rc1 + your THP patches so
> > that we can move on with merging the stuff.
> 
> Yea...so how are we going to coordinate merging of these series for the v4.10
> merge window?  My series mostly changes DAX, but it also changes XFS, ext2 and
> ext4.  I think the plan right now is to have Dave Chinner take it through his
> XFS tree.
> 
> Your first series is mostly mm changes with some DAX sprinkled in, and your
> second series touches dax, mm and all 3 DAX filesystems.  
> 
> What is the best way to handle all this?  Have it go through one central tree
> (-MM?), even though the changes touch code that exists outside of that trees
> normal domain (like the FS code)?  Have my series go through the XFS tree and
> yours through -MM, and give Linus a merge resolution patch?  Something else?

For your changes to go through XFS tree is IMO fine (changes outside of XFS
& DAX are easy). Let me do the rebase first and then discuss how to merge
my patches after that...

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH 0/3] iopmem : A block device for PCIe memory

第七章自我激励

Re: [PATCH 20/20] dax: Clear dirty entry tags on cache flush

[PATCH 1/3] memremap.c : Add support for ZONE_DEVICE IO memory with struct pages.

[PATCH 3/3] iopmem : Add documentation for iopmem driver

Re: [PATCH 17/20] mm: Export follow_pte()

Re: [PATCH 16/20] mm: Provide helper for finishing mkwrite faults

Re: [PATCH 15/20] mm: Move part of wp_page_reuse() into the single call site

Re: [PATCH 12/20] mm: Factor out common parts of write fault handling

Re: [PATCH 11/20] mm: Remove unnecessary vma->vm_ops check

Re: [PATCH 07/20] mm: Add orig_pte field into vm_fault

Re: [PATCH 0/20 v3] dax: Clear dirty bits after flushing caches

12 matches

Site Navigation

Mail list logo

Footer information