Re: [PATCH 0/3] iopmem : A block device for PCIe memory

2016-10-18 Thread Dan Williams
[ adding Ashok and David for potential iommu comments ]

On Tue, Oct 18, 2016 at 2:42 PM, Stephen Bates  wrote:
> This patch follows from an RFC we did earlier this year [1]. This
> patchset applies cleanly to v4.9-rc1.
>
> Updates since RFC
> -
>   Rebased.
>   Included the iopmem driver in the submission.
>
> History
> ---
>
> There have been several attempts to upstream patchsets that enable
> DMAs between PCIe peers. These include Peer-Direct [2] and DMA-Buf
> style patches [3]. None have been successful to date. Haggai Eran
> gives a nice overview of the prior art in this space in his cover
> letter [3].
>
> Motivation and Use Cases
> 
>
> PCIe IO devices are getting faster. It is not uncommon now to find PCIe
> network and storage devices that can generate and consume several GB/s.
> Almost always these devices have either a high performance DMA engine, a
> number of exposed PCIe BARs or both.
>
> Until this patch, any high-performance transfer of information between
> two PICe devices has required the use of a staging buffer in system
> memory. With this patch the bandwidth to system memory is not compromised
> when high-throughput transfers occurs between PCIe devices. This means
> that more system memory bandwidth is available to the CPU cores for data
> processing and manipulation. In addition, in systems where the two PCIe
> devices reside behind a PCIe switch the datapath avoids the CPU
> entirely.

I agree with the motivation and the need for a solution, but I have
some questions about this implementation.

>
> Consumers
> -
>
> We provide a PCIe device driver in an accompanying patch that can be
> used to map any PCIe BAR into a DAX capable block device. For
> non-persistent BARs this simply serves as an alternative to using
> system memory bounce buffers. For persistent BARs this can serve as an
> additional storage device in the system.

Why block devices?  I wonder if iopmem was initially designed back
when we were considering enabling DAX for raw block devices.  However,
that support has since been ripped out / abandoned.  You currently
need a filesystem on top of a block-device to get DAX operation.
Putting xfs or ext4 on top of PCI-E memory mapped range seems awkward
if all you want is a way to map the bar for another PCI-E device in
the topology.

If you're only using the block-device as a entry-point to create
dax-mappings then a device-dax (drivers/dax/) character-device might
be a better fit.

>
> Testing and Performance
> ---
>
> We have done a moderate about of testing of this patch on a QEMU
> environment and on real hardware. On real hardware we have observed
> peer-to-peer writes of up to 4GB/s and reads of up to 1.2 GB/s. In
> both cases these numbers are limitations of our consumer hardware. In
> addition, we have observed that the CPU DRAM bandwidth is not impacted
> when using IOPMEM which is not the case when a traditional path
> through system memory is taken.
>
> For more information on the testing and performance results see the
> GitHub site [4].
>
> Known Issues
> 
>
> 1. Address Translation. Suggestions have been made that in certain
> architectures and topologies the dma_addr_t passed to the DMA master
> in a peer-2-peer transfer will not correctly route to the IO memory
> intended. However in our testing to date we have not seen this to be
> an issue, even in systems with IOMMUs and PCIe switches. It is our
> understanding that an IOMMU only maps system memory and would not
> interfere with device memory regions. (It certainly has no opportunity
> to do so if the transfer gets routed through a switch).
>

There may still be platforms where peer-to-peer cycles are routed up
through the root bridge and then back down to target device, but we
can address that when / if it happens.  I wonder if we could (ab)use a
software-defined 'pasid' as the requester id for a peer-to-peer
mapping that needs address translation.

> 2. Memory Segment Spacing. This patch has the same limitations that
> ZONE_DEVICE does in that memory regions must be spaces at least
> SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where
> BARs can be placed closer together than this. Thus ZONE_DEVICE would not
> be usable on neighboring BARs. For our purposes, this is not an issue as
> we'd only be looking at enabling a single BAR in a given PCIe device.
> More exotic use cases may have problems with this.

I'm working on patches for 4.10 to allow mixing multiple
devm_memremap_pages() allocations within the same physical section.
Hopefully this won't be a problem going forward.

> 3. Coherency Issues. When IOMEM is written from both the CPU and a PCIe
> peer there is potential for coherency issues and for writes to occur out
> of order. This is something that users of this feature need to be
> cognizant of. Though really, this isn't much different than the
> existing situation 

第七章 自我激励

2016-10-18 Thread 第七章 自我激励先生
销售精英2天强化训练

【时间地点】 2016年11月05-06日上海 
 11月19-20日北京11月26-27日深圳12月17-18日上海 


Judge(评价)一个人,一个公司是不是优秀,不要看他是不是Harvard(哈佛大学),是不是Stanford(斯坦福大学).不要judge(评价)里面有多少名牌大学毕业生,而要judge(评价)这帮人干活是不是发疯一样干,看他每天下班是不是笑眯眯回家!
——阿里巴巴公司马云


——课程简介

第一章客户需求分析
思考:
1、面对客户找不到话说,怎么办?二次沟通应该聊些什么?
2、为什么我把所有资料都给客户了,他还说要考虑一下?
3、同一件事,客户不同的人告诉我不一样的要求,听谁的?
4、同一件事,客户同一个人告诉我两次的答案不一样,听哪次的?
5、推荐哪一种产品给客户好?最好的?稍好的?还是够用的?
4、为什么我按客户要求去做,他还是没有选择我们?
5、不同的客户,我应该如何应对?
6、忠诚的客户应该如何培养?

第一节、为什么要对客户需求进行分析?
1、客户初次告诉我们的信息往往是有所保留的;
2、客户想要的产品,不一定就是实际所需要的;
3、客户不缺少产品信息,困惑的是自己如何选择; 
4、客户购买决定是比较出来的,没有比较,产品就没有价值;
5、销售人员第一思想是战争思想,情报最重要;
6、未来的送货员,联络员,报价员将被淘汰;

第二节、如何做好客户需求分析?
一、基本要求:
1.无事不登三宝殿,有目的地做好拜访计划;
2.引导客户,首先要控制谈话的方向、节奏、内容;
3.从讲产品的“卖点”转变到讲客户的“买点”
4.好的,不一定是最适合的,最合适的才是最好的;
5.不要把猜测当成事实,“谈”的是什么?“判”是由谁判?
6.讨论:客户说价格太贵,代表哪15种不同的意思?

二、需求分析要点:
1.了解客户的4种期望目标;
2.了解客户采购的5个适当;
3.判断谁是关键人的8个依据;
4.哪6大类问题不可以问? 要表达别人听得懂的话;
5.提问注意的“3不谈”,“4不讲”;
6.客户需求分析手册制定的6个步骤;
?找对方向,事半功倍,为什么找这个客户?
?时间没对,努力白费,为什么这个时候找?
?找对人,说对话,为什么找这个人? 
?为什么推荐这个产品?给客户需要的,而不是自己想给的; 
?为什么给这样的服务? 客户看重不是产品,而是使用价值;
?为什么报这个价? 在客户的预算与同行之间找到平衡;
7.为什么还这个价?关注竞争对手,调整自己的策略;

第二章  如何正确推荐产品
思考:
1、为什么我满足客户所提出的要求,客户却还需要考虑一下?
2、为什么客户不相信我质量与服务的承诺?
3、面对客户提出高端产品的要求,而我只有低端产品,怎么办?
4、如何推荐产品才能让客户感觉到我们跟别人不一样;

第一节 为什么需要我们正确地推荐产品?
1、客户往往对自己深层次的问题并不清楚;
2、客户的提出的要求可能是模糊或抽象,有的仅仅提出方向,不要局限于客户明显的问题,头痛医头,脚痛医脚;
3、客户往往会以我们竞品给他的条件要求我们;
4、满足客户提出的要求,是引导客户在不同公司之间做比较,而不在我公司做出决策;

第二节 如何帮助客户建立“排他性”的采购标准?
案例:客户关心的是你如何保证你的质量和服务水平
1、打仗就是打后勤,推荐产品中常用的34项内容;
2、产品的功能与客户需要解决的问题要相对应;客户喜欢提供解决方案的人,而不仅提供工具的人;
3、如何给竞争对手业务员设置障碍?

第三节  见什么人,说什么话;
不同情况下如何讲?时间、能力、精力、兴趣、文化水平、不同的职位等;
1. 什么情况下偏重于理性说服,打动别人的脑?
2. 什么情况下偏重于情感说服,打动别人的心?
3. 何种情况下只讲优势不讲劣势?
4. 何种情况下即讲优势又讲劣势?

第三章如何有效处理异议
思考
1、遇到小气、固执、粗鲁、啰嗦、刻薄、吹毛求疵、优柔寡断的客户应对?
2、客户直接挂电话,怎么办?
3、第二次见面,客户对我大发脾气,怎么办?
4、有一个行业,销售人员每天都会遇到大量的拒绝,为什么却没有任何人会沮丧? 
5、客户就没有压力吗?知已知彼,客户采购时会有哪些压力?
6、为什么客户在上班时与下班后会表现不同的性格特征?

第一节:买卖双方的心情分析
1、如果一方比另一方更主动、更积极追求合作,则后者称为潜在客户 
2、卖方知道某价一定不能卖,但买方不知道什么价不能买;
3、当卖方表现自己很想卖,买方会表现自己不想买;
4、买方还的价,并不一定是他认为商品就应该值这个价;
5、付钱之前,买方占优势,之后,卖方占优势;

第二节、理解客户购买时的心态;
1、客户谈判时常用7种试探技巧分析;
2、客户态度非常好,就是不下订单,为什么?
3、为什么有些客户让我们感觉高高在上,花钱是大爷?难道他们素质真的差?
4、客户自身会有哪6个压力?
案例:客户提出合理条件,是否我就应该降价?
案例:如何分清客户异议的真实性?
案例:当谈判出现僵局时怎么办?
案例:为什么我答应客户提出的所有的条件,反而失去了订单?
案例:客户一再地提出不同的条件,怎么处理?
案例:客户要求我降价时,怎么办?请分8个步骤处理

第三节 客户异议处理的5个区分
1、要区分“第一” 还是“唯一”
2、对客户要求的真伪进行鉴别;
3、要区分“情绪”还是“行为”
4、区分“假想”还是“事实”
5、区别问题的轻重,缓急;

第四章  如何建立良好的客情关系?
案例:销售工作需要疯狂、圆滑、奉承、见人说人话,见鬼说鬼话吗?
案例:生意不成仁义在,咱俩交个朋友,这句话应该由谁说?
案例:邀请客户吃饭,你应该怎么说?
案例:当客户表扬了你,你会怎么回答?
案例:我代表公司的形象,是否我应该表现自己很强势?
案例:为了获得客户的信任,我是否应该花重金包装自己?让自己很完美?
案例:是否需要处处表现自己很有礼貌?
案例:如何与企业高层、政府高层打交道?

第一节 做回真实和真诚的自己,表里如一
礼仪的目的是尊重别人,而不是伪装自己,礼仪中常见的错误;
1、演别人,再好的演技也会搞砸,想做别人的时候,你就会离自己很远;
2、不同的人,需求不同,越改越累,越改越气,只会把自己折磨得心浮气躁,不得人心;
3、以朋友的心态与客户交往,过多的商业化语言、行为、过多的礼仪只会让客户感觉到生硬、距离、排斥、公事公办,没有感情;
4、适当的暴露自己的缺点,越完美的人越不可信;
5、守时,守信,守约,及时传递进程与信息,让客户感觉到可控性;
6、销售不是向客户笑,而是要让客户对自己笑;

第二节 感谢伤害我的人,是因为我自己错了;
1、一味顺从、推卸责任、理论交谈、谈论小事、无诚信;
2、当客户说过一段时间、以后、改天、回头、月底时,如何应对?
3、越完美的人越不可信,自我暴露的四个层次;
4、做好防错性的服务,签完合同仅仅是合作的开始;
?指导客户如何使用; 
?跟踪产品使用的情况; 
?为客户在使用过程中提供指导建议; 
?积极解答客户在使用中提出的问题; 


第四章团队配合
思考:
1.团队配合的前提是什么?是否任意两个人在一起都会有团队精神?
2.团队配合中为什么会出现扯皮的现象?
3.为什么公司花那么高成本让大家加深感情,但有些人之间还是有隔阂?
4.业绩好的人影响业绩差的人容易还是业绩差的影响业绩好的容易?
5.统一底薪好?还是差别化底薪好?如何让大家都觉得公平?
6.为什么有能力的不听话,听话的却没能力?
7.为什么有些人总是不按我要求的方法去做?
8.面对业绩总是很差的员工,到底是留还是开?

第一节团队配合的重要性
1.优秀的业务员业绩往往是普通的几十甚至上百倍;
2.提高成交的效率,不要杀敌一千,而自损八百;
3.优秀业务员缺时间,业绩普通的业务员缺能力,扬长避短,人尽其才;
4.把人力资源效益利用最大化;
5.打造完美的团队,让成员的缺点相互抵消;

第二节,如何开展团队配合
第一、能力互补
1.关注员工的能力,不要把未来寄托在员工未知的潜能上;
2.不要把员工塑造成同一类型的人,不把专才当全才用;
3.团队以能为本,销售岗位常见的14项技能;
4.售前、售中、售后人员要求与如何搭配?
5.案例:新员工有激情,但能力不足,老员工有能力,但激情不足,怎么办?

第二、利益关联
1.为什么成员会相互冷漠、互不关心、彼此封锁信息和资源?
2.为什么团队成员把团队的事不当回事?
3.如何才能让团队成员真心的为优秀的成员而高兴?
4.开除业绩差的员工,其他成员缺乏安全感怎么办?
5.如何才能让团队自动自发的努力工作?

第三节、不同客户喜欢不同风格的销售人员
1、 销售人员形象与举止,注意自己的形象;
2、 是否具备相似的背景,门当户对;
3、 是否具备相同的认识,道不同不相为盟;
4、 是否“投其所好”,话不投机半句多;
5、 赞美,喜欢对方,我们同样对喜欢我们的人有好感;
先交流感情,增进互信,欲速则不达;
6、 是否对销售人员熟悉,销售最忌讳交浅言深;
初次见面就企图跟别人成为朋友的行为很幼稚;
初次见面就暗示好处的行为很肤浅;
刚见面就强调价格很便宜的行为很愚蠢;
7、 销售人员是否具备亲和力,别人的脸是自己的一面镜子;
成交并不取决于说理,而是取决于心情
8、 销售人员是否值得信赖。

第六章  新客户开发
案例:为什么客户一开始很有兴趣,但迟迟不下单?
案例:前天明明说不买的客户居然今天却买了,客户的话能相信吗?
案例:客户答应买我司的产品,却突然变卦买别人的了,为什么?
案例:为什么我们会买自己没有兴趣的而且并不需要的产品?
一、客户是根据自己所投入的精力、金钱来确定自己的态度;
二、如何才能引导客户作自我说服?
1.不要轻易给客户下结论,谁会买,谁不会买
2.态度上的变化叫说服,行为上的变化叫接受;
3.我们都喜欢为我们自己的行为找理由,却不善于做我们已找到理由的事;
4.客户是发现了自己的需求,“发现”的依据是自己的行为; 
5.案例:合同签订后,应该问哪4句话,提升客户忠诚度?

第七章 自我激励
1.做销售工作赚钱最快,且最容易得到老板的重视、同事的尊重;
2.不要把第一次见面看成最后一次,工作要积极但不要着急;
3.不是成功太慢,而是放弃太快,钱是给内行的人赚的;
4.不要报着试试看的心态,企图一夜暴富的投机心态让客户反感;
5.不是有希望才坚持,而是坚持了才有希望; 
6.付出才会拥有,而不是拥有才付出;做了才会,而不是会了才做;
7.好工作是做出来的,不是找出来的,不要把自己托付给公司,而要独立成长;
8.尝试不同的工作方法,而不是多年重复使用一种方式,具备试错的精神;
9.工作可以出错,但不可以不做,世界上最危险的莫过于原地不动;
10.不要把未来寄托在自己一无所知的行业上,做好目前的工作;

【培训特点】
1.分组讨论,训练为主,互动式教学;2次现场考试;
2.真实案例分析,大量课后作业题,既有抢答,又有辩论,还有现场演练,热烈的课堂氛围;
3.将销售管理融入培训现场:
   3.1  不仅关注个人学习表现,而且重视团队合作;
   3.2  不仅关注2天以内的学习,而且营造2天以后的培训学习氛围;
   3.3  不仅考核个人得分,而且考核团队得分;不仅考核学员的学习成绩,而且考核学员学习的参与度;


【讲师介绍】 王老师
 销售团队管理咨询师、销售培训讲师;
 曾任可口可乐(中国)公司业务经理;阿里巴巴(中国)网络技术有限公司业务经理;
 清华大学.南京大学EMBA特邀培训讲师;新加坡莱佛士学院特约讲师;
 

Re: [PATCH 20/20] dax: Clear dirty entry tags on cache flush

2016-10-18 Thread Ross Zwisler
On Tue, Sep 27, 2016 at 06:08:24PM +0200, Jan Kara wrote:
> Currently we never clear dirty tags in DAX mappings and thus address
> ranges to flush accumulate. Now that we have locking of radix tree
> entries, we have all the locking necessary to reliably clear the radix
> tree dirty tag when flushing caches for corresponding address range.
> Similarly to page_mkclean() we also have to write-protect pages to get a
> page fault when the page is next written to so that we can mark the
> entry dirty again.
> 
> Signed-off-by: Jan Kara 

Looks great. 

Reviewed-by: Ross Zwisler 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 1/3] memremap.c : Add support for ZONE_DEVICE IO memory with struct pages.

2016-10-18 Thread Stephen Bates
From: Logan Gunthorpe 

We build on recent work that adds memory regions owned by a device
driver (ZONE_DEVICE) [1] and to add struct page support for these new
regions of memory [2].

1. Add an extra flags argument into dev_memremap_pages to take in a
MEMREMAP_XX argument. We update the existing calls to this function to
reflect the change.

2. For completeness, we add MEMREMAP_WT support to the memremap;
however we have no actual need for this functionality.

3. We add the static functions, add_zone_device_pages and
remove_zone_device pages. These are similar to arch_add_memory except
they don't create the memory mapping. We don't believe these need to be
made arch specific, but are open to other opinions.

4. dev_memremap_pages and devm_memremap_pages_release are updated to
treat IO memory slightly differently. For IO memory we use a combination
of the appropriate io_remap function and the zone_device pages functions
created above. A flags variable and kaddr pointer are added to struct
page_mem to facilitate this for the release function. We also set up
the page attribute tables for the mapped region correctly based on the
desired mapping.

[1] https://lists.01.org/pipermail/linux-nvdimm/2015-August/001810.html
[2] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002387.html

Signed-off-by: Stephen Bates 
Signed-off-by: Logan Gunthorpe 
---
 drivers/dax/pmem.c|  4 +-
 drivers/nvdimm/pmem.c |  4 +-
 include/linux/memremap.h  |  5 ++-
 kernel/memremap.c | 80 +--
 tools/testing/nvdimm/test/iomap.c |  3 +-
 5 files changed, 86 insertions(+), 10 deletions(-)

diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
index 9630d88..58ac456 100644
--- a/drivers/dax/pmem.c
+++ b/drivers/dax/pmem.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "../nvdimm/pfn.h"
 #include "../nvdimm/nd.h"
 #include "dax.h"
@@ -108,7 +109,8 @@ static int dax_pmem_probe(struct device *dev)
if (rc)
return rc;

-   addr = devm_memremap_pages(dev, , _pmem->ref, altmap);
+   addr = devm_memremap_pages(dev, , _pmem->ref, altmap,
+   ARCH_MEMREMAP_PMEM);
if (IS_ERR(addr))
return PTR_ERR(addr);

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 42b3a82..97032a1 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -278,7 +278,7 @@ static int pmem_attach_disk(struct device *dev,
pmem->pfn_flags = PFN_DEV;
if (is_nd_pfn(dev)) {
addr = devm_memremap_pages(dev, _res, >q_usage_counter,
-   altmap);
+   altmap, ARCH_MEMREMAP_PMEM);
pfn_sb = nd_pfn->pfn_sb;
pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
pmem->pfn_pad = resource_size(res) - resource_size(_res);
@@ -287,7 +287,7 @@ static int pmem_attach_disk(struct device *dev,
res->start += pmem->data_offset;
} else if (pmem_should_map_pages(dev)) {
addr = devm_memremap_pages(dev, >res,
-   >q_usage_counter, NULL);
+   >q_usage_counter, NULL, ARCH_MEMREMAP_PMEM);
pmem->pfn_flags |= PFN_MAP;
} else
addr = devm_memremap(dev, pmem->phys_addr,
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 9341619..fc99283 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -51,12 +51,13 @@ struct dev_pagemap {

 #ifdef CONFIG_ZONE_DEVICE
 void *devm_memremap_pages(struct device *dev, struct resource *res,
-   struct percpu_ref *ref, struct vmem_altmap *altmap);
+   struct percpu_ref *ref, struct vmem_altmap *altmap,
+   unsigned long flags);
 struct dev_pagemap *find_dev_pagemap(resource_size_t phys);
 #else
 static inline void *devm_memremap_pages(struct device *dev,
struct resource *res, struct percpu_ref *ref,
-   struct vmem_altmap *altmap)
+   struct vmem_altmap *altmap, unsigned long flags)
 {
/*
 * Fail attempts to call devm_memremap_pages() without
diff --git a/kernel/memremap.c b/kernel/memremap.c
index b501e39..d5f462c 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -175,13 +175,41 @@ static RADIX_TREE(pgmap_radix, GFP_KERNEL);
 #define SECTION_MASK ~((1UL << PA_SECTION_SHIFT) - 1)
 #define SECTION_SIZE (1UL << PA_SECTION_SHIFT)

+enum {
+   PAGEMAP_IO_MEM = 1 << 0,
+};
+
 struct page_map {
struct resource res;
struct percpu_ref *ref;
struct dev_pagemap pgmap;
struct vmem_altmap altmap;
+   void *kaddr;
+   int flags;
 };

+static int add_zone_device_pages(int nid, u64 start, u64 size)
+{
+   struct pglist_data *pgdat = NODE_DATA(nid);

[PATCH 3/3] iopmem : Add documentation for iopmem driver

2016-10-18 Thread Stephen Bates
Add documentation for the iopmem PCIe device driver.

Signed-off-by: Stephen Bates 
Signed-off-by: Logan Gunthorpe 
---
 Documentation/blockdev/00-INDEX   |  2 ++
 Documentation/blockdev/iopmem.txt | 62 +++
 2 files changed, 64 insertions(+)
 create mode 100644 Documentation/blockdev/iopmem.txt

diff --git a/Documentation/blockdev/00-INDEX b/Documentation/blockdev/00-INDEX
index c08df56..913e500 100644
--- a/Documentation/blockdev/00-INDEX
+++ b/Documentation/blockdev/00-INDEX
@@ -8,6 +8,8 @@ cpqarray.txt
- info on using Compaq's SMART2 Intelligent Disk Array Controllers.
 floppy.txt
- notes and driver options for the floppy disk driver.
+iopmem.txt
+   - info on the iopmem block driver.
 mflash.txt
- info on mGine m(g)flash driver for linux.
 nbd.txt
diff --git a/Documentation/blockdev/iopmem.txt 
b/Documentation/blockdev/iopmem.txt
new file mode 100644
index 000..ba805b8
--- /dev/null
+++ b/Documentation/blockdev/iopmem.txt
@@ -0,0 +1,62 @@
+IOPMEM Block Driver
+===
+
+Logan Gunthorpe and Stephen Bates - October 2016
+
+Introduction
+
+
+The iopmem module creates a DAX capable block device from a BAR on a PCIe
+device. iopmem leverages heavily from the pmem driver although it utilizes IO
+memory rather than system memory as its backing store.
+
+Usage
+-
+
+To include the iopmem module in your kernel please set CONFIG_BLK_DEV_IOPMEM
+to either y or m. A block device will be created for each PCIe attached device
+that matches the vendor and device ID as specified in the module. Currently an
+unallocated PMC PCIe ID is used as the default. Alternatively this driver can
+be bound to any aribtary PCIe function using the sysfs bind entry.
+
+The main purpose for an iopmem block device is expected to be for peer-2-peer
+PCIe transfers. We DO NOT RECCOMEND accessing a iopmem device using the local
+CPU unless you are doing one of the three following things:
+
+1. Creating a DAX capable filesystem on the iopmem device.
+2. Creating some files on the DAX capable filesystem.
+3. Interogating the files on said filesystem to obtain pointers that can be
+   passed to other PCIe devices for p2p DMA operations.
+
+Issues
+--
+
+1. Address Translation. Suggestions have been made that in certain
+architectures and topologies the dma_addr_t passed to the DMA master
+in a peer-2-peer transfer will not correctly route to the IO memory
+intended. However in our testing to date we have not seen this to be
+an issue, even in systems with IOMMUs and PCIe switches. It is our
+understanding that an IOMMU only maps system memory and would not
+interfere with device memory regions. (It certainly has no opportunity
+to do so if the transfer gets routed through a switch).
+
+2. Memory Segment Spacing. This patch has the same limitations that
+ZONE_DEVICE does in that memory regions must be spaces at least
+SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where
+BARs can be placed closer together than this. Thus ZONE_DEVICE would not
+be usable on neighboring BARs. For our purposes, this is not an issue as
+we'd only be looking at enabling a single BAR in a given PCIe device.
+More exotic use cases may have problems with this.
+
+3. Coherency Issues. When IOMEM is written from both the CPU and a PCIe
+peer there is potential for coherency issues and for writes to occur out
+of order. This is something that users of this feature need to be
+cognizant of and may necessitate the use of CONFIG_EXPERT. Though really,
+this isn't much different than the existing situation with RDMA: if
+userspace sets up an MR for remote use, they need to be careful about
+using that memory region themselves.
+
+4. Architecture. Currently this patch is applicable only to x86
+architectures. The same is true for much of the code pertaining to
+PMEM and ZONE_DEVICE. It is hoped that the work will be extended to other
+ARCH over time.
--
2.1.4
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 17/20] mm: Export follow_pte()

2016-10-18 Thread Ross Zwisler
On Tue, Sep 27, 2016 at 06:08:21PM +0200, Jan Kara wrote:
> DAX will need to implement its own version of page_check_address(). To
> avoid duplicating page table walking code, export follow_pte() which
> does what we need.
> 
> Signed-off-by: Jan Kara 

Reviewed-by: Ross Zwisler 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 16/20] mm: Provide helper for finishing mkwrite faults

2016-10-18 Thread Ross Zwisler
On Tue, Sep 27, 2016 at 06:08:20PM +0200, Jan Kara wrote:
> Provide a helper function for finishing write faults due to PTE being
> read-only. The helper will be used by DAX to avoid the need of
> complicating generic MM code with DAX locking specifics.
> 
> Signed-off-by: Jan Kara 
> ---
>  include/linux/mm.h |  1 +
>  mm/memory.c| 65 
> +++---
>  2 files changed, 39 insertions(+), 27 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 1055f2ece80d..e5a014be8932 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -617,6 +617,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct 
> vm_area_struct *vma)
>  int alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
>   struct page *page);
>  int finish_fault(struct vm_fault *vmf);
> +int finish_mkwrite_fault(struct vm_fault *vmf);
>  #endif
>  
>  /*
> diff --git a/mm/memory.c b/mm/memory.c
> index f49e736d6a36..8c8cb7f2133e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2266,6 +2266,36 @@ oom:
>   return VM_FAULT_OOM;
>  }
>  
> +/**
> + * finish_mkrite_fault - finish page fault making PTE writeable once the page
  finish_mkwrite_fault

> @@ -2315,26 +2335,17 @@ static int wp_page_shared(struct vm_fault *vmf)
>   put_page(vmf->page);
>   return tmp;
>   }
> - /*
> -  * Since we dropped the lock we need to revalidate
> -  * the PTE as someone else may have changed it.  If
> -  * they did, we just return, as we can count on the
> -  * MMU to tell us if they didn't also make it writable.
> -  */
> - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> - vmf->address, >ptl);
> - if (!pte_same(*vmf->pte, vmf->orig_pte)) {
> + tmp = finish_mkwrite_fault(vmf);
> + if (unlikely(!tmp || (tmp &
> +   (VM_FAULT_ERROR | VM_FAULT_NOPAGE {

The 'tmp' return from finish_mkwrite_fault() can only be 0 or VM_FAULT_WRITE.
I think this test should just be 

if (unlikely(!tmp)) {

With that and the small spelling fix:

Reviewed-by: Ross Zwisler 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 15/20] mm: Move part of wp_page_reuse() into the single call site

2016-10-18 Thread Ross Zwisler
On Tue, Sep 27, 2016 at 06:08:19PM +0200, Jan Kara wrote:
> wp_page_reuse() handles write shared faults which is needed only in
> wp_page_shared(). Move the handling only into that location to make
> wp_page_reuse() simpler and avoid a strange situation when we sometimes
> pass in locked page, sometimes unlocked etc.
> 
> Signed-off-by: Jan Kara 

Reviewed-by: Ross Zwisler 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 12/20] mm: Factor out common parts of write fault handling

2016-10-18 Thread Ross Zwisler
On Tue, Oct 18, 2016 at 12:50:00PM +0200, Jan Kara wrote:
> On Mon 17-10-16 16:08:51, Ross Zwisler wrote:
> > On Tue, Sep 27, 2016 at 06:08:16PM +0200, Jan Kara wrote:
> > > Currently we duplicate handling of shared write faults in
> > > wp_page_reuse() and do_shared_fault(). Factor them out into a common
> > > function.
> > > 
> > > Signed-off-by: Jan Kara 
> > > ---
> > >  mm/memory.c | 78 
> > > +
> > >  1 file changed, 37 insertions(+), 41 deletions(-)
> > > 
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 63d9c1a54caf..0643b3b5a12a 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -2063,6 +2063,41 @@ static int do_page_mkwrite(struct vm_area_struct 
> > > *vma, struct page *page,
> > >  }
> > >  
> > >  /*
> > > + * Handle dirtying of a page in shared file mapping on a write fault.
> > > + *
> > > + * The function expects the page to be locked and unlocks it.
> > > + */
> > > +static void fault_dirty_shared_page(struct vm_area_struct *vma,
> > > + struct page *page)
> > > +{
> > > + struct address_space *mapping;
> > > + bool dirtied;
> > > + bool page_mkwrite = vma->vm_ops->page_mkwrite;
> > 
> > I think you may need to pass in a 'page_mkwrite' parameter if you don't want
> > to change behavior.  Just checking to see of vma->vm_ops->page_mkwrite is
> > non-NULL works fine for this path:
> > 
> > do_shared_fault()
> > fault_dirty_shared_page()
> > 
> > and for
> > 
> > wp_page_shared()
> > wp_page_reuse()
> > fault_dirty_shared_page()
> > 
> > But for these paths:
> > 
> > wp_pfn_shared()
> > wp_page_reuse()
> > fault_dirty_shared_page()
> > 
> > and
> > 
> > do_wp_page()
> > wp_page_reuse()
> > fault_dirty_shared_page()
> > 
> > we unconditionally pass 0 for the 'page_mkwrite' parameter, even though from
> > the logic in wp_pfn_shared() especially you can see that
> > vma->vm_ops->pfn_mkwrite() must be defined some of the time.
> 
> The trick which makes this work is that for fault_dirty_shared_page() to be
> called at all, you have to set 'dirty_shared' argument to wp_page_reuse()
> and that does not happen from wp_pfn_shared() and do_wp_page() paths. So
> things work as they should. If you look somewhat later into the series,
> the patch "mm: Move part of wp_page_reuse() into the single call site"
> cleans this up to make things more obvious.
> 
>   Honza

Ah, cool, that makes sense.

You can add:

Reviewed-by: Ross Zwisler 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 11/20] mm: Remove unnecessary vma->vm_ops check

2016-10-18 Thread Jan Kara
On Mon 17-10-16 13:40:41, Ross Zwisler wrote:
> On Tue, Sep 27, 2016 at 06:08:15PM +0200, Jan Kara wrote:
> > We don't check whether vma->vm_ops is NULL in do_shared_fault() so
> > there's hardly any point in checking it in wp_page_shared() which gets
> > called only for shared file mappings as well.
> > 
> > Signed-off-by: Jan Kara 
> > ---
> >  mm/memory.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index a4522e8999b2..63d9c1a54caf 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -2301,7 +2301,7 @@ static int wp_page_shared(struct vm_fault *vmf, 
> > struct page *old_page)
> >  
> > get_page(old_page);
> >  
> > -   if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
> > +   if (vma->vm_ops->page_mkwrite) {
> > int tmp;
> >  
> > pte_unmap_unlock(vmf->pte, vmf->ptl);
> > -- 
> > 2.6.6
> 
> Does this apply equally to the check in wp_pfn_shared()?  Both
> wp_page_shared() and wp_pfn_shared() are called for shared file mappings via
> do_wp_page().

Yes, it does apply there as well. Added to the commit. There are actually
more places with these checks which don't seem necessary but I didn't want
to do more cleanups than I need... But at least these two come logically
together.

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 07/20] mm: Add orig_pte field into vm_fault

2016-10-18 Thread Jan Kara
On Mon 17-10-16 10:45:12, Ross Zwisler wrote:
> On Tue, Sep 27, 2016 at 06:08:11PM +0200, Jan Kara wrote:
> > Add orig_pte field to vm_fault structure to allow ->page_mkwrite
> > handlers to fully handle the fault. This also allows us to save some
> > passing of extra arguments around.
> > 
> > Signed-off-by: Jan Kara 
> > ---
> 
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index f88b2d3810a7..66bc77f2d1d2 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -890,11 +890,12 @@ static bool __collapse_huge_page_swapin(struct 
> > mm_struct *mm,
> > vmf.pte = pte_offset_map(pmd, address);
> > for (; vmf.address < address + HPAGE_PMD_NR*PAGE_SIZE;
> > vmf.pte++, vmf.address += PAGE_SIZE) {
> > -   pteval = *vmf.pte;
> > +   vmf.orig_pte = *vmf.pte;
> > +   pteval = vmf.orig_pte;
> > if (!is_swap_pte(pteval))
> > continue;
> 
> 'pteval' is now only used once.  It's probably cleaner to just remove it and
> use vmf.orig_pte for the is_swap_pte() check.

Yes, fixed.

> > @@ -3484,8 +3484,7 @@ static int handle_pte_fault(struct vm_fault *vmf)
> >  * So now it's safe to run pte_offset_map().
> >  */
> > vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
> > -
> > -   entry = *vmf->pte;
> > +   vmf->orig_pte = *vmf->pte;
> >  
> > /*
> >  * some architectures can have larger ptes than wordsize,
> > @@ -3496,6 +3495,7 @@ static int handle_pte_fault(struct vm_fault *vmf)
> >  * ptl lock held. So here a barrier will do.
> >  */
> > barrier();
> > +   entry = vmf->orig_pte;
> 
> This set of 'entry' is now on the other side of the barrier().  I'll admit
> that I don't fully grok the need for the barrier. Does it apply to only the
> setting of vmf->pte and vmf->orig_pte, or does 'entry' also matter because it
> too is of type pte_t, and thus could be bigger than the architecture's word
> size?
> 
> My guess is that 'entry' matters, too, and should remain before the barrier()
> call.  If not, can you help me understand why?

Sure, actually the comment just above the barrier() explains it: We care
about sampling *vmf->pte value only once - so we want the value stored in
'entry' (vmf->orig_pte after the patch) to be used and avoid compiler
optimizations leading to refetching the value at *vmf->pte. The way I've
written the code achieves this. Actually, I've moved the 'entry' assignment
even further down where it makes more sense with the new code layout.

Honza

-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/20 v3] dax: Clear dirty bits after flushing caches

2016-10-18 Thread Jan Kara
On Mon 17-10-16 12:59:55, Ross Zwisler wrote:
> On Mon, Oct 17, 2016 at 10:47:32AM +0200, Jan Kara wrote:
> 
> > This week I plan to rebase both series on top of rc1 + your THP patches so
> > that we can move on with merging the stuff.
> 
> Yea...so how are we going to coordinate merging of these series for the v4.10
> merge window?  My series mostly changes DAX, but it also changes XFS, ext2 and
> ext4.  I think the plan right now is to have Dave Chinner take it through his
> XFS tree.
> 
> Your first series is mostly mm changes with some DAX sprinkled in, and your
> second series touches dax, mm and all 3 DAX filesystems.  
> 
> What is the best way to handle all this?  Have it go through one central tree
> (-MM?), even though the changes touch code that exists outside of that trees
> normal domain (like the FS code)?  Have my series go through the XFS tree and
> yours through -MM, and give Linus a merge resolution patch?  Something else?

For your changes to go through XFS tree is IMO fine (changes outside of XFS
& DAX are easy). Let me do the rebase first and then discuss how to merge
my patches after that...

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm