Re: [PATCH v2 3/3] Btrfs: heuristic add byte core set calculation

2017-07-28 Thread kbuild test robot
Hi Timofey,

[auto build test ERROR on next-20170724]
[cannot apply to btrfs/next v4.13-rc2 v4.13-rc1 v4.12 v4.13-rc2]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Timofey-Titovets/Btrfs-populate-heuristic-with-detection-logic/20170729-061208
config: i386-randconfig-n0-201730 (attached as .config)
compiler: gcc-4.8 (Debian 4.8.4-1) 4.8.4
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   fs/btrfs/compression.o: In function `btrfs_compress_heuristic':
>> compression.c:(.text+0x2208): undefined reference to `__udivdi3'

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH] btrfs: preserve i_mode if __btrfs_set_acl() fails

2017-07-28 Thread Ernesto A . Fernández
On Sat, Jul 29, 2017 at 12:48:04AM +, Josef Bacik wrote:
> On Fri, Jul 28, 2017 at 09:26:29PM -0300, Ernesto A. Fernández wrote:
> > +   ret = __btrfs_set_acl(trans, inode, acl, type);
> > +   if (ret)
> > +   goto out;
> > +
> > +   inode->i_mode = mode;
> > +   inode_inc_iversion(inode);
> > +   inode->i_ctime = current_time(inode);
> > +   set_bit(BTRFS_INODE_COPY_EVERYTHING, _I(inode)->runtime_flags);
> 
> This only needs to be set if we actually set the xattr.  I'd fix setxattr to
> call it every time it's called.

I had not thought of that, thank you.

If I'm understanding this correctly the issue would be only when setting
a NULL default acl on an inode that is not a directory. In that case I
probably shouldn't be calling btrfs_update_inode either, but I can't move
that back to setxattr.

Perhaps __btrfs_set_acl could return an error in that case, like -ENOTDIR,
and then we can set ret back to 0 before returning from btrfs_set_acl.

> > +   ret = btrfs_update_inode(trans, root, inode);
> > +   BUG_ON(ret);
> 
> No BUG_ON, return the error.

The call to BUG_ON was already there before my patch, only inside the 
__btrfs_setxattr function. Since I didn't know the reason I thought it
was best not to change it. I'll do as you say in the next version.

Thank you for your review.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


★webmaster--10月·广州国际进出口汽车配件展【与“广交会”同期同地举行】 (地右P1-L-Me)

2017-07-28 Thread linux-btrfs-owner
webmaster【通过本邮件参展优惠500元一展位,需回信专用邮箱“12809...@qq.com”报名】
  
尊敬的 企业领导/公司负责人:
  
  诚邀参加中国最大汽配外贸展 —— APF 2017
  汽配行业品牌盛会,外贸企业最佳选择,全球采购首选平台!
  
★ 与“广交会”同期同地举行,
★ 以“广交会”庞大的客流量为依托,买家互动,借势兴展,
★ 共享来自全球数十万采购商资源•••
  
  
【 基 本 信 息 】
  
中文名称: 2017广州国际进出口汽车配件展览会
英文名称: The Guangzhou International Import and Export Auto Parts Fair 2017 (APF 
2017)
  
展会日期: 2017年10月13—15日
展会场馆: 广州琶洲国际采购中心
  
批准单位: 中华人民共和国商务部
主办单位: 中国对外贸易经济合作企业协会、映德国际会展有限公司
  
官方网站: http://www.CAPE-china.com 
在线客服: 邮箱/QQ:q...@12809395.com;  微信:ZhanShangZhiJia;  
微博:http://weibo.com/yingdehuizhan
咨询电话: 4000-580-850(转5206或8144); 131-2662-5206; 010―8699-7155、 8084-2128; 
  
  
【 展 会 介 绍 】
  
  
中国目前的汽车保有量已达1.95亿多辆,预计到2020年,中国汽车保有量将超过2.5亿辆。预计2016年中国汽车年产销量将超过3000万辆,到2020年中国汽车产销量将分别超过4500万辆,从而成为名副其实的全球第一大汽车市场。汽车配件是汽车工业发展的基础,汽车配件配套及售后服务市场是汽车市场的重要组成部分,中国汽车工业的迅猛发展,为汽车配件行业提供了坚实的产业基础和有力的市场支撑,并将形成1.5-2万亿元超大规模的市场产值。
  
  
作为汽车市场的焦点,广州拥有国内最大的汽车生产基地和汽车产业集群,连续三年汽车消费增速全国前列。2017年是“十三五”规划实施的重要一年,是供给侧结构性改革的深化之年,中国汽车工业已步入由大到强的发展之路,行业资源分配日益优化、产业布局日趋合理的态势已初现端倪,产业发展正逐步由产销量的提升演变为质量的飞跃。尤其在夯实产业根基、促使健康发展原则指导下,汽车配件产业,已被提升为汽车产业链条中首要的发展对象,资源倾斜、政策扶持、整顿规范,可以预计,继我国整车生产及消费在过去十年取得蓬勃发展成就之后,未来五到十年,将是我国汽车配件行业产生根本性变革的黄金时期。
  
  
得益于中国汽车产业高速发展和全球汽车零部件产业链积极向中国转移,映德会展、中汽展览联合行业权威机构定于2017年10月13-15日在广州琶洲国际采购中心举办“2017广州国际进出口汽车配件展览会”(APF
 
2017)。依托汽车产业和全球最大的潜在市场资源,根据汽车配件产业发展现状和中外市场需求,在继承和延伸往届展会成功经验的基础上,在各级政府部门、行业协会的关心与支持下、经过主承办单位的精心组织策划,“APF
 2017”将以全新的面貌再现广州,展会将全面展示汽车领域的最新产品与成果及未来发展方向,将有超过百家合作媒体的超大阵容作全方位的立体宣传。APF 
全国统一参展报名热线:4000-580-850(转5206、8144)。

  
我们将继续以“突出品牌、开拓创新、注重实效、强化服务”的办展宗旨,凭借独特的创意,科学的组织管理和卓越的服务,以全新的理念为广大中外参展商提供一个“专业化、国际化、品牌化”的展示交流平台,为全球汽车配件及后市场行业提供更多的合作机会,有力推动中国汽车配件产品全面进入全球采购体系,与世界各国汽车产业协调合作、互利共赢、共同发展进步。
  
  
【 展 会 优 势 】
  
●绝佳商机 —— APF 
2017举办时间正值“广交会”期间,享有“中国第一展”美誉的“广交会”,每年参加的采购商大约20多万,来自一百多个国家和地区。我们将通过一系列途径充分借助“广交会”全球买家的巨大资源,并通过组委会客户关系邀请系统向国内外三十多万采购商发出邀请,与“广交会”完全互动,借势兴展,同时弥补“广交会”内销的不足,形成“一内一外、相辅相成”的作用。以“广交会”庞大的客流量为依托,中外采购商云集,市场潜力不可估量,巨大商机全面彰显,是开拓国际市场的重要平台!
  
●   黄金地段 —— 
广州琶洲国际采购中心与广交会展馆一路之隔,连为一体,形成完美对接,连接广交会同类产品展区,距离地铁八号线琶洲站A出口仅200米之遥,交通非常便利,方便海外客商前来参观、采购。
  
●   参展回报 —— 
与每个国内外采购决策者面对面交流,和意向客户达成交易,在专业客户中扩大品牌影响力;建立海外分销网络,拓展国际市场;新产品、新技术推广;开拓新市场;了解竞争对手及行业发展趋势;洞悉国际最新技术与资讯;约见老客户并发展新业务。
  
  
【 目 标 观 众 】
  
  中国(广州)国际汽车零部件及用品展览会组委会(映德会展―YOND 
EXPO)将专业观众组织和媒体宣传作为工作重点,邀请中外汽车制造商、改装厂、改装行、改装店,汽车工业设备制造商、汽车零配件用品制造商、贸易商、代理商、经销商、终端用户,汽车配件用品市场、超市、连锁加盟店、4S店,汽车保养及美容中心、汽车维修中心、汽车修理厂,汽车综合性能检测站、汽车后市场经销商,汽车后市场连锁经营领域专家、学者、投资公司及国内外有志于汽车后市场投资创业人士、汽车服务行业、汽车爱好者、车友会、俱乐部、商务机构、汽车维修检测行业相关部门、汽车交通运输部门、政府主管部门、汽车行业协会、专业媒体等主要单位及负责人参会。采取卓有实效的措施为参展企业搭建交流与合作的平台,促进科技成果转化,提高企业市场竞争力;同时通过系列紧密有序的宣传活动,确保展会在国内外引起最大关注。
  
  16万国内外专业买家云集羊城 ——
  
一、 国内专业买家 
1、300家整车厂和汽车销售公司
 - 
本田(广州,东风),丰田(一汽,广汽),大众(一汽,上海),北京现代,上海通用,东风日产,长安福特,比亚迪,奇瑞等35家主流整车企业和60家汽车销售公司,汽车用品公司的采购负责人现场参观采购。
2、8000家4S店集团及全国4S店
 - 
新疆广汇,冀东庞大,上海永达,浙江物产元通,广物汽贸,东创建国,大连中升,湖南申湘,深圳深业,中汽西南,安徽亚夏,郑州豫华等300家4S店集团和中国各品牌4000家4S店采购负责人参展采购。
3、1500家全国一级批发物流商
 - 
欧特隆(辽宁,杭州,南京,山西),沈阳新天成,郑州二仟家,山西茂德隆,长沙湘泸,福建永联,成都穗丰,广州永丰,新疆半分利,北京派安,石家庄中惠等1200家一级批发物流参展采购。
4、7000家全国各地市代理经销商
5、2500家全国优质影音改装专业店
 - 以新城子昂,上海车之宝,北京双周,音乐前线,先歌兄弟, 非常城市等为代表的全国各区域优质影音改装店参展采购。
6、300家大型零售终端连锁
 - 以新奇特,黄帽子,上海美车饰等为代表的全国各区域优质零售终端及大型连锁参展采购。。
7、9家国内终端零售店(含南方/泛珠三角地区终端店3家)
 - 以金手指,车元素等为代表的福建,江西,湖南,广东,广西,海南,四川,贵州,云南,香港,澳门等泛珠三角地区零售终端现场采购。以及2万家全国优秀零售终端。
  
二、 国外专业买家 
1、4000亚洲买家:
 - 包括日本、韩国、印度尼西亚、马来西亚、印度、泰国、菲律宾、越南、新加坡等国行业商会组团采购参观。
2、1500中东买家:
 - 包括阿联酋、沙特阿拉伯、伊朗、叙利亚、以色列、科威特、卡塔尔、也门等国采购商组团参观采购。
3、2500欧美买家:
 - 包括德国、英国、法国、美国、墨西哥、加拿大等国采购商采购参观。
  
  
【 展 品 范 围 】
  
  
汽车零部件、零配件,发动机系统、底盘系统、制动系统、行驶系统、转向系统、车身系统、传动系统、排气系统、散热冷却系统、燃油系统,汽车附件、通用件、紧固件、密封件、摩擦材料,汽车电机、轴承、蓄电池、滤清器、散热器、消声器、传感器、仪器仪表、雨刷器、变速器、离合器、离合片、刹车片、汽车弹簧、减震器、保险杠、安全气囊、座椅、玻璃、车镜、车灯、汽车空调、轮胎、轮毂、链条、防滑链,汽车线束、插接器、硬管、软管、软轴、拉索,车用纺织品,汽车油漆、润滑油、机油、添加剂,汽车用品,汽车电子电器,汽车音影、音响、导航、车载通讯、安全和防盗系统,汽车改装部件及用品,汽保设备及工具,汽车模具,汽车零部件制造技术、设备、工具及材料,汽车零部件清洗设备及包装,汽车新产品,汽车节能环保与新能源技术及产品,相关软件、媒体、认证、金融和保险机构等。
  
  
【 参 展 细 则 】
  
◆ 展位规格: 
  1、特装展位:36平方米起租,仅提供相应面积室内外空地。展台搭建、展览器具、用电用水等自理。 
  2、标准展位:9平方米(3m×3m)每个,2.5m高壁板、一条楣板(展商名称)、一张洽谈桌、两把椅子、两盏射灯、220V/5A电源插座一处。 
  
◆ 展位费用:  
  特装展位:境内企业RMB2000/平方米;  境外企业USD500/平方米; 
  标准展位:境内企业RMB2/个;  境外企业USD5000/个; (双面开口标准展位另加收10%费用)
  
◆ 会刊广告: 
(大会《会刊》将帮助您在展会后找到客户!除在展会期间广为发送外,还通过各种有关渠道发送给未能前来参观展会的各地专业人士手中,他们可利用会刊迅速查找服务内容与联络方法。
 会刊尺寸:130mm*210mm,进口铜板纸彩色精印,发行量10万册。)
  封面 CNY 3; 封二封三 CNY 22000; 扉页 CNY 18000; 黑白页 CNY 5000;
  封底 CNY 2; 彩页跨版 CNY 18000; 彩页 CNY 12000; 300字简介 CNY 2000;
  
◆ 会议论坛:
  
如技术交流会/产品推广发布会,CNY9000/小时/场,用于会场及相关设备租金(包括场地、扩音设施、灯具、投影机、投影仪,桌椅、空调、茶水并协助主讲企业组织听众)。
   
   
【 参 展 程 序 】
  
1、大会即日起开始接受厂商报名,组委会(映德会展―YOND 
EXPO)严格按“款到先后顺序优先安排展位”,先期报名参展企业除“在统一参展费用的基础上获得较靠前展台位置”的同时,并可享受更多“展前宣传”和“买家推介”等增值服务。
2、参展单位请详细填写《参展申请表》(备索)并加盖公章,传真或复印后寄送至大会组织办公室(映德会展―YOND 
EXPO),并于三个工作日内向大会指定账户汇出参展费用。 
3、参展单位请于报名时将300字内企业简介同时提供至大会组织办公室,以便进行及时展前宣传和刊登《会刊》等。 
4、展品运输、仓储、吊装,展商报道、接待、食宿等后勤服务,详见会前《参展商手册》,约在大会开幕前一个半月发送。
5、需用动力电、气或用水、特装展台装修等事宜,请于大会开幕前一月将有关资料提供给大会组委会,以便会务组协助参展企业做好相应安排。
6、组委会拒绝与参展范围不符的厂商参展。报名截止日期:2017年08月31日。 
  
  
【 筹 展 联 络 】
   
广州国际进出口汽车配件展组委会
官方网站: http://www.CAPE-china.com 
全国统一客服热线: 

Re: write corruption due to bio cloning on raid5/6

2017-07-28 Thread Janos Toth F.
The read-only scrub finished without errors/hangs (with kernel
4.12.3). So, I guess the hangs were caused by:
1: other bug in 4.13-RC1
2: crazy-random SATA/disk-controller issue
3: interference between various btrfs tools [*]
4: something in the background did DIO write with 4.13-RC1 (but all
affected content was eventually overwritten/deleted between the scrub
attempts)

[*] I expected scrub to finish in ~5 rather than ~40 hours (and didn't
expect interference issues), so I didn't disable the scheduled
maintenance script which deletes old files, recursively defrags the
whole fs and runs a balance with usage=33 filters. I guess either of
those (especially balance) could potentially cause scrub to hang.

On Thu, Jul 27, 2017 at 10:44 PM, Duncan <1i5t5.dun...@cox.net> wrote:
> Janos Toth F. posted on Thu, 27 Jul 2017 16:14:47 +0200 as excerpted:
>
>> * This is off-topic but raid5 scrub is painful. The disks run at
>> constant ~100% utilization while performing at ~1/5 of their sequential
>> read speeds. And despite explicitly asking idle IO priority when
>> launching scrub, the filesystem becomes unbearably slow (while scrub
>> takes a days or so to finish ... or get to the point where it hung the
>> last time around, close to the end).
>
> That's because basically all the userspace scrub command does is make the
> appropriate kernel calls to have it do the real scrub.  So priority-
> idling the userspace scrub doesn't do what it does on normal userspace
> jobs that do much of the work themselves.
>
> The problem is that idle-prioritizing the kernel threads actually doing
> the work could risk a deadlock due to lock inversion, since they're
> kernel threads and aren't designed with the idea of people messing with
> their priority in mind.
>
> Meanwhile, that's yet another reason btrfs raid56 mode isn't recommended
> at this time.  Try btrfs raid1 or raid10 mode instead, or possible btrfs
> raid1, single or raid0 mode on top of a pair of mdraid5s or similar.  Tho
> parity-raid mode in general (that is, not btrfs-specific) is known for
> being slow in various cases, with raid10 normally being the best
> performing closest alternative.  (Tho in the btrfs-specific case, btrfs
> raid1 on top of a pair of mdraid/dmraid/whatever raid0s, is the normally
> recommended higher performance reasonably low danger alternative.)

If this applies to all RAID flavors then I consider the built-in help
and the manual pages of scrub misleading (if it's RAID56-only, the
manual should still mention how RAID56 is an exception).

Also, a resumed scrub seems to skip a lot of data. It picks up where
it left but then prematurely reports a job well done. I remember
noticing a similar behavior with balance cancel/resume on RAID5 a few
years ago (it went on for a few more chunks but left the rest alone
and reported completion --- I am not sure if that's fixed now or these
have a common root cause).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: preserve i_mode if __btrfs_set_acl() fails

2017-07-28 Thread Josef Bacik
On Fri, Jul 28, 2017 at 09:26:29PM -0300, Ernesto A. Fernández wrote:
> When changing a file's acl mask, btrfs_set_acl() will first set the
> group bits of i_mode to the value of the mask, and only then set the
> actual extended attribute representing the new acl.
> 
> If the second part fails (due to lack of space, for example) and the
> file had no acl attribute to begin with, the system will from now on
> assume that the mask permission bits are actual group permission bits,
> potentially granting access to the wrong users.
> 
> Prevent this by starting the journal transaction before calling
> __btrfs_set_acl and only changing the inode mode after it returns
> successfully.
> 
> Signed-off-by: Ernesto A. Fernández 
> ---
> This issue is covered by generic/449 in xfstests. Several filesystems
> are affected; some of them have already applied patches:
>   - fe26569 ext2: preserve i_mode if ext2_set_acl() fails
>   - f070e5a jfs: preserve i_mode if __jfs_set_acl() fails
>   - fcea8ae reiserfs: preserve i_mode if __reiserfs_set_acl() fails
> 
>  fs/btrfs/acl.c | 29 ++---
>  1 file changed, 26 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/acl.c b/fs/btrfs/acl.c
> index 8d8370d..d041526 100644
> --- a/fs/btrfs/acl.c
> +++ b/fs/btrfs/acl.c
> @@ -27,6 +27,7 @@
>  #include "ctree.h"
>  #include "btrfs_inode.h"
>  #include "xattr.h"
> +#include "transaction.h"
>  
>  struct posix_acl *btrfs_get_acl(struct inode *inode, int type)
>  {
> @@ -113,14 +114,36 @@ static int __btrfs_set_acl(struct btrfs_trans_handle 
> *trans,
>  
>  int btrfs_set_acl(struct inode *inode, struct posix_acl *acl, int type)
>  {
> + struct btrfs_root *root = BTRFS_I(inode)->root;
> + struct btrfs_trans_handle *trans;
>   int ret;
> + umode_t mode = inode->i_mode;
> +
> + if (btrfs_root_readonly(root))
> + return -EROFS;
> +
> + trans = btrfs_start_transaction(root, 2);
> + if (IS_ERR(trans))
> + return PTR_ERR(trans);
>  
>   if (type == ACL_TYPE_ACCESS && acl) {
> - ret = posix_acl_update_mode(inode, >i_mode, );
> + ret = posix_acl_update_mode(inode, , );
>   if (ret)
> - return ret;
> + goto out;
>   }
> - return __btrfs_set_acl(NULL, inode, acl, type);
> + ret = __btrfs_set_acl(trans, inode, acl, type);
> + if (ret)
> + goto out;
> +
> + inode->i_mode = mode;
> + inode_inc_iversion(inode);
> + inode->i_ctime = current_time(inode);
> + set_bit(BTRFS_INODE_COPY_EVERYTHING, _I(inode)->runtime_flags);

This only needs to be set if we actually set the xattr.  I'd fix setxattr to
call it every time it's called.

> + ret = btrfs_update_inode(trans, root, inode);
> + BUG_ON(ret);

No BUG_ON, return the error.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: preserve i_mode if __btrfs_set_acl() fails

2017-07-28 Thread Ernesto A . Fernández
When changing a file's acl mask, btrfs_set_acl() will first set the
group bits of i_mode to the value of the mask, and only then set the
actual extended attribute representing the new acl.

If the second part fails (due to lack of space, for example) and the
file had no acl attribute to begin with, the system will from now on
assume that the mask permission bits are actual group permission bits,
potentially granting access to the wrong users.

Prevent this by starting the journal transaction before calling
__btrfs_set_acl and only changing the inode mode after it returns
successfully.

Signed-off-by: Ernesto A. Fernández 
---
This issue is covered by generic/449 in xfstests. Several filesystems
are affected; some of them have already applied patches:
  - fe26569 ext2: preserve i_mode if ext2_set_acl() fails
  - f070e5a jfs: preserve i_mode if __jfs_set_acl() fails
  - fcea8ae reiserfs: preserve i_mode if __reiserfs_set_acl() fails

 fs/btrfs/acl.c | 29 ++---
 1 file changed, 26 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/acl.c b/fs/btrfs/acl.c
index 8d8370d..d041526 100644
--- a/fs/btrfs/acl.c
+++ b/fs/btrfs/acl.c
@@ -27,6 +27,7 @@
 #include "ctree.h"
 #include "btrfs_inode.h"
 #include "xattr.h"
+#include "transaction.h"
 
 struct posix_acl *btrfs_get_acl(struct inode *inode, int type)
 {
@@ -113,14 +114,36 @@ static int __btrfs_set_acl(struct btrfs_trans_handle 
*trans,
 
 int btrfs_set_acl(struct inode *inode, struct posix_acl *acl, int type)
 {
+   struct btrfs_root *root = BTRFS_I(inode)->root;
+   struct btrfs_trans_handle *trans;
int ret;
+   umode_t mode = inode->i_mode;
+
+   if (btrfs_root_readonly(root))
+   return -EROFS;
+
+   trans = btrfs_start_transaction(root, 2);
+   if (IS_ERR(trans))
+   return PTR_ERR(trans);
 
if (type == ACL_TYPE_ACCESS && acl) {
-   ret = posix_acl_update_mode(inode, >i_mode, );
+   ret = posix_acl_update_mode(inode, , );
if (ret)
-   return ret;
+   goto out;
}
-   return __btrfs_set_acl(NULL, inode, acl, type);
+   ret = __btrfs_set_acl(trans, inode, acl, type);
+   if (ret)
+   goto out;
+
+   inode->i_mode = mode;
+   inode_inc_iversion(inode);
+   inode->i_ctime = current_time(inode);
+   set_bit(BTRFS_INODE_COPY_EVERYTHING, _I(inode)->runtime_flags);
+   ret = btrfs_update_inode(trans, root, inode);
+   BUG_ON(ret);
+out:
+   btrfs_end_transaction(trans);
+   return ret;
 }
 
 /*
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs incremental send | receive fails with Error: File not found

2017-07-28 Thread A L


On 7/28/2017 9:32 PM, Hermann Schwärzler wrote:

Hi

for me it looks like those snapshots are not read-only. But as far as 
I know for using send they have to be.


They are read-only.
# btrfs property get userData.20170727T1222/
ro=true



At least
https://btrfs.wiki.kernel.org/index.php/Incremental_Backup#Initial_Bootstrapping 


states "We will need to create a read-only snapshot ,,,"

I am using send/receive (with read-only snapshots) on a regular basis 
and never had a problem like yours.
I have no good explanation. There are no problems reported on the 
filesystems with Btrfs scrub or Btrfs check. Did you also replace files 
with same name between snapshots?


What are the commands you use to create your snapshots?

I used to do it in an hourly cron job like this.
# btrfs subvolume snapshot -r /mnt/storagePool/volume/userData/ 
/mnt/storagePool/snapshots/userData.`date +%Y.%m.%d-%H.%M.%S`

Now I use btrbk, but the command is the same and the problem is the same.

The problem I see seems similar to the issue fixed in 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f59627810e18d4435051d982b5d05cab18c6e653 
but that commit should already be in kernel-4.13_rc2




Greetings
Hermann

On 07/28/2017 07:26 PM, A L wrote:
I often hit the following error when doing incremental btrfs 
send-receive:

Btrfs incremental send | receive fails with Error: File not found

Sometimes I can do two-three incremental snapshots, but then the same
error (different file) happens again. It seems that the files were
changed or replaced between snapshots, which is causing the problems for
send-receive. I have tried to delete all snapshots and started over but
the problem comes back, so I think it must be a bug.

The source volume is:   /mnt/storagePool (with RAID1 profile)
with subvolume:   volume/userData
Backup disk is:   /media/usb-backup (external USB disk)

[...]

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 12/13] btrfs: allow backref search checks for shared extents

2017-07-28 Thread Liu Bo
On Wed, Jul 12, 2017 at 04:20:10PM -0600, Edmund Nadolski wrote:
> When called with a struct share_check, find_parent_nodes()
> will detect a shared extent and immediately return with
> BACKREF_SHARED_FOUND.
> 

Reviewed-by: Liu Bo 

Thanks,

-liubo
> Signed-off-by: Edmund Nadolski 
> Signed-off-by: Jeff Mahoney 
> ---
>  fs/btrfs/backref.c | 164 
> +
>  1 file changed, 115 insertions(+), 49 deletions(-)
> 
> diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
> index c1882e5..35ac0bd 100644
> --- a/fs/btrfs/backref.c
> +++ b/fs/btrfs/backref.c
> @@ -135,6 +135,25 @@ struct preftrees {
>   struct preftree indirect_missing_keys;
>  };
>  
> +/*
> + * Checks for a shared extent during backref search.
> + *
> + * The share_count tracks prelim_refs (direct and indirect) having a
> + * ref->count >0:
> + *  - incremented when a ref->count transitions to >0
> + *  - decremented when a ref->count transitions to <1
> + */
> +struct share_check {
> + u64 root_objectid;
> + u64 inum;
> + int share_count;
> +};
> +
> +static inline int extent_is_shared(struct share_check *sc)
> +{
> + return (sc && sc->share_count > 1) ? BACKREF_FOUND_SHARED : 0;
> +}
> +
>  static struct kmem_cache *btrfs_prelim_ref_cache;
>  
>  int __init btrfs_prelim_ref_init(void)
> @@ -195,14 +214,26 @@ static int prelim_ref_compare(struct prelim_ref *ref1,
>   return 0;
>  }
>  
> +void update_share_count(struct share_check *sc, int oldcount, int newcount)
> +{
> + if ((!sc) || (oldcount == 0 && newcount < 1))
> + return;
> +
> + if (oldcount > 0 && newcount < 1)
> + sc->share_count--;
> + else if (oldcount < 1 && newcount > 0)
> + sc->share_count++;
> +}
> +
>  /*
>   * Add @newref to the @root rbtree, merging identical refs.
>   *
> - * Callers should assumed that newref has been freed after calling.
> + * Callers should assume that newref has been freed after calling.
>   */
>  static void prelim_ref_insert(const struct btrfs_fs_info *fs_info,
> struct preftree *preftree,
> -   struct prelim_ref *newref)
> +   struct prelim_ref *newref,
> +   struct share_check *sc)
>  {
>   struct rb_root *root;
>   struct rb_node **p;
> @@ -234,12 +265,20 @@ static void prelim_ref_insert(const struct 
> btrfs_fs_info *fs_info,
>   eie->next = newref->inode_list;
>   trace_btrfs_prelim_ref_merge(fs_info, ref, newref,
>preftree->count);
> + /*
> +  * A delayed ref can have newref->count < 0.
> +  * The ref->count is updated to follow any
> +  * BTRFS_[ADD|DROP]_DELAYED_REF actions.
> +  */
> + update_share_count(sc, ref->count,
> +ref->count + newref->count);
>   ref->count += newref->count;
>   free_pref(newref);
>   return;
>   }
>   }
>  
> + update_share_count(sc, 0, newref->count);
>   preftree->count++;
>   trace_btrfs_prelim_ref_insert(fs_info, newref, NULL, preftree->count);
>   rb_link_node(>rbnode, parent, p);
> @@ -303,7 +342,8 @@ static void prelim_release(struct preftree *preftree)
>  static int add_prelim_ref(const struct btrfs_fs_info *fs_info,
> struct preftree *preftree, u64 root_id,
> const struct btrfs_key *key, int level, u64 parent,
> -   u64 wanted_disk_byte, int count, gfp_t gfp_mask)
> +   u64 wanted_disk_byte, int count,
> +   struct share_check *sc, gfp_t gfp_mask)
>  {
>   struct prelim_ref *ref;
>  
> @@ -348,31 +388,32 @@ static int add_prelim_ref(const struct btrfs_fs_info 
> *fs_info,
>   ref->count = count;
>   ref->parent = parent;
>   ref->wanted_disk_byte = wanted_disk_byte;
> - prelim_ref_insert(fs_info, preftree, ref);
> -
> - return 0;
> + prelim_ref_insert(fs_info, preftree, ref, sc);
> + return extent_is_shared(sc);
>  }
>  
>  /* direct refs use root == 0, key == NULL */
>  static int add_direct_ref(const struct btrfs_fs_info *fs_info,
> struct preftrees *preftrees, int level, u64 parent,
> -   u64 wanted_disk_byte, int count, gfp_t gfp_mask)
> +   u64 wanted_disk_byte, int count,
> +   struct share_check *sc, gfp_t gfp_mask)
>  {
>   return add_prelim_ref(fs_info, >direct, 0, NULL, level,
> -   parent, wanted_disk_byte, count, gfp_mask);
> +   parent, wanted_disk_byte, count, sc, gfp_mask);
>  }
>  
>  /* indirect 

Re: Btrfs incremental send | receive fails with Error: File not found

2017-07-28 Thread Hermann Schwärzler

Hi

for me it looks like those snapshots are not read-only. But as far as I 
know for using send they have to be.


At least
https://btrfs.wiki.kernel.org/index.php/Incremental_Backup#Initial_Bootstrapping
states "We will need to create a read-only snapshot ,,,"

I am using send/receive (with read-only snapshots) on a regular basis 
and never had a problem like yours.


What are the commands you use to create your snapshots?

Greetings
Hermann

On 07/28/2017 07:26 PM, A L wrote:

I often hit the following error when doing incremental btrfs send-receive:
Btrfs incremental send | receive fails with Error: File not found

Sometimes I can do two-three incremental snapshots, but then the same
error (different file) happens again. It seems that the files were
changed or replaced between snapshots, which is causing the problems for
send-receive. I have tried to delete all snapshots and started over but
the problem comes back, so I think it must be a bug.

The source volume is:   /mnt/storagePool (with RAID1 profile)
with subvolume:   volume/userData
Backup disk is:   /media/usb-backup (external USB disk)

[...]

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 08/13] btrfs: convert prelimary reference tracking to use rbtrees

2017-07-28 Thread Liu Bo
On Wed, Jul 12, 2017 at 04:20:06PM -0600, Edmund Nadolski wrote:
> It's been known for a while that the use of multiple lists
> that are periodically merged was an algorithmic problem within
> btrfs.  There are several workloads that don't complete in any
> reasonable amount of time (e.g. btrfs/130) and others that cause
> soft lockups.
> 
> The solution is to use a set of rbtrees that do insertion merging
> for both indirect and direct refs, with the former converting
> refs into the latter.  The result is a btrfs/130 workload that
> used to take several hours now takes about half of that. This
> runtime still isn't acceptable and a future patch will address that
> by moving the rbtrees higher in the stack so the lookups can be
> shared across multiple calls to find_parent_nodes.
>


Reviewed-by: Liu Bo 

Thanks,

-liubo
> Signed-off-by: Edmund Nadolski 
> Signed-off-by: Jeff Mahoney 
> ---
>  fs/btrfs/backref.c | 441 
> ++---
>  1 file changed, 284 insertions(+), 157 deletions(-)
> 
> diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
> index 6cac5ab..1edb107 100644
> --- a/fs/btrfs/backref.c
> +++ b/fs/btrfs/backref.c
> @@ -26,11 +26,6 @@
>  #include "delayed-ref.h"
>  #include "locking.h"
>  
> -enum merge_mode {
> - MERGE_IDENTICAL_KEYS = 1,
> - MERGE_IDENTICAL_PARENTS,
> -};
> -
>  /* Just an arbitrary number so we can be sure this happened */
>  #define BACKREF_FOUND_SHARED 6
>  
> @@ -129,7 +124,7 @@ static int find_extent_in_eb(const struct extent_buffer 
> *eb,
>   * this structure records all encountered refs on the way up to the root
>   */
>  struct prelim_ref {
> - struct list_head list;
> + struct rb_node rbnode;
>   u64 root_id;
>   struct btrfs_key key_for_search;
>   int level;
> @@ -139,6 +134,18 @@ struct prelim_ref {
>   u64 wanted_disk_byte;
>  };
>  
> +struct preftree {
> + struct rb_root root;
> +};
> +
> +#define PREFTREE_INIT{ .root = RB_ROOT }
> +
> +struct preftrees {
> + struct preftree direct;/* BTRFS_SHARED_[DATA|BLOCK]_REF_KEY */
> + struct preftree indirect;  /* BTRFS_[TREE_BLOCK|EXTENT_DATA]_REF_KEY */
> + struct preftree indirect_missing_keys;
> +};
> +
>  static struct kmem_cache *btrfs_prelim_ref_cache;
>  
>  int __init btrfs_prelim_ref_init(void)
> @@ -158,6 +165,108 @@ void btrfs_prelim_ref_exit(void)
>   kmem_cache_destroy(btrfs_prelim_ref_cache);
>  }
>  
> +static void free_pref(struct prelim_ref *ref)
> +{
> + kmem_cache_free(btrfs_prelim_ref_cache, ref);
> +}
> +
> +/*
> + * Return 0 when both refs are for the same block (and can be merged).
> + * A -1 return indicates ref1 is a 'lower' block than ref2, while 1
> + * indicates a 'higher' block.
> + */
> +static int prelim_ref_compare(struct prelim_ref *ref1,
> +   struct prelim_ref *ref2)
> +{
> + if (ref1->level < ref2->level)
> + return -1;
> + if (ref1->level > ref2->level)
> + return 1;
> + if (ref1->root_id < ref2->root_id)
> + return -1;
> + if (ref1->root_id > ref2->root_id)
> + return 1;
> + if (ref1->key_for_search.type < ref2->key_for_search.type)
> + return -1;
> + if (ref1->key_for_search.type > ref2->key_for_search.type)
> + return 1;
> + if (ref1->key_for_search.objectid < ref2->key_for_search.objectid)
> + return -1;
> + if (ref1->key_for_search.objectid > ref2->key_for_search.objectid)
> + return 1;
> + if (ref1->key_for_search.offset < ref2->key_for_search.offset)
> + return -1;
> + if (ref1->key_for_search.offset > ref2->key_for_search.offset)
> + return 1;
> + if (ref1->parent < ref2->parent)
> + return -1;
> + if (ref1->parent > ref2->parent)
> + return 1;
> +
> + return 0;
> +}
> +
> +/*
> + * Add @newref to the @root rbtree, merging identical refs.
> + *
> + * Callers should assumed that newref has been freed after calling.
> + */
> +static void prelim_ref_insert(struct preftree *preftree,
> +   struct prelim_ref *newref)
> +{
> + struct rb_root *root;
> + struct rb_node **p;
> + struct rb_node *parent = NULL;
> + struct prelim_ref *ref;
> + int result;
> +
> + root = >root;
> + p = >rb_node;
> +
> + while (*p) {
> + parent = *p;
> + ref = rb_entry(parent, struct prelim_ref, rbnode);
> + result = prelim_ref_compare(ref, newref);
> + if (result < 0) {
> + p = &(*p)->rb_left;
> + } else if (result > 0) {
> + p = &(*p)->rb_right;
> + } else {
> + /* Identical refs, merge them and free @newref */
> + struct extent_inode_elem *eie = ref->inode_list;
> +
> + while (eie && eie->next)

Re: Btrfs + compression = slow performance and high cpu usage

2017-07-28 Thread Peter Grandi
In addition to my previous "it does not happen here" comment, if
someone is reading this thread, there are some other interesting
details:

> When the compression is turned off, I am able to get the
> maximum 500-600 mb/s write speed on this disk (raid array)
> with minimal cpu usage.

No details on whether it is a parity RAID or not.

> btrfs device usage /mnt/arh-backup1/
> /dev/sda, ID: 2
>Device size:21.83TiB
>Device slack:  0.00B
>Data,single: 9.29TiB
>Metadata,single:46.00GiB
>System,single:  32.00MiB
>Unallocated:12.49TiB

That's exactly 24TB of "Device size", of which around 45% are
used, and the string "backup" may suggest that the content is
backups, which may indicate a very fragmented freespace.
Of course compression does not help with that, in my freshly
created Btrfs volume I get as expected:

  soft#  umount /mnt/sde3
  soft#  mount -t btrfs -o commit=10 /dev/sde3 /mnt/sde3
 

  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/sde3/testfile 
bs=1M count=1 conv=fsync
  1+0 records in
  1+0 records out
  1048576 bytes (10 GB) copied, 103.747 s, 101 MB/s
  0.00user 11.56system 1:44.86elapsed 11%CPU (0avgtext+0avgdata 
3072maxresident)k
  20480672inputs+20498272outputs (1major+349minor)pagefaults 0swaps

  soft#  filefrag /mnt/sde3/testfile 
  /mnt/sde3/testfile: 11 extents found

versus:

  soft#  umount /mnt/sde3   
 
  soft#  mount -t btrfs -o commit=10,compress=lzo,compress-force /dev/sde3 
/mnt/sde3

  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/sde3/testfile 
bs=1M count=1 conv=fsync
  1+0 records in  
  1+0 records out
  1048576 bytes (10 GB) copied, 109.051 s, 96.2 MB/s
  0.02user 13.03system 1:49.49elapsed 11%CPU (0avgtext+0avgdata 
3068maxresident)k
  20494784inputs+20492320outputs (1major+347minor)pagefaults 0swaps

  soft#  filefrag /mnt/sde3/testfile 
  /mnt/sde3/testfile: 49287 extents found

Most the latter extents are mercifully rather contiguous, their
size is just limited by the compression code, here is an extract
from 'filefrag -v' from around the middle:

  24757:  1321888.. 1321919:   11339579..  11339610: 32:   11339594:
  24758:  1321920.. 1321951:   11339597..  11339628: 32:   11339611:
  24759:  1321952.. 1321983:   11339615..  11339646: 32:   11339629:
  24760:  1321984.. 1322015:   11339632..  11339663: 32:   11339647:
  24761:  1322016.. 1322047:   11339649..  11339680: 32:   11339664:
  24762:  1322048.. 1322079:   11339667..  11339698: 32:   11339681:
  24763:  1322080.. 1322111:   11339686..  11339717: 32:   11339699:
  24764:  1322112.. 1322143:   11339703..  11339734: 32:   11339718:
  24765:  1322144.. 1322175:   11339720..  11339751: 32:   11339735:
  24766:  1322176.. 1322207:   11339737..  11339768: 32:   11339752:
  24767:  1322208.. 1322239:   11339754..  11339785: 32:   11339769:
  24768:  1322240.. 1322271:   11339771..  11339802: 32:   11339786:
  24769:  1322272.. 1322303:   11339789..  11339820: 32:   11339803:

But again this is on a fresh empty Btrfs volume.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs + compression = slow performance and high cpu usage

2017-07-28 Thread Hugo Mills
On Fri, Jul 28, 2017 at 06:20:14PM +, William Muriithi wrote:
> Hi Roman,
> 
> > autodefrag
> 
> This sure sounded like a good thing to enable? on paper? right?...
> 
> The moment you see anything remotely weird about btrfs, this is the first 
> thing you have to disable and retest without. Oh wait, the first would be 
> qgroups, this one is second.
> 
> What's the problem with autodefrag?  I am also using it, so you caught my 
> attention when you implied that it shouldn't be used.  According to docs, it 
> seem like one of the very mature feature of the filesystem.  See below for 
> the doc I am referring to 
> 
> https://btrfs.wiki.kernel.org/index.php/Status
> 
> I am using it as I assumed it could prevent the filesystem being too 
> fragmented long term, but never thought there was price to pay for using it

   It introduces additional I/O on writes, as it modifies a small area
surrounding any write or cluster of writes.

   I'm not aware of it causing massive slowdowns, in the way the
qgroups does in some situations.

   If your system is already marginal in terms of being able to
support the I/O required, then turning on autodefrag will make things
worse (but you may be heading for _much_ worse performance in the
future as the FS becomes more fragmented -- depending on your write
patterns and use case).

   Hugo.

-- 
Hugo Mills | Great oxymorons of the world, no. 6:
hugo@... carfax.org.uk | Mature Student
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [PATCH 2/2] btrfs: increase ctx->pos for delayed dir index

2017-07-28 Thread Liu Bo
On Mon, Jul 24, 2017 at 03:14:26PM -0400, jo...@toxicpanda.com wrote:
> From: Josef Bacik 
> 
> Our dir_context->pos is supposed to hold the next position we're
> supposed to look.  If we successfully insert a delayed dir index we
> could end up with a duplicate entry because we don't increase ctx->pos
> after doing the dir_emit.
>

Looks good.

Reviewed-by: Liu Bo 

Thanks,

-liubo
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/delayed-inode.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
> index 8ae409b..19e4ad2 100644
> --- a/fs/btrfs/delayed-inode.c
> +++ b/fs/btrfs/delayed-inode.c
> @@ -1727,6 +1727,7 @@ int btrfs_readdir_delayed_dir_index(struct dir_context 
> *ctx,
>  
>   if (over)
>   return 1;
> + ctx->pos++;
>   }
>   return 0;
>  }
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2][v2] btrfs: fix readdir deadlock with pagefault

2017-07-28 Thread Liu Bo
On Mon, Jul 24, 2017 at 03:14:25PM -0400, jo...@toxicpanda.com wrote:
> From: Josef Bacik 
> 
> Readdir does dir_emit while under the btree lock.  dir_emit can trigger
> the page fault which means we can deadlock.  Fix this by allocating a
> buffer on opening a directory and copying the readdir into this buffer
> and doing dir_emit from outside of the tree lock.
> 
> Signed-off-by: Josef Bacik 
> ---
> v1->v2:
> - use kzalloc instead of alloc_page().
> - make struct btrfs_file_private so you can still start a userspace trans on a
>   directory.
> 
>  fs/btrfs/ctree.h |   5 +++
>  fs/btrfs/file.c  |   9 -
>  fs/btrfs/inode.c | 107 
> +--
>  fs/btrfs/ioctl.c |  19 ++
>  4 files changed, 107 insertions(+), 33 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 5ee9f10..33e942b 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1264,6 +1264,11 @@ struct btrfs_root {
>   atomic64_t qgroup_meta_rsv;
>  };
>  
> +struct btrfs_file_private {
> + struct btrfs_trans_handle *trans;
> + void *filldir_buf;
> +};
> +
>  static inline u32 btrfs_inode_sectorsize(const struct inode *inode)
>  {
>   return btrfs_sb(inode->i_sb)->sectorsize;
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 0f102a1..1897c3b 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1973,8 +1973,15 @@ static ssize_t btrfs_file_write_iter(struct kiocb 
> *iocb,
>  
>  int btrfs_release_file(struct inode *inode, struct file *filp)
>  {
> - if (filp->private_data)
> + struct btrfs_file_private *private = filp->private_data;
> +
> + if (private && private->trans)
>   btrfs_ioctl_trans_end(filp);
> + if (private && private->filldir_buf)
> + kfree(private->filldir_buf);
> + kfree(private);
> + filp->private_data = NULL;
> +
>   /*
>* ordered_data_close is set by settattr when we are about to truncate
>* a file from a non-zero size to a zero size.  This tries to
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 9a4413a..bbdbeea 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -5877,25 +5877,73 @@ unsigned char btrfs_filetype_table[] = {
>   DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK
>  };
>  
> +/*
> + * All this infrastructure exists because dir_emit can fault, and we are 
> holding
> + * the tree lock when doing readdir.  For now just allocate a buffer and copy
> + * our information into that, and then dir_emit from the buffer.  This is
> + * similar to what NFS does, only we don't keep the buffer around in 
> pagecache
> + * because I'm afraid I'll fuck that up.  Long term we need to make filldir 
> do
> + * copy_to_user_inatomic so we don't have to worry about page faulting under 
> the
> + * tree lock.
> + */
> +static int btrfs_opendir(struct inode *inode, struct file *file)
> +{
> + struct btrfs_file_private *private;
> +
> + private = kzalloc(sizeof(struct btrfs_file_private), GFP_KERNEL);
> + if (!private)
> + return -ENOMEM;
> + private->filldir_buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
> + if (!private->filldir_buf) {
> + kfree(private);
> + return -ENOMEM;
> + }
> + file->private_data = private;
> + return 0;
> +}
> +
> +struct dir_entry {
> + u64 ino;
> + u64 offset;
> + unsigned type;
> + int name_len;
> +};
> +
> +static int btrfs_filldir(void *addr, int entries, struct dir_context *ctx)
> +{
> + while (entries--) {
> + struct dir_entry *entry = addr;
> + char *name = (char *)(entry + 1);
> + ctx->pos = entry->offset;
> + if (!dir_emit(ctx, name, entry->name_len, entry->ino,
> +   entry->type))
> + return 1;
> + addr += sizeof(struct dir_entry) + entry->name_len;
> + ctx->pos++;
> + }
> + return 0;
> +}
> +
>  static int btrfs_real_readdir(struct file *file, struct dir_context *ctx)
>  {
>   struct inode *inode = file_inode(file);
>   struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>   struct btrfs_root *root = BTRFS_I(inode)->root;
> + struct btrfs_file_private *private = file->private_data;
>   struct btrfs_dir_item *di;
>   struct btrfs_key key;
>   struct btrfs_key found_key;
>   struct btrfs_path *path;
> + void *addr;
>   struct list_head ins_list;
>   struct list_head del_list;
>   int ret;
>   struct extent_buffer *leaf;
>   int slot;
> - unsigned char d_type;
> - int over = 0;
> - char tmp_name[32];
>   char *name_ptr;
>   int name_len;
> + int entries = 0;
> + int total_len = 0;
>   bool put = false;
>   struct btrfs_key location;
>  
> @@ -5906,12 +5954,14 @@ static int btrfs_real_readdir(struct file *file, 
> struct dir_context *ctx)
>   if (!path)

RE: Btrfs + compression = slow performance and high cpu usage

2017-07-28 Thread William Muriithi
Hi Roman,

> autodefrag

This sure sounded like a good thing to enable? on paper? right?...

The moment you see anything remotely weird about btrfs, this is the first thing 
you have to disable and retest without. Oh wait, the first would be qgroups, 
this one is second.

What's the problem with autodefrag?  I am also using it, so you caught my 
attention when you implied that it shouldn't be used.  According to docs, it 
seem like one of the very mature feature of the filesystem.  See below for the 
doc I am referring to 

https://btrfs.wiki.kernel.org/index.php/Status

I am using it as I assumed it could prevent the filesystem being too fragmented 
long term, but never thought there was price to pay for using it

Regards,
William

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs + compression = slow performance and high cpu usage

2017-07-28 Thread Peter Grandi
> I am stuck with a problem of btrfs slow performance when using
> compression. [ ... ]

That to me looks like an issue with speed, not performance, and
in particular with PEBCAK issues.

As to high CPU usage, when you find a way to do both compression
and checksumming without using much CPU time, please send patches
urgently :-).

In your case the increase in CPU time is bizarre. I have the
Ubuntu 4.4 "lts-xenial" kernel and what you report does not
happen here (with a few little changes):

  soft#  grep 'model name' /proc/cpuinfo | sort -u
  model name  : AMD FX(tm)-6100 Six-Core Processor
  soft#  cpufreq-info | grep 'current CPU frequency'
current CPU frequency is 3.30 GHz (asserted by call to hardware).
current CPU frequency is 3.30 GHz (asserted by call to hardware).
current CPU frequency is 3.30 GHz (asserted by call to hardware).
current CPU frequency is 3.30 GHz (asserted by call to hardware).
current CPU frequency is 3.30 GHz (asserted by call to hardware).
current CPU frequency is 3.30 GHz (asserted by call to hardware).

  soft#  lsscsi | grep 'sd[ae]'
  [0:0:0:0]diskATA  HFS256G32MNB-220 3L00  /dev/sda
  [5:0:0:0]diskATA  ST2000DM001-1CH1 CC44  /dev/sde

  soft#  mkfs.btrfs -f /dev/sde3
  [ ... ]
  soft#  mount -t btrfs -o 
discard,autodefrag,compress=lzo,compress-force,commit=10 /dev/sde3 /mnt/sde3

  soft#  df /dev/sda6 /mnt/sde3
  Filesystem 1M-blocks  Used Available Use% Mounted on
  /dev/sda6  90048 76046 14003  85% /
  /dev/sde3 23756819235501   1% /mnt/sde3

The above is useful context information that was "amazingly"
omitted from your reported.

In dmesg I see (not the "force zlib compression"):

  [327730.917285] BTRFS info (device sde3): turning on discard
  [327730.917294] BTRFS info (device sde3): enabling auto defrag
  [327730.917300] BTRFS info (device sde3): setting 8 feature flag
  [327730.917304] BTRFS info (device sde3): force zlib compression
  [327730.917313] BTRFS info (device sde3): disk space caching is enabled
  [327730.917315] BTRFS: has skinny extents
  [327730.917317] BTRFS: flagging fs with big metadata feature
  [327730.920740] BTRFS: creating UUID tree

and the result is:

  soft#  pv -tpreb /dev/sda6 | time dd iflag=fullblock of=/mnt/sde3/testfile 
bs=1M count=1 oflag=direct
  1+0 records in17MB/s] [==>] 11% ETA 
0:15:06
  1+0 records out
  1048576 bytes (10 GB) copied, 112.845 s, 92.9 MB/s
  0.05user 9.93system 1:53.20elapsed 8%CPU (0avgtext+0avgdata 3016maxresident)k
  120inputs+20496000outputs (1major+346minor)pagefaults 0swaps
  9.77GB 0:01:53 [88.3MB/s] [==>]
  11%

  soft#  btrfs fi df /mnt/sde3/
  Data, single: total=10.01GiB, used=9.77GiB
  System, DUP: total=8.00MiB, used=16.00KiB
  Metadata, DUP: total=1.00GiB, used=11.66MiB
  GlobalReserve, single: total=16.00MiB, used=0.00B

As it was running system CPU time was under 20% of one CPU:

  top - 18:57:29 up 3 days, 19:27,  4 users,  load average: 5.44, 2.82, 1.45
  Tasks: 325 total,   1 running, 324 sleeping,   0 stopped,   0 zombie
  %Cpu0  :  0.0 us,  2.3 sy,  0.0 ni, 91.3 id,  6.3 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu1  :  0.0 us,  1.3 sy,  0.0 ni, 78.5 id, 20.2 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu2  :  0.3 us,  5.8 sy,  0.0 ni, 81.0 id, 12.5 wa,  0.0 hi,  0.3 si,  0.0 
st
  %Cpu3  :  0.3 us,  3.4 sy,  0.0 ni, 91.9 id,  4.4 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu4  :  0.3 us, 10.6 sy,  0.0 ni, 55.4 id, 33.7 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu5  :  0.0 us,  0.3 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  KiB Mem:   8120660 total,  5162236 used,  2958424 free,  4440100 buffers
  KiB Swap:0 total,0 used,0 free.   351848 cached Mem

PID  PPID USER  PR  NIVIRTRESDATA  %CPU %MEM TIME+ TTY  
COMMAND
  21047 21046 root  20   08872   26161364  12.9  0.0   0:02.31 
pts/3dd iflag=fullblo+
  21045  3535 root  20   07928   1948 460  12.3  0.0   0:00.72 
pts/3pv -tpreb /dev/s+
  21019 2 root  20   0   0  0   0   1.3  0.0   0:42.88 ?
[kworker/u16:1]

Of course "oflag=direct" is a rather "optimistic" option in this
context, so I tried again with something more sensible:

  soft#  pv -tpreb /dev/sda6 | time dd iflag=fullblock of=/mnt/sde3/testfile 
bs=1M count=1 conv=fsync
  1+0 records in.4MB/s] [==>] 11% ETA 
0:14:41
  1+0 records out
  1048576 bytes (10 GB) copied, 110.523 s, 94.9 MB/s
  0.03user 8.94system 1:50.71elapsed 8%CPU (0avgtext+0avgdata 3024maxresident)k
  136inputs+20499648outputs (1major+348minor)pagefaults 0swaps
  9.77GB 0:01:50 [90.3MB/s] [==>] 11%

  soft#  btrfs fi df /mnt/sde3/
  Data, single: total=7.01GiB, used=6.35GiB
  System, DUP: total=8.00MiB, used=16.00KiB
  Metadata, DUP: total=1.00GiB, used=15.81MiB
  GlobalReserve, 

Re: Btrfs + compression = slow performance and high cpu usage

2017-07-28 Thread Roman Mamedov
On Fri, 28 Jul 2017 17:40:50 +0100 (BST)
"Konstantin V. Gavrilenko"  wrote:

> Hello list, 
> 
> I am stuck with a problem of btrfs slow performance when using compression.
> 
> when the compress-force=lzo mount flag is enabled, the performance drops to 
> 30-40 mb/s and one of the btrfs processes utilises 100% cpu time.
> mount options: btrfs 
> relatime,discard,autodefrag,compress=lzo,compress-force,space_cache=v2,commit=10

It does not work like that, you need to set compress-force=lzo (and remove
compress=).

With your setup I believe you currently use compress-force[=zlib](default),
overriding compress=lzo, since it's later in the options order.

Secondly,

> autodefrag

This sure sounded like a good thing to enable? on paper? right?...

The moment you see anything remotely weird about btrfs, this is the first
thing you have to disable and retest without. Oh wait, the first would be
qgroups, this one is second.

Finally, what is the reasoning behind "commit=10", and did you check with the
default value of 30?

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs incremental send | receive fails with Error: File not found

2017-07-28 Thread A L

I often hit the following error when doing incremental btrfs send-receive:
Btrfs incremental send | receive fails with Error: File not found

Sometimes I can do two-three incremental snapshots, but then the same 
error (different file) happens again. It seems that the files were 
changed or replaced between snapshots, which is causing the problems for 
send-receive. I have tried to delete all snapshots and started over but 
the problem comes back, so I think it must be a bug.


The source volume is:   /mnt/storagePool (with RAID1 profile)
with subvolume:   volume/userData
Backup disk is:   /media/usb-backup (external USB disk)


# cat /proc/version
Linux version 4.13.0-rc2 (root@e350) (gcc version 6.3.0 (Gentoo 6.3.0 
p1.0)) #2 SMP PREEMPT Fri Jul 28 14:25:15 CEST 2017


# btrfs version
btrfs-progs v4.11.1

# btrfs fi show:
Label: 'Backup'  uuid: f021a039-87d6-4498-a0f5-6bbba3dfb1f1
    Total devices 1 FS bytes used 362.85GiB
    devid    1 size 931.51GiB used 367.06GiB path /dev/sdf1

Label: 'pool'  uuid: ea4f1d6d-c2c5-4247-a903-15b36ee276a7
    Total devices 2 FS bytes used 362.33GiB
    devid    1 size 927.51GiB used 367.03GiB path /dev/sdc2
    devid    2 size 927.51GiB used 367.03GiB path /dev/sdd2


(backup) /media/usb-backup/volumes/userData # btrfs sub list .
ID 258 gen 30 top level 5 path scripts
ID 1622 gen 3227 top level 5 path volumes/userData/userData.20170727T1222
ID 1999 gen 3251 top level 5 path volumes/userData/userData.20170727T2102

(source) /mnt/storagePool/snapshots # btrfs sub list .
ID 262 gen 118703 top level 5 path volume/userData
ID 1928 gen 118105 top level 5 path snapshots/userData.20170727T1222
ID 1930 gen 118151 top level 5 path snapshots/userData.20170727T2102
ID 1932 gen 118167 top level 5 path snapshots/userData.20170727T2300
ID 1936 gen 118390 top level 5 path snapshots/userData.20170728T0100
ID 1939 gen 118502 top level 5 path snapshots/userData.20170728T0200
ID 1955 gen 118667 top level 5 path snapshots/userData.20170728T1300
ID 1960 gen 118695 top level 5 path snapshots/userData.20170728T1700
ID 1962 gen 118699 top level 5 path snapshots/userData.20170728T1800


# btrfs subvolume list -p -a -c -g -u -q -R -t /mnt/storagePool/snapshots
ID  gen cgen    parent  top level   parent_uuid 
received_uuid   uuid    path
--  ---     --  -   --- 
-       
260 118702  24  5   5   - 
6e20167e-8d72-cc42-b486-10c6a5516ca7 
dd86162c-4df2-d646-a65f-77768adc132d    volume/mail
262 118703  39  5   5   - 
8464242d-0e81-e84e-ba93-78b1c8f00fc9 
94c256cb-970e-e349-a660-ff4d9291c829    volume/userData
506 118691  333 5   5   - 
d0c6ff24-1766-b049-abe9-80396795448f 
c759b1cc-106e-134a-8cef-f1da1bc5e169    volume/storageTemp
1469    78671   78671   5   5   -   - 
8a94524e-a956-c14b-bb8d-d453627f27d5    volume/mysql
1928    118105  118105  5   5 94c256cb-970e-e349-a660-ff4d9291c829 
8464242d-0e81-e84e-ba93-78b1c8f00fc9 
7aed8444-34a7-c54d-ae06-e0e80ead3c18 snapshots/userData.20170727T1222
1930    118151  118151  5   5 94c256cb-970e-e349-a660-ff4d9291c829 
8464242d-0e81-e84e-ba93-78b1c8f00fc9 
20b4fab3-f75c-4445-914a-23465e09626c snapshots/userData.20170727T2102
1932    118167  118167  5   5 94c256cb-970e-e349-a660-ff4d9291c829 
8464242d-0e81-e84e-ba93-78b1c8f00fc9 
2b0069dc-5d71-df49-9c32-d5e0f17c09e9 snapshots/userData.20170727T2300
1936    118390  118390  5   5 94c256cb-970e-e349-a660-ff4d9291c829 
8464242d-0e81-e84e-ba93-78b1c8f00fc9 
8aa3ea70-b703-b740-8012-373be0616720 snapshots/userData.20170728T0100
1939    118502  118502  5   5 94c256cb-970e-e349-a660-ff4d9291c829 
8464242d-0e81-e84e-ba93-78b1c8f00fc9 
ad84276f-a481-d04a-ad26-301dd79b158f snapshots/userData.20170728T0200
1955    118667  118667  5   5 94c256cb-970e-e349-a660-ff4d9291c829 
8464242d-0e81-e84e-ba93-78b1c8f00fc9 
605cf43c-5e01-9d4e-ad22-77488f0d3e90 snapshots/userData.20170728T1300
1960    118695  118695  5   5 94c256cb-970e-e349-a660-ff4d9291c829 
8464242d-0e81-e84e-ba93-78b1c8f00fc9 
31c72ce0-5765-b042-a073-8c4296e111ec snapshots/userData.20170728T1700
1962    118699  118699  5   5 94c256cb-970e-e349-a660-ff4d9291c829 
8464242d-0e81-e84e-ba93-78b1c8f00fc9 
feadb1df-867b-7245-86d0-5472cd3c899b snapshots/userData.20170728T1800



# btrfs subvolume list -p -a -c -g -u -q -R -t 
/media/usb-backup/volumes/userData
ID  gen cgen    parent  top level   parent_uuid 
received_uuid   uuid    path
--  ---     --  -   --- 
-       
258 30  9   5   5   -   - 
95dafde0-677c-7542-9d18-9bbfdbf7c9b3    scripts
1622    3227    2532    5   5   - 
8464242d-0e81-e84e-ba93-78b1c8f00fc9 
cfe52e52-b7dd-7e48-8616-43286f5a11e0 volumes/userData/userData.20170727T1222
1999    3251    3224    5   5 

Btrfs + compression = slow performance and high cpu usage

2017-07-28 Thread Konstantin V. Gavrilenko
Hello list, 

I am stuck with a problem of btrfs slow performance when using compression.

when the compress-force=lzo mount flag is enabled, the performance drops to 
30-40 mb/s and one of the btrfs processes utilises 100% cpu time.
mount options: btrfs 
relatime,discard,autodefrag,compress=lzo,compress-force,space_cache=v2,commit=10

The command I am testing the write throughput is

# pv -tpreb /dev/sdb | dd of=./testfile bs=1M oflag=direct

# top -d 1 
top - 15:49:13 up  1:52,  2 users,  load average: 5.28, 2.32, 1.39
Tasks: 320 total,   2 running, 318 sleeping,   0 stopped,   0 zombie
%Cpu0  :  0.0 us,  2.0 sy,  0.0 ni, 77.0 id, 21.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  0.0 us,  1.0 sy,  0.0 ni, 90.0 id,  9.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  0.0 us,  1.0 sy,  0.0 ni, 72.0 id, 27.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  0.0 us,100.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  :  0.0 us,  1.0 sy,  0.0 ni, 57.0 id, 42.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  :  0.0 us,  0.0 sy,  0.0 ni, 96.0 id,  4.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu6  :  0.0 us,  0.0 sy,  0.0 ni, 94.0 id,  6.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu7  :  0.0 us,  1.0 sy,  0.0 ni, 95.1 id,  3.9 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu8  :  1.0 us,  2.0 sy,  0.0 ni, 24.0 id, 73.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu9  :  0.0 us,  0.0 sy,  0.0 ni, 81.8 id, 18.2 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu10 :  1.0 us,  0.0 sy,  0.0 ni, 98.0 id,  1.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu11 :  0.0 us,  2.0 sy,  0.0 ni, 83.3 id, 14.7 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 32934136 total, 10137496 free,   602244 used, 22194396 buff/cache
KiB Swap:0 total,0 free,0 used. 30525664 avail Mem 

  PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND 

  
37017 root  20   0   0  0  0 R 100.0  0.0   0:32.42 
kworker/u49:8   
  
36732 root  20   0   0  0  0 D   4.0  0.0   0:02.40 
btrfs-transacti 
  
40105 root  20   08388   3040   2000 D   4.0  0.0   0:02.88 dd   


The keyworker process that causes the high cpu usage is  most likely searching 
for the free space.

# echo l > /proc/sysrq-trigger

# dmest -T
Fri Jul 28 15:57:51 2017] CPU: 1 PID: 36430 Comm: kworker/u49:2 Not tainted 
4.10.0-28-generic #32~16.04.2-Ubuntu
[Fri Jul 28 15:57:51 2017] Hardware name: Supermicro X8DTL/X8DTL, BIOS 2.1b 
  11/16/2012
[Fri Jul 28 15:57:51 2017] Workqueue: btrfs-delalloc btrfs_delalloc_helper 
[btrfs]
[Fri Jul 28 15:57:51 2017] task: 9ddce6206a40 task.stack: aa9121f6c000
[Fri Jul 28 15:57:51 2017] RIP: 0010:rb_next+0x1e/0x40
[Fri Jul 28 15:57:51 2017] RSP: 0018:aa9121f6fb40 EFLAGS: 0282
[Fri Jul 28 15:57:51 2017] RAX: 9dddc34df1b0 RBX: 0001 RCX: 
1000
[Fri Jul 28 15:57:51 2017] RDX: 9dddc34df708 RSI: 9ddccaf470a4 RDI: 
9dddc34df2d0
[Fri Jul 28 15:57:51 2017] RBP: aa9121f6fb40 R08: 0001 R09: 
3000
[Fri Jul 28 15:57:51 2017] R10:  R11: 0002 R12: 
9ddccaf47080
[Fri Jul 28 15:57:51 2017] R13: 1000 R14: aa9121f6fc50 R15: 
9dddc34df2d0
[Fri Jul 28 15:57:51 2017] FS:  () 
GS:9ddcefa4() knlGS:
[Fri Jul 28 15:57:51 2017] CS:  0010 DS:  ES:  CR0: 80050033
[Fri Jul 28 15:57:51 2017] Call Trace:_space_for_alloc+0xde/0x270 [btrfs]
[Fri Jul 28 15:57:51 2017]  btrfs_find_space_for_alloc+0xde/0x270 [btrfs]
[Fri Jul 28 15:57:51 2017]  find_free_extent.isra.68+0x3c6/0x1040 [btrfs]s]
[Fri Jul 28 15:57:51 2017]  btrfs_reserve_extent+0xab/0x210 [btrfs]btrfs]
[Fri Jul 28 15:57:51 2017]  submit_compressed_extents+0x154/0x580 [btrfs]s]
[Fri Jul 28 15:57:51 2017]  ? submit_compressed_extents+0x580/0x580 [btrfs]
[Fri Jul 28 15:57:51 2017]  async_cow_submit+0x82/0x90 [btrfs]00 [btrfs]
[Fri Jul 28 15:57:51 2017]  btrfs_scrubparity_helper+0x1fe/0x300 [btrfs]
[Fri Jul 28 15:57:51 2017]  btrfs_delalloc_helper+0xe/0x10 [btrfs]
[Fri Jul 28 15:57:51 2017]  process_one_work+0x16b/0x4a0a0
[Fri Jul 28 15:57:51 2017]  worker_thread+0x4b/0x500+0x60/0x60
[Fri Jul 28 15:57:51 2017]  kthread+0x109/0x1400x4a0/0x4a0




When the compression is turned off, I am able to get the maximum 500-600 mb/s 
write speed on this disk (raid array) with minimal cpu usage.

mount options: relatime,discard,autodefrag,space_cache=v2,commit=10

# iostat -m 1 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0.080.007.74   10.770.00   81.40

Device:tpsMB_read/sMB_wrtn/sMB_readMB_wrtn
sda2376.00 0.00   

[GIT PULL] Btrfs fixes for 4.13-rc3

2017-07-28 Thread David Sterba
Hi,

please pull the following btrfs fixes. They're addressing problems reported by
users, and there's one more regression fix. Thanks.

The next pull request will be sent by Chris, I'm heading off to vacation.


The following changes since commit c3cfb656307583ddfea45375c10183737593c195:

  Btrfs: fix unexpected return value of bio_readpage_error (2017-07-14 20:42:37 
+0200)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-4.13-part3

for you to fetch changes up to 0e4324a4c36b3eb5cd1f71cbbc38d888f919ebfc:

  btrfs: round down size diff when shrinking/growing device (2017-07-24 
16:05:00 +0200)


Filipe Manana (1):
  Btrfs: fix dir item validation when replaying xattr deletes

Jeff Mahoney (1):
  btrfs: fix lockup in find_free_extent with read-only block groups

Nikolay Borisov (1):
  btrfs: round down size diff when shrinking/growing device

Omar Sandoval (1):
  Btrfs: fix early ENOSPC due to delalloc

 fs/btrfs/extent-tree.c | 11 +--
 fs/btrfs/tree-log.c|  3 +--
 fs/btrfs/volumes.c |  4 ++--
 3 files changed, 8 insertions(+), 10 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2][v2] btrfs: fix readdir deadlock with pagefault

2017-07-28 Thread David Sterba
On Mon, Jul 24, 2017 at 03:14:25PM -0400, jo...@toxicpanda.com wrote:
> From: Josef Bacik 
> 
> Readdir does dir_emit while under the btree lock.  dir_emit can trigger
> the page fault which means we can deadlock.  Fix this by allocating a
> buffer on opening a directory and copying the readdir into this buffer
> and doing dir_emit from outside of the tree lock.
> 
> Signed-off-by: Josef Bacik 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] Btrfs: add skeleton code for compression heuristic

2017-07-28 Thread Anand Jain



On 28/07/2017 00:36, David Sterba wrote:

On Mon, Jul 24, 2017 at 11:40:17PM +0800, Anand Jain wrote:



Eg. files that are already compressed would increase the cpu consumption
with compress-force, while they'd be hopefully detected as
incompressible with 'compress' and clever heuristics. So the NOCOMPRESS
bit would better reflect the status of the file.


 I thought 'compress' in above, is the compress option. Ah you mean
 to say compression algo .. got it. Right compress-force for
 incompressible-data is very expensive.

 And its also true that compress option for incompressible data is
 not at all expensive and its only one time.


   current NOCOMPRESS is based on trial and error method and is more
   accurate than heuristic also loss of cpu power is only one time ?




Curreently, force-compress beats everything, so even a file with
NOCOMPRESS will be compressed, all new writes will be passed to the
compression and stored uncompressed eventually.


 It makes sense to me when you replace NOCOMPRESS with
 incompressible-data in the above statement. As in my understanding..

 You will never have a file with NOCOMPRESS flag if compress-force
 option is used.



Each time they
compression code will run and fail, so it's not one time.

Although you can say it's more 'accurate', it's also more expensive.


  yes. Expensive only in compress-force.


   May be the only opportunity that heuristic can facilitate is at the
   logic to monitor and reset the NOCOMPRESS, as of now there is no
   such a logic.


The heurictic can be made adaptive, and examine data even for NOCOMPRESS
files, but that's a few steps ahead of where we are now.


  Nice.

Thanks, Anand

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 00/13] use rbtrees for preliminary backrefs

2017-07-28 Thread David Sterba
On Wed, Jul 12, 2017 at 04:20:05PM -0600, Edmund Nadolski wrote:
> This patch series attempts to improve the performance of backref
> searches by changing the prelim_refs implementation to use
> rbtrees instead of lists.  This also aims to reduce the soft
> lockup occurences that can result when a backref search consumes
> too much cpu time.
> 
> Test runs of btrfs/130 show an improvement in the overall
> run time of the test (shown below in seconds) as a function of
> the number of extents:
> 
> nr_extents:2565126401024 2048
> +---+-+---+---+--
>  unpatched: 20186375220440419
>patched: 12 93203106022007
> 
> (Note, the current default value for nr_extents in btrfs/130 is
> 4096, which takes a very long time to complete.)
> 
> Changes for v3:
> 
> Patch 08/13:
>  - Update changelog and comments for third rbtree.
>  - Fixed issue in resolve_indirect_refs() which prevented
>module load when sanity checking was enabled.
> 
> Patch 10/13:
>  - Fix TP_printk_btrfs format string per coding standards.
> 
> Changes for v2:
> 
> Patch 06/13:
>  - Added changelog description.
> 
> Patch 07/13:
>  - Updated changelog description.
>  - Removed 'TODO' comment.
> 
> Patch 08/13:
>  - Added code for proper iteration of missing keys. This adds
>a third rbtree (.indirect_missing_keys in struct preftrees)
>plus the requisite code in add_prelim_ref(), add_missing_keys(),
>resolve_indirect_refs(), and find_parent_nodes().
>  - Rename release_pref() to free_pref().
>  - Replace WARN() with BUG_ON().
>  - Remove 'TODO' comments and the unused 'merge_mode' enum.
> 
> The other patches have no functional changes. Some have diff
> context changes due to the above modifications.
> 
> Edmund Nadolski (6):
>   btrfs: btrfs_check_shared should manage its own transaction
>   btrfs: remove ref_tree implementation from backref.c
>   btrfs: convert prelimary reference tracking to use rbtrees
>   btrfs: add cond_resched() calls when resolving backrefs
>   btrfs: allow backref search checks for shared extents
>   btrfs: clean up extraneous computations in add_delayed_refs
> 
> Jeff Mahoney (7):
>   btrfs: struct-funcs, constify readers
>   btrfs: constify tracepoint arguments
>   btrfs: backref, constify some arguments
>   btrfs: backref, add unode_aux_to_inode_list helper
>   btrfs: backref, cleanup __ namespace abuse
>   btrfs: add a node counter to each of the rbtrees
>   btrfs: backref, add tracepoints for prelim_ref insertion and merging

FYI, the whole patchset is now queued for 4.14. It's been in for-next
for a long time and I haven't seen any problems related to it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix assertion failure during fsync in no-holes mode

2017-07-28 Thread fdmanana
From: Filipe Manana 

When logging an inode in full mode that has an inline compressed extent
that represents a range with a size matching the sector size (currently
the same as the page size), has a trailing hole and the no-holes feature
is enabled, we end up failing an assertion leading to a trace like the
following:

[141812.031528] assertion failed: len == i_size, file: fs/btrfs/tree-log.c, 
line: 4453
[141812.033069] [ cut here ]
[141812.034330] kernel BUG at fs/btrfs/ctree.h:3452!
[141812.035137] invalid opcode:  [#1] PREEMPT SMP
[141812.035932] Modules linked in: btrfs dm_thin_pool dm_persistent_data 
dm_bio_prison dm_bufio dm_flakey dm_mod dax ppdev evdev ghash_clmulni_intel 
pcbc aesni_intel aes_x86_64 tpm_tis psmouse crypto_simd parport_pc sg pcspkr 
tpm_tis_core cryptd parport serio_raw glue_helper tpm i2c_piix4 i2c_core button 
sunrpc loop autofs4 ext4 crc16 jbd2 mbcache raid10 raid456 async_raid6_recov 
async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic 
raid1 raid0 multipath linear md_mod sd_mod ata_generic virtio_scsi ata_piix 
floppy crc32c_intel libata scsi_mod virtio_pci virtio_ring e1000 virtio [last 
unloaded: btrfs]
[141812.036790] CPU: 3 PID: 845 Comm: fdm-stress Tainted: GB   W   
4.12.3-btrfs-next-52+ #1
[141812.036790] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
[141812.036790] task: 8801e6694180 task.stack: c90009004000
[141812.036790] RIP: 0010:assfail.constprop.18+0x1c/0x1e [btrfs]
[141812.036790] RSP: 0018:c90009007bc0 EFLAGS: 00010282
[141812.036790] RAX: 0046 RBX: 88017512c008 RCX: 
0001
[141812.036790] RDX: 88023fd95201 RSI: 8182264c RDI: 

[141812.036790] RBP: c90009007bc0 R08: 0001 R09: 
0001
[141812.036790] R10: 1000 R11: 82f5a0c9 R12: 
88014e5947e8
[141812.036790] R13: 000b4000 R14: 8801b234d008 R15: 

[141812.036790] FS:  7fdba6ffd700() GS:88023fd8() 
knlGS:
[141812.036790] CS:  0010 DS:  ES:  CR0: 80050033
[141812.036790] CR2: 7fdb9c10 CR3: 00016efa2000 CR4: 
001406e0
[141812.036790] Call Trace:
[141812.036790]  btrfs_log_inode+0x9f0/0xd3d [btrfs]
[141812.036790]  ? __mutex_lock+0x120/0x3ce
[141812.036790]  btrfs_log_inode_parent+0x224/0x685 [btrfs]
[141812.036790]  ? lock_acquire+0x16b/0x1af
[141812.036790]  btrfs_log_dentry_safe+0x60/0x7b [btrfs]
[141812.036790]  btrfs_sync_file+0x32e/0x3f8 [btrfs]
[141812.036790]  vfs_fsync_range+0x8a/0x9d
[141812.036790]  vfs_fsync+0x1c/0x1e
[141812.036790]  do_fsync+0x31/0x4a
[141812.036790]  SyS_fdatasync+0x13/0x17
[141812.036790]  entry_SYSCALL_64_fastpath+0x18/0xad
[141812.036790] RIP: 0033:0x7fdbac41a47d
[141812.036790] RSP: 002b:7fdba6ffce30 EFLAGS: 0293 ORIG_RAX: 
004b
[141812.036790] RAX: ffda RBX: 81092c9f RCX: 
7fdbac41a47d
[141812.036790] RDX: 004cf0160a40 RSI:  RDI: 
0006
[141812.036790] RBP: c90009007f98 R08:  R09: 
0010
[141812.036790] R10: 02e8 R11: 0293 R12: 
8110cd90
[141812.036790] R13: c90009007f78 R14:  R15: 

[141812.036790]  ? time_hardirqs_off+0x9/0x14
[141812.036790]  ? trace_hardirqs_off_caller+0x1f/0xa3
[141812.036790] Code: c7 d6 61 6b a0 48 89 e5 e8 ba ef a8 e0 0f 0b 55 89 f1 48 
c7 c2 6d 65 6b a0 48 89 fe 48 c7 c7 81 65 6b a0 48 89 e5 e8 9c ef a8 e0 <0f> 0b 
0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 49 89
[141812.036790] RIP: assfail.constprop.18+0x1c/0x1e [btrfs] RSP: 
c90009007bc0
[141812.084448] ---[ end trace 44e472684c7a32cc ]---

Which happens because the code that logs a trailing hole when the no-holes
feature is enabled, did not consider that a compressed inline extent can
represent a range with a size matching the sector size, in which case
expanding the inode's i_size, through a truncate operation, won't lead
to padding with zeroes the page that represents the inline extent, and
therefore the inline extent remains after the truncation.

Fix this by adapting the assertion to accept inline extents representing
data with a sector size length if, and only if, the inline extents are
compressed.

A sample and trivial reproducer (for systems with a 4K page size) for this
issue:

  mkfs.btrfs -O no-holes -f /dev/sdc
  mount -o compress /dev/sdc /mnt
  xfs_io -f -c "pwrite -S 0xab 0 4K" /mnt/foobar
  sync
  xfs_io -c "truncate 32K" /mnt/foobar
  xfs_io -c "fsync" /mnt/foobar

Signed-off-by: Filipe Manana 
---
 fs/btrfs/tree-log.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 3a11ae63676e..c02654cf4c8b 100644
--- a/fs/btrfs/tree-log.c
+++ 

Btrfs progs release 4.12

2017-07-28 Thread David Sterba
Hi,

btrfs-progs version 4.12 have been released. Although it's major number update,
there are no major updates, besides the usual bugfixes and enhancements.

Per user request, the tarball now contains the generated manual pages, as the
build dependencies for documentation are not lightweight. If you configure with
--disable-documentation, the generated *.gz are not touched and need to be
manually copied to the destination path ($prefix/share/man/man[58]).

Changes:
  * subvol show: new options --rootid, --uuid to show subvol by the given spec
  * convert: progress report fixes, found by tsan
  * image: progress report fixes, found by tsan
  * fix infinite looping in find-root, or when looking for free extents
  * other:
* code refactoring
* docs updates
* build: ThreadSanitizer support
* tests: stricter checks for mounted filesystem

Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/
Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git

Shortlog:

Adam Buchbinder (7):
  btrfs-progs: convert: Fix data race when reporting progress
  btrfs-progs: image: Fix data races when reporting progress
  btrfs-progs: image: fix typos in messages
  btrfs-progs: tests: Fix missing internal deps in check and misc tests
  btrfs-progs: Tighten integer types in print-tree
  btrfs-progs: build: Enable ThreadSanitizer, using D=tsan
  btrfs-progs: tests: Use '-t btrfs' mount option in tests

Anand Jain (2):
  btrfs-progs: subvol show: fix the path use full_path as provided by the 
root info
  btrfs-progs: subvol show: add support to search subvolume by rootid or 
uuid

David Sterba (8):
  btrfs-progs: docs: document conventions
  btrfs-progs: docs: move deprecated mount option to own section
  btrfs-progs: docs: enhance documentation of 'btrfs device ready'
  btrfs-progs: docs: adjust wording for subvol delete
  btrfs-progs: tests: enhance API to request type of the converted 
filesystem
  btrfs-progs: tests: use separate helper for mounting convert filesystems
  btrfs-progs: docs: update wording for compression mount options
  btrfs-progs: update CHANGES for v4.12

Justin Maggard (1):
  btrfs-progs: Fix an infinite loop in btrfs_next_bg

Liu Bo (1):
  Btrfs-progs: fix infinite loop in find_free_extent

Philipp Hahn (1):
  btrfs-progs: Fix slot >= nritems

Qu Wenruo (61):
  btrfs-progs: Cleanup open-coded btrfs_chunk_item_size
  btrfs-progs: Remove deprecated leafsize usage
  btrfs-progs: Introduce sectorsize nodesize and stripesize members for 
btrfs_fs_info
  btrfs-progs: Refactor block sizes users in disk-io.c
  btrfs-progs: Refactor block sizes users in btrfs-corrupt-block.c
  btrfs-progs: Refactor block sizes users in ctree.c and ctree.h
  btrfs-progs: Refactor block sizes users in btrfs-map-logical.c
  btrfs-progs: Refactor block sizes users in chunk-recover.c
  btrfs-progs: Refactor block sizes users in backref.c
  btrfs-progs: Refactor block sizes users in cmds-restore.c
  btrfs-progs: Refactor nodesize user in extent_io.c
  btrfs-progs: Refactor nodesize users in image/main.c
  btrfs-progs: Refactor block sizes users in cmds-check.c
  btrfs-progs: Refactor nodesize user in btrfstune.c
  btrfs-progs: Refactor nodesize users in utils.c
  btrfs-progs: Refactor block sizes users in extent-tree.c
  btrfs-progs: Refactor nodesize user in print-tree.c
  btrfs-progs: Refactor nodesize users in qgroup-verify.c
  btrfs-progs: Refactor nodesize users in cmds-inspect-tree-stats.c
  btrfs-progs: Refactor sectorsize users in mkfs/main.c
  btrfs-progs: Refactor sectorsizes users in file-item.c
  btrfs-progs: Refactor sectorsize users in free-space-cache.c
  btrfs-progs: Refactor sectorsize users in file.c
  btrfs-progs: Refactor sectorsize users in volumes.c
  btrfs-progs: Refactor sectorsize users in free-space-tree.c
  btrfs-progs: Refactor sectorsize in convert/source-fs.c
  btrfs-progs: Refactor sectorsize users in convert/main.c
  btrfs-progs: Refactor sectorsize users in convert/source-ext2.c
  btrfs-progs: Refactor sectorsize users in cmds-inspect-dump-tree.c
  btrfs-progs: Remove block size members in btrfs_root
  btrfs-progs: Refactor btrfs_root paramters in btrfs-corrupt-block.c
  btrfs-progs: Refactor read_tree_block to get rid of btrfs_root
  btrfs-progs: Refactor read_node_slot function to get rid of btrfs_root 
parameter
  btrfs-progs: raid56: Introduce raid56 header for later recovery usage
  btrfs-progs: raid56: Introduce tables for RAID6 recovery
  btrfs-progs: raid56: Allow raid6 to recover 2 data stripes
  btrfs-progs: raid56: Allow raid6 to recover data and P
  btrfs-progs: Introduce wrapper to recover raid56 data
  btrfs-progs: Enhance chunk item validation check
  btrfs-progs: check: Reuse btrfs_check_chunk_valid in 

Re: [PATCH] btrfs-progs: eliminate bogus IOC_DEV_INFO call

2017-07-28 Thread Henk Slager
On Thu, Jul 27, 2017 at 9:24 PM, Hans van Kranenburg
 wrote:
> Device ID numbers always start at 1, not at 0. The first IOC_DEV_INFO
> call does not make sense, since it will always return ENODEV.

When there is a btrfs-replace ongoing, there is a Device ID 0

> ioctl(3, BTRFS_IOC_DEV_INFO, {devid=0}) = -1 ENODEV (No such device)
>
> Signed-off-by: Hans van Kranenburg 
> ---
>  cmds-fi-usage.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/cmds-fi-usage.c b/cmds-fi-usage.c
> index 101a0c4..52c4c62 100644
> --- a/cmds-fi-usage.c
> +++ b/cmds-fi-usage.c
> @@ -535,7 +535,7 @@ static int load_device_info(int fd, struct device_info 
> **device_info_ptr,
> return 1;
> }
>
> -   for (i = 0, ndevs = 0 ; i <= fi_args.max_id ; i++) {
> +   for (i = 1, ndevs = 0 ; i <= fi_args.max_id ; i++) {
> if (ndevs >= fi_args.num_devices) {
> error("unexpected number of devices: %d >= %llu", 
> ndevs,
> (unsigned long long)fi_args.num_devices);
> --
> 2.11.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: Remove redundant setting of uuid in btrfs_block_header.

2017-07-28 Thread Nikolay Borisov
btrfs_alloc_dev_extent currently unconditionally sets the uuid in the leaf block
header the function is working with. This is unnecessary since this operation
is peformed by the core btree handling code (splitting a node, allocating a new
btree block etc). So let's remove it.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/volumes.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 5a1913956f20..84501e9d486c 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1611,8 +1611,6 @@ static int btrfs_alloc_dev_extent(struct 
btrfs_trans_handle *trans,
BTRFS_FIRST_CHUNK_TREE_OBJECTID);
btrfs_set_dev_extent_chunk_offset(leaf, extent, chunk_offset);
 
-   write_extent_buffer_chunk_tree_uuid(leaf, fs_info->chunk_tree_uuid);
-
btrfs_set_dev_extent_length(leaf, extent, num_bytes);
btrfs_mark_buffer_dirty(leaf);
 out:
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: Simplify btrfs_alloc_dev_extent

2017-07-28 Thread Filipe Manana
On Fri, Jul 28, 2017 at 6:59 AM, Nikolay Borisov  wrote:
>
>
> On 27.07.2017 20:57, Filipe Manana wrote:
>> On Thu, Jul 27, 2017 at 6:36 PM, Nikolay Borisov  wrote:
>>> Currently btrfs_alloc_dev_extent essentially open codes btrfs_insert_item. 
>>> So
>>> let's remove the superfluous code, leaving only the important bits, namely
>>> initialising the device extent and just calling btrfs_insert_item. So first 
>>> add
>>> definition for the stack-based set/get function. And then use them.
>>> Additionally, remove the code which sets the uuid of the block header, since
>>> this is something which is already handled by the core btree code.
>>
>> Quite honestly, I don't see the value of this change at all.
>> It doesn't make things simpler nor more readable nor nothing.
>> We have many, really many places using btrfs_insert_empty_item()
>> instead of calling btrfs_insert_item(), are you planning on sending a
>> patch to do the replacement for each of them? What's the point?
>
> I beg you to differ. Some of the code in btrfs is a mess, it's working
> but it's messy. There is constant violation of abstractions (as is the
> case in this function, heck the uuid setting of the block header
> function doesn't even belong here).

The uuid setting is a different thing (and that's fine to go away),
unrelated to using insert_empty_item() vs insert_item(), which is what
I was referring to in my previous reply.

> All of this hampers reading
> comprehension of the code for newcomers. You are experienced in the code
> and likely this doesn't apply to you but since I'm someone relatively
> new to the code this has been my experience. And I believe any effort to
> actually simplify the code, make it more coherent and succinct is a win
> long-term.

Well, this hasn't prevented me, or others that have started
contributing to btrfs after I did, from being able to understand the
code and do useful changes (otherwise such kind of patches would have
landed long time ago). This kind of change won't save anyone's time
understanding the code.

Plus, if I want to go a bit more nitpick, this change of using
btrfs_insert_item() is from a performance/efficiency point of view,
worse as it requires an additional memory allocation/free (the device
extent).

>
> I will wait for other feedback, if people feel patches like that are
> just bikeshedding then I will refrain from such cleanups in the future.
>
>>
>> Plus you are introducing now a memory leak. See below.
>
> Will fix it.
>
>>
>>>
>>> Signed-off-by: Nikolay Borisov 
>>> ---
>>>  fs/btrfs/ctree.h   |  8 
>>>  fs/btrfs/volumes.c | 34 --
>>>  2 files changed, 20 insertions(+), 22 deletions(-)
>>>
>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>> index cd9497bcdb1e..567fbf186257 100644
>>> --- a/fs/btrfs/ctree.h
>>> +++ b/fs/btrfs/ctree.h
>>> @@ -1740,6 +1740,14 @@ BTRFS_SETGET_FUNCS(dev_extent_chunk_objectid, struct 
>>> btrfs_dev_extent,
>>>  BTRFS_SETGET_FUNCS(dev_extent_chunk_offset, struct btrfs_dev_extent,
>>>chunk_offset, 64);
>>>  BTRFS_SETGET_FUNCS(dev_extent_length, struct btrfs_dev_extent, length, 64);
>>> +BTRFS_SETGET_STACK_FUNCS(stack_dev_extent_chunk_tree, struct 
>>> btrfs_dev_extent,
>>> +chunk_tree, 64);
>>> +BTRFS_SETGET_STACK_FUNCS(stack_dev_extent_chunk_objectid,
>>> +struct btrfs_dev_extent, chunk_objectid, 64);
>>> +BTRFS_SETGET_STACK_FUNCS(stack_dev_extent_chunk_offset, struct 
>>> btrfs_dev_extent,
>>> +chunk_offset, 64);
>>> +BTRFS_SETGET_STACK_FUNCS(stack_dev_extent_length, struct btrfs_dev_extent,
>>> +length, 64);
>>>
>>>  static inline unsigned long btrfs_dev_extent_chunk_tree_uuid(struct 
>>> btrfs_dev_extent *dev)
>>>  {
>>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>>> index 5a1913956f20..94e98261dbd0 100644
>>> --- a/fs/btrfs/volumes.c
>>> +++ b/fs/btrfs/volumes.c
>>> @@ -1581,42 +1581,32 @@ static int btrfs_alloc_dev_extent(struct 
>>> btrfs_trans_handle *trans,
>>>   u64 chunk_offset, u64 start, u64 
>>> num_bytes)
>>>  {
>>> int ret;
>>> -   struct btrfs_path *path;
>>> -   struct btrfs_fs_info *fs_info = device->fs_info;
>>> -   struct btrfs_root *root = fs_info->dev_root;
>>> +   struct btrfs_root *root = device->fs_info->dev_root;
>>> struct btrfs_dev_extent *extent;
>>> -   struct extent_buffer *leaf;
>>> struct btrfs_key key;
>>>
>>> WARN_ON(!device->in_fs_metadata);
>>> WARN_ON(device->is_tgtdev_for_dev_replace);
>>> -   path = btrfs_alloc_path();
>>> -   if (!path)
>>> +
>>> +   extent = kzalloc(sizeof(*extent), GFP_NOFS);
>>> +   if (!extent)
>>> return -ENOMEM;
>>>
>>> key.objectid = device->devid;
>>> key.offset = start;
>>> key.type = BTRFS_DEV_EXTENT_KEY;
>>> -   

[PATCH v3] btrfs: Do not use data_alloc_cluster in ssd mode

2017-07-28 Thread Hans van Kranenburg
This patch provides a band aid to improve the 'out of the box'
behaviour of btrfs for disks that are detected as being an ssd.  In a
general purpose mixed workload scenario, the current ssd mode causes
overallocation of available raw disk space for data, while leaving
behind increasing amounts of unused fragmented free space. This
situation leads to early ENOSPC problems which are harming user
experience and adoption of btrfs as a general purpose filesystem.

This patch modifies the data extent allocation behaviour of the ssd mode
to make it behave identical to nossd mode.  The metadata behaviour and
additional ssd_spread option stay untouched so far.

Recommendations for future development are to reconsider the current
oversimplified nossd / ssd distinction and the broken detection
mechanism based on the rotational attribute in sysfs and provide
experienced users with a more flexible way to choose allocator behaviour
for data and metadata, optimized for certain use cases, while keeping
sane 'out of the box' default settings.  The internals of the current
btrfs code have more potential than what currently gets exposed to the
user to choose from.

The SSD story...

In the first year of btrfs development, around early 2008, btrfs
gained a mount option which enables specific functionality for
filesystems on solid state devices. The first occurance of this
functionality is in commit e18e4809, labeled "Add mount -o ssd, which
includes optimizations for seek free storage".

The effect on allocating free space for doing (data) writes is to
'cluster' writes together, writing them out in contiguous space, as
opposed to a 'tetris' way of putting all separate writes into any free
space fragment that fits (which is what the -o nossd behaviour does).

A somewhat simplified explanation of what happens is that, when for
example, the 'cluster' size is set to 2MiB, when we do some writes, the
data allocator will search for a free space block that is 2MiB big, and
put the writes in there. The ssd mode itself might allow a 2MiB cluster
to be composed of multiple free space extents with some existing data in
between, while the additional ssd_spread mount option kills off this
option and requires fully free space.

The idea behind this is (commit 536ac8ae): "The [...] clusters make it
more likely a given IO will completely overwrite the ssd block, so it
doesn't have to do an internal rwm cycle."; ssd block meaning nand erase
block. So, effectively this means applying a "locality based algorithm"
and trying to outsmart the actual ssd.

Since then, various changes have been made to the involved code, but the
basic idea is still present, and gets activated whenever the ssd mount
option is active. This also happens by default, when the rotational flag
as seen at /sys/block//queue/rotational is set to 0.

However, there's a number of problems with this approach.

First, what the optimization is trying to do is outsmart the ssd by
assuming there is a relation between the physical address space of the
block device as seen by btrfs and the actual physical storage of the
ssd, and then adjusting data placement. However, since the introduction
of the Flash Translation Layer (FTL) which is a part of the internal
controller of an ssd, these attempts are futile. The use of good quality
FTL in consumer ssd products might have been limited in 2008, but this
situation has changed drastically soon after that time. Today, even the
flash memory in your automatic cat feeding machine or your grandma's
wheelchair has a full featured one.

Second, the behaviour as described above results in the filesystem being
filled up with badly fragmented free space extents because of relatively
small pieces of space that are freed up by deletes, but not selected
again as part of a 'cluster'. Since the algorithm prefers allocating a
new chunk over going back to tetris mode, the end result is a filesystem
in which all raw space is allocated, but which is composed of
underutilized chunks with a 'shotgun blast' pattern of fragmented free
space. Usually, the next problematic thing that happens is the
filesystem wanting to allocate new space for metadata, which causes the
filesystem to fail in spectacular ways.

Third, the default mount options you get for an ssd ('ssd' mode enabled,
'discard' not enabled), in combination with spreading out writes over
the full address space and ignoring freed up space leads to worst case
behaviour in providing information to the ssd itself, since it will
never learn that all the free space left behind is actually free.  There
are two ways to let an ssd know previously written data does not have to
be preserved, which are sending explicit signals using discard or
fstrim, or by simply overwriting the space with new data.  The worst
case behaviour is the btrfs ssd_spread mount option in combination with
not having discard enabled. It has a side effect of minimizing the reuse
of free space previously written in.