Re: [RFC v2 05/10] vfs: introduce one hash table

2012-09-27 Thread Zhi Yong Wu
On Thu, Sep 27, 2012 at 11:43 AM, Dave Chinner da...@fromorbit.com wrote:
 On Sun, Sep 23, 2012 at 08:56:30PM +0800, zwu.ker...@gmail.com wrote:
 From: Zhi Yong Wu wu...@linux.vnet.ibm.com

   Adds a hash table structure which contains
 a lot of hash list and is used to efficiently
 look up the data temperature of a file or its
 ranges.
   In each hash list of hash table, the hash node
 will keep track of temperature info.

 So, let me see if I've got the relationship straight:

 - sb-s_hot_info.hot_inode_tree indexes hot_inode_items, one per inode

 - hot_inode_item contains access frequency data for that inode

 - hot_inode_item holds a heat hash node to index the access
   frequency data for that inode

 - hot_inode_item.hot_range_tree indexes hot_range_items for that inode

 - hot_range_item contains access frequency data for that range

 - hot_range_item holds a heat hash node to index the access
   frequency data for that range

 - sb-s_hot_info.heat_inode_hl indexes per-inode heat hash nodes

 - sb-s_hot_info.heat_range_hl indexes per-range heat hash nodes
Correct.

 How about some ascii art? :) Just looking at the hot inode item case
 (the range item case is the same pattern, though), we have:


 heat_inode_hl   hot_inode_tree
 | |
 | V
 |   +---hot_inode_item---+
 +---+   |   frequency data   |
 |   V^   V
 | ...--hot_inode_item--... |...--hot_inode_item--
 |   frequency data   |  frequency data
 |   ^|   ^
 |   ||   |
 |   ||   |
 +--hot_hash_node--hot_hash_node--hot_hash_node--
Great, can we put them in hot_tracking.txt in Documentation?


 There's no actual data stored in the hot_hash_node, just pointer
 back to the frequency data, a hlist_node and a pointer to the
 hashlist head. IOWs, I agree with Ram that this does not need to
 exist and just embedding a hlist_node inside the hot_inode_item is
 all that is needed. i.e:

 heat_inode_hl   hot_inode_tree
 | |
 | V
 |   +---hot_inode_item---+
 |   |   frequency data   |
 +---+   |   hlist_node   |
 |   V^ | V
 | ...--hot_inode_item--... | |  ...--hot_inode_item--
 |   frequency data   | |frequency data
 +--hlist_node---+ +---hlist_node---.

 There's no need for separate allocations, initialisations, locks and
 reference counting - all that is already in the hot_inode_item. The
 items have the same lifecycle limitations - a hot_hash_node must be
 torn down before the frequency data it points to is freed. Finally,
 there's no difference in how you move it between lists.
How will you know if one hot_inode_item should be moved between lists
when its freq data is changed?

 Indeed, calling it a hash is wrong - there's not hashing at all
 - it keeping an array of list where each entry corresponds to a
 specific temperature. It is a *heat map*, not a hash list. i.e.
 inode_heat_map, not heat_inode_hl. HEAT_MAP_SIZE, not HASH_SIZE.
OK.

 As it is, there aren't any users of the heat maps that are generated
 in this patch set - it's not even exported to userspace or to
 debugfs, so I'm not sure how it will be used yet. How are these heat
 maps going to be used by filesystems, Zhi?
In hot_hash_calc_temperature(), you can see that one hot_inode or
hot_range's freq data will be distilled into one temperature value,
then it will be inserted to the heat map based on its temperature.
When the file corresponding to the inode or range got hotter or cold,
its location will be changed in the heat map based on its new
temperature in hot_hash_update_hash_table().

And the user will retrieve those freq data and temperature info via
debugfs or ioctl interfaces.

 Cheers,

 Dave.
 --
 Dave Chinner
 da...@fromorbit.com



-- 
Regards,

Zhi Yong Wu
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v2 06/10] vfs: enable hot data tracking

2012-09-27 Thread Zhi Yong Wu
On Thu, Sep 27, 2012 at 11:54 AM, Dave Chinner da...@fromorbit.com wrote:
 On Sun, Sep 23, 2012 at 08:56:31PM +0800, zwu.ker...@gmail.com wrote:
 From: Zhi Yong Wu wu...@linux.vnet.ibm.com

   Miscellaneous features that implement hot data tracking
 and generally make the hot data functions a bit more friendly.

 Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
 ---
  fs/direct-io.c   |   10 ++
  include/linux/hot_tracking.h |   11 +++
  mm/filemap.c |8 
  mm/page-writeback.c  |   21 +
  mm/readahead.c   |9 +
  5 files changed, 59 insertions(+), 0 deletions(-)

 diff --git a/fs/direct-io.c b/fs/direct-io.c
 index f86c720..3773f44 100644
 --- a/fs/direct-io.c
 +++ b/fs/direct-io.c
 @@ -37,6 +37,7 @@
  #include linux/uio.h
  #include linux/atomic.h
  #include linux/prefetch.h
 +#include hot_tracking.h

  /*
   * How many user pages to map in one call to get_user_pages().  This 
 determines
 @@ -1297,6 +1298,15 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, 
 struct inode *inode,
   prefetch(bdev-bd_queue);
   prefetch((char *)bdev-bd_queue + SMP_CACHE_BYTES);

 + /* Hot data tracking */
 + if (TRACK_THIS_INODE(iocb-ki_filp-f_mapping-host)
 +  iov_length(iov, nr_segs)  0) {
 + hot_rb_update_freqs(iocb-ki_filp-f_mapping-host,
 + (u64)offset,
 + (u64)iov_length(iov, nr_segs),
 + rw  WRITE);
 + }

 That's a bit messy. I'd prefer a static inline function that hides
 all this. e.g.
Do you think of moving the condition into hot_inode_udate_freqs(), not
adding another new function?

 track_hot_inode_ranges(inode, offset, length, rw)
 {
 if (inode-i_sb-s_flags  MS_HOT_TRACKING)
 hot_inode_freq_update(inode, offset, length, rw);
 }

 diff --git a/mm/page-writeback.c b/mm/page-writeback.c
 index 5ad5ce2..552c861 100644
 --- a/mm/page-writeback.c
 +++ b/mm/page-writeback.c
 @@ -35,6 +35,7 @@
  #include linux/buffer_head.h /* __set_page_dirty_buffers */
  #include linux/pagevec.h
  #include linux/timer.h
 +#include linux/hot_tracking.h
  #include trace/events/writeback.h

  /*
 @@ -1895,13 +1896,33 @@ EXPORT_SYMBOL(generic_writepages);
  int do_writepages(struct address_space *mapping, struct writeback_control 
 *wbc)
  {
   int ret;
 + pgoff_t start = 0;
 + u64 prev_count = 0, count = 0;

   if (wbc-nr_to_write = 0)
   return 0;
 +
 + /* Hot data tracking */
 + if (TRACK_THIS_INODE(mapping-host)
 +  wbc-range_cyclic) {
 + start = mapping-writeback_index  PAGE_CACHE_SHIFT;
 + prev_count = (u64)wbc-nr_to_write;
 + }

 Why only wbc-range_cyclic? This won't record things like
 synchronous writes or fsync-triggered writes, are are far more
 likely to be to hot ranges in a file...
sorry, i don't undersand what  wbc-range_cyclic means. OK, i will fix
it in next version.


 Cheers,

 Dave.
 --
 Dave Chinner
 da...@fromorbit.com



-- 
Regards,

Zhi Yong Wu
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v2 07/10] vfs: fork one kthread to update data temperature

2012-09-27 Thread Zhi Yong Wu
On Thu, Sep 27, 2012 at 12:03 PM, Dave Chinner da...@fromorbit.com wrote:
 On Sun, Sep 23, 2012 at 08:56:32PM +0800, zwu.ker...@gmail.com wrote:
 From: Zhi Yong Wu wu...@linux.vnet.ibm.com

   Fork and run one kernel kthread to calculate
 that temperature based on some metrics kept
 in custom frequency data structs, and store
 the info in the hash table.

 No new kthreads, please. Use a per-superblock workqueue and a struct
 delayed_work to run periodic work on each superblock.
If no new kthread is created, which kthread will work on these
delayed_work tasks?


 That will also remove all the nasty, nasty
 !hot_track_temperature_update_kthread checks from the code, too.

 Also, I'd separate the work that the workqueue does from the patch
 that introduces the work queue. That way there is only one new thing
 to comment on in the patch. Further, I'd separate the aging code
 from the code that updates the temperature map into it's own patch
 as well..

 Finally, you're going to need a shrinker to control the amount of
 memory that is used in tracking hot regions - if we are throwing
 inodes out of memory due to memory pressure, we most definitely are
 going to need to reduce the amount of memory the tracking code is
 using, even if it means losing useful information (i.e. the shrinker
 accelerates the aging process).
Great, I agree with you.

 Given the above, and the other comments earlier in the series,
 there's not a lot of point in me spending time commenting on ethe
 code in detail here as it will change significantly as a result of
 all the earlier comments
OK,  i will complete the code change based on all your earlier comments ASAP.

 Cheers,

 Dave.
 --
 Dave Chinner
 da...@fromorbit.com
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Regards,

Zhi Yong Wu
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v2 05/10] vfs: introduce one hash table

2012-09-27 Thread Dave Chinner
On Thu, Sep 27, 2012 at 02:23:16PM +0800, Zhi Yong Wu wrote:
 On Thu, Sep 27, 2012 at 11:43 AM, Dave Chinner da...@fromorbit.com wrote:
  On Sun, Sep 23, 2012 at 08:56:30PM +0800, zwu.ker...@gmail.com wrote:
  From: Zhi Yong Wu wu...@linux.vnet.ibm.com
 
Adds a hash table structure which contains
  a lot of hash list and is used to efficiently
  look up the data temperature of a file or its
  ranges.
In each hash list of hash table, the hash node
  will keep track of temperature info.
 
  So, let me see if I've got the relationship straight:
 
  - sb-s_hot_info.hot_inode_tree indexes hot_inode_items, one per inode
 
  - hot_inode_item contains access frequency data for that inode
 
  - hot_inode_item holds a heat hash node to index the access
frequency data for that inode
 
  - hot_inode_item.hot_range_tree indexes hot_range_items for that inode
 
  - hot_range_item contains access frequency data for that range
 
  - hot_range_item holds a heat hash node to index the access
frequency data for that range
 
  - sb-s_hot_info.heat_inode_hl indexes per-inode heat hash nodes
 
  - sb-s_hot_info.heat_range_hl indexes per-range heat hash nodes
 Correct.
 
  How about some ascii art? :) Just looking at the hot inode item case
  (the range item case is the same pattern, though), we have:
 
 
  heat_inode_hl   hot_inode_tree
  | |
  | V
  |   +---hot_inode_item---+
  +---+   |   frequency data   |
  |   V^   V
  | ...--hot_inode_item--... |...--hot_inode_item--
  |   frequency data   |  frequency data
  |   ^|   ^
  |   ||   |
  |   ||   |
  +--hot_hash_node--hot_hash_node--hot_hash_node--
 Great, can we put them in hot_tracking.txt in Documentation?
 
 
  There's no actual data stored in the hot_hash_node, just pointer
  back to the frequency data, a hlist_node and a pointer to the
  hashlist head. IOWs, I agree with Ram that this does not need to
  exist and just embedding a hlist_node inside the hot_inode_item is
  all that is needed. i.e:
 
  heat_inode_hl   hot_inode_tree
  | |
  | V
  |   +---hot_inode_item---+
  |   |   frequency data   |
  +---+   |   hlist_node   |
  |   V^ | V
  | ...--hot_inode_item--... | |  ...--hot_inode_item--
  |   frequency data   | |frequency data
  +--hlist_node---+ +---hlist_node---.
 
  There's no need for separate allocations, initialisations, locks and
  reference counting - all that is already in the hot_inode_item. The
  items have the same lifecycle limitations - a hot_hash_node must be
  torn down before the frequency data it points to is freed. Finally,
  there's no difference in how you move it between lists.
 How will you know if one hot_inode_item should be moved between lists
 when its freq data is changed?

Record the current temperature in the frequency data, and if it
changes, change the list it is on.

  Indeed, calling it a hash is wrong - there's not hashing at all
  - it keeping an array of list where each entry corresponds to a
  specific temperature. It is a *heat map*, not a hash list. i.e.
  inode_heat_map, not heat_inode_hl. HEAT_MAP_SIZE, not HASH_SIZE.
 OK.
 
  As it is, there aren't any users of the heat maps that are generated
  in this patch set - it's not even exported to userspace or to
  debugfs, so I'm not sure how it will be used yet. How are these heat
  maps going to be used by filesystems, Zhi?
 In hot_hash_calc_temperature(), you can see that one hot_inode or
 hot_range's freq data will be distilled into one temperature value,
 then it will be inserted to the heat map based on its temperature.
 When the file corresponding to the inode or range got hotter or cold,
 its location will be changed in the heat map based on its new
 temperature in hot_hash_update_hash_table().

Yes, but a hot_inode_item or hot_range_item can only have one
location in the heat map, right? So it doesn't need external
structure to point to the frequency data to track this

 And the user will retrieve those freq data and temperature info via
 debugfs or ioctl interfaces.

Right - but that data is only extracted after an initial
hot_inode_tree lookup - The heat map itself is never directly used
for lookups. If it's not used for lookups based on temperature, why
is it needed?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v2 06/10] vfs: enable hot data tracking

2012-09-27 Thread Dave Chinner
On Thu, Sep 27, 2012 at 02:28:12PM +0800, Zhi Yong Wu wrote:
 On Thu, Sep 27, 2012 at 11:54 AM, Dave Chinner da...@fromorbit.com wrote:
  On Sun, Sep 23, 2012 at 08:56:31PM +0800, zwu.ker...@gmail.com wrote:
  From: Zhi Yong Wu wu...@linux.vnet.ibm.com
 
Miscellaneous features that implement hot data tracking
  and generally make the hot data functions a bit more friendly.
 
  Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
  ---
   fs/direct-io.c   |   10 ++
   include/linux/hot_tracking.h |   11 +++
   mm/filemap.c |8 
   mm/page-writeback.c  |   21 +
   mm/readahead.c   |9 +
   5 files changed, 59 insertions(+), 0 deletions(-)
 
  diff --git a/fs/direct-io.c b/fs/direct-io.c
  index f86c720..3773f44 100644
  --- a/fs/direct-io.c
  +++ b/fs/direct-io.c
  @@ -37,6 +37,7 @@
   #include linux/uio.h
   #include linux/atomic.h
   #include linux/prefetch.h
  +#include hot_tracking.h
 
   /*
* How many user pages to map in one call to get_user_pages().  This 
  determines
  @@ -1297,6 +1298,15 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, 
  struct inode *inode,
prefetch(bdev-bd_queue);
prefetch((char *)bdev-bd_queue + SMP_CACHE_BYTES);
 
  + /* Hot data tracking */
  + if (TRACK_THIS_INODE(iocb-ki_filp-f_mapping-host)
  +  iov_length(iov, nr_segs)  0) {
  + hot_rb_update_freqs(iocb-ki_filp-f_mapping-host,
  + (u64)offset,
  + (u64)iov_length(iov, nr_segs),
  + rw  WRITE);
  + }
 
  That's a bit messy. I'd prefer a static inline function that hides
  all this. e.g.
 Do you think of moving the condition into hot_inode_udate_freqs(), not
 adding another new function?

Moving it into hot_inode_udate_freqs() will add a function call
overhead even when tracking is not enabled. a static inline function
will just result in no extra overhead other than the if
statement

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v2 07/10] vfs: fork one kthread to update data temperature

2012-09-27 Thread Dave Chinner
On Thu, Sep 27, 2012 at 02:54:22PM +0800, Zhi Yong Wu wrote:
 On Thu, Sep 27, 2012 at 12:03 PM, Dave Chinner da...@fromorbit.com wrote:
  On Sun, Sep 23, 2012 at 08:56:32PM +0800, zwu.ker...@gmail.com wrote:
  From: Zhi Yong Wu wu...@linux.vnet.ibm.com
 
Fork and run one kernel kthread to calculate
  that temperature based on some metrics kept
  in custom frequency data structs, and store
  the info in the hash table.
 
  No new kthreads, please. Use a per-superblock workqueue and a struct
  delayed_work to run periodic work on each superblock.
 If no new kthread is created, which kthread will work on these
 delayed_work tasks?

One of the kworker threads that service the workqueue
infrastructure.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v2 03/10] vfs: add one new mount option '-o hottrack'

2012-09-27 Thread Dave Chinner
On Thu, Sep 27, 2012 at 01:25:34PM +0800, Zhi Yong Wu wrote:
 On Tue, Sep 25, 2012 at 5:28 PM, Dave Chinner da...@fromorbit.com wrote:
  On Sun, Sep 23, 2012 at 08:56:28PM +0800, zwu.ker...@gmail.com wrote:
  From: Zhi Yong Wu wu...@linux.vnet.ibm.com
 
Introduce one new mount option '-o hottrack',
  and add its parsing support.
Its usage looks like:
 mount -o hottrack
 mount -o nouser,hottrack
 mount -o nouser,hottrack,loop
 mount -o hottrack,nouser
 
  I think that this option parsing should be done by the filesystem,
  even though the tracking functionality is in the VFS. That way ony
  the filesystems that can use the tracking information will turn it
  on, rather than being able to turn it on for everything regardless
  of whether it is useful or not.
 
  Along those lines, just using a normal superblock flag to indicate
  it is active (e.g. MS_HOT_INODE_TRACKING in sb-s_flags) means you
  don't need to allocate the sb-s_hot_info structure just to be able
 If we don't allocate one sb-s_hot_info, where will those hash list
 head and btree roots locate?

I wrote that thinking (mistakenly) that s-hot)info was dynamically
allocated rather than being embedded in the struct super_block.

Indeed, if the mount option is held in s_flags, then it could be
dynamically allocated, but I don't think that's really necessary...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v2 05/10] vfs: introduce one hash table

2012-09-27 Thread Zhi Yong Wu
On Thu, Sep 27, 2012 at 2:57 PM, Dave Chinner da...@fromorbit.com wrote:
 On Thu, Sep 27, 2012 at 02:23:16PM +0800, Zhi Yong Wu wrote:
 On Thu, Sep 27, 2012 at 11:43 AM, Dave Chinner da...@fromorbit.com wrote:
  On Sun, Sep 23, 2012 at 08:56:30PM +0800, zwu.ker...@gmail.com wrote:
  From: Zhi Yong Wu wu...@linux.vnet.ibm.com
 
Adds a hash table structure which contains
  a lot of hash list and is used to efficiently
  look up the data temperature of a file or its
  ranges.
In each hash list of hash table, the hash node
  will keep track of temperature info.
 
  So, let me see if I've got the relationship straight:
 
  - sb-s_hot_info.hot_inode_tree indexes hot_inode_items, one per inode
 
  - hot_inode_item contains access frequency data for that inode
 
  - hot_inode_item holds a heat hash node to index the access
frequency data for that inode
 
  - hot_inode_item.hot_range_tree indexes hot_range_items for that inode
 
  - hot_range_item contains access frequency data for that range
 
  - hot_range_item holds a heat hash node to index the access
frequency data for that range
 
  - sb-s_hot_info.heat_inode_hl indexes per-inode heat hash nodes
 
  - sb-s_hot_info.heat_range_hl indexes per-range heat hash nodes
 Correct.
 
  How about some ascii art? :) Just looking at the hot inode item case
  (the range item case is the same pattern, though), we have:
 
 
  heat_inode_hl   hot_inode_tree
  | |
  | V
  |   +---hot_inode_item---+
  +---+   |   frequency data   |
  |   V^   V
  | ...--hot_inode_item--... |...--hot_inode_item--
  |   frequency data   |  frequency data
  |   ^|   ^
  |   ||   |
  |   ||   |
  +--hot_hash_node--hot_hash_node--hot_hash_node--
 Great, can we put them in hot_tracking.txt in Documentation?
 
 
  There's no actual data stored in the hot_hash_node, just pointer
  back to the frequency data, a hlist_node and a pointer to the
  hashlist head. IOWs, I agree with Ram that this does not need to
  exist and just embedding a hlist_node inside the hot_inode_item is
  all that is needed. i.e:
 
  heat_inode_hl   hot_inode_tree
  | |
  | V
  |   +---hot_inode_item---+
  |   |   frequency data   |
  +---+   |   hlist_node   |
  |   V^ | V
  | ...--hot_inode_item--... | |  ...--hot_inode_item--
  |   frequency data   | |frequency data
  +--hlist_node---+ +---hlist_node---.
 
  There's no need for separate allocations, initialisations, locks and
  reference counting - all that is already in the hot_inode_item. The
  items have the same lifecycle limitations - a hot_hash_node must be
  torn down before the frequency data it points to is freed. Finally,
  there's no difference in how you move it between lists.
 How will you know if one hot_inode_item should be moved between lists
 when its freq data is changed?

 Record the current temperature in the frequency data, and if it
I know how to do it, thanks.
 changes, change the list it is on.

  Indeed, calling it a hash is wrong - there's not hashing at all
  - it keeping an array of list where each entry corresponds to a
  specific temperature. It is a *heat map*, not a hash list. i.e.
  inode_heat_map, not heat_inode_hl. HEAT_MAP_SIZE, not HASH_SIZE.
 OK.
 
  As it is, there aren't any users of the heat maps that are generated
  in this patch set - it's not even exported to userspace or to
  debugfs, so I'm not sure how it will be used yet. How are these heat
  maps going to be used by filesystems, Zhi?
 In hot_hash_calc_temperature(), you can see that one hot_inode or
 hot_range's freq data will be distilled into one temperature value,
 then it will be inserted to the heat map based on its temperature.
 When the file corresponding to the inode or range got hotter or cold,
 its location will be changed in the heat map based on its new
 temperature in hot_hash_update_hash_table().

 Yes, but a hot_inode_item or hot_range_item can only have one
 location in the heat map, right? So it doesn't need external
Yes.
 structure to point to the frequency data to track this
OK.

 And the user will retrieve those freq data and temperature info via
 debugfs or ioctl interfaces.

 Right - but that data is only extracted after an initial
 hot_inode_tree lookup - The heat map itself is never directly used
 for lookups. If it's not used for lookups based on temperature, why
 is it needed?
You mean we don't need hot_inode_tree? You know, after those hook
functions collect the freq data for inode, they will store those raw
info in hot_inode_tree. One private kthread will iterate 

Re: [RFC v2 06/10] vfs: enable hot data tracking

2012-09-27 Thread Zhi Yong Wu
On Thu, Sep 27, 2012 at 2:59 PM, Dave Chinner da...@fromorbit.com wrote:
 On Thu, Sep 27, 2012 at 02:28:12PM +0800, Zhi Yong Wu wrote:
 On Thu, Sep 27, 2012 at 11:54 AM, Dave Chinner da...@fromorbit.com wrote:
  On Sun, Sep 23, 2012 at 08:56:31PM +0800, zwu.ker...@gmail.com wrote:
  From: Zhi Yong Wu wu...@linux.vnet.ibm.com
 
Miscellaneous features that implement hot data tracking
  and generally make the hot data functions a bit more friendly.
 
  Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
  ---
   fs/direct-io.c   |   10 ++
   include/linux/hot_tracking.h |   11 +++
   mm/filemap.c |8 
   mm/page-writeback.c  |   21 +
   mm/readahead.c   |9 +
   5 files changed, 59 insertions(+), 0 deletions(-)
 
  diff --git a/fs/direct-io.c b/fs/direct-io.c
  index f86c720..3773f44 100644
  --- a/fs/direct-io.c
  +++ b/fs/direct-io.c
  @@ -37,6 +37,7 @@
   #include linux/uio.h
   #include linux/atomic.h
   #include linux/prefetch.h
  +#include hot_tracking.h
 
   /*
* How many user pages to map in one call to get_user_pages().  This 
  determines
  @@ -1297,6 +1298,15 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, 
  struct inode *inode,
prefetch(bdev-bd_queue);
prefetch((char *)bdev-bd_queue + SMP_CACHE_BYTES);
 
  + /* Hot data tracking */
  + if (TRACK_THIS_INODE(iocb-ki_filp-f_mapping-host)
  +  iov_length(iov, nr_segs)  0) {
  + hot_rb_update_freqs(iocb-ki_filp-f_mapping-host,
  + (u64)offset,
  + (u64)iov_length(iov, nr_segs),
  + rw  WRITE);
  + }
 
  That's a bit messy. I'd prefer a static inline function that hides
  all this. e.g.
 Do you think of moving the condition into hot_inode_udate_freqs(), not
 adding another new function?

 Moving it into hot_inode_udate_freqs() will add a function call
 overhead even when tracking is not enabled. a static inline function
Can we not directly define hot_inode_udate_freqs to be a static inline?:)

 will just result in no extra overhead other than the if
 statement

 Cheers,

 Dave.
 --
 Dave Chinner
 da...@fromorbit.com



-- 
Regards,

Zhi Yong Wu
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v2 07/10] vfs: fork one kthread to update data temperature

2012-09-27 Thread Zhi Yong Wu
On Thu, Sep 27, 2012 at 3:01 PM, Dave Chinner da...@fromorbit.com wrote:
 On Thu, Sep 27, 2012 at 02:54:22PM +0800, Zhi Yong Wu wrote:
 On Thu, Sep 27, 2012 at 12:03 PM, Dave Chinner da...@fromorbit.com wrote:
  On Sun, Sep 23, 2012 at 08:56:32PM +0800, zwu.ker...@gmail.com wrote:
  From: Zhi Yong Wu wu...@linux.vnet.ibm.com
 
Fork and run one kernel kthread to calculate
  that temperature based on some metrics kept
  in custom frequency data structs, and store
  the info in the hash table.
 
  No new kthreads, please. Use a per-superblock workqueue and a struct
  delayed_work to run periodic work on each superblock.
 If no new kthread is created, which kthread will work on these
 delayed_work tasks?

 One of the kworker threads that service the workqueue
 infrastructure.
Got it, thanks

 Cheers,

 Dave.
 --
 Dave Chinner
 da...@fromorbit.com



-- 
Regards,

Zhi Yong Wu
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v2 03/10] vfs: add one new mount option '-o hottrack'

2012-09-27 Thread Zhi Yong Wu
On Thu, Sep 27, 2012 at 3:05 PM, Dave Chinner da...@fromorbit.com wrote:
 On Thu, Sep 27, 2012 at 01:25:34PM +0800, Zhi Yong Wu wrote:
 On Tue, Sep 25, 2012 at 5:28 PM, Dave Chinner da...@fromorbit.com wrote:
  On Sun, Sep 23, 2012 at 08:56:28PM +0800, zwu.ker...@gmail.com wrote:
  From: Zhi Yong Wu wu...@linux.vnet.ibm.com
 
Introduce one new mount option '-o hottrack',
  and add its parsing support.
Its usage looks like:
 mount -o hottrack
 mount -o nouser,hottrack
 mount -o nouser,hottrack,loop
 mount -o hottrack,nouser
 
  I think that this option parsing should be done by the filesystem,
  even though the tracking functionality is in the VFS. That way ony
  the filesystems that can use the tracking information will turn it
  on, rather than being able to turn it on for everything regardless
  of whether it is useful or not.
 
  Along those lines, just using a normal superblock flag to indicate
  it is active (e.g. MS_HOT_INODE_TRACKING in sb-s_flags) means you
  don't need to allocate the sb-s_hot_info structure just to be able
 If we don't allocate one sb-s_hot_info, where will those hash list
 head and btree roots locate?

 I wrote that thinking (mistakenly) that s-hot)info was dynamically
 allocated rather than being embedded in the struct super_block.

 Indeed, if the mount option is held in s_flags, then it could be
 dynamically allocated, but I don't think that's really necessary...
ah, you prefer allocating it, OK, let me try. thanks for your explaination.


 Cheers,

 Dave.
 --
 Dave Chinner
 da...@fromorbit.com



-- 
Regards,

Zhi Yong Wu
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: typo in inode.c

2012-09-27 Thread David Sterba
On Thu, Sep 27, 2012 at 05:04:02AM +0800, ching wrote:
 On 09/26/2012 11:23 PM, David Sterba wrote:
  On Wed, Sep 26, 2012 at 07:48:47PM +0800, ching wrote:
  There is a typo (?) in inode.c (git)
  What's the top commit and what git tree?
 
  This has been fixed in 3.6-rc4 via
  287082b0bd10060e9c6b32ed9605174ddf2f672a
 
 This mistake is in
 
 http://git.kernel.org/?p=linux/kernel/git/mason/linux-btrfs.git;a=summary

I see, this is because the fix was merged into linus' tree via the
trivial tree. If unsure, you may want to check btrfs-next first.

david
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/4] btrfs: extended inode refs

2012-09-27 Thread David Sterba
On Tue, Sep 25, 2012 at 04:04:46PM -0400, Chris Mason wrote:
 @@ -889,16 +899,23 @@ static inline int __add_inode_ref(struct 
 btrfs_trans_handle *trans,
   while (cur_offset  item_size) {
   extref = (struct btrfs_inode_extref *)base + cur_offset;
  
 - victim_name_len = btrfs_inode_extref_name_len(eb, 
 extref);
 - victim_name = kmalloc(namelen, GFP_NOFS);
 - leaf = path-nodes[0];
 - read_extent_buffer(eb, name, (unsigned 
 long)extref-name, namelen);
 + victim_name_len = btrfs_inode_extref_name_len(leaf, 
 extref);
 +
 + if (btrfs_inode_extref_parent(leaf, extref) != 
 parent_objectid)
 + goto next;
 +
 + victim_name = kmalloc(victim_name_len, GFP_NOFS);

unchecked kmalloc

 + read_extent_buffer(leaf, victim_name, (unsigned 
 long)extref-name,
 +victim_name_len);
  
   search_key.objectid = inode_objectid;
   search_key.type = BTRFS_INODE_EXTREF_KEY;
   search_key.offset = btrfs_extref_hash(parent_objectid,
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: improve the noflush reservation

2012-09-27 Thread Miao Xie
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.

We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
This is based on btrfs-next tree.
---
 fs/btrfs/ctree.h |   26 +-
 fs/btrfs/delayed-inode.c |6 ++-
 fs/btrfs/extent-tree.c   |   85 ++---
 fs/btrfs/inode-map.c |5 ++-
 fs/btrfs/inode.c |8 +++--
 fs/btrfs/relocation.c|   12 --
 fs/btrfs/transaction.c   |   30 +++-
 fs/btrfs/transaction.h   |2 +-
 8 files changed, 85 insertions(+), 89 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index dbb461f..cb59e9b 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2870,6 +2870,18 @@ int btrfs_remove_block_group(struct btrfs_trans_handle 
*trans,
 u64 btrfs_reduce_alloc_profile(struct btrfs_root *root, u64 flags);
 u64 btrfs_get_alloc_profile(struct btrfs_root *root, int data);
 void btrfs_clear_space_info_full(struct btrfs_fs_info *info);
+
+enum btrfs_reserve_flush_enum {
+   /* If we are in the transaction, we can't flush anything.*/
+   BTRFS_RESERVE_NO_FLUSH,
+   /*
+* Flushing delalloc may cause deadlock somewhere, in this
+* case, use FLUSH LIMIT
+*/
+   BTRFS_RESERVE_FLUSH_LIMIT,
+   BTRFS_RESERVE_FLUSH_ALL,
+};
+
 int btrfs_check_data_free_space(struct inode *inode, u64 bytes);
 void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes);
 void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans,
@@ -2889,19 +2901,13 @@ struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct 
btrfs_root *root,
 void btrfs_free_block_rsv(struct btrfs_root *root,
  struct btrfs_block_rsv *rsv);
 int btrfs_block_rsv_add(struct btrfs_root *root,
-   struct btrfs_block_rsv *block_rsv,
-   u64 num_bytes);
-int btrfs_block_rsv_add_noflush(struct btrfs_root *root,
-   struct btrfs_block_rsv *block_rsv,
-   u64 num_bytes);
+   struct btrfs_block_rsv *block_rsv, u64 num_bytes,
+   enum btrfs_reserve_flush_enum flush);
 int btrfs_block_rsv_check(struct btrfs_root *root,
  struct btrfs_block_rsv *block_rsv, int min_factor);
 int btrfs_block_rsv_refill(struct btrfs_root *root,
- struct btrfs_block_rsv *block_rsv,
- u64 min_reserved);
-int btrfs_block_rsv_refill_noflush(struct btrfs_root *root,
-  struct btrfs_block_rsv *block_rsv,
-  u64 min_reserved);
+  struct btrfs_block_rsv *block_rsv, u64 min_reserved,
+  enum btrfs_reserve_flush_enum flush);
 int btrfs_block_rsv_migrate(struct btrfs_block_rsv *src_rsv,
struct btrfs_block_rsv *dst_rsv,
u64 num_bytes);
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index eb768c4..2e2eddb 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -651,7 +651,8 @@ static int btrfs_delayed_inode_reserve_metadata(
 */
if (!src_rsv || (!trans-bytes_reserved 
 src_rsv-type != BTRFS_BLOCK_RSV_DELALLOC)) {
-   ret = btrfs_block_rsv_add_noflush(root, dst_rsv, num_bytes);
+   ret = btrfs_block_rsv_add(root, dst_rsv, num_bytes,
+ BTRFS_RESERVE_NO_FLUSH);
/*
 * Since we're under a transaction reserve_metadata_bytes could
 * try to commit the transaction which will make it return
@@ -686,7 +687,8 @@ static int btrfs_delayed_inode_reserve_metadata(
 * reserve something strictly for us.  If not be a pain and try
 * to steal from the delalloc block rsv.
 */
-   ret = btrfs_block_rsv_add_noflush(root, dst_rsv, num_bytes);
+   ret = btrfs_block_rsv_add(root, dst_rsv, num_bytes,
+ BTRFS_RESERVE_NO_FLUSH);
if (!ret)
goto out;
 
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 8a01087..73b0255 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3851,24 +3851,31 @@ static int flush_space(struct btrfs_root *root,
  

Re: [PATCH 1/2] btrfs-progs: limit the max value of leafsize and nodesize

2012-09-27 Thread David Sterba
On Wed, Sep 26, 2012 at 03:52:07PM +0800, Robin Dong wrote:
 Using mkfs.btrfs like:
   mkfs.btrfs -l 131072 /dev/sda
 
 will return no error, but after mount it, the dmesg will report:
   BTRFS: couldn't mount because metadata blocksize (131072) was too large
 
 The user tools should use BTRFS_MAX_METADATA_BLOCKSIZE to limit leaf and node 
 size.

Good catch.

 @@ -1291,11 +1291,13 @@ int main(int ac, char **av)
   }
   }
   sectorsize = max(sectorsize, (u32)getpagesize());
 - if (leafsize  sectorsize || (leafsize  (sectorsize - 1))) {
 + if (leafsize  sectorsize || leafsize  BTRFS_MAX_METADATA_BLOCKSIZE ||
 + (leafsize  (sectorsize - 1))) {

Could you please separate the BTRFS_MAX_METADATA_BLOCKSIZE check and add
appropriate error message that actually informs the user what kind of
error happened?

   fprintf(stderr, Illegal leafsize %u\n, leafsize);
   exit(1);
   }
 - if (nodesize  sectorsize || (nodesize  (sectorsize - 1))) {
 + if (nodesize  sectorsize || nodesize  BTRFS_MAX_METADATA_BLOCKSIZE ||
 + (nodesize  (sectorsize - 1))) {

(same here)

   fprintf(stderr, Illegal nodesize %u\n, nodesize);
   exit(1);
   }

Thanks!
david

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix wrong calculation of the available space when reserving the space

2012-09-27 Thread Miao Xie
According to the comment, we can overcommit the space up to 1/2 of the total
disk space, or we just can overcommit up to 1/8. But the code was written
reversedly. Fix it.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
This is based on btrfs-next tree.
---
 fs/btrfs/extent-tree.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a010234..8a01087 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3962,9 +3962,9 @@ again:
 * 1/2 of the space.
 */
if (flush)
-   avail = 3;
-   else
avail = 1;
+   else
+   avail = 3;
 spin_unlock(root-fs_info-free_chunk_lock);
 
if (used + num_bytes  space_info-total_bytes + avail) {
-- 
1.6.5.2
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: improve the noflush reservation

2012-09-27 Thread Miao Xie
On thu, 27 Sep 2012 16:45:51 +0800, Miao Xie wrote:
 In some places(such as: evicting inode), we just can not flush the reserved
 space of delalloc, flushing the delayed directory index and delayed inode
 is OK, but we don't try to flush those things and just go back when there is
 no enough space to be reserved. This patch fixes this problem.
 
 We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and 
 FLUSH_ALL.
 If we can in the transaction, we should not flush anything, or the deadlock
 would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
 would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
 and we will flush all things.
 
 Signed-off-by: Miao Xie mi...@cn.fujitsu.com
 ---
 This is based on btrfs-next tree.

Sorry, I forget to say that this patch is against:
[PATCH] Btrfs: fix wrong calculation of the available space when reserving the 
space

Both this patch and the above patch are based on btrfs-next tree.

Thanks
Miao
 ---
  fs/btrfs/ctree.h |   26 +-
  fs/btrfs/delayed-inode.c |6 ++-
  fs/btrfs/extent-tree.c   |   85 ++---
  fs/btrfs/inode-map.c |5 ++-
  fs/btrfs/inode.c |8 +++--
  fs/btrfs/relocation.c|   12 --
  fs/btrfs/transaction.c   |   30 +++-
  fs/btrfs/transaction.h   |2 +-
  8 files changed, 85 insertions(+), 89 deletions(-)
 
 diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
 index dbb461f..cb59e9b 100644
 --- a/fs/btrfs/ctree.h
 +++ b/fs/btrfs/ctree.h
 @@ -2870,6 +2870,18 @@ int btrfs_remove_block_group(struct btrfs_trans_handle 
 *trans,
  u64 btrfs_reduce_alloc_profile(struct btrfs_root *root, u64 flags);
  u64 btrfs_get_alloc_profile(struct btrfs_root *root, int data);
  void btrfs_clear_space_info_full(struct btrfs_fs_info *info);
 +
 +enum btrfs_reserve_flush_enum {
 + /* If we are in the transaction, we can't flush anything.*/
 + BTRFS_RESERVE_NO_FLUSH,
 + /*
 +  * Flushing delalloc may cause deadlock somewhere, in this
 +  * case, use FLUSH LIMIT
 +  */
 + BTRFS_RESERVE_FLUSH_LIMIT,
 + BTRFS_RESERVE_FLUSH_ALL,
 +};
 +
  int btrfs_check_data_free_space(struct inode *inode, u64 bytes);
  void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes);
  void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans,
 @@ -2889,19 +2901,13 @@ struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct 
 btrfs_root *root,
  void btrfs_free_block_rsv(struct btrfs_root *root,
 struct btrfs_block_rsv *rsv);
  int btrfs_block_rsv_add(struct btrfs_root *root,
 - struct btrfs_block_rsv *block_rsv,
 - u64 num_bytes);
 -int btrfs_block_rsv_add_noflush(struct btrfs_root *root,
 - struct btrfs_block_rsv *block_rsv,
 - u64 num_bytes);
 + struct btrfs_block_rsv *block_rsv, u64 num_bytes,
 + enum btrfs_reserve_flush_enum flush);
  int btrfs_block_rsv_check(struct btrfs_root *root,
 struct btrfs_block_rsv *block_rsv, int min_factor);
  int btrfs_block_rsv_refill(struct btrfs_root *root,
 -   struct btrfs_block_rsv *block_rsv,
 -   u64 min_reserved);
 -int btrfs_block_rsv_refill_noflush(struct btrfs_root *root,
 -struct btrfs_block_rsv *block_rsv,
 -u64 min_reserved);
 +struct btrfs_block_rsv *block_rsv, u64 min_reserved,
 +enum btrfs_reserve_flush_enum flush);
  int btrfs_block_rsv_migrate(struct btrfs_block_rsv *src_rsv,
   struct btrfs_block_rsv *dst_rsv,
   u64 num_bytes);
 diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
 index eb768c4..2e2eddb 100644
 --- a/fs/btrfs/delayed-inode.c
 +++ b/fs/btrfs/delayed-inode.c
 @@ -651,7 +651,8 @@ static int btrfs_delayed_inode_reserve_metadata(
*/
   if (!src_rsv || (!trans-bytes_reserved 
src_rsv-type != BTRFS_BLOCK_RSV_DELALLOC)) {
 - ret = btrfs_block_rsv_add_noflush(root, dst_rsv, num_bytes);
 + ret = btrfs_block_rsv_add(root, dst_rsv, num_bytes,
 +   BTRFS_RESERVE_NO_FLUSH);
   /*
* Since we're under a transaction reserve_metadata_bytes could
* try to commit the transaction which will make it return
 @@ -686,7 +687,8 @@ static int btrfs_delayed_inode_reserve_metadata(
* reserve something strictly for us.  If not be a pain and try
* to steal from the delalloc block rsv.
*/
 - ret = btrfs_block_rsv_add_noflush(root, dst_rsv, num_bytes);
 + ret = btrfs_block_rsv_add(root, dst_rsv, num_bytes,
 +   

BTRF - Storage Usage

2012-09-27 Thread Sébastien Maury

Hi,

I've installed a new server using btrfs for my root partition (/).

It uses snapper for snapshots management and all seems to work pretty fine.

My problem is to be able to know the remaining REAL free space in my  
partition.


Using different commands, i have different results, and i don't know  
how to interpret them correctly :

poivron:~ # btrfs filesystem df /
Data: total=4.01GB, used=2.16GB
System, DUP: total=8.00MB, used=4.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=3.00GB, used=429.16MB
Metadata: total=8.00MB, used=0.00

poivron:~ #  df -hP /
Filesystem  Size  Used Avail Use% Mounted on
/dev/sda3   132G  3.0G  124G   3% /

poivron:~ # btrfs filesystem show /dev/sda3
Label: none  uuid: 9e68b667-f9f9-490f-9da1-ae4e91558212
Total devices 1 FS bytes used 2.58GB
devid1 size 131.64GB used 10.04GB path /dev/sda3

Btrfs v0.19+

poivron:~ # du -sh /.snapshots
40G /.snapshots

===

Please help me understand and interpret those information to know the  
most accurately as possible what is my real remaining space, and what  
space is used by what.


Although, i don't really understand the output of the command btrfs  
filesystem df / : what are exactly Data, System DUP, System  
total, Metadata DUP and Metadata total ?


==

Here are some complementary informations :
poivron:~ # uname -a
Linux poivron 3.0.26-0.7-default #1 SMP Tue Apr 17 10:27:57 UTC 2012  
(3829766) x86_64 x86_64 x86_64 GNU/Linux


poivron:~ # snapper list-configs
Config| Subvolume
--+
root  | /

poivron:~ # cat /etc/snapper/configs/root

# subvolume to snapshot
SUBVOLUME=/

# filesystem type
FSTYPE=btrfs


# run daily number cleanup
NUMBER_CLEANUP=yes

# limit for number cleanup
NUMBER_MIN_AGE=1800
NUMBER_LIMIT=100


# create hourly snapshots
TIMELINE_CREATE=yes

# cleanup hourly snapshots after some time
TIMELINE_CLEANUP=yes

# limits for timeline cleanup
TIMELINE_MIN_AGE=1800
TIMELINE_LIMIT_HOURLY=10
TIMELINE_LIMIT_DAILY=10
TIMELINE_LIMIT_MONTHLY=10
TIMELINE_LIMIT_YEARLY=10


# cleanup empty pre-post-pairs
EMPTY_PRE_POST_CLEANUP=yes

# limits for empty pre-post-pair cleanup
EMPTY_PRE_POST_MIN_AGE=1800


Cordialement,

Sébastien MAURY
Responsable d'exploitation du site de Montpellier
Équipe DBA
___
INSERM - DSI - Pôle Infrastructures

Délégation régionale Languedoc Roussillon
60, rue de Navacelles
34394 Montpellier Cedex 5

Mob : 06 31 51 42 18
Fixe : 04 67 63 61 43
Fax : 04 67 63 70 25
Mél : sebastien.ma...@inserm.fr
___



This message was sent using IMP, the Internet Messaging Program.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V4 1/2] Btrfs: cleanup duplicated division functions

2012-09-27 Thread Miao Xie
div_factor{_fine} has been implemented for two times, and these two functions
are very similar, so cleanup the reduplicate implement and drop the original
div_factor(), and then rename div_factor_fine() to div_factor(). So the factor
of the new div_factor() is 100, not 10.

And I move div_factor into a independent file named math.h because it is a
common math function, may be used by every composition of btrfs.

Because these functions are mostly used on the hot path, and we are sure
the parameters are right in the most cases, we don't add complex checks
for the parameters. But in the other place, we must check and make sure
the parameters are right. So besides the code cleanup, this patch also
add a check for the usage of the space balance, it is the only place that
we need add check to make sure the parameters of div_factor are right till
now. Besides that, the old kernel may hold the wrong usage value, so we
must rectify it.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v3 - v4:
- deal with the wrong usage that was input on the old kernel

Changelog v2 - v3:
- drop the original div_factor and rename div_factor_fine to div_factor
- drop the check of the factor

Changelog v1 - v2:
- add missing check
---
 fs/btrfs/extent-tree.c |   29 +
 fs/btrfs/ioctl.c   |   21 ++
 fs/btrfs/math.h|   35 ++
 fs/btrfs/relocation.c  |2 +-
 fs/btrfs/transaction.c |2 +-
 fs/btrfs/volumes.c |   55 ++-
 6 files changed, 94 insertions(+), 50 deletions(-)
 create mode 100644 fs/btrfs/math.h

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a010234..bcb9ced 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -33,6 +33,7 @@
 #include volumes.h
 #include locking.h
 #include free-space-cache.h
+#include math.h
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -648,24 +649,6 @@ void btrfs_clear_space_info_full(struct btrfs_fs_info 
*info)
rcu_read_unlock();
 }
 
-static u64 div_factor(u64 num, int factor)
-{
-   if (factor == 10)
-   return num;
-   num *= factor;
-   do_div(num, 10);
-   return num;
-}
-
-static u64 div_factor_fine(u64 num, int factor)
-{
-   if (factor == 100)
-   return num;
-   num *= factor;
-   do_div(num, 100);
-   return num;
-}
-
 u64 btrfs_find_block_group(struct btrfs_root *root,
   u64 search_start, u64 search_hint, int owner)
 {
@@ -674,7 +657,7 @@ u64 btrfs_find_block_group(struct btrfs_root *root,
u64 last = max(search_hint, search_start);
u64 group_start = 0;
int full_search = 0;
-   int factor = 9;
+   int factor = 90;
int wrapped = 0;
 again:
while (1) {
@@ -708,7 +691,7 @@ again:
if (!full_search  factor  10) {
last = search_start;
full_search = 1;
-   factor = 10;
+   factor = 100;
goto again;
}
 found:
@@ -3513,7 +3496,7 @@ static int should_alloc_chunk(struct btrfs_root *root,
if (force == CHUNK_ALLOC_LIMITED) {
thresh = btrfs_super_total_bytes(root-fs_info-super_copy);
thresh = max_t(u64, 64 * 1024 * 1024,
-  div_factor_fine(thresh, 1));
+  div_factor(thresh, 1));
 
if (num_bytes - num_allocated  thresh)
return 1;
@@ -3521,12 +3504,12 @@ static int should_alloc_chunk(struct btrfs_root *root,
thresh = btrfs_super_total_bytes(root-fs_info-super_copy);
 
/* 256MB or 2% of the FS */
-   thresh = max_t(u64, 256 * 1024 * 1024, div_factor_fine(thresh, 2));
+   thresh = max_t(u64, 256 * 1024 * 1024, div_factor(thresh, 2));
/* system chunks need a much small threshold */
if (sinfo-flags  BTRFS_BLOCK_GROUP_SYSTEM)
thresh = 32 * 1024 * 1024;
 
-   if (num_bytes  thresh  sinfo-bytes_used  div_factor(num_bytes, 8))
+   if (num_bytes  thresh  sinfo-bytes_used  div_factor(num_bytes, 80))
return 0;
return 1;
 }
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 9384a2a..121339c 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3297,6 +3297,23 @@ void update_ioctl_balance_args(struct btrfs_fs_info 
*fs_info, int lock,
}
 }
 
+static int btrfs_check_balance_args(struct btrfs_ioctl_balance_args *bargs)
+{
+   if ((bargs-data.flags  BTRFS_BALANCE_ARGS_USAGE) 
+   (bargs-data.usage  0 || bargs-data.usage  100))
+   return -EINVAL;
+
+   if ((bargs-meta.flags  BTRFS_BALANCE_ARGS_USAGE) 
+   (bargs-meta.usage  0 || bargs-meta.usage  100))
+   return -EINVAL;
+
+   if ((bargs-sys.flags  BTRFS_BALANCE_ARGS_USAGE) 
+   (bargs-sys.usage  0 || bargs-sys.usage  100))
+   return -EINVAL;
+
+   return 0;
+}
+
 static long 

Re: [PATCH V3 1/2] Btrfs: cleanup duplicated division functions

2012-09-27 Thread Miao Xie
Sorry to reply late.

On Mon, 24 Sep 2012 18:47:42 +0200, David Sterba wrote:
 This is the most straightforward transformation I can think of.  It
 doesn't result in an unnecessary BUG_ON, keeps churn to a minimum and

 agree with you.

 doesn't change the style of the balance ioctl.  (If I were to check
 every filter argument that way, btrfs_balance_ioctl() would be very long
 and complicated.)

 I think the check in btrfs_balance_ioctl() is necessary, the reason is above.
 
 btrfs_balance_ioctl does not seem as the right place, it does the
 processing related to the state of balance (resume/cancel etc). Look at
 btrfs_balance() itself, it does lot more sanity checks of the parameters

I think we should not put the check in btrfs_balance(), because the arguments
are valid forever if they pass the check when they are input, if we put the
check in btrfs_balance(), the check will be done every time we resume the 
balance.
it is unnecessary.

 We can put the extra checks into helpers (and not only this
 one) if clarity and readability of the function becomes a concern.

Agree. I will put this check into a helper in the next version of this patch.
And I will make a separate patch to move the current check in btrfs_balance
from btrfs_balance to the above helper after this patch is received.

Thanks
Miao
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRF - Storage Usage

2012-09-27 Thread Hugo Mills
On Thu, Sep 27, 2012 at 12:44:27PM +0200, Sébastien Maury wrote:
 I've installed a new server using btrfs for my root partition (/).
 
 It uses snapper for snapshots management and all seems to work pretty fine.
 
 My problem is to be able to know the remaining REAL free space in my  
 partition.

   This is in the FAQ: 
https://btrfs.wiki.kernel.org/index.php/FAQ#Why_are_there_so_many_ways_to_check_the_amount_of_free_space.3F

   Short answer: you can't know in general.

   Longer answer -- see below.

 Using different commands, i have different results, and i don't know  
 how to interpret them correctly :

 poivron:~ # btrfs filesystem show /dev/sda3
 Label: none  uuid: 9e68b667-f9f9-490f-9da1-ae4e91558212
  Total devices 1 FS bytes used 2.58GB
  devid1 size 131.64GB used 10.04GB path /dev/sda3

   You have 131.64 GiB of raw storage in your filesystem. Of that,
10.04 GiB is currently allocated for use by the FS (and it will take
more as it needs it).

 poivron:~ # btrfs filesystem df /
 Data: total=4.01GB, used=2.16GB

   4.01 GiB of the 10.04 GiB allocation is assigned for use by data,
and 2.16 GiB of that allocation actually contains data.

 System, DUP: total=8.00MB, used=4.00KB

   16 MiB (=2*8.00 MiB) of the 10.04 GiB allocation is assigned for
use as two copies of the system data. There is 4 KiB of system data
actually used.

 System: total=4.00MB, used=0.00
 Metadata, DUP: total=3.00GB, used=429.16MB

   6 GiB (=2*3.00 GiB) of your 10.04 GiB allocation is assigned for
use as metadata, with two copies (DUP) being kept. 429.16 MiB of the
3.00 GiB is currently in use.

 Metadata: total=8.00MB, used=0.00

 poivron:~ #  df -hP /
 Filesystem  Size  Used Avail Use% Mounted on
 /dev/sda3   132G  3.0G  124G   3% /

   Plain old df can't handle the truth, so this is at best only a hint
at what's actually happening. When Avail reaches zero, your FS is
probably full. Other than that, you can't necessarily say very much.

 ===
 
 Please help me understand and interpret those information to know the  
 most accurately as possible what is my real remaining space, and what  
 space is used by what.
 
 Although, i don't really understand the output of the command btrfs  
 filesystem df / : what are exactly Data, System DUP, System  
 total, Metadata DUP and Metadata total ?

   This should all be covered in the glossary on the website:
https://btrfs.wiki.kernel.org/index.php/Glossary

   Data is the contents of your files. Metadata is all the other stuff
that the FS needs in order to store your files -- directory
structures, permissions, locations of the file data, that kind of
thing. System is a particular bit of the metadata (the chunk tree)
which governs an internal physical/virtual mapping, and which needs to
be read before anything else can make any kind of sense.

   DUP is a bit like RAID-1: anything stored in a DUP chunk is
actually written to two different places on the disk, and can help
recovery in the case of physical disk corruption (e.g. bad blocks,
head crash).

 ==
 
 Here are some complementary informations :
 poivron:~ # uname -a
 Linux poivron 3.0.26-0.7-default #1 SMP Tue Apr 17 10:27:57 UTC 2012  
 (3829766) x86_64 x86_64 x86_64 GNU/Linux

   You [probably(*)] need to upgrade your kernel as soon as possible.
btrfs code moves very fast, and 3.0 has significant bugs in it. You
should be running the latest released kernel -- right now, that's 3.5,
or 3.6-rc7. Next week, it will probably change to 3.6 when Linus makes
the next release. Most distributions have a repository somewhere which
will give you access to new kernels without too much trouble.

   Hugo.

(*) Some of the enterprise distributions do have backported btrfs
fixes in their apparently older kernels.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   ---   __(_'  Squeak!   ---   


signature.asc
Description: Digital signature


Re: BTRF - Storage Usage

2012-09-27 Thread Sébastien Maury

Hi,

Thanks for the quick reply, this clarify me lots of things.
I've had read the articles you mentioned, but i must admit that your  
explanations based on my examples makes things even more clearer.


Also, if i understand things properly, snaphots size aren't included  
in the btrfs filesystem show command output ?
So, the use, for example, of a du -sh /.snapshots is correct to  
determine the disk usage of my snapshots ?


I will see with the people of my company in charge of maintaining  
distributions to provide us a more recent kernel.


PS : I use SLES 11 SP2 distribution.

Hugo Mills h...@carfax.org.uk a écrit :


On Thu, Sep 27, 2012 at 12:44:27PM +0200, Sébastien Maury wrote:

I've installed a new server using btrfs for my root partition (/).

It uses snapper for snapshots management and all seems to work pretty fine.

My problem is to be able to know the remaining REAL free space in my
partition.


   This is in the FAQ:   
https://btrfs.wiki.kernel.org/index.php/FAQ#Why_are_there_so_many_ways_to_check_the_amount_of_free_space.3F


   Short answer: you can't know in general.

   Longer answer -- see below.


Using different commands, i have different results, and i don't know
how to interpret them correctly :



poivron:~ # btrfs filesystem show /dev/sda3
Label: none  uuid: 9e68b667-f9f9-490f-9da1-ae4e91558212
 Total devices 1 FS bytes used 2.58GB
 devid1 size 131.64GB used 10.04GB path /dev/sda3


   You have 131.64 GiB of raw storage in your filesystem. Of that,
10.04 GiB is currently allocated for use by the FS (and it will take
more as it needs it).


poivron:~ # btrfs filesystem df /
Data: total=4.01GB, used=2.16GB


   4.01 GiB of the 10.04 GiB allocation is assigned for use by data,
and 2.16 GiB of that allocation actually contains data.


System, DUP: total=8.00MB, used=4.00KB


   16 MiB (=2*8.00 MiB) of the 10.04 GiB allocation is assigned for
use as two copies of the system data. There is 4 KiB of system data
actually used.


System: total=4.00MB, used=0.00
Metadata, DUP: total=3.00GB, used=429.16MB


   6 GiB (=2*3.00 GiB) of your 10.04 GiB allocation is assigned for
use as metadata, with two copies (DUP) being kept. 429.16 MiB of the
3.00 GiB is currently in use.


Metadata: total=8.00MB, used=0.00



poivron:~ #  df -hP /
Filesystem  Size  Used Avail Use% Mounted on
/dev/sda3   132G  3.0G  124G   3% /


   Plain old df can't handle the truth, so this is at best only a hint
at what's actually happening. When Avail reaches zero, your FS is
probably full. Other than that, you can't necessarily say very much.


===

Please help me understand and interpret those information to know the
most accurately as possible what is my real remaining space, and what
space is used by what.

Although, i don't really understand the output of the command btrfs
filesystem df / : what are exactly Data, System DUP, System
total, Metadata DUP and Metadata total ?


   This should all be covered in the glossary on the website:
https://btrfs.wiki.kernel.org/index.php/Glossary

   Data is the contents of your files. Metadata is all the other stuff
that the FS needs in order to store your files -- directory
structures, permissions, locations of the file data, that kind of
thing. System is a particular bit of the metadata (the chunk tree)
which governs an internal physical/virtual mapping, and which needs to
be read before anything else can make any kind of sense.

   DUP is a bit like RAID-1: anything stored in a DUP chunk is
actually written to two different places on the disk, and can help
recovery in the case of physical disk corruption (e.g. bad blocks,
head crash).


==

Here are some complementary informations :
poivron:~ # uname -a
Linux poivron 3.0.26-0.7-default #1 SMP Tue Apr 17 10:27:57 UTC 2012
(3829766) x86_64 x86_64 x86_64 GNU/Linux


   You [probably(*)] need to upgrade your kernel as soon as possible.
btrfs code moves very fast, and 3.0 has significant bugs in it. You
should be running the latest released kernel -- right now, that's 3.5,
or 3.6-rc7. Next week, it will probably change to 3.6 when Linus makes
the next release. Most distributions have a repository somewhere which
will give you access to new kernels without too much trouble.

   Hugo.

(*) Some of the enterprise distributions do have backported btrfs
fixes in their apparently older kernels.

--
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   ---   __(_'  Squeak!   ---





Cordialement,

Sébastien MAURY
Responsable d'exploitation du site de Montpellier
Équipe DBA
___
INSERM - DSI - Pôle Infrastructures

Délégation régionale Languedoc Roussillon
60, rue de Navacelles
34394 Montpellier Cedex 5

Mob : 06 31 51 42 18
Fixe : 04 67 63 61 43
Fax : 04 67 63 70 25
Mél : sebastien.ma...@inserm.fr
___



Re: [PATCH] Btrfs: improve the noflush reservation

2012-09-27 Thread Miao Xie
Please ignore this patch, my btrfs-next tree is old, and this patch will
conflict with Josef's patch
  [PATCH] Btrfs: run delayed refs first when out of space
I will modify this patch as soon as possible.

Thanks
Miao

On thu, 27 Sep 2012 16:45:51 +0800, Miao Xie wrote:
 In some places(such as: evicting inode), we just can not flush the reserved
 space of delalloc, flushing the delayed directory index and delayed inode
 is OK, but we don't try to flush those things and just go back when there is
 no enough space to be reserved. This patch fixes this problem.
 
 We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and 
 FLUSH_ALL.
 If we can in the transaction, we should not flush anything, or the deadlock
 would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
 would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
 and we will flush all things.
 
 Signed-off-by: Miao Xie mi...@cn.fujitsu.com
 ---
 This is based on btrfs-next tree.
 ---
  fs/btrfs/ctree.h |   26 +-
  fs/btrfs/delayed-inode.c |6 ++-
  fs/btrfs/extent-tree.c   |   85 ++---
  fs/btrfs/inode-map.c |5 ++-
  fs/btrfs/inode.c |8 +++--
  fs/btrfs/relocation.c|   12 --
  fs/btrfs/transaction.c   |   30 +++-
  fs/btrfs/transaction.h   |2 +-
  8 files changed, 85 insertions(+), 89 deletions(-)
 
 diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
 index dbb461f..cb59e9b 100644
 --- a/fs/btrfs/ctree.h
 +++ b/fs/btrfs/ctree.h
 @@ -2870,6 +2870,18 @@ int btrfs_remove_block_group(struct btrfs_trans_handle 
 *trans,
  u64 btrfs_reduce_alloc_profile(struct btrfs_root *root, u64 flags);
  u64 btrfs_get_alloc_profile(struct btrfs_root *root, int data);
  void btrfs_clear_space_info_full(struct btrfs_fs_info *info);
 +
 +enum btrfs_reserve_flush_enum {
 + /* If we are in the transaction, we can't flush anything.*/
 + BTRFS_RESERVE_NO_FLUSH,
 + /*
 +  * Flushing delalloc may cause deadlock somewhere, in this
 +  * case, use FLUSH LIMIT
 +  */
 + BTRFS_RESERVE_FLUSH_LIMIT,
 + BTRFS_RESERVE_FLUSH_ALL,
 +};
 +
  int btrfs_check_data_free_space(struct inode *inode, u64 bytes);
  void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes);
  void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans,
 @@ -2889,19 +2901,13 @@ struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct 
 btrfs_root *root,
  void btrfs_free_block_rsv(struct btrfs_root *root,
 struct btrfs_block_rsv *rsv);
  int btrfs_block_rsv_add(struct btrfs_root *root,
 - struct btrfs_block_rsv *block_rsv,
 - u64 num_bytes);
 -int btrfs_block_rsv_add_noflush(struct btrfs_root *root,
 - struct btrfs_block_rsv *block_rsv,
 - u64 num_bytes);
 + struct btrfs_block_rsv *block_rsv, u64 num_bytes,
 + enum btrfs_reserve_flush_enum flush);
  int btrfs_block_rsv_check(struct btrfs_root *root,
 struct btrfs_block_rsv *block_rsv, int min_factor);
  int btrfs_block_rsv_refill(struct btrfs_root *root,
 -   struct btrfs_block_rsv *block_rsv,
 -   u64 min_reserved);
 -int btrfs_block_rsv_refill_noflush(struct btrfs_root *root,
 -struct btrfs_block_rsv *block_rsv,
 -u64 min_reserved);
 +struct btrfs_block_rsv *block_rsv, u64 min_reserved,
 +enum btrfs_reserve_flush_enum flush);
  int btrfs_block_rsv_migrate(struct btrfs_block_rsv *src_rsv,
   struct btrfs_block_rsv *dst_rsv,
   u64 num_bytes);
 diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
 index eb768c4..2e2eddb 100644
 --- a/fs/btrfs/delayed-inode.c
 +++ b/fs/btrfs/delayed-inode.c
 @@ -651,7 +651,8 @@ static int btrfs_delayed_inode_reserve_metadata(
*/
   if (!src_rsv || (!trans-bytes_reserved 
src_rsv-type != BTRFS_BLOCK_RSV_DELALLOC)) {
 - ret = btrfs_block_rsv_add_noflush(root, dst_rsv, num_bytes);
 + ret = btrfs_block_rsv_add(root, dst_rsv, num_bytes,
 +   BTRFS_RESERVE_NO_FLUSH);
   /*
* Since we're under a transaction reserve_metadata_bytes could
* try to commit the transaction which will make it return
 @@ -686,7 +687,8 @@ static int btrfs_delayed_inode_reserve_metadata(
* reserve something strictly for us.  If not be a pain and try
* to steal from the delalloc block rsv.
*/
 - ret = btrfs_block_rsv_add_noflush(root, dst_rsv, num_bytes);
 + ret = btrfs_block_rsv_add(root, dst_rsv, num_bytes,
 +   

Re: BTRF - Storage Usage

2012-09-27 Thread Hugo Mills
On Thu, Sep 27, 2012 at 01:25:58PM +0200, Sébastien Maury wrote:
 Hi,
 
 Thanks for the quick reply, this clarify me lots of things.
 I've had read the articles you mentioned, but i must admit that your  
 explanations based on my examples makes things even more clearer.
 
 Also, if i understand things properly, snaphots size aren't included  
 in the btrfs filesystem show command output ?
 So, the use, for example, of a du -sh /.snapshots is correct to  
 determine the disk usage of my snapshots ?

   Disk usage of a snapshot has two different answers:

1) The total size of the files listed in the snapshot, which you can
   get from du.

2) The amount of space that would be freed up by deleting the
   snapshot, which isn't currently available, but probably will be
   soon. (The additional bookkeeping code was part of the qgroups
   patches, which are in 3.6).

 I will see with the people of my company in charge of maintaining  
 distributions to provide us a more recent kernel.
 
 PS : I use SLES 11 SP2 distribution.

   OK, that one's actually one of the few that does keep proper
backports: 
https://btrfs.wiki.kernel.org/index.php/Getting_started#Distro_support

   That said, I don't know how good they are at keeping up -- probably
pretty good, but other people here may be able to answer that better.

   Hugo.

 Hugo Mills h...@carfax.org.uk a écrit :
 
  On Thu, Sep 27, 2012 at 12:44:27PM +0200, Sébastien Maury wrote:
  I've installed a new server using btrfs for my root partition (/).
 
  It uses snapper for snapshots management and all seems to work pretty fine.
 
  My problem is to be able to know the remaining REAL free space in my
  partition.
 
 This is in the FAQ:   
  https://btrfs.wiki.kernel.org/index.php/FAQ#Why_are_there_so_many_ways_to_check_the_amount_of_free_space.3F
 
 Short answer: you can't know in general.
 
 Longer answer -- see below.
 
  Using different commands, i have different results, and i don't know
  how to interpret them correctly :
 
  poivron:~ # btrfs filesystem show /dev/sda3
  Label: none  uuid: 9e68b667-f9f9-490f-9da1-ae4e91558212
   Total devices 1 FS bytes used 2.58GB
   devid1 size 131.64GB used 10.04GB path /dev/sda3
 
 You have 131.64 GiB of raw storage in your filesystem. Of that,
  10.04 GiB is currently allocated for use by the FS (and it will take
  more as it needs it).
 
  poivron:~ # btrfs filesystem df /
  Data: total=4.01GB, used=2.16GB
 
 4.01 GiB of the 10.04 GiB allocation is assigned for use by data,
  and 2.16 GiB of that allocation actually contains data.
 
  System, DUP: total=8.00MB, used=4.00KB
 
 16 MiB (=2*8.00 MiB) of the 10.04 GiB allocation is assigned for
  use as two copies of the system data. There is 4 KiB of system data
  actually used.
 
  System: total=4.00MB, used=0.00
  Metadata, DUP: total=3.00GB, used=429.16MB
 
 6 GiB (=2*3.00 GiB) of your 10.04 GiB allocation is assigned for
  use as metadata, with two copies (DUP) being kept. 429.16 MiB of the
  3.00 GiB is currently in use.
 
  Metadata: total=8.00MB, used=0.00
 
  poivron:~ #  df -hP /
  Filesystem  Size  Used Avail Use% Mounted on
  /dev/sda3   132G  3.0G  124G   3% /
 
 Plain old df can't handle the truth, so this is at best only a hint
  at what's actually happening. When Avail reaches zero, your FS is
  probably full. Other than that, you can't necessarily say very much.
 
  ===
 
  Please help me understand and interpret those information to know the
  most accurately as possible what is my real remaining space, and what
  space is used by what.
 
  Although, i don't really understand the output of the command btrfs
  filesystem df / : what are exactly Data, System DUP, System
  total, Metadata DUP and Metadata total ?
 
 This should all be covered in the glossary on the website:
  https://btrfs.wiki.kernel.org/index.php/Glossary
 
 Data is the contents of your files. Metadata is all the other stuff
  that the FS needs in order to store your files -- directory
  structures, permissions, locations of the file data, that kind of
  thing. System is a particular bit of the metadata (the chunk tree)
  which governs an internal physical/virtual mapping, and which needs to
  be read before anything else can make any kind of sense.
 
 DUP is a bit like RAID-1: anything stored in a DUP chunk is
  actually written to two different places on the disk, and can help
  recovery in the case of physical disk corruption (e.g. bad blocks,
  head crash).
 
  ==
 
  Here are some complementary informations :
  poivron:~ # uname -a
  Linux poivron 3.0.26-0.7-default #1 SMP Tue Apr 17 10:27:57 UTC 2012
  (3829766) x86_64 x86_64 x86_64 GNU/Linux
 
 You [probably(*)] need to upgrade your kernel as soon as possible.
  btrfs code moves very fast, and 3.0 has significant bugs in it. You
  should be running the latest released kernel -- right now, that's 3.5,
  or 3.6-rc7. Next week, it will probably 

Re: BTRF - Storage Usage

2012-09-27 Thread Sébastien Maury

Hi,

Thanks a lot for your time and answers.

Things look pretty clear now for me.

I'm monitoring my systems using nagios, and i was annoyed about the  
disk usage monitoring.

Thanks to your answers, i should be able to developp a rather accurate script.
Or so i hope :)

Regards,
Sebastien.

Hugo Mills h...@carfax.org.uk a écrit :


On Thu, Sep 27, 2012 at 01:25:58PM +0200, Sébastien Maury wrote:

Hi,

Thanks for the quick reply, this clarify me lots of things.
I've had read the articles you mentioned, but i must admit that your
explanations based on my examples makes things even more clearer.

Also, if i understand things properly, snaphots size aren't included
in the btrfs filesystem show command output ?
So, the use, for example, of a du -sh /.snapshots is correct to
determine the disk usage of my snapshots ?


   Disk usage of a snapshot has two different answers:

1) The total size of the files listed in the snapshot, which you can
   get from du.

2) The amount of space that would be freed up by deleting the
   snapshot, which isn't currently available, but probably will be
   soon. (The additional bookkeeping code was part of the qgroups
   patches, which are in 3.6).


I will see with the people of my company in charge of maintaining
distributions to provide us a more recent kernel.

PS : I use SLES 11 SP2 distribution.


   OK, that one's actually one of the few that does keep proper
backports:   
https://btrfs.wiki.kernel.org/index.php/Getting_started#Distro_support


   That said, I don't know how good they are at keeping up -- probably
pretty good, but other people here may be able to answer that better.

   Hugo.


Hugo Mills h...@carfax.org.uk a écrit :

 On Thu, Sep 27, 2012 at 12:44:27PM +0200, Sébastien Maury wrote:
 I've installed a new server using btrfs for my root partition (/).

 It uses snapper for snapshots management and all seems to work   
pretty fine.


 My problem is to be able to know the remaining REAL free space in my
 partition.

This is in the FAQ:
   
https://btrfs.wiki.kernel.org/index.php/FAQ#Why_are_there_so_many_ways_to_check_the_amount_of_free_space.3F


Short answer: you can't know in general.

Longer answer -- see below.

 Using different commands, i have different results, and i don't know
 how to interpret them correctly :

 poivron:~ # btrfs filesystem show /dev/sda3
 Label: none  uuid: 9e68b667-f9f9-490f-9da1-ae4e91558212
  Total devices 1 FS bytes used 2.58GB
  devid1 size 131.64GB used 10.04GB path /dev/sda3

You have 131.64 GiB of raw storage in your filesystem. Of that,
 10.04 GiB is currently allocated for use by the FS (and it will take
 more as it needs it).

 poivron:~ # btrfs filesystem df /
 Data: total=4.01GB, used=2.16GB

4.01 GiB of the 10.04 GiB allocation is assigned for use by data,
 and 2.16 GiB of that allocation actually contains data.

 System, DUP: total=8.00MB, used=4.00KB

16 MiB (=2*8.00 MiB) of the 10.04 GiB allocation is assigned for
 use as two copies of the system data. There is 4 KiB of system data
 actually used.

 System: total=4.00MB, used=0.00
 Metadata, DUP: total=3.00GB, used=429.16MB

6 GiB (=2*3.00 GiB) of your 10.04 GiB allocation is assigned for
 use as metadata, with two copies (DUP) being kept. 429.16 MiB of the
 3.00 GiB is currently in use.

 Metadata: total=8.00MB, used=0.00

 poivron:~ #  df -hP /
 Filesystem  Size  Used Avail Use% Mounted on
 /dev/sda3   132G  3.0G  124G   3% /

Plain old df can't handle the truth, so this is at best only a hint
 at what's actually happening. When Avail reaches zero, your FS is
 probably full. Other than that, you can't necessarily say very much.

 ===

 Please help me understand and interpret those information to know the
 most accurately as possible what is my real remaining space, and what
 space is used by what.

 Although, i don't really understand the output of the command btrfs
 filesystem df / : what are exactly Data, System DUP, System
 total, Metadata DUP and Metadata total ?

This should all be covered in the glossary on the website:
 https://btrfs.wiki.kernel.org/index.php/Glossary

Data is the contents of your files. Metadata is all the other stuff
 that the FS needs in order to store your files -- directory
 structures, permissions, locations of the file data, that kind of
 thing. System is a particular bit of the metadata (the chunk tree)
 which governs an internal physical/virtual mapping, and which needs to
 be read before anything else can make any kind of sense.

DUP is a bit like RAID-1: anything stored in a DUP chunk is
 actually written to two different places on the disk, and can help
 recovery in the case of physical disk corruption (e.g. bad blocks,
 head crash).

 ==

 Here are some complementary informations :
 poivron:~ # uname -a
 Linux poivron 3.0.26-0.7-default #1 SMP Tue Apr 17 10:27:57 UTC 2012
 (3829766) x86_64 x86_64 x86_64 GNU/Linux

You [probably(*)] 

Re: [PATCH] Btrfs: fix wrong calculation of the available space when reserving the space

2012-09-27 Thread Miao Xie
Please ignore this patch, it is not based on the new btrfs-next tree.
I'll send the right one as soon as possible.

Thanks
Miao

On thu, 27 Sep 2012 17:09:15 +0800, Miao Xie wrote:
 According to the comment, we can overcommit the space up to 1/2 of the total
 disk space, or we just can overcommit up to 1/8. But the code was written
 reversedly. Fix it.
 
 Signed-off-by: Miao Xie mi...@cn.fujitsu.com
 ---
 This is based on btrfs-next tree.
 ---
  fs/btrfs/extent-tree.c |4 ++--
  1 files changed, 2 insertions(+), 2 deletions(-)
 
 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index a010234..8a01087 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -3962,9 +3962,9 @@ again:
* 1/2 of the space.
*/
   if (flush)
 - avail = 3;
 - else
   avail = 1;
 + else
 + avail = 3;
spin_unlock(root-fs_info-free_chunk_lock);
  
   if (used + num_bytes  space_info-total_bytes + avail) {
 


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix wrong calculation of the available space when reserving the space

2012-09-27 Thread Josef Bacik
On Thu, Sep 27, 2012 at 03:09:15AM -0600, Miao Xie wrote:
 According to the comment, we can overcommit the space up to 1/2 of the total
 disk space, or we just can overcommit up to 1/8. But the code was written
 reversedly. Fix it.
 

Sorry the comment is wrong, I was actually just looking at this the other day
:).  Basically we want non-flushers to be able to overcommit more to give those
guys more of a chance of being able to make an allocation, but we want flushers
to not be able to overcommit too much since they are allowed to make more
headroom, so the logic is right, the comment is wrong.  Thanks,

Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4 1/2] Btrfs: cleanup duplicated division functions

2012-09-27 Thread Ilya Dryomov
Hi Miao,

You haven't addressed any of my concerns with v3.

On Thu, Sep 27, 2012 at 06:19:58PM +0800, Miao Xie wrote:

(snipped)

 the parameters are right. So besides the code cleanup, this patch also
 add a check for the usage of the space balance, it is the only place that
 we need add check to make sure the parameters of div_factor are right till
 now. Besides that, the old kernel may hold the wrong usage value, so we
 must rectify it.

Cleaning up/unifying duplicated functions and changing the existing
logic are two very different things.  If you, in the course of writing
this patch, became unhappy with the way balancing ioctl deals with
invalid input, please send a separate patch.

Before your patch, volumes.c had its own copy of div_factor_fine():

static u64 div_factor_fine(u64 num, int factor)
{
if (factor = 0)
return 0;
if (factor = 100)
return num;

num *= factor;
do_div(num, 100);
return num;
}

which was called from chunk_usage_filter() on unvalidated user input.
As far as the cleanup part of your patch goes, you've dropped
factor = 0 / factor = 100 logic, merged volumes.c's copy with
extent-tree.c's copy and renamed div_factor_fine() to div_factor().  To
make chunk_usage_filter() happy again, it's enough to move the dropped
logic directly to the call site:

static int chunk_usage_filter(struct btrfs_fs_info *fs_info, u64 chunk_offset,
  struct btrfs_balance_args *bargs)
{
...

-   user_thresh = div_factor_fine(cache-key.offset, bargs-usage);
+   if (bargs-usage == 0)
+   user_thresh = 0;
+   else if (bargs-usage = 100)
+   user_thresh = cache-key.offset;
+   else
+   user_thresh = div_factor(cache-key.offset, bargs-usage);

...
}

So I would suggest you drop all hunks related to changing the way
balancing ioctl works and make the above change to chunk_usage_filter()
instead.  Once again, if you are unhappy with usage filter argument
handling, send a separate patch.

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs send/receive review by vfs folks

2012-09-27 Thread Alex Lyakas
Hi Jan,
I hope to get my proposal working soon, then expect for some code from
me to look at.

Thanks!
Alex.


On Mon, Sep 24, 2012 at 11:27 AM, Jan Schmidt list.bt...@jan-o-sch.net wrote:
 Hi Alex,

 On Mon, September 24, 2012 at 11:13 (+0200), Alex Lyakas wrote:
 Hi,

 write_buf:
 Used to write the stream to a user space supplied pipe. Please note
 the ERESTARTSYS comment there, I need some help here as I don't know
 how to handle that correctly. If I ignore the return value, it loops
 forever. If I bail out to user space, it reenters the ioctl and starts
 from the beginning (which is really bad). I have two possible
 solutions in my mind.
 1. Store some kind of state in the ioctl arguments so that we can
 continue where we stopped when the ioctl reenters. This would however
 complicate the code a lot.
 2. Spawn a thread when the ioctl is called and leave the ioctl
 immediately. I don't know if ERESTARTSYS can happen in vfs_xxx calls
 if they happen from a non syscall thread.

 I am hitting the ERESTARTSYS issue also. To easiest way to repro this
 is to stop the user process in gdb.
 As Alexander mentioned, restarting the ioctl from the beginning is
 really bad, because some commands were already sent to the pipe, and
 possibly consumed by the user mode (dump_thread). Also the command, on
 which vfs_write() hit ERESTARTSYS, might not have been pushed fully to
 the pipe. So if the ioctl() restarts, it starts filling the pipe with
 duplicate commands, and at least one command in the pipe might be
 corrupted. So the receive part cannot process such stream successfully
 (usually it hits crc error).

 In addition to what Alexander suggested, I have a third suggestion,
 but I would like to know whether community believes this issue is
 worth to fix.

 It's a must-fix in my opinion. As you mentioned, it's easy to hit. Second, 
 code
 like this doesn't look like it should be in mainline at all:

  391 /* TODO handle that correctly */
  392 /*if (ret == -ERESTARTSYS) {
  393 continue;
  394 }*/

 I'm looking forward to your proposal, preferably in form of a patch :-)

 -Jan
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/4] btrfs: extended inode refs

2012-09-27 Thread Mark Fasheh
On Tue, Sep 25, 2012 at 04:04:46PM -0400, Chris Mason wrote:
 On Mon, Aug 20, 2012 at 02:29:17PM -0600, Mark Fasheh wrote:
  
  Testing wise, the basic namespace operations work well (link, unlink, etc).
  The rest has gotten less debugging (and I really don't have a great way of
  testing the code in tree-log.c) Attached to this e-mail are btrfs-progs
  patches which make testing of the changes possible.
 
 Hi Mark,
 
 I hit a few problems testing this, so I have the patch below that I plan
 on folding into your commits (to keep bisect from crashing in tree log).
 
 Just let me know if this is a problem, or if you see any bugs in there.
 I'm still doing a last round of checks on it, but I wanted to send along
 early for comments.
 
 The biggest change in here is to always check the ref_objectid when
 returning a backref.  Hash collisions mean we may return a ref for a
 completely different parent id otherwise.  I think I caught all the
 places missing that logic, but please double check me.

Ahh yes of course. I missed that in a couple key areas. Thanks for fixing
it.


 Other than that I went through and fixed up bugs in
 tree-log.c.  __add_inode_ref had a bunch of cut and paste errors, and you
 carefully preserved a huge use-after-free bug in the original
 add_inode_ref.

Cool, everything in there looks good to me. Thanks again Chris!
--Mark

--
Mark Fasheh
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] btrfs fi df output [Was Re: BTRF - Storage Usage]

2012-09-27 Thread Goffredo Baroncelli

On 09/27/2012 12:44 PM, Sébastien Maury wrote:

Hi,

I've installed a new server using btrfs for my root partition (/).

It uses snapper for snapshots management and all seems to work pretty fine.

My problem is to be able to know the remaining REAL free space in my
partition.

Using different commands, i have different results, and i don't know how
to interpret them correctly :
poivron:~ # btrfs filesystem df /
Data: total=4.01GB, used=2.16GB
System, DUP: total=8.00MB, used=4.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=3.00GB, used=429.16MB
Metadata: total=8.00MB, used=0.00


In effect the output of btrfs filesystem df / is not very friendly. 
What about changing the output as below:


$ btrfs filesystem disk-free /
Summary:
  Total:135.00GB
  Allocated: 10.51GB
  Unallocated:  124.49GB
  Free_(Estimated)   86.56GB
  Average_disk_efficiency:  62 %

Details:
  Chunk-type  Mode   AllocatedUsedFree
  --     -   -
  DataSingle4.01GB  2.16GB  1.87GB
  System  DUP  16.00MB  4.00KB  7.99MB
  System  Single4.00MB0.00  4.00MB
  MetadataDUP   6.00GB429.16MB  2.57GB
  MetadataSingle8.00MB0.00  8.00MB



Where the Free_(Estimated) and Average_disk_efficency are computed as:
  Average_disk_efficency = ratio of average disk usage =
(sum(ChunkUsed)+sum(ChunkFree))/sum(ChunkAllocated)

  Estimated_available = Average_disk_efficency *
Unallocated+sum(ChunkFree)

I am open to suggestion about the terms: Used vs Allocated and Free vs 
Available, or a better description of Average disk efficiency



BR
G.Baroncelli

P.S. the source could be find at

http://cassiopea.homelinux.net/git/btrfs-progs-unstable.git

branch
disk_free
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: cache extent state when writing out dirty metadata pages

2012-09-27 Thread Josef Bacik
Everytime we write out dirty pages we search for an offset in the tree,
convert the bits in the state, and then when we wait we search for the
offset again and clear the bits.  So for every dirty range in the io tree we
are doing 4 rb searches, which is suboptimal.  With this patch we are only
doing 2 searches for every cycle (modulo weird things happening).  Thanks,

Signed-off-by: Josef Bacik jba...@fusionio.com
---
 fs/btrfs/disk-io.c  |4 ++--
 fs/btrfs/extent-tree.c  |5 +++--
 fs/btrfs/extent_io.c|   43 +--
 fs/btrfs/extent_io.h|6 --
 fs/btrfs/free-space-cache.c |2 +-
 fs/btrfs/relocation.c   |2 +-
 fs/btrfs/transaction.c  |   14 +-
 fs/btrfs/tree-log.c |3 ++-
 8 files changed, 63 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index c47a3ae..032cce2 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3569,7 +3569,7 @@ static int btrfs_destroy_marked_extents(struct btrfs_root 
*root,
 
while (1) {
ret = find_first_extent_bit(dirty_pages, start, start, end,
-   mark);
+   mark, NULL);
if (ret)
break;
 
@@ -3624,7 +3624,7 @@ static int btrfs_destroy_pinned_extent(struct btrfs_root 
*root,
 again:
while (1) {
ret = find_first_extent_bit(unpin, 0, start, end,
-   EXTENT_DIRTY);
+   EXTENT_DIRTY, NULL);
if (ret)
break;
 
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index efb044e..65941d7 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -313,7 +313,8 @@ static u64 add_new_free_space(struct 
btrfs_block_group_cache *block_group,
while (start  end) {
ret = find_first_extent_bit(info-pinned_extents, start,
extent_start, extent_end,
-   EXTENT_DIRTY | EXTENT_UPTODATE);
+   EXTENT_DIRTY | EXTENT_UPTODATE,
+   NULL);
if (ret)
break;
 
@@ -5028,7 +5029,7 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle 
*trans,
 
while (1) {
ret = find_first_extent_bit(unpin, 0, start, end,
-   EXTENT_DIRTY);
+   EXTENT_DIRTY, NULL);
if (ret)
break;
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 8f0f03b..1038f85 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -937,6 +937,7 @@ int set_extent_bit(struct extent_io_tree *tree, u64 start, 
u64 end, int bits,
  * @end:   the end offset in bytes (inclusive)
  * @bits:  the bits to set in this range
  * @clear_bits:the bits to clear in this range
+ * @cached_state:  state that we're going to cache
  * @mask:  the allocation mask
  *
  * This will go through and set bits for the given range.  If any states exist
@@ -946,7 +947,8 @@ int set_extent_bit(struct extent_io_tree *tree, u64 start, 
u64 end, int bits,
  * boundary bits like LOCK.
  */
 int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
-  int bits, int clear_bits, gfp_t mask)
+  int bits, int clear_bits,
+  struct extent_state **cached_state, gfp_t mask)
 {
struct extent_state *state;
struct extent_state *prealloc = NULL;
@@ -963,6 +965,15 @@ again:
}
 
spin_lock(tree-lock);
+   if (cached_state  *cached_state) {
+   state = *cached_state;
+   if (state-start = start  state-end  start 
+   state-tree) {
+   node = state-rb_node;
+   goto hit_next;
+   }
+   }
+
/*
 * this search will find all the extents that end after
 * our range starts.
@@ -993,6 +1004,7 @@ hit_next:
 */
if (state-start == start  state-end = end) {
set_state_bits(tree, state, bits);
+   cache_state(state, cached_state);
state = clear_state_bit(tree, state, clear_bits, 0);
if (last_end == (u64)-1)
goto out;
@@ -1033,6 +1045,7 @@ hit_next:
goto out;
if (state-end = end) {
set_state_bits(tree, state, bits);
+   cache_state(state, cached_state);
state = clear_state_bit(tree, state, clear_bits, 0);
if (last_end == (u64)-1)
goto out;
@@ -1071,6 +1084,7 @@ 

[PATCH] btrfs-convert: show progress

2012-09-27 Thread Alfredo Esteban
Hello,

I'm sending a patch to show progress of btrfs-convert command. I put a
progress bar in the only heavy process: the btrfs metadata creation
(due to CRC calculation):

 ./btrfs-convert /dev/loop1
Creating btrfs metadata [] 100%
Creating ext2fs image file... [DONE]
Cleaning up system chunk... [DONE]
Conversion complete.

I just used \r. I think it is a simple but effective approach
without ncurses either other dependencies.

Suggestions are welcome.

Alfredo


convert-progress-bar.patch
Description: Binary data


Re: [PATCH] btrfs-convert: show progress

2012-09-27 Thread cwillu
On Thu, Sep 27, 2012 at 6:02 PM, Alfredo Esteban aedelato...@gmail.com wrote:
 Hello,

 I'm sending a patch to show progress of btrfs-convert command. I put a
 progress bar in the only heavy process: the btrfs metadata creation
 (due to CRC calculation):

Please include patches inline in the email, not as an attachment.

 ./btrfs-convert /dev/loop1
 Creating btrfs metadata [] 
 100%
 Creating ext2fs image file... [DONE]
 Cleaning up system chunk... [DONE]
 Conversion complete.

 I just used \r. I think it is a simple but effective approach
 without ncurses either other dependencies.

There should probably be some way to disable the progress bar (ideally
defaulting to a istty check) so that log files don't capture hundreds
if not thousands of lines of [  ].
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V5 1/2] Btrfs: cleanup duplicated division functions

2012-09-27 Thread Miao Xie
div_factor{_fine} has been implemented for two times, and these two functions
are very similar, so cleanup the reduplicate implement and drop the original
div_factor(), and then rename div_factor_fine() to div_factor(). So the factor
of the new div_factor() is 100, not 10.

And I move div_factor into a independent file named math.h because it is a
common math function, may be used by every composition of btrfs.

Because these functions are mostly used on the hot path, and we are sure
the parameters are right in the most cases, we don't add complex checks
for the parameters. But in the other place, we must check and make sure
the parameters are right. So besides the code cleanup, this patch also
add a check for the usage of the space balance, it is the only place that
we need add check to make sure the parameters of div_factor are right till now.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v4 - v5:
- drop the check in the space balance, and make the churn to a minimum

Changelog v3 - v4:
- deal with the wrong usage that was input on the old kernel

Changelog v2 - v3:
- drop the original div_factor and rename div_factor_fine to div_factor
- drop the check of the factor

Changelog v1 - v2:
- add missing check
---
 fs/btrfs/extent-tree.c |   29 ++---
 fs/btrfs/math.h|   35 +++
 fs/btrfs/relocation.c  |2 +-
 fs/btrfs/transaction.c |2 +-
 fs/btrfs/volumes.c |   35 ++-
 5 files changed, 53 insertions(+), 50 deletions(-)
 create mode 100644 fs/btrfs/math.h

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a010234..bcb9ced 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -33,6 +33,7 @@
 #include volumes.h
 #include locking.h
 #include free-space-cache.h
+#include math.h
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -648,24 +649,6 @@ void btrfs_clear_space_info_full(struct btrfs_fs_info 
*info)
rcu_read_unlock();
 }
 
-static u64 div_factor(u64 num, int factor)
-{
-   if (factor == 10)
-   return num;
-   num *= factor;
-   do_div(num, 10);
-   return num;
-}
-
-static u64 div_factor_fine(u64 num, int factor)
-{
-   if (factor == 100)
-   return num;
-   num *= factor;
-   do_div(num, 100);
-   return num;
-}
-
 u64 btrfs_find_block_group(struct btrfs_root *root,
   u64 search_start, u64 search_hint, int owner)
 {
@@ -674,7 +657,7 @@ u64 btrfs_find_block_group(struct btrfs_root *root,
u64 last = max(search_hint, search_start);
u64 group_start = 0;
int full_search = 0;
-   int factor = 9;
+   int factor = 90;
int wrapped = 0;
 again:
while (1) {
@@ -708,7 +691,7 @@ again:
if (!full_search  factor  10) {
last = search_start;
full_search = 1;
-   factor = 10;
+   factor = 100;
goto again;
}
 found:
@@ -3513,7 +3496,7 @@ static int should_alloc_chunk(struct btrfs_root *root,
if (force == CHUNK_ALLOC_LIMITED) {
thresh = btrfs_super_total_bytes(root-fs_info-super_copy);
thresh = max_t(u64, 64 * 1024 * 1024,
-  div_factor_fine(thresh, 1));
+  div_factor(thresh, 1));
 
if (num_bytes - num_allocated  thresh)
return 1;
@@ -3521,12 +3504,12 @@ static int should_alloc_chunk(struct btrfs_root *root,
thresh = btrfs_super_total_bytes(root-fs_info-super_copy);
 
/* 256MB or 2% of the FS */
-   thresh = max_t(u64, 256 * 1024 * 1024, div_factor_fine(thresh, 2));
+   thresh = max_t(u64, 256 * 1024 * 1024, div_factor(thresh, 2));
/* system chunks need a much small threshold */
if (sinfo-flags  BTRFS_BLOCK_GROUP_SYSTEM)
thresh = 32 * 1024 * 1024;
 
-   if (num_bytes  thresh  sinfo-bytes_used  div_factor(num_bytes, 8))
+   if (num_bytes  thresh  sinfo-bytes_used  div_factor(num_bytes, 80))
return 0;
return 1;
 }
diff --git a/fs/btrfs/math.h b/fs/btrfs/math.h
new file mode 100644
index 000..4fef49f
--- /dev/null
+++ b/fs/btrfs/math.h
@@ -0,0 +1,35 @@
+
+/*
+ * Copyright (C) 2012 Fujitsu.  All rights reserved.
+ * Written by Miao Xie mi...@cn.fujitsu.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 

Re: [PATCH V4 1/2] Btrfs: cleanup duplicated division functions

2012-09-27 Thread Miao Xie
On thu, 27 Sep 2012 19:56:24 +0300, Ilya Dryomov wrote:
 the parameters are right. So besides the code cleanup, this patch also
 add a check for the usage of the space balance, it is the only place that
 we need add check to make sure the parameters of div_factor are right till
 now. Besides that, the old kernel may hold the wrong usage value, so we
 must rectify it.
 
 Cleaning up/unifying duplicated functions and changing the existing
 logic are two very different things.  If you, in the course of writing
 this patch, became unhappy with the way balancing ioctl deals with
 invalid input, please send a separate patch.
 
 Before your patch, volumes.c had its own copy of div_factor_fine():
 
 static u64 div_factor_fine(u64 num, int factor)
 {
   if (factor = 0)
   return 0;
   if (factor = 100)
   return num;
 
   num *= factor;
   do_div(num, 100);
   return num;
 }
 
 which was called from chunk_usage_filter() on unvalidated user input.
 As far as the cleanup part of your patch goes, you've dropped
 factor = 0 / factor = 100 logic, merged volumes.c's copy with
 extent-tree.c's copy and renamed div_factor_fine() to div_factor().  To
 make chunk_usage_filter() happy again, it's enough to move the dropped
 logic directly to the call site:
 
 static int chunk_usage_filter(struct btrfs_fs_info *fs_info, u64 chunk_offset,
 struct btrfs_balance_args *bargs)
 {
   ...
 
 - user_thresh = div_factor_fine(cache-key.offset, bargs-usage);
 + if (bargs-usage == 0)
 + user_thresh = 0;
 + else if (bargs-usage = 100)
 + user_thresh = cache-key.offset;
 + else
 + user_thresh = div_factor(cache-key.offset, bargs-usage);
 
   ...
 }
 
 So I would suggest you drop all hunks related to changing the way
 balancing ioctl works and make the above change to chunk_usage_filter()
 instead.  Once again, if you are unhappy with usage filter argument
 handling, send a separate patch.

Fine.
(I forget the rule that one patch just do one thing)

Thanks
Miao
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/2] btrfs-progs: limit the max value of leafsize and nodesize

2012-09-27 Thread Robin Dong
From: Robin Dong san...@taobao.com

Using mkfs.btrfs like:

mkfs.btrfs -l 131072 /dev/sda

will return no error, but after mount it, the dmesg will report:

BTRFS: couldn't mount because metadata blocksize (131072) was too large

The leafsize and nodesize are equal at present, so we just use one function
check_leaf_or_node_size to limit leaf and node size below 
BTRFS_MAX_METADATA_BLOCKSIZE.

Signed-off-by: Robin Dong san...@taobao.com
Reviewed-by: David Sterba d...@jikos.cz
---
 ctree.h |6 ++
 mkfs.c  |   29 +++--
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/ctree.h b/ctree.h
index 7f55229..75c1e0a 100644
--- a/ctree.h
+++ b/ctree.h
@@ -111,6 +111,12 @@ struct btrfs_trans_handle;
 #define BTRFS_DEV_ITEMS_OBJECTID 1ULL
 
 /*
+ * the max metadata block size.  This limit is somewhat artificial,
+ * but the memmove costs go through the roof for larger blocks.
+ */
+#define BTRFS_MAX_METADATA_BLOCKSIZE 65536
+
+/*
  * we can actually store much bigger names, but lets not confuse the rest
  * of linux
  */
diff --git a/mkfs.c b/mkfs.c
index dff5eb8..8420482 100644
--- a/mkfs.c
+++ b/mkfs.c
@@ -1201,6 +1201,27 @@ static int zero_output_file(int out_fd, u64 size, u32 
sectorsize)
return ret;
 }
 
+static int check_leaf_or_node_size(u32 size, u32 sectorsize)
+{
+   if (size  sectorsize) {
+   fprintf(stderr,
+   Illegal leafsize (or nodesize) %u (smaller than %u)\n,
+   size, sectorsize);
+   return -1;
+   } else if (size  BTRFS_MAX_METADATA_BLOCKSIZE) {
+   fprintf(stderr,
+   Illegal leafsize (or nodesize) %u (larger than %u)\n,
+   size, BTRFS_MAX_METADATA_BLOCKSIZE);
+   return -1;
+   } else if (size  (sectorsize - 1)) {
+   fprintf(stderr,
+   Illegal leafsize (or nodesize) %u (not align to %u)\n,
+   size, sectorsize);
+   return -1;
+   }
+   return 0;
+}
+
 int main(int ac, char **av)
 {
char *file;
@@ -1291,14 +1312,10 @@ int main(int ac, char **av)
}
}
sectorsize = max(sectorsize, (u32)getpagesize());
-   if (leafsize  sectorsize || (leafsize  (sectorsize - 1))) {
-   fprintf(stderr, Illegal leafsize %u\n, leafsize);
+   if (check_leaf_or_node_size(leafsize, sectorsize))
exit(1);
-   }
-   if (nodesize  sectorsize || (nodesize  (sectorsize - 1))) {
-   fprintf(stderr, Illegal nodesize %u\n, nodesize);
+   if (check_leaf_or_node_size(nodesize, sectorsize))
exit(1);
-   }
ac = ac - optind;
if (ac == 0)
print_usage();
-- 
1.7.3.2

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/2] btrfs-progs: limit the min value of total_bytes

2012-09-27 Thread Robin Dong
From: Robin Dong san...@taobao.com

Using mkfs.btrfs like:

mkfs.btrfs -b 1048576 /dev/sda

will report error:

mkfs.btrfs: volumes.c:796: btrfs_alloc_chunk: Assertion `!(ret)' failed.
Aborted

because the length of dev_extent is 4MB.

But if we use mkfs.btrfs with 8MB total bytes, the newly mounted btrfs 
filesystem
would not contain even one empty file. So 12MB will be good min-value for 
block_count.

Signed-off-by: Robin Dong san...@taobao.com
---
 mkfs.c |6 +-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/mkfs.c b/mkfs.c
index 8420482..496faa8 100644
--- a/mkfs.c
+++ b/mkfs.c
@@ -1345,7 +1345,11 @@ int main(int ac, char **av)
dev_block_count, mixed, nodiscard);
if (block_count == 0)
block_count = dev_block_count;
-   else if (block_count  dev_block_count) {
+   else if (block_count  3 * BTRFS_MKFS_SYSTEM_GROUP_SIZE) {
+   fprintf(stderr, Illegal total number of bytes %u\n,
+   block_count);
+   exit(1);
+   } else if (block_count  dev_block_count) {
fprintf(stderr, %s is smaller than requested size\n, 
file);
exit(1);
}
-- 
1.7.3.2

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] btrfs fi df output [Was Re: BTRF - Storage Usage]

2012-09-27 Thread Roman Mamedov
On Thu, 27 Sep 2012 23:02:35 +0200
Goffredo Baroncelli kreij...@libero.it wrote:

 Sorry for the space error:
 Below a more correct example
 
 $ btrfs filesystem disk-free /
 Summary:
 Total:135.00GB
 Allocated: 10.51GB
 Unallocated:  124.49GB
 Free_(Estimated)  86.56GB
 Average_disk_efficiency: 62 %

How do you estimate Free here? Sorry I didn't check the source code in git,
but from the Details below nothing leads me to believe that this FS is
doomed to only be able to usefully utilize only ~86GB of the partition, and not
more.

Are you ready to answer the flood of questions from people why their disk is
only 62% efficient, and how to tune it to 100%? :-)

Why use underscores instead of spaces?


 
 Details:
 Chunk-typeMode   AllocatedUsedFree
 --   -   -
 Data  Single4.01GB  2.16GB  1.87GB
 SystemDUP  16.00MB  4.00KB  7.99MB
 SystemSingle4.00MB0.00  4.00MB
 Metadata  DUP   6.00GB429.16MB  2.57GB
 Metadata  Single8.00MB0.00  8.00MB

-- 
With respect,
Roman

~~~
Stallman had a printer,
with code he could not see.
So he began to tinker,
and set the software free.


signature.asc
Description: PGP signature