Re: [PATCH 0/8 v2] Non-blocking AIO

2017-03-06 Thread Avi Kivity

On 03/06/2017 08:27 PM, Jens Axboe wrote:

On 03/06/2017 11:17 AM, Avi Kivity wrote:


On 03/06/2017 07:06 PM, Jens Axboe wrote:

On 03/06/2017 09:59 AM, Avi Kivity wrote:

On 03/06/2017 06:08 PM, Jens Axboe wrote:

On 03/06/2017 08:59 AM, Avi Kivity wrote:

On 03/06/2017 05:38 PM, Jens Axboe wrote:

On 03/06/2017 08:29 AM, Avi Kivity wrote:

On 03/06/2017 05:19 PM, Jens Axboe wrote:

On 03/06/2017 01:25 AM, Jan Kara wrote:

On Sun 05-03-17 16:56:21, Avi Kivity wrote:

The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
any of these conditions are met. This way userspace can push most
of the write()s to the kernel to the best of its ability to complete
and if it returns -EAGAIN, can defer it to another thread.


Is it not possible to push the iocb to a workqueue?  This will allow
existing userspace to work with the new functionality, unchanged. Any
userspace implementation would have to do the same thing, so it's not like
we're saving anything by pushing it there.

That is not easy because until IO is fully submitted, you need some parts
of the context of the process which submits the IO (e.g. memory mappings,
but possibly also other credentials). So you would need to somehow transfer
this information to the workqueue.

Outside of technical challenges, the API also needs to return EAGAIN or
start blocking at some point. We can't expose a direct connection to
queue work like that, and let any user potentially create millions of
pending work items (and IOs).

You wouldn't expect more concurrent events than the maxevents parameter
that was supplied to io_setup syscall; it should have reserved any
resources needed.

Doesn't matter what limit you apply, my point still stands - at some
point you have to return EAGAIN, or block. Returning EAGAIN without
the caller having flagged support for that change of behavior would
be problematic.

Doesn't it already return EAGAIN (or some other error) if you exceed
maxevents?

It's a setup thing. We check these limits when someone creates an IO
context, and carve out the specified entries form our global pool. Then
we free those "resources" when the io context is freed.

Right now I can setup an IO context with 1000 entries on it, yet that
number has NO bearing on when io_submit() would potentially block or
return EAGAIN.

We can have a huge gap on the intent signaled by io context setup, and
the reality imposed by what actually happens on the IO submission side.

Isn't that a bug?  Shouldn't that 1001st incomplete io_submit() return
EAGAIN?

Just tested it, and maxevents is not respected for this:

io_setup(1, [0x7fc64537f000])   = 0
io_submit(0x7fc64537f000, 10, [{pread, fildes=3, buf=0x1eb4000,
nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096,
offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0},
{pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread,
fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3,
buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000,
nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096,
offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0},
{pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}]) = 10

which is unexpected, to me.

ioctx_alloc()
{
  [...]

  /*
   * We keep track of the number of available ringbuffer slots, to 
prevent
   * overflow (reqs_available), and we also use percpu counters for 
this.
   *
   * So since up to half the slots might be on other cpu's percpu 
counters
   * and unavailable, double nr_events so userspace sees what they
   * expected: additionally, we move req_batch slots to/from percpu
   * counters at a time, so make sure that isn't 0:
   */
  nr_events = max(nr_events, num_possible_cpus() * 4);
  nr_events *= 2;
}

On a 4-lcore desktop:

io_setup(1, [0x7fc210041000])   = 0
io_submit(0x7fc210041000, 1, [big array]) = 126
io_submit(0x7fc210041000, 1, [big array]) = -1 EAGAIN (Resource
temporarily unavailable)

so, the user should already expect EAGAIN from io_submit() due to
resource limits.  I'm sure the check could be tightened so that if we do
have to use a workqueue, we respect the user's limit rather than some
inflated number.

This is why I previously said that the 1000 requests you potentially
asks for when setting up your IO context has NOTHING to do with when you
will run into EAGAIN. Yes, returning EAGAIN if the app exceeds the
limit that it itself has set is existing behavior and it certainly makes
sense. And it's an easily handled condition, since the app can just
backoff and wait/reap completion events.


Every time I used aio, I considered maxevents to be the maximum number 
of in-flight requests for that queue, and observed this limit 
religiously.  It's possible others don&#

Re: [PATCH 0/8 v2] Non-blocking AIO

2017-03-06 Thread Avi Kivity



On 03/06/2017 07:06 PM, Jens Axboe wrote:

On 03/06/2017 09:59 AM, Avi Kivity wrote:


On 03/06/2017 06:08 PM, Jens Axboe wrote:

On 03/06/2017 08:59 AM, Avi Kivity wrote:

On 03/06/2017 05:38 PM, Jens Axboe wrote:

On 03/06/2017 08:29 AM, Avi Kivity wrote:

On 03/06/2017 05:19 PM, Jens Axboe wrote:

On 03/06/2017 01:25 AM, Jan Kara wrote:

On Sun 05-03-17 16:56:21, Avi Kivity wrote:

The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
any of these conditions are met. This way userspace can push most
of the write()s to the kernel to the best of its ability to complete
and if it returns -EAGAIN, can defer it to another thread.


Is it not possible to push the iocb to a workqueue?  This will allow
existing userspace to work with the new functionality, unchanged. Any
userspace implementation would have to do the same thing, so it's not like
we're saving anything by pushing it there.

That is not easy because until IO is fully submitted, you need some parts
of the context of the process which submits the IO (e.g. memory mappings,
but possibly also other credentials). So you would need to somehow transfer
this information to the workqueue.

Outside of technical challenges, the API also needs to return EAGAIN or
start blocking at some point. We can't expose a direct connection to
queue work like that, and let any user potentially create millions of
pending work items (and IOs).

You wouldn't expect more concurrent events than the maxevents parameter
that was supplied to io_setup syscall; it should have reserved any
resources needed.

Doesn't matter what limit you apply, my point still stands - at some
point you have to return EAGAIN, or block. Returning EAGAIN without
the caller having flagged support for that change of behavior would
be problematic.

Doesn't it already return EAGAIN (or some other error) if you exceed
maxevents?

It's a setup thing. We check these limits when someone creates an IO
context, and carve out the specified entries form our global pool. Then
we free those "resources" when the io context is freed.

Right now I can setup an IO context with 1000 entries on it, yet that
number has NO bearing on when io_submit() would potentially block or
return EAGAIN.

We can have a huge gap on the intent signaled by io context setup, and
the reality imposed by what actually happens on the IO submission side.

Isn't that a bug?  Shouldn't that 1001st incomplete io_submit() return
EAGAIN?

Just tested it, and maxevents is not respected for this:

io_setup(1, [0x7fc64537f000])   = 0
io_submit(0x7fc64537f000, 10, [{pread, fildes=3, buf=0x1eb4000,
nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096,
offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0},
{pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread,
fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3,
buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000,
nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096,
offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0},
{pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}]) = 10

which is unexpected, to me.

ioctx_alloc()
{
 [...]

 /*
  * We keep track of the number of available ringbuffer slots, to 
prevent
  * overflow (reqs_available), and we also use percpu counters for this.
  *
  * So since up to half the slots might be on other cpu's percpu 
counters
  * and unavailable, double nr_events so userspace sees what they
  * expected: additionally, we move req_batch slots to/from percpu
  * counters at a time, so make sure that isn't 0:
  */
 nr_events = max(nr_events, num_possible_cpus() * 4);
 nr_events *= 2;
}


On a 4-lcore desktop:

io_setup(1, [0x7fc210041000])   = 0
io_submit(0x7fc210041000, 1, [big array]) = 126
io_submit(0x7fc210041000, 1, [big array]) = -1 EAGAIN (Resource 
temporarily unavailable)


so, the user should already expect EAGAIN from io_submit() due to 
resource limits.  I'm sure the check could be tightened so that if we do 
have to use a workqueue, we respect the user's limit rather than some 
inflated number.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8 v2] Non-blocking AIO

2017-03-06 Thread Avi Kivity



On 03/06/2017 06:08 PM, Jens Axboe wrote:

On 03/06/2017 08:59 AM, Avi Kivity wrote:

On 03/06/2017 05:38 PM, Jens Axboe wrote:

On 03/06/2017 08:29 AM, Avi Kivity wrote:

On 03/06/2017 05:19 PM, Jens Axboe wrote:

On 03/06/2017 01:25 AM, Jan Kara wrote:

On Sun 05-03-17 16:56:21, Avi Kivity wrote:

The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
any of these conditions are met. This way userspace can push most
of the write()s to the kernel to the best of its ability to complete
and if it returns -EAGAIN, can defer it to another thread.


Is it not possible to push the iocb to a workqueue?  This will allow
existing userspace to work with the new functionality, unchanged. Any
userspace implementation would have to do the same thing, so it's not like
we're saving anything by pushing it there.

That is not easy because until IO is fully submitted, you need some parts
of the context of the process which submits the IO (e.g. memory mappings,
but possibly also other credentials). So you would need to somehow transfer
this information to the workqueue.

Outside of technical challenges, the API also needs to return EAGAIN or
start blocking at some point. We can't expose a direct connection to
queue work like that, and let any user potentially create millions of
pending work items (and IOs).

You wouldn't expect more concurrent events than the maxevents parameter
that was supplied to io_setup syscall; it should have reserved any
resources needed.

Doesn't matter what limit you apply, my point still stands - at some
point you have to return EAGAIN, or block. Returning EAGAIN without
the caller having flagged support for that change of behavior would
be problematic.

Doesn't it already return EAGAIN (or some other error) if you exceed
maxevents?

It's a setup thing. We check these limits when someone creates an IO
context, and carve out the specified entries form our global pool. Then
we free those "resources" when the io context is freed.

Right now I can setup an IO context with 1000 entries on it, yet that
number has NO bearing on when io_submit() would potentially block or
return EAGAIN.

We can have a huge gap on the intent signaled by io context setup, and
the reality imposed by what actually happens on the IO submission side.


Isn't that a bug?  Shouldn't that 1001st incomplete io_submit() return 
EAGAIN?


Just tested it, and maxevents is not respected for this:

io_setup(1, [0x7fc64537f000])   = 0
io_submit(0x7fc64537f000, 10, [{pread, fildes=3, buf=0x1eb4000, 
nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, 
offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, 
{pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread, 
fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3, 
buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, 
nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, 
offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, 
{pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}]) = 10


which is unexpected, to me.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8 v2] Non-blocking AIO

2017-03-06 Thread Avi Kivity

On 03/06/2017 05:38 PM, Jens Axboe wrote:

On 03/06/2017 08:29 AM, Avi Kivity wrote:


On 03/06/2017 05:19 PM, Jens Axboe wrote:

On 03/06/2017 01:25 AM, Jan Kara wrote:

On Sun 05-03-17 16:56:21, Avi Kivity wrote:

The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
any of these conditions are met. This way userspace can push most
of the write()s to the kernel to the best of its ability to complete
and if it returns -EAGAIN, can defer it to another thread.


Is it not possible to push the iocb to a workqueue?  This will allow
existing userspace to work with the new functionality, unchanged. Any
userspace implementation would have to do the same thing, so it's not like
we're saving anything by pushing it there.

That is not easy because until IO is fully submitted, you need some parts
of the context of the process which submits the IO (e.g. memory mappings,
but possibly also other credentials). So you would need to somehow transfer
this information to the workqueue.

Outside of technical challenges, the API also needs to return EAGAIN or
start blocking at some point. We can't expose a direct connection to
queue work like that, and let any user potentially create millions of
pending work items (and IOs).

You wouldn't expect more concurrent events than the maxevents parameter
that was supplied to io_setup syscall; it should have reserved any
resources needed.

Doesn't matter what limit you apply, my point still stands - at some
point you have to return EAGAIN, or block. Returning EAGAIN without
the caller having flagged support for that change of behavior would
be problematic.


Doesn't it already return EAGAIN (or some other error) if you exceed 
maxevents?



And for this to really work, aio would need some serious help in
how it applies limits. It looks like a hot mess.


For sure.  I think it would be a shame to create more user-facing 
complexity.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8 v2] Non-blocking AIO

2017-03-06 Thread Avi Kivity



On 03/06/2017 05:19 PM, Jens Axboe wrote:

On 03/06/2017 01:25 AM, Jan Kara wrote:

On Sun 05-03-17 16:56:21, Avi Kivity wrote:

The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
any of these conditions are met. This way userspace can push most
of the write()s to the kernel to the best of its ability to complete
and if it returns -EAGAIN, can defer it to another thread.


Is it not possible to push the iocb to a workqueue?  This will allow
existing userspace to work with the new functionality, unchanged. Any
userspace implementation would have to do the same thing, so it's not like
we're saving anything by pushing it there.

That is not easy because until IO is fully submitted, you need some parts
of the context of the process which submits the IO (e.g. memory mappings,
but possibly also other credentials). So you would need to somehow transfer
this information to the workqueue.

Outside of technical challenges, the API also needs to return EAGAIN or
start blocking at some point. We can't expose a direct connection to
queue work like that, and let any user potentially create millions of
pending work items (and IOs).


You wouldn't expect more concurrent events than the maxevents parameter 
that was supplied to io_setup syscall; it should have reserved any 
resources needed.



  That's why the current API is safe, even
though it does suck that it block seemingly randomly for users.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8 v2] Non-blocking AIO

2017-03-06 Thread Avi Kivity

On 03/06/2017 10:25 AM, Jan Kara wrote:

On Sun 05-03-17 16:56:21, Avi Kivity wrote:

The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
any of these conditions are met. This way userspace can push most
of the write()s to the kernel to the best of its ability to complete
and if it returns -EAGAIN, can defer it to another thread.


Is it not possible to push the iocb to a workqueue?  This will allow
existing userspace to work with the new functionality, unchanged. Any
userspace implementation would have to do the same thing, so it's not like
we're saving anything by pushing it there.

That is not easy because until IO is fully submitted, you need some parts
of the context of the process which submits the IO (e.g. memory mappings,
but possibly also other credentials). So you would need to somehow transfer
this information to the workqueue.




It's at least possible to pass the mm_struct to the workqueue, and I 
imagine other process attributes.  But I appreciate the difficulty.


It would be quite annoying to have to keep a large number of worker 
threads active, just in case aio is not working.  Modern NVMes have 
fairly deep queues, and at the worst case, you'll need one thread for 
each I/O to keep everything busy.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8 v2] Non-blocking AIO

2017-03-05 Thread Avi Kivity



On 03/01/2017 01:36 AM, Goldwyn Rodrigues wrote:

This series adds nonblocking feature to asynchronous I/O writes.
io_submit() can be delayed because of a number of reason:
  - Block allocation for files
  - Data writebacks for direct I/O
  - Sleeping because of waiting to acquire i_rwsem
  - Congested block device


We've been hit by a few of these so this change is very welcome.



The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
any of these conditions are met. This way userspace can push most
of the write()s to the kernel to the best of its ability to complete
and if it returns -EAGAIN, can defer it to another thread.



Is it not possible to push the iocb to a workqueue?  This will allow 
existing userspace to work with the new functionality, unchanged. Any 
userspace implementation would have to do the same thing, so it's not 
like we're saving anything by pushing it there.



In order to enable this, IOCB_FLAG_NOWAIT is introduced in
uapi/linux/aio_abi.h which translates to IOCB_NOWAIT for struct iocb,
BIO_NOWAIT for bio and IOMAP_NOWAIT for iomap.

This feature is provided for direct I/O of asynchronous I/O only. I have
tested it against xfs, ext4, and btrfs.

Changes since v1:
  + Forwardported from 4.9.10
  + changed name from _NONBLOCKING to *_NOWAIT
  + filemap_range_has_page call moved to closer to (just before) calling 
filemap_write_and_wait_range().
  + BIO_NOWAIT limited to get_request()
  + XFS fixes
- included reflink
- use of xfs_ilock_nowait() instead of a XFS_IOLOCK_NONBLOCKING flag
- Translate the flag through IOMAP_NOWAIT (iomap) to check for
  block allocation for the file.
  + ext4 coding style


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problems with nodatacow/nodatasum

2012-05-13 Thread Avi Kivity
On Sat, Apr 21, 2012 at 3:15 AM, Chris Mason  wrote:
>>
>> Are there plans to allow per-subvolume nodatasum/nodatacow?
>
> It can be set on a per file basis, let me push out a commit to btrfs
> progs with ioctls to set it.

Did this not happen, or am I barking up the wrong btrfs-progs.git tree?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problems with nodatacow/nodatasum

2012-04-20 Thread Avi Kivity
On Fri, Apr 20, 2012 at 12:12 PM, David Sterba  wrote:
> On Fri, Apr 20, 2012 at 11:19:39AM +0300, Avi Kivity wrote:
>>   /dev/mapper/luks-blah /                       btrfs
>> subvol=/rootvol        1 1
>>   /dev/mapper/luks-blah /var/lib/libvirt/images     btrfs
>> nodatasum,nodatacow,subvol=/images.libvirt        1 2
>
> what does /proc/mounts say about the applied options? do you see
> nodatasum and nodatacow there? afaik most (if not all) mount options
> affect the whole filesystem including any subvolume mounts.  having
> per-subvol options is possible, just not implemented.
>

Nothing, the options don't show up there.

Are there plans to allow per-subvolume nodatasum/nodatacow?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Problems with nodatacow/nodatasum

2012-04-20 Thread Avi Kivity
I have a btrfs filesystem mounted in two locations as two subvolumes:


  /dev/mapper/luks-blah /   btrfs
subvol=/rootvol1 1
  /dev/mapper/luks-blah /var/lib/libvirt/images btrfs
nodatasum,nodatacow,subvol=/images.libvirt1 2

However, a file under the second  mount is getting seriously
fragmented.  It started out with a few dozen extents (reasonable for a
several gigabytes), now it's 11000 and counting, after an application
started pounding on it with a bit of threaded O_DIRECT random I/O.

3.3.1-5.fc16.x86_64

Any hints?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs oops (autodefrag related?)

2012-03-13 Thread Avi Kivity
On 03/13/2012 02:04 AM, Chris Mason wrote:
> On Mon, Mar 12, 2012 at 09:32:54PM +0200, Avi Kivity wrote:
> > Because I'm such a btrfs fanboi I'm running btrfs on my /, all past
> > experience notwithstanding.  In an attempt to recover some performance,
> > I enabled autodefrag, and got this in return:
>
> Hi Avi,
>
> This one was fixed in the 3.3 series.  You can pull from my for-linus
> repo for a commit against 3.2.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus
>
> The individual fix is here:
>
> http://git.kernel.org/?p=linux/kernel/git/mason/linux-btrfs.git;a=commit;h=87826df0ec36fc28884b4ddbb3f3af41c4c2008f
>
>

Thanks.  Suggest queueing it for -stable.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs oops (autodefrag related?)

2012-03-12 Thread Avi Kivity
Because I'm such a btrfs fanboi I'm running btrfs on my /, all past
experience notwithstanding.  In an attempt to recover some performance,
I enabled autodefrag, and got this in return:

[567304.937620] [ cut here ]
[567304.938525] kernel BUG at fs/btrfs/inode.c:1588!
[567304.938525] invalid opcode:  [#1] SMP
[567304.938525] CPU 0
[567304.938525] Modules linked in: vfat fat usb_storage binfmt_misc
tcp_lp ppdev parport_pc lp parport fuse ebtable_nat ebtables be2iscsi
ipt_MASQUERADE iscsi_boot_sysfs iptable_nat nf_nat bnx2i xt_CHECKSUM
cnic iptable_mangle bridge stp llc uio cxgb4i cxgb4 cxgb3i libcxgbi
cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr
iscsi_tcp libiscsi_tcp libiscsi lockd rfcomm scsi_transport_iscsi bnep
nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_conntrack_ipv6
nf_defrag_ipv6 xt_state ip6table_filter ip6_tables nf_conntrack
snd_hda_codec_hdmi snd_hda_codec_conexant btusb bluetooth snd_hda_intel
snd_hda_codec uvcvideo snd_hwdep snd_seq arc4 snd_seq_device snd_pcm
videodev media thinkpad_acpi iwlwifi i2c_i801 snd_timer mac80211 tpm_tis
tpm tpm_bios v4l2_compat_ioctl32 e1000e cfg80211 iTCO_wdt snd
snd_page_alloc soundcore iTCO_vendor_support microcode joydev rfkill
vhost_net macvtap macvlan tun virtio_net kvm_intel kvm sunrpc uinput xts
gf128mul dm_crypt btrfs zlib_deflate libcrc32c sdhci_pci sdhci mmc_core
wmi i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded:
scsi_wait_scan]
[567304.938525]
[567304.938525] Pid: 533, comm: btrfs-fixup-1 Not tainted
3.2.6-3.fc16.x86_64 #1 LENOVO 4174BH4/4174BH4
[567304.938525] RIP: 0010:[]  []
btrfs_writepage_fixup_worker+0x152/0x160 [btrfs]
[567304.948748] RSP: 0018:88011276fde0  EFLAGS: 00010246
[567304.948748] RAX:  RBX: ea000416a000 RCX:
0664a000
[567304.948748] RDX: 8800452cbde8 RSI:  RDI:
8800452cbde8
[567304.948748] RBP: 88011276fe30 R08: 88011e21a780 R09:
88011276fd98
[567304.948748] R10: 0001 R11: 0010 R12:
0664a000
[567304.948748] R13: 88002ecc0190 R14:  R15:
0664afff
[567304.948748] FS:  () GS:88011e20()
knlGS:
[567304.948748] CS:  0010 DS:  ES:  CR0: 8005003b
[567304.948748] CR2: 07ff00ac2000 CR3: bbd92000 CR4:
000426e0
[567304.948748] DR0:  DR1:  DR2:

[567304.948748] DR3:  DR6: 0ff0 DR7:
0400
[567304.948748] Process btrfs-fixup-1 (pid: 533, threadinfo
88011276e000, task 8801125e5c80)
[567304.948748] Stack:
[567304.948748]  880114cb4ea8 88002ecc0030 
88009b09b8c0
[567304.948748]   880116896360 880114cb4ed0
8801168963b0
[567304.948748]  880116896378 88011276fe98 88011276fee0
a0163647
[567304.948748] Call Trace:
[567304.948748]  [] worker_loop+0x147/0x530 [btrfs]
[567304.948748]  [] ? btrfs_queue_worker+0x2e0/0x2e0
[btrfs]
[567304.948748]  [] kthread+0x8c/0xa0
[567304.948748]  [] kernel_thread_helper+0x4/0x10
[567304.948748]  [] ? kthread_worker_fn+0x190/0x190
[567304.948748]  [] ? gs_change+0x13/0x13
[567304.948748] Code: 00 48 8b 7d b8 48 8d 4d c8 41 b8 50 00 00 00 4c 89
fa 4c 89 e6 e8 3f 8f 01 00 eb b3 48 89 df e8 c5 bb fd e0 0f 1f 44 00 00
eb 92 <0f> 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 57 41
[567304.948748] RIP  []
btrfs_writepage_fixup_worker+0x152/0x160 [btrfs]
[567304.948748]  RSP 
[567305.036430] ---[ end trace 642b0cfbec5885d3 ]---

The thread that died was doing some O_DIRECT I/O.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Suddenly, a dead filesystem

2011-10-04 Thread Avi Kivity
On Tue, Oct 4, 2011 at 10:01 PM, Josef Bacik  wrote:
> On 10/04/2011 03:50 PM, Avi Kivity wrote:
>>> Yup I would love to investigate this, what kind of snapshot?
>>
>> LVM snapshot.
>>
>>> Did you dd
>>> the drive or soemthing?  Let me know where to pull it down.
>>
>> Um, it's my /home.  It's not leaving home.  But if you have test
>> kernels or btrfscks you want to throw at it, I'm happy to help.
>
> Hah well you can run btrfs-image -c <0-9> -t  on your
> device it will create a image of the broken fs for me to look at and
> doesn't copy any of the data and zero's out any inline extents, the only
> thing I'll potentially would be able to see is the file names.  If
> that's acceptable just upload it somewhere for me to pull down.  If not
> I'll put together a debug patch for you to try out.  Thanks,
>

Debug patches only, please.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Suddenly, a dead filesystem

2011-10-04 Thread Avi Kivity
> Yup I would love to investigate this, what kind of snapshot?

LVM snapshot.

> Did you dd
> the drive or soemthing?  Let me know where to pull it down.

Um, it's my /home.  It's not leaving home.  But if you have test
kernels or btrfscks you want to throw at it, I'm happy to help.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Suddenly, a dead filesystem

2011-10-04 Thread Avi Kivity
On Tue, Oct 4, 2011 at 4:39 PM, Avi Kivity  wrote:
> On Mon, Oct 3, 2011 at 8:07 PM, Avi Kivity  wrote:
>>>>
>>>> Thats -EIO, is there any messages before the bug?  Thanks,
>>>>
>>>
>>> Not that I recall.  I'll check again when I'm near the poor thing again.
>>>
>>
>> Confirmed - that's the first relevant message.
>>
>> As to -EIO, I dd'ed the entire volume and no errors were found.
>>
>
> btw, is there a new version of fsck.btrfs I can try?  The one I have
> segfaults immediately.
>

Meanwhile, I used btrfs-zero-log and was able to remount.  I have a
snapshot of the dead filesystem, if someone wants to investigate.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Suddenly, a dead filesystem

2011-10-04 Thread Avi Kivity
On Mon, Oct 3, 2011 at 8:07 PM, Avi Kivity  wrote:
>>>
>>> Thats -EIO, is there any messages before the bug?  Thanks,
>>>
>>
>> Not that I recall.  I'll check again when I'm near the poor thing again.
>>
>
> Confirmed - that's the first relevant message.
>
> As to -EIO, I dd'ed the entire volume and no errors were found.
>

btw, is there a new version of fsck.btrfs I can try?  The one I have
segfaults immediately.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Suddenly, a dead filesystem

2011-10-03 Thread Avi Kivity
>>
>> Thats -EIO, is there any messages before the bug?  Thanks,
>>
>
> Not that I recall.  I'll check again when I'm near the poor thing again.
>

Confirmed - that's the first relevant message.

As to -EIO, I dd'ed the entire volume and no errors were found.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Suddenly, a dead filesystem

2011-10-03 Thread Avi Kivity
On Mon, Oct 3, 2011 at 4:34 PM, Josef Bacik  wrote:
> On 09/30/2011 05:40 PM, Avi Kivity wrote:
>> On Wed, Aug 24, 2011 at 12:08 AM, Avi Kivity  wrote:
>>>> This is fixed upstream, I've sent the patch to -stable so hopefully it
>>>> will show up in fedora soon, but in the meantime can you try Linus's
>>>> tree and verify that it fixes it?  Thanks,
>>>
>>> Thanks, it appears to be fixed (at least in a VM).  Will try native soon.
>>>
>>
>> It's back.  3.1-rc8:
>>
>>
>
> Thats -EIO, is there any messages before the bug?  Thanks,
>

Not that I recall.  I'll check again when I'm near the poor thing again.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Suddenly, a dead filesystem

2011-09-30 Thread Avi Kivity
On Wed, Aug 24, 2011 at 12:08 AM, Avi Kivity  wrote:
>> This is fixed upstream, I've sent the patch to -stable so hopefully it
>> will show up in fedora soon, but in the meantime can you try Linus's
>> tree and verify that it fixes it?  Thanks,
>
> Thanks, it appears to be fixed (at least in a VM).  Will try native soon.
>

It's back.  3.1-rc8:


[   24.431935] kernel BUG at fs/btrfs/tree-log.c:1690!
[   24.431935] invalid opcode:  [#1] SMP
[   24.431935] CPU 0
[   24.431935] Modules linked in:
[   24.431935]
[   24.431935] Pid: 1, comm: swapper Not tainted 3.1.0-rc8+ #4 Bochs Bochs
[   24.431935] RIP: 0010:[]  []
replay_one_buffer+0x24c/0x330
[   24.431935] RSP: 0018:8800078d7810  EFLAGS: 00010282
[   24.431935] RAX: fffb RBX: 0002 RCX: fffb921b
[   24.431935] RDX: ff24 RSI: 0001 RDI: 0286
[   24.431935] RBP: 8800078d78b0 R08: 880006cabf18 R09: 9018
[   24.431935] R10: 0001 R11:  R12: 0097
[   24.431935] R13: 8800078d79b8 R14: 8800073755a0 R15: 000c
[   24.431935] FS:  () GS:880007c0()
knlGS:
[   24.431935] CS:  0010 DS:  ES:  CR0: 8005003b
[   24.431935] CR2:  CR3: 01a05000 CR4: 06f0
[   24.431935] DR0:  DR1:  DR2: 
[   24.431935] DR3:  DR6: 0ff0 DR7: 0400
[   24.431935] Process swapper (pid: 1, threadinfo 8800078d6000,
task 8800078d8000)
[   24.431935] Stack:
[   24.431935]  8800078d785e 880007372090 8800078d78a0
8126b38d
[   24.431935]  880007388000 880006d0b000 880007372120
880006d0b400
[   24.431935]  9d48 00051000 060c
0500
[   24.431935] Call Trace:
[   24.431935]  [] ? alloc_extent_buffer+0x7d/0x420
[   24.431935]  [] walk_down_log_tree+0x1ea/0x3b0
[   24.431935]  [] walk_log_tree+0xbd/0x1d0
[   24.431935]  [] btrfs_recover_log_trees+0x211/0x300
[   24.431935]  [] ? fixup_inode_link_counts+0x150/0x150
[   24.431935]  [] open_ctree+0x13f4/0x17a0
[   24.431935]  [] ? disk_name+0xba/0xc0
[   24.431935]  [] btrfs_mount+0x426/0x5d0
[   24.431935]  [] mount_fs+0x20/0xd0
[   24.431935]  [] vfs_kern_mount+0x6a/0xc0
[   24.431935]  [] do_kern_mount+0x54/0x110
[   24.431935]  [] do_mount+0x53a/0x840
[   24.431935]  [] ? memdup_user+0x4b/0x90
[   24.431935]  [] ? strndup_user+0x5b/0x80
[   24.431935]  [] sys_mount+0x98/0xf0
[   24.431935]  [] mount_block_root+0xe4/0x28a
[   24.431935]  [] ? sys_mknodat+0xc3/0x1f0
[   24.431935]  [] mount_root+0x53/0x57
[   24.431935]  [] prepare_namespace+0x16d/0x1a6
[   24.431935]  [] kernel_init+0x153/0x158
[   24.431935]  [] ? schedule_tail+0x27/0xb0
[   24.431935]  [] kernel_thread_helper+0x4/0x10
[   24.431935]  [] ? start_kernel+0x3bb/0x3bb
[   24.431935]  [] ? gs_change+0x13/0x13
[   24.431935] Code: 8b 4d 90 48 8b 55 88 48 8b 75 98 41 89 d9 48 89
04 24 4d 89 f0 e8 e5 ea ff ff 83 f8 fe 0f 84 a2 fe ff ff 85 c0 0f 84
9a fe ff ff <0f> 0b 66 90 49 8b 7d 20 48 8b 55 90 4c 8d 4d ae 48 8b 75
98 41
[   24.431935] RIP  [] replay_one_buffer+0x24c/0x330
[   24.431935]  RSP 
[   24.573567] ---[ end trace f1fff9aaced9257d ]---
[   24.575888] Kernel panic - not syncing: Attempted to kill init!
[   24.578587] Pid: 1, comm: swapper Tainted: G  D 3.1.0-rc8+ #4
[   24.581490] Call Trace:
[   24.582643]  [] panic+0x91/0x1a7
[   24.584839]  [] do_exit+0x760/0x830
[   24.587148]  [] ? kmsg_dump+0x4a/0xe0
[   24.589520]  [] oops_end+0xab/0xf0
[   24.591794]  [] die+0x58/0x90
[   24.593879]  [] do_trap+0xc4/0x170
[   24.596162]  [] do_invalid_op+0x95/0xb0
[   24.598672]  [] ? replay_one_buffer+0x24c/0x330
[   24.601534]  [] ? iput+0x103/0x200
[   24.603767]  [] ? add_inode_ref+0x500/0x520
[   24.606333]  [] invalid_op+0x1b/0x20
[   24.608647]  [] ? replay_one_buffer+0x24c/0x330
[   24.611366]  [] ? alloc_extent_buffer+0x7d/0x420
[   24.614119]  [] walk_down_log_tree+0x1ea/0x3b0
[   24.616829]  [] walk_log_tree+0xbd/0x1d0
[   24.619168]  [] btrfs_recover_log_trees+0x211/0x300
[   24.621825]  [] ? fixup_inode_link_counts+0x150/0x150
[   24.624669]  [] open_ctree+0x13f4/0x17a0
[   24.627186]  [] ? disk_name+0xba/0xc0
[   24.629735]  [] btrfs_mount+0x426/0x5d0
[   24.632180]  [] mount_fs+0x20/0xd0
[   24.634475]  [] vfs_kern_mount+0x6a/0xc0
[   24.636989]  [] do_kern_mount+0x54/0x110
[   24.639528]  [] do_mount+0x53a/0x840
[   24.641886]  [] ? memdup_user+0x4b/0x90
[   24.644360]  [] ? strndup_user+0x5b/0x80
[   24.646854]  [] sys_mount+0x98/0xf0
[   24.649200]  [] mount_block_root+0xe4/0x28a
[   24.651793]  [] ? sys_mknodat+0xc3/0x1f0
[   24.654305]  [] mount_root+0x53/0x57
[   24.656610]  [] prepare_namespace+0x16d/0x1a6
[   24.659370]  [] kernel_init+0x153/0x158
[   24.661878]  [] ? schedule_tail+0x27/0xb0

Re: Suddenly, a dead filesystem

2011-08-23 Thread Avi Kivity
On Tue, Aug 23, 2011 at 5:27 PM, Hugo Mills  wrote:
>
>   Could you give a bit more information about what happened in the
> crash immediately prior to the FS being unmountable,

I was talking on the phone (HTC Desire running Android 2.2).

> and how your
> storage is configured?

/dev/sda2 -> physical volume -> logical volume -> luks encryption -> btrfs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Suddenly, a dead filesystem

2011-08-23 Thread Avi Kivity
> This is fixed upstream, I've sent the patch to -stable so hopefully it
> will show up in fedora soon, but in the meantime can you try Linus's
> tree and verify that it fixes it?  Thanks,

Thanks, it appears to be fixed (at least in a VM).  Will try native soon.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Suddenly, a dead filesystem

2011-08-23 Thread Avi Kivity
I've been using btrfs for a while as my /home (converted from ext4;
encrypted lvm) when it died on me.  Mounting it crashes immediately,
here's a log:

[    6.455721] [ cut here ]
[    6.456117] kernel BUG at fs/btrfs/inode.c:4586!
[    6.456117] invalid opcode:  [#1] SMP
[    6.456117] CPU 0
[    6.456117] Modules linked in: btrfs zlib_deflate libcrc32c
[    6.456117]
[    6.456117] Pid: 243, comm: mount Not tainted
2.6.40.3-0.fc15.x86_64 #1 Bochs Bochs
[    6.456117] RIP: 0010:[]  []
btrfs_add_link+0x123/0x17c [btrfs]
[    6.456117] RSP: 0018:880007ac9838  EFLAGS: 00010282
[    6.456117] RAX: ffef RBX: 880003ec3938 RCX: 0ed7
[    6.456117] RDX: 0ed6 RSI: 60fff8e013b0 RDI: ea0da3b0
[    6.456117] RBP: 880007ac98a8 R08: a00123af R09: 0b23
[    6.456117] R10: 0b23 R11: 0002 R12: 880003ec3548
[    6.456117] R13: 880007863000 R14: 000d R15: 880007798500
[    6.456117] FS:  7effe7241820() GS:880006e0()
knlGS:
[    6.456117] CS:  0010 DS:  ES:  CR0: 80050033
[    6.456117] CR2: 7effe62cc580 CR3: 07b7c000 CR4: 06f0
[    6.456117] DR0:  DR1:  DR2: 
[    6.456117] DR3:  DR6: 0ff0 DR7: 0400
[    6.456117] Process mount (pid: 243, threadinfo 880007ac8000,
task 880002d4)
[    6.456117] Stack:
[    6.456117]  8801 02cb 880007ac9878
880003e5d000
[    6.456117]   3fff880002c5f000 01002894

[    6.456117]  1000 880003e5a120 880003ec3548
880007ac99c7
[    6.456117] Call Trace:
[    6.456117]  [] add_inode_ref+0x2e6/0x37c [btrfs]
[    6.456117]  [] ? read_extent_buffer+0xc3/0xe3 [btrfs]
[    6.456117]  [] replay_one_buffer+0x197/0x212 [btrfs]
[    6.456117]  [] walk_down_log_tree+0x15a/0x2c1 [btrfs]
[    6.456117]  [] walk_log_tree+0x7f/0x19e [btrfs]
[    6.456117]  [] ? radix_tree_lookup+0xb/0xd
[    6.456117]  [] btrfs_recover_log_trees+0x28b/0x298 [btrfs]
[    6.456117]  [] ? replay_one_dir_item+0xbd/0xbd [btrfs]
[    6.456117]  [] open_ctree+0x10f1/0x13ff [btrfs]
[    6.456117]  [] btrfs_mount+0x233/0x496 [btrfs]
[    6.456117]  [] ? pcpu_next_pop+0x3d/0x4a
[    6.456117]  [] ? pcpu_alloc+0x7f7/0x833
[    6.456117]  [] mount_fs+0x69/0x155
[    6.456117]  [] ? __alloc_percpu+0x10/0x12
[    6.456117]  [] vfs_kern_mount+0x63/0x9d
[    6.456117]  [] do_kern_mount+0x4d/0xdf
[    6.456117]  [] do_mount+0x63c/0x69f
[    6.456117]  [] ? memdup_user+0x55/0x7d
[    6.456117]  [] ? strndup_user+0x3b/0x51
[    6.456117]  [] sys_mount+0x88/0xc2
[    6.456117]  [] system_call_fastpath+0x16/0x1b
[    6.456117] Code: 89 f1 4c 89 fa 4c 89 ee 48 89 44 24 08 41 8b 04
24 66 c1 e8 0c 83 e0 0f 0f b6 80 78 eb 06 a0 89 04 24 e8 8c d5 fe ff
85 c0 74 02 <0f> 0b 45 01 f6 4d 63 f6 4c 03 b3 d0 00 00 00 4c 89 b3 d0
00 00
[    6.456117] RIP  [] btrfs_add_link+0x123/0x17c [btrfs]
[    6.456117]  RSP 
[    6.592232] ---[ end trace 44b5956456a7dc01 ]---

Tried btrfsck, immediate segfault.

Both kernel and btrfsprogs are stock Fedora 15.  I still have the
logical volume and would like to recover it.  Its fairly easy to try
out things in a virtual machine, so if you have a patch you want me to
try out, I'm here.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] Ext4: fail if we try to use hole punch

2010-11-16 Thread Avi Kivity

On 11/16/2010 02:50 PM, Josef Bacik wrote:

On Tue, Nov 16, 2010 at 02:25:35PM +0200, Avi Kivity wrote:
>  On 11/15/2010 07:05 PM, Josef Bacik wrote:
>>  Ext4 doesn't have the ability to punch holes yet, so make sure we return
>>  EOPNOTSUPP if we try to use hole punching through fallocate.  This support 
can
>>  be added later.  Thanks,
>>
>
>  Instead of teaching filesystems to fail if they don't support the
>  capability, why don't supporting filesystems say so, allowing the fail
>  code to be in common code?
>

There is no simple way to test if a filesystem supports hole punching or not so
the check has to be done per fs.  Thanks,


Could put a flag word in superblock_operations.  Filesystems which 
support punching (or other features) can enable it there.


Or even have its own callback.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] Ext4: fail if we try to use hole punch

2010-11-16 Thread Avi Kivity

On 11/15/2010 07:05 PM, Josef Bacik wrote:

Ext4 doesn't have the ability to punch holes yet, so make sure we return
EOPNOTSUPP if we try to use hole punching through fallocate.  This support can
be added later.  Thanks,



Instead of teaching filesystems to fail if they don't support the 
capability, why don't supporting filesystems say so, allowing the fail 
code to be in common code?


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance with qemu

2010-04-08 Thread Avi Kivity

On 04/08/2010 06:34 PM, Christoph Hellwig wrote:

On Thu, Apr 08, 2010 at 06:28:54PM +0300, Avi Kivity wrote:
   

When it updates qcow2 metadata or when the guest issues a barrier.  It's
relatively new.  I have a patch that introduces cache=volatile somewhere.
 

qcow2 does not issues any fsyncs by itself, it only passes throught the
guests ones.  The only other placess issueing fsyncs is commit a COW
image back to the base image, and on migreation.
   


Shouldn't it do that then?  What's the point of fsyncing guest data if 
qcow2 metadata is volatile?


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance with qemu

2010-04-08 Thread Avi Kivity

On 04/08/2010 06:26 PM, Chris Mason wrote:



Once the O_DIRECT read patch is in, you can switch to that, or tell qemu
to use a writeback cache instead.
   

Even with writeback qemu will issue a lot of fsyncs.
 

Oh, I didn't see that when I was testing, when does it fsync?
   


When it updates qcow2 metadata or when the guest issues a barrier.  It's 
relatively new.  I have a patch that introduces cache=volatile somewhere.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance with qemu

2010-04-08 Thread Avi Kivity

On 03/30/2010 03:56 PM, Chris Mason wrote:

On Sun, Mar 28, 2010 at 05:18:03PM +0200, Diego Calleja wrote:
   

Hi, I'm using KVM, and the virtual disk (a 20 GB file using the "raw"
qemu format according to virt-manager and, of course, placed on a btrfs
filesystem, running the latest mainline git) is awfully slow, no matter
what OS is running inside the VM. The PCBSD installer says it's copying
data at a 40-50 KB/s rate. Is someone using KVM and having better numbers
than me? How can I help to debug this workload?
 

The problem is that qemu uses O_SYNC by default, which makes btrfs do
log commits for every write.
   


Problem is, btrfs takes the 50 KB/s guest rate and inflates it to 
something much larger (megabytes/sec).  Are there plans to reduce the 
amount of O_SYNC overhead writes?


I saw this too, but with 2.6.31 or 2.6.32 IIRC.


Once the O_DIRECT read patch is in, you can switch to that, or tell qemu
to use a writeback cache instead.
   


Even with writeback qemu will issue a lot of fsyncs.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v8][RFC] mutex: implement adaptive spinning

2009-01-14 Thread Avi Kivity

Nick Piggin wrote:

(no they're not, Nick's ticket locks still spin on a shared cacheline
IIRC -- the MCS locks mentioned could fix this)



It reminds me. I wrote a basic variation of MCS spinlocks a while back. And
converted dcache lock to use it, which showed large dbench improvements on
a big machine (of course for different reasons than the dbench improvements
in this threaed).

http://lkml.org/lkml/2008/8/28/24

Each "lock" object is sane in size because given set of spin-local queues may
only be used once per lock stack. But any spinlocks within a mutex acquisition
will always be at the bottom of such a stack anyway, by definition.

If you can use any code or concept for your code, that would be great.
  


Does it make sense to replace 'nest' with a per-cpu counter that's 
incremented on each lock?  I guest you'd have to search for the value of 
nest on unlock, but it would a very short search (typically length 1, 2 
if lock sorting is used to avoid deadlocks).


I think you'd need to make the lock store the actual node pointer, not 
the cpu number, since the values of nest would be different on each cpu.


That would allow you to replace spinlocks with mcs_locks wholesale.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v8][RFC] mutex: implement adaptive spinning

2009-01-12 Thread Avi Kivity

Peter Zijlstra wrote:

Spinlocks can use 'pure' MCS locks.
  


How about this, then.  In mutex_lock(), keep wait_lock locked and only 
release it when scheduling out.  Waiter spinning naturally follows.  If 
spinlocks are cache friendly (are they today?) we inherit that.  If 
there is no contention on the mutex, then we don't need to reacquire the 
wait_lock on mutex_unlock() (not that the atomic op is that expensive 
these days).


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v8][RFC] mutex: implement adaptive spinning

2009-01-12 Thread Avi Kivity

Peter Zijlstra wrote:

On Mon, 2009-01-12 at 18:13 +0200, Avi Kivity wrote:

  
One thing that worries me here is that the spinners will spin on a 
memory location in struct mutex, which means that the cacheline holding 
the mutex (which is likely to be under write activity from the owner) 
will be continuously shared by the spinners, slowing the owner down when 
it needs to unshare it.  One way out of this is to spin on a location in 
struct mutex_waiter, and have the mutex owner touch it when it schedules 
out.



Yeah, that is what pure MCS locks do -- however I don't think its a
feasible strategy for this spin/sleep hybrid.
  


Bummer.


So:
- each task_struct has an array of currently owned mutexes, appended to 
by mutex_lock()



That's not going to fly I think. Lockdep does this but its very
expensive and has some issues. We're currently at 48 max owners, and
still some code paths manage to exceed that.
  


Might make it per-cpu instead, and set a bit in the mutex when 
scheduling out so we know not to remove it from the list on unlock.



- mutex waiters spin on mutex_waiter.wait, which they initialize to zero
- when switching out of a task, walk the mutex list, and for each mutex, 
bump each waiter's wait variable, and clear the owner array



Which is O(n).
  


It may be better than O(n) cpus banging on the mutex for the lock 
duration.  Of course we should avoid walking the part of the list where 
non-spinning owners wait (or maybe have a separate list for spinners).


- when unlocking a mutex, bump the nearest waiter's wait variable, and 
remove from the owner array


Something similar might be done to spinlocks to reduce cacheline 
contention from spinners and the owner.



Spinlocks can use 'pure' MCS locks.
  


I'll read up on those, thanks.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v8][RFC] mutex: implement adaptive spinning

2009-01-12 Thread Avi Kivity

Peter Zijlstra wrote:

Subject: mutex: implement adaptive spinning
From: Peter Zijlstra 
Date: Mon Jan 12 14:01:47 CET 2009

Change mutex contention behaviour such that it will sometimes busy wait on
acquisition - moving its behaviour closer to that of spinlocks.

This concept got ported to mainline from the -rt tree, where it was originally
implemented for rtmutexes by Steven Rostedt, based on work by Gregory Haskins.

Testing with Ingo's test-mutex application (http://lkml.org/lkml/2006/1/8/50)
gave a 345% boost for VFS scalability on my testbox:

 # ./test-mutex-shm V 16 10 | grep "^avg ops"
 avg ops/sec:   296604

 # ./test-mutex-shm V 16 10 | grep "^avg ops"
 avg ops/sec:   85870

The key criteria for the busy wait is that the lock owner has to be running on
a (different) cpu. The idea is that as long as the owner is running, there is a
fair chance it'll release the lock soon, and thus we'll be better off spinning
instead of blocking/scheduling.

Since regular mutexes (as opposed to rtmutexes) do not atomically track the
owner, we add the owner in a non-atomic fashion and deal with the races in
the slowpath.

Furthermore, to ease the testing of the performance impact of this new code,
there is means to disable this behaviour runtime (without having to reboot
the system), when scheduler debugging is enabled (CONFIG_SCHED_DEBUG=y),
by issuing the following command:

 # echo NO_OWNER_SPIN > /debug/sched_features

This command re-enables spinning again (this is also the default):

 # echo OWNER_SPIN > /debug/sched_features
  


One thing that worries me here is that the spinners will spin on a 
memory location in struct mutex, which means that the cacheline holding 
the mutex (which is likely to be under write activity from the owner) 
will be continuously shared by the spinners, slowing the owner down when 
it needs to unshare it.  One way out of this is to spin on a location in 
struct mutex_waiter, and have the mutex owner touch it when it schedules 
out.


So:
- each task_struct has an array of currently owned mutexes, appended to 
by mutex_lock()

- mutex waiters spin on mutex_waiter.wait, which they initialize to zero
- when switching out of a task, walk the mutex list, and for each mutex, 
bump each waiter's wait variable, and clear the owner array
- when unlocking a mutex, bump the nearest waiter's wait variable, and 
remove from the owner array


Something similar might be done to spinlocks to reduce cacheline 
contention from spinners and the owner.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Compile error of latest hotfix release of btrfs

2008-11-03 Thread Avi Kivity

yanhai zhu wrote:

sorry,

--

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 9b37ce6..04a0e58 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2539,7 +2539,11 @@ int extent_readpages(struct extent_io_tree *tree,
/* open coding of lru_cache_add, also not exported */
page_cache_get(page);
if (!pagevec_add(&pvec, page))
+   #if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,28)
+   __pagevec_lru_add_file(&pvec);
+   #else
__pagevec_lru_add(&pvec);
+   #endif
  


Suggest sticking in some compat header:

+#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,28)
+# define  __pagevec_lru_add_file __pagevec_lru_add
+#endif


To reduce impact on code that is intended to go to mainline.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some very basic questions

2008-10-22 Thread Avi Kivity

jim owens wrote:

For most SATA drives, disabling write back cache seems to take high
toll on write throughput.  :-(


I measured this yesterday.  This is true for pure write workloads; 
for mixed read/write workloads the throughput decrease is negligible.


Different tests on different hardware
give different results at different times!



True.  But data loss is forever!



I got flamed for this on another list, but let's disable the write 
cache and live with the performance drop.


We don't get to decide this, customers do.


We get to pick the defaults.


As they say in the raid forum... fast, cheap, good - pick any 2


We can upgrade slow to fast, but !good gets upgraded to another fs.


P.S. no flames because we chose no-battery == disable-write-cache


ACK!

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some very basic questions

2008-10-22 Thread Avi Kivity

Tejun Heo wrote:

For most SATA drives, disabling write back cache seems to take high
toll on write throughput.  :-(

  


I measured this yesterday.  This is true for pure write workloads; for 
mixed read/write workloads the throughput decrease is negligible.



As long as the error status is sticky, it doesn't have to hold on to
the data, it's not gonna be able to write it anyway.  The drive has to
hold onto the failure information only.  Yeah, but fully agreed on
that it's most likely dependent on the specific firmware.  There isn't
any requirement on how to handle write back failure in the ATA spec.
It wouldn't be too surprising if there are some drives which happily
report the old data after silent write failure followed by flush and
power loss at the right timing.


I got flamed for this on another list, but let's disable the write cache 
and live with the performance drop.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some very basic questions

2008-10-22 Thread Avi Kivity

Ric Wheeler wrote:
For any given set of disks, you "just" need to do the math to compute 
the utilized capacity, the expected rate of drive failure, the rebuild 
time and then see whether you can recover from your first failure 
before a 2nd disk dies.




Spare disks have the advantage of a fully linear access pattern 
(ignoring normal working load).  Spare capacity has the advantage of 
utilizing all devices (if you have a hundred-disk fs, all surviving 
disks participate in the rebuild; whereas with spare disks you are 
limited to the surviving raidset members.


Spare capacity also has the advantage that you don't need to rebuild 
free space.
In practice, this is not an academic question since drives do 
occasionally fail in batches (and drives from the same batch get 
stuffed into the same system).


This seems to be orthogonal to the sparing method used; and in both 
cases the answer is to tolerate dual failures.  File-based redundancy 
has the advantage here of allowing triple mirroring for metadata and 
frequently written files, and double parity raid for large files.


I suspect that what will be used in mission critical deployments will 
be more conservative than what is used in less critical path systems


That's true, unfortunately.  But with time people will trust the newer, 
more efficient methods.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some very basic questions

2008-10-22 Thread Avi Kivity

Chris Mason wrote:

One problem with the spare
capacity model is the general trend where drives from the same batch
that get hammered on in the same way tend to die at the same time.  Some
shops will sleep better knowing there's a hot spare and that's fine by
me.
  


How does hot sparing help?  All your disks die except the spare.

Of course, I've no objection to disk sparing as an additional option; I 
just feel that capacity sparing is superior.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some very basic questions

2008-10-22 Thread Avi Kivity

Ric Wheeler wrote:
I think that the btrfs plan is still to push more complicated RAID 
schemes off to MD (RAID6, etc) so this is an issue even with a JBOD. 
It will be interesting to map out the possible ways to use built in 
mirroring, etc vs the external RAID and actually measure the utilized 
capacity and performance (online & during rebuilds).


That's leaving a lot of performance and features on the table, IMO.  We 
definitely want to have metadata and small files using mirroring 
(perhaps even three copies for some metadata).  Use RAID[56] for large 
files.  Perhaps even start files at RAID1, and have the scrubber convert 
them to RAID[56] when it notices they are only ever read.  Keep 
temporary or unimportant files at RAID0.  Play games with asymmetric 
setups (small fast disks + large slow disks). etc etc etc.


Delegating things to MD throws out a lot of metadata so these things 
become impossible.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some very basic questions

2008-10-22 Thread Avi Kivity

Ric Wheeler wrote:


Well, btrfs is not about duplicating how most storage works today.  
Spare capacity has significant advantages over spare disks, such as 
being able to mix disk sizes, RAID levels, and better performance.


Sure, there are advantages that go in favour of one or the other 
approaches. But btrfs is also about being able to use common hardware 
configurations without having to reinvent where we can avoid it (if we 
have a working RAID or enough drives to do RAID5 with spares or RAID6, 
we want to be able to delegate that off to something else if we can).


Well, if you have an existing RAID (or have lots of $$$ to buy a new 
one), you needn't tell Btrfs about it.  Just be sure not to enable Btrfs 
data redundancy, or you'll have redundant redundancy, which is expensive.


What Btrfs enables with its multiple device capabilities is to assemble 
a JBOD into a filesystem-level data redundancy system, which is cheaper, 
more flexible (per-file data redundancy levels), and faster (no need for 
RMW, since you're always COWing).


The major difficulty with the spare capacity model is that your 
recovery is not as simple and well understood as RAID rebuilds. 


That's Chris's problem. :-)

If you assume that whole drives fail under btrfs mirroring, you are 
not really doing anything more than simple RAID, or do I misunderstand 
your suggestion?


I do assume that whole drives fail, but RAIDing and rebuilding is file 
level.  So one extent on a failed disk might be part of a mirrored file, 
while another extent can be part of a 14-member RAID6 extent.


A rebuild would iterate over all disk extents (making use of the backref 
tree), determine which file contains that extent, and rebuild that 
extent using spare storage on other disks.


I don't see the point about head seeking. In RAID, you also have the 
same layout so you minimize head movement (just move more heads per IO 
in parallel).


Suppose you have 5 disks with 1 spare.  Suppose you are reading from a 
full fs.  On a disk-level RAID, all disks are full.  So you have 5 
spindles seeking over 100% of the disk surface.  With spare capacity, 
you have 6 disks which are 5/6 full (retaining the same utilization as 
old-school RAID).  So you have 6 spindles, each with a seek range that 
is 5/6 of a whole disk, so more seek heads _and_ faster individual seeks.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some very basic questions

2008-10-22 Thread Avi Kivity

Ric Wheeler wrote:
You want to have spare capacity, enough for one or two (or fifteen) 
drives' worth of data.  When a drive goes bad, you rebuild into the 
spare capacity you have.


That is a different model (and one that makes sense, we used that in 
Centera for object level protection schemes). It is a nice model as 
well, but not how most storage works today.


Well, btrfs is not about duplicating how most storage works today.  
Spare capacity has significant advantages over spare disks, such as 
being able to mix disk sizes, RAID levels, and better performance.




When you replace the drive, the filesystem moves data into the new 
drive to take advantage of the new spindle.




When you buy a storage solution (hardware or software), the key here 
is "utilized capacity." If you have an enclosure that can host say 
12-15 drives in a 2U enclosure, people normally leave one drive as 
spare.  RAID6 is another way to do this. You can do a 4+2 and 4+2 with 
66% utilized capacity in RAID 6 or possibly a RAID5 scheme using like 
5+1 and 4+1 with one global spare (75% utilized capacity).


That gives you the chance to do  rebuild your RAID group without 
having to physically visit the data center. You can also do fancy 
stuff with the spare (like migrate as many blocks as possible before 
the RAID rebuild to that spare) which reduces your exposure to the 2nd 
drive failure and speeds up your rebuild time.


In the end, whether you use a block based RAID solution or an object 
based solution, you just need to figure out how to balance your 
utilized capacity against performance and data integrity needs.


In both models (spare disk and spare capacity) the storage utilization 
is the same, or nearly so.  But with spare capacity you get better 
performance since you have more spindles seeking for your data, and 
since less of the disk surface is occupied by data, making your seeks 
shorter.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some very basic questions

2008-10-22 Thread Avi Kivity

Chris Mason wrote:
You want to have spare capacity, enough for one or two (or fifteen) 
drives' worth of data.  When a drive goes bad, you rebuild into the 
spare capacity you have.





You want spare capacity that does not degrade your raid levels if you
move the data onto it.  In some configs, this will be a hot spare, in
others it'll just be free space.
  


What kind of configuration would prefer a spare disk to spare capacity?  
RAID6 with a small number of disks?


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some very basic questions

2008-10-22 Thread Avi Kivity
Ric Wheeler wrote: 
One key is not to replace the drives too early - you often can recover 
significant amounts of data from a drive that is on its last legs. 
This can be useful even in RAID rebuilds since with today's enormous 
drive capacities, you might hit a latent error during the rebuild on 
one of the presumed healthy drives.


Of course, if you don't have a spare drive in your configuration, this 
is not practical...


Why would you have a spare drive?  That's a wasted spindle.

You want to have spare capacity, enough for one or two (or fifteen) 
drives' worth of data.  When a drive goes bad, you rebuild into the 
spare capacity you have.


When you replace the drive, the filesystem moves data into the new drive 
to take advantage of the new spindle.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some very basic questions

2008-10-22 Thread Avi Kivity

jim owens wrote:

Avi Kivity wrote:

jim owens wrote:


Remember that the device bandwidth is the limiter so even
when each host has a dedicated path to the device (as in
dual port SAS or FC), that 2nd host cuts the throughput by
more than 1/2 with uncoordinated seeks and transfers.


That's only a problem if there is a single shared device.  Since 
btrfs supports multiple devices, each host could own a device set and 
access from other hosts would be through the owner.  You would need 
RDMA to get reasonable performance and some kind of dual-porting to 
get high availability.  Each host could control the allocation tree 
for its devices.


No.  Every device including a monster $$$ array has the problem.

As I said before, unless the application is partitioned
there is always data host2 needs from host1's disk and that
slows down host1.


The CPU load should not be significant if you have RDMA.  Or are you 
talking about the seek load?  Since host1's load should be distributed 
over all devices in the system, overall seek capacity increases as you 
add more nodes.




If host2 seldom needs any host1 data, then you are describing
a configuration that can be done easily by each host having a
separate filesystem for the device it owns by default.  Each
host nfs mounts the other host's data and if host1 fails, host2
can direct mount host1-fs from the shared array.



Separate namespaces are uninteresting to me.  That's just pushing back 
the problem to the user.



Even with multiple disks under the same filesystem as separate
allocated storage there is still the problem of shared namespace
metadata that slows down both hosts.  If you don't need shared
namespaces then you absolutely don't want a cluster fs.



If you separate the allocation metadata to the storage owning node, and 
the file metadata to the actively using node, the slowdown should be low 
in most cases.  Problems begin when all nodes access the same file, but 
that's relatively rare.  Even then, when the file size does not change 
and when the data is preallocated, it's possible to achieve acceptable 
overhead.



A cluster fs is useful, but the cost can be high so using
it for a single-host fs is not a good idea.


Development costs, yes.  But I don't see why the runtime overhead can't 
disappear when running on a single host.  Sort of like running an smp 
kernel on uniprocessor (I agree the fs problem is much bigger).


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some very basic questions

2008-10-22 Thread Avi Kivity

Ric Wheeler wrote:
Scrubbing is key for many scenarios since errors can "grow" even in 
places where previous IO has been completed without flagging an error.


Some neat tricks are:

   (1) use block level scrubbing to detect any media errors. If you 
can map that sector level error into a file system object (meta data, 
file data or unallocated space), tools can recover (fsck, get another 
copy of the file or just ignore it!). There is a special command 
called "READ_VERIFY" that can be used to validate the sectors without 
actually moving data from the target to the host, so you can scrub 
without consuming page cache, etc.




This has the disadvantage of not catching errors that were introduced 
while writing; the very errors that btrfs checksums can catch.


   (2) sign and validate the object at the file level, say by 
validating a digital signature. This can catch high level errors (say 
the app messed up).


Btrfs extent-level checksums can be used for this.  This is just below 
the application level, but good enough IMO.


Note that this scrubbing needs to be carefully tuned to not interfere 
with the foreground workload, using something like IO nice or the 
other IO controllers being kicked about might help :-)


Right.  Further, reading the disk by logical block order will help 
reduce seeks.  Btrfs's back references, if cached properly, will help 
with this as well.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some very basic questions

2008-10-22 Thread Avi Kivity

Stephan von Krawczynski wrote:



   - filesystem autodetects, isolates, and (possibly) repairs errors
   - online "scan, check, repair filesystem" tool initiated by admin
   - Reliability so high that they never run that check-and-fix tool



That is _wrong_ (to a certain extent). You _want to run_ diagnostic tools to
make sure that there is no problem. And you don't want some software (not even
HAL) to repair errors without prior admin knowledge/permission


I think there's a place for a scrubber to continuously verify filesystem 
data and metadata, at low io priority, and correct any correctable 
errors.  The admin can read the error correction report at their 
leisure, and then take any action that's outside the filesystem's 
capabilities (like ordering and installing new disks).


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some very basic questions

2008-10-22 Thread Avi Kivity

jim owens wrote:


Remember that the device bandwidth is the limiter so even
when each host has a dedicated path to the device (as in
dual port SAS or FC), that 2nd host cuts the throughput by
more than 1/2 with uncoordinated seeks and transfers.


That's only a problem if there is a single shared device.  Since btrfs 
supports multiple devices, each host could own a device set and access 
from other hosts would be through the owner.  You would need RDMA to get 
reasonable performance and some kind of dual-porting to get high 
availability.  Each host could control the allocation tree for its devices.


Of course, this doesn't solve the other problems with parallel mounts.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data-deduplication?

2008-10-15 Thread Avi Kivity

Andi Kleen wrote:

  


There are some patches to do in QEMU's cow format for KVM. That's
user level only.
 
  
And thus, doesn't work for sharing between different images, especially 
at runtime. 



It would work if the images are all based once on a reference image, won't it?
  


Yes and no.  It's difficult to do it at runtime, and it allows one qemu 
to access another guest's data (for read-only).


Also, it's almost impossible to do at runtime.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data-deduplication?

2008-10-15 Thread Avi Kivity

Andi Kleen wrote:

Ray Van Dolson <[EMAIL PROTECTED]> writes:

  

I recall their being a thread here a number of months back regarding
data-deduplication support for bttfs.

Did anyone end up picking that up and giving a go at it?  Block level
data dedup would be *awesome* in a Linux filesystem.  It does wonders
for storing virtual machines w/ NetApp and WAFL, and even ZFS doesn't
have this feature yet (although I've read discussions on them looking
to add it).



There are some patches to do in QEMU's cow format for KVM. That's
user level only.
  


And thus, doesn't work for sharing between different images, especially 
at runtime. I'd really, really [any number of reallies], really like to 
see btrfs deduplication.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: packing structures and numbers

2008-10-04 Thread Avi Kivity

Andi Kleen wrote:

On those archs that take faults on unaligned accesses it's unlikely to
be in the noise.  But we could (and should) stick a get_unaligned() in
the accessor functions.



Normally the compiler on such architectures generates special load/store code
for known unaligned types that does not actually fault, but just uses multiple
instructions. That is slower than a normal memory access, but still much
faster than a exception.

You only really get the full exception fault penalty when the compiler
cannot figure out at compile time that a given variable is unaligned.
But with packed it normally assumes that (I think)


Sounds reasonable.  In which case the unaligned access issue I raised is 
a red herring.  So using uleb128 or not is down to whether the improved 
packing efficiency is worth the increased complexity; it seems unlikely 
that it is.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: packing structures and numbers

2008-10-04 Thread Avi Kivity

Zach Brown wrote:

Avi Kivity wrote:
  

I've been reading btrfs's on-disk format, and two things caught my eye

- attribute((packed)) structures everywhere, often with misaligned
fields.  This conserves space, but can be harmful to in-memory
performance on some archs.



How harmful?  Do you have any profiles that can even pick this out of
the noise?
  


On those archs that take faults on unaligned accesses it's unlikely to 
be in the noise.  But we could (and should) stick a get_unaligned() in 
the accessor functions.


uleb128 encoding is orthogonal to this issue, however.  It's only about 
getting better density.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: packing structures and numbers

2008-10-04 Thread Avi Kivity

Daniel Phillips wrote:

On Friday 03 October 2008 17:28, Chris Mason wrote:
  

On Fri, Oct 03, 2008 at 05:22:51PM -0700, Daniel Phillips wrote:


...Are we sure that attribute ((packed)) works the same on all
arches?
  

As long as you use types with strictly defined size, yes.



Just to be a language lawyer about that... int does not have a
strictly defined size, yet we define uint32_t as a typedef of int
on 32 bit arches do we not?
  


What else would you define it to?

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: packing structures and numbers

2008-10-03 Thread Avi Kivity

Chris Mason wrote:

On Fri, 2008-10-03 at 14:42 +0300, Avi Kivity wrote:
  

I've been reading btrfs's on-disk format, and two things caught my eye

- attribute((packed)) structures everywhere, often with misaligned 
fields.  This conserves space, but can be harmful to in-memory 
performance on some archs.



packed is important to make sure that a given field takes exactly the
same amount of space everywhere, regardless of compiler optimization or
arch.
  


Yes, of course.

- le64's everywhere.   This scales nicely, but wastes space.  My home 
directory is unlikely to have more than 4G objects or 4GB extents (let 
alone >2 devices).


I think the two issues can be improved by separating the on-disk format 
and the in-memory structure, and by using uleb128 as the on-disk format 
for numbers.  uleb128 is a variable-length format that encodes 7 bits of 
a number in each byte, using the eighth bit as a stop bit.





This couldn't be used everywhere, as the array of items headers and keys
need to be a fixed sized the current bin_search code.  The items can be
variable sized but in general they don't have as many le64s.

  


You'd decode the keys and headers before searching.  This of couse 
negates the idea behind a binary search, unless you cache the decoded nodes.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


packing structures and numbers

2008-10-03 Thread Avi Kivity

I've been reading btrfs's on-disk format, and two things caught my eye

- attribute((packed)) structures everywhere, often with misaligned 
fields.  This conserves space, but can be harmful to in-memory 
performance on some archs.
- le64's everywhere.   This scales nicely, but wastes space.  My home 
directory is unlikely to have more than 4G objects or 4GB extents (let 
alone >2 devices).


I think the two issues can be improved by separating the on-disk format 
and the in-memory structure, and by using uleb128 as the on-disk format 
for numbers.  uleb128 is a variable-length format that encodes 7 bits of 
a number in each byte, using the eighth bit as a stop bit.


So, for example

struct btrfs_disk_key {
   __le64 objectid;
   u8 type;
   __le64 offset;
} __attribute__ ((__packed__));

With 1M objectids, and 1T offsets, this reduces in size from 17 bytes to 
10 bytes.  Most other structures show similar gains.  We can also have 
more than 256 types if the need arises.


There are, off course, disadvantages to switching to uleb128:

- need to write encode and decode functions, which is tedious.  This can 
be automated a la xdr.

- increased cpu utilization for decoding and encoding
- can no longer know the size of the in-memory structures in advance
- it's just wonderful to rewrite the entire disk format so close to 
freezing it


The advantages, IMO, outweigh the disadvantages:

- better packing reduces tree depth and therefore seekage, the most 
important cost on rotating media

- the disk format is infinitely growable
- in-memory format is more efficient for archs which prefer aligned accesses

I'm not volunteering to do this, but please consider this proposal.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: oops in btrfs_reserve_extent (v0.16)

2008-09-29 Thread Avi Kivity

Josef Bacik wrote:

Could you try with the unstable git tree?  I just rewrote a bunch of this stuff
and like to know if the same issue is present there.  I wouldn't recommend using
unstable for your home directory yet tho, that code still may eat children.
Thanks,
  


Will try that on a virtual machine -- unfortunately I am too far 
bandwidth wise from an F9 installer.


Meanwhile I rsync'ed my /home without online resizing, using v0.16.  
Filesystem and myself both happy.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


oops in btrfs_reserve_extent (v0.16)

2008-09-28 Thread Avi Kivity
Perhaps a bit optimistically I started moving my home directory to 
btrfs, using v0.16.  Midway through the rsync I noticed the 10GB lvm 
volume would be too small for a filesystem with no -ENOSPC support, so I 
extended the volume and resized the filesystem while rsync was running.  
The punishment for this arrived almost immediately:


new size for /dev/mapper/vg0-home.btrfs is 21474836480
Unable to find block group for 8619294720
[ cut here ]
WARNING: at /home/avi/btrfs/btrfs-0.16/extent-tree.c:300 
find_search_start+0x1d9/0x27e [btrfs]() (Not tainted)

Modules linked in: btrfs ...
Pid: 31165, comm: pdflush Not tainted 2.6.26.3-29.fc9.x86_64 #1

Call Trace:
[] warn_on_slowpath+0x60/0xa3
[] ? printk+0x67/0x6a
[] ? :btrfs:get_state_private+0x6b/0x79
[] ? :btrfs:btrfs_lookup_block_group+0x3b/0x73
[] ? :btrfs:find_first_extent_bit_state+0x23/0x5e
[] :btrfs:find_search_start+0x1d9/0x27e
[] ? :btrfs:get_state_private+0x6b/0x79
[] :btrfs:find_free_extent+0x334/0x5ff
[] :btrfs:__btrfs_reserve_extent+0x1a9/0x23a
[] :btrfs:btrfs_reserve_extent+0x5e/0x79
[] :btrfs:cow_file_range+0x11d/0x23e
[] :btrfs:run_delalloc_range+0x2cc/0x2e2
[] ? :btrfs:find_lock_delalloc_range+0x205/0x218
[] ? getnstimeofday+0x3a/0x96
[] :btrfs:__extent_writepage+0x169/0x5b6
[] ? find_get_pages_tag+0x3d/0x95
[] write_cache_pages+0x1c5/0x314
[] ? :btrfs:__extent_writepage+0x0/0x5b6
[] :btrfs:extent_writepages+0x32/0x52
[] ? :btrfs:btrfs_get_extent+0x0/0x74f
[] :btrfs:btrfs_writepages+0x23/0x25
[] do_writepages+0x28/0x38
[] __writeback_single_inode+0x16d/0x2cc
[] ? :dm_mod:dm_any_congested+0x43/0x52
[] sync_sb_inodes+0x20b/0x2d0
[] writeback_inodes+0xa8/0x100
[] wb_kupdate+0xa3/0x119
[] pdflush+0x148/0x1f7
[] ? wb_kupdate+0x0/0x119
[] ? pdflush+0x0/0x1f7
[] kthread+0x49/0x76
[] child_rip+0xa/0x12
[] ? kthread+0x0/0x76
[] ? child_rip+0x0/0x12

and

allocation failed flags 1
[ cut here ]
kernel BUG at /home/avi/btrfs/btrfs-0.16/extent-tree.c:2111!
invalid opcode:  [1] SMP
CPU 0
Modules linked in: btrfs ...
Pid: 31165, comm: pdflush Tainted: GW 2.6.26.3-29.fc9.x86_64 #1
RIP: 0010:[]  [] 
:btrfs:__btrfs_reserve_extent+0x1ff/0x23a

RSP: 0018:81005f095860  EFLAGS: 00010286
RAX: 001d RBX: 0001 RCX: 81005e487e60
RDX: 0001 RSI: 81005f0956d0 RDI: 0246
RBP: 81005f0958d0 R08: 813d2db0 R09: 810001019798
R10: 4bd3179bd155 R11:  R12: 8100256ea000
R13: 81003722f000 R14: 1000 R15: 81003e1926c8
FS:  () GS:81417000() knlGS:
CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
CR2: 0031c032e830 CR3: 7a40a000 CR4: 26e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process pdflush (pid: 31165, threadinfo 81005f094000, task 
810018b2da80)

Stack:   81005f095990  
0001 0246  
1000  81003722f000 
Call Trace:
[] :btrfs:btrfs_reserve_extent+0x5e/0x79
[] :btrfs:cow_file_range+0x11d/0x23e
[] :btrfs:run_delalloc_range+0x2cc/0x2e2
[] ? :btrfs:find_lock_delalloc_range+0x205/0x218
[] ? getnstimeofday+0x3a/0x96
[] :btrfs:__extent_writepage+0x169/0x5b6
[] ? find_get_pages_tag+0x3d/0x95
[] write_cache_pages+0x1c5/0x314
[] ? :btrfs:__extent_writepage+0x0/0x5b6
[] :btrfs:extent_writepages+0x32/0x52
[] ? :btrfs:btrfs_get_extent+0x0/0x74f
[] :btrfs:btrfs_writepages+0x23/0x25
[] do_writepages+0x28/0x38
[] __writeback_single_inode+0x16d/0x2cc
[] ? :dm_mod:dm_any_congested+0x43/0x52
[] sync_sb_inodes+0x20b/0x2d0
[] writeback_inodes+0xa8/0x100
[] wb_kupdate+0xa3/0x119
[] pdflush+0x148/0x1f7
[] ? wb_kupdate+0x0/0x119
[] ? pdflush+0x0/0x1f7
[] kthread+0x49/0x76
[] child_rip+0xa/0x12
[] ? kthread+0x0/0x76
[] ? child_rip+0x0/0x12

Code: 85 08 01 00 00 4c 89 f2 48 8b 70 20 e8 0d f5 ff ff e9 91 fe ff ff 
85 c0 74 15 48 89 de 48 c7 c7 f1 bc 51 a0 31 c0 e8 03 40 da e0 <0f> 0b 
eb fe 48 8b 45 18 49 8b bd 08 01 00 00 b9 50 00 00 00 48

RIP  [] :btrfs:__btrfs_reserve_extent+0x1ff/0x23a

The 'Unable to find block group' perhaps indicate the firesystem was 
resized beyond the blockdev's limits?  Or is online resizing broken?


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html