Re: ceph and efficient access of distributed resources

2013-04-16 Thread Mark Kampe

The client does a 12MB read, which (because of the striping)
gets broken into 3 separate 4MB reads, each of which is sent,
all in parallel, to 3 distinct OSDs.  The only bottle-neck
in such an operation is the client-NIC.

On 04/16/2013 01:06 PM, Gandalf Corvotempesta wrote:

2013/4/16 Mark Kampe :

RADOS is the underlying storage cluster, but the access methods (block,
object, and file) stripe their data across many RADOS objects, which
CRUSH very effectively distributes across all of the servers.  A 100MB
read or write turns into dozens of parallel operations to servers all
over the cluster.


Let me try to explain.
AFAIK check will split datas into chunks of 4MB each, so, a single
12MB file will be stored in 3 different chunks across multiple OSDs
and then replicated many times (based on value of replica count)

Let's assume a 12MB file and a 3x replica.
RADOS will create 3x3 chuks for the same file stored on 9 OSDs

When reading AFAIK replicas are not used, so all reads are done to the
"master copy".
But these 3 chunks are read in parallel on multiple OSDs or all read
request are done trough a single OSD? In the first case we will have
3x bandwidth for read operations directed to a file with at least 3
chunks, in the latter we have a big bottleneck.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph and efficient access of distributed resources

2013-04-16 Thread Mark Kampe

On 04/16/13 00:20, Gandalf Corvotempesta wrote:

2013/4/16 Mark Kampe :

The entire web is richly festooned with cache servers whose
sole raison d'etre is to solve precisely this problem.  They
are so good at it that back-bone providers often find it more
cash-efficient to buy more cache servers than to lay more
fiber.  Cache servers don't merely save disk I/O, they catch
these requests before they reach the server (or even the
backbone).


Mine was just an example, there are many other cases where a frotnend
cache is not possible.
I think that ceph should spread reads across the whole clusters by
default (like a big RAID-1), to archieve bandwidth improvement.


At my previous distributed storage start-up (Parascale) we had the
ability to distribute reads across copies for load distribution
purposes and everybody we talked to said "who cares!".  Why?

   For hot-spot situations (as in your original example)
   higher level caching is far more effective than random
   traffic distribution.

   For lower level (e.g. coincidental) reuse, sending all the
   requests to a single server will usually perform better.
   Network I/O is much faster than disk I/O, and a single
   recipient will have N * the cache hit rate that N servers
   would have.


What happens in case of a big file (for example, 100MB) with multiple
chunks? Is ceph smart enough to read multiple chunks from multiple
servers simultaneously or the whole file will be served by just an OSD


RADOS is the underlying storage cluster, but the access methods (block,
object, and file) stripe their data across many RADOS objects, which
CRUSH very effectively distributes across all of the servers.  A 100MB
read or write turns into dozens of parallel operations to servers all
over the cluster.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph and efficient access of distributed resources

2013-04-15 Thread Mark Kampe

If I correctly understand the discussion, you are correct
that I/O could be saved by doing this ... were it not for
the fact the I/O in question is already being saved much
more effectively by someone else.

The entire web is richly festooned with cache servers whose
sole raison d'etre is to solve precisely this problem.  They
are so good at it that back-bone providers often find it more
cash-efficient to buy more cache servers than to lay more
fiber.  Cache servers don't merely save disk I/O, they catch
these requests before they reach the server (or even the
backbone).


On 04/15/2013 01:06 PM, Gandalf Corvotempesta wrote:



Currently reads always come from the primary OSD in the placement group
rather than a secondary even if the secondary is closer to the client.



In this way, only one OSD will be involved in reading an object, this
will
result in a bottleneck if multiple clients needs to access to the same
file.

For example, a 3KB CSS file served by a webserver to 400 users, will be
read just from one OSD. 400 users directed to 1 OSD  while (in case of
replica 3) other 2 OSDs are available?

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Comments on Ceph.com's blog article 'Ceph's New Monitor Changes'

2013-03-12 Thread Mark Kampe

It seems to me that the surviving OSDs still remember all of
the osdmap and pgmap history back to "last epoch started"
for all of their PGs.  Isn't this enough to enable reconstruction
of all of the pgmaps and osdmaps required to find any copy of
currently stored object?

My history has given me biases, but I prefer reconstruction over
snapshots because:

 (a) it enables recovery from more catastrophic incidents
 (e.g. a bug has corrupted all of the monitor stores
 or a fire has reduced all monitor nodes to slag)

 (b) it is less likely to result in inconsistencies involving
 object updates after the last snapshot

 (c) the ability to reconstruct is a superset of the ability
 to audit, so we get consistency audits for free



It tends to be a common source of discomfort among potential Ceph
users that if their mons ever become unrecoverable, it's almost
impossible to recover your data (compare to GlusterFS, where you can
always pull data out of Gluster bricks unharmed, at least as long as
you don't use striping volumes). With a file backed mon store, I had
hoped that eventually this might tie into btrfs snapshots such that
you would have been able to roll back to a known good configuration
in an emergency. With the switch to leveldb, I no longer foresee that
ever happening. Mind sharing your thoughts on that?

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Geographic DR for RGW

2013-02-26 Thread Mark Kampe

A few weeks ago, Yehuda Sadeh sent out a proposal for adding
support for asynchronous remote site replication to the RADOS
Gateway.  We have done some preliminary planning and are
now starting the implementation work.

At the 100,000' level (from which height all water looks
drinkable and all mountains climbable) the work can be
divided into:

   1. a bunch of changes within the gateway to create
  regions, add new attributes to buckets and objects,
  log new information and implement/expose some new
  operations

   2. new RESTful APIs to exploit and manage the new
  behaviors, and associated unit test suites.

   3. free-standing data and metadata synchronization
  agents that learn of and propagate changes

   4. management agents to monitor, control, and report
  on this activity.

   5. white-box test suites to stress change detection,
  reconciliation, propagation, and replay.

We feel that we pretty much have to do (1) (lots of changes
to a great deal of complicated code).  Category (2) is
new code, and (in principle) decoupled from the internals
of the gateway, but it has many tendrils.  C++ developers
with some familiarity with the Gateway could definitely help
here ... but it is questionable whether or not it makes
sense to try to bring new people up to speed for a project
that will only last a few months.

In the near term, the most modular pieces with the lowest
activation energy are (3), and in two months there may be
enough working to enable work on (4) and (5).  The
synchronization and management agents are free-standing
processes based on RESTful APIs, and so can be implemented
in pretty much anything (Python, Java, C++, Ruby, ...).

If there are other people who are able to help make this
happen, we would love to invite your participation.  This
is an opportunity to:
   * accelerate the development of a strategic feature
   * help to shape some major new functionality
   * get very familiar with the Gateway code
   * play with eventual consistency, asynchronous pull replication
   * be one of the kool kids
   * earn Karma and improve the world through Open Source contribution

thank you,

   mark.ka...@inktank.com
   VP, Engineering

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: some performance issue

2013-02-04 Thread Mark Kampe

Writes are intrinsically more expensive (in both the file
system and hardware) but it is not uncommon for individual
small random writes to substantially outperform reads even
if O_DIRECT.

If the I/O is not massively parallel, reads are going to be
processed one at a time (e.g. ~6ms seek, ~4ms latency, and
27us transfer).  Writes, however, are commonly accepted by
the drive and then queued, enabling the drive to choose among
the competing requests to significantly (e.g. 2-3x) reduce
both average seek time and rotational latency.

If the I/O is being buffered, the performance advantages for
random writes can be even greater (due to a deeper request
queue and potential request aggregation).  Isolated random
reads (with few cache hits) get a much smaller performance
boost (if any) from buffered I/O.

With massively parallel requests, however, the write
advantage should evaporate.

On 02/04/2013 09:15 AM, sheng qiu wrote:

Hi Xiaoxi,

thanks for your reply.

On Mon, Feb 4, 2013 at 10:52 AM, Chen, Xiaoxi  wrote:

I doubt your data is correct ,even the ext4 data, have you use O_DIRECT when 
doing the test? It's unusual to have 2X random write IOPS than random read.



  i did not use O_DIRECT. so page cache is used during the test.
one thing i guess why random write is better than random read is that
since the io request size is 4KB, so for each write request if miss on
page cache, it will allocate a new page and write the complete 4KB
dirty data there (since no partitional writes, no need to fetch the
missed data from OSDs). While for read requests, it has to wait until
the data are fetched from the OSDs.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: on disk encryption

2013-01-31 Thread Mark Kampe

Correct.

I wasn't actually involved in this (or any other real) work,
but as I recall the only real trick is how much key management
you want:

  Do we want to be able to recover the key if a good disk
  is rescued from a destroyed server and added to a new
  server?

  Do we want to ensure that the keys are not persisted on
  the server, so that an entire server can be decommissioned
  without having to worry about the data being recovered
  by somebody who knows where to look?

If you are willing to keep the key on the server and lose
the data when the server fails, this is trivial.  If you
are unwilling to keep the key on the server, or if you need
the disk to remain readable after the server is lost, we
need some third party (like the monitors) to maintain the
keys.

We thought these might be important, so we were looking
at how to get the monitors to keep track of the encryption
keys.

On 01/31/2013 03:42 PM, Marcus Sorensen wrote:

Yes, anyone could do this now by setting up the OSDs on top of
dm-crypted disks, correct? This would just automate the process, and
manage keys for us?


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: geo replication

2013-01-09 Thread Mark Kampe

Right now, your only option is synchronous replication, which
happens at the speed of the slowest OSD ... so unless your
WAN links are fast and fat, it comes at non-negligible
performance penalty.

We will soon be sending out a proposal for an asynchronous
replication mechanism with eventual consistency for the
RADOS Gateway ... but that is a somewhat simpler problem
(immutable objects, good change lists, and a WAN friendly
protocol).

Asynchronous RADOS replication is definitely on our list,
but more complex and farther out.

On 01/09/2013 01:19 PM, Gandalf Corvotempesta wrote:

probably this was already asked before but i'm unable to find any answer.
Is possible to replicate a cluster geografically?

GlusterFS does this with rsync (i think called automatically on every
file write), does cheph do something similiar?

I don't think that using multiple geographically distributed OSD with
10-15ms of latency will be good

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Are there significant performance enhancements in 0.56.x to be expected soon or planned in the near future?

2013-01-09 Thread Mark Kampe

Performance work is always ongoing, but I am not aware of any
significant imminent enhancements.  We are just wrapping up an
investigation of the effects of various file system and I/O
options on different types of traffic, and the next major area
of focus will be RADOS Block Device and VMs over RBD.  This is
pretty far away from Hadoop and probably won't yield much fruit
until March.

There are a few people working on Hadoop integration, and I
have not been closely following their activities, but I do
not believe that any major performance work will be forthcoming
in the next few weeks

On 01/09/2013 04:51 AM, Lachfeld, Jutta wrote:

Hi all,

in expectation of better performance, we are just switching from CEPH version 
0.48 to 0.56.1
for comparisons between Hadoop with HDFS and Hadoop with CEPH FS.

We are now wondering whether there are currently any development activities
concerning further significant performance enhancements,
or whether further significant performance enhancements are already planned for 
the near future.

I would now be loath to start benchmarking with 0.56.1 and then, a month or so 
later, detect that there have been significant performance enhancements in CEPH 
in the meantime.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD fio Performance concerns

2012-11-22 Thread Mark Kampe

Sequential is faster than random on a disk, but we are not
doing I/O to a disk, but a distributed storage cluster:

  small random operations are striped over multiple objects and
  servers, and so can proceed in parallel and take advantage of
  more nodes and disks.  This parallelism can overcome the added
  latencies of network I/O to yield very good throughput.

  small sequential read and write operations are serialized on
  a single server, NIC, and drive.  This serialization eliminates
  parallelism, and the network and other queuing delays are no
  longer compensated for.

This striping is a good idea for the small random I/O that is
typical of the way Linux systems talk to their disks.  But for
other I/O patterns, it is not optimal.

On 11/21/2012 01:47 PM, Sébastien Han wrote:

Hi Mark,

Well the most concerning thing is that I have 2 Ceph clusters and both
of them show better rand than seq...
I don't have enough background to argue on your assomptions but I
could try to skrink my test platform to a single OSD and how it
performs. We keep in touch on that one.

But it seems that Alexandre and I have the same results (more rand
than seq), he has (at least) one cluster and I have 2. Thus I start to
think that's not an isolated issue.

Is it different for you? Do you usually get more seq IOPS from an RBD
thant rand?


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD fio Performance concerns

2012-11-19 Thread Mark Kampe

Recall:
   1. RBD volumes are striped (4M wide) across RADOS objects
   2. distinct writes to a single RADOS object are serialized

Your sequential 4K writes are direct, depth=256, so there are
(at all times) 256 writes queued to the same object.  All of
your writes are waiting through a very long line, which is adding
horrendous latency.

If you want to do sequential I/O, you should do it buffered
(so that the writes can be aggregated) or with a 4M block size
(very efficient and avoiding object serialization).

We do direct writes for benchmarking, not because it is a reasonable
way to do I/O, but because it bypasses the buffer cache and enables
us to directly measure cluster I/O throughput (which is what we are
trying to optimize).  Applications should usually do buffered I/O,
to get the (very significant) benefits of caching and write aggregation.


That's correct for some of the benchmarks. However even with 4K for
seq, I still get less IOPS. See below my last fio:

# fio rbd-bench.fio
seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
fio 1.59
Starting 4 processes
Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99  iops] [eta 02m:59s]
seq-read: (groupid=0, jobs=1): err= 0: pid=15096
   read : io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec
 slat (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90
 clat (msec): min=18 , max=133 , avg=76.37, stdev=16.63
  lat (msec): min=18 , max=133 , avg=76.67, stdev=16.62
 bw (KB/s) : min=0, max=14406, per=31.89%, avg=4258.24, stdev=6239.06
   cpu  : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279
   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
  issued r/w/d: total=200473/0/0, short=0/0/0

  lat (msec): 20=0.01%, 50=9.46%, 100=90.45%, 250=0.10%
rand-read: (groupid=1, jobs=1): err= 0: pid=16846
   read : io=6376.4MB, bw=108814KB/s, iops=27203 , runt= 60005msec
 slat (usec): min=8 , max=12723 , avg=33.54, stdev=59.87
 clat (usec): min=4642 , max=55760 , avg=9374.10, stdev=970.40
  lat (usec): min=4671 , max=55788 , avg=9408.00, stdev=971.21
 bw (KB/s) : min=105496, max=109136, per=100.00%, avg=108815.48, 
stdev=648.62
   cpu  : usr=8.26%, sys=49.11%, ctx=1486259, majf=0, minf=278
   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
  issued r/w/d: total=1632349/0/0, short=0/0/0

  lat (msec): 10=83.39%, 20=16.56%, 50=0.04%, 100=0.01%
seq-write: (groupid=2, jobs=1): err= 0: pid=18653
   write: io=44684KB, bw=753502 B/s, iops=183 , runt= 60725msec
 slat (usec): min=8 , max=1246.8K, avg=5402.76, stdev=40024.97
 clat (msec): min=25 , max=4868 , avg=1384.22, stdev=470.19
  lat (msec): min=25 , max=4868 , avg=1389.62, stdev=470.17
 bw (KB/s) : min=7, max= 2165, per=104.03%, avg=764.65, stdev=353.97
   cpu  : usr=0.05%, sys=0.35%, ctx=5478, majf=0, minf=21
   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%, >=64=99.4%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
  issued r/w/d: total=0/11171/0, short=0/0/0

  lat (msec): 50=0.21%, 100=0.44%, 250=0.97%, 500=1.49%, 750=4.60%
  lat (msec): 1000=12.73%, 2000=66.36%, >=2000=13.20%
rand-write: (groupid=3, jobs=1): err= 0: pid=20446
   write: io=208588KB, bw=3429.5KB/s, iops=857 , runt= 60822msec
 slat (usec): min=10 , max=1693.9K, avg=1148.15, stdev=15210.37
 clat (msec): min=22 , max=5639 , avg=297.37, stdev=430.27
  lat (msec): min=22 , max=5639 , avg=298.52, stdev=430.84
 bw (KB/s) : min=0, max= 7728, per=31.44%, avg=1078.21, stdev=2000.45
   cpu  : usr=0.34%, sys=1.61%, ctx=37183, majf=0, minf=19
   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
  issued r/w/d: total=0/52147/0, short=0/0/0

  lat (msec): 50=2.82%, 100=25.63%, 250=46.12%, 500=10.36%, 750=5.10%
  lat (msec): 1000=2.91%, 2000=5.75%, >=2000=1.33%

Run status group 0 (all jobs):
READ: io=801892KB, aggrb=13353KB/s, minb=13673KB/s, maxb=13673KB/s,
mint=60053msec, maxt=60053msec

Run status group 1 (all jobs):
READ: io=6376.4MB, aggrb=108814KB/s, minb=111425KB/s,

Re: RBD fio Performance concerns

2012-11-16 Thread Mark Kampe

On 11/15/2012 12:23 PM, Sébastien Han wrote:


First of all, I would like to thank you for this well explained,
structured and clear answer. I guess I got better IOPS thanks to the 10K disks.


10K RPM would bring your per-drive throughput (for 4K random writes)
up to 142 IOPS and your aggregate cluster throughput up to 1700.
This would predict a corresponding RADOSbench throughput somewhere
above 425 (how much better depending on write aggregation and cylinder 
affinity).  Your RADOSbench 708 now seems even more reasonable.



To be really honest I wasn't so concerned about the RADOS benchmarks
but more about the RBD fio benchmarks and the amont of IOPS that comes
out of it, which I found à bit to low.


Sticking with 4K random writes, it looks to me like you were running
fio with libaio (which means direct, no buffer cache).  Because it
is direct, every I/O operation is really happening and the best
sustained throughput you should expect from this cluster is
the aggregate raw fio 4K write throughput (1700 IOPS) divided
by two copies = 850 random 4K writes per second.  If I read the
output correctly you got 763 or about 90% of back-of-envelope.

BUT, there are some footnotes (there always are with performance)

If you had been doing buffered I/O you would have seen a lot more
(up front) benefit from page caching ... but you wouldn't have been
measuring real (and hence sustainable) I/O throughput ... which is
ultimately limited by the heads on those twelve disk drives, where
all of those writes ultimately wind up.  It is easy to be fast
if you aren't really doing the writes :-)

I would have expected write aggregation and cylinder affinity to
have eliminated some seeks and improved rotational latency resulting
in better than theoretical random write throughput.  Against those
expectations 763/850 IOPS is not so impressive.  But, it looks to
me like you were running fio in a 1G file with 100 parallel requests.
The default RBD stripe width is 4M.  This means that those 100
parallel requests were being spread across 256 (1G/4M) objects.
People in the know tell me that writes to a single object are
serialized, which means that many of those (potentially) parallel
writes were to the same object, and hence serialized.  This would
increase the average request time for the colliding operations,
and reduce the aggregate throughput correspondingly.  Use a
bigger file (or a narrower stripe) and this will get better.

Thus, getting 763 random 4K write IOPs out of those 12 drives
still sounds about right to me.



On 15 nov. 2012, at 19:43, Mark Kampe  wrote:


Dear Sebastien,

Ross Turn forwarded me your e-mail.  You sent a great deal
of information, but it was not immediately obvious to me
what your specific concern was.

You have 4 servers, 3 OSDs per, 2 copy, and you measured a
radosbench (4K object creation) throughput of 2.9MB/s
(or 708 IOPS).  I infer that you were disappointed by
this number, but it looks right to me.

Assuming typical 7200 RPM drives, I would guess that each
of them would deliver a sustained direct 4K random write
performance in the general neighborhood of:
4ms seek (short seeks with write-settle-downs)
4ms latency (1/2 rotation)
0ms write (4K/144MB/s ~ 30us)
-
8ms or about 125 IOPS

Your twelve drives should therefore have a sustainable
aggregate direct 4K random write throughput of 1500 IOPS.

Each 4K object create involves four writes (two copies,
each getting one data write and one data update).  Thus
I would expect a (crude) 4K create rate of 375 IOPS (1500/4).

You are getting almost twice the expected raw IOPS ...
and we should expect that a large number of parallel
operations would realize some write/seek aggregation
benefits ... so these numbers look right to me.

Is this the number you were concerned about, or have I
misunderstood?

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Using asphyxiate with Doxygen and Java?

2012-10-27 Thread Mark Kampe

TV is of the opinion that Asphyxiate was the right
direction to move in, and that the sloth problems
are solvable, but would require work.

On 10/26/2012 6:26 PM, Noah Watkins wrote:

I stumbled upon Breathe, and then asphyxiate. The doxygenfile
directive in the later doesn't seem to like what Doxygen produces from
parsing JavaDoc markup, although I've read that the Doxygen produced
should be compliant. Here is the error:

AssertionError: cannot handle compounddef kind=class

Before going any further and I wanted to ping the list to see if
anyone thinks it would be a good/bad idea to look into this. It'd be
nice to have the Java documentation in Sphinx seamlessly. Any change
Breathe has gotten faster over time?


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Guidelines for Calculating IOPS?

2012-10-19 Thread Mark Kampe

Replication should have no effect on read throughput/IOPS.

The client does a single write to the primary, and the
primary then handles re-replication to the secondary
copies.  As such the client does not pay (in terms of
CPU or NIC bandwidth) for the replication.  Per-client
throughput limitations should be largely independent of
the replication.

However, the replication does generate additional network
and I/O activity between the OSDs.  This means that the
available aggregate throughput (of the entire cluster)
is effectively cut in half when you move from one-copy to two.

I am confused by your math:

   You say 385MB/s and 5250 IOPS (x8k)
   5250 IOPS * 8192 = 43MB/s

Do you mean that some of your clients are generating
a lot of small block writes (at up to 5250 IPS) and
that others of your clients are doing larger writes
(with an aggregate throughput of 385MB/s)?

For RADOS throughput:
   385MB/s is a fairly small number
   5250 buffered sequential IOPS is a very small number
   5250 random IOPS is not a particularly large
number, but will require several servers

My guess is that the IOPS may drive the number of
servers, and the drives per server will be the
capacity divided by the number of required servers.

So how many IOPS can you get per server?

You are using RBD, and depending on the particulars
of your stack, there may be a great deal of buffering
and caching on the client side that can make the
RADOS traffic much more efficient than the tributary
client requests.  Thus, I would suggest that you
probably want to actually benchmark the application
in question to measure the client-experienced throughput.


On 10/19/12 07:47, Mike Dawson wrote:

All,

I am investigating the use of Ceph for a video surveillance project with
the following minimum block storage requirements:

385 Mbps of constant write bandwidth
100TB storage requirement
5250 IOPS (size of ~8 KB)

I believe 2 replicas would be acceptable. We intend to use large
capacity (2 or 3TB) SATA 7200rpm 3.5" drives, if the IOPS work out
properly.

Is there a method / formula to estimate IOPS for RDB? Specifically I
would like to understand:

- How does replica count affect read/write IOPS?

- I'm trying to understand best practice for when to optimize server
count, drives per server, and drive capacity as it relates to IOPS. Is
there a point of diminishing I/O performance using server chassis with
lots of drive slots, like the 36-drive Supermicro SC847a?

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Client Location

2012-10-09 Thread Mark Kampe

I'm not a real engineer, so please forgive me if I misunderstand,
but can't you create a separate rule for each data center (choosing
first a local copy, and then remote copies), which should ensure
that the primary is always local.  Each data center would then
use a different pool, associated with the appropriate location-
sensitive rule.

Does this approach get you the desired locality preference?

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Newbie questions

2012-10-06 Thread Mark Kampe

In your original question, I assumed you were talking about the
partitioning of a single cluster.  Now you are talking about
Geographic Disaster Recovery: the replication of data across
multiple (relatively) independent clusters.  This is not yet
supported, but it is most definitely on the road-map.

On 10/6/2012 5:08 PM, Adam Nielsen wrote:

The problem you are describing is called split-brain.  Ceph has an odd
number of monitors and quorum is required before objects can be served.
The partition with the smaller number of monitors will wait harmlessly
until connectivity is reestablished .


Ah right, that makes sense.  Is this set in stone or can it be
configured? I'm just thinking that in this scenario it could be
beneficial to allow read-only access from the partition with the smaller
number of monitors, if there are also clients that can only see those
hosts.  (For example, a business with two sites, and the link between
them goes down, so client PCs can only see their site-local servers.)


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Newbie questions

2012-10-06 Thread Mark Kampe

The problem you are describing is called split-brain.  Ceph has an odd number 
of monitors and quorum is required before objects can be served.  The partition 
with the smaller number of monitors will wait harmlessly until connectivity is 
reestablished .

Adam Nielsen  wrote:

>Thanks both for your answers - very informative.  I think I will set up a test 
>Ceph system on my home servers to try it out.
>
>I have one more question:
>
>Ceph seems to handle failed nodes well enough, but what about failed network 
>links?  Say you have a few systems in two locations, connected by a single 
>link.  If the link fails, you will have two isolated networks, each of which 
>will think the other has failed and presumably will try to go on as best it 
>can.  What happens when the link comes back up again?  What if the same file 
>was modified by both isolated clusters when the link was down?  What version 
>will end up back in the cluster?
>
>Thanks again,
>Adam.
>
>
>--
>To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>the body of a message to majord...@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Release/branch naming; input requested

2012-05-21 Thread Mark Kampe



On 05/18/12 09:32, Sage Weil wrote:


I think we can limit the relative branches to:

  master = integration, unstable, tip, bleeding edge (same as now)
  [next] = next upcoming release (same as now)
  current = most recent release
  stable = most recent stable release


We have already signed one contract that obligates us
to years of support, and once customers go into production
they will be loathe to move to each new stable release as
it comes out.  Thus, I fear that we will be maintaining
multiple stable releases ... but once a stable release ceases
to be the newest, the bar on what has to be back-ported to
it can be significantly raised, lowering the cost.


I like Yehuda's suggestion of cephalopods, or other interesting sea
creatures.  As he points out, though,

   http://www.thecephalopodpage.org/taxa.php

suggests that there may not be enough good choices that are strictly
cephalopods, though.  Although it might be ok?
...


I don't have an opinion about themes, but I do suggest that they
should be memorable and easily pronounced ... and taxonomic
family names do not always have those characteristics.


3. What do we do with version numbers? With a 2-3 week iteration,
we'll end up with something like 0.41.x, 0.56.x for Folsom integration
(less than a year from now), and 0.57, 0.58 etc for "latest".


We can keep those, they are completely orthogonal. These are exactly
what they are: dev cycle numbers. I'm not too afraid of big numbers
there, as they become uninteresting once you have the other naming
scheme. They have the nice property of monotonically increasing which
is useful internally.


Given that most releases will only get a subset of what is in builds,
I too think that builds should be orthogonal to releases.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Logging braindump

2012-03-22 Thread Mark Kampe

On 03/22/12 09:38, Colin McCabe wrote:

On Mon, Mar 19, 2012 at 1:53 PM, Tommi Virtanen
  wrote:

[mmap'ed buffer discussion]


I always thought mmap'ed circular buffers were an elegant approach for
getting data that survived a process crash, but not paying the
overhead of write(2) and read(2).  The main problem is that you need
special tools to read the circular buffer files off of the disk.  As
Sage commented, that is probably undesirable for many users.


(a) I actually favor not simply mmaping the circular buffer,
but having a program that pulls the data out of memory
and writes it to disk (ala Varnish).  In addition to doing
huge writes (greatly reducing the write overhead), it can
filter what it processes, so that we have extensive logging
for the last few seconds, and more manageable logs on disk
extending farther back in time (modulo log rotation).

(b) The most interesting logs are probably the ones in coredumps
(that didn't make it out to disk) for which we want a
crawler/extractor anyway.  It probably isn't very hard to
make the program that extracts logs from memory also be
able to pick the pockets of dead bodies (put a big self
identifying header on the front of each buffer).

Note also that having the ability to extract the logs from
a coredump pretty much eliminates any motivations to flush
log entries out to disk promptly/expensively.  If the process
exits clealy, we'll get the logs.  If the process produces
a coredump, we'll still get the logs.

(c) I have always loved text logs that I can directly view.
Their immediate and effortless accessibility encourages
their use, which encourages work in optimizing their content
(lots of the stuff you need, and little else).

But binary logs are less than half the size (cheaper to
take and keep twice as much info), and a program that
formats them can take arguments about which records/fields
you want and how you want them formatted ... and getting
the output the way you want it (whether for browsing or
subsequent reprocessing) is a huge win.  You get used to
running the log processing command quickly, but the benefits

(d) If somebody really wants text logs for archival, it is completely
trivial to run the output of the log-extractor through the
formatter before writing it to disk ... so the in memory
format need not be tied to the on-disk format.  The rotation
code won't care.


An mmap'ed buffer, even a lockless one, is a simple beast.  Do you
really need a whole library just for that?  Maybe I'm just
old-fashioned.


IMHO, surprisingly few things involving large numbers of performance
critical threads turn out to be simple :-)  For example:

If we are logging a lot, buffer management has the potential
to become a bottle-neck ... so we need to be able to allocate
a record of the required size from the circular buffer
with atomic instructions (at least in non-wrap situations).

But if records are allocated and then filled, we have to
consider how to handle the case where the filling is
delayed, and the reader catches up with an incomplete
log record (e.g. skip it, wait how long, ???).

And while we hope this will never happen, we have to deal
with what happens when the writer catches up with the
reader, or worse, an incomplete log block ... where we might
have to determine whether or not the owner is deceased (making
it safe to break his record lock) ... or should we simply take
down the service at that point (on the assumption that something
has gone very wrong).

If we are going to use multiple buffers, we may have to
do a transaction dance (last guy in has to close this
buffer to new writes, start a new one, and somebody has
to wait for pending additions to complete, queue this
one for delivery or perhaps even flush it to disk if we
don't have some other thread/process doing this).

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph article for ;login:

2012-03-14 Thread Mark Kampe

Title
Ceph: a next generation, open source storage platform - Part I

Target size, scope, depth:
2K-3K words
CRUSH, RADOS, RGW, RBD (leave CephFS for part II)
os/storage concepts literate non-programmers

Audience:
Engineers, admins, and technically inclined Linux community
members who may or may not have heard of Ceph before, but might
find it interesting to learn more about the domain, technology,
and product.

Goals:
Generate positive buzz,

leading more potential adopters (or even community
 members) to investigate further

contributing to our image as smart people, great
technology, and a high-road endeavor

Strategy:
primary focus is technical
educate readers about the challenges of peta-byte scale storage
educate readers about effective approaches to addressing them
showcase RADOS as an example of such solutions
inspire people with the value and idealism of open source

Anti-Strategies:
do not pitch the company/business (let technology do that)
do not dis existing products (we're above that)

Suggested Table of Contents and word-budget

1. introduction and challenges of petabyte scale storate (200)
2. solutions to these challenges (300-400)
3. CRUSH and RADOS architecture (500 + pictures)
4. how we address those challenges (400+pictures)
5. examples of products built on top of RADOS
   RBD concept, layering, functionality, value (200+picture)
   RGW concept, layering, functionality, value (200+picture)
6. why open source is the right thing here (200)
7. state of the project ... 50% technology, 25% community, 25% company (200)
8. how to get involved (100)


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: efficient removal of old objects

2012-02-01 Thread Mark Kampe

On 01/31/12 17:02, Tommi Virtanen wrote:


To make my point even clearer: point me to another data store that has
that idiom.


(a) Automatic expiration and deletion is, and has long been, a
standard feature of archival systems ... and our RADOS
clouds are much larger than most archival systems.

(b) I have no competent opinions on the short term solution to this
particular problem, but in the longer term I do not believe
that garbage collection can or should be entrusted to clients.
Clients are ephemeral and cannot be depended on to remember,
a few years (or even hours) from now, that there were some
files they were supposed to delete.

IMHO, object store intelligence is not merely about back-ground
replication and migration, but about "being able to take
responsibility for the life cycle of the data they hold".
The amount of data we store will quickly grow beyond the
ability of external agents to manage it, and lifecycle
automation will become increasingly critical.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem while reading the paper about CRUSH

2012-01-31 Thread Mark Kampe
De-cluster means ensure that objects that all have one copy on a single volume 
have their other copies spread all over the cloud.  This enables many to many 
recovery.

The weights bias selection, e.g. so we can discourage placement on a device 
that is more full.

A "metric" is a unit or means of measuring an interesting quantity.
---mark---N�r��yb�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj"��!�i

Re: towards a user-mode diagnostic log mechanism

2012-01-06 Thread Mark Kampe

On 01/05/12 20:09, Colin McCabe wrote:


Getting the system time is a surprisingly expensive operation, and
this poses a problem for logging system designers.  You can use the
rdtsc CPU instruction, but unfortunately on some architectures CPU
frequency scaling makes it very inaccurate.  Even when it is accurate,
it's not synchronized between multiple CPUs.

Another option is to without time for most messages and just have a
periodic timestamp that gets injected every so often-- perhaps every
second, for example.


I agree it needs to be cheap ... but my experience with
debugging problems in this sort of system suggests that
we need the finest grained timestamps we can get
(on every single message).

Even though the clocks on different nodes are not that
closely synchronized, computing the relative offsets
from initial transactions isn't hard ... and then it
becomes possible to construct a total ordering of
events with accurate timings.


Pantheios and log4cpp are two potential candidates.  I don't know that
much about either, unfortunately.


Good suggestions.  I am also looking at varnish (suggested by Wido
den Hollander) which does logging in a shared memory segment from
which external processes can save it (or not).  What was (for me)
a new idea here is the clean decoupling of in-memory event capture
from on-disk persistence and log rotation.  After I thought about
it for a few minutes, I concluded it had many nice consequences.


Honestly, I feel like logging is something that you will end up
customizing pretty extensively to suit the needs of your application.
But perhaps it's worth checking out what these libraries provide--
especially in terms of efficiency.


I agree that the data captured is going to be something we hone
based on experience.  I am (for now) most concerned with the
mechanism ... because I don't want to start making big investments
in instrumentation until we have a good mechanism, around which
we can fix the APIs that instrumentation will use.

I'll try to review the suggested candidates and describe the
mechanisms and advantages of each in another two weeks.

Thank you very much for the feedback.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


towards a user-mode diagnostic log mechanism

2011-12-19 Thread Mark Kampe

I'd like to keep this ball moving ... as I believe that the
limitations of our current logging mechanisms are already
making support difficult, and that is about to become worse.

As a first step, I'd just like to get opinions on the general
requirements we are trying to satisfy, and decisions we have
to make along the way.

Comments?

I Requirements

  A. Primary Requirements (must have)
 1. information captured
a. standard: time, sub-system, level, proc/thread
b. additional: operation and parameters
c. extensible for new operations
 2. efficiency
a. run time overhead < 1%
   (I believe this requires delayed flush circular bufferring)
b. persistent space O(Gigabytes per node-year)
 3. configurability
a. capture level per sub-system
 4. persistence
a. flushed out on process shut-down
b. recoverable from user-mode core-dumps
 5. presentation
a. output can be processed w/grep,less,...

  B. Secondary Requirements (nice to have)
 1. ease of use
a. compatible with/convertable from existing calls
b. run-time definition of new event records
 2. configurability
a. size/rotation rules per sub-system
b. separate in-memory/on-disk capture levels

II Decisions to be made

   A. Capture Circumstances
  1. some subset of procedure calls
 (I'm opposed to this, but it is an option)
  2. explicit event logging calls

   B. Capture Format
  1. ASCII text
  2. per-event binary format
  3. binary header + ASCII text

   C. Synchronization
  1. per-process vs per-thread buffers

   D. Flushing
  1. last writer flushes vs dedicated thread
  2. single- vs double-bufferred output

   E. Available open source candidates

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


moving towards release criteria

2011-11-28 Thread Mark Kampe

At present we are running automated nightlies to catch
problems that slip past developers or only show up on
long runs, and filing bugs when they fail ... but the
decision of whether or not we are ready to push out a
new release is not yet criterion-based.  We have to
start moving towards official release criteria.

I suggest that our initial release criteria should fall
into four categories:

  (1) functional validation

100% passage of designated validation suites,
with a formal process for managing the functional
assertions to be tested (or designating specific
assertions to be compliance-optional)

  (2) regression tests

100% passage of designated regression suites,
with a formal process for designating which
bugs do and do not require the creation of
new regression test cases.

  (3) performance

individual and aggregate throughput measurements,
and key-event timings will be made with controlled
loads on specified hardware configuration, and
compared against performance targets, and a formal
process for defining the target metrics and requirements.

  (4) reliability and robustness

a specified number of hours of (client perceived) error
free operation under continuous load (with specified
levels and characteristics), in the face of specified
error injections ... and a formal process for defining
the times, load characteristics, error injections, and
acceptable performance.

Does this seem like the right general form for our release criteria? 
What changes would you suggest?


Once we agree on the general form of our release criteria, the next
steps are:

  (a) put some stakes in the ground for the initial requirements
  (knowing that they will evolve in scope, specificity, and
  rigour)

  (b) propose some processes for the review, approval, and
  evolution of those standards, and the communication of
  the (current) requirements and results to the community.

  (c) set a date for the first release to be subject to these
  criteria

comments?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: The costs of logging and not logging

2011-11-21 Thread Mark Kampe

I'm a big believer in asynchronous flushes of an in-memory
ring-buffer.  For user-mode processes a core-dump-grave
robber can reliably pull out all of the un-flushed entries
... and the same process will also work for the vast majority
of all kernel crashes.

String logging is popular because:
  1. It trivial to do (in the instrumented code)
  2. It is trivial to read (in the recovered logs)
  3. It is easily usable by grep/perl/etc type tools

But binary logging:
  1. is much (e.g. 4-10x) smaller (especially for standard header info)
  2. is much faster to take (no formatting, less data to copy)
  3. is much faster to process (no re-parsing)
  4. is smaller to store on disk and faster to ship for diagnosis

  and a log dumper can quickly produce output that
  is identical to what the strings would have been

So I also prefer binary logs ... even though they require
the importation of additional classes.  But ...

 (a) the log classes must be kept upwards compatible so
 that old logs can be ready by newer tools.

 (b) the binary records should glow-in-the-dark, so that
 they can be recovered even from corrupted ring-buffers
 and blocks whose meta-data has been lost.


I see two main issues with the slowness of the current logs:

  - all of the string rendering in the operator<<()'s is slow.  things like
prefixing every line with a dump of the pg state is great for debugging,
but makes the overhead very high.  we could scale all of that back, but
it'd be a big project.
  - the logging always goes to a file, synchronously.  we could write to a
ring buffer and either write it out only on crash, or (at the very least)
write it async.



I wonder, though, if something different might work.  gcc lets you
arbitrarily instrument function calls with -finstrument-functions.
Something that logs function calls and arguments to an in-memory ring
buffer and dumps that out on crash could potentially have a low overhead
(if we restrict it to certain code) and would give us lots of insight into
what happend leading up to the crash.


This does have the advantage of being automatic ... but it is much more
information, perhaps without much more value.  My experience with
logging is that you don't have to capture very much information,
and that in fact we often go back to weed out no-longer-interesting
information.  Not only does too much information take cycles and
space, but it also makes it hard to find "the good stuff".  I think that 
human architects can make very good decisions about what

information should be captured, and when.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


The costs of logging and not logging

2011-11-21 Thread Mark Kampe

The bugs we most dread are situations that only happen rarely,
and are only detected long after the damage has been done.
Given the business we are in, we will face many of them.
We apparently have such bugs open at this very moment.

In most cases, the primary debugging tools one has are
audit and diagnostic logs ... which WE do not have because
they are too expensive (because they are synchronously
written with C++ streams) to leave enabled all the time.

I think it is a mistake to think of audit and diagnostic
logs as a tool to be turned on when we have a problem to
debug.  There should be a basic level of logging that is
always enabled (so we will have data after the first
instance of the bug) ... which can be cranked up from
verbose to bombastic when we find a problem that won't
yield to more moderate interrogation:

 (a) after the problem happens is too late to
 start collecting data.

 (b) these logs are gold mines of information for
 a myriad of purposes we cannot yet even imagine.

This can only be done if the logging mechanism is
sufficiently inexpensive that we are not afraid to
use it:
low execution artifact from the logging operations
reansonable memory costs for bufferring
small enough on disk that we can keep them for months

Not having such a mechanism is (if I correctly
understand) already hurting us for internal debugging,
and will quickly cripple us when we have customer
(i.e. people who cannot diagnose problems for
themselves) problems to debug.

There are many tricks to make logging cheap, and the
sizes acceptable.  There are probably a dozen open-source
implementations that already do what we need, and if they
don't something basic can be built in a two-digit number
of hours.  The real cost is not in the mechanism but in
adapting existing code to use it.  This cost can be
mitigated by making the changes opportunistically ...
one component at a time, as dictated by need/fear.

But we cannot make that change-over until we have a
mechanism.  Because the greatest cost is not the
mechanism, but the change-over, we should give more
than passing thought to what mechanism to choose ...
so that the decision we make remains a good one for
the next few years.

This may be something that we need to do sooner,
rather than later.

regards,
   ---mark---
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html