Re: Crash and strange things on MDS

2013-02-11 Thread Gregory Farnum
On Mon, Feb 4, 2013 at 10:01 AM, Kevin Decherf  wrote:
> References:
> [1] http://www.spinics.net/lists/ceph-devel/msg04903.html
> [2] ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
> 1: /usr/bin/ceph-mds() [0x817e82]
> 2: (()+0xf140) [0x7f9091d30140]
> 3: (MDCache::request_drop_foreign_locks(MDRequest*)+0x21) [0x5b9dc1]
> 4: (MDCache::request_drop_locks(MDRequest*)+0x19) [0x5baae9]
> 5: (MDCache::request_cleanup(MDRequest*)+0x60) [0x5bab70]
> 6: (MDCache::request_kill(MDRequest*)+0x80) [0x5bae90]
> 7: (Server::journal_close_session(Session*, int)+0x372) [0x549aa2]
> 8: (Server::kill_session(Session*)+0x137) [0x549c67]
> 9: (Server::find_idle_sessions()+0x12a6) [0x54b0d6]
> 10: (MDS::tick()+0x338) [0x4da928]
> 11: (SafeTimer::timer_thread()+0x1af) [0x78151f]
> 12: (SafeTimerThread::entry()+0xd) [0x782bad]
> 13: (()+0x7ddf) [0x7f9091d28ddf]
> 14: (clone()+0x6d) [0x7f90909cc24d]

This in particular is quite odd. Do you have any logging from when
that happened? (Oftentimes the log can have a bunch of debugging
information from shortly before the crash.)

On Mon, Feb 11, 2013 at 10:54 AM, Kevin Decherf  wrote:
> Furthermore, I observe another strange thing more or less related to the
> storms.
>
> During a rsync command to write ~20G of data on Ceph and during (and
> after) the storm, one OSD sends a lot of data to the active MDS
> (400Mbps peak each 6 seconds). After a quick check, I found that when I
> stop osd.23, osd.14 stops its peaks.

This is consistent with Sam's suggestion that MDS is thrashing its
cache, and is grabbing a directory object off of the OSDs. How large
are the directories you're using? If they're a significant fraction of
your cache size, it might be worth enabling the (sadly less stable)
directory fragmentation options, which will split them up into smaller
fragments that can be independently read and written to disk.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unable to mount cephfs - can't read superblock

2013-02-11 Thread Gregory Farnum
On Sat, Feb 9, 2013 at 2:13 PM, Adam Nielsen  wrote:
 $ ceph -s
 health HEALTH_WARN 192 pgs degraded; 192 pgs stuck unclean
 monmap e1: 1 mons at {0=192.168.0.6:6789/0}, election epoch 0,
 quorum 0 0
 osdmap e3: 1 osds: 1 up, 1 in
 pgmap v119: 192 pgs: 192 active+degraded; 0 bytes data, 10204 MB
 used, 2740 GB / 2750 GB avail
 mdsmap e1: 0/0/1 up
>>>
>> In any case, this output indicates that your MDS isn't actually running, 
>> Adam, or at least isn't connected. Check and see if the process is still 
>> going?
>> You should also have minimal logging by default in /var/lib/ceph/mds*; you 
>> might find some output there that could be useful.
>
> The MDS appears to be running:
>
> $ ps -A | grep ceph
> 12903 ?00:00:17 ceph-mon
> 12966 ?00:00:10 ceph-mds
> 13047 ?00:00:31 ceph-osd
>
> And I found some logs in /var/log/ceph:
>
> $ cat /var/log/ceph/ceph-mds.0.log
> 2013-02-10 07:57:16.505842 b4aa3b70  0 mds.-1.0 ms_handle_connect on 
> 192.168.0.6:6789/0
>
> So it appears the mds is running.  Wireshark shows some traffic going between 
> hosts when the mount request comes through, but then the responses stop and 
> the client eventually gives up and the mount fails.
>
>>> You better add a second OSD or just do a mkcephfs again with a second
>>> OSD in the configuration.
>
> I just tried this and it fixed the unclean pgs issue, but I still can't mount 
> a cephfs filesystem:
>
> $ ceph -s
>health HEALTH_OK
>monmap e1: 1 mons at {0=192.168.0.6:6789/0}, election epoch 0, quorum 0 0
>osdmap e5: 2 osds: 2 up, 2 in
> pgmap v107: 384 pgs: 384 active+clean; 0 bytes data, 40423 MB used, 5461 
> GB / 5501 GB avail
>mdsmap e1: 0/0/1 up
>
> remote$ mount -t ceph 192.168.0.6:6789:/ /mnt/ceph/
> mount: 192.168.0.6:6789:/: can't read superblock
>
> Running the mds daemon in debug mode says this:
>
> ...
> 2013-02-10 08:07:03.550977 b2a83b70 10 mds.-1.0 MDS::ms_get_authorizer 
> type=mon
> 2013-02-10 08:07:03.551840 b4a87b70  0 mds.-1.0 ms_handle_connect on 
> 192.168.0.6:6789/0
> 2013-02-10 08:07:03.555307 b738c710 10 mds.-1.0 beacon_send up:boot seq 1 
> (currently up:boot)
> 2013-02-10 08:07:03.555629 b738c710 10 mds.-1.0 create_logger
> 2013-02-10 08:07:03.564138 b4a87b70  5 mds.-1.0 handle_mds_map epoch 1 from 
> mon.0
> 2013-02-10 08:07:03.564348 b4a87b70 10 mds.-1.0  my compat 
> compat={},rocompat={},incompat={1=base v0.20,2=client writeable 
> ranges,3=default file layouts on dirs,4=dir inode in separate object}
> 2013-02-10 08:07:03.564454 b4a87b70 10 mds.-1.0  mdsmap compat 
> compat={},rocompat={},incompat={1=base v0.20,2=client writeable 
> ranges,3=default file layouts on dirs,4=dir inode in separate object}
> 2013-02-10 08:07:03.564547 b4a87b70 10 mds.-1.-1 map says i am 
> 192.168.0.6:6800/16077 mds.-1.-1 state down:dne
> 2013-02-10 08:07:03.564654 b4a87b70 10 mds.-1.-1 not in map yet
> 2013-02-10 08:07:07.67 b2881b70 10 mds.-1.-1 beacon_send up:boot seq 2 
> (currently down:dne)
> 2013-02-10 08:07:11.555858 b2881b70 10 mds.-1.-1 beacon_send up:boot seq 3 
> (currently down:dne)
> 2013-02-10 08:07:15.556123 b2881b70 10 mds.-1.-1 beacon_send up:boot seq 4 
> (currently down:dne)
> 2013-02-10 08:07:19.556411 b2881b70 10 mds.-1.-1 beacon_send up:boot seq 5 
> (currently down:dne)
> 2013-02-10 08:07:23.556654 b2881b70 10 mds.-1.-1 beacon_send up:boot seq 6 
> (currently down:dne)
> 2013-02-10 08:07:27.556931 b2881b70 10 mds.-1.-1 beacon_send up:boot seq 7 
> (currently down:dne)
> 2013-02-10 08:07:31.557189 b2881b70 10 mds.-1.-1 beacon_send up:boot seq 8 
> (currently down:dne)
> ...

How bizarre. That indicates the MDS is running and is requesting to
become active, but the monitor for some reason isn't letting it in.
Can you restart your monitor with logging on as well (--debug_mon 20
on the end of the command line, or "debug mon = 20" in the config) and
then try again?
The other possibility is that maybe your MDS doesn't have the right
access permissions; does "ceph auth list" include an MDS, and does it
have any permissions associated?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: preferred OSD

2013-02-11 Thread Gregory Farnum
On Fri, Feb 8, 2013 at 4:45 PM, Sage Weil  wrote:
> Hi Marcus-
>
> On Fri, 8 Feb 2013, Marcus Sorensen wrote:
>> I know people have been disscussing on and off about providing a
>> "preferred OSD" for things like multi-datacenter, or even within a
>> datacenter, choosing an OSD that would avoid traversing uplinks.  Has
>> there been any discussion on how to do this? I seem to remember people
>> saying things like 'the crush map doesn't work that way at the
>> moment'. Presumably, when a client needs to access an object, it looks
>> up where the object should be stored via the crush map, which returns
>> all OSDs that could be read from.

Exactly.

>> I was thinking this morning that you
>> could potentially leave the crush map out of it, by setting a location
>> for each OSD in the ceph.conf, and an /etc/ceph/location file for the
>> client.  Then use the absolute value of the difference to determine
>> preferred OSD. So, if OSD0 was location=1, and OSD1 was location=3,
>> and client 1 was location=2, then it would do the normal thing, but if
>> client 1 was location=1.3, then it would prefer OSD0 for reads.
>> Perhaps that's overly simplistic and wouldn't scale to meet everyone's
>> requirements, but you could do multiple locations and sprinkle clients
>> in between them all in various ways.  Or perhaps the location is a
>> matrix, so you could literally map it out on a grid with a set of
>> coordinates. What ideas are being discussed around how to implement
>> this?
>
> We can do something like this for reads today, where we pick a read
> replica based on the closest IP or some other metric/mask.  We generally
> don't enable this because it leads to non-optimal cache behavior, but it
> could in principle be enabled via a config option for certain clusters
> (and in fact some of that code is already in place).

Just to be specific — there are currently flags which will let the
client read from local-host if it can figure that out, and those
aren't heavily-tested but do work when we turn them on. Other metrics
of "close" don't appear yet, though.
In general, CRUSH locations seem like a good measure of closeness that
the client could rely on, rather than a separate "location" value, but
it does restrict the usefulness if you've configured multiple CRUSH
root nodes. I think it would need to support a tree of some kind
though, rather than just a linear value.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rest mgmt api

2013-02-11 Thread Gregory Farnum
On Wed, Feb 6, 2013 at 12:14 PM, Sage Weil  wrote:
> On Wed, 6 Feb 2013, Dimitri Maziuk wrote:
>> On 02/06/2013 01:34 PM, Sage Weil wrote:
>>
>> > I think the one caveat here is that having a single registry for commands
>> > in the monitor means that commands can come in two flavors: vector
>> > (cli) and URL (presumably in json form).  But a single command
>> > dispatch/registry framework will make that distinction pretty simple...
>>
>> Any reason you can't have your CLI json-encode the commands (or,
>> conversely, your cgi/wsgi/php/servlet URL handler decode them into
>> vector) before passing them on to the monitor?
>
> We can, but they won't necessarily look the same, because it is unlikely
> we can make a sane 1:1 translation of the CLI to REST that makes sense,
> and it would be nice to avoid baking knowledge about the individual
> commands into the client side.

I disagree and am with Joao on this one — the monitor parsing is
ridiculous as it stand right now, and we should be trying to get rid
of the manual string parsing. The monitors should be parsing JSON
commands that are sent by the client; it makes validation and the
logic control flow a lot easier. We're going to want some level of
intelligence in the clients so that they can tailor themselves to the
appropriate UI conventions, and having two different parsing paths in
the monitors is just asking for trouble: they will get out of sync and
have different kinds of parsing errors.

What we could do is have the monitors speak JSON only, and then give
the clients a minimal intelligence so that the CLI could (for
instance) prettify the options for commands it knows about, but still
allow pass-through for access to newer commands it hasn't yet heard
of.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD down

2013-02-10 Thread Gregory Farnum
The OSD daemon is getting back EIO when it tries to do a read. Sounds like your 
disk is going bad. 
-Greg

PS: This question is a good fit for the new ceph-users list. :)


On Sunday, February 10, 2013 at 9:45 AM, Olivier Bonvalet wrote:

> Hi,
> 
> I have an OSD which often stopped (ceph 0.56.2), with that in logs :
> 
> 446 stamp 2013-02-10 18:37:27.559777) v2  47+0+0 (4068038983 0 0)
> 0x11e028c0 con 0x573d6e0
> -3> 2013-02-10 18:37:27.561618 7f1c765d5700 1 --
> 192.168.42.1:0/5824 <== osd.31 192.168.42.3:6811/23050 129 
> osd_ping(ping_reply e13446 stamp 2013-02-10 18:37:27.559777) v2  47
> +0+0 (4068038983 0 0) 0x73be380 con 0x573d420
> -2> 2013-02-10 18:37:27.562674 7f1c765d5700 1 --
> 192.168.42.1:0/5824 <== osd.1 192.168.42.2:6803/7458 129 
> osd_ping(ping_reply e13446 stamp 2013-02-10 18:37:27.559777) v2  47
> +0+0 (4068038983 0 0) 0x6bd8a80 con 0x573dc60
> -1> 2013-02-10 18:37:28.217626 7f1c805e9700 5 osd.12 13444 tick
> 0> 2013-02-10 18:37:28.552692 7f1c725cd700 -1 os/FileStore.cc 
> (http://FileStore.cc): In
> function 'virtual int FileStore::read(coll_t, const hobject_t&,
> uint64_t, size_t, ceph::bufferlist&)' thread 7f1c725cd700 time
> 2013-02-10 18:37:28.537715
> os/FileStore.cc (http://FileStore.cc): 2732: FAILED 
> assert(!m_filestore_fail_eio || got != -5)
> 
> ceph version ()
> 1: (FileStore::read(coll_t, hobject_t const&, unsigned long, unsigned
> long, ceph::buffer::list&)+0x462) [0x725f92]
> 2: (PG::_scan_list(ScrubMap&, std::vector std::allocator >&, bool)+0x371) [0x685da1]
> 3: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t,
> bool)+0x29b) [0x6866bb]
> 4: (PG::replica_scrub(MOSDRepScrub*)+0x8e9) [0x6952b9]
> 5: (OSD::RepScrubWQ::_process(MOSDRepScrub*)+0xc2) [0x6410a2]
> 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x80f9e9]
> 7: (ThreadPool::WorkThread::entry()+0x10) [0x8121f0]
> 8: (()+0x68ca) [0x7f1c852f48ca]
> 9: (clone()+0x6d) [0x7f1c83e23b6d]
> NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this.
> 
> --- logging levels ---
> 0/ 5 none
> 0/ 1 lockdep
> 0/ 1 context
> 1/ 1 crush
> 1/ 5 mds
> 1/ 5 mds_balancer
> 1/ 5 mds_locker
> 1/ 5 mds_log
> 1/ 5 mds_log_expire
> 1/ 5 mds_migrator
> 0/ 1 buffer
> 0/ 1 timer
> 0/ 1 filer
> 0/ 1 striper
> 0/ 1 objecter
> 0/ 5 rados
> 0/ 5 rbd
> 0/ 5 journaler
> 0/ 5 objectcacher
> 0/ 5 client
> 0/ 5 osd
> 0/ 5 optracker
> 0/ 5 objclass
> 1/ 3 filestore
> 1/ 3 journal
> 0/ 5 ms
> 1/ 5 mon
> 0/10 monc
> 0/ 5 paxos
> 0/ 5 tp
> 1/ 5 auth
> 1/ 5 crypto
> 1/ 1 finisher
> 1/ 5 heartbeatmap
> 1/ 5 perfcounter
> 1/ 5 rgw
> 1/ 5 hadoop
> 1/ 5 javaclient
> 1/ 5 asok
> 1/ 1 throttle
> -2/-2 (syslog threshold)
> -1/-1 (stderr threshold)
> max_recent 10
> max_new 1000
> log_file /var/log/ceph/osd.12.log
> --- end dump of recent events ---
> 2013-02-10 18:37:29.236649 7f1c725cd700 -1 *** Caught signal (Aborted)
> **
> in thread 7f1c725cd700
> 
> ceph version ()
> 1: /usr/bin/ceph-osd() [0x7a0db9]
> 2: (()+0xeff0) [0x7f1c852fcff0]
> 3: (gsignal()+0x35) [0x7f1c83d861b5]
> 4: (abort()+0x180) [0x7f1c83d88fc0]
> 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f1c8461adc5]
> 6: (()+0xcb166) [0x7f1c84619166]
> 7: (()+0xcb193) [0x7f1c84619193]
> 8: (()+0xcb28e) [0x7f1c8461928e]
> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x7c9) [0x8f3fc9]
> 10: (FileStore::read(coll_t, hobject_t const&, unsigned long, unsigned
> long, ceph::buffer::list&)+0x462) [0x725f92]
> 11: (PG::_scan_list(ScrubMap&, std::vector std::allocator >&, bool)+0x371) [0x685da1]
> 12: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t,
> bool)+0x29b) [0x6866bb]
> 13: (PG::replica_scrub(MOSDRepScrub*)+0x8e9) [0x6952b9]
> 14: (OSD::RepScrubWQ::_process(MOSDRepScrub*)+0xc2) [0x6410a2]
> 15: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x80f9e9]
> 16: (ThreadPool::WorkThread::entry()+0x10) [0x8121f0]
> 17: (()+0x68ca) [0x7f1c852f48ca]
> 18: (clone()+0x6d) [0x7f1c83e23b6d]
> NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this.
> 
> --- begin dump of recent events ---
> -1> 2013-02-10 18:37:29.217778 7f1c805e9700 5 osd.12 13444 tick
> 0> 2013-02-10 18:37:29.236649 7f1c725cd700 -1 *** Caught signal
> (Aborted) **
> in thread 7f1c725cd700
> 
> ceph version ()
> 1: /usr/bin/ceph-osd() [0x7a0db9]
> 2: (()+0xeff0) [0x7f1c852fcff0]
> 3: (gsignal()+0x35) [0x7f1c83d861b5]
> 4: (abort()+0x180) [0x7f1c83d88fc0]
> 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f1c8461adc5]
> 6: (()+0xcb166) [0x7f1c84619166]
> 7: (()+0xcb193) [0x7f1c84619193]
> 8: (()+0xcb28e) [0x7f1c8461928e]
> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x7c9) [0x8f3fc9]
> 10: (FileStore::read(coll_t, hobject_t const&, unsigned long, unsigned
> long, ceph::buffer::list&)+0x462) [0x725f92]
> 11: (PG::_scan_list(ScrubMap&, std::vector std::allocator >&, bool)+0x371) [0x685da1]
> 12: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t,
> bool)+0x29

Re: Possible filesystem corruption or something else?

2013-02-09 Thread Gregory Farnum
On Saturday, February 9, 2013 at 6:23 AM, John Axel Eriksson wrote:
> Three times now, twice on one osd, once on another we've had the osd
> crash. Restarting it wouldn't help - it would crash with the same
> error. The only way I found to get it up again was to reformat both
> the journal disk and the disk ceph is using for storage... basically
> recreating the osd.
> This has got me thinking it's some sort of filesystem corruption going
> on but I can't be sure.
>  
> Thing is, the first two times this happended on 0.48.3 (argonaut) and
> this last time it happened on 0.56.2 - I upgraded hoping this issue
> was fixed.
>  
> There is another possibility than ceph itself - we're using btrfs on
> the ceph disks. We're using it because in general we haven't seen any
> problems. We've been running ceph on these for six months without
> issue. We also really need the compression btrfs can do (we're saving
> vast amounts of space this way because of the nature of the data we're
> storing).
>  
> Kernel is, and has been 3.6.2-030602-generic for a long time now, I
> think we started out on 3.5.x but pretty quickly went to 3.6.2. The
> disks are formatted like so:
> mkfs.btrfs -l 32k -n 32k /dev/xvdf
>  
> Otherwise the nodes are running on Ubuntu 12.04.1 LTS. This is all
> running on EC2. Thanks for any help I can get!
>  
> I know it may not be verbose enough but this is the log I got from
> this last crash:

This log indicates the problem is a corruption in the integrated leveldb 
database. And you mention using btrfs compression, so I point you to 
http://tracker.ceph.com/issues/2563. :( I don't know anything more than that; 
maybe somebody else on the team knows more…Sam?
-Greg

  
>  
> 2013-02-09 13:18:08.685989 7f3f92949780 1 journal _open
> /mnt/osd.2.journal fd 7: 1048576000 bytes, block size 4096 bytes,
> directio = 1, aio = 0
> 2013-02-09 13:18:08.693418 7f3f92949780 0
> filestore(/var/lib/ceph/osd/ceph-2) mkjournal created journal on
> /mnt/osd.2.journal
> 2013-02-09 13:18:08.693481 7f3f92949780 -1 created new journal
> /mnt/osd.2.journal for object store /var/lib/ceph/osd/ceph-2
> 2013-02-09 13:18:21.926143 7f09b972d780 0
> filestore(/var/lib/ceph/osd/ceph-2) mount FIEMAP ioctl is supported
> and appears to work
> 2013-02-09 13:18:21.926214 7f09b972d780 0
> filestore(/var/lib/ceph/osd/ceph-2) mount FIEMAP ioctl is disabled via
> 'filestore fiemap' config option
> 2013-02-09 13:18:21.926704 7f09b972d780 0
> filestore(/var/lib/ceph/osd/ceph-2) mount detected btrfs
> 2013-02-09 13:18:21.926881 7f09b972d780 0
> filestore(/var/lib/ceph/osd/ceph-2) mount btrfs CLONE_RANGE ioctl is
> supported
> 2013-02-09 13:18:21.996613 7f09b972d780 0
> filestore(/var/lib/ceph/osd/ceph-2) mount btrfs SNAP_CREATE is
> supported
> 2013-02-09 13:18:21.998330 7f09b972d780 0
> filestore(/var/lib/ceph/osd/ceph-2) mount btrfs SNAP_DESTROY is
> supported
> 2013-02-09 13:18:21.999840 7f09b972d780 0
> filestore(/var/lib/ceph/osd/ceph-2) mount btrfs START_SYNC is
> supported (transid 549552)
> 2013-02-09 13:18:22.032267 7f09b972d780 0
> filestore(/var/lib/ceph/osd/ceph-2) mount btrfs WAIT_SYNC is supported
> 2013-02-09 13:18:22.045994 7f09b972d780 0
> filestore(/var/lib/ceph/osd/ceph-2) mount btrfs SNAP_CREATE_V2 is
> supported
> 2013-02-09 13:18:22.104523 7f09b972d780 0
> filestore(/var/lib/ceph/osd/ceph-2) mount syncfs(2) syscall fully
> supported (by glibc and kernel)
> 2013-02-09 13:18:22.104811 7f09b972d780 0
> filestore(/var/lib/ceph/osd/ceph-2) mount found snaps
> <4282852,4282856>
> 2013-02-09 13:18:22.323175 7f09b972d780 0
> filestore(/var/lib/ceph/osd/ceph-2) mount: enabling PARALLEL journal
> mode: btrfs, SNAP_CREATE_V2 detected and 'filestore btrfs snap' mode
> is enabled
> 2013-02-09 13:18:23.041769 7f09b4dc7700 -1 *** Caught signal (Aborted) **
> in thread 7f09b4dc7700
>  
> ceph version 0.56.2 (586538e22afba85c59beda49789ec42024e7a061)
> 1: /usr/bin/ceph-osd() [0x7828da]
> 2: (()+0xfcb0) [0x7f09b8bc8cb0]
> 3: (gsignal()+0x35) [0x7f09b7587425]
> 4: (abort()+0x17b) [0x7f09b758ab8b]
> 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f09b7ed969d]
> 6: (()+0xb5846) [0x7f09b7ed7846]
> 7: (()+0xb5873) [0x7f09b7ed7873]
> 8: (()+0xb596e) [0x7f09b7ed796e]
> 9: (std::__throw_length_error(char const*)+0x57) [0x7f09b7e84907]
> 10: (()+0x9eaa2) [0x7f09b7ec0aa2]
> 11: (char* std::string::_S_construct(char const*, char
> const*, std::allocator const&, std::forward_iterator_tag)+0x35)
> [0x7f09b7ec2495]
> 12: (std::basic_string,
> std::allocator >::basic_string(char const*, unsigned long,
> std::allocator const&)+0x1d) [0x7f09b7ec261d]
> 13: (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*,
> leveldb::Slice const&) const+0x47) [0x769137]
> 14: (leveldb::TableBuilder::Add(leveldb::Slice const&, leveldb::Slice
> const&)+0x92) [0x777b62]
> 15: 
> (leveldb::DBImpl::DoCompactionWork(leveldb::DBImpl::CompactionState*)+0x482)
> [0x7639a2]
> 16: (leveldb::DBImpl::BackgroundCompaction()+0x2b0) [0x7641a0]
> 17: (leveld

Re: Unable to mount cephfs - can't read superblock

2013-02-09 Thread Gregory Farnum
On Saturday, February 9, 2013 at 3:09 AM, Wido den Hollander wrote:
> On 02/09/2013 12:06 PM, Adam Nielsen wrote:
> > Thanks for your quick reply!
> > 
> > > Could you show the output of "ceph -s"
> > 
> > $ ceph -s
> > health HEALTH_WARN 192 pgs degraded; 192 pgs stuck unclean
> > monmap e1: 1 mons at {0=192.168.0.6:6789/0}, election epoch 0,
> > quorum 0 0
> > osdmap e3: 1 osds: 1 up, 1 in
> > pgmap v119: 192 pgs: 192 active+degraded; 0 bytes data, 10204 MB
> > used, 2740 GB / 2750 GB avail
> > mdsmap e1: 0/0/1 up
> 
> 
> 
> Ah, I see you only have one OSD, where the default replication level is 
> 2. Also, pools don't work by default if only one replica is left.

Actually that shouldn't be a problem with a size 2 pool. They'll work as long 
as at least half of their assigned replica count is available. :) (There was a 
brief period in development releases where that wasn't the case, but it didn't 
last very long!)

In any case, this output indicates that your MDS isn't actually running, Adam, 
or at least isn't connected. Check and see if the process is still going?
You should also have minimal logging by default in /var/lib/ceph/mds*; you 
might find some output there that could be useful.
-Greg

 
> 
> You better add a second OSD or just do a mkcephfs again with a second 
> OSD in the configuration.
> 
> Just a reminder, it's also in the docs, but CephFS is still in beta, so 
> expect weird things to happen. It can however not hurt to play with it!
> 
> P.S.: There's also a new and shiny ceph-users list since two days ago, 
> might want to subscribe there.
> 
> Wido
> 
> > > Also, which version of Ceph are you using under which OS?
> > 
> > The latest stable Debian release from ceph.com (http://ceph.com) (bobtail 
> > AFAIK).
> > 
> > Thanks,
> > Adam.
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majord...@vger.kernel.org 
> > (mailto:majord...@vger.kernel.org)
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> 
> -- 
> Wido den Hollander
> 42on B.V.
> 
> Phone: +31 (0)20 700 9902
> Skype: contact42on
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org 
> (mailto:majord...@vger.kernel.org)
> More majordomo info at http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Quantal gitbuilder is broken

2013-02-08 Thread Gregory Farnum
I'm not sure who's responsible for this, but I see everything is red
on our i386 Quantal gitbuilder. Probably just needs
libboost-program-options installed, based on the error I'm seeing in
one of my branches? Although there are some warnings I'm not too used
to seeing in there as well.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph mkfs failed

2013-02-08 Thread Gregory Farnum
On Fri, Feb 8, 2013 at 1:18 PM, sheng qiu  wrote:
> ok, i have figured out it.

That looks like a LevelDB issue given the backtrace (and the OSD isn't
responding because it crashed). If you figured out why LevelDB
crashed, it'd be good to know so that other people can reference this
if they see something similar. :)
-Greg


> On Fri, Feb 8, 2013 at 2:57 PM, sheng qiu  wrote:
>> ok, this is tested using ext3/ext4 on a normal SSD as OSD.
>>
>> ceph -s shows:
>> health HEALTH_WARN 384 pgs degraded; 384 pgs stuck unclean; recovery
>> 22/44 degraded (50.000%)
>>monmap e1: 1 mons at {0=165.91.215.237:6789/0}, election epoch 2, quorum 
>> 0 0
>>osdmap e3: 1 osds: 1 up, 1 in
>> pgmap v10: 384 pgs: 384 active+degraded; 26716 bytes data, 1184 MB
>> used, 55857 MB / 60093 MB avail; 22/44 degraded (50.000%)
>>mdsmap e4: 1/1/1 up {0=0=up:active}
>>
>> dmesg shows:
>> [  212.758376] libceph: client4106 fsid f60af615-67cb-4245-91cb-22752821f3e6
>> [  212.759869] libceph: mon0 165.91.215.237:6789 session established
>> [  338.292461] libceph: osd0 165.91.215.237:6801 socket closed (con state 
>> OPEN)
>> [  338.292483] libceph: osd0 165.91.215.237:6801 socket error on write
>> [  339.161231] libceph: osd0 165.91.215.237:6801 socket error on write
>> [  340.159003] libceph: osd0 165.91.215.237:6801 socket error on write
>> [  342.158514] libceph: osd0 165.91.215.237:6801 socket error on write
>> [  346.149549] libceph: osd0 165.91.215.237:6801 socket error on write
>>
>> osd.0.log shows:
>> 2013-02-08 14:52:51.649726 7f82780f6700  0 -- 165.91.215.237:6801/7135
 165.91.215.237:0/3238315774 pipe(0x2d61240 sd=803 :6801 pgs=0 cs=0
>> l=0).accept peer addr is really 165.91.215.237:0/3238315774 (socket is
>> 165.91.215.237:57270/0)
>> 2013-02-08 14:53:26.103770 7f8283c10700 -1 *** Caught signal
>> (Segmentation fault) **
>>  in thread 7f8283c10700
>>
>>  ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
>>  1: ./ceph-osd() [0x78648a]
>>  2: (()+0x10060) [0x7f828cb0e060]
>>  3: (fwrite()+0x34) [0x7f828aea3ec4]
>>  4: (leveldb::log::Writer::EmitPhysicalRecord(leveldb::log::RecordType,
>> char const*, unsigned long)+0x11f) [0x76d93f]
>>  5: (leveldb::log::Writer::AddRecord(leveldb::Slice const&)+0x74) [0x76dae4]
>>  6: (leveldb::DBImpl::Write(leveldb::WriteOptions const&,
>> leveldb::WriteBatch*)+0x160) [0x763050]
>>  7: 
>> (LevelDBStore::submit_transaction(std::tr1::shared_ptr)+0x2a)
>> [0x74ec1a]
>>  8: (DBObjectMap::remove_xattrs(hobject_t const&,
>> std::set,
>> std::allocator > const&, SequencerPosition const*)+0x16a)
>> [0x746fca]
>>  9: (FileStore::_setattrs(coll_t, hobject_t const&,
>> std::map,
>> std::allocator > >&,
>> SequencerPosition const&)+0xe7f) [0x719aff]
>>  10: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned
>> long, int)+0x3cba) [0x71e7da]
>>  11: (FileStore::do_transactions(std::list> std::allocator >&, unsigned long)+0x4c)
>> [0x72152c]
>>  12: (FileStore::_do_op(FileStore::OpSequencer*)+0x1b1) [0x6f1331]
>>  13: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4bc) [0x827dec]
>>  14: (ThreadPool::WorkThread::entry()+0x10) [0x829cb0]
>>  15: (()+0x7efc) [0x7f828cb05efc]
>>  16: (clone()+0x6d) [0x7f828af1cf8d]
>>  NOTE: a copy of the executable, or `objdump -rdS ` is
>> needed to interpret this.
>>
>>
>> any suggestions?
>>
>> Thanks,
>> Sheng
>>
>> On Fri, Feb 8, 2013 at 11:53 AM, sheng qiu  wrote:
>>> Hi,
>>>
>>> i think it's not related with my local FS. i build a ext4 on a ramdisk
>>> and used it as OSD.
>>> when i run the iozone or fio on the mounted client point, it  shows
>>> the same info as before:
>>>
>>> 2013-02-08 11:45:06.803915 7f28ec7c4700  0 -- 165.91.215.237:6801/7101
> 165.91.215.237:0/1990103183 pipe(0x2ded240 sd=803 :6801 pgs=0 cs=0
>>> l=0).accept peer addr is really 165.91.215.237:0/1990103183 (socket is
>>> 165.91.215.237:60553/0)
>>> 2013-02-08 11:45:06.879009 7f28f7add700 -1 *** Caught signal
>>> (Segmentation fault) **
>>>  in thread 7f28f7add700
>>>
>>> the ceph -s shows, also the same as using my own local FS:
>>>
>>>   health HEALTH_WARN 384 pgs degraded; 384 pgs stuck unclean; recovery
>>> 21/42 degraded (50.000%)
>>>monmap e1: 1 mons at {0=165.91.215.237:6789/0}, election epoch 2, quorum 
>>> 0 0
>>>osdmap e3: 1 osds: 1 up, 1 in
>>> pgmap v7: 384 pgs: 384 active+degraded; 21003 bytes data, 276 MB
>>> used, 3484 MB / 3961 MB avail; 21/42 degraded (50.000%)
>>>mdsmap e4: 1/1/1 up {0=0=up:active}
>>>
>>> dmesg shows:
>>>
>>> [  656.799209] libceph: client4099 fsid da0fe76d-8506-4bf8-8b49-172fd8bc6d1f
>>> [  656.800657] libceph: mon0 165.91.215.237:6789 session established
>>> [  683.789954] libceph: osd0 165.91.215.237:6801 socket closed (con state 
>>> OPEN)
>>> [  683.790007] libceph: osd0 165.91.215.237:6801 socket error on write
>>> [  684.909095] libceph: osd0 165.91.215.237:6801 socket error on write
>>> [  685.903425] libceph: osd0 165.91.215.237:6801 socket error on write
>>> [  687.903937] libceph: osd0

Re: ceph mkfs failed

2013-02-07 Thread Gregory Farnum
On Thu, Feb 7, 2013 at 12:42 PM, sheng qiu  wrote:
> Hi Dan,
>
> thanks for your reply.
>
> after some code tracking, i found it failed at this point :
> in file leveldb/db/db_impl.cc  --> NewDB()
>
> log::Writer log(file);
> std::string record;
> new_db.EncodeTo(&record);
> s = log.AddRecord(record);
> if (s.ok()) {
>   fprintf(test, "NewDB: 2\n");
>   s = file->Close();
> }else
>   fprintf(test, "NewDB: 2.5\n");
>
> the log.AddRecord return s which is not ok().
>
> can you provide some hint why it fails?  i am reading the AddRecord()
> function now.

LevelDB is a generic library which we don't develop. My understanding
is that it's expected to work on any POSIX-compliant filesystem, but
you can check out the docs
(http://leveldb.googlecode.com/svn/trunk/doc/index.html) or source
code for more info.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Increase number of pg in running system

2013-02-05 Thread Gregory Farnum
On Tuesday, February 5, 2013 at 5:49 PM, Sage Weil wrote:
> On Tue, 5 Feb 2013, Mandell Degerness wrote:
> > I would like very much to specify pg_num and pgp_num for the default
> > pools, but they are defaulting to 64 (no OSDs are defined in the
> > config file). I have tried using the options indicated by Artem, but
> > they didn't seem to have any effect on the data and rbd pools which
> > are created by default. Is there something I am missing?
>  
>  
>  
>  
> Ah, I see. Specifying this is awkward. In [mon] or [global],
>  
> osd pg bits = N
> osd pgp bits = N
>  
> where N is the the number of bits to shift 1 to the left. So for 1024
> PGs, you'd do 10. (What it's actually doing is MIN(1, num_osds) << N.
> The default N is 6, so you're probaby seeing 64 PGs per pool by default.)

I see the confusion though — the osd_pool_default_pg_num option is only used 
for pools which you create through the monitor after the system is already 
running.



On Tuesday, February 5, 2013 at 7:22 PM, Chen, Xiaoxi wrote:

> But can we change the pg_num of a pool when the pool contains data? If yes, 
> how to do this?  
>  
We advise against that right now; the relevant code isn't well-enough tested.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph Development with Eclipse

2013-02-05 Thread Gregory Farnum
I haven't done the initial setup in several years, but as I recall,
once Ceph was built it was a simple matter of doing "New Makefile
Project with Existing Code" in Eclipse. Make sure you've got the C++
version of Eclipse.
Other than that, I'm afraid you'll need to go through the Eclipse
support systems. :)
-Greg

On Mon, Feb 4, 2013 at 9:18 PM, charles L  wrote:
>
> hi Greg ...
>
> I waited all day for your reply ...guess u had much to do today...
>
> I can see that no one else can help from the community ..they seems not to be 
> familiar with my scenario...and I see you are the only one that understand my 
> tout...
>
> Pls kindly find time to respond with the steps i need to take ...thanks a 
> million...
>
>
> Regards,
>
> Charles.
>
> 
>> From: charlesboy...@hotmail.com
>> To: g...@inktank.com
>> CC: ceph-devel@vger.kernel.org; post.the.re...@gmail.com
>> Subject: RE: Ceph Development with Eclipse
>> Date: Mon, 4 Feb 2013 14:01:25 +0100
>>
>>
>> Hi Greg,
>>
>> I guess setting Ceph up outside of Eclipse include doing the following on 
>> any linux system:
>>
>> 1. To prepare the source tree after it has been git cloned,
>> $ git submodule update --init
>>
>> 2. To build the server daemons, and FUSE client, execute the following:
>> $ ./autogen.sh
>> $ ./configure
>> $ make
>> 3. Creating a C/C++ project on eclipse and importing the Makefile into 
>> Eclipse ...to put the ceph project under Eclipse control...
>>
>> BUT its not working!!! Makefile not visible in Eclipse and wont import... I 
>> guess i have missed some steps in setting it up outside Eclipse or the 
>> process of importing ...
>>
>>
>> ... I will appreciate your guidance and guidance from the community too...
>>
>>
>> Regards,
>>
>> Charles.
>>
>>
>>
>>
>> > Date: Sat, 2 Feb 2013 18:16:08 -0800
>> > From: g...@inktank.com
>> > To: charlesboy...@hotmail.com
>> > CC: ceph-devel@vger.kernel.org; post.the.re...@gmail.com
>> > Subject: Re: Ceph Development with Eclipse
>> >
>> > I actually still do this. Set it up Ceph outside of Eclipse initially, 
>> > then import it as a project with an existing Makefile. It should pick up 
>> > on everything it needs to well enough. :)
>> > -Greg
>> >
>> >
>> > On Saturday, February 2, 2013 at 9:40 AM, charles L wrote:
>> >
>> > >
>> > >
>> > > Hi
>> > >
>> > > I am a beginner at c++ and eclipse. I need some startup help to develop 
>> > > ceph with eclipse. If you could provide your config file on eclipse, it 
>> > > will be a great starting point and very appreciated.
>> > >
>> > > Regards,
>> > >
>> > > Charles.
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > the body of a message to majord...@vger.kernel.org
>> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throttle::wait use case clarification

2013-02-05 Thread Gregory Farnum
Merged, thanks a bunch for the contributions. :)
-Greg

On Tue, Feb 5, 2013 at 11:10 AM, Loic Dachary  wrote:
> Hi,
>
> Here is a new pull request containing the patch and the associated unit 
> tests, all together. Thanks a lot for reviewing them :-)
>
> https://github.com/ceph/ceph/pull/39
>
> Cheers
>
> On 02/05/2013 01:22 AM, Gregory Farnum wrote:
>> Loic,
>> Sorry for the delay in getting back to you about these patches. :( I
>> finally got some time to look over them, and in general it's all good!
>> I do have some comments, though.
>>
>> On Mon, Jan 21, 2013 at 5:44 AM, Loic Dachary  wrote:
>>>> Looking through the history of that test (in _reset_max), I think it's an 
>>>> accident and we actually want to be waking up the front if the maximum 
>>>> increases (or possibly in all cases, in case the front is a very large 
>>>> request we're going to let through anyway). Want to submit a patch? :)
>>> :-) Here it is. "make check" does not complain. I've not run teuthology + 
>>> qa-suite though. I figured out how to run teuthology but did not yet try 
>>> qa-suite.
>>>
>>> http://marc.info/?l=ceph-devel&m=135877502606311&w=4
>>
>> This patch to reverse the conditional is obviously fine.
>>
>>>> The other possibility I was trying to investigate is that it had something 
>>>> to do with handling get() requests larger than the max correctly, but I 
>>>> can't find any evidence of that one...
>>> I've run the Throttle unit tests after uncommenting
>>> https://github.com/ceph/ceph/pull/34/files#L3R269
>>> and commenting out
>>> https://github.com/ceph/ceph/pull/34/files#L3R266
>>> and it passes.
>>
>> Regarding these unit tests, I have a few questions which I left on
>> Github. Can you address them and then give a single pull request which
>> includes both the Throttle fix and the tests? :)
>> Thanks!
>> -Greg
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> Loïc Dachary, Artisan Logiciel Libre
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throttle::wait use case clarification

2013-02-04 Thread Gregory Farnum
Loic,
Sorry for the delay in getting back to you about these patches. :( I
finally got some time to look over them, and in general it's all good!
I do have some comments, though.

On Mon, Jan 21, 2013 at 5:44 AM, Loic Dachary  wrote:
>> Looking through the history of that test (in _reset_max), I think it's an 
>> accident and we actually want to be waking up the front if the maximum 
>> increases (or possibly in all cases, in case the front is a very large 
>> request we're going to let through anyway). Want to submit a patch? :)
> :-) Here it is. "make check" does not complain. I've not run teuthology + 
> qa-suite though. I figured out how to run teuthology but did not yet try 
> qa-suite.
>
> http://marc.info/?l=ceph-devel&m=135877502606311&w=4

This patch to reverse the conditional is obviously fine.

>> The other possibility I was trying to investigate is that it had something 
>> to do with handling get() requests larger than the max correctly, but I 
>> can't find any evidence of that one...
> I've run the Throttle unit tests after uncommenting
> https://github.com/ceph/ceph/pull/34/files#L3R269
> and commenting out
> https://github.com/ceph/ceph/pull/34/files#L3R266
> and it passes.

Regarding these unit tests, I have a few questions which I left on
Github. Can you address them and then give a single pull request which
includes both the Throttle fix and the tests? :)
Thanks!
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-02-04 Thread Gregory Farnum
Set your /proc/sys/kernel/core_pattern file. :) http://linux.die.net/man/5/core
-Greg

On Mon, Feb 4, 2013 at 1:08 PM, Sébastien Han  wrote:
> ok I finally managed to get something on my test cluster,
> unfortunately, the dump goes to /
>
> any idea to change the destination path?
>
> My production / won't be big enough...
>
> --
> Regards,
> Sébastien Han.
>
>
> On Mon, Feb 4, 2013 at 10:03 PM, Dan Mick  wrote:
>> ...and/or do you have the corepath set interestingly, or one of the
>> core-trapping mechanisms turned on?
>>
>>
>> On 02/04/2013 11:29 AM, Sage Weil wrote:
>>>
>>> On Mon, 4 Feb 2013, S?bastien Han wrote:

 Hum just tried several times on my test cluster and I can't get any
 core dump. Does Ceph commit suicide or something? Is it expected
 behavior?
>>>
>>>
>>> SIGSEGV should trigger the usual path that dumps a stack trace and then
>>> dumps core.  Was your ulimit -c set before the daemon was started?
>>>
>>> sage
>>>
>>>
>>>
 --
 Regards,
 S?bastien Han.


 On Sun, Feb 3, 2013 at 10:03 PM, S?bastien Han 
 wrote:
>
> Hi Lo?c,
>
> Thanks for bringing our discussion on the ML. I'll check that tomorrow
> :-).
>
> Cheer
> --
> Regards,
> S?bastien Han.
>
>
> On Sun, Feb 3, 2013 at 10:01 PM, S?bastien Han 
> wrote:
>>
>> Hi Lo?c,
>>
>> Thanks for bringing our discussion on the ML. I'll check that tomorrow
>> :-).
>>
>> Cheers
>>
>> --
>> Regards,
>> S?bastien Han.
>>
>>
>> On Sun, Feb 3, 2013 at 7:17 PM, Loic Dachary  wrote:
>>>
>>>
>>> Hi,
>>>
>>> As discussed during FOSDEM, the script you wrote to kill the OSD when
>>> it
>>> grows too much could be amended to core dump instead of just being
>>> killed &
>>> restarted. The binary + core could probably be used to figure out
>>> where the
>>> leak is.
>>>
>>> You should make sure the OSD current working directory is in a file
>>> system
>>> with enough free disk space to accomodate for the dump and set
>>>
>>> ulimit -c unlimited
>>>
>>> before running it ( your system default is probably ulimit -c 0 which
>>> inhibits core dumps ). When you detect that OSD grows too much kill it
>>> with
>>>
>>> kill -SEGV $pid
>>>
>>> and upload the core found in the working directory, together with the
>>> binary in a public place. If the osd binary is compiled with -g but
>>> without
>>> changing the -O settings, you should have a larger binary file but no
>>> negative impact on performances. Forensics analysis will be made a lot
>>> easier with the debugging symbols.
>>>
>>> My 2cts
>>>
>>> On 01/31/2013 08:57 PM, Sage Weil wrote:

 On Thu, 31 Jan 2013, Sylvain Munaut wrote:
>
> Hi,
>
> I disabled scrubbing using
>
>> ceph osd tell \* injectargs '--osd-scrub-min-interval 100'
>> ceph osd tell \* injectargs '--osd-scrub-max-interval 1000'
>
>
> and the leak seems to be gone.
>
> See the graph at  http://i.imgur.com/A0KmVot.png  with the OSD
> memory
> for the 12 osd processes over the last 3.5 days.
> Memory was rising every 24h. I did the change yesterday around 13h00
> and OSDs stopped growing. OSD memory even seems to go down slowly by
> small blocks.
>
> Of course I assume disabling scrubbing is not a long term solution
> and
> I should re-enable it ... (how do I do that btw ? what were the
> default values for those parameters)


 It depends on the exact commit you're on.  You can see the defaults
 if
 you
 do

   ceph-osd --show-config | grep osd_scrub

 Thanks for testing this... I have a few other ideas to try to
 reproduce.

 sage
 --
 To unsubscribe from this list: send the line "unsubscribe ceph-devel"
 in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>> --
>>> Lo?c Dachary, Artisan Logiciel Libre
>>>
>>


>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Paxos and long-lasting deleted data

2013-02-03 Thread Gregory Farnum
On Sunday, February 3, 2013 at 11:45 AM, Andrey Korolyov wrote:
> Just an update: this data stayed after pool deletion, so there is
> probably a way to delete garbage bytes on live pool without doing any
> harm(hope so), since it is can be dissected from actual pool pool data
> placement, in theory.


What? You mean you deleted the pool and the data in use by the cluster didn't 
drop? If that's the case, check and see if it's still at the same level — pool 
deletes are asynchronous and throttled to prevent impacting client operations 
too much.
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph Development with Eclipse‏

2013-02-02 Thread Gregory Farnum
I actually still do this. Set it up Ceph outside of Eclipse initially, then 
import it as a project with an existing Makefile. It should pick up on 
everything it needs to well enough. :) 
-Greg


On Saturday, February 2, 2013 at 9:40 AM, charles L wrote:

> 
> 
> Hi 
> 
> I am a beginner at c++ and eclipse. I need some startup help to develop ceph 
> with eclipse. If you could provide your config file on eclipse, it will be a 
> great starting point and very appreciated.
> 
> Regards,
> 
> Charles.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Maintenance mode

2013-01-31 Thread Gregory Farnum
Try "ceph osd set noout" beforehand and then "ceph osd unset noout". That will 
prevent any OSDs from getting removed from the mapping, so no data will be 
rebalanced. I don't think there's a way to prevent OSDs from getting zapped on 
an individual basis, though.  
This is described briefly in the docs at 
http://ceph.com/docs/master/rados/operations/troubleshooting-osd/?highlight=noout,
 though it could probably be a bit more clear.
-Greg


On Thursday, January 31, 2013 at 11:40 PM, Alexis GÜNST HORN wrote:

> Hello to all,
>  
> Here is my setup :
>  
> - 2 racks
> - osd1 .. osd6 in rack1
> - osd7 .. osd12 in rack2
> - replica = 2
> - CRUSH map set to put replicas accross racks
>  
> My question :
> Let's imagine that one day, I need to unplug one of the racks (let's
> say, rack1). No problem because an other copy of my objects will be in
> the other rack. But, if i do it, Ceph will start to rebalance data
> accross OSDs.
>  
> So, is there a way to put nodes in "Maintenance mode", in order to put
> Ceph in "degraded" mode, but avoiding any remaping.
>  
> The idea is to have a command like :
>  
> $ ceph osd set maintenance=on osd.1
> $ ceph osd set maintenance=on osd.2
> $ ceph osd set maintenance=on osd.3
> $ ceph osd set maintenance=on osd.4
> $ ceph osd set maintenance=on osd.5
> $ ceph osd set maintenance=on osd.6
>  
> So Ceph knows that 6 osds are down, goes into degraded mode, but
> without remapping data.
> Then, once maintenance finished, i'll only have to do the opposite :
>  
> $ ceph osd set maintenance=off osd.1
> $ ceph osd set maintenance=off osd.2
> $ ceph osd set maintenance=off osd.3
> $ ceph osd set maintenance=off osd.4
> $ ceph osd set maintenance=off osd.5
> $ ceph osd set maintenance=off osd.6
>  
> What do you think ?
> I know that there is a way in Ceph doc to do so with reweight, but
> it's a little bit complex...
>  
> What do you think ?
> Thanks,
>  
> Alexis
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org 
> (mailto:majord...@vger.kernel.org)
> More majordomo info at http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Paxos and long-lasting deleted data

2013-01-31 Thread Gregory Farnum
On Thu, Jan 31, 2013 at 10:50 AM, Andrey Korolyov  wrote:
> http://xdel.ru/downloads/ceph-log/rados-out.txt.gz
>
>
> On Thu, Jan 31, 2013 at 10:31 PM, Gregory Farnum  wrote:
>> Can you pastebin the output of "rados -p rbd ls"?


Well, that sure is a lot of rbd objects. Looks like a tool mismatch or
a bug in whatever version you were using. Can you describe how you got
into this state, what versions of the servers and client tools you
used, etc?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Paxos and long-lasting deleted data

2013-01-31 Thread Gregory Farnum
Can you pastebin the output of "rados -p rbd ls"?

On Thu, Jan 31, 2013 at 10:17 AM, Andrey Korolyov  wrote:
> Hi,
>
> Please take a look, this data remains for days and seems not to be
> deleted in future too:
>
> pool name   category KB  objects   clones
>degraded  unfound   rdrd KB   wr
> wr KB
> data-  000
>0   0000
> 0
> install -   15736833 38560
>0   0   163   464648
> 60970390
> metadata-  000
>0   0000
> 0
> prod-rack0  -  364027905888950
>0   0   320   267626
> 689034186
> rbd -4194305 10270
>0   04111269
> 25165828
>   total used  690091436893778
>   total avail18335469376
>   total space25236383744
>
> for pool in $(rados lspools) ; do rbd ls -l $pool ; done | grep -v
> SIZE | awk '{ sum += $2} END { print sum }'
> rbd: pool data doesn't contain rbd images
> rbd: pool metadata doesn't contain rbd images
> 526360
>
> I have same thing before, but not so contrast as there. Cluster was
> put on moderate failure test, dropping one or two osds at once under
> I/O pressure with replication factor three.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] fix build (ceph.spec)

2013-01-30 Thread Gregory Farnum
On Wednesday, January 30, 2013 at 10:00 AM, Danny Al-Gaaf wrote:
> This set fixes some issues in the spec file. 
> 
> I'm not sure what the reason for #35e5d74e5c5786bc91df5dc10b5c08c77305df4e
> was. But I would revert it and fix the underlaying issues instead.


That is a pretty obtuse commit message, but it was actually because while 
rbd-fuse is ready to be in-tree, we don't think it's ready for people to be 
using; there are some undocumented gotchas involved with it and not a lot of 
testing. :)

I'll let others comment on the Java library moves and other packaging changes.
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: HEALTH_ERR 18624 pgs stuck inactive; 18624 pgs stuck unclean; no osds

2013-01-30 Thread Gregory Farnum
I believe this is because you specified "hostname" rather than "host" for the 
OSDs in your ceph.conf. "hostname" isn't a config option that anything in Ceph 
recognizes. :) 
-Greg


On Wednesday, January 30, 2013 at 8:12 AM, femi anjorin wrote:

> Hi,
> 
> Can anyone help with this?
> 
> I am running a cluster of 6 servers. Each with 16 hard drives. I
> mounted all the hard drives on the recommended mount point
> /var/lib/ceph/osd/ceph-n . look like this:
> /dev/sda1 on /var/lib/ceph/osd/ceph-0
> /dev/sdb1 on /var/lib/ceph/osd/ceph-1
> /dev/sdc1 on /var/lib/ceph/osd/ceph-2
> /dev/sdd1 on /var/lib/ceph/osd/ceph-3
> /dev/sde1 on /var/lib/ceph/osd/ceph-4
> /dev/sdf1 on /var/lib/ceph/osd/ceph-5
> /dev/sdg1 on /var/lib/ceph/osd/ceph-6
> /dev/sdh1 on /var/lib/ceph/osd/ceph-7
> /dev/sdi1 on /var/lib/ceph/osd/ceph-8
> /dev/sdj1 on /var/lib/ceph/osd/ceph-9
> /dev/sdk1 on /var/lib/ceph/osd/ceph-10
> /dev/sdl1 on /var/lib/ceph/osd/ceph-11
> /dev/sdm1 on /var/lib/ceph/osd/ceph-12
> /dev/sdn1 on /var/lib/ceph/osd/ceph-13
> /dev/sdo1 on /var/lib/ceph/osd/ceph-14
> /dev/sdp1 on /var/lib/ceph/osd/ceph-15
> 
> 
> Below is a summarized copy of my ceph.conf file. Since i have 16
> drive on each server ...so i did a configuration of osd.0 - osd.95.
> While I did configuration of 3 monitors and 1 mds server.
> :
> [global]
> auth cluster required = cephx
> auth service required = cephx
> auth client required = cephx
> debug ms = 1
> [osd]
> osd journal size = 1
> filestore xattr use omap = true
> 
> [osd.0]
> hostname = testserver109
> devs = /dev/sda1
> [osd.1]
> hostname = testserver109
> devs = /dev/sdb1
> .
> .
> .
> [osd.16]
> hostname = testserver110
> devs = /dev/sda1
> .
> .
> [osd.95]
> hostname = testserver114
> devs = /dev/sdp1
> 
> [mon]
> mon data = /var/lib/ceph/mon/$cluster-$id
> 
> [mon.a]
> host = testserver109
> mon addr = 172.16.1.9:6789
> 
> [mon.b]
> host = testserver110
> mon addr = 172.16.1.10:6789
> 
> [mon.c]
> host = testserver111
> mon addr = 172.16.1.11:6789
> [mds.a]
> host = testserver025
> 
> [mon]
> debug mon = 20
> debug paxos = 20
> debug auth = 20
> 
> [osd]
> debug osd = 20
> debug filestore = 20
> debug journal = 20
> debug monc = 20
> 
> [mds]
> debug mds = 20
> debug mds balancer = 20
> debug mds log = 20
> debug mds migrator = 20
> :
> 
> Steps:
> 1. I did mkcephfs -a -c /etc/ceph/ceph.conf -k ceph.keyring
> temp dir is /tmp/mkcephfs.G5cBEIaS1o
> preparing monmap in /tmp/mkcephfs.G5cBEIaS1o/monmap
> /usr/bin/monmaptool --create --clobber --add a 172.16.1.9:6789 --add b
> 172.16.1.10:6789 --add c 172.16.1.11:6789 --print
> /tmp/mkcephfs.G5cBEIaS1o/monmap
> /usr/bin/monmaptool: monmap file /tmp/mkcephfs.G5cBEIaS1o/monmap
> /usr/bin/monmaptool: generated fsid 3dd34cbf-e228-4ced-850c-68cde0a7d8b5
> epoch 0
> fsid 3dd34cbf-e228-4ced-850c-68cde0a7d8b5
> last_changed 2013-01-30 12:38:14.564735
> created 2013-01-30 12:38:14.564735
> 0: 172.16.1.9:6789/0 mon.a
> 1: 172.16.1.10:6789/0 mon.b
> 2: 172.16.1.11:6789/0 mon.c
> /usr/bin/monmaptool: writing epoch 0 to
> /tmp/mkcephfs.G5cBEIaS1o/monmap (3 monitors)
> === mds.a ===
> creating private key for mds.a keyring /var/lib/ceph/mds/ceph-a/keyring
> creating /var/lib/ceph/mds/ceph-a/keyring
> Building generic osdmap from /tmp/mkcephfs.G5cBEIaS1o/conf
> /usr/bin/osdmaptool: osdmap file '/tmp/mkcephfs.G5cBEIaS1o/osdmap'
> /usr/bin/osdmaptool: writing epoch 1 to /tmp/mkcephfs.G5cBEIaS1o/osdmap
> Generating admin key at /tmp/mkcephfs.G5cBEIaS1o/keyring.admin
> creating /tmp/mkcephfs.G5cBEIaS1o/keyring.admin
> Building initial monitor keyring
> added entity mds.a auth auth(auid = 18446744073709551615
> key=AQAnBglRaGP7MxAANo/xsy5P9NxMzCZGmHQDCw== with 0 caps)
> === mon.a ===
> pushing everything to testserver109
> /usr/bin/ceph-mon: created monfs at /var/lib/ceph/mon/ceph-a for mon.a
> === mon.b ===
> pushing everything to testserver110
> /usr/bin/ceph-mon: created monfs at /var/lib/ceph/mon/ceph-b for mon.b
> === mon.c ===
> pushing everything to testserver111
> /usr/bin/ceph-mon: created monfs at /var/lib/ceph/mon/ceph-c for mon.c
> placing client.admin keyring in ceph.keyring
> 
> ---
> Apparently the monitor and mds got created and the ceph.keyring was
> created BUT the OSDs were not created.
> 
> 
> 2. I copied to ceph.keyring to all node
> 3. I did a "service ceph -a start" command (on all node)
> 4. I did a "ceph health" (on the node where i used the mkcephfs)
> 
> 2013-01-30 13:12:18.822022 7f80ea476760 1 -- :/0 messenger.start
> 2013-01-30 13

Re: Geo-replication with RADOS GW

2013-01-28 Thread Gregory Farnum
On Monday, January 28, 2013 at 9:54 AM, Ben Rowland wrote:
> Hi,
>  
> I'm considering using Ceph to create a cluster across several data
> centres, with the strict requirement that writes should go to both
> DCs. This seems possible by specifying rules in the CRUSH map, with
> an understood latency hit resulting from purely synchronous writes.
>  
> The part I'm unsure about is how the RADOS GW fits into this picture.
> For high availability (and to improve best-case latency on reads),
> we'd want to run a gateway in each data centre. However, the first
> paragraph of the following post suggests this is not possible:
>  
> http://article.gmane.org/gmane.comp.file-systems.ceph.devel/12238
>  
> Is there a hard restriction on how many radosgw instances can run
> across the cluster, or is the point of the above post more about a
> performance hit?

It's talking about the performance hit. Most people can't afford data-center 
level connectivity between two different buildings. ;) If you did have a Ceph 
cluster split across two DC (with the bandwidth to support them) this will work 
fine. There aren't any strict limits on the number of gateways you stick on a 
cluster, just the scaling costs associated with cache invalidation 
notifications.

  
> It seems to me it should be possible to run more
> than one radosgw, particularly if each instance communicates with a
> local OSD which can proxy reads/writes to the primary (which may or
> may not be DC-local).

They aren't going to do this, though — each gateway will communicate with the 
primaries directly.
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: lagging peering wq

2013-01-25 Thread Gregory Farnum
On Friday, January 25, 2013 at 9:50 AM, Sage Weil wrote:
> Faidon/paravoid's cluster has a bunch of OSDs that are up, but the pg 
> queries indicate they are tens of thousands of epochs behind:
> 
> "history": { "epoch_created": 14,
> "last_epoch_started": 88174,
> "last_epoch_clean": 88174,
> "last_epoch_split": 0,
> "same_up_since": 88172,
> "same_interval_since": 88172,
> "same_primary_since": 88172,
> 
> (where the current map epoch is 102000 or thereabouts).
> 
> I think just restarting all OSDs at once will get him caught up (esp with 
> a 'ceph osd set noup' block until they are done processing maps), but I 
> wonder if we may want an additional check that if any PG falls more than X 
> epochs behind the OSD marks it self down and catches up before coming 
> in...
> 
> What do you think?

Sam's explained to me why this "shouldn't" happen (since events for each PG get 
queued on every map update), so it sounds like it would be better to prevent 
the mess (e.g., add some basic fairness to the PG work queue dispatchers in 
order to prevent any PG from falling so far behind), rather than trying to 
clean the mess up.
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph write path

2013-01-24 Thread Gregory Farnum
On Thursday, January 24, 2013 at 6:41 PM, sheng qiu wrote:
> Hi,
> 
> i am trying to understand the ceph codes on client side.
> for write path, if it's aio_write, the ceph_write_begin() allocate
> pages in page cache to buffer the written data, however i did not see
> it allocated any space on the remote OSDs (for local fs such as ext2,
> the get_block() did this),
> i suppose it's done later when invoke kernel flushing process to write
> back the dirty pages.
> 
> i checked the ceph_writepages_start(), here it seems organize the
> dirty data and prepare the requests to send to the OSDs. For new
> allocated written data, how it maps to the OSDs and where it is done?
> is it done in ceph_osdc_new_request()?
> 
> If the transfer unit is not limited to sizes of obj, i supposed that
> ceph needed to packed several pieces of data (smaller than one obj
> size) together so that there won't be internal fragmentation for an
> object. who does this job and which part of source codes/files are
> related with this?
> 
> I really want to get a deep understanding about the codes, so i raised
> these questions. if my understanding is not correct, please figure
> out. i will be very appreciated.
> 
There seems to be a bit of a fundamental misunderstanding here. The Ceph 
storage system is built on top of an object store (RADOS), and so when the 
clients are doing writes they just tell the object storage daemon (OSD) to 
write the named object. The daemons are responsible for doing disk allocation 
and layout stuff themselves (and in fact they handle most of that by sticking 
the objects in a perfectly ordinary Linux filesystem).
The client maps the data to the correct OSDs via the CRUSH algorithm; it's a 
calculation based on the object name that anybody in the system can perform; 
there's no lookup or anything. 
It doesn't do any packing of different pieces of data into one object or 
anything like that.

I'd recommend checking out some of the academic papers available at 
http://ceph.com/resources/publications/ for more background information about 
the key algorithms and design choices.
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster

2013-01-24 Thread Gregory Farnum
What's the physical layout of your networking? This additional log may prove 
helpful as well, but I really need a bit more context in evaluating the 
messages I see from the first one. :) 
-Greg


On Thursday, January 24, 2013 at 9:24 AM, Isaac Otsiabah wrote:

> 
> 
> Gregory, i tried send the the attached debug output several times and 
> the mail server rejected them all probably becauseof the file size so i cut 
> the log file size down and it is attached. You will see the 
> reconnection failures by the error message line below. The ceph version 
> is 0.56
> 
> 
> it appears to be a timing issue because with the flag (debug ms=1) turned on, 
> the system ran slower and became harder to fail.
> I
> ran it several times and finally got it to fail on (osd.0) using 
> default crush map. The attached tar file contains log files for all 
> components on g8ct plus the ceph.conf. By the way, the log file contain only 
> the last 1384 lines where the error occurs.
> 
> 
> I started with a 1-node cluster on host g8ct (osd.0, osd.1, osd.2) and then 
> added host g13ct (osd.3, osd.4, osd.5)
> 
> 
> id weight type name up/down reweight
> -1 6 root default
> -3 6 rack unknownrack
> -2 3 host g8ct
> 0 1 osd.0 down 1
> 1 1 osd.1 up 1
> 2 1 osd.2 up 1
> -4 3 host g13ct
> 3 1 osd.3 up 1
> 4 1 osd.4 up 1
> 5 1 osd.5 up 1
> 
> 
> 
> The error messages are in ceph.log and ceph-osd.0.log:
> 
> ceph.log:2013-01-08
> 05:41:38.080470 osd.0 192.168.0.124:6801/25571 3 : [ERR] map e15 had 
> wrong cluster addr (192.168.0.124:6802/25571 != my 
> 192.168.1.124:6802/25571)
> ceph-osd.0.log:2013-01-08 05:41:38.080458 7f06757fa710 0 log [ERR] : map e15 
> had wrong cluster addr 
> (192.168.0.124:6802/25571 != my 192.168.1.124:6802/25571)
> 
> 
> 
> [root@g8ct ceph]# ceph -v
> ceph version 0.56 (1a32f0a0b42f169a7b55ed48ec3208f6d4edc1e8)
> 
> 
> Isaac
> 
> 
> - Original Message -
> From: Gregory Farnum mailto:g...@inktank.com)>
> To: Isaac Otsiabah mailto:zmoo...@yahoo.com)>
> Cc: "ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org)" 
> mailto:ceph-devel@vger.kernel.org)>
> Sent: Monday, January 7, 2013 1:27 PM
> Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host 
> to my cluster
> 
> On Monday, January 7, 2013 at 1:00 PM, Isaac Otsiabah wrote:
> 
> 
> When i add a new host (with osd's) to my existing cluster, 1 or 2 
> previous osd(s) goes down for about 2 minutes and then they come back 
> up. 
> > 
> > 
> > [root@h1ct ~]# ceph osd tree
> > 
> > # id weight type name up/down reweight
> > -1 
> > 3 root default
> > -3 3 rack unknownrack
> > -2 3 host h1
> > 0 1 osd.0 up 1
> > 1 1 osd.1 up 1
> > 2 
> > 1 osd.2 up 1
> 
> 
> For example, after adding host h2 (with 3 new osd) to the above cluster
> and running the "ceph osd tree" command, i see this: 
> > 
> > 
> > [root@h1 ~]# ceph osd tree
> > 
> > # id weight type name up/down reweight
> > -1 6 root default
> > -3 
> > 6 rack unknownrack
> > -2 3 host h1
> > 0 1 osd.0 up 1
> > 1 1 osd.1 down 1
> > 2 
> > 1 osd.2 up 1
> > -4 3 host h2
> > 3 1 osd.3 up 1
> > 4 1 osd.4 up 
> > 1
> > 5 1 osd.5 up 1
> 
> 
> The down osd always come back up after 2 minutes or less andi see the 
> following error message in the respective osd log file: 
> > 2013-01-07 04:40:17.613028 7fec7f092760 1 journal _open 
> > /ceph_journal/journals/journal_2 fd 26: 1073741824 bytes, block size 
> > 4096 bytes, directio = 1, aio = 0
> > 2013-01-07 04:40:17.613122 
> > 7fec7f092760 1 journal _open /ceph_journal/journals/journal_2 fd 26: 
> > 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 0
> > 2013-01-07
> > 04:42:10.006533 7fec746f7710 0 -- 192.168.0.124:6808/19449 >> 
> > 192.168.1.123:6800/18287 pipe(0x7fec2e10 sd=31 :6808 pgs=0 cs=0 
> > l=0).accept connect_seq 0 vs existing 0 state connecting
> > 2013-01-07 
> > 04:45:29.834341 7fec743f4710 0 -- 192.168.1.124:6808/19449 >> 
> > 192.168.1.122:6800/20072 pipe(0x7fec5402f320 sd=28 :45438 pgs=7 cs=1 
> > l=0).fault, initiating reconnect
> > 2013-01-07 04:45:29.835748 
> > 7fec743f4710 0 -- 192.168.1.124:6808/19449 >> 
> > 192.168.1.122:6800/20072 pipe(0x7fec5402f320 sd=28 :45439 pgs=15 cs=3 
> > l=0).fault, initiating reconnect
> > 2013-01-07 04:45:30.835219 7fec743f4710 0 -- 
> > 192.168.1.124:6808/19449 >> 192.168.1.122:6800/20072 
> > pipe(0x7fec5402f320 sd=28 :45894 pgs=482 cs=903 l=0).fault, init

Re: some questions about ceph

2013-01-23 Thread Gregory Farnum
On Wednesday, January 23, 2013 at 3:35 PM, Yue Li wrote:
> Hi,
>  
> i have some questions about ceph.
>  
> ceph provide a POSIX client for users.
> for aio-read/write, it still use page cache on client side (seems to
> me). How long will the page cache expire (in case the data on server
> side has changed)?

The kernel client does this automatically; ceph-fuse currently doesn't do page 
cache invalidation (so, yes, you can get stale data), but fixing this is in our 
queue and should be coming pretty soon: http://tracker.newdream.net/issues/2215
  
> if we miss the page cache, we need to fetch data from server side for
> read accesses, what's the minimum transfer unit between client and
> OSDs?

There is no hard limit, although there will be a practical minimum based on the 
read ahead and prefetch settings you specify.
  
> for write accesses, will the client batch the write request data into
> units of obj size then transferring to OSDs?

It will try to write out what it can, but no — if you aren't doing any syncs 
yourself, then the client will write out dirty data according to an LRU (in 
ceph-fuse) or the regular page cache eviction algorithms (for the kernel), 
aggregating the dirty data it has available.
  
> generally what's the minimum transfer unit between client and OSDs?

No minimum.
  
>  
> How to ensure the consistency for multi-write from clients on the same
> piece of data or parallel read and write on the same data?

If you have multiple clients accessing the same piece of data and at least one 
is a writer, they will go into a synchronous mode and data access is 
coordinated and ordered by the MDS.
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Using a Data Pool

2013-01-23 Thread Gregory Farnum
On Wednesday, January 23, 2013 at 5:01 AM, Paul Sherriffs wrote:
> Hello All;
>  
> I have been trying to associate a directory to a data pool (both called 
> 'Media') according to a previous thread on this list. It all works except the 
> last line:
>  
>  
> ceph osd pool create Media 500 500
>  
> ceph mds add_data_pool 3
>  
> > added data pool 3 to mdsmap
>  
> mkdir /mnt/ceph/Media
>  
> cephfs /mnt/ceph/Media set_layout -p 3
>  
> > Segmentation fault
cephfs is not a super-friendly tool right now — sorry! :(
I believe you will find it works correctly if you specify all the layout 
parameters, not just one of them.
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questions about journals, performance and disk utilization.

2013-01-22 Thread Gregory Farnum
On Tuesday, January 22, 2013 at 1:57 PM, Mark Nelson wrote:
> On 01/22/2013 03:50 PM, Stefan Priebe wrote:
> > Hi,
> > Am 22.01.2013 22:26, schrieb Jeff Mitchell:
> > > Mark Nelson wrote:
> > > > It may (or may not) help to use a power-of-2 number of PGs. It's
> > > > generally a good idea to do this anyway, so if you haven't set up your
> > > > production cluster yet, you may want to play around with this. Basically
> > > > just take whatever number you were planning on using and round it up (or
> > > > down slightly). IE if you were going to use 7,000 PGs, round up to 8192.
> > > 
> > > 
> > > 
> > > As I was asking about earlier on IRC, I'm in a situation where the docs
> > > did not mention this in the section about calculating PGs so I have a
> > > non-power-of-2 -- and since there are some production things running on
> > > that pool I can't currently change it.
> > 
> > 
> > 
> > Oh same thing here - did i miss the doc or can someone point me the
> > location.
> > 
> > Is there a chance to change the number of PGs for a pool?
> > 
> > Greets,
> > Stefan
> 
> 
> 
> Honestly I don't know if it will actually have a significant effect. 
> ceph_stable_mod will map things optimally when pg_num is a power of 2, 
> but that's only part of how things work. It may not matter very much 
> with high PG counts.

IIRC, having a non-power of 2 count means that the extra PGs (above the 
lower-bounding power of 2) will be twice the size of the other PGs. For 
reasonable PG counts this should not cause any problems.
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ssh passwords

2013-01-22 Thread Gregory Farnum
On Tuesday, January 22, 2013 at 10:24 AM, Gandalf Corvotempesta wrote:
> Hi all,
> i'm trying my very first ceph installation following the 5-minutes quickstart:
> http://ceph.com/docs/master/start/quick-start/#install-debian-ubuntu
> 
> just a question: why ceph is asking me for SSH password? Is ceph
> trying to connect to itself via SSH?
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org 
> (mailto:majord...@vger.kernel.org)
> More majordomo info at http://vger.kernel.org/majordomo-info.html

If you're using mkcephfs to set it up, or asking etc/init.d/ceph to start up 
daemons on each node, it uses ssh to go in and do that. I believe the 
Quick-start guide is using both of those. :) 
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throttle::wait use case clarification

2013-01-22 Thread Gregory Farnum
On Monday, January 21, 2013 at 5:44 AM, Loic Dachary wrote:
> 
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> On 01/21/2013 12:02 AM, Gregory Farnum wrote:
> > On Sunday, January 20, 2013 at 5:39 AM, Loic Dachary wrote:
> > > Hi,
> > > 
> > > While working on unit tests for Throttle.{cc,h} I tried to figure out a 
> > > use case related to the Throttle::wait method but couldn't
> > > 
> > > https://github.com/ceph/ceph/pull/34/files#L3R258
> > > 
> > > Although it was not a blocker and I managed to reach 100% coverage 
> > > anyway, it got me curious and I would very much appreciate pointers to 
> > > understand the rationale.
> > > 
> > > wait() can be called to set a new maximum before waiting for all pending 
> > > threads to get get what they asked for. Since the maximum has changed, 
> > > wait() wakes up the first thread : the conditions under which it decided 
> > > to go to sleep have changed and the conclusion may be different.
> > > 
> > > However, it only does so when the new maximum is less than current one. 
> > > For instance
> > > 
> > > A) decision does not change
> > > 
> > > max = 10, current 9
> > > thread 1 tries to get 5 but only 1 is available, it goes to sleep
> > > wait(8)
> > > max = 8, current 9
> > > wakes up thread 1
> > > thread 1 tries to get 5 but current is already beyond the maximum, it 
> > > goes to sleep
> > > 
> > > B) decision changes
> > > 
> > > max = 10, current 1
> > > thread 1 tries to get 10 but only 9 is available, it goes to sleep
> > > wait(9)
> > > max = 9, current 1
> > > wakes up thread 1
> > > thread 1 tries to get 10 which is above the maximum : it succeeds because 
> > > current is below the new maximum
> > > 
> > > It will not wake up a thread if the maximum increases, for instance:
> > > 
> > > max = 10, current 9
> > > thread 1 tries to get 5 but only 1 is available, it goes to sleep
> > > wait(20)
> > > max = 20, current 9
> > > does *not* wake up thread 1
> > > keeps waiting until another thread put(N) with N >= 0 although there now 
> > > is 11 available and it would allow it to get 5 out of it
> > > 
> > > Why is it not desirable for thread 1 to wake up in this case ? When 
> > > debugging a real world situation, I think it would show as a thread 
> > > blocked although the throttle it is waiting on has enough to satisfy its 
> > > request. What am I missing ?
> > > 
> > > Cheers
> > > 
> > > 
> > > Attachments:
> > > - loic.vcf
> > 
> > 
> > 
> > 
> > Looking through the history of that test (in _reset_max), I think it's an 
> > accident and we actually want to be waking up the front if the maximum 
> > increases (or possibly in all cases, in case the front is a very large 
> > request we're going to let through anyway). Want to submit a patch? :)
> :-) Here it is. "make check" does not complain. I've not run teuthology + 
> qa-suite though. I figured out how to run teuthology but did not yet try 
> qa-suite.
> 
> http://marc.info/?l=ceph-devel&m=135877502606311&w=4
> 
> > 
> > The other possibility I was trying to investigate is that it had something 
> > to do with handling get() requests larger than the max correctly, but I 
> > can't find any evidence of that one...
> I've run the Throttle unit tests after uncommenting
> https://github.com/ceph/ceph/pull/34/files#L3R269
> and commenting out
> https://github.com/ceph/ceph/pull/34/files#L3R266
> and it passes.
> 
> I'm not sure if I should have posted the proposed Throttle unit test to the 
> list instead of proposing it as a pull request
> https://github.com/ceph/ceph/pull/34
> 
> What is best ?
Pull requests are good; you just sent it in on a weekend and we've all got a 
queue before we evaluate code pulls. :)
Thanks!
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: handling fs errors

2013-01-22 Thread Gregory Farnum
On Tuesday, January 22, 2013 at 5:12 AM, Wido den Hollander wrote:
> On 01/22/2013 07:12 AM, Yehuda Sadeh wrote:
> > On Mon, Jan 21, 2013 at 10:05 PM, Sage Weil  > (mailto:s...@inktank.com)> wrote:
> > > We observed an interesting situation over the weekend. The XFS volume
> > > ceph-osd locked up (hung in xfs_ilock) for somewhere between 2 and 4
> > > minutes. After 3 minutes (180s), ceph-osd gave up waiting and committed
> > > suicide. XFS seemed to unwedge itself a bit after that, as the daemon was
> > > able to restart and continue.
> > > 
> > > The problem is that during that 180s the OSD was claiming to be alive but
> > > not able to do any IO. That heartbeat check is meant as a sanity check
> > > against a wedged kernel, but waiting so long meant that the ceph-osd
> > > wasn't failed by the cluster quickly enough and client IO stalled.
> > > 
> > > We could simply change that timeout to something close to the heartbeat
> > > interval (currently default is 20s). That will make ceph-osd much more
> > > sensitive to fs stalls that may be transient (high load, whatever).
> > > 
> > > Another option would be to make the osd heartbeat replies conditional on
> > > whether the internal heartbeat is healthy. Then the heartbeat warnings
> > > could start at 10-20s, ping replies would pause, but the suicide could
> > > still be 180s out. If the stall is short-lived, pings will continue, the
> > > osd will mark itself back up (if it was marked down) and continue.
> > > 
> > > Having written that out, the last option sounds like the obvious choice.
> > > Any other thoughts?
> > 
> > 
> > 
> > Another option would be to have the osd reply to the ping with some
> > health description.
> 
> 
> 
> Looking to the future with more monitoring that might be a good idea.
> 
> If an OSD simply stops sending heartbeats if the internal conditions 
> aren't met you don't know what's going on.
> 
> If the heartbeat would have metadata which tells: "I'm here, but not in 
> such a good shape" that could be reported back to the monitors.


I think we want to move towards more comprehensive pinging like this, but it's 
not something to do in haste. Pausing pings when the internal threads are 
disappearing sounds like a good simple step to make the reporting better match 
reality.
-Greg 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Consistently reading/writing rados objects via command line

2013-01-21 Thread Gregory Farnum
On Monday, January 21, 2013 at 5:01 PM, Nick Bartos wrote:
> I would like to store some objects in rados, and retrieve them in a
> consistent manor. In my initial tests, if I do a 'rados -p foo put
> test /tmp/test', while it is uploading I can do a 'rados -p foo get
> test /tmp/blah' on another machine, and it will download a partially
> written file without returning an error code, so the downloader cannot
> tell the file is corrupt/incomplete.
>  
> My question is, how do I read/write objects in rados via the command
> line in such a way where the downloader does not get a corrupt or
> incomplete file? It's fine if it just returns an error on the client
> and I can try again, I just need to be notified on error.
>  
You must be writing large-ish objects? By default the rados tool will upload 
objects 4MB at a time and you're trying to download mid-way through the full 
object upload. You can add a "--block-size 20971520" to upload 20MB in a single 
operation, but make sure you don't exceed the "osd max write size" (90MB by 
default).
This is all client-side stuff, though — from the RADOS object store's 
perspective, the file is complete after each 4MB write. If you want something 
more sophisticated (like handling larger objects) you'll need to do at least 
some minimal tooling of your own, e.g. by setting an object xattr before 
starting and after finishing the file change, then checking for that presence 
when reading (and locking on reads or doing a check when the read completes). 
You can do that with the "setxattr", "rmxattr", and "getxattr" options.
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: questions on networks and hardware

2013-01-20 Thread Gregory Farnum
On Sunday, January 20, 2013 at 2:43 PM, Gandalf Corvotempesta wrote:
> 2013/1/20 Gregory Farnum mailto:g...@inktank.com)>:
> > This is a bit embarrassing, but if you're actually using two networks and 
> > the cluster network fails but the client network stays up, things behave 
> > pretty badly (the OSDs will keep insisting it's failed, while the monitor 
> > will insist it's still working). I believe there's a branch working on this 
> > problem, but I haven't been involved with it.
> > It's not necessary to have split networks though, no.
>  
>  
>  
> Ok.
>  
> > Does that answer your question?
>  
> Absolutely, but usually cluster network is faster than client network
> and being forced to use two cluster network is very very expensive.

I'm not quite sure what you mean…the use of the "cluster network" and "public 
network" are really just intended as conveniences for people with multiple NICs 
on their box. There's nothing preventing you from running everything on the 
same network…(and more specifically, from different speed grades to different 
boxes, but keeping them all on the same network).

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Throttle::wait use case clarification

2013-01-20 Thread Gregory Farnum
On Sunday, January 20, 2013 at 5:39 AM, Loic Dachary wrote:
> Hi,
> 
> While working on unit tests for Throttle.{cc,h} I tried to figure out a use 
> case related to the Throttle::wait method but couldn't
> 
> https://github.com/ceph/ceph/pull/34/files#L3R258
> 
> Although it was not a blocker and I managed to reach 100% coverage anyway, it 
> got me curious and I would very much appreciate pointers to understand the 
> rationale.
> 
> wait() can be called to set a new maximum before waiting for all pending 
> threads to get get what they asked for. Since the maximum has changed, wait() 
> wakes up the first thread : the conditions under which it decided to go to 
> sleep have changed and the conclusion may be different.
> 
> However, it only does so when the new maximum is less than current one. For 
> instance
> 
> A) decision does not change
> 
> max = 10, current 9
> thread 1 tries to get 5 but only 1 is available, it goes to sleep
> wait(8)
> max = 8, current 9
> wakes up thread 1
> thread 1 tries to get 5 but current is already beyond the maximum, it goes to 
> sleep
> 
> B) decision changes
> 
> max = 10, current 1
> thread 1 tries to get 10 but only 9 is available, it goes to sleep
> wait(9)
> max = 9, current 1
> wakes up thread 1
> thread 1 tries to get 10 which is above the maximum : it succeeds because 
> current is below the new maximum
> 
> It will not wake up a thread if the maximum increases, for instance:
> 
> max = 10, current 9
> thread 1 tries to get 5 but only 1 is available, it goes to sleep
> wait(20)
> max = 20, current 9
> does *not* wake up thread 1
> keeps waiting until another thread put(N) with N >= 0 although there now is 
> 11 available and it would allow it to get 5 out of it
> 
> Why is it not desirable for thread 1 to wake up in this case ? When debugging 
> a real world situation, I think it would show as a thread blocked although 
> the throttle it is waiting on has enough to satisfy its request. What am I 
> missing ?
> 
> Cheers 
> 
> 
> Attachments: 
> - loic.vcf
> 


Looking through the history of that test (in _reset_max), I think it's an 
accident and we actually want to be waking up the front if the maximum 
increases (or possibly in all cases, in case the front is a very large request 
we're going to let through anyway). Want to submit a patch? :)
The other possibility I was trying to investigate is that it had something to 
do with handling get() requests larger than the max correctly, but I can't find 
any evidence of that one...
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph replication and data redundancy

2013-01-20 Thread Gregory Farnum
(Sorry for the blank email just now, my client got a little eager!)  

Apart from the things that Wido has mentioned, you say you've set up 4 nodes 
and each one has a monitor on it. That's why you can't do anything when you 
bring down two nodes — the monitor cluster requires a strict majority in order 
to continue operating, which is why we recommend odd numbers. If you set up a 
different node as a monitor (simulating one in a different data center) and 
then bring down two nodes, things should keep working.
-Greg


On Sunday, January 20, 2013 at 9:29 AM, Wido den Hollander wrote:

> Hi,
>  
> On 01/17/2013 10:55 AM, Ulysse 31 wrote:
> > Hi all,
> >  
> > I'm not sure if it's the good mailing, if not, sorry for that, tell me
> > the appropriate one, i'll go for it.
> > Here is my actual project :
> > The company i work for has several buildings, each of them are linked
> > with gigabit trunk links allowing us to have multiple machines over
> > the same lan on different buildings.
> > We need to archive some data (over 5 to 10Tb), but we want that data
> > present on each buildings, and, in case of the lost of a building
> > (catastrophy scenario) we steel have the data.
> > Rather than using simple storage machines sync'ed by rsync, we thaught
> > re-using older desktop machines we have in stock, and make a
> > clusterized fs on it :
> > In fact, speed is clearly not the goal of this data storage, we would
> > just store old projects on it sometimes, and will access it in rare
> > cases. the most important is to keep that data archived somewhere.
>  
>  
>  
> Ok, keep that in mind. All writes to RADOS are synchronous, so if you  
> experience high latency or some congestion on your network Ceph will  
> become slow.
>  
> > I was interrested by ceph in the way that we can declare, using the
> > crush-map, a hierarchical maner to place replicated data.
> > So for a test, i build a sample cluster composed of 4 nodes, installed
> > under debian squeeze and actual bobtail stable version of ceph.
> > On my sample i wanted to simulate 2 "per buildings" nodes, each nodes
> > has a 2Tb disk and has mon/osd/mds (i know it is not optimized, but
> > that just a sample), osd uses xfs on /dev/sda3, and made a crush map
> > like :
> > ---
> > # begin crush map
> >  
> > # devices
> > device 0 osd.0
> > device 1 osd.1
> > device 2 osd.2
> > device 3 osd.3
> >  
> > # types
> > type 0 osd
> > type 1 host
> > type 2 rack
> > type 3 row
> > type 4 room
> > type 5 datacenter
> > type 6 root
> >  
> > # buckets
> > host server-0 {
> > id -2 # do not change unnecessarily
> > # weight 1.000
> > alg straw
> > hash 0 # rjenkins1
> > item osd.0 weight 1.000
> > }
> > host server-1 {
> > id -5 # do not change unnecessarily
> > # weight 1.000
> > alg straw
> > hash 0 # rjenkins1
> > item osd.1 weight 1.000
> > }
> > host server-2 {
> > id -6 # do not change unnecessarily
> > # weight 1.000
> > alg straw
> > hash 0 # rjenkins1
> > item osd.2 weight 1.000
> > }
> > host server-3 {
> > id -7 # do not change unnecessarily
> > # weight 1.000
> > alg straw
> > hash 0 # rjenkins1
> > item osd.3 weight 1.000
> > }
> > rack bat0 {
> > id -3 # do not change unnecessarily
> > # weight 3.000
> > alg straw
> > hash 0 # rjenkins1
> > item server-0 weight 1.000
> > item server-1 weight 1.000
> > }
> > rack bat1 {
> > id -4 # do not change unnecessarily
> > # weight 3.000
> > alg straw
> > hash 0 # rjenkins1
> > item server-2 weight 1.000
> > item server-3 weight 1.000
> > }
> > root root {
> > id -1 # do not change unnecessarily
> > # weight 3.000
> > alg straw
> > hash 0 # rjenkins1
> > item bat0 weight 3.000
> > item bat1 weight 3.000
> > }
> >  
> > # rules
> > rule data {
> > ruleset 0
> > type replicated
> > min_size 1
> > max_size 10
> > step take root
> > step chooseleaf firstn 0 type rack
> > step emit
> > }
> > rule metadata {
> > ruleset 1
> > type replicated
> > min_size 1
> > max_size 10
> > step take root
> > step chooseleaf firstn 0 type rack
> > step emit
> > }
> > rule rbd {
> > ruleset 2
> > type replicated
> > min_size 1
> > max_size 10
> > step take root
> > step chooseleaf firstn 0 type rack
> > step emit
> > }
> > # end crush map
> > ---
> >  
> > Using this crush-map, coupled with a default pool data size 2
> > (replication 2), allowed me to be sure to have duplicate of all data
> > on both "sample building" bat0 and bat1.
> > Then I mounted on a client using ceph-fuse using : ceph-fuse -m
> > server-2:6789 /mnt/mycephfs (server-2 located on bat1), everything
> > works fine has expected, can write/read data, from one or more
> > clients, no probs on that.
>  
>  
>  
> Just to repeat. CephFS is still in development and can be buggy sometimes.
>  
> Also, if you do this, make sure you have an Active/Standby MDS setup  
> where each building has an MDS.
>  
> > Then I begin stress tests, i simulate the lost of one node, no problem
> > on that, still can access to the cluster data.
> > Finally i simulate the lost of a buildi

Re: ceph replication and data redundancy

2013-01-20 Thread Gregory Farnum
On Sunday, January 20, 2013 at 9:29 AM, Wido den Hollander wrote:
> Hi,
> 
> On 01/17/2013 10:55 AM, Ulysse 31 wrote:
> > Hi all,
> > 
> > I'm not sure if it's the good mailing, if not, sorry for that, tell me
> > the appropriate one, i'll go for it.
> > Here is my actual project :
> > The company i work for has several buildings, each of them are linked
> > with gigabit trunk links allowing us to have multiple machines over
> > the same lan on different buildings.
> > We need to archive some data (over 5 to 10Tb), but we want that data
> > present on each buildings, and, in case of the lost of a building
> > (catastrophy scenario) we steel have the data.
> > Rather than using simple storage machines sync'ed by rsync, we thaught
> > re-using older desktop machines we have in stock, and make a
> > clusterized fs on it :
> > In fact, speed is clearly not the goal of this data storage, we would
> > just store old projects on it sometimes, and will access it in rare
> > cases. the most important is to keep that data archived somewhere.
> 
> 
> 
> Ok, keep that in mind. All writes to RADOS are synchronous, so if you 
> experience high latency or some congestion on your network Ceph will 
> become slow.
> 
> > I was interrested by ceph in the way that we can declare, using the
> > crush-map, a hierarchical maner to place replicated data.
> > So for a test, i build a sample cluster composed of 4 nodes, installed
> > under debian squeeze and actual bobtail stable version of ceph.
> > On my sample i wanted to simulate 2 "per buildings" nodes, each nodes
> > has a 2Tb disk and has mon/osd/mds (i know it is not optimized, but
> > that just a sample), osd uses xfs on /dev/sda3, and made a crush map
> > like :
> > ---
> > # begin crush map
> > 
> > # devices
> > device 0 osd.0
> > device 1 osd.1
> > device 2 osd.2
> > device 3 osd.3
> > 
> > # types
> > type 0 osd
> > type 1 host
> > type 2 rack
> > type 3 row
> > type 4 room
> > type 5 datacenter
> > type 6 root
> > 
> > # buckets
> > host server-0 {
> > id -2 # do not change unnecessarily
> > # weight 1.000
> > alg straw
> > hash 0 # rjenkins1
> > item osd.0 weight 1.000
> > }
> > host server-1 {
> > id -5 # do not change unnecessarily
> > # weight 1.000
> > alg straw
> > hash 0 # rjenkins1
> > item osd.1 weight 1.000
> > }
> > host server-2 {
> > id -6 # do not change unnecessarily
> > # weight 1.000
> > alg straw
> > hash 0 # rjenkins1
> > item osd.2 weight 1.000
> > }
> > host server-3 {
> > id -7 # do not change unnecessarily
> > # weight 1.000
> > alg straw
> > hash 0 # rjenkins1
> > item osd.3 weight 1.000
> > }
> > rack bat0 {
> > id -3 # do not change unnecessarily
> > # weight 3.000
> > alg straw
> > hash 0 # rjenkins1
> > item server-0 weight 1.000
> > item server-1 weight 1.000
> > }
> > rack bat1 {
> > id -4 # do not change unnecessarily
> > # weight 3.000
> > alg straw
> > hash 0 # rjenkins1
> > item server-2 weight 1.000
> > item server-3 weight 1.000
> > }
> > root root {
> > id -1 # do not change unnecessarily
> > # weight 3.000
> > alg straw
> > hash 0 # rjenkins1
> > item bat0 weight 3.000
> > item bat1 weight 3.000
> > }
> > 
> > # rules
> > rule data {
> > ruleset 0
> > type replicated
> > min_size 1
> > max_size 10
> > step take root
> > step chooseleaf firstn 0 type rack
> > step emit
> > }
> > rule metadata {
> > ruleset 1
> > type replicated
> > min_size 1
> > max_size 10
> > step take root
> > step chooseleaf firstn 0 type rack
> > step emit
> > }
> > rule rbd {
> > ruleset 2
> > type replicated
> > min_size 1
> > max_size 10
> > step take root
> > step chooseleaf firstn 0 type rack
> > step emit
> > }
> > # end crush map
> > ---
> > 
> > Using this crush-map, coupled with a default pool data size 2
> > (replication 2), allowed me to be sure to have duplicate of all data
> > on both "sample building" bat0 and bat1.
> > Then I mounted on a client using ceph-fuse using : ceph-fuse -m
> > server-2:6789 /mnt/mycephfs (server-2 located on bat1), everything
> > works fine has expected, can write/read data, from one or more
> > clients, no probs on that.
> 
> 
> 
> Just to repeat. CephFS is still in development and can be buggy sometimes.
> 
> Also, if you do this, make sure you have an Active/Standby MDS setup 
> where each building has an MDS.
> 
> > Then I begin stress tests, i simulate the lost of one node, no problem
> > on that, still can access to the cluster data.
> > Finally i simulate the lost of a building (bat0), bringing down
> > server-0 and server-1. the results was an hang on the cluster, no more
> > access to any data ... ceph -s on the active nodes hanging with :
> > 
> > 2013-01-17 09:14:18.327911 7f4e5ca70700 0 -- xxx.xxx.xxx.52:0/16543
> > > > xxx.xxx.xxx.51:6789/0 pipe(0x2c9d490 sd=3 :0 pgs=0 cs=0 l=1).fault
> > > 
> > 
> > 
> > 
> > I start search the net and might have found the answer, the problem
> > came from the fact that my rules uses "step chooseleaf firstn 0 type
> > rack", which, allows me in fact to have data replica

Re: questions on networks and hardware

2013-01-20 Thread Gregory Farnum
On Sunday, January 20, 2013 at 12:30 PM, Gandalf Corvotempesta wrote:
> 2013/1/19 John Nielsen mailto:li...@jnielsen.net)>:
> > I'm planning a Ceph deployment which will include:
> > 10Gbit/s public/client network
> > 10Gbit/s cluster network
> 
> 
> 
> I'm still trying to know if a redundant cluster networks is needed or not.
> Is ceph able to manage a cluster network failure from on OSD?
> 
> What I would like to know if ceph will monitor OSD from the public
> side (that should be redundant, because client will use it to connect
> to the cluster) or from the cluster side.
> 
This is a bit embarrassing, but if you're actually using two networks and the 
cluster network fails but the client network stays up, things behave pretty 
badly (the OSDs will keep insisting it's failed, while the monitor will insist 
it's still working). I believe there's a branch working on this problem, but I 
haven't been involved with it.
It's not necessary to have split networks though, no.

Does that answer your question?
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd max write size

2013-01-20 Thread Gregory Farnum
On Sunday, January 20, 2013 at 11:06 AM, Stefan Priebe wrote:
> Hi,
>  
> what is the purpose or idea behind this setting?
>  
Couple different things:
1) The OSDs can't accept writes which won't fit inside their journal, and if 
you have a small journal you could conceivably attempt to write something which 
wouldn't fit. This doesn't happen so much any more, but it did sometimes in the 
past.
1b) The MDS uses this setting to limit the size of some of its operations 
(mostly directory updates), which can (rarely) grow quite large.
2) The OSDs don't have any other strict limits on how large an op you send 
around, but performance becomes a problem if they grow too large. This setting 
caps the operation size it allows.

It defaults to 90MB and you can change it if you want to, but I'm not sure why 
you would…
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: max useful journal size

2013-01-18 Thread Gregory Farnum
On Fri, Jan 18, 2013 at 2:20 PM, Travis Rhoden  wrote:
> Hey folks,
>
> The Ceph docs give the following recommendation on sizing your journal:
>
> osd journal size = {2 * (expected throughput * filestore min sync interval)}
>
> The default value of min sync interval is .01.  If you use throughput
> of a mediocre 7200RPM drive of 100MB/sec, this comes to 2 MB.  That
> seems like the lower bound to have the journal do anything at all.

Ah. This should refer to the max sync interval, not the min!

> My question is what is the upper bound?  There's clearly a limit to
> how big make, such that it just becomes wasted space.  The reason I
> want to know is that since I will be journals on SSDs, with each
> journal being a dedicated partition, there is a benefit to not making
> the partition bigger than it needs to be.  All that unpartitioned
> space can be used by the SSD firmware for wear-leveling and other
> things (so long as it remains unpartitioned).
>
> Would the following calc be appopriate?
>
> Assume an SSD write speed of 400MB/sec.  Default max sync interval is 5.
>
> 2 * (400 MB/sec * 5sec) = 4 GB.
>
> So is it appropriate to assume that if I can't write to an SSD faster
> than 400 MB/sec, and I keep the default sync interval values, a
> journal greater than 4GB is just a waste?
>
> I had been using 10GB journals...  seems like overkill.
>
> Or put another way, if I want to use 10GB journals, I should bump the
> max sync interval to 12.5.

It can of course grow as large as you let it, and I would leave some
extra room as a margin. The main consideration is that the journal
doesn't like getting too far ahead of the filestore, and that's what
the above calculation uses to set size.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: HOWTO: teuthology and code coverage

2013-01-17 Thread Gregory Farnum
It's great to see people outside of Inktank starting to get into using
teuthology. Thanks for the write-up!
-Greg

On Wed, Jan 16, 2013 at 6:01 AM, Loic Dachary  wrote:
> Hi,
>
> I'm happy to report that running teuthology to get a lcov code coverage 
> report worked for me.
>
> http://dachary.org/wp-uploads/2013/01/teuthology/total/mon/Monitor.cc.gcov.html
>
> It took me a while to figure out the logic (thanks Josh for the help :-). I 
> wrote a HOWTO explaining the steps in detail. It should be straightforward to 
> run on an OpenStack tenant, using virtual machines instead of bare metal.
>
> http://dachary.org/?p=1788
>
> Cheers
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mds: first stab at lookup-by-ino problem/soln description

2013-01-17 Thread Gregory Farnum
We discussed this internally at Inktank today and came up with a
solution that we're happy about. The basic idea is the same, but it
differs from the one Sage originally described in several ways of
varying importance.

First, we determined how to make it work with multiple data pools.
Simply enough, we require every lookup to include both the inode
number and the pool that its data lives in. This works for most
things, the exception being some narrow cases involving files seen by
NFS before having their layout changed, or which are left alone after
creation for a long time (dropping out of MDS cache) and then getting
their layout changed. To deal with these cases, we add "sentinel
objects". By default, we write a backtrace when the file is first
written to, and to the correct location. If for some reason the file
is not written to before the inode drops out of MDS cache, then the
MDS creates an empty object, named and located the same as the first
inode object would be, but containing no data except for the
backtrace. If the next access to the file is a write, great! We don't
need to do anything else.
If users instead change the layout of the file and then write data to
it, the MDS changes the sentinel object to be marked as such and to
point to the pool the data is actually located in. Then lookups from
handles (or hard links, etc) which encode the original pool encounter
this sentinel object and are re-directed to the new pool (and internal
Ceph objects can, as an optimization, be updated to point to the new
correct pool).
Our thought was that this would only need to happen for files which
drop out of cache, although it occurs to me as I write this that we
don't have a good way of knowing if files were seen by NFS, so we
might need to do it for any file which gets its data pool changed
following create. :/ Alternatives are letting NFS get ESTALE if they
run into this narrow race (bummer), or creating them and then cleaning
them up if we can figure out that all the clients which were alive at
create time have shut down (complicated, not necessarily possible).
Thoughts?

In order to deal with the narrow race in cross-MDS renames which Sage
described "ghost entries" for, we've instead decided on a simpler
solution: if an MDS which is expected to be authoritative for an inode
can't find it, the MDS broadcasts a request for that inode to every
other MDS in the cluster. Presumably the destination MDS still has the
inode in cache and can respond in the affirmative for the search; if
it can't then that means the backtrace on the file data has to have
been updated and so a re-query will find the auth MDS. This is much
simpler, will perform at least as efficiently for small clusters, and
isn't likely to be a problem people encounter in practice.

We added a brief future optimization, which is that hard links should
contain a full (lazily-updated) backtrace for faster lookups in the
common case.



One implication of this entire scheme which we didn't discuss
explicitly is that it requires the MDS to be able to access all the
data pools, which it presently doesn't. This might have implications
on Ceph's usability for certain regulated industries, and although we
have some thoughts on how to make that unnecessary (make clients
responsible for updating the backtraces, etc) we won't be implementing
that initially. Is this going to cause problems for any current users?



Regarding the use of a distributed hash table: we really didn't want
to add that kind of extra complexity and load to the system. Our
current proposal is a way of reducing the work involved in tracking
inodes; a distributed hash table would be going the other direction
and extending the AnchorTable to be scalable. Not implausible but not
what we're after, either.

Please let us know if you have any input!
-Greg

On Tue, Jan 15, 2013 at 3:35 PM, Sage Weil  wrote:
> One of the first things we need to fix in the MDS is how we support
> lookup-by-ino.  It's important for fsck, NFS reexport, and (insofar as
> there are limitations to the current anchor table design) hard links and
> snapshots.
>
> Below is a description of the problem and a rough sketch of my proposed
> solution.  This is the first time I thought about the lookup algorithm in
> any detail, so I've probably missed something, and the 'ghost entries' bit
> is what came to mind on the plane.  Hopefully we can think of something a
> bit lighter weight.
>
> Anyway, poke holes if anything isn't clear, if you have any better ideas,
> or if it's time to refine further.  This is just a starting point for the
> conversation.
>
>
> The problem
> ---
>
> The MDS stores all fs metadata (files, inodes) in a hierarchy,
> allowing it to distribute responsibility among ceph-mds daemons by
> partitioning the namespace hierarchically.  This is also a huge win
> for inode prefetching: loading the directory gets you both the names
> and the inodes in a single IO.

Re: mds: first stab at lookup-by-ino problem/soln description

2013-01-16 Thread Gregory Farnum
On Wed, Jan 16, 2013 at 5:17 PM, Sage Weil  wrote:
> On Wed, 16 Jan 2013, Gregory Farnum wrote:
>> I'm not familiar with the interfaces at work there. Do we have a free
>> 32 bits we can steal in order to do that stuffing? (I *think* it would
>> go in the NFS filehandle structure rather than the ino, right?)
>
> Right, there is at least 8 more bytes in a standard fh (16 bytes iirc) to
> stuff whatever we want into.
>
>> We would need to also store that information in order to eventually
>> replace the anchor table, but of course that's much easier to deal
>> with. If we can just do it this way, that still leaves handling files
>> which don't have any data written yet ? under our current system,
>> users can apply a data layout to any inode which has not had data
>> written to it yet. Unfortunately that gets hard to deal with if a user
>> touches a bunch of files and then comes back to place them the next
>> day. :/ I suppose un-touched files could have the special property
>> that their lookup data is stored in the metadata pool and it gets
>> moved as soon as they have data ? in the typical case files are
>> written right away and so this wouldn't be any more writes, just a bit
>> more logic.
>
> We can also change the semantics, here.  It could be that you have to
> specify the file's layout on create, and can't after it was created.
> Otherwise you get the directory/subtree's layout.  We could store the pool
> with the remote dentry link, for instance, and we could stick it in the
> fh.  So the  is really the "locator" that you would need.
>
> That could work...

I'm less a fan of forcing users to specify file layouts on create
since there aren't any standard interfaces which would let them do
that, so a lot of use cases would be restricted to directory-level
layout changes. Granted that covers the big ones, but we do have a
non-zero number of users who have learned our previous semantics,
right?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mds: first stab at lookup-by-ino problem/soln description

2013-01-16 Thread Gregory Farnum
On Wed, Jan 16, 2013 at 3:54 PM, Sam Lang  wrote:
>
> On Wed, Jan 16, 2013 at 3:52 PM, Gregory Farnum  wrote:
>>
>> My biggest concern with this was how it worked on cluster with
>> multiple data pools, and Sage's initial response was to either
>> 1) create an object for each inode that lives in the metadata pool,
>> and holds the backtraces (rather than putting them as attributes on
>> the first object in the file), or
>> 2) use a more sophisticated data structure, perhaps built on Eleanor's
>> b-tree project from last summer
>> (http://ceph.com/community/summer-adventures-with-ceph-building-a-b-tree/)
>>
>> I had thought that we could just query each data pool for the object,
>> but Sage points out that 100-pool clusters aren't exactly unreasonable
>> and that would take quite a lot of query time. And having the
>> backtraces in the data pools significantly complicates things with our
>> rules about setting layouts on new files.
>>
>> So this is going to need some kind of revision, please suggest
>> alternatives!
>
>
> Correct me if I'm wrong, but this seems like its only an issue in the NFS
> reexport case, as fsck can walk through the data objects in each pool (in
> parallel?) and verify back/forward consistency, so we won't have to guess
> which pool an ino is in.
>
> Given that, if we could stuff the pool id in the ino for the file returned
> through the client interfaces, then we wouldn't have to guess.
>
> -sam

I'm not familiar with the interfaces at work there. Do we have a free
32 bits we can steal in order to do that stuffing? (I *think* it would
go in the NFS filehandle structure rather than the ino, right?)
We would need to also store that information in order to eventually
replace the anchor table, but of course that's much easier to deal
with. If we can just do it this way, that still leaves handling files
which don't have any data written yet — under our current system,
users can apply a data layout to any inode which has not had data
written to it yet. Unfortunately that gets hard to deal with if a user
touches a bunch of files and then comes back to place them the next
day. :/ I suppose un-touched files could have the special property
that their lookup data is stored in the metadata pool and it gets
moved as soon as they have data — in the typical case files are
written right away and so this wouldn't be any more writes, just a bit
more logic.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mds: first stab at lookup-by-ino problem/soln description

2013-01-16 Thread Gregory Farnum
My biggest concern with this was how it worked on cluster with
multiple data pools, and Sage's initial response was to either
1) create an object for each inode that lives in the metadata pool,
and holds the backtraces (rather than putting them as attributes on
the first object in the file), or
2) use a more sophisticated data structure, perhaps built on Eleanor's
b-tree project from last summer
(http://ceph.com/community/summer-adventures-with-ceph-building-a-b-tree/)

I had thought that we could just query each data pool for the object,
but Sage points out that 100-pool clusters aren't exactly unreasonable
and that would take quite a lot of query time. And having the
backtraces in the data pools significantly complicates things with our
rules about setting layouts on new files.

So this is going to need some kind of revision, please suggest alternatives!
-Greg

On Tue, Jan 15, 2013 at 3:35 PM, Sage Weil  wrote:
> One of the first things we need to fix in the MDS is how we support
> lookup-by-ino.  It's important for fsck, NFS reexport, and (insofar as
> there are limitations to the current anchor table design) hard links and
> snapshots.
>
> Below is a description of the problem and a rough sketch of my proposed
> solution.  This is the first time I thought about the lookup algorithm in
> any detail, so I've probably missed something, and the 'ghost entries' bit
> is what came to mind on the plane.  Hopefully we can think of something a
> bit lighter weight.
>
> Anyway, poke holes if anything isn't clear, if you have any better ideas,
> or if it's time to refine further.  This is just a starting point for the
> conversation.
>
>
> The problem
> ---
>
> The MDS stores all fs metadata (files, inodes) in a hierarchy,
> allowing it to distribute responsibility among ceph-mds daemons by
> partitioning the namespace hierarchically.  This is also a huge win
> for inode prefetching: loading the directory gets you both the names
> and the inodes in a single IO.
>
> One consequence of this is that we do not have a flat "inode table"
> that let's us look up files by inode number.  We *can* find
> directories by ino simply because they are stored in an object named
> after the ino.  However, we can't populate the cache this way because
> the metadata in cache must be fully attached to the root to avoid
> various forms of MDS anarchy.
>
> Lookup-by-ino is currently needed for hard links.  The first link to a
> file is deemed the "primary" link, and that is where the inode is
> stored.  Any additional links are internally "remote" links, and
> reference the inode by ino.  However, there are other uses for
> lookup-by-ino, including NFS reexport and fsck.
>
> Anchor table
> 
>
> The anchor table is currently used to locate inodes that have hard
> links.  Inodes in the anchor table are said to be "anchored," and can
> be found by ino alone with no knowledge of their path.  Normally, only
> inodes that have hard links need to be anchored.  There are a few
> other cases, but they are not relevant here.
>
> The anchor table is a flat table of records like:
>
>  ino -> (parent ino, hash(name), refcount)
>
> All parent ino's referenced in the table also have records.  The
> refcount includes both other records listing a given ino as parent and
> the anchor itself (i.e., the inode).  To anchor an inode, we insert
> records for the ino and all ancestors (if they are not already present).
>
> An anchor removal means decrementing the ino record.  Once a refcount
> hits 0 it can be removed, and the parent ino's refcount can be
> decremented.
>
> A directory rename involves changing the parent ino value for an
> existing record, populating the new ancestors into the table (as
> needed), and decrementing the old parent's refcount.
>
> This all works great if there are a small number of anchors, but does
> not scale.  The entire table is managed by a single MDS, and is
> currently kept in memory.  We do not want to anchor every inode in the
> system or this is impractical.
>
> But, be want lookup-by-ino for NFS reexport, and something
> similar/related for fsck.
>
>
> Current lookup by ino procedure
> ---
>
> ::
>
>  lookup_ino(ino)
>send message mds.N -> mds.0
>  "anchor lookup $ino"
>get reply message mds.0 -> mds.N
>  reply contains record for $ino and all ancestors (an "anchor trace")
>parent = depest ancestor in trace that we have in our cache
>while parent != ino
>  child = parent.lookup(hash(name))
>  if not found
>restart from the top
>  parent = child
>
>
> Directory backpointers
> --
>
> There is partial infrastructure for supporting fsck that is already maintained
> for directories.  Each directory object (the first object for the directory,
> if there are multiple fragments) has an attr that provides a snapshot of 
> ancestors
> called a "backtrace".::
>
>  struct inode_backtrace_t {
>inodeno_t ino;   /

Re: Grid data placement

2013-01-15 Thread Gregory Farnum
On Tue, Jan 15, 2013 at 11:00 AM, Dimitri Maziuk  wrote:
> On 01/15/2013 12:36 PM, Gregory Farnum wrote:
>> On Tue, Jan 15, 2013 at 10:33 AM, Dimitri Maziuk  
>> wrote:
>
>>> At the start of the batch #cores-in-the-cluster processes try to mmap
>>> the same 2GB and start reading it from SEEK_SET at the same time. I
>>> won't know until I try but I suspect it won't like that.
>>
>> Well, it'll be #servers-in-cluster serving up 4MB chunks out of cache.
>> It's possible you could overwhelm their networking but my bet is
>> they'll just get spread out slightly on the first block and then not
>> contend in the future.
>
> In the future the application spreads out the reads as well: running
> instances go through the data at different speed, and when one's
> finished, the next one starts on the same core & it mmap's the first
> chunk again.
>
>> Just as long as you're thinking of it as a test system that would make
>> us very happy. :)
>
> Well, IRL this is throw-away data generated at the start of a batch, and
> we're good if one batch a month runs to completion. So if it doesn't
> crash all the time every time, that actually should be good enough for
> me. However, not all of the nodes have spare disk slots, so I couldn't
> do a full-scale deployment anyway, not without rebuilding half the nodes.

In that case you are my favorite kind of user and you should install
and try it out right away! :D
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Grid data placement

2013-01-15 Thread Gregory Farnum
On Tue, Jan 15, 2013 at 10:33 AM, Dimitri Maziuk  wrote:
> On 01/15/2013 12:16 PM, Gregory Farnum wrote:
>
>> There's a "read from replicas" operation flag that allows reading data
>> off the local node, although I don't think there's a way to turn it on
>> in the standard filesystem clients right now. It wouldn't be hard for
>> somebody to add. I'm not sure you actually need it though; Ceph unlike
>> NFS distributes the data over all the OSDs in the cluster so you could
>> scale the number of suppliers as you scale the number of consumers.
>> You would definitely not need to run an MDS on each node; just one
>> should be fine.
>
> At the start of the batch #cores-in-the-cluster processes try to mmap
> the same 2GB and start reading it from SEEK_SET at the same time. I
> won't know until I try but I suspect it won't like that.

Well, it'll be #servers-in-cluster serving up 4MB chunks out of cache.
It's possible you could overwhelm their networking but my bet is
they'll just get spread out slightly on the first block and then not
contend in the future.

>> Rather more importantly however, Inktank still doesn't consider the
>> CephFS filesystem to be production-ready, so you will want to tread
>> with care in checking it out.
>
> I'm assuming Inktank will consider it production-ready at some point;
> wouldn't it be great if I had a tested working setup by then. ;-)

Just as long as you're thinking of it as a test system that would make
us very happy. :)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Grid data placement

2013-01-15 Thread Gregory Farnum
On Tue, Jan 15, 2013 at 9:38 AM, Dimitri Maziuk  wrote:
> Hi everyone,
>
> quick question: can I get ceph to replicate a bunch of files to every
> host in compute cluster and then have those hosts read those files from
> local disk?
>
> TFM looks like a custom crush map should get the files to [osd on] every
> host, but I'm not clear on the read step: do I need an mds on every host
> and mount the fs off localhost's mds?
>
> (We've $APP running on the cluster, normally one instance/cpu core, that
> mmap's (read only) ~30GB of binary files. I/O over NFS kills the cluster
> even with a few hosts. Currently the files are rsync'ed to every host at
> the start of the batch; that'll only scale to a few dozen hosts at best.)

There's a "read from replicas" operation flag that allows reading data
off the local node, although I don't think there's a way to turn it on
in the standard filesystem clients right now. It wouldn't be hard for
somebody to add. I'm not sure you actually need it though; Ceph unlike
NFS distributes the data over all the OSDs in the cluster so you could
scale the number of suppliers as you scale the number of consumers.
You would definitely not need to run an MDS on each node; just one
should be fine.

Rather more importantly however, Inktank still doesn't consider the
CephFS filesystem to be production-ready, so you will want to tread
with care in checking it out. :)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] configure.ac: check for org.junit.rules.ExternalResource

2013-01-15 Thread Gregory Farnum
On Tue, Jan 15, 2013 at 6:55 AM, Noah Watkins  wrote:
> On Tue, Jan 15, 2013 at 1:32 AM, Danny Al-Gaaf  
> wrote:
>> Am 15.01.2013 10:04, schrieb James Page:
>>> On 12/01/13 16:36, Noah Watkins wrote:
 On Thu, Jan 10, 2013 at 9:13 PM, Gary Lowell
  wrote:
>>
>> I would also prefer to not add another huge build dependency to ceph,
>> especially since it's e.g. not supported by SLES11 and since ceph
>> currently builds fine (even with these small warnings from autotools).
>
> Ahh, I had in my head a separate repository for Java bindings managed
> by maven (or ant). Either way, I have no strong opinion -- we only
> have one junit dependency :)

I don't believe Ceph currently requires any Java to build, and it's
going to remain that way if I have anything to say about it. ;) Hadoop
bindings can be packaged in about a billion different ways that don't
require core Ceph to depend on a Java toolchain.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD nodes with >=8 spinners, SSD-backed journals, and their performance impact

2013-01-14 Thread Gregory Farnum
On Mon, Jan 14, 2013 at 6:09 AM, Florian Haas  wrote:
> Hi Mark,
>
> thanks for the comments.
>
> On Mon, Jan 14, 2013 at 2:46 PM, Mark Nelson  wrote:
>> Hi Florian,
>>
>> Couple of comments:
>>
>> "OSDs use a write-ahead mode for local operations: a write hits the journal
>> first, and from there is then being copied into the backing filestore."
>>
>> It's probably important to mention that this is true by default only for
>> non-btrfs file systems.  See:
>>
>> http://ceph.com/wiki/OSD_journal
>
> I am well aware of that, but I've yet to find a customer (or user)
> that's actually willing to entrust a production cluster with several
> hundred terabytes of data to btrfs. :) Besides, the whole post is
> about whether or not to use dedicated SSD block devices for OSD
> journals, and if you're tossing everything into btrfs you've already
> made the decision to use in-filestore journals.

That is absolutely not the case. btrfs works just fine with an
external journal on SSD or whatever else; what made you think
otherwise?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph version 0.56.1, data loss on power failure

2013-01-11 Thread Gregory Farnum
On Fri, Jan 11, 2013 at 3:07 AM, Marcin Szukala
 wrote:
> 2013/1/10 Gregory Farnum :
>> On Thu, Jan 10, 2013 at 8:56 AM, Marcin Szukala
>>  wrote:
>>> Hi,
>>>
>>> Scenario is correct but the last line. I can mount the image, but the
>>> data that was written to the image before power failure is lost.
>>>
>>> Currently the ceph cluster is not healthy, but i dont think its
>>> related because I had this issue before the cluster itsef had issues
>>> (about that I will write in different post not to mix topics).
>>
>> This sounds like one of two possibilities:
>> 1) You aren't actually committing data to RADOS very often and so when
>> the power fails you lose several minutes of writes. How much data are
>> you losing, how's it generated, and is whatever you're doing running
>> any kind of fsync or sync? And what filesystem are you using?
>> 2) Your cluster is actually not accepting writes and so RBD never
>> manages to do a write but you aren't doing much and so you don't
>> notice. What's the output of ceph -s?
>> -Greg
>
> Hi,
>
> Today I have created new ceph cluster from scratch.
> root@ceph-1:~# ceph -s
>health HEALTH_OK
>monmap e1: 3 mons at
> {a=10.3.82.102:6789/0,b=10.3.82.103:6789/0,d=10.3.82.105:6789/0},
> election epoch 4, quorum 0,1,2 a,b,d
>osdmap e65: 56 osds: 56 up, 56 in
> pgmap v3892: 13744 pgs: 13744 active+clean; 73060 MB data, 147 GB
> used, 51983 GB / 52131 GB avail
>mdsmap e1: 0/0/1 up
>
> The issue persisst.
> I`am losing all of data on the image.

So you mean you mount the image, format it with 5 XFS filesystems as
below, run it for a while, and then the power on the system fails.
Then you turn the system back on, attach the image, and it has no
filesystems on it at all? Or the filesystems remain and can be mounted
but they have no data?
-Greg

> On the mounted image I have 5 logical volumes.
>
> root@compute-9:~# mount
> (snip)
> /dev/mapper/compute--9-nova on /var/lib/nova type xfs (rw)
> /dev/mapper/compute--9-tmp on /tmp type xfs (rw)
> /dev/mapper/compute--9-libvirt on /etc/libvirt type xfs (rw)
> /dev/mapper/compute--9-log on /var/log type xfs (rw)
> /dev/mapper/compute--9-openvswitch on /var/lib/openvswitch type xfs (rw)
>
> So I have directories with little to none data writes and with a lot
> of writes (logs). No fsync or sync. Filesystem is xfs.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD memory leaks?

2013-01-11 Thread Gregory Farnum
On Fri, Jan 11, 2013 at 6:57 AM, Sébastien Han  wrote:
>> Is osd.1 using the heap profiler as well? Keep in mind that active use
>> of the memory profiler will itself cause memory usage to increase —
>> this sounds a bit like that to me since it's staying stable at a large
>> but finite portion of total memory.
>
> Well, the memory consumption was already high before the profiler was
> started. So yes with the memory profiler enable an OSD might consume
> more memory but this doesn't cause the memory leaks.

My concern is that maybe you saw a leak but when you restarted with
the memory profiling you lost whatever conditions caused it.

> Any ideas? Nothing to say about my scrumbing theory?
I like it, but Sam indicates that without some heap dumps which
capture the actual leak then scrub is too large to effectively code
review for leaks. :(
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Question about configuration

2013-01-10 Thread Gregory Farnum
On Thu, Jan 10, 2013 at 4:51 PM, Yasuhiro Ohara  wrote:
>
> Hi, Greg,
>
> When I went through the Ceph document, I could find the description
> about /etc/init.d only, so it is still the easiest for me.
> Is there documentation on other (upstart?) system or do I need to
> learn those system ? Or just letting me know how to install the
> resource file (for Ceph in upstart) might work for me.

There isn't really any documentation right now and if you started off
with sysvinit it's probably easiest to continue that way. It will work
with that system too; it's just that if you run "sudo service ceph -a
start" then it's going to go and turn on all the daemons listed in its
local ceph.conf.
-Greg

>
> Thanks.
>
> regards,
> Yasu
>
> From: Gregory Farnum 
> Subject: Re: Question about configuration
> Date: Thu, 10 Jan 2013 16:43:59 -0800
> Message-ID: 
> 
>
>> On Thu, Jan 10, 2013 at 4:39 PM, Yasuhiro Ohara  wrote:
>>>
>>> Hi,
>>>
>>> What will happen when constructing a cluster of 10 host,
>>> but the hosts are gradually removed from the cluster
>>> one by one (in each step waiting Ceph status to become healthy),
>>> and reaches eventually to, say, 3 hosts ?
>>>
>>> In other words, is there any problem with having 10 osd configuration
>>> in the ceph.conf, but actually only 3 is up (the 7 are down and out) ?
>>
>> If you're not using the /etc/init.d ceph script to start up everything
>> with the -a option, this will work just fine.
>>
>>>
>>> I assume that if the size of the replication is 3, we can turn off
>>> 2 osds at each time, and Ceph can recover itself to the healthy state.
>>> Is it the case ?
>>
>> Yeah, that should work fine. You might consider just marking OSDs
>> "out" two at a time and not actually killing them until the cluster
>> has become quiescent again, though — that way they can participate as
>> a source for recovery.
>> -Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Question about configuration

2013-01-10 Thread Gregory Farnum
On Thu, Jan 10, 2013 at 4:39 PM, Yasuhiro Ohara  wrote:
>
> Hi,
>
> What will happen when constructing a cluster of 10 host,
> but the hosts are gradually removed from the cluster
> one by one (in each step waiting Ceph status to become healthy),
> and reaches eventually to, say, 3 hosts ?
>
> In other words, is there any problem with having 10 osd configuration
> in the ceph.conf, but actually only 3 is up (the 7 are down and out) ?

If you're not using the /etc/init.d ceph script to start up everything
with the -a option, this will work just fine.

>
> I assume that if the size of the replication is 3, we can turn off
> 2 osds at each time, and Ceph can recover itself to the healthy state.
> Is it the case ?

Yeah, that should work fine. You might consider just marking OSDs
"out" two at a time and not actually killing them until the cluster
has become quiescent again, though — that way they can participate as
a source for recovery.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph version 0.56.1, data loss on power failure

2013-01-10 Thread Gregory Farnum
On Thu, Jan 10, 2013 at 8:56 AM, Marcin Szukala
 wrote:
> Hi,
>
> Scenario is correct but the last line. I can mount the image, but the
> data that was written to the image before power failure is lost.
>
> Currently the ceph cluster is not healthy, but i dont think its
> related because I had this issue before the cluster itsef had issues
> (about that I will write in different post not to mix topics).

This sounds like one of two possibilities:
1) You aren't actually committing data to RADOS very often and so when
the power fails you lose several minutes of writes. How much data are
you losing, how's it generated, and is whatever you're doing running
any kind of fsync or sync? And what filesystem are you using?
2) Your cluster is actually not accepting writes and so RBD never
manages to do a write but you aren't doing much and so you don't
notice. What's the output of ceph -s?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: geo replication

2013-01-10 Thread Gregory Farnum
On Wed, Jan 9, 2013 at 1:33 PM, Gandalf Corvotempesta
 wrote:
> 2013/1/9 Mark Kampe :
>> Asynchronous RADOS replication is definitely on our list,
>> but more complex and farther out.
>
> Do you have any ETA?
> 1 month?  6 months ? 1 year?

No, but definitely closer to 1 year than either of the other options
at this point.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD memory leaks?

2013-01-10 Thread Gregory Farnum
On Wed, Jan 9, 2013 at 10:09 AM, Sylvain Munaut
 wrote:
> Just fyi, I also have growing memory on OSD, and I have the same logs:
>
> "libceph: osd4 172.20.11.32:6801 socket closed" in the RBD clients

That message is not an error; it just happens if the RBD client
doesn't talk to that OSD for a while. I believe its volume has been
turned down quite a lot in the latest kernels/our git tree.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD memory leaks?

2013-01-10 Thread Gregory Farnum
On Wed, Jan 9, 2013 at 8:10 AM, Dave Spano  wrote:
> Yes, I'm using argonaut.
>
> I've got 38 heap files from yesterday. Currently, the OSD in question is 
> using 91.2% of memory according to top, and staying there. I initially 
> thought it would go until the OOM killer started killing processes, but I 
> don't see anything funny in the system logs that indicate that.
>
> On the other hand, the ceph-osd process on osd.1 is using far less memory.

Is osd.1 using the heap profiler as well? Keep in mind that active use
of the memory profiler will itself cause memory usage to increase —
this sounds a bit like that to me since it's staying stable at a large
but finite portion of total memory.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue

2013-01-10 Thread Gregory Farnum
On Wed, Jan 9, 2013 at 8:00 AM, Noah Watkins  wrote:
> Hi Jutta,
>
> On Wed, Jan 9, 2013 at 7:11 AM, Lachfeld, Jutta
>  wrote:
>>
>> the current content of the web page 
>> http://ceph.com/docs/master/cephfs/hadoop shows a configuration parameter 
>> ceph.object.size.
>> Is it the CEPH equivalent  to the "HDFS block size" parameter which I have 
>> been looking for?
>
> Yes. By specifying ceph.object.size, the Hadoop will use a default
> Ceph file layout with stripe unit = object size, and stripe count = 1.
> This is effectively the same meaning as dfs.block.size for HDFS.
>
>> Does the parameter ceph.object.size apply to version 0.56.1?
>
> The Ceph/Hadoop file system plugin is being developed here:
>
>   git://github.com/ceph/hadoop-common cephfs/branch-1.0
>
> There is an old version of the Hadoop plugin in the Ceph tree which
> will be removed shortly. Regarding the versions, development is taking
> place in cephfs/branch-1.0 and in ceph.git master. We don't yet have a
> system in place for dealing with compatibility across versions because
> the code is in heavy development.

If you are using the old version in the Ceph tree, you should be
setting fs.ceph.blockSize rather than ceph.object.size. :)


>> I would be interested in setting this parameter to values higher than 64MB, 
>> e.g. 256MB or 512MB similar to the values I have used for HDFS for 
>> increasing the performance of the TeraSort benchmark. Would these values be 
>> allowed and would they at all make sense for the mechanisms used in CEPH?
>
> I can't think of any reason why a large size would cause concern, but
> maybe someone else can chime in?

Yep, totally fine.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crushmap Design Question

2013-01-10 Thread Gregory Farnum
On Tue, Jan 8, 2013 at 12:20 PM, Moore, Shawn M  wrote:
> I have been testing ceph for a little over a month now.  Our design goal is 
> to have 3 datacenters in different buildings all tied together over 10GbE.  
> Currently there are 10 servers each serving 1 osd in 2 of the datacenters.  
> In the third is one large server with 16 SAS disks serving 8 osds.  
> Eventually we will add one more identical large server into the third 
> datacenter.  I have told ceph to keep 3 copies and tried to do the crushmap 
> in such a way that as long as a majority of mon's can stay up, we could run 
> off of one datacenter's worth of osds.   So in my testing, it doesn't work 
> out quite this way...
>
> Everything is currently ceph version 0.56.1 
> (e4a541624df62ef353e754391cbbb707f54b16f7)
>
> I will put hopefully relevant files at the end of this email.
>
> When all 28 osds are up, I get:
> 2013-01-08 13:56:07.435914 mon.0 [INF] pgmap v2712076: 7104 pgs: 7104 
> active+clean; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail
>
> When I fail a datacenter (including 1 of 3 mon's) I eventually get:
> 2013-01-08 13:58:54.020477 mon.0 [INF] pgmap v2712139: 7104 pgs: 7104 
> active+degraded; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail; 
> 16362/49086 degraded (33.333%)
>
> At this point everything is still ok.  But when I fail the 2nd datacenter 
> (still leaving 2 out of 3 mons running) I get:
> 2013-01-08 14:01:25.600056 mon.0 [INF] pgmap v2712189: 7104 pgs: 7104 
> incomplete; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail
>
> Most VM's quit working and "rbd ls" works, but not a single line from "rados 
> -p rbd ls" works and the command hangs.  Now after a while (you can see from 
> timestamps) I end up at and stays this way:
> 2013-01-08 14:40:54.030370 mon.0 [INF] pgmap v2713794: 7104 pgs: 213 active, 
> 117 active+remapped, 3660 incomplete, 3108 active+degraded+remapped, 6 
> remapped+incomplete; 60264 MB data, 65701 MB used, 4604 GB / 4768 GB avail; 
> 7696/49086 degraded (15.679%)

This took me a bit to work out as well, but you've run afoul of a new
post-argonaut feature intended to prevent people from writing with
insufficient durability. Pools now have a "min size" and PGs in that
pool won't go active if they don't have that many OSDs to write on.
The clue here is the "incomplete" state. You can change it with "ceph
osd pool foo set min_size 1", where "foo" is the name of the pool
whose min_size you wish to change (and this command sets the min size
to 1, obviously). The default for new pools is controlled by the "osd
pool default min size" config value (which you should put in the
global section). By default it'll be half of your default pool size.

So in your case your pools have a default size of 3, and the min size
is (3/2 = 1.5 rounded up), and the OSDs are refusing to go active
because of the dramatically reduced redundancy. You can set the min
size down though and they will go active.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What is the acceptable attachment file size on the mail server?

2013-01-10 Thread Gregory Farnum
On Thu, Jan 10, 2013 at 10:45 AM, Isaac Otsiabah  wrote:
>
> What is the acceptable attachment file size? because i have been trying to 
> post a problem with an attachment greater than 1.5MG and it seems  to get 
> lost.

That wouldn't be surprising; Sage suggests the FAQ
(http://www.tux.org/lkml/) has an answer somewhere although I couldn't
find it in a quick check.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster

2013-01-10 Thread Gregory Farnum
On Tue, Jan 8, 2013 at 1:31 PM, Isaac Otsiabah  wrote:
>
>
> Hi Greg, it appears to be a timing issue because with the flag (debug ms=1) 
> turned on, the system ran slower and became harder to fail. I ran it several 
> times and finally got it to fail on (osd.0) using default crush map. The 
> attached tar file contains log files  for all components on g8ct plus the 
> ceph.conf.
>
> I started with a 1-node cluster on host g8ct (osd.0, osd.1, osd.2)  and then 
> added host g13ct (osd.3, osd.4, osd.5)
>
>
>
>  idweight  type name   up/down reweight
> -1  6   root default
> -3  6   rack unknownrack
> -2  3   host g8ct
> 0   1   osd.0   down1
> 1   1   osd.1   up  1
> 2   1   osd.2   up  1
> -4  3   host g13ct
> 3   1   osd.3   up  1
> 4   1   osd.4   up  1
> 5   1   osd.5   up  1
>
>
>
> The error messages are in ceph.log and ceph-osd.0.log:
>
> ceph.log:2013-01-08 05:41:38.080470 osd.0 192.168.0.124:6801/25571 3 : [ERR] 
> map e15 had wrong cluster addr (192.168.0.124:6802/25571 != my 
> 192.168.1.124:6802/25571)
> ceph-osd.0.log:2013-01-08 05:41:38.080458 7f06757fa710  0 log [ERR] : map e15 
> had wrong cluster addr (192.168.0.124:6802/25571 != my 
> 192.168.1.124:6802/25571)

Thanks. I had a brief look through these logs on Tuesday and want to
spend more time with them because they have some odd stuff in them. It
*looks* like the OSD is starting out using a single IP for both the
public and cluster networks and then switching over at some point,
which is...odd.
Knowing more details about how your network is actually set up would
be very helpful.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: recoverying from 95% full osd

2013-01-08 Thread Gregory Farnum
On Tuesday, January 8, 2013 at 10:52 PM, Sage Weil wrote:
> On Wed, 9 Jan 2013, Roman Hlynovskiy wrote:
> > Thanks a lot Greg,
> > 
> > that was the black magic command I was looking for )
> > 
> > I deleted some obsolete data and reached those figures:
> > 
> > chef@cephgw:~$ ./clu.sh (http://clu.sh) exec "df -kh"|grep osd
> > /dev/mapper/vg00-osd 252G 153G 100G 61% /var/lib/ceph/osd/ceph-0
> > /dev/mapper/vg00-osd 252G 180G 73G 72% /var/lib/ceph/osd/ceph-1
> > /dev/mapper/vg00-osd 252G 213G 40G 85% /var/lib/ceph/osd/ceph-2
> > 
> > which in comparison to previous one:
> > 
> > /dev/mapper/vg00-osd 252G 173G 80G 69% /var/lib/ceph/osd/ceph-0
> > /dev/mapper/vg00-osd 252G 203G 50G 81% /var/lib/ceph/osd/ceph-1
> > /dev/mapper/vg00-osd 252G 240G 13G 96% /var/lib/ceph/osd/ceph-2
> > 
> > show that 20gig were removed from osd-1, 23gig from osd-2 and 27gig from 
> > osd-3.
> > So, cleaned up space also has some disproportion.
> > 
> > at the same time:
> > chef@cephgw:~$ ceph osd tree
> > 
> > # id weight type name up/down reweight
> > -1 3 pool default
> > -3 3 rack unknownrack
> > -2 1 host ceph-node01
> > 0 1 osd.0 up 1
> > -4 1 host ceph-node02
> > 1 1 osd.1 up 1
> > -5 1 host ceph-node03
> > 2 1 osd.2 up 1
> > 
> > 
> > all osd weights are the same. I guess there is no automatic way to
> > balance storage usage for my case and I have to play with osd weights
> > using 'ceph osd reweight-by-utilization xx' until storage is used more
> > or less equally and when get the weights back to 1?
> 
> 
> 
> How many pgs do you have? ('ceph osd dump | grep ^pool').

I believe this is it. 384 PGs, but three pools of which only one (or maybe a 
second one, sort of) is in use. Automatically setting the right PG counts is 
coming some day, but until then being able to set up pools of the right size is 
a big gotcha. :(
Depending on how mutable the data is, recreate with larger PG counts on the 
pools in use. Otherwise we can do something more detailed.
-Greg
 
> 
> You might also adjust the crush tunables, see
> 
> http://ceph.com/docs/master/rados/operations/crush-map/?highlight=tunable#tunables
> 
> sage
> 
> > 
> > 
> > 
> > 2013/1/8 Gregory Farnum mailto:g...@inktank.com)>:
> > > On Tue, Jan 8, 2013 at 2:42 AM, Roman Hlynovskiy
> > > mailto:roman.hlynovs...@gmail.com)> wrote:
> > > > Hello,
> > > > 
> > > > I am running ceph v0.56 and at the moment trying to recover ceph which
> > > > got completely stuck after 1 osd got filled by 95%. Looks like the
> > > > distribution algorithm is not perfect since all 3 OSD's I user are
> > > > 256Gb each, however one of them got filled faster than others:
> > > > 
> > > > osd-1:
> > > > Filesystem Size Used Avail Use% Mounted on
> > > > /dev/mapper/vg00-osd 252G 173G 80G 69% /var/lib/ceph/osd/ceph-0
> > > > 
> > > > osd-2:
> > > > Filesystem Size Used Avail Use% Mounted on
> > > > /dev/mapper/vg00-osd 252G 203G 50G 81% /var/lib/ceph/osd/ceph-1
> > > > 
> > > > osd-3:
> > > > Filesystem Size Used Avail Use% Mounted on
> > > > /dev/mapper/vg00-osd 252G 240G 13G 96% /var/lib/ceph/osd/ceph-2
> > > > 
> > > > 
> > > > by the moment mds is showing the following behaviour:
> > > > 2013-01-08 16:25:47.006354 b4a73b70 0 mds.0.objecter FULL, paused
> > > > modify 0x9ba63c0 tid 23448
> > > > 2013-01-08 16:26:47.005211 b4a73b70 0 mds.0.objecter FULL, paused
> > > > modify 0xca86c30 tid 23449
> > > > 
> > > > so, it does not respond to any mount requests
> > > > 
> > > > I've played around with all types of commands like:
> > > > ceph mon tell \* injectargs '--mon-osd-full-ratio 98'
> > > > ceph mon tell \* injectargs '--mon-osd-full-ratio 0.98'
> > > > 
> > > > and
> > > > 
> > > > 'mon osd full ratio = 0.98' in mon configuration for each mon
> > > > 
> > > > however
> > > > 
> > > > chef@ceph-node03:/var/log/ceph$ ceph health detail
> > > > HEALTH_ERR 1 full osd(s)
> > > > osd.2 is full at 95%
> > > > 
> > > > mds still believes 95% is the threshold, so no responses to mount 
> > > > requests.
> > > > 
> > > > chef@ceph-node03:/var/log/ceph$ rados -p data bench 10 wr

Re: Is Ceph recovery able to handle massive crash

2013-01-08 Thread Gregory Farnum
On Tue, Jan 8, 2013 at 11:44 AM, Denis Fondras  wrote:
> Hello,
>
>
>> What error message do you get when you try and turn it on? If the
>> daemon is crashing, what is the backtrace?
>
>
> The daemon is crashing. Here is the full log if you want to take a look :
> http://vps.ledeuns.net/ceph-osd.0.log.gz
>
> The RBD rebuild script helped to get the data back. I will now try to
> rebuild a Ceph cluster and do some more tests.
>
> Denis

It looks like it's taking approximately forever for writes to complete
to disk; it's shutting down because threads are going off to write and
not coming back. If you set "osd op thread timeout = 60" (or 120) it
might manage to churn through, but I'd look into why the writes are
taking so long — bad disk, fragmented btrfs filesystem, or something
else.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adjusting replicas on argonaut

2013-01-08 Thread Gregory Farnum
Yep! The "step chooseleaf firstn 0 type host" means "choose n nodes of
type host, and select a leaf under each one of them", where n is the
pool size. You only have two hosts so it can't do more than 2 with
that rule type.
You could do "step chooseleaf firstn 0 type device", but that won't
guarantee a segregation across hosts, unfortunately. CRUSH isn't great
at dealing with situations where you want your number of copies to be
equal to or greater than your total failure domain counts. You can
make it work if you're willing to hardcode some stuff but it's not
real pleasant.
-Greg

On Tue, Jan 8, 2013 at 2:28 PM, Bryan Stillwell
 wrote:
> That would make sense.  Here's what the metadata rule looks like:
>
> rule metadata {
> ruleset 1
> type replicated
> min_size 2
> max_size 10
> step take default
> step chooseleaf firstn 0 type host
> step emit
> }
>
> On Tue, Jan 8, 2013 at 3:23 PM, Gregory Farnum  wrote:
>> What are your CRUSH rules? Depending on how you set this cluster up,
>> it might not be placing more than one replica in a single host, and
>> you've only got two hosts so it couldn't satisfy your request for 3
>> copies.
>> -Greg
>>
>> On Tue, Jan 8, 2013 at 2:11 PM, Bryan Stillwell
>>  wrote:
>>> I tried increasing the number of metadata replicas from 2 to 3 on my
>>> test cluster with the following command:
>>>
>>> ceph osd pool set metadata size 3
>>>
>>>
>>> Afterwards it appears that all the metadata placement groups switch to
>>> a degraded state and doesn't seem to be attempting to recover:
>>>
>>> 2013-01-08 14:49:37.352735 mon.0 [INF] pgmap v156393: 1920 pgs: 1280
>>> active+clean, 640 active+degraded; 903 GB data, 1820 GB used, 2829 GB
>>> / 4650 GB avail; 1255/486359 degraded (0.258%)
>>>
>>>
>>> Does anything need to be done after increasing the number of replicas?
>>>
>>> Here's what the OSD tree looks like:
>>>
>>> root@a1:~# ceph osd tree
>>> dumped osdmap tree epoch 1303
>>> # idweight  type name   up/down reweight
>>> -1  4.99557 pool default
>>> -3  4.99557 rack unknownrack
>>> -2  2.49779 host b1
>>> 0   0.499557osd.0   up  1
>>> 1   0.499557osd.1   up  1
>>> 2   0.499557osd.2   up  1
>>> 3   0.499557osd.3   up  1
>>> 4   0.499557osd.4   up  1
>>> -4  2.49779 host b2
>>> 5   0.499557osd.5   up  1
>>> 6   0.499557osd.6   up  1
>>> 7   0.499557osd.7   up  1
>>> 8   0.499557osd.8   up  1
>>> 9   0.499557osd.9   up  1
>>>
>>>
>>> Thanks,
>>> Bryan
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
>
>
> Bryan Stillwell
> SYSTEM ADMINISTRATOR
>
> E: bstillw...@photobucket.com
> O: 303.228.5109
> M: 970.310.6085
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adjusting replicas on argonaut

2013-01-08 Thread Gregory Farnum
What are your CRUSH rules? Depending on how you set this cluster up,
it might not be placing more than one replica in a single host, and
you've only got two hosts so it couldn't satisfy your request for 3
copies.
-Greg

On Tue, Jan 8, 2013 at 2:11 PM, Bryan Stillwell
 wrote:
> I tried increasing the number of metadata replicas from 2 to 3 on my
> test cluster with the following command:
>
> ceph osd pool set metadata size 3
>
>
> Afterwards it appears that all the metadata placement groups switch to
> a degraded state and doesn't seem to be attempting to recover:
>
> 2013-01-08 14:49:37.352735 mon.0 [INF] pgmap v156393: 1920 pgs: 1280
> active+clean, 640 active+degraded; 903 GB data, 1820 GB used, 2829 GB
> / 4650 GB avail; 1255/486359 degraded (0.258%)
>
>
> Does anything need to be done after increasing the number of replicas?
>
> Here's what the OSD tree looks like:
>
> root@a1:~# ceph osd tree
> dumped osdmap tree epoch 1303
> # idweight  type name   up/down reweight
> -1  4.99557 pool default
> -3  4.99557 rack unknownrack
> -2  2.49779 host b1
> 0   0.499557osd.0   up  1
> 1   0.499557osd.1   up  1
> 2   0.499557osd.2   up  1
> 3   0.499557osd.3   up  1
> 4   0.499557osd.4   up  1
> -4  2.49779 host b2
> 5   0.499557osd.5   up  1
> 6   0.499557osd.6   up  1
> 7   0.499557osd.7   up  1
> 8   0.499557osd.8   up  1
> 9   0.499557osd.9   up  1
>
>
> Thanks,
> Bryan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Rados gateway init timeout with cache

2013-01-08 Thread Gregory Farnum
On Tue, Jan 8, 2013 at 1:11 PM, Yann ROBIN  wrote:
> We lost data in notify and gc. What bother me is that the rados gateway can 
> start if we desactivate the cache.
> I think the availability of the cache objects shouldn't take down the rados 
> gateway. The option should be more a "I want the cache if available".

But what if some of the instances can access the objects and others
can't? Then you've got daemons caching data and the others aren't
notifying them. This pretty much needs to be a manual switch, as far
as I can imagine it working. Unless somebody else has ideas on
improving it?
-Greg


>
> -Message d'origine-
> De : Gregory Farnum [mailto:g...@inktank.com]
> Envoyé : mardi 8 janvier 2013 18:03
> À : Yann ROBIN
> Cc : ceph-devel@vger.kernel.org
> Objet : Re: Rados gateway init timeout with cache
>
> To clarify, you lost the data on half of your OSDs? And it sounds like they 
> weren't in separate CRUSH failure domains?
>
> Given that, yep, you've lost some data. :(
>
> On Tue, Jan 8, 2013 at 5:41 AM, Yann ROBIN  wrote:
>> Notify and gc objects where unfound, we marked them as lost and now the 
>> rados start.
>> But this means that if some notify object are not fully available, the 
>> radosgateway stop responding.
>
> Yes, that's the case. I'm not sure there's a way around it that makes much 
> sense and satisfies the necessary guarantees, though.
> -Greg
>
>> -Original Message-
>> From: ceph-devel-ow...@vger.kernel.org
>> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Yann ROBIN
>> Sent: mardi 8 janvier 2013 12:13
>> To: ceph-devel@vger.kernel.org
>> Subject: Rados gateway init timeout with cache
>>
>> Hi,
>>
>> We recently experienced issue with the backplane of our server, resulting in 
>> loosing half of our osd.
>> During that period the rados gateway failed initializing (timeout).
>> We found that the gateway was hanging in the init_watch function.
>>
>> We recreate our OSDs and we still have this issue, but pg are not all in an 
>> active+clean state :
>>health HEALTH_WARN 1 pgs degraded; 1 pgs recovering; 2 pgs recovery_wait; 
>> 3 pgs stuck unclean; recovery 7/10140464 degraded (0.000%); 3/5070232 
>> unfound (0.000%); noout flag(s) set
>>monmap e2: 3 mons at 
>> {ceph-mon-1=172.20.1.13:6789/0,ceph-mon-2=172.20.2.13:6789/0,ceph-mon-3=172.17.9.20:6789/0},
>>  election epoch 256, quorum 0,1,2 ceph-mon-1,ceph-mon-2,ceph-mon-3
>>osdmap e4439: 6 osds: 6 up, 6 in
>> pgmap v2531184: 11024 pgs: 11019 active+clean, 2 active+recovery_wait, 1 
>> active+recovering+degraded+remapped, 2 active+clean+scrubbing+deep; 1291 GB 
>> data, 2612 GB used, 19645 GB / 22257 GB avail; 7/10140464 degraded (0.000%); 
>> 3/5070232 unfound (0.000%)
>>mdsmap e1: 0/0/1 up
>>
>> Should we open an ticket for this init issue with rados gateway ?
>> Version is 0.56.1 upgraded from 0.55.
>>
>> --
>> Yann ROBIN
>> YouScribe
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majord...@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majord...@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: "hit suicide timeout" message after upgrade to 0.56

2013-01-08 Thread Gregory Farnum
I'm confused. Isn't the HeartbeatMap all about local thread
heartbeating (so, not pings with other OSDs)? I would assume the
upgrade and restart just caused a bunch of work and the CPUs got
overloaded.
-Greg

On Thu, Jan 3, 2013 at 8:52 AM, Sage Weil  wrote:
> Hi Wido,
>
> On Thu, 3 Jan 2013, Wido den Hollander wrote:
>> Hi,
>>
>> I updated my 10 node 40 OSD cluster from 0.48 to 0.56 yesterday evening and
>> found out this morning that I had 23 OSDs still up and in.
>>
>> Investigating some logs I found these messages:
>
> This sounds quite a bit #3714.  You might give wip-3714 a try...
>
> sage
>
>
>>
>> *
>> -8> 2013-01-02 21:13:40.528936 7f9eb177a700  1 heartbeat_map is_healthy
>> 'OSD::op_tp thread 0x7f9ea2f5d700' had timed out after 30
>> -7> 2013-01-02 21:13:40.528985 7f9eb177a700  1 heartbeat_map is_healthy
>> 'OSD::op_tp thread 0x7f9ea375e700' had timed out after 30
>> -6> 2013-01-02 21:13:41.311088 7f9eaff77700 10 monclient:
>> _send_mon_message to mon.pri at [2a00:f10:11b:cef0:230:48ff:fed3:b086]:6789/0
>> -5> 2013-01-02 21:13:45.047220 7f9e92282700  0 --
>> [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:0/2882 >>
>> [2a00:f10:11b:cef0:225:90ff:fe33:49fe]:6805/2373 pipe(0x9d7ad80 sd=135 :0
>> pgs=0 cs=0 l=1).fault
>> -4> 2013-01-02 21:13:45.049225 7f9e962c2700  0 --
>> [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6801/2882 >>
>> [2a00:f10:11b:cef0:225:90ff:fe33:49fe]:6804/2373 pipe(0x9d99000 sd=104 :44363
>> pgs=99 cs=1 l=0).fault with nothing to send, going to standby
>> -3> 2013-01-02 21:13:45.529075 7f9eb177a700  1 heartbeat_map is_healthy
>> 'OSD::op_tp thread 0x7f9ea2f5d700' had timed out after 30
>> -2> 2013-01-02 21:13:45.529115 7f9eb177a700  1 heartbeat_map is_healthy
>> 'OSD::op_tp thread 0x7f9ea2f5d700' had suicide timed out after 300
>> -1> 2013-01-02 21:13:45.531952 7f9eb177a700 -1 common/HeartbeatMap.cc: In
>> function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const
>> char*, time_t)' thread 7f9eb177a700 time 2013-01-02 21:13:45.529176
>> common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")
>>
>>  ceph version 0.56 (1a32f0a0b42f169a7b55ed48ec3208f6d4edc1e8)
>>  1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*,
>> long)+0x107) [0x796877]
>>  2: (ceph::HeartbeatMap::is_healthy()+0x87) [0x797207]
>>  3: (ceph::HeartbeatMap::check_touch_file()+0x23) [0x797453]
>>  4: (CephContextServiceThread::entry()+0x55) [0x8338d5]
>>  5: (()+0x7e9a) [0x7f9eb4571e9a]
>>  6: (clone()+0x6d) [0x7f9eb2ff5cbd]
>>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to
>> interpret this.
>>
>>  0> 2013-01-02 21:13:46.314478 7f9eaff77700 10 monclient:
>> _send_mon_message to mon.pri at [2a00:f10:11b:cef0:230:48ff:fed3:b086]:6789/0
>> *
>>
>> Reading these messages I'm trying to figure out why those messages came 
>> along.
>>
>> Am I understanding this correctly that the heartbeat updates didn't come 
>> along
>> in time and the OSDs committed suicide?
>>
>> I read the code in common/HeartbeatMap.cc and it seems like that.
>>
>> During the restart of the cluster the Atom CPUs were very busy, so could it 
>> be
>> that the CPUs were just to busy and the OSDs weren't responding to heartbeats
>> in time?
>>
>> In total 16 of the 17 crashed OSDs are down with these log messages.
>>
>> I'm now starting the 16 crashed OSDs one by one and that seems to go just
>> fine.
>>
>> I've set "osd recovery max active = 1" to prevent overloading the CPUs to 
>> much
>> since I know Atoms are not that powerful. I'm just still trying to get it all
>> working on them :)
>>
>> Am I right this is probably a lack of CPU power during the heavy recovery
>> which causes them to not respond to heartbeat updates in time?
>>
>> Wido
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: what could go wrong with two clusters on the same network?

2013-01-08 Thread Gregory Farnum
On Mon, Dec 31, 2012 at 10:27 AM, Wido den Hollander  wrote:
> Just make sure you use cephx (enabled by default in 0.55) so that you don't
> accidentally connect to the wrong cluster.

Use of cephx will provide an additional layer of protection for the
clients, but the OSDs and monitors (the only ones with disk state) do
record which cluster they belong to on first start-up, so if they end
up pointed to the wrong cluster they will refuse to connect. Just fyi!
:)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: recoverying from 95% full osd

2013-01-08 Thread Gregory Farnum
On Tue, Jan 8, 2013 at 2:42 AM, Roman Hlynovskiy
 wrote:
> Hello,
>
> I am running ceph v0.56 and at the moment trying to recover ceph which
> got completely stuck after 1 osd got filled by 95%. Looks like the
> distribution algorithm is not perfect since all 3 OSD's I user are
> 256Gb each, however one of them got filled faster than others:
>
> osd-1:
> FilesystemSize  Used Avail Use% Mounted on
> /dev/mapper/vg00-osd  252G  173G   80G  69% /var/lib/ceph/osd/ceph-0
>
> osd-2:
> FilesystemSize  Used Avail Use% Mounted on
> /dev/mapper/vg00-osd  252G  203G   50G  81% /var/lib/ceph/osd/ceph-1
>
> osd-3:
> FilesystemSize  Used Avail Use% Mounted on
> /dev/mapper/vg00-osd  252G  240G   13G  96% /var/lib/ceph/osd/ceph-2
>
>
> by the moment mds is showing the following behaviour:
> 2013-01-08 16:25:47.006354 b4a73b70  0 mds.0.objecter  FULL, paused
> modify 0x9ba63c0 tid 23448
> 2013-01-08 16:26:47.005211 b4a73b70  0 mds.0.objecter  FULL, paused
> modify 0xca86c30 tid 23449
>
> so, it does not respond to any mount requests
>
> I've played around with all types of commands like:
> ceph mon tell \* injectargs '--mon-osd-full-ratio 98'
> ceph mon tell \* injectargs '--mon-osd-full-ratio 0.98'
>
> and
>
> 'mon osd full ratio = 0.98' in mon configuration for each mon
>
> however
>
> chef@ceph-node03:/var/log/ceph$ ceph health detail
> HEALTH_ERR 1 full osd(s)
> osd.2 is full at 95%
>
> mds still believes 95% is the threshold, so no responses to mount requests.
>
> chef@ceph-node03:/var/log/ceph$ rados -p data bench 10 write
>  Maintaining 16 concurrent writes of 4194304 bytes for at least 10 seconds.
>  Object prefix: benchmark_data_ceph-node03_3903
> 2013-01-08 16:33:02.363206 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa467ff0 tid 1
> 2013-01-08 16:33:02.363618 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa468780 tid 2
> 2013-01-08 16:33:02.363741 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa468f88 tid 3
> 2013-01-08 16:33:02.364056 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa469348 tid 4
> 2013-01-08 16:33:02.364171 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa469708 tid 5
> 2013-01-08 16:33:02.365024 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa469ac8 tid 6
> 2013-01-08 16:33:02.365187 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa46a2d0 tid 7
> 2013-01-08 16:33:02.365296 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa46a690 tid 8
> 2013-01-08 16:33:02.365402 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa46aa50 tid 9
> 2013-01-08 16:33:02.365508 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa46ae10 tid 10
> 2013-01-08 16:33:02.365635 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa46b1d0 tid 11
> 2013-01-08 16:33:02.365742 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa46b590 tid 12
> 2013-01-08 16:33:02.365868 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa46b950 tid 13
> 2013-01-08 16:33:02.365975 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa46bd10 tid 14
> 2013-01-08 16:33:02.366096 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa46c0d0 tid 15
> 2013-01-08 16:33:02.366203 b6be3710  0 client.9958.objecter  FULL,
> paused modify 0xa46c490 tid 16
>sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
>  0  1616 0 0 0 - 0
>  1  1616 0 0 0 - 0
>  2  1616 0 0 0 - 0
>
> rados doesn't work.
>
> chef@ceph-node03:/var/log/ceph$ ceph osd reweight-by-utilization
> no change: average_util: 0.812678, overload_util: 0.975214. overloaded
> osds: (none)
>
> this one also.
>
>
> is there any chance to recover ceph?

"ceph pg set_full_ratio 0.98"

However, as Mark mentioned, you want to figure out why one OSD is so
much fuller than the others first. Even in a small cluster I don't
think you should be able to see that kind of variance. Simply setting
the full ratio to 98% and then continuing to run could cause bigger
problems if that OSD continues to get a disproportionate share of the
writes and fills up its disk.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD Crashed when runing "rbd list"

2013-01-08 Thread Gregory Farnum
On Tue, Jan 8, 2013 at 7:51 AM, Chen, Xiaoxi  wrote:
> Hi List,
>   Every time I ran "rbd list" after creating a lot of rbd volumes (more 
> than 100s), certain OSDs will die,osd.65 die first and then osd.35 
> (osd.65,that's the fifth disk on the sixth host) will die.
>   Is it a bug for 0.55? My ceph version is 0.55-1 with 3.7 kernel.
> I would like to upgrade to 0.56-1 but there is no package for 3.7 
> kernel(raring)
>
>Log of osd.35 attached.Key messages are below:
>
> 1 -- 192.101.11.203:6843/19960 mark_down 192.101.11.206:6861/3735 -- 
> 0x7f331867a000
>-38> 2013-01-08 23:37:37.751473 7f3302fc0700 -1 ./messages/MOSDOp.h: In 
> function 'bool MOSDOp::check_rmw(int)' thread 7f3302fc0700 time 2013-01-08 
> 23:37:37.748254
> ./messages/MOSDOp.h: 57: FAILED assert(rmw_flags)
>
>  ceph version 0.55.1 (8e25c8d984f9258644389a18997ec6bdef8e056b)
>  1: (()+0x22f765) [0x7f3310831765]
>  2: (MOSDOpReply::claim_op_out_data(std::vector 
> >&)+0) [0x7f3310897850]
>  3: (OSD::handle_op(std::tr1::shared_ptr)+0x441) [0x7f33108f19c1]
>  4: (OSD::dispatch_op(std::tr1::shared_ptr)+0x83) [0x7f33108fd8c3]
>  5: (OSD::do_waiters()+0x104) [0x7f33108fdc64]
>  6: (OSD::ms_dispatch(Message*)+0x317) [0x7f33109027e7]
>  7: (DispatchQueue::entry()+0x353) [0x7f3310b6b743]
>  8: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f3310ac7dad]
>  9: (()+0x7f9f) [0x7f330ffc5f9f]
>  10: (clone()+0x6d) [0x7f330e2800cd]
>
>Thanks for the help.

Sounds like you've got a v0.56 binary talking to v0.55 daemons. An
upgrade to v0.56.1 should fix it. See
http://tracker.newdream.net/issues/3715
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is Ceph recovery able to handle massive crash

2013-01-08 Thread Gregory Farnum
On Tue, Jan 8, 2013 at 12:44 AM, Denis Fondras  wrote:
>> What's wrong with your primary OSD?
>
>
> I don't know what's really wrong. The disk seems fine.

What error message do you get when you try and turn it on? If the
daemon is crashing, what is the backtrace?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Windows port

2013-01-08 Thread Gregory Farnum
On Mon, Jan 7, 2013 at 9:36 PM, Cesar Mello  wrote:
> Hi,
>
> I have been playing with ceph and reading the docs/thesis the last
> couple of nights just to learn something during my vacation. I was not
> expecting to find such an awesome and state of the art project.
> Congratulations for the great work!
>
> Please I would like to know if a Windows port is imagined for the
> future or if that is a dead-end. By Windows port I mean an abstraction
> layer for hardware/sockets/threading/etc and building with Visual C++
> 2012 Express. And then have this state of the art object storage
> cluster running on Windows nodes too.

This is not super-likely (although it's not impossible either).
Inktank is a long way from doing the necessary development for this —
we don't have any Windows developers on staff. External contributors
who are interested in doing the work themselves would certainly get
some support in doing so, but I can't even begin to estimate the size
of the project that would be required. Would a simple abstraction
layer be able to provide the right interface with anything approaching
acceptable performance? I really don't know.


On Tue, Jan 8, 2013 at 6:00 AM, Dino Yancey  wrote:
> Hi,
>
> I am also curious if a Windows port, specifically the client-side, is
> on the roadmap.

This is somewhat more likely than porting the servers, but again we at
Inktank don't currently have the expertise necessary. It'd be a lot
easier if there were a FUSE for Windows; that's how we'll be getting
an OS X client.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Rados gateway init timeout with cache

2013-01-08 Thread Gregory Farnum
To clarify, you lost the data on half of your OSDs? And it sounds like
they weren't in separate CRUSH failure domains?

Given that, yep, you've lost some data. :(

On Tue, Jan 8, 2013 at 5:41 AM, Yann ROBIN  wrote:
> Notify and gc objects where unfound, we marked them as lost and now the rados 
> start.
> But this means that if some notify object are not fully available, the 
> radosgateway stop responding.

Yes, that's the case. I'm not sure there's a way around it that makes
much sense and satisfies the necessary guarantees, though.
-Greg

> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org 
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Yann ROBIN
> Sent: mardi 8 janvier 2013 12:13
> To: ceph-devel@vger.kernel.org
> Subject: Rados gateway init timeout with cache
>
> Hi,
>
> We recently experienced issue with the backplane of our server, resulting in 
> loosing half of our osd.
> During that period the rados gateway failed initializing (timeout).
> We found that the gateway was hanging in the init_watch function.
>
> We recreate our OSDs and we still have this issue, but pg are not all in an 
> active+clean state :
>health HEALTH_WARN 1 pgs degraded; 1 pgs recovering; 2 pgs recovery_wait; 
> 3 pgs stuck unclean; recovery 7/10140464 degraded (0.000%); 3/5070232 unfound 
> (0.000%); noout flag(s) set
>monmap e2: 3 mons at 
> {ceph-mon-1=172.20.1.13:6789/0,ceph-mon-2=172.20.2.13:6789/0,ceph-mon-3=172.17.9.20:6789/0},
>  election epoch 256, quorum 0,1,2 ceph-mon-1,ceph-mon-2,ceph-mon-3
>osdmap e4439: 6 osds: 6 up, 6 in
> pgmap v2531184: 11024 pgs: 11019 active+clean, 2 active+recovery_wait, 1 
> active+recovering+degraded+remapped, 2 active+clean+scrubbing+deep; 1291 GB 
> data, 2612 GB used, 19645 GB / 22257 GB avail; 7/10140464 degraded (0.000%); 
> 3/5070232 unfound (0.000%)
>mdsmap e1: 0/0/1 up
>
> Should we open an ticket for this init issue with rados gateway ?
> Version is 0.56.1 upgraded from 0.55.
>
> --
> Yann ROBIN
> YouScribe
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is Ceph recovery able to handle massive crash

2013-01-07 Thread Gregory Farnum
On Monday, January 7, 2013 at 9:25 AM, Denis Fondras wrote:
> Hello all,
> 
> > I'm using Ceph 0.55.1 on a Debian Wheezy (1 mon, 1 mds et 3 osd over
> > btrfs) and every once in a while, an OSD process crashes (almost never
> > the same osd crashes).
> > This time I had 2 osd crash in a row and so I only had one replicate. I
> > could bring the 2 crashed osd up and it started to recover.
> > Unfortunately, the "source" osd crashed while recovering and now I have
> > a some lost PGs.
> > 
> > If I happen to bring the primary OSD up again, can I imagine the lost PG
> > will be recovered too ?
> 
> 
> 
> Ok, so it seems I can't bring back to life my primary OSD :-(
> 
> ---8<---
> health HEALTH_WARN 72 pgs incomplete; 72 pgs stuck inactive; 72 pgs 
> stuck unclean
> monmap e1: 1 mons at {a=192.168.0.132:6789/0}, election epoch 1, quorum 0 a
> osdmap e1130: 3 osds: 2 up, 2 in
> pgmap v1567492: 624 pgs: 552 active+clean, 72 incomplete; 1633 GB 
> data, 4766 GB used, 3297 GB / 8383 GB avail
> mdsmap e127: 1/1/1 up {0=a=up:active}
> 
> 2013-01-07 18:11:10.852673 mon.0 [INF] pgmap v1567492: 624 pgs: 552 
> active+clean, 72 incomplete; 1633 GB data, 4766 GB used, 3297 GB / 8383 
> GB avail
> ---8<---
> 
> When I "rbd list", I can see all my images.
> When I do "rbd map", I can map only a few of them and when I mount the 
> devices, none can mount (the mount process hangs and I cannot even ^C 
> the process).
> 
> Is there something I can try ?

What's wrong with your primary OSD? In general they shouldn't really be 
crashing that frequently and if you've got a new bug we'd like to diagnose and 
fix it.

If that can't be done (or it's a hardware failure or something), you can mark 
the OSD lost, but that might lose data and then you will be sad.
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster

2013-01-07 Thread Gregory Farnum
On Monday, January 7, 2013 at 1:00 PM, Isaac Otsiabah wrote:
> 
> 
> When i add a new host (with osd's) to my existing cluster, 1 or 2 previous 
> osd(s) goes down for about 2 minutes and then they come back up. 
> 
> 
> [root@h1ct ~]# ceph osd tree
> 
> # id weight type name up/down reweight
> -1 
> 3 root default
> -3 3 rack unknownrack
> -2 3 host h1
> 0 1 osd.0 up 1
> 1 1 osd.1 up 1
> 2 
> 1 osd.2 up 1
> 
> 
> For example, after adding host h2 (with 3 new osd) to the above cluster and 
> running the "ceph osd tree" command, i see this: 
> 
> 
> [root@h1 ~]# ceph osd tree
> 
> # id weight type name up/down reweight
> -1 6 root default
> -3 
> 6 rack unknownrack
> -2 3 host h1
> 0 1 osd.0 up 1
> 1 1 osd.1 down 1
> 2 
> 1 osd.2 up 1
> -4 3 host h2
> 3 1 osd.3 up 1
> 4 1 osd.4 up 
> 1
> 5 1 osd.5 up 1
> 
> 
> The down osd always come back up after 2 minutes or less andi see the 
> following error message in the respective osd log file: 
> 2013-01-07 04:40:17.613028 7fec7f092760 1 journal _open 
> /ceph_journal/journals/journal_2 fd 26: 1073741824 bytes, block size 
> 4096 bytes, directio = 1, aio = 0
> 2013-01-07 04:40:17.613122 
> 7fec7f092760 1 journal _open /ceph_journal/journals/journal_2 fd 26: 
> 1073741824 bytes, block size 4096 bytes, directio = 1, aio = 0
> 2013-01-07
> 04:42:10.006533 7fec746f7710 0 -- 192.168.0.124:6808/19449 >> 
> 192.168.1.123:6800/18287 pipe(0x7fec2e10 sd=31 :6808 pgs=0 cs=0 
> l=0).accept connect_seq 0 vs existing 0 state connecting
> 2013-01-07 
> 04:45:29.834341 7fec743f4710 0 -- 192.168.1.124:6808/19449 >> 
> 192.168.1.122:6800/20072 pipe(0x7fec5402f320 sd=28 :45438 pgs=7 cs=1 
> l=0).fault, initiating reconnect
> 2013-01-07 04:45:29.835748 
> 7fec743f4710 0 -- 192.168.1.124:6808/19449 >> 
> 192.168.1.122:6800/20072 pipe(0x7fec5402f320 sd=28 :45439 pgs=15 cs=3 
> l=0).fault, initiating reconnect
> 2013-01-07 04:45:30.835219 7fec743f4710 0 -- 
> 192.168.1.124:6808/19449 >> 192.168.1.122:6800/20072 
> pipe(0x7fec5402f320 sd=28 :45894 pgs=482 cs=903 l=0).fault, initiating 
> reconnect
> 2013-01-07 04:45:30.837318 7fec743f4710 0 -- 
> 192.168.1.124:6808/19449 >> 192.168.1.122:6800/20072 
> pipe(0x7fec5402f320 sd=28 :45895 pgs=483 cs=905 l=0).fault, initiating 
> reconnect
> 2013-01-07 04:45:30.851984 7fec637fe710 0 log [ERR] : map 
> e27 had wrong cluster addr (192.168.0.124:6808/19449 != my 
> 192.168.1.124:6808/19449)
> 
> Also, this only happens only when the cluster ip address and the public ip 
> address are different for example
> 
> 
> 
> [osd.0]
> host = g8ct
> public address = 192.168.0.124
> cluster address = 192.168.1.124
> btrfs devs = /dev/sdb
> 
> 
> 
> 
> but does not happen when they are the same. Any idea what may be the issue?
> 
This isn't familiar to me at first glance. What version of Ceph are you using?

If this is easy to reproduce, can you pastebin your ceph.conf and then add 
"debug ms = 1" to your global config and gather up the logs from each daemon?
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is Ceph recovery able to handle massive crash

2013-01-05 Thread Gregory Farnum
On Saturday, January 5, 2013 at 4:19 AM, Denis Fondras wrote:
> Hello all,
> 
> I'm using Ceph 0.55.1 on a Debian Wheezy (1 mon, 1 mds et 3 osd over 
> btrfs) and every once in a while, an OSD process crashes (almost never 
> the same osd crashes).
> This time I had 2 osd crash in a row and so I only had one replicate. I 
> could bring the 2 crashed osd up and it started to recover. 
> Unfortunately, the "source" osd crashed while recovering and now I have 
> a some lost PGs.
> 
> If I happen to bring the primary OSD up again, can I imagine the lost PG 
> will be recovered too ?


Yes, it will recover just fine. Ceph is strictly consistent and so you won't 
lose any data unless you lose the disks.
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any idea about doing deduplication in ceph?

2013-01-04 Thread Gregory Farnum
On Wed, Dec 26, 2012 at 6:16 PM, lollipop  wrote:
> Nowadays, I am wondering doing offline deduplication in ceph?
> My idea is:
> First in the ceph-client, I try to get the locations of chunks in one file.
> The information includes
> how many chunks the file has and which osd the chunk(object group) has been
> stored.
> Then the ceph-client try to communicate with the exact osd to ask the osd to
> return the chunk hash.
> After that, we compare the returned hash with the already stored hash table,
> If the chunk is duplicated, we try to change the file meta-data.
> Can it work?
> Can you give some ideas? Thank you


Any off-line deduplication support in Ceph is going to have the
important parts be in the OSD code, not in the clients. :) We've
discussed dedup a little bit internally and on the mailing list; you
can do an archive search if you're interested in what the current
thoughts are. :)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph stability

2013-01-04 Thread Gregory Farnum
On Fri, Dec 21, 2012 at 2:07 AM, Amon Ott  wrote:
> Am 20.12.2012 15:31, schrieb Mark Nelson:
>> On 12/20/2012 01:08 AM, Roman Hlynovskiy wrote:
>>> Hello Mark,
>>>
>>> for multi-mds solutions do you refer to multi-active arch or 1 active
>>> and many standby arch?
>>
>> That's a good question!  I know we don't really recommend multi-active
>> right now for production use.  Not sure what our current recommendations
>> are for multi-standby.  As far as I know it's considered to be more
>> stable.  I'm sure Greg or Sage can chime in with a more accurate
>> assessment.
>
> We have been testing a lot with multi-standby, because a single MDS does
> not make a lot of sense in a cluster. Maybe the clue is to have only one
> standby, making SPOF a DPOF?

The number of standbys should not have any impact on stability — they
are read-only until the active MDS gets declared down, at which point
one of them becomes active. I suppose having a standby could reduce
stability if your active MDS is overloaded enough to miss check-ins
without actually going down — without a standby it would eventually
recover, but with a standby a transition happens. That's the only
difference, though.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: which Linux kernel version corresponds to 0.48argonaut?

2013-01-04 Thread Gregory Farnum
I think they might be different just as a consequence of being updated
less recently; that's where all of the lines whose origin I recognize
differ (not certain about the calc_parents stuff though). Sage can
confirm.
The specific issue you encountered previously was of course because
you changed the layout algorithm and the client needs to be able to
process that layout itself.
-Greg

On Thu, Dec 20, 2012 at 5:37 PM, Xing Lin  wrote:
> This may be useful for other Ceph newbies just like me.
>
> I have ported my changes to 0.48argonaut to related Ceph files included in
> Linux, though files with the same name are not exactly the same. Then I
> recompiled and installed the kernel. After that, everything seems to be
> working again now: Ceph is working with my new simple replica placement
> algorithm. :)
> So, it seems that Ceph files included in the Linux kernel are supposed to be
> different from those in 0.48argonaut. Presumably, the Linux kernel contains
> the client-side implementation while 0.48argonaut contains the server-side
> implementation. It would be appreciated if someone can confirm it. Thank
> you!
>
> Xing
>
>
> On 12/20/2012 11:54 AM, Xing Lin wrote:
>>
>> Hi,
>>
>> I was trying to add a simple replica placement algorithm in Ceph. This
>> algorithm simply returns r_th item in a bucket for the r_th replica. I have
>> made that change in Ceph source code (including files such as crush.h,
>> crush.c, mapper.c, ...) and I can run Ceph monitor and osd daemons. However,
>> I am not able to map rbd block devices at client machines. 'rbd map image0'
>> reported "input/output error" and 'dmesg' at the client machine showed
>> message like "libceph: handle_map corrupt msg". I believe that is because I
>> have not ported my changes to Ceph client side programs and it does not
>> recognize the new placement algorithm. I probably need to recompile the rbd
>> block device driver. When I was trying to replace Ceph related files in
>> Linux with my own version, I noticed that files in Linux-3.2.16 are
>> different from these included in Ceph source code. For example, the
>> following is the diff of crush.h in Linux-3.2.16 and 0.48argonaut. So, my
>> question is that is there any version of Linux that contains the exact Ceph
>> files as included in 0.48argonaut? Thanks.
>>
>> ---
>>  $ diff -uNrp ceph-0.48argonaut/src/crush/crush.h
>> linux-3.2.16/include/linux/crush/crush.h
>> --- ceph-0.48argonaut/src/crush/crush.h2012-06-26 11:56:36.0
>> -0600
>> +++ linux-3.2.16/include/linux/crush/crush.h2012-04-22
>> 16:31:32.0 -0600
>> @@ -1,12 +1,7 @@
>>  #ifndef CEPH_CRUSH_CRUSH_H
>>  #define CEPH_CRUSH_CRUSH_H
>>
>> -#if defined(__linux__)
>>  #include 
>> -#elif defined(__FreeBSD__)
>> -#include 
>> -#include "include/inttypes.h"
>> -#endif
>>
>>  /*
>>   * CRUSH is a pseudo-random data distribution algorithm that
>> @@ -156,24 +151,25 @@ struct crush_map {
>>  struct crush_bucket **buckets;
>>  struct crush_rule **rules;
>>
>> +/*
>> + * Parent pointers to identify the parent bucket a device or
>> + * bucket in the hierarchy.  If an item appears more than
>> + * once, this is the _last_ time it appeared (where buckets
>> + * are processed in bucket id order, from -1 on down to
>> + * -max_buckets.
>> + */
>> +__u32 *bucket_parents;
>> +__u32 *device_parents;
>> +
>>  __s32 max_buckets;
>>  __u32 max_rules;
>>  __s32 max_devices;
>> -
>> -/* choose local retries before re-descent */
>> -__u32 choose_local_tries;
>> -/* choose local attempts using a fallback permutation before
>> - * re-descent */
>> -__u32 choose_local_fallback_tries;
>> -/* choose attempts before giving up */
>> -__u32 choose_total_tries;
>> -
>> -__u32 *choose_tries;
>>  };
>>
>>
>>  /* crush.c */
>> -extern int crush_get_bucket_item_weight(const struct crush_bucket *b, int
>> pos);
>> +extern int crush_get_bucket_item_weight(struct crush_bucket *b, int pos);
>> +extern void crush_calc_parents(struct crush_map *map);
>>  extern void crush_destroy_bucket_uniform(struct crush_bucket_uniform *b);
>>  extern void crush_destroy_bucket_list(struct crush_bucket_list *b);
>>  extern void crush_destroy_bucket_tree(struct crush_bucket_tree *b);
>> @@ -181,9 +177,4 @@ extern void crush_destroy_bucket_straw(s
>>  extern void crush_destroy_bucket(struct crush_bucket *b);
>>  extern void crush_destroy(struct crush_map *map);
>>
>> -static inline int crush_calc_tree_node(int i)
>> -{
>> -return ((i+1) << 1)-1;
>> -}
>> -
>>  #endif
>>
>> 
>> Xing
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel

Re: Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue

2013-01-04 Thread Gregory Farnum
Sorry for the delay; I've been out on vacation...

On Fri, Dec 14, 2012 at 6:09 AM, Lachfeld, Jutta
 wrote:
> I do not have the full output of "ceph pg dump" for that specific TeraSort 
> run, but here is a typical output after automatically preparing CEPH for a 
> benchmark run
>  (removed almost all lines in the long pg_stat table hoping that you do not 
> need them):

Actually those were exactly what I was after; they include output on
the total PG size and the number of objects so we can check on average
size. :) If you'd like to do it yourself, look at some of the PGs
which correspond to your data pool (the PG ids are all of the form
0.123a, and the number before the decimal point is the pool ID; by
default you'll be looking for 0).


On Fri, Dec 14, 2012 at 6:53 AM, Mark Nelson  wrote:
> The large block size may be an issue (at least with some of our default
> tunable settings).  You might want to try 4 or 16MB and see if it's any
> better or worse.

Unless you've got a specific reason to think this is busted, I am
pretty confident it's not a problem. :)


Jutta, do you have any finer-grained numbers than total run time
(specifically, how much time is spent on data generation versus the
read-and-sort for each FS)? HDFS doesn't do any journaling like Ceph
does and the fact that the Ceph journal is in-memory might not be
helping much since it's so small compared to the amount of data being
written.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] fix build and packaging issues

2013-01-04 Thread Gregory Farnum
Thanks!

Gary, can you pull these into a branch and do some before-and-after
package comparisons on our systems (for the different distros in
gitbuilder) and then merge into master?
-Greg

On Fri, Jan 4, 2013 at 9:51 AM, Danny Al-Gaaf  wrote:
> This set of patches contains fixes for some build and packaging issues.
>
> Danny Al-Gaaf (6):
>   src/java/Makefile.am: fix default java dir
>   ceph.spec.in: fix handling of java files
>   ceph.spec.in: rename libcephfs-java package to cephfs-java
>   ceph.spec.in: fix libcephfs-jni package name
>   configure.ac: remove AC_PROG_RANLIB
>   configure.ac: change junit4 handling
>
>  ceph.spec.in | 21 +
>  configure.ac |  8 +---
>  src/java/Makefile.am |  6 +++---
>  3 files changed, 17 insertions(+), 18 deletions(-)
>
> --
> 1.8.0.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel rbd format=2

2013-01-03 Thread Gregory Farnum
Alex has been doing a lot of work to support this lately, but I don't
think he's sketched out the actual implementation timeline yet.
-Greg

On Mon, Dec 17, 2012 at 4:10 PM, Chris Dunlop  wrote:
> Hi,
>
> Format 2 images (and attendant layering support) are not yet
> supported by the kernel rbd client, according to:
>
> http://ceph.com/docs/master/rbd/rbd-snapshot/#layering
>
> When might this support be available?
>
> Cheers,
>
> Chris
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph and Software Defined Storage (SDS)

2012-12-15 Thread Gregory Farnum
http://en.wikipedia.org/wiki/Software_defined_storage

Basically, with Ceph you've got a bunch of storage boxes providing raw disk. 
They're aggregated together via the software daemons they run rather than by a 
RAID controller or some hardware box that re-exports them, and the 
administrator can change the "shape" of that storage via simple commands — 
adding new boxes, removing old ones, or setting it up so that type foo data 
lives in those 20 machines and type bar lives in these other 10 machines even 
though it all used to live on all 30 machines is done via just a couple of 
commands.
It's storage, defined by software. ;)
-Greg


On Saturday, December 15, 2012 at 5:36 AM, Itamar Landsman wrote:

> Hi,
>  
> There is some talk going around about software defined storage.
> Nevertheless I could not find a concrete definition for it.
>  
> Can someone comment about where a definition can be found and how can
> ceph be a part of it?
>  
> Thanks
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org 
> (mailto:majord...@vger.kernel.org)
> More majordomo info at http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue

2012-12-13 Thread Gregory Farnum
On Thu, Dec 13, 2012 at 12:23 PM, Cameron Bahar  wrote:
> Is the chunk size tunable in A Ceph cluster. I don't mean dynamic, but even 
> statically configurable when a cluster is first installed?

Yeah. You can set chunk size on a per-file basis; you just can't
change it once the file has any data written to it.
In the context of Hadoop the question is just if the bindings are
configured correctly to do so automatically.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue

2012-12-13 Thread Gregory Farnum
On Thu, Dec 13, 2012 at 9:27 AM, Sage Weil  wrote:
> Hi Jutta,
>
> On Thu, 13 Dec 2012, Lachfeld, Jutta wrote:
>> Hi all,
>>
>> I am currently doing some comparisons between CEPH FS and HDFS as a file 
>> system for Hadoop using Hadoop's integrated benchmark TeraSort. This 
>> benchmark first generates the specified amount of data in the file system 
>> used by Hadoop, e.g. 1TB of data, and then sorts the data via the MapReduce 
>> framework of Hadoop, sending the sorted output again to the file system used 
>> by Hadoop.  The benchmark measures the elapsed time of a sort run.
>>
>> I am wondering about my best result achieved with CEPH FS in comparison to 
>> the ones achieved with HDFS. With CEPH, the runtime of the benchmark is 
>> somewhat longer, the factor is about 1.2 when comparing with an HDFS run 
>> using the default HDFS block size of 64MB. When comparing with an HDFS run 
>> using an HDFS block size of 512MB the factor is even 1.5.
>>
>> Could you please take a look at the configuration, perhaps some key factor 
>> already catches your eye, e.g. CEPH version.
>>
>> OS: SLES 11 SP2
>>
>> CEPH:
>> OSDs are distributed over several machines.
>> There is 1 MON and 1 MDS process on yet another machine.
>>
>> Replication of the data pool is set to 1.
>> Underlying file systems for data are btrfs.
>> Mount options  are only "rw,noatime".
>> For each CEPH OSD, we use a RAM disk of 256MB for the journal.
>> Package ceph has version 0.48-13.1, package ceph-fuse has version 0.48-13.1.
>>
>> HDFS:
>> HDFS is distributed over the same machines.
>> HDFS name node on yet another machine.
>>
>> Replication level is set to 1.
>> HDFS block size is set to  64MB or even 512MB.
>
> I suspect that this is part of it.  The default ceph block size is only
> 4MB.  Especially since the differential increases with larger blocks.
> I'm not sure if the setting of block sizees is properly wired up; it
> depends on what version of the hadoop bindings you are using.  Noah would
> know more.
>
> You can adjust the default block/object size for the fs with the cephfs
> utility from a kernel mount.  There isn't yet a convenient way to do this
> via ceph-fuse.

If Jutta is using the *old* ones I last worked on in 2009, then this
is already wired up for 64MB blocks. A "ceph pg dump" would let us get
a rough estimate of the block sizes in use.

"ceph -s" would also be useful to check that everything is set up reasonably.

Other than that, it would be fair to describe these bindings as
little-used — minimal performance tests indicated rough parity back in
2009, but those were only a couple minutes long and on very small
clusters, so 1.2x might be normal. Noah and Joe are working on new
bindings now, and those will be tuned and accompany some backend
changes if necessary. They might also have a better eye for typical
results.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph osd create XX

2012-12-12 Thread Gregory Farnum
On Wed, Dec 12, 2012 at 2:00 PM, Stefan Priebe  wrote:
> Hi Greg,
>
> thanks for explanation. I'm using current next branch.
>
> I'm using:
> host1:
> osd 11 .. 14
> host2:
> osd 21 .. 24
> host3:
> osd 31 .. 34
> host4:
> osd 41 .. 44
> host5:
> osd 51 .. 54
>
> Right now i want to add host6. But i still don't know even with your
> explanation how to add osd 61-64 to osdmap.

Yeah, it's not going to let you allocate non-sequential IDs at this
point, sorry. We are slowly divorcing the ID and the name, but it's
not done yet...
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph osd create XX

2012-12-12 Thread Gregory Farnum
On Wed, Dec 12, 2012 at 1:43 PM, Stefan Priebe  wrote:
> Hi Greg,
>
> i don't get it. I was using this doc:
> http://ceph.com/docs/master/rados/operations/add-or-rm-osds/
>
> There is written that i have to use the osd-num for ceph osd create. Which
> UUID is now meant?

Ah, you're right — I don't think that doc has been correct since
before argonaut, but prior to ~v0.55 it wouldn't complain on non-UUID
values.

The basic "ceph osd create" call simply allocates a new OSD ID and
bumps the max allowed, and returns the new ID back to the caller. If
you also specify the UUID, then it will check and see if that UUID
already exists and simply return the allocated ID if it does
(otherwise, create a new one as before). In argonaut, there's a bug
that will silently ignore any non-UUID extra arguments, which is what
was happening with those IDs.

John, can you fix this in the docs please? :)
-Greg

(All this is assuming I've got my commit dates right; if not there
might be a bit of variation about when which version took effect.)


> I've already added osd.61,62,63 and 64 to ceph.conf
>
> Greets,
> Stefan
> Am 12.12.2012 22:41, schrieb Gregory Farnum:
>>
>> Yeah; 61 is not a valid UUID and you can't specify anything else on that
>> line.
>>
>> On Wed, Dec 12, 2012 at 1:38 PM, Stefan Priebe 
>> wrote:
>>>
>>> HI Greg,
>>>
>>> sorry just a copy & paste error.
>>>
>>> [cloud1-ceph1: ~]# ceph osd create 61
>>> (22) Invalid argument
>>>
>>>
>>>> Read those two lines again. Very slowly. :)
>>>>
>>>> The correct syntax is
>>>> ceph osd create 
>>>>
>>>> The uuid is optional, but you don't specify IDs; it gives you an ID
>>>> back.
>>>
>>>
>>>
>>> Stefan
>>
>> --
>>
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph osd create XX

2012-12-12 Thread Gregory Farnum
Yeah; 61 is not a valid UUID and you can't specify anything else on that line.

On Wed, Dec 12, 2012 at 1:38 PM, Stefan Priebe  wrote:
> HI Greg,
>
> sorry just a copy & paste error.
>
> [cloud1-ceph1: ~]# ceph osd create 61
> (22) Invalid argument
>
>
>> Read those two lines again. Very slowly. :)
>>
>> The correct syntax is
>> ceph osd create 
>>
>> The uuid is optional, but you don't specify IDs; it gives you an ID back.
>
>
> Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph osd create XX

2012-12-12 Thread Gregory Farnum
On Wed, Dec 12, 2012 at 1:31 PM, Stefan Priebe  wrote:
> Hello List,
>
> ceph osd create $NUM
>
> does not seem to work anymore ;-(
>
> # ceph osd createosd. 62
> unknown command createosd

Read those two lines again. Very slowly. :)

The correct syntax is
ceph osd create 

The uuid is optional, but you don't specify IDs; it gives you an ID back.
-Greg

>
> Crushmap is already changed and imported ceph.conf is altered and reloaded.
>
> Greets
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: on disk encryption

2012-12-10 Thread Gregory Farnum
On Monday, December 10, 2012 at 1:17 AM, James Page wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>  
> On 19/09/12 02:53, Dustin Kirkland wrote:
> > > > Looking forward, another option might be to implement
> > > > encryption inside btrfs (placeholder fields are there in the
> > > > disk format, introduced along with the compression code way
> > > > back when). This would let ceph-osd handle more of the key
> > > > handling internally and do something like, say, only encrypt
> > > > the current/ and snap_*/ subdirectories.
> > > >  
> > > > Other ideas? Thoughts?
> > > >  
> > > > sage
> > I love the idea of btrfs supporting encryption natively much like
> > it does compression. It may be some time before that happens, so
> > in the meantime, I'd love to see Ceph support dm-crypt and/or
> > eCryptfs beneath.
>  
>  
>  
> Has this discussion progressed into any sort of implementation yet?
> It sounds like this is going to be a key feature for users who want
> top-to-bottom encryption of data right down to the block level.


Peter is working on this now — I'll let him discuss the details. :)
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-commit] [ceph/ceph] e6a154: osx: compile on OSX

2012-12-09 Thread Gregory Farnum
Oooh, very nice! Do you have a list of the dependencies that you actually 
needed to install?

Apart from breaking this up into smaller patches, we'll also want to reformat 
some of it. Rather than sticking an #if APPLE on top of every spin lock, we 
should have utility functions that do this for us. ;)

Also, we should be able to find libatomic_ops for OS X (its parent project 
works under OS X), and we can use that to construct a spin lock if we think 
it'll be useful. I'm not too sure how effective its muteness are at spinlock-y 
workloads.
-Greg


On Sunday, December 9, 2012 at 9:41 AM, GitHub wrote:

> Branch: refs/heads/wip-osx
> Home: https://github.com/ceph/ceph
> Commit: e6a1544d42737b1aacf12210c0818200bb6d29aa
> https://github.com/ceph/ceph/commit/e6a1544d42737b1aacf12210c0818200bb6d29aa
> Author: Noah Watkins 
> Date: 2012-12-09 (Sun, 09 Dec 2012)
> 
> Changed paths:
> M autogen.sh (http://autogen.sh)
> M configure.ac (http://configure.ac)
> A m4/ax_c_var_func.m4
> A m4/ax_cxx_static_cast.m4
> M src/Makefile.am (http://Makefile.am)
> M src/client/Client.cc (http://Client.cc)
> M src/client/fuse_ll.cc (http://fuse_ll.cc)
> M src/client/ioctl.h
> M src/common/OutputDataSocket.cc (http://OutputDataSocket.cc)
> M src/common/admin_socket.cc (http://admin_socket.cc)
> M src/common/blkdev.cc (http://blkdev.cc)
> M src/common/ceph_context.cc (http://ceph_context.cc)
> M src/common/ceph_context.h
> M src/common/code_environment.cc (http://code_environment.cc)
> A src/common/cpipe.c
> A src/common/cpipe.h
> M src/common/ipaddr.cc (http://ipaddr.cc)
> M src/common/lockdep.cc (http://lockdep.cc)
> R src/common/pipe.c
> R src/common/pipe.h
> M src/common/sctp_crc32.c
> M src/common/sync_filesystem.h
> M src/common/xattr.c
> M src/crush/crush.h
> M src/crush/hash.h
> M src/include/assert.h
> M src/include/atomic.h
> M src/include/buffer.h
> M src/include/byteorder.h
> M src/include/cephfs/libcephfs.h
> M src/include/compat.h
> M src/include/inttypes.h
> M src/include/linux_fiemap.h
> M src/include/msgr.h
> M src/include/types.h
> M src/leveldb
> M src/log/Log.cc (http://Log.cc)
> M src/log/Log.h
> M src/mon/LogMonitor.cc (http://LogMonitor.cc)
> M src/msg/Pipe.cc (http://Pipe.cc)
> M src/msg/SimpleMessenger.cc (http://SimpleMessenger.cc)
> M src/msg/SimpleMessenger.h
> M src/os/FileJournal.cc (http://FileJournal.cc)
> M src/os/FileStore.cc (http://FileStore.cc)
> M src/osd/OSD.cc (http://OSD.cc)
> M src/rados.cc (http://rados.cc)
> M src/test/system/cross_process_sem.cc (http://cross_process_sem.cc)
> M src/test/system/systest_runnable.cc (http://systest_runnable.cc)
> M src/tools/ceph.cc (http://ceph.cc)
> M src/tools/common.cc (http://common.cc)
> 
> Log Message:
> ---
> osx: compile on OSX
> 
> This patch allows the full tree to build on OSX, but currently there are
> a lot of segfaults, inconsistent uses of DARWIN/__APPLE__ defines, and
> some of the semantic changes are likely wrong. We'll need to split this
> up into a much longer patch series, and can probably start with the
> minimal change set needed to make fuse work.
> 
> Use homebrew installed in its default location (/usr/local/) to install
> dependencies, and configure with --without-libatomic-ops --with-libaio.
> 
> ___
> Ceph-commit mailing list
> ceph-com...@lists.ceph.newdream.net 
> (mailto:ceph-com...@lists.ceph.newdream.net)
> http://lists.ceph.newdream.net/listinfo.cgi/ceph-commit-ceph.newdream.net



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Files lost after mds rebuild

2012-12-07 Thread Gregory Farnum
On Wed, Nov 21, 2012 at 11:23 PM, Drunkard Zhang  wrote:
>
> 2012/11/22 Gregory Farnum :
> > On Tue, Nov 20, 2012 at 8:28 PM, Drunkard Zhang  
> > wrote:
> >> 2012/11/21 Gregory Farnum :
> >>> No, absolutely not. There is no relationship between different RADOS
> >>> pools. If you've been using the cephfs tool to place some filesystem
> >>> data in different pools then your configuration is a little more
> >>> complicated (have you done that?), but deleting one pool is never
> >>> going to remove data from the others.
> >>> -Greg
> >>>
> >> I think that should be a bug. Here's the story I did:
> >> I created one directory 'audit' in running ceph filesystem, and put
> >> some data into the directory (about 100GB) before these commands:
> >> ceph osd pool create audit
> >> ceph mds add_data_pool 4
> >> cephfs /mnt/temp/audit/ set_layout -p 4
> >>
> >> log3 ~ # ceph osd dump | grep audit
> >> pool 4 'audit' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num
> >> 8 pgp_num 8 last_change 1558 owner 0
> >>
> >> at this time, all data in audit still usable, after 'ceph osd pool
> >> delete data', the disk space recycled (forgot to test if the data
> >> still usable), only 200MB used, from 'ceph -s'. So, here's what I'm
> >> thinking, the data stored before pool created won't follow the pool,
> >> it still follows the default pool 'data', is this a bug, or intended
> >> behavior?
> >
> > Oh, I see. Data is not moved when you set directory layouts; it only
> > impacts files created after that point. This is intended behavior —
> > Ceph would need to copy the data around anyway in order to make it
> > follow the pool. There's no sense in hiding that from the user,
> > especially given the complexity involved in doing so safely —
> > especially when there are many use cases where you want the files in
> > different pools.
> > -Greg
>
> Got you, but how can I know which pools a file lives in? Is there any 
> commands?

You can get this information with the cephfs program if you're using
the kernel client. There's not yet a way to get it out of ceph-fuse,
although we will be implementing it as virtual xattrs in the
not-too-distant future.


> About data and pools relationship, I thought that objects is hooked to
> a pool, when the pool changed, just unhook this and hook to another,
> seems I was wrong.

Indeed that's incorrect. Pools are a logical namespace; when you
delete the pool you are also deleting everything else in it. Doing
otherwise is totally infeasible with Ceph since they also represent
placement policies.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 0.55 init script Issue?

2012-12-05 Thread Gregory Farnum
On Wed, Dec 5, 2012 at 12:17 PM, James Page  wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> On 05/12/12 19:41, Dan Mick wrote:
>> The story as best I know it is that we're trying to transition to
>> and use upstart where possible, but that the upstart config does
>> not (yet?) try to do what the init.d config did.  That is, it
>> doesn't support options to the one script, but rather separates
>> daemons into separate services, and does not reach out to remote
>> machines to start daemons, etc.
>>
>> The intent is that init.d/ceph is left for non-Upstart distros,
>> AFAICT.
>>
>> Tv had some design notes here:
>>
>> http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg09314.html
>>
>>  We need better documentation/rationale here at least.
>
> Maybe it might be better if the ceph init script and the ceph upstart
> configuration did not namespace clash; how about shifting the name of
> the ceph upstart configuration to ceph-all?

Yeah, this or something very similar is definitely the correct
solution. Sage recently added the "ceph" upstart job, and we didn't
put it through sufficient verification prior to release in order to
notice this issue. Users who aren't using upstart (I expect that's all
of them) should just delete the job after running the package install.
We'll certainly sort this out prior to the next release; I'm not sure
if we want to roll a v0.55.1 right away or not.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Cephfs losing files and corrupting others

2012-12-04 Thread Gregory Farnum
On Tue, Dec 4, 2012 at 1:57 PM, Gregory Farnum  wrote:
> On Sun, Nov 25, 2012 at 12:45 PM, Nathan Howell
>  wrote:
>> So when trawling through the filesystem doing checksum validation
>> these popped up on the files that are filled with null bytes:
>> https://gist.github.com/186ad4c5df816d44f909
>>
>> Is there any way to fsck today? Looks like feature #86
>> http://tracker.newdream.net/issues/86 isn't implemented yet.
>
> Yeah, unfortunately there isn't — fsck is one of those things that we
> want to do as we prepare CephFS for production use, but we're only now
> starting to move back in that direction.
>
> The error printouts you're seeing indicate that...actually, I don't
> know what they mean in this context. Hrm. In any case, Zheng Yan
> contributed some patches that could impact a number of these issues,
> but I still don't see how the NULL bytes could enter into it from our
> end.

Oooh, actually, Zheng's patches are definitely related to this issue.
If you can try the "next" branch, that might resolve it going forward
(it won't repair current damage, though).
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd recovery extremely slow with current master

2012-12-04 Thread Gregory Farnum
Yeah, I checked with Sam and you probably want to increase the "osd
max backfill" option to make it go faster in the future — this option
limits the number of PGs an OSD will be sending or receiving a
backfill for. The default is currently set to 5, and should probably
be much higher.
-Greg

On Fri, Nov 23, 2012 at 1:16 AM, Stefan Priebe - Profihost AG
 wrote:
> It is with current next from today right now.
>
> It just prints this line:
> 2012-11-23 10:15:29.927754 mon.0 [INF] pgmap v89614: 7632 pgs: 5956
> active+clean, 446 active+remapped+wait_backfill, 540
> active+degraded+wait_backfill, 690 active+degraded+remapped+wait_backfill; 0
> bytes data, 2827 MB used, 4461 GB / 4464 GB avail; 1/3 degraded (33.333%)
>
> And there is no I/O or CPU load on any machine. And it tooks hours to
> recover with 0 bytes data (deleted all images before trying this again).
>
> Greets,
> Stefan
>
> Am 20.11.2012 00:21, schrieb Gregory Farnum:
>
>> Which version was this on? There was some fairly significant work to
>> recovery done to introduce a reservation scheme and some other stuff
>> that might need some different defaults.
>> -Greg
>>
>> On Tue, Nov 13, 2012 at 12:33 PM, Stefan Priebe 
>> wrote:
>>>
>>> Hi list,
>>>
>>> osd recovery seems to be really slow with current master.
>>>
>>> I see only 1-8 active+recovering out of 1200. Even there's no load on
>>> ceph
>>> cluster.
>>>
>>> Greets,
>>> Stefan
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Debian/Ubuntu packages for ceph-deploy

2012-12-04 Thread Gregory Farnum
On Thu, Nov 22, 2012 at 4:29 PM, Martin Gerhard Loschwitz
 wrote:
> Hi folks,
>
> I figured it might be a cool thing to have packages of ceph-deploy for
> Debian and Ubuntu 12.04; I took the time and created them (along with
> packages of python-pushy, which ceph-deploy needs but which was not
> present in the Debian archive and thus in the Ubuntu archive either).
>
> They are available from http://people.debian.org/~madkiss/ceph-deploy/
>
> I did upload python-pushy to the official Debian unstable repository
> already, but I didn't do so just yet with ceph-deploy. Also, I don't
> want to step on somebody's toes - if there were secret plans to start
> ceph-deploy packaging anyway, I'm more than happy to hand over what
> I have got to the responsible person.
>
> Any feedback is highly appreciated (esp. with regards to the question
> if it's already okay to upload ceph-deploy).

We're certainly planning to have ceph-deploy packages once we're
prepared to support them — but if a community member wants to do it
for us now...well, we like having less to do. ;) You're not going to
step on anybody's toes, and if we need to arrange a transfer of
stewardship I imagine that can be arranged?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Cephfs losing files and corrupting others

2012-12-04 Thread Gregory Farnum
On Sun, Nov 25, 2012 at 12:45 PM, Nathan Howell
 wrote:
> So when trawling through the filesystem doing checksum validation
> these popped up on the files that are filled with null bytes:
> https://gist.github.com/186ad4c5df816d44f909
>
> Is there any way to fsck today? Looks like feature #86
> http://tracker.newdream.net/issues/86 isn't implemented yet.

Yeah, unfortunately there isn't — fsck is one of those things that we
want to do as we prepare CephFS for production use, but we're only now
starting to move back in that direction.

The error printouts you're seeing indicate that...actually, I don't
know what they mean in this context. Hrm. In any case, Zheng Yan
contributed some patches that could impact a number of these issues,
but I still don't see how the NULL bytes could enter into it from our
end. If you can afford the disk space required to turn on "debug osd =
10" on the OSDs, and "debug mds = 10" on the MDS, that might give us a
clue about what's going on, if we manage to grab the logs that overlap
with the bad event (or at least the detection of it). You'll certainly
want to enable log rotation, though — that will generate some very
large logs.

Sorry for the slow turnaround time on this, our attention is being
pulled in a lot of directions besides CephFS and this is going to be a
hard one.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Removed directory is back in the Ceph FS

2012-12-04 Thread Gregory Farnum
Can you try and reproduce this again using v0.55? There are a number
of (community!) patches in there that could have fixed this issue.
-Greg

On Tue, Nov 20, 2012 at 2:14 AM, Franck Marchand  wrote:
> Hi Gregory !
>
> Thx to have a look on this.
>
> I mounted the ceph fs from multiple clients. I did a rm -rf myfolder
> from a client.
> I use the ceph-fs-common (0.48-2argonaut) to mount my ceph fs.
>
> Franck
>
> 2012/11/20 Gregory Farnum :
>> On Tue, Nov 13, 2012 at 3:23 AM, Franck Marchand  
>> wrote:
>>> Hi,
>>>
>>> I have a weird pb. I remove a folder using a mounted fs partition. I
>>> did it and it worked well.
>>
>> What client are you using? How did you delete it? (rm -rf, etc?) Are
>> you using multiple clients or one, and did you check it on a different
>> client?
>>
>>> I checked later to see if I had all my folders in ceph fs ... : the
>>> folder I removed was back and I can't remove it ! Here is the error
>>> message I got :
>>>
>>> rm -rf 2012-11-10/
>>> rm cannot remove `2012-11-10': Directory not empty
>>>
>>> This folder is empty ...
>>> So anybody had the same pb ? Am I doing something wrong ?
>>
>> This sounds like a known but undiagnosed problem with the MDS
>> "rstats". The part where your client reported success is a new
>> wrinkle, though.
>> -Greg
>>
>>
>>>
>>> Thx
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very bad behavior when

2012-12-04 Thread Gregory Farnum
On Tue, Dec 4, 2012 at 12:46 PM, Sylvain Munaut
 wrote:
> Hi,
>
>> Sorry to let this drop for so long, but is this something you've seen
>> happen before/again or otherwise reproduced? I'm not entirely sure how
>> to best test for it (other than just jerking the time around), and
>> while I can come up with scenarios where the OSD leaks memory, I've
>> got nothing for how that happens to the monitors. We've also fixed a
>> number of leaks recently that could account for part of the problem.
>
> It happenned very reliably at each attempt to restart the OSD and
> stopped right when I fixed the clock.
> Just take a working cluster, take an osd out, let it rebalance, set
> the clock of one of the OSD 50 min too fast, and restart the OSD.
>
> I had it occur twice with the same clock sync problems. (once in a
> test cluster with just 2 osd IIRC and once in the prod cluster).
>
> I don't get it anymore because I patched the underlying problem that
> was causing the clock to jump forward 50 min.
>
> If you can't reproduce it locally, I can try to reproduce it again on
> the test cluster tomorrow.
>
> My best guess was that somehow the messages had a timestamp and it
> refused to process message too much in the future and maybe just
> queued them while waiting (but 50 min worth of message is a lot of
> memory). But that's really a wild guess :p

No, there's no mechanism for anything like that. I suspect it's a bug
with trying to obtain not-yet-existent cephx keys, but unfortunately I
don't think anybody has the bandwidth to deal with it right now. I've
created a bug, feel free to update if there's anything else important:
http://tracker.newdream.net/issues/3569
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


<    2   3   4   5   6   7   8   9   10   11   >