Re: Client can't reboot when rbd volume is mounted.

2013-02-11 Thread Roman Alekseev

On 11.02.2013 17:52, Sage Weil wrote:

On Mon, 11 Feb 2013, Roman Alekseev wrote:

On 11.02.2013 09:36, Sage Weil wrote:

On Mon, 11 Feb 2013, Roman Alekseev wrote:

Hi,

When I try to reboot a client server  without unmounting of rbd volume
manually
its services stop working but server doesn't reboot completely and show
the
following logs in KVM console:

[235618.0202207] libceph: connect 192.168.0.19:6789 error -101

That is

#defineENETUNREACH 101 /* Network is unreachable */

Note that that (or any other) socket error is not necessarily fatal; the
kernel client will retry and eventually connect to that or another OSD
to complete the IO.  Are you observing that the RBD image hangs or
something?

You can peek at in-flight IO (and other state) with

   cat /sys/kernel/debug/ceph/*/osdc

unmount/unmap should not be necessarily in any case unless there is a bug.
We backported a bunch of stuff to 3.6.6, so 3.6.10 ought to be okay.  You
might try a newer 3.6.x kernel too; I forget if there was a second batch
of fixes..

sage

Hi Sage,


#define ENETUNREACH 101 /* Network is unreachable */

The reason of this error is that networking stop working after performing
server reset request.


Are you observing that the RBD image hangs or something?

the RBD works properly. It is just mapped and mounted on the client server.

# /dev/rbd1  99G  616M   93G   1% /home/test

I think I'm confused about what you mean by 'server'.  Do you mean the
host that rbd is mapped on, or the host(s) where the ceph-osd's are
running?

By 'the RBD works properly' do you mean the client where it is mapped?  In
which case, what exactly is the problem?
I mean the host that rbd is mapped on. This host doesn't want to restart 
until rbd volume is mounted:)
In order to get server restarted we need to umount rbd volume manually 
before performing "reboot" command.



The "/sys/kernel/debug" folder is empty, how to put 'ceph/*/osdc' content into
it?

'mount -t debugfs none /sys/kernel/debug' and it will appear (along with
other fun stuff)...

sage



I've update kernel to 3.7.4 version but problem is still persist.

Thanks

--
Kind regards,

R. Alekseev

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




Thanks.

--
Kind regards,

R. Alekseev

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


slow requests, hunting for new mon

2013-02-11 Thread Chris Dunlop
Hi,

What are likely causes for "slow requests" and "monclient: hunting for new
mon" messages? E.g.:

2013-02-12 16:27:07.318943 7f9c0bc16700  0 monclient: hunting for new mon
...
2013-02-12 16:27:45.892314 7f9c13c26700  0 log [WRN] : 6 slow requests, 6 
included below; oldest blocked for > 30.383883 secs
2013-02-12 16:27:45.892323 7f9c13c26700  0 log [WRN] : slow request 30.383883 
seconds old, received at 2013-02-12 16:27:15.508374: 
osd_op(client.9821.0:122242 rb.0.209f.74b0dc51.0120 [write 921600~4096] 
2.981cf6bc) v4 currently no flag points reached
2013-02-12 16:27:45.892328 7f9c13c26700  0 log [WRN] : slow request 30.383782 
seconds old, received at 2013-02-12 16:27:15.508475: 
osd_op(client.9821.0:122243 rb.0.209f.74b0dc51.0120 [write 987136~4096] 
2.981cf6bc) v4 currently no flag points reached
2013-02-12 16:27:45.892334 7f9c13c26700  0 log [WRN] : slow request 30.383720 
seconds old, received at 2013-02-12 16:27:15.508537: 
osd_op(client.9821.0:122244 rb.0.209f.74b0dc51.0120 [write 
1036288~8192] 2.981cf6bc) v4 currently no flag points reached
2013-02-12 16:27:45.892338 7f9c13c26700  0 log [WRN] : slow request 30.383684 
seconds old, received at 2013-02-12 16:27:15.508573: 
osd_op(client.9821.0:122245 rb.0.209f.74b0dc51.0122 [write 
1454080~4096] 2.fff29a9a) v4 currently no flag points reached
2013-02-12 16:27:45.892341 7f9c13c26700  0 log [WRN] : slow request 30.328986 
seconds old, received at 2013-02-12 16:27:15.563271: 
osd_op(client.9821.0:122246 rb.0.209f.74b0dc51.0122 [write 
1482752~4096] 2.fff29a9a) v4 currently no flag points reached

I have a ceph 0.56.2 system with 3 boxes: two boxes have both mon and a
single osd, and the 3rd box has just a mon - see ceph.conf below. The boxes
are running an eclectic mix of self-compiled kernels: b2 is linux-3.4.6, b4
is linux-3.7.3 and b5 is linux-3.6.10.

On b5 / osd.1 the 'hunting' message appears in the osd log regularly, e.g.
190 times yesterday. The message does't appear at all on b4 / osd.0.

Both osd logs show the 'slow requests' messages. Generally these come in
waves, with 30-50 of the associated individual 'slow request' messages
coming in just after the initial 'slow requests' message. Each box saw
around 30 waves yesterday, with no obvious time correlation between the two.

The osd disks are generally cruising along at around 400-800 KB/s, with
occasional spikes up to 1.2-2 MB/s, with a mostly write load.

The gigabit network interfaces (2 per box for public vs cluster) are
also cruising, with the busiest interface at less than 5% utilisation.

CPU utilisation is likewise small, with 90% or more idle and less then 3%
wait for io. There's plenty of free memory, 19 GB on b4 and 6 GB on b5.

Cheers,

Chris


ceph.conf

[global]
auth supported = cephx
[mon]
[mon.b2]
host = b2
mon addr = 10.200.63.130:6789
[mon.b4]
host = b4
mon addr = 10.200.63.132:6789
[mon.b5]
host = b5
mon addr = 10.200.63.133:6789
[osd]
osd journal size = 1000
public network = 10.200.63.0/24
cluster network = 192.168.254.0/24
[osd.0]
host = b4
public addr = 10.200.63.132
cluster addr = 192.168.254.132
[osd.1]
host = b5
public addr = 10.200.63.133
cluster addr = 192.168.254.133

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: .gitignore issues

2013-02-11 Thread Josh Durgin

On 02/11/2013 06:28 PM, David Zafman wrote:


After updating to latest master I have the following files listed by git status:


These are mostly renamed binaries. If you run 'make clean' on the 
version before the name changes 
(133295ed001a950e3296f4e88a916ab2405be0cc) they'll be removed.

If you're sure you have nothing you want to save that's not
in a commit, you can always 'git clean -fdx'.

src/ceph.conf and src/keyring are generated by vstart.sh, and
I forgot to add them to .gitignore again earlier. There was
also a typo in ceph-filestore-dump - it was not renamed.
These are fixed now.

Josh


$ git status
# On branch master
# Untracked files:
#   (use "git add ..." to include in what will be committed)
#
#   src/bench_log
#   src/ceph-filestore-dump
#   src/ceph.conf
#   src/dupstore
#   src/keyring
#   src/kvstorebench
#   src/multi_stress_watch
#   src/omapbench
#   src/psim
#   src/radosacl
#   src/scratchtool
#   src/scratchtoolpp
#   src/smalliobench
#   src/smalliobenchdumb
#   src/smalliobenchfs
#   src/smalliobenchrbd
#   src/streamtest
#   src/testcrypto
#   src/testkeys
#   src/testrados
#   src/testrados_delete_pools_parallel
#   src/testrados_list_parallel
#   src/testrados_open_pools_parallel
#   src/testrados_watch_notify
#   src/testsignal_handlers
#   src/testtimers
#   src/tpbench
#   src/xattr_bench
nothing added to commit but untracked files present (use "git add" to track)

David Zafman
Senior Developer
david.zaf...@inktank.com





--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


.gitignore issues

2013-02-11 Thread David Zafman

After updating to latest master I have the following files listed by git status:

$ git status
# On branch master
# Untracked files:
#   (use "git add ..." to include in what will be committed)
#
#   src/bench_log
#   src/ceph-filestore-dump
#   src/ceph.conf
#   src/dupstore
#   src/keyring
#   src/kvstorebench
#   src/multi_stress_watch
#   src/omapbench
#   src/psim
#   src/radosacl
#   src/scratchtool
#   src/scratchtoolpp
#   src/smalliobench
#   src/smalliobenchdumb
#   src/smalliobenchfs
#   src/smalliobenchrbd
#   src/streamtest
#   src/testcrypto
#   src/testkeys
#   src/testrados
#   src/testrados_delete_pools_parallel
#   src/testrados_list_parallel
#   src/testrados_open_pools_parallel
#   src/testrados_watch_notify
#   src/testsignal_handlers
#   src/testtimers
#   src/tpbench
#   src/xattr_bench
nothing added to commit but untracked files present (use "git add" to track)

David Zafman
Senior Developer
david.zaf...@inktank.com



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster

2013-02-11 Thread Isaac Otsiabah


Yes, there were osd daemons running on the same node that the monitor was 
running on.  If that is the case then i will run a test case with the 
monitor running on a different node where no osd is running and see what 
happens. Thank you. 

Isaac


From: Gregory Farnum 
To: Isaac Otsiabah  
Cc: "ceph-devel@vger.kernel.org"  
Sent: Monday, February 11, 2013 12:29 PM
Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to 
my cluster

jIsaac,
I'm sorry I haven't been able to wrangle any time to look into this
more yet, but Sage pointed out in a related thread that there might be
some buggy handling of things like this if the OSD and the monitor are
located on the same host. Am I correct in assuming that with your
small cluster, all your OSDs are co-located with a monitor daemon?
-Greg

On Mon, Jan 28, 2013 at 12:17 PM, Isaac Otsiabah  wrote:
>
>
> Gregory, i recreated the osd down problem again this morning on two nodes 
> (g13ct, g14ct). First, i created a 1-node cluster on g13ct (with osd.0, 1 ,2) 
> and then added host g14ct (osd3. 4, 5). osd.1 went down for about 1 minute 
> and half after adding osd 3, 4, 5 were adde4d. i have included the routing 
> table of each node at the time osd.1 went down. ceph.conf and ceph-osd.1.log 
> files are attached. The crush map was default. Also, it could be a timing 
> issue because it does not always fail when  using default crush map, it takes 
> several trials before you see it. Thank you.
>
>
> [root@g13ct ~]# netstat -r
> Kernel IP routing table
> Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
> default         133.164.98.250 0.0.0.0         UG        0 0          0 eth2
> 133.164.98.0    *               255.255.255.0   U         0 0          0 eth2
> link-local      *               255.255.0.0     U         0 0          0 eth3
> link-local      *               255.255.0.0     U         0 0          0 eth0
> link-local      *               255.255.0.0     U         0 0          0 eth2
> 192.0.0.0       *               255.0.0.0       U         0 0          0 eth3
> 192.0.0.0       *               255.0.0.0       U         0 0          0 eth0
> 192.168.0.0     *               255.255.255.0   U         0 0          0 eth3
> 192.168.1.0     *               255.255.255.0   U         0 0          0 eth0
> [root@g13ct ~]# ceph osd tree
>
> # id    weight  type name       up/down reweight
> -1      6       root default
> -3      6               rack unknownrack
> -2      3                       host g13ct
> 0       1                               osd.0   up      1
> 1       1                               osd.1   down    1
> 2       1                               osd.2   up      1
> -4      3                       host g14ct
> 3       1                               osd.3   up      1
> 4       1                               osd.4   up      1
> 5       1                               osd.5   up      1
>
>
>
> [root@g14ct ~]# ceph osd tree
>
> # id    weight  type name       up/down reweight
> -1      6       root default
> -3      6               rack unknownrack
> -2      3                       host g13ct
> 0       1                               osd.0   up      1
> 1       1                               osd.1   down    1
> 2       1                               osd.2   up      1
> -4      3                       host g14ct
> 3       1                               osd.3   up      1
> 4       1                               osd.4   up      1
> 5       1                               osd.5   up      1
>
> [root@g14ct ~]# netstat -r
> Kernel IP routing table
> Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
> default         133.164.98.250 0.0.0.0         UG        0 0          0 eth0
> 133.164.98.0    *               255.255.255.0   U         0 0          0 eth0
> link-local      *               255.255.0.0     U         0 0          0 eth3
> link-local      *               255.255.0.0     U         0 0          0 eth5
> link-local      *               255.255.0.0     U         0 0          0 eth0
> 192.0.0.0       *               255.0.0.0       U         0 0          0 eth3
> 192.0.0.0       *               255.0.0.0       U         0 0          0 eth5
> 192.168.0.0     *               255.255.255.0   U         0 0          0 eth3
> 192.168.1.0     *               255.255.255.0   U         0 0          0 eth5
> [root@g14ct ~]# ceph osd tree
>
> # id    weight  type name       up/down reweight
> -1      6       root default
> -3      6               rack unknownrack
> -2      3                       host g13ct
> 0       1                               osd.0   up      1
> 1       1                               osd.1   down    1
> 2       1                               osd.2   up      1
> -4      3                       host g14ct
> 3       1                               osd.3   up      1
> 4       1                               osd.4   up     

Re: File exists not handled in 0.48argonaut1

2013-02-11 Thread Samuel Just
The actual problem appears to be a corrupted log file.  You should
rename out of the way the directory:
/mnt/osd97/current/corrupt_log_2013-02-08_18:50_2.fa8.  Then, restart
the osd with debug osd = 20, debug filestore = 20, and debug ms = 1 in
the [osd] section of the ceph.conf.
-Sam

On Mon, Feb 11, 2013 at 2:21 PM, Mandell Degerness
 wrote:
> Since the attachment didn't work, apparently, here is a link to the log:
>
> http://dl.dropbox.com/u/766198/error17.log.gz
>
> On Mon, Feb 11, 2013 at 1:42 PM, Samuel Just  wrote:
>> I don't see the more complete log.
>> -Sam
>>
>> On Mon, Feb 11, 2013 at 11:12 AM, Mandell Degerness
>>  wrote:
>>> Anyone have any thoughts on this???  It looks like I may have to wipe
>>> out the OSDs effected and rebuild them, but I'm afraid that may result
>>> in data loss because of the old OSD first crush map in place :(.
>>>
>>> On Fri, Feb 8, 2013 at 1:36 PM, Mandell Degerness
>>>  wrote:
 We ran into an error which appears very much like a bug fixed in 0.44.

 This cluster is running version:

 ceph version 0.48.1argonaut 
 (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c)

 The error line is:

 Feb  8 18:50:07 192.168.8.14 ceph-osd: 2013-02-08 18:50:07.545682
 7f40f9f08700  0 filestore(/mnt/osd97)  error (17) File exists not
 handled on operation 20 (11279344.0.0, or op 0, counting from 0)

 A more complete log is attached.

 First question: is this a know bug fixed in more recent versions?

 Second question: is there any hope of recovery?
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rest mgmt api

2013-02-11 Thread Sage Weil
On Mon, 11 Feb 2013, Gregory Farnum wrote:
> [...]
> ...but my instinct is to want one canonical code path in the monitors,
> not two. Two allows for discrepancies in what each method allows to
> [...]

Yeah, I'm convinced.

Just chatted with Dan and Josh a bit about this.  Josh had the interesting 
idea that the specification of what commands are supported could be 
requested from the monitor is some canonical form (say, a blob of JSON), 
and then enforced at the client.  That would be translated into an 
argparse config for the CLI, and a simple matching/validation table for 
the REST endpoint.

That might be worth the complexity to get the best of both worlds... but 
first Dan is looking at whether Python's argparse will do everything we 
want for the CLI end of things.

In the meantime, the first set of tasks still stand: move the ceph tool 
cruft into MonClient and Objecter and out of tool/common.cc for starters.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-02-11 Thread Kevin Decherf
On Mon, Feb 11, 2013 at 02:47:13PM -0800, Gregory Farnum wrote:
> On Mon, Feb 11, 2013 at 2:24 PM, Kevin Decherf  wrote:
> > On Mon, Feb 11, 2013 at 12:25:59PM -0800, Gregory Farnum wrote:
> > Yes, there is a dump of 100,000 events for this backtrace in the linked
> > archive (I need 7 hours to upload it).
> 
> Can you just pastebin the last couple hundred lines? I'm mostly
> interested if there's anything from the function which actually caused
> the assert/segfault. Also, the log should compress well and get much
> smaller!

Sent in pm.

And yes, I have a good compression rate but...

   % ls -lh 
   total 38G
   -rw-r--r-- 1 kdecherf kdecherf 3.3G Feb 11 18:36 cc-ceph-log.tar.gz
   -rw--- 1 kdecherf kdecherf  66M Feb  4 17:57 ceph.log
   -rw-r--r-- 1 kdecherf kdecherf 3.5G Feb  4 14:44 ceph-mds.b.log
   -rw-r--r-- 1 kdecherf kdecherf  31G Feb  5 15:55 ceph-mds.c.log
   -rw-r--r-- 1 kdecherf kdecherf  27M Feb 11 19:46 ceph-osd.14.log

;-)

> > The distribution is heterogeneous: we have a folder of ~17G for 300k
> > objects, another of ~2G for 150k objects and a lof of smaller directories.
> 
> Sorry, you mean 300,000 files in the single folder?
> If so, that's definitely why it's behaving so badly — your folder is
> larger than your maximum cache size settings, and so if you run an
> "ls" or anything the MDS will read the whole thing off disk, then
> instantly drop most of the folder from its cache. Then re-read again
> for the next request to list contents, etc etc.

The biggest top-level folder contains 300k files but splitted into
several subfolders (a subfolder does not contain more than 10,000 files
at its level).

> > Are you talking about the mds bal frag and mds bal split * settings?
> > Do you have any advice about the value to use?
> If you set "mds bal frag = true" in your config, it will split up
> those very large directories into smaller fragments and behave a lot
> better. This isn't quite as stable (thus the default to "off"), so if
> you have the memory to just really up your cache size I'd start with
> that and see if it makes your problems better. But if it doesn't,
> directory fragmentation does work reasonably well and it's something
> we'd be interested in bug reports for. :)

I will try it, thanks!

-- 
Kevin Decherf - @Kdecherf
GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F
http://kdecherf.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unable to mount cephfs - can't read superblock

2013-02-11 Thread Ross David Turk

On Feb 9, 2013, at 3:25 AM, Adam Nielsen  wrote:

> I will use that list as soon as it appears on GMane, since I find their NNTP 
> interface a lot easier than managing a bunch of mailing list subscriptions! 
> Maybe someone with more authority than myself can add it?
> 
>  http://gmane.org/subscribe.php

Agree - requested it last week, will follow up when it's added.

Cheers,
Ross

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-02-11 Thread Gregory Farnum
On Mon, Feb 11, 2013 at 2:24 PM, Kevin Decherf  wrote:
> On Mon, Feb 11, 2013 at 12:25:59PM -0800, Gregory Farnum wrote:
>> On Mon, Feb 4, 2013 at 10:01 AM, Kevin Decherf  wrote:
>> > References:
>> > [1] http://www.spinics.net/lists/ceph-devel/msg04903.html
>> > [2] ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
>> > 1: /usr/bin/ceph-mds() [0x817e82]
>> > 2: (()+0xf140) [0x7f9091d30140]
>> > 3: (MDCache::request_drop_foreign_locks(MDRequest*)+0x21) [0x5b9dc1]
>> > 4: (MDCache::request_drop_locks(MDRequest*)+0x19) [0x5baae9]
>> > 5: (MDCache::request_cleanup(MDRequest*)+0x60) [0x5bab70]
>> > 6: (MDCache::request_kill(MDRequest*)+0x80) [0x5bae90]
>> > 7: (Server::journal_close_session(Session*, int)+0x372) [0x549aa2]
>> > 8: (Server::kill_session(Session*)+0x137) [0x549c67]
>> > 9: (Server::find_idle_sessions()+0x12a6) [0x54b0d6]
>> > 10: (MDS::tick()+0x338) [0x4da928]
>> > 11: (SafeTimer::timer_thread()+0x1af) [0x78151f]
>> > 12: (SafeTimerThread::entry()+0xd) [0x782bad]
>> > 13: (()+0x7ddf) [0x7f9091d28ddf]
>> > 14: (clone()+0x6d) [0x7f90909cc24d]
>>
>> This in particular is quite odd. Do you have any logging from when
>> that happened? (Oftentimes the log can have a bunch of debugging
>> information from shortly before the crash.)
>
> Yes, there is a dump of 100,000 events for this backtrace in the linked
> archive (I need 7 hours to upload it).

Can you just pastebin the last couple hundred lines? I'm mostly
interested if there's anything from the function which actually caused
the assert/segfault. Also, the log should compress well and get much
smaller!

>> On Mon, Feb 11, 2013 at 10:54 AM, Kevin Decherf  wrote:
>> > Furthermore, I observe another strange thing more or less related to the
>> > storms.
>> >
>> > During a rsync command to write ~20G of data on Ceph and during (and
>> > after) the storm, one OSD sends a lot of data to the active MDS
>> > (400Mbps peak each 6 seconds). After a quick check, I found that when I
>> > stop osd.23, osd.14 stops its peaks.
>>
>> This is consistent with Sam's suggestion that MDS is thrashing its
>> cache, and is grabbing a directory object off of the OSDs. How large
>> are the directories you're using? If they're a significant fraction of
>> your cache size, it might be worth enabling the (sadly less stable)
>> directory fragmentation options, which will split them up into smaller
>> fragments that can be independently read and written to disk.
>
> The distribution is heterogeneous: we have a folder of ~17G for 300k
> objects, another of ~2G for 150k objects and a lof of smaller directories.

Sorry, you mean 300,000 files in the single folder?
If so, that's definitely why it's behaving so badly — your folder is
larger than your maximum cache size settings, and so if you run an
"ls" or anything the MDS will read the whole thing off disk, then
instantly drop most of the folder from its cache. Then re-read again
for the next request to list contents, etc etc.

> Are you talking about the mds bal frag and mds bal split * settings?
> Do you have any advice about the value to use?
If you set "mds bal frag = true" in your config, it will split up
those very large directories into smaller fragments and behave a lot
better. This isn't quite as stable (thus the default to "off"), so if
you have the memory to just really up your cache size I'd start with
that and see if it makes your problems better. But if it doesn't,
directory fragmentation does work reasonably well and it's something
we'd be interested in bug reports for. :)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rest mgmt api

2013-02-11 Thread Gregory Farnum
On Mon, Feb 11, 2013 at 2:00 PM, Sage Weil  wrote:
> On Mon, 11 Feb 2013, Gregory Farnum wrote:
>> On Wed, Feb 6, 2013 at 12:14 PM, Sage Weil  wrote:
>> > On Wed, 6 Feb 2013, Dimitri Maziuk wrote:
>> >> On 02/06/2013 01:34 PM, Sage Weil wrote:
>> >>
>> >> > I think the one caveat here is that having a single registry for 
>> >> > commands
>> >> > in the monitor means that commands can come in two flavors: 
>> >> > vector
>> >> > (cli) and URL (presumably in json form).  But a single command
>> >> > dispatch/registry framework will make that distinction pretty simple...
>> >>
>> >> Any reason you can't have your CLI json-encode the commands (or,
>> >> conversely, your cgi/wsgi/php/servlet URL handler decode them into
>> >> vector) before passing them on to the monitor?
>> >
>> > We can, but they won't necessarily look the same, because it is unlikely
>> > we can make a sane 1:1 translation of the CLI to REST that makes sense,
>> > and it would be nice to avoid baking knowledge about the individual
>> > commands into the client side.
>>
>> I disagree and am with Joao on this one ? the monitor parsing is
>> ridiculous as it stand right now, and we should be trying to get rid
>> of the manual string parsing. The monitors should be parsing JSON
>> commands that are sent by the client; it makes validation and the
>
> No argument that the current parsing code is bad...
>
>> logic control flow a lot easier. We're going to want some level of
>> intelligence in the clients so that they can tailor themselves to the
>> appropriate UI conventions, and having two different parsing paths in
>
> What do you mean by tailor to UI conventions?

Implementing and/or allowing positional versus named parameters, to
toss off one suggestion. Obviously the CLI will want to allow input
data in a format different than an API, but a port to a different
platform might prefer named parameters instead of positional ones, or
whatever.
Basically I'm agreeing that we as users want to be able to input data
differently and have it mean the same thing ;) 

>> the monitors is just asking for trouble: they will get out of sync and
>> have different kinds of parsing errors.
>>
>> What we could do is have the monitors speak JSON only, and then give
>> the clients a minimal intelligence so that the CLI could (for
>> instance) prettify the options for commands it knows about, but still
>> allow pass-through for access to newer commands it hasn't yet heard
>> of.
>
> That doesn't really help; it means the mon still has to understand the
> CLI grammar.
>
> What we are talking about is the difference between:
>
> [ 'osd', 'down', '123' ]
>
> and
>
> {
>   URI: '/osd/down',
>   OSD-Id: 123
> }
>
> or however we generically translate the HTTP request into JSON.  Once we
> normalize the code, calling it "parsing" is probably misleading.  The top
> (CLI) fragment will match against a rule like:
>
>  [ STR("osd"), STR("down"), POSINT ]
>
> or however we encode the syntax, while the below would match against
>
>  { .prefix = "/osd/down",
>.fields = [ "OSD-Id": POSINT ]
>  }
>
> ..or something.  I'm making this syntax up, but you get the idea: there
> would be a strict format for the request and generic code that validates
> it and passes the resulting arguments/matches into a function like
>
>  int do_command_osd_down(int n);
>
> regardless of which type of input pattern it matched.

...but my instinct is to want one canonical code path in the monitors,
not two. Two allows for discrepancies in what each method allows to
come in that we're not going to have if they all come in to the
monitor in a single form. So I say that the canonicalization should
happen client-side, and the enforcement should happen server-side (and
probably client-side as well, but that's just for courtesy).
You've suggested that we want the monitors to do the parsing so that
old clients will work, but given that new commands in the monitors
often require new capabilities in the clients, having it be slightly
more awkward to send new commands to new monitors from old clients
doesn't seem like such a big deal to me — if somebody's running
monitor version .64 and client ceph tool version .60 and wants to use
a new thing, I don't feel bad about making them give the CLI a command
which completely specifies what the JSON looks like, instead of using
the pretty wrapping they'd get if they upgraded their client.

Having a canonicalized format also means that when we return errors
they can be a lot more useful, since the monitor can specify what
fields it received and which ones were bad, instead of just outputting
a string from whichever line of code actually broke. Consider an
incoming command whose canonical form is

[ 'crush', 'add', '123', '1.0' ]

And the parsing code runs through that and it fails and the string
going back says "error: does not specify weight!". But the user looks
and says "yes I did, it's 1.0!"

Versus if the error came back as
"Received command: ['area': 'crush'

Re: rest mgmt api

2013-02-11 Thread Dimitri Maziuk
On 02/11/2013 04:00 PM, Sage Weil wrote:
> On Mon, 11 Feb 2013, Gregory Farnum wrote:
...

> That doesn't really help; it means the mon still has to understand the 
> CLI grammar.
> 
> What we are talking about is the difference between:
> 
> [ 'osd', 'down', '123' ]
> 
> and
> 
> {
>   URI: '/osd/down',
>   OSD-Id: 123
> }
> 
> or however we generically translate the HTTP request into JSON.

I think the setup we have in mind is where the MON reads something like
{"who:"osd", "which":"123", "what":"down", "when":"now"} from a socket
(pipe, whatever),

the CLI reads "osd down 123 now" from the prompt and pushes {"who:"osd",
"which":"123", "what":"down", "when":"now"} into that socket,

the webapp gets whatever: "/osd/down/123/now" or
?who=osd&command=down&id=123&when=now" from whoever impersonates the
browser and pipes {"who:"osd", "which":"123", "what":"down",
"when":"now"} into that same socket,

and all three of them are three completely separate applications that
don't try to do what they don't need to.

> FWIW you could pass the CLI command as JSON, but that's no different than 
> encoding vector; it's still a different way to describing the same 
> command.

The devil is of course in the details: in (e.g.) python json.loads() the
string and gives you the map you could plug into a lookup table or
something to get right to the function call. My c++ is way rusty, I've
no idea what's available in boost &co -- if you have to roll your own
json parser then you indeed don't care how that vector is encoded.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] fs: encode_fh: return FILEID_INVALID if invalid fid_type

2013-02-11 Thread Namjae Jeon
2013/2/12, Dave Chinner :
> On Mon, Feb 11, 2013 at 05:25:58PM +0900, Namjae Jeon wrote:
>> From: Namjae Jeon 
>>
>> This patch is a follow up on below patch:
>>
>> [PATCH] exportfs: add FILEID_INVALID to indicate invalid fid_type
>> commit: 216b6cbdcbd86b1db0754d58886b466ae31f5a63
> 
>> diff --git a/fs/xfs/xfs_export.c b/fs/xfs/xfs_export.c
>> index a836118..3391800 100644
>> --- a/fs/xfs/xfs_export.c
>> +++ b/fs/xfs/xfs_export.c
>> @@ -48,7 +48,7 @@ static int xfs_fileid_length(int fileid_type)
>>  case FILEID_INO32_GEN_PARENT | XFS_FILEID_TYPE_64FLAG:
>>  return 6;
>>  }
>> -return 255; /* invalid */
>> +return FILEID_INVALID; /* invalid */
>>  }
>
> I think you can drop the "/* invalid */" comment from there now as
> it is redundant with this change.
Okay, Thanks for review :-)
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> da...@fromorbit.com
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-02-11 Thread Kevin Decherf
On Mon, Feb 11, 2013 at 12:25:59PM -0800, Gregory Farnum wrote:
> On Mon, Feb 4, 2013 at 10:01 AM, Kevin Decherf  wrote:
> > References:
> > [1] http://www.spinics.net/lists/ceph-devel/msg04903.html
> > [2] ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
> > 1: /usr/bin/ceph-mds() [0x817e82]
> > 2: (()+0xf140) [0x7f9091d30140]
> > 3: (MDCache::request_drop_foreign_locks(MDRequest*)+0x21) [0x5b9dc1]
> > 4: (MDCache::request_drop_locks(MDRequest*)+0x19) [0x5baae9]
> > 5: (MDCache::request_cleanup(MDRequest*)+0x60) [0x5bab70]
> > 6: (MDCache::request_kill(MDRequest*)+0x80) [0x5bae90]
> > 7: (Server::journal_close_session(Session*, int)+0x372) [0x549aa2]
> > 8: (Server::kill_session(Session*)+0x137) [0x549c67]
> > 9: (Server::find_idle_sessions()+0x12a6) [0x54b0d6]
> > 10: (MDS::tick()+0x338) [0x4da928]
> > 11: (SafeTimer::timer_thread()+0x1af) [0x78151f]
> > 12: (SafeTimerThread::entry()+0xd) [0x782bad]
> > 13: (()+0x7ddf) [0x7f9091d28ddf]
> > 14: (clone()+0x6d) [0x7f90909cc24d]
> 
> This in particular is quite odd. Do you have any logging from when
> that happened? (Oftentimes the log can have a bunch of debugging
> information from shortly before the crash.)

Yes, there is a dump of 100,000 events for this backtrace in the linked
archive (I need 7 hours to upload it).

> 
> On Mon, Feb 11, 2013 at 10:54 AM, Kevin Decherf  wrote:
> > Furthermore, I observe another strange thing more or less related to the
> > storms.
> >
> > During a rsync command to write ~20G of data on Ceph and during (and
> > after) the storm, one OSD sends a lot of data to the active MDS
> > (400Mbps peak each 6 seconds). After a quick check, I found that when I
> > stop osd.23, osd.14 stops its peaks.
> 
> This is consistent with Sam's suggestion that MDS is thrashing its
> cache, and is grabbing a directory object off of the OSDs. How large
> are the directories you're using? If they're a significant fraction of
> your cache size, it might be worth enabling the (sadly less stable)
> directory fragmentation options, which will split them up into smaller
> fragments that can be independently read and written to disk.

The distribution is heterogeneous: we have a folder of ~17G for 300k
objects, another of ~2G for 150k objects and a lof of smaller directories.
Are you talking about the mds bal frag and mds bal split * settings?
Do you have any advice about the value to use?

-- 
Kevin Decherf - @Kdecherf
GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F
http://kdecherf.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: File exists not handled in 0.48argonaut1

2013-02-11 Thread Mandell Degerness
Since the attachment didn't work, apparently, here is a link to the log:

http://dl.dropbox.com/u/766198/error17.log.gz

On Mon, Feb 11, 2013 at 1:42 PM, Samuel Just  wrote:
> I don't see the more complete log.
> -Sam
>
> On Mon, Feb 11, 2013 at 11:12 AM, Mandell Degerness
>  wrote:
>> Anyone have any thoughts on this???  It looks like I may have to wipe
>> out the OSDs effected and rebuild them, but I'm afraid that may result
>> in data loss because of the old OSD first crush map in place :(.
>>
>> On Fri, Feb 8, 2013 at 1:36 PM, Mandell Degerness
>>  wrote:
>>> We ran into an error which appears very much like a bug fixed in 0.44.
>>>
>>> This cluster is running version:
>>>
>>> ceph version 0.48.1argonaut 
>>> (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c)
>>>
>>> The error line is:
>>>
>>> Feb  8 18:50:07 192.168.8.14 ceph-osd: 2013-02-08 18:50:07.545682
>>> 7f40f9f08700  0 filestore(/mnt/osd97)  error (17) File exists not
>>> handled on operation 20 (11279344.0.0, or op 0, counting from 0)
>>>
>>> A more complete log is attached.
>>>
>>> First question: is this a know bug fixed in more recent versions?
>>>
>>> Second question: is there any hope of recovery?
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: chain_fsetxattr extra chunk removal

2013-02-11 Thread Loic Dachary
Hi,

I amended the unit tests ( https://github.com/ceph/ceph/pull/40/files ) to 
cover the code below. A review would be much appreciated :-)

Cheers

On 02/11/2013 09:08 PM, Loic Dachary wrote:
> 
> 
> On 02/11/2013 06:13 AM, Yehuda Sadeh wrote:
>> On Thu, Feb 7, 2013 at 12:59 PM, Loic Dachary  wrote:
>>> Hi,
>>>
>>> While writing unit tests for chain_xattr.cc I tried to understand how to 
>>> create the conditions to trigger this part of the chain_fsetxattr function:
>>>
>>>   /* if we're exactly at a chunk size, remove the next one (if wasn't 
>>> removed
>>>  before) */
>>>   if (ret >= 0 && chunk_size == CHAIN_XATTR_MAX_BLOCK_LEN) {
>>> get_raw_xattr_name(name, i, raw_name, sizeof(raw_name));
>>> int r = sys_fremovexattr(fd, raw_name);
>>> if (r < 0 && r != -ENODATA)
>>>   ret = r;
>>>   }
>>>
>>> I suspect this cleans up extra empty attributes created as a side effect of 
>>> a previous version of the function. Or I just don't understand the case it 
>>> addresses.
>>>
>>> I'd very much appreciate a hint :-)
>>>
>>
>> Well, the code has changed a bit, but originally when a chain was
>> overwritten we didn't bother to remove the xattrs tail. When we read
>> the chain we stop either when we got a short xattr, or when the next
>> xattr in the chain didn't exist.  So when writing an xattr that was
>> perfectly aligned with the block len we had to remove the next xattr
>> in order make sure that readers will not over-read. I'm not too sure
>> whether that still the case, Sam might have a better idea.
>> In any case, it might be a good idea to test the case where we have a
>> big xattr that spans across multiple blocks (e.g., > 3) and being
>> overwritten by a short xattr. Probably also need to test it with
>> different combinations of aligned and non-aligned block sizes.
> 
> I understand now and I'll modify the pull request 
> https://github.com/ceph/ceph/pull/40 accordingly.
> 
> Thanks :-)
> 
>>
>> Thanks,
>> Yehuda
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re: rest mgmt api

2013-02-11 Thread Sage Weil
On Mon, 11 Feb 2013, Gregory Farnum wrote:
> On Wed, Feb 6, 2013 at 12:14 PM, Sage Weil  wrote:
> > On Wed, 6 Feb 2013, Dimitri Maziuk wrote:
> >> On 02/06/2013 01:34 PM, Sage Weil wrote:
> >>
> >> > I think the one caveat here is that having a single registry for commands
> >> > in the monitor means that commands can come in two flavors: 
> >> > vector
> >> > (cli) and URL (presumably in json form).  But a single command
> >> > dispatch/registry framework will make that distinction pretty simple...
> >>
> >> Any reason you can't have your CLI json-encode the commands (or,
> >> conversely, your cgi/wsgi/php/servlet URL handler decode them into
> >> vector) before passing them on to the monitor?
> >
> > We can, but they won't necessarily look the same, because it is unlikely
> > we can make a sane 1:1 translation of the CLI to REST that makes sense,
> > and it would be nice to avoid baking knowledge about the individual
> > commands into the client side.
> 
> I disagree and am with Joao on this one ? the monitor parsing is
> ridiculous as it stand right now, and we should be trying to get rid
> of the manual string parsing. The monitors should be parsing JSON
> commands that are sent by the client; it makes validation and the

No argument that the current parsing code is bad...

> logic control flow a lot easier. We're going to want some level of
> intelligence in the clients so that they can tailor themselves to the
> appropriate UI conventions, and having two different parsing paths in

What do you mean by tailor to UI conventions?

> the monitors is just asking for trouble: they will get out of sync and
> have different kinds of parsing errors.
> 
> What we could do is have the monitors speak JSON only, and then give
> the clients a minimal intelligence so that the CLI could (for
> instance) prettify the options for commands it knows about, but still
> allow pass-through for access to newer commands it hasn't yet heard
> of.

That doesn't really help; it means the mon still has to understand the 
CLI grammar.

What we are talking about is the difference between:

[ 'osd', 'down', '123' ]

and

{
  URI: '/osd/down',
  OSD-Id: 123
}

or however we generically translate the HTTP request into JSON.  Once we 
normalize the code, calling it "parsing" is probably misleading.  The top 
(CLI) fragment will match against a rule like:

 [ STR("osd"), STR("down"), POSINT ]

or however we encode the syntax, while the below would match against

 { .prefix = "/osd/down",
   .fields = [ "OSD-Id": POSINT ]
 }

..or something.  I'm making this syntax up, but you get the idea: there 
would be a strict format for the request and generic code that validates 
it and passes the resulting arguments/matches into a function like

 int do_command_osd_down(int n);

regardless of which type of input pattern it matched.

Obviously we'll need 100% testing coverage for both the RESTful and CLI 
variants, whether we do the above or whether the CLI is translating one 
into the other via duplicated knowledge of the command set.

FWIW you could pass the CLI command as JSON, but that's no different than 
encoding vector; it's still a different way to describing the same 
command.


If the parsing code is wrapping in a single library that validates typed 
fields or positional arguments/flags, I don't think this is going to turn 
into anything remotely like the same wild-west horror that the current 
code represents.  And if we were building this from scratch with no 
legacy, I'd argue that the same model is still pretty good... unless we 
recast the entire CLI in terms of a generic URI+field model that 
matches the REST API perfectly.

Now.. if that is the route want to go, that is another choice.  We could:

 - redesign a fresh CLI with commands like

   ceph /osd/123 mark=down
   ceph /pool/foo create pg_num=123

 - make this a programmatic transformation to/from a REST request, like

   /osd/123?command=mark&status=down
   /pool/foo?command=create&pg_num=123

   (or whatever the REST requests are "supposed" to look like)

 - hard-code a client-side mapping for legacy commands only
 - only add new commands in the new syntax

That means retraining users and only adding new commands in the new model 
of things.  And dreaming up said model...

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: File exists not handled in 0.48argonaut1

2013-02-11 Thread Samuel Just
I don't see the more complete log.
-Sam

On Mon, Feb 11, 2013 at 11:12 AM, Mandell Degerness
 wrote:
> Anyone have any thoughts on this???  It looks like I may have to wipe
> out the OSDs effected and rebuild them, but I'm afraid that may result
> in data loss because of the old OSD first crush map in place :(.
>
> On Fri, Feb 8, 2013 at 1:36 PM, Mandell Degerness
>  wrote:
>> We ran into an error which appears very much like a bug fixed in 0.44.
>>
>> This cluster is running version:
>>
>> ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c)
>>
>> The error line is:
>>
>> Feb  8 18:50:07 192.168.8.14 ceph-osd: 2013-02-08 18:50:07.545682
>> 7f40f9f08700  0 filestore(/mnt/osd97)  error (17) File exists not
>> handled on operation 20 (11279344.0.0, or op 0, counting from 0)
>>
>> A more complete log is attached.
>>
>> First question: is this a know bug fixed in more recent versions?
>>
>> Second question: is there any hope of recovery?
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fs: encode_fh: return FILEID_INVALID if invalid fid_type

2013-02-11 Thread Dave Chinner
On Mon, Feb 11, 2013 at 05:25:58PM +0900, Namjae Jeon wrote:
> From: Namjae Jeon 
> 
> This patch is a follow up on below patch:
> 
> [PATCH] exportfs: add FILEID_INVALID to indicate invalid fid_type
> commit: 216b6cbdcbd86b1db0754d58886b466ae31f5a63 

> diff --git a/fs/xfs/xfs_export.c b/fs/xfs/xfs_export.c
> index a836118..3391800 100644
> --- a/fs/xfs/xfs_export.c
> +++ b/fs/xfs/xfs_export.c
> @@ -48,7 +48,7 @@ static int xfs_fileid_length(int fileid_type)
>   case FILEID_INO32_GEN_PARENT | XFS_FILEID_TYPE_64FLAG:
>   return 6;
>   }
> - return 255; /* invalid */
> + return FILEID_INVALID; /* invalid */
>  }

I think you can drop the "/* invalid */" comment from there now as
it is redundant with this change.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD Weights

2013-02-11 Thread Gregory Farnum
On Mon, Feb 11, 2013 at 12:43 PM, Holcombe, Christopher
 wrote:
> Hi Everyone,
>
> I just wanted to confirm my thoughts on the ceph osd weightings.  My 
> understanding is they are a statistical distribution number.  My current 
> setup has 3TB hard drives and they all have the default weight of 1.  I was 
> thinking that if I mixed in 4TB hard drives in the future it would only put 
> 3TB of data on them.  I thought if I changed the weight to 3 for the 3TB hard 
> drives and 4 for the 4TB hard drives it would correctly use the larger 
> storage disks.  Is that correct?

Yep, looks good.
-Greg
PS: This is a good question for the new ceph-users list.
(http://ceph.com/community/introducing-ceph-users/)
:)
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


OSD Weights

2013-02-11 Thread Holcombe, Christopher
Hi Everyone,

I just wanted to confirm my thoughts on the ceph osd weightings.  My 
understanding is they are a statistical distribution number.  My current setup 
has 3TB hard drives and they all have the default weight of 1.  I was thinking 
that if I mixed in 4TB hard drives in the future it would only put 3TB of data 
on them.  I thought if I changed the weight to 3 for the 3TB hard drives and 4 
for the 4TB hard drives it would correctly use the larger storage disks.  Is 
that correct?

Thanks,
Chris



NOTICE: This e-mail and any attachments is intended only for use by the 
addressee(s) named herein and may contain legally privileged, proprietary or 
confidential information. If you are not the intended recipient of this e-mail, 
you are hereby notified that any dissemination, distribution or copying of this 
email, and any attachments thereto, is strictly prohibited. If you receive this 
email in error please immediately notify me via reply email or at (800) 
927-9800 and permanently delete the original copy and any copy of any e-mail, 
and any printout.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster

2013-02-11 Thread Gregory Farnum
jIsaac,
I'm sorry I haven't been able to wrangle any time to look into this
more yet, but Sage pointed out in a related thread that there might be
some buggy handling of things like this if the OSD and the monitor are
located on the same host. Am I correct in assuming that with your
small cluster, all your OSDs are co-located with a monitor daemon?
-Greg

On Mon, Jan 28, 2013 at 12:17 PM, Isaac Otsiabah  wrote:
>
>
> Gregory, i recreated the osd down problem again this morning on two nodes 
> (g13ct, g14ct). First, i created a 1-node cluster on g13ct (with osd.0, 1 ,2) 
> and then added host g14ct (osd3. 4, 5). osd.1 went down for about 1 minute 
> and half after adding osd 3, 4, 5 were adde4d. i have included the routing 
> table of each node at the time osd.1 went down. ceph.conf and ceph-osd.1.log 
> files are attached. The crush map was default. Also, it could be a timing 
> issue because it does not always fail when  using default crush map, it takes 
> several trials before you see it. Thank you.
>
>
> [root@g13ct ~]# netstat -r
> Kernel IP routing table
> Destination Gateway Genmask Flags   MSS Window  irtt Iface
> default 133.164.98.250 0.0.0.0 UG0 0  0 eth2
> 133.164.98.0*   255.255.255.0   U 0 0  0 eth2
> link-local  *   255.255.0.0 U 0 0  0 eth3
> link-local  *   255.255.0.0 U 0 0  0 eth0
> link-local  *   255.255.0.0 U 0 0  0 eth2
> 192.0.0.0   *   255.0.0.0   U 0 0  0 eth3
> 192.0.0.0   *   255.0.0.0   U 0 0  0 eth0
> 192.168.0.0 *   255.255.255.0   U 0 0  0 eth3
> 192.168.1.0 *   255.255.255.0   U 0 0  0 eth0
> [root@g13ct ~]# ceph osd tree
>
> # idweight  type name   up/down reweight
> -1  6   root default
> -3  6   rack unknownrack
> -2  3   host g13ct
> 0   1   osd.0   up  1
> 1   1   osd.1   down1
> 2   1   osd.2   up  1
> -4  3   host g14ct
> 3   1   osd.3   up  1
> 4   1   osd.4   up  1
> 5   1   osd.5   up  1
>
>
>
> [root@g14ct ~]# ceph osd tree
>
> # idweight  type name   up/down reweight
> -1  6   root default
> -3  6   rack unknownrack
> -2  3   host g13ct
> 0   1   osd.0   up  1
> 1   1   osd.1   down1
> 2   1   osd.2   up  1
> -4  3   host g14ct
> 3   1   osd.3   up  1
> 4   1   osd.4   up  1
> 5   1   osd.5   up  1
>
> [root@g14ct ~]# netstat -r
> Kernel IP routing table
> Destination Gateway Genmask Flags   MSS Window  irtt Iface
> default 133.164.98.250 0.0.0.0 UG0 0  0 eth0
> 133.164.98.0*   255.255.255.0   U 0 0  0 eth0
> link-local  *   255.255.0.0 U 0 0  0 eth3
> link-local  *   255.255.0.0 U 0 0  0 eth5
> link-local  *   255.255.0.0 U 0 0  0 eth0
> 192.0.0.0   *   255.0.0.0   U 0 0  0 eth3
> 192.0.0.0   *   255.0.0.0   U 0 0  0 eth5
> 192.168.0.0 *   255.255.255.0   U 0 0  0 eth3
> 192.168.1.0 *   255.255.255.0   U 0 0  0 eth5
> [root@g14ct ~]# ceph osd tree
>
> # idweight  type name   up/down reweight
> -1  6   root default
> -3  6   rack unknownrack
> -2  3   host g13ct
> 0   1   osd.0   up  1
> 1   1   osd.1   down1
> 2   1   osd.2   up  1
> -4  3   host g14ct
> 3   1   osd.3   up  1
> 4   1   osd.4   up  1
> 5   1   osd.5   up  1
>
>
>
>
>
> Isaac
>
>
>
>
>
>
>
>
>
>
> - Original Message -
> From: Isaac Otsiabah 
> To: Gregory Farnum 
> Cc: "ceph-devel@vger.kernel.org" 
> Sent: Friday, January 25, 2013 9:51 AM
> Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host 
> to my cluster
>
>
>
> Gregory, the network physical layout is simple, the two networks are
> separate. the 192.168.0 and the 192.168.1 are not subnets within a

Re: Crash and strange things on MDS

2013-02-11 Thread Gregory Farnum
On Mon, Feb 4, 2013 at 10:01 AM, Kevin Decherf  wrote:
> References:
> [1] http://www.spinics.net/lists/ceph-devel/msg04903.html
> [2] ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
> 1: /usr/bin/ceph-mds() [0x817e82]
> 2: (()+0xf140) [0x7f9091d30140]
> 3: (MDCache::request_drop_foreign_locks(MDRequest*)+0x21) [0x5b9dc1]
> 4: (MDCache::request_drop_locks(MDRequest*)+0x19) [0x5baae9]
> 5: (MDCache::request_cleanup(MDRequest*)+0x60) [0x5bab70]
> 6: (MDCache::request_kill(MDRequest*)+0x80) [0x5bae90]
> 7: (Server::journal_close_session(Session*, int)+0x372) [0x549aa2]
> 8: (Server::kill_session(Session*)+0x137) [0x549c67]
> 9: (Server::find_idle_sessions()+0x12a6) [0x54b0d6]
> 10: (MDS::tick()+0x338) [0x4da928]
> 11: (SafeTimer::timer_thread()+0x1af) [0x78151f]
> 12: (SafeTimerThread::entry()+0xd) [0x782bad]
> 13: (()+0x7ddf) [0x7f9091d28ddf]
> 14: (clone()+0x6d) [0x7f90909cc24d]

This in particular is quite odd. Do you have any logging from when
that happened? (Oftentimes the log can have a bunch of debugging
information from shortly before the crash.)

On Mon, Feb 11, 2013 at 10:54 AM, Kevin Decherf  wrote:
> Furthermore, I observe another strange thing more or less related to the
> storms.
>
> During a rsync command to write ~20G of data on Ceph and during (and
> after) the storm, one OSD sends a lot of data to the active MDS
> (400Mbps peak each 6 seconds). After a quick check, I found that when I
> stop osd.23, osd.14 stops its peaks.

This is consistent with Sam's suggestion that MDS is thrashing its
cache, and is grabbing a directory object off of the OSDs. How large
are the directories you're using? If they're a significant fraction of
your cache size, it might be worth enabling the (sadly less stable)
directory fragmentation options, which will split them up into smaller
fragments that can be independently read and written to disk.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unable to mount cephfs - can't read superblock

2013-02-11 Thread Gregory Farnum
On Sat, Feb 9, 2013 at 2:13 PM, Adam Nielsen  wrote:
 $ ceph -s
 health HEALTH_WARN 192 pgs degraded; 192 pgs stuck unclean
 monmap e1: 1 mons at {0=192.168.0.6:6789/0}, election epoch 0,
 quorum 0 0
 osdmap e3: 1 osds: 1 up, 1 in
 pgmap v119: 192 pgs: 192 active+degraded; 0 bytes data, 10204 MB
 used, 2740 GB / 2750 GB avail
 mdsmap e1: 0/0/1 up
>>>
>> In any case, this output indicates that your MDS isn't actually running, 
>> Adam, or at least isn't connected. Check and see if the process is still 
>> going?
>> You should also have minimal logging by default in /var/lib/ceph/mds*; you 
>> might find some output there that could be useful.
>
> The MDS appears to be running:
>
> $ ps -A | grep ceph
> 12903 ?00:00:17 ceph-mon
> 12966 ?00:00:10 ceph-mds
> 13047 ?00:00:31 ceph-osd
>
> And I found some logs in /var/log/ceph:
>
> $ cat /var/log/ceph/ceph-mds.0.log
> 2013-02-10 07:57:16.505842 b4aa3b70  0 mds.-1.0 ms_handle_connect on 
> 192.168.0.6:6789/0
>
> So it appears the mds is running.  Wireshark shows some traffic going between 
> hosts when the mount request comes through, but then the responses stop and 
> the client eventually gives up and the mount fails.
>
>>> You better add a second OSD or just do a mkcephfs again with a second
>>> OSD in the configuration.
>
> I just tried this and it fixed the unclean pgs issue, but I still can't mount 
> a cephfs filesystem:
>
> $ ceph -s
>health HEALTH_OK
>monmap e1: 1 mons at {0=192.168.0.6:6789/0}, election epoch 0, quorum 0 0
>osdmap e5: 2 osds: 2 up, 2 in
> pgmap v107: 384 pgs: 384 active+clean; 0 bytes data, 40423 MB used, 5461 
> GB / 5501 GB avail
>mdsmap e1: 0/0/1 up
>
> remote$ mount -t ceph 192.168.0.6:6789:/ /mnt/ceph/
> mount: 192.168.0.6:6789:/: can't read superblock
>
> Running the mds daemon in debug mode says this:
>
> ...
> 2013-02-10 08:07:03.550977 b2a83b70 10 mds.-1.0 MDS::ms_get_authorizer 
> type=mon
> 2013-02-10 08:07:03.551840 b4a87b70  0 mds.-1.0 ms_handle_connect on 
> 192.168.0.6:6789/0
> 2013-02-10 08:07:03.555307 b738c710 10 mds.-1.0 beacon_send up:boot seq 1 
> (currently up:boot)
> 2013-02-10 08:07:03.555629 b738c710 10 mds.-1.0 create_logger
> 2013-02-10 08:07:03.564138 b4a87b70  5 mds.-1.0 handle_mds_map epoch 1 from 
> mon.0
> 2013-02-10 08:07:03.564348 b4a87b70 10 mds.-1.0  my compat 
> compat={},rocompat={},incompat={1=base v0.20,2=client writeable 
> ranges,3=default file layouts on dirs,4=dir inode in separate object}
> 2013-02-10 08:07:03.564454 b4a87b70 10 mds.-1.0  mdsmap compat 
> compat={},rocompat={},incompat={1=base v0.20,2=client writeable 
> ranges,3=default file layouts on dirs,4=dir inode in separate object}
> 2013-02-10 08:07:03.564547 b4a87b70 10 mds.-1.-1 map says i am 
> 192.168.0.6:6800/16077 mds.-1.-1 state down:dne
> 2013-02-10 08:07:03.564654 b4a87b70 10 mds.-1.-1 not in map yet
> 2013-02-10 08:07:07.67 b2881b70 10 mds.-1.-1 beacon_send up:boot seq 2 
> (currently down:dne)
> 2013-02-10 08:07:11.555858 b2881b70 10 mds.-1.-1 beacon_send up:boot seq 3 
> (currently down:dne)
> 2013-02-10 08:07:15.556123 b2881b70 10 mds.-1.-1 beacon_send up:boot seq 4 
> (currently down:dne)
> 2013-02-10 08:07:19.556411 b2881b70 10 mds.-1.-1 beacon_send up:boot seq 5 
> (currently down:dne)
> 2013-02-10 08:07:23.556654 b2881b70 10 mds.-1.-1 beacon_send up:boot seq 6 
> (currently down:dne)
> 2013-02-10 08:07:27.556931 b2881b70 10 mds.-1.-1 beacon_send up:boot seq 7 
> (currently down:dne)
> 2013-02-10 08:07:31.557189 b2881b70 10 mds.-1.-1 beacon_send up:boot seq 8 
> (currently down:dne)
> ...

How bizarre. That indicates the MDS is running and is requesting to
become active, but the monitor for some reason isn't letting it in.
Can you restart your monitor with logging on as well (--debug_mon 20
on the end of the command line, or "debug mon = 20" in the config) and
then try again?
The other possibility is that maybe your MDS doesn't have the right
access permissions; does "ceph auth list" include an MDS, and does it
have any permissions associated?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: preferred OSD

2013-02-11 Thread Gregory Farnum
On Fri, Feb 8, 2013 at 4:45 PM, Sage Weil  wrote:
> Hi Marcus-
>
> On Fri, 8 Feb 2013, Marcus Sorensen wrote:
>> I know people have been disscussing on and off about providing a
>> "preferred OSD" for things like multi-datacenter, or even within a
>> datacenter, choosing an OSD that would avoid traversing uplinks.  Has
>> there been any discussion on how to do this? I seem to remember people
>> saying things like 'the crush map doesn't work that way at the
>> moment'. Presumably, when a client needs to access an object, it looks
>> up where the object should be stored via the crush map, which returns
>> all OSDs that could be read from.

Exactly.

>> I was thinking this morning that you
>> could potentially leave the crush map out of it, by setting a location
>> for each OSD in the ceph.conf, and an /etc/ceph/location file for the
>> client.  Then use the absolute value of the difference to determine
>> preferred OSD. So, if OSD0 was location=1, and OSD1 was location=3,
>> and client 1 was location=2, then it would do the normal thing, but if
>> client 1 was location=1.3, then it would prefer OSD0 for reads.
>> Perhaps that's overly simplistic and wouldn't scale to meet everyone's
>> requirements, but you could do multiple locations and sprinkle clients
>> in between them all in various ways.  Or perhaps the location is a
>> matrix, so you could literally map it out on a grid with a set of
>> coordinates. What ideas are being discussed around how to implement
>> this?
>
> We can do something like this for reads today, where we pick a read
> replica based on the closest IP or some other metric/mask.  We generally
> don't enable this because it leads to non-optimal cache behavior, but it
> could in principle be enabled via a config option for certain clusters
> (and in fact some of that code is already in place).

Just to be specific — there are currently flags which will let the
client read from local-host if it can figure that out, and those
aren't heavily-tested but do work when we turn them on. Other metrics
of "close" don't appear yet, though.
In general, CRUSH locations seem like a good measure of closeness that
the client could rely on, rather than a separate "location" value, but
it does restrict the usefulness if you've configured multiple CRUSH
root nodes. I think it would need to support a tree of some kind
though, rather than just a linear value.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: chain_fsetxattr extra chunk removal

2013-02-11 Thread Loic Dachary


On 02/11/2013 06:13 AM, Yehuda Sadeh wrote:
> On Thu, Feb 7, 2013 at 12:59 PM, Loic Dachary  wrote:
>> Hi,
>>
>> While writing unit tests for chain_xattr.cc I tried to understand how to 
>> create the conditions to trigger this part of the chain_fsetxattr function:
>>
>>   /* if we're exactly at a chunk size, remove the next one (if wasn't removed
>>  before) */
>>   if (ret >= 0 && chunk_size == CHAIN_XATTR_MAX_BLOCK_LEN) {
>> get_raw_xattr_name(name, i, raw_name, sizeof(raw_name));
>> int r = sys_fremovexattr(fd, raw_name);
>> if (r < 0 && r != -ENODATA)
>>   ret = r;
>>   }
>>
>> I suspect this cleans up extra empty attributes created as a side effect of 
>> a previous version of the function. Or I just don't understand the case it 
>> addresses.
>>
>> I'd very much appreciate a hint :-)
>>
> 
> Well, the code has changed a bit, but originally when a chain was
> overwritten we didn't bother to remove the xattrs tail. When we read
> the chain we stop either when we got a short xattr, or when the next
> xattr in the chain didn't exist.  So when writing an xattr that was
> perfectly aligned with the block len we had to remove the next xattr
> in order make sure that readers will not over-read. I'm not too sure
> whether that still the case, Sam might have a better idea.
> In any case, it might be a good idea to test the case where we have a
> big xattr that spans across multiple blocks (e.g., > 3) and being
> overwritten by a short xattr. Probably also need to test it with
> different combinations of aligned and non-aligned block sizes.

I understand now and I'll modify the pull request 
https://github.com/ceph/ceph/pull/40 accordingly.

Thanks :-)

> 
> Thanks,
> Yehuda
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re: rest mgmt api

2013-02-11 Thread Gregory Farnum
On Wed, Feb 6, 2013 at 12:14 PM, Sage Weil  wrote:
> On Wed, 6 Feb 2013, Dimitri Maziuk wrote:
>> On 02/06/2013 01:34 PM, Sage Weil wrote:
>>
>> > I think the one caveat here is that having a single registry for commands
>> > in the monitor means that commands can come in two flavors: vector
>> > (cli) and URL (presumably in json form).  But a single command
>> > dispatch/registry framework will make that distinction pretty simple...
>>
>> Any reason you can't have your CLI json-encode the commands (or,
>> conversely, your cgi/wsgi/php/servlet URL handler decode them into
>> vector) before passing them on to the monitor?
>
> We can, but they won't necessarily look the same, because it is unlikely
> we can make a sane 1:1 translation of the CLI to REST that makes sense,
> and it would be nice to avoid baking knowledge about the individual
> commands into the client side.

I disagree and am with Joao on this one — the monitor parsing is
ridiculous as it stand right now, and we should be trying to get rid
of the manual string parsing. The monitors should be parsing JSON
commands that are sent by the client; it makes validation and the
logic control flow a lot easier. We're going to want some level of
intelligence in the clients so that they can tailor themselves to the
appropriate UI conventions, and having two different parsing paths in
the monitors is just asking for trouble: they will get out of sync and
have different kinds of parsing errors.

What we could do is have the monitors speak JSON only, and then give
the clients a minimal intelligence so that the CLI could (for
instance) prettify the options for commands it knows about, but still
allow pass-through for access to newer commands it hasn't yet heard
of.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: File exists not handled in 0.48argonaut1

2013-02-11 Thread Mandell Degerness
Anyone have any thoughts on this???  It looks like I may have to wipe
out the OSDs effected and rebuild them, but I'm afraid that may result
in data loss because of the old OSD first crush map in place :(.

On Fri, Feb 8, 2013 at 1:36 PM, Mandell Degerness
 wrote:
> We ran into an error which appears very much like a bug fixed in 0.44.
>
> This cluster is running version:
>
> ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c)
>
> The error line is:
>
> Feb  8 18:50:07 192.168.8.14 ceph-osd: 2013-02-08 18:50:07.545682
> 7f40f9f08700  0 filestore(/mnt/osd97)  error (17) File exists not
> handled on operation 20 (11279344.0.0, or op 0, counting from 0)
>
> A more complete log is attached.
>
> First question: is this a know bug fixed in more recent versions?
>
> Second question: is there any hope of recovery?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-02-11 Thread Kevin Decherf
On Mon, Feb 11, 2013 at 11:00:15AM -0600, Sam Lang wrote:
> Hi Kevin, sorry for the delayed response.
> This looks like the mds cache is thrashing quite a bit, and with
> multiple MDSs the tree partitioning is causing those estale messages.
> In your case, you should probably run with just a single active mds (I
> assume all three MDSs are active, but ceph -s will tell you for sure),
> and the others as standby.  I don't think you'll be able to do that
> without starting over though.

Hi Sam,

I know that MDS clustering is a bit buggy so I have only one active MDS
on this cluster.

Here is the output of ceph -s:

   ~ # ceph -s
  health HEALTH_OK
  monmap e1: 3 mons at {a=x:6789/0,b=y:6789/0,c=z:6789/0}, election epoch 
48, quorum 0,1,2 a,b,c
  osdmap e79: 27 osds: 27 up, 27 in
   pgmap v895343: 5376 pgs: 5376 active+clean; 18987 MB data, 103 GB used, 
21918 GB / 23201 GB avail
  mdsmap e73: 1/1/1 up {0=b=up:active}, 2 up:standby


> Also, you might want to increase the size of the mds cache if you have
> enough memory on that machine.  mds cache size defaults to 100k, you
> might increase it to 300k and see if you get the same problems.

I have 24GB of memory for each MDS, I will try to increase this value.
Thanks for advice.

> Do you have debug logging enabled when you see this crash?  Can you
> compress that mds log and post it somewhere or email it to me?

Yes, I have 34GB of raw logs (for this issue) but I have no debug log
of the beginning of the storm itself. I will upload a compressed
archive.


Furthermore, I observe another strange thing more or less related to the
storms.

During a rsync command to write ~20G of data on Ceph and during (and
after) the storm, one OSD sends a lot of data to the active MDS
(400Mbps peak each 6 seconds). After a quick check, I found that when I
stop osd.23, osd.14 stops its peaks.

I will forward a copy of the debug enabled log of osd14.

The only significant difference between osd.23 and others is the list of
hb_in where osd.14 is missing (but I think it's unrelated).

   ~ # ceph pg dump
   osdstat  kbused   kbavail  kb hb in hb out
   0  4016228  851255948   901042464   
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]  []
   1  4108748  851163428   901042464   
[0,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,18,19,20,21,22,23,24,25,26]  []
   2  4276584  850995592   901042464   
[0,1,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]  []
   3  3997368  851274808   901042464   
[0,1,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26] []
   4  4358212  850913964   901042464   
[0,1,2,3,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]  []
   5  4039112  851233064   901042464   
[0,1,2,3,4,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]  []
   6  3971568  851300608   901042464   
[0,1,2,3,4,5,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]  []
   7  3942556  851329620   901042464   
[0,1,2,3,4,5,6,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]  []
   8  4275584  850996592   901042464   
[0,1,2,3,4,5,6,7,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]  []
   9  4279308  850992868   901042464   
[0,1,2,3,4,5,6,7,8,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]  []
   10 3728136  851544040   901042464   
[0,1,2,3,4,5,6,7,8,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]   []
   11 3934096  851338080   901042464   
[0,1,2,3,4,5,6,7,8,9,10,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]   []
   12 3991600  851280576   901042464   
[0,1,2,3,4,5,6,7,8,9,10,11,13,14,15,16,17,18,19,20,21,22,23,24,25,26]   []
   13 4211228  851060948   901042464   
[0,1,2,3,4,5,6,7,8,9,10,11,12,14,15,16,17,18,19,20,21,22,23,24,25,26]   []
   14 4169476  851102700   901042464   
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,15,16,17,18,19,20,21,22,23,24,25,26]   []
   15 4385584  850886592   901042464   
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,16,17,18,19,20,21,22,23,24,25,26]   []
   16 3761176  851511000   901042464   
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,17,18,19,20,21,22,23,24,25,26]   []
   17 3646096  851626080   901042464   
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,18,19,20,21,22,23,24,25,26]   []
   18 4119448  851152728   901042464   
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,19,20,21,22,23,24,25,26]   []
   19 4592992  850679184   901042464   
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,20,21,22,23,24,25,26]   []
   20 3740840  851531336   901042464   
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,21,22,23,24,25,26]   []
   21 4363552  850908624   901042464   
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,22,23,24,25,26]   []
   22 3831420  851440756   901042464   
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,23,24,25,26]   []
   23 3681648  851590528   901042464   
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,15,16,17,18,19,20,21,22,24,25,26]   []
   24 3946192  851325984   901042464   
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,25,26]   []

Re: Crash and strange things on MDS

2013-02-11 Thread Sam Lang
On Mon, Feb 11, 2013 at 7:05 AM, Kevin Decherf  wrote:
> On Mon, Feb 04, 2013 at 07:01:54PM +0100, Kevin Decherf wrote:
>> Hey everyone,
>>
>> It's my first post here to expose a potential issue I found today using
>> Ceph 0.56.1.
>>
>> The cluster configuration is, briefly: 27 osd of ~900GB and 3 MON/MDS.
>> All nodes are running Exherbo (source-based distribution) with Ceph
>> 0.56.1 and Linux 3.7.0. We are only using CephFS on this cluster which
>> is mounted on ~60 clients (increasing each day). Objects are replicated
>> three times and the cluster handles only 7GB of data atm for 350k
>> objects.
>>
>> In certain conditions (I don't know them atm), some clients hang,
>> generate CPU overloads (kworker) and are unable to make any IO on
>> Ceph. The active MDS have ~20Mbps in/out during the issue (less than
>> 2Mbps in normal activity). I don't know if it's directly linked but we
>> also observe a lot of missing files at the same time.
>>
>> The problem is similar to this one [1].
>>
>> A restart of the client or the MDS was enough before today, but we found
>> a new behavior: the active MDS consumes a lot of CPU during 3 to 5 hours
>> with ~25% clients hanging.
>>
>> In logs I found a segfault with this backtrace [2] and 100,000 dumped
>> events during the first hang. We observed another hang which produces
>> lot of these events (in debug mode):
>>- "mds.0.server FAIL on ESTALE but attempting recovery"
>>- "mds.0.server reply_request -116 (Stale NFS file handle)
>>   client_request(client.10991:1031 getattr As #104bab0
>>   RETRY=132)"

Hi Kevin, sorry for the delayed response.
This looks like the mds cache is thrashing quite a bit, and with
multiple MDSs the tree partitioning is causing those estale messages.
In your case, you should probably run with just a single active mds (I
assume all three MDSs are active, but ceph -s will tell you for sure),
and the others as standby.  I don't think you'll be able to do that
without starting over though.

Also, you might want to increase the size of the mds cache if you have
enough memory on that machine.  mds cache size defaults to 100k, you
might increase it to 300k and see if you get the same problems.

>>
>> We have no profiling tools available on these nodes, and I don't know
>> what I should search in the 35 GB log file.
>>
>> Note: the segmentation fault occured only once but the problem was
>> observed four times on this cluster.
>>
>> Any help may be appreciated.
>>
>> References:
>> [1] http://www.spinics.net/lists/ceph-devel/msg04903.html
>> [2] ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
>> 1: /usr/bin/ceph-mds() [0x817e82]
>> 2: (()+0xf140) [0x7f9091d30140]
>> 3: (MDCache::request_drop_foreign_locks(MDRequest*)+0x21) [0x5b9dc1]
>> 4: (MDCache::request_drop_locks(MDRequest*)+0x19) [0x5baae9]
>> 5: (MDCache::request_cleanup(MDRequest*)+0x60) [0x5bab70]
>> 6: (MDCache::request_kill(MDRequest*)+0x80) [0x5bae90]
>> 7: (Server::journal_close_session(Session*, int)+0x372) [0x549aa2]
>> 8: (Server::kill_session(Session*)+0x137) [0x549c67]
>> 9: (Server::find_idle_sessions()+0x12a6) [0x54b0d6]
>> 10: (MDS::tick()+0x338) [0x4da928]
>> 11: (SafeTimer::timer_thread()+0x1af) [0x78151f]
>> 12: (SafeTimerThread::entry()+0xd) [0x782bad]
>> 13: (()+0x7ddf) [0x7f9091d28ddf]
>> 14: (clone()+0x6d) [0x7f90909cc24d]

>
> I found a possible cause/way to reproduce this issue.
> We have now ~90 clients for 18GB / 650k objects and the storm occurs
> when we execute an "intensive IO" command (tar of the whole pool / rsync
> in one folder) on one of our client (the only which uses ceph-fuse,
> don't know if it's limited to it or not).

Do you have debug logging enabled when you see this crash?  Can you
compress that mds log and post it somewhere or email it to me?

Thanks,
-sam

>
> Any idea?
>
> Cheers,
> --
> Kevin Decherf - @Kdecherf
> GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F
> http://kdecherf.com
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] snapshot, clone and mount a VM-Image

2013-02-11 Thread Sage Weil
On Mon, 11 Feb 2013, Wolfgang Hennerbichler wrote:
> 
> 
> On 02/11/2013 03:02 PM, Wido den Hollander wrote:
> 
> > You are looking at a way to "extract" the snapshot, correct?
> 
> No.
> 
> > Why would
> > you want to mount it and backup the files?
> 
> because then I can do things like incremental backups. There will be a
> ceph cluster at an ISP soon, who hosts various services on various VMs,
> and it is important that the mailspool for example is backed up
> efficiently, because it's huge and the number of files is also high.

Note that an alternative way to approach incremental backups is at the 
block device level.  We plan to implement an incremental backup function 
for the relative change between two snapshots (or a snapshot and the 
head).  It's O(n) the size of the device vs the number of files, but 
should be more efficient for all but the most sparse of images.  The 
implementation should be simple; the challenge is mostly around the 
incremental file format, probably.

That doesn't help you now, but would be a relatively self-contained piece 
of functionality for someone to contribute to RBD.  This isn't a top 
priority yet, so it will be a while before the inktank devs can get to it.

sage


> 
> > Couldn't you better handle this in the Virtual Machine itself?
> 
> not really. open, changing files, a lot of virtual machines that one
> needs to take care of, and so on.
> 
> > If you want to backup the virtual machines to an extern location you
> > could use either "rbd" or "qemu-img" to get the snapshot out of the Ceph
> > cluster:
> > 
> > $ rbd export --snap   
> > 
> > Or use qemu-img
> > 
> > $ qemu-img convert -f raw -O qcow2 -s  rbd:rbd/
> > .qcow2
> > 
> > You then get files which you can backup externally.
> > 
> > Would that work?
> 
> sure, but this is a very inefficient way of backing things up, because
> one would back up on block level. I want to back up on filesystem level.
> 
> > Wido
> > 
> >> thanks a lot for you answers
> >> Wolfgang
> >>
> > 
> > 
> 
> 
> -- 
> DI (FH) Wolfgang Hennerbichler
> Software Development
> Unit Advanced Computing Technologies
> RISC Software GmbH
> A company of the Johannes Kepler University Linz
> 
> IT-Center
> Softwarepark 35
> 4232 Hagenberg
> Austria
> 
> Phone: +43 7236 3343 245
> Fax: +43 7236 3343 250
> wolfgang.hennerbich...@risc-software.at
> http://www.risc-software.at
> ___
> ceph-users mailing list
> ceph-us...@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPv6 address confusion in OSDs

2013-02-11 Thread Sage Weil
On Mon, 11 Feb 2013, Simon Leinen wrote:
> Sage Weil writes:
> > On Mon, 11 Feb 2013, Simon Leinen wrote:
> >> We run a ten-node 64-OSD Ceph cluster and use IPv6 where possible.
> 
> I should have mentioned that this is under Ubuntu 12.10 with version
> 0.56.1-1quantal of the ceph packages.  Sorry about the omission.
> 
> >> Today I noticed this error message from an OSD just after I restarted
> >> it (in an attempt to resolve an issue with some "stuck" pgs that
> >> included that OSD):
> >> 
> >> 2013-02-11 09:24:57.232811 osd.35 [ERR] map e768 had wrong cluster addr 
> >> ([2001:620:0:6::106]:6822/1990 != my 
> >> [fe80::67d:7bff:fef1:78b%vlan301]:6822/1990)
> >> 
> >> These two addresses belong to the same interface:
> >> 
> >> root@h1:~# ip -6 addr list dev vlan301
> >> 7: vlan301@bond0:  mtu 1500 
> >> inet6 2001:620:0:6::106/64 scope global 
> >> valid_lft forever preferred_lft forever
> >> inet6 fe80::67d:7bff:fef1:78b/64 scope link 
> >> valid_lft forever preferred_lft forever
> >> 
> >> 2001:620:... is the global-scope address, and this is how OSDs are
> >> addressed in our ceph.conf.  fe80:... is the link-local address that
> >> every IPv6 interface has.  Shouldn't these be treated as equivalent?
> 
> > Is this OSD by chance sharing a host with one of the monitors?
> 
> Yes, indeed! We have five monitors, i.e. every other server runs a
> ceph-mon in addition to the 4-9 ceph-osd processes each server has.
> This (h1) is one of the servers that has both.
> 
> > The 'my address' value is learned by looking at the socket we connect to 
> > the monitor with...
> 
> Thanks for the hint! I'll look at the code and try to understand
> what's happening and how this could be avoided.
> 
> The cluster seems to have recovered from this particular error by
> itself. 

That makes sense if the trigger here is it random choosely to connect to 
the local monitor first and learning the address that way.  Adding 
'debug ms = 20' to your ceph.conf may give a hint.. looked for a 'learned 
by addr' message (or somethign similar) right at startup time.

> But in general, when I reboot servers, there's often some pgs
> that remain stuck, and I have to restart some OSDs until ceph -w shows
> everything as "active+clean".

Note that 'ceph osd down NN' may give similar results as restarting the 
daemon.

> (Our network setup is somewhat complex, with IPv6 over VLANs over
> "bonded" 10GEs redundantly connected to a pair of Brocade switches
> running VLAG (something like multi-chassis Etherchannel).  So it's
> possible that there are some connectivity issues hiding somewhere.)

Let us know what you find!
sage


> -- 
> Simon.
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/15] kv_flat_btree_async.cc: use vector instead of VLA's

2013-02-11 Thread Sage Weil
On Mon, 11 Feb 2013, Danny Al-Gaaf wrote:
> Am 10.02.2013 06:57, schrieb Sage Weil:
> > On Thu, 7 Feb 2013, Danny Al-Gaaf wrote:
> >> Fix "variable length array of non-POD element type" errors caused by
> >> using librados::ObjectWriteOperation VLAs. (-Wvla)
> >>
> >> Signed-off-by: Danny Al-Gaaf 
> >> ---
> >>  src/key_value_store/kv_flat_btree_async.cc | 14 +++---
> >>  1 file changed, 7 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/src/key_value_store/kv_flat_btree_async.cc 
> >> b/src/key_value_store/kv_flat_btree_async.cc
> >> index 96c6cb0..4342e70 100644
> >> --- a/src/key_value_store/kv_flat_btree_async.cc
> >> +++ b/src/key_value_store/kv_flat_btree_async.cc
> >> @@ -1119,9 +1119,9 @@ int KvFlatBtreeAsync::cleanup(const index_data 
> >> &idata, const int &errno) {
> >>  //all changes were created except for updating the index and possibly
> >>  //deleting the objects. roll forward.
> >>  vector, librados::ObjectWriteOperation*> > ops;
> >> -librados::ObjectWriteOperation owos[idata.to_delete.size() + 1];
> >> +vector owos(idata.to_delete.size() + 
> >> 1);
> > 
> > I haven't read much of the surrounding code, but from what is included 
> > here I don't think this is equivalent... these are just null pointers 
> > initially, and so
> > 
> >>  for (int i = 0; i <= (int)idata.to_delete.size(); ++i) {
> >> -  ops.push_back(make_pair(pair(0, ""), &owos[i]));
> >> +  ops.push_back(make_pair(pair(0, ""), owos[i]));
> > 
> > this doesn't do anything useful... owos[i] may as well be NULL.  Why not 
> > make it
> > 
> > vector owos(...)
> > 
> > ?
> 
> Because this would lead to a linker error:
> 
> kv_flat_btree_async.o: In function `void
> std::__uninitialized_fill_n::__uninit_fill_n unsigned long,
> librados::ObjectWriteOperation>(librados::ObjectWriteOperation*,
> unsigned long, librados::ObjectWriteOperation const&)':
> /usr/bin/../lib64/gcc/x86_64-suse-linux/4.7/../../../../include/c++/4.7/bits/stl_uninitialized.h:188:
> undefined reference to
> `librados::ObjectOperation::ObjectOperation(librados::ObjectOperation
> const&)'
> /usr/bin/../lib64/gcc/x86_64-suse-linux/4.7/../../../../include/c++/4.7/bits/stl_uninitialized.h:188:
> undefined reference to
> `librados::ObjectOperation::ObjectOperation(librados::ObjectOperation
> const&)'
> 
> 
> Because in src/include/rados/librados.hpp
> librados::ObjectOperation::ObjectOperation(librados::ObjectOperation
> const&) was is defined, but not implemented in the librados.cc.
> 
> Not sure if removing ObjectOperation(librados::ObjectOperation const&)
> is the way to go here.

Oh, I see... yeah, we shouldn't remove that.  Probably we should 
restructure the code to use a list<>, which doesn't require a copy 
constructor or assignment operator.

Note that this particular code shouldn't hold up the rest of the patches, 
since it's not being used by anything (yet!).

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPv6 address confusion in OSDs

2013-02-11 Thread Simon Leinen
Sage Weil writes:
> On Mon, 11 Feb 2013, Simon Leinen wrote:
>> We run a ten-node 64-OSD Ceph cluster and use IPv6 where possible.

I should have mentioned that this is under Ubuntu 12.10 with version
0.56.1-1quantal of the ceph packages.  Sorry about the omission.

>> Today I noticed this error message from an OSD just after I restarted
>> it (in an attempt to resolve an issue with some "stuck" pgs that
>> included that OSD):
>> 
>> 2013-02-11 09:24:57.232811 osd.35 [ERR] map e768 had wrong cluster addr 
>> ([2001:620:0:6::106]:6822/1990 != my 
>> [fe80::67d:7bff:fef1:78b%vlan301]:6822/1990)
>> 
>> These two addresses belong to the same interface:
>> 
>> root@h1:~# ip -6 addr list dev vlan301
>> 7: vlan301@bond0:  mtu 1500 
>> inet6 2001:620:0:6::106/64 scope global 
>> valid_lft forever preferred_lft forever
>> inet6 fe80::67d:7bff:fef1:78b/64 scope link 
>> valid_lft forever preferred_lft forever
>> 
>> 2001:620:... is the global-scope address, and this is how OSDs are
>> addressed in our ceph.conf.  fe80:... is the link-local address that
>> every IPv6 interface has.  Shouldn't these be treated as equivalent?

> Is this OSD by chance sharing a host with one of the monitors?

Yes, indeed! We have five monitors, i.e. every other server runs a
ceph-mon in addition to the 4-9 ceph-osd processes each server has.
This (h1) is one of the servers that has both.

> The 'my address' value is learned by looking at the socket we connect to 
> the monitor with...

Thanks for the hint! I'll look at the code and try to understand
what's happening and how this could be avoided.

The cluster seems to have recovered from this particular error by
itself.  But in general, when I reboot servers, there's often some pgs
that remain stuck, and I have to restart some OSDs until ceph -w shows
everything as "active+clean".

(Our network setup is somewhat complex, with IPv6 over VLANs over
"bonded" 10GEs redundantly connected to a pair of Brocade switches
running VLAG (something like multi-chassis Etherchannel).  So it's
possible that there are some connectivity issues hiding somewhere.)
-- 
Simon.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Client can't reboot when rbd volume is mounted.

2013-02-11 Thread Sage Weil
On Mon, 11 Feb 2013, Roman Alekseev wrote:
> On 11.02.2013 09:36, Sage Weil wrote:
> > On Mon, 11 Feb 2013, Roman Alekseev wrote:
> > > Hi,
> > > 
> > > When I try to reboot a client server  without unmounting of rbd volume
> > > manually
> > > its services stop working but server doesn't reboot completely and show
> > > the
> > > following logs in KVM console:
> > > 
> > > [235618.0202207] libceph: connect 192.168.0.19:6789 error -101
> > That is
> > 
> > #defineENETUNREACH 101 /* Network is unreachable */
> > 
> > Note that that (or any other) socket error is not necessarily fatal; the
> > kernel client will retry and eventually connect to that or another OSD
> > to complete the IO.  Are you observing that the RBD image hangs or
> > something?
> > 
> > You can peek at in-flight IO (and other state) with
> > 
> >   cat /sys/kernel/debug/ceph/*/osdc
> > 
> > unmount/unmap should not be necessarily in any case unless there is a bug.
> > We backported a bunch of stuff to 3.6.6, so 3.6.10 ought to be okay.  You
> > might try a newer 3.6.x kernel too; I forget if there was a second batch
> > of fixes..
> > 
> > sage
> 
> Hi Sage,
> 
> > #define ENETUNREACH 101 /* Network is unreachable */
> 
> The reason of this error is that networking stop working after performing
> server reset request.
> 
> > Are you observing that the RBD image hangs or something?
> 
> the RBD works properly. It is just mapped and mounted on the client server.
> 
> # /dev/rbd1  99G  616M   93G   1% /home/test

I think I'm confused about what you mean by 'server'.  Do you mean the 
host that rbd is mapped on, or the host(s) where the ceph-osd's are 
running?

By 'the RBD works properly' do you mean the client where it is mapped?  In 
which case, what exactly is the problem?

> The "/sys/kernel/debug" folder is empty, how to put 'ceph/*/osdc' content into
> it?

'mount -t debugfs none /sys/kernel/debug' and it will appear (along with 
other fun stuff)...

sage


> 
> I've update kernel to 3.7.4 version but problem is still persist.
> 
> Thanks
> 
> -- 
> Kind regards,
> 
> R. Alekseev
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPv6 address confusion in OSDs

2013-02-11 Thread Sage Weil
On Mon, 11 Feb 2013, Simon Leinen wrote:
> We run a ten-node 64-OSD Ceph cluster and use IPv6 where possible.
> 
> Today I noticed this error message from an OSD just after I restarted
> it (in an attempt to resolve an issue with some "stuck" pgs that
> included that OSD):
> 
> 2013-02-11 09:24:57.232811 osd.35 [ERR] map e768 had wrong cluster addr 
> ([2001:620:0:6::106]:6822/1990 != my 
> [fe80::67d:7bff:fef1:78b%vlan301]:6822/1990)
> 
> These two addresses belong to the same interface:
> 
> root@h1:~# ip -6 addr list dev vlan301
> 7: vlan301@bond0:  mtu 1500 
> inet6 2001:620:0:6::106/64 scope global 
>valid_lft forever preferred_lft forever
> inet6 fe80::67d:7bff:fef1:78b/64 scope link 
>valid_lft forever preferred_lft forever
> 
> 2001:620:... is the global-scope address, and this is how OSDs are
> addressed in our ceph.conf.  fe80:... is the link-local address that
> every IPv6 interface has.  Shouldn't these be treated as equivalent?

Is this OSD by chance sharing a host with one of the monitors?

The 'my address' value is learned by looking at the socket we connect to 
the monitor with...

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fs: encode_fh: return FILEID_INVALID if invalid fid_type

2013-02-11 Thread Sage Weil
Acked-by: Sage Weil 

On Mon, 11 Feb 2013, Namjae Jeon wrote:

> From: Namjae Jeon 
> 
> This patch is a follow up on below patch:
> 
> [PATCH] exportfs: add FILEID_INVALID to indicate invalid fid_type
> commit: 216b6cbdcbd86b1db0754d58886b466ae31f5a63 
> 
> Signed-off-by: Namjae Jeon 
> Signed-off-by: Vivek Trivedi 
> Acked-by: Steven Whitehouse 
> ---
>  fs/btrfs/export.c   |4 ++--
>  fs/ceph/export.c|4 ++--
>  fs/fuse/inode.c |2 +-
>  fs/gfs2/export.c|4 ++--
>  fs/isofs/export.c   |4 ++--
>  fs/nilfs2/namei.c   |4 ++--
>  fs/ocfs2/export.c   |4 ++--
>  fs/reiserfs/inode.c |4 ++--
>  fs/udf/namei.c  |4 ++--
>  fs/xfs/xfs_export.c |4 ++--
>  mm/cleancache.c |2 +-
>  mm/shmem.c  |2 +-
>  12 files changed, 21 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/btrfs/export.c b/fs/btrfs/export.c
> index 614f34a..81ee29e 100644
> --- a/fs/btrfs/export.c
> +++ b/fs/btrfs/export.c
> @@ -22,10 +22,10 @@ static int btrfs_encode_fh(struct inode *inode, u32 *fh, 
> int *max_len,
>  
>   if (parent && (len < BTRFS_FID_SIZE_CONNECTABLE)) {
>   *max_len = BTRFS_FID_SIZE_CONNECTABLE;
> - return 255;
> + return FILEID_INVALID;
>   } else if (len < BTRFS_FID_SIZE_NON_CONNECTABLE) {
>   *max_len = BTRFS_FID_SIZE_NON_CONNECTABLE;
> - return 255;
> + return FILEID_INVALID;
>   }
>  
>   len  = BTRFS_FID_SIZE_NON_CONNECTABLE;
> diff --git a/fs/ceph/export.c b/fs/ceph/export.c
> index ca3ab3f..16796be 100644
> --- a/fs/ceph/export.c
> +++ b/fs/ceph/export.c
> @@ -81,7 +81,7 @@ static int ceph_encode_fh(struct inode *inode, u32 *rawfh, 
> int *max_len,
>   if (parent_inode) {
>   /* nfsd wants connectable */
>   *max_len = connected_handle_length;
> - type = 255;
> + type = FILEID_INVALID;
>   } else {
>   dout("encode_fh %p\n", dentry);
>   fh->ino = ceph_ino(inode);
> @@ -90,7 +90,7 @@ static int ceph_encode_fh(struct inode *inode, u32 *rawfh, 
> int *max_len,
>   }
>   } else {
>   *max_len = handle_length;
> - type = 255;
> + type = FILEID_INVALID;
>   }
>   if (dentry)
>   dput(dentry);
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 9876a87..973e8f0 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -679,7 +679,7 @@ static int fuse_encode_fh(struct inode *inode, u32 *fh, 
> int *max_len,
>  
>   if (*max_len < len) {
>   *max_len = len;
> - return  255;
> + return  FILEID_INVALID;
>   }
>  
>   nodeid = get_fuse_inode(inode)->nodeid;
> diff --git a/fs/gfs2/export.c b/fs/gfs2/export.c
> index 4767774..9973df4 100644
> --- a/fs/gfs2/export.c
> +++ b/fs/gfs2/export.c
> @@ -37,10 +37,10 @@ static int gfs2_encode_fh(struct inode *inode, __u32 *p, 
> int *len,
>  
>   if (parent && (*len < GFS2_LARGE_FH_SIZE)) {
>   *len = GFS2_LARGE_FH_SIZE;
> - return 255;
> + return FILEID_INVALID;
>   } else if (*len < GFS2_SMALL_FH_SIZE) {
>   *len = GFS2_SMALL_FH_SIZE;
> - return 255;
> + return FILEID_INVALID;
>   }
>  
>   fh[0] = cpu_to_be32(ip->i_no_formal_ino >> 32);
> diff --git a/fs/isofs/export.c b/fs/isofs/export.c
> index 2b4f235..12088d8 100644
> --- a/fs/isofs/export.c
> +++ b/fs/isofs/export.c
> @@ -125,10 +125,10 @@ isofs_export_encode_fh(struct inode *inode,
>*/
>   if (parent && (len < 5)) {
>   *max_len = 5;
> - return 255;
> + return FILEID_INVALID;
>   } else if (len < 3) {
>   *max_len = 3;
> - return 255;
> + return FILEID_INVALID;
>   }
>  
>   len = 3;
> diff --git a/fs/nilfs2/namei.c b/fs/nilfs2/namei.c
> index 1d0c0b8..9de78f0 100644
> --- a/fs/nilfs2/namei.c
> +++ b/fs/nilfs2/namei.c
> @@ -517,11 +517,11 @@ static int nilfs_encode_fh(struct inode *inode, __u32 
> *fh, int *lenp,
>  
>   if (parent && *lenp < NILFS_FID_SIZE_CONNECTABLE) {
>   *lenp = NILFS_FID_SIZE_CONNECTABLE;
> - return 255;
> + return FILEID_INVALID;
>   }
>   if (*lenp < NILFS_FID_SIZE_NON_CONNECTABLE) {
>   *lenp = NILFS_FID_SIZE_NON_CONNECTABLE;
> - return 255;
> + return FILEID_INVALID;
>   }
>  
>   fid->cno = root->cno;
> diff --git a/fs/ocfs2/export.c b/fs/ocfs2/export.c
> index 322216a..2965116 100644
> --- a/fs/ocfs2/export.c
> +++ b/fs/ocfs2/export.c
> @@ -195,11 +195,11 @@ static int ocfs2_encode_fh(struct inode *inode, u32 
> *fh_in, int *max_len,
>  
>   if (parent && (len < 6)) {
>   *max_len = 6;
> - type = 255;
> + type = FILEID_INVALID;
>   goto bail;
>   } else i

Re: [PATCH 01/15] kv_flat_btree_async.cc: use vector instead of VLA's

2013-02-11 Thread Danny Al-Gaaf
Am 10.02.2013 06:57, schrieb Sage Weil:
> On Thu, 7 Feb 2013, Danny Al-Gaaf wrote:
>> Fix "variable length array of non-POD element type" errors caused by
>> using librados::ObjectWriteOperation VLAs. (-Wvla)
>>
>> Signed-off-by: Danny Al-Gaaf 
>> ---
>>  src/key_value_store/kv_flat_btree_async.cc | 14 +++---
>>  1 file changed, 7 insertions(+), 7 deletions(-)
>>
>> diff --git a/src/key_value_store/kv_flat_btree_async.cc 
>> b/src/key_value_store/kv_flat_btree_async.cc
>> index 96c6cb0..4342e70 100644
>> --- a/src/key_value_store/kv_flat_btree_async.cc
>> +++ b/src/key_value_store/kv_flat_btree_async.cc
>> @@ -1119,9 +1119,9 @@ int KvFlatBtreeAsync::cleanup(const index_data &idata, 
>> const int &errno) {
>>  //all changes were created except for updating the index and possibly
>>  //deleting the objects. roll forward.
>>  vector, librados::ObjectWriteOperation*> > ops;
>> -librados::ObjectWriteOperation owos[idata.to_delete.size() + 1];
>> +vector owos(idata.to_delete.size() + 
>> 1);
> 
> I haven't read much of the surrounding code, but from what is included 
> here I don't think this is equivalent... these are just null pointers 
> initially, and so
> 
>>  for (int i = 0; i <= (int)idata.to_delete.size(); ++i) {
>> -  ops.push_back(make_pair(pair(0, ""), &owos[i]));
>> +  ops.push_back(make_pair(pair(0, ""), owos[i]));
> 
> this doesn't do anything useful... owos[i] may as well be NULL.  Why not 
> make it
> 
> vector owos(...)
> 
> ?

Because this would lead to a linker error:

kv_flat_btree_async.o: In function `void
std::__uninitialized_fill_n::__uninit_fill_n(librados::ObjectWriteOperation*,
unsigned long, librados::ObjectWriteOperation const&)':
/usr/bin/../lib64/gcc/x86_64-suse-linux/4.7/../../../../include/c++/4.7/bits/stl_uninitialized.h:188:
undefined reference to
`librados::ObjectOperation::ObjectOperation(librados::ObjectOperation
const&)'
/usr/bin/../lib64/gcc/x86_64-suse-linux/4.7/../../../../include/c++/4.7/bits/stl_uninitialized.h:188:
undefined reference to
`librados::ObjectOperation::ObjectOperation(librados::ObjectOperation
const&)'


Because in src/include/rados/librados.hpp
librados::ObjectOperation::ObjectOperation(librados::ObjectOperation
const&) was is defined, but not implemented in the librados.cc.

Not sure if removing ObjectOperation(librados::ObjectOperation const&)
is the way to go here.

Danny
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-02-11 Thread Kevin Decherf
On Mon, Feb 04, 2013 at 07:01:54PM +0100, Kevin Decherf wrote:
> Hey everyone,
> 
> It's my first post here to expose a potential issue I found today using
> Ceph 0.56.1.
> 
> The cluster configuration is, briefly: 27 osd of ~900GB and 3 MON/MDS.
> All nodes are running Exherbo (source-based distribution) with Ceph
> 0.56.1 and Linux 3.7.0. We are only using CephFS on this cluster which
> is mounted on ~60 clients (increasing each day). Objects are replicated
> three times and the cluster handles only 7GB of data atm for 350k
> objects.
> 
> In certain conditions (I don't know them atm), some clients hang,
> generate CPU overloads (kworker) and are unable to make any IO on
> Ceph. The active MDS have ~20Mbps in/out during the issue (less than
> 2Mbps in normal activity). I don't know if it's directly linked but we
> also observe a lot of missing files at the same time.
> 
> The problem is similar to this one [1].
> 
> A restart of the client or the MDS was enough before today, but we found
> a new behavior: the active MDS consumes a lot of CPU during 3 to 5 hours
> with ~25% clients hanging.
> 
> In logs I found a segfault with this backtrace [2] and 100,000 dumped
> events during the first hang. We observed another hang which produces
> lot of these events (in debug mode):
>- "mds.0.server FAIL on ESTALE but attempting recovery"
>- "mds.0.server reply_request -116 (Stale NFS file handle)
>   client_request(client.10991:1031 getattr As #104bab0
>   RETRY=132)"
> 
> We have no profiling tools available on these nodes, and I don't know
> what I should search in the 35 GB log file.
> 
> Note: the segmentation fault occured only once but the problem was
> observed four times on this cluster.
> 
> Any help may be appreciated.
> 
> References:
> [1] http://www.spinics.net/lists/ceph-devel/msg04903.html
> [2] ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
> 1: /usr/bin/ceph-mds() [0x817e82]
> 2: (()+0xf140) [0x7f9091d30140]
> 3: (MDCache::request_drop_foreign_locks(MDRequest*)+0x21) [0x5b9dc1]
> 4: (MDCache::request_drop_locks(MDRequest*)+0x19) [0x5baae9]
> 5: (MDCache::request_cleanup(MDRequest*)+0x60) [0x5bab70]
> 6: (MDCache::request_kill(MDRequest*)+0x80) [0x5bae90]
> 7: (Server::journal_close_session(Session*, int)+0x372) [0x549aa2]
> 8: (Server::kill_session(Session*)+0x137) [0x549c67]
> 9: (Server::find_idle_sessions()+0x12a6) [0x54b0d6]
> 10: (MDS::tick()+0x338) [0x4da928]
> 11: (SafeTimer::timer_thread()+0x1af) [0x78151f]
> 12: (SafeTimerThread::entry()+0xd) [0x782bad]
> 13: (()+0x7ddf) [0x7f9091d28ddf]
> 14: (clone()+0x6d) [0x7f90909cc24d]

I found a possible cause/way to reproduce this issue.
We have now ~90 clients for 18GB / 650k objects and the storm occurs
when we execute an "intensive IO" command (tar of the whole pool / rsync
in one folder) on one of our client (the only which uses ceph-fuse,
don't know if it's limited to it or not).

Any idea?

Cheers,
-- 
Kevin Decherf - @Kdecherf
GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F
http://kdecherf.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Client can't reboot when rbd volume is mounted.

2013-02-11 Thread Roman Alekseev

On 11.02.2013 09:36, Sage Weil wrote:

On Mon, 11 Feb 2013, Roman Alekseev wrote:

Hi,

When I try to reboot a client server  without unmounting of rbd volume
manually
its services stop working but server doesn't reboot completely and show the
following logs in KVM console:

[235618.0202207] libceph: connect 192.168.0.19:6789 error -101

That is

#defineENETUNREACH 101 /* Network is unreachable */

Note that that (or any other) socket error is not necessarily fatal; the
kernel client will retry and eventually connect to that or another OSD
to complete the IO.  Are you observing that the RBD image hangs or
something?

You can peek at in-flight IO (and other state) with

  cat /sys/kernel/debug/ceph/*/osdc

unmount/unmap should not be necessarily in any case unless there is a bug.
We backported a bunch of stuff to 3.6.6, so 3.6.10 ought to be okay.  You
might try a newer 3.6.x kernel too; I forget if there was a second batch
of fixes..

sage


Hi Sage,

> #define ENETUNREACH 101 /* Network is unreachable */

The reason of this error is that networking stop working after 
performing server reset request.


> Are you observing that the RBD image hangs or something?

the RBD works properly. It is just mapped and mounted on the client server.

# /dev/rbd1  99G  616M   93G   1% /home/test

The "/sys/kernel/debug" folder is empty, how to put 'ceph/*/osdc' 
content into it?


I've update kernel to 3.7.4 version but problem is still persist.

Thanks

--
Kind regards,

R. Alekseev

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


IPv6 address confusion in OSDs

2013-02-11 Thread Simon Leinen
We run a ten-node 64-OSD Ceph cluster and use IPv6 where possible.

Today I noticed this error message from an OSD just after I restarted
it (in an attempt to resolve an issue with some "stuck" pgs that
included that OSD):

2013-02-11 09:24:57.232811 osd.35 [ERR] map e768 had wrong cluster addr 
([2001:620:0:6::106]:6822/1990 != my 
[fe80::67d:7bff:fef1:78b%vlan301]:6822/1990)

These two addresses belong to the same interface:

root@h1:~# ip -6 addr list dev vlan301
7: vlan301@bond0:  mtu 1500 
inet6 2001:620:0:6::106/64 scope global 
   valid_lft forever preferred_lft forever
inet6 fe80::67d:7bff:fef1:78b/64 scope link 
   valid_lft forever preferred_lft forever

2001:620:... is the global-scope address, and this is how OSDs are
addressed in our ceph.conf.  fe80:... is the link-local address that
every IPv6 interface has.  Shouldn't these be treated as equivalent?
-- 
Simon.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fs: encode_fh: return FILEID_INVALID if invalid fid_type

2013-02-11 Thread Namjae Jeon
From: Namjae Jeon 

This patch is a follow up on below patch:

[PATCH] exportfs: add FILEID_INVALID to indicate invalid fid_type
commit: 216b6cbdcbd86b1db0754d58886b466ae31f5a63 

Signed-off-by: Namjae Jeon 
Signed-off-by: Vivek Trivedi 
Acked-by: Steven Whitehouse 
---
 fs/btrfs/export.c   |4 ++--
 fs/ceph/export.c|4 ++--
 fs/fuse/inode.c |2 +-
 fs/gfs2/export.c|4 ++--
 fs/isofs/export.c   |4 ++--
 fs/nilfs2/namei.c   |4 ++--
 fs/ocfs2/export.c   |4 ++--
 fs/reiserfs/inode.c |4 ++--
 fs/udf/namei.c  |4 ++--
 fs/xfs/xfs_export.c |4 ++--
 mm/cleancache.c |2 +-
 mm/shmem.c  |2 +-
 12 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/export.c b/fs/btrfs/export.c
index 614f34a..81ee29e 100644
--- a/fs/btrfs/export.c
+++ b/fs/btrfs/export.c
@@ -22,10 +22,10 @@ static int btrfs_encode_fh(struct inode *inode, u32 *fh, 
int *max_len,
 
if (parent && (len < BTRFS_FID_SIZE_CONNECTABLE)) {
*max_len = BTRFS_FID_SIZE_CONNECTABLE;
-   return 255;
+   return FILEID_INVALID;
} else if (len < BTRFS_FID_SIZE_NON_CONNECTABLE) {
*max_len = BTRFS_FID_SIZE_NON_CONNECTABLE;
-   return 255;
+   return FILEID_INVALID;
}
 
len  = BTRFS_FID_SIZE_NON_CONNECTABLE;
diff --git a/fs/ceph/export.c b/fs/ceph/export.c
index ca3ab3f..16796be 100644
--- a/fs/ceph/export.c
+++ b/fs/ceph/export.c
@@ -81,7 +81,7 @@ static int ceph_encode_fh(struct inode *inode, u32 *rawfh, 
int *max_len,
if (parent_inode) {
/* nfsd wants connectable */
*max_len = connected_handle_length;
-   type = 255;
+   type = FILEID_INVALID;
} else {
dout("encode_fh %p\n", dentry);
fh->ino = ceph_ino(inode);
@@ -90,7 +90,7 @@ static int ceph_encode_fh(struct inode *inode, u32 *rawfh, 
int *max_len,
}
} else {
*max_len = handle_length;
-   type = 255;
+   type = FILEID_INVALID;
}
if (dentry)
dput(dentry);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 9876a87..973e8f0 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -679,7 +679,7 @@ static int fuse_encode_fh(struct inode *inode, u32 *fh, int 
*max_len,
 
if (*max_len < len) {
*max_len = len;
-   return  255;
+   return  FILEID_INVALID;
}
 
nodeid = get_fuse_inode(inode)->nodeid;
diff --git a/fs/gfs2/export.c b/fs/gfs2/export.c
index 4767774..9973df4 100644
--- a/fs/gfs2/export.c
+++ b/fs/gfs2/export.c
@@ -37,10 +37,10 @@ static int gfs2_encode_fh(struct inode *inode, __u32 *p, 
int *len,
 
if (parent && (*len < GFS2_LARGE_FH_SIZE)) {
*len = GFS2_LARGE_FH_SIZE;
-   return 255;
+   return FILEID_INVALID;
} else if (*len < GFS2_SMALL_FH_SIZE) {
*len = GFS2_SMALL_FH_SIZE;
-   return 255;
+   return FILEID_INVALID;
}
 
fh[0] = cpu_to_be32(ip->i_no_formal_ino >> 32);
diff --git a/fs/isofs/export.c b/fs/isofs/export.c
index 2b4f235..12088d8 100644
--- a/fs/isofs/export.c
+++ b/fs/isofs/export.c
@@ -125,10 +125,10 @@ isofs_export_encode_fh(struct inode *inode,
 */
if (parent && (len < 5)) {
*max_len = 5;
-   return 255;
+   return FILEID_INVALID;
} else if (len < 3) {
*max_len = 3;
-   return 255;
+   return FILEID_INVALID;
}
 
len = 3;
diff --git a/fs/nilfs2/namei.c b/fs/nilfs2/namei.c
index 1d0c0b8..9de78f0 100644
--- a/fs/nilfs2/namei.c
+++ b/fs/nilfs2/namei.c
@@ -517,11 +517,11 @@ static int nilfs_encode_fh(struct inode *inode, __u32 
*fh, int *lenp,
 
if (parent && *lenp < NILFS_FID_SIZE_CONNECTABLE) {
*lenp = NILFS_FID_SIZE_CONNECTABLE;
-   return 255;
+   return FILEID_INVALID;
}
if (*lenp < NILFS_FID_SIZE_NON_CONNECTABLE) {
*lenp = NILFS_FID_SIZE_NON_CONNECTABLE;
-   return 255;
+   return FILEID_INVALID;
}
 
fid->cno = root->cno;
diff --git a/fs/ocfs2/export.c b/fs/ocfs2/export.c
index 322216a..2965116 100644
--- a/fs/ocfs2/export.c
+++ b/fs/ocfs2/export.c
@@ -195,11 +195,11 @@ static int ocfs2_encode_fh(struct inode *inode, u32 
*fh_in, int *max_len,
 
if (parent && (len < 6)) {
*max_len = 6;
-   type = 255;
+   type = FILEID_INVALID;
goto bail;
} else if (len < 3) {
*max_len = 3;
-   type = 255;
+   type = FILEID_INVALID;
goto bail;
}
 
diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
index 30195bc.