Re: [ceph-users] Significant slowdown of osds since v0.67 Dumpling

2013-08-26 Thread Samuel Just
I just pushed a patch to wip-dumpling-log-assert (based on current
dumpling head).  I had disabled most of the code in PGLog::check() but
left an (I thought) innocuous assert.  It seems that with (at least)
g++ 4.6.3, stl list::size() is linear in the size of the list, so that
assert actually traverses the pg log on each operation.  The patch in
wip-dumpling-log-assert should disable that assert as well by default.
 Let me know if it helps.

It should be built within an hour of this email.
-Sam

On Mon, Aug 26, 2013 at 10:46 PM, Matthew Anderson
 wrote:
> Hi Guys,
>
> I'm having the same problem as Oliver with 0.67.2. CPU usage is around
> double that of the 0.61.8 OSD's in the same cluster which appears to
> be causing the performance decrease.
>
> I did a perf comparison (not sure if I did it right but it seems ok).
> Both hosts are the same spec running Ubuntu 12.04.1 (3.2 kernel),
> journal and osd data is on an SSD, OSD's are in the same pool with the
> same weight and the perf tests were run at the same time on a
> realworld load consisting of RBD traffic only.
>
> Dumpling -
>
> Events: 332K cycles
>  17.93%  ceph-osd  libc-2.15.so   [.] 0x15d523
>  17.03%  ceph-osd  ceph-osd   [.] 0x5c2897
>   4.66%  ceph-osd  ceph-osd   [.]
> leveldb::InternalKeyComparator::Compare(leveldb::Slice const&, level
>   3.46%  ceph-osd  ceph-osd   [.] leveldb::Block::Iter::Next()
>   2.70%  ceph-osd  libstdc++.so.6.0.16[.]
> std::string::_M_mutate(unsigned long, unsigned long, unsigned long)
>   2.60%  ceph-osd  ceph-osd   [.] PGLog::check()
>   2.57%  ceph-osd  [kernel.kallsyms]  [k] __ticket_spin_lock
>   2.49%  ceph-osd  ceph-osd   [.] ceph_crc32c_le_intel
>   1.93%  ceph-osd  libsnappy.so.1.1.2 [.]
> snappy::RawUncompress(snappy::Source*, char*)
>   1.53%  ceph-osd  libstdc++.so.6.0.16[.] std::string::append(char
> const*, unsigned long)
>   1.47%  ceph-osd  libtcmalloc.so.0.1.0   [.] operator new(unsigned long)
>   1.33%  ceph-osd  [kernel.kallsyms]  [k] copy_user_generic_string
>   0.98%  ceph-osd  libtcmalloc.so.0.1.0   [.] operator delete(void*)
>   0.90%  ceph-osd  libstdc++.so.6.0.16[.] std::string::assign(char
> const*, unsigned long)
>   0.75%  ceph-osd  libstdc++.so.6.0.16[.]
> std::string::_M_replace_safe(unsigned long, unsigned long, char cons
>   0.58%  ceph-osd  [kernel.kallsyms]  [k] wait_sb_inodes
>   0.55%  ceph-osd  ceph-osd   [.]
> leveldb::Block::Iter::Valid() const
>   0.51%  ceph-osd  libtcmalloc.so.0.1.0   [.]
> tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::
>   0.50%  ceph-osd  libtcmalloc.so.0.1.0   [.]
> tcmalloc::CentralFreeList::FetchFromSpans()
>   0.47%  ceph-osd  libstdc++.so.6.0.16[.] 0x9ebc8
>   0.46%  ceph-osd  libc-2.15.so   [.] vfprintf
>   0.45%  ceph-osd  [kernel.kallsyms]  [k] find_busiest_group
>   0.45%  ceph-osd  libstdc++.so.6.0.16[.]
> std::string::resize(unsigned long, char)
>   0.43%  ceph-osd  libpthread-2.15.so [.] pthread_mutex_unlock
>   0.41%  ceph-osd  [kernel.kallsyms]  [k] iput_final
>   0.40%  ceph-osd  ceph-osd   [.]
> leveldb::Block::Iter::Seek(leveldb::Slice const&)
>   0.39%  ceph-osd  libc-2.15.so   [.] _IO_vfscanf
>   0.39%  ceph-osd  ceph-osd   [.] leveldb::Block::Iter::key() 
> const
>   0.39%  ceph-osd  libtcmalloc.so.0.1.0   [.]
> tcmalloc::CentralFreeList::ReleaseToSpans(void*)
>   0.37%  ceph-osd  libstdc++.so.6.0.16[.] std::basic_ostream std::char_traits >& std::__ostream_in
>
>
> Cuttlefish -
>
> Events: 160K cycles
>   7.53%  ceph-osd  [kernel.kallsyms]  [k] __ticket_spin_lock
>   6.26%  ceph-osd  libc-2.15.so   [.] 0x89115
>   3.06%  ceph-osd  ceph-osd   [.] ceph_crc32c_le
>   2.66%  ceph-osd  libtcmalloc.so.0.1.0   [.] operator new(unsigned long)
>   2.46%  ceph-osd  [kernel.kallsyms]  [k] find_busiest_group
>   1.80%  ceph-osd  libtcmalloc.so.0.1.0   [.] operator delete(void*)
>   1.42%  ceph-osd  [kernel.kallsyms]  [k] try_to_wake_up
>   1.27%  ceph-osd  ceph-osd   [.] 0x531fb6
>   1.21%  ceph-osd  libstdc++.so.6.0.16[.] 0x9ebc8
>   1.14%  ceph-osd  [kernel.kallsyms]  [k] wait_sb_inodes
>   1.02%  ceph-osd  libc-2.15.so   [.] _IO_vfscanf
>   1.01%  ceph-osd  [kernel.kallsyms]  [k] update_shares
>   0.98%  ceph-osd  [kernel.kallsyms]  [k] filemap_fdatawait_range
>   0.90%  ceph-osd  libstdc++.so.6.0.16[.] std::basic_ostream std::char_traits >& std
>   0.89%  ceph-osd  [kernel.kallsyms]  [k] iput_final
>   0.79%  ceph-osd  libstdc++.so.6.0.16[.] std::basic_string std::char_traits, std::a
>   0.79%  ceph-osd  [kernel.kallsyms]  [k] copy_user_generic_string
>   0.78%  ceph-osd  libc-2.15.so   [.] vfprintf
>   0.70%  ceph-osd  libtcmalloc.so.0.1.0   [.]
> tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc:
>   0.69%  ceph-osd  [kernel.kallsyms]  [k] __d_look

Re: [ceph-users] Significant slowdown of osds since v0.67 Dumpling

2013-08-26 Thread Matthew Anderson
Hi Guys,

I'm having the same problem as Oliver with 0.67.2. CPU usage is around
double that of the 0.61.8 OSD's in the same cluster which appears to
be causing the performance decrease.

I did a perf comparison (not sure if I did it right but it seems ok).
Both hosts are the same spec running Ubuntu 12.04.1 (3.2 kernel),
journal and osd data is on an SSD, OSD's are in the same pool with the
same weight and the perf tests were run at the same time on a
realworld load consisting of RBD traffic only.

Dumpling -

Events: 332K cycles
 17.93%  ceph-osd  libc-2.15.so   [.] 0x15d523
 17.03%  ceph-osd  ceph-osd   [.] 0x5c2897
  4.66%  ceph-osd  ceph-osd   [.]
leveldb::InternalKeyComparator::Compare(leveldb::Slice const&, level
  3.46%  ceph-osd  ceph-osd   [.] leveldb::Block::Iter::Next()
  2.70%  ceph-osd  libstdc++.so.6.0.16[.]
std::string::_M_mutate(unsigned long, unsigned long, unsigned long)
  2.60%  ceph-osd  ceph-osd   [.] PGLog::check()
  2.57%  ceph-osd  [kernel.kallsyms]  [k] __ticket_spin_lock
  2.49%  ceph-osd  ceph-osd   [.] ceph_crc32c_le_intel
  1.93%  ceph-osd  libsnappy.so.1.1.2 [.]
snappy::RawUncompress(snappy::Source*, char*)
  1.53%  ceph-osd  libstdc++.so.6.0.16[.] std::string::append(char
const*, unsigned long)
  1.47%  ceph-osd  libtcmalloc.so.0.1.0   [.] operator new(unsigned long)
  1.33%  ceph-osd  [kernel.kallsyms]  [k] copy_user_generic_string
  0.98%  ceph-osd  libtcmalloc.so.0.1.0   [.] operator delete(void*)
  0.90%  ceph-osd  libstdc++.so.6.0.16[.] std::string::assign(char
const*, unsigned long)
  0.75%  ceph-osd  libstdc++.so.6.0.16[.]
std::string::_M_replace_safe(unsigned long, unsigned long, char cons
  0.58%  ceph-osd  [kernel.kallsyms]  [k] wait_sb_inodes
  0.55%  ceph-osd  ceph-osd   [.]
leveldb::Block::Iter::Valid() const
  0.51%  ceph-osd  libtcmalloc.so.0.1.0   [.]
tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::
  0.50%  ceph-osd  libtcmalloc.so.0.1.0   [.]
tcmalloc::CentralFreeList::FetchFromSpans()
  0.47%  ceph-osd  libstdc++.so.6.0.16[.] 0x9ebc8
  0.46%  ceph-osd  libc-2.15.so   [.] vfprintf
  0.45%  ceph-osd  [kernel.kallsyms]  [k] find_busiest_group
  0.45%  ceph-osd  libstdc++.so.6.0.16[.]
std::string::resize(unsigned long, char)
  0.43%  ceph-osd  libpthread-2.15.so [.] pthread_mutex_unlock
  0.41%  ceph-osd  [kernel.kallsyms]  [k] iput_final
  0.40%  ceph-osd  ceph-osd   [.]
leveldb::Block::Iter::Seek(leveldb::Slice const&)
  0.39%  ceph-osd  libc-2.15.so   [.] _IO_vfscanf
  0.39%  ceph-osd  ceph-osd   [.] leveldb::Block::Iter::key() const
  0.39%  ceph-osd  libtcmalloc.so.0.1.0   [.]
tcmalloc::CentralFreeList::ReleaseToSpans(void*)
  0.37%  ceph-osd  libstdc++.so.6.0.16[.] std::basic_ostream >& std::__ostream_in


Cuttlefish -

Events: 160K cycles
  7.53%  ceph-osd  [kernel.kallsyms]  [k] __ticket_spin_lock
  6.26%  ceph-osd  libc-2.15.so   [.] 0x89115
  3.06%  ceph-osd  ceph-osd   [.] ceph_crc32c_le
  2.66%  ceph-osd  libtcmalloc.so.0.1.0   [.] operator new(unsigned long)
  2.46%  ceph-osd  [kernel.kallsyms]  [k] find_busiest_group
  1.80%  ceph-osd  libtcmalloc.so.0.1.0   [.] operator delete(void*)
  1.42%  ceph-osd  [kernel.kallsyms]  [k] try_to_wake_up
  1.27%  ceph-osd  ceph-osd   [.] 0x531fb6
  1.21%  ceph-osd  libstdc++.so.6.0.16[.] 0x9ebc8
  1.14%  ceph-osd  [kernel.kallsyms]  [k] wait_sb_inodes
  1.02%  ceph-osd  libc-2.15.so   [.] _IO_vfscanf
  1.01%  ceph-osd  [kernel.kallsyms]  [k] update_shares
  0.98%  ceph-osd  [kernel.kallsyms]  [k] filemap_fdatawait_range
  0.90%  ceph-osd  libstdc++.so.6.0.16[.] std::basic_ostream >& std
  0.89%  ceph-osd  [kernel.kallsyms]  [k] iput_final
  0.79%  ceph-osd  libstdc++.so.6.0.16[.] std::basic_string, std::a
  0.79%  ceph-osd  [kernel.kallsyms]  [k] copy_user_generic_string
  0.78%  ceph-osd  libc-2.15.so   [.] vfprintf
  0.70%  ceph-osd  libtcmalloc.so.0.1.0   [.]
tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc:
  0.69%  ceph-osd  [kernel.kallsyms]  [k] __d_lookup_rcu
  0.69%  ceph-osd  libtcmalloc.so.0.1.0   [.]
tcmalloc::CentralFreeList::FetchFromSpans()
  0.66%  ceph-osd  [kernel.kallsyms]  [k] igrab
  0.63%  ceph-osd  [kernel.kallsyms]  [k] update_cfs_load
  0.63%  ceph-osd  [kernel.kallsyms]  [k] link_path_walk

If you'd like some more tests run just let me know, more than happy to help

Thanks
-Matt
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sequential placement

2013-08-26 Thread Chen, Xiaoxi
The "random" may come from ceph trunks. For RBD, Ceph trunk the image to 
4M(default) objects, for Rados bench , it already 4M objects if you didn't set 
the parameters. So from XFS's view, there are lots of 4M files, in default, 
with ag!=1 (allocation group, specified during mkfs, default seems to be 32 or 
more), the files will be spread across the allocation groups, which results 
some random pattern as you can see from blktrace.

AG=1 may works for single client senarios, but should not be that useful for a 
multi-tenant environment since the access pattern is a mixture of all tenant, 
shoud be random enough. One thing you may try is set the 
/sys/block/{disk}/queue/readahead_kb= 1024 or 2048, that should be helpful for 
sequential read performance.

-Original Message-
From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Gregory Farnum
Sent: Tuesday, August 27, 2013 5:25 AM
To: Samuel Just
Cc: ceph-users@lists.ceph.com; daniel pol
Subject: Re: [ceph-users] Sequential placement

In addition to that, Ceph uses full data journaling - if you have two journals 
on the OS drive then you'll be limited to what that OS drive can provide, 
divided by two (if you have two-copy happening).
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

On Mon, Aug 26, 2013 at 2:09 PM, Samuel Just  wrote:
> I think rados bench is actually creating new objects with each IO.
> Can you paste in the command you used?
> -Sam
>
> On Tue, Aug 20, 2013 at 7:28 AM, daniel pol  wrote:
>> Hi !
>>
>> Ceph newbie here with a placement question. I'm trying to get a 
>> simple Ceph setup to run well with sequential reads big packets (>256k).
>> This is for learning/benchmarking purpose only and the setup I'm 
>> working with has a single server with 2 data drives, one OSD on each, 
>> journals on the OS drive, no replication, dumpling release.
>> When running rados bench or using a rbd block device the performance 
>> is only 35%-50% of what the underlying XFS filesystem can do and when 
>> I look at the IO trace I see random IO going to the physical disk, 
>> while the IO at ceph layer is sequential. Haven't tested CephFS yet 
>> but expect similar results there.
>> Looking for advice on how to configure Ceph to generate sequential 
>> read IO pattern to the underlying physical disk.
>>
>> Have a nice day,
>> Dani
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] paxos is_readable spam on monitors?

2013-08-26 Thread Joao Eduardo Luis

On 08/16/2013 10:40 PM, Jeppesen, Nelson wrote:

Hello Ceph-users,

Running dumping (upgraded yesterday) and several hours after the upgrade the 
following type of message repeated over and over in logs. Started about 8 hours 
ago.

1 mon.1@0(leader).paxos(paxos active c 6005920..6006535) is_readable 
now=2013-08-16 14:35:53.351282 lease_expire=2013-08-16 14:35:58.245728 has v0 
lc 6006535

Only monitor settings I have in ceph.conf besides host and host_addr:

[mon]
 mon cluster log to syslog = false
 mon cluster log file = none

Can I safely ignore them? Thanks.


Hello Nelson,

I remember this message being outputted at this level for as far as I 
can remember.  At least in cuttlefish it ought to be at this level already.


However, we did increase the default paxos debug level to 1 last July 
(commit:f7d1902cc), in order to be more verbose about things that could 
lead to elections.  Prior to this patch, the default level was 0 (pretty 
much no debug messages at all); now we do output a selected few, that 
being amongst them.


If you do not wish to see them, set 'debug paxos = 0' on your config file.

Besides taking up log space, those messages are completely harmless :-)

  -Joao

--
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Optimal configuration to validate Ceph

2013-08-26 Thread Samuel Just
Up to the point that you saturate the network, sure.  Note that rados
bench defaults to 16 writes at a time, so I would not expect a single
rados bench client with 16 concurrent writes to show linear scaling
past 16 osds (perhaps 32 if you have replication enabled).  For larger
numbers of osds, you'll need more concurrent writes (either more
clients, or a larger number of outstanding writes on each client).
-Sam

On Mon, Aug 26, 2013 at 3:51 PM, Sushma R  wrote:
> Thanks for the response.
> Yes, we intend to use rbd and radosgw eventually.
> However, for evaluation we are using rados bench and we are getting
> performance of ~50 MB/sec with a single OSD (with SSDs). We added more OSDs
> to the same server and the performance scales linearly.
> Can we assume that the performance with multiple OSDs on a "single" server
> (without saturating CPU utilization) would be the best compared to the
> multiple OSDs on multiple servers, since there are no network latencies
> involved?
>
>
>
> On Mon, Aug 26, 2013 at 1:47 PM, Samuel Just  wrote:
>>
>> If you create a pool with size 1 (no replication), (2) should be
>> somewhere around 3x the speed of (1) assuming the client workload has
>> enough parallelism and is well distributed over objects (so a random
>> rbd workload with a large queue depth rather than a small sequential
>> workload with a small queue depth).  If you have 3x replication on
>> (2), but 1x on (1), you should expect (2) to be pretty close to (1),
>> perhaps a bit slower due to replication latency.
>>
>> The details will actually depend a lot on the workload.  Do you intend
>> to use rbd?
>> -Sam
>>
>> On Fri, Aug 23, 2013 at 2:41 PM, Sushma R  wrote:
>> > Hi,
>> >
>> > I understand that Ceph is a scalable distributed storage architecture.
>> > However, I'd like to understand if performance on single node cluster is
>> > better or worse than a 3 node cluster.
>> > Let's say I have the following 2 setups:
>> > 1. Single node cluster with one OSD.
>> > 2. Three node cluster with one OSD on each node.
>> >
>> > Would the performance of Setup 2 be approximately (3x) of Setup 1? (OR)
>> > Would Setup 2 perform better than (3x) Setup 1, because of more
>> > parallelism?
>> > (OR)
>> > Would Setup 2 perform worse than (3x) Setup 1, because of replication,
>> > etc.
>> >
>> > In other words, I'm trying to understand do we definitely need more than
>> > three nodes to validate the benchmark results or a single/two node
>> > should
>> > give an idea of a larger scale?
>> >
>> > Thanks,
>> > Sushma
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Significant slowdown of osds since v0.67 Dumpling

2013-08-26 Thread Oliver Daudey
Hey Samuel,

Nope, "PGLog::undirty()" doesn't use as much CPU as before, but I found
it curious that it still showed up, as I thought you disabled it.  As
long as you can reproduce the abnormal leveldb CPU-usage.  Let me know
if I can help with anything.


   Regards,

 Oliver

On ma, 2013-08-26 at 14:23 -0700, Samuel Just wrote:
> Saw your logs.  I thought it might be enabling
> filestore_xattr_use_omap, but it isn't.  PGLog::undirty() doesn't seem
> to be using very much cpu.
> -Sam
> 
> On Mon, Aug 26, 2013 at 2:04 PM, Oliver Daudey  wrote:
> > Hey Samuel,
> >
> > I have been trying to get it reproduced on my test-cluster and seem to
> > have found a way.  Try: `rbd bench-write test --io-threads 80
> > --io-pattern=rand'.  On my test-cluster, this closely replicates what I
> > see during profiling on my production-cluster, including the extra
> > CPU-usage by leveldb, which doesn't show up on Cuttlefish.  It's very
> > curious that "PGLog::undirty()" is also still showing up near the top,
> > even in 0.67.2.
> >
> > I'll send you the logs by private mail.
> >
> >
> >Regards,
> >
> >   Oliver
> >
> > On ma, 2013-08-26 at 13:35 -0700, Samuel Just wrote:
> >> Can you attach a log from the startup of one of the dumpling osds on
> >> your production machine (no need for logging, just need some of the
> >> information dumped on every boot)?
> >>
> >> libleveldb is leveldb.  We've used leveldb for a few things since
> >> bobtail.  If anything, the load on leveldb should be lighter in
> >> dumpling, I would think...  I'll have to try to reproduce it locally.
> >> I'll keep you posted.
> >> -Sam
> >>
> >> On Sat, Aug 24, 2013 at 10:11 AM, Oliver Daudey  wrote:
> >> > Hey Samuel,
> >> >
> >> > Unfortunately, disabling "wbthrottle" made almost no difference on my
> >> > production-cluster.  OSD-load was still much higher on Dumpling.
> >> >
> >> > I've mentioned this several times already, but when profiling with `perf
> >> > top' on my production-cluster, any time I'm running a Dumpling-OSD,
> >> > several "libleveldb"-related entries come up near the top, that don't
> >> > show up when running the Cuttlefish-OSD at all.  Let's concentrate on
> >> > that for a moment, as it's a clearly visible difference on my
> >> > production-cluster, which shows the actual problem.
> >> >
> >> > Dumpling OSDs:
> >> >  17.23%  [kernel] [k] intel_idle
> >> >   6.35%  [kernel] [k] find_busiest_group
> >> >   4.36%  kvm  [.] 0x2cdbb0
> >> >   3.38%  libleveldb.so.1.9[.] 0x22821
> >> >   2.40%  libc-2.11.3.so   [.] memcmp
> >> >   2.04%  ceph-osd [.] ceph_crc32c_le_intel
> >> >   1.90%  [kernel] [k] _raw_spin_lock
> >> >   1.87%  [kernel] [k] copy_user_generic_string
> >> >   1.35%  [kernel] [k]
> >> > default_send_IPI_mask_sequence_phys
> >> >   1.34%  [kernel] [k] __hrtimer_start_range_ns
> >> >   1.14%  libc-2.11.3.so   [.] memcpy
> >> >   1.03%  [kernel] [k] hrtimer_interrupt
> >> >   1.01%  [kernel] [k] do_select
> >> >   1.00%  [kernel] [k] __schedule
> >> >   0.99%  [kernel] [k] _raw_spin_unlock_irqrestore
> >> >   0.97%  [kernel] [k] cpumask_next_and
> >> >   0.97%  [kernel] [k] find_next_bit
> >> >   0.96%  libleveldb.so.1.9[.]
> >> > leveldb::InternalKeyComparator::Compar
> >> >   0.91%  [kernel] [k] _raw_spin_lock_irqsave
> >> >   0.91%  [kernel] [k] fget_light
> >> >   0.89%  [kernel] [k] clockevents_program_event
> >> >   0.79%  [kernel] [k] sync_inodes_sb
> >> >   0.78%  libleveldb.so.1.9[.] leveldb::Block::Iter::Next()
> >> >   0.75%  [kernel] [k] apic_timer_interrupt
> >> >   0.70%  [kernel] [k] native_write_cr0
> >> >   0.60%  [kvm_intel]  [k] vmx_vcpu_run
> >> >   0.58%  [kernel] [k] load_balance
> >> >   0.57%  [kernel] [k] rcu_needs_cpu
> >> >   0.56%  ceph-osd [.] PGLog::undirty()
> >> >   0.51%  libpthread-2.11.3.so [.] pthread_mutex_lock
> >> >   0.50%  [vdso]   [.] 0x7fff6dbff6ce
> >> >
> >> > Same load, but with Cuttlefish-OSDs:
> >> >  19.23%  [kernel] [k] intel_idle
> >> >   6.43%  [kernel] [k] find_busiest_group
> >> >   5.25%  kvm  [.] 0x152a75
> >> >   2.70%  ceph-osd [.] ceph_crc32c_le
> >> >   2.44%  [kernel] [k] _raw_spin_lock
> >> >   1.95%  [kernel] [k] copy_user_generic_string
> >> >   1.53%  [kernel] [k]
> >> > default_send_IPI_mask_sequence_phys
> >> >   1.28%  [kernel]   

Re: [ceph-users] Significant slowdown of osds since v0.67 Dumpling

2013-08-26 Thread Samuel Just
Saw your logs.  I thought it might be enabling
filestore_xattr_use_omap, but it isn't.  PGLog::undirty() doesn't seem
to be using very much cpu.
-Sam

On Mon, Aug 26, 2013 at 2:04 PM, Oliver Daudey  wrote:
> Hey Samuel,
>
> I have been trying to get it reproduced on my test-cluster and seem to
> have found a way.  Try: `rbd bench-write test --io-threads 80
> --io-pattern=rand'.  On my test-cluster, this closely replicates what I
> see during profiling on my production-cluster, including the extra
> CPU-usage by leveldb, which doesn't show up on Cuttlefish.  It's very
> curious that "PGLog::undirty()" is also still showing up near the top,
> even in 0.67.2.
>
> I'll send you the logs by private mail.
>
>
>Regards,
>
>   Oliver
>
> On ma, 2013-08-26 at 13:35 -0700, Samuel Just wrote:
>> Can you attach a log from the startup of one of the dumpling osds on
>> your production machine (no need for logging, just need some of the
>> information dumped on every boot)?
>>
>> libleveldb is leveldb.  We've used leveldb for a few things since
>> bobtail.  If anything, the load on leveldb should be lighter in
>> dumpling, I would think...  I'll have to try to reproduce it locally.
>> I'll keep you posted.
>> -Sam
>>
>> On Sat, Aug 24, 2013 at 10:11 AM, Oliver Daudey  wrote:
>> > Hey Samuel,
>> >
>> > Unfortunately, disabling "wbthrottle" made almost no difference on my
>> > production-cluster.  OSD-load was still much higher on Dumpling.
>> >
>> > I've mentioned this several times already, but when profiling with `perf
>> > top' on my production-cluster, any time I'm running a Dumpling-OSD,
>> > several "libleveldb"-related entries come up near the top, that don't
>> > show up when running the Cuttlefish-OSD at all.  Let's concentrate on
>> > that for a moment, as it's a clearly visible difference on my
>> > production-cluster, which shows the actual problem.
>> >
>> > Dumpling OSDs:
>> >  17.23%  [kernel] [k] intel_idle
>> >   6.35%  [kernel] [k] find_busiest_group
>> >   4.36%  kvm  [.] 0x2cdbb0
>> >   3.38%  libleveldb.so.1.9[.] 0x22821
>> >   2.40%  libc-2.11.3.so   [.] memcmp
>> >   2.04%  ceph-osd [.] ceph_crc32c_le_intel
>> >   1.90%  [kernel] [k] _raw_spin_lock
>> >   1.87%  [kernel] [k] copy_user_generic_string
>> >   1.35%  [kernel] [k]
>> > default_send_IPI_mask_sequence_phys
>> >   1.34%  [kernel] [k] __hrtimer_start_range_ns
>> >   1.14%  libc-2.11.3.so   [.] memcpy
>> >   1.03%  [kernel] [k] hrtimer_interrupt
>> >   1.01%  [kernel] [k] do_select
>> >   1.00%  [kernel] [k] __schedule
>> >   0.99%  [kernel] [k] _raw_spin_unlock_irqrestore
>> >   0.97%  [kernel] [k] cpumask_next_and
>> >   0.97%  [kernel] [k] find_next_bit
>> >   0.96%  libleveldb.so.1.9[.]
>> > leveldb::InternalKeyComparator::Compar
>> >   0.91%  [kernel] [k] _raw_spin_lock_irqsave
>> >   0.91%  [kernel] [k] fget_light
>> >   0.89%  [kernel] [k] clockevents_program_event
>> >   0.79%  [kernel] [k] sync_inodes_sb
>> >   0.78%  libleveldb.so.1.9[.] leveldb::Block::Iter::Next()
>> >   0.75%  [kernel] [k] apic_timer_interrupt
>> >   0.70%  [kernel] [k] native_write_cr0
>> >   0.60%  [kvm_intel]  [k] vmx_vcpu_run
>> >   0.58%  [kernel] [k] load_balance
>> >   0.57%  [kernel] [k] rcu_needs_cpu
>> >   0.56%  ceph-osd [.] PGLog::undirty()
>> >   0.51%  libpthread-2.11.3.so [.] pthread_mutex_lock
>> >   0.50%  [vdso]   [.] 0x7fff6dbff6ce
>> >
>> > Same load, but with Cuttlefish-OSDs:
>> >  19.23%  [kernel] [k] intel_idle
>> >   6.43%  [kernel] [k] find_busiest_group
>> >   5.25%  kvm  [.] 0x152a75
>> >   2.70%  ceph-osd [.] ceph_crc32c_le
>> >   2.44%  [kernel] [k] _raw_spin_lock
>> >   1.95%  [kernel] [k] copy_user_generic_string
>> >   1.53%  [kernel] [k]
>> > default_send_IPI_mask_sequence_phys
>> >   1.28%  [kernel] [k] __hrtimer_start_range_ns
>> >   1.21%  [kernel] [k] do_select
>> >   1.19%  [kernel] [k] hrtimer_interrupt
>> >   1.19%  [kernel] [k] _raw_spin_unlock_irqrestore
>> >   1.16%  [kernel] [k] fget_light
>> >   1.12%  [kernel] [k] cpumask_next_and
>> >   1.11%  [kernel] [k] clockevents_program_event
>> >   1.08%  [kernel] [k] __schedule
>> >   1.08%  [kernel] 

Re: [ceph-users] cuttlefish operatiing a cluster(start ceph all) failed

2013-08-26 Thread Samuel Just
Usually you need to run the inictl stuff on the node the process is on
to control the process.
-Sam

On Fri, Aug 16, 2013 at 12:28 AM, maoqi1982  wrote:
> Hi list:
>
> After I deployed a cuttlefish(6.1.07) cluster on three nodes(OS Ubuntu
> 12.04),one ceph-deploy node ,one monitor node and a  OSD node . None other
> daemons was found on monitor with   "sudo initctl list | grep ceph ".As the
> content below ,  I can only find the monitor daemon process.
>
> ceph-osd-all stop/waiting
>
> ceph-mds-all-starter stop/waiting
>
> ceph-mds-all stop/waiting
>
> ceph-osd-all-starter stop/waiting
>
> ceph-all start/running
>
> ceph-mon-all start/running
>
> ceph-mon-all-starter stop/waiting
>
> ceph-mon (ceph/ceph-mon21) start/running, process 9236
>
> ceph-create-keys stop/waiting
>
> ceph-osd stop/waiting
>
> ceph-mds stop/waiting
>
>  All the  commands above that operating on ceph-mon21 seems to be work and
> the status return seems to be ok,but In fact it does not work .For example,
> I run “ceph-all”and the status turned into “start/running ” .But on the OSD
> node ,the process exists still(not running).
>
> In fact,I can not control(start/stop) the others daemons among the cluster
> except the local daemon(mon) . Would you please help me to find any problem.
>
> Thanks very much !
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sequential placement

2013-08-26 Thread Gregory Farnum
In addition to that, Ceph uses full data journaling — if you have two
journals on the OS drive then you'll be limited to what that OS drive
can provide, divided by two (if you have two-copy happening).
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

On Mon, Aug 26, 2013 at 2:09 PM, Samuel Just  wrote:
> I think rados bench is actually creating new objects with each IO.
> Can you paste in the command you used?
> -Sam
>
> On Tue, Aug 20, 2013 at 7:28 AM, daniel pol  wrote:
>> Hi !
>>
>> Ceph newbie here with a placement question. I'm trying to get a simple Ceph
>> setup to run well with sequential reads big packets (>256k).
>> This is for learning/benchmarking purpose only and the setup I'm working
>> with has a single server with 2 data drives, one OSD on each, journals on
>> the OS drive, no replication, dumpling release.
>> When running rados bench or using a rbd block device the performance is only
>> 35%-50% of what the underlying XFS filesystem can do and when I look at the
>> IO trace I see random IO going to the physical disk, while the IO at ceph
>> layer is sequential. Haven't tested CephFS yet but expect similar results
>> there.
>> Looking for advice on how to configure Ceph to generate sequential read IO
>> pattern to the underlying physical disk.
>>
>> Have a nice day,
>> Dani
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd/OSD.cc: 4844: FAILED assert(_get_map_bl(epoch, bl)) (ceph 0.61.7)

2013-08-26 Thread Samuel Just
I just backported that one and 2 related patches to cuttlefish head.
-Sam

On Mon, Aug 26, 2013 at 2:14 PM, Stefan Priebe  wrote:
> i had the same problem and backported
> 6951d2345a5d837c3b14103bd4d8f5ee4407c937 to ceph.
>
> It fixes it for me.
>
> Greets,
> Stefan
>
> Am 26.08.2013 23:10, schrieb Samuel Just:
>
>> This is the same osd, and hasn't been working in the mean time?  Can
>> your clsuter operate without that osd?
>> -Sam
>>
>> On Mon, Aug 19, 2013 at 2:05 PM, Olivier Bonvalet 
>> wrote:
>>>
>>> Le lundi 19 août 2013 à 12:27 +0200, Olivier Bonvalet a écrit :

 Hi,

 I have an OSD which crash every time I try to start it (see logs below).
 Is it a known problem ? And is there a way to fix it ?

 root! taman:/var/log/ceph# grep -v ' pipe' osd.65.log
 2013-08-19 11:07:48.478558 7f6fe367a780  0 ceph version 0.61.7
 (8f010aff684e820ecc837c25ac77c7a05d7191ff), process ceph-osd, pid 19327
 2013-08-19 11:07:48.516363 7f6fe367a780  0
 filestore(/var/lib/ceph/osd/ceph-65) mount FIEMAP ioctl is supported and
 appears to work
 2013-08-19 11:07:48.516380 7f6fe367a780  0
 filestore(/var/lib/ceph/osd/ceph-65) mount FIEMAP ioctl is disabled via
 'filestore fiemap' config option
 2013-08-19 11:07:48.516514 7f6fe367a780  0
 filestore(/var/lib/ceph/osd/ceph-65) mount did NOT detect btrfs
 2013-08-19 11:07:48.517087 7f6fe367a780  0
 filestore(/var/lib/ceph/osd/ceph-65) mount syscall(SYS_syncfs, fd) fully
 supported
 2013-08-19 11:07:48.517389 7f6fe367a780  0
 filestore(/var/lib/ceph/osd/ceph-65) mount found snaps <>
 2013-08-19 11:07:49.199483 7f6fe367a780  0
 filestore(/var/lib/ceph/osd/ceph-65) mount: enabling WRITEAHEAD journal
 mode: btrfs not detected
 2013-08-19 11:07:52.191336 7f6fe367a780  1 journal _open /dev/sdk4 fd
 18: 53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1
 2013-08-19 11:07:52.196020 7f6fe367a780  1 journal _open /dev/sdk4 fd
 18: 53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1
 2013-08-19 11:07:52.196920 7f6fe367a780  1 journal close /dev/sdk4
 2013-08-19 11:07:52.199908 7f6fe367a780  0
 filestore(/var/lib/ceph/osd/ceph-65) mount FIEMAP ioctl is supported and
 appears to work
 2013-08-19 11:07:52.199916 7f6fe367a780  0
 filestore(/var/lib/ceph/osd/ceph-65) mount FIEMAP ioctl is disabled via
 'filestore fiemap' config option
 2013-08-19 11:07:52.200058 7f6fe367a780  0
 filestore(/var/lib/ceph/osd/ceph-65) mount did NOT detect btrfs
 2013-08-19 11:07:52.200886 7f6fe367a780  0
 filestore(/var/lib/ceph/osd/ceph-65) mount syscall(SYS_syncfs, fd) fully
 supported
 2013-08-19 11:07:52.200919 7f6fe367a780  0
 filestore(/var/lib/ceph/osd/ceph-65) mount found snaps <>
 2013-08-19 11:07:52.215850 7f6fe367a780  0
 filestore(/var/lib/ceph/osd/ceph-65) mount: enabling WRITEAHEAD journal
 mode: btrfs not detected
 2013-08-19 11:07:52.219819 7f6fe367a780  1 journal _open /dev/sdk4 fd
 26: 53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1
 2013-08-19 11:07:52.227420 7f6fe367a780  1 journal _open /dev/sdk4 fd
 26: 53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1
 2013-08-19 11:07:52.500342 7f6fe367a780  0 osd.65 144201 crush map has
 features 262144, adjusting msgr requires for clients
 2013-08-19 11:07:52.500353 7f6fe367a780  0 osd.65 144201 crush map has
 features 262144, adjusting msgr requires for osds
 2013-08-19 11:08:13.581709 7f6fbdcb5700 -1 osd/OSD.cc: In function
 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f6fbdcb5700 time 
 2013-08-19
 11:08:13.579519
 osd/OSD.cc: 4844: FAILED assert(_get_map_bl(epoch, bl))

   ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
   1: (OSDService::get_map(unsigned int)+0x44b) [0x6f5b9b]
   2: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&,
 PG::RecoveryCtx*, std::set,
 std::less >,
 std::allocator > >*)+0x3c8) [0x6f8f48]
   3: (OSD::process_peering_events(std::list >
 const&, ThreadPool::TPHandle&)+0x31f) [0x6f975f]
   4: (OSD::PeeringWQ::_process(std::list >
 const&, ThreadPool::TPHandle&)+0x14) [0x7391d4]
   5: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0x8f8e3a]
   6: (ThreadPool::WorkThread::entry()+0x10) [0x8fa0e0]
   7: (()+0x6b50) [0x7f6fe3070b50]
   8: (clone()+0x6d) [0x7f6fe15cba7d]
   NOTE: a copy of the executable, or `objdump -rdS ` is
 needed to interpret this.

 full logs here : http://pastebin.com/RphNyLU0


>>>
>>> Hi,
>>>
>>> still same problem with Ceph 0.61.8 :
>>>
>>> 2013-08-19 23:01:54.369609 7fdd667a4780  0 osd.65 144279 crush map has
>>> features 262144, adjusting msgr requires for osds
>>> 2013-08-19 23:01:58.315115 7fdd405de700 -1 osd/OSD.cc: In function
>>> 'OSDMapRef OSDService::get_map(epoch_t)' thread 7fdd405de700 time 2013-0

Re: [ceph-users] osd/OSD.cc: 4844: FAILED assert(_get_map_bl(epoch, bl)) (ceph 0.61.7)

2013-08-26 Thread Stefan Priebe
i had the same problem and backported 
6951d2345a5d837c3b14103bd4d8f5ee4407c937 to ceph.


It fixes it for me.

Greets,
Stefan

Am 26.08.2013 23:10, schrieb Samuel Just:

This is the same osd, and hasn't been working in the mean time?  Can
your clsuter operate without that osd?
-Sam

On Mon, Aug 19, 2013 at 2:05 PM, Olivier Bonvalet  wrote:

Le lundi 19 août 2013 à 12:27 +0200, Olivier Bonvalet a écrit :

Hi,

I have an OSD which crash every time I try to start it (see logs below).
Is it a known problem ? And is there a way to fix it ?

root! taman:/var/log/ceph# grep -v ' pipe' osd.65.log
2013-08-19 11:07:48.478558 7f6fe367a780  0 ceph version 0.61.7 
(8f010aff684e820ecc837c25ac77c7a05d7191ff), process ceph-osd, pid 19327
2013-08-19 11:07:48.516363 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount FIEMAP ioctl is supported and appears to work
2013-08-19 11:07:48.516380 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount FIEMAP ioctl is disabled via 'filestore fiemap' config option
2013-08-19 11:07:48.516514 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount did NOT detect btrfs
2013-08-19 11:07:48.517087 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount syscall(SYS_syncfs, fd) fully supported
2013-08-19 11:07:48.517389 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) mount 
found snaps <>
2013-08-19 11:07:49.199483 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount: enabling WRITEAHEAD journal mode: btrfs not detected
2013-08-19 11:07:52.191336 7f6fe367a780  1 journal _open /dev/sdk4 fd 18: 
53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1
2013-08-19 11:07:52.196020 7f6fe367a780  1 journal _open /dev/sdk4 fd 18: 
53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1
2013-08-19 11:07:52.196920 7f6fe367a780  1 journal close /dev/sdk4
2013-08-19 11:07:52.199908 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount FIEMAP ioctl is supported and appears to work
2013-08-19 11:07:52.199916 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount FIEMAP ioctl is disabled via 'filestore fiemap' config option
2013-08-19 11:07:52.200058 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount did NOT detect btrfs
2013-08-19 11:07:52.200886 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount syscall(SYS_syncfs, fd) fully supported
2013-08-19 11:07:52.200919 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) mount 
found snaps <>
2013-08-19 11:07:52.215850 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount: enabling WRITEAHEAD journal mode: btrfs not detected
2013-08-19 11:07:52.219819 7f6fe367a780  1 journal _open /dev/sdk4 fd 26: 
53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1
2013-08-19 11:07:52.227420 7f6fe367a780  1 journal _open /dev/sdk4 fd 26: 
53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1
2013-08-19 11:07:52.500342 7f6fe367a780  0 osd.65 144201 crush map has features 
262144, adjusting msgr requires for clients
2013-08-19 11:07:52.500353 7f6fe367a780  0 osd.65 144201 crush map has features 
262144, adjusting msgr requires for osds
2013-08-19 11:08:13.581709 7f6fbdcb5700 -1 osd/OSD.cc: In function 'OSDMapRef 
OSDService::get_map(epoch_t)' thread 7f6fbdcb5700 time 2013-08-19 
11:08:13.579519
osd/OSD.cc: 4844: FAILED assert(_get_map_bl(epoch, bl))

  ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
  1: (OSDService::get_map(unsigned int)+0x44b) [0x6f5b9b]
  2: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, 
std::set, std::less >, 
std::allocator > >*)+0x3c8) [0x6f8f48]
  3: (OSD::process_peering_events(std::list > const&, 
ThreadPool::TPHandle&)+0x31f) [0x6f975f]
  4: (OSD::PeeringWQ::_process(std::list > const&, 
ThreadPool::TPHandle&)+0x14) [0x7391d4]
  5: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0x8f8e3a]
  6: (ThreadPool::WorkThread::entry()+0x10) [0x8fa0e0]
  7: (()+0x6b50) [0x7f6fe3070b50]
  8: (clone()+0x6d) [0x7f6fe15cba7d]
  NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

full logs here : http://pastebin.com/RphNyLU0




Hi,

still same problem with Ceph 0.61.8 :

2013-08-19 23:01:54.369609 7fdd667a4780  0 osd.65 144279 crush map has features 
262144, adjusting msgr requires for osds
2013-08-19 23:01:58.315115 7fdd405de700 -1 osd/OSD.cc: In function 'OSDMapRef 
OSDService::get_map(epoch_t)' thread 7fdd405de700 time 2013-08-19 
23:01:58.313955
osd/OSD.cc: 4847: FAILED assert(_get_map_bl(epoch, bl))

  ceph version 0.61.8 (a6fdcca3bddbc9f177e4e2bf0d9cdd85006b028b)
  1: (OSDService::get_map(unsigned int)+0x44b) [0x6f736b]
  2: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, 
std::set, std::less >, 
std::allocator > >*)+0x3c8) [0x6fa708]
  3: (OSD::process_peering_events(std::list > const&, 
ThreadPool::TPHandle&)+0x31f) [0x6faf1f]
  4: (OSD::PeeringWQ::_process(std::list > const&, 
ThreadPool::TPHandle&)+0x14) [0x73a9b4]
  5: (ThreadPool::worker(Threa

Re: [ceph-users] radosgw subusers permission problem

2013-08-26 Thread Yehuda Sadeh
On Fri, Aug 23, 2013 at 5:31 AM, Mihály Árva-Tóth
 wrote:
> Hello,
>
> I have an user with 3 subuser:
>
> { "user_id": "johndoe",
>   "display_name": "John Doe",
>   "email": "",
>   "suspended": 0,
>   "max_buckets": 1000,
>   "auid": 0,
>   "subusers": [
> { "id": "johndoe:readonly",
>   "permissions": "read"},
> { "id": "johndoe:swift",
>   "permissions": "full-control"},
> { "id": "johndoe:wo",
>   "permissions": "write"}],
>   "keys": [
> { "user": "johndoe",
>   "access_key": "xxx",
>   "secret_key": "xxx}],
>   "swift_keys": [
> { "user": "johndoe:readonly",
>   "secret_key": "abcde"},
> { "user": "johndoe:swift",
>   "secret_key": "fghij"},
> { "user": "johndoe:wo",
>   "secret_key": "klmno"}],
>   "caps": []}
>
> If I understand correct johndoe:readonly subuser has no privileges to create
> container or upload object. But I can do:
>
> swift -V 1.0 -A http://localhost/auth -U johndoe:readonly -K abcde post
> testcontainer
> swift -V 1.0 -A http://localhost/auth -U johndoe:readonly -K abcde upload
> testcontainer testfile.100
> swift -V 1.0 -A http://localhost/auth -U johndoe:readonly -K abcde stat
> testcontainer sparse.100
>Account: v1
>  Container: testcontainer
> Object: sparse.100
>   Content Type: binary/octet-stream
> Content Length: 5242880
>  Last Modified: Fri, 23 Aug 2013 12:25:57 GMT
>   ETag: 5f363e0e58a95f06cbe9bbc662c5dfb6
> Meta Mtime: 1372251959.01
> ...
>
>
> Another side, johndoe:wo user (who has write permission only) should not be
> able to list containers and objects. But I can do it:
>
> swift -V 1.0 -A http://localhost/auth -U johndoe:wo -K klmno list
> testcontainer
> sparse.100
>
> Is there anything that I misunderstood?
>

Hi,

  thank you for the report. I opened tracker issue #6126.

Yehuda
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd/OSD.cc: 4844: FAILED assert(_get_map_bl(epoch, bl)) (ceph 0.61.7)

2013-08-26 Thread Samuel Just
This is the same osd, and hasn't been working in the mean time?  Can
your clsuter operate without that osd?
-Sam

On Mon, Aug 19, 2013 at 2:05 PM, Olivier Bonvalet  wrote:
> Le lundi 19 août 2013 à 12:27 +0200, Olivier Bonvalet a écrit :
>> Hi,
>>
>> I have an OSD which crash every time I try to start it (see logs below).
>> Is it a known problem ? And is there a way to fix it ?
>>
>> root! taman:/var/log/ceph# grep -v ' pipe' osd.65.log
>> 2013-08-19 11:07:48.478558 7f6fe367a780  0 ceph version 0.61.7 
>> (8f010aff684e820ecc837c25ac77c7a05d7191ff), process ceph-osd, pid 19327
>> 2013-08-19 11:07:48.516363 7f6fe367a780  0 
>> filestore(/var/lib/ceph/osd/ceph-65) mount FIEMAP ioctl is supported and 
>> appears to work
>> 2013-08-19 11:07:48.516380 7f6fe367a780  0 
>> filestore(/var/lib/ceph/osd/ceph-65) mount FIEMAP ioctl is disabled via 
>> 'filestore fiemap' config option
>> 2013-08-19 11:07:48.516514 7f6fe367a780  0 
>> filestore(/var/lib/ceph/osd/ceph-65) mount did NOT detect btrfs
>> 2013-08-19 11:07:48.517087 7f6fe367a780  0 
>> filestore(/var/lib/ceph/osd/ceph-65) mount syscall(SYS_syncfs, fd) fully 
>> supported
>> 2013-08-19 11:07:48.517389 7f6fe367a780  0 
>> filestore(/var/lib/ceph/osd/ceph-65) mount found snaps <>
>> 2013-08-19 11:07:49.199483 7f6fe367a780  0 
>> filestore(/var/lib/ceph/osd/ceph-65) mount: enabling WRITEAHEAD journal 
>> mode: btrfs not detected
>> 2013-08-19 11:07:52.191336 7f6fe367a780  1 journal _open /dev/sdk4 fd 18: 
>> 53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1
>> 2013-08-19 11:07:52.196020 7f6fe367a780  1 journal _open /dev/sdk4 fd 18: 
>> 53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1
>> 2013-08-19 11:07:52.196920 7f6fe367a780  1 journal close /dev/sdk4
>> 2013-08-19 11:07:52.199908 7f6fe367a780  0 
>> filestore(/var/lib/ceph/osd/ceph-65) mount FIEMAP ioctl is supported and 
>> appears to work
>> 2013-08-19 11:07:52.199916 7f6fe367a780  0 
>> filestore(/var/lib/ceph/osd/ceph-65) mount FIEMAP ioctl is disabled via 
>> 'filestore fiemap' config option
>> 2013-08-19 11:07:52.200058 7f6fe367a780  0 
>> filestore(/var/lib/ceph/osd/ceph-65) mount did NOT detect btrfs
>> 2013-08-19 11:07:52.200886 7f6fe367a780  0 
>> filestore(/var/lib/ceph/osd/ceph-65) mount syscall(SYS_syncfs, fd) fully 
>> supported
>> 2013-08-19 11:07:52.200919 7f6fe367a780  0 
>> filestore(/var/lib/ceph/osd/ceph-65) mount found snaps <>
>> 2013-08-19 11:07:52.215850 7f6fe367a780  0 
>> filestore(/var/lib/ceph/osd/ceph-65) mount: enabling WRITEAHEAD journal 
>> mode: btrfs not detected
>> 2013-08-19 11:07:52.219819 7f6fe367a780  1 journal _open /dev/sdk4 fd 26: 
>> 53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1
>> 2013-08-19 11:07:52.227420 7f6fe367a780  1 journal _open /dev/sdk4 fd 26: 
>> 53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1
>> 2013-08-19 11:07:52.500342 7f6fe367a780  0 osd.65 144201 crush map has 
>> features 262144, adjusting msgr requires for clients
>> 2013-08-19 11:07:52.500353 7f6fe367a780  0 osd.65 144201 crush map has 
>> features 262144, adjusting msgr requires for osds
>> 2013-08-19 11:08:13.581709 7f6fbdcb5700 -1 osd/OSD.cc: In function 
>> 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f6fbdcb5700 time 2013-08-19 
>> 11:08:13.579519
>> osd/OSD.cc: 4844: FAILED assert(_get_map_bl(epoch, bl))
>>
>>  ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
>>  1: (OSDService::get_map(unsigned int)+0x44b) [0x6f5b9b]
>>  2: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, 
>> PG::RecoveryCtx*, std::set, 
>> std::less >, 
>> std::allocator > >*)+0x3c8) [0x6f8f48]
>>  3: (OSD::process_peering_events(std::list > 
>> const&, ThreadPool::TPHandle&)+0x31f) [0x6f975f]
>>  4: (OSD::PeeringWQ::_process(std::list > const&, 
>> ThreadPool::TPHandle&)+0x14) [0x7391d4]
>>  5: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0x8f8e3a]
>>  6: (ThreadPool::WorkThread::entry()+0x10) [0x8fa0e0]
>>  7: (()+0x6b50) [0x7f6fe3070b50]
>>  8: (clone()+0x6d) [0x7f6fe15cba7d]
>>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
>> interpret this.
>>
>> full logs here : http://pastebin.com/RphNyLU0
>>
>>
>
> Hi,
>
> still same problem with Ceph 0.61.8 :
>
> 2013-08-19 23:01:54.369609 7fdd667a4780  0 osd.65 144279 crush map has 
> features 262144, adjusting msgr requires for osds
> 2013-08-19 23:01:58.315115 7fdd405de700 -1 osd/OSD.cc: In function 'OSDMapRef 
> OSDService::get_map(epoch_t)' thread 7fdd405de700 time 2013-08-19 
> 23:01:58.313955
> osd/OSD.cc: 4847: FAILED assert(_get_map_bl(epoch, bl))
>
>  ceph version 0.61.8 (a6fdcca3bddbc9f177e4e2bf0d9cdd85006b028b)
>  1: (OSDService::get_map(unsigned int)+0x44b) [0x6f736b]
>  2: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, 
> PG::RecoveryCtx*, std::set, 
> std::less >, std::allocator 
> > >*)+0x3c8) [0x6fa708]
>  3: (OSD::process_peering_events(std::list > const&, 
> ThreadPool::TPHandle&)+0x31f) [0x6faf1f]
>  4: (OSD::PeeringWQ::_process(st

Re: [ceph-users] Sequential placement

2013-08-26 Thread Samuel Just
I think rados bench is actually creating new objects with each IO.
Can you paste in the command you used?
-Sam

On Tue, Aug 20, 2013 at 7:28 AM, daniel pol  wrote:
> Hi !
>
> Ceph newbie here with a placement question. I'm trying to get a simple Ceph
> setup to run well with sequential reads big packets (>256k).
> This is for learning/benchmarking purpose only and the setup I'm working
> with has a single server with 2 data drives, one OSD on each, journals on
> the OS drive, no replication, dumpling release.
> When running rados bench or using a rbd block device the performance is only
> 35%-50% of what the underlying XFS filesystem can do and when I look at the
> IO trace I see random IO going to the physical disk, while the IO at ceph
> layer is sequential. Haven't tested CephFS yet but expect similar results
> there.
> Looking for advice on how to configure Ceph to generate sequential read IO
> pattern to the underlying physical disk.
>
> Have a nice day,
> Dani
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mon listens on wrong interface

2013-08-26 Thread Alfredo Deza
On Mon, Aug 26, 2013 at 3:41 PM, Fuchs, Andreas (SwissTXT) <
andreas.fu...@swisstxt.ch> wrote:

> Hi Sage
>
> Many thanks for your answer. The Cluster is now up and running and
> "talking" on the right interfaces.
>
> Regards
> Andi
>
> -Original Message-
> From: Sage Weil [mailto:s...@inktank.com]
> Sent: Montag, 26. August 2013 18:20
> To: Fuchs, Andreas (SwissTXT)
> Cc: ceph-us...@ceph.com
> Subject: RE: [ceph-users] ceph-mon listens on wrong interface
>
> On Mon, 26 Aug 2013, Fuchs, Andreas (SwissTXT) wrote:
> > Hi Sage
> >
> > Thanks for your answer
> >
> > I had ceph.conf already adjusted
> >   mon_hosts has the list of public ip's of the mon servers
> >
> > but ceph-mon is listening on eth0 instead of the ip listed in
> > mon_hosts
> >
> > also entering [mon.ceph-ceph01] sections with host= and mon_addr=
> > entries did not change this
> >
> > do I have to redeploy the installation, so far I just pushed the new
> config and restarted the services?
>
> You will need to redeploy.  ceph-deploy purge and ceph-deploy purgedata to
> reset your nodes.
>
> > Btw.
> > ceph-deploy new ceph01:10.100.214.x ... wont't work as it requires a
> > name not an ip, but in my case ceph01 resolves to the correct ip
>

Andreas, I confirmed this was the case and it is indeed a bug, so I created
a ticket and I just got it
merged and fixed in the master branch. Ticket link:
http://tracker.ceph.com/issues/6124

There should be a bug-fix release later this week including this change.
Thanks for reporting it!

>
> Interesting; we should make ceph-deploy take an IP there too.
>
> sage
>
> >
> > regards
> > Andi
> >
> > -Original Message-
> > From: Sage Weil [mailto:s...@inktank.com]
> > Sent: Freitag, 23. August 2013 17:28
> > To: Fuchs, Andreas (SwissTXT)
> > Cc: ceph-us...@ceph.com
> > Subject: Re: [ceph-users] ceph-mon listens on wrong interface
> >
> > Hi Andreas,
> >
> > On Fri, 23 Aug 2013, Fuchs, Andreas (SwissTXT) wrote:
> > > Hi, we built a ceph cluster with the folling network setup
> > >
> > > eth0 is on a management network (access for admins and monitoring
> > > tools)
> > > eth1 is ceph sync
> > > eth2 is ceph public
> > >
> > > deployed by ceph-deploy I have the following config
> > >
> > > [global]
> > > fsid = 18c6b4db-b936-43a2-ba68-d750036036cc
> > > mon_initial_members = ceph01, ceph02, ceph03 mon_host =
> > > 10.100.214.11,10.100.214.12,10.100.214.13
> > > auth_supported = cephx
> > > osd_journal_size = 5000
> > > filestore_xattr_use_omap = true
> > > public_network = 10.100.214.0/24
> > > cluster_network = 10.100.213.0/24
> > >
> > > the problem is now that ceph-mon is listening on eth0
> > >
> > > netstat -lpn | grep 6789
> > > tcp0  0 10.100.220.111:6789 0.0.0.0:*
> LISTEN  1609/ceph-mon
> > >
> > > where it should listen on eth0 10.100.214.x
> > >
> > > how can I achieve this?
> >
> > I assume you used ceph-deploy here?  The problem is that when you do
> >
> >  ceph-deploy new ceph01 ceph02 ceph03
> >
> > it is using the ceph01 etc as both the hostname to identify the
> > instance
> > (good) and looking it up via DNS to resolve the IP for the mon_host
> > list (bad, in your case).  Try
> >
> >  ceph-deploy new ceph01:10.100.214.x ...
> >
> > or
> >
> >  ceph-deploy new ceph01:ceph01.myothernetwork.foo.com ...
> >
> > Or, just manually edit the ceph.conf after the 'ceph-deploy new ...'
> > command to get how you want it.
> >
> > sage
> >
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] lvm for a quick ceph lab cluster test

2013-08-26 Thread Samuel Just
Seems reasonable to me.  I'm not sure I've heard anything about using
LVM under ceph.  Let us know how it goes!
-Sam

On Wed, Aug 21, 2013 at 5:18 PM, Liu, Larry  wrote:
> Hi guys,
>
> I'm a newbie in ceph. Wonder if I can use 2~3 LVM disks on each server,
> total 2 servers to run a quick ceph clustering tests.
>
> Thanks!
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Storage, File Systems and Data Scrubbing

2013-08-26 Thread Samuel Just
ceph-osd builds a transactional interface on top of the usual posix
operations so that we can do things like atomically perform an object
write and update the osd metadata.  The current implementation
requires our own journal and some metadata ordering (which is provided
by the backing filesystem's own journal) to implement our own atomic
operations.  It's true that in some cases you might be able to get
away with having the client replay the operation (which we do anyway
for other reasons), but that wouldn't be enough to ensure consistency
of the filesystem's own internal structures.  It also wouldn't be
enough to ensure that the OSD's internal structure remain consistent
in the case of a crash.  Also, if the client is unavailable to do the
replay, you'd have a problem.

In summary, it's actually really hard to to detect partial/corrupted
writes after a crash without journaling of some form.
-Sam

On Wed, Aug 21, 2013 at 6:03 PM, Mike Lowe  wrote:
> Let me make a simpler case, to do ACID (https://en.wikipedia.org/wiki/ACID)
> which are all properties you want in a filesystem or a database, you need a
> journal.  You need a journaled filesystem to make the object store's file
> operations safe.  You need a journal in ceph to make sure the object
> operations are safe.  Flipped bits are a separate problem that may be aided
> by journaling but the primary objective of a journal is to make guarantees
> about concurrent operations and interrupted operations.  There isn't a
> person on this list who hasn't had an osd die, without a journal starting
> that osd up again and getting it usable would be impractical.
>
> On Aug 21, 2013, at 8:00 PM, Johannes Klarenbeek
>  wrote:
>
>
>
>
> I think you are missing the distinction between metadata journaling and data
> journaling.  In most cases a journaling filesystem is one that journal's
> it's own metadata but your data is on its own.  Consider the case where you
> have a replication level of two, the osd filesystems have journaling
> disabled and you append a block to a file (which is an object in terms of
> ceph) but only one commits the change in file size to disk.  Later you scrub
> and discover a discrepancy in object sizes, with a replication level of 2
> there is no way to authoritatively say which one is correct just based on
> what's in ceph.  This is a similar scenario to a btrfs bug that caused me to
> lose data with ceph.  Journaling your metadata is the absolute minimum level
> of assurance you need to make a transactional system like ceph work.
>
> Hey Mike J
>
> I get your point. However, isn’t it then possible to authoritatively say
> which one is the correct one in case of 3 OSD’s?
> Or is the replication level a configuration setting that tells the cluster
> that the object needs to be replicated 3 times?
> In both cases, data scrubbing chooses the majority of the same-same
> replicated objects in order to know which one is authorative.
>
> But I also believe (!) that each object has a checksum and each PG too so
> that it should be easy to find the corrupted object on any of the OSD’s.
> How else would scrubbing find corrupted sectors? Especially when I think
> about 2TB SATA disks being hit by cosmic-rays that flip a bit somewhere.
> It happens more often with big cheap TB disks, but that doesn’t mean the
> corrupted sector is a bad sector (in not useable anymore). Journaling is not
> going to help anyone with this.
> Therefor I believe (again) that the data scrubber must have a mechanism to
> detect these types of corruptions even in a 2 OSD setup by means of
> checksums (or better, with a hashed checksum id).
>
> Also, aren’t there 2 types of transactions; one for writing and one for
> replicating?
>
> On Aug 21, 2013, at 4:23 PM, Johannes Klarenbeek
>  wrote:
>
>
>
> Dear ceph-users,
>
> I read a lot of documentation today about ceph architecture and linux file
> system benchmarks in particular and I could not help notice something that I
> like to clear up for myself. Take into account that it has been a while that
> I actually touched linux, but I did some programming on php2b12 and apache
> back in the days so I’m not a complete newbie. The real question is below if
> you do not like reading the rest ;)
>
> What I have come to understand about file systems for OSD’s is that in
> theory btrfs is the file system of choice. However, due to its young age
> it’s not considered stable yet. Therefore EXT4 but preferably XFS is used in
> most cases. It seems that most people choose this system because of its
> journaling feature and XFS for its additional attribute storage which has a
> 64kb limit which should be sufficient for most operations.
>
> But when you look at file system benchmarks btrfs is really, really slow.
> Then comes XFS, then EXT4, but EXT2 really dwarfs all other throughput
> results. On journaling systems (like XFS, EXT4 and btrfs) disabling
> journaling actually helps throughput as well. Sometimes more then 2 times
> for write 

Re: [ceph-users] Significant slowdown of osds since v0.67 Dumpling

2013-08-26 Thread Oliver Daudey
Hey Samuel,

I have been trying to get it reproduced on my test-cluster and seem to
have found a way.  Try: `rbd bench-write test --io-threads 80
--io-pattern=rand'.  On my test-cluster, this closely replicates what I
see during profiling on my production-cluster, including the extra
CPU-usage by leveldb, which doesn't show up on Cuttlefish.  It's very
curious that "PGLog::undirty()" is also still showing up near the top,
even in 0.67.2.

I'll send you the logs by private mail.


   Regards,

  Oliver

On ma, 2013-08-26 at 13:35 -0700, Samuel Just wrote:
> Can you attach a log from the startup of one of the dumpling osds on
> your production machine (no need for logging, just need some of the
> information dumped on every boot)?
> 
> libleveldb is leveldb.  We've used leveldb for a few things since
> bobtail.  If anything, the load on leveldb should be lighter in
> dumpling, I would think...  I'll have to try to reproduce it locally.
> I'll keep you posted.
> -Sam
> 
> On Sat, Aug 24, 2013 at 10:11 AM, Oliver Daudey  wrote:
> > Hey Samuel,
> >
> > Unfortunately, disabling "wbthrottle" made almost no difference on my
> > production-cluster.  OSD-load was still much higher on Dumpling.
> >
> > I've mentioned this several times already, but when profiling with `perf
> > top' on my production-cluster, any time I'm running a Dumpling-OSD,
> > several "libleveldb"-related entries come up near the top, that don't
> > show up when running the Cuttlefish-OSD at all.  Let's concentrate on
> > that for a moment, as it's a clearly visible difference on my
> > production-cluster, which shows the actual problem.
> >
> > Dumpling OSDs:
> >  17.23%  [kernel] [k] intel_idle
> >   6.35%  [kernel] [k] find_busiest_group
> >   4.36%  kvm  [.] 0x2cdbb0
> >   3.38%  libleveldb.so.1.9[.] 0x22821
> >   2.40%  libc-2.11.3.so   [.] memcmp
> >   2.04%  ceph-osd [.] ceph_crc32c_le_intel
> >   1.90%  [kernel] [k] _raw_spin_lock
> >   1.87%  [kernel] [k] copy_user_generic_string
> >   1.35%  [kernel] [k]
> > default_send_IPI_mask_sequence_phys
> >   1.34%  [kernel] [k] __hrtimer_start_range_ns
> >   1.14%  libc-2.11.3.so   [.] memcpy
> >   1.03%  [kernel] [k] hrtimer_interrupt
> >   1.01%  [kernel] [k] do_select
> >   1.00%  [kernel] [k] __schedule
> >   0.99%  [kernel] [k] _raw_spin_unlock_irqrestore
> >   0.97%  [kernel] [k] cpumask_next_and
> >   0.97%  [kernel] [k] find_next_bit
> >   0.96%  libleveldb.so.1.9[.]
> > leveldb::InternalKeyComparator::Compar
> >   0.91%  [kernel] [k] _raw_spin_lock_irqsave
> >   0.91%  [kernel] [k] fget_light
> >   0.89%  [kernel] [k] clockevents_program_event
> >   0.79%  [kernel] [k] sync_inodes_sb
> >   0.78%  libleveldb.so.1.9[.] leveldb::Block::Iter::Next()
> >   0.75%  [kernel] [k] apic_timer_interrupt
> >   0.70%  [kernel] [k] native_write_cr0
> >   0.60%  [kvm_intel]  [k] vmx_vcpu_run
> >   0.58%  [kernel] [k] load_balance
> >   0.57%  [kernel] [k] rcu_needs_cpu
> >   0.56%  ceph-osd [.] PGLog::undirty()
> >   0.51%  libpthread-2.11.3.so [.] pthread_mutex_lock
> >   0.50%  [vdso]   [.] 0x7fff6dbff6ce
> >
> > Same load, but with Cuttlefish-OSDs:
> >  19.23%  [kernel] [k] intel_idle
> >   6.43%  [kernel] [k] find_busiest_group
> >   5.25%  kvm  [.] 0x152a75
> >   2.70%  ceph-osd [.] ceph_crc32c_le
> >   2.44%  [kernel] [k] _raw_spin_lock
> >   1.95%  [kernel] [k] copy_user_generic_string
> >   1.53%  [kernel] [k]
> > default_send_IPI_mask_sequence_phys
> >   1.28%  [kernel] [k] __hrtimer_start_range_ns
> >   1.21%  [kernel] [k] do_select
> >   1.19%  [kernel] [k] hrtimer_interrupt
> >   1.19%  [kernel] [k] _raw_spin_unlock_irqrestore
> >   1.16%  [kernel] [k] fget_light
> >   1.12%  [kernel] [k] cpumask_next_and
> >   1.11%  [kernel] [k] clockevents_program_event
> >   1.08%  [kernel] [k] __schedule
> >   1.08%  [kernel] [k] find_next_bit
> >   0.99%  [kernel] [k] _raw_spin_lock_irqsave
> >   0.90%  [kernel] [k] native_write_cr0
> >   0.83%  [kernel] [k] native_write_msr_safe
> >   0.82%  [kernel] [k] apic_timer_interrupt
> >   0.70%  libc

Re: [ceph-users] locking rbd device

2013-08-26 Thread Josh Durgin

On 08/26/2013 01:49 PM, Josh Durgin wrote:

On 08/26/2013 12:03 AM, Wolfgang Hennerbichler wrote:

hi list,

I realize there's a command called "rbd lock" to lock an image. Can
libvirt use this to prevent virtual machines from being started
simultaneously on different virtualisation containers?

wogri


Yes - that's the reason for lock command's existence. You have
to be careful with things like live migration though, which will have
the device open in two places while migration completes. If libvirt
could use rbd locks like it uses its sanlock plugin, it would be
able to deal with this correctly.


To be clear, libvirt doesn't use rbd locking at all right now, but it
could probably be patched to do so without too much effort.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bucket count limit

2013-08-26 Thread Samuel Just
As I understand it, that should actually help avoid bucket contention
and thereby increase performance.

Yehuda, anything to add?
-Sam

On Thu, Aug 22, 2013 at 7:08 AM, Mostowiec Dominik
 wrote:
> Hi,
>
> I think about sharding s3 buckets in CEPH cluster, create bucket-per-XX (256
> buckets) or even bucket-per-XXX (4096 buckets) where XXX is sign from object
> md5 url.
>
> Could this be the problem? (performance, or some limits)
>
>
>
> --
>
> Regards
>
> Dominik
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] locking rbd device

2013-08-26 Thread Josh Durgin

On 08/26/2013 12:03 AM, Wolfgang Hennerbichler wrote:

hi list,

I realize there's a command called "rbd lock" to lock an image. Can libvirt use 
this to prevent virtual machines from being started simultaneously on different 
virtualisation containers?

wogri


Yes - that's the reason for lock command's existence. You have
to be careful with things like live migration though, which will have
the device open in two places while migration completes. If libvirt
could use rbd locks like it uses its sanlock plugin, it would be
able to deal with this correctly.

Josh

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Optimal configuration to validate Ceph

2013-08-26 Thread Samuel Just
If you create a pool with size 1 (no replication), (2) should be
somewhere around 3x the speed of (1) assuming the client workload has
enough parallelism and is well distributed over objects (so a random
rbd workload with a large queue depth rather than a small sequential
workload with a small queue depth).  If you have 3x replication on
(2), but 1x on (1), you should expect (2) to be pretty close to (1),
perhaps a bit slower due to replication latency.

The details will actually depend a lot on the workload.  Do you intend
to use rbd?
-Sam

On Fri, Aug 23, 2013 at 2:41 PM, Sushma R  wrote:
> Hi,
>
> I understand that Ceph is a scalable distributed storage architecture.
> However, I'd like to understand if performance on single node cluster is
> better or worse than a 3 node cluster.
> Let's say I have the following 2 setups:
> 1. Single node cluster with one OSD.
> 2. Three node cluster with one OSD on each node.
>
> Would the performance of Setup 2 be approximately (3x) of Setup 1? (OR)
> Would Setup 2 perform better than (3x) Setup 1, because of more parallelism?
> (OR)
> Would Setup 2 perform worse than (3x) Setup 1, because of replication, etc.
>
> In other words, I'm trying to understand do we definitely need more than
> three nodes to validate the benchmark results or a single/two node should
> give an idea of a larger scale?
>
> Thanks,
> Sushma
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Significant slowdown of osds since v0.67 Dumpling

2013-08-26 Thread Samuel Just
Can you attach a log from the startup of one of the dumpling osds on
your production machine (no need for logging, just need some of the
information dumped on every boot)?

libleveldb is leveldb.  We've used leveldb for a few things since
bobtail.  If anything, the load on leveldb should be lighter in
dumpling, I would think...  I'll have to try to reproduce it locally.
I'll keep you posted.
-Sam

On Sat, Aug 24, 2013 at 10:11 AM, Oliver Daudey  wrote:
> Hey Samuel,
>
> Unfortunately, disabling "wbthrottle" made almost no difference on my
> production-cluster.  OSD-load was still much higher on Dumpling.
>
> I've mentioned this several times already, but when profiling with `perf
> top' on my production-cluster, any time I'm running a Dumpling-OSD,
> several "libleveldb"-related entries come up near the top, that don't
> show up when running the Cuttlefish-OSD at all.  Let's concentrate on
> that for a moment, as it's a clearly visible difference on my
> production-cluster, which shows the actual problem.
>
> Dumpling OSDs:
>  17.23%  [kernel] [k] intel_idle
>   6.35%  [kernel] [k] find_busiest_group
>   4.36%  kvm  [.] 0x2cdbb0
>   3.38%  libleveldb.so.1.9[.] 0x22821
>   2.40%  libc-2.11.3.so   [.] memcmp
>   2.04%  ceph-osd [.] ceph_crc32c_le_intel
>   1.90%  [kernel] [k] _raw_spin_lock
>   1.87%  [kernel] [k] copy_user_generic_string
>   1.35%  [kernel] [k]
> default_send_IPI_mask_sequence_phys
>   1.34%  [kernel] [k] __hrtimer_start_range_ns
>   1.14%  libc-2.11.3.so   [.] memcpy
>   1.03%  [kernel] [k] hrtimer_interrupt
>   1.01%  [kernel] [k] do_select
>   1.00%  [kernel] [k] __schedule
>   0.99%  [kernel] [k] _raw_spin_unlock_irqrestore
>   0.97%  [kernel] [k] cpumask_next_and
>   0.97%  [kernel] [k] find_next_bit
>   0.96%  libleveldb.so.1.9[.]
> leveldb::InternalKeyComparator::Compar
>   0.91%  [kernel] [k] _raw_spin_lock_irqsave
>   0.91%  [kernel] [k] fget_light
>   0.89%  [kernel] [k] clockevents_program_event
>   0.79%  [kernel] [k] sync_inodes_sb
>   0.78%  libleveldb.so.1.9[.] leveldb::Block::Iter::Next()
>   0.75%  [kernel] [k] apic_timer_interrupt
>   0.70%  [kernel] [k] native_write_cr0
>   0.60%  [kvm_intel]  [k] vmx_vcpu_run
>   0.58%  [kernel] [k] load_balance
>   0.57%  [kernel] [k] rcu_needs_cpu
>   0.56%  ceph-osd [.] PGLog::undirty()
>   0.51%  libpthread-2.11.3.so [.] pthread_mutex_lock
>   0.50%  [vdso]   [.] 0x7fff6dbff6ce
>
> Same load, but with Cuttlefish-OSDs:
>  19.23%  [kernel] [k] intel_idle
>   6.43%  [kernel] [k] find_busiest_group
>   5.25%  kvm  [.] 0x152a75
>   2.70%  ceph-osd [.] ceph_crc32c_le
>   2.44%  [kernel] [k] _raw_spin_lock
>   1.95%  [kernel] [k] copy_user_generic_string
>   1.53%  [kernel] [k]
> default_send_IPI_mask_sequence_phys
>   1.28%  [kernel] [k] __hrtimer_start_range_ns
>   1.21%  [kernel] [k] do_select
>   1.19%  [kernel] [k] hrtimer_interrupt
>   1.19%  [kernel] [k] _raw_spin_unlock_irqrestore
>   1.16%  [kernel] [k] fget_light
>   1.12%  [kernel] [k] cpumask_next_and
>   1.11%  [kernel] [k] clockevents_program_event
>   1.08%  [kernel] [k] __schedule
>   1.08%  [kernel] [k] find_next_bit
>   0.99%  [kernel] [k] _raw_spin_lock_irqsave
>   0.90%  [kernel] [k] native_write_cr0
>   0.83%  [kernel] [k] native_write_msr_safe
>   0.82%  [kernel] [k] apic_timer_interrupt
>   0.70%  libc-2.11.3.so   [.] memcpy
>   0.68%  [kernel] [k] sync_inodes_sb
>   0.63%  [kernel] [k] tg_load_down
>   0.63%  [kernel] [k] load_balance
>   0.61%  libpthread-2.11.3.so [.] pthread_mutex_lock
>   0.58%  [kernel] [k] rcu_needs_cpu
>   0.57%  [kernel] [k] fput
>   0.56%  libc-2.11.3.so   [.] 0x7fb29
>   0.54%  [vdso]   [.] 0x7fff2afb873a
>   0.50%  [kernel] [k] iput
>   0.50%  [kernel] [k] reschedule_interrupt
>
> It seems to me like "libleveldb" is accounting for significant extra
> CPU-loading on Dumpling.  Another interesting fa

Re: [ceph-users] Hardware recommendations

2013-08-26 Thread Shain Miley
Martin,

Thank you very much for sharing your insight on hardware options.  This will be 
very useful for us going forward.

Shain

Shain Miley | Manager of Systems and Infrastructure, Digital Media | 
smi...@npr.org | 202.513.3649

From: Martin B Nielsen [mar...@unity3d.com]
Sent: Monday, August 26, 2013 1:13 PM
To: Shain Miley
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Hardware recommendations

Hi Shain,

Those R515 seem to mimic our servers (2U supermicro w. 12x 3.5" bays and 2x 
2.5" in the rear for OS).

Since we need a mix of SSD & platter we have 8x 4TB drives and 4x 500GB SSD + 
2x 250GB SSD for OS in each node (2x 8-port LSI 2308 in IT-mode)

We've partitioned 10GB from each 4x 500GB to use as journal for 4x 4TB drives 
and each of the OS disks each hold 2x journals each for the remaining 4 platter 
disks.

We tested a lot how to put these journals and this setup seemed to fit best 
into our setup (pure VM block storage - 3x replica).

Everything connected via 10GbE (1 network for cluster, 1 for public) and 3 
standalone monitor servers.

For storage nodes we use E5-2620/32gb ram, and monitor nodes E3-1260L/16gb ram 
- we've tested with both 1 and 2 nodes going down and starting redistributing 
data and they seem to cope more than fine.

Overall I find these nodes as a good compromise between capacity, price and 
performance - we looked into getting 2U servers with 8x 3.5" bays and get more 
of them, but ultimately went with this.

We also have some boxes from coraid (SR & SRX with and without 
flashcache/etherflash) so we've been able to do some direct comparison and so 
far ceph is looking good - especially price-storage ratio.

At any rate, back to your mail, I think the most important factor is looking at 
all the pieces and making sure you're not being [hard] bottlenecked somewhere - 
we found 24gb ram to be a little on the low side when all 12 disks started to 
redistribute, but 32 is fine. Also not having journals on SSD before writing to 
platter really hurt a lot when we tested - this can prob. be mitigated somewhat 
with better raid controllers. CPU-wise the E5 2620 hardly breaks a sweat even 
when having to do just a little with a node going down.

Good luck with your HW-adventure :).

Cheers,
Martin


On Mon, Aug 26, 2013 at 3:56 PM, Shain Miley 
mailto:smi...@npr.org>> wrote:
Good morning,

I am in the process of deciding what hardware we are going to purchase for our 
new ceph based storage cluster.

I have been informed that I must submit my purchase needs by the end of this 
week in order to meet our FY13 budget requirements  (which does not leave me 
much time).

We are planning to build multiple clusters (one primarily for radosgw at 
location 1; the other for vm block storage at location 2).

We will be building our radosgw storage out first, so this is the main focus of 
this email thread.

I have read all the docs and the white papers, etc on hardware suggestions 
...and we have an existing relationship with Dell, so I have been planning on 
buying a bunch of Dell R515's with 4TB drives and using 10GigE networking for 
this radosgw setup (although this will be primary used for radosgw purposes...I 
will be testing running a limited number of vm's on this infrastructure  as 
well...in order to see what kind of performance we can achieve).

I am just wondering if anyone else has any quick thoughts on these hardware 
choices, or any alternative suggestions that I might look at as I seek to 
finalize our purchasing this week.

Thanks in advance,

Shain

Sent from my iPhone
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mon listens on wrong interface

2013-08-26 Thread Fuchs, Andreas (SwissTXT)
Hi Sage

Many thanks for your answer. The Cluster is now up and running and "talking" on 
the right interfaces.

Regards
Andi

-Original Message-
From: Sage Weil [mailto:s...@inktank.com] 
Sent: Montag, 26. August 2013 18:20
To: Fuchs, Andreas (SwissTXT)
Cc: ceph-us...@ceph.com
Subject: RE: [ceph-users] ceph-mon listens on wrong interface

On Mon, 26 Aug 2013, Fuchs, Andreas (SwissTXT) wrote:
> Hi Sage
> 
> Thanks for your answer
> 
> I had ceph.conf already adjusted
>   mon_hosts has the list of public ip's of the mon servers
> 
> but ceph-mon is listening on eth0 instead of the ip listed in 
> mon_hosts
> 
> also entering [mon.ceph-ceph01] sections with host= and mon_addr= 
> entries did not change this
> 
> do I have to redeploy the installation, so far I just pushed the new config 
> and restarted the services?

You will need to redeploy.  ceph-deploy purge and ceph-deploy purgedata to 
reset your nodes.

> Btw.
> ceph-deploy new ceph01:10.100.214.x ... wont't work as it requires a 
> name not an ip, but in my case ceph01 resolves to the correct ip

Interesting; we should make ceph-deploy take an IP there too.

sage

> 
> regards
> Andi
> 
> -Original Message-
> From: Sage Weil [mailto:s...@inktank.com]
> Sent: Freitag, 23. August 2013 17:28
> To: Fuchs, Andreas (SwissTXT)
> Cc: ceph-us...@ceph.com
> Subject: Re: [ceph-users] ceph-mon listens on wrong interface
> 
> Hi Andreas,
> 
> On Fri, 23 Aug 2013, Fuchs, Andreas (SwissTXT) wrote:
> > Hi, we built a ceph cluster with the folling network setup
> > 
> > eth0 is on a management network (access for admins and monitoring
> > tools)
> > eth1 is ceph sync
> > eth2 is ceph public
> > 
> > deployed by ceph-deploy I have the following config
> > 
> > [global]
> > fsid = 18c6b4db-b936-43a2-ba68-d750036036cc
> > mon_initial_members = ceph01, ceph02, ceph03 mon_host =
> > 10.100.214.11,10.100.214.12,10.100.214.13
> > auth_supported = cephx
> > osd_journal_size = 5000
> > filestore_xattr_use_omap = true
> > public_network = 10.100.214.0/24
> > cluster_network = 10.100.213.0/24
> > 
> > the problem is now that ceph-mon is listening on eth0
> > 
> > netstat -lpn | grep 6789
> > tcp0  0 10.100.220.111:6789 0.0.0.0:*   LISTEN  
> > 1609/ceph-mon
> > 
> > where it should listen on eth0 10.100.214.x
> > 
> > how can I achieve this?
> 
> I assume you used ceph-deploy here?  The problem is that when you do
> 
>  ceph-deploy new ceph01 ceph02 ceph03
> 
> it is using the ceph01 etc as both the hostname to identify the 
> instance
> (good) and looking it up via DNS to resolve the IP for the mon_host 
> list (bad, in your case).  Try
> 
>  ceph-deploy new ceph01:10.100.214.x ...
> 
> or
> 
>  ceph-deploy new ceph01:ceph01.myothernetwork.foo.com ...
> 
> Or, just manually edit the ceph.conf after the 'ceph-deploy new ...' 
> command to get how you want it.
> 
> sage
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 1 particular ceph-mon never jobs on 0.67.2

2013-08-26 Thread Travis Rhoden
Cool.  So far I have tried:

start on (local-filesystems and net-device-up IFACE=eth0)
start on (local-filesystems and net-device-up IFACE=eth0 and net-device-up
IFACE=eth1)

About to try:
start on (local-filesystems and net-device-up IFACE=eth0 and net-device-up
IFACE=eth1 and started network-services)

The "local-filesystems" + network device is billed as an alternative to
runlevel if you need to to do something *after* networking...

No luck so far.  I'll keep trying things out.


On Mon, Aug 26, 2013 at 2:31 PM, Sage Weil  wrote:

> On Mon, 26 Aug 2013, Travis Rhoden wrote:
> > Hi Sage,
> >
> > Thanks for the response.  I noticed that as well, and suspected
> > hostname/DHCP/DNS shenanigans.  What's weird is that all nodes are
> > identically configured.  I also have monitors running on n0 and n12, and
> > they come up fine, every time.
> >
> > Here's the mon_host line from ceph.conf:
> >
> > mon_initial_members = n0, n12, n24
> > mon_host = 10.0.1.0,10.0.1.12,10.0.1.24
> >
> > just to test /etc/hosts and name resolution...
> >
> > root@n24:~# getent hosts n24
> > 10.0.1.24   n24
> > root@n24:~# hostname -s
> > n24
> >
> > The only loopback device in /etc/hosts is "127.0.0.1   localhost", so
> > that should be fine.
> >
> > Upon rebooting this node, I've had the monitor come up okay once, maybe
> out
> > of 12 tries.  So it appears to be some kind of race...  No clue what is
> > going on.  If I stop and start the monitor (or restart), it doesn't
> appear
> > to change anything.
> >
> > However, on the topic of races, I having one other more pressing issue.
> > Each OSD host is having it's hostname assigned via DHCP.  Until that
> > assignment is made (during init), the hostname is "localhost", and then
> it
> > switches over to "n", for some node number.  The issue I am seeing is
> > that there is a race between this hostname assignment and the Ceph
> Upstart
> > scripts, such that sometimes ceph-osd starts while the hostname is still
> > 'localhost'.  This then causes the osd location to change in the
> crushmap,
> > which is going to be a very bad thing.  =)  When rebooting all my nodes
> at
> > once (there are several dozen), about 50% move from being under n to
> > localhost.  Restarting all the ceph-osd jobs moves them back (because the
> > hostname is defined).
> >
> > I'm wondering what kind of delay, or additional "start-on" logic I can
> add
> > to the upstart script to work around this.
>
> Hmm, this is beyond my upstart-fu, unfortunately.  This has come up
> before, actually.  Previously we would wait for any interface to come up
> and then start, but that broke with multi-nic machines, and I ended up
> just making things start in runlevel [2345].
>
> James, do you know what should be done to make the job wait for *all*
> network interfaces to be up?  Is that even the right solution here?
>
> sage
>
>
> >
> >
> > On Fri, Aug 23, 2013 at 4:47 PM, Sage Weil  wrote:
> >   Hi Travis,
> >
> >   On Fri, 23 Aug 2013, Travis Rhoden wrote:
> >   > Hey folks,
> >   >
> >   > I've just done a brand new install of 0.67.2 on a cluster of
> >   Calxeda nodes.
> >   >
> >   > I have one particular monitor that number joins the quorum
> >   when I restart
> >   > the node.  Looks to  me like it has something to do with the
> >   "create-keys"
> >   > task, which never seems to finish:
> >   >
> >   > root  1240 1  4 13:03 ?00:00:02
> >   /usr/bin/ceph-mon
> >   > --cluster=ceph -i n24 -f
> >   > root  1244 1  0 13:03 ?00:00:00
> >   /usr/bin/python
> >   > /usr/sbin/ceph-create-keys --cluster=ceph -i n24
> >   >
> >   > I don't see that task on my other monitors.  Additionally,
> >   that task is
> >   > periodically query the monitor status:
> >   >
> >   > root  1240 1  2 13:03 ?00:00:02
> >   /usr/bin/ceph-mon
> >   > --cluster=ceph -i n24 -f
> >   > root  1244 1  0 13:03 ?00:00:00
> >   /usr/bin/python
> >   > /usr/sbin/ceph-create-keys --cluster=ceph -i n24
> >   > root  1982  1244 15 13:04 ?00:00:00
> >   /usr/bin/python
> >   > /usr/bin/ceph --cluster=ceph
> >   --admin-daemon=/var/run/ceph/ceph-mon.n24.asok
> >   > mon_status
> >   >
> >   > Checking that status myself, I see:
> >   >
> >   > # ceph --cluster=ceph
> >   --admin-daemon=/var/run/ceph/ceph-mon.n24.asok
> >   > mon_status
> >   > { "name": "n24",
> >   >   "rank": 2,
> >   >   "state": "probing",
> >   >   "election_epoch": 0,
> >   >   "quorum": [],
> >   >   "outside_quorum": [
> >   > "n24"],
> >   >   "extra_probe_peers": [],
> >   >   "sync_provider": [],
> >   >   "monmap": { "epoch": 2,
> >   >   "fsid": "f0b0d4ec-1ac3-4b24-9eab-c19760ce4682",
> >   >   "modified": "2013-08-23 12:55:34.374650",
> >   >   "created": 

Re: [ceph-users] 1 particular ceph-mon never jobs on 0.67.2

2013-08-26 Thread Sage Weil
On Mon, 26 Aug 2013, Travis Rhoden wrote:
> Hi Sage,
> 
> Thanks for the response.  I noticed that as well, and suspected
> hostname/DHCP/DNS shenanigans.  What's weird is that all nodes are
> identically configured.  I also have monitors running on n0 and n12, and
> they come up fine, every time.
> 
> Here's the mon_host line from ceph.conf:
> 
> mon_initial_members = n0, n12, n24
> mon_host = 10.0.1.0,10.0.1.12,10.0.1.24
> 
> just to test /etc/hosts and name resolution...
> 
> root@n24:~# getent hosts n24
> 10.0.1.24   n24
> root@n24:~# hostname -s
> n24
> 
> The only loopback device in /etc/hosts is "127.0.0.1   localhost", so
> that should be fine. 
> 
> Upon rebooting this node, I've had the monitor come up okay once, maybe out
> of 12 tries.  So it appears to be some kind of race...  No clue what is
> going on.  If I stop and start the monitor (or restart), it doesn't appear
> to change anything.
> 
> However, on the topic of races, I having one other more pressing issue. 
> Each OSD host is having it's hostname assigned via DHCP.  Until that
> assignment is made (during init), the hostname is "localhost", and then it
> switches over to "n", for some node number.  The issue I am seeing is
> that there is a race between this hostname assignment and the Ceph Upstart
> scripts, such that sometimes ceph-osd starts while the hostname is still
> 'localhost'.  This then causes the osd location to change in the crushmap,
> which is going to be a very bad thing.  =)  When rebooting all my nodes at
> once (there are several dozen), about 50% move from being under n to
> localhost.  Restarting all the ceph-osd jobs moves them back (because the
> hostname is defined).
> 
> I'm wondering what kind of delay, or additional "start-on" logic I can add
> to the upstart script to work around this.

Hmm, this is beyond my upstart-fu, unfortunately.  This has come up 
before, actually.  Previously we would wait for any interface to come up 
and then start, but that broke with multi-nic machines, and I ended up 
just making things start in runlevel [2345].

James, do you know what should be done to make the job wait for *all*
network interfaces to be up?  Is that even the right solution here?

sage


> 
> 
> On Fri, Aug 23, 2013 at 4:47 PM, Sage Weil  wrote:
>   Hi Travis,
> 
>   On Fri, 23 Aug 2013, Travis Rhoden wrote:
>   > Hey folks,
>   >
>   > I've just done a brand new install of 0.67.2 on a cluster of
>   Calxeda nodes.
>   >
>   > I have one particular monitor that number joins the quorum
>   when I restart
>   > the node.  Looks to  me like it has something to do with the
>   "create-keys"
>   > task, which never seems to finish:
>   >
>   > root      1240     1  4 13:03 ?        00:00:02
>   /usr/bin/ceph-mon
>   > --cluster=ceph -i n24 -f
>   > root      1244     1  0 13:03 ?        00:00:00
>   /usr/bin/python
>   > /usr/sbin/ceph-create-keys --cluster=ceph -i n24
>   >
>   > I don't see that task on my other monitors.  Additionally,
>   that task is
>   > periodically query the monitor status:
>   >
>   > root      1240     1  2 13:03 ?        00:00:02
>   /usr/bin/ceph-mon
>   > --cluster=ceph -i n24 -f
>   > root      1244     1  0 13:03 ?        00:00:00
>   /usr/bin/python
>   > /usr/sbin/ceph-create-keys --cluster=ceph -i n24
>   > root      1982  1244 15 13:04 ?        00:00:00
>   /usr/bin/python
>   > /usr/bin/ceph --cluster=ceph
>   --admin-daemon=/var/run/ceph/ceph-mon.n24.asok
>   > mon_status
>   >
>   > Checking that status myself, I see:
>   >
>   > # ceph --cluster=ceph
>   --admin-daemon=/var/run/ceph/ceph-mon.n24.asok
>   > mon_status
>   > { "name": "n24",
>   >   "rank": 2,
>   >   "state": "probing",
>   >   "election_epoch": 0,
>   >   "quorum": [],
>   >   "outside_quorum": [
>   >         "n24"],
>   >   "extra_probe_peers": [],
>   >   "sync_provider": [],
>   >   "monmap": { "epoch": 2,
>   >       "fsid": "f0b0d4ec-1ac3-4b24-9eab-c19760ce4682",
>   >       "modified": "2013-08-23 12:55:34.374650",
>   >       "created": "0.00",
>   >       "mons": [
>   >             { "rank": 0,
>   >               "name": "n0",
>   >               "addr": "10.0.1.0:6789\/0"},
>   >             { "rank": 1,
>   >               "name": "n12",
>   >               "addr": "10.0.1.12:6789\/0"},
>   >             { "rank": 2,
>   >               "name": "n24",
>   >               "addr": "0.0.0.0:6810\/0"}]}}
>                         
> 
> This is the problem.  I can't remember exactly what causes this,
> though.
> Can you verify the host in ceph.conf mon_host line matches the ip that
> is
> configured on th machine, and that the /etc/hsots on the machine
> doesn't
> have a loopback address on it.
> 
> Thanks!
> sage
> 

Re: [ceph-users] 1 particular ceph-mon never jobs on 0.67.2

2013-08-26 Thread Travis Rhoden
Hi Sage,

Thanks for the response.  I noticed that as well, and suspected
hostname/DHCP/DNS shenanigans.  What's weird is that all nodes are
identically configured.  I also have monitors running on n0 and n12, and
they come up fine, every time.

Here's the mon_host line from ceph.conf:

mon_initial_members = n0, n12, n24
mon_host = 10.0.1.0,10.0.1.12,10.0.1.24

just to test /etc/hosts and name resolution...

root@n24:~# getent hosts n24
10.0.1.24   n24
root@n24:~# hostname -s
n24

The only loopback device in /etc/hosts is "127.0.0.1   localhost", so
that should be fine.

Upon rebooting this node, I've had the monitor come up okay once, maybe out
of 12 tries.  So it appears to be some kind of race...  No clue what is
going on.  If I stop and start the monitor (or restart), it doesn't appear
to change anything.

However, on the topic of races, I having one other more pressing issue.
Each OSD host is having it's hostname assigned via DHCP.  Until that
assignment is made (during init), the hostname is "localhost", and then it
switches over to "n", for some node number.  The issue I am seeing is
that there is a race between this hostname assignment and the Ceph Upstart
scripts, such that sometimes ceph-osd starts while the hostname is still
'localhost'.  This then causes the osd location to change in the crushmap,
which is going to be a very bad thing.  =)  When rebooting all my nodes at
once (there are several dozen), about 50% move from being under n to
localhost.  Restarting all the ceph-osd jobs moves them back (because the
hostname is defined).

I'm wondering what kind of delay, or additional "start-on" logic I can add
to the upstart script to work around this.


On Fri, Aug 23, 2013 at 4:47 PM, Sage Weil  wrote:

> Hi Travis,
>
> On Fri, 23 Aug 2013, Travis Rhoden wrote:
> > Hey folks,
> >
> > I've just done a brand new install of 0.67.2 on a cluster of Calxeda
> nodes.
> >
> > I have one particular monitor that number joins the quorum when I restart
> > the node.  Looks to  me like it has something to do with the
> "create-keys"
> > task, which never seems to finish:
> >
> > root  1240 1  4 13:03 ?00:00:02 /usr/bin/ceph-mon
> > --cluster=ceph -i n24 -f
> > root  1244 1  0 13:03 ?00:00:00 /usr/bin/python
> > /usr/sbin/ceph-create-keys --cluster=ceph -i n24
> >
> > I don't see that task on my other monitors.  Additionally, that task is
> > periodically query the monitor status:
> >
> > root  1240 1  2 13:03 ?00:00:02 /usr/bin/ceph-mon
> > --cluster=ceph -i n24 -f
> > root  1244 1  0 13:03 ?00:00:00 /usr/bin/python
> > /usr/sbin/ceph-create-keys --cluster=ceph -i n24
> > root  1982  1244 15 13:04 ?00:00:00 /usr/bin/python
> > /usr/bin/ceph --cluster=ceph
> --admin-daemon=/var/run/ceph/ceph-mon.n24.asok
> > mon_status
> >
> > Checking that status myself, I see:
> >
> > # ceph --cluster=ceph --admin-daemon=/var/run/ceph/ceph-mon.n24.asok
> > mon_status
> > { "name": "n24",
> >   "rank": 2,
> >   "state": "probing",
> >   "election_epoch": 0,
> >   "quorum": [],
> >   "outside_quorum": [
> > "n24"],
> >   "extra_probe_peers": [],
> >   "sync_provider": [],
> >   "monmap": { "epoch": 2,
> >   "fsid": "f0b0d4ec-1ac3-4b24-9eab-c19760ce4682",
> >   "modified": "2013-08-23 12:55:34.374650",
> >   "created": "0.00",
> >   "mons": [
> > { "rank": 0,
> >   "name": "n0",
> >   "addr": "10.0.1.0:6789\/0"},
> > { "rank": 1,
> >   "name": "n12",
> >   "addr": "10.0.1.12:6789\/0"},
> > { "rank": 2,
> >   "name": "n24",
> >   "addr": "0.0.0.0:6810\/0"}]}}
> 
>
> This is the problem.  I can't remember exactly what causes this, though.
> Can you verify the host in ceph.conf mon_host line matches the ip that is
> configured on th machine, and that the /etc/hsots on the machine doesn't
> have a loopback address on it.
>
> Thanks!
> sage
>
>
>
>
> >
> > Any ideas what is going on here?  I don't see anything useful in
> > /var/log/ceph/ceph-mon.n24.log
> >
> >  Thanks,
> >
> >  - Travis
> >
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd in centos6.4

2013-08-26 Thread raj kumar
Thank you so much. This is helpful not only for me, but for all beginners.

Raj


On Fri, Aug 23, 2013 at 5:31 PM, Kasper Dieter  wrote:

> Once the cluster is created on Ceph server nodes with MONs and OSDs on it
> you have to copy the config + auth info to the clients:
>
> #--- on server node, e.g.:
> scp /etc/ceph/ceph.conf client-1:/etc/ceph
> scp /etc/ceph/keyring.bin   client-1:/etc/ceph
> scp /etc/ceph/ceph.conf client-2:/etc/ceph
> scp /etc/ceph/keyring.bin   client-2:/etc/ceph
>
> #--- on client node(s):
> modprobe -v rbd
> modprobe -v ceph# only, if you want to run CephFS
> rados lspools
> rbd create -c /etc/ceph/ceph.conf  --size 1024000 --pool rbdrbd-64k
>   --order 16 --keyring /etc/ceph/keyring.bin
> rbd create -c /etc/ceph/ceph.conf  --size 1024000 --pool rbdrbd-128k
>  --order 17 --keyring /etc/ceph/keyring.bin
> rbd create -c /etc/ceph/ceph.conf  --size 1024000 --pool rbdrbd-256k
>  --order 18 --keyring /etc/ceph/keyring.bin
> rbd create -c /etc/ceph/ceph.conf  --size 1024000 --pool rbdrbd-4m
>  --order 22 --keyring /etc/ceph/keyring.bin
> rbd map rbd-64k
> rbd map rbd-128k
> rbd map rbd-256k
> rbd map rbd-4m
> rbd showmapped
>
> id pool   image   snap device
> 5  rbdrbd-64k -/dev/rbd5
> 6  rbdrbd-128k-/dev/rbd6
> 7  rbdrbd-256k-/dev/rbd7
> 8  rbdrbd-4m  -/dev/rbd8
>
>
> Now, your application can direct access the Rados Block Devices /dev/rbdX
>
> Regards,
> -Dieter
>
>
>
> On Fri, Aug 23, 2013 at 01:31:05PM +0200, raj kumar wrote:
> >Thank you Sir. I appreciate your help on this.
> >I upgraded the kernel to 3.4.53-8.
> >For second point, I want to give a client(which is not kvm) a block
> >storage. So without iscsi how the client will access the ceph cluster
> and
> >allocated block device.  and can you please let me know the flow to
> >provision the block storage. creating rbd image and map in one of the
> mon
> >host is right?  the ceph doc is not very clear on this.
> >Regards
> >Raj
> >
> >On Fri, Aug 23, 2013 at 4:03 PM, Kasper Dieter
> ><[1]dieter.kas...@ts.fujitsu.com> wrote:
> >
> >  On Thu, Aug 22, 2013 at 03:32:35PM +0200, raj kumar wrote:
> >  >ceph cluster is running fine in centos6.4.
> >  >Now I would like to export the block device to client using
> rbd.
> >  >my question is,
> >  >1. I used to modprobe rbd in one of the monitor host. But I got
> >  error,
> >  >   FATAL: Module rbd not found
> >  >   I could not find rbd module. How can i do this?
> >
> >  # cat /etc/centos-release
> >  CentOS release 6.4 (Final)
> >
> >  # updatedb
> >  # locate rbd.ko
> >  /lib/modules/3.8.13/kernel/drivers/block/rbd.ko
> >
> >  # locate virtio_blk.ko
> >
>  /lib/modules/2.6.32-358.14.1.el6.x86_64/kernel/drivers/block/virtio_blk.ko
> >
>  /lib/modules/2.6.32-358.el6.x86_64/kernel/drivers/block/virtio_blk.ko
> >  /lib/modules/3.8.13/kernel/drivers/block/virtio_blk.ko
> >
> >  Well, the standard CentOS-6.4 kernel does not include 'rbd.ko'.
> >  For some reasons the 'Enterprise distros' (RHEL, SLES) disabled the
> Ceph
> >  Kernel
> >  components by default, although the CephFS (= ceph.ko) is in the
> >  upstream Kernel
> >  until 2.6.34, and the Block-Device (= rbd.ko) until 2.6.37.
> >
> >  We build our own Kernel 3.8.13 (a good mixture of recent & muture)
> and
> >  put it into CentOS-6.4.
> >  >2. Once the rbd is created. Do we need to create iscsi target
> in
> >  one of a
> >  >monitor host and present the lun to client. If so what if the
> >  monitor host
> >  >goes down. so what is the best practice to provide a lun to
> >  clients.
> >  >thanks
> >  This depends on your Client.
> >  Using
> >"RADOS - Block-Layer - RBD-Driver - iSCSI-TGT // iSCSI-INI -
> Client"
> >  is a waste of stack overhead.
> >  If the client is kvm-qemu you can use
> >"RADOS // librbd - kvm-qemu"
> >  or
> >"RADOS // Block-Layer - RBD-Driver - Client"
> >
> >  The "//" symbolized the border between Server-nodes and
> client-nodes.
> >
> >  -Dieter
> >
> >  >Raj
> >
> >  > ___
> >  > ceph-users mailing list
> >  > [2]ceph-users@lists.ceph.com
> >  > [3]http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> > References
> >
> >Visible links
> >1. mailto:dieter.kas...@ts.fujitsu.com
> >2. mailto:ceph-users@lists.ceph.com
> >3. http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hardware recommendations

2013-08-26 Thread Martin B Nielsen
Hi Shain,

Those R515 seem to mimic our servers (2U supermicro w. 12x 3.5" bays and 2x
2.5" in the rear for OS).

Since we need a mix of SSD & platter we have 8x 4TB drives and 4x 500GB SSD
+ 2x 250GB SSD for OS in each node (2x 8-port LSI 2308 in IT-mode)

We've partitioned 10GB from each 4x 500GB to use as journal for 4x 4TB
drives and each of the OS disks each hold 2x journals each for the
remaining 4 platter disks.

We tested a lot how to put these journals and this setup seemed to fit best
into our setup (pure VM block storage - 3x replica).

Everything connected via 10GbE (1 network for cluster, 1 for public) and 3
standalone monitor servers.

For storage nodes we use E5-2620/32gb ram, and monitor nodes E3-1260L/16gb
ram - we've tested with both 1 and 2 nodes going down and starting
redistributing data and they seem to cope more than fine.

Overall I find these nodes as a good compromise between capacity, price and
performance - we looked into getting 2U servers with 8x 3.5" bays and get
more of them, but ultimately went with this.

We also have some boxes from coraid (SR & SRX with and without
flashcache/etherflash) so we've been able to do some direct comparison and
so far ceph is looking good - especially price-storage ratio.

At any rate, back to your mail, I think the most important factor is
looking at all the pieces and making sure you're not being [hard]
bottlenecked somewhere - we found 24gb ram to be a little on the low side
when all 12 disks started to redistribute, but 32 is fine. Also not having
journals on SSD before writing to platter really hurt a lot when we tested
- this can prob. be mitigated somewhat with better raid controllers.
CPU-wise the E5 2620 hardly breaks a sweat even when having to do just a
little with a node going down.

Good luck with your HW-adventure :).

Cheers,
Martin


On Mon, Aug 26, 2013 at 3:56 PM, Shain Miley  wrote:

> Good morning,
>
> I am in the process of deciding what hardware we are going to purchase for
> our new ceph based storage cluster.
>
> I have been informed that I must submit my purchase needs by the end of
> this week in order to meet our FY13 budget requirements  (which does not
> leave me much time).
>
> We are planning to build multiple clusters (one primarily for radosgw at
> location 1; the other for vm block storage at location 2).
>
> We will be building our radosgw storage out first, so this is the main
> focus of this email thread.
>
> I have read all the docs and the white papers, etc on hardware suggestions
> ...and we have an existing relationship with Dell, so I have been planning
> on buying a bunch of Dell R515's with 4TB drives and using 10GigE
> networking for this radosgw setup (although this will be primary used for
> radosgw purposes...I will be testing running a limited number of vm's on
> this infrastructure  as well...in order to see what kind of performance we
> can achieve).
>
> I am just wondering if anyone else has any quick thoughts on these
> hardware choices, or any alternative suggestions that I might look at as I
> seek to finalize our purchasing this week.
>
> Thanks in advance,
>
> Shain
>
> Sent from my iPhone
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] librados pthread_create failure

2013-08-26 Thread Greg Poirier
Gregs are awesome, apparently. Thanks for the confirmation.

I know that threads are light-weight, it's just the first time I've ever
run into something that uses them... so liberally. ^_^


On Mon, Aug 26, 2013 at 10:07 AM, Gregory Farnum  wrote:

> On Mon, Aug 26, 2013 at 9:24 AM, Greg Poirier 
> wrote:
> > So, in doing some testing last week, I believe I managed to exhaust the
> > number of threads available to nova-compute last week. After some
> > investigation, I found the pthread_create failure and increased nproc for
> > our Nova user to, what I considered, a ridiculous 120,000 threads after
> > reading that librados will require a thread per osd, plus a few for
> > overhead, per VM on our compute nodes.
> >
> > This made me wonder: how many threads could Ceph possibly need on one of
> our
> > compute nodes.
> >
> > 32 cores * an overcommit ratio of 16, assuming each one is booted from a
> > Ceph volume, * 300 (approximate number of disks in our soon-to-go-live
> Ceph
> > cluster) = 153,600 threads.
> >
> > So this is where I started to put the truck in reverse. Am I right? What
> > about when we triple the size of our Ceph cluster? I could easily see a
> > future where we have easily 1,000 disks, if not many, many more in our
> > cluster. How do people scale this? Do you RAID to increase the density of
> > your Ceph cluster? I can only imagine that this will also drastically
> > increase the amount of resources required on my data nodes as well.
> >
> > So... suggestions? Reading?
>
> Your math looks right to me. So far though it hasn't caused anybody
> any trouble — Linux threads are much cheaper than people imagine when
> they're inactive. At some point we will certainly need to reduce the
> thread counts of our messenger (using epoll on a bunch of sockets
> instead of 2 threads -> 1 socket), but it hasn't happened yet.
> In terms of things you can do if this does become a problem, the most
> prominent is probably to (sigh) partition your cluster into pods on a
> per-rack basis or something. This is actually not as bad as it sounds
> since your network design probably would prefer not to send all writes
> through your core router, so if you create a pool for each rack and do
> something like this rack, next rack, next row for your replication you
> get better network traffic patterns.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] librados pthread_create failure

2013-08-26 Thread Gregory Farnum
On Mon, Aug 26, 2013 at 9:24 AM, Greg Poirier  wrote:
> So, in doing some testing last week, I believe I managed to exhaust the
> number of threads available to nova-compute last week. After some
> investigation, I found the pthread_create failure and increased nproc for
> our Nova user to, what I considered, a ridiculous 120,000 threads after
> reading that librados will require a thread per osd, plus a few for
> overhead, per VM on our compute nodes.
>
> This made me wonder: how many threads could Ceph possibly need on one of our
> compute nodes.
>
> 32 cores * an overcommit ratio of 16, assuming each one is booted from a
> Ceph volume, * 300 (approximate number of disks in our soon-to-go-live Ceph
> cluster) = 153,600 threads.
>
> So this is where I started to put the truck in reverse. Am I right? What
> about when we triple the size of our Ceph cluster? I could easily see a
> future where we have easily 1,000 disks, if not many, many more in our
> cluster. How do people scale this? Do you RAID to increase the density of
> your Ceph cluster? I can only imagine that this will also drastically
> increase the amount of resources required on my data nodes as well.
>
> So... suggestions? Reading?

Your math looks right to me. So far though it hasn't caused anybody
any trouble — Linux threads are much cheaper than people imagine when
they're inactive. At some point we will certainly need to reduce the
thread counts of our messenger (using epoll on a bunch of sockets
instead of 2 threads -> 1 socket), but it hasn't happened yet.
In terms of things you can do if this does become a problem, the most
prominent is probably to (sigh) partition your cluster into pods on a
per-rack basis or something. This is actually not as bad as it sounds
since your network design probably would prefer not to send all writes
through your core router, so if you create a pool for each rack and do
something like this rack, next rack, next row for your replication you
get better network traffic patterns.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mon listens on wrong interface

2013-08-26 Thread Sage Weil
On Mon, 26 Aug 2013, Fuchs, Andreas (SwissTXT) wrote:
> Hi Sage
> 
> Thanks for your answer
> 
> I had ceph.conf already adjusted
>   mon_hosts has the list of public ip's of the mon servers
> 
> but ceph-mon is listening on eth0 instead of the ip listed in mon_hosts
> 
> also entering [mon.ceph-ceph01] sections with host= and mon_addr= entries did 
> not change this
> 
> do I have to redeploy the installation, so far I just pushed the new config 
> and restarted the services?

You will need to redeploy.  ceph-deploy purge and ceph-deploy purgedata to 
reset your nodes.

> Btw.
> ceph-deploy new ceph01:10.100.214.x ... wont't work as it requires a 
> name not an ip, but in my case ceph01 resolves to the correct ip

Interesting; we should make ceph-deploy take an IP there too.

sage

> 
> regards
> Andi
> 
> -Original Message-
> From: Sage Weil [mailto:s...@inktank.com] 
> Sent: Freitag, 23. August 2013 17:28
> To: Fuchs, Andreas (SwissTXT)
> Cc: ceph-us...@ceph.com
> Subject: Re: [ceph-users] ceph-mon listens on wrong interface
> 
> Hi Andreas,
> 
> On Fri, 23 Aug 2013, Fuchs, Andreas (SwissTXT) wrote:
> > Hi, we built a ceph cluster with the folling network setup
> > 
> > eth0 is on a management network (access for admins and monitoring 
> > tools)
> > eth1 is ceph sync
> > eth2 is ceph public
> > 
> > deployed by ceph-deploy I have the following config
> > 
> > [global]
> > fsid = 18c6b4db-b936-43a2-ba68-d750036036cc
> > mon_initial_members = ceph01, ceph02, ceph03 mon_host = 
> > 10.100.214.11,10.100.214.12,10.100.214.13
> > auth_supported = cephx
> > osd_journal_size = 5000
> > filestore_xattr_use_omap = true
> > public_network = 10.100.214.0/24
> > cluster_network = 10.100.213.0/24
> > 
> > the problem is now that ceph-mon is listening on eth0
> > 
> > netstat -lpn | grep 6789
> > tcp0  0 10.100.220.111:6789 0.0.0.0:*   LISTEN  
> > 1609/ceph-mon
> > 
> > where it should listen on eth0 10.100.214.x
> > 
> > how can I achieve this?
> 
> I assume you used ceph-deploy here?  The problem is that when you do
> 
>  ceph-deploy new ceph01 ceph02 ceph03
> 
> it is using the ceph01 etc as both the hostname to identify the instance
> (good) and looking it up via DNS to resolve the IP for the mon_host list 
> (bad, in your case).  Try
> 
>  ceph-deploy new ceph01:10.100.214.x ...
> 
> or
> 
>  ceph-deploy new ceph01:ceph01.myothernetwork.foo.com ...
> 
> Or, just manually edit the ceph.conf after the 'ceph-deploy new ...' 
> command to get how you want it.
> 
> sage
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] librados pthread_create failure

2013-08-26 Thread Greg Poirier
So, in doing some testing last week, I believe I managed to exhaust the
number of threads available to nova-compute last week. After some
investigation, I found the pthread_create failure and increased nproc for
our Nova user to, what I considered, a ridiculous 120,000 threads after
reading that librados will require a thread per osd, plus a few for
overhead, per VM on our compute nodes.

This made me wonder: how many threads could Ceph possibly need on one of
our compute nodes.

32 cores * an overcommit ratio of 16, assuming each one is booted from a
Ceph volume, * 300 (approximate number of disks in our soon-to-go-live Ceph
cluster) = 153,600 threads.

So this is where I started to put the truck in reverse. Am I right? What
about when we triple the size of our Ceph cluster? I could easily see a
future where we have easily 1,000 disks, if not many, many more in our
cluster. How do people scale this? Do you RAID to increase the density of
your Ceph cluster? I can only imagine that this will also drastically
increase the amount of resources required on my data nodes as well.

So... suggestions? Reading?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Some help needed with ceph deployment

2013-08-26 Thread Alfredo Deza
On Mon, Aug 26, 2013 at 10:45 AM, Johannes Klarenbeek <
johannes.klarenb...@rigo.nl> wrote:

>  Hello ceph-users,
>
> ** **
>
> I’m trying to set up a linux cluster but it takes me a little longer then
> I hoped for. There are some things that I do not quite understand yet.
> Hopefully some of you can help me out.
>
> ** **
>
> **1)  **When using ceph-deploy, a ceph.conf file is created in the
> current directory and in the /etc/ceph directory. Which one is ceph-deploy
> using and which one should I edit?
>
The usual workflow is that ceph-deploy will use the one in the current dir
to overwrite the remote one (or the one in /etc/ceph/) but will warn if
this is the case and error out specifying it needs the `--overwrite-conf`
flag
to continue.

> 
>
> ** **
>
> **2)  **I have 6 OSD running per machine. I disk zapped them with
> ceph-deploy disk zap. I prepared/activated them with a separate journal on
> an ssd card and they are all running.
>
> ** **
>
> **a)  **Ceph-deploy disk list doesn’t show me what file system is in
> use (or ‘Linux Filesystem’ as it mentions, is a filesystem in its own
> right). Neither does it show you what partition or path it is using for its
> journal.
>
That ceph-deploy command calls `ceph-disk list` in the remote host which in
turn will not (as far as I could see) tell you what exact file system is in
use.

> 
>
> **b)  **Running parted doesn’t show me what file system is in use
> either (except of course that it is a ‘Linux Filesystem’)… I believe parted
> should do the trick to show me these settings??
>
How are you calling parted? with what flags? Usually something like: `sudo
parted /dev/{device} print` would print some output. Can you show what you
are getting back?

> 
>
> **c)   **When a GPT partition is corrupt (missing msdos prelude for
> example) ceph-deploy disk zap doesn’t work. But after 4 times repeating the
> command, it works, showing up as ‘Linux Filesystem’.
>
So it didn't work 4 times and then it did? This does sound unexpected.

> 
>
> **d)  **How can you set the file system that ceph-deploy disk zap is
> formatting the ceph data disk with? I like to zap a disk with XFS for
> example.
>
This is currently not supported but could be added as a feature.

> 
>
> ** **
>
> **3)  **Is there a way to set the data partition for ceph-mon with
> ceph-deploy or should I do it manually in ceph.conf? How do I format that
> partition (what file system should I use)
>
> ** **
>
> **4)  **When running ceph status the following message is what I get:
> root@cephnode1:/root#ceph status
> cluster:----
> health: HEALTH_WARN 37 pgs degraded; 192 pgs stuck unclean
>monmap e1: 1 mons at {cephnode1=172.16.1.2:6789/0}, election epoch 1,
> quorum 0 cephnode1
> osdmap e38: 6 osds: 6 up, 6 in
> pgmap v65: 192 pgs: 155 active+remapped, 37 active+degraded; 0 bytes
> data, 213 MB used, 11172GB / 11172GB avail
>mdsmap e1: 0/0/1 up
>
> ** **
>
> **a)  **How do I get rid of the HEALTH_WARN message? Can I run some
> tool that initiates a repair?
>
> **b)  **I did not put any data in it yet, but it already uses a
> whopping 213 MB, why?
>
> ** **
>
> **5)  **Last but not least, my config file looks like this
> root@cephnode1:/root#cat /etc/ceph/ceph.conf
> [global]
>
> fsid = ----
> mon_initial_members = cephnode1
> mon_host = 172.16.1.2
> auth_supported = cephx
>
> osd_journal_size = 1024
> filestore_xattr_use_omap = true
>
> ** **
>
> This is really strange, since the documentation states that the minimum
> requirement for a config with my configuration should at least have the
> [mon.1] and [osd.1] [osd.2] [osd.3] [osd.4] [osd.5] [osd.6] directives. I
> have set up separate journaling for my OSD’s but it doesn’t show in my conf
> file. Also the journaling partitions are 2GB big and not 1024MB (if that is
> what it means then).
>
> ** **
>
> ** **
>
> ** **
>
> I can really use your help, since I’m stuck for the moment.
>
> ** **
>
> Regards,
>
> Johannes
>
> ** **
>
>
>
>
>
> 
>
> ** **
>
> ** **
>
> ** **
>
>
> __ Informatie van ESET Endpoint Antivirus, versie van database
> viruskenmerken 8730 (20130826) __
>
> Het bericht is gecontroleerd door ESET Endpoint Antivirus.
>
> http://www.eset.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Some help needed with ceph deployment

2013-08-26 Thread Johannes Klarenbeek

On Mon, Aug 26, 2013 at 10:45 AM, Johannes Klarenbeek 
mailto:johannes.klarenb...@rigo.nl>> wrote:
Hello ceph-users,

I'm trying to set up a linux cluster but it takes me a little longer then I 
hoped for. There are some things that I do not quite understand yet. Hopefully 
some of you can help me out.


1)  When using ceph-deploy, a ceph.conf file is created in the current 
directory and in the /etc/ceph directory. Which one is ceph-deploy using and 
which one should I edit?
The usual workflow is that ceph-deploy will use the one in the current dir to 
overwrite the remote one (or the one in /etc/ceph/) but will warn if this is 
the case and error out specifying it needs the `--overwrite-conf` flag
to continue.

Aha, so the current dir is leading. And if I make changes to that file, it is 
not overwritten by ceph-deploy?


2)  I have 6 OSD running per machine. I disk zapped them with ceph-deploy 
disk zap. I prepared/activated them with a separate journal on an ssd card and 
they are all running.


a)  Ceph-deploy disk list doesn't show me what file system is in use (or 
'Linux Filesystem' as it mentions, is a filesystem in its own right). Neither 
does it show you what partition or path it is using for its journal.
That ceph-deploy command calls `ceph-disk list` in the remote host which in 
turn will not (as far as I could see) tell you what exact file system is in use.

b)  Running parted doesn't show me what file system is in use either 
(except of course that it is a 'Linux Filesystem')... I believe parted should 
do the trick to show me these settings??
How are you calling parted? with what flags? Usually something like: `sudo 
parted /dev/{device} print` would print some output. Can you show what you are 
getting back?

I started parted and then used the print command... hmm but it now actually 
returns something else...my bad. This is what it returns however (and its using 
xfs)
root@cephnode1:/root@parted /dev/sdd print
Model: ATA WDC WD2000FYYZ-0 (scsi)
Disk /dev/sdd: 2000GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number  Start  EndSize   File systemName
   Flags
11049kB 2000GB 2000GB xfs 
ceph data

c)   When a GPT partition is corrupt (missing msdos prelude for example) 
ceph-deploy disk zap doesn't work. But after 4 times repeating the command, it 
works, showing up as 'Linux Filesystem'.
So it didn't work 4 times and then it did? This does sound unexpected.

It is. But I have to say, I was fooling around a little with dd to wipe the 
disk clean.

d)  How can you set the file system that ceph-deploy disk zap is formatting 
the ceph data disk with? I like to zap a disk with XFS for example.
This is currently not supported but could be added as a feature.

Seems like a important feature. How does ceph-deploy determine what file system 
the disk is zapped with?


3)  Is there a way to set the data partition for ceph-mon with ceph-deploy 
or should I do it manually in ceph.conf? How do I format that partition (what 
file system should I use)

 !This is however something I still need to do!

4)  When running ceph status the following message is what I get:
root@cephnode1:/root#ceph status
cluster:----
health: HEALTH_WARN 37 pgs degraded; 192 pgs stuck unclean
   monmap e1: 1 mons at 
{cephnode1=172.16.1.2:6789/0<http://172.16.1.2:6789/0>}, election epoch 1, 
quorum 0 cephnode1
osdmap e38: 6 osds: 6 up, 6 in
pgmap v65: 192 pgs: 155 active+remapped, 37 active+degraded; 0 bytes data, 
213 MB used, 11172GB / 11172GB avail
   mdsmap e1: 0/0/1 up



a)  How do I get rid of the HEALTH_WARN message? Can I run some tool that 
initiates a repair?

b)  I did not put any data in it yet, but it already uses a whopping 213 
MB, why?


5)  Last but not least, my config file looks like this
root@cephnode1:/root#cat /etc/ceph/ceph.conf
[global]

fsid = ----
mon_initial_members = cephnode1
mon_host = 172.16.1.2
auth_supported = cephx

osd_journal_size = 1024
filestore_xattr_use_omap = true



This is really strange, since the documentation states that the minimum 
requirement for a config with my configuration should at least have the [mon.1] 
and [osd.1] [osd.2] [osd.3] [osd.4] [osd.5] [osd.6] directives. I have set up 
separate journaling for my OSD's but it doesn't show in my conf file. Also the 
journaling partitions are 2GB big and not 1024MB (if that is what it means 
then).






I can really use your help, since I'm stuck for the moment.

Regards,
Johannes









__ Informatie van ESET Endpoint Antivirus, versie van database 
viruskenmerken 8730 (20130826) __

Het bericht is gecontroleerd door ESET Endpoint Antivirus.

http://www.eset.com

___

[ceph-users] Some help needed with ceph deployment

2013-08-26 Thread Johannes Klarenbeek
Hello ceph-users,

I'm trying to set up a linux cluster but it takes me a little longer then I 
hoped for. There are some things that I do not quite understand yet. Hopefully 
some of you can help me out.


1)  When using ceph-deploy, a ceph.conf file is created in the current 
directory and in the /etc/ceph directory. Which one is ceph-deploy using and 
which one should I edit?


2)  I have 6 OSD running per machine. I disk zapped them with ceph-deploy 
disk zap. I prepared/activated them with a separate journal on an ssd card and 
they are all running.


a)  Ceph-deploy disk list doesn't show me what file system is in use (or 
'Linux Filesystem' as it mentions, is a filesystem in its own right). Neither 
does it show you what partition or path it is using for its journal.

b)  Running parted doesn't show me what file system is in use either 
(except of course that it is a 'Linux Filesystem')... I believe parted should 
do the trick to show me these settings??

c)   When a GPT partition is corrupt (missing msdos prelude for example) 
ceph-deploy disk zap doesn't work. But after 4 times repeating the command, it 
works, showing up as 'Linux Filesystem'.

d)  How can you set the file system that ceph-deploy disk zap is formatting 
the ceph data disk with? I like to zap a disk with XFS for example.


3)  Is there a way to set the data partition for ceph-mon with ceph-deploy 
or should I do it manually in ceph.conf? How do I format that partition (what 
file system should I use)



4)  When running ceph status the following message is what I get:
root@cephnode1:/root#ceph status
cluster:----
health: HEALTH_WARN 37 pgs degraded; 192 pgs stuck unclean
   monmap e1: 1 mons at {cephnode1=172.16.1.2:6789/0}, election epoch 1, quorum 
0 cephnode1
osdmap e38: 6 osds: 6 up, 6 in
pgmap v65: 192 pgs: 155 active+remapped, 37 active+degraded; 0 bytes data, 
213 MB used, 11172GB / 11172GB avail
   mdsmap e1: 0/0/1 up



a)  How do I get rid of the HEALTH_WARN message? Can I run some tool that 
initiates a repair?

b)  I did not put any data in it yet, but it already uses a whopping 213 
MB, why?


5)  Last but not least, my config file looks like this
root@cephnode1:/root#cat /etc/ceph/ceph.conf
[global]

fsid = ----
mon_initial_members = cephnode1
mon_host = 172.16.1.2
auth_supported = cephx

osd_journal_size = 1024
filestore_xattr_use_omap = true



This is really strange, since the documentation states that the minimum 
requirement for a config with my configuration should at least have the [mon.1] 
and [osd.1] [osd.2] [osd.3] [osd.4] [osd.5] [osd.6] directives. I have set up 
separate journaling for my OSD's but it doesn't show in my conf file. Also the 
journaling partitions are 2GB big and not 1024MB (if that is what it means 
then).






I can really use your help, since I'm stuck for the moment.

Regards,
Johannes











__ Informatie van ESET Endpoint Antivirus, versie van database 
viruskenmerken 8730 (20130826) __

Het bericht is gecontroleerd door ESET Endpoint Antivirus.

http://www.eset.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mon listens on wrong interface

2013-08-26 Thread Fuchs, Andreas (SwissTXT)
Hi Sage

Thanks for your answer

I had ceph.conf already adjusted
  mon_hosts has the list of public ip's of the mon servers

but ceph-mon is listening on eth0 instead of the ip listed in mon_hosts

also entering [mon.ceph-ceph01] sections with host= and mon_addr= entries did 
not change this

do I have to redeploy the installation, so far I just pushed the new config and 
restarted the services?

Btw.
ceph-deploy new ceph01:10.100.214.x ... wont't work as it requires a name not 
an ip, but in my case ceph01 resolves to the correct ip

regards
Andi

-Original Message-
From: Sage Weil [mailto:s...@inktank.com] 
Sent: Freitag, 23. August 2013 17:28
To: Fuchs, Andreas (SwissTXT)
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] ceph-mon listens on wrong interface

Hi Andreas,

On Fri, 23 Aug 2013, Fuchs, Andreas (SwissTXT) wrote:
> Hi, we built a ceph cluster with the folling network setup
> 
> eth0 is on a management network (access for admins and monitoring 
> tools)
> eth1 is ceph sync
> eth2 is ceph public
> 
> deployed by ceph-deploy I have the following config
> 
> [global]
> fsid = 18c6b4db-b936-43a2-ba68-d750036036cc
> mon_initial_members = ceph01, ceph02, ceph03 mon_host = 
> 10.100.214.11,10.100.214.12,10.100.214.13
> auth_supported = cephx
> osd_journal_size = 5000
> filestore_xattr_use_omap = true
> public_network = 10.100.214.0/24
> cluster_network = 10.100.213.0/24
> 
> the problem is now that ceph-mon is listening on eth0
> 
> netstat -lpn | grep 6789
> tcp0  0 10.100.220.111:6789 0.0.0.0:*   LISTEN
>   1609/ceph-mon
> 
> where it should listen on eth0 10.100.214.x
> 
> how can I achieve this?

I assume you used ceph-deploy here?  The problem is that when you do

 ceph-deploy new ceph01 ceph02 ceph03

it is using the ceph01 etc as both the hostname to identify the instance
(good) and looking it up via DNS to resolve the IP for the mon_host list (bad, 
in your case).  Try

 ceph-deploy new ceph01:10.100.214.x ...

or

 ceph-deploy new ceph01:ceph01.myothernetwork.foo.com ...

Or, just manually edit the ceph.conf after the 'ceph-deploy new ...' 
command to get how you want it.

sage

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Hardware recommendations

2013-08-26 Thread Shain Miley
Good morning,

I am in the process of deciding what hardware we are going to purchase for our 
new ceph based storage cluster.

I have been informed that I must submit my purchase needs by the end of this 
week in order to meet our FY13 budget requirements  (which does not leave me 
much time).

We are planning to build multiple clusters (one primarily for radosgw at 
location 1; the other for vm block storage at location 2).

We will be building our radosgw storage out first, so this is the main focus of 
this email thread.

I have read all the docs and the white papers, etc on hardware suggestions 
...and we have an existing relationship with Dell, so I have been planning on 
buying a bunch of Dell R515's with 4TB drives and using 10GigE networking for 
this radosgw setup (although this will be primary used for radosgw purposes...I 
will be testing running a limited number of vm's on this infrastructure  as 
well...in order to see what kind of performance we can achieve).

I am just wondering if anyone else has any quick thoughts on these hardware 
choices, or any alternative suggestions that I might look at as I seek to 
finalize our purchasing this week.

Thanks in advance,

Shain

Sent from my iPhone
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to migrate from a "missing auth" monitor files to a regular one?

2013-08-26 Thread Yu Changyuan
On Sun, Aug 25, 2013 at 10:27 PM, Joao Eduardo Luis
wrote:

> On 08/25/2013 12:36 PM, Yu Changyuan wrote:
>
>> Today, when I restart ceph service, the problem I asked on mail-list
>> before happened
>> again(http://article.gmane.**org/gmane.comp.file-systems.**ceph.user/2995
>> ),
>> ceph-mon refuse to start and report below error:
>>
>> 2013-08-25 18:24:52.465600 7fb50a496780 -1 mon/AuthMonitor.cc: In
>> function 'virtual void AuthMonitor::update_from_**paxos(bool*)' thread
>> 7fb50a496780 time 2013-08-25 18:24:52.453920
>> mon/AuthMonitor.cc: 152: FAILED assert(ret == 0)
>>
>>   ceph version 0.61.7 (**8f010aff684e820ecc837c25ac77c7**a05d7191ff)
>>   1: (AuthMonitor::update_from_**paxos(bool*)+0x1fee) [0x57742e]
>>   2: (PaxosService::refresh(bool*)+**0x18d) [0x4f630d]
>>   3: (Monitor::refresh_from_paxos(**bool*)+0x57) [0x496477]
>>   4: (Monitor::init_paxos()+0xf5) [0x496635]
>>   5: (Monitor::preinit()+0x6bc) [0x4ad1dc]
>>   6: (main()+0x1bec) [0x48ac8c]
>>   7: (__libc_start_main()+0xed) [0x7fb5084c660d]
>>   8: ceph-mon() [0x48dab9]
>>
>> Then, I switch to ''wip-mon-skip-auth-**cuttlefish" branch, ceph-mon
>> complain some "missing auth inc"(from 1 to 500), and continue running,
>> then everything is ok again.
>>
>> But when I stop this patched ceph-mon, and try to start regular
>> unpatched ceph-mon, above error happened again. As I mentioned, the
>> ceph-mon files last time I use is not the final one that 'missing auth',
>> but the files 2 days before ceph-mon fail, which actually ceph-mon start
>> ok but ceph-osd refuse to work.
>>
>> So, I want to know how to make these ceph-mon files that only work with
>> patched ceph-mon to work again withexcept OSError, e regular unpatched
>> ceph-mon.
>>
>>
> Changyuan,
>
> Would you mind sending us your monitor store?  If you have other monitors,
> specially if this doesn't happen on them, the other monitor's stores would
> also be insightful.

OK, I have sent the monitor's store to you.

> Furthermore, what's your cluster history?  At what version was it first
> deployed, and what versions have you upgraded it to until reaching 0.61.7?
>
This is the full history of my cluster:
1. My cluster first deploy on version 0.61.1
2. and when ceph-mon refuse to start after a reboot, I directly upgrade to
0.61.7, and make the cluster work again with patched ceph-mon and monitor's
store 2 days before ceph-mon not work.
3. then I stop restart cluster with regular ceph-mon(and works).
4. I restart cluster cluster and find ceph-mon not start again 3 days ago,
so I try patched ceph-mon and it works, but this time I do not  restart
cluster with a regular ceph-mon.
5. then I try to add another monitor(mon.b) yesterday, after mon.b join the
cluster, the ceph-mon which is unpatched running on the new host throw the
same exception from "AuthMonitor::update_from_paxos", and stopped.
6. I have to stop cluster and manually remove the never start again mon.b
from cluster(I don't have patched version on new host), and make the
cluster running a single mon.a with patched ceph-mon again.

  -Joao

-- 
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com
__**_
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**com



-- 
Best regards,
Changyuan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to migrate from a "missing auth" monitor files to a regular one?

2013-08-26 Thread Yu Changyuan
Thank you, after apply 25 times 'ceph auth add mon.a', unpatched version
works.

Here's the details step:
1. stop cluster(mon,osd and mds), backup current /var/lib/ceph/mon/ceph-a
dir
2. start patched ceph-mon and ceph-osd(i am not sure ceph-osd is necessary
or not)
3. run 'ceph auth add mon.a' 25 times.
4. stop ceph-mon and ceph.osd, and run unpatched ceph-mon with command
'ceph-mon -i a -f', and it works.
5. stop ceph-mon, backup current ok /var/lib/ceph/mon/ceph-a dir,
6. revert back to the /var/lib/ceph/mon/ceph-a that save on step 1, and run
unpatched ceph-mon again,
ensure that ceph-mon is not start with this version of files(throw errors).
7. switch back to save dir on step 5.



On Mon, Aug 26, 2013 at 12:16 AM, Sage Weil  wrote:

> On Sun, 25 Aug 2013, Yu Changyuan wrote:
> > Today, when I restart ceph service, the problem I asked on mail-list
> before
> > happened
> > again(http://article.gmane.org/gmane.comp.file-systems.ceph.user/2995),
> > ceph-mon refuse to start and report below error:
> >
> > 2013-08-25 18:24:52.465600 7fb50a496780 -1 mon/AuthMonitor.cc: In
> function
> > 'virtual void AuthMonitor::update_from_paxos(bool*)' thread 7fb50a496780
> > time 2013-08-25 18:24:52.453920
> > mon/AuthMonitor.cc: 152: FAILED assert(ret == 0)
> >
> >  ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
> >  1: (AuthMonitor::update_from_paxos(bool*)+0x1fee) [0x57742e]
> >  2: (PaxosService::refresh(bool*)+0x18d) [0x4f630d]
> >  3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x496477]
> >  4: (Monitor::init_paxos()+0xf5) [0x496635]
> >  5: (Monitor::preinit()+0x6bc) [0x4ad1dc]
> >  6: (main()+0x1bec) [0x48ac8c]
> >  7: (__libc_start_main()+0xed) [0x7fb5084c660d]
> >  8: ceph-mon() [0x48dab9]
> >
> > Then, I switch to ''wip-mon-skip-auth-cuttlefish" branch, ceph-mon
> complain
> > some "missing auth inc"(from 1 to 500), and continue running, then
> > everything is ok again.
> >
> > But when I stop this patched ceph-mon, and try to start regular unpatched
> > ceph-mon, above error happened again. As I mentioned, the ceph-mon files
> > last time I use is not the final one that 'missing auth', but the files 2
> > days before ceph-mon fail, which actually ceph-mon start ok but ceph-osd
> > refuse to work.
> >
> > So, I want to know how to make these ceph-mon files that only work with
> > patched ceph-mon to work again with regular unpatched ceph-mon.
>
> Without seeing logs and knowing exactly what is going on, my first guess
> is that running several 'ceph auth add' or 'ceph auth import' commands
> that makes modifications to the auth db 25 times will get you past the
> gap.  After that, the mon should start with the unpatched version.
>
> If that doesn't fix it, can you generate a log with 'debug ms = 1' 'debug
> paxos = 20' 'debug mon = 20' and share that?
>
> Thanks-
> sage




-- 
Best regards,
Changyuan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Problems with keyrings during deployment

2013-08-26 Thread Francesc Alted
Hi,

I am a newcomer to Ceph.  After having a look at the docs (BTW, it is nice
to see its concepts being implemented), I am trying to do some tests,
mainly to check the Python APIs to access RADOS and RDB components.  I am
following this quick guide:

http://ceph.com/docs/next/start/quick-ceph-deploy/

But after adding a monitor (ceph-deploy mon create ceph-server), I see that
the subdirectories bootstrap-mds and bootstrap-osd (in /var/lib/ceph) do
not contain keyrings.  I have tried to create the monitor again (as
suggested in the docs), but the keyrings continue to not appear there:

$ ceph-deploy gatherkeys ceph-server
[ceph_deploy.gatherkeys][DEBUG ] Checking ceph-server for
/etc/ceph/ceph.client.admin.keyring
[ceph_deploy.gatherkeys][WARNIN] Unable to find
/etc/ceph/ceph.client.admin.keyring on ['ceph-server']
[ceph_deploy.gatherkeys][DEBUG ] Have ceph.mon.keyring
[ceph_deploy.gatherkeys][DEBUG ] Checking ceph-server for
/var/lib/ceph/bootstrap-osd/ceph.keyring
[ceph_deploy.gatherkeys][WARNIN] Unable to find
/var/lib/ceph/bootstrap-osd/ceph.keyring on ['ceph-server']
[ceph_deploy.gatherkeys][DEBUG ] Checking ceph-server for
/var/lib/ceph/bootstrap-mds/ceph.keyring
[ceph_deploy.gatherkeys][WARNIN] Unable to find
/var/lib/ceph/bootstrap-mds/ceph.keyring on ['ceph-server']

My admin node (the machine from where I issue the ceph commands) is an
openSUSE 12.3 where I compiled the ceph-0.67.1 tarball.  The server node is
a Debian Precise 64-bit (using vagrant w/ VirtaulBox), and Ceph
installation seems to have gone well, as per the logs:

[ceph-server][INFO  ] Running command: ceph --version
[ceph-server][INFO  ] ceph version 0.67.2
(eb4380dd036a0b644c6283869911d615ed729ac8)

Any hints on what is going on there?  Thanks!

-- 
Francesc Alted
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] The whole cluster hangs when changing MTU to 9216

2013-08-26 Thread James Harper
> 
> Centos 6.4
> Ceph Cuttlefish 0.61.7, or 0.61.8.
> 
> I changed the MTU to 9216(or 9000), then restarted all the cluster nodes.
> The whole cluster hung, with messages in the mon log as below:

Does tcpdump report any tcp or ip checksum errors? (tcpdump -v -s0 -i 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] The whole cluster hangs when changing MTU to 9216

2013-08-26 Thread Da Chun Ng
Centos 6.4Ceph Cuttlefish 0.61.7, or 0.61.8.
I changed the MTU to 9216(or 9000), then restarted all the cluster nodes. The 
whole cluster hung, with messages in the mon log as below:4048 2013-08-26 
15:52:43.028554 7fd83f131700  1 mon.ceph0@0(electing).elector(15) init, last 
seen epoch 154049 2013-08-26 15:52:46.431842 7fd83f131700  1 
mon.ceph0@0(electing) e1 discarding message auth(proto 0 30 bytes epoch 1) v1 
and sending client elsewhere4050 2013-08-26 15:52:46.431886 7fd83f131700  1 
mon.ceph0@0(electing) e1 discarding message auth(proto 0 30 bytes epoch 1) v1 
and sending client elsewhere4051 2013-08-26 15:52:46.431899 7fd83f131700  1 
mon.ceph0@0(electing) e1 discarding message auth(proto 0 26 bytes epoch 1) v1 
and sending client elsewhere4052 2013-08-26 15:52:46.431911 7fd83f131700  1 
mon.ceph0@0(electing) e1 discarding message auth(proto 0 27 bytes epoch 1) v1 
and sending client elsewhere4053 2013-08-26 15:52:46.431923 7fd83f131700  1 
mon.ceph0@0(electing) e1 discarding message auth(proto 0 30 bytes epoch 1) v1 
and sending client elsewhere4054 2013-08-26 15:52:46.431937 7fd83f131700  1 
mon.ceph0@0(electing) e1 discarding message auth(proto 0 26 bytes epoch 0) v1 
and sending client elsewhere4055 2013-08-26 15:52:46.431948 7fd83f131700  1 
mon.ceph0@0(electing) e1 discarding message auth(proto 0 26 bytes epoch 1) v1 
and sending client elsewhere4056 2013-08-26 15:52:48.028808 7fd83f131700  1 
mon.ceph0@0(electing).elector(15) init, last seen epoch 154057 2013-08-26 
15:52:51.432073 7fd83f131700  1 mon.ceph0@0(electing) e1 discarding message 
auth(proto 0 26 bytes epoch 0) v1 and sending client elsewhere4058 2013-08-26 
15:52:51.432116 7fd83f131700  1 mon.ceph0@0(electing) e1 discarding message 
auth(proto 0 27 bytes epoch 1) v1 and sending client elsewhere4059 2013-08-26 
15:52:51.432129 7fd83f131700  1 mon.ceph0@0(electing) e1 discarding message 
auth(proto 0 26 bytes epoch 1) v1 and sending client elsewhere4060 2013-08-26 
15:52:51.432147 7fd83f131700  1 mon.ceph0@0(electing) e1 discarding message 
auth(proto 0 27 bytes epoch 1) v1 and sending client elsewhere4061 2013-08-26 
15:52:53.029037 7fd83f131700  1 mon.ceph0@0(electing).elector(15) init, last 
seen epoch 15
  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] locking rbd device

2013-08-26 Thread Wolfgang Hennerbichler
hi list, 

I realize there's a command called "rbd lock" to lock an image. Can libvirt use 
this to prevent virtual machines from being started simultaneously on different 
virtualisation containers? 

wogri
-- 
http://www.wogri.at

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com