Re: rbd map command hangs for 15 minutes during system start up

2012-11-22 Thread Nick Bartos
Here are the ceph log messages (including the libceph kernel debug
stuff you asked for) from a node boot with the rbd command hung for a
couple of minutes:

https://raw.github.com/gist/4132395/7cb5f0150179b012429c6e57749120dd88616cce/gistfile1.txt

On Wed, Nov 21, 2012 at 9:49 PM, Nick Bartos  wrote:
> It's very easy to reproduce now with my automated install script, the
> most I've seen it succeed with that patch is 2 in a row, and hanging
> on the 3rd, although it hangs on most builds.  So it shouldn't take
> much to get it to do it again.  I'll try and get to that tomorrow,
> when I'm a bit more rested and my brain is working better.
>
> Yes during this the OSDs are probably all syncing up.  All the osd and
> mon daemons have started by the time the rdb commands are ran, though.
>
> On Wed, Nov 21, 2012 at 8:47 PM, Sage Weil  wrote:
>> On Wed, 21 Nov 2012, Nick Bartos wrote:
>>> FYI the build which included all 3.5 backports except patch #50 is
>>> still going strong after 21 builds.
>>
>> Okay, that one at least makes some sense.  I've opened
>>
>> http://tracker.newdream.net/issues/3519
>>
>> How easy is this to reproduce?  If it is something you can trigger with
>> debugging enabled ('echo module libceph +p >
>> /sys/kernel/debug/dynamic_debug/control') that would help tremendously.
>>
>> I'm guessing that during this startup time the OSDs are still in the
>> process of starting?
>>
>> Alex, I bet that a test that does a lot of map/unmap stuff in a loop while
>> thrashing OSDs could hit this.
>>
>> Thanks!
>> sage
>>
>>
>>>
>>> On Wed, Nov 21, 2012 at 9:34 AM, Nick Bartos  wrote:
>>> > With 8 successful installs already done, I'm reasonably confident that
>>> > it's patch #50.  I'm making another build which applies all patches
>>> > from the 3.5 backport branch, excluding that specific one.  I'll let
>>> > you know if that turns up any unexpected failures.
>>> >
>>> > What will the potential fall out be for removing that specific patch?
>>> >
>>> >
>>> > On Wed, Nov 21, 2012 at 9:02 AM, Nick Bartos  wrote:
>>> >> It's really looking like it's the
>>> >> libceph_resubmit_linger_ops_when_pg_mapping_changes commit.  When
>>> >> patches 1-50 (listed below) are applied to 3.5.7, the hang is present.
>>> >>  So far I have gone through 4 successful installs with no hang with
>>> >> only 1-49 applied.  I'm still leaving my test run to make sure it's
>>> >> not a fluke, but since previously it hangs within the first couple of
>>> >> builds, it really looks like this is where the problem originated.
>>> >>
>>> >> 1-libceph_eliminate_connection_state_DEAD.patch
>>> >> 2-libceph_kill_bad_proto_ceph_connection_op.patch
>>> >> 3-libceph_rename_socket_callbacks.patch
>>> >> 4-libceph_rename_kvec_reset_and_kvec_add_functions.patch
>>> >> 5-libceph_embed_ceph_messenger_structure_in_ceph_client.patch
>>> >> 6-libceph_start_separating_connection_flags_from_state.patch
>>> >> 7-libceph_start_tracking_connection_socket_state.patch
>>> >> 8-libceph_provide_osd_number_when_creating_osd.patch
>>> >> 9-libceph_set_CLOSED_state_bit_in_con_init.patch
>>> >> 10-libceph_embed_ceph_connection_structure_in_mon_client.patch
>>> >> 11-libceph_drop_connection_refcounting_for_mon_client.patch
>>> >> 12-libceph_init_monitor_connection_when_opening.patch
>>> >> 13-libceph_fully_initialize_connection_in_con_init.patch
>>> >> 14-libceph_tweak_ceph_alloc_msg.patch
>>> >> 15-libceph_have_messages_point_to_their_connection.patch
>>> >> 16-libceph_have_messages_take_a_connection_reference.patch
>>> >> 17-libceph_make_ceph_con_revoke_a_msg_operation.patch
>>> >> 18-libceph_make_ceph_con_revoke_message_a_msg_op.patch
>>> >> 19-libceph_fix_overflow_in___decode_pool_names.patch
>>> >> 20-libceph_fix_overflow_in_osdmap_decode.patch
>>> >> 21-libceph_fix_overflow_in_osdmap_apply_incremental.patch
>>> >> 22-libceph_transition_socket_state_prior_to_actual_connect.patch
>>> >> 23-libceph_fix_NULL_dereference_in_reset_connection.patch
>>> >> 24-libceph_use_con_get_put_methods.patch
>>> >> 25-libceph_drop_ceph_con_get_put_helpers_and_nref_member.patch
>>> >> 26-libceph_encapsulate_out_message_data_setup.patch
>>> >> 27-libceph_encapsulate_advancing_msg_page.patch
>>> >> 28-libceph_don_t_mark_footer_complete_before_it_is.patch
>>> >> 29-libceph_move_init_bio__functions_up.patch
>>> >> 30-libceph_move_init_of_bio_iter.patch
>>> >> 31-libceph_don_t_use_bio_iter_as_a_flag.patch
>>> >> 32-libceph_SOCK_CLOSED_is_a_flag_not_a_state.patch
>>> >> 33-libceph_don_t_change_socket_state_on_sock_event.patch
>>> >> 34-libceph_just_set_SOCK_CLOSED_when_state_changes.patch
>>> >> 35-libceph_don_t_touch_con_state_in_con_close_socket.patch
>>> >> 36-libceph_clear_CONNECTING_in_ceph_con_close.patch
>>> >> 37-libceph_clear_NEGOTIATING_when_done.patch
>>> >> 38-libceph_define_and_use_an_explicit_CONNECTED_state.patch
>>> >> 39-libceph_separate_banner_and_connect_writes.patch
>>> >> 40-libceph_distinguish_two_phases_of_connect_sequence.patch
>>> >> 41-libceph_small_c

Re: RBD fio Performance concerns

2012-11-22 Thread Sébastien Han
Hum sorry, you're right. Forget about what I said :)


On Thu, Nov 22, 2012 at 4:54 PM, Stefan Priebe - Profihost AG
 wrote:
> I thought the Client would then write to the 2nd is this wrong?
>
> Stefan
>
> Am 22.11.2012 um 16:49 schrieb Sébastien Han :
>
 But who cares? it's also on the 2nd node. or even on the 3rd if you have
 replicas 3.
>>
>> Yes but you could also suffer a crash while writing the first replica.
>> If the journal is in tmpfs, there is nothing to replay.
>>
>>
>>
>> On Thu, Nov 22, 2012 at 4:35 PM, Alexandre DERUMIER  
>> wrote:
>>>
> But who cares? it's also on the 2nd node. or even on the 3rd if you have
> replicas 3.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD fio Performance concerns

2012-11-22 Thread Alexandre DERUMIER
>>We need something like tmpfs - running in local memory but support dio. 

Maybe with ramdisk, /dev/ram0  ? 

we can format it with standard filesystem (ext3,ext4,...) so maybe dio works 
with it ?



- Mail original - 

De: "Stefan Priebe - Profihost AG"  
À: "Sébastien Han"  
Cc: "Mark Nelson" , "Alexandre DERUMIER" 
, "ceph-devel" , "Mark Kampe" 
 
Envoyé: Jeudi 22 Novembre 2012 14:29:03 
Objet: Re: RBD fio Performance concerns 

Am 22.11.2012 14:22, schrieb Sébastien Han: 
> And RAMDISK devices are too expensive. 
> 
> It would make sense in your infra, but yes they are really expensive. 

We need something like tmpfs - running in local memory but support dio. 

Stefan 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD daemon changes port no

2012-11-22 Thread Sage Weil
On Thu, 22 Nov 2012, hemant surale wrote:
> Sir,
> 
> Thanks for the direction . Here I was using "mount.ceph  monaddr:ip:/
> /home/hemant/mntpoint " cmd . Is it possible to do achieve same effect
> with "mount.ceph" of what you suggested with "cephfs". (cephfs
> /mnt/ceph/foo --pool )
> 
> But I see that cephfs is able to set which osd to use , the object
> size . So can you throw more light on this.

Mount can mount/bind a subdirectory into you namespace, but it doesn't 
change the layout policies used by the MDS; that's what cephfs does.

sage

> 
> 
> Thanks & Regards,
> Hemant Surale.
> 
> On Wed, Nov 21, 2012 at 8:59 PM, Sage Weil  wrote:
> > On Wed, 21 Nov 2012, hemant surale wrote:
> >> > Oh I see.  Generally speaking, the only way to guarantee separation is to
> >> > put them in different pools and distribute the pools across different 
> >> > sets
> >> > of OSDs.
> >>
> >> yeah that was correct approach but i found problem doing so from
> >> abstract level i.e. when I put file inside mounted dir
> >> "/home/hemant/cephfs " ( mounted using "mount.ceph" cmd ) . At that
> >> time anyways ceph is going to use default pool data to store files (
> >> here files were striped into different objects and then sent to
> >> appropriate osd ) .
> >>So how to tell ceph to use different pools in this case ?
> >>
> >> Goal : separate read and write operations , where read will be done
> >> from one group of OSD and write is done to other group of OSD.
> >
> > First create the other pool,
> >
> >  ceph osd pool create 
> >
> > and then adjust the CRUSH rule to distributed to a different set of OSDs
> > for that pool.
> >
> > To allow cephfs use it,
> >
> >  ceph mds add_data_pool 
> >
> > and then:
> >
> >  cephfs /mnt/ceph/foo --pool 
> >
> > will set the policy on the directory such that new files beneath that
> > point will be stored in a different pool.
> >
> > Hope that helps!
> > sage
> >
> >
> >>
> >>
> >>
> >>
> >> -
> >> Hemant Surale.
> >>
> >>
> >> On Wed, Nov 21, 2012 at 12:33 PM, Sage Weil  wrote:
> >> > On Wed, 21 Nov 2012, hemant surale wrote:
> >> >> Its a little confusing question I believe .
> >> >>
> >> >> Actually there are two files X & Y.  When I am reading X from its
> >> >> primary .I want to make sure simultaneous writing of Y should go to
> >> >> any other OSD except primary OSD for X (from where my current read is
> >> >> getting served ) .
> >> >
> >> > Oh I see.  Generally speaking, the only way to guarantee separation is to
> >> > put them in different pools and distribute the pools across different 
> >> > sets
> >> > of OSDs.  Otherwise, it's all (pseudo)random and you never know.  
> >> > Usually,
> >> > they will be different, particularly as the cluster size increases, but
> >> > sometimes they will be the same.
> >> >
> >> > sage
> >> >
> >> >
> >> >>
> >> >>
> >> >> -
> >> >> Hemant Sural.e
> >> >>
> >> >> On Wed, Nov 21, 2012 at 11:50 AM, Sage Weil  wrote:
> >> >> > On Wed, 21 Nov 2012, hemant surale wrote:
> >> >> >> >>and one more thing how can it be possible to read from one osd 
> >> >> >> >> and
> >> >> >> >> then simultaneous write to direct on other osd with less/no 
> >> >> >> >> traffic?
> >> >> >> >
> >> >> >> > I'm not sure I understand the question...
> >> >> >>
> >> >> >> Scenario :
> >> >> >>I have written file X.txt on some osd which is primary for 
> >> >> >> filr
> >> >> >> X.txt ( direct write operation using rados cmd) .
> >> >> >>Now while read on file X.txt is in progress, Can I make sure
> >> >> >> the simultaneous write request must be directed to other osd using
> >> >> >> crushmaps/other way?
> >> >> >
> >> >> > Nope.  The object location is based on the name.  Reads and writes go 
> >> >> > to
> >> >> > the same location so that a single OSD can serialize request.  That 
> >> >> > means,
> >> >> > for example, that a read that follows a write returns the just-written
> >> >> > data.
> >> >> >
> >> >> > sage
> >> >> >
> >> >> >
> >> >> >> Goal of task :
> >> >> >>Trying to avoid read - write clashes as much as possible to
> >> >> >> achieve faster operations (I/O) . Although CRUSH selects osd for data
> >> >> >> placement based on pseudo random function.  is it possible ?
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> -
> >> >> >> Hemant Surale.
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> On Tue, Nov 20, 2012 at 10:15 PM, Sage Weil  wrote:
> >> >> >> > On Tue, 20 Nov 2012, hemant surale wrote:
> >> >> >> >> Hi Community,
> >> >> >> >>I have question about port number used by ceph-osd daemon . I
> >> >> >> >> observed traffic (inter -osd communication while data ingest 
> >> >> >> >> happened)
> >> >> >> >> on port 6802 and then after some time when I ingested second file
> >> >> >> >> after some delay port no 6804 was used . Is there any specific 
> >> >> >> >> reason
> >> >> >> >> to change port no here?
> >> >> >> >
> >> >> >> > The ports are dynamic.  Daemons bind to a random (6800-6900) port 
> >> >> >> > on
> >> >

Re: Files lost after mds rebuild

2012-11-22 Thread Drunkard Zhang
2012/11/22 Gregory Farnum :
> On Tue, Nov 20, 2012 at 8:28 PM, Drunkard Zhang  wrote:
>> 2012/11/21 Gregory Farnum :
>>> No, absolutely not. There is no relationship between different RADOS
>>> pools. If you've been using the cephfs tool to place some filesystem
>>> data in different pools then your configuration is a little more
>>> complicated (have you done that?), but deleting one pool is never
>>> going to remove data from the others.
>>> -Greg
>>>
>> I think that should be a bug. Here's the story I did:
>> I created one directory 'audit' in running ceph filesystem, and put
>> some data into the directory (about 100GB) before these commands:
>> ceph osd pool create audit
>> ceph mds add_data_pool 4
>> cephfs /mnt/temp/audit/ set_layout -p 4
>>
>> log3 ~ # ceph osd dump | grep audit
>> pool 4 'audit' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num
>> 8 pgp_num 8 last_change 1558 owner 0
>>
>> at this time, all data in audit still usable, after 'ceph osd pool
>> delete data', the disk space recycled (forgot to test if the data
>> still usable), only 200MB used, from 'ceph -s'. So, here's what I'm
>> thinking, the data stored before pool created won't follow the pool,
>> it still follows the default pool 'data', is this a bug, or intended
>> behavior?
>
> Oh, I see. Data is not moved when you set directory layouts; it only
> impacts files created after that point. This is intended behavior —
> Ceph would need to copy the data around anyway in order to make it
> follow the pool. There's no sense in hiding that from the user,
> especially given the complexity involved in doing so safely —
> especially when there are many use cases where you want the files in
> different pools.
> -Greg

Got you, but how can I know which pools a file lives in? Is there any commands?

About data and pools relationship, I thought that objects is hooked to
a pool, when the pool changed, just unhook this and hook to another,
seems I was wrong.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD Backup

2012-11-22 Thread Wido den Hollander



On 11/22/2012 06:57 PM, Stefan Priebe - Profihost AG wrote:

Hi,

Am 21.11.2012 14:47, schrieb Wido den Hollander:

The snapshot isn't consistent since it has no way of telling the VM to
flush it's buffers.

To make it consistent you have to run "sync" (In the VM) just prior to
creating the snapshot.


Mhm but between executing sync and executing snap is again time to store
data.



True. That is always a problem with snapshots. I always regard data 
written to disk in the last 30 seconds as being in the "danger zone".


When you use libvirt and QCOW2 as a backing store for your virtual 
machine you can also snapshot with libvirt. It will not only snapshot 
the disk, but it will also store the memory contents from the virtual 
machine so you have a consistent state of the virtual machine.


This has a drawback however, since when you give the VM 16GB of memory, 
you have to store 16GB of data.


Right now this doesn't work yet with RBD, but there is a feature request 
in the tracker. I can't seem to find it right now.


What you could do is:

$ ssh root@virtual-machine "sync"
$ rbd snap create vm-disk@snap1
$ rbd export --snap snap1 vm-disk /mnt/backup/vm-disk_snap1.img

This way you have a pretty consistent snapshot.

Wido


rbd export --snap BACKUP image1 /mnt/backup/image1.img
losetup /mnt/backup/image1.img

kpartx -a /dev/loop0

Now you will have the partitions from the RBD image available in
/dev/mapper/loop0pX

Works fine!

Greets,
Stefan

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD fio Performance concerns

2012-11-22 Thread Stefan Priebe - Profihost AG
I thought the Client would then write to the 2nd is this wrong?

Stefan

Am 22.11.2012 um 16:49 schrieb Sébastien Han :

>>> But who cares? it's also on the 2nd node. or even on the 3rd if you have
>>> replicas 3.
> 
> Yes but you could also suffer a crash while writing the first replica.
> If the journal is in tmpfs, there is nothing to replay.
> 
> 
> 
> On Thu, Nov 22, 2012 at 4:35 PM, Alexandre DERUMIER  
> wrote:
>> 
 But who cares? it's also on the 2nd node. or even on the 3rd if you have
 replicas 3.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD fio Performance concerns

2012-11-22 Thread Stefan Priebe - Profihost AG

Am 22.11.2012 13:50, schrieb Sébastien Han:

journal is running on tmpfs to me but that changes nothing.


I don't think it works then. According to the doc: Enables using
libaio for asynchronous writes to the journal. Requires journal dio
set to true.


Ah might be but as the SSDs are pretty fast i don't know which device to 
use as journal except tmpfs.


And RAMDISK devices are too expensive.

Greets,
Stefan


On Thu, Nov 22, 2012 at 12:48 PM, Stefan Priebe - Profihost AG
 wrote:

Am 22.11.2012 11:49, schrieb Sébastien Han:


@Alexandre: cool!

@ Stefan: Full SSD cluster and 10G switches?


Yes



Couple of weeks ago I saw
that you use journal aio, did you notice performance improvement with it?


journal is running on tmpfs to me but that changes nothing.

Stefan

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD fio Performance concerns

2012-11-22 Thread Stefan Priebe - Profihost AG

Am 22.11.2012 15:37, schrieb Mark Nelson:

I don't think we recommend tmpfs at all for anything other than playing
around. :)


I discussed this with somebody frmo inktank. Had to search the 
mailinglist. It might be OK if you're working with enough replicas and UPS.


I see no other option while working with SSDs - the only Option would be 
to be able to deaktivate the journal at all. But ceph does not support this.


Stefan


On 11/22/2012 08:22 AM, Stefan Priebe - Profihost AG wrote:

Hi,

can someone from inktank comment this? Might be using /dev/ram0 with an
fs on it be better than tmpfs as we can use dio?

Greets,
Stefan


- Mail original -

De: "Stefan Priebe - Profihost AG" 
À: "Sébastien Han" 
Cc: "Mark Nelson" , "Alexandre DERUMIER"
, "ceph-devel" ,
"Mark Kampe" 
Envoyé: Jeudi 22 Novembre 2012 14:29:03
Objet: Re: RBD fio Performance concerns

Am 22.11.2012 14:22, schrieb Sébastien Han:

And RAMDISK devices are too expensive.

It would make sense in your infra, but yes they are really expensive.


We need something like tmpfs - running in local memory but support dio.

Stefan





--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD fio Performance concerns

2012-11-22 Thread Mark Nelson

On 11/22/2012 04:49 AM, Sébastien Han wrote:

@Alexandre: cool!

@ Stefan: Full SSD cluster and 10G switches? Couple of weeks ago I saw
that you use journal aio, did you notice performance improvement with it?

@Mark Kampe

 > If I read the above correctly, your random operations are 4K and your
 > sequential operations are 4M.


As you recommend. (see below what you previously said):

 > If you want to do sequential I/O, you should do it buffered
 > (so that the writes can be aggregated) or with a 4M block size
 > (very efficient and avoiding object serialization).



 > The block-size difference makes the random and sequential
 > results incomparable.


Ok let's do it again then (short output that fits on a screen), with a
single OSD and 4K blocks:

seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=42
rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=42
seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=42
rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=42

seq-read: (groupid=0, jobs=1): err= 0: pid=7542
   read : io=912716KB, bw=15210KB/s, iops=3802 , runt= 60009msec
rand-read: (groupid=1, jobs=1): err= 0: pid=7546
   read : io=980504KB, bw=16339KB/s, iops=4084 , runt= 60009msec
seq-write: (groupid=2, jobs=1): err= 0: pid=7547
   write: io=54216KB, bw=922718 B/s, iops=225 , runt= 60167msec
rand-write: (groupid=3, jobs=1): err= 0: pid=7557
   write: io=66116KB, bw=1098.5KB/s, iops=274 , runt= 60192msec

Sequentials and random operations are getting closer to each others, but
random operations remain higher.

 > the more data you send me, the longer it takes me to find
 > time to review your results.  If you send me a message that
 > fits on a single screen, I will try to answer it immediately.


I just don't want to miss any information that you may find useful.

@ Mark Nelson

See below the new blkparse trace with 4K block for all operations:

Inline image 1


Thanks for doing this!  Unfortunately it's only showing writes so we 
don't know what read behavior looks like in these graphs.  That might be 
important.  Also, do you know approximately how the tests line up with 
timestamps on the seekwatcher results?  Seekwatcher only seems to be 
going for 228 seconds, but the 4 tests should be lasting 240+ seconds?


If I just break this down into 60s chunks with seq-read, rand-read, 
seq-write, rand-write in that order, my wildly speculative guess is that 
there's a chunk of time missing at the beginning where the read tests 
are happening that extend out to about second ~80-85 in the graph. 
After that it's the write tests from ~85 out to ~205.  After that, I 
guess that there are no more incoming writes, but existing writes are 
being flushed and that with no incoming writes we see a bump in 
performance (possibly due to a reduction in lock contention? What 
version of ceph is this again?)


A couple of thoughts:

- You may want to pause between the tests for a while (or even reformat 
between every test!)


- That spike in performance at the end is interesting.  I'd really like 
to know if that happened during the rand-write test or after the test 
completed (once the data hit the jouranl the writes would have been 
acknowledged letting the test end while the data was still being flushed 
out to disk).


- If my interpretation is right, it looks like the typical seq-write 
throughput is slightly higher according to blktrace but with regular 
dips, while the random write performance is typically lower but with no 
dips (and maybe a big increase at the end).  In both cases we have a 
very high number of seeks! Do you have WB cache on your controller? 
These are 10K RPM drives?


- Read behavior would be really useful!

- You can pretty clearly see the different AGs in XFS doing their thing. 
 I wonder if 1 AG would be better here.  On the other hand, there's a 
pretty long thread that discusses IOPs heavy workload on XFS here:


http://xfs.9218.n7.nabble.com/XFS-Abysmal-write-performance-because-of-excessive-seeking-allocation-groups-to-blame-td15501.html

- It would be very interesting to try this test on EXT4 or BTRFS and see 
if the results are the same.  I forget, did someone already do this?


Mark


Thanks again everyone, for your help.

Cheers!

On Thu, Nov 22, 2012 at 11:19 AM, Stefan Priebe - Profihost AG
mailto:s.pri...@profihost.ag>> wrote:
 >
 >
 > Same to me:
 > rand 4k: 23.000 iops
 > seq 4k: 13.000 iops
 >
 > Even in writeback mode where normally seq 4k should be merged into
bigger requests.
 >
 > Stefan
 >
 > Am 21.11.2012 17:34, schrieb Mark Nelson:
 >
 >> Responding to my own message. :)
 >>
 >> Talked to Sage a bit offline about this.  I think there are two opposing
 >> forces:
 >>
 >> On one hand, random IO may be spreading reads/writes out across more
 >> OSDs than sequential IO that presumably would be hitting a single OSD
 >> more regularly.
 >>
 >> On the other hand, you'd expect that sequential writ

Re: 'zombie snapshot' problem

2012-11-22 Thread Andrey Korolyov
On Thu, Nov 22, 2012 at 2:05 AM, Josh Durgin  wrote:
> On 11/21/2012 04:50 AM, Andrey Korolyov wrote:
>>
>> Hi,
>>
>> Somehow I have managed to produce unkillable snapshot, which does not
>> allow to remove itself or parent image:
>>
>> $ rbd snap purge dev-rack0/vm2
>> Removing all snapshots: 100% complete...done.
>
>
> I see one bug with 'snap purge' ignoring the return code when removing
> snaps. I just fixed this in the next branch. It's probably getting the
> same error as 'rbd snap rm' below.
>
> Could you post the output of:
>
> rbd snap purge dev-rack0/vm2 --debug-ms 1 --debug-rbd 20
>
>
>> $ rbd rm dev-rack0/vm2
>> 2012-11-21 16:31:24.184626 7f7e0d172780 -1 librbd: image has snapshots
>> - not removing
>> Removing image: 0% complete...failed.
>> rbd: image has snapshots - these must be deleted with 'rbd snap purge'
>> before the image can be removed.
>> $ rbd snap ls dev-rack0/vm2
>> SNAPID NAME   SIZE
>> 188 vm2.snap-yxf 16384 MB
>> $ rbd info dev-rack0/vm2
>> rbd image 'vm2':
>>  size 16384 MB in 4096 objects
>>  order 22 (4096 KB objects)
>>  block_name_prefix: rbd_data.1fa164c960874
>>  format: 2
>>  features: layering
>> $ rbd snap rm --snap vm2.snap-yxf dev-rack0/vm2
>> rbd: failed to remove snapshot: (2) No such file or directory
>> $ rbd snap create --snap vm2.snap-yxf dev-rack0/vm2
>> rbd: failed to create snapshot: (17) File exists
>> $ rbd snap rollback --snap vm2.snap-yxf dev-rack0/vm2
>> Rolling back to snapshot: 100% complete...done.
>> $ rbd snap protect --snap vm2.snap-yxf dev-rack0/vm2
>> $ rbd snap unprotect --snap vm2.snap-yxf dev-rack0/vm2
>>
>>
>> Meanwhile, ``rbd ls -l dev-rack0''  segfaulting with an attached log.
>> Is there any reliable way to kill problematic snap?
>
>
> From this log it looks like vm2 used to be a clone, and the snapshot
> vm2.snap-yxf was taken before it was flattened. Later, the parent of
> vm2.snap-yxf was deleted. Is this correct?

I have attached log you asked, hope it will be useful.
Here is a two possible flows: snapshot created before and during flatten:

Completely linear flow:

$ rbd cp install/debian7 dev-rack0/testimg
Image copy: 100% complete...done.
$ rbd snap create --snap test1 dev-rack0/testimg
$ rbd snap clone --snap test1 dev-rack0/testimg dev-rack0/testimg2
rbd: error parsing command 'clone'
$ rbd snap protect --snap test1 dev-rack0/testimg
$ rbd clone --snap test1 dev-rack0/testimg dev-rack0/testimg2
$ rbd snap create --snap test2 dev-rack0/testimg2
$ rbd flatten dev-rack0/testimg2
Image flatten: 100% complete...done.
$ rbd snap unprotect --snap test1 dev-rack0/testimg
2012-11-22 15:11:03.446892 7ff9fb7c1780 -1 librbd: snap_unprotect:
can't unprotect; at least 1 child(ren) in pool dev-rack0
rbd: unprotecting snap failed: (16) Device or resource busy
$ rbd snap purge dev-rack0/testimg2
Removing all snapshots: 100% complete...done.
$ rbd snap ls dev-rack0/testimg2
$ rbd snap unprotect --snap test1 dev-rack0/testimg


snapshot created over image with ``flatten'' in progress:

$ rbd snap create --snap test3 dev-rack0/testimg
$ rbd snap protect --snap test3 dev-rack0/testimg
$ rbd clone --snap test3 dev-rack0/testimg dev-rack0/testimg3
rbd $ rbd flatten dev-rack0/testimg3
[here was executed rbd snap create --snap test43 dev-rack0/testimg3]
Image flatten: 100% complete...done.
$ rbd snap unprotect --snap test3 dev-rack0/testimg
$ rbd snap ls dev-rack0/testimg3
SNAPID NAME SIZE
   323 test43 640 MB
$ rbd snap purge dev-rack0/testimg3
Removing all snapshots: 100% complete...done.
$ rbd snap ls dev-rack0/testimg3
SNAPID NAME SIZE
   323 test43 640 MB
$ rbd snap rm --snap test43 dev-rack0/testimg3
rbd: failed to remove snapshot: (2) No such file or directory

Hooray, problem found! Now I`ll avoid this by putting flatten state as
exclusive one over the image.
ceph version 0.54 (commit:60b84b095b1009a305d4d6a5b16f88571cbd3150)

>
> It was a bug in 0.53 that protected snapshots could be deleted.
>
> Josh


snap.txt.gz
Description: GNU Zip compressed data


Re: RBD fio Performance concerns

2012-11-22 Thread Stefan Priebe - Profihost AG
Otherwise you would have the same problem with the disk crashes

Am 22.11.2012 um 16:55 schrieb Sébastien Han :

> Hum sorry, you're right. Forget about what I said :)
> 
> 
> On Thu, Nov 22, 2012 at 4:54 PM, Stefan Priebe - Profihost AG
>  wrote:
>> I thought the Client would then write to the 2nd is this wrong?
>> 
>> Stefan
>> 
>> Am 22.11.2012 um 16:49 schrieb Sébastien Han :
>> 
> But who cares? it's also on the 2nd node. or even on the 3rd if you have
> replicas 3.
>>> 
>>> Yes but you could also suffer a crash while writing the first replica.
>>> If the journal is in tmpfs, there is nothing to replay.
>>> 
>>> 
>>> 
>>> On Thu, Nov 22, 2012 at 4:35 PM, Alexandre DERUMIER  
>>> wrote:
 
>> But who cares? it's also on the 2nd node. or even on the 3rd if you have
>> replicas 3.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Question about simulation with crushtool

2012-11-22 Thread Nam Dang
Dear all,

I am trying to do some small experiment with crushtool by simulating
different variants of CRUSHs.
However, I encounter some problem with crushtool due to its lack of
documentation.

I want to ask the command to simulate the placement in a 32-device
bucket system (only 1 bucket)? And how to show the placement of the
data through the simulation?
What I've figured out so far is:

> crushtool -t --min_x 3 --build --num_osds 32 root uniform 0

The output that I get is:

crushtool successfully built or modified map.  Use '-o ' to write it out.
rule 0 (data2012-11-23 00:13:21.676296 7fa31f813780  0 layer 1  root
bucket type uniform  0
), x = 0..1023, numrep = 2..2
2012-11-23 00:13:21.676313 7fa31f813780  0 lower_items
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]
2012-11-23 00:13:21.676320 7fa31f813780  0 lower_weights
[65536,65536,65536,65536,65536,65536,65536,65536,65536,65536,65536,65536,65536,65536,65536,65536,65536,65536,65536,65536,65536,65536,65536,65536,65536,65536,65536,65536,65536,65536,65536,65536]
2012-11-23 00:13:21.676374 7fa31f813780  0   item 0 weight 65536
2012-11-23 00:13:21.676375 7fa31f813780  0   item 1 weight 65536
2012-11-23 00:13:21.676376 7fa31f813780  0   item 2 weight 65536
2012-11-23 00:13:21.676376 7fa31f813780  0   item 3 weight 65536
2012-11-23 00:13:21.676377 7fa31f813780  0   item 4 weight 65536
2012-11-23 00:13:21.676378 7fa31f813780  0   item 5 weight 65536
2012-11-23 00:13:21.676378 7fa31f813780  0   item 6 weight 65536
2012-11-23 00:13:21.676379 7fa31f813780  0   item 7 weight 65536
2012-11-23 00:13:21.676380 7fa31f813780  0   item 8 weight 65536
2012-11-23 00:13:21.676380 7fa31f813780  0   item 9 weight 65536
2012-11-23 00:13:21.676381 7fa31f813780  0   item 10 weight 65536
2012-11-23 00:13:21.676382 7fa31f813780  0   item 11 weight 65536
rule 0 (data) num_rep 2 result size == 02012-11-23 00:13:21.676382
7fa31f813780  0   item 12 weight 65536
:   2012-11-23 00:13:21.676404 7fa31f813780  0   item 13 weight 65536
1024/1024
2012-11-23 00:13:21.676407 7fa31f813780  0   item 14 weight 65536
2012-11-23 00:13:21.676407 7fa31f813780  0   item 15 weight 65536
2012-11-23 00:13:21.676408 7fa31f813780  0   item 16 weight 65536
2012-11-23 00:13:21.676409 7fa31f813780  0   item 17 weight 65536
2012-11-23 00:13:21.676412 7fa31f813780  0   item 18 weight 65536
2012-11-23 00:13:21.676413 7fa31f813780  0   item 19 weight 65536
2012-11-23 00:13:21.676414 7fa31f813780  0   item 20 weight 65536
2012-11-23 00:13:21.676414 7fa31f813780  0   item 21 weight 65536
2012-11-23 00:13:21.676415 7fa31f813780  0   item 22 weight 65536
2012-11-23 00:13:21.676416 7fa31f813780  0   item 23 weight 65536
2012-11-23 00:13:21.676417 7fa31f813780  0   item 24 weight 65536
2012-11-23 00:13:21.676417 7fa31f813780  0   item 25 weight 65536
2012-11-23 00:13:21.676671 7fa31f813780  0   item 26 weight 65536
2012-11-23 00:13:21.676673 7fa31f813780  0   item 27 weight 65536
2012-11-23 00:13:21.676674 7fa31f813780  0   item 28 weight 65536
2012-11-23 00:13:21.676675 7fa31f813780  0   item 29 weight 65536
2012-11-23 00:13:21.676676 7fa31f813780  0   item 30 weight 65536
2012-11-23 00:13:21.676678 7fa31f813780  0   item 31 weight 65536
2012-11-23 00:13:21.676731 7fa31f813780  0  in bucket -1 'root' size
32 weight 2097152


I can't seem to be able to add parameters like
"--show_utilization_all"; crushtool keeps complaining about
"layers must be specified with 3-tuples of (name, buckettype, size)".
I totally have no idea why.

When I modified the source code to force crushtool to printout the
placement of all the data items (for 32 devices it seems crushtool
allocates 32 items only), the list entries are just blank. Basically
it appears that CRUSH fails to find appropriate location for ALL the
data (the devices all have the same weight).

I hope someone here can tell me more about the usage of crushtool for
simulation like this. THank you

Best regards,
Nam Dang

Email: n...@de.cs.titech.ac.jp
HP: (+81) 080-4465-1587
Yokota Lab, Dept. of Computer Science
Tokyo Institute of Technology
Tokyo, Japan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD fio Performance concerns

2012-11-22 Thread Mark Nelson
I don't think we recommend tmpfs at all for anything other than playing 
around. :)


On 11/22/2012 08:22 AM, Stefan Priebe - Profihost AG wrote:

Hi,

can someone from inktank comment this? Might be using /dev/ram0 with an
fs on it be better than tmpfs as we can use dio?

Greets,
Stefan


- Mail original -

De: "Stefan Priebe - Profihost AG" 
À: "Sébastien Han" 
Cc: "Mark Nelson" , "Alexandre DERUMIER"
, "ceph-devel" ,
"Mark Kampe" 
Envoyé: Jeudi 22 Novembre 2012 14:29:03
Objet: Re: RBD fio Performance concerns

Am 22.11.2012 14:22, schrieb Sébastien Han:

And RAMDISK devices are too expensive.

It would make sense in your infra, but yes they are really expensive.


We need something like tmpfs - running in local memory but support dio.

Stefan




--
Mark Nelson
Performance Engineer
Inktank
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd map command hangs for 15 minutes during system start up

2012-11-22 Thread Nick Bartos
It's very easy to reproduce now with my automated install script, the
most I've seen it succeed with that patch is 2 in a row, and hanging
on the 3rd, although it hangs on most builds.  So it shouldn't take
much to get it to do it again.  I'll try and get to that tomorrow,
when I'm a bit more rested and my brain is working better.

Yes during this the OSDs are probably all syncing up.  All the osd and
mon daemons have started by the time the rdb commands are ran, though.

On Wed, Nov 21, 2012 at 8:47 PM, Sage Weil  wrote:
> On Wed, 21 Nov 2012, Nick Bartos wrote:
>> FYI the build which included all 3.5 backports except patch #50 is
>> still going strong after 21 builds.
>
> Okay, that one at least makes some sense.  I've opened
>
> http://tracker.newdream.net/issues/3519
>
> How easy is this to reproduce?  If it is something you can trigger with
> debugging enabled ('echo module libceph +p >
> /sys/kernel/debug/dynamic_debug/control') that would help tremendously.
>
> I'm guessing that during this startup time the OSDs are still in the
> process of starting?
>
> Alex, I bet that a test that does a lot of map/unmap stuff in a loop while
> thrashing OSDs could hit this.
>
> Thanks!
> sage
>
>
>>
>> On Wed, Nov 21, 2012 at 9:34 AM, Nick Bartos  wrote:
>> > With 8 successful installs already done, I'm reasonably confident that
>> > it's patch #50.  I'm making another build which applies all patches
>> > from the 3.5 backport branch, excluding that specific one.  I'll let
>> > you know if that turns up any unexpected failures.
>> >
>> > What will the potential fall out be for removing that specific patch?
>> >
>> >
>> > On Wed, Nov 21, 2012 at 9:02 AM, Nick Bartos  wrote:
>> >> It's really looking like it's the
>> >> libceph_resubmit_linger_ops_when_pg_mapping_changes commit.  When
>> >> patches 1-50 (listed below) are applied to 3.5.7, the hang is present.
>> >>  So far I have gone through 4 successful installs with no hang with
>> >> only 1-49 applied.  I'm still leaving my test run to make sure it's
>> >> not a fluke, but since previously it hangs within the first couple of
>> >> builds, it really looks like this is where the problem originated.
>> >>
>> >> 1-libceph_eliminate_connection_state_DEAD.patch
>> >> 2-libceph_kill_bad_proto_ceph_connection_op.patch
>> >> 3-libceph_rename_socket_callbacks.patch
>> >> 4-libceph_rename_kvec_reset_and_kvec_add_functions.patch
>> >> 5-libceph_embed_ceph_messenger_structure_in_ceph_client.patch
>> >> 6-libceph_start_separating_connection_flags_from_state.patch
>> >> 7-libceph_start_tracking_connection_socket_state.patch
>> >> 8-libceph_provide_osd_number_when_creating_osd.patch
>> >> 9-libceph_set_CLOSED_state_bit_in_con_init.patch
>> >> 10-libceph_embed_ceph_connection_structure_in_mon_client.patch
>> >> 11-libceph_drop_connection_refcounting_for_mon_client.patch
>> >> 12-libceph_init_monitor_connection_when_opening.patch
>> >> 13-libceph_fully_initialize_connection_in_con_init.patch
>> >> 14-libceph_tweak_ceph_alloc_msg.patch
>> >> 15-libceph_have_messages_point_to_their_connection.patch
>> >> 16-libceph_have_messages_take_a_connection_reference.patch
>> >> 17-libceph_make_ceph_con_revoke_a_msg_operation.patch
>> >> 18-libceph_make_ceph_con_revoke_message_a_msg_op.patch
>> >> 19-libceph_fix_overflow_in___decode_pool_names.patch
>> >> 20-libceph_fix_overflow_in_osdmap_decode.patch
>> >> 21-libceph_fix_overflow_in_osdmap_apply_incremental.patch
>> >> 22-libceph_transition_socket_state_prior_to_actual_connect.patch
>> >> 23-libceph_fix_NULL_dereference_in_reset_connection.patch
>> >> 24-libceph_use_con_get_put_methods.patch
>> >> 25-libceph_drop_ceph_con_get_put_helpers_and_nref_member.patch
>> >> 26-libceph_encapsulate_out_message_data_setup.patch
>> >> 27-libceph_encapsulate_advancing_msg_page.patch
>> >> 28-libceph_don_t_mark_footer_complete_before_it_is.patch
>> >> 29-libceph_move_init_bio__functions_up.patch
>> >> 30-libceph_move_init_of_bio_iter.patch
>> >> 31-libceph_don_t_use_bio_iter_as_a_flag.patch
>> >> 32-libceph_SOCK_CLOSED_is_a_flag_not_a_state.patch
>> >> 33-libceph_don_t_change_socket_state_on_sock_event.patch
>> >> 34-libceph_just_set_SOCK_CLOSED_when_state_changes.patch
>> >> 35-libceph_don_t_touch_con_state_in_con_close_socket.patch
>> >> 36-libceph_clear_CONNECTING_in_ceph_con_close.patch
>> >> 37-libceph_clear_NEGOTIATING_when_done.patch
>> >> 38-libceph_define_and_use_an_explicit_CONNECTED_state.patch
>> >> 39-libceph_separate_banner_and_connect_writes.patch
>> >> 40-libceph_distinguish_two_phases_of_connect_sequence.patch
>> >> 41-libceph_small_changes_to_messenger.c.patch
>> >> 42-libceph_add_some_fine_ASCII_art.patch
>> >> 43-libceph_set_peer_name_on_con_open_not_init.patch
>> >> 44-libceph_initialize_mon_client_con_only_once.patch
>> >> 45-libceph_allow_sock_transition_from_CONNECTING_to_CLOSED.patch
>> >> 46-libceph_initialize_msgpool_message_types.patch
>> >> 47-libceph_prevent_the_race_of_incoming_work_during_teardown.patch
>> >> 48-l

Re: RBD fio Performance concerns

2012-11-22 Thread Stefan Priebe - Profihost AG

Am 22.11.2012 14:22, schrieb Sébastien Han:

And RAMDISK devices are too expensive.

It would make sense in your infra, but yes they are really expensive.


We need something like tmpfs - running in local memory but support dio.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] rbd block driver fix race between aio completition and aio cancel

2012-11-22 Thread Stefan Priebe
This one fixes a race which qemu had also in iscsi block driver
between cancellation and io completition.

qemu_rbd_aio_cancel was not synchronously waiting for the end of
the command.

To archieve this it introduces a new status flag which uses
-EINPROGRESS.

Signed-off-by: Stefan Priebe 
---
 block/rbd.c |   23 ++-
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index 0384c6c..783c3d7 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -77,6 +77,7 @@ typedef struct RBDAIOCB {
 int error;
 struct BDRVRBDState *s;
 int cancelled;
+int status;
 } RBDAIOCB;
 
 typedef struct RADOSCB {
@@ -376,12 +377,6 @@ static void qemu_rbd_complete_aio(RADOSCB *rcb)
 RBDAIOCB *acb = rcb->acb;
 int64_t r;
 
-if (acb->cancelled) {
-qemu_vfree(acb->bounce);
-qemu_aio_release(acb);
-goto done;
-}
-
 r = rcb->ret;
 
 if (acb->cmd == RBD_AIO_WRITE ||
@@ -406,10 +401,11 @@ static void qemu_rbd_complete_aio(RADOSCB *rcb)
 acb->ret = r;
 }
 }
+acb->status = 0;
+
 /* Note that acb->bh can be NULL in case where the aio was cancelled */
 acb->bh = qemu_bh_new(rbd_aio_bh_cb, acb);
 qemu_bh_schedule(acb->bh);
-done:
 g_free(rcb);
 }
 
@@ -574,6 +570,12 @@ static void qemu_rbd_aio_cancel(BlockDriverAIOCB *blockacb)
 {
 RBDAIOCB *acb = (RBDAIOCB *) blockacb;
 acb->cancelled = 1;
+
+while (acb->status == -EINPROGRESS) {
+qemu_aio_wait();
+}
+
+qemu_aio_release(acb);
 }
 
 static AIOPool rbd_aio_pool = {
@@ -646,7 +648,8 @@ static void rbd_aio_bh_cb(void *opaque)
 qemu_bh_delete(acb->bh);
 acb->bh = NULL;
 
-qemu_aio_release(acb);
+if (!acb->cancelled)
+qemu_aio_release(acb);
 }
 
 static int rbd_aio_discard_wrapper(rbd_image_t image,
@@ -691,6 +694,7 @@ static BlockDriverAIOCB *rbd_start_aio(BlockDriverState *bs,
 acb->s = s;
 acb->cancelled = 0;
 acb->bh = NULL;
+acb->status = -EINPROGRESS;
 
 if (cmd == RBD_AIO_WRITE) {
 qemu_iovec_to_buf(acb->qiov, 0, acb->bounce, qiov->size);
@@ -737,7 +741,8 @@ static BlockDriverAIOCB *rbd_start_aio(BlockDriverState *bs,
 failed:
 g_free(rcb);
 s->qemu_aio_count--;
-qemu_aio_release(acb);
+if (!acb->cancelled)
+qemu_aio_release(acb);
 return NULL;
 }
 
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD fio Performance concerns

2012-11-22 Thread Stefan Priebe - Profihost AG

Am 22.11.2012 15:52, schrieb Alexandre DERUMIER:

I discussed this with somebody frmo inktank. Had to search the
mailinglist. It might be OK if you're working with enough replicas and UPS.

I see no other option while working with SSDs - the only Option would be
to be able to deaktivate the journal at all. But ceph does not support this.


Do you have a big difference with putting 1 journal by osd on each ssd drive ?


Not tested.


another alternative can be (but indeed costly):

- stec zeus ram ssd drive, around 2000$ for 8G (I have benched it around 10 
iops ;)
- ddrdrive (http://www.ddrdrive.com/) (around 20iops, don't know the price)
- fusionio card (iodrive2, 360GB, around 3000€ , but they are 160GB model, 
maybe half the price)


All too expensive.


- maybe ocz talos, around 600€ for OCZ Talos 2 R 200 Go (don't have benched 
them, but spec say around 35000iops random)
Not usable as each OSD can do 35.000 random IOP/s in my case and have 8 
of them in each node...


Stefan


- Mail original -

De: "Stefan Priebe - Profihost AG" 
À: "Mark Nelson" 
Cc: "Alexandre DERUMIER" , "ceph-devel" , "Mark 
Kampe" , "Sébastien Han" 
Envoyé: Jeudi 22 Novembre 2012 15:42:14
Objet: Re: RBD fio Performance concerns

Am 22.11.2012 15:37, schrieb Mark Nelson:

I don't think we recommend tmpfs at all for anything other than playing
around. :)


I discussed this with somebody frmo inktank. Had to search the
mailinglist. It might be OK if you're working with enough replicas and UPS.

I see no other option while working with SSDs - the only Option would be
to be able to deaktivate the journal at all. But ceph does not support this.

Stefan


On 11/22/2012 08:22 AM, Stefan Priebe - Profihost AG wrote:

Hi,

can someone from inktank comment this? Might be using /dev/ram0 with an
fs on it be better than tmpfs as we can use dio?

Greets,
Stefan


- Mail original -

De: "Stefan Priebe - Profihost AG" 
À: "Sébastien Han" 
Cc: "Mark Nelson" , "Alexandre DERUMIER"
, "ceph-devel" ,
"Mark Kampe" 
Envoyé: Jeudi 22 Novembre 2012 14:29:03
Objet: Re: RBD fio Performance concerns

Am 22.11.2012 14:22, schrieb Sébastien Han:

And RAMDISK devices are too expensive.

It would make sense in your infra, but yes they are really expensive.


We need something like tmpfs - running in local memory but support dio.

Stefan





--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD fio Performance concerns

2012-11-22 Thread Stefan Priebe - Profihost AG

Am 22.11.2012 15:46, schrieb Mark Nelson:

I haven't played a whole lot with SSD only OSDs yet (other than noting
last summer that iop performance wasn't as high as I wanted it).  Is a
second partition on the SSD for the journal not an option for you?


Haven't tested that. But does this makes sense? I mean data goes to Disk 
journal - same disk then has to copy the Data from part A to part B.


Why is this an advantage?

Stefan


Mark

On 11/22/2012 08:42 AM, Stefan Priebe - Profihost AG wrote:

Am 22.11.2012 15:37, schrieb Mark Nelson:

I don't think we recommend tmpfs at all for anything other than playing
around. :)


I discussed this with somebody frmo inktank. Had to search the
mailinglist. It might be OK if you're working with enough replicas and
UPS.

I see no other option while working with SSDs - the only Option would be
to be able to deaktivate the journal at all. But ceph does not support
this.

Stefan


On 11/22/2012 08:22 AM, Stefan Priebe - Profihost AG wrote:

Hi,

can someone from inktank comment this? Might be using /dev/ram0 with an
fs on it be better than tmpfs as we can use dio?

Greets,
Stefan


- Mail original -

De: "Stefan Priebe - Profihost AG" 
À: "Sébastien Han" 
Cc: "Mark Nelson" , "Alexandre DERUMIER"
, "ceph-devel" ,
"Mark Kampe" 
Envoyé: Jeudi 22 Novembre 2012 14:29:03
Objet: Re: RBD fio Performance concerns

Am 22.11.2012 14:22, schrieb Sébastien Han:

And RAMDISK devices are too expensive.

It would make sense in your infra, but yes they are really expensive.


We need something like tmpfs - running in local memory but support
dio.

Stefan








--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD fio Performance concerns

2012-11-22 Thread Stefan Priebe - Profihost AG

Am 22.11.2012 11:49, schrieb Sébastien Han:

@Alexandre: cool!

@ Stefan: Full SSD cluster and 10G switches?

Yes


Couple of weeks ago I saw
that you use journal aio, did you notice performance improvement with it?

journal is running on tmpfs to me but that changes nothing.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD fio Performance concerns

2012-11-22 Thread Alexandre DERUMIER
>>I discussed this with somebody frmo inktank. Had to search the
>>mailinglist. It might be OK if you're working with enough replicas and UPS.
>>
>>I see no other option while working with SSDs - the only Option would be
>>to be able to deaktivate the journal at all. But ceph does not support this.

Do you have a big difference with putting 1 journal by osd on each ssd drive ?


another alternative can be (but indeed costly):

- stec zeus ram ssd drive, around 2000$ for 8G (I have benched it around 10 
iops ;)
- ddrdrive (http://www.ddrdrive.com/) (around 20iops, don't know the price)
- fusionio card (iodrive2, 360GB, around 3000€ , but they are 160GB model, 
maybe half the price)
- maybe ocz talos, around 600€ for OCZ Talos 2 R 200 Go (don't have benched 
them, but spec say around 35000iops random)





- Mail original - 

De: "Stefan Priebe - Profihost AG"  
À: "Mark Nelson"  
Cc: "Alexandre DERUMIER" , "ceph-devel" 
, "Mark Kampe" , "Sébastien 
Han"  
Envoyé: Jeudi 22 Novembre 2012 15:42:14 
Objet: Re: RBD fio Performance concerns 

Am 22.11.2012 15:37, schrieb Mark Nelson: 
> I don't think we recommend tmpfs at all for anything other than playing 
> around. :) 

I discussed this with somebody frmo inktank. Had to search the 
mailinglist. It might be OK if you're working with enough replicas and UPS. 

I see no other option while working with SSDs - the only Option would be 
to be able to deaktivate the journal at all. But ceph does not support this. 

Stefan 

> On 11/22/2012 08:22 AM, Stefan Priebe - Profihost AG wrote: 
>> Hi, 
>> 
>> can someone from inktank comment this? Might be using /dev/ram0 with an 
>> fs on it be better than tmpfs as we can use dio? 
>> 
>> Greets, 
>> Stefan 
>> 
>>> - Mail original - 
>>> 
>>> De: "Stefan Priebe - Profihost AG"  
>>> À: "Sébastien Han"  
>>> Cc: "Mark Nelson" , "Alexandre DERUMIER" 
>>> , "ceph-devel" , 
>>> "Mark Kampe"  
>>> Envoyé: Jeudi 22 Novembre 2012 14:29:03 
>>> Objet: Re: RBD fio Performance concerns 
>>> 
>>> Am 22.11.2012 14:22, schrieb Sébastien Han: 
 And RAMDISK devices are too expensive. 
 
 It would make sense in your infra, but yes they are really expensive. 
>>> 
>>> We need something like tmpfs - running in local memory but support dio. 
>>> 
>>> Stefan 
>>> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] overflow of int ret: use ssize_t for ret

2012-11-22 Thread Stefan Priebe - Profihost AG

Hi Andreas,

thanks for your comment. Do i have to resend this patch?

--
Greets,
Stefan

Am 22.11.2012 17:40, schrieb Andreas Färber:

Am 22.11.2012 10:07, schrieb Stefan Priebe:

When acb->cmd is WRITE or DISCARD block/rbd stores rcb->size into acb->ret

Look here:
if (acb->cmd == RBD_AIO_WRITE ||
 acb->cmd == RBD_AIO_DISCARD) {
 if (r<  0) {
 acb->ret = r;
 acb->error = 1;
 } else if (!acb->error) {
 acb->ret = rcb->size;
 }

right now acb->ret is just an int and we might get an overflow if size is too 
big.
For discards rcb->size holds the size of the discard - this might be some TB if 
you
discard a whole device.

The steps to reproduce are:
mkfs.xfs -f a whole device bigger than int in bytes. mkfs.xfs sends a discard. 
Important is that you use scsi-hd and set discard_granularity=512. Otherwise 
rbd disabled discard support.


Whatever type you decide to use, please add an identifying topic such as
"block/rbd:" in the subject (int ret is very generic!), and this patch
is missing a Signed-off-by.

Regards,
Andreas



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


tiering of storage pools in ceph in general

2012-11-22 Thread Jimmy Tang
Hi All,

Is it possible at this point in time to setup some form of tiering of storage 
pools in ceph by modifying the crush map? For example I want to have my most 
recently used data on a small set of nodes that have SSD's and over time 
migrate data from the SSD's to some bulk spinning disk using a LRU policy?

Regards,
Jimmy Tang

--
Senior Software Engineer, Digital Repository of Ireland (DRI)
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/ | jt...@tchpc.tcd.ie

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD daemon changes port no

2012-11-22 Thread hemant surale
Sir,

Thanks for the direction . Here I was using "mount.ceph  monaddr:ip:/
/home/hemant/mntpoint " cmd . Is it possible to do achieve same effect
with "mount.ceph" of what you suggested with "cephfs". (cephfs
/mnt/ceph/foo --pool )

But I see that cephfs is able to set which osd to use , the object
size . So can you throw more light on this.


Thanks & Regards,
Hemant Surale.

On Wed, Nov 21, 2012 at 8:59 PM, Sage Weil  wrote:
> On Wed, 21 Nov 2012, hemant surale wrote:
>> > Oh I see.  Generally speaking, the only way to guarantee separation is to
>> > put them in different pools and distribute the pools across different sets
>> > of OSDs.
>>
>> yeah that was correct approach but i found problem doing so from
>> abstract level i.e. when I put file inside mounted dir
>> "/home/hemant/cephfs " ( mounted using "mount.ceph" cmd ) . At that
>> time anyways ceph is going to use default pool data to store files (
>> here files were striped into different objects and then sent to
>> appropriate osd ) .
>>So how to tell ceph to use different pools in this case ?
>>
>> Goal : separate read and write operations , where read will be done
>> from one group of OSD and write is done to other group of OSD.
>
> First create the other pool,
>
>  ceph osd pool create 
>
> and then adjust the CRUSH rule to distributed to a different set of OSDs
> for that pool.
>
> To allow cephfs use it,
>
>  ceph mds add_data_pool 
>
> and then:
>
>  cephfs /mnt/ceph/foo --pool 
>
> will set the policy on the directory such that new files beneath that
> point will be stored in a different pool.
>
> Hope that helps!
> sage
>
>
>>
>>
>>
>>
>> -
>> Hemant Surale.
>>
>>
>> On Wed, Nov 21, 2012 at 12:33 PM, Sage Weil  wrote:
>> > On Wed, 21 Nov 2012, hemant surale wrote:
>> >> Its a little confusing question I believe .
>> >>
>> >> Actually there are two files X & Y.  When I am reading X from its
>> >> primary .I want to make sure simultaneous writing of Y should go to
>> >> any other OSD except primary OSD for X (from where my current read is
>> >> getting served ) .
>> >
>> > Oh I see.  Generally speaking, the only way to guarantee separation is to
>> > put them in different pools and distribute the pools across different sets
>> > of OSDs.  Otherwise, it's all (pseudo)random and you never know.  Usually,
>> > they will be different, particularly as the cluster size increases, but
>> > sometimes they will be the same.
>> >
>> > sage
>> >
>> >
>> >>
>> >>
>> >> -
>> >> Hemant Sural.e
>> >>
>> >> On Wed, Nov 21, 2012 at 11:50 AM, Sage Weil  wrote:
>> >> > On Wed, 21 Nov 2012, hemant surale wrote:
>> >> >> >>and one more thing how can it be possible to read from one osd 
>> >> >> >> and
>> >> >> >> then simultaneous write to direct on other osd with less/no traffic?
>> >> >> >
>> >> >> > I'm not sure I understand the question...
>> >> >>
>> >> >> Scenario :
>> >> >>I have written file X.txt on some osd which is primary for filr
>> >> >> X.txt ( direct write operation using rados cmd) .
>> >> >>Now while read on file X.txt is in progress, Can I make sure
>> >> >> the simultaneous write request must be directed to other osd using
>> >> >> crushmaps/other way?
>> >> >
>> >> > Nope.  The object location is based on the name.  Reads and writes go to
>> >> > the same location so that a single OSD can serialize request.  That 
>> >> > means,
>> >> > for example, that a read that follows a write returns the just-written
>> >> > data.
>> >> >
>> >> > sage
>> >> >
>> >> >
>> >> >> Goal of task :
>> >> >>Trying to avoid read - write clashes as much as possible to
>> >> >> achieve faster operations (I/O) . Although CRUSH selects osd for data
>> >> >> placement based on pseudo random function.  is it possible ?
>> >> >>
>> >> >>
>> >> >>
>> >> >> -
>> >> >> Hemant Surale.
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Tue, Nov 20, 2012 at 10:15 PM, Sage Weil  wrote:
>> >> >> > On Tue, 20 Nov 2012, hemant surale wrote:
>> >> >> >> Hi Community,
>> >> >> >>I have question about port number used by ceph-osd daemon . I
>> >> >> >> observed traffic (inter -osd communication while data ingest 
>> >> >> >> happened)
>> >> >> >> on port 6802 and then after some time when I ingested second file
>> >> >> >> after some delay port no 6804 was used . Is there any specific 
>> >> >> >> reason
>> >> >> >> to change port no here?
>> >> >> >
>> >> >> > The ports are dynamic.  Daemons bind to a random (6800-6900) port on
>> >> >> > startup and communicate on that.  They discover each other via the
>> >> >> > addresses published in the osdmap when the daemon starts.
>> >> >> >
>> >> >> >>and one more thing how can it be possible to read from one osd 
>> >> >> >> and
>> >> >> >> then simultaneous write to direct on other osd with less/no traffic?
>> >> >> >
>> >> >> > I'm not sure I understand the question...
>> >> >> >
>> >> >> > sage
>> >> >> --
>> >> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>> >> >> in
>> >> >>

Hangup during scrubbing - possible solutions

2012-11-22 Thread Andrey Korolyov
Hi,

In the recent versions Ceph introduces some unexpected behavior for
the permanent connections (VM or kernel clients) - after crash
recovery, I/O will hang on the next planned scrub on the following
scenario:

- launch a bunch of clients doing non-intensive writes,
- lose one or more osd, mark them down, wait for recovery completion,
- do a slow scrub, e.g. scrubbing one osd per 5m, inside bash script,
or wait for ceph to do the same,
- observe a raising number of pgs stuck in the active+clean+scrubbing
state (they took a master role from ones which was on killed osd and
almost surely they are being written in time of crash),
- some time later, clients will hang hardly and ceph log introduce
stuck(old) I/O requests.

The only one way to return clients back without losing their I/O state
is per-osd restart, which also will help to get rid of
active+clean+scrubbing pgs.

First of all, I`ll be happy to help to solve this problem by providing
logs. Second question is not directly related to this problem, but I
have thought on for a long time - is there a planned features to
control scrub process more precisely, e.g. pg scrub rate or scheduled
scrub, instead of current set of timeouts which of course not very
predictable on when to run?

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD fio Performance concerns

2012-11-22 Thread Stefan Priebe - Profihost AG


Same to me:
rand 4k: 23.000 iops
seq 4k: 13.000 iops

Even in writeback mode where normally seq 4k should be merged into 
bigger requests.


Stefan

Am 21.11.2012 17:34, schrieb Mark Nelson:

Responding to my own message. :)

Talked to Sage a bit offline about this.  I think there are two opposing
forces:

On one hand, random IO may be spreading reads/writes out across more
OSDs than sequential IO that presumably would be hitting a single OSD
more regularly.

On the other hand, you'd expect that sequential writes would be getting
coalesced either at the RBD layer or on the OSD, and that the
drive/controller/filesystem underneath the OSD would be doing some kind
of readahead or prefetching.

On the third hand, maybe coalescing/prefetching is in fact happening but
we are IOP limited by some per-osd limitation.

It could be interesting to do the test with a single OSD and see what
happens.

Mark

On 11/21/2012 09:52 AM, Mark Nelson wrote:

Hi Guys,

I'm late to this thread but thought I'd chime in.  Crazy that you are
getting higher performance with random reads/writes vs sequential!  It
would be interesting to see what kind of throughput smalliobench reports
(should be packaged in bobtail) and also see if this behavior happens
with cephfs.  It's still too early in the morning for me right now to
come up with a reasonable explanation for what's going on.  It might be
worth running blktrace and seekwatcher to see what the io patterns on
the underlying disk look like in each case.  Maybe something unexpected
is going on.

Mark

On 11/19/2012 02:57 PM, Sébastien Han wrote:

Which iodepth did you use for those benchs?



I really don't understand why I can't get more rand read iops with 4K
block ...


Me neither, hope to get some clarification from the Inktank guys. It
doesn't make any sense to me...
--
Bien cordialement.
Sébastien HAN.


On Mon, Nov 19, 2012 at 8:11 PM, Alexandre DERUMIER
 wrote:

@Alexandre: is it the same for you? or do you always get more IOPS
with seq?


rand read 4K : 6000 iops
seq read 4K : 3500 iops
seq read 4M : 31iops (1gigabit client bandwith limit)

rand write 4k: 6000iops  (tmpfs journal)
seq write 4k: 1600iops
seq write 4M : 31iops (1gigabit client bandwith limit)


I really don't understand why I can't get more rand read iops with 4K
block ...

I try with high end cpu for client, it doesn't change nothing.
But test cluster use  old 8 cores E5420  @ 2.50GHZ (But cpu is around
15% on cluster during read bench)


- Mail original -

De: "Sébastien Han" 
À: "Mark Kampe" 
Cc: "Alexandre DERUMIER" , "ceph-devel"

Envoyé: Lundi 19 Novembre 2012 19:03:40
Objet: Re: RBD fio Performance concerns

@Sage, thanks for the info :)
@Mark:


If you want to do sequential I/O, you should do it buffered
(so that the writes can be aggregated) or with a 4M block size
(very efficient and avoiding object serialization).


The original benchmark has been performed with 4M block size. And as
you can see I still get more IOPS with rand than seq... I just tried
with 4M without direct I/O, still the same. I can print fio results if
it's needed.


We do direct writes for benchmarking, not because it is a reasonable
way to do I/O, but because it bypasses the buffer cache and enables
us to directly measure cluster I/O throughput (which is what we are
trying to optimize). Applications should usually do buffered I/O,
to get the (very significant) benefits of caching and write
aggregation.


I know why I use direct I/O. It's synthetic benchmarks, it's far away
from a real life scenario and how common applications works. I just
try to see the maximum I/O throughput that I can get from my RBD. All
my applications use buffered I/O.

@Alexandre: is it the same for you? or do you always get more IOPS
with seq?

Thanks to all of you..


On Mon, Nov 19, 2012 at 5:54 PM, Mark Kampe 
wrote:

Recall:
1. RBD volumes are striped (4M wide) across RADOS objects
2. distinct writes to a single RADOS object are serialized

Your sequential 4K writes are direct, depth=256, so there are
(at all times) 256 writes queued to the same object. All of
your writes are waiting through a very long line, which is adding
horrendous latency.

If you want to do sequential I/O, you should do it buffered
(so that the writes can be aggregated) or with a 4M block size
(very efficient and avoiding object serialization).

We do direct writes for benchmarking, not because it is a reasonable
way to do I/O, but because it bypasses the buffer cache and enables
us to directly measure cluster I/O throughput (which is what we are
trying to optimize). Applications should usually do buffered I/O,
to get the (very significant) benefits of caching and write
aggregation.



That's correct for some of the benchmarks. However even with 4K for
seq, I still get less IOPS. See below my last fio:

# fio rbd-bench.fio
seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio,
iodepth=256
rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio,
iodepth=256
s

[PATCH V2] mds: fix CDir::_commit_partial() bug

2012-11-22 Thread Yan, Zheng
From: "Yan, Zheng" 

When a null dentry is encountered, CDir::_commit_partial() adds
a OSD_TMAP_RM command to delete the dentry. But if the dentry is
new, the osd will not find the dentry when handling the command
and the tmap update operation will fail totally.

This patch also makes sure dentries are properly marked as new
when preparing new dentries and exporting dentries.

Signed-off-by: Yan, Zheng 
---
 src/mds/CDentry.h  |  2 ++
 src/mds/CDir.cc| 11 ---
 src/mds/CDir.h |  2 +-
 src/mds/MDCache.cc |  9 ++---
 src/mds/Server.cc  |  3 +++
 5 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/src/mds/CDentry.h b/src/mds/CDentry.h
index 480e562..5755c55 100644
--- a/src/mds/CDentry.h
+++ b/src/mds/CDentry.h
@@ -347,6 +347,8 @@ public:
 // twiddle
 state = 0;
 state_set(CDentry::STATE_AUTH);
+if (nstate & STATE_NEW)
+  mark_new();
 if (nstate & STATE_DIRTY)
   _mark_dirty(ls);
 if (!replica_map.empty())
diff --git a/src/mds/CDir.cc b/src/mds/CDir.cc
index c5220ed..411d864 100644
--- a/src/mds/CDir.cc
+++ b/src/mds/CDir.cc
@@ -1696,7 +1696,7 @@ class C_Dir_Committed : public Context {
 public:
   C_Dir_Committed(CDir *d, version_t v, version_t lrv) : dir(d), version(v), 
last_renamed_version(lrv) { }
   void finish(int r) {
-dir->_committed(version, last_renamed_version);
+dir->_committed(version, last_renamed_version, r);
   }
 };
 
@@ -1802,6 +1802,10 @@ CDir::map_t::iterator 
CDir::_commit_partial(ObjectOperation& m,
   continue;  // skip clean dentries
 
 if (dn->get_linkage()->is_null()) {
+  if (dn->is_new()) {
+   dn->mark_clean();
+   continue;
+  }
   dout(10) << " rm " << dn->name << " " << *dn << dendl;
   finalbl.append(CEPH_OSD_TMAP_RM);
   dn->key().encode(finalbl);
@@ -1997,10 +2001,11 @@ void CDir::_commit(version_t want)
  *
  * @param v version i just committed
  */
-void CDir::_committed(version_t v, version_t lrv)
+void CDir::_committed(version_t v, version_t lrv, int ret)
 {
-  dout(10) << "_committed v " << v << " (last renamed " << lrv << ") on " << 
*this << dendl;
+  dout(10) << "_committed ret " << ret << " v " << v << " (last renamed " << 
lrv << ") on " << *this << dendl;
   assert(is_auth());
+  assert(ret == 0);
 
   bool stray = inode->is_stray();
 
diff --git a/src/mds/CDir.h b/src/mds/CDir.h
index 418..274e38b 100644
--- a/src/mds/CDir.h
+++ b/src/mds/CDir.h
@@ -487,7 +487,7 @@ private:
unsigned max_write_size=-1,
map_t::iterator last_committed_dn=map_t::iterator());
   void _encode_dentry(CDentry *dn, bufferlist& bl, const set *snaps);
-  void _committed(version_t v, version_t last_renamed_version);
+  void _committed(version_t v, version_t last_renamed_version, int ret);
   void wait_for_commit(Context *c, version_t v=0);
 
   // -- dirtyness --
diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
index f8b1c8f..e69a49f 100644
--- a/src/mds/MDCache.cc
+++ b/src/mds/MDCache.cc
@@ -657,12 +657,15 @@ CDentry *MDCache::get_or_create_stray_dentry(CInode *in)
   CDir *straydir = strayi->get_dirfrag(fg);
   assert(straydir);
   CDentry *straydn = straydir->lookup(straydname);
-  if (!straydn) {
+
+  if (!straydn)
 straydn = straydir->add_null_dentry(straydname);
-straydn->mark_new();
-  } else 
+  else
 assert(straydn->get_projected_linkage()->is_null());
 
+  if (!straydn->is_dirty())
+straydn->mark_new();
+
   return straydn;
 }
 
diff --git a/src/mds/Server.cc b/src/mds/Server.cc
index ec0d5d5..228fede 100644
--- a/src/mds/Server.cc
+++ b/src/mds/Server.cc
@@ -1685,6 +1685,9 @@ CDentry* Server::prepare_null_dentry(MDRequest *mdr, CDir 
*dir, const string& dn
   }
 }
 
+if (!dn->is_dirty())
+  dn->mark_new();
+
 return dn;
   }
 
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD fio Performance concerns

2012-11-22 Thread Alexandre DERUMIER
>>but it seems that Alexandre and I have the same results (more rand 
>>than seq), he has (at least) one cluster and I have 2. Thus I start to 
>>think that's not an isolated issue. 

Hi, I have bought new servers with more powerfull cpus to made a new 3 nodes 
cluster to compare.
I'll redo tests in 1 or 2 week. 
I hope performance will improve.

I'll keep you in touch !

Alexandre


- Mail original - 

De: "Sébastien Han"  
À: "Mark Nelson"  
Cc: "Alexandre DERUMIER" , "ceph-devel" 
, "Mark Kampe"  
Envoyé: Mercredi 21 Novembre 2012 22:47:08 
Objet: Re: RBD fio Performance concerns 

Hi Mark, 

Well the most concerning thing is that I have 2 Ceph clusters and both 
of them show better rand than seq... 
I don't have enough background to argue on your assomptions but I 
could try to skrink my test platform to a single OSD and how it 
performs. We keep in touch on that one. 

But it seems that Alexandre and I have the same results (more rand 
than seq), he has (at least) one cluster and I have 2. Thus I start to 
think that's not an isolated issue. 

Is it different for you? Do you usually get more seq IOPS from an RBD 
thant rand? 


On Wed, Nov 21, 2012 at 5:34 PM, Mark Nelson  wrote: 
> Responding to my own message. :) 
> 
> Talked to Sage a bit offline about this. I think there are two opposing 
> forces: 
> 
> On one hand, random IO may be spreading reads/writes out across more OSDs 
> than sequential IO that presumably would be hitting a single OSD more 
> regularly. 
> 
> On the other hand, you'd expect that sequential writes would be getting 
> coalesced either at the RBD layer or on the OSD, and that the 
> drive/controller/filesystem underneath the OSD would be doing some kind of 
> readahead or prefetching. 
> 
> On the third hand, maybe coalescing/prefetching is in fact happening but we 
> are IOP limited by some per-osd limitation. 
> 
> It could be interesting to do the test with a single OSD and see what 
> happens. 
> 
> Mark 
> 
> 
> On 11/21/2012 09:52 AM, Mark Nelson wrote: 
>> 
>> Hi Guys, 
>> 
>> I'm late to this thread but thought I'd chime in. Crazy that you are 
>> getting higher performance with random reads/writes vs sequential! It 
>> would be interesting to see what kind of throughput smalliobench reports 
>> (should be packaged in bobtail) and also see if this behavior happens 
>> with cephfs. It's still too early in the morning for me right now to 
>> come up with a reasonable explanation for what's going on. It might be 
>> worth running blktrace and seekwatcher to see what the io patterns on 
>> the underlying disk look like in each case. Maybe something unexpected 
>> is going on. 
>> 
>> Mark 
>> 
>> On 11/19/2012 02:57 PM, Sébastien Han wrote: 
>>> 
>>> Which iodepth did you use for those benchs? 
>>> 
>>> 
 I really don't understand why I can't get more rand read iops with 4K 
 block ... 
>>> 
>>> 
>>> Me neither, hope to get some clarification from the Inktank guys. It 
>>> doesn't make any sense to me... 
>>> -- 
>>> Bien cordialement. 
>>> Sébastien HAN. 
>>> 
>>> 
>>> On Mon, Nov 19, 2012 at 8:11 PM, Alexandre DERUMIER 
>>>  wrote: 
>> 
>> @Alexandre: is it the same for you? or do you always get more IOPS 
>> with seq? 
 
 
 rand read 4K : 6000 iops 
 seq read 4K : 3500 iops 
 seq read 4M : 31iops (1gigabit client bandwith limit) 
 
 rand write 4k: 6000iops (tmpfs journal) 
 seq write 4k: 1600iops 
 seq write 4M : 31iops (1gigabit client bandwith limit) 
 
 
 I really don't understand why I can't get more rand read iops with 4K 
 block ... 
 
 I try with high end cpu for client, it doesn't change nothing. 
 But test cluster use old 8 cores E5420 @ 2.50GHZ (But cpu is around 
 15% on cluster during read bench) 
 
 
 - Mail original - 
 
 De: "Sébastien Han"  
 À: "Mark Kampe"  
 Cc: "Alexandre DERUMIER" , "ceph-devel" 
  
 Envoyé: Lundi 19 Novembre 2012 19:03:40 
 Objet: Re: RBD fio Performance concerns 
 
 @Sage, thanks for the info :) 
 @Mark: 
 
> If you want to do sequential I/O, you should do it buffered 
> (so that the writes can be aggregated) or with a 4M block size 
> (very efficient and avoiding object serialization). 
 
 
 The original benchmark has been performed with 4M block size. And as 
 you can see I still get more IOPS with rand than seq... I just tried 
 with 4M without direct I/O, still the same. I can print fio results if 
 it's needed. 
 
> We do direct writes for benchmarking, not because it is a reasonable 
> way to do I/O, but because it bypasses the buffer cache and enables 
> us to directly measure cluster I/O throughput (which is what we are 
> trying to optimize). Applications should usually do buffered I/O, 
> to get the (very significant) benefits of caching and write 
> aggregation. 
 
 
 I know why I use d

Fwd: does still not recommended place rbd device on nodes, where osd daemon located?

2012-11-22 Thread ruslan usifov
Hello

Thank for your attention, and  sorry for my bad english!

In my draft architecture, i want use same hardware for osd and rbd
devices. In other words, i have 5 nodes this 5TB software raid on each
Disk space. I want build on this nodes, ceph cluster. All 5 nodes will
be run OSD and, on the same 5 node i will start 3 mons for QUORUM.
Also on the same 5 nodes i will start cluster stack (pacemaker +
corocync) with follow configuration

node ceph-precie-64-01
node ceph-precie-64-02
node ceph-precie-64-03
node ceph-precie-64-04
node ceph-precie-64-05
primitive samba_fs ocf:heartbeat:Filesystem \
params device="-U cb4d3dda-92e9-4bd8-9fbc-
2940c096e8ec" directory="/mnt" fstype="ext4"
primitive samba_rbd ocf:ceph:rbd \
params name="samba"
group samba samba_rbd samba_fs
property $id="cib-bootstrap-options" \
dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
cluster-infrastructure="openais" \
expected-quorum-votes="3" \
stonith-enabled="false" \
no-quorum-policy="stop" \
last-lrm-refresh="1352806660"


So rbd block device can be fault tolerant. In my case, use separate
machine for rbd not appropriated :-( (use multiple machines only for
rbd fault tollerant is too much cost )

2012/11/22 Dan Mick :
> Still not certain I'm understanding *just* what you mean, but I'll point out
> that you can set up a cluster with rbd images, mount them from a separate
> non-virtualized host with kernel rbd, and expand those images and take
> advantage of the newly-available space on the separate host, just as though
> you were expanding a RAID device.  Maybe that fits your use case, Ruslan?
>
>
> On 11/21/2012 12:05 PM, ruslan usifov wrote:
>>
>> Yes i mean exactly this. it's a great pity :-( Maybe present some ceph
>> equivalent that solve my problem?
>>
>> 2012/11/21 Gregory Farnum :
>>>
>>> On Wed, Nov 21, 2012 at 4:33 AM, ruslan usifov 
>>> wrote:

 So, not possible use ceph as scalable block device without
 visualization?
>>>
>>>
>>> I'm not sure I understand, but if you're trying to take a bunch of
>>> compute nodes and glue their disks together, no, that's not a
>>> supported use case at this time. There are a number of deadlock issues
>>> caused by this sort of loopback; it's the same reason you shouldn't
>>> mount NFS on the server host.
>>> We may in the future manage to release an rbd-fuse client that you can
>>> use to do this with a little less pain, but it's not ready at this
>>> point.
>>> -Greg
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] overflow of int ret: use ssize_t for ret

2012-11-22 Thread Andreas Färber
Am 22.11.2012 10:07, schrieb Stefan Priebe:
> When acb->cmd is WRITE or DISCARD block/rbd stores rcb->size into acb->ret
> 
> Look here:
>if (acb->cmd == RBD_AIO_WRITE ||
> acb->cmd == RBD_AIO_DISCARD) {
> if (r < 0) {
> acb->ret = r;
> acb->error = 1;
> } else if (!acb->error) {
> acb->ret = rcb->size;
> }
> 
> right now acb->ret is just an int and we might get an overflow if size is too 
> big.
> For discards rcb->size holds the size of the discard - this might be some TB 
> if you
> discard a whole device.
> 
> The steps to reproduce are:
> mkfs.xfs -f a whole device bigger than int in bytes. mkfs.xfs sends a 
> discard. Important is that you use scsi-hd and set discard_granularity=512. 
> Otherwise rbd disabled discard support.

Whatever type you decide to use, please add an identifying topic such as
"block/rbd:" in the subject (int ret is very generic!), and this patch
is missing a Signed-off-by.

Regards,
Andreas


-- 
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD fio Performance concerns

2012-11-22 Thread Stefan Priebe - Profihost AG

Am 22.11.2012 16:26, schrieb Alexandre DERUMIER:

Haven't tested that. But does this makes sense? I mean data goes to Disk
journal - same disk then has to copy the Data from part A to part B.

Why is this an advantage?


Well, if you are cpu limited, I don't think you can use all 8*35000iops by node.
So, maybe a benchmark can tell us if the difference is really big.

Using tmpfs and ups can be ok, but if you have a kernel panic or hardware 
problem, you'll lost your journal.


But who cares? it's also on the 2nd node. or even on the 3rd if you have 
replicas 3.


Stefan



- Mail original -

De: "Stefan Priebe - Profihost AG" 
À: "Mark Nelson" 
Cc: "Alexandre DERUMIER" , "ceph-devel" , "Mark 
Kampe" , "Sébastien Han" 
Envoyé: Jeudi 22 Novembre 2012 16:01:56
Objet: Re: RBD fio Performance concerns

Am 22.11.2012 15:46, schrieb Mark Nelson:

I haven't played a whole lot with SSD only OSDs yet (other than noting
last summer that iop performance wasn't as high as I wanted it). Is a
second partition on the SSD for the journal not an option for you?


Haven't tested that. But does this makes sense? I mean data goes to Disk
journal - same disk then has to copy the Data from part A to part B.

Why is this an advantage?

Stefan


Mark

On 11/22/2012 08:42 AM, Stefan Priebe - Profihost AG wrote:

Am 22.11.2012 15:37, schrieb Mark Nelson:

I don't think we recommend tmpfs at all for anything other than playing
around. :)


I discussed this with somebody frmo inktank. Had to search the
mailinglist. It might be OK if you're working with enough replicas and
UPS.

I see no other option while working with SSDs - the only Option would be
to be able to deaktivate the journal at all. But ceph does not support
this.

Stefan


On 11/22/2012 08:22 AM, Stefan Priebe - Profihost AG wrote:

Hi,

can someone from inktank comment this? Might be using /dev/ram0 with an
fs on it be better than tmpfs as we can use dio?

Greets,
Stefan


- Mail original -

De: "Stefan Priebe - Profihost AG" 
À: "Sébastien Han" 
Cc: "Mark Nelson" , "Alexandre DERUMIER"
, "ceph-devel" ,
"Mark Kampe" 
Envoyé: Jeudi 22 Novembre 2012 14:29:03
Objet: Re: RBD fio Performance concerns

Am 22.11.2012 14:22, schrieb Sébastien Han:

And RAMDISK devices are too expensive.

It would make sense in your infra, but yes they are really expensive.


We need something like tmpfs - running in local memory but support
dio.

Stefan








--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] use int64_t for return values from rbd instead of int

2012-11-22 Thread Stefan Priebe - Profihost AG

Am 21.11.2012 23:32, schrieb Peter Maydell:

On 21 November 2012 17:03, Stefan Weil  wrote:

Why do you use int64_t instead of off_t?
If the value is related to file sizes, off_t would be a good choice.


Looking at the librbd API (which is what the size and ret
values come from), it uses size_t and ssize_t for these.
So I think probably ssize_t is the right type for ret
(and size) in our structs here.


This sounds reasonable but does ssize_t support negative values? For 
error values.


Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mds: fix CDir::_commit_partial() bug

2012-11-22 Thread Sage Weil
On Thu, 22 Nov 2012, Yan, Zheng wrote:
> From: "Yan, Zheng" 
> 
> When a null dentry is encountered, CDir::_commit_partial() adds
> a OSD_TMAP_RM command to delete the dentry. But if the dentry is
> new, the osd will not find the dentry when handling the command
> and the tmap update operation will fail totally.
> 
> Signed-off-by: Yan, Zheng 

This could explain all manner of corruptions and problems we've seen!  
Great catch.

sage


> ---
>  src/mds/CDir.cc | 17 +
>  src/mds/CDir.h  |  2 +-
>  2 files changed, 10 insertions(+), 9 deletions(-)
> 
> diff --git a/src/mds/CDir.cc b/src/mds/CDir.cc
> index c5220ed..4896f01 100644
> --- a/src/mds/CDir.cc
> +++ b/src/mds/CDir.cc
> @@ -1696,7 +1696,7 @@ class C_Dir_Committed : public Context {
>  public:
>C_Dir_Committed(CDir *d, version_t v, version_t lrv) : dir(d), version(v), 
> last_renamed_version(lrv) { }
>void finish(int r) {
> -dir->_committed(version, last_renamed_version);
> +dir->_committed(version, last_renamed_version, r);
>}
>  };
>  
> @@ -1801,14 +1801,14 @@ CDir::map_t::iterator 
> CDir::_commit_partial(ObjectOperation& m,
>  if (!dn->is_dirty())
>continue;  // skip clean dentries
>  
> -if (dn->get_linkage()->is_null()) {
> -  dout(10) << " rm " << dn->name << " " << *dn << dendl;
> -  finalbl.append(CEPH_OSD_TMAP_RM);
> -  dn->key().encode(finalbl);
> -} else {
> +if (!dn->get_linkage()->is_null()) {
>dout(10) << " set " << dn->name << " " << *dn << dendl;
>finalbl.append(CEPH_OSD_TMAP_SET);
>_encode_dentry(dn, finalbl, snaps);
> +} else if (!dn->is_new()) {
> +  dout(10) << " rm " << dn->name << " " << *dn << dendl;
> +  finalbl.append(CEPH_OSD_TMAP_RM);
> +  dn->key().encode(finalbl);
>  }
>}
>  
> @@ -1997,10 +1997,11 @@ void CDir::_commit(version_t want)
>   *
>   * @param v version i just committed
>   */
> -void CDir::_committed(version_t v, version_t lrv)
> +void CDir::_committed(version_t v, version_t lrv, int ret)
>  {
> -  dout(10) << "_committed v " << v << " (last renamed " << lrv << ") on " << 
> *this << dendl;
> +  dout(10) << "_committed ret " << ret << " v " << v << " (last renamed " << 
> lrv << ") on " << *this << dendl;
>assert(is_auth());
> +  assert(ret == 0);
>  
>bool stray = inode->is_stray();
>  
> diff --git a/src/mds/CDir.h b/src/mds/CDir.h
> index 418..274e38b 100644
> --- a/src/mds/CDir.h
> +++ b/src/mds/CDir.h
> @@ -487,7 +487,7 @@ private:
> unsigned max_write_size=-1,
> map_t::iterator last_committed_dn=map_t::iterator());
>void _encode_dentry(CDentry *dn, bufferlist& bl, const set 
> *snaps);
> -  void _committed(version_t v, version_t last_renamed_version);
> +  void _committed(version_t v, version_t last_renamed_version, int ret);
>void wait_for_commit(Context *c, version_t v=0);
>  
>// -- dirtyness --
> -- 
> 1.7.11.7
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd map command hangs for 15 minutes during system start up

2012-11-22 Thread Sage Weil
On Wed, 21 Nov 2012, Nick Bartos wrote:
> FYI the build which included all 3.5 backports except patch #50 is
> still going strong after 21 builds.

Okay, that one at least makes some sense.  I've opened

http://tracker.newdream.net/issues/3519

How easy is this to reproduce?  If it is something you can trigger with 
debugging enabled ('echo module libceph +p > 
/sys/kernel/debug/dynamic_debug/control') that would help tremendously.

I'm guessing that during this startup time the OSDs are still in the 
process of starting?

Alex, I bet that a test that does a lot of map/unmap stuff in a loop while 
thrashing OSDs could hit this.

Thanks!
sage


> 
> On Wed, Nov 21, 2012 at 9:34 AM, Nick Bartos  wrote:
> > With 8 successful installs already done, I'm reasonably confident that
> > it's patch #50.  I'm making another build which applies all patches
> > from the 3.5 backport branch, excluding that specific one.  I'll let
> > you know if that turns up any unexpected failures.
> >
> > What will the potential fall out be for removing that specific patch?
> >
> >
> > On Wed, Nov 21, 2012 at 9:02 AM, Nick Bartos  wrote:
> >> It's really looking like it's the
> >> libceph_resubmit_linger_ops_when_pg_mapping_changes commit.  When
> >> patches 1-50 (listed below) are applied to 3.5.7, the hang is present.
> >>  So far I have gone through 4 successful installs with no hang with
> >> only 1-49 applied.  I'm still leaving my test run to make sure it's
> >> not a fluke, but since previously it hangs within the first couple of
> >> builds, it really looks like this is where the problem originated.
> >>
> >> 1-libceph_eliminate_connection_state_DEAD.patch
> >> 2-libceph_kill_bad_proto_ceph_connection_op.patch
> >> 3-libceph_rename_socket_callbacks.patch
> >> 4-libceph_rename_kvec_reset_and_kvec_add_functions.patch
> >> 5-libceph_embed_ceph_messenger_structure_in_ceph_client.patch
> >> 6-libceph_start_separating_connection_flags_from_state.patch
> >> 7-libceph_start_tracking_connection_socket_state.patch
> >> 8-libceph_provide_osd_number_when_creating_osd.patch
> >> 9-libceph_set_CLOSED_state_bit_in_con_init.patch
> >> 10-libceph_embed_ceph_connection_structure_in_mon_client.patch
> >> 11-libceph_drop_connection_refcounting_for_mon_client.patch
> >> 12-libceph_init_monitor_connection_when_opening.patch
> >> 13-libceph_fully_initialize_connection_in_con_init.patch
> >> 14-libceph_tweak_ceph_alloc_msg.patch
> >> 15-libceph_have_messages_point_to_their_connection.patch
> >> 16-libceph_have_messages_take_a_connection_reference.patch
> >> 17-libceph_make_ceph_con_revoke_a_msg_operation.patch
> >> 18-libceph_make_ceph_con_revoke_message_a_msg_op.patch
> >> 19-libceph_fix_overflow_in___decode_pool_names.patch
> >> 20-libceph_fix_overflow_in_osdmap_decode.patch
> >> 21-libceph_fix_overflow_in_osdmap_apply_incremental.patch
> >> 22-libceph_transition_socket_state_prior_to_actual_connect.patch
> >> 23-libceph_fix_NULL_dereference_in_reset_connection.patch
> >> 24-libceph_use_con_get_put_methods.patch
> >> 25-libceph_drop_ceph_con_get_put_helpers_and_nref_member.patch
> >> 26-libceph_encapsulate_out_message_data_setup.patch
> >> 27-libceph_encapsulate_advancing_msg_page.patch
> >> 28-libceph_don_t_mark_footer_complete_before_it_is.patch
> >> 29-libceph_move_init_bio__functions_up.patch
> >> 30-libceph_move_init_of_bio_iter.patch
> >> 31-libceph_don_t_use_bio_iter_as_a_flag.patch
> >> 32-libceph_SOCK_CLOSED_is_a_flag_not_a_state.patch
> >> 33-libceph_don_t_change_socket_state_on_sock_event.patch
> >> 34-libceph_just_set_SOCK_CLOSED_when_state_changes.patch
> >> 35-libceph_don_t_touch_con_state_in_con_close_socket.patch
> >> 36-libceph_clear_CONNECTING_in_ceph_con_close.patch
> >> 37-libceph_clear_NEGOTIATING_when_done.patch
> >> 38-libceph_define_and_use_an_explicit_CONNECTED_state.patch
> >> 39-libceph_separate_banner_and_connect_writes.patch
> >> 40-libceph_distinguish_two_phases_of_connect_sequence.patch
> >> 41-libceph_small_changes_to_messenger.c.patch
> >> 42-libceph_add_some_fine_ASCII_art.patch
> >> 43-libceph_set_peer_name_on_con_open_not_init.patch
> >> 44-libceph_initialize_mon_client_con_only_once.patch
> >> 45-libceph_allow_sock_transition_from_CONNECTING_to_CLOSED.patch
> >> 46-libceph_initialize_msgpool_message_types.patch
> >> 47-libceph_prevent_the_race_of_incoming_work_during_teardown.patch
> >> 48-libceph_report_socket_read_write_error_message.patch
> >> 49-libceph_fix_mutex_coverage_for_ceph_con_close.patch
> >> 50-libceph_resubmit_linger_ops_when_pg_mapping_changes.patch
> >>
> >>
> >> On Wed, Nov 21, 2012 at 8:50 AM, Sage Weil  wrote:
> >>> Thanks for hunting this down.  I'm very curious what the culprit is...
> >>>
> >>> sage
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] overflow of int ret: use ssize_t for ret

2012-11-22 Thread Stefan Weil

Am 22.11.2012 20:09, schrieb Stefan Priebe - Profihost AG:

Hi Andreas,

thanks for your comment. Do i have to resend this patch?

--
Greets,
Stefan




Hi Stefan,

I'm afraid yes, you'll have to resend the patch.

Signed-off-by is a must, see http://wiki.qemu.org/Contribute/SubmitAPatch

When you resend the patch, you can fix the minor issues (subject)as well.

Regards

StefanW.


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] use int64_t for return values from rbd instead of int

2012-11-22 Thread Peter Maydell
On 22 November 2012 08:23, Stefan Priebe - Profihost AG
 wrote:
> Am 21.11.2012 23:32, schrieb Peter Maydell:
>> Looking at the librbd API (which is what the size and ret
>> values come from), it uses size_t and ssize_t for these.
>> So I think probably ssize_t is the right type for ret
>> (and size) in our structs here.
>
>
> This sounds reasonable but does ssize_t support negative values? For error
> values.

Yes, the first 's' in ssize_t means 'signed' and is the
difference between it and size_t.

-- PMM
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] use int64_t for return values from rbd instead of int

2012-11-22 Thread Peter Maydell
On 21 November 2012 17:03, Stefan Weil  wrote:
> Why do you use int64_t instead of off_t?
> If the value is related to file sizes, off_t would be a good choice.

Looking at the librbd API (which is what the size and ret
values come from), it uses size_t and ssize_t for these.
So I think probably ssize_t is the right type for ret
(and size) in our structs here.

-- PMM
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to create snapshots

2012-11-22 Thread Stefan Priebe - Profihost AG

Hi,

Am 21.11.2012 15:29, schrieb Wido den Hollander:

Use:

$ rbd -p kvmpool1 snap create --image vm-113-disk-1 BACKUP

"rbd -h" also tells:

,  are [pool/]name[@snap], or you may specify
individual pieces of names with -p/--pool, --image, and/or --snap.

Never tried it, but you might be able to use:

$ rbd -p kvmpool1 snap create vm-113-disk-1@BACKUP


This does not work but:
rbd snap create kvmpool1/vm-113-disk-1@BACKUP

works fine. Thanks!

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD fio Performance concerns

2012-11-22 Thread Stefan Priebe - Profihost AG

Hi,

can someone from inktank comment this? Might be using /dev/ram0 with an 
fs on it be better than tmpfs as we can use dio?


Greets,
Stefan


- Mail original -

De: "Stefan Priebe - Profihost AG" 
À: "Sébastien Han" 
Cc: "Mark Nelson" , "Alexandre DERUMIER" , "ceph-devel" 
, "Mark Kampe" 
Envoyé: Jeudi 22 Novembre 2012 14:29:03
Objet: Re: RBD fio Performance concerns

Am 22.11.2012 14:22, schrieb Sébastien Han:

And RAMDISK devices are too expensive.

It would make sense in your infra, but yes they are really expensive.


We need something like tmpfs - running in local memory but support dio.

Stefan


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD Backup

2012-11-22 Thread Stefan Priebe - Profihost AG

Hi,

Am 21.11.2012 14:47, schrieb Wido den Hollander:

The snapshot isn't consistent since it has no way of telling the VM to
flush it's buffers.

To make it consistent you have to run "sync" (In the VM) just prior to
creating the snapshot.


Mhm but between executing sync and executing snap is again time to store 
data.



rbd export --snap BACKUP image1 /mnt/backup/image1.img
losetup /mnt/backup/image1.img

kpartx -a /dev/loop0

Now you will have the partitions from the RBD image available in
/dev/mapper/loop0pX

Works fine!

Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD fio Performance concerns

2012-11-22 Thread Mark Kampe

Sequential is faster than random on a disk, but we are not
doing I/O to a disk, but a distributed storage cluster:

  small random operations are striped over multiple objects and
  servers, and so can proceed in parallel and take advantage of
  more nodes and disks.  This parallelism can overcome the added
  latencies of network I/O to yield very good throughput.

  small sequential read and write operations are serialized on
  a single server, NIC, and drive.  This serialization eliminates
  parallelism, and the network and other queuing delays are no
  longer compensated for.

This striping is a good idea for the small random I/O that is
typical of the way Linux systems talk to their disks.  But for
other I/O patterns, it is not optimal.

On 11/21/2012 01:47 PM, Sébastien Han wrote:

Hi Mark,

Well the most concerning thing is that I have 2 Ceph clusters and both
of them show better rand than seq...
I don't have enough background to argue on your assomptions but I
could try to skrink my test platform to a single OSD and how it
performs. We keep in touch on that one.

But it seems that Alexandre and I have the same results (more rand
than seq), he has (at least) one cluster and I have 2. Thus I start to
think that's not an isolated issue.

Is it different for you? Do you usually get more seq IOPS from an RBD
thant rand?


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


incremental rbd export / sparse files?

2012-11-22 Thread Stefan Priebe - Profihost AG

Hello list,

right now a rbd export exports exactly the size of the disk even if 
there is KNOWN free space. Is this inteded to change?


Might it be possible to export just differences between snapshots and 
merge them later?


Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd map command hangs for 15 minutes during system start up

2012-11-22 Thread Nick Bartos
FYI the build which included all 3.5 backports except patch #50 is
still going strong after 21 builds.

On Wed, Nov 21, 2012 at 9:34 AM, Nick Bartos  wrote:
> With 8 successful installs already done, I'm reasonably confident that
> it's patch #50.  I'm making another build which applies all patches
> from the 3.5 backport branch, excluding that specific one.  I'll let
> you know if that turns up any unexpected failures.
>
> What will the potential fall out be for removing that specific patch?
>
>
> On Wed, Nov 21, 2012 at 9:02 AM, Nick Bartos  wrote:
>> It's really looking like it's the
>> libceph_resubmit_linger_ops_when_pg_mapping_changes commit.  When
>> patches 1-50 (listed below) are applied to 3.5.7, the hang is present.
>>  So far I have gone through 4 successful installs with no hang with
>> only 1-49 applied.  I'm still leaving my test run to make sure it's
>> not a fluke, but since previously it hangs within the first couple of
>> builds, it really looks like this is where the problem originated.
>>
>> 1-libceph_eliminate_connection_state_DEAD.patch
>> 2-libceph_kill_bad_proto_ceph_connection_op.patch
>> 3-libceph_rename_socket_callbacks.patch
>> 4-libceph_rename_kvec_reset_and_kvec_add_functions.patch
>> 5-libceph_embed_ceph_messenger_structure_in_ceph_client.patch
>> 6-libceph_start_separating_connection_flags_from_state.patch
>> 7-libceph_start_tracking_connection_socket_state.patch
>> 8-libceph_provide_osd_number_when_creating_osd.patch
>> 9-libceph_set_CLOSED_state_bit_in_con_init.patch
>> 10-libceph_embed_ceph_connection_structure_in_mon_client.patch
>> 11-libceph_drop_connection_refcounting_for_mon_client.patch
>> 12-libceph_init_monitor_connection_when_opening.patch
>> 13-libceph_fully_initialize_connection_in_con_init.patch
>> 14-libceph_tweak_ceph_alloc_msg.patch
>> 15-libceph_have_messages_point_to_their_connection.patch
>> 16-libceph_have_messages_take_a_connection_reference.patch
>> 17-libceph_make_ceph_con_revoke_a_msg_operation.patch
>> 18-libceph_make_ceph_con_revoke_message_a_msg_op.patch
>> 19-libceph_fix_overflow_in___decode_pool_names.patch
>> 20-libceph_fix_overflow_in_osdmap_decode.patch
>> 21-libceph_fix_overflow_in_osdmap_apply_incremental.patch
>> 22-libceph_transition_socket_state_prior_to_actual_connect.patch
>> 23-libceph_fix_NULL_dereference_in_reset_connection.patch
>> 24-libceph_use_con_get_put_methods.patch
>> 25-libceph_drop_ceph_con_get_put_helpers_and_nref_member.patch
>> 26-libceph_encapsulate_out_message_data_setup.patch
>> 27-libceph_encapsulate_advancing_msg_page.patch
>> 28-libceph_don_t_mark_footer_complete_before_it_is.patch
>> 29-libceph_move_init_bio__functions_up.patch
>> 30-libceph_move_init_of_bio_iter.patch
>> 31-libceph_don_t_use_bio_iter_as_a_flag.patch
>> 32-libceph_SOCK_CLOSED_is_a_flag_not_a_state.patch
>> 33-libceph_don_t_change_socket_state_on_sock_event.patch
>> 34-libceph_just_set_SOCK_CLOSED_when_state_changes.patch
>> 35-libceph_don_t_touch_con_state_in_con_close_socket.patch
>> 36-libceph_clear_CONNECTING_in_ceph_con_close.patch
>> 37-libceph_clear_NEGOTIATING_when_done.patch
>> 38-libceph_define_and_use_an_explicit_CONNECTED_state.patch
>> 39-libceph_separate_banner_and_connect_writes.patch
>> 40-libceph_distinguish_two_phases_of_connect_sequence.patch
>> 41-libceph_small_changes_to_messenger.c.patch
>> 42-libceph_add_some_fine_ASCII_art.patch
>> 43-libceph_set_peer_name_on_con_open_not_init.patch
>> 44-libceph_initialize_mon_client_con_only_once.patch
>> 45-libceph_allow_sock_transition_from_CONNECTING_to_CLOSED.patch
>> 46-libceph_initialize_msgpool_message_types.patch
>> 47-libceph_prevent_the_race_of_incoming_work_during_teardown.patch
>> 48-libceph_report_socket_read_write_error_message.patch
>> 49-libceph_fix_mutex_coverage_for_ceph_con_close.patch
>> 50-libceph_resubmit_linger_ops_when_pg_mapping_changes.patch
>>
>>
>> On Wed, Nov 21, 2012 at 8:50 AM, Sage Weil  wrote:
>>> Thanks for hunting this down.  I'm very curious what the culprit is...
>>>
>>> sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] overflow of int ret: use ssize_t for ret

2012-11-22 Thread Stefan Priebe
When acb->cmd is WRITE or DISCARD block/rbd stores rcb->size into acb->ret

Look here:
   if (acb->cmd == RBD_AIO_WRITE ||
acb->cmd == RBD_AIO_DISCARD) {
if (r < 0) {
acb->ret = r;
acb->error = 1;
} else if (!acb->error) {
acb->ret = rcb->size;
}

right now acb->ret is just an int and we might get an overflow if size is too 
big.
For discards rcb->size holds the size of the discard - this might be some TB if 
you
discard a whole device.

The steps to reproduce are:
mkfs.xfs -f a whole device bigger than int in bytes. mkfs.xfs sends a discard. 
Important is that you use scsi-hd and set discard_granularity=512. Otherwise 
rbd disabled discard support.
---
 block/rbd.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index 5a0f79f..0384c6c 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -69,7 +69,7 @@ typedef enum {
 typedef struct RBDAIOCB {
 BlockDriverAIOCB common;
 QEMUBH *bh;
-int ret;
+ssize_t ret;
 QEMUIOVector *qiov;
 char *bounce;
 RBDAIOCmd cmd;
@@ -86,7 +86,7 @@ typedef struct RADOSCB {
 int done;
 int64_t size;
 char *buf;
-int ret;
+ssize_t ret;
 } RADOSCB;
 
 #define RBD_FD_READ 0
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] use int64_t for return values from rbd instead of int

2012-11-22 Thread Stefan Priebe - Profihost AG

Hello,

i send a new patch using ssize_t. (Subject [PATCH] overflow of int ret: 
use ssize_t for ret)


Stefan

Am 22.11.2012 09:40, schrieb Peter Maydell:

On 22 November 2012 08:23, Stefan Priebe - Profihost AG
 wrote:

Am 21.11.2012 23:32, schrieb Peter Maydell:

Looking at the librbd API (which is what the size and ret
values come from), it uses size_t and ssize_t for these.
So I think probably ssize_t is the right type for ret
(and size) in our structs here.



This sounds reasonable but does ssize_t support negative values? For error
values.


Yes, the first 's' in ssize_t means 'signed' and is the
difference between it and size_t.

-- PMM


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] use int64_t for return values from rbd instead of int

2012-11-22 Thread Stefan Weil
   Am 21.11.2012 21:53, schrieb Stefan   Priebe -
Profihost AG:
 Not sure about off_t. What is min and max size?  Stefan

 off_t is a signed value which is used in function lseek to
 address any byte of a seekable file.
 The range is typically 64 bit (like int64_t), but may be smaller if
the host only supports 2 GB files.
 Stefan


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 'zombie snapshot' problem

2012-11-22 Thread Josh Durgin

On 11/21/2012 04:50 AM, Andrey Korolyov wrote:

Hi,

Somehow I have managed to produce unkillable snapshot, which does not
allow to remove itself or parent image:

$ rbd snap purge dev-rack0/vm2
Removing all snapshots: 100% complete...done.


I see one bug with 'snap purge' ignoring the return code when removing
snaps. I just fixed this in the next branch. It's probably getting the
same error as 'rbd snap rm' below.

Could you post the output of:

rbd snap purge dev-rack0/vm2 --debug-ms 1 --debug-rbd 20


$ rbd rm dev-rack0/vm2
2012-11-21 16:31:24.184626 7f7e0d172780 -1 librbd: image has snapshots
- not removing
Removing image: 0% complete...failed.
rbd: image has snapshots - these must be deleted with 'rbd snap purge'
before the image can be removed.
$ rbd snap ls dev-rack0/vm2
SNAPID NAME   SIZE
188 vm2.snap-yxf 16384 MB
$ rbd info dev-rack0/vm2
rbd image 'vm2':
 size 16384 MB in 4096 objects
 order 22 (4096 KB objects)
 block_name_prefix: rbd_data.1fa164c960874
 format: 2
 features: layering
$ rbd snap rm --snap vm2.snap-yxf dev-rack0/vm2
rbd: failed to remove snapshot: (2) No such file or directory
$ rbd snap create --snap vm2.snap-yxf dev-rack0/vm2
rbd: failed to create snapshot: (17) File exists
$ rbd snap rollback --snap vm2.snap-yxf dev-rack0/vm2
Rolling back to snapshot: 100% complete...done.
$ rbd snap protect --snap vm2.snap-yxf dev-rack0/vm2
$ rbd snap unprotect --snap vm2.snap-yxf dev-rack0/vm2


Meanwhile, ``rbd ls -l dev-rack0''  segfaulting with an attached log.
Is there any reliable way to kill problematic snap?


From this log it looks like vm2 used to be a clone, and the snapshot
vm2.snap-yxf was taken before it was flattened. Later, the parent of
vm2.snap-yxf was deleted. Is this correct?

It was a bug in 0.53 that protected snapshots could be deleted.

Josh
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[TRIVIAL PATCH] bdi_register: Add __printf verification, fix arg mismatch

2012-11-22 Thread Joe Perches
__printf is useful to verify format and arguments.

Signed-off-by: Joe Perches 
---
 fs/ceph/super.c |2 +-
 include/linux/backing-dev.h |1 +
 2 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/ceph/super.c b/fs/ceph/super.c
index 2eb43f2..e7dbb5c 100644
--- a/fs/ceph/super.c
+++ b/fs/ceph/super.c
@@ -849,7 +849,7 @@ static int ceph_register_bdi(struct super_block *sb,
fsc->backing_dev_info.ra_pages =
default_backing_dev_info.ra_pages;
 
-   err = bdi_register(&fsc->backing_dev_info, NULL, "ceph-%d",
+   err = bdi_register(&fsc->backing_dev_info, NULL, "ceph-%ld",
   atomic_long_inc_return(&bdi_seq));
if (!err)
sb->s_bdi = &fsc->backing_dev_info;
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 2a9a9ab..12731a1 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -114,6 +114,7 @@ struct backing_dev_info {
 int bdi_init(struct backing_dev_info *bdi);
 void bdi_destroy(struct backing_dev_info *bdi);
 
+__printf(3, 4)
 int bdi_register(struct backing_dev_info *bdi, struct device *parent,
const char *fmt, ...);
 int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev);


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hangup during scrubbing - possible solutions

2012-11-22 Thread Sage Weil
On Thu, 22 Nov 2012, Andrey Korolyov wrote:
> Hi,
> 
> In the recent versions Ceph introduces some unexpected behavior for
> the permanent connections (VM or kernel clients) - after crash
> recovery, I/O will hang on the next planned scrub on the following
> scenario:
> 
> - launch a bunch of clients doing non-intensive writes,
> - lose one or more osd, mark them down, wait for recovery completion,
> - do a slow scrub, e.g. scrubbing one osd per 5m, inside bash script,
> or wait for ceph to do the same,
> - observe a raising number of pgs stuck in the active+clean+scrubbing
> state (they took a master role from ones which was on killed osd and
> almost surely they are being written in time of crash),
> - some time later, clients will hang hardly and ceph log introduce
> stuck(old) I/O requests.
> 
> The only one way to return clients back without losing their I/O state
> is per-osd restart, which also will help to get rid of
> active+clean+scrubbing pgs.
> 
> First of all, I`ll be happy to help to solve this problem by providing
> logs.

If you can reproduce this behavior with 'debug osd = 20' and 'debug ms = 
1' logging on the OSD, that would be wonderful!

> Second question is not directly related to this problem, but I
> have thought on for a long time - is there a planned features to
> control scrub process more precisely, e.g. pg scrub rate or scheduled
> scrub, instead of current set of timeouts which of course not very
> predictable on when to run?

Not yet.  I would be interested in hearing what kind of control/config 
options/whatever you (and others) would like to see!

Thanks-
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] mds: fix CDir::_commit_partial() bug

2012-11-22 Thread Yan, Zheng
From: "Yan, Zheng" 

When a null dentry is encountered, CDir::_commit_partial() adds
a OSD_TMAP_RM command to delete the dentry. But if the dentry is
new, the osd will not find the dentry when handling the command
and the tmap update operation will fail totally.

Signed-off-by: Yan, Zheng 
---
 src/mds/CDir.cc | 17 +
 src/mds/CDir.h  |  2 +-
 2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/src/mds/CDir.cc b/src/mds/CDir.cc
index c5220ed..4896f01 100644
--- a/src/mds/CDir.cc
+++ b/src/mds/CDir.cc
@@ -1696,7 +1696,7 @@ class C_Dir_Committed : public Context {
 public:
   C_Dir_Committed(CDir *d, version_t v, version_t lrv) : dir(d), version(v), 
last_renamed_version(lrv) { }
   void finish(int r) {
-dir->_committed(version, last_renamed_version);
+dir->_committed(version, last_renamed_version, r);
   }
 };
 
@@ -1801,14 +1801,14 @@ CDir::map_t::iterator 
CDir::_commit_partial(ObjectOperation& m,
 if (!dn->is_dirty())
   continue;  // skip clean dentries
 
-if (dn->get_linkage()->is_null()) {
-  dout(10) << " rm " << dn->name << " " << *dn << dendl;
-  finalbl.append(CEPH_OSD_TMAP_RM);
-  dn->key().encode(finalbl);
-} else {
+if (!dn->get_linkage()->is_null()) {
   dout(10) << " set " << dn->name << " " << *dn << dendl;
   finalbl.append(CEPH_OSD_TMAP_SET);
   _encode_dentry(dn, finalbl, snaps);
+} else if (!dn->is_new()) {
+  dout(10) << " rm " << dn->name << " " << *dn << dendl;
+  finalbl.append(CEPH_OSD_TMAP_RM);
+  dn->key().encode(finalbl);
 }
   }
 
@@ -1997,10 +1997,11 @@ void CDir::_commit(version_t want)
  *
  * @param v version i just committed
  */
-void CDir::_committed(version_t v, version_t lrv)
+void CDir::_committed(version_t v, version_t lrv, int ret)
 {
-  dout(10) << "_committed v " << v << " (last renamed " << lrv << ") on " << 
*this << dendl;
+  dout(10) << "_committed ret " << ret << " v " << v << " (last renamed " << 
lrv << ") on " << *this << dendl;
   assert(is_auth());
+  assert(ret == 0);
 
   bool stray = inode->is_stray();
 
diff --git a/src/mds/CDir.h b/src/mds/CDir.h
index 418..274e38b 100644
--- a/src/mds/CDir.h
+++ b/src/mds/CDir.h
@@ -487,7 +487,7 @@ private:
unsigned max_write_size=-1,
map_t::iterator last_committed_dn=map_t::iterator());
   void _encode_dentry(CDentry *dn, bufferlist& bl, const set *snaps);
-  void _committed(version_t v, version_t last_renamed_version);
+  void _committed(version_t v, version_t last_renamed_version, int ret);
   void wait_for_commit(Context *c, version_t v=0);
 
   // -- dirtyness --
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] overflow of int ret: use ssize_t for ret

2012-11-22 Thread Stefan Priebe - Profihost AG


Signed-off-by: Stefan Priebe 

Am 22.11.2012 10:07, schrieb Stefan Priebe:

When acb->cmd is WRITE or DISCARD block/rbd stores rcb->size into acb->ret

Look here:
if (acb->cmd == RBD_AIO_WRITE ||
 acb->cmd == RBD_AIO_DISCARD) {
 if (r<  0) {
 acb->ret = r;
 acb->error = 1;
 } else if (!acb->error) {
 acb->ret = rcb->size;
 }

right now acb->ret is just an int and we might get an overflow if size is too 
big.
For discards rcb->size holds the size of the discard - this might be some TB if 
you
discard a whole device.

The steps to reproduce are:
mkfs.xfs -f a whole device bigger than int in bytes. mkfs.xfs sends a discard. 
Important is that you use scsi-hd and set discard_granularity=512. Otherwise 
rbd disabled discard support.
---
  block/rbd.c |4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index 5a0f79f..0384c6c 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -69,7 +69,7 @@ typedef enum {
  typedef struct RBDAIOCB {
  BlockDriverAIOCB common;
  QEMUBH *bh;
-int ret;
+ssize_t ret;
  QEMUIOVector *qiov;
  char *bounce;
  RBDAIOCmd cmd;
@@ -86,7 +86,7 @@ typedef struct RADOSCB {
  int done;
  int64_t size;
  char *buf;
-int ret;
+ssize_t ret;
  } RADOSCB;

  #define RBD_FD_READ 0

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD fio Performance concerns

2012-11-22 Thread Stefan Priebe - Profihost AG
In my test it was just recovering some replicas not the whole osd.

Am 22.11.2012 um 16:35 schrieb Alexandre DERUMIER :

>>> But who cares? it's also on the 2nd node. or even on the 3rd if you have 
>>> replicas 3.
> Yes, but rebuilding a dead node use cpu and ios. (but it should be benched 
> too, to see the impact on the production)
> 
> 
> 
> - Mail original - 
> 
> De: "Stefan Priebe - Profihost AG"  
> À: "Alexandre DERUMIER"  
> Cc: "ceph-devel" , "Mark Kampe" 
> , "Sébastien Han" , "Mark 
> Nelson"  
> Envoyé: Jeudi 22 Novembre 2012 16:28:57 
> Objet: Re: RBD fio Performance concerns 
> 
> Am 22.11.2012 16:26, schrieb Alexandre DERUMIER: 
 Haven't tested that. But does this makes sense? I mean data goes to Disk 
 journal - same disk then has to copy the Data from part A to part B. 
 
 Why is this an advantage?
>> 
>> Well, if you are cpu limited, I don't think you can use all 8*35000iops by 
>> node. 
>> So, maybe a benchmark can tell us if the difference is really big. 
>> 
>> Using tmpfs and ups can be ok, but if you have a kernel panic or hardware 
>> problem, you'll lost your journal.
> 
> But who cares? it's also on the 2nd node. or even on the 3rd if you have 
> replicas 3. 
> 
> Stefan 
> 
> 
>> - Mail original - 
>> 
>> De: "Stefan Priebe - Profihost AG"  
>> À: "Mark Nelson"  
>> Cc: "Alexandre DERUMIER" , "ceph-devel" 
>> , "Mark Kampe" , 
>> "Sébastien Han"  
>> Envoyé: Jeudi 22 Novembre 2012 16:01:56 
>> Objet: Re: RBD fio Performance concerns 
>> 
>> Am 22.11.2012 15:46, schrieb Mark Nelson: 
>>> I haven't played a whole lot with SSD only OSDs yet (other than noting 
>>> last summer that iop performance wasn't as high as I wanted it). Is a 
>>> second partition on the SSD for the journal not an option for you?
>> 
>> Haven't tested that. But does this makes sense? I mean data goes to Disk 
>> journal - same disk then has to copy the Data from part A to part B. 
>> 
>> Why is this an advantage? 
>> 
>> Stefan 
>> 
>>> Mark 
>>> 
>>> On 11/22/2012 08:42 AM, Stefan Priebe - Profihost AG wrote: 
 Am 22.11.2012 15:37, schrieb Mark Nelson: 
> I don't think we recommend tmpfs at all for anything other than playing 
> around. :)
 
 I discussed this with somebody frmo inktank. Had to search the 
 mailinglist. It might be OK if you're working with enough replicas and 
 UPS. 
 
 I see no other option while working with SSDs - the only Option would be 
 to be able to deaktivate the journal at all. But ceph does not support 
 this. 
 
 Stefan 
 
> On 11/22/2012 08:22 AM, Stefan Priebe - Profihost AG wrote: 
>> Hi, 
>> 
>> can someone from inktank comment this? Might be using /dev/ram0 with an 
>> fs on it be better than tmpfs as we can use dio? 
>> 
>> Greets, 
>> Stefan 
>> 
>>> - Mail original - 
>>> 
>>> De: "Stefan Priebe - Profihost AG"  
>>> À: "Sébastien Han"  
>>> Cc: "Mark Nelson" , "Alexandre DERUMIER" 
>>> , "ceph-devel" , 
>>> "Mark Kampe"  
>>> Envoyé: Jeudi 22 Novembre 2012 14:29:03 
>>> Objet: Re: RBD fio Performance concerns 
>>> 
>>> Am 22.11.2012 14:22, schrieb Sébastien Han: 
 And RAMDISK devices are too expensive. 
 
 It would make sense in your infra, but yes they are really expensive.
>>> 
>>> We need something like tmpfs - running in local memory but support 
>>> dio. 
>>> 
>>> Stefan
>>> 
>>> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD fio Performance concerns

2012-11-22 Thread Sébastien Han
Hi Mark,

Well the most concerning thing is that I have 2 Ceph clusters and both
of them show better rand than seq...
I don't have enough background to argue on your assomptions but I
could try to skrink my test platform to a single OSD and how it
performs. We keep in touch on that one.

But it seems that Alexandre and I have the same results (more rand
than seq), he has (at least) one cluster and I have 2. Thus I start to
think that's not an isolated issue.

Is it different for you? Do you usually get more seq IOPS from an RBD
thant rand?


On Wed, Nov 21, 2012 at 5:34 PM, Mark Nelson  wrote:
> Responding to my own message. :)
>
> Talked to Sage a bit offline about this.  I think there are two opposing
> forces:
>
> On one hand, random IO may be spreading reads/writes out across more OSDs
> than sequential IO that presumably would be hitting a single OSD more
> regularly.
>
> On the other hand, you'd expect that sequential writes would be getting
> coalesced either at the RBD layer or on the OSD, and that the
> drive/controller/filesystem underneath the OSD would be doing some kind of
> readahead or prefetching.
>
> On the third hand, maybe coalescing/prefetching is in fact happening but we
> are IOP limited by some per-osd limitation.
>
> It could be interesting to do the test with a single OSD and see what
> happens.
>
> Mark
>
>
> On 11/21/2012 09:52 AM, Mark Nelson wrote:
>>
>> Hi Guys,
>>
>> I'm late to this thread but thought I'd chime in.  Crazy that you are
>> getting higher performance with random reads/writes vs sequential!  It
>> would be interesting to see what kind of throughput smalliobench reports
>> (should be packaged in bobtail) and also see if this behavior happens
>> with cephfs.  It's still too early in the morning for me right now to
>> come up with a reasonable explanation for what's going on.  It might be
>> worth running blktrace and seekwatcher to see what the io patterns on
>> the underlying disk look like in each case.  Maybe something unexpected
>> is going on.
>>
>> Mark
>>
>> On 11/19/2012 02:57 PM, Sébastien Han wrote:
>>>
>>> Which iodepth did you use for those benchs?
>>>
>>>
 I really don't understand why I can't get more rand read iops with 4K
 block ...
>>>
>>>
>>> Me neither, hope to get some clarification from the Inktank guys. It
>>> doesn't make any sense to me...
>>> --
>>> Bien cordialement.
>>> Sébastien HAN.
>>>
>>>
>>> On Mon, Nov 19, 2012 at 8:11 PM, Alexandre DERUMIER
>>>  wrote:
>>
>> @Alexandre: is it the same for you? or do you always get more IOPS
>> with seq?


 rand read 4K : 6000 iops
 seq read 4K : 3500 iops
 seq read 4M : 31iops (1gigabit client bandwith limit)

 rand write 4k: 6000iops  (tmpfs journal)
 seq write 4k: 1600iops
 seq write 4M : 31iops (1gigabit client bandwith limit)


 I really don't understand why I can't get more rand read iops with 4K
 block ...

 I try with high end cpu for client, it doesn't change nothing.
 But test cluster use  old 8 cores E5420  @ 2.50GHZ (But cpu is around
 15% on cluster during read bench)


 - Mail original -

 De: "Sébastien Han" 
 À: "Mark Kampe" 
 Cc: "Alexandre DERUMIER" , "ceph-devel"
 
 Envoyé: Lundi 19 Novembre 2012 19:03:40
 Objet: Re: RBD fio Performance concerns

 @Sage, thanks for the info :)
 @Mark:

> If you want to do sequential I/O, you should do it buffered
> (so that the writes can be aggregated) or with a 4M block size
> (very efficient and avoiding object serialization).


 The original benchmark has been performed with 4M block size. And as
 you can see I still get more IOPS with rand than seq... I just tried
 with 4M without direct I/O, still the same. I can print fio results if
 it's needed.

> We do direct writes for benchmarking, not because it is a reasonable
> way to do I/O, but because it bypasses the buffer cache and enables
> us to directly measure cluster I/O throughput (which is what we are
> trying to optimize). Applications should usually do buffered I/O,
> to get the (very significant) benefits of caching and write
> aggregation.


 I know why I use direct I/O. It's synthetic benchmarks, it's far away
 from a real life scenario and how common applications works. I just
 try to see the maximum I/O throughput that I can get from my RBD. All
 my applications use buffered I/O.

 @Alexandre: is it the same for you? or do you always get more IOPS
 with seq?

 Thanks to all of you..


 On Mon, Nov 19, 2012 at 5:54 PM, Mark Kampe 
 wrote:
>
> Recall:
> 1. RBD volumes are striped (4M wide) across RADOS objects
> 2. distinct writes to a single RADOS object are serialized
>
> Your sequential 4K writes are direct, depth=256, so there are
> (at all times) 256 writes queued to the same ob

Problem with SGID and new inode

2012-11-22 Thread Giorgos Kappes
Ηι,

I was looking at the source code of the ceph MDS and in particular at
the function
CInode* Server::prepare_new_inode(...) in the mds/Server.cc file which
creates a new inode.
At lines 1739-1747 the code checks if the parent directory has the
set-group-ID bit set. If
this bit is set and the new inode refers to a directory then the new
inode should also have
the set-group-ID bit set. However, as I understand, at line 1744 the
set-group-ID bit is set
at the local variable [mode |= S_ISGID] and not on the inode.
Shouldn't this line be
[in->inode.mode |= S_ISGID;]?

To illustrate the above problem I tried to create a new directory
inside a directory that has
the set-group-ID bit set:

root@client-admin:/mnt/admin# mkdir mydir
root@client-admin:/mnt/admin# chmod +s mydir
root@client-admin:/mnt/admin# ls -l
total 1
drwsr-sr-x 1 root root  0 Nov 22  2012 mydir
-rw-r--r-- 1 root root 13 Oct 19 08:05 myfile.txt
root@client-admin:/mnt/admin# cd mydir
root@client-admin:/mnt/admin/mydir# mkdir newdir
root@client-admin:/mnt/admin/mydir# ls -l
total 1
drwxr-xr-x 1 root root 0 Nov 22  2012 newdir

Finally, I would like to note that I am using Ceph 0.48.2 but the
above problem also seems
to exist in the v0.54 development release.

Best regards,
Girogos Kappes

---
Giorgos Kappes
Website: http://www.cs.uoi.gr/~gkappes
email: geok...@gmail.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Very bad behavior when

2012-11-22 Thread Sylvain Munaut
Hi,

I know that ceph has time synced servers has a requirements, but I
think a sane failure mode like a message in the logs instead of
incontrollably growing memory usage would be a good idea.

I had the NTP process die on me tonight on an OSD (for unknown reason
so far ...) and the clock went 3000s out of sync and the OSD memory
just kept growing, and also the master mon memory.
(which has the nice effect of having the master mon being OOM killed,
then one of the backup takes the master role and grows as well and
gets killed and so on and so forth until there is no quorum anymore).

Cheers,

 Sylvain
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD fio Performance concerns

2012-11-22 Thread Mark Nelson
I haven't played a whole lot with SSD only OSDs yet (other than noting 
last summer that iop performance wasn't as high as I wanted it).  Is a 
second partition on the SSD for the journal not an option for you?


Mark

On 11/22/2012 08:42 AM, Stefan Priebe - Profihost AG wrote:

Am 22.11.2012 15:37, schrieb Mark Nelson:

I don't think we recommend tmpfs at all for anything other than playing
around. :)


I discussed this with somebody frmo inktank. Had to search the
mailinglist. It might be OK if you're working with enough replicas and UPS.

I see no other option while working with SSDs - the only Option would be
to be able to deaktivate the journal at all. But ceph does not support
this.

Stefan


On 11/22/2012 08:22 AM, Stefan Priebe - Profihost AG wrote:

Hi,

can someone from inktank comment this? Might be using /dev/ram0 with an
fs on it be better than tmpfs as we can use dio?

Greets,
Stefan


- Mail original -

De: "Stefan Priebe - Profihost AG" 
À: "Sébastien Han" 
Cc: "Mark Nelson" , "Alexandre DERUMIER"
, "ceph-devel" ,
"Mark Kampe" 
Envoyé: Jeudi 22 Novembre 2012 14:29:03
Objet: Re: RBD fio Performance concerns

Am 22.11.2012 14:22, schrieb Sébastien Han:

And RAMDISK devices are too expensive.

It would make sense in your infra, but yes they are really expensive.


We need something like tmpfs - running in local memory but support dio.

Stefan







--
Mark Nelson
Performance Engineer
Inktank
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD Backup

2012-11-22 Thread Josh Durgin

On 11/22/2012 05:13 AM, Wido den Hollander wrote:



On 11/22/2012 06:57 PM, Stefan Priebe - Profihost AG wrote:

Hi,

Am 21.11.2012 14:47, schrieb Wido den Hollander:

The snapshot isn't consistent since it has no way of telling the VM to
flush it's buffers.

To make it consistent you have to run "sync" (In the VM) just prior to
creating the snapshot.


Mhm but between executing sync and executing snap is again time to store
data.



True. That is always a problem with snapshots. I always regard data
written to disk in the last 30 seconds as being in the "danger zone".

When you use libvirt and QCOW2 as a backing store for your virtual
machine you can also snapshot with libvirt. It will not only snapshot
the disk, but it will also store the memory contents from the virtual
machine so you have a consistent state of the virtual machine.

This has a drawback however, since when you give the VM 16GB of memory,
you have to store 16GB of data.

Right now this doesn't work yet with RBD, but there is a feature request
in the tracker. I can't seem to find it right now.

What you could do is:

$ ssh root@virtual-machine "sync"
$ rbd snap create vm-disk@snap1
$ rbd export --snap snap1 vm-disk /mnt/backup/vm-disk_snap1.img

This way you have a pretty consistent snapshot.


You can get an entirely consistent snapshot using xfs_freeze to
stop I/O to the fs until you thaw it. It's done at the vfs level
these days, so it works on all filesystems.

Josh


rbd export --snap BACKUP image1 /mnt/backup/image1.img
losetup /mnt/backup/image1.img

kpartx -a /dev/loop0

Now you will have the partitions from the RBD image available in
/dev/mapper/loop0pX

Works fine!

Greets,
Stefan


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD Backup

2012-11-22 Thread Stefan Priebe - Profihost AG

Hi Josh,

Am 22.11.2012 22:08, schrieb Josh Durgin:


This way you have a pretty consistent snapshot.


You can get an entirely consistent snapshot using xfs_freeze to
stop I/O to the fs until you thaw it. It's done at the vfs level
these days, so it works on all filesystems.


Great thing we even use XFS ;-) but when i do
xfs_freeze -f /

it just hangs and i can't do anything until i reset the whole VM.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD fio Performance concerns

2012-11-22 Thread Alexandre DERUMIER
>>But who cares? it's also on the 2nd node. or even on the 3rd if you have 
>>replicas 3. 
Yes, but rebuilding a dead node use cpu and ios. (but it should be benched too, 
to see the impact on the production)



- Mail original - 

De: "Stefan Priebe - Profihost AG"  
À: "Alexandre DERUMIER"  
Cc: "ceph-devel" , "Mark Kampe" 
, "Sébastien Han" , "Mark 
Nelson"  
Envoyé: Jeudi 22 Novembre 2012 16:28:57 
Objet: Re: RBD fio Performance concerns 

Am 22.11.2012 16:26, schrieb Alexandre DERUMIER: 
>>> Haven't tested that. But does this makes sense? I mean data goes to Disk 
>>> journal - same disk then has to copy the Data from part A to part B. 
>>> 
>>> Why is this an advantage? 
> 
> Well, if you are cpu limited, I don't think you can use all 8*35000iops by 
> node. 
> So, maybe a benchmark can tell us if the difference is really big. 
> 
> Using tmpfs and ups can be ok, but if you have a kernel panic or hardware 
> problem, you'll lost your journal. 

But who cares? it's also on the 2nd node. or even on the 3rd if you have 
replicas 3. 

Stefan 


> - Mail original - 
> 
> De: "Stefan Priebe - Profihost AG"  
> À: "Mark Nelson"  
> Cc: "Alexandre DERUMIER" , "ceph-devel" 
> , "Mark Kampe" , 
> "Sébastien Han"  
> Envoyé: Jeudi 22 Novembre 2012 16:01:56 
> Objet: Re: RBD fio Performance concerns 
> 
> Am 22.11.2012 15:46, schrieb Mark Nelson: 
>> I haven't played a whole lot with SSD only OSDs yet (other than noting 
>> last summer that iop performance wasn't as high as I wanted it). Is a 
>> second partition on the SSD for the journal not an option for you? 
> 
> Haven't tested that. But does this makes sense? I mean data goes to Disk 
> journal - same disk then has to copy the Data from part A to part B. 
> 
> Why is this an advantage? 
> 
> Stefan 
> 
>> Mark 
>> 
>> On 11/22/2012 08:42 AM, Stefan Priebe - Profihost AG wrote: 
>>> Am 22.11.2012 15:37, schrieb Mark Nelson: 
 I don't think we recommend tmpfs at all for anything other than playing 
 around. :) 
>>> 
>>> I discussed this with somebody frmo inktank. Had to search the 
>>> mailinglist. It might be OK if you're working with enough replicas and 
>>> UPS. 
>>> 
>>> I see no other option while working with SSDs - the only Option would be 
>>> to be able to deaktivate the journal at all. But ceph does not support 
>>> this. 
>>> 
>>> Stefan 
>>> 
 On 11/22/2012 08:22 AM, Stefan Priebe - Profihost AG wrote: 
> Hi, 
> 
> can someone from inktank comment this? Might be using /dev/ram0 with an 
> fs on it be better than tmpfs as we can use dio? 
> 
> Greets, 
> Stefan 
> 
>> - Mail original - 
>> 
>> De: "Stefan Priebe - Profihost AG"  
>> À: "Sébastien Han"  
>> Cc: "Mark Nelson" , "Alexandre DERUMIER" 
>> , "ceph-devel" , 
>> "Mark Kampe"  
>> Envoyé: Jeudi 22 Novembre 2012 14:29:03 
>> Objet: Re: RBD fio Performance concerns 
>> 
>> Am 22.11.2012 14:22, schrieb Sébastien Han: 
>>> And RAMDISK devices are too expensive. 
>>> 
>>> It would make sense in your infra, but yes they are really expensive. 
>> 
>> We need something like tmpfs - running in local memory but support 
>> dio. 
>> 
>> Stefan 
>> 
 
 
>> 
>> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with SGID and new inode

2012-11-22 Thread Sage Weil
On Thu, 22 Nov 2012, Giorgos Kappes wrote:
> ??,
> 
> I was looking at the source code of the ceph MDS and in particular at
> the function
> CInode* Server::prepare_new_inode(...) in the mds/Server.cc file which
> creates a new inode.
> At lines 1739-1747 the code checks if the parent directory has the
> set-group-ID bit set. If
> this bit is set and the new inode refers to a directory then the new
> inode should also have
> the set-group-ID bit set. However, as I understand, at line 1744 the
> set-group-ID bit is set
> at the local variable [mode |= S_ISGID] and not on the inode.
> Shouldn't this line be
> [in->inode.mode |= S_ISGID;]?
> 
> To illustrate the above problem I tried to create a new directory
> inside a directory that has
> the set-group-ID bit set:
> 
> root@client-admin:/mnt/admin# mkdir mydir
> root@client-admin:/mnt/admin# chmod +s mydir
> root@client-admin:/mnt/admin# ls -l
> total 1
> drwsr-sr-x 1 root root  0 Nov 22  2012 mydir
> -rw-r--r-- 1 root root 13 Oct 19 08:05 myfile.txt
> root@client-admin:/mnt/admin# cd mydir
> root@client-admin:/mnt/admin/mydir# mkdir newdir
> root@client-admin:/mnt/admin/mydir# ls -l
> total 1
> drwxr-xr-x 1 root root 0 Nov 22  2012 newdir
> 
> Finally, I would like to note that I am using Ceph 0.48.2 but the
> above problem also seems
> to exist in the v0.54 development release.

Thanks, this is indeed a bug.  I pushed a fix to the next branch, commit 
1c715a11f70788d987c16aa67ce2b6f32c04a673.

Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: incremental rbd export / sparse files?

2012-11-22 Thread Sage Weil
On Thu, 22 Nov 2012, Stefan Priebe - Profihost AG wrote:
> Hello list,
> 
> right now a rbd export exports exactly the size of the disk even if there is
> KNOWN free space. Is this inteded to change?
> 
> Might it be possible to export just differences between snapshots and merge
> them later?

We were just talking about this the other day.

Step 1 is to create a mechanism to output a list of block ranges that 
have/have not changed between snapshots.

Step 2 is to export the incremental changes.  The hangup there is figuring 
out a generic and portable file format to represent those incremental 
changes; we'd rather not invent something ourselves that is ceph-specific.
Suggestions welcome!

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD fio Performance concerns

2012-11-22 Thread Alexandre DERUMIER
>>Haven't tested that. But does this makes sense? I mean data goes to Disk 
>>journal - same disk then has to copy the Data from part A to part B. 
>>
>>Why is this an advantage? 

Well, if you are cpu limited, I don't think you can use all 8*35000iops by node.
So, maybe a benchmark can tell us if the difference is really big.

Using tmpfs and ups can be ok, but if you have a kernel panic or hardware 
problem, you'll lost your journal. 



- Mail original - 

De: "Stefan Priebe - Profihost AG"  
À: "Mark Nelson"  
Cc: "Alexandre DERUMIER" , "ceph-devel" 
, "Mark Kampe" , "Sébastien 
Han"  
Envoyé: Jeudi 22 Novembre 2012 16:01:56 
Objet: Re: RBD fio Performance concerns 

Am 22.11.2012 15:46, schrieb Mark Nelson: 
> I haven't played a whole lot with SSD only OSDs yet (other than noting 
> last summer that iop performance wasn't as high as I wanted it). Is a 
> second partition on the SSD for the journal not an option for you? 

Haven't tested that. But does this makes sense? I mean data goes to Disk 
journal - same disk then has to copy the Data from part A to part B. 

Why is this an advantage? 

Stefan 

> Mark 
> 
> On 11/22/2012 08:42 AM, Stefan Priebe - Profihost AG wrote: 
>> Am 22.11.2012 15:37, schrieb Mark Nelson: 
>>> I don't think we recommend tmpfs at all for anything other than playing 
>>> around. :) 
>> 
>> I discussed this with somebody frmo inktank. Had to search the 
>> mailinglist. It might be OK if you're working with enough replicas and 
>> UPS. 
>> 
>> I see no other option while working with SSDs - the only Option would be 
>> to be able to deaktivate the journal at all. But ceph does not support 
>> this. 
>> 
>> Stefan 
>> 
>>> On 11/22/2012 08:22 AM, Stefan Priebe - Profihost AG wrote: 
 Hi, 
 
 can someone from inktank comment this? Might be using /dev/ram0 with an 
 fs on it be better than tmpfs as we can use dio? 
 
 Greets, 
 Stefan 
 
> - Mail original - 
> 
> De: "Stefan Priebe - Profihost AG"  
> À: "Sébastien Han"  
> Cc: "Mark Nelson" , "Alexandre DERUMIER" 
> , "ceph-devel" , 
> "Mark Kampe"  
> Envoyé: Jeudi 22 Novembre 2012 14:29:03 
> Objet: Re: RBD fio Performance concerns 
> 
> Am 22.11.2012 14:22, schrieb Sébastien Han: 
>> And RAMDISK devices are too expensive. 
>> 
>> It would make sense in your infra, but yes they are really expensive. 
> 
> We need something like tmpfs - running in local memory but support 
> dio. 
> 
> Stefan 
> 
>>> 
>>> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD fio Performance concerns

2012-11-22 Thread Sébastien Han
>>But who cares? it's also on the 2nd node. or even on the 3rd if you have
>>replicas 3.

Yes but you could also suffer a crash while writing the first replica.
If the journal is in tmpfs, there is nothing to replay.



On Thu, Nov 22, 2012 at 4:35 PM, Alexandre DERUMIER  wrote:
>
> >>But who cares? it's also on the 2nd node. or even on the 3rd if you have
> >>replicas 3.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD fio Performance concerns

2012-11-22 Thread Sébastien Han
> journal is running on tmpfs to me but that changes nothing.

I don't think it works then. According to the doc: Enables using
libaio for asynchronous writes to the journal. Requires journal dio
set to true.


On Thu, Nov 22, 2012 at 12:48 PM, Stefan Priebe - Profihost AG
 wrote:
> Am 22.11.2012 11:49, schrieb Sébastien Han:
>
>> @Alexandre: cool!
>>
>> @ Stefan: Full SSD cluster and 10G switches?
>
> Yes
>
>
>> Couple of weeks ago I saw
>> that you use journal aio, did you notice performance improvement with it?
>
> journal is running on tmpfs to me but that changes nothing.
>
> Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Debian/Ubuntu packages for ceph-deploy

2012-11-22 Thread Martin Gerhard Loschwitz
Hi folks,

I figured it might be a cool thing to have packages of ceph-deploy for
Debian and Ubuntu 12.04; I took the time and created them (along with
packages of python-pushy, which ceph-deploy needs but which was not
present in the Debian archive and thus in the Ubuntu archive either).

They are available from http://people.debian.org/~madkiss/ceph-deploy/

I did upload python-pushy to the official Debian unstable repository
already, but I didn't do so just yet with ceph-deploy. Also, I don't
want to step on somebody's toes - if there were secret plans to start
ceph-deploy packaging anyway, I'm more than happy to hand over what
I have got to the responsible person.

Any feedback is highly appreciated (esp. with regards to the question
if it's already okay to upload ceph-deploy).

Best regards
Martin G. Loschwitz

-- 
Martin Gerhard Loschwitz
Chief Brand Officer, Principal Consultant
hastexo Professional Services

CONFIDENTIALITY NOTICE: This e-mail and/or the accompanying documents
are privileged and confidential under applicable law. The person who
receives this message and who is not the addressee, one of his employees
or an agent entitled to hand it over to the addressee, is informed that
he may not use, disclose or reproduce the contents thereof. Should you
have received this e-mail (or any copy thereof) in error, please let us
know by telephone or e-mail without delay and delete the message from
your system. Thank you.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [Discussion] Enhancement for CRUSH rules

2012-11-22 Thread Chen, Xiaoxi
Hi list,
 I am thinking about the possibility to add some primitive in CRUSH to meet 
the following user stories:
A. "Same host", "Same rack"
To balance between availability and performance ,one may like such a 
rule: 3 Replicas, Replica 1 and Replica 2 should in the same rack while Replica 
3 reside in another rack.This is common because a typical deployment in 
datacenter usually has much fewer uplink bandwidth than backbone bandwidth.

More aggressive guys may even want same host, since the most common failure is 
disk failure. And we have several disk (also means several OSDs) reside in the 
same physical machine. If we can place Replica 1 & 2 on the same host but 
replica 3 in somewhere else.It will not only reduce replication traffic but 
also saving a lot of time & bandwidth when disk failure happened and a recovery 
take place.
B."local"
 Although we cannot mount RBD volumes to where a OSD running at, but 
QEMU canbe used. This scenarios is really common in cloud computing. We have a 
large amount of compute-nodes, just plug in some disks  and make the 
machines reused for Ceph cluster. To reduce network traffic and latency , if it 
is possible to have some placement-group-maybe 3 PG for a compute-node. Define 
the rules like: primary copy of the PG  should (if possible) reside in 
localhost, the second replica should go different places

By doing this , a significant amount of network bandwidth & a RTT can 
be saved. What's more ,since read always go to primary, it will benefit a lot 
from such mechanism.

It looks to me that A is simpler but B seems much complex. Hoping for inputs.

  
 Xiaoxi
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: incremental rbd export / sparse files?

2012-11-22 Thread Dietmar Maurer
> Step 2 is to export the incremental changes.  The hangup there is figuring out
> a generic and portable file format to represent those incremental changes;
> we'd rather not invent something ourselves that is ceph-specific.
> Suggestions welcome!

AFAIK, both 'zfs' and 'btrfs' already have such format. 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Cephfs losing files and corrupting others

2012-11-22 Thread Nathan Howell
I upgraded to 0.54 and now there are some hints in the logs. The
directories referenced in the log entries are now missing:

2012-11-23 07:28:04.802864 mds.0 [ERR] loaded dup inode 100662f
[2,head] v3851654 at /xxx/20120203, but inode 100662f.head
v3853093 already exists at ~mds0/stray7/100662f
2012-11-23 07:28:04.802889 mds.0 [ERR] loaded dup inode 1003a4b
[2,head] v431518 at /xxx/20120206, but inode 1003a4b.head v3853192
already exists at ~mds0/stray8/1003a4b
2012-11-23 07:28:04.802909 mds.0 [ERR] loaded dup inode 100149e
[2,head] v431522 at /xxx/20120207, but inode 100149e.head v3853206
already exists at ~mds0/stray8/100149e
2012-11-23 07:28:04.802927 mds.0 [ERR] loaded dup inode 1000a5f
[2,head] v431526 at /xxx/20120208, but inode 1000a5f.head v3853208
already exists at ~mds0/stray8/1000a5f

Any ideas?

On Thu, Nov 15, 2012 at 11:00 AM, Nathan Howell
 wrote:
> Yes, successfully written files were disappearing. We switched to ceph-fuse
> and haven't seen any files truncated since. Older files (written months ago)
> are still having their entire contents replaced with NULL bytes, seemly at
> random. I can't yet say for sure this has happened since switching over to
> fuse... but we think it has.
>
> I'm going to test all of the archives over the next few days and restore
> them from S3, so we should be back in a known-good state after that. In the
> event more files end up corrupted, is there any logging that I can enable
> that would help track down the problem?
>
> thanks,
> -n
>
>
> On Sat, Nov 3, 2012 at 9:54 AM, Gregory Farnum  wrote:
>>
>> On Fri, Nov 2, 2012 at 12:30 AM, Nathan Howell
>>  wrote:
>> > On Thu, Nov 1, 2012 at 3:32 PM, Sam Lang  wrote:
>> >> Do the writes succeed?  I.e. the programs creating the files don't get
>> >> errors back?  Are you seeing any problems with the ceph mds or osd
>> >> processes
>> >> crashing?  Can you describe your I/O workload during these bulk loads?
>> >> How
>> >> many files, how much data, multiple clients writing, etc.
>> >>
>> >> As far as I know, there haven't been any fixes to 0.48.2 to resolve
>> >> problems
>> >> like yours.  You might try the ceph fuse client to see if you get the
>> >> same
>> >> behavior.  If not, then at least we have narrowed down the problem to
>> >> the
>> >> ceph kernel client.
>> >
>> > Yes, the writes succeed. Wednesday's failure looked like this:
>> >
>> > 1) rsync 100-200mb tarball directly into ceph from a remote site
>> > 2) untar ~500 files from tarball in ceph into a new directory in ceph
>> > 3) wait for a while
>> > 4) the .tar file and some log files disappeared but the untarred files
>> > were fine
>>
>> Just to be clear, you copied a tarball into Ceph and untarred all in
>> Ceph, and the extracted contents were fine but the tarball
>> disappeared? So this looks like a case of successfully-written files
>> disappearing?
>> Did you at any point check the tarball from a machine other than the
>> initial client that copied it in?
>>
>> This truncation sounds like maybe Yan's fix will deal with it. But if
>> you've also seen files with the proper size but be empty or corrupted,
>> that sounds like an OSD bug. Sam, are you aware of any btrfs issues
>> that could cause this?
>>
>> Nathan, you've also seen parts of the filesystem hierarchy get lost?
>> That's rather more concerning; under what circumstances have you seen
>> that?
>> -Greg
>>
>> > Total filesystem size is:
>> >
>> > pgmap v2221244: 960 pgs: 960 active+clean; 2418 GB data, 7293 GB used,
>> > 6151 GB / 13972 GB avail
>> >
>> > Generally our load looks like:
>> >
>> > Constant trickle of 1-2mb files from 3 machines, about 1GB per day
>> > total. No file is written to by more than 1 machine, but the files go
>> > into shared directories.
>> >
>> > Grid jobs are running constantly and are doing sequential reads from
>> > the filesystem. Compute nodes have the filesystem mounted read-only.
>> > They're primarily located at a remote site (~40ms away) and tend to
>> > average 1-2 megabits/sec.
>> >
>> > Nightly data jobs load in ~10GB from a few remote sites in to <10
>> > large files. These are split up into about 1000 smaller files but the
>> > originals are also kept. All of this is done on one machine. The
>> > journals and osd drives are write saturated while this is going on.
>> >
>> >
>> > On Thu, Nov 1, 2012 at 4:02 PM, Gregory Farnum  wrote:
>> >> Are you using hard links, by any chance?
>> >
>> > No, we are using a handfull of soft links though.
>> >
>> >
>> >> Do you have one or many MDS systems?
>> >
>> > ceph mds stat says: e686: 1/1/1 up {0=xxx=up:active}, 2 up:standby
>> >
>> >
>> >> What filesystem are you using on your OSDs?
>> >
>> > btrfs
>> >
>> >
>> > thanks,
>> > -n
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html