Re: Upgrading from 0.61.5 to 0.61.6 ended in disaster

2013-07-29 Thread peter

On 2013-07-25 17:46, Sage Weil wrote:

On Thu, 25 Jul 2013, pe...@2force.nl wrote:
We did not upgrade from bobtail to cuttlefish and are still seeing 
this issue.
I posted this on the ceph-users mailinglist and I missed this thread 
(sorry!)

so I didn't know.


That's interesting; a bobtail upgraded cluster was the only way I was 
able
to reproduce it, but I'm also working with relatively short-lived 
clusters
in a test environment so there may very well be a possibility I 
missed.
Can you summarize what the lineage of your cluster is?  (What version 
was

it installed with, and when was it upgraded and to what versions?)

Either way, I also have an osd crashing after upgrading to 0.61.6. As 
said on
the other list, I'm more than happy to share log files etc with you 
guys.


Will take a look.

Thanks!
sage


Hi Sage,

Did you happen to find out what is causing the osd crash? I'm not sure 
what the best way is to recover from this.


Thanks,

Peter






Thanks,

Peter

This is fixed in the cuttlefish branch as of earlier this afternoon. 
I've

spent most of the day expanding the automated test suite to include
upgrade combinations to trigger this and *finally* figured out that 
this
particular problem seems to surface on clusters that upgraded from 
bobtail

- cuttlefish but not clusters created on cuttlefish.


If you've run into this issue, please use the cuttlefish branch 
build for
now.  We will have a release out in the next day or so that includes 
this

and a few other pending fixes.


I'm sorry we missed this one!  The upgrade test matrix I've been 
working

on today should catch this type of issue in the future.



Thanks!
sage

--
To unsubscribe from this list: send the line unsubscribe ceph-devel 
in

the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel 
in

the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Read ahead affect Ceph read performance much

2013-07-29 Thread Li Wang
We performed Iozone read test on a 32-node HPC server. Regarding the 
hardware of each node, the CPU is very powerful, so does the network, 
with a bandwidth  1.5 GB/s. 64GB memory, the IO is relatively slow, the 
throughput measured by ‘dd’ locally is around 70MB/s. We configured a 
Ceph cluster with 24 OSDs on 24 nodes, one mds, one to four clients, one 
client per node. The performance is as follows,


Iozone sequential read throughput (MB/s)
Number of clients 1  2 4
Default resize180.0954   324.4836   591.5851
Resize: 256MB 645.3347   1022.998   1267.631

The complete iozone parameter for one client is,
iozone -t 1 -+m /tmp/iozone.nodelist.50305030 -s 64G -r 4M -i 0 -+n -w 
-c -e -b /tmp/iozone.nodelist.50305030.output, on each client node, only 
one thread is started.


for two clients, it is,
iozone -t 2 -+m /tmp/iozone.nodelist.50305030 -s 32G -r 4M -i 0 -+n -w 
-c -e -b /tmp/iozone.nodelist.50305030.output


As the data shown, a larger read ahead window could result in 300% speedup!

Besides, Since the backend of Ceph is not the traditional hard disk, it 
is beneficial to capture the stride read prefetching. To prove this, we 
tested the stride read with the following program, as we know, the 
generic read ahead algorithm of Linux kernel will not capture 
stride-read prefetch, so we use fadvise() to manually force pretching.

the record size is 4MB. The result is even more surprising,

Stride read throughput (MB/s)
Number of records prefetched  0  1  4  16  64  128
Throughput  42.82  100.74 217.41  497.73  854.48  950.18

As the data shown, with a read ahead size of 128*4MB, the speedup over
without read ahead could be up to 950/42  2000%!

The core logic of the test program is below,

stride = 17
recordsize = 4MB
for (;;) {
  for (i = 0; i  count; ++i) {
long long start = pos + (i + 1) * stride * recordsize;
printf(PRE READ %lld %lld\n, start, start + block);
posix_fadvise(fd, start, block, POSIX_FADV_WILLNEED);
  }
  len = read(fd, buf, block);
  total += len;
  printf(READ %lld %lld\n, pos, (pos + len));
  pos += len;
  lseek(fd, (stride - 1) * block, SEEK_CUR);
  pos += (stride - 1) * block;
}

Given the above results and some more, We plan to submit a blue print to 
discuss the prefetching optimization of Ceph.


Cheers,
Li Wang




--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ObjectContext PGRegistry API

2013-07-29 Thread Loic Dachary
Hi Sam,

Sorry to bother you with this again. Would you have time to quickly review this 
proposal ? I'm sure you'll have comments that will require work on my part ;-)

Cheers

On 22/07/2013 22:33, Loic Dachary wrote:
 Hi Sam,
 
 Here is the proposed ObjectContext  PGRegistry API:
 
 https://github.com/dachary/ceph/blob/wip-5487/src/osd/PGRegistry.h
 
 which is part of the following commit
 
 https://github.com/dachary/ceph/commit/60958095585a1f8392d8a967767f7620089d547d
 
 It's a first draft and I asume your comments will require significant
 work on my part. I'd rather do it as soon as possible, while my short
 term memory is still fresh ;-)
 
 Cheers
 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.



signature.asc
Description: OpenPGP digital signature


Re: Read ahead affect Ceph read performance much

2013-07-29 Thread Andrey Korolyov
Wow, very glad to hear that. I tried with the regular FS tunable and
there was almost no effect on the regular test, so I thought that
reads cannot be improved at all in this direction.

On Mon, Jul 29, 2013 at 2:24 PM, Li Wang liw...@ubuntukylin.com wrote:
 We performed Iozone read test on a 32-node HPC server. Regarding the
 hardware of each node, the CPU is very powerful, so does the network, with a
 bandwidth  1.5 GB/s. 64GB memory, the IO is relatively slow, the throughput
 measured by ‘dd’ locally is around 70MB/s. We configured a Ceph cluster with
 24 OSDs on 24 nodes, one mds, one to four clients, one client per node. The
 performance is as follows,

 Iozone sequential read throughput (MB/s)
 Number of clients 1  2 4
 Default resize180.0954   324.4836   591.5851
 Resize: 256MB 645.3347   1022.998   1267.631

 The complete iozone parameter for one client is,
 iozone -t 1 -+m /tmp/iozone.nodelist.50305030 -s 64G -r 4M -i 0 -+n -w -c -e
 -b /tmp/iozone.nodelist.50305030.output, on each client node, only one
 thread is started.

 for two clients, it is,
 iozone -t 2 -+m /tmp/iozone.nodelist.50305030 -s 32G -r 4M -i 0 -+n -w -c -e
 -b /tmp/iozone.nodelist.50305030.output

 As the data shown, a larger read ahead window could result in 300% speedup!

 Besides, Since the backend of Ceph is not the traditional hard disk, it is
 beneficial to capture the stride read prefetching. To prove this, we tested
 the stride read with the following program, as we know, the generic read
 ahead algorithm of Linux kernel will not capture stride-read prefetch, so we
 use fadvise() to manually force pretching.
 the record size is 4MB. The result is even more surprising,

 Stride read throughput (MB/s)
 Number of records prefetched  0  1  4  16  64  128
 Throughput  42.82  100.74 217.41  497.73  854.48  950.18

 As the data shown, with a read ahead size of 128*4MB, the speedup over
 without read ahead could be up to 950/42  2000%!

 The core logic of the test program is below,

 stride = 17
 recordsize = 4MB
 for (;;) {
   for (i = 0; i  count; ++i) {
 long long start = pos + (i + 1) * stride * recordsize;
 printf(PRE READ %lld %lld\n, start, start + block);
 posix_fadvise(fd, start, block, POSIX_FADV_WILLNEED);
   }
   len = read(fd, buf, block);
   total += len;
   printf(READ %lld %lld\n, pos, (pos + len));
   pos += len;
   lseek(fd, (stride - 1) * block, SEEK_CUR);
   pos += (stride - 1) * block;
 }

 Given the above results and some more, We plan to submit a blue print to
 discuss the prefetching optimization of Ceph.

 Cheers,
 Li Wang




 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Read ahead affect Ceph read performance much

2013-07-29 Thread Mark Nelson

On 07/29/2013 05:24 AM, Li Wang wrote:

We performed Iozone read test on a 32-node HPC server. Regarding the
hardware of each node, the CPU is very powerful, so does the network,
with a bandwidth  1.5 GB/s. 64GB memory, the IO is relatively slow, the
throughput measured by ‘dd’ locally is around 70MB/s. We configured a
Ceph cluster with 24 OSDs on 24 nodes, one mds, one to four clients, one
client per node. The performance is as follows,

 Iozone sequential read throughput (MB/s)
Number of clients 1  2 4
Default resize180.0954   324.4836   591.5851
Resize: 256MB 645.3347   1022.9981267.631

The complete iozone parameter for one client is,
iozone -t 1 -+m /tmp/iozone.nodelist.50305030 -s 64G -r 4M -i 0 -+n -w
-c -e -b /tmp/iozone.nodelist.50305030.output, on each client node, only
one thread is started.

for two clients, it is,
iozone -t 2 -+m /tmp/iozone.nodelist.50305030 -s 32G -r 4M -i 0 -+n -w
-c -e -b /tmp/iozone.nodelist.50305030.output

As the data shown, a larger read ahead window could result in 300%
speedup!


Very interesting!  I've done some similar tests and saw somewhat 
different results (I actually in some cases saw improvement with lower 
readahead!).  I suspect that this may be very hardware dependent.  Were 
you using RBD or CephFS?  In either case, was it the kernel client or 
userland (IE QEMU/KVM or FUSE)?  Also, where did you adjust readahead? 
Was this on the client volume or under the OSDs?


I've got to prepare for the talk later this week, but I will try to get 
my readahead test results out soon as well.




Besides, Since the backend of Ceph is not the traditional hard disk, it
is beneficial to capture the stride read prefetching. To prove this, we
tested the stride read with the following program, as we know, the
generic read ahead algorithm of Linux kernel will not capture
stride-read prefetch, so we use fadvise() to manually force pretching.
the record size is 4MB. The result is even more surprising,

 Stride read throughput (MB/s)
Number of records prefetched  0  1  4  16  64  128
Throughput  42.82  100.74 217.41  497.73  854.48  950.18

As the data shown, with a read ahead size of 128*4MB, the speedup over
without read ahead could be up to 950/42  2000%!

The core logic of the test program is below,

stride = 17
recordsize = 4MB
for (;;) {
   for (i = 0; i  count; ++i) {
 long long start = pos + (i + 1) * stride * recordsize;
 printf(PRE READ %lld %lld\n, start, start + block);
 posix_fadvise(fd, start, block, POSIX_FADV_WILLNEED);
   }
   len = read(fd, buf, block);
   total += len;
   printf(READ %lld %lld\n, pos, (pos + len));
   pos += len;
   lseek(fd, (stride - 1) * block, SEEK_CUR);
   pos += (stride - 1) * block;
}

Given the above results and some more, We plan to submit a blue print to
discuss the prefetching optimization of Ceph.


Cool!



Cheers,
Li Wang




--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ceph branch status

2013-07-29 Thread ceph branch robot
-- All Branches --

Dan Mick dan.m...@inktank.com
2012-12-18 12:27:36 -0800   wip-rbd-striping
2013-07-16 23:00:06 -0700   wip-5634
2013-07-18 16:34:23 -0700   wip-daemon

David Zafman david.zaf...@inktank.com
2013-01-28 20:26:34 -0800   wip-wireshark-zafman
2013-03-22 18:14:10 -0700   wip-snap-test-fix
2013-07-19 19:47:30 -0700   wip-5624

Gary Lowell gary.low...@inktank.com
2013-07-08 15:45:00 -0700   last

Gary Lowell glow...@inktank.com
2013-01-28 22:49:45 -0800   wip-3930
2013-02-05 19:29:11 -0800   wip.cppchecker
2013-02-10 22:21:52 -0800   wip-3955
2013-02-26 19:28:48 -0800   wip-system-leveldb
2013-03-01 18:55:35 -0800   wip-da-spec-1
2013-03-19 11:28:15 -0700   wip-3921
2013-04-11 23:00:05 -0700   wip-init-radosgw
2013-04-17 23:30:11 -0700   wip-4725
2013-04-21 22:06:37 -0700   wip-4752
2013-04-22 14:11:37 -0700   wip-4632
2013-05-31 11:20:40 -0700   wip-doc-prereq
2013-06-06 22:31:54 -0700   wip-build-doc
2013-07-03 17:00:31 -0700   wip-5496

Greg Farnum g...@inktank.com
2013-02-13 14:46:38 -0800   wip-mds-snap-fix
2013-02-22 19:57:53 -0800   wip-4248-snapid-journaling
2013-05-01 17:06:27 -0700   wip-optracker-4354
2013-06-26 16:28:22 -0700   wip-rgw-geo-replica-log
2013-07-19 15:13:07 -0700   wip-rgw-versionchecks

James Page james.p...@ubuntu.com
2013-02-27 22:50:38 +   wip-debhelper-8

Joao Eduardo Luis joao.l...@inktank.com
2013-04-18 00:01:24 +0100   wip-4521-tool
2013-04-22 15:14:28 +0100   wip-4748
2013-04-24 16:42:11 +0100   wip-4521
2013-04-30 18:45:22 +0100   wip-mon-compact-dbg
2013-05-21 01:46:13 +0100   wip-monstoretool-foo
2013-05-31 16:26:02 +0100   wip-mon-cache-first-last-committed
2013-05-31 21:00:28 +0100   wip-mon-trim-b
2013-07-20 04:30:59 +0100   wip-mon-caps-test

Joe Buck jbb...@gmail.com
2013-05-02 16:32:33 -0700   wip-buck-add-terasort
2013-07-01 12:33:57 -0700   wip-rgw-geo-buck

John Wilkins john.wilk...@inktank.com
2012-12-21 15:14:37 -0800   wip-mon-docs

Josh Durgin josh.dur...@inktank.com
2013-03-01 14:45:23 -0800   wip-rbd-workunit-debug
2013-04-29 14:32:00 -0700   wip-rbd-close-image

Noah Watkins noahwatk...@gmail.com
2013-01-05 11:58:38 -0800   wip-localized-read-tests
2013-04-22 15:23:09 -0700   wip-cls-lua
2013-07-21 12:01:01 -0700   wip-osx-upstream
2013-07-21 22:05:32 -0700   fallocate-error-handling

Roald van Loon roaldvanl...@gmail.com
2012-12-24 22:26:56 +   wip-dout

Sage Weil s...@inktank.com
2012-07-14 17:40:21 -0700   wip-osd-redirect
2012-11-30 13:47:27 -0800   wip-osd-readhole
2012-12-07 14:38:46 -0800   wip-osd-alloc
2013-01-29 13:46:02 -0800   wip-readdir
2013-02-11 07:05:15 -0800   wip-sim-journal-clone
2013-04-18 13:51:36 -0700   argonaut
2013-06-02 21:21:09 -0700   wip-fuse-bobtail
2013-06-04 22:43:04 -0700   wip-osd-push
2013-06-18 17:00:00 -0700   wip-mon-refs
2013-06-21 17:59:58 -0700   wip-rgw-vstart
2013-06-24 21:23:55 -0700   bobtail
2013-06-25 13:16:45 -0700   wip-5401
2013-06-28 12:54:08 -0700   wip-mds-snap
2013-06-30 20:41:55 -0700   wip-5453
2013-07-01 17:48:09 -0700   wip-5021
2013-07-06 09:22:29 -0700   wip-mds-lazyio-cuttlefish
2013-07-06 13:00:51 -0700   wip-mds-lazyio-cuttlefish-minimal
2013-07-10 11:03:55 -0700   wip-mon-sync
2013-07-12 08:50:24 -0700   wip-libcephfs
2013-07-18 16:59:03 -0700   wip-refs
2013-07-18 18:12:16 -0700   cuttlefish
2013-07-19 21:13:09 -0700   wip-5692
2013-07-19 22:32:23 -0700   wip-mon-caps
2013-07-20 08:49:48 -0700   wip-5624-b
2013-07-20 09:02:40 -0700   wip-5695
2013-07-21 08:59:51 -0700   wip-paxos
2013-07-21 17:16:10 -0700   wip-5672
2013-07-21 19:58:12 -0700   wip-before
2013-07-21 22:03:19 -0700   wip-cuttlefish-osdmap

Sam Lang sam.l...@inktank.com
2012-11-27 15:01:58 -0600   wip-mtime-incr

Samuel Just sam.j...@inktank.com
2013-06-06 11:51:04 -0700   wip_bench_num
2013-06-06 13:08:51 -0700   wip_5238_cuttlefish
2013-06-17 14:50:53 -0700   wip-log-rewrite-sam
2013-06-19 14:54:13 -0700   wip_cuttlefish_compact_on_startup
2013-06-19 19:46:06 -0700   wip_observer
2013-07-19 14:51:43 -0700   wip-cuttlefish-next

Yehuda Sadeh yeh...@inktank.com
2012-11-16 11:09:34 -0800   wip-mongoose
2012-12-07 13:40:12 -0800   wip-rgw-dr
2012-12-10 13:29:37 -0800   wip-multipart-size
2012-12-13 18:09:37 -0800   wip-2169
2013-02-12 09:40:12 -0800   wip-json-decode
2013-02-22 15:04:37 -0800   wip-4247
2013-02-22 16:19:37 

[PATCH] mds: remove waiting lock before merging with neighbours

2013-07-29 Thread David Disseldorp
CephFS currently deadlocks under CTDB's ping_pong POSIX locking test
when run concurrently on multiple nodes.
The deadlock is caused by failed removal of a waiting_locks entry when
the waiting lock is merged with an existing lock, e.g:

Initial MDS state (two clients, same file):
held_locks -- start: 0, length: 1, client: 4116, pid: 7899, type: 2
  start: 2, length: 1, client: 4110, pid: 40767, type: 2
waiting_locks -- start: 1, length: 1, client: 4116, pid: 7899, type: 2

Waiting lock entry 4116@1:1 fires:
handle_client_file_setlock: start: 1, length: 1,
client: 4116, pid: 7899, type: 2

MDS state after lock is obtained:
held_locks -- start: 0, length: 2, client: 4116, pid: 7899, type: 2
  start: 2, length: 1, client: 4110, pid: 40767, type: 2
waiting_locks -- start: 1, length: 1, client: 4116, pid: 7899, type: 2

Note that the waiting 4116@1:1 lock entry is merged with the existing
4116@0:1 held lock to become a 4116@0:2 held lock. However, the now
handled 4116@1:1 waiting_locks entry remains.

When handling a lock request, the MDS calls adjust_locks() to merge
the new lock with available neighbours. If the new lock is merged,
then the waiting_locks entry is not located in the subsequent
remove_waiting() call.
This fix ensures that the waiting_locks entry is removed prior to
modification during merge.

Signed-off-by: David Disseldorp dd...@suse.de
---
 src/mds/flock.cc | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/mds/flock.cc b/src/mds/flock.cc
index e83c5ee..5e329af 100644
--- a/src/mds/flock.cc
+++ b/src/mds/flock.cc
@@ -75,12 +75,14 @@ bool ceph_lock_state_t::add_lock(ceph_filelock new_lock,
   } else {
 //yay, we can insert a shared lock
 dout(15)  inserting shared lock  dendl;
+remove_waiting(new_lock);
 adjust_locks(self_overlapping_locks, new_lock, neighbor_locks);
 held_locks.insert(pairuint64_t, ceph_filelock(new_lock.start, 
new_lock));
 ret = true;
   }
 }
   } else { //no overlapping locks except our own
+remove_waiting(new_lock);
 adjust_locks(self_overlapping_locks, new_lock, neighbor_locks);
 dout(15)  no conflicts, inserting   new_lock  dendl;
 held_locks.insert(pairuint64_t, ceph_filelock
@@ -89,7 +91,6 @@ bool ceph_lock_state_t::add_lock(ceph_filelock new_lock,
   }
   if (ret) {
 ++client_held_lock_counts[(client_t)new_lock.client];
-remove_waiting(new_lock);
   }
   else if (wait_on_fail  !replay)
 ++client_waiting_lock_counts[(client_t)new_lock.client];
@@ -306,7 +307,7 @@ void 
ceph_lock_state_t::adjust_locks(listmultimapuint64_t, ceph_filelock::ite
 old_lock = (*iter)-second;
 old_lock_client = old_lock-client;
 dout(15)  lock to coalesce:   *old_lock  dendl;
-/* because if it's a neibhoring lock there can't be any self-overlapping
+/* because if it's a neighboring lock there can't be any self-overlapping
locks that covered it */
 if (old_lock-type == new_lock.type) { //merge them
   if (0 == new_lock.length) {
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


mds.0 crashed with 0.61.7

2013-07-29 Thread Andreas Friedrich
Hello,

my Ceph test cluster runs fine with 0.61.4.

I have removed all data and have setup a new cluster with 0.61.7 using
the same configuration (see ceph.conf).

After
  mkcephfs -c /etc/ceph/ceph.conf -a
  /etc/init.d/ceph -a start
the mds.0 crashed:

-1 2013-07-29 17:02:57.626886 7fba2a8cd700  1 -- 10.0.0.231:6800/806 == 
osd.121 10.0.0.231:6834/5350 1  osd_op_reply(4 mds_snaptable [read 0~0] ack 
= -2 (No such file or directory)) v4  112+0+0 (2505332647 0 0) 0x13b7a30 
con 0x7fba20010200
 0 2013-07-29 17:02:57.627838 7fba2a8cd700 -1 mds/MDSTable.cc: In function 
'void MDSTable::load_2(int, ceph::bufferlist, Context*)' thread 7fba2a8cd700 
time 2013-07-29 17:02:57.626907
mds/MDSTable.cc: 150: FAILED assert(0)

 ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
 1: (MDSTable::load_2(int, ceph::buffer::list, Context*)+0x4cf) [0x6e398f]
 2: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xe1e) [0x73c16e]
 3: (MDS::handle_core_message(Message*)+0x93f) [0x4db2ff]
 4: (MDS::_dispatch(Message*)+0x2f) [0x4db3df]
 5: (MDS::ms_dispatch(Message*)+0x1a3) [0x4dd163]
 6: (DispatchQueue::entry()+0x399) [0x7ddd69]
 7: (DispatchQueue::DispatchThread::entry()+0xd) [0x7d343d]
 8: (()+0x77b6) [0x7fba2f51e7b6]
 9: (clone()+0x6d) [0x7fba2e15dd6d]
 ...

At this point I have no rbd, no cephfs, no ceph-fuse configured.

  /etc/init.d/ceph -a stop
  /etc/init.d/ceph -a start

doesn't help.

Any help would be appreciated.

Andreas Friedrich
--
FUJITSU
Fujitsu Technology Solutions GmbH
Heinz-Nixdorf-Ring 1, 33106 Paderborn, Germany
Tel: +49 (5251) 525-1512
Fax: +49 (5251) 525-321512
Email: andreas.friedr...@ts.fujitsu.com
Web: ts.fujitsu.com
Company details: de.ts.fujitsu.com/imprint
--
[global]
#debug ms = 20
debug ITX = 0
debug monc = 0
debug rados = 0
#
# enable secure authentication
# auth supported = cephx
# keyring = /etc/ceph/keyring.client
#
# -- or -- disable secure authentication
# auth supported = none

# auth cluster required = cephx
# auth service required = cephx
# auth client required = cephx

auth cluster required = none
auth service required = none
auth client required = none

# allow ourselves to open a lot of files
max open files = 131072

# set log file
# log file = /ceph-log/log/$name.log
# log_to_syslog = true# uncomment this line to log to syslog

# set up pid files
pid file = /var/run/ceph/$name.pid

# If you want to run a IPv6 cluster, set this to true. Dual-stack isn't 
possible
#ms bind ipv6 = true
public network = 10.0.0.0/24
cluster network = 10.0.0.0/24

# environment for startup with rosckets
# environment = LD_PRELOAD=/usr/lib64/libsdp.so.1
# environment = LD_PRELOAD=/usr/local/lib/rsocket/librspreload.so.1.0.0 
LD_LIBRARY_PATH=/usr/local/lib/rsocket:\\\$LD_LIBRARY_PATH

### [client.radosgw.ceph]
### 
### host = ceph
### # auto start = yes
### log file = /var/log/ceph/$name.log
### keyring = /etc/ceph/keyring.radosgw.ceph
### rgw socket path = /var/run/radosgw.sock
### # debug rgw = 20
### # debug ms = 1

[mon]
#mon data = /var/lib/ceph/mon/$cluster-$id
mon data = /data/mon$id
# debug ms = 0 ; see message traffic
# debug mon = 5   ; monitor 
# debug paxos = 5 ; monitor replication
# debug auth = 5  ; authentication code
# keyring = /etc/ceph/keyring.$name
debug optracker = 0
mon debug dump transactions = false

[mon.0]
host = cibst1
mon addr = 10.0.0.231:6789

[mon.1]
host = cibst2
mon addr = 10.0.0.232:6789

[mon.3]
host = cibst3
mon addr = 10.0.0.233:6789

[mds]
# debug mds = 1
# keyring = /etc/ceph/keyring.$name
debug optracker = 0

[mds.0]
host = cibst1

[mds.1]
host = cibst2

[osd]

# journal dio = false
# journal aio = true
#osd data = /var/lib/ceph/osd/$cluster-$id
osd data = /data/$name
# osd journal = /journals/$name/journal
# osd journal = 
osd journal size = 5120
#osd journal size = 1024

filestore max sync interval = 30
filestore min sync interval = 29
filestore flusher = false
filestore queue max ops = 1

debug optracker = 0

# keyring = /etc/ceph/keyring.$name
# debug osd = 20
# debug osd = 0 ; waiters
# debug ms = 10 ; message traffic
# debug filestore = 20 ; local object storage
debug journal = 0   ; local journaling
# debug monc = 5  ; monitor interaction, startup


Re: mds.0 crashed with 0.61.7

2013-07-29 Thread Andreas Bluemle
Hi Sage,

as this crash had been around for a while already: do you
know whether this had happened in ceph version 0.61.4 as well?


Best Regards

Andreas Bluemle


On Mon, 29 Jul 2013 08:47:00 -0700 (PDT)
Sage Weil s...@inktank.com wrote:

 Hi Andreas,
 
 Can you reproduce this (from mkcephfs onward) with debug mds = 20 and 
 debug ms = 1?  I've seen this crash several times but never been able
 to get to the bottom of it.
 
 Thanks!
 sage
 
 On Mon, 29 Jul 2013, Andreas Friedrich wrote:
 
  Hello,
  
  my Ceph test cluster runs fine with 0.61.4.
  
  I have removed all data and have setup a new cluster with 0.61.7
  using the same configuration (see ceph.conf).
  
  After
mkcephfs -c /etc/ceph/ceph.conf -a
/etc/init.d/ceph -a start
  the mds.0 crashed:
  
  -1 2013-07-29 17:02:57.626886 7fba2a8cd700  1 --
  10.0.0.231:6800/806 == osd.121 10.0.0.231:6834/5350 1 
  osd_op_reply(4 mds_snaptable [read 0~0] ack = -2 (No such file or
  directory)) v4  112+0+0 (2505332647 0 0) 0x13b7a30 con
  0x7fba20010200
   0 2013-07-29 17:02:57.627838 7fba2a8cd700 -1 mds/MDSTable.cc:
   0 In function 'void MDSTable::load_2(int, ceph::bufferlist,
   0 Context*)' thread 7fba2a8cd700 time 2013-07-29
   0 17:02:57.626907
  mds/MDSTable.cc: 150: FAILED assert(0)
  
   ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
   1: (MDSTable::load_2(int, ceph::buffer::list, Context*)+0x4cf)
  [0x6e398f] 2: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xe1e)
  [0x73c16e] 3: (MDS::handle_core_message(Message*)+0x93f) [0x4db2ff]
   4: (MDS::_dispatch(Message*)+0x2f) [0x4db3df]
   5: (MDS::ms_dispatch(Message*)+0x1a3) [0x4dd163]
   6: (DispatchQueue::entry()+0x399) [0x7ddd69]
   7: (DispatchQueue::DispatchThread::entry()+0xd) [0x7d343d]
   8: (()+0x77b6) [0x7fba2f51e7b6]
   9: (clone()+0x6d) [0x7fba2e15dd6d]
   ...
  
  At this point I have no rbd, no cephfs, no ceph-fuse configured.
  
/etc/init.d/ceph -a stop
/etc/init.d/ceph -a start
  
  doesn't help.
  
  Any help would be appreciated.
  
  Andreas Friedrich
  --
  FUJITSU
  Fujitsu Technology Solutions GmbH
  Heinz-Nixdorf-Ring 1, 33106 Paderborn, Germany
  Tel: +49 (5251) 525-1512
  Fax: +49 (5251) 525-321512
  Email: andreas.friedr...@ts.fujitsu.com
  Web: ts.fujitsu.com
  Company details: de.ts.fujitsu.com/imprint
  --
  
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel
 in the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 



-- 
Andreas Bluemle mailto:andreas.blue...@itxperts.de
Heinrich Boell Strasse 88   Phone: (+49) 89 4317582
D-81829 Muenchen (Germany)  Mobil: (+49) 177 522 0151
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mds.0 crashed with 0.61.7

2013-07-29 Thread Sage Weil
On Mon, 29 Jul 2013, Andreas Bluemle wrote:
 Hi Sage,
 
 as this crash had been around for a while already: do you
 know whether this had happened in ceph version 0.61.4 as well?

Pretty sure, yeah. 

sage

 
 
 Best Regards
 
 Andreas Bluemle
 
 
 On Mon, 29 Jul 2013 08:47:00 -0700 (PDT)
 Sage Weil s...@inktank.com wrote:
 
  Hi Andreas,
  
  Can you reproduce this (from mkcephfs onward) with debug mds = 20 and 
  debug ms = 1?  I've seen this crash several times but never been able
  to get to the bottom of it.
  
  Thanks!
  sage
  
  On Mon, 29 Jul 2013, Andreas Friedrich wrote:
  
   Hello,
   
   my Ceph test cluster runs fine with 0.61.4.
   
   I have removed all data and have setup a new cluster with 0.61.7
   using the same configuration (see ceph.conf).
   
   After
 mkcephfs -c /etc/ceph/ceph.conf -a
 /etc/init.d/ceph -a start
   the mds.0 crashed:
   
   -1 2013-07-29 17:02:57.626886 7fba2a8cd700  1 --
   10.0.0.231:6800/806 == osd.121 10.0.0.231:6834/5350 1 
   osd_op_reply(4 mds_snaptable [read 0~0] ack = -2 (No such file or
   directory)) v4  112+0+0 (2505332647 0 0) 0x13b7a30 con
   0x7fba20010200
0 2013-07-29 17:02:57.627838 7fba2a8cd700 -1 mds/MDSTable.cc:
0 In function 'void MDSTable::load_2(int, ceph::bufferlist,
0 Context*)' thread 7fba2a8cd700 time 2013-07-29
0 17:02:57.626907
   mds/MDSTable.cc: 150: FAILED assert(0)
   
ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
1: (MDSTable::load_2(int, ceph::buffer::list, Context*)+0x4cf)
   [0x6e398f] 2: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xe1e)
   [0x73c16e] 3: (MDS::handle_core_message(Message*)+0x93f) [0x4db2ff]
4: (MDS::_dispatch(Message*)+0x2f) [0x4db3df]
5: (MDS::ms_dispatch(Message*)+0x1a3) [0x4dd163]
6: (DispatchQueue::entry()+0x399) [0x7ddd69]
7: (DispatchQueue::DispatchThread::entry()+0xd) [0x7d343d]
8: (()+0x77b6) [0x7fba2f51e7b6]
9: (clone()+0x6d) [0x7fba2e15dd6d]
...
   
   At this point I have no rbd, no cephfs, no ceph-fuse configured.
   
 /etc/init.d/ceph -a stop
 /etc/init.d/ceph -a start
   
   doesn't help.
   
   Any help would be appreciated.
   
   Andreas Friedrich
   --
   FUJITSU
   Fujitsu Technology Solutions GmbH
   Heinz-Nixdorf-Ring 1, 33106 Paderborn, Germany
   Tel: +49 (5251) 525-1512
   Fax: +49 (5251) 525-321512
   Email: andreas.friedr...@ts.fujitsu.com
   Web: ts.fujitsu.com
   Company details: de.ts.fujitsu.com/imprint
   --
   
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel
  in the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
  
 
 
 
 -- 
 Andreas Bluemle mailto:andreas.blue...@itxperts.de
 Heinrich Boell Strasse 88   Phone: (+49) 89 4317582
 D-81829 Muenchen (Germany)  Mobil: (+49) 177 522 0151
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Anyone in NYC next week?

2013-07-29 Thread Milosz Tanski
Just signed up, looking forward to it.

On Thu, Jul 25, 2013 at 5:18 PM, Travis Rhoden trho...@gmail.com wrote:
 I'm already signed up.  Looking forward to it!

  - Travis

 On Thu, Jul 25, 2013 at 12:19 AM, Sage Weil s...@inktank.com wrote:
 I'm going to be in NYC next week at our first Ceph Day of the summer.  If
 you're in town and want to hear more about what we're doing, you should
 join us!

 http://www.inktank.com/CEPHdays/

 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


blueprint: object redirects

2013-07-29 Thread Sage Weil
I have a draft blueprint up for supporting object redirects, a 
basic building block that will be used for tiering in RADOS.  The basic 
idea is that an object may have symlink-like semantics indicating that it 
is stored in another pool.. maybe something slower, or erasure-encoded, or 
whatever.  There will be basic librados functions to get redirect 
metadata, safely/atomically demote objects to another pool (turn them into 
a redirect), and promote objects back to the main pool.  Flags will let 
you control whether promotion happens automatically on write or possibly 
read.

http://wiki.ceph.com/01Planning/02Blueprints/Emperor/osd:_object_redirects

I'm not particularly happy with the complexity surrounding the tombstone 
state.  Hopefully we can come up with a simple way to make the client-side 
safely drive object deletion.

If you're interested in discussing this at CDS, please add your name to 
the blueprint so we can include you in the hangout!
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


krbd live resize

2013-07-29 Thread Loic Dachary
Hi,

This works:

lvcreate --name tmp --size 10G all
  Logical volume tmp created
mkfs.ext4 /dev/all/tmp
mount /dev/all/tmp /mnt
blockdev --getsize64 /dev/all/tmp
10737418240
lvextend -L+1G /dev/all/tmp
  Extending logical volume tmp to 11,00 GiB
  Logical volume tmp successfully resized
blockdev --getsize64 /dev/all/tmp
11811160064
resize2fs /dev/all/tmp 
resize2fs 1.41.12 (17-May-2010)
Filesystem at /dev/all/tmp is mounted on /mnt; on-line resizing required
old desc_blocks = 1, new_desc_blocks = 1
Performing an on-line resize of /dev/all/tmp to 2883584 (4k) blocks.
The filesystem on /dev/all/tmp is now 2883584 blocks long.

This does not work:

rbd create --size 10240 tmp
rbd info tmp
rbd image 'tmp':
size 10240 MB in 2560 objects
order 22 (4096 KB objects)
block_name_prefix: rb.0.12dd.238e1f29
format: 1
rbd map tmp
mkfs.ext4 /dev/rbd1
mount /dev/rbd1 /mnt
blockdev --getsize64 /dev/rbd1
10737418240
rbd resize --size 2 tmp
blockdev --getsize64 /dev/rbd1
10737418240
resize2fs /dev/rbd1 
resize2fs 1.42 (29-Nov-2011)
The filesystem is already 2621440 blocks long.  Nothing to do!

It does work after umounting:

umount /mnt
blockdev --getsize64 /dev/rbd1
fsck -f /dev/rbd1
resize2fs /dev/rbd1 
resize2fs 1.42 (29-Nov-2011)
Resizing the filesystem on /dev/rbd1 to 512 (4k) blocks.
The filesystem on /dev/rbd1 is now 512 blocks long.

I assume there should be something in KRBD to allow for the same behavior as 
with LVM but I don't know enough about the kernel to be more specific. Maybe 
something similar to ioctl BLKRRPART ?

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.



signature.asc
Description: OpenPGP digital signature


blueprint: rgw multi-region disaster recovery, second phase

2013-07-29 Thread Yehuda Sadeh
I've created a blueprint for the second phase of the multiregion / DR project:

http://wiki.ceph.com/index.php?title=01Planning/02Blueprints/Emperor/RGW_Multi-region_%2F%2F_Disaster_Recovery_(phase_2)

While a huge amount of work was done for Dumpling, there's still some
work that needs to be done (mainly in the area of the disaster
recovery). If you're interested in discussing this at CDS, please add
yourself as an interested party to the blueprint.


Yehuda
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


blueprint: RADOS Object Temperature Monitoring

2013-07-29 Thread Samuel Just
I've created a blueprint for a RADOS level mechanism for discovering
cold objects.

http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/RADOS_Object_Temperature_Monitoring

Such a mechanism will be crucial to future tiering implementations.
If you are interested in discussing this at CDS, please add yourself
as an interested party to the blueprint!
-Sam
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


blueprint: rgw quota

2013-07-29 Thread Yehuda Sadeh
I created a blueprint for rgw bucket quotas. The document itself is
mainly a placeholder and a reference to the older bucket quota that we
prepared for Dumpling. If you're interested in discussing this at CDS,
please add yourself as an interested party to the blueprint.

http://wiki.ceph.com/01Planning/02Blueprints/Emperor/RGW_Bucket_Level_Quota

Yehuda
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Fwd: [ceph-users] Small fix for ceph.spec

2013-07-29 Thread Patrick McGarry
-- Forwarded message --
From: Erik Logtenberg e...@logtenberg.eu
Date: Mon, Jul 29, 2013 at 7:07 PM
Subject: [ceph-users] Small fix for ceph.spec
To: ceph-us...@lists.ceph.com


Hi,

The spec file used for building rpm's misses a build time dependency on
snappy-devel. Please see attached patch to fix.

Kind regards,

Erik.

___
ceph-users mailing list
ceph-us...@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--- ceph.spec-orig	2013-07-30 00:24:54.70500 +0200
+++ ceph.spec	2013-07-30 00:25:34.19900 +0200
@@ -42,6 +42,7 @@
 BuildRequires:  libxml2-devel
 BuildRequires:  libuuid-devel
 BuildRequires:  leveldb-devel  1.2
+BuildRequires:  snappy-devel
 
 #
 # specific


blueprint: librgw

2013-07-29 Thread Yehuda Sadeh
I created another blueprint for defining and creating a library for
rgw. This is also just a placeholder and a pointer at an older
blueprint.

http://wiki.ceph.com/01Planning/02Blueprints/Emperor/librgw

If you wish to discuss this at CDS, please add yourself to the blueprint.

Yehuda
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


blueprint: rgw bucket scalability

2013-07-29 Thread Yehuda Sadeh
I created a new blueprint that discusses rgw bucket scalability:

http://wiki.ceph.com/01Planning/02Blueprints/Emperor/rgw:_bucket_index_scalability

As was brought up on the mailing list recently, bucket index may serve
as a contention point. There are a few suggestions in how to solve /
mitigate the issue, and we'd like to discuss these at CDS. If you want
to participate in the discussion, please add yourself to the blueprint
as an interested party.

Yehuda
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


blueprint: rgw multitenancy

2013-07-29 Thread Yehuda Sadeh
I created a new blueprint that discusses rgw multitenancy. The rgw
multitenancy defines a level of hierarchy on top of users and their
data which provides the ability to separate the users into different
organizational entities.

http://wiki.ceph.com/01Planning/02Blueprints/Emperor/rgw:_multitenancy

As with the other blueprints, if you wish to participate please add
yourself as an interested party to the blueprint.

Yehuda
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re: question about striped_read

2013-07-29 Thread majianpeng
On Mon, Jul 29, 2013 at 11:00 AM, majianpeng majianp...@gmail.com wrote:

 [snip]
 I don't think the later was_short can handle the hole case. For the hole 
 case,
 we should try reading next strip object instead of return. how about
 below patch.
 
 Hi Yan,
 i uesed this demo to test hole case.
 dd if=/dev/urandom bs=4096 count=2 of=file_with_holes
 dd if=/dev/urandom bs=4096 seek=7 count=2 of=file_with_holes

 dd if=file_with_holes of=/dev/null bs=16k count=1 iflag=direct
 Using the dynamic_debug in striped_read,  the message are:
 [ 8743.663499] ceph:   file.c:350  : striped_read 0~16384 (read 0) 
 got 16384
 [ 8743.663502] ceph:   file.c:390  : striped_read returns 16384
 From the messages, we can see it can't hit the short-read.
 For the ceph-file-hole, how does the ceph handle?
 Or am i missing something?

the default strip size is 4M, all data are written to the first object
in your test case.
could you try something like below.

dd if=/dev/urandom bs=1M count=2 of=file_with_holes
dd if=/dev/urandom bs=1M count=2 seek=4 of=file_with_holes conv=notrunc
dd if=file_with_holes bs=8M /dev/null


From above test, i think your patch is right.
Although, the original code can work but it  call multi striped_read.
As your said for stripe short-read,it doesn't make sense to return rather than 
reading next stripe.
But can you add some comments for this?
The short-read reasongs are two:EOF or hit-hole.
But for hit-hole there are some differents case. For that i don't know.

Thanks!
Jianpeng Ma
Regards
Yan, Zheng




 Thanks!
 Jianpeng Ma

 Regards
 Yan, Zheng
 ---
 diff --git a/fs/ceph/file.c b/fs/ceph/file.c
 index 271a346..6ca2921 100644
 --- a/fs/ceph/file.c
 +++ b/fs/ceph/file.c
 @@ -350,16 +350,17 @@ more:
 ret, hit_stripe ?  HITSTRIPE : , was_short ?  SHORT : );
 
if (ret  0) {
 -  int didpages = (page_align + ret)  PAGE_CACHE_SHIFT;
 +  int didpages = (page_align + this_len)  PAGE_CACHE_SHIFT;
 
 -  if (read  pos - off) {
 -  dout( zero gap %llu to %llu\n, off + read, pos);
 -  ceph_zero_page_vector_range(page_align + read,
 -  pos - off - read, pages);
 +  if (was_short) {
 +  dout( zero gap %llu to %llu\n,
 +   pos + ret, pos + this_len);
 +  ceph_zero_page_vector_range(page_align + ret,
 +  this_len - ret, 
 page_pos);
}
 -  pos += ret;
 +  pos += this_len;
read = pos - off;
 -  left -= ret;
 +  left -= this_len;
page_pos += didpages;
pages_left -= didpages;
 
Thanks!
Jianpeng Ma
On Mon, Jul 29, 2013 at 11:00 AM, majianpeng majianp...@gmail.com wrote:

 [snip]
 I don't think the later was_short can handle the hole case. For the hole 
 case,
 we should try reading next strip object instead of return. how about
 below patch.
 
 Hi Yan,
 i uesed this demo to test hole case.
 dd if=/dev/urandom bs=4096 count=2 of=file_with_holes
 dd if=/dev/urandom bs=4096 seek=7 count=2 of=file_with_holes

 dd if=file_with_holes of=/dev/null bs=16k count=1 iflag=direct
 Using the dynamic_debug in striped_read,  the message are:
 [ 8743.663499] ceph:   file.c:350  : striped_read 0~16384 (read 0) 
 got 16384
 [ 8743.663502] ceph:   file.c:390  : striped_read returns 16384
 From the messages, we can see it can't hit the short-read.
 For the ceph-file-hole, how does the ceph handle?
 Or am i missing something?

the default strip size is 4M, all data are written to the first object
in your test case.
could you try something like below.

dd if=/dev/urandom bs=1M count=2 of=file_with_holes
dd if=/dev/urandom bs=1M count=2 seek=4 of=file_with_holes conv=notrunc
dd if=file_with_holes bs=8M /dev/null

Regards
Yan, Zheng




 Thanks!
 Jianpeng Ma

 Regards
 Yan, Zheng
 ---
 diff --git a/fs/ceph/file.c b/fs/ceph/file.c
 index 271a346..6ca2921 100644
 --- a/fs/ceph/file.c
 +++ b/fs/ceph/file.c
 @@ -350,16 +350,17 @@ more:
 ret, hit_stripe ?  HITSTRIPE : , was_short ?  SHORT : );
 
if (ret  0) {
 -  int didpages = (page_align + ret)  PAGE_CACHE_SHIFT;
 +  int didpages = (page_align + this_len)  PAGE_CACHE_SHIFT;
 
 -  if (read  pos - off) {
 -  dout( zero gap %llu to %llu\n, off + read, pos);
 -  ceph_zero_page_vector_range(page_align + read,
 -  pos - off - read, pages);
 +  if (was_short) {
 +  dout( zero gap %llu to %llu\n,
 +   pos + ret, pos + this_len);
 +  ceph_zero_page_vector_range(page_align + ret,
 +  this_len - ret, 
 page_pos);
   

Re: Re: question about striped_read

2013-07-29 Thread Yan, Zheng
On Tue, Jul 30, 2013 at 10:08 AM, majianpeng majianp...@gmail.com wrote:
On Mon, Jul 29, 2013 at 11:00 AM, majianpeng majianp...@gmail.com wrote:

 [snip]
 I don't think the later was_short can handle the hole case. For the hole 
 case,
 we should try reading next strip object instead of return. how about
 below patch.
 
 Hi Yan,
 i uesed this demo to test hole case.
 dd if=/dev/urandom bs=4096 count=2 of=file_with_holes
 dd if=/dev/urandom bs=4096 seek=7 count=2 of=file_with_holes

 dd if=file_with_holes of=/dev/null bs=16k count=1 iflag=direct
 Using the dynamic_debug in striped_read,  the message are:
 [ 8743.663499] ceph:   file.c:350  : striped_read 0~16384 (read 0) 
 got 16384
 [ 8743.663502] ceph:   file.c:390  : striped_read returns 16384
 From the messages, we can see it can't hit the short-read.
 For the ceph-file-hole, how does the ceph handle?
 Or am i missing something?

the default strip size is 4M, all data are written to the first object
in your test case.
could you try something like below.

dd if=/dev/urandom bs=1M count=2 of=file_with_holes
dd if=/dev/urandom bs=1M count=2 seek=4 of=file_with_holes conv=notrunc
dd if=file_with_holes bs=8M /dev/null


 From above test, i think your patch is right.
 Although, the original code can work but it  call multi striped_read.

For test case
---
dd if=/dev/urandom bs=1M count=2 of=file_with_holes
dd if=/dev/urandom bs=1M count=2 seek=4 of=file_with_holes conv=notrunc
dd if=file_with_holes bs=8M iflag=direct /dev/null

I got
---
ceph:  striped_read 0~8388608 (read 0) got 2097152 HITSTRIPE SHORT
ceph:  striped_read 2097152~6291456 (read 2097152) got 0 HITSTRIPE SHORT
ceph:  zero tail 4194304
ceph:  striped_read returns 6291456
ceph:  sync_read result 6291456
ceph:  aio_read 88000fb22f98 1193e8c.fffe dropping
cap refs on Fcr = 6291456

the original code zeros data in range 2M~6M, it's obvious incorrect.

 As your said for stripe short-read,it doesn't make sense to return rather 
 than reading next stripe.
 But can you add some comments for this?
 The short-read reasongs are two:EOF or hit-hole.
 But for hit-hole there are some differents case. For that i don't know.


For hit-hole, there is only one case: the strip object's size is
smaller then 4M. When reading
a strip object, if the returned data is less than we expected, we need
to check if following strip
objects have data.

I think the original code and my patch doesn't handle the below case properly.

| object 0 |  hole  |  hole |  object 3 |
dd if=testfile iflag=direct bs=16M /dev/null

Could you write a patch, do some tests and submit it.

Regards
Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html