date:20141201

Re: [ceph-users] LevelDB support status is still experimental on Giant?

2014-12-01 Thread Satoru Funai

Hi Xiaoxi, 
Thanks for very useful information. 
Can you share more details about "Terrible bad performance" is compare against 
what? and what kind of usage pattern? 
I'm just interested in key/value backend for more cost/performance without 
expensive HW such as ssd/fusion io. 
Regards, 
Satoru Funai 
- 元のメッセージ -

> 差出人: "Xiaoxi Chen" 
> 宛先: "Haomai Wang" 
> Cc: "Satoru Funai" , ceph-us...@ceph.com
> 送信済み: 2014年12月1日, 月曜日 午後 11:26:56
> 件名: RE: [ceph-users] LevelDB support status is still experimental on
> Giant?

> Range query is not that important in nowadays SSDyou can see very
> high read random read IOPS in ssd spec, and getting higher day by
> day.The key problem here is trying to exactly matching one
> query(get/put) to one SSD IO(read/write), eliminate the read/write
> amplification. We kind of believe OpenNvmKV may be the right
> approach.

> Back to the context of Ceph, can we find some use case of nowadays
> key-value backend? We would like to learn from community what’s the
> workload pattern if you wants a K-V backed Ceph? Or just have a try?
> I think before we get a suitable DB backend ,we had better off to
> optimize the key-value backend code to support specified kind of
> load.

> From: Haomai Wang [mailto:haomaiw...@gmail.com]
> Sent: Monday, December 1, 2014 10:14 PM
> To: Chen, Xiaoxi
> Cc: Satoru Funai; ceph-us...@ceph.com
> Subject: Re: [ceph-users] LevelDB support status is still
> experimental on Giant?

> Exactly, I'm just looking forward a better DB backend suitable for
> KeyValueStore. It maybe traditional B-tree design.

> Kinetic original I think it was a good backend, but it doesn't
> support range query :-(

> On Mon, Dec 1, 2014 at 10:04 PM, Chen, Xiaoxi < xiaoxi.c...@intel.com
> > wrote:
> > We have tested it for a while, basically it seems kind of stable
> > but
> > show terrible bad performance.
> 

> > This is not the fault of Ceph , but levelDB, or more generally, all
> > K-V storage with LSM design(RocksDB,etc), the LSM tree structure
> > naturally introduce very large write amplification 10X to 20X
> > when you have tens GB of data per OSD. So you can always see very
> > bad sequential write performance (~200MB/s for a 12SSD setup), we
> > can share more details on the performance meeting.
> 

> > To this end, key-value backend with LevelDB is not useable for RBD
> > usage, but maybe workable(not tested) in the LOSF cases ( tons of
> > small objects stored via rados , k-v backend can prevent the FS
> > metadata become the bottleneck)
> 

> > From: ceph-users [mailto: ceph-users-boun...@lists.ceph.com ] On
> > Behalf Of Haomai Wang
> 
> > Sent: Monday, December 1, 2014 9:48 PM
> 
> > To: Satoru Funai
> 
> > Cc: ceph-us...@ceph.com
> 
> > Subject: Re: [ceph-users] LevelDB support status is still
> > experimental on Giant?
> 

> > Yeah, mainly used by test env.
> 

> > On Mon, Dec 1, 2014 at 6:29 PM, Satoru Funai <
> > satoru.fu...@gmail.com
> > > wrote:
> 
> > > Hi guys,
> > 
> 
> > > I'm interested in to use key/value store as a backend of Ceph
> > > OSD.
> > 
> 
> > > When firefly release, LevelDB support is mentioned as
> > > experimental,
> > 
> 
> > > is it same status on Giant release?
> > 
> 
> > > Regards,
> > 
> 

> > > Satoru Funai
> > 
> 
> > > ___
> > 
> 
> > > ceph-users mailing list
> > 
> 
> > > ceph-users@lists.ceph.com
> > 
> 
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> 
> > --
> 

> > Best Regards,
> 
> > Wheat
> 
> --

> Best Regards,
> Wheat___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] LevelDB support status is still experimental on Giant?

2014-12-01 Thread Chen, Xiaoxi

Compared to Filestore on SSD(We run levelDB on top of SSD). The usage pattern 
is RBD sequential write(64K * QD8) and random write( 4K * QD8), read seems on 
par.

I would suspect KV backend on HDD will be even worse ,compared to Filestore on 
HDD.

From: Satoru Funai [mailto:satoru.fu...@gmail.com]
Sent: Tuesday, December 2, 2014 1:27 PM
To: Chen, Xiaoxi
Cc: ceph-us...@ceph.com; Haomai Wang
Subject: Re: [ceph-users] LevelDB support status is still experimental on Giant?

Hi Xiaoxi,
Thanks for very useful information.
Can you share more details about "Terrible bad performance" is compare against 
what? and what kind of usage pattern?
I'm just interested in key/value backend for more cost/performance without 
expensive HW such as ssd/fusion io.
Regards,
Satoru Funai

差出人: "Xiaoxi Chen" mailto:xiaoxi.c...@intel.com>>
宛先: "Haomai Wang" mailto:haomaiw...@gmail.com>>
Cc: "Satoru Funai" mailto:satoru.fu...@gmail.com>>, 
ceph-us...@ceph.com
送信済み: 2014年12月1日, 月曜日 午後 11:26:56
件名: RE: [ceph-users] LevelDB support status is still experimental on Giant?
Range query is not that important in nowadays SSDyou can see very high read 
random read IOPS in ssd spec, and getting higher day by day.The key problem 
here is trying to exactly matching one query(get/put) to one SSD 
IO(read/write), eliminate the read/write amplification. We kind of believe 
OpenNvmKV may be the right approach.

Back to the context of Ceph,  can we find some use case of nowadays key-value 
backend?  We would like to learn from community what’s the workload pattern if 
you wants a K-V backed Ceph? Or just have a try?  I think before we get a 
suitable DB backend ,we had better off to optimize the key-value backend code 
to support specified kind of load.

From: Haomai Wang [mailto:haomaiw...@gmail.com]
Sent: Monday, December 1, 2014 10:14 PM
To: Chen, Xiaoxi
Cc: Satoru Funai; ceph-us...@ceph.com
Subject: Re: [ceph-users] LevelDB support status is still experimental on Giant?

Exactly, I'm just looking forward a better DB backend suitable for 
KeyValueStore. It maybe traditional B-tree design.

Kinetic original I think it was a good backend, but it doesn't support range 
query :-(

On Mon, Dec 1, 2014 at 10:04 PM, Chen, Xiaoxi 
mailto:xiaoxi.c...@intel.com>> wrote:
We have tested it for a while, basically it seems kind of stable but show 
terrible bad performance.

This is not the fault of Ceph , but levelDB, or more generally,  all K-V 
storage with LSM design(RocksDB,etc), the LSM tree structure naturally 
introduce very large write amplification 10X to 20X when you have tens GB 
of data per OSD. So you can always see very bad sequential write performance 
(~200MB/s for a 12SSD setup), we can share more details on the performance 
meeting.

To this end,  key-value backend with LevelDB is not useable for RBD usage, but 
maybe workable(not tested) in the LOSF cases ( tons of small objects stored via 
rados , k-v backend can prevent the FS metadata become the bottleneck)

From: ceph-users 
[mailto:ceph-users-boun...@lists.ceph.com]
 On Behalf Of Haomai Wang
Sent: Monday, December 1, 2014 9:48 PM
To: Satoru Funai
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] LevelDB support status is still experimental on Giant?

Yeah, mainly used by test env.

On Mon, Dec 1, 2014 at 6:29 PM, Satoru Funai 
mailto:satoru.fu...@gmail.com>> wrote:
Hi guys,
I'm interested in to use key/value store as a backend of Ceph OSD.
When firefly release, LevelDB support is mentioned as experimental,
is it same status on Giant release?
Regards,

Satoru Funai
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--

Best Regards,

Wheat

--

Best Regards,

Wheat

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Deleting buckets and objects fails to reduce reported cluster usage

2014-12-01 Thread Ben


On 2014-12-02 15:03, Yehuda Sadeh wrote:

On Mon, Dec 1, 2014 at 4:26 PM, Ben  wrote:

On 2014-12-02 11:25, Yehuda Sadeh wrote:


On Mon, Dec 1, 2014 at 4:23 PM, Ben  wrote:


...


How can I tell if the shard has an object in it from the logs?





Search for a different sequence (e.g., search for rgw.gc_remove).

Yehuda






0 Results in the logs for rgw.gc_remove




Well, something is modifying the gc log. Do you happen to have more
than one radosgw running on the same cluster?

Yehuda




We have 2 radosgw servers
obj01 and obj02



Are both of them pointing at the same zone?



Yes, they are load balanced


Well, the gc log show entries, and then it doesn't, so something
clears these up. Try reproducing again with logs on, see if you see
new entries in the rgw logs. If you don't see these, maybe try turning
on 'debug ms = 1' on your osds (ceph tell osd.* injectargs '--debug_ms
1'), and look in your osd logs for such messages. These might give you
some hint for their origin.
Also, could it be that you ran 'radosgw-admin gc process', instead of
waiting for the gc cycle to complete?

Yehuda


I did anohter test, this time with a 600mb file. I uploaded it, then 
deleted the file and did a gc list --include all.
It displayed around 143 _shadow_ files. I let GC process itself (I did 
not force this process) and I checked the pool afterward by running 
'rados ls -p .rgw.buckets | grep ""' and they no 
longer exist.


I've added the debug ms to the OSDs, I'll do another test with the 600mb 
file.


Also worth noting, I have started clearing out files from the 
.rgw.buckets pool that are from a bucket which has been deleted and no 
longer visible by running 'rados -p .rgw.gc rm' over all the _shadow_ 
files contained in that bucket prefix default.4804.14__shadow_

Is this alright to do, or is there a better way to clear out files?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Deleting buckets and objects fails to reduce reported cluster usage

2014-12-01 Thread Yehuda Sadeh

On Mon, Dec 1, 2014 at 4:26 PM, Ben  wrote:
> On 2014-12-02 11:25, Yehuda Sadeh wrote:
>>
>> On Mon, Dec 1, 2014 at 4:23 PM, Ben  wrote:

...

>>> How can I tell if the shard has an object in it from the logs?
>>
>>
>>
>>
>> Search for a different sequence (e.g., search for rgw.gc_remove).
>>
>> Yehuda
>
>
>
>
>
> 0 Results in the logs for rgw.gc_remove

 Well, something is modifying the gc log. Do you happen to have more
 than one radosgw running on the same cluster?

 Yehuda
>>>
>>>
>>>
>>> We have 2 radosgw servers
>>> obj01 and obj02
>>
>>
>> Are both of them pointing at the same zone?
>
>
> Yes, they are load balanced

Well, the gc log show entries, and then it doesn't, so something
clears these up. Try reproducing again with logs on, see if you see
new entries in the rgw logs. If you don't see these, maybe try turning
on 'debug ms = 1' on your osds (ceph tell osd.* injectargs '--debug_ms
1'), and look in your osd logs for such messages. These might give you
some hint for their origin.
Also, could it be that you ran 'radosgw-admin gc process', instead of
waiting for the gc cycle to complete?

Yehuda
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] trouble starting second monitor

2014-12-01 Thread K Richard Pixley


Hm.  Already exists.

And now I'm completely confused.  Ok, so I'm trying to start over. I've 
"ceph-deploy purge"'d all my machines a few times with "ceph-deploy 
purgedata" intermixed.  I've manually removed all the files I could see 
that were generated, except my osd directories, which I apparently can't 
remove.


ceph@adriatic:~$ sudo rm -rf osd
rm: cannot remove â: Operation not permitted
rm: cannot remove â: Operation not permitted
rm: cannot remove â: Operation not permitted

What's up with that and how do I get rid of it in order to start over?

--rich

On 12/1/14 00:01 , Irek Fasikhov wrote:

[celtic][DEBUG ] create the mon path if it does not exist

mkdir /var/lib/ceph/mon/

2014-12-01 4:32 GMT+03:00 K Richard Pixley >:


What does this mean, please?

--rich

ceph@adriatic:~/my-cluster$ ceph status
cluster 1023db58-982f-4b78-b507-481233747b13
 health HEALTH_OK
 monmap e1: 1 mons at {black=192.168.1.77:6789/0
}, election epoch 2, quorum 0 black
 mdsmap e7: 1/1/1 up {0=adriatic=up:active}, 3 up:standby
 osdmap e17: 4 osds: 4 up, 4 in
  pgmap v48: 192 pgs, 3 pools, 1884 bytes data, 20 objects
29134 MB used, 113 GB / 149 GB avail
 192 active+clean
ceph@adriatic:~/my-cluster$ ceph-deploy mon create celtic
[ceph_deploy.conf][DEBUG ] found configuration file at:
/home/ceph/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.20): /usr/bin/ceph-deploy
mon create celtic
[ceph_deploy.mon][DEBUG ] Deploying mon, cluster ceph hosts celtic
[ceph_deploy.mon][DEBUG ] detecting platform for host celtic ...
[celtic][DEBUG ] connection detected need for sudo
[celtic][DEBUG ] connected to host: celtic
[celtic][DEBUG ] detect platform information from remote host
[celtic][DEBUG ] detect machine type
[ceph_deploy.mon][INFO  ] distro info: Ubuntu 14.04 trusty
[celtic][DEBUG ] determining if provided host has same hostname in
remote
[celtic][DEBUG ] get remote short hostname
[celtic][DEBUG ] deploying mon to celtic
[celtic][DEBUG ] get remote short hostname
[celtic][DEBUG ] remote hostname: celtic
[celtic][DEBUG ] write cluster configuration to
/etc/ceph/{cluster}.conf
[celtic][DEBUG ] create the mon path if it does not exist
[celtic][DEBUG ] checking for done path:
/var/lib/ceph/mon/ceph-celtic/done
[celtic][DEBUG ] create a done file to avoid re-doing the mon
deployment
[celtic][DEBUG ] create the init path if it does not exist
[celtic][DEBUG ] locating the `service` executable...
[celtic][INFO  ] Running command: sudo initctl emit ceph-mon
cluster=ceph id=celtic
[celtic][INFO  ] Running command: sudo ceph --cluster=ceph
--admin-daemon /var/run/ceph/ceph-mon.celtic.asok mon_status
[celtic][ERROR ] admin_socket: exception getting command
descriptions: [Errno 2] No such file or directory
[celtic][WARNIN] monitor: mon.celtic, might not be running yet
[celtic][INFO  ] Running command: sudo ceph --cluster=ceph
--admin-daemon /var/run/ceph/ceph-mon.celtic.asok mon_status
[celtic][ERROR ] admin_socket: exception getting command
descriptions: [Errno 2] No such file or directory
[celtic][WARNIN] celtic is not defined in `mon initial members`
[celtic][WARNIN] monitor celtic does not exist in monmap
[celtic][WARNIN] neither `public_addr` nor `public_network` keys
are defined for monitors
[celtic][WARNIN] monitors may not be able to form quorum

___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Revisiting MDS memory footprint

2014-12-01 Thread Gregory Farnum

On Mon, Dec 1, 2014 at 8:06 AM, John Spray  wrote:
> I meant to chime in earlier here but then the weekend happened, comments 
> inline
>
> On Sun, Nov 30, 2014 at 7:20 PM, Wido den Hollander  wrote:
>> Why would you want all CephFS metadata in memory? With any filesystem
>> that will be a problem.
>
> The latency associated with a cache miss (RADOS OMAP dirfrag read) is
> fairly high, so the goal when sizing will to allow the MDSs to keep a
> very large proportion of the metadata in RAM.  In a local FS, the
> filesystem metadata in RAM is relatively small, and the speed to disk
> is relatively high.  In Ceph FS, that is reversed: we want to
> compensate for the cache miss latency by having lots of RAM in the MDS
> and a big cache.
>
> hot-standby MDSs are another manifestation of the expected large
> cache: we expect these caches to be big, to the point where refilling
> from the backing store on a failure would be annoyingly slow, and it's
> worth keeping that hot standby cache.

I actually don't think the cache misses should be *dramatically* more
expensive than local FS misses. They'll be larger since it's remote
and a leveldb lookup is a bit slower than hitting the rest spot on
disk, but everything's nicely streamed in and such so it's not too
bad.
But I'm also making this up as much as you are the rest of it, which
looks good to me. :)

The one thing I'd also bring up is just to be a bit more explicit
about CephFS in-memory inode size having nothing to do with that of a
local FS. We don't need to keep track of things like block locations,
but we do keep track of file "capabilities" (leases) and a whole bunch
of other state like the scrubbing/fsck status of it (coming soon!),
the clean/dirty status in a lot more detail than the kernel does, any
old versions of the inode that have been snapshotted, etc etc etc.
Once upon a time Sage did have some numbers indicating that a cached
dentry took about 1KB, but things change in both directions pretty
frequently and memory use will likely be a thing we look at around the
time we're wondering if we should declare CephFS to be ready for
community use in production previews.
-Greg

>
> Also, remember that because we embed inodes in dentries, when we load
> a directory fragment we are also loading all the inodes in that
> directory fragment -- if you have only one file open, but it has an
> ancestor with lots of files, then you'll have more files in cache than
> you might have expected.
>
>> We do however need a good rule of thumb of how much memory is used for
>> each inode.
>
> Yes -- and ideally some practical measurements too :-)
>
> One important point that I don't think anyone mentioned so far: the
> memory consumption per inode depends on how many clients have
> capabilities on the inode.  So if many clients hold a read capability
> on a file, more memory will be used MDS-side for that file.  If
> designing a benchmark for this, the client count, and level of overlap
> in the client workloads would be an important dimension.
>
> The number of *open* files on clients strongly affects the ability of
> the MDS to trim is cache, since the MDS pins in cache any inode which
> is in use by a client.  We recently added health checks so that the
> MDS can complain about clients that are failing to respond to requests
> to trim their caches, and the way we test this is to have a client
> obstinately keep some number of files open.
>
> We also allocate memory for pending metadata updates (so-called
> 'projected inodes') while they are in the journal, so the memory usage
> will depend on the journal size and the number of writes in flight.
>
> It would be really useful to come up with a test script that monitors
> MDS memory consumption as a function of number of files in cache,
> number of files opened by clients, number of clients opening the same
> files.  I feel a 3d chart plot coming on :-)
>
> Cheers,
> John
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Giant + nfs over cephfs hang tasks

2014-12-01 Thread Gregory Farnum

On Sun, Nov 30, 2014 at 1:15 PM, Andrei Mikhailovsky  wrote:
> Greg, thanks for your comment. Could you please share what OS, kernel and
> any nfs/cephfs settings you've used to achieve the pretty well stability?
> Also, what kind of tests have you ran to check that?


We're just doing it on our testing cluster with the
teuthology/ceph-qa-suite stuff in
https://github.com/ceph/ceph-qa-suite/tree/master/suites/knfs/basic
So that'll be running our ceph-client kernel, which I believe is
usually a recent rc release with the new Ceph changes on top, with
knfs exporting a kcephfs mount, and then running each of the tasks
named in the "tasks" folder on top of a client to that knfs export.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Client forward compatibility

2014-12-01 Thread Gregory Farnum

On Tue, Nov 25, 2014 at 1:00 AM, Dan Van Der Ster
 wrote:
> Hi Greg,
>
>
>> On 24 Nov 2014, at 22:01, Gregory Farnum  wrote:
>>
>> On Thu, Nov 20, 2014 at 9:08 AM, Dan van der Ster
>>  wrote:
>>> Hi all,
>>> What is compatibility/incompatibility of dumpling clients to talk to firefly
>>> and giant clusters?
>>
>> We sadly don't have a good matrix about this yet, but in general you
>> should assume that anything which changed the way the data is
>> physically placed on the cluster will prevent them from communicating;
>> if you don't enable those features then they should remain compatible.
>
>
> It would be good to have such a compat matrix, as I was confused, probably 
> others are confused, and if I’m not wrong, even you are confused ... see 
> below.
>
>
>> In particular
>>
>>> I know that tunables=firefly will prevent dumpling
>>> clients from talking to a firefly cluster, but how about the existence or
>>> not of erasure pools?
>>
>> As you mention, updating the tunables will prevent old clients from
>> accessing them (although that shouldn't be the case in future now that
>> they're all set by the crush map for later interpretation). Erasure
>> pools are a special case (precisely because people had issues with
>> them) and you should be able to communicate with a cluster that has EC
>> pools while using old clients
>
>
> That’s what we’d hoped, but alas we get the same error mentioned here: 
> http://tracker.ceph.com/issues/8178
> In our case (0.67.11 clients talking to the latest firefly gitbuilder build) 
> we get:
>protocol feature mismatch, my 407 < peer 417 missing 
> 10
>
> By adding an EC pool, we lose connectivity for dumpling clients to even the 
> replicated pools. The good news is that when we remove the EC pool, the 
> 10 feature bit is removed so dumpling clients can connect again. But 
> nevertheless it leaves open the possibility of accidentally breaking the 
> users’ access.

yep. Sorry, apparently we tried to do this and didn't quite make
it all the way. :/

We discussed last week trying to build and maintain a forward
compatibility matrix briefly, but haven't done it yet. There's one
floating around somewhere in the docs for the kernel client but a
userspace one just hasn't been anything people have asked for
previously, so we never thought of it. Meanwhile, I'm sure it's not
the most pleasant way to do things but if you go over the upgrade
notes for each major release they should include the possible break
points.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Incomplete PGs

2014-12-01 Thread Aaron Bassett

Hi all, I have a problem with some incomplete pgs. Here’s the backstory: I had 
a pool that I had accidently left with a size of 2. On one of the ods nodes, 
the system hdd started to fail and I attempted to rescue it by sacrificing one 
of my osd nodes. That went ok and I was able to bring the node back up minus 
the one osd. Now I have 11 incomplete osds. I believe these are mostly from the 
pool that only had size two, but I cant tell for sure. I found another thread 
on here that talked about using ceph_objectstore_tool to add or remove pg data 
to get out of an incomplete state. 

Let’s start with the one pg I’ve been playing with, this is a loose description 
of where I’ve been. First I saw that it had the missing osd in 
“down_osds_we_would_probe” when I queried it, and some reading around that told 
me to recreate the missing osd, so I did that. It (obviously) didnt have the 
missing data, but it took the pg from down+incomplete to just incomplete. Then 
I tried pg_force_create and that didnt seem to make a difference. Some more 
googling then brought me to ceph_objectstore_tool and I started to take a 
closer look at the results from pg query. I noticed that the list of probing 
osds gets longer and longer till the end of the query has something like:

 "probing_osds": [
   "0",
   "3",
   "4",
   "16",
   "23",
   "26",
   "35",
   "41",
   "44",
   "51",
   "56”],

So I took a look at those osds and noticed that some of them have data in the 
directory for the troublesome pg and others dont. So I tried picking one with 
the *most* data and i used ceph_objectstore_tool to export the pg. It was > 6G 
so a fair amount of data is still there. I then imported it (after removing) 
into all the others in that list. Unfortunately, it is still incomplete. I’m 
not sure what my next step should be here. Here’s some other stuff from the 
query on it:

"info": { "pgid": "0.63b",
"last_update": "50495'8246",
"last_complete": "50495'8246",
"log_tail": "20346'5245",
"last_user_version": 8246,
"last_backfill": "MAX",
"purged_snaps": "[]",
"history": { "epoch_created": 1,
"last_epoch_started": 51102,
"last_epoch_clean": 50495,
"last_epoch_split": 0,
"same_up_since": 68312,
"same_interval_since": 68312,
"same_primary_since": 68190,
"last_scrub": "28158'8240",
"last_scrub_stamp": "2014-11-18 17:08:49.368486",
"last_deep_scrub": "28158'8240",
"last_deep_scrub_stamp": "2014-11-18 17:08:49.368486",
"last_clean_scrub_stamp": "2014-11-18 17:08:49.368486"},
"stats": { "version": "50495'8246",
"reported_seq": "84279",
"reported_epoch": "69394",
"state": "down+incomplete",
"last_fresh": "2014-12-01 23:23:07.355308",
"last_change": "2014-12-01 21:28:52.771807",
"last_active": "2014-11-24 13:37:09.784417",
"last_clean": "2014-11-22 21:59:49.821836",
"last_became_active": "0.00",
"last_unstale": "2014-12-01 23:23:07.355308",
"last_undegraded": "2014-12-01 23:23:07.355308",
"last_fullsized": "2014-12-01 23:23:07.355308",
"mapping_epoch": 68285,
"log_start": "20346'5245",
"ondisk_log_start": "20346'5245",
"created": 1,
"last_epoch_clean": 50495,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "28158'8240",
"last_scrub_stamp": "2014-11-18 17:08:49.368486",
"last_deep_scrub": "28158'8240",
"last_deep_scrub_stamp": "2014-11-18 17:08:49.368486",
"last_clean_scrub_stamp": "2014-11-18 17:08:49.368486",
"log_size": 3001,
"ondisk_log_size": 3001,


Also in the peering section, all the peers now have the same last_update: which 
makes me think it should just pick up and take off. 

There is another think I’m having problems with and I’m not sure if it’s 
related or not. I set a crush map manually as I have a mix of ssd and platter 
osds and it seems to work when I set it, the cluster starts rebalancing, etc, 
but if I do a restart ceph-all on all my nodes the crush maps seems to revert 
to the one I didn’t set. I don’t know if its being blocked from taking by these 
incomplete pgs or if I’m missing a step to get it to “stick” It makes me think 
when I’m stopping and starting these osds to use ceph_objectstore_tool on them 
they may be getting out of sync with the cluster.

Any insights would be greatly appreciated,

Aaron 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Deleting buckets and objects fails to reduce reported cluster usage

2014-12-01 Thread Yehuda Sadeh

On Mon, Dec 1, 2014 at 3:47 PM, Ben  wrote:
> On 2014-12-02 10:40, Yehuda Sadeh wrote:
>>
>> On Mon, Dec 1, 2014 at 3:20 PM, Ben  wrote:
>>>
>>> On 2014-12-02 09:25, Yehuda Sadeh wrote:


 On Mon, Dec 1, 2014 at 2:10 PM, Ben  wrote:
>
>
> On 2014-12-02 08:39, Yehuda Sadeh wrote:
>>
>>
>>
>> On Sat, Nov 29, 2014 at 2:26 PM, Ben  wrote:
>>>
>>>
>>>
>>>
>>> On 29/11/14 11:40, Yehuda Sadeh wrote:




 On Fri, Nov 28, 2014 at 1:38 PM, Ben  wrote:
>
>
>
>
> On 29/11/14 01:50, Yehuda Sadeh wrote:
>>
>>
>>
>>
>> On Thu, Nov 27, 2014 at 9:22 PM, Ben  wrote:
>>>
>>>
>>>
>>>
>>> On 2014-11-28 15:42, Yehuda Sadeh wrote:




 On Thu, Nov 27, 2014 at 2:15 PM, b  wrote:
>
>
>
>
> On 2014-11-27 11:36, Yehuda Sadeh wrote:
>>
>>
>>
>>
>>
>> On Wed, Nov 26, 2014 at 3:49 PM, b  wrote:
>>>
>>>
>>>
>>>
>>>
>>> On 2014-11-27 10:21, Yehuda Sadeh wrote:






 On Wed, Nov 26, 2014 at 3:09 PM, b 
 wrote:
>
>
>
>
>
>
> On 2014-11-27 09:38, Yehuda Sadeh wrote:
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Nov 26, 2014 at 2:32 PM, b 
>> wrote:
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> I've been deleting a bucket which originally had 60TB of
>>> data
>>> in
>>> it,
>>> with
>>> our cluster doing only 1 replication, the total usage was
>>> 120TB.
>>>
>>> I've been deleting the objects slowly using S3 browser,
>>> and
>>> I
>>> can
>>> see
>>> the
>>> bucket usage is now down to around 2.5TB or 5TB with
>>> duplication,
>>> but
>>> the
>>> usage in the cluster has not changed.
>>>
>>> I've looked at garbage collection (radosgw-admin gc list
>>> --include
>>> all)
>>> and
>>> it just reports square brackets "[]"
>>>
>>> I've run radosgw-admin temp remove --date=2014-11-20, and
>>> it
>>> doesn't
>>> appear
>>> to have any effect.
>>>
>>> Is there a way to check where this space is being
>>> consumed?
>>>
>>> Running 'ceph df' the USED space in the buckets pool is
>>> not
>>> showing
>>> any
>>> of
>>> the 57TB that should have been freed up from the deletion
>>> so
>>> far.
>>>
>>> Running 'radosgw-admin bucket stats | jshon | grep
>>> size_kb_actual'
>>> and
>>> adding up all the buckets usage, this shows that the
>>> space
>>> has
>>> been
>>> freed
>>> from the bucket, but the cluster is all sorts of messed
>>> up.
>>>
>>>
>>> ANY IDEAS? What can I look at?
>>
>>
>>
>>
>>
>>
>>
>>
>> Can you run 'radosgw-admin gc list --include-all'?
>>
>> Yehuda
>
>
>
>
>
>
>
>
> I've done it before, and it just returns square brackets []
> (see
> below)
>
> radosgw-admin gc list --include-all
> []







 Do you know which of the rados pools have all that extra
 data?

Re: [ceph-users] Deleting buckets and objects fails to reduce reported cluster usage

2014-12-01 Thread Ben


On 2014-12-02 11:25, Yehuda Sadeh wrote:

On Mon, Dec 1, 2014 at 4:23 PM, Ben  wrote:

On 2014-12-02 11:21, Yehuda Sadeh wrote:


On Mon, Dec 1, 2014 at 3:47 PM, Ben  wrote:


On 2014-12-02 10:40, Yehuda Sadeh wrote:



On Mon, Dec 1, 2014 at 3:20 PM, Ben  wrote:



On 2014-12-02 09:25, Yehuda Sadeh wrote:




On Mon, Dec 1, 2014 at 2:10 PM, Ben  wrote:




On 2014-12-02 08:39, Yehuda Sadeh wrote:





On Sat, Nov 29, 2014 at 2:26 PM, Ben  
wrote:






On 29/11/14 11:40, Yehuda Sadeh wrote:






On Fri, Nov 28, 2014 at 1:38 PM, Ben  
wrote:






On 29/11/14 01:50, Yehuda Sadeh wrote:






On Thu, Nov 27, 2014 at 9:22 PM, Ben  
wrote:






On 2014-11-28 15:42, Yehuda Sadeh wrote:






On Thu, Nov 27, 2014 at 2:15 PM, b  
wrote:






On 2014-11-27 11:36, Yehuda Sadeh wrote:







On Wed, Nov 26, 2014 at 3:49 PM, b 
wrote:







On 2014-11-27 10:21, Yehuda Sadeh wrote:








On Wed, Nov 26, 2014 at 3:09 PM, b 


wrote:








On 2014-11-27 09:38, Yehuda Sadeh wrote:









On Wed, Nov 26, 2014 at 2:32 PM, b 


wrote:









I've been deleting a bucket which originally had 
60TB

of
data
in
it,
with
our cluster doing only 1 replication, the total 
usage

was
120TB.

I've been deleting the objects slowly using S3 
browser,

and
I
can
see
the
bucket usage is now down to around 2.5TB or 5TB 
with

duplication,
but
the
usage in the cluster has not changed.

I've looked at garbage collection (radosgw-admin 
gc

list
--include
all)
and
it just reports square brackets "[]"

I've run radosgw-admin temp remove 
--date=2014-11-20,

and
it
doesn't
appear
to have any effect.

Is there a way to check where this space is being
consumed?

Running 'ceph df' the USED space in the buckets 
pool is

not
showing
any
of
the 57TB that should have been freed up from the
deletion
so
far.

Running 'radosgw-admin bucket stats | jshon | grep
size_kb_actual'
and
adding up all the buckets usage, this shows that 
the

space
has
been
freed
from the bucket, but the cluster is all sorts of 
messed

up.


ANY IDEAS? What can I look at?










Can you run 'radosgw-admin gc list --include-all'?

Yehuda










I've done it before, and it just returns square 
brackets

[]
(see
below)

radosgw-admin gc list --include-all
[]









Do you know which of the rados pools have all that 
extra

data?
Try
to
list that pool's objects, verify that there are no
surprises
there
(e.g., use 'rados -p  ls').

Yehuda









I'm just running that command now, and its taking some
time.
There
is
a
large number of objects.

Once it has finished, what should I be looking for?








I assume the pool in question is the one that holds 
your

objects
data?
You should be looking for objects that are not expected 
to

exist
anymore, and objects of buckets that don't exist 
anymore.

The
problem
here is to identify these.
I suggest starting by looking at all the existing 
buckets,

compose
a
list of all the bucket prefixes for the existing 
buckets,

and
try
to
look whether there are objects that have different 
prefixes.


Yehuda








Any ideas? I've found the prefix, the number of objects 
in

the
pool
that
match that prefix numbers in the 21 millions
The actual 'radosgw-admin bucket stats' command reports 
it as

only
having
1.2 million.







Well, the objects you're seeing are raw objects, and 
since rgw

stripes
the data, it is expected to have more raw objects than 
objects

in
the
bucket. Still, it seems that you have much too many of 
these.

You
can
try to check whether there are pending multipart uploads 
that

were
never completed using the S3 api.
At the moment there's no easy way to figure out which raw
objects
are
not supposed to exist. The process would be like this:
1. rados ls -p 
keep the list sorted
2. list objects in the bucket
3. for each object in (2), do: radosgw-admin object stat
--bucket= --object= 
--rgw-cache-enabled=false

(disabling the cache so that it goes quicker)
4. look at the result of (3), and generate a list of all 
the

parts.
5. sort result of (4), compare it to (1)

Note that if you're running firefly or later, the raw 
objects

are
not
specified explicitly in the command you run at (3), so 
you

might
need
a different procedure, e.g., find out the raw objects 
random

string
that is being used, remove it from the list generated in 
1,

etc.)

That's basically it.
I'll be interested to figure out what happened, why the
garbage
collection didn't work correctly. You could try verifying 
that

it's
working by:
   - create an object (let's say ~10MB in size).
   - radosgw-admin object stat --bucket=
--object=
 (keep this info, see
   - remove the object
   - run radosgw-admin gc list --include-all and verify 
that

the
raw
parts are listed there
   - wait a few hours, repeat last step, see that the 
parts

don't
appear
there anymore
   - run rados -p  ls, check to see if the raw 
objects

still
exist

Yehuda

Not sure where to go from here, and our cluster is 
slowly

filling
up
while
not clearing any

Re: [ceph-users] Deleting buckets and objects fails to reduce reported cluster usage

2014-12-01 Thread Yehuda Sadeh

On Mon, Dec 1, 2014 at 4:23 PM, Ben  wrote:
> On 2014-12-02 11:21, Yehuda Sadeh wrote:
>>
>> On Mon, Dec 1, 2014 at 3:47 PM, Ben  wrote:
>>>
>>> On 2014-12-02 10:40, Yehuda Sadeh wrote:


 On Mon, Dec 1, 2014 at 3:20 PM, Ben  wrote:
>
>
> On 2014-12-02 09:25, Yehuda Sadeh wrote:
>>
>>
>>
>> On Mon, Dec 1, 2014 at 2:10 PM, Ben  wrote:
>>>
>>>
>>>
>>> On 2014-12-02 08:39, Yehuda Sadeh wrote:




 On Sat, Nov 29, 2014 at 2:26 PM, Ben  wrote:
>
>
>
>
>
> On 29/11/14 11:40, Yehuda Sadeh wrote:
>>
>>
>>
>>
>>
>> On Fri, Nov 28, 2014 at 1:38 PM, Ben  wrote:
>>>
>>>
>>>
>>>
>>>
>>> On 29/11/14 01:50, Yehuda Sadeh wrote:





 On Thu, Nov 27, 2014 at 9:22 PM, Ben  wrote:
>
>
>
>
>
> On 2014-11-28 15:42, Yehuda Sadeh wrote:
>>
>>
>>
>>
>>
>> On Thu, Nov 27, 2014 at 2:15 PM, b  wrote:
>>>
>>>
>>>
>>>
>>>
>>> On 2014-11-27 11:36, Yehuda Sadeh wrote:






 On Wed, Nov 26, 2014 at 3:49 PM, b 
 wrote:
>
>
>
>
>
>
> On 2014-11-27 10:21, Yehuda Sadeh wrote:
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Nov 26, 2014 at 3:09 PM, b 
>> wrote:
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 2014-11-27 09:38, Yehuda Sadeh wrote:








 On Wed, Nov 26, 2014 at 2:32 PM, b 
 wrote:
>
>
>
>
>
>
>
>
> I've been deleting a bucket which originally had 60TB
> of
> data
> in
> it,
> with
> our cluster doing only 1 replication, the total usage
> was
> 120TB.
>
> I've been deleting the objects slowly using S3 browser,
> and
> I
> can
> see
> the
> bucket usage is now down to around 2.5TB or 5TB with
> duplication,
> but
> the
> usage in the cluster has not changed.
>
> I've looked at garbage collection (radosgw-admin gc
> list
> --include
> all)
> and
> it just reports square brackets "[]"
>
> I've run radosgw-admin temp remove --date=2014-11-20,
> and
> it
> doesn't
> appear
> to have any effect.
>
> Is there a way to check where this space is being
> consumed?
>
> Running 'ceph df' the USED space in the buckets pool is
> not
> showing
> any
> of
> the 57TB that should have been freed up from the
> deletion
> so
> far.
>
> Running 'radosgw-admin bucket stats | jshon | grep
> size_kb_actual'
> and
> adding up all the buckets usage, this shows that the
> space
> has
> been
> freed
> from the bucket, but the cluster is all sorts of messed
> up.
>
>
> ANY IDEAS? What can I look at?

Re: [ceph-users] Deleting buckets and objects fails to reduce reported cluster usage

2014-12-01 Thread Ben


On 2014-12-02 11:21, Yehuda Sadeh wrote:

On Mon, Dec 1, 2014 at 3:47 PM, Ben  wrote:

On 2014-12-02 10:40, Yehuda Sadeh wrote:


On Mon, Dec 1, 2014 at 3:20 PM, Ben  wrote:


On 2014-12-02 09:25, Yehuda Sadeh wrote:



On Mon, Dec 1, 2014 at 2:10 PM, Ben  wrote:



On 2014-12-02 08:39, Yehuda Sadeh wrote:




On Sat, Nov 29, 2014 at 2:26 PM, Ben  wrote:





On 29/11/14 11:40, Yehuda Sadeh wrote:





On Fri, Nov 28, 2014 at 1:38 PM, Ben  
wrote:





On 29/11/14 01:50, Yehuda Sadeh wrote:





On Thu, Nov 27, 2014 at 9:22 PM, Ben  
wrote:





On 2014-11-28 15:42, Yehuda Sadeh wrote:





On Thu, Nov 27, 2014 at 2:15 PM, b  
wrote:





On 2014-11-27 11:36, Yehuda Sadeh wrote:






On Wed, Nov 26, 2014 at 3:49 PM, b  
wrote:






On 2014-11-27 10:21, Yehuda Sadeh wrote:







On Wed, Nov 26, 2014 at 3:09 PM, b 
wrote:







On 2014-11-27 09:38, Yehuda Sadeh wrote:








On Wed, Nov 26, 2014 at 2:32 PM, b 


wrote:








I've been deleting a bucket which originally had 
60TB of

data
in
it,
with
our cluster doing only 1 replication, the total 
usage was

120TB.

I've been deleting the objects slowly using S3 
browser,

and
I
can
see
the
bucket usage is now down to around 2.5TB or 5TB with
duplication,
but
the
usage in the cluster has not changed.

I've looked at garbage collection (radosgw-admin gc 
list

--include
all)
and
it just reports square brackets "[]"

I've run radosgw-admin temp remove 
--date=2014-11-20, and

it
doesn't
appear
to have any effect.

Is there a way to check where this space is being
consumed?

Running 'ceph df' the USED space in the buckets pool 
is

not
showing
any
of
the 57TB that should have been freed up from the 
deletion

so
far.

Running 'radosgw-admin bucket stats | jshon | grep
size_kb_actual'
and
adding up all the buckets usage, this shows that the
space
has
been
freed
from the bucket, but the cluster is all sorts of 
messed

up.


ANY IDEAS? What can I look at?









Can you run 'radosgw-admin gc list --include-all'?

Yehuda









I've done it before, and it just returns square 
brackets []

(see
below)

radosgw-admin gc list --include-all
[]








Do you know which of the rados pools have all that 
extra

data?
Try
to
list that pool's objects, verify that there are no 
surprises

there
(e.g., use 'rados -p  ls').

Yehuda








I'm just running that command now, and its taking some 
time.

There
is
a
large number of objects.

Once it has finished, what should I be looking for?







I assume the pool in question is the one that holds your
objects
data?
You should be looking for objects that are not expected 
to

exist
anymore, and objects of buckets that don't exist anymore. 
The

problem
here is to identify these.
I suggest starting by looking at all the existing 
buckets,

compose
a
list of all the bucket prefixes for the existing buckets, 
and

try
to
look whether there are objects that have different 
prefixes.


Yehuda







Any ideas? I've found the prefix, the number of objects in 
the

pool
that
match that prefix numbers in the 21 millions
The actual 'radosgw-admin bucket stats' command reports it 
as

only
having
1.2 million.






Well, the objects you're seeing are raw objects, and since 
rgw

stripes
the data, it is expected to have more raw objects than 
objects

in
the
bucket. Still, it seems that you have much too many of 
these.

You
can
try to check whether there are pending multipart uploads 
that

were
never completed using the S3 api.
At the moment there's no easy way to figure out which raw
objects
are
not supposed to exist. The process would be like this:
1. rados ls -p 
keep the list sorted
2. list objects in the bucket
3. for each object in (2), do: radosgw-admin object stat
--bucket= --object= 
--rgw-cache-enabled=false

(disabling the cache so that it goes quicker)
4. look at the result of (3), and generate a list of all 
the

parts.
5. sort result of (4), compare it to (1)

Note that if you're running firefly or later, the raw 
objects

are
not
specified explicitly in the command you run at (3), so you 
might

need
a different procedure, e.g., find out the raw objects 
random

string
that is being used, remove it from the list generated in 1,
etc.)

That's basically it.
I'll be interested to figure out what happened, why the 
garbage
collection didn't work correctly. You could try verifying 
that

it's
working by:
   - create an object (let's say ~10MB in size).
   - radosgw-admin object stat --bucket=
--object=
 (keep this info, see
   - remove the object
   - run radosgw-admin gc list --include-all and verify 
that the

raw
parts are listed there
   - wait a few hours, repeat last step, see that the parts
don't
appear
there anymore
   - run rados -p  ls, check to see if the raw 
objects

still
exist

Yehuda


Not sure where to go from here, and our cluster is slowly
filling
up
while
not clearing any space.







I did the last section:





I'll be interested to figure out what happened, why the 
garbage
collection

Re: [ceph-users] Deleting buckets and objects fails to reduce reported cluster usage

2014-12-01 Thread Yehuda Sadeh

On Mon, Dec 1, 2014 at 3:20 PM, Ben  wrote:
> On 2014-12-02 09:25, Yehuda Sadeh wrote:
>>
>> On Mon, Dec 1, 2014 at 2:10 PM, Ben  wrote:
>>>
>>> On 2014-12-02 08:39, Yehuda Sadeh wrote:


 On Sat, Nov 29, 2014 at 2:26 PM, Ben  wrote:
>
>
>
> On 29/11/14 11:40, Yehuda Sadeh wrote:
>>
>>
>>
>> On Fri, Nov 28, 2014 at 1:38 PM, Ben  wrote:
>>>
>>>
>>>
>>> On 29/11/14 01:50, Yehuda Sadeh wrote:



 On Thu, Nov 27, 2014 at 9:22 PM, Ben  wrote:
>
>
>
> On 2014-11-28 15:42, Yehuda Sadeh wrote:
>>
>>
>>
>> On Thu, Nov 27, 2014 at 2:15 PM, b  wrote:
>>>
>>>
>>>
>>> On 2014-11-27 11:36, Yehuda Sadeh wrote:




 On Wed, Nov 26, 2014 at 3:49 PM, b  wrote:
>
>
>
>
> On 2014-11-27 10:21, Yehuda Sadeh wrote:
>>
>>
>>
>>
>>
>> On Wed, Nov 26, 2014 at 3:09 PM, b  wrote:
>>>
>>>
>>>
>>>
>>>
>>> On 2014-11-27 09:38, Yehuda Sadeh wrote:






 On Wed, Nov 26, 2014 at 2:32 PM, b 
 wrote:
>
>
>
>
>
>
> I've been deleting a bucket which originally had 60TB of
> data
> in
> it,
> with
> our cluster doing only 1 replication, the total usage was
> 120TB.
>
> I've been deleting the objects slowly using S3 browser, and
> I
> can
> see
> the
> bucket usage is now down to around 2.5TB or 5TB with
> duplication,
> but
> the
> usage in the cluster has not changed.
>
> I've looked at garbage collection (radosgw-admin gc list
> --include
> all)
> and
> it just reports square brackets "[]"
>
> I've run radosgw-admin temp remove --date=2014-11-20, and
> it
> doesn't
> appear
> to have any effect.
>
> Is there a way to check where this space is being consumed?
>
> Running 'ceph df' the USED space in the buckets pool is not
> showing
> any
> of
> the 57TB that should have been freed up from the deletion
> so
> far.
>
> Running 'radosgw-admin bucket stats | jshon | grep
> size_kb_actual'
> and
> adding up all the buckets usage, this shows that the space
> has
> been
> freed
> from the bucket, but the cluster is all sorts of messed up.
>
>
> ANY IDEAS? What can I look at?







 Can you run 'radosgw-admin gc list --include-all'?

 Yehuda
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> I've done it before, and it just returns square brackets []
>>> (see
>>> below)
>>>
>>> radosgw-admin gc list --include-all
>>> []
>>
>>
>>
>>
>>
>>
>> Do you know which of the rados pools have all that extra data?
>> Try
>> to
>> list that pool's objects, verify that there are no surprises
>> there
>> (e.g., use 'rados -p  ls').
>>
>> Yehuda
>
>
>
>
>
>
> I'm just running that command now, and its taking some time.
> There
> is
> a
> large number of objects.
>
> Once it has finished, what should I be looking for?





 I assume the pool in question is the one that holds your objects
 data?
 You should be looking for objects that are not expected to exist
 anymore, and

Re: [ceph-users] Radosgw agent only syncing metadata

2014-12-01 Thread Mark Kirkwood


On 25/11/14 12:40, Mark Kirkwood wrote:

On 25/11/14 11:58, Yehuda Sadeh wrote:

On Mon, Nov 24, 2014 at 2:43 PM, Mark Kirkwood
 wrote:

On 22/11/14 10:54, Yehuda Sadeh wrote:


On Thu, Nov 20, 2014 at 6:52 PM, Mark Kirkwood
 wrote:




Fri Nov 21 02:13:31 2014

x-amz-copy-source:bucketbig/_multipart_big.dat.2/fjid6CneDQYKisHf0pRFOT5cEWF_EQr.meta

/bucketbig/__multipart_big.dat.2%2Ffjid6CneDQYKisHf0pRFOT5cEWF_EQr.meta

2014-11-21 15:13:31.914925 7fb5e3f87700 15 generated auth header: AWS
us-west key:tk7RgBQMD92je2Nz1m2D/GV+VNM=
2014-11-21 15:13:31.914964 7fb5e3f87700 20 sending request to

http://ceph2:80/bucketbig/__multipart_big.dat.2%2Ffjid6CneDQYKisHf0pRFOT5cEWF_EQr.meta?rgwx-uid=us-west&rgwx-region=us&rgwx-prepend-metadata=us

2014-11-21 15:13:31.920510 7fb5e3f87700 10 receive_http_header
2014-11-21 15:13:31.920525 7fb5e3f87700 10 received header:HTTP/1.1
411
Length Required



It looks like you're running the wrong fastcgi module.

Yehuda



Thanks Yehuda - so what would be the right fastcgi? Do you mean
http://gitbuilder.ceph.com/libapache-mod-fastcgi-deb-precise-x86_64-basic/ref/master/




This one should work, yeah.



Looks that that was the issue:

$ rados df|grep bucket
.us-east.rgw.buckets -  93740   24
00   0   3493746  216 93740
.us-east.rgw.buckets.index -  01
 00   0   24   25
270
.us-west.rgw.buckets -  93740   24
00   000  215 93740
.us-west.rgw.buckets.index -  01
 00   0   19   18
190

Now I reinstalled the Ceph patched apache2 and fastcgi module () not
sure if needed to do apache2 as well):

$ cat /etc/apt/sources.list.d/ceph.list
...
deb
http://gitbuilder.ceph.com/libapache-mod-fastcgi-deb-precise-x86_64-basic/ref/master/
precise main
deb
http://gitbuilder.ceph.com/apache2-deb-precise-x86_64-basic/ref/master/
precise main

Now that I've got that working I'll look at getting a more complex setup


Just for the record, using these apache and fastcgi modules seems to be 
the story - I've managed to run through the more complicated examples:


- zones in different ceph clusters
- zones in different regions

... and get replication working (on Ubuntu 12.04 and 14.04 with Ceph 
0.87). Thanks for your help. I have some further questions that I'll ask 
in a new thread (as they are not really about 'how to make it work').


regards

Mark


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Deleting buckets and objects fails to reduce reported cluster usage

2014-12-01 Thread Ben


On 2014-12-02 10:40, Yehuda Sadeh wrote:

On Mon, Dec 1, 2014 at 3:20 PM, Ben  wrote:

On 2014-12-02 09:25, Yehuda Sadeh wrote:


On Mon, Dec 1, 2014 at 2:10 PM, Ben  wrote:


On 2014-12-02 08:39, Yehuda Sadeh wrote:



On Sat, Nov 29, 2014 at 2:26 PM, Ben  wrote:




On 29/11/14 11:40, Yehuda Sadeh wrote:




On Fri, Nov 28, 2014 at 1:38 PM, Ben  wrote:




On 29/11/14 01:50, Yehuda Sadeh wrote:




On Thu, Nov 27, 2014 at 9:22 PM, Ben  
wrote:




On 2014-11-28 15:42, Yehuda Sadeh wrote:




On Thu, Nov 27, 2014 at 2:15 PM, b  
wrote:




On 2014-11-27 11:36, Yehuda Sadeh wrote:





On Wed, Nov 26, 2014 at 3:49 PM, b  
wrote:





On 2014-11-27 10:21, Yehuda Sadeh wrote:






On Wed, Nov 26, 2014 at 3:09 PM, b  
wrote:






On 2014-11-27 09:38, Yehuda Sadeh wrote:







On Wed, Nov 26, 2014 at 2:32 PM, b 
wrote:







I've been deleting a bucket which originally had 60TB 
of

data
in
it,
with
our cluster doing only 1 replication, the total usage 
was

120TB.

I've been deleting the objects slowly using S3 
browser, and

I
can
see
the
bucket usage is now down to around 2.5TB or 5TB with
duplication,
but
the
usage in the cluster has not changed.

I've looked at garbage collection (radosgw-admin gc 
list

--include
all)
and
it just reports square brackets "[]"

I've run radosgw-admin temp remove --date=2014-11-20, 
and

it
doesn't
appear
to have any effect.

Is there a way to check where this space is being 
consumed?


Running 'ceph df' the USED space in the buckets pool 
is not

showing
any
of
the 57TB that should have been freed up from the 
deletion

so
far.

Running 'radosgw-admin bucket stats | jshon | grep
size_kb_actual'
and
adding up all the buckets usage, this shows that the 
space

has
been
freed
from the bucket, but the cluster is all sorts of 
messed up.



ANY IDEAS? What can I look at?








Can you run 'radosgw-admin gc list --include-all'?

Yehuda








I've done it before, and it just returns square brackets 
[]

(see
below)

radosgw-admin gc list --include-all
[]







Do you know which of the rados pools have all that extra 
data?

Try
to
list that pool's objects, verify that there are no 
surprises

there
(e.g., use 'rados -p  ls').

Yehuda







I'm just running that command now, and its taking some 
time.

There
is
a
large number of objects.

Once it has finished, what should I be looking for?






I assume the pool in question is the one that holds your 
objects

data?
You should be looking for objects that are not expected to 
exist
anymore, and objects of buckets that don't exist anymore. 
The

problem
here is to identify these.
I suggest starting by looking at all the existing buckets,
compose
a
list of all the bucket prefixes for the existing buckets, 
and

try
to
look whether there are objects that have different 
prefixes.


Yehuda






Any ideas? I've found the prefix, the number of objects in 
the

pool
that
match that prefix numbers in the 21 millions
The actual 'radosgw-admin bucket stats' command reports it 
as

only
having
1.2 million.





Well, the objects you're seeing are raw objects, and since 
rgw

stripes
the data, it is expected to have more raw objects than 
objects in

the
bucket. Still, it seems that you have much too many of these. 
You

can
try to check whether there are pending multipart uploads that 
were

never completed using the S3 api.
At the moment there's no easy way to figure out which raw 
objects

are
not supposed to exist. The process would be like this:
1. rados ls -p 
keep the list sorted
2. list objects in the bucket
3. for each object in (2), do: radosgw-admin object stat
--bucket= --object= --rgw-cache-enabled=false
(disabling the cache so that it goes quicker)
4. look at the result of (3), and generate a list of all the
parts.
5. sort result of (4), compare it to (1)

Note that if you're running firefly or later, the raw objects 
are

not
specified explicitly in the command you run at (3), so you 
might

need
a different procedure, e.g., find out the raw objects random
string
that is being used, remove it from the list generated in 1, 
etc.)


That's basically it.
I'll be interested to figure out what happened, why the 
garbage
collection didn't work correctly. You could try verifying 
that

it's
working by:
   - create an object (let's say ~10MB in size).
   - radosgw-admin object stat --bucket= 
--object=

 (keep this info, see
   - remove the object
   - run radosgw-admin gc list --include-all and verify that 
the

raw
parts are listed there
   - wait a few hours, repeat last step, see that the parts 
don't

appear
there anymore
   - run rados -p  ls, check to see if the raw objects 
still

exist

Yehuda

Not sure where to go from here, and our cluster is slowly 
filling

up
while
not clearing any space.






I did the last section:




I'll be interested to figure out what happened, why the 
garbage
collection didn't work correctly. You could try verifying 
that

it's
working by:
   - create an object (let's say ~10MB

Re: [ceph-users] Deleting buckets and objects fails to reduce reported cluster usage

2014-12-01 Thread Ben


On 2014-12-02 09:25, Yehuda Sadeh wrote:

On Mon, Dec 1, 2014 at 2:10 PM, Ben  wrote:

On 2014-12-02 08:39, Yehuda Sadeh wrote:


On Sat, Nov 29, 2014 at 2:26 PM, Ben  wrote:



On 29/11/14 11:40, Yehuda Sadeh wrote:



On Fri, Nov 28, 2014 at 1:38 PM, Ben  wrote:



On 29/11/14 01:50, Yehuda Sadeh wrote:



On Thu, Nov 27, 2014 at 9:22 PM, Ben  wrote:



On 2014-11-28 15:42, Yehuda Sadeh wrote:



On Thu, Nov 27, 2014 at 2:15 PM, b  wrote:



On 2014-11-27 11:36, Yehuda Sadeh wrote:




On Wed, Nov 26, 2014 at 3:49 PM, b  
wrote:




On 2014-11-27 10:21, Yehuda Sadeh wrote:





On Wed, Nov 26, 2014 at 3:09 PM, b  
wrote:





On 2014-11-27 09:38, Yehuda Sadeh wrote:






On Wed, Nov 26, 2014 at 2:32 PM, b  
wrote:






I've been deleting a bucket which originally had 60TB of 
data

in
it,
with
our cluster doing only 1 replication, the total usage 
was

120TB.

I've been deleting the objects slowly using S3 browser, 
and I

can
see
the
bucket usage is now down to around 2.5TB or 5TB with
duplication,
but
the
usage in the cluster has not changed.

I've looked at garbage collection (radosgw-admin gc list
--include
all)
and
it just reports square brackets "[]"

I've run radosgw-admin temp remove --date=2014-11-20, 
and it

doesn't
appear
to have any effect.

Is there a way to check where this space is being 
consumed?


Running 'ceph df' the USED space in the buckets pool is 
not

showing
any
of
the 57TB that should have been freed up from the 
deletion so

far.

Running 'radosgw-admin bucket stats | jshon | grep
size_kb_actual'
and
adding up all the buckets usage, this shows that the 
space

has
been
freed
from the bucket, but the cluster is all sorts of messed 
up.



ANY IDEAS? What can I look at?







Can you run 'radosgw-admin gc list --include-all'?

Yehuda







I've done it before, and it just returns square brackets 
[]

(see
below)

radosgw-admin gc list --include-all
[]






Do you know which of the rados pools have all that extra 
data?

Try
to
list that pool's objects, verify that there are no 
surprises

there
(e.g., use 'rados -p  ls').

Yehuda






I'm just running that command now, and its taking some time.
There
is
a
large number of objects.

Once it has finished, what should I be looking for?





I assume the pool in question is the one that holds your 
objects

data?
You should be looking for objects that are not expected to 
exist

anymore, and objects of buckets that don't exist anymore. The
problem
here is to identify these.
I suggest starting by looking at all the existing buckets, 
compose

a
list of all the bucket prefixes for the existing buckets, and 
try

to
look whether there are objects that have different prefixes.

Yehuda





Any ideas? I've found the prefix, the number of objects in the 
pool

that
match that prefix numbers in the 21 millions
The actual 'radosgw-admin bucket stats' command reports it as 
only

having
1.2 million.




Well, the objects you're seeing are raw objects, and since rgw
stripes
the data, it is expected to have more raw objects than objects 
in

the
bucket. Still, it seems that you have much too many of these. 
You

can
try to check whether there are pending multipart uploads that 
were

never completed using the S3 api.
At the moment there's no easy way to figure out which raw 
objects

are
not supposed to exist. The process would be like this:
1. rados ls -p 
keep the list sorted
2. list objects in the bucket
3. for each object in (2), do: radosgw-admin object stat
--bucket= --object= --rgw-cache-enabled=false
(disabling the cache so that it goes quicker)
4. look at the result of (3), and generate a list of all the 
parts.

5. sort result of (4), compare it to (1)

Note that if you're running firefly or later, the raw objects 
are

not
specified explicitly in the command you run at (3), so you 
might

need
a different procedure, e.g., find out the raw objects random 
string
that is being used, remove it from the list generated in 1, 
etc.)


That's basically it.
I'll be interested to figure out what happened, why the garbage
collection didn't work correctly. You could try verifying that 
it's

working by:
   - create an object (let's say ~10MB in size).
   - radosgw-admin object stat --bucket= 
--object=

 (keep this info, see
   - remove the object
   - run radosgw-admin gc list --include-all and verify that 
the raw

parts are listed there
   - wait a few hours, repeat last step, see that the parts 
don't

appear
there anymore
   - run rados -p  ls, check to see if the raw objects 
still

exist

Yehuda

Not sure where to go from here, and our cluster is slowly 
filling

up
while
not clearing any space.





I did the last section:



I'll be interested to figure out what happened, why the garbage
collection didn't work correctly. You could try verifying that 
it's

working by:
   - create an object (let's say ~10MB in size).
   - radosgw-admin object stat --bucket= 
--object=

 (keep this info, see
   - remove the object
   - ru

Re: [ceph-users] Deleting buckets and objects fails to reduce reported cluster usage

2014-12-01 Thread Yehuda Sadeh

On Mon, Dec 1, 2014 at 2:10 PM, Ben  wrote:
> On 2014-12-02 08:39, Yehuda Sadeh wrote:
>>
>> On Sat, Nov 29, 2014 at 2:26 PM, Ben  wrote:
>>>
>>>
>>> On 29/11/14 11:40, Yehuda Sadeh wrote:


 On Fri, Nov 28, 2014 at 1:38 PM, Ben  wrote:
>
>
> On 29/11/14 01:50, Yehuda Sadeh wrote:
>>
>>
>> On Thu, Nov 27, 2014 at 9:22 PM, Ben  wrote:
>>>
>>>
>>> On 2014-11-28 15:42, Yehuda Sadeh wrote:


 On Thu, Nov 27, 2014 at 2:15 PM, b  wrote:
>
>
> On 2014-11-27 11:36, Yehuda Sadeh wrote:
>>
>>
>>
>> On Wed, Nov 26, 2014 at 3:49 PM, b  wrote:
>>>
>>>
>>>
>>> On 2014-11-27 10:21, Yehuda Sadeh wrote:




 On Wed, Nov 26, 2014 at 3:09 PM, b  wrote:
>
>
>
>
> On 2014-11-27 09:38, Yehuda Sadeh wrote:
>>
>>
>>
>>
>>
>> On Wed, Nov 26, 2014 at 2:32 PM, b  wrote:
>>>
>>>
>>>
>>>
>>>
>>> I've been deleting a bucket which originally had 60TB of data
>>> in
>>> it,
>>> with
>>> our cluster doing only 1 replication, the total usage was
>>> 120TB.
>>>
>>> I've been deleting the objects slowly using S3 browser, and I
>>> can
>>> see
>>> the
>>> bucket usage is now down to around 2.5TB or 5TB with
>>> duplication,
>>> but
>>> the
>>> usage in the cluster has not changed.
>>>
>>> I've looked at garbage collection (radosgw-admin gc list
>>> --include
>>> all)
>>> and
>>> it just reports square brackets "[]"
>>>
>>> I've run radosgw-admin temp remove --date=2014-11-20, and it
>>> doesn't
>>> appear
>>> to have any effect.
>>>
>>> Is there a way to check where this space is being consumed?
>>>
>>> Running 'ceph df' the USED space in the buckets pool is not
>>> showing
>>> any
>>> of
>>> the 57TB that should have been freed up from the deletion so
>>> far.
>>>
>>> Running 'radosgw-admin bucket stats | jshon | grep
>>> size_kb_actual'
>>> and
>>> adding up all the buckets usage, this shows that the space
>>> has
>>> been
>>> freed
>>> from the bucket, but the cluster is all sorts of messed up.
>>>
>>>
>>> ANY IDEAS? What can I look at?
>>
>>
>>
>>
>>
>>
>> Can you run 'radosgw-admin gc list --include-all'?
>>
>> Yehuda
>
>
>
>
>
>
> I've done it before, and it just returns square brackets []
> (see
> below)
>
> radosgw-admin gc list --include-all
> []





 Do you know which of the rados pools have all that extra data?
 Try
 to
 list that pool's objects, verify that there are no surprises
 there
 (e.g., use 'rados -p  ls').

 Yehuda
>>>
>>>
>>>
>>>
>>>
>>> I'm just running that command now, and its taking some time.
>>> There
>>> is
>>> a
>>> large number of objects.
>>>
>>> Once it has finished, what should I be looking for?
>>
>>
>>
>>
>> I assume the pool in question is the one that holds your objects
>> data?
>> You should be looking for objects that are not expected to exist
>> anymore, and objects of buckets that don't exist anymore. The
>> problem
>> here is to identify these.
>> I suggest starting by looking at all the existing buckets, compose
>> a
>> list of all the bucket prefixes for the existing buckets, and try
>> to
>> look whether there are objects that have different prefixes.
>>
>> Yehuda
>
>
>
>
> Any ideas? I've found the prefix, the number of objects in the pool
> that
> match that prefix numbers in the 21 millions
> The actual 'radosgw-admin bucket stats' command reports it as only
> having
> 1.2 million.



 W

Re: [ceph-users] Deleting buckets and objects fails to reduce reported cluster usage

2014-12-01 Thread Ben


On 2014-12-02 08:39, Yehuda Sadeh wrote:

On Sat, Nov 29, 2014 at 2:26 PM, Ben  wrote:


On 29/11/14 11:40, Yehuda Sadeh wrote:


On Fri, Nov 28, 2014 at 1:38 PM, Ben  wrote:


On 29/11/14 01:50, Yehuda Sadeh wrote:


On Thu, Nov 27, 2014 at 9:22 PM, Ben  wrote:


On 2014-11-28 15:42, Yehuda Sadeh wrote:


On Thu, Nov 27, 2014 at 2:15 PM, b  wrote:


On 2014-11-27 11:36, Yehuda Sadeh wrote:



On Wed, Nov 26, 2014 at 3:49 PM, b  wrote:



On 2014-11-27 10:21, Yehuda Sadeh wrote:




On Wed, Nov 26, 2014 at 3:09 PM, b  
wrote:




On 2014-11-27 09:38, Yehuda Sadeh wrote:





On Wed, Nov 26, 2014 at 2:32 PM, b  
wrote:





I've been deleting a bucket which originally had 60TB of 
data

in
it,
with
our cluster doing only 1 replication, the total usage was
120TB.

I've been deleting the objects slowly using S3 browser, 
and I

can
see
the
bucket usage is now down to around 2.5TB or 5TB with
duplication,
but
the
usage in the cluster has not changed.

I've looked at garbage collection (radosgw-admin gc list
--include
all)
and
it just reports square brackets "[]"

I've run radosgw-admin temp remove --date=2014-11-20, and 
it

doesn't
appear
to have any effect.

Is there a way to check where this space is being 
consumed?


Running 'ceph df' the USED space in the buckets pool is 
not

showing
any
of
the 57TB that should have been freed up from the deletion 
so

far.

Running 'radosgw-admin bucket stats | jshon | grep
size_kb_actual'
and
adding up all the buckets usage, this shows that the space 
has

been
freed
from the bucket, but the cluster is all sorts of messed 
up.



ANY IDEAS? What can I look at?






Can you run 'radosgw-admin gc list --include-all'?

Yehuda






I've done it before, and it just returns square brackets [] 
(see

below)

radosgw-admin gc list --include-all
[]





Do you know which of the rados pools have all that extra 
data? Try

to
list that pool's objects, verify that there are no surprises 
there

(e.g., use 'rados -p  ls').

Yehuda





I'm just running that command now, and its taking some time. 
There

is
a
large number of objects.

Once it has finished, what should I be looking for?




I assume the pool in question is the one that holds your 
objects

data?
You should be looking for objects that are not expected to 
exist

anymore, and objects of buckets that don't exist anymore. The
problem
here is to identify these.
I suggest starting by looking at all the existing buckets, 
compose a
list of all the bucket prefixes for the existing buckets, and 
try to

look whether there are objects that have different prefixes.

Yehuda




Any ideas? I've found the prefix, the number of objects in the 
pool

that
match that prefix numbers in the 21 millions
The actual 'radosgw-admin bucket stats' command reports it as 
only

having
1.2 million.



Well, the objects you're seeing are raw objects, and since rgw 
stripes
the data, it is expected to have more raw objects than objects in 
the
bucket. Still, it seems that you have much too many of these. You 
can
try to check whether there are pending multipart uploads that 
were

never completed using the S3 api.
At the moment there's no easy way to figure out which raw objects 
are

not supposed to exist. The process would be like this:
1. rados ls -p 
keep the list sorted
2. list objects in the bucket
3. for each object in (2), do: radosgw-admin object stat
--bucket= --object= --rgw-cache-enabled=false
(disabling the cache so that it goes quicker)
4. look at the result of (3), and generate a list of all the 
parts.

5. sort result of (4), compare it to (1)

Note that if you're running firefly or later, the raw objects are 
not
specified explicitly in the command you run at (3), so you might 
need
a different procedure, e.g., find out the raw objects random 
string

that is being used, remove it from the list generated in 1, etc.)

That's basically it.
I'll be interested to figure out what happened, why the garbage
collection didn't work correctly. You could try verifying that 
it's

working by:
   - create an object (let's say ~10MB in size).
   - radosgw-admin object stat --bucket= 
--object=

 (keep this info, see
   - remove the object
   - run radosgw-admin gc list --include-all and verify that the 
raw

parts are listed there
   - wait a few hours, repeat last step, see that the parts don't
appear
there anymore
   - run rados -p  ls, check to see if the raw objects 
still

exist

Yehuda

Not sure where to go from here, and our cluster is slowly 
filling up

while
not clearing any space.




I did the last section:


I'll be interested to figure out what happened, why the garbage
collection didn't work correctly. You could try verifying that 
it's

working by:
   - create an object (let's say ~10MB in size).
   - radosgw-admin object stat --bucket= 
--object=

 (keep this info, see
   - remove the object
   - run radosgw-admin gc list --include-all and verify that the 
raw

parts are listed there
   - wait a few hours, repeat l

Re: [ceph-users] Deleting buckets and objects fails to reduce reported cluster usage

2014-12-01 Thread Yehuda Sadeh

On Sat, Nov 29, 2014 at 2:26 PM, Ben  wrote:
>
> On 29/11/14 11:40, Yehuda Sadeh wrote:
>>
>> On Fri, Nov 28, 2014 at 1:38 PM, Ben  wrote:
>>>
>>> On 29/11/14 01:50, Yehuda Sadeh wrote:

 On Thu, Nov 27, 2014 at 9:22 PM, Ben  wrote:
>
> On 2014-11-28 15:42, Yehuda Sadeh wrote:
>>
>> On Thu, Nov 27, 2014 at 2:15 PM, b  wrote:
>>>
>>> On 2014-11-27 11:36, Yehuda Sadeh wrote:


 On Wed, Nov 26, 2014 at 3:49 PM, b  wrote:
>
>
> On 2014-11-27 10:21, Yehuda Sadeh wrote:
>>
>>
>>
>> On Wed, Nov 26, 2014 at 3:09 PM, b  wrote:
>>>
>>>
>>>
>>> On 2014-11-27 09:38, Yehuda Sadeh wrote:




 On Wed, Nov 26, 2014 at 2:32 PM, b  wrote:
>
>
>
>
> I've been deleting a bucket which originally had 60TB of data
> in
> it,
> with
> our cluster doing only 1 replication, the total usage was
> 120TB.
>
> I've been deleting the objects slowly using S3 browser, and I
> can
> see
> the
> bucket usage is now down to around 2.5TB or 5TB with
> duplication,
> but
> the
> usage in the cluster has not changed.
>
> I've looked at garbage collection (radosgw-admin gc list
> --include
> all)
> and
> it just reports square brackets "[]"
>
> I've run radosgw-admin temp remove --date=2014-11-20, and it
> doesn't
> appear
> to have any effect.
>
> Is there a way to check where this space is being consumed?
>
> Running 'ceph df' the USED space in the buckets pool is not
> showing
> any
> of
> the 57TB that should have been freed up from the deletion so
> far.
>
> Running 'radosgw-admin bucket stats | jshon | grep
> size_kb_actual'
> and
> adding up all the buckets usage, this shows that the space has
> been
> freed
> from the bucket, but the cluster is all sorts of messed up.
>
>
> ANY IDEAS? What can I look at?





 Can you run 'radosgw-admin gc list --include-all'?

 Yehuda
>>>
>>>
>>>
>>>
>>>
>>> I've done it before, and it just returns square brackets [] (see
>>> below)
>>>
>>> radosgw-admin gc list --include-all
>>> []
>>
>>
>>
>>
>> Do you know which of the rados pools have all that extra data? Try
>> to
>> list that pool's objects, verify that there are no surprises there
>> (e.g., use 'rados -p  ls').
>>
>> Yehuda
>
>
>
>
> I'm just running that command now, and its taking some time. There
> is
> a
> large number of objects.
>
> Once it has finished, what should I be looking for?



 I assume the pool in question is the one that holds your objects
 data?
 You should be looking for objects that are not expected to exist
 anymore, and objects of buckets that don't exist anymore. The
 problem
 here is to identify these.
 I suggest starting by looking at all the existing buckets, compose a
 list of all the bucket prefixes for the existing buckets, and try to
 look whether there are objects that have different prefixes.

 Yehuda
>>>
>>>
>>>
>>> Any ideas? I've found the prefix, the number of objects in the pool
>>> that
>>> match that prefix numbers in the 21 millions
>>> The actual 'radosgw-admin bucket stats' command reports it as only
>>> having
>>> 1.2 million.
>>
>>
>> Well, the objects you're seeing are raw objects, and since rgw stripes
>> the data, it is expected to have more raw objects than objects in the
>> bucket. Still, it seems that you have much too many of these. You can
>> try to check whether there are pending multipart uploads that were
>> never completed using the S3 api.
>> At the moment there's no easy way to figure out which raw objects are
>> not supposed to exist. The process would be like this:
>> 1. rados ls -p 
>> keep the list sorted
>> 2. list objects in the bucket
>> 3. for each object in (2), do: radosgw-admin object stat
>> --bucket= --object= --rgw-cache-enabled=false
>> (di

Re: [ceph-users] do I have to use sudo for CEPH install

2014-12-01 Thread Lindsay Mathieson

You have to be a root user, either via login, su or sudo.

So no, you don't have to use sudo - just logon as root.

On 2 December 2014 at 00:05, Jiri Kanicky  wrote:
> Hi.
>
> Do I have to install sudo in Debian Wheezy to deploy CEPH succesfully? I
> dont normally use sudo.
>
> Thank you
> Jiri
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Lindsay
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-fs-common & ceph-mds on ARM Raspberry Debian 7.6

2014-12-01 Thread Florent MONTHEL

Hi Paulo,

Thanks a lot. I’ve just added into /etc/apst/sources.list below back ports:

deb http://ftp.debian.org/debian/ wheezy-backports main

And : apt-get update

But ceph-deploy still throw alerts. So I added package manually (to take them 
from wheezy-backports) :

apt-get -t wheezy-backports install ceph ceph-mds ceph-common ceph-fs-common 
gdisk

And ceph-deploy is now OK :
root@socrate:~/cluster# ceph-deploy install socrate.flox-arts.in
…
[socrate.flox-arts.in][DEBUG ] ceph version 0.80.7 
(6c0127fcb58008793d3c8b62d925bc91963672a3)

Thanks


Florent Monthel





> Le 1 déc. 2014 à 00:03, Paulo Almeida  a écrit :
> 
> Hi,
> 
> You should be able to use the wheezy-backports repository, which has
> ceph 0.80.7.
> 
> Cheers,
> Paulo
> 
> On Sun, 2014-11-30 at 19:31 +0100, Florent MONTHEL wrote:
>> Hi,
>> 
>> 
>> I’m trying to deploy CEPH (with ceph-deploy) on Raspberry Debian 7.6
>> and I have below error on ceph-deploy install command :
>> 
>> 
>> 
>> 
>> [socrate.flox-arts.in][INFO  ] Running command: env
>> DEBIAN_FRONTEND=noninteractive DEBIAN_PRIORITY=critical apt-get -q -o
>> Dpkg::Options::=--force-confnew --no-install-recommends --assume-yes
>> install -- ceph ceph-mds ceph-common ceph-fs-common gdisk
>> [socrate.flox-arts.in][DEBUG ] Reading package lists...
>> [socrate.flox-arts.in][DEBUG ] Building dependency tree...
>> [socrate.flox-arts.in][DEBUG ] Reading state information...
>> [socrate.flox-arts.in][WARNIN] E: Unable to locate package ceph-mds
>> [socrate.flox-arts.in][WARNIN] E: Unable to locate package
>> ceph-fs-common
>> [socrate.flox-arts.in][ERROR ] RuntimeError: command returned non-zero
>> exit status: 100
>> [ceph_deploy][ERROR ] RuntimeError: Failed to execute command: env
>> DEBIAN_FRONTEND=noninteractive DEBIAN_PRIORITY=critical apt-get -q -o
>> Dpkg::Options::=--force-confnew --no-install-recommends --assume-yes
>> install -- ceph ceph-mds ceph-common ceph-fs-common gdisk
>> 
>> 
>> Do you know how I can have these 2 package on this platform ?
>> Thanks
>> 
>> 
>> 
>> Florent Monthel
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Optimal or recommended threads values

2014-12-01 Thread Craig Lewis

I'm still using the default values, mostly because I haven't had time to
test.

On Thu, Nov 27, 2014 at 2:44 AM, Andrei Mikhailovsky 
wrote:

> Hi Craig,
>
> Are you keeping the filestore, disk and op threads at their default
> values? or did you also change them?
>
> Cheers
>
>
> Tuning these values depends on a lot more than just the SSDs and HDDs.
> Which kernel and IO scheduler are you using?  Does your HBA do write
> caching?  It also depends on what your goals are.  Tuning for a RadosGW
> cluster is different that for a RDB cluster.  The short answer is that you
> are the only person that can can tell you what your optimal values are.  As
> always, the best benchmark is production load.
>
>
> In my small cluster (5 nodes, 44 osds), I'm optimizing to minimize latency
> during recovery.  When the cluster is healthy, bandwidth and latency are
> more than adequate for my needs.  Even with journals on SSDs, I've found
> that reducing the number of operations and threads has reduced my average
> latency.
>
> I use injectargs to try out new values while I monitor cluster latency.  I
> monitor latency while the cluster is healthy and recovering.  If a change
> is deemed better, only then will I persist the change to ceph.conf.  This
> gives me a fallback that any changes that causes massive problems can be
> undone with a restart or reboot.
>
>
> So far, the configs that I've written to ceph.conf are
> [global]
>   mon osd down out interval = 900
>   mon osd min down reporters = 9
>   mon osd min down reports = 12
>   osd pool default flag hashpspool = true
>
> [osd]
>   osd max backfills = 1
>   osd recovery max active = 1
>   osd recovery op priority = 1
>
>
> I have it on my list to investigate filestore max sync interval.  And now
> that I've pasted that, I need to revisit the min down reports/reporters.  I
> have some nodes with 10 OSDs, and I don't want any one node able to mark
> the rest of the cluster as down (it happened once).
>
>
>
>
> On Sat, Nov 22, 2014 at 6:24 AM, Andrei Mikhailovsky 
> wrote:
>
>> Hello guys,
>>
>> Could some one comment on the optimal or recommended values of various
>> threads values in ceph.conf?
>>
>> At the moment I have the following settings:
>>
>> filestore_op_threads = 8
>> osd_disk_threads = 8
>> osd_op_threads = 8
>> filestore_merge_threshold = 40
>> filestore_split_multiple = 8
>>
>> Are these reasonable for a small cluster made of 7.2K SAS disks with ssd
>> journals with a ratio of 4:1?
>>
>> What are the settings that other people are using?
>>
>> Thanks
>>
>> Andrei
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Compile from source with Kinetic support

2014-12-01 Thread Haomai Wang

On Tue, Dec 2, 2014 at 12:38 AM, Ken Dreyer  wrote:

> On 11/28/14 7:04 AM, Haomai Wang wrote:
> > Yeah, ceph source repo doesn't contain Kinetic header file and library
> > souce, you need to install kinetic devel package separately.
>
> Hi Haomai,
>
> I'm wondering if we need AC_CHECK_HEADER([kinetic/kinetic.h], ...) in
> configure.ac to double-check when the user specifies --with-kinetic? It
> might help to avoid some user confusion if we can have ./configure bail
> out early instead of continuing all the way through the build.
>
> Something like this? (completely untested)
>
> --- a/configure.ac
> +++ b/configure.ac
> @@ -557,7 +557,13 @@ AC_ARG_WITH([kinetic],
>  #AS_IF([test "x$with_kinetic" = "xyes"],
>  #[PKG_CHECK_MODULES([KINETIC], [kinetic_client], [], [true])])
>  AS_IF([test "x$with_kinetic" = "xyes"],
> -[AC_DEFINE([HAVE_KINETIC], [1], [Defined if you have
> kinetic enable
> +[AC_CHECK_HEADER([kinetic/kinetic.h],
> +  [AC_DEFINE(
> + [HAVE_KINETIC], [1], [Defined if you have kinetic
> enabled])],
> +  [AC_MSG_FAILURE(
> + ["Can't find kinetic headers; please install them"])
> +)]
> +])
>  AM_CONDITIONAL(WITH_KINETIC, [ test "$with_kinetic" = "yes" ])
>

Yeah, it's better. Anyone who help to add these?
You can close https://github.com/ceph/ceph/pull/3046 and create a PR. I
don't have a std-c++11 env to test it at all :-(


> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 

Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Compile from source with Kinetic support

2014-12-01 Thread Haomai Wang

Sorry, it's a typo

/WITH_KINETIC/HAVE_KINETIC/

:-)

On Tue, Dec 2, 2014 at 12:51 AM, Julien Lutran 
wrote:

>
> Sorry, It didn't change anything :
>
> root@host:~/sources/ceph# head -12 src/os/KeyValueDB.cc
> // -*- mode:C++; tab-width:8; c-basic-offset:2; indent-tabs-mode:t -*-
> // vim: ts=8 sw=2 smarttab
>
> #include "KeyValueDB.h"
> #include "LevelDBStore.h"
> #ifdef HAVE_LIBROCKSDB
> #include "RocksDBStore.h"
> #endif
> #ifdef WITH_KINETIC
> #include "KineticStore.h"
> #endif
>
> root@host:~/sources/ceph# make
> [...]
>   CXX  os/libos_la-KeyValueDB.lo
> os/KeyValueDB.cc: In static member function 'static KeyValueDB*
> KeyValueDB::create(CephContext*, const string&, const string&)':
> os/KeyValueDB.cc:21:16: error: expected type-specifier before
> 'KineticStore'
>  return new KineticStore(cct);
> ^
> os/KeyValueDB.cc:21:16: error: expected ';' before 'KineticStore'
> os/KeyValueDB.cc:21:32: error: 'KineticStore' was not declared in this
> scope
>  return new KineticStore(cct);
> ^
> os/KeyValueDB.cc: In static member function 'static int
> KeyValueDB::test_init(const string&, const string&)':
> os/KeyValueDB.cc:39:12: error: 'KineticStore' has not been declared
>  return KineticStore::_test_init(g_ceph_context);
> ^
> make[3]: *** [os/libos_la-KeyValueDB.lo] Error 1
>
>
> On 12/01/2014 03:22 PM, Haomai Wang wrote:
>
>> #ifdef WITH_KINETIC
>> #include "KineticStore.h"
>> #endif
>>
>
>


-- 

Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Compile from source with Kinetic support

2014-12-01 Thread Julien Lutran



Sorry, It didn't change anything :

root@host:~/sources/ceph# head -12 src/os/KeyValueDB.cc
// -*- mode:C++; tab-width:8; c-basic-offset:2; indent-tabs-mode:t -*-
// vim: ts=8 sw=2 smarttab

#include "KeyValueDB.h"
#include "LevelDBStore.h"
#ifdef HAVE_LIBROCKSDB
#include "RocksDBStore.h"
#endif
#ifdef WITH_KINETIC
#include "KineticStore.h"
#endif

root@host:~/sources/ceph# make
[...]
  CXX  os/libos_la-KeyValueDB.lo
os/KeyValueDB.cc: In static member function 'static KeyValueDB* 
KeyValueDB::create(CephContext*, const string&, const string&)':

os/KeyValueDB.cc:21:16: error: expected type-specifier before 'KineticStore'
 return new KineticStore(cct);
^
os/KeyValueDB.cc:21:16: error: expected ';' before 'KineticStore'
os/KeyValueDB.cc:21:32: error: 'KineticStore' was not declared in this scope
 return new KineticStore(cct);
^
os/KeyValueDB.cc: In static member function 'static int 
KeyValueDB::test_init(const string&, const string&)':

os/KeyValueDB.cc:39:12: error: 'KineticStore' has not been declared
 return KineticStore::_test_init(g_ceph_context);
^
make[3]: *** [os/libos_la-KeyValueDB.lo] Error 1


On 12/01/2014 03:22 PM, Haomai Wang wrote:

#ifdef WITH_KINETIC
#include "KineticStore.h"
#endif


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Compile from source with Kinetic support

2014-12-01 Thread Ken Dreyer

On 11/28/14 7:04 AM, Haomai Wang wrote:
> Yeah, ceph source repo doesn't contain Kinetic header file and library
> souce, you need to install kinetic devel package separately.

Hi Haomai,

I'm wondering if we need AC_CHECK_HEADER([kinetic/kinetic.h], ...) in
configure.ac to double-check when the user specifies --with-kinetic? It
might help to avoid some user confusion if we can have ./configure bail
out early instead of continuing all the way through the build.

Something like this? (completely untested)

--- a/configure.ac
+++ b/configure.ac
@@ -557,7 +557,13 @@ AC_ARG_WITH([kinetic],
 #AS_IF([test "x$with_kinetic" = "xyes"],
 #[PKG_CHECK_MODULES([KINETIC], [kinetic_client], [], [true])])
 AS_IF([test "x$with_kinetic" = "xyes"],
-[AC_DEFINE([HAVE_KINETIC], [1], [Defined if you have
kinetic enable
+[AC_CHECK_HEADER([kinetic/kinetic.h],
+  [AC_DEFINE(
+ [HAVE_KINETIC], [1], [Defined if you have kinetic
enabled])],
+  [AC_MSG_FAILURE(
+ ["Can't find kinetic headers; please install them"])
+)]
+])
 AM_CONDITIONAL(WITH_KINETIC, [ test "$with_kinetic" = "yes" ])
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] How to see which crush tunables are active in a ceph-cluster?

2014-12-01 Thread Udo Lembke

Hi all,
http://ceph.com/docs/master/rados/operations/crush-map/#crush-tunables
described how to set the tunables to legacy, argonaut, bobtail, firefly
or optimal.

But how can I see, which profile is active in an ceph-cluster?

With "ceph osd getcrushmap" I got not realy much info
(only "tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50)


Udo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Problems with pgs incomplete

2014-12-01 Thread Butkeev Stas

Thank you Lionel,
Indeed I have forgotten about size > min_size. I have set min_size to 1 and my 
cluster is UP now. I have deleted crash osd and have set size to 3 and min_size 
to 2.

---
With regards,
Stanislav 


01.12.2014, 19:15, "Lionel Bouton" :
> Le 01/12/2014 17:08, Lionel Bouton a écrit :
>>  I may be wrong here (I'm surprised you only have 4 incomplete pgs, I'd
>>  expect ~1/3rd of your pgs to be incomplete given your "ceph osd tree"
>>  output) but reducing min_size to 1 should be harmless and should
>>  unfreeze the recovering process.
>
> Ignore this part : I wasn't paying enough attention to the osd tree
> output and mixed osd/host levels.
>
> Others have pointed out that you have size = 3 for some pools. In this
> case you might have lost an OSD before a previous recovering process
> finished which would explain your current state (in this case my earlier
> advice still applies).
>
> Best regards,
>
> Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Revisiting MDS memory footprint

2014-12-01 Thread John Spray

On Fri, Nov 28, 2014 at 1:48 PM, Florian Haas  wrote:
> Out of curiosity: would it matter at all whether or not a significant
> fraction of the files in CephFS were hard links? Clearly the only
> thing that differs in metadata between individual hard-linked files is
> the file name, but I wonder if the Ceph MDS actually takes this into
> consideration. In other words, I'm not sure whether the MDS simply
> adds another pointer to the same set of metadata, or whether that set
> of metadata is actually duplicated in MDS memory. I am guessing the
> latter, but it would be nice to be sure.

When we load a hard link dentry (in CDir::_omap_fetched), if we
already have the inode in cache then we just refer to that copy -- we
never have two of the same inode (CInode object) in memory.  If we
don't have the inode in cache, then the inode isn't loaded until
someone tries to traverse the dentry (i.e. touch the file in any way),
at which point we go to fetch the backtrace from the RADOS object for
that file.

So hard links may incur less memory overhead when loading a directory
fragment, but you will take an I/O hammering when dereferencing them
if the linked inode is not already in cache, as each individual hard
link has to be followed via a separate RADOS object.

In general I would be very cautious about workloads that do a lot of
reads of cold hard linked files, e.g. if benchmarking this case for
backups then you should try to create the hard links, let the files
fall out of cache, then observe the performance of a restore where
many hard links are being dereferenced via backtraces.

I'm mostly reading this from the code rather than from memory, so I'm
sure Greg or Sage will jump in if I'm getting any of these cases
wrong.

Cheers,
John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Problems with pgs incomplete

2014-12-01 Thread Lionel Bouton

Le 01/12/2014 17:08, Lionel Bouton a écrit :
> I may be wrong here (I'm surprised you only have 4 incomplete pgs, I'd
> expect ~1/3rd of your pgs to be incomplete given your "ceph osd tree"
> output) but reducing min_size to 1 should be harmless and should
> unfreeze the recovering process.

Ignore this part : I wasn't paying enough attention to the osd tree
output and mixed osd/host levels.

Others have pointed out that you have size = 3 for some pools. In this
case you might have lost an OSD before a previous recovering process
finished which would explain your current state (in this case my earlier
advice still applies).

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Problems with pgs incomplete

2014-12-01 Thread Lionel Bouton

Le 01/12/2014 15:09, Butkeev Stas a écrit :
> pg 13.2 is incomplete, acting [1,3] (reducing pool .rgw.buckets min_size from 
> 2 may help; search ceph.com/docs for 'incomplete')
The answer is in the logs: your .rgw.buckets pool is using min_size = 2.
So it doesn't have enough valid pg replicas to start recovering.

IIRC past messages on this list you must have size > min_size to recover
from a failed OSD as Ceph doesn't try to use available data to recover
if it doesn't respect min_size.

I may be wrong here (I'm surprised you only have 4 incomplete pgs, I'd
expect ~1/3rd of your pgs to be incomplete given your "ceph osd tree"
output) but reducing min_size to 1 should be harmless and should
unfreeze the recovering process.

Best regards,

Lionel Bouton
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Revisiting MDS memory footprint

2014-12-01 Thread John Spray

I meant to chime in earlier here but then the weekend happened, comments inline

On Sun, Nov 30, 2014 at 7:20 PM, Wido den Hollander  wrote:
> Why would you want all CephFS metadata in memory? With any filesystem
> that will be a problem.

The latency associated with a cache miss (RADOS OMAP dirfrag read) is
fairly high, so the goal when sizing will to allow the MDSs to keep a
very large proportion of the metadata in RAM.  In a local FS, the
filesystem metadata in RAM is relatively small, and the speed to disk
is relatively high.  In Ceph FS, that is reversed: we want to
compensate for the cache miss latency by having lots of RAM in the MDS
and a big cache.

hot-standby MDSs are another manifestation of the expected large
cache: we expect these caches to be big, to the point where refilling
from the backing store on a failure would be annoyingly slow, and it's
worth keeping that hot standby cache.

Also, remember that because we embed inodes in dentries, when we load
a directory fragment we are also loading all the inodes in that
directory fragment -- if you have only one file open, but it has an
ancestor with lots of files, then you'll have more files in cache than
you might have expected.

> We do however need a good rule of thumb of how much memory is used for
> each inode.

Yes -- and ideally some practical measurements too :-)

One important point that I don't think anyone mentioned so far: the
memory consumption per inode depends on how many clients have
capabilities on the inode.  So if many clients hold a read capability
on a file, more memory will be used MDS-side for that file.  If
designing a benchmark for this, the client count, and level of overlap
in the client workloads would be an important dimension.

The number of *open* files on clients strongly affects the ability of
the MDS to trim is cache, since the MDS pins in cache any inode which
is in use by a client.  We recently added health checks so that the
MDS can complain about clients that are failing to respond to requests
to trim their caches, and the way we test this is to have a client
obstinately keep some number of files open.

We also allocate memory for pending metadata updates (so-called
'projected inodes') while they are in the journal, so the memory usage
will depend on the journal size and the number of writes in flight.

It would be really useful to come up with a test script that monitors
MDS memory consumption as a function of number of files in cache,
number of files opened by clients, number of clients opening the same
files.  I feel a 3d chart plot coming on :-)

Cheers,
John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Problems with pgs incomplete

2014-12-01 Thread Tomasz Kuzemko

On Mon, Dec 01, 2014 at 05:09:31PM +0300, Butkeev Stas wrote:
> Hi all,
> I have Ceph cluster+rgw. Now I have problems with one of OSD, it's down now. 
> I check ceph status and see this information
> 
> [root@node-1 ceph-0]# ceph -s
> cluster fc8c3ecc-ccb8-4065-876c-dc9fc992d62d
>  health HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck 
> unclean
>  monmap e1: 3 mons at 
> {a=10.29.226.39:6789/0,b=10.29.226.29:6789/0,c=10.29.226.40:6789/0}, election 
> epoch 294, quorum 0,1,2 b,a,c
>  osdmap e418: 6 osds: 5 up, 5 in
>   pgmap v23588: 312 pgs, 16 pools, 141 kB data, 594 objects
> 5241 MB used, 494 GB / 499 GB avail
>  308 active+clean
>4 incomplete
> 
> Why am I having 4 pgs incomplete in bucket .rgw.buckets if I am having 
> replicated size 2 and min_size 2?
> 
> My osd tree
> [root@node-1 ceph-0]# ceph osd tree
> # idweight  type name   up/down reweight
> -1  4   root croc
> -2  4   region ru
> -4  3   datacenter vol-5
> -5  1   host node-1
> 0   1   osd.0   down0
> -6  1   host node-2
> 1   1   osd.1   up  1
> -7  1   host node-3
> 2   1   osd.2   up  1
> -3  1   datacenter comp
> -8  1   host node-4
> 3   1   osd.3   up  1
> -9  1   host node-5
> 4   1   osd.4   up  1
> -10 1   host node-6
> 5   1   osd.5   up  1
> 
> Addition information:
> 
> [root@node-1 ceph-0]# ceph health detail
> HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean
> pg 13.6 is stuck inactive for 1547.665758, current state incomplete, last 
> acting [1,3]
> pg 13.4 is stuck inactive for 1547.652111, current state incomplete, last 
> acting [1,2]
> pg 13.5 is stuck inactive for 4502.009928, current state incomplete, last 
> acting [1,3]
> pg 13.2 is stuck inactive for 4501.979770, current state incomplete, last 
> acting [1,3]
> pg 13.6 is stuck unclean for 4501.969914, current state incomplete, last 
> acting [1,3]
> pg 13.4 is stuck unclean for 4502.001114, current state incomplete, last 
> acting [1,2]
> pg 13.5 is stuck unclean for 4502.009942, current state incomplete, last 
> acting [1,3]
> pg 13.2 is stuck unclean for 4501.979784, current state incomplete, last 
> acting [1,3]
> pg 13.2 is incomplete, acting [1,3] (reducing pool .rgw.buckets min_size from 
> 2 may help; search ceph.com/docs for 'incomplete')
> pg 13.6 is incomplete, acting [1,3] (reducing pool .rgw.buckets min_size from 
> 2 may help; search ceph.com/docs for 'incomplete')
> pg 13.4 is incomplete, acting [1,2] (reducing pool .rgw.buckets min_size from 
> 2 may help; search ceph.com/docs for 'incomplete')
> pg 13.5 is incomplete, acting [1,3] (reducing pool .rgw.buckets min_size from 
> 2 may help; search ceph.com/docs for 'incomplete')
> 
> [root@node-1 ceph-0]# ceph osd dump | grep 'pool'
> pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash 
> rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
> pool 1 '.rgw.root' replicated size 3 min_size 2 crush_ruleset 0 object_hash 
> rjenkins pg_num 8 pgp_num 8 last_change 34 owner 18446744073709551615 flags 
> hashpspool stripe_width 0
> pool 2 '.rgw.control' replicated size 3 min_size 2 crush_ruleset 0 
> object_hash rjenkins pg_num 8 pgp_num 8 last_change 36 owner 
> 18446744073709551615 flags hashpspool stripe_width 0
> pool 3 '.rgw' replicated size 3 min_size 2 crush_ruleset 0 object_hash 
> rjenkins pg_num 8 pgp_num 8 last_change 38 owner 18446744073709551615 flags 
> hashpspool stripe_width 0
> pool 4 '.rgw.gc' replicated size 3 min_size 2 crush_ruleset 0 object_hash 
> rjenkins pg_num 8 pgp_num 8 last_change 39 flags hashpspool stripe_width 0
> pool 5 '.users.uid' replicated size 3 min_size 2 crush_ruleset 0 object_hash 
> rjenkins pg_num 8 pgp_num 8 last_change 40 owner 18446744073709551615 flags 
> hashpspool stripe_width 0
> pool 6 '.log' replicated size 3 min_size 2 crush_ruleset 0 object_hash 
> rjenkins pg_num 8 pgp_num 8 last_change 42 owner 18446744073709551615 flags 
> hashpspool stripe_width 0
> pool 7 '.users' replicated size 3 min_size 2 crush_ruleset 0 object_hash 
> rjenkins pg_num 8 pgp_num 8 last_change 44 flags hashpspool stripe_width 0
> pool 8 '.users.swift' replicated size 3 min_size 2 crush_ruleset 0 
> object_hash rjenkins pg_num 8 pgp_num 8 last_change 46 flags hashpspool 
> stripe_width 0
> pool 9 '.usage' replicated size 3 min_size 2 crush_ruleset 0 object_hash 
> rjenkins pg_num 8 pgp_num 8 last_change 48 flags h

Re: [ceph-users] Problems with pgs incomplete

2014-12-01 Thread Georgios Dimitrakakis


Hi!

I had a very similar issue a few days ago.

For me it wasn't too much of a problem since the cluster was new 
without data and I could force recreate the PGs. I really hope that in 
your case it won't be necessary to do the same thing.


As a first step try to reduce the min_size from 2 to 1 as suggested for 
the .rgw.buckets pool and see if this can bring you cluster back to 
health.


Regards,

George

On Mon, 01 Dec 2014 17:09:31 +0300, Butkeev Stas wrote:

Hi all,
I have Ceph cluster+rgw. Now I have problems with one of OSD, it's
down now. I check ceph status and see this information

[root@node-1 ceph-0]# ceph -s
cluster fc8c3ecc-ccb8-4065-876c-dc9fc992d62d
 health HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs
stuck unclean
 monmap e1: 3 mons at
{a=10.29.226.39:6789/0,b=10.29.226.29:6789/0,c=10.29.226.40:6789/0},
election epoch 294, quorum 0,1,2 b,a,c
 osdmap e418: 6 osds: 5 up, 5 in
  pgmap v23588: 312 pgs, 16 pools, 141 kB data, 594 objects
5241 MB used, 494 GB / 499 GB avail
 308 active+clean
   4 incomplete

Why am I having 4 pgs incomplete in bucket .rgw.buckets if I am
having replicated size 2 and min_size 2?

My osd tree
[root@node-1 ceph-0]# ceph osd tree
# idweight  type name   up/down reweight
-1  4   root croc
-2  4   region ru
-4  3   datacenter vol-5
-5  1   host node-1
0   1   osd.0   down0
-6  1   host node-2
1   1   osd.1   up  1
-7  1   host node-3
2   1   osd.2   up  1
-3  1   datacenter comp
-8  1   host node-4
3   1   osd.3   up  1
-9  1   host node-5
4   1   osd.4   up  1
-10 1   host node-6
5   1   osd.5   up  1

Addition information:

[root@node-1 ceph-0]# ceph health detail
HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck 
unclean

pg 13.6 is stuck inactive for 1547.665758, current state incomplete,
last acting [1,3]
pg 13.4 is stuck inactive for 1547.652111, current state incomplete,
last acting [1,2]
pg 13.5 is stuck inactive for 4502.009928, current state incomplete,
last acting [1,3]
pg 13.2 is stuck inactive for 4501.979770, current state incomplete,
last acting [1,3]
pg 13.6 is stuck unclean for 4501.969914, current state incomplete,
last acting [1,3]
pg 13.4 is stuck unclean for 4502.001114, current state incomplete,
last acting [1,2]
pg 13.5 is stuck unclean for 4502.009942, current state incomplete,
last acting [1,3]
pg 13.2 is stuck unclean for 4501.979784, current state incomplete,
last acting [1,3]
pg 13.2 is incomplete, acting [1,3] (reducing pool .rgw.buckets
min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 13.6 is incomplete, acting [1,3] (reducing pool .rgw.buckets
min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 13.4 is incomplete, acting [1,2] (reducing pool .rgw.buckets
min_size from 2 may help; search ceph.com/docs for 'incomplete')
pg 13.5 is incomplete, acting [1,3] (reducing pool .rgw.buckets
min_size from 2 may help; search ceph.com/docs for 'incomplete')

[root@node-1 ceph-0]# ceph osd dump | grep 'pool'
pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash
rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool
stripe_width 0
pool 1 '.rgw.root' replicated size 3 min_size 2 crush_ruleset 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 34 owner
18446744073709551615 flags hashpspool stripe_width 0
pool 2 '.rgw.control' replicated size 3 min_size 2 crush_ruleset 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 36 owner
18446744073709551615 flags hashpspool stripe_width 0
pool 3 '.rgw' replicated size 3 min_size 2 crush_ruleset 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 38 owner
18446744073709551615 flags hashpspool stripe_width 0
pool 4 '.rgw.gc' replicated size 3 min_size 2 crush_ruleset 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 39 flags
hashpspool stripe_width 0
pool 5 '.users.uid' replicated size 3 min_size 2 crush_ruleset 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 40 owner
18446744073709551615 flags hashpspool stripe_width 0
pool 6 '.log' replicated size 3 min_size 2 crush_ruleset 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 42 owner
18446744073709551615 flags hashpspool stripe_width 0
pool 7 '.users' replicated size 3 min_size 2 crush_ruleset 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 44 flags
hashpspool stripe_width 0
pool 8 '.users.swift' replicated size 3 min_size 2 crush_ruleset 0
object_hash rjenkins pg_

[ceph-users] LevelDB support status is still experimental on Giant?

2014-12-01 Thread Satoru Funai

Hi guys,
I'm interested in to use key/value store as a backend of Ceph OSD.
When firefly release, LevelDB support is mentioned as experimental,
is it same status on Giant release?
Regards,

Satoru Funai
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Problems with pgs incomplete

2014-12-01 Thread Butkeev Stas

Hi all,
I have Ceph cluster+rgw. Now I have problems with one of OSD, it's down now. I 
check ceph status and see this information

[root@node-1 ceph-0]# ceph -s
cluster fc8c3ecc-ccb8-4065-876c-dc9fc992d62d
 health HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck 
unclean
 monmap e1: 3 mons at 
{a=10.29.226.39:6789/0,b=10.29.226.29:6789/0,c=10.29.226.40:6789/0}, election 
epoch 294, quorum 0,1,2 b,a,c
 osdmap e418: 6 osds: 5 up, 5 in
  pgmap v23588: 312 pgs, 16 pools, 141 kB data, 594 objects
5241 MB used, 494 GB / 499 GB avail
 308 active+clean
   4 incomplete

Why am I having 4 pgs incomplete in bucket .rgw.buckets if I am having 
replicated size 2 and min_size 2?

My osd tree
[root@node-1 ceph-0]# ceph osd tree
# idweight  type name   up/down reweight
-1  4   root croc
-2  4   region ru
-4  3   datacenter vol-5
-5  1   host node-1
0   1   osd.0   down0
-6  1   host node-2
1   1   osd.1   up  1
-7  1   host node-3
2   1   osd.2   up  1
-3  1   datacenter comp
-8  1   host node-4
3   1   osd.3   up  1
-9  1   host node-5
4   1   osd.4   up  1
-10 1   host node-6
5   1   osd.5   up  1

Addition information:

[root@node-1 ceph-0]# ceph health detail
HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean
pg 13.6 is stuck inactive for 1547.665758, current state incomplete, last 
acting [1,3]
pg 13.4 is stuck inactive for 1547.652111, current state incomplete, last 
acting [1,2]
pg 13.5 is stuck inactive for 4502.009928, current state incomplete, last 
acting [1,3]
pg 13.2 is stuck inactive for 4501.979770, current state incomplete, last 
acting [1,3]
pg 13.6 is stuck unclean for 4501.969914, current state incomplete, last acting 
[1,3]
pg 13.4 is stuck unclean for 4502.001114, current state incomplete, last acting 
[1,2]
pg 13.5 is stuck unclean for 4502.009942, current state incomplete, last acting 
[1,3]
pg 13.2 is stuck unclean for 4501.979784, current state incomplete, last acting 
[1,3]
pg 13.2 is incomplete, acting [1,3] (reducing pool .rgw.buckets min_size from 2 
may help; search ceph.com/docs for 'incomplete')
pg 13.6 is incomplete, acting [1,3] (reducing pool .rgw.buckets min_size from 2 
may help; search ceph.com/docs for 'incomplete')
pg 13.4 is incomplete, acting [1,2] (reducing pool .rgw.buckets min_size from 2 
may help; search ceph.com/docs for 'incomplete')
pg 13.5 is incomplete, acting [1,3] (reducing pool .rgw.buckets min_size from 2 
may help; search ceph.com/docs for 'incomplete')

[root@node-1 ceph-0]# ceph osd dump | grep 'pool'
pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins 
pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
pool 1 '.rgw.root' replicated size 3 min_size 2 crush_ruleset 0 object_hash 
rjenkins pg_num 8 pgp_num 8 last_change 34 owner 18446744073709551615 flags 
hashpspool stripe_width 0
pool 2 '.rgw.control' replicated size 3 min_size 2 crush_ruleset 0 object_hash 
rjenkins pg_num 8 pgp_num 8 last_change 36 owner 18446744073709551615 flags 
hashpspool stripe_width 0
pool 3 '.rgw' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins 
pg_num 8 pgp_num 8 last_change 38 owner 18446744073709551615 flags hashpspool 
stripe_width 0
pool 4 '.rgw.gc' replicated size 3 min_size 2 crush_ruleset 0 object_hash 
rjenkins pg_num 8 pgp_num 8 last_change 39 flags hashpspool stripe_width 0
pool 5 '.users.uid' replicated size 3 min_size 2 crush_ruleset 0 object_hash 
rjenkins pg_num 8 pgp_num 8 last_change 40 owner 18446744073709551615 flags 
hashpspool stripe_width 0
pool 6 '.log' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins 
pg_num 8 pgp_num 8 last_change 42 owner 18446744073709551615 flags hashpspool 
stripe_width 0
pool 7 '.users' replicated size 3 min_size 2 crush_ruleset 0 object_hash 
rjenkins pg_num 8 pgp_num 8 last_change 44 flags hashpspool stripe_width 0
pool 8 '.users.swift' replicated size 3 min_size 2 crush_ruleset 0 object_hash 
rjenkins pg_num 8 pgp_num 8 last_change 46 flags hashpspool stripe_width 0
pool 9 '.usage' replicated size 3 min_size 2 crush_ruleset 0 object_hash 
rjenkins pg_num 8 pgp_num 8 last_change 48 flags hashpspool stripe_width 0
pool 10 'test' replicated size 2 min_size 2 crush_ruleset 0 object_hash 
rjenkins pg_num 136 pgp_num 136 last_change 68 flags hashpspool stripe_width 0
pool 11 '.rgw.buckets.index' replicated size 3 min_size 2 crush_ruleset 0 
obj

[ceph-users] Rsync mirror for repository?

2014-12-01 Thread Brian Rak


Is there a place I can download the entire repository for giant?

I'm really just looking for a rsync server that presents all the files 
here: http://download.ceph.com/ceph/giant/centos6.5/


I know that eu.ceph.com runs one, but I'm not sure how up to date that 
is (because of http://eu.ceph.com/rpm-giant/el6/x86_64/ , it has two 
versions in that directory).


Ceph is fairly critical to us, so we don't want to rely on an external 
mirror (we've had issues with other software where the files on the 
external mirror suddenly become broken).


For now, I downloaded it via 'wget -r', but this really isn't ideal.

I already tried:

$ rsync rsync://download.ceph.com
rsync: failed to connect to download.ceph.com: Connection refused (111)
$ rsync rsync://ceph.com --contimeout=2
rsync error: timeout waiting for daemon connection (code 35) at 
socket.c(279) [receiver=3.0.6]


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] To clarify requirements for Monitors

2014-12-01 Thread Roman Naumenko


Thank you, Paulo.

Metadata = mds, so metadata server should have cpu power.

--Roman

On 14-11-28 05:34 PM, Paulo Almeida wrote:

On Fri, 2014-11-28 at 16:37 -0500, Roman Naumenko wrote:

And if I understand correctly, monitors are the access points to the
cluster, so they should provide enough aggregated network output for
all connected clients based on number of OSDs in the cluster?

I'm not sure what you mean by "access points to the cluster", but the
monitors only provide the cluster map to the client, which then
communicates directly with the OSDs. Quoting the documentation[1]:

"Ceph eliminates the centralized gateway to enable clients to interact
with Ceph OSD Daemons directly. (...) Before Ceph Clients can read or
write data, they must contact a Ceph Monitor to obtain the most recent
copy of the cluster map."

[1] http://ceph.com/docs/master/architecture/

Cheers,
Paulo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Removing Snapshots Killing Cluster Performance

2014-12-01 Thread Daniel Schneller


Thanks for your input. We will see what we can find out
with the logs and how to proceed from there. 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] LevelDB support status is still experimental on Giant?

2014-12-01 Thread Chen, Xiaoxi

Range query is not that important in nowadays SSDyou can see very high read 
random read IOPS in ssd spec, and getting higher day by day.The key problem 
here is trying to exactly matching one query(get/put) to one SSD 
IO(read/write), eliminate the read/write amplification. We kind of believe 
OpenNvmKV may be the right approach.

Back to the context of Ceph,  can we find some use case of nowadays key-value 
backend?  We would like to learn from community what’s the workload pattern if 
you wants a K-V backed Ceph? Or just have a try?  I think before we get a 
suitable DB backend ,we had better off to optimize the key-value backend code 
to support specified kind of load.



From: Haomai Wang [mailto:haomaiw...@gmail.com]
Sent: Monday, December 1, 2014 10:14 PM
To: Chen, Xiaoxi
Cc: Satoru Funai; ceph-us...@ceph.com
Subject: Re: [ceph-users] LevelDB support status is still experimental on Giant?

Exactly, I'm just looking forward a better DB backend suitable for 
KeyValueStore. It maybe traditional B-tree design.

Kinetic original I think it was a good backend, but it doesn't support range 
query :-(



On Mon, Dec 1, 2014 at 10:04 PM, Chen, Xiaoxi 
mailto:xiaoxi.c...@intel.com>> wrote:
We have tested it for a while, basically it seems kind of stable but show 
terrible bad performance.

This is not the fault of Ceph , but levelDB, or more generally,  all K-V 
storage with LSM design(RocksDB,etc), the LSM tree structure naturally 
introduce very large write amplification 10X to 20X when you have tens GB 
of data per OSD. So you can always see very bad sequential write performance 
(~200MB/s for a 12SSD setup), we can share more details on the performance 
meeting.

To this end,  key-value backend with LevelDB is not useable for RBD usage, but 
maybe workable(not tested) in the LOSF cases ( tons of small objects stored via 
rados , k-v backend can prevent the FS metadata become the bottleneck)

From: ceph-users 
[mailto:ceph-users-boun...@lists.ceph.com]
 On Behalf Of Haomai Wang
Sent: Monday, December 1, 2014 9:48 PM
To: Satoru Funai
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] LevelDB support status is still experimental on Giant?

Yeah, mainly used by test env.

On Mon, Dec 1, 2014 at 6:29 PM, Satoru Funai 
mailto:satoru.fu...@gmail.com>> wrote:
Hi guys,
I'm interested in to use key/value store as a backend of Ceph OSD.
When firefly release, LevelDB support is mentioned as experimental,
is it same status on Giant release?
Regards,

Satoru Funai
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--

Best Regards,

Wheat



--

Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Compile from source with Kinetic support

2014-12-01 Thread Haomai Wang

Hmm, src/os/KeyValueDB.cc lack of lines:

#ifdef WITH_KINETIC
#include "KineticStore.h"
#endif


On Mon, Dec 1, 2014 at 6:14 PM, Julien Lutran  wrote:

> I'm sorry but the compilation still fails after including the cpp-client
> headers :
>
>
>   CXX  os/libos_la-KeyValueDB.lo
> os/KeyValueDB.cc: In static member function 'static KeyValueDB*
> KeyValueDB::create(CephContext*, const string&, const string&)':
> os/KeyValueDB.cc:18:16: error: expected type-specifier before
> 'KineticStore'
>  return new KineticStore(cct);
> ^
> os/KeyValueDB.cc:18:16: error: expected ';' before 'KineticStore'
> os/KeyValueDB.cc:18:32: error: 'KineticStore' was not declared in this
> scope
>  return new KineticStore(cct);
> ^
> os/KeyValueDB.cc: In static member function 'static int
> KeyValueDB::test_init(const string&, const string&)':
> os/KeyValueDB.cc:36:12: error: 'KineticStore' has not been declared
>  return KineticStore::_test_init(g_ceph_context);
> ^
>   CXX  os/libos_la-KeyValueStore.lo
> make[3]: *** [os/libos_la-KeyValueDB.lo] Error 1
> make[3]: *** Waiting for unfinished jobs
> In file included from os/KeyValueStore.cc:53:0:
> os/KineticStore.h:13:29: fatal error: kinetic/kinetic.h: No such file or
> directory
>  #include 
>  ^
> compilation terminated.
> make[3]: *** [os/libos_la-KeyValueStore.lo] Error 1
>
>
> -- Julien
>
>
>
> On 11/28/2014 08:54 PM, Nigel Williams wrote:
>
>> On Sat, Nov 29, 2014 at 5:19 AM, Julien Lutran 
>> wrote:
>>
>>> Where can I find this kinetic devel package ?
>>>
>> I guess you want this (C== kinetic client)? it has kinetic.h at least.
>>
>> https://github.com/Seagate/kinetic-cpp-client
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 

Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] LevelDB support status is still experimental on Giant?

2014-12-01 Thread Haomai Wang

Exactly, I'm just looking forward a better DB backend suitable for
KeyValueStore. It maybe traditional B-tree design.

Kinetic original I think it was a good backend, but it doesn't support
range query :-(



On Mon, Dec 1, 2014 at 10:04 PM, Chen, Xiaoxi  wrote:

>  We have tested it for a while, basically it seems kind of stable but
> show terrible bad performance.
>
>
>
> This is not the fault of Ceph , but levelDB, or more generally,  all K-V
> storage with LSM design(RocksDB,etc), the LSM tree structure naturally
> introduce very large write amplification 10X to 20X when you have tens
> GB of data per OSD. So you can always see very bad sequential write
> performance (~200MB/s for a 12SSD setup), we can share more details on the
> performance meeting.
>
>
>
> To this end,  key-value backend with LevelDB is not useable for RBD usage,
> but maybe workable(not tested) in the LOSF cases ( tons of small objects
> stored via rados , k-v backend can prevent the FS metadata become the
> bottleneck)
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *Haomai Wang
> *Sent:* Monday, December 1, 2014 9:48 PM
> *To:* Satoru Funai
> *Cc:* ceph-us...@ceph.com
> *Subject:* Re: [ceph-users] LevelDB support status is still experimental
> on Giant?
>
>
>
> Yeah, mainly used by test env.
>
>
>
> On Mon, Dec 1, 2014 at 6:29 PM, Satoru Funai 
> wrote:
>
> Hi guys,
> I'm interested in to use key/value store as a backend of Ceph OSD.
> When firefly release, LevelDB support is mentioned as experimental,
> is it same status on Giant release?
> Regards,
>
> Satoru Funai
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
>
> --
>
> Best Regards,
>
> Wheat
>



-- 

Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] LevelDB support status is still experimental on Giant?

2014-12-01 Thread Chen, Xiaoxi

We have tested it for a while, basically it seems kind of stable but show 
terrible bad performance.

This is not the fault of Ceph , but levelDB, or more generally,  all K-V 
storage with LSM design(RocksDB,etc), the LSM tree structure naturally 
introduce very large write amplification 10X to 20X when you have tens GB 
of data per OSD. So you can always see very bad sequential write performance 
(~200MB/s for a 12SSD setup), we can share more details on the performance 
meeting.

To this end,  key-value backend with LevelDB is not useable for RBD usage, but 
maybe workable(not tested) in the LOSF cases ( tons of small objects stored via 
rados , k-v backend can prevent the FS metadata become the bottleneck)

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Haomai 
Wang
Sent: Monday, December 1, 2014 9:48 PM
To: Satoru Funai
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] LevelDB support status is still experimental on Giant?

Yeah, mainly used by test env.

On Mon, Dec 1, 2014 at 6:29 PM, Satoru Funai 
mailto:satoru.fu...@gmail.com>> wrote:
Hi guys,
I'm interested in to use key/value store as a backend of Ceph OSD.
When firefly release, LevelDB support is mentioned as experimental,
is it same status on Giant release?
Regards,

Satoru Funai
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--

Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] do I have to use sudo for CEPH install

2014-12-01 Thread Jiri Kanicky


Hi.

Do I have to install sudo in Debian Wheezy to deploy CEPH succesfully? I 
dont normally use sudo.


Thank you
Jiri
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Removing Snapshots Killing Cluster Performance

2014-12-01 Thread Dan Van Der Ster

> On 01 Dec 2014, at 13:37, Daniel Schneller 
>  wrote:
> 
> On 2014-12-01 10:03:35 +, Dan Van Der Ster said:
> 
>> Which version of Ceph are you using? This could be related: 
>> http://tracker.ceph.com/issues/9487
> 
> Firefly. I had seen this ticket earlier (when deleting a whole pool) and hoped
> the backport of the fix would be available some time soon. I must admin, I did
> not look this up before posting, because I had forgotten about it.
> 
>> See "ReplicatedPG: don't move on to the next snap immediately"; basically, 
>> the OSD is getting into a tight loop "trimming" the snapshot objects. The 
>> fix above breaks out of that loop more frequently, and then you can use the 
>> osd snap trim sleep option to throttle it further. I’m not sure if the fix 
>> above will be sufficient if you have many objects to remove per snapshot.
> 
> Just so I get this right: With the fix alone you are not sure it would be 
> "nice"
> enough, so adjusting the snap trim sleep option in addition might be needed?
> I assume the loop that will be broken up with 9487 does not take the sleep
> time into account?

You will probably need the osd snap trim sleep regardless. IIRC, the previous 
(i.e. current) behaviour of the snap trimmer is to “sleep” only once per PG. So 
if it takes many seconds to trim all the objects in a single PG, then the sleep 
is basically useless. The fix breaks out of a single PG, though I don’t recall 
how often the sleep happens in the new behaviour; every object, every N 
objects, etc...

If you want to watch the snap trim logs, you need debug_osd=10 to see the snap 
trim operations, and debug_osd=20 to see when the snap trim sleep occurs.

Once you have the logs and confirm that the snap trim is causing your problems, 
I would experiment with different snap trim sleep values. If you had the fix 
for #9487 in production, something between 0.01 and 0.05s probably makes sense. 
But without that fix, perhaps you want to try with a much larger sleep.

>> That commit is only in giant at the moment. The backport to dumpling is in 
>> the dumpling branch but not yet in a release, and firefly is still pending.
> 
> Holding my breath :)
> 
> Any thoughts on the other items I had in the original post?
> 
>>> 2) Is there a way to get a decent approximation of how much work
>>> deleting a specific snapshot will entail (in terms of objects, time,
>>> whatever)?

I don’t know.

>>> 3) Would SSD journals help here? Or any other hardware configuration
>>> change for that matter?

I’m not sure an SSD journal would help very much here.

But maybe the IO priority options will help in your case:

  osd disk thread ioprio class = idle
  osd disk thread ioprio priority = 0

(note that you also need to use the cfq io scheduler on your OSD disks).

This should help if client IOs and snap trim IOs are competing for disk 
bandwidth (though there may be a PG lock blocking client IOs..)

Cheers, Dan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] LevelDB support status is still experimental on Giant?

2014-12-01 Thread Haomai Wang

Yeah, mainly used by test env.

On Mon, Dec 1, 2014 at 6:29 PM, Satoru Funai  wrote:

> Hi guys,
> I'm interested in to use key/value store as a backend of Ceph OSD.
> When firefly release, LevelDB support is mentioned as experimental,
> is it same status on Giant release?
> Regards,
>
> Satoru Funai
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 

Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fastest way to shrink/rewrite rbd image ?

2014-12-01 Thread Alexandre DERUMIER

>>Does it work with virtio-blk if you attach the RBD as a LUN?

virtio-blk don't support discard and triming

>> Supposedly, SCSI pass-through works in this mode, e.g.

SCSI pass-through works only with virtio-scsi, not virtio-blk

>>However, it seems that virtio-scsi is slowly becoming preferred over 
>>virtio-blk. Are there any disadvantages to using virtio-scsi now? 

It's a little bit slower sometimes.
(but I can be faster than virtio-blk with multiqueues and iscsi passtrough).

With librbd, I see a little slowdown vs virtio-blk (maybe 20% slower).

>>Does it support live migration? 
yes of course

- Mail original - 

De: "Daniel Swarbrick"  
À: ceph-users@lists.ceph.com 
Envoyé: Lundi 1 Décembre 2014 13:32:15 
Objet: Re: [ceph-users] Fastest way to shrink/rewrite rbd image ? 

On 01/12/14 10:22, Alexandre DERUMIER wrote: 
> 
> Yes, it's working fine. 
> 
> (you need to use virtio-scsi and enable discard option) 
> 

Does it work with virtio-blk if you attach the RBD as a LUN? Supposedly, 
SCSI pass-through works in this mode, e.g. 

... 

However, it seems that virtio-scsi is slowly becoming preferred over 
virtio-blk. Are there any disadvantages to using virtio-scsi now? Does 
it support live migration? 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Degraded

2014-12-01 Thread Georgios Dimitrakakis


Hi Andrei!

I had a similar setting with replicated size 2 and min_size also 2.

Changing that didn't change the status of the cluster.

I 've also tried to remove the pools and recreate them without success.

Removing and re-adding the OSDs also didn't have any influence!

Therefore and since I didn't have any data at all I performed a force 
recreate on all PGs and after that things went back to normal.


Thanks for your reply!


Best,


George

On Sat, 29 Nov 2014 11:39:51 + (GMT), Andrei Mikhailovsky wrote:
I think I had a similar issue recently when I've added a new pool. 
All

pgs that corresponded to the new pool were shown as degraded/unclean.
After doing a bit of testing I've realized that my issue was down to
this:

replicated size 2
min_size 2

replicated size and min size was the same. In my case, i've got 2 osd
servers with total replica of 2. The minimal size should be set to 1 
-

so that the cluster would still work with at least one PG being up.

After I've changed the min_size to 1 the cluster sorted itself out.
Try doing this for your pools.

Andrei

-


FROM: "Georgios Dimitrakakis"
TO: ceph-users@lists.ceph.com
SENT: Saturday, 29 November, 2014 11:13:05 AM
SUBJECT: [ceph-users] Ceph Degraded

Hi all!

I am setting UP a new cluster with 10 OSDs
and the state is degraded!

# ceph health
HEALTH_WARN 940 pgs degraded; 1536 pgs stuck unclean
#

There are only the default pools

# ceph osd lspools
0 data,1 metadata,2 rbd,

with each one having 512 pg_num and 512 pgp_num

# ceph osd dump | grep replic
pool 0 'data' replicated size 2 min_size 2 crush_ruleset 0
object_hash
rjenkins pg_num 512 pgp_num 512 last_change 286 flags hashpspool
crash_replay_interval 45 stripe_width 0
pool 1 'metadata' replicated size 2 min_size 2 crush_ruleset 0
object_hash rjenkins pg_num 512 pgp_num 512 last_change 287 flags
hashpspool stripe_width 0
pool 2 'rbd' replicated size 2 min_size 2 crush_ruleset 0
object_hash
rjenkins pg_num 512 pgp_num 512 last_change 288 flags hashpspool
stripe_width 0

No data yet so is there something I can do to repair it as it is?

Best regards,

George
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Removing Snapshots Killing Cluster Performance

2014-12-01 Thread Daniel Schneller


On 2014-12-01 10:03:35 +, Dan Van Der Ster said:

Which version of Ceph are you using? This could be related: 
http://tracker.ceph.com/issues/9487


Firefly. I had seen this ticket earlier (when deleting a whole pool) and hoped
the backport of the fix would be available some time soon. I must admin, I did
not look this up before posting, because I had forgotten about it.

See "ReplicatedPG: don't move on to the next snap immediately"; 
basically, the OSD is getting into a tight loop "trimming" the snapshot 
objects. The fix above breaks out of that loop more frequently, and 
then you can use the osd snap trim sleep option to throttle it further. 
I’m not sure if the fix above will be sufficient if you have many 
objects to remove per snapshot.


Just so I get this right: With the fix alone you are not sure it would 
be "nice"

enough, so adjusting the snap trim sleep option in addition might be needed?
I assume the loop that will be broken up with 9487 does not take the sleep
time into account?

That commit is only in giant at the moment. The backport to dumpling is 
in the dumpling branch but not yet in a release, and firefly is still 
pending.


Holding my breath :)

Any thoughts on the other items I had in the original post?


2) Is there a way to get a decent approximation of how much work
deleting a specific snapshot will entail (in terms of objects, time,
whatever)?

3) Would SSD journals help here? Or any other hardware configuration
change for that matter?



Thanks!
Daniel


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fastest way to shrink/rewrite rbd image ?

2014-12-01 Thread Daniel Swarbrick

On 01/12/14 10:22, Alexandre DERUMIER wrote:
> 
> Yes, it's working fine.
> 
> (you need to use virtio-scsi and enable discard option)
> 

Does it work with virtio-blk if you attach the RBD as a LUN? Supposedly,
SCSI pass-through works in this mode, e.g.

  ...

However, it seems that virtio-scsi is slowly becoming preferred over
virtio-blk. Are there any disadvantages to using virtio-scsi now? Does
it support live migration?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Giant + nfs over cephfs hang tasks

2014-12-01 Thread Andrei Mikhailovsky

Ilya, 

I see. My server is has 24GB of ram + 3GB of swap. While running the tests, 
I've noticed that the server had 14GB of ram shown as cached and only 2MB were 
used from the swap. Not sure if this is helpful to your debugging. 

Andrei 

-- 
Andrei Mikhailovsky 
Director 
Arhont Information Security 

Web: http://www.arhont.com 
http://www.wi-foo.com 
Tel: +44 (0)870 4431337 
Fax: +44 (0)208 429 3111 
PGP: Key ID - 0x2B3438DE 
PGP: Server - keyserver.pgp.com 

DISCLAIMER 

The information contained in this email is intended only for the use of the 
person(s) to whom it is addressed and may be confidential or contain legally 
privileged information. If you are not the intended recipient you are hereby 
notified that any perusal, use, distribution, copying or disclosure is strictly 
prohibited. If you have received this email in error please immediately advise 
us by return email at and...@arhont.com and delete and purge the email and any 
attachments without making a copy. 

- Original Message -

> From: "Ilya Dryomov" 
> To: "Andrei Mikhailovsky" 
> Cc: "ceph-users" , "Gregory Farnum"
> 
> Sent: Monday, 1 December, 2014 11:06:37 AM
> Subject: Re: [ceph-users] Giant + nfs over cephfs hang tasks

> On Mon, Dec 1, 2014 at 1:39 PM, Andrei Mikhailovsky
>  wrote:
> > Ilya,
> >
> > I will try doing that once again tonight as this is a production
> > cluster and
> > when dds trigger that dmesg error the cluster's io becomes very bad
> > and I
> > have to reboot the server to get things on track. Most of my vms
> > start
> > having 70-90% iowait until that server is rebooted.

> That's easily explained - those splats in dmesg indicate a case of a
> severe memory pressure.

> >
> > I've actually checked what you've asked last time i've ran the
> > test.
> >
> > When I do 4 dds concurrently nothing aprears in the dmesg output.
> > No
> > messages at all.
> >
> > The kern.log file that i've sent last time is what I got about a
> > minute
> > after i've started 8 dds. I've pasted the full output. The 8 dds
> > did
> > actually complete, but it took a rather long time. I was getting
> > about 6MB/s
> > per dd process compared to around 70MB/s per dd process when 4 dds
> > were
> > running. Do you still want me to run this or is the information
> > i've
> > provided enough?

> No, no need if it's a production cluster.

> Thanks,

> Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Giant + nfs over cephfs hang tasks

2014-12-01 Thread Ilya Dryomov

On Mon, Dec 1, 2014 at 1:39 PM, Andrei Mikhailovsky  wrote:
> Ilya,
>
> I will try doing that once again tonight as this is a production cluster and
> when dds trigger that dmesg error the cluster's io becomes very bad and I
> have to reboot the server to get things on track. Most of my vms start
> having 70-90% iowait until that server is rebooted.

That's easily explained - those splats in dmesg indicate a case of a
severe memory pressure.

>
> I've actually checked what you've asked last time i've ran the test.
>
> When I do 4 dds concurrently nothing aprears in the dmesg output. No
> messages at all.
>
> The kern.log file that i've sent last time is what I got about a minute
> after i've started 8 dds. I've pasted the full output. The 8 dds did
> actually complete, but it took a rather long time. I was getting about 6MB/s
> per dd process compared to around 70MB/s per dd process when 4 dds were
> running. Do you still want me to run this or is the information i've
> provided enough?

No, no need if it's a production cluster.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] large reads become 512 kbyte reads on qemu-kvm rbd

2014-12-01 Thread Ilya Dryomov

On Mon, Dec 1, 2014 at 1:09 PM, Dan Van Der Ster
 wrote:
> Hi Ilya,
>
>> On 28 Nov 2014, at 17:56, Ilya Dryomov  wrote:
>>
>> On Fri, Nov 28, 2014 at 5:46 PM, Dan Van Der Ster
>>  wrote:
>>> Hi Andrei,
>>> Yes, I’m testing from within the guest.
>>>
>>> Here is an example. First, I do 2MB reads when the max_sectors_kb=512, and
>>> we see the reads are split into 4. (fio sees 25 iops, though iostat reports
>>> 100 smaller iops):
>>>
>>> # echo 512 >  /sys/block/vdb/queue/max_sectors_kb  # this is the default
>>> # fio --readonly --name /dev/vdb --rw=read --size=1G  --ioengine=libaio
>>> --direct=1 --runtime=10s --blocksize=2m
>>> /dev/vdb: (g=0): rw=read, bs=2M-2M/2M-2M/2M-2M, ioengine=libaio, iodepth=1
>>> fio-2.0.13
>>> Starting 1 process
>>> Jobs: 1 (f=1): [R] [100.0% done] [51200K/0K/0K /s] [25 /0 /0  iops] [eta
>>> 00m:00s]
>>>
>>> meanwhile iostat is reporting 100 iops of average size 1024 sectors (i.e.
>>> 512kB):
>>>
>>> Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz
>>> avgqu-sz   await  svctm  %util
>>> vdb   0.00 0.00  100.000.0050.00 0.00  1024.00
>>> 3.02   30.25  10.00 100.00
>>>
>>>
>>>
>>> Now increase the max_sectors_kb to 4MB, and the IOs are no longer split:
>>>
>>> # echo 4096 >  /sys/block/vdb/queue/max_sectors_kb
>>> # fio --readonly --name /dev/vdb --rw=read --size=1G  --ioengine=libaio
>>> --direct=1 --runtime=10s --blocksize=2m
>>> /dev/vdb: (g=0): rw=read, bs=2M-2M/2M-2M/2M-2M, ioengine=libaio, iodepth=1
>>> fio-2.0.13
>>> Starting 1 process
>>> Jobs: 1 (f=1): [R] [100.0% done] [200.0M/0K/0K /s] [100 /0 /0  iops] [eta
>>> 00m:00s]
>>>
>>> iostat reports 100 iops, 4096 sectors each read (i.e. 2MB):
>>>
>>> Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz
>>> avgqu-sz   await  svctm  %util
>>> vdb 300.00 0.00  100.000.00   200.00 0.00  4096.00
>>> 0.999.94   9.94  99.40
>>
>> We set the hard request size limit to rbd object size (4M typically)
>>
>>blk_queue_max_hw_sectors(q, segment_size / SECTOR_SIZE);
>>
>
> Are you referring to librbd or krbd? My observations are limited to librbd at 
> the moment. (I didn’t try this on krbd).

Yes, I was referring to krbd.  But it looks like that patch from
Christoph will change this for qemu+librbd as well - an artificial soft
limit imposed by the VM kernel will disappear.  CC'ing Josh.

>
>> but block layer then sets the soft limit for fs requests to 512K
>>
>>   BLK_DEF_MAX_SECTORS  = 1024,
>>
>>   limits->max_sectors = min_t(unsigned int, max_hw_sectors,
>>   BLK_DEF_MAX_SECTORS);
>>
>> which you are supposed to change on a per-device basis via sysfs.  We
>> could probably raise the soft limit to rbd object size by default as
>> well - I don't see any harm in that.
>>
>
> Indeed, this patch which was being targeted for 3.19:
>
> https://lkml.org/lkml/2014/9/6/123

Oh good, I was just about to send a patch for krbd.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Giant + nfs over cephfs hang tasks

2014-12-01 Thread Andrei Mikhailovsky

Ilya, 

I will try doing that once again tonight as this is a production cluster and 
when dds trigger that dmesg error the cluster's io becomes very bad and I have 
to reboot the server to get things on track. Most of my vms start having 70-90% 
iowait until that server is rebooted. 

I've actually checked what you've asked last time i've ran the test. 

When I do 4 dds concurrently nothing aprears in the dmesg output. No messages 
at all. 

The kern.log file that i've sent last time is what I got about a minute after 
i've started 8 dds. I've pasted the full output. The 8 dds did actually 
complete, but it took a rather long time. I was getting about 6MB/s per dd 
process compared to around 70MB/s per dd process when 4 dds were running. Do 
you still want me to run this or is the information i've provided enough? 

Cheers 

Andrei 

- Original Message -

> From: "Ilya Dryomov" 
> To: "Andrei Mikhailovsky" 
> Cc: "ceph-users" , "Gregory Farnum"
> 
> Sent: Monday, 1 December, 2014 8:22:08 AM
> Subject: Re: [ceph-users] Giant + nfs over cephfs hang tasks

> On Mon, Dec 1, 2014 at 12:30 AM, Andrei Mikhailovsky
>  wrote:
> >
> > Ilya, further to your email I have switched back to the 3.18 kernel
> > that
> > you've sent and I got similar looking dmesg output as I had on the
> > 3.17
> > kernel. Please find it attached for your reference. As before, this
> > is the
> > command I've ran on the client:
> >
> >
> > time dd if=/dev/zero of=4G00 bs=4M count=5K oflag=direct & time dd
> > if=/dev/zero of=4G11 bs=4M count=5K oflag=direct &time dd
> > if=/dev/zero
> > of=4G22 bs=4M count=5K oflag=direct &time dd if=/dev/zero of=4G33
> > bs=4M
> > count=5K oflag=direct & time dd if=/dev/zero of=4G44 bs=4M count=5K
> > oflag=direct & time dd if=/dev/zero of=4G55 bs=4M count=5K
> > oflag=direct
> > &time dd if=/dev/zero of=4G66 bs=4M count=5K oflag=direct &time dd
> > if=/dev/zero of=4G77 bs=4M count=5K oflag=direct &

> Can you run that command again - on 3.18 kernel, to completion - and
> paste

> - the entire dmesg
> - "time" results for each dd

> ?

> Compare those to your results with four dds (or any other number
> which
> doesn't trigger page allocation failures).

> Thanks,

> Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] LevelDB support status is still experimental on Giant?

2014-12-01 Thread Satoru Funai

Hi guys,
I'm interested in to use key/value store as a backend of Ceph OSD.
When firefly release, LevelDB support is mentioned as experimental,
is it same status on Giant release?
Regards,

Satoru Funai
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Compile from source with Kinetic support

2014-12-01 Thread Julien Lutran

I'm sorry but the compilation still fails after including the cpp-client 
headers :



  CXX  os/libos_la-KeyValueDB.lo
os/KeyValueDB.cc: In static member function 'static KeyValueDB* 
KeyValueDB::create(CephContext*, const string&, const string&)':

os/KeyValueDB.cc:18:16: error: expected type-specifier before 'KineticStore'
 return new KineticStore(cct);
^
os/KeyValueDB.cc:18:16: error: expected ';' before 'KineticStore'
os/KeyValueDB.cc:18:32: error: 'KineticStore' was not declared in this scope
 return new KineticStore(cct);
^
os/KeyValueDB.cc: In static member function 'static int 
KeyValueDB::test_init(const string&, const string&)':

os/KeyValueDB.cc:36:12: error: 'KineticStore' has not been declared
 return KineticStore::_test_init(g_ceph_context);
^
  CXX  os/libos_la-KeyValueStore.lo
make[3]: *** [os/libos_la-KeyValueDB.lo] Error 1
make[3]: *** Waiting for unfinished jobs
In file included from os/KeyValueStore.cc:53:0:
os/KineticStore.h:13:29: fatal error: kinetic/kinetic.h: No such file or 
directory

 #include 
 ^
compilation terminated.
make[3]: *** [os/libos_la-KeyValueStore.lo] Error 1


-- Julien


On 11/28/2014 08:54 PM, Nigel Williams wrote:

On Sat, Nov 29, 2014 at 5:19 AM, Julien Lutran  wrote:

Where can I find this kinetic devel package ?

I guess you want this (C== kinetic client)? it has kinetic.h at least.

https://github.com/Seagate/kinetic-cpp-client
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] large reads become 512 kbyte reads on qemu-kvm rbd

2014-12-01 Thread Dan Van Der Ster

Hi Ilya,

> On 28 Nov 2014, at 17:56, Ilya Dryomov  wrote:
> 
> On Fri, Nov 28, 2014 at 5:46 PM, Dan Van Der Ster
>  wrote:
>> Hi Andrei,
>> Yes, I’m testing from within the guest.
>> 
>> Here is an example. First, I do 2MB reads when the max_sectors_kb=512, and
>> we see the reads are split into 4. (fio sees 25 iops, though iostat reports
>> 100 smaller iops):
>> 
>> # echo 512 >  /sys/block/vdb/queue/max_sectors_kb  # this is the default
>> # fio --readonly --name /dev/vdb --rw=read --size=1G  --ioengine=libaio
>> --direct=1 --runtime=10s --blocksize=2m
>> /dev/vdb: (g=0): rw=read, bs=2M-2M/2M-2M/2M-2M, ioengine=libaio, iodepth=1
>> fio-2.0.13
>> Starting 1 process
>> Jobs: 1 (f=1): [R] [100.0% done] [51200K/0K/0K /s] [25 /0 /0  iops] [eta
>> 00m:00s]
>> 
>> meanwhile iostat is reporting 100 iops of average size 1024 sectors (i.e.
>> 512kB):
>> 
>> Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz
>> avgqu-sz   await  svctm  %util
>> vdb   0.00 0.00  100.000.0050.00 0.00  1024.00
>> 3.02   30.25  10.00 100.00
>> 
>> 
>> 
>> Now increase the max_sectors_kb to 4MB, and the IOs are no longer split:
>> 
>> # echo 4096 >  /sys/block/vdb/queue/max_sectors_kb
>> # fio --readonly --name /dev/vdb --rw=read --size=1G  --ioengine=libaio
>> --direct=1 --runtime=10s --blocksize=2m
>> /dev/vdb: (g=0): rw=read, bs=2M-2M/2M-2M/2M-2M, ioengine=libaio, iodepth=1
>> fio-2.0.13
>> Starting 1 process
>> Jobs: 1 (f=1): [R] [100.0% done] [200.0M/0K/0K /s] [100 /0 /0  iops] [eta
>> 00m:00s]
>> 
>> iostat reports 100 iops, 4096 sectors each read (i.e. 2MB):
>> 
>> Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz
>> avgqu-sz   await  svctm  %util
>> vdb 300.00 0.00  100.000.00   200.00 0.00  4096.00
>> 0.999.94   9.94  99.40
> 
> We set the hard request size limit to rbd object size (4M typically)
> 
>blk_queue_max_hw_sectors(q, segment_size / SECTOR_SIZE);
> 

Are you referring to librbd or krbd? My observations are limited to librbd at 
the moment. (I didn’t try this on krbd).

> but block layer then sets the soft limit for fs requests to 512K
> 
>   BLK_DEF_MAX_SECTORS  = 1024,
> 
>   limits->max_sectors = min_t(unsigned int, max_hw_sectors,
>   BLK_DEF_MAX_SECTORS);
> 
> which you are supposed to change on a per-device basis via sysfs.  We
> could probably raise the soft limit to rbd object size by default as
> well - I don't see any harm in that.
> 

Indeed, this patch which was being targeted for 3.19:

https://lkml.org/lkml/2014/9/6/123

Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Removing Snapshots Killing Cluster Performance

2014-12-01 Thread Dan Van Der Ster

Hi,
Which version of Ceph are you using? This could be related: 
http://tracker.ceph.com/issues/9487
See "ReplicatedPG: don't move on to the next snap immediately"; basically, the 
OSD is getting into a tight loop "trimming" the snapshot objects. The fix above 
breaks out of that loop more frequently, and then you can use the osd snap trim 
sleep option to throttle it further. I’m not sure if the fix above will be 
sufficient if you have many objects to remove per snapshot.

That commit is only in giant at the moment. The backport to dumpling is in the 
dumpling branch but not yet in a release, and firefly is still pending.
Cheers, Dan


> On 01 Dec 2014, at 10:51, Daniel Schneller 
>  wrote:
> 
> Hi!
> 
> We take regular (nightly) snapshots of our Rados Gateway Pools for
> backup purposes. This allows us - with some manual pokery - to restore
> clients' documents should they delete them accidentally.
> 
> The cluster is a 4 server setup with 12x4TB spinning disks each,
> totaling about 175TB. We are running firefly.
> 
> We have now completed our first month of snapshots and want to remove
> the oldest ones. Unfortunately doing so practically kills everything
> else that is using the cluster, because performance drops to almost zero
> while the OSDs work their disks 100% (as per iostat). It seems this is
> the same phenomenon I asked about some time ago where we were deleting
> whole pools.
> 
> I could not find any way to throttle the background deletion activity
> (the command returns almost immediately). Here is a graph the I/O
> operations waiting (colored by device) while deleting a few snapshots.
> Each of the "blocks" in the graph show one snapshot being removed. The
> big one in the middle was a snapshot of the .rgw.buckets pool. It took
> about 15 minutes during which basically nothing relying on the cluster
> was working due to immense slowdowns. This included users getting 
> kicked off their SSH sessions due to timeouts.
> 
> https://public.centerdevice.de/8c95f1c2-a7c3-457f-83b6-834688e0d048
> 
> While this is a big issue in itself for us, we would at least try to
> estimate how long the process will take per snapshot / per pool. I
> assume the time needed is a function of the number of objects that were
> modified between two snapshots. We tried to get an idea of at least how
> many objects were added/removed in total by running `rados df` with a
> snapshot specified as a parameter, but it seems we still always get the
> current values:
> 
> $ sudo rados -p .rgw df --snap backup-20141109
> selected snap 13 'backup-20141109'
> pool name   category KB  objects
> .rgw- 276165  1368545
> 
> $ sudo rados -p .rgw df --snap backup-20141124
> selected snap 28 'backup-20141124'
> pool name   category KB  objects
> .rgw- 276165  1368546
> 
> $ sudo rados -p .rgw df
> pool name   category KB  objects
> .rgw- 276165  1368547
> 
> So there are a few questions:
> 
> 1) Is there any way to control how much such an operation will
> tax the cluster (we would be happy to have it run longer, if that meant
> not utilizing all disks fully during that time)?
> 
> 2) Is there a way to get a decent approximation of how much work
> deleting a specific snapshot will entail (in terms of objects, time,
> whatever)?
> 
> 3) Would SSD journals help here? Or any other hardware configuration
> change for that matter?
> 
> 4) Any other recommendations? We definitely need to remove the data,
> not because of a lack of space (at least not at the moment), but because
> when customers delete stuff / cancel accounts, we are obliged to remove
> their data at least after a reasonable amount of time.
> 
> Cheers,
> Daniel
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Removing Snapshots Killing Cluster Performance

2014-12-01 Thread Daniel Schneller


Hi!

We take regular (nightly) snapshots of our Rados Gateway Pools for
backup purposes. This allows us - with some manual pokery - to restore
clients' documents should they delete them accidentally.

The cluster is a 4 server setup with 12x4TB spinning disks each,
totaling about 175TB. We are running firefly.

We have now completed our first month of snapshots and want to remove
the oldest ones. Unfortunately doing so practically kills everything
else that is using the cluster, because performance drops to almost zero
while the OSDs work their disks 100% (as per iostat). It seems this is
the same phenomenon I asked about some time ago where we were deleting
whole pools.

I could not find any way to throttle the background deletion activity
(the command returns almost immediately). Here is a graph the I/O
operations waiting (colored by device) while deleting a few snapshots.
Each of the "blocks" in the graph show one snapshot being removed. The
big one in the middle was a snapshot of the .rgw.buckets pool. It took
about 15 minutes during which basically nothing relying on the cluster
was working due to immense slowdowns. This included users getting 
kicked off their SSH sessions due to timeouts.


https://public.centerdevice.de/8c95f1c2-a7c3-457f-83b6-834688e0d048

While this is a big issue in itself for us, we would at least try to
estimate how long the process will take per snapshot / per pool. I
assume the time needed is a function of the number of objects that were
modified between two snapshots. We tried to get an idea of at least how
many objects were added/removed in total by running `rados df` with a
snapshot specified as a parameter, but it seems we still always get the
current values:

$ sudo rados -p .rgw df --snap backup-20141109
selected snap 13 'backup-20141109'
pool name       category                 KB      objects
.rgw            -                     276165      1368545

$ sudo rados -p .rgw df --snap backup-20141124
selected snap 28 'backup-20141124'
pool name       category                 KB      objects
.rgw            -                     276165      1368546

$ sudo rados -p .rgw df
pool name       category                 KB      objects
.rgw            -                     276165      1368547

So there are a few questions:

1) Is there any way to control how much such an operation will
tax the cluster (we would be happy to have it run longer, if that meant
not utilizing all disks fully during that time)?

2) Is there a way to get a decent approximation of how much work
deleting a specific snapshot will entail (in terms of objects, time,
whatever)?

3) Would SSD journals help here? Or any other hardware configuration
change for that matter?

4) Any other recommendations? We definitely need to remove the data,
not because of a lack of space (at least not at the moment), but because
when customers delete stuff / cancel accounts, we are obliged to remove
their data at least after a reasonable amount of time.

Cheers,
Daniel___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fastest way to shrink/rewrite rbd image ?

2014-12-01 Thread Alexandre DERUMIER

>>I think if you enable TRIM support on your RBD, then run fstrim on your 
>>filesystems inside the guest (assuming ext4 / XFS guest filesystem), 
>>Ceph should reclaim the trimmed space. 

Yes, it's working fine.

(you need to use virtio-scsi and enable discard option)


- Mail original - 

De: "Daniel Swarbrick"  
À: ceph-users@lists.ceph.com 
Envoyé: Vendredi 28 Novembre 2014 17:16:14 
Objet: Re: [ceph-users] Fastest way to shrink/rewrite rbd image ? 

Take a look at 
http://ceph.com/docs/master/rbd/qemu-rbd/#enabling-discard-trim 

I think if you enable TRIM support on your RBD, then run fstrim on your 
filesystems inside the guest (assuming ext4 / XFS guest filesystem), 
Ceph should reclaim the trimmed space. 

On 28/11/14 17:05, Christoph Adomeit wrote: 
> Hi, 
> 
> I would like to shrink a thin provisioned rbd image which has grown to 
> maximum. 
> 90% of the data in the image is deleted data which is still hidden in the 
> image and marked as deleted. 
> 
> So I think I can fill the whole Image with zeroes and then qemu-img convert 
> it. 
> So the newly created image should be only 10% of the maximum size. 
> 
> I will do something like 
> qemu-img convert -O raw rbd:pool/origimage rbd:pool/smallimage 
> rbd rename origimage origimage-saved 
> rbd rename smallimage origimage 
> 
> Would this be the best and fastest way or are there other ways to do this ? 
> 
> Thanks 
> Christoph 
> 
> 
> 


___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Giant + nfs over cephfs hang tasks

2014-12-01 Thread Ilya Dryomov

On Mon, Dec 1, 2014 at 12:30 AM, Andrei Mikhailovsky  wrote:
>
> Ilya, further to your email I have switched back to the 3.18 kernel that
> you've sent and I got similar looking dmesg output as I had on the 3.17
> kernel. Please find it attached for your reference. As before, this is the
> command I've ran on the client:
>
>
> time dd if=/dev/zero of=4G00 bs=4M count=5K oflag=direct & time dd
> if=/dev/zero of=4G11 bs=4M count=5K oflag=direct &time dd if=/dev/zero
> of=4G22 bs=4M count=5K oflag=direct &time dd if=/dev/zero of=4G33 bs=4M
> count=5K oflag=direct & time dd if=/dev/zero of=4G44 bs=4M count=5K
> oflag=direct & time dd if=/dev/zero of=4G55 bs=4M count=5K oflag=direct
> &time dd if=/dev/zero of=4G66 bs=4M count=5K oflag=direct &time dd
> if=/dev/zero of=4G77 bs=4M count=5K oflag=direct &

Can you run that command again - on 3.18 kernel, to completion - and
paste

- the entire dmesg
- "time" results for each dd

?

Compare those to your results with four dds (or any other number which
doesn't trigger page allocation failures).

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] trouble starting second monitor

2014-12-01 Thread Irek Fasikhov

[celtic][DEBUG ] create the mon path if it does not exist

mkdir /var/lib/ceph/mon/

2014-12-01 4:32 GMT+03:00 K Richard Pixley :

> What does this mean, please?
>
> --rich
>
> ceph@adriatic:~/my-cluster$ ceph status
> cluster 1023db58-982f-4b78-b507-481233747b13
>  health HEALTH_OK
>  monmap e1: 1 mons at {black=192.168.1.77:6789/0}, election epoch 2,
> quorum 0 black
>  mdsmap e7: 1/1/1 up {0=adriatic=up:active}, 3 up:standby
>  osdmap e17: 4 osds: 4 up, 4 in
>   pgmap v48: 192 pgs, 3 pools, 1884 bytes data, 20 objects
> 29134 MB used, 113 GB / 149 GB avail
>  192 active+clean
> ceph@adriatic:~/my-cluster$ ceph-deploy mon create celtic
> [ceph_deploy.conf][DEBUG ] found configuration file at:
> /home/ceph/.cephdeploy.conf
> [ceph_deploy.cli][INFO  ] Invoked (1.5.20): /usr/bin/ceph-deploy mon
> create celtic
> [ceph_deploy.mon][DEBUG ] Deploying mon, cluster ceph hosts celtic
> [ceph_deploy.mon][DEBUG ] detecting platform for host celtic ...
> [celtic][DEBUG ] connection detected need for sudo
> [celtic][DEBUG ] connected to host: celtic
> [celtic][DEBUG ] detect platform information from remote host
> [celtic][DEBUG ] detect machine type
> [ceph_deploy.mon][INFO  ] distro info: Ubuntu 14.04 trusty
> [celtic][DEBUG ] determining if provided host has same hostname in remote
> [celtic][DEBUG ] get remote short hostname
> [celtic][DEBUG ] deploying mon to celtic
> [celtic][DEBUG ] get remote short hostname
> [celtic][DEBUG ] remote hostname: celtic
> [celtic][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
> [celtic][DEBUG ] create the mon path if it does not exist
> [celtic][DEBUG ] checking for done path: /var/lib/ceph/mon/ceph-celtic/
> done
> [celtic][DEBUG ] create a done file to avoid re-doing the mon deployment
> [celtic][DEBUG ] create the init path if it does not exist
> [celtic][DEBUG ] locating the `service` executable...
> [celtic][INFO  ] Running command: sudo initctl emit ceph-mon cluster=ceph
> id=celtic
> [celtic][INFO  ] Running command: sudo ceph --cluster=ceph --admin-daemon
> /var/run/ceph/ceph-mon.celtic.asok mon_status
> [celtic][ERROR ] admin_socket: exception getting command descriptions:
> [Errno 2] No such file or directory
> [celtic][WARNIN] monitor: mon.celtic, might not be running yet
> [celtic][INFO  ] Running command: sudo ceph --cluster=ceph --admin-daemon
> /var/run/ceph/ceph-mon.celtic.asok mon_status
> [celtic][ERROR ] admin_socket: exception getting command descriptions:
> [Errno 2] No such file or directory
> [celtic][WARNIN] celtic is not defined in `mon initial members`
> [celtic][WARNIN] monitor celtic does not exist in monmap
> [celtic][WARNIN] neither `public_addr` nor `public_network` keys are
> defined for monitors
> [celtic][WARNIN] monitors may not be able to form quorum
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

63 matches

Mail list logo