Re: [ceph-users] capacity planning - iops

2016-09-19 Thread Jan Schermer
Are you talking about global IOPS or per-VM/per-RBD device?
And at what queue depth?
It all comes down to latency. Not sure what the numbers can be on recent 
versions of Ceph and on modern OSes but I doubt it will be <1ms for the OSD 
daemon alone. That gives you 1000 real synchronous IOPS. With higher queue 
depths (or with more RBD devices in parallel) you could reach higher numbers, 
but you need to know what you application needs.
For SATA drives, you need to add their latency to this number, and it scales 
only when the writes are distributed to all the drives (so if you hammer a 4k 
region it will still hit the same drives, even with higher queue depth, which 
might/or might not, increase throughput or even make it worse...)

Jan


> On 19 Sep 2016, at 16:23, Matteo Dacrema  wrote:
> 
> Hi All,
> 
> I’m trying to estimate how many iops ( 4k direct random write )  my ceph 
> cluster should deliver.
> I’ve Journal on SSDs and SATA 7.2k drives for OSD.
> 
> The question is: does journal on SSD increase the number of maximum write 
> iops or I need to consider only the IOPS provided by SATA drives divided by 
> replica count?
> 
> Regards
> M.
> 
> 
> 
> This email and any files transmitted with it are confidential and intended 
> solely for the use of the individual or entity to whom they are addressed. If 
> you have received this email in error please notify the system manager. This 
> message contains confidential information and is intended only for the 
> individual named. If you are not the named addressee you should not 
> disseminate, distribute or copy this e-mail. Please notify the sender 
> immediately by e-mail if you have received this e-mail by mistake and delete 
> this e-mail from your system. If you are not the intended recipient you are 
> notified that disclosing, copying, distributing or taking any action in 
> reliance on the contents of this information is strictly prohibited.
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] BlueStore write amplification

2016-08-23 Thread Jan Schermer
Is that 400MB on all nodes or on each node? If it's on all nodes then 10:1 is 
not that surprising.
What what the block size in your fio benchmark?
We had much higher amplification on our cluster with snapshots and stuff...

Jan

> On 23 Aug 2016, at 08:38, Zhiyuan Wang  wrote:
> 
> Hi 
> I have test bluestore on SSD, and I found that the BW from fio is about 40MB, 
> but
> the write BW from iostat of SSD is about 400MB, nearly ten times.
> Could someone help to explain this? 
> Thanks a lot.
>  
> Below are my configuration file:
> [global]
> fsid = 31e77e3c-447c-4745-a91a-58bda80a868c
> enable experimental unrecoverable data corrupting features = 
> bluestore rocksdb
> osd objectstore = bluestore
>  
> bluestore default buffered read = true
> bluestore_min_alloc_size=4096
> osd pool default size = 1
>  
> osd pg bits = 8
> osd pgp bits = 8
> auth supported = none
> log to syslog = false
> filestore xattr use omap = true
> auth cluster required = none
> auth service required = none
> auth client required = none
>  
> public network = 192.168.200.233/24
> cluster network = 192.168.100.233/24
>  
> mon initial members = node3
> mon host = 192.168.200.233
> mon data = /etc/ceph/mon.node3
>   
> filestore merge threshold = 40
> filestore split multiple = 8
> osd op threads = 8
>  
> debug_bluefs = "0/0"
> debug_bluestore = "0/0"
> debug_bdev = "0/0" 
> debug_lockdep = "0/0" 
> debug_context = "0/0"  
> debug_crush = "0/0"
> debug_mds = "0/0"
> debug_mds_balancer = "0/0"
> debug_mds_locker = "0/0"
> debug_mds_log = "0/0"
> debug_mds_log_expire = "0/0"
> debug_mds_migrator = "0/0"
> debug_buffer = "0/0"
> debug_timer = "0/0"
> debug_filer = "0/0"
> debug_objecter = "0/0"
> debug_rados = "0/0"
> debug_rbd = "0/0"
> debug_journaler = "0/0"
> debug_objectcacher = "0/0"
> debug_client = "0/0"
> debug_osd = "0/0"
> debug_optracker = "0/0"
> debug_objclass = "0/0"
> debug_filestore = "0/0"
> debug_journal = "0/0"
> debug_ms = "0/0"
> debug_mon = "0/0"
> debug_monc = "0/0"
> debug_paxos = "0/0"
> debug_tp = "0/0"
> debug_auth = "0/0"
> debug_finisher = "0/0"
> debug_heartbeatmap = "0/0"
> debug_perfcounter = "0/0"
> debug_rgw = "0/0"
> debug_hadoop = "0/0"
> debug_asok = "0/0"
> debug_throttle = "0/0"
>  
> [osd.0]
> host = node3
> osd data = /etc/ceph/osd-device-0-data
> bluestore block path = /dev/disk/by-partlabel/osd-device-0-block
> bluestore block db path = /dev/disk/by-partlabel/osd-device-0-db
> bluestore block wal path = /dev/disk/by-partlabel/osd-device-0-wal
>  
> [osd.1]
> host = node3
> osd data = /etc/ceph/osd-device-1-data
> bluestore block path = /dev/disk/by-partlabel/osd-device-1-block
> bluestore block db path = /dev/disk/by-partlabel/osd-device-1-db
> bluestore block wal path = /dev/disk/by-partlabel/osd-device-1-wal
> [osd.2]
> host = node3
> osd data = /etc/ceph/osd-device-2-data
> bluestore block path = /dev/disk/by-partlabel/osd-device-2-block
> bluestore block db path = /dev/disk/by-partlabel/osd-device-2-db
> bluestore block wal path = /dev/disk/by-partlabel/osd-device-2-wal
>  
>  
> [osd.3]
> host = node3
> osd data = /etc/ceph/osd-device-3-data
> bluestore block path = /dev/disk/by-partlabel/osd-device-3-block
> bluestore block db path = /dev/disk/by-partlabel/osd-device-3-db
> bluestore block wal path = /dev/disk/by-partlabel/osd-device-3-wal
> Email Disclaimer & Confidentiality Notice
> This message is confidential and intended solely for the use of the recipient 
> to whom they are addressed. If you are not the intended recipient you should 
> not deliver, distribute or copy this e-mail. Please notify the sender 
> immediately by e-mail and delete this e-mail from your system. Copyright © 
> 2016 by Istuary Innovation Labs, Inc. All rights reserved. 
>  
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Include mon restart in logrotate?

2016-08-11 Thread Jan Schermer
I had to make a cronjob to trigger compact on the MONs as well.
Ancient version, though.

Jan

> On 11 Aug 2016, at 10:09, Wido den Hollander  wrote:
> 
> 
>> Op 11 augustus 2016 om 9:56 schreef Eugen Block :
>> 
>> 
>> Hi list,
>> 
>> we have a working cluster based on Hammer with 4 nodes, 19 OSDs and 3 MONs.
>> Now after a couple of weeks we noticed that we're running out of disk  
>> space on one of the nodes in /var.
>> Similar to [1] there are two large LOG files in  
>> /var/lib/ceph/mon/ceph-d/store.db/ and I already figured they are  
>> managed when the respective MON is restarted. But the MONs are not  
>> restarted regularly so the log files can grow for months and fill up  
>> the file system.
>> 
> 
> Warning! These are not your regular log files. They are binary logs of 
> LevelDB which are mandatory for the MONs to work!
> 
>> I was thinking about adding another file in /etc/logrotate.d/ and  
>> trigger a monitor restart once a week. But I'm not sure if it's  
>> recommended to restart all MONs at the same time, which could happen  
>> if someone started logrotate manually.
>> So my question is, how do you guys manage that and how is it supposed  
>> to be handled? I'd really appreciate any insights!
>> 
> You shouldn't have to worry about that. The MONs should compact and rotate 
> those logs themselve.
> 
> They compact their store on start, so that works for you, but they should do 
> this while running.
> 
> What version of Ceph are you running exactly?
> 
> What is the output of ceph -s? MONs usually only compact when the cluster is 
> healthy.
> 
> Wido
> 
>> Regards,
>> Eugen
>> 
>> [1]  
>> http://ceph-users.ceph.narkive.com/PBL3kuhq/large-log-like-files-on-monitor
>> 
>> -- 
>> Eugen Block voice   : +49-40-559 51 75
>> NDE Netzdesign und -entwicklung AG  fax : +49-40-559 51 77
>> Postfach 61 03 15
>> D-22423 Hamburg e-mail  : ebl...@nde.ag
>> 
>> Vorsitzende des Aufsichtsrates: Angelika Mozdzen
>>   Sitz und Registergericht: Hamburg, HRB 90934
>>   Vorstand: Jens-U. Mozdzen
>>USt-IdNr. DE 814 013 983
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure

2016-08-03 Thread Jan Schermer
Christian, can you post your values for Power_Loss_Cap_Test on the drive which 
is failing?

Thanks
Jan

> On 03 Aug 2016, at 13:33, Christian Balzer <ch...@gol.com> wrote:
> 
> 
> Hello,
> 
> yeah, I was particular interested in the Power_Loss_Cap_Test bit, as it
> seemed to be such an odd thing to fail (given that's not single capacitor).
> 
> As for your Reallocated_Sector_Ct, that's really odd and definitely a RMA
> worthy issue. 
> 
> For the record, Intel SSDs use (typically 24) sectors when doing firmware
> upgrades, so this is a totally healthy 3610. ^o^
> ---
>  5 Reallocated_Sector_Ct   0x0032   099   099   000Old_age   Always   
> -   24
> ---
> 
> Christian
> 
> On Wed, 3 Aug 2016 13:12:53 +0200 Daniel Swarbrick wrote:
> 
>> Right, I actually updated to smartmontools 6.5+svn4324, which now
>> properly supports this drive model. Some of the smart attr names have
>> changed, and make more sense now (and there are no more "Unknowns"):
>> 
>> ID# ATTRIBUTE_NAME  FLAGSVALUE WORST THRESH FAIL RAW_VALUE
>>  5 Reallocated_Sector_Ct   -O--CK   081   081   000-944
>>  9 Power_On_Hours  -O--CK   100   100   000-1067
>> 12 Power_Cycle_Count   -O--CK   100   100   000-7
>> 170 Available_Reservd_Space PO--CK   085   085   010-0
>> 171 Program_Fail_Count  -O--CK   100   100   000-0
>> 172 Erase_Fail_Count-O--CK   100   100   000-68
>> 174 Unsafe_Shutdown_Count   -O--CK   100   100   000-6
>> 175 Power_Loss_Cap_Test PO--CK   100   100   010-6510 (4 4307)
>> 183 SATA_Downshift_Count-O--CK   100   100   000-0
>> 184 End-to-End_ErrorPO--CK   100   100   090-0
>> 187 Reported_Uncorrect  -O--CK   100   100   000-0
>> 190 Temperature_Case-O---K   070   065   000-30 (Min/Max
>> 25/35)
>> 192 Unsafe_Shutdown_Count   -O--CK   100   100   000-6
>> 194 Temperature_Internal-O---K   100   100   000-30
>> 197 Current_Pending_Sector  -O--C-   100   100   000-1100
>> 199 CRC_Error_Count -OSRCK   100   100   000-0
>> 225 Host_Writes_32MiB   -O--CK   100   100   000-20135
>> 226 Workld_Media_Wear_Indic -O--CK   100   100   000-20
>> 227 Workld_Host_Reads_Perc  -O--CK   100   100   000-82
>> 228 Workload_Minutes-O--CK   100   100   000-64012
>> 232 Available_Reservd_Space PO--CK   084   084   010-0
>> 233 Media_Wearout_Indicator -O--CK   100   100   000-0
>> 234 Thermal_Throttle-O--CK   100   100   000-0/0
>> 241 Host_Writes_32MiB   -O--CK   100   100   000-20135
>> 242 Host_Reads_32MiB-O--CK   100   100   000-92945
>> 243 NAND_Writes_32MiB   -O--CK   100   100   000-95289
>> 
>> Reallocated_Sector_Ct is still increasing, but Available_Reservd_Space
>> seems to be holding steady.
>> 
>> AFAIK, we've only had one other S3610 fail, and it seemed to be a sudden
>> death. The drive simply disappeared from the controller one day, and
>> could no longer be detected.
>> 
>> On 03/08/16 12:15, Jan Schermer wrote:
>>> Make sure you are reading the right attribute and interpreting it right.
>>> update-smart-drivedb sometimes makes wonders :)
>>> 
>>> I wonder what isdct tool would say the drive's life expectancy is with this 
>>> workload? Are you really writing ~600TB/month??
>>> 
>>> Jan
>>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> 
> -- 
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure

2016-08-03 Thread Jan Schermer
I'm a fool, I miscalculated the writes by a factor of 1000 of course :-)
600GB/month is not much for S36xx at all, must be some sort of defect then...

Jan


> On 03 Aug 2016, at 12:15, Jan Schermer <j...@schermer.cz> wrote:
> 
> Make sure you are reading the right attribute and interpreting it right.
> update-smart-drivedb sometimes makes wonders :)
> 
> I wonder what isdct tool would say the drive's life expectancy is with this 
> workload? Are you really writing ~600TB/month??
> 
> Jan
> 
> 
>> On 03 Aug 2016, at 12:06, Maxime Guyot <maxime.gu...@elits.com> wrote:
>> 
>> Hi,
>> 
>> I haven’t had problems with Power_Loss_Cap_Test so far. 
>> 
>> Regarding Reallocated_Sector_Ct (SMART ID: 5/05h), you can check the 
>> “Available Reserved Space” (SMART ID: 232/E8h), the data sheet 
>> (http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-s3610-spec.pdf)
>>  reads:
>> "This attribute reports the number of reserve blocks
>> 
>>  remaining. The normalized value 
>> begins at 100 (64h),
>> which corresponds to 100 percent availability of the
>> reserved space. The threshold value for this attribute is
>> 10 percent availability."
>> 
>> According to the SMART data you copied, it should be about 84% of the over 
>> provisioning left? Since the drive is pretty young, it might be some form of 
>> defect?
>> I have a number of S3610 with ~150 DW, all SMART counters are their initial 
>> values (except for the temperature).
>> 
>> Cheers,
>> Maxime
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On 03/08/16 11:12, "ceph-users on behalf of Daniel Swarbrick" 
>> <ceph-users-boun...@lists.ceph.com on behalf of 
>> daniel.swarbr...@profitbricks.com> wrote:
>> 
>>> Hi Christian,
>>> 
>>> Intel drives are good, but apparently not infallible. I'm watching a DC
>>> S3610 480GB die from reallocated sectors.
>>> 
>>> ID# ATTRIBUTE_NAME  FLAGSVALUE WORST THRESH FAIL RAW_VALUE
>>> 5 Reallocated_Sector_Ct   -O--CK   081   081   000-756
>>> 9 Power_On_Hours  -O--CK   100   100   000-1065
>>> 12 Power_Cycle_Count   -O--CK   100   100   000-7
>>> 175 Program_Fail_Count_Chip PO--CK   100   100   010-17454078318
>>> 183 Runtime_Bad_Block   -O--CK   100   100   000-0
>>> 184 End-to-End_ErrorPO--CK   100   100   090-0
>>> 187 Reported_Uncorrect  -O--CK   100   100   000-0
>>> 190 Airflow_Temperature_Cel -O---K   070   065   000-30 (Min/Max
>>> 25/35)
>>> 192 Power-Off_Retract_Count -O--CK   100   100   000-6
>>> 194 Temperature_Celsius -O---K   100   100   000-30
>>> 197 Current_Pending_Sector  -O--C-   100   100   000-1288
>>> 199 UDMA_CRC_Error_Count-OSRCK   100   100   000-0
>>> 228 Power-off_Retract_Count -O--CK   100   100   000-63889
>>> 232 Available_Reservd_Space PO--CK   084   084   010-0
>>> 233 Media_Wearout_Indicator -O--CK   100   100   000-0
>>> 241 Total_LBAs_Written  -O--CK   100   100   000-20131
>>> 242 Total_LBAs_Read -O--CK   100   100   000-92945
>>> 
>>> The Reallocated_Sector_Ct is increasing about once a minute. I'm not
>>> sure how many reserved sectors the drive has, i.e., how soon before it
>>> starts throwing write IO errors.
>>> 
>>> It's a very young drive, with only 1065 hours on the clock, and has not
>>> even done two full drive-writes:
>>> 
>>> Device Statistics (GP Log 0x04)
>>> Page Offset Size Value  Description
>>> 1  =  ==  == General Statistics (rev 2) ==
>>> 1  0x008  47  Lifetime Power-On Resets
>>> 1  0x018  6   1319318736  Logical Sectors Written
>>> 1  0x020  6137121729  Number of Write Commands
>>> 1  0x028  6   6091245600  Logical Sectors Read
>>> 1  0x030  6115252407  Number of Read Commands
>>> 
>>> Fortunately this drive is not used as a Ceph journal. It's in a mdraid
>>> RAID5 array :-|
>>> 
>>> Cheers,
>>> Daniel
>>> 
>>> On 03/08/16 07:45, Christian Balzer wrote:
>>>> 
>>>> Hello,
>>>> 
>>>> not a Ceph specific issue, but this is probably the largest sample size of
>>&

Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure

2016-08-03 Thread Jan Schermer
Make sure you are reading the right attribute and interpreting it right.
update-smart-drivedb sometimes makes wonders :)

I wonder what isdct tool would say the drive's life expectancy is with this 
workload? Are you really writing ~600TB/month??

Jan


> On 03 Aug 2016, at 12:06, Maxime Guyot  wrote:
> 
> Hi,
> 
> I haven’t had problems with Power_Loss_Cap_Test so far. 
> 
> Regarding Reallocated_Sector_Ct (SMART ID: 5/05h), you can check the 
> “Available Reserved Space” (SMART ID: 232/E8h), the data sheet 
> (http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-s3610-spec.pdf)
>  reads:
> "This attribute reports the number of reserve blocks
> 
>   remaining. The normalized value 
> begins at 100 (64h),
> which corresponds to 100 percent availability of the
> reserved space. The threshold value for this attribute is
> 10 percent availability."
> 
> According to the SMART data you copied, it should be about 84% of the over 
> provisioning left? Since the drive is pretty young, it might be some form of 
> defect?
> I have a number of S3610 with ~150 DW, all SMART counters are their initial 
> values (except for the temperature).
> 
> Cheers,
> Maxime
> 
> 
> 
> 
> 
> 
> 
> 
> On 03/08/16 11:12, "ceph-users on behalf of Daniel Swarbrick" 
>  daniel.swarbr...@profitbricks.com> wrote:
> 
>> Hi Christian,
>> 
>> Intel drives are good, but apparently not infallible. I'm watching a DC
>> S3610 480GB die from reallocated sectors.
>> 
>> ID# ATTRIBUTE_NAME  FLAGSVALUE WORST THRESH FAIL RAW_VALUE
>> 5 Reallocated_Sector_Ct   -O--CK   081   081   000-756
>> 9 Power_On_Hours  -O--CK   100   100   000-1065
>> 12 Power_Cycle_Count   -O--CK   100   100   000-7
>> 175 Program_Fail_Count_Chip PO--CK   100   100   010-17454078318
>> 183 Runtime_Bad_Block   -O--CK   100   100   000-0
>> 184 End-to-End_ErrorPO--CK   100   100   090-0
>> 187 Reported_Uncorrect  -O--CK   100   100   000-0
>> 190 Airflow_Temperature_Cel -O---K   070   065   000-30 (Min/Max
>> 25/35)
>> 192 Power-Off_Retract_Count -O--CK   100   100   000-6
>> 194 Temperature_Celsius -O---K   100   100   000-30
>> 197 Current_Pending_Sector  -O--C-   100   100   000-1288
>> 199 UDMA_CRC_Error_Count-OSRCK   100   100   000-0
>> 228 Power-off_Retract_Count -O--CK   100   100   000-63889
>> 232 Available_Reservd_Space PO--CK   084   084   010-0
>> 233 Media_Wearout_Indicator -O--CK   100   100   000-0
>> 241 Total_LBAs_Written  -O--CK   100   100   000-20131
>> 242 Total_LBAs_Read -O--CK   100   100   000-92945
>> 
>> The Reallocated_Sector_Ct is increasing about once a minute. I'm not
>> sure how many reserved sectors the drive has, i.e., how soon before it
>> starts throwing write IO errors.
>> 
>> It's a very young drive, with only 1065 hours on the clock, and has not
>> even done two full drive-writes:
>> 
>> Device Statistics (GP Log 0x04)
>> Page Offset Size Value  Description
>> 1  =  ==  == General Statistics (rev 2) ==
>> 1  0x008  47  Lifetime Power-On Resets
>> 1  0x018  6   1319318736  Logical Sectors Written
>> 1  0x020  6137121729  Number of Write Commands
>> 1  0x028  6   6091245600  Logical Sectors Read
>> 1  0x030  6115252407  Number of Read Commands
>> 
>> Fortunately this drive is not used as a Ceph journal. It's in a mdraid
>> RAID5 array :-|
>> 
>> Cheers,
>> Daniel
>> 
>> On 03/08/16 07:45, Christian Balzer wrote:
>>> 
>>> Hello,
>>> 
>>> not a Ceph specific issue, but this is probably the largest sample size of
>>> SSD users I'm familiar with. ^o^
>>> 
>>> This morning I was woken at 4:30 by Nagios, one of our Ceph nodes having a
>>> religious experience.
>>> 
>>> It turns out that the SMART check plugin I run to mostly get an early
>>> wearout warning detected a "Power_Loss_Cap_Test" failure in one of the
>>> 200GB DC S3700 used for journals.
>>> 
>>> While SMART is of the opinion that this drive is failing and will explode
>>> spectacularly any moment that particular failure is of little worries to
>>> me, never mind that I'll eventually replace this unit.
>>> 
>>> What brings me here is that this is the first time in over 3 years that an
>>> Intel SSD has shown a (harmless in this case) problem, so I'm wondering if
>>> this particular failure has been seen by others.
>>> 
>>> That of course entails people actually monitoring for these things. ^o^
>>> 
>>> Thanks,
>>> 
>>> Christian
>>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing 

Re: [ceph-users] ceph + vmware

2016-07-20 Thread Jan Schermer

> On 20 Jul 2016, at 18:38, Mike Christie  wrote:
> 
> On 07/20/2016 03:50 AM, Frédéric Nass wrote:
>> 
>> Hi Mike,
>> 
>> Thanks for the update on the RHCS iSCSI target.
>> 
>> Will RHCS 2.1 iSCSI target be compliant with VMWare ESXi client ? (or is
>> it too early to say / announce).
> 
> No HA support for sure. We are looking into non HA support though.
> 
>> 
>> Knowing that HA iSCSI target was on the roadmap, we chose iSCSI over NFS
>> so we'll just have to remap RBDs to RHCS targets when it's available.
>> 
>> So we're currently running :
>> 
>> - 2 LIO iSCSI targets exporting the same RBD images. Each iSCSI target
>> has all VAAI primitives enabled and run the same configuration.
>> - RBD images are mapped on each target using the kernel client (so no
>> RBD cache).
>> - 6 ESXi. Each ESXi can access to the same LUNs through both targets,
>> but in a failover manner so that each ESXi always access the same LUN
>> through one target at a time.
>> - LUNs are VMFS datastores and VAAI primitives are enabled client side
>> (except UNMAP as per default).
>> 
>> Do you see anthing risky regarding this configuration ?
> 
> If you use a application that uses scsi persistent reservations then you
> could run into troubles, because some apps expect the reservation info
> to be on the failover nodes as well as the active ones.
> 
> Depending on the how you do failover and the issue that caused the
> failover, IO could be stuck on the old active node and cause data
> corruption. If the initial active node looses its network connectivity
> and you failover, you have to make sure that the initial active node is
> fenced off and IO stuck on that node will never be executed. So do
> something like add it to the ceph monitor blacklist and make sure IO on
> that node is flushed and failed before unblacklisting it.
> 

With iSCSI you can't really do hot failover unless you only use synchronous IO.
(With any of opensource target softwares available).
Flushing the buffers doesn't really help because you don't know what in-flight 
IO happened before the outage
and which didn't. You could end with only part of the "transaction" written on 
persistent storage.

If you only use synchronous IO all the way from client to the persistent 
storage shared between
iSCSI target then all should be fine, otherwise YMMV - some people run it like 
that without realizing
the dangers and have never had a problem, so it may be strictly theoretical, 
and it all depends on how often you need to do the
failover and what data you are storing - corrupting a few images on a gallery 
site could be fine but corrupting
a large database tablespace is no fun at all.

Some (non opensource) solutions exist, Solaris supposedly does this in some(?) 
way, maybe some iSCSI guru
can chime tell us what magic they do, but I don't think it's possible without 
client support
(you essentialy have to do something like transactions and replay the last 
transaction on failover). Maybe
something can be enabled in protocol to do the iSCSI IO synchronous or make it 
at least wait for some sort of ACK from the
server (which would require some sort of cache mirroring between the targets) 
without making it synchronous all the way.

The one time I had to use it I resorted to simply mirroring in via mdraid on 
the client side over two targets sharing the same
DAS, and this worked fine during testing but never went to production in the 
end.

Jan

> 
>> 
>> Would you recommend LIO or STGT (with rbd bs-type) target for ESXi
>> clients ?
> 
> I can't say, because I have not used stgt with rbd bs-type support enough.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Physical maintainance

2016-07-13 Thread Jan Schermer
Looks good.
You can start several OSDs at a time as long as you have enough CPU and you're 
not saturating your drives or controllers.

Jan

> On 13 Jul 2016, at 15:09, Wido den Hollander <w...@42on.com> wrote:
> 
> 
>> Op 13 juli 2016 om 14:47 schreef Kees Meijs <k...@nefos.nl>:
>> 
>> 
>> Thanks!
>> 
>> So to sum up, I'd best:
>> 
>>  * set the noout flag
>>  * stop the OSDs one by one
>>  * shut down the physical node
>>  * jank the OSD drives to prevent ceph-disk(8) from automaticly
>>activating at boot time
>>  * do my maintainance
>>  * start the physical node
>>  * reseat and activate the OSD drives one by one
>>  * unset the noout flag
>> 
> 
> That should do it indeed. Take your time between the OSDs and that should 
> limit the 'downtime' for clients.
> 
> Wido
> 
>> On 13-07-16 14:39, Jan Schermer wrote:
>>> If you stop the OSDs cleanly then that should cause no disruption to 
>>> clients.
>>> Starting the OSD back up is another story, expect slow request for a while 
>>> there and unless you have lots of very fast CPUs on the OSD node, start 
>>> them one-by-one and not all at once.
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Physical maintainance

2016-07-13 Thread Jan Schermer
If you stop the OSDs cleanly then that should cause no disruption to clients.
Starting the OSD back up is another story, expect slow request for a while 
there and unless you have lots of very fast CPUs on the OSD node, start them 
one-by-one and not all at once.


Jan


> On 13 Jul 2016, at 14:37, Wido den Hollander  wrote:
> 
> 
>> Op 13 juli 2016 om 14:31 schreef Kees Meijs :
>> 
>> 
>> Hi Cephers,
>> 
>> There's some physical maintainance I need to perform on an OSD node.
>> Very likely the maintainance is going to take a while since it involves
>> replacing components, so I would like to be well prepared.
>> 
>> Unfortunately it is no option to add another OSD node or rebalance at
>> this time, so I'm planning to operate in degraded state during the
>> maintainance.
>> 
>> If at all possible, I would to shut down the OSD node cleanly and
>> prevent slow (or even blocking) requests on Ceph clients.
>> 
>> Just setting the noout flag and shutting down the OSDs on the given node
>> is not enough as it seems. In fact clients do not act that well in this
>> case. Connections time out and for a while I/O seems to stall.
>> 
> 
> noout doesn't do anything with the clients, it just tells the cluster not to 
> mark any OSD as out after they go down.
> 
> If you want to do this slowly, take the OSDs down one by one and wait for the 
> PGs to become active+X again.
> 
> When you start, do the same again, start them one by one.
> 
> You will always have a short moment where the PGs are inactive.
> 
>> Any thoughts on this, anyone? For example, is it a sensible idea and are
>> writes still possible? Let's assume there are OSDs on to the
>> to-be-maintained host which are primary for sure.
>> 
>> Thanks in advance!
>> 
>> Cheers,
>> Kees
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph + vmware

2016-07-08 Thread Jan Schermer
There is no Ceph plugin for VMware (and I think you need at least an Enterprise 
license for storage plugins, much $$$).
The "VMware" way to do this without the plugin would be to have a VM running on 
every host serving RBD devices over iSCSI to the other VMs (the way their 
storage applicances work, maybe you could even re-use them somehow? I haven't 
used VMware in a while, so not sure if one can login to the appliance and 
customize it...).
Nevertheless I think it's ugly, messy and is going to be even slower than Ceph 
by itself.

But you can always just use RBD client (kernel/userspace) in the VMs 
themselves, VMware has pretty fast networking so the overhead wouldn't be that 
large.

Jan


> On 08 Jul 2016, at 21:22, Oliver Dzombic  wrote:
> 
> Hi,
> 
> does anyone have experience how to connect vmware with ceph smart ?
> 
> iSCSI multipath does not really worked well.
> NFS could be, but i think thats just too much layers in between to have
> some useable performance.
> 
> Systems like ScaleIO have developed a vmware addon to talk with it.
> 
> Is there something similar out there for ceph ?
> 
> What are you using ?
> 
> Thank you !
> 
> -- 
> Mit freundlichen Gruessen / Best regards
> 
> Oliver Dzombic
> IP-Interactive
> 
> mailto:i...@ip-interactive.de
> 
> Anschrift:
> 
> IP Interactive UG ( haftungsbeschraenkt )
> Zum Sonnenberg 1-3
> 63571 Gelnhausen
> 
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
> 
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Disk failures

2016-06-14 Thread Jan Schermer
Hi,
bit rot is not "bit rot" per se - nothing is rotting on the drive platter. It 
occurs during reads (mostly, anyway), and it's random.
You can happily read a block and get the correct data, then read it again and 
get garbage, then get correct data again.
This could be caused by a worn out cell on SSD but firmwares look for than and 
rewrite it if the signal is attentuated too much.
On spinners there are no cells to refresh so rewriting it doesn't help either. 

You can't really "look for" bit rot due to the reasons above, strong 
checksumming/hash verification during reads is the only solution.

And trust me, bit rot is a very real thing and very dangerous as well - do you 
think companies like Seagate or WD would lie about bit rot if it's not real?
I'd buy a drive with BER 10^999 over one with 10^14, wouldn't everyone?
And it is especially dangerous when something like Ceph handles much larger 
blocks of data than the client does.
While the client (or an app) has some knowledge of the data _and_ hopefully 
throws an error if it read garbage, Ceph will (if for example snapshots
are used and FIEMAP is off) actually have to read the whole object (say 4MiB) 
and write it elsewhere, without any knowledge whether what it read (and wrote) 
made any sense to the app.
This way corruption might spread silently into your backups if you don't 
validate the data somehow (or dump it from a database for example, where it's 
likely to get detected).

Btw just because you think you haven't seen it doesn't mean you haven't seen it 
- never seen artefacting in movies? Just a random bug in the decoder, is it? 
VoD guys would tell you...

For things like databases this is somewhat less impactful - bit rot doesn't 
"flip a bit" but affects larger blocks of data (like one sector), so databases 
usually catch this during read and err
instead of returning garbage to the client.

Jan



> On 09 Jun 2016, at 09:16, Christian Balzer  wrote:
> 
> 
> Hello,
> 
> On Thu, 9 Jun 2016 08:43:23 +0200 Gandalf Corvotempesta wrote:
> 
>> Il 09 giu 2016 02:09, "Christian Balzer"  ha scritto:
>>> Ceph currently doesn't do any (relevant) checksumming at all, so if a
>>> PRIMARY PG suffers from bit-rot this will be undetected until the next
>>> deep-scrub.
>>> 
>>> This is one of the longest and gravest outstanding issues with Ceph and
>>> supposed to be addressed with bluestore (which currently doesn't have
>>> checksum verified reads either).
>> 
>> So if bit rot happens on primary PG, ceph is spreading the currupted data
>> across the cluster?
> No.
> 
> You will want to re-read the Ceph docs and the countless posts here about
> replication within Ceph works.
> http://docs.ceph.com/docs/hammer/architecture/#smart-daemons-enable-hyperscale
> 
> A client write goes to the primary OSD/PG and will not be ACK'ed to the
> client until is has reached all replica OSDs.
> This happens while the data is in-flight (in RAM), it's not read from the
> journal or filestore.
> 
>> What would be sent to the replica,  the original data or the saved one?
>> 
>> When bit rot happens I'll have 1 corrupted object and 2 good.
>> how do you manage this between deep scrubs?  Which data would be used by
>> ceph? I think that a bitrot on a huge VM block device could lead to a
>> mess like the whole device corrupted
>> VM affected by bitrot would be able to stay up and running?
>> And bitrot on a qcow2 file?
>> 
> Bitrot is a bit hyped, I haven't seen any on the Ceph clusters I run nor
> on other systems here where I (can) actually check for it.
> 
> As to how it would affect things, that very much depends.
> 
> If it's something like a busy directory inode that gets corrupted, the data
> in question will be in RAM (SLAB) and the next update  will correct things.
> 
> If it's a logfile, you're likely to never notice until deep-scrub detects
> it eventually.
> 
> This isn't a  Ceph specific question, on all systems that aren't backed
> by something like ZFS or BTRFS you're potentially vulnerable to this.
> 
> Of course if you're that worried, you could always run BTRFS of ZFS inside
> your VM and notice immediately when something goes wrong.
> I personally wouldn't though, due to the performance penalties involved
> (CoW).
> 
> 
>> Let me try to explain: when writing to primary PG i have to write bit "1"
>> Due to a bit rot, I'm saving "0".
>> Would ceph read the wrote bit and spread that across the cluster (so it
>> will spread "0") or spread the in memory value "1" ?
>> 
>> What if the journal fails during a read or a write? 
> Again, you may want to get a deeper understanding of Ceph.
> The journal isn't involved in reads.
> 
>> Ceph is able to
>> recover by removing that journal from the affected osd (and still
>> running at lower speed) or should i use a raid1 on ssds used by journal ?
>> 
> Neither, a journal failure is lethal for the OSD involved and unless you
> have LOTS of money RAID1 SSDs are a waste.
> 
> If you use DC level 

Re: [ceph-users] Ubuntu Trusty: kernel 3.13 vs kernel 4.2

2016-06-14 Thread Jan Schermer
One storage setups has exhibited extremely poor performance in my lab on 4.2 
kernel (mdraid1+lvm+nfs), others run fine.
No problems with xenial so far. If I had to choose a LTS kernel for trusty I'd 
choose the xenial one.
(Btw I think newest trusty point release already has the 4.2 HWE stack by 
default, not sure if 3.13 is supported? I usually just upgrade)


Jan

> On 14 Jun 2016, at 09:45, magicb...@hotmail.com wrote:
> 
> Hi list,
> 
> is there any opinion/recommendation regarding the ubuntu trusty available 
> kernels and Ceph(hammer, xfs)?
> Does kernel 4.2 worth installing from Ceph(hammer, xfs) perspective?
> 
> Thanks :)
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS: slow writes over NFS when fs is mounted with kernel driver but fast with Fuse

2016-06-03 Thread Jan Schermer
It should be noted that using  "async" with NFS _will_ corrupt your data if 
anything happens.
It's ok-ish for something like an image library, but it's most certainly not OK 
for VM drives, databases, or if you write any kind of binary blobs that you 
can't recreate.

If ceph-fuse is fast (you are testing that on the NFS client side, right?) then 
it must completely ignore the sync flag the nfs server asks for when doing IO. 
I'd call that a serious bug unless it's stated somewhere...

Jan


> On 03 Jun 2016, at 06:03, Yan, Zheng  wrote:
> 
> On Mon, May 30, 2016 at 10:29 PM, David  wrote:
>> Hi All
>> 
>> I'm having an issue with slow writes over NFS (v3) when cephfs is mounted
>> with the kernel driver. Writing a single 4K file from the NFS client is
>> taking 3 - 4 seconds, however a 4K write (with sync) into the same folder on
>> the server is fast as you would expect. When mounted with ceph-fuse, I don't
>> get this issue on the NFS client.
>> 
>> Test environment is a small cluster with a single MON and single MDS, all
>> running 10.2.1, CephFS metadata is an ssd pool, data is on spinners. The NFS
>> server is CentOS 7, I've tested with the current shipped kernel (3.10),
>> ELrepo 4.4 and ELrepo 4.6.
>> 
>> More info:
>> 
>> With the kernel driver, I mount the filesystem with "-o name=admin,secret"
>> 
>> I've exported a folder with the following options:
>> 
>> *(rw,root_squash,sync,wdelay,no_subtree_check,fsid=1244,sec=1)
>> 
>> I then mount the folder on a CentOS 6 client with the following options (all
>> default):
>> 
>> rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.3.231,mountvers=3,mountport=597,mountproto=udp,local_lock=none
>> 
>> A small 4k write is taking 3 - 4 secs:
>> 
>> # time dd if=/dev/zero of=testfile bs=4k count=1
>> 1+0 records in
>> 1+0 records out
>> 4096 bytes (4.1 kB) copied, 3.59678 s, 1.1 kB/s
>> 
>> real0m3.624s
>> user0m0.000s
>> sys 0m0.001s
>> 
>> But a sync write on the sever directly into the same folder is fast (this is
>> with the kernel driver):
>> 
>> # time dd if=/dev/zero of=testfile2 bs=4k count=1 conv=fdatasync
>> 1+0 records in
>> 1+0 records out
>> 4096 bytes (4.1 kB) copied, 0.0121925 s, 336 kB/s
> 
> 
> Your nfs export has sync option. 'dd if=/dev/zero of=testfile bs=4k
> count=1' on nfs client is equivalent to 'dd if=/dev/zero of=testfile
> bs=4k count=1 conv=fsync' on cephfs. The reason that sync metadata
> operation takes 3~4 seconds is that the MDS flushes its journal every
> 5 seconds.  Adding async option to nfs export can avoid this delay.
> 
>> 
>> real0m0.015s
>> user0m0.000s
>> sys 0m0.002s
>> 
>> If I mount cephfs with Fuse instead of the kernel, the NFS client write is
>> fast:
>> 
>> dd if=/dev/zero of=fuse01 bs=4k count=1
>> 1+0 records in
>> 1+0 records out
>> 4096 bytes (4.1 kB) copied, 0.026078 s, 157 kB/s
>> 
> 
> In this case, ceph-fuse sends an extra request (getattr request on
> directory) to MDS. The request causes MDS to flush its journal.
> Whether or not client sends the extra request depends on what
> capabilities it has.  What capabilities client has, in turn, depend on
> how many clients are accessing the directory. In my test, nfs on
> ceph-fuse is not always fast.
> 
> Yan, Zheng
> 
> 
>> Does anyone know what's going on here?
> 
> 
> 
>> 
>> Thanks
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mon not starting on boot with systemd and Ubuntu 16.04

2016-05-13 Thread Jan Schermer
Can you check that the dependencies have started? Anything about those in the 
logs?

network-online.target local-fs.target ceph-create-keys@%i.service

Jan


> On 13 May 2016, at 14:30, Sage Weil <s...@newdream.net> wrote:
> 
> This is starting to sound like a xenial systemd issue to me.  Maybe poke 
> the canonical folks?
> 
> You might edit the unit file and make it touch something in /tmp instead 
> of starting Ceph just to rule out ceph...
> 
> sage
> 
> 
> On Fri, 13 May 2016, Wido den Hollander wrote:
> 
>> No luck either. After a reboot only the Ceph OSD starts, but the monitor not.
>> 
>> I have checked:
>> - service is enabled
>> - tried to re-enable the service
>> - check the MON logs to see if it was started, it wasn't
>> - systemd log to see if it wants to start the MON, it doesn't
>> 
>> My systemd-foo isn't that good either, so I don't know what is happening 
>> here.
>> 
>> Wido
>> 
>>> Op 12 mei 2016 om 15:31 schreef Jan Schermer <j...@schermer.cz>:
>>> 
>>> 
>>> Btw try replacing
>>> 
>>> WantedBy=ceph-mon.target
>>> 
>>> With: WantedBy=default.target
>>> then systemctl daemon-reload.
>>> 
>>> See if that does the trick
>>> 
>>> I only messed with systemctl to have my own services start, I still hope it 
>>> goes away eventually... :P
>>> 
>>> Jan
>>> 
>>>> On 12 May 2016, at 15:01, Wido den Hollander <w...@42on.com> wrote:
>>>> 
>>>> 
>>>> To also answer Sage's question: No, this is a fresh Jewel install in a few 
>>>> test VMs. This system was not upgraded.
>>>> 
>>>> It was installed 2 hours ago.
>>>> 
>>>>> Op 12 mei 2016 om 14:51 schreef Jan Schermer <j...@schermer.cz>:
>>>>> 
>>>>> 
>>>>> Can you post the contents of ceph-mon@.service file?
>>>>> 
>>>> 
>>>> Yes, here you go:
>>>> 
>>>> root@charlie:~# cat /lib/systemd/system/ceph-mon@.service 
>>>> [Unit]
>>>> Description=Ceph cluster monitor daemon
>>>> 
>>>> # According to:
>>>> #   http://www.freedesktop.org/wiki/Software/systemd/NetworkTarget
>>>> # these can be removed once ceph-mon will dynamically change network
>>>> # configuration.
>>>> After=network-online.target local-fs.target ceph-create-keys@%i.service
>>>> Wants=network-online.target local-fs.target ceph-create-keys@%i.service
>>>> 
>>>> PartOf=ceph-mon.target
>>>> 
>>>> [Service]
>>>> LimitNOFILE=1048576
>>>> LimitNPROC=1048576
>>>> EnvironmentFile=-/etc/default/ceph
>>>> Environment=CLUSTER=ceph
>>>> ExecStart=/usr/bin/ceph-mon -f --cluster ${CLUSTER} --id %i --setuser ceph 
>>>> --setgroup ceph
>>>> ExecReload=/bin/kill -HUP $MAINPID
>>>> PrivateDevices=yes
>>>> ProtectHome=true
>>>> ProtectSystem=full
>>>> PrivateTmp=true
>>>> TasksMax=infinity
>>>> Restart=on-failure
>>>> StartLimitInterval=30min
>>>> StartLimitBurst=3
>>>> 
>>>> [Install]
>>>> WantedBy=ceph-mon.target
>>>> root@charlie:~#
>>>> 
>>>>> what does
>>>>> systemctl is-enabled ceph-mon@charlie
>>>>> say?
>>>>> 
>>>> 
>>>> root@charlie:~# systemctl is-enabled ceph-mon@charlie
>>>> enabled
>>>> root@charlie:~#
>>>> 
>>>>> However, this looks like it was just started at a bad moment and died - 
>>>>> nothing in logs?
>>>>> 
>>>> 
>>>> No, I checked the ceph-mon logs in /var/log/ceph. No sign of it even 
>>>> trying to start after boot. In /var/log/syslog there also is not a trace 
>>>> of ceph-mon.
>>>> 
>>>> Only the OSD starts.
>>>> 
>>>> Wido
>>>> 
>>>>> Jan
>>>>> 
>>>>> 
>>>>>> On 12 May 2016, at 14:44, Sage Weil <s...@newdream.net> wrote:
>>>>>> 
>>>>>> On Thu, 12 May 2016, Wido den Hollander wrote:
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I am setting up a Jewel cluster in VMs with Ubuntu 16.04.
>>>>>>> 
>>>>>>> ceph v

Re: [ceph-users] ceph-mon not starting on boot with systemd and Ubuntu 16.04

2016-05-12 Thread Jan Schermer
Btw try replacing

WantedBy=ceph-mon.target

With: WantedBy=default.target
then systemctl daemon-reload.

See if that does the trick

I only messed with systemctl to have my own services start, I still hope it 
goes away eventually... :P

Jan

> On 12 May 2016, at 15:01, Wido den Hollander <w...@42on.com> wrote:
> 
> 
> To also answer Sage's question: No, this is a fresh Jewel install in a few 
> test VMs. This system was not upgraded.
> 
> It was installed 2 hours ago.
> 
>> Op 12 mei 2016 om 14:51 schreef Jan Schermer <j...@schermer.cz>:
>> 
>> 
>> Can you post the contents of ceph-mon@.service file?
>> 
> 
> Yes, here you go:
> 
> root@charlie:~# cat /lib/systemd/system/ceph-mon@.service 
> [Unit]
> Description=Ceph cluster monitor daemon
> 
> # According to:
> #   http://www.freedesktop.org/wiki/Software/systemd/NetworkTarget
> # these can be removed once ceph-mon will dynamically change network
> # configuration.
> After=network-online.target local-fs.target ceph-create-keys@%i.service
> Wants=network-online.target local-fs.target ceph-create-keys@%i.service
> 
> PartOf=ceph-mon.target
> 
> [Service]
> LimitNOFILE=1048576
> LimitNPROC=1048576
> EnvironmentFile=-/etc/default/ceph
> Environment=CLUSTER=ceph
> ExecStart=/usr/bin/ceph-mon -f --cluster ${CLUSTER} --id %i --setuser ceph 
> --setgroup ceph
> ExecReload=/bin/kill -HUP $MAINPID
> PrivateDevices=yes
> ProtectHome=true
> ProtectSystem=full
> PrivateTmp=true
> TasksMax=infinity
> Restart=on-failure
> StartLimitInterval=30min
> StartLimitBurst=3
> 
> [Install]
> WantedBy=ceph-mon.target
> root@charlie:~#
> 
>> what does
>> systemctl is-enabled ceph-mon@charlie
>> say?
>> 
> 
> root@charlie:~# systemctl is-enabled ceph-mon@charlie
> enabled
> root@charlie:~#
> 
>> However, this looks like it was just started at a bad moment and died - 
>> nothing in logs?
>> 
> 
> No, I checked the ceph-mon logs in /var/log/ceph. No sign of it even trying 
> to start after boot. In /var/log/syslog there also is not a trace of ceph-mon.
> 
> Only the OSD starts.
> 
> Wido
> 
>> Jan
>> 
>> 
>>> On 12 May 2016, at 14:44, Sage Weil <s...@newdream.net> wrote:
>>> 
>>> On Thu, 12 May 2016, Wido den Hollander wrote:
>>>> Hi,
>>>> 
>>>> I am setting up a Jewel cluster in VMs with Ubuntu 16.04.
>>>> 
>>>> ceph version 10.2.0 (3a9fba20ec743699b69bd0181dd6c54dc01c64b9)
>>>> 
>>>> After a reboot the Ceph Monitors don't start and I have to do so manually.
>>>> 
>>>> Three machines, alpha, bravo and charlie all have the same problem.
>>>> 
>>>> root@charlie:~# systemctl status ceph-mon@charlie
>>>> ● ceph-mon@charlie.service - Ceph cluster monitor daemon
>>>>  Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; vendor 
>>>> preset: enabled)
>>>>  Active: inactive (dead)
>>>> root@charlie:~#
>>>> 
>>>> I can start it and it works
>>> 
>>> Hmm.. my systemd-fu is weak, but if it's enabled it seems like it shoud 
>>> come up.
>>> 
>>> Was this an upgraded package?  What if you do 'systemctl reenable 
>>> ceph-mon@charlie'?
>>> 
>>> sage
>>> 
>>> 
>>> 
>>>> 
>>>> root@charlie:~# systemctl start ceph-mon@charlie
>>>> root@charlie:~# systemctl status ceph-mon@charlie
>>>> ● ceph-mon@charlie.service - Ceph cluster monitor daemon
>>>>  Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; vendor 
>>>> preset: enabled)
>>>>  Active: active (running) since Thu 2016-05-12 16:08:56 CEST; 1s ago
>>>> Main PID: 1368 (ceph-mon)
>>>> 
>>>> I tried removing the /var/log/ceph/ceph-mon.charlie.log file and reboot to 
>>>> see if the mon was actually invoked, but it wasn't.
>>>> 
>>>> ceph.target has been started and so is the OSD on the machine. It is just 
>>>> the monitor which hasn't been started.
>>>> 
>>>> In the syslog I see:
>>>> 
>>>> May 12 16:11:19 charlie systemd[1]: Starting Ceph object storage daemon...
>>>> May 12 16:11:19 charlie systemd[1]: Starting LSB: Start Ceph distributed 
>>>> file system daemons at boot time...
>>>> May 12 16:11:19 charlie systemd[1]: Started LSB: Start Ceph distributed 
>>>> file system daemons at boot time.
>>>> May 12 16:11:20 charlie systemd[1]: Started Ceph object storage daemon.
>>>> May 12 16:11:20 charlie systemd[1]: Started Ceph disk activation: 
>>>> /dev/sdb2.
>>>> May 12 16:11:21 charlie systemd[1]: Started Ceph object storage daemon.
>>>> May 12 16:11:21 charlie systemd[1]: Started Ceph disk activation: 
>>>> /dev/sdb1.
>>>> 
>>>> Am I missing something or is this a bug?
>>>> 
>>>> Wido
>>>> ___
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mon not starting on boot with systemd and Ubuntu 16.04

2016-05-12 Thread Jan Schermer
So systemctl is-enabled ceph-mon.target says "enabled" as well?

I think it should start then, or at least try

Jan

> On 12 May 2016, at 15:14, Wido den Hollander <w...@42on.com> wrote:
> 
> 
>> Op 12 mei 2016 om 15:12 schreef Jan Schermer <j...@schermer.cz>:
>> 
>> 
>> What about systemctl is-enabled ceph-mon.target?
>> 
> 
> Just tried that, no luck either. There is simply no trace of the monitors 
> trying to start on boot.
> 
>> Jan
>> 
>> 
>> 
>>> On 12 May 2016, at 15:01, Wido den Hollander <w...@42on.com> wrote:
>>> 
>>> 
>>> To also answer Sage's question: No, this is a fresh Jewel install in a few 
>>> test VMs. This system was not upgraded.
>>> 
>>> It was installed 2 hours ago.
>>> 
>>>> Op 12 mei 2016 om 14:51 schreef Jan Schermer <j...@schermer.cz>:
>>>> 
>>>> 
>>>> Can you post the contents of ceph-mon@.service file?
>>>> 
>>> 
>>> Yes, here you go:
>>> 
>>> root@charlie:~# cat /lib/systemd/system/ceph-mon@.service 
>>> [Unit]
>>> Description=Ceph cluster monitor daemon
>>> 
>>> # According to:
>>> #   http://www.freedesktop.org/wiki/Software/systemd/NetworkTarget
>>> # these can be removed once ceph-mon will dynamically change network
>>> # configuration.
>>> After=network-online.target local-fs.target ceph-create-keys@%i.service
>>> Wants=network-online.target local-fs.target ceph-create-keys@%i.service
>>> 
>>> PartOf=ceph-mon.target
>>> 
>>> [Service]
>>> LimitNOFILE=1048576
>>> LimitNPROC=1048576
>>> EnvironmentFile=-/etc/default/ceph
>>> Environment=CLUSTER=ceph
>>> ExecStart=/usr/bin/ceph-mon -f --cluster ${CLUSTER} --id %i --setuser ceph 
>>> --setgroup ceph
>>> ExecReload=/bin/kill -HUP $MAINPID
>>> PrivateDevices=yes
>>> ProtectHome=true
>>> ProtectSystem=full
>>> PrivateTmp=true
>>> TasksMax=infinity
>>> Restart=on-failure
>>> StartLimitInterval=30min
>>> StartLimitBurst=3
>>> 
>>> [Install]
>>> WantedBy=ceph-mon.target
>>> root@charlie:~#
>>> 
>>>> what does
>>>> systemctl is-enabled ceph-mon@charlie
>>>> say?
>>>> 
>>> 
>>> root@charlie:~# systemctl is-enabled ceph-mon@charlie
>>> enabled
>>> root@charlie:~#
>>> 
>>>> However, this looks like it was just started at a bad moment and died - 
>>>> nothing in logs?
>>>> 
>>> 
>>> No, I checked the ceph-mon logs in /var/log/ceph. No sign of it even trying 
>>> to start after boot. In /var/log/syslog there also is not a trace of 
>>> ceph-mon.
>>> 
>>> Only the OSD starts.
>>> 
>>> Wido
>>> 
>>>> Jan
>>>> 
>>>> 
>>>>> On 12 May 2016, at 14:44, Sage Weil <s...@newdream.net> wrote:
>>>>> 
>>>>> On Thu, 12 May 2016, Wido den Hollander wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> I am setting up a Jewel cluster in VMs with Ubuntu 16.04.
>>>>>> 
>>>>>> ceph version 10.2.0 (3a9fba20ec743699b69bd0181dd6c54dc01c64b9)
>>>>>> 
>>>>>> After a reboot the Ceph Monitors don't start and I have to do so 
>>>>>> manually.
>>>>>> 
>>>>>> Three machines, alpha, bravo and charlie all have the same problem.
>>>>>> 
>>>>>> root@charlie:~# systemctl status ceph-mon@charlie
>>>>>> ● ceph-mon@charlie.service - Ceph cluster monitor daemon
>>>>>> Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; vendor 
>>>>>> preset: enabled)
>>>>>> Active: inactive (dead)
>>>>>> root@charlie:~#
>>>>>> 
>>>>>> I can start it and it works
>>>>> 
>>>>> Hmm.. my systemd-fu is weak, but if it's enabled it seems like it shoud 
>>>>> come up.
>>>>> 
>>>>> Was this an upgraded package?  What if you do 'systemctl reenable 
>>>>> ceph-mon@charlie'?
>>>>> 
>>>>> sage
>>>>> 
>>>>> 
>>>>> 
>>>>>> 
>>>>>> root@charlie:~# systemctl start ceph-mon@charlie

Re: [ceph-users] ceph-mon not starting on boot with systemd and Ubuntu 16.04

2016-05-12 Thread Jan Schermer
What about systemctl is-enabled ceph-mon.target?

Jan



> On 12 May 2016, at 15:01, Wido den Hollander <w...@42on.com> wrote:
> 
> 
> To also answer Sage's question: No, this is a fresh Jewel install in a few 
> test VMs. This system was not upgraded.
> 
> It was installed 2 hours ago.
> 
>> Op 12 mei 2016 om 14:51 schreef Jan Schermer <j...@schermer.cz>:
>> 
>> 
>> Can you post the contents of ceph-mon@.service file?
>> 
> 
> Yes, here you go:
> 
> root@charlie:~# cat /lib/systemd/system/ceph-mon@.service 
> [Unit]
> Description=Ceph cluster monitor daemon
> 
> # According to:
> #   http://www.freedesktop.org/wiki/Software/systemd/NetworkTarget
> # these can be removed once ceph-mon will dynamically change network
> # configuration.
> After=network-online.target local-fs.target ceph-create-keys@%i.service
> Wants=network-online.target local-fs.target ceph-create-keys@%i.service
> 
> PartOf=ceph-mon.target
> 
> [Service]
> LimitNOFILE=1048576
> LimitNPROC=1048576
> EnvironmentFile=-/etc/default/ceph
> Environment=CLUSTER=ceph
> ExecStart=/usr/bin/ceph-mon -f --cluster ${CLUSTER} --id %i --setuser ceph 
> --setgroup ceph
> ExecReload=/bin/kill -HUP $MAINPID
> PrivateDevices=yes
> ProtectHome=true
> ProtectSystem=full
> PrivateTmp=true
> TasksMax=infinity
> Restart=on-failure
> StartLimitInterval=30min
> StartLimitBurst=3
> 
> [Install]
> WantedBy=ceph-mon.target
> root@charlie:~#
> 
>> what does
>> systemctl is-enabled ceph-mon@charlie
>> say?
>> 
> 
> root@charlie:~# systemctl is-enabled ceph-mon@charlie
> enabled
> root@charlie:~#
> 
>> However, this looks like it was just started at a bad moment and died - 
>> nothing in logs?
>> 
> 
> No, I checked the ceph-mon logs in /var/log/ceph. No sign of it even trying 
> to start after boot. In /var/log/syslog there also is not a trace of ceph-mon.
> 
> Only the OSD starts.
> 
> Wido
> 
>> Jan
>> 
>> 
>>> On 12 May 2016, at 14:44, Sage Weil <s...@newdream.net> wrote:
>>> 
>>> On Thu, 12 May 2016, Wido den Hollander wrote:
>>>> Hi,
>>>> 
>>>> I am setting up a Jewel cluster in VMs with Ubuntu 16.04.
>>>> 
>>>> ceph version 10.2.0 (3a9fba20ec743699b69bd0181dd6c54dc01c64b9)
>>>> 
>>>> After a reboot the Ceph Monitors don't start and I have to do so manually.
>>>> 
>>>> Three machines, alpha, bravo and charlie all have the same problem.
>>>> 
>>>> root@charlie:~# systemctl status ceph-mon@charlie
>>>> ● ceph-mon@charlie.service - Ceph cluster monitor daemon
>>>>  Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; vendor 
>>>> preset: enabled)
>>>>  Active: inactive (dead)
>>>> root@charlie:~#
>>>> 
>>>> I can start it and it works
>>> 
>>> Hmm.. my systemd-fu is weak, but if it's enabled it seems like it shoud 
>>> come up.
>>> 
>>> Was this an upgraded package?  What if you do 'systemctl reenable 
>>> ceph-mon@charlie'?
>>> 
>>> sage
>>> 
>>> 
>>> 
>>>> 
>>>> root@charlie:~# systemctl start ceph-mon@charlie
>>>> root@charlie:~# systemctl status ceph-mon@charlie
>>>> ● ceph-mon@charlie.service - Ceph cluster monitor daemon
>>>>  Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; vendor 
>>>> preset: enabled)
>>>>  Active: active (running) since Thu 2016-05-12 16:08:56 CEST; 1s ago
>>>> Main PID: 1368 (ceph-mon)
>>>> 
>>>> I tried removing the /var/log/ceph/ceph-mon.charlie.log file and reboot to 
>>>> see if the mon was actually invoked, but it wasn't.
>>>> 
>>>> ceph.target has been started and so is the OSD on the machine. It is just 
>>>> the monitor which hasn't been started.
>>>> 
>>>> In the syslog I see:
>>>> 
>>>> May 12 16:11:19 charlie systemd[1]: Starting Ceph object storage daemon...
>>>> May 12 16:11:19 charlie systemd[1]: Starting LSB: Start Ceph distributed 
>>>> file system daemons at boot time...
>>>> May 12 16:11:19 charlie systemd[1]: Started LSB: Start Ceph distributed 
>>>> file system daemons at boot time.
>>>> May 12 16:11:20 charlie systemd[1]: Started Ceph object storage daemon.
>>>> May 12 16:11:20 charlie systemd[1]: Started Ceph disk activation: 
>>>> /dev/sdb2.
>>>> May 12 16:11:21 charlie systemd[1]: Started Ceph object storage daemon.
>>>> May 12 16:11:21 charlie systemd[1]: Started Ceph disk activation: 
>>>> /dev/sdb1.
>>>> 
>>>> Am I missing something or is this a bug?
>>>> 
>>>> Wido
>>>> ___
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mon not starting on boot with systemd and Ubuntu 16.04

2016-05-12 Thread Jan Schermer
Can you post the contents of ceph-mon@.service file?

what does
systemctl is-enabled ceph-mon@charlie
say?

However, this looks like it was just started at a bad moment and died - nothing 
in logs?

Jan


> On 12 May 2016, at 14:44, Sage Weil  wrote:
> 
> On Thu, 12 May 2016, Wido den Hollander wrote:
>> Hi,
>> 
>> I am setting up a Jewel cluster in VMs with Ubuntu 16.04.
>> 
>> ceph version 10.2.0 (3a9fba20ec743699b69bd0181dd6c54dc01c64b9)
>> 
>> After a reboot the Ceph Monitors don't start and I have to do so manually.
>> 
>> Three machines, alpha, bravo and charlie all have the same problem.
>> 
>> root@charlie:~# systemctl status ceph-mon@charlie
>> ● ceph-mon@charlie.service - Ceph cluster monitor daemon
>>   Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; vendor 
>> preset: enabled)
>>   Active: inactive (dead)
>> root@charlie:~#
>> 
>> I can start it and it works
> 
> Hmm.. my systemd-fu is weak, but if it's enabled it seems like it shoud 
> come up.
> 
> Was this an upgraded package?  What if you do 'systemctl reenable 
> ceph-mon@charlie'?
> 
> sage
> 
> 
> 
>> 
>> root@charlie:~# systemctl start ceph-mon@charlie
>> root@charlie:~# systemctl status ceph-mon@charlie
>> ● ceph-mon@charlie.service - Ceph cluster monitor daemon
>>   Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; vendor 
>> preset: enabled)
>>   Active: active (running) since Thu 2016-05-12 16:08:56 CEST; 1s ago
>> Main PID: 1368 (ceph-mon)
>> 
>> I tried removing the /var/log/ceph/ceph-mon.charlie.log file and reboot to 
>> see if the mon was actually invoked, but it wasn't.
>> 
>> ceph.target has been started and so is the OSD on the machine. It is just 
>> the monitor which hasn't been started.
>> 
>> In the syslog I see:
>> 
>> May 12 16:11:19 charlie systemd[1]: Starting Ceph object storage daemon...
>> May 12 16:11:19 charlie systemd[1]: Starting LSB: Start Ceph distributed 
>> file system daemons at boot time...
>> May 12 16:11:19 charlie systemd[1]: Started LSB: Start Ceph distributed file 
>> system daemons at boot time.
>> May 12 16:11:20 charlie systemd[1]: Started Ceph object storage daemon.
>> May 12 16:11:20 charlie systemd[1]: Started Ceph disk activation: /dev/sdb2.
>> May 12 16:11:21 charlie systemd[1]: Started Ceph object storage daemon.
>> May 12 16:11:21 charlie systemd[1]: Started Ceph disk activation: /dev/sdb1.
>> 
>> Am I missing something or is this a bug?
>> 
>> Wido
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ubuntu or CentOS for my first lab. Please recommend. Thanks

2016-05-05 Thread Jan Schermer
This a always a topic that starts a flamewar
my POV:

Ubuntu  + generally newer versions of software, packages are closer to vanilla 
versions
+ more community packages
+ several versions of HWE (kernels) to choose from over 
lifetime of the distro
- not much support from vendors (for e.g. firmware upgrades, 
BIOS, binary packages)

CentOS  + more "stable" versions
+ more enterprisey (unchanging) landscape, with better 
compatibility
+ generally compatible with RHEL, means that binaries and 
support are usually provider by vendors
-  frankenpackages of ancient versions patched ad nauseum with 
backported features
- documentation lacking on "specialities" that are not present 
in vanilla versions (kernel is the worst offender) 

My experience is that Ubuntu is much faster overall, can be better "googled" or 
subverted to your needs, LTS versions seldom break during upgrades but I've 
seen it.
CentOS is more suitable for running software like SAP or application servers 
like JBoss if you need support. I've never seen breakage during upgrades, but 
those upgrades mostly aren't even worth it :)

Usually, this choice is up to organisational preference, CentOS will be much 
easier to use in environment heavy with vendors and certifications...

Jan


> On 05 May 2016, at 14:09, Michael Ferguson  wrote:
> 
>  
>  
> Michael E. Ferguson, 
> “First, your place, and then, the world’s”
> “Good work ain’t cheap, and cheap work ain’t good”
> PHONE: 305-333-2185 | FAX: 305-533-1582 | fergu...@eastsidemiami.com 
> 
>  
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] On-going Bluestore Performance Testing Results

2016-04-22 Thread Jan Schermer
Having correlated graphs of CPU and block device usage would be helpful.

To my cynical eye this looks like a clear regression in CPU usage, which was 
always bottlenecking pure-SSD OSDs, and now got worse.
The gains are from doing less IO on IO-saturated HDDs.

Regression of 70% in 16-32K random writes is the most troubling, that's 
coincidentaly the average IO size for a DB2, and the biggest bottleneck to its 
performance I've seen (other databases will be similiar).
It's great 

Btw readahead is not dependant on filesystem (it's a mechanism in the IO 
scheduler), so it should be present even on a block device, I think? 

Jan
 
 
> On 22 Apr 2016, at 17:35, Mark Nelson  wrote:
> 
> Hi Guys,
> 
> Now that folks are starting to dig into bluestore with the Jewel release, I 
> wanted to share some of our on-going performance test data. These are from 
> 10.1.0, so almost, but not quite, Jewel.  Generally bluestore is looking very 
> good on HDDs, but there are a couple of strange things to watch out for, 
> especially with NVMe devices.  Mainly:
> 
> 1) in HDD+NVMe configurations performance increases dramatically when 
> replacing the stock CentOS7 kernel with Kernel 4.5.1.
> 
> 2) In NVMe only configurations performance is often lower at middle-sized 
> IOs.  Kernel 4.5.1 doesn't really help here.  In fact it seems to amplify 
> both the cases where bluestore is faster and where it is slower.
> 
> 3) Medium sized sequential reads are where bluestore consistently tends to be 
> slower than filestore.  It's not clear yet if this is simply due to Bluestore 
> not doing read ahead at the OSD (ie being entirely dependent on client read 
> ahead) or something else as well.
> 
> I wanted to post this so other folks have some ideas of what to look for as 
> they do their own bluestore testing.  This data is shown as percentage 
> differences vs filestore, but I can also release the raw throughput values if 
> people are interested in those as well.
> 
> https://drive.google.com/file/d/0B2gTBZrkrnpZOTVQNkV0M2tIWkk/view?usp=sharing
> 
> Thanks!
> Mark
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deprecating ext4 support

2016-04-12 Thread Jan Schermer
I apologise, I probably should have dialed down a bit.
I'd like to personally apologise to Sage, for being so patient with my ranting.

To be clear: We are so lucky to have Ceph. It was something we sorely needed 
and for the right price (free).
It's was a dream come true to cloud providers - and it still is.

However, working with it in production, spending much time getting to know how 
ceph works, what it does, and also seeing how and where it fails prompted my 
interest in where it's going, because big public clouds are one thing, 
traditional SMB/Small enterprise needs are another and that's where I feel it 
fails hard. So I tried prodding here on ML, watched performance talks (which, 
frankly, reinforced my confirmation bias) and hoped to see some hint of it 
getting bette. That for me equals simpler, faster, not reinventing the wheel. I 
truly don't see that and it makes me sad.

You are talking about the big picture - Ceph for storing anything, new 
architecture - and it sounds cool. Given enough money and time it can 
materialise, I won't elaborate on that. I just hope you don't forget about the 
measly RBD users like me (I'd guesstimate a silent 90%+ majority, but no idea, 
hopefully the product manager has a better one) who are frustrated from the 
current design. I'd like to think I represent those users who used to solve HA 
with DRBD 10 years ago, who had to battle NFS shares with rsync and inotify 
scripts, who were the only people on-call every morning at 3AM when logrotate 
killed their IO, all while having to work with rotting hardware and no budget. 
We are still out there and there's nothing for us - RBD is not as fast, simple 
or reliable as DRBD, filesystem is not as simple nor as fast as rsync, 
scrubbing still wakes us at 3AM...

I'd very much like Ceph to be my storage system of choice in the future again, 
which is why I am so vocal with my opinions, and maybe truly selfish with my 
needs. I have not yet been convinced of the bright future, and -  being the 
sceptical^Wcynical monster I turned into - I expect everything which makes my 
spidey sense tingle to fail, as it usually does. But that's called confirmation 
bias, which can make my whole point moot I guess :)

Jan 




> On 12 Apr 2016, at 23:08, Nick Fisk <n...@fisk.me.uk> wrote:
> 
> Jan,
> 
> I would like to echo Sage's response here. It seems you only want a subset
> of what Ceph offers, whereas RADOS is designed to offer a whole lot more,
> which requires a lot more intelligence at the lower levels.
> 
> I must say I have found your attitude to both Sage and the Ceph project as a
> whole over the last few emails quite disrespectful. I spend a lot of my time
> trying to sell the benefits of open source, which centre on the openness of
> the idea/code and not around the fact that you can get it for free. One of
> the things that I like about open source is the constructive, albeit
> sometimes abrupt, constructive criticism that results in a better product.
> Simply shouting Ceph is slow and it's because dev's don't understand
> filesystems is not constructive.
> 
> I've just come back from an expo at ExCel London where many providers are
> passionately talking about Ceph. There seems to be a lot of big money
> sloshing about for something that is inherently "wrong"
> 
> Sage and the core Ceph team seem like  very clever people to me and I trust
> that over the years of development, that if they have decided that standard
> FS's are not the ideal backing store for Ceph, that this is probably correct
> decision. However I am also aware that the human condition "Can't see the
> wood for the trees" is everywhere and I'm sure if you have any clever
> insights into filesystem behaviour, the Ceph Dev team would be more than
> open to suggestions.
> 
> Personally I wish I could contribute more to the project as I feel that I
> (any my company) get more from Ceph than we put in, but it strikes a nerve
> when there is such negative criticism for what effectively is a free
> product.
> 
> Yes, I also suffer from the problem of slow sync writes, but the benefit of
> being able to shift 1U servers around a Rack/DC compared to a SAS tethered
> 4U jbod somewhat outweighs that as well as several other advanatages. A new
> cluster that we are deploying has several hardware choices which go a long
> way to improve this performance as well. Coupled with the coming Bluestore,
> the future looks bright.
> 
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Sage Weil
>> Sent: 12 April 2016 21:48
>> To: Jan Schermer <j...@schermer.cz>
>> Cc: ceph-devel <ceph-de...@vger.kernel.org>; ceph-users > us...@ceph.com>; ceph-maintain...@ceph.com
>> Subject: Re: [ceph-users] Deprecat

Re: [ceph-users] Deprecating ext4 support

2016-04-12 Thread Jan Schermer
Still the answer to most of your points from me is "but who needs that?"
Who needs to have exactly the same data in two separate objects (replicas)? 
Ceph needs it because "consistency"?, but the app (VM filesystem) is fine with 
whatever version because the flush didn't happen (if it did the contents would 
be the same).

You say "Ceph needs", but I say "the guest VM needs" - there's the problem.

> On 12 Apr 2016, at 21:58, Sage Weil <s...@newdream.net> wrote:
> 
> Okay, I'll bite.
> 
> On Tue, 12 Apr 2016, Jan Schermer wrote:
>>> Local kernel file systems maintain their own internal consistency, but 
>>> they only provide what consistency promises the POSIX interface 
>>> does--which is almost nothing.
>> 
>> ... which is exactly what everyone expects
>> ... which is everything any app needs
>> 
>>> That's why every complicated data 
>>> structure (e.g., database) stored on a file system ever includes it's own 
>>> journal.
>> ... see?
> 
> They do this because POSIX doesn't give them what they want.  They 
> implement a *second* journal on top.  The result is that you get the 
> overhead from both--the fs journal keeping its data structures consistent, 
> the database keeping its consistent.  If you're not careful, that means 
> the db has to do something like file write, fsync, db journal append, 
> fsync.
It's more like
transaction log write, flush
data write
That's simply because most filesystems don't journal data, but some do.


> And both fsyncs turn into a *fs* journal io and flush.  (Smart 
> databases often avoid most of the fs overhead by putting everything in a 
> single large file, but at that point the file system isn't actually doing 
> anything except passing IO to the block layer).
> 
> There is nothing wrong with POSIX file systems.  They have the unenviable 
> task of catering to a huge variety of workloads and applications, but are 
> truly optimal for very few.  And that's fine.  If you want a local file 
> system, you should use ext4 or XFS, not Ceph.
> 
> But it turns ceph-osd isn't a generic application--it has a pretty 
> specific workload pattern, and POSIX doesn't give us the interfaces we 
> want (mainly, atomic transactions or ordered object/file enumeration).

The workload (with RBD) is inevitably expecting POSIX. Who needs more than 
that? To me that indicates unnecessary guarantees.

> 
>>> We coudl "wing it" and hope for 
>>> the best, then do an expensive crawl and rsync of data on recovery, but we 
>>> chose very early on not to do that.  If you want a system that "just" 
>>> layers over an existing filesystem, try you can try Gluster (although note 
>>> that they have a different sort of pain with the ordering of xattr 
>>> updates, and are moving toward a model that looks more like Ceph's backend 
>>> in their next version).
>> 
>> True, which is why we dismissed it.
> 
> ...and yet it does exactly what you asked for:

I was implying it suffers the same flaws. In any case it wasn't really fast and 
it seemed overly complex.
To be fair it was some while ago when I tried it.
Can't talk about consistency - I don't think I ever used it in production as 
more than a PoC.

> 
>>>> IMO, If Ceph was moving in the right direction [...] Ceph would 
>>>> simply distribute our IO around with CRUSH.
> 
> You want ceph to "just use a file system."  That's what gluster does--it 
> just layers the distributed namespace right on top of a local namespace.  
> If you didn't care about correctness or data safety, it would be 
> beautiful, and just as fast as the local file system (modulo network).  
> But if you want your data safe, you immediatley realize that local POSIX 
> file systems don't get you want you need: the atomic update of two files 
> on different servers so that you can keep your replicas in sync.  Gluster 
> originally took the minimal path to accomplish this: a "simple" 
> prepare/write/commit, using xattrs as transaction markers.  We took a 
> heavyweight approach to support arbitrary transactions.  And both of us 
> have independently concluded that the local fs is the wrong tool for the 
> job.
> 
>>> Offloading stuff to the file system doesn't save you CPU--it just makes 
>>> someone else responsible.  What does save you CPU is avoiding the 
>>> complexity you don't need (i.e., half of what the kernel file system is 
>>> doing, and everything we have to do to work around an ill-suited 
>>> interface) and instead implement exactly the set of features that we need 
>>> to get the job done.
>> 
>> In theory

Re: [ceph-users] Deprecating ext4 support

2016-04-12 Thread Jan Schermer

> On 12 Apr 2016, at 20:00, Sage Weil <s...@newdream.net> wrote:
> 
> On Tue, 12 Apr 2016, Jan Schermer wrote:
>> I'd like to raise these points, then
>> 
>> 1) some people (like me) will never ever use XFS if they have a choice
>> given no choice, we will not use something that depends on XFS
>> 
>> 2) choice is always good
> 
> Okay!
> 
>> 3) doesn't majority of Ceph users only care about RBD?
> 
> Probably that's true now.  We shouldn't recommend something that prevents 
> them from adding RGW to an existing cluster in the future, though.
> 
>> (Angry rant coming)
>> Even our last performance testing of Ceph (Infernalis) showed abysmal 
>> performance. The most damning sign is the consumption of CPU time at 
>> unprecedented rate. Was it faster than Dumpling? Slightly, but it ate 
>> more CPU also, so in effect it was not really "faster".
>> 
>> It would make *some* sense to only support ZFS or BTRFS because you can 
>> offload things like clones/snapshots and consistency to the filesystem - 
>> which would make the architecture much simpler and everything much 
>> faster. Instead you insist on XFS and reimplement everything in 
>> software. I always dismissed this because CPU time was ususally cheap, 
>> but in practice it simply doesn't work. You duplicate things that 
>> filesystems had solved for years now (namely crash consistency - though 
>> we have seen that fail as well), instead of letting them do their work 
>> and stripping the IO path to the bare necessity and letting someone 
>> smarter and faster handle that.
>> 
>> IMO, If Ceph was moving in the right direction there would be no 
>> "supported filesystem" debate, instead we'd be free to choose whatever 
>> is there that provides the guarantees we need from filesystem (which is 
>> usually every filesystem in the kernel) and Ceph would simply distribute 
>> our IO around with CRUSH.
>> 
>> Right now CRUSH (and in effect what it allows us to do with data) is 
>> _the_ reason people use Ceph, as there simply wasn't much else to use 
>> for distributed storage. This isn't true anymore and the alternatives 
>> are orders of magnitude faster and smaller.
> 
> This touched on pretty much every reason why we are ditching file 
> systems entirely and moving toward BlueStore.

Nooo!

> 
> Local kernel file systems maintain their own internal consistency, but 
> they only provide what consistency promises the POSIX interface 
> does--which is almost nothing.

... which is exactly what everyone expects
... which is everything any app needs

>  That's why every complicated data 
> structure (e.g., database) stored on a file system ever includes it's own 
> journal.
... see?


>  In our case, what POSIX provides isn't enough.  We can't even 
> update a file and it's xattr atomically, let alone the much more 
> complicated transitions we need to do.
... have you thought that maybe xattrs weren't meant to be abused this way? 
Filesystems usually aren't designed to be a performant key=value stores.
btw at least i_version should be atomic?

And I still feel (ironically) that you don't understand what journals and 
commits/flushes are for if you make this argument...

Btw I think at least i_version xattr could be atomic.


>  We coudl "wing it" and hope for 
> the best, then do an expensive crawl and rsync of data on recovery, but we 
> chose very early on not to do that.  If you want a system that "just" 
> layers over an existing filesystem, try you can try Gluster (although note 
> that they have a different sort of pain with the ordering of xattr 
> updates, and are moving toward a model that looks more like Ceph's backend 
> in their next version).

True, which is why we dismissed it.

> 
> Offloading stuff to the file system doesn't save you CPU--it just makes 
> someone else responsible.  What does save you CPU is avoiding the 
> complexity you don't need (i.e., half of what the kernel file system is 
> doing, and everything we have to do to work around an ill-suited 
> interface) and instead implement exactly the set of features that we need 
> to get the job done.

In theory you are right.
In practice in-kernel filesystems are fast, and fuse filesystems are slow.
Ceph is like that - slow. And you want to be fast by writing more code :)

> 
> FileStore is slow, mostly because of the above, but also because it is an 
> old and not-very-enlightened design.  BlueStore is roughly 2x faster in 
> early testing.
... which is still literally orders of magnitude slower than a filesystem.
I dug into bluestore and how you want to implement it, and from what I 
un

Re: [ceph-users] Deprecating ext4 support

2016-04-12 Thread Jan Schermer
I'd like to raise these points, then

1) some people (like me) will never ever use XFS if they have a choice
given no choice, we will not use something that depends on XFS

2) choice is always good

3) doesn't majority of Ceph users only care about RBD?

(Angry rant coming)
Even our last performance testing of Ceph (Infernalis) showed abysmal 
performance. The most damning sign is the consumption of CPU time at 
unprecedented rate. Was it faster than Dumpling? Slightly, but it ate more CPU 
also, so in effect it was not really "faster".

It would make *some* sense to only support ZFS or BTRFS because you can offload 
things like clones/snapshots and consistency to the filesystem - which would 
make the architecture much simpler and everything much faster.
Instead you insist on XFS and reimplement everything in software. I always 
dismissed this because CPU time was ususally cheap, but in practice it simply 
doesn't work.
You duplicate things that filesystems had solved for years now (namely crash 
consistency - though we have seen that fail as well), instead of letting them 
do their work and stripping the IO path to the bare necessity and letting 
someone smarter and faster handle that.

IMO, If Ceph was moving in the right direction there would be no "supported 
filesystem" debate, instead we'd be free to choose whatever is there that 
provides the guarantees we need from filesystem (which is usually every 
filesystem in the kernel) and Ceph would simply distribute our IO around with 
CRUSH.

Right now CRUSH (and in effect what it allows us to do with data) is _the_ 
reason people use Ceph, as there simply wasn't much else to use for distributed 
storage. This isn't true anymore and the alternatives are orders of magnitude 
faster and smaller.

Jan

P.S. If anybody needs a way out I think I found it, with no need to trust a 
higher power :P


> On 11 Apr 2016, at 23:44, Sage Weil  wrote:
> 
> On Mon, 11 Apr 2016, Sage Weil wrote:
>> Hi,
>> 
>> ext4 has never been recommended, but we did test it.  After Jewel is out, 
>> we would like explicitly recommend *against* ext4 and stop testing it.
> 
> I should clarify that this is a proposal and solicitation of feedback--we 
> haven't made any decisions yet.  Now is the time to weigh in.
> 
> sage
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deprecating ext4 support

2016-04-11 Thread Jan Schermer
RIP Ceph.


> On 11 Apr 2016, at 23:42, Allen Samuels  wrote:
> 
> RIP ext4.
> 
> 
> Allen Samuels
> Software Architect, Fellow, Systems and Software Solutions 
> 
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samu...@sandisk.com
> 
> 
>> -Original Message-
>> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
>> ow...@vger.kernel.org] On Behalf Of Sage Weil
>> Sent: Monday, April 11, 2016 2:40 PM
>> To: ceph-de...@vger.kernel.org; ceph-us...@ceph.com; ceph-
>> maintain...@ceph.com; ceph-annou...@ceph.com
>> Subject: Deprecating ext4 support
>> 
>> Hi,
>> 
>> ext4 has never been recommended, but we did test it.  After Jewel is out,
>> we would like explicitly recommend *against* ext4 and stop testing it.
>> 
>> Why:
>> 
>> Recently we discovered an issue with the long object name handling that is
>> not fixable without rewriting a significant chunk of FileStores filename
>> handling.  (There is a limit in the amount of xattr data ext4 can store in 
>> the
>> inode, which causes problems in LFNIndex.)
>> 
>> We *could* invest a ton of time rewriting this to fix, but it only affects 
>> ext4,
>> which we never recommended, and we plan to deprecate FileStore once
>> BlueStore is stable anyway, so it seems like a waste of time that would be
>> better spent elsewhere.
>> 
>> Also, by dropping ext4 test coverage in ceph-qa-suite, we can significantly
>> improve time/coverage for FileStore on XFS and on BlueStore.
>> 
>> The long file name handling is problematic anytime someone is storing rados
>> objects with long names.  The primary user that does this is RGW, which
>> means any RGW cluster using ext4 should recreate their OSDs to use XFS.
>> Other librados users could be affected too, though, like users with very long
>> rbd image names (e.g., > 100 characters), or custom librados users.
>> 
>> How:
>> 
>> To make this change as visible as possible, the plan is to make ceph-osd
>> refuse to start if the backend is unable to support the configured max
>> object name (osd_max_object_name_len).  The OSD will complain that ext4
>> cannot store such an object and refuse to start.  A user who is only using
>> RBD might decide they don't need long file names to work and can adjust
>> the osd_max_object_name_len setting to something small (say, 64) and run
>> successfully.  They would be taking a risk, though, because we would like
>> to stop testing on ext4.
>> 
>> Is this reasonable?  If there significant ext4 users that are unwilling to
>> recreate their OSDs, now would be the time to speak up.
>> 
>> Thanks!
>> sage
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-fuse huge performance gap between different block sizes

2016-03-25 Thread Jan Schermer
FYI when I performed testing on our cluster I saw the same thing.
fio randwrite 4k test over a large volume was a lot faster with larger RBD 
object size (8mb was marginally better than the default 4mb). It makes no sense 
to me unless there is a huge overhead with increasing number of objects. Or 
maybe there is some sort of alignment problem that causes small objects overlap 
with the actual workload. (In my cluster some objects are mysteriously sized as 
4MiB-4KiB).

Jan

> On 25. 3. 2016, at 10:17, Zhang Qiang  wrote:
> 
> Hi Christian, Thanks for your reply, here're the test specs:
> >>>
> [global]
> ioengine=libaio
> runtime=90
> direct=1
> group_reporting
> iodepth=16
> ramp_time=5
> size=1G
> 
> [seq_w_4k_20]
> bs=4k
> filename=seq_w_4k_20
> rw=write
> numjobs=20
> 
> [seq_w_1m_20]
> bs=1m
> filename=seq_w_1m_20
> rw=write
> numjobs=20
> 
> 
> Test results: 4k -  aggrb=13245KB/s, 1m - aggrb=1102.6MB/s
> 
> Mount options:  ceph-fuse /ceph -m 10.3.138.36:6789
> 
> Ceph configurations:
> 
> filestore_xattr_use_omap = true
> auth cluster required = cephx
> auth service required = cephx
> auth client required = cephx
> osd journal size = 128
> osd pool default size = 2
> osd pool default min size = 1
> osd pool default pg num = 512
> osd pool default pgp num = 512
> osd crush chooseleaf type = 1
> 
> 
> Other configurations are all default.
> 
> Status:
>  health HEALTH_OK
>  monmap e5: 5 mons at 
> {1=10.3.138.37:6789/0,2=10.3.138.39:6789/0,3=10.3.138.40:6789/0,4=10.3.138.59:6789/0,GGZ-YG-S0311-PLATFORM-138=10.3.138.36:6789/0}
> election epoch 28, quorum 0,1,2,3,4 
> GGZ-YG-S0311-PLATFORM-138,1,2,3,4
>  mdsmap e55: 1/1/1 up {0=1=up:active}
>  osdmap e1290: 20 osds: 20 up, 20 in
>   pgmap v7180: 1000 pgs, 2 pools, 14925 MB data, 3851 objects
> 37827 MB used, 20837 GB / 21991 GB avail
> 1000 active+clean
> 
>> On Fri, 25 Mar 2016 at 16:44 Christian Balzer  wrote:
>> 
>> Hello,
>> 
>> On Fri, 25 Mar 2016 08:11:27 + Zhang Qiang wrote:
>> 
>> > Hi all,
>> >
>> > According to fio,
>> Exact fio command please.
>> 
>> >with 4k block size, the sequence write performance of
>> > my ceph-fuse mount
>> 
>> Exact mount options, ceph config (RBD cache) please.
>> 
>> >is just about 20+ M/s, only 200 Mb of 1 Gb full
>> > duplex NIC outgoing bandwidth was used for maximum. But for 1M block
>> > size the performance could achieve as high as 1000 M/s, approaching the
>> > limit of the NIC bandwidth. Why the performance stats differs so mush
>> > for different block sizes?
>> That's exactly why.
>> You can see that with local attached storage as well, many small requests
>> are slower than large (essential sequential) writes.
>> Network attached storage in general (latency) and thus Ceph as well (plus
>> code overhead) amplify that.
>> 
>> >Can I configure ceph-fuse mount's block size
>> > for maximum performance?
>> >
>> Very little to do with that if you're using sync writes (thus the fio
>> command line pleasE), if not RBD cache could/should help.
>> 
>> Christian
>> 
>> > Basic information about the cluster: 20 OSDs on separate PCIe hard disks
>> > distributed across 2 servers, each with write performance about 300 M/s;
>> > 5 MONs; 1 MDS. Ceph version 0.94.6
>> > (e832001feaf8c176593e0325c8298e3f16dfb403).
>> >
>> > Thanks :)
>> 
>> 
>> --
>> Christian BalzerNetwork/Systems Engineer
>> ch...@gol.com   Global OnLine Japan/Rakuten Communications
>> http://www.gol.com/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] xfs: v4 or v5?

2016-03-25 Thread Jan Schermer
V5 is supposedly stable, but that only means it will be just as bad as any 
other XFS.

I recommend avoiding XFS whenever possible. Ext4 works perfectly and I never 
lost any data with it, even when it got corrupted, while XFS still likes to eat 
the data when something goes wrong (and it will, like when you hit bit rot or a 
data cable fails).

Jan

> On 25. 3. 2016, at 11:44, Dzianis Kahanovich  wrote:
> 
> Before adding/replacing new OSDs:
> 
> What version of xfs is preferred by ceph developers/testers now?
> Time ago I move all to v5 (crc=1,finobt=1), it works, exclude
> "logbsize=256k,logbufs=8" in 4.4. Now I see, v5 is default mode (xfsprogs &
> kernel 4.5 at least).
> 
> I in doubts: make new OSDs old-style v4 + logbsize=256k,logbufs=8 (and remove 
> v5
> crc worloads) - increase linear performance (exclude rm/ls operations), or 
> make
> current default v5 to other benefits.
> 
> Are xfs v4 still mainstream for ceph?
> 
> PS I use too fresh "unstable" Gentoo ~amd64, so don't know some normal distros
> reality...
> 
> -- 
> WBR, Dzianis Kahanovich AKA Denis Kaganovich, http://mahatma.bspu.unibel.by/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] DONTNEED fadvise flag

2016-03-23 Thread Jan Schermer
So the OSDs pass this through to the filestore so it doesn't pollute the cache?
That would be... surprising.

Jan


> On 23 Mar 2016, at 18:28, Gregory Farnum  wrote:
> 
> On Mon, Mar 21, 2016 at 6:02 AM, Yan, Zheng  wrote:
>> 
>>> On Mar 21, 2016, at 18:17, Kenneth Waegeman  
>>> wrote:
>>> 
>>> Thanks! As we are using the kernel client of EL7, does someone knows if 
>>> that client supports it?
>>> 
>> 
>> fadvise DONTNEED is supported by kernel memory management subsystem. Fadvise 
>> DONTNEED works for all filesystems (including cephfs kernel client) that use 
>> page cache.
> 
> Does the kernel pass it through to the OSDs? They take advisory flags
> for this kind of thing now as well, which RBD uses for exactly this
> (in block form, of course).
> 
> In other news, I created a tracker ticket for ceph-fuse to support it.
> http://tracker.ceph.com/issues/15252
> -Greg
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance with encrypted OSDs

2016-03-20 Thread Jan Schermer
Compared to ceph-osd overhead, the dm-crypt overhead will be completely 
negligible for most scenarios.
One exception could be sequential reads with slow CPU (not supporting AES), but 
I don't expect more than a few percent difference even then.

Btw nicer solution is to use SED drives. Spindles are no problem (Hitachi makes 
them for example), SSDs are trickier - DC-class Intels don't support it for 
example.

Jan


> On 20 Mar 2016, at 17:53, Daniel Delin  wrote:
> 
> Hi,
> 
> I´m looking into running a Ceph cluster with the OSDs encrypted with 
> dm-crypt, both
> spinning disks and cache-tier SSDs and I wonder if there are any solid data 
> on the 
> possible performance penalty this will incur, both bandwidth and latency. 
> Done some googling, but can´t find that much.
> 
> The CPUs involved will have hardware AES-NI support, 4 spinning disks and 2 
> cache tier
> SSDs / OSD node.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ssd only storage and ceph

2016-03-19 Thread Jan Schermer



> On 17 Mar 2016, at 17:28, Erik Schwalbe  wrote:
> 
> Hi,
> 
> at the moment I do some tests with SSD's and ceph.
> My Question is, how to mount an SSD OSD? With or without discard option?

I recommend running without discard but running "fstrim" command every now and 
then (depends on how fast your SSD is - some SSDs hang for quite a while when 
fstrim is run on them, test it)

> Where should I do the fstrim, when I mount the OSD without discard? On the 
> ceph storage node? Inside the vm, running on rbd?
> 

discard on the SSD itself makes garbage collection easier - that might make the 
SSD faster and it can last longer (how faster and how longer depends on the 
SSD, generally if you use DC-class SSDs you won't notice anything)
discard in the VM (assuming everything supports it) makes thin-provisioning 
more effective, but you (IMO) need virtio-scsi for that. I have no real-life 
experience whether Ceph actually frees the unneeded space even if you make it 
work...



> What is the best practice there.
> 
> Thanks for your answers.
> 
> Regards,
> Erik
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] xfs corruption

2016-03-07 Thread Jan Schermer
This functionality is common on RAID controllers in combination with 
HCL-certified drives.

This usually means that you can't rely on it working unless you stick to the 
exact combination that's certified, which is impossible in practice.
For example LSI controllers do this if you get the right SSDs, but the right 
SSD also needs to have the "right" firmware, which is usually very old and you 
won't get that version anywhere... or maybe only if you shell out 100% premium 
on the price by buying it directly with server hardware, and even then the 
replacement drives you'll get might be a different revision after some time and 
you'll need to explain to the vendor that you very specifically bought that 
combination because of HCL.
Good luck getting support from a server vendor to help you when you keep old 
buggy firmware on your drives in order for the HBA to work correctly :-) 

The good news is that modern drives don't really need TRIM. It's better to 
concentrate on the higher layers where it's much more useful for thin 
provisioning and oversuscribing of the disk space, but the drives themselves 
don't gain much.

There's one scenario when it is useful and that's when you deliberately get a 
lower-grade drive (TBW/DWPD) and need it to survive for longer than the rated 
amount of data. Underprivisioning is quite useful then and you need to either 
TRIM or secure-erase the drive when you prepare it. But unless you're on a 
tight budget I'd say it's not worth it and you should just get the proper 
drive...

Jan



> On 07 Mar 2016, at 09:21, Ric Wheeler  wrote:
> 
> 
> Unfortunately, you will have to follow up with the hardware RAID card vendors 
> to see what commands their firmware handles.
> 
> Good luck!
> 
> Ric
> 
> 
> On 03/07/2016 01:37 PM, Ferhat Ozkasgarli wrote:
>> I am always forgetting this reply all things.
>> /
>> /
>> /RAID5 and RAID10 (or other raid levels) are a property of the block 
>> devices. XFS, ext4, etc can pass down those commands to the firmware on the 
>> card and it is up to the firmware to propagate the command on to the backend 
>> drives./
>> 
>> You mean I can get a hardware raid card that can pass discard and trim 
>> commend to disks with raid 10 array?
>> 
>> Can you please suggest me such a raid card?
>> 
>> Because we are in a verge of deciding on hardware raid or software raid to 
>> use. Because our OpenStack cluster uses full SSD storage (local raid 10) and 
>> my manager want to utilize hardware raid with SSD disks.
>> 
>> 
>> 
>> On Mon, Mar 7, 2016 at 10:04 AM, Ric Wheeler > > wrote:
>> 
>>You are right that some cards might not send those commands on to the
>>backend storage, but spinning disks don't usually implement either trim or
>>discard (SSD's do though).
>> 
>>XFS, ext4, etc can pass down those commands to the firmware on the card
>>and it is up to the firmware to propagate the command on to the backend
>>drives.
>> 
>>The file system layer itself does track allocation internally in its
>>layer, so you will benefit from being able to reuse those blocks after a
>>trim command (even without a raid card of any kind).
>> 
>>Regards,
>> 
>>Ric
>> 
>> 
>>On 03/07/2016 12:58 PM, Ferhat Ozkasgarli wrote:
>> 
>>Rick; you mean Raid 0 environment right?
>> 
>>If you use raid 5 or raid 10 or some other more complex raid
>>configuration most of the physical disks' abilities vanishes. (trim,
>>discard etc..)
>> 
>>Only handful of hardware raid cards able to pass trim and discard
>>commands to physical disks if the raid configuration is raid 0 or 
>> raid 1.
>> 
>>On Mon, Mar 7, 2016 at 9:21 AM, Ric Wheeler > >>> wrote:
>> 
>> 
>> 
>>It is perfectly reasonable and common to use hardware RAID cards 
>> in
>>writeback mode under XFS (and under Ceph) if you configure them
>>properly.
>> 
>>The key thing is that for writeback cache enabled, you need to
>>make sure
>>that the S-ATA drives' write cache itself is disabled. Also make
>>sure that
>>your file system is mounted with "barrier" enabled.
>> 
>>To check the backend write cache state on drives, you often need
>>to use
>>RAID card specific tools to query and set them.
>> 
>>Regards,
>> 
>>Ric
>> 
>> 
>> 
>> 
>>On 02/27/2016 07:20 AM, fangchen sun wrote:
>> 
>> 
>>Thank you for your response!
>> 
>>All my hosts have raid cards. Some raid cards are in
>>pass-throughput
>>mode, and the others are in write-back mode. I will set all
>>raid cards
>>pass-throughput mode and observe for a period of time.

Re: [ceph-users] Ceph RBD latencies

2016-03-03 Thread Jan Schermer
I think the latency comes from journal flushing

Try tuning

filestore min sync interval = .1
filestore max sync interval = 5

and also
/proc/sys/vm/dirty_bytes (I suggest 512MB)
/proc/sys/vm/dirty_background_bytes (I suggest 256MB)

See if that helps

It would be useful to see the job you are running to know what exactly it does, 
I'm afraid your latency is not really that bad, it will scale horizontally 
(with number of clients) rather than vertically (higher IOPS for single 
blocking writes) and there's not much that can be done about that.


> On 03 Mar 2016, at 14:33, RDS  wrote:
> 
> A couple of suggestions:
> 1)   # of pgs per OSD should be 100-200
> 2)  When dealing with SSD or Flash, performance of these devices hinge on how 
> you partition them and how you tune linux:
>   a)   if using partitions, did you align the partitions on a 4k 
> boundary? I start at sector 2048 using either fdisk or sfdisk

On SSD you should align at 8MB boundary (usually the erase block is quite 
large, though it doesn't matter that much), and the write block size is 
actually something like 128k
2048 aligns at 1MB which is completely fine

>   b)   There are quite a few Linux settings that benefit SSD/Flash and 
> they are: Deadline io scheduler only when using the deadline associated 
> settings, up  QDepth to 512 or 1024, set rq_affinity=2 if OS allows it, 
> setting read ahead if doing majority of reads, and other

those don't matter that much, higher queue depths mean larger throughput but at 
the expense of latency, the default are usually fine

> 3)   mount options:  noatime, delaylog,inode64,noquota, etc…

defaults work fine (noatime is a relic, relatime is what filesystems use by 
default nowadays)

> 
> I have written some papers/blogs on this subject if you are interested in 
> seeing them.
> Rick
>> On Mar 3, 2016, at 2:41 AM, Adrian Saul  
>> wrote:
>> 
>> Hi Ceph-users,
>> 
>> TL;DR - I can't seem to pin down why an unloaded system with flash based OSD 
>> journals has higher than desired write latencies for RBD devices.  Any ideas?
>> 
>> 
>> I am developing a storage system based on Ceph and an SCST+pacemaker 
>> cluster.   Our initial testing showed promising results even with mixed 
>> available hardware and we proceeded to order a more designed platform for 
>> developing into production.   The hardware is:
>> 
>> 2x 1RU servers as "frontends" (SCST+pacemaker - ceph mons and clients using 
>> RBD - they present iSCSI to other systems).
>> 3x 2RU OSD SSD servers (24 bay 2.5" SSD) - currently with 4 2TB Samsung Evo 
>> SSDs each
>> 3x 4RU OSD SATA servers (36 bay) - currently with 6 8TB Seagate each
>> 
>> As part of the research and planning we opted to put a pair of Intel 
>> PC3700DC 400G NVME cards in each OSD server.  These are configured mirrored 
>> and setup as the journals for the OSD disks, the aim being to improve write 
>> latencies.  All the machines have 128G RAM and dual E5-2630v3 CPUs, and use 
>> 4 aggregated 10G NICs back to a common pair of switches.   All machines are 
>> running Centos 7, with the frontends using the 4.4.1 elrepo-ml kernel to get 
>> a later RBD kernel module.
>> 
>> On the ceph side each disk in the OSD servers are setup as an individual 
>> OSD, with a 12G journal created on the flash mirror.   I setup the SSD 
>> servers into one root, and the SATA servers into another and created pools 
>> using hosts as fault boundaries, with the pools set for 2 copies.   I 
>> created the pools with the pg_num and pgp_num set to 32x the number of OSDs 
>> in the pool.   On the frontends we create RBD devices and present them as 
>> iSCSI LUNs using SCST to clients - in this test case a Solaris host.
>> 
>> The problem I have is that even with a lightly loaded system the service 
>> times for the LUNs for writes is just not getting down to where we want it, 
>> and they are not very stable - with 5 LUNs doing around 200 32K IOPS 
>> consistently the service times sit at around 3-4ms, but regularly (every 
>> 20-30 seconds) up to above 12-15ms which puts the average at 6ms over 5 
>> minutes.  I fully expected we would have some latencies due to the 
>> distributed and networked nature of Ceph, but in this instance I just cannot 
>> find where these latencies are coming from, especially with the SSD based 
>> pool and having flash based journaling.
>> 
>> - The RBD devices show relatively low service times, but high queue times.  
>> These are in line with what Solaris sees so I don't think SCST/iSCSI is 
>> adding much latency.
>> - The journals are reporting 0.02ms service times, and seem to cope fine 
>> with any bursts
>> - The SSDs do show similar latency variations with writes - bursting up to 
>> 12ms or more whenever there is high write workloads.
>> - I have tried applying what tuning I can to the SSD block devices (noop 
>> scheduler etc) - no difference
>> - I have removed any sort of smarts around IO grouping 

Re: [ceph-users] blocked i/o on rbd device

2016-03-02 Thread Jan Schermer
Are you exporting (or mounting) the NFS as async or sync?

How much memory does the server have?

Jan


> On 02 Mar 2016, at 12:54, Shinobu Kinjo  wrote:
> 
> Ilya,
> 
>> We've recently fixed two major long-standing bugs in this area.
> 
> If you could elaborate more, it would be reasonable for the community.
> Is there any pointer?
> 
> Cheers,
> Shinobu 
> 
> - Original Message -
> From: "Ilya Dryomov" 
> To: "Randy Orr" 
> Cc: "ceph-users" 
> Sent: Wednesday, March 2, 2016 8:40:42 PM
> Subject: Re: [ceph-users] blocked i/o on rbd device
> 
> On Tue, Mar 1, 2016 at 10:57 PM, Randy Orr  wrote:
>> Hello,
>> 
>> I am running the following:
>> 
>> ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
>> ubuntu 14.04 with kernel 3.19.0-49-generic #55~14.04.1-Ubuntu SMP
>> 
>> For this use case I am mapping and mounting an rbd using the kernel client
>> and exporting the ext4 filesystem via NFS to a number of clients.
>> 
>> Once or twice a week we've seen disk io "stuck" or "blocked" on the rbd
>> device. When this happens iostat shows avgqu-sz at a constant number with
>> utilization at 100%. All i/o operations via NFS blocks, though I am able to
>> traverse the filesystem locally on the nfs server and read/write data. If I
>> wait long enough the device will eventually recover and avgqu-sz goes to
>> zero.
>> 
>> The only issue I could find that was similar to this is:
>> http://tracker.ceph.com/issues/8818 - However, I am not seeing the error
>> messages described and I am running a more recent version of the kernel that
>> should contain the fix from that issue. So, I assume this is likely a
>> different problem.
>> 
>> The ceph cluster reports as healthy the entire time, all pgs up and in,
>> there was no scrubbing going on, no osd failures or anything like that.
>> 
>> I ran echo t > /proc/sysrq-trigger and the output is here:
>> https://gist.github.com/anonymous/89c305443080149e9f45
>> 
>> Any ideas on what could be going on here? Any additional information I can
>> provide?
> 
> Hi Randy,
> 
> We've recently fixed two major long-standing bugs in this area.
> Currently, the only kernel that has fixes for both is 4.5-rc6, but
> backports are on their way - both patches will be 4.4.4.  I'll make
> sure those patches are queued for the ubuntu 3.19 kernel as well, but
> it'll take some time for them to land.
> 
> Could you try either 4.5-rc6 or 4.4.4 after it comes out?  It's likely
> that your problem is fixed.
> 
> Thanks,
> 
>Ilya
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Old CEPH (0.87) cluster degradation - putting OSDs down one by one

2016-02-27 Thread Jan Schermer
Anythink in dmesg/kern.log at the time this happened?

 0> 2016-02-26 23:20:47.931236 7f9434e0f700 -1 *** Caught signal (Aborted) 
**

I think your filesystem was somehow corrupted.

An regarding this: 2. Physical HDD replaced  and NOT added to CEPH - here we 
had strange kernel crash just after HDD connected to the controller.
What are the drives connected to? We have had problems with Intel SATA/SAS 
driver. You can do a hotplug of a drive but if you remove one and put in 
another the kernel crashes (it only happens if some time passes between those 
two actions, makes it very nasty).

Jan



> On 27 Feb 2016, at 00:14, maxxik  wrote:
> 
> Hi Cephers
> 
> At the moment we are trying to recover our CEPH cluser (0.87) which is 
> behaving very odd.
> 
> What have been done :
> 
> 1. OSD drive failure happened - CEPH put OSD down and  out.
> 2. Physical HDD replaced  and NOT added to CEPH - here we had strange kernel 
> crash just after HDD connected to the controller.
> 3. Physical host rebooted.
> 4. CEPH started restoration and putting OSD's down one by one (actually I can 
> see osd process crush in logs).
> 
> ceph.conf is in attachment.
> 
> 
> OSD failure :
> 
> -4> 2016-02-26 23:20:47.906443 7f942b4b6700  5 -- op tracker -- seq: 
> 471061, time: 2016-02-26 23:20:47.906404, even
> t: header_read, op: pg_backfill(progress 13.77 e 183964/183964 lb 
> 45e69877/rb.0.25e43.6b8b4567.2c3b/head//13)
> -3> 2016-02-26 23:20:47.906451 7f942b4b6700  5 -- op tracker -- seq: 
> 471061, time: 2016-02-26 23:20:47.906406, even
> t: throttled, op: pg_backfill(progress 13.77 e 183964/183964 lb 
> 45e69877/rb.0.25e43.6b8b4567.2c3b/head//13)
> -2> 2016-02-26 23:20:47.906456 7f942b4b6700  5 -- op tracker -- seq: 
> 471061, time: 2016-02-26 23:20:47.906421, even
> t: all_read, op: pg_backfill(progress 13.77 e 183964/183964 lb 
> 45e69877/rb.0.25e43.6b8b4567.2c3b/head//13)
> -1> 2016-02-26 23:20:47.906462 7f942b4b6700  5 -- op tracker -- seq: 
> 471061, time: 0.00, event: dispatched, op:
>  pg_backfill(progress 13.77 e 183964/183964 lb 
> 45e69877/rb.0.25e43.6b8b4567.2c3b/head//13)
>  0> 2016-02-26 23:20:47.931236 7f9434e0f700 -1 *** Caught signal 
> (Aborted) **
>  in thread 7f9434e0f700
> 
>  ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)
>  1: /usr/bin/ceph-osd() [0x9e2015]
>  2: (()+0xfcb0) [0x7f945459fcb0]
>  3: (gsignal()+0x35) [0x7f94533d30d5]
>  4: (abort()+0x17b) [0x7f94533d683b]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f9453d2569d]
>  6: (()+0xb5846) [0x7f9453d23846]
>  7: (()+0xb5873) [0x7f9453d23873]
>  8: (()+0xb596e) [0x7f9453d2396e]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x259) [0xacb979]
>  10: (SnapSet::get_clone_bytes(snapid_t) const+0x15f) [0x732c0f]
>  11: (ReplicatedPG::_scrub(ScrubMap&)+0x10c4) [0x7f5e54]
>  12: (PG::scrub_compare_maps()+0xcb6) [0x7876e6]
>  13: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x1c3) [0x7880b3]
>  14: (PG::scrub(ThreadPool::TPHandle&)+0x33d) [0x789abd]
>  15: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x13) [0x67ccf3]
>  16: (ThreadPool::worker(ThreadPool::WorkThread*)+0x48e) [0xabb3ce]
>  17: (ThreadPool::WorkThread::entry()+0x10) [0xabe160]
>  18: (()+0x7e9a) [0x7f9454597e9a]
>  19: (clone()+0x6d) [0x7f94534912ed]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
> interpret this.
> 
> --- logging levels ---
>0/ 5 none
>0/ 1 lockdep
>0/ 1 context
>1/ 1 crush
>1/ 5 mds
>1/ 5 mds_balancer
>1/ 5 mds_locker
>1/ 5 mds_log
>1/ 5 mds_log_expire
>1/ 5 mds_migrator
>0/ 1 buffer
>0/ 1 timer
>0/ 1 filer
>0/ 1 striper
>0/ 1 objecter
>0/ 5 rados
>0/ 5 rbd
>0/ 5 rbd_replay
>0/ 5 journaler
>0/ 5 objectcacher
>0/ 5 client
>0/ 5 osd
>0/ 5 optracker
>0/ 5 objclass
>1/ 3 filestore
>1/ 3 keyvaluestore
>1/ 3 journal
>0/ 5 ms
>1/ 5 mon
>0/10 monc
>1/ 5 paxos
>0/ 5 tp
>1/ 5 auth
>1/ 5 crypto
>1/ 1 finisher
>1/ 5 heartbeatmap
>1/ 5 perfcounter
>1/ 5 rgw
>1/10 civetweb
>1/ 5 javaclient
>1/ 5 asok
>1/ 1 throttle
>0/ 0 refs
>   -1/-1 (syslog threshold)
>   -1/-1 (stderr threshold)
>   max_recent 1
>   max_new 1000
>   log_file /var/log/ceph/ceph-osd.27.log
> 
> 
> Current OSD tree:
> 
> 
> # idweight  type name   up/down reweight
> -10 2   root ssdtree
> -8  1   host ibstorage01-ssd1
> 9   1   osd.9   up  1
> -9  1   host ibstorage02-ssd1
> 10  1   osd.10  up  1
> -1  22.99   root default
> -7  22.99   room cdsqv1
> -3  22.99   rack gopc-rack01
> -2  8   host ibstorage01-sas1
> 0   1   osd.0   down0
> 1   1 

Re: [ceph-users] Observations with a SSD based pool under Hammer

2016-02-26 Thread Jan Schermer
RBD backend might be even worse, depending on how large dataset you try. One 
4KB block can end up creating a 4MB object, and depending on how well 
hole-punching and fallocate works on your system you could in theory end up 
with a >1000 amplification if you always hit a different 4MB chunk (but that's 
not realistic).
Is that right?

Jan

> On 26 Feb 2016, at 22:05, Josh Durgin  wrote:
> 
> On 02/24/2016 07:10 PM, Christian Balzer wrote:
>> 10 second rados bench with 4KB blocks, 219MB written in total.
>> nand-writes per SSD:41*32MB=1312MB.
>> 10496MB total written to all SSDs.
>> Amplification:48!!!
>> 
>> Le ouch.
>> In my use case with rbd cache on all VMs I expect writes to be rather
>> large for the most part and not like this extreme example.
>> But as I wrote the last time I did this kind of testing, this is an area
>> where caveat emptor most definitely applies when planning and buying SSDs.
>> And where the Ceph code could probably do with some attention.
> 
> In this case it's likely rados bench using tiny objects that's
> causing the massive overhead. rados bench is doing each write to a new
> object, which ends up in a new file beneath the osd, with its own
> xattrs too. For 4k writes, that's a ton of overhead.
> 
> fio with the rbd backend will give you a more realistic picture.
> In jewel there will be --max-objects and --object-size options for
> rados bench to get closer to an rbd-like workload as well.
> 
> Josh
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Guest sync write iops so poor.

2016-02-26 Thread Jan Schermer
O_DIRECT is _not_ a flag for synchronous blocking IO.
O_DIRECT only hints the kernel that it needs not cache/buffer the data.
The kernel is actually free to buffer and cache it and it does buffer it.
It also does _not_ flush O_DIRECT writes to disk but it makes best effort to 
send it to the drives ASAP (where it can sit in cache).
Finishing an O_DIRECT request doesn't guarantee it is on disk at all.

In effect, you can issue parallel O_DIRECT request and they will scale with 
queue depth, but the ordering is not guaranteed and neither is it crash safe.


btw "innodb_flush_log_at_trx_commit = 5" does not do what you think it does. 
It's only values are
0 - flush only periodically, not crash consistent (most data should be there 
somewhere but it does require a lengthy manual recovery)
1 - flush after every transaction (not every write as you illustrated), ACID 
compliant
2 - flush periodically, database *should* be crash consistent but you can lose 
some transactions

no other value does anything:

mysql> show global variables like "innodb_flush_log_at_trx_commit";
++---+
| Variable_name  | Value |
++---+
| innodb_flush_log_at_trx_commit | 2 |
++---+
1 row in set (0.00 sec)

mysql> set global innodb_flush_log_at_trx_commit = 1;
Query OK, 0 rows affected (0.00 sec)

mysql> show global variables like "innodb_flush_log_at_trx_commit";
++---+
| Variable_name  | Value |
++---+
| innodb_flush_log_at_trx_commit | 1 |
++---+
1 row in set (0.00 sec)

mysql> set global innodb_flush_log_at_trx_commit = 5;
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> show global variables like "innodb_flush_log_at_trx_commit";
++---+
| Variable_name  | Value |
++---+
| innodb_flush_log_at_trx_commit | 2 |
++---+
1 row in set (0.00 sec)

On Ceph, you either need to live with a max of ~ 200 (serializable) 
transactions/sec, settle for innodb_flush_log_at_trx_commit = 2 and lose the 
tail of transactions or you can put the innodb log files on a separate device 
(drbd accross several nodes, physical SSD...) which will survive a crash.


Jan

> On 26 Feb 2016, at 10:49, Huan Zhang  wrote:
> 
> fio /dev/rbd0 sync=1 has no problem.
> Doesn't find 'sync cache code' in linux rbd block driver and radosgw api. 
> Seems sync cache is just the concept of librbd (for rbd cache). 
> Just my concerns.
> 
> 2016-02-26 17:30 GMT+08:00 Huan Zhang  >:
> Hi Nick,
> DB's IO pattern depends on config, mysql for example.
> innodb_flush_log_at_trx_commit =1, mysql will sync after one transcation. 
> like:
> write
> sync
> wirte
> sync
> ...
> 
> innodb_flush_log_at_trx_commit = 5,
> write
> write
> write
> write
> write
> sync
> 
> innodb_flush_log_at_trx_commit = 0,
> write
> write
> ...
> one second later.
> sync.
> 
> 
> may not very accurate, but more or less.
> 
> We test mysql tps, with nnodb_flush_log_at_trx_commit =1, get very poor 
> performance even if we can reach very high O_DIRECT randwrite iops with fio.
> 
> 
> 
> 
> 2016-02-26 16:59 GMT+08:00 Nick Fisk  >:
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
> > ] On Behalf Of
> > Huan Zhang
> > Sent: 26 February 2016 06:50
> > To: Jason Dillaman >
> > Cc: josh durgin >; 
> > Nick Fisk >;
> > ceph-users >
> > Subject: Re: [ceph-users] Guest sync write iops so poor.
> >
> > rbd engine with fsync=1 seems stuck.
> > Jobs: 1 (f=1): [w(1)] [0.0% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
> > 1244d:10h:39m:18s]
> >
> > But fio using /dev/rbd0 sync=1 direct=1 ioengine=libaio iodepth=64, get very
> > high iops ~35K, similar to direct wirte.
> >
> > I'm confused with that result, IMHO, ceph could just ignore the sync cache
> > command since it always use sync write to journal, right?
> 
> Even if the data is not sync'd to the data storage part of the OSD, the data 
> still has to be written to the journal and this is where the performance 
> limit lies.
> 
> The very nature of SDS means that you are never going to achieve the same 
> latency as you do to a local disk as even if the software side introduced no 
> extra latency, just the network latency will severely limit your sync 
> performance.
> 
> Do you know the IO pattern the DB's generate? I know you can switch most DB's 
> to flush with O_DIRECT instead of 

Re: [ceph-users] List of SSDs

2016-02-25 Thread Jan Schermer

> On 25 Feb 2016, at 22:41, Shinobu Kinjo <ski...@redhat.com> wrote:
> 
>> Just beware of HBA compatibility, even in passthrough mode some crappy 
>> firmwares can try and be smart about what you can do (LSI-Avago, I'm looking 
>> your way for crippling TRIM, seriously WTH).
> 
> This is very good to know.
> Can anybody elaborate on this a bit more?
> 

To some degree, it's been a while since I investigated this.
For TRIM/discard to work, you need to have
1) working TRIM/discard command on the drive
2) the scsi/libata layer (?) somehow detect how many blocks can be discarded at 
once and what the block size is etc.
those properties are found in /sys/block/xxx/queue/discard_*

3) filesystem that supports discard (and it looks at those discard_* properties 
to determine when/what to discard).
4) there are also flags (hdparm -I shows them) what happens after trim - either 
the data is zeroed or random data is returned (it is possible to TRIM a sector 
and then read the original data - it doesn't actually need to erase anything, 
it simply marks that sector as unused in bitmap and GC does it's magic when it 
feels like it, if ever)

RAID controllers need to have some degree of control over this, because they 
need to be able to compare the drive contents when scrubbing (the same probably 
somehow applies to mdraid) either by maintaining some bitmap of used blocks or 
by trusting the drives to be deterministic. If you discard a sector on a HW 
RAID, both drives need to start returning the same data or scrubbing will fail. 
Some drives guarantee that and some don't.
You either have DRAT - Deterministic Read After Trim (but this only guarantees 
that data don't change, but they can be random)
or you have DZAT - Deterministic read Zero After Trim (subsequent reads only 
return NULLs)
or you can have none of the above (whcih is no big deal, except for RAID).

Even though I don't use LSI HBAs in IR (RAID) mode, the firmware doesn't like 
that my drives don't have DZAT/DRAT (or rather didn't, this doesn't apply to 
the Intels I have now) and crippled the discard_* parameters to try and 
disallow the use of TRIM. And it mostly works because the filesystem doesn't 
have the discard_* parameters it needs for discard to work...
... BUT it doesn't cripple the TRIM command itself so running hdparm 
--trim-sector-ranges still works (lol) and I suppose if those discard_* 
parameters were made read/write (actually I found a patch that does exactly 
that back then) then we could re-enable trim in spite of the firmware nonsense, 
but with modern SSDs it's mostly pointless anyway and LSI sucks, so who cares 
:-)

*
Sorry if I mixed some layers, maybe it's not filesystem that calls discard but 
another layer in kernel, also not sure how exactly discard_* values are 
detected and when etc., but in essence it works like that.

Jan



> Rgds,
> Shinobu
> 
> - Original Message -
> From: "Jan Schermer" <j...@schermer.cz>
> To: "Nick Fisk" <n...@fisk.me.uk>
> Cc: "Robert LeBlanc" <rob...@leblancnet.us>, "Shinobu Kinjo" 
> <ski...@redhat.com>, ceph-users@lists.ceph.com
> Sent: Thursday, February 25, 2016 11:10:41 PM
> Subject: Re: [ceph-users] List of SSDs
> 
> We are very happy with S3610s in our cluster.
> We had to flash a new firmware because of latency spikes (NCQ-related), but 
> had zero problems after that...
> Just beware of HBA compatibility, even in passthrough mode some crappy 
> firmwares can try and be smart about what you can do (LSI-Avago, I'm looking 
> your way for crippling TRIM, seriously WTH).
> 
> Jan
> 
> 
>> On 25 Feb 2016, at 14:48, Nick Fisk <n...@fisk.me.uk> wrote:
>> 
>> There’s two factors really
>> 
>> 1.   Suitability for use in ceph
>> 2.   Number of people using them
>> 
>> For #1, there are a number of people using various different drives, so lots 
>> of options. The blog articled linked is a good place to start.
>> 
>> For #2 and I think this is quite important. Lots of people use the S3xx’s 
>> intel drives. This means any problems you face will likely have a lot of 
>> input from other people. Also you are less likely to face surprises, as most 
>> usage cases have already been covered. 
>> 
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
>> <mailto:ceph-users-boun...@lists.ceph.com>] On Behalf Of Robert LeBlanc
>> Sent: 25 February 2016 05:56
>> To: Shinobu Kinjo <ski...@redhat.com <mailto:ski...@redhat.com>>
>> Cc: ceph-users <ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>>
>> Subject: Re: [ceph-users] List of SSDs
>> 
>> We are moving to the Intel S3610, from our testing it is a good balance 
>> between pri

Re: [ceph-users] List of SSDs

2016-02-25 Thread Jan Schermer
We are very happy with S3610s in our cluster.
We had to flash a new firmware because of latency spikes (NCQ-related), but had 
zero problems after that...
Just beware of HBA compatibility, even in passthrough mode some crappy 
firmwares can try and be smart about what you can do (LSI-Avago, I'm looking 
your way for crippling TRIM, seriously WTH).

Jan


> On 25 Feb 2016, at 14:48, Nick Fisk  wrote:
> 
> There’s two factors really
>  
> 1.   Suitability for use in ceph
> 2.   Number of people using them
>  
> For #1, there are a number of people using various different drives, so lots 
> of options. The blog articled linked is a good place to start.
>  
> For #2 and I think this is quite important. Lots of people use the S3xx’s 
> intel drives. This means any problems you face will likely have a lot of 
> input from other people. Also you are less likely to face surprises, as most 
> usage cases have already been covered. 
>  
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
> ] On Behalf Of Robert LeBlanc
> Sent: 25 February 2016 05:56
> To: Shinobu Kinjo >
> Cc: ceph-users >
> Subject: Re: [ceph-users] List of SSDs
>  
> We are moving to the Intel S3610, from our testing it is a good balance 
> between price, performance and longevity. But as with all things, do your 
> testing ahead of time. This will be our third model of SSDs for our cluster. 
> The S3500s didn't have enough life and performance tapers off add it gets 
> full. The Micron M600s looked good with the Sebastian journal tests, but once 
> in use for a while go downhill pretty bad. We also tested Micron M500dc 
> drives and they were on par with the S3610s and are more expensive and are 
> closer to EoL. The S3700s didn't have quite the same performance as the 
> S3610s, but they will last forever and are very stable in terms of 
> performance and have the best power loss protection. 
> 
> Short answer is test them for yourself to make sure they will work. You are 
> pretty safe with the Intel S3xxx drives. The Micron M500dc is also pretty 
> safe based on my experience. It had also been mentioned that someone has had 
> good experience with a Samsung DC Pro (has to have both DC and Pro in the 
> name), but we weren't able to get any quick enough to test so I can't vouch 
> for them. 
> 
> Sent from a mobile device, please excuse any typos.
> 
> On Feb 24, 2016 6:37 PM, "Shinobu Kinjo"  > wrote:
> Hello,
> 
> There has been a bunch of discussion about using SSD.
> Does anyone have any list of SSDs describing which SSD is highly recommended, 
> which SSD is not.
> 
> Rgds,
> Shinobu
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
>  ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Guest sync write iops so poor.

2016-02-25 Thread Jan Schermer

> On 25 Feb 2016, at 14:39, Nick Fisk  wrote:
> 
> 
> 
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Huan Zhang
>> Sent: 25 February 2016 11:11
>> To: josh.dur...@inktank.com
>> Cc: ceph-users 
>> Subject: [ceph-users] Guest sync write iops so poor.
>> 
>> Hi,
>>   We test sync iops with fio sync=1 for database workloads in VM,
>> the backend is librbd and ceph (all SSD setup).
>>   The result is sad to me. we only get ~400 IOPS sync randwrite with
>> iodepth=1
>> to iodepth=32.
>>   But test in physical machine with fio ioengine=rbd sync=1, we can reache
>> ~35K IOPS.
>> seems the qemu rbd is the bottleneck.
>>   qemu version is 2.1.2 with rbd_aio_flush patched.
>>rbd cache is off, qemu cache=none.
>> 
>> So what's wrong with it? Is that normal? Could you give me some help?
> 
> Yes, this is normal at QD=1. As the write needs to be acknowledged by both 
> replica OSD's across a network connection the round trip latency severely 
> limits you as compared to travelling along a 30cm sata cable.
> 
> The two biggest contributors to latency is the network and the speed at which 
> the CPU can process the ceph code.  To improve performance look at these two 
> areas first. Easy win is to disable debug logging in ceph.
> 
> However this number should scale as you increase the QD, so something is not 
> right if you are seeing the same performance at QD=1 as QD=32.

Are you sure? Unless something (io elevator) coalesces the writes then they 
should be serialized and blocking, QD doesn't necessarily help there. Either 
way, you're benchmarking the elevator and not RBD if you reach higher IOPS with 
QD>1, IMO.

35K IOPS with ioengine=rbd sounds like the "sync=1" option doesn't actually 
work. Or it's not touching the same object (but I wonder whether write ordering 
is preserved at that rate?).

400 IOPS is sadly the same figure I can reach on a raw device... testing with 
filesystem you can easily reach <200 IOPS (because of journal, metadata... but 
again, then you're benchmarking filesystem journal and ioelevator efficiency, 
not RBD itself).

Jan


> 
>> Thanks very much.
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs mmap performance?

2016-02-19 Thread Jan Schermer
I don't think there's any point in MMAP-ing a virtual file.
And I'd be surprised if there weren't any bugs or performance issues...

Jan

> On 19 Feb 2016, at 14:38, Dzianis Kahanovich  wrote:
> 
> I have content for apache 2.4 in cephfs, trying to be scalable, "EnableMMAP 
> On".
> Some environments known as not friendly for MMAP for SMP scalability (more
> locks). What cephfs-specific recommendations about apache's EnableMMAP 
> setting?
> 
> -- 
> WBR, Dzianis Kahanovich AKA Denis Kaganovich, http://mahatma.bspu.unibel.by/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to properly deal with NEAR FULL OSD

2016-02-17 Thread Jan Schermer
It would be helpful to see your crush map (there are some tunables that help 
with this issue as well available if you're not running ancient versions).
However, distribution uniformity isn't that great really.
It helps to increase the number of PGs, but beware that there's no turning back.

Other than that, play with reweights (and possibly crush weights) regularly - 
that's what we do...

Jan


> On 18 Feb 2016, at 01:11, Vlad Blando  wrote:
> 
> Hi This been bugging me for some time now, the distribution of data on the 
> OSD is not balanced so some OSD are near full, i did ceph osd 
> reweight-by-utilization but it not helping much.
> 
> 
> [root@controller-node ~]# ceph osd tree
> # idweight  type name   up/down reweight
> -1  98.28   root default
> -2  32.76   host ceph-node-1
> 0   3.64osd.0   up  1
> 1   3.64osd.1   up  1
> 2   3.64osd.2   up  1
> 3   3.64osd.3   up  1
> 4   3.64osd.4   up  1
> 5   3.64osd.5   up  1
> 6   3.64osd.6   up  1
> 7   3.64osd.7   up  1
> 8   3.64osd.8   up  1
> -3  32.76   host ceph-node-2
> 9   3.64osd.9   up  1
> 10  3.64osd.10  up  1
> 11  3.64osd.11  up  1
> 12  3.64osd.12  up  1
> 13  3.64osd.13  up  1
> 14  3.64osd.14  up  1
> 15  3.64osd.15  up  1
> 16  3.64osd.16  up  1
> 17  3.64osd.17  up  1
> -4  32.76   host ceph-node-3
> 18  3.64osd.18  up  1
> 19  3.64osd.19  up  1
> 20  3.64osd.20  up  1
> 21  3.64osd.21  up  1
> 22  3.64osd.22  up  1
> 23  3.64osd.23  up  1
> 24  3.64osd.24  up  1
> 25  3.64osd.25  up  1
> 26  3.64osd.26  up  1
> [root@controller-node ~]#
> 
> 
> [root@controller-node ~]# /opt/df-osd.sh
> ceph-node-1
> ===
> /dev/sdb1  3.7T  2.0T  1.7T  54% /var/lib/ceph/osd/ceph-0
> /dev/sdc1  3.7T  2.7T  1.1T  72% /var/lib/ceph/osd/ceph-1
> /dev/sdd1  3.7T  3.3T  431G  89% /var/lib/ceph/osd/ceph-2
> /dev/sde1  3.7T  2.8T  879G  77% /var/lib/ceph/osd/ceph-3
> /dev/sdf1  3.7T  3.3T  379G  90% /var/lib/ceph/osd/ceph-4
> /dev/sdg1  3.7T  2.9T  762G  80% /var/lib/ceph/osd/ceph-5
> /dev/sdh1  3.7T  3.0T  733G  81% /var/lib/ceph/osd/ceph-6
> /dev/sdi1  3.7T  3.4T  284G  93% /var/lib/ceph/osd/ceph-7
> /dev/sdj1  3.7T  3.4T  342G  91% /var/lib/ceph/osd/ceph-8
> ===
> ceph-node-2
> ===
> /dev/sdb1  3.7T  3.1T  622G  84% /var/lib/ceph/osd/ceph-9
> /dev/sdc1  3.7T  2.7T  1.1T  72% /var/lib/ceph/osd/ceph-10
> /dev/sdd1  3.7T  3.1T  557G  86% /var/lib/ceph/osd/ceph-11
> /dev/sde1  3.7T  3.3T  392G  90% /var/lib/ceph/osd/ceph-12
> /dev/sdf1  3.7T  2.6T  1.1T  72% /var/lib/ceph/osd/ceph-13
> /dev/sdg1  3.7T  2.8T  879G  77% /var/lib/ceph/osd/ceph-14
> /dev/sdh1  3.7T  2.7T  984G  74% /var/lib/ceph/osd/ceph-15
> /dev/sdi1  3.7T  3.2T  463G  88% /var/lib/ceph/osd/ceph-16
> /dev/sdj1  3.7T  3.1T  594G  85% /var/lib/ceph/osd/ceph-17
> ===
> ceph-node-3
> ===
> /dev/sdb1  3.7T  2.8T  910G  76% /var/lib/ceph/osd/ceph-18
> /dev/sdc1  3.7T  2.7T 1012G  73% /var/lib/ceph/osd/ceph-19
> /dev/sdd1  3.7T  3.2T  537G  86% /var/lib/ceph/osd/ceph-20
> /dev/sde1  3.7T  3.2T  465G  88% /var/lib/ceph/osd/ceph-21
> /dev/sdf1  3.7T  3.0T  663G  83% /var/lib/ceph/osd/ceph-22
> /dev/sdg1  3.7T  3.4T  248G  94% /var/lib/ceph/osd/ceph-23
> /dev/sdh1  3.7T  2.8T  928G  76% /var/lib/ceph/osd/ceph-24
> /dev/sdi1  3.7T  2.9T  802G  79% /var/lib/ceph/osd/ceph-25
> /dev/sdj1  3.7T  2.7T  1.1T  73% /var/lib/ceph/osd/ceph-26
> ===
> [root@controller-node ~]#
> 
> 
> [root@controller-node ~]# ceph health detail
> HEALTH_ERR 2 pgs inconsistent; 10 near full osd(s); 2 scrub 

Re: [ceph-users] How to recover from OSDs full in small cluster

2016-02-17 Thread Jan Schermer
Hmm, it's possible there aren't any safeguards against filling the whole drive 
when increasing PGs, actually I think ceph only cares about free space when 
backilling which is not what happened (at least directly) in your case.
However, having a completely full OSD filesystem is not going to end well - 
better trash the OSD if it crashes because of it.
Be aware that whenever ceph starts backfilling it temporarily needs more space, 
and sometimes it shuffles more data than you'd expect. What can happen is that 
while OSD1 is trying to get rid of data, it simultaneously gets filled with 
data from another OSD (because crush-magic happens) and if that eats the last 
bits of space it's going to go FUBAR. 

You can set "nobackfill" on the cluster, that will prevent ceph from shuffling 
anything temporarily (set that before you restart the OSDs), but I wonder if 
it's too late - that 20MB free in the df output scares me. 

The safest way would probably be to trash osd.5 and osd.4 in your case, create 
two new OSDs in their place and backfill them again (with lower reweight). It's 
up to you whether you can afford the IO it will cause.

Which OSDs actually crashed? 4 and 5? Too late to save them methinks...

Jan



> On 17 Feb 2016, at 23:06, Lukáš Kubín <lukas.ku...@gmail.com> wrote:
> 
> You're right, the "full" osd was still up and in until I increased the pg 
> values of one of the pools. The redistribution has not completed yet and 
> perhaps that's what is still filling the drive. With this info - do you think 
> I'm still safe to follow the steps suggested in previous post?
> 
> Thanks!
> 
> Lukas
> 
> On Wed, Feb 17, 2016 at 10:29 PM Jan Schermer <j...@schermer.cz 
> <mailto:j...@schermer.cz>> wrote:
> Something must be on those 2 OSDs that ate all that space - ceph by default 
> doesn't allow OSD to get completely full (filesystem-wise) and from what 
> you've shown those filesystems are really really full.
> OSDs don't usually go down when "full" (95%) .. or do they? I don't think 
> so... so the reason they stopped is likely a completely full filfeystem. You 
> have to move something out of the way, restart those OSDs with lower reweight 
> and hopefully everything will be good.
> 
> Jan
> 
> 
>> On 17 Feb 2016, at 22:22, Lukáš Kubín <lukas.ku...@gmail.com 
>> <mailto:lukas.ku...@gmail.com>> wrote:
>> 
>> Ahoj Jan, thanks for the quick hint!
>> 
>> Those 2 OSDs are currently full and down. How should I handle that? Is it ok 
>> that I delete some pg directories again and start the OSD daemons, on both 
>> drives in parallel. Then set the weights as recommended ?
>> 
>> What effect should I expect then - will the cluster attempt to move some pgs 
>> out of these drives to different local OSDs? I'm asking because when I've 
>> attempted to delete pg dirs and restart OSD for the first time, the OSD get 
>> full again very fast.
>> 
>> Thank you.
>> 
>> Lukas
>> 
>> 
>> 
>> On Wed, Feb 17, 2016 at 9:48 PM Jan Schermer <j...@schermer.cz 
>> <mailto:j...@schermer.cz>> wrote:
>> Ahoj ;-)
>> 
>> You can reweight them temporarily, that shifts the data from the full drives.
>> 
>> ceph osd reweight osd.XX YY
>> (XX = the number of full OSD, YY is "weight" which default to 1)
>> 
>> This is different from "crush reweight" which defaults to drive size in TB.
>> 
>> Beware that reweighting will (afaik) only shuffle the data to other local 
>> drives, so you should reweight both the full drives at the same time and 
>> only by little bit at a time (0.95 is a good starting point).
>> 
>> Jan
>> 
>>  
>> 
>>> On 17 Feb 2016, at 21:43, Lukáš Kubín <lukas.ku...@gmail.com 
>>> <mailto:lukas.ku...@gmail.com>> wrote:
>>> 
>> 
>>> Hi,
>>> I'm running a very small setup of 2 nodes with 6 OSDs each. There are 2 
>>> pools, each of size=2. Today, one of our OSDs got full, another 2 near 
>>> full. Cluster turned into ERR state. I have noticed uneven space 
>>> distribution among OSD drives between 70 and 100 perce. I have realized 
>>> there's a low amount of pgs in those 2 pools (128 each) and increased one 
>>> of them to 512, expecting a magic to happen and redistribute the space 
>>> evenly. 
>>> 
>>> Well, something happened - another OSD became full during the 
>>> redistribution and cluster stopped both OSDs and marked them down. After 
>>> some hours the remaining drives partially rebalanced and cluster get to 
>>> WARN state. 
>>> 
>>> I'v

Re: [ceph-users] How to recover from OSDs full in small cluster

2016-02-17 Thread Jan Schermer
Something must be on those 2 OSDs that ate all that space - ceph by default 
doesn't allow OSD to get completely full (filesystem-wise) and from what you've 
shown those filesystems are really really full.
OSDs don't usually go down when "full" (95%) .. or do they? I don't think so... 
so the reason they stopped is likely a completely full filfeystem. You have to 
move something out of the way, restart those OSDs with lower reweight and 
hopefully everything will be good.

Jan


> On 17 Feb 2016, at 22:22, Lukáš Kubín <lukas.ku...@gmail.com> wrote:
> 
> Ahoj Jan, thanks for the quick hint!
> 
> Those 2 OSDs are currently full and down. How should I handle that? Is it ok 
> that I delete some pg directories again and start the OSD daemons, on both 
> drives in parallel. Then set the weights as recommended ?
> 
> What effect should I expect then - will the cluster attempt to move some pgs 
> out of these drives to different local OSDs? I'm asking because when I've 
> attempted to delete pg dirs and restart OSD for the first time, the OSD get 
> full again very fast.
> 
> Thank you.
> 
> Lukas
> 
> 
> 
> On Wed, Feb 17, 2016 at 9:48 PM Jan Schermer <j...@schermer.cz 
> <mailto:j...@schermer.cz>> wrote:
> Ahoj ;-)
> 
> You can reweight them temporarily, that shifts the data from the full drives.
> 
> ceph osd reweight osd.XX YY
> (XX = the number of full OSD, YY is "weight" which default to 1)
> 
> This is different from "crush reweight" which defaults to drive size in TB.
> 
> Beware that reweighting will (afaik) only shuffle the data to other local 
> drives, so you should reweight both the full drives at the same time and only 
> by little bit at a time (0.95 is a good starting point).
> 
> Jan
> 
>  
> 
>> On 17 Feb 2016, at 21:43, Lukáš Kubín <lukas.ku...@gmail.com 
>> <mailto:lukas.ku...@gmail.com>> wrote:
>> 
> 
>> Hi,
>> I'm running a very small setup of 2 nodes with 6 OSDs each. There are 2 
>> pools, each of size=2. Today, one of our OSDs got full, another 2 near full. 
>> Cluster turned into ERR state. I have noticed uneven space distribution 
>> among OSD drives between 70 and 100 perce. I have realized there's a low 
>> amount of pgs in those 2 pools (128 each) and increased one of them to 512, 
>> expecting a magic to happen and redistribute the space evenly. 
>> 
>> Well, something happened - another OSD became full during the redistribution 
>> and cluster stopped both OSDs and marked them down. After some hours the 
>> remaining drives partially rebalanced and cluster get to WARN state. 
>> 
>> I've deleted 3 placement group directories from one of the full OSD's 
>> filesystem which allowed me to start it up again. Soon, however this drive 
>> became full again.
>> 
>> So now, there are 2 of 12 OSDs down, cluster is in WARN and I have no drives 
>> to add. 
>> 
>> Is there a way how to get out of this situation without adding OSDs? I will 
>> attempt to release some space, just waiting for colleague to identify RBD 
>> volumes (openstack images and volumes) which can be deleted.
>> 
>> Thank you.
>> 
>> Lukas
>> 
>> 
>> This is my cluster state now:
>> 
>> [root@compute1 ~]# ceph -w
>> cluster d35174e9-4d17-4b5e-80f2-02440e0980d5
>>  health HEALTH_WARN
>> 10 pgs backfill_toofull
>> 114 pgs degraded
>> 114 pgs stuck degraded
>> 147 pgs stuck unclean
>> 114 pgs stuck undersized
>> 114 pgs undersized
>> 1 requests are blocked > 32 sec
>> recovery 56923/640724 objects degraded (8.884%)
>> recovery 29122/640724 objects misplaced (4.545%)
>> 3 near full osd(s)
>>  monmap e3: 3 mons at 
>> {compute1=10.255.242.14:6789/0,compute2=10.255.242.15:6789/0,compute3=10.255.242.16:6789/0
>>  
>> <http://10.255.242.14:6789/0,compute2=10.255.242.15:6789/0,compute3=10.255.242.16:6789/0>}
>> election epoch 128, quorum 0,1,2 compute1,compute2,compute3
>>  osdmap e1073: 12 osds: 10 up, 10 in; 39 remapped pgs
>>   pgmap v21609066: 640 pgs, 2 pools, 2390 GB data, 309 kobjects
>> 4365 GB used, 890 GB / 5256 GB avail
>> 56923/640724 objects degraded (8.884%)
>> 29122/640724 objects misplaced (4.545%)
>>  493 active+clean
>>  108 active+undersized+degraded
>>   29 active+remapped
>>6 active+undersized+degraded+remapped+backfill_t

Re: [ceph-users] How to recover from OSDs full in small cluster

2016-02-17 Thread Jan Schermer
Ahoj ;-)

You can reweight them temporarily, that shifts the data from the full drives.

ceph osd reweight osd.XX YY
(XX = the number of full OSD, YY is "weight" which default to 1)

This is different from "crush reweight" which defaults to drive size in TB.

Beware that reweighting will (afaik) only shuffle the data to other local 
drives, so you should reweight both the full drives at the same time and only 
by little bit at a time (0.95 is a good starting point).

Jan

 
> On 17 Feb 2016, at 21:43, Lukáš Kubín  wrote:
> 
> Hi,
> I'm running a very small setup of 2 nodes with 6 OSDs each. There are 2 
> pools, each of size=2. Today, one of our OSDs got full, another 2 near full. 
> Cluster turned into ERR state. I have noticed uneven space distribution among 
> OSD drives between 70 and 100 perce. I have realized there's a low amount of 
> pgs in those 2 pools (128 each) and increased one of them to 512, expecting a 
> magic to happen and redistribute the space evenly. 
> 
> Well, something happened - another OSD became full during the redistribution 
> and cluster stopped both OSDs and marked them down. After some hours the 
> remaining drives partially rebalanced and cluster get to WARN state. 
> 
> I've deleted 3 placement group directories from one of the full OSD's 
> filesystem which allowed me to start it up again. Soon, however this drive 
> became full again.
> 
> So now, there are 2 of 12 OSDs down, cluster is in WARN and I have no drives 
> to add. 
> 
> Is there a way how to get out of this situation without adding OSDs? I will 
> attempt to release some space, just waiting for colleague to identify RBD 
> volumes (openstack images and volumes) which can be deleted.
> 
> Thank you.
> 
> Lukas
> 
> 
> This is my cluster state now:
> 
> [root@compute1 ~]# ceph -w
> cluster d35174e9-4d17-4b5e-80f2-02440e0980d5
>  health HEALTH_WARN
> 10 pgs backfill_toofull
> 114 pgs degraded
> 114 pgs stuck degraded
> 147 pgs stuck unclean
> 114 pgs stuck undersized
> 114 pgs undersized
> 1 requests are blocked > 32 sec
> recovery 56923/640724 objects degraded (8.884%)
> recovery 29122/640724 objects misplaced (4.545%)
> 3 near full osd(s)
>  monmap e3: 3 mons at 
> {compute1=10.255.242.14:6789/0,compute2=10.255.242.15:6789/0,compute3=10.255.242.16:6789/0
>  
> }
> election epoch 128, quorum 0,1,2 compute1,compute2,compute3
>  osdmap e1073: 12 osds: 10 up, 10 in; 39 remapped pgs
>   pgmap v21609066: 640 pgs, 2 pools, 2390 GB data, 309 kobjects
> 4365 GB used, 890 GB / 5256 GB avail
> 56923/640724 objects degraded (8.884%)
> 29122/640724 objects misplaced (4.545%)
>  493 active+clean
>  108 active+undersized+degraded
>   29 active+remapped
>6 active+undersized+degraded+remapped+backfill_toofull
>4 active+remapped+backfill_toofull
> 
> [root@ceph1 ~]# df|grep osd
> /dev/sdg1   580496384 500066812  80429572  87% 
> /var/lib/ceph/osd/ceph-3
> /dev/sdf1   580496384 502131428  78364956  87% 
> /var/lib/ceph/osd/ceph-2
> /dev/sde1   580496384 506927100  73569284  88% 
> /var/lib/ceph/osd/ceph-0
> /dev/sdb1   287550208 28755018820 100% 
> /var/lib/ceph/osd/ceph-5
> /dev/sdd1   580496384 58049636420 100% 
> /var/lib/ceph/osd/ceph-4
> /dev/sdc1   580496384 478675672 101820712  83% 
> /var/lib/ceph/osd/ceph-1
> 
> [root@ceph2 ~]# df|grep osd
> /dev/sdf1   580496384 448689872 131806512  78% 
> /var/lib/ceph/osd/ceph-7
> /dev/sdb1   287550208 227054336  60495872  79% 
> /var/lib/ceph/osd/ceph-11
> /dev/sdd1   580496384 464175196 116321188  80% 
> /var/lib/ceph/osd/ceph-10
> /dev/sdc1   580496384 489451300  91045084  85% 
> /var/lib/ceph/osd/ceph-6
> /dev/sdg1   580496384 470559020 109937364  82% 
> /var/lib/ceph/osd/ceph-9
> /dev/sde1   580496384 490289388  90206996  85% 
> /var/lib/ceph/osd/ceph-8
> 
> [root@ceph2 ~]# ceph df
> GLOBAL:
> SIZE  AVAIL RAW USED %RAW USED
> 5256G  890G4365G 83.06
> POOLS:
> NAME   ID USED  %USED MAX AVAIL OBJECTS
> glance 6  1714G 32.61  385G  219579
> cinder 7   676G 12.86  385G   97488
> 
> [root@ceph2 ~]# ceph osd pool get glance pg_num
> pg_num: 512
> [root@ceph2 ~]# ceph osd pool get cinder pg_num
> pg_num: 128
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users 

Re: [ceph-users] Dell Ceph Hardware recommendations

2016-02-10 Thread Jan Schermer
Dell finally sells a controller with true JBOD mode? The last I checked they 
only had "JBOD-via-RAID0" as a recommended solution (doh, numerous problems) 
and true JBOD was only offered for special use cases like hadoop storage.
One can obviously reflash the controller to another mode, but that's not really 
"supported".

R730xd would make a nice converged box if only it wasn't so complicated in 
every way (feature-wise, number of firmwares, OMSA to make it supportable...). 
Less is more in this case. Good for VMware, though...


Jan

> On 10 Feb 2016, at 11:24, Yann Dupont  wrote:
> 
> 
> 
> Le 10/02/16 03:55, Matt Taylor a écrit :
>> We are using Dell R730XD's with 2 x Internal SAS in Raid 1 for OS. 24 x 
>> 400GB SSD.
>> 
>> PERC H730P Mini is being used with non-RAID passthrough for the SSD's.
>> 
>> CPU and RAM specs aren't really needed to be known as you can do whatever 
>> you want, however I would recommend minimum of 2 x quad's and at least 48GB 
>> of RAM.
>> 
>> NICS are 4 x 10G (2 x 10 bonded for cluster, 2 x bonded for public). 
>> Naturally, you have the 4 x 1G on-board too.
>> 
>> Performance is very good for us.
> 
> Quite the same here too, with a mix of R730xd for capacity tier (Pure 
> mechanical 8 Gb drives) and 2xR630 for SSD tier.
> 
> 1 problem to be aware of :
> 
> -> We have "write intensive" Dell-branded SanDisk  OEM (LT0400MO), and there 
> is a problem with stock firmware : When used with H730P in raid0 mode, it 
> works like a charm. When put in HBA mode, as soon as you start to write data, 
> you have almost immediately SCSI write errors with the disk marked faulty, an 
> then it's a brick :/  . Firmware D413 fixed the problem ("faulty" disks were 
> changed)
> 
> So before doing extensive tests, if drives are of this model, make sure you 
> have latest firmware :)
> 
> Cheers,
> Yann
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Journal

2016-02-01 Thread Jan Schermer
Hi,
unfortunately I'm not a dev, so it's gonna be someone else ripping the journal 
out and trying.
But I came to understand that getting rid of journal is not that easy of a task.

To me, more important would be if the devs understood what I'm trying to say :-)
because without that any new development will still try and accomodate whatever 
the
original design provided, and in this way it will be inherently "flawed".

I never used RADOS directly (the closest I came was trying RadosGW to replace a 
Swift cluster),
so it's possible someone build a custom app on top of that and it's using some 
more powerful
features that RBD/S3 don't need. Is there such a project? Is someone actually 
building on top of
librados, or is it in reality only tied to RBD/S3 as we know it?

If journal is really only for crash consitency in case of abrupt OSD failure, 
then I can tell you right now
RBD doesn't need it. The tricky part comes when you have to compare different 
data on OSDs after
a crash - from filesystem perspective anything goes, but we need to stick to 
one "version" of the data
(or leave marked it as unused in a bitmap if one is to be used, who cares what 
data was actually in there).
But no need to reiterate I guess, there are more scenarios I haven't thought of.

I'd like to see Ceph competetive to vSAN, ScaleIO or even some concoction that 
I can brew using
NBD/DM/ZFS/whatever, and it's pretty obvious to me something isn't right in the 
design - at least
for the "most important" workload which is RBD.

Jan


> On 29 Jan 2016, at 18:05, Robert LeBlanc <rob...@leblancnet.us> wrote:
> 
> Signed PGP part
> Jan,
> 
> I know that Sage has worked through a lot of this and spent a lot of
> time on it, so I'm somewhat inclined to say that if he says it needs
> to be there, then it needs to be there. I, however, have been known to
> stare at the tress so much that I miss the forest and I understand
> some of the points that you bring up about the data consistency and
> recovery from the client prospective. One thing that might be helpful
> is for you (or someone else) to get in the code and disable the
> journal pieces (not sure how difficult this would be) and test it
> against your theories. It seems like you have some deep and sincere
> interests in seeing Ceph be successful. If you theory holds up, then
> presenting the data and results will help others understand and be
> more interested in it. It took me a few months of this kind of work
> with the WeightedPriorityQueue, and I think the developers and
> understanding the limitations of the PrioritizedQueue and how
> WeightedPriorityQueue can overcome them with the battery of tests I've
> done with a proof of concept. Theory and actual results can be
> different, but results are generally more difficult to argue.
> 
> Some of the decision about the journal may be based on RADOS and not
> RBD. For instance, the decision may have been made that if a RADOS
> write has been given to the cluster, it is to be assumed that the
> write is durable without waiting for an ACK. I can't see why an
> S3/RADOS client can't wait for an ACK from the web server/OSD, but I
> haven't gotten into that area yet. That is something else to keep in
> mind.
> 
> Lionel,
> 
> I don't think the journal is used for anything more than crash
> consistency of the OSD. I don't believe the journal is used a playback
> instrument for bringing other OSDs into sync. An osd that is out of
> sync will write it's updates to it's journal to speed up the process,
> but that is the extent. The OSD providing the update has to read the
> updates to send from disk/page cache. My understanding that the
> journal is "never" read from, only when the OSD process crashes.
> 
> I'm happy to be corrected if I've misstated anything.
> 
> Robert LeBlanc
> 
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Fri, Jan 29, 2016 at 9:27 AM, Lionel Bouton
> <lionel-subscript...@bouton.name> wrote:
> > Le 29/01/2016 16:25, Jan Schermer a écrit :
> >
> > [...]
> >
> > But if I understand correctly, there is indeed a log of the recent
> > modifications in the filestore which is used when a PG is recovering
> > because another OSD is lagging behind (not when Ceph reports a full
> > backfill where I suppose all objects' versions of a PG are compared).
> >
> > That list of transactions becomes useful only when OSD crashes and comes
> > back up - it needs to catch up somehow and this is one of the options. But
> > do you really need the "content" of those transactions which is what the
> > journal does?
> > If you have no such list then you need to either rely o

Re: [ceph-users] SSD Journal

2016-01-29 Thread Jan Schermer
>> inline

> On 29 Jan 2016, at 05:03, Somnath Roy <somnath@sandisk.com> wrote:
> 
> <  
> From: Jan Schermer [mailto:j...@schermer.cz <mailto:j...@schermer.cz>] 
> Sent: Thursday, January 28, 2016 3:51 PM
> To: Somnath Roy
> Cc: Tyler Bishop; ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> Subject: Re: SSD Journal
>  
> Thanks for a great walkthrough explanation.
> I am not really going to (and capable) of commenting on everything but.. see 
> below
>  
> On 28 Jan 2016, at 23:35, Somnath Roy <somnath@sandisk.com 
> <mailto:somnath@sandisk.com>> wrote:
>  
> Hi,
> Ceph needs to maintain a journal in case of filestore as underlying 
> filesystem like XFS *doesn’t have* any transactional semantics. Ceph has to 
> do a transactional write with data and metadata in the write path. It does in 
> the following way.
>  
> "Ceph has to do a transactional write with data and metadata in the write 
> path"
> Why? Isn't that only to provide that to itself?
> 
> [Somnath] Yes, that is for Ceph..That’s 2 setattrs (for rbd) + PGLog/Info..

And why does Ceph need that? Aren't we going in circles here? No client needs 
those transactions so there's no point in needing those transactions in Ceph.

>  
> 1. It creates a transaction object having multiple metadata operations and 
> the actual payload write.
>  
> 2. It is passed to Objectstore layer.
>  
> 3. Objectstore can complete the transaction in sync or async (Filestore) way.
>  
> Depending on whether the write was flushed or not? How is that decided?
> [Somnath] It depends on how ObjectStore backend is written..Not 
> dynamic..Filestore implemented in async way , I think BlueStore is written in 
> sync way (?)..
> 
>  
> 4.  Filestore dumps the entire Transaction object to the journal. It is a 
> circular buffer and written to the disk sequentially with O_DIRECT | O_DSYNC 
> way.
>  
> Just FYI, O_DIRECT doesn't really guarantee "no buffering", it's purpose is 
> just to avoid needless caching.
> It should behave the way you want on Linux, but you must not rely on it since 
> this guarantee is not portable.
> 
> [Somnath] O_DIRECT alone is not guaranteed but With O_DSYNC it is guaranteed 
> to be reaching the disk..It may still be there in Disk cache , but, this is 
> taken care by disks..

O_DSYNC is the same as calling fdatasync() after writes. This only flushes the 
data, not the metadata. So if your "transactions" need those (and I think they 
do) then you don't get the expected consistency. In practice it could flush 
effectively everything.

>  
> 5. Once journal write is successful , write is acknowledged to the client. 
> Read for this data is not allowed yet as it is still not been written to the 
> actual location in the filesystem.
>  
> Now you are providing a guarantee for something nobody really needs. There is 
> no guarantee with traditional filesystems of not returning dirty unwritten 
> data. The guarentees are on writes, not reads. It might be easier to do it 
> this way if you plan for some sort of concurrent access to the same data from 
> multiple readers (that don't share the cache) - but is that really the case 
> here if it's still the same OSD that serves the data?
> Do the journals absorb only the unbuffered IO or all IO?
>  
> And what happens currently if I need to read the written data rightaway? When 
> do I get it then?
> 
> [Somnath] Well, this is debatable, but currently reads are blocked till 
> entire Tx execution is completed (not after doing syncfs)..Journal absorbs 
> all the IO..

So a database doing checkpoint read/modify/write is going to suffer greatly? 
That might explain a few more things I've seen.
But it's not needed anyway, in fact things like databases are very likely to 
write to the same place over and over again and you should in fact accomodate 
them by caching.


>  
> 6. The actual execution of the transaction is done in parallel for the 
> filesystem that can do check pointing like BTRFS. For the filesystem like 
> XFS/ext4 the journal is write ahead i.e Tx object will be written to journal 
> first and then the Tx execution will happen.
>  
> 7. Tx execution is done in parallel by the filestore worker threads. The 
> payload write is a buffered write and a sync thread within filestore is 
> periodically calling ‘syncfs’ to persist data/metadata to the actual location.
>  
> 8. Before each ‘syncfs’ call it determines the seq number till it is 
> persisted and trim the transaction objects from journal upto that point. This 
> will make room for more writes in the journal. If journal is full, write will 
> be stuck.
>  
> 9. If OSD is crashed after writ

Re: [ceph-users] SSD Journal

2016-01-29 Thread Jan Schermer

> On 29 Jan 2016, at 16:00, Lionel Bouton <lionel-subscript...@bouton.name> 
> wrote:
> 
> Le 29/01/2016 01:12, Jan Schermer a écrit :
>> [...]
>>> Second I'm not familiar with Ceph internals but OSDs must make sure that 
>>> their PGs are synced so I was under the impression that the OSD content for 
>>> a PG on the filesystem should always be guaranteed to be on all the other 
>>> active OSDs *or* their journals (so you wouldn't apply journal content 
>>> unless the other journals have already committed the same content). If you 
>>> remove the journals there's no intermediate on-disk "buffer" that can be 
>>> used to guarantee such a thing: one OSD will always have data that won't be 
>>> guaranteed to be on disk on the others. As I understand this you could say 
>>> that this is some form of 2-phase commit.
>> You can simply commit the data (to the filestore), and it would be in fact 
>> faster.
>> Client gets the write acknowledged when all the OSDs have the data - that 
>> doesn't change in this scenario. If one OSD gets ahead of the others and 
>> commits something the other OSDs do not before the whole cluster goes down 
>> then it doesn't hurt anything - you didn't acknowledge so the client has to 
>> replay if it cares, _NOT_ the OSDs.
>> The problem still exists, just gets shifted elsewhere. But the client (guest 
>> filesystem) already handles this.
> 
> Hum, if one OSD gets ahead of the others there must be a way for the
> OSDs to resynchronize themselves. I assume that on resync for each PG
> OSDs probably compare something very much like a tx_id.

Why? Yes, it makes sense when you scrub them to have the same data, but the 
client doesn't care. If it were a hard drive the situation is the same - maybe 
the data was written, maybe it was not. You have no way of knowing and you 
don't care - the filesystem (or even any sane database) handles this by design.
It's your choice whether to replay the tx or rollback because the client 
doesn't care either way - that block that you write (or don't) is either 
unallocated or containing any of the 2 versions of the data at that point. 
You clearly don't want to give the client 2 differnt versions of the data, so 
something like data=journal should be used and the data compared when OSD comes 
back up... still nothing that required "ceph journal" though.

> 
> What I was expecting is that in the case of a small backlog the journal
> - containing the last modifications by design - was used during recovery
> to fetch all the recent transaction contents. It seemed efficient to me:
> especially on rotating media fetching data from the journal would avoid
> long seeks. The first alternative I can think of is maintaining a
> separate log of the recently modified objects in the filestore without
> the actual content of the modification. Then you can fetch the objects
> from the filestore as needed but this probably seeks all over the place.
> In the case of multiple PGs lagging behind on other OSDs, reading the
> local journal would be even better as you have even more chances of
> ordering reads to avoid seeks on the journal and much more seeks would
> happen on the filestore.
> 
> But if I understand correctly, there is indeed a log of the recent
> modifications in the filestore which is used when a PG is recovering
> because another OSD is lagging behind (not when Ceph reports a full
> backfill where I suppose all objects' versions of a PG are compared).

That list of transactions becomes useful only when OSD crashes and comes back 
up - it needs to catch up somehow and this is one of the options. But do you 
really need the "content" of those transactions which is what the journal does?
If you have no such list then you need to either rely on things like mtime of 
the object, or simply compare the hash of the objects (scrub). In the meantime 
you simply have to run from the other copies or stick to one copy of the data. 
But even if you stick to the "wrong" version it does no harm as long as you 
don't arbitrarily change that copy because the client didn't know what data 
ended on drive and must be (and is) prepared to use whatever you have.


> 
> Lionel

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Journal

2016-01-28 Thread Jan Schermer
You can't run Ceph OSD without a journal. The journal is always there.
If you don't have a journal partition then there's a "journal" file on the OSD 
filesystem that does the same thing. If it's a partition then this file turns 
into a symlink.

You will always be better off with a journal on a separate partition because of 
the way writeback cache in linux works (someone correct me if I'm wrong).
The journal needs to flush to disk quite often, and linux is not always able to 
flush only the journal data. You can't defer metadata flushing forever and also 
doing fsync() makes all the dirty data flush as well. ext2/3/4 also flushes 
data to the filesystem periodicaly (5s is it I think?) which will make the 
latency of the journal go through the roof momentarily.
(I'll leave researching how exactly XFS does it to those who care about that 
"filesystem'o'thing").

P.S. I feel very strongly that this whole concept is broken fundamentaly. We 
already have a journal for the filesystem which is time proven, well behaved 
and above all fast. Instead there's this reinvented wheel which supposedly does 
it better in userspace while not really avoiding the filesystem journal either. 
It would maybe make sense if OSD was storing the data on a block device 
directly, avoiding the filesystem altogether. But it would still do the same 
bloody thing and (no disrespect) ext4 does this better than Ceph ever will.


> On 28 Jan 2016, at 20:01, Tyler Bishop  wrote:
> 
> This is an interesting topic that i've been waiting for.
> 
> Right now we run the journal as a partition on the data disk.  I've build 
> drives without journals and the write performance seems okay but random io 
> performance is poor in comparison to what it should be.
> 
>  
>  
> Tyler Bishop
> Chief Technical Officer
> 513-299-7108 x10
> tyler.bis...@beyondhosting.net
> If you are not the intended recipient of this transmission you are notified 
> that disclosing, copying, distributing or taking any action in reliance on 
> the contents of this information is strictly prohibited.
>  
> 
> From: "Bill WONG" 
> To: "ceph-users" 
> Sent: Thursday, January 28, 2016 1:36:01 PM
> Subject: [ceph-users] SSD Journal
> 
> Hi,
> i have tested with SSD Journal with SATA, it works perfectly.. now, i am 
> testing with full SSD ceph cluster, now with full SSD ceph cluster, do i 
> still need to have SSD as journal disk? 
> 
> [ assumed i do not have PCIe SSD Flash which is better performance than 
> normal SSD disk]
> 
> please give some ideas on full ssd ceph cluster ... thank you!
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Journal

2016-01-28 Thread Jan Schermer
has to 
> write and thus getting a more stable performance out.
>  
> 3. Main reason that we can’t allow journal to go further ahead as the Tx 
> object will not be deleted till the Tx executes. More behind the Tx execution 
> , more memory growth will happen. Presently, Tx object is deleted 
> asynchronously (and thus taking more time)and we changed it to delete it from 
> the filestore worker thread itself.
>  
> 4. The sync thread is optimized to do a fast sync. The extra last commit seq 
> file is not maintained any more for *the write ahead journal* as this 
> information can be found in journal header.
>  
> Here is the related pull requests..
>  
>  
> https://github.com/ceph/ceph/pull/7271 
> <https://github.com/ceph/ceph/pull/7271>
> https://github.com/ceph/ceph/pull/7303 
> <https://github.com/ceph/ceph/pull/7303>
> https://github.com/ceph/ceph/pull/7278 
> <https://github.com/ceph/ceph/pull/7278>
> https://github.com/ceph/ceph/pull/6743 
> <https://github.com/ceph/ceph/pull/6743>
>  
> Regarding bypassing filesystem and accessing block device directly, yes, that 
> should be more clean/simple and efficient solution. With Sage’s Bluestore, 
> Ceph is moving towards that very fast !!!


This all makes sense, but it's unnecessary.
All you need to do is mirror the IO the client does on the filesystem serving 
the objects. That's it. The filesystem journal already provides all the 
guarantees you need. For example you don't need "no read from cache" guarantee 
because you don't get it anywhere else (so what's the use of that?). You don't 
need atomic multi-IO transactions because they are not implemented anywhere but 
at the *guest* filesystem level, which already has to work with hard drives 
that have no such concept. Even if Ceph put itself in the role of such a smart 
virtual drive that can handle multi-IO atomic transactions then currently there 
are no consumers of those capabilities. 

What do we all really need RBD to do? Emulate a physical hard drive of course. 
And it simply does not need to do any better, that's wasted effort. 
Sure it would be very nice if you could offload all the trickiness of ACID onto 
the hardware, but you can't (yet), and at this point nobody really needs that - 
filesystems are already doing the hard work in a proven way.
Unless you bring something new to the table which makes use of all that then 
you only need to bench yourself to the physical hardware. And sadly Ceph is 
nowhere close to a single SSD performance even when running on a beefy cluster 
while the benefits it supposedly provides are for what? 

Just make sure that the same IO that the guest sends gets to the filesystem on 
the OSD. (Ok, fair enough it's not _that_ simple, but not much more complicated 
either - you still need to persist data on all the objects since the last flush 
(which btw in any real world cluster means just checking as there was likely an 
fsync already somewhere from other clients))
Bam. You're done. You just mirrored what a hard drive does, because you 
mirrored that to a filesystem that mirrors that to a hard drive... No need of 
journals on top of filesystems with journals with data on filesystems with 
journals... My databases are not that fond of the multi-ms commiting limbo 
while data falls down throught those dream layers :P

I really don't know how to explain that more. I bet if you ask on LKML, someone 
like Theodore Ts'o would say "you're doing completely superfluous work" in more 
technical terms.

Jan

>  
> Thanks & Regards
> Somnath
>  
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
> <mailto:ceph-users-boun...@lists.ceph.com>] On Behalf Of Tyler Bishop
> Sent: Thursday, January 28, 2016 1:35 PM
> To: Jan Schermer
> Cc: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> Subject: Re: [ceph-users] SSD Journal
>  
> What approach did sandisk take with this for jewel?
>  
>  
>  
> Tyler Bishop
> Chief Technical Officer
> 513-299-7108 x10
> tyler.bis...@beyondhosting.net <mailto:tyler.bis...@beyondhosting.net>
> If you are not the intended recipient of this transmission you are notified 
> that disclosing, copying, distributing or taking any action in reliance on 
> the contents of this information is strictly prohibited.
>  
>  
> From: "Jan Schermer" <j...@schermer.cz <mailto:j...@schermer.cz>>
> To: "Tyler Bishop" <tyler.bis...@beyondhosting.net 
> <mailto:tyler.bis...@beyondhosting.net>>
> Cc: "Bill WONG" <wongahsh...@gmail.com <mailto:wongahsh...@gmail.com>>, 
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> Sent: Thursday, January 28, 2016 4:32:54 PM
> Subject: Re: [ceph-users] SSD Journal
>  
> You can't run 

Re: [ceph-users] SSD Journal

2016-01-28 Thread Jan Schermer

> On 28 Jan 2016, at 23:19, Lionel Bouton <lionel-subscript...@bouton.name> 
> wrote:
> 
> Le 28/01/2016 22:32, Jan Schermer a écrit :
>> P.S. I feel very strongly that this whole concept is broken fundamentaly. We 
>> already have a journal for the filesystem which is time proven, well behaved 
>> and above all fast. Instead there's this reinvented wheel which supposedly 
>> does it better in userspace while not really avoiding the filesystem journal 
>> either. It would maybe make sense if OSD was storing the data on a block 
>> device directly, avoiding the filesystem altogether. But it would still do 
>> the same bloody thing and (no disrespect) ext4 does this better than Ceph 
>> ever will.
>> 
> 
> Hum I've seen this discussed previously but I'm not sure the fs journal could 
> be used as a Ceph journal.
> 
> First BTRFS doesn't have a journal per se, so you would not be able to use 
> xfs or ext4 journal on another device with journal=data setup to make write 
> bursts/random writes fast. And I won't go back to XFS or test ext4... I've 
> detected too much silent corruption by hardware with BTRFS to trust our data 
> to any filesystem not using CRC on reads (and in our particular case the 
> compression and speed are additional bonuses).

ZFS takes care of all those concerns... Most people are quite happy with 
ext2/3/4, oblivious to the fact they are losing bits here and there... and the 
world still spins the same :-)
I personally believe the task of not corrupting data doesn't belong in the 
fileystem layer but rather should be handled by RAID array, mdraid, RBD... ZFS 
does it because it handles those tasks too.

> 
> Second I'm not familiar with Ceph internals but OSDs must make sure that 
> their PGs are synced so I was under the impression that the OSD content for a 
> PG on the filesystem should always be guaranteed to be on all the other 
> active OSDs *or* their journals (so you wouldn't apply journal content unless 
> the other journals have already committed the same content). If you remove 
> the journals there's no intermediate on-disk "buffer" that can be used to 
> guarantee such a thing: one OSD will always have data that won't be 
> guaranteed to be on disk on the others. As I understand this you could say 
> that this is some form of 2-phase commit.

You can simply commit the data (to the filestore), and it would be in fact 
faster.
Client gets the write acknowledged when all the OSDs have the data - that 
doesn't change in this scenario. If one OSD gets ahead of the others and 
commits something the other OSDs do not before the whole cluster goes down then 
it doesn't hurt anything - you didn't acknowledge so the client has to replay 
if it cares, _NOT_ the OSDs.
The problem still exists, just gets shifted elsewhere. But the client (guest 
filesystem) already handles this.

> 
> I may be mistaken: there are structures in the filestore that *may* take on 
> this role but I'm not sure what their exact use is : the _TEMP dirs, 
> the omap and meta dirs. My guess is that they serve other purposes: it would 
> make sense to use the journals for this because the data is already there and 
> the commit/apply coherency barriers seem both trivial and efficient to use.
> 
> That's not to say that the journals are the only way to maintain the needed 
> coherency, just that they might be used to do so because once they are here, 
> this is a trivial extension of their use.
> 

In the context of cloud more and more people realize that clinging to things 
like "durability" and "consitency" is out of fashion. I think the future will 
take a different turn... I can't say I agree with that, though, I'm usually the 
one fixing those screw ups afterwards.


> Lionel

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph rdb question about possibilities

2016-01-28 Thread Jan Schermer
Yum repo doesn't really sound like something "mission critical" (in my 
experience). You need to wait for the repo to update anyway before you can use 
the package, so not something realtime either.
CephFS is overkill for this.

I would either simply rsync the repo between the three machines via cron (and 
set DNS to point to all three IPs they have). If you need something "more 
realtime" than you can use for example incron.
Or you can push new packages to all three and refresh them (assuming you have 
some tools that do that already).
Or if you want to use Ceph then you can create one rbd image that only gets 
mounted on one of the hosts, and if that goes down you remount it elsewhere (by 
hand, or via pacemaker, cron script...). I don't think Ceph makes sense if that 
going to be the only use, though...

Maybe you could also push the packages to RadosGW, but I'm not familiar with it 
that much, not sure how you build a repo that points there. This would make 
sense but I have no idea if it's simple to do.

Jan


> On 28 Jan 2016, at 13:03, Sándor Szombat <szombat.san...@gmail.com> wrote:
> 
> Hello,
> 
> yes I missunderstand things I things. Thanks for your help!
> 
> So the situation is next: We have a yum repo with rpm packages. We want to 
> store these rpm's in Ceph. But we have three main node what can be able to 
> install the other nodes. So we have to share these rpm packages between the 
> three host. I check the CephFs, it will be the best solution for us, but it 
> is in beta and beta products cant allowed for us. This is why I tryto find 
> another, Ceph based solution. 
> 
> 
> 
> 2016-01-28 12:46 GMT+01:00 Jan Schermer <j...@schermer.cz 
> <mailto:j...@schermer.cz>>:
> This is somewhat confusing.
> 
> CephFS is a shared filesystem - you mount that on N hosts and they can access 
> the data simultaneously.
> RBD is a block device, this block device can be accesses from more than 1 
> host, BUT you need to use a cluster aware filesystem (such as GFS2, OCFS).
> 
> Both CephFS and RBD use RADOS as a backend, which is responsible for data 
> placement, high-availability and so on.
> 
> If you explain your scenario more we could suggest some options - do you 
> really need to have the data accessible on more servers, or is a (short) 
> outage acceptable when one server goes down? What type of data do you need to 
> share and how will the data be accessed?
> 
> Jan
> 
> > On 28 Jan 2016, at 11:06, Sándor Szombat <szombat.san...@gmail.com 
> > <mailto:szombat.san...@gmail.com>> wrote:
> >
> > Hello all!
> >
> > I check the Ceph FS, but it is beta now unfortunatelly. I start to meet 
> > with the rdb. It is possible to create an image in a pool, mount it as a 
> > block device (for example /dev/rbd0), and format this as HDD, and mount it 
> > on 2 host? I tried to make this, and it's work but after mount the  
> > /dev/rbd0 on the two host and I tried to put files into these mounted 
> > folders it can't refresh automatically between hosts.
> > So the main question: this will be a possible solution?
> > (The task: we have 3 main node what can install the other nodes with 
> > ansible, and we want to store our rpm's in ceph it is possible. This is 
> > necessary because of high avability.)
> >
> > Thanks for your help!
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> > <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph rdb question about possibilities

2016-01-28 Thread Jan Schermer
This is somewhat confusing.

CephFS is a shared filesystem - you mount that on N hosts and they can access 
the data simultaneously.
RBD is a block device, this block device can be accesses from more than 1 host, 
BUT you need to use a cluster aware filesystem (such as GFS2, OCFS).

Both CephFS and RBD use RADOS as a backend, which is responsible for data 
placement, high-availability and so on.

If you explain your scenario more we could suggest some options - do you really 
need to have the data accessible on more servers, or is a (short) outage 
acceptable when one server goes down? What type of data do you need to share 
and how will the data be accessed?

Jan

> On 28 Jan 2016, at 11:06, Sándor Szombat  wrote:
> 
> Hello all! 
> 
> I check the Ceph FS, but it is beta now unfortunatelly. I start to meet with 
> the rdb. It is possible to create an image in a pool, mount it as a block 
> device (for example /dev/rbd0), and format this as HDD, and mount it on 2 
> host? I tried to make this, and it's work but after mount the  /dev/rbd0 on 
> the two host and I tried to put files into these mounted folders it can't 
> refresh automatically between hosts. 
> So the main question: this will be a possible solution?
> (The task: we have 3 main node what can install the other nodes with ansible, 
> and we want to store our rpm's in ceph it is possible. This is necessary 
> because of high avability.)
> 
> Thanks for your help!
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD behavior, in case of its journal disk (either HDD or SSD) failure

2016-01-25 Thread Jan Schermer
OSD stops.
And you pretty much lose all data on the OSD if you lose the journal.

Jan

> On 25 Jan 2016, at 14:04, M Ranga Swami Reddy  wrote:
> 
> Hello,
> 
> If a journal disk fails (with crash or power failure, etc), what
> happens on OSD operations?
> 
> PS: Assume that journal and OSD is on a separate drive.
> 
> Thanks
> Swami
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD OSDs - more Cores or more GHz

2016-01-20 Thread Jan Schermer
The OSD is able to use more than one core to do the work, so increasing the 
number of cores will increase throughput.
However, if you care about latency then that is always tied to speed=frequency.

If the question was "should I get 40GHz in 8 cores or in 16 cores" then the 
answer will always be "in 8 cores".
However, higher freq. CPUs are much pricies than a lower ticking onec with more 
cores,
so you will get a higher "throughput" for less $ if you scale the cores and not 
the frequency.

If you need to run more OSDs on one host than the number of cores this gets a 
bit tricky, because of NUMA and
linux scheduler that you should tune. If the number of OSDs is small enough I 
would always prefer the faster (frequency)
CPU over a lower one. 

Jan


> On 20 Jan 2016, at 13:01, Tomasz Kuzemko  wrote:
> 
> Hi,
> my team did some benchmarks in the past to answer this question. I don't
> have results at hand, but conclusion was that it depends on how many
> disks/OSDs you have in a single host: above 9 there was more benefit
> from more cores than GHz (6-core 3.5GHz vs 10-core 2.4GHz AFAIR).
> 
> --
> Tomasz Kuzemko
> tomasz.kuze...@corp.ovh.com
> 
> On 20.01.2016 10:01, Götz Reinicke - IT Koordinator wrote:
>> Hi folks,
>> 
>> we plan to use more ssd OSDs in our first cluster layout instead of SAS
>> osds. (more IO is needed than space)
>> 
>> short question: What would influence the performance more? more Cores or
>> more GHz/Core.
>> 
>> Or is it as always: Depeds on the total of OSDs/nodes/repl-level/etc ... :)
>> 
>> If needed, I can give some more detailed information on the layout.
>> 
>>  Thansk for feedback . Götz
>> -- 
>> Götz Reinicke
>> IT-Koordinator
>> 
>> Tel. +49 7141 969 82420
>> E-Mail goetz.reini...@filmakademie.de
>> 
>> Filmakademie Baden-Württemberg GmbH
>> Akademiehof 10
>> 71638 Ludwigsburg
>> www.filmakademie.de
>> 
>> Eintragung Amtsgericht Stuttgart HRB 205016
>> 
>> Vorsitzender des Aufsichtsrats: Jürgen Walter MdL
>> Staatssekretär im Ministerium für Wissenschaft,
>> Forschung und Kunst Baden-Württemberg
>> 
>> Geschäftsführer: Prof. Thomas Schadt
>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD OSDs - more Cores or more GHz

2016-01-20 Thread Jan Schermer
This is very true, but do you actually exclusively pin the cores to the OSD 
daemons so they don't interfere?
I don't think may people do that, it wouldn't work with more than a handful of 
OSDs.
The OSD might typicaly only need <100% of one core, but during startup or some 
reshuffling it's beneficial
to allow it to get more (>400%), and that will interfere with whatever else was 
pinned there...

Jan

> On 20 Jan 2016, at 13:07, Oliver Dzombic  wrote:
> 
> Hi,
> 
> Cores > Frequency
> 
> If you think about recovery / scrubbing tasks its better when a cpu core
> can be assigned to do this.
> 
> Compared to a situation where the same cpu core needs to recovery/scrub
> and still deliver the productive content at the same time.
> 
> The more you can create a situation where an osd has its "own" cpu core,
> the better it is. Modern CPU's are anyway so fast, that even SSDs cant
> run the CPU's to their limit.
> 
> -- 
> Mit freundlichen Gruessen / Best regards
> 
> Oliver Dzombic
> IP-Interactive
> 
> mailto:i...@ip-interactive.de
> 
> Anschrift:
> 
> IP Interactive UG ( haftungsbeschraenkt )
> Zum Sonnenberg 1-3
> 63571 Gelnhausen
> 
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
> 
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
> 
> 
> Am 20.01.2016 um 10:01 schrieb Götz Reinicke - IT Koordinator:
>> Hi folks,
>> 
>> we plan to use more ssd OSDs in our first cluster layout instead of SAS
>> osds. (more IO is needed than space)
>> 
>> short question: What would influence the performance more? more Cores or
>> more GHz/Core.
>> 
>> Or is it as always: Depeds on the total of OSDs/nodes/repl-level/etc ... :)
>> 
>> If needed, I can give some more detailed information on the layout.
>> 
>>  Thansk for feedback . Götz
>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD OSDs - more Cores or more GHz

2016-01-20 Thread Jan Schermer
I'm using Ceph with all SSDs, I doubt you have to worry about speed that
much with HDD (it will be abysmall either way).
With SSDs you need to start worrying about processor caches and memory
colocation in NUMA systems, linux scheduler is not really that smart right now.
Yes, the process will get its own core, but it might be a different core every
time it spins up, this increases latencies considerably if you start hammering
the OSDs on the same host.

But as always, YMMV ;-)

Jan


> On 20 Jan 2016, at 13:28, Oliver Dzombic <i...@ip-interactive.de> wrote:
> 
> Hi Jan,
> 
> actually the linux kernel does this automatically anyway ( sending new
> processes to "empty/low used" cores ).
> 
> A single scrubbing/recovery or what ever process wont take more than
> 100% CPU ( one core ) because technically this processes are not able to
> run multi thread.
> 
> Of course, if you configure your ceph to have ( up to ) 8 backfill
> processes, then 8 processes will start, which can utilize ( up to ) 8
> CPU cores.
> 
> But still, the single process wont be able to use more than one cpu core.
> 
> ---
> 
> In a situation where you have 2x E5-2620v3 for example, you have 2x 6
> Cores x 2 HT Units = 24 Threads ( vCores ).
> 
> So if you use inside such a system 24 OSD's every OSD will have (
> mathematically ) its "own" CPU Core automatically.
> 
> Such a combination will perform better compared if you are using 1x E5
> CPU with a much higher frequency ( but still the same amout of cores ).
> 
> This kind of CPU's are so fast, that the physical HDD ( no matter if
> SAS/SSD/ATA ) will not be able to overload the cpu ( no matter which cpu
> you use of this kind ).
> 
> Its like if you are playing games. If the game is running smooth, it
> does not matter if its running on a 4 GHz machine on 40% utilization or
> on a 2 GHz machine with 80% utilization. Is running smooth, it can not
> do better :-)
> 
> So if your data is coming as fast as the HDD can physical deliver it,
> its not important if the cpu runs with 2, 3, 4, 200 Ghz. Its already the
> max of what the HDD can deliver.
> 
> So as long as the HDD's dont get faster, the CPU's does not need to be
> faster.
> 
> The Ceph storage is usually just delivering data, not running a
> commercial webserver/what ever beside that.
> 
> So if you are deciding what CPU you have to choose, you only have to
> think about how fast your HDD devices are. So that the CPU does not
> become the bottleneck.
> 
> And the more cores you have, the lower is the chance, that different
> requests will block each other.
> 
> 
> 
> So all in all, Core > Frequency, always. ( As long as you use fast/up to
> date CPUs ). If you are using old cpu's, of course you have to make sure
> that the performance of the cpu ( which does by the way not only depend
> on the frequency ) is sufficient that its not breaking the HDD data output.
> 
> 
> 
> -- 
> Mit freundlichen Gruessen / Best regards
> 
> Oliver Dzombic
> IP-Interactive
> 
> mailto:i...@ip-interactive.de
> 
> Anschrift:
> 
> IP Interactive UG ( haftungsbeschraenkt )
> Zum Sonnenberg 1-3
> 63571 Gelnhausen
> 
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
> 
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
> 
> 
> Am 20.01.2016 um 13:10 schrieb Jan Schermer:
>> This is very true, but do you actually exclusively pin the cores to the OSD 
>> daemons so they don't interfere?
>> I don't think may people do that, it wouldn't work with more than a handful 
>> of OSDs.
>> The OSD might typicaly only need <100% of one core, but during startup or 
>> some reshuffling it's beneficial
>> to allow it to get more (>400%), and that will interfere with whatever else 
>> was pinned there...
>> 
>> Jan
>> 
>>> On 20 Jan 2016, at 13:07, Oliver Dzombic <i...@ip-interactive.de> wrote:
>>> 
>>> Hi,
>>> 
>>> Cores > Frequency
>>> 
>>> If you think about recovery / scrubbing tasks its better when a cpu core
>>> can be assigned to do this.
>>> 
>>> Compared to a situation where the same cpu core needs to recovery/scrub
>>> and still deliver the productive content at the same time.
>>> 
>>> The more you can create a situation where an osd has its "own" cpu core,
>>> the better it is. Modern CPU's are anyway so fast, that even SSDs cant
>>> run the CPU's to their limit.
>>> 
>>> -- 
>>> Mit freundlichen Gruessen / Best regards
>>> 
>>> Oliver Dzombic
>>> IP-Interactive
>>> 
>&

Re: [ceph-users] SSD OSDs - more Cores or more GHz

2016-01-20 Thread Jan Schermer
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
> 
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
> 
> 
> Am 20.01.2016 um 14:14 schrieb Wade Holler:
>> Great commentary.
>> 
>> While it is fundamentally true that higher clock speed equals lower
>> latency, I'm my practical experience we are more often interested in
>> latency at the concurrency profile of the applications.
>> 
>> So in this regard I favor more cores when I have to choose, such that we
>> can support more concurrent operations at a queue depth of 0.
>> 
>> Cheers
>> Wade
>> On Wed, Jan 20, 2016 at 7:58 AM Jan Schermer <j...@schermer.cz
>> <mailto:j...@schermer.cz>> wrote:
>> 
>>I'm using Ceph with all SSDs, I doubt you have to worry about speed that
>>much with HDD (it will be abysmall either way).
>>With SSDs you need to start worrying about processor caches and memory
>>colocation in NUMA systems, linux scheduler is not really that smart
>>right now.
>>Yes, the process will get its own core, but it might be a different
>>core every
>>time it spins up, this increases latencies considerably if you start
>>hammering
>>the OSDs on the same host.
>> 
>>But as always, YMMV ;-)
>> 
>>Jan
>> 
>> 
>>> On 20 Jan 2016, at 13:28, Oliver Dzombic <i...@ip-interactive.de
>><mailto:i...@ip-interactive.de>> wrote:
>>> 
>>> Hi Jan,
>>> 
>>> actually the linux kernel does this automatically anyway ( sending new
>>> processes to "empty/low used" cores ).
>>> 
>>> A single scrubbing/recovery or what ever process wont take more than
>>> 100% CPU ( one core ) because technically this processes are not
>>able to
>>> run multi thread.
>>> 
>>> Of course, if you configure your ceph to have ( up to ) 8 backfill
>>> processes, then 8 processes will start, which can utilize ( up to ) 8
>>> CPU cores.
>>> 
>>> But still, the single process wont be able to use more than one
>>cpu core.
>>> 
>>> ---
>>> 
>>> In a situation where you have 2x E5-2620v3 for example, you have 2x 6
>>> Cores x 2 HT Units = 24 Threads ( vCores ).
>>> 
>>> So if you use inside such a system 24 OSD's every OSD will have (
>>> mathematically ) its "own" CPU Core automatically.
>>> 
>>> Such a combination will perform better compared if you are using 1x E5
>>> CPU with a much higher frequency ( but still the same amout of
>>cores ).
>>> 
>>> This kind of CPU's are so fast, that the physical HDD ( no matter if
>>> SAS/SSD/ATA ) will not be able to overload the cpu ( no matter
>>which cpu
>>> you use of this kind ).
>>> 
>>> Its like if you are playing games. If the game is running smooth, it
>>> does not matter if its running on a 4 GHz machine on 40%
>>utilization or
>>> on a 2 GHz machine with 80% utilization. Is running smooth, it can not
>>> do better :-)
>>> 
>>> So if your data is coming as fast as the HDD can physical deliver it,
>>> its not important if the cpu runs with 2, 3, 4, 200 Ghz. Its
>>already the
>>> max of what the HDD can deliver.
>>> 
>>> So as long as the HDD's dont get faster, the CPU's does not need to be
>>> faster.
>>> 
>>> The Ceph storage is usually just delivering data, not running a
>>> commercial webserver/what ever beside that.
>>> 
>>> So if you are deciding what CPU you have to choose, you only have to
>>> think about how fast your HDD devices are. So that the CPU does not
>>> become the bottleneck.
>>> 
>>> And the more cores you have, the lower is the chance, that different
>>> requests will block each other.
>>> 
>>> 
>>> 
>>> So all in all, Core > Frequency, always. ( As long as you use
>>fast/up to
>>> date CPUs ). If you are using old cpu's, of course you have to
>>make sure
>>> that the performance of the cpu ( which does by the way not only
>>depend
>>> on the frequency ) is sufficient that its not breaking the HDD
>>data output.
>>> 
>>> 
>>> 
>>> --
>>> Mit freundlichen Gruessen / Best regards
>>> 
>>> Oliver Dzombic
>>> IP-Interactive
>>> 
>>> mailto:i...@ip-interactive.de <mailto:i...@ip-interactive.de>

Re: [ceph-users] bad sectors on rbd device?

2016-01-06 Thread Jan Schermer
I think you are running out of memory(?), or at least of the memory for the 
type of allocation krbd tries to use.
I'm not going to decode all the logs but you can try increasing min_free_kbytes 
as the first step. I assume this is amd64 when there's no HIGHMEM trouble (I 
don't remember how to solve those).
It can happen either due to system being under memory pressure (from device 
drivers and other in-kernel allocations) or if it is too slow to satisfy the 
allocation request in time (if it's a VM for example). It can also be caused by 
bug in the rbd client of course...

Newer kernel almost always helps with vm troubles like this :-)

Jan


> On 05 Jan 2016, at 14:55, Philipp Schwaha  wrote:
> 
> Hi List,
> 
> I have an issue with an rbd device. I have an rbd device on which I
> created a file system. When I copy files to the file system I get issues
> about failing to write to a sector to sectors on the rbd block device.
> I see the following in the log file:
> 
> [88931.224311] rbd: rbd0: write 8 at 202e777000 result -12
> [88931.224317] blk_update_request: I/O error, dev rbd0, sector 269958072
> [88931.224542] rbd: rbd0: write 8 at 202e6f7000 result -12
> [88931.225908] rbd: rbd0: write 8 at 202e677000 result -12
> [88931.226198] rbd: rbd0: write 8 at 202e7f7000 result -12
> [88931.227501] rbd: rbd0: write 8 at 202e877000 result -12
> [88931.247151] rbd: rbd0: write 8 at 202eff7000 result -12
> [88931.247827] rbd: rbd0: write 8 at 202f077000 result -12
> 
> Looking further I found the following:
> 
> [88931.181608] warn_alloc_failed: 119 callbacks suppressed
> [88931.181616] kworker/2:13: page allocation failure: order:1, mode:0x204020
> [88931.181621] CPU: 2 PID: 7300 Comm: kworker/2:13 Tainted: G W 4.3.3-ge
> [88931.181636] Workqueue: rbd rbd_queue_workfn [rbd]
> [88931.181641] 88013c483ae0 813656c3 00204020
> 8114c438
> [88931.181645]  88017fff9b00 
> 
> [88931.181648]  0f12 00244220
> 
> [88931.181652] Call Trace:
> [88931.181665] [] ? dump_stack+0x40/0x5d
> [88931.181670] [] ? warn_alloc_failed+0xd8/0x130
> [88931.181673] [] ? __alloc_pages_nodemask+0x2b3/0x9e0
> [88931.181679] [] ? kmem_getpages+0x5d/0x100
> [88931.181683] [] ? fallback_alloc+0x141/0x1f0
> [88931.181686] [] ? kmem_cache_alloc+0x1e3/0x450
> [88931.181696] [] ? ceph_osdc_alloc_request+0x51/0x250
> [libceph]
> [88931.181700] [] ?
> rbd_osd_req_create.isra.25+0x51/0x1a0 [rbd]
> [88931.181704] [] ? rbd_img_request_fill+0x228/0x850 [rbd]
> [88931.181708] [] ? rbd_queue_workfn+0x2b9/0x3b0 [rbd]
> [88931.181713] [] ? process_one_work+0x14c/0x3b0
> [88931.181717] [] ? worker_thread+0x4d/0x440
> [88931.181720] [] ? rescuer_thread+0x2e0/0x2e0
> [88931.181724] [] ? kthread+0xbd/0xe0
> [88931.181727] [] ? kthread_park+0x50/0x50
> [88931.181731] [] ? ret_from_fork+0x3f/0x70
> [88931.181734] [] ? kthread_park+0x50/0x50
> [88931.181736] Mem-Info:
> [88931.181745] active_anon:57146 inactive_anon:65771 isolated_anon:0
> [88931.181745] active_file:405123 inactive_file:397563 isolated_file:0
> [88931.181745] unevictable:0 dirty:192 writeback:16100 unstable:0
> [88931.181745] slab_reclaimable:28501 slab_unreclaimable:8143
> [88931.181745] mapped:14501 shmem:24976 pagetables:1962 bounce:0
> [88931.181745] free:8824 free_pcp:816 free_cma:0
> [88931.181750] Node 0 DMA free:15436kB min:28kB low:32kB high:40kB
> active_anon:4kB in
> B inactive_file:28kB unevictable:0kB isolated(anon):0kB
> isolated(file):0kB present:15
> kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:48kB
> slab_unreclaima
> tables:4kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB
> free_cma:0kB writeback_
> eclaimable? no
> [88931.181758] lowmem_reserve[]: 0 1873 3856 3856
> [88931.181762] Node 0 DMA32 free:13720kB min:3800kB low:4748kB
> high:5700kB active_ano
> B active_file:806264kB inactive_file:776948kB unevictable:0kB
> isolated(anon):0kB isol
> B managed:1921632kB mlocked:0kB dirty:104kB writeback:9384kB
> mapped:35712kB shmem:514
> lab_unreclaimable:12532kB kernel_stack:2224kB pagetables:3464kB
> unstable:0kB bounce:0
> 0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> [88931.181769] lowmem_reserve[]: 0 0 1982 1982
> [88931.181773] Node 0 Normal free:6140kB min:4024kB low:5028kB
> high:6036kB active_ano
> kB active_file:814224kB inactive_file:813276kB unevictable:0kB
> isolated(anon):0kB iso
> kB managed:2030320kB mlocked:0kB dirty:664kB writeback:55016kB
> mapped:22292kB shmem:4
> slab_unreclaimable:19964kB kernel_stack:2352kB pagetables:4380kB
> unstable:0kB bounce
> 100kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> [88931.181780] lowmem_reserve[]: 0 0 0 0
> [88931.181784] Node 0 DMA: 11*4kB (UEM) 8*8kB (EM) 8*16kB (UEM) 3*32kB
> (UE) 2*64kB (U
> 12kB (EM) 3*1024kB (UEM) 1*2048kB (E) 2*4096kB (M) = 15436kB
> [88931.181801] 

Re: [ceph-users] My OSDs are down and not coming UP

2015-12-29 Thread Jan Schermer
Just try putting something like the following in ceph.conf:

[global]
mon_host =::2:1612::50 ::2:1612::30
mon_initial_members = node-1 node-2

Also, I just noticed you have two MONs? It should always be an odd number. Not 
sure if they can ever get quorum now?

Jan



> On 30 Dec 2015, at 00:15, Ing. Martin Samek <samek...@fel.cvut.cz> wrote:
> 
> I'm deploying ceph cluster manually following different guides. I didn't use 
> ceph-deploy yet.
> 
> MS: 
> 
> Dne 30.12.2015 v 00:13 Somnath Roy napsal(a):
>> It should be monitor host names..If you are deploying with ceph-deploy it 
>> should be added in the conf file automatically..How are you creating your 
>> cluster ?
>> Did you change conf file after installing ?
>>  
>> From: Ing. Martin Samek [mailto:samek...@fel.cvut.cz 
>> <mailto:samek...@fel.cvut.cz>] 
>> Sent: Tuesday, December 29, 2015 3:09 PM
>> To: Jan Schermer
>> Cc: Somnath Roy; ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>> Subject: Re: [ceph-users] My OSDs are down and not coming UP
>>  
>> Hi,
>> 
>> No, never. It is my first attempt, first ceph cluster i try ever run.
>> 
>> im not sure, if "mon initial members" should contain mon servers ids or 
>> hostnames ?
>> 
>> MS:
>> 
>> Dne 30.12.2015 v 00:04 Jan Schermer napsal(a):
>> Has the cluster ever worked? 
>>  
>> Are you sure that "mon initial members = 0" is correct? How do the OSDs know 
>> where to look for MONs?
>>  
>> Jan
>>  
>>  
>> On 29 Dec 2015, at 21:41, Ing. Martin Samek <samek...@fel.cvut.cz 
>> <mailto:samek...@fel.cvut.cz>> wrote:
>>  
>> Hi,
>> 
>> network is OK, all nodes are in one VLAN, in one switch, in one rack.
>> 
>> 
>> tracepath6 node2
>>  1?: [LOCALHOST]0.030ms pmtu 1500
>>  1:  node2 0.634ms reached
>>  1:  node2 0.296ms reached
>>  Resume: pmtu 1500 hops 1 back 64 
>> tracepath6 node3
>>  1?: [LOCALHOST]0.022ms pmtu 1500
>>  1:  node3 0.643ms reached
>>  1:  node3 1.065ms reached
>>  Resume: pmtu 1500 hops 1 back 64 
>> 
>> There is no firewall installed or configured.
>> 
>> Martin
>> 
>>  
>>  
>>  
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] My OSDs are down and not coming UP

2015-12-29 Thread Jan Schermer
Has the cluster ever worked?

Are you sure that "mon initial members = 0" is correct? How do the OSDs know 
where to look for MONs?

Jan


> On 29 Dec 2015, at 21:41, Ing. Martin Samek  > wrote:
> 
> Hi,
> 
> network is OK, all nodes are in one VLAN, in one switch, in one rack.
> 
> tracepath6 node2
>  1?: [LOCALHOST]0.030ms pmtu 1500
>  1:  node2 0.634ms reached
>  1:  node2 0.296ms reached
>  Resume: pmtu 1500 hops 1 back 64 
> tracepath6 node3
>  1?: [LOCALHOST]0.022ms pmtu 1500
>  1:  node3 0.643ms reached
>  1:  node3 1.065ms reached
>  Resume: pmtu 1500 hops 1 back 64 
> 
> There is no firewall installed or configured.
> 
> Martin
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ubuntu 14.04 or centos 7

2015-12-29 Thread Jan Schermer
If you need to get Ceph up as part of some enterprise project, where 
you can't touch it after it gets in production without PM approval and a Change 
Request then CentOS is what fits in that box. You are unlikely to break 
anything during upgrades and you get new hardware support and (not too many) 
new features in new point releases. Ubuntu LTS is similiar but less 
"customized" - some components get a version bump whereas on CentOS they 
backport patches mostly.
If, however, you plan to keep it up-to-date, try new hardware in the cluster, 
try new features and want to tune ceph/kernel/hardware often, then Ubuntu 
(non-LTS) is the way to go.
YMMV, but when investigating issue on CentOS (or RHEL) you don't really know 
what you are running as the version numbers are misleading. Their kernel stack 
is the worst offender of all, full of backported features and fixes and 
internal hacks that simply were't there in the supposed "version". Try tuning 
their VM settings and you'll see their 2.6 kernel is neither 2.6 nor 3.x nor 
4.x, but a mix of all - I hope at least with RHEL their technical support would 
know what to do but... we all know how support works, right? :)
Generally, newer OS will run more happily than an old one, but it more 
unpredictable.

I'd much rather upgrade the whole stack (OS+Ceph) and concentrate on testing 
the upgrades, than to rely on vendor to test everything for me and stay on 
ancient versions, but that really depends on what your cluster will be used 
for. 

Jan

> On 29 Dec 2015, at 09:38, Gerard Braad  wrote:
> 
> Hi,
> 
> On Tue, Dec 29, 2015 at 3:58 PM, min fang  wrote:
>> centos 7 and ubuntu 14.04.  Which
>> one is better? thanks.
> 
> this is a question that often feels like flamebait, and too general.
> But probably the best answer is: it depends on personal preference,
> skill-level related to maintenance, use-case, etc. What might be of
> interest is the suggested kernel, but this depends on your use-case.
> 
> regards,
> 
> 
> Gerard
> 
> 
> --
> Gerard Braad — 吉拉德
>   F/OSS & IT Consultant in Beijing
>   http://gbraad.nl
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] sync writes - expected performance?

2015-12-14 Thread Jan Schermer
Even with 10G ethernet, the bottleneck is not the network, nor the drives 
(assuming they are datacenter-class). The bottleneck is the software.
The only way to improve that is to either increase CPU speed (more GHz per 
core) or to simplify the datapath IO has to take before it is considered 
durable.
Stuff like RDMA will help only if there so zero-copy between the (RBD) client 
and the drive, or if the write is acknowledged when in the remote buffers of 
replicas (but it still has to come from client directly or RDMA becomes a bit 
pointless, IMHO).

Databases do sync writes for a reason, O_DIRECT doesn't actually make strong 
guarantees on ordering or buffering, though in practice the race condition is 
negligible.

Your 600 IOPS are pretty good actually.

Jan


> On 14 Dec 2015, at 22:58, Warren Wang - ISD  wrote:
> 
> Whoops, I misread Nikola¹s original email, sorry!
> 
> If all your SSDs are all performing at that level for sync IO, then I
> agree that it¹s down to other things, like network latency and PG locking.
> Sequential 4K writes with 1 thread and 1 qd is probably the worst
> performance you¹ll see. Is there a router between your VM and the Ceph
> cluster, or one between Ceph nodes for the cluster network?
> 
> Are you using dsync at the VM level to simulate what a database or other
> app would do? If you can switch to directIO, you¹ll likely get far better
> performance. 
> 
> Warren Wang
> 
> 
> 
> 
> On 12/14/15, 12:03 PM, "Mark Nelson"  wrote:
> 
>> 
>> 
>> On 12/14/2015 04:49 AM, Nikola Ciprich wrote:
>>> Hello,
>>> 
>>> i'm doing some measuring on test (3 nodes) cluster and see strange
>>> performance
>>> drop for sync writes..
>>> 
>>> I'm using SSD for both journalling and OSD. It should be suitable for
>>> journal, giving about 16.1KIOPS (67MB/s) for sync IO.
>>> 
>>> (measured using fio --filename=/dev/xxx --direct=1 --sync=1 --rw=write
>>> --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based
>>> --group_reporting --name=journal-test)
>>> 
>>> On top of this cluster, I have running KVM guest (using qemu librbd
>>> backend).
>>> Overall performance seems to be quite good, but the problem is when I
>>> try
>>> to measure sync IO performance inside the guest.. I'm getting only
>>> about 600IOPS,
>>> which I think is quite poor.
>>> 
>>> The problem is, I don't see any bottlenect, OSD daemons don't seem to
>>> be hanging on
>>> IO, neither hogging CPU, qemu process is also not somehow too much
>>> loaded..
>>> 
>>> I'm using hammer 0.94.5 on top of centos 6 (4.1 kernel), all debugging
>>> disabled,
>>> 
>>> my question is, what results I can expect for synchronous writes? I
>>> understand
>>> there will always be some performance drop, but 600IOPS on top of
>>> storage which
>>> can give as much as 16K IOPS seems to little..
>> 
>> So basically what this comes down to is latency.  Since you get 16K IOPS
>> for O_DSYNC writes on the SSD, there's a good chance that it has a
>> super-capacitor on board and can basically acknowledge a write as
>> complete as soon as it hits the on-board cache rather than when it's
>> written to flash.  Figure that for 16K O_DSYNC IOPs means that each IO
>> is completing in around 0.06ms on average.  That's very fast!  At 600
>> IOPs for O_DSYNC writes on your guest, you're looking at about 1.6ms per
>> IO on average.
>> 
>> So how do we account for the difference?  Let's start out by looking at
>> a quick example of network latency (This is between two random machines
>> in one of our labs at Red Hat):
>> 
>>> 64 bytes from gqas008: icmp_seq=1 ttl=64 time=0.583 ms
>>> 64 bytes from gqas008: icmp_seq=2 ttl=64 time=0.219 ms
>>> 64 bytes from gqas008: icmp_seq=3 ttl=64 time=0.224 ms
>>> 64 bytes from gqas008: icmp_seq=4 ttl=64 time=0.200 ms
>>> 64 bytes from gqas008: icmp_seq=5 ttl=64 time=0.196 ms
>> 
>> now consider that when you do a write in ceph, you write to the primary
>> OSD which then writes out to the replica OSDs.  Every replica IO has to
>> complete before the primary will send the acknowledgment to the client
>> (ie you have to add the latency of the worst of the replica writes!).
>> In your case, the network latency alone is likely dramatically
>> increasing IO latency vs raw SSD O_DSYNC writes.  Now add in the time to
>> process crush mappings, look up directory and inode metadata on the
>> filesystem where objects are stored (assuming it's not cached), and
>> other processing time, and the 1.6ms latency for the guest writes starts
>> to make sense.
>> 
>> Can we improve things?  Likely yes.  There's various areas in the code
>> where we can trim latency away, implement alternate OSD backends, and
>> potentially use alternate network technology like RDMA to reduce network
>> latency.  The thing to remember is that when you are talking about
>> O_DSYNC writes, even very small increases in latency can have dramatic
>> effects on performance.  Every fraction of a millisecond has huge
>> 

Re: [ceph-users] write speed , leave a little to be desired?

2015-12-11 Thread Jan Schermer
The drive will actually be writing 500MB/s in this case, if the journal is on 
the same drive.
All writes get to the journal and then to the filestore, so 200MB/s is actually 
a sane figure.

Jan


> On 11 Dec 2015, at 13:55, Zoltan Arnold Nagy  
> wrote:
> 
> It’s very unfortunate that you guys are using the EVO drives. As we’ve 
> discussed numerous times on the ML, they are not very suitable for this task.
> I think that 200-300MB/s is actually not bad (without knowing anything about 
> the hardware setup, as you didn’t give details…) coming from those drives, 
> but expect to replace them soon.
> 
>> On 11 Dec 2015, at 13:44, Florian Rommel  
>> wrote:
>> 
>> Hi, we are just testing our new ceph cluster and to optimise our spinning 
>> disks we created an erasure coded pool and a SSD cache pool.
>> 
>> We modified the crush map to make an sad pool as easy server contains 1 ssd 
>> drive and 5 spinning drives.
>> 
>> Stress testing the cluster in terms of read performance is very nice pushing 
>> a little bit over 1.2GB/s
>> however the write speed is pushing 200-300MB/s.
>> 
>> All the SSDs are SAMSUNG 500GB EVO 850 PROs and can push 500MB/write speed, 
>> as tested with hdparm and dd.
>> 
>> What can we tweak that the write speed increases as well over the network?
>> 
>> We run everything over 10Ge
>> 
>> The cache mode is set to write-back
>> 
>> Any help would be greatly appreciated.
>> 
>> Thank you and best regards
>> //Florian
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Client io blocked when removing snapshot

2015-12-10 Thread Jan Schermer
Removing snapshot means looking for every *potential* object the snapshot can 
have, and this takes a very long time (6TB snapshot will consist of 1.5M 
objects (in one replica) assuming the default 4MB object size). The same 
applies to large thin volumes (don't try creating and then dropping a 1 EiB 
volume, even if you only have 1GB of physical space :)).
Doing this is simply expensive and might saturate your OSDs. If you don't have 
enough RAM to cache the structure then all the "is there a file 
/var/lib/ceph/" will go to disk and that can hurt a lot.
I don't think there's any priority to this (is there?), so it competes with 
everything else.

I'm not sure how snapshots are exactly coded in Ceph, but in a COW filesystem 
you simply don't dereference blocks of the parent of the  snapshot when doing 
writes to it and that's cheap, but Ceph stores "blocks" in files with 
computable names and has no pointers to them that could be modified,  so by 
creating a snapshot you hurt the performance a lot (you need to create a copy 
of the 4MB object into the snapshot(s) when you dirty a byte in there). Though 
I remember reading that the logic is actually reversed and it is the snapshot 
that gets the original blocks(??)...
Anyway if you are removing snapshot at the same time as writing to the parent 
there could be potentionaly a problem in what gets done first. Is Ceph smart 
enough to not care about snapshots that are getting deleted? I have no idea but 
I think it must be because we use snapshots a lot and haven't had that any 
issues with it.

Jan

> On 10 Dec 2015, at 07:52, Wukongming  wrote:
> 
> Hi, All
> 
> I used a rbd command to create a 6TB-size image, And then created a snapshot 
> of this image. After that, I kept writing something like modifying files so 
> the snapshots would be cloned one by one.
> At this time, I did the fellow 2 ops simultaneously.
> 
> 1. keep client io to this image.
> 2. excute a rbd snap rm command to delete snapshot.
> 
> Finally ,I found client io blocked for quite a long time. I used SATA disk to 
> test, and felt that ceph makes it a priority to remove snapshot.
> Also we use iostat tool to help watch the disk state, and it runs in full 
> workload.
> 
> So, should we have a priority to deal with client io instead of removing 
> snapshot?
> -
> wukongming ID: 12019
> Tel:0571-86760239
> Dept:2014 UIS2 ONEStor
> 
> -
> 本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出
> 的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、
> 或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本
> 邮件!
> This e-mail and its attachments contain confidential information from H3C, 
> which is
> intended only for the person or entity whose address is listed above. Any use 
> of the
> information contained herein in any way (including, but not limited to, total 
> or partial
> disclosure, reproduction, or dissemination) by persons other than the intended
> recipient(s) is prohibited. If you receive this e-mail in error, please 
> notify the sender
> by phone or email immediately and delete it!
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Client io blocked when removing snapshot

2015-12-10 Thread Jan Schermer

> On 10 Dec 2015, at 15:14, Sage Weil <s...@newdream.net> wrote:
> 
> On Thu, 10 Dec 2015, Jan Schermer wrote:
>> Removing snapshot means looking for every *potential* object the snapshot 
>> can have, and this takes a very long time (6TB snapshot will consist of 1.5M 
>> objects (in one replica) assuming the default 4MB object size). The same 
>> applies to large thin volumes (don't try creating and then dropping a 1 EiB 
>> volume, even if you only have 1GB of physical space :)).
>> Doing this is simply expensive and might saturate your OSDs. If you don't 
>> have enough RAM to cache the structure then all the "is there a file 
>> /var/lib/ceph/" will go to disk and that can hurt a lot.
>> I don't think there's any priority to this (is there?), so it competes with 
>> everything else.
>> 
>> I'm not sure how snapshots are exactly coded in Ceph, but in a COW 
>> filesystem you simply don't dereference blocks of the parent of the  
>> snapshot when doing writes to it and that's cheap, but Ceph stores "blocks" 
>> in files with computable names and has no pointers to them that could be 
>> modified,  so by creating a snapshot you hurt the performance a lot (you 
>> need to create a copy of the 4MB object into the snapshot(s) when you dirty 
>> a byte in there). Though I remember reading that the logic is actually 
>> reversed and it is the snapshot that gets the original blocks(??)...
>> Anyway if you are removing snapshot at the same time as writing to the 
>> parent there could be potentionaly a problem in what gets done first. Is 
>> Ceph smart enough to not care about snapshots that are getting deleted? I 
>> have no idea but I think it must be because we use snapshots a lot and 
>> haven't had that any issues with it.
> 
> It's not quite so bad... the OSD maintains a map (in leveldb) of the 
> objects that are referenced by a snapshot, so the amount of work is 
> proportional to the number of objects that were cloned for that snapshot.
> 


Nice. I saw a blueprint somewhere earlier this year, so that's a pretty new 
thing (Hammer or Infernalis?)
And is it a map (with pointers to objects) or just a bitmap of the overlay?

Jan

> There is certainly room for improvement in terms of the impact on client 
> IO, though.  :)
> 
> sage

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph] Feature Ceph Geo-replication

2015-12-10 Thread Jan Schermer
If you don't need synchronnous replication then asynchronnous is the way to go, 
but Ceph doesn't offer that natively. (not for RBD anyway, not sure how radosgw 
could be set up).

200km will add at least 1ms of latency network-wise, 2ms RTT, for TCP it will 
be more.
For sync replication (which ceph offers natively) this is high, and you also 
need a 0% packetloss. Unless you have your own fibre I recommend against it.

I suggest you replicate data on application level (databases) or incremental 
replication of filesystems (ZFS). You can also use Ceph snapshots and replicate 
those.

Before you decide on any design you should know your RPO/RTO/RCO requirements 
to decide what is sufficient.

Jan


> On 10 Dec 2015, at 17:17, Andrea Annoè  wrote:
> 
> Hi to all
> Someone has news about Geo-replication?
>  
> I have find this really nice article by Sebastien 
> http://www.sebastien-han.fr/blog/2013/01/28/ceph-geo-replication-sort-of/ 
>  
> but it’s 3 years ago…
> My question is about configuration (and limitation : TTL, distance, flapping 
> network consideration, ecc…) for create this geo-replication.
>  
> If my dual site is distance about 200km …what do you suggest?
> Replica synchrony or asynchrony ? 
>  
> Thanks in advance to all for your opinion.
>  
> Best Regards Andrea.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Blocked requests after "osd in"

2015-12-09 Thread Jan Schermer
Are you seeing "peering" PGs when the blocked requests are happening? That's 
what we see regularly when starting OSDs.

I'm not sure this can be solved completely (and whether there are major 
improvements in newer Ceph versions), but it can be sped up by
1) making sure you have free (and not dirtied or fragmented) memory on the node 
where you are starting the OSD
- that means dropping caches before starting the OSD if you have lots 
of "free" RAM that is used for VFS cache
2) starting the OSDs one by one instead of booting several of them
3) if you pin the OSDs to CPUs/cores, do that after the OSD is in - I found it 
to be best to pin the OSD to a cgroup limited to one NUMA node and then limit 
it to a subset of cores after it has run a bit. OSD tends to use hundreds of % 
of CPU when booting
4) you could possibly prewarm cache for the OSD in /var/lib/ceph/osd...

It's unclear to me whether MONs influence this somehow (the peering stage) but 
I have observed their CPU usage and IO also spikes when OSDs are started, so 
make sure they are not under load.

Jan


> On 09 Dec 2015, at 11:03, Christian Kauhaus  wrote:
> 
> Hi,
> 
> I'm getting blocked requests (>30s) every time when an OSD is set to "in" in
> our clusters. Once this has happened, backfills run smoothly.
> 
> I have currently no idea where to start debugging. Has anyone a hint what to
> examine first in order to narrow this issue?
> 
> TIA
> 
> Christian
> 
> -- 
> Dipl-Inf. Christian Kauhaus <>< · k...@flyingcircus.io · +49 345 219401-0
> Flying Circus Internet Operations GmbH · http://flyingcircus.io
> Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
> HR Stendal 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS: number of PGs for metadata pool

2015-12-09 Thread Jan Schermer
Number of PGs doesn't affect the number of replicas, so don't worry about it.

Jan

> On 09 Dec 2015, at 13:03, Mykola Dvornik  wrote:
> 
> Hi guys,
> 
> I am creating a 4-node/16OSD/32TB CephFS from scratch. 
> 
> According to the ceph documentation the metadata pool should have small 
> amount of PGs since it contains some negligible amount of data compared to 
> data pool. This makes me feel it might not be safe.
> 
> So I was wondering how to chose the number of PGs per metadata pool to 
> maintain its performance and reliability?
> 
> Regards,
> 
> Mykola
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-08 Thread Jan Schermer

> On 08 Dec 2015, at 08:57, Benedikt Fraunhofer <fraunho...@traced.net> wrote:
> 
> Hi Jan,
> 
>> Doesn't look near the limit currently (but I suppose you rebooted it in the 
>> meantime?).
> 
> the box this numbers came from has an uptime of 13 days
> so it's one of the boxes that did survive yesterdays half-cluster-wide-reboot.
> 

So this box had no issues? Keep an eye on the number of threadas, but maybe 
others will have a better idea, this is just where I'd start. I have seen close 
to a milion threads from OSDs on my boxes, not sure what the number are now.

>> Did iostat say anything about the drives? (btw dm-1 and dm-6 are what? Is 
>> that your data drives?) - were they overloaded really?
> 
> no they didn't have any load and or iops.
> Basically the whole box had nothing to do.
> 
> If I understand the load correctly, this just reports threads
> that are ready and willing to work but - in this case -
> don't get any data to work with.

Different unixes calculate this differently :-) By itself "load" is meaningless.
It should be something like an average number of processes that want to run at 
any given time but can't (because they are waiting for whatever they need - 
disks, CPU, blocking sockets...).

Jan


> 
> Thx
> 
> Benedikt
> 
> 
> 2015-12-08 8:44 GMT+01:00 Jan Schermer <j...@schermer.cz>:
>> 
>> Jan
>> 
>> 
>>> On 08 Dec 2015, at 08:41, Benedikt Fraunhofer <fraunho...@traced.net> wrote:
>>> 
>>> Hi Jan,
>>> 
>>> we had 65k for pid_max, which made
>>> kernel.threads-max = 1030520.
>>> or
>>> kernel.threads-max = 256832
>>> (looks like it depends on the number of cpus?)
>>> 
>>> currently we've
>>> 
>>> root@ceph1-store209:~# sysctl -a | grep -e thread -e pid
>>> kernel.cad_pid = 1
>>> kernel.core_uses_pid = 0
>>> kernel.ns_last_pid = 60298
>>> kernel.pid_max = 65535
>>> kernel.threads-max = 256832
>>> vm.nr_pdflush_threads = 0
>>> root@ceph1-store209:~# ps axH |wc -l
>>> 17548
>>> 
>>> we'll see how it behaves once puppet has come by and adjusted it.
>>> 
>>> Thx!
>>> 
>>> Benedikt
>> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph snapshost

2015-12-08 Thread Jan Schermer
You don't really *have* to stop I/O.
In fact, I recommend you don't unless you have to.

The reason why this is recommended is to minimize the risk of data loss because 
the snapshot will be in a very similiar state as if you suddenly lost power to 
the server. Obviously if you need to have the same state of data in the 
snapshot (and we're talking about rollback of several seconds typically). For 
example if you append some data to a file and do a snapshot instantly the data 
will likely not be there (yet).

The reason why I recommend you don't do that is because this exposes problems 
with data consistency in the guest (applications/developers doing something 
stupid...) which is a good thing! If you suddenly lose power to your production 
database, you don't want to have to restore from backup. In an ACID compliant 
database all the data should simply be there no matter how harsh the shutdown 
was.

So unless you deliberately run your guests with disabled barriers/flushes*, or 
you need the absolute latest data, don't bother quiescing IO.
* in which case there's no guarantee even with fsfreeze

Jan

> On 09 Dec 2015, at 03:59, Yan, Zheng  wrote:
> 
> On Wed, Dec 9, 2015 at 12:10 AM, Dan Nica  
> wrote:
>> Hi guys,
>> 
>> 
>> 
>> So from documentation I must stop the I/O before taking rbd snapshots, how
>> do I do that or what does that mean ? do I have to unmount
>> 
> 
> see fsfreeze(8) command
> 
> 
>> the rbd image ?
>> 
>> 
>> 
>> --
>> 
>> Dan
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] after loss of journal, osd fails to start with failed assert OSDMapRef OSDService::get_map(epoch_t) ret != null

2015-12-07 Thread Jan Schermer
The rule of thumb is that the data on OSD is gone if the related journal is 
gone.
Journal doesn't just "vanish", though, so you should investigate further...

This log is from the new empty journal, right?

Jan

> On 08 Dec 2015, at 08:08, Benedikt Fraunhofer  wrote:
> 
> Hello List,
> 
> after some crash of a box, the journal vanished. Creating a new one
> with --mkjournal results in the osd beeing unable to start.
> Does anyone want to dissect this any further or should I just trash
> the osd and recreate it?
> 
> Thx in advance
>  Benedikt
> 
> 2015-12-01 07:46:31.505255 7fadb7f1e900  0 ceph version 0.94.5
> (9764da52395923e0b32908d83a9f7304401fee43), process ceph-osd, pid 5486
> 2015-12-01 07:46:31.628585 7fadb7f1e900  0
> filestore(/var/lib/ceph/osd/ceph-328) backend xfs (magic 0x58465342)
> 2015-12-01 07:46:31.662972 7fadb7f1e900  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-328) detect_features:
> FIEMAP ioctl is supported and appears to work
> 2015-12-01 07:46:31.662984 7fadb7f1e900  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-328) detect_features:
> FIEMAP ioctl is disabled via 'filestore fiemap' config option
> 2015-12-01 07:46:31.674999 7fadb7f1e900  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-328) detect_features:
> syncfs(2) syscall fully supported (by glibc and kernel)
> 2015-12-01 07:46:31.675071 7fadb7f1e900  0
> xfsfilestorebackend(/var/lib/ceph/osd/ceph-328) detect_feature:
> extsize is supported and kernel 3.19.0-33-generic >= 3.5
> 2015-12-01 07:46:31.806490 7fadb7f1e900  0
> filestore(/var/lib/ceph/osd/ceph-328) mount: enabling WRITEAHEAD
> journal mode: checkpoint is not enabled
> 2015-12-01 07:46:35.598698 7fadb7f1e900  1 journal _open
> /var/lib/ceph/osd/ceph-328/journal fd 19: 9663676416 bytes, block size
> 4096 bytes, directio = 1, aio = 1
> 2015-12-01 07:46:35.600956 7fadb7f1e900  1 journal _open
> /var/lib/ceph/osd/ceph-328/journal fd 19: 9663676416 bytes, block size
> 4096 bytes, directio = 1, aio = 1
> 2015-12-01 07:46:35.619860 7fadb7f1e900  0 
> cls/hello/cls_hello.cc:271: loading cls_hello
> 2015-12-01 07:46:35.682532 7fadb7f1e900 -1 osd/OSD.h: In function
> 'OSDMapRef OSDService::get_map(epoch_t)' thread 7fadb7f1e900 time
> 2015-12-01 07:46:35.681204
> osd/OSD.h: 716: FAILED assert(ret)
> 
> ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x8b) [0xbc60eb]
> 2: (OSDService::get_map(unsigned int)+0x3f) [0x70ad5f]
> 3: (OSD::init()+0x6ad) [0x6c5e0d]
> 4: (main()+0x2860) [0x6527e0]
> 5: (__libc_start_main()+0xf5) [0x7fadb505bec5]
> 6: /usr/bin/ceph-osd() [0x66b887]
> NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this.
> 
> --- begin dump of recent events ---
>   -62> 2015-12-01 07:46:31.503728 7fadb7f1e900  5 asok(0x5402000)
> register_command perfcounters_dump hook 0x53a2050
>   -61> 2015-12-01 07:46:31.503759 7fadb7f1e900  5 asok(0x5402000)
> register_command 1 hook 0x53a2050
>   -60> 2015-12-01 07:46:31.503764 7fadb7f1e900  5 asok(0x5402000)
> register_command perf dump hook 0x53a2050
>   -59> 2015-12-01 07:46:31.503768 7fadb7f1e900  5 asok(0x5402000)
> register_command perfcounters_schema hook 0x53a2050
>   -58> 2015-12-01 07:46:31.503772 7fadb7f1e900  5 asok(0x5402000)
> register_command 2 hook 0x53a2050
>   -57> 2015-12-01 07:46:31.503775 7fadb7f1e900  5 asok(0x5402000)
> register_command perf schema hook 0x53a2050
>   -56> 2015-12-01 07:46:31.503786 7fadb7f1e900  5 asok(0x5402000)
> register_command perf reset hook 0x53a2050
>   -55> 2015-12-01 07:46:31.503790 7fadb7f1e900  5 asok(0x5402000)
> register_command config show hook 0x53a2050
>   -54> 2015-12-01 07:46:31.503792 7fadb7f1e900  5 asok(0x5402000)
> register_command config set hook 0x53a2050
>   -53> 2015-12-01 07:46:31.503797 7fadb7f1e900  5 asok(0x5402000)
> register_command config get hook 0x53a2050
>   -52> 2015-12-01 07:46:31.503799 7fadb7f1e900  5 asok(0x5402000)
> register_command config diff hook 0x53a2050
>   -51> 2015-12-01 07:46:31.503802 7fadb7f1e900  5 asok(0x5402000)
> register_command log flush hook 0x53a2050
>   -50> 2015-12-01 07:46:31.503804 7fadb7f1e900  5 asok(0x5402000)
> register_command log dump hook 0x53a2050
>   -49> 2015-12-01 07:46:31.503807 7fadb7f1e900  5 asok(0x5402000)
> register_command log reopen hook 0x53a2050
>   -48> 2015-12-01 07:46:31.505255 7fadb7f1e900  0 ceph version 0.94.5
> (9764da52395923e0b32908d83a9f7304401fee43), process ceph-osd, pid 5486
>   -47> 2015-12-01 07:46:31.619430 7fadb7f1e900  1 -- 10.9.246.104:0/0
> learned my addr 10.9.246.104:0/0
>   -46> 2015-12-01 07:46:31.619439 7fadb7f1e900  1
> accepter.accepter.bind my_inst.addr is 10.9.246.104:6821/5486
> need_addr=0
>   -45> 2015-12-01 07:46:31.619457 7fadb7f1e900  1
> accepter.accepter.bind my_inst.addr is 0.0.0.0:6824/5486 need_addr=1
>   -44> 2015-12-01 07:46:31.619473 7fadb7f1e900  1
> accepter.accepter.bind my_inst.addr is 0.0.0.0:6825/5486 

Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-07 Thread Jan Schermer
And how many pids do you have currently?
This should do it I think
# ps axH |wc -l

Jan

> On 08 Dec 2015, at 08:26, Benedikt Fraunhofer <fraunho...@traced.net> wrote:
> 
> Hi Jan,
> 
> we initially had to bump it once we had more than 12 osds
> per box. But it'll change that to the values you provided.
> 
> Thx!
> 
> Benedikt
> 
> 2015-12-08 8:15 GMT+01:00 Jan Schermer <j...@schermer.cz>:
>> What is the setting of sysctl kernel.pid_max?
>> You relly need to have this:
>> kernel.pid_max = 4194304
>> (I think it also sets this as well: kernel.threads-max = 4194304)
>> 
>> I think you are running out of processs IDs.
>> 
>> Jan
>> 
>>> On 08 Dec 2015, at 08:10, Benedikt Fraunhofer <fraunho...@traced.net> wrote:
>>> 
>>> Hello Cephers,
>>> 
>>> lately, our ceph-cluster started to show some weird behavior:
>>> 
>>> the osd boxes show a load of 5000-15000 before the osds get marked down.
>>> Usually the box is fully usable, even "apt-get dist-upgrade" runs smoothly,
>>> you can read and write to any disk, only things you can't do are strace the 
>>> osd
>>> processes, sync or reboot.
>>> 
>>> we only find some logs about the "xfsaild = XFS Access Item List Daemon"
>>> as hung_task warnings.
>>> 
>>> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016108]
>>> [] ? kthread_create_on_node+0x1c0/0x1c0
>>> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016112] INFO: task
>>> xfsaild/dm-1:1445 blocked for more than 120 seconds.
>>> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016329]   Tainted:
>>> G C 3.19.0-39-generic #44~14.04.1-Ubuntu
>>> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016558] "echo 0 >
>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016802] xfsaild/dm-1
>>> D 8807faa03af8 0  1445  2 0x
>>> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016805]
>>> 8807faa03af8 8808098989d0 00013e80 8807faa03fd8
>>> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016808]
>>> 00013e80 88080bb775c0 8808098989d0 88011381b2a8
>>> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016812]
>>> 8807faa03c50 7fff 8807faa03c48 8808098989d0
>>> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016815] Call Trace:
>>> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016819]
>>> [] schedule+0x29/0x70
>>> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016823]
>>> [] schedule_timeout+0x20c/0x280
>>> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016826]
>>> [] ? sched_clock_cpu+0x85/0xc0
>>> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016830]
>>> [] ? try_to_wake_up+0x1f1/0x340
>>> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016834]
>>> [] wait_for_completion+0xa4/0x170
>>> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016836]
>>> [] ? wake_up_state+0x20/0x20
>>> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016840]
>>> [] flush_work+0xed/0x1c0
>>> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016846]
>>> [] ? destroy_worker+0x90/0x90
>>> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016870]
>>> [] xlog_cil_force_lsn+0x7e/0x1f0 [xfs]
>>> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016873]
>>> [] ? lock_timer_base.isra.36+0x2b/0x50
>>> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016878]
>>> [] ? try_to_del_timer_sync+0x4f/0x70
>>> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016901]
>>> [] _xfs_log_force+0x60/0x270 [xfs]
>>> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016904]
>>> [] ? internal_add_timer+0x80/0x80
>>> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016926]
>>> [] xfs_log_force+0x2a/0x90 [xfs]
>>> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016948]
>>> [] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
>>> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016970]
>>> [] xfsaild+0x140/0x5a0 [xfs]
>>> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016992]
>>> [] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
>>> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016996]
>>> [] kthread+0xd2/0xf0
>>> Dec  7 15:36:32 ceph1-store204 kernel: [152066.017000]
>>> [] ? kthread_create_on_node+0x1c0/0x1c0
>>> Dec  7 15:36:32 ceph1-store204 kernel: [152066.017005]
>>> [] ret_from_fo

Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-07 Thread Jan Schermer
What is the setting of sysctl kernel.pid_max?
You relly need to have this:
kernel.pid_max = 4194304
(I think it also sets this as well: kernel.threads-max = 4194304)

I think you are running out of processs IDs.

Jan

> On 08 Dec 2015, at 08:10, Benedikt Fraunhofer  wrote:
> 
> Hello Cephers,
> 
> lately, our ceph-cluster started to show some weird behavior:
> 
> the osd boxes show a load of 5000-15000 before the osds get marked down.
> Usually the box is fully usable, even "apt-get dist-upgrade" runs smoothly,
> you can read and write to any disk, only things you can't do are strace the 
> osd
> processes, sync or reboot.
> 
> we only find some logs about the "xfsaild = XFS Access Item List Daemon"
> as hung_task warnings.
> 
> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016108]
> [] ? kthread_create_on_node+0x1c0/0x1c0
> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016112] INFO: task
> xfsaild/dm-1:1445 blocked for more than 120 seconds.
> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016329]   Tainted:
> G C 3.19.0-39-generic #44~14.04.1-Ubuntu
> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016558] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016802] xfsaild/dm-1
> D 8807faa03af8 0  1445  2 0x
> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016805]
> 8807faa03af8 8808098989d0 00013e80 8807faa03fd8
> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016808]
> 00013e80 88080bb775c0 8808098989d0 88011381b2a8
> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016812]
> 8807faa03c50 7fff 8807faa03c48 8808098989d0
> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016815] Call Trace:
> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016819]
> [] schedule+0x29/0x70
> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016823]
> [] schedule_timeout+0x20c/0x280
> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016826]
> [] ? sched_clock_cpu+0x85/0xc0
> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016830]
> [] ? try_to_wake_up+0x1f1/0x340
> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016834]
> [] wait_for_completion+0xa4/0x170
> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016836]
> [] ? wake_up_state+0x20/0x20
> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016840]
> [] flush_work+0xed/0x1c0
> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016846]
> [] ? destroy_worker+0x90/0x90
> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016870]
> [] xlog_cil_force_lsn+0x7e/0x1f0 [xfs]
> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016873]
> [] ? lock_timer_base.isra.36+0x2b/0x50
> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016878]
> [] ? try_to_del_timer_sync+0x4f/0x70
> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016901]
> [] _xfs_log_force+0x60/0x270 [xfs]
> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016904]
> [] ? internal_add_timer+0x80/0x80
> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016926]
> [] xfs_log_force+0x2a/0x90 [xfs]
> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016948]
> [] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016970]
> [] xfsaild+0x140/0x5a0 [xfs]
> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016992]
> [] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
> Dec  7 15:36:32 ceph1-store204 kernel: [152066.016996]
> [] kthread+0xd2/0xf0
> Dec  7 15:36:32 ceph1-store204 kernel: [152066.017000]
> [] ? kthread_create_on_node+0x1c0/0x1c0
> Dec  7 15:36:32 ceph1-store204 kernel: [152066.017005]
> [] ret_from_fork+0x58/0x90
> Dec  7 15:36:32 ceph1-store204 kernel: [152066.017009]
> [] ? kthread_create_on_node+0x1c0/0x1c0
> Dec  7 15:36:32 ceph1-store204 kernel: [152066.017013] INFO: task
> xfsaild/dm-6:1616 blocked for more than 120 seconds.
> 
> kswapd is also reported as hung, but we don't have swap on the osds.
> 
> It looks like either all ceph-osd-threads are reporting in as willing to work,
> or it's the xfs-maintenance-process itself like described in [1,2]
> 
> Usually if we aint fast enough setting no{out,scrub,deep-scrub} this
> has an avalanche
> effect where we usually end up ipmi-power-cycling half of the cluster
> because all the osd-nodes
> are busy doing nothing (according to iostat or top, exept the load).
> 
> Is this a known bug for kernel 3.19.0-39 (ubuntu 14.04 with the vivid kernel)?
> Do the xfs-tweaks described here
> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg25295.html
> (i know this is for a pull request modifying the write-paths)
> look decent or worth a try?
> 
> Currently we're running with "back to defaults" and less load
> (desperate try with the filestore settings, didnt change anything)
> ceph.conf-osd section:
> 
> [osd]
>  filestore max sync interval = 15
>  filestore min sync interval = 1
>  osd max backfills = 1
>  osd recovery op priority = 1
> 
> 
> as a baffled try to get it to 

Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-07 Thread Jan Schermer
Doesn't look near the limit currently (but I suppose you rebooted it in the 
meantime?).
Did iostat say anything about the drives? (btw dm-1 and dm-6 are what? Is that 
your data drives?) - were they overloaded really?

Jan


> On 08 Dec 2015, at 08:41, Benedikt Fraunhofer  wrote:
> 
> Hi Jan,
> 
> we had 65k for pid_max, which made
> kernel.threads-max = 1030520.
> or
> kernel.threads-max = 256832
> (looks like it depends on the number of cpus?)
> 
> currently we've
> 
> root@ceph1-store209:~# sysctl -a | grep -e thread -e pid
> kernel.cad_pid = 1
> kernel.core_uses_pid = 0
> kernel.ns_last_pid = 60298
> kernel.pid_max = 65535
> kernel.threads-max = 256832
> vm.nr_pdflush_threads = 0
> root@ceph1-store209:~# ps axH |wc -l
> 17548
> 
> we'll see how it behaves once puppet has come by and adjusted it.
> 
> Thx!
> 
> Benedikt

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New cluster performance analysis

2015-12-04 Thread Jan Schermer

> On 04 Dec 2015, at 14:31, Adrien Gillard  wrote:
> 
> After some more tests :
> 
>  - The pool being used as cache pool has no impact on performance, I get the 
> same results with a "dedicated" replicated pool.
>  - You are right Jan, on raw devices I get better performance on a volume if 
> I fill it first, or at least if I write a zone that already has been allocated
>  - The same seem to apply when the test is run on the mounted filesystem.
> 

Yeah. The the first (raw device) is because the objects on OSDs get "thick" in 
the process.
The second (filesystem) is because of both the OSD objects getting thick and 
the guest filesystem getting thick.
Preallocating the space can speed up things considerably (like 100x)).
Unfortunately I haven't found a way to convince fallocate()  to thick 
provision files.

Jan

> 
> 
> 
> 
> On Thu, Dec 3, 2015 at 2:49 PM, Adrien Gillard  > wrote:
> I did some more tests : 
> 
> fio on a raw RBD volume (4K, numjob=32, QD=1) gives me around 3000 IOPS
> 
> I also tuned xfs mount options on client (I realized I didn't do that 
> already) and with 
> "largeio,inode64,swalloc,logbufs=8,logbsize=256k,attr2,auto,nodev,noatime,nodiratime"
>  I get better performance :
> 
> 4k-32-1-randwrite-libaio: (groupid=0, jobs=32): err= 0: pid=26793: Thu Dec  3 
> 10:45:55 2015
>   write: io=1685.3MB, bw=5720.1KB/s, iops=1430, runt=301652msec
> slat (usec): min=5, max=1620, avg=41.61, stdev=25.82
> clat (msec): min=1, max=4141, avg=14.61, stdev=112.55
>  lat (msec): min=1, max=4141, avg=14.65, stdev=112.55
> clat percentiles (msec):
>  |  1.00th=[3],  5.00th=[4], 10.00th=[4], 20.00th=[4],
>  | 30.00th=[4], 40.00th=[5], 50.00th=[5], 60.00th=[5],
>  | 70.00th=[5], 80.00th=[6], 90.00th=[7], 95.00th=[7],
>  | 99.00th=[  227], 99.50th=[  717], 99.90th=[ 1844], 99.95th=[ 2245],
>  | 99.99th=[ 3097]
> 
> So, more than 50% improvement but it actually varies quite a lot between 
> tests (sometimes I get a bit more than 1000). If I run the test fo 30 minutes 
> it drops to 900 IOPS.
> 
> As you suggested I also filled a volume with zeros (dd if=/dev/zero 
> of=/dev/rbd1 bs=1M) and then ran fio on the raw device, I didn't see a lot of 
> improvement.
> 
> If I run fio test directly on block devices I seem to saturate the spinners, 
> [1] is a graph of IO load on one of the OSD host. 
> [2] is the same OSD graph but when the test is done on a device mounted and 
> formatted with XFS on the client. 
> If I get half of the IOPS on the XFS volume because of the journal, shouldn't 
> I get the same amount of IOPS on the backend ? 
> [3] shows what happen if I run the test for 30 minutes.
> 
> During the fio tests on the raw device, load average on the OSD servers 
> increases up to 13/14 and I get a bit of iowait (I guess because the OSD are 
> busy)
> During the fio tests on the raw device, load average on the OSD servers peaks 
> at the beginning and decreases to 5/6, but goes trough the roof on the client.
> Scheduler is deadline for all the drives, I didn't try to change it yet.
> 
> What I don't understand, even with your explanations, are the rados results. 
> From what I understand it performs at the RADOS level and thus should not be 
> impacted by client filesystem.
> Given the results above I guess you are right and this has to do with the 
> client filesystem.
> 
> The cluster will be used for backups, write IO size during backups is around 
> 150/200K (I guess mostly sequential) and I am looking for the highest 
> bandwith and parallelization.
> 
> @Nick, I will try to create a new stand alone replicated pool.
> 
> 
> [1] http://postimg.org/image/qvtvdq1n1/ 
> [2] http://postimg.org/image/nhf6lzwgl/ 
> [3] http://postimg.org/image/h7l0obw7h/ 
> 
> On Thu, Dec 3, 2015 at 1:30 PM, Nick Fisk  > wrote:
> Couple of things to check
> 
> 1.  Can you create just a normal non cached pool and test performance to 
> rule out any funnies going on there.
> 
> 2.  Can you also run something like iostat during the benchmarks and see 
> if it looks like all your disks are getting saturated.
> 
> 
> 
> _
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
> ] On Behalf Of Adrien Gillard
> Sent: 02 December 2015 21:33
> To: ceph-us...@ceph.com 
> Subject: [ceph-users] New cluster performance analysis
> 
> 
> Hi everyone,
> 
>  
> I am currently testing our new cluster and I would like some feedback on the 
> numbers I am getting.
> 
>  
> For the hardware :
> 
> 7 x OSD : 2 x Intel 2640v3 (8x2.6GHz), 64B RAM, 2x10Gbits LACP for public 
> net., 2x10Gbits LACP 

Re: [ceph-users] How long will the logs be kept?

2015-12-03 Thread Jan Schermer
You can setup logrotate however you want - not sure what the default is for 
your distro.
Usually logrotate doesn't touch files that are smaller than some size even if 
they are old. It will also not delete logs for OSDs that no longer exist. 

Ceph itself has nothing to do with log rotation, logrotate does the work. Ceph 
packages likely contain default logrotate rules for the logs but you can edit 
them to your liking.

Jan

> On 03 Dec 2015, at 09:38, Wukongming  wrote:
> 
> Yes, I can find ceph of rotate configure file in the directory of 
> /etc/logrotate.d. 
> Also, I find sth. Weird.
> 
> drwxr-xr-x  2 root root   4.0K Dec  3 14:54 ./
> drwxrwxr-x 19 root syslog 4.0K Dec  3 13:33 ../
> -rw---  1 root root  0 Dec  2 06:25 ceph.audit.log
> -rw---  1 root root85K Nov 25 09:17 ceph.audit.log.1.gz
> -rw---  1 root root   228K Dec  3 16:00 ceph.log
> -rw---  1 root root28K Dec  3 06:23 ceph.log.1.gz
> -rw---  1 root root   374K Dec  2 06:22 ceph.log.2.gz
> -rw-r--r--  1 root root   4.3M Dec  3 16:01 ceph-mon.wkm01.log
> -rw-r--r--  1 root root   561K Dec  3 06:25 ceph-mon.wkm01.log.1.gz
> -rw-r--r--  1 root root   2.2M Dec  2 06:25 ceph-mon.wkm01.log.2.gz
> -rw-r--r--  1 root root  0 Dec  2 06:25 ceph-osd.0.log
> -rw-r--r--  1 root root992 Dec  1 09:09 ceph-osd.0.log.1.gz
> -rw-r--r--  1 root root19K Dec  3 10:51 ceph-osd.2.log
> -rw-r--r--  1 root root   2.3K Dec  2 10:50 ceph-osd.2.log.1.gz
> -rw-r--r--  1 root root27K Dec  1 10:31 ceph-osd.2.log.2.gz
> -rw-r--r--  1 root root13K Dec  3 10:23 ceph-osd.5.log
> -rw-r--r--  1 root root   1.6K Dec  2 09:57 ceph-osd.5.log.1.gz
> -rw-r--r--  1 root root22K Dec  1 09:51 ceph-osd.5.log.2.gz
> -rw-r--r--  1 root root19K Dec  3 10:51 ceph-osd.8.log
> -rw-r--r--  1 root root18K Dec  2 10:50 ceph-osd.8.log.1
> -rw-r--r--  1 root root   261K Dec  1 13:54 ceph-osd.8.log.2
> 
> I deployed ceph cluster on Nov 21, from that day to Dec.1, I mean the 
> continue 10 days' logs were compressed into one file, it is not what I want.
> Does any OP affect log compressing?
> 
> Thanks!
>Kongming Wu
> -
> wukongming ID: 12019
> Tel:0571-86760239
> Dept:2014 UIS2 ONEStor
> 
> -邮件原件-
> 发件人: huang jun [mailto:hjwsm1...@gmail.com] 
> 发送时间: 2015年12月3日 13:19
> 收件人: wukongming 12019 (RD)
> 抄送: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
> 主题: Re: How long will the logs be kept?
> 
> it will rotate every week by default, you can see the logrotate file 
> /etc/ceph/logrotate.d/ceph
> 
> 2015-12-03 12:37 GMT+08:00 Wukongming :
>> Hi ,All
>>Is there anyone who knows How long or how many days will the logs.gz 
>> (mon/osd/mds)be kept, maybe before flushed?
>> 
>> -
>> wukongming ID: 12019
>> Tel:0571-86760239
>> Dept:2014 UIS2 OneStor
>> 
>> --
>> ---
>> 本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出
>> 的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、
>> 或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本
>> 邮件!
>> This e-mail and its attachments contain confidential information from 
>> H3C, which is intended only for the person or entity whose address is 
>> listed above. Any use of the information contained herein in any way 
>> (including, but not limited to, total or partial disclosure, 
>> reproduction, or dissemination) by persons other than the intended
>> recipient(s) is prohibited. If you receive this e-mail in error, 
>> please notify the sender by phone or email immediately and delete it!
> 
> 
> 
> --
> thanks
> huangjun
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-osd@.service does not mount OSD data disk

2015-12-03 Thread Jan Schermer
echo add >/sys/block/sdX/sdXY/uevent

The easiest way to make it mount automagically

Jan

> On 03 Dec 2015, at 20:31, Timofey Titovets  wrote:
> 
> Lol, it's opensource guys
> https://github.com/ceph/ceph/tree/master/systemd
> ceph-disk@
> 
> 2015-12-03 21:59 GMT+03:00 Florent B :
>> "ceph" service does mount :
>> 
>> systemctl status ceph -l
>> ● ceph.service - LSB: Start Ceph distributed file system daemons at boot
>> time
>>   Loaded: loaded (/etc/init.d/ceph)
>>   Active: active (exited) since Thu 2015-12-03 17:48:52 CET; 2h 9min ago
>>  Process: 931 ExecStart=/etc/init.d/ceph start (code=exited,
>> status=0/SUCCESS)
>> 
>> Dec 03 17:48:47 test3 ceph[931]: Running as unit run-1218.service.
>> Dec 03 17:48:47 test3 ceph[931]: Starting ceph-create-keys on test3...
>> Dec 03 17:48:47 test3 ceph[931]: === mds.1 ===
>> Dec 03 17:48:47 test3 ceph[931]: Starting Ceph mds.1 on test3...
>> Dec 03 17:48:47 test3 ceph[931]: Running as unit run-1318.service.
>> Dec 03 17:48:47 test3 ceph[931]: === osd.2 ===
>> Dec 03 17:48:47 test3 ceph[931]: Mounting xfs on
>> test3:/var/lib/ceph/osd/ceph-2
>> Dec 03 17:48:52 test3 ceph[931]: create-or-move updated item name
>> 'osd.2' weight 0.8447 at location {host=test3,root=default} to crush map
>> Dec 03 17:48:52 test3 ceph[931]: Starting Ceph osd.2 on test3...
>> Dec 03 17:48:52 test3 ceph[931]: Running as unit run-1580.service.
>> 
>> 
>> I don't see any udev rule related to Ceph on my servers...
>> 
>> 
>> On 12/03/2015 07:56 PM, Adrien Gillard wrote:
>>> I think OSD are automatically mouted at boot via udev rules and that
>>> the ceph service does not handle the mounting part.
>>> 
>> 
> 
> 
> 
> -- 
> Have a nice day,
> Timofey.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to mount a bootable VM image file?

2015-12-02 Thread Jan Schermer
There's a pretty cool thing caled libguestfs, and a tool called guestfish

http://libguestfs.org 

I've never used it (just stumbled on it recently) but it should do exactly what 
you need :-) And it supports RBD.

Jan


> On 02 Dec 2015, at 18:07, Gregory Farnum  wrote:
> 
> On Wednesday, December 2, 2015, Judd Maltin  > wrote:
> I'm using OpenStack to create VMs.  They're KVM VMs, and I can see all the 
> authentication information I need on the process tree.  I want to mount this 
> bootable image on the hypervizor node to access its filesystem and fix a file 
> I messed up in /etc/ so I can get the VM to boot.
> 
> [root@ceph mnt]# mount -t ceph 
> 192.168.170.53:6789:/volumes/d02ef718-bb44-4316-9e93-5979396921da_disk 
> /mnt/image -o 'name=volumes,secret=AQDG7fBVqH3/LxAA8pQ0IF5LKQzAPYKTv8SvfQ=='
> mount: 
> 192.168.170.53:6789:/volumes/d02ef718-bb44-4316-9e93-5979396921da_disk: can't 
> read superblock
> 
> How can I find and use the partition inside this raw, bootable file image?
> 
> You've probably created it using features that the kernel client you have 
> installed doesn't understand. You'd need to either use a newer kernel or 
> (more likely) just hook it up to a VM with QEMU.
> -Greg
> 
> 
>  
> Thanks folks,
> -judd
> 
> -- 
> Judd Maltin
> T: 917-882-1270
> Of Life immense in passion, pulse, and power,   <>
> Cheerful—for freest action form’d, under the laws divine,   <>
> The Modern Man I sing. -Walt Whitman
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New cluster performance analysis

2015-12-02 Thread Jan Schermer
> Let's take IOPS, assuming the spinners can do 50 (4k) synced sustained IOPS 
> (I hope they can do more ^^), we should be around 50x84/3 = 1400 IOPS, which 
> is far from rados bench (538) and fio (847). And surprisingly fio numbers are 
> greater than rados.
> 

I think the missing factor here is filesystem journal overhead - that would 
explain the strange numbers you are seeing and the low performance in rados 
bench - every filesystem metadata operation has to do at least one 1 (synced) 
OP to the journal and that's not only file creation but also file growth (or 
filling the holes). And that's on the OSD as well as on the client filesystem 
side(!).


To do a proper benchmark, fill the RBD mounted filesytem first with data 
completely and then try again with fio on a preallocated file. (and don't 
enable discard if that's supported)
Better yet, run fio on the block device itself but write it over with dd 
if=/dev/zero first.
I think you'll get bit different numbers then.
Of course whether that's representative of what your usage pattern might be is 
another story.

Can you tell us what workload should be running on this and what the 
expectations were?
Can you see someting maxed our while the benchmark is running? (CPU or drives?) 
Have you tried switching schedulers on the drives?

Jan

> On 02 Dec 2015, at 22:33, Adrien Gillard  wrote:
> 
> Hi everyone,
> 
>  
> I am currently testing our new cluster and I would like some feedback on the 
> numbers I am getting.
> 
>  
> For the hardware :
> 
> 7 x OSD : 2 x Intel 2640v3 (8x2.6GHz), 64B RAM, 2x10Gbits LACP for public 
> net., 2x10Gbits LACP for cluster net., MTU 9000
> 
> 1 x MON : 2 x Intel 2630L (6x2GHz), 32GB RAM and Intel DC SSD, 2x10Gbits LACP 
> for public net., MTU 9000
> 
> 2 x MON : VMs (8 cores, 8GB RAM), backed by SSD
> 
>  
> Journals are 20GB partitions on SSD
> 
>  
> The system is CentOS 7.1 with stock kernel (3.10.0-229.20.1.el7.x86_64). No 
> particular system optimizations.
> 
>  
> Ceph is Infernalis from Ceph repository  : ceph version 9.2.0 
> (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
> 
>  
> [cephadm@cph-adm-01  ~/scripts]$ ceph -s
> 
> cluster 259f65a3-d6c8-4c90-a9c2-71d4c3c55cce
> 
>  health HEALTH_OK
> 
>  monmap e1: 3 mons at 
> {clb-cph-frpar1-mon-02=x.x.x.2:6789/0,clb-cph-frpar2-mon-01=x.x.x.1:6789/0,clb-cph-frpar2-mon-03=x.x.x.3:6789/0}
> 
> election epoch 62, quorum 0,1,2 
> clb-cph-frpar2-mon-01,clb-cph-frpar1-mon-02,clb-cph-frpar2-mon-03
> 
>  osdmap e844: 84 osds: 84 up, 84 in
> 
> flags sortbitwise
> 
>   pgmap v111655: 3136 pgs, 3 pools, 3166 GB data, 19220 kobjects
> 
> 8308 GB used, 297 TB / 305 TB avail
> 
> 3136 active+clean
> 
>  
> My ceph.conf :
> 
>  
> [global]
> 
> fsid = 259f65a3-d6c8-4c90-a9c2-71d4c3c55cce
> 
> mon_initial_members = clb-cph-frpar2-mon-01, clb-cph-frpar1-mon-02, 
> clb-cph-frpar2-mon-03
> 
> mon_host = x.x.x.1,x.x.x.2,x.x.x.3
> 
> auth_cluster_required = cephx
> 
> auth_service_required = cephx
> 
> auth_client_required = cephx
> 
> filestore_xattr_use_omap = true
> 
> public network = 10.25.25.0/24 
> cluster network = 10.25.26.0/24 
> debug_lockdep = 0/0
> 
> debug_context = 0/0
> 
> debug_crush = 0/0
> 
> debug_buffer = 0/0
> 
> debug_timer = 0/0
> 
> debug_filer = 0/0
> 
> debug_objecter = 0/0
> 
> debug_rados = 0/0
> 
> debug_rbd = 0/0
> 
> debug_journaler = 0/0
> 
> debug_objectcatcher = 0/0
> 
> debug_client = 0/0
> 
> debug_osd = 0/0
> 
> debug_optracker = 0/0
> 
> debug_objclass = 0/0
> 
> debug_filestore = 0/0
> 
> debug_journal = 0/0
> 
> debug_ms = 0/0
> 
> debug_monc = 0/0
> 
> debug_tp = 0/0
> 
> debug_auth = 0/0
> 
> debug_finisher = 0/0
> 
> debug_heartbeatmap = 0/0
> 
> debug_perfcounter = 0/0
> 
> debug_asok = 0/0
> 
> debug_throttle = 0/0
> 
> debug_mon = 0/0
> 
> debug_paxos = 0/0
> 
> debug_rgw = 0/0
> 
>  
> [osd]
> 
> osd journal size = 0
> 
> osd mount options xfs = "rw,noatime,inode64,logbufs=8,logbsize=256k"
> 
> filestore min sync interval = 5
> 
> filestore max sync interval = 15
> 
> filestore queue max ops = 2048
> 
> filestore queue max bytes = 1048576000
> 
> filestore queue committing max ops = 4096
> 
> filestore queue committing max bytes = 1048576000
> 
> filestore op thread = 32
> 
> filestore journal writeahead = true
> 
> filestore merge threshold = 40
> 
> filestore split multiple = 8
> 
>  
> journal max write bytes = 1048576000
> 
> journal max write entries = 4096
> 
> journal queue max ops = 8092
> 
> journal queue max bytes = 1048576000
> 
>  
> osd max write size = 512
> 
> osd op threads = 16
> 
> osd disk threads = 2
> 
> osd op num threads per shard = 3
> 
> osd op num shards = 10
> 
> osd map cache size = 1024
> 
> osd max backfills = 1
> 
> osd recovery max active = 2
> 
>  
> I have set up 2 pools : one for cache with 3x replication in front of an EC 
> pool. At the moment I am only 

Re: [ceph-users] Removing OSD - double rebalance?

2015-12-02 Thread Jan Schermer
1) if you have the original drive that works and just want to replace it then 
you can just "dd" it over to the new drive and then extend the partition if the 
new one is larger, this avoids double backfilling in this case
2) if the old drive is dead you should "out" it and at the same time add a new 
drive

If you reweight the drive then you shuffle all data on it to the rest of the 
drives on that host (with default crush at least), so you need to have free 
space to do that safely.
Also, ceph is not that smart to only backfill the data to the new drive locally 
(even though it could) and the "hashing" algorithm doesn't really guarantee 
that no other data moves when you switch drives like that.

TL;DR - if you can, deal with the additional load

Jan

> On 02 Dec 2015, at 11:59, Andy Allan  wrote:
> 
> On 30 November 2015 at 09:34, Burkhard Linke
>  wrote:
>> On 11/30/2015 10:08 AM, Carsten Schmitt wrote:
> 
>>> But after entering the last command, the cluster starts rebalancing again.
>>> 
>>> And that I don't understand: Shouldn't be one rebalancing process enough
>>> or am I missing something?
>> 
>> Removing the OSD changes the weight for the host, thus a second rebalance is
>> necessary.
>> 
>> The best practice to remove an OSD involves changing the crush weight to 0.0
>> as first step.
> 
> I found this out the hard way too. It's unfortunate that the
> documentation is, in my mind, not helpful on the order of commands to
> run.
> 
> http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual
> 
> Is there any good reason why the documentation recommends this
> double-rebalance approach? Or conversely, any reason not to change the
> documentation so that rebalances only happen once?
> 
> Thanks,
> Andy
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Modification Time of RBD Images

2015-11-26 Thread Jan Schermer
Find in which block the filesystem on your RBD image stores journal, find the 
object hosting this block in rados and use its mtime :-)

Jan


> On 26 Nov 2015, at 18:49, Gregory Farnum  wrote:
> 
> I don't think anything tracks this explicitly for RBD, but each RADOS object 
> does maintain an mtime you can check via the rados tool. You could write a 
> script to iterate through all the objects in the image and find the most 
> recent mtime (although a custom librados binary will be faster if you want to 
> do this frequently).
> -Greg
> 
> On Thursday, November 26, 2015, Christoph Adomeit 
> > wrote:
> Hi there,
> 
> I am using Ceph-Hammer and I am wondering about the following:
> 
> What is the recommended way to find out when an rbd-Image was last modified ?
> 
> Thanks
>   Christoph
> 
> --
> Christoph Adomeit
> GATWORKS GmbH
> Reststrauch 191
> 41199 Moenchengladbach
> Sitz: Moenchengladbach
> Amtsgericht Moenchengladbach, HRB 6303
> Geschaeftsfuehrer:
> Christoph Adomeit, Hans Wilhelm Terstappen
> 
> christoph.adom...@gatworks.de  Internetloesungen vom 
> Feinsten
> Fon. +49 2166 9149-32  Fax. +49 2166 9149-10
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Vierified and tested SAS/SATA SSD for Ceph

2015-11-24 Thread Jan Schermer
Intel DC series (S3610 for journals, S3510 might be OK for data).
Samsung DC PRO series (if you can get them).

There are other drives that might be suitable but I strongly suggest you avoid 
those that aren't tested by others - it's a PITA to deal with the problems poor 
SSDs cause.

Jan

> On 24 Nov 2015, at 15:23, Mike Almateia  wrote:
> 
> Hello.
> 
> Someone have list of verified/tested SSD drives for Ceph?
> I thinking about Ultrastar SSD1600MM SAS SSD for our all-flash Ceph cluster. 
> Somebody use it in production?
> 
> -- 
> Mike, runs.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH over SW-RAID

2015-11-23 Thread Jan Schermer
So I assume we _are_ talking about bit-rot?

> On 23 Nov 2015, at 18:37, Jose Tavares  wrote:
> 
> Yes, but with SW-RAID, when we have a block that was read and does not
> match its checksum, the device falls out of the array, and the data is read
> again from the other devices in the array.

That's not true. SW-RAID reads data from one drive only. Comparison of the data 
on different drives only happens when a check is executed, and that's doesn't 
help with bit-rot one bit :-) (the same goes for various SANs and arrays, but 
those usually employ additional CRC for data so their BER is orders of 
magnitude higher.)

> The problem is that in SW-RAID1
> we don't have the badblocks isolated. The disks can be sincronized again as
> the write operation is not tested. The problem (device falling out of the
> array) will happen again if we try to read any other data written over the
> bad block.

Not true either. Bit-rot happens not (only) when the data gets written wrong, 
but when it is read. If you read one block long enough you will get wrong data 
once every $BER_bits. Rewriting the data doesn't help.
(It's a bit different with some SSDs that don't refresh blocks so 
rewriting/refreshing them might help).

> 
> My new question regarding Ceph is if it isolates this bad sectors where it
> found bad data when scrubbing? or there will be always a replica of
> something over a known bad block..?
> 
> I also saw that Ceph use same metrics when capturing data from disks. When
> the disk is resetting or have problems, its metrics are going to be bad and
> the cluster will rank bad this osd. But I didn't saw any way of sending
> alerts or anything like that. SW-RAID has its mdadm monitor that alerts
> when things go bad. Should I have to be looking for ceph logs all the time
> to see when things go bad?

You should graph every drive and look for anomalies. Ceph only detects a 
problem when the drive is already very unusable (the ceph-osd process itself 
blocks for tens of seconds typically).
CEPH is not really good when it comes to latency SLAs, no matter how much you 
try, but that's usually sufficient.

> 
> Thanks.
> Jose Tavares
> 
> On Mon, Nov 23, 2015 at 3:19 PM, Robert LeBlanc 
> wrote:
> 
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>> 
>> Most people run their clusters with no RAID for the data disks (some
>> will run RAID for the journals, but we don't). We use the scrub
>> mechanism to find data inconsistency and we use three copies to do
>> RAID over host/racks, etc. Unless you have a specific need, it is best
>> to forgo the Linux SW RAID or even HW RAIDs too with Ceph.
>> - 
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> 
>> 
>> On Mon, Nov 23, 2015 at 10:09 AM, Jose Tavares  wrote:
>>> Hi guys ...
>>> 
>>> Is there any advantage in running CEPH over a Linux SW-RAID to avoid data
>>> corruption due to disk bad blocks?
>>> 
>>> Can we just rely on the scrubbing feature of CEPH? Can we live without an
>>> underlying layer that avoids hardware problems to be passed to CEPH?
>>> 
>>> I have a setup where I put one OSD per node and I have a 2 disk raid-1
>>> setup. Is it a good option or it would be better if I had 2 OSDs, one in
>>> each disk? If I had one OSD per disk, I would have to increase the
>> number os
>>> replicas to guarantee enough replicas if one node goes down.
>>> 
>>> Thanks a lot.
>>> Jose Tavares
>>> 
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>> 
>> -BEGIN PGP SIGNATURE-
>> Version: Mailvelope v1.2.3
>> Comment: https://www.mailvelope.com
>> 
>> wsFcBAEBCAAQBQJWU0qBCRDmVDuy+mK58QAAczAP/RducnXBNyeESCwUP/RC
>> 3ELmoZxMO2ymrcQoutUVXfPTZk7f9pINUux4NRnglbVDxHasmNBHFKV3uWTS
>> OBmaVuC99cwG/ekhmNaW9qmQIZiP8byijoDln26eqarhhuMECgbYxZhLtB9M
>> A1W5gpKEvCBvYcjW9V/rwb0+V678Eo1IVlezwJ1TP3pxvRWpDsg1dIhOBit8
>> PznnPTMS46RGFrFirTg1AfvmipSI3rhLFdR2g7xHrQs9UHdmC0OQ/Jcjnln+
>> L0LNni7ht1lK80J9Mk4Q/nt7HfWCxJrg497Q+R0m+ab3qFJWBUGwofjbEnut
>> JroMLph0sxAzmDSst8a15pzTYaIqMqKkGfGeHgiaNzePwELAY2AKwgx2AIlf
>> iYJCtyiXRHnfQfQEi1TflWFuEaaAhKCPqRO7Duf6a+rEsSkvViaZ9Mtm1bSX
>> KnLLSz8ZtXI4wTWbImXbpdhuGgHvKsEGWlU+YDuCil9i+PedM67us1Y6TAsT
>> UWvCd8P385psITLI37Ly+YDHphjyeyYljCPGuom1e+/J3flElS/BgWUGUibB
>> rA3QUNUIPWKO6F37JEDja13BShTE9I17Y3EpSgGGG3jnTt93/E4dEvR6mC/F
>> qPPjs7EMvc99Xi7rTqtpm58JLGXWh3rMgjITJTwfLhGtCHgSvvrsRjmGB9Xa
>> anPK
>> =XQGP
>> -END PGP SIGNATURE-
>> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH over SW-RAID

2015-11-23 Thread Jan Schermer
SW-RAID doesn't help with bit-rot if that's what you're afraid of.
If you are afraid bit-rot you need to use a fully checksumming filesystem like 
ZFS.
Ceph doesn't help there either when using replicas - not sure how strong error 
detection+correction is in EC-type pools.

The only thing I can suggest (apart from using ZFS) is getting drives that have 
a higher BER rating so bit-rot isn't as likely to occur.

Jan

> On 23 Nov 2015, at 18:09, Jose Tavares  wrote:
> 
> Hi guys ...
> 
> Is there any advantage in running CEPH over a Linux SW-RAID to avoid data 
> corruption due to disk bad blocks?
> 
> Can we just rely on the scrubbing feature of CEPH? Can we live without an 
> underlying layer that avoids hardware problems to be passed to CEPH?
> 
> I have a setup where I put one OSD per node and I have a 2 disk raid-1 setup. 
> Is it a good option or it would be better if I had 2 OSDs, one in each disk? 
> If I had one OSD per disk, I would have to increase the number os replicas to 
> guarantee enough replicas if one node goes down.
> 
> Thanks a lot.
> Jose Tavares
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] what's the benefit if I deploy more ceph-mon node?

2015-11-19 Thread Jan Schermer
There's no added benefit - it just adds resiliency.
On the other hand - more monitors means more likelihood that one of them will 
break, when that happens there will be a brief interruption to some (not only 
management) operations. 
If you decide to reduce the number of MONs then that is a PITA as it means 
recreating for example libvirt domains (mon addresses are pretty much hardcoded 
in there).
Just keep at at 3 unless you have more than 3 datacenters.

Jan

> On 19 Nov 2015, at 09:35, 席智勇  wrote:
> 
> 
> hi all:
> 
>As the title, if I deploy more than three ceph-mon node, I can 
> tolerate more monitor node failture, what I wana know is,  is there any other 
> benefit, for example, better for IOPS or latency? On the other hand, what the 
> disadvantage if it has?
> 
>  best regards~
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD - 'attempt to access beyond end of device'

2015-11-12 Thread Jan Schermer
xfs_growfs "autodetects" the block device size. You can force re-read of the 
block device to refresh this info but might not do anything at all.

There are situations when block device size will not reflect reality - for 
example you can't (or at least couldn't) resize partition that is in use 
(mounted, mapped, used in LVM...) without serious hacks, and ioctls on this 
partition will return the old size until you reboot.
The block device can also simply lie (like if you triggered a bug that made the 
rbd device visually larger).
Device-mapper devices have their own issues.

The only advice I can give is to never, ever shrink LUNs or block devices and 
to avoid partitions if you can. I usually set up a fairly large OS drive (with 
oversized partitions to be safe, assuming you have thin-provisioning it wastes 
no real space) and a separate data volume without any partitioning. This also 
works-around possible alignment issues
Growing is always safe, shrinking destroys data. I am very surprised that "rbd 
resize" doesn't require something like "--i-really-really-know-what-i-am-doing 
--please-eatmydata" parameter to shrink the image (or does it ask for 
confirmation when shrinking at least? I can't try it now). Making a typo == 
instawipe?

My bet would still be that the original image was larger and you shrunk it by 
mistake. The kernel client most probably never gets the capacity change 
notification and you end up creating filesystem that points outside the device. 
(not sure if mkfs.xfs actually tries seeking over the full sector range). This 
is the most plausible explanation I can think of, but anything is possible. I 
have other ideas if you want to investigate but I'd take it off-list...

Jan

P.S. Your image is not 2TB but rather 2000 GiB ;-)



> On 12 Nov 2015, at 22:10, Bogdan SOLGA <bogdan.so...@gmail.com> wrote:
> 
> Unfortunately I can no longer execute those commands for that rbd5, as I had 
> to delete it; I couldn't 'resurrect' it, at least not in a decent time.
> 
> Here is the output for another image, which is 2TB big:
> 
> ceph-admin@ceph-client-01:~$ sudo blockdev --getsz --getss --getbsz /dev/rbd1
> 4194304000
> 512
> 512
> 
> ceph-admin@ceph-client-01:~$ xfs_info /dev/rbd1
> meta-data=/dev/rbd2  isize=256agcount=8127, agsize=64512 blks
>  =   sectsz=512   attr=2
> data =   bsize=4096   blocks=524288000, imaxpct=25
>  =   sunit=1024   swidth=1024 blks
> naming   =version 2  bsize=4096   ascii-ci=0
> log  =internal   bsize=4096   blocks=2560, version=2
>  =   sectsz=512   sunit=8 blks, lazy-count=1
> realtime =none   extsz=4096   blocks=0, rtextents=0
> 
> 
> I know rbd can also shrink the image, but I'm sure I haven't shrunk it. What 
> I have tried, accidentally, was to resize the image to the same size it 
> previously had, and that operation has failed, after trying for some time. 
> Hmm... I think the failed resize was the culprit for it's malfunctioning, 
> then.
> 
> Any (additional) advices on how to prevent this type of issues, in the 
> future? Should the resizing and the xfs_growfs be executed with some 
> parameters, for a better configuration of the image and / or filesystem?
> 
> Thank you very much for your help!
> 
> Regards,
> Bogdan
> 
> 
> On Thu, Nov 12, 2015 at 11:00 PM, Jan Schermer <j...@schermer.cz 
> <mailto:j...@schermer.cz>> wrote:
> Can you post the output of:
> 
> blockdev --getsz --getss --getbsz /dev/rbd5
> and
> xfs_info /dev/rbd5
> 
> rbd resize can actually (?) shrink the image as well - is it possible that 
> the device was actually larger and you shrunk it?
> 
> Jan
> 
>> On 12 Nov 2015, at 21:46, Bogdan SOLGA <bogdan.so...@gmail.com 
>> <mailto:bogdan.so...@gmail.com>> wrote:
>> 
>> By running rbd resize <http://docs.ceph.com/docs/master/rbd/rados-rbd-cmds/> 
>> and then 'xfs_growfs -d' on the filesystem.
>> 
>> Is there a better way to resize an RBD image and the filesystem?
>> 
>> On Thu, Nov 12, 2015 at 10:35 PM, Jan Schermer <j...@schermer.cz 
>> <mailto:j...@schermer.cz>> wrote:
>> 
>>> On 12 Nov 2015, at 20:49, Bogdan SOLGA <bogdan.so...@gmail.com 
>>> <mailto:bogdan.so...@gmail.com>> wrote:
>>> 
>>> Hello Jan!
>>> 
>>> Thank you for your advices, first of all!
>>> 
>>> The filesystem was created using mkfs.xfs, after creating the RBD block 
>>> device and mapping it on the Ceph client. I haven't specified any 
>>> parameters when I created the filesystem, I just ran mkfs.xfs on

Re: [ceph-users] RBD - 'attempt to access beyond end of device'

2015-11-12 Thread Jan Schermer

> On 12 Nov 2015, at 20:49, Bogdan SOLGA <bogdan.so...@gmail.com> wrote:
> 
> Hello Jan!
> 
> Thank you for your advices, first of all!
> 
> The filesystem was created using mkfs.xfs, after creating the RBD block 
> device and mapping it on the Ceph client. I haven't specified any parameters 
> when I created the filesystem, I just ran mkfs.xfs on the image name.
> 
> As you mentioned the filesystem thinking the block device should be larger 
> than it is - I have initially created that image as a 2GB image, and then 
> resized it to be much bigger. Could this be the issue?

Sounds more than likely :-) How exactly did you grow it?

Jan

> 
> There are several RBD images mounted on one Ceph client, but only one of them 
> had issues. I have made a clone, and I will try running fsck on it.
> 
> Fortunately it's not important data, it's just testing data. If I won't 
> succeed repairing it I will trash and re-create it, of course.
> 
> Thank you, once again!
> 
> 
> 
> On Thu, Nov 12, 2015 at 9:28 PM, Jan Schermer <j...@schermer.cz 
> <mailto:j...@schermer.cz>> wrote:
> How did you create filesystems and/or partitions on this RBD block device?
> The obvious causes would be
> 1) you partitioned it and the partition on which you ran mkfs points or 
> pointed during mkfs outside the block device size (happens if you for example 
> automate this and confuse sectors x cylinders, or if you copied the partition 
> table with dd or from some image)
> or
> 2) mkfs created the filesystem with pointers outside of the block device for 
> some other reason (bug?)
> or
> 3) this RBD device is a snapshot that got corrupted (or wasn't snapshotted in 
> crash-consistent state and you got "lucky") and some reference points to a 
> non-sensical block number (fsck could fix this, but I wouldn't trust the data 
> integrity anymore)
> 
> Basically the filesystem thinks the block device should be larger than it is 
> and tries to reach beyond.
> 
> Is this just one machine or RBD image or is there more?
> 
> I'd first create a snapshot and then try running fsck on it, it should 
> hopefully tell you if there's a problem in setup or a corruption.
> 
> If it's not important data and it's just one instance of this problem then 
> I'd just trash and recreate it.
> 
> Jan
> 
>> On 12 Nov 2015, at 20:14, Bogdan SOLGA <bogdan.so...@gmail.com 
>> <mailto:bogdan.so...@gmail.com>> wrote:
>> 
>> Hello everyone!
>> 
>> We have a recently installed Ceph cluster (v 0.94.5, Ubuntu 14.04), and 
>> today I noticed a lot of 'attempt to access beyond end of device' messages 
>> in the /var/log/syslog file. They are related to a mounted RBD image, and 
>> have the following format:
>> 
>> Nov 12 21:06:44 ceph-client-01 kernel: [438507.952532] attempt to access 
>> beyond end of device
>> Nov 12 21:06:44 ceph-client-01 kernel: [438507.952534] rbd5: rw=33, 
>> want=6193176, limit=4194304
>> 
>> After restarting that Ceph client, I see a lot of 'metadata I/O error' 
>> messages in the boot log:
>> 
>> XFS (rbd5): metadata I/O error: block 0x46e001 ("xfs_buf_iodone_callbacks") 
>> error 5 numblks 1
>> 
>> Any idea on why these messages are shown? The health of the cluster shows as 
>> OK, and I can access that block device without (apparent) issues...
>> 
>> Thank you!
>> 
>> Regards,
>> Bogdan
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD - 'attempt to access beyond end of device'

2015-11-12 Thread Jan Schermer
Can you post the output of:

blockdev --getsz --getss --getbsz /dev/rbd5
and
xfs_info /dev/rbd5

rbd resize can actually (?) shrink the image as well - is it possible that the 
device was actually larger and you shrunk it?

Jan

> On 12 Nov 2015, at 21:46, Bogdan SOLGA <bogdan.so...@gmail.com> wrote:
> 
> By running rbd resize <http://docs.ceph.com/docs/master/rbd/rados-rbd-cmds/> 
> and then 'xfs_growfs -d' on the filesystem.
> 
> Is there a better way to resize an RBD image and the filesystem?
> 
> On Thu, Nov 12, 2015 at 10:35 PM, Jan Schermer <j...@schermer.cz 
> <mailto:j...@schermer.cz>> wrote:
> 
>> On 12 Nov 2015, at 20:49, Bogdan SOLGA <bogdan.so...@gmail.com 
>> <mailto:bogdan.so...@gmail.com>> wrote:
>> 
>> Hello Jan!
>> 
>> Thank you for your advices, first of all!
>> 
>> The filesystem was created using mkfs.xfs, after creating the RBD block 
>> device and mapping it on the Ceph client. I haven't specified any parameters 
>> when I created the filesystem, I just ran mkfs.xfs on the image name.
>> 
>> As you mentioned the filesystem thinking the block device should be larger 
>> than it is - I have initially created that image as a 2GB image, and then 
>> resized it to be much bigger. Could this be the issue?
> 
> Sounds more than likely :-) How exactly did you grow it?
> 
> Jan
> 
>> 
>> There are several RBD images mounted on one Ceph client, but only one of 
>> them had issues. I have made a clone, and I will try running fsck on it.
>> 
>> Fortunately it's not important data, it's just testing data. If I won't 
>> succeed repairing it I will trash and re-create it, of course.
>> 
>> Thank you, once again!
>> 
>> 
>> 
>> On Thu, Nov 12, 2015 at 9:28 PM, Jan Schermer <j...@schermer.cz 
>> <mailto:j...@schermer.cz>> wrote:
>> How did you create filesystems and/or partitions on this RBD block device?
>> The obvious causes would be
>> 1) you partitioned it and the partition on which you ran mkfs points or 
>> pointed during mkfs outside the block device size (happens if you for 
>> example automate this and confuse sectors x cylinders, or if you copied the 
>> partition table with dd or from some image)
>> or
>> 2) mkfs created the filesystem with pointers outside of the block device for 
>> some other reason (bug?)
>> or
>> 3) this RBD device is a snapshot that got corrupted (or wasn't snapshotted 
>> in crash-consistent state and you got "lucky") and some reference points to 
>> a non-sensical block number (fsck could fix this, but I wouldn't trust the 
>> data integrity anymore)
>> 
>> Basically the filesystem thinks the block device should be larger than it is 
>> and tries to reach beyond.
>> 
>> Is this just one machine or RBD image or is there more?
>> 
>> I'd first create a snapshot and then try running fsck on it, it should 
>> hopefully tell you if there's a problem in setup or a corruption.
>> 
>> If it's not important data and it's just one instance of this problem then 
>> I'd just trash and recreate it.
>> 
>> Jan
>> 
>>> On 12 Nov 2015, at 20:14, Bogdan SOLGA <bogdan.so...@gmail.com 
>>> <mailto:bogdan.so...@gmail.com>> wrote:
>>> 
>>> Hello everyone!
>>> 
>>> We have a recently installed Ceph cluster (v 0.94.5, Ubuntu 14.04), and 
>>> today I noticed a lot of 'attempt to access beyond end of device' messages 
>>> in the /var/log/syslog file. They are related to a mounted RBD image, and 
>>> have the following format:
>>> 
>>> Nov 12 21:06:44 ceph-client-01 kernel: [438507.952532] attempt to access 
>>> beyond end of device
>>> Nov 12 21:06:44 ceph-client-01 kernel: [438507.952534] rbd5: rw=33, 
>>> want=6193176, limit=4194304
>>> 
>>> After restarting that Ceph client, I see a lot of 'metadata I/O error' 
>>> messages in the boot log:
>>> 
>>> XFS (rbd5): metadata I/O error: block 0x46e001 ("xfs_buf_iodone_callbacks") 
>>> error 5 numblks 1
>>> 
>>> Any idea on why these messages are shown? The health of the cluster shows 
>>> as OK, and I can access that block device without (apparent) issues...
>>> 
>>> Thank you!
>>> 
>>> Regards,
>>> Bogdan
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>> 
>> 
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD - 'attempt to access beyond end of device'

2015-11-12 Thread Jan Schermer
Apologies, it seems that to shrink the device a parameter --allow-shrink must 
be used.

> On 12 Nov 2015, at 22:49, Jan Schermer <j...@schermer.cz> wrote:
> 
> xfs_growfs "autodetects" the block device size. You can force re-read of the 
> block device to refresh this info but might not do anything at all.
> 
> There are situations when block device size will not reflect reality - for 
> example you can't (or at least couldn't) resize partition that is in use 
> (mounted, mapped, used in LVM...) without serious hacks, and ioctls on this 
> partition will return the old size until you reboot.
> The block device can also simply lie (like if you triggered a bug that made 
> the rbd device visually larger).
> Device-mapper devices have their own issues.
> 
> The only advice I can give is to never, ever shrink LUNs or block devices and 
> to avoid partitions if you can. I usually set up a fairly large OS drive 
> (with oversized partitions to be safe, assuming you have thin-provisioning it 
> wastes no real space) and a separate data volume without any partitioning. 
> This also works-around possible alignment issues
> Growing is always safe, shrinking destroys data. I am very surprised that 
> "rbd resize" doesn't require something like 
> "--i-really-really-know-what-i-am-doing --please-eatmydata" parameter to 
> shrink the image (or does it ask for confirmation when shrinking at least? I 
> can't try it now). Making a typo == instawipe?
> 
> My bet would still be that the original image was larger and you shrunk it by 
> mistake. The kernel client most probably never gets the capacity change 
> notification and you end up creating filesystem that points outside the 
> device. (not sure if mkfs.xfs actually tries seeking over the full sector 
> range). This is the most plausible explanation I can think of, but anything 
> is possible. I have other ideas if you want to investigate but I'd take it 
> off-list...
> 
> Jan
> 
> P.S. Your image is not 2TB but rather 2000 GiB ;-)
> 
> 
> 
>> On 12 Nov 2015, at 22:10, Bogdan SOLGA <bogdan.so...@gmail.com 
>> <mailto:bogdan.so...@gmail.com>> wrote:
>> 
>> Unfortunately I can no longer execute those commands for that rbd5, as I had 
>> to delete it; I couldn't 'resurrect' it, at least not in a decent time.
>> 
>> Here is the output for another image, which is 2TB big:
>> 
>> ceph-admin@ceph-client-01:~$ sudo blockdev --getsz --getss --getbsz /dev/rbd1
>> 4194304000
>> 512
>> 512
>> 
>> ceph-admin@ceph-client-01:~$ xfs_info /dev/rbd1
>> meta-data=/dev/rbd2  isize=256agcount=8127, agsize=64512 blks
>>  =   sectsz=512   attr=2
>> data =   bsize=4096   blocks=524288000, imaxpct=25
>>  =   sunit=1024   swidth=1024 blks
>> naming   =version 2  bsize=4096   ascii-ci=0
>> log  =internal   bsize=4096   blocks=2560, version=2
>>  =   sectsz=512   sunit=8 blks, lazy-count=1
>> realtime =none   extsz=4096   blocks=0, rtextents=0
>> 
>> 
>> I know rbd can also shrink the image, but I'm sure I haven't shrunk it. What 
>> I have tried, accidentally, was to resize the image to the same size it 
>> previously had, and that operation has failed, after trying for some time. 
>> Hmm... I think the failed resize was the culprit for it's malfunctioning, 
>> then.
>> 
>> Any (additional) advices on how to prevent this type of issues, in the 
>> future? Should the resizing and the xfs_growfs be executed with some 
>> parameters, for a better configuration of the image and / or filesystem?
>> 
>> Thank you very much for your help!
>> 
>> Regards,
>> Bogdan
>> 
>> 
>> On Thu, Nov 12, 2015 at 11:00 PM, Jan Schermer <j...@schermer.cz 
>> <mailto:j...@schermer.cz>> wrote:
>> Can you post the output of:
>> 
>> blockdev --getsz --getss --getbsz /dev/rbd5
>> and
>> xfs_info /dev/rbd5
>> 
>> rbd resize can actually (?) shrink the image as well - is it possible that 
>> the device was actually larger and you shrunk it?
>> 
>> Jan
>> 
>>> On 12 Nov 2015, at 21:46, Bogdan SOLGA <bogdan.so...@gmail.com 
>>> <mailto:bogdan.so...@gmail.com>> wrote:
>>> 
>>> By running rbd resize 
>>> <http://docs.ceph.com/docs/master/rbd/rados-rbd-cmds/> and then 'xfs_growfs 
>>> -d' on the filesystem.
>>> 
>>> Is there a better way to resize an RBD image and the filesystem?

Re: [ceph-users] RBD - 'attempt to access beyond end of device'

2015-11-12 Thread Jan Schermer
How did you create filesystems and/or partitions on this RBD block device?
The obvious causes would be
1) you partitioned it and the partition on which you ran mkfs points or pointed 
during mkfs outside the block device size (happens if you for example automate 
this and confuse sectors x cylinders, or if you copied the partition table with 
dd or from some image)
or
2) mkfs created the filesystem with pointers outside of the block device for 
some other reason (bug?)
or
3) this RBD device is a snapshot that got corrupted (or wasn't snapshotted in 
crash-consistent state and you got "lucky") and some reference points to a 
non-sensical block number (fsck could fix this, but I wouldn't trust the data 
integrity anymore)

Basically the filesystem thinks the block device should be larger than it is 
and tries to reach beyond.

Is this just one machine or RBD image or is there more?

I'd first create a snapshot and then try running fsck on it, it should 
hopefully tell you if there's a problem in setup or a corruption.

If it's not important data and it's just one instance of this problem then I'd 
just trash and recreate it.

Jan

> On 12 Nov 2015, at 20:14, Bogdan SOLGA  wrote:
> 
> Hello everyone!
> 
> We have a recently installed Ceph cluster (v 0.94.5, Ubuntu 14.04), and today 
> I noticed a lot of 'attempt to access beyond end of device' messages in the 
> /var/log/syslog file. They are related to a mounted RBD image, and have the 
> following format:
> 
> Nov 12 21:06:44 ceph-client-01 kernel: [438507.952532] attempt to access 
> beyond end of device
> Nov 12 21:06:44 ceph-client-01 kernel: [438507.952534] rbd5: rw=33, 
> want=6193176, limit=4194304
> 
> After restarting that Ceph client, I see a lot of 'metadata I/O error' 
> messages in the boot log:
> 
> XFS (rbd5): metadata I/O error: block 0x46e001 ("xfs_buf_iodone_callbacks") 
> error 5 numblks 1
> 
> Any idea on why these messages are shown? The health of the cluster shows as 
> OK, and I can access that block device without (apparent) issues...
> 
> Thank you!
> 
> Regards,
> Bogdan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Chown in Parallel

2015-11-10 Thread Jan Schermer
I would just disable barriers and enable them afterwards(+sync), should be a 
breeze then.

Jan

> On 10 Nov 2015, at 12:58, Nick Fisk  wrote:
> 
> I’m currently upgrading to Infernalis and the chown stage is taking a log 
> time on my OSD nodes. I’ve come up with this little one liner to run the 
> chown’s in parallel
>  
> find /var/lib/ceph/osd -maxdepth 1 -mindepth 1 -print | xargs -P12 -n1 chown 
> -R ceph:ceph
>  
> NOTE: You still need to make sure the other directory’s in the /var/lib/ceph 
> folder are updated separately but this should speed up the process for 
> machines with larger number of disks.
>  
> Nick
> 
>  ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Chown in Parallel

2015-11-10 Thread Jan Schermer
Interesting. I have all the inodes in cache on my nodes so I expect the 
bottleneck to be filesystem metadata -> journal writes. Unless something else 
is going on in here ;-)

Jan

> On 10 Nov 2015, at 13:19, Nick Fisk <n...@fisk.me.uk> wrote:
> 
> I’m looking at iostat and most of the IO is read, so I think it would still 
> take a while if it was still single threaded
>  
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda   0.00 0.500.005.50 0.0022.25 8.09
>  0.000.000.000.00   0.00   0.00
> sdb   0.00 0.500.005.50 0.0022.25 8.09
>  0.000.000.000.00   0.00   0.00
> sdc   0.00   356.00  498.503.00  1994.00  1436.0013.68
>  1.242.482.38   18.00   1.94  97.20
> sdd   0.50 0.00  324.500.00  1484.00 0.00 9.15
>  0.972.982.980.00   2.98  96.80
> sde   0.00 0.00  300.500.00  1588.00 0.0010.57
>  0.983.253.250.00   3.25  97.80
> sdf   0.0013.00  197.00   95.50  1086.00  1200.0015.63   
> 121.41  685.704.98 2089.91   3.42 100.00
> md1   0.00 0.000.000.00 0.00 0.00 0.00
>  0.000.000.000.00   0.00   0.00
> md0   0.00 0.000.005.50 0.0022.00 8.00
>  0.000.000.000.00   0.00   0.00
> sdg   0.00 0.000.000.00 0.00 0.00 0.00
>  0.000.000.000.00   0.00   0.00
> sdm   0.00 0.00  262.000.00  1430.00 0.0010.92
>  0.993.783.780.00   3.76  98.60
> sdi   0.00   113.00  141.00  337.00   764.00  3340.0017.17
> 98.93  191.243.65  269.73   2.06  98.40
> sdk   1.0042.50  378.50   74.50  2004.00   692.0011.90   
> 145.21  278.942.68 1682.44   2.21 100.00
> sdn   0.00 0.00  250.500.00  1346.00 0.0010.75
>  0.973.903.900.00   3.88  97.20
> sdj   0.0067.50   94.00  287.50   466.00  2952.0017.92   
> 144.55  589.075.43  779.90   2.62 100.00
> sdh   0.0085.50  158.00  176.00   852.00  2120.0017.80   
> 144.49  500.045.05  944.40   2.99 100.00
> sdl   0.00 0.00  173.009.50   956.00   300.0013.76
>  2.85   15.645.73  196.00   5.41  98.80
>  
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
> <mailto:ceph-users-boun...@lists.ceph.com>] On Behalf Of Jan Schermer
> Sent: 10 November 2015 12:07
> To: Nick Fisk <n...@fisk.me.uk <mailto:n...@fisk.me.uk>>
> Cc: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> Subject: Re: [ceph-users] Chown in Parallel
>  
> I would just disable barriers and enable them afterwards(+sync), should be a 
> breeze then.
>  
> Jan
>  
> On 10 Nov 2015, at 12:58, Nick Fisk <n...@fisk.me.uk 
> <mailto:n...@fisk.me.uk>> wrote:
>  
> I’m currently upgrading to Infernalis and the chown stage is taking a log 
> time on my OSD nodes. I’ve come up with this little one liner to run the 
> chown’s in parallel
>  
> find /var/lib/ceph/osd -maxdepth 1 -mindepth 1 -print | xargs -P12 -n1 chown 
> -R ceph:ceph
>  
> NOTE: You still need to make sure the other directory’s in the /var/lib/ceph 
> folder are updated separately but this should speed up the process for 
> machines with larger number of disks.
>  
> Nick
> 
>  ___
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> <http://xo4t.mj.am/link/xo4t/qknh5q8/1/MyrIHaeADwpTQ6E9Py2DNg/aHR0cDovL2xpc3RzLmNlcGguY29tL2xpc3RpbmZvLmNnaS9jZXBoLXVzZXJzLWNlcGguY29t>
>  
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] data size less than 4 mb

2015-11-02 Thread Jan Schermer
Can those hints be disabled somehow? I was battling XFS preallocation the other 
day, and the mount option didn't make any difference - maybe because those 
hints have precedence (which could mean they aren't working as they should), 
maybe not.

In particular, when you fallocate a file, some number of blocks will be 
reserved without actually allocating the blocks. When you then dirty a block 
with write and flush, metadata needs to be written (in journal, synchronously) 
<- this is slow with all drives, and extremely slow with sh*tty drives (doing 
benchmark on such a file will yield just 100 write IOPs, but when you allocate 
the file previously with dd if=/dev/zero it will have 6000 IOPs!) - and there 
doesn't seem to be a way to disable it in XFS. Not sure if hints should help or 
if they are actually causing the problem (I am not clear on whether they 
preallocate metadata blocks or just block count). Ext4 does the same thing.

Might be worth looking into?

Jan


> On 31 Oct 2015, at 19:36, Gregory Farnum  wrote:
> 
> On Friday, October 30, 2015, mad Engineer  > wrote:
> i am learning ceph,block storage and read that each object size is 4 Mb.I am 
> not clear about the concepts of object storage still what will happen if the 
> actual size of data written to block is less than 4 Mb lets say 1 Mb.Will it 
> still create object with 4 mb size and keep the rest of the space free and 
> unusable?
> 
> No, it will only take up as much space as you write (plus some metadata). 
> Although I think RBD passes down io hints suggesting the object's final size 
> will be 4MB so that the underlying storage (eg xfs) can prevent fragmentation.
> -Greg
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] data size less than 4 mb

2015-11-02 Thread Jan Schermer

> On 02 Nov 2015, at 11:59, Wido den Hollander <w...@42on.com> wrote:
> 
> 
> 
> On 02-11-15 11:56, Jan Schermer wrote:
>> Can those hints be disabled somehow? I was battling XFS preallocation
>> the other day, and the mount option didn't make any difference - maybe
>> because those hints have precedence (which could mean they aren't
>> working as they should), maybe not.
>> 
> 
> This config option?
> 
> OPTION(rbd_enable_alloc_hint, OPT_BOOL, true) // when writing a object,
> it will issue a hint to osd backend to indicate the expected size object
> need
> 
> Found in src/common/config_opts.h
> 
> Wido
> 

Thanks, but can this option be set for a whole OSD by default?

Jan

>> In particular, when you fallocate a file, some number of blocks will be
>> reserved without actually allocating the blocks. When you then dirty a
>> block with write and flush, metadata needs to be written (in journal,
>> synchronously) <- this is slow with all drives, and extremely slow with
>> sh*tty drives (doing benchmark on such a file will yield just 100 write
>> IOPs, but when you allocate the file previously with dd if=/dev/zero it
>> will have 6000 IOPs!) - and there doesn't seem to be a way to disable it
>> in XFS. Not sure if hints should help or if they are actually causing
>> the problem (I am not clear on whether they preallocate metadata blocks
>> or just block count). Ext4 does the same thing.
>> 
>> Might be worth looking into?
>> 
>> Jan
>> 
>> 
>>> On 31 Oct 2015, at 19:36, Gregory Farnum <gfar...@redhat.com
>>> <mailto:gfar...@redhat.com>> wrote:
>>> 
>>> On Friday, October 30, 2015, mad Engineer <themadengin...@gmail.com
>>> <mailto:themadengin...@gmail.com>> wrote:
>>> 
>>>i am learning ceph,block storage and read that each object size is
>>>4 Mb.I am not clear about the concepts of object storage still
>>>what will happen if the actual size of data written to block is
>>>less than 4 Mb lets say 1 Mb.Will it still create object with 4 mb
>>>size and keep the rest of the space free and unusable?
>>> 
>>> 
>>> No, it will only take up as much space as you write (plus some
>>> metadata). Although I think RBD passes down io hints suggesting the
>>> object's final size will be 4MB so that the underlying storage (eg
>>> xfs) can prevent fragmentation.
>>> -Greg
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Understanding the number of TCP connections between clients and OSDs

2015-10-26 Thread Jan Schermer
If we're talking about RBD clients (qemu) then the number also grows with 
number of volumes attached to the client. With a single volume it was <1000. It 
grows when there's heavy IO happening in the guest.
I had to bump up the file open limits to several thusands (8000 was it?) to 
accomodate client with 10 volumes in our cluster. We just scaled the number of 
OSDs down so hopefully I could have a graph of that.
But I just guesstimated what it could become, and that's not necessarily what 
the theoretical limit is. Very bad things happen when you reach that threshold. 
It could also depend on the guest settings (like queue depth), and how much it 
seeks over the drive (how many different PGs it hits), but knowing the upper 
bound is most critical.

Jan

> On 26 Oct 2015, at 21:32, Rick Balsano  wrote:
> 
> We've run into issues with the number of open TCP connections from a single 
> client to the OSDs in our Ceph cluster.
> 
> We can (& have) increased the open file limit to work around this, but we're 
> looking to understand what determines the number of open connections 
> maintained between a client and a particular OSD. Our naive assumption was 1 
> open TCP connection per OSD or per port made available by the Ceph node. 
> There are many more than this, presumably to allow parallel connections, 
> because we see 1-4 connections from each client per open port on a Ceph node.
> 
> Here is some background on our cluster:
> * still running Firefly 0.80.8
> * 414 OSDs, 35 nodes, one massive pool
> * clients are KVM processes, accessing Ceph RBD images using virtio
> * total number of open TCP connections from one client to all nodes between 
> 500-1000 
> 
> Is there any way to either know or cap the maximum number of connections we 
> should expect?
> 
> I can provide more info as required. I've done some searches and found 
> references to "huge number of TCP connections" but nothing concrete to tell 
> me how to predict how that scales.
> 
> Thanks,
> Rick
> -- 
> Rick Balsano
> Senior Software Engineer
> Opower 
> 
> O +1 571 384 1210
> We're Hiring! See jobs here .
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow ssd journal

2015-10-23 Thread Jan Schermer
The drive you have is not suitable at all for journal. Horrible, actually.

"test with fio (qd=32,128,256, bs=4k) show very good performance of SSD disk 
(10-30k write io)."

This is not realistic. Try:

fio --sync=1 --fsync=1 --direct=1 --iodepth=1 --ioengine=aio 

Jan

On 23 Oct 2015, at 16:31, K K  wrote:

Hello.

Some strange things happen with my ceph installation after I was moved journal 
to SSD disk.

OS: Ubuntu 15.04 with ceph version 0.94.2-0ubuntu0.15.04.1
server: dell r510 with PERC H700 Integrated 512MB RAID cache
my cluster have:
1 monitor node
2 OSD nodes with 6 OSD daemons at each server (3Tb HDD SATA 7200 rpm disks XFS 
system). 
network: 1Gbit to hypervisor and 1 Gbit among all ceph nodes
ceph.conf:
[global]
public network = 10.12.0.0/16
cluster network = 192.168.133.0/24
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
filestore xattr use omap = true
filestore max sync interval = 10
filestore min sync interval = 1
filestore queue max ops = 500
#filestore queue max bytes = 16 MiB
#filestore queue committing max ops = 4096
#filestore queue committing max bytes = 16 MiB
filestore op threads = 20
filestore flusher = false
filestore journal parallel = false
filestore journal writeahead = true
#filestore fsync flushes journal data = true
journal dio = true
journal aio = true
osd pool default size = 2 # Write an object n times.
osd pool default min size = 1 # Allow writing n copy in a degraded state.
osd pool default pg num = 333
osd pool default pgp num = 333
osd crush chooseleaf type = 1

[client]
rbd cache = true
rbd cache size = 102400
rbd cache max dirty = 12800

[osd]
osd journal size = 5200
#osd journal = /dev/disk/by-partlabel/journal-$id

Without SSD as a journal i have a ~112MB/sec throughput

After I was added SSD 64Gb ADATA for a journal disk and create 6 raw 
partitions. And I get a very slow bandwidth with rados bench:

Total time run: 302.350730
Total writes made: 1146
Write size: 4194304
Bandwidth (MB/sec): 15.161

Stddev Bandwidth: 11.5658
Max bandwidth (MB/sec): 52
Min bandwidth (MB/sec): 0
Average Latency: 4.21521
Stddev Latency: 1.25742
Max latency: 8.32535
Min latency: 0.277449

iostat show a few write io (no more than 200):


Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await 
w_await svctm %util
sdh 0.00 0.00 0.00 8.00 0.00 1024.00256.00 129.48  2120.50  0.00   
2120.50 124.50 99.60
sdh 0.00 0.00 0.00 124.00 0.00 14744.00  237.81 148.44  1723.81  0.00   1723.81 
 8.10 100.40
sdh 0.00 0.00 0.00 114.00 0.00 13508.00  236.98 144.27  1394.91  0.00   1394.91 
 8.77 100.00
sdh 0.00 0.00 0.00 122.00 0.00 13964.00  228.92 122.99  1439.74  0.00   1439.74 
 8.20 100.00
sdh 0.00 0.00 0.00 161.00 0.00 19640.00  243.98 154.98  1251.16  0.00   1251.16 
 6.21 100.00
sdh 0.00 0.00 0.00 11.00   0.00 1408.00256.00 152.68   717.09   0.00   
717.0990.91 100.00
sdh 0.00 0.00 0.00 154.00 0.00 18696.00  242.81 142.09  1278.65  0.00   1278.65 
 6.49 100.00

test with fio (qd=32,128,256, bs=4k) show very good performance of SSD disk 
(10-30k write io).

Can anybody help me? Can someone faced with similar problem?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [urgent] KVM issues after upgrade to 0.94.4

2015-10-21 Thread Jan Schermer
If I'm reading it correctly his cmdline says cache=none for the rbd device, so 
there should be no writeback caching:

file=rbd:Primary-ubuntu-1/c3f90fb4-c1a6-4e99-a2c0-64ae4517412e:id=admin:key=AQDiDbJR2GqPABAAWCcsUQ+UQwK8z9c6LWrizw==:auth_supported=cephx\;none:mon_host=ceph-mon.csprdc.arhont.com\:6789,if=none,id=drive-virtio-disk0,format=raw,cache=none

If that's actually overriden by ceph.conf setting then that is another bug I 
guess :-)

Jan



> On 21 Oct 2015, at 19:46, Jason Dillaman  wrote:
> 
> There is an edge case with cloned image writeback caching that occurs after 
> an attempt to read a non-existent clone RADOS object, followed by a write to 
> said object, followed by another read.  This second read will cause the 
> cached write to be flushed to the OSD while the appropriate locks are not 
> being held.  This issue is being tracked via an upstream tracker ticket [1].
> 
> This issue effects librbd clients using v0.94.4 and v9.x.  Disabling the 
> cache or switching to write-through caching (rbd_cache_max_dirty = 0) should 
> avoid the issue until it is fixed in the next Ceph release.
> 
> [1] http://tracker.ceph.com/issues/13559
> 
> -- 
> 
> Jason Dillaman 
> 
> 
> - Original Message - 
> 
>> From: "Andrei Mikhailovsky" 
>> To: ceph-us...@ceph.com
>> Sent: Wednesday, October 21, 2015 8:17:39 AM
>> Subject: [ceph-users] [urgent] KVM issues after upgrade to 0.94.4
> 
>> Hello guys,
> 
>> I've upgraded to the latest Hammer release and I've just noticed a massive
>> issue after the upgrade (((
> 
>> I am using ceph for virtual machine rbd storage over cloudstack. I am having
>> issues with starting virtual routers. The libvirt error message is:
> 
>> cat r-1407-VM.log
>> 2015-10-21 11:04:59.262+: starting up
>> LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin
>> QEMU_AUDIO_DRV=none /usr/bin/kvm-spice -name r-1407-VM -S -machine
>> pc-i440fx-trusty,accel=kvm,usb=off -m 256 -realtime mlock=off -smp
>> 1,sockets=1,cores=1,threads=1 -uuid 815d2860-cc7f-475d-bf63-02814c720fe4
>> -no-user-config -nodefaults -chardev
>> socket,id=charmonitor,path=/var/lib/libvirt/qemu/r-1407-VM.monitor,server,nowait
>> -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown
>> -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device
>> virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x6 -drive
>> file=rbd:Primary-ubuntu-1/c3f90fb4-c1a6-4e99-a2c0-64ae4517412e:id=admin:key=AQDiDbJR2GqPABAAWCcsUQ+UQwK8z9c6LWrizw==:auth_supported=cephx\;none:mon_host=ceph-mon.csprdc.arhont.com\:6789,if=none,id=drive-virtio-disk0,format=raw,cache=none
>> -device
>> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=2
>> -drive
>> file=/usr/share/cloudstack-common/vms/systemvm.iso,if=none,id=drive-ide0-1-0,readonly=on,format=raw,cache=none
>> -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0,bootindex=1
>> -netdev tap,fd=54,id=hostnet0,vhost=on,vhostfd=55 -device
>> virtio-net-pci,netdev=hostnet0,id=net0,mac=02:00:2e:f7:00:18,bus=pci.0,addr=0x3,rombar=0,romfile=
>> -netdev tap,fd=56,id=hostnet1,vhost=on,vhostfd=57 -device
>> virtio-net-pci,netdev=hostnet1,id=net1,mac=0e:00:a9:fe:01:42,bus=pci.0,addr=0x4,rombar=0,romfile=
>> -netdev tap,fd=58,id=hostnet2,vhost=on,vhostfd=59 -device
>> virtio-net-pci,netdev=hostnet2,id=net2,mac=06:0c:b6:00:02:13,bus=pci.0,addr=0x5,rombar=0,romfile=
>> -chardev pty,id=charserial0 -device
>> isa-serial,chardev=charserial0,id=serial0 -chardev
>> socket,id=charchannel0,path=/var/lib/libvirt/qemu/r-1407-VM.agent,server,nowait
>> -device
>> virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=r-1407-VM.vport
>> -device usb-tablet,id=input0 -vnc 192.168.169.2:10,password -device
>> cirrus-vga,id=video0,bus=pci.0,addr=0x2
>> Domain id=42 is tainted: high-privileges
>> libust[20136/20136]: Warning: HOME environment variable not set. Disabling
>> LTTng-UST per-user tracing. (in setup_local_apps() at lttng-ust-comm.c:305)
>> char device redirected to /dev/pts/13 (label charserial0)
>> librbd/LibrbdWriteback.cc: In function 'virtual ceph_tid_t
>> librbd::LibrbdWriteback::write(const object_t&, const object_locator_t&,
>> uint64_t, uint64_t, const SnapContext&, const bufferlist&, utime_t,
>> uint64_t, __u32, Context*)' thread 7ffa6b7fe700 time 2015-10-21
>> 12:05:07.901876
>> librbd/LibrbdWriteback.cc: 160: FAILED assert(m_ictx->owner_lock.is_locked())
>> ceph version 0.94.4 (95292699291242794510b39ffde3f4df67898d3a)
>> 1: (()+0x17258b) [0x7ffa92ef758b]
>> 2: (()+0xa9573) [0x7ffa92e2e573]
>> 3: (()+0x3a90ca) [0x7ffa9312e0ca]
>> 4: (()+0x3b583d) [0x7ffa9313a83d]
>> 5: (()+0x7212c) [0x7ffa92df712c]
>> 6: (()+0x9590f) [0x7ffa92e1a90f]
>> 7: (()+0x969a3) [0x7ffa92e1b9a3]
>> 8: (()+0x4782a) [0x7ffa92dcc82a]
>> 9: (()+0x56599) [0x7ffa92ddb599]
>> 10: (()+0x7284e) [0x7ffa92df784e]
>> 11: (()+0x162b7e) [0x7ffa92ee7b7e]
>> 

Re: [ceph-users] Ceph OSDs with bcache experience

2015-10-21 Thread Jan Schermer

> On 21 Oct 2015, at 09:11, Wido den Hollander  wrote:
> 
> On 10/20/2015 09:45 PM, Martin Millnert wrote:
>> The thing that worries me with your next-gen design (actually your current 
>> design aswell) is SSD wear. If you use Intel SSD at 10 DWPD, that's 12TB/day 
>> per 64TB total.  I guess use case dependant,  and perhaps 1:4 write read 
>> ratio is quite high in terms of writes as-is.
>> You're also throughput-limiting yourself to the pci-e bw of the NVME device 
>> (regardless of NVRAM/SSD). Compared to traditonal interface, that may be ok 
>> of course in relative terms. NVRAM vs SSD here is simply a choice between 
>> wear (NVRAM as journal minimum), and cache hit probability (size).  
>> Interesting thought experiment anyway for me, thanks for sharing Wido.
>> /M
> 
> We are looking at the PC 3600DC 1.2TB, according to the specs from
> Intel: 10.95PBW
> 
> Like I mentioned in my reply to Mark, we are still running on 1Gbit and
> heading towards 10Gbit.
> 
> Bandwidth isn't really a issue in our cluster. During peak moments we
> average about 30k IOps through the cluster, but the TOTAL client I/O is
> just 1Gbit Read and Write. Sometimes a bit higher, but mainly small I/O.
> 
> Bandwidth-wise there is no need for 10Gbit, but we are doing it for the
> lower latency and thus more IOps.
> 
> Currently our S3700 SSDs are peaking at 50% utilization according to iostat.
> 
> After 2 years of operation the lowest Media_Wearout_Indicator we see is
> 33. On Intel SSDs this starts at 100 and counts down to 0. 0 indicating
> that the SSD is worn out.
> 
> So in 24 months we worn through 67% of the SSD. A quick calculation
> tells me we still have 12 months left on that SSD before it dies.

Could you maybe run isdct and compare what it says about expected lifetime? I 
think isdct will report a much longer lifetime than you expect.

For comparison one of my drives (S3610, 1.2TB) - this drive has 3 DWPD rating 
(~6.5PB written)

241 Total_LBAs_Written  0x0032   100   100   000Old_age   Always   
-   1487714 <-- units of 32MB, that translates to ~47TB
233 Media_Wearout_Indicator 0x0032   100   100   000Old_age   Always   
-   0 (maybe my smartdb needs updating, but this is what it says)
9 Power_On_Hours  0x0032   100   100   000Old_age   Always   -  
 1008

If I extrapolate this blindly I would expect the SSD to reach it's TBW of 6.5PB 
in about 15 years.

But isdct says:
EnduranceAnalyzer: 46.02 Years

If I reverse it and calculate the endurance based on smart values, that would 
give the expected lifetime of over 18PB (which is not impossible at all), but 
isdct is a bit smarter and looks at what the current use pattern is. It's 
clearly not only about discarding the initial bursts when the drive was filled 
during backfilling because it's not that much, and all my S3610 drives indicate 
a similiar endurance of 40 years (+-10).

I'd trust isdct over extrapolated SMART values - I think the SSD will actually 
switch to a different calculation scheme when it reaches certain lifepoint 
(when all reserve blocks are used, or when first cells start to die...) which 
is why there's a discrepancy.

Jan


> 
> But this is the lowest, other SSDs which were taken into production at
> the same moment are ranging between 36 and 61.
> 
> Also, when buying the 1.2TB SSD we'll probably allocate only 1TB of the
> SSD and leave 200GB of cells spare so the Wear-Leveling inside the SSD
> has some spare cells.
> 
> Wido
> 
>> 
>>  Original message 
>> From: Wido den Hollander  
>> Date: 20/10/2015  16:00  (GMT+01:00) 
>> To: ceph-users  
>> Subject: [ceph-users] Ceph OSDs with bcache experience 
>> 
>> Hi,
>> 
>> In the "newstore direction" thread on ceph-devel I wrote that I'm using
>> bcache in production and Mark Nelson asked me to share some details.
>> 
>> Bcache is running in two clusters now that I manage, but I'll keep this
>> information to one of them (the one at PCextreme behind CloudStack).
>> 
>> In this cluster has been running for over 2 years now:
>> 
>> epoch 284353
>> fsid 0d56dd8f-7ae0-4447-b51b-f8b818749307
>> created 2013-09-23 11:06:11.819520
>> modified 2015-10-20 15:27:48.734213
>> 
>> The system consists out of 39 hosts:
>> 
>> 2U SuperMicro chassis:
>> * 80GB Intel SSD for OS
>> * 240GB Intel S3700 SSD for Journaling + Bcache
>> * 6x 3TB disk
>> 
>> This isn't the newest hardware. The next batch of hardware will be more
>> disks per chassis, but this is it for now.
>> 
>> All systems were installed with Ubuntu 12.04, but they are all running
>> 14.04 now with bcache.
>> 
>> The Intel S3700 SSD is partitioned with a GPT label:
>> - 5GB Journal for each OSD
>> - 200GB Partition for bcache
>> 
>> root@ceph11:~# df -h|grep osd
>> /dev/bcache02.8T  1.1T  1.8T  38% /var/lib/ceph/osd/ceph-60
>> /dev/bcache12.8T  1.2T  1.7T  41% /var/lib/ceph/osd/ceph-61
>> /dev/bcache22.8T  930G  1.9T  

Re: [ceph-users] Help with Bug #12738: scrub bogus results when missing a clone

2015-10-21 Thread Jan Schermer
We just had to look into a similiar problem (missing clone objects, extraneous 
clone objects, wrong sizes on few objects...)

You should do something like this:

1) find all OSDs hosting the PG
ceph pg map 8.e82
2) find the directory with the object on the OSDs
should be something like /var/lib/ceph/ceph-XX/current/8.e82_head/

3) look in this directory for files named like what you see in logs 
(rb.0.bfcb12.238e1f29.002acd39) 
there are _head_ objects that contain the original data, and then objects named 
with the snapshot id instead (_1fc8ce82_ instead of _head_)

4) compare what files are there on the OSDs
5a) you are lucky and one of the OSDs has them - in that case you could either 
copy them to the others (don't forget xattrs!) or rebuild them via backfills 
from the good OSD
5b) you are not that lucky and the files are not there - I'm not that sure what 
to do then
You could in theory just copy the _head_ object contents to the missing objects 
and then drop the image.
Or you could maybe just delete the _head_ objects (since you don't need that 
image anymore), but I don't know whether there's some info stored (in leveldb, 
or somewhere else) about the rbd image or if all the info is in the objects 
themselves.
I think others here will help you more in that case. 

I'm not sure if there's an option to "delete rbd image, ignore missing files, 
call it a day" - that one would be handy for situations like this.

Jan



> On 21 Oct 2015, at 09:01, Chris Taylor  wrote:
> 
> Is there some way to manually correct this error while this bug is still 
> needing review? I have one PG that is stuck inconsistent with the same error. 
> I already created a new RBD image and migrated the data to it. The original 
> RBD image was "rb.0.ac3386.238e1f29". The new image is "rb.0.bfcb12.238e1f29".
> 
>  
> 2015-10-20 19:18:07.686783 7f50e4c1d700 0 log_channel(cluster) log [INF] : 
> 8.e82 repair starts
> 2015-10-20 19:18:40.300721 7f50e4c1d700 -1 log_channel(cluster) log [ERR] : 
> repair 8.e82 1fc8ce82/rb.0.ac3386.238e1f29.0008776e/snapdir//8 missing 
> clones
> 2015-10-20 19:18:40.301094 7f50e4c1d700 -1 log_channel(cluster) log [ERR] : 
> repair 8.e82 9cc8ce82/rb.0.bfcb12.238e1f29.002acd39/head//8 expected 
> clone 1fc8ce82/rb.0.ac3386.238e1f29.0008776e/44//8
> 2015-10-20 19:18:40.301124 7f50e4c1d700 -1 log_channel(cluster) log [ERR] : 
> repair 8.e82 fb78ce82/rb.0.bfcb12.238e1f29.000e69a3/head//8 expected 
> clone 9cc8ce82/rb.0.bfcb12.238e1f29.002acd39/44//8
> 2015-10-20 19:18:40.301140 7f50e4c1d700 -1 log_channel(cluster) log [ERR] : 
> repair 8.e82 8038ce82/rb.0.bfcb12.238e1f29.002b7781/head//8 expected 
> clone fb78ce82/rb.0.bfcb12.238e1f29.000e69a3/44//8
> 2015-10-20 19:18:40.301155 7f50e4c1d700 -1 log_channel(cluster) log [ERR] : 
> repair 8.e82 c8b7ce82/rb.0.bfcb12.238e1f29.00059252/head//8 expected 
> clone 8038ce82/rb.0.bfcb12.238e1f29.002b7781/44//8
> 2015-10-20 19:18:40.301170 7f50e4c1d700 -1 log_channel(cluster) log [ERR] : 
> repair 8.e82 9d26ce82/rb.0.bfcb12.238e1f29.000cd86d/head//8 expected 
> clone c8b7ce82/rb.0.bfcb12.238e1f29.00059252/44//8
> 2015-10-20 19:18:40.301185 7f50e4c1d700 -1 log_channel(cluster) log [ERR] : 
> repair 8.e82 c006ce82/rb.0.bfcb12.238e1f29.000c53d6/head//8 expected 
> clone 9d26ce82/rb.0.bfcb12.238e1f29.000cd86d/44//8
> 2015-10-20 19:18:40.301200 7f50e4c1d700 -1 log_channel(cluster) log [ERR] : 
> repair 8.e82 3434ce82/rb.0.bfcb12.238e1f29.002cb957/head//8 expected 
> clone c006ce82/rb.0.bfcb12.238e1f29.000c53d6/44//8
> 2015-10-20 19:18:47.724047 7f50e4c1d700 -1 log_channel(cluster) log [ERR] : 
> 8.e82 repair 8 errors, 0 fixed
> 2
> 
>  
> Thanks,
> 
> Chris
> 
>  
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


  1   2   3   4   >