>> inline

> On 29 Jan 2016, at 05:03, Somnath Roy <somnath....@sandisk.com> wrote:
> 
> <<inline
>  
> From: Jan Schermer [mailto:j...@schermer.cz <mailto:j...@schermer.cz>] 
> Sent: Thursday, January 28, 2016 3:51 PM
> To: Somnath Roy
> Cc: Tyler Bishop; ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> Subject: Re: SSD Journal
>  
> Thanks for a great walkthrough explanation.
> I am not really going to (and capable) of commenting on everything but.. see 
> below
>  
> On 28 Jan 2016, at 23:35, Somnath Roy <somnath....@sandisk.com 
> <mailto:somnath....@sandisk.com>> wrote:
>  
> Hi,
> Ceph needs to maintain a journal in case of filestore as underlying 
> filesystem like XFS *doesn’t have* any transactional semantics. Ceph has to 
> do a transactional write with data and metadata in the write path. It does in 
> the following way.
>  
> "Ceph has to do a transactional write with data and metadata in the write 
> path"
> Why? Isn't that only to provide that to itself?
> 
> [Somnath] Yes, that is for Ceph..That’s 2 setattrs (for rbd) + PGLog/Info..

And why does Ceph need that? Aren't we going in circles here? No client needs 
those transactions so there's no point in needing those transactions in Ceph.

>  
> 1. It creates a transaction object having multiple metadata operations and 
> the actual payload write.
>  
> 2. It is passed to Objectstore layer.
>  
> 3. Objectstore can complete the transaction in sync or async (Filestore) way.
>  
> Depending on whether the write was flushed or not? How is that decided?
> [Somnath] It depends on how ObjectStore backend is written..Not 
> dynamic..Filestore implemented in async way , I think BlueStore is written in 
> sync way (?)..
> 
>  
> 4.  Filestore dumps the entire Transaction object to the journal. It is a 
> circular buffer and written to the disk sequentially with O_DIRECT | O_DSYNC 
> way.
>  
> Just FYI, O_DIRECT doesn't really guarantee "no buffering", it's purpose is 
> just to avoid needless caching.
> It should behave the way you want on Linux, but you must not rely on it since 
> this guarantee is not portable.
> 
> [Somnath] O_DIRECT alone is not guaranteed but With O_DSYNC it is guaranteed 
> to be reaching the disk..It may still be there in Disk cache , but, this is 
> taken care by disks..

O_DSYNC is the same as calling fdatasync() after writes. This only flushes the 
data, not the metadata. So if your "transactions" need those (and I think they 
do) then you don't get the expected consistency. In practice it could flush 
effectively everything.

>  
> 5. Once journal write is successful , write is acknowledged to the client. 
> Read for this data is not allowed yet as it is still not been written to the 
> actual location in the filesystem.
>  
> Now you are providing a guarantee for something nobody really needs. There is 
> no guarantee with traditional filesystems of not returning dirty unwritten 
> data. The guarentees are on writes, not reads. It might be easier to do it 
> this way if you plan for some sort of concurrent access to the same data from 
> multiple readers (that don't share the cache) - but is that really the case 
> here if it's still the same OSD that serves the data?
> Do the journals absorb only the unbuffered IO or all IO?
>  
> And what happens currently if I need to read the written data rightaway? When 
> do I get it then?
> 
> [Somnath] Well, this is debatable, but currently reads are blocked till 
> entire Tx execution is completed (not after doing syncfs)..Journal absorbs 
> all the IO..

So a database doing checkpoint read/modify/write is going to suffer greatly? 
That might explain a few more things I've seen.
But it's not needed anyway, in fact things like databases are very likely to 
write to the same place over and over again and you should in fact accomodate 
them by caching.


>  
> 6. The actual execution of the transaction is done in parallel for the 
> filesystem that can do check pointing like BTRFS. For the filesystem like 
> XFS/ext4 the journal is write ahead i.e Tx object will be written to journal 
> first and then the Tx execution will happen.
>  
> 7. Tx execution is done in parallel by the filestore worker threads. The 
> payload write is a buffered write and a sync thread within filestore is 
> periodically calling ‘syncfs’ to persist data/metadata to the actual location.
>  
> 8. Before each ‘syncfs’ call it determines the seq number till it is 
> persisted and trim the transaction objects from journal upto that point. This 
> will make room for more writes in the journal. If journal is full, write will 
> be stuck.
>  
> 9. If OSD is crashed after write is acknowledge, the Tx will be replayed from 
> the last successful backend commit seq number (maintained in a file after 
> ‘syncfs’).
>  
>  
> You can just completely rip at least 6-9 out and mirror what the client sends 
> to the filesystem with the same effect (and without journal). Who cares how 
> the filesystem implements it then, everybody can choose the filesystem that 
> matches the workload (e.g. the one they use alread on a physical volume they 
> are migrating from).
> It's a sensible solution to a non existing problem...
>  
> [Somnath] May be but different client has different requirement, can’t design 
> OSD I guess based on what client will do..One has to do all effort to make 
> OSD crash consistent IMO..

Exactly. EXT4 and XFS might have different behaviour with applications, so if 
your database runs on XFS you make the OSD filesystem XFS as well and it will 
mimic this behaviour. It shouldn't matter what filesystem you use unless the 
app does something stupid.

> Probably, it would be better if filestore gives user a choice where to use 
> journal or not based on client’s need….If client can live without being 
> consistent , so be it..

My point is that you don't lose any consistency. You will always be crash 
consistent unless you acknowledge something which wasn't written yet. You don't 
need OSD transactions to do that, you don't need multi-io transactions and you 
don't need any 2-phase commit. Filesystems don't make any use of them.

> 
> 
> So, as you can see, it’s not a flaw but a necessity to have a journal for 
> filestore in case of rbd workload as it can do partial overwrites. It is not 
> needed for full writes like for objects and that’s the reason Sage came up 
> with new store which will not be doing double writes for Object workload.
> The keyvaluestore backend also doesn’t have any journal as it is relying on 
> backend like leveldb/rocksdb for that.
>  
> Regarding Jan’s point for block vs file journal, IMO the only advantage of 
> journal being a block device is filestore can do aio writes on that.
>  
> You also don't have the filesystem journal. You can simply divide the whole 
> block divice into 4MB blocks and use them.
> But my point was that you are getting even closer to reimplementing a 
> fileystem in userspace, which is just nonsense.
> 
> [Somnath] Ceph tries to do some coalescing internally, but consider the 
> filesystem coalescing as well for journal file write and it is smart..

Filesystems already do that... schedulers already do that. Can Ceph do that 
better (faster, safer)? Sorry, but no way...
Keep in mind that those mechanisms aren't really disabled anyway so you 
duplicate whatever it is they do...

> Other than this transaction logic, it is piggy backing to the filesystem , 
> so, I don’t think filestore is anywhere near to re-implement a 
> filesystem..IMO, the burden of Bluestore is way more than filestore as it has 
> to implement lot of stuff now that we were safely relying on filesystem so 
> far…
>  
> Now, here is what SanDisk changed..
>  
> 1. In the write path Filestore has to do some throttling as journal can’t go 
> much further than the actual backend write (Tx execution). We have introduced 
> a dynamic throlling based on journal fill rate and a % increase from a config 
> option filestore_queue_max_bytes. This config option keeps track of 
> outstanding backend byte writes.
>  
> 2. Instead of buffered write we have introduced a O_DSYNC write during 
> transaction execution as it is reducing the amount of data syncfs has to 
> write and thus getting a more stable performance out.
>  
> 3. Main reason that we can’t allow journal to go further ahead as the Tx 
> object will not be deleted till the Tx executes. More behind the Tx execution 
> , more memory growth will happen. Presently, Tx object is deleted 
> asynchronously (and thus taking more time)and we changed it to delete it from 
> the filestore worker thread itself.
>  
> 4. The sync thread is optimized to do a fast sync. The extra last commit seq 
> file is not maintained any more for *the write ahead journal* as this 
> information can be found in journal header.
>  
> Here is the related pull requests..
>  
>  
> https://github.com/ceph/ceph/pull/7271 
> <https://github.com/ceph/ceph/pull/7271>
> https://github.com/ceph/ceph/pull/7303 
> <https://github.com/ceph/ceph/pull/7303>
> https://github.com/ceph/ceph/pull/7278 
> <https://github.com/ceph/ceph/pull/7278>
> https://github.com/ceph/ceph/pull/6743 
> <https://github.com/ceph/ceph/pull/6743>
>  
> Regarding bypassing filesystem and accessing block device directly, yes, that 
> should be more clean/simple and efficient solution. With Sage’s Bluestore, 
> Ceph is moving towards that very fast !!!
>  
>  
> This all makes sense, but it's unnecessary.
> All you need to do is mirror the IO the client does on the filesystem serving 
> the objects. That's it. The filesystem journal already provides all the 
> guarantees you need. For example you don't need "no read from cache" 
> guarantee because you don't get it anywhere else (so what's the use of 
> that?). You don't need atomic multi-IO transactions because they are not 
> implemented anywhere but at the *guest* filesystem level, which already has 
> to work with hard drives that have no such concept. Even if Ceph put itself 
> in the role of such a smart virtual drive that can handle multi-IO atomic 
> transactions then currently there are no consumers of those capabilities. 
>  
> What do we all really need RBD to do? Emulate a physical hard drive of 
> course. And it simply does not need to do any better, that's wasted effort. 
> Sure it would be very nice if you could offload all the trickiness of ACID 
> onto the hardware, but you can't (yet), and at this point nobody really needs 
> that - filesystems are already doing the hard work in a proven way.
> Unless you bring something new to the table which makes use of all that then 
> you only need to bench yourself to the physical hardware. And sadly Ceph is 
> nowhere close to a single SSD performance even when running on a beefy 
> cluster while the benefits it supposedly provides are for what? 
>  
> Just make sure that the same IO that the guest sends gets to the filesystem 
> on the OSD. (Ok, fair enough it's not _that_ simple, but not much more 
> complicated either - you still need to persist data on all the objects since 
> the last flush (which btw in any real world cluster means just checking as 
> there was likely an fsync already somewhere from other clients))
> Bam. You're done. You just mirrored what a hard drive does, because you 
> mirrored that to a filesystem that mirrors that to a hard drive... No need of 
> journals on top of filesystems with journals with data on filesystems with 
> journals... My databases are not that fond of the multi-ms commiting limbo 
> while data falls down throught those dream layers :P
>  
> I really don't know how to explain that more. I bet if you ask on LKML, 
> someone like Theodore Ts'o would say "you're doing completely superfluous 
> work" in more technical terms.
>  
> Jan
> 
> 
>  
> Thanks & Regards
> Somnath
>  
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
> <mailto:ceph-users-boun...@lists.ceph.com>] On Behalf Of Tyler Bishop
> Sent: Thursday, January 28, 2016 1:35 PM
> To: Jan Schermer
> Cc: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> Subject: Re: [ceph-users] SSD Journal
>  
> What approach did sandisk take with this for jewel?
>  
>  
>  
> Tyler Bishop
> Chief Technical Officer
> 513-299-7108 x10
> tyler.bis...@beyondhosting.net <mailto:tyler.bis...@beyondhosting.net>
> If you are not the intended recipient of this transmission you are notified 
> that disclosing, copying, distributing or taking any action in reliance on 
> the contents of this information is strictly prohibited.
>  
>  
> From: "Jan Schermer" <j...@schermer.cz <mailto:j...@schermer.cz>>
> To: "Tyler Bishop" <tyler.bis...@beyondhosting.net 
> <mailto:tyler.bis...@beyondhosting.net>>
> Cc: "Bill WONG" <wongahsh...@gmail.com <mailto:wongahsh...@gmail.com>>, 
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> Sent: Thursday, January 28, 2016 4:32:54 PM
> Subject: Re: [ceph-users] SSD Journal
>  
> You can't run Ceph OSD without a journal. The journal is always there.
> If you don't have a journal partition then there's a "journal" file on the 
> OSD filesystem that does the same thing. If it's a partition then this file 
> turns into a symlink.
>  
> You will always be better off with a journal on a separate partition because 
> of the way writeback cache in linux works (someone correct me if I'm wrong).
> The journal needs to flush to disk quite often, and linux is not always able 
> to flush only the journal data. You can't defer metadata flushing forever and 
> also doing fsync() makes all the dirty data flush as well. ext2/3/4 also 
> flushes data to the filesystem periodicaly (5s is it I think?) which will 
> make the latency of the journal go through the roof momentarily.
> (I'll leave researching how exactly XFS does it to those who care about that 
> "filesystem'o'thing").
>  
> P.S. I feel very strongly that this whole concept is broken fundamentaly. We 
> already have a journal for the filesystem which is time proven, well behaved 
> and above all fast. Instead there's this reinvented wheel which supposedly 
> does it better in userspace while not really avoiding the filesystem journal 
> either. It would maybe make sense if OSD was storing the data on a block 
> device directly, avoiding the filesystem altogether. But it would still do 
> the same bloody thing and (no disrespect) ext4 does this better than Ceph 
> ever will.
>  
>  
> On 28 Jan 2016, at 20:01, Tyler Bishop <tyler.bis...@beyondhosting.net 
> <mailto:tyler.bis...@beyondhosting.net>> wrote:
>  
> This is an interesting topic that i've been waiting for.
>  
> Right now we run the journal as a partition on the data disk.  I've build 
> drives without journals and the write performance seems okay but random io 
> performance is poor in comparison to what it should be.
>  
>  
>  
> Tyler Bishop
> Chief Technical Officer
> 513-299-7108 x10
> tyler.bis...@beyondhosting.net <mailto:tyler.bis...@beyondhosting.net>
> If you are not the intended recipient of this transmission you are notified 
> that disclosing, copying, distributing or taking any action in reliance on 
> the contents of this information is strictly prohibited.
>  
>  
> From: "Bill WONG" <wongahsh...@gmail.com <mailto:wongahsh...@gmail.com>>
> To: "ceph-users" <ceph-users@lists.ceph.com 
> <mailto:ceph-users@lists.ceph.com>>
> Sent: Thursday, January 28, 2016 1:36:01 PM
> Subject: [ceph-users] SSD Journal
>  
> Hi,
> i have tested with SSD Journal with SATA, it works perfectly.. now, i am 
> testing with full SSD ceph cluster, now with full SSD ceph cluster, do i 
> still need to have SSD as journal disk? 
>  
> [ assumed i do not have PCIe SSD Flash which is better performance than 
> normal SSD disk]
>  
> please give some ideas on full ssd ceph cluster ... thank you!
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to