Re: Is BlueFS an alternative of BlueStore?

2016-01-07 Thread peng.hse

Hi Sage,

thanks for your quick response. Javen and I  once the zfs developer,are 
currently focusing on how to
leverage some of the zfs ideas to improve the ceph backend performance 
in userspace.



Based on your encouraging reply, we come up with 2 schemes to continue 
our future work


1. the scheme one: using the entire new FS to replace rocksdb+bluefs, 
the FS itself handles the mapping of

oid->fs-object(kind of zfs dnode) and the according attrs used by ceph.
   Despite the implemention challenges you mentioned about the in-order 
enumeration of objects during backfill, scrub, etc (the
same situation we also confronted in zfs, the ZAP features help us 
a lot).
From performance or architecture point of view, it looks more clear 
and clean, would you suggest us to give a try ?


2. the scheme two: As your last suspect, we just temporarily implemented 
the simple version of the FS
 which leverage libzpool ideas to plug into rocksdb underneath as 
your bluefs did


precious your insightful reply.

Thanks



On 2016年01月07日 21:19, Sage Weil wrote:

On Thu, 7 Jan 2016, Javen Wu wrote:

Hi Sage,

Sorry to bother you. I am not sure if it is appropriate to send email to you
directly, but I cannot find any useful information to address my confusion
from Internet. Hope you can help me.

Occasionally, I heard that you are going to start BlueFS to eliminate the
redudancy between XFS journal and RocksDB WAL. I am a little confused.
Is the Bluefs only to host RocksDB for BlueStore or it's an
alternative of BlueStore?

I am a new comer to CEPH, I am not sure my understanding is correct about
BlueStore. BlueStore in my mind is as below.

  BlueStore
  =
RocksDB
+---+  +---+
|   onode   |  |   |
|WAL|  |   |
|   omap|  |   |
+---+  |   bdev|
|   |  |   |
|   XFS |  |   |
|   |  |   |
+---+  +---+

This is the picture before BlueFS enters the picture.


I am curious if BlueFS is able to host RocksDB, actually it's already a
"filesystem" which have to maintain blockmap kind of metadata by its own
WITHOUT the help of RocksDB.

Right.  BlueFS is a really simple "file system" that is *just* complicated
enough to implement the rocksdb::Env interface, which is what rocksdb
needs to store its log and sst files.  The after picture looks like

  ++
  | bluestore  |
  +--+ |
  | rocksdb  | |
  +--+ |
  |  bluefs  | |
  +--+-+
  |block device|
  ++


The reason we care the intention and the design target of BlueFS is that I had
discussion with my partner Peng.Hse about an idea to introduce a new
ObjectStore using ZFS library. I know CEPH supports ZFS as FileStore backend
already, but we had a different immature idea to use libzpool to implement a
new
ObjectStore for CEPH totally in userspace without SPL and ZOL kernel module.
So that we can align CEPH transaction and zfs transaction in order to  avoid
double write for CEPH journal.
ZFS core part libzpool (DMU, metaslab etc) offers a dnode object store and
it's platform kernel/user independent. Another benefit for the idea is we
can extend our metadata without bothering any DBStore.

Frankly, we are not sure if our idea is realistic so far, but when I heard of
BlueFS, I think we need to know the BlueFS design goal.

I think it makes a lot of sense, but there are a few challenges.  One
reason we use rocksdb (or a similar kv store) is that we need in-order
enumeration of objects in order to do collection listing (needed for
backfill, scrub, and omap).  You'll need something similar on top of zfs.

I suspect the simplest path would be to also implement the rocksdb::Env
interface on top of the zfs libraries.  See BlueRocksEnv.{cc,h} to see the
interface that has to be implemented...

sage



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is BlueFS an alternative of BlueStore?

2016-01-07 Thread Javen Wu

Thanks Sage for your reply.

I am not sure I understand the challenges you mentioned about 
backfill/scrub.

I will investigate from the code and let you know if we can conquer the
challenge by easy means.
Our rough idea for ZFSStore are:
1. encapsulate dnode object as onode and add onode attributes.
2. uses ZAP object as collection. (ZFS directory uses ZAP object)
3. enumerating entries in ZAP object is list objects in collection.
4. create a new metaslab class to store CEPH journal.
5. align CEPH journal and ZFS transcation.

Actually we've talked about the possibility of building RocksDB::Env on top
of the zfs libraries. It must align ZIL(ZFS intent log) and RocksDB WAL.
Otherwise, there is still same problem as XFS and RocksDB.

ZFS is tree style log structure-like file system, once a leaf block updates,
the modification would be propagated from the leaf to the root of tree.
To batch writes and reduce times of disk write, ZFS persist modification 
to disk
in 5 seconds transaction. Only when Fsync/sync write arrives in the 
middle of

the 5 seconds, ZFS would persist the journal to ZIL.
I remembered RocksDB would do a sync after log record adding, so it means if
we can not align ZIL and WAL, the log write would be write to ZIL 
firstly and

then apply ZIL to log file, finally Rockdb update sst file. It's almost the
same problem as XFS if my understanding is correct.

In my mind, aligning ZIL and WAL need more modifications in RocksDB.

Thanks
Javen


On 2016年01月07日 22:37, peng.hse wrote:

Hi Sage,

thanks for your quick response. Javen and I  once the zfs 
developer,are currently focusing on how to
leverage some of the zfs ideas to improve the ceph backend performance 
in userspace.



Based on your encouraging reply, we come up with 2 schemes to continue 
our future work


1. the scheme one: using the entire new FS to replace rocksdb+bluefs, 
the FS itself handles the mapping of
oid->fs-object(kind of zfs dnode) and the according attrs used by 
ceph.
   Despite the implemention challenges you mentioned about the 
in-order enumeration of objects during backfill, scrub, etc (the
same situation we also confronted in zfs, the ZAP features help us 
a lot).
From performance or architecture point of view, it looks more 
clear and clean, would you suggest us to give a try ?


2. the scheme two: As your last suspect, we just temporarily 
implemented the simple version of the FS
 which leverage libzpool ideas to plug into rocksdb underneath as 
your bluefs did


precious your insightful reply.

Thanks



On 2016年01月07日 21:19, Sage Weil wrote:

On Thu, 7 Jan 2016, Javen Wu wrote:

Hi Sage,

Sorry to bother you. I am not sure if it is appropriate to send 
email to you
directly, but I cannot find any useful information to address my 
confusion

from Internet. Hope you can help me.

Occasionally, I heard that you are going to start BlueFS to 
eliminate the

redudancy between XFS journal and RocksDB WAL. I am a little confused.
Is the Bluefs only to host RocksDB for BlueStore or it's an
alternative of BlueStore?

I am a new comer to CEPH, I am not sure my understanding is correct 
about

BlueStore. BlueStore in my mind is as below.

  BlueStore
  =
RocksDB
+---+  +---+
|   onode   |  |   |
|WAL|  |   |
|   omap|  |   |
+---+  |   bdev|
|   |  |   |
|   XFS |  |   |
|   |  |   |
+---+  +---+

This is the picture before BlueFS enters the picture.


I am curious if BlueFS is able to host RocksDB, actually it's already a
"filesystem" which have to maintain blockmap kind of metadata by its 
own

WITHOUT the help of RocksDB.
Right.  BlueFS is a really simple "file system" that is *just* 
complicated

enough to implement the rocksdb::Env interface, which is what rocksdb
needs to store its log and sst files.  The after picture looks like

  ++
  | bluestore  |
  +--+ |
  | rocksdb  | |
  +--+ |
  |  bluefs  | |
  +--+-+
  |block device|
  ++

The reason we care the intention and the design target of BlueFS is 
that I had

discussion with my partner Peng.Hse about an idea to introduce a new
ObjectStore using ZFS library. I know CEPH supports ZFS as FileStore 
backend
already, but we had a different immature idea to use libzpool to 
implement a

new
ObjectStore for CEPH totally in userspace without SPL and ZOL kernel 
module.
So that we can align CEPH transaction and zfs transaction in order 
to  avoid

double write for CEPH journal.
ZFS core part libzpool (DMU, metaslab etc) offers a dnode object 
store and
it's platform kernel/user independent. Another benefit for the idea 
is we

can extend our metadata without bothering any DBStore.

Frankly, we are not sure if

Re: Is BlueFS an alternative of BlueStore?

2016-01-07 Thread Sage Weil
On Thu, 7 Jan 2016, Javen Wu wrote:
> Hi Sage,
> 
> Sorry to bother you. I am not sure if it is appropriate to send email to you
> directly, but I cannot find any useful information to address my confusion
> from Internet. Hope you can help me.
> 
> Occasionally, I heard that you are going to start BlueFS to eliminate the
> redudancy between XFS journal and RocksDB WAL. I am a little confused.
> Is the Bluefs only to host RocksDB for BlueStore or it's an
> alternative of BlueStore?
> 
> I am a new comer to CEPH, I am not sure my understanding is correct about
> BlueStore. BlueStore in my mind is as below.
> 
>  BlueStore
>  =
>RocksDB
> +---+  +---+
> |   onode   |  |   |
> |WAL|  |   |
> |   omap|  |   |
> +---+  |   bdev|
> |   |  |   |
> |   XFS |  |   |
> |   |  |   |
> +---+  +---+

This is the picture before BlueFS enters the picture.

> I am curious if BlueFS is able to host RocksDB, actually it's already a
> "filesystem" which have to maintain blockmap kind of metadata by its own
> WITHOUT the help of RocksDB. 

Right.  BlueFS is a really simple "file system" that is *just* complicated 
enough to implement the rocksdb::Env interface, which is what rocksdb 
needs to store its log and sst files.  The after picture looks like

 ++
 | bluestore  |
 +--+ |
 | rocksdb  | |
 +--+ |
 |  bluefs  | |
 +--+-+
 |block device|
 ++

> The reason we care the intention and the design target of BlueFS is that I had
> discussion with my partner Peng.Hse about an idea to introduce a new
> ObjectStore using ZFS library. I know CEPH supports ZFS as FileStore backend
> already, but we had a different immature idea to use libzpool to implement a
> new
> ObjectStore for CEPH totally in userspace without SPL and ZOL kernel module.
> So that we can align CEPH transaction and zfs transaction in order to  avoid
> double write for CEPH journal.
> ZFS core part libzpool (DMU, metaslab etc) offers a dnode object store and
> it's platform kernel/user independent. Another benefit for the idea is we
> can extend our metadata without bothering any DBStore.
> 
> Frankly, we are not sure if our idea is realistic so far, but when I heard of
> BlueFS, I think we need to know the BlueFS design goal.

I think it makes a lot of sense, but there are a few challenges.  One 
reason we use rocksdb (or a similar kv store) is that we need in-order 
enumeration of objects in order to do collection listing (needed for 
backfill, scrub, and omap).  You'll need something similar on top of zfs.  

I suspect the simplest path would be to also implement the rocksdb::Env 
interface on top of the zfs libraries.  See BlueRocksEnv.{cc,h} to see the 
interface that has to be implemented...

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Is BlueFS an alternative of BlueStore?

2016-01-06 Thread Javen Wu

Hi Sage,

Sorry to bother you. I am not sure if it is appropriate to send email to 
you

directly, but I cannot find any useful information to address my confusion
from Internet. Hope you can help me.

Occasionally, I heard that you are going to start BlueFS to eliminate the
redudancy between XFS journal and RocksDB WAL. I am a little confused.
Is the Bluefs only to host RocksDB for BlueStore or it's an
alternative of BlueStore?

I am a new comer to CEPH, I am not sure my understanding is correct about
BlueStore. BlueStore in my mind is as below.

 BlueStore
 =
   RocksDB
+---+  +---+
|   onode   |  |   |
|WAL|  |   |
|   omap|  |   |
+---+  |   bdev|
|   |  |   |
|   XFS |  |   |
|   |  |   |
+---+  +---+

I am curious if BlueFS is able to host RocksDB, actually it's already a
"filesystem" which have to maintain blockmap kind of metadata by its own
WITHOUT the help of RocksDB. When BlueFS is introduced into the picture,
why RocksDB is needed yet? So I guess BlueFS is an alternative of BlueStore
and it's a new ObjectStore without leveraging RocksDB.

Is my understanding correct?

The reason we care the intention and the design target of BlueFS is that 
I had

discussion with my partner Peng.Hse about an idea to introduce a new
ObjectStore using ZFS library. I know CEPH supports ZFS as FileStore 
backend
already, but we had a different immature idea to use libzpool to 
implement a new
ObjectStore for CEPH totally in userspace without SPL and ZOL kernel 
module.
So that we can align CEPH transaction and zfs transaction in order to  
avoid

double write for CEPH journal.
ZFS core part libzpool (DMU, metaslab etc) offers a dnode object store and
it's platform kernel/user independent. Another benefit for the idea is we
can extend our metadata without bothering any DBStore.

Frankly, we are not sure if our idea is realistic so far, but when I 
heard of

BlueFS, I think we need to know the BlueFS design goal.

Thanks
Javen
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html