RE: Notes from a discussion a design to allow EC overwrites

2015-11-13 Thread Allen Samuels
This scheme fundamentally relies on the temporary objects "gracefully" 
transitioning into being portions of full-up long-term durable objects.

This means that if the allocation size for a temporary object significantly 
mismatches the size of the mutation (partial stripe write) you're creating a 
problem that's proportional to the mismatch size. 

So either, NewStore is able to efficiently allocate small chunks
OR
You have some kind of background cleanup process that reclaims that space 
(i.e., a "straightener")

The right choices depend on having a shared notion of the operational profile 
that you're trying to optimize for.

The fundamental question becomes, are you going to optimize for small-block 
random writes?

In my experience this is a key use-case in virtually every customer's 
evaluation scenario. I believe we MUST make this case reasonably efficient.

It seems to me that the lowest-complexity "fix" for the problem is to teach 
NewStore to have two different allocation sizes (big and small :)). 
Naturally the allocator becomes more complex. Worst case, you're now left with 
the garbage collection problem. Which I suspect could be punted to a subsequent 
release (i.e., I'm out of large blocks, but there's plenty of fragmented 
available space -- This can happen, but's a pretty pathological case which 
becomes rare-er and rare-er as you scale-out)

Allen Samuels
Software Architect, Emerging Storage Solutions 

2880 Junction Avenue, Milpitas, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com


-Original Message-
From: Samuel Just [mailto:sj...@redhat.com] 
Sent: Friday, November 13, 2015 7:39 AM
To: Sage Weil <sw...@redhat.com>
Cc: ceph-devel@vger.kernel.org; Allen Samuels <allen.samu...@sandisk.com>; 
Durgin, Josh <jdur...@redhat.com>; Farnum, Gregory <gfar...@redhat.com>
Subject: Re: Notes from a discussion a design to allow EC overwrites

Lazily persisting the intermediate entries would certainly also work, but 
there's an argument that it needlessly adds to the write transaction.

Actually, we probably want to avoid having small writes be full stripe writes 
-- with a 8+3 code the difference between modifying a single stripelet and 
modifying the full stripe is 4 writes vs 11 writes.

It means that during peering, any log we can find (particularly the shortest 
one) from the most recent active interval isn't an upper bound on writes 
committed to the client (the (actingset.size() - M - 1)th one is?) -- we'd have 
to think carefully about the implications of that.
-Sam

On Fri, Nov 13, 2015 at 5:35 AM, Sage Weil <sw...@redhat.com> wrote:
> On Thu, 12 Nov 2015, Samuel Just wrote:
>> I was present for a discussion about allowing EC overwrites and 
>> thought it would be good to summarize it for the list:
>>
>> Commit Protocol:
>> 1) client sends write to primary
>> 2) primary reads in partial stripes needed for partial stripe 
>> overwrites from replicas
>> 3) primary sends prepares to participating replicas and queues its 
>> own prepare locally
>> 4) once all prepares are complete, primary sends a commit to the 
>> client
>> 5) primary sends applies to all participating replicas
>>
>> When we get the prepare, we write out a temp object with the data to 
>> be written.  On apply, we use an objectstore primitive to atomically 
>> move those extents into the actual object.  The log entry contains 
>> the name/id for the temp object so it can be applied on apply or removed on 
>> rollback.
>
> Currently we assume that temp objects are/can be cleared out on restart.
> This will need to change.  And we'll need to be careful that they get 
> cleaned out when peering completes (and the rollforward/rollback 
> decision is made.
>
> If the stripes are small, then the objectstore primitive may not 
> actually be that efficient.  I'd suggest also hinting that the temp 
> object will be swapped later, so that the backend can, if it's small, 
> store it in a cheap temporary location in the expectation that it will get 
> rewritten later.
> (In particular, the newstore allocation chunk is currently targetting 
> 512kb, and this will only be efficient with narrow stripes, so it'll 
> just get double-written.  We'll want to keep the temp value in the kv 
> store [log, hopefully] and not bother to allocate disk and rewrite 
> it.)
>
>> Each log entry contains a list of the shard ids modified.  During 
>> peering, we use the same protocol for choosing the authoritative log 
>> for the existing EC pool, except that we first take the longest 
>> candidate log and use it to extend shorter logs until they hit an entry they 
>> should have witnessed, but didn't.
>>
>> Implicit in the above scheme is the fact that if an ob

RE: Question about how rebuild works.

2015-11-06 Thread Allen Samuels
So the current algorithm optimizes for minimum period of cluster degradation at 
the expense of degrading MTTDL.

So in the 3x replication case, the MTTR(two failure data) is somewhere between 
1x and 2x the MTTR of a single failure -- depending on the phase alignment of 
the first and second rebuild. 

The average case would be 1.5x and this is inverse with the MTTDL, i.e., this 
behavior cuts the MTTDL in half.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com


-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Gregory Farnum
Sent: Friday, November 06, 2015 8:53 AM
To: Samuel Just <sj...@redhat.com>
Cc: ceph-devel <ceph-devel@vger.kernel.org>
Subject: Re: Question about how rebuild works.

Yeah, I'm more concerned about individual object durability. This seems like a 
good way (in ongoing flapping or whatever) for objects at the tail end of a PG 
to never get properly replicated even as we expend lots of IO repeatedly 
recovering earlier objects which are better-replicated. :/ Perhaps min_size et 
al make this a moot point, but...I don't think so. Haven't worked it all the 
way through.
-Greg

On Fri, Nov 6, 2015 at 8:48 AM, Samuel Just <sj...@redhat.com> wrote:
> Nope, it's worse, there could be arbitrary portions of backfilled and 
> unbackfilled portions on any particular incomplete osd.  We'd need a 
> backfilled_regions field with a type like map<hobject_t, hobject_t> 
> mapping backfilled regions begin->end.  It's pretty tedious, but 
> doable provided that we bound how large the mapping gets.  I'm 
> skeptical about how large an effect this would actually have on 
> overall durability (how frequent is this case?).  Once Allen does the 
> math, we'll have a better idea :) -Sam
>
> On Fri, Nov 6, 2015 at 8:43 AM, Gregory Farnum <gfar...@redhat.com> wrote:
>> Argh, I guess I was wrong. Sorry for the misinformation, all! :(
>>
>> If we were to try and do this, Sam, do you have any idea how much it 
>> would take? Presumably we'd have to add a backfill_begin marker to 
>> bookend with last_backfill_started, and then everywhere we send over 
>> object ops we'd have to compare against both of those values. But I'm 
>> not sure how many sites that's likely to be, what other kinds of 
>> paths rely on last_backfill_started, or if I'm missing something.
>> -Greg
>>
>> On Fri, Nov 6, 2015 at 8:30 AM, Samuel Just <sj...@redhat.com> wrote:
>>> What it actually does is rebuild 3 until it catches up with 2 and 
>>> then it rebuilds them in parallel (to minimize reads).  Optimally, 
>>> we'd start 3 from where 2 left off and then circle back, but we'd 
>>> have to complicate the metadata we use to track backfill.
>>> -Sam
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html


RE: newstore direction

2015-10-22 Thread Allen Samuels
How would this kind of split affect small transactions? Will each split be 
separately transactionally consistent or is there some kind of meta-transaction 
that synchronizes each of the splits?


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Samuel Just
Sent: Friday, October 23, 2015 8:42 AM
To: James (Fei) Liu-SSI <james@ssi.samsung.com>
Cc: Sage Weil <sw...@redhat.com>; Ric Wheeler <rwhee...@redhat.com>; Orit 
Wasserman <owass...@redhat.com>; ceph-devel@vger.kernel.org
Subject: Re: newstore direction

Since the changes which moved the pg log and the pg info into the pg object 
space, I think it's now the case that any transaction submitted to the 
objectstore updates a disjoint range of objects determined by the sequencer.  
It might be easier to exploit that parallelism if we control allocation and 
allocation related metadata.  We could split the store into N pieces which 
partition the pg space (one additional one for the meta sequencer?) with one 
rocksdb instance for each.
Space could then be parcelled out in large pieces (small frequency of global 
allocation decisions) and managed more finely within each partition.  The main 
challenge would be avoiding internal fragmentation of those, but at least 
defragmentation can be managed on a per-partition basis.  Such parallelism is 
probably necessary to exploit the full throughput of some ssds.
-Sam

On Thu, Oct 22, 2015 at 10:42 AM, James (Fei) Liu-SSI 
<james@ssi.samsung.com> wrote:
> Hi Sage and other fellow cephers,
>   I truly share the pains with you  all about filesystem while I am working 
> on  objectstore to improve the performance. As mentioned , there is nothing 
> wrong with filesystem. Just the Ceph as one of  use case need more supports 
> but not provided in near future by filesystem no matter what reasons.
>
>There are so many techniques  pop out which can help to improve 
> performance of OSD.  User space driver(DPDK from Intel) is one of them. It 
> not only gives you the storage allocator,  also gives you the thread 
> scheduling support,  CPU affinity , NUMA friendly, polling  which  might 
> fundamentally change the performance of objectstore.  It should not be hard 
> to improve CPU utilization 3x~5x times, higher IOPS etc.
> I totally agreed that goal of filestore is to gives enough support for 
> filesystem with either 1 ,1b, or 2 solutions. In my humble opinion , The new 
> design goal of objectstore should focus on giving the best  performance for 
> OSD with new techniques. These two goals are not going to conflict with each 
> other.  They are just for different purposes to make Ceph not only more 
> stable but also better.
>
>   Scylla mentioned by Orit is a good example .
>
>   Thanks all.
>
>   Regards,
>   James
>
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Thursday, October 22, 2015 5:50 AM
> To: Ric Wheeler
> Cc: Orit Wasserman; ceph-devel@vger.kernel.org
> Subject: Re: newstore direction
>
> On Wed, 21 Oct 2015, Ric Wheeler wrote:
>> You will have to trust me on this as the Red Hat person who spoke to
>> pretty much all of our key customers about local file systems and
>> storage - customers all have migrated over to using normal file systems 
>> under Oracle/DB2.
>> Typically, they use XFS or ext4.  I don't know of any non-standard
>> file systems and only have seen one account running on a raw block
>> store in 8 years
>> :)
>>
>> If you have a pre-allocated file and write using O_DIRECT, your IO
>> path is identical in terms of IO's sent to the device.
>>
>> If we are causing additional IO's, then we really need to spend some
>> time talking to the local file system gurus about this in detail.  I
>> can help with that conversation.
>
> If the file is truly preallocated (that is, prewritten with zeros...
> fallocate doesn't help here because the extents is marked unwritten),
> then
> sure: there is very little change in the data path.
>
> But at that point, what is the point?  This only works if you have one (or a 
> few) huge files and the user space app already has all the complexity of a 
> filesystem-like thing (with its own internal journal, allocators, garbage 
> collection, etc.).  Do they just do this to ease administrative tasks like 
> backup?
>
>
> This is the fundamental tradeoff:
>
> 1) We have a file per object.  We fsync like crazy and the fact that the

RE: newstore direction

2015-10-21 Thread Allen Samuels
I am pushing internally to open-source ZetaScale. Recent events may or may not 
affect that trajectory -- stay tuned.

Allen Samuels
Software Architect, Fellow, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: Mark Nelson [mailto:mnel...@redhat.com] 
Sent: Wednesday, October 21, 2015 10:45 PM
To: Allen Samuels <allen.samu...@sandisk.com>; Ric Wheeler 
<rwhee...@redhat.com>; Sage Weil <sw...@redhat.com>; ceph-devel@vger.kernel.org
Subject: Re: newstore direction

On 10/21/2015 05:06 AM, Allen Samuels wrote:
> I agree that moving newStore to raw block is going to be a significant 
> development effort. But the current scheme of using a KV store combined with 
> a normal file system is always going to be problematic (FileStore or 
> NewStore). This is caused by the transactional requirements of the 
> ObjectStore interface, essentially you need to make transactionally 
> consistent updates to two indexes, one of which doesn't understand 
> transactions (File Systems) and can never be tightly-connected to the other 
> one.
>
> You'll always be able to make this "loosely coupled" approach work, but it 
> will never be optimal. The real question is whether the performance 
> difference of a suboptimal implementation is something that you can live with 
> compared to the longer gestation period of the more optimal implementation. 
> Clearly, Sage believes that the performance difference is significant or he 
> wouldn't have kicked off this discussion in the first place.
>
> While I think we can all agree that writing a full-up KV and raw-block 
> ObjectStore is a significant amount of work. I will offer the case that the 
> "loosely couple" scheme may not have as much time-to-market advantage as it 
> appears to have. One example: NewStore performance is limited due to bugs in 
> XFS that won't be fixed in the field for quite some time (it'll take at least 
> a couple of years before a patched version of XFS will be widely deployed at 
> customer environments).
>
> Another example: Sage has just had to substantially rework the journaling 
> code of rocksDB.
>
> In short, as you can tell, I'm full throated in favor of going down the 
> optimal route.
>
> Internally at Sandisk, we have a KV store that is optimized for flash (it's 
> called ZetaScale). We have extended it with a raw block allocator just as 
> Sage is now proposing to do. Our internal performance measurements show a 
> significant advantage over the current NewStore. That performance advantage 
> stems primarily from two things:

Has there been any discussion regarding opensourcing zetascale?

>
> (1) ZetaScale uses a B+-tree internally rather than an LSM tree 
> (levelDB/RocksDB). LSM trees experience exponential increase in write 
> amplification (cost of an insert) as the amount of data under management 
> increases. B+tree write-amplification is nearly constant independent of the 
> size of data under management. As the KV database gets larger (Since newStore 
> is effectively moving the per-file inode into the kv data base. Don't forget 
> checksums that Sage want's to add :)) this performance delta swamps all 
> others.
> (2) Having a KV and a file-system causes a double lookup. This costs CPU time 
> and disk accesses to page in data structure indexes, metadata efficiency 
> decreases.
>
> You can't avoid (2) as long as you're using a file system.
>
> Yes an LSM tree performs better on HDD than does a B-tree, which is a good 
> argument for keeping the KV module pluggable.
>
>
> Allen Samuels
> Software Architect, Fellow, Systems and Software Solutions
>
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com
>
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org 
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Ric Wheeler
> Sent: Tuesday, October 20, 2015 11:32 AM
> To: Sage Weil <sw...@redhat.com>; ceph-devel@vger.kernel.org
> Subject: Re: newstore direction
>
> On 10/19/2015 03:49 PM, Sage Weil wrote:
>> The current design is based on two simple ideas:
>>
>>1) a key/value interface is better way to manage all of our 
>> internal metadata (object metadata, attrs, layout, collection 
>> membership, write-ahead logging, overlay data, etc.)
>>
>>2) a file system is well suited for storage object data (as files).
>>
>> So far 1 is working out well, but I'm questioning the wisdom of #2.  
>> A few
>> things:
>>
>>- We currently write the data to the file, fsync, then commit the 
>> kv transaction.  That's

RE: newstore direction

2015-10-21 Thread Allen Samuels
One of the biggest changes that flash is making in the storage world is that 
the way basic trade-offs in storage management software architecture are being 
affected. In the HDD world CPU time per IOP was relatively inconsequential, 
i.e., it had little effect on overall performance which was limited by the 
physics of the hard drive. Flash is now inverting that situation. When you look 
at the performance levels being delivered in the latest generation of NVMe SSDs 
you rapidly see that that storage itself is generally no longer the bottleneck 
(speaking about BW, not latency of course) but rather it's the system sitting 
in front of the storage that is the bottleneck. Generally it's the CPU cost of 
an IOP.

When Sandisk first starting working with Ceph (Dumpling) the design of librados 
and the OSD lead to the situation that the CPU cost of an IOP was dominated by 
context switches and network socket handling. Over time, much of that has been 
addressed. The socket handling code has been re-written (more than once!) some 
of the internal queueing in the OSD (and the associated context switches) have 
been eliminated. As the CPU costs have dropped, performance on flash has 
improved accordingly.

Because we didn't want to completely re-write the OSD (time-to-market and 
stability drove that decision), we didn't move it from the current "thread per 
IOP" model into a truly asynchronous "thread per CPU core" model that 
essentially eliminates context switches in the IO path. But a fully optimized 
OSD would go down that path (at least part-way). I believe it's been proposed 
in the past. Perhaps a hybrid "fast-path" style could get most of the benefits 
while preserving much of the legacy code.

I believe this trend toward thread-per-core software development will also tend 
to support the "do it in user-space" trend. That's because most of the kernel 
and file-system interface is architected around the blocking "thread-per-IOP" 
model and is unlikely to change in the future.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: Martin Millnert [mailto:mar...@millnert.se]
Sent: Thursday, October 22, 2015 6:20 AM
To: Mark Nelson <mnel...@redhat.com>
Cc: Ric Wheeler <rwhee...@redhat.com>; Allen Samuels 
<allen.samu...@sandisk.com>; Sage Weil <sw...@redhat.com>; 
ceph-devel@vger.kernel.org
Subject: Re: newstore direction

Adding 2c

On Wed, 2015-10-21 at 14:37 -0500, Mark Nelson wrote:
> My thought is that there is some inflection point where the userland
> kvstore/block approach is going to be less work, for everyone I think,
> than trying to quickly discover, understand, fix, and push upstream
> patches that sometimes only really benefit us.  I don't know if we've
> truly hit that that point, but it's tough for me to find flaws with
> Sage's argument.

Regarding the userland / kernel land aspect of the topic, there are further 
aspects AFAIK not yet addressed in the thread:
In the networking world, there's been development on memory mapped (multiple 
approaches exist) userland networking, which for packet management has the 
benefit of - for very, very specific applications of networking code - avoiding 
e.g. per-packet context switches etc, and streamlining processor cache 
management performance. People have gone as far as removing CPU cores from CPU 
scheduler to completely dedicate them to the networking task at hand (cache 
optimizations). There are various latency/throughput (bulking) optimizations 
applicable, but at the end of the day, it's about keeping the CPU bus busy with 
"revenue" bus traffic.

Granted, storage IO operations may be much heavier in cycle counts for context 
switches to ever appear as a problem in themselves, certainly for slower SSDs 
and HDDs. However, when going for truly high performance IO, *every* hurdle in 
the data path counts toward the total latency.
(And really, high performance random IO characteristics approaches the 
networking, per-packet handling characteristics).  Now, I'm not really 
suggesting memory-mapping a storage device to user space, not at all, but 
having better control over the data path for a very specific use case, reduces 
dependency on the code that works as best as possible for the general case, and 
allows for very purpose-built code, to address a narrow set of requirements. 
("Ceph storage cluster backend" isn't a typical FS use case.) It also decouples 
dependencies on users i.e.
waiting for the next distro release before being able to take up the benefits 
of improvements to the storage code.

A random google came up with related data on where "doing something way 
different" /can/ have significant benefits:
http://phunq.net/pipermail/tux3/2015-April/002147.ht

RE: newstore direction

2015-10-21 Thread Allen Samuels
Fixing the bug doesn't take a long time. Getting it deployed is where the delay 
is. Many companies standardize on a particular release of a particular distro. 
Getting them to switch to a new release -- even a "bug fix" point release -- is 
a major undertaking that often is a complete roadblock. Just my experience. 
YMMV. 


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: Ric Wheeler [mailto:rwhee...@redhat.com] 
Sent: Wednesday, October 21, 2015 8:24 PM
To: Allen Samuels <allen.samu...@sandisk.com>; Sage Weil <sw...@redhat.com>; 
ceph-devel@vger.kernel.org
Subject: Re: newstore direction



On 10/21/2015 06:06 AM, Allen Samuels wrote:
> I agree that moving newStore to raw block is going to be a significant 
> development effort. But the current scheme of using a KV store combined with 
> a normal file system is always going to be problematic (FileStore or 
> NewStore). This is caused by the transactional requirements of the 
> ObjectStore interface, essentially you need to make transactionally 
> consistent updates to two indexes, one of which doesn't understand 
> transactions (File Systems) and can never be tightly-connected to the other 
> one.
>
> You'll always be able to make this "loosely coupled" approach work, but it 
> will never be optimal. The real question is whether the performance 
> difference of a suboptimal implementation is something that you can live with 
> compared to the longer gestation period of the more optimal implementation. 
> Clearly, Sage believes that the performance difference is significant or he 
> wouldn't have kicked off this discussion in the first place.

I think that we need to work with the existing stack - measure and do some 
collaborative analysis - before we throw out decades of work.  Very hard to 
understand why the local file system is a barrier for performance in this case 
when it is not an issue in existing enterprise applications.

We need some deep analysis with some local file system experts thrown in to 
validate the concerns.

>
> While I think we can all agree that writing a full-up KV and raw-block 
> ObjectStore is a significant amount of work. I will offer the case that the 
> "loosely couple" scheme may not have as much time-to-market advantage as it 
> appears to have. One example: NewStore performance is limited due to bugs in 
> XFS that won't be fixed in the field for quite some time (it'll take at least 
> a couple of years before a patched version of XFS will be widely deployed at 
> customer environments).

Not clear what bugs you are thinking of or why you think fixing bugs will take 
a long time to hit the field in XFS. Red Hat has most of the XFS developers on 
staff and we actively backport fixes and ship them, other distros do as well.

Never seen a "bug" take a couple of years to hit users.

Regards,

Ric

>
> Another example: Sage has just had to substantially rework the journaling 
> code of rocksDB.
>
> In short, as you can tell, I'm full throated in favor of going down the 
> optimal route.
>
> Internally at Sandisk, we have a KV store that is optimized for flash (it's 
> called ZetaScale). We have extended it with a raw block allocator just as 
> Sage is now proposing to do. Our internal performance measurements show a 
> significant advantage over the current NewStore. That performance advantage 
> stems primarily from two things:
>
> (1) ZetaScale uses a B+-tree internally rather than an LSM tree 
> (levelDB/RocksDB). LSM trees experience exponential increase in write 
> amplification (cost of an insert) as the amount of data under management 
> increases. B+tree write-amplification is nearly constant independent of the 
> size of data under management. As the KV database gets larger (Since newStore 
> is effectively moving the per-file inode into the kv data base. Don't forget 
> checksums that Sage want's to add :)) this performance delta swamps all 
> others.
> (2) Having a KV and a file-system causes a double lookup. This costs CPU time 
> and disk accesses to page in data structure indexes, metadata efficiency 
> decreases.
>
> You can't avoid (2) as long as you're using a file system.
>
> Yes an LSM tree performs better on HDD than does a B-tree, which is a good 
> argument for keeping the KV module pluggable.
>
>
> Allen Samuels
> Software Architect, Fellow, Systems and Software Solutions
>
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com
>
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org 
> [mailto:ceph-devel-ow...@vger.kernel.org]

RE: newstore direction

2015-10-21 Thread Allen Samuels
Actually Range queries are an important part of the performance story and 
random read speed doesn't really solve the problem.

When you're doing a scrub, you need to enumerate the objects in a specific 
order on multiple nodes -- so that they can compare the contents of their 
stores in order to determine if data cleaning needs to take place.

If you don't have in-order enumeration in your basic data structure (which 
NVMKV doesn't have) then you're forced to sort the directory before you can 
respond to an enumeration. That sort will either consume huge amounts of IOPS 
OR huge amounts of DRAM. Regardless of the choice, you'll see a significant 
degradation of performance while the scrub is ongoing -- which is one of the 
biggest problems with clustered systems (expensive and extensive maintenance 
operations).


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: Chen, Xiaoxi [mailto:xiaoxi.c...@intel.com]
Sent: Thursday, October 22, 2015 1:10 AM
To: Mark Nelson <mnel...@redhat.com>; Allen Samuels 
<allen.samu...@sandisk.com>; Sage Weil <sw...@redhat.com>
Cc: James (Fei) Liu-SSI <james@ssi.samsung.com>; Somnath Roy 
<somnath@sandisk.com>; ceph-devel@vger.kernel.org
Subject: RE: newstore direction

We did evaluate whether NVMKV could be implemented by non-fusionIO ssds, i.e 
re-invent an NVMKV, the final conclusion sounds like it's not hard with 
persistent memory(which will be available soon).  But yeah, NVMKV will not work 
if no PM is present---persist the hashing table to SSD is not practicable.

Range query seems not a very big issue as the random read performance of 
nowadays SSD is more than enough, I mean, even we break all sequential to 
random (typically 70-80K IOPS which is ~300MB/s), the performance still good 
enough.

Anyway,  I think for the high IOPS case, it's hard for the consumer to play 
well on SSDs from different vendors, would be better to leave it to SSD vendor, 
something like Openstack Cinder's structure.  a vendor has the responsibility 
to maintain their drivers to ceph and take care the performance.

> -Original Message-
> From: Mark Nelson [mailto:mnel...@redhat.com]
> Sent: Wednesday, October 21, 2015 9:36 PM
> To: Allen Samuels; Sage Weil; Chen, Xiaoxi
> Cc: James (Fei) Liu-SSI; Somnath Roy; ceph-devel@vger.kernel.org
> Subject: Re: newstore direction
>
> Thanks Allen!  The devil is always in the details.  Know of anything
> else that looks promising?
>
> Mark
>
> On 10/21/2015 05:06 AM, Allen Samuels wrote:
> > I doubt that NVMKV will be useful for two reasons:
> >
> > (1) It relies on the unique sparse-mapping addressing capabilities
> > of the FusionIO VSL interface, it won't run on standard SSDs
> > (2) NVMKV doesn't provide any form of in-order enumeration (i.e., no
> range operations on keys). This is pretty much required for deep scrubbing.
> >
> >
> > Allen Samuels
> > Software Architect, Fellow, Systems and Software Solutions
> >
> > 2880 Junction Avenue, San Jose, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com
> >
> > -Original Message-
> > From: ceph-devel-ow...@vger.kernel.org
> > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson
> > Sent: Tuesday, October 20, 2015 6:20 AM
> > To: Sage Weil <sw...@redhat.com>; Chen, Xiaoxi
> > <xiaoxi.c...@intel.com>
> > Cc: James (Fei) Liu-SSI <james@ssi.samsung.com>; Somnath Roy
> > <somnath@sandisk.com>; ceph-devel@vger.kernel.org
> > Subject: Re: newstore direction
> >
> > On 10/20/2015 07:30 AM, Sage Weil wrote:
> >> On Tue, 20 Oct 2015, Chen, Xiaoxi wrote:
> >>> +1, nowadays K-V DB care more about very small key-value pairs,
> >>> +say
> >>> several bytes to a few KB, but in SSD case we only care about 4KB
> >>> or 8KB. In this way, NVMKV is a good design and seems some of the
> >>> SSD vendor are also trying to build this kind of interface, we had
> >>> a NVM-L library but still under development.
> >>
> >> Do you have an NVMKV link?  I see a paper and a stale github repo..
> >> not sure if I'm looking at the right thing.
> >>
> >> My concern with using a key/value interface for the object data is
> >> that you end up with lots of key/value pairs (e.g., $inode_$offset
> >> =
> >> $4kb_of_data) that is pretty inefficient to store and (depending on
> >> the
> >> implementation) tends to break alignment.  I don't think these
> >> interfaces are targetted toward block-

RE: newstore direction

2015-10-21 Thread Allen Samuels
I agree. My only point was that you still have to factor this time into the 
argument that by continuing to put NewStore on top of a file system you'll get 
to a stable system much sooner than the longer development path of doing your 
own raw storage allocator. IMO, once you factor that into the equation the "on 
top of an FS" path doesn't look like such a clear winner.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: Ric Wheeler [mailto:rwhee...@redhat.com]
Sent: Thursday, October 22, 2015 10:17 AM
To: Allen Samuels <allen.samu...@sandisk.com>; Sage Weil <sw...@redhat.com>; 
ceph-devel@vger.kernel.org
Subject: Re: newstore direction

On 10/21/2015 08:53 PM, Allen Samuels wrote:
> Fixing the bug doesn't take a long time. Getting it deployed is where the 
> delay is. Many companies standardize on a particular release of a particular 
> distro. Getting them to switch to a new release -- even a "bug fix" point 
> release -- is a major undertaking that often is a complete roadblock. Just my 
> experience. YMMV.
>

Customers do control the pace that they upgrade their machines, but we put out 
fixes on a very regular pace.  A lot of customers will get fixes without having 
to qualify a full new release (i.e., fixes come out between major and minor 
releases are easy).

If someone is deploying a critical server for storage, then it falls back on 
the storage software team to help guide them and encourage them to update when 
needed (and no promises of success, but people move if the win is big. If it is 
not, they can wait).

ric




PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: newstore direction

2015-10-21 Thread Allen Samuels
I agree that moving newStore to raw block is going to be a significant 
development effort. But the current scheme of using a KV store combined with a 
normal file system is always going to be problematic (FileStore or NewStore). 
This is caused by the transactional requirements of the ObjectStore interface, 
essentially you need to make transactionally consistent updates to two indexes, 
one of which doesn't understand transactions (File Systems) and can never be 
tightly-connected to the other one.

You'll always be able to make this "loosely coupled" approach work, but it will 
never be optimal. The real question is whether the performance difference of a 
suboptimal implementation is something that you can live with compared to the 
longer gestation period of the more optimal implementation. Clearly, Sage 
believes that the performance difference is significant or he wouldn't have 
kicked off this discussion in the first place.

While I think we can all agree that writing a full-up KV and raw-block 
ObjectStore is a significant amount of work. I will offer the case that the 
"loosely couple" scheme may not have as much time-to-market advantage as it 
appears to have. One example: NewStore performance is limited due to bugs in 
XFS that won't be fixed in the field for quite some time (it'll take at least a 
couple of years before a patched version of XFS will be widely deployed at 
customer environments).

Another example: Sage has just had to substantially rework the journaling code 
of rocksDB.

In short, as you can tell, I'm full throated in favor of going down the optimal 
route.

Internally at Sandisk, we have a KV store that is optimized for flash (it's 
called ZetaScale). We have extended it with a raw block allocator just as Sage 
is now proposing to do. Our internal performance measurements show a 
significant advantage over the current NewStore. That performance advantage 
stems primarily from two things:

(1) ZetaScale uses a B+-tree internally rather than an LSM tree 
(levelDB/RocksDB). LSM trees experience exponential increase in write 
amplification (cost of an insert) as the amount of data under management 
increases. B+tree write-amplification is nearly constant independent of the 
size of data under management. As the KV database gets larger (Since newStore 
is effectively moving the per-file inode into the kv data base. Don't forget 
checksums that Sage want's to add :)) this performance delta swamps all others.
(2) Having a KV and a file-system causes a double lookup. This costs CPU time 
and disk accesses to page in data structure indexes, metadata efficiency 
decreases.

You can't avoid (2) as long as you're using a file system.

Yes an LSM tree performs better on HDD than does a B-tree, which is a good 
argument for keeping the KV module pluggable.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Ric Wheeler
Sent: Tuesday, October 20, 2015 11:32 AM
To: Sage Weil <sw...@redhat.com>; ceph-devel@vger.kernel.org
Subject: Re: newstore direction

On 10/19/2015 03:49 PM, Sage Weil wrote:
> The current design is based on two simple ideas:
>
>   1) a key/value interface is better way to manage all of our internal
> metadata (object metadata, attrs, layout, collection membership,
> write-ahead logging, overlay data, etc.)
>
>   2) a file system is well suited for storage object data (as files).
>
> So far 1 is working out well, but I'm questioning the wisdom of #2.  A
> few
> things:
>
>   - We currently write the data to the file, fsync, then commit the kv
> transaction.  That's at least 3 IOs: one for the data, one for the fs
> journal, one for the kv txn to commit (at least once my rocksdb
> changes land... the kv commit is currently 2-3).  So two people are
> managing metadata, here: the fs managing the file metadata (with its
> own
> journal) and the kv backend (with its journal).

If all of the fsync()'s fall into the same backing file system, are you sure 
that each fsync() takes the same time? Depending on the local FS implementation 
of course, but the order of issuing those fsync()'s can effectively make some 
of them no-ops.

>
>   - On read we have to open files by name, which means traversing the
> fs namespace.  Newstore tries to keep it as flat and simple as
> possible, but at a minimum it is a couple btree lookups.  We'd love to
> use open by handle (which would reduce this to 1 btree traversal), but
> running the daemon as ceph and not root makes that hard...

This seems like a a pretty low hurdle to overcome.

>
>   - ...and file systems insist on updating mtime on writes, even when
> it is a overwrite with no

RE: newstore direction

2015-10-21 Thread Allen Samuels
I doubt that NVMKV will be useful for two reasons:

(1) It relies on the unique sparse-mapping addressing capabilities of the 
FusionIO VSL interface, it won't run on standard SSDs
(2) NVMKV doesn't provide any form of in-order enumeration (i.e., no range 
operations on keys). This is pretty much required for deep scrubbing.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Tuesday, October 20, 2015 6:20 AM
To: Sage Weil <sw...@redhat.com>; Chen, Xiaoxi <xiaoxi.c...@intel.com>
Cc: James (Fei) Liu-SSI <james@ssi.samsung.com>; Somnath Roy 
<somnath@sandisk.com>; ceph-devel@vger.kernel.org
Subject: Re: newstore direction

On 10/20/2015 07:30 AM, Sage Weil wrote:
> On Tue, 20 Oct 2015, Chen, Xiaoxi wrote:
>> +1, nowadays K-V DB care more about very small key-value pairs, say
>> several bytes to a few KB, but in SSD case we only care about 4KB or
>> 8KB. In this way, NVMKV is a good design and seems some of the SSD
>> vendor are also trying to build this kind of interface, we had a
>> NVM-L library but still under development.
>
> Do you have an NVMKV link?  I see a paper and a stale github repo..
> not sure if I'm looking at the right thing.
>
> My concern with using a key/value interface for the object data is
> that you end up with lots of key/value pairs (e.g., $inode_$offset =
> $4kb_of_data) that is pretty inefficient to store and (depending on
> the
> implementation) tends to break alignment.  I don't think these
> interfaces are targetted toward block-sized/aligned payloads.  Storing
> just the metadata (block allocation map) w/ the kv api and storing the
> data directly on a block/page interface makes more sense to me.
>
> sage

I get the feeling that some of the folks that were involved with nvmkv at 
Fusion IO have left.  Nisha Talagala is now out at Parallel Systems for 
instance.  http://pmem.io might be a better bet, though I haven't looked 
closely at it.

Mark

>
>
>>> -Original Message-
>>> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
>>> ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
>>> Sent: Tuesday, October 20, 2015 6:21 AM
>>> To: Sage Weil; Somnath Roy
>>> Cc: ceph-devel@vger.kernel.org
>>> Subject: RE: newstore direction
>>>
>>> Hi Sage and Somnath,
>>>In my humble opinion, There is another more aggressive  solution
>>> than raw block device base keyvalue store as backend for
>>> objectstore. The new key value  SSD device with transaction support would 
>>> be  ideal to solve the issues.
>>> First of all, it is raw SSD device. Secondly , It provides key value
>>> interface directly from SSD. Thirdly, it can provide transaction
>>> support, consistency will be guaranteed by hardware device. It
>>> pretty much satisfied all of objectstore needs without any extra
>>> overhead since there is not any extra layer in between device and 
>>> objectstore.
>>> Either way, I strongly support to have CEPH own data format
>>> instead of relying on filesystem.
>>>
>>>Regards,
>>>James
>>>
>>> -Original Message-
>>> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
>>> ow...@vger.kernel.org] On Behalf Of Sage Weil
>>> Sent: Monday, October 19, 2015 1:55 PM
>>> To: Somnath Roy
>>> Cc: ceph-devel@vger.kernel.org
>>> Subject: RE: newstore direction
>>>
>>> On Mon, 19 Oct 2015, Somnath Roy wrote:
>>>> Sage,
>>>> I fully support that.  If we want to saturate SSDs , we need to get
>>>> rid of this filesystem overhead (which I am in process of measuring).
>>>> Also, it will be good if we can eliminate the dependency on the k/v
>>>> dbs (for storing allocators and all). The reason is the unknown
>>>> write amps they causes.
>>>
>>> My hope is to keep behing the KeyValueDB interface (and/more change
>>> it as
>>> appropriate) so that other backends can be easily swapped in (e.g. a
>>> btree- based one for high-end flash).
>>>
>>> sage
>>>
>>>
>>>>
>>>> Thanks & Regards
>>>> Somnath
>>>>
>>>>
>>>> -Original Message-
>>>> From: ceph-devel-ow...@vger.kernel.org
>>>> [mailto:ceph-devel-ow...@vger.kernel.org] On Behal

RE: loadable objectstore

2015-09-14 Thread Allen Samuels
Yes, I'm referring to the C++ vtable.

Allen Samuels
Software Architect, Emerging Storage Solutions 

2880 Junction Avenue, Milpitas, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: James (Fei) Liu-SSI [mailto:james@ssi.samsung.com] 
Sent: Monday, September 14, 2015 9:48 AM
To: Allen Samuels <allen.samu...@sandisk.com>; Varada Kari 
<varada.k...@sandisk.com>; Sage Weil <s...@newdream.net>; Matt W. Benjamin 
<m...@cohortfs.com>; Loic Dachary <l...@dachary.org>
Cc: ceph-devel <ceph-devel@vger.kernel.org>
Subject: RE: loadable objectstore

Hi Allen,
I am not exactly  sure what the vtable is. Is the vtable same as vtable 
from  C++ object concept?  IMHP, the procedure linkage table was used to 
redirect position-independent function calls to absolute location of function 
based on the ELF format spec[1]. The performance hit for shared library might 
be negligible. There is very old articles to talk about the performance tests 
between shared vs static libs[2].  I am not following  up the lasted 
complier/linker technologies any more. Would be great to know any new updates.  

[1]http://www.skyfree.org/linux/references/ELF_Format.pdf
[2]https://gcc.gnu.org/ml/gcc/2004-06/msg01956.html 

Regards,
James

-----Original Message-
From: Allen Samuels [mailto:allen.samu...@sandisk.com]
Sent: Saturday, September 12, 2015 1:35 PM
To: Varada Kari; James (Fei) Liu-SSI; Sage Weil; Matt W. Benjamin; Loic Dachary
Cc: ceph-devel
Subject: RE: loadable objectstore

Performance impact after initialization will be zero. All of the call sequences 
are done as vtable dynamic dispatches on the global ObjectStore instance. This 
type of call sequence doesn't matter whether it's dynamic or statically linked, 
they are the same (a simple indirection through the vtbl which is loaded from a 
known constant offset in the object).


Allen Samuels
Chief Software Architect, Emerging Storage Solutions 

951 SanDisk Drive, Milpitas, CA 95035
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Varada Kari
Sent: Friday, September 11, 2015 9:34 PM
To: James (Fei) Liu-SSI <james@ssi.samsung.com>; Sage Weil 
<s...@newdream.net>; Matt W. Benjamin <m...@cohortfs.com>; Loic Dachary 
<l...@dachary.org>
Cc: ceph-devel <ceph-devel@vger.kernel.org>
Subject: RE: loadable objectstore

Hi James,

Please find the responses inline.

varada

> -Original Message-
> From: James (Fei) Liu-SSI [mailto:james@ssi.samsung.com]
> Sent: Saturday, September 12, 2015 12:13 AM
> To: Varada Kari <varada.k...@sandisk.com>; Sage Weil 
> <s...@newdream.net>; Matt W. Benjamin <m...@cohortfs.com>; Loic 
> Dachary <l...@dachary.org>
> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> Subject: RE: loadable objectstore
>
> Hi Varada,
>   Got a chance to go through the code. Great job. It is much cleaner . 
> Several
> questions:
>   1. What you think about the performance impact with the new 
> implementation? Such  as dynamic library vs static link?
[Varada Kari] Haven't measured the performance yet, but there will be some hit 
due to static vs dynamic. But that shouldn't be a major degradation, but I will 
hold on till we have some perf runs to figure that out.
>   2. Could any vendor just provide a objectstore interfaces complied 
> dynamic binary library for their own storage engine with new factory 
> framework?
[Varada Kari] That was one of the design motives for this change. Yes any 
backend adhering the interfaces of object store can integrate with osd. All 
they need to do provide a factory interface and the required version and init 
functionality additionally to all the required object store interfaces.
>
>   Regards,
>   James
>
>
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- 
> ow...@vger.kernel.org] On Behalf Of Varada Kari
> Sent: Friday, September 11, 2015 3:28 AM
> To: Sage Weil; Matt W. Benjamin; Loic Dachary
> Cc: ceph-devel
> Subject: RE: loadable objectstore
>
> Hi Sage/ Matt,
>
> I have submitted the pull request based on wip-plugin branch for the 
> object store factory implementation at https://github.com/ceph/ceph/pull/5884 
> .
> Haven't rebased to the master yet. Working on rebase and including new 
> store in the factory implementation.  Please have a look and let me 
> know your comments. Will submit a rebased PR soon with new store integration.
>
> Thanks,
> Varada
>
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- 
> ow...@vger.kernel.org] On Behalf Of Varada Kari
> Sent: Friday, July 03, 2015 7:31 PM
> To: Sage

RE: loadable objectstore

2015-09-12 Thread Allen Samuels
Performance impact after initialization will be zero. All of the call sequences 
are done as vtable dynamic dispatches on the global ObjectStore instance. This 
type of call sequence doesn't matter whether it's dynamic or statically linked, 
they are the same (a simple indirection through the vtbl which is loaded from a 
known constant offset in the object).


Allen Samuels
Chief Software Architect, Emerging Storage Solutions 

951 SanDisk Drive, Milpitas, CA 95035
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Varada Kari
Sent: Friday, September 11, 2015 9:34 PM
To: James (Fei) Liu-SSI <james@ssi.samsung.com>; Sage Weil 
<s...@newdream.net>; Matt W. Benjamin <m...@cohortfs.com>; Loic Dachary 
<l...@dachary.org>
Cc: ceph-devel <ceph-devel@vger.kernel.org>
Subject: RE: loadable objectstore

Hi James,

Please find the responses inline.

varada

> -Original Message-
> From: James (Fei) Liu-SSI [mailto:james@ssi.samsung.com]
> Sent: Saturday, September 12, 2015 12:13 AM
> To: Varada Kari <varada.k...@sandisk.com>; Sage Weil
> <s...@newdream.net>; Matt W. Benjamin <m...@cohortfs.com>; Loic
> Dachary <l...@dachary.org>
> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> Subject: RE: loadable objectstore
>
> Hi Varada,
>   Got a chance to go through the code. Great job. It is much cleaner . Several
> questions:
>   1. What you think about the performance impact with the new
> implementation? Such  as dynamic library vs static link?
[Varada Kari] Haven't measured the performance yet, but there will be some hit 
due to static vs dynamic. But that shouldn't be a major degradation, but I will 
hold on till we have some perf runs to figure that out.
>   2. Could any vendor just provide a objectstore interfaces complied dynamic
> binary library for their own storage engine with new factory framework?
[Varada Kari] That was one of the design motives for this change. Yes any 
backend adhering the interfaces of object store can integrate with osd. All 
they need to do provide a factory interface and the required version and init 
functionality additionally to all the required object store interfaces.
>
>   Regards,
>   James
>
>
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
> ow...@vger.kernel.org] On Behalf Of Varada Kari
> Sent: Friday, September 11, 2015 3:28 AM
> To: Sage Weil; Matt W. Benjamin; Loic Dachary
> Cc: ceph-devel
> Subject: RE: loadable objectstore
>
> Hi Sage/ Matt,
>
> I have submitted the pull request based on wip-plugin branch for the object
> store factory implementation at https://github.com/ceph/ceph/pull/5884 .
> Haven't rebased to the master yet. Working on rebase and including new
> store in the factory implementation.  Please have a look and let me know
> your comments. Will submit a rebased PR soon with new store integration.
>
> Thanks,
> Varada
>
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
> ow...@vger.kernel.org] On Behalf Of Varada Kari
> Sent: Friday, July 03, 2015 7:31 PM
> To: Sage Weil <s...@newdream.net>; Adam Crume
> <adamcr...@gmail.com>
> Cc: Loic Dachary <l...@dachary.org>; ceph-devel  de...@vger.kernel.org>; Matt W. Benjamin <m...@cohortfs.com>
> Subject: RE: loadable objectstore
>
> Hi All,
>
> Not able to make much progress after making common as a shared object
> along with object store.
> Compilation of the test binaries are failing with 
> "./.libs/libceph_filestore.so:
> undefined reference to `tracepoint_dlopen'".
>
>   CXXLDceph_streamtest
> ./.libs/libceph_filestore.so: undefined reference to `tracepoint_dlopen'
> collect2: error: ld returned 1 exit status
> make[3]: *** [ceph_streamtest] Error 1
>
> But libfilestore.so is linked with lttng-ust.
>
> src/.libs$ ldd libceph_filestore.so
> libceph_keyvaluestore.so.1 => /home/varada/obs-factory/plugin-
> work/src/.libs/libceph_keyvaluestore.so.1 (0x7f5e50f5)
> libceph_os.so.1 => /home/varada/obs-factory/plugin-
> work/src/.libs/libceph_os.so.1 (0x7f5e4f93a000)
> libcommon.so.1 => /home/varada/ obs-factory/plugin-
> work/src/.libs/libcommon.so.1 (0x7f5e4b5df000)
> liblttng-ust.so.0 => /usr/lib/x86_64-linux-gnu/liblttng-ust.so.0
> (0x7f5e4b179000)
> liblttng-ust-tracepoint.so.0 => 
> /usr/lib/x86_64-linux-gnu/liblttng-ust-
> tracepoint.so.0 (0x7f5e4a021000)
> liburcu-bp.so.1 => /usr/lib/liburcu-bp.so.1 (0x7f5e49e1a000)
> liburcu-cds.so.1 => /usr/lib/lib

RE: Inline dedup/compression

2015-08-20 Thread Allen Samuels
I was referring strictly to compression. Dedupe is a whole 'nother issue.

I agree that dedupe on a per-OSD basis isn't interesting. It needs to be done 
at the pool level (or higher). 


Allen Samuels
Software Architect, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com


-Original Message-
From: Chaitanya Huilgol 
Sent: Thursday, August 20, 2015 9:43 PM
To: Allen Samuels; Haomai Wang
Cc: James (Fei) Liu-SSI; ceph-devel
Subject: RE: Inline dedup/compression

Hi,

The original idea of dedupe was to make it cluster wide, If we go with a 
filestore or kevvalue-store based dedupe/compression then isn't it localized to 
the OSD? W.r.t Ceph architecture of object distribution,  won't the probability 
of objects with same/similar data landing on the same OSD be pretty low? 

Regards,
Chaitanya

-Original Message-
From: Allen Samuels
Sent: Friday, August 21, 2015 9:07 AM
To: Haomai Wang
Cc: Chaitanya Huilgol; James (Fei) Liu-SSI; ceph-devel
Subject: RE: Inline dedup/compression

XFS shouldn't have any trouble with the holes scheme. I don't know BTRFS as 
well, but I doubt it's significantly different.

If we assume that the logical address space of a file is broken up into fixed 
sized chunks on fixed size boundaries (presumably a power of 2) then the 
implementation is quite straightforward.

Picking the chunk size will be a key issue for performance. Unfortunately, 
there are competing desires.

For best space utilization, you'll want the chunk size to be large, because on 
average you'll lose 1/2 of a file system sector/block for each chunk of 
compressed data.

For best R/W performance, you'll want the chunk size to be small, because 
logically the file I/O size is equal to a chunk, i.e., on a write you might 
have to read the corresponding chunk, decompress it, insert the new data and 
recompress it. This gets super duper ugly on FileStore because you can't afford 
to crash during the re-write update and risk a partially updated chunk (this 
will give you garbage when you decompress it). This means that you'll have to 
log the entire chunk even if you're only re-writing a small portion of it. 
Hence the desire to make the chunksize small. I'm not as familiar with 
NewStore, but I don't think it's fundamentally much better. Basically any form 
of sub-chunk write-operation stinks in performance. Sub-chunk read operations 
aren't too bad unless the chunk size is ridiculously large. 

For best compression ratios, you'll want the chunk size to be at least equal to 
the history size if not 2 or 3 times larger (64K history size when using zlib, 
snappy is 32K or 64K for the latest version)

The partial-block write problem doesn't exist for RGW objects and it's objects 
are probably already compressed. Meaning that you'll want to be able to convey 
the compression parameters to RADOS so that the backend knows what to do.

I would add a per-file attribute that encodes the compression parameters:  
compression algorithm (zlib, snappy, ...) and chunksize. That would also 
provide backward compatibility and allow per-object compression diversity.

Then you'd want to add verbiage to the individual access schemes to 
allow/disallow compression. For file systems you'd want that on a per-directory 
basis or perhaps even better a set of regular expressions.


Allen Samuels
Software Architect, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com


-Original Message-
From: Haomai Wang [mailto:haomaiw...@gmail.com]
Sent: Thursday, August 20, 2015 8:01 PM
To: Allen Samuels
Cc: Chaitanya Huilgol; James (Fei) Liu-SSI; ceph-devel
Subject: Re: Inline dedup/compression

sorry, should be this
blog(http://mysqlserverteam.com/innodb-transparent-page-compression/)

On Fri, Aug 21, 2015 at 10:51 AM, Haomai Wang haomaiw...@gmail.com wrote:
 I found a
 blog(http://mysqlserverteam.com/innodb-transparent-pageio-compression/
 ) about mysql innodb transparent compression. It's surprised that 
 innodb will do it at low level(just like filestore in ceph) and rely 
 it on filesystem file hole feature. I'm very suspect about the 
 performance afeter storing lot's of *small* hole files on fs. If 
 reliable, it would be easy that filestore/newstore impl alike feature.

 On Fri, Jul 3, 2015 at 1:13 PM, Allen Samuels allen.samu...@sandisk.com 
 wrote:
 For non-overwriting relatively large objects, this scheme works fine. 
 Unfortunately the real use-case for deduplication is block storage with 
 virtualized infrastructure (eliminating duplicate operating system files and 
 applications, etc.) and in order for this to provide good deduplication, 
 you'll need a block size that's equal or smaller than the cluster-size of 
 the file system mounted on the block device. Meaning that your storage is 
 now dominated by small chunks (probably 8K-ish) rather than

RE: Inline dedup/compression

2015-08-20 Thread Allen Samuels
XFS shouldn't have any trouble with the holes scheme. I don't know BTRFS as 
well, but I doubt it's significantly different.

If we assume that the logical address space of a file is broken up into fixed 
sized chunks on fixed size boundaries (presumably a power of 2) then the 
implementation is quite straightforward.

Picking the chunk size will be a key issue for performance. Unfortunately, 
there are competing desires.

For best space utilization, you'll want the chunk size to be large, because on 
average you'll lose 1/2 of a file system sector/block for each chunk of 
compressed data.

For best R/W performance, you'll want the chunk size to be small, because 
logically the file I/O size is equal to a chunk, i.e., on a write you might 
have to read the corresponding chunk, decompress it, insert the new data and 
recompress it. This gets super duper ugly on FileStore because you can't afford 
to crash during the re-write update and risk a partially updated chunk (this 
will give you garbage when you decompress it). This means that you'll have to 
log the entire chunk even if you're only re-writing a small portion of it. 
Hence the desire to make the chunksize small. I'm not as familiar with 
NewStore, but I don't think it's fundamentally much better. Basically any form 
of sub-chunk write-operation stinks in performance. Sub-chunk read operations 
aren't too bad unless the chunk size is ridiculously large. 

For best compression ratios, you'll want the chunk size to be at least equal to 
the history size if not 2 or 3 times larger (64K history size when using zlib, 
snappy is 32K or 64K for the latest version)

The partial-block write problem doesn't exist for RGW objects and it's objects 
are probably already compressed. Meaning that you'll want to be able to convey 
the compression parameters to RADOS so that the backend knows what to do.

I would add a per-file attribute that encodes the compression parameters:  
compression algorithm (zlib, snappy, ...) and chunksize. That would also 
provide backward compatibility and allow per-object compression diversity.

Then you'd want to add verbiage to the individual access schemes to 
allow/disallow compression. For file systems you'd want that on a per-directory 
basis or perhaps even better a set of regular expressions.


Allen Samuels
Software Architect, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com


-Original Message-
From: Haomai Wang [mailto:haomaiw...@gmail.com] 
Sent: Thursday, August 20, 2015 8:01 PM
To: Allen Samuels
Cc: Chaitanya Huilgol; James (Fei) Liu-SSI; ceph-devel
Subject: Re: Inline dedup/compression

sorry, should be this
blog(http://mysqlserverteam.com/innodb-transparent-page-compression/)

On Fri, Aug 21, 2015 at 10:51 AM, Haomai Wang haomaiw...@gmail.com wrote:
 I found a 
 blog(http://mysqlserverteam.com/innodb-transparent-pageio-compression/
 ) about mysql innodb transparent compression. It's surprised that 
 innodb will do it at low level(just like filestore in ceph) and rely 
 it on filesystem file hole feature. I'm very suspect about the 
 performance afeter storing lot's of *small* hole files on fs. If 
 reliable, it would be easy that filestore/newstore impl alike feature.

 On Fri, Jul 3, 2015 at 1:13 PM, Allen Samuels allen.samu...@sandisk.com 
 wrote:
 For non-overwriting relatively large objects, this scheme works fine. 
 Unfortunately the real use-case for deduplication is block storage with 
 virtualized infrastructure (eliminating duplicate operating system files and 
 applications, etc.) and in order for this to provide good deduplication, 
 you'll need a block size that's equal or smaller than the cluster-size of 
 the file system mounted on the block device. Meaning that your storage is 
 now dominated by small chunks (probably 8K-ish) rather than the relatively 
 large 4M stripes that is used today (this will also kill EC since small 
 objects are replicated rather than ECed). This will have a massive impact on 
 backend storage I/O as the basic data/metadata ratio is complete skewed 
 (both for static storage and dynamic I/O count).


 Allen Samuels
 Software Architect, Emerging Storage Solutions

 2880 Junction Avenue, Milpitas, CA 95134
 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com


 -Original Message-
 From: Chaitanya Huilgol
 Sent: Thursday, July 02, 2015 3:50 AM
 To: James (Fei) Liu-SSI; Allen Samuels; Haomai Wang
 Cc: ceph-devel
 Subject: RE: Inline dedup/compression

 Hi James et.al ,

 Here is an example for clarity,
 1. Client Writes object  object.abcd
 2. Based on the crush rules, say  OSD.a is the primary OSD which receives 
 the write 3. OSD.a  performs segmenting/fingerprinting which can be static 
 or dynamic and generates a list of segments, the object.abcd is now 
 represented by a manifest object with the list of segment hash and len  
 [Header]  [Seg1_sha, len

RE: Ceph Hackathon: More Memory Allocator Testing

2015-08-19 Thread Allen Samuels
It was a surprising result that the memory allocator is making such a large 
difference in performance. All of the recent work in fiddling with TCmalloc's 
and Jemalloc's various knobs and switches has been excellent a great example of 
group collaboration. But I think it's only a partial optimization of the 
underlying problem. The real take-away from this activity is that the code base 
is doing a LOT of memory allocation/deallocation which is consuming substantial 
CPU time-- regardless of how much we optimize the memory allocator, you can't 
get away from the fact that it macroscopically MATTERs. The better long-term 
solution is to reduce reliance on the general-purpose memory allocator and to 
implement strategies that are more specific to our usage model. 

What really needs to happen initially is to instrument the 
allocation/deallocation. Most likely we'll find that 80+% of the work is coming 
from just a few object classes and it will be easy to create custom allocation 
strategies for those usages. This will lead to even higher performance that's 
much less sensitive to easy-to-misconfigure environmental factors and the 
entire tcmalloc/jemalloc -- oops it uses more memory discussion will go away.


Allen Samuels
Software Architect, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com


-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy
Sent: Wednesday, August 19, 2015 10:30 AM
To: Alexandre DERUMIER
Cc: Mark Nelson; ceph-devel
Subject: RE: Ceph Hackathon: More Memory Allocator Testing

Yes, it should be 1 per OSD...
There is no doubt that TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES is relative to the 
number of threads running..
But, I don't know if number of threads is a factor for jemalloc..

Thanks  Regards
Somnath

-Original Message-
From: Alexandre DERUMIER [mailto:aderum...@odiso.com]
Sent: Wednesday, August 19, 2015 9:55 AM
To: Somnath Roy
Cc: Mark Nelson; ceph-devel
Subject: Re: Ceph Hackathon: More Memory Allocator Testing

 I think that tcmalloc have a fixed size 
(TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES), and share it between all process. 

I think it is per tcmalloc instance loaded , so, at least with num_osds * 
num_tcmalloc_instance * TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES in a box. 

What is num_tcmalloc_instance ? I think 1 osd process use a defined 
TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES size ?

I'm saying that, because I have exactly the same bug, client side, with librbd 
+ tcmalloc + qemu + iothreads.
When I defined too much iothread threads, I'm hitting the bug directly. (can 
reproduce 100%).
Like the thread_cache size is divide by number of threads?






- Mail original -
De: Somnath Roy somnath@sandisk.com
À: aderumier aderum...@odiso.com, Mark Nelson mnel...@redhat.com
Cc: ceph-devel ceph-devel@vger.kernel.org
Envoyé: Mercredi 19 Août 2015 18:27:30
Objet: RE: Ceph Hackathon: More Memory Allocator Testing

 I think that tcmalloc have a fixed size 
(TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES), and share it between all process. 

I think it is per tcmalloc instance loaded , so, at least with num_osds * 
num_tcmalloc_instance * TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES in a box. 

Also, I think there is no point of increasing osd_op_threads as it is not in IO 
path anymore..Mark is using default 5:2 for shard:thread per shard.. 

But, yes, it could be related to number of threads OSDs are using, need to 
understand how jemalloc works..Also, there may be some tuning to reduce memory 
usage (?). 

Thanks  Regards
Somnath 

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Alexandre DERUMIER
Sent: Wednesday, August 19, 2015 9:06 AM
To: Mark Nelson
Cc: ceph-devel
Subject: Re: Ceph Hackathon: More Memory Allocator Testing 

I was listening at the today meeting, 

and seem that the blocker to have jemalloc as default, 

is that it's used more memory by osd (around 300MB?), and some guys could have 
boxes with 60disks. 


I just wonder if the memory increase is related to 
osd_op_num_shards/osd_op_threads value ? 

Seem that as hackaton, the bench has been done on super big cpus boxed 
36cores/72T, http://ceph.com/hackathon/2015-08-ceph-hammer-full-ssd.pptx
with osd_op_threads = 32. 

I think that tcmalloc have a fixed size 
(TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES), and share it between all process. 

Maybe jemalloc allocated memory by threads. 



(I think guys with 60disks box, dont use ssd, so low iops by osd, and they 
don't need a lot of threads by osd) 



- Mail original -
De: aderumier aderum...@odiso.com
À: Mark Nelson mnel...@redhat.com
Cc: ceph-devel ceph-devel@vger.kernel.org
Envoyé: Mercredi 19 Août 2015 16:01:28
Objet: Re: Ceph Hackathon: More Memory Allocator Testing 

Thanks Marc, 

Results are matching exactly what I

RE: The design of the eviction improvement

2015-07-22 Thread Allen Samuels
I'm very concerned about designing around the assumption that objects are ~1MB 
in size. That's probably a good assumption for block and HDFS dominated 
systems, but likely a very poor assumption about many object and file dominated 
systems.

If I understand the proposals that have been discussed, each of them assumes in 
in-memory data structure with an entry per object (the exact size of the entry 
varies with the different proposals).

Under that assumption, I have another concern which is the lack of graceful 
degradation as the object counts grow and the in-memory data structures get 
larger. Everything seems fine until just a few objects get added then the 
system starts to page and performance drops dramatically (likely) to the point 
where Linux will start killing OSDs.

What's really needed is some kind of way to extend the lists into storage in 
way that's doesn't cause a zillion I/O operations.

I have some vague idea that some data structure like the LSM mechanism ought to 
be able to accomplish what we want. Some amount of the data structure (the most 
likely to be used) is held in DRAM [and backed to storage for restart] and the 
least likely to be used is flushed to storage with some mechanism that allows 
batched updates.

Allen Samuels
Software Architect, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
Sent: Wednesday, July 22, 2015 5:57 AM
To: Wang, Zhiqiang
Cc: sj...@redhat.com; ceph-devel@vger.kernel.org
Subject: RE: The design of the eviction improvement

On Wed, 22 Jul 2015, Wang, Zhiqiang wrote:
  The part that worries me now is the speed with which we can load and
  manage such a list.  Assuming it is several hundred MB, it'll take a
  while to load that into memory and set up all the pointers (assuming
  a conventional linked list structure).  Maybe tens of seconds...

 I'm thinking of maintaining the lists at the PG level. That's to say,
 we have an active/inactive list for every PG. We can load the lists in
 parallel during rebooting. Also, the ~100 MB lists are split among
 different OSD nodes. Perhaps it does not need such long time to load
 them?

 
  I wonder if instead we should construct some sort of flat model
  where we load slabs of contiguous memory, 10's of MB each, and have
  the next/previous pointers be a (slab,position) pair.  That way we
  can load it into memory in big chunks, quickly, and be able to
  operate on it (adjust links) immediately.
 
  Another thought: currently we use the hobject_t hash only instead of
  the full object name.  We could continue to do the same, or we could
  do a hash pair (hobject_t hash + a different hash of the rest of the
  object) to keep the representation compact.  With a model lke the
  above, that could get the object representation down to 2 u32's.  A
  link could be a slab + position (2 more u32's), and if we have prev
  + next that'd be just 6x4=24 bytes per object.

 Looks like for an object, the head and the snapshot version have the
 same hobject hash. Thus we have to use the hash pair instead of just
 the hobject hash. But I still have two questions if we use the hash
 pair to represent an object.

 1) Does the hash pair uniquely identify an object? That's to say, is
 it possible for two objects to have the same hash pair?

With two hashes collisions would be rare but could happen

 2) We need a way to get the full object name from the hash pair, so
 that we know what objects to evict. But seems like we don't have a
 good way to do this?

Ah, yeah--I'm a little stuck in the current hitset view of things.  I think we 
can either embed the full ghobject_t (which means we lose the fixed-size 
property, and the per-object overhead goes way up.. probably from ~24 bytes to 
more like 80 or 100).  Or, we can enumerate objects starting at the (hobject_t) 
hash position to find the object.  That's somewhat inefficient for FileStore 
(it'll list a directory of a hundred or so objects, probably, and iterate over 
them to find the right one), but for NewStore it will be quite fast (NewStore 
has all objects sorted into keys in rocksdb, so we just start listing at the 
right offset).  Usually we'll get the object right off, unless there are 
hobject_t hash collisions (already reasonably rare since it's a 2^32 space for 
the pool).

Given that, I would lean toward the 2-hash fixed-sized records (of these 2 
options)...

sage

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader

RE: The design of the eviction improvement

2015-07-22 Thread Allen Samuels
Don't we need to double-index the data structure?

We need it indexed by atime for the purposes of eviction, but we need it 
indexed by object name for the purposes of updating the list upon a usage.




Allen Samuels
Software Architect, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com


-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
Sent: Wednesday, July 22, 2015 11:51 AM
To: Allen Samuels
Cc: Wang, Zhiqiang; sj...@redhat.com; ceph-devel@vger.kernel.org
Subject: RE: The design of the eviction improvement

On Wed, 22 Jul 2015, Allen Samuels wrote:
 I'm very concerned about designing around the assumption that objects 
 are ~1MB in size. That's probably a good assumption for block and HDFS 
 dominated systems, but likely a very poor assumption about many object 
 and file dominated systems.
 
 If I understand the proposals that have been discussed, each of them 
 assumes in in-memory data structure with an entry per object (the 
 exact size of the entry varies with the different proposals).
 
 Under that assumption, I have another concern which is the lack of 
 graceful degradation as the object counts grow and the in-memory data 
 structures get larger. Everything seems fine until just a few objects 
 get added then the system starts to page and performance drops 
 dramatically (likely) to the point where Linux will start killing OSDs.
 
 What's really needed is some kind of way to extend the lists into 
 storage in way that's doesn't cause a zillion I/O operations.
 
 I have some vague idea that some data structure like the LSM mechanism 
 ought to be able to accomplish what we want. Some amount of the data 
 structure (the most likely to be used) is held in DRAM [and backed to 
 storage for restart] and the least likely to be used is flushed to 
 storage with some mechanism that allows batched updates.

How about this:

The basic mapping we want is object - atime.

We keep a simple LRU of the top N objects in memory with the object-atime 
values.  When an object is accessed, it is moved or added to the top of the 
list.

Periodically, or when the LRU size reaches N * (1.x), we flush:

 - write the top N items to a compact object that can be quickly loaded
 - write our records for the oldest items (N .. N*1.x) to leveldb/rocksdb in a 
simple object - atime fashion

When the agent runs, we just walk across that key range of the db the same way 
we currently enumerate objects.  For each record we use either the stored atime 
or the value in the in-memory LRU (it'll need to be dual-indexed by both a list 
and a hash map), whichever is newer.  We can use the same histogram estimation 
approach we do now to determine if the object in question is below the 
flush/evict threshold.

The LSM does the work of sorting/compacting the atime info, while we avoid 
touching it at all for the hottest objects to keep the amount of work it has to 
do in check.

sage

 
 Allen Samuels
 Software Architect, Systems and Software Solutions
 
 2880 Junction Avenue, San Jose, CA 95134
 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com
 
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
 Sent: Wednesday, July 22, 2015 5:57 AM
 To: Wang, Zhiqiang
 Cc: sj...@redhat.com; ceph-devel@vger.kernel.org
 Subject: RE: The design of the eviction improvement
 
 On Wed, 22 Jul 2015, Wang, Zhiqiang wrote:
   The part that worries me now is the speed with which we can load 
   and manage such a list.  Assuming it is several hundred MB, it'll 
   take a while to load that into memory and set up all the pointers 
   (assuming a conventional linked list structure).  Maybe tens of seconds...
 
  I'm thinking of maintaining the lists at the PG level. That's to 
  say, we have an active/inactive list for every PG. We can load the 
  lists in parallel during rebooting. Also, the ~100 MB lists are 
  split among different OSD nodes. Perhaps it does not need such long 
  time to load them?
 
  
   I wonder if instead we should construct some sort of flat model 
   where we load slabs of contiguous memory, 10's of MB each, and 
   have the next/previous pointers be a (slab,position) pair.  That 
   way we can load it into memory in big chunks, quickly, and be able 
   to operate on it (adjust links) immediately.
  
   Another thought: currently we use the hobject_t hash only instead 
   of the full object name.  We could continue to do the same, or we 
   could do a hash pair (hobject_t hash + a different hash of the 
   rest of the
   object) to keep the representation compact.  With a model lke the 
   above, that could get the object representation down to 2 u32's.  
   A link could be a slab + position (2 more u32's), and if we have 
   prev
   + next that'd be just 6x4=24 bytes

RE: The design of the eviction improvement

2015-07-22 Thread Allen Samuels
Yes the cost of the insertions with the current scheme is probably prohibitive. 
Wouldn't it approach the same amount of time as just having atime turned on in 
the file system? 

My concern about the memory is mostly that we ensure whatever algorithm is 
selected degrades gracefully when you get high counts of small objects. I agree 
that paying $ for RAM that translates into actual performance isn't really a 
problem. It really boils down to your workload and access pattern.


Allen Samuels
Software Architect, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com


-Original Message-
From: Sage Weil [mailto:sw...@redhat.com] 
Sent: Wednesday, July 22, 2015 2:53 PM
To: Allen Samuels
Cc: Wang, Zhiqiang; sj...@redhat.com; ceph-devel@vger.kernel.org
Subject: RE: The design of the eviction improvement

On Wed, 22 Jul 2015, Allen Samuels wrote:
 Don't we need to double-index the data structure?
 
 We need it indexed by atime for the purposes of eviction, but we need 
 it indexed by object name for the purposes of updating the list upon a 
 usage.

If you use the same approach the agent uses now (iterate over items, evict/trim 
anything in bottom end of observed age distribution) you can get away without 
the double-index.  Iterating over the LSM should be quite cheap.  I'd be more 
worried about the cost of the insertions.

I'm also not sure the simplistic approach below can be generalized to something 
like 2Q (and certainly not something like MQ).  Maybe...

On the other hand, I'm not sure it is the end of the world if at the end of the 
day the memory requirements for a cache-tier OSD are higher and inversely 
proportional to the object size.  We can make the OSD flush/evict more 
aggressively if the memory utilization (due to a high object count) gets out of 
hand as a safety mechanism.  Paying a few extra $$ for RAM isn't the end of the 
world I'm guessing when the performance payoff is significant...

sage


  
 
 
 
 Allen Samuels
 Software Architect, Systems and Software Solutions
 
 2880 Junction Avenue, San Jose, CA 95134
 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com
 
 
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
 Sent: Wednesday, July 22, 2015 11:51 AM
 To: Allen Samuels
 Cc: Wang, Zhiqiang; sj...@redhat.com; ceph-devel@vger.kernel.org
 Subject: RE: The design of the eviction improvement
 
 On Wed, 22 Jul 2015, Allen Samuels wrote:
  I'm very concerned about designing around the assumption that 
  objects are ~1MB in size. That's probably a good assumption for 
  block and HDFS dominated systems, but likely a very poor assumption 
  about many object and file dominated systems.
  
  If I understand the proposals that have been discussed, each of them 
  assumes in in-memory data structure with an entry per object (the 
  exact size of the entry varies with the different proposals).
  
  Under that assumption, I have another concern which is the lack of 
  graceful degradation as the object counts grow and the in-memory 
  data structures get larger. Everything seems fine until just a few 
  objects get added then the system starts to page and performance 
  drops dramatically (likely) to the point where Linux will start killing 
  OSDs.
  
  What's really needed is some kind of way to extend the lists into 
  storage in way that's doesn't cause a zillion I/O operations.
  
  I have some vague idea that some data structure like the LSM 
  mechanism ought to be able to accomplish what we want. Some amount 
  of the data structure (the most likely to be used) is held in DRAM 
  [and backed to storage for restart] and the least likely to be used 
  is flushed to storage with some mechanism that allows batched updates.
 
 How about this:
 
 The basic mapping we want is object - atime.
 
 We keep a simple LRU of the top N objects in memory with the object-atime 
 values.  When an object is accessed, it is moved or added to the top of the 
 list.
 
 Periodically, or when the LRU size reaches N * (1.x), we flush:
 
  - write the top N items to a compact object that can be quickly 
 loaded
  - write our records for the oldest items (N .. N*1.x) to 
 leveldb/rocksdb in a simple object - atime fashion
 
 When the agent runs, we just walk across that key range of the db the same 
 way we currently enumerate objects.  For each record we use either the stored 
 atime or the value in the in-memory LRU (it'll need to be dual-indexed by 
 both a list and a hash map), whichever is newer.  We can use the same 
 histogram estimation approach we do now to determine if the object in 
 question is below the flush/evict threshold.
 
 The LSM does the work of sorting/compacting the atime info, while we avoid 
 touching it at all for the hottest objects to keep the amount of work it has 
 to do in check.
 
 sage

RE: The design of the eviction improvement

2015-07-20 Thread Allen Samuels
This seems much better than the current mechanism. Do you have an estimate of 
the memory consumption of the two lists? (In terms of bytes/object?)


Allen Samuels
Software Architect, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com


-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Wang, Zhiqiang
Sent: Monday, July 20, 2015 1:47 AM
To: Sage Weil; sj...@redhat.com; ceph-devel@vger.kernel.org
Subject: The design of the eviction improvement

Hi all,

This is a follow-up of one of the CDS session at 
http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_tiering_eviction.
 We discussed the drawbacks of the current eviction algorithm and several ways 
to improve it. Seems like the LRU variants is the right way to go. I come up 
with some design points after the CDS, and want to discuss it with you. It is 
an approximate 2Q algorithm, combining some benefits of the clock algorithm, 
similar to what the linux kernel does for the page cache.

# Design points:

## LRU lists
- Maintain LRU lists at the PG level.
The SharedLRU and SimpleLRU implementation in the current code have a max_size, 
which limits the max number of elements in the list. This mostly looks like a 
MRU, though its name implies they are LRUs. Since the object size may vary in a 
PG, it's not possible to caculate the total number of objects which the cache 
tier can hold ahead of time. We need a new LRU implementation with no limit on 
the size.
- Two lists for each PG: active and inactive Objects are first put into the 
inactive list when they are accessed, and moved between these two lists based 
on some criteria.
Object flag: active, referenced, unevictable, dirty.
- When an object is accessed:
1) If it's not in both of the lists, it's put on the top of the inactive list
2) If it's in the inactive list, and the referenced flag is not set, the 
referenced flag is set, and it's moved to the top of the inactive list.
3) If it's in the inactive list, and the referenced flag is set, the referenced 
flag is cleared, and it's removed from the inactive list, and put on top of the 
active list.
4) If it's in the active list, and the referenced flag is not set, the 
referenced flag is set, and it's moved to the top of the active list.
5) If it's in the active list, and the referenced flag is set, it's moved to 
the top of the active list.
- When selecting objects to evict:
1) Objects at the bottom of the inactive list are selected to evict. They are 
removed from the inactive list.
2) If the number of the objects in the inactive list becomes low, some of the 
objects at the bottom of the active list are moved to the inactive list. For 
those objects which have the referenced flag set, they are given one more 
chance in the active list. They are moved to the top of the active list with 
the referenced flag cleared. For those objects which don't have the referenced 
flag set, they are moved to the inactive list, with the referenced flag set. So 
that they can be quickly promoted to the active list when necessary.

## Combine flush with eviction
- When evicting an object, if it's dirty, it's flushed first. After flushing, 
it's evicted. If not dirty, it's evicted directly.
- This means that we won't have separate activities and won't set different 
ratios for flush and evict. Is there a need to do so?
- Number of objects to evict at a time. 'evict_effort' acts as the priority, 
which is used to calculate the number of objects to evict.

## LRU lists Snapshotting
- The two lists are snapshotted persisted periodically.
- Only one copy needs to be saved. The old copy is removed when persisting the 
lists. The saved lists are used to restore the LRU lists when OSD reboots.

Any comments/feedbacks are welcomed.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Inline dedup/compression

2015-07-02 Thread Allen Samuels
For non-overwriting relatively large objects, this scheme works fine. 
Unfortunately the real use-case for deduplication is block storage with 
virtualized infrastructure (eliminating duplicate operating system files and 
applications, etc.) and in order for this to provide good deduplication, you'll 
need a block size that's equal or smaller than the cluster-size of the file 
system mounted on the block device. Meaning that your storage is now dominated 
by small chunks (probably 8K-ish) rather than the relatively large 4M stripes 
that is used today (this will also kill EC since small objects are replicated 
rather than ECed). This will have a massive impact on backend storage I/O as 
the basic data/metadata ratio is complete skewed (both for static storage and 
dynamic I/O count).


Allen Samuels
Software Architect, Emerging Storage Solutions 

2880 Junction Avenue, Milpitas, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com


-Original Message-
From: Chaitanya Huilgol 
Sent: Thursday, July 02, 2015 3:50 AM
To: James (Fei) Liu-SSI; Allen Samuels; Haomai Wang
Cc: ceph-devel
Subject: RE: Inline dedup/compression

Hi James et.al ,

Here is an example for clarity,
1. Client Writes object  object.abcd
2. Based on the crush rules, say  OSD.a is the primary OSD which receives the 
write 3. OSD.a  performs segmenting/fingerprinting which can be static or 
dynamic and generates a list of segments, the object.abcd is now represented by 
a manifest object with the list of segment hash and len  [Header]  [Seg1_sha, 
len]  [Seg2_sha, len]  ...
 [Seg3_sha, len]
4. OSD.a writes each segment as a new object in the cluster with object name  
reserved_dedupe_perfixsha 5. The dedupe object write is treated differently 
from regular object writes, If the object is present then an object reference 
count is incremented and the object is not overwritten - this forms the basis 
of the dedupe logic. Multiple objects with one or more same constituent 
segments start sharing the segment objects.
6. Once all the segments are successfully written the object 'object.abcd' is 
now just a stub object with the segment manifest as described above and is goes 
through a regular object write sequence 

Partial writes on objects will be complicated,
- Partially affected segments will have to be read and segmentation logic has 
to be run from first to last affected segment boundaries
-  New segments will be written
- Old overwritten segments have to be deleted
- Write merged manifest of the object 

All this will need protection of the PG lock, Also additional journaling 
mechanism will be needed to  recover from cases where the osd goes down before 
writing all the segments. 

Since this is quite a lot of processing, a better use case for this dedupe 
mechanism would be in the data tiering model with object redirects.
The manifest object fits quiet well into object redirects scheme of things, the 
idea is that, when an object is moved out of the base tier, you have an option 
to create a dedupe stub object and write individual segments into the cold 
backend tier with a rados plugin. 

Remaining responses inline.

Regards,
Chaitanya

-Original Message-
From: James (Fei) Liu-SSI [mailto:james@ssi.samsung.com]
Sent: Wednesday, July 01, 2015 4:00 AM
To: Chaitanya Huilgol; Allen Samuels; Haomai Wang
Cc: ceph-devel
Subject: RE: Inline dedup/compression

Hi Chaitanya,
   Very interesting thoughts. I am not sure whether I get all of them or now. 
Here are several questions for the solution you provided, Might be a little bit 
detailed.

Regards,
James

- Dedupe is set as a pool property
Write:
- Write arrives at the primary OSD/pg
[James] Does the OSD/PG mean PG Backend over here? 
[Chaitanya] I mean the Primary OSD and the PG which get selected by the crush - 
not the specific OSD component

- Data is segmented (rabin/static) and secure hash computed [James] Which 
component in OSD are you going to do the data segment and hash computation?
[Chaitanya] If partial writes are not supported then this could be down before 
acquiring the PG lock, else we need the protection of the PG lock.  Probably in 
the do_request() path?

- A manifest is created with the offset/len/hash for all the segments [James] 
The manifest is going to be part of xattr of object? Where are you going to 
save manifest?
[Chaitanya] The manifest is a stub object with the constituent segments list 

- OSD/pg sends rados write with a special name __known__prefixsecure hash 
for all segments [James] What's your meaning of Rados Wirte?  Where do the all 
segments with secure hash signature write to?
[Chaitanya] All segments are unique objects with the above mentioned naming 
scheme, they get written back into the cluster as a regular client rados object 
write

- PG receiving dedup write will:
1. check for object presence and create object if not present
2. If object is already present, then an reference count

RE: Inline dedup/compression

2015-06-30 Thread Allen Samuels
This covers the read and write, what about the delete? One of the major issues 
with Dedupe, whether global or local is to address the inherent ref-counting 
associated with sharing of pieces of storage.

Allen Samuels
Software Architect, Emerging Storage Solutions 

2880 Junction Avenue, Milpitas, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Chaitanya Huilgol
Sent: Monday, June 29, 2015 11:20 PM
To: James (Fei) Liu-SSI; Haomai Wang
Cc: ceph-devel
Subject: RE: Inline dedup/compression

Below is an alternative idea at a very high level around dedup with ceph 
without a need of centralized hash index,

- Dedupe is set as a pool property
Write:
- Write arrives at the primary OSD/pg
- Data is segmented (rabin/static) and secure hash computed
- A manifest is created with the offset/len/hash for all the segments
- OSD/pg sends rados write with a special name __known__prefixsecure hash 
for all segments
- PG receiving dedup write will:
1. check for object presence and create object if not present
2. If object is already present, then an reference count is incremented 
(check and increment needs to be atomic)
- Response is received by original primary PG for all segments
- Primary PG writes the manifest to local and replicas or EC members
- Response sent to client

Read:
- Read received at primary PG
- Reads manifest object
- sends reads for each segment object __know_prefixsecure hash
- coalesces all the response to build the required data
- Responds to client


Pros:
No need of centralized hash index so inline with ceph no bottleneck philosophy

Cons:
Some PGs may get overloaded due to frequently occurring segment patterns 
Latency and increased traffic on the network

Regards,
Chaitanya

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
Sent: Tuesday, June 30, 2015 2:25 AM
To: Haomai Wang
Cc: ceph-devel
Subject: RE: Inline dedup/compression

Hi Haomai,
  Thanks for moving the idea forward. Regarding to the compression.  However,  
if we do compression on the client level, it is not global. And the compression 
was only applied to the local client, am I right?  I think there is pros and 
cons in two solutions and we can get into details more for each solution.
  I really like your idea for dedupe in OSD side   by the way. Let me think 
more about it.

 Regards,
 James

-Original Message-
From: Haomai Wang [mailto:haomaiw...@gmail.com]
Sent: Friday, June 26, 2015 8:55 PM
To: James (Fei) Liu-SSI
Cc: ceph-devel
Subject: Re: Inline dedup/compression

On Sat, Jun 27, 2015 at 2:03 AM, James (Fei) Liu-SSI 
james@ssi.samsung.com wrote:
 Hi Haomai,
   Thanks for your response as always. I agree compression is comparable 
 easier task but still very challenge in terms of implementation no matter 
 where we should implement . Client side like RBD, or RDBGW or CephFS, or PG 
 should be a little bit better place to implementation in terms of efficiency 
 and cost reduction before the data were duplicated to other OSDs. It has  two 
 reasons :
 1. Keep the data consistency among OSDs in one PG 2. Saving the 
 computing resources

 IMHO , The compression should be accomplished before the replication come 
 into play in pool level. However, we can also have second level of 
 compression in the local objectstore.  In term of unit size of compression , 
 It really depends workload and in which layer we should implement.

 About inline deduplication, it will dramatically increase the complexities if 
 we bring in the replication and Erasure Coding for consideration.

 However, Before we talk about implementation, It would be great if we can 
 understand the pros and cons to implement inline dedupe/compression. We all 
 understand the benefits of dedupe/compression. However, the side effect is 
 performance hurt and need more computing resources. It would be great if we 
 can understand the problems from 30,000 feet high for the whole picture about 
 the Ceph. Please correct me if I were wrong.

Actually we may have some tricks to reduce performance hurt like compression. 
As Joe mentioned, we can compress slave pg data to avoid performance hurt, but 
it may increase the complexity of recovery and pg remap things. Another 
in-detail implement way if we begin to compress data from messenger, osd thread 
and pg thread won't access data for normal client op, so maybe we can make it 
parallel with pg process. Journal thread will get the compressed data at last.

The effect of compression also is a concern, we do compression in rados may not 
get the best compression result. If we can do compression in libcephfs, librbd 
and radosgw and make rados unknown to compression, it maybe simpler and we can 
get file/block/object level compression. it should be better?

About

RE: Regarding key/value interface

2014-09-12 Thread Allen Samuels
Another thing we're looking into is compression. The intersection of 
compression and object striping (fracturing) is interesting. Is the striping 
variable on a per-object basis? 

Allen Samuels
Chief Software Architect, Emerging Storage Solutions 

951 SanDisk Drive, Milpitas, CA 95035
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
Sent: Thursday, September 11, 2014 6:55 PM
To: Somnath Roy
Cc: Haomai Wang (haomaiw...@gmail.com); ceph-us...@lists.ceph.com; 
ceph-devel@vger.kernel.org
Subject: RE: Regarding key/value interface

On Fri, 12 Sep 2014, Somnath Roy wrote:
 Make perfect sense Sage..
 
 Regarding striping of filedata, You are saying KeyValue interface will do the 
 following for me?
 
 1. Say in case of rbd image of order 4 MB, a write request coming to 
 Key/Value interface, it will  chunk the object (say full 4MB) in smaller 
 sizes (configurable ?) and stripe it as multiple key/value pair ?
 
 2. Also, while reading it will take care of accumulating and send it back.

Precisely.

A smarter thing we might want to make it do in the future would be to take a 4 
KB write create a new key that logically overwrites part of the larger, say, 
1MB key, and apply it on read.  And maybe give up and rewrite the entire 1MB 
stripe after too many small overwrites have accumulated.  
Something along those lines to reduce the cost of small IOs to large objects.

sage



  
 Thanks  Regards
 Somnath
 
 
 -Original Message-
 From: Sage Weil [mailto:sw...@redhat.com]
 Sent: Thursday, September 11, 2014 6:31 PM
 To: Somnath Roy
 Cc: Haomai Wang (haomaiw...@gmail.com); ceph-us...@lists.ceph.com; 
 ceph-devel@vger.kernel.org
 Subject: Re: Regarding key/value interface
 
 Hi Somnath,
 
 On Fri, 12 Sep 2014, Somnath Roy wrote:
 
  Hi Sage/Haomai,
 
  If I have a key/value backend that support transaction, range 
  queries (and I don?t need any explicit caching etc.) and I want to 
  replace filestore (and leveldb omap) with that,  which interface you 
  recommend me to derive from , directly ObjectStore or  KeyValueDB ?
 
  I have already integrated this backend by deriving from ObjectStore 
  interfaces earlier (pre keyvalueinteface days) but not tested 
  thoroughly enough to see what functionality is broken (Basic 
  functionalities of RGW/RBD are working fine).
 
  Basically, I want to know what are the advantages (and 
  disadvantages) of deriving it from the new key/value interfaces ?
 
  Also, what state is it in ? Is it feature complete and supporting 
  all the ObjectStore interfaces like clone and all ?
 
 Everything is supported, I think, for perhaps some IO hints that don't make 
 sense in a k/v context.  The big things that you get by using KeyValueStore 
 and plugging into the lower-level interface are:
 
  - striping of file data across keys
  - efficient clone
  - a zillion smaller methods that aren't conceptually difficult to implement 
 bug tedious and to do so.
 
 The other nice thing about reusing this code is that you can use a leveldb or 
 rocksdb backend as a reference for testing or performance or whatever.
 
 The main thing that will be a challenge going forward, I predict, is making 
 storage of the object byte payload in key/value pairs efficient.  I think 
 KeyValuestore is doing some simple striping, but it will suffer for small 
 overwrites (like 512-byte or 4k writes from an RBD).  There are probably some 
 pretty simple heuristics and tricks that can be done to mitigate the most 
 common patterns, but there is no simple solution since the backends generally 
 don't support partial value updates (I assume yours doesn't either?).  But, 
 any work done here will benefit the other backends too so that would be a 
 win..
 
 sage
 
 
 
 PLEASE NOTE: The information contained in this electronic mail message is 
 intended only for the use of the designated recipient(s) named above. If the 
 reader of this message is not the intended recipient, you are hereby notified 
 that you have received this message in error and that any review, 
 dissemination, distribution, or copying of this message is strictly 
 prohibited. If you have received this communication in error, please notify 
 the sender by telephone or e-mail (as shown above) immediately and destroy 
 any and all copies of this message in your possession (whether hard copies or 
 electronically stored copies).
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [Feature]Proposal for adding a new flag named shared to support performance and statistic purpose

2014-06-05 Thread Allen Samuels
You talk about restting the object map on a restart after a crash -- I assume 
you mean rebuilding, how long will this take?


---
The true mystery of the world is the visible, not the invisible.
 Oscar Wilde (1854 - 1900)

Allen Samuels
Chief Software Architect, Emerging Storage Solutions

951 SanDisk Drive, Milpitas, CA 95035
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com


-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Haomai Wang
Sent: Thursday, June 05, 2014 12:43 AM
To: Wido den Hollander
Cc: Sage Weil; Josh Durgin; ceph-devel@vger.kernel.org
Subject: Re: [Feature]Proposal for adding a new flag named shared to support 
performance and statistic purpose

On Thu, Jun 5, 2014 at 3:25 PM, Wido den Hollander w...@42on.com wrote:
 On 06/05/2014 09:01 AM, Haomai Wang wrote:

 Hi,
 Previously I sent a mail about the difficult of rbd snapshot size
 statistic. The main solution is using object map to store the changes.
 The problem is we can't handle with multi client concurrent modify.

 Lack of object map(like pointer map in qcow2), it cause many problems
 in librbd. Such as clone depth, the deep clone depth will cause
 remarkable latency. Usually each clone wrap will increase two times
 of latency.

 I consider to make a tradeoff between multi-client support and
 single-client support for librbd. In practice, most of the
 volumes/images are used by VM, there only exist one client will
 access/modify image. We can't only want to make shared image possible
 but make most of use cases bad. So we can add a new flag called
 shared when creating image. If shared is false, librbd will
 maintain a object map for each image. The object map is considered to
 durable, each image_close call will store the map into rados. If the
 client  is crashed and failed to dump the object map, the next client
 open the image will think the object map as out of date and reset the
 objectmap.


 Why not flush out the object map every X period? Assume a client runs
 for weeks or months and you would keep that map in memory all the time
 since the image is never closed.

Yes, as a period job is also a good alter




 We can easily find the advantage of this feature:
 1. Avoid clone performance problem
 2. Make snapshot statistic possible
 3. Improve librbd operation performance including read, copy-on-write
 operation.

 What do you think above? More feedbacks are appreciate!



 --
 Wido den Hollander
 42on B.V.

 Phone: +31 (0)20 700 9902
 Skype: contact42on



--
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).



RE: RBD thoughts

2014-05-07 Thread Allen Samuels
Ok, now I think I understand. Essentially, you have a write-ahead log + lazy 
application of the log to the backend + code that correctly deals with the RAW 
hazard (same as Cassandra, FileStore, LevelDB, etc.). Correct?

So every block write is done three times, once for the replication journal, 
once in the FileStore journal and once in the target file system. Correct?

Also, if I understand the architecture, you'll be moving the data over the 
network at least one more time (* # of replicas). Correct?

This seems VERY expensive in system resources, though I agree it's a simpler 
implementation task.

---
Never put off until tomorrow what you can do the day after tomorrow.
 Mark Twain

Allen Samuels
Chief Software Architect, Emerging Storage Solutions

951 SanDisk Drive, Milpitas, CA 95035
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com


-Original Message-
From: Sage Weil [mailto:s...@inktank.com]
Sent: Wednesday, May 07, 2014 9:24 AM
To: Allen Samuels
Cc: ceph-devel@vger.kernel.org
Subject: RE: RBD thoughts

On Wed, 7 May 2014, Allen Samuels wrote:
 Sage wrote:
  Allen wrote:
   I was looking over the CDS for Giant and was paying particular
   attention to the rbd journaling stuff. Asynchronous
   geo-replications for block devices is really a key for enterprise
   deployment and this is the foundational element of that. It?s an
   area that we are keenly interested in and would be willing to
   devote development resources toward. It wasn?t clear from the
   recording whether this was just musings or would actually be
   development for Giant, but when you get your head above water
   w.r.t. the acquisition I?d like to investigate how we (Sandisk) could 
   help turn this into a real project. IMO, this is MUCH more important than 
   CephFS stuff for penetrating enterprises.
  
   The blueprint suggests the creation of an additional journal for
   the block device and that this journal would track metadata
   changes and potentially record overwritten data (without the
   overwritten data you can only sync to snapshots ? which will be
   reasonable functionality for some use-cases). It seems to me that
   this probably doesn?t work too well. Wouldn?t it be the case that
   you really want to commit to the journal AND to the block device
   atomically? That?s really problematic with the current RADOS
   design as the separate journal would be in a separate PG from the
   target block and likely on a separate OSD. Now you have all sorts of 
   cases of crashes/updates where the journal and the target block are out 
   of sync.
 
  The idea is to make it a write-ahead journal, which avoids any need
  for atomicity.  The writes are streamed to the journal, and applied
  to the rbd image proper only after they commit there.  Since block
  operations are effeictively idempotent (you can replay the journal
  from any point and the end result is always the same) the recovery
  case is pretty simple.

 Who is responsible for the block device part of the commit?. If it's
 the RBD code rather than the OSD, then I think there's a dangerous
 failure case where the journal commits and then the client crashes and
 the journal-based replication system ends up replicating the last
 (un-performed) write operation. If it's the OSDs that are responsible,
 then this is not an issue.

The idea is to use the usual set of write-ahead journaling tricks: we write 
first to the journal, then to the device, and lazily update a pointer 
indicating which journal events have been applied.  After a crash, the new 
client will reapply anything in the journal after that point to ensure the 
device is in sync.

While the device is in active use, we'd need to track which writes have not yet 
been applied to the device so we can delay a read following a recent write 
until it is applied.  (This should be very rare, given that the file system 
sitting on top of the device is generally doing all sorts of caching.)

This only works, of course, for use-cases where there is a single active writer 
for the device.  That means it's usable for local file systems like
ext3/4 and xfs, but not for someting like ocfs2.

  Similarly, I don't think the snapshot limitation is there; you can
  simply note the journal offset, then copy the image (in a racy way),
  and then replay the journal from that position to capture the recent
  updates.

 w.r.t. snapshots and non-old-data-preserving journaling mode, How will
 you deal with the race between reading the head of the journal and
 reading the data referenced by that head of the journal that could be
 over-written by a write operation before you can actually read it?

Oh, I think I'm using different terminology.  I'm assuming that the journal 
includes the *new* data (ala data=journal mode for ext*).  We talked a bit at 
CDS about an optional separate journal with overwritten data so that you could 
'rewind' activity

RE: RBD thoughts

2014-05-07 Thread Allen Samuels
The extra network move that I was referring to would be local, i.e., from the 
node containing the write-ahead journal to the nodes containing the destination 
objects. I wasn't counting any geo-replication, that would be yet another 
network move.


---
Now I know what a statesman is; he's a dead politician. We need more statesmen. 
 Bob Edwards 

Allen Samuels
Chief Software Architect, Emerging Storage Solutions 

951 SanDisk Drive, Milpitas, CA 95035
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com


-Original Message-
From: Sage Weil [mailto:s...@inktank.com]
Sent: Wednesday, May 07, 2014 12:33 PM
To: Allen Samuels
Cc: ceph-devel@vger.kernel.org
Subject: RE: RBD thoughts

On Wed, 7 May 2014, Allen Samuels wrote:
 Ok, now I think I understand. Essentially, you have a write-ahead log
 + lazy application of the log to the backend + code that correctly
 deals with the RAW hazard (same as Cassandra, FileStore, LevelDB, etc.).
 Correct?

Right.

 So every block write is done three times, once for the replication 
 journal, once in the FileStore journal and once in the target file 
 system. Correct?

More than that, actually.  With the FileStore backend, every write is done 2x.  
The rbd journal would be on top of rados objects, so that's 2*2.  
But that cost goes away with an improved backend that doesn't need a journal 
(like the kv backend or f2fs).

 Also, if I understand the architecture, you'll be moving the data over 
 the network at least one more time (* # of replicas). Correct?

Right; this would be mirrored in the target cluster, probably in another data 
center.

 This seems VERY expensive in system resources, though I agree it's a 
 simpler implementation task.

It's certainly not free. :) 

sage


 
 ---
 Never put off until tomorrow what you can do the day after tomorrow.
  Mark Twain
 
 Allen Samuels
 Chief Software Architect, Emerging Storage Solutions
 
 951 SanDisk Drive, Milpitas, CA 95035
 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com
 
 
 -Original Message-
 From: Sage Weil [mailto:s...@inktank.com]
 Sent: Wednesday, May 07, 2014 9:24 AM
 To: Allen Samuels
 Cc: ceph-devel@vger.kernel.org
 Subject: RE: RBD thoughts
 
 On Wed, 7 May 2014, Allen Samuels wrote:
  Sage wrote:
   Allen wrote:
I was looking over the CDS for Giant and was paying particular 
attention to the rbd journaling stuff. Asynchronous 
geo-replications for block devices is really a key for 
enterprise deployment and this is the foundational element of 
that. It?s an area that we are keenly interested in and would be 
willing to devote development resources toward. It wasn?t clear 
from the recording whether this was just musings or would 
actually be development for Giant, but when you get your head 
above water w.r.t. the acquisition I?d like to investigate how we 
(Sandisk) could help turn this into a real project. IMO, this is MUCH 
more important than CephFS stuff for penetrating enterprises.
   
The blueprint suggests the creation of an additional journal for 
the block device and that this journal would track metadata 
changes and potentially record overwritten data (without the 
overwritten data you can only sync to snapshots ? which will be 
reasonable functionality for some use-cases). It seems to me 
that this probably doesn?t work too well. Wouldn?t it be the 
case that you really want to commit to the journal AND to the 
block device atomically? That?s really problematic with the 
current RADOS design as the separate journal would be in a 
separate PG from the target block and likely on a separate OSD. Now you 
have all sorts of cases of crashes/updates where the journal and the 
target block are out of sync.
  
   The idea is to make it a write-ahead journal, which avoids any 
   need for atomicity.  The writes are streamed to the journal, and 
   applied to the rbd image proper only after they commit there.
   Since block operations are effeictively idempotent (you can replay 
   the journal from any point and the end result is always the same) 
   the recovery case is pretty simple.
 
  Who is responsible for the block device part of the commit?. If it's 
  the RBD code rather than the OSD, then I think there's a dangerous 
  failure case where the journal commits and then the client crashes 
  and the journal-based replication system ends up replicating the 
  last
  (un-performed) write operation. If it's the OSDs that are 
  responsible, then this is not an issue.
 
 The idea is to use the usual set of write-ahead journaling tricks: we write 
 first to the journal, then to the device, and lazily update a pointer 
 indicating which journal events have been applied.  After a crash, the new 
 client will reapply anything in the journal after that point