Re: fecpp C++ forward error correction library

2013-04-29 Thread Christopher LILJENSTOLPE
Supposedly, on 2013-Apr-29, at 07.23 PDT(-0700), someone claiming to be Loic 
Dachary scribed:

> On 04/29/2013 04:06 PM, Jimmy Tang wrote:
>>
>> On 22 Apr 2013, at 14:19, Loic Dachary wrote:
>>
>>> Hi Christopher,
>>>
>>> Jack Lloyd is the author of fecpp ( http://www.randombit.net/code/fecpp/ ) 
>>> and he tells me someone sent him a new SIMD approach a few weeks ago. I'm 
>>> not sure what SIMD means yet, but I'll figure it out ;-). I tend to favor 
>>> fecpp because it is more self contained and may be easier to embed than 
>>> https://pypi.python.org/pypi/zfec
>>>
>>> Cheers
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>>
>>
>>
>> Having played with zfec (more the underlying fec.[c,h] file) it's well 
>> written and self-contained as I remember it; but you will need to write a 
>> bunch of wrapper functions to chunk up data and pad out blocks and of course 
>> turn it into a library before you can use the underlying zfec implementation 
>> for ceph. If my memory is right, zfec's c code requires a C90 or C99 
>> compiler to build it, I'm not sure if you care about that.
>
> Hi Jimmy,
>
> fecpp and zfec compile and run using g++ 4.7 . The requirements for 
> http://code.google.com/p/holostor/ may be different but I suspect it also 
> works.

Greetings - just looked at holostor - some concerns, it looks like it caps out 
(per the comments) at 17:21.  As I know of at least one case where I needed to 
do something like 10:18, that might already violate holostor (n-m > 4) or 
definitely push up to its m boundary.

Christopher


>
> Cheers
>
>>
>> Regards,
>> Jimmy Tang
>>
>> --
>> Senior Software Engineer, Digital Repository of Ireland (DRI)
>> High Performance & Research Computing, IS Services
>> Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
>> http://www.tchpc.tcd.ie/ | jt...@tchpc.tcd.ie
>> Tel: +353-1-896-3847
>>
>
> -- 
> Loïc Dachary, Artisan Logiciel Libre


--
李柯睿
Check my PGP key here: http://www.asgaard.org/cdl/cdl.asc
Current vCard here: http://www.asgaard.org/cdl/cdl.vcf

signature.asc
Description: OpenPGP digital signature


Re: fecpp C++ forward error correction library

2013-04-29 Thread Christopher LILJENSTOLPE
Supposedly, on 2013-Apr-22, at 06.19 PDT(-0700), someone claiming to be Loic 
Dachary scribed:

> Hi Christopher,
>
> Jack Lloyd is the author of fecpp ( http://www.randombit.net/code/fecpp/ ) 
> and he tells me someone sent him a new SIMD approach a few weeks ago. I'm not 
> sure what SIMD means yet, but I'll figure it out ;-). I tend to favor fecpp 
> because it is more self contained and may be easier to embed than 
> https://pypi.python.org/pypi/zfec

I'll defer, provided that we don't incur a performance penalty (which is one of 
the negatives in sharding).  As far as SIMD: Single Instruction Multiple Data.  
Basically, if you think of a tiled processor (like a GPU), then you set all the 
tiles to do the same instruction, and stream different data to each tile.  As 
opposed to MIMD, where each tile runs a different instruction on different 
data.   You don't need parallel processing CPUs to make use of SIMD, but you 
can get them going screamingly fast if you do…

Christopher

>
> Cheers
>
> -- 
> Loïc Dachary, Artisan Logiciel Libre


--
李柯睿
Check my PGP key here: http://www.asgaard.org/cdl/cdl.asc
Current vCard here: http://www.asgaard.org/cdl/cdl.vcf

signature.asc
Description: OpenPGP digital signature


Re: erasure coding (sorry)

2013-04-23 Thread Christopher LILJENSTOLPE
Supposedly, on 2013-Apr-22, at 08.09 PDT(-0700), someone claiming to be Sage 
Weil scribed:

> On Mon, 22 Apr 2013, Christopher LILJENSTOLPE wrote:
>> Supposedly, on 2013-Apr-22, at 01.10 PDT(-0700), someone claiming to be Loic 
>> Dachary scribed:
>>
>>> Hi Christopher,
>>>
>>> You wrote "A modified client/library could be used to store objects that 
>>> should be sharded, vs "standard" ceph treatment.  In this model, each shard 
>>> would be written to a seperate PG, and each PG would we stored on exactly 
>>> one OSD.  " but there is no way for a client to enforce the fact that two 
>>> objects are stored in separate PG.
>>
>> Poorly worded.  The idea is that each shard becomes a seperate object, and 
>> the encoder/sharder would use CRUSH to identify the OSDs to hold the shards. 
>>  However, the OSDs would treat the shard as an n=1 replication and just 
>> store locally.
>>
>> Actually, looking at this this morning, this is actually harder than the 
>> prefered alternative (i.e. grafting a encode/decode into the (e)OSD.  It was 
>> meant to cover the alternative approaches.  I didn't like this one, but it 
>> now appears to be more difficult, and non-deterministic of the placement.
>>
>> One question on CRUSH (it's been too long since I read the paper), if x is 
>> the same for two objects, and, using an n=3 returns R={OSD18,OSD45,OSD97}, 
>> if an object is handed to OSD45 that matches x, but has an n=1, would OSD45 
>> store it, or would it forward it to OSD18 to store?  If it would this idea 
>> is DOA.  Also, if x is held invariant, but n changes, does the same R set 
>> get returned (truncated to n members)?
>
> It would go to osd18, the first item in the sequence that CRUSH generates.

That's what I thought - then it is a non-starter

>
> As Loic observes, not having control of placement from above the librados
> level makes this more or less a non-started.  The only thing that might
> work at that layer is to set up ~5 or more pools, each with a distinct set
> of OSDs, and put each shard/fragment in a different pool.  I don't think
> that is a particularly good approach.

Pretty much of a kludge - I would agree

>
> If we are going to do parity encoding (and I think we should!), I think we
> should fully integrate it into the OSD.
>
> The simplest approach:
>
> - we create a new PG type for 'parity' or 'erasure' or whatever (type
> fields already exist)
> - those PGs use the parity ('INDEP') crush mode so that placement is
> intelligent
> - all reads and writes go to the 'primary'
> - the primary does the shard encoding and distributes the write pieces to
> the other replicas
> - same for reads

Yup - that's basically what I was trying to outline for the single-tier model.  
I called them eOSD's.
>
> There will be a pile of patches to move code around between PG and
> ReplicatedPG, which will be annoying, but hopefully not too painful.  The
> class structure and data types were set up with this in mind long ago.
>
> Several key challenges:
>
> - come up with a scheme for internal naming to keep shards distinct
> - safely rewriting a stripe when there is a partial overwrite.  probably
> want to write new stripes to distinct new objects (cloning old data as
> needed) and clean up the old ones once enough copies are present.
> - recovery logic

Been giving this some thought - I'll try and get them into the blueprint.  Is 
the blueprint, as it is, reasonable to include in the design summit, knowing 
that it will continue to evolve?
>
> sage
>
Christopher

>
>>
>>  Thx
>>  Christopher
>>
>>
>>
>>>
>>> Am I missing something ?
>>>
>>> On 04/22/2013 09:23 AM, Christopher LILJENSTOLPE wrote:
>>>> Supposedly, on 2013-Apr-18, at 14.31 PDT(-0700), someone claiming to be 
>>>> Plaetinck, Dieter scribed:
>>>>
>>>>> On Thu, 18 Apr 2013 16:09:52 -0500
>>>>> Mark Nelson  wrote:
>>>>
>>>>>>
>>>>>
>>>>> @Bryan: I did come across cleversafe.  all the articles around it seemed 
>>>>> promising,
>>>>> but unfortunately it seems everything related to the cleversafe open 
>>>>> source project
>>>>> somehow vanished from the internet.  (e.g. http://www.cleversafe.org/) 
>>>>> quite weird...
>>>>>
>>>>> @Sage: interesting. I thought it would be more relatively simple if one 
>>>>> assumes
>>>>>

Re: erasure coding (sorry)

2013-04-22 Thread Christopher LILJENSTOLPE
Supposedly, on 2013-Apr-22, at 01.10 PDT(-0700), someone claiming to be Loic 
Dachary scribed:

> Hi Christopher,
>
> You wrote "A modified client/library could be used to store objects that 
> should be sharded, vs "standard" ceph treatment.  In this model, each shard 
> would be written to a seperate PG, and each PG would we stored on exactly one 
> OSD.  " but there is no way for a client to enforce the fact that two objects 
> are stored in separate PG.

Poorly worded.  The idea is that each shard becomes a seperate object, and the 
encoder/sharder would use CRUSH to identify the OSDs to hold the shards.  
However, the OSDs would treat the shard as an n=1 replication and just store 
locally.  

Actually, looking at this this morning, this is actually harder than the 
prefered alternative (i.e. grafting a encode/decode into the (e)OSD.  It was 
meant to cover the alternative approaches.  I didn't like this one, but it now 
appears to be more difficult, and non-deterministic of the placement.  

One question on CRUSH (it's been too long since I read the paper), if x is the 
same for two objects, and, using an n=3 returns R={OSD18,OSD45,OSD97}, if an 
object is handed to OSD45 that matches x, but has an n=1, would OSD45 store it, 
or would it forward it to OSD18 to store?  If it would this idea is DOA.  Also, 
if x is held invariant, but n changes, does the same R set get returned 
(truncated to n members)?

Thx
Christopher



>
> Am I missing something ?
>
> On 04/22/2013 09:23 AM, Christopher LILJENSTOLPE wrote:
>> Supposedly, on 2013-Apr-18, at 14.31 PDT(-0700), someone claiming to be 
>> Plaetinck, Dieter scribed:
>>
>>> On Thu, 18 Apr 2013 16:09:52 -0500
>>> Mark Nelson  wrote:
>>
>>>>
>>>
>>> @Bryan: I did come across cleversafe.  all the articles around it seemed 
>>> promising,
>>> but unfortunately it seems everything related to the cleversafe open source 
>>> project
>>> somehow vanished from the internet.  (e.g. http://www.cleversafe.org/) 
>>> quite weird...
>>>
>>> @Sage: interesting. I thought it would be more relatively simple if one 
>>> assumes
>>> the restriction of immutable files.  I'm not familiar with those ceph 
>>> specifics you're mentioning.
>>> When building an erasure codes-based system, maybe there's ways to reuse 
>>> existing ceph
>>> code and/or allow some integration with replication based objects, without 
>>> aiming for full integration or
>>> full support of the rados api, based on some tradeoffs.
>>>
>>> @Josh, that sounds like an interesting approach.  Too bad that page doesn't 
>>> contain any information yet :)
>>
>> Greetings - it does now - see what you all think…
>>
>>  Christopher
>>
>>>
>>> Dieter
>>
>>
>> --
>> 李柯睿
>> Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc
>> Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf
>> Check my calendar availability: https://tungle.me/cdl
>
> -- 
> Loïc Dachary, Artisan Logiciel Libre


--
李柯睿
Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc
Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf
Check my calendar availability: https://tungle.me/cdl

signature.asc
Description: OpenPGP digital signature


Re: erasure coding (sorry)

2013-04-22 Thread Christopher LILJENSTOLPE
Supposedly, on 2013-Apr-18, at 14.31 PDT(-0700), someone claiming to be 
Plaetinck, Dieter scribed:

> On Thu, 18 Apr 2013 16:09:52 -0500
> Mark Nelson  wrote:

>>
>
> @Bryan: I did come across cleversafe.  all the articles around it seemed 
> promising,
> but unfortunately it seems everything related to the cleversafe open source 
> project
> somehow vanished from the internet.  (e.g. http://www.cleversafe.org/) quite 
> weird...
>
> @Sage: interesting. I thought it would be more relatively simple if one 
> assumes
> the restriction of immutable files.  I'm not familiar with those ceph 
> specifics you're mentioning.
> When building an erasure codes-based system, maybe there's ways to reuse 
> existing ceph
> code and/or allow some integration with replication based objects, without 
> aiming for full integration or
> full support of the rados api, based on some tradeoffs.
>
> @Josh, that sounds like an interesting approach.  Too bad that page doesn't 
> contain any information yet :)

Greetings - it does now - see what you all think…

Christopher

>
> Dieter


--
李柯睿
Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc
Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf
Check my calendar availability: https://tungle.me/cdl

signature.asc
Description: OpenPGP digital signature


Re: erasure coding (sorry)

2013-04-18 Thread Christopher LILJENSTOLPE
Supposedly, on 2013-Apr-18, at 14.26 PDT(-0700), someone claiming to be Sage 
Weil scribed:

> On Thu, 18 Apr 2013, Noah Watkins wrote:
>> On Apr 18, 2013, at 2:08 PM, Josh Durgin  wrote:
>>
>>> I talked to some folks interested in doing a more limited form of this
>>> yesterday. They started a blueprint [1]. One of their ideas was to have
>>> erasure coding done by a separate process (or thread perhaps). It would
>>> use erasure coding on an object and then use librados to store the
>>> rasure-encoded pieces in a separate pool, and finally leave a marker in
>>> place of the original object in the first pool.
>>
>> This sounds at a high-level similar to work out of Microsoft:
>>
>> https://www.usenix.org/system/files/conference/atc12/atc12-final181_0.pdf
>>
>> The basic idea is to replicate first, then erasure code in the background.
>
> FWIW, I think a useful (and generic) concept to add to rados would be a
> redirect symlink sort of thing that says "oh, this object is over there is
> that other pool", such that client requests will be transparently
> redirected or proxied.  This will enable generic tiering type operations,
> and probably simplify/enable migration without a lot of additional
> complexity on the client side.

More to come, but I'm starting to think of a union mount of a fuse 
"re-directing" overlay.  The quick idea.

On the "hot" pool, the OSD's would write to the host FS as usual.  However, 
that FS is actually a light-weight fuse (at least for prototype) fs that passes 
almost everything right down to the file system.  As the OSD hits a capacity 
HWM, a watcher (asynchronous process), starts "evicting" objects from the OSD.  
It does that by using a modified ceph client that calls zfec and uses CRUSH to 
place the resulting shards in the "cool" pool.  Once those are committed, it 
replaces the object in the "hot" OSD with a special token. This is repeated 
until a LWM is reached.  When the OSD gets a read request for that object, when 
the fuse shim sees the token, it knows to actually do a modified client fetch 
from the "cool" pool.  It returns the resulting object to the original 
requester and (potentially) stores the object back in the "hot" OSD (if you 
want a cache-like performance), replacing the token.  If necessary, some other 
object may get, in turn, evicted if the HWM is again breached.

We would also need to modify the repair mechanism for the deep scrub in the 
"cool" pool to account for the repair being a re-constitution of an invalid 
shard, rather than a copy (as there is only one copy of a given shard).

I'll get a bit more of a write-up today, hopefully, in the wiki.

Christopher

>
> sage


--
李柯睿
Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc
Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf
Check my calendar availability: https://tungle.me/cdl

signature.asc
Description: OpenPGP digital signature


Re: erasure coding (sorry)

2013-04-18 Thread Christopher LILJENSTOLPE
Supposedly, on 2013-Apr-18, at 14.24 PDT(-0700), someone claiming to be Noah 
Watkins scribed:

> On Apr 18, 2013, at 2:08 PM, Josh Durgin  wrote:
>
>> I talked to some folks interested in doing a more limited form of this
>> yesterday. They started a blueprint [1]. One of their ideas was to have
>> erasure coding done by a separate process (or thread perhaps). It would
>> use erasure coding on an object and then use librados to store the
>> rasure-encoded pieces in a separate pool, and finally leave a marker in
>> place of the original object in the first pool.
>
> This sounds at a high-level similar to work out of Microsoft:

I've looked at that, and it would be somewhat similar (not completely, but 
borrow some ideas).

Christopher

>
> https://www.usenix.org/system/files/conference/atc12/atc12-final181_0.pdf
>
> The basic idea is to replicate first, then erasure code in the background.
>
> - Noah


--
李柯睿
Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc
Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf
Check my calendar availability: https://tungle.me/cdl

signature.asc
Description: OpenPGP digital signature


Re: erasure coding (sorry)

2013-04-18 Thread Christopher LILJENSTOLPE
Supposedly, on 2013-Apr-18, at 14.31 PDT(-0700), someone claiming to be 
Plaetinck, Dieter scribed:

> On Thu, 18 Apr 2013 16:09:52 -0500
> Mark Nelson  wrote:
>
>> On 04/18/2013 04:08 PM, Josh Durgin wrote:
>>> On 04/18/2013 01:47 PM, Sage Weil wrote:
 On Thu, 18 Apr 2013, Plaetinck, Dieter wrote:
> sorry to bring this up again, googling revealed some people don't
> like the subject [anymore].
>
> but I'm working on a new +- 3PB cluster for storage of immutable files.
> and it would be either all cold data, or mostly cold. 150MB avg
> filesize, max size 5GB (for now)
> For this use case, my impression is erasure coding would make a lot
> of sense
> (though I'm not sure about the computational overhead on storing and
> loading objects..? outbound traffic would peak at 6 Gbps, but I can
> make it way less and still keep a large cluster, by taking away the
> small set of hot files.
> inbound traffic would be minimal)
>
> I know that the answer a while ago was "no plans to implement erasure
> coding", has this changed?
> if not, is anyone aware of a similar system that does support it? I
> found QFS but that's meant for batch processing, has a single
> 'namenode' etc.

 We would love to do it, but it is not a priority at the moment (things
 like multi-site replication are in much higher demand).  That of course
 doesn't prevent someone outside of Inktank from working on it :)

 The main caveat is that it will be complicate.  For an initial
 implementation, the full breadth of the rados API probably wouldn't be
 support for erasure/parity encoded pools (thinkgs like rados classes and
 the omap key/value api get tricky when you start talking about parity).
 But for many (or even most) use cases, objects are just bytes, and those
 restrictions are just fine.
>>>
>>> I talked to some folks interested in doing a more limited form of this
>>> yesterday. They started a blueprint [1]. One of their ideas was to have
>>> erasure coding done by a separate process (or thread perhaps). It would
>>> use erasure coding on an object and then use librados to store the
>>> rasure-encoded pieces in a separate pool, and finally leave a marker in
>>> place of the original object in the first pool.
>>>
>>> When the osd detected this marker, it would proxy the request to the
>>> erasure coding thread/process which would service the request on the
>>> second pool for reads, and potentially make writes move the data back to
>>> the first pool in a tiering sort of scenario.
>>>
>>> I might have misremembered some details, but I think it's an
>>> interesting way to get many of the benefits of erasure coding with a
>>> relatively small amount of work compared to a fully native osd solution.
>>>
>>> Josh
>>
>> Neat. :)
>>
>
> @Bryan: I did come across cleversafe.  all the articles around it seemed 
> promising,
> but unfortunately it seems everything related to the cleversafe open source 
> project
> somehow vanished from the internet.  (e.g. http://www.cleversafe.org/) quite 
> weird...

Yea - in a previous incarnation I looked at cleversafe to do something similar 
a few years ago.  It is odd that the cleversafe.org stuff did disapear.  
However, tahoe-lafs also does encoding, and their package (zfec) [1] may be 
leverageable.

>
> @Sage: interesting. I thought it would be more relatively simple if one 
> assumes
> the restriction of immutable files.  I'm not familiar with those ceph 
> specifics you're mentioning.
> When building an erasure codes-based system, maybe there's ways to reuse 
> existing ceph
> code and/or allow some integration with replication based objects, without 
> aiming for full integration or
> full support of the rados api, based on some tradeoffs.

I think this might sit UNDER the rados api.  I would certainly want to leverage 
CRUSH to place the shards, however (great tool, no reason to re-invent the 
wheel).
>
> @Josh, that sounds like an interesting approach.  Too bad that page doesn't 
> contain any information yet :)

Give me time :) - openstack has kept me a bit busy…  May also be a factor of 
"design at keyboard" :)

>
> Dieter

Christopher


[1] https://tahoe-lafs.org/trac/zfec

--
李柯睿
Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc
Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf
Check my calendar availability: https://tungle.me/cdl

signature.asc
Description: OpenPGP digital signature


Re: erasure coding (sorry)

2013-04-18 Thread Christopher LILJENSTOLPE
Supposedly, on 2013-Apr-18, at 14.08 PDT(-0700), someone claiming to be Josh 
Durgin scribed:

> On 04/18/2013 01:47 PM, Sage Weil wrote:
>> On Thu, 18 Apr 2013, Plaetinck, Dieter wrote:
>>> sorry to bring this up again, googling revealed some people don't like the 
>>> subject [anymore].
>>>
>>> but I'm working on a new +- 3PB cluster for storage of immutable files.
>>> and it would be either all cold data, or mostly cold. 150MB avg filesize, 
>>> max size 5GB (for now)
>>> For this use case, my impression is erasure coding would make a lot of sense
>>> (though I'm not sure about the computational overhead on storing and 
>>> loading objects..? outbound traffic would peak at 6 Gbps, but I can make it 
>>> way less and still keep a large cluster, by taking away the small set of 
>>> hot files.
>>> inbound traffic would be minimal)
>>>
>>> I know that the answer a while ago was "no plans to implement erasure 
>>> coding", has this changed?
>>> if not, is anyone aware of a similar system that does support it? I found 
>>> QFS but that's meant for batch processing, has a single 'namenode' etc.
>>
>> We would love to do it, but it is not a priority at the moment (things
>> like multi-site replication are in much higher demand).  That of course
>> doesn't prevent someone outside of Inktank from working on it :)
>>
>> The main caveat is that it will be complicate.  For an initial
>> implementation, the full breadth of the rados API probably wouldn't be
>> support for erasure/parity encoded pools (thinkgs like rados classes and
>> the omap key/value api get tricky when you start talking about parity).
>> But for many (or even most) use cases, objects are just bytes, and those
>> restrictions are just fine.
>
> I talked to some folks interested in doing a more limited form of this
> yesterday. They started a blueprint [1]. One of their ideas was to have
> erasure coding done by a separate process (or thread perhaps). It would
> use erasure coding on an object and then use librados to store the
> rasure-encoded pieces in a separate pool, and finally leave a marker in
> place of the original object in the first pool.
>
> When the osd detected this marker, it would proxy the request to the
> erasure coding thread/process which would service the request on the
> second pool for reads, and potentially make writes move the data back to
> the first pool in a tiering sort of scenario.
>
> I might have misremembered some details, but I think it's an
> interesting way to get many of the benefits of erasure coding with a 
> relatively small amount of work compared to a fully native osd solution.

Greetings,

I'm one of those individuals :)  Our thinking is evolving on this, and 
I think we can keep most of the work out of the main machinery of ceph, and 
simply require a modified client that runs the "proxy" function on the "hot" 
pool OSDs. Even wondering if it could be prototyped in fuse.  I will be writing 
this up in the next day or two in the blueprint below.  Josh has the idea 
basically correct.

>
> Josh

Christopher

>
> [1] 
> http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend


--
李柯睿
Check my PGP key here: https://www.asgaard.org/~cdl/cdl.asc
Current vCard here: https://www.asgaard.org/~cdl/cdl.vcf
Check my calendar availability: https://tungle.me/cdl

signature.asc
Description: OpenPGP digital signature