Re: FileSystem API over distributedlog logs

Gerrit Sundaram Wed, 09 Nov 2016 02:48:20 -0800

Sijie, thank your for your comments and suggestions. I will start a
separate thread for discussing the metadata operation primitives.


BTW, I didn't find any code in kestrel that is related to distributedlog :(
Can you kindly point me the files?

- Gerrit

On Wed, Nov 2, 2016 at 10:35 AM, Sijie Guo <sij...@twitter.com> wrote:

>
>
> On Wed, Nov 2, 2016 at 3:14 AM, Gerrit Sundaram <gerritsunda...@gmail.com>
> wrote:
>
>> FYI - I tried to use the AppendOnlyStreamWriter and
>> AppendOnlyStreamReader to demonstrate the idea :
>> https://github.com/apache/incubator-distributedlog/pulls/43 Let me know
>> if this is a good direction to go after.
>>
>> - Gerrit
>>
>> On Wed, Nov 2, 2016 at 2:21 AM, Gerrit Sundaram <gerritsunda...@gmail.com
>> > wrote:
>>
>>> Hi distributedlog folks,
>>>
>>> I am new to this community. I am wondering is there anyone tried to
>>> build a file system over replicated logs. There are a lot of similarities
>>> between a filesystem file and a replicated log. You can use files to build
>>> replicated log or use replicated logs to build a filesystem.
>>>
>>> I took at the code repo and found there are two files
>>> 'AppendOnlyStreamReader' and 'AppendOnlyStreamWriter'. They seem to
>>> implement file I/O related API. Did you guys attempt to provide filesystem
>>> API over distributedlog?
>>>
>>
> Ah, those two classes were designed for filesystem-like I/O operations. We
> used them for substituting the local-file-based journal in kestrel
> <https://github.com/twitter-archive/kestrel>.
>

>
>>
>>> I am wondering if it is possible to build a filesystem over
>>> distributedlog. Would this be an interesting topic to this project and the
>>> community? I have two reasons for that
>>> - I can leverage the good stuffs like parallel replication, low latency
>>> for better performance?
>>>
>> - DL uses zookeeper for metadata storage. ZooKeeper has pretty nice
>>> filesystem-like interface. So it would be a nice fit.
>>>
>>
> this sounds interesting. I don't think there are any major blockers for DL
> exposing a filesystem-like API, as indeed we already did that for kestrel.
> You might need to spend time on refining the metadata operations, like list
> files, get file status and such.
>
> Re "better performance" - for data I/O, it should be just fine for
> workloads like writes, tailing reads and caught-up reads (scans). I am not
> sure about random reads, as we didn't really pay attention to this at
> Twitter (although Salesforce used bookkeeper as the storage for also
> serving random reads, it should probably work just well).  I am not certain
> about metadata operations - we did create/open/delete log streams
> frequently for some of our use cases, but still might be less frequent
> comparing to a filesystem. We have a plan to make the stream primitive very
> lightweight, so we can support huge number of streams. We probably can work
> together on improving the metadata part.
>
> I took a look at your pull request. I liked your layout - putting it in a
> contrib module to incubate this idea. We definitely welcome any
> contributions that make DL easy to use. Feel free to start a proposal
> discussion
> <https://cwiki.apache.org/confluence/display/DL/Project+Proposals>. I
> believe there will be a lot of corner cases to discuss.
>

>
>
>>
>>> - Gerrit
>>>
>>>
>>>
>>>
>>>
>>
>

Re: FileSystem API over distributedlog logs

Reply via email to