Re: FileSystem API over distributedlog logs

Sijie Guo Fri, 11 Nov 2016 12:11:21 -0800

/cc Leigh

I don't think we pushed the DL related code to kestrel. As I think kestrel
has been in the deprecation path internally at Twitter. But it might be
worth pushing the code change just for reference. Leigh, what's your
opinion?


- Sijie

On Wed, Nov 9, 2016 at 2:48 AM, Gerrit Sundaram <[email protected]>
wrote:

> Sijie, thank your for your comments and suggestions. I will start a
> separate thread for discussing the metadata operation primitives.
>
> BTW, I didn't find any code in kestrel that is related to distributedlog
> :( Can you kindly point me the files?
>
> - Gerrit
>
>
> On Wed, Nov 2, 2016 at 10:35 AM, Sijie Guo <[email protected]> wrote:
>
>>
>>
>> On Wed, Nov 2, 2016 at 3:14 AM, Gerrit Sundaram <[email protected]
>> > wrote:
>>
>>> FYI - I tried to use the AppendOnlyStreamWriter and
>>> AppendOnlyStreamReader to demonstrate the idea :
>>> https://github.com/apache/incubator-distributedlog/pulls/43 Let me know
>>> if this is a good direction to go after.
>>>
>>> - Gerrit
>>>
>>> On Wed, Nov 2, 2016 at 2:21 AM, Gerrit Sundaram <
>>> [email protected]> wrote:
>>>
>>>> Hi distributedlog folks,
>>>>
>>>> I am new to this community. I am wondering is there anyone tried to
>>>> build a file system over replicated logs. There are a lot of similarities
>>>> between a filesystem file and a replicated log. You can use files to build
>>>> replicated log or use replicated logs to build a filesystem.
>>>>
>>>> I took at the code repo and found there are two files
>>>> 'AppendOnlyStreamReader' and 'AppendOnlyStreamWriter'. They seem to
>>>> implement file I/O related API. Did you guys attempt to provide filesystem
>>>> API over distributedlog?
>>>>
>>>
>> Ah, those two classes were designed for filesystem-like I/O operations.
>> We used them for substituting the local-file-based journal in kestrel
>> <https://github.com/twitter-archive/kestrel>.
>>
>
>>
>>>
>>>> I am wondering if it is possible to build a filesystem over
>>>> distributedlog. Would this be an interesting topic to this project and the
>>>> community? I have two reasons for that
>>>> - I can leverage the good stuffs like parallel replication, low latency
>>>> for better performance?
>>>>
>>> - DL uses zookeeper for metadata storage. ZooKeeper has pretty nice
>>>> filesystem-like interface. So it would be a nice fit.
>>>>
>>>
>> this sounds interesting. I don't think there are any major blockers for
>> DL exposing a filesystem-like API, as indeed we already did that for
>> kestrel. You might need to spend time on refining the metadata operations,
>> like list files, get file status and such.
>>
>> Re "better performance" - for data I/O, it should be just fine for
>> workloads like writes, tailing reads and caught-up reads (scans). I am not
>> sure about random reads, as we didn't really pay attention to this at
>> Twitter (although Salesforce used bookkeeper as the storage for also
>> serving random reads, it should probably work just well).  I am not certain
>> about metadata operations - we did create/open/delete log streams
>> frequently for some of our use cases, but still might be less frequent
>> comparing to a filesystem. We have a plan to make the stream primitive very
>> lightweight, so we can support huge number of streams. We probably can work
>> together on improving the metadata part.
>>
>> I took a look at your pull request. I liked your layout - putting it in a
>> contrib module to incubate this idea. We definitely welcome any
>> contributions that make DL easy to use. Feel free to start a proposal
>> discussion
>> <https://cwiki.apache.org/confluence/display/DL/Project+Proposals>. I
>> believe there will be a lot of corner cases to discuss.
>>
>
>>
>>
>>>
>>>> - Gerrit
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: FileSystem API over distributedlog logs

Reply via email to