Re: Write-through-cache in State logic

Robert Bradshaw Thu, 25 Jul 2019 05:10:35 -0700

On Wed, Jul 24, 2019 at 6:21 AM Rakesh Kumar <rakeshku...@lyft.com> wrote:
>
> Thanks Robert,
>
>  I stumble on the jira that you have created some time ago
> https://jira.apache.org/jira/browse/BEAM-5428
>
> You also marked code where code changes are required:
> https://github.com/apache/beam/blob/7688bcfc8ebb4bedf26c5c3b3fe0e13c0ec2aa6d/sdks/python/apache_beam/runners/worker/bundle_processor.py#L291
> https://github.com/apache/beam/blob/7688bcfc8ebb4bedf26c5c3b3fe0e13c0ec2aa6d/sdks/python/apache_beam/runners/worker/bundle_processor.py#L349
> https://github.com/apache/beam/blob/7688bcfc8ebb4bedf26c5c3b3fe0e13c0ec2aa6d/sdks/python/apache_beam/runners/worker/bundle_processor.py#L465
>
> I am willing to provide help to implement this. Let me know how I can help.


As far as I'm aware, no one is actively working on it right now.
Please feel free to assign yourself the JIRA entry and I'll be happy
to answer any questions you might have if (well probably when) these
pointers are insufficient.

> On Tue, Jul 23, 2019 at 3:47 AM Robert Bradshaw <rober...@google.com> wrote:
>>
>> This is documented at
>> https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOHdzaV5Y56ix58Ozrqm2jFg/edit#heading=h.7ghoih5aig5m
>> . Note that it requires participation of both the runner and the SDK
>> (though there are no correctness issues if one or the other side does
>> not understand the protocol, caching just won't be used).
>>
>> I don't think it's been implemented anywhere, but could be very
>> beneficial for performance.
>>
>> On Wed, Jul 17, 2019 at 6:00 PM Rakesh Kumar <rakeshku...@lyft.com> wrote:
>> >
>> > I checked the python sdk[1] and it has similar implementation as Java SDK.
>> >
>> > I would agree with Thomas. In case of high volume event stream and bigger 
>> > cluster size, network call can potentially cause a bottleneck.
>> >
>> > @Robert
>> > I am interested to see the proposal. Can you provide me the link of the 
>> > proposal?
>> >
>> > [1]: 
>> > https://github.com/apache/beam/blob/db59a3df665e094f0af17fe4d9df05fe420f3c16/sdks/python/apache_beam/transforms/userstate.py#L295
>> >
>> >
>> > On Tue, Jul 16, 2019 at 9:43 AM Thomas Weise <t...@apache.org> wrote:
>> >>
>> >> Thanks for the pointer. For streaming, it will be important to support 
>> >> caching across bundles. It appears that even the Java SDK doesn't support 
>> >> that yet?
>> >>
>> >> https://github.com/apache/beam/blob/77b295b1c2b0a206099b8f50c4d3180c248e252c/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/FnApiDoFnRunner.java#L221
>> >>
>> >> Regarding clear/append: It would be nice if both could occur within a 
>> >> single Fn Api roundtrip when the state is persisted.
>> >>
>> >> Thanks,
>> >> Thomas
>> >>
>> >>
>> >>
>> >> On Tue, Jul 16, 2019 at 6:58 AM Lukasz Cwik <lc...@google.com> wrote:
>> >>>
>> >>> User state is built on top of read, append and clear and not off a read 
>> >>> and write paradigm to allow for blind appends.
>> >>>
>> >>> The optimization you speak of can be done completely inside the SDK 
>> >>> without any additional protocol being required as long as you clear the 
>> >>> state first and then append all your new data. The Beam Java SDK does 
>> >>> this for all runners when executed portably[1]. You could port the same 
>> >>> logic to the Beam Python SDK as well.
>> >>>
>> >>> 1: 
>> >>> https://github.com/apache/beam/blob/41478d00d34598e56471d99d0845ac16efa5b8ef/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/state/BagUserState.java#L84
>> >>>
>> >>> On Tue, Jul 16, 2019 at 5:54 AM Robert Bradshaw <rober...@google.com> 
>> >>> wrote:
>> >>>>
>> >>>> Python workers also have a per-bundle SDK-side cache. A protocol has
>> >>>> been proposed, but hasn't yet been implemented in any SDKs or runners.
>> >>>>
>> >>>> On Tue, Jul 16, 2019 at 6:02 AM Reuven Lax <re...@google.com> wrote:
>> >>>> >
>> >>>> > It's runner dependent. Some runners (e.g. the Dataflow runner) do 
>> >>>> > have such a cache, though I think it's currently has a cap for large 
>> >>>> > bags.
>> >>>> >
>> >>>> > Reuven
>> >>>> >
>> >>>> > On Mon, Jul 15, 2019 at 8:48 PM Rakesh Kumar <rakeshku...@lyft.com> 
>> >>>> > wrote:
>> >>>> >>
>> >>>> >> Hi,
>> >>>> >>
>> >>>> >> I have been using python sdk for the application and also using 
>> >>>> >> BagState in production. I was wondering whether state logic has any 
>> >>>> >> write-through-cache implemented or not. If we are sending every read 
>> >>>> >> and write request through network then it comes with a performance 
>> >>>> >> cost. We can avoid network call for a read operation if we have 
>> >>>> >> write-through-cache.
>> >>>> >> I have superficially looked into the implementation and I didn't see 
>> >>>> >> any cache implementation.
>> >>>> >>
>> >>>> >> is it possible to have this cache? would it cause any issue if we 
>> >>>> >> have the caching layer?
>> >>>> >>

Re: Write-through-cache in State logic

Reply via email to