[jira] [Commented] (SAMZA-424) Add a Cache state API to the Samza container

Martin Kleppmann (JIRA) Thu, 09 Oct 2014 04:17:14 -0700

    [ 
https://issues.apache.org/jira/browse/SAMZA-424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14165000#comment-14165000
 ]


Martin Kleppmann commented on SAMZA-424:
----------------------------------------

I must admit I'm confused — I've not quite managed to follow this discussion, 
and even in the latest design doc I'm not sure I understand exactly what is 
being proposed. Is the idea to separate the caching concern from the storage 
engine concern, so that the same caching layer can be reused with different 
storage engines?

If so, would it be possible to make the cache a shared class that storage 
engine implementations can use (e.g. if you're implementing a Voldemort-backed 
storage engine, you don't have to reinvent the cache), but which is not exposed 
in the job configuration? In that case it would be the storage engine's 
responsibility to instantiate the cache and pass through any parameters from 
the configuration.

If we definitely want a composing approach (where the storage engine knows 
nothing about the fact that it's being wrapped by a caching layer), then I 
agree with what has been said: composition in config can get ugly, but using a 
builder pattern inside a StreamTask initializer doesn't really make sense, 
especially when shared stores are involved. Composition through code seems fine 
to me if that code runs during job submission or job startup (so the builder 
code is just a nicer way of specifying a configuration).

[~cpsoman]: In the _2 proposal, you define two stores (page-key-counts and 
member-cache) but I don't understand how they are being composed. Is there 
supposed to be something in the configuration that tells the cache to wrap the 
other store?

There is some elegance to composing layers, but it only really works if the 
things you're composing all have the same API, e.g. KeyValueStore. I can't 
quite see how you could wrap, say, a Lucene storage engine with a 
key-value-oriented cache. On the other hand, a storage engine that exposes a 
full-text search API may be able to benefit from a caching layer, but it would 
be that storage engine's responsibility to use the cache appropriately.

Apologies if I've totally misunderstood and none of this makes any sense. I'm 
just a bit confused about this whole discussion.

> Add a Cache state API to the Samza container
> --------------------------------------------
>
>                 Key: SAMZA-424
>                 URL: https://issues.apache.org/jira/browse/SAMZA-424
>             Project: Samza
>          Issue Type: New Feature
>          Components: container
>            Reporter: Chinmay Soman
>            Assignee: Chinmay Soman
>         Attachments: SAMZA-424-Cache-API_0.pdf, SAMZA-424-Cache-API_1.md, 
> SAMZA-424-Cache-API_2.md, SAMZA-424-Cache-API_2.pdf, samza-424-cache-api_1.pdf
>
>
> There are cases when the user code needs access to a 'cache' which can be 
> used to store custom data. This cache is different from the KeyValue store in 
> the following ways:
> * At the very least Needs to support LRU (Least Recently Used) and TTL (Time 
> To Live) eviction strategies
> * May not support all() and range() operations (since this wreaks havoc with 
> the eviction operation)
> * Needs to exist at a per task or a per container level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SAMZA-424) Add a Cache state API to the Samza container

Reply via email to