Splitting on slabs should allow you to split more finely grained then per
server since each server itself maintains this information. If you take a
look at the memcached protocol, you can see that lru_crawler supports a
metadump command which will enumerate all the key for a set of given slabs
or for all the slabs.

For the consistency part, you can get a snapshot like effect (snapshot like
since its per server and not across the server farm) by combining
the "watch mutations evictions" command on one connection with the
"lru_crawler metadump all" on another connection to the same memcached
server. By first connecting using a watcher and then performing a dump you
can create two logical streams of data that can be joined to get a snapshot
per server. If the amount of data/mutations/evications is small, you can
perform all of this within a DoFn otherwise you can just treat each as two
different outputs which you join and perform the same logical operation to
rebuild the snapshot on a per key basis.

Interestingly, the "watch mutations" command would allow one to build a
streaming memcache IO which shows all changes occurring underneath.

memcached protocol:
https://github.com/memcached/memcached/blob/master/doc/protocol.txt

On Mon, Jul 10, 2017 at 2:41 AM, Ismaël Mejía <ieme...@gmail.com> wrote:

> Hello,
>
> Thanks Lukasz for bring some of this subjects. I have briefly
> discussed with the guys working on this they are the same team who did
> HCatalogIO (Hive).
>
> We just analyzed the different libraries that allowed to develop this
> integration from Java and decided that the most complete
> implementation was spymemcached. One thing I really didn’t like of
> their API is that there is not an abstraction for Mutation (like in
> Bigtable/Hbase) but a corresponding method for each operation so to
> make things easier we discussed to focus first on read/write.
>
> @Lukasz for the enumeration part, I am not sure I follow, we had just
> discussed a naive approach for splitting by server given that
> Memcached is not a cluster but a server farm ‘which means every server
> is its own’ we thought this will be the easiest way to partition, is
> there any technical issue that impeaches this (creating a
> BoundedSource and just read per each server)? Or partitioning by slabs
> will bring us a better optimization? (Notice I am far from an expert
> on Memcached).
>
> For the consistency part I assumed it will be inconsistent when
> reading, because I didn’t know how to do the snapshot but if you can
> give us more details on how to do this, and why it is worth the effort
> (vs the cost of the snapshot), this will be something interesting to
> integrate.
>
> Thanks,
> Ismaël
>
>
> On Sun, Jul 9, 2017 at 7:39 PM, Lukasz Cwik <lc...@google.com.invalid>
> wrote:
> > For the source:
> > Do you plan to support enumerating all the keys via cachedump /
> lru_crawler
> > metadump / ...?
> > If there is an option which doesn't require enumerating the keys, how
> will
> > splitting be done (no splitting / splitting on slab ids / ...)?
> > Can the cache be read while its still being modified (will effectively a
> > snapshot be made using a watcher or is it expected that the cache will be
> > read only or inconsistent when reading)?
> >
> > Also, as a usability point, all PTransforms are meant to be applied to
> > PCollections and not vice versa.
> > e.g.
> > PCollection<byte[]> keys = ...;
> > keys.apply(MemCacheIO.withConfig());
> >
> > This makes it so that people can write:
> > PCollection<...> output =
> > input.apply(ptransform1).apply(ptransform2).apply(...);
> > It also makes it so that a PTransform can be applied to multiple
> > PCollections.
> >
> > If you haven't already, I would also suggest that you take a look at the
> > Pipeline I/O guide: https://beam.apache.org/documentation/io/io-toc/
> > Talks about various usability points and how to write a good I/O
> connector.
> >
> >
> > On Sat, Jul 8, 2017 at 9:31 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
> > wrote:
> >
> >> Hi,
> >>
> >> Great job !
> >>
> >> I'm looking forward for the PRs review.
> >>
> >> Regards
> >> JB
> >>
> >>
> >> On 07/08/2017 09:50 AM, Madhusudan Borkar wrote:
> >>
> >>> Hi,
> >>> We are proposing to build connectors for memcache first and then use it
> >>> for
> >>> Couchbase. The connector for memcache will be build as a IOTransform
> and
> >>> then it can be used for other memcache implementations including
> >>> Couchbase.
> >>>
> >>> 1. As Source
> >>>
> >>>     input will be a key(String / byte[]), output will be a KV<key,
> value>
> >>>
> >>>     where key - String / byte[]
> >>>
> >>>     value - String / byte[]
> >>>
> >>>     Spymemcached supports a multi-get operation where it takes a bunch
> of
> >>> keys and retrieves the associated values, the input PCollection<key>
> can
> >>> be
> >>> bundled into multiple batches and each batch can be submitted via the
> >>> multi-get operation.
> >>>
> >>> PCollection<KV<byte[], byte[]>> values =
> >>>
> >>>     MemCacheIO
> >>>
> >>>     .withConfig()
> >>>
> >>>     .read()
> >>>
> >>>     .withKey(PCollection<byte[]>);
> >>>
> >>>
> >>> 2. As Sink
> >>>
> >>>     input will be a KV<key, value>, output will be none or probably a
> >>> boolean indicating the outcome of the operation
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> //write
> >>>
> >>>     MemCacheIO
> >>>
> >>>     .withConfig()
> >>>
> >>>     .write()
> >>>
> >>>     .withEntries(PCollection<KV<byte[],byte[]>>);
> >>>
> >>>
> >>> Implementation plan
> >>>
> >>> 1. Develop Memcache connector with 'set' and 'add' operation
> >>>
> >>> 2. Then develop other operations
> >>>
> >>> 3. Use Memcache connector for Couchbase
> >>>
> >>>
> >>> Thanks @Ismael for help
> >>>
> >>> Please, let us know your views.
> >>>
> >>> Madhu Borkar
> >>>
> >>>
> >> --
> >> Jean-Baptiste Onofré
> >> jbono...@apache.org
> >> http://blog.nanthrax.net
> >> Talend - http://www.talend.com
> >>
>

Reply via email to