Besides data size, I think data refreshment is the BIG barrier especially
for streaming jobs. For most cases lookup data set is updated periodically
when the streaming job is running. I like the idea of SeekableIO, it stil
can be integrated with ExternalKvStore , as a lower level API.

On Mon, Aug 28, 2017 at 7:32 PM, JingsongLee <lzljs3620...@aliyun.com>
wrote:

> Yes, the runner can hold the entire side input in the right way.But it
> will be some waste, in the case of large amounts of data.
> Best, Jingsong Lee
>
> ------------------------------------------------------------------From:Lukasz
> Cwik <lc...@google.com.INVALID>Time:2017 Aug 25 (Fri) 23:26To:dev <
> dev@beam.apache.org>Cc:JingsongLee <jingsongl...@gmail.com>Subject:Re:
> [PROPOSAL] External Join with KV Stores
> Jinsong, what do you mean by the batch data is too large?
>
> To my knowledge, nothing requires an SDK/runner to hold the entire side
> input in memory. Lists, maps, iterables, ... can all be broken up into
> smaller segments which can be loaded, cached and discarded separately.
>
> On Thu, Aug 24, 2017 at 5:10 PM, Mingmin Xu <mingm...@gmail.com> wrote:
>
> > wanna bring up this thread as we're looking for similar feature in SQL.
> > --Please point me if something is there, I don't find any JIRA task.
> >
> > Now the streaming+batch/batch+batch join is implemented with sideInput.
> > It's not a one-fit-all rule as Jingsong mentioned, the batch data may be
> > too large, and it would be changed periodically. A userland PTransform
> > sounds a more straight-forward option, as it doesn't require support in
> > runner level.
> >
> > Mingmin
> >
> > On Mon, Jul 17, 2017 at 8:56 PM, JingsongLee <lzljs3620...@aliyun.com>
> > wrote:
> >
> > > Sorry for so long to reply.
> > > Hi, Aljoscha, I think Async I/O operator and Batch
> the same, and Async is
> > > a better interface. All IO-related operations may be more appropriate
> > >  for asynchronous use. Just like you said, the beginning
> > > is like no any special support by the Runners.
> > > I really like Luke's idea, let the user see a SeekableRea
> > > d + Sideinput interface, and in the runner layer will
> > > optimize it to the direct access to external
> > > store. This requires a suitable SeekableRead
> interface and more efficient
> > > compiler optimization.
> > > Kenn's idea is exciting. If we can have an interface similar
> > >  to FileSystem (Maybe like SeekableRead), abstract
> and unify a interface
> > > for multiple of KV stores, we can let users to see only the concept
> > > of Beam rather than the specific KVStore.
> > > Best, Jingsong Lee
> > > ------------------------------------------------------------
> > ------From:Kenneth
> > > Knowles <k...@google.com.INVALID>Time:2017 Jul 7 (Fri) 11:43To:dev <
> > > dev@beam.apache.org>Cc:JingsongLee <lzljs3620...@aliyun.com
> >Subject:Re:
> > > [PROPOSAL] External Join with KV Stores
> > > In the streams/tables way of talking, side inputs are
> tables. External KV
> > > stores are basically also [globally windowed] tables. Both
> > > are time-varying.
> > >
> > > I think it makes perfect sense to access an external
> KV store in userland
> > > directly rather than listen to its changelog and
> reproduce the same table
> > > as a multimap side input. I'm sure many users are
> already doing this. I'm
> > > sure users will always do this. Providing a common interface (simpler
> > than
> > > Filesystem) and helpful transform(s) in an extension
> module seems nice.
> > > Does it require any support in the core SDK?
> > >
> > > If I understand, Luke & Robert, you favor adding
> metadata to Read/SDF so
> > > that a user _does_ write it as a changelog listener
> that is observed as a
> > > multimap side input, and each runner optimizes it if they can to just
> > > directly access the KV store? A runner is free to
> use any kind of storage
> > > they like to materialize a side input anyhow, so this
> is surely possible,
> > > but it is a "sufficiently smart compiler" issue. As for semantics, I'm
> > not
> > > worried about availability - it is globally windowed and always
> > available.
> > > But I think this requires retractions to be correctly equivalent to
> > direct
> > > access.
> > >
> > > I think we can have a userland PTransform in much
> less time than a model
> > > concept, so I favor it.
> > >
> > > Kenn
> > >
> > >
> >
> >
> > --
> > ----
> > Mingmin
> >
>
>


-- 
----
Mingmin

Reply via email to