Besides data size, I think data refreshment is the BIG barrier especially for streaming jobs. For most cases lookup data set is updated periodically when the streaming job is running. I like the idea of SeekableIO, it stil can be integrated with ExternalKvStore , as a lower level API.
On Mon, Aug 28, 2017 at 7:32 PM, JingsongLee <lzljs3620...@aliyun.com> wrote: > Yes, the runner can hold the entire side input in the right way.But it > will be some waste, in the case of large amounts of data. > Best, Jingsong Lee > > ------------------------------------------------------------------From:Lukasz > Cwik <lc...@google.com.INVALID>Time:2017 Aug 25 (Fri) 23:26To:dev < > dev@beam.apache.org>Cc:JingsongLee <jingsongl...@gmail.com>Subject:Re: > [PROPOSAL] External Join with KV Stores > Jinsong, what do you mean by the batch data is too large? > > To my knowledge, nothing requires an SDK/runner to hold the entire side > input in memory. Lists, maps, iterables, ... can all be broken up into > smaller segments which can be loaded, cached and discarded separately. > > On Thu, Aug 24, 2017 at 5:10 PM, Mingmin Xu <mingm...@gmail.com> wrote: > > > wanna bring up this thread as we're looking for similar feature in SQL. > > --Please point me if something is there, I don't find any JIRA task. > > > > Now the streaming+batch/batch+batch join is implemented with sideInput. > > It's not a one-fit-all rule as Jingsong mentioned, the batch data may be > > too large, and it would be changed periodically. A userland PTransform > > sounds a more straight-forward option, as it doesn't require support in > > runner level. > > > > Mingmin > > > > On Mon, Jul 17, 2017 at 8:56 PM, JingsongLee <lzljs3620...@aliyun.com> > > wrote: > > > > > Sorry for so long to reply. > > > Hi, Aljoscha, I think Async I/O operator and Batch > the same, and Async is > > > a better interface. All IO-related operations may be more appropriate > > > for asynchronous use. Just like you said, the beginning > > > is like no any special support by the Runners. > > > I really like Luke's idea, let the user see a SeekableRea > > > d + Sideinput interface, and in the runner layer will > > > optimize it to the direct access to external > > > store. This requires a suitable SeekableRead > interface and more efficient > > > compiler optimization. > > > Kenn's idea is exciting. If we can have an interface similar > > > to FileSystem (Maybe like SeekableRead), abstract > and unify a interface > > > for multiple of KV stores, we can let users to see only the concept > > > of Beam rather than the specific KVStore. > > > Best, Jingsong Lee > > > ------------------------------------------------------------ > > ------From:Kenneth > > > Knowles <k...@google.com.INVALID>Time:2017 Jul 7 (Fri) 11:43To:dev < > > > dev@beam.apache.org>Cc:JingsongLee <lzljs3620...@aliyun.com > >Subject:Re: > > > [PROPOSAL] External Join with KV Stores > > > In the streams/tables way of talking, side inputs are > tables. External KV > > > stores are basically also [globally windowed] tables. Both > > > are time-varying. > > > > > > I think it makes perfect sense to access an external > KV store in userland > > > directly rather than listen to its changelog and > reproduce the same table > > > as a multimap side input. I'm sure many users are > already doing this. I'm > > > sure users will always do this. Providing a common interface (simpler > > than > > > Filesystem) and helpful transform(s) in an extension > module seems nice. > > > Does it require any support in the core SDK? > > > > > > If I understand, Luke & Robert, you favor adding > metadata to Read/SDF so > > > that a user _does_ write it as a changelog listener > that is observed as a > > > multimap side input, and each runner optimizes it if they can to just > > > directly access the KV store? A runner is free to > use any kind of storage > > > they like to materialize a side input anyhow, so this > is surely possible, > > > but it is a "sufficiently smart compiler" issue. As for semantics, I'm > > not > > > worried about availability - it is globally windowed and always > > available. > > > But I think this requires retractions to be correctly equivalent to > > direct > > > access. > > > > > > I think we can have a userland PTransform in much > less time than a model > > > concept, so I favor it. > > > > > > Kenn > > > > > > > > > > > > -- > > ---- > > Mingmin > > > > -- ---- Mingmin