[Discussion] A tweak to existing large iterable protocol?

Ruoyun Huang Mon, 20 May 2019 17:10:26 -0700

Hi, Folks,

We propose to make a tweak to existing fnapi Large Iterable (result from
GBK) protocol. Would like to see what everyone thinks.

*To clarify a few terms used:*

*[large iterable]* A list of elements that are too expensive to hold them
all in memory; To store a single element is relatively cheap.

*[Large item iterable]* Same as above, except for even one single element
is too expensive to keep in memory.

[*Page*] A subset of elements from an iterable, up to a size limit.

[*Subsequent pages*] All pages except for the first one.

*Status quo: *Currently large iterable is implemented by using a lazy
decoding mechanism (related community discussions [1]), and is now only
implemented/available in python. To summarize how it works: The first page
is transmitted over data channel, and all the subsequent pages are
transmitted over state channel. For each subsequent page, we store on
Runner side: a token, and a mapping from token to corresponding iterator.
In the meanwhile, SDK side always holds the first page in memory [until
completely done].

There is no token created/sent for starting position of iterable.

*Proposed Tweak:* Create a token for starting position of EACH iterable,
and send it over to SDK. If SDK needs to re-iterate the large iterable, SDK
may request all the pages (including the first page) via state channel.

*Benefit:* SDK side no longer holds first page in memory. Thus better
memory efficiency. In addition to that, it makes easier to handle
large-item-iterable in the future.

*Cons:* [Any other downside?]

More data communication (one extra page) is needed.

*Collateral impact:* None. The only difference this proposal makes, is to
add *one* token (i.e. starting position of large-iterable) into existing
protocol. This token can be just redundant, with the existing behavior
remains unchanged. Once this information is in place, SDKs can make their
own choices whether to store first page in memory or not.

*Why now?* Large iterable is available only in Python, but it is coming to
other SDKs soon. It would probably be easier to add this extra information
early, thus less modification needed everywhere later on.

[1]:
https://lists.apache.org/thread.html/70cac361b659516933c505b513d43986c25c13da59eabfd28457f1f2@%3Cdev.beam.apache.org%3E

[Discussion] A tweak to existing large iterable protocol?

Reply via email to