Re: Beam Tuple

Stephen Sisk Tue, 13 Dec 2016 11:07:01 -0800

I don't have enough info to comment on whether Tuples are the right answer
- but the user problem here is real.

There's a fundamental question I had as a new Beam user which was "how do I
get my data from one ParDo to the next?" This is *really key* - without it,
doing basic pipelines is not possible, so there should hopefully be
something very simple for users. This is also an area where advanced users
with lots of knowledge (aka, people reading this list) have a lot of
knowledge they can use to decide the exact correct solution to their
problem, but for beginning users learning beam, they just want to know how
to do this seemingly simple task - if the answer is "here, read lots of
documentations about coders", we're giving users an intimidating first user
experience that will likely block their first pipeline creation experience.

Having *something* that's a simple answer would be helpful. What I've seen
from the docs don't seem to make it clear. The Beam docs don't talk about
it at all (yet!), and looking at the old the dataflow docs, from what I can
see, it forces the user to go through several jumps of understanding/read
docs in different areas.

For AutoValue - do we have clear guidance/code labs/examples showing users
how to use AutoValue and what coder to use with AutoValue? There's a real
trade-off there since it involves users learning several concepts vs
Tuples, which it sounds like most folks trying to do data processing would
be familiar with from other tools.

Like I said - I'm not speaking up for or against Tuples, but Beam should
have an answer. If we did have a built-in Tuple, I would think it would be
good for it to have a robust coder already in the coder registry.

Robert - can you speak to what exactly the Tuple tradeoffs are, and why it
wouldn't be appropriate for beam to at least push users towards one? I'd
like to hear more about that.

S

On Tue, Dec 13, 2016 at 10:03 AM Robert Bradshaw
<[email protected]> wrote:

> On Tue, Dec 13, 2016 at 9:02 AM, Jean-Baptiste Onofré <[email protected]>
> wrote:
> > Hi Robert,
> >
> > Agree, however which one the user would use ? Create his own one ?
>
> Whichever suits their needs best, which could include his or her own.
>
> > Today, I think Beam is heavily flexible in term of data format (which is
> > great), but the trade off is that the end-users have to write lot of
> > boilerplate code (just to convert from one type to another).
> >
> > So, basically, the purpose of a Beam Tuple is to have something provided
> out
> > of box: if the user wants to use another tuple, that's fine.
> > Generally speaking, the discussion about data format extension is about
> to
> > simplify the way for users to manipulate popular data formats.
>
> If I understand correctly, the proposal is to pick (or write) a Tuple
> API and bless it by shipping it with the SDK along with beam-specific
> helper code. I'd be helpful to see concretely how large of a savings
> this would be to a user, and whether that's worth the cost.
>
> > On 12/13/2016 05:56 PM, Robert Bradshaw wrote:
> >>
> >> The Java language isn't very amenable to Tuple APIs as there are several
> >> (mutually exclusive?) tradeoffs that must be made, each with their pros
> >> and
> >> cons. What advantage is there of Beam providing its own tuple API vs.
> >> letting users pick whatever tuple library they want and using that with
> >> Beam?
> >>
> >> (I suppose we're already using and encouraging AutoValue which covers a
> >> lot
> >> of tuple cases.)
> >>
> >> On Tue, Dec 13, 2016 at 8:20 AM, Aparup Banerjee (apbanerj) <
> >> [email protected]> wrote:
> >>
> >>> We have created one. An untagged Tuple. Will be happy to contribute it
> to
> >>> the community
> >>>
> >>> Aparup
> >>>
> >>>> On Dec 13, 2016, at 5:11 AM, Amit <[email protected]> wrote:
> >>>>
> >>>> I'll add that I know of Beam's PTuple, but my question is about much
> >>>> simpler Tuples, untagged.
> >>>>
> >>>> On Tue, Dec 13, 2016 at 1:56 PM Jean-Baptiste Onofré <[email protected]
> >
> >>>> wrote:
> >>>>
> >>>>> Hi Amit,
> >>>>>
> >>>>> as discussed together, I think a Tuple abstraction would be good in
> the
> >>>>> SDK (more than in the data format extension).
> >>>>>
> >>>>> Regards
> >>>>> JB
> >>>>>
> >>>>>> On 12/13/2016 11:06 AM, Amit Sela wrote:
> >>>>>> Hi all,
> >>>>>>
> >>>>>> I was wondering why Beam doesn't have tuples as part of the SDK ?
> >>>>>> To the best of my knowledge all currently supported (OSS) runners:
> >>>
> >>> Spark,
> >>>>>>
> >>>>>> Flink, Apex provide a Tuple abstraction and I was wondering if Beam
> >>>>>
> >>>>> should
> >>>>>>
> >>>>>> too ?
> >>>>>>
> >>>>>> Consider KV for example; it is a special ("*keyed*" by the first
> >>>>>> field)
> >>>>>> implementation Tuple2.
> >>>>>> While KV's importance is far more than being a Tuple2, I'm wondering
> >>>>>> if
> >>>>>
> >>>>> the
> >>>>>>
> >>>>>> SDK would benefit from a proper TupleX support ?
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Amit
> >>>>>>
> >>>>>
> >>>>> --
> >>>>> Jean-Baptiste Onofré
> >>>>> [email protected]
> >>>>> http://blog.nanthrax.net
> >>>>> Talend - http://www.talend.com
> >>>>>
> >>>
> >>
> >
> > --
> > Jean-Baptiste Onofré
> > [email protected]
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
>

Re: Beam Tuple

Reply via email to