The standard VARINT coder is used for all sorts of integer values (e.g. the
output of the CountElements transform), but the vast majority of them are
likely significantly less than a full 64 bits. In Python, declaring an
element type to be int will use this. On the other hand, using a VarInt
format
seems
> that it's still not supported in the Python SDK Harness. Is there any plan
> on that?
>
> Robert Bradshaw 于2019年7月30日周二 下午12:33写道:
>
>> On Tue, Jul 30, 2019 at 11:52 AM jincheng sun
>> wrote:
>>
>>>
>>>>> Is it possible to add
t; On Wed, Jul 31, 2019 at 3:16 AM Pablo Estrada
>> wrote:
>> > >>>
>> > >>> +1
>> > >>>
>> > >>> I installed from source, and ran unit tests for Python in 2.7, 3.5,
>> 3.6.
>> > >>>
&g
On Tue, Jul 30, 2019 at 11:52 AM jincheng sun
wrote:
>
>>> Is it possible to add an interface such as `isSelfContained()` to the
>>> `Coder`? This interface indicates
>>> whether the serialized bytes are self contained. If it returns true,
>>> then there is no need to add a prefixing length.
>>>
I checked all the artifact signatures and ran a couple test pipelines with
the wheels (Py2 and Py3) and everything looked good to me, so +1.
On Mon, Jul 29, 2019 at 8:29 PM Valentyn Tymofieiev
wrote:
> I have checked Python 3 batch and streaming quickstarts on Dataflow runner
> using .zip and wh
On Mon, Jul 29, 2019 at 4:14 PM jincheng sun
wrote:
> Hi Robert,
>
> Thanks for your detail comments, I would have added a few pointers inline.
>
> Best,
> Jincheng
>
> Robert Bradshaw 于2019年7月29日周一 下午12:35写道:
>
>> On Sun, Jul 28, 2019 at 6:51 AM jincheng su
so curious about the standard whether a coder can be put into
> StandardCoders.
> For example, I noticed that FLOAT is not among StandardCoders, while DOUBLE
> is among it.
StandardCoders is supposed to be some sort of lowest common
denominator, but theres no hard and fast criteria. Fo
>>
>> Reshuffle.viaRandomKey()
>>
>> ParDo.of(ReadFileRangesFn(createSource) :: DoFn> OffsetRange>, T>) where
>>
>> createSource :: String -> FileBasedSource
>>
>> createSource = AvroSource
>>
>>
>> AvroIO.read without getHi
like
System.identityHashCode(this) in the body of a DoFn might be
sufficient.
> On Thu, Jul 25, 2019 at 9:54 PM Robert Bradshaw wrote:
>>
>> Though it's not obvious in the name, Stateful ParDos can only be
>> applied to keyed PCollections, similar to GroupByKey. (You c
Though it's not obvious in the name, Stateful ParDos can only be
applied to keyed PCollections, similar to GroupByKey. (You could,
however, assign every element to the same key and then apply a
Stateful DoFn, though in that case all elements would get processed on
the same worker.)
On Thu, Jul 25,
l (where the bulk of
>>> temp file handing logic lives). Might be hard to decouple either modifying
>>> existing code or creating new transforms, unless if we re-write most of
>>> FileBasedSink from scratch.
>>>
>>> Let me know if I'm on the wron
IRA entry and I'll be happy
to answer any questions you might have if (well probably when) these
pointers are insufficient.
> On Tue, Jul 23, 2019 at 3:47 AM Robert Bradshaw wrote:
>>
>> This is documented at
>> https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOH
On Thu, Jul 25, 2019 at 5:31 AM Thomas Weise wrote:
>
> Hi Jincheng,
>
> It is very exciting to see this follow-up, that you have done your research
> on the current state and that there is the intention to join forces on the
> portability effort!
>
> I have added a few pointers inline.
>
> Seve
>From the portability perspective,
https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto
and the associated services for executing pipelines is about as "core"
as it gets, and eventually I'd like to see all runners being portable
(even if they have an option
On Tue, Jul 23, 2019 at 10:26 PM Chamikara Jayalath
wrote:
>
> On Tue, Jul 23, 2019 at 1:10 PM Kyle Weaver wrote:
>>
>> I agree with David that at least clearer log statements should be added.
>>
>> Udi, that's an interesting idea, but I imagine the sheer number of existing
>> flags (including m
hen append all your new data. The Beam Java SDK does this for all
>>> runners when executed portably[1]. You could port the same logic to the
>>> Beam Python SDK as well.
>>>
>>> 1:
>>> https://github.com/apache/beam/blob/41478d00d34598e56471d99d0845ac16efa5b8ef/
I think having a single, default, auto-created temporary bucket per
project for use in GCP (when running on Dataflow, or running elsewhere
but using GCS such as for this BQ load files example), though not
ideal, is the best user experience. If we don't want to be
automatically creating such things
On Mon, Jul 22, 2019 at 7:39 PM Eugene Kirpichov wrote:
>
> On Mon, Jul 22, 2019 at 7:49 AM Robert Bradshaw wrote:
>>
>> On Mon, Jul 22, 2019 at 4:04 PM Neville Li wrote:
>> >
>> > Thanks Robert. Agree with the FileIO point. I'll look in
based on FileBasedSource, which we'd like to avoid, but there's no
alternative. An SDF, if exposed, would likely be overkill and
cumbersome to call (given the reflection machinery involved in
invoking DoFns).
> I'll file separate PRs for core changes needed for discussion. WDYT?
So
This was due to a bad release artifact push. This has now been fixed upstream.
On Mon, Jul 22, 2019 at 11:00 AM Robert Bradshaw wrote:
>
> Looks like https://sourceforge.net/p/docutils/bugs/365/
>
> On Sun, Jul 21, 2019 at 11:56 PM Tanay Tummalapalli
> wrote:
>>
>
Looks like https://sourceforge.net/p/docutils/bugs/365/
On Sun, Jul 21, 2019 at 11:56 PM Tanay Tummalapalli
wrote:
> Hi everyone,
>
> The Python PreCommit from the Jenkins job "beam_PreCommit_Python_Cron" is
> failing[1]. The task :sdks:python:docs is failing with this traceback:
>
> Traceback (
On Fri, Jul 19, 2019 at 5:16 PM Neville Li wrote:
>
> Forking this thread to discuss action items regarding the change. We can keep
> technical discussion in the original thread.
>
> Background: our SMB POC showed promising performance & cost saving
> improvements and we'd like to adopt it for p
values.
I'm not quite following here. Suppose one processes element a, m, and
z. Then one decides to split the bundle, but there's not a "range" we
can pick for the "other" as this bundle already spans the whole range.
But maybe I'm just off in the weeds here.
>
>>> Note on reader block/offset/split requirement
>>>>
>>>> Because of the merge sort, we can't split or offset seek a bucket file.
>>>> Because without persisting the offset index of a key group somewhere, we
>>>> ca
>>> which may not have the same key distribution. It might be possible to
>>> binary search for matching keys but that's extra complication. IMO the
>>> reader work distribution is better solved by better bucket/shard strategy
>>> in upstream writer
X" - commit hash not changed after
> squash & force-push
> b00 "fixup: Address review comments." - commit hash has changed after
> squash, but these commits were never reviewed,
>
> 5. Author requests another review iteration (PTAL). Since PR
Congratulations!
On Wed, Jul 17, 2019, 12:56 PM Katarzyna Kucharczyk
wrote:
> Congratulations! :)
>
> On Wed, Jul 17, 2019 at 12:46 PM Michał Walenia <
> michal.wale...@polidea.com> wrote:
>
>> Congratulations, Robert! :)
>>
>> On Wed, Jul 17, 2019 at 12:45 PM Łukasz Gajowy
>> wrote:
>>
>>> Con
Python workers also have a per-bundle SDK-side cache. A protocol has
been proposed, but hasn't yet been implemented in any SDKs or runners.
On Tue, Jul 16, 2019 at 6:02 AM Reuven Lax wrote:
>
> It's runner dependent. Some runners (e.g. the Dataflow runner) do have such a
> cache, though I think
his requires a lot of copy-pasted Avro boilerplate.
>>> - For compatibility, we can delegate to the new classes from the old ones
>>> and remove them in the next breaking release.
>>>
>>> Re: WriteFiles logic, I'm not sure about generalizing it, but what about
(since they
>> operate on a WritableByteChannel/ReadableByteChannel, which is the level of
>> granularity we need) but the Writers, at least, seem to be mostly
>> private-access. Do you foresee them being made public at any point?
>>
>> - Claire
>>
>> On M
t;>>> This is for the common pattern of few core data producer pipelines and
>>>> many downstream consumer pipelines. It's not intended to replace
>>>> shuffle/join within a single pipeline. On the producer side, by
>>>> pre-grouping/sorting data
On Mon, Jul 15, 2019 at 5:42 AM Chamikara Jayalath wrote:
>
> On Sat, Jul 13, 2019 at 7:41 PM Chad Dombrova wrote:
>>
>> Hi Chamikara,
why not make this part of the pipeline options? does it really need to
vary from transform to transform?
>>>
>>> It's possible for the same pipel
on cloudpickle vs dill in Beam, I'll bring it to
> the mailing list.
>
> On Wed, May 15, 2019 at 5:25 AM Robert Bradshaw wrote:
>>
>> (2) seems reasonable.
>>
>> On Tue, May 14, 2019 at 3:15 AM Udi Meiri wrote:
>> >
>> > It seems like pick
On Wed, Jul 10, 2019 at 5:06 AM Kenneth Knowles wrote:
>
> My opinion: what is important is that we have a policy for what goes into the
> master commit history. This is very simple IMO: each commit should clearly do
> something that it states, and a commit should do just one thing.
Exactly how
On Thu, Jun 27, 2019 at 1:52 AM Rui Wang wrote:
>>
>>
>> AFAIK all streaming runners today practically do provide these panes in
>> order;
>
> Does it refer to "the stage immediately after GBK itself processes fired
> panes in order" in streaming runners? Could you share more information?
>
>
>
There is no promise that panes will arrive in order (especially the
further you get "downstream"). Though they may be approximately so,
it's dangerous to assume that. You can inspect the sequential index in
PaneInfo to determine whether a pane is older than other panes you
have seen.
On Wed, Jun 2
ction and should use it accordingly in subsequent
> transforms.
>
> Thanks,
> Cham
>
> On Wed, Jun 26, 2019 at 9:57 AM Chamikara Jayalath
> wrote:
>>
>>
>>
>> On Wed, Jun 26, 2019 at 5:46 AM Robert Bradshaw wrote:
>>>
>>> Good q
>> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail
>>>
>>>
>>> On Tue, Jun 18, 2019 at 12:03 AM Valentyn Tymofieiev
>>> wrote:
>>>>
>>>> I like the update Ismaël referenced [1], I think we should
bundle). However this would be a valid "lazy" implementation.
>
> On Wed, Jun 26, 2019 at 2:29 PM Robert Bradshaw wrote:
>>
>> Note also that a "lazy" SDK implementation would be to simply return
>> all the timers (as if they were new timers) to runner
Good question.
I'm not sure what could be done with (5) if it contains no deferred
objects (e.g there's nothing to wait on).
There is also (6) return PCollection. The
advantage of (2) is that one can migrate to (1) or (6) without
changing the public API, while giving something to wait on without
f
> work needed to implement Robert's suggestion.
>
> Reuven
>
> On Wed, Jun 26, 2019 at 12:28 PM Robert Bradshaw wrote:
>>
>> Another option, that is nice from an API perspective but places a
>> burden on SDK implementers (and possibly runners), is to
ould have to
> update the runner while executing it. It is another possible option, but
> seems to have issues of its own.
>
> On 6/26/19 12:28 PM, Robert Bradshaw wrote:
> > Another option, that is nice from an API perspective but places a
> > burden on SDK implement
Another option, that is nice from an API perspective but places a
burden on SDK implementers (and possibly runners), is to maintain the
ordering of timers by requiring timers to be fired in order, and if
any timers are set to fire them immediately before processing later
timers. In other words, if
Thanks. This is a great write-up. +1 to an official tweet.
On Wed, Jun 26, 2019, 6:11 AM Reza Rokni wrote:
> Thank you for putting this together!
>
> On Wed, 26 Jun 2019 at 01:23, Ahmet Altay wrote:
>
>> Thank you for writing and sharing this. I enjoyed reading it :) I think
>> it is worth shar
On Tue, Jun 25, 2019 at 2:26 PM Ismaël Mejía wrote:
>
> I stumbled recently into BEAM-3644 [1]. This issue mentions that
> Python direct runner saw a great performance gain because of relying
> on portability’s FnApiRunner. This seems to me a bit contra-intuitive
> considering the extra overhead o
On Fri, Jun 21, 2019 at 1:02 PM Thomas Weise wrote:
>
> From what I understand, spreadsheets (not docs) provide the functionality
> that we need: https://support.google.com/docs/answer/91588
>
> Interested PMC members can subscribe and react to changes in the spreadsheet.
>
> Lazy consensus requi
This is unfortunate, as the reviewers pulldown at least gives some
reasonable suggestions for new users (which is why I suggested it).
But, yes, R: @ is the convention.
On Fri, Jun 21, 2019 at 5:56 PM Lukasz Cwik wrote:
>
> Only a few people have permission to update the 'Reviewers' section and I
Welcome! You're in.
On Thu, Jun 20, 2019 at 11:05 AM Jonas Grabber wrote:
>
> Hello,
>
> I'd like to start contributing to Apache Beam. Could you add me as a
> Contributor in ASF Jira?
> My Jira user ID is: "Jonas Grabber".
>
>
> Best regards,
> Jonas
Welcome! I added you to the contributors group.
On Thu, Jun 20, 2019 at 11:03 AM Matt Helm wrote:
>
> Hi Beam community,
>
> I'm Matt Helm, a Data Engineer at Shopify. I'm based in Vancouver, Canada. As
> part of Beam Summit I'm looking to please start taking issues from Jira. My
> username is
inline.
>
> On Fri, Jun 14, 2019 at 2:20 AM Robert Bradshaw wrote:
>>
>> On Thu, Jun 13, 2019 at 8:42 PM Reuven Lax wrote:
>> >
>> > Spoke to Brian about his proposal. It is essentially this:
>> >
>> > We create PortableSchemaCoder, with a well-k
track that and will try some experiments.
>
Thanks.
>
> Jan
>
> [1] https://issues.apache.org/jira/browse/BEAM-7574
>
> On 6/14/19 1:42 PM, Robert Bradshaw wrote:
> > On Fri, Jun 14, 2019 at 1:02 PM Jan Lukavský wrote:
> >> > Interesting. However, there we should ne
one hour and a periodicity of one
minute. If we have one datum per hour, better to not explode. If we
have one datum per minute, it's a wash. If we have one datum per
second, much better to explode.
> On 6/14/19 12:19 PM, Robert Bradshaw wrote:
> > On Fri, Jun 14, 2019 at 12:10 PM J
On Fri, Jun 14, 2019 at 12:10 PM Jan Lukavský wrote:
>
> Hi Robert,
>
> thanks for the discussion. I will create a JIRA with summary of this.
> Some comments inline.
>
> Jan
>
> On 6/14/19 10:49 AM, Robert Bradshaw wrote:
> > On Thu, Jun 13, 2019 at 8:43 PM Jan L
l issues
that are still not decided. (I suppose we should be open to throwing
whole PRs away in that case.) There are certainly pieces that we'll
know that we need (like the ability to serialize a row consistently in
all languages) we can get in immediately.
> Reuven
>
> On Thu, Jun
On Thu, Jun 13, 2019 at 8:43 PM Jan Lukavský wrote:
>
> On 6/13/19 6:10 PM, Robert Bradshaw wrote:
>
> On Thu, Jun 13, 2019 at 5:28 PM Jan Lukavský wrote:
>>
>> On 6/13/19 4:31 PM, Robert Bradshaw wrote:
>>
>> The comment fails to take into account the asymm
On Thu, Jun 13, 2019 at 5:28 PM Jan Lukavský wrote:
> On 6/13/19 4:31 PM, Robert Bradshaw wrote:
>
> The comment fails to take into account the asymmetry between calling
> addInput vs. mergeAccumulators. It also focuses a lot on the asymptotic
> behavior, when the most common beh
mbineByKeyAndWindow
>
> Jan
> On 6/13/19 3:51 PM, Robert Bradshaw wrote:
>
> I think the problem is that it never leverages the (traditionally much
> cheaper) CombineFn.addInput(old_accumulator, new_value). Instead, it always
> calls CombineFn.mergeAccumulators(old_accumula
I think the problem is that it never leverages the (traditionally much
cheaper) CombineFn.addInput(old_accumulator, new_value). Instead, it always
calls CombineFn.mergeAccumulators(old_accumulator,
CombineFn.addInput(CombineFn.createAccumulator(), new_value)). It should be
feasible to fix this whil
k data channel
transfer vs ???, possibly with parameters like "preferred batch size") or
standardizing on one physical byte representation for all communication
over the boundary.)
>
>
>>
>> Can we also just have both of these, with different URNs?
>>
&g
order to analyze it. Schemas don't change this, logical types don't change
>> this. However in practice schemas do make it much easier to introspect
>> data, and to analyze data. If the tradeoff is that some user might assume
>> they understand more than they do about d
On Tue, Jun 11, 2019 at 8:04 PM Kenneth Knowles wrote:
>
> I believe the schema registry is a transient construction-time concept. I
> don't think there's any need for a concept of a registry in the portable
> representation.
>
> I'd rather urn:beam:schema:logicaltype:javasdk not be used whenever
We have a pull request to add Pypi dependency compatibility badges to
our readme: https://github.com/apache/beam/pull/8791 This looks
generally useful, though highlights how out of date we still are in
some areas. Any thoughts on this? Given the high visibility, I wanted
to get some consensus (even
ed to extend this to beyond 2019. 2 releases
> (~ 3 months) after solid python 3 support will very likely put the last
> python 2 supporting release to last quarter of 2019 already.
>
> On Fri, Jun 7, 2019 at 2:15 AM Robert Bradshaw
> wrote:
>
>> I don't think the seco
n Mon, Jun 10, 2019 at 2:14 AM Robert Bradshaw
> wrote:
>
>> On Sat, Jun 8, 2019 at 9:25 PM Kenneth Knowles wrote:
>>
>>> On Fri, Jun 7, 2019 at 4:35 AM Robert Burke wrote:
>>>
>>>> Wouldn't SDK specific types always be under the "co
On Sat, Jun 8, 2019 at 9:25 PM Kenneth Knowles wrote:
> On Fri, Jun 7, 2019 at 4:35 AM Robert Burke wrote:
>
>> Wouldn't SDK specific types always be under the "coders" component
>> instead of the logical type listing?
>>
>> Offhand, having a separate normalized listing of logical schema types i
;>
>> In addition to the statement, keeping a target release and date(if possible)
>> or timeline to drop support would also help users to decide when they need
>> to work on migrating to Python 3.
>>
>> Regards,
>> - TT
>>
>> [1] https://python
Until Python 3 support for Beam is officially out of beta and
recommended, I don't think we can tell people to stop using Python 2.
Given that 2020 is just over 6 months away, that seems a short
transition time, so I would guess we'll have to continue supporting
Python 2 sometime into 2020.
A quic
One issue with the fully expanded version is that it's so large it's
hard to read.
I think it would be useful to make the ~ entries (at least) clickable
or with hover tool tips. It would be nice to be able to expand columns
individually as well.
On Tue, Jun 4, 2019 at 7:20 AM Melissa Pashniak wr
+1
I validated the artifacts and Python 3.
On Sat, Jun 1, 2019 at 7:45 PM Ankur Goenka wrote:
>
> Thanks Ahmet and Luke for validation.
>
> If no one has objections then I am planning to move ahead without Gearpump
> validation as it seems to be broken from past multiple releases.
>
> Reminder:
, that ordering of sources is relevant only for
>> (partitioned!) streaming sources and generally always reduces to
>> sequence metadata (e.g. offsets).
>>
>> Jan
>>
>> On 5/28/19 11:43 AM, Robert Bradshaw wrote:
>> > Huge +1 to all Kenn said.
>> >
ys reduces to
> sequence metadata (e.g. offsets).
Offsets within a file, unordered between files seems exactly analogous
with offsets within a partition, unordered between partitions, right?
> On 5/28/19 11:43 AM, Robert Bradshaw wrote:
> > Huge +1 to all Kenn said.
> >
> &
are:
> > >> - The product is clearly marked as beta with a big warning.
> > >> - It looks like mostly a single person project. For the same reason
I also strongly prefer not using a fork for a specific setting. Fork will
only have less people looking at it.
> > >>
I'm not quite following what these sizes are needed for--aren't the
benchmarks already tuned to be specific, known sizes? I agree that
this can be expensive; especially for benchmarking purposes a 5x
overhead means you're benchmarking the sizing code, not the pipeline
itself.
Beam computes estimat
On Fri, May 24, 2019 at 6:57 PM Kenneth Knowles wrote:
>
> On Fri, May 24, 2019 at 9:51 AM Kenneth Knowles wrote:
>>
>> On Fri, May 24, 2019 at 8:14 AM Reuven Lax wrote:
>>>
>>> Some great comments!
>>>
>>> Aljoscha: absolutely this would have to be implemented by runners to be
>>> efficient. W
>>>
>>> On Mon, May 27, 2019 at 3:35 PM Maximilian Michels wrote:
>>> >
>>> > +1
>>> >
>>> > On 27.05.19 14:04, Robert Bradshaw wrote:
>>> > > Sounds like everyone's onboard with the plan. Any chance we could
>>&
ed to
>>> match the logical ordering. Not only might you have several elements with
>>> the same timestamp, but in reality time skew across backend servers can
>>> cause the events to have timestamps in reverse order of the actual
>>> causality order.
>&g
I'm generally in favor of autoformatters, though I haven't looked at
how well this particular one works. We might have to go with
https://github.com/desbma/black-2spaces given
https://github.com/python/black/issues/378 .
On Mon, May 27, 2019 at 10:43 PM Pablo Estrada wrote:
>
> This looks pretty
I also favor explicit opt-in, especially when you're mixing mature and
new components.
A differently-named, but still published, artifact seems preferable
IMHO to long lived branches. I don't have a handle on how problematic
this would be in practice (e.g. how would a user know to update the
name.
r.
+1
And, on a pragmatic note, it'd be good to share the port with the
artifact server as well, in which case the job server could say "serve
artifacts to me" without having to worry about any intervening port
forwarding, etc. that sits between it and the sdk.
> On 27.05.19 13:3
Sounds like everyone's onboard with the plan. Any chance we could
publish these for the upcoming 2.13 release?
On Wed, Feb 6, 2019 at 6:29 PM Łukasz Gajowy wrote:
>
> +1 to have a registry for images accessible to anyone. For snapshot images, I
> agree that gcr + apache-beam-testing project seem
ironment? That would solve the problem of updating the expansion
> > service, although it adds additional complexity for bringing up the
> > environment.
> >
> >
> > Which environment would be used to perform the expansion? I think
> > thi
On Fri, May 24, 2019 at 5:32 AM Reuven Lax wrote:
>
> On Thu, May 23, 2019 at 1:53 PM Ahmet Altay wrote:
>>
>>
>>
>> On Thu, May 23, 2019 at 1:38 PM Lukasz Cwik wrote:
>>>
>>>
>>>
>>> On Thu, May 23, 2019 at 11:37 AM Rui Wang wrote:
>
> A few obvious problems with this code:
> 1.
t outputs on
>> >> checkpoint in sink). That implies that if you don't have sink that is
>> >> able to commit outputs atomically on checkpoint, the pipeline
>> >> execution should be deterministic upon retries, otherwise shadow
>> >> writes
Thanks for writing this up.
I think the justification for adding this to the model needs to be
that it is useful (you have this covered, though some examples would
be nice) and that it's something that can't easily be done by users
themselves (specifically, though it can be (relatively) cheaply do
n which to perform the expansion (though this would
probably not be offered by most, let alone all, expansion services).
> On 23.05.19 11:31, Robert Bradshaw wrote:
> > On Wed, May 22, 2019 at 6:17 PM Maximilian Michels > <mailto:m...@apache.org>> wrote:
> >
> > Hi,
On Thu, May 23, 2019 at 11:07 AM Maximilian Michels wrote:
> My motivation was to get rid of the Docker dependency for the Python VR
> tests. Similarly to how we use Python's LOOPBACK environment for
> executing all non-cross-language tests, I wanted to use Java's EMBEDDED
> environment to run th
On Wed, May 22, 2019 at 6:17 PM Maximilian Michels wrote:
> Hi,
>
> Robert and me were discussing on the subject of user-specified
> environments for external transforms [1]. We couldn't decide whether
> users should have direct control over the environment when they use an
> external transform i
ng, but I have trouble seeing how an SDK would take
advantage of this (what would trigger the discard of the first page? What
complexity would that incur vs. the marginal benefit (if any)?). Mostly
this feels like a solution in search of a problem. Is there a problem that
you're trying to solv
ser themselves, should we elevate this to a property of
> (Stateful?)DoFns that the runner can provide? I think a compelling
> argument can be made here that we should.
>
> +1
>
> Jan
>
> On 5/21/19 11:07 AM, Robert Bradshaw wrote:
> > On Mon, May 20, 2019 at 5:24 PM
>> StatefulParDo in the first place.
>>>
>>>>
>>>> > Pipelines that fail in the "worst case" batch scenario are likely to
>>>> degrade poorly (possibly catastrophically) when the watermark falls
>>>> behind in streaming mode
The primary con I see with this is that the runner must know ahead of
time, when it starts encoding the iterable, whether or not to treat it
as a large one (to also cache the part it's encoding to the data
channel, or at least some kind of pointer to it). With the current
protocol it can be lazy, a
t; I see the first two options quite equally good, although the letter one
> is probably more time consuming to implement. But it would bring
> additional feature to streaming case as well.
>
> Thanks for any thoughts.
>
> Jan
>
> On 5/20/19 12:41 PM, Robert Bradshaw wrote:
&g
I created https://issues.apache.org/jira/browse/BEAM-7367
On Mon, May 20, 2019 at 3:11 PM Michael Luckey wrote:
>
> This is most likely caused by Merge of
> https://issues.apache.org/jira/browse/BEAM-7349, which was done lately.
>
> Best,
>
> michel
>
> On Mon, May 20, 2019 at 2:49 PM Charith El
I have no idea about this failure, but it sounds like you've done due
diligence looking into it at this point and it makes sense to ask some
reviewers to take a look at your code which can happen in parallel to
figuring out the root cuase of this kafka issue before it finally gets
submitted.
On Mo
On Fri, May 17, 2019 at 4:48 PM Jan Lukavský wrote:
>
> Hi Reuven,
>
> > How so? AFAIK stateful DoFns work just fine in batch runners.
>
> Stateful ParDo works in batch as far, as the logic inside the state works for
> absolutely unbounded out-of-orderness of elements. That basically
> (practica
On Wed, May 15, 2019 at 8:43 PM Allie Chen wrote:
> Thanks all for your reply. I will try each of them and see how it goes.
>
> The experiment I am working now is similar to
> https://stackoverflow.com/questions/48886943/early-results-from-groupbykey-transform,
> which tries to get early results
On Wed, May 15, 2019 at 8:51 PM Kenneth Knowles wrote:
>
> On Wed, May 15, 2019 at 3:05 AM Robert Bradshaw wrote:
>>
>> Isn't there an API for concisely computing new fields from old ones?
>> Perhaps these expressions could contain references to metadata value
&g
?
>
>
>
> On Wed, May 15, 2019, 6:51 AM Robert Bradshaw wrote:
>>
>> This does bring up an interesting question though. Are runners
>> violating (the intent of) the spec if they simply abandon/kill workers
>> rather than gracefully bringing them down (e.g. so th
This does bring up an interesting question though. Are runners
violating (the intent of) the spec if they simply abandon/kill workers
rather than gracefully bringing them down (e.g. so that these
callbacks can be invoked)?
On Tue, May 7, 2019 at 3:55 PM Michael Luckey wrote:
>
> Thanks Kenn and R
On Wed, May 15, 2019 at 1:00 PM Robert Bradshaw wrote:
>>
>> Unfortunately the "write" portion of the reshuffle cannot be
>> parallelized more than the source that it's reading from. In my
>> experience, generally the read is the bottleneck in this case, but
701 - 800 of 1418 matches
Mail list logo