You can literally return a Python tuple of outputs from a composite
transform as well. (Dicts with PCollections as values are also
supported, if you want things to be named rather than referenced by
index.)
On Fri, Oct 25, 2019 at 4:06 PM Ahmet Altay wrote:
>
> Is DoOutputsTuple what you are
oking at the PR, here
> is the verbiage I added about urgency:
>
> P0/Blocker: "A P0 issue is more urgent than simply blocking the next release"
> P1/Critical: "Most critical bugs should block release"
> P2/Major: "No special urgency is associated"
> ...
I think we'll still need approach (2) for when the pipeline finishes
and a runner is tearing down workers.
On Fri, Oct 25, 2019 at 10:36 AM Maximilian Michels wrote:
>
> Hi Jincheng,
>
> Thanks for bringing this up and capturing the ideas in the doc.
>
> Intuitively, I would have also considered
It looks like fn_api_runner_test.py is quite expensive, taking 10-15+
minutes on each version of Python. This test consists of a base class
that is basically a validates runner suite, and is then run in several
configurations, many more of which (including some expensive ones)
have been added
We cut a release every 6 weeks, according to schedule, making it easy
to plan for, and the release manager typically sends out a warning
email to remind everyone. I don't think it makes sense to do that for
every ticket. Blockers should be reserved for things we really
shouldn't release without.
Thanks for trying this out. Yes, this is definitely something that
should be supported (and tested).
On Mon, Oct 21, 2019 at 3:40 PM Igor Durovic wrote:
>
> Hi everyone,
>
> The interactive beam example using the DirectRunner fails after execution of
> the last cell. The recursion limit is
I just merged https://github.com/apache/beam/pull/9845 which should
resolve the issue.
On Mon, Oct 21, 2019 at 12:58 PM Chad Dombrova wrote:
>
> thanks!
>
> On Mon, Oct 21, 2019 at 12:47 PM Kyle Weaver wrote:
>>
>> This issue is being tracked at
>>
iding the creation of these bundles, but maybe the test
> should be modified so that it adheres to the model [1].
>
> Jan
>
> [1] https://github.com/apache/beam/pull/9846
>
> On 10/21/19 6:00 PM, Robert Bradshaw wrote:
> > Yes, the model allows them.
> >
> > It also t
Yes, the model allows them.
It also takes less work to avoid them in general (e.g. imagine one
reshuffles N elements to M > N workers. A priori, one would "start" a
bundle and then try to read all data destined for that
worker--postponing this until one knows that the set of data for this
worker
itAtTimestamp: SDK Native Object -> Timestamp
On Fri, May 10, 2019 at 1:33 PM Robert Bradshaw wrote:
> On Thu, May 9, 2019 at 9:32 AM PM Kenneth Knowles wrote:
>
> > From: Robert Bradshaw
> > Date: Wed, May 8, 2019 at 3:00 PM
> > To: dev
> >
> >> From: Ke
Sounds nice. Is there a design doc (or, perhaps, you could just give an
example of what this would look like in this thread)?
On Wed, Oct 16, 2019 at 5:51 PM Chad Dombrova wrote:
> Hi all,
> One of our goals for the portability framework is to be able to assign
> different environments to
Very excited to see this! I've added some comments to the doc.
On Tue, Oct 15, 2019 at 3:43 PM Pablo Estrada wrote:
> I've just been informed that access wasn't open. I've since opened access
> to it.
> Thanks
> -P.
>
> On Tue, Oct 15, 2019 at 2:10 PM Pablo Estrada wrote:
>
>> Hello all,
>> I
Very nice to see, thanks for sharing.
On Fri, Oct 11, 2019 at 5:44 AM Maximilian Michels wrote:
>
> Glad to see that we have fixed the recent flakes. Let's keep up the good
> work :)
>
> -Max
>
> On 10.10.19 23:37, Kenneth Knowles wrote:
> > All the cells in the pull request template are green
Can we use a lower default timeout to mitigate this issue in the short
term (I'd imagine one second or possibly smaller would be sufficient
for our use), and get a fix upstream in the long term?
On Fri, Oct 11, 2019 at 9:38 AM Luke Cwik wrote:
>
> I'm looking for a thread pool that re-uses
Looks like an issue with the protobuf library. Do you know what
version of protobuf you're using? (E.g. by running pip freeze.)
I don't have Catalina to test this on, but it'd be useful if you could
winnow this down to the import that fails.
On Thu, Oct 10, 2019 at 8:15 AM Kamil Wasilewski
On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot wrote:
>
> Hi guys,
>
> You probably know that there has been for several months an work
> developing a new Spark runner based on Spark Structured Streaming
> framework. This work is located in a feature branch here:
>
on test coverage could be increased. But we wrote this test
> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/dataflow/dataflow_exercise_metrics_pipeline.py>
> .
>
>
> On Wed, Oct 9, 2019 at 10:51 AM Luke Cwik wrote:
>
>> One way w
Probably worth mentioning Slack and StackOverflow as well.
On Wed, Oct 9, 2019 at 3:59 PM María Cruz wrote:
>
> Hi all,
> sorry for multiple messages. I realized after sending the first email that a
> new thread with a different subject was probably more efficient.
>
> I created a communication
Yes, Dataflow still uses the old API, for both counters and for its
progress/autoscaling mechanisms. We'd need to convert that over as
well (which is on the TODO list but lower than finishing up support
for portability in general).
On Mon, Oct 7, 2019 at 9:56 AM Robert Burke wrote:
>
> The Go
OK, this appears to have been a weird config issue on my system
(though the error certainly could have been better). As BEAM-8303 has
a workaround and all else is looking good, I don't think that's worth
another RC.
+1 (binding) to this release.
On Fri, Oct 4, 2019 at 10:56 AM Robert Bradshaw
t;>>>
>>>>> > Oh, and one more thing, I think it'd make sense for Apache Beam to
>>>>> > sign https://python3statement.org/. The promise is that we'd
>>>>> > discontinue Python 2 support *in* 2020, which is not committing us to
>>>>> &g
The artifact signatures and contents all look good to me. I've also
verify the wheels work for the direct runner. However, I'm having an
issue with trying to run on dataflow with Python 3.6:
python -m apache_beam.examples.wordcount --input
gs://clouddfe-robertwb/chicago_taxi_data/eval/data.csv
Please add robertwb0 as well.
On Wed, Oct 2, 2019 at 9:09 AM Ahmet Altay wrote:
>
>
>
> On Tue, Oct 1, 2019 at 8:44 PM Pablo Estrada wrote:
>>
>> When she set up the repo, Hannah requested PMC members to ask for
>> privileges, so I did.
>> The set of admins currently is just Hannah and myself
For this specific usecase, I would suggest this be done via PTranform URNs.
E.g. one could have a GroupByKeyOneShot whose implementation is
input
.apply(GroupByKey.of()
.apply(kv -> KV.of(kv.key(), kv.iterator())
A runner would be free to recognize and optimize this in the graph (based
The correct link is https://python3statement.org/
On Tue, Oct 1, 2019 at 10:14 AM Mark Liu wrote:
>
> +1
>
> btw, the link (http://python3stament.org) you provided is broken.
>
> On Tue, Oct 1, 2019 at 9:44 AM Udi Meiri wrote:
>>
>> +1
>>
>> On Tue, Oct 1, 2019 at 3:22 AM Łukasz Gajowy wrote:
+1
On Mon, Sep 30, 2019 at 5:35 PM David Cavazos wrote:
>
> +1
>
> On Mon, Sep 30, 2019 at 5:27 PM Ahmet Altay wrote:
>>
>> +1
>>
>> On Mon, Sep 30, 2019 at 5:22 PM Valentyn Tymofieiev
>> wrote:
>>>
>>> Hi everyone,
>>>
>>> Please vote whether to sign a pledge on behalf of Apache Beam to
.yaml
>>>
>>>
>>> On Fri, Sep 27, 2019 at 5:17 PM Chad Dombrova wrote:
>>>>
>>>> Are there any dissenting votes to making a BooleanCoder a standard
>>>> (portable) coder?
>>>>
>>>> I'm happy to make a PR to imple
I had with +Robert Bradshaw a
> while ago: We both agreed all of the coders listed in BEAM-7996 should be
> implemented in Python, but didn't come to a conclusion on whether or not they
> should actually be _standard_ coders, versus just being implicitly standard
> as part of row
user to check what is being sent.
>>>> >
>>>> > One more heavy-weight option is to also allow user configure and persist
>>>> > what information he is ok with sharing.
>>>> >
>>>> > --Mikhail
>>>> >
>>>&
arton.com/static/files/trace/profile.html
> This information also appears within the build scans that are sent to Gradle.
>
> Integrating with either of these sources of information would allow us to
> figure out whether its new tasks or old tasks taking longer.
>
> On Tue, Sep
Does anyone know how to gather stats on where the time is being spent?
Several times the idea of consolidating many of the (expensive)
validates runner integration tests into a single pipeline, and then
running things individually only if that fails, has come up. I think
that'd be a big win if
On Mon, Sep 23, 2019 at 3:08 PM Brian Hulette wrote:
>
> Would people actually click on that link though? I think Kyle has a point
> that in practice users would only find and click on that link when they're
> having some kind of issue, especially if the link has "feedback" in it.
I think the
the Runner adds the Impulse override. That
> way also the Python SDK would not have to have separate code paths for
> Reads.
Or, rather, that the Runner adds the non-Impuls override (in Java and Python).
> On 19.09.19 11:46, Robert Bradshaw wrote:
> > On Thu, Sep 19, 2019 at 11:22 AM Maxi
Oh, and one more thing, I think it'd make sense for Apache Beam to
sign https://python3statement.org/. The promise is that we'd
discontinue Python 2 support *in* 2020, which is not committing us to
January if we're not ready. Worth a vote?
On Thu, Sep 19, 2019 at 3:58 PM Robert Bradshaw wrote
Exactly how long we support Python 2 depends on our users. Other than
those that speak up (such as yourself, thanks!), it's hard to get a
handle on how many need Python 2 and for how long. (Should we send out
a survey? Maybe after some experience with 2.16?)
On the one hand, the whole ecosystem
; > > >
> > > > Kyle Weaver | Software Engineer | github.com/ibzib
> > <http://github.com/ibzib>
> > > <http://github.com/ibzib>
> > > > <http://github.com/ibzib> | kcwea...@google.com
>
On Tue, Sep 17, 2019 at 1:43 PM Thomas Weise wrote:
>
> +1 for making --experiments=beam_fn_api default.
>
> Can the Dataflow runner driver just remove the setting if it is not
> compatible?
The tricky bit would be undoing the differences in graph construction
due to this flag flip. But I would
Thanks for bringing this up again. My thoughts on the open questions below.
On Mon, Sep 16, 2019 at 11:51 AM Chad Dombrova wrote:
> That commit solves 2 problems:
>
> Adds the pubsub Java deps so that they’re available in our portable pipeline
> Makes the coder for the PubsubIO message-holder
;> non-docker environment, as Docker adds some operational complexity that
>>>>> isn't really needed to run a word count example. For example, Yu's
>>>>> pipeline
>>>>> errored here because the expected Docker container wasn't built before
>>>&g
I would also suggest SO as the best alternative, especially due to its
indexability and searchability. If discussion is needed, the users
list (my preference) or slack can be good options, and ideally the
resolution is brought back to SO.
On Fri, Sep 6, 2019 at 1:10 PM Udi Meiri wrote:
>
> I
:sdks:java:testing:expansion-service could be useful to publish for
testing as well.
On Fri, Aug 30, 2019 at 3:13 PM Lukasz Cwik wrote:
>
> Google internally relies on being able to get the POM files generated for:
> :sdks:java:testing:nexmark
> :sdks:java:testing:test-utils
>
> Generating the
Just to clarify, the repeated list of cache tokens in the process
bundle request is used to validate reading *and* stored when writing?
In that sense, should they just be called version identifiers or
something like that?
On Tue, Aug 27, 2019 at 11:33 AM Maximilian Michels wrote:
>
> Thanks.
On Sun, Aug 18, 2019 at 7:30 PM Rakesh Kumar wrote:
>
> not to completely hijack Max's question but a tangential question regarding
> LRU cache.
>
> What is the preferred python library for LRU cache?
> I noticed that cachetools [1] is used as one of the dependencies for GCP [2].
>
Hi,
Please join me and the rest of the Beam PMC in welcoming a new
committer: Valentyn Tymofieiev
Valentyn has made numerous contributions to Beam over the last several
years (including 100+ pull requests), most recently pushing through
the effort to make Beam compatible with Python 3. He is
On Fri, Aug 23, 2019 at 4:25 PM Ning Kang wrote:
> On Aug 23, 2019, at 3:09 PM, Robert Bradshaw wrote:
>
> Cool, sounds like we're getting closer to the same page. Some more replies
> below.
>
> On Fri, Aug 23, 2019 at 1:47 PM Ning Kang wrote:
>
>> Thanks for the
I suggest re-writing the test to avoid save_main_session.
On Fri, Aug 23, 2019 at 11:57 AM Udi Meiri wrote:
> Hi,
> I'm trying to get pytest with the xdist plugin to run Beam tests. The
> issue is with save_main_session and a dependency of pytest-xdist called
> execnet, which triggers this
nstruction never spans
multiple cells (though its implementation might via function calls) so one
never has out-of-date transforms dangling off the pipeline object.
> This has the downsides of recreating the PCollectiion objects which are
> being used as handles (though perhaps they could be re-iden
On Wed, Aug 21, 2019 at 3:33 PM GMAIL wrote:
> Thanks for the input, Robert!
>
> On Aug 21, 2019, at 11:49 AM, Robert Bradshaw wrote:
>
> On Wed, Aug 14, 2019 at 11:29 AM Ning Kang wrote:
>
>> Ahmet, thanks for forwarding!
>>
>>
>>> My main con
the design
>>>> overview.
>>>>
>>>> If you have any questions, please feel free to contact me through this
>>>> email address!
>>>>
>>>> Thanks!
>>>>
>>>> Regards,
>>>> Ning.
>>>>
>>>>
The original timestamps are probably being assigned in the
watchForNewFiles transform, which is also setting the watermark:
https://github.com/apache/beam/blob/release-2.15.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L668
Until https://issues.apache.org/jira/browse/BEAM-644
On Mon, Aug 19, 2019 at 5:44 PM Ahmet Altay wrote:
>
>
>
> On Mon, Aug 19, 2019 at 9:56 AM Brian Hulette wrote:
>>
>>
>>
>> On Fri, Aug 16, 2019 at 5:17 PM Chad Dombrova wrote:
>> Agreed on float since it seems to trivially map to a double, but I’m
>> torn on int still. While I
On Fri, Aug 9, 2019 at 12:48 PM Michał Walenia
wrote:
> From what I understand, the Java 8 -> 11 testing isn't in essence similar
> to py2 -> py3 checks.
>
True. Python 3 is in many ways a new language, and much less (and more
subtly) backwards compatible. You also can't "link" Python 3 code
t; "blocked" call and then issue all the requests together.
>
>
> On Thu, Aug 8, 2019 at 9:42 AM Robert Bradshaw wrote:
>>
>> On Tue, Aug 6, 2019 at 12:07 AM Thomas Weise wrote:
>> >
>> > That would add a synchronization point that force
;>> simpler implementation, since we'd no longer have to work within the
>>>> constraints of the existing job server infrastructure. The only downside I
>>>> can think of is the additional cost of implementing/maintaining jar
>>>> creation code in
If we do provide a configuration value for this, I would make it have a
fairly large default and ure-use the flag for all RPCs of similar nature,
not tweeks for this particular service only.
On Fri, Aug 9, 2019 at 2:58 AM Ahmet Altay wrote:
> Default plus a flag to override sounds reasonable.
Could you clarify what you mean by "inconsistent" and "incorrect"? Are
elements missing/duplicated, or just batched differently?
On Fri, Aug 9, 2019 at 2:18 AM rahul patwari wrote:
>
> I only ran in Direct runner. I will run in other runners and let you know the
> results.
> I am not setting
On Wed, Aug 7, 2019 at 5:59 PM Thomas Weise wrote:
>
>> > * The pipeline construction code itself may need access to cluster
>> > resources. In such cases the jar file cannot be created offline.
>>
>> Could you elaborate?
>
>
> The entry point is arbitrary code written by the user, not limited
On Wed, Aug 7, 2019 at 11:12 PM Brian Hulette wrote:
>
> Thanks for all the suggestions, I've added responses inline.
>
> On Wed, Aug 7, 2019 at 12:52 PM Chad Dombrova wrote:
>>
>> There’s a lot of ground to cover here, so I’m going to pull from a few
>> different responses.
>>
>>
Thanks for the note. Are there any associated documents worth sharing as
well? More below.
On Wed, Aug 7, 2019 at 9:39 PM Ning Kang wrote:
> To whom may concern,
>
> This is Ning from Google. We are currently making efforts to leverage an
> interactive runner under python beam sdk.
>
> There is
ownCoders.addLengthPrefixedCoder). However, only a few
>>> > coders are defined in StandardCoders. It means that for most coder, a
>>> > length will be added to the serialized bytes which is not necessary in my
>>> > thoughts. My suggestion is maybe we can add s
I think the question here is whether PipelineRunner::run is allowed to
be blocking. If it is, then the futures make sense (but there's no way
to properly cancel it). I'm OK with not being able to return metrics
on cancel in this case, or the case the pipeline didn't even start up
yet. Otherwise,
On Wed, Aug 7, 2019 at 6:20 AM Thomas Weise wrote:
>
> Hi Kyle,
>
> [document doesn't have comments enabled currently]
>
> As noted, worker deployment is an open question. I believe pipeline
> submission and worker execution need to be considered together for a complete
> deployment story. The
On Sun, Aug 4, 2019 at 12:03 AM Chad Dombrova wrote:
>
> Hi,
>
> This looks like a great feature.
>
> Is there a plan to eventually support custom field types?
>
> I assume adding support for dataclasses in python 3.7+ should be trivial to
> do in a follow up PR. Do you see any complications
Lots of improvements all around. Thank you for pushing this through, Anton!
On Fri, Aug 2, 2019 at 1:37 AM Chad Dombrova wrote:
>
> Nice work all round! I love the release blog format with the highlights and
> links to issues.
>
> -chad
>
>
> On Thu, Aug 1, 2019 at 4:23 PM Anton Kedin wrote:
Congratulations!
On Thu, Aug 1, 2019 at 9:59 AM Jan Lukavský wrote:
> Thanks everyone!
>
> Looking forward to working with this great community! :-)
>
> Cheers,
>
> Jan
> On 8/1/19 12:18 AM, Rui Wang wrote:
>
> Congratulations!
>
>
> -Rui
>
> On Wed, Jul 31, 2019 at 10:51 AM Robin Qiu wrote:
The standard VARINT coder is used for all sorts of integer values (e.g. the
output of the CountElements transform), but the vast majority of them are
likely significantly less than a full 64 bits. In Python, declaring an
element type to be int will use this. On the other hand, using a VarInt
seems
> that it's still not supported in the Python SDK Harness. Is there any plan
> on that?
>
> Robert Bradshaw 于2019年7月30日周二 下午12:33写道:
>
>> On Tue, Jul 30, 2019 at 11:52 AM jincheng sun
>> wrote:
>>
>>>
>>>>> Is it possible to a
Jul 31, 2019 at 3:16 AM Pablo Estrada
>> wrote:
>> > >>>
>> > >>> +1
>> > >>>
>> > >>> I installed from source, and ran unit tests for Python in 2.7, 3.5,
>> 3.6.
>> > >>>
>> >
On Tue, Jul 30, 2019 at 11:52 AM jincheng sun
wrote:
>
>>> Is it possible to add an interface such as `isSelfContained()` to the
>>> `Coder`? This interface indicates
>>> whether the serialized bytes are self contained. If it returns true,
>>> then there is no need to add a prefixing length.
>>>
I checked all the artifact signatures and ran a couple test pipelines with
the wheels (Py2 and Py3) and everything looked good to me, so +1.
On Mon, Jul 29, 2019 at 8:29 PM Valentyn Tymofieiev
wrote:
> I have checked Python 3 batch and streaming quickstarts on Dataflow runner
> using .zip and
On Mon, Jul 29, 2019 at 4:14 PM jincheng sun
wrote:
> Hi Robert,
>
> Thanks for your detail comments, I would have added a few pointers inline.
>
> Best,
> Jincheng
>
> Robert Bradshaw 于2019年7月29日周一 下午12:35写道:
>
>> On Sun, Jul 28, 2019 at 6:51 AM jincheng su
, I noticed that FLOAT is not among StandardCoders, while DOUBLE
> is among it.
StandardCoders is supposed to be some sort of lowest common
denominator, but theres no hard and fast criteria. For this example,
some languages (e.g. Python) don't have the notion of FLOAT, and using
a FLOAT coder
y()
>>
>> ParDo.of(ReadFileRangesFn(createSource) :: DoFn> OffsetRange>, T>) where
>>
>> createSource :: String -> FileBasedSource
>>
>> createSource = AvroSource
>>
>>
>> AvroIO.read without getHintMatchedManyFiles() :: PTransform> P
m.identityHashCode(this) in the body of a DoFn might be
sufficient.
> On Thu, Jul 25, 2019 at 9:54 PM Robert Bradshaw wrote:
>>
>> Though it's not obvious in the name, Stateful ParDos can only be
>> applied to keyed PCollections, similar to GroupByKey. (You could,
>>
Though it's not obvious in the name, Stateful ParDos can only be
applied to keyed PCollections, similar to GroupByKey. (You could,
however, assign every element to the same key and then apply a
Stateful DoFn, though in that case all elements would get processed on
the same worker.)
On Thu, Jul
f
>>> temp file handing logic lives). Might be hard to decouple either modifying
>>> existing code or creating new transforms, unless if we re-write most of
>>> FileBasedSink from scratch.
>>>
>>> Let me know if I'm on the wrong track.
>>>
>>>
ntry and I'll be happy
to answer any questions you might have if (well probably when) these
pointers are insufficient.
> On Tue, Jul 23, 2019 at 3:47 AM Robert Bradshaw wrote:
>>
>> This is documented at
>> https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOHdz
On Thu, Jul 25, 2019 at 5:31 AM Thomas Weise wrote:
>
> Hi Jincheng,
>
> It is very exciting to see this follow-up, that you have done your research
> on the current state and that there is the intention to join forces on the
> portability effort!
>
> I have added a few pointers inline.
>
>
>From the portability perspective,
https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto
and the associated services for executing pipelines is about as "core"
as it gets, and eventually I'd like to see all runners being portable
(even if they have an
On Tue, Jul 23, 2019 at 10:26 PM Chamikara Jayalath
wrote:
>
> On Tue, Jul 23, 2019 at 1:10 PM Kyle Weaver wrote:
>>
>> I agree with David that at least clearer log statements should be added.
>>
>> Udi, that's an interesting idea, but I imagine the sheer number of existing
>> flags (including
new data. The Beam Java SDK does this for all
>>> runners when executed portably[1]. You could port the same logic to the
>>> Beam Python SDK as well.
>>>
>>> 1:
>>> https://github.com/apache/beam/blob/41478d00d34598e56471d99d0845ac16efa5b8ef/sdks/java/harness/sr
I think having a single, default, auto-created temporary bucket per
project for use in GCP (when running on Dataflow, or running elsewhere
but using GCS such as for this BQ load files example), though not
ideal, is the best user experience. If we don't want to be
automatically creating such things
On Mon, Jul 22, 2019 at 7:39 PM Eugene Kirpichov wrote:
>
> On Mon, Jul 22, 2019 at 7:49 AM Robert Bradshaw wrote:
>>
>> On Mon, Jul 22, 2019 at 4:04 PM Neville Li wrote:
>> >
>> > Thanks Robert. Agree with the FileIO point. I'll look into it
urce, which we'd like to avoid, but there's no
alternative. An SDF, if exposed, would likely be overkill and
cumbersome to call (given the reflection machinery involved in
invoking DoFns).
> I'll file separate PRs for core changes needed for discussion. WDYT?
Sounds good.
> On Mon, Jul 22, 20
This was due to a bad release artifact push. This has now been fixed upstream.
On Mon, Jul 22, 2019 at 11:00 AM Robert Bradshaw wrote:
>
> Looks like https://sourceforge.net/p/docutils/bugs/365/
>
> On Sun, Jul 21, 2019 at 11:56 PM Tanay Tummalapalli
> wrote:
&g
Looks like https://sourceforge.net/p/docutils/bugs/365/
On Sun, Jul 21, 2019 at 11:56 PM Tanay Tummalapalli
wrote:
> Hi everyone,
>
> The Python PreCommit from the Jenkins job "beam_PreCommit_Python_Cron" is
> failing[1]. The task :sdks:python:docs is failing with this traceback:
>
> Traceback
On Fri, Jul 19, 2019 at 5:16 PM Neville Li wrote:
>
> Forking this thread to discuss action items regarding the change. We can keep
> technical discussion in the original thread.
>
> Background: our SMB POC showed promising performance & cost saving
> improvements and we'd like to adopt it for
t quite following here. Suppose one processes element a, m, and
z. Then one decides to split the bundle, but there's not a "range" we
can pick for the "other" as this bundle already spans the whole range.
But maybe I'm just off in the weeds here.
> On Wed, Jul 17, 2019 at 6:
gt;> Because of the merge sort, we can't split or offset seek a bucket file.
>>>> Because without persisting the offset index of a key group somewhere, we
>>>> can't efficiently skip to a key group without exhausting the previous
>>>> ones
possible to
>>> binary search for matching keys but that's extra complication. IMO the
>>> reader work distribution is better solved by better bucket/shard strategy
>>> in upstream writer.
>>>
>>> References
>>>
>>> ReadMatche
sh not changed after
> squash & force-push
> b00 "fixup: Address review comments." - commit hash has changed after
> squash, but these commits were never reviewed,
>
> 5. Author requests another review iteration (PTAL). Since PR still has
> a00, pr
Congratulations!
On Wed, Jul 17, 2019, 12:56 PM Katarzyna Kucharczyk
wrote:
> Congratulations! :)
>
> On Wed, Jul 17, 2019 at 12:46 PM Michał Walenia <
> michal.wale...@polidea.com> wrote:
>
>> Congratulations, Robert! :)
>>
>> On Wed, Jul 17, 2019 at 12:45 PM Łukasz Gajowy
>> wrote:
>>
>>>
Python workers also have a per-bundle SDK-side cache. A protocol has
been proposed, but hasn't yet been implemented in any SDKs or runners.
On Tue, Jul 16, 2019 at 6:02 AM Reuven Lax wrote:
>
> It's runner dependent. Some runners (e.g. the Dataflow runner) do have such a
> cache, though I think
f copy-pasted Avro boilerplate.
>>> - For compatibility, we can delegate to the new classes from the old ones
>>> and remove them in the next breaking release.
>>>
>>> Re: WriteFiles logic, I'm not sure about generalizing it, but what about
>>> s
teChannel/ReadableByteChannel, which is the level of
>> granularity we need) but the Writers, at least, seem to be mostly
>> private-access. Do you foresee them being made public at any point?
>>
>> - Claire
>>
>> On Mon, Jul 15, 2019 at 9:31 AM Robert Bradshaw
oducer pipelines and
>>>> many downstream consumer pipelines. It's not intended to replace
>>>> shuffle/join within a single pipeline. On the producer side, by
>>>> pre-grouping/sorting data and writing to bucket/shard output files, the
>>>> consum
On Mon, Jul 15, 2019 at 5:42 AM Chamikara Jayalath wrote:
>
> On Sat, Jul 13, 2019 at 7:41 PM Chad Dombrova wrote:
>>
>> Hi Chamikara,
why not make this part of the pipeline options? does it really need to
vary from transform to transform?
>>>
>>> It's possible for the same
ation on cloudpickle vs dill in Beam, I'll bring it to
> the mailing list.
>
> On Wed, May 15, 2019 at 5:25 AM Robert Bradshaw wrote:
>>
>> (2) seems reasonable.
>>
>> On Tue, May 14, 2019 at 3:15 AM Udi Meiri wrote:
>> >
>> > It seems like pickling
On Wed, Jul 10, 2019 at 5:06 AM Kenneth Knowles wrote:
>
> My opinion: what is important is that we have a policy for what goes into the
> master commit history. This is very simple IMO: each commit should clearly do
> something that it states, and a commit should do just one thing.
Exactly
On Thu, Jun 27, 2019 at 1:52 AM Rui Wang wrote:
>>
>>
>> AFAIK all streaming runners today practically do provide these panes in
>> order;
>
> Does it refer to "the stage immediately after GBK itself processes fired
> panes in order" in streaming runners? Could you share more information?
>
>
601 - 700 of 1309 matches
Mail list logo