That's great Reza, thanks! I'm still getting to grips with Beam and
Dataflow so apologies for all the questions. I have a few more if that's ok:
1. When the article says "the schema would be mutated", does this mean the
BigQuery schema?
2. Also, when the known good BigQuery schema is retrieved, an
Etienne, I’ve just discovered that the code I used for my tests overrides the
command-line arguments, and while I thought I was testing with the SparkRunner
and FlinkRunner, in fact every test used DirectRunner, which explains why I was
seeing the committed values. So there’s no need for a tick
Hi Joe,
That part of the blog should have been written a bit cleaner.. I blame the
writer ;-) So while that solution worked it was inefficient, this is
discussed in the next paragraph.. But essentially checking the validity of
the schema every time is not efficient, especially as they are normally
Hi again, Etienne! You didn’t mention whether spark is reporting committed
values now, but you also didn’t mention opening a ticket concerning the spark
output. Am I right in inferring that spark does in fact report committed
values?
Thanks!
-Phil
On 2018/11/30 13:57:43, Etienne Chauchot wr
Hi, Etienne! Thanks for the response. Yes, I ran the test with both the flink
and spark runners, and both showed committed and attempted values.
I didn’t actually use MetricsPusher for these tests. I have questions about
MetricsPusher, but I’ll put those in another post.
-Phil
On 2018/11/
Thanks Reza, that's really helpful!
I have a few questions:
"He used a GroupByKey function on the JSON type and then a manual check on
the JSON schema against the known good BigQuery schema. If there was a
difference, the schema would mutate and the updates would be pushed
through."
If the diffe
Hi Phil,
Thanks for using MetricsPusher and Beam in general !
- MetricsHttpSink works that way: it filters out committed metrics from the
json output when committed metrics are not
supported. I checked, Flink runner still does not support committed metrics.
So there should be no committed metri
I'd also like to put the perspective out there that composite
transforms are like subroutines; their inner complexity should not
concern the end user and probably is the wrong thing to optimize for
(assuming there are not other costs, e.g. performance, and of course
we shouldn't have unnecessary co
I just wanted to thank you again. I split up my project in a beam core
stuff and my plugin. This got rid of a number of circular dependency
issues and lib conflicts.
I also gave the Dataflow PipelineOptions the list of files to stage.
That has made things work and much quicker than I anticipated
Hi Akshay,
I think you're bringing up a very important point. Simplicity with
minimal complexity is something that we strive for. In the case of the
Write transform, the complexity was mainly added due to historical
reasons which Kenneth mentioned.
It is to note that some Runners don't even
Hi Joe,
You may find some of the info in this blog of interest, its based on
streaming pipelines but useful ideas.
https://cloud.google.com/blog/products/gcp/how-to-handle-mutating-json-schemas-in-a-streaming-pipeline-with-square-enix
Cheers
Reza
On Thu, 29 Nov 2018 at 06:53, Joe Cullen wrote
11 matches
Mail list logo