Re: Inferring Csv Schemas

2018-11-30 Thread Joe Cullen
That's great Reza, thanks! I'm still getting to grips with Beam and Dataflow so apologies for all the questions. I have a few more if that's ok: 1. When the article says "the schema would be mutated", does this mean the BigQuery schema? 2. Also, when the known good BigQuery schema is retrieved, an

Re: Beam Metrics questions

2018-11-30 Thread Phil Franklin
Etienne, I’ve just discovered that the code I used for my tests overrides the command-line arguments, and while I thought I was testing with the SparkRunner and FlinkRunner, in fact every test used DirectRunner, which explains why I was seeing the committed values. So there’s no need for a tick

Re: Inferring Csv Schemas

2018-11-30 Thread Reza Rokni
Hi Joe, That part of the blog should have been written a bit cleaner.. I blame the writer ;-) So while that solution worked it was inefficient, this is discussed in the next paragraph.. But essentially checking the validity of the schema every time is not efficient, especially as they are normally

Re: Beam Metrics questions

2018-11-30 Thread Phil Franklin
Hi again, Etienne! You didn’t mention whether spark is reporting committed values now, but you also didn’t mention opening a ticket concerning the spark output. Am I right in inferring that spark does in fact report committed values? Thanks! -Phil On 2018/11/30 13:57:43, Etienne Chauchot wr

Re: Beam Metrics questions

2018-11-30 Thread Phil Franklin
Hi, Etienne! Thanks for the response. Yes, I ran the test with both the flink and spark runners, and both showed committed and attempted values. I didn’t actually use MetricsPusher for these tests. I have questions about MetricsPusher, but I’ll put those in another post. -Phil On 2018/11/

Re: Inferring Csv Schemas

2018-11-30 Thread Joe Cullen
Thanks Reza, that's really helpful! I have a few questions: "He used a GroupByKey function on the JSON type and then a manual check on the JSON schema against the known good BigQuery schema. If there was a difference, the schema would mutate and the updates would be pushed through." If the diffe

Re: Beam Metrics questions

2018-11-30 Thread Etienne Chauchot
Hi Phil, Thanks for using MetricsPusher and Beam in general ! - MetricsHttpSink works that way: it filters out committed metrics from the json output when committed metrics are not supported. I checked, Flink runner still does not support committed metrics. So there should be no committed metri

Re: Why beam pipeline ends up creating gigantic DAG for a simple word count program !!

2018-11-30 Thread Robert Bradshaw
I'd also like to put the perspective out there that composite transforms are like subroutines; their inner complexity should not concern the end user and probably is the wrong thing to optimize for (assuming there are not other costs, e.g. performance, and of course we shouldn't have unnecessary co

Re:

2018-11-30 Thread Matt Casters
I just wanted to thank you again. I split up my project in a beam core stuff and my plugin. This got rid of a number of circular dependency issues and lib conflicts. I also gave the Dataflow PipelineOptions the list of files to stage. That has made things work and much quicker than I anticipated

Re: Why beam pipeline ends up creating gigantic DAG for a simple word count program !!

2018-11-30 Thread Maximilian Michels
Hi Akshay, I think you're bringing up a very important point. Simplicity with minimal complexity is something that we strive for. In the case of the Write transform, the complexity was mainly added due to historical reasons which Kenneth mentioned. It is to note that some Runners don't even

Re: Inferring Csv Schemas

2018-11-30 Thread Reza Ardeshir Rokni
Hi Joe, You may find some of the info in this blog of interest, its based on streaming pipelines but useful ideas. https://cloud.google.com/blog/products/gcp/how-to-handle-mutating-json-schemas-in-a-streaming-pipeline-with-square-enix Cheers Reza On Thu, 29 Nov 2018 at 06:53, Joe Cullen wrote