Latency in advancing Spark Watermarks

2020-03-19 Thread rahul patwari
Hi, *Usage Info*: We are using Beam: 2.16.0, Spark: 2.4.2 We are running Spark on Kubernetes. We are using Spark Streaming(legacy) Runner with Beam Java SDK The Pipeline has been run with default configurations i.e. default configurations for SparkPipelineOptions. *Issue*: When a Beam Pipeline

Re: detach subset of columns from PCollection ;do some operations; reattach transformed columns

2020-03-19 Thread Luke Cwik
What is your data source? Can you add a row identifier or use some combination of columns as a unique key? On Thu, Mar 19, 2020 at 7:20 AM Aniruddh Sharma wrote: > Hi > > Need some advise on how to implement following use case. > > I read dataset which is 1+ TB in size, this has 1000+ columns.

Re: Splittable DoFn and Dataflow, "conflicting bucketing functions"

2020-03-19 Thread Luke Cwik
That doesn't sound like it should be an issue and sounds like a bug in Dataflow. If you're willing to share a minimal pipeline that gets this error. I can get an issue opened up internally and assigned. On Thu, Mar 19, 2020 at 2:09 PM Kjetil Halvorsen < kjetil.halvor...@cognite.com> wrote: >

Re: Splittable DoFn and Dataflow, "conflicting bucketing functions"

2020-03-19 Thread Kjetil Halvorsen
Thank you for the tip about the "--dataFlowJobFile". I wasn't aware of it, and it was of great help to interpret the error message from Dataflow. I found the error/bug in an upstream DoFn (execute before the SDF) with a side-input. Both the main input to the DoFn and the side input were bounded

detach subset of columns from PCollection ;do some operations; reattach transformed columns

2020-03-19 Thread Aniruddh Sharma
Hi Need some advise on how to implement following use case. I read dataset which is 1+ TB in size, this has 1000+ columns. Only 3 columns out of these 1000+ columns contain PII information and I need to call Google DLP API. I want to select only 3 columns out of these 1000+ columns and