Hi,
*Usage Info*:
We are using Beam: 2.16.0, Spark: 2.4.2
We are running Spark on Kubernetes.
We are using Spark Streaming(legacy) Runner with Beam Java SDK
The Pipeline has been run with default configurations i.e. default
configurations for SparkPipelineOptions.
*Issue*:
When a Beam Pipeline
What is your data source?
Can you add a row identifier or use some combination of columns as a unique
key?
On Thu, Mar 19, 2020 at 7:20 AM Aniruddh Sharma
wrote:
> Hi
>
> Need some advise on how to implement following use case.
>
> I read dataset which is 1+ TB in size, this has 1000+ columns.
That doesn't sound like it should be an issue and sounds like a bug in
Dataflow.
If you're willing to share a minimal pipeline that gets this error. I can
get an issue opened up internally and assigned.
On Thu, Mar 19, 2020 at 2:09 PM Kjetil Halvorsen <
kjetil.halvor...@cognite.com> wrote:
>
Thank you for the tip about the "--dataFlowJobFile". I wasn't aware of it,
and it was of great help to interpret the error message from Dataflow.
I found the error/bug in an upstream DoFn (execute before the SDF) with a
side-input. Both the main input to the DoFn and the side input were bounded
Hi
Need some advise on how to implement following use case.
I read dataset which is 1+ TB in size, this has 1000+ columns.
Only 3 columns out of these 1000+ columns contain PII information and I need to
call Google DLP API.
I want to select only 3 columns out of these 1000+ columns and