hi
We have a few simple Dataflow Streaming jobs running. Requirement is to build
HA/DR solution.
a) Is it a good idea to spin multiple Dataflow jobs in different regions
listing to same 'shared' pubsub Subscription.
b) If not , then can you please share some best practices about it.
Thanks
In my dataflow job , i read a message from pubsub.
In Job GUI -- I see
Data watermark 's value as => Max watermark
a) what is meaning of 'Max watermark' .
b) Does it impact in anyway how groupBy is evaluated.
c) is there a way to explicitly set value of watermark to 'Max Watermark'
Thanks
Ani
Like Spark has 2 levels of processing
a) across different worker.
b) Within same Executor - multiple cores can work on different partitions.
I know in Apache Beam with DataFlow as Runner - partitioning is abstracted. But
does Dataflow uses multiple cores to process different partitions at same ti
On 2018/09/13 22:30:07, Lukasz Cwik wrote:
> You can even change windowing strategies between group bys with Window.into.
>
> On Thu, Sep 13, 2018 at 3:29 PM Lukasz Cwik wrote:
>
> > Multiple group by are supported.
> >
> > On Thu, Sep 13, 2018 at 2:36 PM asharma...@gmail.com
> > wrote:
>
Hi
from documentation groupby is applied on key and window basis.
If my source is Pubsub (unbounded) - does Beam support applying multiple
groupby transformations and all of applied groupby transformation execute in a
single window. Or is only one groupby operation supported for unbounded sour
On 2018/09/06 05:31:45, Jean-Baptiste Onofré wrote:
> Hi,
>
> AFAIU you are using dataflow runner right ?
>
> Regards
> JB
>
> On 06/09/2018 00:03, asharma...@gmail.com wrote:
> > Hi
> >
> > I am doing some processing and creating local list inside a Pardo function.
> > Total data size is
Hi
I am doing some processing and creating local list inside a Pardo function.
Total data size is 2 GB which is executing inside a single instance. I am
running this code on highmem-64 machine
and It gives this error
Shutting down JVM after 8 consecutive periods of measured GC thrashing. Memory
Excerpt for Autoscaling on Streaming mode
"Currently, PubsubIO is the only source that supports autoscaling on streaming
pipelines. All SDK-provided sinks are supported. In this Beta release,
Autoscaling works smoothest when reading from Cloud Pub/Sub subscriptions tied
to topics published with
On 2018/08/29 00:37:30, Lukasz Cwik wrote:
> It seems like you specified gs:/ and not gs://
>
> Typo?
>
> On Mon, Aug 27, 2018 at 2:02 PM Sameer Abhyankar
> wrote:
>
> > See this thread to see if it is related to the way the executable jar is
> > being created:
> >
> > https://lists.apache
hi
i am creating a Dataflow job from a configuration file and I have hard coded
the gs staging location it.
and I compile an executable jar for my pipeline.
I copy the executable jar to a cloud shell environment and execute the jar.
But my hard coded part of staging location is not picked and
On 2018/08/24 13:58:22, asharma...@gmail.com wrote:
> Hi
>
> Is there any example available about how to invoke Beam pipeline (in java)
> from Cloud function.
>
> Thanks
> Aniruddh
>
To add more details, request is not to invoke the DataFlow template but jar
itself.
Hi
Is there any example available about how to invoke Beam pipeline (in java) from
Cloud function.
Thanks
Aniruddh
On 2018/08/21 16:20:13, Lukasz Cwik wrote:
> I would agree with Eugene. A simple application that does this is probably
> what your looking for.
>
> There are ways to make this work with parallel processing systems but its
> quite a hassle and only worthwhile if your computation is very expen
Hi
I have to process a big file and call several Pardo's to do some
transformations. Records in file dont have any unique key.
Lets say file 'testfile' has 1 million records.
After processing , I want to generate only one output file same as my input
'testfile' and also i have a requirement
what is the best suggested way to to read a file encrypted with a customer key
which is also wrapped in KMS
15 matches
Mail list logo