Re: Keeping keys in a state for a very very long time (keys expiry unknown)

2020-04-06 Thread Mohil Khare
Sure thing.. I would love to contribute.

Thanks
Mohil



On Mon, Apr 6, 2020 at 6:17 PM Reza Ardeshir Rokni 
wrote:

> Great! BTW if you get the time and wanted to contribute back to beam there
> is a nice section to record cool patterns:
>
> https://beam.apache.org/documentation/patterns/overview/
>
> This would make a great one!
>
> On Tue, 7 Apr 2020 at 09:12, Mohil Khare  wrote:
>
>> No ... that's a valid answer. Since I wanted to have a long window size
>> per key and since we can't use state with session windows, I am using a
>> sliding window for let's say 72 hrs which starts every hour.
>>
>> Thanks a lot Reza for your input.
>>
>> Regards
>> Mohil
>>
>> On Mon, Apr 6, 2020 at 6:09 PM Reza Ardeshir Rokni 
>> wrote:
>>
>>> Depends on the use case, Global state comes with the technical debt of
>>> having to do your own GC, but comes with more control. You could
>>> implement the pattern above with a long FixedWindow as well, which will
>>> take care of the GC within the window  bound.
>>>
>>> Sorry, its not a yes / no answer :-)
>>>
>>> On Tue, 7 Apr 2020 at 09:03, Mohil Khare  wrote:
>>>
 Thanks a lot Reza for your quick response. Yeah saving the data in an
 external system after timer expiry makes sense.
 So do you suggest using a global window for maintaining state ?

 Thanks and regards
 Mohil

 On Mon, Apr 6, 2020 at 5:37 PM Reza Ardeshir Rokni 
 wrote:

> Are you able to make use of the following pattern?
>
> Store StateA-metadata until no activity for Duration X, you can use a
> Timer to check this, then expire the value, but store in an
> external system. If you get a record that does want this value after
> expiry, call out to the external system and store the value again in key
> StateA-metadata.
>
> Cheers
>
> On Tue, 7 Apr 2020 at 08:03, Mohil Khare  wrote:
>
>> Hello all,
>> We are attempting a implement a use case where beam (java sdk) reads
>> two kind of records from data stream like Kafka:
>>
>> 1. Records of type A containing key and corresponding metadata.
>> 2. Records of type B containing the same key, but no metadata. Beam
>> then needs to fill metadata for records of type B  by doing a lookup for
>> metadata using keys received in records of type A.
>>
>> Idea is to save metadata or rather state for keys received in records
>> of type A and then do a lookup when records of type B are received.
>> I have implemented this using the "@State" construct of beam. However
>> my problem is that we don't know when keys should expire. I don't think
>> keeping a global window will be a good idea as there could be many keys
>> (may be millions over a period of time) to be saved in a state.
>>
>> What is the best way to achieve this? I was reading about RedisIO,
>> but found that it is still in the experimental stage. Can someone please
>> recommend any solution to achieve this.
>>
>> Thanks and regards
>> Mohil
>>
>>
>>
>>
>>
>>


Re: Keeping keys in a state for a very very long time (keys expiry unknown)

2020-04-06 Thread Reza Ardeshir Rokni
Great! BTW if you get the time and wanted to contribute back to beam there
is a nice section to record cool patterns:

https://beam.apache.org/documentation/patterns/overview/

This would make a great one!

On Tue, 7 Apr 2020 at 09:12, Mohil Khare  wrote:

> No ... that's a valid answer. Since I wanted to have a long window size
> per key and since we can't use state with session windows, I am using a
> sliding window for let's say 72 hrs which starts every hour.
>
> Thanks a lot Reza for your input.
>
> Regards
> Mohil
>
> On Mon, Apr 6, 2020 at 6:09 PM Reza Ardeshir Rokni 
> wrote:
>
>> Depends on the use case, Global state comes with the technical debt of
>> having to do your own GC, but comes with more control. You could
>> implement the pattern above with a long FixedWindow as well, which will
>> take care of the GC within the window  bound.
>>
>> Sorry, its not a yes / no answer :-)
>>
>> On Tue, 7 Apr 2020 at 09:03, Mohil Khare  wrote:
>>
>>> Thanks a lot Reza for your quick response. Yeah saving the data in an
>>> external system after timer expiry makes sense.
>>> So do you suggest using a global window for maintaining state ?
>>>
>>> Thanks and regards
>>> Mohil
>>>
>>> On Mon, Apr 6, 2020 at 5:37 PM Reza Ardeshir Rokni 
>>> wrote:
>>>
 Are you able to make use of the following pattern?

 Store StateA-metadata until no activity for Duration X, you can use a
 Timer to check this, then expire the value, but store in an
 external system. If you get a record that does want this value after
 expiry, call out to the external system and store the value again in key
 StateA-metadata.

 Cheers

 On Tue, 7 Apr 2020 at 08:03, Mohil Khare  wrote:

> Hello all,
> We are attempting a implement a use case where beam (java sdk) reads
> two kind of records from data stream like Kafka:
>
> 1. Records of type A containing key and corresponding metadata.
> 2. Records of type B containing the same key, but no metadata. Beam
> then needs to fill metadata for records of type B  by doing a lookup for
> metadata using keys received in records of type A.
>
> Idea is to save metadata or rather state for keys received in records
> of type A and then do a lookup when records of type B are received.
> I have implemented this using the "@State" construct of beam. However
> my problem is that we don't know when keys should expire. I don't think
> keeping a global window will be a good idea as there could be many keys
> (may be millions over a period of time) to be saved in a state.
>
> What is the best way to achieve this? I was reading about RedisIO, but
> found that it is still in the experimental stage. Can someone please
> recommend any solution to achieve this.
>
> Thanks and regards
> Mohil
>
>
>
>
>
>


Re: Keeping keys in a state for a very very long time (keys expiry unknown)

2020-04-06 Thread Mohil Khare
No ... that's a valid answer. Since I wanted to have a long window size per
key and since we can't use state with session windows, I am using a sliding
window for let's say 72 hrs which starts every hour.

Thanks a lot Reza for your input.

Regards
Mohil

On Mon, Apr 6, 2020 at 6:09 PM Reza Ardeshir Rokni 
wrote:

> Depends on the use case, Global state comes with the technical debt of
> having to do your own GC, but comes with more control. You could
> implement the pattern above with a long FixedWindow as well, which will
> take care of the GC within the window  bound.
>
> Sorry, its not a yes / no answer :-)
>
> On Tue, 7 Apr 2020 at 09:03, Mohil Khare  wrote:
>
>> Thanks a lot Reza for your quick response. Yeah saving the data in an
>> external system after timer expiry makes sense.
>> So do you suggest using a global window for maintaining state ?
>>
>> Thanks and regards
>> Mohil
>>
>> On Mon, Apr 6, 2020 at 5:37 PM Reza Ardeshir Rokni 
>> wrote:
>>
>>> Are you able to make use of the following pattern?
>>>
>>> Store StateA-metadata until no activity for Duration X, you can use a
>>> Timer to check this, then expire the value, but store in an
>>> external system. If you get a record that does want this value after
>>> expiry, call out to the external system and store the value again in key
>>> StateA-metadata.
>>>
>>> Cheers
>>>
>>> On Tue, 7 Apr 2020 at 08:03, Mohil Khare  wrote:
>>>
 Hello all,
 We are attempting a implement a use case where beam (java sdk) reads
 two kind of records from data stream like Kafka:

 1. Records of type A containing key and corresponding metadata.
 2. Records of type B containing the same key, but no metadata. Beam
 then needs to fill metadata for records of type B  by doing a lookup for
 metadata using keys received in records of type A.

 Idea is to save metadata or rather state for keys received in records
 of type A and then do a lookup when records of type B are received.
 I have implemented this using the "@State" construct of beam. However
 my problem is that we don't know when keys should expire. I don't think
 keeping a global window will be a good idea as there could be many keys
 (may be millions over a period of time) to be saved in a state.

 What is the best way to achieve this? I was reading about RedisIO, but
 found that it is still in the experimental stage. Can someone please
 recommend any solution to achieve this.

 Thanks and regards
 Mohil








Re: Keeping keys in a state for a very very long time (keys expiry unknown)

2020-04-06 Thread Reza Ardeshir Rokni
Depends on the use case, Global state comes with the technical debt of
having to do your own GC, but comes with more control. You could
implement the pattern above with a long FixedWindow as well, which will
take care of the GC within the window  bound.

Sorry, its not a yes / no answer :-)

On Tue, 7 Apr 2020 at 09:03, Mohil Khare  wrote:

> Thanks a lot Reza for your quick response. Yeah saving the data in an
> external system after timer expiry makes sense.
> So do you suggest using a global window for maintaining state ?
>
> Thanks and regards
> Mohil
>
> On Mon, Apr 6, 2020 at 5:37 PM Reza Ardeshir Rokni 
> wrote:
>
>> Are you able to make use of the following pattern?
>>
>> Store StateA-metadata until no activity for Duration X, you can use a
>> Timer to check this, then expire the value, but store in an
>> external system. If you get a record that does want this value after
>> expiry, call out to the external system and store the value again in key
>> StateA-metadata.
>>
>> Cheers
>>
>> On Tue, 7 Apr 2020 at 08:03, Mohil Khare  wrote:
>>
>>> Hello all,
>>> We are attempting a implement a use case where beam (java sdk) reads two
>>> kind of records from data stream like Kafka:
>>>
>>> 1. Records of type A containing key and corresponding metadata.
>>> 2. Records of type B containing the same key, but no metadata. Beam then
>>> needs to fill metadata for records of type B  by doing a lookup for
>>> metadata using keys received in records of type A.
>>>
>>> Idea is to save metadata or rather state for keys received in records of
>>> type A and then do a lookup when records of type B are received.
>>> I have implemented this using the "@State" construct of beam. However my
>>> problem is that we don't know when keys should expire. I don't think
>>> keeping a global window will be a good idea as there could be many keys
>>> (may be millions over a period of time) to be saved in a state.
>>>
>>> What is the best way to achieve this? I was reading about RedisIO, but
>>> found that it is still in the experimental stage. Can someone please
>>> recommend any solution to achieve this.
>>>
>>> Thanks and regards
>>> Mohil
>>>
>>>
>>>
>>>
>>>
>>>


Re: Keeping keys in a state for a very very long time (keys expiry unknown)

2020-04-06 Thread Mohil Khare
Thanks a lot Reza for your quick response. Yeah saving the data in an
external system after timer expiry makes sense.
So do you suggest using a global window for maintaining state ?

Thanks and regards
Mohil

On Mon, Apr 6, 2020 at 5:37 PM Reza Ardeshir Rokni 
wrote:

> Are you able to make use of the following pattern?
>
> Store StateA-metadata until no activity for Duration X, you can use a
> Timer to check this, then expire the value, but store in an
> external system. If you get a record that does want this value after
> expiry, call out to the external system and store the value again in key
> StateA-metadata.
>
> Cheers
>
> On Tue, 7 Apr 2020 at 08:03, Mohil Khare  wrote:
>
>> Hello all,
>> We are attempting a implement a use case where beam (java sdk) reads two
>> kind of records from data stream like Kafka:
>>
>> 1. Records of type A containing key and corresponding metadata.
>> 2. Records of type B containing the same key, but no metadata. Beam then
>> needs to fill metadata for records of type B  by doing a lookup for
>> metadata using keys received in records of type A.
>>
>> Idea is to save metadata or rather state for keys received in records of
>> type A and then do a lookup when records of type B are received.
>> I have implemented this using the "@State" construct of beam. However my
>> problem is that we don't know when keys should expire. I don't think
>> keeping a global window will be a good idea as there could be many keys
>> (may be millions over a period of time) to be saved in a state.
>>
>> What is the best way to achieve this? I was reading about RedisIO, but
>> found that it is still in the experimental stage. Can someone please
>> recommend any solution to achieve this.
>>
>> Thanks and regards
>> Mohil
>>
>>
>>
>>
>>
>>


Re: Keeping keys in a state for a very very long time (keys expiry unknown)

2020-04-06 Thread Reza Ardeshir Rokni
Are you able to make use of the following pattern?

Store StateA-metadata until no activity for Duration X, you can use a Timer
to check this, then expire the value, but store in an external system. If
you get a record that does want this value after expiry, call out to the
external system and store the value again in key StateA-metadata.

Cheers

On Tue, 7 Apr 2020 at 08:03, Mohil Khare  wrote:

> Hello all,
> We are attempting a implement a use case where beam (java sdk) reads two
> kind of records from data stream like Kafka:
>
> 1. Records of type A containing key and corresponding metadata.
> 2. Records of type B containing the same key, but no metadata. Beam then
> needs to fill metadata for records of type B  by doing a lookup for
> metadata using keys received in records of type A.
>
> Idea is to save metadata or rather state for keys received in records of
> type A and then do a lookup when records of type B are received.
> I have implemented this using the "@State" construct of beam. However my
> problem is that we don't know when keys should expire. I don't think
> keeping a global window will be a good idea as there could be many keys
> (may be millions over a period of time) to be saved in a state.
>
> What is the best way to achieve this? I was reading about RedisIO, but
> found that it is still in the experimental stage. Can someone please
> recommend any solution to achieve this.
>
> Thanks and regards
> Mohil
>
>
>
>
>
>


Re: Using Self signed root ca for https connection in eleasticsearchIO

2020-04-06 Thread Mohil Khare
Any update on this? Shall I open a jira for this support ?

Thanks and regards
Mohil

On Sun, Mar 22, 2020 at 9:36 PM Mohil Khare  wrote:

> Hi,
> This is Mohil from Prosimo, a small bay area based stealth mode startup.
> We use Beam (on version 2.19) with google dataflow in our analytics
> pipeline with Kafka and PubSub as source while GCS, BigQuery and
> ElasticSearch as our sink.
>
> We want to use our private self signed root ca for tls connections between
> our internal services viz kafka, ElasticSearch, beam etc. We are able to
> setup secure tls connection between beam and kafka using self signed root
> certificate in keystore.jks and truststore.jks and transferring it to
> worker VMs running kafkaIO using KafkaIO's read via withConsumerFactorFn().
>
> We want to do similar things with elasticseachIO where we want to update
> its worker VM's truststore with our self signed root certificate so that
> when elasticsearchIO connects using HTTPS, it can connect successfully
> without ssl handshake failure. Currently we couldn't find any way to do so
> with ElasticsearchIO. We tried various possible workarounds like:
>
> 1. Trying JvmInitializer to initialise Jvm with truststore using
> System.setproperty for javax.net.ssl.trustStore,
> 2. Transferring our jar to GCP's appengine where we start jar using
> Djavax.net.ssl.trustStore and then triggering beam job from there.
> 3. Setting elasticsearchIO flag withTrustSelfSigned to true (I don't think
> it will work because looking at the source code, it looks like it has
> dependency with keystorePath)
>
> But nothing worked. When we logged in to worker VMs, it looked like our
> trustStore never made it to worker VM. All elasticsearchIO connections
> failed with the following exception:
>
> sun.security.validator.ValidatorException: PKIX path building failed:
> sun.security.provider.certpath.SunCertPathBuilderException: unable to find
> valid certification path to requested target
> sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:387)
>
>
> Right now to unblock ourselves, we have added proxy with letsencrypt root
> ca between beam and Elasticsearch cluster and beam's elasticsearchIO
> connect successfully to proxy using letsencrypt root certificate. We won't
> want to use Letsencrypt root certificater for internal services as it
> expires every three months.  Is there a way, just like kafkaIO, to use
> selfsigned root certificate with elasticsearchIO? Or is there a way to
> update java cacerts on worker VMs where beam job is running?
>
> Looking forward for some suggestions soon.
>
> Thanks and Regards
> Mohil Khare
>


Keeping keys in a state for a very very long time (keys expiry unknown)

2020-04-06 Thread Mohil Khare
Hello all,
We are attempting a implement a use case where beam (java sdk) reads two
kind of records from data stream like Kafka:

1. Records of type A containing key and corresponding metadata.
2. Records of type B containing the same key, but no metadata. Beam then
needs to fill metadata for records of type B  by doing a lookup for
metadata using keys received in records of type A.

Idea is to save metadata or rather state for keys received in records of
type A and then do a lookup when records of type B are received.
I have implemented this using the "@State" construct of beam. However my
problem is that we don't know when keys should expire. I don't think
keeping a global window will be a good idea as there could be many keys
(may be millions over a period of time) to be saved in a state.

What is the best way to achieve this? I was reading about RedisIO, but
found that it is still in the experimental stage. Can someone please
recommend any solution to achieve this.

Thanks and regards
Mohil


Re: Scheduling dataflow pipelines

2020-04-06 Thread Marco Mistroni
Tx Andre. Have skipped pubsub for a simple CLF invoked via CLF sched
Thx for asdist


On Mon, Apr 6, 2020, 5:52 PM André Rocha Silva <
a.si...@portaltelemedicina.com.br> wrote:

> Marco
>
> If I'd give a step by step I'd go:
> 1) test the template on dataflow
> 2) test the cloud function
> 3) call the cloud function from a Pub/sub
> 4) send a message to pub/sub from scheduler
>
> take a look on this tutorial about scheduler:
> https://www.youtube.com/watch?v=WUPEUjvSBW8
>
> I think cloud composer is way too expensive, if you wanna call the
> template twice a day e.g.
>
> kind regards
>
> On Mon, Apr 6, 2020 at 11:45 AM Marco Mistroni 
> wrote:
>
>> Thanks will give it a go
>>
>> On Mon, Apr 6, 2020, 3:39 PM Soliman ElSaber 
>> wrote:
>>
>>> We are using Composer (Airflow) to schedule and run the Dataflow jobs...
>>> Using the Python SDK, with small changes no the Composer (Airflow)
>>> DataFlowPythonOperator, to force it to use Python 3...
>>> It is working fine and creating a new Dataflow job every 30 minutes...
>>>
>>> On Mon, Apr 6, 2020 at 10:33 PM Marco Mistroni 
>>> wrote:
>>>
 Right.. tx Andre. So presumably the flow of action will b
 - create dflow template
 -create CLF that invokes it
 - create cold scheduler job that invokes function?

 Kind regards

 On Mon, Apr 6, 2020, 2:14 PM André Rocha Silva <
 a.si...@portaltelemedicina.com.br> wrote:

> Marco
>
> If you are already using GCP, I suggest you use the cloud scheduler.
> It is like a cron job completely serverless.
>
> If you need some extra help, let me know.
>
> On Mon, Apr 6, 2020 at 4:38 AM deepak kumar  wrote:
>
>> We have used composer (airlfow) successfully to schedule Dataflow
>> jobs.
>> Please let me know if you would need details around it.
>>
>> Thanks
>> Deepak
>>
>> On Sun, Apr 5, 2020 at 7:56 PM Joshua B. Harrison <
>> josh.harri...@gmail.com> wrote:
>>
>>> Hi Marco,
>>>
>>> I've ended using a VM running Luigi to schedule jobs. I use the data
>>> flow Python API to execute stored templates.
>>>
>>> I can give you more details if you’re interested.
>>>
>>> Best,
>>> Joshua
>>>
>>> On Sun, Apr 5, 2020 at 5:02 AM Marco Mistroni 
>>> wrote:
>>>
 HI all
  sorry for this partially OT but has anyone been successful in
 scheduling dataflow job on GCP?
 I have tried the CloudFunction approach (following few eamples on
 the web) but it didnt work out for me - the cloud function keep on 
 giving
 me an INVALID ARGUMENT - which i could not debug

 So i was wondering if anyone has  been successful and can provide
 me an example

 kind regards
  Marco

 --
>>> Joshua Harrison |  Software Engineer |  joshharri...@gmail.com
>>>  |  404-433-0242
>>>
>>
>
> --
>
>*ANDRÉ ROCHA SILVA*
>   * DATA ENGINEER*
>   (48) 3181-0611
>
>    /andre-rocha-silva/
> 
> 
> 
> 
>
>
>>>
>>> --
>>> Soliman ElSaber
>>> Data Engineer
>>> www.mindvalley.com
>>>
>>
>
> --
>
>*ANDRÉ ROCHA SILVA*
>   * DATA ENGINEER*
>   (48) 3181-0611
>
>    /andre-rocha-silva/
> 
> 
> 
> 
>
>


Re: Scheduling dataflow pipelines

2020-04-06 Thread Joshua B. Harrison
I agree with André. However, if you want to do anything more complex,
you’ll need synchronization. This is why we went with Luigi. Composer is
pretty heavyweight. Running Luigi on a shared VM costs us about .30 cents a
day and allows much more control over how we schedule and execute tasks.

On Mon, Apr 6, 2020 at 10:52 AM André Rocha Silva <
a.si...@portaltelemedicina.com.br> wrote:

> Marco
>
> If I'd give a step by step I'd go:
> 1) test the template on dataflow
> 2) test the cloud function
> 3) call the cloud function from a Pub/sub
> 4) send a message to pub/sub from scheduler
>
> take a look on this tutorial about scheduler:
> https://www.youtube.com/watch?v=WUPEUjvSBW8
>
> I think cloud composer is way too expensive, if you wanna call the
> template twice a day e.g.
>
> kind regards
>
> On Mon, Apr 6, 2020 at 11:45 AM Marco Mistroni 
> wrote:
>
>> Thanks will give it a go
>>
>> On Mon, Apr 6, 2020, 3:39 PM Soliman ElSaber 
>> wrote:
>>
>>> We are using Composer (Airflow) to schedule and run the Dataflow jobs...
>>> Using the Python SDK, with small changes no the Composer (Airflow)
>>> DataFlowPythonOperator, to force it to use Python 3...
>>> It is working fine and creating a new Dataflow job every 30 minutes...
>>>
>>> On Mon, Apr 6, 2020 at 10:33 PM Marco Mistroni 
>>> wrote:
>>>
 Right.. tx Andre. So presumably the flow of action will b
 - create dflow template
 -create CLF that invokes it
 - create cold scheduler job that invokes function?

 Kind regards

 On Mon, Apr 6, 2020, 2:14 PM André Rocha Silva <
 a.si...@portaltelemedicina.com.br> wrote:

> Marco
>
> If you are already using GCP, I suggest you use the cloud scheduler.
> It is like a cron job completely serverless.
>
> If you need some extra help, let me know.
>
> On Mon, Apr 6, 2020 at 4:38 AM deepak kumar  wrote:
>
>> We have used composer (airlfow) successfully to schedule Dataflow
>> jobs.
>> Please let me know if you would need details around it.
>>
>> Thanks
>> Deepak
>>
>> On Sun, Apr 5, 2020 at 7:56 PM Joshua B. Harrison <
>> josh.harri...@gmail.com> wrote:
>>
>>> Hi Marco,
>>>
>>> I've ended using a VM running Luigi to schedule jobs. I use the data
>>> flow Python API to execute stored templates.
>>>
>>> I can give you more details if you’re interested.
>>>
>>> Best,
>>> Joshua
>>>
>>> On Sun, Apr 5, 2020 at 5:02 AM Marco Mistroni 
>>> wrote:
>>>
 HI all
  sorry for this partially OT but has anyone been successful in
 scheduling dataflow job on GCP?
 I have tried the CloudFunction approach (following few eamples on
 the web) but it didnt work out for me - the cloud function keep on 
 giving
 me an INVALID ARGUMENT - which i could not debug

 So i was wondering if anyone has  been successful and can provide
 me an example

 kind regards
  Marco

 --
>>> Joshua Harrison |  Software Engineer |  joshharri...@gmail.com
>>>  |  404-433-0242
>>>
>>
>
> --
>
>*ANDRÉ ROCHA SILVA*
>   * DATA ENGINEER*
>   (48) 3181-0611
>
>    /andre-rocha-silva/
> 
> 
> 
> 
>
>
>>>
>>> --
>>> Soliman ElSaber
>>> Data Engineer
>>> www.mindvalley.com
>>>
>>
>
> --
>
>*ANDRÉ ROCHA SILVA*
>   * DATA ENGINEER*
>   (48) 3181-0611
>
>    /andre-rocha-silva/
> 
> 
> 
> 
>
> --
Joshua Harrison |  Software Engineer |  joshharri...@gmail.com
 |  404-433-0242


Re: Scheduling dataflow pipelines

2020-04-06 Thread André Rocha Silva
Marco

If I'd give a step by step I'd go:
1) test the template on dataflow
2) test the cloud function
3) call the cloud function from a Pub/sub
4) send a message to pub/sub from scheduler

take a look on this tutorial about scheduler:
https://www.youtube.com/watch?v=WUPEUjvSBW8

I think cloud composer is way too expensive, if you wanna call the template
twice a day e.g.

kind regards

On Mon, Apr 6, 2020 at 11:45 AM Marco Mistroni  wrote:

> Thanks will give it a go
>
> On Mon, Apr 6, 2020, 3:39 PM Soliman ElSaber 
> wrote:
>
>> We are using Composer (Airflow) to schedule and run the Dataflow jobs...
>> Using the Python SDK, with small changes no the Composer (Airflow)
>> DataFlowPythonOperator, to force it to use Python 3...
>> It is working fine and creating a new Dataflow job every 30 minutes...
>>
>> On Mon, Apr 6, 2020 at 10:33 PM Marco Mistroni 
>> wrote:
>>
>>> Right.. tx Andre. So presumably the flow of action will b
>>> - create dflow template
>>> -create CLF that invokes it
>>> - create cold scheduler job that invokes function?
>>>
>>> Kind regards
>>>
>>> On Mon, Apr 6, 2020, 2:14 PM André Rocha Silva <
>>> a.si...@portaltelemedicina.com.br> wrote:
>>>
 Marco

 If you are already using GCP, I suggest you use the cloud scheduler. It
 is like a cron job completely serverless.

 If you need some extra help, let me know.

 On Mon, Apr 6, 2020 at 4:38 AM deepak kumar  wrote:

> We have used composer (airlfow) successfully to schedule Dataflow jobs.
> Please let me know if you would need details around it.
>
> Thanks
> Deepak
>
> On Sun, Apr 5, 2020 at 7:56 PM Joshua B. Harrison <
> josh.harri...@gmail.com> wrote:
>
>> Hi Marco,
>>
>> I've ended using a VM running Luigi to schedule jobs. I use the data
>> flow Python API to execute stored templates.
>>
>> I can give you more details if you’re interested.
>>
>> Best,
>> Joshua
>>
>> On Sun, Apr 5, 2020 at 5:02 AM Marco Mistroni 
>> wrote:
>>
>>> HI all
>>>  sorry for this partially OT but has anyone been successful in
>>> scheduling dataflow job on GCP?
>>> I have tried the CloudFunction approach (following few eamples on
>>> the web) but it didnt work out for me - the cloud function keep on 
>>> giving
>>> me an INVALID ARGUMENT - which i could not debug
>>>
>>> So i was wondering if anyone has  been successful and can provide me
>>> an example
>>>
>>> kind regards
>>>  Marco
>>>
>>> --
>> Joshua Harrison |  Software Engineer |  joshharri...@gmail.com
>>  |  404-433-0242
>>
>

 --

*ANDRÉ ROCHA SILVA*
   * DATA ENGINEER*
   (48) 3181-0611

    /andre-rocha-silva/
 
 
 
 


>>
>> --
>> Soliman ElSaber
>> Data Engineer
>> www.mindvalley.com
>>
>

-- 

   *ANDRÉ ROCHA SILVA*
  * DATA ENGINEER*
  (48) 3181-0611

   /andre-rocha-silva/






Re: Scheduling dataflow pipelines

2020-04-06 Thread Marco Mistroni
Thanks will give it a go

On Mon, Apr 6, 2020, 3:39 PM Soliman ElSaber  wrote:

> We are using Composer (Airflow) to schedule and run the Dataflow jobs...
> Using the Python SDK, with small changes no the Composer (Airflow)
> DataFlowPythonOperator, to force it to use Python 3...
> It is working fine and creating a new Dataflow job every 30 minutes...
>
> On Mon, Apr 6, 2020 at 10:33 PM Marco Mistroni 
> wrote:
>
>> Right.. tx Andre. So presumably the flow of action will b
>> - create dflow template
>> -create CLF that invokes it
>> - create cold scheduler job that invokes function?
>>
>> Kind regards
>>
>> On Mon, Apr 6, 2020, 2:14 PM André Rocha Silva <
>> a.si...@portaltelemedicina.com.br> wrote:
>>
>>> Marco
>>>
>>> If you are already using GCP, I suggest you use the cloud scheduler. It
>>> is like a cron job completely serverless.
>>>
>>> If you need some extra help, let me know.
>>>
>>> On Mon, Apr 6, 2020 at 4:38 AM deepak kumar  wrote:
>>>
 We have used composer (airlfow) successfully to schedule Dataflow jobs.
 Please let me know if you would need details around it.

 Thanks
 Deepak

 On Sun, Apr 5, 2020 at 7:56 PM Joshua B. Harrison <
 josh.harri...@gmail.com> wrote:

> Hi Marco,
>
> I've ended using a VM running Luigi to schedule jobs. I use the data
> flow Python API to execute stored templates.
>
> I can give you more details if you’re interested.
>
> Best,
> Joshua
>
> On Sun, Apr 5, 2020 at 5:02 AM Marco Mistroni 
> wrote:
>
>> HI all
>>  sorry for this partially OT but has anyone been successful in
>> scheduling dataflow job on GCP?
>> I have tried the CloudFunction approach (following few eamples on the
>> web) but it didnt work out for me - the cloud function keep on giving me 
>> an
>> INVALID ARGUMENT - which i could not debug
>>
>> So i was wondering if anyone has  been successful and can provide me
>> an example
>>
>> kind regards
>>  Marco
>>
>> --
> Joshua Harrison |  Software Engineer |  joshharri...@gmail.com
>  |  404-433-0242
>

>>>
>>> --
>>>
>>>*ANDRÉ ROCHA SILVA*
>>>   * DATA ENGINEER*
>>>   (48) 3181-0611
>>>
>>>    /andre-rocha-silva/
>>> 
>>> 
>>> 
>>> 
>>>
>>>
>
> --
> Soliman ElSaber
> Data Engineer
> www.mindvalley.com
>


Re: Scheduling dataflow pipelines

2020-04-06 Thread Soliman ElSaber
We are using Composer (Airflow) to schedule and run the Dataflow jobs...
Using the Python SDK, with small changes no the Composer (Airflow) DataFlow
PythonOperator, to force it to use Python 3...
It is working fine and creating a new Dataflow job every 30 minutes...

On Mon, Apr 6, 2020 at 10:33 PM Marco Mistroni  wrote:

> Right.. tx Andre. So presumably the flow of action will b
> - create dflow template
> -create CLF that invokes it
> - create cold scheduler job that invokes function?
>
> Kind regards
>
> On Mon, Apr 6, 2020, 2:14 PM André Rocha Silva <
> a.si...@portaltelemedicina.com.br> wrote:
>
>> Marco
>>
>> If you are already using GCP, I suggest you use the cloud scheduler. It
>> is like a cron job completely serverless.
>>
>> If you need some extra help, let me know.
>>
>> On Mon, Apr 6, 2020 at 4:38 AM deepak kumar  wrote:
>>
>>> We have used composer (airlfow) successfully to schedule Dataflow jobs.
>>> Please let me know if you would need details around it.
>>>
>>> Thanks
>>> Deepak
>>>
>>> On Sun, Apr 5, 2020 at 7:56 PM Joshua B. Harrison <
>>> josh.harri...@gmail.com> wrote:
>>>
 Hi Marco,

 I've ended using a VM running Luigi to schedule jobs. I use the data
 flow Python API to execute stored templates.

 I can give you more details if you’re interested.

 Best,
 Joshua

 On Sun, Apr 5, 2020 at 5:02 AM Marco Mistroni 
 wrote:

> HI all
>  sorry for this partially OT but has anyone been successful in
> scheduling dataflow job on GCP?
> I have tried the CloudFunction approach (following few eamples on the
> web) but it didnt work out for me - the cloud function keep on giving me 
> an
> INVALID ARGUMENT - which i could not debug
>
> So i was wondering if anyone has  been successful and can provide me
> an example
>
> kind regards
>  Marco
>
> --
 Joshua Harrison |  Software Engineer |  joshharri...@gmail.com
  |  404-433-0242

>>>
>>
>> --
>>
>>*ANDRÉ ROCHA SILVA*
>>   * DATA ENGINEER*
>>   (48) 3181-0611
>>
>>    /andre-rocha-silva/
>> 
>> 
>> 
>> 
>>
>>

-- 
Soliman ElSaber
Data Engineer
www.mindvalley.com


Re: Scheduling dataflow pipelines

2020-04-06 Thread Marco Mistroni
Right.. tx Andre. So presumably the flow of action will b
- create dflow template
-create CLF that invokes it
- create cold scheduler job that invokes function?

Kind regards

On Mon, Apr 6, 2020, 2:14 PM André Rocha Silva <
a.si...@portaltelemedicina.com.br> wrote:

> Marco
>
> If you are already using GCP, I suggest you use the cloud scheduler. It is
> like a cron job completely serverless.
>
> If you need some extra help, let me know.
>
> On Mon, Apr 6, 2020 at 4:38 AM deepak kumar  wrote:
>
>> We have used composer (airlfow) successfully to schedule Dataflow jobs.
>> Please let me know if you would need details around it.
>>
>> Thanks
>> Deepak
>>
>> On Sun, Apr 5, 2020 at 7:56 PM Joshua B. Harrison <
>> josh.harri...@gmail.com> wrote:
>>
>>> Hi Marco,
>>>
>>> I've ended using a VM running Luigi to schedule jobs. I use the data
>>> flow Python API to execute stored templates.
>>>
>>> I can give you more details if you’re interested.
>>>
>>> Best,
>>> Joshua
>>>
>>> On Sun, Apr 5, 2020 at 5:02 AM Marco Mistroni 
>>> wrote:
>>>
 HI all
  sorry for this partially OT but has anyone been successful in
 scheduling dataflow job on GCP?
 I have tried the CloudFunction approach (following few eamples on the
 web) but it didnt work out for me - the cloud function keep on giving me an
 INVALID ARGUMENT - which i could not debug

 So i was wondering if anyone has  been successful and can provide me an
 example

 kind regards
  Marco

 --
>>> Joshua Harrison |  Software Engineer |  joshharri...@gmail.com
>>>  |  404-433-0242
>>>
>>
>
> --
>
>*ANDRÉ ROCHA SILVA*
>   * DATA ENGINEER*
>   (48) 3181-0611
>
>    /andre-rocha-silva/
> 
> 
> 
> 
>
>


Re: Scheduling dataflow pipelines

2020-04-06 Thread André Rocha Silva
Marco

If you are already using GCP, I suggest you use the cloud scheduler. It is
like a cron job completely serverless.

If you need some extra help, let me know.

On Mon, Apr 6, 2020 at 4:38 AM deepak kumar  wrote:

> We have used composer (airlfow) successfully to schedule Dataflow jobs.
> Please let me know if you would need details around it.
>
> Thanks
> Deepak
>
> On Sun, Apr 5, 2020 at 7:56 PM Joshua B. Harrison 
> wrote:
>
>> Hi Marco,
>>
>> I've ended using a VM running Luigi to schedule jobs. I use the data flow
>> Python API to execute stored templates.
>>
>> I can give you more details if you’re interested.
>>
>> Best,
>> Joshua
>>
>> On Sun, Apr 5, 2020 at 5:02 AM Marco Mistroni 
>> wrote:
>>
>>> HI all
>>>  sorry for this partially OT but has anyone been successful in
>>> scheduling dataflow job on GCP?
>>> I have tried the CloudFunction approach (following few eamples on the
>>> web) but it didnt work out for me - the cloud function keep on giving me an
>>> INVALID ARGUMENT - which i could not debug
>>>
>>> So i was wondering if anyone has  been successful and can provide me an
>>> example
>>>
>>> kind regards
>>>  Marco
>>>
>>> --
>> Joshua Harrison |  Software Engineer |  joshharri...@gmail.com
>>  |  404-433-0242
>>
>

-- 

   *ANDRÉ ROCHA SILVA*
  * DATA ENGINEER*
  (48) 3181-0611

   /andre-rocha-silva/






Re: Apache Dataflow Template (Python)

2020-04-06 Thread André Rocha Silva
Hey!

Could you make it work? You can take a look in this post, is a single file
template, easy peasy to create a template from:
https://towardsdatascience.com/my-first-etl-job-google-cloud-dataflow-1fd773afa955

If you want, we can schedule a google hangout and I help you, step by step.
It is the least I can do after having had so much help from the community :)

On Sat, Apr 4, 2020 at 4:52 PM Marco Mistroni  wrote:

> Hey
>  sure... it's  a crap script :).. just an ordinary dataflow script
>
>
> https://github.com/mmistroni/GCP_Experiments/tree/master/dataflow/edgar_flow
>
>
> What i meant to say , for your template question, is for you to write a
> basic script which run on bean... something as simple as this
>
>
> https://github.com/mmistroni/GCP_Experiments/blob/master/dataflow/beam_test.py
>
> and then you can create a template out of it by just running this
>
> python -m edgar_main  --runner=dataflow --project=datascience-projets
> --template_location=gs://mm_dataflow_bucket/templates/edgar_dataflow_template
> --temp_location=gs://mm_dataflow_bucket/temp
> --staging_location=gs://mm_dataflow_bucket/staging
>
> That will create a template 'edgar_dataflow_template' which you can use in
> GCP dataflow console to create your job.
>
> hth, i m sort of a noob to Beam, having started writing code just over a
> month ago. Feel free to ping me if u get stuck
>
> kind regards
>  Marco
>
>
>
>
>
>
>
>
>
>
>
>
> On Sat, Apr 4, 2020 at 6:01 PM Xander Song  wrote:
>
>> Hi Marco,
>>
>> Thanks for your response. Would you mind sending the edgar_main script so
>> I can take a look?
>>
>> On Sat, Apr 4, 2020 at 2:25 AM Marco Mistroni 
>> wrote:
>>
>>> Hey
>>>  As far as I know you can generate a dataflow template out of your beam
>>> code by specifying an option on command line?
>>> I am running this CMD and once template is generated I kick off a dflow
>>> job via console by pointing at it
>>>
>>> python -m edgar_main --runner=dataflow --project=datascience-projets
>>> --template_location=gs:// Hth
>>>
>>>
>>> On Sat, Apr 4, 2020, 9:52 AM Xander Song  wrote:
>>>
 I am attempting to write a custom Dataflow Template using the Apache
 Beam Python SDK, but am finding the documentation difficult to follow. Does
 anyone have a minimal working example of how to write and deploy such a
 template?

 Thanks in advance.

>>>

-- 

   *ANDRÉ ROCHA SILVA*
  * DATA ENGINEER*
  (48) 3181-0611

   /andre-rocha-silva/






Re: Scheduling dataflow pipelines

2020-04-06 Thread deepak kumar
We have used composer (airlfow) successfully to schedule Dataflow jobs.
Please let me know if you would need details around it.

Thanks
Deepak

On Sun, Apr 5, 2020 at 7:56 PM Joshua B. Harrison 
wrote:

> Hi Marco,
>
> I've ended using a VM running Luigi to schedule jobs. I use the data flow
> Python API to execute stored templates.
>
> I can give you more details if you’re interested.
>
> Best,
> Joshua
>
> On Sun, Apr 5, 2020 at 5:02 AM Marco Mistroni  wrote:
>
>> HI all
>>  sorry for this partially OT but has anyone been successful in scheduling
>> dataflow job on GCP?
>> I have tried the CloudFunction approach (following few eamples on the
>> web) but it didnt work out for me - the cloud function keep on giving me an
>> INVALID ARGUMENT - which i could not debug
>>
>> So i was wondering if anyone has  been successful and can provide me an
>> example
>>
>> kind regards
>>  Marco
>>
>> --
> Joshua Harrison |  Software Engineer |  joshharri...@gmail.com
>  |  404-433-0242
>