Re: Live fixing of a Beam bug on July 25 at 3:30pm-4:30pm PST

2019-07-25 Thread Andres Angel
Awesome Pablo thanks so much!!!

AU

On Thu, Jul 25, 2019 at 7:48 PM Pablo Estrada  wrote:

> Thanks for those who tuned in : ) - I feel like I might have spent too
> long fiddling with Python code, and not long enough doing setup, testing,
> etc. I will try to do another one where I just test / setup the environment
> / lint checks etc.
>
> Here are links for:
> Setting up the Python environment: https://youtu.be/xpIpEO4PUDo?t=334
> Quickly setting up the Java environment:
> https://youtu.be/xpIpEO4PUDo?t=3659
>
> Doing a Pull Request: https://youtu.be/xpIpEO4PUDo?t=3770
>
> On Thu, Jul 25, 2019 at 4:39 PM sridhar inuog 
> wrote:
>
>> Thanks, Pablo for organizing this session. I found it useful.
>>
>> On Thu, Jul 25, 2019 at 4:56 PM Pablo Estrada  wrote:
>>
>>> The link is here: https://www.youtube.com/watch?v=xpIpEO4PUDo
>>> This is still happening.
>>>
>>> On Thu, Jul 25, 2019 at 2:55 PM Innocent Djiofack 
>>> wrote:
>>>
 Did I miss the link or this was postponed?

 On Tue, Jul 23, 2019 at 3:05 PM Austin Bennett <
 whatwouldausti...@gmail.com> wrote:

> Pablo,
>
> Assigned  https://issues.apache.org/jira/browse/BEAM-7607 to you, to
> make even more likely that it is still around on the 25th :-)
>
> Cheers,
> Austin
>
> On Tue, Jul 23, 2019 at 11:24 AM Pablo Estrada 
> wrote:
>
>> Hi all,
>> I've just realized that
>> https://issues.apache.org/jira/browse/BEAM-7607 is a single-line
>> change - and we'd spend 40 minutes chitchatting, so I'll also be working 
>> on
>> https://jira.apache.org/jira/browse/BEAM-7803, which is a Python
>> issue (also for the BigQuery sink!).
>> Thanks!
>> -P.
>>
>> On Sat, Jul 20, 2019 at 2:05 PM Pablo Estrada 
>> wrote:
>>
>>> Hello all,
>>>
>>> This will be streamed on youtube on this link:
>>> https://www.youtube.com/watch?v=xpIpEO4PUDo
>>>
>>> I think there will be a live chat, so I will hopefully be available
>>> to answer questions. To be honest, my workflow is not super efficient,
>>> but... oh well, hopefully it will be at least somewhat helpful to 
>>> others : )
>>> Best
>>> -P.
>>>
>>> On Thu, Jul 18, 2019 at 12:59 AM Tim Sell  wrote:
>>>
 +1, I'd love to see this as a recording. Will you stick it up on
 youtube afterwards?

 On Thu, Jul 18, 2019 at 4:00 AM sridhar inuog <
 sridharin...@gmail.com> wrote:

> Thanks, Pablo! Looking forward to it! Hopefully, it will also be
> recorded as well.
>
> On Wed, Jul 17, 2019 at 2:50 PM Pablo Estrada 
> wrote:
>
>> Yes! So I will be working on a small feature request for Java's
>> BigQueryIO: https://issues.apache.org/jira/browse/BEAM-7607
>>
>> Maybe I'll do something for Python next month. : )
>> Best
>> -P.
>>
>> On Wed, Jul 17, 2019 at 12:32 PM Rakesh Kumar <
>> rakeshku...@lyft.com> wrote:
>>
>>> +1, I really appreciate this initiative. It would be really
>>> helpful newbies like me.
>>>
>>> Is it possible to list out what are the things that you are
>>> planning to cover?
>>>
>>>
>>>
>>>
>>> On Tue, Jul 16, 2019 at 11:19 AM Yichi Zhang 
>>> wrote:
>>>
 Thanks for organizing this Pablo, it'll be very helpful!

 On Tue, Jul 16, 2019 at 10:57 AM Pablo Estrada <
 pabl...@google.com> wrote:

> Hello all,
> I'll be having a session where I live-fix a Beam bug for 1
> hour next week. Everyone is invited.
>
> It will be on July 25, between 3:30pm and 4:30pm PST.
> Hopefully I will finish a full change in that time frame, but 
> we'll see.
>
> I have not yet decided if I will do this via hangouts, or via
> a youtube livestream. In any case, I will share the link here in 
> the next
> few days.
>
> I will most likely work on the Java SDK (I have a little
> feature request in mind).
>
> Thanks!
> -P.
>


 --

 *DJIOFACK INNOCENT*
 *"Be better than the day before!" -*
 *+1 404 751 8024*

>>>


Re: Live fixing of a Beam bug on July 25 at 3:30pm-4:30pm PST

2019-07-25 Thread Pablo Estrada
Thanks for those who tuned in : ) - I feel like I might have spent too long
fiddling with Python code, and not long enough doing setup, testing, etc. I
will try to do another one where I just test / setup the environment / lint
checks etc.

Here are links for:
Setting up the Python environment: https://youtu.be/xpIpEO4PUDo?t=334
Quickly setting up the Java environment: https://youtu.be/xpIpEO4PUDo?t=3659

Doing a Pull Request: https://youtu.be/xpIpEO4PUDo?t=3770

On Thu, Jul 25, 2019 at 4:39 PM sridhar inuog 
wrote:

> Thanks, Pablo for organizing this session. I found it useful.
>
> On Thu, Jul 25, 2019 at 4:56 PM Pablo Estrada  wrote:
>
>> The link is here: https://www.youtube.com/watch?v=xpIpEO4PUDo
>> This is still happening.
>>
>> On Thu, Jul 25, 2019 at 2:55 PM Innocent Djiofack 
>> wrote:
>>
>>> Did I miss the link or this was postponed?
>>>
>>> On Tue, Jul 23, 2019 at 3:05 PM Austin Bennett <
>>> whatwouldausti...@gmail.com> wrote:
>>>
 Pablo,

 Assigned  https://issues.apache.org/jira/browse/BEAM-7607 to you, to
 make even more likely that it is still around on the 25th :-)

 Cheers,
 Austin

 On Tue, Jul 23, 2019 at 11:24 AM Pablo Estrada 
 wrote:

> Hi all,
> I've just realized that
> https://issues.apache.org/jira/browse/BEAM-7607 is a single-line
> change - and we'd spend 40 minutes chitchatting, so I'll also be working 
> on
> https://jira.apache.org/jira/browse/BEAM-7803, which is a Python
> issue (also for the BigQuery sink!).
> Thanks!
> -P.
>
> On Sat, Jul 20, 2019 at 2:05 PM Pablo Estrada 
> wrote:
>
>> Hello all,
>>
>> This will be streamed on youtube on this link:
>> https://www.youtube.com/watch?v=xpIpEO4PUDo
>>
>> I think there will be a live chat, so I will hopefully be available
>> to answer questions. To be honest, my workflow is not super efficient,
>> but... oh well, hopefully it will be at least somewhat helpful to others 
>> : )
>> Best
>> -P.
>>
>> On Thu, Jul 18, 2019 at 12:59 AM Tim Sell  wrote:
>>
>>> +1, I'd love to see this as a recording. Will you stick it up on
>>> youtube afterwards?
>>>
>>> On Thu, Jul 18, 2019 at 4:00 AM sridhar inuog <
>>> sridharin...@gmail.com> wrote:
>>>
 Thanks, Pablo! Looking forward to it! Hopefully, it will also be
 recorded as well.

 On Wed, Jul 17, 2019 at 2:50 PM Pablo Estrada 
 wrote:

> Yes! So I will be working on a small feature request for Java's
> BigQueryIO: https://issues.apache.org/jira/browse/BEAM-7607
>
> Maybe I'll do something for Python next month. : )
> Best
> -P.
>
> On Wed, Jul 17, 2019 at 12:32 PM Rakesh Kumar <
> rakeshku...@lyft.com> wrote:
>
>> +1, I really appreciate this initiative. It would be really
>> helpful newbies like me.
>>
>> Is it possible to list out what are the things that you are
>> planning to cover?
>>
>>
>>
>>
>> On Tue, Jul 16, 2019 at 11:19 AM Yichi Zhang 
>> wrote:
>>
>>> Thanks for organizing this Pablo, it'll be very helpful!
>>>
>>> On Tue, Jul 16, 2019 at 10:57 AM Pablo Estrada <
>>> pabl...@google.com> wrote:
>>>
 Hello all,
 I'll be having a session where I live-fix a Beam bug for 1 hour
 next week. Everyone is invited.

 It will be on July 25, between 3:30pm and 4:30pm PST. Hopefully
 I will finish a full change in that time frame, but we'll see.

 I have not yet decided if I will do this via hangouts, or via a
 youtube livestream. In any case, I will share the link here in the 
 next few
 days.

 I will most likely work on the Java SDK (I have a little
 feature request in mind).

 Thanks!
 -P.

>>>
>>>
>>> --
>>>
>>> *DJIOFACK INNOCENT*
>>> *"Be better than the day before!" -*
>>> *+1 404 751 8024*
>>>
>>


Re: Live fixing of a Beam bug on July 25 at 3:30pm-4:30pm PST

2019-07-25 Thread sridhar inuog
Thanks, Pablo for organizing this session. I found it useful.

On Thu, Jul 25, 2019 at 4:56 PM Pablo Estrada  wrote:

> The link is here: https://www.youtube.com/watch?v=xpIpEO4PUDo
> This is still happening.
>
> On Thu, Jul 25, 2019 at 2:55 PM Innocent Djiofack 
> wrote:
>
>> Did I miss the link or this was postponed?
>>
>> On Tue, Jul 23, 2019 at 3:05 PM Austin Bennett <
>> whatwouldausti...@gmail.com> wrote:
>>
>>> Pablo,
>>>
>>> Assigned  https://issues.apache.org/jira/browse/BEAM-7607 to you, to
>>> make even more likely that it is still around on the 25th :-)
>>>
>>> Cheers,
>>> Austin
>>>
>>> On Tue, Jul 23, 2019 at 11:24 AM Pablo Estrada 
>>> wrote:
>>>
 Hi all,
 I've just realized that https://issues.apache.org/jira/browse/BEAM-7607 is
 a single-line change - and we'd spend 40 minutes chitchatting, so I'll also
 be working on https://jira.apache.org/jira/browse/BEAM-7803, which is
 a Python issue (also for the BigQuery sink!).
 Thanks!
 -P.

 On Sat, Jul 20, 2019 at 2:05 PM Pablo Estrada 
 wrote:

> Hello all,
>
> This will be streamed on youtube on this link:
> https://www.youtube.com/watch?v=xpIpEO4PUDo
>
> I think there will be a live chat, so I will hopefully be available to
> answer questions. To be honest, my workflow is not super efficient, but...
> oh well, hopefully it will be at least somewhat helpful to others : )
> Best
> -P.
>
> On Thu, Jul 18, 2019 at 12:59 AM Tim Sell  wrote:
>
>> +1, I'd love to see this as a recording. Will you stick it up on
>> youtube afterwards?
>>
>> On Thu, Jul 18, 2019 at 4:00 AM sridhar inuog 
>> wrote:
>>
>>> Thanks, Pablo! Looking forward to it! Hopefully, it will also be
>>> recorded as well.
>>>
>>> On Wed, Jul 17, 2019 at 2:50 PM Pablo Estrada 
>>> wrote:
>>>
 Yes! So I will be working on a small feature request for Java's
 BigQueryIO: https://issues.apache.org/jira/browse/BEAM-7607

 Maybe I'll do something for Python next month. : )
 Best
 -P.

 On Wed, Jul 17, 2019 at 12:32 PM Rakesh Kumar 
 wrote:

> +1, I really appreciate this initiative. It would be really
> helpful newbies like me.
>
> Is it possible to list out what are the things that you are
> planning to cover?
>
>
>
>
> On Tue, Jul 16, 2019 at 11:19 AM Yichi Zhang 
> wrote:
>
>> Thanks for organizing this Pablo, it'll be very helpful!
>>
>> On Tue, Jul 16, 2019 at 10:57 AM Pablo Estrada <
>> pabl...@google.com> wrote:
>>
>>> Hello all,
>>> I'll be having a session where I live-fix a Beam bug for 1 hour
>>> next week. Everyone is invited.
>>>
>>> It will be on July 25, between 3:30pm and 4:30pm PST. Hopefully
>>> I will finish a full change in that time frame, but we'll see.
>>>
>>> I have not yet decided if I will do this via hangouts, or via a
>>> youtube livestream. In any case, I will share the link here in the 
>>> next few
>>> days.
>>>
>>> I will most likely work on the Java SDK (I have a little feature
>>> request in mind).
>>>
>>> Thanks!
>>> -P.
>>>
>>
>>
>> --
>>
>> *DJIOFACK INNOCENT*
>> *"Be better than the day before!" -*
>> *+1 404 751 8024*
>>
>


Re: Live fixing of a Beam bug on July 25 at 3:30pm-4:30pm PST

2019-07-25 Thread Pablo Estrada
The link is here: https://www.youtube.com/watch?v=xpIpEO4PUDo
This is still happening.

On Thu, Jul 25, 2019 at 2:55 PM Innocent Djiofack 
wrote:

> Did I miss the link or this was postponed?
>
> On Tue, Jul 23, 2019 at 3:05 PM Austin Bennett <
> whatwouldausti...@gmail.com> wrote:
>
>> Pablo,
>>
>> Assigned  https://issues.apache.org/jira/browse/BEAM-7607 to you, to
>> make even more likely that it is still around on the 25th :-)
>>
>> Cheers,
>> Austin
>>
>> On Tue, Jul 23, 2019 at 11:24 AM Pablo Estrada 
>> wrote:
>>
>>> Hi all,
>>> I've just realized that https://issues.apache.org/jira/browse/BEAM-7607 is
>>> a single-line change - and we'd spend 40 minutes chitchatting, so I'll also
>>> be working on https://jira.apache.org/jira/browse/BEAM-7803, which is a
>>> Python issue (also for the BigQuery sink!).
>>> Thanks!
>>> -P.
>>>
>>> On Sat, Jul 20, 2019 at 2:05 PM Pablo Estrada 
>>> wrote:
>>>
 Hello all,

 This will be streamed on youtube on this link:
 https://www.youtube.com/watch?v=xpIpEO4PUDo

 I think there will be a live chat, so I will hopefully be available to
 answer questions. To be honest, my workflow is not super efficient, but...
 oh well, hopefully it will be at least somewhat helpful to others : )
 Best
 -P.

 On Thu, Jul 18, 2019 at 12:59 AM Tim Sell  wrote:

> +1, I'd love to see this as a recording. Will you stick it up on
> youtube afterwards?
>
> On Thu, Jul 18, 2019 at 4:00 AM sridhar inuog 
> wrote:
>
>> Thanks, Pablo! Looking forward to it! Hopefully, it will also be
>> recorded as well.
>>
>> On Wed, Jul 17, 2019 at 2:50 PM Pablo Estrada 
>> wrote:
>>
>>> Yes! So I will be working on a small feature request for Java's
>>> BigQueryIO: https://issues.apache.org/jira/browse/BEAM-7607
>>>
>>> Maybe I'll do something for Python next month. : )
>>> Best
>>> -P.
>>>
>>> On Wed, Jul 17, 2019 at 12:32 PM Rakesh Kumar 
>>> wrote:
>>>
 +1, I really appreciate this initiative. It would be really helpful
 newbies like me.

 Is it possible to list out what are the things that you are
 planning to cover?




 On Tue, Jul 16, 2019 at 11:19 AM Yichi Zhang 
 wrote:

> Thanks for organizing this Pablo, it'll be very helpful!
>
> On Tue, Jul 16, 2019 at 10:57 AM Pablo Estrada 
> wrote:
>
>> Hello all,
>> I'll be having a session where I live-fix a Beam bug for 1 hour
>> next week. Everyone is invited.
>>
>> It will be on July 25, between 3:30pm and 4:30pm PST. Hopefully I
>> will finish a full change in that time frame, but we'll see.
>>
>> I have not yet decided if I will do this via hangouts, or via a
>> youtube livestream. In any case, I will share the link here in the 
>> next few
>> days.
>>
>> I will most likely work on the Java SDK (I have a little feature
>> request in mind).
>>
>> Thanks!
>> -P.
>>
>
>
> --
>
> *DJIOFACK INNOCENT*
> *"Be better than the day before!" -*
> *+1 404 751 8024*
>


Re: Stateful ParDo on Non-Keyed PCollection

2019-07-25 Thread Robert Bradshaw
On Thu, Jul 25, 2019 at 6:34 PM rahul patwari
 wrote:
>
> So, If an RPC call has to be performed for a batch of Rows(PCollection), 
> instead of each Row, the recommended way is to batch the Rows in 
> startBundle() of 
> DoFn(https://stackoverflow.com/questions/49094781/yield-results-in-finish-bundle-from-a-custom-dofn/49101711#49101711)?

Yes.

> I thought Stateful and Timely Processing could be helpful here.

The upside is that you can persist state across bundles (which is
especially helpful when bundles are small, e.g. for streaming
pipelines). The downside is that you can't persist state across keys
(and it also enforces a shuffle to colocate the data by key).

If you get to choose your keys, you would want to have about as many
keys as you have concurrent bundles (or some small multiple, to ensure
they're not lumpily distributed). Keying by something like
System.identityHashCode(this) in the body of a DoFn might be
sufficient.

> On Thu, Jul 25, 2019 at 9:54 PM Robert Bradshaw  wrote:
>>
>> Though it's not obvious in the name, Stateful ParDos can only be
>> applied to keyed PCollections, similar to GroupByKey. (You could,
>> however, assign every element to the same key and then apply a
>> Stateful DoFn, though in that case all elements would get processed on
>> the same worker.)
>>
>> On Thu, Jul 25, 2019 at 6:06 PM rahul patwari
>>  wrote:
>> >
>> > Hi,
>> >
>> > https://beam.apache.org/blog/2017/02/13/stateful-processing.html  gives an 
>> > example of assigning an arbitrary-but-consistent index to each element on 
>> > a per key-and-window basis.
>> >
>> > If the Stateful ParDo is applied on a Non-Keyed PCollection, say, 
>> > PCollection with Fixed Windows, the state is maintained per window 
>> > and every element in the window will be assigned a consistent index?
>> > Does this mean every element belonging to the window will be processed in 
>> > a single DoFn Instance, which otherwise could have been done in multiple 
>> > parallel instances, limiting performance?
>> > Similarly, How does Stateful ParDo behave on Bounded Non-Keyed PCollection?
>> >
>> > Thanks,
>> > Rahul


Re: applying keyed state on top of stream from co-groupByKey output

2019-07-25 Thread Kiran Hurakadli
Thank you

On Thu, 25 Jul 2019 at 10:51 PM, Rui Wang  wrote:

> I also added an alternatively solution to your example to the SO question.
>
>
> -Rui
>
> On Thu, Jul 25, 2019 at 9:02 AM Kenneth Knowles  wrote:
>
>> Thanks for the very detailed question! I have written up an answer and I
>> suggest we continue discussion there.
>>
>> Kenn
>>
>> On Tue, Jul 23, 2019 at 9:11 PM Kiran Hurakadli 
>> wrote:
>>
>>>
>>> Hi All,
>>> I am trying to merge  2 data streams using coGroupByKey and applying
>>> stateful
>>> ParDo. Input to the cogroup fun is (word,count) , Full problem explained
>>> on
>>> stack over flow
>>>
>>> https://stackoverflow.com/questions/57162131/applying-keyed-state-on-top-of-stream-from-co-group-stream
>>>
>>> Please help me
>>> --
>>> Regards,
>>> *Kiran M Hurakadli.*
>>>
>> --
Regards,
*Kiran M Hurakadli.*


Re: applying keyed state on top of stream from co-groupByKey output

2019-07-25 Thread Kiran Hurakadli
Thank you for your response

On Thu, 25 Jul 2019 at 9:32 PM, Kenneth Knowles  wrote:

> Thanks for the very detailed question! I have written up an answer and I
> suggest we continue discussion there.
>
> Kenn
>
> On Tue, Jul 23, 2019 at 9:11 PM Kiran Hurakadli 
> wrote:
>
>>
>> Hi All,
>> I am trying to merge  2 data streams using coGroupByKey and applying
>> stateful
>> ParDo. Input to the cogroup fun is (word,count) , Full problem explained
>> on
>> stack over flow
>>
>> https://stackoverflow.com/questions/57162131/applying-keyed-state-on-top-of-stream-from-co-group-stream
>>
>> Please help me
>> --
>> Regards,
>> *Kiran M Hurakadli.*
>>
> --
Regards,
*Kiran M Hurakadli.*


Re: How to merge two streams and perform stateful operations on merged stream using Apache Beam

2019-07-25 Thread Kiran Hurakadli
Yeah !This is same question .. thank you for you detailed explanation

On Thu, 25 Jul 2019 at 9:32 PM, Kenneth Knowles  wrote:

> Is this the same question as your other email about your StackOverflow
> question? If it is, then please see my answer on StackOverflow. If it is
> not, can you explain a little more?
>
> Kenn
>
> On Wed, Jul 24, 2019 at 10:48 PM Kiran Hurakadli 
> wrote:
>
>> I have 2  kafka streams , i want to merge by some key and on top of the
>> merged stream i want to perform stateful operation so that i can sum up
>> counts from both streams
>>
>> this what i tried, but dint work ..
>>
>> PCollection stream1 = .. read from kafka
>>
>> PCollection stream2 = .. read from kafka
>>
>> PCollection  wonrdCount1 =  stream1.apply(...)
>>
>> PCollection  wonrdCount2 =  stream2.apply(...)
>>
>> PCollection merged = merge wordcount1 and wordcount2 using
>> CoGroupByKey
>>
>>
>>
>> Pcolection finalStream = mergred.apply(...)
>>
>>
>> for finalstream apply state..
>>
>> Please suggest solution ..
>>
>> --
>> Regards,
>> *Kiran M Hurakadli.*
>>
> --
Regards,
*Kiran M Hurakadli.*


Re: applying keyed state on top of stream from co-groupByKey output

2019-07-25 Thread Rui Wang
I also added an alternatively solution to your example to the SO question.


-Rui

On Thu, Jul 25, 2019 at 9:02 AM Kenneth Knowles  wrote:

> Thanks for the very detailed question! I have written up an answer and I
> suggest we continue discussion there.
>
> Kenn
>
> On Tue, Jul 23, 2019 at 9:11 PM Kiran Hurakadli 
> wrote:
>
>>
>> Hi All,
>> I am trying to merge  2 data streams using coGroupByKey and applying
>> stateful
>> ParDo. Input to the cogroup fun is (word,count) , Full problem explained
>> on
>> stack over flow
>>
>> https://stackoverflow.com/questions/57162131/applying-keyed-state-on-top-of-stream-from-co-group-stream
>>
>> Please help me
>> --
>> Regards,
>> *Kiran M Hurakadli.*
>>
>


Re: [python SDK] Returning Pub/Sub message_id and timestamp

2019-07-25 Thread Ahmet Altay
Your analysis and suggestion looks correct to me. Thank you for reporting
this. I filed https://issues.apache.org/jira/browse/BEAM-7819 to track this.

On Thu, Jul 25, 2019 at 4:51 AM Matthew Darwin <
matthew.dar...@carfinance247.co.uk> wrote:

> Thanks Ahmet. Looking at the source code, with my untrained python eyes, I
> think if the intention is to include the message id and the publish time in
> the attributes attribute of the PubSubMessage type, then the protobuf
> mapping is missing something:-
>
>
> @staticmethod
> def _from_proto_str(proto_msg):
> """Construct from serialized form of ``PubsubMessage``.
>
> Args:
> proto_msg: String containing a serialized protobuf of type
>
> https://cloud.google.com/pubsub/docs/reference/rpc/google.pubsub.v1#google.pubsub.v1.PubsubMessage
>
> Returns:
> A new PubsubMessage object.
> """
> msg = pubsub.types.pubsub_pb2.PubsubMessage()
> msg.ParseFromString(proto_msg)
> # Convert ScalarMapContainer to dict.
> attributes = dict((key, msg.attributes[key]) for key in msg.attributes)
> return PubsubMessage(msg.data, attributes)
>
> The protobuf definition is here:-
>
>
> https://cloud.google.com/pubsub/docs/reference/rpc/google.pubsub.v1#google.pubsub.v1.PubsubMessage
>
> and so it looks as if the message_id and publish_time are not being parsed
> as they are seperate from the attributes. Perhaps the PubsubMessage class
> needs expanding to include these as attributes, or they would need adding
> to the dictionary for attributes. This would only need doing for the
> _from_proto_str as obviously they would not need to be populated when
> transmitting a message to PubSub.
>
> My python is not great, I'm assuming the latter option would need to look
> something like this?
>
> attributes = dict((key, msg.attributes[key]) for key in msg.attributes)
> attributes.update({'message_id': msg.message_id, 'publish_time':
> msg.publish_time})
> return PubsubMessage(msg.data, attributes)
>
> On Wed, 2019-07-24 at 10:01 -0700, Ahmet Altay wrote:
>
> *This message originated from outside your organization*
> --
> No, I do not see any issues with the code you posted on stackoverflow. It
> might very well be an issue with Beam.
>
> +Udi Meiri  is the author of this code, might have an
> idea.
>
> On Wed, Jul 24, 2019 at 9:47 AM Matthew Darwin <
> matthew.dar...@carfinance247.co.uk> wrote:
>
> Thanks Ahmet, I'm aware of the documentation and what should be happening,
> however, as you can see in the example I posted on stackoverflow, this does
> not appear to be the case, hence my question here! If there are any glaring
> issues with my code, I'd be grateful if you could point them out.
>
> On Wed, 2019-07-24 at 09:16 -0700, Ahmet Altay wrote:
>
> *This message originated from outside your organization*
> --
> When with_attributes is set to True, the elements will be of type
> PubsubMessage [1]. I could not find a test/example for this, but
> documentation suggests [2], PubsubMessage will have an attributes map
> including the system provided values. One of those keys will be message_id
> [3].
>
> [1]
> https://github.com/apache/beam/blob/a5ed104aeac91781492523808fa4d3df6f2e55f7/sdks/python/apache_beam/io/gcp/pubsub.py#L156
> [2]
> https://github.com/apache/beam/blob/a5ed104aeac91781492523808fa4d3df6f2e55f7/sdks/python/apache_beam/io/gcp/pubsub.py#L94
> [3]
> https://cloud.google.com/pubsub/docs/reference/rpc/google.pubsub.v1#google.pubsub.v1.PubsubMessage
>
> On Wed, Jul 24, 2019 at 6:23 AM Matthew Darwin <
> matthew.dar...@carfinance247.co.uk> wrote:
>
> Thank you both for you replies; apologies for the delay in responding, I
> was away from my machine over the weekend.
>
> I'm still a bit confused over how the message id from PubSub is exposed.
>
> From the documentation, I was expecting the PubSub message id to be in the
> attributes property of the message; but when running this into BigQuery as
> a string, the only properties are those that I pass in to my PubSub message
> as attributes. None of the examples you've linked use the with_attributes
> flag, unfortunately.
>
> I've posted the code and output onto stackoverflow:-
> https://stackoverflow.com/questions/57183876/how-do-you-access-the-message-id-from-google-pub-sub-using-apache-beam
>
> Secondly, is the 2.14 release available for preview?
>
>
> Release candidate is not out yet. This will happen soon (~this week) and
> you can watch dev@ for this. If it makes sense for your case, you can
> also build yourself from head and test it out.
>
>
>
> Kind regards,
>
> Matthew
>
>
> On Fri, 2019-07-19 at 13:22 -0700, Valentyn Tymofieiev wrote:
>
> *This message originated from outside your organization*
> --
> Also, see
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/complete/game/leader_board.py
>  which
> involves both PubSub and Bigquery IOs.
>
> On Fri, Jul 19, 2019 at 12:31 PM Pablo Estrada  wrote:
>
> Beam 2.14.0 will

Re: Stateful ParDo on Non-Keyed PCollection

2019-07-25 Thread rahul patwari
Yes. But, GroupIntoBatches works on KV. We are working on
PCollection throughout our pipeline.
We can convert Row to KV. But, we only have a few keys and a Bounded
PCollection. As we have Global windows and a few keys, the opportunity for
parallelism is limited to [No. of keys] with Stateful ParDo [per Key, Per
Window] Processing.

On Thu, Jul 25, 2019 at 10:08 PM Reuven Lax  wrote:

> Have you looked at the GroupIntoBatches transform?
>
> On Thu, Jul 25, 2019 at 9:34 AM rahul patwari 
> wrote:
>
>> So, If an RPC call has to be performed for a batch of
>> Rows(PCollection), instead of each Row, the recommended way is to
>> batch the Rows in startBundle() of DoFn(
>> https://stackoverflow.com/questions/49094781/yield-results-in-finish-bundle-from-a-custom-dofn/49101711#49101711)?
>> I thought Stateful and Timely Processing could be helpful here.
>>
>> On Thu, Jul 25, 2019 at 9:54 PM Robert Bradshaw 
>> wrote:
>>
>>> Though it's not obvious in the name, Stateful ParDos can only be
>>> applied to keyed PCollections, similar to GroupByKey. (You could,
>>> however, assign every element to the same key and then apply a
>>> Stateful DoFn, though in that case all elements would get processed on
>>> the same worker.)
>>>
>>> On Thu, Jul 25, 2019 at 6:06 PM rahul patwari
>>>  wrote:
>>> >
>>> > Hi,
>>> >
>>> > https://beam.apache.org/blog/2017/02/13/stateful-processing.html
>>> gives an example of assigning an arbitrary-but-consistent index to each
>>> element on a per key-and-window basis.
>>> >
>>> > If the Stateful ParDo is applied on a Non-Keyed PCollection, say,
>>> PCollection with Fixed Windows, the state is maintained per window and
>>> every element in the window will be assigned a consistent index?
>>> > Does this mean every element belonging to the window will be processed
>>> in a single DoFn Instance, which otherwise could have been done in multiple
>>> parallel instances, limiting performance?
>>> > Similarly, How does Stateful ParDo behave on Bounded Non-Keyed
>>> PCollection?
>>> >
>>> > Thanks,
>>> > Rahul
>>>
>>


Re: Stateful ParDo on Non-Keyed PCollection

2019-07-25 Thread rahul patwari
So, If an RPC call has to be performed for a batch of
Rows(PCollection), instead of each Row, the recommended way is to
batch the Rows in startBundle() of DoFn(
https://stackoverflow.com/questions/49094781/yield-results-in-finish-bundle-from-a-custom-dofn/49101711#49101711)?
I thought Stateful and Timely Processing could be helpful here.

On Thu, Jul 25, 2019 at 9:54 PM Robert Bradshaw  wrote:

> Though it's not obvious in the name, Stateful ParDos can only be
> applied to keyed PCollections, similar to GroupByKey. (You could,
> however, assign every element to the same key and then apply a
> Stateful DoFn, though in that case all elements would get processed on
> the same worker.)
>
> On Thu, Jul 25, 2019 at 6:06 PM rahul patwari
>  wrote:
> >
> > Hi,
> >
> > https://beam.apache.org/blog/2017/02/13/stateful-processing.html  gives
> an example of assigning an arbitrary-but-consistent index to each element
> on a per key-and-window basis.
> >
> > If the Stateful ParDo is applied on a Non-Keyed PCollection, say,
> PCollection with Fixed Windows, the state is maintained per window and
> every element in the window will be assigned a consistent index?
> > Does this mean every element belonging to the window will be processed
> in a single DoFn Instance, which otherwise could have been done in multiple
> parallel instances, limiting performance?
> > Similarly, How does Stateful ParDo behave on Bounded Non-Keyed
> PCollection?
> >
> > Thanks,
> > Rahul
>


Re: Stateful ParDo on Non-Keyed PCollection

2019-07-25 Thread Robert Bradshaw
Though it's not obvious in the name, Stateful ParDos can only be
applied to keyed PCollections, similar to GroupByKey. (You could,
however, assign every element to the same key and then apply a
Stateful DoFn, though in that case all elements would get processed on
the same worker.)

On Thu, Jul 25, 2019 at 6:06 PM rahul patwari
 wrote:
>
> Hi,
>
> https://beam.apache.org/blog/2017/02/13/stateful-processing.html  gives an 
> example of assigning an arbitrary-but-consistent index to each element on a 
> per key-and-window basis.
>
> If the Stateful ParDo is applied on a Non-Keyed PCollection, say, 
> PCollection with Fixed Windows, the state is maintained per window and 
> every element in the window will be assigned a consistent index?
> Does this mean every element belonging to the window will be processed in a 
> single DoFn Instance, which otherwise could have been done in multiple 
> parallel instances, limiting performance?
> Similarly, How does Stateful ParDo behave on Bounded Non-Keyed PCollection?
>
> Thanks,
> Rahul


Stateful ParDo on Non-Keyed PCollection

2019-07-25 Thread rahul patwari
Hi,

https://beam.apache.org/blog/2017/02/13/stateful-processing.html  gives an
example of assigning an arbitrary-but-consistent index to each element on a
per key-and-window basis.

If the Stateful ParDo is applied on a Non-Keyed PCollection, say,
PCollection with Fixed Windows, the state is maintained per window and
every element in the window will be assigned a consistent index?
Does this mean every element belonging to the window will be processed in a
single DoFn Instance, which otherwise could have been done in multiple
parallel instances, limiting performance?
Similarly, How does Stateful ParDo behave on Bounded Non-Keyed PCollection?

Thanks,
Rahul


Re: How to merge two streams and perform stateful operations on merged stream using Apache Beam

2019-07-25 Thread Kenneth Knowles
Is this the same question as your other email about your StackOverflow
question? If it is, then please see my answer on StackOverflow. If it is
not, can you explain a little more?

Kenn

On Wed, Jul 24, 2019 at 10:48 PM Kiran Hurakadli 
wrote:

> I have 2  kafka streams , i want to merge by some key and on top of the
> merged stream i want to perform stateful operation so that i can sum up
> counts from both streams
>
> this what i tried, but dint work ..
>
> PCollection stream1 = .. read from kafka
>
> PCollection stream2 = .. read from kafka
>
> PCollection  wonrdCount1 =  stream1.apply(...)
>
> PCollection  wonrdCount2 =  stream2.apply(...)
>
> PCollection merged = merge wordcount1 and wordcount2 using
> CoGroupByKey
>
>
>
> Pcolection finalStream = mergred.apply(...)
>
>
> for finalstream apply state..
>
> Please suggest solution ..
>
> --
> Regards,
> *Kiran M Hurakadli.*
>


Re: applying keyed state on top of stream from co-groupByKey output

2019-07-25 Thread Kenneth Knowles
Thanks for the very detailed question! I have written up an answer and I
suggest we continue discussion there.

Kenn

On Tue, Jul 23, 2019 at 9:11 PM Kiran Hurakadli 
wrote:

>
> Hi All,
> I am trying to merge  2 data streams using coGroupByKey and applying
> stateful
> ParDo. Input to the cogroup fun is (word,count) , Full problem explained on
> stack over flow
>
> https://stackoverflow.com/questions/57162131/applying-keyed-state-on-top-of-stream-from-co-group-stream
>
> Please help me
> --
> Regards,
> *Kiran M Hurakadli.*
>


Re: [python SDK] Returning Pub/Sub message_id and timestamp

2019-07-25 Thread Matthew Darwin
Thanks Ahmet. Looking at the source code, with my untrained python eyes, I 
think if the intention is to include the message id and the publish time in the 
attributes attribute of the PubSubMessage type, then the protobuf mapping is 
missing something:-


@staticmethod
def _from_proto_str(proto_msg):
"""Construct from serialized form of ``PubsubMessage``.

Args:
proto_msg: String containing a serialized protobuf of type
https://cloud.google.com/pubsub/docs/reference/rpc/google.pubsub.v1#google.pubsub.v1.PubsubMessage

Returns:
A new PubsubMessage object.
"""
msg = pubsub.types.pubsub_pb2.PubsubMessage()
msg.ParseFromString(proto_msg)
# Convert ScalarMapContainer to dict.
attributes = dict((key, msg.attributes[key]) for key in msg.attributes)
return PubsubMessage(msg.data, attributes)

The protobuf definition is here:-

https://cloud.google.com/pubsub/docs/reference/rpc/google.pubsub.v1#google.pubsub.v1.PubsubMessage

and so it looks as if the message_id and publish_time are not being parsed as 
they are seperate from the attributes. Perhaps the PubsubMessage class needs 
expanding to include these as attributes, or they would need adding to the 
dictionary for attributes. This would only need doing for the _from_proto_str 
as obviously they would not need to be populated when transmitting a message to 
PubSub.

My python is not great, I'm assuming the latter option would need to look 
something like this?

attributes = dict((key, msg.attributes[key]) for key in msg.attributes)
attributes.update({'message_id': msg.message_id, 'publish_time': 
msg.publish_time})
return PubsubMessage(msg.data, attributes)

On Wed, 2019-07-24 at 10:01 -0700, Ahmet Altay wrote:
This message originated from outside your organization

No, I do not see any issues with the code you posted on stackoverflow. It might 
very well be an issue with Beam.

+Udi Meiri is the author of this code, might have an 
idea.

On Wed, Jul 24, 2019 at 9:47 AM Matthew Darwin 
mailto:matthew.dar...@carfinance247.co.uk>> 
wrote:
Thanks Ahmet, I'm aware of the documentation and what should be happening, 
however, as you can see in the example I posted on stackoverflow, this does not 
appear to be the case, hence my question here! If there are any glaring issues 
with my code, I'd be grateful if you could point them out.

On Wed, 2019-07-24 at 09:16 -0700, Ahmet Altay wrote:
This message originated from outside your organization

When with_attributes is set to True, the elements will be of type PubsubMessage 
[1]. I could not find a test/example for this, but documentation suggests [2], 
PubsubMessage will have an attributes map including the system provided values. 
One of those keys will be message_id [3].

[1] 
https://github.com/apache/beam/blob/a5ed104aeac91781492523808fa4d3df6f2e55f7/sdks/python/apache_beam/io/gcp/pubsub.py#L156
[2] 
https://github.com/apache/beam/blob/a5ed104aeac91781492523808fa4d3df6f2e55f7/sdks/python/apache_beam/io/gcp/pubsub.py#L94
[3] 
https://cloud.google.com/pubsub/docs/reference/rpc/google.pubsub.v1#google.pubsub.v1.PubsubMessage

On Wed, Jul 24, 2019 at 6:23 AM Matthew Darwin 
mailto:matthew.dar...@carfinance247.co.uk>> 
wrote:
Thank you both for you replies; apologies for the delay in responding, I was 
away from my machine over the weekend.

I'm still a bit confused over how the message id from PubSub is exposed.

From the documentation, I was expecting the PubSub message id to be in the 
attributes property of the message; but when running this into BigQuery as a 
string, the only properties are those that I pass in to my PubSub message as 
attributes. None of the examples you've linked use the with_attributes flag, 
unfortunately.

I've posted the code and output onto stackoverflow:- 
https://stackoverflow.com/questions/57183876/how-do-you-access-the-message-id-from-google-pub-sub-using-apache-beam

Secondly, is the 2.14 release available for preview?


Release candidate is not out yet. This will happen soon (~this week) and you 
can watch dev@ for this. If it makes sense for your case, you can also build 
yourself from head and test it out.


Kind regards,

Matthew


On Fri, 2019-07-19 at 13:22 -0700, Valentyn Tymofieiev wrote:
This message originated from outside your organization

Also, see 
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/complete/game/leader_board.py
 which involves both PubSub and Bigquery IOs.

On Fri, Jul 19, 2019 at 12:31 PM Pablo Estrada 
mailto:pabl...@google.com>> wrote:
Beam 2.14.0 will include support for writing files in the fileio module (the 
support will include GCS, local files, HDFS). It will also support streaming. 
The transform is still marked as experimental, and is likely to receive 
improvements - but you can check it out for your pipelines, and see if it helps 
you : )
Best
-P.

On Fri, Jul 19, 2019 at 12:24 PM Valentyn Tymofieie