Re: 2019 Beam Events

2018-12-04 Thread Matthias Baetens
Great stuff, Gris! Looking forward to what 2019 will bring!

The Beam meetup in London will have a new get together early next year as
well :-)
https://www.meetup.com/London-Apache-Beam-Meetup/


On Tue, 4 Dec 2018 at 23:50 Austin Bennett 
wrote:

> Already got that process kicked off with the NY and LA meet ups, now that
> SF is about to be inagurated goal will be to get these moving as well.
>
> For anyone that is in (or goes to) those areas:
> https://www.meetup.com/New-York-Apache-Beam/
> https://www.meetup.com/Los-Angeles-Apache-Beam/
>
> Please reach out to get involved!
>
>
>
> On Tue, Dec 4, 2018 at 3:13 PM Griselda Cuevas  wrote:
>
>> +1 to Pablo's suggestion, if there's interest in "Founding a Meetup group
>> in a particular city, let's create the Meetup page and start getting sign
>> ups. Joana will be reaching out with a comprenhexive list of how to get
>> started and we're hoping to compile a high level calendar of
>> launches/announcements to feed into your meetup.
>>
>> G
>>
>> On Tue, 4 Dec 2018 at 12:04, Daniel Salerno 
>> wrote:
>>
>>> =)
>>> What good news!
>>> Okay, I'll set up the group and try to get interested.
>>> Thank you
>>>
>>>
>>> Em ter, 4 de dez de 2018 às 17:19, Pablo Estrada 
>>> escreveu:
>>>
 FWIW, for some of these places that have interest (e.g. Brazil,
 Israel), it's possible to create a group in meetup.com, and start
 gauging interest, and looking for organizers.
 Once a group of people with interest exists, it's easier to get
 interest / sponsorship to bring speakers.
 So if you are willing to create the group in meetup, Daniel, we can
 monitor it and try to plan something as it grows : )
 Best
 -P.

 On Tue, Dec 4, 2018 at 10:55 AM Daniel Salerno 
 wrote:

>
> It's a shame that there are no events in Brazil ...
>
> =(
>
> Em ter, 4 de dez de 2018 às 13:12, OrielResearch Eila Arich-Landkof <
> e...@orielresearch.org> escreveu:
>
>> agree 👍
>>
>> On Tue, Dec 4, 2018 at 5:41 AM Chaim Turkel  wrote:
>>
>>> Israel would be nice to have one
>>> chaim
>>> On Tue, Dec 4, 2018 at 12:33 AM Griselda Cuevas 
>>> wrote:
>>> >
>>> > Hi Beam Community,
>>> >
>>> > I started curating industry conferences, meetups and events that
>>> are relevant for Beam, this initial list I came up with. I'd love your 
>>> help
>>> adding others that I might have overlooked. Once we're satisfied with 
>>> the
>>> list, let's re-share so we can coordinate proposal submissions, 
>>> attendance
>>> and community meetups there.
>>> >
>>> >
>>> > Cheers,
>>> >
>>> > G
>>> >
>>> >
>>> >
>>>
>>> --
>>>
>>>
>>> Loans are funded by
>>> FinWise Bank, a Utah-chartered bank located in Sandy,
>>> Utah, member FDIC, Equal
>>> Opportunity Lender. Merchant Cash Advances are
>>> made by Behalf. For more
>>> information on ECOA, click here
>>> . For important information
>>> about
>>> opening a new
>>> account, review Patriot Act procedures here
>>> .
>>> Visit Legal
>>>  to
>>> review our comprehensive program terms,
>>> conditions, and disclosures.
>>>
>>
>>
>> --
>> Eila
>> www.orielresearch.org
>> https://www.meetu
>> p.co
>> 
>> m/Deep-Learning-In-Production/
>> 
>>
>>
>> --


Re: How to kick off a Beam pipeline with PortableRunner in Java?

2018-12-04 Thread Ruoyun Huang
This should work. but maybe try adding a section like this to your pom.xml
file:



  portable-runner

  

true

  

  

  



  org.apache.beam

  beam-runners-reference-java

  ${beam.version}

  runtime



  




Note that PortableRunner has major features still in development. It
currently can only run sample pipeline without file output. There are
active efforts on it though.





On Tue, Dec 4, 2018 at 2:21 PM Sai Inampudi  wrote:

> Thanks for the help Ankur and Ruoyun, appreciate it. I went through the
> wiki and I am still facing the same issue as before (where it complains
> about the following:
> java.lang.IllegalArgumentException: Unknown 'runner' specified
> 'PortableRunner', supported pipeline runners [DirectRunner, FlinkRunner,
> SparkRunner, TestFlinkRunner, TestSparkRunner])
>
> I am probably doing a naive mistake but I am not sure where, below is
> everything I have done thus far:
> * Confirmed I have met the prerequisities in the wiki
>  * Docker is installed and working without sudo
>  * gradle target for the java container works
>
> * Successfully started the JobServer (for now, I didn't bother with the
> flink job server and instead just kicked off the reference job server) by
> running the following
> * ./gradlew :beam-runners-reference-job-server:run
> * Confirmed that the JobService started at 8099 port
> * Started ReferenceRunnerJobService at localhost:8099
><-> 98% EXECUTING [22m 43s]
>> :beam-runners-reference-job-server:run
>> IDLE
>
> * My code itself is being managed via maven, so I made sure to pull in the
> latest 2.10.0-SNAPSHOT dependencies as per
> https://repository.apache.org/content/repositories/snapshots/
>
> * The pipeline is kicked off with the following options:
>  * --runner=PortableRunner --jobEndpoint=localhost:8099
> --streaming=true
>  * Running the pipeline results in the following error
>* java.lang.IllegalArgumentException: Unknown 'runner'
> specified 'PortableRunner', supported pipeline runners [DirectRunner,
> FlinkRunner, SparkRunner, TestFlinkRunner, TestSparkRunner]
>
>
>
> I am sure I am missing something basic but I was hoping I could get ideas
> on what it might be?
>
>

-- 

Ruoyun  Huang


Re: 2019 Beam Events

2018-12-04 Thread Austin Bennett
Already got that process kicked off with the NY and LA meet ups, now that
SF is about to be inagurated goal will be to get these moving as well.

For anyone that is in (or goes to) those areas:
https://www.meetup.com/New-York-Apache-Beam/
https://www.meetup.com/Los-Angeles-Apache-Beam/

Please reach out to get involved!



On Tue, Dec 4, 2018 at 3:13 PM Griselda Cuevas  wrote:

> +1 to Pablo's suggestion, if there's interest in "Founding a Meetup group
> in a particular city, let's create the Meetup page and start getting sign
> ups. Joana will be reaching out with a comprenhexive list of how to get
> started and we're hoping to compile a high level calendar of
> launches/announcements to feed into your meetup.
>
> G
>
> On Tue, 4 Dec 2018 at 12:04, Daniel Salerno  wrote:
>
>> =)
>> What good news!
>> Okay, I'll set up the group and try to get interested.
>> Thank you
>>
>>
>> Em ter, 4 de dez de 2018 às 17:19, Pablo Estrada 
>> escreveu:
>>
>>> FWIW, for some of these places that have interest (e.g. Brazil, Israel),
>>> it's possible to create a group in meetup.com, and start gauging
>>> interest, and looking for organizers.
>>> Once a group of people with interest exists, it's easier to get interest
>>> / sponsorship to bring speakers.
>>> So if you are willing to create the group in meetup, Daniel, we can
>>> monitor it and try to plan something as it grows : )
>>> Best
>>> -P.
>>>
>>> On Tue, Dec 4, 2018 at 10:55 AM Daniel Salerno 
>>> wrote:
>>>

 It's a shame that there are no events in Brazil ...

 =(

 Em ter, 4 de dez de 2018 às 13:12, OrielResearch Eila Arich-Landkof <
 e...@orielresearch.org> escreveu:

> agree 👍
>
> On Tue, Dec 4, 2018 at 5:41 AM Chaim Turkel  wrote:
>
>> Israel would be nice to have one
>> chaim
>> On Tue, Dec 4, 2018 at 12:33 AM Griselda Cuevas 
>> wrote:
>> >
>> > Hi Beam Community,
>> >
>> > I started curating industry conferences, meetups and events that
>> are relevant for Beam, this initial list I came up with. I'd love your 
>> help
>> adding others that I might have overlooked. Once we're satisfied with the
>> list, let's re-share so we can coordinate proposal submissions, 
>> attendance
>> and community meetups there.
>> >
>> >
>> > Cheers,
>> >
>> > G
>> >
>> >
>> >
>>
>> --
>>
>>
>> Loans are funded by
>> FinWise Bank, a Utah-chartered bank located in Sandy,
>> Utah, member FDIC, Equal
>> Opportunity Lender. Merchant Cash Advances are
>> made by Behalf. For more
>> information on ECOA, click here
>> . For important information
>> about
>> opening a new
>> account, review Patriot Act procedures here
>> .
>> Visit Legal
>>  to
>> review our comprehensive program terms,
>> conditions, and disclosures.
>>
>
>
> --
> Eila
> www.orielresearch.org
> https://www.meetu
> p.co
> 
> m/Deep-Learning-In-Production/
> 
>
>
>


Re: 2019 Beam Events

2018-12-04 Thread Griselda Cuevas
+1 to Pablo's suggestion, if there's interest in "Founding a Meetup group
in a particular city, let's create the Meetup page and start getting sign
ups. Joana will be reaching out with a comprenhexive list of how to get
started and we're hoping to compile a high level calendar of
launches/announcements to feed into your meetup.

G

On Tue, 4 Dec 2018 at 12:04, Daniel Salerno  wrote:

> =)
> What good news!
> Okay, I'll set up the group and try to get interested.
> Thank you
>
>
> Em ter, 4 de dez de 2018 às 17:19, Pablo Estrada 
> escreveu:
>
>> FWIW, for some of these places that have interest (e.g. Brazil, Israel),
>> it's possible to create a group in meetup.com, and start gauging
>> interest, and looking for organizers.
>> Once a group of people with interest exists, it's easier to get interest
>> / sponsorship to bring speakers.
>> So if you are willing to create the group in meetup, Daniel, we can
>> monitor it and try to plan something as it grows : )
>> Best
>> -P.
>>
>> On Tue, Dec 4, 2018 at 10:55 AM Daniel Salerno 
>> wrote:
>>
>>>
>>> It's a shame that there are no events in Brazil ...
>>>
>>> =(
>>>
>>> Em ter, 4 de dez de 2018 às 13:12, OrielResearch Eila Arich-Landkof <
>>> e...@orielresearch.org> escreveu:
>>>
 agree 👍

 On Tue, Dec 4, 2018 at 5:41 AM Chaim Turkel  wrote:

> Israel would be nice to have one
> chaim
> On Tue, Dec 4, 2018 at 12:33 AM Griselda Cuevas 
> wrote:
> >
> > Hi Beam Community,
> >
> > I started curating industry conferences, meetups and events that are
> relevant for Beam, this initial list I came up with. I'd love your help
> adding others that I might have overlooked. Once we're satisfied with the
> list, let's re-share so we can coordinate proposal submissions, attendance
> and community meetups there.
> >
> >
> > Cheers,
> >
> > G
> >
> >
> >
>
> --
>
>
> Loans are funded by
> FinWise Bank, a Utah-chartered bank located in Sandy,
> Utah, member FDIC, Equal
> Opportunity Lender. Merchant Cash Advances are
> made by Behalf. For more
> information on ECOA, click here
> . For important information about
> opening a new
> account, review Patriot Act procedures here
> .
> Visit Legal
>  to
> review our comprehensive program terms,
> conditions, and disclosures.
>


 --
 Eila
 www.orielresearch.org
 https://www.meetu 
 p.co 
 m/Deep-Learning-In-Production/
 





Re: Join PCollection Data with HBase Large Data - Suggestion Requested

2018-12-04 Thread Chandan Biswas
Thanks Lukasz for quick reply.

On Tue, Dec 4, 2018 at 4:20 PM Lukasz Cwik  wrote:

> Is HBase only updated by the output within your pipeline or can an
> external system also update the HBase data? If no, then querying HBase
> within processElement is your best bet since your effectively trying to do
> a sparse lookup with slowly changing data.
>
>
>
> On Tue, Dec 4, 2018 at 11:59 AM Chandan Biswas 
> wrote:
>
>> Also I forgot to mention that keys will not be repeating frequently in a
>> window.
>>
>> Thanks,
>> Chandan
>>
>> On Tue, Dec 4, 2018 at 1:49 PM Chandan Biswas 
>> wrote:
>>
>>> Thanks Lukasz and Steve for replying quickly. Sorry for not be clear
>>> enough. But my use case is something like Steve mentioned. So when I am
>>> reading the data from stream, I need to figure out that the data is coming
>>> from stream is not duplicate for the key. So I need to check the all the
>>> historical data for that key stored in Hbase and the table size is like
>>> 1TB. The output of the processing is stored in the same Hbase table. Please
>>> let me know if you need more context.
>>>
>>> Thanks,
>>> Chandan
>>>
>>> On Tue, Dec 4, 2018 at 11:32 AM Steve Niemitz 
>>> wrote:
>>>
 interesting to know that the state scales so well!

 On Tue, Dec 4, 2018 at 12:21 PM Lukasz Cwik  wrote:

> Your correct in saying that StatefulDoFn is pointless if you only see
> every key+window once. The users description wasn't exactly clear but it
> seemed to me they were reading from a stream and wanted to store all old
> values that they had seen implying they see keys more then once. The user
> would need to ensure that the windowing strategy triggers more then once
> for my suggestion to be useful (e.g. global window with after element 
> count
> trigger) but without further details my suggestion is a guess.
>
> Also. the implementation for state storage is Runner dependent but I
> am aware of users storing very large amounts (>> 1 TiB) within state on
> Dataflow and in general scales very well with the number of keys and
> windows.
>
> On Tue, Dec 4, 2018 at 8:01 AM Steve Niemitz 
> wrote:
>
>> We have a similar use case, except with BigtableIO instead of HBase.
>>
>> We ended up building a custom transform that was basically
>> PCollection[ByteString] -> PCollection[BigtableRow], and would fetch rows
>> from Bigtable based on the input, however it's tricky to get right 
>> because
>> of batching, etc.
>>
>> I'm curious how a StatefulDoFn would help here, it seems like it's
>> more of just a cache than an actual join (and in my use-case we're never
>> reading a key more than once so a cache wouldn't help here anyways).  
>> Also
>> I'd be interested to see how the state storage performs with "large"
>> amounts of state.  We're reading ~1 TB of data from Bigtable in a run, 
>> and
>> it doesn't seem reasonable to store that all in a DoFn's state.
>>
>>
>>
>> On Tue, Dec 4, 2018 at 1:23 AM Lukasz Cwik  wrote:
>>
>>> What about a StatefulDoFn where you append the value(s) in a bag
>>> state as you see them?
>>>
>>> If you need to seed the state information, you could do a one time
>>> lookup in processElement for each key to HBase if the key hasn't yet 
>>> been
>>> seen (storing the fact that you loaded the data in a boolean) but
>>> afterwards you would rely on reading the value(s) from the bag state.
>>>
>>> processElement(...) {
>>>   Value newValue = ...
>>>   Iterable values;
>>>   if (!hasSeenKeyBooleanValueState.read()) {
>>> values = ... load from HBase ...
>>> valuesBagState.append(values);
>>> hasSeenKeyBooleanValueState.set(true);
>>>   } else {
>>> values = valuesBagState.read();
>>>   }
>>>   ... perform processing using values ...
>>>
>>>valuesBagState.append(newValue);
>>> }
>>>
>>> This blog post[1] has a good example.
>>>
>>> 1: https://beam.apache.org/blog/2017/02/13/stateful-processing.html
>>>
>>> On Mon, Dec 3, 2018 at 12:48 PM Chandan Biswas <
>>> pela.chan...@gmail.com> wrote:
>>>
 Hello All,
 I have a use case where I have PCollection> data
 coming from Kafka source. When processing each record (KV) I
 need all old values for that Key stored in a hbase table. The naive
 approach is to do HBase lookup in the DoFn.processElement. I considered
 sideinput but it' not going to work because of large dataset. Any
 suggestion?

 Thanks,
 Chandan

>>>
>>>
>>> --
>>> Thanks,
>>> *Chandan Biswas*
>>>
>>
>>
>> --
>> Thanks,
>> *Chandan Biswas*
>>
>

-- 
Thanks,
*Chandan Biswas*


Re: How to kick off a Beam pipeline with PortableRunner in Java?

2018-12-04 Thread Sai Inampudi
Thanks for the help Ankur and Ruoyun, appreciate it. I went through the wiki 
and I am still facing the same issue as before (where it complains about the 
following:
java.lang.IllegalArgumentException: Unknown 'runner' specified 
'PortableRunner', supported pipeline runners [DirectRunner, FlinkRunner, 
SparkRunner, TestFlinkRunner, TestSparkRunner])

I am probably doing a naive mistake but I am not sure where, below is 
everything I have done thus far:
* Confirmed I have met the prerequisities in the wiki
 * Docker is installed and working without sudo
 * gradle target for the java container works

* Successfully started the JobServer (for now, I didn't bother with the flink 
job server and instead just kicked off the reference job server) by running the 
following
* ./gradlew :beam-runners-reference-job-server:run
* Confirmed that the JobService started at 8099 port
* Started ReferenceRunnerJobService at localhost:8099
   <-> 98% EXECUTING [22m 43s]
   > :beam-runners-reference-job-server:run
   > IDLE

* My code itself is being managed via maven, so I made sure to pull in the 
latest 2.10.0-SNAPSHOT dependencies as per 
https://repository.apache.org/content/repositories/snapshots/

* The pipeline is kicked off with the following options:
 * --runner=PortableRunner --jobEndpoint=localhost:8099 --streaming=true
 * Running the pipeline results in the following error
   * java.lang.IllegalArgumentException: Unknown 'runner' specified 
'PortableRunner', supported pipeline runners [DirectRunner, FlinkRunner, 
SparkRunner, TestFlinkRunner, TestSparkRunner]



I am sure I am missing something basic but I was hoping I could get ideas on 
what it might be?



Re: Join PCollection Data with HBase Large Data - Suggestion Requested

2018-12-04 Thread Lukasz Cwik
Is HBase only updated by the output within your pipeline or can an external
system also update the HBase data? If no, then querying HBase within
processElement is your best bet since your effectively trying to do a
sparse lookup with slowly changing data.



On Tue, Dec 4, 2018 at 11:59 AM Chandan Biswas 
wrote:

> Also I forgot to mention that keys will not be repeating frequently in a
> window.
>
> Thanks,
> Chandan
>
> On Tue, Dec 4, 2018 at 1:49 PM Chandan Biswas 
> wrote:
>
>> Thanks Lukasz and Steve for replying quickly. Sorry for not be clear
>> enough. But my use case is something like Steve mentioned. So when I am
>> reading the data from stream, I need to figure out that the data is coming
>> from stream is not duplicate for the key. So I need to check the all the
>> historical data for that key stored in Hbase and the table size is like
>> 1TB. The output of the processing is stored in the same Hbase table. Please
>> let me know if you need more context.
>>
>> Thanks,
>> Chandan
>>
>> On Tue, Dec 4, 2018 at 11:32 AM Steve Niemitz 
>> wrote:
>>
>>> interesting to know that the state scales so well!
>>>
>>> On Tue, Dec 4, 2018 at 12:21 PM Lukasz Cwik  wrote:
>>>
 Your correct in saying that StatefulDoFn is pointless if you only see
 every key+window once. The users description wasn't exactly clear but it
 seemed to me they were reading from a stream and wanted to store all old
 values that they had seen implying they see keys more then once. The user
 would need to ensure that the windowing strategy triggers more then once
 for my suggestion to be useful (e.g. global window with after element count
 trigger) but without further details my suggestion is a guess.

 Also. the implementation for state storage is Runner dependent but I am
 aware of users storing very large amounts (>> 1 TiB) within state on
 Dataflow and in general scales very well with the number of keys and
 windows.

 On Tue, Dec 4, 2018 at 8:01 AM Steve Niemitz 
 wrote:

> We have a similar use case, except with BigtableIO instead of HBase.
>
> We ended up building a custom transform that was basically
> PCollection[ByteString] -> PCollection[BigtableRow], and would fetch rows
> from Bigtable based on the input, however it's tricky to get right because
> of batching, etc.
>
> I'm curious how a StatefulDoFn would help here, it seems like it's
> more of just a cache than an actual join (and in my use-case we're never
> reading a key more than once so a cache wouldn't help here anyways).  Also
> I'd be interested to see how the state storage performs with "large"
> amounts of state.  We're reading ~1 TB of data from Bigtable in a run, and
> it doesn't seem reasonable to store that all in a DoFn's state.
>
>
>
> On Tue, Dec 4, 2018 at 1:23 AM Lukasz Cwik  wrote:
>
>> What about a StatefulDoFn where you append the value(s) in a bag
>> state as you see them?
>>
>> If you need to seed the state information, you could do a one time
>> lookup in processElement for each key to HBase if the key hasn't yet been
>> seen (storing the fact that you loaded the data in a boolean) but
>> afterwards you would rely on reading the value(s) from the bag state.
>>
>> processElement(...) {
>>   Value newValue = ...
>>   Iterable values;
>>   if (!hasSeenKeyBooleanValueState.read()) {
>> values = ... load from HBase ...
>> valuesBagState.append(values);
>> hasSeenKeyBooleanValueState.set(true);
>>   } else {
>> values = valuesBagState.read();
>>   }
>>   ... perform processing using values ...
>>
>>valuesBagState.append(newValue);
>> }
>>
>> This blog post[1] has a good example.
>>
>> 1: https://beam.apache.org/blog/2017/02/13/stateful-processing.html
>>
>> On Mon, Dec 3, 2018 at 12:48 PM Chandan Biswas <
>> pela.chan...@gmail.com> wrote:
>>
>>> Hello All,
>>> I have a use case where I have PCollection> data
>>> coming from Kafka source. When processing each record (KV) I
>>> need all old values for that Key stored in a hbase table. The naive
>>> approach is to do HBase lookup in the DoFn.processElement. I considered
>>> sideinput but it' not going to work because of large dataset. Any
>>> suggestion?
>>>
>>> Thanks,
>>> Chandan
>>>
>>
>>
>> --
>> Thanks,
>> *Chandan Biswas*
>>
>
>
> --
> Thanks,
> *Chandan Biswas*
>


Re: 2019 Beam Events

2018-12-04 Thread Daniel Salerno
=)
What good news!
Okay, I'll set up the group and try to get interested.
Thank you


Em ter, 4 de dez de 2018 às 17:19, Pablo Estrada 
escreveu:

> FWIW, for some of these places that have interest (e.g. Brazil, Israel),
> it's possible to create a group in meetup.com, and start gauging
> interest, and looking for organizers.
> Once a group of people with interest exists, it's easier to get interest /
> sponsorship to bring speakers.
> So if you are willing to create the group in meetup, Daniel, we can
> monitor it and try to plan something as it grows : )
> Best
> -P.
>
> On Tue, Dec 4, 2018 at 10:55 AM Daniel Salerno 
> wrote:
>
>>
>> It's a shame that there are no events in Brazil ...
>>
>> =(
>>
>> Em ter, 4 de dez de 2018 às 13:12, OrielResearch Eila Arich-Landkof <
>> e...@orielresearch.org> escreveu:
>>
>>> agree 👍
>>>
>>> On Tue, Dec 4, 2018 at 5:41 AM Chaim Turkel  wrote:
>>>
 Israel would be nice to have one
 chaim
 On Tue, Dec 4, 2018 at 12:33 AM Griselda Cuevas 
 wrote:
 >
 > Hi Beam Community,
 >
 > I started curating industry conferences, meetups and events that are
 relevant for Beam, this initial list I came up with. I'd love your help
 adding others that I might have overlooked. Once we're satisfied with the
 list, let's re-share so we can coordinate proposal submissions, attendance
 and community meetups there.
 >
 >
 > Cheers,
 >
 > G
 >
 >
 >

 --


 Loans are funded by
 FinWise Bank, a Utah-chartered bank located in Sandy,
 Utah, member FDIC, Equal
 Opportunity Lender. Merchant Cash Advances are
 made by Behalf. For more
 information on ECOA, click here
 . For important information about
 opening a new
 account, review Patriot Act procedures here
 .
 Visit Legal
  to
 review our comprehensive program terms,
 conditions, and disclosures.

>>>
>>>
>>> --
>>> Eila
>>> www.orielresearch.org
>>> https://www.meetu 
>>> p.co 
>>> m/Deep-Learning-In-Production/
>>> 
>>>
>>>
>>>


Re: Join PCollection Data with HBase Large Data - Suggestion Requested

2018-12-04 Thread Chandan Biswas
Also I forgot to mention that keys will not be repeating frequently in a
window.

Thanks,
Chandan

On Tue, Dec 4, 2018 at 1:49 PM Chandan Biswas 
wrote:

> Thanks Lukasz and Steve for replying quickly. Sorry for not be clear
> enough. But my use case is something like Steve mentioned. So when I am
> reading the data from stream, I need to figure out that the data is coming
> from stream is not duplicate for the key. So I need to check the all the
> historical data for that key stored in Hbase and the table size is like
> 1TB. The output of the processing is stored in the same Hbase table. Please
> let me know if you need more context.
>
> Thanks,
> Chandan
>
> On Tue, Dec 4, 2018 at 11:32 AM Steve Niemitz 
> wrote:
>
>> interesting to know that the state scales so well!
>>
>> On Tue, Dec 4, 2018 at 12:21 PM Lukasz Cwik  wrote:
>>
>>> Your correct in saying that StatefulDoFn is pointless if you only see
>>> every key+window once. The users description wasn't exactly clear but it
>>> seemed to me they were reading from a stream and wanted to store all old
>>> values that they had seen implying they see keys more then once. The user
>>> would need to ensure that the windowing strategy triggers more then once
>>> for my suggestion to be useful (e.g. global window with after element count
>>> trigger) but without further details my suggestion is a guess.
>>>
>>> Also. the implementation for state storage is Runner dependent but I am
>>> aware of users storing very large amounts (>> 1 TiB) within state on
>>> Dataflow and in general scales very well with the number of keys and
>>> windows.
>>>
>>> On Tue, Dec 4, 2018 at 8:01 AM Steve Niemitz 
>>> wrote:
>>>
 We have a similar use case, except with BigtableIO instead of HBase.

 We ended up building a custom transform that was basically
 PCollection[ByteString] -> PCollection[BigtableRow], and would fetch rows
 from Bigtable based on the input, however it's tricky to get right because
 of batching, etc.

 I'm curious how a StatefulDoFn would help here, it seems like it's more
 of just a cache than an actual join (and in my use-case we're never reading
 a key more than once so a cache wouldn't help here anyways).  Also I'd be
 interested to see how the state storage performs with "large" amounts of
 state.  We're reading ~1 TB of data from Bigtable in a run, and it doesn't
 seem reasonable to store that all in a DoFn's state.



 On Tue, Dec 4, 2018 at 1:23 AM Lukasz Cwik  wrote:

> What about a StatefulDoFn where you append the value(s) in a bag state
> as you see them?
>
> If you need to seed the state information, you could do a one time
> lookup in processElement for each key to HBase if the key hasn't yet been
> seen (storing the fact that you loaded the data in a boolean) but
> afterwards you would rely on reading the value(s) from the bag state.
>
> processElement(...) {
>   Value newValue = ...
>   Iterable values;
>   if (!hasSeenKeyBooleanValueState.read()) {
> values = ... load from HBase ...
> valuesBagState.append(values);
> hasSeenKeyBooleanValueState.set(true);
>   } else {
> values = valuesBagState.read();
>   }
>   ... perform processing using values ...
>
>valuesBagState.append(newValue);
> }
>
> This blog post[1] has a good example.
>
> 1: https://beam.apache.org/blog/2017/02/13/stateful-processing.html
>
> On Mon, Dec 3, 2018 at 12:48 PM Chandan Biswas 
> wrote:
>
>> Hello All,
>> I have a use case where I have PCollection> data coming
>> from Kafka source. When processing each record (KV) I need all
>> old values for that Key stored in a hbase table. The naive approach is to
>> do HBase lookup in the DoFn.processElement. I considered sideinput but 
>> it'
>> not going to work because of large dataset. Any suggestion?
>>
>> Thanks,
>> Chandan
>>
>
>
> --
> Thanks,
> *Chandan Biswas*
>


-- 
Thanks,
*Chandan Biswas*


Re: Join PCollection Data with HBase Large Data - Suggestion Requested

2018-12-04 Thread Chandan Biswas
Thanks Lukasz and Steve for replying quickly. Sorry for not be clear
enough. But my use case is something like Steve mentioned. So when I am
reading the data from stream, I need to figure out that the data is coming
from stream is not duplicate for the key. So I need to check the all the
historical data for that key stored in Hbase and the table size is like
1TB. The output of the processing is stored in the same Hbase table. Please
let me know if you need more context.

Thanks,
Chandan

On Tue, Dec 4, 2018 at 11:32 AM Steve Niemitz  wrote:

> interesting to know that the state scales so well!
>
> On Tue, Dec 4, 2018 at 12:21 PM Lukasz Cwik  wrote:
>
>> Your correct in saying that StatefulDoFn is pointless if you only see
>> every key+window once. The users description wasn't exactly clear but it
>> seemed to me they were reading from a stream and wanted to store all old
>> values that they had seen implying they see keys more then once. The user
>> would need to ensure that the windowing strategy triggers more then once
>> for my suggestion to be useful (e.g. global window with after element count
>> trigger) but without further details my suggestion is a guess.
>>
>> Also. the implementation for state storage is Runner dependent but I am
>> aware of users storing very large amounts (>> 1 TiB) within state on
>> Dataflow and in general scales very well with the number of keys and
>> windows.
>>
>> On Tue, Dec 4, 2018 at 8:01 AM Steve Niemitz  wrote:
>>
>>> We have a similar use case, except with BigtableIO instead of HBase.
>>>
>>> We ended up building a custom transform that was basically
>>> PCollection[ByteString] -> PCollection[BigtableRow], and would fetch rows
>>> from Bigtable based on the input, however it's tricky to get right because
>>> of batching, etc.
>>>
>>> I'm curious how a StatefulDoFn would help here, it seems like it's more
>>> of just a cache than an actual join (and in my use-case we're never reading
>>> a key more than once so a cache wouldn't help here anyways).  Also I'd be
>>> interested to see how the state storage performs with "large" amounts of
>>> state.  We're reading ~1 TB of data from Bigtable in a run, and it doesn't
>>> seem reasonable to store that all in a DoFn's state.
>>>
>>>
>>>
>>> On Tue, Dec 4, 2018 at 1:23 AM Lukasz Cwik  wrote:
>>>
 What about a StatefulDoFn where you append the value(s) in a bag state
 as you see them?

 If you need to seed the state information, you could do a one time
 lookup in processElement for each key to HBase if the key hasn't yet been
 seen (storing the fact that you loaded the data in a boolean) but
 afterwards you would rely on reading the value(s) from the bag state.

 processElement(...) {
   Value newValue = ...
   Iterable values;
   if (!hasSeenKeyBooleanValueState.read()) {
 values = ... load from HBase ...
 valuesBagState.append(values);
 hasSeenKeyBooleanValueState.set(true);
   } else {
 values = valuesBagState.read();
   }
   ... perform processing using values ...

valuesBagState.append(newValue);
 }

 This blog post[1] has a good example.

 1: https://beam.apache.org/blog/2017/02/13/stateful-processing.html

 On Mon, Dec 3, 2018 at 12:48 PM Chandan Biswas 
 wrote:

> Hello All,
> I have a use case where I have PCollection> data coming
> from Kafka source. When processing each record (KV) I need all
> old values for that Key stored in a hbase table. The naive approach is to
> do HBase lookup in the DoFn.processElement. I considered sideinput but it'
> not going to work because of large dataset. Any suggestion?
>
> Thanks,
> Chandan
>


-- 
Thanks,
*Chandan Biswas*


Re: Dynamic Naming of file using KV in IO

2018-12-04 Thread Chamikara Jayalath
Not sure if I fully understood your question, but will it be possible to
send the PCollection that you read from XmlIO through a ParDo to generate
the PCollection> that you need ?

If this doesn't work, you can also try reading XML files directly from a
ParDo and generating the PCollection that you need.

Thanks,
Cham


On Mon, Dec 3, 2018 at 11:49 AM Vinay Patil  wrote:

> Hi,
>
> I need a help regarding dynamic naming for Xml with KV PCollection.
>
> PCollection> xmlCollection =….
>
> I am not able to use XmlIO for this PCollection
> XmlDTO is actually the dto marshalled and String is the key
>
> I tried using KV but XmlIO needs a Class type, KV.getClass does not work…
> I need to get the Key for distribution and XmlDTO does not have it.
>
> Any suggestions?
>
> Regards,
> Vinay Patil
>


Re: 2019 Beam Events

2018-12-04 Thread Pablo Estrada
FWIW, for some of these places that have interest (e.g. Brazil, Israel),
it's possible to create a group in meetup.com, and start gauging interest,
and looking for organizers.
Once a group of people with interest exists, it's easier to get interest /
sponsorship to bring speakers.
So if you are willing to create the group in meetup, Daniel, we can monitor
it and try to plan something as it grows : )
Best
-P.

On Tue, Dec 4, 2018 at 10:55 AM Daniel Salerno 
wrote:

>
> It's a shame that there are no events in Brazil ...
>
> =(
>
> Em ter, 4 de dez de 2018 às 13:12, OrielResearch Eila Arich-Landkof <
> e...@orielresearch.org> escreveu:
>
>> agree 👍
>>
>> On Tue, Dec 4, 2018 at 5:41 AM Chaim Turkel  wrote:
>>
>>> Israel would be nice to have one
>>> chaim
>>> On Tue, Dec 4, 2018 at 12:33 AM Griselda Cuevas  wrote:
>>> >
>>> > Hi Beam Community,
>>> >
>>> > I started curating industry conferences, meetups and events that are
>>> relevant for Beam, this initial list I came up with. I'd love your help
>>> adding others that I might have overlooked. Once we're satisfied with the
>>> list, let's re-share so we can coordinate proposal submissions, attendance
>>> and community meetups there.
>>> >
>>> >
>>> > Cheers,
>>> >
>>> > G
>>> >
>>> >
>>> >
>>>
>>> --
>>>
>>>
>>> Loans are funded by
>>> FinWise Bank, a Utah-chartered bank located in Sandy,
>>> Utah, member FDIC, Equal
>>> Opportunity Lender. Merchant Cash Advances are
>>> made by Behalf. For more
>>> information on ECOA, click here
>>> . For important information about
>>> opening a new
>>> account, review Patriot Act procedures here
>>> .
>>> Visit Legal
>>>  to
>>> review our comprehensive program terms,
>>> conditions, and disclosures.
>>>
>>
>>
>> --
>> Eila
>> www.orielresearch.org
>> https://www.meetu 
>> p.co 
>> m/Deep-Learning-In-Production/
>> 
>>
>>
>>


Re: 2019 Beam Events

2018-12-04 Thread Daniel Salerno
It's a shame that there are no events in Brazil ...

=(

Em ter, 4 de dez de 2018 às 13:12, OrielResearch Eila Arich-Landkof <
e...@orielresearch.org> escreveu:

> agree 👍
>
> On Tue, Dec 4, 2018 at 5:41 AM Chaim Turkel  wrote:
>
>> Israel would be nice to have one
>> chaim
>> On Tue, Dec 4, 2018 at 12:33 AM Griselda Cuevas  wrote:
>> >
>> > Hi Beam Community,
>> >
>> > I started curating industry conferences, meetups and events that are
>> relevant for Beam, this initial list I came up with. I'd love your help
>> adding others that I might have overlooked. Once we're satisfied with the
>> list, let's re-share so we can coordinate proposal submissions, attendance
>> and community meetups there.
>> >
>> >
>> > Cheers,
>> >
>> > G
>> >
>> >
>> >
>>
>> --
>>
>>
>> Loans are funded by
>> FinWise Bank, a Utah-chartered bank located in Sandy,
>> Utah, member FDIC, Equal
>> Opportunity Lender. Merchant Cash Advances are
>> made by Behalf. For more
>> information on ECOA, click here
>> . For important information about
>> opening a new
>> account, review Patriot Act procedures here
>> .
>> Visit Legal
>>  to
>> review our comprehensive program terms,
>> conditions, and disclosures.
>>
>
>
> --
> Eila
> www.orielresearch.org
> https://www.meetu 
> p.co 
> m/Deep-Learning-In-Production/
> 
>
>
>


Beam Metrics using FlinkRunner

2018-12-04 Thread Phil Franklin
I’m having difficulty accessing Beam metrics when using FlinkRunner in 
streaming mode. I don’t get any metrics from MetricsPusher, though the same 
setup delivered metrics from SparkRunner.  Probably for the same reason that 
MetricsPusher doesn’t work, I also don’t get any output when I call an instance 
of MetricsHttpSink directly.  The problem seems to be that Flink never returns 
from pipeline.run(), an issue that others have referred to as FlinkRunner 
hanging.  

Is there a solution for getting metrics in this case that I’m missing?

Thanks!
-Phil

Re: bean elasticsearch connector for dataflow

2018-12-04 Thread Adeel Ahmad
Hello,

Thanks - Just seen this , updated support for ES 6.3.2...

https://beam.apache.org/blog/2018/10/29/beam-2.8.0.html

Adeel


On Tue, 4 Dec 2018 at 17:43, Tim  wrote:

> Beam 2.8.0 brought in support for ES 6.3.x
>
> I’m not sure if that works against a 6.5.x server but I could imagine it
> does.
>
> Tim,
> Sent from my iPhone
>
> On 4 Dec 2018, at 18:28, Adeel Ahmad  wrote:
>
> Hello,
>
> I am trying to use gcp dataflow for indexing data from pubsub into
> elasticsearch.
> Does dataflow (which uses beam) now support elasticsearch 6.5.x or does it
> still only support 5.6.x?
>
> --
> Thanks,
>
> Adeel
>
>
>
>

-- 
Thanks,

Adeel Ahmad
m: (+44) 7721724715
e: aahmad1...@gmail.com


Re: bean elasticsearch connector for dataflow

2018-12-04 Thread Tim
Beam 2.8.0 brought in support for ES 6.3.x

I’m not sure if that works against a 6.5.x server but I could imagine it does.

Tim,
Sent from my iPhone

> On 4 Dec 2018, at 18:28, Adeel Ahmad  wrote:
> 
> Hello,
> 
> I am trying to use gcp dataflow for indexing data from pubsub into 
> elasticsearch.
> Does dataflow (which uses beam) now support elasticsearch 6.5.x or does it 
> still only support 5.6.x?
> 
> -- 
> Thanks,
> 
> Adeel
> 
> 
> 


Re: Join PCollection Data with HBase Large Data - Suggestion Requested

2018-12-04 Thread Steve Niemitz
interesting to know that the state scales so well!

On Tue, Dec 4, 2018 at 12:21 PM Lukasz Cwik  wrote:

> Your correct in saying that StatefulDoFn is pointless if you only see
> every key+window once. The users description wasn't exactly clear but it
> seemed to me they were reading from a stream and wanted to store all old
> values that they had seen implying they see keys more then once. The user
> would need to ensure that the windowing strategy triggers more then once
> for my suggestion to be useful (e.g. global window with after element count
> trigger) but without further details my suggestion is a guess.
>
> Also. the implementation for state storage is Runner dependent but I am
> aware of users storing very large amounts (>> 1 TiB) within state on
> Dataflow and in general scales very well with the number of keys and
> windows.
>
> On Tue, Dec 4, 2018 at 8:01 AM Steve Niemitz  wrote:
>
>> We have a similar use case, except with BigtableIO instead of HBase.
>>
>> We ended up building a custom transform that was basically
>> PCollection[ByteString] -> PCollection[BigtableRow], and would fetch rows
>> from Bigtable based on the input, however it's tricky to get right because
>> of batching, etc.
>>
>> I'm curious how a StatefulDoFn would help here, it seems like it's more
>> of just a cache than an actual join (and in my use-case we're never reading
>> a key more than once so a cache wouldn't help here anyways).  Also I'd be
>> interested to see how the state storage performs with "large" amounts of
>> state.  We're reading ~1 TB of data from Bigtable in a run, and it doesn't
>> seem reasonable to store that all in a DoFn's state.
>>
>>
>>
>> On Tue, Dec 4, 2018 at 1:23 AM Lukasz Cwik  wrote:
>>
>>> What about a StatefulDoFn where you append the value(s) in a bag state
>>> as you see them?
>>>
>>> If you need to seed the state information, you could do a one time
>>> lookup in processElement for each key to HBase if the key hasn't yet been
>>> seen (storing the fact that you loaded the data in a boolean) but
>>> afterwards you would rely on reading the value(s) from the bag state.
>>>
>>> processElement(...) {
>>>   Value newValue = ...
>>>   Iterable values;
>>>   if (!hasSeenKeyBooleanValueState.read()) {
>>> values = ... load from HBase ...
>>> valuesBagState.append(values);
>>> hasSeenKeyBooleanValueState.set(true);
>>>   } else {
>>> values = valuesBagState.read();
>>>   }
>>>   ... perform processing using values ...
>>>
>>>valuesBagState.append(newValue);
>>> }
>>>
>>> This blog post[1] has a good example.
>>>
>>> 1: https://beam.apache.org/blog/2017/02/13/stateful-processing.html
>>>
>>> On Mon, Dec 3, 2018 at 12:48 PM Chandan Biswas 
>>> wrote:
>>>
 Hello All,
 I have a use case where I have PCollection> data coming
 from Kafka source. When processing each record (KV) I need all
 old values for that Key stored in a hbase table. The naive approach is to
 do HBase lookup in the DoFn.processElement. I considered sideinput but it'
 not going to work because of large dataset. Any suggestion?

 Thanks,
 Chandan

>>>


bean elasticsearch connector for dataflow

2018-12-04 Thread Adeel Ahmad
Hello,

I am trying to use gcp dataflow for indexing data from pubsub into
elasticsearch.
Does dataflow (which uses beam) now support elasticsearch 6.5.x or does it
still only support 5.6.x?

-- 
Thanks,

Adeel


Re: Join PCollection Data with HBase Large Data - Suggestion Requested

2018-12-04 Thread Lukasz Cwik
Your correct in saying that StatefulDoFn is pointless if you only see every
key+window once. The users description wasn't exactly clear but it seemed
to me they were reading from a stream and wanted to store all old values
that they had seen implying they see keys more then once. The user would
need to ensure that the windowing strategy triggers more then once for my
suggestion to be useful (e.g. global window with after element count
trigger) but without further details my suggestion is a guess.

Also. the implementation for state storage is Runner dependent but I am
aware of users storing very large amounts (>> 1 TiB) within state on
Dataflow and in general scales very well with the number of keys and
windows.

On Tue, Dec 4, 2018 at 8:01 AM Steve Niemitz  wrote:

> We have a similar use case, except with BigtableIO instead of HBase.
>
> We ended up building a custom transform that was basically
> PCollection[ByteString] -> PCollection[BigtableRow], and would fetch rows
> from Bigtable based on the input, however it's tricky to get right because
> of batching, etc.
>
> I'm curious how a StatefulDoFn would help here, it seems like it's more of
> just a cache than an actual join (and in my use-case we're never reading a
> key more than once so a cache wouldn't help here anyways).  Also I'd be
> interested to see how the state storage performs with "large" amounts of
> state.  We're reading ~1 TB of data from Bigtable in a run, and it doesn't
> seem reasonable to store that all in a DoFn's state.
>
>
>
> On Tue, Dec 4, 2018 at 1:23 AM Lukasz Cwik  wrote:
>
>> What about a StatefulDoFn where you append the value(s) in a bag state as
>> you see them?
>>
>> If you need to seed the state information, you could do a one time lookup
>> in processElement for each key to HBase if the key hasn't yet been seen
>> (storing the fact that you loaded the data in a boolean) but afterwards you
>> would rely on reading the value(s) from the bag state.
>>
>> processElement(...) {
>>   Value newValue = ...
>>   Iterable values;
>>   if (!hasSeenKeyBooleanValueState.read()) {
>> values = ... load from HBase ...
>> valuesBagState.append(values);
>> hasSeenKeyBooleanValueState.set(true);
>>   } else {
>> values = valuesBagState.read();
>>   }
>>   ... perform processing using values ...
>>
>>valuesBagState.append(newValue);
>> }
>>
>> This blog post[1] has a good example.
>>
>> 1: https://beam.apache.org/blog/2017/02/13/stateful-processing.html
>>
>> On Mon, Dec 3, 2018 at 12:48 PM Chandan Biswas 
>> wrote:
>>
>>> Hello All,
>>> I have a use case where I have PCollection> data coming
>>> from Kafka source. When processing each record (KV) I need all
>>> old values for that Key stored in a hbase table. The naive approach is to
>>> do HBase lookup in the DoFn.processElement. I considered sideinput but it'
>>> not going to work because of large dataset. Any suggestion?
>>>
>>> Thanks,
>>> Chandan
>>>
>>


Re: Generic Type PTransform

2018-12-04 Thread Lukasz Cwik
There are various strategies that you can use depending on what you know
with the worst case being that you have to ask the person using the
PTransform to give you a K and V coder or a concrete type descriptor for K
and V which would allow you to get the coder from the coder registry.

The Apache Beam SDK has used the following strategies to solve this problem
internally (on a case by case basis):
* Get the coder from the PCollection that is supplied as input since it can
encode CustomType> and hence some part of it
knows how to encode the K and V types
* If you know that K and V will always be from a class of types (Java
Serializable, Proto message subclass, ...), then you can use a generic
coder for that class of types (SerializableCoder, ProtoCoder, ...)
* Ask the user to explicitly provide a coder (see getDefaultOutputCoder,
getAccumulatorCoder).


On Tue, Dec 4, 2018 at 2:07 AM Eran Twili 
wrote:

> Thanks Lukasz,
>
>
>
> Please tell me, how can I set a coder on the *PCollection* created after
> the "MapToKV" apply?
>
> I mean, all I know is that it will be a *PCollection>*, and I
> don't know what will be the actual runtime types of K and V.
>
> So, what coder should I set? Can you please give a code example of how to
> do that?
>
>
>
> Really appriciate your help,
>
> Eran
>
>
>
> *From:* Lukasz Cwik [mailto:lc...@google.com]
> *Sent:* Monday, December 03, 2018 7:10 PM
> *To:* user@beam.apache.org
> *Subject:* Re: Generic Type PTransform
>
>
>
> Apache Beam attempts to propagate coders through by looking at any typing
> information available but because Java has a lot of type erasure and there
> are many scenarios where these coders can't be propagated forward from a
> previous transform.
>
>
>
> Take the following two examples (note that there are many subtle
> variations that can give different results):
>
> List erasedType = new List();  // type is lost
>
> List keptType = new List() {};  // type is kept because of
> anonymous inner class being declared
>
> In the first the type is erased and in the second the type information is
> available. I would suggest
>
>
>
> In your case we can't infer what K and what V are because after the code
> compiles the types have been erased hence the error message. To immediately
> fix the problem, you'll want to set the coder on the PCollection created
> after you apply the "MapToKV" transform (you might need to do it on the
> "MapToSimpleImmutableEntry" transform as well).
>
>
>
> If you want to get into the details, take a look at they CoderRegistry[1]
> as it contains the type inference / propagation code.
>
>
>
> Finally, this not an uncommon problem that users face and it seems as
> though the error message that is given doesn't make sense so feel free to
> propose changes in the error messages to help others such as yourself.
>
>
>
> 1:
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/CoderRegistry.java
>
>
>
> On Sun, Dec 2, 2018 at 10:54 PM Matt Casters 
> wrote:
>
> There are probably smarter people than me on this list but since I
> recently been through a similar thought exercise...
>
>
>
> For the generic use in Kettle I have a PCollection going
> through the pipeline.
> KettleRow is just an Object[] wrapper for which I can implement a Coder.
>
> The "group by" that I implemented does the following:Split
> PCollection into PCollection>
> Then it  applies the standard GroupByKey.create() giving us
> PCollection>>
> This means that we can simple aggregate all the elements in
> Iterable to aggregate a group.
>
> Well, at least that works for me. The code is open so look at it over here:
>
> https://github.com/mattcasters/kettle-beam-core/blob/master/src/main/java/org/kettle/beam/core/transform/GroupByTransform.java
>
> Like you I had trouble with the Coder for my KettleRows so I hacked up
> this to make it work:
>
>
> https://github.com/mattcasters/kettle-beam-core/blob/master/src/main/java/org/kettle/beam/core/coder/KettleRowCoder.java
>
>
>
> It's set on the pipeline:
> pipeline.getCoderRegistry().registerCoderForClass( KettleRow.class, new
> KettleRowCoder());
>
>
> Good luck!
>
> Matt
>
>
>
> Op zo 2 dec. 2018 om 20:57 schreef Eran Twili  >:
>
> Hi,
>
>
>
> We are considering using Beam in our software.
>
> We wish to create a service for a user which will operate Beam for him,
> and obviously the user code doesn't have Beam API visibility.
>
> For that we need to generify some Beam API.
>
> So the user supply functions and we embed them in a generic *PTransform*
> and run them in a Beam pipeline.
>
> We have some difficulties to understand how can we provide the user with
> option to perform *GroupByKey* operation.
>
> The problem is that *GroupByKey* takes *KV* and our *PCollections* holds
> only user datatypes which should not be Beam datatypes.
>
> So we thought about having this *PTransform*:
>
> public class PlatformGroupByKey extends
> PTransform>>,
> PCollection {
> @Overri

Re: No Translator Found issue

2018-12-04 Thread Vinay Patil
Thank you Juan, adding the following worked for me:
transformer implementation=
"org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"

Regards,
Vinay Patil


On Mon, Dec 3, 2018 at 11:21 PM Juan Carlos Garcia 
wrote:

> Hi Vinay,
>
> When generating your Fatjar make sure you are merging the service files
> (META-INF/services) of your dependencies.
>
> Apache Beam relies heavily on the Java service locator to discover /
> register its components.
>
> JC
>
>
> Am Di., 4. Dez. 2018, 03:18 hat Vinay Patil 
> geschrieben:
>
>> Hi,
>>
>> I am using Beam 2.8.0 version. When I submit pipeline to Flink 1.5.2
>> cluster, I am getting the following exception:
>>
>> Caused by: java.lang.IllegalStateException: No translator known for
>> org.apache.beam.sdk.io.Read$Bounded
>>
>> Can you please let me know what could be the problem?
>>
>> Regards,
>> Vinay Patil
>>
>


Re: Join PCollection Data with HBase Large Data - Suggestion Requested

2018-12-04 Thread Steve Niemitz
We have a similar use case, except with BigtableIO instead of HBase.

We ended up building a custom transform that was basically
PCollection[ByteString] -> PCollection[BigtableRow], and would fetch rows
from Bigtable based on the input, however it's tricky to get right because
of batching, etc.

I'm curious how a StatefulDoFn would help here, it seems like it's more of
just a cache than an actual join (and in my use-case we're never reading a
key more than once so a cache wouldn't help here anyways).  Also I'd be
interested to see how the state storage performs with "large" amounts of
state.  We're reading ~1 TB of data from Bigtable in a run, and it doesn't
seem reasonable to store that all in a DoFn's state.



On Tue, Dec 4, 2018 at 1:23 AM Lukasz Cwik  wrote:

> What about a StatefulDoFn where you append the value(s) in a bag state as
> you see them?
>
> If you need to seed the state information, you could do a one time lookup
> in processElement for each key to HBase if the key hasn't yet been seen
> (storing the fact that you loaded the data in a boolean) but afterwards you
> would rely on reading the value(s) from the bag state.
>
> processElement(...) {
>   Value newValue = ...
>   Iterable values;
>   if (!hasSeenKeyBooleanValueState.read()) {
> values = ... load from HBase ...
> valuesBagState.append(values);
> hasSeenKeyBooleanValueState.set(true);
>   } else {
> values = valuesBagState.read();
>   }
>   ... perform processing using values ...
>
>valuesBagState.append(newValue);
> }
>
> This blog post[1] has a good example.
>
> 1: https://beam.apache.org/blog/2017/02/13/stateful-processing.html
>
> On Mon, Dec 3, 2018 at 12:48 PM Chandan Biswas 
> wrote:
>
>> Hello All,
>> I have a use case where I have PCollection> data coming
>> from Kafka source. When processing each record (KV) I need all
>> old values for that Key stored in a hbase table. The naive approach is to
>> do HBase lookup in the DoFn.processElement. I considered sideinput but it'
>> not going to work because of large dataset. Any suggestion?
>>
>> Thanks,
>> Chandan
>>
>


Re: 2019 Beam Events

2018-12-04 Thread OrielResearch Eila Arich-Landkof
agree 👍

On Tue, Dec 4, 2018 at 5:41 AM Chaim Turkel  wrote:

> Israel would be nice to have one
> chaim
> On Tue, Dec 4, 2018 at 12:33 AM Griselda Cuevas  wrote:
> >
> > Hi Beam Community,
> >
> > I started curating industry conferences, meetups and events that are
> relevant for Beam, this initial list I came up with. I'd love your help
> adding others that I might have overlooked. Once we're satisfied with the
> list, let's re-share so we can coordinate proposal submissions, attendance
> and community meetups there.
> >
> >
> > Cheers,
> >
> > G
> >
> >
> >
>
> --
>
>
> Loans are funded by
> FinWise Bank, a Utah-chartered bank located in Sandy,
> Utah, member FDIC, Equal
> Opportunity Lender. Merchant Cash Advances are
> made by Behalf. For more
> information on ECOA, click here
> . For important information about
> opening a new
> account, review Patriot Act procedures here
> .
> Visit Legal
>  to
> review our comprehensive program terms,
> conditions, and disclosures.
>


-- 
Eila
www.orielresearch.org
https://www.meetu p.co

m/Deep-Learning-In-Production/



Re: Latin America Community

2018-12-04 Thread Eryx
Hi Leonardo,

I'm Héctor Eryx from Guadalajara, México. I'm currently using Beam for
personal projects, plus giving some training/mentoring on how to use to
local communities.

Also, I'm in touch with some friends at IBM Mexico who are using Beam to
run data storage events analysis.

We are few, but we are strong 💪, hehe.

Kind regards,
Héctor Eryx Paredes Camacho

El mar., 4 de diciembre de 2018 6:20 a. m., Leonardo Miguel <
leonardo.mig...@arquivei.com.br> escribió:

> Hi guys,
>
> Just want to check if there is someone using Beam and/or Scio at this side
> of the globe.
> I'd like to know also if there is any event near or some related community.
> If you are using Beam and/or Scio, please let me know.
>
> Let me start first:
> I'm located at Sao Carlos, Sao Paulo, Brazil.
> We use Beam and Scio running on Google Dataflow to serve data products
> (streaming and batch) over fiscal documents.
>
> Thanks!
>
>
> --
> []s
>
> Leonardo Alves Miguel
> Data Engineer
> (16) 3509-5515 | www.arquivei.com.br
> 
> [image: Arquivei.com.br – Inteligência em Notas Fiscais]
> 
> [image: Google seleciona Arquivei para imersão e mentoria no Vale do
> Silício]
> 
> 
> 
> 
>


Re: 2019 Beam Events

2018-12-04 Thread Dan
The next Pentaho London meetup has a presentation on using Kettle with Beam:

https://www.meetup.com/Pentaho-London-User-Group/events/256773962/

Thanks,
Dan

On Tue, 4 Dec 2018 at 13:20, Maximilian Michels  wrote:

> Thanks for sharing, Gris! This list will likely never be complete, as
> there are endless conferences :)
>
> Nevertheless, it's a great idea to coordinate the attendance for the
> major ones.
>
> Cheers,
> Max
>
> On 03.12.18 23:33, Griselda Cuevas wrote:
> > Hi Beam Community,
> >
> > I started curating industry conferences, meetups and events that are
> > relevant for Beam, this initial list I came up with
> > <
> https://docs.google.com/spreadsheets/d/1CloF63FOKSPM6YIuu8eExjhX6xrIiOp5j4zPbSg3Apo/edit#gid=0>.
>
> > *I'd love your help adding others that I might have overlooked.* Once
> > we're satisfied with the list, let's re-share so we can coordinate
> > proposal submissions, attendance and community meetups there.
> >
> >
> > Cheers,
> >
> > G
> >
> >
> >
>


Re: 2019 Beam Events

2018-12-04 Thread Maximilian Michels
Thanks for sharing, Gris! This list will likely never be complete, as 
there are endless conferences :)


Nevertheless, it's a great idea to coordinate the attendance for the 
major ones.


Cheers,
Max

On 03.12.18 23:33, Griselda Cuevas wrote:

Hi Beam Community,

I started curating industry conferences, meetups and events that are 
relevant for Beam, this initial list I came up with 
. 
*I'd love your help adding others that I might have overlooked.* Once 
we're satisfied with the list, let's re-share so we can coordinate 
proposal submissions, attendance and community meetups there.



Cheers,

G





Re: Graceful shutdown of long-running Beam pipeline on Flink

2018-12-04 Thread Maximilian Michels

Thank you for sharing these, Lukasz!

Great question, Wayne!

As for pipeline shutdown, Flink users typically take a snapshot and then 
cancel the pipeline with Flink tools.


The Beam tooling needs to be improved to support cancelling as well. If 
snapshotting is enabled, the Beam job could also be restored from a 
snapshot instead of explicitly taking a savepoint.


Related issue for cancelling: 
https://issues.apache.org/jira/browse/BEAM-593 I think we should address 
this soon for the next release.


Thanks,
Max


On 03.12.18 17:53, Lukasz Cwik wrote:
There are propoosals for pipeline drain[1] and also for snapshot and 
update[2] for Apache Beam. We would love contributions in this space.


1: 
https://docs.google.com/document/d/1NExwHlj-2q2WUGhSO4jTu8XGhDPmm3cllSN8IMmWci8
2: 
https://docs.google.com/document/d/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MY


On Mon, Dec 3, 2018 at 7:05 AM Wayne Collins > wrote:


Hi JC,

Thanks for the quick response!
I had hoped for an in-pipeline solution for runner portability but
it is nice to know we're not the only ones stepping outside to
interact with runner management. :-)

Wayne


On 2018-12-03 01:23, Juan Carlos Garcia wrote:

Hi Wayne,

We have the same setup and we do daily updates to our pipeline.

The way we do it is using the flink tool via a Jenkins.

Basically our deployment job do as follow:

1. Detect if the pipeline is running (it matches via job name)

2. If found, do a flink cancel with a savepoint (we uses hdfs for
checkpoint / savepoint) under a given directory.

3. It uses the flink run command for the new job and specify the
savepoint from step 2.

I don't think there is any support to achieve the same from within
the pipeline. You need to do this externally as explained above.

Best regards,
JC


Am Mo., 3. Dez. 2018, 00:46 hat Wayne Collins mailto:wayn...@dades.ca>> geschrieben:

Hi all,
We have a number of Beam pipelines processing unbounded
streams sourced from Kafka on the Flink runner and are very
happy with both the platform and performance!

The problem is with shutting down the pipelines...for version
upgrades, system maintenance, load management, etc. it would
be nice to be able to gracefully shut these down under
software control but haven't been able to find a way to do so.
We're in good shape on checkpointing and then cleanly
recovering but shutdowns are all destructive to Flink or the
Flink TaskManager.

Methods tried:

1) Calling cancel on FlinkRunnerResult returned from
pipeline.run()
This would be our preferred method but p.run() doesn't return
until termination and even if it did, the runner code simply
throws:
"throw new UnsupportedOperationException("FlinkRunnerResult
does not support cancel.");"
so this doesn't appear to be a near-term option.

2) Inject a "termination" message into the pipeline via Kafka
This does get through, but calling exit() from a stage in the
pipeline also terminates the Flink TaskManager.

3) Inject a "sleep" message, then manually restart the cluster
This is our current method: we pause the data at the source,
flood all branches of the pipeline with a "we're going down"
msg so the stages can do a bit of housekeeping, then hard-stop
the entire environment and re-launch with the new version.

Is there a "Best Practice" method for gracefully terminating
an unbounded pipeline from within the pipeline or from the
mainline that launches it?

Thanks!
Wayne

-- 
Wayne Collins

dades.ca    Inc.
mailto:wayn...@dades.ca
cell:416-898-5137



-- 
Wayne Collins

dades.ca    Inc.
mailto:wayn...@dades.ca
cell:416-898-5137



Latin America Community

2018-12-04 Thread Leonardo Miguel
Hi guys,

Just want to check if there is someone using Beam and/or Scio at this side
of the globe.
I'd like to know also if there is any event near or some related community.
If you are using Beam and/or Scio, please let me know.

Let me start first:
I'm located at Sao Carlos, Sao Paulo, Brazil.
We use Beam and Scio running on Google Dataflow to serve data products
(streaming and batch) over fiscal documents.

Thanks!

-- 
[]s

Leonardo Alves Miguel
Data Engineer
(16) 3509-5515 | www.arquivei.com.br

[image: Arquivei.com.br – Inteligência em Notas Fiscais]

[image: Google seleciona Arquivei para imersão e mentoria no Vale do
Silício]






RE: Generic Type PTransform

2018-12-04 Thread Eran Twili
Thanks Lukasz,

Please tell me, how can I set a coder on the PCollection created after the 
"MapToKV" apply?
I mean, all I know is that it will be a PCollection>, and I don't know 
what will be the actual runtime types of K and V.
So, what coder should I set? Can you please give a code example of how to do 
that?

Really appriciate your help,
Eran

From: Lukasz Cwik [mailto:lc...@google.com]
Sent: Monday, December 03, 2018 7:10 PM
To: user@beam.apache.org
Subject: Re: Generic Type PTransform

Apache Beam attempts to propagate coders through by looking at any typing 
information available but because Java has a lot of type erasure and there are 
many scenarios where these coders can't be propagated forward from a previous 
transform.

Take the following two examples (note that there are many subtle variations 
that can give different results):
List erasedType = new List();  // type is lost
List keptType = new List() {};  // type is kept because of 
anonymous inner class being declared
In the first the type is erased and in the second the type information is 
available. I would suggest

In your case we can't infer what K and what V are because after the code 
compiles the types have been erased hence the error message. To immediately fix 
the problem, you'll want to set the coder on the PCollection created after you 
apply the "MapToKV" transform (you might need to do it on the 
"MapToSimpleImmutableEntry" transform as well).

If you want to get into the details, take a look at they CoderRegistry[1] as it 
contains the type inference / propagation code.

Finally, this not an uncommon problem that users face and it seems as though 
the error message that is given doesn't make sense so feel free to propose 
changes in the error messages to help others such as yourself.

1: 
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/CoderRegistry.java

On Sun, Dec 2, 2018 at 10:54 PM Matt Casters 
mailto:mattcast...@gmail.com>> wrote:
There are probably smarter people than me on this list but since I recently 
been through a similar thought exercise...

For the generic use in Kettle I have a PCollection going through the 
pipeline.
KettleRow is just an Object[] wrapper for which I can implement a Coder.

The "group by" that I implemented does the following:Split 
PCollection into PCollection>
Then it  applies the standard GroupByKey.create() giving us 
PCollection>>
This means that we can simple aggregate all the elements in Iterable 
to aggregate a group.

Well, at least that works for me. The code is open so look at it over here:
https://github.com/mattcasters/kettle-beam-core/blob/master/src/main/java/org/kettle/beam/core/transform/GroupByTransform.java

Like you I had trouble with the Coder for my KettleRows so I hacked up this to 
make it work:
https://github.com/mattcasters/kettle-beam-core/blob/master/src/main/java/org/kettle/beam/core/coder/KettleRowCoder.java

It's set on the pipeline:
pipeline.getCoderRegistry().registerCoderForClass( KettleRow.class, new 
KettleRowCoder());

Good luck!
Matt

Op zo 2 dec. 2018 om 20:57 schreef Eran Twili 
mailto:eran.tw...@niceactimize.com>>:
Hi,

We are considering using Beam in our software.
We wish to create a service for a user which will operate Beam for him, and 
obviously the user code doesn't have Beam API visibility.
For that we need to generify some Beam API.
So the user supply functions and we embed them in a generic PTransform and run 
them in a Beam pipeline.
We have some difficulties to understand how can we provide the user with option 
to perform GroupByKey operation.
The problem is that GroupByKey takes KV and our PCollections holds only user 
datatypes which should not be Beam datatypes.
So we thought about having this PTransform:
public class PlatformGroupByKey extends
PTransform>>, 
PCollection {
@Override
public PCollection>>> 
expand(PCollection>> input) {

return input
.apply("MapToKV",
MapElements.via(
new 
SimpleFunction>, KV>() {
@Override
public KV 
apply(CustomType> kv) {
return KV.of(kv.field.getKey(), 
kv.field.getValue()); }}))
.apply("GroupByKey",
GroupByKey.create())
.apply("MapToSimpleImmutableEntry",
MapElements.via(
new SimpleFunction>, 
CustomType>>>() {
@Override
public 
CustomType>> apply(KV> kv) {
return new CustomType<>(new 
SimpleImmutableEntry<>(kv.getKey(), kv.getValue())); }}));
}
}
In which we will get PCollection from our key-value type (java's 
SimpleImmutableEntry),
Convert it to KV,
Preform the GroupByKey,
And re-convert it again to SimpleImmutableEntry.

But we get this error in