[Events] Apache Beam Summit 2018 in California on 3/14

2018-02-23 Thread Griselda Cuevas
Hi Everyone,

I know that some folks from the community are attending a few events in
California in March, one of them is the Google Cloud Community Conference
on March 15th.

Taking advantage of these events happening in California, some of us have
been putting together the first Apache Beam Community Summit on March 14th

[1].
The location will be Wingtip Club in San Francisco, CA.

The objective of this summit, is to open a space for the members of this
community to spend some time working face-to-face, focusing specially on
2018 plans.

I'd like to invite any Apache Beam developer and/or user interested in
joining to register using this form

[2]
by 3/10.

I realize the invitation comes with a short notice so apologies for that. We
needed to take advantage of the opportunity of having folks traveling to
California. We'll make sure to report back all important outcomes here to
the mailing list.

If you can't join but would like to organize a second version this year
(maybe in Europe or another location), let me know. I'll be happy to help
coordinate this as well.

Have a great weekend!
G

[1] https://docs.google.com/document/d/1fY-O2IdN0IrWYvPhxs6EvgwSaaLAd74aR
ws2_mogkNQ/edit#heading=h.mls7w0yyv8wc

[2] https://docs.google.com/forms/d/1kZKakoxesFJvNa37RjHRr9WWK8ojQ
gUBBARqjQkNBVg/edit?usp=drive_web


Re: Partitioning a stream randomly and writing to files with TextIO

2018-02-23 Thread Lukasz Cwik
There shouldn't be any swapping or memory concerns if your using Dataflow
(unless each element is large (GiB++)). Dataflow will process small
segments of the files all in parallel and write these results out before
processing more so the entire PCollection is never required to be in memory
at a given time.

On Fri, Feb 23, 2018 at 12:22 AM, Carlos Alonso 
wrote:

> Hi Lukasz, could you please elaborate a bit more around the 2nd part?
> What's important to know, from the developers perspective, about Dataflow's
> memory management? How big can partitions grow? And what are the
> performance considerations? As this sounds like if the workers will "swap"
> into disk if partitions are very big, right?
>
> Thanks!
>
> On Fri, Feb 23, 2018 at 2:27 AM Lukasz Cwik  wrote:
>
>> 1) Creating a PartitionFn is the right way to go. I would suggest using
>> something which would give you stable output so you could replay your
>> pipeline and this would be useful for tests as well. Use something like the
>> object's hashcode and divide the hash space into 80%/10%/10% segments could
>> work just make sure that if you go with hashcode the hashcode function
>> distribute elements well.
>>
>> 2) This is runner dependent but most runners don't require storing
>> everything in memory. For example if you were using Dataflow, you would
>> only need to store a couple of elements in memory not the entire
>> PCollection.
>>
>> On Thu, Feb 22, 2018 at 11:38 AM, Josh  wrote:
>>
>>> Hi all,
>>>
>>> I want to read a large dataset using BigQueryIO, and then randomly
>>> partition the rows into three chunks, where one partition has 80% of the
>>> data and there are two other partitions with 10% and 10%. I then want to
>>> write the three partitions to three files in GCS.
>>>
>>> I have a couple of quick questions:
>>> (1) What would be the best way to do this random partitioning with Beam?
>>> I think I can just use a PartitionFn which uses Math.random to determine
>>> which of the three partitions an element should go to, but not sure if
>>> there is a better approach.
>>>
>>> (2) I would then take the resulting PCollectionList and use TextIO to
>>> write each partition to a GCS file. For this, would I need all data for the
>>> largest partition to fit into the memory of a single worker?
>>>
>>> Thanks for any advice,
>>>
>>> Josh
>>>
>>
>>


Processing genomics data from bigQuery prior to training model

2018-02-23 Thread OrielResearch Eila Arich-Landkof
Hi all,

I am looking for a good reference for processing data prior to training a
model using APACHE BEAM
*Phase1:*
 30K+ columns of features, partitioned between big query tables - each of
10K, and 100K+ rows.

*Phase 2:*
more columns and more rows

any reference is highly appreciated.

Thank you,
Eila

-- 
Eila
www.orielresearch.org
https://www.meetup.com/Deep-Learning-In-Production/


Re: Presentations about Apche Beam

2018-02-23 Thread Griselda Cuevas
Ciao Davide,

Just checking if you'd like to share the presentations with the community?

Have a great weekend!

>
> On Wed, Feb 14, 2018, 10:26 PM Davide Russo 
> wrote:
>
>> Dear Beam Community,
>>
>>
>>
>> My name is David and i’m a studet of University of Molise, Italy. I’m
>> studing “Security of Software System” and during this degree I’m attending
>> the course of “Complex Software Architecture and Styes”; for the exam I
>> have to analyse an Apache complex system. I selected to analyse Apache Beam
>> for many reason. I’m writing you because for this project I produced 3
>> presentations about Beam that I would to submit you to add them on Beam
>> website as presentation material. They derived in part form the existing
>> presentation so if you think that these presentations are good and well
>> done it could be cool if you add them on the Beam website. If you are
>> interested let me know that I’ll send you as soon as possible my work.
>>
>>
>>
>> Best Regards.
>>
>> DR
>>
>


Re: Partitioning a stream randomly and writing to files with TextIO

2018-02-23 Thread Carlos Alonso
Hi Lukasz, could you please elaborate a bit more around the 2nd part?
What's important to know, from the developers perspective, about Dataflow's
memory management? How big can partitions grow? And what are the
performance considerations? As this sounds like if the workers will "swap"
into disk if partitions are very big, right?

Thanks!

On Fri, Feb 23, 2018 at 2:27 AM Lukasz Cwik  wrote:

> 1) Creating a PartitionFn is the right way to go. I would suggest using
> something which would give you stable output so you could replay your
> pipeline and this would be useful for tests as well. Use something like the
> object's hashcode and divide the hash space into 80%/10%/10% segments could
> work just make sure that if you go with hashcode the hashcode function
> distribute elements well.
>
> 2) This is runner dependent but most runners don't require storing
> everything in memory. For example if you were using Dataflow, you would
> only need to store a couple of elements in memory not the entire
> PCollection.
>
> On Thu, Feb 22, 2018 at 11:38 AM, Josh  wrote:
>
>> Hi all,
>>
>> I want to read a large dataset using BigQueryIO, and then randomly
>> partition the rows into three chunks, where one partition has 80% of the
>> data and there are two other partitions with 10% and 10%. I then want to
>> write the three partitions to three files in GCS.
>>
>> I have a couple of quick questions:
>> (1) What would be the best way to do this random partitioning with Beam?
>> I think I can just use a PartitionFn which uses Math.random to determine
>> which of the three partitions an element should go to, but not sure if
>> there is a better approach.
>>
>> (2) I would then take the resulting PCollectionList and use TextIO to
>> write each partition to a GCS file. For this, would I need all data for the
>> largest partition to fit into the memory of a single worker?
>>
>> Thanks for any advice,
>>
>> Josh
>>
>
>


Re: Partitioning a stream randomly and writing to files with TextIO

2018-02-23 Thread Josh
I see, thanks Lukasz - I will try setting that up. Good shout on using
hashcode / ensuring the pipeline is deterministic!

On 23 Feb 2018 01:27, "Lukasz Cwik"  wrote:

> 1) Creating a PartitionFn is the right way to go. I would suggest using
> something which would give you stable output so you could replay your
> pipeline and this would be useful for tests as well. Use something like the
> object's hashcode and divide the hash space into 80%/10%/10% segments could
> work just make sure that if you go with hashcode the hashcode function
> distribute elements well.
>
> 2) This is runner dependent but most runners don't require storing
> everything in memory. For example if you were using Dataflow, you would
> only need to store a couple of elements in memory not the entire
> PCollection.
>
> On Thu, Feb 22, 2018 at 11:38 AM, Josh  wrote:
>
>> Hi all,
>>
>> I want to read a large dataset using BigQueryIO, and then randomly
>> partition the rows into three chunks, where one partition has 80% of the
>> data and there are two other partitions with 10% and 10%. I then want to
>> write the three partitions to three files in GCS.
>>
>> I have a couple of quick questions:
>> (1) What would be the best way to do this random partitioning with Beam?
>> I think I can just use a PartitionFn which uses Math.random to determine
>> which of the three partitions an element should go to, but not sure if
>> there is a better approach.
>>
>> (2) I would then take the resulting PCollectionList and use TextIO to
>> write each partition to a GCS file. For this, would I need all data for the
>> largest partition to fit into the memory of a single worker?
>>
>> Thanks for any advice,
>>
>> Josh
>>
>
>