Re: Regarding Beam SlackChannel

2018-03-09 Thread OrielResearch Eila Arich-Landkof
thanks!!


On Fri, Mar 9, 2018 at 1:58 PM, Lukasz Cwik  wrote:

> Invite sent, welcome.
>
> On Thu, Mar 8, 2018 at 7:08 PM, OrielResearch Eila Arich-Landkof <
> e...@orielresearch.org> wrote:
>
>> Hi Lukasz,
>>
>> Could you please add me as well
>> Thanks,
>> Eila
>>
>> On Thu, Mar 8, 2018 at 2:56 PM, Lukasz Cwik  wrote:
>>
>>> Invite sent, welcome.
>>>
>>> On Thu, Mar 8, 2018 at 11:50 AM, Chang Liu  wrote:
>>>
 Hello

 Can someone please add me to the Beam slackchannel?

 Thanks.


 Best regards/祝好,

 Chang Liu 刘畅



>>>
>>
>>
>> --
>> Eila
>> www.orielresearch.org
>> https://www.meetup.com/Deep-Learning-In-Production/
>>
>
>


-- 
Eila
www.orielresearch.org
https://www.meetup.com/Deep-Learning-In-Production/


Re: Scio 0.5.0 released

2018-03-09 Thread Neville Li
On Fri, Mar 9, 2018 at 4:31 PM 'Eugene Kirpichov' via Scio Users <
scio-us...@googlegroups.com> wrote:

> Hi!
>
> On Fri, Mar 9, 2018 at 1:22 PM Rafal Wojdyla  wrote:
>
>> Hi all,
>>
>> We have just released Scio 0.5.0. This is a major/breaking release - make
>> sure to read the breaking changes section below.
>>
>> - rav
>>
>> https://github.com/spotify/scio/releases/tag/v0.5.0
>>
>> *"In ictu"*
>>
>> Breaking changes
>>
>>- BigQueryIO in JobTest#output now requires a type parameter. Explicit
>> .map(T.toTableRow)of test data is no longer needed.
>>- Typed AvroIO now accepts case classes instead of Avro records in
>>JobTest. Explicit .map(T.toGenericRecord) of test data is no longer
>>needed. See this change
>>
>> 
>> for more.
>>- Package com.spotify.scio.extra.transforms is moved from scio-extra
>>to scio-core, under com.spotify.scio.transforms.
>>
>> See this section
>> 
>>  for more details.
>> Features
>>
>>- Support reading BigQuery as Avro #964
>>, #992
>>
>>- BigQuery client now supports load and export #1060
>>
>>- Add TFRecordSpec support for Featran #1002
>>
>>- Add AsyncLookupDoFn #1012
>>
>>
>> This looks like a very useful general-purpose tool, Beam users have asked
> for something like this many times; and the implementation looks very
> high-quality as well. Any interest in backporting it into Beam?
>

Sounds like a good idea. We'll look into it. Might take a while though
since our tests are in scala.
https://github.com/spotify/scio/issues/1071

>
>
>>
>>- Bump sparkey to 2.2.1, protobuf-generic to 0.2.4 #1028
>>
>>- Added ser/der support for joda DateTime #1038
>>
>>- Password is now optional for jdbc connection #1040
>>
>>- Add job cancellation option to ScioResult#waitUntilDone #1056
>> #1058
>> #1062
>> #1066
>>
>>
>> Bug fixes
>>
>>- Fix transform name in joins #1034
>> #1035
>>
>>- Add applyKvTransform #1020
>> #1032
>>
>>- Add helpers to initialize counters #1026
>> #1027
>>
>>- Fix SCollectionMatchers serialization #1001
>>
>>- Check runner version #1008
>> #1009
>>
>>- Log exception in AsyncLookupDoFn only if cache put fails #1039
>>
>>- ProjectId nonEmpty string check in BigQueryClient #1045
>>
>>- Fix SCollection#withSlidingWindows #1054
>>
>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "Scio Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to scio-users+unsubscr...@googlegroups.com.
> To post to this group, send email to scio-us...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/scio-users/CAFmTo4-7m9X9zH7pE1G0Mi-X3FuPCGFurkCh23ghdSaH48c-sA%40mail.gmail.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


Re: Advice on parallelizing network calls in DoFn

2018-03-09 Thread Romain Manni-Bucau
@Kenn: why not preferring to make beam reactive? Would alow to scale way
more without having to hardly synchronize multithreading. Elegant and
efficient :). Beam 3?


Le 9 mars 2018 22:49, "Kenneth Knowles"  a écrit :

> I will start with the "exciting futuristic" answer, which is that we
> envision the new DoFn to be able to provide an automatic ExecutorService
> parameters that you can use as you wish.
>
> new DoFn<>() {
>   @ProcessElement
>   public void process(ProcessContext ctx, ExecutorService
> executorService) {
>   ... launch some futures, put them in instance vars ...
>   }
>
>   @FinishBundle
>   public void finish(...) {
>  ... block on futures, output results if appropriate ...
>   }
> }
>
> This way, the Java SDK harness can use its overarching knowledge of what
> is going on in a computation to, for example, share a thread pool between
> different bits. This was one reason to delete IntraBundleParallelization -
> it didn't allow the runner and user code to properly manage how many things
> were going on concurrently. And mostly the runner should own parallelizing
> to max out cores and what user code needs is asynchrony hooks that can
> interact with that. However, this feature is not thoroughly considered. TBD
> how much the harness itself manages blocking on outstanding requests versus
> it being your responsibility in FinishBundle, etc.
>
> I haven't explored rolling your own here, if you are willing to do the
> knob tuning to get the threading acceptable for your particular use case.
> Perhaps someone else can weigh in.
>
> Kenn
>
> On Fri, Mar 9, 2018 at 1:38 PM Josh Ferge 
> wrote:
>
>> Hello all:
>>
>> Our team has a pipeline that make external network calls. These pipelines
>> are currently super slow, and the hypothesis is that they are slow because
>> we are not threading for our network calls. The github issue below provides
>> some discussion around this:
>>
>> https://github.com/apache/beam/pull/957
>>
>> In beam 1.0, there was IntraBundleParallelization, which helped with
>> this. However, this was removed because it didn't comply with a few BEAM
>> paradigms.
>>
>> Questions going forward:
>>
>> What is advised for jobs that make blocking network calls? It seems
>> bundling the elements into groups of size X prior to passing to the DoFn,
>> and managing the threading within the function might work. thoughts?
>> Are these types of jobs even suitable for beam?
>> Are there any plans to develop features that help with this?
>>
>> Thanks
>>
>


Re: Advice on parallelizing network calls in DoFn

2018-03-09 Thread Kenneth Knowles
I will start with the "exciting futuristic" answer, which is that we
envision the new DoFn to be able to provide an automatic ExecutorService
parameters that you can use as you wish.

new DoFn<>() {
  @ProcessElement
  public void process(ProcessContext ctx, ExecutorService
executorService) {
  ... launch some futures, put them in instance vars ...
  }

  @FinishBundle
  public void finish(...) {
 ... block on futures, output results if appropriate ...
  }
}

This way, the Java SDK harness can use its overarching knowledge of what is
going on in a computation to, for example, share a thread pool between
different bits. This was one reason to delete IntraBundleParallelization -
it didn't allow the runner and user code to properly manage how many things
were going on concurrently. And mostly the runner should own parallelizing
to max out cores and what user code needs is asynchrony hooks that can
interact with that. However, this feature is not thoroughly considered. TBD
how much the harness itself manages blocking on outstanding requests versus
it being your responsibility in FinishBundle, etc.

I haven't explored rolling your own here, if you are willing to do the knob
tuning to get the threading acceptable for your particular use case.
Perhaps someone else can weigh in.

Kenn

On Fri, Mar 9, 2018 at 1:38 PM Josh Ferge 
wrote:

> Hello all:
>
> Our team has a pipeline that make external network calls. These pipelines
> are currently super slow, and the hypothesis is that they are slow because
> we are not threading for our network calls. The github issue below provides
> some discussion around this:
>
> https://github.com/apache/beam/pull/957
>
> In beam 1.0, there was IntraBundleParallelization, which helped with this.
> However, this was removed because it didn't comply with a few BEAM
> paradigms.
>
> Questions going forward:
>
> What is advised for jobs that make blocking network calls? It seems
> bundling the elements into groups of size X prior to passing to the DoFn,
> and managing the threading within the function might work. thoughts?
> Are these types of jobs even suitable for beam?
> Are there any plans to develop features that help with this?
>
> Thanks
>


Advice on parallelizing network calls in DoFn

2018-03-09 Thread Josh Ferge
Hello all:

Our team has a pipeline that make external network calls. These pipelines
are currently super slow, and the hypothesis is that they are slow because
we are not threading for our network calls. The github issue below provides
some discussion around this:

https://github.com/apache/beam/pull/957

In beam 1.0, there was IntraBundleParallelization, which helped with this.
However, this was removed because it didn't comply with a few BEAM
paradigms.

Questions going forward:

What is advised for jobs that make blocking network calls? It seems
bundling the elements into groups of size X prior to passing to the DoFn,
and managing the threading within the function might work. thoughts?
Are these types of jobs even suitable for beam?
Are there any plans to develop features that help with this?

Thanks


Re: Scio 0.5.0 released

2018-03-09 Thread Eugene Kirpichov
Hi!

On Fri, Mar 9, 2018 at 1:22 PM Rafal Wojdyla  wrote:

> Hi all,
>
> We have just released Scio 0.5.0. This is a major/breaking release - make
> sure to read the breaking changes section below.
>
> - rav
>
> https://github.com/spotify/scio/releases/tag/v0.5.0
>
> *"In ictu"*
>
> Breaking changes
>
>- BigQueryIO in JobTest#output now requires a type parameter. Explicit
>.map(T.toTableRow)of test data is no longer needed.
>- Typed AvroIO now accepts case classes instead of Avro records in
>JobTest. Explicit .map(T.toGenericRecord) of test data is no longer
>needed. See this change
>
> 
> for more.
>- Package com.spotify.scio.extra.transforms is moved from scio-extra to
> scio-core, under com.spotify.scio.transforms.
>
> See this section
> 
>  for more details.
> Features
>
>- Support reading BigQuery as Avro #964
>, #992
>
>- BigQuery client now supports load and export #1060
>
>- Add TFRecordSpec support for Featran #1002
>
>- Add AsyncLookupDoFn #1012 
>
> This looks like a very useful general-purpose tool, Beam users have asked
for something like this many times; and the implementation looks very
high-quality as well. Any interest in backporting it into Beam?


>
>- Bump sparkey to 2.2.1, protobuf-generic to 0.2.4 #1028
>
>- Added ser/der support for joda DateTime #1038
>
>- Password is now optional for jdbc connection #1040
>
>- Add job cancellation option to ScioResult#waitUntilDone #1056
> #1058
> #1062
> #1066
>
>
> Bug fixes
>
>- Fix transform name in joins #1034
> #1035
>
>- Add applyKvTransform #1020
> #1032
>
>- Add helpers to initialize counters #1026
> #1027
>
>- Fix SCollectionMatchers serialization #1001
>
>- Check runner version #1008
> #1009
>
>- Log exception in AsyncLookupDoFn only if cache put fails #1039
>
>- ProjectId nonEmpty string check in BigQueryClient #1045
>
>- Fix SCollection#withSlidingWindows #1054
>
>
>
>


Scio 0.5.0 released

2018-03-09 Thread Rafal Wojdyla
Hi all,

We have just released Scio 0.5.0. This is a major/breaking release - make
sure to read the breaking changes section below.

- rav

https://github.com/spotify/scio/releases/tag/v0.5.0

*"In ictu"*

Breaking changes

   - BigQueryIO in JobTest#output now requires a type parameter. Explicit
   .map(T.toTableRow)of test data is no longer needed.
   - Typed AvroIO now accepts case classes instead of Avro records in
   JobTest. Explicit .map(T.toGenericRecord) of test data is no longer
   needed. See this change
   

for more.
   - Package com.spotify.scio.extra.transforms is moved from scio-extra to
   scio-core, under com.spotify.scio.transforms.

See this section

 for more details.
Features

   - Support reading BigQuery as Avro #964
   , #992
   
   - BigQuery client now supports load and export #1060
   
   - Add TFRecordSpec support for Featran #1002
   
   - Add AsyncLookupDoFn #1012 
   - Bump sparkey to 2.2.1, protobuf-generic to 0.2.4 #1028
   
   - Added ser/der support for joda DateTime #1038
   
   - Password is now optional for jdbc connection #1040
   
   - Add job cancellation option to ScioResult#waitUntilDone #1056
    #1058
    #1062
    #1066
   

Bug fixes

   - Fix transform name in joins #1034
    #1035
   
   - Add applyKvTransform #1020
    #1032
   
   - Add helpers to initialize counters #1026
    #1027
   
   - Fix SCollectionMatchers serialization #1001
   
   - Check runner version #1008
    #1009
   
   - Log exception in AsyncLookupDoFn only if cache put fails #1039
   
   - ProjectId nonEmpty string check in BigQueryClient #1045
   
   - Fix SCollection#withSlidingWindows #1054
   


Re: Regarding Beam SlackChannel

2018-03-09 Thread Lukasz Cwik
Invite sent, welcome.

On Thu, Mar 8, 2018 at 7:08 PM, OrielResearch Eila Arich-Landkof <
e...@orielresearch.org> wrote:

> Hi Lukasz,
>
> Could you please add me as well
> Thanks,
> Eila
>
> On Thu, Mar 8, 2018 at 2:56 PM, Lukasz Cwik  wrote:
>
>> Invite sent, welcome.
>>
>> On Thu, Mar 8, 2018 at 11:50 AM, Chang Liu  wrote:
>>
>>> Hello
>>>
>>> Can someone please add me to the Beam slackchannel?
>>>
>>> Thanks.
>>>
>>>
>>> Best regards/祝好,
>>>
>>> Chang Liu 刘畅
>>>
>>>
>>>
>>
>
>
> --
> Eila
> www.orielresearch.org
> https://www.meetup.com/Deep-Learning-In-Production/
>


Re: BigQueryIO and NumFileShards

2018-03-09 Thread Jose Ignacio Honrado
Thanks a lot for the explanation Eugene. I will try low values.

On Fri, Mar 9, 2018 at 7:03 AM, Eugene Kirpichov 
wrote:

> It's unfortunate that we have this parameter at all - we discussed various
> ways to get rid of it with +Reuven Lax  , ideally we'd
> be computing it automatically . In your case the throughput is quite modest
> and even a value of 1 should do well.
>
> Basically in this codepath we write the data to files in parallel, and
> every $triggeringFrequency we flush the files to a BigQuery load job. How
> many files to write in parallel, depends on the throughput. The fewer, the
> better, but the write throughput to a single file is limited. You can
> assume that write throughput to GCS is a few dozen MB/s per file; I assume
> 1000 events/s fits under that, depending on the event size.
>
> Actually with that in mind, we should probably just set the value to
> something like 10 or 100 which will be enough for most needs (up to about 5
> GB/s) but keep it configurable for people who need more, and eventually
> figure out a way to autoscale it.
>
> On Thu, Mar 8, 2018 at 1:50 AM Jose Ignacio Honrado 
> wrote:
>
>> Hi,
>>
>> I am using BigQueryIO from Apache Beam 2.3.0 and Scio 0.47 to load data
>> into BQ from Dataflow using jobs (Write.Method.FILE_LOADS). Here is the
>> code:
>>
>> val timePartitioning = new TimePartitioning().setField("
>> partition_day").setType("DAY")
>>
>> BigQueryIO.write[Event]
>>   .to("some-table")
>>   .withCreateDisposition(Write.CreateDisposition.CREATE_IF_NEEDED)
>>   .withWriteDisposition(Write.WriteDisposition.WRITE_APPEND)
>>   .withMethod(Write.Method.FILE_LOADS)
>>   .withFormatFunction((input: Event) => BigQueryType[Event].
>> toTableRow(input))
>>   .withSchema(BigQueryType[Event].schema)
>>   .withTriggeringFrequency(Duration.standardMinutes(15))
>>   .withNumFileShards(XXX)
>>   .withTimePartitioning(timePartitioning)
>>
>> My question is related to the "numFileShards", which is a mandatory
>> parameter to set when using a "triggeringFrequency". I have been trying to
>> find information and reading the source code to understand what it does but
>> I couldn't find anything relevant.
>>
>> Considering there is gonna be a throughput of 300-1000 events per second,
>> what would be the recommended value?
>>
>> Thanks!
>>
>