Re: Question about unbounded in-memory PCollection

2019-05-07 Thread Rui Wang
Does TestStream.java
[1]
satisfy your need?



-Rui

[1]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/TestStream.java

On Tue, May 7, 2019 at 2:47 PM Chengzhi Zhao 
wrote:

> Hi Beam Team,
>
> I am new to here and recently study the programming guide, I have a
> question about the in-memory data,
> https://beam.apache.org/documentation/programming-guide/#creating-a-pcollection
>
> Is there a way to create unbounded PCollection from the in-memory
> collection? I want to test the unbounded PCollection locally and don't know
> what's the easiest way to get unbounded PCollection. Please let me know if
> I am doing something wrong or I should use a file system to do it.
>
> Thanks in advance!
>
> Best,
> Chengzhi
>
> On Tue, May 7, 2019 at 5:46 PM Chengzhi Zhao 
> wrote:
>
>> Hi Beam Team,
>>
>> I am new to here and recently study the programming guide, I have a
>> question about the in-memory data,
>> https://beam.apache.org/documentation/programming-guide/#creating-a-pcollection
>>
>> Is there a way to create unbounded PCollection from the in-memory
>> collection? I want to test the unbounded PCollection locally and don't know
>> what's the easiest way to get unbounded PCollection. Please let me know if
>> I am doing something wrong or I should use a file system to do it.
>>
>> Thanks in advance!
>>
>> Best,
>> Chengzhi
>>
>


Re: Question about unbounded in-memory PCollection

2019-05-07 Thread Chengzhi Zhao
Hi Beam Team,

I am new to here and recently study the programming guide, I have a
question about the in-memory data,
https://beam.apache.org/documentation/programming-guide/#creating-a-pcollection

Is there a way to create unbounded PCollection from the in-memory
collection? I want to test the unbounded PCollection locally and don't know
what's the easiest way to get unbounded PCollection. Please let me know if
I am doing something wrong or I should use a file system to do it.

Thanks in advance!

Best,
Chengzhi

On Tue, May 7, 2019 at 5:46 PM Chengzhi Zhao 
wrote:

> Hi Beam Team,
>
> I am new to here and recently study the programming guide, I have a
> question about the in-memory data,
> https://beam.apache.org/documentation/programming-guide/#creating-a-pcollection
>
> Is there a way to create unbounded PCollection from the in-memory
> collection? I want to test the unbounded PCollection locally and don't know
> what's the easiest way to get unbounded PCollection. Please let me know if
> I am doing something wrong or I should use a file system to do it.
>
> Thanks in advance!
>
> Best,
> Chengzhi
>


Re: Streaming inserts BQ with Java SDK Beam

2019-05-07 Thread Andres Angel
Pablo thanks so much I will explore this method then for   STREAMING_INSERTS
,
this answer worth much for us :)

thanks.

On Tue, May 7, 2019 at 1:05 PM Pablo Estrada  wrote:

> Hi Andres!
> You can definitely do streaming inserts using the Java SDK. This is
> available with BigQueryIO.write(). Specifically, you can use the
> `withMethod`[1] call to specify whether you want batch loads or streaming
> inserts. If you specify streaming inserts, Beam should insert rows as they
> come in bundles.
> Hope that helps
> -P.
>
> [1]
> https://beam.apache.org/releases/javadoc/2.11.0/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.html#withMethod-org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.Method-
>
> On Tue, May 7, 2019 at 9:58 AM Andres Angel <
> ingenieroandresan...@gmail.com> wrote:
>
>> Hello everyone,
>>
>> I need to use BigQuery inserts within my beam pipeline, hence I know well
>> the built-in IO options offer `BigQueryIO`, however this will insert in a
>> batch fashion to BQ creating underneath a BQ load job. I instead need to
>> trigger a streaming insert into BQ, and I was reviewing the Java SDK
>> documentation but seems like this is not possible.
>>
>> In the other hand, I have the python SDK and I found this GitHub
>> documentation
>> 
>> code where they are using a method *InsertAll
>> *
>>  which
>> is apparently what I need. If this is official I would like to know if
>> there is a naive fashion to trigger stream inserts in BQ using the Java SDK.
>>
>> thanks so much for your feedback
>> AU
>>
>


Re: Streaming inserts BQ with Java SDK Beam

2019-05-07 Thread Alex Van Boxel
I think you really need a peculiar reason to force streamingInsert in a
batch job. In batch mode you. Note that you will quickly hit the quota
limit in batch mode: "
Maximum rows per second: 100,000 rows per second, per project", as in batch
load you can process a lot more information in a shorter time.

I know you can force a batch mode in streaming mode, I don't know for the
other way around.

_/
_/ Alex Van Boxel


On Tue, May 7, 2019 at 6:58 PM Andres Angel 
wrote:

> Hello everyone,
>
> I need to use BigQuery inserts within my beam pipeline, hence I know well
> the built-in IO options offer `BigQueryIO`, however this will insert in a
> batch fashion to BQ creating underneath a BQ load job. I instead need to
> trigger a streaming insert into BQ, and I was reviewing the Java SDK
> documentation but seems like this is not possible.
>
> In the other hand, I have the python SDK and I found this GitHub
> documentation
> 
> code where they are using a method *InsertAll
> *
>  which
> is apparently what I need. If this is official I would like to know if
> there is a naive fashion to trigger stream inserts in BQ using the Java SDK.
>
> thanks so much for your feedback
> AU
>


Re: Is it safe to cache the value of a singleton view (with a global window) in a DoFn?

2019-05-07 Thread Lukasz Cwik
Keep your code simple and rely on the runner caching the value locally so
it should be very cheap to access. If you have a performance issue due to a
runner lacking caching, it would be good to hear about it so we could file
a JIRA about it.

On Mon, May 6, 2019 at 4:24 PM Kenneth Knowles  wrote:

> A singleton view in the global window and no triggering does have just a
> single immutable value. (It really ought to have an updated value in the
> presence of triggers, but I believe instead you will receive a crash. I
> haven't tested.)
>
> In general, a side input yields one value per window. Dataflow in batch
> will already do what you describe, but works with all windows. Dataflow in
> streaming has some caching but if you see a problem that is interesting
> information.
>
> Kenn
>
> On Sat, May 4, 2019 at 9:19 AM Steve Niemitz  wrote:
>
>> I have a singleton view in a global window that is read from a DoFn.  I'm
>> curious if its "correct" to cache that value from the view, or if I need to
>> read it every time.
>>
>> As a (simplified) example, if I were to generate the view as such:
>>
>> input.getPipeline
>>   .apply(Create.of(Collections.singleton[Void](null)))
>>   .apply(MapElements.via(new SimpleFunction[Void, JLong]() {
>> override def apply(input: Void): JLong = {
>>   Instant.now().getMillis
>> }
>>   })).apply(View.asSingleton[JLong]())
>>
>> and then read it from a DoFn (using context.sideInput), is it guaranteed
>> that:
>> - every instance of the DoFn will read the same value?
>> - The value will never change?
>>
>> If so it seems like it'd be safe to cache the value inside the DoFn.  It
>> seems like this would be the case, but I've also seen cases in dataflow
>> where the UI indicates that the MapElements step above produced more than
>> one element, so I'm curious what people have to say.
>>
>> Thanks!
>>
>


Re: Streaming inserts BQ with Java SDK Beam

2019-05-07 Thread Pablo Estrada
Hi Andres!
You can definitely do streaming inserts using the Java SDK. This is
available with BigQueryIO.write(). Specifically, you can use the
`withMethod`[1] call to specify whether you want batch loads or streaming
inserts. If you specify streaming inserts, Beam should insert rows as they
come in bundles.
Hope that helps
-P.

[1]
https://beam.apache.org/releases/javadoc/2.11.0/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.html#withMethod-org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.Write.Method-

On Tue, May 7, 2019 at 9:58 AM Andres Angel 
wrote:

> Hello everyone,
>
> I need to use BigQuery inserts within my beam pipeline, hence I know well
> the built-in IO options offer `BigQueryIO`, however this will insert in a
> batch fashion to BQ creating underneath a BQ load job. I instead need to
> trigger a streaming insert into BQ, and I was reviewing the Java SDK
> documentation but seems like this is not possible.
>
> In the other hand, I have the python SDK and I found this GitHub
> documentation
> 
> code where they are using a method *InsertAll
> *
>  which
> is apparently what I need. If this is official I would like to know if
> there is a naive fashion to trigger stream inserts in BQ using the Java SDK.
>
> thanks so much for your feedback
> AU
>


Streaming inserts BQ with Java SDK Beam

2019-05-07 Thread Andres Angel
Hello everyone,

I need to use BigQuery inserts within my beam pipeline, hence I know well
the built-in IO options offer `BigQueryIO`, however this will insert in a
batch fashion to BQ creating underneath a BQ load job. I instead need to
trigger a streaming insert into BQ, and I was reviewing the Java SDK
documentation but seems like this is not possible.

In the other hand, I have the python SDK and I found this GitHub
documentation

code where they are using a method *InsertAll
*
which
is apparently what I need. If this is official I would like to know if
there is a naive fashion to trigger stream inserts in BQ using the Java SDK.

thanks so much for your feedback
AU


Re: Apache BEAM on Flink in production

2019-05-07 Thread Austin Bennett
On the Beam YouTube channel:
https://www.youtube.com/channel/UChNnb_YO_7B0HlW6FhAXZZQ you can see two
talks from people at Lyft; they use Beam on Flink.

Other users can also chime in as to how they are running.

Would also suggest coming to BeamSummit.org in Berlin in June and/or
sharing experiences or coming to ApacheCon in September, where we are to
have 2 tracks in each of 2 days focused on Beam
https://www.apachecon.com/acna19/index.html




On Tue, May 7, 2019 at 6:52 AM  wrote:

> Hi all,
>
>
>
> We currently run Apache Flink based data load processes (fairly simple
> streaming ETL jobs) and are looking at converting to Apache BEAM to give
> more flexibility on the runner.
>
>
>
> Is anyone aware of any organisations running Apache BEAM on Flink in
> production. Does anyone have any case studies they would be able to share?
>
>
>
> Many thanks,
>
>
>
> Steve
>
> This communication and any attachments are confidential and intended
> solely for the addressee. If you are not the intended recipient please
> advise us immediately and delete it. Unless specifically stated in the
> message or otherwise indicated, you may not duplicate, redistribute or
> forward this message and any attachments are not intended for distribution
> to, or use by any person or entity in any jurisdiction or country where
> such distribution or use would be contrary to local law or regulation.
> NatWest Markets Plc  or any affiliated entity ("NatWest Markets") accepts
> no responsibility for any changes made to this message after it was sent.
> Unless otherwise specifically indicated, the contents of this
> communication and its attachments are for information purposes only and
> should not be regarded as an offer or solicitation to buy or sell a product
> or service, confirmation of any transaction, a valuation, indicative price
> or an official statement. Trading desks may have a position or interest
> that is inconsistent with any views expressed in this message. In
> evaluating the information contained in this message, you should know that
> it could have been previously provided to other clients and/or internal
> NatWest Markets personnel, who could have already acted on it.
> NatWest Markets cannot provide absolute assurances that all electronic
> communications (sent or received) are secure, error free, not corrupted,
> incomplete or virus free and/or that they will not be lost, mis-delivered,
> destroyed, delayed or intercepted/decrypted by others. Therefore NatWest
> Markets disclaims all liability with regards to electronic communications
> (and the contents therein) if they are corrupted, lost destroyed, delayed,
> incomplete, mis-delivered, intercepted, decrypted or otherwise
> misappropriated by others.
> Any electronic communication that is conducted within or through NatWest
> Markets systems will be subject to being archived, monitored and produced
> to regulators and in litigation in accordance with NatWest Markets’ policy
> and local laws, rules and regulations. Unless expressly prohibited by local
> law, electronic communications may be archived in countries other than the
> country in which you are located, and may be treated in accordance with the
> laws and regulations of the country of each individual included in the
> entire chain.
> Copyright NatWest Markets plc. All rights reserved. See
> http://www.natwestmarkets.com/legal/s-t-discl.html for further risk
> disclosure.
>