Batch loads in streaming pipeline - withNumFileShards

2017-11-15 Thread Arpan Jain
Hi,

I am trying to use Method.FILE_LOADS for loading data into BQ in my
streaming pipeline using RC3 release of 2.2.2. It looks
like withNumFileShards needs to be also set for using this. Couple of
questions regarding this:

* Any guidelines on what's a good value for this? FWIW my pipeline is going
to write > 100K messages/second to around 500 tables
using DynamicDestinations. I am trying to set a triggering frequency of 30
minutes or 1 hour.

* Can we throw a better error message when it is not set? The exception
thrown by the pipeline if this is not set is "Exception in thread "main"
java.lang.IllegalArgumentException" without any message.


Re: Batch loading for streaming pipelines

2017-03-13 Thread Arpan Jain
Thanks for the reply! We are using these pipelines to read structured log
lines from Kafka and storing them in bigquery.

.withMaxNumRecords() or .withMaxReadTime() aren't that useful
because they do not remember how much they have read in previous run.


On Mon, Mar 13, 2017 at 9:42 PM, Kenneth Knowles <k...@google.com.invalid>
wrote:

> This seems like a good topic for user@ so I've moved it there (dev@ to
> BCC).
>
> You can get a bounded PCollection from KafkaIO via either of
> .withMaxNumRecords() or .withMaxReadTime().
>
> Whether or not that will meet your use case would depend on more details of
> what you are computing. Periodic batch jobs are harder to get right. In
> particular, the time you stop reading and the end of a window (esp.
> sessions) are not likely to coincide, so you'll need to deal with that.
>
> Kenn
>
> On Mon, Mar 13, 2017 at 6:09 PM, Arpan Jain <jainar...@gmail.com> wrote:
>
> > Hi,
> >
> > We run multiple streaming pipelines using cloud dataflow that read from
> > Kafka and write to BigQuery. We don't mind a few hours delay and are
> > thinking of avoiding the costs associated with streaming data into
> > BigQuery. Is there already a support (or a future plan) for such a
> > scenario? If not then I guess I will implement one of the following
> option:
> > * A BoundedSource implementation for Kafka so that we can run this in
> > batch mode.
> > * The streaming job writes to GCS and then a BQ load job writes to
> > BigQuery.
> >
> > Thanks!
> >
>