using FileIO to read a single input file

2021-12-13 Thread Randal Moore
I have some side inputs that I would like to add to my pipeline. Some of them are based on a file pattern, so I found that I can collect the contents of those files using a pattern like the following: val genotypes = p.apply(FileIO.`match`.filepattern(opts.getGenotypesFilePattern())) .ap

Re: Strange 'gzip error' running Beam on Dataflow

2018-10-12 Thread Randal Moore
ype: text/csv) to > the output files and, as consequence, GZIP files were automatically > decompressed when downloading them (as explained in the previous link). > > Best, > > > El vie., 12 oct. 2018 23:40, Randal Moore escribió: > >> Using Beam Java SDK 2.6. >> &g

Strange 'gzip error' running Beam on Dataflow

2018-10-12 Thread Randal Moore
Using Beam Java SDK 2.6. I have a batch pipeline that has run successfully in its current several times. Suddenly I am getting strange errors complaining about the format of the input. As far as I know, the pipeline didn't change at all since the last successful run. The error: java.util.zip.ZipEx

Controlling namespace when writing to DataStore

2018-08-29 Thread Randal Moore
I cannot find a way to control which namespace I'm writing to when saving to DataStore from a Beam/DataFlow job. I am using org.apache.beam.sdk.io.gcp.datastore.DatastoreV1 I find the ability to control the namespace with the reader but not the writer. Am I missing something obvious?

Re: Status type mismatch between different API calls

2018-02-02 Thread Randal Moore
ed client. > > MonitoringUtil.toState() converts that string to the set of enums your > familiar with: > > https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/util/MonitoringUtil.java#L222 > > On Fri, Feb 2, 2018 at 8:40 AM,

Status type mismatch between different API calls

2018-02-02 Thread Randal Moore
I'm using dataflow. Found what seems to me to be "usage problem" with the available APIs. When I submit a job to the dataflow runner, I get back a DataflowPipelineJob or its superclass, PipelineResult which provides me the status of the job - as an enumerated type. But if I use a DataflowClient to

Re: [VOTE] [DISCUSSION] Remove support for Java 7

2017-10-17 Thread Randal Moore
+1 On Tue, Oct 17, 2017 at 5:21 PM Raghu Angadi wrote: > +1. > > On Tue, Oct 17, 2017 at 2:11 PM, David McNeill > wrote: > >> The final version of Beam that supports Java 7 should be clearly stated >> in the docs, so those stuck on old production infrastructure for other java >> app dependencie

Strange errors running on DataFlow

2017-08-03 Thread Randal Moore
I have a batch pipeline that runs well with small inputs but fails with a larger dataset. Looking at stackdriver I find a fair number of the following: Request failed with code 400, will NOT retry: https://dataflow.googleapis.com/v1b3/projects/cgs-nonprod/locations/us-central1/jobs/2017-08-03_13_0

Re: API to query the state of a running dataflow job?

2017-07-10 Thread Randal Moore
master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowPipelineJob.java#L441 > [3] https://issues.apache.org/jira/secure/CreateIssue!default.jspa > > On Sun, Jul 9, 2017 at 2:54 PM, Randal Moore wrote: > >> Is this part of the Beam API or

API to query the state of a running dataflow job?

2017-07-09 Thread Randal Moore
Is this part of the Beam API or something I should look at the google docs for help? Assume a job is running in dataflow - how can an interested third-party app query the status if it knows the job-id? rdm

Where to hide credentials when submitting a beam pipeline

2017-07-07 Thread Randal Moore
Maybe this is more of a question for DataFlow - but I'm submitting a pipeline that needs to access a rest service running in a GKE kubernetes instance. I need to pass in creds. I started with pipeline-options which work but all options get exposed on the DataFlow web pages. Is there a way to pas

Re: Providing HTTP client to DoFn

2017-07-07 Thread Randal Moore
. Keeping Y small will improve caching, larger > Y helps with hot keys. > > On Fri, Jul 7, 2017 at 8:26 AM, Randal Moore wrote: > >> Sorry for being confusing - I am still grasping at the correct semantics >> to use to refer to some of the things. I think that made a mess of

Re: Providing HTTP client to DoFn

2017-07-07 Thread Randal Moore
e index but it would be 100 times larger? > ** A map based side input which has values which are 4 bytes vs 400 bytes > isn't going to change much in lookup cost > > > > On Wed, Jul 5, 2017 at 6:22 PM, Randal Moore wrote: > >> Based on my understanding so far, I

Re: Providing HTTP client to DoFn

2017-07-05 Thread Randal Moore
can contain methods marked with @Setup/@Teardown which only get invoked >> once per DoFn instance (which is relatively infrequently) and you could >> store an instance per DoFn instead of a singleton if the REST library was >> not thread safe. >> >> On Wed, Jul 5, 2

Providing HTTP client to DoFn

2017-07-05 Thread Randal Moore
I have a step in my beam pipeline that needs some data from a rest service. The data acquired from the rest service is dependent on the context of the data being processed and relatively large. The rest client I am using isn't serializable - nor is it likely possible to make it so (background threa

Is this a valid usecase for Apache Beam and Google dataflow

2017-06-20 Thread Randal Moore
Just starting looking at Beam this week as a candidate for executing some fairly CPU intensive work. I am curious if the stream-oriented features of Beam are a match for my usecase. My user will submit a large number of computations to the system (as a "job"). Those computations can be expressed