S3 FileSystem attempt

2017-09-15 Thread Jacob Marble
Good afternoon, I have started work on BEAM-2500, "Add support for S3 as a Apache Beam FileSystem". The work-in-progress: https://github.com/Kochava/beam-s3 Feel free to critique anything there, I hope to submit a PR eventually. The code is heavily inspired by the GCS FileSystem implementation, a

Re: S3 FileSystem attempt

2017-09-15 Thread Jacob Marble
For those who enjoy a good laugh, here's the fix: https://github.com/Kochava/beam-s3/commit/e5d995640e046639b001250d8cf423e1179cc2aa On Fri, Sep 15, 2017 at 3:33 PM, Jacob Marble wrote: > Good afternoon, > > I have started work on BEAM-2500, "Add support for S3 as a Apach

setup and teardown *once*

2017-09-27 Thread Jacob Marble
I have been thinking on a Redshift reader/writer, basically to wrap UNLOAD and COPY in a PTransform. For example, steps to UNLOAD into a PCollection: 1) JDBC to Redshift - UNLOAD TO 's3://bucket/tmp-prefix' 2) S3 to PCollection - work i

Re: setup and teardown *once*

2017-09-27 Thread Jacob Marble
has succeeded). You need to ensure that the > operation is idempotent. > > Reuven > > On Wed, Sep 27, 2017 at 8:51 AM, Jacob Marble wrote: > > > I have been thinking on a Redshift reader/writer, basically to wrap > UNLOAD > > and COPY in a PTransform. For example, steps

Re: setup and teardown *once*

2017-09-27 Thread Jacob Marble
PCollection containing the filenames. You could then attach a Void > key (using WithKeys), GBK the filenames together and delete in the next > step. > > Reuven > > On Wed, Sep 27, 2017 at 9:04 AM, Jacob Marble wrote: > > > Thanks, Reuven, that makes sense for step 1. A

Re: setup and teardown *once*

2017-09-27 Thread Jacob Marble
thoughts here? On Wed, Sep 27, 2017 at 9:25 AM, Jacob Marble wrote: > Reuven, I think I found an example of the pattern you describe in > JdbcIO.Read.expand(). Thanks for this. > > On Wed, Sep 27, 2017 at 9:13 AM, Reuven Lax > wrote: > >> Create is essentially a BoundedSou

Re: setup and teardown *once*

2017-09-28 Thread Jacob Marble
run against a snapshot. > > On Wed, Sep 27, 2017 at 5:40 PM Jacob Marble wrote: > > > After playing with this for a day, I can't figure out how to make step 2 > > start *after* step 1 completes. > > > > The natural way to accomplish step 2 is TextIO.Read,

Re: setup and teardown *once*

2017-09-28 Thread Jacob Marble
The implementation so far. https://github.com/Kochava/beam-s3/tree/redshift/src/main/java/com/kochava/beam/redshift On Wed, Sep 27, 2017 at 8:51 AM, Jacob Marble wrote: > I have been thinking on a Redshift reader/writer, basically to wrap UNLOAD > and COPY in a PTransform. For example,

spark-submit forces jackson 2.4.4

2017-10-02 Thread Jacob Marble
My Beam pipeline runs fine with DirectRunner and DataflowRunner, but fails with SparkRunner. That stack trace is after this message. The exception indicates that com.fasterxml.jackson.databind.ObjectMapper.enable doesn't exist. ObjectMapper.enable() didn't exist until Jackson 2.5. `mvn dependency:

Re: spark-submit forces jackson 2.4.4

2017-10-02 Thread Jacob Marble
packages. You can also create a shaded jar. > > I have a similar issue in the spark 2 runner that I worked around by > aligning the dependencies. > > Regards > JB > > On Oct 2, 2017, 20:04, at 20:04, Jacob Marble wrote: > >My Beam pipeline runs fine with DirectRunner an

Re: spark-submit forces jackson 2.4.4

2017-10-02 Thread Jacob Marble
; rmannibucau> | > LinkedIn <https://www.linkedin.com/in/rmannibucau> > > 2017-10-02 20:13 GMT+02:00 Jacob Marble : > > > Yes, I'm using spark-submit, and I'm giving it a shaded jar. > > > > What do you mean "aligning the dependencies"? >

Re: spark-submit forces jackson 2.4.4

2017-10-02 Thread Jacob Marble
A-INF/*.SF META-INF/*.DSA META-INF/*.RSA Jacob On Mon, Oct 2, 2017 at 11:17 AM, Jacob Marble wrote: > There is a lot of chat

Re: spark-submit forces jackson 2.4.4

2017-10-02 Thread Jacob Marble
ck META-INF/maven in the shaded jar or > maybe share your mvn output in verbose mode (-X) and a dependency:tree > > Le 3 oct. 2017 02:16, "Jacob Marble" a écrit : > > > I gave up on running a Spark pipeline locally, tried AWS EMR/Spark > instead. > > Now this:

Re: spark-submit forces jackson 2.4.4

2017-10-05 Thread Jacob Marble
ml.jackson* ** Jacob On Mon, Oct 2, 2017 at 10:35 PM, Jacob Marble wrote: > Romain- > > I have been using dependency:tree to check myself. Also, > META-INF/maven/com.

Re: [Proposal] Apache Beam Swag Store

2017-11-05 Thread Jacob Marble
I think this is a great idea, ready to order mine. :) Jacob On Sat, Oct 28, 2017 at 11:19 AM, Jean-Baptiste Onofré wrote: > It sounds good. Please let us know trademark update. > > Thanks > Regards > JB > > On Oct 28, 2017, 20:15, at 20:15, Griselda Cuevas > wrote: > >Thanks for the feedback a

Re: Configuring file-based transforms with different options

2018-03-09 Thread Jacob Marble
Yes, I agree with all of this. Jacob On Thu, Mar 8, 2018 at 9:52 PM, Robert Bradshaw wrote: > On Thu, Mar 8, 2018 at 9:38 PM Eugene Kirpichov > wrote: > >> I think it may have been an API design mistake to put the S3 region into >> PipelineOptions. >> > > +1, IMHO it's generally a mistake to p

Re: Configuring file-based transforms with different options

2018-03-09 Thread Jacob Marble
I think when I wrote the S3 code, I couldn't see how to set storage class per-bucket, so put it in a flag. It's easy to imagine a use case where storage class differs per filespec, not only per bucket. Jacob On Fri, Mar 9, 2018 at 9:51 AM, Jacob Marble wrote: > Yes, I agree wit

Dealing with AWS Regions

2018-03-13 Thread Jacob Marble
Starting a new thread just for dealing with AWS regions better, context S3 and Redshift. S3FileSystem.amazonS3 build could be refactored to select region based on [1]: 1. the flag value region 2. the EC2 region, if found in environment (running in EC2 VM) 3. the default region (us-east-1) For act