Re: json source for a pipeline

2018-02-13 Thread Jean-Baptiste Onofré
Hi Anant, did you take a look on the jackson extension: https://github.com/apache/beam/tree/master/sdks/java/extensions/jackson Maybe it does what you want (converting JSON as object). Regards JB On 02/14/2018 03:50 AM, Anant Chaudhary wrote: > Hello Beam Devs, > > We are starting to explore

Re: json source for a pipeline

2018-02-13 Thread Eugene Kirpichov
Hi, You can use the general-purpose FileIO. It was designed to support pretty much anything not explicitly supported by the IOs for concrete file formats bundled with Beam, eg TextIO and AvroIO. E.g.: p.apply(FileIO.match().filepattern("...")).apply(FileIO.readMatches()) will give you a

json source for a pipeline

2018-02-13 Thread Anant Chaudhary
Hello Beam Devs, We are starting to explore apache beam and google cloud dataflow. Seems like it can fit some of our data processing use cases pretty well. Some of my colleagues have worked with Apache Spark in the past, however the promise of not having to manage the servers has us inclining

Re: Plan for a Parquet new release and writing Parquet file with outputstream

2018-02-13 Thread Jean-Baptiste Onofré
Hi Ryan, Thanks for the update. Ideally for Beam, it would be great to have the AvroParquetReader and AvroParquetWriter using the InputFile/OutputFile interfaces. It would allow me to directly leverage Beam FileIO. Do you have a rough date for the Parquet release with that ? Thanks Regards JB

Re: Plan for a Parquet new release and writing Parquet file with outputstream

2018-02-13 Thread Ryan Blue
Jean-Baptiste, We're planning a release that will include the new OutputFile class, which I think you should be able to use. Is there anything you'd change to make this work more easily with Beam? rb On Tue, Feb 13, 2018 at 12:31 PM, Jean-Baptiste Onofré wrote: > Hi guys, >

Jenkins build is back to stable : beam_SeedJob #1072

2018-02-13 Thread Apache Jenkins Server
See

Jenkins build is still unstable: beam_SeedJob #1071

2018-02-13 Thread Apache Jenkins Server
See

Build failed in Jenkins: beam_PostRelease_NightlySnapshot #46

2018-02-13 Thread Apache Jenkins Server
See -- GitHub pull request #4665 of commit 8446108ea935f5ff7c4cc68da7b5fe5e540f6b5f, no merge conflicts. [EnvInject] - Loading node environment variables. Building

Jenkins build is unstable: beam_SeedJob #1070

2018-02-13 Thread Apache Jenkins Server
See

Re: Add Errorprone to build process?

2018-02-13 Thread Ismaël Mejía
The approach so far is optional (behind a profile), iin any case contributions are more than welcome to fix the error-prone detected issues once this is merged. On Tue, Feb 13, 2018 at 10:03 PM, Eugene Kirpichov wrote: > Oh! I had no idea this was already in progress -

Build failed in Jenkins: beam_SeedJob #1069

2018-02-13 Thread Apache Jenkins Server
See -- GitHub pull request #4677 of commit 2e3024534da3a568cd62c554af7a845f832150a9, no merge conflicts. Setting status of 2e3024534da3a568cd62c554af7a845f832150a9 to PENDING with url

Re: Plan for a Parquet new release and writing Parquet file with outputstream

2018-02-13 Thread Eugene Kirpichov
Thanks for raising this, JB! To clarify for people on Parquet mailing list who are not familiar with Beam: Beam supports multiple filesystems (currently: local, HDFS, Google Cloud, S3) via a pluggable interface (that among other things can give you a Channel for reading/writing the given path),

Re: Add Errorprone to build process?

2018-02-13 Thread Eugene Kirpichov
Oh! I had no idea this was already in progress - thanks Kenn! On Tue, Feb 13, 2018, 12:23 PM Ismaël Mejía wrote: > Kenn submitted a PR for this yesterday. So I assume is taken. > https://github.com/apache/beam/pull/4667 > > On Tue, Feb 13, 2018 at 8:49 PM, Eugene Kirpichov

Jenkins build is back to normal : beam_SeedJob #1068

2018-02-13 Thread Apache Jenkins Server
See

Build failed in Jenkins: beam_SeedJob #1067

2018-02-13 Thread Apache Jenkins Server
See -- GitHub pull request #4677 of commit 998f3898fbff8c1ad4c4cabf7b5657ad35b793e6, no merge conflicts. Setting status of 998f3898fbff8c1ad4c4cabf7b5657ad35b793e6 to PENDING with url

Plan for a Parquet new release and writing Parquet file with outputstream

2018-02-13 Thread Jean-Baptiste Onofré
Hi guys, I'm working on the Apache Beam ParquetIO: https://github.com/apache/beam/pull/1851 In Beam, thanks to FileIO, we support several filesystems (HDFS, S3, ...). If I was able to implement the Read part using AvroParquetReader leveraging Beam FileIO, I'm struggling on the writing part.

Re: Add Errorprone to build process?

2018-02-13 Thread Ismaël Mejía
Kenn submitted a PR for this yesterday. So I assume is taken. https://github.com/apache/beam/pull/4667 On Tue, Feb 13, 2018 at 8:49 PM, Eugene Kirpichov wrote: > Filed https://issues.apache.org/jira/browse/BEAM-3697 > > This seems to be a good task for a new contributor,

Re: A 15x speed-up in local Python DirectRunner execution

2018-02-13 Thread Charles Chen
This is now checked into master. You can use it by setting --runner=SwitchingDirectRunner. Please let us know if you run into any issues. On Thu, Feb 8, 2018 at 10:30 AM Romain Manni-Bucau wrote: > Very interesting! Sounds like a sane way for beam future and I'm very >

Jenkins build is back to normal : beam_SeedJob #1066

2018-02-13 Thread Apache Jenkins Server
See

Add Errorprone to build process?

2018-02-13 Thread Eugene Kirpichov
Filed https://issues.apache.org/jira/browse/BEAM-3697 This seems to be a good task for a new contributor, easy and potentially with high payoff (uncovering bugs), and a good way to become cursorily familiar with diverse parts of the codebase by fixing the bugs it finds. Any takers? Description:

Build failed in Jenkins: beam_SeedJob #1065

2018-02-13 Thread Apache Jenkins Server
See -- GitHub pull request #4677 of commit 8feb13112432e224cc19d7819bb9c282f6ccbe0b, no merge conflicts. Setting status of 8feb13112432e224cc19d7819bb9c282f6ccbe0b to PENDING with url

Jenkins build is back to normal : beam_SeedJob #1064

2018-02-13 Thread Apache Jenkins Server
See

Build failed in Jenkins: beam_SeedJob #1063

2018-02-13 Thread Apache Jenkins Server
See -- GitHub pull request #4677 of commit 8feb13112432e224cc19d7819bb9c282f6ccbe0b, no merge conflicts. Setting status of 8feb13112432e224cc19d7819bb9c282f6ccbe0b to PENDING with url

Re: PipelineOptions fromSystemProps?

2018-02-13 Thread Romain Manni-Bucau
Oki, will try a PR after the classloader one then. Thanks a lot. Le 13 févr. 2018 19:14, "Lukasz Cwik" a écrit : > The one in Dataflow 1.x was one system property that contained all the > JSON so it wasn't exactly what you were looking for. > > On Tue, Feb 13, 2018 at 9:51 AM,

Re: PipelineOptions fromSystemProps?

2018-02-13 Thread Lukasz Cwik
The one in Dataflow 1.x was one system property that contained all the JSON so it wasn't exactly what you were looking for. On Tue, Feb 13, 2018 at 9:51 AM, Romain Manni-Bucau wrote: > I like your proposal Kenneth. Perfectly fits my use case and deployment > one as well -

Re: PipelineOptions fromSystemProps?

2018-02-13 Thread Lukasz Cwik
PipelineOptionsFactory.fromSystemProperties did exist in Dataflow 1.x and was dropped for the reason that Ken mentioned. On Tue, Feb 13, 2018 at 9:46 AM, Kenneth Knowles wrote: > Pipeline options are not global - they are a property of a single job. The > TestPipeline reads

classloader fixes for pipeline options

2018-02-13 Thread Romain Manni-Bucau
Hi guys, just wanted to send a head ups I'm trying to make the pipeline options classloader usage being enhanced to support an execution where the sdks-java-core is not in the same classloader than the runner. I sent a functional PR on that aspect at https://github.com/apache/beam/pull/4674 -

Re: PipelineOptions fromSystemProps?

2018-02-13 Thread Kenneth Knowles
Pipeline options are not global - they are a property of a single job. The TestPipeline reads them from a very particular system property because it is a special testing rule. If you want a generic way to build pipeline options from a set of system properties, it should be from an explicit

Re: PipelineOptions fromSystemProps?

2018-02-13 Thread Romain Manni-Bucau
makes sense, do we want beam.foo.bar -> --foo-bar conversion too? Romain Manni-Bucau @rmannibucau | Blog | Old Blog | Github | LinkedIn

Re: PipelineOptions fromSystemProps?

2018-02-13 Thread Eugene Kirpichov
Neutral about this one: haven't seen a case where this was needed, but don't see anything wrong with it either. One thing I'd recommend if you go through with it, extract from system properties under "beam." rather than all of them, to avoid clashes. On Tue, Feb 13, 2018, 7:53 AM Jean-Baptiste

Re: Proposal: build Python wheel distributions for Apache Beam releases

2018-02-13 Thread Robert Bradshaw
On Tue, Feb 13, 2018 at 8:31 AM, Nima Mousavi wrote: > Related question: > > How can we tell if the docker image of our binary contains the cython > optimized beam or the slower codepath? > The image was built on Google cloud (using gcloud container builds submit). There

Re: [VOTE] Release 2.3.0, release candidate #3

2018-02-13 Thread Jean-Baptiste Onofré
+1 (binding) Tested the Spark runner (with wordcount example and beam samples) Tested the performance of the direct runner I just updated the spreadsheet. Regards JB On 02/11/2018 06:33 AM, Jean-Baptiste Onofré wrote: > Hi everyone, > > Please review and vote on the release candidate #3 for

Re: [VOTE] Release 2.3.0, release candidate #3

2018-02-13 Thread Jean-Baptiste Onofré
Gently reminder for the vote, we have only +1 (non binding) vote for now. Regards JB On 02/11/2018 06:33 AM, Jean-Baptiste Onofré wrote: > Hi everyone, > > Please review and vote on the release candidate #3 for the version 2.3.0, as > follows: > > [ ] +1, Approve the release > [ ] -1, Do not

Re: [VOTE] Release 2.3.0, release candidate #3

2018-02-13 Thread Jean-Baptiste Onofré
Hi guys, as discussed, I created a staging repository extension containing the hadoop-input-format artifact: https://repository.apache.org/content/repositories/orgapachebeam-1029/ Regards JB On 02/11/2018 06:33 AM, Jean-Baptiste Onofré wrote: > Hi everyone, > > Please review and vote on the

Re: Proposal: build Python wheel distributions for Apache Beam releases

2018-02-13 Thread Nima Mousavi
Related question: How can we tell if the docker image of our binary contains the cython optimized beam or the slower codepath? The image was built on Google cloud (using *gcloud container builds submit* ). On Mon, Feb 12, 2018 at 9:32 PM, Ahmet Altay wrote: > +1 to wheels.

Re: PipelineOptions fromSystemProps?

2018-02-13 Thread Jean-Baptiste Onofré
Hi Romain, it sounds interesting to me, and doesn't break anything, so +1 from my side. Regards JB On 02/13/2018 03:42 PM, Romain Manni-Bucau wrote: > Hi guys, > > there are hacks in beam testing code to read the args from a system property > but > I wonder if we shouldnt add a

PipelineOptions fromSystemProps?

2018-02-13 Thread Romain Manni-Bucau
Hi guys, there are hacks in beam testing code to read the args from a system property but I wonder if we shouldnt add a PipelineOptionsFactory.fromSystemProperties(). It would iterate over the system properties and take all --xxx=foo as potential argument it tries to bind. Rational behind that

Build failed in Jenkins: beam_PostRelease_NightlySnapshot #44

2018-02-13 Thread Apache Jenkins Server
See Changes: [robertwb] [BEAM-3074] Serialize DoFns by portable id in Dataflow runner. [lcwik] [BEAM-3629] Send the windowing strategy and whether its a merging window [iemejia] Fix warning on

Jenkins build became unstable: beam_Release_NightlySnapshot #684

2018-02-13 Thread Apache Jenkins Server
See