Re: Looking for I/O transform to untar a tar.gz

2018-03-16 Thread Sajeevan Achuthan
Thanks Cham On 16 March 2018 at 23:28, Chamikara Jayalath wrote: > Actually, I could assign it to you. > > On Fri, Mar 16, 2018 at 4:27 PM Chamikara Jayalath > wrote: > >> Of course. Feel free to add a comment to JIRA and send out a pull request >>

Re: Looking for I/O transform to untar a tar.gz

2018-03-16 Thread Chamikara Jayalath
Actually, I could assign it to you. On Fri, Mar 16, 2018 at 4:27 PM Chamikara Jayalath wrote: > Of course. Feel free to add a comment to JIRA and send out a pull request > for this. > Can one of the JIRA admins assign this to Sajeevan ? > > Thanks, > Cham > > On Fri, Mar

Re: Looking for I/O transform to untar a tar.gz

2018-03-16 Thread Chamikara Jayalath
Of course. Feel free to add a comment to JIRA and send out a pull request for this. Can one of the JIRA admins assign this to Sajeevan ? Thanks, Cham On Fri, Mar 16, 2018 at 4:22 PM Sajeevan Achuthan < achuthan.sajee...@gmail.com> wrote: > Hi Guys, > > Can I take a look at this issue? If you

Re: Design specs for portable Combine

2018-03-16 Thread Daniel Oliveira
So since I made some updates to the doc I feel like this is a good time to add a summary (I didn't know I needed to do that when I originally sent it out). Structure and Lifting of Combines (In Apache Beam Portability) This doc covers how Combines will be modeled in the Runner API and Fn API, as

Re: Looking for I/O transform to untar a tar.gz

2018-03-16 Thread Sajeevan Achuthan
Hi Guys, Can I take a look at this issue? If you agree, my Jira id is eachsaj thanks Saj On 16 March 2018 at 22:13, Chamikara Jayalath wrote: > Created https://issues.apache.org/jira/browse/BEAM-3867. > > Thanks, > Cham > > On Fri, Mar 16, 2018 at 3:00 PM Eugene

Re: (java) stream & beam?

2018-03-16 Thread Jean-Baptiste Onofré
Big +1 Regards JB Le 16 mars 2018 à 15:59, à 15:59, Reuven Lax a écrit: >BTW while it's true that raw GBK can't be fluent (due to constraint on >element type). once we have schema support we can introduce >groupByField, >and that can be fluent. > > >On Wed, Mar 14, 2018 at

Re: (java) stream & beam?

2018-03-16 Thread Reuven Lax
BTW while it's true that raw GBK can't be fluent (due to constraint on element type). once we have schema support we can introduce groupByField, and that can be fluent. On Wed, Mar 14, 2018 at 11:50 PM Robert Bradshaw wrote: > On Wed, Mar 14, 2018 at 11:04 PM Romain

Re: Looking for I/O transform to untar a tar.gz

2018-03-16 Thread Chamikara Jayalath
Created https://issues.apache.org/jira/browse/BEAM-3867. Thanks, Cham On Fri, Mar 16, 2018 at 3:00 PM Eugene Kirpichov wrote: > Reading can not be parallelized, but processing can be - so there is value > in having our file-based sources automatically decompress .tar and

Re: Looking for I/O transform to untar a tar.gz

2018-03-16 Thread Eugene Kirpichov
Reading can not be parallelized, but processing can be - so there is value in having our file-based sources automatically decompress .tar and .tar.gz. (also, I suspect that many people use Beam even for cases with a modest amount of data, that don't have or need parallelism, just for the sake of

Re: Looking for I/O transform to untar a tar.gz

2018-03-16 Thread Jean-Baptiste Onofré
Gzip is supported by TextIO. However you are right, tar is not yet supported. It's similar in the way of dealing with entries. Could you please create a Jira about that ? Thanks Regards JB Le 16 mars 2018 à 14:50, à 14:50, Chamikara Jayalath a écrit: >FWIW, if you have

Re: Looking for I/O transform to untar a tar.gz

2018-03-16 Thread Chamikara Jayalath
FWIW, if you have a concat gzip file [1] TextIO and other file-based sources should be able to read that. But we don't support tar files. Is it possible to perform tar extraction before running the pipeline ? This step probably cannot be parallelized. So not much value in performing within the

Re: Routines intermittently not being executed on Apache Beam code

2018-03-16 Thread Lukasz Cwik
I asked the same question on the stack overflow question. Also, adding u...@beam.apache.org On Fri, Mar 16, 2018 at 2:03 PM Reuven Lax wrote: > Can you explain what you mean? Are you saying that you call > waitUntilFinish(), then execute some other code, and you think some

Re: Routines intermittently not being executed on Apache Beam code

2018-03-16 Thread Reuven Lax
Can you explain what you mean? Are you saying that you call waitUntilFinish(), then execute some other code, and you think some of that other code is not being executed? On Fri, Mar 16, 2018 at 1:46 PM Lucas Arruda wrote: > I have an Apache Beam pipeline written on Java.

Routines intermittently not being executed on Apache Beam code

2018-03-16 Thread Lucas Arruda
I have an Apache Beam pipeline written on Java. I'm with a problem that some routines are not being executed on all instances of that pipeline. Those routines are as simple as logging messages or excluding a file in GCS. They are all put to run after the following code: More at

Re: Looking for I/O transform to untar a tar.gz

2018-03-16 Thread Sajeevan Achuthan
Eugene - Yes, you are correct. I tried with a text file & Beam wordcount example. The TextIO reader reads some illegal characters as seen below. here’s: 1 addiction: 1 new: 1 we: 1 mood: 1 an: 1 incredible: 1 swings,: 1 known: 1 choices.: 1

Re: Using the Go Beam SDK

2018-03-16 Thread Henning Rohde
Hi Philip, Thanks for expressing interest in the Go SDK! The documentation is indeed still incomplete (BEAM-3826) and the main design document is probably be the best starting point right now: https://s.apache.org/beam-go-sdk-design-rfc It also contains links to some of the better

Re: Looking for I/O transform to untar a tar.gz

2018-03-16 Thread Eugene Kirpichov
The code behaves as I expected, and the output is corrupt. Beam unzipped the .gz, but then interpreted the .tar as a text file, and split the .tar file by \n. E.g. the first file of the output starts with lines:

Re: Looking for I/O transform to untar a tar.gz

2018-03-16 Thread Sajeevan Achuthan
Eugene, I ran the code and it works fine. I am very confident in this case. I appreciate you guys for the great work. The code supposed to show that Beam TextIO can read the double compressed files and write output without any processing. so ignored the processing steps. I agree with you the

Re: Looking for I/O transform to untar a tar.gz

2018-03-16 Thread Eugene Kirpichov
Sajeevan - I'm quite confident that TextIO can handle .gz, but can not handle properly .tar. Did you run this code? Did your test .tar.gz file contain multiple files? Did you obtain the expected output, identical to the input except for order of lines? (also, the ParDo in this code doesn't do

Re: Looking for I/O transform to untar a tar.gz

2018-03-16 Thread Sajeevan Achuthan
Hi Guys, The TextIo can handle the tar.gz type double compressed files. See the code test code. PipelineOptions optios = PipelineOptionsFactory.fromArgs(args).withValidation().create(); Pipeline p = Pipeline.create(optios); * p.apply("ReadLines", TextIO.read().from("/dataset.tar.gz"))*