The code behaves as I expected, and the output is corrupt. Beam unzipped the .gz, but then interpreted the .tar as a text file, and split the .tar file by \n. E.g. the first file of the output starts with lines: A20171012.1145+0200-1200+0200_epg10-1_node.xml/0000755000175000017500000000000013252764467016513 5ustar eachsajeachsajA20171012.1145+0200-1200+0200_epg10-1_node.xml/data0000644000175000017500000000360513252764467017353 0ustar eachsajeachsaj<?xml version="1.0" encoding="UTF-8"?>
which are clearly not the expected input. On Fri, Mar 16, 2018 at 10:39 AM Sajeevan Achuthan < achuthan.sajee...@gmail.com> wrote: > Eugene, I ran the code and it works fine. I am very confident in this > case. I appreciate you guys for the great work. > > The code supposed to show that Beam TextIO can read the double compressed > files and write output without any processing. so ignored the processing > steps. I agree with you the further processing is not easy in this case. > > > import org.apache.beam.sdk.Pipeline; > import org.apache.beam.sdk.io.TextIO; > import org.apache.beam.sdk.options.PipelineOptions; > import org.apache.beam.sdk.options.PipelineOptionsFactory; > import org.apache.beam.sdk.transforms.DoFn; > import org.apache.beam.sdk.transforms.ParDo; > > public class ReadCompressedTextFile { > > public static void main(String[] args) { > PipelineOptions optios = > PipelineOptionsFactory.fromArgs(args).withValidation().create(); > Pipeline p = Pipeline.create(optios); > > p.apply("ReadLines", > TextIO.read().from("./dataset.tar.gz") > > ).apply(ParDo.of(new DoFn<String, String>(){ > @ProcessElement > public void processElement(ProcessContext c) { > c.output(c.element()); > // Just write the all content to "/tmp/filout/outputfile" > } > > })) > > .apply(TextIO.write().to("/tmp/filout/outputfile")); > > p.run().waitUntilFinish(); > } > > } > > The full code, data file & output contents are attached. > > thanks > Saj > > > > > > On 16 March 2018 at 16:56, Eugene Kirpichov <kirpic...@google.com> wrote: > >> Sajeevan - I'm quite confident that TextIO can handle .gz, but can not >> handle properly .tar. Did you run this code? Did your test .tar.gz file >> contain multiple files? Did you obtain the expected output, identical to >> the input except for order of lines? >> (also, the ParDo in this code doesn't do anything - it outputs its input >> - so it can be removed) >> >> On Fri, Mar 16, 2018 at 9:06 AM Sajeevan Achuthan < >> achuthan.sajee...@gmail.com> wrote: >> >>> Hi Guys, >>> >>> The TextIo can handle the tar.gz type double compressed files. See the >>> code test code. >>> >>> PipelineOptions optios = >>> PipelineOptionsFactory.fromArgs(args).withValidation().create(); >>> Pipeline p = Pipeline.create(optios); >>> >>> * p.apply("ReadLines", TextIO.read().from("/dataset.tar.gz"))* >>> .apply(ParDo.of(new DoFn<String, String>(){ >>> @ProcessElement >>> public void processElement(ProcessContext c) { >>> c.output(c.element()); >>> } >>> >>> })) >>> >>> .apply(TextIO.write().to("/tmp/filout/outputfile")); >>> >>> p.run().waitUntilFinish(); >>> >>> Thanks >>> /Saj >>> >>> On 16 March 2018 at 04:29, Pablo Estrada <pabl...@google.com> wrote: >>> >>>> Hi! >>>> Quick questions: >>>> - which sdk are you using? >>>> - is this batch or streaming? >>>> >>>> As JB mentioned, TextIO is able to work with compressed files that >>>> contain text. Nothing currently handles the double decompression that I >>>> believe you're looking for. >>>> TextIO for Java is also able to"watch" a directory for new files. If >>>> you're able to (outside of your pipeline) decompress your first zip file >>>> into a directory that your pipeline is watching, you may be able to use >>>> that as work around. Does that sound like a good thing? >>>> Finally, if you want to implement a transform that does all your logic, >>>> well then that sounds like SplittableDoFn material; and in that case, >>>> someone that knows SDF better can give you guidance (or clarify if my >>>> suggestions are not correct). >>>> Best >>>> -P. >>>> >>>> On Thu, Mar 15, 2018, 8:09 PM Jean-Baptiste Onofré <j...@nanthrax.net> >>>> wrote: >>>> >>>>> Hi >>>>> >>>>> TextIO supports compressed file. Do you want to read files in text ? >>>>> >>>>> Can you detail a bit the use case ? >>>>> >>>>> Thanks >>>>> Regards >>>>> JB >>>>> Le 15 mars 2018, à 18:28, Shirish Jamthe <sjam...@google.com> a écrit: >>>>>> >>>>>> Hi, >>>>>> >>>>>> My input is a tar.gz or .zip file which contains thousands of tar.gz >>>>>> files and other files. >>>>>> I would lile to extract the tar.gz files from the tar. >>>>>> >>>>>> Is there a transform that can do that? I couldn't find one. >>>>>> If not is it in works? Any pointers to start work on it? >>>>>> >>>>>> thanks >>>>>> >>>>> -- >>>> Got feedback? go/pabloem-feedback >>>> <https://goto.google.com/pabloem-feedback> >>>> >>> >>> >