To clarify: I think natively supporting .tar and .tar.gz would be quite useful. I'm just saying that currently we don't.
On Fri, Mar 16, 2018 at 10:44 AM Eugene Kirpichov <kirpic...@google.com> wrote: > The code behaves as I expected, and the output is corrupt. > Beam unzipped the .gz, but then interpreted the .tar as a text file, and > split the .tar file by \n. > E.g. the first file of the output starts with lines: > A20171012.1145+0200-1200+0200_epg10-1_node.xml/0000755000175000017500000000000013252764467016513 > 5ustar > eachsajeachsajA20171012.1145+0200-1200+0200_epg10-1_node.xml/data0000644000175000017500000000360513252764467017353 > 0ustar eachsajeachsaj<?xml version="1.0" encoding="UTF-8"?> > > which are clearly not the expected input. > > On Fri, Mar 16, 2018 at 10:39 AM Sajeevan Achuthan < > achuthan.sajee...@gmail.com> wrote: > >> Eugene, I ran the code and it works fine. I am very confident in this >> case. I appreciate you guys for the great work. >> >> The code supposed to show that Beam TextIO can read the double compressed >> files and write output without any processing. so ignored the processing >> steps. I agree with you the further processing is not easy in this case. >> >> >> import org.apache.beam.sdk.Pipeline; >> import org.apache.beam.sdk.io.TextIO; >> import org.apache.beam.sdk.options.PipelineOptions; >> import org.apache.beam.sdk.options.PipelineOptionsFactory; >> import org.apache.beam.sdk.transforms.DoFn; >> import org.apache.beam.sdk.transforms.ParDo; >> >> public class ReadCompressedTextFile { >> >> public static void main(String[] args) { >> PipelineOptions optios = >> PipelineOptionsFactory.fromArgs(args).withValidation().create(); >> Pipeline p = Pipeline.create(optios); >> >> p.apply("ReadLines", >> TextIO.read().from("./dataset.tar.gz") >> >> ).apply(ParDo.of(new DoFn<String, String>(){ >> @ProcessElement >> public void processElement(ProcessContext c) { >> c.output(c.element()); >> // Just write the all content to "/tmp/filout/outputfile" >> } >> >> })) >> >> .apply(TextIO.write().to("/tmp/filout/outputfile")); >> >> p.run().waitUntilFinish(); >> } >> >> } >> >> The full code, data file & output contents are attached. >> >> thanks >> Saj >> >> >> >> >> >> On 16 March 2018 at 16:56, Eugene Kirpichov <kirpic...@google.com> wrote: >> >>> Sajeevan - I'm quite confident that TextIO can handle .gz, but can not >>> handle properly .tar. Did you run this code? Did your test .tar.gz file >>> contain multiple files? Did you obtain the expected output, identical to >>> the input except for order of lines? >>> (also, the ParDo in this code doesn't do anything - it outputs its input >>> - so it can be removed) >>> >>> On Fri, Mar 16, 2018 at 9:06 AM Sajeevan Achuthan < >>> achuthan.sajee...@gmail.com> wrote: >>> >>>> Hi Guys, >>>> >>>> The TextIo can handle the tar.gz type double compressed files. See the >>>> code test code. >>>> >>>> PipelineOptions optios = >>>> PipelineOptionsFactory.fromArgs(args).withValidation().create(); >>>> Pipeline p = Pipeline.create(optios); >>>> >>>> * p.apply("ReadLines", TextIO.read().from("/dataset.tar.gz"))* >>>> .apply(ParDo.of(new DoFn<String, String>(){ >>>> @ProcessElement >>>> public void processElement(ProcessContext c) { >>>> c.output(c.element()); >>>> } >>>> >>>> })) >>>> >>>> .apply(TextIO.write().to("/tmp/filout/outputfile")); >>>> >>>> p.run().waitUntilFinish(); >>>> >>>> Thanks >>>> /Saj >>>> >>>> On 16 March 2018 at 04:29, Pablo Estrada <pabl...@google.com> wrote: >>>> >>>>> Hi! >>>>> Quick questions: >>>>> - which sdk are you using? >>>>> - is this batch or streaming? >>>>> >>>>> As JB mentioned, TextIO is able to work with compressed files that >>>>> contain text. Nothing currently handles the double decompression that I >>>>> believe you're looking for. >>>>> TextIO for Java is also able to"watch" a directory for new files. If >>>>> you're able to (outside of your pipeline) decompress your first zip file >>>>> into a directory that your pipeline is watching, you may be able to use >>>>> that as work around. Does that sound like a good thing? >>>>> Finally, if you want to implement a transform that does all your >>>>> logic, well then that sounds like SplittableDoFn material; and in that >>>>> case, someone that knows SDF better can give you guidance (or clarify if >>>>> my >>>>> suggestions are not correct). >>>>> Best >>>>> -P. >>>>> >>>>> On Thu, Mar 15, 2018, 8:09 PM Jean-Baptiste Onofré <j...@nanthrax.net> >>>>> wrote: >>>>> >>>>>> Hi >>>>>> >>>>>> TextIO supports compressed file. Do you want to read files in text ? >>>>>> >>>>>> Can you detail a bit the use case ? >>>>>> >>>>>> Thanks >>>>>> Regards >>>>>> JB >>>>>> Le 15 mars 2018, à 18:28, Shirish Jamthe <sjam...@google.com> a >>>>>> écrit: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> My input is a tar.gz or .zip file which contains thousands of tar.gz >>>>>>> files and other files. >>>>>>> I would lile to extract the tar.gz files from the tar. >>>>>>> >>>>>>> Is there a transform that can do that? I couldn't find one. >>>>>>> If not is it in works? Any pointers to start work on it? >>>>>>> >>>>>>> thanks >>>>>>> >>>>>> -- >>>>> Got feedback? go/pabloem-feedback >>>>> <https://goto.google.com/pabloem-feedback> >>>>> >>>> >>>> >>