Re: Looking for I/O transform to untar a tar.gz

Chamikara Jayalath Fri, 16 Mar 2018 14:51:12 -0700

FWIW, if you have a concat gzip file [1] TextIO and other file-based
sources should be able to read that. But we don't support tar files. Is it
possible to perform tar extraction before running the pipeline ? This step
probably cannot be parallelized. So not much value in performing within the
pipeline anyways (other than easy access to various file-systems).


- Cham

[1]
https://stackoverflow.com/questions/8005114/fast-concatenation-of-multiple-gzip-files

On Fri, Mar 16, 2018 at 12:26 PM Sajeevan Achuthan <
[email protected]> wrote:

> Eugene - Yes, you are correct. I tried with a text file &  Beam wordcount
> example. The TextIO reader reads some illegal characters as seen below.
>
>
> here’s: 1
> addiction: 1
> new: 1
> we: 1
> mood: 1
> an: 1
> incredible: 1
> swings,: 1
> known: 1
> choices.: 1
> ^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@They’re:
> 1
> already: 2
> today: 1
> the: 3
> generation: 1
> wordcount-00002
>
>
> thanks
> Saj
>
>
> On 16 March 2018 at 17:45, Eugene Kirpichov <[email protected]> wrote:
>
>> To clarify: I think natively supporting .tar and .tar.gz would be quite
>> useful. I'm just saying that currently we don't.
>>
>> On Fri, Mar 16, 2018 at 10:44 AM Eugene Kirpichov <[email protected]>
>> wrote:
>>
>>> The code behaves as I expected, and the output is corrupt.
>>> Beam unzipped the .gz, but then interpreted the .tar as a text file, and
>>> split the .tar file by \n.
>>> E.g. the first file of the output starts with lines:
>>> A20171012.1145+0200-1200+0200_epg10-1_node.xml/0000755000175000017500000000000013252764467016513
>>> 5ustar
>>> eachsajeachsajA20171012.1145+0200-1200+0200_epg10-1_node.xml/data0000644000175000017500000000360513252764467017353
>>> 0ustar  eachsajeachsaj<?xml version="1.0" encoding="UTF-8"?>
>>>
>>> which are clearly not the expected input.
>>>
>>> On Fri, Mar 16, 2018 at 10:39 AM Sajeevan Achuthan <
>>> [email protected]> wrote:
>>>
>>>> Eugene, I ran the code and it works fine.  I am very confident in this
>>>> case. I appreciate you guys for the great work.
>>>>
>>>> The code supposed to show that Beam TextIO can read the double
>>>> compressed files and write output without any processing. so ignored the
>>>> processing steps. I agree with you the further processing is not easy in
>>>> this case.
>>>>
>>>>
>>>> import org.apache.beam.sdk.Pipeline;
>>>> import org.apache.beam.sdk.io.TextIO;
>>>> import org.apache.beam.sdk.options.PipelineOptions;
>>>> import org.apache.beam.sdk.options.PipelineOptionsFactory;
>>>> import org.apache.beam.sdk.transforms.DoFn;
>>>> import org.apache.beam.sdk.transforms.ParDo;
>>>>
>>>> public class ReadCompressedTextFile {
>>>>
>>>> public static void main(String[] args) {
>>>> PipelineOptions optios =
>>>> PipelineOptionsFactory.fromArgs(args).withValidation().create();
>>>>     Pipeline p = Pipeline.create(optios);
>>>>
>>>>     p.apply("ReadLines",
>>>>     TextIO.read().from("./dataset.tar.gz")
>>>>
>>>>     ).apply(ParDo.of(new DoFn<String, String>(){
>>>>     @ProcessElement
>>>>     public void processElement(ProcessContext c) {
>>>>     c.output(c.element());
>>>>     // Just write the all content to "/tmp/filout/outputfile"
>>>>     }
>>>>
>>>>     }))
>>>>
>>>>    .apply(TextIO.write().to("/tmp/filout/outputfile"));
>>>>
>>>>     p.run().waitUntilFinish();
>>>> }
>>>>
>>>> }
>>>>
>>>> The full code, data file & output contents are attached.
>>>>
>>>> thanks
>>>> Saj
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 16 March 2018 at 16:56, Eugene Kirpichov <[email protected]>
>>>> wrote:
>>>>
>>>>> Sajeevan - I'm quite confident that TextIO can handle .gz, but can not
>>>>> handle properly .tar. Did you run this code? Did your test .tar.gz file
>>>>> contain multiple files? Did you obtain the expected output, identical to
>>>>> the input except for order of lines?
>>>>> (also, the ParDo in this code doesn't do anything - it outputs its
>>>>> input - so it can be removed)
>>>>>
>>>>> On Fri, Mar 16, 2018 at 9:06 AM Sajeevan Achuthan <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi Guys,
>>>>>>
>>>>>> The TextIo can handle the tar.gz type double compressed files. See
>>>>>> the code test code.
>>>>>>
>>>>>>  PipelineOptions optios =
>>>>>> PipelineOptionsFactory.fromArgs(args).withValidation().create();
>>>>>>     Pipeline p = Pipeline.create(optios);
>>>>>>
>>>>>>    * p.apply("ReadLines",  TextIO.read().from("/dataset.tar.gz"))*
>>>>>>                       .apply(ParDo.of(new DoFn<String, String>(){
>>>>>>     @ProcessElement
>>>>>>     public void processElement(ProcessContext c) {
>>>>>>     c.output(c.element());
>>>>>>     }
>>>>>>
>>>>>>     }))
>>>>>>
>>>>>>    .apply(TextIO.write().to("/tmp/filout/outputfile"));
>>>>>>
>>>>>>     p.run().waitUntilFinish();
>>>>>>
>>>>>> Thanks
>>>>>> /Saj
>>>>>>
>>>>>> On 16 March 2018 at 04:29, Pablo Estrada <[email protected]> wrote:
>>>>>>
>>>>>>> Hi!
>>>>>>> Quick questions:
>>>>>>> - which sdk are you using?
>>>>>>> - is this batch or streaming?
>>>>>>>
>>>>>>> As JB mentioned, TextIO is able to work with compressed files that
>>>>>>> contain text. Nothing currently handles the double decompression that I
>>>>>>> believe you're looking for.
>>>>>>> TextIO for Java is also able to"watch" a directory for new files. If
>>>>>>> you're able to (outside of your pipeline) decompress your first zip file
>>>>>>> into a directory that your pipeline is watching, you may be able to use
>>>>>>> that as work around. Does that sound like a good thing?
>>>>>>> Finally, if you want to implement a transform that does all your
>>>>>>> logic, well then that sounds like SplittableDoFn material; and in that
>>>>>>> case, someone that knows SDF better can give you guidance (or clarify 
>>>>>>> if my
>>>>>>> suggestions are not correct).
>>>>>>> Best
>>>>>>> -P.
>>>>>>>
>>>>>>> On Thu, Mar 15, 2018, 8:09 PM Jean-Baptiste Onofré <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi
>>>>>>>>
>>>>>>>> TextIO supports compressed file. Do you want to read files in text ?
>>>>>>>>
>>>>>>>> Can you detail a bit the use case ?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Regards
>>>>>>>> JB
>>>>>>>> Le 15 mars 2018, à 18:28, Shirish Jamthe <[email protected]> a
>>>>>>>> écrit:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> My input is a tar.gz or .zip file which contains thousands of
>>>>>>>>> tar.gz files and other files.
>>>>>>>>> I would lile to extract the tar.gz files from the tar.
>>>>>>>>>
>>>>>>>>> Is there a transform that can do that? I couldn't find one.
>>>>>>>>> If not is it in works? Any pointers to start work on it?
>>>>>>>>>
>>>>>>>>> thanks
>>>>>>>>>
>>>>>>>> --
>>>>>>> Got feedback? go/pabloem-feedback
>>>>>>> <https://goto.google.com/pabloem-feedback>
>>>>>>>
>>>>>>
>>>>>>
>>>>
>

Re: Looking for I/O transform to untar a tar.gz

Reply via email to