Re: Looking for I/O transform to untar a tar.gz

Chamikara Jayalath Fri, 16 Mar 2018 16:29:30 -0700

Actually, I could assign it to you.

On Fri, Mar 16, 2018 at 4:27 PM Chamikara Jayalath <[email protected]>
wrote:


> Of course. Feel free to add a comment to JIRA and send out a pull request
> for this.
> Can one of the JIRA admins assign this to Sajeevan ?
>
> Thanks,
> Cham
>
> On Fri, Mar 16, 2018 at 4:22 PM Sajeevan Achuthan <
> [email protected]> wrote:
>
>> Hi Guys,
>>
>> Can I take a look at this issue? If you agree, my Jira id is eachsaj
>>
>> thanks
>> Saj
>>
>>
>>
>> On 16 March 2018 at 22:13, Chamikara Jayalath <[email protected]>
>> wrote:
>>
>>> Created https://issues.apache.org/jira/browse/BEAM-3867.
>>>
>>> Thanks,
>>> Cham
>>>
>>> On Fri, Mar 16, 2018 at 3:00 PM Eugene Kirpichov <[email protected]>
>>> wrote:
>>>
>>>> Reading can not be parallelized, but processing can be - so there is
>>>> value in having our file-based sources automatically decompress .tar and
>>>> .tar.gz.
>>>> (also, I suspect that many people use Beam even for cases with a modest
>>>> amount of data, that don't have or need parallelism, just for the sake of
>>>> convenience of Beam's APIs and IOs)
>>>>
>>>> On Fri, Mar 16, 2018 at 2:50 PM Chamikara Jayalath <
>>>> [email protected]> wrote:
>>>>
>>>>> FWIW, if you have a concat gzip file [1] TextIO and other file-based
>>>>> sources should be able to read that. But we don't support tar files. Is it
>>>>> possible to perform tar extraction before running the pipeline ? This step
>>>>> probably cannot be parallelized. So not much value in performing within 
>>>>> the
>>>>> pipeline anyways (other than easy access to various file-systems).
>>>>>
>>>>> - Cham
>>>>>
>>>>> [1]
>>>>> https://stackoverflow.com/questions/8005114/fast-concatenation-of-multiple-gzip-files
>>>>>
>>>>>
>>>>> On Fri, Mar 16, 2018 at 12:26 PM Sajeevan Achuthan <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Eugene - Yes, you are correct. I tried with a text file &  Beam
>>>>>> wordcount example. The TextIO reader reads some illegal characters as 
>>>>>> seen
>>>>>> below.
>>>>>>
>>>>>>
>>>>>> here’s: 1
>>>>>> addiction: 1
>>>>>> new: 1
>>>>>> we: 1
>>>>>> mood: 1
>>>>>> an: 1
>>>>>> incredible: 1
>>>>>> swings,: 1
>>>>>> known: 1
>>>>>> choices.: 1
>>>>>> ^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@eachsaj^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@They’re:
>>>>>> 1
>>>>>> already: 2
>>>>>> today: 1
>>>>>> the: 3
>>>>>> generation: 1
>>>>>> wordcount-00002
>>>>>>
>>>>>>
>>>>>> thanks
>>>>>> Saj
>>>>>>
>>>>>>
>>>>>> On 16 March 2018 at 17:45, Eugene Kirpichov <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> To clarify: I think natively supporting .tar and .tar.gz would be
>>>>>>> quite useful. I'm just saying that currently we don't.
>>>>>>>
>>>>>>> On Fri, Mar 16, 2018 at 10:44 AM Eugene Kirpichov <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> The code behaves as I expected, and the output is corrupt.
>>>>>>>> Beam unzipped the .gz, but then interpreted the .tar as a text
>>>>>>>> file, and split the .tar file by \n.
>>>>>>>> E.g. the first file of the output starts with lines:
>>>>>>>> A20171012.1145+0200-1200+0200_epg10-1_node.xml/0000755000175000017500000000000013252764467016513
>>>>>>>> 5ustar
>>>>>>>> eachsajeachsajA20171012.1145+0200-1200+0200_epg10-1_node.xml/data0000644000175000017500000000360513252764467017353
>>>>>>>> 0ustar  eachsajeachsaj<?xml version="1.0" encoding="UTF-8"?>
>>>>>>>>
>>>>>>>> which are clearly not the expected input.
>>>>>>>>
>>>>>>>> On Fri, Mar 16, 2018 at 10:39 AM Sajeevan Achuthan <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Eugene, I ran the code and it works fine.  I am very confident in
>>>>>>>>> this case. I appreciate you guys for the great work.
>>>>>>>>>
>>>>>>>>> The code supposed to show that Beam TextIO can read the double
>>>>>>>>> compressed files and write output without any processing. so ignored 
>>>>>>>>> the
>>>>>>>>> processing steps. I agree with you the further processing is not easy 
>>>>>>>>> in
>>>>>>>>> this case.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> import org.apache.beam.sdk.Pipeline;
>>>>>>>>> import org.apache.beam.sdk.io.TextIO;
>>>>>>>>> import org.apache.beam.sdk.options.PipelineOptions;
>>>>>>>>> import org.apache.beam.sdk.options.PipelineOptionsFactory;
>>>>>>>>> import org.apache.beam.sdk.transforms.DoFn;
>>>>>>>>> import org.apache.beam.sdk.transforms.ParDo;
>>>>>>>>>
>>>>>>>>> public class ReadCompressedTextFile {
>>>>>>>>>
>>>>>>>>> public static void main(String[] args) {
>>>>>>>>> PipelineOptions optios =
>>>>>>>>> PipelineOptionsFactory.fromArgs(args).withValidation().create();
>>>>>>>>>     Pipeline p = Pipeline.create(optios);
>>>>>>>>>
>>>>>>>>>     p.apply("ReadLines",
>>>>>>>>>     TextIO.read().from("./dataset.tar.gz")
>>>>>>>>>
>>>>>>>>>     ).apply(ParDo.of(new DoFn<String, String>(){
>>>>>>>>>     @ProcessElement
>>>>>>>>>     public void processElement(ProcessContext c) {
>>>>>>>>>     c.output(c.element());
>>>>>>>>>     // Just write the all content to "/tmp/filout/outputfile"
>>>>>>>>>     }
>>>>>>>>>
>>>>>>>>>     }))
>>>>>>>>>
>>>>>>>>>    .apply(TextIO.write().to("/tmp/filout/outputfile"));
>>>>>>>>>
>>>>>>>>>     p.run().waitUntilFinish();
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> The full code, data file & output contents are attached.
>>>>>>>>>
>>>>>>>>> thanks
>>>>>>>>> Saj
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 16 March 2018 at 16:56, Eugene Kirpichov <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Sajeevan - I'm quite confident that TextIO can handle .gz, but
>>>>>>>>>> can not handle properly .tar. Did you run this code? Did your test 
>>>>>>>>>> .tar.gz
>>>>>>>>>> file contain multiple files? Did you obtain the expected output, 
>>>>>>>>>> identical
>>>>>>>>>> to the input except for order of lines?
>>>>>>>>>> (also, the ParDo in this code doesn't do anything - it outputs
>>>>>>>>>> its input - so it can be removed)
>>>>>>>>>>
>>>>>>>>>> On Fri, Mar 16, 2018 at 9:06 AM Sajeevan Achuthan <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Guys,
>>>>>>>>>>>
>>>>>>>>>>> The TextIo can handle the tar.gz type double compressed files.
>>>>>>>>>>> See the code test code.
>>>>>>>>>>>
>>>>>>>>>>>  PipelineOptions optios =
>>>>>>>>>>> PipelineOptionsFactory.fromArgs(args).withValidation().create();
>>>>>>>>>>>     Pipeline p = Pipeline.create(optios);
>>>>>>>>>>>
>>>>>>>>>>>    * p.apply("ReadLines",
>>>>>>>>>>> TextIO.read().from("/dataset.tar.gz"))*
>>>>>>>>>>>                       .apply(ParDo.of(new DoFn<String, String>(){
>>>>>>>>>>>     @ProcessElement
>>>>>>>>>>>     public void processElement(ProcessContext c) {
>>>>>>>>>>>     c.output(c.element());
>>>>>>>>>>>     }
>>>>>>>>>>>
>>>>>>>>>>>     }))
>>>>>>>>>>>
>>>>>>>>>>>    .apply(TextIO.write().to("/tmp/filout/outputfile"));
>>>>>>>>>>>
>>>>>>>>>>>     p.run().waitUntilFinish();
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> /Saj
>>>>>>>>>>>
>>>>>>>>>>> On 16 March 2018 at 04:29, Pablo Estrada <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi!
>>>>>>>>>>>> Quick questions:
>>>>>>>>>>>> - which sdk are you using?
>>>>>>>>>>>> - is this batch or streaming?
>>>>>>>>>>>>
>>>>>>>>>>>> As JB mentioned, TextIO is able to work with compressed files
>>>>>>>>>>>> that contain text. Nothing currently handles the double 
>>>>>>>>>>>> decompression that
>>>>>>>>>>>> I believe you're looking for.
>>>>>>>>>>>> TextIO for Java is also able to"watch" a directory for new
>>>>>>>>>>>> files. If you're able to (outside of your pipeline) decompress 
>>>>>>>>>>>> your first
>>>>>>>>>>>> zip file into a directory that your pipeline is watching, you may 
>>>>>>>>>>>> be able
>>>>>>>>>>>> to use that as work around. Does that sound like a good thing?
>>>>>>>>>>>> Finally, if you want to implement a transform that does all
>>>>>>>>>>>> your logic, well then that sounds like SplittableDoFn material; 
>>>>>>>>>>>> and in that
>>>>>>>>>>>> case, someone that knows SDF better can give you guidance (or 
>>>>>>>>>>>> clarify if my
>>>>>>>>>>>> suggestions are not correct).
>>>>>>>>>>>> Best
>>>>>>>>>>>> -P.
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Mar 15, 2018, 8:09 PM Jean-Baptiste Onofré <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>
>>>>>>>>>>>>> TextIO supports compressed file. Do you want to read files in
>>>>>>>>>>>>> text ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Can you detail a bit the use case ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> Regards
>>>>>>>>>>>>> JB
>>>>>>>>>>>>> Le 15 mars 2018, à 18:28, Shirish Jamthe <[email protected]>
>>>>>>>>>>>>> a écrit:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> My input is a tar.gz or .zip file which contains thousands of
>>>>>>>>>>>>>> tar.gz files and other files.
>>>>>>>>>>>>>> I would lile to extract the tar.gz files from the tar.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is there a transform that can do that? I couldn't find one.
>>>>>>>>>>>>>> If not is it in works? Any pointers to start work on it?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> thanks
>>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>> Got feedback? go/pabloem-feedback
>>>>>>>>>>>> <https://goto.google.com/pabloem-feedback>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>

Re: Looking for I/O transform to untar a tar.gz

Reply via email to