To clarify: I think natively supporting .tar and .tar.gz would be quite
useful. I'm just saying that currently we don't.

On Fri, Mar 16, 2018 at 10:44 AM Eugene Kirpichov <kirpic...@google.com>
wrote:

> The code behaves as I expected, and the output is corrupt.
> Beam unzipped the .gz, but then interpreted the .tar as a text file, and
> split the .tar file by \n.
> E.g. the first file of the output starts with lines:
> A20171012.1145+0200-1200+0200_epg10-1_node.xml/0000755000175000017500000000000013252764467016513
> 5ustar
> eachsajeachsajA20171012.1145+0200-1200+0200_epg10-1_node.xml/data0000644000175000017500000000360513252764467017353
> 0ustar  eachsajeachsaj<?xml version="1.0" encoding="UTF-8"?>
>
> which are clearly not the expected input.
>
> On Fri, Mar 16, 2018 at 10:39 AM Sajeevan Achuthan <
> achuthan.sajee...@gmail.com> wrote:
>
>> Eugene, I ran the code and it works fine.  I am very confident in this
>> case. I appreciate you guys for the great work.
>>
>> The code supposed to show that Beam TextIO can read the double compressed
>> files and write output without any processing. so ignored the processing
>> steps. I agree with you the further processing is not easy in this case.
>>
>>
>> import org.apache.beam.sdk.Pipeline;
>> import org.apache.beam.sdk.io.TextIO;
>> import org.apache.beam.sdk.options.PipelineOptions;
>> import org.apache.beam.sdk.options.PipelineOptionsFactory;
>> import org.apache.beam.sdk.transforms.DoFn;
>> import org.apache.beam.sdk.transforms.ParDo;
>>
>> public class ReadCompressedTextFile {
>>
>> public static void main(String[] args) {
>> PipelineOptions optios =
>> PipelineOptionsFactory.fromArgs(args).withValidation().create();
>>     Pipeline p = Pipeline.create(optios);
>>
>>     p.apply("ReadLines",
>>     TextIO.read().from("./dataset.tar.gz")
>>
>>     ).apply(ParDo.of(new DoFn<String, String>(){
>>     @ProcessElement
>>     public void processElement(ProcessContext c) {
>>     c.output(c.element());
>>     // Just write the all content to "/tmp/filout/outputfile"
>>     }
>>
>>     }))
>>
>>    .apply(TextIO.write().to("/tmp/filout/outputfile"));
>>
>>     p.run().waitUntilFinish();
>> }
>>
>> }
>>
>> The full code, data file & output contents are attached.
>>
>> thanks
>> Saj
>>
>>
>>
>>
>>
>> On 16 March 2018 at 16:56, Eugene Kirpichov <kirpic...@google.com> wrote:
>>
>>> Sajeevan - I'm quite confident that TextIO can handle .gz, but can not
>>> handle properly .tar. Did you run this code? Did your test .tar.gz file
>>> contain multiple files? Did you obtain the expected output, identical to
>>> the input except for order of lines?
>>> (also, the ParDo in this code doesn't do anything - it outputs its input
>>> - so it can be removed)
>>>
>>> On Fri, Mar 16, 2018 at 9:06 AM Sajeevan Achuthan <
>>> achuthan.sajee...@gmail.com> wrote:
>>>
>>>> Hi Guys,
>>>>
>>>> The TextIo can handle the tar.gz type double compressed files. See the
>>>> code test code.
>>>>
>>>>  PipelineOptions optios =
>>>> PipelineOptionsFactory.fromArgs(args).withValidation().create();
>>>>     Pipeline p = Pipeline.create(optios);
>>>>
>>>>    * p.apply("ReadLines",  TextIO.read().from("/dataset.tar.gz"))*
>>>>                       .apply(ParDo.of(new DoFn<String, String>(){
>>>>     @ProcessElement
>>>>     public void processElement(ProcessContext c) {
>>>>     c.output(c.element());
>>>>     }
>>>>
>>>>     }))
>>>>
>>>>    .apply(TextIO.write().to("/tmp/filout/outputfile"));
>>>>
>>>>     p.run().waitUntilFinish();
>>>>
>>>> Thanks
>>>> /Saj
>>>>
>>>> On 16 March 2018 at 04:29, Pablo Estrada <pabl...@google.com> wrote:
>>>>
>>>>> Hi!
>>>>> Quick questions:
>>>>> - which sdk are you using?
>>>>> - is this batch or streaming?
>>>>>
>>>>> As JB mentioned, TextIO is able to work with compressed files that
>>>>> contain text. Nothing currently handles the double decompression that I
>>>>> believe you're looking for.
>>>>> TextIO for Java is also able to"watch" a directory for new files. If
>>>>> you're able to (outside of your pipeline) decompress your first zip file
>>>>> into a directory that your pipeline is watching, you may be able to use
>>>>> that as work around. Does that sound like a good thing?
>>>>> Finally, if you want to implement a transform that does all your
>>>>> logic, well then that sounds like SplittableDoFn material; and in that
>>>>> case, someone that knows SDF better can give you guidance (or clarify if 
>>>>> my
>>>>> suggestions are not correct).
>>>>> Best
>>>>> -P.
>>>>>
>>>>> On Thu, Mar 15, 2018, 8:09 PM Jean-Baptiste Onofré <j...@nanthrax.net>
>>>>> wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> TextIO supports compressed file. Do you want to read files in text ?
>>>>>>
>>>>>> Can you detail a bit the use case ?
>>>>>>
>>>>>> Thanks
>>>>>> Regards
>>>>>> JB
>>>>>> Le 15 mars 2018, à 18:28, Shirish Jamthe <sjam...@google.com> a
>>>>>> écrit:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> My input is a tar.gz or .zip file which contains thousands of tar.gz
>>>>>>> files and other files.
>>>>>>> I would lile to extract the tar.gz files from the tar.
>>>>>>>
>>>>>>> Is there a transform that can do that? I couldn't find one.
>>>>>>> If not is it in works? Any pointers to start work on it?
>>>>>>>
>>>>>>> thanks
>>>>>>>
>>>>>> --
>>>>> Got feedback? go/pabloem-feedback
>>>>> <https://goto.google.com/pabloem-feedback>
>>>>>
>>>>
>>>>
>>

Reply via email to