Eugene, I ran the code and it works fine.  I am very confident in this
case. I appreciate you guys for the great work.

The code supposed to show that Beam TextIO can read the double compressed
files and write output without any processing. so ignored the processing
steps. I agree with you the further processing is not easy in this case.


import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;

public class ReadCompressedTextFile {

public static void main(String[] args) {
PipelineOptions optios =
PipelineOptionsFactory.fromArgs(args).withValidation().create();
    Pipeline p = Pipeline.create(optios);

    p.apply("ReadLines",
    TextIO.read().from("./dataset.tar.gz")

    ).apply(ParDo.of(new DoFn<String, String>(){
    @ProcessElement
    public void processElement(ProcessContext c) {
    c.output(c.element());
    // Just write the all content to "/tmp/filout/outputfile"
    }

    }))

   .apply(TextIO.write().to("/tmp/filout/outputfile"));

    p.run().waitUntilFinish();
}

}

The full code, data file & output contents are attached.

thanks
Saj





On 16 March 2018 at 16:56, Eugene Kirpichov <kirpic...@google.com> wrote:

> Sajeevan - I'm quite confident that TextIO can handle .gz, but can not
> handle properly .tar. Did you run this code? Did your test .tar.gz file
> contain multiple files? Did you obtain the expected output, identical to
> the input except for order of lines?
> (also, the ParDo in this code doesn't do anything - it outputs its input -
> so it can be removed)
>
> On Fri, Mar 16, 2018 at 9:06 AM Sajeevan Achuthan <
> achuthan.sajee...@gmail.com> wrote:
>
>> Hi Guys,
>>
>> The TextIo can handle the tar.gz type double compressed files. See the
>> code test code.
>>
>>  PipelineOptions optios = PipelineOptionsFactory.
>> fromArgs(args).withValidation().create();
>>     Pipeline p = Pipeline.create(optios);
>>
>>    * p.apply("ReadLines",  TextIO.read().from("/dataset.tar.gz"))*
>>                       .apply(ParDo.of(new DoFn<String, String>(){
>>     @ProcessElement
>>     public void processElement(ProcessContext c) {
>>     c.output(c.element());
>>     }
>>
>>     }))
>>
>>    .apply(TextIO.write().to("/tmp/filout/outputfile"));
>>
>>     p.run().waitUntilFinish();
>>
>> Thanks
>> /Saj
>>
>> On 16 March 2018 at 04:29, Pablo Estrada <pabl...@google.com> wrote:
>>
>>> Hi!
>>> Quick questions:
>>> - which sdk are you using?
>>> - is this batch or streaming?
>>>
>>> As JB mentioned, TextIO is able to work with compressed files that
>>> contain text. Nothing currently handles the double decompression that I
>>> believe you're looking for.
>>> TextIO for Java is also able to"watch" a directory for new files. If
>>> you're able to (outside of your pipeline) decompress your first zip file
>>> into a directory that your pipeline is watching, you may be able to use
>>> that as work around. Does that sound like a good thing?
>>> Finally, if you want to implement a transform that does all your logic,
>>> well then that sounds like SplittableDoFn material; and in that case,
>>> someone that knows SDF better can give you guidance (or clarify if my
>>> suggestions are not correct).
>>> Best
>>> -P.
>>>
>>> On Thu, Mar 15, 2018, 8:09 PM Jean-Baptiste Onofré <j...@nanthrax.net>
>>> wrote:
>>>
>>>> Hi
>>>>
>>>> TextIO supports compressed file. Do you want to read files in text ?
>>>>
>>>> Can you detail a bit the use case ?
>>>>
>>>> Thanks
>>>> Regards
>>>> JB
>>>> Le 15 mars 2018, à 18:28, Shirish Jamthe <sjam...@google.com> a écrit:
>>>>>
>>>>> Hi,
>>>>>
>>>>> My input is a tar.gz or .zip file which contains thousands of tar.gz
>>>>> files and other files.
>>>>> I would lile to extract the tar.gz files from the tar.
>>>>>
>>>>> Is there a transform that can do that? I couldn't find one.
>>>>> If not is it in works? Any pointers to start work on it?
>>>>>
>>>>> thanks
>>>>>
>>>> --
>>> Got feedback? go/pabloem-feedback
>>> <https://goto.google.com/pabloem-feedback>
>>>
>>
>>

Attachment: dataset.tar.gz
Description: application/gzip

Attachment: output.tar
Description: Unix tar archive

Reply via email to