Re: Reading files from an S3 folder

Robert Metzger Wed, 23 Nov 2016 08:46:23 -0800

Hi,
This is not the expected behavior.
Each parallel instance should read only one file. The files should not be
read multiple times by the different parallel instances.
How did you check / find out that each node is reading all the data?


Regards,
Robert

On Tue, Nov 22, 2016 at 7:42 PM, Alex Reid <alex.james.r...@gmail.com>
wrote:

> Hi, I've been playing around with using apache flink to process some data,
> and I'm starting out using the batch DataSet API.
>
> To start, I read in some data from files in an S3 folder:
>
> DataSet<String> records = env.readTextFile("s3://my-s3-bucket/some-folder/");
>
>
> Within the folder, there are 20 gzipped files, and I have 20 node/tasks run 
> (so parallel 20). It looks like each node is reading in ALL the files (whole 
> folder), but what I really want is for each node/task to read in 1 file each 
> and each process the data within the file they read in.
>
> Is this expected behavior? Am I suppose to be doing something different here 
> to get the results I want?
>
> Thanks.
>
>

Re: Reading files from an S3 folder

Reply via email to