Re: Using S3 as a streaming File source

2020-09-02 Thread orionemail
OK thanks for the notice on the cost point. I will check the cost calculations.

This already does have SNS enabled for another solution to this problem, but 
I'm trying to use the minimal amount of different software components at this 
stage of the pipeline. My prefered approach would have been them to send this 
data directly to a Kinesis/Kafka stream but that is not an option at this time.

Thanks for the assistance.

Sent with [ProtonMail](https://protonmail.com) Secure Email.

‐‐‐ Original Message ‐‐‐
On Tuesday, 1 September 2020 17:53, Ayush Verma  wrote:

> Word of caution. Streaming from S3 is really cost prohibitive as the only way 
> to detect new files is to continuously spam the S3 List API.
>
> On Tue, Sep 1, 2020 at 4:50 PM Jörn Franke  wrote:
>
>> Why don’t you get an S3 notification on SQS and do the actions from there?
>>
>> You will probably need to write the content of the files to a no sql 
>> database .
>>
>> Alternatively send the s3 notification to Kafka and read flink from there.
>>
>> https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
>>
>>> Am 01.09.2020 um 16:46 schrieb orionemail :
>>
>>> 
>>> Hi,
>>>
>>> I have a S3 bucket that is continuously written to by millions of devices. 
>>> These upload small compressed archives.
>>>
>>> What I want to do is treat the tar gzipped (.tgz) files as a streaming 
>>> source and process each archive. The archive contains three files that each 
>>> might need to be processed.
>>>
>>> I see that
>>>
>>> env.readFile(f
>>>
>>> ,
>>>
>>> bucket
>>>
>>> ,
>>>
>>> FileProcessingMode.
>>>
>>> PROCESS_CONTINUOUSLY
>>>
>>> ,
>>>
>>> 1L
>>>
>>> ).print()
>>>
>>> ;
>>>
>>> might do what I need, but I am unsure how best to implement 'f' - the 
>>> InputFileFormat. Is there a similar example for me to reference?
>>>
>>> Or is this idea not workable with this method? I need to ensure exactly 
>>> once, and also trigger removal of the files after processing.
>>>
>>> Thanks,
>>>
>>> Sent with [ProtonMail](https://protonmail.com) Secure Email.

Re: Using S3 as a streaming File source

2020-09-01 Thread Ayush Verma
Word of caution. Streaming from S3 is really cost prohibitive as the only
way to detect new files is to continuously spam the S3 List API.

On Tue, Sep 1, 2020 at 4:50 PM Jörn Franke  wrote:

> Why don’t you get an S3 notification on SQS and do the actions from there?
>
> You will probably need to write the content of the files to a no sql
> database .
>
> Alternatively send the s3 notification to Kafka and read flink from there.
>
>
> https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
>
>
> Am 01.09.2020 um 16:46 schrieb orionemail :
>
> 
> Hi,
>
> I have a S3 bucket that is continuously written to by millions of
> devices.  These upload small compressed archives.
>
> What I want to do is treat the tar gzipped (.tgz) files as a streaming
> source and process each archive.  The archive contains three files that
> each might need to be processed.
>
> I see that
>
> env.readFile(f, bucket, FileProcessingMode.*PROCESS_CONTINUOUSLY*, 
> 1L).print();
>
> might do what I need, but I am unsure how best to implement 'f' - the
> InputFileFormat.  Is there a similar example for me to reference?
>
> Or is this idea not workable with this method? I need to ensure exactly
> once, and also trigger removal of the files after processing.
>
> Thanks,
>
>
> Sent with ProtonMail  Secure Email.
>
>


Re: Using S3 as a streaming File source

2020-09-01 Thread Jörn Franke
Why don’t you get an S3 notification on SQS and do the actions from there?

You will probably need to write the content of the files to a no sql database .

Alternatively send the s3 notification to Kafka and read flink from there.


https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html


> Am 01.09.2020 um 16:46 schrieb orionemail :
> 
> 
> Hi,
> 
> I have a S3 bucket that is continuously written to by millions of devices.  
> These upload small compressed archives.
> 
> What I want to do is treat the tar gzipped (.tgz) files as a streaming source 
> and process each archive.  The archive contains three files that each might 
> need to be processed.
> 
> I see that 
> env.readFile(f, bucket, FileProcessingMode.PROCESS_CONTINUOUSLY, 
> 1L).print();
> might do what I need, but I am unsure how best to implement 'f' - the 
> InputFileFormat.  Is there a similar example for me to reference?
> 
> Or is this idea not workable with this method? I need to ensure exactly once, 
> and also trigger removal of the files after processing.
> 
> Thanks,
> 
> 
> Sent with ProtonMail Secure Email.
> 


Using S3 as a streaming File source

2020-09-01 Thread orionemail
Hi,

I have a S3 bucket that is continuously written to by millions of devices. 
These upload small compressed archives.

What I want to do is treat the tar gzipped (.tgz) files as a streaming source 
and process each archive. The archive contains three files that each might need 
to be processed.

I see that

env.readFile(f

,

bucket

,

FileProcessingMode.

PROCESS_CONTINUOUSLY

,

1L

).print()

;

might do what I need, but I am unsure how best to implement 'f' - the 
InputFileFormat. Is there a similar example for me to reference?

Or is this idea not workable with this method? I need to ensure exactly once, 
and also trigger removal of the files after processing.

Thanks,

Sent with [ProtonMail](https://protonmail.com) Secure Email.

Re: Streaming file source?

2017-01-20 Thread Niels Basjes
Thanks!

This sounds really close to what I had in mind.
I'll use this first and see how far I get.

Niels

On Fri, Jan 20, 2017 at 11:27 AM, Stephan Ewen  wrote:

> Hi Niels!
>
> There is the Continuous File Monitoring Source, used via
>
> StreamExecutionEnvironment.readFile(FileInputFormat
> inputFormat, String filePath, FileProcessingMode watchType, long interval);
>
> This can be used to both continuously ingest from files, or to read files
> once.
>
> Kostas can probably comment more about whether and how you can make the
> file order deterministic.
>
> Stephan
>
>
> On Fri, Jan 20, 2017 at 11:20 AM, Niels Basjes  wrote:
>
>> Hi,
>>
>> For testing and optimizing a streaming application I want to have a "100%
>> accurate repeatable" substitute for a Kafka source.
>> I was thinking of creating a streaming source class that simply reads the
>> records from a (static unchanging) set of files.
>> Each file would then produce the data which (in the live situation) come
>> from a single Kafka partition.
>>
>> I hate reinventing the wheel so I'm wondering is something like this
>> already been built by someone?
>> If so, where can I find it?
>>
>> --
>> Best regards / Met vriendelijke groeten,
>>
>> Niels Basjes
>>
>
>


-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Streaming file source?

2017-01-20 Thread Stephan Ewen
Hi Niels!

There is the Continuous File Monitoring Source, used via

StreamExecutionEnvironment.readFile(FileInputFormat
inputFormat, String filePath, FileProcessingMode watchType, long interval);

This can be used to both continuously ingest from files, or to read files
once.

Kostas can probably comment more about whether and how you can make the
file order deterministic.

Stephan


On Fri, Jan 20, 2017 at 11:20 AM, Niels Basjes  wrote:

> Hi,
>
> For testing and optimizing a streaming application I want to have a "100%
> accurate repeatable" substitute for a Kafka source.
> I was thinking of creating a streaming source class that simply reads the
> records from a (static unchanging) set of files.
> Each file would then produce the data which (in the live situation) come
> from a single Kafka partition.
>
> I hate reinventing the wheel so I'm wondering is something like this
> already been built by someone?
> If so, where can I find it?
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes
>


Streaming file source?

2017-01-20 Thread Niels Basjes
Hi,

For testing and optimizing a streaming application I want to have a "100%
accurate repeatable" substitute for a Kafka source.
I was thinking of creating a streaming source class that simply reads the
records from a (static unchanging) set of files.
Each file would then produce the data which (in the live situation) come
from a single Kafka partition.

I hate reinventing the wheel so I'm wondering is something like this
already been built by someone?
If so, where can I find it?

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes