Re: S3 checkpointing in AWS in Frankfurt

2016-11-24 Thread Stephan Ewen
We have been looking for a while for some way to decouple the S3 filesystem
support from Hadoop.

Does anyone know a good S3 connector library that works independent of
Hadoop and EMRFS?

Best,
Stephan


On Wed, Nov 23, 2016 at 7:57 PM, Greg Hogan  wrote:

> EMRFS looks to *add* cost (and consistency).
>
> Storing an object to S3 costs "$0.005 per 1,000 requests", so $0.432/day
> at 1 Hz. Is the number of checkpoint files simply parallelism * number of
> operators? That could add up quickly.
>
> Is the recommendation to run HDFS on EBS?
>
> On Wed, Nov 23, 2016 at 12:40 PM, Jonathan Share 
> wrote:
>
>> Hi Greg,
>>
>> Standard storage class, everything is on defaults, we've not done
>> anything special with the bucket.
>>
>> Cloud Watch only appears to give me total billing for S3 in general, I
>> don't see a breakdown unless that's something I can configure somewhere.
>>
>> Regards,
>> Jonathan
>>
>>
>> On 23 November 2016 at 16:29, Greg Hogan  wrote:
>>
>>> Hi Jonathan,
>>>
>>> Which S3 storage class are you using? Do you have a breakdown of the S3
>>> costs as storage / API calls / early deletes / data transfer?
>>>
>>> Greg
>>>
>>> On Wed, Nov 23, 2016 at 2:52 AM, Jonathan Share 
>>> wrote:
>>>
 Hi,

 I'm interested in hearing if anyone else has experience with using
 Amazon S3 as a state backend in the Frankfurt region. For political reasons
 we've been asked to keep all European data in Amazon's Frankfurt region.
 This causes a problem as the S3 endpoint in Frankfurt requires the use of
 AWS Signature Version 4 "This new Region supports only Signature
 Version 4" [1] and this doesn't appear to work with the Hadoop version
 that Flink is built against [2].

 After some hacking we have managed to create a docker image with a
 build of Flink 1.2 master, copying over jar files from the hadoop
 3.0.0-alpha1 package and this appears to work, for the most part but we
 still suffer from some classpath problems (conflicts between AWS API used
 in hadoop and those we want to use in out streams for interacting with
 Kinesis) and the whole thing feels a little fragile. Has anyone else tried
 this? Is there a simpler solution?

 As a follow-up question, we saw that with checkpointing on three
 relatively simple streams set to 1 second, our S3 costs were higher than
 the EC2 costs for our entire infrastructure. This seems slightly
 disproportionate. For now we have reduced checkpointing interval to 10
 seconds and that has greatly improved the cost projections graphed via
 Amazon Cloud Watch, but I'm interested in hearing other peoples experience
 with this. Is that the kind of billing level we can expect or is this a
 symptom of a mis-configuration? Is this a setup others are using? As we are
 using Kinesis as the source for all streams I don't see a huge risk with
 larger checkpoint intervals and our Sinks are designed to mostly tolerate
 duplicates (some improvements can be made).

 Thanks in advance
 Jonathan


 [1] https://aws.amazon.com/blogs/aws/aws-region-germany/
 [2] https://issues.apache.org/jira/browse/HADOOP-13324

>>>
>>>
>>
>


Re: S3 checkpointing in AWS in Frankfurt

2016-11-23 Thread Greg Hogan
EMRFS looks to *add* cost (and consistency).

Storing an object to S3 costs "$0.005 per 1,000 requests", so $0.432/day at
1 Hz. Is the number of checkpoint files simply parallelism * number of
operators? That could add up quickly.

Is the recommendation to run HDFS on EBS?

On Wed, Nov 23, 2016 at 12:40 PM, Jonathan Share 
wrote:

> Hi Greg,
>
> Standard storage class, everything is on defaults, we've not done anything
> special with the bucket.
>
> Cloud Watch only appears to give me total billing for S3 in general, I
> don't see a breakdown unless that's something I can configure somewhere.
>
> Regards,
> Jonathan
>
>
> On 23 November 2016 at 16:29, Greg Hogan  wrote:
>
>> Hi Jonathan,
>>
>> Which S3 storage class are you using? Do you have a breakdown of the S3
>> costs as storage / API calls / early deletes / data transfer?
>>
>> Greg
>>
>> On Wed, Nov 23, 2016 at 2:52 AM, Jonathan Share 
>> wrote:
>>
>>> Hi,
>>>
>>> I'm interested in hearing if anyone else has experience with using
>>> Amazon S3 as a state backend in the Frankfurt region. For political reasons
>>> we've been asked to keep all European data in Amazon's Frankfurt region.
>>> This causes a problem as the S3 endpoint in Frankfurt requires the use of
>>> AWS Signature Version 4 "This new Region supports only Signature
>>> Version 4" [1] and this doesn't appear to work with the Hadoop version
>>> that Flink is built against [2].
>>>
>>> After some hacking we have managed to create a docker image with a build
>>> of Flink 1.2 master, copying over jar files from the hadoop
>>> 3.0.0-alpha1 package and this appears to work, for the most part but we
>>> still suffer from some classpath problems (conflicts between AWS API used
>>> in hadoop and those we want to use in out streams for interacting with
>>> Kinesis) and the whole thing feels a little fragile. Has anyone else tried
>>> this? Is there a simpler solution?
>>>
>>> As a follow-up question, we saw that with checkpointing on three
>>> relatively simple streams set to 1 second, our S3 costs were higher than
>>> the EC2 costs for our entire infrastructure. This seems slightly
>>> disproportionate. For now we have reduced checkpointing interval to 10
>>> seconds and that has greatly improved the cost projections graphed via
>>> Amazon Cloud Watch, but I'm interested in hearing other peoples experience
>>> with this. Is that the kind of billing level we can expect or is this a
>>> symptom of a mis-configuration? Is this a setup others are using? As we are
>>> using Kinesis as the source for all streams I don't see a huge risk with
>>> larger checkpoint intervals and our Sinks are designed to mostly tolerate
>>> duplicates (some improvements can be made).
>>>
>>> Thanks in advance
>>> Jonathan
>>>
>>>
>>> [1] https://aws.amazon.com/blogs/aws/aws-region-germany/
>>> [2] https://issues.apache.org/jira/browse/HADOOP-13324
>>>
>>
>>
>


Re: S3 checkpointing in AWS in Frankfurt

2016-11-23 Thread Jonathan Share
We're not running on EMR (running Flink as a standalone cluster on
Kubernetes on EC2). I assume that it's not possible to use EMRFS if not
running on Amazon's EMR images.


On 23 November 2016 at 18:00, Foster, Craig  wrote:

> I would suggest using EMRFS anyway, which is the way to access the S3 file
> system from EMR (using the same s3:// prefixes).  That said, you will run
> into the same shading issues in our build until the next release—which is
> coming up relatively shortly.
>
>
>
>
>
>
>
> *From: *Robert Metzger 
> *Reply-To: *"user@flink.apache.org" 
> *Date: *Wednesday, November 23, 2016 at 8:24 AM
> *To: *"user@flink.apache.org" 
> *Subject: *Re: S3 checkpointing in AWS in Frankfurt
>
>
>
> Hi Jonathan,
>
>
>
> have you tried using Amazon's latest EMR Hadoop distribution? Maybe
> they've fixed the issue in their for older Hadoop releases?
>
>
>
> On Wed, Nov 23, 2016 at 4:38 PM, Scott Kidder 
> wrote:
>
> Hi Jonathan,
>
>
>
> You might be better off creating a small Hadoop HDFS cluster just for the
> purpose of storing Flink checkpoint & savepoint data. Like you, I tried
> using S3 to persist Flink state, but encountered AWS SDK issues and felt
> like I was going down an ill-advised path. I then created a small 3-node
> HDFS cluster in the same region as my Flink hosts but distributed across 3
> AZs. The checkpointing is very fast and, most importantly, just works.
>
>
>
> Is there a firm requirement to use S3, or could you use HDFS instead?
>
>
>
> Best,
>
>
>
> --Scott Kidder
>
>
>
> On Tue, Nov 22, 2016 at 11:52 PM, Jonathan Share 
> wrote:
>
> Hi,
>
>
>
> I'm interested in hearing if anyone else has experience with using Amazon
> S3 as a state backend in the Frankfurt region. For political reasons we've
> been asked to keep all European data in Amazon's Frankfurt region. This
> causes a problem as the S3 endpoint in Frankfurt requires the use of AWS
> Signature Version 4 "This new Region supports only Signature Version 4"
> [1] and this doesn't appear to work with the Hadoop version that Flink is
> built against [2].
>
>
>
> After some hacking we have managed to create a docker image with a build
> of Flink 1.2 master, copying over jar files from the hadoop
> 3.0.0-alpha1 package and this appears to work, for the most part but we
> still suffer from some classpath problems (conflicts between AWS API used
> in hadoop and those we want to use in out streams for interacting with
> Kinesis) and the whole thing feels a little fragile. Has anyone else tried
> this? Is there a simpler solution?
>
>
>
> As a follow-up question, we saw that with checkpointing on three
> relatively simple streams set to 1 second, our S3 costs were higher than
> the EC2 costs for our entire infrastructure. This seems slightly
> disproportionate. For now we have reduced checkpointing interval to 10
> seconds and that has greatly improved the cost projections graphed via
> Amazon Cloud Watch, but I'm interested in hearing other peoples experience
> with this. Is that the kind of billing level we can expect or is this a
> symptom of a mis-configuration? Is this a setup others are using? As we are
> using Kinesis as the source for all streams I don't see a huge risk with
> larger checkpoint intervals and our Sinks are designed to mostly tolerate
> duplicates (some improvements can be made).
>
>
>
> Thanks in advance
>
> Jonathan
>
>
>
>
>
> [1] https://aws.amazon.com/blogs/aws/aws-region-germany/
>
> [2] https://issues.apache.org/jira/browse/HADOOP-13324
>
>
>
>
>


Re: S3 checkpointing in AWS in Frankfurt

2016-11-23 Thread Jonathan Share
Hi Scott,

Thanks for the suggestion, it sounds like you and I think alike, going over
to hdfs sounds to me like the simplest solution.

There are no requirements to use S3, just another team member who is
generally sceptical fearing that adding HDFS will add a new class of
maintenance problems to our stack, and the project has a general goal of
using managed services as much as possible so we wanted to try and make it
work.

Regards,
Jonathan


On 23 November 2016 at 16:38, Scott Kidder  wrote:

> Hi Jonathan,
>
> You might be better off creating a small Hadoop HDFS cluster just for the
> purpose of storing Flink checkpoint & savepoint data. Like you, I tried
> using S3 to persist Flink state, but encountered AWS SDK issues and felt
> like I was going down an ill-advised path. I then created a small 3-node
> HDFS cluster in the same region as my Flink hosts but distributed across 3
> AZs. The checkpointing is very fast and, most importantly, just works.
>
> Is there a firm requirement to use S3, or could you use HDFS instead?
>
> Best,
>
> --Scott Kidder
>
> On Tue, Nov 22, 2016 at 11:52 PM, Jonathan Share 
> wrote:
>
>> Hi,
>>
>> I'm interested in hearing if anyone else has experience with using Amazon
>> S3 as a state backend in the Frankfurt region. For political reasons we've
>> been asked to keep all European data in Amazon's Frankfurt region. This
>> causes a problem as the S3 endpoint in Frankfurt requires the use of AWS
>> Signature Version 4 "This new Region supports only Signature Version 4"
>> [1] and this doesn't appear to work with the Hadoop version that Flink is
>> built against [2].
>>
>> After some hacking we have managed to create a docker image with a build
>> of Flink 1.2 master, copying over jar files from the hadoop
>> 3.0.0-alpha1 package and this appears to work, for the most part but we
>> still suffer from some classpath problems (conflicts between AWS API used
>> in hadoop and those we want to use in out streams for interacting with
>> Kinesis) and the whole thing feels a little fragile. Has anyone else tried
>> this? Is there a simpler solution?
>>
>> As a follow-up question, we saw that with checkpointing on three
>> relatively simple streams set to 1 second, our S3 costs were higher than
>> the EC2 costs for our entire infrastructure. This seems slightly
>> disproportionate. For now we have reduced checkpointing interval to 10
>> seconds and that has greatly improved the cost projections graphed via
>> Amazon Cloud Watch, but I'm interested in hearing other peoples experience
>> with this. Is that the kind of billing level we can expect or is this a
>> symptom of a mis-configuration? Is this a setup others are using? As we are
>> using Kinesis as the source for all streams I don't see a huge risk with
>> larger checkpoint intervals and our Sinks are designed to mostly tolerate
>> duplicates (some improvements can be made).
>>
>> Thanks in advance
>> Jonathan
>>
>>
>> [1] https://aws.amazon.com/blogs/aws/aws-region-germany/
>> [2] https://issues.apache.org/jira/browse/HADOOP-13324
>>
>
>


Re: S3 checkpointing in AWS in Frankfurt

2016-11-23 Thread Jonathan Share
Hi Greg,

Standard storage class, everything is on defaults, we've not done anything
special with the bucket.

Cloud Watch only appears to give me total billing for S3 in general, I
don't see a breakdown unless that's something I can configure somewhere.

Regards,
Jonathan


On 23 November 2016 at 16:29, Greg Hogan  wrote:

> Hi Jonathan,
>
> Which S3 storage class are you using? Do you have a breakdown of the S3
> costs as storage / API calls / early deletes / data transfer?
>
> Greg
>
> On Wed, Nov 23, 2016 at 2:52 AM, Jonathan Share 
> wrote:
>
>> Hi,
>>
>> I'm interested in hearing if anyone else has experience with using Amazon
>> S3 as a state backend in the Frankfurt region. For political reasons we've
>> been asked to keep all European data in Amazon's Frankfurt region. This
>> causes a problem as the S3 endpoint in Frankfurt requires the use of AWS
>> Signature Version 4 "This new Region supports only Signature Version 4"
>> [1] and this doesn't appear to work with the Hadoop version that Flink is
>> built against [2].
>>
>> After some hacking we have managed to create a docker image with a build
>> of Flink 1.2 master, copying over jar files from the hadoop
>> 3.0.0-alpha1 package and this appears to work, for the most part but we
>> still suffer from some classpath problems (conflicts between AWS API used
>> in hadoop and those we want to use in out streams for interacting with
>> Kinesis) and the whole thing feels a little fragile. Has anyone else tried
>> this? Is there a simpler solution?
>>
>> As a follow-up question, we saw that with checkpointing on three
>> relatively simple streams set to 1 second, our S3 costs were higher than
>> the EC2 costs for our entire infrastructure. This seems slightly
>> disproportionate. For now we have reduced checkpointing interval to 10
>> seconds and that has greatly improved the cost projections graphed via
>> Amazon Cloud Watch, but I'm interested in hearing other peoples experience
>> with this. Is that the kind of billing level we can expect or is this a
>> symptom of a mis-configuration? Is this a setup others are using? As we are
>> using Kinesis as the source for all streams I don't see a huge risk with
>> larger checkpoint intervals and our Sinks are designed to mostly tolerate
>> duplicates (some improvements can be made).
>>
>> Thanks in advance
>> Jonathan
>>
>>
>> [1] https://aws.amazon.com/blogs/aws/aws-region-germany/
>> [2] https://issues.apache.org/jira/browse/HADOOP-13324
>>
>
>


Re: S3 checkpointing in AWS in Frankfurt

2016-11-23 Thread Foster, Craig
I would suggest using EMRFS anyway, which is the way to access the S3 file 
system from EMR (using the same s3:// prefixes).  That said, you will run into 
the same shading issues in our build until the next release—which is coming up 
relatively shortly.



From: Robert Metzger 
Reply-To: "user@flink.apache.org" 
Date: Wednesday, November 23, 2016 at 8:24 AM
To: "user@flink.apache.org" 
Subject: Re: S3 checkpointing in AWS in Frankfurt

Hi Jonathan,

have you tried using Amazon's latest EMR Hadoop distribution? Maybe they've 
fixed the issue in their for older Hadoop releases?

On Wed, Nov 23, 2016 at 4:38 PM, Scott Kidder 
mailto:kidder.sc...@gmail.com>> wrote:
Hi Jonathan,

You might be better off creating a small Hadoop HDFS cluster just for the 
purpose of storing Flink checkpoint & savepoint data. Like you, I tried using 
S3 to persist Flink state, but encountered AWS SDK issues and felt like I was 
going down an ill-advised path. I then created a small 3-node HDFS cluster in 
the same region as my Flink hosts but distributed across 3 AZs. The 
checkpointing is very fast and, most importantly, just works.

Is there a firm requirement to use S3, or could you use HDFS instead?

Best,

--Scott Kidder

On Tue, Nov 22, 2016 at 11:52 PM, Jonathan Share 
mailto:jon.sh...@gmail.com>> wrote:
Hi,

I'm interested in hearing if anyone else has experience with using Amazon S3 as 
a state backend in the Frankfurt region. For political reasons we've been asked 
to keep all European data in Amazon's Frankfurt region. This causes a problem 
as the S3 endpoint in Frankfurt requires the use of AWS Signature Version 4 
"This new Region supports only Signature Version 4" [1] and this doesn't appear 
to work with the Hadoop version that Flink is built against [2].

After some hacking we have managed to create a docker image with a build of 
Flink 1.2 master, copying over jar files from the hadoop 3.0.0-alpha1 package 
and this appears to work, for the most part but we still suffer from some 
classpath problems (conflicts between AWS API used in hadoop and those we want 
to use in out streams for interacting with Kinesis) and the whole thing feels a 
little fragile. Has anyone else tried this? Is there a simpler solution?

As a follow-up question, we saw that with checkpointing on three relatively 
simple streams set to 1 second, our S3 costs were higher than the EC2 costs for 
our entire infrastructure. This seems slightly disproportionate. For now we 
have reduced checkpointing interval to 10 seconds and that has greatly improved 
the cost projections graphed via Amazon Cloud Watch, but I'm interested in 
hearing other peoples experience with this. Is that the kind of billing level 
we can expect or is this a symptom of a mis-configuration? Is this a setup 
others are using? As we are using Kinesis as the source for all streams I don't 
see a huge risk with larger checkpoint intervals and our Sinks are designed to 
mostly tolerate duplicates (some improvements can be made).

Thanks in advance
Jonathan


[1] https://aws.amazon.com/blogs/aws/aws-region-germany/
[2] https://issues.apache.org/jira/browse/HADOOP-13324




Re: S3 checkpointing in AWS in Frankfurt

2016-11-23 Thread Robert Metzger
Hi Jonathan,

have you tried using Amazon's latest EMR Hadoop distribution? Maybe they've
fixed the issue in their for older Hadoop releases?

On Wed, Nov 23, 2016 at 4:38 PM, Scott Kidder 
wrote:

> Hi Jonathan,
>
> You might be better off creating a small Hadoop HDFS cluster just for the
> purpose of storing Flink checkpoint & savepoint data. Like you, I tried
> using S3 to persist Flink state, but encountered AWS SDK issues and felt
> like I was going down an ill-advised path. I then created a small 3-node
> HDFS cluster in the same region as my Flink hosts but distributed across 3
> AZs. The checkpointing is very fast and, most importantly, just works.
>
> Is there a firm requirement to use S3, or could you use HDFS instead?
>
> Best,
>
> --Scott Kidder
>
> On Tue, Nov 22, 2016 at 11:52 PM, Jonathan Share 
> wrote:
>
>> Hi,
>>
>> I'm interested in hearing if anyone else has experience with using Amazon
>> S3 as a state backend in the Frankfurt region. For political reasons we've
>> been asked to keep all European data in Amazon's Frankfurt region. This
>> causes a problem as the S3 endpoint in Frankfurt requires the use of AWS
>> Signature Version 4 "This new Region supports only Signature Version 4"
>> [1] and this doesn't appear to work with the Hadoop version that Flink is
>> built against [2].
>>
>> After some hacking we have managed to create a docker image with a build
>> of Flink 1.2 master, copying over jar files from the hadoop
>> 3.0.0-alpha1 package and this appears to work, for the most part but we
>> still suffer from some classpath problems (conflicts between AWS API used
>> in hadoop and those we want to use in out streams for interacting with
>> Kinesis) and the whole thing feels a little fragile. Has anyone else tried
>> this? Is there a simpler solution?
>>
>> As a follow-up question, we saw that with checkpointing on three
>> relatively simple streams set to 1 second, our S3 costs were higher than
>> the EC2 costs for our entire infrastructure. This seems slightly
>> disproportionate. For now we have reduced checkpointing interval to 10
>> seconds and that has greatly improved the cost projections graphed via
>> Amazon Cloud Watch, but I'm interested in hearing other peoples experience
>> with this. Is that the kind of billing level we can expect or is this a
>> symptom of a mis-configuration? Is this a setup others are using? As we are
>> using Kinesis as the source for all streams I don't see a huge risk with
>> larger checkpoint intervals and our Sinks are designed to mostly tolerate
>> duplicates (some improvements can be made).
>>
>> Thanks in advance
>> Jonathan
>>
>>
>> [1] https://aws.amazon.com/blogs/aws/aws-region-germany/
>> [2] https://issues.apache.org/jira/browse/HADOOP-13324
>>
>
>


Re: S3 checkpointing in AWS in Frankfurt

2016-11-23 Thread Scott Kidder
Hi Jonathan,

You might be better off creating a small Hadoop HDFS cluster just for the
purpose of storing Flink checkpoint & savepoint data. Like you, I tried
using S3 to persist Flink state, but encountered AWS SDK issues and felt
like I was going down an ill-advised path. I then created a small 3-node
HDFS cluster in the same region as my Flink hosts but distributed across 3
AZs. The checkpointing is very fast and, most importantly, just works.

Is there a firm requirement to use S3, or could you use HDFS instead?

Best,

--Scott Kidder

On Tue, Nov 22, 2016 at 11:52 PM, Jonathan Share 
wrote:

> Hi,
>
> I'm interested in hearing if anyone else has experience with using Amazon
> S3 as a state backend in the Frankfurt region. For political reasons we've
> been asked to keep all European data in Amazon's Frankfurt region. This
> causes a problem as the S3 endpoint in Frankfurt requires the use of AWS
> Signature Version 4 "This new Region supports only Signature Version 4"
> [1] and this doesn't appear to work with the Hadoop version that Flink is
> built against [2].
>
> After some hacking we have managed to create a docker image with a build
> of Flink 1.2 master, copying over jar files from the hadoop
> 3.0.0-alpha1 package and this appears to work, for the most part but we
> still suffer from some classpath problems (conflicts between AWS API used
> in hadoop and those we want to use in out streams for interacting with
> Kinesis) and the whole thing feels a little fragile. Has anyone else tried
> this? Is there a simpler solution?
>
> As a follow-up question, we saw that with checkpointing on three
> relatively simple streams set to 1 second, our S3 costs were higher than
> the EC2 costs for our entire infrastructure. This seems slightly
> disproportionate. For now we have reduced checkpointing interval to 10
> seconds and that has greatly improved the cost projections graphed via
> Amazon Cloud Watch, but I'm interested in hearing other peoples experience
> with this. Is that the kind of billing level we can expect or is this a
> symptom of a mis-configuration? Is this a setup others are using? As we are
> using Kinesis as the source for all streams I don't see a huge risk with
> larger checkpoint intervals and our Sinks are designed to mostly tolerate
> duplicates (some improvements can be made).
>
> Thanks in advance
> Jonathan
>
>
> [1] https://aws.amazon.com/blogs/aws/aws-region-germany/
> [2] https://issues.apache.org/jira/browse/HADOOP-13324
>


Re: S3 checkpointing in AWS in Frankfurt

2016-11-23 Thread Greg Hogan
Hi Jonathan,

Which S3 storage class are you using? Do you have a breakdown of the S3
costs as storage / API calls / early deletes / data transfer?

Greg

On Wed, Nov 23, 2016 at 2:52 AM, Jonathan Share  wrote:

> Hi,
>
> I'm interested in hearing if anyone else has experience with using Amazon
> S3 as a state backend in the Frankfurt region. For political reasons we've
> been asked to keep all European data in Amazon's Frankfurt region. This
> causes a problem as the S3 endpoint in Frankfurt requires the use of AWS
> Signature Version 4 "This new Region supports only Signature Version 4"
> [1] and this doesn't appear to work with the Hadoop version that Flink is
> built against [2].
>
> After some hacking we have managed to create a docker image with a build
> of Flink 1.2 master, copying over jar files from the hadoop
> 3.0.0-alpha1 package and this appears to work, for the most part but we
> still suffer from some classpath problems (conflicts between AWS API used
> in hadoop and those we want to use in out streams for interacting with
> Kinesis) and the whole thing feels a little fragile. Has anyone else tried
> this? Is there a simpler solution?
>
> As a follow-up question, we saw that with checkpointing on three
> relatively simple streams set to 1 second, our S3 costs were higher than
> the EC2 costs for our entire infrastructure. This seems slightly
> disproportionate. For now we have reduced checkpointing interval to 10
> seconds and that has greatly improved the cost projections graphed via
> Amazon Cloud Watch, but I'm interested in hearing other peoples experience
> with this. Is that the kind of billing level we can expect or is this a
> symptom of a mis-configuration? Is this a setup others are using? As we are
> using Kinesis as the source for all streams I don't see a huge risk with
> larger checkpoint intervals and our Sinks are designed to mostly tolerate
> duplicates (some improvements can be made).
>
> Thanks in advance
> Jonathan
>
>
> [1] https://aws.amazon.com/blogs/aws/aws-region-germany/
> [2] https://issues.apache.org/jira/browse/HADOOP-13324
>


S3 checkpointing in AWS in Frankfurt

2016-11-22 Thread Jonathan Share
Hi,

I'm interested in hearing if anyone else has experience with using Amazon
S3 as a state backend in the Frankfurt region. For political reasons we've
been asked to keep all European data in Amazon's Frankfurt region. This
causes a problem as the S3 endpoint in Frankfurt requires the use of AWS
Signature Version 4 "This new Region supports only Signature Version 4" [1]
and this doesn't appear to work with the Hadoop version that Flink is built
against [2].

After some hacking we have managed to create a docker image with a build of
Flink 1.2 master, copying over jar files from the hadoop
3.0.0-alpha1 package and this appears to work, for the most part but we
still suffer from some classpath problems (conflicts between AWS API used
in hadoop and those we want to use in out streams for interacting with
Kinesis) and the whole thing feels a little fragile. Has anyone else tried
this? Is there a simpler solution?

As a follow-up question, we saw that with checkpointing on three relatively
simple streams set to 1 second, our S3 costs were higher than the EC2 costs
for our entire infrastructure. This seems slightly disproportionate. For
now we have reduced checkpointing interval to 10 seconds and that has
greatly improved the cost projections graphed via Amazon Cloud Watch, but
I'm interested in hearing other peoples experience with this. Is that the
kind of billing level we can expect or is this a symptom of a
mis-configuration? Is this a setup others are using? As we are using
Kinesis as the source for all streams I don't see a huge risk with larger
checkpoint intervals and our Sinks are designed to mostly tolerate
duplicates (some improvements can be made).

Thanks in advance
Jonathan


[1] https://aws.amazon.com/blogs/aws/aws-region-germany/
[2] https://issues.apache.org/jira/browse/HADOOP-13324