We have been looking for a while for some way to decouple the S3 filesystem support from Hadoop.
Does anyone know a good S3 connector library that works independent of Hadoop and EMRFS? Best, Stephan On Wed, Nov 23, 2016 at 7:57 PM, Greg Hogan <c...@greghogan.com> wrote: > EMRFS looks to *add* cost (and consistency). > > Storing an object to S3 costs "$0.005 per 1,000 requests", so $0.432/day > at 1 Hz. Is the number of checkpoint files simply parallelism * number of > operators? That could add up quickly. > > Is the recommendation to run HDFS on EBS? > > On Wed, Nov 23, 2016 at 12:40 PM, Jonathan Share <jon.sh...@gmail.com> > wrote: > >> Hi Greg, >> >> Standard storage class, everything is on defaults, we've not done >> anything special with the bucket. >> >> Cloud Watch only appears to give me total billing for S3 in general, I >> don't see a breakdown unless that's something I can configure somewhere. >> >> Regards, >> Jonathan >> >> >> On 23 November 2016 at 16:29, Greg Hogan <c...@greghogan.com> wrote: >> >>> Hi Jonathan, >>> >>> Which S3 storage class are you using? Do you have a breakdown of the S3 >>> costs as storage / API calls / early deletes / data transfer? >>> >>> Greg >>> >>> On Wed, Nov 23, 2016 at 2:52 AM, Jonathan Share <jon.sh...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> I'm interested in hearing if anyone else has experience with using >>>> Amazon S3 as a state backend in the Frankfurt region. For political reasons >>>> we've been asked to keep all European data in Amazon's Frankfurt region. >>>> This causes a problem as the S3 endpoint in Frankfurt requires the use of >>>> AWS Signature Version 4 "This new Region supports only Signature >>>> Version 4" [1] and this doesn't appear to work with the Hadoop version >>>> that Flink is built against [2]. >>>> >>>> After some hacking we have managed to create a docker image with a >>>> build of Flink 1.2 master, copying over jar files from the hadoop >>>> 3.0.0-alpha1 package and this appears to work, for the most part but we >>>> still suffer from some classpath problems (conflicts between AWS API used >>>> in hadoop and those we want to use in out streams for interacting with >>>> Kinesis) and the whole thing feels a little fragile. Has anyone else tried >>>> this? Is there a simpler solution? >>>> >>>> As a follow-up question, we saw that with checkpointing on three >>>> relatively simple streams set to 1 second, our S3 costs were higher than >>>> the EC2 costs for our entire infrastructure. This seems slightly >>>> disproportionate. For now we have reduced checkpointing interval to 10 >>>> seconds and that has greatly improved the cost projections graphed via >>>> Amazon Cloud Watch, but I'm interested in hearing other peoples experience >>>> with this. Is that the kind of billing level we can expect or is this a >>>> symptom of a mis-configuration? Is this a setup others are using? As we are >>>> using Kinesis as the source for all streams I don't see a huge risk with >>>> larger checkpoint intervals and our Sinks are designed to mostly tolerate >>>> duplicates (some improvements can be made). >>>> >>>> Thanks in advance >>>> Jonathan >>>> >>>> >>>> [1] https://aws.amazon.com/blogs/aws/aws-region-germany/ >>>> [2] https://issues.apache.org/jira/browse/HADOOP-13324 >>>> >>> >>> >> >