Re: Ramifications of minimizing use of .tmp directories / renames in HBase when using S3 as backing store

2015-09-10 Thread Andrew Purtell
> Consistency issues: Since S3 has read after write consistency for new ​> ​ objects ​Eventually. The problem is reads for the new objects may fail for some arbitrary time first, with 404 or 500 responses as I mentioned before. ​ ​> ​ Appends: An append in S3 can be modeled as a read / copy /

Re: Ramifications of minimizing use of .tmp directories / renames in HBase when using S3 as backing store

2015-09-09 Thread Matteo Bertozzi
hbase relies on .tmp directories to do some sort of "atomic" file creation. and avoid problems like half data written when it crashes. there is a jira open, to solve that problem in one of the next major releases: https://issues.apache.org/jira/browse/HBASE-14090 There is a document in it, if you

Re: Ramifications of minimizing use of .tmp directories / renames in HBase when using S3 as backing store

2015-09-09 Thread Anthony Nguyen
Thanks Matteo. If I understand correctly, one example of how the .tmp directories help prevent issues is as follows: If HBase were to crash during a compaction, since these .tmp directories are cleared out at start, cleanup is much easier, right? On Wed, Sep 9, 2015 at 7:31 PM, Matteo Bertozzi

Ramifications of minimizing use of .tmp directories / renames in HBase when using S3 as backing store

2015-09-09 Thread Anthony Nguyen
Hi all, I'm investigating the use of S3 as a backing store for HBase. Would there be any major issues with modifying HBase in such a way where when an S3 location is set for the rootdir, writes to .tmp are removed and minimized, instead writing directly to the final destination? The reason I'd

Re: Ramifications of minimizing use of .tmp directories / renames in HBase when using S3 as backing store

2015-09-09 Thread Andrew Purtell
It cannot work to use S3 as a backing store for HBase. This has been attempted in the past (although not by me, so this isn't firsthand knowledge). One basic problem is HBase expects to be able to read what it has written immediately after the write completes. For example, opening a store file

Re: Ramifications of minimizing use of .tmp directories / renames in HBase when using S3 as backing store

2015-09-09 Thread iain wright
Hi Anthony, Just curious, you mention your access pattern is mostly reads. Is it random reads, M/R jobs over a portion of the dataset, or other? Best, -- Iain Wright This email message is confidential, intended only for the recipient(s) named above and may contain information that is

Re: Ramifications of minimizing use of .tmp directories / renames in HBase when using S3 as backing store

2015-09-09 Thread iain wright
I see - If it was only M/R, Hive, or a similar reporting/analytics workload without the low latency get's/read requirement, I was going to suggest writing to S3 directly and using spark+hivecontext. Cheers, -- Iain Wright This email message is confidential, intended only for the recipient(s)

Re: Ramifications of minimizing use of .tmp directories / renames in HBase when using S3 as backing store

2015-09-09 Thread Anthony Nguyen
Hi Iain, Random reads for now, with further MR work w/ Hive possibly down the line. Thanks, -t On Wed, Sep 9, 2015 at 9:31 PM, iain wright wrote: > Hi Anthony, > > Just curious, you mention your access pattern is mostly reads. Is it random > reads, M/R jobs over a portion