Thanks! I understand what you mean however I have little confusion. Does it mean there are unused block sitting around? For eg:
HFile1 with 3 blocks spread accross 3 nodes Node A:(b1),b2,b3 Node B:b1,(b2),b3 and Node C:b1,b2,(b3). HFile2 with 3 blocks spread accross 3 nodes Node A:(b1),b2,b3 Node B:b1,(b2),b3 and Node C:b1,b2,(b3) I have 2 questions: 1) When compactions occur on Node A would it also include b2 and b3 which is actually a redundant copy? My guess is yes. 2) Now compaction occurs and creates HFile3 which as you said is replicated. But what happens to HFile1 and HFile2? I am assuming it gets deleted. Thanks for everyones patience! On Thu, Jul 7, 2011 at 2:43 PM, Buttler, David <buttl...@llnl.gov> wrote: > The nice part of using HDFS as the file system is that the replication is > taken care of by the file system. So, when the compaction finishes, that > means the replication has already taken place. > > -----Original Message----- > From: Mohit Anchlia [mailto:mohitanch...@gmail.com] > Sent: Thursday, July 07, 2011 2:02 PM > To: user@hbase.apache.org; Andrew Purtell > Subject: Re: Hbase performance with HDFS > > Thanks Andrew. Really helpful. I think I have one more question right > now :) Underneath HDFS replicates blocks by default 3. Not sure how it > relates to HFile and compactions. When compaction occurs is it also > happening on the replica blocks from other nodes? If not then how does > it work when one node fails. > > On Thu, Jul 7, 2011 at 1:53 PM, Andrew Purtell <apurt...@apache.org> wrote: >>> You mentioned about compactions, when do those occur and what triggers >>> them? >> >> Compactions are triggered by an algorithm that monitors the number of flush >> files in a store and the size of them, and is configurable in several >> dimensions. >> >>> Does it cause additional space usage when that happens >> >> Yes. >> >>> if it >>> does it would mean you always need to have much more disk then you >>> really need. >> >> >> Not all regions are compacted at once. Each region by default is constrained >> to 256 MB. Not all regions will hold the full amount of data. The result is >> not a perfect copy (doubling) if some data has been deleted or are >> associated with TTLs that have expired. The merge sorted result is moved >> into place and the old files are deleted as soon as the compaction >> completes. So how much more is "much more"? You can't write to any kind of >> data store on a (nearly) full volume anyway, no matter HBase/HDFS, or MySQL, >> or... >> >>> Since HDFS is mostly write once how are updates/deletes handled? >> >> >> Not mostly, only write once. >> >> From the BigTable paper, section 5.3: "A valid read operation is executed on >> a merged view of the sequence of SSTables and the memtable. Since the >> SSTables and the memtable are lexicographically sorted data structures, the >> merged view can be formed efficiently." So what this means is all the store >> files and the memstore serve effectively as change logs sorted in reverse >> chronological order. >> >> Deletes are just another write, but one that writes tombstones "covering" >> data with older timestamps. >> >> When serving queries, HBase searches store files back in time until it finds >> data at the coordinates requested or a tombstone. >> >> The process of compaction not only merge sorts a bunch of accumulated store >> files (from flushes) into fewer store files (or one) for read efficiency, it >> also performs housekeeping, dropping data "covered" by the delete >> tombstones. Incidentally this is also how TTLs are supported: expired values >> are dropped as well. >> >> Best regards, >> >> - Andy >> >> Problems worthy of attack prove their worth by hitting back. - Piet Hein >> (via Tom White) >> >> >>>________________________________ >>>From: Mohit Anchlia <mohitanch...@gmail.com> >>>To: Andrew Purtell <apurt...@apache.org> >>>Cc: "user@hbase.apache.org" <user@hbase.apache.org> >>>Sent: Thursday, July 7, 2011 12:30 PM >>>Subject: Re: Hbase performance with HDFS >>> >>>Thanks that helps! Just few more questions: >>> >>>You mentioned about compactions, when do those occur and what triggers >>>them? Does it cause additional space usage when that happens, if it >>>does it would mean you always need to have much more disk then you >>>really need. >>> >>>Since HDFS is mostly write once how are updates/deletes handled? >>> >>>Is Hbase also suitable for Blobs? >>> >>>On Thu, Jul 7, 2011 at 12:11 PM, Andrew Purtell <apurt...@apache.org> wrote: >>>> Some thoughts off the top of my head. Lars' architecture material >>>> might/should cover this too. Pretty sure his book will. >>>> Regarding reads: >>>> One does not have to read a whole HDFS block. You can request arbitrary >>>> byte >>>> ranges with the block, via positioned reads. (It is true also that HDFS can >>>> be improved for better random reading performance in ways not necessarily >>>> yet committed to trunk or especially a 0.20.x branch with append support >>>> for >>>> HBase. See https://issues.apache.org/jira/browse/HDFS-1323) >>>> HBase holds indexes to store files in HDFS in memory. We also open all >>>> store >>>> files at the HDFS layer and stash those references. Additionally, users can >>>> specify the use of bloom filters to improve query time performance through >>>> wholesale skipping of HFile reads if they are known not to contain data >>>> that >>>> satisfies the query. Bloom filters are held in memory as well. >>>> So with indexes resident in memory when handling Gets we know the byte >>>> ranges within HDFS block(s) that contain the data of interest. With >>>> positioned reads we retrieve only those bytes from a DataNode. With >>>> optional >>>> bloomfilters we avoid whole HFiles entirely. >>>> Regarding writes: >>>> I think you should consult the bigtable paper again if you are still asking >>>> about the write path. The database is log structured. Writes are >>>> accumulated >>>> in memory, and flushed all at once. Later flush files are compacted as >>>> needed, because as you point out GFS and HDFS are optimized for streaming >>>> sequential reads and writes. >>>> >>>> Best regards, >>>> >>>> - Andy >>>> Problems worthy of attack prove their worth by hitting back. - Piet Hein >>>> (via Tom White) >>>> >>>> ________________________________ >>>> From: Mohit Anchlia <mohitanch...@gmail.com> >>>> To: user@hbase.apache.org; Andrew Purtell <apurt...@apache.org> >>>> Sent: Thursday, July 7, 2011 11:53 AM >>>> Subject: Re: Hbase performance with HDFS >>>> >>>> I have looked at bigtable and it's ssTables etc. But my question is >>>> directly related to how it's used with HDFS. HDFS recommends large >>>> files, bigger blocks, write once and read many sequential reads. But >>>> accessing small rows and writing small rows is more random and >>>> different than inherent design of HDFS. How do these 2 go together and >>>> is able to provide performance. >>>> >>>> On Thu, Jul 7, 2011 at 11:22 AM, Andrew Purtell <apurt...@apache.org> >>>> wrote: >>>>> Hi Mohit, >>>>> >>>>> Start here: http://labs.google.com/papers/bigtable.html >>>>> >>>>> Best regards, >>>>> >>>>> >>>>> - Andy >>>>> >>>>> Problems worthy of attack prove their worth by hitting back. - Piet Hein >>>>> (via Tom White) >>>>> >>>>> >>>>>>________________________________ >>>>>>From: Mohit Anchlia <mohitanch...@gmail.com> >>>>>>To: user@hbase.apache.org >>>>>>Sent: Thursday, July 7, 2011 11:12 AM >>>>>>Subject: Hbase performance with HDFS >>>>>> >>>>>>I've been trying to understand how Hbase can provide good performance >>>>>>using HDFS when purpose of HDFS is sequential large block sizes which >>>>>>is inherently different than of Hbase where it's more random and row >>>>>>sizes might be very small. >>>>>> >>>>>>I am reading this but doesn't answer my question. It does say that >>>>>>HFile block size is different but how it really works with HDFS is >>>>>>what I am trying to understand. >>>>>> >>>>>>http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html >>>>>> >>>>>> >>>>>> >>>> >>>> >>>> >>> >>> >>> >> >