Re: Error about rs block seek
Hi,all Before the exception stack, there is an Error log: 2013-05-13 00:00:14,491 ERROR org.apache.hadoop.hbase.io.hfile.HFileReaderV2: Current pos = 32651; currKeyLen = 45; currValLen = 80; block limit = 32775; HFile name = 1f96183d55144c058fa2a05fe5c0b814; currBlock currBlockOffset = 33550830 And the operation is scanner's next. Current pos + currKeyLen + currValLen block limit 32651+45 +80 = 32776 32775 , and in my table configs, set blocksize 32768, and when I change the value from blocksize from 64k(default value) to 32k, so many error logs being found. I use 0.94.3, can someone tell me the influence of blocksize setting. Tks. 2013/5/13 ramkrishna vasudevan ramkrishna.s.vasude...@gmail.com Your TTL is negative here 'TTL = '-1','. Any reason for it to be negative? This could be a possible reason. Not sure.. Regards Ram On Mon, May 13, 2013 at 7:20 AM, Bing Jiang jiangbinglo...@gmail.com wrote: hi, Ted. No data block encoding, our table config below: User Table Description CrawlInfohttp://10.100.12.33:8003/table.jsp?name=CrawlInfo {NAME = 'CrawlInfo', DEFERRED_LOG_FLUSH = 'true', MAX_FILESIZE = '34359738368', FAMILIES = [{NAME = 'CrawlStats', BLOOMFILTER = 'ROWCOL', CACHE_INDEX_ON_WRITE = 'true', TTL = '-1', CACHE_DATA_ON_WRITE = 'true', CACHE_BLOOMS_ON_WRITE = 'true', VERSIONS = '1', BLOCKSIZE = '32768'}]} 2013/5/13 Bing Jiang jiangbinglo...@gmail.com Hi, JM. Our jdk version is 1.6.0_38 2013/5/13 Jean-Marc Spaggiari jean-m...@spaggiari.org Hi Bing, Which JDK are you using? Thanks, JM 2013/5/12 Bing Jiang jiangbinglo...@gmail.com Yes, we use hbase-0.94.3 , and we change block.size from 64k to 32k. 2013/5/13 Ted Yu yuzhih...@gmail.com Can you tell us the version of hbase you are using ? Did this problem happen recently ? Thanks On May 12, 2013, at 6:25 PM, Bing Jiang jiangbinglo...@gmail.com wrote: Hi, all. In our hbase cluster, there are many logs like below: 2013-05-13 00:00:04,161 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: java.lang.IllegalArgumentException at java.nio.Buffer.position(Buffer.java:216) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.blockSeek(HFileReaderV2.java:882) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.loadBlockAndSeekToKey(HFileReaderV2.java:753) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:487) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:501) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:226) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:145) at org.apache.hadoop.hbase.regionserver.StoreScanner.init(StoreScanner.java:131) at org.apache.hadoop.hbase.regionserver.Store.getScanner(Store.java:2073) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.init(HRegion.java:3412) at org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1642) at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1634) at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1610) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4230) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4204) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2025) at org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:3461) at sun.reflect.GeneratedMethodAccessor30.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426) and Table config: Can anyone tell me how I can find the reason about this? -- Bing Jiang weibo: http://weibo.com/jiangbinglover BLOG: http://blog.sina.com.cn/jiangbinglover National Research Center for Intelligent Computing Systems Institute of Computing technology Graduate University of Chinese Academy of Science
Block size of HBase files
Hi, I have the dfs.block.size value set to 1 GB in my cluster configuration. I have around 250 GB of data stored in hbase over this cluster. But when i check the number of blocks, it doesn't correspond to the block size value i set. From what i understand i should only have ~250 blocks. But instead when i did a fsck on the /hbase/table-name, i got the following Status: HEALTHY Total size:265727504820 B Total dirs:1682 Total files: 1459 Total blocks (validated): 1459 (avg. block size 182129886 B) Minimally replicated blocks: 1459 (100.0 %) Over-replicated blocks:0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor:3 Average block replication: 3.0 Corrupt blocks:0 Missing replicas: 0 (0.0 %) Number of data-nodes: 5 Number of racks: 1 Are there any other configuration parameters that need to be set ? -- Regards, Praveen Bysani http://www.praveenbysani.com
Re: Block size of HBase files
On Sun, May 12, 2013 at 11:40 PM, Praveen Bysani praveen.ii...@gmail.comwrote: Hi, I have the dfs.block.size value set to 1 GB in my cluster configuration. Just out of curiosity - why do you have it set at 1GB? I have around 250 GB of data stored in hbase over this cluster. But when i check the number of blocks, it doesn't correspond to the block size value i set. From what i understand i should only have ~250 blocks. But instead when i did a fsck on the /hbase/table-name, i got the following Status: HEALTHY Total size:265727504820 B Total dirs:1682 Total files: 1459 Total blocks (validated): 1459 (avg. block size 182129886 B) Minimally replicated blocks: 1459 (100.0 %) Over-replicated blocks:0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor:3 Average block replication: 3.0 Corrupt blocks:0 Missing replicas: 0 (0.0 %) Number of data-nodes: 5 Number of racks: 1 Are there any other configuration parameters that need to be set ? What is your HFile size set to? The HFiles that get persisted would be bound by that number. Thereafter each HFile would be split into blocks, the size of which you configure using the dfs.block.size configuration parameter. -- Regards, Praveen Bysani http://www.praveenbysani.com
RE: How to implement this check put and then update something logic?
Well, this did come from a graph domain. However, I think this could be a common problem when you need to update something according to the original value where a simple checkAndPut on single value won't work. Another example, if you want to implement something like UPDATE, you want to know whether this is a new value inserted, or update to an old value. It won't be easy now, You need to checkAndPut on null, if not null, then you get the value and checkAndPut on that value, since you want to make sure the column is still there. If it fails , you loop back from check null. So I think a little bit of enhancement on current HBASE atomic operation could greatly improve the usability upon similar problems. Or maybe there are already solution for this type of issue? Maybe this problem is more in the graph domain? I know that there are projects aimed at representing graphs at large scale better. I'm saying this since you have one ID referencing another ID (using target ID). On May 10, 2013, at 11:47 AM, Liu, Raymond raymond@intel.com wrote: Thanks, seems there are no other better solution? Really need a GetAndPut atomic op here ... You can do this by looping over a checkAndPut operation until it succeeds. -Mike On Thu, May 9, 2013 at 8:52 PM, Liu, Raymond raymond@intel.com wrote: Any suggestion? Hi Say, I have four field for one record :id, status, targetid, and count. Status is on and off, target could reference other id, and count will record the number of on status for all targetid from same id. The record could be add / delete, or updated to change the status. I could put count in another table, or put it in the same table, it doesn't matter. As long as it can work. My question is how can I ensure its correctness of the count field when run with multiple client update the table concurrently? The closet thing I can think of is checkAndPut, but I will need two steps to find out the change of count, since checkAndPut etc can only test a single value and with EQUAL comparator, thus I can only check upon null firstly, then on or off. Thus when thing change during this two step, I need to retry from first step until it succeed. This could be bad when a lot of concurrent op is on going. And then, I need to update count by checkAndIncrement, though if the above problem could be solved, the order of -1 +1 might not be important for the final result, but in some intermediate time, it might not reflect the real count of that time. I know this kind of transaction is not the target of HBASE, APP should take care of it, then , what's the best practice on this? Any quick simple solution for my problem? Client RowLock could solve this issue, But it seems to me that it is not safe and is not recommended and deprecated? Btw. Is that possible or practice to implement something like PutAndGet which put in new row and return the old row back to client been implemented? That would help a lot for my case. Best Regards, Raymond Liu
Re: Error about rs block seek
Current pos = 32651; currKeyLen = 45; currValLen = 80; block limit = 32775 This means after the cur position we need to have atleast 45+80+4(key length stored as 4 bytes) +4(value length 4 bytes) So atleast 32784 should have been the limit. If we have memstoreTS also written with this KV some more bytes.. Do u use Hbase handled checksum? -Anoop- On Mon, May 13, 2013 at 12:00 PM, Bing Jiang jiangbinglo...@gmail.comwrote: Hi,all Before the exception stack, there is an Error log: 2013-05-13 00:00:14,491 ERROR org.apache.hadoop.hbase.io.hfile.HFileReaderV2: Current pos = 32651; currKeyLen = 45; currValLen = 80; block limit = 32775; HFile name = 1f96183d55144c058fa2a05fe5c0b814; currBlock currBlockOffset = 33550830 And the operation is scanner's next. Current pos + currKeyLen + currValLen block limit 32651+45 +80 = 32776 32775 , and in my table configs, set blocksize 32768, and when I change the value from blocksize from 64k(default value) to 32k, so many error logs being found. I use 0.94.3, can someone tell me the influence of blocksize setting. Tks. 2013/5/13 ramkrishna vasudevan ramkrishna.s.vasude...@gmail.com Your TTL is negative here 'TTL = '-1','. Any reason for it to be negative? This could be a possible reason. Not sure.. Regards Ram On Mon, May 13, 2013 at 7:20 AM, Bing Jiang jiangbinglo...@gmail.com wrote: hi, Ted. No data block encoding, our table config below: User Table Description CrawlInfohttp://10.100.12.33:8003/table.jsp?name=CrawlInfo {NAME = 'CrawlInfo', DEFERRED_LOG_FLUSH = 'true', MAX_FILESIZE = '34359738368', FAMILIES = [{NAME = 'CrawlStats', BLOOMFILTER = 'ROWCOL', CACHE_INDEX_ON_WRITE = 'true', TTL = '-1', CACHE_DATA_ON_WRITE = 'true', CACHE_BLOOMS_ON_WRITE = 'true', VERSIONS = '1', BLOCKSIZE = '32768'}]} 2013/5/13 Bing Jiang jiangbinglo...@gmail.com Hi, JM. Our jdk version is 1.6.0_38 2013/5/13 Jean-Marc Spaggiari jean-m...@spaggiari.org Hi Bing, Which JDK are you using? Thanks, JM 2013/5/12 Bing Jiang jiangbinglo...@gmail.com Yes, we use hbase-0.94.3 , and we change block.size from 64k to 32k. 2013/5/13 Ted Yu yuzhih...@gmail.com Can you tell us the version of hbase you are using ? Did this problem happen recently ? Thanks On May 12, 2013, at 6:25 PM, Bing Jiang jiangbinglo...@gmail.com wrote: Hi, all. In our hbase cluster, there are many logs like below: 2013-05-13 00:00:04,161 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: java.lang.IllegalArgumentException at java.nio.Buffer.position(Buffer.java:216) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.blockSeek(HFileReaderV2.java:882) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.loadBlockAndSeekToKey(HFileReaderV2.java:753) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:487) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:501) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:226) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:145) at org.apache.hadoop.hbase.regionserver.StoreScanner.init(StoreScanner.java:131) at org.apache.hadoop.hbase.regionserver.Store.getScanner(Store.java:2073) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.init(HRegion.java:3412) at org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1642) at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1634) at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1610) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4230) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4204) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2025) at org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:3461) at sun.reflect.GeneratedMethodAccessor30.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597)
Re: Error about rs block seek
hi, Anoop. I do not handle or change the hbase checksum. So I want to know if I set block size at the beginning of creating tables, does something make troubles? 2013/5/13 Anoop John anoop.hb...@gmail.com Current pos = 32651; currKeyLen = 45; currValLen = 80; block limit = 32775 This means after the cur position we need to have atleast 45+80+4(key length stored as 4 bytes) +4(value length 4 bytes) So atleast 32784 should have been the limit. If we have memstoreTS also written with this KV some more bytes.. Do u use Hbase handled checksum? -Anoop- On Mon, May 13, 2013 at 12:00 PM, Bing Jiang jiangbinglo...@gmail.com wrote: Hi,all Before the exception stack, there is an Error log: 2013-05-13 00:00:14,491 ERROR org.apache.hadoop.hbase.io.hfile.HFileReaderV2: Current pos = 32651; currKeyLen = 45; currValLen = 80; block limit = 32775; HFile name = 1f96183d55144c058fa2a05fe5c0b814; currBlock currBlockOffset = 33550830 And the operation is scanner's next. Current pos + currKeyLen + currValLen block limit 32651+45 +80 = 32776 32775 , and in my table configs, set blocksize 32768, and when I change the value from blocksize from 64k(default value) to 32k, so many error logs being found. I use 0.94.3, can someone tell me the influence of blocksize setting. Tks. 2013/5/13 ramkrishna vasudevan ramkrishna.s.vasude...@gmail.com Your TTL is negative here 'TTL = '-1','. Any reason for it to be negative? This could be a possible reason. Not sure.. Regards Ram On Mon, May 13, 2013 at 7:20 AM, Bing Jiang jiangbinglo...@gmail.com wrote: hi, Ted. No data block encoding, our table config below: User Table Description CrawlInfohttp://10.100.12.33:8003/table.jsp?name=CrawlInfo {NAME = 'CrawlInfo', DEFERRED_LOG_FLUSH = 'true', MAX_FILESIZE = '34359738368', FAMILIES = [{NAME = 'CrawlStats', BLOOMFILTER = 'ROWCOL', CACHE_INDEX_ON_WRITE = 'true', TTL = '-1', CACHE_DATA_ON_WRITE = 'true', CACHE_BLOOMS_ON_WRITE = 'true', VERSIONS = '1', BLOCKSIZE = '32768'}]} 2013/5/13 Bing Jiang jiangbinglo...@gmail.com Hi, JM. Our jdk version is 1.6.0_38 2013/5/13 Jean-Marc Spaggiari jean-m...@spaggiari.org Hi Bing, Which JDK are you using? Thanks, JM 2013/5/12 Bing Jiang jiangbinglo...@gmail.com Yes, we use hbase-0.94.3 , and we change block.size from 64k to 32k. 2013/5/13 Ted Yu yuzhih...@gmail.com Can you tell us the version of hbase you are using ? Did this problem happen recently ? Thanks On May 12, 2013, at 6:25 PM, Bing Jiang jiangbinglo...@gmail.com wrote: Hi, all. In our hbase cluster, there are many logs like below: 2013-05-13 00:00:04,161 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: java.lang.IllegalArgumentException at java.nio.Buffer.position(Buffer.java:216) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.blockSeek(HFileReaderV2.java:882) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.loadBlockAndSeekToKey(HFileReaderV2.java:753) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:487) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:501) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:226) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:145) at org.apache.hadoop.hbase.regionserver.StoreScanner.init(StoreScanner.java:131) at org.apache.hadoop.hbase.regionserver.Store.getScanner(Store.java:2073) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.init(HRegion.java:3412) at org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1642) at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1634) at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1610) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4230) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4204) at
Re: Error about rs block seek
So I want to know if I set block size at the beginning of creating tables, does something make troubles? Should not. We have tested with diff block sizes from def 64K to 8K fro testing purposes. Have not came across issues like this. Only on this data it is coming or every time u create a new table with 32K as block size and do some writes and then do read, this issue comes? -Anoop- On Mon, May 13, 2013 at 1:36 PM, Bing Jiang jiangbinglo...@gmail.comwrote: hi, Anoop. I do not handle or change the hbase checksum. So I want to know if I set block size at the beginning of creating tables, does something make troubles? 2013/5/13 Anoop John anoop.hb...@gmail.com Current pos = 32651; currKeyLen = 45; currValLen = 80; block limit = 32775 This means after the cur position we need to have atleast 45+80+4(key length stored as 4 bytes) +4(value length 4 bytes) So atleast 32784 should have been the limit. If we have memstoreTS also written with this KV some more bytes.. Do u use Hbase handled checksum? -Anoop- On Mon, May 13, 2013 at 12:00 PM, Bing Jiang jiangbinglo...@gmail.com wrote: Hi,all Before the exception stack, there is an Error log: 2013-05-13 00:00:14,491 ERROR org.apache.hadoop.hbase.io.hfile.HFileReaderV2: Current pos = 32651; currKeyLen = 45; currValLen = 80; block limit = 32775; HFile name = 1f96183d55144c058fa2a05fe5c0b814; currBlock currBlockOffset = 33550830 And the operation is scanner's next. Current pos + currKeyLen + currValLen block limit 32651+45 +80 = 32776 32775 , and in my table configs, set blocksize 32768, and when I change the value from blocksize from 64k(default value) to 32k, so many error logs being found. I use 0.94.3, can someone tell me the influence of blocksize setting. Tks. 2013/5/13 ramkrishna vasudevan ramkrishna.s.vasude...@gmail.com Your TTL is negative here 'TTL = '-1','. Any reason for it to be negative? This could be a possible reason. Not sure.. Regards Ram On Mon, May 13, 2013 at 7:20 AM, Bing Jiang jiangbinglo...@gmail.com wrote: hi, Ted. No data block encoding, our table config below: User Table Description CrawlInfohttp://10.100.12.33:8003/table.jsp?name=CrawlInfo {NAME = 'CrawlInfo', DEFERRED_LOG_FLUSH = 'true', MAX_FILESIZE = '34359738368', FAMILIES = [{NAME = 'CrawlStats', BLOOMFILTER = 'ROWCOL', CACHE_INDEX_ON_WRITE = 'true', TTL = '-1', CACHE_DATA_ON_WRITE = 'true', CACHE_BLOOMS_ON_WRITE = 'true', VERSIONS = '1', BLOCKSIZE = '32768'}]} 2013/5/13 Bing Jiang jiangbinglo...@gmail.com Hi, JM. Our jdk version is 1.6.0_38 2013/5/13 Jean-Marc Spaggiari jean-m...@spaggiari.org Hi Bing, Which JDK are you using? Thanks, JM 2013/5/12 Bing Jiang jiangbinglo...@gmail.com Yes, we use hbase-0.94.3 , and we change block.size from 64k to 32k. 2013/5/13 Ted Yu yuzhih...@gmail.com Can you tell us the version of hbase you are using ? Did this problem happen recently ? Thanks On May 12, 2013, at 6:25 PM, Bing Jiang jiangbinglo...@gmail.com wrote: Hi, all. In our hbase cluster, there are many logs like below: 2013-05-13 00:00:04,161 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: java.lang.IllegalArgumentException at java.nio.Buffer.position(Buffer.java:216) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.blockSeek(HFileReaderV2.java:882) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.loadBlockAndSeekToKey(HFileReaderV2.java:753) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:487) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:501) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:226) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:145) at org.apache.hadoop.hbase.regionserver.StoreScanner.init(StoreScanner.java:131) at org.apache.hadoop.hbase.regionserver.Store.getScanner(Store.java:2073) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.init(HRegion.java:3412) at
Re: Error about rs block seek
Is it possible to reproduce this with simple test case based on your usecase and data? You can share it so that can really debug the actual problem. Regards Ram On Mon, May 13, 2013 at 1:57 PM, Anoop John anoop.hb...@gmail.com wrote: So I want to know if I set block size at the beginning of creating tables, does something make troubles? Should not. We have tested with diff block sizes from def 64K to 8K fro testing purposes. Have not came across issues like this. Only on this data it is coming or every time u create a new table with 32K as block size and do some writes and then do read, this issue comes? -Anoop- On Mon, May 13, 2013 at 1:36 PM, Bing Jiang jiangbinglo...@gmail.com wrote: hi, Anoop. I do not handle or change the hbase checksum. So I want to know if I set block size at the beginning of creating tables, does something make troubles? 2013/5/13 Anoop John anoop.hb...@gmail.com Current pos = 32651; currKeyLen = 45; currValLen = 80; block limit = 32775 This means after the cur position we need to have atleast 45+80+4(key length stored as 4 bytes) +4(value length 4 bytes) So atleast 32784 should have been the limit. If we have memstoreTS also written with this KV some more bytes.. Do u use Hbase handled checksum? -Anoop- On Mon, May 13, 2013 at 12:00 PM, Bing Jiang jiangbinglo...@gmail.com wrote: Hi,all Before the exception stack, there is an Error log: 2013-05-13 00:00:14,491 ERROR org.apache.hadoop.hbase.io.hfile.HFileReaderV2: Current pos = 32651; currKeyLen = 45; currValLen = 80; block limit = 32775; HFile name = 1f96183d55144c058fa2a05fe5c0b814; currBlock currBlockOffset = 33550830 And the operation is scanner's next. Current pos + currKeyLen + currValLen block limit 32651+45 +80 = 32776 32775 , and in my table configs, set blocksize 32768, and when I change the value from blocksize from 64k(default value) to 32k, so many error logs being found. I use 0.94.3, can someone tell me the influence of blocksize setting. Tks. 2013/5/13 ramkrishna vasudevan ramkrishna.s.vasude...@gmail.com Your TTL is negative here 'TTL = '-1','. Any reason for it to be negative? This could be a possible reason. Not sure.. Regards Ram On Mon, May 13, 2013 at 7:20 AM, Bing Jiang jiangbinglo...@gmail.com wrote: hi, Ted. No data block encoding, our table config below: User Table Description CrawlInfohttp://10.100.12.33:8003/table.jsp?name=CrawlInfo {NAME = 'CrawlInfo', DEFERRED_LOG_FLUSH = 'true', MAX_FILESIZE = '34359738368', FAMILIES = [{NAME = 'CrawlStats', BLOOMFILTER = 'ROWCOL', CACHE_INDEX_ON_WRITE = 'true', TTL = '-1', CACHE_DATA_ON_WRITE = 'true', CACHE_BLOOMS_ON_WRITE = 'true', VERSIONS = '1', BLOCKSIZE = '32768'}]} 2013/5/13 Bing Jiang jiangbinglo...@gmail.com Hi, JM. Our jdk version is 1.6.0_38 2013/5/13 Jean-Marc Spaggiari jean-m...@spaggiari.org Hi Bing, Which JDK are you using? Thanks, JM 2013/5/12 Bing Jiang jiangbinglo...@gmail.com Yes, we use hbase-0.94.3 , and we change block.size from 64k to 32k. 2013/5/13 Ted Yu yuzhih...@gmail.com Can you tell us the version of hbase you are using ? Did this problem happen recently ? Thanks On May 12, 2013, at 6:25 PM, Bing Jiang jiangbinglo...@gmail.com wrote: Hi, all. In our hbase cluster, there are many logs like below: 2013-05-13 00:00:04,161 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: java.lang.IllegalArgumentException at java.nio.Buffer.position(Buffer.java:216) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.blockSeek(HFileReaderV2.java:882) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.loadBlockAndSeekToKey(HFileReaderV2.java:753) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:487) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:501) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:226) at
Re: Block size of HBase files
Hi, I wanted to minimize on the number of map reduce tasks generated while processing a job, hence configured it to a larger value. I don't think i have configured HFile size in the cluster. I use Cloudera Manager to mange my cluster, and the only configuration i can relate to is hfile.block.cache.size which is set to 0.25. How do i change the HFile size ? On 13 May 2013 15:03, Amandeep Khurana ama...@gmail.com wrote: On Sun, May 12, 2013 at 11:40 PM, Praveen Bysani praveen.ii...@gmail.com wrote: Hi, I have the dfs.block.size value set to 1 GB in my cluster configuration. Just out of curiosity - why do you have it set at 1GB? I have around 250 GB of data stored in hbase over this cluster. But when i check the number of blocks, it doesn't correspond to the block size value i set. From what i understand i should only have ~250 blocks. But instead when i did a fsck on the /hbase/table-name, i got the following Status: HEALTHY Total size:265727504820 B Total dirs:1682 Total files: 1459 Total blocks (validated): 1459 (avg. block size 182129886 B) Minimally replicated blocks: 1459 (100.0 %) Over-replicated blocks:0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor:3 Average block replication: 3.0 Corrupt blocks:0 Missing replicas: 0 (0.0 %) Number of data-nodes: 5 Number of racks: 1 Are there any other configuration parameters that need to be set ? What is your HFile size set to? The HFiles that get persisted would be bound by that number. Thereafter each HFile would be split into blocks, the size of which you configure using the dfs.block.size configuration parameter. -- Regards, Praveen Bysani http://www.praveenbysani.com -- Regards, Praveen Bysani http://www.praveenbysani.com
Re: Block size of HBase files
Praveen, How many regions there in ur table and how and CFs? Under /hbase/table-name there will be many files and dir u will be able to see. There will be .tableinfo file and every region will have .regionInfo file and then under cf the data file (HFiles) . Your total data is 250GB. When your block size is 1GB and u have only one file of 250GB, then what you are looking for makes sense. But it is not the case with HBase data storage. HFiles are created per CF per region. Also as data comes in (writes), by default after 128mb HBase will flush it as a file into HDFS. So making a file in HDFS with 1 block.(In ur case) Later these smaller files will get merged into bigger one .(Compaction) At the time when u checked, some major compactions were run? Major compaction will merge all files under a CF within a region to one HFile . So if u have 100 regions and 2 CFs for table,after major compaction you will be having 200 HFiles. (Remember under /hbase/table-name some other files also you will be able to see other than the HFiles.) The #files and avg block size displayed below speaks it.(Why u have those many blocks) The HFile size Amandeep was refering is the max size for an HFile (And thus for a region). If you keep on writing data to a region and when the data size crosses this max size, HBase will split that region into 2. Can you try checking the files count and blocks count after running a major compaction? What MR job u r trying to run with HBase? Also why you run MR directly on the HFiles? When you run the MR job over HBase (Like Scan using MR) it is not the #files or blocks which decides the #mappers. It will be based on the #regions in the table.. -Anoop- On Mon, May 13, 2013 at 3:15 PM, Praveen Bysani praveen.ii...@gmail.comwrote: Hi, I wanted to minimize on the number of map reduce tasks generated while processing a job, hence configured it to a larger value. I don't think i have configured HFile size in the cluster. I use Cloudera Manager to mange my cluster, and the only configuration i can relate to is hfile.block.cache.size which is set to 0.25. How do i change the HFile size ? On 13 May 2013 15:03, Amandeep Khurana ama...@gmail.com wrote: On Sun, May 12, 2013 at 11:40 PM, Praveen Bysani praveen.ii...@gmail.com wrote: Hi, I have the dfs.block.size value set to 1 GB in my cluster configuration. Just out of curiosity - why do you have it set at 1GB? I have around 250 GB of data stored in hbase over this cluster. But when i check the number of blocks, it doesn't correspond to the block size value i set. From what i understand i should only have ~250 blocks. But instead when i did a fsck on the /hbase/table-name, i got the following Status: HEALTHY Total size:265727504820 B Total dirs:1682 Total files: 1459 Total blocks (validated): 1459 (avg. block size 182129886 B) Minimally replicated blocks: 1459 (100.0 %) Over-replicated blocks:0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor:3 Average block replication: 3.0 Corrupt blocks:0 Missing replicas: 0 (0.0 %) Number of data-nodes: 5 Number of racks: 1 Are there any other configuration parameters that need to be set ? What is your HFile size set to? The HFiles that get persisted would be bound by that number. Thereafter each HFile would be split into blocks, the size of which you configure using the dfs.block.size configuration parameter. -- Regards, Praveen Bysani http://www.praveenbysani.com -- Regards, Praveen Bysani http://www.praveenbysani.com
Re: Block size of HBase files
You can change HFile size through hbase.hregion.max.filesize parameter. On May 13, 2013, at 2:45 AM, Praveen Bysani praveen.ii...@gmail.com wrote: Hi, I wanted to minimize on the number of map reduce tasks generated while processing a job, hence configured it to a larger value. I don't think i have configured HFile size in the cluster. I use Cloudera Manager to mange my cluster, and the only configuration i can relate to is hfile.block.cache.size which is set to 0.25. How do i change the HFile size ? On 13 May 2013 15:03, Amandeep Khurana ama...@gmail.com wrote: On Sun, May 12, 2013 at 11:40 PM, Praveen Bysani praveen.ii...@gmail.com wrote: Hi, I have the dfs.block.size value set to 1 GB in my cluster configuration. Just out of curiosity - why do you have it set at 1GB? I have around 250 GB of data stored in hbase over this cluster. But when i check the number of blocks, it doesn't correspond to the block size value i set. From what i understand i should only have ~250 blocks. But instead when i did a fsck on the /hbase/table-name, i got the following Status: HEALTHY Total size:265727504820 B Total dirs:1682 Total files: 1459 Total blocks (validated): 1459 (avg. block size 182129886 B) Minimally replicated blocks: 1459 (100.0 %) Over-replicated blocks:0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor:3 Average block replication: 3.0 Corrupt blocks:0 Missing replicas: 0 (0.0 %) Number of data-nodes: 5 Number of racks: 1 Are there any other configuration parameters that need to be set ? What is your HFile size set to? The HFiles that get persisted would be bound by that number. Thereafter each HFile would be split into blocks, the size of which you configure using the dfs.block.size configuration parameter. -- Regards, Praveen Bysani http://www.praveenbysani.com -- Regards, Praveen Bysani http://www.praveenbysani.com
Re: Block size of HBase files
Hi, Thanks for the details. No i haven't run any compaction or i have no idea if there is one going on in background. I executed a major_compact on that table and i now have 731 regions (each about ~350 mb !!). I checked the configuration in CM, and the value for hbase.hregion.max.filesize is 1 GB too !!! I am not trying to access HFiles in my MR job, infact i am just using a PIG script which handles this. This number (731) is close to my number of map tasks, which makes sense. But how can i decrease this, shouldn't the size of each region be 1 GB with that configuration value ? On 13 May 2013 18:36, Ted Yu yuzhih...@gmail.com wrote: You can change HFile size through hbase.hregion.max.filesize parameter. On May 13, 2013, at 2:45 AM, Praveen Bysani praveen.ii...@gmail.com wrote: Hi, I wanted to minimize on the number of map reduce tasks generated while processing a job, hence configured it to a larger value. I don't think i have configured HFile size in the cluster. I use Cloudera Manager to mange my cluster, and the only configuration i can relate to is hfile.block.cache.size which is set to 0.25. How do i change the HFile size ? On 13 May 2013 15:03, Amandeep Khurana ama...@gmail.com wrote: On Sun, May 12, 2013 at 11:40 PM, Praveen Bysani praveen.ii...@gmail.com wrote: Hi, I have the dfs.block.size value set to 1 GB in my cluster configuration. Just out of curiosity - why do you have it set at 1GB? I have around 250 GB of data stored in hbase over this cluster. But when i check the number of blocks, it doesn't correspond to the block size value i set. From what i understand i should only have ~250 blocks. But instead when i did a fsck on the /hbase/table-name, i got the following Status: HEALTHY Total size:265727504820 B Total dirs:1682 Total files: 1459 Total blocks (validated): 1459 (avg. block size 182129886 B) Minimally replicated blocks: 1459 (100.0 %) Over-replicated blocks:0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor:3 Average block replication: 3.0 Corrupt blocks:0 Missing replicas: 0 (0.0 %) Number of data-nodes: 5 Number of racks: 1 Are there any other configuration parameters that need to be set ? What is your HFile size set to? The HFiles that get persisted would be bound by that number. Thereafter each HFile would be split into blocks, the size of which you configure using the dfs.block.size configuration parameter. -- Regards, Praveen Bysani http://www.praveenbysani.com -- Regards, Praveen Bysani http://www.praveenbysani.com -- Regards, Praveen Bysani http://www.praveenbysani.com
Re: Block size of HBase files
now have 731 regions (each about ~350 mb !!). I checked the configuration in CM, and the value for hbase.hregion.max.filesize is 1 GB too !!! You mentioned the splits at the time of table creation? How u created the table? -Anoop- On Mon, May 13, 2013 at 5:18 PM, Praveen Bysani praveen.ii...@gmail.comwrote: Hi, Thanks for the details. No i haven't run any compaction or i have no idea if there is one going on in background. I executed a major_compact on that table and i now have 731 regions (each about ~350 mb !!). I checked the configuration in CM, and the value for hbase.hregion.max.filesize is 1 GB too !!! I am not trying to access HFiles in my MR job, infact i am just using a PIG script which handles this. This number (731) is close to my number of map tasks, which makes sense. But how can i decrease this, shouldn't the size of each region be 1 GB with that configuration value ? On 13 May 2013 18:36, Ted Yu yuzhih...@gmail.com wrote: You can change HFile size through hbase.hregion.max.filesize parameter. On May 13, 2013, at 2:45 AM, Praveen Bysani praveen.ii...@gmail.com wrote: Hi, I wanted to minimize on the number of map reduce tasks generated while processing a job, hence configured it to a larger value. I don't think i have configured HFile size in the cluster. I use Cloudera Manager to mange my cluster, and the only configuration i can relate to is hfile.block.cache.size which is set to 0.25. How do i change the HFile size ? On 13 May 2013 15:03, Amandeep Khurana ama...@gmail.com wrote: On Sun, May 12, 2013 at 11:40 PM, Praveen Bysani praveen.ii...@gmail.com wrote: Hi, I have the dfs.block.size value set to 1 GB in my cluster configuration. Just out of curiosity - why do you have it set at 1GB? I have around 250 GB of data stored in hbase over this cluster. But when i check the number of blocks, it doesn't correspond to the block size value i set. From what i understand i should only have ~250 blocks. But instead when i did a fsck on the /hbase/table-name, i got the following Status: HEALTHY Total size:265727504820 B Total dirs:1682 Total files: 1459 Total blocks (validated): 1459 (avg. block size 182129886 B) Minimally replicated blocks: 1459 (100.0 %) Over-replicated blocks:0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor:3 Average block replication: 3.0 Corrupt blocks:0 Missing replicas: 0 (0.0 %) Number of data-nodes: 5 Number of racks: 1 Are there any other configuration parameters that need to be set ? What is your HFile size set to? The HFiles that get persisted would be bound by that number. Thereafter each HFile would be split into blocks, the size of which you configure using the dfs.block.size configuration parameter. -- Regards, Praveen Bysani http://www.praveenbysani.com -- Regards, Praveen Bysani http://www.praveenbysani.com -- Regards, Praveen Bysani http://www.praveenbysani.com
Re: Block size of HBase files
I mean when u created the table (Using client I guess) have u specified any thuing like splitKeys or [start,end, no#regions]? -Anoop- On Mon, May 13, 2013 at 5:49 PM, Praveen Bysani praveen.ii...@gmail.comwrote: We insert data using java hbase client (org.apache.hadoop.hbase.client.*) . However we are not providing any details in the configuration object , except for the zookeeper quorum, port number. Should we specify explicitly at this stage ? On 13 May 2013 19:54, Anoop John anoop.hb...@gmail.com wrote: now have 731 regions (each about ~350 mb !!). I checked the configuration in CM, and the value for hbase.hregion.max.filesize is 1 GB too !!! You mentioned the splits at the time of table creation? How u created the table? -Anoop- On Mon, May 13, 2013 at 5:18 PM, Praveen Bysani praveen.ii...@gmail.com wrote: Hi, Thanks for the details. No i haven't run any compaction or i have no idea if there is one going on in background. I executed a major_compact on that table and i now have 731 regions (each about ~350 mb !!). I checked the configuration in CM, and the value for hbase.hregion.max.filesize is 1 GB too !!! I am not trying to access HFiles in my MR job, infact i am just using a PIG script which handles this. This number (731) is close to my number of map tasks, which makes sense. But how can i decrease this, shouldn't the size of each region be 1 GB with that configuration value ? On 13 May 2013 18:36, Ted Yu yuzhih...@gmail.com wrote: You can change HFile size through hbase.hregion.max.filesize parameter. On May 13, 2013, at 2:45 AM, Praveen Bysani praveen.ii...@gmail.com wrote: Hi, I wanted to minimize on the number of map reduce tasks generated while processing a job, hence configured it to a larger value. I don't think i have configured HFile size in the cluster. I use Cloudera Manager to mange my cluster, and the only configuration i can relate to is hfile.block.cache.size which is set to 0.25. How do i change the HFile size ? On 13 May 2013 15:03, Amandeep Khurana ama...@gmail.com wrote: On Sun, May 12, 2013 at 11:40 PM, Praveen Bysani praveen.ii...@gmail.com wrote: Hi, I have the dfs.block.size value set to 1 GB in my cluster configuration. Just out of curiosity - why do you have it set at 1GB? I have around 250 GB of data stored in hbase over this cluster. But when i check the number of blocks, it doesn't correspond to the block size value i set. From what i understand i should only have ~250 blocks. But instead when i did a fsck on the /hbase/table-name, i got the following Status: HEALTHY Total size:265727504820 B Total dirs:1682 Total files: 1459 Total blocks (validated): 1459 (avg. block size 182129886 B) Minimally replicated blocks: 1459 (100.0 %) Over-replicated blocks:0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor:3 Average block replication: 3.0 Corrupt blocks:0 Missing replicas: 0 (0.0 %) Number of data-nodes: 5 Number of racks: 1 Are there any other configuration parameters that need to be set ? What is your HFile size set to? The HFiles that get persisted would be bound by that number. Thereafter each HFile would be split into blocks, the size of which you configure using the dfs.block.size configuration parameter. -- Regards, Praveen Bysani http://www.praveenbysani.com -- Regards, Praveen Bysani http://www.praveenbysani.com -- Regards, Praveen Bysani http://www.praveenbysani.com -- Regards, Praveen Bysani http://www.praveenbysani.com
Re: Export / Import and table splits
Hi Jeremy, Thanks for sharing this. I will take a look at it, and also most probably give a try to the snapshot option JM 2013/5/7 Jeremy Carroll phobos...@gmail.com https://github.com/phobos182/hadoop-hbase-tools/blob/master/hbase/copy_table.rb I wrote a quick script to do it with mechanize + ruby. I have a new tool which I'm polishing up that does the same thing in Python but using the HBase REST interface to get the data. On Tue, May 7, 2013 at 3:23 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi, When we are doing an export, we are only exporting the data. Then when we are importing that back, we need to make sure the table is pre-splitted correctly else we might hotspot some servers. If you simply export then import without pre-splitting at all, you will most probably brought some servers down because they will be overwhelmed with splits and compactions. Do we have any tool to pre-split a table the same way another table is already pre-splitted? Something like duplicate 'source_table', 'target_table' Which will create a new table called 'target_table' with exactly the same parameters as 'source_table' and the same regions boundaries? If we don't have, will it be useful to have one? Or event something like: create 'target_table', 'f1', {SPLITS_MODEL = 'source_table'} JM
Re: Export / Import and table splits
I'll go with the snapshots since you can avoid all the I/O of the import/export but the consistency model is different, and you don't have the start/end time option... you should delete the rows tstart and tend after the clone Matteo On Tue, May 14, 2013 at 1:48 AM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Jeremy, Thanks for sharing this. I will take a look at it, and also most probably give a try to the snapshot option JM 2013/5/7 Jeremy Carroll phobos...@gmail.com https://github.com/phobos182/hadoop-hbase-tools/blob/master/hbase/copy_table.rb I wrote a quick script to do it with mechanize + ruby. I have a new tool which I'm polishing up that does the same thing in Python but using the HBase REST interface to get the data. On Tue, May 7, 2013 at 3:23 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi, When we are doing an export, we are only exporting the data. Then when we are importing that back, we need to make sure the table is pre-splitted correctly else we might hotspot some servers. If you simply export then import without pre-splitting at all, you will most probably brought some servers down because they will be overwhelmed with splits and compactions. Do we have any tool to pre-split a table the same way another table is already pre-splitted? Something like duplicate 'source_table', 'target_table' Which will create a new table called 'target_table' with exactly the same parameters as 'source_table' and the same regions boundaries? If we don't have, will it be useful to have one? Or event something like: create 'target_table', 'f1', {SPLITS_MODEL = 'source_table'} JM
Re: Export / Import and table splits
The cluser is stopped anyway, so there is no consistency concerns. which mean snapshots might be the best option. No need to delete any after. The goal is really to export the data locally, get the cluster down, get a new cluster, put the data and reload the table... the 2 clusters can't be up at the same time... 2013/5/13 Matteo Bertozzi theo.berto...@gmail.com I'll go with the snapshots since you can avoid all the I/O of the import/export but the consistency model is different, and you don't have the start/end time option... you should delete the rows tstart and tend after the clone Matteo On Tue, May 14, 2013 at 1:48 AM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Jeremy, Thanks for sharing this. I will take a look at it, and also most probably give a try to the snapshot option JM 2013/5/7 Jeremy Carroll phobos...@gmail.com https://github.com/phobos182/hadoop-hbase-tools/blob/master/hbase/copy_table.rb I wrote a quick script to do it with mechanize + ruby. I have a new tool which I'm polishing up that does the same thing in Python but using the HBase REST interface to get the data. On Tue, May 7, 2013 at 3:23 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi, When we are doing an export, we are only exporting the data. Then when we are importing that back, we need to make sure the table is pre-splitted correctly else we might hotspot some servers. If you simply export then import without pre-splitting at all, you will most probably brought some servers down because they will be overwhelmed with splits and compactions. Do we have any tool to pre-split a table the same way another table is already pre-splitted? Something like duplicate 'source_table', 'target_table' Which will create a new table called 'target_table' with exactly the same parameters as 'source_table' and the same regions boundaries? If we don't have, will it be useful to have one? Or event something like: create 'target_table', 'f1', {SPLITS_MODEL = 'source_table'} JM
Re: Block size of HBase files
Hi Anoop, No we didn't specify any such while creating and writing into the table. On 13 May 2013 20:22, Anoop John anoop.hb...@gmail.com wrote: I mean when u created the table (Using client I guess) have u specified any thuing like splitKeys or [start,end, no#regions]? -Anoop- On Mon, May 13, 2013 at 5:49 PM, Praveen Bysani praveen.ii...@gmail.com wrote: We insert data using java hbase client (org.apache.hadoop.hbase.client.*) . However we are not providing any details in the configuration object , except for the zookeeper quorum, port number. Should we specify explicitly at this stage ? On 13 May 2013 19:54, Anoop John anoop.hb...@gmail.com wrote: now have 731 regions (each about ~350 mb !!). I checked the configuration in CM, and the value for hbase.hregion.max.filesize is 1 GB too !!! You mentioned the splits at the time of table creation? How u created the table? -Anoop- On Mon, May 13, 2013 at 5:18 PM, Praveen Bysani praveen.ii...@gmail.com wrote: Hi, Thanks for the details. No i haven't run any compaction or i have no idea if there is one going on in background. I executed a major_compact on that table and i now have 731 regions (each about ~350 mb !!). I checked the configuration in CM, and the value for hbase.hregion.max.filesize is 1 GB too !!! I am not trying to access HFiles in my MR job, infact i am just using a PIG script which handles this. This number (731) is close to my number of map tasks, which makes sense. But how can i decrease this, shouldn't the size of each region be 1 GB with that configuration value ? On 13 May 2013 18:36, Ted Yu yuzhih...@gmail.com wrote: You can change HFile size through hbase.hregion.max.filesize parameter. On May 13, 2013, at 2:45 AM, Praveen Bysani praveen.ii...@gmail.com wrote: Hi, I wanted to minimize on the number of map reduce tasks generated while processing a job, hence configured it to a larger value. I don't think i have configured HFile size in the cluster. I use Cloudera Manager to mange my cluster, and the only configuration i can relate to is hfile.block.cache.size which is set to 0.25. How do i change the HFile size ? On 13 May 2013 15:03, Amandeep Khurana ama...@gmail.com wrote: On Sun, May 12, 2013 at 11:40 PM, Praveen Bysani praveen.ii...@gmail.com wrote: Hi, I have the dfs.block.size value set to 1 GB in my cluster configuration. Just out of curiosity - why do you have it set at 1GB? I have around 250 GB of data stored in hbase over this cluster. But when i check the number of blocks, it doesn't correspond to the block size value i set. From what i understand i should only have ~250 blocks. But instead when i did a fsck on the /hbase/table-name, i got the following Status: HEALTHY Total size:265727504820 B Total dirs:1682 Total files: 1459 Total blocks (validated): 1459 (avg. block size 182129886 B) Minimally replicated blocks: 1459 (100.0 %) Over-replicated blocks:0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor:3 Average block replication: 3.0 Corrupt blocks:0 Missing replicas: 0 (0.0 %) Number of data-nodes: 5 Number of racks: 1 Are there any other configuration parameters that need to be set ? What is your HFile size set to? The HFiles that get persisted would be bound by that number. Thereafter each HFile would be split into blocks, the size of which you configure using the dfs.block.size configuration parameter. -- Regards, Praveen Bysani http://www.praveenbysani.com -- Regards, Praveen Bysani http://www.praveenbysani.com -- Regards, Praveen Bysani http://www.praveenbysani.com -- Regards, Praveen Bysani http://www.praveenbysani.com -- Regards, Praveen Bysani http://www.praveenbysani.com