RegionSize returning in MB - change to bytes?
Hi All, There is a new optimization in spark (SPARK-34809) where ignoreEmptySplits filters out all regions that's size is 0. They use a hadoop library getSize() in TableInputFormat. Drilling down, this will return Bytes, but it converts it from MegaBytes - meaning anything under 1 MB will come down as 0 Bytes, meaning empty. I did a quick PR I thought would help: https://github.com/apache/hbase/pull/3737 But it turns out it's not as easy as requesting the size in Bytes instead of MB from Size class, as we set it in MB te begin with in RegionMetricsBuilder -> setStoreFileSize(new Size(regionLoadPB.getStorefileSizeMB(), Size.Unit.MEGABYTE)) I did some testing, and inserting a few kilobytes of data, then calling list_regions will in fact give back size 0. My question is, is it okay to store the region size in Bytes instead? Mainly asking because of backward compatibility reasons. Regards, Norbert
[jira] [Resolved] (HBASE-24833) Bootstrap should not delete the META table directory if it's not partial
[ https://issues.apache.org/jira/browse/HBASE-24833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tak-Lon (Stephen) Wu resolved HBASE-24833. -- Resolution: Fixed > Bootstrap should not delete the META table directory if it's not partial > > > Key: HBASE-24833 > URL: https://issues.apache.org/jira/browse/HBASE-24833 > Project: HBase > Issue Type: Sub-task >Affects Versions: 3.0.0-alpha-1, 2.3.0, 2.3.1, 2.3.3 >Reporter: Tak-Lon (Stephen) Wu >Assignee: Tak-Lon (Stephen) Wu >Priority: Major > Fix For: 2.5.0 > > > this issues were discussed in > [PR#2113|https://github.com/apache/hbase/pull/2113] as part of HBASE-24286, > and it is a dependencies before we solve HBASE-24286. > The changes were introduced in [HBASE-24471 > |https://github.com/apache/hbase/commit/4d5efec76718032a1e55024fd5133409e4be3cb8#diff-21659161b1393e6632730dcbea205fd8R70-R89] > that partial meta was introduced and `partial` was defined as > InitMetaProcedure did not succeed and INIT_META_ASSIGN_META was not completed. > {code:java} > private static void writeFsLayout(Path rootDir, Configuration conf) throws > IOException { >LOG.info("BOOTSTRAP: creating hbase:meta region"); >FileSystem fs = rootDir.getFileSystem(conf); >Path tableDir = CommonFSUtils.getTableDir(rootDir, > TableName.META_TABLE_NAME); >if (fs.exists(tableDir) && !fs.delete(tableDir, true)) { > LOG.warn("Can not delete partial created meta table, continue..."); >} > {code} > however, in the cloud use case where HFiles store on S3, WALs store on HDFS, > ZK data are stored within the cluster, this partial meta becomes a block when > cluster recreate on existing HFiles; Here, Zk data and WALs cannot be > retained (HDFS was associated with cloud instance and was terminated > together) when cluster recreates on the flushed HFiles, and existing meta are > always considered as partial and deleted in `INIT_META_WRITE_FS_LAYOUT` > during bootstrap. As a result, the recreate cluster starts with a empty meta > table, either the cluster hangs during the master initialization (branch-2) > because table states of namespace table cannot be assigned, or starts as a > fresh cluster without any region assigned and table opens (may need HBCK to > rebuild the meta). > Potential solution suggested by Anoop > {quote}In case of HM start and the bootstrap we create the ClusterID and > write to FS and then to zk and then create the META table FS layout. So in a > cluster recreate, we will see clusterID is there in FS and also the META FS > layout but no clusterID in zk. Ya seems we can use this as indication for > cluster recreate over existing data. In HM start, this is some thing we need > to check at 1st itself and track. If this mode is true, later when (if) we do > INIT_META_WRITE_FS_LAYOUT , we should not delete the META dir. As part of the > Bootstrap when we write that proc to MasterProcWal, we can include this mode > (boolean) info also. This is a protobuf message anyways. So even if this HM > got killed and restarted (at a point where the clusterId was written to zk > but the Meta FS layout part was not reached) we can use the info added as > part of the bootstrap wal entry and make sure NOT to delete the meta dir. > {quote} > In this JIRA, we're going to fix the `partial` definition when we found > cluster ID was stored in HFiles but ZK were deleted or fresh on cluster > creates. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[VOTE] The first HBase 2.3.7 release candidate (RC0) is available
Please vote on this Apache hbase release candidate, hbase-2.3.7RC0 The VOTE will remain open for at least 72 hours. [ ] +1 Release this package as Apache hbase 2.3.7[ ] -1 Do not release this package because ... The tag to be voted on is 2.3.7RC0: https://github.com/apache/hbase/tree/2.3.7RC0 This tag currently points to git reference 8b2f5141e900c851a2b351fccd54b13bcac5e2ed The release files, including signatures, digests, as well as CHANGES.mdand RELEASENOTES.md included in this RC can be found at: https://dist.apache.org/repos/dist/dev/hbase/2.3.7RC0/ Maven artifacts are available in a staging repository at: https://repository.apache.org/content/repositories/orgapachehbase-1466/ Artifacts were signed with the 0x1C3489BD key which can be found in: https://dist.apache.org/repos/dist/release/hbase/KEYS To learn more about Apache hbase, please see http://hbase.apache.org/ Thanks,Your HBase Release Manager
Re: RegionSize returning in MB - change to bytes?
Hi Norbert, To answer your question directly: the RegionSizeCalculator class is annotated with @InterfaceAudience.Private, which means there's a good chance that it's implementation can be changed without need for a deprecation cycle and user participation. Curiously, I noticed that this `sizeMap` is accessed down in the method `long getRegionSize(byte[])`, and its javadoc mentions the returned unit explicitly as bytes. So with a little investigation using git blame, I see that the switch from returning values in bytes to values in megabytes came in through HBASE-16169 -- your proposed change was the old implementation. For whatever reasons, it was determined to not be scalable. So, we could revert back, but we'd need some new solution to what HBASE-16169 aimed to solve. I hope this helps. Thanks, Nick On Tue, Oct 12, 2021 at 10:54 AM Norbert Kalmar wrote: > Hi All, > > There is a new optimization in spark (SPARK-34809) where ignoreEmptySplits > filters out all regions that's size is 0. They use a hadoop library > getSize() in TableInputFormat. > > Drilling down, this will return Bytes, but it converts it from MegaBytes - > meaning anything under 1 MB will come down as 0 Bytes, meaning empty. > I did a quick PR I thought would help: > https://github.com/apache/hbase/pull/3737 > But it turns out it's not as easy as requesting the size in Bytes instead > of MB from Size class, as we set it in MB te begin with in > RegionMetricsBuilder > -> setStoreFileSize(new Size(regionLoadPB.getStorefileSizeMB(), > Size.Unit.MEGABYTE)) > > I did some testing, and inserting a few kilobytes of data, then > calling list_regions > will in fact give back size 0. > > My question is, is it okay to store the region size in Bytes instead? > Mainly asking because of backward compatibility reasons. > > Regards, > Norbert >
[jira] [Created] (HBASE-26353) Support loadable dictionaries in hbase-compression-zstd
Andrew Kyle Purtell created HBASE-26353: --- Summary: Support loadable dictionaries in hbase-compression-zstd Key: HBASE-26353 URL: https://issues.apache.org/jira/browse/HBASE-26353 Project: HBase Issue Type: Sub-task Reporter: Andrew Kyle Purtell Assignee: Andrew Kyle Purtell Fix For: 2.5.0, 3.0.0-alpha-2 ZStandard supports initialization of compressors and decompressors with a precomputed dictionary, which can dramatically improve and speed up compression of tables with small values. For more details, please see [The Case For Small Data Compression|https://github.com/facebook/zstd#the-case-for-small-data-compression]. If a table is going to have a lot of small values and the user can put together a representative set of files that can be used to train a dictionary for compressing those values, a dictionary can be trained with the {{zstd}} command line utility, available in any zstandard package for your favorite OS: Training: {noformat} $ zstd --maxdict=1126400 --train-fastcover=shrink \ -o mytable.dict training_files/* Trying 82 different sets of parameters ... k=674 d=8 f=20 steps=40 split=75 accel=1 Save dictionary of size 1126400 into file mytable.dict {noformat} Deploy the dictionary file to HDFS. Create the table: {noformat} hbase> create "mytable", ... , CONFIGURATION => { 'hbase.io.compress.zstd.level' => '6', 'hbase.io.compress.zstd.dictionary' => true, 'hbase.io.compress.zstd.dictonary.file' => \ 'hdfs://nn/zdicts/mytable.dict' } {noformat} Now start storing data. Compression results even for small values will be excellent. Note: Beware, if the dictionary is lost, the data will not be decompressable. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-26354) [hbase-connectors] Added python client for HBase thrift service
Yutong Xiao created HBASE-26354: --- Summary: [hbase-connectors] Added python client for HBase thrift service Key: HBASE-26354 URL: https://issues.apache.org/jira/browse/HBASE-26354 Project: HBase Issue Type: Improvement Reporter: Yutong Xiao Assignee: Yutong Xiao A python client for HBase thrift service. Has request retry mechanism and exception handling. Also encapsulated redundant parameters to make the usage easier. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: RegionSize returning in MB - change to bytes?
The return value is in bytes, the problem is that we normalize the size in MB and then multiply MB to get the size in bytes, so if a file is less than 1MB, the returned value will be zero. Need to investigate more here. Reading the issue, the scalable problem they wanted to solve is that we will go to master to get the region size, not about whether the unit is in MB or not. Thanks. Nick Dimiduk 于2021年10月13日周三 上午7:47写道: > Hi Norbert, > > To answer your question directly: the RegionSizeCalculator class is > annotated with @InterfaceAudience.Private, which means there's a good > chance that it's implementation can be changed without need for a > deprecation cycle and user participation. > > Curiously, I noticed that this `sizeMap` is accessed down in the method > `long getRegionSize(byte[])`, and its javadoc mentions the returned unit > explicitly as bytes. > > So with a little investigation using git blame, I see that the switch from > returning values in bytes to values in megabytes came in through > HBASE-16169 -- your proposed change was the old implementation. For > whatever reasons, it was determined to not be scalable. So, we could revert > back, but we'd need some new solution to what HBASE-16169 aimed to solve. > > I hope this helps. > > Thanks, > Nick > > On Tue, Oct 12, 2021 at 10:54 AM Norbert Kalmar > wrote: > > > Hi All, > > > > There is a new optimization in spark (SPARK-34809) where > ignoreEmptySplits > > filters out all regions that's size is 0. They use a hadoop library > > getSize() in TableInputFormat. > > > > Drilling down, this will return Bytes, but it converts it from MegaBytes > - > > meaning anything under 1 MB will come down as 0 Bytes, meaning empty. > > I did a quick PR I thought would help: > > https://github.com/apache/hbase/pull/3737 > > But it turns out it's not as easy as requesting the size in Bytes instead > > of MB from Size class, as we set it in MB te begin with in > > RegionMetricsBuilder > > -> setStoreFileSize(new Size(regionLoadPB.getStorefileSizeMB(), > > Size.Unit.MEGABYTE)) > > > > I did some testing, and inserting a few kilobytes of data, then > > calling list_regions > > will in fact give back size 0. > > > > My question is, is it okay to store the region size in Bytes instead? > > Mainly asking because of backward compatibility reasons. > > > > Regards, > > Norbert > > >
[jira] [Created] (HBASE-26355) Release 1.4.14
Duo Zhang created HBASE-26355: - Summary: Release 1.4.14 Key: HBASE-26355 URL: https://issues.apache.org/jira/browse/HBASE-26355 Project: HBase Issue Type: Task Reporter: Duo Zhang Per the end of this thread: https://lists.apache.org/thread.html/r34d3fb86d667f8b3e58cbba78655733ac76e10f5883650f4910adc5c%40%3Cdev.hbase.apache.org%3E Let's do a final 1.4.14 release and mark branch-1.4 as EOL. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: EOL branch-1 and all 1.x ?
Filed HBASE-26355 for releasing 1.4.14. 张铎(Duo Zhang) 于2021年10月11日周一 下午5:31写道: > So I think in this thread, the only concern is about performance issues, > so we decided to make new releases on branch-1. > > But at least I think we all agree to EOL other 1.x release lines, > especially branch-1.4 right? > > If no other concerns, let's do a final 1.4.14 release and then mark > branch-1.4 as EOL. There are 40 issues under 1.4.14 so I think it is worth > having a new release. > > Thanks. > > Andrew Purtell 于2021年6月1日周二 上午3:16写道: > >> It would be good to do the performance work at least, if you are up for >> it. There are always going to be consequences for the kind of significant >> evolution that 2.x represents over 1.x. >> >> Regarding performance, a change always has positive and negative >> consequences. It is important to understand them both, informed by real >> world use cases. My guess is you have real world use cases, Reid. Your >> results will be meaningful. >> >> Synthetic benchmarks are less interesting unless the regression is >> obvious and more like a bug than a consequence. Sure they will report >> positive and negative changes, but does that actually mean anything? It >> depends. Sometimes it will only mean something if we care about supporting >> the synthetic benchmark as a first class use case. (Usually we don’t; but >> universal cross system bench tools like YCSB are exceptions.) >> >> >> > On May 31, 2021, at 9:25 AM, Reid Chan wrote: >> > >> > Thanks to Andrew and Sean's help, I managed to release the first >> candidate >> > of 1.7.0 (at least it is a beginning, and graduated from green hand). >> > BTW, The [VOTE] >> > < >> https://lists.apache.org/thread.html/r0b96b6596fc423e17ff648633e5ea76fd897d9afb8a03ae6e09cdb8f%40%3Cdev.hbase.apache.org%3E >> > >> > >> > The following are my thoughts: >> > I'm willing to continue branch-1's life as a RM. >> > And before EOL branch-1, I need to announce EOL of branch-1.4. >> > While maintaining the branch-1, I also will do some benchmarks between >> 1.7+ >> > and 2.4+ (the latest). If 2.4+ is better, cool. Otherwise, I'm willing >> to >> > spend some time diving in. >> > After the performance issue is done, I need to review the upgrade from >> 1.x >> > to 2.x. I remember someone wrote it. But HBASE-25902 seems to reveal >> some >> > problems already. >> > I will announce EOL of branch-1 if listed above are done. >> > >> > Probably more than 1 year, by estimation, if I have to do it all alone. >> The >> > most time-spending should be performance diving in (if there was) and >> > upgrade review. >> > >> > Any thought is appreciated. >> > >> > >> > --- >> > Best regards, >> > R.C >> > >> > >> > >> > >> >> On Tue, Apr 20, 2021 at 12:13 AM Reid Chan >> wrote: >> >> >> >> >> >> FYI, a JDK issue when I was making the 1.7.0 release. >> >> >> >> >> >> >> https://lists.apache.org/thread.html/r118b08134676d9234362a28898249186fe73a1fb08535d6eec6a91d3%40%3Cdev.hbase.apache.org%3E >> >> >> >> >> >> --- >> >> Best Regards, >> >> R.C >> >> >> >>> On Thu, Apr 1, 2021 at 6:03 AM Andrew Purtell >> wrote: >> >>> >> >>> Is it time to consider EOL of branch-1 and all 1.x releases ? >> >>> >> >>> There doesn't seem to be much developer interest in branch-1 beyond >> >>> occasional maintenance. This is understandable. Per our compatibility >> >>> guidelines, branch-1 commits must be compatible with Java 7, and the >> range >> >>> of acceptable versions of third party dependencies is also restricted >> due >> >>> to Java 7 compatibility requirements. Most developers are writing code >> >>> with >> >>> Java 8+ idioms these days. For that reason and because the branch-1 >> code >> >>> base is generally aged at this point, all but trivial (or lucky!) >> >>> backports >> >>> require substantial changes in order to integrate adequately. Let me >> also >> >>> observe that branch-1 artifacts are not fully compatible with Java 11 >> or >> >>> later. (The shell is a good example of such issues: The version of >> >>> jruby-complete required by branch-1 is not compatible with Java 11 and >> >>> upgrading to the version used by branch-2 causes shell commands to >> error >> >>> out due to Ruby language changes.) >> >>> >> >>> We can a priori determine there is insufficient motivation for >> production >> >>> of release artifacts for the PMC to vote upon. Otherwise, someone >> would >> >>> have done it. We had 12 releases from branch-2 derived code in 2019, >> 13 >> >>> releases from branch-2 derived code in 2020, and so far we have had 3 >> >>> releases from branch-2 derived code in 2021. In contrast, we had 8 >> >>> releases >> >>> from branch-1 derived code in 2019, 0 releases from branch-1 in 2020, >> and >> >>> so far 0 releases from branch-1 in 2021. >> >>> >> >>> * 2021202020191.x0282.x31312* >> >>> >> >>> If there is someone interested in continuing branch-1, now is the >> time to >> >>> commit. However let me be clear that simply expressing an abstract >> desire >> >>>
[jira] [Created] (HBASE-26356) Set version as 2.1.10 in branch-2.1 in prep for first RC of 2.1.10
Duo Zhang created HBASE-26356: - Summary: Set version as 2.1.10 in branch-2.1 in prep for first RC of 2.1.10 Key: HBASE-26356 URL: https://issues.apache.org/jira/browse/HBASE-26356 Project: HBase Issue Type: Sub-task Reporter: Duo Zhang -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-26357) Generate CHANGES.txt for 1.4.14
Duo Zhang created HBASE-26357: - Summary: Generate CHANGES.txt for 1.4.14 Key: HBASE-26357 URL: https://issues.apache.org/jira/browse/HBASE-26357 Project: HBase Issue Type: Sub-task Components: documentation Reporter: Duo Zhang -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-26358) Put up 1.4.14RC0
Duo Zhang created HBASE-26358: - Summary: Put up 1.4.14RC0 Key: HBASE-26358 URL: https://issues.apache.org/jira/browse/HBASE-26358 Project: HBase Issue Type: Sub-task Reporter: Duo Zhang -- This message was sent by Atlassian Jira (v8.3.4#803005)