RegionSize returning in MB - change to bytes?

2021-10-12 Thread Norbert Kalmar
Hi All,

There is a new optimization in spark (SPARK-34809) where ignoreEmptySplits
filters out all regions that's size is 0. They use a hadoop library
getSize() in TableInputFormat.

Drilling down, this will return Bytes, but it converts it from MegaBytes -
meaning anything under 1 MB will come down as 0 Bytes, meaning empty.
I did a quick PR I thought would help:
https://github.com/apache/hbase/pull/3737
But it turns out it's not as easy as requesting the size in Bytes instead
of MB from Size class, as we set it in MB te begin with in RegionMetricsBuilder
-> setStoreFileSize(new Size(regionLoadPB.getStorefileSizeMB(),
Size.Unit.MEGABYTE))

I did some testing, and inserting a few kilobytes of data, then
calling list_regions
will in fact give back size 0.

My question is, is it okay to store the region size in Bytes instead?
Mainly asking because of backward compatibility reasons.

Regards,
Norbert


[jira] [Resolved] (HBASE-24833) Bootstrap should not delete the META table directory if it's not partial

2021-10-12 Thread Tak-Lon (Stephen) Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tak-Lon (Stephen) Wu resolved HBASE-24833.
--
Resolution: Fixed

> Bootstrap should not delete the META table directory if it's not partial
> 
>
> Key: HBASE-24833
> URL: https://issues.apache.org/jira/browse/HBASE-24833
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 3.0.0-alpha-1, 2.3.0, 2.3.1, 2.3.3
>Reporter: Tak-Lon (Stephen) Wu
>Assignee: Tak-Lon (Stephen) Wu
>Priority: Major
> Fix For: 2.5.0
>
>
> this issues were discussed in 
> [PR#2113|https://github.com/apache/hbase/pull/2113] as part of HBASE-24286, 
> and it is a dependencies before we solve HBASE-24286.
> The changes were introduced in [HBASE-24471 
> |https://github.com/apache/hbase/commit/4d5efec76718032a1e55024fd5133409e4be3cb8#diff-21659161b1393e6632730dcbea205fd8R70-R89]
>  that partial meta was introduced and `partial` was defined as 
> InitMetaProcedure did not succeed and INIT_META_ASSIGN_META was not completed.
> {code:java}
>   private static void writeFsLayout(Path rootDir, Configuration conf) throws 
> IOException { 
>LOG.info("BOOTSTRAP: creating hbase:meta region"); 
>FileSystem fs = rootDir.getFileSystem(conf); 
>Path tableDir = CommonFSUtils.getTableDir(rootDir, 
> TableName.META_TABLE_NAME); 
>if (fs.exists(tableDir) && !fs.delete(tableDir, true)) { 
>  LOG.warn("Can not delete partial created meta table, continue..."); 
>}
> {code}
> however, in the cloud use case where HFiles store on S3, WALs store on HDFS, 
> ZK data are stored within the cluster, this partial meta becomes a block when 
> cluster recreate on existing HFiles; Here, Zk data and WALs cannot be 
> retained (HDFS was associated with cloud instance and was terminated 
> together) when cluster recreates on the flushed HFiles, and existing meta are 
> always considered as partial and deleted in `INIT_META_WRITE_FS_LAYOUT` 
> during bootstrap. As a result, the recreate cluster starts with a empty meta 
> table, either the cluster hangs during the master initialization (branch-2) 
> because table states of namespace table cannot be assigned, or starts as a 
> fresh cluster without any region assigned and table opens (may need HBCK to 
> rebuild the meta).
> Potential solution suggested by Anoop
> {quote}In case of HM start and the bootstrap we create the ClusterID and 
> write to FS and then to zk and then create the META table FS layout. So in a 
> cluster recreate, we will see clusterID is there in FS and also the META FS 
> layout but no clusterID in zk. Ya seems we can use this as indication for 
> cluster recreate over existing data. In HM start, this is some thing we need 
> to check at 1st itself and track. If this mode is true, later when (if) we do 
> INIT_META_WRITE_FS_LAYOUT , we should not delete the META dir. As part of the 
> Bootstrap when we write that proc to MasterProcWal, we can include this mode 
> (boolean) info also. This is a protobuf message anyways. So even if this HM 
> got killed and restarted (at a point where the clusterId was written to zk 
> but the Meta FS layout part was not reached) we can use the info added as 
> part of the bootstrap wal entry and make sure NOT to delete the meta dir.
> {quote}
> In this JIRA, we're going to fix the `partial` definition when we found 
> cluster ID was stored in HFiles but ZK were deleted or fresh on cluster 
> creates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[VOTE] The first HBase 2.3.7 release candidate (RC0) is available

2021-10-12 Thread Nick Dimiduk
Please vote on this Apache hbase release candidate,
hbase-2.3.7RC0
The VOTE will remain open for at least 72 hours.
[ ] +1 Release this package as Apache hbase 2.3.7[ ] -1 Do not release this
package because ...
The tag to be voted on is 2.3.7RC0:
  https://github.com/apache/hbase/tree/2.3.7RC0
This tag currently points to git reference
  8b2f5141e900c851a2b351fccd54b13bcac5e2ed
The release files, including signatures, digests, as well as CHANGES.mdand
RELEASENOTES.md included in this RC can be found at:
  https://dist.apache.org/repos/dist/dev/hbase/2.3.7RC0/
Maven artifacts are available in a staging repository at:
  https://repository.apache.org/content/repositories/orgapachehbase-1466/
Artifacts were signed with the 0x1C3489BD key which can be found in:
  https://dist.apache.org/repos/dist/release/hbase/KEYS
To learn more about Apache hbase, please see
  http://hbase.apache.org/
Thanks,Your HBase Release Manager


Re: RegionSize returning in MB - change to bytes?

2021-10-12 Thread Nick Dimiduk
Hi Norbert,

To answer your question directly: the RegionSizeCalculator class is
annotated with @InterfaceAudience.Private, which means there's a good
chance that it's implementation can be changed without need for a
deprecation cycle and user participation.

Curiously, I noticed that this `sizeMap` is accessed down in the method
`long getRegionSize(byte[])`, and its javadoc mentions the returned unit
explicitly as bytes.

So with a little investigation using git blame, I see that the switch from
returning values in bytes to values in megabytes came in through
HBASE-16169 -- your proposed change was the old implementation. For
whatever reasons, it was determined to not be scalable. So, we could revert
back, but we'd need some new solution to what HBASE-16169 aimed to solve.

I hope this helps.

Thanks,
Nick

On Tue, Oct 12, 2021 at 10:54 AM Norbert Kalmar  wrote:

> Hi All,
>
> There is a new optimization in spark (SPARK-34809) where ignoreEmptySplits
> filters out all regions that's size is 0. They use a hadoop library
> getSize() in TableInputFormat.
>
> Drilling down, this will return Bytes, but it converts it from MegaBytes -
> meaning anything under 1 MB will come down as 0 Bytes, meaning empty.
> I did a quick PR I thought would help:
> https://github.com/apache/hbase/pull/3737
> But it turns out it's not as easy as requesting the size in Bytes instead
> of MB from Size class, as we set it in MB te begin with in
> RegionMetricsBuilder
> -> setStoreFileSize(new Size(regionLoadPB.getStorefileSizeMB(),
> Size.Unit.MEGABYTE))
>
> I did some testing, and inserting a few kilobytes of data, then
> calling list_regions
> will in fact give back size 0.
>
> My question is, is it okay to store the region size in Bytes instead?
> Mainly asking because of backward compatibility reasons.
>
> Regards,
> Norbert
>


[jira] [Created] (HBASE-26353) Support loadable dictionaries in hbase-compression-zstd

2021-10-12 Thread Andrew Kyle Purtell (Jira)
Andrew Kyle Purtell created HBASE-26353:
---

 Summary: Support loadable dictionaries in hbase-compression-zstd
 Key: HBASE-26353
 URL: https://issues.apache.org/jira/browse/HBASE-26353
 Project: HBase
  Issue Type: Sub-task
Reporter: Andrew Kyle Purtell
Assignee: Andrew Kyle Purtell
 Fix For: 2.5.0, 3.0.0-alpha-2


ZStandard supports initialization of compressors and decompressors with a 
precomputed dictionary, which can dramatically improve and speed up compression 
of tables with small values. For more details, please see [The Case For Small 
Data 
Compression|https://github.com/facebook/zstd#the-case-for-small-data-compression].
 

If a table is going to have a lot of small values and the user can put together 
a representative set of files that can be used to train a dictionary for 
compressing those values, a dictionary can be trained with the {{zstd}} command 
line utility, available in any zstandard package for your favorite OS:

Training:
{noformat}
$ zstd --maxdict=1126400 --train-fastcover=shrink \
-o mytable.dict training_files/*
Trying 82 different sets of parameters
...
k=674  
d=8
f=20
steps=40
split=75
accel=1
Save dictionary of size 1126400 into file mytable.dict
{noformat}

Deploy the dictionary file to HDFS.

Create the table:

{noformat}
hbase> create "mytable", 
  ... ,
  CONFIGURATION => {
'hbase.io.compress.zstd.level' => '6',
'hbase.io.compress.zstd.dictionary' => true,
'hbase.io.compress.zstd.dictonary.file' => \
  'hdfs://nn/zdicts/mytable.dict'
  }
{noformat}

Now start storing data. Compression results even for small values will be 
excellent.

Note: Beware, if the dictionary is lost, the data will not be decompressable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-26354) [hbase-connectors] Added python client for HBase thrift service

2021-10-12 Thread Yutong Xiao (Jira)
Yutong Xiao created HBASE-26354:
---

 Summary: [hbase-connectors] Added python client for HBase thrift 
service
 Key: HBASE-26354
 URL: https://issues.apache.org/jira/browse/HBASE-26354
 Project: HBase
  Issue Type: Improvement
Reporter: Yutong Xiao
Assignee: Yutong Xiao


A python client for HBase thrift service. Has request retry mechanism and 
exception handling. Also encapsulated redundant parameters to make the usage 
easier.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: RegionSize returning in MB - change to bytes?

2021-10-12 Thread Duo Zhang
The return value is in bytes, the problem is that we normalize the size in
MB and then multiply MB to get the size in bytes, so if a file is less than
1MB, the returned value will be zero.

Need to investigate more here.

Reading the issue, the scalable problem they wanted to solve is that we
will go to master to get the region size, not about whether the unit is in
MB or not.

Thanks.

Nick Dimiduk  于2021年10月13日周三 上午7:47写道:

> Hi Norbert,
>
> To answer your question directly: the RegionSizeCalculator class is
> annotated with @InterfaceAudience.Private, which means there's a good
> chance that it's implementation can be changed without need for a
> deprecation cycle and user participation.
>
> Curiously, I noticed that this `sizeMap` is accessed down in the method
> `long getRegionSize(byte[])`, and its javadoc mentions the returned unit
> explicitly as bytes.
>
> So with a little investigation using git blame, I see that the switch from
> returning values in bytes to values in megabytes came in through
> HBASE-16169 -- your proposed change was the old implementation. For
> whatever reasons, it was determined to not be scalable. So, we could revert
> back, but we'd need some new solution to what HBASE-16169 aimed to solve.
>
> I hope this helps.
>
> Thanks,
> Nick
>
> On Tue, Oct 12, 2021 at 10:54 AM Norbert Kalmar 
> wrote:
>
> > Hi All,
> >
> > There is a new optimization in spark (SPARK-34809) where
> ignoreEmptySplits
> > filters out all regions that's size is 0. They use a hadoop library
> > getSize() in TableInputFormat.
> >
> > Drilling down, this will return Bytes, but it converts it from MegaBytes
> -
> > meaning anything under 1 MB will come down as 0 Bytes, meaning empty.
> > I did a quick PR I thought would help:
> > https://github.com/apache/hbase/pull/3737
> > But it turns out it's not as easy as requesting the size in Bytes instead
> > of MB from Size class, as we set it in MB te begin with in
> > RegionMetricsBuilder
> > -> setStoreFileSize(new Size(regionLoadPB.getStorefileSizeMB(),
> > Size.Unit.MEGABYTE))
> >
> > I did some testing, and inserting a few kilobytes of data, then
> > calling list_regions
> > will in fact give back size 0.
> >
> > My question is, is it okay to store the region size in Bytes instead?
> > Mainly asking because of backward compatibility reasons.
> >
> > Regards,
> > Norbert
> >
>


[jira] [Created] (HBASE-26355) Release 1.4.14

2021-10-12 Thread Duo Zhang (Jira)
Duo Zhang created HBASE-26355:
-

 Summary: Release 1.4.14
 Key: HBASE-26355
 URL: https://issues.apache.org/jira/browse/HBASE-26355
 Project: HBase
  Issue Type: Task
Reporter: Duo Zhang


Per the end of this thread:

https://lists.apache.org/thread.html/r34d3fb86d667f8b3e58cbba78655733ac76e10f5883650f4910adc5c%40%3Cdev.hbase.apache.org%3E

Let's do a final 1.4.14 release and mark branch-1.4 as EOL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: EOL branch-1 and all 1.x ?

2021-10-12 Thread Duo Zhang
Filed HBASE-26355 for releasing 1.4.14.

张铎(Duo Zhang)  于2021年10月11日周一 下午5:31写道:

> So I think in this thread, the only concern is about performance issues,
> so we decided to make new releases on branch-1.
>
> But at least I think we all agree to EOL other 1.x release lines,
> especially branch-1.4 right?
>
> If no other concerns, let's do a final 1.4.14 release and then mark
> branch-1.4 as EOL. There are 40 issues under 1.4.14 so I think it is worth
> having a new release.
>
> Thanks.
>
> Andrew Purtell  于2021年6月1日周二 上午3:16写道:
>
>> It would be good to do the performance work at least, if you are up for
>> it. There are always going to be consequences for the kind of significant
>> evolution that 2.x represents over 1.x.
>>
>> Regarding performance, a change always has positive and negative
>> consequences. It is important to understand them both, informed by real
>> world use cases. My guess is you have real world use cases, Reid. Your
>> results will be meaningful.
>>
>> Synthetic benchmarks are less interesting unless the regression is
>> obvious and more like a bug than a consequence. Sure they will report
>> positive and negative changes, but does that actually mean anything? It
>> depends. Sometimes it will only mean something if we care about supporting
>> the synthetic benchmark as a first class use case. (Usually we don’t; but
>> universal cross system bench tools like YCSB are exceptions.)
>>
>>
>> > On May 31, 2021, at 9:25 AM, Reid Chan  wrote:
>> >
>> > Thanks to Andrew and Sean's help, I managed to release the first
>> candidate
>> > of 1.7.0 (at least it is a beginning, and graduated from green hand).
>> > BTW, The [VOTE]
>> > <
>> https://lists.apache.org/thread.html/r0b96b6596fc423e17ff648633e5ea76fd897d9afb8a03ae6e09cdb8f%40%3Cdev.hbase.apache.org%3E
>> >
>> >
>> > The following are my thoughts:
>> > I'm willing to continue branch-1's life as a RM.
>> > And before EOL branch-1, I need to announce EOL of branch-1.4.
>> > While maintaining the branch-1, I also will do some benchmarks between
>> 1.7+
>> > and 2.4+ (the latest). If 2.4+ is better, cool. Otherwise, I'm willing
>> to
>> > spend some time diving in.
>> > After the performance issue is done, I need to review the upgrade from
>> 1.x
>> > to 2.x. I remember someone wrote it. But HBASE-25902 seems to reveal
>> some
>> > problems already.
>> > I will announce EOL of branch-1 if listed above are done.
>> >
>> > Probably more than 1 year, by estimation, if I have to do it all alone.
>> The
>> > most time-spending should be performance diving in (if there was) and
>> > upgrade review.
>> >
>> > Any thought is appreciated.
>> >
>> >
>> > ---
>> > Best regards,
>> > R.C
>> >
>> >
>> >
>> >
>> >> On Tue, Apr 20, 2021 at 12:13 AM Reid Chan 
>> wrote:
>> >>
>> >>
>> >> FYI, a JDK issue when I was making the 1.7.0 release.
>> >>
>> >>
>> >>
>> https://lists.apache.org/thread.html/r118b08134676d9234362a28898249186fe73a1fb08535d6eec6a91d3%40%3Cdev.hbase.apache.org%3E
>> >>
>> >>
>> >> ---
>> >> Best Regards,
>> >> R.C
>> >>
>> >>> On Thu, Apr 1, 2021 at 6:03 AM Andrew Purtell 
>> wrote:
>> >>>
>> >>> Is it time to consider EOL of branch-1 and all 1.x releases ?
>> >>>
>> >>> There doesn't seem to be much developer interest in branch-1 beyond
>> >>> occasional maintenance. This is understandable. Per our compatibility
>> >>> guidelines, branch-1 commits must be compatible with Java 7, and the
>> range
>> >>> of acceptable versions of third party dependencies is also restricted
>> due
>> >>> to Java 7 compatibility requirements. Most developers are writing code
>> >>> with
>> >>> Java 8+ idioms these days. For that reason and because the branch-1
>> code
>> >>> base is generally aged at this point, all but trivial (or lucky!)
>> >>> backports
>> >>> require substantial changes in order to integrate adequately. Let me
>> also
>> >>> observe that branch-1 artifacts are not fully compatible with Java 11
>> or
>> >>> later. (The shell is a good example of such issues: The version of
>> >>> jruby-complete required by branch-1 is not compatible with Java 11 and
>> >>> upgrading to the version used by branch-2 causes shell commands to
>> error
>> >>> out due to Ruby language changes.)
>> >>>
>> >>> We can a priori determine there is insufficient motivation for
>> production
>> >>> of release artifacts for the PMC to vote upon. Otherwise, someone
>> would
>> >>> have done it. We had 12 releases from branch-2 derived code in 2019,
>> 13
>> >>> releases from branch-2 derived code in 2020, and so far we have had 3
>> >>> releases from branch-2 derived code in 2021. In contrast, we had 8
>> >>> releases
>> >>> from branch-1 derived code in 2019, 0 releases from branch-1 in 2020,
>> and
>> >>> so far 0 releases from branch-1 in 2021.
>> >>>
>> >>> *  2021202020191.x0282.x31312*
>> >>>
>> >>> If there is someone interested in continuing branch-1, now is the
>> time to
>> >>> commit. However let me be clear that simply expressing an abstract
>> desire
>> >>>

[jira] [Created] (HBASE-26356) Set version as 2.1.10 in branch-2.1 in prep for first RC of 2.1.10

2021-10-12 Thread Duo Zhang (Jira)
Duo Zhang created HBASE-26356:
-

 Summary: Set version as 2.1.10 in branch-2.1 in prep for first RC 
of 2.1.10
 Key: HBASE-26356
 URL: https://issues.apache.org/jira/browse/HBASE-26356
 Project: HBase
  Issue Type: Sub-task
Reporter: Duo Zhang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-26357) Generate CHANGES.txt for 1.4.14

2021-10-12 Thread Duo Zhang (Jira)
Duo Zhang created HBASE-26357:
-

 Summary: Generate CHANGES.txt for 1.4.14
 Key: HBASE-26357
 URL: https://issues.apache.org/jira/browse/HBASE-26357
 Project: HBase
  Issue Type: Sub-task
  Components: documentation
Reporter: Duo Zhang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-26358) Put up 1.4.14RC0

2021-10-12 Thread Duo Zhang (Jira)
Duo Zhang created HBASE-26358:
-

 Summary: Put up 1.4.14RC0
 Key: HBASE-26358
 URL: https://issues.apache.org/jira/browse/HBASE-26358
 Project: HBase
  Issue Type: Sub-task
Reporter: Duo Zhang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)