HBase Meetup April 2023

2023-03-30 Thread Tak Lon (Stephen) Wu
Hi users/developers/friends,

I’m wondering if any users, developers or anyone would be interested in a
hybrid (physical and virtual) meetup, where the physical meetup is going to
be hosted in the Bay area/Santa Clara/San Jose (with foods and drinks), and
the virtual call will be a zoom call/google hangout for the people around
the world. the session should be around 1 ~ 2 hours.

Topics could be discussed over the user slack channel (we have a thread
there already) or this email if anyone has any emerging, interesting ideas
for the HBase community, although I have few starters in my mind
* What are we going to do in ApacheCon/Community over the code,
https://communityovercode.org/
* Few recent discussions on the dev channel
 - Review/discussion of what's in 2.6
 - HBase on K8s
 - HBase on Ozone integration
 - MTLS integration
 - HBase backup
 - more ...

If you like to participate, please signup on this google spreadsheet [1].
Once we confirm the date, we will reply on this thread. I will try to
accommodate everyone, but if the time conflict is real and we have many
things to cover, we could have multiple sessions to cover people who cannot
make it.

(Chinese version)
大家好,我想看看有任何用户、开发人员或任何人会对举办一个非官方线上聚会感兴趣,时长大概为1或2小时。

本次聚会的目的为鼓励大家提出对 HBase 项目有任何建议,或新兴的、有趣的想法。以下为我收集的一些议题
* ApacheCon/Community https://communityovercode.org/
將會於本年2023年10月舉行,我們HBASE 項目有任何參與的想法或活動嗎?
* 最近在DEV频道上的讨论
 - Review/discussion of what's in 2.6
 - HBase on K8s
 - HBase on Ozone integration
 - MTLS integration
 - HBase backup
 - more ...

如果您有兴趣参加,请在此电子表格[1]填上你的名字和可参与日期。一旦我们确认线上讨论的日期,我们将在回复此邮件。我会尽量安排合适的时间,但如果时间冲突确实存在或是很多议题要讨论,我们会安排更多时间和多场会议来涵盖所需的讨论。

1. Event signup sheet,
https://docs.google.com/spreadsheets/d/1TBawOo68GSxkjahCPNjDT3nhdCA-Jv_lG9Qf7hEy96U/edit#gid=0

Thanks,
Stephen


Re: Re: hbase集群单个region flush时间过长问题求助

2023-03-30 Thread Duo Zhang
感觉还是 region 数量太多了,只有 14 台机器,4800 个 region 肯定集群状态就不太正常了。
这个表是有实时写入的数据吗?是否可以暂时停掉?比如找个晚上,新建一张表,1024 分片,读老表,把数据写到新的表里,然后把老表 disable 再
drop,把新表 rename 成老表。如果对 HBase 比较熟悉的话,可以尝试把老表 disable 掉,直接写 MR 任务读
snapshot,写 HFile 出来,然后往新表里 bulkload,速度能快不少
新表建议至少用 ConstantSizeRegionSplitPolicy(大概是这个名字,记不太清楚了),甚至直接把 split disable
都行,就 14 台机器,1024 分表也足够了,再多意义也不太大。。。

邢*  于2023年3月30日周四 17:59写道:

>
> 感谢张老师的回复,由于我们这个大表快照的时间是近期才明显增长的,所以还是感觉hbase集群存在问题。我这边儿重新整理了下这个问题发生的全过程,辛苦张老师再帮忙看一下,再次感谢。
>
>
> 1、hbase集群信息:
> 3台虚拟机器:
> 作为hbase master,hdfs namenode,journal node,failovercontroller
> 14台物理机器:配置96c,376G,SSD盘
> 作为hbase region server,hdfs datanode
>
>
> hbase版本:2.1.9
> hadoop版本:hadoop-3.0.0-cdh-6.2.0
> 表:sdhz_user_info_realtime
> 大小:9.5TB
> 1个family
>
>
> 2、最近的操作:
> 2023年1月:sdhz_user_info_realtime表,region数量1024
> 2023年2月:因为sdhz_user_info_realtime表有很多脏数据,所以采用scan
> region的形式,一个一个region删除无用的列
> 2023年2月底:发现sdhz_user_info_realtime表的region数量暴增到4000多个,然后手动merge过region,造成hbase
> meta 空洞的问题,之后就放弃了merge,目前已经达到4800个region了
> 2023年3月16日夜间:sdhz_user_info_realtime表做snapshot超时,超时时间5分钟
> 2023年3月17日:将snapshot超时时间改为15分钟
> 2023年3月19日:snapshot再次超时
> 2023年3月20日:将snapshot超时时间改为60分钟,目前观察每次snapshot时间,大约半小时左右
>
>
> 3、想得到解答的问题:
> 为什么snapshot需要这么久?
> region的数量如何降下去?
>
>
>
>
> 4、在此期间对问题进行了跟踪排查,整理的信息如下:
>
>
>
>
> 在snapshot的过程中,时间花费在region的flush上,snapshot的校验上
> 在出现“Done waiting - online snapshot
> for”日志之前,一直在做flush操作,所有region大约花费总共20多分钟,每个region的flush的时间大约是10~20s,每次flush了几MB到几十MB的memstore,每个region
> server上大约340个region
> 在出现“Done waiting - online snapshot for”日志之后,开始对snapshot进行校验,移动等操作,花费了10分钟
>
>
> regionserver中发现了DFSClient的日志
> 2023-03-29 16:24:55,395 INFO org.apache.hadoop.hdfs.DFSClient: Could not
> complete
> /hbase/.hbase-snapshot/.tmp/sdhz_user_info_realtime_1680076952828/region-manifest.603b4d8028af279648af4bfaa3889fd0
> retrying...
>
>
>
>
> datanode中发现了日志:
>
> 增量块上报过程中,同一个块,要上报好几次,例如下面的日志,blk_1109695462_35954673这个块,怀疑dn向nn发送IBR的过程中,有失败然后进入pendingIBR重试的,想通过arthas检测IBR上报过程中putMissing的方法
>
>
> 2023-03-29 16:16:51,367 DEBUG
> org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager: call
> blockReceivedAndDeleted:
> [DatanodeStorage[DS-059fc182-f7cc-43bd-a0c3-2e1447f6650f,DISK,NORMAL][blk_1109695462_35954673,
> status: RECEIVING_BLOCK, delHint: null, blk_1109695456_35954667, status:
> RECEIVED_BLOCK, delHint: null],
> DatanodeStorage[DS-b7afb119-fef7-429d-bf95-c7df181b7785,DISK,NORMAL][blk_1109695459_35954670,
> status: RECEIVING_BLOCK, delHint: null]]
> 2023-03-29 16:16:51,367 DEBUG
> org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager: call
> blockReceivedAndDeleted:
> [DatanodeStorage[DS-059fc182-f7cc-43bd-a0c3-2e1447f6650f,DISK,NORMAL][blk_1109695462_35954673,
> status: RECEIVING_BLOCK, delHint: null, blk_1109695456_35954667, status:
> RECEIVED_BLOCK, delHint: null],
> DatanodeStorage[DS-b7afb119-fef7-429d-bf95-c7df181b7785,DISK,NORMAL][blk_1109695459_35954670,
> status: RECEIVING_BLOCK, delHint: null]]
> 2023-03-29 16:16:51,370 DEBUG
> org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager: call
> blockReceivedAndDeleted:
> [DatanodeStorage[DS-b7afb119-fef7-429d-bf95-c7df181b7785,DISK,NORMAL][blk_1109695452_35954663,
> status: RECEIVED_BLOCK, delHint: null]]
> 2023-03-29 16:16:51,372 DEBUG
> org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager: call
> blockReceivedAndDeleted:
> [DatanodeStorage[DS-b7afb119-fef7-429d-bf95-c7df181b7785,DISK,NORMAL][blk_1109695459_35954670,
> status: RECEIVED_BLOCK, delHint: null]]
> 2023-03-29 16:16:51,380 DEBUG
> org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager: call
> blockReceivedAndDeleted:
> [DatanodeStorage[DS-059fc182-f7cc-43bd-a0c3-2e1447f6650f,DISK,NORMAL][blk_1109695462_35954673,
> status: RECEIVED_BLOCK, delHint: null],
> DatanodeStorage[DS-b7afb119-fef7-429d-bf95-c7df181b7785,DISK,NORMAL][blk_1109695468_35954679,
> status: RECEIVING_BLOCK, delHint: null]]
> 2023-03-29 16:16:51,552 DEBUG
> org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager: call
> blockReceivedAndDeleted:
> [DatanodeStorage[DS-059fc182-f7cc-43bd-a0c3-2e1447f6650f,DISK,NORMAL][blk_1109695391_35954602,
> status: RECEIVED_BLOCK, delHint: null]]
> 2023-03-29 16:16:51,574 DEBUG
> org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager: call
> blockReceivedAndDeleted:
> [DatanodeStorage[DS-059fc182-f7cc-43bd-a0c3-2e1447f6650f,DISK,NORMAL][blk_1109695455_35954666,
> status: RECEIVED_BLOCK, delHint: null]]
> 2023-03-29 16:16:52,094 DEBUG
> org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager: call
> blockReceivedAndDeleted:
> [DatanodeStorage[DS-b7afb119-fef7-429d-bf95-c7df181b7785,DISK,NORMAL][blk_1109695468_35954679,
> status: RECEIVED_BLOCK, delHint: null]]
> 2023-03-29 16:16:53,164 DEBUG
> org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager: call
> blockReceivedAndDeleted:
> [DatanodeStorage[DS-b7afb119-fef7-429d-bf95-c7df181b7785,DISK,NORMAL][blk_1109695447_35954658,
> status: RECEIVED_BLOCK, delHint: 

Re:Re: hbase集群单个region flush时间过长问题求助

2023-03-30 Thread 邢*
感谢张老师的回复,由于我们这个大表快照的时间是近期才明显增长的,所以还是感觉hbase集群存在问题。我这边儿重新整理了下这个问题发生的全过程,辛苦张老师再帮忙看一下,再次感谢。


1、hbase集群信息:
3台虚拟机器:
作为hbase master,hdfs namenode,journal node,failovercontroller
14台物理机器:配置96c,376G,SSD盘
作为hbase region server,hdfs datanode


hbase版本:2.1.9
hadoop版本:hadoop-3.0.0-cdh-6.2.0
表:sdhz_user_info_realtime
大小:9.5TB
1个family


2、最近的操作:
2023年1月:sdhz_user_info_realtime表,region数量1024
2023年2月:因为sdhz_user_info_realtime表有很多脏数据,所以采用scan region的形式,一个一个region删除无用的列
2023年2月底:发现sdhz_user_info_realtime表的region数量暴增到4000多个,然后手动merge过region,造成hbase 
meta 空洞的问题,之后就放弃了merge,目前已经达到4800个region了
2023年3月16日夜间:sdhz_user_info_realtime表做snapshot超时,超时时间5分钟
2023年3月17日:将snapshot超时时间改为15分钟
2023年3月19日:snapshot再次超时
2023年3月20日:将snapshot超时时间改为60分钟,目前观察每次snapshot时间,大约半小时左右


3、想得到解答的问题:
为什么snapshot需要这么久?
region的数量如何降下去?




4、在此期间对问题进行了跟踪排查,整理的信息如下:




在snapshot的过程中,时间花费在region的flush上,snapshot的校验上
在出现“Done waiting - online snapshot 
for”日志之前,一直在做flush操作,所有region大约花费总共20多分钟,每个region的flush的时间大约是10~20s,每次flush了几MB到几十MB的memstore,每个region
 server上大约340个region
在出现“Done waiting - online snapshot for”日志之后,开始对snapshot进行校验,移动等操作,花费了10分钟


regionserver中发现了DFSClient的日志
2023-03-29 16:24:55,395 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
complete 
/hbase/.hbase-snapshot/.tmp/sdhz_user_info_realtime_1680076952828/region-manifest.603b4d8028af279648af4bfaa3889fd0
 retrying...




datanode中发现了日志:
增量块上报过程中,同一个块,要上报好几次,例如下面的日志,blk_1109695462_35954673这个块,怀疑dn向nn发送IBR的过程中,有失败然后进入pendingIBR重试的,想通过arthas检测IBR上报过程中putMissing的方法


2023-03-29 16:16:51,367 DEBUG 
org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager: call 
blockReceivedAndDeleted: 
[DatanodeStorage[DS-059fc182-f7cc-43bd-a0c3-2e1447f6650f,DISK,NORMAL][blk_1109695462_35954673,
 status: RECEIVING_BLOCK, delHint: null, blk_1109695456_35954667, status: 
RECEIVED_BLOCK, delHint: null], 
DatanodeStorage[DS-b7afb119-fef7-429d-bf95-c7df181b7785,DISK,NORMAL][blk_1109695459_35954670,
 status: RECEIVING_BLOCK, delHint: null]]
2023-03-29 16:16:51,367 DEBUG 
org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager: call 
blockReceivedAndDeleted: 
[DatanodeStorage[DS-059fc182-f7cc-43bd-a0c3-2e1447f6650f,DISK,NORMAL][blk_1109695462_35954673,
 status: RECEIVING_BLOCK, delHint: null, blk_1109695456_35954667, status: 
RECEIVED_BLOCK, delHint: null], 
DatanodeStorage[DS-b7afb119-fef7-429d-bf95-c7df181b7785,DISK,NORMAL][blk_1109695459_35954670,
 status: RECEIVING_BLOCK, delHint: null]]
2023-03-29 16:16:51,370 DEBUG 
org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager: call 
blockReceivedAndDeleted: 
[DatanodeStorage[DS-b7afb119-fef7-429d-bf95-c7df181b7785,DISK,NORMAL][blk_1109695452_35954663,
 status: RECEIVED_BLOCK, delHint: null]]
2023-03-29 16:16:51,372 DEBUG 
org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager: call 
blockReceivedAndDeleted: 
[DatanodeStorage[DS-b7afb119-fef7-429d-bf95-c7df181b7785,DISK,NORMAL][blk_1109695459_35954670,
 status: RECEIVED_BLOCK, delHint: null]]
2023-03-29 16:16:51,380 DEBUG 
org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager: call 
blockReceivedAndDeleted: 
[DatanodeStorage[DS-059fc182-f7cc-43bd-a0c3-2e1447f6650f,DISK,NORMAL][blk_1109695462_35954673,
 status: RECEIVED_BLOCK, delHint: null], 
DatanodeStorage[DS-b7afb119-fef7-429d-bf95-c7df181b7785,DISK,NORMAL][blk_1109695468_35954679,
 status: RECEIVING_BLOCK, delHint: null]]
2023-03-29 16:16:51,552 DEBUG 
org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager: call 
blockReceivedAndDeleted: 
[DatanodeStorage[DS-059fc182-f7cc-43bd-a0c3-2e1447f6650f,DISK,NORMAL][blk_1109695391_35954602,
 status: RECEIVED_BLOCK, delHint: null]]
2023-03-29 16:16:51,574 DEBUG 
org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager: call 
blockReceivedAndDeleted: 
[DatanodeStorage[DS-059fc182-f7cc-43bd-a0c3-2e1447f6650f,DISK,NORMAL][blk_1109695455_35954666,
 status: RECEIVED_BLOCK, delHint: null]]
2023-03-29 16:16:52,094 DEBUG 
org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager: call 
blockReceivedAndDeleted: 
[DatanodeStorage[DS-b7afb119-fef7-429d-bf95-c7df181b7785,DISK,NORMAL][blk_1109695468_35954679,
 status: RECEIVED_BLOCK, delHint: null]]
2023-03-29 16:16:53,164 DEBUG 
org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager: call 
blockReceivedAndDeleted: 
[DatanodeStorage[DS-b7afb119-fef7-429d-bf95-c7df181b7785,DISK,NORMAL][blk_1109695447_35954658,
 status: RECEIVED_BLOCK, delHint: null]]
2023-03-29 16:16:55,574 DEBUG 
org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager: call 
blockReceivedAndDeleted: 
[DatanodeStorage[DS-b7afb119-fef7-429d-bf95-c7df181b7785,DISK,NORMAL][blk_1109695351_35954562,
 status: RECEIVED_BLOCK, delHint: null]]
2023-03-29 16:16:56,894 DEBUG 
org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager: call 
blockReceivedAndDeleted: 
[DatanodeStorage[DS-059fc182-f7cc-43bd-a0c3-2e1447f6650f,DISK,NORMAL][blk_1109695391_35954602,
 status: RECEIVED_BLOCK, delHint: