Sun Xin created HBASE-28756:
-------------------------------
Summary: RegionSizeCalculator ignored the size of memstore, which
leads Spark miss data
Key: HBASE-28756
URL: https://issues.apache.org/jira/browse/HBASE-28756
Project: HBase
Issue Type: Bug
Components: mapreduce
Affects Versions: 2.5.10, 3.0.0-beta-1, 2.6.0
Reporter: Sun Xin
Assignee: Sun Xin
RegionSizeCalculator only considers the size of StoreFile and ignores the size
of MemStore. For a new region that has only been written to MemStore and has
not been flushed, will consider its size to be 0.
When we use TableInputFormat to read HBase table data in Spark.
{code:java}
spark.sparkContext.newAPIHadoopRDD(
conf,
classOf[TableInputFormat],
classOf[ImmutableBytesWritable],
classOf[Result])
}{code}
Spark defaults to ignoring empty InputSplits, which is determined by the
configurationĀ "{{{}spark.hadoopRDD.ignoreEmptySplits{}}}".
{code:java}
private[spark] val HADOOP_RDD_IGNORE_EMPTY_SPLITS =
ConfigBuilder("spark.hadoopRDD.ignoreEmptySplits")
.internal()
.doc("When true, HadoopRDD/NewHadoopRDD will not create partitions for
empty input splits.")
.version("2.3.0")
.booleanConf
.createWithDefault(true) {code}
The above reasons lead to Spark missing data. So we should consider both the
size of the StoreFile and the MemStore in the RegionSizeCalculator.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)