[jira] [Created] (HBASE-28756) RegionSizeCalculator ignored the size of memstore, which leads Spark miss data

Sun Xin (Jira) Wed, 24 Jul 2024 20:51:23 -0700

Sun Xin created HBASE-28756:
-------------------------------

             Summary: RegionSizeCalculator ignored the size of memstore, which 
leads Spark miss data
                 Key: HBASE-28756
                 URL: https://issues.apache.org/jira/browse/HBASE-28756
             Project: HBase
          Issue Type: Bug
          Components: mapreduce
    Affects Versions: 2.5.10, 3.0.0-beta-1, 2.6.0
            Reporter: Sun Xin
            Assignee: Sun Xin



RegionSizeCalculator only considers the size of StoreFile and ignores the size 
of MemStore. For a new region that has only been written to MemStore and has 
not been flushed, will consider its size to be 0.

When we use TableInputFormat to read HBase table data in Spark.
{code:java}
spark.sparkContext.newAPIHadoopRDD(
    conf,
    classOf[TableInputFormat],
    classOf[ImmutableBytesWritable],
    classOf[Result])
}{code}
Spark defaults to ignoring empty InputSplits, which is determined by the 
configuration  "{{{}spark.hadoopRDD.ignoreEmptySplits{}}}".
{code:java}
private[spark] val HADOOP_RDD_IGNORE_EMPTY_SPLITS =
  ConfigBuilder("spark.hadoopRDD.ignoreEmptySplits")
    .internal()
    .doc("When true, HadoopRDD/NewHadoopRDD will not create partitions for 
empty input splits.")
    .version("2.3.0")
    .booleanConf
    .createWithDefault(true) {code}
The above reasons lead to Spark missing data. So we should consider both the 
size of the StoreFile and the MemStore in the RegionSizeCalculator.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HBASE-28756) RegionSizeCalculator ignored the size of memstore, which leads Spark miss data

Reply via email to