Balazs Meszaros created HBASE-22711: ---------------------------------------
Summary: Spark connector doesn't use the given mapping when inserting data Key: HBASE-22711 URL: https://issues.apache.org/jira/browse/HBASE-22711 Project: HBase Issue Type: Bug Components: hbase-connectors Affects Versions: connector-1.0.0 Reporter: Balazs Meszaros Assignee: Balazs Meszaros In some cases a Spark DataFrames cannot be read back with the same mapping as they were written. For example: {code:scala} val sql = spark.sqlContext val persons = """[ |{"name": "alice", "age": 20, "height": 5, "email": "al...@alice.com"}, |{"name": "bob", "age": 23, "height": 6, "email": "b...@bob.com"}, |{"name": "carol", "age": 12, "email": "ca...@carol.com", "height": 4.11} |] """.stripMargin val df = spark.read.json(Seq(persons).toDS) df.write .format("org.apache.hadoop.hbase.spark") .option("hbase.columns.mapping", "name STRING :key, age SHORT p:age, email STRING c:email, height FLOAT p:height") .option("hbase.table", "person") .option("hbase.spark.use.hbasecontext", false) .save() {code} It cannot be read back with the same mapping: {code:scala} val df2 = sql.read .format("org.apache.hadoop.hbase.spark") .option("hbase.columns.mapping", "name STRING :key, age SHORT p:age, email STRING c:email, height FLOAT p:height") .option("hbase.table", "person") .option("hbase.spark.use.hbasecontext", false) .load() df2.createOrReplaceTempView("tableView") val results = sql.sql("SELECT * FROM tableView") results.show() {code} The results: {noformat} +---+-----+---------+---------------+ |age| name| height| email| +---+-----+---------+---------------+ | 0|alice| 2.3125|al...@alice.com| | 0| bob| 2.375| b...@bob.com| | 0|carol|2.2568748|ca...@carol.com| +---+-----+---------+---------------+ {noformat} Spark stores integer values in long, floating point values in double so shorts become 8 bytes long, floats also become 8 bytes long in HBase: {noformat} shell> scan 'person' alice column=p:age, timestamp=1563450714829, value=\x00\x00\x00\x00\x00\x00\x00\x14 alice column=p:height, timestamp=1563450714829, value=@\x14\x00\x00\x00\x00\x00\x00 {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016)