you can get better performance if directly read/write HBase. you also use
spark-phoenix, this is an example, reading data from CSV file and writing
into Phoenix table:
def main(args: Array[String]): Unit = {
val sc = new SparkContext("local", "phoenix-test")
val path = "/tmp/data"
val hbaseConnectionString = "host1,host2,host3"
val customSchema = StructType(Array(
StructField("O_ORDERKEY", StringType, true),
StructField("O_CUSTKEY", StringType, true),
StructField("O_ORDERSTATUS", StringType, true),
StructField("O_TOTALPRICE", StringType, true),
StructField("O_ORDERDATE", StringType, true),
StructField("O_ORDERPRIORITY", StringType, true),
StructField("O_CLERK", StringType, true),
StructField("O_SHIPPRIORITY", StringType, true),
StructField("O_COMMENT", StringType, true)))
// import com.databricks.spark.csv._
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("delimiter", "|")
.option("header", "false")
.schema(customSchema)
.load(path)
val start = System.currentTimeMillis()
df.write.format("org.apache.phoenix.spark")
.mode("overwrite")
.option("table", "DATAX")
.option("zkUrl", hbaseConnectionString)
.save()
val end = System.currentTimeMillis()
print("taken time:" + ((end - start) / 1000) + "s")
}
----------------------------------------
Yun Zhang
Best regards!
2018-08-06 20:10 GMT+08:00 Brandon Geise <[email protected]>:
> Thanks for the reply Yun.
>
>
>
> I’m not quite clear how this would exactly help on the upsert side? Are
> you suggesting deriving the type from Phoenix then doing the
> encoding/decoding and writing/reading directly from HBase?
>
>
>
> Thanks,
>
> Brandon
>
>
>
> *From: *Jaanai Zhang <[email protected]>
> *Reply-To: *<[email protected]>
> *Date: *Sunday, August 5, 2018 at 9:34 PM
> *To: *<[email protected]>
> *Subject: *Re: Spark-Phoenix Plugin
>
>
>
> You can get data type from Phoenix meta, then encode/decode data to
> write/read data. I think this way is effective, FYI :)
>
>
>
>
> ----------------------------------------
>
> Yun Zhang
>
> Best regards!
>
>
>
>
>
> 2018-08-04 21:43 GMT+08:00 Brandon Geise <[email protected]>:
>
> Good morning,
>
>
>
> I’m looking at using a combination of Hbase, Phoenix and Spark for a
> project and read that using the Spark-Phoenix plugin directly is more
> efficient than JDBC, however it wasn’t entirely clear from examples when
> writing a dataframe if an upsert is performed and how much fine-grained
> options there are for executing the upsert. Any information someone can
> share would be greatly appreciated!
>
>
>
>
>
> Thanks,
>
> Brandon
>
>
>