Hi, I am trying to do some operation on an Hbase table that is being populated by Spark Streaming.
Now this is just Spark on Hbase as opposed to Spark on Hive -> view on Hbase etc. I also have Phoenix view on this Hbase table. This is sample code scala> val tableName = "marketDataHbase" > val conf = HBaseConfiguration.create() conf: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, hbase-default.xml, hbase-site.xml scala> conf.set(TableInputFormat.INPUT_TABLE, tableName) scala> //create rdd scala> *val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],classOf[org.apache.hadoop.hbase.io <http://hbase.io>.ImmutableBytesWritable],classOf[org.apache.hadoop.hbase.client.Result])*hBaseRDD: org.apache.spark.rdd.RDD[(org.apache.hadoop.hbase.io.ImmutableBytesWritable, org.apache.hadoop.hbase.client.Result)] = NewHadoopRDD[4] at newAPIHadoopRDD at <console>:64 scala> hBaseRDD.count res11: Long = 22272 scala> // transform (ImmutableBytesWritable, Result) tuples into an RDD of Result's scala> val resultRDD = hBaseRDD.map(tuple => tuple._2) resultRDD: org.apache.spark.rdd.RDD[org.apache.hadoop.hbase.client.Result] = MapPartitionsRDD[8] at map at <console>:41 scala> // transform into an RDD of (RowKey, ColumnValue)s the RowKey has the time removed scala> val keyValueRDD = resultRDD.map(result => (Bytes.toString(result.getRow()).split(" ")(0), Bytes.toString(result.value))) keyValueRDD: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[9] at map at <console>:43 scala> keyValueRDD.take(2).foreach(kv => println(kv)) (000055e2-63f1-4def-b625-e73f0ac36271,43.89760813529593664528) (000151e9-ff27-493d-a5ca-288507d92f95,57.68882040742382868990) OK above I am only getting the rowkey (UUID above) and the last attribute (price). However, I have the rowkey and 3 more columns there in Hbase table! scan 'marketDataHbase', "LIMIT" => 1 ROW COLUMN+CELL 000055e2-63f1-4def-b625-e73f0ac36271 column=price_info:price, timestamp=1476133232864, value=43.89760813529593664528 000055e2-63f1-4def-b625-e73f0ac36271 column=price_info:ticker, timestamp=1476133232864, value=S08 000055e2-63f1-4def-b625-e73f0ac36271 column=price_info:timecreated, timestamp=1476133232864, value=2016-10-10T17:12:22 1 row(s) in 0.0100 seconds So how can I get the other columns? Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.