Re: Accessing Hbase tables through Spark, this seems to work

ayan guha Mon, 17 Oct 2016 16:48:45 -0700

I do not see a rationale to have hbase in this scheme of things....may be I
am missing something?


If data is delivered in HDFS, why not just add partition to an existing
Hive table?

On Tue, Oct 18, 2016 at 8:23 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Thanks Mike,
>
> My test csv data comes as
>
> UUID,                                 ticker,  timecreated,          price
> a2c844ed-137f-4820-aa6e-c49739e46fa6, S01,     2016-10-17T22:02:09,
> 53.36665625650533484995
> a912b65e-b6bc-41d4-9e10-d6a44ea1a2b0, S02,     2016-10-17T22:02:09,
> 86.31917515824627016510
> 5f4e3a9d-05cc-41a2-98b3-40810685641e, S03,     2016-10-17T22:02:09,
> 95.48298277703729129559
>
>
> And this is my Hbase table with one column family
>
> create 'marketDataHbase', 'price_info'
>
> It is populated every 15 minutes from test.csv files delivered via Kafka
> and Flume to HDFS
>
>
>    1. Create a fat csv file based on all small csv files for today -->
>    prices/2016-10-17
>    2. Populate data into Hbase table using 
> org.apache.hadoop.hbase.mapreduce.ImportTsv
>
>    3. This is pretty quick using MapReduce
>
>
> That importTsv only appends new rows to Hbase table as the choice of UUID
> as rowKey avoids any issues.
>
> So I only have 15 minutes lag in my batch Hbase table.
>
> I have both Hive ORC tables and Phoenix views on top of this Hbase tables.
>
>
>    1. Phoenix offers the fastest response if used on top of Hbase.
>    unfortunately, Spark 2 with Phoenix is broken
>    2. Spark on Hive on Hbase looks OK. This works fine with Spark 2
>    3. Spark on Hbase tables directly using key, value DFs for each
>    column. Not as fast as 2 but works. I don't think a DF is a good choice for
>    a key, value pair?
>
> Now if I use Zeppelin to read from Hbase
>
>
>    1. I can use Phoenix JDBC. That looks very fast
>    2. I can use Spark csv directly on HDFS csv files.
>    3. I can use Spark on Hive tables
>
>
> If I use Tableau on Hbase data then, only sql like code is useful. Phoenix
> or Hive
>
> I don't want to change the design now. But admittedly Hive is the best SQL
> on top of Hbase. Next release of Hive is going to have in-memory database
> (LLAP) so we can cache Hive tables in memory. That will be faster. Many
> people underestimate Hive but I still believe it has a lot to offer besides
> serious ANSI compliant SQL.
>
> Regards
>
>  Mich
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 17 October 2016 at 21:54, Michael Segel <msegel_had...@hotmail.com>
> wrote:
>
>> Mitch,
>>
>> Short answer… no, it doesn’t scale.
>>
>> Longer answer…
>>
>> You are using an UUID as the row key?  Why?  (My guess is that you want
>> to avoid hot spotting)
>>
>> So you’re going to have to pull in all of the data… meaning a full table
>> scan… and then perform a sort order transformation, dropping the UUID in
>> the process.
>>
>> You would be better off not using HBase and storing the data in Parquet
>> files in a directory partitioned on date.  Or rather the rowkey would be
>> the max_ts - TS so that your data is in LIFO.
>> Note: I’ve used the term epoch to describe the max value of a long (8
>> bytes of ‘FF’ ) for the max_ts. This isn’t a good use of the term epoch,
>> but if anyone has a better term, please let me know.
>>
>>
>>
>> Having said that… if you want to use HBase, you could do the same thing.
>> If you want to avoid hot spotting, you could load the day’s transactions
>> using a bulk loader so that you don’t have to worry about splits.
>>
>> But that’s just my $0.02 cents worth.
>>
>> HTH
>>
>> -Mike
>>
>> PS. If you wanted to capture the transactions… you could do the following
>> schemea:
>>
>> 1) Rowkey = max_ts - TS
>> 2) Rows contain the following:
>> CUSIP (Transaction ID)
>> Party 1 (Seller)
>> Party 2 (Buyer)
>> Symbol
>> Qty
>> Price
>>
>> This is a trade ticket.
>>
>>
>>
>> On Oct 16, 2016, at 1:37 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>> Hi,
>>
>> I have trade data stored in Hbase table. Data arrives in csv format to
>> HDFS and then loaded into Hbase via periodic load with
>> org.apache.hadoop.hbase.mapreduce.ImportTsv.
>>
>> The Hbase table has one Column family "trade_info" and three columns:
>> ticker, timecreated, price.
>>
>> The RowKey is UUID. So each row has UUID, ticker, timecreated and price
>> in the csv file
>>
>> Each row in Hbase is a key, value map. In my case, I have one Column
>> Family and three columns. Without going into semantics I see Hbase as a
>> column oriented database where column data stay together.
>>
>> So I thought of this way of accessing the data.
>>
>> I define an RDD for each column in the column family as below. In this
>> case column trade_info:ticker
>>
>> //create rdd
>> val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],clas
>> sOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],class
>> Of[org.apache.hadoop.hbase.client.Result])
>> val rdd1 = hBaseRDD.map(tuple => tuple._2).map(result => (result.getRow,
>> result.getColumn("price_info".getBytes(), "ticker".getBytes()))).map(row
>> => {
>> (
>>   row._1.map(_.toChar).mkString,
>>   row._2.asScala.reduceLeft {
>>     (a, b) => if (a.getTimestamp > b.getTimestamp) a else b
>>   }.getValue.map(_.toChar).mkString
>> )
>> })
>> case class columns (key: String, ticker: String)
>> val dfticker = rdd1.toDF.map(p => columns(p(0).toString,p(1).toString))
>>
>> Note that the end result is a DataFrame with the RowKey -> key and column
>> -> ticker
>>
>> I use the same approach to create two other DataFrames, namely dftimecreated
>> and dfprice for the two other columns.
>>
>> Note that if I don't need a column, then I do not create a DF for it. So
>> a DF with each column I use. I am not sure how this compares if I read the
>> full row through other methods if any.
>>
>> Anyway all I need to do after creating a DataFrame for each column is to
>> join themthrough RowKey to slice and dice data. Like below.
>>
>> Get me the latest prices ordered by timecreated and ticker (ticker is
>> stock)
>>
>> val rs = 
>> dfticker.join(dftimecreated,"key").join(dfprice,"key").orderBy('timecreated
>> desc, 'price desc).select('timecreated, 'ticker,
>> 'price.cast("Float").as("Latest price"))
>> rs.show(10)
>>
>> +-------------------+------+------------+
>> |        timecreated|ticker|Latest price|
>> +-------------------+------+------------+
>> |2016-10-16T18:44:57|   S16|   97.631966|
>> |2016-10-16T18:44:57|   S13|    92.11406|
>> |2016-10-16T18:44:57|   S19|    85.93021|
>> |2016-10-16T18:44:57|   S09|   85.714645|
>> |2016-10-16T18:44:57|   S15|    82.38932|
>> |2016-10-16T18:44:57|   S17|    80.77747|
>> |2016-10-16T18:44:57|   S06|    79.81854|
>> |2016-10-16T18:44:57|   S18|    74.10128|
>> |2016-10-16T18:44:57|   S07|    66.13622|
>> |2016-10-16T18:44:57|   S20|    60.35727|
>> +-------------------+------+------------+
>> only showing top 10 rows
>>
>> Is this a workable solution?
>>
>> Thanks
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>>
>


-- 
Best Regards,
Ayan Guha

Re: Accessing Hbase tables through Spark, this seems to work

Reply via email to