I do not see a rationale to have hbase in this scheme of things....may be I am missing something?
If data is delivered in HDFS, why not just add partition to an existing Hive table? On Tue, Oct 18, 2016 at 8:23 AM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Thanks Mike, > > My test csv data comes as > > UUID, ticker, timecreated, price > a2c844ed-137f-4820-aa6e-c49739e46fa6, S01, 2016-10-17T22:02:09, > 53.36665625650533484995 > a912b65e-b6bc-41d4-9e10-d6a44ea1a2b0, S02, 2016-10-17T22:02:09, > 86.31917515824627016510 > 5f4e3a9d-05cc-41a2-98b3-40810685641e, S03, 2016-10-17T22:02:09, > 95.48298277703729129559 > > > And this is my Hbase table with one column family > > create 'marketDataHbase', 'price_info' > > It is populated every 15 minutes from test.csv files delivered via Kafka > and Flume to HDFS > > > 1. Create a fat csv file based on all small csv files for today --> > prices/2016-10-17 > 2. Populate data into Hbase table using > org.apache.hadoop.hbase.mapreduce.ImportTsv > > 3. This is pretty quick using MapReduce > > > That importTsv only appends new rows to Hbase table as the choice of UUID > as rowKey avoids any issues. > > So I only have 15 minutes lag in my batch Hbase table. > > I have both Hive ORC tables and Phoenix views on top of this Hbase tables. > > > 1. Phoenix offers the fastest response if used on top of Hbase. > unfortunately, Spark 2 with Phoenix is broken > 2. Spark on Hive on Hbase looks OK. This works fine with Spark 2 > 3. Spark on Hbase tables directly using key, value DFs for each > column. Not as fast as 2 but works. I don't think a DF is a good choice for > a key, value pair? > > Now if I use Zeppelin to read from Hbase > > > 1. I can use Phoenix JDBC. That looks very fast > 2. I can use Spark csv directly on HDFS csv files. > 3. I can use Spark on Hive tables > > > If I use Tableau on Hbase data then, only sql like code is useful. Phoenix > or Hive > > I don't want to change the design now. But admittedly Hive is the best SQL > on top of Hbase. Next release of Hive is going to have in-memory database > (LLAP) so we can cache Hive tables in memory. That will be faster. Many > people underestimate Hive but I still believe it has a lot to offer besides > serious ANSI compliant SQL. > > Regards > > Mich > > > > > > > > > > > > > > > > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 17 October 2016 at 21:54, Michael Segel <msegel_had...@hotmail.com> > wrote: > >> Mitch, >> >> Short answer… no, it doesn’t scale. >> >> Longer answer… >> >> You are using an UUID as the row key? Why? (My guess is that you want >> to avoid hot spotting) >> >> So you’re going to have to pull in all of the data… meaning a full table >> scan… and then perform a sort order transformation, dropping the UUID in >> the process. >> >> You would be better off not using HBase and storing the data in Parquet >> files in a directory partitioned on date. Or rather the rowkey would be >> the max_ts - TS so that your data is in LIFO. >> Note: I’ve used the term epoch to describe the max value of a long (8 >> bytes of ‘FF’ ) for the max_ts. This isn’t a good use of the term epoch, >> but if anyone has a better term, please let me know. >> >> >> >> Having said that… if you want to use HBase, you could do the same thing. >> If you want to avoid hot spotting, you could load the day’s transactions >> using a bulk loader so that you don’t have to worry about splits. >> >> But that’s just my $0.02 cents worth. >> >> HTH >> >> -Mike >> >> PS. If you wanted to capture the transactions… you could do the following >> schemea: >> >> 1) Rowkey = max_ts - TS >> 2) Rows contain the following: >> CUSIP (Transaction ID) >> Party 1 (Seller) >> Party 2 (Buyer) >> Symbol >> Qty >> Price >> >> This is a trade ticket. >> >> >> >> On Oct 16, 2016, at 1:37 PM, Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >> Hi, >> >> I have trade data stored in Hbase table. Data arrives in csv format to >> HDFS and then loaded into Hbase via periodic load with >> org.apache.hadoop.hbase.mapreduce.ImportTsv. >> >> The Hbase table has one Column family "trade_info" and three columns: >> ticker, timecreated, price. >> >> The RowKey is UUID. So each row has UUID, ticker, timecreated and price >> in the csv file >> >> Each row in Hbase is a key, value map. In my case, I have one Column >> Family and three columns. Without going into semantics I see Hbase as a >> column oriented database where column data stay together. >> >> So I thought of this way of accessing the data. >> >> I define an RDD for each column in the column family as below. In this >> case column trade_info:ticker >> >> //create rdd >> val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],clas >> sOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],class >> Of[org.apache.hadoop.hbase.client.Result]) >> val rdd1 = hBaseRDD.map(tuple => tuple._2).map(result => (result.getRow, >> result.getColumn("price_info".getBytes(), "ticker".getBytes()))).map(row >> => { >> ( >> row._1.map(_.toChar).mkString, >> row._2.asScala.reduceLeft { >> (a, b) => if (a.getTimestamp > b.getTimestamp) a else b >> }.getValue.map(_.toChar).mkString >> ) >> }) >> case class columns (key: String, ticker: String) >> val dfticker = rdd1.toDF.map(p => columns(p(0).toString,p(1).toString)) >> >> Note that the end result is a DataFrame with the RowKey -> key and column >> -> ticker >> >> I use the same approach to create two other DataFrames, namely dftimecreated >> and dfprice for the two other columns. >> >> Note that if I don't need a column, then I do not create a DF for it. So >> a DF with each column I use. I am not sure how this compares if I read the >> full row through other methods if any. >> >> Anyway all I need to do after creating a DataFrame for each column is to >> join themthrough RowKey to slice and dice data. Like below. >> >> Get me the latest prices ordered by timecreated and ticker (ticker is >> stock) >> >> val rs = >> dfticker.join(dftimecreated,"key").join(dfprice,"key").orderBy('timecreated >> desc, 'price desc).select('timecreated, 'ticker, >> 'price.cast("Float").as("Latest price")) >> rs.show(10) >> >> +-------------------+------+------------+ >> | timecreated|ticker|Latest price| >> +-------------------+------+------------+ >> |2016-10-16T18:44:57| S16| 97.631966| >> |2016-10-16T18:44:57| S13| 92.11406| >> |2016-10-16T18:44:57| S19| 85.93021| >> |2016-10-16T18:44:57| S09| 85.714645| >> |2016-10-16T18:44:57| S15| 82.38932| >> |2016-10-16T18:44:57| S17| 80.77747| >> |2016-10-16T18:44:57| S06| 79.81854| >> |2016-10-16T18:44:57| S18| 74.10128| >> |2016-10-16T18:44:57| S07| 66.13622| >> |2016-10-16T18:44:57| S20| 60.35727| >> +-------------------+------+------------+ >> only showing top 10 rows >> >> Is this a workable solution? >> >> Thanks >> >> >> >> Dr Mich Talebzadeh >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> http://talebzadehmich.wordpress.com >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> > -- Best Regards, Ayan Guha