Oltp use case scenario does not mean necessarily the traditional oltp. See also apache hawk etc. they can fit indeed to some use cases to some other less.
> On 17 Oct 2016, at 23:02, Michael Segel <msegel_had...@hotmail.com> wrote: > > You really don’t want to do OLTP on a distributed NoSQL engine. > Remember Big Data isn’t relational its more of a hierarchy model or record > model. Think IMS or Pick (Dick Pick’s revelation, U2, Universe, etc …) > > >> On Oct 17, 2016, at 3:45 PM, Jörn Franke <jornfra...@gmail.com> wrote: >> >> It has some implication because it imposes the SQL model on Hbase. >> Internally it translates the SQL queries into custom Hbase processors. Keep >> also in mind for what Hbase need a proper key design and how Phoenix designs >> those keys to get the best performance out of it. I think for oltp it is a >> workable model and I think they plan to offer Phoenix as a default interface >> as part of Hbase anyway. >> For OLAP it depends. >> >> >> On 17 Oct 2016, at 22:34, ayan guha <guha.a...@gmail.com> wrote: >> >>> Hi >>> >>> Any reason not to recommend Phoneix? I haven't used it myself so curious >>> about pro's and cons about the use of it. >>> >>>> On 18 Oct 2016 03:17, "Michael Segel" <msegel_had...@hotmail.com> wrote: >>>> Guys, >>>> Sorry for jumping in late to the game… >>>> >>>> If memory serves (which may not be a good thing…) : >>>> >>>> You can use HiveServer2 as a connection point to HBase. >>>> While this doesn’t perform well, its probably the cleanest solution. >>>> I’m not keen on Phoenix… wouldn’t recommend it…. >>>> >>>> >>>> The issue is that you’re trying to make HBase, a key/value object store, a >>>> Relational Engine… its not. >>>> >>>> There are some considerations which make HBase not ideal for all use cases >>>> and you may find better performance with Parquet files. >>>> >>>> One thing missing is the use of secondary indexing and query optimizations >>>> that you have in RDBMSs and are lacking in HBase / MapRDB / etc … so your >>>> performance will vary. >>>> >>>> With respect to Tableau… their entire interface in to the big data world >>>> revolves around the JDBC/ODBC interface. So if you don’t have that piece >>>> as part of your solution, you’re DOA w respect to Tableau. >>>> >>>> Have you considered Drill as your JDBC connection point? (YAAP: Yet >>>> another Apache project) >>>> >>>> >>>>> On Oct 9, 2016, at 12:23 PM, Benjamin Kim <bbuil...@gmail.com> wrote: >>>>> >>>>> Thanks for all the suggestions. It would seem you guys are right about >>>>> the Tableau side of things. The reports don’t need to be real-time, and >>>>> they won’t be directly feeding off of the main DMP HBase data. Instead, >>>>> it’ll be batched to Parquet or Kudu/Impala or even PostgreSQL. >>>>> >>>>> I originally thought that we needed two-way data retrieval from the DMP >>>>> HBase for ID generation, but after further investigation into the >>>>> use-case and architecture, the ID generation needs to happen local to the >>>>> Ad Servers where we generate a unique ID and store it in a ID linking >>>>> table. Even better, many of the 3rd party services supply this ID. So, >>>>> data only needs to flow in one direction. We will use Kafka as the bus >>>>> for this. No JDBC required. This is also goes for the REST Endpoints. 3rd >>>>> party services will hit ours to update our data with no need to read from >>>>> our data. And, when we want to update their data, we will hit theirs to >>>>> update their data using a triggered job. >>>>> >>>>> This al boils down to just integrating with Kafka. >>>>> >>>>> Once again, thanks for all the help. >>>>> >>>>> Cheers, >>>>> Ben >>>>> >>>>> >>>>>> On Oct 9, 2016, at 3:16 AM, Jörn Franke <jornfra...@gmail.com> wrote: >>>>>> >>>>>> please keep also in mind that Tableau Server has the capabilities to >>>>>> store data in-memory and refresh only when needed the in-memory data. >>>>>> This means you can import it from any source and let your users work >>>>>> only on the in-memory data in Tableau Server. >>>>>> >>>>>>> On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke <jornfra...@gmail.com> >>>>>>> wrote: >>>>>>> Cloudera 5.8 has a very old version of Hive without Tez, but Mich >>>>>>> provided already a good alternative. However, you should check if it >>>>>>> contains a recent version of Hbase and Phoenix. That being said, I just >>>>>>> wonder what is the dataflow, data model and the analysis you plan to >>>>>>> do. Maybe there are completely different solutions possible. Especially >>>>>>> these single inserts, upserts etc. should be avoided as much as >>>>>>> possible in the Big Data (analysis) world with any technology, because >>>>>>> they do not perform well. >>>>>>> >>>>>>> Hive with Llap will provide an in-memory cache for interactive >>>>>>> analytics. You can put full tables in-memory with Hive using Ignite >>>>>>> HDFS in-memory solution. All this does only make sense if you do not >>>>>>> use MR as an engine, the right input format (ORC, parquet) and a recent >>>>>>> Hive version. >>>>>>> >>>>>>> On 8 Oct 2016, at 21:55, Benjamin Kim <bbuil...@gmail.com> wrote: >>>>>>> >>>>>>>> Mich, >>>>>>>> >>>>>>>> Unfortunately, we are moving away from Hive and unifying on Spark >>>>>>>> using CDH 5.8 as our distro. And, the Tableau released a Spark >>>>>>>> ODBC/JDBC driver too. I will either try Phoenix JDBC Server for HBase >>>>>>>> or push to move faster to Kudu with Impala. We will use Impala as the >>>>>>>> JDBC in-between until the Kudu team completes Spark SQL support for >>>>>>>> JDBC. >>>>>>>> >>>>>>>> Thanks for the advice. >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Ben >>>>>>>> >>>>>>>> >>>>>>>>> On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh >>>>>>>>> <mich.talebza...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> Sure. But essentially you are looking at batch data for analytics for >>>>>>>>> your tableau users so Hive may be a better choice with its rich SQL >>>>>>>>> and ODBC.JDBC connection to Tableau already. >>>>>>>>> >>>>>>>>> I would go for Hive especially the new release will have an in-memory >>>>>>>>> offering as well for frequently accessed data :) >>>>>>>>> >>>>>>>>> >>>>>>>>> Dr Mich Talebzadeh >>>>>>>>> >>>>>>>>> LinkedIn >>>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>>>>> >>>>>>>>> http://talebzadehmich.wordpress.com >>>>>>>>> >>>>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for >>>>>>>>> any loss, damage or destruction of data or any other property which >>>>>>>>> may arise from relying on this email's technical content is >>>>>>>>> explicitly disclaimed. The author will in no case be liable for any >>>>>>>>> monetary damages arising from such loss, damage or destruction. >>>>>>>>> >>>>>>>>> >>>>>>>>>> On 8 October 2016 at 20:15, Benjamin Kim <bbuil...@gmail.com> wrote: >>>>>>>>>> Mich, >>>>>>>>>> >>>>>>>>>> First and foremost, we have visualization servers that run Tableau >>>>>>>>>> for external user reports. Second, we have servers that are ad >>>>>>>>>> servers and REST endpoints for cookie sync and segmentation data >>>>>>>>>> exchange. These will use JDBC directly within the same data-center. >>>>>>>>>> When not colocated in the same data-center, they will connected to a >>>>>>>>>> located database server using JDBC. Either way, by using JDBC >>>>>>>>>> everywhere, it simplifies and unifies the code on the JDBC industry >>>>>>>>>> standard. >>>>>>>>>> >>>>>>>>>> Does this make sense? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Ben >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh >>>>>>>>>>> <mich.talebza...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>> Like any other design what is your presentation layer and end users? >>>>>>>>>>> >>>>>>>>>>> Are they SQL centric users from Tableau background or they may use >>>>>>>>>>> spark functional programming. >>>>>>>>>>> >>>>>>>>>>> It is best to describe the use case. >>>>>>>>>>> >>>>>>>>>>> HTH >>>>>>>>>>> >>>>>>>>>>> Dr Mich Talebzadeh >>>>>>>>>>> >>>>>>>>>>> LinkedIn >>>>>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>>>>>>> >>>>>>>>>>> http://talebzadehmich.wordpress.com >>>>>>>>>>> >>>>>>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for >>>>>>>>>>> any loss, damage or destruction of data or any other property which >>>>>>>>>>> may arise from relying on this email's technical content is >>>>>>>>>>> explicitly disclaimed. The author will in no case be liable for any >>>>>>>>>>> monetary damages arising from such loss, damage or destruction. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> On 8 October 2016 at 19:40, Felix Cheung >>>>>>>>>>>> <felixcheun...@hotmail.com> wrote: >>>>>>>>>>>> I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix >>>>>>>>>>>> JDBC server - HBASE would work better. >>>>>>>>>>>> >>>>>>>>>>>> Without naming specifics, there are at least 4 or 5 different >>>>>>>>>>>> implementations of HBASE sources, each at varying level of >>>>>>>>>>>> development and different requirements (HBASE release version, >>>>>>>>>>>> Kerberos support etc) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> _____________________________ >>>>>>>>>>>> From: Benjamin Kim <bbuil...@gmail.com> >>>>>>>>>>>> Sent: Saturday, October 8, 2016 11:26 AM >>>>>>>>>>>> Subject: Re: Spark SQL Thriftserver with HBase >>>>>>>>>>>> To: Mich Talebzadeh <mich.talebza...@gmail.com> >>>>>>>>>>>> Cc: <user@spark.apache.org>, Felix Cheung >>>>>>>>>>>> <felixcheun...@hotmail.com> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Mich, >>>>>>>>>>>> >>>>>>>>>>>> Are you talking about the Phoenix JDBC Server? If so, I forgot >>>>>>>>>>>> about that alternative. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Ben >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh >>>>>>>>>>>> <mich.talebza...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>> I don't think it will work >>>>>>>>>>>> >>>>>>>>>>>> you can use phoenix on top of hbase >>>>>>>>>>>> >>>>>>>>>>>> hbase(main):336:0> scan 'tsco', 'LIMIT' => 1 >>>>>>>>>>>> ROW >>>>>>>>>>>> COLUMN+CELL >>>>>>>>>>>> TSCO-1-Apr-08 >>>>>>>>>>>> column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08 >>>>>>>>>>>> TSCO-1-Apr-08 >>>>>>>>>>>> column=stock_daily:close, timestamp=1475866783376, value=405.25 >>>>>>>>>>>> TSCO-1-Apr-08 >>>>>>>>>>>> column=stock_daily:high, timestamp=1475866783376, value=406.75 >>>>>>>>>>>> TSCO-1-Apr-08 >>>>>>>>>>>> column=stock_daily:low, timestamp=1475866783376, value=379.25 >>>>>>>>>>>> TSCO-1-Apr-08 >>>>>>>>>>>> column=stock_daily:open, timestamp=1475866783376, value=380.00 >>>>>>>>>>>> TSCO-1-Apr-08 >>>>>>>>>>>> column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC >>>>>>>>>>>> TSCO-1-Apr-08 >>>>>>>>>>>> column=stock_daily:ticker, timestamp=1475866783376, value=TSCO >>>>>>>>>>>> TSCO-1-Apr-08 >>>>>>>>>>>> column=stock_daily:volume, timestamp=1475866783376, value=49664486 >>>>>>>>>>>> >>>>>>>>>>>> And the same on Phoenix on top of Hvbase table >>>>>>>>>>>> >>>>>>>>>>>> 0: jdbc:phoenix:thin:url=http://rhes564:8765> select >>>>>>>>>>>> substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate, >>>>>>>>>>>> "close" AS "Day's close", "high" AS "Day's High", "low" AS "Day's >>>>>>>>>>>> Low", "open" AS "Day's Open", "ticker", "volume", >>>>>>>>>>>> (to_number("low")+to_number("high"))/2 AS "AverageDailyPrice" from >>>>>>>>>>>> "tsco" where to_number("volume") > 0 and "high" != '-' and >>>>>>>>>>>> to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd') >>>>>>>>>>>> order by to_date("Date",'dd-MMM-yy') limit 1; >>>>>>>>>>>> +-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+ >>>>>>>>>>>> | TRADEDATE | Day's close | Day's High | Day's Low | Day's >>>>>>>>>>>> Open | ticker | volume | AverageDailyPrice | >>>>>>>>>>>> +-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+ >>>>>>>>>>>> | 2015-10-07 | 197.00 | 198.05 | 184.84 | 192.20 >>>>>>>>>>>> | TSCO | 30046994 | 191.445 | >>>>>>>>>>>> >>>>>>>>>>>> HTH >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Dr Mich Talebzadeh >>>>>>>>>>>> >>>>>>>>>>>> LinkedIn >>>>>>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>>>>>>>> >>>>>>>>>>>> http://talebzadehmich.wordpress.com >>>>>>>>>>>> >>>>>>>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility >>>>>>>>>>>> for any loss, damage or destructionof data or any other property >>>>>>>>>>>> which may arise from relying on this email's technical content is >>>>>>>>>>>> explicitly disclaimed.The author will in no case be liable for any >>>>>>>>>>>> monetary damages arising from suchloss, damage or destruction. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> On 8 October 2016 at 19:05, Felix Cheung >>>>>>>>>>>>> <felixcheun...@hotmail.com> wrote: >>>>>>>>>>>>> Great, then I think those packages as Spark data source should >>>>>>>>>>>>> allow you to do exactly that (replace org.apache.spark.sql.jdbc >>>>>>>>>>>>> with HBASE one) >>>>>>>>>>>>> >>>>>>>>>>>>> I do think it will be great to get more examples around this >>>>>>>>>>>>> though. Would be great if you could share your experience with >>>>>>>>>>>>> this! >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> _____________________________ >>>>>>>>>>>>> From: Benjamin Kim <bbuil...@gmail.com> >>>>>>>>>>>>> Sent: Saturday, October 8, 2016 11:00 AM >>>>>>>>>>>>> Subject: Re: Spark SQL Thriftserver with HBase >>>>>>>>>>>>> To: Felix Cheung <felixcheun...@hotmail.com> >>>>>>>>>>>>> Cc: <user@spark.apache.org> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Felix, >>>>>>>>>>>>> >>>>>>>>>>>>> My goal is to use Spark SQL JDBC Thriftserver to access HBase >>>>>>>>>>>>> tables using just SQL. I have been able to CREATE tables using >>>>>>>>>>>>> this statement below in the past: >>>>>>>>>>>>> >>>>>>>>>>>>> CREATE TABLE <table-name> >>>>>>>>>>>>> USING org.apache.spark.sql.jdbc >>>>>>>>>>>>> OPTIONS ( >>>>>>>>>>>>> url >>>>>>>>>>>>> "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&password=<password>", >>>>>>>>>>>>> dbtable "dim.dimension_acamp" >>>>>>>>>>>>> ); >>>>>>>>>>>>> >>>>>>>>>>>>> After doing this, I can access the PostgreSQL table using Spark >>>>>>>>>>>>> SQL JDBC Thriftserver using SQL statements (SELECT, UPDATE, >>>>>>>>>>>>> INSERT, etc.). I want to do the same with HBase tables. We tried >>>>>>>>>>>>> this using Hive and HiveServer2, but the response times are just >>>>>>>>>>>>> too long. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Ben >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Oct 8, 2016, at 10:53 AM, Felix Cheung >>>>>>>>>>>>> <felixcheun...@hotmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Ben, >>>>>>>>>>>>> >>>>>>>>>>>>> I'm not sure I'm following completely. >>>>>>>>>>>>> >>>>>>>>>>>>> Is your goal to use Spark to create or access tables in HBASE? If >>>>>>>>>>>>> so the link below and several packages out there support that by >>>>>>>>>>>>> having a HBASE data source for Spark. There are some examples on >>>>>>>>>>>>> how the Spark code look like in that link as well. On that note, >>>>>>>>>>>>> you should also be able to use the HBASE data source from pure >>>>>>>>>>>>> SQL (Spark SQL) query as well, which should work in the case with >>>>>>>>>>>>> the Spark SQL JDBC Thrift Server (with >>>>>>>>>>>>> USING,http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10). >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> _____________________________ >>>>>>>>>>>>> From: Benjamin Kim <bbuil...@gmail.com> >>>>>>>>>>>>> Sent: Saturday, October 8, 2016 10:40 AM >>>>>>>>>>>>> Subject: Re: Spark SQL Thriftserver with HBase >>>>>>>>>>>>> To: Felix Cheung <felixcheun...@hotmail.com> >>>>>>>>>>>>> Cc: <user@spark.apache.org> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Felix, >>>>>>>>>>>>> >>>>>>>>>>>>> The only alternative way is to create a stored procedure (udf) in >>>>>>>>>>>>> database terms that would run Spark scala code underneath. In >>>>>>>>>>>>> this way, I can use Spark SQL JDBC Thriftserver to execute it >>>>>>>>>>>>> using SQL code passing the key, values I want to UPSERT. I wonder >>>>>>>>>>>>> if this is possible since I cannot CREATE a wrapper table on top >>>>>>>>>>>>> of a HBase table in Spark SQL? >>>>>>>>>>>>> >>>>>>>>>>>>> What do you think? Is this the right approach? >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Ben >>>>>>>>>>>>> >>>>>>>>>>>>> On Oct 8, 2016, at 10:33 AM, Felix Cheung >>>>>>>>>>>>> <felixcheun...@hotmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> HBase has released support for Spark >>>>>>>>>>>>> hbase.apache.org/book.html#spark >>>>>>>>>>>>> >>>>>>>>>>>>> And if you search you should find several alternative approaches. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" >>>>>>>>>>>>> <bbuil...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Does anyone know if Spark can work with HBase tables using Spark >>>>>>>>>>>>> SQL? I know in Hive we are able to create tables on top of an >>>>>>>>>>>>> underlying HBase table that can be accessed using MapReduce jobs. >>>>>>>>>>>>> Can the same be done using HiveContext or SQLContext? We are >>>>>>>>>>>>> trying to setup a way to GET and POST data to and from the HBase >>>>>>>>>>>>> table using the Spark SQL JDBC thriftserver from our RESTful API >>>>>>>>>>>>> endpoints and/or HTTP web farms. If we can get this to work, then >>>>>>>>>>>>> we can load balance the thriftservers. In addition, this will >>>>>>>>>>>>> benefit us in giving us a way to abstract the data storage layer >>>>>>>>>>>>> away from the presentation layer code. There is a chance that we >>>>>>>>>>>>> will swap out the data storage technology in the future. We are >>>>>>>>>>>>> currently experimenting with Kudu. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Ben >>>>>>>>>>>>> --------------------------------------------------------------------- >>>>>>>>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>> >>>>> >>>> >