Hi Any reason not to recommend Phoneix? I haven't used it myself so curious about pro's and cons about the use of it. On 18 Oct 2016 03:17, "Michael Segel" <msegel_had...@hotmail.com> wrote:
> Guys, > Sorry for jumping in late to the game… > > If memory serves (which may not be a good thing…) : > > You can use HiveServer2 as a connection point to HBase. > While this doesn’t perform well, its probably the cleanest solution. > I’m not keen on Phoenix… wouldn’t recommend it…. > > > The issue is that you’re trying to make HBase, a key/value object store, a > Relational Engine… its not. > > There are some considerations which make HBase not ideal for all use cases > and you may find better performance with Parquet files. > > One thing missing is the use of secondary indexing and query optimizations > that you have in RDBMSs and are lacking in HBase / MapRDB / etc … so your > performance will vary. > > With respect to Tableau… their entire interface in to the big data world > revolves around the JDBC/ODBC interface. So if you don’t have that piece as > part of your solution, you’re DOA w respect to Tableau. > > Have you considered Drill as your JDBC connection point? (YAAP: Yet > another Apache project) > > > On Oct 9, 2016, at 12:23 PM, Benjamin Kim <bbuil...@gmail.com> wrote: > > Thanks for all the suggestions. It would seem you guys are right about the > Tableau side of things. The reports don’t need to be real-time, and they > won’t be directly feeding off of the main DMP HBase data. Instead, it’ll be > batched to Parquet or Kudu/Impala or even PostgreSQL. > > I originally thought that we needed two-way data retrieval from the DMP > HBase for ID generation, but after further investigation into the use-case > and architecture, the ID generation needs to happen local to the Ad Servers > where we generate a unique ID and store it in a ID linking table. Even > better, many of the 3rd party services supply this ID. So, data only needs > to flow in one direction. We will use Kafka as the bus for this. No JDBC > required. This is also goes for the REST Endpoints. 3rd party services will > hit ours to update our data with no need to read from our data. And, when > we want to update their data, we will hit theirs to update their data using > a triggered job. > > This al boils down to just integrating with Kafka. > > Once again, thanks for all the help. > > Cheers, > Ben > > > On Oct 9, 2016, at 3:16 AM, Jörn Franke <jornfra...@gmail.com> wrote: > > please keep also in mind that Tableau Server has the capabilities to store > data in-memory and refresh only when needed the in-memory data. This means > you can import it from any source and let your users work only on the > in-memory data in Tableau Server. > > On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke <jornfra...@gmail.com> wrote: > >> Cloudera 5.8 has a very old version of Hive without Tez, but Mich >> provided already a good alternative. However, you should check if it >> contains a recent version of Hbase and Phoenix. That being said, I just >> wonder what is the dataflow, data model and the analysis you plan to do. >> Maybe there are completely different solutions possible. Especially these >> single inserts, upserts etc. should be avoided as much as possible in the >> Big Data (analysis) world with any technology, because they do not perform >> well. >> >> Hive with Llap will provide an in-memory cache for interactive analytics. >> You can put full tables in-memory with Hive using Ignite HDFS in-memory >> solution. All this does only make sense if you do not use MR as an engine, >> the right input format (ORC, parquet) and a recent Hive version. >> >> On 8 Oct 2016, at 21:55, Benjamin Kim <bbuil...@gmail.com> wrote: >> >> Mich, >> >> Unfortunately, we are moving away from Hive and unifying on Spark using >> CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC driver >> too. I will either try Phoenix JDBC Server for HBase or push to move faster >> to Kudu with Impala. We will use Impala as the JDBC in-between until the >> Kudu team completes Spark SQL support for JDBC. >> >> Thanks for the advice. >> >> Cheers, >> Ben >> >> >> On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >> Sure. But essentially you are looking at batch data for analytics for >> your tableau users so Hive may be a better choice with its rich SQL and >> ODBC.JDBC connection to Tableau already. >> >> I would go for Hive especially the new release will have an in-memory >> offering as well for frequently accessed data :) >> >> >> Dr Mich Talebzadeh >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> http://talebzadehmich.wordpress.com >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> On 8 October 2016 at 20:15, Benjamin Kim <bbuil...@gmail.com> wrote: >> >>> Mich, >>> >>> First and foremost, we have visualization servers that run Tableau for >>> external user reports. Second, we have servers that are ad servers and REST >>> endpoints for cookie sync and segmentation data exchange. These will use >>> JDBC directly within the same data-center. When not colocated in the same >>> data-center, they will connected to a located database server using JDBC. >>> Either way, by using JDBC everywhere, it simplifies and unifies the code on >>> the JDBC industry standard. >>> >>> Does this make sense? >>> >>> Thanks, >>> Ben >>> >>> >>> On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <mich.talebza...@gmail.com> >>> wrote: >>> >>> Like any other design what is your presentation layer and end users? >>> >>> Are they SQL centric users from Tableau background or they may use spark >>> functional programming. >>> >>> It is best to describe the use case. >>> >>> HTH >>> >>> Dr Mich Talebzadeh >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> On 8 October 2016 at 19:40, Felix Cheung <felixcheun...@hotmail.com> >>> wrote: >>> >>>> I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC >>>> server - HBASE would work better. >>>> >>>> Without naming specifics, there are at least 4 or 5 different >>>> implementations of HBASE sources, each at varying level of development and >>>> different requirements (HBASE release version, Kerberos support etc) >>>> >>>> >>>> _____________________________ >>>> From: Benjamin Kim <bbuil...@gmail.com> >>>> Sent: Saturday, October 8, 2016 11:26 AM >>>> Subject: Re: Spark SQL Thriftserver with HBase >>>> To: Mich Talebzadeh <mich.talebza...@gmail.com> >>>> Cc: <user@spark.apache.org>, Felix Cheung <felixcheun...@hotmail.com> >>>> >>>> >>>> >>>> Mich, >>>> >>>> Are you talking about the Phoenix JDBC Server? If so, I forgot about >>>> that alternative. >>>> >>>> Thanks, >>>> Ben >>>> >>>> >>>> On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh <mich.talebza...@gmail.com> >>>> wrote: >>>> >>>> I don't think it will work >>>> >>>> you can use phoenix on top of hbase >>>> >>>> hbase(main):336:0> scan 'tsco', 'LIMIT' => 1 >>>> ROW COLUMN+CELL >>>> TSCO-1-Apr-08 >>>> column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08 >>>> TSCO-1-Apr-08 >>>> column=stock_daily:close, timestamp=1475866783376, value=405.25 >>>> TSCO-1-Apr-08 >>>> column=stock_daily:high, timestamp=1475866783376, value=406.75 >>>> TSCO-1-Apr-08 >>>> column=stock_daily:low, timestamp=1475866783376, value=379.25 >>>> TSCO-1-Apr-08 >>>> column=stock_daily:open, timestamp=1475866783376, value=380.00 >>>> TSCO-1-Apr-08 >>>> column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC >>>> TSCO-1-Apr-08 >>>> column=stock_daily:ticker, timestamp=1475866783376, value=TSCO >>>> TSCO-1-Apr-08 >>>> column=stock_daily:volume, timestamp=1475866783376, value=49664486 >>>> >>>> And the same on Phoenix on top of Hvbase table >>>> >>>> 0: jdbc:phoenix:thin:url=http://rhes564:8765> select >>>> substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate, >>>> "close" AS "Day's close", "high" AS "Day's High", "low" AS "Day's Low", >>>> "open" AS "Day's Open", "ticker", "volume", >>>> (to_number("low")+to_number("high"))/2 >>>> AS "AverageDailyPrice" from "tsco" where to_number("volume") > 0 and "high" >>>> != '-' and to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd') >>>> order by to_date("Date",'dd-MMM-yy') limit 1; >>>> +-------------+--------------+-------------+------------+--- >>>> ----------+---------+-----------+--------------------+ >>>> | TRADEDATE | Day's close | Day's High | Day's Low | Day's Open | >>>> ticker | volume | AverageDailyPrice | >>>> +-------------+--------------+-------------+------------+--- >>>> ----------+---------+-----------+--------------------+ >>>> | 2015-10-07 | 197.00 | 198.05 | 184.84 | 192.20 | >>>> TSCO | 30046994 | 191.445 | >>>> >>>> HTH >>>> >>>> >>>> >>>> >>>> Dr Mich Talebzadeh >>>> >>>> >>>> LinkedIn * >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>> >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destructionof data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed.The author will in no case be liable for any monetary damages >>>> arising from suchloss, damage or destruction. >>>> >>>> >>>> >>>> On 8 October 2016 at 19:05, Felix Cheung <felixcheun...@hotmail.com> >>>> wrote: >>>> >>>>> Great, then I think those packages as Spark data source should allow >>>>> you to do exactly that (replace org.apache.spark.sql.jdbc with HBASE one) >>>>> >>>>> I do think it will be great to get more examples around this though. >>>>> Would be great if you could share your experience with this! >>>>> >>>>> >>>>> _____________________________ >>>>> From: Benjamin Kim <bbuil...@gmail.com> >>>>> Sent: Saturday, October 8, 2016 11:00 AM >>>>> Subject: Re: Spark SQL Thriftserver with HBase >>>>> To: Felix Cheung <felixcheun...@hotmail.com> >>>>> Cc: <user@spark.apache.org> >>>>> >>>>> >>>>> Felix, >>>>> >>>>> My goal is to use Spark SQL JDBC Thriftserver to access HBase tables >>>>> using just SQL. I have been able to CREATE tables using this statement >>>>> below in the past: >>>>> >>>>> CREATE TABLE <table-name> >>>>> USING org.apache.spark.sql.jdbc >>>>> OPTIONS ( >>>>> url "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&pass >>>>> word=<password>", >>>>> dbtable "dim.dimension_acamp" >>>>> ); >>>>> >>>>> >>>>> After doing this, I can access the PostgreSQL table using Spark SQL >>>>> JDBC Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). I >>>>> want to do the same with HBase tables. We tried this using Hive and >>>>> HiveServer2, but the response times are just too long. >>>>> >>>>> Thanks, >>>>> Ben >>>>> >>>>> >>>>> On Oct 8, 2016, at 10:53 AM, Felix Cheung <felixcheun...@hotmail.com> >>>>> wrote: >>>>> >>>>> Ben, >>>>> >>>>> I'm not sure I'm following completely. >>>>> >>>>> Is your goal to use Spark to create or access tables in HBASE? If so >>>>> the link below and several packages out there support that by having a >>>>> HBASE data source for Spark. There are some examples on how the Spark code >>>>> look like in that link as well. On that note, you should also be able to >>>>> use the HBASE data source from pure SQL (Spark SQL) query as well, which >>>>> should work in the case with the Spark SQL JDBC Thrift Server (with USING, >>>>> http://spark.apache.org/docs/latest/sql-programming-gu >>>>> ide.html#tab_sql_10). >>>>> >>>>> >>>>> _____________________________ >>>>> From: Benjamin Kim <bbuil...@gmail.com> >>>>> Sent: Saturday, October 8, 2016 10:40 AM >>>>> Subject: Re: Spark SQL Thriftserver with HBase >>>>> To: Felix Cheung <felixcheun...@hotmail.com> >>>>> Cc: <user@spark.apache.org> >>>>> >>>>> >>>>> Felix, >>>>> >>>>> The only alternative way is to create a stored procedure (udf) in >>>>> database terms that would run Spark scala code underneath. In this way, I >>>>> can use Spark SQL JDBC Thriftserver to execute it using SQL code passing >>>>> the key, values I want to UPSERT. I wonder if this is possible since I >>>>> cannot CREATE a wrapper table on top of a HBase table in Spark SQL? >>>>> >>>>> What do you think? Is this the right approach? >>>>> >>>>> Thanks, >>>>> Ben >>>>> >>>>> On Oct 8, 2016, at 10:33 AM, Felix Cheung <felixcheun...@hotmail.com> >>>>> wrote: >>>>> >>>>> HBase has released support for Spark >>>>> hbase.apache.org/book.html#spark >>>>> >>>>> And if you search you should find several alternative approaches. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" < >>>>> bbuil...@gmail.com> wrote: >>>>> >>>>> Does anyone know if Spark can work with HBase tables using Spark SQL? >>>>> I know in Hive we are able to create tables on top of an underlying HBase >>>>> table that can be accessed using MapReduce jobs. Can the same be done >>>>> using >>>>> HiveContext or SQLContext? We are trying to setup a way to GET and POST >>>>> data to and from the HBase table using the Spark SQL JDBC thriftserver >>>>> from >>>>> our RESTful API endpoints and/or HTTP web farms. If we can get this to >>>>> work, then we can load balance the thriftservers. In addition, this will >>>>> benefit us in giving us a way to abstract the data storage layer away from >>>>> the presentation layer code. There is a chance that we will swap out the >>>>> data storage technology in the future. We are currently experimenting with >>>>> Kudu. >>>>> >>>>> Thanks, >>>>> Ben >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>>> >>>> >>> >>> >> >> > > >