Is this technique similar to what Kinesis is offering or what Structured Streaming is going to have eventually?
Just curious. Cheers, Ben > On Oct 17, 2016, at 10:14 AM, vincent gromakowski > <vincent.gromakow...@gmail.com> wrote: > > I would suggest to code your own Spark thriftserver which seems to be very > easy. > http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server > > <http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server> > > I am starting to test it. The big advantage is that you can implement any > logic because it's a spark job and then start a thrift server on temporary > table. For example you can query a micro batch rdd from a kafka stream, or > pre load some tables and implement a rolling cache to periodically update the > spark in memory tables with persistent store... > It's not part of the public API and I don't know yet what are the issues > doing this but I think Spark community should look at this path: making the > thriftserver be instantiable in any spark job. > > 2016-10-17 18:17 GMT+02:00 Michael Segel <msegel_had...@hotmail.com > <mailto:msegel_had...@hotmail.com>>: > Guys, > Sorry for jumping in late to the game… > > If memory serves (which may not be a good thing…) : > > You can use HiveServer2 as a connection point to HBase. > While this doesn’t perform well, its probably the cleanest solution. > I’m not keen on Phoenix… wouldn’t recommend it…. > > > The issue is that you’re trying to make HBase, a key/value object store, a > Relational Engine… its not. > > There are some considerations which make HBase not ideal for all use cases > and you may find better performance with Parquet files. > > One thing missing is the use of secondary indexing and query optimizations > that you have in RDBMSs and are lacking in HBase / MapRDB / etc … so your > performance will vary. > > With respect to Tableau… their entire interface in to the big data world > revolves around the JDBC/ODBC interface. So if you don’t have that piece as > part of your solution, you’re DOA w respect to Tableau. > > Have you considered Drill as your JDBC connection point? (YAAP: Yet another > Apache project) > > >> On Oct 9, 2016, at 12:23 PM, Benjamin Kim <bbuil...@gmail.com >> <mailto:bbuil...@gmail.com>> wrote: >> >> Thanks for all the suggestions. It would seem you guys are right about the >> Tableau side of things. The reports don’t need to be real-time, and they >> won’t be directly feeding off of the main DMP HBase data. Instead, it’ll be >> batched to Parquet or Kudu/Impala or even PostgreSQL. >> >> I originally thought that we needed two-way data retrieval from the DMP >> HBase for ID generation, but after further investigation into the use-case >> and architecture, the ID generation needs to happen local to the Ad Servers >> where we generate a unique ID and store it in a ID linking table. Even >> better, many of the 3rd party services supply this ID. So, data only needs >> to flow in one direction. We will use Kafka as the bus for this. No JDBC >> required. This is also goes for the REST Endpoints. 3rd party services will >> hit ours to update our data with no need to read from our data. And, when we >> want to update their data, we will hit theirs to update their data using a >> triggered job. >> >> This al boils down to just integrating with Kafka. >> >> Once again, thanks for all the help. >> >> Cheers, >> Ben >> >> >>> On Oct 9, 2016, at 3:16 AM, Jörn Franke <jornfra...@gmail.com >>> <mailto:jornfra...@gmail.com>> wrote: >>> >>> please keep also in mind that Tableau Server has the capabilities to store >>> data in-memory and refresh only when needed the in-memory data. This means >>> you can import it from any source and let your users work only on the >>> in-memory data in Tableau Server. >>> >>> On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke <jornfra...@gmail.com >>> <mailto:jornfra...@gmail.com>> wrote: >>> Cloudera 5.8 has a very old version of Hive without Tez, but Mich provided >>> already a good alternative. However, you should check if it contains a >>> recent version of Hbase and Phoenix. That being said, I just wonder what is >>> the dataflow, data model and the analysis you plan to do. Maybe there are >>> completely different solutions possible. Especially these single inserts, >>> upserts etc. should be avoided as much as possible in the Big Data >>> (analysis) world with any technology, because they do not perform well. >>> >>> Hive with Llap will provide an in-memory cache for interactive analytics. >>> You can put full tables in-memory with Hive using Ignite HDFS in-memory >>> solution. All this does only make sense if you do not use MR as an engine, >>> the right input format (ORC, parquet) and a recent Hive version. >>> >>> On 8 Oct 2016, at 21:55, Benjamin Kim <bbuil...@gmail.com >>> <mailto:bbuil...@gmail.com>> wrote: >>> >>>> Mich, >>>> >>>> Unfortunately, we are moving away from Hive and unifying on Spark using >>>> CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC driver >>>> too. I will either try Phoenix JDBC Server for HBase or push to move >>>> faster to Kudu with Impala. We will use Impala as the JDBC in-between >>>> until the Kudu team completes Spark SQL support for JDBC. >>>> >>>> Thanks for the advice. >>>> >>>> Cheers, >>>> Ben >>>> >>>> >>>>> On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh <mich.talebza...@gmail.com >>>>> <mailto:mich.talebza...@gmail.com>> wrote: >>>>> >>>>> Sure. But essentially you are looking at batch data for analytics for >>>>> your tableau users so Hive may be a better choice with its rich SQL and >>>>> ODBC.JDBC connection to Tableau already. >>>>> >>>>> I would go for Hive especially the new release will have an in-memory >>>>> offering as well for frequently accessed data :) >>>>> >>>>> >>>>> Dr Mich Talebzadeh >>>>> >>>>> LinkedIn >>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>> >>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> >>>>> >>>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> >>>>> >>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any >>>>> loss, damage or destruction of data or any other property which may arise >>>>> from relying on this email's technical content is explicitly disclaimed. >>>>> The author will in no case be liable for any monetary damages arising >>>>> from such loss, damage or destruction. >>>>> >>>>> >>>>> On 8 October 2016 at 20:15, Benjamin Kim <bbuil...@gmail.com >>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>> Mich, >>>>> >>>>> First and foremost, we have visualization servers that run Tableau for >>>>> external user reports. Second, we have servers that are ad servers and >>>>> REST endpoints for cookie sync and segmentation data exchange. These will >>>>> use JDBC directly within the same data-center. When not colocated in the >>>>> same data-center, they will connected to a located database server using >>>>> JDBC. Either way, by using JDBC everywhere, it simplifies and unifies the >>>>> code on the JDBC industry standard. >>>>> >>>>> Does this make sense? >>>>> >>>>> Thanks, >>>>> Ben >>>>> >>>>> >>>>>> On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <mich.talebza...@gmail.com >>>>>> <mailto:mich.talebza...@gmail.com>> wrote: >>>>>> >>>>>> Like any other design what is your presentation layer and end users? >>>>>> >>>>>> Are they SQL centric users from Tableau background or they may use spark >>>>>> functional programming. >>>>>> >>>>>> It is best to describe the use case. >>>>>> >>>>>> HTH >>>>>> >>>>>> Dr Mich Talebzadeh >>>>>> >>>>>> LinkedIn >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>> >>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> >>>>>> >>>>>> http://talebzadehmich.wordpress.com >>>>>> <http://talebzadehmich.wordpress.com/> >>>>>> >>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any >>>>>> loss, damage or destruction of data or any other property which may >>>>>> arise from relying on this email's technical content is explicitly >>>>>> disclaimed. The author will in no case be liable for any monetary >>>>>> damages arising from such loss, damage or destruction. >>>>>> >>>>>> >>>>>> On 8 October 2016 at 19:40, Felix Cheung <felixcheun...@hotmail.com >>>>>> <mailto:felixcheun...@hotmail.com>> wrote: >>>>>> I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC >>>>>> server - HBASE would work better. >>>>>> >>>>>> Without naming specifics, there are at least 4 or 5 different >>>>>> implementations of HBASE sources, each at varying level of development >>>>>> and different requirements (HBASE release version, Kerberos support etc) >>>>>> >>>>>> >>>>>> _____________________________ >>>>>> From: Benjamin Kim <bbuil...@gmail.com <mailto:bbuil...@gmail.com>> >>>>>> Sent: Saturday, October 8, 2016 11:26 AM >>>>>> Subject: Re: Spark SQL Thriftserver with HBase >>>>>> To: Mich Talebzadeh <mich.talebza...@gmail.com >>>>>> <mailto:mich.talebza...@gmail.com>> >>>>>> Cc: <user@spark.apache.org <mailto:user@spark.apache.org>>, Felix Cheung >>>>>> <felixcheun...@hotmail.com <mailto:felixcheun...@hotmail.com>> >>>>>> >>>>>> >>>>>> >>>>>> Mich, >>>>>> >>>>>> Are you talking about the Phoenix JDBC Server? If so, I forgot about >>>>>> that alternative. >>>>>> >>>>>> Thanks, >>>>>> Ben >>>>>> >>>>>> >>>>>> On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh <mich.talebza...@gmail.com >>>>>> <mailto:mich.talebza...@gmail.com>> wrote: >>>>>> >>>>>> I don't think it will work >>>>>> >>>>>> you can use phoenix on top of hbase >>>>>> >>>>>> hbase(main):336:0> scan 'tsco', 'LIMIT' => 1 >>>>>> ROW COLUMN+CELL >>>>>> TSCO-1-Apr-08 >>>>>> column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08 >>>>>> TSCO-1-Apr-08 >>>>>> column=stock_daily:close, timestamp=1475866783376, value=405.25 >>>>>> TSCO-1-Apr-08 >>>>>> column=stock_daily:high, timestamp=1475866783376, value=406.75 >>>>>> TSCO-1-Apr-08 >>>>>> column=stock_daily:low, timestamp=1475866783376, value=379.25 >>>>>> TSCO-1-Apr-08 >>>>>> column=stock_daily:open, timestamp=1475866783376, value=380.00 >>>>>> TSCO-1-Apr-08 >>>>>> column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC >>>>>> TSCO-1-Apr-08 >>>>>> column=stock_daily:ticker, timestamp=1475866783376, value=TSCO >>>>>> TSCO-1-Apr-08 >>>>>> column=stock_daily:volume, timestamp=1475866783376, value=49664486 >>>>>> >>>>>> And the same on Phoenix on top of Hvbase table >>>>>> >>>>>> 0: jdbc:phoenix:thin:url=http://rhes564:8765 <http://rhes564:8765/>> >>>>>> select substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate, >>>>>> "close" AS "Day's close", "high" AS "Day's High", "low" AS "Day's Low", >>>>>> "open" AS "Day's Open", "ticker", "volume", >>>>>> (to_number("low")+to_number("high"))/2 AS "AverageDailyPrice" from >>>>>> "tsco" where to_number("volume") > 0 and "high" != '-' and >>>>>> to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd') order >>>>>> by to_date("Date",'dd-MMM-yy') limit 1; >>>>>> +-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+ >>>>>> | TRADEDATE | Day's close | Day's High | Day's Low | Day's Open | >>>>>> ticker | volume | AverageDailyPrice | >>>>>> +-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+ >>>>>> | 2015-10-07 | 197.00 | 198.05 | 184.84 | 192.20 | >>>>>> TSCO | 30046994 | 191.445 | >>>>>> >>>>>> HTH >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Dr Mich Talebzadeh >>>>>> >>>>>> LinkedIn >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>> >>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> >>>>>> >>>>>> http://talebzadehmich.wordpress.com >>>>>> <http://talebzadehmich.wordpress.com/> >>>>>> >>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any >>>>>> loss, damage or destructionof data or any other property which may arise >>>>>> from relying on this email's technical content is explicitly >>>>>> disclaimed.The author will in no case be liable for any monetary damages >>>>>> arising from suchloss, damage or destruction. >>>>>> >>>>>> >>>>>> On 8 October 2016 at 19:05, Felix Cheung <felixcheun...@hotmail.com >>>>>> <mailto:felixcheun...@hotmail.com>> wrote: >>>>>> Great, then I think those packages as Spark data source should allow you >>>>>> to do exactly that (replace org.apache.spark.sql.jdbc with HBASE one) >>>>>> >>>>>> I do think it will be great to get more examples around this though. >>>>>> Would be great if you could share your experience with this! >>>>>> >>>>>> >>>>>> _____________________________ >>>>>> From: Benjamin Kim <bbuil...@gmail.com <mailto:bbuil...@gmail.com>> >>>>>> Sent: Saturday, October 8, 2016 11:00 AM >>>>>> Subject: Re: Spark SQL Thriftserver with HBase >>>>>> To: Felix Cheung <felixcheun...@hotmail.com >>>>>> <mailto:felixcheun...@hotmail.com>> >>>>>> Cc: <user@spark.apache.org <mailto:user@spark.apache.org>> >>>>>> >>>>>> >>>>>> Felix, >>>>>> >>>>>> My goal is to use Spark SQL JDBC Thriftserver to access HBase tables >>>>>> using just SQL. I have been able to CREATE tables using this statement >>>>>> below in the past: >>>>>> >>>>>> CREATE TABLE <table-name> >>>>>> USING org.apache.spark.sql.jdbc >>>>>> OPTIONS ( >>>>>> url >>>>>> "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&password=<password>", >>>>>> dbtable "dim.dimension_acamp" >>>>>> ); >>>>>> >>>>>> After doing this, I can access the PostgreSQL table using Spark SQL JDBC >>>>>> Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). I want >>>>>> to do the same with HBase tables. We tried this using Hive and >>>>>> HiveServer2, but the response times are just too long. >>>>>> >>>>>> Thanks, >>>>>> Ben >>>>>> >>>>>> >>>>>> On Oct 8, 2016, at 10:53 AM, Felix Cheung <felixcheun...@hotmail.com >>>>>> <mailto:felixcheun...@hotmail.com>> wrote: >>>>>> >>>>>> Ben, >>>>>> >>>>>> I'm not sure I'm following completely. >>>>>> >>>>>> Is your goal to use Spark to create or access tables in HBASE? If so the >>>>>> link below and several packages out there support that by having a HBASE >>>>>> data source for Spark. There are some examples on how the Spark code >>>>>> look like in that link as well. On that note, you should also be able to >>>>>> use the HBASE data source from pure SQL (Spark SQL) query as well, which >>>>>> should work in the case with the Spark SQL JDBC Thrift Server (with >>>>>> USING,http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10 >>>>>> >>>>>> <http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10>). >>>>>> >>>>>> >>>>>> _____________________________ >>>>>> From: Benjamin Kim <bbuil...@gmail.com <mailto:bbuil...@gmail.com>> >>>>>> Sent: Saturday, October 8, 2016 10:40 AM >>>>>> Subject: Re: Spark SQL Thriftserver with HBase >>>>>> To: Felix Cheung <felixcheun...@hotmail.com >>>>>> <mailto:felixcheun...@hotmail.com>> >>>>>> Cc: <user@spark.apache.org <mailto:user@spark.apache.org>> >>>>>> >>>>>> >>>>>> Felix, >>>>>> >>>>>> The only alternative way is to create a stored procedure (udf) in >>>>>> database terms that would run Spark scala code underneath. In this way, >>>>>> I can use Spark SQL JDBC Thriftserver to execute it using SQL code >>>>>> passing the key, values I want to UPSERT. I wonder if this is possible >>>>>> since I cannot CREATE a wrapper table on top of a HBase table in Spark >>>>>> SQL? >>>>>> >>>>>> What do you think? Is this the right approach? >>>>>> >>>>>> Thanks, >>>>>> Ben >>>>>> >>>>>> On Oct 8, 2016, at 10:33 AM, Felix Cheung <felixcheun...@hotmail.com >>>>>> <mailto:felixcheun...@hotmail.com>> wrote: >>>>>> >>>>>> HBase has released support for Spark >>>>>> hbase.apache.org/book.html#spark >>>>>> <http://hbase.apache.org/book.html#spark> >>>>>> >>>>>> And if you search you should find several alternative approaches. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <bbuil...@gmail.com >>>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>>> >>>>>> Does anyone know if Spark can work with HBase tables using Spark SQL? I >>>>>> know in Hive we are able to create tables on top of an underlying HBase >>>>>> table that can be accessed using MapReduce jobs. Can the same be done >>>>>> using HiveContext or SQLContext? We are trying to setup a way to GET and >>>>>> POST data to and from the HBase table using the Spark SQL JDBC >>>>>> thriftserver from our RESTful API endpoints and/or HTTP web farms. If we >>>>>> can get this to work, then we can load balance the thriftservers. In >>>>>> addition, this will benefit us in giving us a way to abstract the data >>>>>> storage layer away from the presentation layer code. There is a chance >>>>>> that we will swap out the data storage technology in the future. We are >>>>>> currently experimenting with Kudu. >>>>>> >>>>>> Thanks, >>>>>> Ben >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>>>> <mailto:user-unsubscr...@spark.apache.org> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>> >> > >