This will give me an opportunity to start using Structured Streaming. Then, I can try adding more functionality. If all goes well, then we could transition off of HBase to a more in-memory data solution that can “spill-over” data for us.
> On Oct 17, 2016, at 11:53 AM, vincent gromakowski > <vincent.gromakow...@gmail.com> wrote: > > Instead of (or additionally to) saving results somewhere, you just start a > thriftserver that expose the Spark tables of the SQLContext (or SparkSession > now). That means you can implement any logic (and maybe use structured > streaming) to expose your data. Today using the thriftserver means reading > data from the persistent store every query, so if the data modeling doesn't > fit the query it can be quite long. What you generally do in a common spark > job is to load the data and cache spark table in a in-memory columnar table > which is quite efficient for any kind of query, the counterpart is that the > cache isn't updated you have to implement a reload mechanism, and this > solution isn't available using the thriftserver. > What I propose is to mix the two world: periodically/delta load data in spark > table cache and expose it through the thriftserver. But you have to implement > the loading logic, it can be very simple to very complex depending on your > needs. > > > 2016-10-17 19:48 GMT+02:00 Benjamin Kim <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>>: > Is this technique similar to what Kinesis is offering or what Structured > Streaming is going to have eventually? > > Just curious. > > Cheers, > Ben > > >> On Oct 17, 2016, at 10:14 AM, vincent gromakowski >> <vincent.gromakow...@gmail.com <mailto:vincent.gromakow...@gmail.com>> wrote: >> >> I would suggest to code your own Spark thriftserver which seems to be very >> easy. >> http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server >> >> <http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server> >> >> I am starting to test it. The big advantage is that you can implement any >> logic because it's a spark job and then start a thrift server on temporary >> table. For example you can query a micro batch rdd from a kafka stream, or >> pre load some tables and implement a rolling cache to periodically update >> the spark in memory tables with persistent store... >> It's not part of the public API and I don't know yet what are the issues >> doing this but I think Spark community should look at this path: making the >> thriftserver be instantiable in any spark job. >> >> 2016-10-17 18:17 GMT+02:00 Michael Segel <msegel_had...@hotmail.com >> <mailto:msegel_had...@hotmail.com>>: >> Guys, >> Sorry for jumping in late to the game… >> >> If memory serves (which may not be a good thing…) : >> >> You can use HiveServer2 as a connection point to HBase. >> While this doesn’t perform well, its probably the cleanest solution. >> I’m not keen on Phoenix… wouldn’t recommend it…. >> >> >> The issue is that you’re trying to make HBase, a key/value object store, a >> Relational Engine… its not. >> >> There are some considerations which make HBase not ideal for all use cases >> and you may find better performance with Parquet files. >> >> One thing missing is the use of secondary indexing and query optimizations >> that you have in RDBMSs and are lacking in HBase / MapRDB / etc … so your >> performance will vary. >> >> With respect to Tableau… their entire interface in to the big data world >> revolves around the JDBC/ODBC interface. So if you don’t have that piece as >> part of your solution, you’re DOA w respect to Tableau. >> >> Have you considered Drill as your JDBC connection point? (YAAP: Yet another >> Apache project) >> >> >>> On Oct 9, 2016, at 12:23 PM, Benjamin Kim <bbuil...@gmail.com >>> <mailto:bbuil...@gmail.com>> wrote: >>> >>> Thanks for all the suggestions. It would seem you guys are right about the >>> Tableau side of things. The reports don’t need to be real-time, and they >>> won’t be directly feeding off of the main DMP HBase data. Instead, it’ll be >>> batched to Parquet or Kudu/Impala or even PostgreSQL. >>> >>> I originally thought that we needed two-way data retrieval from the DMP >>> HBase for ID generation, but after further investigation into the use-case >>> and architecture, the ID generation needs to happen local to the Ad Servers >>> where we generate a unique ID and store it in a ID linking table. Even >>> better, many of the 3rd party services supply this ID. So, data only needs >>> to flow in one direction. We will use Kafka as the bus for this. No JDBC >>> required. This is also goes for the REST Endpoints. 3rd party services will >>> hit ours to update our data with no need to read from our data. And, when >>> we want to update their data, we will hit theirs to update their data using >>> a triggered job. >>> >>> This al boils down to just integrating with Kafka. >>> >>> Once again, thanks for all the help. >>> >>> Cheers, >>> Ben >>> >>> >>>> On Oct 9, 2016, at 3:16 AM, Jörn Franke <jornfra...@gmail.com >>>> <mailto:jornfra...@gmail.com>> wrote: >>>> >>>> please keep also in mind that Tableau Server has the capabilities to store >>>> data in-memory and refresh only when needed the in-memory data. This means >>>> you can import it from any source and let your users work only on the >>>> in-memory data in Tableau Server. >>>> >>>> On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke <jornfra...@gmail.com >>>> <mailto:jornfra...@gmail.com>> wrote: >>>> Cloudera 5.8 has a very old version of Hive without Tez, but Mich provided >>>> already a good alternative. However, you should check if it contains a >>>> recent version of Hbase and Phoenix. That being said, I just wonder what >>>> is the dataflow, data model and the analysis you plan to do. Maybe there >>>> are completely different solutions possible. Especially these single >>>> inserts, upserts etc. should be avoided as much as possible in the Big >>>> Data (analysis) world with any technology, because they do not perform >>>> well. >>>> >>>> Hive with Llap will provide an in-memory cache for interactive analytics. >>>> You can put full tables in-memory with Hive using Ignite HDFS in-memory >>>> solution. All this does only make sense if you do not use MR as an engine, >>>> the right input format (ORC, parquet) and a recent Hive version. >>>> >>>> On 8 Oct 2016, at 21:55, Benjamin Kim <bbuil...@gmail.com >>>> <mailto:bbuil...@gmail.com>> wrote: >>>> >>>>> Mich, >>>>> >>>>> Unfortunately, we are moving away from Hive and unifying on Spark using >>>>> CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC driver >>>>> too. I will either try Phoenix JDBC Server for HBase or push to move >>>>> faster to Kudu with Impala. We will use Impala as the JDBC in-between >>>>> until the Kudu team completes Spark SQL support for JDBC. >>>>> >>>>> Thanks for the advice. >>>>> >>>>> Cheers, >>>>> Ben >>>>> >>>>> >>>>>> On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh <mich.talebza...@gmail.com >>>>>> <mailto:mich.talebza...@gmail.com>> wrote: >>>>>> >>>>>> Sure. But essentially you are looking at batch data for analytics for >>>>>> your tableau users so Hive may be a better choice with its rich SQL and >>>>>> ODBC.JDBC connection to Tableau already. >>>>>> >>>>>> I would go for Hive especially the new release will have an in-memory >>>>>> offering as well for frequently accessed data :) >>>>>> >>>>>> >>>>>> Dr Mich Talebzadeh >>>>>> >>>>>> LinkedIn >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>> >>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> >>>>>> >>>>>> http://talebzadehmich.wordpress.com >>>>>> <http://talebzadehmich.wordpress.com/> >>>>>> >>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any >>>>>> loss, damage or destruction of data or any other property which may >>>>>> arise from relying on this email's technical content is explicitly >>>>>> disclaimed. The author will in no case be liable for any monetary >>>>>> damages arising from such loss, damage or destruction. >>>>>> >>>>>> >>>>>> On 8 October 2016 at 20:15, Benjamin Kim <bbuil...@gmail.com >>>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>>> Mich, >>>>>> >>>>>> First and foremost, we have visualization servers that run Tableau for >>>>>> external user reports. Second, we have servers that are ad servers and >>>>>> REST endpoints for cookie sync and segmentation data exchange. These >>>>>> will use JDBC directly within the same data-center. When not colocated >>>>>> in the same data-center, they will connected to a located database >>>>>> server using JDBC. Either way, by using JDBC everywhere, it simplifies >>>>>> and unifies the code on the JDBC industry standard. >>>>>> >>>>>> Does this make sense? >>>>>> >>>>>> Thanks, >>>>>> Ben >>>>>> >>>>>> >>>>>>> On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <mich.talebza...@gmail.com >>>>>>> <mailto:mich.talebza...@gmail.com>> wrote: >>>>>>> >>>>>>> Like any other design what is your presentation layer and end users? >>>>>>> >>>>>>> Are they SQL centric users from Tableau background or they may use >>>>>>> spark functional programming. >>>>>>> >>>>>>> It is best to describe the use case. >>>>>>> >>>>>>> HTH >>>>>>> >>>>>>> Dr Mich Talebzadeh >>>>>>> >>>>>>> LinkedIn >>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>>> >>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> >>>>>>> >>>>>>> http://talebzadehmich.wordpress.com >>>>>>> <http://talebzadehmich.wordpress.com/> >>>>>>> >>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any >>>>>>> loss, damage or destruction of data or any other property which may >>>>>>> arise from relying on this email's technical content is explicitly >>>>>>> disclaimed. The author will in no case be liable for any monetary >>>>>>> damages arising from such loss, damage or destruction. >>>>>>> >>>>>>> >>>>>>> On 8 October 2016 at 19:40, Felix Cheung <felixcheun...@hotmail.com >>>>>>> <mailto:felixcheun...@hotmail.com>> wrote: >>>>>>> I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC >>>>>>> server - HBASE would work better. >>>>>>> >>>>>>> Without naming specifics, there are at least 4 or 5 different >>>>>>> implementations of HBASE sources, each at varying level of development >>>>>>> and different requirements (HBASE release version, Kerberos support etc) >>>>>>> >>>>>>> >>>>>>> _____________________________ >>>>>>> From: Benjamin Kim <bbuil...@gmail.com <mailto:bbuil...@gmail.com>> >>>>>>> Sent: Saturday, October 8, 2016 11:26 AM >>>>>>> Subject: Re: Spark SQL Thriftserver with HBase >>>>>>> To: Mich Talebzadeh <mich.talebza...@gmail.com >>>>>>> <mailto:mich.talebza...@gmail.com>> >>>>>>> Cc: <user@spark.apache.org <mailto:user@spark.apache.org>>, Felix >>>>>>> Cheung <felixcheun...@hotmail.com <mailto:felixcheun...@hotmail.com>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Mich, >>>>>>> >>>>>>> Are you talking about the Phoenix JDBC Server? If so, I forgot about >>>>>>> that alternative. >>>>>>> >>>>>>> Thanks, >>>>>>> Ben >>>>>>> >>>>>>> >>>>>>> On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh <mich.talebza...@gmail.com >>>>>>> <mailto:mich.talebza...@gmail.com>> wrote: >>>>>>> >>>>>>> I don't think it will work >>>>>>> >>>>>>> you can use phoenix on top of hbase >>>>>>> >>>>>>> hbase(main):336:0> scan 'tsco', 'LIMIT' => 1 >>>>>>> ROW COLUMN+CELL >>>>>>> TSCO-1-Apr-08 >>>>>>> column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08 >>>>>>> TSCO-1-Apr-08 >>>>>>> column=stock_daily:close, timestamp=1475866783376, value=405.25 >>>>>>> TSCO-1-Apr-08 >>>>>>> column=stock_daily:high, timestamp=1475866783376, value=406.75 >>>>>>> TSCO-1-Apr-08 >>>>>>> column=stock_daily:low, timestamp=1475866783376, value=379.25 >>>>>>> TSCO-1-Apr-08 >>>>>>> column=stock_daily:open, timestamp=1475866783376, value=380.00 >>>>>>> TSCO-1-Apr-08 >>>>>>> column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC >>>>>>> TSCO-1-Apr-08 >>>>>>> column=stock_daily:ticker, timestamp=1475866783376, value=TSCO >>>>>>> TSCO-1-Apr-08 >>>>>>> column=stock_daily:volume, timestamp=1475866783376, value=49664486 >>>>>>> >>>>>>> And the same on Phoenix on top of Hvbase table >>>>>>> >>>>>>> 0: jdbc:phoenix:thin:url=http://rhes564:8765 <http://rhes564:8765/>> >>>>>>> select substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate, >>>>>>> "close" AS "Day's close", "high" AS "Day's High", "low" AS "Day's Low", >>>>>>> "open" AS "Day's Open", "ticker", "volume", >>>>>>> (to_number("low")+to_number("high"))/2 AS "AverageDailyPrice" from >>>>>>> "tsco" where to_number("volume") > 0 and "high" != '-' and >>>>>>> to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd') order >>>>>>> by to_date("Date",'dd-MMM-yy') limit 1; >>>>>>> +-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+ >>>>>>> | TRADEDATE | Day's close | Day's High | Day's Low | Day's Open | >>>>>>> ticker | volume | AverageDailyPrice | >>>>>>> +-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+ >>>>>>> | 2015-10-07 | 197.00 | 198.05 | 184.84 | 192.20 | >>>>>>> TSCO | 30046994 | 191.445 | >>>>>>> >>>>>>> HTH >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Dr Mich Talebzadeh >>>>>>> >>>>>>> LinkedIn >>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>>> >>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> >>>>>>> >>>>>>> http://talebzadehmich.wordpress.com >>>>>>> <http://talebzadehmich.wordpress.com/> >>>>>>> >>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any >>>>>>> loss, damage or destructionof data or any other property which may >>>>>>> arise from relying on this email's technical content is explicitly >>>>>>> disclaimed.The author will in no case be liable for any monetary >>>>>>> damages arising from suchloss, damage or destruction. >>>>>>> >>>>>>> >>>>>>> On 8 October 2016 at 19:05, Felix Cheung <felixcheun...@hotmail.com >>>>>>> <mailto:felixcheun...@hotmail.com>> wrote: >>>>>>> Great, then I think those packages as Spark data source should allow >>>>>>> you to do exactly that (replace org.apache.spark.sql.jdbc with HBASE >>>>>>> one) >>>>>>> >>>>>>> I do think it will be great to get more examples around this though. >>>>>>> Would be great if you could share your experience with this! >>>>>>> >>>>>>> >>>>>>> _____________________________ >>>>>>> From: Benjamin Kim <bbuil...@gmail.com <mailto:bbuil...@gmail.com>> >>>>>>> Sent: Saturday, October 8, 2016 11:00 AM >>>>>>> Subject: Re: Spark SQL Thriftserver with HBase >>>>>>> To: Felix Cheung <felixcheun...@hotmail.com >>>>>>> <mailto:felixcheun...@hotmail.com>> >>>>>>> Cc: <user@spark.apache.org <mailto:user@spark.apache.org>> >>>>>>> >>>>>>> >>>>>>> Felix, >>>>>>> >>>>>>> My goal is to use Spark SQL JDBC Thriftserver to access HBase tables >>>>>>> using just SQL. I have been able to CREATE tables using this statement >>>>>>> below in the past: >>>>>>> >>>>>>> CREATE TABLE <table-name> >>>>>>> USING org.apache.spark.sql.jdbc >>>>>>> OPTIONS ( >>>>>>> url >>>>>>> "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&password=<password>", >>>>>>> dbtable "dim.dimension_acamp" >>>>>>> ); >>>>>>> >>>>>>> After doing this, I can access the PostgreSQL table using Spark SQL >>>>>>> JDBC Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). >>>>>>> I want to do the same with HBase tables. We tried this using Hive and >>>>>>> HiveServer2, but the response times are just too long. >>>>>>> >>>>>>> Thanks, >>>>>>> Ben >>>>>>> >>>>>>> >>>>>>> On Oct 8, 2016, at 10:53 AM, Felix Cheung <felixcheun...@hotmail.com >>>>>>> <mailto:felixcheun...@hotmail.com>> wrote: >>>>>>> >>>>>>> Ben, >>>>>>> >>>>>>> I'm not sure I'm following completely. >>>>>>> >>>>>>> Is your goal to use Spark to create or access tables in HBASE? If so >>>>>>> the link below and several packages out there support that by having a >>>>>>> HBASE data source for Spark. There are some examples on how the Spark >>>>>>> code look like in that link as well. On that note, you should also be >>>>>>> able to use the HBASE data source from pure SQL (Spark SQL) query as >>>>>>> well, which should work in the case with the Spark SQL JDBC Thrift >>>>>>> Server (with >>>>>>> USING,http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10 >>>>>>> >>>>>>> <http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10>). >>>>>>> >>>>>>> >>>>>>> _____________________________ >>>>>>> From: Benjamin Kim <bbuil...@gmail.com <mailto:bbuil...@gmail.com>> >>>>>>> Sent: Saturday, October 8, 2016 10:40 AM >>>>>>> Subject: Re: Spark SQL Thriftserver with HBase >>>>>>> To: Felix Cheung <felixcheun...@hotmail.com >>>>>>> <mailto:felixcheun...@hotmail.com>> >>>>>>> Cc: <user@spark.apache.org <mailto:user@spark.apache.org>> >>>>>>> >>>>>>> >>>>>>> Felix, >>>>>>> >>>>>>> The only alternative way is to create a stored procedure (udf) in >>>>>>> database terms that would run Spark scala code underneath. In this way, >>>>>>> I can use Spark SQL JDBC Thriftserver to execute it using SQL code >>>>>>> passing the key, values I want to UPSERT. I wonder if this is possible >>>>>>> since I cannot CREATE a wrapper table on top of a HBase table in Spark >>>>>>> SQL? >>>>>>> >>>>>>> What do you think? Is this the right approach? >>>>>>> >>>>>>> Thanks, >>>>>>> Ben >>>>>>> >>>>>>> On Oct 8, 2016, at 10:33 AM, Felix Cheung <felixcheun...@hotmail.com >>>>>>> <mailto:felixcheun...@hotmail.com>> wrote: >>>>>>> >>>>>>> HBase has released support for Spark >>>>>>> hbase.apache.org/book.html#spark >>>>>>> <http://hbase.apache.org/book.html#spark> >>>>>>> >>>>>>> And if you search you should find several alternative approaches. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" >>>>>>> <bbuil...@gmail.com <mailto:bbuil...@gmail.com>> wrote: >>>>>>> >>>>>>> Does anyone know if Spark can work with HBase tables using Spark SQL? I >>>>>>> know in Hive we are able to create tables on top of an underlying HBase >>>>>>> table that can be accessed using MapReduce jobs. Can the same be done >>>>>>> using HiveContext or SQLContext? We are trying to setup a way to GET >>>>>>> and POST data to and from the HBase table using the Spark SQL JDBC >>>>>>> thriftserver from our RESTful API endpoints and/or HTTP web farms. If >>>>>>> we can get this to work, then we can load balance the thriftservers. In >>>>>>> addition, this will benefit us in giving us a way to abstract the data >>>>>>> storage layer away from the presentation layer code. There is a chance >>>>>>> that we will swap out the data storage technology in the future. We are >>>>>>> currently experimenting with Kudu. >>>>>>> >>>>>>> Thanks, >>>>>>> Ben >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>>>>> <mailto:user-unsubscr...@spark.apache.org> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >> > >