Re: Best way to present data collected by Flume through Spark

Mich Talebzadeh Fri, 16 Sep 2016 01:10:21 -0700

Hi Sean,

At the moment I am using Zeppelin with Spark SQL to get data from Hive. So
any connection here using visitation has to be through this sort of API.


I know Tableau only uses SQL. Zeppelin can use Spark sql directly or
through Spark Thrift Server.

The question is a user may want to create a join or something involving
many tables and the preference would be to use some sort of database.

In this case Hive is running on Spark engine so we are not talking about
Map-reduce and the associated latency.

That Hive element can be easily plugged out. So our requirement is to
present multiple tables to dashboard and let the user slice and dice.

The factors are not just speed but also the functionality. At the moment
Zeppelin uses Spark SQL. I can get rid of Hive and replace it with another
but I think I still need to have a tabular interface to Flume delivered
data.

I will be happy to consider all options

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 16 September 2016 at 08:46, Sean Owen <so...@cloudera.com> wrote:

> Why Hive and why precompute data at 15 minute latency? there are
> several ways here to query the source data directly with no extra step
> or latency here. Even Spark SQL is real-time-ish for queries on the
> source data, and Impala (or heck Drill etc) are.
>
> On Thu, Sep 15, 2016 at 10:56 PM, Mich Talebzadeh
> <mich.talebza...@gmail.com> wrote:
> > OK this seems to be working for the "Batch layer". I will try to create a
> > functional diagram for it
> >
> > Publisher sends prices every two seconds
> > Kafka receives data
> > Flume delivers data from Kafka to HDFS on text files time stamped
> > A Hive ORC external table (source table) is created on the directory
> where
> > flume writes continuously
> > All temporary flume tables are prefixed by "." (hidden files), so Hive
> > external table does not see those
> > Every price row includes a timestamp
> > A conventional Hive table (target table) is created with all columns from
> > the external table + two additional columns with one being a timestamp
> from
> > Hive
> > A cron job set up that runs ever 15 minutes  as below
> > 0,15,30,45 00-23 * * 1-5 (/home/hduser/dba/bin/populate_marketData.ksh
> -D
> > test > /var/tmp/populate_marketData_test.err 2>&1)
> >
> > This cron as can be seen runs runs every 15 minutes and refreshes the
> Hive
> > target table with the new data. New data meaning the price created time >
> > MAX(price created time) from the target table
> >
> > Target table statistics are updated at each run. It takes an average of 2
> > minutes to run the job
> > Thu Sep 15 22:45:01 BST 2016  ======= Started
> > /home/hduser/dba/bin/populate_marketData.ksh  =======
> > 15/09/2016 22:45:09.09
> > 15/09/2016 22:46:57.57
> > 2016-09-15T22:46:10
> > 2016-09-15T22:46:57
> > Thu Sep 15 22:47:21 BST 2016  ======= Completed
> > /home/hduser/dba/bin/populate_marketData.ksh  =======
> >
> >
> > So the target table is 15 minutes out of sync with flume data which is
> not
> > bad.
> >
> > Assuming that I replace ORC tables with Parquet, druid whatever, that
> can be
> > done pretty easily. However, although I am using Zeppelin here, people
> may
> > decide to use Tableau, QlikView etc which we need to think about the
> > connectivity between these notebooks and the underlying database. I know
> > Tableau and it is very SQL centric and works with ODBC and JDBC drivers
> or
> > native drivers. For example I know that Tableau comes with Hive supplied
> > ODBC drivers. I am not sure these database have drivers for Druid etc?
> >
> > Let me know your thoughts.
> >
> > Cheers
> >
> > Dr Mich Talebzadeh
> >
>

Re: Best way to present data collected by Flume through Spark

Reply via email to