Re: Best way to present data collected by Flume through Spark

Sean Owen Fri, 16 Sep 2016 00:48:06 -0700

Why Hive and why precompute data at 15 minute latency? there are
several ways here to query the source data directly with no extra step
or latency here. Even Spark SQL is real-time-ish for queries on the
source data, and Impala (or heck Drill etc) are.


On Thu, Sep 15, 2016 at 10:56 PM, Mich Talebzadeh
<mich.talebza...@gmail.com> wrote:
> OK this seems to be working for the "Batch layer". I will try to create a
> functional diagram for it
>
> Publisher sends prices every two seconds
> Kafka receives data
> Flume delivers data from Kafka to HDFS on text files time stamped
> A Hive ORC external table (source table) is created on the directory where
> flume writes continuously
> All temporary flume tables are prefixed by "." (hidden files), so Hive
> external table does not see those
> Every price row includes a timestamp
> A conventional Hive table (target table) is created with all columns from
> the external table + two additional columns with one being a timestamp from
> Hive
> A cron job set up that runs ever 15 minutes  as below
> 0,15,30,45 00-23 * * 1-5 (/home/hduser/dba/bin/populate_marketData.ksh -D
> test > /var/tmp/populate_marketData_test.err 2>&1)
>
> This cron as can be seen runs runs every 15 minutes and refreshes the Hive
> target table with the new data. New data meaning the price created time >
> MAX(price created time) from the target table
>
> Target table statistics are updated at each run. It takes an average of 2
> minutes to run the job
> Thu Sep 15 22:45:01 BST 2016  ======= Started
> /home/hduser/dba/bin/populate_marketData.ksh  =======
> 15/09/2016 22:45:09.09
> 15/09/2016 22:46:57.57
> 2016-09-15T22:46:10
> 2016-09-15T22:46:57
> Thu Sep 15 22:47:21 BST 2016  ======= Completed
> /home/hduser/dba/bin/populate_marketData.ksh  =======
>
>
> So the target table is 15 minutes out of sync with flume data which is not
> bad.
>
> Assuming that I replace ORC tables with Parquet, druid whatever, that can be
> done pretty easily. However, although I am using Zeppelin here, people may
> decide to use Tableau, QlikView etc which we need to think about the
> connectivity between these notebooks and the underlying database. I know
> Tableau and it is very SQL centric and works with ODBC and JDBC drivers or
> native drivers. For example I know that Tableau comes with Hive supplied
> ODBC drivers. I am not sure these database have drivers for Druid etc?
>
> Let me know your thoughts.
>
> Cheers
>
> Dr Mich Talebzadeh
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Best way to present data collected by Flume through Spark

Reply via email to