BTW, we now support OLAP functionality natively in spark w/o the need for
Druid, through our Spark native BI platform(SNAP):

 - we provide SQL commands to: create star schema, create olap index, and
insert into olap index. So you can be up and running very quickly in a
Spark env.
- Query Acceleration is provided through an OLAP Index FileFormat and Query
Optimizer extensions(just like spark-druid-olap).
- We have also posted details on a BI Benchmark
to quantify
query acceleration and cost.
- haven't looked at integration with Spark Streaming yet, but since we have
a FileFormat should be possible to integrate. Please ping me if this is of


On Mon, Aug 29, 2016 at 7:19 PM, Chanh Le <> wrote:

> Hi everyone,
> Seems a lot people using Druid for realtime Dashboard.
> I’m just wondering of using Druid for main storage engine because Druid
> can store the raw data and can integrate with Spark also (theoretical).
> In that case do we need to store 2 separate storage Druid (store segment
> in HDFS) and HDFS?.
> BTW did anyone try this one
> SparklineData/spark-druid-olap?
> Regards,
> Chanh
> On Aug 30, 2016, at 3:23 AM, Mich Talebzadeh <>
> wrote:
> Thanks Bhaarat and everyone.
> This is an updated version of the same diagram
> <LambdaArchitecture.png>
> ​​​
> The frequency of Recent data is defined by the Windows length in Spark
> Streaming. It can vary between 0.5 seconds to an hour. ( Don't think we can
> move any Spark granularity below 0.5 seconds in anger. For some
> applications like Credit card transactions and fraud detection. Data is
> stored real time by Spark in Hbase tables. Hbase tables will be on HDFS as
> well. The same Spark Streaming will write asynchronously to HDFS Hive
> tables.
> One school of thought is never write to Hive from Spark, write  straight
> to Hbase and then read Hbase tables into Hive periodically?
> Now the third component in this layer is Serving Layer that can combine
> data from the current (Hbase) and the historical (Hive tables) to give the
> user visual analytics. Now that visual analytics can be Real time dashboard
> on top of Serving Layer. That Serving layer could be an in-memory NoSQL
> offering or Data from Hbase (Red Box) combined with Hive tables.
> I am not aware of any industrial strength Real time Dashboard.  The idea
> is that one uses such dashboard in real time. Dashboard in this sense
> meaning a general purpose API to data store of some type like on Serving
> layer to provide visual analytics real time on demand, combining real time
> data and aggregate views. As usual the devil in the detail.
> Let me know your thoughts. Anyway this is first cut pattern.
> ​​
> Dr Mich Talebzadeh
> LinkedIn * 
> <>*
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
> On 29 August 2016 at 18:53, Bhaarat Sharma <> wrote:
>> Hi Mich
>> This is really helpful. I'm trying to wrap my head around the last
>> diagram you shared (the one with kafka). In this diagram spark streaming is
>> pushing data to HDFS and NoSql. However, I'm confused by the "Real Time
>> Queries, Dashboards" annotation. Based on this diagram, will real time
>> queries be running on Spark or HBase?
>> PS: My intention was not to steer the conversation away from what Ashok
>> asked but I found the diagrams shared by Mich very insightful.
>> On Sun, Aug 28, 2016 at 7:18 PM, Mich Talebzadeh <
>>> wrote:
>>> Hi,
>>> In terms of positioning, Spark is really the first Big Data platform to
>>> integrate batch, streaming and interactive computations in a unified
>>> framework. What this boils down to is the fact that whichever way one look
>>> at it there is somewhere that Spark can make a contribution to. In general,
>>> there are few design patterns common to Big Data
>>>    - *ETL & Batch*
>>> The first one is the most common one with Established tools like Sqoop,
>>> Talend for ETL and HDFS for storage of some kind. Spark can be used as the
>>> execution engine for Hive at the storage level which  actually makes it
>>> a true vendor independent (BTW, Impala and Tez and LLAP) are offered by
>>> vendors) processing engine. Personally I use Spark at ETL layer by
>>> extracting data from sources through plug ins (JDBC and others) and storing
>>> in on HDFS in some kind
>>>    - *Batch, real time plus Analytics*
>>> In this pattern you have data coming in real time and you want to query
>>> them real time through real time dashboard. HDFS is not ideal for updating
>>> data in real time and neither for random access of data. Source could be
>>> all sorts of Web Servers and need Flume Agent with Flume. At the storage
>>> layer we are probably looking at something like Hbase. The crucial point
>>> being that saved data needs to be ready for queries immediately The
>>> dashboards requires Hbase APIs. The Analytics can be done through Hive
>>> again running on Spark engine. Again note here that we ideally should
>>> process batch and real time separately.
>>>    - *Real time / Streaming*
>>> This is most relevant to Spark as we are moving to near real time. Where
>>> Spark excels. We need to capture the incoming events (logs, sensor data,
>>> pricing, emails) through interfaces like Kafka, Message Queues etc.  Need
>>> to process these events with minimum latency. Again Spark is a very good
>>> candidate here with its Spark Streaming and micro-batching capabilities.
>>> There are others like Storm, Flink etc. that are event based but you don’t
>>> hear much. Again for streaming architecture you need to sync data in real
>>> time using something like Hbase, Cassandra (?) and others as real time
>>> store or forever storage HDFS or Hive etc.
>>>             In general there is also *Lambda Architecture* that is
>>> designed for streaming analytics. The streaming data ends up in both batch
>>> layer and speed layer. Batch layer is used to answer batch queries. On the
>>> other hand speed later is used ti handle fast/real time queries. This model
>>> is really cool as Spark Streaming can feed both the batch layer and
>>> the speed layer.
>>> At a high level this looks like this, from
>>> <image.png>
>>> My favourite would be something like below with Spark playing a major
>>> role
>>> <LambdaArchitecture.png>
>>> ​
>>> HTH
>>> Dr Mich Talebzadeh
>>> LinkedIn * 
>>> <>*
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>> On 28 August 2016 at 19:43, Sivakumaran S <> wrote:
>>>> Spark best fits for processing. But depending on the use case, you
>>>> could expand the scope of Spark to moving data using the native connectors.
>>>> The only that Spark is not, is Storage. Connectors are available for most
>>>> storage options though.
>>>> Regards,
>>>> Sivakumaran S
>>>> On 28-Aug-2016, at 6:04 PM, Ashok Kumar <
>>>> <>> wrote:
>>>> Hi,
>>>> There are design patterns that use Spark extensively. I am new to this
>>>> area so I would appreciate if someone explains where Spark fits in
>>>> especially within faster or streaming use case.
>>>> What are the best practices involving Spark. Is it always best to
>>>> deploy it for processing engine,
>>>> For example when we have a pattern
>>>> Input Data -> Data in Motion -> Processing -> Storage
>>>> Where does Spark best fit in.
>>>> Thanking you

Reply via email to