BTW, we now support OLAP functionality natively in spark w/o the need for Druid, through our Spark native BI platform(SNAP): https://www.linkedin.com/pulse/integrated-business-intelligence-big-data-stacks-harish-butani
- we provide SQL commands to: create star schema, create olap index, and insert into olap index. So you can be up and running very quickly in a Spark env. - Query Acceleration is provided through an OLAP Index FileFormat and Query Optimizer extensions(just like spark-druid-olap). - We have also posted details on a BI Benchmark <https://www.linkedin.com/pulse/integrated-business-intelligence-big-data-stacks-harish-butani> to quantify query acceleration and cost. - haven't looked at integration with Spark Streaming yet, but since we have a FileFormat should be possible to integrate. Please ping me if this is of interest. regards, Harish. On Mon, Aug 29, 2016 at 7:19 PM, Chanh Le <giaosu...@gmail.com> wrote: > Hi everyone, > > Seems a lot people using Druid for realtime Dashboard. > I’m just wondering of using Druid for main storage engine because Druid > can store the raw data and can integrate with Spark also (theoretical). > In that case do we need to store 2 separate storage Druid (store segment > in HDFS) and HDFS?. > BTW did anyone try this one https://github.com/ > SparklineData/spark-druid-olap? > > > Regards, > Chanh > > > On Aug 30, 2016, at 3:23 AM, Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > > Thanks Bhaarat and everyone. > > This is an updated version of the same diagram > > <LambdaArchitecture.png> > > The frequency of Recent data is defined by the Windows length in Spark > Streaming. It can vary between 0.5 seconds to an hour. ( Don't think we can > move any Spark granularity below 0.5 seconds in anger. For some > applications like Credit card transactions and fraud detection. Data is > stored real time by Spark in Hbase tables. Hbase tables will be on HDFS as > well. The same Spark Streaming will write asynchronously to HDFS Hive > tables. > One school of thought is never write to Hive from Spark, write straight > to Hbase and then read Hbase tables into Hive periodically? > > Now the third component in this layer is Serving Layer that can combine > data from the current (Hbase) and the historical (Hive tables) to give the > user visual analytics. Now that visual analytics can be Real time dashboard > on top of Serving Layer. That Serving layer could be an in-memory NoSQL > offering or Data from Hbase (Red Box) combined with Hive tables. > > I am not aware of any industrial strength Real time Dashboard. The idea > is that one uses such dashboard in real time. Dashboard in this sense > meaning a general purpose API to data store of some type like on Serving > layer to provide visual analytics real time on demand, combining real time > data and aggregate views. As usual the devil in the detail. > > > > Let me know your thoughts. Anyway this is first cut pattern. > > > > Dr Mich Talebzadeh > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > http://talebzadehmich.wordpress.com > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 29 August 2016 at 18:53, Bhaarat Sharma <bhaara...@gmail.com> wrote: > >> Hi Mich >> >> This is really helpful. I'm trying to wrap my head around the last >> diagram you shared (the one with kafka). In this diagram spark streaming is >> pushing data to HDFS and NoSql. However, I'm confused by the "Real Time >> Queries, Dashboards" annotation. Based on this diagram, will real time >> queries be running on Spark or HBase? >> >> PS: My intention was not to steer the conversation away from what Ashok >> asked but I found the diagrams shared by Mich very insightful. >> >> On Sun, Aug 28, 2016 at 7:18 PM, Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> Hi, >>> >>> In terms of positioning, Spark is really the first Big Data platform to >>> integrate batch, streaming and interactive computations in a unified >>> framework. What this boils down to is the fact that whichever way one look >>> at it there is somewhere that Spark can make a contribution to. In general, >>> there are few design patterns common to Big Data >>> >>> >>> >>> - *ETL & Batch* >>> >>> The first one is the most common one with Established tools like Sqoop, >>> Talend for ETL and HDFS for storage of some kind. Spark can be used as the >>> execution engine for Hive at the storage level which actually makes it >>> a true vendor independent (BTW, Impala and Tez and LLAP) are offered by >>> vendors) processing engine. Personally I use Spark at ETL layer by >>> extracting data from sources through plug ins (JDBC and others) and storing >>> in on HDFS in some kind >>> >>> >>> >>> - *Batch, real time plus Analytics* >>> >>> In this pattern you have data coming in real time and you want to query >>> them real time through real time dashboard. HDFS is not ideal for updating >>> data in real time and neither for random access of data. Source could be >>> all sorts of Web Servers and need Flume Agent with Flume. At the storage >>> layer we are probably looking at something like Hbase. The crucial point >>> being that saved data needs to be ready for queries immediately The >>> dashboards requires Hbase APIs. The Analytics can be done through Hive >>> again running on Spark engine. Again note here that we ideally should >>> process batch and real time separately. >>> >>> >>> >>> - *Real time / Streaming* >>> >>> This is most relevant to Spark as we are moving to near real time. Where >>> Spark excels. We need to capture the incoming events (logs, sensor data, >>> pricing, emails) through interfaces like Kafka, Message Queues etc. Need >>> to process these events with minimum latency. Again Spark is a very good >>> candidate here with its Spark Streaming and micro-batching capabilities. >>> There are others like Storm, Flink etc. that are event based but you don’t >>> hear much. Again for streaming architecture you need to sync data in real >>> time using something like Hbase, Cassandra (?) and others as real time >>> store or forever storage HDFS or Hive etc. >>> >>> >>> In general there is also *Lambda Architecture* that is >>> designed for streaming analytics. The streaming data ends up in both batch >>> layer and speed layer. Batch layer is used to answer batch queries. On the >>> other hand speed later is used ti handle fast/real time queries. This model >>> is really cool as Spark Streaming can feed both the batch layer and >>> the speed layer. >>> >>> >>> At a high level this looks like this, from >>> http://lambda-architecture.net/ >>> >>> <image.png> >>> >>> >>> >>> >>> >>> My favourite would be something like below with Spark playing a major >>> role >>> >>> >>> <LambdaArchitecture.png> >>> >>> >>> HTH >>> >>> Dr Mich Talebzadeh >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> On 28 August 2016 at 19:43, Sivakumaran S <siva.kuma...@me.com> wrote: >>> >>>> Spark best fits for processing. But depending on the use case, you >>>> could expand the scope of Spark to moving data using the native connectors. >>>> The only that Spark is not, is Storage. Connectors are available for most >>>> storage options though. >>>> >>>> Regards, >>>> >>>> Sivakumaran S >>>> >>>> >>>> >>>> On 28-Aug-2016, at 6:04 PM, Ashok Kumar <ashok34...@yahoo.com.INVALID >>>> <ashok34...@yahoo.com.invalid>> wrote: >>>> >>>> Hi, >>>> >>>> There are design patterns that use Spark extensively. I am new to this >>>> area so I would appreciate if someone explains where Spark fits in >>>> especially within faster or streaming use case. >>>> >>>> What are the best practices involving Spark. Is it always best to >>>> deploy it for processing engine, >>>> >>>> For example when we have a pattern >>>> >>>> Input Data -> Data in Motion -> Processing -> Storage >>>> >>>> Where does Spark best fit in. >>>> >>>> Thanking you >>>> >>>> >>> >> > >