Thanks Mitch, i will check it. Cheers
Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links> 2016-08-30 9:52 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com>: > You can use Hbase for building real time dashboards > > Check this link > <https://www.sigmoid.com/integrating-spark-kafka-hbase-to-power-a-real-time-dashboard/> > > HTH > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 30 August 2016 at 08:33, Alonso Isidoro Roman <alons...@gmail.com> > wrote: > >> HBase for real time queries? HBase was designed with the batch in mind. >> Impala should be a best choice, but i do not know what Druid can do.... >> >> >> Cheers >> >> Alonso Isidoro Roman >> [image: https://]about.me/alonso.isidoro.roman >> >> <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links> >> >> 2016-08-30 8:56 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com>: >> >>> Hi Chanh, >>> >>> Druid sounds like a good choice. >>> >>> But again the point being is that what else Druid brings on top of >>> Hbase. >>> >>> Unless one decides to use Druid for both historical data and real time >>> data in place of Hbase! >>> >>> It is easier to write API against Druid that Hbase? You still want a UI >>> dashboard? >>> >>> Cheers >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> On 30 August 2016 at 03:19, Chanh Le <giaosu...@gmail.com> wrote: >>> >>>> Hi everyone, >>>> >>>> Seems a lot people using Druid for realtime Dashboard. >>>> I’m just wondering of using Druid for main storage engine because Druid >>>> can store the raw data and can integrate with Spark also (theoretical). >>>> In that case do we need to store 2 separate storage Druid (store >>>> segment in HDFS) and HDFS?. >>>> BTW did anyone try this one https://github.com/Sparkli >>>> neData/spark-druid-olap? >>>> >>>> >>>> Regards, >>>> Chanh >>>> >>>> >>>> On Aug 30, 2016, at 3:23 AM, Mich Talebzadeh <mich.talebza...@gmail.com> >>>> wrote: >>>> >>>> Thanks Bhaarat and everyone. >>>> >>>> This is an updated version of the same diagram >>>> >>>> <LambdaArchitecture.png> >>>> >>>> The frequency of Recent data is defined by the Windows length in Spark >>>> Streaming. It can vary between 0.5 seconds to an hour. ( Don't think we can >>>> move any Spark granularity below 0.5 seconds in anger. For some >>>> applications like Credit card transactions and fraud detection. Data is >>>> stored real time by Spark in Hbase tables. Hbase tables will be on HDFS as >>>> well. The same Spark Streaming will write asynchronously to HDFS Hive >>>> tables. >>>> One school of thought is never write to Hive from Spark, write >>>> straight to Hbase and then read Hbase tables into Hive periodically? >>>> >>>> Now the third component in this layer is Serving Layer that can combine >>>> data from the current (Hbase) and the historical (Hive tables) to give the >>>> user visual analytics. Now that visual analytics can be Real time dashboard >>>> on top of Serving Layer. That Serving layer could be an in-memory NoSQL >>>> offering or Data from Hbase (Red Box) combined with Hive tables. >>>> >>>> I am not aware of any industrial strength Real time Dashboard. The >>>> idea is that one uses such dashboard in real time. Dashboard in this sense >>>> meaning a general purpose API to data store of some type like on Serving >>>> layer to provide visual analytics real time on demand, combining real time >>>> data and aggregate views. As usual the devil in the detail. >>>> >>>> >>>> >>>> Let me know your thoughts. Anyway this is first cut pattern. >>>> >>>> >>>> >>>> Dr Mich Talebzadeh >>>> >>>> >>>> LinkedIn * >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>> >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>>> On 29 August 2016 at 18:53, Bhaarat Sharma <bhaara...@gmail.com> wrote: >>>> >>>>> Hi Mich >>>>> >>>>> This is really helpful. I'm trying to wrap my head around the last >>>>> diagram you shared (the one with kafka). In this diagram spark streaming >>>>> is >>>>> pushing data to HDFS and NoSql. However, I'm confused by the "Real Time >>>>> Queries, Dashboards" annotation. Based on this diagram, will real time >>>>> queries be running on Spark or HBase? >>>>> >>>>> PS: My intention was not to steer the conversation away from what >>>>> Ashok asked but I found the diagrams shared by Mich very insightful. >>>>> >>>>> On Sun, Aug 28, 2016 at 7:18 PM, Mich Talebzadeh < >>>>> mich.talebza...@gmail.com> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> In terms of positioning, Spark is really the first Big Data platform >>>>>> to integrate batch, streaming and interactive computations in a unified >>>>>> framework. What this boils down to is the fact that whichever way one >>>>>> look >>>>>> at it there is somewhere that Spark can make a contribution to. In >>>>>> general, >>>>>> there are few design patterns common to Big Data >>>>>> >>>>>> >>>>>> >>>>>> - *ETL & Batch* >>>>>> >>>>>> The first one is the most common one with Established tools like >>>>>> Sqoop, Talend for ETL and HDFS for storage of some kind. Spark can be >>>>>> used >>>>>> as the execution engine for Hive at the storage level which actually >>>>>> makes it a true vendor independent (BTW, Impala and Tez and LLAP) are >>>>>> offered by vendors) processing engine. Personally I use Spark at ETL >>>>>> layer >>>>>> by extracting data from sources through plug ins (JDBC and others) and >>>>>> storing in on HDFS in some kind >>>>>> >>>>>> >>>>>> >>>>>> - *Batch, real time plus Analytics* >>>>>> >>>>>> In this pattern you have data coming in real time and you want to >>>>>> query them real time through real time dashboard. HDFS is not ideal for >>>>>> updating data in real time and neither for random access of data. Source >>>>>> could be all sorts of Web Servers and need Flume Agent with Flume. At the >>>>>> storage layer we are probably looking at something like Hbase. The >>>>>> crucial >>>>>> point being that saved data needs to be ready for queries immediately The >>>>>> dashboards requires Hbase APIs. The Analytics can be done through Hive >>>>>> again running on Spark engine. Again note here that we ideally should >>>>>> process batch and real time separately. >>>>>> >>>>>> >>>>>> >>>>>> - *Real time / Streaming* >>>>>> >>>>>> This is most relevant to Spark as we are moving to near real time. >>>>>> Where Spark excels. We need to capture the incoming events (logs, sensor >>>>>> data, pricing, emails) through interfaces like Kafka, Message Queues etc. >>>>>> Need to process these events with minimum latency. Again Spark is a >>>>>> very good candidate here with its Spark Streaming and micro-batching >>>>>> capabilities. There are others like Storm, Flink etc. that are event >>>>>> based >>>>>> but you don’t hear much. Again for streaming architecture you need to >>>>>> sync >>>>>> data in real time using something like Hbase, Cassandra (?) and others as >>>>>> real time store or forever storage HDFS or Hive etc. >>>>>> >>>>>> >>>>>> In general there is also *Lambda Architecture* that is >>>>>> designed for streaming analytics. The streaming data ends up in both >>>>>> batch >>>>>> layer and speed layer. Batch layer is used to answer batch queries. On >>>>>> the >>>>>> other hand speed later is used ti handle fast/real time queries. This >>>>>> model >>>>>> is really cool as Spark Streaming can feed both the batch layer and >>>>>> the speed layer. >>>>>> >>>>>> >>>>>> At a high level this looks like this, from >>>>>> http://lambda-architecture.net/ >>>>>> >>>>>> <image.png> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> My favourite would be something like below with Spark playing a major >>>>>> role >>>>>> >>>>>> >>>>>> <LambdaArchitecture.png> >>>>>> >>>>>> >>>>>> HTH >>>>>> >>>>>> Dr Mich Talebzadeh >>>>>> >>>>>> >>>>>> LinkedIn * >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>>> >>>>>> >>>>>> http://talebzadehmich.wordpress.com >>>>>> >>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>> for any loss, damage or destruction of data or any other property which >>>>>> may >>>>>> arise from relying on this email's technical content is explicitly >>>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>>> arising from such loss, damage or destruction. >>>>>> >>>>>> >>>>>> >>>>>> On 28 August 2016 at 19:43, Sivakumaran S <siva.kuma...@me.com> >>>>>> wrote: >>>>>> >>>>>>> Spark best fits for processing. But depending on the use case, you >>>>>>> could expand the scope of Spark to moving data using the native >>>>>>> connectors. >>>>>>> The only that Spark is not, is Storage. Connectors are available for >>>>>>> most >>>>>>> storage options though. >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> Sivakumaran S >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 28-Aug-2016, at 6:04 PM, Ashok Kumar < >>>>>>> ashok34...@yahoo.com.INVALID <ashok34...@yahoo.com.invalid>> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> There are design patterns that use Spark extensively. I am new to >>>>>>> this area so I would appreciate if someone explains where Spark fits in >>>>>>> especially within faster or streaming use case. >>>>>>> >>>>>>> What are the best practices involving Spark. Is it always best to >>>>>>> deploy it for processing engine, >>>>>>> >>>>>>> For example when we have a pattern >>>>>>> >>>>>>> Input Data -> Data in Motion -> Processing -> Storage >>>>>>> >>>>>>> Where does Spark best fit in. >>>>>>> >>>>>>> Thanking you >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> >>> >> >