Re: Correct way to use spark streaming with apache zeppelin

trung kien Sun, 13 Mar 2016 10:54:06 -0700

Thanks all for actively sharing your experience.

@Chris: using something like Redis is something I am trying to figure out.
I have  a lots of transactions, so I couldn't trigger update event for
every single transaction.
I'm looking at Spark Streaming because it provide batch processing (e.g I
can update the cache every 5 seconds). In addition Spark can scale pretty
well and I don't have to worry about losing data.


Now having the cache with following information:
             * Date
             * BranchID
             * ProductID
             TotalQty
             TotalDollar

* is key, note that I have history data as well (byday).

Now I want to use zeppelin for querying again the cache (while the cache is
updating).
I don't need the Zeppelin update automatically (I can hit the run button
myself :) )
Just curious if parquet is the right solution for us?



On Sun, Mar 13, 2016 at 3:25 PM, Chris Miller <cmiller11...@gmail.com>
wrote:

> Cool! Thanks for sharing.
>
>
> --
> Chris Miller
>
> On Sun, Mar 13, 2016 at 12:53 AM, Todd Nist <tsind...@gmail.com> wrote:
>
>> Below is a link to an example which Silvio Fiorito put together
>> demonstrating how to link Zeppelin with Spark Stream for real-time charts.
>> I think the original thread was pack in early November 2015, subject: Real
>> time chart in Zeppelin, if you care to try to find it.
>>
>> https://gist.github.com/granturing/a09aed4a302a7367be92
>>
>> HTH.
>>
>> -Todd
>>
>> On Sat, Mar 12, 2016 at 6:21 AM, Chris Miller <cmiller11...@gmail.com>
>> wrote:
>>
>>> I'm pretty new to all of this stuff, so bare with me.
>>>
>>> Zeppelin isn't really intended for realtime dashboards as far as I know.
>>> Its reporting features (tables, graphs, etc.) are more for displaying the
>>> results from the output of something. As far as I know, there isn't really
>>> anything to "watch" a dataset and have updates pushed to the Zeppelin UI.
>>>
>>> As for Spark, unless you're doing a lot of processing that you didn't
>>> mention here, I don't think it's a good fit just for this.
>>>
>>> If it were me (just off the top of my head), I'd just build a simple web
>>> service that uses websockets to push updates to the client which could then
>>> be used to update graphs, tables, etc. The data itself -- that is, the
>>> accumulated totals -- you could store in something like Redis. When an
>>> order comes in, just add that quantity and price to the existing value and
>>> trigger your code to push out an updated value to any clients via the
>>> websocket. You could use something like a Redis pub/sub channel to trigger
>>> the web app to notify clients of an update.
>>>
>>> There are about 5 million other ways you could design this, but I would
>>> just keep it as simple as possible. I just threw one idea out...
>>>
>>> Good luck.
>>>
>>>
>>> --
>>> Chris Miller
>>>
>>> On Sat, Mar 12, 2016 at 6:58 PM, trung kien <kient...@gmail.com> wrote:
>>>
>>>> Thanks Chris and Mich for replying.
>>>>
>>>> Sorry for not explaining my problem clearly.  Yes i am talking about a
>>>> flexibke dashboard when mention Zeppelin.
>>>>
>>>> Here is the problem i am having:
>>>>
>>>> I am running a comercial website where we selle many products and we
>>>> have many branchs in many place. We have a lots of realtime transactions
>>>> and want to anaylyze it in realtime.
>>>>
>>>> We dont want every time doing analytics we have to aggregate every
>>>> single transactions ( each transaction have BranchID, ProductID, Qty,
>>>> Price). So, we maintain intermediate data which contains : BranchID,
>>>> ProducrID, totalQty, totalDollar
>>>>
>>>> Ideally, we have 2 tables:
>>>>    Transaction ( BranchID, ProducrID, Qty, Price, Timestamp)
>>>>
>>>> And intermediate table Stats is just sum of every transaction group by
>>>> BranchID and ProductID( i am using Sparkstreaming to calculate this table
>>>> realtime)
>>>>
>>>> My thinking is that doing statistics ( realtime dashboard)  on Stats
>>>> table is much easier, this table is also not enough for maintain.
>>>>
>>>> I'm just wondering, whats the best way to store Stats table( a database
>>>> or parquet file?)
>>>> What exactly are you trying to do? Zeppelin is for interactive analysis
>>>> of a dataset. What do you mean "realtime analytics" -- do you mean build a
>>>> report or dashboard that automatically updates as new data comes in?
>>>>
>>>>
>>>> --
>>>> Chris Miller
>>>>
>>>> On Sat, Mar 12, 2016 at 3:13 PM, trung kien <kient...@gmail.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I've just viewed some Zeppenlin's videos. The intergration between
>>>>> Zeppenlin and Spark is really amazing and i want to use it for my
>>>>> application.
>>>>>
>>>>> In my app, i will have a Spark streaming app to do some basic realtime
>>>>> aggregation ( intermediate data). Then i want to use Zeppenlin to do some
>>>>> realtime analytics on the intermediate data.
>>>>>
>>>>> My question is what's the most efficient storage engine to store
>>>>> realtime intermediate data? Is parquet file somewhere is suitable?
>>>>>
>>>>
>>>>
>>>
>>
>


-- 
Thanks
Kien

Re: Correct way to use spark streaming with apache zeppelin

Reply via email to