Re: Pseudo Spark Streaming ?

Jörn Franke Sun, 05 Apr 2015 05:58:19 -0700

Hallo,
Only because you receive the log files hourly it means that you have to use
Spark Streaming. Spark streaming is often used if you receive new events
each minute /second potentially at an irregular frequency. Of course your
analysis window can be larger.


I think your use case justifies standard Spark or MR.

If you are not restricted to them you may check a key/value or column
store, such as redis or apache cassandra. They have also some nice
mechanisms for storage /performance optimal counting of unique users
(hyperloglog) etc.

In any case you can join the data with historical data.

Best regards
Le 5 avr. 2015 12:44, "Bahubali Jain" <bahub...@gmail.com> a écrit :

> Hi,
> I have a requirement in which I plan to use the SPARK Streaming.
> I am supposed to calculate the access count to certain webpages.I receive
> the webpage access information thru log files.
> By Access count I mean "how many times was the page accessed *till now* "
> I have the log files for past 2 years and everyday we keep receiving
> almost 6 GB of access logs(on an hourly basis).
> Since we receive these logs on an hourly basis I feel that I should use
> the SPARK Streaming.
> But the problem is that the access counts have to be cumulative , i.e even
> the older access(past 2 years) counts for a webpage should also be
> considered for the final value.
>
> How to achieve this thru streaming, since streaming picks only new files.
> I don't want to use DB to store the access counts since it would
> considerably slow down the processing.
>
> Thanks,
> Baahu
> --
> Twitter:http://twitter.com/Baahu
>
>

Re: Pseudo Spark Streaming ?

Reply via email to