Hallo,
Only because you receive the log files hourly it means that you have to use
Spark Streaming. Spark streaming is often used if you receive new events
each minute /second potentially at an irregular frequency. Of course your
analysis window can be larger.

I think your use case justifies standard Spark or MR.

If you are not restricted to them you may check a key/value or column
store, such as redis or apache cassandra. They have also some nice
mechanisms for storage /performance optimal counting of unique users
(hyperloglog) etc.

In any case you can join the data with historical data.

Best regards
Le 5 avr. 2015 12:44, "Bahubali Jain" <bahub...@gmail.com> a écrit :

> Hi,
> I have a requirement in which I plan to use the SPARK Streaming.
> I am supposed to calculate the access count to certain webpages.I receive
> the webpage access information thru log files.
> By Access count I mean "how many times was the page accessed *till now* "
> I have the log files for past 2 years and everyday we keep receiving
> almost 6 GB of access logs(on an hourly basis).
> Since we receive these logs on an hourly basis I feel that I should use
> the SPARK Streaming.
> But the problem is that the access counts have to be cumulative , i.e even
> the older access(past 2 years) counts for a webpage should also be
> considered for the final value.
>
> How to achieve this thru streaming, since streaming picks only new files.
> I don't want to use DB to store the access counts since it would
> considerably slow down the processing.
>
> Thanks,
> Baahu
> --
> Twitter:http://twitter.com/Baahu
>
>

Reply via email to