Hallo,
Only because you receive the log files hourly it means that you have to use
Spark Streaming. Spark streaming is often used if you receive new events
each minute /second potentially at an irregular frequency. Of course your
analysis window can be larger.
I think your use case justifies standard Spark or MR.
If you are not restricted to them you may check a key/value or column
store, such as redis or apache cassandra. They have also some nice
mechanisms for storage /performance optimal counting of unique users
(hyperloglog) etc.
In any case you can join the data with historical data.
Best regards
Le 5 avr. 2015 12:44, Bahubali Jain bahub...@gmail.com a écrit :
Hi,
I have a requirement in which I plan to use the SPARK Streaming.
I am supposed to calculate the access count to certain webpages.I receive
the webpage access information thru log files.
By Access count I mean how many times was the page accessed *till now*
I have the log files for past 2 years and everyday we keep receiving
almost 6 GB of access logs(on an hourly basis).
Since we receive these logs on an hourly basis I feel that I should use
the SPARK Streaming.
But the problem is that the access counts have to be cumulative , i.e even
the older access(past 2 years) counts for a webpage should also be
considered for the final value.
How to achieve this thru streaming, since streaming picks only new files.
I don't want to use DB to store the access counts since it would
considerably slow down the processing.
Thanks,
Baahu
--
Twitter:http://twitter.com/Baahu