Hallo, Only because you receive the log files hourly it means that you have to use Spark Streaming. Spark streaming is often used if you receive new events each minute /second potentially at an irregular frequency. Of course your analysis window can be larger.
I think your use case justifies standard Spark or MR. If you are not restricted to them you may check a key/value or column store, such as redis or apache cassandra. They have also some nice mechanisms for storage /performance optimal counting of unique users (hyperloglog) etc. In any case you can join the data with historical data. Best regards Le 5 avr. 2015 12:44, "Bahubali Jain" <bahub...@gmail.com> a écrit : > Hi, > I have a requirement in which I plan to use the SPARK Streaming. > I am supposed to calculate the access count to certain webpages.I receive > the webpage access information thru log files. > By Access count I mean "how many times was the page accessed *till now* " > I have the log files for past 2 years and everyday we keep receiving > almost 6 GB of access logs(on an hourly basis). > Since we receive these logs on an hourly basis I feel that I should use > the SPARK Streaming. > But the problem is that the access counts have to be cumulative , i.e even > the older access(past 2 years) counts for a webpage should also be > considered for the final value. > > How to achieve this thru streaming, since streaming picks only new files. > I don't want to use DB to store the access counts since it would > considerably slow down the processing. > > Thanks, > Baahu > -- > Twitter:http://twitter.com/Baahu > >