Hi, I have a requirement in which I plan to use the SPARK Streaming. I am supposed to calculate the access count to certain webpages.I receive the webpage access information thru log files. By Access count I mean "how many times was the page accessed *till now* " I have the log files for past 2 years and everyday we keep receiving almost 6 GB of access logs(on an hourly basis). Since we receive these logs on an hourly basis I feel that I should use the SPARK Streaming. But the problem is that the access counts have to be cumulative , i.e even the older access(past 2 years) counts for a webpage should also be considered for the final value.
How to achieve this thru streaming, since streaming picks only new files. I don't want to use DB to store the access counts since it would considerably slow down the processing. Thanks, Baahu -- Twitter:http://twitter.com/Baahu