Re: Pseudo Spark Streaming ?

2015-04-05 Thread Jörn Franke
Hallo,
Only because you receive the log files hourly it means that you have to use
Spark Streaming. Spark streaming is often used if you receive new events
each minute /second potentially at an irregular frequency. Of course your
analysis window can be larger.

I think your use case justifies standard Spark or MR.

If you are not restricted to them you may check a key/value or column
store, such as redis or apache cassandra. They have also some nice
mechanisms for storage /performance optimal counting of unique users
(hyperloglog) etc.

In any case you can join the data with historical data.

Best regards
Le 5 avr. 2015 12:44, Bahubali Jain bahub...@gmail.com a écrit :

 Hi,
 I have a requirement in which I plan to use the SPARK Streaming.
 I am supposed to calculate the access count to certain webpages.I receive
 the webpage access information thru log files.
 By Access count I mean how many times was the page accessed *till now* 
 I have the log files for past 2 years and everyday we keep receiving
 almost 6 GB of access logs(on an hourly basis).
 Since we receive these logs on an hourly basis I feel that I should use
 the SPARK Streaming.
 But the problem is that the access counts have to be cumulative , i.e even
 the older access(past 2 years) counts for a webpage should also be
 considered for the final value.

 How to achieve this thru streaming, since streaming picks only new files.
 I don't want to use DB to store the access counts since it would
 considerably slow down the processing.

 Thanks,
 Baahu
 --
 Twitter:http://twitter.com/Baahu




Pseudo Spark Streaming ?

2015-04-05 Thread Bahubali Jain
Hi,
I have a requirement in which I plan to use the SPARK Streaming.
I am supposed to calculate the access count to certain webpages.I receive
the webpage access information thru log files.
By Access count I mean how many times was the page accessed *till now* 
I have the log files for past 2 years and everyday we keep receiving almost
6 GB of access logs(on an hourly basis).
Since we receive these logs on an hourly basis I feel that I should use the
SPARK Streaming.
But the problem is that the access counts have to be cumulative , i.e even
the older access(past 2 years) counts for a webpage should also be
considered for the final value.

How to achieve this thru streaming, since streaming picks only new files.
I don't want to use DB to store the access counts since it would
considerably slow down the processing.

Thanks,
Baahu
-- 
Twitter:http://twitter.com/Baahu