2 good benefits of Streaming 1. maintains windows as you move across time, removing & adding monads as you move through the window 2. Connect with streaming systems like kafka to import data as it comes & process it
You dont seem to need any of these features, you would be better off using Spark with crontab maybe :), serializing your object in HDFS if its huge, or maintaining it in memory. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twitter.com/mayur_rustagi> On Tue, Mar 25, 2014 at 2:04 PM, Sanjay Awatramani <sanjay_a...@yahoo.com>wrote: > Hi, > > I had initially thought of a streaming approach to solve my problem, and I > am stuck at few places and want opinion if this problem is suitable for > streaming, or is it better to stick to basic spark. > > Problem: I get chunks of log files in a folder and need to do some > analysis on them on an hourly interval, eg. 11.00 to 11.59. The file chunks > may or may not come in real time and there can be breaks between subsequent > chunks. > > pseudocode: > While{ > CheckForFile(localFolder) > CopyToHDFS() > RDDfile=read(fileFromHDFS) > RDDHour=RDDHour.union.RDDfile.filter(keyHour=currentHr) > if(RDDHour.keys().contains(currentHr+1) //next Hr has come, so current > Hr should be complete > { > RDDHour.process() > deleteFileFromHDFS() > RDDHour.empty() > currentHr++ > } > } > > If I use streaming, I face the following problems: > 1) Inability to keep a Java Variable (currentHr) in the driver which can > be used across batches. > 2) The input files may come with a break, for eg. 10.00 - 10.30 comes, > then a break for 4 hours. If I use streaming, then I can't process the > 10.00 - 10.30 batch as its incomplete, and the 1 hour DStream window for > the 10.30 - 11.00 file will have previous RDD as empty as nothing was > received in the preceding 4 hours. Basically Streaming takes file time as > input and not the time inside the file content. > 3) no control on deleting file from HDFS as the program runs in a > SparkStreamingContext loop > > Any ideas on overcoming the above limitations or whether streaming is > suitable for such kind of problem or not, will be helpful. > > Regards, > Sanjay >