Hi,

I had initially thought of a streaming approach to solve my problem, and I am 
stuck at few places and want opinion if this problem is suitable for streaming, 
or is it better to stick to basic spark.

Problem: I get chunks of log files in a folder and need to do some analysis on 
them on an hourly interval, eg. 11.00 to 11.59. The file chunks may or may not 
come in real time and there can be breaks between subsequent chunks.

pseudocode:
While{
  CheckForFile(localFolder)
  CopyToHDFS()
  RDDfile=read(fileFromHDFS)
  RDDHour=RDDHour.union.RDDfile.filter(keyHour=currentHr)
  if(RDDHour.keys().contains(currentHr+1) //next Hr has come, so current Hr 
should be complete
  {
      RDDHour.process()
      deleteFileFromHDFS()
      RDDHour.empty()
      currentHr++
  }
}

If I use streaming, I face the following problems:
1) Inability to keep a Java Variable (currentHr) in the driver which can be 
used across batches.
2) The input files may come with a break, for eg. 10.00 - 10.30 comes, then a 
break for 4 hours. If I use streaming, then I can't process the 10.00 - 10.30 
batch as its incomplete, and the 1 hour DStream window for the 10.30 - 11.00 
file will have previous RDD as empty as nothing was received in the preceding 4 
hours. Basically Streaming takes file time as input and not the time inside the 
file content. 
3) no control on deleting file from HDFS as the program runs in a 
SparkStreamingContext loop

Any ideas on overcoming the above limitations or whether streaming is suitable 
for such kind of problem or not, will be helpful.

Regards,
Sanjay

Reply via email to