
I have a Spark streaming that reads messages/prices from Kafka and writes
it as text file to HDFS.

This is pretty efficient. Its only function is to persist the incoming
messages to HDFS.

This is what it does
     dstream.foreachRDD { pricesRDD =>
       val x= pricesRDD.count
       // Check if any messages in
       if (x > 0)
           // Combine each partition's results into a single RDD
         val cachedRDD = pricesRDD.repartition(1).cache
         cachedRDD.saveAsTextFile("/data/prices/prices_" +

So these are the files on HDFS directory

drwxr-xr-x   - hduser supergroup          0 2016-09-14 15:11
drwxr-xr-x   - hduser supergroup          0 2016-09-14 15:11
drwxr-xr-x   - hduser supergroup          0 2016-09-14 15:11
drwxr-xr-x   - hduser supergroup          0 2016-09-14 15:11

Now I present these prices to Zeppelin. These files are produced every 2
seconds. However, when I get to plot them, I am only interesting in one
hours data say.

I cater for this by using filter on prices (each has a TIMECREATED).

I don't think this is efficient as I don't want to load all these files. I
just want to  to read the prices created in past hour or something.

One thing I considered was to load all prices by converting
System.currentTimeMillis into today's date and fetch the most recent ones.
However, this is looking cumbersome. I can create these files with any
timestamp extension when persisting but System.currentTimeMillis seems to
be most efficient.

Any alternatives you can think of?


