Hi,

I have a Spark streaming that reads messages/prices from Kafka and writes
it as text file to HDFS.

This is pretty efficient. Its only function is to persist the incoming
messages to HDFS.

This is what it does
     dstream.foreachRDD { pricesRDD =>
       val x= pricesRDD.count
       // Check if any messages in
       if (x > 0)
       {
           // Combine each partition's results into a single RDD
         val cachedRDD = pricesRDD.repartition(1).cache
         cachedRDD.saveAsTextFile("/data/prices/prices_" +
System.currentTimeMillis.toString)
....

So these are the files on HDFS directory

drwxr-xr-x   - hduser supergroup          0 2016-09-14 15:11
/data/prices/prices_1473862284010
drwxr-xr-x   - hduser supergroup          0 2016-09-14 15:11
/data/prices/prices_1473862288010
drwxr-xr-x   - hduser supergroup          0 2016-09-14 15:11
/data/prices/prices_1473862290010
drwxr-xr-x   - hduser supergroup          0 2016-09-14 15:11
/data/prices/prices_1473862294010

Now I present these prices to Zeppelin. These files are produced every 2
seconds. However, when I get to plot them, I am only interesting in one
hours data say.

I cater for this by using filter on prices (each has a TIMECREATED).


I don't think this is efficient as I don't want to load all these files. I
just want to  to read the prices created in past hour or something.


One thing I considered was to load all prices by converting
System.currentTimeMillis into today's date and fetch the most recent ones.
However, this is looking cumbersome. I can create these files with any
timestamp extension when persisting but System.currentTimeMillis seems to
be most efficient.


Any alternatives you can think of?


Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Reply via email to