Hi, An alternative to Spark could be flume to store data from Kafka to HDFS. It provides also some reliability mechanisms and has been explicitly designed for import/export and is tested. Not sure if i would go for spark streaming if the use case is only storing, but I do not have the full picture of your use case.
Anyway, what you could do is create a directory / hour/ day etc (whatever you need) and put the corresponding files there. If there are a lot of small files you can put them into a Hadoop Archive (HAR) to reduce load on the namenode. Best regards > On 14 Sep 2016, at 17:28, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > > Hi, > > I have a Spark streaming that reads messages/prices from Kafka and writes it > as text file to HDFS. > > This is pretty efficient. Its only function is to persist the incoming > messages to HDFS. > > This is what it does > dstream.foreachRDD { pricesRDD => > val x= pricesRDD.count > // Check if any messages in > if (x > 0) > { > // Combine each partition's results into a single RDD > val cachedRDD = pricesRDD.repartition(1).cache > cachedRDD.saveAsTextFile("/data/prices/prices_" + > System.currentTimeMillis.toString) > .... > > So these are the files on HDFS directory > > drwxr-xr-x - hduser supergroup 0 2016-09-14 15:11 > /data/prices/prices_1473862284010 > drwxr-xr-x - hduser supergroup 0 2016-09-14 15:11 > /data/prices/prices_1473862288010 > drwxr-xr-x - hduser supergroup 0 2016-09-14 15:11 > /data/prices/prices_1473862290010 > drwxr-xr-x - hduser supergroup 0 2016-09-14 15:11 > /data/prices/prices_1473862294010 > > Now I present these prices to Zeppelin. These files are produced every 2 > seconds. However, when I get to plot them, I am only interesting in one hours > data say. > I cater for this by using filter on prices (each has a TIMECREATED). > > I don't think this is efficient as I don't want to load all these files. I > just want to to read the prices created in past hour or something. > > One thing I considered was to load all prices by converting > System.currentTimeMillis into today's date and fetch the most recent ones. > However, this is looking cumbersome. I can create these files with any > timestamp extension when persisting but System.currentTimeMillis seems to be > most efficient. > > Any alternatives you can think of? > > Thanks > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > http://talebzadehmich.wordpress.com > > Disclaimer: Use it at your own risk. Any and all responsibility for any loss, > damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. >