Re: Reading the most recent text files created by Spark streaming

Jörn Franke Wed, 14 Sep 2016 10:36:58 -0700

Hi,

An alternative to Spark could be flume to store data from Kafka to HDFS. It 
provides also some reliability mechanisms and has been explicitly designed for 
import/export and is tested. Not sure if i would go for spark streaming if the 
use case is only storing, but I do not have the full picture of your use case.


Anyway, what you could do is create a directory / hour/ day etc (whatever you 
need) and put the corresponding files there. If there are a lot of small files 
you can put them into a Hadoop Archive (HAR) to reduce load on the namenode.

Best  regards

> On 14 Sep 2016, at 17:28, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> Hi,
> 
> I have a Spark streaming that reads messages/prices from Kafka and writes it 
> as text file to HDFS.
> 
> This is pretty efficient. Its only function is to persist the incoming 
> messages to HDFS.
> 
> This is what it does
>      dstream.foreachRDD { pricesRDD =>
>        val x= pricesRDD.count
>        // Check if any messages in
>        if (x > 0)
>        {
>            // Combine each partition's results into a single RDD
>          val cachedRDD = pricesRDD.repartition(1).cache
>          cachedRDD.saveAsTextFile("/data/prices/prices_" + 
> System.currentTimeMillis.toString)
> ....
> 
> So these are the files on HDFS directory
> 
> drwxr-xr-x   - hduser supergroup          0 2016-09-14 15:11 
> /data/prices/prices_1473862284010
> drwxr-xr-x   - hduser supergroup          0 2016-09-14 15:11 
> /data/prices/prices_1473862288010
> drwxr-xr-x   - hduser supergroup          0 2016-09-14 15:11 
> /data/prices/prices_1473862290010
> drwxr-xr-x   - hduser supergroup          0 2016-09-14 15:11 
> /data/prices/prices_1473862294010
> 
> Now I present these prices to Zeppelin. These files are produced every 2 
> seconds. However, when I get to plot them, I am only interesting in one hours 
> data say.
> I cater for this by using filter on prices (each has a TIMECREATED).
> 
> I don't think this is efficient as I don't want to load all these files. I 
> just want to  to read the prices created in past hour or something.
> 
> One thing I considered was to load all prices by converting 
> System.currentTimeMillis into today's date and fetch the most recent ones. 
> However, this is looking cumbersome. I can create these files with any 
> timestamp extension when persisting but System.currentTimeMillis seems to be 
> most efficient.
> 
> Any alternatives you can think of?
> 
> Thanks
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>

Re: Reading the most recent text files created by Spark streaming

Reply via email to