Re: Reading the most recent text files created by Spark streaming

2016-09-15 Thread Mich Talebzadeh
Yes thanks. I had flume already for twitter so configured it to get data
from Kafka source and post it to HDFS.

cheers

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 14 September 2016 at 18:36, Jörn Franke  wrote:

> Hi,
>
> An alternative to Spark could be flume to store data from Kafka to HDFS.
> It provides also some reliability mechanisms and has been explicitly
> designed for import/export and is tested. Not sure if i would go for spark
> streaming if the use case is only storing, but I do not have the full
> picture of your use case.
>
> Anyway, what you could do is create a directory / hour/ day etc (whatever
> you need) and put the corresponding files there. If there are a lot of
> small files you can put them into a Hadoop Archive (HAR) to reduce load on
> the namenode.
>
> Best  regards
>
> On 14 Sep 2016, at 17:28, Mich Talebzadeh 
> wrote:
>
> Hi,
>
> I have a Spark streaming that reads messages/prices from Kafka and writes
> it as text file to HDFS.
>
> This is pretty efficient. Its only function is to persist the incoming
> messages to HDFS.
>
> This is what it does
>  dstream.foreachRDD { pricesRDD =>
>val x= pricesRDD.count
>// Check if any messages in
>if (x > 0)
>{
>// Combine each partition's results into a single RDD
>  val cachedRDD = pricesRDD.repartition(1).cache
>  cachedRDD.saveAsTextFile("/data/prices/prices_" +
> System.currentTimeMillis.toString)
> 
>
> So these are the files on HDFS directory
>
> drwxr-xr-x   - hduser supergroup  0 2016-09-14 15:11
> /data/prices/prices_1473862284010
> drwxr-xr-x   - hduser supergroup  0 2016-09-14 15:11
> /data/prices/prices_1473862288010
> drwxr-xr-x   - hduser supergroup  0 2016-09-14 15:11
> /data/prices/prices_1473862290010
> drwxr-xr-x   - hduser supergroup  0 2016-09-14 15:11
> /data/prices/prices_1473862294010
>
> Now I present these prices to Zeppelin. These files are produced every 2
> seconds. However, when I get to plot them, I am only interesting in one
> hours data say.
>
> I cater for this by using filter on prices (each has a TIMECREATED).
>
>
> I don't think this is efficient as I don't want to load all these files. I
> just want to  to read the prices created in past hour or something.
>
>
> One thing I considered was to load all prices by converting
> System.currentTimeMillis into today's date and fetch the most recent ones.
> However, this is looking cumbersome. I can create these files with any
> timestamp extension when persisting but System.currentTimeMillis seems to
> be most efficient.
>
>
> Any alternatives you can think of?
>
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>


Re: Reading the most recent text files created by Spark streaming

2016-09-14 Thread Jörn Franke
Hi,

An alternative to Spark could be flume to store data from Kafka to HDFS. It 
provides also some reliability mechanisms and has been explicitly designed for 
import/export and is tested. Not sure if i would go for spark streaming if the 
use case is only storing, but I do not have the full picture of your use case.

Anyway, what you could do is create a directory / hour/ day etc (whatever you 
need) and put the corresponding files there. If there are a lot of small files 
you can put them into a Hadoop Archive (HAR) to reduce load on the namenode.

Best  regards

> On 14 Sep 2016, at 17:28, Mich Talebzadeh  wrote:
> 
> Hi,
> 
> I have a Spark streaming that reads messages/prices from Kafka and writes it 
> as text file to HDFS.
> 
> This is pretty efficient. Its only function is to persist the incoming 
> messages to HDFS.
> 
> This is what it does
>  dstream.foreachRDD { pricesRDD =>
>val x= pricesRDD.count
>// Check if any messages in
>if (x > 0)
>{
>// Combine each partition's results into a single RDD
>  val cachedRDD = pricesRDD.repartition(1).cache
>  cachedRDD.saveAsTextFile("/data/prices/prices_" + 
> System.currentTimeMillis.toString)
> 
> 
> So these are the files on HDFS directory
> 
> drwxr-xr-x   - hduser supergroup  0 2016-09-14 15:11 
> /data/prices/prices_1473862284010
> drwxr-xr-x   - hduser supergroup  0 2016-09-14 15:11 
> /data/prices/prices_1473862288010
> drwxr-xr-x   - hduser supergroup  0 2016-09-14 15:11 
> /data/prices/prices_1473862290010
> drwxr-xr-x   - hduser supergroup  0 2016-09-14 15:11 
> /data/prices/prices_1473862294010
> 
> Now I present these prices to Zeppelin. These files are produced every 2 
> seconds. However, when I get to plot them, I am only interesting in one hours 
> data say.
> I cater for this by using filter on prices (each has a TIMECREATED).
> 
> I don't think this is efficient as I don't want to load all these files. I 
> just want to  to read the prices created in past hour or something.
> 
> One thing I considered was to load all prices by converting 
> System.currentTimeMillis into today's date and fetch the most recent ones. 
> However, this is looking cumbersome. I can create these files with any 
> timestamp extension when persisting but System.currentTimeMillis seems to be 
> most efficient.
> 
> Any alternatives you can think of?
> 
> Thanks
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>