Re: Best practise for 'Streaming' dumps?
Yeah... Have not tried it, but if you set the slidingDuration == windowDuration that should prevent overlaps. Gino B. > On Jun 8, 2014, at 8:25 AM, Jeremy Lee wrote: > > I read it more carefully, and window() might actually work for some other > stuff like logs. (assuming I can have multiple windows with entirely > different attributes on a single stream..) > > Thanks for that! > > >> On Sun, Jun 8, 2014 at 11:11 PM, Jeremy Lee >> wrote: >> Yes.. but from what I understand that's a "sliding window" so for a window >> of (60) over (1) second DStreams, that would save the entire last minute of >> data once per second. That's more than I need. >> >> I think what I'm after is probably updateStateByKey... I want to mutate data >> structures (probably even graphs) as the stream comes in, but I also want >> that state to be persistent across restarts of the application, (Or parallel >> version of the app, if possible) So I'd have to save that structure >> occasionally and reload it as the "primer" on the next run. >> >> I was almost going to use HBase or Hive, but they seem to have been >> deprecated in 1.0.0? Or just late to the party? >> >> Also, I've been having trouble deleting hadoop directories.. the old "two >> line" examples don't seem to work anymore. I actually managed to fill up the >> worker instances (I gave them tiny EBS) and I think I crashed them. >> >> >> >>> On Sat, Jun 7, 2014 at 10:23 PM, Gino Bustelo wrote: >>> Have you thought of using window? >>> >>> Gino B. >>> >>> > On Jun 6, 2014, at 11:49 PM, Jeremy Lee >>> > wrote: >>> > >>> > >>> > It's going well enough that this is a "how should I in 1.0.0" rather than >>> > "how do i" question. >>> > >>> > So I've got data coming in via Streaming (twitters) and I want to >>> > archive/log it all. It seems a bit wasteful to generate a new HDFS file >>> > for each DStream, but also I want to guard against data loss from crashes, >>> > >>> > I suppose what I want is to let things build up into "superbatches" over >>> > a few minutes, and then serialize those to parquet files, or similar? Or >>> > do i? >>> > >>> > Do I count-down the number of DStreams, or does Spark have a preferred >>> > way of scheduling cron events? >>> > >>> > What's the best practise for keeping persistent data for a streaming app? >>> > (Across restarts) And to clean up on termination? >>> > >>> > >>> > -- >>> > Jeremy Lee BCompSci(Hons) >>> > The Unorthodox Engineers >> >> >> >> -- >> Jeremy Lee BCompSci(Hons) >> The Unorthodox Engineers > > > > -- > Jeremy Lee BCompSci(Hons) > The Unorthodox Engineers
Re: Best practise for 'Streaming' dumps?
I read it more carefully, and window() might actually work for some other stuff like logs. (assuming I can have multiple windows with entirely different attributes on a single stream..) Thanks for that! On Sun, Jun 8, 2014 at 11:11 PM, Jeremy Lee wrote: > Yes.. but from what I understand that's a "sliding window" so for a window > of (60) over (1) second DStreams, that would save the entire last minute of > data once per second. That's more than I need. > > I think what I'm after is probably updateStateByKey... I want to mutate > data structures (probably even graphs) as the stream comes in, but I also > want that state to be persistent across restarts of the application, (Or > parallel version of the app, if possible) So I'd have to save that > structure occasionally and reload it as the "primer" on the next run. > > I was almost going to use HBase or Hive, but they seem to have been > deprecated in 1.0.0? Or just late to the party? > > Also, I've been having trouble deleting hadoop directories.. the old "two > line" examples don't seem to work anymore. I actually managed to fill up > the worker instances (I gave them tiny EBS) and I think I crashed them. > > > > On Sat, Jun 7, 2014 at 10:23 PM, Gino Bustelo wrote: > >> Have you thought of using window? >> >> Gino B. >> >> > On Jun 6, 2014, at 11:49 PM, Jeremy Lee >> wrote: >> > >> > >> > It's going well enough that this is a "how should I in 1.0.0" rather >> than "how do i" question. >> > >> > So I've got data coming in via Streaming (twitters) and I want to >> archive/log it all. It seems a bit wasteful to generate a new HDFS file for >> each DStream, but also I want to guard against data loss from crashes, >> > >> > I suppose what I want is to let things build up into "superbatches" >> over a few minutes, and then serialize those to parquet files, or similar? >> Or do i? >> > >> > Do I count-down the number of DStreams, or does Spark have a preferred >> way of scheduling cron events? >> > >> > What's the best practise for keeping persistent data for a streaming >> app? (Across restarts) And to clean up on termination? >> > >> > >> > -- >> > Jeremy Lee BCompSci(Hons) >> > The Unorthodox Engineers >> > > > > -- > Jeremy Lee BCompSci(Hons) > The Unorthodox Engineers > -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers
Re: Best practise for 'Streaming' dumps?
Yes.. but from what I understand that's a "sliding window" so for a window of (60) over (1) second DStreams, that would save the entire last minute of data once per second. That's more than I need. I think what I'm after is probably updateStateByKey... I want to mutate data structures (probably even graphs) as the stream comes in, but I also want that state to be persistent across restarts of the application, (Or parallel version of the app, if possible) So I'd have to save that structure occasionally and reload it as the "primer" on the next run. I was almost going to use HBase or Hive, but they seem to have been deprecated in 1.0.0? Or just late to the party? Also, I've been having trouble deleting hadoop directories.. the old "two line" examples don't seem to work anymore. I actually managed to fill up the worker instances (I gave them tiny EBS) and I think I crashed them. On Sat, Jun 7, 2014 at 10:23 PM, Gino Bustelo wrote: > Have you thought of using window? > > Gino B. > > > On Jun 6, 2014, at 11:49 PM, Jeremy Lee > wrote: > > > > > > It's going well enough that this is a "how should I in 1.0.0" rather > than "how do i" question. > > > > So I've got data coming in via Streaming (twitters) and I want to > archive/log it all. It seems a bit wasteful to generate a new HDFS file for > each DStream, but also I want to guard against data loss from crashes, > > > > I suppose what I want is to let things build up into "superbatches" over > a few minutes, and then serialize those to parquet files, or similar? Or do > i? > > > > Do I count-down the number of DStreams, or does Spark have a preferred > way of scheduling cron events? > > > > What's the best practise for keeping persistent data for a streaming > app? (Across restarts) And to clean up on termination? > > > > > > -- > > Jeremy Lee BCompSci(Hons) > > The Unorthodox Engineers > -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers
Re: Best practise for 'Streaming' dumps?
Have you thought of using window? Gino B. > On Jun 6, 2014, at 11:49 PM, Jeremy Lee > wrote: > > > It's going well enough that this is a "how should I in 1.0.0" rather than > "how do i" question. > > So I've got data coming in via Streaming (twitters) and I want to archive/log > it all. It seems a bit wasteful to generate a new HDFS file for each DStream, > but also I want to guard against data loss from crashes, > > I suppose what I want is to let things build up into "superbatches" over a > few minutes, and then serialize those to parquet files, or similar? Or do i? > > Do I count-down the number of DStreams, or does Spark have a preferred way of > scheduling cron events? > > What's the best practise for keeping persistent data for a streaming app? > (Across restarts) And to clean up on termination? > > > -- > Jeremy Lee BCompSci(Hons) > The Unorthodox Engineers
Best practise for 'Streaming' dumps?
It's going well enough that this is a "how should I in 1.0.0" rather than "how do i" question. So I've got data coming in via Streaming (twitters) and I want to archive/log it all. It seems a bit wasteful to generate a new HDFS file for each DStream, but also I want to guard against data loss from crashes, I suppose what I want is to let things build up into "superbatches" over a few minutes, and then serialize those to parquet files, or similar? Or do i? Do I count-down the number of DStreams, or does Spark have a preferred way of scheduling cron events? What's the best practise for keeping persistent data for a streaming app? (Across restarts) And to clean up on termination? -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers