Re: Best practise for 'Streaming' dumps?

2014-06-08 Thread Gino Bustelo
Yeah... Have not tried it, but if you set the slidingDuration == windowDuration 
that should prevent overlaps. 

Gino B.

> On Jun 8, 2014, at 8:25 AM, Jeremy Lee  wrote:
> 
> I read it more carefully, and window() might actually work for some other 
> stuff like logs. (assuming I can have multiple windows with entirely 
> different attributes on a single stream..) 
> 
> Thanks for that!
> 
> 
>> On Sun, Jun 8, 2014 at 11:11 PM, Jeremy Lee  
>> wrote:
>> Yes.. but from what I understand that's a "sliding window" so for a window 
>> of (60) over (1) second DStreams, that would save the entire last minute of 
>> data once per second. That's more than I need.
>> 
>> I think what I'm after is probably updateStateByKey... I want to mutate data 
>> structures (probably even graphs) as the stream comes in, but I also want 
>> that state to be persistent across restarts of the application, (Or parallel 
>> version of the app, if possible) So I'd have to save that structure 
>> occasionally and reload it as the "primer" on the next run.
>> 
>> I was almost going to use HBase or Hive, but they seem to have been 
>> deprecated in 1.0.0? Or just late to the party?
>> 
>> Also, I've been having trouble deleting hadoop directories.. the old "two 
>> line" examples don't seem to work anymore. I actually managed to fill up the 
>> worker instances (I gave them tiny EBS) and I think I crashed them.
>> 
>> 
>> 
>>> On Sat, Jun 7, 2014 at 10:23 PM, Gino Bustelo  wrote:
>>> Have you thought of using window?
>>> 
>>> Gino B.
>>> 
>>> > On Jun 6, 2014, at 11:49 PM, Jeremy Lee  
>>> > wrote:
>>> >
>>> >
>>> > It's going well enough that this is a "how should I in 1.0.0" rather than 
>>> > "how do i" question.
>>> >
>>> > So I've got data coming in via Streaming (twitters) and I want to 
>>> > archive/log it all. It seems a bit wasteful to generate a new HDFS file 
>>> > for each DStream, but also I want to guard against data loss from crashes,
>>> >
>>> > I suppose what I want is to let things build up into "superbatches" over 
>>> > a few minutes, and then serialize those to parquet files, or similar? Or 
>>> > do i?
>>> >
>>> > Do I count-down the number of DStreams, or does Spark have a preferred 
>>> > way of scheduling cron events?
>>> >
>>> > What's the best practise for keeping persistent data for a streaming app? 
>>> > (Across restarts) And to clean up on termination?
>>> >
>>> >
>>> > --
>>> > Jeremy Lee  BCompSci(Hons)
>>> >   The Unorthodox Engineers
>> 
>> 
>> 
>> -- 
>> Jeremy Lee  BCompSci(Hons)
>>   The Unorthodox Engineers
> 
> 
> 
> -- 
> Jeremy Lee  BCompSci(Hons)
>   The Unorthodox Engineers


Re: Best practise for 'Streaming' dumps?

2014-06-08 Thread Jeremy Lee
I read it more carefully, and window() might actually work for some other
stuff like logs. (assuming I can have multiple windows with entirely
different attributes on a single stream..)

Thanks for that!


On Sun, Jun 8, 2014 at 11:11 PM, Jeremy Lee 
wrote:

> Yes.. but from what I understand that's a "sliding window" so for a window
> of (60) over (1) second DStreams, that would save the entire last minute of
> data once per second. That's more than I need.
>
> I think what I'm after is probably updateStateByKey... I want to mutate
> data structures (probably even graphs) as the stream comes in, but I also
> want that state to be persistent across restarts of the application, (Or
> parallel version of the app, if possible) So I'd have to save that
> structure occasionally and reload it as the "primer" on the next run.
>
> I was almost going to use HBase or Hive, but they seem to have been
> deprecated in 1.0.0? Or just late to the party?
>
> Also, I've been having trouble deleting hadoop directories.. the old "two
> line" examples don't seem to work anymore. I actually managed to fill up
> the worker instances (I gave them tiny EBS) and I think I crashed them.
>
>
>
> On Sat, Jun 7, 2014 at 10:23 PM, Gino Bustelo  wrote:
>
>> Have you thought of using window?
>>
>> Gino B.
>>
>> > On Jun 6, 2014, at 11:49 PM, Jeremy Lee 
>> wrote:
>> >
>> >
>> > It's going well enough that this is a "how should I in 1.0.0" rather
>> than "how do i" question.
>> >
>> > So I've got data coming in via Streaming (twitters) and I want to
>> archive/log it all. It seems a bit wasteful to generate a new HDFS file for
>> each DStream, but also I want to guard against data loss from crashes,
>> >
>> > I suppose what I want is to let things build up into "superbatches"
>> over a few minutes, and then serialize those to parquet files, or similar?
>> Or do i?
>> >
>> > Do I count-down the number of DStreams, or does Spark have a preferred
>> way of scheduling cron events?
>> >
>> > What's the best practise for keeping persistent data for a streaming
>> app? (Across restarts) And to clean up on termination?
>> >
>> >
>> > --
>> > Jeremy Lee  BCompSci(Hons)
>> >   The Unorthodox Engineers
>>
>
>
>
> --
> Jeremy Lee  BCompSci(Hons)
>   The Unorthodox Engineers
>



-- 
Jeremy Lee  BCompSci(Hons)
  The Unorthodox Engineers


Re: Best practise for 'Streaming' dumps?

2014-06-08 Thread Jeremy Lee
Yes.. but from what I understand that's a "sliding window" so for a window
of (60) over (1) second DStreams, that would save the entire last minute of
data once per second. That's more than I need.

I think what I'm after is probably updateStateByKey... I want to mutate
data structures (probably even graphs) as the stream comes in, but I also
want that state to be persistent across restarts of the application, (Or
parallel version of the app, if possible) So I'd have to save that
structure occasionally and reload it as the "primer" on the next run.

I was almost going to use HBase or Hive, but they seem to have been
deprecated in 1.0.0? Or just late to the party?

Also, I've been having trouble deleting hadoop directories.. the old "two
line" examples don't seem to work anymore. I actually managed to fill up
the worker instances (I gave them tiny EBS) and I think I crashed them.



On Sat, Jun 7, 2014 at 10:23 PM, Gino Bustelo  wrote:

> Have you thought of using window?
>
> Gino B.
>
> > On Jun 6, 2014, at 11:49 PM, Jeremy Lee 
> wrote:
> >
> >
> > It's going well enough that this is a "how should I in 1.0.0" rather
> than "how do i" question.
> >
> > So I've got data coming in via Streaming (twitters) and I want to
> archive/log it all. It seems a bit wasteful to generate a new HDFS file for
> each DStream, but also I want to guard against data loss from crashes,
> >
> > I suppose what I want is to let things build up into "superbatches" over
> a few minutes, and then serialize those to parquet files, or similar? Or do
> i?
> >
> > Do I count-down the number of DStreams, or does Spark have a preferred
> way of scheduling cron events?
> >
> > What's the best practise for keeping persistent data for a streaming
> app? (Across restarts) And to clean up on termination?
> >
> >
> > --
> > Jeremy Lee  BCompSci(Hons)
> >   The Unorthodox Engineers
>



-- 
Jeremy Lee  BCompSci(Hons)
  The Unorthodox Engineers


Re: Best practise for 'Streaming' dumps?

2014-06-07 Thread Gino Bustelo
Have you thought of using window?

Gino B.

> On Jun 6, 2014, at 11:49 PM, Jeremy Lee  
> wrote:
> 
> 
> It's going well enough that this is a "how should I in 1.0.0" rather than 
> "how do i" question.
> 
> So I've got data coming in via Streaming (twitters) and I want to archive/log 
> it all. It seems a bit wasteful to generate a new HDFS file for each DStream, 
> but also I want to guard against data loss from crashes,
> 
> I suppose what I want is to let things build up into "superbatches" over a 
> few minutes, and then serialize those to parquet files, or similar? Or do i?
> 
> Do I count-down the number of DStreams, or does Spark have a preferred way of 
> scheduling cron events?
> 
> What's the best practise for keeping persistent data for a streaming app? 
> (Across restarts) And to clean up on termination?
> 
> 
> -- 
> Jeremy Lee  BCompSci(Hons)
>   The Unorthodox Engineers


Best practise for 'Streaming' dumps?

2014-06-06 Thread Jeremy Lee
It's going well enough that this is a "how should I in 1.0.0" rather than
"how do i" question.

So I've got data coming in via Streaming (twitters) and I want to
archive/log it all. It seems a bit wasteful to generate a new HDFS file for
each DStream, but also I want to guard against data loss from crashes,

I suppose what I want is to let things build up into "superbatches" over a
few minutes, and then serialize those to parquet files, or similar? Or do i?

Do I count-down the number of DStreams, or does Spark have a preferred way
of scheduling cron events?

What's the best practise for keeping persistent data for a streaming app?
(Across restarts) And to clean up on termination?


-- 
Jeremy Lee  BCompSci(Hons)
  The Unorthodox Engineers