[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-23 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14145541#comment-14145541
 ] 

Hari Shreedharan commented on SPARK-3129:
-

Sure. Thanks Matei!

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: SecurityFix.diff, StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-23 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14145537#comment-14145537
 ] 

Matei Zaharia commented on SPARK-3129:
--

Alright, in that case, this sounds pretty good to me. I would go ahead with 
this version. Please coordinate with [~tdas] as well since he's been looking 
into this.

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: SecurityFix.diff, StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-23 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14145411#comment-14145411
 ] 

Hari Shreedharan commented on SPARK-3129:
-

It is per node, single threaded.

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: SecurityFix.diff, StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-23 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14145324#comment-14145324
 ] 

Matei Zaharia commented on SPARK-3129:
--

Is that 100 MB/s per node or in total? That should be pretty for per-node if it 
scales well to a cluster.

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: SecurityFix.diff, StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-22 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143775#comment-14143775
 ] 

Hari Shreedharan commented on SPARK-3129:
-

I did multiple rounds of testing and it looks like on average total rate for 
writing and flushing is around 100 MB/s. There are a couple of outliers, but 
that is likely due to flakey networking on EC2. Barring the one outlier, the 
least I got was 79 MB/s and max was 142 MB/s, but most were near 100.

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: SecurityFix.diff, StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-19 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141382#comment-14141382
 ] 

Matei Zaharia commented on SPARK-3129:
--

So Hari, what is the maximum sustainable rate in MB/second? That's the number 
we should be looking for. I think a latency of 50-100 ms to flush is fine, but 
we can't be writing just 5 Kbytes/second.

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: SecurityFix.diff, StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-18 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140049#comment-14140049
 ] 

Hari Shreedharan commented on SPARK-3129:
-

Do these numbers look ok enough to you guys, [~tdas], [~matei], [~pwendell]? If 
you want to experiment, it is you can use the app and play around with it. I 
don't think this number is too bad - though I don't know how the current block 
replication code performs - but I'd estimate it to be in a few 10's of millis.

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: SecurityFix.diff, StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-18 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140048#comment-14140048
 ] 

Hari Shreedharan commented on SPARK-3129:
-

Reducing the buffer size decreases the number of hflushes per file (total time 
taken per file is less), but each hflush takes longer as more data is buffered 
locally (I guess).

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: SecurityFix.diff, StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-18 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140033#comment-14140033
 ] 

Hari Shreedharan commented on SPARK-3129:
-

So I did some benchmarking on EC2, writing to files one after another, with a 
200ms gap between hflushes. In millis, times for hflush:

Writes for stream 1: 
30,61,75,89,68,4,65,59,92,261,3,66,86,81,96,4,75,64,79,82,69,2,91,69,65,80
Writes for stream 2: 
58,65,68,75,4,79,89,110,73,76,3,66,74,70,111,3,80,132,97,72,120,2,182,91,70,62
Writes for stream 3: 
68,74,79,67,4,67,82,97,109,3,104,56,65,81,3,57,61,57,2,76,61,59,62
Writes for stream 4: 
94,88,93,82,4,116,89,74,66,3,61,79,73,70,3,68,83,106,70,3,73,70,71,76
Writes for stream 5: 
66,67,83,63,3,70,77,110,80,69,3,83,75,67,65,4,73,70,97,2,56,63,79,105
Writes for stream 6: 
62,68,62,69,3,64,61,72,3,72,62,76,72,4,58,138,77,66,1,62,93,71,107
Writes for stream 7: 
82,63,94,80,4,121,117,69,74,3,80,70,66,63,3,59,69,68,70,1,59,130,75,96
Writes for stream 8: 
93,80,269,66,4,73,106,95,143,3,90,72,65,89,3,62,75,65,82,2,76,57,68,108
Writes for stream 9: 
132,75,59,78,4,70,66,66,71,3,60,75,89,4,78,84,76,74,1,73,63,67,88
Writes for stream 10: 
80,70,95,76,3,145,146,85,101,4,157,83,70,82,4,72,73,159,121,3,92,82,69,74


Here is the code I used to benchmark: 
https://github.com/harishreedharan/hdfs-benchmark

You can run it using a command line that looks like:
{code}
mvn test -Dpath=hdfs://xy.example.com/data/op -DbufferSize=1024 -Dtotal=10 
{code}
Time in between flushes defaults to 200ms, which can be set using 
-DflushInterval=500 (in millis).

In most cases, an hflush takes between 50 and 100 ms, which seems pretty ok 
(there are outliers). This though is a little bit flakey, since I was running 
on EC2 - not on physical boxes.

I'd prefer the WAL option, to not tightly couple Spark's reliability with 
Kafka. We should make it pluggable - so we can replace the WAL option with 
something that uses Kafka information if Kafka is being used.

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: SecurityFix.diff, StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-18 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14138706#comment-14138706
 ] 

Saisai Shao commented on SPARK-3129:


Strongly agree with Matei's comment, I think we can refer Storm's design to 
rollup the lost message to Receivers or others, and then we can replay this 
lost message if external sources support this replay feature, like Kafka or 
others. WAL would be another option for unreliable sources, the throughput and 
reliability can be balanced by user to open WAL trigger.

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: SecurityFix.diff, StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-17 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14138281#comment-14138281
 ] 

Matei Zaharia commented on SPARK-3129:
--

Great, it will be nice to see how fast this is. I also think the rate per node 
doesn't need to be enormous for this to be useful, since we can also 
parallelize receiving over multiple nodes.

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: SecurityFix.diff, StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-17 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14138032#comment-14138032
 ] 

Hari Shreedharan commented on SPARK-3129:
-

Thanks Matei for the background. I had considered some of the factors (like 
executors always talking to the latest ones) - but I was not aware of the 
distinct RDD ids etc. 

TD and I discussed this offline and we agreed that the WAL would probably be 
the best way to go. I am planning to do some benchmarking of appending data to 
a 5-node HDFS cluster on EC2 today. Considering that HBase does use a WAL on 
HDFS, my expectation is that the perf should be reasonable.

I will post the application on github and post a link here. I will run the 
application and see how it goes. I will also post it here.

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: SecurityFix.diff, StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-17 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14138014#comment-14138014
 ] 

Matei Zaharia commented on SPARK-3129:
--

Hari, have you actually benchmarked a WAL based on HDFS? Recently we've 
discovered a number of bugs with block replication in Spark, and this plus the 
complexity of making executors reconnect make the WAL a much more attractive 
design short-term. I don't know if you have a more detailed design doc, but the 
work for reconnecting executors is quite a bit more involved than what the doc 
here suggests. For example, you need to make sure that the new driver uses a 
distinct set of RDD IDs, shuffle IDs, block IDs, etc from the old one, and you 
need to make sure that executors find the newest driver at all times (e.g. if 
one restarts and then immediately fails). I actually implemented a prototype of 
it when we were working on Spark Streaming, but I never pushed it into mainline 
Spark because of these issues.

Longer-term, I hope that a lot of this issue will be handled by better 
treatment of reliable input sources, in particular Kafka. If we were able to 
replay lost data from Kafka nicely (which is hard with its current low-level 
API, but will hopefully become easy later), people would have a reliable 
real-time source to get data from, in addition to the higher-latency source 
currently available in HDFS. Then we would only need this WAL for other data 
sources, such as Twitter, where the source is not reliable, and the pressure on 
it for throughput would be much lower. 

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: SecurityFix.diff, StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-16 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14135841#comment-14135841
 ] 

Hari Shreedharan commented on SPARK-3129:
-

As long as at least one executor containing  every block needs to be available 
when the driver comes back up - that is pretty much it. Since each block is 
replicated unless all 3 executors holding a block fails, it will not lose data.

A WAL would be necessary to recover data which have not been pushed as blocks 
yet (look at the store(Any) method). Adding a persistent WAL is going to hit 
performance, especially if the WAL has to be durable (you'd need to do an 
hflush if the WAL is on HDFS, or its equivalent on any other system). So you'd 
be paying for persisting the data when each block is created, whereas in this 
case, you are paying only at startup and driver restarts. Even the amount of 
data transferred is very less, since it is just metadata. If the WAL is not 
durable, then there is no guarantee it would be recoverable. If the WAL is 
local to each executor somehow, you'd still have to send all the block info to 
the driver when it comes back up.

TD and I had discussed the WAL approach and felt it is actually more complex 
and might affect performance more than this one. In this case, all the building 
blocks are already there (since we already know how to get block infos from 
executors which hold on to the blocks). We just need to add Akka messages to 
ask the executors to re-send block metadata.

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: SecurityFix.diff, StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-16 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14135789#comment-14135789
 ] 

Patrick Wendell commented on SPARK-3129:


I think for this it's worth considering a design that solves H/A using simpler 
mechanisms (for instance, adding a write-ahead-log for received data). Also, 
with this proposal, what happens if both a driver and an executor fail at the 
same time?

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: SecurityFix.diff, StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-16 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14135116#comment-14135116
 ] 

Hari Shreedharan commented on SPARK-3129:
-

It looks like Akka makes it difficult to connect back to a client (in this case 
a BlockManagerSlaveActor) from a new server (in this case, 
BlockManagerMasterActor). Since ActorRefs are serializable, I am going to 
actually serialize the ActorRef to BlockManagerSlaveActor to the HDFS location 
rather their locations - so we can simply startup from that to connect to the 
slaves.

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: SecurityFix.diff, StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-10 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129279#comment-14129279
 ] 

Hari Shreedharan commented on SPARK-3129:
-

[~sowen] Thanks! That fixed the issue! That saved me a whole lot of time! 

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: SecurityFix.diff, StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129256#comment-14129256
 ] 

Sean Owen commented on SPARK-3129:
--

[~hshreedharan] Just manually add the src dir in the parent to the module in 
IntelliJ. It'd be cooler if it was automatic, but not hard. There have been 
fixes proposed but I assume this is likely to go away as a problem only when 
yarn-alpha goes away.

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: SecurityFix.diff, StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-10 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129252#comment-14129252
 ] 

Hari Shreedharan commented on SPARK-3129:
-

FYI here is the branch where I am doing development on this: 
https://github.com/harishreedharan/spark/tree/streaming-ha

Off topic, in Intellij, is there a way to get the yarn/stable stuff to 
recognize their base classes in common so we can get autocomplete and syntax 
highlighting (even type awareness) to work properly?

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: SecurityFix.diff, StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-08 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125981#comment-14125981
 ] 

Thomas Graves commented on SPARK-3129:
--

yes that should be enough.

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: SecurityFix.diff, StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-08 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125828#comment-14125828
 ] 

Hari Shreedharan commented on SPARK-3129:
-

Correct me if I am wrong here, it looks like what I'd need to do is:
* Create the key, add it to credentials in the client
* Then these credentials get written out in the setUpSecurityToken method.

When the AM restarts it has access to these credentials once again (and they 
get shipped to the executors when they are started by the AM).

How I wish Hadoop security model was simpler :(

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-08 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125810#comment-14125810
 ] 

Hari Shreedharan commented on SPARK-3129:
-

(I am not too familiar with how UGI gets passed around if it does at all)

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-08 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125790#comment-14125790
 ] 

Hari Shreedharan commented on SPARK-3129:
-

[~tgraves] - It looks like the SecurityManager class already persists the key 
to the UGI when the AM starts up the first time. A restarted AM would be able 
to get the key from the UGI anyway (that is true even today - where the AM has 
access to the key on restart anyway) - correct? So I don't know if there is a 
need to change anything in the security model (unless I am missing something 
less obvious)

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-08 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125784#comment-14125784
 ] 

Hari Shreedharan commented on SPARK-3129:
-

Hi Saisai,

You are correct that there would be a latency increase, but that is a cost to 
be paid for reliability. I want to get at least the first part (storeReliably 
or equivalent right) before going into the WAL implementation.

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-04 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122274#comment-14122274
 ] 

Saisai Shao commented on SPARK-3129:


Hi [~hshreedharan]], thanks for your reply, is this PR 
(https://github.com/apache/spark/pull/1195) the one you mentioned about 
storeReliably()? 

According to my knowledge, this API aims to store bunch of messages into BM 
directly to make it reliable, but for some receiver like Kafka, socket and 
others, data is injected one by one message, we can't call storeReliably() each 
time because of efficiency and throughput concern, so we need to store these 
data locally to some amount, and then flush to BM using storeReliably(). So I 
think data will potentially be lost as we store it locally. These days I 
thought about WAL things, IMHO i think WAL would be a better solution compared 
to blocked store API.

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-04 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122114#comment-14122114
 ] 

Hari Shreedharan commented on SPARK-3129:
-

Looks like simply moving the code that generates the secret and sets in the UGI 
to the Client class should take care of that. 

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-04 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122103#comment-14122103
 ] 

Hari Shreedharan commented on SPARK-3129:
-

I am less worried about client mode, since most streaming applications would 
run in cluster mode. We can make this available only in the cluster mode.

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-04 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122064#comment-14122064
 ] 

Thomas Graves commented on SPARK-3129:
--

On yarn, it generates the secret automatically.  In cluster mode, it does it in 
the applicationMaster.  Since it generates it in the applicationmaster, it goes 
away when the application master dies.   If the secret was generated on the 
client side and populated into the credentials in the UGI similar to how we do 
tokens then a restart of the AM in cluster mode should be able to pick it back 
up.  

This won't work for client mode though since the client/spark driver wouldn't 
have a way to get ahold of the UGI again.  

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-04 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122017#comment-14122017
 ] 

Hari Shreedharan commented on SPARK-3129:
-

[~tgraves] - Am I correct in assuming that using Akka automatically gives the 
shared secret authentication if spark.authenticate is set to true - if the AM 
is restarted by YARN itself (since it is the same application, it theoretically 
has access to the same shared secret and thus should be able to communicate via 
Akka)? 

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-04 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14122006#comment-14122006
 ] 

Hari Shreedharan commented on SPARK-3129:
-

Yes, so my initial goal is to be able to recover all the blocks that have not 
been made into an RDD yet (at which point it would be safe). There is data 
which may not have become a block yet (which are created using the += operator) 
- for now, I am going to call it fair game to say that we are going to be 
adding storeReliably(ArrayBuffer/Iterable) methods which are the only ones that 
store data such that they are guaranteed to be recovered.

At a later stage, we could use something like a WAL on HDFS to recover even the 
+= data, though that would affect performance.



> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-04 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121160#comment-14121160
 ] 

Saisai Shao commented on SPARK-3129:


Hi [~hshreedharan], one more question:

Is your design goal trying to fix the receiver node failure caused data loss 
issue? Seems potentially data will be lost when data is only stored in 
BlockGenerator not yet in BM when node is failed. Your design doc mainly 
focused on driver failure, so what's your thought?

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-08-21 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14105928#comment-14105928
 ] 

Hari Shreedharan commented on SPARK-3129:
-

[~tgraves] - Thanks for the pointers. Yes, using HDFS also allows us to use the 
same file with some protection to store the keys. This is something that might 
some design and discussion first. 

I will also update the PR with the reflection code.

[~jerryshao]:
1. Today RDDs already get checkpointed at the end of every job when the runJob 
method gets called. Nothing is changing here. The entire graph does get 
checkpointed today already.
2. No, this is something that will need to be taken care of. When the driver 
dies, blocks can no longer be batched into RDDs - which means generating blocks 
without the driver makes no sense. Also, when the driver comes back online, new 
receivers get created, which would start receiving the data now. The only 
reason the executors are being kept around is to get the data in their memory - 
any processing/receiving should be killed.
3. Since it is an RDD, there is nothing that stops it from being recovered, 
right? It is recovered by the usual method of regenerating it. Only DStream 
data that has not been converted into an RDD is really lost - so getting the 
RDD back should not be a concern at all (of course, the cache is gone, but it 
can get pulled back into cache once the driver comes back up).


> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-08-20 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14105086#comment-14105086
 ] 

Saisai Shao commented on SPARK-3129:


Hi Hari, I have some high level questions about this:

1. In the design doc, you mentioned to do "Once the RDD is generated, the RDD 
is checkpointed to HDFS - at which point it is fully 
recoverable", I'm not sure you checkpoint only the metadata of RDD or also 
about the data? I think RDD checkpointing is little expensive for each batch 
duration if the batch duration is quite short.
2. If we keep executors alive when driver dies, do we still need to keep 
receivers to receive data from external source? If so I think there may 
potentially have some problems: firstly memory usage will be accumulated since 
no data is consumed; secondly when driver comes back how to balance the data 
processing priority, since old data needs to be processed first, this will 
delay the newly coming data processing time and lead to unwanted issue if 
latency is larger than the batch duration.
3. In some scenarios we need to operate DStream with RDD (like join real-time 
data with history log), normally RDD is cached in BM's memory, I think we also 
need to recover this RDD's metadata, not only streaming data if we need to 
recover the processing.

Maybe there are many other details we need to think about, because to do driver 
HA is quite complex. Please correct me if something is misunderstood. Thanks a 
lot.


> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-08-20 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14104077#comment-14104077
 ] 

Thomas Graves commented on SPARK-3129:
--

Yes that probably means using reflection. 

I think having a file based one makes sense so we don't have other dependencies 
if you don't need them.  You can always make it more complex and use zookeeper 
for those who want to install it.   For yarn you could save it in the 
.sparkStaging directories along with the application jars that way it knows 
where to find it.

You still have the question of how authentication works.  This would require 
either the secret key being stored somewhere in hdfs also (and protected) or 
some other way for executors to allow connections and figure out this is a 
restart.   

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-08-19 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14102579#comment-14102579
 ] 

Hari Shreedharan commented on SPARK-3129:
-

The way the driver "finds" the executors would be common for all the scheduling 
systems (it should really be independent of the scheduling/deployment). I agree 
about the auth part too. 

[~tdas] mentioned there is something similar already in standalone. I'd like to 
concentrate on YARN - if someone else is interested in Mesos please feel free 
to take it up!

I posted an initial patch for Client mode to simply keep the executors around 
(though it is not exposed via SparkSubmit which we can do once we can get the 
whole series of patches in). 

For YARN mode, does that mean the method calls have to be via reflection? I'd 
assume so. 

The reason I mentioned doing it via HDFS and then pinging the executors is to 
make it independent of YARN/Mesos/Standalone - we can just do it via 
StreamingContext and make it completely independent of the backend on which 
Spark is running (I am not even sure this should be a valid option for 
non-streaming cases, as it does not really add any value elsewhere).

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-08-19 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14102504#comment-14102504
 ] 

Thomas Graves commented on SPARK-3129:
--

A couple of random thoughts on this for yarn.  yarn added this ability in 2.4.0 
and you have to tell it you want it in the application submission context.  So 
you will have to handle other versions of yarn properly where its not supported.
 I believe yarn will tell you what nodes you have containers already running on 
but you'll have to figure out details about ports, etc. I haven't looked at all 
the specifics.

You'll have to figure out how to do authentication properly.  This gets 
forgotten about many times. 

I think we should flush out more of the high level design concerns between 
yarn/standalone/mesos and on yarn the client/cluster modes. 

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-08-19 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14102501#comment-14102501
 ] 

Hari Shreedharan commented on SPARK-3129:
-

This doc is an early list of fixes. I may have missed some, and/or they may be 
better ways to do this. Please post any feedback you have! Thanks!

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org