[jira] [Issue Comment Deleted] (SPARK-1647) Prevent data loss when Streaming driver goes down
[ https://issues.apache.org/jira/browse/SPARK-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giulio De Vecchi updated SPARK-1647: Comment: was deleted (was: Not sure if this make sense, but maybe would be nice to have a kind of flag available within the code that tells me if I'm running in a normal situation or during a recovery. To better explain this, let's consider the following scenario: I am processing data, let's say from a Kafka streaming, and I am updating a database based on the computations. During the recovery I don't want to update again the database (for many reasons, let's just assume that) but I want my system to be in the same status as before, thus I would like to know if my code is running for the first time or during a recovery so I can avoid to update the database again. More generally I want to know this in case I'm interacting with external entities. ) Prevent data loss when Streaming driver goes down - Key: SPARK-1647 URL: https://issues.apache.org/jira/browse/SPARK-1647 Project: Spark Issue Type: Bug Components: Streaming Reporter: Hari Shreedharan Assignee: Hari Shreedharan Currently when the driver goes down, any uncheckpointed data is lost from within spark. If the system from which messages are pulled can replay messages, the data may be available - but for some systems, like Flume this is not the case. Also, all windowing information is lost for windowing functions. We must persist raw data somehow, and be able to replay this data if required. We also must persist windowing information with the data itself. This will likely require quite a bit of work to complete and probably will have to be split into several sub-jiras. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1647) Prevent data loss when Streaming driver goes down
[ https://issues.apache.org/jira/browse/SPARK-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110942#comment-14110942 ] Giulio De Vecchi edited comment on SPARK-1647 at 8/27/14 10:33 AM: --- Not sure if this make sense, but maybe would be nice to have a kind of flag available within the code that tells me if I'm running in a normal situation or during a recovery. To better explain this, let's consider the following scenario: I am processing data, let's say from a Kafka streaming, and I am updating a database based on the computations. During the recovery I don't want to update again the database (for many reasons, let's just assume that) but I want my system to be in the same status as before, thus I would like to know if my code is running for the first time or during a recovery so I can avoid to update the database again. More generally I want to know this in case I'm interacting with external entities. was (Author: gadv): Not sure if this make sense, but maybe would be nice to have a kind of flag available within the code that tells me if I'm running in a normal situation or during a recovery. To better explain this, let's consider the following scenario: I am processing data, let's say from a Kafka streaming, and I am updating a database based on the computations. During the recovery I don't want to update again the database (for many reasons, let's just assume that) but I want my system to be in the same status as before, so I would be able to know if my code is running for the first time or during a recovery so I can avoid to update the database again. More generally I want to know this in case I'm interacting with external entities. Prevent data loss when Streaming driver goes down - Key: SPARK-1647 URL: https://issues.apache.org/jira/browse/SPARK-1647 Project: Spark Issue Type: Bug Components: Streaming Reporter: Hari Shreedharan Assignee: Hari Shreedharan Currently when the driver goes down, any uncheckpointed data is lost from within spark. If the system from which messages are pulled can replay messages, the data may be available - but for some systems, like Flume this is not the case. Also, all windowing information is lost for windowing functions. We must persist raw data somehow, and be able to replay this data if required. We also must persist windowing information with the data itself. This will likely require quite a bit of work to complete and probably will have to be split into several sub-jiras. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1647) Prevent data loss when Streaming driver goes down
[ https://issues.apache.org/jira/browse/SPARK-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110942#comment-14110942 ] Giulio De Vecchi edited comment on SPARK-1647 at 8/27/14 10:34 AM: --- Not sure if this make sense, but maybe would be nice to have a kind of flag available within the code that tells me if I'm running in a normal situation or during a recovery. To better explain this, let's consider the following scenario: I am processing data, let's say from a Kafka streaming, and I am updating a database based on the computations. During the recovery I don't want to update again the database (for many reasons, let's just assume that) but I want my system to be in the same status as before, thus I would like to know if my code is running for the first time or during a recovery so I can avoid to update the database again. More generally I want to know this in case I'm interacting with external entities. was (Author: gadv): Not sure if this make sense, but maybe would be nice to have a kind of flag available within the code that tells me if I'm running in a normal situation or during a recovery. To better explain this, let's consider the following scenario: I am processing data, let's say from a Kafka streaming, and I am updating a database based on the computations. During the recovery I don't want to update again the database (for many reasons, let's just assume that) but I want my system to be in the same status as before, thus I would like to know if my code is running for the first time or during a recovery so I can avoid to update the database again. More generally I want to know this in case I'm interacting with external entities. Prevent data loss when Streaming driver goes down - Key: SPARK-1647 URL: https://issues.apache.org/jira/browse/SPARK-1647 Project: Spark Issue Type: Bug Components: Streaming Reporter: Hari Shreedharan Assignee: Hari Shreedharan Currently when the driver goes down, any uncheckpointed data is lost from within spark. If the system from which messages are pulled can replay messages, the data may be available - but for some systems, like Flume this is not the case. Also, all windowing information is lost for windowing functions. We must persist raw data somehow, and be able to replay this data if required. We also must persist windowing information with the data itself. This will likely require quite a bit of work to complete and probably will have to be split into several sub-jiras. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1647) Prevent data loss when Streaming driver goes down
[ https://issues.apache.org/jira/browse/SPARK-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110942#comment-14110942 ] Giulio De Vecchi commented on SPARK-1647: - Not sure if this make sense, but maybe would be nice to have a kind of flag available within the code that tells me if I'm running in a normal situation or during a recovery. To better explain this, let's consider the following scenario: I am processing data, let's say from a Kafka streaming, and I am updating a database based on the computations. During the recovery I don't want to update again the database (for many reasons, let's just assume that) but I want my system to be in the same status as before, so I would be able to know if my code is running for the first time or during a recovery so I can avoid to update the database again. More generally I want to know this in case I'm interacting with external entities. Prevent data loss when Streaming driver goes down - Key: SPARK-1647 URL: https://issues.apache.org/jira/browse/SPARK-1647 Project: Spark Issue Type: Bug Components: Streaming Reporter: Hari Shreedharan Assignee: Hari Shreedharan Currently when the driver goes down, any uncheckpointed data is lost from within spark. If the system from which messages are pulled can replay messages, the data may be available - but for some systems, like Flume this is not the case. Also, all windowing information is lost for windowing functions. We must persist raw data somehow, and be able to replay this data if required. We also must persist windowing information with the data itself. This will likely require quite a bit of work to complete and probably will have to be split into several sub-jiras. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org