[jira] [Issue Comment Deleted] (SPARK-1647) Prevent data loss when Streaming driver goes down

2014-08-28 Thread Giulio De Vecchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giulio De Vecchi updated SPARK-1647:


Comment: was deleted

(was: Not sure if this make sense, but maybe would be nice to have a kind of 
flag available within the code that tells me if I'm running in a normal 
situation or during a recovery.
To better explain this, let's consider the following scenario:
I am processing data, let's say from a Kafka streaming, and I am updating a 
database based on the computations. During the recovery I don't want to update 
again the database (for many reasons, let's just assume that) but I want my 
system to be in the same status as before, thus I would like to know if my code 
is running for the first time or during a recovery so I can avoid to update the 
database again.

More generally I want to know this in case I'm interacting with external 
entities.

)

 Prevent data loss when Streaming driver goes down
 -

 Key: SPARK-1647
 URL: https://issues.apache.org/jira/browse/SPARK-1647
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Hari Shreedharan
Assignee: Hari Shreedharan

 Currently when the driver goes down, any uncheckpointed data is lost from 
 within spark. If the system from which messages are pulled can  replay 
 messages, the data may be available - but for some systems, like Flume this 
 is not the case. 
 Also, all windowing information is lost for windowing functions. 
 We must persist raw data somehow, and be able to replay this data if 
 required. We also must persist windowing information with the data itself.
 This will likely require quite a bit of work to complete and probably will 
 have to be split into several sub-jiras.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1647) Prevent data loss when Streaming driver goes down

2014-08-27 Thread Giulio De Vecchi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110942#comment-14110942
 ] 

Giulio De Vecchi edited comment on SPARK-1647 at 8/27/14 10:33 AM:
---

Not sure if this make sense, but maybe would be nice to have a kind of flag 
available within the code that tells me if I'm running in a normal situation 
or during a recovery.
To better explain this, let's consider the following scenario:
I am processing data, let's say from a Kafka streaming, and I am updating a 
database based on the computations. During the recovery I don't want to update 
again the database (for many reasons, let's just assume that) but I want my 
system to be in the same status as before, thus I would like to know if my 
code is running for the first time or during a recovery so I can avoid to 
update the database again.

More generally I want to know this in case I'm interacting with external 
entities.




was (Author: gadv):
Not sure if this make sense, but maybe would be nice to have a kind of flag 
available within the code that tells me if I'm running in a normal situation 
or during a recovery.
To better explain this, let's consider the following scenario:
I am processing data, let's say from a Kafka streaming, and I am updating a 
database based on the computations. During the recovery I don't want to update 
again the database (for many reasons, let's just assume that) but I want my 
system to be in the same status as before, so I would be able to know if my 
code is running for the first time or during a recovery so I can avoid to 
update the database again.

More generally I want to know this in case I'm interacting with external 
entities.



 Prevent data loss when Streaming driver goes down
 -

 Key: SPARK-1647
 URL: https://issues.apache.org/jira/browse/SPARK-1647
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Hari Shreedharan
Assignee: Hari Shreedharan

 Currently when the driver goes down, any uncheckpointed data is lost from 
 within spark. If the system from which messages are pulled can  replay 
 messages, the data may be available - but for some systems, like Flume this 
 is not the case. 
 Also, all windowing information is lost for windowing functions. 
 We must persist raw data somehow, and be able to replay this data if 
 required. We also must persist windowing information with the data itself.
 This will likely require quite a bit of work to complete and probably will 
 have to be split into several sub-jiras.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1647) Prevent data loss when Streaming driver goes down

2014-08-27 Thread Giulio De Vecchi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110942#comment-14110942
 ] 

Giulio De Vecchi edited comment on SPARK-1647 at 8/27/14 10:34 AM:
---

Not sure if this make sense, but maybe would be nice to have a kind of flag 
available within the code that tells me if I'm running in a normal situation 
or during a recovery.
To better explain this, let's consider the following scenario:
I am processing data, let's say from a Kafka streaming, and I am updating a 
database based on the computations. During the recovery I don't want to update 
again the database (for many reasons, let's just assume that) but I want my 
system to be in the same status as before, thus I would like to know if my code 
is running for the first time or during a recovery so I can avoid to update the 
database again.

More generally I want to know this in case I'm interacting with external 
entities.




was (Author: gadv):
Not sure if this make sense, but maybe would be nice to have a kind of flag 
available within the code that tells me if I'm running in a normal situation 
or during a recovery.
To better explain this, let's consider the following scenario:
I am processing data, let's say from a Kafka streaming, and I am updating a 
database based on the computations. During the recovery I don't want to update 
again the database (for many reasons, let's just assume that) but I want my 
system to be in the same status as before, thus I would like to know if my 
code is running for the first time or during a recovery so I can avoid to 
update the database again.

More generally I want to know this in case I'm interacting with external 
entities.



 Prevent data loss when Streaming driver goes down
 -

 Key: SPARK-1647
 URL: https://issues.apache.org/jira/browse/SPARK-1647
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Hari Shreedharan
Assignee: Hari Shreedharan

 Currently when the driver goes down, any uncheckpointed data is lost from 
 within spark. If the system from which messages are pulled can  replay 
 messages, the data may be available - but for some systems, like Flume this 
 is not the case. 
 Also, all windowing information is lost for windowing functions. 
 We must persist raw data somehow, and be able to replay this data if 
 required. We also must persist windowing information with the data itself.
 This will likely require quite a bit of work to complete and probably will 
 have to be split into several sub-jiras.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1647) Prevent data loss when Streaming driver goes down

2014-08-26 Thread Giulio De Vecchi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110942#comment-14110942
 ] 

Giulio De Vecchi commented on SPARK-1647:
-

Not sure if this make sense, but maybe would be nice to have a kind of flag 
available within the code that tells me if I'm running in a normal situation 
or during a recovery.
To better explain this, let's consider the following scenario:
I am processing data, let's say from a Kafka streaming, and I am updating a 
database based on the computations. During the recovery I don't want to update 
again the database (for many reasons, let's just assume that) but I want my 
system to be in the same status as before, so I would be able to know if my 
code is running for the first time or during a recovery so I can avoid to 
update the database again.

More generally I want to know this in case I'm interacting with external 
entities.



 Prevent data loss when Streaming driver goes down
 -

 Key: SPARK-1647
 URL: https://issues.apache.org/jira/browse/SPARK-1647
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Hari Shreedharan
Assignee: Hari Shreedharan

 Currently when the driver goes down, any uncheckpointed data is lost from 
 within spark. If the system from which messages are pulled can  replay 
 messages, the data may be available - but for some systems, like Flume this 
 is not the case. 
 Also, all windowing information is lost for windowing functions. 
 We must persist raw data somehow, and be able to replay this data if 
 required. We also must persist windowing information with the data itself.
 This will likely require quite a bit of work to complete and probably will 
 have to be split into several sub-jiras.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org