RE: If not stop StreamingContext gracefully, will checkpoint data be consistent?

2015-06-14 Thread Haopu Wang
Hi, can someone help to confirm the behavior? Thank you!

-Original Message-
From: Haopu Wang 
Sent: Friday, June 12, 2015 4:57 PM
To: user
Subject: If not stop StreamingContext gracefully, will checkpoint data
be consistent?

This is a quick question about Checkpoint. The question is: if the
StreamingContext is not stopped gracefully, will the checkpoint be
consistent?
Or I should always gracefully shutdown the application even in order to
use the checkpoint?

Thank you very much!


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: If not stop StreamingContext gracefully, will checkpoint data be consistent?

2015-06-15 Thread Akhil Das
I think it should be fine, that's the whole point of check-pointing (in
case of driver failure etc).

Thanks
Best Regards

On Mon, Jun 15, 2015 at 6:54 AM, Haopu Wang  wrote:

> Hi, can someone help to confirm the behavior? Thank you!
>
> -Original Message-
> From: Haopu Wang
> Sent: Friday, June 12, 2015 4:57 PM
> To: user
> Subject: If not stop StreamingContext gracefully, will checkpoint data
> be consistent?
>
> This is a quick question about Checkpoint. The question is: if the
> StreamingContext is not stopped gracefully, will the checkpoint be
> consistent?
> Or I should always gracefully shutdown the application even in order to
> use the checkpoint?
>
> Thank you very much!
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


RE: If not stop StreamingContext gracefully, will checkpoint data be consistent?

2015-06-15 Thread Haopu Wang
Akhil, thank you for the response. I want to explore more.

 

If the application is just monitoring a HDFS folder and output the word
count of each streaming batch into also HDFS.

 

When I kill the application _before_ spark takes a checkpoint, after
recovery, spark will resume the processing from the timestamp of latest
checkpoint. That means some files will be processed twice and duplicate
results are generated.

 

Please correct me if the understanding is wrong, thanks again!

 



From: Akhil Das [mailto:ak...@sigmoidanalytics.com] 
Sent: Monday, June 15, 2015 3:48 PM
To: Haopu Wang
Cc: user
Subject: Re: If not stop StreamingContext gracefully, will checkpoint
data be consistent?

 

I think it should be fine, that's the whole point of check-pointing (in
case of driver failure etc).




Thanks

Best Regards

 

On Mon, Jun 15, 2015 at 6:54 AM, Haopu Wang  wrote:

Hi, can someone help to confirm the behavior? Thank you!


-Original Message-
From: Haopu Wang
Sent: Friday, June 12, 2015 4:57 PM
To: user
Subject: If not stop StreamingContext gracefully, will checkpoint data
be consistent?

This is a quick question about Checkpoint. The question is: if the
StreamingContext is not stopped gracefully, will the checkpoint be
consistent?
Or I should always gracefully shutdown the application even in order to
use the checkpoint?

Thank you very much!


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

 



Re: If not stop StreamingContext gracefully, will checkpoint data be consistent?

2015-06-16 Thread Akhil Das
Good question, with  fileStream or textFileStream basically it will only
takes in the files whose timestamp is > the current timestamp
<https://github.com/apache/spark/blob/3c0156899dc1ec1f7dfe6d7c8af47fa6dc7d00bf/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L172>
and
when checkpointing is enabled
<https://github.com/apache/spark/blob/3c0156899dc1ec1f7dfe6d7c8af47fa6dc7d00bf/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L324>
it
would restore the latest filenames from the checkpoint directory which i
believe will kind of reprocess some files.

Thanks
Best Regards

On Mon, Jun 15, 2015 at 2:49 PM, Haopu Wang  wrote:

>  Akhil, thank you for the response. I want to explore more.
>
>
>
> If the application is just monitoring a HDFS folder and output the word
> count of each streaming batch into also HDFS.
>
>
>
> When I kill the application _*before*_ spark takes a checkpoint, after
> recovery, spark will resume the processing from the timestamp of latest
> checkpoint. That means some files will be processed twice and duplicate
> results are generated.
>
>
>
> Please correct me if the understanding is wrong, thanks again!
>
>
>  --
>
> *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
> *Sent:* Monday, June 15, 2015 3:48 PM
> *To:* Haopu Wang
> *Cc:* user
> *Subject:* Re: If not stop StreamingContext gracefully, will checkpoint
> data be consistent?
>
>
>
> I think it should be fine, that's the whole point of check-pointing (in
> case of driver failure etc).
>
>
>   Thanks
>
> Best Regards
>
>
>
> On Mon, Jun 15, 2015 at 6:54 AM, Haopu Wang  wrote:
>
> Hi, can someone help to confirm the behavior? Thank you!
>
>
> -Original Message-
> From: Haopu Wang
> Sent: Friday, June 12, 2015 4:57 PM
> To: user
> Subject: If not stop StreamingContext gracefully, will checkpoint data
> be consistent?
>
> This is a quick question about Checkpoint. The question is: if the
> StreamingContext is not stopped gracefully, will the checkpoint be
> consistent?
> Or I should always gracefully shutdown the application even in order to
> use the checkpoint?
>
> Thank you very much!
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>


RE: If not stop StreamingContext gracefully, will checkpoint data be consistent?

2015-06-18 Thread Haopu Wang
Akhil,

 

>From my test, I can see the files in the last batch will alwyas be
reprocessed upon restarting from checkpoint even for graceful shutdown.

 

I think usually the file is expected to be processed only once. Maybe
this is a bug in fileStream? or do you know any approach to workaround
it?

 

Much thanks!

 



From: Akhil Das [mailto:ak...@sigmoidanalytics.com] 
Sent: Tuesday, June 16, 2015 3:26 PM
To: Haopu Wang
Cc: user
Subject: Re: If not stop StreamingContext gracefully, will checkpoint
data be consistent?

 

Good question, with  fileStream or textFileStream basically it will only
takes in the files whose timestamp is > the current timestamp
<https://github.com/apache/spark/blob/3c0156899dc1ec1f7dfe6d7c8af47fa6dc
7d00bf/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileI
nputDStream.scala#L172>  and when checkpointing is enabled
<https://github.com/apache/spark/blob/3c0156899dc1ec1f7dfe6d7c8af47fa6dc
7d00bf/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileI
nputDStream.scala#L324>  it would restore the latest filenames from the
checkpoint directory which i believe will kind of reprocess some files.




Thanks

Best Regards

 

On Mon, Jun 15, 2015 at 2:49 PM, Haopu Wang  wrote:

Akhil, thank you for the response. I want to explore more.

 

If the application is just monitoring a HDFS folder and output the word
count of each streaming batch into also HDFS.

 

When I kill the application _before_ spark takes a checkpoint, after
recovery, spark will resume the processing from the timestamp of latest
checkpoint. That means some files will be processed twice and duplicate
results are generated.

 

Please correct me if the understanding is wrong, thanks again!

 



From: Akhil Das [mailto:ak...@sigmoidanalytics.com] 
Sent: Monday, June 15, 2015 3:48 PM
To: Haopu Wang
Cc: user
Subject: Re: If not stop StreamingContext gracefully, will checkpoint
data be consistent?

 

I think it should be fine, that's the whole point of check-pointing (in
case of driver failure etc).




Thanks

Best Regards

 

On Mon, Jun 15, 2015 at 6:54 AM, Haopu Wang  wrote:

Hi, can someone help to confirm the behavior? Thank you!


-Original Message-
From: Haopu Wang
Sent: Friday, June 12, 2015 4:57 PM
To: user
Subject: If not stop StreamingContext gracefully, will checkpoint data
be consistent?

This is a quick question about Checkpoint. The question is: if the
StreamingContext is not stopped gracefully, will the checkpoint be
consistent?
Or I should always gracefully shutdown the application even in order to
use the checkpoint?

Thank you very much!


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

 

 



Re: If not stop StreamingContext gracefully, will checkpoint data be consistent?

2015-06-19 Thread Akhil Das
One workaround would be remove/move the files from the input directory once
you have it processed.

Thanks
Best Regards

On Fri, Jun 19, 2015 at 5:48 AM, Haopu Wang  wrote:

>  Akhil,
>
>
>
> From my test, I can see the files in the last batch will alwyas be
> reprocessed upon restarting from checkpoint even for graceful shutdown.
>
>
>
> I think usually the file is expected to be processed only once. Maybe this
> is a bug in fileStream? or do you know any approach to workaround it?
>
>
>
> Much thanks!
>
>
>  --
>
> *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
> *Sent:* Tuesday, June 16, 2015 3:26 PM
>
> *To:* Haopu Wang
> *Cc:* user
> *Subject:* Re: If not stop StreamingContext gracefully, will checkpoint
> data be consistent?
>
>
>
> Good question, with  fileStream or textFileStream basically it will only
> takes in the files whose timestamp is > the current timestamp
> <https://github.com/apache/spark/blob/3c0156899dc1ec1f7dfe6d7c8af47fa6dc7d00bf/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L172>
>  and
> when checkpointing is enabled
> <https://github.com/apache/spark/blob/3c0156899dc1ec1f7dfe6d7c8af47fa6dc7d00bf/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L324>
>  it
> would restore the latest filenames from the checkpoint directory which i
> believe will kind of reprocess some files.
>
>
>   Thanks
>
> Best Regards
>
>
>
> On Mon, Jun 15, 2015 at 2:49 PM, Haopu Wang  wrote:
>
> Akhil, thank you for the response. I want to explore more.
>
>
>
> If the application is just monitoring a HDFS folder and output the word
> count of each streaming batch into also HDFS.
>
>
>
> When I kill the application _*before*_ spark takes a checkpoint, after
> recovery, spark will resume the processing from the timestamp of latest
> checkpoint. That means some files will be processed twice and duplicate
> results are generated.
>
>
>
> Please correct me if the understanding is wrong, thanks again!
>
>
>  ----------
>
> *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
> *Sent:* Monday, June 15, 2015 3:48 PM
> *To:* Haopu Wang
> *Cc:* user
> *Subject:* Re: If not stop StreamingContext gracefully, will checkpoint
> data be consistent?
>
>
>
> I think it should be fine, that's the whole point of check-pointing (in
> case of driver failure etc).
>
>
>   Thanks
>
> Best Regards
>
>
>
> On Mon, Jun 15, 2015 at 6:54 AM, Haopu Wang  wrote:
>
> Hi, can someone help to confirm the behavior? Thank you!
>
>
> -Original Message-
> From: Haopu Wang
> Sent: Friday, June 12, 2015 4:57 PM
> To: user
> Subject: If not stop StreamingContext gracefully, will checkpoint data
> be consistent?
>
> This is a quick question about Checkpoint. The question is: if the
> StreamingContext is not stopped gracefully, will the checkpoint be
> consistent?
> Or I should always gracefully shutdown the application even in order to
> use the checkpoint?
>
> Thank you very much!
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
>