Re: How to enable fault-tolerance?

2014-06-09 Thread Peng Cheng
I speculate that Spark will only retry on exceptions that are registered with
TaskSetScheduler, so a definitely-will-fail task will fail quickly without
taking more resources. However I haven't found any documentation or web page
on it



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-enable-fault-tolerance-tp7250p7255.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: How to enable fault-tolerance?

2014-06-09 Thread Aaron Davidson
Looks like your problem is local mode:
https://github.com/apache/spark/blob/640f9a0efefd42cff86aecd4878a3a57f5ae85fa/core/src/main/scala/org/apache/spark/SparkContext.scala#L1430

For some reason, someone decided not to do retries when running in local
mode. Not exactly sure why, feel free to submit a JIRA on this.


On Mon, Jun 9, 2014 at 8:59 AM, Peng Cheng pc...@uow.edu.au wrote:

 I speculate that Spark will only retry on exceptions that are registered
 with
 TaskSetScheduler, so a definitely-will-fail task will fail quickly without
 taking more resources. However I haven't found any documentation or web
 page
 on it



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-enable-fault-tolerance-tp7250p7255.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: How to enable fault-tolerance?

2014-06-09 Thread Peng Cheng
Thanks a lot! That's very responsive, somebody definitely has 
encountered the same problem before, and added two hidden modes in 
masterURL:


(from SparkContext.scala: line1431)

   // Regular expression for local[N, maxRetries], used in tests with 
failing tasks
   val LOCAL_N_FAILURES_REGEX = 
local\[([0-9]+)\s*,\s*([0-9]+)\].r
   // Regular expression for simulating a Spark cluster of [N, cores, 
memory] locally
   val LOCAL_CLUSTER_REGEX = 
local-cluster\[\s*([0-9]+)\s*,\s*([0-9]+)\s*,\s*([0-9]+)\s*].r


Unfortunately they never got pushed into the documentation, and you got 
config parameters scattered in two different places (masterURL and 
$spark.task.maxFailures).
I'm thinking of adding a new config parameter 
$spark.task.maxLocalFailures to override 1, how do you think?


Thanks again buddy.

Yours Peng

On Mon 09 Jun 2014 01:33:45 PM EDT, Aaron Davidson wrote:

Looks like your problem is local mode:
https://github.com/apache/spark/blob/640f9a0efefd42cff86aecd4878a3a57f5ae85fa/core/src/main/scala/org/apache/spark/SparkContext.scala#L1430

For some reason, someone decided not to do retries when running in
local mode. Not exactly sure why, feel free to submit a JIRA on this.


On Mon, Jun 9, 2014 at 8:59 AM, Peng Cheng pc...@uow.edu.au
mailto:pc...@uow.edu.au wrote:

I speculate that Spark will only retry on exceptions that are
registered with
TaskSetScheduler, so a definitely-will-fail task will fail quickly
without
taking more resources. However I haven't found any documentation
or web page
on it



--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/How-to-enable-fault-tolerance-tp7250p7255.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.




Re: How to enable fault-tolerance?

2014-06-09 Thread Peng Cheng

Oh, and to make things worse, they forgot '\*' in their regex.
Am I the first to encounter this problem before?

On Mon 09 Jun 2014 02:24:43 PM EDT, Peng Cheng wrote:

Thanks a lot! That's very responsive, somebody definitely has
encountered the same problem before, and added two hidden modes in
masterURL:

(from SparkContext.scala: line1431)

   // Regular expression for local[N, maxRetries], used in tests with
failing tasks
   val LOCAL_N_FAILURES_REGEX = local\[([0-9]+)\s*,\s*([0-9]+)\].r
   // Regular expression for simulating a Spark cluster of [N, cores,
memory] locally
   val LOCAL_CLUSTER_REGEX =
local-cluster\[\s*([0-9]+)\s*,\s*([0-9]+)\s*,\s*([0-9]+)\s*].r

Unfortunately they never got pushed into the documentation, and you
got config parameters scattered in two different places (masterURL and
$spark.task.maxFailures).
I'm thinking of adding a new config parameter
$spark.task.maxLocalFailures to override 1, how do you think?

Thanks again buddy.

Yours Peng

On Mon 09 Jun 2014 01:33:45 PM EDT, Aaron Davidson wrote:

Looks like your problem is local mode:
https://github.com/apache/spark/blob/640f9a0efefd42cff86aecd4878a3a57f5ae85fa/core/src/main/scala/org/apache/spark/SparkContext.scala#L1430


For some reason, someone decided not to do retries when running in
local mode. Not exactly sure why, feel free to submit a JIRA on this.


On Mon, Jun 9, 2014 at 8:59 AM, Peng Cheng pc...@uow.edu.au
mailto:pc...@uow.edu.au wrote:

I speculate that Spark will only retry on exceptions that are
registered with
TaskSetScheduler, so a definitely-will-fail task will fail quickly
without
taking more resources. However I haven't found any documentation
or web page
on it



--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/How-to-enable-fault-tolerance-tp7250p7255.html

Sent from the Apache Spark User List mailing list archive at
Nabble.com.




Re: How to enable fault-tolerance?

2014-06-09 Thread Matei Zaharia
If this is a useful feature for local mode, we should open a JIRA to document 
the setting or improve it (I’d prefer to add a spark.local.retries property 
instead of a special URL format). We initially disabled it for everything 
except unit tests because 90% of the time an exception in local mode means a 
problem in the application, and we’d rather let the user debug that right away 
rather than retrying the task several times and having them worry about why 
they get so many errors.

Matei

On Jun 9, 2014, at 11:28 AM, Peng Cheng pc...@uowmail.edu.au wrote:

 Oh, and to make things worse, they forgot '\*' in their regex.
 Am I the first to encounter this problem before?
 
 On Mon 09 Jun 2014 02:24:43 PM EDT, Peng Cheng wrote:
 Thanks a lot! That's very responsive, somebody definitely has
 encountered the same problem before, and added two hidden modes in
 masterURL:
 
 (from SparkContext.scala: line1431)
 
   // Regular expression for local[N, maxRetries], used in tests with
 failing tasks
   val LOCAL_N_FAILURES_REGEX = local\[([0-9]+)\s*,\s*([0-9]+)\].r
   // Regular expression for simulating a Spark cluster of [N, cores,
 memory] locally
   val LOCAL_CLUSTER_REGEX =
 local-cluster\[\s*([0-9]+)\s*,\s*([0-9]+)\s*,\s*([0-9]+)\s*].r
 
 Unfortunately they never got pushed into the documentation, and you
 got config parameters scattered in two different places (masterURL and
 $spark.task.maxFailures).
 I'm thinking of adding a new config parameter
 $spark.task.maxLocalFailures to override 1, how do you think?
 
 Thanks again buddy.
 
 Yours Peng
 
 On Mon 09 Jun 2014 01:33:45 PM EDT, Aaron Davidson wrote:
 Looks like your problem is local mode:
 https://github.com/apache/spark/blob/640f9a0efefd42cff86aecd4878a3a57f5ae85fa/core/src/main/scala/org/apache/spark/SparkContext.scala#L1430
 
 
 For some reason, someone decided not to do retries when running in
 local mode. Not exactly sure why, feel free to submit a JIRA on this.
 
 
 On Mon, Jun 9, 2014 at 8:59 AM, Peng Cheng pc...@uow.edu.au
 mailto:pc...@uow.edu.au wrote:
 
I speculate that Spark will only retry on exceptions that are
registered with
TaskSetScheduler, so a definitely-will-fail task will fail quickly
without
taking more resources. However I haven't found any documentation
or web page
on it
 
 
 
--
View this message in context:
 
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-enable-fault-tolerance-tp7250p7255.html
 
Sent from the Apache Spark User List mailing list archive at
Nabble.com.
 
 



Re: How to enable fault-tolerance?

2014-06-09 Thread Peng Cheng
Hi Matei, Yeah you are right this is very niche (my user case is as a 
web crawler), but I glad you also like an additional property. Let me 
open a JIRA.


Yours Peng

On Mon 09 Jun 2014 03:08:29 PM EDT, Matei Zaharia wrote:

If this is a useful feature for local mode, we should open a JIRA to document 
the setting or improve it (I’d prefer to add a spark.local.retries property 
instead of a special URL format). We initially disabled it for everything 
except unit tests because 90% of the time an exception in local mode means a 
problem in the application, and we’d rather let the user debug that right away 
rather than retrying the task several times and having them worry about why 
they get so many errors.

Matei

On Jun 9, 2014, at 11:28 AM, Peng Cheng pc...@uowmail.edu.au wrote:


Oh, and to make things worse, they forgot '\*' in their regex.
Am I the first to encounter this problem before?

On Mon 09 Jun 2014 02:24:43 PM EDT, Peng Cheng wrote:

Thanks a lot! That's very responsive, somebody definitely has
encountered the same problem before, and added two hidden modes in
masterURL:

(from SparkContext.scala: line1431)

   // Regular expression for local[N, maxRetries], used in tests with
failing tasks
   val LOCAL_N_FAILURES_REGEX = local\[([0-9]+)\s*,\s*([0-9]+)\].r
   // Regular expression for simulating a Spark cluster of [N, cores,
memory] locally
   val LOCAL_CLUSTER_REGEX =
local-cluster\[\s*([0-9]+)\s*,\s*([0-9]+)\s*,\s*([0-9]+)\s*].r

Unfortunately they never got pushed into the documentation, and you
got config parameters scattered in two different places (masterURL and
$spark.task.maxFailures).
I'm thinking of adding a new config parameter
$spark.task.maxLocalFailures to override 1, how do you think?

Thanks again buddy.

Yours Peng

On Mon 09 Jun 2014 01:33:45 PM EDT, Aaron Davidson wrote:

Looks like your problem is local mode:
https://github.com/apache/spark/blob/640f9a0efefd42cff86aecd4878a3a57f5ae85fa/core/src/main/scala/org/apache/spark/SparkContext.scala#L1430


For some reason, someone decided not to do retries when running in
local mode. Not exactly sure why, feel free to submit a JIRA on this.


On Mon, Jun 9, 2014 at 8:59 AM, Peng Cheng pc...@uow.edu.au
mailto:pc...@uow.edu.au wrote:

I speculate that Spark will only retry on exceptions that are
registered with
TaskSetScheduler, so a definitely-will-fail task will fail quickly
without
taking more resources. However I haven't found any documentation
or web page
on it



--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/How-to-enable-fault-tolerance-tp7250p7255.html

Sent from the Apache Spark User List mailing list archive at
Nabble.com.