[jira] [Updated] (SPARK-47520) Precision issues with sum of floats/doubles leads to incorrect data after repartition stage retry

2024-03-23 Thread William Montaz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Montaz updated SPARK-47520:
---
Summary: Precision issues with sum of floats/doubles leads to incorrect 
data after repartition stage retry  (was: Precision issues with sum of 
floats/doubles leads to incorrect data after repartition)

> Precision issues with sum of floats/doubles leads to incorrect data after 
> repartition stage retry
> -
>
> Key: SPARK-47520
> URL: https://issues.apache.org/jira/browse/SPARK-47520
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2, 3.3.2, 3.5.0
>Reporter: William Montaz
>Priority: Major
>  Labels: correctness
>
> We discovered an important correctness issue directly linked to SPARK-47024
> Even if SPARK-47024 has been considered 'Not a Problem' since it is linked 
> directly to floats and double precision, it can still have drastic impacts 
> combined to spark.sql.execution.sortBeforeRepartition set to true (the 
> default)
> We consistently reproduced the issue doing a GROUP BY with a SUM of float or 
> double aggregation, followed by a repartition (common case to produce bigger 
> files as output, either triggered by SQL hints or extensions like kyuubi). 
> If the repartition stage fails with Fetch Failed Exception for only few 
> tasks, spark decides to recompute the partitions from the previous stage for 
> which output could not be fetched and will retry only the failed partitions 
> downstream.
> Because block fetch order is indeterministic, the new upstream partition 
> computation can provide a slightly different value for a float/double sum 
> aggregation. We noticed a 1 bit difference is UnsafeRow backing byte array in 
> all of our attempts. The sort performed before repartition uses 
> UnsafeRow.hashcode for the row prefix which will be completely different even 
> with such 1 bit difference, leading to the sort being completely different in 
> the new upstream partition and thus target downstream partition for the 
> shuffled rows completely different as well.
> Because sort becomes undeterministic and since only the failed dowstream 
> tasks are retried the resulting repartition will lead to duplicate rows as 
> well as missing rows. The solution brought by SPARK-23207 is broken.
> So far, we can only deactivate spark.sql.execution.sortBeforeRepartition to 
> make the entire job fail instead of producing incorrect data. The default for 
> spark currently leads to silent correctness issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47520) Precision issues with sum of floats/doubles leads to incorrect data after repartition

2024-03-22 Thread William Montaz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Montaz updated SPARK-47520:
---
Description: 
We discovered an important correctness issue directly linked to SPARK-47024

Even if SPARK-47024 has been considered 'Not a Problem' since it is linked 
directly to floats and double precision, it can still have drastic impacts 
combined to spark.sql.execution.sortBeforeRepartition set to true (the default)

We consistently reproduced the issue doing a GROUP BY with a SUM of float or 
double aggregation, followed by a repartition (common case to produce bigger 
files as output, either triggered by SQL hints or extensions like kyuubi). 

If the repartition stage fails with Fetch Failed Exception for only few tasks, 
spark decides to recompute the partitions from the previous stage for which 
output could not be fetched and will retry only the failed partitions 
downstream.

Because block fetch order is indeterministic, the new upstream partition 
computation can provide a slightly different value for a float/double sum 
aggregation. We noticed a 1 bit difference is UnsafeRow backing byte array in 
all of our attempts. The sort performed before repartition uses 
UnsafeRow.hashcode for the row prefix which will be completely different even 
with such 1 bit difference, leading to the sort being completely different in 
the new upstream partition and thus target downstream partition for the 
shuffled rows completely different as well.

Because sort becomes undeterministic and since only the failed dowstream tasks 
are retried the resulting repartition will lead to duplicate rows as well as 
missing rows. The solution brought by SPARK-23207 is broken.

So far, we can only deactivate spark.sql.execution.sortBeforeRepartition to 
make the entire job fail instead of producing incorrect data. The default for 
spark currently leads to silent correctness issue.

  was:
We discovered an important correctness issue directly linked to SPARK-47024

Even if SPARK-47024 has been considered 'Not a Problem' since it is linked 
directly to floats and double precision, it can still have drastic impacts 
combined to spark.sql.execution.sortBeforeRepartition set to true (the default)

We consistently reproduced the issue doing a GROUP BY with a SUM of float or 
double aggregation, followed by a repartition (common case to produce bigger 
files as output, either triggered by SQL hints or extensions like kyuubi). 

If the repartition stage fails with Fetch Failed Exception for only few tasks, 
spark decides to recompute the partitions from the previous stage for which 
output could not be fetched and will retry only the failed partitions 
downstream.

Because block fetch order is indeterministic, the new upstream partition 
computation can provide a slightly different value for a float/double sum 
aggregation. We noticed a 1 bit difference is UnsafeRow backing byte array in 
all of our attempts. The sort performed before repartition uses 
UnsafeRow.hashcode for the row prefix which will be completely different even 
with such 1 bit difference, leading to the sort being completely different in 
the new upstream partition and thus target downstream partition for the 
shuffled rows completely different as well.

Because sort becomes undeterministic and since only the failed dowstream tasks 
are retried the resulting repartition will lead to duplicate rows as well as 
missing rows. The solution brought by SPARK-23207 is broken.

So far, we can only deactivate spark.sql.execution.sortBeforeRepartition to 
make the entire job fail.


> Precision issues with sum of floats/doubles leads to incorrect data after 
> repartition
> -
>
> Key: SPARK-47520
> URL: https://issues.apache.org/jira/browse/SPARK-47520
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2, 3.3.2, 3.5.0
>Reporter: William Montaz
>Priority: Major
>  Labels: correctness
>
> We discovered an important correctness issue directly linked to SPARK-47024
> Even if SPARK-47024 has been considered 'Not a Problem' since it is linked 
> directly to floats and double precision, it can still have drastic impacts 
> combined to spark.sql.execution.sortBeforeRepartition set to true (the 
> default)
> We consistently reproduced the issue doing a GROUP BY with a SUM of float or 
> double aggregation, followed by a repartition (common case to produce bigger 
> files as output, either triggered by SQL hints or extensions like kyuubi). 
> If the repartition stage fails with Fetch Failed Exception for only few 
> tasks, spark decides to recompute the partitions from the previous stage for 
> which output could not be fetched and will retry only the failed partitions 
> 

[jira] [Updated] (SPARK-47520) Precision issues with sum of floats/doubles leads to incorrect data after repartition

2024-03-22 Thread William Montaz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Montaz updated SPARK-47520:
---
Description: 
We discovered an important correctness issue directly linked to SPARK-47024

Even if SPARK-47024 has been considered 'Not a Problem' since it is linked 
directly to floats and double precision, it can still have drastic impacts 
combined to spark.sql.execution.sortBeforeRepartition set to true (the default)

We consistently reproduced the issue doing a GROUP BY with a SUM of float or 
double aggregation, followed by a repartition (common case to produce bigger 
files as output, either triggered by SQL hints or extensions like kyuubi). 

If the repartition stage fails with Fetch Failed Exception for only few tasks, 
spark decides to recompute the partitions from the previous stage for which 
output could not be fetched and will retry only the failed partitions 
downstream.

Because block fetch order is indeterministic, the new upstream partition 
computation can provide a slightly different value for a float/double sum 
aggregation. We noticed a 1 bit difference is UnsafeRow backing byte array in 
all of our attempts. The sort performed before repartition uses 
UnsafeRow.hashcode for the row prefix which will be completely different even 
with such 1 bit difference, leading to the sort being completely different in 
the new upstream partition and thus target downstream partition for the 
shuffled rows completely different as well.

Because sort becomes undeterministic and since only the failed dowstream tasks 
are retried the resulting repartition will lead to duplicate rows as well as 
missing rows. The solution brought by SPARK-23207 is broken.

So far, we can only deactivate spark.sql.execution.sortBeforeRepartition to 
make the entire job fail.

  was:
We discovered an important correctness issue directly linked to SPARK-47024

Even if SPARK-47024 has been considered 'Not a Problem' since it is linked 
directly to floats and double rounding, it can still have drastic impacts 
combined to spark.sql.execution.sortBeforeRepartition set to true (the default)

We consistently reproduced the issue doing a GROUP BY with a SUM of float or 
double aggregation, followed by a repartition (common case to produce bigger 
files as output, either triggered by SQL hints or extensions like kyuubi). 

If the repartition stage fails with Fetch Failed Exception for only few tasks, 
spark decides to recompute the partitions from the previous stage for which 
output could not be fetched and will retry only the failed partitions 
downstream.

Because block fetch order is indeterministic, the new upstream partition 
computation can provide a slightly different value for a float/double sum 
aggregation. We noticed a 1 bit difference is UnsafeRow backing byte array in 
all of our attempts. The sort performed before repartition uses 
UnsafeRow.hashcode for the row prefix which will be completely different even 
with such 1 bit difference, leading to the sort being completely different in 
the new upstream partition and thus target downstream partition for the 
shuffled rows completely different as well.

Because sort becomes undeterministic and since only the failed dowstream tasks 
are retried the resulting repartition will lead to duplicate rows as well as 
missing rows. The solution brought by SPARK-23207 is broken.

So far, we can only deactivate spark.sql.execution.sortBeforeRepartition to 
make the entire job fail.


> Precision issues with sum of floats/doubles leads to incorrect data after 
> repartition
> -
>
> Key: SPARK-47520
> URL: https://issues.apache.org/jira/browse/SPARK-47520
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2, 3.3.2, 3.5.0
>Reporter: William Montaz
>Priority: Major
>  Labels: correctness
>
> We discovered an important correctness issue directly linked to SPARK-47024
> Even if SPARK-47024 has been considered 'Not a Problem' since it is linked 
> directly to floats and double precision, it can still have drastic impacts 
> combined to spark.sql.execution.sortBeforeRepartition set to true (the 
> default)
> We consistently reproduced the issue doing a GROUP BY with a SUM of float or 
> double aggregation, followed by a repartition (common case to produce bigger 
> files as output, either triggered by SQL hints or extensions like kyuubi). 
> If the repartition stage fails with Fetch Failed Exception for only few 
> tasks, spark decides to recompute the partitions from the previous stage for 
> which output could not be fetched and will retry only the failed partitions 
> downstream.
> Because block fetch order is indeterministic, the new upstream partition 
> computation can 

[jira] [Updated] (SPARK-47520) Precision issues with sum of floats/doubles leads to incorrect data after repartition

2024-03-22 Thread William Montaz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Montaz updated SPARK-47520:
---
Summary: Precision issues with sum of floats/doubles leads to incorrect 
data after repartition  (was: Rounding issues with sum of floats/doubles leads 
to incorrect data after repartition)

> Precision issues with sum of floats/doubles leads to incorrect data after 
> repartition
> -
>
> Key: SPARK-47520
> URL: https://issues.apache.org/jira/browse/SPARK-47520
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2, 3.3.2, 3.5.0
>Reporter: William Montaz
>Priority: Major
>  Labels: correctness
>
> We discovered an important correctness issue directly linked to SPARK-47024
> Even if SPARK-47024 has been considered 'Not a Problem' since it is linked 
> directly to floats and double rounding, it can still have drastic impacts 
> combined to spark.sql.execution.sortBeforeRepartition set to true (the 
> default)
> We consistently reproduced the issue doing a GROUP BY with a SUM of float or 
> double aggregation, followed by a repartition (common case to produce bigger 
> files as output, either triggered by SQL hints or extensions like kyuubi). 
> If the repartition stage fails with Fetch Failed Exception for only few 
> tasks, spark decides to recompute the partitions from the previous stage for 
> which output could not be fetched and will retry only the failed partitions 
> downstream.
> Because block fetch order is indeterministic, the new upstream partition 
> computation can provide a slightly different value for a float/double sum 
> aggregation. We noticed a 1 bit difference is UnsafeRow backing byte array in 
> all of our attempts. The sort performed before repartition uses 
> UnsafeRow.hashcode for the row prefix which will be completely different even 
> with such 1 bit difference, leading to the sort being completely different in 
> the new upstream partition and thus target downstream partition for the 
> shuffled rows completely different as well.
> Because sort becomes undeterministic and since only the failed dowstream 
> tasks are retried the resulting repartition will lead to duplicate rows as 
> well as missing rows. The solution brought by SPARK-23207 is broken.
> So far, we can only deactivate spark.sql.execution.sortBeforeRepartition to 
> make the entire job fail.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47520) Rounding issues with sum of floats/doubles leads to incorrect data after repartition

2024-03-22 Thread William Montaz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Montaz updated SPARK-47520:
---
Description: 
We discovered an important correctness issue directly linked to SPARK-47024

Even if SPARK-47024 has been considered 'Not a Problem' since it is linked 
directly to floats and double rounding, it can still have drastic impacts 
combined to spark.sql.execution.sortBeforeRepartition set to true (the default)

We consistently reproduced the issue doing a GROUP BY with a SUM of float or 
double aggregation, followed by a repartition (common case to produce bigger 
files as output, either triggered by SQL hints or extensions like kyuubi). 

If the repartition stage fails with Fetch Failed Exception for only few tasks, 
spark decides to recompute the partitions from the previous stage for which 
output could not be fetched and will retry only the failed partitions 
downstream.

Because block fetch order is indeterministic, the new upstream partition 
computation can provide a slightly different value for a float/double sum 
aggregation. We noticed a 1 bit difference is UnsafeRow backing byte array in 
all of our attempts. The sort performed before repartition uses 
UnsafeRow.hashcode for the row prefix which will be completely different even 
with such 1 bit difference, leading to the sort being completely different in 
the new upstream partition and thus target partition for the shuffled rows 
completely different as well.

Because sort becomes undeterministic and since only the failed dowstream tasks 
are retried the resulting repartition will lead to duplicate rows as well as 
missing rows. The solution brought by SPARK-23207 is broken.

So far, we can only deactivate spark.sql.execution.sortBeforeRepartition to 
make the entire job fail.

  was:
We discovered an important correctness issue directly linked to SPARK-47024

Even if SPARK-47024 has been considered 'Not a Problem' since it is linked 
directly to floats and double rounding, it can still have drastic impacts 
combined to spark.sql.execution.sortBeforeRepartition set to true (the default)

We consistently reproduced the issue doing a GROUP BY with a SUM of float or 
double aggreagtion, followed by a repartition (common case to produce bigger 
files as output, either triggered by SQL hints or extensions like kyuubi). 

If the repartition stage fails with Fetch Failed Exception for only few tasks, 
spark decides to recompute the partitions from the previous stage for which 
output could not be fetched and will retry only the failed partitions 
downstream.

Because block fetch order is indeterministic, the new before-shuffle partition 
computation can provide a slightly different value for a float/double sum 
aggregation. We noticed a 1 bit difference in all of our attempts. The sort 
performed before repartition uses UnsafeRow.hashcode for the row prefix which 
will be completely different even with such 1 bit difference, leading to the 
sort being completely different in the new before-shuffle partition and thus 
destination partition for the shuffled rows completely different as well.

Because sort becomes undeterministic and since only the failed dowstream tasks 
are retried the resulting repartition will lead to duplicate rows as well as 
missing rows. The solution brought by SPARK-23207 is broken.

So far, we can only deactivate spark.sql.execution.sortBeforeRepartition to 
make the entire job fail.


> Rounding issues with sum of floats/doubles leads to incorrect data after 
> repartition
> 
>
> Key: SPARK-47520
> URL: https://issues.apache.org/jira/browse/SPARK-47520
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2, 3.3.2, 3.5.0
>Reporter: William Montaz
>Priority: Major
>  Labels: correctness
>
> We discovered an important correctness issue directly linked to SPARK-47024
> Even if SPARK-47024 has been considered 'Not a Problem' since it is linked 
> directly to floats and double rounding, it can still have drastic impacts 
> combined to spark.sql.execution.sortBeforeRepartition set to true (the 
> default)
> We consistently reproduced the issue doing a GROUP BY with a SUM of float or 
> double aggregation, followed by a repartition (common case to produce bigger 
> files as output, either triggered by SQL hints or extensions like kyuubi). 
> If the repartition stage fails with Fetch Failed Exception for only few 
> tasks, spark decides to recompute the partitions from the previous stage for 
> which output could not be fetched and will retry only the failed partitions 
> downstream.
> Because block fetch order is indeterministic, the new upstream partition 
> computation can provide a slightly different value for a 

[jira] [Updated] (SPARK-47520) Rounding issues with sum of floats/doubles leads to incorrect data after repartition

2024-03-22 Thread William Montaz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Montaz updated SPARK-47520:
---
Description: 
We discovered an important correctness issue directly linked to SPARK-47024

Even if SPARK-47024 has been considered 'Not a Problem' since it is linked 
directly to floats and double rounding, it can still have drastic impacts 
combined to spark.sql.execution.sortBeforeRepartition set to true (the default)

We consistently reproduced the issue doing a GROUP BY with a SUM of float or 
double aggregation, followed by a repartition (common case to produce bigger 
files as output, either triggered by SQL hints or extensions like kyuubi). 

If the repartition stage fails with Fetch Failed Exception for only few tasks, 
spark decides to recompute the partitions from the previous stage for which 
output could not be fetched and will retry only the failed partitions 
downstream.

Because block fetch order is indeterministic, the new upstream partition 
computation can provide a slightly different value for a float/double sum 
aggregation. We noticed a 1 bit difference is UnsafeRow backing byte array in 
all of our attempts. The sort performed before repartition uses 
UnsafeRow.hashcode for the row prefix which will be completely different even 
with such 1 bit difference, leading to the sort being completely different in 
the new upstream partition and thus target downstream partition for the 
shuffled rows completely different as well.

Because sort becomes undeterministic and since only the failed dowstream tasks 
are retried the resulting repartition will lead to duplicate rows as well as 
missing rows. The solution brought by SPARK-23207 is broken.

So far, we can only deactivate spark.sql.execution.sortBeforeRepartition to 
make the entire job fail.

  was:
We discovered an important correctness issue directly linked to SPARK-47024

Even if SPARK-47024 has been considered 'Not a Problem' since it is linked 
directly to floats and double rounding, it can still have drastic impacts 
combined to spark.sql.execution.sortBeforeRepartition set to true (the default)

We consistently reproduced the issue doing a GROUP BY with a SUM of float or 
double aggregation, followed by a repartition (common case to produce bigger 
files as output, either triggered by SQL hints or extensions like kyuubi). 

If the repartition stage fails with Fetch Failed Exception for only few tasks, 
spark decides to recompute the partitions from the previous stage for which 
output could not be fetched and will retry only the failed partitions 
downstream.

Because block fetch order is indeterministic, the new upstream partition 
computation can provide a slightly different value for a float/double sum 
aggregation. We noticed a 1 bit difference is UnsafeRow backing byte array in 
all of our attempts. The sort performed before repartition uses 
UnsafeRow.hashcode for the row prefix which will be completely different even 
with such 1 bit difference, leading to the sort being completely different in 
the new upstream partition and thus target partition for the shuffled rows 
completely different as well.

Because sort becomes undeterministic and since only the failed dowstream tasks 
are retried the resulting repartition will lead to duplicate rows as well as 
missing rows. The solution brought by SPARK-23207 is broken.

So far, we can only deactivate spark.sql.execution.sortBeforeRepartition to 
make the entire job fail.


> Rounding issues with sum of floats/doubles leads to incorrect data after 
> repartition
> 
>
> Key: SPARK-47520
> URL: https://issues.apache.org/jira/browse/SPARK-47520
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2, 3.3.2, 3.5.0
>Reporter: William Montaz
>Priority: Major
>  Labels: correctness
>
> We discovered an important correctness issue directly linked to SPARK-47024
> Even if SPARK-47024 has been considered 'Not a Problem' since it is linked 
> directly to floats and double rounding, it can still have drastic impacts 
> combined to spark.sql.execution.sortBeforeRepartition set to true (the 
> default)
> We consistently reproduced the issue doing a GROUP BY with a SUM of float or 
> double aggregation, followed by a repartition (common case to produce bigger 
> files as output, either triggered by SQL hints or extensions like kyuubi). 
> If the repartition stage fails with Fetch Failed Exception for only few 
> tasks, spark decides to recompute the partitions from the previous stage for 
> which output could not be fetched and will retry only the failed partitions 
> downstream.
> Because block fetch order is indeterministic, the new upstream partition 
> computation can provide a 

[jira] [Created] (SPARK-47520) Rounding issues with sum of floats/doubles leads to incorrect data after repartition

2024-03-22 Thread William Montaz (Jira)
William Montaz created SPARK-47520:
--

 Summary: Rounding issues with sum of floats/doubles leads to 
incorrect data after repartition
 Key: SPARK-47520
 URL: https://issues.apache.org/jira/browse/SPARK-47520
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0, 3.3.2, 3.4.2
Reporter: William Montaz


We discovered an important correctness issue directly linked to SPARK-47024

Even if SPARK-47024 has been considered 'Not a Problem' since it is linked 
directly to floats and double rounding, it can still have drastic impacts 
combined to spark.sql.execution.sortBeforeRepartition set to true (the default)

We consistently reproduced the issue doing a GROUP BY with a SUM of float or 
double aggreagtion, followed by a repartition (common case to produce bigger 
files as output, either triggered by SQL hints or extensions like kyuubi). 

If the repartition stage fails with Fetch Failed Exception for only few tasks, 
spark decides to recompute the partitions from the previous stage for which 
output could not be fetched and will retry only the failed partitions 
downstream.

Because block fetch order is indeterministic, the new before-shuffle partition 
computation can provide a slightly different value for a float/double sum 
aggregation. We noticed a 1 bit difference in all of our attempts. The sort 
performed before repartition uses UnsafeRow.hashcode for the row prefix which 
will be completely different even with such 1 bit difference, leading to the 
sort being completely different in the new before-shuffle partition and thus 
destination partition for the shuffled rows completely different as well.

Because sort becomes undeterministic and since only the failed dowstream tasks 
are retried the resulting repartition will lead to duplicate rows as well as 
missing rows. The solution brought by SPARK-23207 is broken.

So far, we can only deactivate spark.sql.execution.sortBeforeRepartition to 
make the entire job fail.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25975) Spark History does not display necessarily the incomplete applications when requested

2018-11-08 Thread William Montaz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16679788#comment-16679788
 ] 

William Montaz commented on SPARK-25975:


Associated pull request https://github.com/apache/spark/pull/22981

> Spark History does not display necessarily the incomplete applications when 
> requested
> -
>
> Key: SPARK-25975
> URL: https://issues.apache.org/jira/browse/SPARK-25975
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.2
>Reporter: William Montaz
>Priority: Minor
> Attachments: fix.patch
>
>
> Filtering of incomplete applications is made in javascript against the 
> response returned by the API. The problem is that if the returned result is 
> not big enough (because of spark.history.ui.maxApplications), it might not 
> contain incomplete applications. 
> We can call the API with status RUNNING or COMPLETED depending on the view we 
> want to fix this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25975) Spark History does not display necessarily the incomplete applications when requested

2018-11-08 Thread William Montaz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Montaz updated SPARK-25975:
---
Attachment: fix.patch

> Spark History does not display necessarily the incomplete applications when 
> requested
> -
>
> Key: SPARK-25975
> URL: https://issues.apache.org/jira/browse/SPARK-25975
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.2
>Reporter: William Montaz
>Priority: Minor
> Attachments: fix.patch
>
>
> Filtering of incomplete applications is made in javascript against the 
> response returned by the API. The problem is that if the returned result is 
> not big enough (because of spark.history.ui.maxApplications), it might not 
> contain incomplete applications. 
> We can call the API with status RUNNING or COMPLETED depending on the view we 
> want to fix this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25973) Spark History Main page performance improvement

2018-11-08 Thread William Montaz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16679772#comment-16679772
 ] 

William Montaz commented on SPARK-25973:


Ok created https://github.com/apache/spark/pull/22980

> Spark History Main page performance improvement
> ---
>
> Key: SPARK-25973
> URL: https://issues.apache.org/jira/browse/SPARK-25973
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.2
>Reporter: William Montaz
>Priority: Minor
> Attachments: fix.patch
>
>
> HistoryPage.scala counts applications (with a predicate depending on if it is 
> displaying incomplete or complete applications) to check if it must display 
> the dataTable.
> Since it only checks if allAppsSize > 0, we could use exists method on the 
> iterator. This way we stop iterating at the first occurence found.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25973) Spark History Main page performance improvement

2018-11-08 Thread William Montaz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16679853#comment-16679853
 ] 

William Montaz commented on SPARK-25973:


New pull request on master branch https://github.com/apache/spark/pull/22982

> Spark History Main page performance improvement
> ---
>
> Key: SPARK-25973
> URL: https://issues.apache.org/jira/browse/SPARK-25973
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.2
>Reporter: William Montaz
>Priority: Minor
> Attachments: fix.patch
>
>
> HistoryPage.scala counts applications (with a predicate depending on if it is 
> displaying incomplete or complete applications) to check if it must display 
> the dataTable.
> Since it only checks if allAppsSize > 0, we could use exists method on the 
> iterator. This way we stop iterating at the first occurence found.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25973) Spark History Main page performance improvement

2018-11-08 Thread William Montaz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Montaz updated SPARK-25973:
---
Priority: Minor  (was: Major)

> Spark History Main page performance improvement
> ---
>
> Key: SPARK-25973
> URL: https://issues.apache.org/jira/browse/SPARK-25973
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.2
>Reporter: William Montaz
>Priority: Minor
> Attachments: fix.patch
>
>
> HistoryPage.scala counts applications (with a predicate depending on if it is 
> displaying incomplete or complete applications) to check if it must display 
> the dataTable.
> Since it only checks if allAppsSize > 0, we could use exists method on the 
> iterator> This way we stop iterating at the first occurence found.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25975) Spark History does not display necessarily the incomplete applications when requested

2018-11-08 Thread William Montaz (JIRA)
William Montaz created SPARK-25975:
--

 Summary: Spark History does not display necessarily the incomplete 
applications when requested
 Key: SPARK-25975
 URL: https://issues.apache.org/jira/browse/SPARK-25975
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 2.3.2
Reporter: William Montaz
 Attachments: fix.patch

Filtering of incomplete applications is made in javascript against the response 
returned by the API. The problem is that if the returned result is not big 
enough (because of spark.history.ui.maxApplications), it might not contain 
incomplete applications. 

We can call the API with status RUNNING or COMPLETED depending on the view we 
want to fix this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25973) Spark History Main page performance improvment

2018-11-08 Thread William Montaz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Montaz updated SPARK-25973:
---
Attachment: fix.patch

> Spark History Main page performance improvment
> --
>
> Key: SPARK-25973
> URL: https://issues.apache.org/jira/browse/SPARK-25973
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.2
>Reporter: William Montaz
>Priority: Major
> Attachments: fix.patch
>
>
> HistoryPage.scala counts applications (with a predicate depending on if it is 
> displaying incomplete or complete applications) to check if it must display 
> the dataTable.
> Since it only checks if allAppsSize > 0, we could use exists method on the 
> iterator> This way we stop iterating at the first occurence found.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25973) Spark History Main page performance improvment

2018-11-08 Thread William Montaz (JIRA)
William Montaz created SPARK-25973:
--

 Summary: Spark History Main page performance improvment
 Key: SPARK-25973
 URL: https://issues.apache.org/jira/browse/SPARK-25973
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 2.3.2
Reporter: William Montaz


HistoryPage.scala counts applications (with a predicate depending on if it is 
displaying incomplete or complete applications) to check if it must display the 
dataTable.

Since it only checks if allAppsSize > 0, we could use exists method on the 
iterator> This way we stop iterating at the first occurence found.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25973) Spark History Main page performance improvement

2018-11-08 Thread William Montaz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Montaz updated SPARK-25973:
---
Summary: Spark History Main page performance improvement  (was: Spark 
History Main page performance improvment)

> Spark History Main page performance improvement
> ---
>
> Key: SPARK-25973
> URL: https://issues.apache.org/jira/browse/SPARK-25973
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.2
>Reporter: William Montaz
>Priority: Major
> Attachments: fix.patch
>
>
> HistoryPage.scala counts applications (with a predicate depending on if it is 
> displaying incomplete or complete applications) to check if it must display 
> the dataTable.
> Since it only checks if allAppsSize > 0, we could use exists method on the 
> iterator> This way we stop iterating at the first occurence found.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25973) Spark History Main page performance improvement

2018-11-08 Thread William Montaz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Montaz updated SPARK-25973:
---
Description: 
HistoryPage.scala counts applications (with a predicate depending on if it is 
displaying incomplete or complete applications) to check if it must display the 
dataTable.

Since it only checks if allAppsSize > 0, we could use exists method on the 
iterator. This way we stop iterating at the first occurence found.

 

 

  was:
HistoryPage.scala counts applications (with a predicate depending on if it is 
displaying incomplete or complete applications) to check if it must display the 
dataTable.

Since it only checks if allAppsSize > 0, we could use exists method on the 
iterator> This way we stop iterating at the first occurence found.

 

 


> Spark History Main page performance improvement
> ---
>
> Key: SPARK-25973
> URL: https://issues.apache.org/jira/browse/SPARK-25973
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.2
>Reporter: William Montaz
>Priority: Minor
> Attachments: fix.patch
>
>
> HistoryPage.scala counts applications (with a predicate depending on if it is 
> displaying incomplete or complete applications) to check if it must display 
> the dataTable.
> Since it only checks if allAppsSize > 0, we could use exists method on the 
> iterator. This way we stop iterating at the first occurence found.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24150) Race condition in FsHistoryProvider

2018-05-02 Thread William Montaz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Montaz updated SPARK-24150:
---
Description: 
There exist a race condition in checkLogs method between threads of 
replayExecutor. They use the field "applications" to synchronise, but they also 
update that field.

The problem is that threads will eventually synchronise on different monitors 
(because they will synchronise on different objects which references have been 
assigned to "applications"), breaking the initial synchronisation intent. This 
has even greater chance to reproduce when number_new_log_files > 
replayExecutor_pool_size

If such log disappears (it will not be present in the list "applications"), it 
will be impossible to read it from the UI (being in the list "applications" is 
a mandatory check to avoid getting a 404)

Workaround:
 * use a permanent object as a monitor on which to synchronise (or synchronise 
on `this`)
 * keep volatile field for all other read accesses

  was:
There exist a race condition in checkLogs method between threads of 
replayExecutor. They use the field "applications" to synchronise, but they also 
update that field.

The problem is that threads will eventually synchronise on different monitors 
(because they will synchronise on different objects which references have been 
assigned to "applications"), breaking the initial synchronisation intent. This 
has even greater chance to reproduce when number_new_log_files > 
replayExecutor_pool_size

Workaround:
 * use a permanent object as a monitor on which to synchronise (or synchronise 
on `this`)
 * keep volatile field for all other read accesses


> Race condition in FsHistoryProvider
> ---
>
> Key: SPARK-24150
> URL: https://issues.apache.org/jira/browse/SPARK-24150
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: William Montaz
>Priority: Major
>
> There exist a race condition in checkLogs method between threads of 
> replayExecutor. They use the field "applications" to synchronise, but they 
> also update that field.
> The problem is that threads will eventually synchronise on different monitors 
> (because they will synchronise on different objects which references have 
> been assigned to "applications"), breaking the initial synchronisation 
> intent. This has even greater chance to reproduce when number_new_log_files > 
> replayExecutor_pool_size
> If such log disappears (it will not be present in the list "applications"), 
> it will be impossible to read it from the UI (being in the list 
> "applications" is a mandatory check to avoid getting a 404)
> Workaround:
>  * use a permanent object as a monitor on which to synchronise (or 
> synchronise on `this`)
>  * keep volatile field for all other read accesses



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24150) Race condition in FsHistoryProvider

2018-05-02 Thread William Montaz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Montaz updated SPARK-24150:
---
Description: 
There exist a race condition in checkLogs method between threads of 
replayExecutor. They use the field "applications" to synchronise, but they also 
update that field.

The problem is that threads will eventually synchronise on different monitors 
(because they will synchronise on different objects which references have been 
assigned to "applications"), breaking the initial synchronisation intent. This 
has even greater chance to reproduce when number_new_log_files > 
replayExecutor_pool_size

Workaround:
 * use a permanent object as a monitor on which to synchronise (or synchronise 
on `this`)
 * keep volatile field for all other read accesses

  was:
There exist a race condition in checkLogs method between threads of 
replayExecutor. They use the field "applications" to synchronise, but they also 
update that field.

The problem is that threads will eventually synchronise on different monitors 
(because they will synchronise on different objects which references that have 
been assigned to "applications"), breaking the initial synchronisation intent. 
This has even greater chance to reproduce when number_new_log_files > 
replayExecutor_pool_size

Workaround:
 * use a permanent object as a monitor on which to synchronise (or synchronise 
on `this`)
 * keep volatile field for all other read accesses


> Race condition in FsHistoryProvider
> ---
>
> Key: SPARK-24150
> URL: https://issues.apache.org/jira/browse/SPARK-24150
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: William Montaz
>Priority: Major
>
> There exist a race condition in checkLogs method between threads of 
> replayExecutor. They use the field "applications" to synchronise, but they 
> also update that field.
> The problem is that threads will eventually synchronise on different monitors 
> (because they will synchronise on different objects which references have 
> been assigned to "applications"), breaking the initial synchronisation 
> intent. This has even greater chance to reproduce when number_new_log_files > 
> replayExecutor_pool_size
> Workaround:
>  * use a permanent object as a monitor on which to synchronise (or 
> synchronise on `this`)
>  * keep volatile field for all other read accesses



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24150) Race condition in FsHistoryProvider

2018-05-02 Thread William Montaz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Montaz updated SPARK-24150:
---
Priority: Major  (was: Minor)

> Race condition in FsHistoryProvider
> ---
>
> Key: SPARK-24150
> URL: https://issues.apache.org/jira/browse/SPARK-24150
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: William Montaz
>Priority: Major
>
> There exist a race condition in checkLogs method between threads of 
> replayExecutor. They use the field "applications" to synchronise, but they 
> also update that field.
> The problem is that threads will eventually synchronise on different monitors 
> (because they will synchronise on different objects which references that 
> have been assigned to "applications"), breaking the initial synchronisation 
> intent. This has even greater chance to reproduce when number_new_log_files > 
> replayExecutor_pool_size
> Workaround:
>  * use a permanent object as a monitor on which to synchronise (or 
> synchronise on `this`)
>  * keep volatile field for all other read accesses



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24150) Race condition in FsHistoryProvider

2018-05-02 Thread William Montaz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Montaz updated SPARK-24150:
---
Description: 
There exist a race condition in checkLogs method between threads of 
replayExecutor. They use the field "applications" to synchronise, but they also 
update that field.

The problem is that threads will eventually synchronise on different monitors 
(because they will synchronise on different objects which references that have 
been assigned to "applications"), breaking the initial synchronisation intent. 
This has even greater chance to reproduce when number_new_log_files > 
replayExecutor_pool_size

Workaround:
 * use a permanent object as a monitor on which to synchronise (or synchronise 
on `this`)
 * keep volatile field for all other read accesses

  was:
There exist a race condition in checkLogs method between threads of 
replayExecutor. They use the field "applications" to synchronise, but they also 
update that field.

The problem is that if the number of tasks (the number of new log files to 
replay and add to the applications list) is greater than the number of threads 
in the pool, threads will eventually synchronise on different monitors (because 
they will synchronise on different objects which references that have been 
assigned to "applications"), breaking the initial synchronisation intent.

Workaround:
 * use a permanent object as a monitor on which to synchronise (or synchronise 
on `this`)
 * keep volatile field for all other read accesses


> Race condition in FsHistoryProvider
> ---
>
> Key: SPARK-24150
> URL: https://issues.apache.org/jira/browse/SPARK-24150
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: William Montaz
>Priority: Minor
>
> There exist a race condition in checkLogs method between threads of 
> replayExecutor. They use the field "applications" to synchronise, but they 
> also update that field.
> The problem is that threads will eventually synchronise on different monitors 
> (because they will synchronise on different objects which references that 
> have been assigned to "applications"), breaking the initial synchronisation 
> intent. This has even greater chance to reproduce when number_new_log_files > 
> replayExecutor_pool_size
> Workaround:
>  * use a permanent object as a monitor on which to synchronise (or 
> synchronise on `this`)
>  * keep volatile field for all other read accesses



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24150) Race condition in FsHistoryProvider

2018-05-02 Thread William Montaz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Montaz updated SPARK-24150:
---
Description: 
There exist a race condition in checkLogs method between threads of 
replayExecutor. They use the field "applications" to synchronise, but they also 
update that field.

The problem is that if the number of tasks (the number of new log files to 
replay and add to the applications list) is greater than the number of threads 
in the pool, threads will eventually synchronise on different monitors (because 
they will synchronise on different objects which references that have been 
assigned to "applications"), breaking the initial synchronisation intent.

Workaround:
 * use a permanent object as a monitor on which to synchronise (or synchronise 
on `this`)
 * keep volatile field for all other read accesses

  was:
There exist a race condition in checkLogs method between threads of 
replayExecutor. They use the field "applications" to synchronise, but they also 
update that field.

The problem is that if the number of tasks (the number of new log files to 
replay and add to the applications list) is greater than the number of threads 
in the pool, there is a great chance that a thread will try to synchronise on 
an updated version of applications (since it is volatile and updated) while 
some are still being synchronised on an old reference of applications. There 
the race condition happens.

Workaround:
 * use a permanent object as a monitor on which to synchronise (or synchronise 
on `this`)
 * keep volatile field for all other read accesses


> Race condition in FsHistoryProvider
> ---
>
> Key: SPARK-24150
> URL: https://issues.apache.org/jira/browse/SPARK-24150
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: William Montaz
>Priority: Minor
>
> There exist a race condition in checkLogs method between threads of 
> replayExecutor. They use the field "applications" to synchronise, but they 
> also update that field.
> The problem is that if the number of tasks (the number of new log files to 
> replay and add to the applications list) is greater than the number of 
> threads in the pool, threads will eventually synchronise on different 
> monitors (because they will synchronise on different objects which references 
> that have been assigned to "applications"), breaking the initial 
> synchronisation intent.
> Workaround:
>  * use a permanent object as a monitor on which to synchronise (or 
> synchronise on `this`)
>  * keep volatile field for all other read accesses



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24150) Race condition in FsHistoryProvider

2018-05-02 Thread William Montaz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Montaz updated SPARK-24150:
---
Description: 
There exist a race condition in checkLogs method between threads of 
replayExecutor. They use the field "applications" to synchronise, but they also 
update that field.

The problem is that if the number of tasks (the number of new log files to 
replay and add to the applications list) is greater than the number of threads 
in the pool, there is a great chance that a thread will try to synchronise on 
an updated version of applications (since it is volatile and updated) while 
some are still being synchronised on an old reference of applications. There 
the race condition happens.

Workaround:
 * use a permanent object as a monitor on which to synchronise (or synchronise 
on `this`)
 * keep volatile field for all other read accesses

  was:
There exist a race condition between the method checkLogs and cleanLogs.

cleanLogs can read the field applications while it is concurrently processed by 
checkLogs. It is possible that checkLogs added new fetched logs, sets 
applications and this is erased by cleanLogs having an old version of 
applications. The problem is that the fetched log won't appear in applications 
anymore and it will then be impossible to display the corresponding application 
in the History Server, since it must be in the LinkedList applications. 

Workaround:
 * use a permanent object as a monitor on which to synchronise
 * keep volatile field for all other read accesses


> Race condition in FsHistoryProvider
> ---
>
> Key: SPARK-24150
> URL: https://issues.apache.org/jira/browse/SPARK-24150
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: William Montaz
>Priority: Minor
>
> There exist a race condition in checkLogs method between threads of 
> replayExecutor. They use the field "applications" to synchronise, but they 
> also update that field.
> The problem is that if the number of tasks (the number of new log files to 
> replay and add to the applications list) is greater than the number of 
> threads in the pool, there is a great chance that a thread will try to 
> synchronise on an updated version of applications (since it is volatile and 
> updated) while some are still being synchronised on an old reference of 
> applications. There the race condition happens.
> Workaround:
>  * use a permanent object as a monitor on which to synchronise (or 
> synchronise on `this`)
>  * keep volatile field for all other read accesses



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24150) Race condition in FsHistoryProvider

2018-05-02 Thread William Montaz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Montaz updated SPARK-24150:
---
Description: 
There exist a race condition between the method checkLogs and cleanLogs.

cleanLogs can read the field applications while it is concurrently processed by 
checkLogs. It is possible that checkLogs added new fetched logs, sets 
applications and this is erased by cleanLogs having an old version of 
applications. The problem is that the fetched log won't appear in applications 
anymore and it will then be impossible to display the corresponding application 
in the History Server, since it must be in the LinkedList applications. 

Workaround:
 * use a permanent object as a monitor on which to synchronise
 * keep volatile field for all other read accesses

  was:
There exist a race condition between the method checkLogs and cleanLogs.

Workaround:
 * use a permanent object as a monitor on which to synchronise
 * keep volatile field for all other read accesses


> Race condition in FsHistoryProvider
> ---
>
> Key: SPARK-24150
> URL: https://issues.apache.org/jira/browse/SPARK-24150
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: William Montaz
>Priority: Minor
>
> There exist a race condition between the method checkLogs and cleanLogs.
> cleanLogs can read the field applications while it is concurrently processed 
> by checkLogs. It is possible that checkLogs added new fetched logs, sets 
> applications and this is erased by cleanLogs having an old version of 
> applications. The problem is that the fetched log won't appear in 
> applications anymore and it will then be impossible to display the 
> corresponding application in the History Server, since it must be in the 
> LinkedList applications. 
> Workaround:
>  * use a permanent object as a monitor on which to synchronise
>  * keep volatile field for all other read accesses



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24150) Race condition in FsHistoryProvider

2018-05-02 Thread William Montaz (JIRA)
William Montaz created SPARK-24150:
--

 Summary: Race condition in FsHistoryProvider
 Key: SPARK-24150
 URL: https://issues.apache.org/jira/browse/SPARK-24150
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: William Montaz


There exist a race condition between the method checkLogs and cleanLogs.

Workaround:
 * use a permanent object as a monitor on which to synchronise
 * keep volatile field for all other read accesses



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org